Ying-ping Chen and Meng-Hiot Lim (Eds.) Linkage in Evolutionary Computation
Studies in Computational Intelligence, Volume 157 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 134. Ngoc Thanh Nguyen and Radoslaw Katarzyniak (Eds.) New Challenges in Applied Intelligence Technologies, 2008 ISBN 978-3-540-79354-0 Vol. 135. Hsinchun Chen and Christopher C. Yang (Eds.) Intelligence and Security Informatics, 2008 ISBN 978-3-540-69207-2 Vol. 136. Carlos Cotta, Marc Sevaux and Kenneth S¨orensen (Eds.) Adaptive and Multilevel Metaheuristics, 2008 ISBN 978-3-540-79437-0 Vol. 137. Lakhmi C. Jain, Mika Sato-Ilic, Maria Virvou, George A. Tsihrintzis, Valentina Emilia Balas and Canicious Abeynayake (Eds.) Computational Intelligence Paradigms, 2008 ISBN 978-3-540-79473-8 Vol. 138. Bruno Apolloni, Witold Pedrycz, Simone Bassis and Dario Malchiodi The Puzzle of Granular Computing, 2008 ISBN 978-3-540-79863-7 Vol. 139. Jan Drugowitsch Design and Analysis of Learning Classifier Systems, 2008 ISBN 978-3-540-79865-1
Vol. 145. Mikhail Ju. Moshkov, Marcin Piliszczuk and Beata Zielosko Partial Covers, Reducts and Decision Rules in Rough Sets, 2008 ISBN 978-3-540-69027-6 Vol. 146. Fatos Xhafa and Ajith Abraham (Eds.) Metaheuristics for Scheduling in Distributed Computing Environments, 2008 ISBN 978-3-540-69260-7 Vol. 147. Oliver Kramer Self-Adaptive Heuristics for Evolutionary Computation, 2008 ISBN 978-3-540-69280-5 Vol. 148. Philipp Limbourg Dependability Modelling under Uncertainty, 2008 ISBN 978-3-540-69286-7 Vol. 149. Roger Lee (Ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, 2008 ISBN 978-3-540-70559-8 Vol. 150. Roger Lee (Ed.) Software Engineering Research, Management and Applications, 2008 ISBN 978-3-540-70774-5 Vol. 151. Tomasz G. Smolinski, Mariofanna G. Milanova and Aboul-Ella Hassanien (Eds.) Computational Intelligence in Biomedicine and Bioinformatics, 2008 ISBN 978-3-540-70776-9
Vol. 140. Nadia Magnenat-Thalmann, Lakhmi C. Jain and N. Ichalkaranje (Eds.) New Advances in Virtual Humans, 2008 ISBN 978-3-540-79867-5
Vol. 152. Jaroslaw Stepaniuk Rough – Granular Computing in Knowledge Discovery and Data Mining, 2008 ISBN 978-3-540-70800-1
Vol. 141. Christa Sommerer, Lakhmi C. Jain and Laurent Mignonneau (Eds.) The Art and Science of Interface and Interaction Design (Vol. 1), 2008 ISBN 978-3-540-79869-9
Vol. 153. Carlos Cotta and Jano van Hemert (Eds.) Recent Advances in Evolutionary Computation for Combinatorial Optimization, 2008 ISBN 978-3-540-70806-3
Vol. 142. George A. Tsihrintzis, Maria Virvou, Robert J. Howlett and Lakhmi C. Jain (Eds.) New Directions in Intelligent Interactive Multimedia, 2008 ISBN 978-3-540-68126-7 Vol. 143. Uday K. Chakraborty (Ed.) Advances in Differential Evolution, 2008 ISBN 978-3-540-68827-3 Vol. 144. Andreas Fink and Franz Rothlauf (Eds.) Advances in Computational Intelligence in Transport, Logistics, and Supply Chain Management, 2008 ISBN 978-3-540-69024-5
Vol. 154. Oscar Castillo, Patricia Melin, Janusz Kacprzyk and Witold Pedrycz (Eds.) Soft Computing for Hybrid Intelligent Systems, 2008 ISBN 978-3-540-70811-7 Vol. 155. Hamid R. Tizhoosh and M. Ventresca (Eds.) Oppositional Concepts in Computational Intelligence, 2008 ISBN 978-3-540-70826-1 Vol. 156. Dawn E. Holmes and Lakhmi C. Jain (Eds.) Innovations in Bayesian Networks, 2008 ISBN 978-3-540-85065-6 Vol. 157. Ying-ping Chen and Meng-Hiot Lim (Eds.) Linkage in Evolutionary Computation, 2008 ISBN 978-3-540-85067-0
Ying-ping Chen Meng-Hiot Lim (Eds.)
Linkage in Evolutionary Computation
123
Ying-ping Chen Department of Computer Science National Chiao Tung University 1001 Ta Hsueh Road HsinChu City 300 Taiwan Email:
[email protected] Meng-Hiot Lim School of Electrical & Electronic Engineering, Block S1 Nanyang Technological University Singapore 639798 Singapore Email:
[email protected] ISBN 978-3-540-85067-0
e-ISBN 978-3-540-85068-7
DOI 10.1007/978-3-540-85068-7 Studies in Computational Intelligence
ISSN 1860949X
Library of Congress Control Number: 2008931425 c 2008 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 987654321 springer.com
Preface
For the past three decades, genetic and evolutionary algorithms (GEAs) have gradually established a strong foothold as powerful search methods and have been widely applied to solve problems in many disciplines. In order to improve the performance and applicability, numerous sophisticated mechanisms have been introduced and integrated into GEAs. One major category of these enhancing mechanisms is the concept of linkage. The motivation for linkage is to model the relationships or interactions between the decision variables, similar to that of genetic linkage observed in biological systems, through linkage learning techniques. Linkage learning connects the computational optimization methodologies and the natural evolution mechanisms. Not only can learning and adapting natural mechanisms enable us to design better computational methodologies, but also the insight gained by observing and analyzing the algorithmic behavior permits us to further understand biological systems, based on which GEAs are developed. In recent years, the issue of linkage in GEAs has garnered greater attention and recognition from researchers. Conventional approaches that rely much on ad hoc tweaking of parameters to control the search by balancing the level of exploitation and exploration are grossly inadequate. As shown in the work reported here, such parameters tweaking based approaches have their limits; they can be easily ”fooled” by cases of triviality or peculiarity of the class of problems that the algorithms are designed to handle. Furthermore, these approaches are usually blind to the interactions between the decision variables, thereby disrupting the partial solutions that are being built up along the way. The whole volume consisting of 19 chapters is divided into 3 parts: (i) Models and Theories; (ii) Operators and Frameworks; (iii) Applications. Part I consists of 7 chapters, dealing primarily with the theoretical foundations associated to linkage. Most of the chapters in this part focus on establishing models for formulating the linkage problem. One clear trend that emerged from these chapters is the general acceptance of estimation of distribution algorithms (EDAs) as a viable technique for linkage learning. However, the issue of computational complexity associated to EDAs needs to be addressed. The next 7 chapters which are
VI
Preface
grouped under Part II discuss issues pertaining to search operators and computational frameworks to facilitate linkage learning. The needs for operators that are designed to minimize the disruptions of good partial solutions are emphasized in these chapters. Along with this, the frameworks for extracting and analyzing solutions need to be in place so that learning can take place. Part III consists of five chapters, presenting applications that incorporate schemes for analyzing intermediate solutions so that linkage based learning can be incorporated into problem-solving. These are all interesting applications; one of the chapters incorporating the aspect of linkage learning in the design synthesis of MEMS structure. We feel that more work on applying linkage to real-world problems should be encouraged and our hope is for this edited volume to be a significant step in that direction. We hope that this edited volume will serve as a useful guide and reference for researchers who are currently working in the area of linkage. For postgraduate research students, this volume will serve as a good source of reference. It is also suitable as a text for a graduate level course focusing on linkage issues. Although it is unlikely to be complete even when confined within the scope of evolutionary linkage, it is reasonably comprehensive; the collection of chapters can quickly expose practitioners to most of the important issues pertaining to linkage. For practitioners who are looking at putting into practice the concept of linkage, the few chapters on applications will serve as a useful guide. We are extremely fortunate and honored to have a list of distinguished contributors who are willing to share their findings and expertise in this edited volume. For this, we are truly grateful. Besides being cooperative and accommodating, the authors have done a marvelous job, judging from the high quality manuscripts submitted.
Taiwan & Singapore, May 2008
Ying-ping Chen Meng-Hiot Lim
Contents
Part I: Models and Theories Parallel Bivariate Marginal Distribution Algorithm with Probability Model Migration Josef Schwarz, Jiri Jaros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm Nan Ding, Shude Zhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
Linkage in Island Models Zbigniew Skolicki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
Real-Coded ECGA for Solving Decomposable Real-Valued Optimization Problems Minqiang Li, David E. Goldberg, Kumara Sastry, Tian-Li Yu . . . . . . . . . . .
61
Linkage Learning Accuracy in the Bayesian Optimization Algorithm Claudio F. Lima, Martin Pelikan, David E. Goldberg, Fernando G. Lobo, Kumara Sastry, Mark Hauschild . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
The Impact of Exact Probabilistic Learning Algorithms in EDAs Based on Bayesian Networks Carlos Echegoyen, Roberto Santana, Jose A. Lozano, Pedro Larra˜ naga . . . 109 Linkage Learning in Estimation of Distribution Algorithms David Coffin, Robert E. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Part II: Operators and Frameworks Parallel GEAs with Linkage Analysis over Grid Asim Munawar, Mohamed Wahib, Masaharu Munetomo, Kiyoshi Akama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
VIII
Contents
Identification and Exploitation of Linkage by Means of Alternative Splicing Philipp Rohlfshagen, John A. Bullinaria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 A Clustering-Based Approach for Linkage Learning Applied to Multimodal Optimization Leonardo Emmendorfer, Aurora Pozo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Studying the Effects of Dual Coding on the Adaptation of Representation for Linkage in Evolutionary Algorithms Maroun Bercachi, Philippe Collard, Manuel Clergue, Sebastien Verel . . . . . 249 Symbiotic Evolution to Avoid Linkage Problem Ramin Halavati, Saeed Bagheri Shouraki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis Thomas Goth, Chia-Ti Tsai, Fu-Tien Chiang, Clare Bates Congdon . . . . . 315 Real-Coded Extended Compact Genetic Algorithm Based on Mixtures of Models Pier Luca Lanzi, Luigi Nichetti, Kumara Sastry, Davide Voltini, David E. Goldberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Part III: Applications Genetic Algorithms for the Airport Gate Assignment: Linkage, Representation and Uniform Crossover Xiao-Bing Hu, Ezequiel Di Paolo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 A Decomposed Approach for the Minimum Interference Frequency Assignment Gualtiero Colombo, Stuart M. Allen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Set Representation and Multi-parent Learning within an Evolutionary Algorithm for Optimal Design of Trusses Amitay Isaacs, Tapabrata Ray, Warren Smith . . . . . . . . . . . . . . . . . . . . . . . . 419 A Network Design Problem by a GA with Linkage Identification and Recombination for Overlapping Building Blocks Miwako Tsuji, Masaharu Munetomo, Kiyoshi Akama . . . . . . . . . . . . . . . . . . 441 Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis Corie L. Cobb, Ying Zhang, Alice M. Agogino, Jennifer Mangold . . . . . . . . 461 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
List of Contributors
Alice M. Agogino University of California, Berkeley United States
[email protected] Manuel Clergue Universite de Nice-Sophia Antipolis France
[email protected] Kiyoshi Akama Hokkaido University Japan
[email protected] Corie L. Cobb University of California, Berkeley United States
[email protected] Stuart M. Allen Cardiff University United Kingdom
[email protected] David Coffin University College London United Kingdom
[email protected] Maroun Bercachi Universite de Nice-Sophia Antipolis France
[email protected] Philippe Collard Universite de Nice-Sophia Antipolis France
[email protected] John A. Bullinaria University of Birmingham United Kingdom
[email protected] Gualtiero Colombo Cardiff University United Kingdom
[email protected] Fu-Tien Chiang National Taiwan University Hospital Taiwan
[email protected] Clare Bates Congdon University of Southern Maine United States
[email protected] X
List of Contributors
Nan Ding Tsinghua University China
[email protected] Jiri Jaros Brno University of Technology Czech Republic
[email protected] Carlos Echegoyen University of the Basque Country Spain
[email protected] Pier Luca Lanzi Politecnico di Milano Italy
[email protected] Leonardo Emmendorfer University of Algarve Portugal
[email protected] Pedro Larra˜ naga Technical University of Madrid Spain
[email protected] David E. Goldberg University of Illinois at UrbanaChampaign United States
[email protected] Minqiang Li Tianjin University China
[email protected] Thomas Goth Colby College United States
[email protected] Claudio F. Lima University of Algarve Portugal
[email protected] Ramin Halavati Sharif University of Technology Iran
[email protected] Fernando G. Lobo University of Algarve Portugal
[email protected] Mark Hauschild University of Missouri at St. Louis United States
[email protected] Xiao-Bing Hu University of Sussex United Kingdom
[email protected] Amitay Isaacs University of New South Wales Australian Defence Force Academy Australia
[email protected] Jose A. Lozano University of the Basque Country Spain
[email protected] Jennifer Mangold University of California, Berkeley United States
[email protected] Asim Munawar Hokkaido University Japan
[email protected] List of Contributors
Masaharu Munetomo Hokkaido University Japan
[email protected] Josef Schwarz Brno University of Technology Czech Republic
[email protected] Luigi Nichetti Politecnico di Milano Italy
Saeed Bagheri Shouraki Sharif University of Technology Iran
[email protected] Ezequiel Di Paolo University of Sussex United Kingdom
[email protected] Martin Pelikan University of Missouri at St. Louis United States
[email protected] Aurora Pozo Federal University of Paran´ a Brazil
[email protected] Tapabrata Ray University of New South Wales Australian Defence Force Academy Australia
[email protected] Philipp Rohlfshagen University of Birmingham United Kingdom
[email protected] Roberto Santana University of the Basque Country Spain
[email protected] Kumara Sastry University of Illinois at UrbanaChampaign United States
[email protected] XI
Zbigniew Skolicki Google, Inc. United States
[email protected] Robert E. Smith University College London United Kingdom
[email protected] Warren Smith University of New South Wales Australian Defence Force Academy Australia
[email protected] Chia-Ti Tsai National Taiwan University Hospital Taiwan
[email protected] Miwako Tsuji Hokkaido University Japan m
[email protected] Sebastien Verel Universite de Nice-Sophia Antipolis France
[email protected] Davide Voltini Politecnico di Milano Italy
XII
List of Contributors
Mohamed Wahib Hokkaido University Japan
[email protected] Ying Zhang Georgia Institute of Technology United States
[email protected] Tian-Li Yu University of Illinois at UrbanaChampaign United States
[email protected] Shude Zhou China Academy of Electronics and Information Technology China
[email protected] Parallel Bivariate Marginal Distribution Algorithm with Probability Model Migration Josef Schwarz and Jiri Jaros Department of Computer Systems, Faculty of Information Technology, Brno University of Technology, CZ {schwarz,jarosjir}@fit.vutbr.cz
Summary. This chapter presents a new concept of parallel Bivariate Marginal Distribution Algorithm (BMDA) using the stepping stone communication model with the unidirectional ring topology. The traditional migration of individuals is compared with a newly proposed technique of probability model migration. The idea of the new adaptive BMDA (aBMDA) algorithms is to modify the classic learning of the probability model (applied in the sequential BMDA [24]). In the proposed strategy, the adaptive learning of the resident probability model is used. The evaluation of pair dependency, using Pearson’s chi-square statistics is influenced by the relevant immigrant pair dependency according to the quality of resident and immigrant subpopulation. Experimental results show that the proposed aBMDA significantly outperforms the traditional concept of migration of individuals.
1 Introduction The concept of traditional parallel genetic algorithm (PGA) is well known. It stems from the idea that the large problem can be successfully solved using decomposition of the original problem into smaller tasks. Consequently, the tasks can be solved simultaneously using multiple processors. This divide-and-conquer technique can be applied to GA in many distinct ways. Mostly, the population is divided into a few subpopulations or demes, and each of these demes evolves separately on different processors. Exchange of information among subpopulations is possible via a migration operator. In this context, the term island model is commonly used. Island populations are free to converge toward different sub-optima. The migration operator is supposed to mix good features that emerge locally in the different demes. Many topologies can be defined for connecting the demes, like mesh, torus, hypercube or ring. The most common models are the island model and the stepping stones model. In the basic island model, migration can occur between any subpopulations, whereas in the stepping stone model, migration is restricted to neighboring demes. In [7], the theory is published providing rational decisions for the proper setting of control parameters. An interesting survey of PGA is published in [2]. An effective technique for the massive parallelization of compact GA was published in [15]. An extremely prestigious PGA which is capable of solving billion-variable optimization problems was recently published in [10]. Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 3–23, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
4
J. Schwarz and J. Jaros
This chapter concerns the application of the stepping stone model (for simplicity we will use the term island-based model) for bivariate marginal distribution algorithm BMDA. This new approach using probability model migration is conceptually different from the traditional parallel genetic algorithms with migration of individuals/solutions and also from the EDAs using parallel building of pseudo-sequential probabilistic models. The remaining sections are organized as follows: Section 2 introduces the basic concept of EDA algorithm and current techniques used in the parallelization of the EDA algorithms. In Section 3 the sequential BMDA is described including the factorization and graphical representation of the probability model. Section 4 presents the motivation and a new idea of learning the probability model using a concept of probability model migration. Experimental results are shown in Section 5, Section 6 concludes the chapter.
2 Traditional EDAs EDAs belong to the class of advanced evolutionary algorithms based on the estimation and sampling of graphical probabilistic models [4, 5, 6, 11, 13, 22, 23, 26]. They do not suffer from the disruption of building blocks typical in standard genetic algorithms. The canonical sequential EDA is described in Fig. 1. EDAs often surpass classical EAs in the number of required fitness function evaluations. However, the absolute execution time is still a limiting factor which determines the size of practically tractable problems. Referring to Fig. 1 the most time-consuming task is the estimation of probability model for many problems. Most papers on EDAs concentrate on parallel construction and sampling of probabilistic models. The well-known algorithms employing parallel construction of Bayesian network are published in [17, 20, 21]. In [25], the theory of population sizing and timing to convergence is published. A new idea of the multideme parallel estimation of distribution algorithm (PEDAs) based on PBIL algorithm was published in [1]. In [16], mixtures of distribution with
Set t ĸ 0; Generate initial population D(0); While termination criteria is false do begin Select a set of promising solution Ds(t); Construct a new probability model M from Ds(t) using chosen metric; Sample offspring O(t) from M; Evaluate O(t); Create D(t+1) as a subset of O(t) ∪ D(t) with cardinality N; t ĸ t + 1; end Fig. 1. The pseudo code of canonical EDA
Parallel BMDA with Probability Model Migration
5
Bayesian inference are discussed. Parallel learning of belief networks in large domains is investigated in [27]. Using the concept of PBIL algorithm [3, 12, 19], the classical phenomenon of migration in island-based EAs was carried over into probability distribution of EDAs. A new approach of probability vector crossover was implemented with very good performance. 2.1 Linkage Learning in EDA Algorithms In competent genetic algorithms, various sophisticated linkage learning techniques must be implemented to discover Building Blocks (BBs). In EDA algorithms, the linkage learning is automatically incorporated into a graphical probabilistic model. EDAs support an effective detection, mixing and reproduction of BBs, so that they are capable of solving complex optimization problems including deceptive problems. But it is important to select the probabilistic model with a proper complexity reflecting the fitness function complexity. We can recognize three categories of model complexity: without dependency (Univariate Marginal Distribution Algorithm), pairwise dependency (MIMIC, BMDA) and multivariate dependency (BOA, EBNA). 2.2 Migration of Probabilistic Parameters for UMDA The concept of migration of probabilistic parameters instead of individuals was firstly published in [8] where on UMDA platform the convex combination of univariate probability models is investigated for various network topologies (ring, star etc.). Further enhancement of this concept is described in [9] where the local search methods are used to identify which parts of the immigrant model can improve the resident model. In the following sections we describe the proposal of a new concept of islandbased BMDA algorithm with unidirectional ring topology based on the combination of two adjacent bivariate probability models.
3 Sequential BMDA The well known representative of bivariate EDAs is the Bivariate Marginal Distribution Algorithm (BMDA) proposed by Pelikan and Mühlenbein [19, 24]. This algorithm uses a factorization of the joint probability distribution that exhibits second-order dependencies. EDAs are also population based algorithm but unlike GAs the new population is generated by sampling the recognized probability model. Let us denote: D = (X0, X1,..., XN-1) with X ∈ D, is the population of strings /solutions/individuals, X = (X0, X1,..., Xn-1) is a string/solution of length n with Xi as a variable, x = (x0, x1,..., xn-1) is a string/solution with xi as a possible instantiation of variable Xi, xi ∈{0,1}, p(X) = p(X0, X1,..., Xn-1) denotes the n dimensional probability distribution, p(x0, x1,..., xn-1) = p(X0 = x0, X1 = x1,..., Xn-1 = xn-1) denotes a probability of a concrete n dimensional vector.
6
J. Schwarz and J. Jaros
The probabilistic model used in BMDA can be formalized by M = (G, Θ), where G is a dependency graph and Θ = (θ0, θ2,…, θn-1) is a set of parameters which are estimated by local conditional or marginal probability for each node/variable of the dependency graph. A greedy algorithm for building dependency graphs is used. At the beginning, the root node is selected and subsequently the nodes with maximum dependency value are searched among the remaining nodes and joined. These pairwise dependencies in BMDA are discovered by Pearson’s chi-square statistics:
⎞ ⎟ 1 − ∑ ∑ ⎜ ∀x ∈Dom( X ) ∀x ∈Dom( X ) m( xi )m( x j ) ⎟⎟ i j j ⎠ ⎝ i ⎛
m 2 ( xi , x j )
χ i2, j = N ⎜⎜
(1)
where N is the size of parent population and m(xi, xj), m(xi) resp. m(xj) denote the number of individuals in the parent population with concrete values of xi and/or xj. These values are stored in the contingency tables. From the theoretical point of view this metric can be seen as statistical testing of hypothesis – for example binary genes Xi and Xj are considered to be independent at 95 percent confidence level if χ i2, j < 3.84 . Like COMIT, BMDA also uses a variant of minimum spanning tree technique to learn a model. However, during the tree construction, if none of the remaining variables can be “rooted” to existing tree, BMDA starts to form additional tree from remaining variables. The final probability distribution is thus a forest distribution (a set of mutually independent dependency trees):
p( X ) =
∏ p ( X r ) ∏ p( X i | X j ( i ) )
X r ∈R
(2)
X i ∈V \ R
where V is the set of nodes of dependency tree, R is the set of root nodes and Xj(i) denotes the parent node of Xi. Given the tree dependency structure, the univariate marginal probability distributions are estimated from the promising/parent population: p( X i = 1) =
m( X i = 1) N
(3)
and the bivariate conditional probability distributions p( X i | X j (i ) ) are estimated as
p( xi | x j(i ) ) =
m( xi , x j(i ) )
(4)
m( x j(i ) )
For example, the joint probability distribution for the dependency graph in Fig. 2 can be expressed by the factorization: 1. p(X) = p(X4) p(X3 |X4) p(X2 |X3) p(X1 |X2) p(X0|X1) 2. p(X) = p(X2) p(X3 |X2) p(X0 |X4) p(X4) p(X1 |X4) The time complexity of the complete BMDA algorithm can be expressed by the formula: O(n3)+O(4Nn2)+O(Nn), where the first component is a cubical time
Parallel BMDA with Probability Model Migration
3
4
7
4
3
0
0
2
1
a)
2
1
b)
Fig. 2. Example of dependency graph for: a) COMIT, b) BMDA
complexity of the dependency graph construction, the second component is a quadratic time complexity of contingency tables collection and the third component of the formula reflects a linear complexity of new solution sampling.
4 Island-Based BMDA 4.1 Migration of Individuals
In traditional island-based PGA algorithms, an infrequent migration of individuals among subpopulation is incorporated. The migration process is controlled by several parameters. It is necessary to determine the number and size of the subpopulations, the frequency and the intensity of migration and the method used for selection of candidate migrants. By analogy, it is possible to build island-based parallel BMDA, whereas the GA demes are replaced by BMDA ones. In BMDA and generally in EDAs, as it is known, new individuals are generated by the sampling of the probabilistic model. Consequently, a question pops up, whether it is possible to replace the migration of individual just by the probability model transfer. This topic is investigated in the next subsection. 4.2 Migration of the Probabilistic Model
The principal motivation for the proposal of a new concept of BMDA parallelization is to discover the efficiency of the transfer of probabilistic parameters in comparison with the traditional transfer of individuals. The main goal is to find a robust computational tool for hard optimization problems. The present approaches recently published in [1, 8, 9] use a simpler probability model only (PBIL, UMDA). Consistent with the theoretical conclusion shown in [24] and on the basis of experimental works done in [17], we used the island-based communication model with unidirectional ring topology with synchronization, see Fig. 3. We have simulated the island-based system partly on a single processor computer and partly on a real parallel system composed of a cluster of eight Linux-based workstations. It is evident that we can simply decompose the migration process in the ring loop into pairwise interactions of two adjacent islands - one of them is considered to
8
J. Schwarz and J. Jaros
Immigrant island
Migration direction
Subpopulation
Resident island
Fig. 3. Ring topology of island-based BMDA
be a resident island specified by resident probabilistic model and the second one is considered to be an immigrant island which probabilistic model is transferred to participate on building up a new resident model after a predefined migration rate. We focused on the problem of how to compose the resident model with the incoming model belonging to the immigrant island. In general, the modification of the resident model by the immigrant model can be formalized by the adaptation rule [3, 19]:
M´R = β M R ° (1- β) MI ,
(5)
where operator ° can be for example, a summing operator and the coefficient β in the range specifies the influence of the immigrant model. 4.3 Adaptive Learning of Probabilistic Model
We applied the adaptive learning for the both parts of the probabilistic model M R = (GR , ΘR) – the dependence graph GR and the parameter set ΘR. The new dependency graph G´R is not built by the aggregation of the original graph GR and the incoming graph GI but by means of Pearson’s chi-square statistics:
χ i2, j = βχ i2R , j R + (1 − β ) χ i2I , j I
(6)
The new parameters Θ´R are calculated by the simple adaptation rule: Θ´R = β ΘR + (1–β)ΘI
(7)
The adaptation coefficient β is defined by the formula: ⎧ ⎪
FR
β = ⎨ FI + FR ⎪⎩ 0.9
if FI ≥ FR ,
(8)
otherwise
where FR represents the mean fitness value of the resident subpopulation and FI represents the mean fitness value of the immigrant subpopulation.
Parallel BMDA with Probability Model Migration
9
Procedure (Output: M’R, Input:SubPopI, SubPopR) Calculate FR for the resident subpopulation; Calculate FI for the immigrant subpopulation; Calculate ȕ: °
FR
β = ® FI + FR °¯ 0.9
if FI ≥ FR otherwise
For i=0 to n-1 do begin For j=0 to n-1 do begin Calculate χ i2R , j , χ i2I , j ; R I Store in Chisqr_Table[i,j]: χ i2, j = βχ i2R , j + (1 − β ) χ i2I , j R I end end Build the new dependency graphs G´R according Chisqr_Table; Calculate set of the parameters: ĬR(G´R) , ĬI(G´R) ; Learning of the parameters: Ĭ’R = ȕ ĬR + (1-ȕ)ĬI Store the new resident model: M´R = (G´R , Ĭ´R), Sample the adapted model M´R ; Replacement of SubPopR; end Fig. 4. Adaptive learning of the resident model
The major part of all experiments was implemented using this pseudo parallel version of the algorithm aBMDA, see Fig. 4. 4.4 Parallel Implementation on a Cluster of Workstations
In the parallel version of aBMDA, it is necessary to transfer some components of the probability model from the immigrant node to the resident one. In the proposed version, the contingency tables are transferred. The spatial complexity of all transported tables is O(4n2), where n is the cardinality of the solved problem. Seeing that chisquare is symmetric, and dependencies between the same variables irrelevant, the spatial complexity can be reduced to O(2(n2 – n)). In contrast to the probabilistic model migration, the migration of individuals used in iBMDA works with the spatial complexity O(nkN), where kN is the number of migrating individuals. Because the communication overhead in modern interconnection networks depends more strongly on the start-up latency of communication than on a transported message size, we can consider that the communication overhead will be nearly the same for both approaches. Moreover, using an asynchronous or non-blocking type of migration [14], the communication overhead could be simply overlapped. Our parallel implementation of aBMDA derives benefits from overlapping of communication and computation, based on non-blocking MPI [18] communication subroutines MPI_ISend, MPI_IRecv and MPI_Wait. The basic idea is shown in Fig. 5.
10
J. Schwarz and J. Jaros
The information exchange between the resident and the immigrant node begins with the initiation of receiving request. During the receiving procedure, the resident node can compute its contingency tables and the mean fitness value FR of the resident population. Next, all computed data are packed into a simple send buffer using standard C routine memcpy and sent using non-blocking communication to the neighbor node. The resident chisqr-table is computed from the resident contingency tables in the next step. Now, the resident node has to wait until the immigrant data are completely received. After finalization, the data from the immigrant node are unpacked from a receive buffer and the immigrant chisqr-table is computed. Now, the probabilistic model composition can be started. First, the resident and the immigrant chisqrtables are combined together using beta parameter to produce a new chsqri-table. A new dependency graph is created according to the information stored in the learned chisqr-table. Second, a set of parameters Θ’R are calculated using new dependency graph, and the original resident and the immigrant contingency tables. As a result the new probabilistic model M´R = (G´R, Θ´R), is determined.
Procedure MakeExhangeIslandInformation(); MPI_IRecv(Receive buffer); Calculate the mean fitness value FR of the resident island; Calculate resident contingency tables; Pack resident contingency tables and FR into a send buffer; MPI_ISend(Send buffer); Calculate Chisqr_Table_Resident[i,j] = χ i2R , j R MPI_Wait(Waiting for receiving finish); Unpack immigrant contingency tables and FI from a receive buffer; Calculate ȕ; Calculate the Chisqr_Table_Imigrant[i,j]= χ i2I , j ; I
Calculate items of the composed Chisqr_Table[i,j]: χi2, j = βχ i2 , j + (1 − β ) χi2 , j Build the new dependency graphs G´R according to new Chisqr_Table; Calculate set of the parameters: ĬR(G´R) , ĬI(G´R) using contingency tables ; Learning of the parameters: Ĭ´R = ȕ ĬR + (1-ȕ)ĬI Compose new resident model: M´R = (G´R , Ĭ´R) MPI_Wait(Waiting for sending finish); end R
R
I
I
Fig. 5. MPI communication between the resident and the immigrant node
The migration of individuals used in iBMDA can be realized in similar way. The computation of contingency tables and mean fitness value is simply replaced by the selection of individuals intended for the migration. In this case, only selected individuals are packed into a send buffer and transported to the neighbor node. Received solutions are then unpacked in the resident node and incorporated into resident population. Finally, a new population is created in the standard way.
Parallel BMDA with Probability Model Migration
11
Besides the described type of communication, the MPI_Gather [18] operation was employed after each generation. During this operation, all necessary information from all processing nodes are collected to compute global statistics including the global mean fitness value, the best global solution, etc.
5 Experimental Results In our experiments, we compared four different variants of the BMDA algorithm. The first group consists of two versions of parallel BMDA algorithm: 1. aBMDA, with adaptive learning of dependency graph. 2. iBMDA, with the migration of individuals. These two parallel BMDA algorithms work with 8 islands of subpopulations, each consisting of 256 individuals as a portion of the full population with 2048 individuals. The second group used for the comparison includes two classical variants of BMDA: 3. sBMDA, sequential BMDA, with full population of 2048 individuals (as the whole eight-island model). 4. oBMDA - sequential BMDA with reduced population consisting of 256 individuals (as in case of one island). The fixed subpopulation size has been used for the whole range of problem size. We have not wittingly used the possibility of the adaptation of the subpopulation size according to problem size as discussed in [25]. Our goal was to compare namely the parallel adaptive aBMDA version with traditional iBMDA version under limited resources (subpopulation size). The value of the population size for sBMDA is set to 2048 derived partially from our experience and from the experimental results published in [25] for the 3-Deceptive problem. In all BMDA variants, truncation-based selection strategy was used, i.e. all individuals were ordered by their fitness value and the better half was used for model building. The truncation-based replacement strategy was also used for the replacement operator, i.e. the new generated solutions (offspring) replace the worse half of the subpopulation. The probabilistic model is built in each generation. Frequency of the model migration or individual migration was even - once per five generations. In case of the algorithm with migrating individuals, the elitism is used, that is, 13 best individuals of the immigrant subpopulation (i.e. about k=5 percent of the subpopulation) replace the worse individuals of the resident subpopulation. First stop condition was met after 500 generations; the second condition was activated if there was no improvement in the interval of 50 generations. 5.1 Specification of Benchmarks
For our experimental study, four well known benchmarks with various complexity and known global optimum were used. The OneMax and TwoMax problems served as the basic benchmarks for the testing of their performance. The Quadratic problem represents the adequate benchmark that should be solvable just by any BMDA
12
J. Schwarz and J. Jaros
algorithm. The 3-Deceptive task belongs to the hard deceptive benchmark for BMDA and it is often used for the testing of BOA algorithms. n −1
OneMax: f OneMax ( x ) = ∑ x i
(9)
i =0
TwoMax: f TwoMax ( x ) =
n −1
∑x
i
−
i =0
n
n n + 2 2
−1
2 Quadratic: f Quadratic ( x ) = ∑ f 2 ( xπ ( 2i ) , xπ ( 2i +1) )
(10)
(11)
i=0
where f 2 (u, v ) = 0.9 − 0.9(u + v ) + 1.9uv n −1 3
3-Deceptive: f 3− Deceptive ( x) = ∑ f 3 ( xπ ( 3i ) + xπ ( 3i +1) + xπ (3i + 2 ) )
(12)
i =0
where
⎧ ⎪ ⎪ f 3 (u ) = ⎨ ⎪ ⎪⎩
0.9 0.8 0 1
if if if
u=0 u =1 u=2
otherwise
The four mentioned objective functions were used to form fitness functions (FF) without additional modification. We have tested four variants of BMDA using 30 independent runs. To have baseline to island based versions, we first tested the classic sequential BMDA (sBMDA) with ordinary population of 2048 individuals and the classical sequential BMDA with reduced population (oBMDA). The first metric is represented by the often used success rate of the global optimum discovery. The second metric is calculated as the average value of the best fitness function (FF) over 30 runs. The third metric is computed as the mean value of the number of correctly discovered buildings blocks over 30 runs. These metrics/statistics are discussed in the next sections. 5.2 OneMax Problem
The sBMDA and aBMDA algorithms succeeded in the whole range of the problem size, see Fig. 6. Classical iBMDA version produces comparative result only up to 260 variables. Consecutively, the rapid drop follows after this threshold. It is also evident that oBMDA was significantly outperformed by all the other algorithms. 5.3 TwoMax Problem
In the case of TwoMax problem, see Fig. 7, the results of the tested algorithms are similar to the results achieved for OneMax problem. The aBMDA version outperformed all other versions and achieved the same results as sBMDA. The drop in success rate for the iBMDA version with migration of individuals is more significant than in the case of OneMax problem.
Parallel BMDA with Probability Model Migration
13
Success rate for OneMax problem 100 90 80
Success rate [%]
70 60 50 40 30
aBMDA iBMDA
20
sBMDA oBMDA
10 0 20
40
60
80
100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500
Problem size [variables]
Fig. 6. Success rate for OneMax problem
Success rate for TwoMax problem 100 90 80
Success rate [%]
70 60 50 40 30
aBMDA iBMDA
20
sBMDA oBMDA
10 0 20
40
60
80
100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500
Problem size [variables]
Fig. 7. Success rate for TwoMax problem
5.4 Quadratic Problem
To achieve global solution for this problem, the probability model with the bivariate dependency is required. This benchmark is thus perfectly suitable for testing and comparing all BMDA variants. In Fig. 8, the success rate for all compared algorithms can be seen. The best results were reached by sBMDA that succeeded in the nearly whole range of the problem sizes. Similar behavior can be observed even for aBMDA version that achieved 100 percent success rate for up to 260 variables.
14
J. Schwarz and J. Jaros
Success rate for Quadratic problem 100 90 80
Success rate [%]
70 60 50 40 30 aBMDA 20
iBMDA sBMDA
10
oBMDA
0 20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
Problem size [variables]
Fig. 8. Success rate for Quadratic problem Table 1. Statistics results (mean±std of FF) for Quadratic problem
Problem size aBMDA 60 30.00±0.00 80 40.00±0.00 100 50.00±0.00 120 60.00±0.00 140 70.00±0.00 260 130.00±0.00 280 139.92±0.24 300 149.75±0.46
iBMDA 30.00±0.00 39.98±0.30 49.87±0.80 59.67±0.13 67.71±3.14 127.10±0.37 136.53±0.30 146.09±0.27
Algorithm sBMDA 30.00±0.00 40.00±0.00 50.00±0.00 60.00±0.00 70.00±0.00 130.00±0.00 139.99±0.02 149.99±0.02
oBMDA 29.96±0.06 39.88±0.10 49.65±0.14 59.37±0.24 69.08±0.25 126.32±0.57 135.79±0.54 145.33±0.64
Optimum 30 40 50 60 70 130 140 150
Table 2. Statistics results (mean±std of BBs) for Quadratic problem
Problem size aBMDA 60 30.00±0.00 80 40.00±0.00 100 50.00±0.00 120 60.00±0.00 140 70.00±0.00 260 130.00±0.00 280 139.20±2.45 300 147.47±4.64
iBMDA 30.00±0.00 39.83±0.37 48.70±0.78 56.73±1.26 64.50±1.36 101.03±3.71 105.10±3.27 110.67±2.77
Algorithm sBMDA 30.00±0.00 40.00±0.00 50.00±0.00 60.00±0.00 70.00±0.00 130.00±0.00 140.00±0.00 149.90±0.25
oBMDA 29.56±0.61 38.87±1.02 46.50±1.41 53.80±2.48 60.80±2.38 93.47±5.86 98.03±5.15 103.60±4.93
Optimum 30 40 50 60 70 130 140 150
Parallel BMDA with Probability Model Migration
15
Besides the success rate metric the second metric represented by mean±std statistics of the fitness function are presented in Table 1. It is evident that aBMDA version and sBMDA version provide the same results up to 260 variables. For higher number of variables sBMDA achieves better results. The best value achieved for each problem size is written in bold. In Table 2, the third metric represented by mean±std statistics for the number of correctly recognized buildings blocks (BBs) are shown. 5.5 3-Deceptive Problem
The problem was investigated for the variable range from 21 to 120, see Fig. 9. For higher number of variables, the drop in success rate is significant for all proposed algorithms. It is caused by rather high complexity of the 3-Deceptive problem that requires a more complex model and also larger population size for efficient performance. The best success rate was achieved by aBMDA version. Very similar values were also achieved by sBMDA. On the other hand, the worst results were obtained by oBMDA and iBMDA.
Success rate for 3-Deceptive problem
100
aBMDA 90
iBMDA sBMDA
80
oBMDA
Success rate [%]
70 60 50 40 30 20 10 0 21
30
39
51
60
72
81
90
99
120
Problem size [variables]
Fig. 9. Success rate for 3-Deceptive problem
In Table 3, the mean±std statistics of the fitness function are presented. The best results were obtained by the aBMDA version and by sBMDA. The worst mean fitness values were achieved by oBMDA algorithm followed by iBMDA. The mean values and standard deviation of the discovered BBs are presented in Table 4. From Tables 3 and 4, a significant correlation between the mean value of fitness and the mean value of BBs in case of aBMDA version can be recognized. For the case of the 99–variable problem the mean number of BBs is 26.3 which is 80 percent of total 33 blocks. Note that iBMDA discovered only 11.9 BBs (36 percent). It is interesting to compare these values with the experimental results published for BOA algorithm in [25], where the achieved number of BBs for 99 variables and for the population size estimated to 250 equals to 25 percent.
16
J. Schwarz and J. Jaros Table 3. Statistics results (mean±std of FF) for 3-Deceptive problem
Problem size aBMDA 21 7.00±0.00 30 10.00±0.00 39 13.00±0.00 51 16.98±0.04 60 19.96±0.06 72 23.91±0.15 81 26.80±0.15 90 29.58±0.20 99 32.36±0.35 120 38.67±0.29
iBMDA 7.00±0.00 9.92±0.06 12.76±0.11 16.36±0.09 19.13±0.13 22.73±0.14 25.43±0.16 28.11±0.15 30.87±0.15 37.10±0.14
Algorithm sBMDA 7.00±0.00 10.00±0.00 13.00±0.00 16.99±0.03 19.96±0.06 23.90±0.09 26.83±0.10 29.78±0.16 32.63±0.18 39.21±0.2
oBMDA 6.90±0.05 9.75±0.12 12.06±1.99 16.12±0.16 18.78±0.22 22.41±0.24 25.07±0.23 27.84±0.57 30.51±0.22 38.58±0.59
Optimum 7 10 13 17 20 24 27 30 33 40
Table 4. Statistics results (mean±std of BBs) for 3-Deceptive problem
Problem size aBMDA 21 7.00±0.00 30 10.00±0.00 39 13.00±0.00 51 16.83±0.45 60 19.7±0.59 72 23.13±1.02 81 25.00±1.51 90 25.87±2.08 99 26.30±2.99 120 26.23±3.72
iBMDA 7.00±0.00 9.23±0.56 10.63±1.11 10.60±0.92 11.33±1.49 11.53±1.56 11.50±1.62 11.27±1.41 11.86±1.83 11.63±1.80
Algorithm sBMDA 7.00±0.00 10.00±0.00 13.00±0.00 16.90±0.39 19.56±0.56 23.00±0.93 25.33±0.98 27.83±1.61 29.36±1.85 32.10±2.09
oBMDA 6.47±0.50 7.60±1.11 7.33±1.72 8.27±1.67 7.23±2.39 8.23±2.29 7.92±2.39 8.50±2.26 8.67±2.95 9.03±1.74
Optimum 7 10 13 17 20 24 27 30 33 40
5.6 Discussion on Pseudo-parallel Version of Algorithms
In our experiments two groups of algorithms are compared: 1. The new proposed island-based aBMDA with probabilistic model learning and the traditional island-based iBMDA with individual migration. 2. Sequential sBMDA version with the full population size and reduced sequential oBMDA version. In the first experiment, the success rate metric was applied. Both aBMDA and sBMDA versions are capable of finding global optima with 100 percent success rate for up to 500 variables in the case of OneMax and TwoMax problems and up to 260 variables in the case of Quadratic problems. For difficult problems, like 3-Deceptive, the algorithms lack the ability to find repeatedly the optimal solution for problem size larger than 39. It is evident that aBMDA is effective optimization tool outperforming iBMDA version based on the traditional migration of individuals. From this point of view the range of solvable problem size is at least two times larger in case of aBMDA version.
Parallel BMDA with Probability Model Migration
17
In the second experiment the statistics including mean±std values of fitness function (FF) was processed for two harder problems – the Quadratic problem in Table 1 and 3-Deceptive problem in Table 3. The best values are written in bold. From Table 1, it is evident that for Quadratic problem, aBMDA and sBMDA have reached the global optima up to 260 variables and outperformed very significantly iBMDA version. In the case of 3-Deceptive problem, aBMDA outperformed all other algorithms besides the sBMDA that is better for the problem size exceeding 90 variables. Note that for the 120-variable problem the mean value of FF in case of aBMDA equals to 38.6 which is close to global optimum represented by value 40. In Table 2 and Table 4, the statistical results for BBs are shown for Quadratic and 3-Deceptive problems. From Table 2 it is evident, that in case of Quadratic problem, aBMDA discovered all BBs up to 260 variables while iBMDA was successful up to 30 variables only. In case of 3-Deceptive problems, see Table 4, aBMDA outperformed iBMDA in the whole range of problem size. Note that aBMDA achieved approximately two time higher mean value of BBs for the problem size exceeding 60 variables. The computational complexity of all algorithms measured by the number of generation is comparable. For example, in the case of Quadratic problem with 60 variables the average computational time is about 20 generations, see Fig. 10. Note that oBMDA was able to find the global optima for this instance of Quadratic problem only in 66 percent of the 30 runs, see Fig. 8. Time complexity of proposed algorithms (Quadratic 60) 20 19,5 19
Generations
18,5 18 17,5 17 16,5 16 15,5 15 aBMDA
iBMDA
sBMDA
oBMDA
Variants of algorithm
Fig. 10. Time complexity of the proposed algorithms for Quadratic problem
5.7 Performance of Parallel Implementation
The parallel implementations of aBMDA and iBMDA were tested on a cluster of 8 Linux-based workstations equipped with Intel E6550 processor and 2GB RAM connected together by 1Gb LAN network. In case of sequential sBMDA only one station was used.
18
J. Schwarz and J. Jaros
First, the comparison of the mean execution time TG related to one generation was performed. For TG calculation, five independent runs, each composed of 20 generation (including 4 migration cycles), were carried out. The algorithms were compared using OneMax benchmark, see Fig. 11. Let us note that the convergence toward global optima was not checked in this case. From Fig. 11, a marked difference between sequential and parallel approaches is evident, as it was expected. The values of TG for the both parallel algorithms are comparable. Finally, OneMax benchmark is relatively simple, as the fitness function evaluation does not influence the execution time very much. For more complex problems, a gap between sequential and parallel approaches will be deeper, because computational complexity will dominate communication complexity.
Execution time of one generation for OneMax problem
70 aBMDA iBMDA
60
sBMDA 50
T G [s]
40
30
20
10
0 20
40
60
80
100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500
Problem size [variables]
Fig. 11. The mean execution time TG for OneMax
Speed-up of one generation for OneMax problem 10 9 8
Speed-up
7 6 5 4 3 2 aBMDA 1
iBMDA
0 20
40
60
80
100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500
Problem size [variables]
Fig. 12. Speed-up of iBMDA and aBMDA with regard to sBMDA for OneMax problem
Parallel BMDA with Probability Model Migration
19
The speed-up of parallel implementations for OneMax problem is displayed in Fig. 12. It varies between 5 and 9 for both parallel algorithms. The decrease of the speed-up for larger instances of OneMax problem is caused by the increased parallelization overhead. The overhead consists of the quadratic time complexity of contingency tables transport and the subsequent model composition in the resident node. The performance of aBMDA algorithm is slightly better. For simpler instances (say up to 200 variables) the achieved speed-up was larger than a number of processing nodes (8 in our experiment) – it is known as the phenomenon of superlinear speed-up [3]. The extent of speed-up was also investigated for Quadratic problem, see Fig. 13. Both parallel algorithms achieved superlinear speed-up in the whole range of problem instances as the consequence of the more complex benchmark. Speed-up of one generation for Quadratic problem 11 10 9
Speed-up
8 7 6 5 4 3 aBMDA 2
iBMDA
1 20
40
60
80
100
120
140 160 180 Problem size [variables]
200
220
240
260
280
300
Fig. 13. Speed-up of iBMDA and aBMDA with regard to sBMDA for Quadratic problem
Let us note that the knowledge of the concrete speed-up can be utilized for prediction of the execution time of the optimization process and also for the setting of proper population size and the number of processing nodes. Finally, we investigated the speed-up of the optimization tasks which resulted in 100% success rate. The value of the speed-up was calculated using the following schema: a) first the number of generations required for achievement the global optima for each version of BMDA and for each instance of the problem was measured and averaged during 30 independent runs, b) this value was multiplied by the mean execution time of one generation TG for relevant version of BMDA, c) finally, this value was normalized according the values obtained by sequential sBMDA and plotted in the Fig. 14 and Fig. 15. From Fig. 14, it is evident that iBMDA algorithm was not able to find optimal solutions for the whole range of OneMax instances. It succeeded only in the range from 20 to 220 variables. aBMDA solved the tasks reliably with the speed-up higher than 6 and even with superlinear speed-up up to 280 variables.
20
J. Schwarz and J. Jaros
Speed-up for OneMax with 100% success rate
10 9 8 7
Speed-up
6 5 4 3 2 aBMDA 1
iBMDA
0 20
40
60
80
100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500
Problem size [variables]
Fig. 14. Speed-up of iBMDA and aBMDA with regard to sBMDA for OneMax problem in case of 100% success rate
Speed-up for Quadratic with 100% success rate 10 9 8
Speed-up
7 6 5 4 3 aBMDA 2
iBMDA
1 20
40
60
80
100
120
140
160
180
200
220
240
260
Problem size [v ariables]
Fig. 15. Speed-up of iBMDA and aBMDA with regard to sBMDA for Quadratic problem in case of 100% success rate
The capability of the speed-up for Quadratic problem (Fig. 15) was investigated for up to 260 variables. aBMDA was still able to produce optimal solutions with 100% success rate. The speed-up varied between 4.4 and 8.7. iBMDA algorithm achieved comparable speed-up as aBMDA but only up to 60 variables. 5.7.1 Discussion on Parallel Implementation The message passing interface MPI used for parallel implementation of BMDA algorithms appears as an efficient software tool. Two variants of parallel BMDA was investigated – aBMDA and iBMDA. They are mutually compared and also compared to the sequential sBMDA.
Parallel BMDA with Probability Model Migration
21
From Fig. 12 and Fig. 13 it is possible to conclude, that the communication overhead is negligible. The values of achieved speed-up of processing were very close to the number of utilized processing nodes, and even higher in some cases. The ability of finding global optima and its speed-up was also examined for both parallel algorithms tested. From Fig. 14 and Fig. 15, it is evident that aBMDA algorithm is much more robust than iBMDA. The speed-up of optimization process with respect to sBMDA (with 100 percent success rate) varies from 4.4 to 8.7, but for most of the problem instances it is higher than 6.
6 Conclusions This chapter has presented a new idea of parallel BMDA algorithms using islandbased model with the unidirectional ring topology. The cooperation of demes was realized via migration of probabilistic models instead of the traditional migration of individuals. Each two-neighbor demes are organized as a pair of resident and immigrant demes. We have introduced an adaptive learning technique, based on the quality of resident and immigrant subpopulation, which consists of the adaptation of the resident probabilistic model by the incoming neighbor immigrant model. We have introduced an efficient tool how to learn graphical probabilistic model and associated probabilistic parameters. The comparative experimental studies demonstrated that the proposed parallel algorithm aBMDA outperforms the traditional iBMDA using migration of individuals, mainly according to the problem size and the level of the success rate. This is true for all the applied benchmarks beginning with OneMax and ending with the Quadratic problem. The 3-Deceptive problem is a hard problem for all compared algorithms but aBMDA and sBMDA provide relatively best solutions. Note that the sequential sBMDA with full population provides competitive results compared with the aBMDA, but indeed the time complexity of aBMDA version can be significantly reduced by parallel processing. Future work will be focused on more sophisticated modifications of learning techniques with limited size of parameters transfer. We also aim to parallelize the Bayesian Optimization Algorithm (BOA) using a modified concept of probabilistic model migration.
Acknowledgement This work was partially supported by the Grant Agency of the Czech Republic under No.~102/07/0850 Design and hardware implementation of a patent-invention machine and the Research Plan No. MSM 0021630528 - Security-Oriented Research in Information Technology.
References 1. Ahn, C.W., Goldberg, D.E., Ramakrishna, R.S.: Multiple-Deme Parallel Estimation of Distribution Algorithms: Basic Framework and Application (Illigal Report No. 2003016) (2003) 2. Alba, E., Troya, J.M.: A Survay of Parallel Distributed Genetic Algorithms. Complexity 4, 303–346 (1999)
22
J. Schwarz and J. Jaros
3. Baluja, S.: Population-Based Incremental Learning: A method for Integrating Genetic Search Based Function Optimization and Competitive Learning, Technical report CMUCS-94-163, Carnegie Mellon University (1994) 4. Bosman, P.A.N., Thierens, D.: An Algorithmic Framework For Density Es-timation Based Evolutionary Algorithms, Technical report UU-CS-1999-46, Utrecht University (1999) 5. Bosman, P.A.N., Thierens, D.: Linkage information processing in distribution estimation algorithms. In: Proceedings of the Genetic and Evolutionary Computation Conference GECCO 1999, Orlando, Florida, vol. I, pp. 60–67. Morgan Kaufmann Publishers, San Fransisco (1999) 6. Bosman, P.A.N., Thierens, D.: Continuous iterated density estimation evolutionary algorithms within the IDEA framework. In: Proceedings of the Optimization by Building and Using Probabilistic Models OBUPM workshop. Genetic and Evolutionary Computation Conference GECCO 2000, pp. 197–200 (2000) 7. Cantu-Paz, E.: Efficient and Accurate Parallel genetic algorithm. Kluwer Academic Publishers, Dordrecht (2000) 8. delaOssa, L., Gámez, J.A., Puerta, J.M.: Migration of probability models instead of individuals: an alternative when applying the island model to edas. In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guervós, J.J., Bullinaria, J.A., Rowe, J.E., Tiňo, P., Kabán, A., Schwefel, H.-P. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 242–252. Springer, Heidelberg (2004) 9. delaOssa, L., Gámez, J.A., Puerta, J.M.: Improving model combination through local search in parallel univariate EDAs. Evolutionary Computation 2, 1426–1433 (2005) 10. Goldberg, D.E., Sastry, K., Llorà, X.: Toward routine billion-variable optimization using genetic algorithms. Complexity 12, 27–29 (2007) 11. Heckerman, D., Geiger, D., Chickering, M.: Learning Bayesian networks: The combination of knowledge and statistical data, Technical Report MSR-TR-94-09, Microsoft Research, Redmond, WA (1994) 12. Höhfeld, M., Rudolph, G.: Towards a theory of population-based incremental learning. In: Proceedings of 4th International Conference on Evolutionary Computation, pp. 1–5. IEEE Press, Los Alamitos (1997) 13. Larrañaga, P., Lozano, J.A.: Estimation of Distribution Algorithms. Kluwer Academic Publishers, London (2002) 14. Lin, S.C., Punch, W.F., Goodman, E.D.: Coarse-grain parallel genetic algorithms: categorization and a new approach. In: Sixth IEEE Conference on Parallel and Distributed Processing, pp. 28–37. IEEE Press, Piscataway (1994) 15. Lobo, F.G., Lima, C.F., Martires, H.: Massive parallelization of the compact genetic algorithm. In: Ribeiro, R. (ed.) Proceedings of the International Conference on Adaptive and Natural Computing Algorithms (ICANNGA 2005), pp. 530–533. Springer, Heidelberg (2005) 16. Marin, J.M., Mengersen, K., Robert, C.P.: Bayesian Modelling and Inference on Mixtures of Distribution. In: Dey, D., Rao, C.R. (eds.) Handbook of Statistics, vol. 25, pp. 459–507 (2005) 17. Mendiburu-Alberro, A.: Parallel implementation of estimation of Distribution Algorithms based on probabilistic graphical models. Application to chemical calibration models, PhD thesis, the University of Bascue Country, Donostia-San Sebastian (2006) 18. MPI-2 Extension to message-Passing Interface, Message Passing Interface Forum (2003), http://www.mpi-forum.org/docs/mpi2-report.pdf 19. Mühlenbein, H.: The equation for response to selection and its use for prediction. Evolutionary Computation 5(3), 303–346 (1997)
Parallel BMDA with Probability Model Migration
23
20. Ocenasek, J.: Parallel Estimation of Distribution Algorithms, PhD. Thesis, Faculty of Information Technology, Brno University of Technology, Brno, Czech Rep. (2002) 21. Ocenasek, J., Schwarz, J., Pelikan, M.: Design of Multithreaded Estimation of Distribution Algorithms. In: Cantú-Paz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2724, pp. 1247–1258. Springer, Heidelberg (2003) 22. Pelikan, M., Goldberg, D.E., Cantú-Paz, E.: Linkage problem, distribution estimation, and Bayesian networks, IlliGAL Report No. 98013, University of Illinois at UrbanaChampaign, Illinois Genetic Algorithms Laboratory, Urbana, Illinois (1998) 23. Pelikan, M., Goldberg, D.E., Lobo, F.: A survey of optimization by building and using probabilistic models, IlliGAL Report No. 99018, University of Illinois at UrbanaChampaign, Illinois Genetic Algorithms Laboratory, Urbana, Illinois (1999) 24. Pelikan, M., Mühlenbein, H.: The bivariate marginal distribution algorithm. In: Advances in Soft Computing - Engineering Design and Manufacturing, pp. 521–535. Springer, London (1999) 25. Pelikan, M., Sastry, K., Cantú-Paz, E.: Bayesian Optimization Algorithm, Population Sizing and Time to Convergence, IlliGAL Report No. 2000001, Il-linois Genetic Algorithm Laboratory, University of Illinois at Urbana-Champaing, p. 13 (2000) 26. Pelikan, M., Sastry, K., Goldberg, D.E.: Evolutionary Algorithms + Graphical Models = Scalable Black-Box Optimization, IlliGAL Report No. 2001029, University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory, Urbana, Illinois (2001) 27. Xiang, Y., Chu, T.: Parallel Learning of Belief Networks in Large and Difficult Domains. In: Data Mining and Knowledge Discovery, vol. 3, pp. 315–339 (1999)
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm Nan Ding1 and Shude Zhou2 1
Department of Electronic Engineering, Tsinghua University, Beijing, 100084 China 2 China Academy of Electronics and Information Technology, Beijing, 100041 China
Abstract. In this chapter, we review two methods that deal with the linkage detection in the Histogram-based Estimation of Distribution Algorithms; one is based on probabilistic graphical models, and the other is based on space transformation. The two methods deal with the linkage in the optimization problem with different accuracy and with different computational complexity. Probabilistic graphical model is generally more accurate but always associated to high-cost, while transformation is the opposite. In the following, we will mainly discuss the way to reduce the complexity of the method based on probabilistic graphical model and the way to obtain the transformation which captures the dominant linkage of the problem.
1 Introduction Estimation of distribution algorithms (EDAs) are a class of evolutionary algorithms that use probabilistic model of promising solutions found so far to obtain new candidate solutions of optimized problems. One of the charming characteristics of this evolutionary paradigm is that the joint probability density can explicitly represent correlation among variables [1,16]. It is verified by several researchers [2] that multivariate-dependency EDAs have the potential ability to optimize hard problems with strong nonlinearity. However, it is also noted that how to efficiently learn the complex probabilistic model is a bottleneck problem. Therefore, obtaining a good balance between the complication of probabilistic models and the efficiency of learning method is a key factor for designing new EDAs. In continuous domain, the predominant probabilistic model applied by EDAs is based on Gaussian probability distribution. Continuous EDAs based on multivariate Gaussian distribution have polynomial computational complexity. However, the inherent shortcoming of Gaussian-based EDA is that the unimodal model is too rough and thus is likely to mislead the search to a local optimum when solving complex optimization problems. Although clustering techniques such as Gaussian mixture model, Gaussian kernel model are considered to conquer this shortcoming in the literature, their complicated probabilistic distributions make it more difficult to estimate the linkage information, and thus the computational complexity increases remarkably. Histogram model is another alternative probabilistic model in the continuous EDA. It is also called a kind of discretization of the continuous problem. In comparison with Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 25–40, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
26
N. Ding and S. Zhou
unimodal Gaussian model, histogram probabilistic model is able to represent multiple local optima by bins of different heights. Histogram model has already been used in previous work [3-11,13,14]. For example, marginal histogram models are applied in the FWH [7] by S. Tsutsui et al. and the histogram-based EDA (HEDA) [8,9] by N. Ding et al. B. Yuan et al. [11] also proposed the HEDA as an extension of the PBIL [12]. Q. Zhang et al. introduce EDA/L [14] in which several local search strategies are employed in a marginal histogram model. In those above algorithms, the complete probability is approximated by the product of the marginal probability of each variable, that is to say, the linkages of the variables are discarded. However, as we know, when optimizing problems with bounded epistasis, the linkage information should be given prior consideration in the process of the evolutionary algorithms. The IDEA [4,5] based on histogram model (IDEA-H) by P.A.N. Bosman et al. used the multivariate histogram model to consider the variable linkage, but they also remarked that the complexity of the IDEA-H grows exponentially when expressing joint probability of multiple random variables. The aim of the chapter is to conquer the linkage problem of HEDA from two aspects: one is based on probabilistic graphical models (PGM), where the multivariate histogram model is considered; and the other is based on space transformation, where the marginal distribution is built in the transformed space. PGM takes the Markov properties into account, where the variable is independent of each other when given its neighbors in the graphical models, thus it avoids estimating the complete probability. Space transformation works under the assumption that a decent transformation from the original space may cancel out some of the dominant linkages among the variables. The two methods deal with the variables linkages of the optimization problem with different accuracy and with different computational complexity. Probabilistic graphical model is generally more accurate but always associated to high-cost, while transformation is the opposite. We also want to acknowledge that P. Pošík in [18] had a general introduction to the real-valued evolutionary algorithm on the use of probabilistic model and coordinate transform, which is really helpful to our work. The chapter here would mostly focus on the Histogram-based EDA and would contain the specific concerns about the histogram model. This chapter is organized as follows. Section 2 briefly reviews the HEDA, especially the marginal HEDA. In Section 3, HEDA based on probabilistic graphical models is introduced and especially we discuss how to reduce its computational complexity. Section 4 is about HEDA based on space transformation.
2 Histogram-Based Estimation of Distribution Algorithm and Its Marginal Case The histogram-based estimation of distribution algorithm (HEDA) has a main framework as follow: 1. Initialize the histogram model 2. Generate population P(t ) by sampling on the histogram model. 3. Evaluate and rank the population P(t ) .
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm
27
4. Update the histogram model according to the selected individuals P '(t ) . 5. Return to step 2 if not terminated. The marginal histogram model has a general form of l −1
P( Z 0 ,..., Z l −1 ) = ∏ P( Zi )
(1)
i=0
In Eq.(1), each P( Z i ) (i = 0,..., l − 1) denotes a 1-variate histogram model. The core of marginal histogram-based estimation of distribution algorithm is to estimate the marginal distributions of P( Z i ) (i = 0,..., l − 1) and then to generate new individuals by sampling on P( Z i ) for each variable. The FWH [7] and the sur-shrHEDA [9] both belong to this class of HEDA. We now have a brief review of these two algorithms. In the FWH, the height of each bin in each variable Zi is proportional to the count of the selected individuals in it. To sample each variable of the new individual, firstly a bin is sampled according to the heights of the bins of the variable, and then the real value of that variable of the new individual is uniformly sampled in the domain of the bin. Since the height of the bin can be normalized to be equal to the probability of the bin, in the later description we will no longer differentiate the two phrases “the height of the bin” and “the probability of the bin”. In the sur-shr-HEDA, two specific strategies for HEDA, the surrounding effect and the shrinking strategy were developed to conquer the two drawbacks of the HEDA. The initial population should be large enough to sample the variables with a lot of bins; otherwise, many bins will never get a chance to be sampled. The solution accuracy is greatly influenced by the width of bins and highly accurate solutions can only be achieved by setting enough number of bins. With surrounding effect, if an individual in a certain bin No.i is selected, not only the No.i bin gets an improvement on its height, but the surrounded bins, i.e. the No.(i+1) and No.(i-1) bins, also get the minor improvements (always the height of its surrounded bin times the surrounding factor) on their heights respectively. Using the surrounding effect, those bins with heights of zero have the opportunity to be sampled. Furthermore, it has been shown in our previous work that the HEDA with surrounding effect can find the best bins near the current sampled bins with hill-climbing during the search process. That ensures the algorithm to find the optimal bin with small population, even when the number of bins is large. For the shrinking strategy, if the height of the highest bin of a variable is over the threshold value, the domain of that variable will shrink to the domain of that highest bin, and the new domain will be divided into bins as the initial step of the algorithm. Since the searching space gradually shrinks, using the shrinking strategy will make the solution accurate enough. Experimental results have already shown that the HEDA combining both the surrounding effect and the shrinking strategy performs excellently in continuous optimization, especially, in those problems with multiple local optima. [9] Marginal histogram-based estimation of distribution algorithm is frequently applied in practice for its simplicity and efficiency. However, the obvious drawback of the marginal probability estimation is that it loses the ability to detect any variable
28
N. Ding and S. Zhou
linkages of the problem. This drawback is serious in the problems that variables are strongly dependent with each other. In the following, we supply two solutions for this problem. The first is by applying probabilistic graphical model; the second is by applying a space transformation.
3 HEDA Based on Probabilistic Graphical Models Applying probabilistic graphical models (PGM) in HEDA was first put forward by P.A.N. Bosman in his IDEA-H [4] (If we regard HEDA as a special case of discrete EDAs, the history of applying probabilistic graphical model is even longer, for example, see [3].) In the analysis of the IEDA-H, Bosman regarded that the computational complexity of the IDEA-H grows exponentially with the maximum number of the variables that a variable conditionally depends on, which becomes the bottleneck of the IDEA-H. Recently, an accelerated algorithm, which is called the dHEDA, with polynomial complexity dictated by the size of the population, was proposed in [10]. Since the computational complexity is the main concern for HEDA based on PGM, we will discuss the dHEDA in details. 3.1 General Framework In general, the HEDA based on PGM share the same framework as any other HEDAs, except that it iteratively updates and samples from P( Z 0 ,..., Z l −1 ) under the assumption that: l −2
P( Z 0 ,..., Z l −1 ) = P( Z jl −1 )i∏ P( Z ji | Zπ ( ji )0 ,..., Zπ ( ji )|π ( j )|−1 )
In
Eq.(2),
{Z 0 ,..., Zl −1} = {Z j0 ,...Z jl −1 } ,
(2)
i
i=0
but
they
are
in
different
order;
while {Zπ ( ji )0 ,..., Zπ ( ji )|π ( ji )|−1 } ⊆ {Z ji +1 ,...Z jl −1 } , | π ( ji ) |≤ k . Z i (i = 1,..., l ) is the discrete
random variable which represents the bin indices that take values from {1,…, nb } .
Fig. 1. P ( x1 ,..., x6 ) = P ( x1 ) P ( x2 | x1 ) P ( x3 | x1 ) P ( x4 | x1 , x2 ) P ( x5 | x2 , x3 ) P ( x6 | x5 )
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm
29
That is to say, the probabilistic graphical model simplifies the complete factorized probability function by assuming the Markov properties of the variables, see Fig.1. as an example. In the expansion of the complete probability, each variable is regarded to be conditionally independent of any other variables given its parent variables (the nodes with an edge pointing to the current node in the graph, for example, x1 and x2 are both parent variables of x4 ). Interested readers may consult to [19] for more knowledge about PGM. Then the core of the HEDA based on PGM is to estimate each item in the right hand side of Eq.(2). In each generation, there are mainly two jobs to estimate the PGM: 1. The probability density structure (PDS) Z j , Zπ ( j ) ,..., Zπ ( j )
{
}
found. 2. The probability density function (PDF) P( Z j | Zπ ( j ) ,..., Zπ ( j )
) and P ( Z jl −1 )
i
{
i
i |π ( ji )|−1
i 0
i |π ( ji )|−1
i 0
have to be
}
have to be calculated. In the following discussion of this section, more details of the above two jobs in the dHEDA are presented. 3.2 PDS Search
{
In each generation, the PDS Z j , Zπ ( j ) ,..., Zπ ( j ) i
i |π ( ji )|−1
i 0
}
(i = 0,..., l − 1) , is learnt from the
selected individuals. To minimize the difference between Eq.(2) and the complete factorized probabilistic function, like the IDEA-H[4], the dHEDA minimize the following expression when using the Kullback-Leibler (K-L) divergence as a metric: l −2
J ( Z 0 ,..., Z l −1 ) = H ( Z jl −1 ) + ∑ H ( Z ji | Zπ ( ji )0 ,..., Zπ ( ji )|π ( j )|−1 ) i=0
i
(3)
where H (•) denotes the entropy. For variable i with k parents, the conditional entropy is calculated by: H ( Z i | Zπ ( i )0 ,..., Zπ ( i )k −1 ) = H ( Z i , Zπ ( i )0 ,..., Zπ ( i )k −1 ) − H ( Zπ ( i )0 ,..., Zπ ( i )k −1 )
If the variables are of
(4)
nb bins, the joint entropy of {Z 0 ,..., Z n −1} is calculated by: nb −1
nb −1
z0 = 0
zn −1 = 0
H ( Z 0 ,..., Z n −1 ) = − ∑ ... ∑ PZ0 ,..., Z n−1 ( z0 ,..., zn −1 ) ln( PZ0 ,..., Zn −1 ( z0 ,..., zn −1 ))
(5)
where P( Z 0 = z0 ,..., Z n −1 = zn −1 ) is equal to the summation of the probability factors of those selected individuals that fall into the super-bin ( Z 0 = z0 ,..., Z n −1 = zn −1 ) in the dHEDA. Note that we introduce the new term super-bin to avoid confusion with the concept of bin in marginal case. The concept of super-bin is no different from that of bin except that super-bin denotes the bin in more than 1 dimension. We also use ( z0 ,..., zn −1 ) as a short hand of ( Z 0 = z0 ,..., Z n −1 = zn −1 ) in the following.
30
N. Ding and S. Zhou
In order to find the promising PDS, greedy methods can be employed to minimize J ( Z 0 ,..., Z l −1 ) . It has been verified that without considering the calculation cost of Eq.(3), we can minimize J ( Z 0 ,..., Z l −1 ) using greedy methods with polynomial computational complexity. For example, Bosman et al. proposed 3 methods for 3 different general structures (chain structure, tree structure and Bayesian Network structure). These graphical models are successful applied in Gaussian-based EDA [4] and exhibit acceptable polynomial computational complexity. 3.3 Computation Issue on the K-L Divergence
The computation issue on the K-L divergence J ( Z 0 ,..., Z l −1 ) is important because it is the metric to lead the PDS search. It is clear from Eq.(3)(4) that the core of calculating J ( Z 0 ,..., Z l −1 ) is to calculate the joint entropy. In other words, if the entropy calculation is efficient, it will be also efficient to obtain J ( Z 0 ,..., Z l −1 ) . However, the obstacle in histogram-based EDA is that the computational complexity of direct calculation of J ( Z 0 ,..., Z l −1 ) using Eq.(5) is unaffordable, because it would take exponential complexity on the maximum number of the parents nodes over all the nodes. However, it is noticed that the probability of the super-bin P( Z 0 = z0 ,..., Z n −1 = zn −1 ) is non-zero if and only if there are selected individuals in the current generation falling into this super-bin. Since there are only the number of selected individuals nbest for PGM estimation, given any Z i , Zπ ( j ) ,..., Zπ ( j ) , there exist only at most nbest super-bins with non-zero probability. According to Eq.(5), the value of the joint entropy is only contributed by those non-zero super-bins. Thus, without altering the result, the expression of joint entropy is rewritten as:
{
i 0
i |π ( ji )|−1
}
N
H ( Z 0 ,..., Z n −1 ) = −∑ PZ 0 ,..., Z n−1 ( z i 0 ,..., z i n −1 ) ln( PZ0 ,..., Zn−1 ( z i 0 ,..., z i n −1 ))
(6)
i =1
In Eq.(6), N≤nbest indicates the number of non-zero super-bins, and PZ0 ,..., Zn−1 ( z0 ,..., zn −1 ) is the probability (height) of super-bin ( z0 ,..., zn −1 ) which at least one of the selected individuals falls in. The reason of N≤nbest is because there might be more than one individual falling into the same super-bin. So, the process to calculate the entropy of certain joint variables is as following. Initially, H ( Z 0 ,..., Z n −1 ) = 0 . We first pick an unpicked individual; gain its super-bin
( z i 0 ,..., z i n −1 ) ; check if ( z i 0 ,..., z i n −1 ) exists in the memory if yes we improve the i i height of super-bin ( z i 0 ,..., z i n −1 ) , if no we create a new super-bin ( z 0 ,..., z n −1 ) and improve its height. After all individuals are picked; we sum up the probability (heights) of the super-bins in the memory times its logarithms and get the entropy. Now, let us analyze the computational complexity of the calculation of H ( Z i , Zπ ( i ) ,..., Zπ ( i ) ) . The step to find those individuals that belong to the same bin, 0
k −1
2 i(k + 1)] . and to sum up their improvements to get the height of that bin takes O[nbest
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm
31
2 Thus, the complexity of O[nbest i(k + 1)] is taken to calculate H ( Z i , Zπ ( i ) ,..., Zπ ( i ) ) . 0
k −1
2 Overall, the complexity of O[nbest il 2 ] is taken to calculate J ( Z 0 ,..., Z l −1 ) . Overall, we can conclude that we are able to find the PDS with polynomial time applying the above method to calculate entropies. This is much tractable than the way that directly applies Eq.(5) to calculate entropies.
3.4 PDF Calculation
It is noticed that all the PDFs are the byproducts of entropy computation. Thus, there is no more extra step for PDF calculation if we save their results during the PDS search. So far, we have introduced the main framework of the HEDA based on probabilistic graphical model, especially the details of entropy computation because of its importance in computational complexity. There is something more that needs to be remarked: it seems tricky that the probabilities of many super-bins are equal to zero. Does that mean the later-generation individuals will never get a chance to be sampled in those zero-probability bins (which is obviously unreasonable for its lack of general knowledge)? This is not the case. In fact, in the first step, the sampling process chooses the bins only according to the height of the bin, which means those zero-height bins cannot be chosen; but in the second step, the real-value of the individuals can be sampled out of the chosen bins if we applies a variant surrounding effect. Interested readers might refer to [10] for details. 3.5 Experimental Results
We design two experiments: the first is to test the computational efficiency of the accelerated algorithm; the second is to test the performance of the HEDA based on PGM on several benchmark continuous optimization problems.
Fig. 2. The computation time to calculate the joint entropy by the dHEDA and the IDEA-H
32
N. Ding and S. Zhou
First, let us examine the computational time of the dHEDA and that of the IDEAH. The experimental results are shown in fig.2. The number of the bins is respectively 20, 30, 40; the number of the variables in the joint entropy is from 1 to 5; and the number of the selected individuals is 100. The reason that we only test at most 5 joint variables is because our PC memory excesses when we try to run the test of 6 joint variables by the IDEA-H with 40 bins. 20 times are run before the average time is calculated. The experiment is based on CPU-Intel Pentium 2.8GHz. In Fig.2, we notice that the time by the IDEA-H grows remarkably with the increase of the number of the joint variables. If we check the time in the case of bin=40 in TABLE-IV, the time of 5 joint variables entropy calculation by the IDEA-H is over 100s. On the other hand, the time by the dHEDA increases slowly and remains as a small amount of time ( ≈ 10−3 s). The other point is the great difference between them in the computational time with the growing number of the bins: the time by the IDEA-H has a distinct grow, but the time by the dHEDA hardly rises. These experimental results are concordant to our earlier analysis. Recall that the computational complexity for calculating H ( Z i , Zπ ( i ) ,..., Z π ( i ) ) is O[(nb )k +1 ] for 0
k −1
2 i(k + 1)] for the dHEDA. IDEA-H, but only O[nbest Next, we examine the performance of HEDA based on PGM with three other continuous EDAs: the UMDAc, the IDEA-G [4], and the sur-shr-HEDA [9] on 4 benchmark continuous problems: Sphere Function, Summation Cancellation Function, Schwefel-1 Function, and Schwefel-2 Function.
Table 1. Parameter Settings of the 4 Algorithms
IDEA-G
UMDAc
Parameter Settings Maximal evaluation number=3×10 Population size=375 Select-rate=20%
5
Maximal evaluation number=3×10 Population size=375 Select-rate=20%
5
sur-shrHEDA
5
Maximal evaluation number=3×10 , Select rate=20% Population size=375, Mutation rate=5% Bins number=99, Surrounding factor=10%
dHEDA
5
Maximal evaluation number=3×10 , Select rate=20% Population size=375, Mutation rate=5% Bins number=99, Surrounding factor=10%
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm
33
Among the 4 algorithms, the UMDAc and the IDEA-G are based on Gaussian distribution, the sur-shr-HEDA and the dHEDA are based on histogram model; the UMDAc and the sur-shr-HEDA are marginal HEDAs, the IDEA-G and the dHEDA are HEDA based on PGM. Note that we use the dHEDA to substitute the IDEA-H because of the computational concern. The settings of the 4 algorithms are listed in Table 1. Among the 4 problems, Sphere Function and Summation Cancellation Function are unimodal, while the other 2 are multimodal. The Sphere Function and Schwefel-1 Function are separable, while the other 2 are non-separable. The problems are listed in Table 2. Table 2. Problem Descriptions
Representation l −1 JG F ( y ) = ∑ yi2
Sphere
i =0
Summation Cancellation
l −1 JG ⎛ ⎞ F ( y ) = 1/ ⎜10−5 + ∑ X i ⎟ i =0 ⎝ ⎠
X 0 = y0 , X i = yi + X i−1 l −1 JG F ( y ) = ∑ − yi isin( yi )
where
Schwefel-1
i =0
l −1 JG F ( y ) = ∑ ⎡⎣ ( yi2 − y0 )2 + ( yi − 1) 2 ⎤⎦
Schwefel-2
i =0
Domain Sphere
Dimension Type Optimum
[-100,100] l=30
Min
0
l=10
Max
105
Schwefel-1
[-500,500] l=30
Min
-12569.5
Schwefel-2
[-5,5]
Min
0
Summation Cancellation [-3,3]
l=20
Each of the algorithms is run for 20 times for each problem, and the mean value, the best case, and the standard deviation for the 20 runs are collected in Table 3. According to the results in Table 3, we can mainly summarize two points as follows: The first is that the histogram-based EDAs outperform the Gaussian-based EDAs on the above multimodal problems. For example, in the Schwefel-1 Function (which is always regarded as a typical multimodal test problem), both of the histogram-based EDAs are able to achieve the optimum of the problem, while the Gaussian-based EDAs both fail. The excellent performance made by the histogram-based EDAs agrees to our earlier remarks: it is straightforward for the histogram model to estimate the multimodal distribution. Meanwhile, since the Gaussian-based EDAs are unimodal, their models are too rough to estimate the multimodal problems efficiently.
34
N. Ding and S. Zhou
Schwefel-2
Schwefel-1
Summation Cancellation
Sphere
Table 3. Experimental Results
Best Case
Mean Value
Standard Dev.
UMDAc IDEA-G
2.816e-84
1.276e-83
1.064e-83
3.423e-161
2.562e-160
2.992e-160
sur-shr-HEDA
5.872e-15
6.753e-15
7.698e-16
dHEDA UMDAc IDEA-G sur-shr-HEDA dHEDA UMDAc IDEA-G sur-shr-HEDA dHEDA UMDAc IDEA-G
5.104e-19 8.635
7.496e-19 5.798
1.304e-19 2.403
66199
12805
20704
99998
93973
22016
105 -11622.8
105 -10637.5
2.031e-2 6.289e2
-5277.4
-4719.6
2.976e2
-12569.5 -12569.5 8.131e-5
-12569.4 -12569.5 2.576e-2
1.458e-1 1.415e-7 1.984e-2
4.324e-5
1.543e-2
1.652e-1
sur-shr-HEDA
3.793e-19
3.145e-4
1.403e-2
dHEDA
1.987e-21
3.161e-8
1.426e-7
The second is the contrast between the algorithms based on PDS without linkage and the ones based on PDS with linkage. It appears in the experiment that the dHEDAs outperforms the sur-shr-HEDA in both of the non-separable problems. The IDEA-G outperforms the UMDAc in the unimodal non-separable problems, but the two algorithms gain the comparable results on those multimodal non-separable problems. In the Summation Cancellation Function, for example, the IDEA-G’s mean value is above 104 while the UMDAc is below 101 ; and the dHEDA succeeds all of the times but the surshr-HEDA performs unstably with large deviations. This fact clearly illustrates the meaning of the learning of the linkage information in the IDEA-G and the dHEDA.
4 HEDA Based on Space Transformation Although it has already been introduced the accelerated method in probabilistic graphical models, the complexity burden is comparably large in the HEDA. One much simpler method is to deal with the linkages in marginal case. The method is based on the space transformation, especially the linear transformation. P. Pošík in [17,18] proposed several successful evolutionary algorithms on the use of space transformation. We will introduce the transformation according to the covariance-matrix of the samples in this section and apply the method in the HEDA. We call that algorithm to be the Marginal Estimation of Distribution Algorithm in
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm
35
Characteristic Space of Covariance-Matrix (CM-MEDA). Intrinsically, it is equal to preprocessing the samples by PCA in EDA. Similar work can also been found in [15]. 4.1 Covariance-Matrix of the Samples
We first review several basic and well-known properties of Covariance-Matrix. The Covariance-Matrix of a set of samples is defined as follows: JG JG Given a set of m samples xi , where xi = ( xi1 , xi 2 ,..., xil )T The Covariance-Matrix is defined as C=
1 m JG JG JG T JG T ∑ ( xi − xi )( xi − xi ) m − 1 i =1
(7)
And it has the following important properties: 1. The Covariance-Matrix is real and symmetric, that is: cij ∈ R (∀i, j = 1,..., l ) and C = C T 2. The Covariance-Matrix is diagonalizable, that is:
∃Q, s.t. C = Q[λ ]Q −1 where [λ ] = diag (λ1 , λ2 ,..., λl ) 3. All the eigenvectors with different eigenvalues are orthogonal with each other.
JJG JJG
JJG
−1 4. There exists a matrix P = ( p1 , p2 ,..., pl ) , which satisfies C = P[λ ]P and all JJG T −1 pi are orthogonal with each other, that is P = P .
Finally, we formulate some important representations. The space defined in P is the characteristic space of the Covariance-Matrix C, and each pi is the basis of the space. Clearly, according to Property 2 and Property 4, we have C = PT [λ ]P , where
[λ ] = diag (λ1 , λ2 ,..., λl ) . 4.2 General Framework
The CM-MEDA is an algorithm which estimates the marginal distribution of the problem on a transformed space: the characteristic space of the Covariance-Matrix of the selected samples. The algorithm begins with an initialized population that is generated randomly. In each generation, it evaluates the fitness of each sample and selects several promising samples. The above two step has no difference from any other evolutionary algorithms. Then the CM-MEDA calculates the Covariance-Matrix C on the selected samples and then calculates the matrix P, which defines the characteristic space of the Covariance- Matrix C. After P is obtained, the selected samples are transformed to the new space by multiplying PT , so we can obtain the value on each variable in the new coordinate space.
JJG
G
Assuming a selected sample of x, this step is finished by x ' = P ⋅ x . T
36
N. Ding and S. Zhou
The marginal distributions are estimated in the new space. After the marginal disJJG tributions of the variables in the new space have been estimated, new xi' are sampled according to the estimated distribution in each variable. At last, we transform the samples back into the initial coordinate space, according
G
JJG
JJG
−1 T to x = ( P ) ⋅ xi' = P ⋅ xi' . Note that when we transform the samples back, some samples might out of constraints. If that happens, we discard those “illegal” samples and resample the new ones. After we have built the new population, we evaluate the fitness and go to the next generation. In short, the framework of CM-MEDA is as followed:
1. Initialize the population 2. Select the samples with high fitness 3. Calculate the Covariance-Matrix C of the selected samples and build the matrix P 4. Transform the selected samples onto the characteristic space of the CovarianceJG G Matrix by x ' = PT i x 5. Estimate the marginal models on the new space according to the distribution of the transformed selected samples JG
6. Make new samples xi' according to the marginal models in the transformed space 7. Transform the new samples from the transformed space back to the original G
JG
space by x = P ⋅ xi' and check if all the new samples are legal. Resample the illegal samples. 8. Return to step 2 if not terminated. 4.3 CM-MEDA in Histogram Case
In the histogram-based CM-MEDA, the marginal distribution in the transformed space is based on histogram model. There is only one specific concern besides the general framework of the CM-MEDA in the histogram case, that is: How to decide the domain (i.e. upper bound and lower bound) of the histogram model in the transformed space. We solve this problem in a simple way: the domain of each variable is decided according to the range of the selected samples in each variable. Note that, under this way the algorithm can naturally shrink the domain of the variables. Besides, to avoid loss of generality, we make the domain of the variable be a little larger than the range of the variables. In our algorithm, there is one more bin to the left of the leftmost sample and one more bin to the rightmost sample. The leftmost sample is in the center of the bin it belongs to, and so is the rightmost sample. See Fig.3 as an illustration. 4.4 Relation to Principal Component Analysis
Principal component analysis (PCA) is a method that seeks for a projection which best represents the samples in a least-squares sense [20]. It intrinsically provides a promising way of finding the projections which mostly capture the variance of the
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm
37
Fig. 3. The position of the selected samples and the domain of the bins. Note that the leftmost and the rightmost selected samples are in the center of the bins that they belong to. And there is one more bin to the left of the leftmost sample and one more bin to the right of the rightmost sample.
samples. For example, the first component (the principal component) is the projection that the samples have the maximum variance; the second component is the projection with the maximum variance in the subspace orthogonal to the first component, and etc. Since the components given in PCA are the same as the eigenvectors of Covariance-Matrix in the CM-MEDA, we can alternatively think that the core of the CMMEDA is to find the projections which maximize the variance in the space iteratively, because each eigenvector is the projection which has the maximum variance of the samples in the subspace orthogonal to the eigenvectors with higher eigenvalues. The projection which has the maximum variance of samples is the one which mostly needs estimation in evolutionary optimization, because its ultimate aim is to find one optimal point not a scatter of the samples. Thus, we infer that the CM-MEDA can improve the convergence of the EDA by making distribution estimation on the set of axes which maximize the variance of the samples from the viewpoint of PCA. Similar work may also refer to [15,16,17]. According to the framework above, we can find that the CM-MEDA has only three more steps than the MEDA: calculation of covariance matrix, calculation of eigenvectors, and transformation. The calculation of covariance matrix takes O(nbest il 2 ) , where nbest is the number of selected samples, and l is the dimension of the samples. The eigenvectors can be efficiently calculated in O(l 3 ) . And the transformation takes O(l 2 ) . Therefore, it is clear that the CM-MEDA is much faster than the HEDA based
on PGM, even the dHEDA. (Recall that it takes O(nbest 2 il 2 ) to calculate K-L divergence in the dHEDA. ) However, we also note that the CM-MEDA can only accurately capture the linear linkages; while the non-linear linkages are always approximated intrinsically. In contrast, the HEDA based on PGM is able to capture more complex relations among variables if less restrictions on its PDS.
38
N. Ding and S. Zhou
4.5 Experimental Results
We provide a brief experiment to compare the CM-MEDA and MEDA in histogram case on 4 continuous optimization problems. The general settings of the two algorithms are same: the population size is 1000; 20% of the individuals are selected from the population to estimate the model; the number of the bins in each bin is arbitrarily set to be 100; 40 iterations are run each time. The MEDA use the mechanism of the sur-shr-HEDA to update the histogram model. The surrounding factor is set to be 10%, and the mutation rate is set to be 5%. The 4 test functions: Fun-1, Fun-2, Schwefel, Rosenbrock are all non-separable and share a similar formulation. The reason that we choose several functions in this Table 4. Problem Descriptions
Representation n−1
Fun-1
Fun _1( x1 ,...xn ) = ∑100i( xi +1 − 1.3i xi )2 + (1 − x1 )2 i =1
n −1
Fun-2
Fun _ 2( x1 ,...xn ) = ∑ [( xi +1 − xi2 )2 + (1 − xi ) 2 ] i =1
Schwefel
n
Schwefel ( x1 ,...xn ) = ∑ [( x1 − xi2 ) 2 + (1 − xi ) 2 ] i =1
n−1
Rosenbrock
Rosenbrock ( x1 ,...xn ) = ∑ [100i( xi +1 − xi2 )2 + (1 − xi ) 2 ] i =1
Fun-1 Fun-2 Schwefel Rosenbrock
Domain [-5,5] [-5,5] [-5,5] [-5,5]
Dimension 10 10 10 10
Type Min Min Min Min
Optimum 0 0 0 0
Table 5. Experimental Results
Fun-1
Fun-2
Schwefel
Rosenbrock
Best Mean Std. Best Mean Std. Best Mean Std. Best Mean Std.
CM-MEDA 0.1775 0.2760 0.0801 3.879e-11 0.0015 0.0057 1.320e-10 0.0152 0.0359 6.5271 7.2160 0.3181
MEDA 1.7321 3.4524 0.7483 0.0319 0.0571 0.0112 0.0378 0.0682 0.0184 4.2003 8.0493 1.4204
Linkages Detection in Histogram-Based Estimation of Distribution Algorithm
39
formulation is because this formulation of functions is usually hard for Evolutionary Algorithms to solve because of their strong linkages between variables. The problem descriptions are listed in Table 4. Each of the algorithms is run for 20 times in each problem, and the mean value, the best case, and the standard deviation for the 20 runs are collected in Table 5. The results in Table 5 verify the superiority of the CM-MEDA in histogram case in those non-separable cases. For example, when solving Schwefel function, the best solution obtained by the CM-MEDA in 20 runs is 1.320e-10, while the best solution of the MEDA is 0.0378; the mean value of the CM-MEDA is 0.0015, while the mean value of the MEDA is 0.0682.
5 Summary and Further Work In this chapter, we have reviewed the two methods in HEDA which are able to detect linkage among variables in the optimization problems. Using probabilistic graphical model generally represents the relations more accurately but also is of high-cost; while using space transformation is quite fast, but it only roughly represents the major possible relationships. There is, however, still much further work in both topics. The first, for the HEDA, the number of bins in certain problems is always decided arbitrarily in most papers. An alternative method might apply the prior on the bins, such as Dirichlet Process, and then use Bayesian analysis in estimating the model. The second, for the HEDA based on PGM, it is still interesting to have deeper research on dealing with the case of small number of samples with great number of bins in a joint probability. The third, for the HEDA based on space transformation, to find a more efficient transformation, or to define the kernel space might allow the algorithm to capture more complex and non-linear linkages. The fourth, there is little research on comparing the two method above directly. Besides, the attempts to combine the above two methods might have even improved performance, since the space transformation might reduce the number of parents of each nodes in the graphical model.
References [1] Larranaga, P., Lozano, J.A.: Estimation of Distribution Algorithms: A NewTool for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2002) [2] Pelikan, M.: Hierarchical Bayesian optimization algorithm: Toward a new generation of evolutionary algorithms. Springer, Heidelberg (2005) [3] Pelikan, M., Goldberg, D.E., Tsutsui, S.: Getting the best of both worlds: Discrete and continuous genetic and evolutionary algorithms in concert. Information Science 156, 147–171 (2003) [4] Bosman, P.A.N., Thierens, D.: An algorithmic framework for Density Estimation based Evolutionary Algorithms. Utrecht University technical report UU-CS-1999-46 (1999)
40
N. Ding and S. Zhou
[5] Bosman, P.A.N., Thierens, D.: Numertical optimization with real- valued Estimation of Distribution Algorithm. Scalable Optimization via Probabilistic Modeling: From Algorithms to Applications, 91–120 (2006) [6] Lu, Q., Yao, X.: Clustering and learning Gaussian distribution for continuous optimization. IEEE trans. on Systems, Man and Cybernetics-Part C 35(2), 195–204 (2005) [7] Tsutsui, S., Pelikan, M., Goldberg, D.E.: Evolutionary algorithm using marginal histogram models in continuous domain. In: 2001 Genetic and Evolutionary Computation Conference Workshop, San Francisco, CA, pp. 230–233 (2001) [8] Ding, N., Zhou, S., Sun, Z.: Optimizing continuous problems using Estimation of Distribution Algorithms based on histogram model. In: The 6th International Conference of Simulated Evolution and Learning, Hefei, China, pp. 545–562 (2006) [9] Ding, N., Zhou, S., Sun, Z.: Histogram-based Estimation of Distribution Algorithm: a competent method for continuous optimization. J. Computer Science and Technology 23(1), 35–42 (2008) [10] Ding, N., Xu, J., Zhou, S., Sun, Z.: Reducing Computational Complexity of Estimating Multivariate Histogram-based Probabilistic Model. In: 2007 IEEE Congress on Evolutionary Computation, pp. 111–118 (2007) [11] Yuan, B., Gallagher, M.: Playing in continuous spaces: Some analysis and extension of population-based incremental learning. In: IEEE Congress on Evolutionary Computation, Canberra, Australia, pp. 443–450 (2003) [12] Baluja, S.: Population-based incremental learning. Carnegie Mellon University, Technical Report. CMU-CS-94-163 (1994) [13] Cantu-Paz, E.: Supervised and Unsupervised Discretization Methods for Evolutionary Algorithm. In: The Genetic and Evolutionary Computation Workshop, San Francisco, CA, pp. 213–216 (2001) [14] Zhang, Q., Sun, J., Tsang, E., Ford, J.: Hybrid Estimation of Distribution Algorithm for Global Optimization. Engineering Computations 21(1), 91–107 (2004) [15] Zhang, Q., Allinson, N.M., Yin, H.: Population Optimization Algorithm Based on ICA. In: the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (2000) [16] Zhang, Q., Mühlenbein, H.: On the convergence of a class of estimation of distribution algorithms. IEEE Trans. Evolutionary Computation 8(2), 127–136 (2004) [17] Pošík, P.: On the Utility of Linear Transformations for Population-Based Optimization Algorithms. In: the 16th World Congress of the International Federation of Automatic Control (2005) [18] Pošík, P.: On the Use of Probabilistic Models and Coordinate Transforms in Real-Valued Evolutionary Algorithms. PhD Dissertation, Czech Technical University (2007) [19] Lauritzen, S.L.: Graphical Models. Clarendon Press, Oxford (1996) [20] Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, Chichester (2000)
Linkage in Island Models Zbigniew Skolicki George Mason University, Fairfax, VA. Current affiliation is Google, Inc., 1600 Amphitheatre Parkway, Mountain View, CA
[email protected] Abstract. Island models (IMs) have multiple sub-populations which periodically exchange a fraction of individuals. This complex setup results in distinct dynamics of evolution and therefore, IMs are characterized by several interesting properties of IMs with regard to gene linkage, which are presented in this chapter. Traditional single-population evolutionary algorithms (EAs) suffer from a relatively quick and often random gene fixation, due to evolutionary selection, drift and physical linkage of recombination operators. In IMs, additional, slower inter-island level of evolution together with a recurring local evolution inside islands make searching for optimal allele configurations more systematic. In this chapter, it is shown that IM dynamics may counteract hitch-hiking. Further, multiple building blocks may be better identified and combined in IMs, consequently supporting compositional evolution, which is studied on the H-IFF function. Finally, a discussion follows on how the repeated fixation of genes in islands might be treated as a linkage learning process.
1 Introduction Searching for an optimal solution to a given problem requires mapping of the solution space into a representation space. Solutions are encoded using multiple parameters — or genes in evolutionary computation. However, these parameters/genes are rarely independent, with regard to the characteristics of solutions. Moreover, the mutual dependency between them, called linkage or epistasis, is very often inherent to the problem, and exists regardless of the representation used. In fact, linkage between genes is a fundamental reason to use evolutionary algorithms (EAs). If each gene could be optimized independently, the problem would be perfectly separable. Such problem could be easily solved sequentially, without any need for evolutionary algorithms. The strength of evolutionary algorithms (at least according to the building block hypothesis) lies in their ability to simultaneously analyze multiple configurations of genes (building blocks, BB), to focus on the more promising ones and to combine together such blocks. Recombination has a profound impact in evolutionary systems [10]. One of the properties of recombination operators is that they copy certain genes together more often than others. The resulting dependence between genes is called a physical linkage. There is a difference between inherent and true linkage between parameters (epistasis)
Most of the work was done for author’s PhD dissertation.
Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 41–60, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
42
Z. Skolicki
existing in a given problem, and a physical linkage existing due to recombination operators. In fact, good recombination operators are designed so that their physical linkage approximates a given problem epistasis. Among those operators, more conservative ones focus on maintaining certain gene configurations, while more aggressive operators focus on mixing genes, potentially discovering better blocks. A lot of research has been directed at learning linkage, that is adjusting recombination operators so that their physical linkage matches problem epistasis. An example here is a Messy GA [4] or the Linkage Learning GA [6]. Also SEAM [10] supports recombination of building blocks, regardless of their configuration on a genome, by initially operating on subsets of genes. Unfortunately, regardless of simultaneous analysis of multiple configurations, evolutionary algorithms are imperfect in searching for the best solution. It is well known that repetitive stochastic sampling and the selection operator in any evolutionary algorithm causes genes to lose alleles and a population to converge [1]. The smaller the population, the more visible is the drift. Only a small portion of a usually large number of possible gene configurations have a chance to be verified and potentially remain as part of a genome before a quick fixation. Physical linkage and stochastic sampling results in coincidentally copying certain genes together. A process in which an allele of one gene increases its survival rate due to randomly being selected with good alleles of other genes is called hitch-hiking. Although this effect occurs stronger for more conservative recombination operators, we may observe it even for a uniform crossover, which copies random genes together. There are two seemingly opposite goals of an effective evolutionary algorithm. One is preventing the preliminary convergence and side-effects of stochasticity, and the other is being able to identify, converge to and propagate useful gene combinations. The observations suggest that IMs, thanks to their two-level evolution, may be able to facilitate both the identification of useful gene combinations, and the maintenance of multiple search trajectories at a more general level. These properties should make solving problems with a high level of epistasis easier. In this chapter we will see potential benefits from using IMs, in the context of gene linkage. First, the interaction of the two levels of evolution in IMs is discussed in the context of the fixation of genes, highlighting several interesting implications. We will see that IMs reduce random gene fixation (and hitch-hiking). These observations lead to a more profound conclusion, namely that IMs support so-called compositional evolution [10]. Arguments suggesting that IM dynamics may lead to adaptation of fixations to the linkage between genes conclude the chapter.
2 Island Models and Two-Level Evolution Island models (IMs) are models of evolutionary algorithms in which multiple copies of the evolutionary process are run simultaneously in sub-populations called islands, and individuals are exchanged between them from time to time. Since standard operators are restricted to islands, IMs are an example of non-panmictic algorithms, as opposed to single-population panmictic EAs, where all individuals can interact freely. Island models have been initially studied by evolutionary biologists, and later in computer
Linkage in Island Models
43
science, both in theory and practice. A good background introduction to island models was given by Gustafson [5]. A review was also included by Skolicki [7]. Unfortunately there does not exist a good, widely accepted theory of island models. A classic Shifting Balance Theory by Wright [11] explains the benefits of using island models mainly by their ability to search multiple regions and ”shift” focus of all islands to the promising regions. Drift plays a major role in diversifying the search and thus exploring new regions — somewhat similarly to how it happens according to the theory of Neutral Evolution [9]. Let us confront two possible understandings of IM dynamics. The first one is more common, but there exists another one, which offers explanation to certain IM phenomena. The common understanding of IMs is that they are a collection of independent EAs, which exchange solutions after they evolved them. In this framework islands “compete” to evolve the best possible solution, and the best island “wins.” Such an understanding suggests a few big islands, so that diversity and evolvability is maintained independently inside each of them. A migration either serves to bring the new best individual (and so one should always choose the best emigrants), or sometimes as a big mutation to maintain a high level of diversity. Another understanding is achieved by taking a higher view of the islands to see that they may be treated as individuals in a higher-level population [8, 7]. Interaction between the islands creates a higher-level evolution, and therefore, the whole IM can be seen as a much more compact entity, even if migrations seem to loosely connect the islands. Consequently, two levels of evolution can be identified. One is local, intra-island evolution, which has been studied so far on islands. The other is global, inter-island evolution, occurring between islands. It comes as no surprise that those two levels of evolution interact and mutually impact their dynamics. In this view, the global evolution uses the local evolution to carry out actual computations and significantly changes its dynamics. The local evolution creates the global evolution by “selecting”, “mutating” and “mixing” islands. In IMs sub-populations are relatively smaller (compared to a single population), so genes are getting fixed faster due to selection and drift. However, semi-independent evolution causes islands to fixate genes to different alleles. These blocks of different alleles are later shared by migration. IMs keep restoring the local diversity with the help of migrations and thus practically restarting local evolution when needed. Because migrations occur every migration interval, the global interaction occurs at a slower pace than the local evolution. As a result, the whole evolution occurs at a slower pace compared to a single-population EA. The alternative view of IMs requires rethinking the impact on performance of such standard IM parameters like island size and number, migration size, interval, topology and policy. However, the aim of this chapter is to present the relation between the dynamics of IMs and gene linkage. The analysis of the influence of IM parameters on IM dynamics, although related, is beyond the scope of this chapter. Such analysis of IM performance on a wide spectrum of IM parameters was performed in another publication [7]. Nevertheless, A few issues are worth commenting here, based on these experiments.
44
Z. Skolicki
With sub-populations of smaller size getting converged quickly, there is a chance that some important gene linkage (or local optima) may not be discovered in any island. This disadvantage is to some degree alleviated by the larger number of islands, in which such linkage/optimum may be found. The question is how to balance the size and number of islands to maximize the probability of finding good solutions. As will be shown in the next sections, the ability of inter-island evolution to stimulate intraisland evolution after local convergence, is strong. It is fine to converge on a local level. On the other hand, the diversity of inter-island evolution is not easily increased. This suggests that setups with larger number of islands (being “individuals” of inter-island evolution) than the size of islands should be preferred. In fact, experiments on some simple functions suggest that the performance increases with increasing the number of islands, until rather extreme configurations are reached, when it drops again. A second issue is whether choosing random or elite individuals should be preferred. However, with smaller islands, which are converging within migration intervals, the choice of migration policy is less important. In such IMs, all individuals in a given island will be very similar at the moment of migration, and therefore the performance should be similar regardless of the method of migrant choice. A similar behavior was observed in experiments with various migration policies. Finally, choosing migration size and interval are important. It turns out that migration interval cannot be too short, so that the intra-island evolution is not interrupted prematurely. It was also shown that the mixing between islands can be highly sensitive to the migration size and other parameters. A proper choice of these parameters is required for successful exchange of partial solutions between islands.
3 Experimental Setup For experiments in this chapter, the following model of IMs was used. A number of N = 10 identical islands of size M = 10 individuals is used (often described as NxM model). Generations are synchronized and migrations occur every migration interval i = 10. Instead of specifying a constant number of migrants, a migration probability α = 0.1 is applied to each individual. Such an approach allows for experimentation in general with migration ratios different than fractions resulting from a small island size. A dynamic full topology was used, which means that each time a migration is about to occur a single target population for migrants is chosen randomly out of all populations. A similar approach was used in the literature [3]. In general, dynamic topology allows for comparing setups with different number of islands in a fair way and constant α , because the total number of migrants remain constant. A random migration policy was also used, choosing emigrants randomly and replacing random individuals in the target populations. In some experiments mutation is turned off to better focus on recombination and to make the effects of mixing genes more visible. Mutation operates on single genes, and therefore it is less related to linkage, compared with recombination. Two setups for EAs selection pressure in islands is used, EA-1 and EA-2 [2]. EA-1 is weaker and uses binary tournament parent selection, no survival selection and nonoverlapping model (similar to GA setups). EA-2 is stronger and uses stochastic parent
Linkage in Island Models
45
selection, truncation survival selection and overlapping model with a brood ratio of 1.0 (similar to ES setups, in particular (μ + λ ), where μ = λ ). Those setups refer only to the selection pressure, in the way described above, and reproduction operators (recombination and mutation) are used independently (unlike traditional ES setups with no recombination). Either uniform or one-point crossover is used at a rate of 1.0 (for experiments with the OneMax, OneHitcher and ManyHitchers functions) and at a rate of 0.7 (for experiments with the H-IFF function). Mutation, if used, is performed at a 1/L ratio. It is either bit-flip mutation for binary representation, or a non-adaptive Gaussian mutation with σ = 0.01% of a domain range (quite small) for real-valued representation. Elitism is used for experiments with the H-IFF function. For all measures a mean from 60 iterations is reported, which results in rather small confidence intervals.
4 Oscillating Gene Convergence and Avoidance of Random Fixations EAs operate in a stochastic manner on a finite set of individuals. Therefore, some gene values can be carried from a parent to a child by accident, rather than being deterministically selected, and other alleles may be lost through generations. Additionally, an allele of one gene may survive because of being repetitively copied together with a good allele of another gene, due to the physical linkage of a recombination operator. Such a situation is called “hitch-hiking” and makes the survival of other alleles even less likely. In some cases, hitch-hiking may cause a fixation of the first gene to a nearly randomly chosen value. The effect will be stronger when the recombination operator is more conservative because of a higher chance of survival of physically linked alleles (e.g. it is stronger with a one-point than with a uniform crossover).
20
fixed genes
15
10
5 1x100 single population 10x10 IM 0 0
50
100
150 200 generations
250
300
Fig. 1. Average gradual fixation in IMs compared to fast simultaneous fixation in one-population EA
46
Z. Skolicki 0.5 all genes
locus diversity
0.4
0.3
0.2
0.1
0 0
50
100
150
200
250
300
generations
Fig. 2. Allele diversity for all 20 loci, in a single run with a single population (100 ind.)
0.5 all genes
locus diversity
0.4
0.3
0.2
0.1
0 0
50
100
150 200 generations
250
300
Fig. 3. Allele diversity for all 200 loci, in a single run with a 10x10 IM
The two-level interaction has impact on selecting good gene configurations and, as a result, on gene fixation, which proceeds much slower in IMs than in a single-population EA. Fig. 1 illustrates how the average number of fixed genes grows in IMs compared to a standard EAs and how migrations counteract fixation on the OneMax function.1 Whereas for a standard EA most of genes converge relatively fast, for an IM the convergence oscillates due to migrations. This creates a situation in which some genes are already fixed and some still have multiple alleles across islands. Fig. 2 shows the diversity of alleles for each of 20 loci in a run of a single-population EA and Fig. 3 shows 1
α = 0.1, i = 10, policy = random, topology = full, EA-1, recombination = uniform crossover, L = 20, no mutation. The OneMax problem maximizes the number of 1s in a binary genome.
Linkage in Island Models
47
2 1.6 1.2 0.8 0.4
0
0.2
0.4 x
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
y
Fig. 4. Function TwoGenes, a = 1 1
average x
0.8
0.6
0.4 1x100 panmictic, EA-1 1x100 panmictic, EA-2 10x10 IM, EA-1 10x10 IM, EA-2
0.2
0 0
100
200 300 generations
400
500
Fig. 5. Average x (first gene) value, a = 0, uniform crossover
each of 20*10=200 loci in an IM. It is again visible that for a standard EA, most of the genes converge in more or less the same time; for an IM some converge much earlier than others, due to migrations that partially restore diversity on islands. If some genes influence fitness more than others, it is more likely that their good values will cause a fitness value to be high enough to determine the selection of genomes containing these alleles. As a result, such genes will be optimized by an EA earlier, whereas the other genes will suffer from stochastic sampling and hitch-hiking and will keep losing alleles. If the less-contributing genes converge at the same time as the morecontributing ones, the first ones cannot be effectively optimized. As we saw, in IMs, it is unlikely that all genes will get fixed at the same time. Only those genes that are non-fixed are still being evolved, and the others are not changed by recombination since the alleles are the same for all individuals in an island. Such a situation in general allows weaker genes to evolve after the more influential ones got optimized. In each island, the less important genes can get fixed to a different value, maintaining their diversity at the inter-island level.
48
Z. Skolicki
1 0.9 0.8 average x
0.7 0.6 0.5 0.4 1x100 single population, EA-1 1x100 single population, EA-2 10x10 IM, EA-1 10x10 IM, EA-2
0.3 0.2 0.1 0
100
200 300 generations
400
500
Fig. 6. Average x (first gene) value, a = 0, one-point crossover 1
average x
0.8
0.6
0.4 1x100 panmictic, EA-1 1x100 panmictic, EA-2 10x10 IM, EA-1 10x10 IM, EA-2
0.2
0 0
100
200 300 generations
400
500
Fig. 7. Average x (first gene) value, a = 10, uniform crossover
Let us take a function TwoGenes,2 defined over [0, 1] × [0, 1] by Equation (1) below and shown in Fig. 4: TwoGenes(x, y) = max(1 − x, 10x − 8.9) + ay .
(1)
The x and y genes are independent. Two peaks exist for the first gene and one peak for the second gene. The parameter a regulates if the second gene is stronger or weaker than the first one. If a is big, y will dominate when calculating the fitness, and x will get fixed to quite random values. After y converges, evolution optimizes x, but since at that time its diversity is low, the process is similar to hill-climbing, often converging to the lower peak, which has a bigger basin of attraction. 2
This function’s name in [7] is IM1 .
Linkage in Island Models
49
0.8 0.7
average x
0.6 0.5 0.4 1x100 single population, EA-1 1x100 single population, EA-2 10x10 IM, EA-1 10x10 IM, EA-2
0.3 0.2 0
100
200
300
400
500
generations
Fig. 8. Average x (first gene) value, a = 10, one-point crossover
10 i=1 i=2 i=5 i=10 i=20
fitness contribution
8
6
4
2
0 0
0.2
0.4
0.6
0.8
1
xi
Fig. 9. Function ManyGenes, gene contributions
Since the two possible peaks for x are 0.0 and 1.0, the average x value from multiple runs is equal to a percentage with which the model converges to the higher peak. The results of experiments confirm that the use of island models increases the average x value, compared to the panmictic case. Fig. 5 – 8 compare the average x using two different values of a.3 With weak selection (EA-1), an IM performs better than a singlepopulational EA, even with a = 0. With strong selection (EA-2), single-population EA does not “lose” individuals from the good peak, when a = 0, so it is performing 3
α = 0.1, i = 10, policy = random, topology = full, recombination = uniform crossover, L = 2, mutation rate = 1/L = 0.5.
50
Z. Skolicki 1.8e+06 1.7e+06 1.6e+06
fitness
1.5e+06 1.4e+06 1.3e+06 1.2e+06 1x100 panmictic, EA-1 1x100 panmictic, EA-2 10x10 IM, EA-1 10x10 IM, EA-2
1.1e+06 1e+06 900000 0
20
40
60
80
100
120
140
generations
Fig. 10. Maximum fitness averaged over runs with ManyGenes function, uniform crossover
1.4e+06 1.35e+06 1.3e+06
fitness
1.25e+06 1.2e+06 1.15e+06 1.1e+06 1.05e+06 1x100 single population, EA-1 1x100 single population, EA-2 10x10 IM, EA-1 10x10 IM, EA-2
1e+06 950000 900000 0
20
40
60 80 100 generations
120
140
Fig. 11. Maximum fitness averaged over runs with ManyGenes function, one-point crossover
comparably to an IM. However, when a = 10, the results for a single population are much worse compared to the IM case, which performance decreases only a little. We see that genes are analyzed more independently in IM than they are in traditional EAs. Let us test this hypothesis on one more example with more genes. Imagine a function ManyGenes,4 shown below in Equation (2), with L = 20 and defined over [0, 1] × [0, 1]. L
ManyGenes(x) = ∑ (2xi )i . i=1
4
This function’s name in [7] is IM1 .
(2)
Linkage in Island Models
51
1
20th gene
0.9 0.8 0.7 0.6 0.5
1x100 single population 10x10 IM
0.4 0.4
0.5
0.6
0.7 0.8 1st gene
0.9
1
Fig. 12. Trajectory of the 1st vs. the 20th gene average values, EA-2, uniform crossover
1 0.9
20th gene
0.8 0.7 0.6 0.5 1x100 single population 10x10 IM 0.4 0.4
0.5
0.6
0.7 1st gene
0.8
0.9
1
Fig. 13. Trajectory of the 1st vs. 20th gene average values, EA-2, one-point crossover
Selected components of this function are shown in Fig. 9. The highest-index (i = L) gene influences fitness most, provided that the average value of the corresponding gene (xL ) is already above 0.5, which is quickly reached. The shape of function ManyGenes causes a very fast convergence of this gene. In a single-populational EA, this behavior causes the other genes to have less of a chance of being optimized properly. A 100-individuals standard EA was compared with a 10 × 10 IM, using EA-2 and turning off mutation to stress the effect of losing the alleles. With mutation, both algorithms would clearly have results very close to the optimum, since ManyGenes is
52
Z. Skolicki
1
20th gene
0.9 0.8 0.7 0.6 0.5
1x100 single population 10x10 IM
0.4 0.4
0.5
0.6
0.7
0.8
0.9
1
15th gene
Fig. 14. Trajectories of the 15th vs. the 20th gene average values, EA-2, uniform crossover 1 0.9
20th gene
0.8 0.7 0.6 0.5 1x100 single population 10x10 IM 0.4 0.4
0.5
0.6
0.7 15th gene
0.8
0.9
1
Fig. 15. Trajectories of the 15th vs. the 20th gene average values, EA-2, one-point crossover
a unimodal function. Fig. 10 and 11 show that the IM performs better than a singlepopulational EA, both for EA-1 and EA-2.5 The analysis of the average change of gene values explains why an IM achieves better results. The value of the first gene (least important) compared against the value of the last, 20-th gene (most important) is plotted in Fig. 12. For a single population, the 20-th gene is optimized first (because of its bigger impact on fitness), and only then the 1-st gene changes its average value. It is, however, too late to optimize it well, because many alleles have been lost. This asymmetry occurs between any two genes in the function, but the more distant they are, the difference between them is more visible. On the other 5
α = 0.1, i = 10, policy = random, topology = full, recombination = uniform crossover, L = 20, no mutation.
Linkage in Island Models
53
hand, in the IM we can clearly see, that the convergence of the 20-th gene slows down, allowing stepwise optimization of the 1-st one between migrations. Toward the ends of these intervals, evolution stagnates, and after migrations additional optimization is possible for both genes. For the case of one-point crossover the effect of using IMs is particularly visible (see Fig. 13). This is because of a linkage preserving property of a one-point crossover. As a result, the hitch-hiking effect is stronger in a single-populational EA. Migration intervals enforce some balance between genes. Stronger ones have to “wait” for the weaker ones. In Fig. 14 and 15, we see analogical plots for the 15-th and 20-th genes, where this effect is visible better.
5 Compositional Evolution In this section IMs are shown to support compositional evolution, which was defined by Watson as “evolutionary processes involving the combination of systems or subsystems of semi-indpendently preadapted genetic material” [10]. In fact, the existence of two levels of evolution, and of the oscillating convergence of genes, suggest that local evolution might be able to construct building blocks, and the global level evolution may be efficient at mixing them. Generally, functions, which have solutions built from multiple sub-solutions, should benefit from using IMs. The examples in this section are based on the H-IFF function, which is the original function used by Watson to test compositional evolution. The domain of this function is binary. The evaluation procedure first splits genome conceptually into pairs of genes and gives credit whenever such two genes are the same (either 00, or 11). Then, fourtuples consisting of two pairs each are checked for having the same allele (either 0000, or 1111). This checking is performed with longer and longer modules, until the whole genome (presumably of length being a power of 2) is tested. At each next level there is half the number of available blocks, but their weight is doubled, so they have the same contribution to the general fitness. This is done on purpose to balance the importance of intra-module and inter-module relations. As shown by Watson, the H-IFF can only be effectively solved by EAs using composition (recombination). In Fig. 16, performance results of IM, single-population and isolated setups are compared on the H-IFF function 6 . One-point recombination was chosen because in the H-IFF function close bits may belong to the same module. The average fitness obtained for an IM is significantly higher than the fitness values obtained for both panmictic and isolated models. Because of a special construction of the H-IFF function, these results confirm that IMs support compositional evolution. In order for inter-island evolution to work, locals and migrants must be effectively mixed. Assumptions about the survivability of blocks consisting of both locals and migrants can be confirmed directly by measuring this survivability, as described below. In each run, towards the end of migration intervals islands converge either back to locals, to migrants, or to some hybrids. For each island and each migration, the change in 6
α = 0.1, i = 10, policy = random, topology = full, EA1, elitism, , L = 64, two-point crossover rate 0.7, mutation rate =0.015 ≈ 1/L).
54
Z. Skolicki 400 350
fitness
300 250 200 1x100 single population 10x10 isolated 10x10 island model
150 100 0
100
200 300 generations
400
500
Fig. 16. Island model obtains much better result than both single-population and isolated multipopulation models, on the H-IFF function
50
migrants Δf locals Δf hybrids Δf
between-migration Δf
40 30 20 10 0 -10 -20 -30 0
100
200 300 generations
400
500
Fig. 17. Rejecting migrants with uniform crossover
fitness in the whole interval was credited to the final individual type. Results were averaged over multiple runs. In each migration interval, the change in fitness is measured with regard to its value at the beginning of the interval, so even when fitness grows continually, this is reflected as repeated periods of fitness increase from zero (or a value immediately following migration). The above procedure tells what type of internal behavior is responsible for fitness increase. If we see high fitness change for hybrids, then we know that mixing locals and migrants (or inter-island “recombination”) is the source of fitness increase. A high value of fitness change for migrants means that they are dominating and increasing average fitness. If we see a high value of fitness change for locals, then it means that rejecting migrants is the best behavior (fitness may increase as long as locals are able to maintain
Linkage in Island Models 50
migrants Δf locals Δf hybrids Δf
40 between-migration Δf
55
30 20 10 0 -10 -20 0
100
200 300 generations
400
500
Fig. 18. Accepting migrants with one-point crossover
evolvability without the help of migration). We see these measures in Fig. 17 for a uniform crossover, and in Fig. 18 for one-point crossover. There is a subtle difference between the two cases. If we look at the curves corresponding to hybrids, we see that for the one-point crossover this curve has the highest values most of the time, which means that this recombination operator successfully supports mixing locals and migrants. On the other hand, with uniform crossover, the curve stays closer to zero, and the cases where locals “regain” the island are characterized by a drop in fitness apparently due to unsuccessful mixing and the loss of good alleles.
6 Potential Adaptation to Existing Linkage A gradual repeated fixation of genes in IMs may adapt fixations to good building blocks, and potentially serve as a linkage learning process. Repeated migrations and exchanging blocks of linked genes are very important for this process. If islands converge totally before migrations, migrant groups will be homogeneous and in practice each island can be treated as a single individual in an inter-island evolution. However, if migrations occur earlier, a variable speed of gene fixation, as seen in section 4, creates a situation when some genes are already fixed, and some still have multiple alleles in a population. Only those genes that are non-fixed are still being evolved, and the others are not changed by recombination since they are the same for all individuals in a population. We can treat the fixed genes as linked, forming a “building block”. Such building blocks would not be specific to any single genome, but would rather result from comparing all individuals in a given island. Migrants represent the fixed genes, and migration should correspond to an exchange of such building blocks, their recombination and re-fixing. Repeated fixations of such BBs should improve their average fitness, as described in the next paragraph. In a single population, once genes get fixed, they remain so, and we have no guarantee that the newly formed building block is indeed a good one. In fact, unless selection
56
Z. Skolicki
is very weak and recombination characterized by a high mixing level, some genes will get fixed to suboptimal alleles. On the other hand, in IMs new individuals are inserted into islands with each migration. If the previously fixed building block was on average producing mediocre solutions, there is a chance that genes of better migrants will fill the island, and a new fixation will replace the previous fixation. The difference comparing this to standard EAs is that in IMs an already pre-evolved block of genes may replace other pre-evolved blocks of genes, which may be more probable than a new good building block emerging in one generation with the help of mutation and recombination. Different fixations are formed inside different islands, and they do not compete with each other until migrants are sent between islands. If the fixation in an island was good, migrants have a smaller chance to change the dominating individuals. As a result, with time, better fixations should survive. Such process could be treated as an example of linkage learning. Even if migrations introduce new alleles into the target island, a fixation occurs systemwide too. After some time, more and more of migrants’ genes will be identical to the genes of local individuals. This will also cause an average number of fixed genes in each island to increase. The new larger “building blocks” should be composed of good smaller building blocks fixed earlier. This behavior is similar to the SEAM algorithm [10]. When the number of islands, N, is larger at the cost of a smaller island size, M, the inter-island BBs competition might play a bigger role. In a special case when M = 1, fixations inside islands are created immediately and may be quite random. Such individuals would correspond to fixations of length L and instead of a gradual linkage learning we would just have a case of choosing the best one from a set of N random ones. Single runs of panmictic and IM setups on the H-IFF function were compared using previously mentioned parameters. In the next three figures, we see exemplary runs for setups with a uniform crossover. Fig. 19 shows average values of each locus in a
Fig. 19. An exemplary run with a panmictic setup, uniform crossover
Linkage in Island Models
57
Fig. 20. An exemplary run with an IM setup (all loci shown), uniform crossover
Fig. 21. An exemplary run with an IM setup (corresponding loci from all islands averaged), uniform crossover
panmictic population, throughout generations. Because genomes were of length 64, there are 64 such loci on the x-axis and 500 generations on the y-axis. The shade of each cell graphically represents the average value. We see that after initial generations, fixation quickly stabilizes the genomes and practically no change occurs afterwards. In Fig. 20 we see an analogical exemplary plot for an IM setup. Note, that each locus in each of the 10 islands is shown separately, and therefore there are 640 loci on the x axis, and the picture is squeezed horizontally. One can immmediately see that fixation
58
Z. Skolicki
Fig. 22. An exemplary run with a panmictic setup, one-point crossover
Fig. 23. An exemplary run with an IM setup (all loci shown), one-point crossover
of genes, even if occurring, is repeatedly disturbed by migrations. In this way, different configurations are tested — and, as we know from the fitness charts (e.g., in Fig. 16), such strategy leads to a better performance. In Fig. 21 corresponding loci from all islands are averaged, resulting again in 64 loci on the x axis. Under this visualization, the IM is treated as one big population, in an analogical way to the panmictic case, to which it can be very easily compared. We can see again that in the IM case the fixation of genes takes much longer and (usually) leads to longer building blocks. Fig. 22 shows a panmictic case, with the one-point crossover. It is difficult to see any significant qualitative difference in behavior compared to the uniform crossover. In
Linkage in Island Models
59
Fig. 24. An exemplary run with an IM setup (corresponding loci from all islands averaged), onepoint crossover
Fig. 23 the IM case with the one-point crossover is shown. This time, one can notice that the change of the recombination operator caused more abrupt changes after migrations. As we know, this is because the one-point crossover better matches the linkage in the H-IFF function. As a result, it copies more meaningful blocks of genes and creates better hybrids after migrations. Again, changes to the fixations occur throughout the whole run. In Fig. 24 we see each locus values averaged over islands, as before. At this level, not only we see that the changes to fixations can occur much longer in IMs than in a standard EA, but also that the gene values very gradually converge to them, thanks to the exchange of genetic material between islands. This gives times for the proper selection of gene configurations.
7 Conclusions Due to a slower pace of the global-level evolution in IMs, allele diversity is maintained much longer among sub-populations. This diversity helps later in optimizing genes that got locally fixed, often to random values. Therefore, using IMs reverts some effects of hitch-hiking. Sub-populations converge locally and evolution stalls until the next migration. Therefore, migration intervals enforce some balance between genes, because they synchronize optimization of different genes. We may say that stronger genes have to ”wait” for the weaker ones. Another property of island models is their two-level evolution. We saw that thanks to the interaction between islands, island models can better exchange building blocks. Still, for this to happen, a good recombination operator must be used. However, in contrast to a standard single-population EA, in IMs recombination operator has much more time and chance to effectively mix building blocks.
60
Z. Skolicki
Finally, a gradual and repeated fixation of genes in island models might be treated as a linkage learning process. In this process fixations in islands are treated as building blocks, because recombination preserves fixed genes in a population. With time, new fixations, with better fitness contribution would be ”learnt” but also the fixations must ultimately grow in size. Although a connection between linkage problems and IMs has been shown, more questions are posed than answered. Some of these questions might be answered based on the research in genetics. Although the role of gene fixation was partially explained by the Shifting Balance Theory [11], and by the two-level view of IMs, they do not explain completely the interaction between islands. A more thorough analysis of the relation between single migrations and linkage, possibly by extending the models used in [7], could help understanding linkage in IMs better.
References 1. De Jong, K.A.: An analysis of the behavior of a class of genetic adaptive systems. Ph.D. thesis, University of Michigan, Ann Arbor (1975) 2. De Jong, K.A.: Evolutionary Computation: A Unified Approach. MIT Press, Cambridge (2006) 3. Fern´andez, F., Tomassini, M., Vanneschi, L.: An empirical study of multipopulation genetic programming. Genetic Programming and Evolvable Machines 4(1), 21–51 (2003) 4. Goldberg, D., Korb, B., Deb, K.: Messy genetic algorithms: Motivation, analysis, and first results. Complex Systems 3(5), 493–530 (1989) 5. Gustafson, S.M.: An analysis of diversity in genetic programming. Ph.D. thesis, The University of Nottingham (2004) 6. Harik, G.R.: Learning gene linkage to efficiently solve problems of bounded difficulty using genetic algorithms. Ph.D. thesis, University of Michigan, Ann Arbor, MI, USA (1997) 7. Skolicki, Z.: An analysis of island models in evolutionary computation. Ph.D. thesis, George Mason University (2007) 8. Skolicki, Z., De Jong, K.: The importance of a two-level perspective for island model design. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC 2007), pp. 4623– 4630 (2007) 9. Toussaint, M., Igel, C.: Neutrality: A necessity for self-adaptation. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC 2002), pp. 1354–1359 (2002) 10. Watson, R.A.: Compositional Evolution: The impact of Sex, Symbiosis and Modularity on the Gradualist Framework of Evolution. Vienna Series in Theoretical Biology. MIT Press, Cambridge (2006) 11. Wright, S.: The roles of mutation, inbreeding, crossbreeding and selection in evolution. In: Jones, D.F. (ed.) Proceedings of the Sixth International Conference of Genetics, Brooklyn Botanic Garden, pp. 356–366 (1932)
Real-Coded ECGA for Solving Decomposable Real-Valued Optimization Problems Minqiang Li1,2, David E. Goldberg2, Kumara Sastry2, and Tian-Li Yu2 1
School of Management, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, People's Republic of China 2 Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana-Champaign, 117 Transportation Building, 104 S. Mathews Avenue Urbana, IL 61801, USA
Abstract. We present the real-coded extended compact genetic algorithm (rECGA) for solving decomposable real-valued optimization problems. Mutual information between real-valued variables is employed to measure variable interaction or dependency, and the variable clustering and aggregation algorithms are proposed to identify the substructures of a problem through partitioning variables. Then, mixture Gaussian probability density function is estimated to model the promising individuals for each substructure, and the sampling of multivariate Gaussian probability density function is carried out by adopting Cholesky decomposition. Facet analyses are made about the population sizing and sampling of the factorization based on the Gaussian probability density function. Finally, experiments on decomposable test functions are conducted. The results illustrate that the rECGA is able to correctly identify the substructure of decomposable problems with linear or nonlinear correlations, and achieves a good scalability. Keywords: real-coded ECGA, probabilistic model building, real-valued decomposable optimization problem, population sizing and sampling, genetic and evolutionary computation.
1 Introduction In the field of genetic and evolutionary computation (GEC), there have been growing interests in using probabilistic distribution methods to explore the search space. The common feature of these kinds of evolutionary algorithms, called probabilistic modelbuilding genetic algorithms (PMBGAs), or estimation of distribution algorithms (EDAs), is to build probabilistic models of promising solutions found so far in the population, and generate new individuals by sampling the estimated distribution functions instead of conducting recombination and mutation operations (Pelikan et al. 2002; Goldberg 2002; Larranaga and Lozano 2002). Generally, the PMBGAs consist of three major steps: • Promising individuals are selected from the initial or current population. • A probability distribution factorization model is estimated based on the selected solutions. • Offspring are generated by sampling the estimated model, and a new population is created by using some kind of replacement policy. Most of the PMBGAs were developed to solve binary optimization problems and have achieved successful results. Only a few algorithms were proposed to deal with Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 61–86, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
62
M. Li et al.
real-valued optimization problems with continuous variables, such as the estimation of Gaussian network algorithms (EGNA) (Larranaga and Lozano 2002), the mixed iterated density estimation algorithms (mIDEAs) (Bosman 2003), the real-coded EDA (RECEDA) (Paul and Iba 2003), the real-coded Bayesian optimization algorithms (rBOA) (Ahn et al. 2004). There were also continuous optimization versions of the univariate marginal distribution algorithms (UMDA) (Mühlenbein 1997), the population based incremental learning (PBIL) (Baluja 1994), the mutual information maximization for input clustering (MIMIC) (De Bonet et al 1997), the bivariate marginal distribution algorithms (BMDA) (Pelikan and Mühlenbein 1999), etc. When the PMBGAs are adopted to solve decomposable real-valued optimization problems, there will appear one issue: the structure and parameters of the probability distribution model have to be estimated simultaneously when mixture Gaussian probability density functions are employed so as to minimize the scoring metric, e.g. the mostly adopted one is Bayesian information criterion (BIC). This task is a NPcomplete problem (Pelikan et al. 1998; Pelikan 2002). This issue that doesn’t exist in the binary ECGA is unique to the real-valued optimization tasks when the ECGA is employed. What we propose is to utilize a two-step procedure of building the factorization models to address this issue by employing mutual information criterion and heuristic decomposing methods. The proposed algorithm for solving decomposable optimization problems works in similar logic as the ECGA for decomposable binary optimization problems, and is called real-coded ECGA (rECGA). The remainder of this chapter is organized as follows. In section 2, we make a review of the major algorithms of the real-coded PMBGAs, and point out the critical issues associated with their application. In section 3, we introduce the two-step procedure of factorization to decompose the optimization function by using mutual information metric. Clustering methods are used to group the sample data into multiple subsets for each subproblem, and specific procedures are designed to estimate the parameters of mixture Gaussian density functions in section 4. Then experiments on test functions are conducted and the results are reported in section 5. Section 6 summarizes the paper, and topics for further research are discussed. In appendix, we make facet analyses about the population sizing and sampling of the factorization based on the Gaussian probability density function with specific functions.
2 Review of Related Work opt ( X ) , X ∈ Ω ⊆ R n , X = ( X 1 , X 2 ,… , X n ) ( X i ⊂ [ai , bi ] , ai < bi , i = 1,2, … , n ) denotes the multi-
For a global real-valued optimization function
dimensional continuous random variables, and
x = ( x1 , x2 ,…, xn ) represents one
of its possible instances or a solution. A joint probability density function is estimated over the selected individuals by a probabilistic model M (ς , θ ) , where ς is the structure of the model that describes the factorization of the probability density
Real-Coded ECGA for Solving Decomposable Real-Valued Optimization Problems
63
function (pdf), θ is the parameter vector for all components of the joint pdf. In general, a joint pdf f ( X ) is factorized as (Larranaga and Lozano 2002): n
f (ς ,θ ) ( X ) = ∏ f ( X i | Π i , θi ) ,
(2-1)
i =1
Π i is a set of variables that X i is dependent on, θi is the parameter vector for the conditional pdf of X i . Given a set of data samples S = { x1 , x2 , … , xq } , the
where
Bayesian information metric (BIC) is used to score the fitness of the joint pdf model: q
BIC ( f (ς ,θ ) ( X ), S ) = − ∑ ln( f (ς ,θ ) ( x j )) + λ × ln( q)× | θ | ,
(2-2)
j =1
where the first part is the likelihood metric of the model, λ is the penalty of the model complexity. It is assumed that the variable vector X = ( X1, X2,…, Xn ) of a problem is able to be partitioned into m sub-vectors: X , X ,…, X , then we say that the problem or total structure can be decomposed into a set of subproblems or substructures. Suppose that set(X) = {X1, X2,…, Xn} indicates the set of all variables, a partition is repre1
2
m
m
sented
as
set(Xk ), set(X2),…, set(Xm) ,
set(X) = ∪set(Xk ) ,
and
k =1
set(X ) ∩ set(X ) =∅ ( k1 ≠ k2 ); k1
k2
{X1k , X 2k ,…, X mk k } ; tion of
m
∑m k =1
k
X = (X , X ,…, X ) , k
k 1
k 2
k mk
set(X k ) =
= n , k = 1,2,…, m . Since the real distribution func-
f ( X i , Π i | θi ) may be multimodal or nonlinear, a single Gaussian pdf is not
able to capture the data feature for a substructure. So, mixture Gaussian pdf should be adopted, and the factorization will be a joint pdf as: m
mk
ck , i
f (ς ,θ ) ( X ) = ∏∏∑ β k ,i , j f ( X ik | Π k ,i , θk ,i , j ) ,
(2-3)
k =1 i =1 j =1
where
β k ,i , j
ck , i
∑β j =1
k ,i , j
is the weight for the mixture Gaussian pdf of the variable
X ik , and
= 1 ; ck ,i is the cluster number of data samples regarding X ik and its pa-
rental variables, and is also the number of Gaussian pdfs for estimating
f ( X ik | Π k ,i , θk ,i , j ) . The rBOA (Ahn et al. 2004) used a factorization with mixture Gaussian pdf. It adopted an identical number of mixture Gaussian pdf for a subproblem. An incremental
64
M. Li et al.
greedy algorithm was used to find the optimal acyclic Bayesian network by starting from an empty one based on the BIC metric. Then maximally connected sub-graphs, or building-blocks with tight interaction between variables, were extracted by graph partition algorithms. The mIDEAs tried a different way to do factorization (Bosman 2003). It divided the problem in solution space instead of decomposing the objective function. The sample data of the selected individuals were clustered into multiple groups by leader algorithm, and the joint pdf for each group was modeled with a factorized Gaussian pdf. The mIDEAs resorted to the clustering procedure to break the nonlinearity or interaction of variables. Since it did not consider the decomposability of a high dimensional objective function, a huge population and numerous clusters are often required to do effective factorization. Moreover, it could not realize the probabilistic crossover of building-blocks (Ahn et al. 2004) because there was no cross breeding between different clusters. The RECEDA by Paul and Iba (2003) employed a multivariate Gaussian probability density function to model the selected individuals, which was quite similar to the estimation of the multivariate normal algorithms (EMNAs) (Larranaga and Lozano 2002). The RECEDA and EMNAs were efficient for unimodal or macroly unimodal optimization functions of low dimension with only linear dependency between variables. For complicated high dimensional optimization functions with multimodality and nonlinear interaction, they usually failed to find the global optimum. Therefore, the correct factorization of a real-valued optimization function is critical to the design of an effective algorithm. But for complicated and large scale realvalued optimization problems, we need an efficient way to do the factorization. Although the Bayesian network based on mixture Gaussian pdf is able to model various types of dependences and interactions between real-valued variables, it is pretty difficult to find the optimal structure and parameters. We have to minimize the BIC over m , ck ,i , β k ,i , j concurrently in formula (2-3) if we want to get the optimal factorization of the joint pdf, and the computation would be overwhelming on high dimensional optimization problems. Actually, what we find by heuristic methods is an approximate solution when we use the Bayesian network to model the problems structure in reality. So, we propose to divide the whole factorization procedure into two steps, and make use of proper metric measures instead of the BIC. The first step attempts to find the substructures of the problem through learning the linkage information or interaction between variables by using regular measures as mutual information, and the global structure of the problem is then decomposed into multiple substructures by partitioning the real-valued variables into multiple groups with regard to an objective function. The second step clusters the selected individuals into multiple exclusive groups regarding specific variable partitions or substructures, and then builds a mixture multivariate Gaussian pdf for each substructure. Thus, we can build a joint pdf with mixture Gaussian pdf which is a good factorization of the joint pdf for a decomposable real-valued optimization problem as: m
ck
f (ς ,θ ) ( X ) = ∏∑ β k , j f ( X k , θk , j ) , k =1 j =1
(2-4)
Real-Coded ECGA for Solving Decomposable Real-Valued Optimization Problems
65
which is a simplified form of the (2-3) for decomposable real-valued optimization problems. θk , j is the parameter for the j-th Gaussian pdf of the variable partition
X k in the mixture model. 3 Structure Decomposition of Real-Valued Optimization Function The structure decomposition of a real-valued optimization function is similar to the binary ECGA which uses the product of marginal distribution to model the joint pdf, where binary variables were partitioned into multiple clusters by a greedy search algorithm (Harik 1999). As to the global real-valued optimization problems, there are a lot of approaches to measure the dependency of real-valued variables in statistics and information theory (MacKay 2003). Tsutsui et al. (2001) adopted the piecewise interval correlation coefficient to measure the nonlinear correlation of variables, but it required the correct splitting of the solution space. We use the mutual information to measure the dependency among real-valued variables. The basic procedure on the partition of real-valued variables based on mutual information is presented as below. 3.1 Mutual Information of Real-Valued Variables In order to use mutual information to measure the dependency of continuous variables based on the selected individuals, we need to discretize the real-valued variables in advance. Suppose that the definition domain of a real-valued variable X∈[a, b] is diL
vided into
L partitions:
∪[a , b ] = [a, b] ( a l
l
l
< bl ), i = 1,2,…, n . The mutual in-
l =1
formation between two random variables X i , X j is defined as (MacKay 2003):
I ( Xi; X j ) = H ( Xi ) − H ( Xi | X j ) , where
(3-1)
H ( X i ) is the entropy of X i , H ( X i | X j ) is the conditional entropy of X i
given X j , and I ( X i ; X j ) = I ( X j ; X i ) . It measures the average reduction in uncertainty about
X i that results from learning the value of X j , that is the average
amount of information that
X i conveys about X j ; or vice versa (MacKay 2003).
By the definition of entropy, we get:
I ( Xi; X j ) =
P( xi , x j )
∑ P( x , x ) log P( x ) P( x ) , i
xi , x j
j
i
j
(3-2)
66
M. Li et al.
where I ( X i ; X j ) ≥ 0 , and I ( X i ; X j ) = 0 only when X i , X j are independent. Usually, we scale the mutual information by I ( X i ; X j ) / log( L ) , so that
I ( X i ; X i ) / log( L) ≤ 1 . Thus, we get the mutual information matrix for the continuous vector
X = ( X 1 , X 2 , …, X n )T of a real-valued optimization function: I ( X 1 , X 2 , … , X n ) = [I ( X i ; X j ) ] n ×n ,
(3-3)
which is a symmetrical matrix. 3.2 Partition of Real-Valued Variables The partition of variables is a typical unsupervised learning task in machine learning, and there are lots of algorithms to complete this task. For binary optimization problems, the minimum description length (MDL) was computed as the clustering metric to measure the partition of genes, and evolutionary algorithms or heuristic methods were adopted to search for the best partition by minimizing the metric (Yu et al. 2003). Theoretically, unsupervised machine learning tasks require fixing specific parameters a priori for a problem to yield rational and acceptable results. So, we design a heuristic greedy search algorithm to cluster variables based on their mutual information, which can easily incorporate the expertise of human experts interactively. Initially, we take the interaction between all variables as insignificant or all variables as independent if max I ( X i ; X j ) < α , where α is the threshold i , j =1, 2 ,..., n ;i ≠ j
{
}
parameter for the lowest bound of mutual information, and α = 0 .1 as default. Otherwise, a clustering procedure is implemented to partition the variables into multiple subsets. The major steps of the clustering algorithm is described as follows. Step 1: Sort the upper triangular entries of the mutual information matrix I ( X 1 , X 2 , … , X n ) into an entry list of mutual information of pair-wise
{
}
variables: I sort = I ( X i ; X j )
n ( n −1)/ 2
, where we do not consider diagonal
entries. Step 2: All entries are fetched in descending order. Pick the biggest one I ( X i ; X j ) ,
{ }
and place the two variables into cluster 1: C1 = X i1 2 . Step 3: Pick the next entry I ( X i ; X j ) in
{ }
K clusters: C1 = X i1
c1
I sort ; and suppose that currently we have
{ }
, C 2 = X i2
c2
{ }
, … , C K = X iK
ck
, where
ck ( k = 1,2,…, K ) is the number of variables in cluster k , then we proceed by:
Real-Coded ECGA for Solving Decomposable Real-Valued Optimization Problems
67
(1) if one of the pair-wise variables in I ( X i ; X j ) has been clustered:
X i or X j ∈ C k
,
the
other
is
placed
in
the
Ck ← Ck ∪ { X i , X j } ;
same
cluster:
{ }.
(2) otherwise, a new cluster is created: K ← K + 1 , C K = X iK
2
Step 4: Go to step 3, and repeat until all entries in
I sort are tested, and all variables
are clustered. Step 5: Output clusters: C1 = X i1
, … , C K = X iK
{ } , C = {X }
2 i c2
2
c1
{ }
cK
.
This algorithm usually outputs a lot of small clusters that have only a few variables. The partition is efficient for evidently decomposable optimization problems if the dependency between some variables is much stronger than that of others. But it would break the dependency between some variables when there is not a big difference among the mutual information of highly correlated ones. So, we suggest an aggregation algorithm to merge some clusters, which is implemented together with the clustering algorithm as described above. The major steps of the aggregation algorithm is described as follows.
{ } , C = {X }
Step 1: For clusters C1 = X i1
2
c1
2 i c2
{ }
, … , C K = X iK
cK
, the averaged
mutual information is calculated: ck 2 I (C k ) = I ( X ik ; X kj ) . ∑ ck ( ck − 1) i =1, j >i
Step 2: Find the mutual correlation between any pair of two clusters by:
I (C k1 , C k 2 ) = min {I ( X ik1 ; X kj 2 ) | X ik1 ∈ C k1 , X ik 2 ∈ C k2 }.
Step 3: Sort the mutual correlation in descending order into a list:
I clusters, sort = {I (C k1 , C k2 )}
K ( K − 1)/ 2
.
Step 4: Select the first mutual correlation in I clusters ,sort :
{
}
If I (Ck1 , Ck2 ) > βmerge × max I (Ck1 ), I (Ck2 ) , then merge clusters Ck1 ,Ck2 ,
β merge ( 0 < β merge < 1 ) is a threshold parameter, and β merge = 0.80 for default. The smaller the β merge , the fewer the clusters yielded. where
Step 5: Go to step 4, and repeat until all of the entries in I clusters ,sort are tested, or all clusters pairs are checked.
{ }
{ }
Step 6: Output clusters: C1 = Xi1 c , C2 = X i2 1
c2
{ }
, … , CK ' = XiK′
cK′
,
K′ ≤ K .
In real-world applications, the aggregation procedure can be repeated multiple times until there is no chance of merging any pair of clusters.
68
M. Li et al.
With the clustering algorithm and aggregation algorithm (called the partition algorithm of real-valued variables), the total real-valued variables are partitioned into exclusive clusters, so that the structure of the optimization problem is decomposed into multiple substructures. The final output of the partition algorithm is similar to the results of the rBOA (Ahn et al. 2004), but the procedure is much more efficient. It is fitted particularly to the case where the optimization function is decomposable, especially when there is significant difference between the mutual information values of closely correlated variables and loosely correlated variables. Although it employs a heuristic method to do partition, the approximate factorization is sufficiently good in practice.
4 Estimation and Sampling of Mixture Gaussian Pdf After the structure of an optimization function is decomposed, we proceed to build the mixture Gaussian pdf for each substructure. The identical sample data of the selected individuals are shared, only different coordinate values are used in estimating parameters for a substructure. Let’s consider the substructure k
X k ( k =1,2,…,m), | set(X k ) |= mk . Suppose that k
entries of X are nonlinearly correlated, and the function regarding X is multimodal. A single Gaussian pdf, a unimodal function to model linear correlation, is not able to model the samples. There are a lot of methods for complicated data clustering tasks (Han and Kamber 2006). The mIDEAs used the leader algorithm (Bosman 2003) which was fast but preferred yielding fewer clusters or nonlinear clusters. We propose to incorporate the density-based clustering method (DBSCAN) (Ester et al. 1996) and k-means clustering method to partition the sample data into multiple groups. The DBSCAN is used firstly to divide the data samples into multiple clusters that may probably be nonlinearly separable, and get the initial cluster number which is represented as ck ,min . Then we choose a cluster number for linearly separable samples, and fit a Gaussian pdf to each cluster of data samples. Empirically, we take ck ∈[ck ,min, ck ,max] ,
ck ,max = ck ,min × βcluster(1 + ⎣ mk ⎦) , where βcluster ≥ 2 is a regulating parameter that
controls the number of clusters to be created. Finally, the k-means algorithm divides the data samples into ck groups. Given a set of samples (l
S = { x1 , x2 , … , xq } , xl = ( xl1 , xl 2 ,…, xl n )
= 1,2,…, q ), we project it on X k , and thus get: S k = {x1k , x2k ,…, xqk } ,
xlk = ( xl1 , xl 2 ,…, xlmk ) . In the rECGA, q = τN , N denotes the population size, and τ ( τ = 0.5 ) indicates the proportion of promising individuals selected for building probabilistic models.
Real-Coded ECGA for Solving Decomposable Real-Valued Optimization Problems
Suppose that we have clustered samples into partition:
ck groups regarding the k-th variable
S1k = {x1k , x2k ,…, xqkk } , S2k = {x1k , x2k ,…, xqkk } , …, Sckk = {x1k , x2k ,…, xqkk } , 1
ck
∑q
k j
69
2
ck
=q.
j =1
For a sample cluster
S kj = {x1k , x2k ,…, xqkk } ( j = 1,2,…, ck ), the mean vector j
and
covariance
matrix
μkj = ( μkj,1, μkj,2 ,…, μkj,mk )T 1 μ = k qj k j ,i
q kj
∑x
k li
for
multivariate
[
Gaussian
Σ kj = cov j ( X ik1 , X ik2 )
,
]
pdf
are
estimated
,
mk ×mk
as:
where
i = 1,2,…, mk , covj ( Xik1 , Xik2 ) is the covariance of variables
,
l =1
X , X in partition k over sample cluster S kj ( j = 1,2,…, ck ). k i1
k i2
Thus, we get the factorization with mixture Gaussian pdf for a decomposable realvalued optimization problem: ck
m
f (ς ,θ ) ( X ) = ∏ ∑ β k , j × f ( X k | μkj , Σ kj ) ,
(4-1)
k =1 j =1
where
β k , j = qkj q .
The multivariate Gaussian pdf in a substructure can be sampled by the Cholesky decomposition and algebra transformation (Anderson 2003). Since the covariance matrix
Σ = [cov( X i, X j)] r×r ( r = q kj ) of a multivariate Gaussian pdf is symmetric
and positive definite, it can be decomposed into a lower and an upper triangular matrix:
⎡ a11 ⎢ a21 T ∑ = L ×L , ∑ = ⎢ ⎢ ⎢ ar1 ⎣ where lii = aii −
i −1
∑l
2 ik
a12 a22 ar 2
, l ji = (a ji −
i −1
∑l
…
⎡ l11 ⎤ ⎢ l21 … a2 r ⎥ ⎥ ,L = ⎢ 0⎥ ⎢ ⎢l ⎥ … arr ⎦ ⎣ r1 a1r
0
…
l22
…
0⎤ 0⎥
⎥,
0⎥
lr 2
…
lrr ⎥
⎦
l ) lii ; i = 1, 2, …, r; j = 1, 2, …, i .
jk ik
k =1
k =1
The sample data of a random vector variable Z = ( z1, z2 , … , zn ) that obeys standard
multivariate
Gaussian
distribution,
Z ~ N (0 , 1) ,
S = {z1 , z2 ,…, zs } , zi = ( zi1 , zi 2 ,… , zir ) , i = 1,2,…, s . Z
are
produced:
70
M. Li et al.
Then we get the samples S
X
= { x1 , x2 ,…, xs } of the random vector variable
X that obeys the linearly correlated multivariate Gaussian distribution X ~ N ( μ, Σ ) by transformation: xi = μ + Lzi′ , i = 1,2,…, s , where
(4-2)
xi is an instance of the random multidimensional variable X . For mixture
Gaussian pdf in the k-th variable partition, the prior probability for choosing a Gaussian pdf of the j-th cluster is decided by βk , j . Thus, the offspring can be generated from the joint mixture Gaussian pdf based on formula (4-1), and the dependencies between variables that have been modeled are inherited to the newly produced individuals. The most important is that multivariate building-blocks are preserved and mixed in the sampling of offspring, so there is a good probability to create higher-order building-blocks and to approach asymptotically the global optimum of a real-valued optimization problem.
5 Experiments 5.1 Test Functions
We mainly consider the functions without dependency or with dependency between only two variables, and test functions are designed as follows. (1) Multivariate real-valued deceptive function (MRDFi) This function is defined with no dependency between variables: n
max f MRDFi ( x ) = ∑ fURDF ( xi ) , i =1
f URDF
⎧ a (c − x ) ⎪⎪ c (x) = ⎨ − c) ( b x ⎪ ⎪⎩ 1 − c
xi ∈ [0,1] , i = 1,2,…, n , 0≤ x < c
,
(5-1)
x ∈ [0,1] ,
c≤ x ≤1
where {a , b, c} controls the shape of the univariate real-valued deceptive function (URDF), and a< b , 0.5 x3* | μˆ,σˆ ) for a univariate function of the MRDFi is approximated with 0.33< μˆ < 0.50, 0.30 x3* | μˆ B (1), σˆ B (1)) ,
(A4-1)
where pB (1) represents the prior probability of sampling offspring that belong to the cluster B based on the mixture Gaussian model in the first generation. The sampling of the rECGA with mixture Gaussian pdf model on the URDF is a discrete dynamic process that is modeled by:
psampling (t ) = pB (t ) × p( x > x3* | μˆ B (t ), σˆ B (t )) ,
(A4-2)
t denotes the t-th generation; pB (t ) , μˆ B (t ) , σˆ B (t ) are estimated on selected individuals of P ( t − 1) .
where
Second, we turn to the MRDFi. The sampling probability for all sub-functions MRDFi to have xi ∈ ( x3 ,1] ( i = 1, 2, … , n ) in the sampling of the t-th generation is estimated as: *
psampling ( n, t ) = p( ∧
i =1, 2 ,…, n
xi ∈ ( x3* ,1]) = ( psampling (t )) n .
(A4-3)
The probability to sample the global optimum in t-th generation: N
p(
∧
i =1, 2 ,…, n
xi ∈ ( x3* ,1], t ) = 1 − [1 − psampling ( n, t )] 2 .
(A4-4)
Thus, the cumulative probability distribution function for successful sampling the global optimum of the MRDFi in total evolution process is computed approximately as: t
N
Pcdf ( ∧ xi ∈ ( x3* ,1], t ) = 1 − (1 − pinitial )∏[1 − ( psampling(τ ))n ] 2 , (A4-5) i =1, 2,…,n
where
τ =1
pinitial = p( ∧
i =1, 2 ,…,n
xi ∈ ( x3* ,1]) = 1 − (1 − p(n )) N .
The formula (A4-5) is only descriptive, and it is not an easy task to estimate the exact model about the cumulative probability distribution function for a specific problem. Empirically, the rECGA with mixture Gaussian pdf achieves good performance on the MRDFi (see Fig.1(b)) because the probabilistic building-blocks recombination based on the estimated joint distribution of the selected individuals works efficiently.
Linkage Learning Accuracy in the Bayesian Optimization Algorithm Claudio F. Lima1 , Martin Pelikan2 , David E. Goldberg3 , Fernando G. Lobo1 , Kumara Sastry3 , and Mark Hauschild2 1
2
3
University of Algarve, Portugal
[email protected],
[email protected] University of Missouri at St. Louis, USA
[email protected],
[email protected] University of Illinois at Urbana-Champaign, USA
[email protected],
[email protected] Summary. The Bayesian optimization algorithm (BOA) uses Bayesian networks to learn linkages between the decision variables of an optimization problem. This chapter studies the influence of different selection and replacement methods on the accuracy of linkage learning in BOA. Results on concatenated m-k deceptive trap functions show that the model accuracy depends on a large extent on the choice of selection method and to a lesser extent on the replacement strategy used. Specifically, it is shown that linkage learning in BOA is more accurate with truncation selection than with tournament selection. The choice of replacement strategy is important when tournament selection is used, but it is not relevant when using truncation selection. On the other hand, if performance is our main concern, tournament selection and restricted tournament replacement should be preferred. Additionally, the learning procedure of Bayesian networks in BOA is investigated to clarify the difference observed between tournament and truncation selection in terms of model quality. It is shown that if the metric that scores candidate networks is changed to take into account the nature of tournament selection, the linkage learning accuracy with tournament selection improves dramatically.
1 Introduction Unlike traditional evolutionary algorithms (EAs), the Bayesian optimization algorithm (BOA) (23, 21) replaces the standard crossover and mutation operators by building a probabilistic model of promising solutions and sampling from the corresponding probability distribution. This feature allows BOA and other advanced estimation of distribution algorithms (EDAs) (15, 24) to automatically identify the problem decomposition and important problem substructures, leading to superior performance for many problems when compared with EAs with fixed, problem-independent variation operators. Although the main feature of BOA and other EDAs is to perform efficient mixing of key substructures or building blocks (BBs), they also provide additional information about the problem being solved. The probabilistic model of the Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 87–107, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
88
C.F. Lima et al.
population, that represents (in)dependencies among decision variables, is an important source of information that can be exploited to enhance the performance of EDAs even more, or to assist the user in a better interpretation and understanding of the underlying structure of the problem. Examples of using structural information from the probabilistic model for another purpose besides mixing are fitness estimation (31, 25, 30), induction of global neighborhoods for mutation operators (29, 17), hybridization and adaptive time continuation (18, 17), substructural niching (28), offline (34) and online (35) population size adaptation. In this chapter we analyze the structural accuracy of the probabilistic models in BOA and their ability to represent underlying problem substructures. In particular, we use concatenated deceptive trap functions—where the optimal model is known, and accurate linkage learning is critical—to investigate the influence of different selection and replacement strategies on model structural accuracy. The initial results show that as far as model quality is concerned, truncation selection should be preferred over tournament selection, and the choice of replacement strategy matters when tournament selection is used. These results also show that if the objective is to obtain near-optimal solutions with high reliability using minimal number of function evaluations, then tournament selection with restricted tournament replacement is the best strategy for BOA. Additionally, we investigate in detail the Bayesian network learning in BOA to understand the difference observed between tournament and truncation selection in terms of model quality. Selection is analyzed as the mating-pool distribution generator, which turns out to have a great impact on Bayesian network learning. In fact, if the metric that scores networks takes into account the natural distribution of tournament selection, the model quality can be highly improved and comparable to that of truncation selection. This reasoning is confirmed through additional experiments. The chapter is structured as follows. The next section gives an outline of BOA. Section 3 motivates the importance of the research topic addressed in this chapter and makes a short survey on related work. Section 4 introduces the experimental setup used for measuring the structural accuracy of the probabilistic models in BOA. Section 5 analyzes the influence of selection in structural linkage learning, while the next section deals with the influence of the replacement method. In Section 7 Bayesian network learning is investigated at some detail, and in the subsequent section a new scoring metric is employed to improve the model quality obtained by tournament selection. The chapter ends with a summary and major conclusions.
2 Bayesian Optimization Algorithm Estimation of distribution algorithms (15, 24) replace traditional variation operators of EAs by building and sampling a probabilistic model of promising solutions to generate the offspring population. The Bayesian optimization algorithm (23, 21) uses Bayesian networks as the probabilistic model to capture the (in)dependencies between the decision variables of the optimization problem.
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
89
BOA starts with an initial population that is usually generated at random. In each iteration, selection is performed to obtain a population of promising solutions. This population is then used to build the probabilistic model for the current generation. After the model structure is learned and its parameters estimated, the offspring population is generated by sampling from the distribution of modeled individuals. The new solutions are then incorporated into the original population by using any standard replacement method. The next iteration proceeds again from the selection phase until some stopping criteria is satisfied. Bayesian networks (BNs) (20) are powerful graphical models that combine probability theory with graph theory to encode probabilistic relationships between variables of interest. A BN is defined by its structure and corresponding parameters. The structure is represented by a directed acyclic graph where the nodes correspond to the variables of the data to be modeled and the edges correspond to conditional dependencies. The parameters are represented by the conditional probabilities for each variable given any instance of the variables that this variable depends on. More formally, a Bayesian network encodes the following joint probability distribution, p(X) =
p(Xi |Πi ),
(1)
i=1
where X = (X1 , X2 , . . . , X ) is a vector with the variables of the problem, Πi is the set of parents of Xi (nodes from which there exists an edge to Xi ), and p(Xi |Πi ) is the conditional probability of Xi given its parents Πi . The parameters of a Bayesian network can be represented by a set of conditional probability tables (CPTs) specifying the conditional probabilities for each variable given all possible instances of the parent variables Πi . Alternatively, these conditional probabilities can be stored in the form of local structures such as decision trees or decision graphs, allowing a more efficient and flexible representation of local conditional distributions. Therefore, we use BNs with decision trees. The scoring metric used to quantify the quality of a given network is the Bayesian-Dirichlet metric (5, 14, 4), which is given by BD(B) = p(B)
i=1 l∈Li
Γ (mi (xi , l) + m (xi , l)) Γ (mi (l)) i , Γ (mi (l) + mi (l)) x Γ (mi (xi , l))
(2)
i
where p(B) is the prior probability of the network structure B, Li is the set of leaves in the decision tree Ti (corresponding to Xi ), mi (l) is the number of instances in the population that contain the traversal path in Ti ending in leaf l, mi (xi , l) is the number of instances in the population that have Xi = xi and contain the traversal path in Ti ending in leaf l, mi (l) and mi (xi , l) represent prior knowledge about the values of mi (l) and mi (xi , l). Here, we consider the K2 variant of the BD metric, which uses an uninformative prior that assigns mi (xi , l) = 1.
90
C.F. Lima et al.
To favor simpler networks over more complex ones, the prior probability of each network p(B) can be adjusted according to its complexity, that is given by the description length of the parameters required by the network (4, 9). Based on this principle, Pelikan (21) proposed the following penalty term for BOA, p(B) = 2−0.5 log2 (n)
i=1
|Li |
,
(3)
where n is the population size. To learn the most adequate structure for the BN a greedy algorithm is usually used for a good compromise between search efficiency and model quality. We consider a simple learning algorithm that starts with an empty network and at each step performs the operation that improves the metric the most, until no further improvement is possible. The operator considered is the split, which splits a leaf on some variable and creates two new children on the leaf. Each time a split on Xj takes place at tree Ti , an edge from Xj to Xi is added to the network. For more details on BNs with local structures the reader is referred elsewhere (4, 9, 21). The hierarchical BOA (hBOA) was later proposed by Pelikan and Goldberg (22, 21) and results from the combination of BNs with local structures with a simple yet powerful niching method to maintain diversity in the population, known as restricted tournament replacement (RTR) (12). hBOA is able to solve hierarchical decomposable problems, in which the variable interactions are present at more than a single level.
3 Motivation and Related Work While BOA is able to solve a broad class of nearly decomposable and hierarchical problems in a reliable and scalable manner, their probabilistic models oftentimes do not exactly reflect the problem structure. Because the probabilistic models are learned from a sample of limited size (population of individuals), particular features of the specific sample are also encoded, which act as noise when seeking for generalization. This is a well-known problem in machine learning, known as overfitting. Analyzing the dependency groups captured by the Bayesian network with decision trees, it can be observed that while all important linkages are detected, spurious linkages are also incorporated in the model. By spurious linkage we mean additional variables that are considered together with a correct linkage group. While the structure of the BN captures such excessive complexity, the corresponding conditional probabilities nearly express independency between the spurious variables and the correct linkage, therefore not affecting the capability of sampling such variables as if they were almost independent. Although the performance of BOA is not greatly affected by this kind of overfitting, several model-based efficiency enhancement techniques for EDAs (31, 25, 30, 29, 17, 18, 28, 34, 35) crucially rely on the structural accuracy of the probabilistic models. One such example is the exploration of substructural neighborhoods for local search in BOA (17). While significant speedups were obtained by
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
91
incorporating model-based local search, the scalability of this speedup decreased for larger problem sizes due to overly complex model structures learned in BOA. Therefore, it is important to understand in which conditions the structural accuracy of the probabilistic models in BOA and other multivariate EDAs can be maximized. So far, only a small number of studies have been done in this direction (26, 33, 6, 8, 13). In the remainder of this section we take a brief look at these works. Santana, Larra˜ naga, and Lozano (26) analyzed the effect of selection on the arousal of certain dependencies in EDAs for random functions and showed that for these functions, independence relationships not represented by the function structure are likely to appear in the probabilistic model. Additionally, they proposed an EDA that only used a subset of the dependencies that exist in the data, which they called malign interactions. Some preliminary experiments showed that these approximations of the probabilistic model can in certain cases be applied to EDAs. Wu and Shapiro (33) investigated the presence of overfitting when learning the probabilistic models in BOA and its consequences in terms of overall performance when solving random 3-SAT problems. CPTs (to encode the conditional probabilities) and the corresponding BIC metric were used. The authors concluded that overfitting does take place and that there is some correlation between this phenomenon and performance. The reduction in overfitting was proposed by using an early stopping criteria during the learning process of BNs, which gave some improvement in performance. The trade-off between model complexity and performance in BOA was also studied recently (6). Correa and Shapiro looked at the performance achieved by BOA as a function of a parameter that determines the maximum number of incoming edges for each node. This parameter puts a limit on the number of parents for each variable, simplifying the search procedure for a model structure. This parameter was found to have a strong effect on the performance of the algorithm, for which there is a limited set of values where the performance is maximized. These results were obtained using CPTs and the corresponding K2 metric. We should note that in fact this parameter is crucial if CPTs are used with the K2 metric, however this is not the case for more sophisticated metrics that efficiently incorporate a complexity term to introduce pressure toward simpler models. This can be done better with the BIC metric for CPTs, or with the K2 metric for the case of decision trees (21). More recently, Echegoyen et al. (8) applied new developments in exact BN learning into the EDA framework to analyze the consequent gains in optimization. While in terms of convergence time the gain in using exact BN learning was marginal, the models learned by EBNA were more closely related to the underlying structure of the problem. However, the computational cost of learning exact BN is only manageable for relatively small problem sizes (experiments were made for a maximum problem size of 20). Equally recent, Hauschild et al. (13) analyzed the probabilistic models built by hBOA for two common test problems: concatenated trap functions and 2D
92
C.F. Lima et al.
Ising spin glasses with periodic boundary conditions. The authors verified that the models learned closely correspond to the structure of the underlying problem. In their analysis, Hauschild et al. used truncation selection and restricted tournament replacement. In this chapter, we will show the results from (13) do not carry over to other combinations of selection and replacement methods. Before presenting these results, we discuss the details of our empirical analysis.
4 Experimental Setup for Measuring Structural Accuracy of Probabilistic Models This section details the experimental setup and measurements used to investigate the structural accuracy of the probabilistic models in BOA. 4.1
Test Problem and Experimental Setup
To investigate the structural accuracy of linkage learning in BOA, we focus on solving a problem of known structure, where it is clear which dependencies must be discovered (for successful tractability) and which dependencies are unnecessary (reducing the interpretability of the models). In this way the evaluation of the model structure quality to correctly detect both dependencies and independencies is performed. The test problem considered is the m-k trap function where m is the number of concatenated k-bit trap functions. Trap functions (1, 7) are relevant to test problem design because they bound an important class of nearly decomposable problems (10). The Trap function used (7) is defined as follows k, if u = k (4) ftrap (u) = k − 1 − u, otherwise where u is the number of ones in the string, k is the size of the trap function. Note that for k ≥ 3 the trap function is fully deceptive (7) which means that any lower than k-order statistics will mislead the search away from the optimum. In this problem the accurate identification and exchange of the building-blocks (BBs) is critical to achieve success, because processing substructures of lower order will lead to exponential scalability (32). Thus, all variables corresponding to each trap function form a linkage group or BB partition and should be treated together by the probabilistic model. Note that no information about the problem is given to the algorithm, therefore it is equally difficult for BOA if the variables correlated are closely or randomly distributed. A trap function with size k = 5 is used in our experiments. A bisection method is used to determine the minimal population size required to solve the problem (27). For each experiment, 10 independent bisection runs are performed. Each bisection run searches for the minimal population size required to find the optimum in 10 out of 10 independent runs. Therefore, the results for the minimal sufficient population size are averaged over 10 bisection runs, while the results for the number of function evaluations used are averaged over 100 (10 × 10) independent runs.
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
4.2
93
Measuring Structural Accuracy of Probabilistic Models
For accurate linkage learning in BOA at least one of the variables of each trap subfunction should depend on all remaining k − 1 variables, so that all k corresponding variables can be processed together by the probabilistic model. For example, the following dependency relation, X1 ← {X2 , X3 , X4 , X5 }, encodes a linkage group between all variables for the first 5-bit trap subfunction. In addition, the remaining − k variables should not be part of that same dependency relation. If this is the case, the extra variables act as spurious linkage. For example, for the dependency relation, X1 ← {X2 , X3 , X4 , X5 , X6 , X11 }, X6 and X11 are spuriously linked variables. In essence, the dependencies between the groups of k bits corresponding to each subfunction must be discovered, while the remaining dependencies should be avoided to maximize mixing and minimize model complexity. At each generation four different measures are analyzed taking into account only dependency groups of order k or higher: Proportion of BBs with correct linkage group is the proportion of BB partitions or subfunctions (out of m) that have a dependency group in the model that only contains the corresponding k variables. Proportion of BBs with spurious linkage group is the proportion of BB partitions or subfunctions (out of m) that have a dependency group in the model that contains the corresponding k variables plus some additional spuriously linked variables. Proportion of BBs with a linkage group is simply the sum of the two previous statistics. This measure is useful to confirm if every BB partition or subfunction is represented in the model, whether with only correct or additional spurious dependencies. Average size of spurious linkage is the average maximal number of spurious variables in the dependency relations that have spurious linkage (only those relations greater than k are considered).
5 Influence of the Selection Method In this section, the influence of the selection method used in BOA, as well as the corresponding selection pressure, is investigated from the standpoint of structural accuracy of the learned linkage by the probabilistic models. Specifically, we consider two widely used ordinal selection schemes: Tournament and truncation selection. The particular choice of these two selection operators is related to the fact that these two schemes are the most frequently used in EDAs. Also, a previous study (2) on the comparison of several selection schemes has shown that these two schemes differ significantly in relevant features such as selection variance and loss of diversity.
94
C.F. Lima et al.
Table 1. Equivalent tournament size (s) and truncation threshold (τ ) for the same selection intensity (I) (2) I 0.56 0.84 1.03
s 2 3 4
τ 66% 47% 36%
In tournament selection (11, 3), s individuals are randomly picked from the population and the best one is selected for the mating pool. This process is repeated n times, where n is the population size. There are two popular variations with the method, with and without replacement. With replacement, the individuals are drawn from the population following a discrete uniform distribution. Without replacement, individuals are also drawn randomly from the population but there’s the guarantee that every individual from the population participates in exactly s tournaments. While the expected outcome for both alternatives is the same, the latter is a less noisy process. Therefore, in this study we use tournament selection without replacement. In truncation selection (19) the best τ % individuals in the population are selected for the mating pool. This method is equivalent to the standard (μ, λ)selection procedure used in evolution strategies (ESs), where τ = μλ × 100. Note that when increasing the size of the tournament s, or decreasing the threshold τ , the selection pressure (ratio of maximum to average fitness in the population) is increased, which means an increase in the selection strength. For the purpose of studying the influence of different selection strategies, the replacement strategy is kept as simple as possible: the offspring fully replace the parent population. We now turn to our head-to-head comparison between tournament and truncation selection having in mind the structural accuracy of the probabilistic models. In order to compare these two methods on a fair basis, different configurations for both methods with equivalent selection intensity are tested. The relation between selection intensity I, tournament size s, and truncation threshold τ is taken from (2) and is shown in Table 1. Figures 1 and 2 show the linkage information captured by BOA with tournament and truncation selection, respectively. The test problem considered is a concatenated trap function with k = 5 and m = 24, giving a total string length of = 120. For tournament selection, the proportion of BBs that have a correct linkage group (with exactly all corresponding k variables) represented in the model is quite low. Although for s = 2, nearly half of the BBs are still covered, for s = 3 and s = 4 this value approaches zero. Nevertheless, the BBs that are not covered with correct linkage groups are covered in spurious linkage groups, that have additional spuriously linked variables (other than the corresponding k variables). This can be observed in Figure 1 (c), where after the initial generation and until the end of the run, all BBs are represented (whether with only correct or
Proportion of BBs w/ spurious linkage group
Proportion of BBs w/ correct linkage group
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
1 s=2 s=3 s=4
0.8
0.6
0.4
0.2
0
0
5
10
15 20 Generation, t
25
30
95
1 s=2 s=3 s=4
0.8
0.6
0.4
0.2
0
0
(a) Correct linkage groups
5
10
15 20 Generation, t
25
30
(b) Spurious linkage groups
Average size of spurious linkage
Proportion of BBs w/ a linkage group
60 s=2 s=3 s=4
1 0.8 0.6 0.4 0.2 0
0
5
10
15 20 Generation, t
(c) All linkage groups
25
30
s=2 s=3 s=4
50 40 30 20 10 0
0
5
10
15 20 Generation, t
25
30
(d) Average size of spurious linkage
Fig. 1. Linkage group information captured by the probabilistic model of BOA along the run for different tournament sizes, s = {2, 3, 4}, when solving m = 24 concatenated traps of order k = 5 ( = 120). Tournament selection and full replacement is used.
additional spurious dependencies) in the probabilistic model, a necessary condition to be able to solve the problem. In Figure 1 (d), a drastic difference between binary tournament and higher tournament sizes can be noted for the average size of spurious linkages. In fact, the number of spurious variables added to the dependency groups is so high that one might wonder how BOA can sample new solutions efficiently. Analyzing the models in more detail, it can be seen that while the structure learned is much more complex than the underlying structure of the trap subfunctions, the parameters of the model nearly express independence between spurious and correlated variables. The detection of these weak dependencies is in part due to random fluctuations in the population (sample of limited size), that act as noise on the process of learning the real dependencies. Note that selection is performed at the individual-level rather than at the substructural or BB-level. Therefore, a top individual does not necessarily has mostly good substructures, which induces the learning process of BNs into some uncertainty, that is more pronounced in the initial generations. As the run proceeds and this source of noise is reduced
Proportion of BBs w/ spurious linkage group
C.F. Lima et al.
Proportion of BBs w/ correct linkage group
96
1 τ=66% τ=47% τ=36%
0.8
0.6
0.4
0.2
0
0
5
10
15 20 Generation, t
25
30
1 τ=66% τ=47% τ=36%
0.8
0.6
0.4
0.2
0
(a) Correct linkage groups
0
5
10
15 20 Generation, t
25
30
(b) Spurious linkage groups
0.8
Average size of spurious linkage
Proportion of BBs w/ a linkage group
2 τ=66% τ=47% τ=36%
1
0.6 0.4 0.2 0
0
5
10
15 20 Generation, t
(c) All linkage groups
25
30
τ=66% τ=47% τ=36%
1.5
1
0.5
0
0
5
10
15 20 Generation, t
25
30
(d) Average size of spurious linkage
Fig. 2. Linkage group information captured by the probabilistic model of BOA along the run for different truncation thresholds, τ = {66%, 47%, 36%}, when solving m = 24 concatenated traps of order k = 5 ( = 120). Truncation selection and full replacement is used.
by the iterative process of select-model-sample, the spurious linkage size reduces significantly. For truncation selection the results are significantly better. With this selection method, BOA is able to represent almost 100% of the BBs with accurate linkage groups of order k = 5, while the size of spurious linkage is practically insignificant. Also, note that increasing the selection pressure (reducing the truncation threshold) hardly affect the accuracy of the linkage information. This completely different behavior between these two selection schemes lead us to take a look at their scalability behavior for the average size of spurious linkage, population size, and number of function evaluations. That is what is shown in Figures 3 and 4. The computational requirements for truncation are higher than for tournament selection by a significant however constant factor. Nevertheless, if we compare tournament selection with s = 2 and truncation selection with τ = 36%, the requirements for truncation are now smaller while in terms of structural accuracy truncation is still much better.
Average size of spurious linkage
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
97
s=2 s=3 s=4 τ=0.66 τ=0.47 τ=0.36
50 40 30 20 10 0
20
40
60 80 Problem size, l
100
120
Fig. 3. Average size of the spurious linkage for different selection strategies when solving concatenated 5-bit trap functions of varying total string length . Full replacement is used.
Population size, n
Num. of function evaluations, nfe
s=2 s=3 s=4 τ=0.66 τ=0.47 τ=0.36
5
10
4
10
s=2 s=3 s=4 τ=0.66 τ=0.47 τ=0.36
6
10
5
10
4
10
3
10
20
40 Problem size, l
(a) Population size
80
120
20
40 Problem size, l
80
120
(b) Num. of evaluations
Fig. 4. (a) Population size and (b) number of function evaluations required for different selection strategies when solving concatenated 5-bit trap functions of varying total string length . Full replacement is used. Although truncation selection requires larger population sizes, using tournament selection with equivalent population sizes to those required by truncation does not significatively improve the linkage information.
Further experiments (not plotted), where tournament selection was tested with the same population size used for truncation with the same selection pressure, showed that tournament selection only improves by a small factor the linkage information and is still much worse than truncation. This suggests that the difference observed isn’t simply a matter of having enough population size, but more about the different way these selection operators work. In a detailed study (2) about the comparison of several selection schemes, it was shown that
98
C.F. Lima et al.
truncation and tournament selection are in fact quite different in terms of selection variance and loss of diversity. Truncation selection has a higher loss of diversity and lower selection variance (for the same selection intensity) than tournament selection. Although we might expect that a lower loss of diversity and a higher selection variance would be desirable to avoid premature convergence, from the standpoint of EDAs where the probabilistic models are learned at every generation, a faster and clear distinction between good individuals and just above average individuals reduces the noise faced by the learning process of BNs. There is also another important difference between these two operators. While in tournament selection the number of copies of an individual is proportional to its rank1 , in truncation no particular relevance is given to very good individuals, because all individuals selected get exactly one copy into the mating pool. This presents an interesting characteristic from the standpoint of BN learning, as will be discussed further in Section 7.
6 Influence of the Replacement Method In this section, we analyze the influence of the replacement method used in BOA with respect to the accuracy of the probabilistic models versus overall performance. Three different replacement strategies are considered: full replacement (FR), elitist replacement (ER), and restricted tournament replacement (RTR). In full replacement, the offspring population completely replaces the parent population at the end of each generation, therefore there is no overlap between these populations. For elitist replacement a given proportion of the worst individuals of the parent population is replaced by new individuals. A typical strategy is to replace the worst 50% individuals of the parent population by offspring, keeping the best 50% individuals for the next generation. Finally, a niching method called restricted tournament replacement (RTR) (12, 21) is also tested. With RTR, each new solution X is incorporated into the original population using the following procedure: 1. Select a random subset of individuals W with size w from the original population. 2. Let Y be the solution from W that is most similar to X (in terms of genotypic distance). 3. Replace Y with X if the latter is better, otherwise discard X. The window size w is set to w = min{, n/20} (21), where is the problem size and n is the population size. Note that RTR is the replacement method used in hBOA. Figure 5 shows the results obtained for the different replacement strategies, using binary tournament selection. It can be seen that for all replacement methods a significant proportion of the BBs are not well represented in the models 1
The best individual gets exactly s copies, while other top individuals get on average a value close to s.
1
Proportion of BBs w/ spurious linkage group
Proportion of BBs w/ correct linkage group
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
ER 50% FR RTR
0.8
0.6
0.4
0.2
0
0
5
10
15 20 Generation, t
25
30
99
1 ER 50% FR RTR
0.8
0.6
0.4
0.2
0
0
(a) Correct linkage groups
5
10
15 20 Generation, t
25
30
(b) Spurious linkage groups
Average size of spurious linkage
Proportion of BBs w/ a linkage group
10 ER 50% FR RTR
1
0.8
0.6
0.4
0.2
0
0
5
10
15 20 Generation, t
(c) All linkage groups
25
30
ER 50% FR RTR
8
6
4
2
0
0
5
10
15 20 Generation, t
25
30
(d) Average size of spurious linkage
Fig. 5. Linkage group information captured by the probabilistic model of BOA along the run for different replacement methods when solving m = 24 concatenated traps of order k = 5 ( = 120). ER 50% stands for the replacement of the worst 50% parents, FR for full replacement, and RTR for restricted tournament replacement. Binary tournament selection is used.
learned by BOA. Although all BBs have a linkage group that relates all k variables of interest, most of them have spurious linkage. This scenario is equally true for RTR and ER 50%, while for the FR method the linkage information captured is slightly more accurate. Figure 5 (d) shows the average size of spurious linkage, where RTR is clearly the worst option with respect to structural accuracy of the probabilistic models, while ER 50% performs better than RTR but still worse than FR, for which the average size is relatively constant and never goes beyond two. Note that the replacement strategy does not have the same impact as tournament size in the spurious linkage size. In Figures 6 and 7, the scalability of the replacement methods is depicted. Additional values for ER were tested to investigate the progressive influence of the proportion of elitism on the structural accuracy of the learned models. RTR is clearly the strategy that requires smaller population size and fewer evaluations, however at the cost of higher spurious linkage sizes than the remaining replacement methods. This is due to the niching capabilities of RTR. By
100
C.F. Lima et al. 10 Average size of spurious linkage
ER 50% ER 75% ER 90%
8
ER 95% ER 99% FR
6
RTR
4
2
0
20
40
60 80 Problem size, l
100
120
Fig. 6. Average size of the spurious linkage for different replacement strategies when solving concatenated 5-bit trap functions with varying total string length . Binary tournament selection is used. 6
4
10
ER 50% ER 75% ER 90%
3
10
ER 95% ER 99%
Num. of function evaluations, nfe
Population size, n
10
5
10
ER 50% ER 75% ER 90% 4
ER 95%
10
ER 99% FR
FR
RTR
RTR
20
40 Problem size, l
80
(a) Population size
120
20
40 Problem size, l
80
120
(b) Number of function evaluations
Fig. 7. (a) Population size and (b) number of function evaluations required for different replacement strategies when solving concatenated 5-bit trap functions with varying total string length . Binary tournament selection is used.
preserving diversity in the population, BOA can solve the given problem with smaller population sizes and consequently with fewer evaluations. However, the quality of the model is not the best because the drawback of using tournament selection is aggravated with smaller population sizes and increased diversity due to niching. For ER, as the elitist proportion is reduced the structure captured by the models gradually improves, until we get to the case of FR where the best result is obtained in terms of model accuracy. Note that for ER, where the best proportion of individuals is always kept in the population, the model is not required to be as accurate as in FR, where the sampled individuals fully replace
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
101
the original population, in which case the quality of sampled solutions have a stronger influence on the probability of success to find the optimum. While these results were obtained for binary tournament selection, additional experiments (not plotted) were performed with truncation selection. In this case, the replacement strategies were found not to have a significative impact on model accuracy and all methods performed very similar to truncation selection with FR (see Figures 2 and 3).
7 A Closer Look at Model Building in BOA This section pays a closer look at Bayesian network learning in BOA to understand the difference in model quality observed between tournament and truncation selection. First, we observe the learning procedure in detail to investigate when correct and spurious dependencies are inserted in the network. Second, we demonstrate how the selection operator can lead to model overfitting and consequently inaccurate linkage learning. Figure 8 shows the metric gain obtained in model building for a single run of BOA. For each learning step (edge addition) the corresponding gain in the scoring metric is presented, as well if the edge inserted is a correct (upper dots) gen. 1 81 40 0 0
30
60
90
120
90
120
90
120
gen. 6 1996 998 0 0
30
60 gen. 11
8913 4456 0 0
30
60
Fig. 8. Metric gain vs. learning step in Bayesian network learning with BOA. The problem is the 5-bit trap with = 50. The first, middle, and last generations of the run are plotted. Upper dots represent correct edges and lower ones represent spurious dependencies. Binary tournament selection and full replacement are used.
102
C.F. Lima et al.
Expected number of copies, ci
4 s=2 s=3 s=4
3
2
1
0
0
0.2
0.4 0.6 Rank, i (in percentile)
0.8
1
Fig. 9. Distribution of the expected number of copies in the mating pool after tournament selection with s = 2, 3, 4. The expected number of copies varies with rank, which is expressed in percentile.
or a spurious (lower dots) one. The first, middle, and last generations of the run are plotted. One can easily see that the metric gain is higher at the beginning of the learning process and decreases towards zero—which is the threshold for accepting a modification in the Bayesian network. Additionally, the metric gain magnitude increases towards the end of the run. This is due to the fact that as the search focus on specific regions of the search space (loss in diversity) the marginal likelihood increases also. In fact, for the last generation the metric gain has a clear shape when compared to the first and middle generations. With respect to the correctness of the edges added to the network, it is clear that the majority of the spurious edges are inserted in the network rather at the end of the learning procedure, when the very few last correct edges are being inserted. This suggests that an early stop or a higher acceptance threshold (note that spurious edges have a metric gain quite small) in the learning procedure could avoid the acceptance of such spurious linkage. In BOA, the selection operator can also be viewed as the generator of the data set used to learn the Bayesian network at each generation. Since in EDAs we are interested in modeling the set of promising solutions, the selection operator indicates which individuals have relevant features to be modeled and propagated in the solution set (population of individuals). The selection operator can explicitly assign several copies of the same individual to the mating pool, where the number of copies is somewhat proportional to their fitness rank. This is the case for tournament, ranking, and proportional selection. While for truncation selection the expected number of copies for the selected individuals is one, which follows a uniform distribution; in tournament selection, the distribution of the expected number of copies ci can be approximated by a power distribution with p.d.f., f (x) = α xα−1 , 0 < x < 1, α = s,
(5)
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
103
where x is the relative rank of each individual, expressed in percentile. The best solution has rank 1, while the worst has rank 0. For example, the best 123rd individual in a population of n = 1000 have a percentile rank of 0.877. Figure 9 shows the distribution of the expected number of copies in the mating pool after tournament selection. The difference between the two selection methods is significant. While tournament selection assigns increasing relevance to top-ranked individuals according to a power distribution, truncation selection gives no particular preference to any of the selected individuals, all having the same frequency in the learning data set. In tournament selection, the best solutions get approximately s copies in the mating pool, which induces the learned models to focus on particular features of these individuals, which contain good substructures, but also misleading components due to stochastic noise.
8 Adaptive Scoring Metric Another way to look at tournament selection in comparison with truncation selection as the mating-pool generator is recognizing that this selection procedure acts as a biased data resampling on an uniform data set. The uniform data set is the set of unique selected solutions (solutions that win at least one tournament), similar to what happens in truncation, while the resampling is performed when individuals win more than one tournament. This sort of resampling is clearly biased by fitness. In order to counterbalance this resampling bias towards top-ranked individuals, a modification of the scoring metric has been proposed (16). More precisely, the complexity penalty component of the K2 metric (Equation 3) has been changed to (6) p(B) = 2−0.5s log2 (n) i=1 |Li | , where the factor s (tournament size) has been incorporated. In this way, the greater the bias from resampling top-ranked individuals, the more demanding the score metric is in accepting new dependencies. Note that this is equivalent to change the acceptance threshold from zero to a higher value that depends on s. Figure 10 shows the linkage group information captured by the Bayesian network in BOA when using the adaptive complexity penalty. The improvement is significative with respect to the previous results obtained for tournament selection. In fact, the results are slightly better than those obtained for truncation selection. This is somewhat expected since resampling top-ranked individuals can lead to a clear detection of important substructures, as long as the learning procedure is conservative enough to avoid changes in the network guided by minor improvements in the metric (which is achieved by the new scoring metric). The number of function evaluations required by tournament selection with the s−penalty is now closer to that required by truncation selection, but like it was observed for the latter (Figure 4) the scalability behavior is not affected.
Proportion of BBs w/ spurious linkage group
C.F. Lima et al.
Proportion of BBs w/ correct linkage group
104
1 s=2 s=3 s=4
0.8
0.6
0.4
0.2
0
0
5
10
15 20 Generation, t
25
30
1 s=2 s=3 s=4
0.8
0.6
0.4
0.2
0
0
(a) Correct linkage groups
5
10
15 20 Generation, t
25
30
(b) Spurious linkage groups
Average size of spurious linkage
Proportion of BBs w/ a linkage group
1 s=2 s=3 s=4
1 0.8 0.6 0.4 0.2 0
0
5
10
15 20 Generation, t
25
30
(c) All linkage groups
s=2 s=3 s=4
0.8
0.6
0.4
0.2
0
0
5
10
15 20 Generation, t
25
30
(d) Average size of spurious linkage
Fig. 10. Linkage group information captured by the probabilistic model of BOA with adaptive scoring metric for different tournament sizes, s = {2, 3, 4}, when solving m = 24 concatenated traps of order k = 5 ( = 120). Tournament selection and full replacement are used.
9 Summary and Conclusions In this chapter we have analyzed the influence of selection and replacement strategies on the structural accuracy of linkage learning in BOA, for concatenated m-k deceptive trap functions. The empirical results obtained led to a better understanding of the model building process in BOA, and consequently to an improvement with respect to model quality. Empirically, we have found that using truncation instead of tournament selection is much better for the purpose of having accurate structural linkage information. Although truncation selection requires larger population sizes, using tournament selection with equivalent population sizes to those required by truncation does not significatively improve the linkage information. For the same purpose, the replacement strategy was found to be relevant only if tournament selection is used, in which case the full replacement of the parents by their offspring is the most appropriate strategy. On the other hand, if overall performance
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
105
(number of function evaluations) is our main concern, tournament selection and restricted tournament replacement are clearly the best options. Intrigued by the empirical observations, we have investigated the learning process of Bayesian networks in BOA to better understand the difference between tournament and truncation selection. Looking at the distribution of the learning data set (mating pool), it has been shown that the nature of the selection operator can have a great impact on the accuracy of Bayesian network learning. Specifically, if we take into account the power distribution in the mating pool generated by tournament selection, the scoring metric that guides the search for an adequate model structure can be modified accordingly. In this case, the model structural accuracy obtained for tournament selection can be highly improved and compared to that of truncation selection, which generates a more suitable uniform distribution for learning. Overall, this chapter provides important information to practitioners about the trade-off between the parameters used in BOA (and consequent computational cost) and the accuracy of the learned linkage information.
Acknowledgments This work was sponsored by the Portuguese Foundation for Science and Technology (FCT/MCTES) under grants SFRH-BD-16980-2004 and PTDC-EIA-677762006, the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant FA9550-06-1-0096, the National Science Foundation under NSF CAREER grant ECS-0547013, ITR grant DMR-03-25939 at Material Computation Center, UIUC. The work was also supported by the High Performance Computing Collaboratory sponsored by Information Technology Services, the Research Award and the Research Board at the University of Missouri in St. Louis. The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Office of Scientific Research, the National Science Foundation, or the U.S. Government.
References [1] Ackley, D.H.: A connectionist machine for genetic hill climbing. Kluwer Academic, Boston (1987) [2] Blickle, T., Thiele, L.: A comparison of selection schemes used in genetic algorithms. Evolutionary Computation 4(4), 311–347 (1997) [3] Brindle, A.: Genetic Algorithms for Function Optimization. PhD thesis, University of Alberta, Edmonton, Canada. Unpublished doctoral dissertation (1981) [4] Chickering, D.M., Heckerman, D., Meek, C.: A Bayesian approach to learning Bayesian networks with local structure. Technical Report MSR-TR-97-07, Microsoft Research, Redmond, WA (1997)
106
C.F. Lima et al.
[5] Cooper, G.F., Herskovits, E.H.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9, 309–347 (1992) [6] Correa, E.S., Shapiro, J.L.: Model complexity vs. performance in the bayesian optimization algorithm. In: Runarsson, T.P., Beyer, H.-G., Burke, E.K., MereloGuerv´ os, J.J., Whitley, L.D., Yao, X. (eds.) PPSN 2006. LNCS, vol. 4193, pp. 998–1007. Springer, Heidelberg (2006) [7] Deb, K., Goldberg, D.E.: Analyzing deception in trap functions. Foundations of Genetic Algorithms 2, 93–108 (1993) [8] Echegoyen, C., Lozano, J.A., Santana, R., Larra˜ naga, P.: Exact bayesian network learning in estimation of distribution algorithms. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1051–1058. IEEE Press, Los Alamitos (2007) [9] Friedman, N., Goldszmidt, M.: Learning bayesian networks with local structure. Graphical Models, 421–459 (1999) [10] Goldberg, D.E.: The Design of Innovation - Lessons from and for Competent Genetic Algorithms. Kluwer Academic Publishers, Norwell (2002) [11] Goldberg, D.E., Korb, B., Deb, K.: Messy genetic algorithms: Motivation, analysis, and first results. Complex Systems 3(5), 493–530 (1989) [12] Harik, G.R.: Finding multimodal solutions using restricted tournament selection. In: Proceedings of the Sixth International Conference on Genetic Algorithms, pp. 24–31 (1995) [13] Hauschild, M., Pelikan, M., Lima, C.F., Sastry, K.: Analyzing probabilistic models in hierarchical BOA on traps and spin glasses. MEDAL Report No. 2007001, University of Missouri at St. Louis, St. Louis, MO (2007) [14] Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: The combination of knowledge and statistical data. Technical Report MSR-TR-94-09, Microsoft Research, Redmond, WA (1994) [15] Larra˜ naga, P., Lozano, J.A.: Estimation of distribution algorithms: a new tool for Evolutionary Computation. Kluwer Academic Publishers, Boston (2002) [16] Lima, C.F., Lobo, F.G., Pelikan, M.: From mating-pool distributions to model overfitting. In: Accepted for the ACM SIGEVO Genetic and Evolutionary Computation Conference (GECCO 2008) (2008) [17] Lima, C.F., Pelikan, M., Sastry, K., Butz, M., Goldberg, D.E., Lobo, F.G.: Substructural neighborhoods for local search in the Bayesian optimization algorithm. In: Runarsson, T.P., Beyer, H.-G., Burke, E.K., Merelo-Guerv´ os, J.J., Whitley, L.D., Yao, X. (eds.) PPSN 2006. LNCS, vol. 4193, pp. 232–241. Springer, Heidelberg (2006) [18] Lima, C.F., Sastry, K., Goldberg, D.E., Lobo, F.G.: Combining competent crossover and mutation operators: a probabilistic model building approach. In: Beyer, H., et al. (eds.) Proceedings of the ACM SIGEVO Genetic and Evolutionary Computation Conference (GECCO 2005), pp. 735–742. ACM Press, New York (2005) [19] M¨ uhlenbein, H., Schlierkamp-Voosen, D.: Predictive models for the breeder genetic algorithm: I. Continuous parameter optimization. Evolutionary Computation 1(1), 25–49 (1993) [20] Pearl, J.: Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, San Mateo (1988) [21] Pelikan, M.: Hierarchical Bayesian Optimization Algorithm: Toward a New Generation of Evolutionary Algorithms. Springer, Heidelberg (2005)
Linkage Learning Accuracy in the Bayesian Optimization Algorithm
107
[22] Pelikan, M., Goldberg, D.E.: Escaping hierarchical traps with competent genetic algorithms. In: Spector, L., et al. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001), pp. 511–518. Morgan Kaufmann, San Francisco (2001) [23] Pelikan, M., Goldberg, D.E., Cantu-Paz, E.: BOA: The Bayesian Optimization Algorithm. In: Banzhaf, W., et al. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference GECCO 1999, pp. 525–532. Morgan Kaufmann, San Francisco (1999) [24] Pelikan, M., Goldberg, D.E., Lobo, F.: A survey of optimization by building and using probabilistic models. Computational Optimization and Applications 21(1), 5–20 (2002) [25] Pelikan, M., Sastry, K.: Fitness inheritance in the bayesian optimization algorithm. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 48–59. Springer, Heidelberg (2004) [26] Santana, R., Larra˜ naga, P., Lozano, J.A.: Interactions and dependencies in estimation of distribution algorithms. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1418–1425. IEEE Press, Los Alamitos (2005) [27] Sastry, K.: Evaluation-relaxation schemes for genetic and evolutionary algorithms. Master’s thesis, University of Illinois at Urbana-Champaign, Urbana, IL (2001) [28] Sastry, K., Abbass, H.A., Goldberg, D.E., Johnson, D.D.: Sub-structural niching in estimation distribution algorithms. In: Beyer, H., et al. (eds.) Proceedings of the ACM SIGEVO Genetic and Evolutionary Computation Conference (GECCO 2005). ACM Press, New York (2005) [29] Sastry, K., Goldberg, D.E.: Designing competent mutation operators via probabilistic model building of neighborhoods. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 114–125. Springer, Heidelberg (2004) [30] Sastry, K., Lima, C.F., Goldberg, D.E.: Evaluation relaxation using substructural information and linear estimation. In: Keijzer, M., et al. (eds.) Proceedings of the ACM SIGEVO Genetic and Evolutionary Computation Conference (GECCO 2006), pp. 419–426. ACM Press, New York (2006) [31] Sastry, K., Pelikan, M., Goldberg, D.E.: Efficiency enhancement of genetic algorithms via building-block-wise fitness estimation. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 720–727 (2004) [32] Thierens, D., Goldberg, D.E.: Mixing in genetic algorithms. In: Forrest, S. (ed.) Proceedings of the Fifth International Conference on Genetic Algorithms, pp. 38–45. Morgan Kaufmann, San Mateo (1993) [33] Wu, H., Shapiro, J.L.: Does overfitting affect performance in estimation of distribution algorithms. In: Keijzer, M., others (eds.) Proceedings of the ACM SIGEVO Genetic and Evolutionary Computation Conference (GECCO 2006), pp. 433–434. ACM Press, New York (2006) [34] Yu, T.-L., Goldberg, D.E.: Dependency structure matrix analysis: Offline utility of the dependency structure matrix genetic algorithm. In: Deb, K.,et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 355–366. Springer, Heidelberg (2004) [35] Yu, T.-L., Sastry, K., Goldberg, D.E.: Population size to go: Online adaptation using noise and substructural measurements. In: Lobo, F.G., et al. (eds.) Parameter Setting in Evolutionary Algorithms, pp. 205–224. Springer, Heidelberg (2007)
The Impact of Exact Probabilistic Learning Algorithms in EDAs Based on Bayesian Networks Carlos Echegoyen1, Roberto Santana1 , Jose A. Lozano1 , and Pedro Larra˜ naga2 1
2
Intelligent Systems Group Department of Computer Science and Artificial Intelligence University of the Basque Country Paseo Manuel de Lardizabal 1, 20018 Donostia - San Sebastian, Spain {carlos.echegoyen,roberto.santana,ja.lozano}@ehu.es Department of Artificial Intelligence, Technical University of Madrid, 28660 Boadilla del Monte, Madrid, Spain
[email protected] Summary. This paper discusses exact learning of Bayesian networks in estimation of distribution algorithms. The estimation of Bayesian network algorithm (EBNA) is used to analyze the impact of learning the optimal (exact) structure in the search. By applying recently introduced methods that allow learning optimal Bayesian networks, we investigate two important issues in EDAs. First, we analyze the question of whether learning more accurate (exact) models of the dependencies implies a better performance of EDAs. Secondly, we are able to study the way in which the problem structure is translated into the probabilistic model when exact learning is accomplished. The results obtained reveal that the quality of the problem information captured by the probability model can improve when the accuracy of the learning algorithm employed is increased. However, improvements in model accuracy do not always imply a more efficient search.
1 Introduction In estimation of distribution algorithms (EDAs) [21, 30] linkage learning, understood as the ability to capture the relationships between the variables of the optimization problem, is accomplished by detecting and representing probabilistic dependencies using probability models. EDAs are evolutionary algorithms that do not employ classical genetic operators such as mutation or crossover. Instead, machine learning methods are used to extract relevant features of the search space. The collected information is represented using a probabilistic model which is later employed to generate new points. During the sampling or generation step, the statistical dependencies between the variables are used for the construction of the new solutions. In EDAs, the ability of learning an accurate representation of the relationships between the variables is related to the class of probabilistic models used Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 109–139, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
110
C. Echegoyen et al.
and the methods employed to learn them. One class of model that has been extensively applied in EDAs is Bayesian networks [18]. Among the benefits of EDAs that use this type of models [14, 28, 33, 45] is that the complexity of the learned structure depends on the characteristics of the data (selected individuals). Additionally, the Bayesian networks learned during the search are suitable for human interpretation, aiding the discovery of unknown information about the problem structure. Although, in the case of EDAs that use Bayesian networks, the role of the parameters that penalize the complexity of the networks has been studied [28, 32], a detailed analysis of the accuracy of the methods used for finding the best network and its influence in the behavior of EDAs has not been conducted. An initial attempt to investigate this problem was presented in [13], where exact Bayesian learning was introduced to EDAs. Methods that do exact Bayesian structure learning [12, 20, 42, 43] compute, given a set of data and a prespecified score (in our case, the BIC score [41]), the network structure that optimizes the score. Since the problem of learning the optimal Bayesian network is NP-hard [6], these methods set constraints on the maximum number of variables and/or cases they can deal with. Usually, dynamic programming algorithms are used to learn the structure. In this chapter, we extend the preliminary results presented in [13] and provide evidence that the methods for learning optimal (exact) Bayesian networks can be very useful to analyze the relationship between the search space and the structure of the learned probabilistic models. The advantage of using methods that learn exact models is because alternative approximate methods commonly applied to learn the models in EDAs are very often able to find only suboptimal solutions. Therefore, using exact learning makes it easier to investigate to what extent approximate learning algorithms are responsible for the loss in accuracy in the mapping between the problem structure and the model structure. In general, an exact learning algorithm can serve as a different framework to investigate the influence of the EDA components in the ability of the probability models to capture the problem structure. The chapter is organized as follows. In the next section, Bayesian networks are presented, the general procedures to learn these networks from data are also discussed. In Section 3, we focus on the type of search strategies used to find the Bayesian network structure. Approximate and exact learning methods are analyzed. Section 4 introduces the EBNA algorithm. In Section 5, the experimental framework and functions used to evaluate the exact and local learning methods used by EBNA are introduced. Sections 6 and 7 respectively present experimental results on the time complexity analysis and convergence reliability of the two EBNA variants. Section 8 analyzes ways for using the Bayesian networks learned by EBNA as a source of problem knowledge and presents experimental results for several functions. Work related to our proposal is analyzed in Section 9. The conclusions of our paper are presented in Section 10.
The Impact of Exact Probabilistic Learning Algorithms
111
2 Bayesian Networks 2.1
Notation
Let X be a random variable. A value of X is denoted x. X = (X1 , . . . , Xn ) will denote a vector of random variables. We will use x = (x1 , . . . , xn ) to denote an assignment to the variables. We will work with discrete variables. The joint probability mass function of x is represented as p(X = x) or p(x). p(xS ) will denote the marginal probability distribution for XS . We use p(Xi = xi | Xj = xj ) or, in a simplified form, p(xi | xj ), to denote the conditional probability distribution of Xi given Xj = xj . Formally, a Bayesian network [5] is a pair (S, θ) representing a graphical factorization of a probability distribution. The structure S is a directed acyclic graph which reflects the set of conditional (in)dependencies among the variables. The factorization of the probability distribution is codified by S: p(x) =
n
p(xi |pai )
i=1
where pai denotes a value of variable Pai , the parent set of Xi (variables from which there exists an arc to Xi in the graph S), while θ is a set of parameters for the local probability distributions associated with each variable. If the variable Xi has ri possible values, x1i , . . . , xri i , the local distribution p(xi |paji , θ i ) is an unrestricted discrete distribution: p(xki |paji , θi ) ≡ θijk where pa1i , . . . , paqi i denote the values of Pai and the term qi denotes the number of possible different instances of the parent variables of Xi . In other words, the parameter θijk represents the probability of variable Xi being in its k-th value, knowing that the set of its parent variables is in its j-th value. Therefore, the i i )qj=1 ). local parameters are given by θ i = (((θijk )rk=1 2.2
Learning Bayesian Networks from Data
There are different strategies to learn the structure of a Bayesian network. We focus on a method called “score + search” which is the one used in the experiments presented in this paper. In this strategy, given a set of data D and a Bayesian network whose structure is denoted by S, a value (score) which evaluates how well the Bayesian network represents the probability distribution of the database D is assigned. Different scores can be used. In this work we have used the Bayesian Information Criterion score (BIC) [41] (based on penalized maximum likelihood).
112
C. Echegoyen et al.
A general formula for a penalized maximum likelihood score can be written as follows: ˆ − f (N )dim(S) log p(D|S, θ) where dim(S) is the dimension –number of parameters needed to specify the model– of the Bayesian network with a structure given by S. Thus: dim(S) =
n
qi (ri − 1)
i=1
and f (N ) is a non negative penalization function. The Jeffreys-Schwarz criterion, sometimes called BIC [41], takes into account f (N ) = 12 log N . Thus the BIC score can be written as follows: BIC(S, D) = log
N n
1 log N qi (ri − 1) 2 i=1 n
ˆi) − p(xw,i |paSi , θ
w=1 i=1
(1)
To find the Bayesian network that optimizes the score implies solving an optimization problem. This can be done with exhaustive or heuristic search algorithms. In Section 3, we analyze two variants for finding the Bayesian network structures. Each structure is evaluated using the maximum likelihood parameters. 2.3
Learning of the Parameters
Once the structure has been learned, the parameters of the Bayesian network are calculated using the Laplace correction: Nijk + 1 θˆijk = Nij + ri
(2)
where Nijk denotes the number of cases in D in the variable Xi has the which ri value xki and P ai has its j th value, and Nij = k=1 Nijk .
3 Methods for Learning Bayesian Networks Once we have defined a score to evaluate Bayesian networks, we have to set a search process to find the Bayesian network that maximizes the score given the data. Approximate and exact methods can be used. 3.1
Learning an Approximate Model
In practical applications, we need to find an adequate model structure as quickly as possible. Therefore, a simple algorithm which returns a good structure, even if not optimal, is preferred. An algorithm that fulfills these criteria is Algorithm B [4] which is typically used by most of Bayesian network based EDAs. Algorithm B is a greedy search which starts with an arcless structure and, at each step, adds the arc with the maximum improvement in the score. The algorithm finishes when there is no arc whose addition improves the score.
The Impact of Exact Probabilistic Learning Algorithms
3.2
113
Learning the Exact Model
Since learning the Bayesian network structure is an NP-hard problem, for a long time the goal of learning exact Bayesian networks was constrained to problems with a very reduced number of variables. In [20], an algorithm for learning the exact structure in less than super-exponential complexity with respect to n is introduced for the first time. In [42], the most efficient method so far is presented and it is used in our work. This algorithm is feasible for n < 33 and it was shown to learn a best network for a data set of 29 variables. In that work, the Bayesian network structure S is defined as a vector S = (S1 , .., Sn ) of parent sets, where Si is the subset of X from which there are arcs to Xi , for example S1 is the parent set of X1 . Another necessary concept is the variable ordering. This is simply the variables of X in a determined order. In this order, the ith element is denoted by ordi . Therefore, the structure S = (S1 , .., Sn ) is said to be consistent with an ordering ord when all the parents of the node precede the node in the ordering. Another important concept in the algorithm is the sink node. Every DAG has at least one node with no outgoing arcs, so at least one node is not a parent of any other node. These nodes are called sinks of the network. In this algorithm, the data set D is processed in a particular way and it uses two kinds of data tables. Given W ⊆ X, first it is defined the contingency table CT (W) to be a list of the frequencies of different data-vectors dW in DW , where DW is the data set for W variables. However, the main task is to calculate conditional frequency tables CF T (Xi , W) that record how many times different values of the variable Xi occur together with different vectors dW−{Xi } in the data. On the other hand, many popular scores such as BIC, AIC and BDe can be decomposed to local scores: score(S) =
n i=1
scorei (Si ) =
n
score(CF T (Xi , Si )),
i=1
Thus, the score of the network is the sum of the local scores that only depend on the conditional frequency table for one variable and its parents. Algorithm 1 presents the main steps of the method: The first step is the main procedure and the only one for which data is needed. It starts by calculating the contingency table for all the variables X and continues calculating contingency tables for all smaller variable subsets, marginalizing variables out of the contingency table. After that, for each contingency table, the conditional frequency table is calculated for each variable appearing in the contingency table. These conditional frequency tables can then be used to calculate the local scores for any parent set given a variable. All the n2n−1 local scores are stored in a table which will be the basis of the algorithm. Having calculated the local scores, the best parents for Xi given a candidate set C are either the whole candidate set C itself or one of the smaller candidate
114
C. Echegoyen et al.
Algorithm 1. Exact learning algorithm 1
Calculate the local scores for all n2n−1 different (variable, variable set)-pairs
Using the local scores, find best parents for all n2n−1 (variable, variable set)-pairs 3 Find the best sink for all 2n variable sets 4 Using the results from Step 3, find a best ordering of the variables 5 Find a best network using results computed in Steps 2 and 4 2
sets {C\{c}|c ∈ C}. It must be computed for all 2n−1 variable sets (parent candidate sets) related with Xi . Step 3 of the algorithm is based on the following observation: The best network G∗ for a variable set W must have a sink s. As G∗ is a network with the highest score, sink s must have incoming arcs from its best possible set of parents. In this way, the rest of the nodes and the arcs must form the best possible network for variables W \{s}. Therefore, the best sink for W, sink ∗ (W), is the node that maximizes the sum between the local score for s and the score for the network S without node s. When we have the best sinks for all 2n variable sets, it is possible to yield the best ordering ord∗ in reverse order. Then, for each position from |X| to 1, |X| in ord∗i we have to store the best sink for the set j=i+1 {ord∗j (X)}. Having a best ordering and a table with the best parents for any candidate set, it is possible to obtain a best network consistent with the given ordering. For the ith variable in the optimal ordering, the best parents from its predecessors are picked. More details about the algorithm can be found in [42]. We use an implementation of Algorithm 1 given by the authors1 . The computational complexity of the algorithm is o(n2 2n−2 ). The memory requirement of the method is 2n+2 bytes and the disk-space requirement is 12n2n−1 bytes.
4 Estimation of Distribution Algorithms Based on Bayesian Networks The estimation of Bayesian networks algorithm (EBNA) allows statistics of unrestricted order in the factorization of the joint probability distribution. This distribution is encoded by a Bayesian network that is learned from the database containing the selected individuals at each generation. It has been applied with good results to a variety of problems [3, 19, 23, 24, 25]. Other algorithms based 1
The C++ code of this implementation is available from http://www.cs.helsinki.fi/u/tsilande/sw/bene/download/
The Impact of Exact Probabilistic Learning Algorithms
115
Algorithm 2. EBNABIC 1 2 3 4 5 6 7 8 9 10 11
BN0 ← (S0 , θ0 ) where S0 is an arc-less DAG, and θ0 is uniform n n p0 (x) = i=1 p(xi ) = i=1 r1i D0 ← Sample M individuals from p0 (x) and evaluate them t←1 do { Se Dt−1 ← Select N individuals from Dt−1 ∗ St ← Using a search method find one network structure according to the BIC score t Se θt ← Calculate θijk using Dt−1 as the data set BNt ← (St∗ , θt ) Dt ← Sample M individuals from BNt and evaluate them } until Stopping criterion is met
on the use of Bayesian networks have been proposed in [28, 33, 45]. A pseudocode of EBNA is shown in Algorithm 2. In the experiments presented in this paper, EBNA uses truncation selection and the number of selected individuals equals half of the population. The best solution at each generation is passed to the next population, therefore, at each generation N − 1 new solutions are sampled. The stopping criterion is changed according to the type of experiments conducted.
5 Experimental Framework and Function Benchmark To investigate the impact of exact learning in the behavior of Bayesian network based EDAs, we compare the EBNA versions that use the two different Bayesian network learning schemes described in Section 3. We call them EBNA-Exact and EBNA-Local. We used three different criteria to compare the algorithms. The time complexity, the convergence reliability and the way in which probabilistic dependencies are represented in the structure of the Bayesian network. In this section, we introduce a set of functions that represent different classes of problems and which are used in the following sections to test the behavior of EDAs. 5.1
Function Benchmark n n Let u(x) = i=1 xi , f (x) be a unitation function if ∀x, y ∈ {0, 1} , u(x) = u(y) ⇒ f (x) = f (y). A unitation function is defined in terms of its unitation value u(x), or in a simpler way u.
116
C. Echegoyen et al.
Function OneM ax: OneM ax(x) =
n
xi = u(x)
(3)
i=1
Unitation functions are also useful for the definition of a class of functions where the difficulty is given by the interactions that arise among subsets of variables. One example of this class of deceptive functions is f3deceptive [15]: i= n 3
f3deceptive (x) =
3 fdec (x3i−2 , x3i−1 , x3i )
(4)
i=1 3 is defined as: where fdec
⎧ 0.9 f or ⎪ ⎪ ⎨ 0.8 f or 3 fdec (u) = ⎪ 0.0 f or ⎪ ⎩ 1.0 f or
u=0 u=1 u=2 u=3
Function SixP eaks is a modification of the F ourP eaks problem [1] and it can be defined mathematically as: SixP eaks(x, t) = max{tail(0, x), head(1, x), tail(1, x), head(0, x)} + R(x, t) (5) where tail(b, x) = number of trailing b’s in x head(b, x) = number of leading b’s in x
R(x, t) =
⎧ ⎨n ⎩
0
if tail(0, x) > t and head(1, x) > t or tail(1, x) > t and head(0, x) > t otherwise
The goal is to maximize the function. For an even number of variables this function has 4 global optima, located at the points: t
t
t
t
(0, . . . , 0, 1, . . . , 1) (0, . . . , 0, 1, . . . , 1) (1, . . . , 1, 0, . . . , 0) (1, . . . , 1, 0, . . . , 0) These points are very difficult to reach because they are isolated. On the other hand, two local optima (0, 0, . . . , 0), (1, 1, . . . , 1) are very easily reachable. The value of t was set to n2 − 1. The P arity function [8] is a simple k-bounded additively separable function that has been used to investigate the limitations of linkage learning by probabilistic modeling. It can be seen as a generalization of the XOR function and the Walsh transform. In this case we will work with the concatenated parity function
The Impact of Exact Probabilistic Learning Algorithms
117
(CPF) [8]. It is said that this problem is hard for EDAs in general. The P arity function can be defined mathematically as: Ceven if u(x) is even parity(x) = Codd otherwise where Ceven and Codd are parameters of the function. The CPF is defined as m concatenated parity sub-functions, CP F (x) =
m−1
parity(xik+1 , ..., xik+k )
(6)
i=0
Where k is the size for each sub-function. As in [8], we use k = 5, Codd = 5 and Ceven = 0. Notice that there are 2n−m solutions where the function reaches the global optima. Function Cuban5 [29] is a non-separable additive function. The second best value of this function is very close to the global optimum. Cuban5(x) = 5 (s0 ) Fcuban1
+
m
5 5 (Fcuban2 (s2j+1 ) + Fcuban1 (s2j+2 ))
(7)
j=0
where si = x4i x4i+1 x4i+2 x4i+3 x4i+4 and n = 4(2m + 1) + 1 ⎧ 0.595 f or x = 000 ⎪ ⎪ ⎪ ⎪ 0.200 f or x = 001 ⎪ ⎪ ⎪ ⎪ 0.595 f or x = 010 ⎪ ⎪ ⎨ 0.100 f or x = 011 3 (x) = Fcuban1 1.000 f or x = 100 ⎪ ⎪ ⎪ ⎪ 0.050 f or x = 101 ⎪ ⎪ ⎪ ⎪ 0.090 f or x = 110 ⎪ ⎪ ⎩ 0.150 f or x = 111 5 Fcuban1 (x) = 3 4Fcuban1 (x1 , x2 , x3 ) if x2 = x4 and x3 = x5 0 otherwise
5 (x) = Fcuban2
(8)
⎧ ⎨
u(x) f or x5 = 0 0 f or x1 = 0, x5 = 1 ⎩ u(x) − 2 f or x1 = 1, x5 = 1
The HP protein model In our experiments we also use a class of coarse-grained protein folding model called the hydrophobic-polar (HP) model [11].
118
C. Echegoyen et al.
Fig. 1. An optimal solution of the HP model for sequence HP HP P HHP HP P HP HHP P HP H. The optimal energy corresponding to this sequence is −9.
Under specific conditions, a protein sequence folds into a native 3-d structure. The problem of determining the protein native structure from its sequence is known as the protein structure prediction problem. To solve this problem, a protein model is chosen and an energy is associated to each possible protein fold. The search for the protein structure is transformed into the search for the optimal protein configuration given the energy function. The HP model considers two types of residues: hydrophobic (H) residues and hydrophilic or polar (P) residues. In the model, a protein is considered as a sequence of these two types of residues, which are located in regular lattice models forming self-avoided paths. Given a pair of residues, they are considered neighbors if they are adjacent either in the chain (connected neighbors) or in the lattice but not connected in the chain (topological neighbors). The total number of topological neighboring positions in the lattice (z) is called the lattice coordination number. Figure 1 shows one possible configuration of sequence HP HP P HHP HP P HP HHP P HP H in the HP model. A solution x can be interpreted as a walk in the lattice, representing one possible folding of the protein. We use a discrete representation of the solutions. For a given sequence and lattice, Xi will represent the relative move of residue i in relation to the previous two residues. Taking as a reference the location of the previous two residues in the lattice, Xi takes values in {0, 1, . . . , z − 2}, where z −1 is the number of movements allowed in the given lattice. These values respectively mean that the new residue will be located in one of the z−1 numbers of possible directions with respect to the previous two locations. If the encoded solution is self-intersecting, it can be repaired or penalized during the evaluation step using a recursive repairing procedure introduced in [9]. Therefore, values for X1 and X2 are meaningless. The locations of these two residues are fixed. For the HP model, an energy function that measures the interaction between topological neighbor residues is defined as HH = −1 and HP = P P = 0. The HP problem consists of finding the solution that minimizes the total energy. More details about the representation and function can be found in [40].
The Impact of Exact Probabilistic Learning Algorithms
119
6 Time Complexity Analysis In our case, the time complexity analysis will refer to the study of the average number of generations needed by EBNA-Local and EBNA-Exact to find the optimum. Experiments were conducted for three functions, OneM ax, f3deceptive and SixP eaks. In the first function, there are no interactions between the variables. In the rest, interactions arise between variables that belong to the same definition set of the function. To determine the average number of generations to find the optimum needed by EBNA-Local and EBNA-Exact, we start with a population of 10 individuals and the population size is increased by 10 until a maximum population size of 150 is reached. For each possible combination of function, number of variables n, and population size N , 50 experiments are conducted. For each execution of the algorithm, a maximum of 105 evaluations are allowed. For the OneM ax function we conducted experiments for n ∈ {15, 20}. In order to increase the accuracy of the curves shown in the first figure, for n = 15 we exceptionally conducted 100 experiments. The original idea was to evaluate, under the dimension constraints imposed by the exact learning algorithm, the scalability of both EBNA versions with n ∈ {10, 12, 15, 20}. Nevertheless, in this work we only present the results achieved for the last two sizes. The results of the experiments for n = 15 are shown in Figures 2 (a) and the average results for n = 20 are shown in Figure 2 (b). The analysis of Figure 2 reveals that both algorithms exhibit the same time complexity pattern. However, EBNA-Exact needs, in general, a higher number of evaluations than EBNA-Local to find the optimal solution for the first time. The difference in the number of generations is less evident when the population size approaches 150. For this simple function, it seems that the error in the learning of the model, introduced by the approximate learning algorithm, is beneficial for the search.
18
25 EBNA Exact EBNA Local
16
EBNA Exact EBNA Local
14
Generation number
Generation number
20 12 10 8 6
15
10
5 4 2 0
50
100
Population size
a)
150
0 0
50
100
150
Population size
b)
Fig. 2. Time complexity analysis for function OneM ax,(a) n = 15 and (b) n = 20
120
C. Echegoyen et al.
600
700 EBNA Exact EBNA Local
400 300 200 100 0 0
EBNA Exact EBNA Local
600
Generation number
Generation number
500
500 400 300 200 100
50
100
0 0
150
Population size
50
100
150
Population size
a)
b)
Fig. 3. Time complexity analysis for function f3deceptive ,(a) n = 15 and (b) n = 18
9
11 EBNA Exact EBNA Local
7 6 5 4 3 2 0
EBNA Exact EBNA Local
10
Generation number
Generation number
8
9 8 7 6 5 4
50
100
Population size
a)
150
3 0
50
100
150
Population size
b)
Fig. 4. Time complexity analysis for function SixP eaks, (a) n = 14 and (b) n = 16
In Figure 3 we can observe that both algorithms have an identical curve. For this function, and for the values of n investigated, the influence of the exact learning is not relevant. We can anticipate that the structures learned by both algorithms are similar. For this function, a small population size determines many generations are necessary to reach the optimum. For the SixP eaks function, the optimal value is reached in a significantly lower number of generations as can be appreciated in Figure 4. It can also be observed that EBNA-Local is able to reach the optimum earlier than EBNAExact. From this analysis we deduce that for SixP eaks function, the structures learned by both algorithms could be different. We will later further analyze this behavior of the algorithms when discussing the structures of the models they learn for the f3deceptive and SixP eaks functions. To illustrate the complexity of the function and to study in more detail the algorithms, we introduce Figures 5 and 6 which show the total number of executions needed by each algorithm in order to succeed, i.e. to find the optimum. It should be noticed that in every run, the maximum number of evaluations is bounded by 105 . This constraint strongly influences the behavior of the algorithms.
The Impact of Exact Probabilistic Learning Algorithms
12
30 EBNA Exact EBNA Local
8 6
20 15
4
10
2
5
50
100
EBNA Exact EBNA Local
25
Executions
Executions
10
0 0
121
0 0
150
Population size
50
100
150
Population size
a)
b)
Fig. 5. Number of executions, for each population size, in order to obtain one optimal value for f3deceptive function, (a) n = 15, (b) n = 18
14
18 EBNA Exact EBNA Local
12
EBNA Exact EBNA Local
16 14
Executions
Executions
10 8 6
12 10 8 6
4
4 2 0 0
2 50
100
150
Population size
a)
0 0
50
100
150
Population size
b)
Fig. 6. Number of executions, for each population size, in order to obtain one optimal value for SixP eaks function, (a) n = 14, (b) n = 16
7 Convergence Reliability In the analysis of the convergence reliability, we focus on the critical population size needed by the EDAs to achieve a predefined convergence rate. In the experiments conducted, the goal was to determine the minimum population size needed by the two different variants of EBNA to find the optimum in 20 consecutive experiments. We investigated the behavior of the algorithms for functions Cuban5 (n = 13), SixP eaks (n ∈ {10, 12, 14}) and f3deceptive (n ∈ {9, 12, 15}). The algorithm begins with a population size N = 16 which is doubled until the optimal solution has been found in 20 consecutive experiments. The maximum number of evaluations allowed is 104 . For each function and value of n, 25 experiments are carried out. Table 1 shows the mean and standard deviation of the critical population size found. Table 1 shows that for function Cuban5, EBNA-Exact requires a slightly higher population size than EBNA-Local. The picture is drastically changed for functions SixP eaks and f3deceptive , for which EBNA-Exact needs a much smaller
122
C. Echegoyen et al.
Table 1. Mean, standard deviation and p-value of the critical population size for different functions and number of variables f unction n EBN A − Exact mean std Cuban5 13 118.40 53.07 SixP eaks 10 153.60 52.26 SixP eaks 12 209.92 110.11 SixP eaks 14 312.32 133.64 f3deceptive 9 135.68 38.40 60.94 f3deceptive 12 168.96 58.66 f3deceptive 15 220.16
EBN A − Local T − T est mean std p − value 109.44 57.26 0.57 215.04 109.11 0.014 389.12 249.19 0.019 604.16 318.97 0.001 168.96 60.94 0.025 261.12 86.50 0.001 296.96 95.79 0.0013
population size. This difference is particularly evident for function SixP eaks. Another observation is that the standard deviation of EBNA-Local is always higher than that of EBNA-Exact. Since the only difference between EBNAExact and EBNA-Local is in the class of algorithm used to learn the models, the difference of behaviors is due to the ability of EBNA-Exact to learn a more accurate model of the dependencies. Therefore, at least for functions SixP eaks and f3deceptive , learning a more accurate model determines a better performance of EBNA. To determine if the population sizes obtained for each algorithm are significantly different, we have carried out a Student’s t-test over the two sets of 25 population sizes for each function and value of n. In the last column of Table 1, the probability values of the test are reported. If we consider a significance level of 0.05, we would have to reject the null hypotesis for all cases except for Cuban5 where there are not significant differences. An explanation of the similar behavior achieved with both algorithms for Cuban5 will be presented in the next section, where the structures of the probabilistic models learned by the algorithms are studied. Moreover, for SixP eaks, using the highest value of n, and for f3deceptive , using the two highest problem sizes, the difference between the algorithms is statistically significant at the 1% level.
8 Problem-Knowledge Extraction from Bayesian Networks The objective of this section is to show a number of ways in which knowledge about the problem structure can be extracted from the analysis of the Bayesian networks learned by EBNA. In particular, we investigate the difference between the structures learned using exact and approximate learning algorithms. We also analyze the changes in the pattern (number and type) of the dependencies captured by the algorithms during their evolution.
The Impact of Exact Probabilistic Learning Algorithms
8.1
123
Probabilistic Models as a Source of Knowledge About the Problem
Although the main objective in EDAs is to obtain a set of optimal solutions, the analysis of the models learned by the algorithms during the evolution can reveal previously unknown characteristics of the problem. There is a variety of information that can be obtained from the analysis of the models. Just to cite a few examples, it could be possible to extract: • A description of sets of dependent or interacting variables. • Probabilistic information about most likely configurations for subsets of the problem variables which can be translated into most-probable partial solutions of the problem. • Evidence on the existence of different types of problem symmetry. • Identification of conflicting partial solutions in problems with frustration. • In addition, by considering the change of the models during the evolution (a dynamical perspective), it is also possible to identify patterns in the formation of optimal structures. A central problem in EDAs is the design of methods for extracting and interpreting this information from the models. There are a number of approaches that have been proposed to treat this issue for different classes of probabilistic models used in EDAs. We postpone a review of some of these approaches for the next section and focus now on the extraction of information from Bayesian networks. We identify three main sources of information: 1. The structure of the Bayesian network: By inspecting the topological characteristics of the graphs (e.g. most frequent arcs), we identify structural relationships between the variables. 2. The probabilistic tables of the Bayesian networks: By analyzing the probability associated to variables linked in the network, it is possible to identify promising and also poor configurations of the partial solutions. 3. Most probable configurations given the network: These are the solutions with the highest probability given the model. Thus, they condense the structural and parametrical information stored by the Bayesian network and have not necessarily been generated during the evolution of the EDA. In this paper we focus on the analysis of network structures. 8.2
Analysis of the Bayesian Structures Learned by EBNA
To investigate the type of dependencies learned by EBNA-Exact and EBNALocal, we saved the structures of the Bayesian networks learned during the evolutionary process by both variants of the algorithm for functions f3deceptive , SixP eaks, Cuban5, CP F and P rotein. In all the following experiments, we start by running EBNA-Local and EBNAExact and choose 30 executions in which the optimum was found and the algorithms did not converge in the first generation. The stopping criterion is a
124
C. Echegoyen et al.
maximum number of 105 evaluations. In each of these experiments, the structures of the Bayesian networks learned in each generation are stored. From the structures, the frequency in which each arc appeared in the Bayesian network was calculated. Since we are not interested in the direction of the dependencies, we add the frequency of the two arcs that involves the same pair of variables. The matrices that store this information are called frequency matrices. Two different ways of showing the information contained in the frequency matrices are used. The first way to represent the frequencies is using images where lighter color indicate a higher frequency. As another means to visualize the patterns of interactions, we use contour maps in which dependencies with a similar frequency are joined with lines. In this way, it is possible to identify areas of similar strength of dependency. In addition, the number of contours is a parameter that can be tuned to focus the attention on the set of the strongest dependencies. In the following, for each function and variant of EBNA employed, two figures are shown. The first figure shows the image graph of the dependencies learned by the model in the last generation and contained in the corresponding frequency matrix. The second figure shows the contour graph corresponding to a matrix that stores all the arcs learned by all the models during the evolution. We call this second matrix the cumulative frequency matrix. In order to fairly compare both algorithms using the contour figures, we normalized the frequencies of the arcs by the highest value among the two cumulative matrices learned by each algorithm. The normalized values are later discretized in ten levels. This way the contour lines refer to the same levels of frequencies. Results for f3deceptive and SixP eaks functions In the initial experiments we use functions f3deceptive (n = 15) and SixP eaks (n = 16). We relate the behavior exhibited by these functions and analyze some patterns identified in the structures of the models learned. We also try to link the behavior in both studies. We start by using a population size of 150, which was the highest population size used on the complexity experiments shown in previous sections. Figure 7 and 8 respectively show the frequency matrices corresponding to EBNA-Local and EBNA-Exact for function f3deceptive . Both algorithms are able to capture the dependencies corresponding to the problem interactions. This fact may explain the similar behavior exhibited in the time complexity experiments. It can be noticed that the models include a number of additional spurious correlations which are not determined by the function structure. This is particularly evident for the EBNA-Exact algorithm and is explained by the fact that exact learning is more sensitive to the overfitting of the data when the population size is small. Therefore, we increase the population size to N = 500 and repeat the same experiment for this function. The frequency matrices obtained are shown in Figures 9 and 10. They reveal the effect of increasing the population size in the dependencies learned. It can be appreciated that spurious correlations
The Impact of Exact Probabilistic Learning Algorithms
125
30 14
14 25
12
20
10 8
15
6
Variable j
Variable j
12
10 8 6
10
4
4 5
2
2 2
4
6
8
10
12
14
2
4
6
Variable i
8
10
12
14
Variable i
a)
b)
Fig. 7. Frequency matrices calculated from the models learned by EBNA-Local for function f3deceptive with N = 150 (a) Last generation (b) All generations
30 14
14 25
12
20
10 8
15
6
10
4
Variable j
Variable j
12
10 8 6 4
5 2
2 2
4
6
8
10
Variable i
a)
12
14
2
4
6
8
10
12
14
Variable i
b)
Fig. 8. Frequency matrices calculated from the models learned by EBNA-Exact for function f3deceptive with N = 150 (a) Last generation (b) All generations
have almost disappeared from the models. Both algorithms are able to learn an accurate model with a population size of N = 500. We conduct a similar analysis for function SixP eaks. Figures 11 and 12 respectively show the frequency matrices calculated for EBNA-Local and EBNAExact with a population size N = 150. It can be seen that both algorithms are unable to learn the accurate structure. As in the case of the f3deceptive function, EBNA-Exact learns more spurious dependencies than EBNA-Local. This fact is specially evident in Figure 12. In this case the patterns of dependencies is spread along the matrix while dependencies learned by EBNA-Local are grouped around the diagonal. This fact may explain the better results achieved by EBNA-Local in the time complexity experiments done for this function. Insufficient population size might be the main reason that explains the poor quality in the mapping between the function structure and the model structure. The experiments presented in [13] showed that EBNA-Exact was able to learn an accurate model for function SixP eaks with a smaller number of variables.
126
C. Echegoyen et al.
30 14
14 25
12
20
10 8
15
6
Variable j
Variable j
12
10 8 6
10
4
4 5
2
2 2
4
6
8
10
12
14
2
4
6
Variable i
8
10
12
14
Variable i
a)
b)
Fig. 9. Frequency matrices calculated from the models learned by EBNA-Local for function f3deceptive with N = 500 (a) Last generation (b) All generations
30 14
14 25
12
20
10 8
15
6
Variable j
Variable j
12
10 8 6
10
4
4 5
2
2 2
4
6
8
10
12
14
2
4
6
Variable i
8
10
12
14
Variable i
a)
b)
Fig. 10. Frequency matrices calculated from the models learned by EBNA-Exact for function f3deceptive with N = 500 (a) Last generation (b) All generations
30
16 14
16 14
25
12
12
8
15
6
10
4
Variable j
Variable j
20 10
10 8 6 4
5 2
2 5
10
Variable i
a)
15
2
4
6
8
10
12
14
16
Variable i
b)
Fig. 11. Frequency matrices calculated from the models learned by EBNA-Local for function SixP eaks with N = 150 (a) Last generation (b) All generations
The Impact of Exact Probabilistic Learning Algorithms
30
16 14
127
16 14
25
12
12
8
15
6
10
Variable j
Variable j
20 10
10 8 6
4
4 5
2
2 5
10
15
2
4
6
Variable i
8
10
12
14
16
Variable i
a)
b)
Fig. 12. Frequency matrices calculated from the models learned by EBNA-Exact for function SixP eaks with N = 150 (a) Last generation (b) All generations
30
16 14
16 14
25
12
12 10 8
15
6
10
Variable j
Variable j
20 10 8 6
4
4 5
2
2 5
10
15
2
4
6
Variable i
8
10
12
14
16
Variable i
a)
b)
Fig. 13. Frequency matrices calculated from the models learned by EBNA-Local for function SixP eaks with N = 500 (a) Last generation (b) All generations
30
16 14
16 14
25
12
12
8
15
6
10
4
Variable j
Variable j
20 10
10 8 6 4
5 2
2 5
10
Variable i
a)
15
2
4
6
8
10
12
14
16
Variable i
b)
Fig. 14. Frequency matrices calculated from the models learned by EBNA-Exact for function SixP eaks with N = 500 (a) Last generation (b) All generations
128
C. Echegoyen et al.
Therefore, we repeat the experiment using a population size N = 500. Results are shown in Figures 13 and 14. The image graph reveals that by increasing the population size EBNA-Exact is able to learn a very accurate structure. The model learned captures all the short-order dependencies of the function. This fact is corroborated by inspecting the contour graph in Figure 14 (b) where there is evidence that exact learning has gains in accuracy with respect to a smaller population size. On the other hand, EBNA-Local does not achieve a similar improvement. Furthermore, the accuracy of the approximation is lower than when a population size N = 150 was used, as can be seen comparing Figures 11 and 13. Cuban5 function We analyze the Cuban5 function (m = 1, n = 13). For m = 1, Cuban5 is equal to the sum of three subfunctions: 5 5 5 (s0 ) + Fcuban2 (s1 ) + Fcuban1 (s2 ) Cuban5(x) = Fcuban1
(9)
5 The interactions are determined by two different functions, Fcuban1 and Therefore, we expect Cuban5 to exhibit a different pattern of interactions than those previously analyzed. As in previous experiments, we start with a population size N = 150. The frequency matrices corresponding to EBNA-Local and EBNA-Exact are respectively shown in Figures 15 and 16. It can be seen in the images calculated from the frequency matrices of the last 5 generation that only some of the dependencies determined by function Fcuban1 are captured by both algorithms. However, the cumulative frequencies clearly 5 . There are no show the existence of dependencies related to function Fcuban2 important differences between the two EDAs. The frequency matrices obtained by increasing the population size to N = 500 are shown in Figures 17 and 18. In this case, the dependencies determined by 5 are easier to recognize in the frequency matrices of the last function Fcuban2 5 Fcuban2 .
30 12
12 25
10
10 8 15 6
Variable j
Variable j
20 8 6
10 4
4 5
2
2 2
4
6
8
Variable i
a)
10
12
2
4
6
8
10
12
Variable i
b)
Fig. 15. Frequency matrices calculated from the models learned by EBNA-Local for function Cuban5 with N = 150 (a) Last generation (b) All generations
The Impact of Exact Probabilistic Learning Algorithms
129
30 12
12 25
10
10 8 15 6
Variable j
Variable j
20 8 6
10 4
4 5
2
2 2
4
6
8
10
12
2
4
Variable i
6
8
10
12
Variable i
a)
b)
Fig. 16. Frequency matrices calculated from the models learned by EBNA-Exact for function Cuban5 with N = 150 (a) Last generation (b) All generations
30 12
12 25
10
10 8 15 6
Variable j
Variable j
20 8 6
10 4
4 5
2
2 2
4
6
8
10
12
2
4
Variable i
6
8
10
12
Variable i
a)
b)
Fig. 17. Frequency matrices calculated from the models learned by EBNA-Local for function Cuban5 with N = 500 (a) Last generation (b) All generations
30 12
12 25
10
10 8 15 6
Variable j
Variable j
20 8 6
10 4
4 5
2
2 2
4
6
8
Variable i
a)
10
12
2
4
6
8
10
12
Variable i
b)
Fig. 18. Frequency matrices calculated from the models learned by EBNA-Exact for function Cuban5 with N = 500 (a) Last generation (b) All generations
130
C. Echegoyen et al.
generation. However, although the population sizes have grown, for this function both algorithms have learned a similar structure. This fact could explain the results achieved in the study of the convergence reliability presented in Section 7. Results for the CP F function The CP F function represents another interesting class of functions. It has been shown that Bayesian network based EDAs such as BOA are deceived by this function. Being CP F a decomposable function of bounded complexity, BOA has an exponential scaling for it. Furthermore, in [8] it is shown that increasing the population size does not always produce an improvement in the algorithm’s behavior. Authors point to the fact that the learning algorithm used by BOA may fail to detect the higher order type of interactions that occurs in the CP F function. We will investigate whether there are differences between exact and local learning for the CP F function with parameters: n = 15, k = 5, codd = 5 and ceven = 0. For these parameters, the optimum can be reached in 212 different points. As a consequence, it is very likely that EBNA reaches the global solution in the first generation. On the other hand, the limitations of the exact learning algorithm do not allow to deal with a higher number of variables. Therefore, we will only analyze the models learned in the first generations of the EDAs, disregarding whether the optimum has been found or not. The models have been calculated using 30 independent experiments. We start with a population size N = 150. For this value and higher values of the population size, none of the algorithms was able to recover any type of structure. However, for N = 1000, EBNA-Exact was able to detect some structure. The frequency matrices calculated for EBNA-Local and EBNA-Exact are shown in Figure 19. Surprisingly, EBNA-Exact was able to recover an almost perfect structure while EBNA-Local was not. These results reveal that, for problems such as
30
30 14
14
25
25 12 20
10 8
15
6
Variable j
Variable j
12
20
10 8
15
6
10
10 4
4
5
5 2
2 2
4
6
8
10
Variable i
a)
12
14
2
4
6
8
10
12
14
Variable i
b)
Fig. 19. Frequency matrices calculated from the models learned in the first generation of the EDAs for function CP F with N = 1000 (a) EBNA Local (b) EBNA-Exact
The Impact of Exact Probabilistic Learning Algorithms
131
CP F , an accurate learning of the model might be essential to recover the correct structure of the problem. It also shows, that even with exact learning, the population size required to discover the problem structure is higher than for the other additive functions considered. HP protein model
20
50
20
18
45
18
16
40
16
14
35
14
12
30
10
25
8
20
6
15
4
10
2
5 5
10
Variable i
a)
15
20
Variable j
Variable j
The HP model has served as a benchmark for studying different issues related with the behavior of EDAs [35, 38, 40]. It is a non-binary, nondecomposable problem for which extensive investigation using evolutionary and other heuristic algorithms have been conducted (see [10, 40] and references therein). We use one instance of the HP model to investigate the impact of exact learning. Figure 1 shows one optimal folding for the chosen sequence HP HP P HHP HP P HP HHP P HP H. In the evaluation of the HP model, two variants are considered. In the first one, infeasible individuals are assigned a penalty. In the second variant, individuals are first repaired and after that the HP function (from now on P rotein function) is used to evaluate them. In all the experiments conducted for the P rotein function, N = 200 and 50 independent experiments of EBNA-Local and EBNAExact were run. Since the P rotein function is not decomposable, a detailed description of the problem structure is not available and we can not contrast the dependencies learned with a perfect model of the interactions. However, previous research on the application of EDAs to the HP problem [40] has shown that important dependencies between adjacent variables arise. These dependencies are in part determined by the codification used, in which each residue’s position depend on the position of the previous two. Thus, the objective of our experiments is twofold. Firstly, to compare the class of models learned by EBNA-Local and EBNA-Exact. Secondly, to investigate the effect that the application of the repair mechanism has in the number and patterns of the interactions learned by the EDAs.
12 10 8 6 4 2 5
10
15
20
Variable i
b)
Fig. 20. Frequency matrices calculated from the models learned by EBNA-Local for function P rotein, repairing procedure with N = 200 (a) Last generation (b) All generations
C. Echegoyen et al.
20
50
20
18
45
18
16
40
16
14
35
14
12
30
10
25
8
20
6
15
4
10
2
5 5
10
15
Variable j
Variable j
132
12 10 8 6 4 2
20
5
Variable i
10
15
20
Variable i
a)
b)
20
50
20
18
45
18
16
40
16
14
35
14
12
30
10
25
8
20
6
15
4
10
2
5 5
10
15
Variable j
Variable j
Fig. 21. Frequency matrices calculated from the models learned by EBNA-Exact for function P rotein, repairing procedure with N = 200 (a) Last generation (b) All generations
12 10 8 6 4 2
20
5
Variable i
10
15
20
Variable i
a)
b)
20
50
20
18
45
18
16
40
16
14
35
14
12
30
10
25
8
20
6
15
4
10
2
5 5
10
Variable i
a)
15
20
Variable j
Variable j
Fig. 22. Frequency matrices calculated from the models learned by EBNA-Local for function P rotein, without repairing procedure with N = 200 (a) Last generation (b) All generations
12 10 8 6 4 2 5
10
15
20
Variable i
b)
Fig. 23. Frequency matrices calculated from the models learned by EBNA-Exact for function P rotein, without repairing procedure with N = 200 (a) Last generation (b) All generations
The Impact of Exact Probabilistic Learning Algorithms
600
500 EBNA Exact R EBNA Local R
EBNA Exact NR EBNA Local NR 400
400
Dependences
Dependences
500
300 200
300
200
100
100 0 0
133
5
10
15
Generation
a)
20
25
30
0 0
5
10
15
20
Generation
b)
Fig. 24. Number of dependencies learned, by EBNA-Exact and EBNA-Local, at each generation for the P rotein function, (a) with repairing procedure (b) without repairing procedure
Figures 20 and 21 respectively show the frequency matrices learned by EBNALocal and EBNA-Exact when the repairing procedure is applied. Figures 22 and 23 show frequencies corresponding to the variant in which the repairing procedure is not applied. An analysis of the figures reveal that EBNA-Exact learns a pattern of interactions more localized around the diagonal representing the dependencies between adjacent variables. The dependencies found by EBNA-Local are more spreadout, away from the diagonal. We also observe some differences due to the application of the repairing procedure. These differences are particularly noticeable from the analysis of the contour graphs. Taking as an accuracy criterion the connectness of the adjacent variables in the problem representation, we see that repairing helps EBNA-Local to learn more accurate structures. Without repairing, the pattern of interactions is more fragmented. However, repairing does not help EBNA-Exact, which is able to recover a more connected structure without the application of the repairing procedure. We also analyze the number of dependencies learned by the algorithms at each generation. Figures 24 (a) and (b) respectively show the sum, for the 50 experiments, of the number of dependencies learned by EBNA-Exact and EBNALocal with and without the application of the repairing procedure. EBNA-Local and EBNA-Exact have a similar behavior. In the initial generations, the number of dependencies learned increases until a maximum is reached and then the number of dependencies starts to diminish.
9 Related Work In [3], an empirical comparison of EBNAs that use different learning algorithms has been presented. Also in [44] different variants of learning algorithms have been evaluated in the context of EDAs that use polytree models (a constrained
134
C. Echegoyen et al.
class of Bayesian networks). The use of exact learning algorithms of Bayesian networks was introduced to EDAs in [13], where preliminary results were presented. Our work is part of an ongoing research trend that investigates the relationship between the problem structure and the class of structure learned during the search by the probabilistic models. A number of researchers have studied the most frequent dependencies learned by the probabilistic models in EDAs and analyzed their mapping with the function structure [2, 22, 27, 34, 37]. A promising related idea is the use of the dependency relationships represented by the probabilistic model to define functions with a desired degree of interactions [31]. The relationship between problem structure and dependencies is analyzed from two different perspectives in [36]. First, using Pearson’s chi-square statistics as a measure of the strength of the interactions between pairs of variables in EDAs, the arousal of dependencies due to the selection operator is shown. Secondly, it is shown that for some problems, only a subset of the dependencies may be needed to solve the problem. More recently, some work has been devoted to analyzing the way in which the different components of the EDA influence the arousal of dependencies [16] and to use the probabilistic models obtained by EDAs to speed up the solution of similar problems in the future [17]. However, the accuracy of the learning algorithm to recover the problem structure from the data has not been investigated in these papers. For instance, for the spin glass problem used as testbeds in [16], most of the dependencies found by hBOA are short dependencies between neighbors in the grid but some long range interactions also appear. We point out that the approximate learning algorithm may produce models that are only an approximate representation of the actual dependencies that arise in the population. The error introduced by the learning method in the estimation of the dependencies should also be taken into account.
10 Conclusions In this work we have accomplished a detailed analysis of the use of exact learning of the Bayesian network structure in the study of EDAs. We have conducted systematic experiments for several functions. Results show that the type of learning algorithm (whether exact or approximate) may produce significant differences in the class of models learned and in the performance of the EBNA. This fact is important because usually Bayesian network models learned using approximate algorithms are thought to accurately reflect the dependencies that arise in the population. As the example of the CP F function illustrates, this might not be the case for functions with a particular type of higher order dependencies. On the other hand, we have shown that whenever the size of the problem is manageable, exact learning of Bayesian networks is a more appropriate option for theoretical analysis of the probabilistic dependencies. We have shown that the analysis of the probabilistic models can reveal the effect that some EDA
The Impact of Exact Probabilistic Learning Algorithms
135
components, such as repairing procedures, have in the arousal of dependencies. By using exact learning, we have confirmed the critical effect that an inadequate population size may have to capture an accurate probabilistic model. Among the trends for future research we identify the followings: • Design of feasible approaches to apply Bayesian network exact learning algorithms to problems with a higher number of variables. A possible alternative will treat these problems by initially identifying interacting sets of variables of manageable size and applying exact learning in each set to obtain a more accurate model of the interactions. Efficient methods for clustering the variables according to the mutual information have already been applied in EDAs [39]. • Application of more advanced techniques for extracting and visualizing the information contained in the models. As the importance of using the information contained in the probabilistic models learned by EDAs is acknowledged, it becomes more necessary to apply more advanced tecniques for information extraction and data visualization. • Use of the most probable configurations to investigate the influence of the learning algorithms and other EDA components. Procedures that use the probabilistic models learned by EDAs either take advantage of the problem structure or use the probabilistic tables corresponding to some sets of marginal and conditional probabilities. However, information contained in the models can, in many cases, be translated into a set of most probable configurations (with their associated probabilities), which are usually not generated during the evolution. Most probable configurations can help to improve EDA behavior [26] but they could also be used to investigate the algorithms and extract relevant problem information. • Another way to improve the results of the learning algorithms, particularly of the exact variant, in the discovery of accurate models, could be to increase the quality of the information contained in the population size. Research on this direction has been reported in [7]. • Exact learning could be used to investigate the effects that the existence of constraints, such as those imposed by repairing procedures, have in the arousal of dependencies: Constrained problems remain an important challenge for EDAs. While it is generally difficult to represent the constraints in the probabilistic model, the use of repairing procedures may introduce undesired bias in the construction of solutions. By investigating the structure of the model, it could be possible to detect the bias introduced and conceive ways of correcting it. Finally, we emphasize that the study of the relationship between the problem structure and the dependencies captured by the probabilistic model should provide answers for the fundamental question of how to select appropriate probabilistic models to optimize a given problem in the framework of EDAs.
136
C. Echegoyen et al.
Acknowledgments This work has been partially supported by the Etortek, Saiotek and Research Groups 2007-2012 (IT-242-07) programs (Basque Government), TIN2005-03824 and Consolider Ingenio 2010 - CSD2007-00018 projects (Spanish Ministry of Education and Science) and COMBIOMED network in computational biomedicine (Carlos III Health Institute).
References 1. Baluja, S., Davies, S.: Using optimal dependency-trees for combinatorial optimization: Learning the structure of the search space. In: Proceedings of the 14th International Conference on Machine Learning, pp. 30–38. Morgan Kaufmann, San Francisco (1997) 2. Bengoetxea, E.: Inexact Graph Matching Using Estimation of Distribution Algorithms. PhD thesis, Ecole Nationale Sup´erieure des T´el´ecommunications (2003) 3. Blanco, R., Lozano, J.A.: Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. In: Larra˜ naga, P., Lozano, J.A. (eds.) An empirical comparison of discrete Estimation of Distribution Algorithms, pp. 163–176. Kluwer Academic Publishers, Dordrecht (2002) 4. Buntine, W.: Theory refinement on Bayesian networks. In: Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pp. 52–60 (1991) 5. Castillo, E., Gutierrez, J.M., Hadi, A.S.: Expert Systems and Probabilistic Network Models. Springer, Heidelberg (1997) 6. Chickering, D.M., Geiger, D., Heckerman, D.: Learning Bayesian networks is NPhard. Technical Report MSR-TR-94-17, Microsoft Research, Redmond, WA (1994) 7. Chuang, C.Y., Chen, Y.P.: On the effectiveness of distribution estimated by probabilistic model building. Technical Report: NCL-TR-2008001, Natural Computing Laboratory NCLab. Department of Computer Science. National Chiao Tung University (February 2008) 8. Coffin, D.J., Smith, R.E.: The limitations of distribution sampling for linkage learning. In: Proceedings of the 2007 Congress on Evolutionary Computation CEC 2007, pp. 364–369. IEEE Press, Los Alamitos (2007) 9. Cotta, C.: Protein structure prediction using evolutionary algorithms hybridized ´ with backtracking. In: Mira, J., Alvarez, J.R. (eds.) IWANN 2003. LNCS, vol. 2687, pp. 321–328. Springer, Heidelberg (2003) 10. Cutello, V., Nicosia, G., Pavone, M., Timmis, J.: An immune algorithm for protein structure prediction on lattice models. IEEE Transactions on Evolutionary Computation 11(1), 101–117 (2007) 11. Dill, K.A.: Theory for the folding and stability of globular proteins. Biochemistry 24(6), 1501–1509 (1985) 12. Eaton, D., Murphy, K.: Exact Bayesian structure learning from uncertain interventions. In: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (2007) 13. Echegoyen, C., Lozano, J.A., Santana, R., Larra˜ naga, P.: Exact Bayesian network learning in estimation of distribution algorithms. In: Proceedings of the 2007 Congress on Evolutionary Computation CEC 2007, pp. 1051–1058. IEEE Press, Los Alamitos (2007)
The Impact of Exact Probabilistic Learning Algorithms
137
14. Etxeberria, R., Larra˜ naga, P.: Global optimization using Bayesian networks. In: Ochoa, A., Soto, M.R., Santana, R. (eds.) Proceedings of the Second Symposium on Artificial Intelligence (CIMAF 1999), Havana, Cuba, pp. 151–173 (1999) 15. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989) 16. Hauschild, M., Pelikan, M., Lima, C., Sastry, K.: Analyzing probabilistic models in hierarchical BOA on traps and spin glasses. In: Thierens, D., et al. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference GECCO 2007, vol. I, pp. 523–530. ACM Press, London (2007) 17. Hauschild, M., Pelikan, M., Sastry, K., Goldberg, D.E.: Using previous models to bias structural learning in the hierarchical BOA. MEDAL Report No. 2008003, Missouri Estimation of Distribution Algorithms Laboratory (MEDAL) (2008) 18. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20, 197–243 (1995) 19. Inza, I., Larra˜ naga, P., Sierra, B.: Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. In: Larra˜ naga, P., Lozano, J.A. (eds.) Feature weighting for nearest neighbor by estimation of distribution algorithms, pp. 291– 308. Kluwer Academic Publishers, Dordrecht (2002) 20. Koivisto, M., Sood, K.: Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research 5, 549–573 (2004) 21. Larra˜ naga, P., Lozano, J.A. (eds.): Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2002) 22. Lima, C.F., Pelikan, M., Goldberg, D.E., Lobo, F.G., Sastry, K., Hauschild, M.: Influence of selection and replacement strategies on linkage learning in BOA. In: Proceedings of the 2007 Congress on Evolutionary Computation CEC 2007, pp. 1083–1090. IEEE Press, Los Alamitos (2007) 23. Lozano, J.A., Larra˜ naga, P., Inza, I., Bengoetxea, E. (eds.): Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms. Springer, Heidelberg (2006) 24. Lozano, J.A., Sagarna, R., Larra˜ naga, P.: Parallel estimation of Bayesian networks algorithms. In: Evolutionary Computation and Probabilistic Graphical Models. Proceedings of the Third Symposium on Adaptive Systems (ISAS 2001), Havana, Cuba, pp. 137–144 (March 2001) 25. Mendiburu, A., Lozano, J., Miguel-Alonso, J.: Parallel implementation of EDAs based on probabilistic graphical models. IEEE Transactions on Evolutionary Computation 9(4), 406–423 (2005) 26. Mendiburu, A., Santana, R., Lozano, J.A.: Introducing belief propagation in estimation of distribution algorithms: A parallel framework. Technical Report EHU-KATIK-11/07, Department of Computer Science and Artificial Intelligence, University of the Basque Country (October 2007) 27. M¨ uhlenbein, H., H¨ ons, R.: The estimation of distributions and the minimum relative entropy principle. Evolutionary Computation 13(1), 1–27 (2005) 28. M¨ uhlenbein, H., Mahnig, T.: Evolutionary synthesis of Bayesian networks for optimization. In: Patel, M., Honavar, V., Balakrishnan, K. (eds.) Advances in Evolutionary Synthesis of Intelligent Agents, pp. 429–455. MIT Press, Cambridge (2001) 29. M¨ uhlenbein, H., Mahnig, T., Ochoa, A.: Schemata, distributions and graphical models in evolutionary optimization. Journal of Heuristics 5(2), 213–247 (1999)
138
C. Echegoyen et al.
30. M¨ uhlenbein, H., Paaß, G.: From recombination of genes to the estimation of distributions I. Binary parameters. In: Ebeling, W., Rechenberg, I., Voigt, H.-M., Schwefel, H.-P. (eds.) PPSN 1996. LNCS, vol. 1141, pp. 178–187. Springer, Heidelberg (1996) 31. Ochoa, A., Soto, M.R.: Linking entropy to estimation of distribution algorithms. In: Lozano, J.A., Larra˜ naga, P., Inza, I., Bengoetxea, E. (eds.) Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms, pp. 1–38. Springer, Heidelberg (2006) 32. Pelikan, M.: Hierarchical Bayesian Optimization Algorithm. Toward a New Generation of Evolutionary Algorithms. Studies in Fuzziness and Soft Computing. Springer, Heidelberg (2005) 33. Pelikan, M., Goldberg, D.E., Cant´ u-Paz, E.: BOA: The Bayesian optimization algorithm. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference GECCO 1999, Orlando, FL, vol. I, pp. 525–532. Morgan Kaufmann Publishers, San Francisco (1999) 34. Santana, R.: Estimation of distribution algorithms with Kikuchi approximations. Evolutionary Computation 13(1), 67–97 (2005) 35. Santana, R., Larra˜ naga, P., Lozano, J.A.: Protein folding in 2-dimensional lattices with estimation of distribution algorithms. In: Barreiro, J.M., Mart´ın-S´ anchez, F., Maojo, V., Sanz, F. (eds.) ISBMDA 2004. LNCS, vol. 3337, pp. 388–398. Springer, Heidelberg (2004) 36. Santana, R., Larra˜ naga, P., Lozano, J.A.: Interactions and dependencies in estimation of distribution algorithms. In: Proceedings of the 2005 Congress on Evolutionary Computation CEC 2005, Edinburgh, U.K, pp. 1418–1425. IEEE Press, Los Alamitos (2005) 37. Santana, R., Larra˜ naga, P., Lozano, J.A.: The role of a priori information in the minimization of contact potentials by means of estimation of distribution algorithms. In: Marchiori, E., Moore, J.H., Rajapakse, J.C. (eds.) EvoBIO 2007. LNCS, vol. 4447, pp. 247–257. Springer, Heidelberg (2007) 38. Santana, R., Larra˜ naga, P., Lozano, J.A.: Component weighting functions for adaptive search with EDAs. In: Proceedings of the 2008 Congress on Evolutionary Computation CEC 2008, Hong Kong. IEEE Press, Los Alamitos (accepted for publication, 2008) 39. Santana, R., Larra˜ naga, P., Lozano, J.A.: Estimation of distribution algorithms with affinity propagation methods. Technical Report EHU-KZAA-IK-1/08, Department of Computer Science and Artificial Intelligence, University of the Basque Country (January 2008), http://www.sc.ehu.es/ccwbayes/technical.htm 40. Santana, R., Larra˜ naga, P., Lozano, J.A.: Protein folding in simplified models with estimation of distribution algorithms. IEEE Transactions on Evolutionary Computation (to appear, 2008) 41. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 7(2), 461– 464 (1978) 42. Silander, T., Myllymaki, P.: A simple approach for finding the globally optimal Bayesian network structure. In: Proceedings of the 22th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2006). Morgan Kaufmann Publishers, San Francisco (2006) 43. Singh, A., Moore, A.: Finding optimal Bayesian networks by dynamic programming. Technical report, Carnegie Mellon University (June 2005)
The Impact of Exact Probabilistic Learning Algorithms
139
44. Soto, M.R., Ochoa, A.: A factorized distribution algorithm based on polytrees. In: Proceedings of the 2000 Congress on Evolutionary Computation CEC 2000, La Jolla Marriott Hotel La Jolla, California, USA, July 6-9, pp. 232–237. IEEE Press, Los Alamitos (2000) 45. Soto, M.R., Ochoa, A., Acid, S., Campos, L.M.: Bayesian evolutionary algorithms based on simplified models. In: Ochoa, A., Soto, M.R., Santana, R. (eds.) Proceedings of the Second Symposium on Artificial Intelligence (CIMAF 1999), Havana, Cuba, pp. 360–367 (March 1999)
Linkage Learning in Estimation of Distribution Algorithms David Coffin1 and Robert E. Smith2 1 2
University College London, Gower Street, London, WC1E 6BT
[email protected] University College London, Gower Street, London, WC1E 6BT
[email protected] Abstract. This chapter explains how structural learning performed by multi-variate estimation of distribution algorithms (EDAs) while building their probabilistic models is a form of linkage learning. We then show how multi-variate EDAs linkage learning mechanisms can be misled with the help of two test problems; the concatenated parity function (CPF), and the concatenated parity/trap function (CP/TF). Although these functions are separable, with bounded complexity and uniformly scaled sub-function contributions, the hierarchical Bayesian Optimization Algorithm (hBOA) scales exponentially on both. We argue that test problems containing parity functions are hard for EDAs because there are no interactions in the contribution to fitness between any strict subset of a parity function’s bits. This means that as population sizes increase the dependency between variable values for any strict subset of a parity function’s bits decreases. Unfortunately most EDAs including hBOA search for their models by looking for dependencies between pairs of variables (at least at first). We make suggestions on how EDAs could be adjusted to handle parity problems, but also comment on the apparently inevitable computational cost.
1 Introduction Estimation of distribution algorithms (EDAs) [21, 14, 12] are a form of evolutionary algorithm (EA) similar to genetic algorithms (GAs), except that they generate new individuals by sampling a probabilistic model of a selection from the population, rather than through the application of genetic operators. Univariate EDAs such as the PBIL [2], Compact GA [9] and UMDA [15] only model marginal probabilities, whereas multivariate EDAs such as the FDA [13], the ECGA [8], the EBNA [7], BOA [20], hBOA [19] and the EDA based on maximum entropy presented in Wright et al. [28] also model joint probabilities. Where the joint probabilities differ from the product of the marginals the EDA has incorporated dependency information into its model; it seems reasonable to say that the EDA has learnt linkage. We discuss the relationship between this and other definitions of linkage in Sect. 2. hBOA is a state-of-the-art, scalable EDA that uses a Bayesian network as its probabilistic model, and incorporates modifications to BOA that allow solution of many hierarchical problems in polynomial time, so it is a reasonable choice to test a fitness function designed to be pathologically difficult for EDAs. In Sect. 6 we present results that show hBOA scales exponentially on the (CPF) for normal population sizes. We modified CPF to incorporate deceptive traps, so that hBOA could no longer solve it with Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 141–156, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
142
D. Coffin and R.E. Smith
very small population sizes, creating the concatenated parity/trap function (CP/TF). hBOA scales exponentially on the CP/TF. We believe that parity functions are hard for EDAs in general, and in Sect. 7 we analyze why this is so. In Sect. 8 we discuss two ways in which EDAs could be modified to optimize functions with separable parity components and in Sect. 9 we present our conclusions.
2 Linkage in EDAs Linkage was a concept borrowed from biology, where it was only loosely defined by geneticists to describe genes located near to each other on the same chromosome and which therefore had a greater than average chance of being inherited from the same parent. This sense of linkage continues in EA literature, for example Harik defines linkage as the probability of two variables being inherited from the same parent [10]. See also [3] for a survey of linkage in EDAs, which includes a discussion of the biologists definition of linkage. In contrast to this definition of linkage in terms of inheritability, variables are sometimes called linked if their contribution to the fitness of an individual is positively interdependent, particularly in the perturbation-based linkage learning literature [16, 11], in which interdependencies in contribution to fitness are detected by explicitly measuring the fitness of local regions of the fitness landscape. Interdependence of genes contributions to fitness and interdependence of genes inheritability are clearly closely related in both biological and computational evolution. In terms of building blocks, interdependencies in the fitness landscape determine which variables are members of the same building block. The extent to which the interdependencies in the inheritability match this determines the rates of building block disruption and mixing. Linkage learning EAs attempt to manipulate the inheritability linkage so it reflects the salient interdependencies within the fitness landscape. We would like to apply the inheritability definition of linkage to EDAs (although we note that applying schema theory concepts to EDAs is controversial [23]), but in EDAs there are no parents, so there is no longer any sense in which we can say the values of two decision variables in a new individual did or did not come from the same parent. The values may be equal to the values in a single individual in the original selection, but that doesn’t tell us anything about whether the EDA is disrupting those building blocks. Instead we have to consider the structure of the EDAs probabilistic model and how that affects which dependencies within the selection are preserved in the next generation. First, consider a partition containing two variables. In GAs without mutation the crossover operator alone determines the schema disruption, so the probability of those two variables being drawn from the same individual is the probability of schema disruption of each of the schema in that partition. The probability of schema creation, and so the total probability of schema transmission, depends on the distribution of the selection as well (e.g. if the selection has converged on the values for these variables, then the overall schema survival rate for crossover has to be 100%). So for GAs without mutation, linkage (the probability of both variables coming from the same parent) equals
Linkage Learning in Estimation of Distribution Algorithms
143
the probability of schema surviving cross-over for every schema in our two-variable partition. This can be thought of in terms of dependency preservation. By definition two variables are dependent if their joint probability distribution does not equal the product of their marginal distributions. If two variables have a linkage of 1 (they will always be selected from the same parent) then the expected distribution of these two variables in the next generation will be the same as the distribution of these two variables in the current distribution. If the linkage is 0 (the two variables will never be selected from the same parent), such as two variables at opposite ends of an individual in one point crossover, then the expected distribution of these two variables will equal the product of their marginal frequencies. Any dependency between the two variables that was present in the selection is lost in the next generation. Thinking about linkage in terms of the capacity to preserve dependency is finally a level of abstraction general enough for EDAs. We use the phrase “capacity to preserve” carefully here, given a selection whose variables are independent (such as a random uniform distribution), the linkage of an EA can not add dependency within the expected distribution of the next generation. We think of a set of variables linkage as the expected proportion of their mutual information in the current selection that will be present in the next generation.
3 The CPF Test Problem The concatenated parity function (CPF) was designed as a pathological test problem for EDAs [5]. It is a simple k-bounded additively separable problem composed of parity sub-functions, where parity functions are the generalisation of the XOR function and the Walsh transform. Ceven if bc(S) is even (1) parity(S) = Codd otherwise where bc is a bit count (unitation) function over bit string S. The CPF is defined below where m is the number of concatenated sub-functions, k is the size, in bits, of each sub-function, and Si denotes the i’th bit in bit string S. m−1
CPF(S) =
∑ parity(Sik . . . S((i+1)k−1)
(2)
i=0
Although each parity function in the CPF contains a number of alternative building blocks, and any pair of variables within each parity function are interdependent, we expect an equal number of each schema in the selection made by an EDA. For example, consider an EDA trying to detect interactions between the first two variables in a five bit parity sub-function, and suppose our selection contains one of every odd parity individual, then each of the 4 possible schema would contain 1/4 of the selection, Tab. 1. As we shall show in Sect. 7.1, the expected selection contains an equal proportion of individuals in each schema defining the variables in any 2, 3 or 4 bit partition. Since
144
D. Coffin and R.E. Smith
Table 1. Expected proportion of selection in each schema in JJ### for a 5-bit parity function Schema
Proportion of selection
Members of selection
00### 01### 10### 11###
1/4 1/4 1/4 1/4
00001, 00010, 00100, 00111 01000, 01011, 01101, 00110 10001, 10010, 10100, 10111 11000, 11011, 11101, 10110
EDAs consider only the proportion of the selection contained within schema when learning linkage, and typically only between pairs of variables (at least to begin with), the variables in a parity function should appear independent to the EDA. Clearly for every parity sub-function half the 2k sub-strings that the function is defined over have odd parity. For the CPF as a whole there are still a large number of optimal strings, 2l−m . However this is 1/2m of the entire search space, an exponentially decreasing proportion. This means that if an EDA were to select solutions to each subfunction at random (which is effectively what we expect) then the EDA would require O(2m ) fitness evaluations to find an optimum. Whilst we designed the CPF to be EDA-hard, the CPF should be extremely easy for a hill-climber (in contrast to deceptive traps), since the values for each parity sub-function are at most one mutation away from a building block.
4 Experimental Design We used non-overlapping concatenated test functions, with k = 5, codd = 5 and ceven = 0, for all experiments. For both the scalability and bisection experiments, we measured the number of evaluations required for hBOA to find any one of these optima.
5 Bisection Method We conducted scalability and population size control map experiments with hBOA, following the methodology and parameter settings of Pelikan et al. in [22]. In particular, for the scalability experiments we used the bisection method to determine population size, with assumed convergence in a maximum of n generations, searching for a population size in which 10 out of 10 trials were successful, and averaging over 10 bisection sizing runs. The run time for the bisection method was limited to one week on a single processor, which in turn determined the maximum problem size for both scalability experiments. Once the population size was determined using the bisection method, we ran hBOA 100 times for each problem size. The bisection method is a simple search algorithm for performing search over unbounded inputs (population sizes have no upper bound). Here we describe how the bisection method is used to find the minimal population size for which a search algorithm optimizes a problem reliably. The bisection method has 2 phases, where the first
Linkage Learning in Estimation of Distribution Algorithms
145
phase establishes bounds for the search, and the second phase performs bisection search within those bounds. In the first phase, an initial population size n is tested. The test passes if the EA can find any of the optimal solutions to the test function within l generations, for 10 out of 10 trials. If this test fails then this suggests the population size is too small, and the population size is doubled. If the test is passed, this suggests the population size is too large, and the population size is halved. This continues until the bisection method finds two population sizes nmin and nmax = 2n where the test fails for nmin and passes for nmax . Having established this range, in the second phase the bisection method reverts to a standard bisection search. It considers the midpoint nmid = (nmin + nmax )/2. If the test passes for nmid , nmax is decreased to nmid , otherwise nmin is increased to nmid . This continues until nmin and nmax are within some threshold distance of each other, and nmax is returned. For population sizing this implies that nmax is the smallest population size (within a specified threshold) for which an EA can reliably optimize a test problem in the number of generations specified by convergence theory, as long as we assume that the number of generations required to converge is a monotonic, non-increasing function of population size. For clarity in the following discussion we will refer to any population size determined by this method as a bisection population size attractor. We shall see in Sect. 6.2 that if the monotonicity assumption is not true, as it is not for hBOA on the CPF, then we may get more than one bisection population size attractor.
6 Results 6.1
CPF Scalability
Figure 1 is a semi-log plot of the mean number of evaluations hBOA required to find any individual with optimal fitness on the CPF problem. The associated bisection population size attractors are also plotted. Results indicate that hBOA scales exponentially (01.49e.123n, R2 = 0.942). However, there appears to be a phase transition at around 65 bits in both the number of evaluations and the bisection population size attractor. In particular, population size attractors increase from relatively small sizes (under 100), to sizes of 1000 or more. 6.2
CPF Population Size Control Map
To investigate the transition in Fig. 1, we measured the number of evaluations required to solve the 80 bit CPF. We also examined the 120 bit CPF, to illustrate how the effects we observed in the 80 bit problem extend as problem size increases. We chose to vary the population sizes from 10 to 1,000, limiting the maximum number of evaluations to 10,000,000. The numbers of evaluations required are plotted together with the mean number of generations in Fig. 2. For low population sizes many of the runs failed to find an optimal solution within the 10,000,000 evaluation limit. A scaled plot of the success rate is also
146
D. Coffin and R.E. Smith evaluations
bisection attactor
exponential fit (evaluations)
1000000
mean evaluations to first hit
100000
10000
1000
100
10
1 0
10
20
30
40
50
60
70
80
90
100
n
Fig. 1. A semi-log scalability plot of hBOA on the CPF mean evaluations (successful runs)
premature convergence (scaled)
mean generations (successful runs)
80,000
250
70,000 200
evaluations
50,000
150
40,000
100
30,000
20,000 50 10,000
0 0
100
200
300
400
500
600
700
800
900
population size
Fig. 2. A population size control map for hBOA on the 80 bit CPF
0 1000
generations
60,000
Linkage Learning in Estimation of Distribution Algorithms
mean evaluations (successful runs)
premature convergence (scaled)
147
mean generations (successful runs)
5,000,000
15,000
4,500,000
4,000,000
12,000
3,000,000
9,000
2,500,000
2,000,000
6,000
generations
evaluations
3,500,000
1,500,000
1,000,000
3,000
500,000
0 0
100
200
300
400
500
600
700
0 800
population size
Fig. 3. A population size control map for hBOA on the 120 bit CPF
included in Fig. 2. The number of evaluations required is fairly constant at the upper bounds of this range. Figure 3 presents a similar control map for the 120 bit CPF. Note that while there is a downward slope in the mean number of evaluations with increasing population size, the algorithm fails to find an optimal solution reliably within the 10,000,000 evaluation limit. Discussion The control maps have several interesting features. Both control maps show large ranges of population sizes for which hBOA can reliably optimize the CPF, but in a very large number of evaluations. This appears to contradict the convergence theory in [25] which states that an EA with appropriate selection pressure should converge within a number of generations that is linear to the problem size, which Pelikan et al. [22] interpret to place an upper bound of n generations on hBOA’s convergence. Both control maps contain regions where hBOA reliably solves the CPF, but taking far more than n generations. For example on the 120 bit CPF hBOA with a population size of 300 takes on average 12988 generations, around 108 times longer than the n generation interpretation of the convergence theory. The same feature is also present on the 80 bit CPF, where with a population size of 200, hBOA converges reliably to an optima, but takes on average 250 generations, or around 3 time longer than the n generation interpretation of the convergence theory. Counterintuitively, both control maps show that hBOA can optimize the CPF reliably in fewer generations for small population sizes than for large population sizes.
148
D. Coffin and R.E. Smith
For example on the 120 bit CPF with a population size of 110 individuals hBOA reliably finds an optima using on average 18,926 evaluations. This increases to 3,899,671 evaluations for a population size of 300 individuals, and with a population size of 400 individuals, 46 out of 100 runs failed to find an optimum in 10,000,000 evaluations. Together, the unusually long times taken to find an optima, and the fact that hBOA finds those optima not only with less evaluations, but with less generations, for smaller population sizes, suggest that hBOA is solving the CPF in an unusual way. We discuss this further in Sect. 7. Further, the control maps suggest one explanation for the apparent transition in the CPF scalability results, as they show that hBOA’s performance on the CPF violates a non-monotonicity assumption made by the bisection method. To explain this, we examine the bisection population size attractors found with the n generation limit used in the scalability experiments. We illustrated this on the 80 bit problem control map plot (Fig. 2). Examining the control map reveals two bisection population size attractors, at 100 and 960 individuals. In this plot, we assumed that if the population sizing method started with a population size for which hBOA was unable to optimize the CPF reliably within n generations, it would find the next highest attractor, and that if the population sizing method started with a population size for which hBOA was able to optimize the CPF reliably within the n generations, it would find the next lowest attractor. For the 120 bit problem, the situation is even more counterintuitive. The control map contains only one bisection method attractor, at 110 bits. If the bisection method started with a population size of 120 individuals or more, it is not clear that it will ever find a population size for which hBOA can reliably optimize the CPF within a reasonable number of generations. One possible cause of the phase transition is the bisection method switching from finding the lower, to the higher, of two bisection method attractors. We have determined that for problems of 100 bits and above, the bisection method was unable to find a basin of attraction in one week of computation time, given the initial population size of 100 individuals. One might assume that hBOA’s performance would have been better had we chosen a higher convergence bound for the bisection population sizing method. The control maps show that this would increase the lower population size attractors basin, and reduce the higher population size attractor. Both these changes would reduce the number of evaluations taken. However, if we assume the phase transition is caused by switching between bisection method attractors, then the scalability plot on Fig. 1 shows that hBOA scales exponentially when using population sizes from the lower bisection method attractor, and the control maps show that the number of evaluations taken is relatively invariant for higher population sizes. These two observations suggest that increasing the convergence bound would not alter the scalability of hBOA on the CPF. 6.3
The CP/TF Test Problem
The results from the CPF population size control map suggests that hBOA is solving parity problems in a potentially interesting manner and we discuss this further in Sect. 7. Since our interest is investigating the larger issue of what is hard for EDAs, we
Linkage Learning in Estimation of Distribution Algorithms mean (a)
mean (b)
149
exponential fit
10,000,000
1,000,000
evaluations
100,000
10,000
1,000
100
10
1 0
10
20
30
40
50
60
70
80
90
100
110
120
130
problem size (bits)
Fig. 4. A semi-log scalability plot of hBOA on the CP/TF
have also examined problems where deceptive traps are concatenated with the CPF to create the Concatenated Parity/Trap Function (CP/TF). In the CP/TF, half of the concatenated sub-functions are deceptive traps [1, 6]. m parity(Sik ...Sik+k ) if i is even (3) CP/T F(S) = ∑ trap(S ik ...Sik+k ) otherwise i=0
where trap(S) =
bc(S) if bc(S) = k k − bc(S) − 1 otherwise
(4)
Only even values of m were used. Since ceven = 5 and codd = 0 for the parity subfunction, the range of fitness contributions for both parity sub-functions and trap functions are the same. For the CP/TF 2(l−m)/2 strings are optimal. Our main motivation for examining CP/TF was the hypothesis that for hBOA to solve the deceptive traps within these problems, it would require a larger population size than those used in the first phase based on the CPF scalability results. This will allow us to isolate the asymptotic performance of hBOA on the second phase of the bisection method. We also note that combinations of parity with other sub-functions are a more realistic possibility for EDAs in the real world. Figure 4 shows that for problems of 30 bits (three 5 bit parity functions and three 5 bit deceptive traps) and above, hBOA scales exponentially on the CP/TF (R2 = 0.9872). For problems of 10 to 30 bits, the number of evaluations required is fitted well by a quadratic function. This may be because 2l−m−1 random individuals will have optimal
150
D. Coffin and R.E. Smith
values for every parity function in the CP/TF, so for small values of m, solving the traps limits the number of evaluations.
7 Analysis In the previous section we showed how hBOA scaled exponentially on the CP/TF. This is a significant result since hBOA scales quadratically or better on a wide range of decomposable problems, including those containing overlapping or hierarchical elements [18]. In Sect. 7.1 we explain more formally why parity sub-functions are hard for EDAs. Our arguments suggest that EDAs should be unable to solve parity functions at all. As such for hBOA to even solve the problem in exponential time bears discussion, which is contained in Sect. 7.2. 7.1
Why Parity Is Hard for EDAs
In this section we explain how the expected mutual information (MI) between any subset of the variables defining a parity sub-function is 0, for any reasonable selection of individuals in the first generation, and by induction, over all subsequent generations. If there is no MI between these lower order sets of variables, the EDAs search heuristic for an appropriate model structure must consider k-sized sets of variables. Assume that for all parity functions, odd parity has higher fitness than even parity, codd > ceven , so that for an order k parity function, 2k−1 bit strings have odd parity, with a fitness of codd , and the other 2k−1 bit strings have even parity and a lower fitness of ceven . Given a random, infinitely sized population, and 50% truncation selection (or less if we assume a random tie-breaking mechanism), the selection will contain equal proportions of all the odd parity bit strings, and none of the even bit strings. The distribution for a single parity block contains every odd parity bit string with probability podd = 1/2k−1 . Now assume a parity sub-function is part of some fitness function such that f (S) = parity(S0 . . . Sk ) + g(Sk+1 . . . Sl ) where g is any sub-function or set of sub-functions over the remaining bits. The distribution will now contain some sub-strings S0 . . . Sk with even parity ‘hitchhiking’ on high values of g, but it will still contain more individuals with odd than even parity sub-strings (assuming the fitness contribution of the parity sub-function is significant for individuals with fitness values close to the mean). The proportion of every odd sub-string will be equal, and the proportion of every even sub-string will be equal and lower, peven < podd . Consider a schema over a k-sized parity sub-function (where k > 2) from a length l bit string. For any schema with order o < k, half the remaining k − o bits will have even parity, and half will have odd parity, so whatever the parity of the schema’s defined bits, half the 2l−o members of the schema will have odd parity and half even. The proportion of any schema defining less than k bits of a parity sub-function will be 2l−o−1 (peven + podd ). In other words, given the expected selection from a random population, for a k-sized parity function the joint probabilities equal the products of the marginal probabilities for any sub-set of less than k variables. This implies that there is no correlation of any
Linkage Learning in Estimation of Distribution Algorithms
151
kind between any pair of variables within a parity function. In particular, there is no mutual information (since mutual information is the Kullback-Leibler divergence of the product of the marginal distributions from the joint distribution). On the other hand, for a schema defining all k bits in a k-sized parity sub-function, the proportion will be podd /2l−k for schema with odd parity, and peven /2l−k for schema with even parity. Assuming the fitness contribution of the parity sub-function is significant and peven < podd , the joint probabilities over the k bits of the parity function will differ from the product of the marginals. So although there are dependencies between the variables in a parity sub-function, an EDA will not be expected to detect them in a random population unless it considers modeling k-wise dependencies. If the EDA fails to model these dependencies when constructing its first probabilistic model, then it will assume each bit is independent, and those bits will be randomly distributed in the subsequent generation as well, and all future generations by induction. All the EDAs we are aware of begin building their probabilistic models by looking for pair-wise dependencies. 7.2
How Does hBOA Solve Parity Functions?
The analysis in Sect. 7.1 suggests parity functions should be impossible for an EDA to solve given a large enough population, and yet hBOA does solve both the CPF and CP/TF reliably at least for some problem sizes, even if it does so in exponential time. This is a credit to hBOA’s robustness, and it is interesting to speculate how it does this. hBOA uses a Bayesian network to model the selection distribution, where joint probabilities are modeled for a node and all of its parents using a decision tree. Which nodes have which parents is determined during the structural learning phase of network search, which is, in general, an NP hard problem [4], so hBOA uses a greedy search heuristic, considering adding (or possibly removing or reversing) one edge at a time and selecting the resultant structure that best fits the selection distribution according to some metric. This is a reasonable approach, but for parity sub-functions with a large enough population size, adding edges between less than k nodes will result in a network that no better models the selection distribution, since there is no difference between the joint probabilities and the product of the marginals to model. In other words, adding an edge at a time, hBOA would correctly add the final parent required to connect all the variables in parity sub-function, but would never get that far. All this analysis begs the question of how hBOA optimizes the CPF and CP/TF at all, and specifically, how this is more possible for smaller population sizes. We suspect that biases are present in the initial population, if it is small enough to deviate significantly from the expected distribution of individuals. A similar effect may accumulate through drift and convergence in the case of larger populations. If all but 2 bits in a k-sized parity sub-function deviate significantly from the expected 1:1 ratio of 1s and 0s then can result in a bias in their parity, let’s say the population happens to contain more individuals with odd parity over those bits, and that codd > ceven . In this case the selection will contain more individuals with even parity over the remaining 2 bits of this parity sub-function, the joint probability distribution over those two bits will differ from the product of the marginals, and a network joining them will better match the distribution than one that does not.
152
D. Coffin and R.E. Smith
We note that linkage discovered through population bias is undesirable, because the EDA never models the true linkage, and is unable to maintain multiple competing schema and so tackle hierarchical problems.
8 Future Work In the previous section we explained how it is difficult for an EDA (assuming a large enough, random population) to learn that bits within a parity sub-function are linked using a pair-wise metric. In Sect. 6 we presented empirical results showing how hBOA scales exponentially on problems containing parity sub-functions. In this section we discuss two possible modifications to EDAs that might allow an EDA to solve these problems with quadratic performance. The first modification is an EDA that considers linking up to k variables at a time during it’s model search, and the second is hybridization with a pair-wise probing mechanism. 8.1
k-Wise Model Search
One option for an EDA designer is to bite the bullet and consider linking up to k-sized groups of variables within the model building phase. The EDA would then discover the interdependencies in misleading functions such as the CPF, but at a substantial cost. EAs computational time complexity is normally approximated in terms of fitness evaluations; fitness evaluations are assumed to be overwhelmingly expensive in terms of the algorithms cost. But clearly even if an EA scales sub-quadratically with respect to the number of fitness evaluations it uses, if any other aspect of the algorithm has a higher computational complexity, then it determines the overall asymptotic scalability. Most EDAs learn the structure of their models by greedy search, in which they start by considering whether pairs of variables appear interdependent given the distribution of the current selection. This gives the EDA a maximum of 2l interdependencies to consider in each iteration of the model structure search algorithm, which is O(l 2 ). The number of iterations of the structure model search algorithm is bounded by l, giving EDAs model structure search an overall time complexity of O(l 3 ). In a sense, many EDAs are already cubic complexity optimization algorithms. It might be the case that, for practical purposes, the difference between the computational cost of evaluating a single change to the probabilistic model structure is so much cheaper than a single fitness evaluation that at the limits of our computational resources fitness evaluation still accounted for the majority of of the computational cost. In this case we would say that though such algorithms have a cubic component, they are practically speaking sub-quadratic algorithms. For example, suppose we have an EDA for which fitness evaluations cost C f l 1.55 , and model evaluations cost Cm l 3 , (where C f and Cm are the costs of a single fitness and model change evaluation respectively). If model evaluations account for less than 1% of total costs, then C f ≥ 99Cm l 1.45 . This may hold for some real world problems and problem sizes. Now consider how much less likely this is in the case of the EDA that has bitten the bullet and is considering whether groups of k variables appear interdependent given the distribution of the selection. There are kl potential linkage groups to evaluate at the
Linkage Learning in Estimation of Distribution Algorithms
153
first iteration of the model structure search, resulting in a O(l k+1 ) structural search algorithm. Now, for the same 1% of total costs threshold, we require that C f ≥ 99Cm l k−0.45 . For typical values of k, this seems unlikely to hold for even low values of l. 8.2
EDA-Probe Hybrid
The second potential modification to an EDA would be hybridization with a pair-wise probe based non-linearity detection measure (an EDA-probe hybrid is presented in [26] but it uses single variable probes). Interactions in parity functions are detectable using a pair-wise detection mechanism, with any probe context (just as interactions are pairwise detectable by EDAs when the population is biased). The results from the probing algorithm would then be used to determine where the EDAs linkage learning mechanism should look for higher order linkage groups. This highlights a trade-off between fitness and model evaluations. Pair-wise perturbation would add a quadratic, rather than O(l k ), element to the model building, although it would require O(l 2 ) additional individuals in each generation. The EDA-probe hybrid trades O(l 2 ) fitness evaluations for O(l k ) structural model evaluations. It appears that by examining linkage from two alternative bases, the schema basis from the EDA, and the Walsh basis from probing, we may be better able to generate a pair-wise linkage learning heuristic that is more robust with respect to the kinds of higher order functions whose linkage it can correctly optimize. But identifying the correct linkage is an optimization problem, and if all potential arrangements of linkage are possible, then we should consider seriously the No Free Lunch theorem [27]. An open question is whether all potential linkage landscapes are possible for EDAs.
9 Conclusion In this chapter we have discussed linkage learning in EDAs. We believe that the success of the EDAs approach to linkage learning is akin the canonical GAs success in optimisation. The canonical GA benefits from an implicit parallelism, the fitness value of one individual counts towards the implicit estimate the fitness values of many schema. The EDA benefits from another form of implicit parallellism, using the frequency of one individual within the selection being modelled counts towards the implicit estimate of the interdependency of many partitions. Both optimisation in GAs and structural learning in EDAs are successful heuristics, but ultimately they are just heuristics and have their weaknesses. Just as the simple GA can be subverted by carefully designed problems test problems such as the deceptive trap, the structure learning mechanism of the EDA can be subverted by problems such as the CPF. Just as the deceptive trap misleads simple GAs because the fitness of it’s sub-schemata are a poor indication of the fitness of building blocks, the CPF misleads the structure learning mechanism of EDAs because the interdependency of sub-partitions are a poor indication of the location of building blocks. In general, EDAs use a pair-wise linkage detection mechanism to identify higher order linkage blocks (at least when beginning to construct their probabilistic models), and functions exist that
154
D. Coffin and R.E. Smith
contain no lower order dependencies, although they contain high-order dependencies that significantly affect an individual’s overall fitness. In the binary domain, the most extreme form of this problem is parity. We showed that there is no expected interdependency in the distribution of a selection between any sub-set of variables defined in a parity sub-function, even though the CPF is of k-bounded complexity and decomposable. We demonstrated this empirically with hBOA, as an example of a robust, scalable EDA, and showed that hBOA scales exponentially on the CP/TF when using bisection population sizing. Parity is not just hard for EDAs. Parity is hard for any linkage learning mechanism that begins building its linkage model using a pair-wise statistic averaged over a reasonably sized, uniformly distributed, population. For example unpublished results show that the Mutual Information Linkage Learning System (MILLS) [24] is unable to detect linkage between the bits in a parity function reliably. MILLS uses mutual information between pairs of bits to estimate linkage during a pre-processing stage. Parity is hard for optimization algorithms in addition to EAs. As we argued earlier, structural learning in Bayesian networks is a form of linkage learning. The difficulty of modeling unstable distributions is well known in the Bayesian network literature. The expected distribution of the parity function violates the stability assumption (also known as the DAG-faithfulness assumption) of most Bayesian structural learning methods [17] in which it is assumed that all independencies in a Bayesian net are structural. In the expected distribution from the parity function, the pair-wise independencies are the result of a set of dependencies that cancel each other out. Whilst the chances of a practioner encountering a problem with misleading linkage may be less likely than encountering a problem that is fitness deceptive, it’s ironic that the very mechanism designed to let EDAs deal with deceptive problems creates a ‘metaproblem’ of the same form. Although the CPFs seem fairly artificial, even by GA test problem standards, they are of value in the EDA literature for three reasons. First, they are simply an extreme case of conditional dependency, in which every pair-wise dependency is conditional on the parity of every other bit. Conditional dependency is not an unusual feature in optimization problems in itself, and it would be interesting to see how EDAs scale on problems with slightly lower degrees of conditional dependency when factors such as noise and thresholds are taken into account. Second, CPFs are the equivalent of a needle in a haystack for a distribution-based linkage learning mechanism. By better understanding how why parity functions are hard, we better understand linkage and linkage learning in general. Third, they suggest linkage deception might be possible for EDAs. The CPF is like a concatenation of balanced trap functions; trap functions in which the fitness of all subschema are equal. In a balanced trap, the fitness of sub-schema is neither deceptive nor informative. A function that deceived EDA linkage learning would contain pair-wise mutual information in the expected distribution of the selection between variables that were not in fact dependent. Acknowledgement. We would like to thank Martin Pelikan for supplying hBOA’s code.
Linkage Learning in Estimation of Distribution Algorithms
155
References 1. Ackley, D.H.: A Connectionist Machine for Genetic Hillclimbing. Kluwer Academic Publishers, Boston (1987) 2. Baluja, S., Caruana, R.: Removing the Genetics from the Standard Genetic Algorithm. School of Computer Science, Carnegie Mellon University (1995) 3. Chen, Y.P., Yu, T.L., Sastry, K., Goldberg, D.E.: A survey of linkage learning techniques in genetic and evolutionary algorithms. IlliGAL Tech. Rep. 2007014 (2007) 4. Chickering, D., Geiger, D., Heckerman, D.: Learning Bayesian networks is NP-hard. Microsoft Research, 94–17 (1994) 5. Coffin, D., Smith, R.: The limitations of distribution sampling for linkage learning. Evolutionary Computation. In: CEC 2007. IEEE Congress on 2007, pp. 364–369 (2007) 6. Deb, K., Goldberg, D.E.: Analyzing deception in trap functions. In: Foundations of Genetic Algorithms - 2, pp. 93–108. Morgan Kaufmann, San Francisco (1992) 7. Etxeberria, R., Larra˜naga, P.: Global optimization using Bayesian networks. In: Ochoa, A., Soto, M.R., Santana, R. (eds.) Proceedings of the Second Symposium on Artificial Intelligence (CIMAF 1999), Havana, Cuba, pp. 151–173 (1999) 8. Harik, G.: Linkage Learning via Probabilistic Modeling in the ECGA. Tech. Rep. 99010, IlliGAL (1999) 9. Harik, G., Lobo, F., Goldberg, D.: The compact genetic algorithm. Evolutionary Computation. IEEE Transactions 3(4), 287–297 (1999) 10. Harik, G.R., Goldberg, D.E.: Learning linkage. In: Belew, R.K., Vose, M.D. (eds.) Foundations of Genetic Algorithms, vol. 4, pp. 247–262. Morgan Kaufmann, San Francisco (1997) 11. Heckendorn, R.B., Wright, A.H.: Efficient linkage discovery by limited probing. Evol. Comput. 12(4), 517–545 (2004) 12. Larra˜naga, P., Lozano, J.A. (eds.): Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers, Boston (2002) 13. M¨uhlenbein, H., Mahnig, T., Ochoa, A.: Schemata, distributions and graphical models in evolutionary optimization. Journal of Heuristics 5(2), 213–247 (1999) 14. M¨uhlenbein, H., Paas, G.: From recombination of genes to the estimation of distributions I. Binary parameters. In: Ebeling, W., Rechenberg, I., Voigt, H.M., Schwefel, H.P. (eds.) PPSN 1996. LNCS, vol. 1141, pp. 178–187. Springer, Heidelberg (1996) 15. M¨uhlenbein, H., Paas, G.: From Recombination of Genes to the Estimation of Distributions I. Binary Parameters. In: Proceedings of the 4th International Conference on Parallel Problem Solving from Nature pp. 178–187 (1996) 16. Munetomo, M., Goldberg, D.: Identifying linkage by nonlinearity check (1998), citeseer.ist.psu.edu/munetomo98identifying.html 17. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge (2000) 18. Pelikan, M.: Hierarchical Bayesian Optimization Algorithm: Toward a New Generation of Evolutionary Algorithms. Springer, Heidelberg (2005) 19. Pelikan, M., Goldberg, D.: Hierarchical Bayesian Optimization Algorithm= Bayesian Optimization Algorithm+ Niching+ Local Structures. In: Optimization by Building and Using Probabilistic Models, pp. 217–221 (2001) 20. Pelikan, M., Goldberg, D., Cantu-Paz, E.: BOA: The Bayesian optimization algorithm. In: Proceedings of the Genetic and Evolutionary Computation Conference GECCO 1999, vol. 1, pp. 525–532 (1999) 21. Pelikan, M., Goldberg, D.E., Lobo, F.G.: A survey of optimization by building and using probabilistic models. Computational Optimization and Applications 21(1), 5–20 (2002)
156
D. Coffin and R.E. Smith
22. Pelikan, M., Sastry, K., Butz, M.V., Goldberg, D.E.: Hierarchical boa on random decomposable problems. Tech. Rep. 2006002, IlliGAL (2006) 23. Santana, R., Larra˜naga, P., Lozano, J.A.: Challenges and open problems in discrete edas. Tech. Rep. EHU-KZAA-IK-1/07, Department of Computer Science and Artificial Intelligence, University of the Basque Country (2007), http://www.sc.ehu.es/ccwbayes/technical.htm 24. Smith, R.E.: An iterative mutual information histogram technique for linkage learning in evolutionary algorithms. In: Proceedings of CEC 2005, pp. 2166–2173 (2005) 25. Thierens, D., Goldberg, D., Pereira, A.: Domino convergence, drift, and the temporalsalience structure of problems. In: The IEEE International Conference on Evolutionary Computation Proceedings, IEEE World Congress on Computational Intelligence, 1998, pp. 535–540 (1998) 26. Tsuji, M., Munetomo, M., Akama, K.: Modeling dependencies of loci with string classification according to fitness differences. In: GECCO (2), pp. 246–257 (2004) 27. Wolpert, D.H., MacReady, W.G.: No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 67–82 (1997) 28. Wright, A., Poli, R., Stephens, C., Langdon, W., Pulavarty, S.: An Estimation of Distribution Algorithm based on maximum entropy. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 343–354. Springer, Heidelberg (2004)
Parallel GEAs with Linkage Analysis over Grid Asim Munawar1, Mohamed Wahib1 , Masaharu Munetomo2, and Kiyoshi Akama2 1
2
Graduate School of Information Science & Technology, Hokkaido University, Japan
[email protected],
[email protected] Information Initiative Center, Hokkaido University, Japan
[email protected],
[email protected] Summary. This chapter describes the latest trends in the field of parallel/distributed computing and the effect of these trends over Genetic and Evolutionary Algorithms (GEAs) especially linkage based GEAs. We concentrate mainly on the Grid computing paradigm which is widely accepted as the most distributed form of computing; due to the advent of Service Oriented Architecture (SOA) and other technologies, Grid has gained a lot of attention in the recent years. We also present a framework that can help users in implementation of metaheuristics based optimization algorithms (including GEAs) over a Grid computing environment. We call this framework MetaHeuristics Grid (MHGrid). Moreover, we give a theoretical analysis of the maximum speedup achievable by using MHGrid. We also discuss our experience of working with Grids.
1 Introduction Over the last two decades, we saw an incredible development in the area of parallel computing. During this time, some excellent parallel algorithms were devised, mainly for key scientific problems, e.g., system of linear equations, partial differential equations, numerical analysis, and prime numbers search. Various implementations of Genetic and Evolutionary Algorithms (GEAs) over different parallel computing environments were also attempted during this time. In GEAs, data and computations are clearly separated, i.e., a common space is used to store the population whereas potentially a large number of threads can act concurrently on this data space, making GEAs ideal candidates for parallel implementation[1]. Even though GEAs have a tremendous potential for parallelization, until the present time it is still a very common practice to implement GEAs in a serial fashion for medium complexity problems. This trend prevailed due to a continuous increase in clock speeds over the last two decades but as rising clock speeds mean rising power consumption, in this decade multicore technology with low clock speeds is becoming more prominent. On the other hand, we see ever increasing distributed resources, data deluge, and high bandwidth/low latency networks. All these parameters make parallel and distributed algorithms more vital than ever before, and require us to rethink GEA design and simulation strategies. In this chapter we will concentrate on modern distributed computing paradigms, namely Grid computing and its effects on GEAs, with an emphasis on linkage based Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 159–187, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
160
A. Munawar et al.
GEAs. In recent years, Grid computing has gained widespread attention from both industry and institutes, as it gives a notion of establishing an open standard platform for distributed computing and ubiquitous services (for further details on Grid computing see Sect. 2.1; for a comparison of Grid with conventional distributed paradigms see Sect. 2.3). Distributed technologies affect algorithm design in several different ways. Some basic points that a developer must keep in mind while developing an algorithm for distributed environment can be summarized as follows. Algorithms: (1) should support interconnections of loosely coupled distributed applications; (2) must be able to tolerate communication delays up to 100’s of milliseconds; (3) must be fault tolerant; (4) should support late-binding1; (5) should support dynamic migration and; (6) must rely on external data resources, whenever required (for further details see Sect. 3.1). Even though Grid middleware (Sect. 2.2) is responsible for hiding most of the complexities; Grid can still be a challenging environment for management and simulations. The reader of this book may not be interested in all the challenges involved, but it is not appropriate to entirely ignore them. Given below is a list of some of the most notable challenges associated with a Grid computing environment: 1. Management of a Grid is difficult as no central administration exists. 2. There is a high probability of communication delays and errors. 3. Intensive testing and debugging of algorithms over such environments can be time consuming. 4. Heterogeneity of the environment poses certain challenges to the users. 5. Dynamic scheduling and dynamic migration of the application is required. 6. Communication and synchronization between different subtasks is typically one of the greatest challenges. 7. Providing security without any central authority is another challenge. In this chapter, we present our work, a framework for implementing GEAs over a Grid computing environment. We call this framework MetaHeuristics Grid or MHGrid (see Sect. 4 for details). The basic motivation behind MHGrid is to handle all the low level complexities including security, heterogeneity, and unreliability of the Grid in an automatic fashion. MHGrid can be considered as an abstraction application that uses different technologies to provide an easy to use robust environment for implementation of GEAs over a Grid. The objective of this research is to provide a framework in which a user can solve complex optimization problems over a Grid with minimal effort. MHGrid provides the following services to the users: 1. Ability to write and add a new GEA (solver) to MHGrid. 2. Ability to write and add a new objective function (optimization problem) to the framework. 3. Auto parallelizes the GEAs for complex time consuming problems. 4. Allows the use of an existing GEA in MHGrid, to solve the problem at hand, with a minimal input. 1
In modern distributed computing the decision on a specific service instance is not made until needed, this is called late-binding.
Parallel GEAs with Linkage Analysis over Grid
161
5. Provide all functionalities as a Web service. This Web service can be consumed2 directly by any user application. Even though any ordinary Parallel GEA (PGEA) can run on MHGrid, a PGEA specially designed for a Grid environment will definitely outperform ordinary PGEAs. This chapter is intended for readers who want to implement GEAs over the Grid computing paradigm. As this book is about linkage-based GEAs, we expect the user to have a basic knowledge of “GEAs” and the “importance of linkage.”In this chapter, we will keep our focus on the Grid computing environment and the effect of Grids on algorithms (particularly GEAs). We have tried our best, not to overload the chapter with impenetrable details by limiting the focus of discussion. Keeping that in mind, we have only provided a brief summary of the most important and notable work done in the past. Therefore, the chapter should not be considered a comprehensive overview of the previous work done in the area. We start this chapter with a discussion on different types of parallel and distributed computing paradigms including Grid computing in Sect. 2. We also discuss some of the basic differences between “Grid computing” and “conventional distributed computing.”In Sect. 3, we discuss the influence of Grid on algorithms and the recent trends in simulations. We also review previous work done in parallel and distributed GEAs with an emphasis on linkage based GEAs. Section 4 presents the proposed framework for implementation of GEAs over a Grid. We also give some empirical results with a theoretical background, and discuss job submission to MHGrid from the user’s perspective. Section 5 summarizes the whole chapter with conclusions.
2 Parallel/Distributed Computing The basic principle underlying parallel computing is that large problems can almost always be divided into smaller sub-problems which can be handled in parallel. Parallelism can be obtained in several different ways: instruction level parallelism, data level parallelism, and task level parallelism. Parallel computing has been in use for many years, mainly in High Performance Computing (HPC), but in recent years it has become more prominent due to increasing distribution of resources and physical constraints preventing frequency scaling. Parallel computing environments are classified according to the level at which the hardware supports parallelism; this is roughly analogous to the distance between CPUs. Given below are some of the well-known examples of parallel computing paradigms: • Superscalar computing: can issue multiple instructions per cycle. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor, allowing faster CPU throughput than would otherwise be possible at the same clock rate. In this case each functional unit is not a separate CPU core but an 2
To consume a Web service means, to use a Web service. A Web service can be consumed by any application, irrespective of the programming language, operating system, or the architecture.
162
A. Munawar et al.
execution resource within a single CPU, such as an arithmetic logic unit, bit shifter, or multiplier. This kind of CPU has shared memory. • Multicore computing: combines two or more independent CPU cores in a single package. Multicore processors can be of two kinds: (1) symmetric, all the cores are R R Opteron, Intel Quad-Core); (2) asymmetric, similar to each other (e.g. AMD consists of general-purpose cores and application specific cores (e.g. Cell BroadTM band Engine ). Each core in a multicore processor can potentially be a superscalar processor. This kind of CPU can either have shared memory or distributed memory; a hybrid of the two is most common. • Distributed computing: is a distributed memory computer system where the processing elements are connected by a network. Distributed computing is highly scalable. Such systems are ultimately limited by network communication speeds. –
–
–
Cluster computing: A cluster is a group of loosely coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer. The components of a cluster are commonly connected to each other through fast local area networks. Clusters are much more cost-effective than a single computer of comparable speed. A vast majority (approximately 74.6%) of the “Top 500” supercomputers in the world are clusters[2]. Massive parallel processing: A massively parallel processor (MPP) is a single computer with a very large number of networked processors. MPPs have many of the same characteristics as clusters, but they are usually larger, typically having “far more” than 100 processors. In an MPP, “each CPU contains its own memory and copy of the operating system and application. Each subsystem communicates with the others via a high-speed interconnect.”Approximately 21.4% of the “Top 500” supercomputers in the world are MPP based systems[2]. Grid computing: is the most distributed form of parallel computing. Grid computing makes use of computers many miles apart, connected by the Internet, to work on a given problem. Because of the low bandwidth and extremely high latency on the Internet, grid computing typically deals only with embarrassingly parallel problems. Most grid computing applications use middleware that operates between the operating system and the application. Middleware is responsible for managing network resources and standardizing the software interface for grid computing applications.
• Specialized parallel computing: include computing applicable to only a specific area. This includes algorithms running over Field-Programmable Gate Array (FPGA), vector processors, Graphics Processing Unit (GPU) etc. It is important to note that these classifications are not mutually exclusive and are often used in a hybrid fashion. However, in this chapter we will focus only on the Grid computing paradigm. 2.1
Grid Computing
There are several definitions for Grid computing. We will use the definition of Grid by “www.grid.org”, which says: “Grid computing is a form of distributed computing that
Parallel GEAs with Linkage Analysis over Grid
163
involves coordinating and sharing computing, application, data, storage, or network resources across dynamic and geographically dispersed organizations.”This is in contrast to the traditional notion of a supercomputer, which has many processors connected by a local high-speed computer bus. The grand vision of a Grid is often presented as an analogy to power Grids where users (or electrical appliances) get access to electricity through wall sockets with no care or consideration for where or how the electricity is actually generated. In this view of Grid computing, computing becomes pervasive and individual users (or client applications) gain access to computing resources (processors, storage, data, applications, and so on) as needed with little or absolutely no knowledge of where those resources are located or what the underlying technologies, hardware, operating system, and so on are [3]. The actual motivation for a Grid computing environment was the need for a distributed computing infrastructure that allows for coordinated resource sharing and problem-solving in dynamic, multi-institutional environments to create a large virtual parallel machine. However, Grid computing has diverged quite far from the initial incarnation as meta-computing, now the aim of Grid computing is to build the distributed computing infrastructure to support so called Virtual Organizations (VOs)[4]. Virtual Organizations (VOs) Using VOs, individuals and organizations can share resources with each other in a controlled, secure, and flexible manner. A VO may consist of a research group in an institute, doctors from several hospitals, remote medical images databases, and a supercomputing center. As shown in Fig. 1, resources in VOs not only include computers or instruments, but also data, software, and human beings. To provide security is one of the most challenging parts of VO management. Security is implemented not only to prevent people outside of the VO from accessing the resources, but the members of the same VO may require special certificates to access some resource. Thus, authentication, authorization, specification and access policies are important issues in managing VOs effectively. Types of Grids and SOA Grid computing is a general phrase that represents several types of distributed computing. In this section we will discuss the most famous types of Grid computing. • Computational Grids: these are traditional Grids that are designed to provide support for high performance computing resources. Such Grids are quite popular and valuable. • Data Grids: these are used for the controlled sharing and management of large amounts of distributed data. • Equipment Grids: these are used to control and analyze the data produced by a primary piece of equipment, e.g., a telescope. It is important to note that all these diverse capabilities provided by different families of Grid can be combined into a common distributed computing environment (also known
164
A. Munawar et al.
Fig. 1. Virtual Organizations accessing different and overlapping sets of resources. Resources may include computing node, printer, storage, cluster, supercomputer, scientific instrument etc. Note that a user is also treated as a resource and can be shared between two or more VOs.
as “Grid of Grids” [5]), based on Service Oriented Architecture (SOA) principles [6]. Embracement of SOA by Grid computing is one of the key reasons for the success of Grid computing. 2.2
Grid Middleware and Other Tools
Grid middleware is a piece of software that resides between an application and the operating system. Ideally, the Grid middleware is responsible for hiding all the low-level details and complexities from the application developers. Global collaboration to define standards for Grid has resulted in a standard called Open Grid Services Architecture (OGSA)[7]. OGSA describes architecture for a service-oriented grid computing environment, developed within the Global Grid Forum (GGF). OGSA is based on several other Web service technologies, notably WSDL[8] and SOAP[9]. Core services offered by the OGSA infrastructure, includes: 1. 2. 3. 4. 5.
Execution Management Services Data Services Resource Management Services Security Services Information Services
OGSA has been adopted as Grid architecture by a number of low-level grid middleware projects including Globus[10]. However, it is not easy to work on low-level Grid middleware directly, therefore, it is important that abstraction tools be built based on the low-level Grid middleware. These tools help in constructing a high-level Grid middleware. NAREGI[11] is an example of high-level Grid middlewares.
Parallel GEAs with Linkage Analysis over Grid
2.3
165
Grid vs. Distributed Computing
It is difficult to draw a clear line between Grid computing and other similar types of distributed computing paradigms like peer-to-peer, cluster of clusters etc. To call an environment a Grid, we use the checklist by Ian Foster[12]. The three point checklist is: 1. Resources are not administered centrally. 2. Open standard, general-purpose interfaces and protocols are used. 3. Non-trivial quality of service is achieved. Any kind of distributed computing that satisfies the above checklist can be treated as a Grid.
3 GEAs over Grid In Sect. 2, we discussed different types of parallel/distributed computing paradigms. We also gave our rationale for preferring Grid computing over other kinds of distributed computing. In this section we will give an overview of PGEAs over Grid computing environment with an emphasis on linkage based PGEAs. 3.1
The Impact of Grids on Algorithms
Simply speaking, implementation of parallel algorithms can exist untouched running on distributed resources in a Grid. Running conventional parallel problems over Grid will definitely remain an important style of computational Grids; however, this does not take full advantage of the Grid. Therefore, there is a dire need for algorithm development, specifically for the Grid computing paradigm. Requirements for such algorithms can be listed as: 1. Algorithms should be developed to support the interconnections of loosely coupled distributed applications. 2. Algorithms must be able to tolerate communication delays up to 100’s of milliseconds without affecting the performance considerably. 3. Algorithms must be fault tolerant. 4. Algorithms should support late-binding. This helps in implementation of workflows and dynamic scheduling. 5. Algorithms must be able to rely on external remote data sources (both real-time data streams and archival data) whenever needed, unlike conventional method of copying the data on local resources before starting the simulation. We would like to encourage the development of Grid-based algorithms that support interconnection of loosely coupled distributed applications. As is clear from Fig. 2, Grid creates a hierarchical computing environment. In order to take full advantage of computational resources the algorithm should be implemented in a hierarchical fashion, such as GE-HPGA[13].
166
A. Munawar et al.
Fig. 2. Algorithms may be broken into fine, medium, and coarse-grained parts for efficient implementation over a computational Grid. The fine, medium, and coarse-grained partition is done on the basis of the communication delays involved. Fine-grained systems use specialized networks, having negligible latency and modest time for communication of a single word. Medium-grained systems have latency in the 100-1000 microseconds range. Coarse-grained systems can have high inter-node bandwidth but the latency is typically 100 milliseconds or more. Similarly the network reliability also decreases as we move from fine towards coarse-grained systems.
Amdahl’s Law & Gustafson’s Law It is important to note that Amdahl’s & Gustafson’s laws apply to all parallel algorithms including those implemented in distributed environments. According to Amdahl’s law, most of the time it is not possible to parallelize the whole algorithm. In such cases the serial part of the algorithm creates a serious bottle neck. This defines an upper limit for the speed-up achievable by parallel algorithms as defined in Amdahl’s law [14]. Amdahl’s law is given by: 1 (1) S= (1 − P ) Where, S is the speed-up of the program (as a factor of its original sequential runtime), and P is the fraction that is parallelizable. If the sequential portion of a program is 10% of the runtime, we can get no more than a 10x speed-up, regardless of how many processors are added. This puts an upper bound on the usefulness of adding more parallel execution units. A more practical but slightly complex law similar to Amdahl’s law is known as Gustafson’s law[15]. 3.2
Recent Trends in Simulations over Distributed Computing
Figure 3 shows the trends in algorithm development over parallel/distributed computing paradigms [4]. It depicts the various technologies used by the simulations performed over parallel computing environments within a rough timeline of last 3 decades. In the recent history of parallel computing, we see four very prominent trends: (1) ever
Parallel GEAs with Linkage Analysis over Grid
167
Fig. 3. Timeline of trends in distributed computing
increasing distribution of computing power; (2) high speed low latency networks; (3) multicore CPU’s and; (4) increasing importance of data-centric computing. It is clear from the figure that the whole field of parallel computing is moving from, “parallel computing” to “distributed computing”to“data deluged computing”. However, instead of replacing it, each trend builds upon the previous trends. 3.3
GEAs over Parallel/Distributed Computing
GEAs usually require a bigger population for solving non-trivial problems, this leads to higher computational cost [16]. Parallel GEAs (PGEAs) is one of the techniques used to reduce this computational cost. This is also the basic motivation behind PGEAs. PGEAs usually show a linear speed-up with an increase in the number of computing nodes (super-linear speed-ups have also been observed in some cases). As we all know, GEAs mimic the process of natural evolution. In nature the individuals do not operate as a single population in which a given individual can mate with any other partner in the population. Rather, species live in and reproduce within subgroups or neighbors only. PGEA introduces this concept of interconnected sub populations in artificial evolution[17]. As a result, PGEAs often lead to superior numerical performance (not only to faster algorithms), even when executed in a single processor environment[18]. It is very difficult to classify PGEAs; however, they are widely classified into two main classes according to the grain of parallelization: 1. Coarse-grain PGEAs (cgPGEAs) also known as distributed or island GEAs (dGEAs), assume a population on each of the computer nodes and migration of individuals among the nodes. [19, 20] shows some of the earliest work done in this area. Peer-to-peer GEAs can also be included in this category[1].
168
A. Munawar et al.
2. Fine-grain PGEAs (fgPGEAs) also known as cellular (cGEAs), assume an individual on each processor node which acts with neighboring individuals for selection and reproduction. [21, 22, 23] shows some of the earliest work done in this area. The above classification relies on the computation/communication ratio. If this ratio is high we call the PGEA a coarse grain algorithm; if it is low we call it a fine grain parallel GA[17]. Many models of PGEAs are based on hybrids of these two approaches. In PGEAs “selection” is considered to be the main bottle-neck, as it is performed on the whole population (sub-population) and not at individual level. As a result, local selection algorithms gained some attention in the last decade[24, 25, 26]. Some algorithms use techniques like gossiping to perform selection at the local level and avoid population explosion or implosion [1]. There are two other approaches that cannot be properly assigned to any of the above categories. They are as follows: 1. Single Population (Panmictic GEAs) algorithms distribute the population among the processors, but treat them as a single panmictic unit [27]. Despite frequent inter-processor communications, [27] shows an improvement in execution time. However, this kind of algorithm is not appropriate for distributed environments involving 100s of milliseconds communication delays. 2. Global Parallelization obtains parallelization of genetic operators/evaluations by code parallelizer embedded in the compiler or by explicit parallelization (such as master-slave, master running the algorithm and slaves computing fitness values in parallel), this kind of parallelization proceed in the same way as a serial GEA. This is only viable for problems with a time-consuming function evaluation; otherwise the communication overhead is usually higher than the benefits of their parallel
Fig. 4. Different models of PGEAs: (a) global parallelization, (b) coarse grain, (c) coarse grain without controller, (d) fine grain, (e) coarse grain and global parallelization (hybrid), (f) coarse grain and coarse grain (hybrid), (g) coarse grain and fine grain (hybrid), (h) peer-to-peer (each node can have either a sub-population or a single individual), (i) coarse grain, fine grain and global parallelization (hybrid), suitable for modern distributed paradigms.
Parallel GEAs with Linkage Analysis over Grid
169
execution. MHGrid (Sect. 4) automatically uses this kind of parallelization for large problems. Figure 4 shows some of the most famous models of PGEAs [28]. The model shown in Fig. 4(i) is particularly suitable for modern distributed computing paradigms (including Grid). In the field of Grid computing, Grid Oriented Genetic Algorithm (GOGA) term was first introduced by [29]. Since then, work has been done on the GOGAs[30, 31] but still it is far from mature. We suggest that decentralized hierarchical algorithms, as shown in Fig. 4(i), can be suitable for Grid computing paradigms. A similar hierarchical algorithm is presented in [13]. As far as “PGEAs with linkage analysis” are concerned, unfortunately not much work has been done in this area. However, some applications are using linkage based parallel GEAs for solving complex optimization problems [32]. We have used parallel linkage identification algorithm by Munetomo et al.(2003) to make a theoretical analysis and demonstrate empirical results in Sect. 4.9. Parallelization suggested in [33] can be used with other linkage identification based algorithms proposed by Munetomo et al. with minor modifications, these algorithms include: (1) Linkage Identification by Nonlinearity Check (LINC)[34], (2) Linkage Identification by Monotonicity Detection (LIMD)[34], (3) Linkage Identification by Epistasis Measures (LIEM)[35], and (4) Linkage Identification with Epistasis Measure considering Monotonicity (LIEM2)[35]. LINC algorithm seeks for nonlinearity in a pair of loci by bitwise perturbations to obtain linkage groups whereas LIMD checks monotonicity and LIEM/LIEM2 makes its decision by checking the strength of epistasis. These algorithms try to avoid Building Block (BB) disruptions over the generations by using the computed linkage sets. We will now discuss the LINC algorithm in detail, other algorithms mentioned above are quite similar to LINC with the difference in operators. The LINC procedure identifies linkage from a population of strings. The basic idea is quite simple: nonlinearity should exist inside a linkage in at least one string; otherwise, they need not be tightly linked because linearity is apparently easy for the GAs. Therefore, LINC can identify linkage by sampling strings and checking whether perturbations in each pair of loci cause nonlinear effects or not. Linear interactions may exist inside a linkage group in some context, so it is necessary to check nonlinearity in O(2k ) strings (k is the maximum length of BBs) to have an accurate linkage set. To check nonlinearity, if we have a string s = s1 s2 s3 ...sl that represents the target chromosome, and changes in fitness values by bit-wise perturbations to s are defined as: Δfi (s) = f (...s¯i ......) − f (...si ......)
(2)
Δfj (s) = f (......s¯j ...) − f (......sj ...)
(3)
Δfij (s) = f (...s¯i s¯j ...) − f (...si sj ...)
(4)
Where, s¯i = 1 − si and s¯j = 1 − sj in binary strings. Then, Δfij (s) = Δfi (s) + Δfj (s) means that change in fitness value by perturbations on si and sj is additive,
170
A. Munawar et al.
Fig. 5. Linkage identification PGEAs (a) by Munetomo et al. (2003), (b) by Michal et al. (2004)
this indicates a linear interaction between the genes. On the other hand Δfij (s) = Δfi (s) + Δfj (s) implies that they are not additive, which means non-linearity. Perturbation based methods like the one described above are good at detecting subfunctions with small contribution to the overall fitness, but one of the disadvantages of the perturbation based methods is that they require O(l2 ) fitness evaluations to detect linkage among a population, where l is the string length. Therefore serial implementation of perturbation methods is not feasible to use in the environments with hard time constraints. This fact leads to parallel implementation of these algorithms like pLINC[33]. Figure 5 shows the comparison between Munetomo et al.(2003) [33], and a modification of Munetomo et al. by Michal et al. (2004) [36]. Note that linkage identification being the main bottleneck is performed in parallel to reduce the overall execution time. After which the linkage sets are generated in a serial manner (in the case of Munetomo et al.). IntraGA is a step that finds candidates of BBs in each linkage set, parallelizing this step further reduces the execution time of the algorithm. However, InterGA, a step that mixes BB candidates to find an optimal solution, is not computationally intensive and can be safely performed in serial fashion without affecting the overall performance of the algorithm. In the last few years we see a growing interest in Estimation of Distribution Algorithms (EDAs), where crossover and mutation genetic operators are replaced by probability estimation and sampling techniques. The Bayesian Optimization Algorithm (BOA)[37] is an EDA that learns the structure of the problem, like linkage based GEAs. Parallel implementation of BOA (pBOA) is proposed in [38], it shows that for real problems, pBOA needs a slightly higher population size and number of generations than BOA, but this is not critical as the overall computation time is reduced linearly with the number of processors.
4 MHGrid: A Grid Based Optimization Engine In this section, we present a framework that facilitates the implementation of GEAs over a Grid computing environment with minimal effort from the user. We call this framework, MetaHeuristics Grid (or MHGrid). MHGrid is a Grid application, not a Grid middleware. As “linkage based GEAs” is a subset of “GEAs”, which is ultimately a
Parallel GEAs with Linkage Analysis over Grid
171
subset of “metaheuristics3 algorithms.” Therefore, we will use the term “metaheuristics algorithms” to represent all kinds of metaheuristics based algorithms including GEAs. 4.1
Introduction to MHGrid
In Sects. 2 and 3 we discussed the recent trends in distributed computing and the need for designing Grid based GEAs. In this section, we present a framework that can help a user to run his metaheuristics algorithm over Grid without going into any low-level details of the Grid environment. It provides a unified platform for solving any global optimization problem4 using any metaheuristics algorithm. The first question that comes into mind is: why do we need such a framework? This question can be answered as follows: 1. No Free Lunch (NFL) theorem [39] states, we cannot have a single algorithm/solver that can solve all the known optimization problems with equally good precision. NFL states that average performance of any optimization algorithm over the entire problem space is the same. This forces us to develop a framework incorporating different kinds of algorithms from which the user can select the algorithm of his choice to solve the problem at hand. 2. Not every person who wants to implement an algorithm over a Grid can be expected to be a Grid expert. Therefore, we need frameworks or Problem Solving Environments (PSEs) like MHGrid that can provide the required abstraction to its users. MHGrid provides the following services to the user: 1. Ability to add a new metaheuristics algorithm/solver to MHGrid and make it available to other users. 2. Ability to add a new objective function to the framework. 3. Ability to use the existing solvers in MHGrid to solve the problem at hand with a minimal input. The user can do this by using the customized portlets in the MHGrid portal. 4. Ability to consume the MHGrid’s Web service directly to perform optimization without using the portal. 5. Ability to administer the jobs (only available to the users with administrator privileges). There are many research challenges involved in construction of such a framework. First of all, we need to provide a simple (but flexible) interface to the user by hiding all the low-level complexities. Secondly, we need to ensure that MHGrid is neutral to all kinds 3
4
Metaheuristics is a heuristic method for solving a very general class of computational problems by combining user given black-box procedures - usually heuristics themselves - in a hopefully efficient way. Most metaheuristics algorithms take their inspiration from nature. Global optimization refers to finding the extreme value of a given non-convex function in a certain feasible region. Classical optimization techniques have problems in dealing with the global optimization problems. One of the main reasons for their failure is that they can easily be entrapped in local minima.
172
A. Munawar et al.
of metaheuristics algorithms. Thirdly, we need to make sure that all the services are available as WSRF compliant Web service. Fourthly, despite all the uncertainties of a Grid, we need to provide a certain level of Quality-of-Service (QoS) to the user. Our key contribution is in combining the computational power of a Grid with the optimization capabilities of metaheuristics algorithms to give a truly general purpose, easy to use Problem Solving Environment (PSE) for global optimization problems. Moreover, we provide all the services as a WSRF compliant Web service, i.e., User can consume it like a normal Web service directly from his code. We have used a unique, hybrid parallelization technique that employs GridRPC + GridMPI approach[40] to break the problem into coarse-grain and medium-grain sub-problems. This parallelization method provides flexibility, fault tolerance, and efficiency. We have also developed an XML based markup language that serves as an interface between the user and MHGrid’s Web service. We call this markup language, MetaHeuristics Markup Language (or MHML). 4.2
Related Work
Several projects have targeted this kind of framework in recent years. Some of them are discussed below: • Neos Server [41, 42, 43] is a general client-server system that is dedicated to optimization problems. Its main objective is to allow users to submit optimization problems and get the results. Users can also add a new solver and make it available to other users (however it must be approved by the management). It has no Grid compliant authentication or authorization and uses plain sockets for communication. NEOS has no capability to spread out the job to multiple resources on distributed infrastructure. Moreover, it is not a black box optimizer as it requires the objective function to be inputted in an algebraic form. • GEODISE [44] is intended to carry out engineering design search and optimization involving computational fluid dynamics (CFD). GEODISE exploits GRID computing technology to flexibly couple computational and data resources. GEODISE use OGSA and Web Services along with other tools for workflow, knowledge management, Grid enabled Problem Solving Environments and CFD. • Nimrod/O [45] allows a user to run an arbitrary computational model as the core of a non-linear optimization process. The current implementation of Nimrod/O has only 4 solvers with the package and it does not allow the user to add his own solvers. • SETI@Home is another good example of a computationally intensive optimization framework implemented over a worldwide distributed computing cluster (similar to Grid concept). It is a problem specific environment for the Search for ExtraTerrestrial Intelligence (SETI). It uses BOINC middleware but is not a general purpose optimization application and is not available for public use. However, a user can volunteer his computational resources for the project. • Others Other projects like GE-HPGA[13], also attempt a similar kind of framework. Although most of the projects mentioned above do not have one-to-one relation to MHGrid, we have tried to make a rough comparison in table 1. It is important to note that most of the Grid based optimization projects in the recent past talks about simulations
Parallel GEAs with Linkage Analysis over Grid
173
Table 1. Comparison of MHGrid with the work done in past
or a specific algorithm performed on the Grid, and do not provide their work as a service to the users. These projects completely ignore the service oriented aspect of the Grid. However, MHGrid is a service oriented framework and provides all the features as a service to the users, who can use the MHGrid services to solve their optimization problems on a real Grid with the ease of a few clicks on the MHGrid portal. 4.3
Architecture of MHGrid
Figure 6 shows the architecture of the MHGrid framework at an abstract level. The main parts of the MHGrid are discussed as follows: • MHGrid’s Web service is the most important part of MHGrid. It is a WSRF compliant Web service that runs in a Globus container, and is responsible for gluing other modules of MHGrid to each other as it is clearly shown in Fig. 6. A user can consume this Web service directly using the Web Services Definition Language (WSDL) interface file. These services include: –
Addition of a solver or an objective function to the database.
Fig. 6. Abstract level architecture of MHGrid (shows the most notable components of the system)
174
A. Munawar et al.
– – –
Query for a list of available solvers and objective functions, along with their specifications. Job submission. Jobs administration
• MHGrid’s Web portal is MHGrid’s access point from the user perspective. It is a 2nd generation Web portal, using GridSphere[46] as portlet container. We provide portlets customized for MHGrid. The MHGrid portal can simply be thought of as a client application consuming the MHGrid Web service. The portal provides the easiest access to MHGrid via any standard web browser (without the need for installing any extra software). • Directory index is a database that maintains the objective functions and solvers registered with the framework. It also maintains logs for all the jobs along with the results obtained. The Web service can query this database in several different ways. • MHGrid’s Scheduler is a simple scheduler that is responsible for scheduling the jobs to appropriate resources on our Grid test bed. 4.4
MHGrid’s Middleware Stack
The services provided by a low-level Grid middleware, such as Globus [10], although being powerful are typically too low-level to be used in a constructive manner by most applications. This creates a significant semantic-gap between the application and the middleware services. Therefore, we have used high-level tools that build upon the lowlevel middleware tools to provide an abstraction to the user. Use of multilayered middleware has several advantages, varying from ease of use to inherent robustness. This layered approach to Grid middleware is shown in Fig. 7. Some of the noticeable tools in the high-level middleware layer includes, Web Portal, GridRPC[47], and Schedulers. We are using Ninf-G[48] as a reference implementation of GridRPC and Condor[49] as the job manager for MHGrid.
Fig. 7. MHGrid’s Middleware. Note that the middleware layer is divided into two distinct levels, low-level middleware and high-level middleware. Also note that MHGrid sits at the application layer.
4.5
Distributed Implementation of Algorithms
We have used GridRPC and GridMPI[50] to achieve parallelization of the algorithms/solvers. Conventional parallel implementations usually depend on MPI or OpenMP programming structure. However, using GridMPI is not a very good choice in the case of the Grid environment. This is due to the various reasons listed below:
Parallel GEAs with Linkage Analysis over Grid
175
1. Unlike conventional parallel computing paradigms, in Grid environment, scheduling takes place dynamically. 2. MPI is not a good option to use in environments with large network latencies. 3. It is not fault tolerant. 4. GridMPI works only with unique global IPs. Therefore, we used GridMPI for intra-cluster parallelization and GridRPC for intercluster parallelism. This unique parallelization technique employing GridRPC and GridMPI is borrowed from [40], and is shown in Fig. 8.
Fig. 8. Unique hybrid GridRPC + GridMPI approach to parallelization
As shown in table 1, MHGrid supports auto parallelization of algorithms for heavy fitness functions. For an optimization problem with a heavy fitness function, MHGrid use multiple instantiations of fitness function to calculate the fitness in parallel. This parallelization is kept completely hidden from the developer of the solver and/or objective function. 4.6
Data Flow
Figure 9 depicts the sequence of operations followed in the MHGrid framework for submission of a new job. As we can see in this figure, Web service creates a Grid job and forward it either to a Ninf-Client or to a scheduler. Whenever a new job is received, the Web service creates a Grid job and forwards it to the next stage for execution. In MHGrid the job can be deployed on the Grid in two different ways, as shown in Fig. 10. These two scenarios are discussed below: 1. Using Scheduler for deployment: In this scenario the job is handed over to the scheduler, which allocate the resources for both the solver(Ninf-G client) and the objective functions(Ninf-G server). This scenario is depicted in Fig. 10(a). This method is efficient in terms of the overheads, but does not scale well with large network environments. Therefore, it is not very flexible. 2. Using Ninf-G for deployment: Another method for deployment of the remote executable is to initiate a very light Ninf-G client. The Ninf-G client then instantiates
176
A. Munawar et al.
Fig. 9. Sequence of operations for a job submission case
Fig. 10. Deployment of objective function using two different approaches (a) Deployment using Scheduler (b) Deployment using Ninf-G
the Ninf-G server by using the scheduler. This can be done using the InvokeServer functionally of Ninf-G. This scenario is depicted in Fig. 10(b). This method has more overheads than the earlier method but on the other hand, it is more reliable, flexible, and robust. It provides an automatic mechanism to recover from errors using the check pointing mechanism provided by most schedulers. A solver or an objective function can use MPI for parallelization within the cluster as shown in Fig. 8.
Parallel GEAs with Linkage Analysis over Grid
177
Fig. 11. This figure shows the main functionality provided by the MHGrid’s API. For the solver developer the Grid is a black-box
4.7
Users of MHGrid
MHGrid provides services to potentially four distinct types of users listed below, MHGrid portal provides separate portlets for each type of users: 1. Administrators: have a lot of rights not available to any other kind of users. A user with administrative rights can monitor all the running jobs, and interrupt them in the middle of processing. He can also set the rights of other users. The administrator does this by using a special portlet included in the Web portal. 2. Solver developers: are the users who write an MHGrid compatible solver/algorithm and add it to the MHGrid framework. Solver developers are provided with a simple to use API (MHGrid API) to help them write an MHGrid compatible solver. The API provides the functions to read MHML (see Sect. 4.8) based configuration files. In addition, it allows the user to make synchronous or asynchronous calls to the objective function. The most important functions provided by the API are listed in Fig. 11. When submitting a solver the developer is required to submit a corresponding MHML based Service Level Description (SLD) file. This SLD is used by MHGrid to find an agreement between the solver, the objective function, and the Grid resources in use. 3. Objective function developers: write MHGrid compatible objective functions and add it to the framework. Similar to a solver, an objective function once added to the framework, can be used by any other user. The objective function must comply with the standard Ninf-IDL interfaces defined by MHGrid. Figure 12 shows an example Ninf-IDL file. The user is also required to submit an MHML based SLD file along with the objective function submission.
178
A. Munawar et al.
Fig. 12. A simple example of Ninf-IDL file
4. Ordinary user: is a user who will run the optimization jobs over MHGrid. The user is required to select the desired solver and objective function along with the MHML(see Sect. 4.8) based job submission file (this file contains, job specification, configuration of the solver, and configuration for the objective function). 4.8
MHML
All the communication between the user/application and the Web service is done using a proposed language that we call, MetaHeuristics Markup Language (MHML). Here, we give only a brief introduction to MHML, for a complete description see Munawar et al. (2007) [51]. Why Standardize the Interfaces? Standardization of communication interfaces is often ignored by the developers. A standard communication interface can lead to a flexible design. It provides interoperability, and reduces the chance of human error. MHGrid forces the developer to use a standard interface for the algorithm as well as the objective function. The interface can be standardized using different methods; however, XML appears to be the most promising language for such a purpose.
Fig. 13. Top level hierarchy of MHML
Parallel GEAs with Linkage Analysis over Grid
179
Brief Introduction to MHML MHML can be considered as a modification/extension to E. Alba et al. (2003) [52]. E. Alba et al. (2003) proposed a language to configure optimization algorithms as XML DTD, but it fails to address some very important issues regarding the configuration of an optimization algorithm. MHML provides many advantages over [52] and can be applied to a greater number of cases. MHML is defined as an XML schema, and has the capability to represent: (1) Job configuration, (2) Solver description and configuration, (3) Objective function description and configuration, (4) Client information, and (5) Results/Errors. A top-level hierarchy of MHML is shown in Fig. 13. 4.9
Grid Test-Bed and Experiments
In this section we give a theoretical analysis of achievable speed-up for a linkage based PGEA over MHGrid. We also give some empirical results obtained by running a PGEA over MHGrid. In addition, we describe a test case running on MHGrid from a user’s perspective. Theoretical Analysis of MHGrid Even though Grid computing is not only about reducing execution time, still this remains an important parameter for analyzing the efficiency of an algorithm over a Grid. The speed-up of a parallel algorithm over a parallel computing environment can be defined as: Ts (5) S= Tp where Ts and Tp denote the execution time when the algorithm is executed in serial and parallel, respectively. It is not easy to formulate a single equation for MHGrid, as it allows the user to run any kind of GEA on a Grid. Therefore, we will give theoretical analysis of only one algorithm, i.e., linkage identification PGEA(pLINC) by Munetomo et al. (2003) [33] (shown in Fig. 5). We assume that, λ(n) is the total time taken by the solver, γ(n) is the total time taken for fitness evaluations, and n is the problem size. For pLINC algorithm λ(n) can be divided into three parts, serial part λserial (n)(time taken by selection and InterGA) and parallelizable parts λlinkage (n)(part of solver evaluation linkages),λIntraGA(n) (time taken by IntraGA step). We can define the speed-up as: S(n, p) =
λserial (n) + λlinkage + λIntraGA + γ(n) λserial (n) +
λlinkage (n)+λIntraGA (n)+γ(n) p
(6)
+ O(n, p)
where p is the total number of computational nodes, and O(n, p) is the parallelization overhead. According to Amdahl’s law[14](see Sect. 3.1 for further details), equation for maximum speed-up S max (n, p) can be obtained by assuming O(n, p) = 0. Now we will try to find the speed-up achieved by running the same algorithm on multiple clusters as compared to a single cluster implementation. In this case the single cluster implementation will become the Ts and the multiple cluster implementation will
180
A. Munawar et al.
become Tp . We can use Eq. 5 to compute the speed gain. For this purpose we will define ω as (7) ω(n) = λlinkage (n) + λIntraGA (n) + γ(n) Now we can compute Ts and Tp as follows: • Single cluster implementation is suitable when λ(n) ≈ γ(n) OR λ(n) > γ(n). It is recommended that a single cluster implementation is used when the fitness evaluation is very light. We can compute Ts as: ω c Ts = λserial + m α + Ointra (8) s where s is the number of computational nodes in a cluster, α is the parallelism factor of the cluster, Ointra gives the parallelism overhead within the cluster, m is the total number of computational subgroups or total number of clusters in multiple cluster implementation, and ωc is equal to ω/m. • Multiple cluster implementation is suitable for the cases where γ(n) >> λ(n). For such cases MHGrid automatically runs the solver on one cluster and distributes the fitness evaluations among other real or virtual clusters. We can ignore the time taken by the solver (linkage identification, mutation, crossover, and selection) λ(n) as it is much less than the total time used for fitness evaluation γ(n) and Grid overheads O(n, p), but we will keep it for comparison purposes with single cluster case. Therefore, using this information Tp can be computed as: Tp =
ω c i + Ointra Ointer + λserial + α s i=1
m
(9)
where Ointer is the communication overhead between clusters, m is the total number of clusters , s is the number of nodes in a single cluster, and α is the parallelism factor of a cluster (it is a function of problem size and CPU specifications). In Eqs. 8 and 9, we can define: ω c u = α + Ointra s
(10)
Therefore, and we will also neglect the term λserial , because usually it is too small compared to other values. We can now write the equation for maximum speed-up offered by MHGrid as: T max mumax (11) S max = smin = min Tp mOinter + umin where Tsmax is the maximum total time to execute pLINC algorithm over a single clusmin ter, Tpmin is the minimum total time to execute pLINC algorithm over m clusters, Ointer max is the minimum inter cluster communication overhead, u is the maximum total time taken by slowest cluster to execute the (1/m)th part of the algorithm, and umin is the minimum total time taken by the fastest cluster to execute (1/m)th part of the algorithm.
Parallel GEAs with Linkage Analysis over Grid
181
Fig. 14. Speed-ups for pLINC algorithm on different number of Grid nodes for different test cases. TF is the time taken for a single fitness evaluation.[53]
Empirical Results The simulation given in this section was performed on an experimental Grid consisting R Dual-Core Opteron model servers, each server had 2 GB of of 14 IBM x3455 AMD memory. Each node was running Fedora core 6 with Globus Toolkit 4.0.x installed on it and was treated as a Globus resource. The job was submitted as a Gram job and files were transferred using GridFTP. Problems used for benchmarking included: • MAX-SAT: An NP-hard problem that asks for the maximum number of clauses which can be satisfied by any assignment. • Sum of Trap functions: Sum of trap functions is considered to be a GEA difficult problem and is often used for benchmarking purposes.
Fig. 15. Current resources of MHGrid, MHGrid CP is a condor pool having 16 execute resources, whereas MHGrid LC1-LC3 are logical clusters each having 5 Globus based computational resources
182
A. Munawar et al.
Fig. 16. Snapshots of MHGrid’s Web portal (a) Retrieve Portlet: retrieves already registered solvers and objective functions and show their properties (b) Register Portlet: registers a new solver or objective function with MHGrid (c) Job Submit Portlet: helps the user to submit an optimization job.[54]. NOTE: The screen shots have been resized and slightly modified.
Parallel GEAs with Linkage Analysis over Grid
183
Figure 14 shows the speed-ups achieved for pLINC[33] algorithm using the above mentioned test functions. The effect of TF and Grid overheads is clear from the figure. The graph suggests that the Grid overheads are negligible for complex problems while it can be significant in the case of simple problems. MHGrid Environment Figure 15 shows the current state of MHGrid. As computational resources, we have three clusters with 5 nodes each, and a condor pool with 16 execute nodes. Each of the three clusters behaves as Globus based Grid resource, while the job is submitted to the condor pool by using the invoke server functionality of Ninf-G. New resources can be added to MHGrid very easily by making a few configuration changes. Figure 16 shows some snapshots of the MHGrid Web portal. MHGrid Web portal offers all MHGrid services using custom built portlets. Each portlet offers a unique service to the user, e.g., Job Submit portlet (shown in Fig. 16(c)) helps the user to submit a new optimization job to the framework. As mentioned earlier MHGrid’s Web portal is just a client that consumes the MHGrid’s Web service. Note that it is possible to consume the Web service directly if the user does not want to use the Web portal. However, MHGrid’s Web portal provides the simplest and most convenient way to use the services offered by MHGrid.
5 Summary and Conclusions In this chapter we discussed some of the classical and modern paradigms in parallel computing with an emphasis on the Grid computing environment. We also said that Grid with SOA seems to be the future of distributed computing. We noted that due to the recent developments in distributed computing, data deluge, networks, and multicore CPUs, parallel computing seems to be more important than ever before. Distributed computing paradigms including Grid however, remains a challenging environment for implementation of algorithms. Relatively large communication delays, unreliable network, no central authority, and heterogeneous resources make it even more challenging. Grid affects the algorithm development in several different ways. The algorithm designed for a Grid environment should be (1) fault tolerant, (2) allow communication delays of hundreds of milliseconds, (3) should be able to depend on external data sources whenever required, and (4) should support late-binding. We saw, how Grid middleware provides a layer to hide the complex details from the user, but it it still very difficult to use a low-level Grid middleware directly for application development. Therefore, certain tools are built for abstraction, which can be treated as an upper-layer middleware tools. These tools are much easier to use but using them still needs some expertise. We presented a Grid based framework that helps the user to solve global optimization problems using metaheuristics algorithms (including GEAs) over Grid computing environment. The user is allowed to add a new algorithm or use an existing algorithm for solving the problem at hand. We provide the user with a simple to use API for porting existing algorithms to MHGrid. MHGrid uses GridRPC+GridMPI to parallelize the code in an efficient manner over Grid. MHGrid is the first framework of its kind
184
A. Munawar et al.
that uses OGSA based Grid environment and offers all its services as a WSRF compliant web service. MHGrid’s Web portal can be directly used from any web browser. In addition, we presented MHML, a language used to interact with MHGrid. We also emphasized the importance of interface standardization. We showed a theoretical analysis of a linkage identification GA running over MHGrid and presented empirical results of a simulation running over an actual Grid. In our experience a practical and stable Grid is difficult to construct and is very error prone. Heterogeneity of Grids also poses a lot of problems especially in the deployment of executables (one of the least studied areas of Grid). In addition, even the high level tools are difficult to use for an average Grid user. As a result, we need Problem Solving Environments (PSE) like MHGrid to enable the user to solve specific kinds of problems over the Grid without taking into consideration any complexity of the Grid. Using MHGrid any kind of parallel or serial algorithm can be deployed and used over a Grid computing environment with a minimal effort made by the user (as MHGrid takes care of all the complexities by itself). However, MHGrid gives the best performance for properly designed hierarchical parallel GEAs with heavy fitness functions. MHGrid can serve as a base design for other similar applications in different areas.
Acknowledgments We would like to acknowledge a number of people who have helped us with the problems at different parts of this project. We would like to thank Hidemoto Nakada, Masato Asou, and Yoshio Tanaka for their help, regarding the use of Ninf-G. We would also like to thank Hiroshi Takemiya for sharing his knowledge about the Grid middlewares and other tools. We also thank the anonymous users who replied to our queries on the mailing lists.
References 1. Wickramasinghe, W., Steen, M.V., Eiben, A.: Peer-to-peer evolutionary algorithms with adaptive autonomous selection. In: GECCO 2007: Proceedings of the 9th annual conference on Genetic and evolutionary computation, pp. 1460–1467. ACM, New York (2007) 2. http://www.top500.org (June 2007) 3. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. MorganKaufman, San Francisco (1999) 4. Fox, G., Aktas, M.S., Aydin, G., Gadgil, H., Pallickara, S., Pierce, E., Sayar, A.: Algorithms and the Grid. Computing and Visualization in Science (CVS) (2005) 5. Fox, G.: Grids of Grids of Simple Services. Computing in Science and Engg. 6(4), 84–87 (2004) 6. Booth, D., Haas, H., McCabe, F., Newcomer, E., Champion, M., Ferris, C., Orchard, D.: Web Service Architecture. In: W3C Working Group Note W3C (2004) 7. Foster, I., Kishimoto, H., Savva, A., Berry, D., Djaoui, A., Grimshaw, A., Horn, B., Maciel, F., Siebenlist, F., Subramaniam, R., Treadwell, J., Reich, J.V.: The Open Grid Services Architecture, Version 1.0. GGF informational document Global Grid Forum(GGF) (2005) 8. Christensen, E., Curbera, F., Meredith, G., Weerawarana, S.: Web Services Description Language (WSDL) 1.1. W3C Working Group Note W3C (2001)
Parallel GEAs with Linkage Analysis over Grid
185
9. Mitra, N., Lafon, Y.: SOAP Version 1.2 Part 0: Primer, 2nd edn. W3C Working Group Note W3C (2007) 10. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005) 11. Miura, K.: Overview of Japanese science Grid project: NAREGI. Technical Report 3 (2006) 12. Foster, I.: What is the Grid? - a three point checklist. GRIDtoday 1(6) (2002) 13. Lim, D., Ong, Y.-S., Jin, Y., Sendhoff, B., Lee, B.-S.: Efficient Hierarchical Parallel Genetic Algorithms using Grid computing. Future Gener. Comput. Syst. 23(4), 658–670 (2007) 14. Amdahl, G., Gene, M.: Validity of the single processor approach to achieving large scale computing capabilities. pp. 79–81 (2000) 15. Gustafson, L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988) 16. Cant´u-Paz, E.: A summary of research on parallel genetic algorithms. Technical report IlliGAL 95007, University of Illinois at Urbana-Champaign (1995) 17. Alba, E., Troya, J.: A survey of parallel distributed genetic algorithms. Complex. 4(4), 31–52 (1999) 18. Gordon, V., Whitley, D.: Serial and Parallel Genetic Algorithms as Function Optimizers. In: Forrest, S. (ed.) Proceedings of the Fifth International Conference on Genetic Algorithms, pp. 177–183. Morgan Kaufmann, San Mateo (1993) 19. Tanese, R.: Distributed Genetic Algorithms. In: Proceedings of the 3rd International Conference on Genetic Algorithms, pp. 434–439. Morgan Kaufmann Publishers Inc., San Francisco (1989) 20. Whitley, D., Starkweather, T.: GENITOR II: a distributed genetic algorithm. J. Exp. Theor. Artif. Intell. 2(3), 189–214 (1990) 21. Davidor, Y.: A Naturally Occurring Niche and Species Phenomenon: The Model and First Results. In: Proceedings of the 4th International Conference on Genetic Algorithms(ICGA), pp. 257–263. Morgan Kaufmann, San Diego (1991) 22. Gorges-Schleuter, M.: ASPARAGOS An Asynchronous Parallel Genetic Optimization Strategy. In: Proceedings of the 3rd International Conference on Genetic Algorithms, pp. 422– 427. Morgan Kaufmann Publishers Inc., San Francisco (1989) 23. Manderick, B., Spiessens, P.: Fine-grained parallel genetic algorithms. In: Proceedings of the third international conference on Genetic algorithms, pp. 428–433. Morgan Kaufmann Publishers Inc., San Francisco (1989) 24. De Jong, K., Sarma, J.: On Decentralizing Selection Algorithms. In: Eshelman, L. (ed.) Proceedings of the Sixth International Conference on Genetic Algorithms, pp. 17–23. Morgan Kaufmann, San Francisco (1995) 25. Gorges-Schleuter, M.: A Comparative Study of Global and Local Selection in Evolution Strategies. In: Eiben, A.E., B¨ack, T., Schoenauer, M., Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, p. 367. Springer, Heidelberg (1998) 26. Eiben, A., Schoenauer, M., van Krevelen, D., Hobbelman, M., ten Hagen, M., van het Schip, R.: Autonomous selection in evolutionary algorithms. In: GECCO 2007: Proceedings of the 9th annual conference on Genetic and evolutionary computation, p. 1506. ACM, New York (2007) 27. Cant´u-Paz, E., Goldberg, D.: Parallel Genetic Algorithms with Distributed Panmictic Populations. Technical report IlliGAL 99006, University of Illinois at Urbana-Champaign (1999) 28. Cant´u-Paz, E.: A Survey of Parallel Genetic Algorithms. Technical report IlliGAL 97003, University of Illinois at Urbana-Champaign (1997) 29. Imade, H., Morishita, R., Ono, I., Ono, N., Okamoto, M.: A grid-oriented genetic algorithm for estimating genetic networks by S-systems. In: SICE 2003 Annual Conference, vol. 3(4-6), pp. 2750–2755 (2003)
186
A. Munawar et al.
30. Imade, H., Morishita, R., Ono, I., Ono, N., Okamoto, M.: A grid-oriented genetic algorithm framework for bioinformatics. New Gen. Comput. 22(2), 177–186 (2004) 31. Herrera, J., Huedo, E., Montero, R., Llorente, I.: A Grid-Oriented Genetic Algorithm. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 315–322. Springer, Heidelberg (2005) 32. Deerman, K.: Protein Structure Prediction Using Parallel Linkage Investigating Genetic Algorithms. Master’s thesis Air force Inst of Tech, Wright-Patterson AFB OH School of Engineering (1999) 33. Munetomo, M., Murao, N., Akama, K.: A Parallel Genetic Algorithm Based on Linkage Identification. In: Cant´u-Paz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2724. Springer, Heidelberg (2003) 34. Munetomo, M., Goldberg, D.: Identifying Linkage Groups by Nonlinearity/Nonmonotonicity Detection. In: Proceedings of the Genetic and Evolutionary Computation Conference, Orlando, Florida, USA, 13-17 1999, vol. 1, pp. 433–440. Morgan Kaufmann, San Francisco (1999) 35. Munetomo, M.: Linkage Identification Based on Epistasis Measures to Realize Efficient Genetic Algorithms. In: Proceedings of the 2002 Congress on Evolutionary Computation, pp. 1332–1337 (2002) 36. Karpowicz, M., Niewiadomska-Szynkiewicz, E., Zientak, M.: A Modified Parallel Genetic Algorithm Based on Linkage Identification. In: KAEiOG, Kazimierz Dolny (2004) 37. Pelikan, M., Goldberg, D., Cant´u-Paz, E.: Linkage Problem, Distribution Estimation, and Bayesian Networks. Technical Report 98013 Urbana, IL (1998) 38. Ocen´asek, J., Schwarz, J.: The Parallel Bayesian Optimization Algorithm. In: Proceedings of the European Symposium on Computational Inteligence, pp. 61–67. Springer, Heidelberg (2000) 39. Wolpert, H.D., Macready, G.W.: No Free Lunch Theorems for Search. Technical Report SFI-TR-95-02-010 Santa Fe, NM (1995) 40. Takemiya, H., Tanaka, Y., Sekiguchi, S., Ogata, S., Kalia, R., Nakano, A., Vashishta, P.: Sustainable adaptive grid supercomputing: multiscale simulation of semiconductor processing across the pacific. In: L¨owe, W., S¨udholt, M. (eds.) SC 2006, p. 106. ACM, New York (2006) 41. Czyzyk, J., Mesnier, M., More, J.: The NEOS Server. IEEE Journal on Computational Science and Engineering 5, 68–75 (1998) 42. Gropp, W., Mor’e, J.: Optimization environments and the NEOS server (1997) 43. Dolan, E.: The NEOS Server 4.0 Administrative Guide. Technical Memorandum ANL/MCSTM-250 Mathematics and Computer Science Division, Argonne National Laboratory (2001) 44. Cox, S., Chen, L., Campobasso, S., Duta, M., Eres, M., Giles, M., Goble, C., Jiao, Z., Keane, A., Pound, G., Roberts, A., Shadbolt, N., Tao, F., Wason, J., Xu, F.: Grid Enabled Optimisation and Design Search (GEODISE). Technical report (2002) 45. Abramson, D., Lewis, A., Peachy, T.: Nimrod/O: A Tool for Automatic Design Optimization. In: The 4th International Conference on Algorithms & Architectures for Parallel Processing (ICA3PP 2000), Hong Kong (2000) 46. Novotny, J., Russell, M., Wehrens, O.: GridSphere: a portal framework for building collaborations: Research Articles. Concurr. Comput.: Pract. Exper. 16(5), 503–513 (2004) 47. Symour, K., Nakada, H., Matsuoka, S., Dongarra, J., Lee, C., Casanova, H.: Overview of GridRPC: A remote procedure call API for grid computing. In: Proc. 3rd Int. Workshop Grid Computing, pp. 274–278 (2002) 48. Tanaka, Y., Nakada, H., Sekiguchi, S., Suzumura, T., Matsuoka, S.: Ninf-G: A Reference Implementation of RPC-based Programming Middleware for Grid Computing. Journal of Grid Computing 1(1), 41–51 (2003)
Parallel GEAs with Linkage Analysis over Grid
187
49. Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Computation Management Agent for Multi-Institutional Grids. In: Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC), San Francisco, California, pp. 7–9 (2001) 50. Ishikawa, Y., Kaneo, Y., Edamoto, M., Okazaki, F., Koie, H., Takano, R., Kudoh, T., Kodama, Y.: Overview of the GridMPI Version 1.0. In: SWoPP 2005 (2005) 51. Munawar, A., Wahib, M., Munetomo, M., Akama, K.: Standardization of Interfaces for MetaHeuristics based Problem Solving Framework over Grid Environment. In: Proccedings of HPCAsia 2007, Seoul, South Korea (2007) 52. Alba, E., Garc-Nieto, J., Nebro, A.: On the Configuration of Optimization Algorithms by Using XML Files (2003) 53. Munawar, A., Wahib, M., Munetomo, M., Akama, K.: Optimization Problem Solving Framework Employing GAs with Linkage Identification over a Grid Environment. In: CEC 2007: Proceedings of IEEE congress on Evolutionary Computation, Singapore (2007) 54. Wahib, M., Munawar, A., Munetomo, M., Akama, K.: A General Service-Oriented Grid Computing Framework For Global Optimization Problem Solving. In: SCC 2008: Proceedings of the 2008 IEEE International Conference on Services Computing, Honolulu, Hawaii, USA. IEEE, Los Alamitos (to appear, 2008)
Identification and Exploitation of Linkage by Means of Alternative Splicing Philipp Rohlfshagen and John A. Bullinaria School of Computer Science, University of Birmingham Edgbaston, Birmingham B15 2TT, United Kingdom
[email protected],
[email protected] Summary. Alternative splicing is an important cellular process that allows the expression of a large number of unique cell-specific proteins from the same underlying strand of DNA, and thereby drastically increases the organism’s phenotypic plasticity. Its emergence is facilitated by the modular composition of genes into numerous semiautonomous building blocks. In artificial evolution, such modular composition is usually unknown initially, but once learned may greatly increase the algorithm’s efficiency. In this paper, an abstract interpretation of alternative splicing is presented that emulates some of the properties of its natural counterpart. Two appoaches, both based upon a simple (1+1) evolutionary algorithm, are described and shown to work well on established benchmark problems. The first algorithm, eAS, is designed for cyclical dynamic optimisation problems: it systematically merges the problem variables into groups that capture the properties exhibited by a finite number of successive states and reuses that information when required. The second algorithm, iAS, employs a systematic search to identify a sub-set of variables for which simultaneous inversion affords an increase in fitness. This approach seems particularly useful for problems that have many local optima that are far apart in the search space. Results from a systematic series of experiments highlight the intrinsic attributes of each algorithm, and allow analysis in terms of the identification and exploitation of linkage.
1 Introduction Evolutionary Algorithms (EAs) are abstract interpretations of biological evolutionary systems in which potential solutions to the problem of interest survive and reproduce according to some measure of their fitness. Repeated application of appropriate genetic operators and selection lead to increasing fitness from one generation to the next. EAs have become a popular choice of algorithm for a diverse range of optimisation problems, especially those where traditional methods tend to fail (e.g., those with highly rugged multi-modal search spaces). Numerous extensions have been suggested over recent years that improve the performance of the canonical framework in a variety of domains. In particular, understanding the identification and exploitation of linkage in these algorithms can lead to more efficient EAs [14]. The concept of linkage originates in genetics and describes the physical relationship between two genetic loci: genes located sufficiently close to one another Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 189–223, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
190
P. Rohlfshagen and J.A. Bullinaria
on the same chromosome will be inherited as a single unit, known as a linkage group, with high probability. As genes are flexible units that have the ability to ‘move’ around the chromosome, one would expect interdependent genes to ‘attract’ one another. This concept has been extended to evolutionary computation (EC) where linkage is commonly used to describe the interdependency amongst the variables of a problem. This interdependency means that the optimal state (e.g., 0 or 1) of one variable Si depends on the state of another variable Sj . Quantifying linkage is difficult in general, but becomes more straightforward if the problem is decomposable. If S = {S1 , S2 , . . . , Sn } describes a set of n variables Si , the fitness function is said to be an Additively Decomposable Function (ADF) if it can be written as a sum of lower-order sub-functions: f (S) =
m
fi (SVi )
(1)
i=1
where m is the number of sub-functions, fi is the i-th sub-function, and SVi is the variable sub-set defined by index set Vi . For example, if Vi = {1, 2, 3, 5} then SVi = {S1 , S2 , S3 , S5 }. The set of variables specified by Vi is a set of interdependent variables, also known as a linkage set or, more commonly, a building block (BB). For most real world applications, the linkage sets will overlap, but they still form a useful basis for analysis. In evolutionary computation we are usually interested in optimising some objective fitness function Φ : X → R, where X represents the problem. The mapping from the problem’s parameters to the elements in X is usually static and predetermined. If no problem specific knowledge is used, the proximity of elements in X will not necessarily reflect the relationship between the parameters of the problem. Consequently, genetic operators acting on X are likely to be disruptive. The identification of linkage groups may allow this disruption to be limited, and hence not only enhances the algorithm’s performance, but also leads to more scalable algorithms. This idea has already been exploited in numerous different ways, some notable examples being the Messy GA and Fast Messy GA [15, 17], and the Linkage Learning Algorithm [21], which attempt to evolve appropriate representations that capture the relationship between the variables of the problem. Another approach has been to use statistical models known as Estimation of Distribution Algorithms (EDAs), such as the Compact and Extended Compact GA [22, 23]. Finally, another class of approaches uses perturbations to detect dependencies amongst variables, such as in the Gene Expression Messy GA [29]. In this chapter, we study the identification and utilization of linkage in two EAs inspired by the cellular process of alternative splicing (AS). AS is a posttranscriptional process that occurs in the majority of cells of higher eukaryotes and is partly responsible for the proteomic diversity observed in these organisms. The dominant property of this process is the ability to express meaningful sub-sequences of the underlying encoding (DNA in this case) alternatively. In nature, this ability is facilitated by the occurrence of readily available BBs, known as exons and introns. Exons (and, as will be shown in section 2, introns) may be viewed as a tightly linked BBs that, by themselves, contain meaningful
Identification and Exploitation of Linkage by Means of Alternative Splicing
191
information about the search space (in effect, sub-solutions). In artificial systems, the structure of linkage sets is rarely available from the onset and needs to be identified during the execution of the algorithm. Therefore, in order to implement the mechanisms of AS, considerable effort has to be directed at the identification of modules in the search space. The first algorithm proposed here, called explicit Alternative Splicing (eAS), is an implicit memory approach for cyclic dynamic environments. It is an extension of an algorithm we presented previously [37]. This algorithm utilises a single encoding that represents a tree of variable depth; and each new state encountered results in a new level being added to that tree. The problem variables are systematically placed into nodes of the tree that correspond to their value across all states encountered thus far. Special care is taken to prevent the genetic operators disturbing the acquired information. In other words, throughout the execution of eAS, the memory of previous states is not acquired and stored directly, but instead, variables common to sub-sets of states are identified and brought together within a single node. This leads to a very concise representation of the search space and allows the algorithm to successfully identify and preserve common (shared) genetic material across a succession of different environmental states when tested on a simple dynamic problem. The second algorithm, called implicit Alternative Splicing (iAS), is proposed for binary encodings and relies upon a top-down search to identify sub-sets of binary variables that may be inverted successfully in their entirety. EAs often suffer from entrapment in local optima from which an escape is unlikely due to a general loss of diversity (i.e. premature convergence). iAS attempts to circumvent this issue by using large-scale perturbations of the encoding that are incrementally refined. In other words, iAS tries to perturb the underlying encoding without disrupting any of the tightly linked BBs discovered so far. The remainder of this chapter is structured as follows: first, the process of AS in biological systems is presented in section 2, followed by a literature review of implementations that resemble AS in EC (section 3). The first algorithm, eAS, is outlined in section 4, including an overview of dynamic optimisation, a description of the algorithm, the experimental setup and results. Section 5 presents the second algorithm, iAS, following a similar structure to the previous section. Finally, the chapter is concluded in section 6 with a summary of the two approaches and a brief discussion of prospects for future work in this area. Throughout, special emphasis is given to the underlying concept of linkage.
2 Alternative Splicing in Nature The physical carrier of genetic information in almost all living organisms is DeoxyriboNucleic Acid (DNA). The DNA of an organism contains all the information required (viewed in the context of the cell) to allow the development of all bodily functions. The DNA of multi-cellular eukaryotes typically contains multiple different regions that are demarcated using well specified sequences of nucleotides. In general, regions that have a dedicated function are known
192
P. Rohlfshagen and J.A. Bullinaria
promoter
terminator
3'
Exon I
Intron I
Exon II
Intron II
Exon III
5'
DNA transcription
5'
Intron I
Exon I
Exon II
Intron II
Exon III
3'
RNA start codon
stop codon
RNA processing
5' UTR
3' UTR
RNA translation
Amino Acids
Protein
Fig. 1. Protein synthesis of an eukaryotic gene: DNA is transcribed to RNA, introns are spliced (RNA processing) and the resulting strand is translated to a polypeptide of amino acids (a protein)
as genes. In classical genetics, a gene is a conceptual entity, a hereditary unit that determines or influences a certain physical characteristic. In molecular genetics, a gene is a physical entity, a well defined strand of DNA that contains instructions to synthesise a protein or other functional constructs. In prokaryotes (bacteria), a gene exclusively contains instructions to fulfill its purpose (e.g., synthesise a protein). In higher eukaryotes (plants and animals), on the other hand, genes usually contain non-coding regions that do not directly contribute towards the final protein product. These interrupted genes are composed of short, coding sequences called exons, and longer, non-coding sequences known as introns (see figure 1). Prokaryotes usually have very condensed DNA that allows highly efficient replication. The DNA of eukaryotes, on the other hand, may contain numerous structural and regulatory elements that may affect gene expression: after transcription, but before translation, an intervening processing step is required to remove non-coding regions that do not contribute towards the protein product. This is known as RNA processing, and physically removes introns from the RNA before it is translated into a protein. Every gene is made up of at least one exon and always starts and ends with an exon, independent of the total number of exons and introns. Exons and introns always occur in an alternating fashion, with the introns usually occupying far larger regions of the gene than exons. The classical pathway of gene expression removes and discards all introns, but AS may affect the splicing event so that introns may be retained or exons may be skipped. This may lead to numerous different transcripts from a single template. AS frequently takes place in the human genome and is now recognised
Identification and Exploitation of Linkage by Means of Alternative Splicing
193
(a)
(b)
(c)
(d)
(e)
Fig. 2. The five most common forms of AS (the boxes resemble exons, the horizontal lines correspond to introns): (a) alternative 5’ site, (b) alternative 3’ site, (c) cassette exon, (d) mutually exclusive exons, and (e) retained intron. Adopted from [18].
as being one of the most fundamental sources of proteomic complexity. Extensive reviews of the mechanisms and developmental consequences of AS already exist [3, 34]. AS events are thought to be regulated by factors such as cell type or developmental stage [32] and occur in an estimated 60% of all genes in the human genome [24, 27]. It is therefore an important factor in accounting for the discrepancy between the size of the human genome and proteome. The five main splicing events are depicted and described in figure 2. The importance of AS is best illustrated by the fact that AS determines the sex in Drosophila melanogaster : the female splice variant includes exon 4 and the male one does not [2]. The following scenario provides a good illustration of an effect of AS that is highly relevant to the field of EC: AS is often associated with exon tandem duplication events [31]. Any exon that is tandem duplicated (that is, has an exact copy placed in parallel) may be regulated as an alternative element at first, and hence it does not interfere with the existing exons. The new exon is expressed only in a small fraction of transcripts (minor form), meaning that the original gene expression is largely preserved (major form). This effectively removes the selection pressure from the new exon and creates a neutral or nearneutral region which is free to accumulate mutations. If the minor form should yield an improvement in fitness, positive selection pressure will favour it, and it will subsequently increase in frequency. The alternative pathway then eventually
194
P. Rohlfshagen and J.A. Bullinaria
becomes the major form, or a tissue-specific expression (see [35]). In other words, the temporary suspension of selection pressure caused by AS may allow the gene to escape local optima in the fitness landscape. An interesting perspective is offered by Herbet and Rich [25] who classify prokaryotes as ‘hard wired’ because they usually consist of a single chromosome containing almost exclusively protein coding material. DNA therefore serves as a true template which is translated faithfully into proteins without the need for significant modification. These attributes are very much evident in current EAs. Eucaryotes, on the other hand, are ‘soft-wired’ as they use relatively little content of their genome, yet result in a far more complex translation by means of post-transcriptional regulation. Here a single template may be used to create vast numbers of similar or even significantly distinct proteins. We believe these insights from genetics may have a significant impact on the development of novel EAs.
3 Alternative Splicing in Evolutionary Computation The majority of EAs, and Genetic Algorithms (GAs) in particular, were originally inspired by the field of population genetics, and most notably Fisher’s genetical theory of natural selection [11]. In recent years, however, advances in molecular genetics, including the genome sequencing projects, have triggered an interest in abstractions that are more directly inspired by the biochemical information processing architecture of organic cells. For example, properties of the genetic code have been exploited successfully by Karuptga and Gosh [28], and a simple implementation of RNA editing, a post-transcriptional process that selectively modifies individual nucleotides, has been presented by Huang and Rocha [26]. It appears that there have been no direct implementations of AS in the literature other than our own work on this subject [37, 38]. However, there have been sudies that explore the phenomenon of ‘alternative expressions’, one of the earliest being Levenick’s Swappers [33]. Swappers are very simple encodings that consist of two parts, one of which is expressive (active) at any one time. This is somewhat similar to a dynamic exon-intron structure, and Levenick showed how such dynamic expression may be useful in accelerating the algorithm’s rate of adaptation. A more elaborate and (more importantly) adaptive approach is the structured GA due to Dasgupta and McGregor [9] which uses a control sequence of meta-bits that determines the regions of the encoding to be expressed. Only one meta-bit (and thus one region) may be active at any one time. This encoding was proposed particularly to deal with cyclic environments: whenever a new environment is encountered, the control sequence is expected to express the part of the genome that implicitly stores that particular state. Similarly, Collard et al. [8] proposed the Dual GA (DGA) which uses a standard binary encoding, but with an extra bit to determine whether the encoding is expressed as its dual, where the dual is defined simply as the inverse of the binary encoding. This meta-bit is, like the rest of the encoding, subject to crossover
Identification and Exploitation of Linkage by Means of Alternative Splicing
195
and mutation and thus adaptive. This work was extended by Gaspar et al. [12] to include multiple meta-bits that control different sections of the encoding: the Folding GA has several meta-genes that determine the state of all subsequent bits in the encoding until the next meta-gene. This encoding has been tested successfully in dynamic domains [13] and is discussed in further detail in section 5.1. Yang extends these concepts further and proposes a primal-dual encoding for use in static [41] and dynamic [40] environments. Here, primal chromosomes are defined as those individuals currently in the population, and a selection scheme is used that considers the individuals of least fitness to be chosen for dual mappings: whenever the dual produces a greater fitness than its primal counterpart, the primal is replaced by the dual. There are numerous further studies dealing with the concept of diploid or poly-ploid encodings, though that work is generally motivated by the concept of multiple alleles rather than the expression of alternative segments at the molecular level. We know of no further work that explicitly addresses the utility of AS or closely related concepts in artificial evolution. This may be due to the fact that AS has only recently gained increasing attention due to a better understanding (from the recent genome sequencing projects in particular). Finally, it should be noted that the concept of linkage used here differs slightly from its generally accepted meaning in genetics, but not necessarily from its usual meaning in EC. In general, the design of EAs in EC follow the principles of population genetics, attempting to obtain a favourable distribution of alleles by means of crossover and mutation under selection pressure. The two algorithms presented in the remainder of this chapter assume that the underlying encoding represents a strand of DNA (or, to be more precise, a single gene) and not a series of genes as found on a chromosome. Thus the linkage groups or sub-structures do not correspond to groups of genes, but groups of nucleotides / amino acids (i.e. exons and introns). This is a vital distinction from a biological perspective, but irrelevant given the definition of linkage in EC (outlined above) where we are only interested in the interdependencies of the variables, independently of how they have been represented conceptually.
4 Explicit Alternative Splicing Historically, EAs have been applied mainly to classes of static optimisation problems. Many problems are, however, dynamic in the real world, and it is thus not surprising to see increasing efforts directed towards Dynamic Optimisation Problems (DOPs) [5]. A DOP, simply stated, is a problem that changes over time t = 1, 2, . . .. More formally: F (X, t) = ft (X) where ft (X) = D(ft−1 (X), t − 1)
(2)
and where D(fi (X), t) encapsulates the dynamics of the problem. The reason for this recent interest is straightforward: problems of great complexity are either impossible or very costly to solve. It follows that in changing environments one should attempt to utilise the best solutions found so far to guide adaption to
196
P. Rohlfshagen and J.A. Bullinaria
the shifted problem. This should in general be more efficient than a complete restart of the algorithm, which would be the simplest technique to deal with dynamics. In fact, such a brute force approach is essentially identical to static optimisation, except that the number of function evaluations (FEs) allowed is restricted by the dynamics of the search space. A more efficient approach is to increase the diversity of the population (e.g., via hyper-mutations [7] or random immigrants [19]), either throughout the execution of the algorithm, or whenever an environmental change has been detected. Another useful approach is to have multiple populations that keep track of the optima encountered in the past, and make use of them as appropriate. Alternatively, a central population could keep track of the overall search, while multiple smaller populations diverge to track promising regions in the search space [6]. Finally, the most recent trend attempts to exploit the concepts of anticipation and prediction, which is particularly useful if the current solution or action affects the dynamics of the search space (i.e. there is time-linkage [4]). Most relevant to this work, however, is the use of memory which has been implemented either implicitly or explicitly. In the former case, the memory is embedded in the encoding, usually in the form of diploidy (e.g., [16]) or polyploidy (e.g., [20]). This approach has been used successfully for small numbers of distinct states, but the space requirements and the memory’s loss of integrity over time prevent this technique from scaling to larger numbers of states. Subsequently, more research has focused on explicit memory schemes which attempt to maintain a diverse register of previously good solutions. In cases where the environment returns to a point similar or identical to a previously visited one, solutions from the register may be reused efficiently. In general, the dynamics of a problem may be characterised by their magnitude and frequency. Here we only consider dynamics that do not alter the actual structure of the search space, but only shift the search space by some distance. Then the magnitude ρ corresponds to the distance between the global optima at times t and t + 1. A value of ρ = 0.2, for example, means the global optimum has shifted a distance of 0.2n where n is the size of the problem. The frequency of change, τ , describes, in FEs, how often a change occurs. 4.1
Pseudo Rhythm
Prior to describing our proposed algorithm, it is important to stress a particular property of random (non-cyclical) DOPs. Randomly changing environments do not follow a strict rhythmic pattern. However, it is possible to formalise a notion of rhythm for acyclic environments using the principle of pseudo-periodicity [10], which was originally defined for pseudo-rhythmic random Boolean networks. Pseudo-periodicity allows one to estimate the correlation amongst a succession of M states. The correlation between any two binary states X(t) and X(t ) at times t and t is defined as C(t, t ) =
N 1 ∗ X (t)Xi∗ (t ) N i=1 i
(3)
Identification and Exploitation of Linkage by Means of Alternative Splicing
1.0
p=0.1 p=0.2 p=0.3 p=0.4 p=0.5
0.8
197
1.0
p=0.6 p=0.7 p=0.8 p=0.9 p=1.0
0.5
0.6 0.0 1.0
0.4
0.5
-0.5
0.2
0.0 -0.5 -1.0
-1.0
0.0 0
5
10
15
0
0
5
(a)
10
1
2
3
4
15
(b)
Fig. 3. Pseudo-rhythm of randomly changing environments: the y-axis shows the degree of correlation amongst states, the x-axis corresponds to the i-th successive state: (a) changes of magnitude 0.1-0.5; (b) changes of magnitude of 0.6-1.0. The inlay in figure (b) shows a magnification of the period up to the fourth successive state, excluding ρ = 1.0
where Xi∗ (t) is the mapping of Xi (t) onto [−1, 1]. The overall estimate for a succession of states is then given by the auto-correlation M 1 AC(k) = C(t, t + k) M t=1
(4)
for k = 0, 1, 2, . . .. Further details have been discussed by Di Paolo [10] (also see [39]). In order to explore the rhythmic activity of randomly changing environments, we generated a succession of 2000 binary states using the framework described in section 4.3 using different degrees of change (ρ ∈ {0.1, 0.2, . . . , 1}) between successive states. The results of this basic analysis are shown in figure 3: there is no correlation amongst states for magnitudes of change equal to or below 0.5. However, once the magnitude of change affects the majority of bits, a rhythmic pattern emerges. This trend increases as the magnitude of change approaches a value of 1 (which corresponds to a cyclic environment). This implies that memory approaches may not only work well for strictly cyclical domains, but also for randomly changing domains where the change between successive states is sufficiently large. This property is investigated further in section 4.4. 4.2
eAS: The Algorithm
The general idea of eAS is to have a single encoding that may be expressed in numerous different ways. This technique is essentially an implicit memory approach, because almost all the information required to reconstruct previously visited states is stored within a single encoding. The most significant difference from other implicit memory approaches is the way in which the memory
198
P. Rohlfshagen and J.A. Bullinaria
is constructed and reused. The states visited are not stored in their entirety, but rather the algorithm attempts to capture the relatedness of a succession of states. The algorithm presented here is a refinement of an algorithm we presented earlier [37]. For the sake of completeness, we will briefly describe the original algorithm, eAS I, first. eAS I This algorithm combines the principles of explicit and implicit memory to represent multiple states concisely within a single non-redundant encoding. The encoding consists of two parts, a memory of p splicing patterns Y ∈ {0, 1}p×q and a virtual gene X ∈ {1, 2, . . . , q}n which is divided into q segments. The splicing patterns control which segments of the virtual gene contribute towards the phenotype. There are p such splicing patterns, each of length q, only one of which is active at any one time (with the index of the active splice denoted by σ). Furthermore, inactive splicing patterns are shielded from mutation (explicit memory) while the active pattern is mutated with low probability using a standard binary mutation operator. The problem variables are able to ‘move’ between segments by means of a mutation operator that simply places a variable from one segment into a randomly chosen one: the mutation operator considers every element in X and mutates element i with probability pm by reassigning the value of Xi to a randomly chosen one from {1, 2, . . . , Xi − 1, Xi + 1, . . . , q}. All variables within a segment that is ‘expressed’ are assigned a value of 1, all other variables are set to 0. A segment j is expressed if Yσj = 1. The following simple example for n = 10, p = 4 and q = 5 will illustrate this. Row 1 shows the encoding (segment numbers for each variable) and rows 2-5 show the (active) splicing patterns on the left and the resultant phenotype on the right. (2, 3, 0, 0, 0, 1, 3, 3, 3, 4) (00 , 01 , 02 , 03 , 04 ) ∴ (0, 0, 0, 0, 0, 0, 0, 0, 0, 0) (00 , 11 , 12 , 03 , 14 ) ∴ (1, 0, 0, 0, 0, 1, 0, 0, 0, 1) (00 , 01 , 12 , 13 , 14 ) ∴ (1, 1, 0, 0, 0, 0, 1, 1, 1, 1) (10 , 11 , 12 , 13 , 14 ) ∴ (1, 1, 1, 1, 1, 1, 1, 1, 1, 1) Whenever a change in the environment occurs, all p splicing patterns are evaluated and the currently best one (as judged by the fitness of the resulting phenotypes) is activated. Such an approach effectively allows one to view cyclic dynamic optimisation problems as static. There is one globally optimal solution that solves the problem independent of its current state and, once that solution is found, no further adaptation is required unless noise and uncertainty is introduced into the system. The encoding bears some noticeable similarity to the Messy GA [15] and Linkage Learning Algorithms [21]. However, there is no under- or over-specification and no crossover. The latter is partly substituted by the use of splicing patterns which have a similar effect as they affect multiple variables simultaneously.
Identification and Exploitation of Linkage by Means of Alternative Splicing
199
The problem with this approach is that one needs to specify the number of segments and splices in advance, which might pose a problem: the number of splices corresponds to the number of unique states in the environment and the number of segments should be chosen according to the relatedness of the states encountered. This information is not usually available a priori. Nevertheless, we confirmed empirically that approximations for p and q are sufficient for our algorithm to significantly outperform a simple GA, as well as a GA with hypermutations or random immigrants, in certain types of dynamics for the problems considered [37]. eAS II The algorithm presented now was developed to improve upon the previous algorithm and to allow for a more simplified application. The most significant difference is the elimination of the splicing patterns, and the number of segments required no longer has to be set a priori. Now the encoding consists of a single vector X ∈ {0, 1, . . . , 2d − 1}n where d ≥ 1 is determined dynamically throughout the execution of the algorithm, and has an initial value of d = 1. This encoding, with fitness ψ = Φ(X), essentially represents a tree. At first, when the algorithm is initiated, all variables may belong to either one of two groups, one group for those variables equal to 1, and the other for variables equal to 0. The integer encoding X is translated to a binary vector Y ∈ {0, 1}n by: decode(X): Yi = Xi /2d−dc mod 2, for i = 1, 2, . . . , n
(5)
where dc ≤ d is the level of the tree to be expressed. At this stage, the encoding is identical to the classical binary one. Once a change in the environment is detected, the number of groups is expanded by a factor of 2 and the variable d is incremented by 1. The four groups now class the variables according to
0
1
0
0 0
0 0
1
0
1 1
State 1
1
0
1
2
1
2
1
0 3
4
State 2
3
0
1 5
6
1 7
State 3
Fig. 4. The internal memory structure of eAS: each layer corresponds to a unique state encountered. The problem variables are sorted according to their state ({0, 1}) such that there is a strict and easily exploitable ordering. All variables in segment 4 of state 3, for example, are expressed as 1 in state 1, 0 in state 2 and 0 in state 3.
200
P. Rohlfshagen and J.A. Bullinaria
variables that are 1 or 0 in both states encountered so far, and variables that are either 1 or 0 in one state but not the other. Figure 4 shows how the tree is incrementally built up: whenever a change occurs, all d levels of the current tree are decoded and evaluated. These scores are stored as φn and are compared against a register, φa , that contains the scores of each expression when it was last active. The expression whose new score matches its old score is made active. The active expression, dc , is corresponds to the level i of the tree for which φni = φai . If no match can be found, another level is added to the tree (expand) with a random assignment of variables. If there are multiple matches, one is chosen at random. It follows that a new level is only added to the tree if a new state is encountered. Otherwise, the (hopefully) correct expression is reused (see section 4.2). An upper limit dmax is used to prevent the tree from growing indefinitely. The mutation operator, like all the other operators acting on the encoding, ensures that only the active expression is affected by mutation (i.e. all other expressions are unaffected and thus preserved). The mutation operator considers every single element in X and mutates it with probability pm to a new value as follows: Xi + 2d−dc if Xi /2d−dc mod 2 = 0 Xi ← (6) Xi − 2d−dc otherwise Further details of this algorithm may be found in the pseudo-code (algorithm 1). An example of the encoding adapting to a two state cyclical oneMax with optima 1 and 0 illustrates it: X(t) = (0, 1, 1, 1, 1, 1, 0, 1, 0, 1) ∴ Y = (0, 1, 1, 1, 1, 1, 0, 1, 0, 1), d = 1, dc = 1 X(t + 1) = (0, 2, 2, 3, 2, 2, 1, 2, 0, 3) ∴ Y = (0, 0, 0, 1, 0, 0, 1, 0, 0, 1), d = 2, dc = 2 X(t + 2) = (2, 2, 2, 3, 2, 2, 3, 2, 2, 3) ∴ Y = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), d = 2, dc = 1 X(t + 3) = (2, 2, 2, 2, 2, 2, 2, 2, 2, 2) ∴ Y = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0), d = 2, dc = 2 The encoding is shown at four different environmental stages where ft (X) = ft+2 (X) and ft+1 (X) = ft+3 (X). The adaptation of one binary encoding leaves the other binary encoding intact and the encoding eventually settles into the correct state. Some notable advantages of this algorithm are: • concise representation of multiple states • operators (decode, mutate, expand, reduce) all operate in linear time (O(n)) • memory grows as required On the other hand, it is clear that the growth of the tree places a limit on the number of states that may be memorised: at any time, 2y nodes are required to describe y states and it is clear that at least 2y − n nodes do not actually contain any variables. It should be possible to use an actual tree that is constructed on the fly and where empty nodes are pruned as necessary. The disadvantage of this is that the operators would require more time to manipulate the data structure and an explicit memory structure might be a better option. Finally, note that there are two fundamental assumptions underlying the design of this algorithm: first, that the algorithm is told when change has occurred,
Identification and Exploitation of Linkage by Means of Alternative Splicing
black box
fitness solution
algorithm
black box
fitness solution
algorithm
(a)
detect change
201
detector
notify algorithm
take action
deal with change
(b)
Fig. 5. EAs for DOPs require the ability to detect and react to changes in the fitness function. Two scenarios are shown, the case of static (a) and dynamic (b) optimisation.
and second, that the algorithm is able to reliably identify previously encountered states. These issues are addressed next. State Detection and Identification In general, and in true black-box fashion, any EA is required to detect changes in the environment to ensure that not only an appropriate action is taken, but also that no expired fitness values are used. It is possible to divide the task of dynamic optimisation into three fundamental aspects as shown in figure 5. Here we assume the algorithm is told when a change occurs. Nevertheless, it would be relatively simple to detect change within the problems considered here using a small register of stationary points placed systematically across the search space (see, for example, Morrison ([36]). A periodic re-evaluation of these points should reveal any changes in the environment. In fact, the likelihood of detecting change using a single stationary point is identical to the likelihood that a previously encountered state is correctly identified as such: whenever a change in the environment has been signaled, eAS re-evaluates all d expressions to obtain the current fitness values. The algorithm assumes that a previous state has reoccurred given any of the d expressions produces a fitness value identical to the one produced when that expression was last active. The problem with this approach is that any given expression may produce identical fitness values for different environments and this may mislead the algorithm. Let us assume that the problem is one of dynamic pattern matching, namely the oneMax problem with a dynamically changing target. Given all types of change, any target pattern can change into any of 2n − 1 different patterns. In this simple scenario, the fitness of any encoding equals the Hamming distance δ between the encoding and the target pattern. In general, the probability that a new state encountered is indeed a state encountered previously is n! δ!(n−δ)! − 1 (7) p(δ) = 1 − 2n − 1
202
P. Rohlfshagen and J.A. Bullinaria
Algorithm 1. Pseudo-code for the part of eAS that deals with changes in the environment. //this method converts the encoding from an integer encoding X to a binary one Y decode(X) : for i = 1, 2, . . . , n : Yi = Xi /2d−dc mod 2 //this method adds another level to the tree (to account for a novel state encountered) 2Xi if rand < 0.5 , d ←d+1 expand(X) : for i = 1, 2, . . . , n : Xi ← 2Xi + 1 otherwise //eliminates the lowest level of the tree if instead an existing level is to be reused Xi /2 if Xi mod 2 = 0 , reduce(X) : for i = 1, 2, . . . , n : Xi ← (Xi − 1)/2 otherwise d←d−1 //act upon change if change detected then φn ← Φ(decode(X, i) for i = 1, 2, . . . , d //evaluate all expressions and store them in φn φadc = ψ //update the fitness value of the splice active just prior change dc = −1 //reset the pointer to the currently active splice //check if the new state has been encountered before for i = 1, 2, . . . , d do if φai = φni then dc ← i end if end for //The state encountered has not been encountered previously if dc = −1 then if d < dmax then expand(X) //test only if environment is acyclic, else this statement is always true if Φ(decode(X, d)) > max(φn ) then dc ← d else reduce(X) dc ← index of max(φn ) end if else dc ← index of max(φn ) end if end if end if
Identification and Exploitation of Linkage by Means of Alternative Splicing
203
It is clear that if δ = 0 or δ = n, then p(δ) = 1. Therefore, although there is a certain probability that an expression is reused in the wrong context, the continuous approximation of the expression to the target pattern reduces this probability over time: (8) lim p(δ) = 1 δ→0
As mentioned previously, in order to detect change, a small register of static random points may be used. The number of points required in this case depends upon the probability p(δ) given by equation 7. In order to guarantee the detection of change with probability of 0.995, we require m=
log(1 − 0.995) log(1 − p(n/2))
random static points (we assume a random point has an expected distance of n/2 to the target pattern). In the case of the n = 100 oneMax, this corresponds to m ≈ 2 points. 4.3
Experimental Setup
The algorithm has been tested on a well established benchmark problem that allows the modeling of different search space characteristics, some of which are known to be difficult for EAs. This problem is described next, followed by a brief discussion of the experimental settings. Modular Base Function This problem, which we shall call a Modular Base Function (MBF), is constructed using BBs of 4 bits, each of which contribute equally towards the total fitness according to some fitness vector. In this case, 25 BBs are used to construct a problem of size n = 100. The fitness value of any encoding is then given by: 25 4 ξx(i) where x(i) = X4i+j (9) i=1
j=1
in which ξx(i) returns an integer value according to the fitness vector ξ. Here we use ⎧ ⎨ (0, 1, 2, 3, 4) ξ ← (0, 2, 2, 4, 4) ⎩ (3, 2, 1, 0, 4) which correspond to the classical oneMax (1), a neutral landscape (2) and a fully deceptive one (3; see [42]). In other words, for each BB, the unitation function indexes the corresponding fitness vector to return a value that contributes towards the overall fitness of an encoding. It is clear that this function is identical to the ADF presented earlier with non-overlapping BBs of size 4 bits.
204
P. Rohlfshagen and J.A. Bullinaria
Dynamics In order to introduce dynamics, we make use of the dynamic benchmark generator proposed by Yang (see [42]) upon which the following description is based. The DOP-generator can construct a dynamic environment from any stationary binary-encoded problem f (X), X ∈ {0, 1}n using a bitwise exclusive-or (⊕) operation that involves X and a binary mask M(k), where k = t/τ is the index of the current environment. The dynamics are generated by performing the exclusive-or on each individual as follows: f (X, t) = f (X ⊕ M(k))
(10)
The mask M is incrementally generated using randomly or systematically constructed templates T(k) each of which contains ρ × n ones: M(k) = M(k − 1) ⊕ T(k)
(11)
The initial mask at time k = 1 is M(1) = 0. It is possible to generate cycles of size 2K by generating 2K masks systematically first. Individuals are subsequently evaluated according to: f (X, t) = f (X ⊕ M(k mod 2K))
(12)
The masks are generated as follows: K binary templates T(0), . . . , T(K − 1) are constructed randomly to form a partition of the search space. Each mask contains n/K exclusively selected bits that are assigned a value of 1. The masks are then generated according to: M(i + 1) = M(1) ⊕ T(i mod K), for i = 0, 1, . . . , 2K − 1
(13)
The templates up to K −1 are used to construct incrementally the mask M(K) = 1 and are subsequently reused to generate up to 2K states where M(2K) = M(0) = 0. In this case, the number of states, 2K, dictates the distance among successive states: n/K. Alternatively, the magnitude of change ρ dictates the number of successive states: K = n/(ρ×n). If, for example, n = 100 and ρ = 0.1, the cycle has a length of 2K = 20 states. This is, however, only the upper limit and it is possible to generate smaller cycles by restricting the partition of the search space to a sub-space (i.e. some bits never change). Thus, if a cycle of length 8 is required with a distance between successive states of ρ = 0.1, the partition of the search space is simply restricted to (0.1n ∗ 8)/2, which, in the case of n = 100 is 40 bits. The maximum cycle that may be generated given n is 2n where all successive states have a distance of 1. The performance of the algorithm on this DOP may be calculated as follows: ⎛ ⎞ G N 1 ⎝ 1 FBOGi,j ⎠ (14) F = G i=1 N j=1 where FBOGi,j corresponds to the fitness of the Best Of Generation in generation i and trial j, G is the number of generations, and N the number of trials. As eAS is implemented as a (1 + 1)-EA, the value of FBOG corresponds to the fitness after every 100 FEs.
Identification and Exploitation of Linkage by Means of Alternative Splicing
205
Table 1. Experimental setup for eAS on cyclical (1 ) and acylical (2 ) dynamic environments
Parameter Values τ 2501,2 , 5001,2, 10001,2 ρ 0.11,2 , 0.21,2 , 0.51,2 , 0.82 , 1.01 ⎧ ⎪ 4, 8, 12, 16, 20 if ρ = 0.1 ⎪ ⎨ 4, 6, 8, 10 if ρ = 0.2 1 c if ρ = 0.5 ⎪ 2, 4 ⎪ ⎩ 2 if ρ = 0.8
Settings The majority of experiments will focus on cyclical domains. However, given the analysis of rhythm in acyclic domains (section 4.1) and the presumably stronger presence of such scenarios in the real world, it is interesting to investigate how eAS would fare in randomly changing environments. The algorithm is tested on the dynamic MBF for all three fitness matrices and a problem of size n = 100. For each setting as shown in table 1, we ran the algorithm for 30 times, limiting the duration of each run to 30 cycles each of length c (total number of FEs per run is thus 30cτ ). For the acyclical experiments, the algorithm is executed for a total of 50 changes (50τ FEs). A summary of parameter settings may be found in table 1. We do not take into account the FEs required to evaluate all expressions after each change. This omission is justified on the grounds that we do not compare eAS to another algorithm and that the FEs required after a change are usually insignificant in regard to the period between environmental changes. 4.4
Results and Analysis
Cyclical Dynamics The general performance of eAS is summarised in table 2 using as performance measure equation 14. Figure 6 shows how eAS is able to memorise the dynamics relatively independent of the length of the cycle. Only the initial period where the dynamics are ‘learned’ is affected. The performance for the oneMax and the neutral function is very good. In fact, as indicated by the table and the graphs, the average value would approximate the maximum of 100 if executed for sufficiently long. The performance on the deceptive problem is significantly worse, as expected. Nevertheless, the performance is again unperturbed by the length of the cycle. In all cases, the performance of eAS improves given longer periods between changes. Figure 7 shows the performance of eAS on the neutral and deceptive MBF. The neutral case is similar to the oneMax although it takes longer for the algorithm to fully encapsulate the dynamics. The final fitness value obtained in
206
P. Rohlfshagen and J.A. Bullinaria
Table 2. The performance of eAS on the dynamic oneMax for different magnitudes of change (ρ ∈ {0.1, 0.2, 0.5, 1.0}) and different durations between successive changes (τ ∈ {250, 500, 1000}). The maximum possible score is 100 in each case.
ρ=0.1 ρ=0.2 ρ=0.5 ρ=1.0 c=4 c=8 c=12 c=16 c=20 c=4 c=6 c=8 c=10 c=2 c=4 c=2 ξ=1 98.51 98.51 98.49 98.50 98.50 98.52 98.52 98.51 98.48 98.54 98.50 98.53 250 ξ=2 96.02 96.11 96.10 96.13 96.07 96.12 96.20 96.05 96.11 96.07 96.06 95.99 ξ=3 77.76 77.91 78.53 78.42 78.59 77.67 77.99 77.93 78.20 77.80 78.09 77.76 ξ=1 99.15 99.15 99.15 99.15 99.15 99.15 99.16 99.15 99.15 99.16 99.15 99.19 500 ξ=2 98.01 97.92 97.96 97.93 97.95 97.91 97.90 97.87 97.90 97.94 97.87 97.90 ξ=3 78.72 78.50 78.53 78.42 78.59 78.61 78.46 78.52 78.46 78.25 78.23 78.59 ξ=1 99.57 99.57 99.58 99.58 99.58 99.58 99.58 99.57 99.58 99.58 99.58 99.58 1000 ξ=2 98.96 98.97 98.97 98.97 98.97 98.97 98.96 98.96 98.95 98.94 98.96 98.92 ξ=3 78.68 78.78 78.81 78.77 78.72 78.69 78.79 78.70 78.62 78.75 78.80 78.52
the deceptive case is only locally optimal and oscillations are evident where the algorithm switches between solutions without further adaptation (figure 7 (b) where F E > 5000). It is important to note that the memory only helps the algorithm to deal with the dynamics and not the base function. If other variation operators would have been employed, a better overall performance could be achieved. In fact, it is possible to combine the algorithm presented in the second half of this chapter, iAS, with this memory scheme for improved performance. This will be left for future work. Acyclical Dynamics We would expect eAS to do well in acyclic environments if the change is sufficiently large to result in a reliable pseudo-rhythm. However, most changes in the real world are expected to be small on average. The most obvious problem encountered by eAS in this case would be the continuous accumulation of states added to the memory: no state is likely to be repeated within a short period of time and the encoding will consistently add new levels to the tree until the maximum has been reached. This is the reason for the guardian if statement as shown in the pseudo-code: a new level is only added to the tree if the corresponding expression, which is random initially, produces a higher fitness value than any of the stored encodings. One would expect continuous optimisation in environments with small change (i.e. trees with a single level) and a mixture of restart and re-use in environments with larger changes (which eventually should adapt a pure re-use strategy). For this experiment, we focus simply on the depth of the tree that evolved given different parameter settings. The results are shown in table 3 and, as expected, the algorithm simply continues to adapt without making use of memory
Identification and Exploitation of Linkage by Means of Alternative Splicing
90
90
80
80
Fitness
100
Fitness
100
70
70
60
60
50
50 0
50
100 150 200 250 Function Evaluations (x100)
300
0
50
(a)
100 150 200 250 Function Evaluations (x100)
300
(b) 100
90
90
80
80
Fitness
100
Fitness
207
70 60
70 60
50
50 0
50
100 150 200 250 Function Evaluations (x100)
300
0
50
(c)
100 150 200 250 Function Evaluations (x100)
300
(d)
Fig. 6. Performance of eAS on dynamic oneMax for ρ = 0.2 and 4 (a), 6 (b), 8 (c) and 10 (d) states per cycle
80
100 90
70 Fitness
Fitness
80 70 60 50
60 50 40
40 0
50
100 150 200 250 Function Evaluations (x100)
(a)
300
0
50
100 150 200 250 Function Evaluations (x100)
300
(b)
Fig. 7. Performance of eAS on fitness matrices 1 (a) and 2 (b) for a change of ρ = 0.1 and a period of τ = 250
if the changes are very small. Conversely, if the changes are fairly large, memory is used and, more importantly, reused efficiently. A value for ρ = 0.8 has an approximate period of 2, which is reflected by the depth of the tree that evolved.
208
P. Rohlfshagen and J.A. Bullinaria
Table 3. Depth of tree constructed in acyclical environments for all fitness matrices and different magnitudes of change ρ ∈ {0.1, 0.2, 0.5, 0.8} and durations τ ∈ {250, 500, 1000}
250 500 1000 ξ=1 1.0 1.0 1.0 ρ=0.1 ξ=2 1.0 1.0 1.0 ξ=3 1.0 1.0 1.0 ξ=1 1.0 1.0 1.0 ρ=0.2 ξ=2 1.2 1.0 1.0 ξ=3 1.3 1.0 1.0 ξ=1 9.1 9.0 9.4 ρ=0.5 ξ=2 8.8 8.9 8.2 ξ=3 9.2 9.5 9.4 ξ=1 2.5 2.1 2.0 ρ=0.8 ξ=2 4.6 3.2 2.3 ξ=3 5.5 3.2 2.0
Finally, the most difficult scenario for eAS is a change of magnitude 0.5 in which case there is no rhythm. It is interesting to note that the algorithm converges to a memory of approximately 9 states for ρ = 0.5 independent of the fitness vector or period τ . This value may be explained as follows: whenever the environment changes, the new optimum shifts a distance of 1/2n. A randomly generated encoding is also expected to have a distance to the global optimum of 1/2n. The initial likelihood that a new level is added to the tree is thus 1/2. If, however, a new level is added and another change occurs, this likelihood changes to 1/3 (we assume that all encodings evolved so far have, on average, a distance of 1/2n to the new optimum). In general, the probability of adding another level to the tree is thus 1/(d + 1). If we simulate this happening for 50 times, empirical results confirm (averaged over 1000 trials) the expected value of d to be roughly 9.5. 4.5
Discussion
It is worth pointing out that the concept of linkage in this scenario, and possibly in other dynamic domains as well, differs from its traditional meaning. Here, we are not interested in the structural properties of the base state, but instead we are interested in the structural properties amongst a succession of states. The genetic operators need to be designed to find and preserve useful structures. It has been shown here how an abstract implementation of AS is able to find subsets of bits common to subsets of states, and to reuse that information when required. The encoding therefore provides a very compact representation of the entire dynamic landscape, and the dynamics are effectively removed from the problem as there is now a single static optimal solution. The performance of eAS has been demonstrated in a variety of different experiments, and the results seem to suggest that the ability of eAS to cope with cycles is independent of cycle
Identification and Exploitation of Linkage by Means of Alternative Splicing
209
length, although there is obviously an upper limit as to how many different states may be fully captured. An unsigned 8 byte integer representation (long) could capture cycles with up to 64 states. It is important to note that the suggested encoding does not enhance the algorithm’s performance on the base problem (i.e. the static version of the problem), but only enhances the algorithm’s ability to cope with the dynamics imposed on top of the base problem. Nevertheless, the suggested methodology may be applied to other algorithms.
5 Implicit Alternative Splicing The second algorithm, implicit Alternative Splicing (iAS), differs from the previous approach in that BBs are found on-the-fly using a systematic search procedure. Hence this algorithm is referred to as implicit because the encoding itself does not imply any modularity of the search space. It is clear that any artificial equivalent of AS relies upon a proper choice of segmentation and a meaningful definition of ‘alternative expression’. In eAS, this choice was dictated naturally by the succession of states. This algorithm, on the other hand, requires additional definitions. We define an alternative expression of a binary segment simply as the segment that is maximum hamming distance from its source. More specifically, given a binary vector X ∈ {0, 1}n, we define an inversion operator ⊥ such that X⊥B means: Xb = 1 ⊕ Xb , ∀b ∈ B ⊆ {1, 2, . . . , n}. In other words, B is a set of indices indicating which bits in X are to be inverted. If B is the set of all indices in X we simply write X⊥ and call it the dual of X. Any segment that may be inverted without negatively affecting the fitness of the underlying encoding will be called an exon. Subsequently, iAS attempts to find and invert (i.e. express alternatively) the largest exon possible. This objective is achieved by inverting randomly chosen segments of decreasing size in a top-down fashion, using intermediate fitness values as stepping stones for subsequent inversions (see below, and also [38]). This abstract implementation essentially reduces iAS to the search of the largest possible binary sequence that may be inverted successfully. The first step of iAS is to create the dual of the encoding with some probability pd . If the dual has a higher fitness than the original encoding, the dual replaces the original encoding. This is followed by a phase of recursive divisions and inversions: the initial focus is directed at a randomly chosen subset of indices B ⊆ {1, 2, . . . , n} that is of size 0 < bl ≤ |B| ≤ bu ≤ n where bl and bu are lower and upper bounds specified by the user. At the beginning of each iteration of iAS, the size of B is chosen randomly within those bounds. As the set of indices B is dynamically reassigned throughout a single iteration of iAS, we explicitly denote the initial set as B 0 . The chosen bits are randomly divided into two, equally sized disjoint sets, α and β such that α ∪ β = B and α ∩ β = ∅. Each of those sets is inverted in X, one at a time, to produce two fitness values. This step is repeated q times with different, randomly chosen, partitions of B. If any of the 2q fitness values should result in a fitness superior to the original one, the process is terminated and the changes are applied immediately. Otherwise, iAS
210
P. Rohlfshagen and J.A. Bullinaria
proceeds by taking the ‘path of least resistance’ and is executed repeatedly using as guidance those bits the inversion of which caused the least decline in fitness (and may thus contain promising inversions of smaller scale). Those bits are then again randomly divided q times and the above procedure is repeated until a better fitness value is found or the number of bits to be inverted is reduced to 1. If the inversion is neutral in regard to fitness, the process is terminated with some predetermined probability pn . If q should be larger than the total number of unique divisions, ⎧ if |B| < 2 ⎨0 if |B| mod 2 = 0 (15) dunique ← 0.5 × (|B|!/((|B|/2)!)2 ) ⎩ 0.5 × ((|B| + 1)!/(((|B| + 1)/2)!)2 ) otherwise an exhaustive search of unique divisions is performed. The following illustrates the workings of iAS on a simple n = 10 oneMax problem. Let us assume we have a solution X = (1, 1, 1, 1, 1, 1, 0, 1, 1, 0) with fitness Φ(X) = 8. First we generate the dual with probability pd . The dual is X⊥ = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1), has a fitness of Φ(X⊥ ) = 2 and will be ignored. We then randomly generate an initial B such as B = {0, 2, 3, 4, 6, 7, 8, 9}. This subset of variables is then partitioned q = 2 times and each partition is used to determine which bits to invert: α1 = {0, 4, 6, 8} ∴ X⊥α1 = (0, 1, 1, 1, 0, 1, 1, 1, 0, 0) β1 = {2, 3, 7, 9} ∴ X⊥β1 = (1, 1, 0, 0, 1, 1, 0, 0, 1, 1) α2 = {2, 3, 4, 7} ∴ X⊥α2 = (1, 1, 0, 0, 0, 1, 0, 0, 1, 0) β2 = {0, 6, 8, 9} ∴ X⊥β2 = (0, 1, 1, 1, 1, 1, 1, 1, 0, 1) In this case, X⊥β2 has a fitness of 8 and the algorithm terminates with probability pn . If execution continuous, B ← {0, 6, 8, 9} and α3 = {0, 6} ∴ X⊥α3 = (1, 1, 1, 1, 1, 1, 0, 1, 0, 1) β3 = {8, 9} ∴ X⊥β3 = (0, 1, 1, 1, 1, 1, 1, 1, 1, 0) α4 = {0, 8} ∴ X⊥α4 = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1) β4 = {6, 9} ∴ X⊥β4 = (0, 1, 1, 1, 1, 1, 0, 1, 0, 0) Again, the algorithm would terminate with probability pn in cases X⊥α3 and X⊥β3 . If not terminated, the correct solution would have been uncovered in the third iteration of the procedure with Φ(X⊥α4 ) = n. More details may be found in the pseudo-code for this algorithm (see algorithm 2). The reasons for this particular implementation follow directly from the exposition of AS in section 2: AS allows the emergence of new proteins by temporarily suspending selection pressure for certain splice forms which are subsequently free to accumulate mutations. This accelerated rate of change may ultimately produce a new protein (see [35]). A purely neutral approach is too costly for our purposes (due to the general lack of parallelism) and selection pressure is upheld to a certain degree while allowing a temporary decrease in fitness. The lack of
Identification and Exploitation of Linkage by Means of Alternative Splicing
211
Algorithm 2. Pseudo-code for a single iteration of the iAS algorithm //Testing the dual if rand([0, 1]) ≤ pd then if Φ(X⊥ ) > Φ(X) then X ← X⊥ end if end if //Generate initial set of bits considered for partitioning and inversion randomly generate B ⊆ {1, 2, . . . , n} | 0 < bl ≤ |B| ≤ bu ≤ n //Generate random partitions and inversions while |B| ≥ 2 ∧ terminate = f alse do if dunique (B) < q then P ← all possible pairs of disjoint partitions (α, β) of B else P ← q random, equally-sized pairs of disjoint partitions (α, β) of B end if //Select the bits that produced the best inversion and assign them to B B ← Pi | Φ(X ⊥Pi ) ≥ Φ(X ⊥Pj ), ∀j∈P //Apply the best inversion found so far if it improves the fitness of X if Φ(X⊥B ) ≥ Φ(X) then if Φ(X⊥B ) = Φ(X) ∧ rand([0, 1]) ≤ pn then X ← X⊥B terminate =true else X ← X⊥B terminate =true end if end if end while problem specific knowledge implies that there is no evidence of modularity in the search space, and it is impossible to determine a priori what segments should be inverted. iAS thus systematically searches for such a segment and the top-down recursive technique allows for the largest segment to be found. There is no certainty that a successful inversion will be found, but testing q different divisions increases the likelihood of success. It is crucial that the initial number of bits, |B 0 |, is chosen randomly (within bounds) as the size of all subsequent segments depends on this initial choice (as segments are always halved). The maximum number of FEs processed during a single iteration is:
212
P. Rohlfshagen and J.A. Bullinaria log 2 (n)
min{q, dunique (n/2i )}
(16)
i=1
given that |B 0 | = n and pd = 0 and that the algorithm continues until |B| = 1. In the case of n = 100 and q = 10, this would be 2(4q + 3 + 1) = 88 FEs. 5.1
Comparison to the Folding GA
The ability of iAS to invert numerous bits simultaneously makes the need for crossover redundant and some initial experiments have confirmed that there is indeed no significant advantage to using multiple individuals (we investigated the use of 2 individuals in [38]) in place of the (1+1) EA employed here. It is worth pointing out the importance of choosing a random distribution of bits to be inverted (and not continuous segments), and it is possible to draw a meaningful comparison between n-point crossover and uniform crossover: the effectiveness of n-point crossover depends upon the total length of the BBs (distance between first and last defining bit), while the effectiveness of uniform crossover is only dependent upon the actual number of bits within the BBs. If the bias of uniform crossover is chosen randomly every time the operator is applied, uniform crossover has the same attributes as n-point crossover without the restrictions imposed by the static ordering of variables inherent in the algorithm’s encoding. A similar comparison may be drawn between iAS and the Folding GA (FGA, [12]), the most closely related approach in the literature. The FGA, an extension of the dual GA, includes multiple meta-bits that adaptively control different sections of the encoding. These so-called meta-genes determine the state of all subsequent bits in the encoding up to the next meta-gene. An example taken from [13] is as follows. A transcription step T is applied to the encoding that ˙ 1} ˙ such that T (0ω) ˙ affects all genes between any two meta-bits {0, = ω and ˙ T (1ω) = ω ¯ where ω ¯ is the inverse of ω: ˙ 00]) ˙ T ([101 = [100] ˙ T ([1011]) = [100] ˙ 00 ˙ 00]) ˙ T ([10 = [100] This technique is expected to work well on the fully deceptive MBF as a single mutation to any of the meta-bits allows the simultaneous inversion of multiple bits. However, as the distribution of BBs is generally unknown a priori, the meta-bits have to be inserted adaptively. If the deceptive BBs form consecutive groups of variables, this approach should fare well because only a few metabits are required to exert sufficient control over the locally optimal parts of the encoding. If, on the other hand, BBs are distributed randomly across the encoding, the problem may become intractable. If, for example, the MBF is constructed as follows: ((v1 , v2 , v3 , v4 )1 , (v5 , v6 , v7 , v8 )2 , . . . , (vn−3 , vn−2 , vn−1 , vn )n/4 )
Identification and Exploitation of Linkage by Means of Alternative Splicing
213
each meta-bit exerts control over exactly one BB in the worst case scenario (locally and globally solved BBs in alternating fashion). If, on the other hand, the MBF is constructed randomly as follows: ((v5 , v15 , v8 , v10 )1 , (v12 , v1 , v22 , v23 )2 , . . . , (v2 , v32 , v21 , v11 )n/4 ) a meta-bit may only affect a single variable in the worst case scenario. Thus, in order to overcome deception, four meta-bits have to be inverted simultaneously, which is just as hard as the original problem. The crucial difference between the FGA and iAS is, of course, that the folding GA adapts during the execution of the algorithm and hence does not rely upon the extensive amount of FEs required by iAS to locate meaningful structures in the search space. 5.2
Experimental Setup
Here we present a modification of the basic MBF problem as well as two additional problems on which iAS has been tested. The extended version of MBF is presented first, followed by a discussion of the NK fitness landscape and a brief outline of the multiple knapsack problem. Finally the experimental settings are described. MBF - Extended Version Here we extend the basic MBF problem to allow exclusively selected parts of the encoding to refer to different fitness matrices. This allows one to mix different search space attributes within the same problem. An MBF of type 0.33 - 0.33 - 0.33, for example, indicates that the search space is composed to 1/3 of each of the 3 fitness vectors (in the order: maxOnes, neutral and deceptive). A type 1 - 0 - 0 would correspond to the classical maxOnes problem. Furthermore, the elements in X that belong to the same BB are distributed randomly across X (although there is still no overlap across all BBs). NK Fitness Landscape Kauffman developed the NK-fitness landscape model (NK; [30]) to study the effects of genetic interactions (epistasis). Each bit in the encoding contributes towards the encoding’s fitness. The contribution depends upon the state of the bit itself and the state of all k bits that are linked to it, and the more the bits are dependant on one another, the more rugged the search space becomes. The fitness values for each bit are generated randomly in the range [0, 1] for each of the 2k+1 possible states and are stored in a look-up table (for high values of k, these values may be generated on-the-fly). The final fitness is the average contribution of all bits: 1 Fi (Xi ; Xi1 , . . . , Xik ) n i=1 n
F (X) =
(17)
214
P. Rohlfshagen and J.A. Bullinaria
where {i1 , . . . , ik } ⊂ {1, . . . , i − 1, i + 1, . . . , n}. For random neighbourhoods, this problem has been shown to be NP-complete for values of k ≥ 2 (see [1]). This problem is again an example of an ADF where the component functions are defined by the relationship between variables. Unlike the MBF, there is overlap between the linkage groups proportional to k, making this problem increasingly difficult to solve. Multiple Knapsack Problem The Multiple Knapsack Problem (MKP) is a widely studied combinatorial optimisation problem that has several direct counterparts in industry, such as the cutting stock problem or resource allocation in distributed systems. Wellestablished benchmarks for this NP-hard problem are readily available allowing a direct comparison to other approaches in the literature. The MKP is a generalisation of the single knapsack problem: the objective is to fill a series of m knapsacks each of capacity cj with any number of n items, each with weight Wij and value Vi , in such a way that the combined value of all items is maximized without exceeding any of the knapsack’s capacities. More formally, the aim is to maximize the fitness max{
n i=1
Vi Xi } |
n
Wij Xi ≤ cj ∀j
(18)
i=1
where Xi ∈ {0, 1}. We studied the widely used SAC’94 suite of benchmark MKP problems, which may be found online at http://elib.zib.de/pub/Packages/ mp-testdata/ip/sac94-suite/. It contains 55 problem instances which range in size from 15-105 objects and 2-30 knapsacks. Settings We tested the performance of iAS on three different problems. First, it was applied to the fully deceptive MBF in order to investigate the ability of iAS to reliably identify the BBs of the problem. Secondly, it was tested on a variety of other MBFs and Kauffman’s NK fitness landscape. Here a comparison was made against the canonical GA (cGA) using a population size of 100, uniform crossover and tournament selection. The crossover and mutation probabilities were pc = 0.8 and pm = 1/n respectively. Finally, iAS was tested on several instances of the MKP. In each case, the algorithm was executed 20 times for each problem, with a maximum of 20,000, 50,000 and 100,000 FEs. Whenever the global optimum was found within the limit allowed, the algorithm’s run was terminated and the number of FEs required noted. The parameter settings used for all problems are shown in table 4.
Identification and Exploitation of Linkage by Means of Alternative Splicing
215
Table 4. Experimental setup for iAS for problems 1 (1 ), 2 (2 ) and 3 (3 )
Parameter Values q 102,3 , 201 ⎧ 641 ⎪ ⎪ ⎨ 1001 |B 0 | {2, 3, . . . , 100}1 ⎪ ⎪ ⎩ {50, 51, . . . , 100}1,2,3 pn 0.51,2,3 pd 0.51,2,3
5.3
Results
Results Problem 1: MBF Table 5 shows the performance of iAS on the fully deceptive MBF: iAS is able to solve this problem if executed for a sufficiently large number of FEs (500,000 in this case). This type of problem may only be solved successfully if the algorithm is able to invert entire BBs instantaneously, and it is evident from table 5 that the choice of |B 0 | is crucial: values that are multiples or powers of 4 (the BB size) are most successful (64 in this case is the largest possible initial segment that is a power of 4). It is clear how problem specific knowledge may be employed in the choice of |B 0 |. This effect is shown more clearly in figure 8 where the frequencies of segment sizes successfully inverted are shown. The highest peak, irrelevant of
Table 5. Results for iAS on fully deceptive problem (BB=4) for different values of |B 0 | and q. The % symbol indicates the number of trials solved successfully.
|B 0 | = 64 |B 0 | = 100 50 < |B 0 | < 100 2 < |B 0 | < 100 q % avg FE % avg FE % avg FE % avg FE 1 0 91.95 - 0 91.35 - 0 91.05 - 0 92.25 5 10 98.4 423350 0 93.45 - 0 96.45 - 0 95.7 10 70 99.6 330962 0 95.6 - 15 98.2 410294 5 98.05 355555 20 100 100 223895 5 98.25 469648 65 99.6 321232 30 99.05 371357 30 100 100 187486 30 99 330856 75 99.75 291386 70 99.65 368337 40 100 100 90103 80 99.75 350168 90 99.9 257587 90 99.9 290079 50 100 100 109580 80 99.75 329730 85 99.85 275812 65 99.6 278106 60 100 100 107851 80 99.75 336884 90 99.9 273186 95 99.95 258828 70 100 100 99680 65 99.65 347471 90 99.9 213651 95 99.95 248939 80 100 100 107223 60 99.55 320236 90 99.9 275824 70 99.7 293830 90 100 100 130937 85 99.75 330761 85 99.85 262796 75 99.7 337469 100 100 100 110571 55 99.55 354448 90 99.9 259557 65 99.65 255182
216
P. Rohlfshagen and J.A. Bullinaria
4
18
4
10
16 8
14
6
12 6
10 8
8
50
4
12
6 16
25
32
4
2
2 0
0 0
20
40
60
80
100
0
20
40
(a) 10
60
80
100
60
80
100
(b) 4
4
8
8
1
6
6 1
4
4
2
2
0
0 0
20
40
60
(c)
80
100
0
20
40
(d)
Fig. 8. Results for iAS on fully deceptive problem (BB=4). Frequency of successful inversions (non-successful inversions not shown) for different values of |B 0 |: 64 (1), 100 (2), 50 < |B 0 | < 100 (3) and 2 < |B 0 | < 100 (4). The x-axes correspond to the number of bits inverted, the y-axes describe the number of successful inversions.
Table 6. Results for iAS on MBF for different distributions. The differences in performance compared to the cGA are shown in the right-most column (20,000 FEs). Column ‘Sig’ indicates if the differences are significant (*) and the % symbol indicates the number of trials solved successfully.
1.00 0.00 0.00 0.60 0.20 0.20 0.33
-
Case 0.00 - 0.00 1.00 - 0.00 0.00 - 1.00 0.20 - 0.20 0.60 - 0.20 0.20 - 0.60 0.33 - 0.33
20,000 FEs (1) 50,000 FEs (2) 100,000 FEs (3) Sig cGA % avg FE % avg FE % avg FE 1v2 2v3 1v3 Diff Sig (1) 100 100 1542 100 100 1542 100 100 1542 - - 1743 * (2) 100 100 1804 100 100 1804 100 100 1804 - - - 2206.1 * (3) 0 90.9 - 0 91.95 - 0 94 - x * * -8 * (4) 30 99.05 7952 80 99.75 24582 100 100 33693 * x * -2.4 * (5) 30 98.8 11458 70 99.6 24547 100 100 39088 * x * -2.4 * (6) 0 94.35 - 0 96.25 - 0 97.45 - * * * -4.15 * (7) 0 97.3 - 20 98.7 39884 45 99.4 60362 * * * -3.75 *
Identification and Exploitation of Linkage by Means of Alternative Splicing
217
the initial segment size, corresponds to the BB size. In other words, iAS is able to successfully identify the BBs and invert them to find the global optimum. It is important that the initial segment size is as large as possible to guide the search and it is evident that 50 < |B 0 | < 100 performs better than 2 < |B 0 | < 100. A large initial segment not only increases the frequency of finding large exons, but also increases the overall success rate of finding smaller exons in subsequent processing steps. It is also evident that the choice of q is crucial as well. The more divisions are tested at each stage, the higher the success rate of finding an exon. On the other hand, high values of q are computationally expensive. There is thus a trade-off between the number of times iAS may be executed and q. In this case, 40 < q < 70 seems best. It should be stressed that a stochastic approach to exploring different divisions at each stage is the only appropriate choice for deceptive problems. Attempts to find the best possible division using local search, or indeed a GA, have failed because the search for a division seems to be as difficult as the original problem itself. Depending on the problem, however, different (problemspecific) heuristics may be employed. The second set of experiments on the MBF compares iAS to the cGA on a variety of different search space properties as shown in table 6. First, it is evident from the data that iAS, unlike the cGA, improves significantly in performance if given more resources (i.e. more FEs). Secondly, iAS performs significantly better than the cGA on all instances, and is able to solve all instances that have no more than 33% deception at least once. It should be noted that a high value of q is required to solve deceptive BBs but slows down the algorithm in the simpler cases. A re-run of iAS on case 1 and 2 using q = 1 requires only 473 and 568 FEs respectively. Table 7. Comparison of iAS and cGA on NK for different values of k given different limits on the number of FEs allowed. Column ‘Sig’ indicates if the differences are significant (*).
20,000 FEs 50,000 FEs 100,000 FEs k cGA iAS Diff Sig cGA iAS Diff Sig cGA iAS Diff Sig 1 0.7105 0.7119 0.0014 x 0.7108 0.7120 0.0012 x 0.7112 0.7120 0.0009 x 2 0.7363 0.7403 0.0040 x 0.7374 0.7412 0.0038 x 0.7380 0.7421 0.0041 x 3 0.7448 0.7581 0.0134 * 0.7469 0.7609 0.0140 * 0.7483 0.7619 0.0137 * 4 0.7490 0.7634 0.0144 * 0.7517 0.7669 0.0152 * 0.7536 0.7700 0.0164 * 5 0.7473 0.7645 0.0172 * 0.7521 0.7657 0.0136 * 0.7530 0.7683 0.0153 * 6 0.7385 0.7535 0.0150 * 0.7419 0.7587 0.0169 * 0.7435 0.7615 0.0179 * 7 0.7388 0.7500 0.0111 * 0.7417 0.7535 0.0117 * 0.7438 0.7552 0.0114 * 8 0.7285 0.7457 0.0171 * 0.7308 0.7497 0.0189 * 0.7339 0.7508 0.0168 * 9 0.7251 0.7382 0.0131 * 0.7302 0.7426 0.0124 * 0.7307 0.7455 0.0148 * 10 0.7233 0.7370 0.0136 x 0.7287 0.7412 0.0124 x 0.7301 0.7442 0.0141 *
218
P. Rohlfshagen and J.A. Bullinaria
Results Problem 2: NK Fitness Landscape The results for the first part of this experiment are shown in Table 7: iAS is significantly better than the cGA in at least 70% of the cases, especially those with higher values of k. Again, as the limit on the FEs is increased, the differences in performance increase as well. This indicates that iAS makes better use of the available resources and suffers less from local optima entrapment. The NK fitness landscape is an ADF where the size and overlap of linkage sets increases with k. High values of k thus imply a greater degree of difficulty as the optimal assignment for each subset of variables XVi is likely to interfere with most other assignments. iAS initially inverts a large subset of variables and this inversion is likely to affect most, if not all, linkage sets (for high values of k at least). Nevertheless, the performance of iAS seems to indicate that the incremental refinements made to the set of inverted variables is able to locate and isolate a set of variables the inversion of which is beneficial. In other words, the approach taken by iAS seems to work despite significant overlap (epistasis) of linkage sets in the problem space. Results Problem 3: MKP This section highlights the potential of iAS for other, more realistic, problems. In particular, it is shown how iAS may be used to solve constrained and permutation based optimisation problems. The constrained optimisation problem chosen is the MKP and iAS may be employed as usual. However, whenever a segment is inverted, the inversion is carried out as follows (starting with a feasible solution): first, all 1s are inverted to 0s. It is safe to do this because the exclusion of an item will never invalidate a solution. Secondly, all bits which were originally 0 are inverted (in random order), if possible. This approach ensures that the inverted segment is as close as possible to a fully inverted segment while obeying the constraints imposed by the problem. This is the simplest approach that ensures feasibility and more sophisticated techniques may be developed that should produce superior results. In particular, it is clear that problem specific knowledge such as the average value-weight ratio of all items, may be used here to determine the order of bits considered for inclusion. Nevertheless, as the results in table 8 show, iAS is able to solve at least one trial in almost all instances and again, iAS produces better results in almost all instances given a higher limit on the number of FEs allowed. It is also possible to apply iAS to permutation based problems. The greatest difficulty in this case is the definition of what an ‘alternative expression’ is. This is straightforward in the binary case, but much less clear in the case of permutations because there is no single unique ‘inversion’. In this case, the group membership as well as the order within the group matters. iAS may thus be extended to include another test at each step that evaluates a certain number of randomly generated permutations within each half. In other words, at each stage, q divisions are tested and for each test, z permutations within each half are tested as well. The average fitness value of all z permutations is subsequently used to
Identification and Exploitation of Linkage by Means of Alternative Splicing
219
Table 8. Results for iAS on MKP using a partial repair function and two different limits on the number of FEs allowed. The % symbol indicates the number of trials solved successfully.
20,000 FEs 100,000 FEs Instance % avg % avg hp1 60 3405.15 75 3409.5 hp2 45 3151.85 70 3168.6 pb1 50 3069.45 65 3080.1 pb2 15 3142.1 45 3154.85 pb4 65 93882.55 85 94683.05 pb5 60 2132.2 85 2136.45 pb6 55 769.7 80 773.2 pb7 20 1027.8 50 1031.15 pet2 100 87061 100 87061 pet3 100 4015 100 4015 pet4 100 6120 100 6120 pet5 55 12394.5 100 12400 pet6 5 10565 40 10594.65 pet7 0 16448.35 5 16475.7 sent01 20 7755.7 20 7758.25 sent02 0 8701.45 0 8710.4 weing1 100 141278 100 141278 weing2 100 130883 100 130883 weing3 100 95677 100 95677 weing4 90 118986.4 100 119337 weing5 100 98796 100 98796 weing6 100 130623 100 130623 weing7 0 1094227.25 0 1095209.7 weing8 25 620728.1 35 621543.2 weish01 100 4554 100 4554 weish02 60 4534 60 4534 weish03 85 4105.55 100 4115
20,000 FEs 100,000 FEs Instance % avg % avg weish04 100 4561 100 4561 weish05 100 4514 100 4514 weish06 35 5546.8 55 5550.15 weish07 80 5562.35 100 5567 weish08 55 5602.55 100 5605 weish09 95 5244.3 100 5246 weish10 55 6322.85 80 6332.6 weish11 50 5611.35 95 5637.75 weish12 70 6320.6 100 6339 weish13 100 6159 100 6159 weish14 60 6938.35 80 6947.8 weish15 100 7486 100 7486 weish16 55 7287.2 75 7288.2 weish17 30 8623.85 95 8632.3 weish18 5 9561.35 35 9572.9 weish19 35 7672.25 65 7686.6 weish20 55 9442.3 80 9446 weish21 70 9063.6 90 9070.35 weish22 25 8908.4 30 8927.8 weish23 10 8319.35 25 8332.95 weish24 20 10202.35 80 10216.25 weish25 10 9919.45 35 9927.7 weish26 35 9547 55 9566.65 weish27 60 9779.45 90 9806.7 weish28 50 9453.75 70 9485.1 weish29 30 9360.4 70 9397.55 weish30 50 11178.75 70 11189.55
guide the search. This requires significantly more FEs, but may be beneficial for difficult problems that consist of relatively few variables, such as the quadratic assignment problem, or in cases where the number of FEs allowed is sufficiently large. 5.4
Discussion
iAS has been tested on three different problems and has been compared to the cGA on two of them. In both cases, iAS significantly outperformed the cGA in the majority of cases. The ability of iAS to invert large numbers of bits simultaneously allows the algorithm to escape from local optima. This is evident in the constant increase in performance once the limit on the number of FEs
220
P. Rohlfshagen and J.A. Bullinaria
allowed is increased. Nevertheless, each iteration of iAS requires a significant cost in terms of FEs required to successfully invert a segment which could pose a problem, if resources are very limited. The tests on the fully deceptive MBF seem to indicate that iAS works especially well on problems where deception leads to sub-optimal solutions that are maximum Hamming distance from the globally optimum sub-solution. However, this attribute is not necessarily restricted to fully deceptive problems as any local optimum in a binary search space is some Hamming distance away from the global optimum and it has been shown that iAS was able to solve at least some trials in almost all instances when tested on a constrained real-world problem. Further work is required to investigate in more depth the kind of problems iAS may be expected to do well on. This will be left for future work.
6 Conclusions and Future Work This chapter has presented two novel algorithms that are inspired by Alternative Splicing (AS), an important cellular process found in higher eukaryotes. Loosely speaking, AS affects the expression of individual genes by affecting the choice of modules (exons and introns) that contribute towards the protein. This modular composition of genetic information is strongly related to the concept of linkage in Evolutionary Computation (EC), and the idea of tightly linked informational Building Blocks (BBs) is evident in both algorithms presented. The first algorithm, explicit Alternative Splicing (eAS), has been applied to a dynamic optimisation problem and systematically groups the variables of a problem according to their value across a series of finite states. This systematic grouping allows for the efficient reuse of acquired information in cases when the dynamic environment returns to a previously visited state. eAS is essentially an implicit memory approach that attempts to capture the state of a variable across multiple different states. Linkage groups in this case do not describe the sub-structures of the base problem, but instead capture the properties of a succession of different instances of the base problem. The ability to recall states from memory allows eAS to deal efficiently with dynamics, at least in the cases considered here. It does not, however, help the algorithm to solve the base problem, and it is expected that other, more sophisticated, operators may have a significant impact on the overall performance of eAS. The second algorithm, implicit alternative splicing (iAS), uses a top-down search process to locate a segment for which inversion has a non-negative impact on the encoding’s fitness. The search starts with a large, randomly chosen initial segment which is systematically reduced in size. At each iteration, the current segment is halved and each half is inverted and the resulting encoding is evaluated. These intermediate fitness values are used to guide the search towards a successful inversion. This algorithm is able to solve fully deceptive problems if given sufficient time. Furthermore, iAS significantly outperformed a canonical GA on two test problems, and managed to solve the majority of instances in the multiple knapsack problem using a partial random repair.
Identification and Exploitation of Linkage by Means of Alternative Splicing
221
Both algorithms are based upon a simple (1+1) EA. The reason for choosing this particular underlying framework was mainly driven by the fact that individual encodings require multiple function evaluations (FEs) to produce a candidate solution (although in the case of eAS, multiple FEs are only required if a change has occurred). In nature, most cellular processes rely upon the massive parallelism evident in animal populations, but in EC such luxuries are usually unavailable and compromises have to be made (i.e. population size reductions). If resources are available, we would expect either algorithm to work well in a population based framework. 6.1
Future Work
We are currently expanding our work on the iAS algorithm. In particular, the algorithm seems highly suitable for use in dynamic domains, cyclic or not, as it should be able to deal efficiently with a variety of different transitions. Initial experiments have shown promise, but further testing is required to determine the full power of the approach. Furthermore, efforts are underway to analyse the behaviour of iAS analytically, and to draw some general conclusions regarding its running time for selected problems. The methodology employed in iAS is fairly general, and it should be possible to refine the performance further using local search or statistical processing to guide the choice of divisions. Finally, it may also be of interest to combine the two approaches to exploit the benefits each of them has to offer.
Acknowledgements This work was supported by a Paul and Yuanbi Ramsay scholarship.
References [1] Altenberg, L.: NK fitness landscapes. In: B¨ ack, T., Fogel, D.B., Michalewicz, Z. (eds.) The Handbook of Evolutionary Computation, pp. B2.7:2–B2.7:10. Oxford University Press, Oxford (1997) [2] Baker, B.S.: Sex in flies: The splice of life. Nature 340, 521–524 (1989) [3] Black, D.L.: Mechanisms of alternative pre-messenger RNA splicing. Annual Review of Biochemistry 72, 291–336 (2003) [4] Bosman, P.A.N., Poutr`e, H.L.: Learning and anticipation in online dynamic optimization with evolutionary algorithms: the stochastic case. In: Proceedings of the 2007 Genetic and Evolutionary Computation Conference, pp. 1165–1172 (2007) [5] Branke, J.: Evolutionary Optimization in Dynamic Environments. Kluwer, Dordrecht (2001) [6] Branke, J., Kaußler, T., Schmidt, C., Schmeck, H.: A multi-population approach to dynamic optimization problems. In: Parmee, I.C. (ed.) Adaptive Computing in Design and Manufacture 2000, pp. 299–308. Springer, Heidelberg (2000) [7] Cobb, H.G.: An investigation into the use of hypermutation as an adaptive operator in genetic algorithms having continuous, time-dependant nonstationary environments. Technical report, Naval Research Laboratory, Washington, USA (1990)
222
P. Rohlfshagen and J.A. Bullinaria
[8] Collard, P., Aurand, J.-P.: DGA: An efficient genetic algorithm. In: Proceedings of the Eleventh European Conference on Artificial Intelligence, pp. 487–491 (1994) [9] Dasgupta, D., McGregor, D.R.: Nonstationary function optimization using the structured genetic algorithm. In: M¨ anner, R., Manderick, B. (eds.) Parallel Problem Solving from Nature, vol. 2, pp. 145–154. Elsevier, Amsterdam (1992) [10] Di Paolo, E.A.: Rhythmic and non-rhythmic attractors in asynchronous random boolean networks. BioSystems 59, 185–195 (2001) [11] Fisher, R.A.: The Genetical Theory of Natural Selection. Clarendon Press, Oxford (1930) [12] Gaspar, A., Clergue, M., Collard, P.: Folding genetic algorithms: the royal road toward an optimal chromosome’s expressiveness. In: Second International ISCS Symposium on Soft Computing (1997) [13] Gaspar, A., Collard, P.: Time dependent optimization with a folding genetic algorithm. In: IEEE International Conference on Tools for Artificial Intelligence, pp. 207–214. IEEE Computer Society Press, Los Alamitos (1997) [14] Goldberg, D.E.: The Design of Innovation: Lessons from and for competent genetic algorithms. Kluwer Academic Publishers, Dordrecht (2002) [15] Goldberg, D.E., Korb, D.E., Deb, K.: Messy genetic algorithms: Motivation, analysis and first results. Complex Systems 3, 493–530 (1989) [16] Goldberg, D.E., Smith, R.E.: Nonstationary function optimization using genetic algorithms with dominance and diploidy. In: Grefenstette, J.J. (ed.) Second International Conference on Genetic Algorithms, pp. 59–68. Lawrence Erlbaum Associates (1987) [17] Goldberg, D.E., Deb, K., Kargupta, H., Harik, G.: Rapid, accurate optimization of difficult problems using fast messy genetic algorithms. In: Proceedings of the Fifth International Conference on Genetic Algorithms, pp. 56–64 (1993) [18] Graveley, B.R.: Alternative splicing: increasing diversity in the proteomic world. TRENDS in Genetics 17(2), 100–107 (2001) [19] Grefenstette, J.J.: Genetic algorithms for changing environments. In: Manner, R., Manderick, B. (eds.) Proceedings of the Second International Conference on Parallel Problem Solving from Nature, vol. 2, pp. 137–144. Elsevier, Amsterdam (1992) [20] Hadad, B.S., Eick, C.F.: Supporting polyploidy in genetic algorithms using dominance vectors. In: Angeline, P.J., McDonnell, J.R., Reynolds, R.G., Eberhart, R. (eds.) EP 1997. LNCS, vol. 1213, pp. 223–234. Springer, Heidelberg (1997) [21] Harik, G.: Learning Gene Linkage to Efficiently solve problems of bounded difficulty using genetic algorithms. PhD thesis, University of Michigan, Ann Arbor (1997) [22] Harik, G.R.: Linkage learning via probabilistic modeling in the ECGA. Technical Report 99010, University of Illinois at Urbana-Champain (1999) [23] Harik, G.R., Lobo, F.G., Goldberg, D.E.: The compact genetic algorithm. IEEE 3(4), 287 (1999) [24] Harrington, E.D., Boue, S., Valcarcel, J., Reich, J.G., Bork, P.: Estimating rates of alternative splicing in mammals and invertebrates. Nature Genetics 36(9), 915– 917 (2004) [25] Herbet, A., Rich, A.: RNA processing and the evolution of eukaryotes. Nature Genetics 21, 265–269 (1999) [26] Huang, C.-F., Rocha, L.M.: Exploration of RNA editing and design of robust genetic algorithms. In: Proceedings of the 2003 IEEE Congress on Evolutionary Computation. IEEE Press, Los Alamitos (2003)
Identification and Exploitation of Linkage by Means of Alternative Splicing
223
[27] International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409, 860–921 (2001) [28] Kargupta, H., Gosh, S.: Towards machine learning through genetic code-like transformations. Genetic Programming and Evolvable Machine Journal 3(3), 231–258 (2002) [29] Karuptga, H.: The gene expression messy genetic algorithm. In: Proceedings of the IEEE International Conference on Evolutionary Computation (1996) [30] Kauffman, S.A.: The Origins of Order. Oxford University Press, Oxford (1993) [31] Kondrashov, F.A., Koonin, E.V.: Origin of alternative splicing by tandem exon duplicaton. Human Molecular Genetics 10(23), 2661–2669 (2001) [32] Ladd, A.N., Cooper, T.A.: Finding signals that regulate alternative splicing in the post-genomic era. Genome Biology 3(11), 1–16 (2002) [33] Levenick, J.R.: Swappers: Introns promote flexibility, diversity and invention. In: Orlando, F.L., Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Jakiela, H., Smith, R.E. (eds.) Proceeding of Genetic and Evolutionary Computation Conference 1999, vol. 1, pp. 361–368. Morgan Kaufmann, San Francisco (1999) [34] Lopez, A.J.: Alternative splicing of pre-mRNA: Developmental consequences and mechanisms of regulation. Annual Review of Genetics 32, 279–305 (1998) [35] Modrek, B., Lee, C.J.: Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nature Genetics 34(2), 177–180 (2003) [36] Morrison, R.W.: Designing Evolutionary Algorithms for Dynamic Environments. Springer, Berlin (2004) [37] Rohlfshagen, P., Bullinaria, J.A.: Alternative splicing in evolutionary computation: Adaptation in dynamic environments. In: Proceedings of the 2006 IEEE Congress on Evolutionary Computing, pp. 8041–8048. IEEE, Piscataway (2006) [38] Rohlfshagen, P., Bullinaria, J.A.: Implicit alternative splicing for genetic algorithms. In: Proceedings of the 2007 IEEE Congress on Evolutionary Computing, pp. 47–54. IEEE Press, Los Alamitos (2007) [39] Rohlfshagen, P., Di Paolo, E.A.: The circular topology of rhythm in asynchronous random boolean networks. BioSystems 73, 141–152 (2004) [40] Yang, S.: Non-stationary problem optimization using the primal-dual genetic algorithm. In: Sarker, R., Reynolds, R., Abbass, H., Tan, K.-C., McKay, R., Essam, D., Gedeon, T. (eds.) Proceedings of the 2003 IEEE Congress on Evolutionary Computation, vol. 3, pp. 2246–2253 (2003) [41] Yang, S.: PDGA: the primal-dual genetic algorithm. In: Abraham, A., Koppen, M., Franke, K. (eds.) Design and Application of Hybrid Intelligent Systems, pp. 214–223. IOS Press, Amsterdam (2003) [42] Yang, S.: A comparative study of immune system based genetic algorithms in dynamic environments. In: Proceedings of the 8th annual conference on Genetic and evolutionary computation, pp. 1377–1384 (2006)
A Clustering-Based Approach for Linkage Learning Applied to Multimodal Optimization Leonardo Emmendorfer1 and Aurora Pozo2 1
2
Informatics Laboratory, Dept. of Electronics and Computer Science Engineering, University of Algarve, Portugal Dept. of Computer Science. Federal University of Paran´ a, Brazil
[email protected],
[email protected] Summary. Schemata theorem is based on the notion that similarities among the individuals of the population of a genetic algorithm are related to the substructures present in the problem being solved. Similarities are currently taken into account in evolutionary computation since the preservation of the diversity in the population is considered very important to allow the identification of the problem structure, as much as to find and maintain several global optima. This work shows an empirical investigation concerning the application of an algorithm which is based on learning from similarities among individuals. Those similarities are shown to reveal information about the problem structure. Empirical evaluation and comparison show the effectiveness and scalability of this new approach when solving multimodal optimization problems, including several instances of graph bisection and a highly multimodal parity function.
1 Introduction Genetic and evolutionary algorithms are known for their wide applicability and robustness despite their relatively simple conception and implementation. An evolutionary algorithm is able to solve optimization problems by evolving successive populations of solutions until convergence occurs. Two steps are usually present at each generation: selection of promising solutions and creation of new solutions in order to obtain a new population. The simple genetic algorithm (sGA) [1] adopts genetic recombination operators extracting and combining information from each parent. The schemata theorem is a theoretical foundation which explains how the simple mechanism of the sGA works [1]. It is based on the notion that similarities among individuals are related to the co-occurrence of important substructures, called building blocks, which represent a sub-specification for the genes considered in the codification of a problem. Individuals possessing the same building blocks belong to the same schema. When the average fitness of the individuals possessing a certain building block is high, that substructure is expected to proliferate in the population. The evolutionary process results from the proliferation of increasingly greater building blocks. sGA with simple crossover operators depends on the knowledge of the best ordering of the genes in the codification to achieve success in the combination Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 225–248, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
226
L. Emmendorfer and A. Pozo
of building blocks. In this kind of tight linkage, interacting genes are coded as nearby genes in the chromosome. Previous knowledge about the structure of the problem is not always available, but could be learned during the evolutionary process. Linkage learning techniques are related to the algorithms which attempt to learn about the structure of the problem being solved, in order to achieve success when combining information acquired about that problem. Evolutionary algorithms which infer probabilistic models from the population are called estimation of distribution algorithms (EDAs), a novel approach for linkage learning. It involves searching for an appropriate model structure, capturing dependencies among genes, and sampling from this model in order to generate the next population. The development of EDAs leads to the adoption of increasingly more complex and powerful probabilistic models. Currently, the most important EDAs are based on the induction of Bayesian networks at each generation. This approach is based on learning the interactions among the genes by inferring probabilistic models which are able to capture dependencies among those genes. EDAs based on Bayesian networks achieve good results for several problems in the literature but a high computational cost associated to the model induction stage is imposed. Finding all global optima for multimodal optimization problems is, generally, not as straightforward as desired; it demands the adoption of some form of diversity maintenance. A niching technique guarantees diversity by maintaining subpopulations and avoiding the recombination among different subpopulations. Niching has also been applied even when solving unimodal optimization problems [2]. It was verified that niching is able to guarantee an effective search by stably maintaining several suboptimal solutions until the optimal becomes clear. Clustering has been recognized as an attractive option among many niching techniques available [3]. K-means clustering algorithm [4] has recently been applied as a niching technique based on grouping genotypically similar solutions together. Here in this work we follow a similar approach, since individuals are also grouped. However, in contrast to other algorithms, the combination among different niches is encouraged, based on the belief that it is possible to combine information from the different clusters in order to effectively combine building blocks. This chapter shows how the similarities among individuals, already suggested as a central issue by the schemata theorem, can be explored in order to perform linkage learning. Actually, the algorithm adopted here is based on clustering the population and learning probabilistic models for each cluster. A very simple EDA [8] based on those ideas is shown to be able to solve hard hierarchical and globally multimodal optimization problems, which illustrate some of the most important hurdles found in the literature. The presented work extends a previous paper [5] including further empirical investigation and extending explanations and discussion about critical aspects of the algorithm’s behavior. The next two sections discuss some important concerns in evolutionary computation. Some competent algorithms [2][3][6][7][9][10][11] are briefly revised.
A Clustering-Based Approach for Linkage Learning
227
Section 4 revises ϕ-PBIL [8], which is based on clustering for both linkage learning and diversity preservation. Section 5 revises some benchmark problems and section 6 presents empirical verifications and comparisons, which were already present in [5]. Section 7 extends the empirical investigation, illustrating the mechanics of the algorithm through an actual ran. A scalability experiment is also described there. Finally, Section 8 discusses results and implications of this work.
2 Linkage Learning In genomics, the genetic linkage is defined as the association of genes on the same chromosome. On the other hand, when two genes are independent, the Mendelian law of independent assortment states that the segregation of one gene is independent of the segregation of the other [12]. Genetic and evolutionary algorithms present a similar behavior since interactions among the variables of the problem may occur and should be taken into account. The identification and preservation of important interactions among genes can have a desirable effect on the evolutionary process. This is generally called linkage learning, and has recently gained increasing attention from the research community. Some of the linkage learning techniques are aimed at identifying substructures which should be conserved during recombination [13]. Interaction among variables is an irreducible whole; a dependency which cannot be broken [14]. In other words, the joint probability distribution function does not factorize as a product of marginal distributions when variables do interact. A dependency between two variables can be detected as a result of the interaction among them. A classic example related to the importance of variable interaction in machine learning is the XOR problem [15] which is not linearly separable and cannot be solved without capturing the interaction among the variables. Simple genetic algorithm (sGA) with one-point crossover relies on the ordering of the genes in the codification of the problem. In order to achieve success it is required that interacting variables should be coded as nearby genes in the chromosome in such a way that crossover would less probably cause the disruption of the substructure. More recently other approaches adopted alternative codification schemes ranging from the simple reordering of the genes to more complex mechanisms like sub-specification and super-specification of solutions, as in Messy GA [11]. This algorithm adopts a two-stage evolutionary process where substructures are identified at a first stage and subsequently combined. This ensures that substructures were all correctly identified before going on and exploring the combinations among them. A different situation occurs in EDAs. The probabilistic model adopted, when that model is powerful enough, can directly capture dependencies. These dependencies are expected to be able to represent and respect the interactions present in the problem. Some earlier EDAs, however, learn models which assume independence among genes and, therefore, are not able to detect any interactions.
228
L. Emmendorfer and A. Pozo
One of the most important member of this class of EDA is PBIL (Population Based Incremental Learning) [16]. Independent of the methodology adopted for capturing the problem structure, diversity preservation has shown to be a crucial concern [2] because the detection of a substructure is only attainable if a critical amount of individuals in the population possess that substructure.
3 Multimodal Optimization through Evolutionary Computation Globally multimodal problems present several global optima or, in other words, several optimal solutions with the same fitness. These problems may be very tricky for an evolutionary algorithm to solve. Slow convergence to one of the optima (genetic drift) often occurs for such problems. The retarded convergence is explained by the recombination of solutions coming from different regions of the search space, which often results in poor solutions. Diversity preservation becomes even more important for globally multimodal problems, but the so-called niching techniques were primarily proposed for avoiding premature convergence on single-optimum problems. Later, some techniques were specifically designed and have been applied to improve solving globally multimodal problems. For example, Clearing [7] performs the separation of the population into smaller subpopulations and applies a strong elitism inside each niche by setting all resources to the “winners” – the best individual of each niche. A very different approach is proposed in [6] where diversity is not maintained at the level of the individuals, but at the level of the substructures, which are first identified by an EDA. Clustering techniques have also been recently applied on globally multimodal problems. In [3] k-means is applied to improve efficiency of EDAs when solving this class of problems. A very simple EDA, which learns independent univariate probabilistic models, reached better efficiency after the adoption of k-means as a niching mechanism. Individuals are grouped together by their genotype and interbreeding can, therefore, be prevented. In [10] unsupervised learning is also explored but from a different perspective. The UEBNA – unsupervised estimation of Bayesian network algorithm – is presented, which is based on the unsupervised learning of Bayesian networks at each generation. An unobserved variable c is included in the model, which represents the unknown subpopulation label. UEBNA adopts the algorithm Bayesian Structural Expectation Maximization (BSEM) [17] for the unsupervised learning of Bayesian networks. It performs a greedy search for the best structure inferring at each step the values for the unknown subpopulation labels. The UEBNA algorithm is expected to detect the correct setting of diverse global optima to diverse subpopulations and capture dependencies among variables of the problem through successive learning of Bayesian networks. Experimental validation confirms the effectiveness of the algorithm for a set of problems.
A Clustering-Based Approach for Linkage Learning
229
The beneficial effect claimed by these works is to control recombination among different subpopulations because better individuals are, usually, not expected to be generated from interbreeding, which is considered harmful for the evolutionary process. Exploration of the search space is done by an evolutionary algorithm, which operates on all subpopulations in a nearly parallel fashion. However, a reliable recombination mechanism which could allow an effective exploration of the search space by combining information gained from different subpopulations should also be considered. The next Section revises a recent algorithm which proposes the exploration of the search space by a kind of interbreeding.
4 ϕ-PBIL: A Clustering-Based Evolutionary Algorithm Clustering has been receiving increasing attention as an effective niching technique for evolutionary computation [3][18]. Individuals are grouped by similarity of their genotype and – as pointed out in the previous Section – recombination among different groups is usually avoided. Linkage learning, in turn, is performed by some other mechanism like the probabilistic models of EDAs. In order to avoid premature convergence to local optima a critical degree of diversity in the population is required. Therefore, diversity in the population occurs due to two major correlated factors: the existence of substructures in the problem and the existence of diverse global optima. When an evolutionary algorithm is relatively far from convergence we can conjecture that the main source of diversity in the population is due to the diversity of substructures, since global optima are not clear yet. This idea is supported by the notion of schemata, which denote similarities among individuals at every stage of the process. Thus, at some stage of the process, a clustering algorithm would group together individuals possessing the same substructures or, in other words, different substructures would characterize each subpopulation. This is exactly what the algorithm ϕ-PBIL (Concept-guided Population-Based Incremental Learning) [8] relies on. Algorithm 1 shows the pseudocode of ϕ-PBIL. ϕ-PBIL is an EDA which follows an incremental architecture where a single individual is generated at each iteration without the adoption of a succession of “generations” as other EDAs often do. A prespecified and fixed number of k clusters are maintained and continuously updated. Whenever a new individual is generated it replaces the worst individual in the population (if the new individual is better) and, subsequently, clustering hypothesis and probabilistic models for each cluster are simply updated after a single typical k-means step, and not fully relearned. Each cluster defines a subpopulation and, since only binary variables are allowed, then the probabilistic models for the subpopulations are the ˆ = (ˆ corresponding binomial proportions Π πi,j ) which denote the proportions of individuals with the value 1 for each gene j on each cluster i. Similarly for other clustered EDAs, sampling from one of the probability vectors (PVs) will generate a new individual. The choice of a PV is done randomly and proportionally to the mean fitness of the individuals of each corresponding cluster.
230
L. Emmendorfer and A. Pozo
Algorithm 1. ϕ-PBIL [8] Initialization: Generate an initial random population of size N0 , compute the fitness of the individuals and select only the Nw (Nw ≤ N0 ) best. Learning: Learn clusters from the population. For each cluster, a probability vector (PV) of binomial proportions is obtained. This leads to two matrixes: one for binomial ˆ = (w ˆ = (ˆ ˆi,j ). proportions Π πi,j ) and another for information measures W while convergence criteria are not met do Breeding: Generate a new individual H, by randomly choosing from one of the two following possible procedures: (i) generate an individual by sampling from one of the PVs, or (ii) apply interbreeding, performing a combination of the PVs of two ˆ , and then parents, guided by their respective information measures stored in W sample from the resulting temporary PV in order to obtain a new individual. Selection: Compute FH , the fitness of the new individual H. If H is not worse than the worst individual currently stored, then delete this worst individual and insert H in the population. ˆ and W ˆ. Learning: Update clusters and matrixes Π end while
An efficient interbreeding mechanism is a major feature of ϕ-PBIL, which differentiates it from the other algorithms described above. It allows for a careful combination of relevant information from two PVs, attempting to maintain the most informative part of each parent. During interbreeding, two parent clusters A and B are randomly selected, proportionally to the mean fitness. The key idea is to create a temporary PV by choosing from which parent will each position j of the PV be taken from – ˆB,j . A measure from information theory is used to guide that choice in π ˆA,j or π order to select the most informative parent for each gene. The measure w ˆi,j of how informative is a cluster i to a gene j is the difference in the entropy of the distribution of the gene j before and after observing cluster i. ˆ = (ˆ πi,j ) the columns Computation of w ˆi,j is described as follows. In matrix Π are associated to the variables, therefore, π ˆi,j represents proportion of 1s in gene j, considering all the individuals belonging to cluster i. The entropy-reduction measure w ˆi,j allows a comparison among its correˆl,j s, l ∈ [1, 2, ..., k], l = i. It is the difference sponding π ˆi,j and the others π between hi,j , which is the entropy of the distribution of the binary variable j when cluster i is considered minus the entropy of the same variable when cluster i is excluded from the calculation, hi,j . All clusters are assumed to have equal numbers of individuals as if the population were equally distributed. The mechanism is illustrated with an experiment in Section 6, where the evolutionary process for an ADF is shown. Besides the mechanism explained above, two features were added in order to increase the performance of the algorithm. The first feature was motivated by local search. A perturbation mechanism was added which changes slightly the binomial proportions estimated and, therefore, allows for an allele to be generated even if all individuals in that cluster posses the complementary allele.
A Clustering-Based Approach for Linkage Learning
231
The maximum likelihood estimator (MLE) of a binomial proportion is the fraction of successes. For the EDA proposed, this is the proportion of 1s in a gene for the individuals of each cluster, or c(x)=i xj (1) π ˆi,j = ni where c(x)=i xj is the number of individuals in cluster i which posses the value 1 in gene j and ni is the number of individuals in cluster i. This estimator may saturate at one of the extreme values 0% or 100%, if all individuals in a cluster posses the same value for a gene (all 0s or all 1s). this wipes out the chance of the complementary value to be generated, which limits local search. Uncertainty can be incorporated through the Wilson estimator, which was revised in [19]. The expression for the binomial proportion becomes c(x)=i xj + 2 (2) π ˆi,j = ni + 4 instead of the MLE (1). Thus, for instance, even if all 100 individuals in a cluster posses the value 1 in a locus there still remains a probability of 2% of a new individual to be generated with a 0 in that same locus. Another feature aims to improve solving hierarchical problems and/or problems with overlapping substructures. The memory of a previous clustering hypothesis is maintained, and some of the new individuals are generated from the model associated to that old hypothesis. Hence, previously detected substructures can be used for a longer period. The best clustering hypothesis ever seen is kept while it performs well. If the old hypothesis is outperformed by the continuously updated current hypothesis then the current hypothesis replaces the old one. ˆ = (ˆ ˆ = (w πi,j ) corThat mechanism works as follows: matrices W ˆi,j ) and Π ˆ old and W ˆ old }. Before responding to an old clustering hypothesis are kept, as {Π each combination of PVs, one of the two hypothesis is randomly selected. The parameter pold is the probability of the old hypothesis to be selected. The number of individuals generated and accepted represents the performance of each hypothesis. Those performance records are gold and gnew , for the old and current hypothesis respectively. Whenever the new individual is selected, the corresponding performance record is increased. Those records are used to control ˆ old ← W ˆ, the replacement of the old hypothesis. If gnew overpasses gold , then W ˆ old ← Π, ˆ gold ← gnew and gnew ← 0. This means that the current hypothesis Π becomes the “novel” old hypothesis. All parameters for ϕ-PBIL are described in table 1. Some default values for most parameters were set after empirical investigation. The set of default values are adopted as a pattern for a general problem but, eventually, an increase in performance can be attained after changing some of those parameters. Section 6 illustrates sensitivity to pc (probability of interbreeding), pold (probability of an
232
L. Emmendorfer and A. Pozo Table 1. ϕ-PBIL parameters Parameter
Description
Default value
N0
initial population size
–
Nw
working population size
–
k
number of clusters
–
pc
probability of interbreeding
50%
pold
probability of an old clustering hyposthesis to be used
50%
pw
probability of the Wilson estimator to be used
50%
old clustering hypothesis to be used during breeding instead of the current one) and pw (probability of the Wilson estimator to be used instead of the sample mean), which are all set to 50% by default. Termination criterion was set to be the loss of diversity inside PVs: the algorithm finishes when all π ˆi,j s saturate (reaching some value above 0.95 or below 0.05). The user must set the values for some parameters for which no default values are provided: initial population size N0 , working population size Nw and the number of clusters k. The working population exists after the initialization and is created from the best individuals of the initial population as described in algorithm 1.
5 Problems This section briefly revises the problems used in the empirical evaluation. The reader is referred to [10][20][9][22] for a more detailed description of the problems. The primary interest here is on global multimodality, but other problems are also be approached. Two classes of optimization problems are often adopted for empirical verification of GAs: additively decomposable functions (ADFs) and hierarchically decomposable functions (HDF). In ADFs there is no dependence structure among the subproblems since the contribution of each substructure to the overall fitness of the solution is independent of the value of the remaining variables, therefore all subproblems can be solved independently. The concatenated k-trap function [9] is an example of an ADF with l/k subproblems where l is the size of the problem and k is the size of each subproblem. HDFs, in turn, are much harder to optimize since the substructures interact. The fitness contribution of two substructures combined is different from the sum of their individual contributions. One of the most important HDFs is HIFF [20] where interactions up to the size of the problem may occur. HDFs illustrate a more general class of problems than ADFs and demand a more complex mechanism
A Clustering-Based Approach for Linkage Learning
233
for recombination. Substructures of each level must generally be identified first and recombined before the next level starts emerging. More recently, a class of symmetrical multimodal problems has been attracting attention. Symmetry occurs when some regularities on the landscape lead to the existence of complementary solutions with identical fitness and, consequently, to complementary global solutions. Later in this section ϕ-PBIL is evaluated when solving 12 instances of symmetrical globally multimodal optimization problems, which also present a structure of interactions among the variables. Concatenated trap-5 The concatenated trap-5 [9] is an additively decomposable function which posses a single global optima (the string of 1s), with a fitness equal to the size of the problem. For each contiguous nonoverlapping block of five variables, two building blocks can be identified: (0, 0, 0, 0, 0) and (1, 1, 1, 1, 1), which contribute with 4 and 5 to the overall fitness, respectively. Single-variable statistics may lead apart from the optimal value for each gene, which makes this problem deceptive. Several combinations of the building blocks lead to multiple suboptimal solutions. An instance of size 50 (Ptrapfive50) is considered later. HIFF Hierarchical if-and-only-if (HIFF) [20] is an hierarchically decomposable function. A binary string of size 2p represents a solution, where p is the number of levels in the hierarchy. The fitness of a solution is given by ⎧ 1, if |B| = 1 ⎪ ⎪ ⎪ ⎨ |B| + f (B L ) + f (B R ), fhif f (B ) = ⎪ if |B| > 1 and (∀i{bi = 0} or ∀i{bi = 1}) ⎪ ⎪ ⎩ f (B L ) + f (B R ), otherwise where B is a block of bits (b1 , ..., b|B| ), |B| is the size of the block, bi is the i-th element in block. B L and B R are respectively the left and right halves of the block B . The evaluation starts with the chromosome as a block. Since tight linkage occurs, HIFF is relatively simple for a GA to solve. The shuffled version, however, is much harder and some mechanism for detecting the problem structure is recommended. In the shuffled version the variables are randomly rearranged in order to avoid tight linkage. An experiment with the shuffled version of HIFF with 64 variables (Pshuff64) is described later. Twomax Twomax is a simply defined function of n binary variables n n zi . ftwomax (Z) = − 2 i=1
234
L. Emmendorfer and A. Pozo
There are two global maxima: (0, 0, ..., 0) and (1, 1, ..., 1), both with fitness equal to n/2. Two instances are considered in the empirical evaluation, with n = 50 and n = 100, denoted here by Ptwomax50 and Ptwomax100 respectively. Concatenated Parity Function The concatenated parity function (CPF), as described in [22], generalizes the XOR function. It is composed by concatenated subfunctions, which are defined as the parity of each substring. The problem considered here is CPF5, which is the concatenation of m nonoverlapping subfunctions of 5 bit substrings each. CP F 5(S) =
m−1
parity(S5i S5i+1 S5i+2 S5i+3 S5i+4 )
i=0
The parity of each substring is 5 if the substring contains an odd number of bits 1, and 0 otherwise.
Fig. 1. Topologies for Pgrid16, Pcatring28, Pcat28 and Pcatring84
A Clustering-Based Approach for Linkage Learning
235
This function is challenging because of its global multimodality: clearly, for every parity subfunction, half substrings are optimal, leading to a huge number of global optima (16 × 2m ) even for small instances of the problem. Graph Bisection The objective of the graph bisection problem is to obtain two equally sized sets of nodes from the original graph such as the number of arcs linking both sets is minimal. The fitness of a given solution is the number of nodes minus the number of links between both sets, which makes it a maximization problem. The codification adopted is based on a binary vector of size n where the i-th gene represents the label of i-th node. All instances considered in [10] are also studied here. Figure 1 illustrates two of those instances: Pgrig36 and Pcatring42, which possess 2 and 6 global optima, respectively. The other instances considered are Pgrid16, Pgrid64, Pcat28, Pcat42, Pcat56, Pcatring28, Pcatring56 and Pcatring84, with the number of global optima ranging from 2 to 6.
6 Empirical Evaluation The main goal in this Section is to show an empirical evaluation of ϕ-PBIL when solving globally multimodal optimization problems, which were already approached by other competent EDAs. Sensitivity to the Parameters This subsection shows an exploration in the parameter space, which allows for checking the robustness of the algorithm and sensitivity to some of the parameters. Three representative test problems are used: HIFF, concatenated trap-5 and Twomax. The experiment was also used to find default values for some parameters of ϕ-PBIL. We start from a hypothesis that pc , pold and pw should all be set to 50%. The instances of test problems are Pshuff64, Ptrapfive50 and Ptwomax50, all three relatively small (64, 50 and 50 variables, respectively), but are shown to be enough for revealing an influence of the parameters on the performance of the algorithm. The other three parameters are fixed for all runs of each problem in this experiment: N0 = 3, 000, Nw = 300 and k = 15. Those values were previously verified to be near the minimal values at which the algorithm was able to find all global solutions for each of the three instances considered here in this evaluation. Figure 2 shows the results of this investigation. Each graph shows the variation of one of the three parameters considered, where the other two are fixed. When nothing else is stated, the value for the remaining two considered parameters is 50%. Changing one of the parameters can dramatically affect the behavior of the algorithm for one or more of the problems. Setting pold to 0% or 100% causes
0.0 0.00
0.25
0.50
0.75
80000 50000
Pshuff64 Ptwomax50 Ptrap5
20000
0.5
1.0
1.5
Pshuff64 Ptwomax50 Ptrap5
0
2.0
avg. number of fitness eval.
L. Emmendorfer and A. Pozo avg. number of global opt. found
236
1.00
0.00
0.25
0.50
0.25
0.50
0.75
80000 50000 0.00
0.25
0.50
pold
0.75
1.00
1.00
80000 20000
50000
Pshuff64 Ptwomax50 Ptrap5
0
avg. number of fitness eval.
2.0 1.5 1.0 0.5
avg. number of global opt. found
0.0
0.50
0.75
pc
Pshuff64 Ptwomax50 Ptrap5
0.25
1.00
20000
1.00
pc
0.00
0.75
Pshuff64 (pw = 0.5) Pshuff64 (pw = 0.75) Ptwomax50 Ptrap5
0
avg. number of fitness eval.
2.0 1.5 1.0 0.5 0.00
1.00
pw
Pshuff64 (pw = 0.5) Pshuff64 (pw = 0.75) Ptwomax50 Ptrap5
0.0
avg. number of global opt. found
pw
0.75
0.00
0.25
0.50 pold
Fig. 2. Empirical verification of the sensitivity of ϕ-PBIL to three parameters: pc , pold and pw , as the mean number of global optima found and maintained and the mean number of fitness evaluations until convergence. Five values of each parameter (0%, 25%, 50%, 75% and 100%) are evaluated for each plot. When nothing else is stated, assume the value for the remaining parameters is pc = 50%, pw = 50% and pold = 50%. Each point represents the mean of 10 independent replications (runs). Lines in the graph result from spline interpolation. When the mean number of fitness evaluations exceeds 100,000 then statistics for the number of global optima and the respective point and line segment are not shown in the graph.
a reduction on the performance on Pshuff64 since both old and new clustering hypothesis store information about the subsequent levels of the problem structure. The performance on other problems is, however, not so influenced by pold ,
A Clustering-Based Approach for Linkage Learning
237
except when pold = 100%, where performance on Ptrapfive50 also drops. Using only old clustering hypothesis seems to be harmful for the algorithm. Setting the probability of interbreeding pc to zero affects the performance of the algorithm for a wider range of problems, since this mechanism is responsible for recombination. Increasing pc slows down the convergence for all instances tested. The parameter pw is also negatively related to the convergence speed, but lower values for pw should be avoided, at least for HIFF, since any global solution for Pshuff64 is found for pw = 0. The effectiveness on other problems was not affected by this parameter. After analyzing all results and recognizing that nearly extreme values for all of the parameters tested are, often, undesirable, therefore we set pc , pold and pw all to 50%. Multimodal Optimization A thorough comparison between ϕ-PBIL and UEBNA is described now. The latter was already shown to be competent to solve globally multimodal optimization problems. In [10], UEBNA is compared to other EDAs when solving 12 instances of globally multimodal problems. Experiments reported there clearly show that UEBNA achieves much better efficiency and efficacy when compared to EBNA, which is based on the supervised induction of Bayesian networks. All results for UEBNA presented here were extracted from [10]. The methodology adopted is also the same. In order to carry out a fair comparison, the default parameters of ϕ-PBIL are set fixed for all 12 instances. The initial population size is set as N0 = 4000 for ϕ-PBIL; the same value adopted for UEBNA. Additionally, the working population size of ϕ-PBIL is set as Nw = 400. The only free parameter adopted here is k, the number of clusters. The value of k which maximizes the performance of the algorithm is chosen, for ϕ-PBIL, from the interval k ∈ [2, 3...20] and, for UEBNA, from the results reported in [10]. Ten independent runs of UEBNA were performed in [10] for each instance; the same number is adopted here for ϕ-PBIL. Table 2 summarizes the results. The problems and algorithms are organized in rows and the data columns are organized in two sections: in the left, denoted by All runs, the results for all 10 runs are shown and in the right, denoted by Successful runs, only those runs where at least one global optimum was correctly found are taken into account. There are columns for mean ± standard deviation of the number of global optima found (Opt.±s.d.) and for mean ± standard deviation (Fit. eval.±s.d.) of the number of fitness evaluations performed until convergence. For most of the instances, ϕ-PBIL attains convergence for all global optima with a smaller number of fitness evaluations when compared to UEBNA. However, for larger problems the advantage is not so clear. For Pcatring56 and Pcatring42, for example, the number of optima found is close to the expected values for both algorithms, and the number of fitness evaluations is similar. In Pcatring84, ϕ-PBIL attained inferior effectiveness and efficiency when compared to
238
L. Emmendorfer and A. Pozo
Table 2. Effectiveness and efficiency of ϕ-PBIL and UEBNA for 12 instances of globally multimodal optimization problems as the mean ± standard deviation of the number of global optima (peaks) found and number of fitness evaluations after 10 independent runs of each algorithm for each problem. Data for UEBNA extracted from [10] Problem
EDA
All runs Opt.±s.d. Fit. eval.±s.d.
Successful runs Opt.±sd Fit. eval.±s.d.
Ptwomax50 UEBNA K=2 (2 peaks) ϕ-PBIL K=2
2.0±0.0 2.0±0.0
55000±0 11434±202
2.0±0.0 2.0±0.0
55000±0 11434±202
Ptwomax100 UEBNA K=2 (2 peaks) ϕ-PBIL K=2
2.0±0.0 2.0±0.0
76600±1265 13776±310
2.0±0.0 2.0±0.0
76600±1265 13776±310
Pgrid16 (2 peaks)
UEBNA K=4 ϕ-PBIL K=4
2.0±0.0 2.0±0.0
51400±2366 14463±1469
2.0±0.0 2.0±0.0
51400±2366 14463±1469
Pgrid36 (2 peaks)
UEBNA K=2 ϕ-PBIL K=5
2.0±0.0 1.2±1.0
85600±8462 49455±8251
2.0±0.0 2.0±0.0
85600±8462 45600±5947
Pgrid64 (2 peaks)
UEBNA K=4 ϕ-PBIL K=5
2.0±0.0 1.0±1.1
124900±3479 130317±20219
2.0±0.0 2.0±0.0
124900±3479 121773±22065
Pcat28 (2 peaks)
UEBNA K=2 ϕ-PBIL K=4
2.0±0.0 2.0±0.0
57100±2846 26309±2020
2.0±0.0 2.0±0.0
57100±2846 26309±2020
Pcat42 (2 peaks)
UEBNA K=2 ϕ-PBIL K=4
2.0±0.0 1.9±0.3
73900±1449 53034±4121
2.0±0.0 1.9±0.3
73900±1449 53034±4121
Pcat56 (2 peaks)
UEBNA K=4 ϕ-PBIL K=4
2.0±0.0 1.3±0.9
96400±2366 82047±8205
2.0±0.0 1.9±0.4
96400±2366 81474±9199
Pcatring28 (4 peaks)
UEBNA K=2 ϕ-PBIL K=8
4.0±0.0 4.0±0.0
54700±949 26498±1275
4.0±0.0 4.0±0.0
54700±949 26498±1275
Pcatring56 (4 peaks)
UEBNA K=8 ϕ-PBIL K=10
3.8±0.4 3.2±1.0
96400±1897 94801±6635
3.8±0.4 3.2±0.1
96400±1897 94801±6635
Pcatring42 (6 peaks)
UEBNA K=6 ϕ-PBIL K=20
5.9±0.3 5.9±0.3
75700±3302 61296±2926
5.9±0.3 5.9±0.3
75700±3302 61296±2926
Pcatring84 (6 peaks)
UEBNA K=10 ϕ-PBIL K=10
4.8±0.8 2.9±1.0
121000±3162 165627±13280
4.8±0.8 2.9±0.1
121000±3162 165627±13280
UEBNA. After changing some parameters, the algorithm reaches better results and overperforms UEBNA [21]. The default parameters setting seems to be not so adequate for the algorithm to achieve success on the largest instance. On the other hand, it is noticeable that ϕ-PBIL can attain good results with relatively small populations. It is also important to note that computational complexity of the learning step of ϕ-PBIL is related to a k-means update, which is much less time-consuming than Bayesian network learning.
A Clustering-Based Approach for Linkage Learning
239
7 Additional Experiments Besides the experiments of the previous Section, which were already present in [5], further empirical evaluations are described here. The motivation for including those experiments is twofold: (i) to illustrate the behavior of ϕ-PBIL when detecting and preserving the structure of a problem, in order to allow a better understanding of how the mechanism of the algorithm works, and (ii) to verify the scalability through a systematic experiment, adopting a multimodal separable problem, which is believed to be EDA-hard [22]. An illustration of the Behavior of ϕ-PBIL This simple experiment illustrates the behavior of ϕ-PBIL when finding and preserving substructures of a problem. Particularly, it shows how the preservation of a substructure comes as result of the interaction of the mechanisms of the algorithm. ˆ during the evolutionary process for a Figure 3 reports values for matrix Π single run of ϕ-PBIL on a small instance of the concatenated trap-5 problem.
Fig. 3. The average proportion of 1s, calculated for the three substructures (BB0, BB1 and BB2), in each of the 4 clusters. A single run of ϕ-PBIL on the concatenated trap-5 problem is shown.
240
L. Emmendorfer and A. Pozo
ˆ for each gene, for t=700, considering the same run Fig. 4. Values from the matrix W of figure 3
ˆ for each gene, for t=1900, considering the run of Fig. 5. Values from the matrix W figure 3
A Clustering-Based Approach for Linkage Learning
241
ˆ , is shown in two moments of the process, after The corresponding matrix W 700 and 1900 fitness evaluations, as shown in figures 4 and 5 respectively . An instance of size 15 was adopted, which leads to 3 optimal building blocks (BB0, BB1 and BB2) to be found and combined. Time (t) in the graph corresponds to the number of fitness evaluation performed so far. The algorithm was ran with the default parameters. Additionally, k = 4 clusters and the population was set to N0 = Nw = 100 individuals. The average πi,j over all genes j of a substring are shown, for each cluster, in figure 3 as lines for BB0, BB1 and BB2, respectively. When the line reaches the value 1, it means that all individuals of that cluster possess the corresponding building block. Two “snapshots” are taken from each cluster, at t = 700 and t = 1900 fitness evaluations, which correspond to the same number of evaluations performed after 7 and 19 generations respectively. The corresponding wij values are shown as bar charts in figure 4 (t = 700) and figure 5 (t = 1900), one bar for each gene j on each cluster i. The same convention as in figure 3 is adopted to represent the genes of different substructures. The behavior of the algorithm during the initial stages of the process is clear. Increasingly better individuals are grouped together by similarity of genotype; each cluster corresponds to a certain building block after a few iterations. The snapshot for t=700, in figure 4, shows that the greatest wij s of clusters 1,2 and 3 correspond to genes of the building block captured by each cluster (BB1,BB2 and BB0, respectively), since the cluster is more informative on those genes than on others. This implies that optimal building blocks are expected to be successfully combined during interbreeding. Cluster 0 is, initially, not strongly related to any building block. ˆ and W ˆ is shown in figure 6, for the same Another visual representation for Π run reported here, corresponding to t = 360. At that time, the creation of a temporary PV for an interbreeding occurs as in figure 7. Clusters 1 and 3 are randomly selected to interbreed. The parent with the greatest value for w ˆi,j (w ˆA,j or w ˆB,j ), for each gene j, is chosen when assembling the new PV. Figure 7 illustrates substructure preservation. Notice that genes 5 to 9 in cluster 1 are influenced by the presence of BB1 in most individuals of that cluster. Cluster 1 is, therefore, highly informative for those genes: w ˆ1,5 = 0.22, ˆ1,7 = 0.01,w ˆ1,8 = 0.11 and w ˆ1,9 = 0.23. The corresponding values w ˆ1,6 = 0.42, w for cluster 3 are lower: w ˆ3,5 = −0.04, w ˆ3,6 = −0.08, w ˆ3,7 = −0.01, w ˆ3,8 = 0 and w ˆ3,9 = −0.02. Cluster 1 wins at all positions corresponding to the building block captured by it, preserving the model of that building block. As the process goes on, other important behavior becomes evident. At t = 1900, as figure 5 illustrates, the information measure for genes in the same substructure are identical at each cluster. This allows for the preservation of the building blocks during crossover in the later stages, for that problem. Selection pressure makes the population converge to individuals which possess only optimal or suboptimal building blocks. Optimal building blocks tend to be widespread as a result and, therefore, clusters possessing those building blocks
242
L. Emmendorfer and A. Pozo
Fig. 6. A setting of binomial proportions π ˆi,j s and their respective w ˆi,j s for an actual ran of ϕ-PBIL for the trap-5 problem, after 360 fitness evaluations. The values inscribed ˆi,j s in the scale shown. are the π ˆi,j s, while the grey tones denote the w
Fig. 7. Generating a temporary PV during an interbreeding operation, for the same ˆB,. and their respective w ˆi,j s are shown using the same run shown in figure 6. π ˆA,. , π convention as in figure 6, except for the resulting PV which has no w ˆi,j s and is shown in white.
tend to be less informative as they were in the initial stages. Cluster 2 at t = 1900, for instance, is informative for the distribution of genes 0, 1, 2, 3 and 4, just because the individuals of cluster 2 are the remaining in population which does not possess the optimal building block BB0. The race between selection and innovation is clear: crossovers involving cluster 2 result in individuals possessing a suboptimal building block at positions 0 to 4, but this individual is not expected to survive long, as a result of its relatively low fitness. At the later stages, for each cluster i, all wi,j s within a given substructure are equal. Clusters differential themselves from each other by the different combinations of optimal and suboptimal building blocks, as oppose to the initial stages, where the presence of a single building block was the major motivation for the existence of each cluster. Notice that the interbreeding mechanism chooses the pi,j corresponding to the greatest wi,j . If all wi,j corresponding to a same substructure are equal, then whenever wa,j > wb,j for a given gene j = g, then wa,j > wb,j for all genes j in the same substructure and the whole building block is expected to be preserved. Since all information measures are equal inside the substructures, therefore the whole block wins or loses during a recombination. This guarantees the preservation of whole building blocks (of their corresponding binomial proportions, to be more precise) during PV crossover.
A Clustering-Based Approach for Linkage Learning
243
Fig. 8. Scalability of ϕ-PBIL on the CPF5 problem, for 3 fixed values of k. The average number of fitness evaluations until the global optima is found is shown, over 100 independent runs. The minimal populations for each problem size are also shown.
Scalability on a Multimodal Problem This experiment shows how the novel algorithm scales on the parity problem CPF5, which is a challenging globally multimodal problem, since the number of global optima increases exponentially as a function of the problem size. In a recent paper [22] this problem was shown to be hard for EDAs. That paper reports a scalability experiment which indicates that hBOA, a state-of-the-art EDA based on Bayesian Networks, scales exponentially on that problem. That work adopted the following methodology to set up the population size for each problem size considered: first, a search for the minimal population size is performed. The minimal population corresponds to the size whereby the algorithm successfully finds a global optima in 10 out of 10 successive runs. This search is repeated 10 times and the average over those replications is adopted as the population size. The search is performed, for each problem size considered, by the bisection method. Once the population sizes are determined, hBOA is ran 100 times for each problem size. A similar methodology was adopted here to check the scalability of ϕ-PBIL for the same problem CPF5. The default parameters were adopted, and the
244
L. Emmendorfer and A. Pozo
Fig. 9. Scalability of ϕ-PBIL on the CPF5 problem, for k proportional to the problem size as k = λ/5. The average number of fitness evaluations until the global optima is found is shown, over 100 independent runs. The minimal populations for each problem size are also shown.
initial population size N0 is set to N0 = 10Nw . The only difference is the search method used to obtain the minimal population sizes, because bisection can get trapped into local optima [22] when performing the search. This motivates the adoption of a more robust method like linear search which is, however, less computationally efficient than bisection. Computational complexity of both methods is quite reasonable for the application considered here. Population size (which is initially equal to k) is increased by a constant factor until a stage where 10 out of 10 runs result in algorithm successfully finding a global optima. The search is repeated 10 times and the results are averaged, for each problem size. Figure 8 shows the results for the average, over 100 runs, of the number of fitness evaluations required until the global optima is found, considering only the successful runs. Below, in the same figure, the minimal population sizes required for each problem size are shown. This first investigation on CPF5, illustrated in figure 8, is performed for 3 values of k (number of clusters). It is evident that the scalability of the algorithm is influenced by the value of k. For k = 10 the curve is clearly exponential,
A Clustering-Based Approach for Linkage Learning
245
but increasing k seems to benefit the performance of the algorithm, since the problem becomes more tractable at least for the problem sizes shown. Obviously, for even greater problem instances, k = 20 or 30 would not suffice. This would not come as a surprise, and suggests that k should be found automatically by the algorithm during the evolutionary process. A less robust but still legitimate approach would be to make k proportional to the problem size λ. Therefore, in another experiment adopting the same methodology, the number of clusters is set to k = λ/5. Figure 9 shows the results for that experiment. The curve for the number of fitness evaluations suggests that an adequate, seemingly below exponential, scalability for that problem was achieved using ϕ-PBIL. It is noticeable that a very simple EDA such as ϕ-PBIL can reach a good level of performance in a highly multimodal problem such as the CPF5. However, comparisons to the results reported in [22], which reveal an exponential scalability of hBOA on CPF5, should be done very carefully since the experiment in [22] was performed using the bisection method, and there were at least two alternative population size attractors which could be found by bisection, for most problem dimensions. After reaching a critical population size, bisection method is attracted to the worst size, which degradates the performance of hBOA.
8 Conclusion The main foundation for the algorithm evaluated here is related to the concept of schemata, which is based on the existence of similarities in the population, which are related to the structure of the problem. Clustering plays a central role in the mechanics of ϕ-PBIL grouping genotypically similar individuals together and capturing existing building blocks. Although the importance of similarities and their relation to the problem structure was, to some extent, already suggested by the schemata theorem, this aspect was never approached as explicitly as here. Clustering was already used in the GA and EDA context, but with a different perspective. It is usually applied to avoid combination of information from different niches, as opposed to how ϕ-PBIL does it. For instance, [3] shows how clustering helps a single-order EDA to solve globally multimodal problems. The solution is intuitive: each cluster is expected to focus on a global optimum, allowing the algorithm to converge to several global solutions which do not interfere each other during the process. The algorithm ϕ-PBIL accomplishes both linkage learning and niching. It is a very simple EDA which adopts single order probabilistic models aided by clustering, which leads to a very parsimonious approach when compared to other competent EDAs. Its simple incremental approach is also noticeable, where individuals are successively generated from continuously updated models. Probabilistic models used by ϕ-PBIL are much simpler to infer when compared to Bayesian networks. Actually, probabilistic models are updated after each new individual is generated and selected at the cost of a single k-means update.
246
L. Emmendorfer and A. Pozo
Other difference between ϕ-PBIL and other EDAs is on how linkage learning is performed. While Bayesian network-based EDAs search for dependencies among increasingly greater groups of variables, ϕ-PBIL directly detects values for interacting groups of variables of arbitrary sizes. This leads to a model limited to second order statistics, since the interaction between a variable and the cluster label is of order 2. Model search is avoided, since all variables are considered to be dependent only upon the cluster label. The most sophisticated part of the algorithm is the interbreeding mechanism, which respects the problem structure and relies on the diversity of substructures as one of the causes for the existence of clusters in the population. One of the experiments gives some evidence that this mechanism is very important for solving problems which present interactions among variables. As experiments illustrate, the linkage learning capability spontaneously emerge as a result of the mechanics of the algorithm: initially, each cluster “specializes” on a specific building block, so clusters could be considered as the representation for schemata. Combination of schemata results, therefore, from the conceptguided combination of clusters, which captures information about how much each gene is related to each cluster. Experiments show why does this operator respect the problem structure. The scalability experiment in section 7 shows that ϕ-PBIL is able to solve a hard multimodal problem like CPF5 problem reliably. The success of the proposed approach in this and in other experiments may also be related to the kind of search performed. Local search is clearly approached by ϕ-PBIL since each cluster is exploring a local region of the space. The adoption of the Wilson estimator, for instance, allows for local search even in situations where the sample mean would simply inhibit it. This aspect distinguishes the novel algorithm from other competent EDAs, which usually build a single global model for the whole search space. Some of the problems solved by the algorithm evaluated here are considered to be hard for EDAs and GAs. It would be very worthy to better understand how the algorithm solves problems with overlapping building blocks, as shown by some other illustrative experiments presented in section 7. Future work will also check scalability for a wider range of problems, including HDFs like HIFF [20]. Further empirical investigation on the behavior of the interbreeding mechanism should also be performed in order to better explain how exploration is actually done for several classes of problems. It is also important to understand the limitations of this novel approach, attempting to identify clearly which classes of problems are more or less amenable. This issue motivates analytical and further empirical investigations.
Acknowledgments The authors would like to thank Fernando Lobo for his valuable suggestions for this work.
A Clustering-Based Approach for Linkage Learning
247
References 1. Holland, J.H.: Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor (1975) 2. Mahfoud, S.W.: Niching methods for genetic algorithms, Doctoral dissertation, University of Illinois at Urbana-Champaign, Urbana, USA (1995) 3. Pelikan, M., Goldberg, D.E.: Genetic algorithms, clustering, and the breaking of symmetry. In: Parallel Problem Solving from Nature, vol. VI, pp. 385–394 (2000) 4. McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkley symposium on Mathematics, Statistics and Probability, pp. 281–296 (1967) 5. Emmendorfer, L., Pozo, A.: An empirical evaluation of linkage learning strategies for multimodal optimization. In: Proceedings of IEEE Congress on Evolutionary Computation (CEC), pp. 326–333 (2007) 6. Sastry, K., Abbass, H.A., Goldberg, D.E., Johnson, D.D.: Sub-structural niching in estimation of distribution algorithms. In: GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, pp. 671–678 (2005) 7. Petrowski, A.: A new selection operator dedicated to speciation. In: Proceedings of the 7th International Conference on Genetic Algorithms, pp. 144–451 (1997) 8. Emmendorfer, L., Pozo, A.: An incremental approach for niching and building block detection via clustering. In: Proceedings of the Seventh International Conference on Intelligent Systems Design and Applications, pp. 303–308 (2007) 9. Pelikan, M., Goldberg, D.E., Cantu-Paz, E.: BOA: The Bayesian optimization algorithm. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 1999), pp. 525–532 (1999) 10. Pe˜ na, J.M., Lozano, J.A., Larra˜ naga, P.: Globally multimodal problem optimization via an estimation of distribution algorithm based on unsupervised learning of bayesian networks. Evolutionary Compututation 13(1), 43–66 (2005) 11. Goldberg, D.E., Korb, G., Deb, G.: Messy genetic algorithms: Motivation, analysis and first results. Complex Systems 3, 493–530 (1989) 12. Liu, B.: Statistical Genomics: Linkage, Mapping, and QTL Analysis. CRC Press, Boca Raton (1998) 13. Harik, G.R.: Learning gene linkage to efficiently solve problems of bounded difficulty using genetic algorithms. PhD thesis, University of Michigan (1997) 14. Jakulin, A., Bratko, I.: Testing the significance of attribute interactions. In: ICML 2004: Proceedings of the twenty-first international conference on Machine learning, pp. 409–416 (2004) 15. Minsky, M., Papert, S.: Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge (1969) 16. Baluja, S., Caruana, R.: Removing the genetics from the standard genetic algorithm. In: International Conference on Machine Learning, pp. 38–46 (1995) 17. Friedman, N., Goldszmidt, M., Koller, D.: The Bayesian Structural EM algorithm. In: Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 129–138 (1998) 18. Hocaoglu, C., Sanderson, A.C.: Evolutionary speciation using minimal representation size clustering. In: Evolutionary Programming, pp. 187–203 (1995) 19. Agresti, A., Coull, B.: Approximate is better than exact for interval estimation of binomial proportions. The American Statistician 58, 119–126 (1998)
248
L. Emmendorfer and A. Pozo
20. Watson, R.A., Hornby, G., Pollack, J.B.: Modeling building-block interdependency. In: PPSN V: Proceedings of the 5th International Conference on Parallel Problem Solving from Nature, pp. 97–108 (1998) 21. Emmendorfer, L., Pozo, A.: Effective linkage learning using low-order statistics and clustering. In: IEEE Transactions on Evolutionary Computation (accepted) 22. Coffin, D.J., Smith, R.E.: The limitations of distribution sampling for linkage learning. In: Proceedings of IEEE Congress on Evolutionary Computation (CEC), pp. 364–369 (2007)
Studying the Effects of Dual Coding on the Adaptation of Representation for Linkage in Evolutionary Algorithms Maroun Bercachi, Philippe Collard, Manuel Clergue, and Sebastien Verel Techniques pour l’Evolution Artificielle team (TEA) Laboratoire I3S - Universite De Nice-Sophia Antipolis / CNRS Les Algorithmes, 2000 Route Des Lucioles BAT. Euclide B - BP. 121 - 06903 Sophia Antipolis - France {bercachi,philippe.collard,clergue,verel}@i3s.unice.fr http://www.i3s.unice.fr/tea
Summary. For successful and efficient use of GAs, it is not enough to simply apply simple GAs (SGAs). In addition, it is necessary to find a proper representation for the problem and to integrate linkage information about the problem structure. Similarly, it is important to develop appropriate search operators that fit well to the properties of the genotype encoding and that can learn linkage information to assisst in creating and not in destroying the building blocks. Besides, the representation must at least be able to encode all possible solutions of an optimization problem, and genetic operators such as crossover and mutation should be applicable to it. In this chapter, sequential alternation strategies between two coding schemes are formulated in the framework of dynamic change of genotype encoding in GAs for function optimization. Likewise, new variants of GAs for difficult optimization problems are developed using a parallel implementation of GAs and evolving a dynamic exchange of individual representation in the context of dual coding concepts. Numerical experiments show that the evolved proposals significantly outperform a SGA with static single coding. Keywords: Genetic Algorithms, Difficult Optimization, Performance, Dynamic Representation, Serialization, Parallelization, Agents, Steady State, Linkage.
1 Introduction Evolutionary algorithms (EAs) are powerful search methodologies inspired by natural evolution. They typically include those forces of genetics deemed most influential in nature, such as selection, mutation, and crossover (mating); so they simulate the reproduction phenomenon and operate on the principle of the survival of the fittest [1, 2]. EAs have been largely used to solve problems in various fields. As the magnitude and complexity of problems processed by EAs increase, investigators begin to realize that for practical and efficient use, certain crucial mechanisms have to be incorporated into the framework of evolutionary computation. Among these crucial mechanisms suggested by practitioners is the Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 249–284, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
250
M. Bercachi et al.
ability to learn linkage which is referred to as the relationship between decision variables and the role of a best coding selection, referred to as the problem representation. The use of linkage information has been acknowledged to be of significant importance. Finding building blocks is the main issue. Once the building blocks are known, they can be searched more efficiently and mixed together. General or fixed recombination operators often break partial solutions what can sometimes lead to loss of potential solutions and convergence to a local optimum [31, 34, 35]. According to this theory, it is important that the crossover operation not be too disruptive of the building blocks, and this suggests that the crossover operation should be selected or designed so that linkage groups (groups of highly linked loci) are not overly disrupted by the crossover operator. Likewise, when applying EAs to optimization problems, we usually use strings of characters drawn from a finite alphabets as chromosomes and genetic operators to manipulate these artificial chromosomes [36, 39]. These EAs either explicitly or implicitly act on an assumption of a good coding scheme which can provide tight linkage for genes of a building block on the chromosome [40, 41]. In order to resolve the issue which is raised because the knowledge of the relationship between variables is unavailable, and because encoding the solutions as fixed binary strings of characters is common in EAs practice, we propose in this chapter a variety of serial and parallel strategies evolving a dynamic change of chromosomal representation considered as a sort of “conversation” between the standard binary coding (SC) and the gray coding (GC). The goal of our strategies based on manipulating the representation of solutions in the algorithm and changing the coding scheme from SC to GC or from GC to SC is to make the interacting components of partial solutions less likely to be broken by recombination operators. The alternation between the two coding schemes is done according to some governing conditions described later. As it has been discussed above, two crucial factors of the EA success, a “proper growth” and a “good mixing of building blocks” are often not achieved. The problem of building block disruption is frequently referred to as the “linkage problem”. Various attempts to prevent the disruption of important partial solutions have been done. The first class of techniques is based on extracting some information from the entire set of promising solutions in order to generate new solutions [25, 27, 28, 29, 30, 31]. The second class of techniques is based on changing the representation of solutions in the algorithm or evolving the recombination operators among individual solutions [22, 23, 24, 32, 33, 35]. Our research is considered to be localized in the second class of methods. So, we will try to find a strategy to employ the linkage learning technique in our proposals to take advantage from this crucial mechanism in EAs process. We do not attempt to design the best algorithm for solving real-parameter optimization problems but rather to demonstrate that implementing an algorithm that combines several coding schemes is better than using one single coding. Precisely, we are investigating two enhancements to genetic algorithms (GAs). First, we add options to the representation selection “tool”; thus this tool incorporates a
Studying the Effects of Dual Coding on the Adaptation of Representation
251
dynamic representation which will cover a wider class of problem domains and will increase the flexibility of GAs operations. Second, we are using this tool and profiting from its adaptability to improve GAs performances either in reducing the number of function evaluations or improving on the success rate. This chapter includes four main sections. In section 2 we introduce the individual representation issue and we take a closer look at the characteristics of chromosomal encoding. In section 3, we investigate the implications of using new dual coding techniques on the performance of GAs. In section 4, we extend our research to test the influence of applying different dual coding proposals on the performance of the CHC algorithm. Finally in section 5, we try in an overall conclusion to perceive which kind of advantage the linkage learning concepts could bring to the use of two or multiple coding schemes simultaneously in one algorithm.
2 Individual Representation Representation is one of the key decisions to be made when applying a GA to a problem. How a problem is represented in a GA individual determines the shape of the solution space that a GA must search [11]. For example, the choice of tree representation instead of vector representation could help according to the tested problem [12]. For any function, there are multiple representations which make optimization trivial [13, 43]. Unfortunately, practitioners often report substantially different performances of GAs by simply changing the representation used. The difficulty of a specific problem, and with it the performance of GAs, can be affected dramatically by using various types of encodings. Indeed, an encoding can perform well for many diverse test functions, but fails for the one problem which we are interested in solving [3]. These observations were confirmed by empirical and theoretical investigations. In particular, there are kinds of GAs, like the messy GA (Goldberg et al. 1989) that use an adaptive encoding that adjusts the structure of the representation to the properties of the problem. This approach, however, burdens the GA not only with the search for promising solutions, but also the search for a good representation. The usual method of applying GAs to optimization problems is to encode each parameter as a bit string using either SC or GC. The debate as to whether GC is better than SC has been a classic example where theory and practice clash [18]. Generally, according to some major studies, the use of GC has been found to enhance the performance of genetic search in some cases [8]. However, GC produces a different function mapping that may have fewer local optimum and different relative hyperplane relationships than SC which sometimes has been found to complicate the search, producing a large number of local optimum [9]. Also, GC has been shown to change the number of local optimum in the search space because two successive real gray-coded numbers differ only by one bit. Moreover, the use of GC is based on the belief that changes introduced by mutation do not have such a disruptive effect on the chromosome as when we use SC [7]. Besides, many studies have shown that SC has a high convergence speed to the
252
M. Bercachi et al.
best solution and is effective for some classes of problems by the fact that it frequently locates the global optimum [6]. Also As a result, different encodings of the same problem are essentially different problems for a GA. Selecting a representation that correlates with a problem’s fitness function can make that problem much easier for a GA to solve [8]. An interesting approach consists of incorporating good concepts about encodings and developing abstractive models which describe the influence of representations on measurements of GA performance. After that, dynamic representation strategies can be used efficiently in a theory-guided manner to achieve significant advancement over existing GAs for certain classes of optimisation problems.
3 Sequential and Parallel Dual Coding Based on GAs In a computer program, the search is dictated by the representation of the domain and the search operators on that representation. In this domain, search often proceeds by modifying the bits of previously evaluated points in the search space. Understanding how the bits in the representation interact with each other in defining the value of the function is critical for the understanding of the function to be optimized. This interaction is called epistatic linkage. Two loci (string positions) are epistatically linked if the effect of changing the allele (value) at one locus depends on the allele at the other locus. Similarly, a group of loci are epistatically linked if the effect of changing the allele at one locus depends on the alleles at all other loci of the group [38]. Like the natural world, GAs are forms of adaptive systems in which various chromosomes interact via sufficiently complicated elements [1, 2]. These elements include selection method, crossover and mutation operators, the encoding mechanism (representation) of the problem, and many others. Many individual representations have been proposed and tested within a wide-range of evolutionary models. Maybe, an essential natural question that has to be answered in all these evolutionary models is which genotype encoding is needed to make individuals evolve better in a GA application? To prevent making a bad choice of a coding that do not match to a problem fitness function, the research effort reported in this section focused on developing strategies of sequential and parallel implementations of a simple GA (SGA) evolving the use of two codings simultaneously, having the goal to revise GAs behaviour, to probably enhance GAs performances degree, and later to refine GAs solution quality. Some previous works [10] were proposed to use dynamic representations to escape local optimum. Their strategies focused on parameters optimization and involved switching the gray representation of individuals when the CHC algorithm has converged. In this section, we explore different ways and diverse criterion of conversation and interaction between two representations in one SGA. The first, serial alternation strategy, changes sequentially two coding schemes according to a specific touchstone. The second way is a new variant of GAs denoted Split-and-Merge GA (SM-GA) and exploits in parallel two codings in a dynamic manner. So currently, we focus more on the matter of exploring variant strategies
Studying the Effects of Dual Coding on the Adaptation of Representation
253
of dynamic representation and we concentrate on the topic of enhancing the performances of a SGA. 3.1
Serial Dual Coding Strategies
As no theory of representations exists, the current design of proper representations is not based on theory, but more a result of black art [3]. In this purpose, designing a new dynamic representation will not remain the black of art of GAs research but become a well predictable engineering task. In this chapter, we encode minimization test problems with fixed binary strings and we referred specifically to the two most popular codings SC and GC. Therefore, the important matter is to explore the best strategy of alternation between SC and GC in order to improve GAs performances. At first, we studied the possibility of GAs dual chromosomal encryption using sequential alternation strategies such as Periodic-GA, APeriodic-GA, LocalOpt-GA, HomogPop-GA and SteadyGen-GA. The idea behind the Periodic-GA was to alternate between two given codings for the same number of generations (period). The parameter requires fine-tuning for a given problem. Aperiodic-GA differs from Periodic-GA by selecting, before each alternation, an arbitrary period (aperiod) from [minP : maxP ] interval. The parameter does not demand expensive tuning because an interval accommodation is easier and less sensitive while moving from one test function to another. LocalOpt-GA consists of changing coding when the population’s best individual is a local optimum. The idea was to try alternating between representations because a local optimum under a coding is not necessary a local optimum under the other, a fact that probably will permit escape from the obstacle created by a local optimum and to achieve better results. This proposal does not require any parameters tuning but it increases significantly the execution time because of the need to process a huge number of function evaluations at each generation. In the framework of LocalOpt-GA, we studied the position and the number of local optimum, for Schaffer function F6 (cf. subsection 3.3) and for the two codings SC and GC, by an exhaustive exploration of the search space. A double local optimum is a solution which is a local optimum under two used codings. For function F6, the reported number of local optimum for SC was 6652 and for GC was 7512. Thus, there was less double local optimum shared between SC and GC and the reported number was 2048. The positions (x, y) of local optimum are given in figure 1. The idea for the HomogPop-GA was to change representation when a population attains a homogeneous phase that reveals its inability to enhance the results. Homogeneity criterion was measured by the standard deviation of fitnesses in the population in comparison with a given real number (ε). Also by alternating, this will maintain some degree of diversity between individuals which will help in discovering the search space. The parameter is very sensitive and requires tuning for each problem. In SteadyGen-GA, the alternation is realized when the best fitness value is not modified for a given number of generations (steadyGen). This strategy is
254
M. Bercachi et al.
(a) Local Optimum for SC
(b) Local Optimum for GC
(c) Double Local Optimum
Fig. 1. Positions of Local Optimum
helpful to enhance the fitness capacity during the search. The parameter is not sensitive and does not require fine-tuning for each problem. The common main algorithm to these strategies involves executing a SGA for one generation with a given coding. Then, a test for coding alternation condition is required (cf. Serial dual coding proposals subsection); if condition is true, then coding alternation is realized and all individuals representation is converted to the new coding. Coding alternation cycle continues until a given maximum number of generations maxGen is reached. To simplify serial strategies algorithms, common procedures were used. For a given population pop, given representations coding, coding1 and coding2, and given setting of steadyGen and maxGen, these procedures can be resumed as follows: • Generate Initial Population(): generates randomly an initial population. • Run 1 SGA(pop, coding): executes a SGA for one generation with pop having coding as representation. • Alternate Coding(coding1, coding2): switches problem encoding between coding1 and coding2 and returns the coding corresponding to the last altered coding. • Convert Population(pop, coding): converts pop individuals representation to coding. • Is MaxGen(maxGen): a boolean function that returns true if an algorithm was executed entirely for maxGen generations and false otherwise. Serial dual coding proposals can be defined as follows: 1. Periodic-GA (cf. algo 1): The alternation is realized if a SGA was executed for period generations with a given coding. A boolean procedure Is Period(period) is used and returns true if a SGA was operated for period generations and false otherwise. 2. Aperiodic-GA (cf. algo 2): Same as Periodic-GA with an arbitrary number aperiod chosen from [minP : maxP ] interval before each alternation. 3. Local Optimum GA (LocalOpt-GA) (cf. algo 3): The alternation is realized if the population’s best individual is a local optimum. A boolean procedure
Studying the Effects of Dual Coding on the Adaptation of Representation
255
Algorithm 1. Periodic-GA period ← periodV alue coding ← starterCoding pop ← Generate Initial Population() repeat repeat Run 1 SGA(pop, coding) until Is Period(period) coding ← Alternate Coding(coding1, coding2) Convert Population(pop, coding) until Is MaxGen(maxGen)
Algorithm 2. Aperiodic-GA coding ← starterCoding pop ← Generate Initial Population() repeat aperiod ← Random[minP : maxP ] repeat Run 1 SGA(pop, coding) until Is Period(aperiod) coding ← Alternate Coding(coding1, coding2) Convert Population(pop, coding) until Is MaxGen(maxGen)
Algorithm 3. LocalOpt-GA coding ← starterCoding pop ← Generate Initial Population() repeat repeat Run 1 SGA(pop, coding) until Is Local Optima(Best Element(pop)) coding ← Alternate Coding(coding1, coding2) Convert Population(pop, coding) until Is MaxGen(maxGen)
Is Local Optima(Best Element(pop)) is used and returns true if pop best individual is a local optimum and false otherwise. A predifined subroutine Best Element(pop) is utilized to get the pop best individual. 4. Homogeneous Population GA (HomogPop-GA) (cf. algo 4): The alternation is realized if the population’s standard deviation is less or equal to ε. A boolean procedure Is Homogeneous Population(pop, ε) is used and returns true if pop standard deviation is less or equal to ε and false otherwise. 5. Steady Generation GA (SteadyGen-GA) (cf. algo 5): The alternation is realized if the population’s best fitness value has not been changed for steadyGen
256
M. Bercachi et al.
Algorithm 4. HomogPop-GA ε ← epsilon coding ← starterCoding pop ← Generate Initial Population() repeat repeat Run 1 SGA(pop, coding) until Is Homogeneous Population(pop, ε) coding ← Alternate Coding(coding1, coding2) Convert Population(pop, coding) until Is MaxGen(maxGen)
Algorithm 5. SteadyGen-GA steadyGen ← steadyGeneration coding ← starterCoding pop ← Generate Initial Population() repeat repeat Run 1 SGA(pop, coding) until Is Steady Generation(pop, steadyGen) coding ← Alternate Coding(coding1, coding2) Convert Population(pop, coding) until Is MaxGen(maxGen)
generations. A boolean procedure Is Steady Generation(pop, steadyGen) is used and returns true if pop best fitness value has not been improved for steadyGen generations and false otherwise. 3.2
SM-GA Technique
Agents (units or sub-populations) are the entities, in literal meaning, that act or have the power of the authority to act on behalf of its designer. The basic and important features of the agents can be listed as autonomy, proactivity and collaboration, especially when we are designing agents to be used for representation utility. An autonomous agent works in a way that it can have self-activation mechanism and behaviour. Collaboration is a very important feature of an agent, which also makes an agent differ from an expert system. Collaboration gives the agent to communicate with other agents in the environment for either satisfying its goals or retrieving information in the environment [17]. This is the inspiration for our first idea to the new formalism called SM-GA, a new technique planned on agents function and implemented in a resurgent encoding work engine for the purpose of bringing some form of order into the unsettled situation caused by the influence of representations on the performance of GAs.
Studying the Effects of Dual Coding on the Adaptation of Representation
257
SM-GA Methodology of Work and Implementation The SM-GA algorithm is based on the role of double-agents (dual coding). It includes two main phases and their functions can be resumed as follows: In first phase, this technique consists of generating randomly an initial population (first agent). Basic population is therefore split into two sub-populations (units) each with a distinct representation. Primarily, two synchronous SGAs are executed with these two units for a given number of generations (startGen). At this point, steady state (state of no improvement of best fitness value for a given number of generations) value is computed automatically for each coding. Steady state measurement for each representation is taken to be equal to the average of all steady states encountered during SGA operation in that representation for the startGen generations. Then after the two units have achieved startGen generations, it consists of merging all individuals in one population having a best coding representation. Best coding is selected relatively to the population that has the least average fitness. Next, a SGA is processed with the united population until meeting a steady state. After estimating regular values of steady states for each representation, second phase induces a re-splitting of the whole population into two sub-populations having each a different coding. Then, the two divided units are operated in parallel with two SGAs. In this manner, SGA will benefit from the two representations at the same time due to the fact that this parallel genotype’s codification describes proactivity appearing on two levels and evolution occurring on two scales simultaneously. Then after each generation, a test for steady state is necessary. If at least one of the two units encounters a corresponding steady state, then agents collaboration property will help to support and preserve landscaped the fitness productivity during the inquiry process. Thus, in a global manner, a merge of the two coexistent units into one unit having a best coding representation will be an appropriate and suitable issue in intention to gather and assemble all developed data. At this level, best individuals spread within the population, and exchanges realized by crossover operators and minor mutational changes in chromosomes make it achievable for better structures and building blocks to be generated and hired. Next, a SGA will run with the integrated population until, at any rate, it deviates to a steady state probably caused by the existence of one or more local optimum and which temporarily reveals its inability to make individuals evolve better. The remedy for this is to re-split the entire agent into two sub-agents, a simple idea induced by the fact that newly created agents will have respectively sufficient autonomy to auto-reshape and invert their unvarying pattern. This way, possibly one of the two shrunk populations will have the opportunity to withdraw and surpass the local optimum, a concept that will make it survive and retrieve its accurate direction to well discover the search space. Then, split-and-merge cycle continues until a given maximum number of generations maxGen are attained (cf. algo 6). The parameter of this algorithm does not require fine-tuning for each problem. The only requirement is that startGen value must be large enough to be able to estimate well enough the steady states measurements for each coding. The schema representing SM-GA whole process is shown in figure 2.
258
M. Bercachi et al.
Fig. 2. SM-GA Schema
To optimize SM-GA algorithm, standard procedures were utilized. For given populations pop, pop1 and pop2, given representations coding, coding1 and coding2, and given numbers steadyGen and maxGen, these procedures can be summarized as follows: • Split(pop, pop1, pop2): takes pop and divides it into two sub-populations pop1 and pop2. • Compute Steady State(coding, startGen): estimates steady state value for coding corresponding to the average of all steady states encountered while executing a SGA with coding for startGen generations.
Studying the Effects of Dual Coding on the Adaptation of Representation
259
Algorithm 6. SM-GA startGen ← startGeneration pop ← Generate Initial Population() Split(pop, pop1, pop2) repeat Run 1 SGA(pop1, coding1) Run 1 SGA(pop2, coding2) until Is Period(startGen) steadyGen1 ← Compute Steady State(coding1, startGen) steadyGen2 ← Compute Steady State(coding2, startGen) bestCoding ← Select Best Coding(pop1, coding1, pop2, coding2) Convert Population(pop1, bestCoding) Convert Population(pop2, bestCoding) Merge(pop1, pop2, pop) repeat Run 1 SGA(pop, bestCoding) until Is Steady Generation(pop, steadyGenOf (bestCoding)) repeat Split(pop, pop1, pop2) Convert Population(pop1, coding1) Convert Population(pop2, coding2) repeat Run 1 SGA(pop1, coding1) Run 1 SGA(pop2, coding2) until Is Steady Generation(pop1, steadyGen1) or Is Steady Generation(pop2, steadyGen2) bestCoding ← Select Best Coding(pop1, coding1, pop2, coding2) Convert Population(pop1, bestCoding) Convert Population(pop2, bestCoding) Merge(pop1, pop2, pop) repeat Run 1 SGA(pop, bestCoding) until Is Steady Generation(pop, steadyGenOf (bestCoding)) until Is MaxGen(maxGen)
• Select Best Coding(pop1, coding1, pop2, coding2): computes pop1 and pop2 fitness averages and returns the coding corresponding to the population that has the least average fitness. • Merge(pop1, pop2, pop): takes pop1 and pop2 and blends them into pop. 3.3
Setup of Experiments
Test Problems Taking the most problematic and challenging test functions under consideration and given the nature of our study, we committed to a total of five minimization
260
M. Bercachi et al.
Table 1. Objective Functions
Name F2
Expression
Range
Dimension
f2 (xi ) = 100(x21 − x2 )2 + (1 − x1 )2
[−2.048 : 2.048]
2
[−100 : 100]
2
[−5.12 : 5.12]
20
[−600 : 600]
10
[−500 : 500]
10
√
F6
F7
f6 (xi ) = 0.5 +
sin2 ( x2 +y 2 )−0.5 (1+0.001(x2 +y 2 ))2
20
f7 (xi ) = 200 +
(x2i − 10 cos(2πxi ))
i=1
F8
f8 (xi ) = 1 +
10
x2
i ( 4000 )−
i=1
F9
f9 (xi ) = 10V +
10
10
xi (cos( √ ))
i=1
i
(−xi sin( |xi |))
i=1
functions. Table 1 summarizes some of the unconstrained real-valued functions. All these routines prove different degrees of complexity and they were selected because of their ease of computation and widespread use, which should facilitate evaluation of the results. The first test function Rosenbrock F2 has been proposed by De Jong. It is unimodal (i.e. containing only one optimum) and is considered to be difficult because it has a very narrow ridge. The tip of the ridge is very sharp, and it runs around a parabola. Algorithms that are not able to discover good directions underperfom in this problem. Rosenbrock F2 has the global minimum at (1, 1) [4]. The second function Schaffer F6 has been conceived by Schaffer. It is an example of a multimodal function (i.e. containing many local optimum, but only one global optimum) and is known to be a hard problem for GAs due to the number of local minima and the large search interval. Schaffer F6 has the global minimum at (0, 0) and there are many nuisance local minima around it [4]. The third function Rastrigin F7 is a typical model of a non-linear highly multimodal function. It is a fairly difficult problem for GAs due to the wide search space and the large number of local minima. It has a complexity of O(n ln(n)), where n is the number of function parameters. This function contains millions of local optimum in the interval of consideration. Rastrigin F7 has the global minimum at (0, ..., 0), i.e. in one corner of the search space [4]. The fourth function Griewangk F8 also is a non-linear multimodal function. It has a complexity O(n ln(n)), where n is the number of function parameters. The terms of the summation produce a parabola, while the local optimum are above parabola level. The dimensions of the search range increase on the basis of the product, which results in the decrease of the local minimums. The more we
Studying the Effects of Dual Coding on the Adaptation of Representation
261
increase the search range, the flatter the function. Generally speaking, this is a very difficult but good function for testing GAs performance mainly because the product creates sub-populations strongly codependent to parallel GAs models. Griewangk F8 has the global minimum at (0, ..., 0) [4]. The fifth function Schwefel F9 also is a non-linear multimodal function. It is somewhat easier than Rastrigin F7 and is characterized by a second-best minimum which is far away from the global optimum. In this function, the parameter V is the negative of the global minimum, which is added to the function so as to move the global minimum to zero, for convenience. The exact value of V depends on system precision; for our experiments V = 418.9829101. Schwefel F9 has the global minimum at (420.9687, ..., 420.9687) [4]. Most algorithms have difficulties converging close to the minimum of such functions especially under high levels of dimensionality (i.e. in a black box form where the search algorithm should not necessarily assume independence of dimensions), because the probability of making progress decreases rapidly as the minimum is approached. Parameter Settings In our defined proposals, a SGA was processed and it encapsulates the standard parameter values for any GA application that is based on fixed binary strings representation. More specifically, the main common parameters are: • • • • • •
Pseudorandom generator: Uniform Generator. Selection mechanism: Tournament Selection. Crossover mechanism: 1-Point Crossover. Mutation mechanism: Bit-Flip Mutation. Replacement models: a) Generational Replacement. b) Elitism Replacement. Algorithm ending criterion: the executions stop after maximum number of generations are reached.
The set of remaining applied parameters are shown in table 2 with: maxGen for maximum number of generations before STOP, popSize for population size, vecSize for genotype size, tSize for tournament selection size, pCross for crossover rate, 1-P ointRate for 1-point crossover rate, pM ut for mutation rate, pM utP erBit for bit-flip mutation rate (1/vecSize). Besides, the values of parameters necessary to new proposals are almost near for each function with a little difference evoked by the problem complexity. Values of these specific parameters were determined recurrently within fixed intervals lengths. In Periodic-GA and Aperiodic-GA, period and aperiod ([minP : maxP ]) values changed within [25 : 100] interval with step of 5. In HomogPop-GA, ε value varied from 0.1 to 5.0 with step of 0.1. In SteadyGen-GA, steadyGen value changed within [5 : 50] interval with step of 5. In SM-GA, startGen value changed within [100 : 500] interval with step of 50. Sufficient tests were performed so as to adequately set the values to each specific parameter. As it has been discussed and after a large number of tests, we found that modifications of these parameters values within coherent fixed intervals lengths do not affect so much the final results of each
262
M. Bercachi et al.
Table 2. Set of Used Parameters Parameters
Objective Functions F2
F6
F7
F8
F9
maxGen
3500
3500
3500
3500
3500
popSize
100
100
100
100
100
vecSize
40
80
200
200
150
tSize
2
2
4
2
2
pCross
0.6
0.6
1.0
0.75
0.6
1-P ointRate
1.0
1.0
1.0
1.0
1.0
pM ut
1.0
1.0
1.0
1.0
1.0
pM utP erBit period
0.025 0.0125 0.0077 0.0035 0.006 50
40
25
30
10
[minP : maxP ] [25:75] [25:70] [20:50] [20:70] [10:20] ε
5.0
0.1
5.0
2.5
steadyGen
35
25
5
25
5
startGen
250
500
100
250
250
1.0
proposal which facilitated a bit our experiments. The best parameter settings between those tested are given in table 2. 3.4
Testing Description and Numerical Observations
Real Numbers and Fitness Computation The real numbers are represented by binary bit strings of length n ∗ N , where n is the problem dimension and N is the number of bits needed to represent each function parameter. N is chosen to have sufficient precision on the majority of real numbers included in the specific search space. In that case, the first N bits represent the first parameter, the second N bits represent the second parameter, and so forth. Given a function parameter x represented by N binary bits, Nif−1x hasi an SC representation, then x real value is computed by: x = a+ 2b−a N −1 i=0 xi 2 where a and b are respectively the minimum and maximum bounds of the search interval. If we write the standard-binary-coded value of a real x as sk−1 ...s1 s0 and the gray-coded value as gk−1 ...g1 g0 , then we have the relationships: gi = si+1 ⊕si and si = si+1 ⊕ gi which allow conversion from one representation to the other (taking sk = 0). In all cases and after real numbers computation, the fitness value was taken equal to the corresponding function value which was calculated according to the function expression given in table 1. Experimental Results Testing new algorithms on objective functions, experimental results were reported in the limits to decide about the optimal proposal among all ones. Table 3 presents statistical results obtained over 200 runs and at the last generation (gen
Studying the Effects of Dual Coding on the Adaptation of Representation
263
Table 3. Experimental Results Proposal
F2
F6
F7
F8
F9
GNTO SR% GNTO SR% GNTO SR% GNTO SR% GNTO SR% SGASC
3500+
31
3500+
37
3500+
0
3500+
6
3500+
SM-GASC
3500+
30
3500+
48
3500+
2
3500+
11
3500+
0
SGAGC
3500+
99
3500+
43
2957
99
3500+
3
2395
94
99
3500+
52
3362
98
3500+
8
2413
97
0
SM-GAGC
3256
SM-GA
3139
Periodic-GASG
3500+
86
3500+
51
3191
98
3500+
5
3480
79
Periodic-GAGS
3500+
91
3500+
46
3500+
80
3500+
4
2279
92
Aperiodic-GASG
3500+
90
3500+
51
3500+
93
3500+
3
3009
89
Aperiodic-GAGS
3500+
92
3500+ 52
3230
93
3500+
3
2818
90
LocalOpt-GASG
3500+
76
3500+
49
3500+
88
3500+
6
2480
93
LocalOpt-GAGS
3500+
80
3500+
48
3491
95
3500+
5
2480
94
HomogPop-GASG
3500+
31
3500+
41
3500+
0
3500+
2
3500+
0
HomogPop-GAGS 3500+ 99
3500+
38
3001
95
3500+
3
2395
94
SteadyGen-GASG 3500+
87
3500+
46
3381
95
3500+
7
2453
86
SteadyGen-GAGS
88
3500+
51
3254
98
3500+
4
2894
82
3500+
100 3500+ 57
2940 100 3500+ 17
2025 100
N b 3500). All problems are being minimized, this table shows generation number to optimum (GNTO) and succes rate (SR) results after 700000 (200 × 3500) executions for each proposal and each function, with the highest score in bold. GNTO value corresponds to the maximum number of generations needed to reach the optimum after entire process of all runs. SR value represents a percentage of the number of times the optimal solution is found after all executions correspondingly to the minimum GNTO found for each function. For example, the minimum GNTO for function F9 was 2025 recorded by SM-GA proposal; for Periodic-GASG proposal, the GNTO was 3480; and if we want to compute the succes rate measure for Periodic-GASG found so far at generation numbered 2025 we will detect a value of 79 denoted SR. In table 3, SC signifies an execution with SC, GC for an execution with GC, SG means that SC was the starter coding and GS when GC was the starter coding. Student’s t-test Generally, the student’s t-test serves for comparing the means of two experiences and assesses whether they are statistically different from each other. As well in our experiments, the t-test was used to compare, across all runs, success rate (SR) and average best fitness (ABF) results between different proposals so it will help to judge the difference between their averages relative to the spread or variability of their scores. Regarding table 3 which distinctly shows the performances of SMGA towards other proposals, t-test results were studied in comparison between SM-GA and other algorithms. Computed results are displayed in table 4.
264
M. Bercachi et al.
Table 4. t-test Results: Comparison between SM-GA and other Algorithms SM-GA
3.5
F2
F6
F7
F8
F9
Compared to
SR ABF SR ABF SR ABF SR ABF SR ABF
SGASC
21
10
4.1
5.1
SM-GASC
21
8.7
1.9
2.6
SGAGC
1.5
1.1
2.9
4.5
SM-GAGC
1.5
1.1
1.1
3.1
Periodic-GASG
5.8
5.1
1.3
2.1
Periodic-GAGS
4.5
4.3
2.3
Aperiodic-GASG
4.8
4.2
Aperiodic-GAGS
4.2
LocalOpt-GASG
7.9
LocalOpt-GAGS
50
25
3.6
11
50
47
98
24
1.8
0.5
50
42
1.5
1.1
4.8
9.3
3.6
2.5
2.1
1.6
2.8
3.1
2.5
1.5
2.1
0.3
3.9
3.9
7.3
3.2
2.7
7.1
6.8
4.4
4.4
4.1
3.9
1.3
2.1
3.9
3.2
4.8
5.1
4.9
2.2
1.7
1.1
1.7
3.9
2.7
4.8
3.3
4.8
1.9
4.6
1.7
2.7
5.3
4.8
3.6
3.3
3.9
1.5
7.1
5.1
1.9
2.7
3.3
2.3
3.9
5.8
3.6
2.6
HomogPop-GASG 21
10
3.3
3.4
50
25
5.3
6.1
50
47
HomogPop-GAGS 1.5
1.1
3.9
4.7
3.3
1.1
4.8
5.5
3.6
2.5
SteadyGen-GASG 5.5
4.9
2.3
3.1
3.3
2.5
3.2
4.1
5.8
2.4
SteadyGen-GAGS 5.3
4.7
1.3
1.5
2.1
1.4
4.4
6.1
6.7
2.9
Discussions
Experiments were performed to search for the optimal proposal for a given set of minimization problems. Finding an appropriate best strategy is not an easy task, since each proposal has particular parameters and specific criterion so that the characteristics and typical combination of properties represented by any suggestion do not allow for generalized performance statements. In order to facilitate an empirical comparison of the performance of each proposal, we have measured the success rate progress over generations which transfers a clear view and permits a legal opinion and decision about the efficiency and the evolution of each proposal. For serial dual coding proposals, table 3 introduces not bad results according to SR evaluation. Likewise, figures positioned at the left side of figure 3 shows that each of these proposals enhanced the performance of the SGA for a given problem in relative terms. At least, we can say that they produced results which were best compared to those of executing a SGA with unchangeable representation. Thus, it means that their performances were affected by attributing inexact values to their specific parameters, or probably they were affected by the choice of the initial population because the effect of this criterion is sometimes dramatic. As well, in table 3, SM-GA produced relatively higher results than the other algorithms according to SR measurement for all tested functions. The experimental data in this table also suggest that, while it is possible to each proposal to control accurately its parameters, very good performance can be obtained with a varying range of SGA control parameter settings. Figures positioned at the right
Studying the Effects of Dual Coding on the Adaptation of Representation 100
100 Function F2 Graph
90
Function F2 Graph 80
70
Success Rate ( % )
Success Rate ( % )
80
60 50 40 30 SGA SC SGA GC Aperiodic-GA GS HomogPop-GA GS
20 10
60
40 SGA SC SGA GC SM-GA SC SM-GA GC SM-GA
20
0
0 0
500
1000 (a1)
1500
2000
2500
3000
3500
0
1000 (a2)
1500
2000
2500
3000
3500
Generation Number
60 Function F6 Graph
Function F6 Graph 50 Success Rate ( % )
50 Success Rate ( % )
500
Generation Number
60
40
30
20
SGA SC SGA GC Aperiodic-GA SG Aperiodic-GA GS Periodic-GA SG SteadyGen-GA GS
10
40
30
20 SGA SC SGA GC SM-GA SC SM-GA GC SM-GA
10
0
0 0
500
1000 (b1)
1500
2000
2500
3000
3500
0
500
Generation Number
1000 (b2)
100
1500
2000
2500
3000
3500
Generation Number
100 Function F7 Graph
Function F7 Graph 80 Success Rate ( % )
80 Success Rate ( % )
265
60
SGA SC SGA GC Aperiodic-GA GS HomogPop-GA GS LocalOpt-GA GS Periodic-GA SG SteadyGen-GA SG SteadyGen-GA GS
40
20
60
40 SGA SC SGA GC SM-GA SC SM-GA GC SM-GA
20
0
0 0
500
1000 (c1)
1500
2000
2500
3000
3500
0
500
Generation Number
1000 (c2)
7
1500
2000
2500
3000
3500
Generation Number
18 Function F8 Graph
Function F8 Graph 16
6 5
Success Rate ( % )
Success Rate ( % )
14
4 3
12 10 8 6
2 1
SGA SC SGA GC SM-GA SC SM-GA GC SM-GA
4
SGA SC SGA GC LocalOpt-GA SG SteadyGen-GA SG
2
0
0 0
500
1000 (d1)
1500
2000
2500
3000
3500
0
1000 (d2)
100
1500
2000
2500
3000
3500
Generation Number
100 Function F9 Graph
Function F9 Graph 80
60
Success Rate ( % )
80 Success Rate ( % )
500
Generation Number
SGA SC SGA GC Periodic-GA SG periodic-GA GS Aperiodic-GA SG Aperiodic-GA GS LocalOpt-GA SG LocalOpt-GA GS HomogPop-GA GS SteadyGen-GA SG SteadyGen-GA GS
40
20
60
40 SGA SC SGA GC SM-GA SC SM-GA GC SM-GA
20
0
0 0
500
1000 (e1)
1500
2000
Generation Number
2500
3000
3500
0
500
1000 (e2)
1500
2000
2500
3000
3500
Generation Number
Fig. 3. Success Rate Evolution over Generations: Comparison between Different Proposals
266
M. Bercachi et al.
side of figure 3 show, for each exploited function, a comparison between SM-GA and SGA referring to SR activity across generations. In these figures, SM-GA graphical records illustrate how SR was progressing quickly after a small number of generations which made SGA performs better and improves its processing during the investigation for the optimum. On the other side, SM-GA shows its advancement over SM-GASC and SM-GAGC which distinctly proves the efficacy of blending and integrating two representations simultaneously. Experimental results were confirmed by using the t-test results in table 4. Entering a t-table at 398 degrees of freedom (199 for n1 + 199 for n2 ) for a level of significance of 95% (p = 0.05) we found a tabulated t-value of 1.96, going up to a higher level of significance of 99% (p = 0.01) we detected a tabulated t-value of 2.58. And to a greater extent, we increased the level of significance to the highest level of 99.9% (p = 0.001) we achieved a tabulated t-value of 3.29. Calculated t-test values in table 4 exceeded these in most cases, so the difference between compared proposals averages is highly significant. Clearly, SM-GASG produced significantly better results than those of other algorithms by the fact that coexistence of dual chromosomal encryption stimulated production, multiplication and interchange of new structures concurrently between synchronized populations according to the split-and-merge cycle and ordered functionality. Genetic algorithms, as it has been discussed, provide a very good conceptional framework for optimization inspired by nature, but theoretical questions and algorithmic considerations deliberated in this chapter suggest that a SGA with one stable coding sometimes fails to converge to the desired solution in a defined number of generations — a state called GA deception in optimization task — by the fact that selecting a representation that conflicts and opposes to a problem’s fitness function can make that problem much harder and difficult for a GA to solve. In this case, a bad choice of individual representation could lead to a loose linkage information in such optimization task. On the other hand, the same representation may contribute in a tight linkage information for another optimization task. In this section, we developed serial dual coding strategies in a dynamic manner to study the fundamental interaction while alternating between two representations. Similarly, we presented a new implementation of GAs for a SM-GA as a new symmetric dual coding proposal. Such algorithms could “choose” and use the suitable representation according to a specific optimization problem. which can reflect the important use of dual coding for linkage in EAs.
4 Sequential and Parallel Dual Coding Based on CHC Algorithm Any reasonable optimization problem has some structure. Knowing and using this structure can aid the search for the optimal solution. It is for this reason that the design of EAs has been focusing on learning problem structure since the introduction of GAs. Most important in this quest for structure usage is the notion of linkage. In binary representation, linkage means the structural cohesion
Studying the Effects of Dual Coding on the Adaptation of Representation
267
of the bits in the coding string with respect to the search space. The linkage information must be properly incorporated in some structure and the way in which this information is stored and used is very important. Many extensions have been made to the classical GA for different purposes. One of the most involved extensions regards the linkage concept. This extension leads to some variation in the selection-recombination phase during the evolution process. It is also clear that for simple GAs with fixed genetic operators and chromosome representations, one of the essential keys to success is a good coding scheme that puts genes belonging to the same building blocks together on the chromosome to provide tight linkage of building blocks. The representation must be selected in such a way that it will be able to give unambiguous information about the interaction of the decision variables involved in the search for the optimum. Likewise, the coding scheme must have the capacity to determine the information on the relationship between those variables. The CHC algorithm is a nontraditional GA which combines a conservative selection strategy that always preserves the best individuals found so far with a radical (highly disruptive) recombination operator that produces offspring that are maximally different from both parents (cf. subsection 4.1). Representation of the problem is typically preset by the user before the actual operation of a CHC begins. For any function, there are multiple representations which make optimization trivial [13]. However, the ensemble of all possible representations results in a larger search space than that of the function being optimized. Many researchers have shown that it is very difficult to predetermine the optimal genotype encoding for any test function before running a local search algorithm. In the pursuit of forestalling roughly a bad choice of a coding that do not connect to a problem fitness function, the investigation reported in this section is centralized on developing strategies of dynamic representation that do fit well to many optimization problems, having the goal to revise CHC behaviour, to probably enhance CHC performances degree, and later to refine CHC solution quality. This section provides two different contributions of dynamic representations while executing a simple CHC for difficult optimization problems. The first contribution is a sequential version of CHC algorithm called Steady-State CHC (SS-CHC) and developed as an alternation strategy between two representations. In the second contribution, a practical methodology is proposed as a new variant of CHC algorithm for function optimization denoted Split-and-Merge CHC (SM-CHC). SM-CHC is conceived using a parallel implementation of a simple CHC and evolving a dynamic change of individual representation in the context of dual coding concepts. 4.1
The CHC Algorithm
The CHC algorithm is a nonclassical GA that maintains a parent population of size popSize [14]. CHC randomly pairs members of the parent population for reproduction. Once paired, reproduction is only permitted if the Hamming distance between the two parents is greater than some threshold value, resulting in a child population of size childSize. The HUX crossover operator is used which
268
M. Bercachi et al.
ensures that each child is of maximal Hamming distance from the parents [14]. From the popSize + childSize individuals, the best popSize number of individuals are selected to form the parent population for the next generation. CHC guarantees survival of the best individuals encountered during the search. CHC also uses a re-start mechanism if the parent population remains unchanged for some number of generations. During a re-start, a population containing popSize copies of the best individual is formed; all but one copy undergo extensive mutation for a given divergence rate divRate. CHC stopping criterion is based on the user-defined maximum number of function evaluations maxEval. 4.2
Dynamic Representation Models
In this chapter, the solutions will be represented by binary strings of fixed length and we also used SC and GC. Hence, the essential work is to discover the best serial and parallel strategies of combination between SC and GC in order to improve CHC performances. Serialization Methodology In this subsection we describe the basic behavior of the Steady-State CHC (SSCHC), and explain how we have serialized it as a set of consecutive CHCs based on the bivalent genotype encoding thought. SS-CHC Algorithm The main objective of the SS-CHC algorithm is to execute a simple CHC with a given representation for one generation and then to evaluate the best fitness and compare its value with the best fitness value of the preceding generation. If the best fitness value has not been changed for a given number of generations steadyGen, a phase called “steady state”, it converts individuals of the whole population to the other representation. Alternation cycle continues until a given maximum number of function evaluations maxEval were reached (cf. algo 1). The aim from the alternation between representations is to enhance the fitness capacity during the search. Also by alternating, this will keep some degree of diversity between different individuals which will help in discovering the search space and next finding fast and smoothly the desired results. Figure 5 (a) shows an example run illustrating the alternation cycle for a given initial population tested on Schwefel Double Sum function P5 (cf. subsection 4.3). As it can be seen, there was a sufficient number of alternations between SC and GC which accelerated the convergence speed to the global optimum. In a particular way, there were 6479 function evaluations for SC distributed over 6 phases, and 9344 evaluations for GC distributed over 7 phases, for a total of 15823 overall evaluations. The steadyGen parameter is not sensitive and does not require fine-tuning for a given problem. The main schema for SS-CHC is shown in figure 4 (a).
Studying the Effects of Dual Coding on the Adaptation of Representation
(a) SS-CHC Schema
269
(b) SM-CHC Schema
Fig. 4. Algorithms Schema
(a) Run Example of SS-CHC
(b) Run Example of SM-CHC
Fig. 5. Algorithms Run Example
Parallelization Methodology In this subsection we study the methodology of work of the Split-and-Merge CHC (SM-CHC), a synchronous strategy formulated as a new variant of CHC algorithm. Likewise, we state how we have parallelized it as a set of concurrent CHCs in the potential of dual chromosomal encryption. SM-CHC Algorithm In theory as well as in practice, agents (Units or Sub-Populations) are the excellent metaphors for devising a distributed environment, because they are able to encapsulate intelligence and tasks modularly as well as taking into account other agents in the environment. Furthermore, agents need to communicate in an environment, where the naming and role features are solved in order to allow
270
M. Bercachi et al.
them to co-operate, co-ordinate and be controlled within a certain extent. The new technique denoted SM-CHC is a new technique inspired from agents characteristics and developed in a dynamic encoding model which will may be helpful in bringing some order into the disturbed situation caused by the influence of representations on the performances of GAs. SM-CHC Functioning The SM-CHC algorithm is based on the role of double-agents (dual coding), and its function can be resumed as follows: At first, this technique consists in generating randomly an initial population (first agent). Basic population is therefore split into two sub-populations (units) giving each a distinct representation. Primarily, two synchronous CHCs are executed with these two different units. In this manner, CHC will benefit from the two representations at the same time by the fact that this parallel genotype codification describes proactivity appearing on two levels and evolution occurring on two scales simultaneously. Then after each generation, a test for steady state is necessary. If at least one of the two units encounters a corresponding steady state for a given steadyGen generations, then agents collaboration property will help to support and preserve fitness productivity landscape during the inquiry process. Thus, in a global manner, a merge of the two coexistent units into one unit having a best coding representation will be an appropriate and suitable issue to gather and assemble all the information acquired. Best coding is selected relatively to the population that has the least average fitness. At this level, individual structures propagate within the population. Therefore, data transfers caused by selecting better solutions and applying recombination operators to them make it possible for promising building blocks to be reproduced and combined together. Next, a simple CHC will run with the integrated population, an organized agent that will be acting regularly in order to guide search to the solution of the problem. This whole unit will be operating until it deviates to a steady state probably caused by the existence of one or more local optimum and which momently reveals its inability to make individuals evolve better. In that case, it requires re-spliting the entire agent into two sub-agents, a simple idea induced by the fact that newly created agents will have respectively sufficient autonomy to auto-reshape and invert their unvarying pattern. This way, possibly one of the two shrunk populations will have the opportunity to withdraw and surpass the local optimum, a concept that will make it survive and retrieve its accurate direction to well discover the search space. Then, split-and-merge cycle continues until a maximum number of function evaluations maxEval were attained (cf. algo 2). Figure 5 (b) shows an example run illustrating the cycle of multi-agents for a given initial population tested on Schwefel Double Sum function P5 (cf. subsection 4.3). As can be seen, there was sufficient number of split-and-merge phases between SC and GC which intensified the process of locating the global optimum during the search. More specifically, there were 10500 function evaluations with two segmented sub-populations separated over 6 phases, and 5400 evaluations with one unified population separated over 5 phases; on the other side, there were 6950
Studying the Effects of Dual Coding on the Adaptation of Representation
271
evaluations for SC spread across 8 regions, and 8950 evaluations for GC spread across 9 regions, for a total of 15900 overall evaluations. SM-CHC parameter (steadyGen) is not sensitive and does not demand expensive tuning while moving from one test function to another. The schema representing SM-CHC whole process is shown in figure 4 (b). SS-CHC and SM-CHC Pseudo-Codes To simplify SS-CHC and SM-CHC pseudo-codes, several procedures were used. For given populations pop, pop1 and pop2, given representations coding, coding1 and coding2, and given numbers steadyGen and maxEval, these procedures can be summarized as follows: • Generate Initial Population(): generates randomly an initial population. • Run 1 CHC(pop, coding): executes a simple CHC for one generation with pop having coding as representation. • Is Steady Generation(pop, steadyGen): a boolean procedure that returns true if pop best fitness value has not been modified for steadyGen generations and false otherwise. • Alternate Coding(coding1, coding2): switches problem encoding between coding1 and coding2 and returns the coding corresponding to the last altered representation. • Convert Population(pop, coding): converts pop individuals representation to coding. • Is MaxEval(maxEval): a boolean function that returns true if an algorithm was executed entirely for maxEval evaluations and false otherwise. • Split(pop, pop1, pop2): takes pop and divides it into two sub-populations pop1 and pop2. • Select Best Coding(pop1, coding1, pop2, coding2): computes pop1 and pop2 fitness averages and returns the coding corresponding to the population that has the least average fitness. • Merge(pop1, pop2, pop): takes pop1 and pop2 and blends them into pop.
4.3
The Environment
Problems A set of nine unconstrained real-valued benchmark functions was used to investigate the effect of dual coding concepts on the proposed techniques. These functions expressions are given in table 5. All these routines are minimization problems and have different degrees of complexity. Likewise, we have analyzed the results of minimization experiments on a set of three real-world problems in order to better sustain our claims. Real-valued and real-world problems are described in the following subsections.
272
M. Bercachi et al.
Algorithm 1. SS-CHC steadyGen ← steadyGeneration coding ← starterCoding pop ← Generate Initial Population() repeat repeat Run 1 CHC(pop, coding) until Is Steady Generation(pop, steadyGen) coding ← Alternate Coding(coding1, coding2) Convert Population(pop, coding) until Is MaxEval(maxEval)
Algorithm 2. SM-CHC steadyGen ← steadyGenerations pop ← Generate Initial Population() repeat Split(pop, pop1, pop2) Convert Population(pop1, coding1) Convert Population(pop2, coding2) repeat Run 1 CHC(pop1, coding1) Run 1 CHC(pop2, coding2) until Is Steady Generation(pop1, steadyGen) or Is Steady Generation(pop2, steadyGen) bestCoding ← Select Best Coding(pop1, coding1, pop2, coding2) Convert Population(pop1, bestCoding) Convert Population(pop2, bestCoding) Merge(pop1, pop2, pop) repeat Run 1 CHC(pop, bestCoding) until Is Steady Generation(pop, steadyGen) until Is MaxEval(maxEval)
Test Functions We have considered nine classical and familiar test functions. They are summarized below: • Ackley P1 [21] This is a multimodal function with a high number of local optimum and is scalable in the problem dimension. P1 has a minimal function value of 0 and the global minimum is located at (0, ..., 0) (cf. figure 6 (a)). • Bohachevsky P2 This is a multimodal function and is separable in that the global optimum can be located by optimizing each variable independently. P2 has a minimal
Studying the Effects of Dual Coding on the Adaptation of Representation
273
Table 5. Test Problems
Name
Expression
Range
P1
p1 (xi ) = −20exp(−0.2 −exp( 1i
P2
p2 (xi ) =
n
1 i
n i=1
x2i )
Dimension
[−32.678 : 32.678]
10
[−15.0 : 15.0]
10
[−65.536 : 65.536]
2
[−2.048 : 2.048]
4
[−500.0 : 500.0]
12
[−100.0 : 100.0]
5
[1.0 : 10.0]
10
[−1.28 : 1.28]
10
[−5.0 : 5.0]
10
(cos(2πxi ))) + 20 + e
i=1
n−1 i=1
(x2i + 2x2i+1 − 0.3 cos(3πxi )
−0.4 cos(4πxi+1 ) + 0.7) P3
1
p3 (xi ) = 0.002+
25
(
j=1 j+
1
n i=1
P4
p4 (xi ) =
n−1 i=1
P5
(100(xi+1 − x2i )2 + (xi − 1)2 )
p5 (xi ) =
i n i=1 j=1
P6
)
(xi −aij )6
x2j
p6 (xi ) = x1 sin(x1 ) + 1.7x2 sin(x1 ) − 1.5x3 −0.1x4 cos(x4 + x5 − x1 ) + 0.2x25 − x2 − 1
P7
P8
p7 (xi ) =
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
p8 (xi ) =
n i=1
P9
p9 (xi ) =
n
ix4i
(xi − 1)2
i=1
function value of 0 and the global minimum is located at (0, ..., 0) (cf. figure 6 (b)). • Foxholes P3 [15] This is an example of a function with many local optimum. Many standard optimization algorithms get stuck in the first peak they find. P3 has a minimal function value of 0.998004 (cf. figure 6 (c)). • Rosenbrock’s Valley P4 [15] This is also known as Banana function and is a classical optimization problem. The global optimum is inside a long, narrow, parabolic shaped flat valley. To find the valley is trivial, however convergence to the optimal solution is difficult and hence this problem has been repeatedly used to assess the
274
•
• •
• •
M. Bercachi et al.
performance of the optimization algorithms. P4 has a minimal function value of 0 and the global minimum is located at (1, ..., 1) (cf. figure 6 (d)). Schwefel Double Sum P5 This is a twofold summation function and is identified as a unimodal test problem. P5 has a minimal function value of 0 and the global minimum is located at (0, ..., 0) (cf. figure 6 (e)). Five Variable P6 This is an example of a multimodal objective function and is known to be hard to optimize. Ten Variable P7 The minimum is obviously when the first five variables equal 1 and the last five variables equal 10. The function value for this case is 1e − 05. This function is trivial to us, but the computer does not see any shortcuts, and it searches for the minimum solution blindly. Quartic P8 [15] This is a simple unimodal function without noise. P8 has a minimal function value of 0 and the global minimum is located at (0, ..., 0) (cf. figure 6 (f)). Sphere P9 This is an example of a continuous, strongly convex, unimodal function. It servers as a test case for convergence velocity and is well known and widely used in all fields of GAs. P9 has a minimal function value of 0 and the global minimum is located at (1, ..., 1) (cf. figure 6 (g)).
All these functions are known to be hard problems for GAs due to the large number of local minima and the wide search space. Most algorithms have difficulties converging close to the minimum of such functions especially under high levels of dimensionality (i.e. in a black box form where the search algorithm should not necessarily assume independence of dimensions), because the probability of making advancement decreases rapidly as the minimum is approached. Real-World Problems We have chosen the following three real-world problems: Systems of Linear Equations P10 [19], Frequency Modulation Sounds Parameter Identification P11 [20], and Telecommunication Company Allocating P12 (Steiner). They all are described as follows: • Systems of Linear Equations Problem P10 The problem may be stated as solving for the elements of a vector X, given the matrix A and the vector B in the expression AX = B. The evaluation function used for these experiments is: n n (aij .xj ) − bj |. P10: p10 (xi ) = | i=1 j=1
Clearly, the best value for this objective function is p10 (x∗ ) = 0. Furthermore, the range for parameters is [−9.0 : 11.0]. Inter-parameter linkage (i.e., nonlinearity) is easily controlled in systems of linear equations, their
Studying the Effects of Dual Coding on the Adaptation of Representation
(a) Ackley P1
(b) Bohachevsky P2
(c) Foxholes P3
(d) Rosenbrock P4
(e) Schwefel P5
(f) Quartic P8
(g) Sphere P9
275
(h) Matrices of Linear Equations Problem Instance P10
Fig. 6. Functions Graphical Representation
non-linearity does not deteriorate as increasing the number of parameters used, and they have proven to be quite difficult. We have considered a ten-parameter problem instance. Its matrices are included in figure 6 (h). • Frequency Modulation Sounds Parameter Identification Problem P11 The problem is to specify six parameters a1 , ω1 , a2 , ω2 , a3 , ω3 , of the frequency modulation sound model represented by: y(t) = a1 sin(ω1 tθ + a2 sin(ω2 tθ + a3 sin(ω3 tθ)))
276
M. Bercachi et al.
with θ = (2π/100). The fitness function is defined as the summation of square errors between the evolved data and the model data, as follows: 100 (y(t) − y0 (t))2 , P11: p11 (a1 , ω1 , a2 , ω2 , a3 , ω3 ) = t=0
where the model data are given by the following equation: y0 (t) = 1.0sin(5.0tθ − 1.5sin(4.8tθ + 2.0sin(4.9tθ))) Each parameter is in the range [−6.4 : 6.35]. This is a highly complex multimodal problem having strong epistasis, with minimum value given by p11 (x∗ ) = 0. • Telecommunication Company Allocating Problem P12 In a certain area there are n villages, and the j th village Vj requires W (j) phone lines. The cost of a line is 1$ per 1km. Where should a telecommunication company locate a (unique) station to cover the demand at the minimum cost? The problem can be formulated as follows: There are n points V1 , ..., Vn on the 2D plane. The problem is to minimize the weighted sum: W1 .|X − V1 | + ... + Wn .|X − Vn | of the Euclidean distances from the design point X to the given points Vi . So, the evaluation function used for these experiments is: n (Wi .|X − Vi |), P12: p12 (X) = i=1
where X is a two-dimensional vector to be optimized, Wi are user-set nonnegative weights and the range for parameters is [−71235.87651 : 71235.87651]. General Parameter Values Simple CHC, SS-CHC and SM-CHC were run with the standard parameter values recommended by Eshelman [14]. The SS-CHC and SM-CHC also needed an assignment of value to only one parameter (steadyGen) which represents the number of generations with no improvement of best fitness value. The values of steadyGen parameter used here were chosen as a result of prior experimentation that showed that moderate and consistent values are preferred. Besides, the attributed values were almost near for each function with a little difference evoked by the problem complexity. More specifically, the main common parameters are: • Pseudorandom generator: Uniform Generator. • Algorithm ending criterion: the executions stop after obtention of the global optimum or after maximum number of function evaluations were achieved. The set of remaining applied parameters are shown in table 6 with: maxEval for maximum number of function evaluations before STOP, popSize for population size, vecSize for genotype size, divRate for divergence rate, and steadyGen for steady state number of generations.
Studying the Effects of Dual Coding on the Adaptation of Representation
277
Table 6. General Parameter Values Parameter P1 maxEval
P2
P3
P4
P5
Problem P6 P7
P8
P9
P10 P11 P12
500K 500K 500K 500K 500K 500K 500K 500K 500K 500K 500K 500K
popSize
100
100
100
100
100
100
100
100
100
100
50
100
vecSize
100
100
20
40
120
50
100
100
100
50
60
200
divRate
0.35
0.35
0.35
0.35
0.35
0.35
0.35
0.35
0.35
0.35
0.22
0.35
50
50
25
35
50
30
35
50
50
75
50
25
steadyGen
4.4
The Results
Numerical Observations Results were obtained for the two proposed algorithms SS-CHC and SM-CHC, and for the simple CHC on multi-dimensional versions of the objective problems, averaged over 200 test runs for each algorithm/problem pair. All test runs were tracked for 500000 maximum evaluations. At the end of each run, the global best fitness value and the consumed number of evaluations were recorded. We evaluated the algorithms by measuring their percent solved (percentage of the number of runs in which the algorithm succeeded in finding the global optimum) and average evaluations (average number of function evaluations required to find the global optimum in those runs where it did find the optimum). Table 7 lists results on each problem, with the highest score in bold, using two versions of CHC, both executed with SC as representation (CHCSC ) and with GC as representation (CHCGC ), and two versions of SS-CHC, both processed with SC as starter coding (SS-CHCSG ) and with GC as starter coding (SS-CHCGS ), and one version of SM-CHC operated with SC and GC as two mixed representations (SM-CHCSG ). Student’s t-test In our experiments, the t-test is used to compare the results, in the case where two algorithms have different percentage of solved values, or average evaluations results, in the case where two methods have the same percentage of solved value but different average evaluations values. The t-test will help to judge the difference between the data averages of different algorithms relative to the spread or variability of their scores. Considering table 7 which clearly proves the performances of SS-CHC and SM-CHC towards simple CHC, t-test results were studied in comparison between CHC and SS-CHC from one side, and between CHC and SM-CHC from the other side. For this purpose, the best recorded values were selected for each two compared algorithms. Computed results are displayed in table 8.
278
M. Bercachi et al. Table 7. Empirical Results
Problem Measurement
Algorithm CHCSC CHCGC SS-CHCSG SS-CHCGS SM-CHCSG
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Percent Solved
100%
93%
100%
100%
Average Evaluations 15974
4911
5363
5401
5356
91.5%
100%
100%
100%
Percent Solved
100%
100%
Average Evaluations 16095
5376
5855
5996
5757
Percent Solved
100%
100%
100%
100%
100%
Average Evaluations
4709
Percent Solved 99% Average Evaluations 159557 Percent Solved
1328
1304
1329
1344
29.5% 323429
100% 140693
100% 148869
100% 144229 100%
100%
80%
100%
100%
Average Evaluations 18548
6021
7421
7793
7688
Percent Solved 47% Average Evaluations 13283
100% 6476
100% 6428
100% 6473
100% 6423
Percent Solved
100%
100%
100%
100%
100%
Average Evaluations
3721
5328
3700
3719
3648
Percent Solved
100%
72%
100%
100%
100%
Average Evaluations 19409
6554
9309
8975
9507
Percent Solved
100%
100%
100%
100%
100%
Average Evaluations 15780
5340
5200
5275
5300
Percent Solved
34%
47%
45%
44%
Average Evaluations 245990
222003
202000
203500
204796
Percent Solved 1.5% Average Evaluations 305092
28% 328669
36% 278853
35% 290812
33% 315788
Percent Solved
13%
53.5%
96.5%
97.5%
97.5%
Average Evaluations
5262
3268
3591
3578
3553
35.5%
Table 8. t-test Results: Comparison between simple CHC and other Algorithms Compared Algorithms
P1
P2
P3
P4
P5
Problem P6 P7
P8
P9 P10 P11 P12
CHC vs SS-CHC 30.61 28.35 1.138 2.14 24.48 0.69 1.426 20.86 1.99 2.35 1.737 12.23 CHC vs SM-CHC 30.67 28.83 −0.71 1.71 22.71 0.72 4.95 18.2 0.57 1.74 1.13 12.23
4.5
Interpretation of the Results
We studied the performances of SS-CHC and SM-CHC relative to simple CHC on the data set used above. The results are shown in part in table 7. This table focuses on the percentage of solved values and average evaluations number needed to reach the global optimum, as this is an appropriate measure of the
Studying the Effects of Dual Coding on the Adaptation of Representation 100
80
60
Percent
40
(%)
Function F8 Graph
Solved
(%)
60
Percent
80
Solved
100
20
CHC-SC CHC-GC SS-CHC-GC
0 5000
10000 15000 20000 25000 30000 35000 Evaluation Number
(a) CHC vs SS-CHC
279
Function F8 Graph
40
20
CHC-SC CHC-GC SM-CHC
0 5000
10000 15000 20000 25000 30000 35000 Evaluation Number
(b) CHC vs SM-CHC
Fig. 7. Comparison of Percent Solved Progress over Number of Evaluations
quality of the optimization process itself. The results presented in this table describe improvements in several aspects of the performance of CHC algorithm although they also raise some positive and constructive issues towards the functioning of the new proposals. The number of function evaluations required to find the correct solution was reduced by a factor of 3, for most test problems, through the use of appropriate parameters setting. Likewise, we found that the percent solved has been improved for most problems where the performance of CHC has been decreased. These performance improvements are the result of applying dynamic representation in a resurgent mechanism incorporating the consumption and the manipulation of the two most popular genotype encodings, the standard binary coding and the gray coding. Besides, results presented in table 7 state that SS-CHCSG outperforms, on several problems, the other algorithms particularly SS-CHCGS , which indicates that starting execution of SS-CHC with SC as initial representation and then alternating to GC in case of steady state is a good recommended strategy, which proves very well the fact that SC frequently locates the global optimum. Figures 7 (a) and (b) show the percent solved progress across the number of evaluations, for Quartic function P8, referring to simple CHC, SS-CHC and SM-CHC. As well, graphical results displayed in these figures verify that SS-CHC and SM-CHC reach very fast an acceptable good solution, if not the best, comparatively to simple CHC. Experimental results are confirmed by using the t-test results in table 8. Entering a t-table at 398 degrees of freedom (200 − 1 for n1 + 200 − 1 for n2 ) for a level of significance of 95% (p = 0.05) we find a tabulated t-value of 1.96. Calculated t-test values in table 8 exceed these in most cases, so the difference between compared proposals averages is obviously significant. Clearly, SS-CHC and SM-CHC produce significantly better results than those of simple CHC by the fact that coexistence of dual chromosomal encryption stimulates production, multiplication and interchange of new structures concurrently between sequential and synchronized populations according to the coding alternation and the split-and-merge cycle and ordered functionality.
280
4.6
M. Bercachi et al.
Discussions
Genetic algorithms are best at getting close to the right solution, but do not tend to make the fine adjustments necessary to move from a near-optimal to an optimal solution. Other algorithms, such as CHC can be used to make these fine adjustments. The only defect of CHC is that it takes more function evaluations and next more generations number to attain the optimum, by the fact that it decelerates the pace of convergence in order to maintain diversity, and to delay and avoid premature convergence. In this section, we studied the possibility of enhancing the performance of CHC and we tried to make SC and GC interacts with each other to transform the binary parameter representation for the problem to avoid compromising the difficulty of the problem because both SC and GC produce all possible representations and both offer several advantages. So, we started by formulating a serial dual coding strategy denoted SS-CHC in a dynamic manner to study the fundamental interaction while alternating between two representations. Likewise, we proposed a practical implementation of CHC for a SM-CHC technique as a new parallel dual coding strategy. For this purpose, we tried to improve bounds on CHC convergence by profiting from the manner of operating simultaneously two codings in two units of work to consume the majority of possible representations that can be obtained by the two codifications. Table 7 shows some interesting observations of these experiments. As it can be seen, the convergence speed is one of the main issues indicating the advancement of SS-CHC and SM-CHC over simple CHC. Distinctly, the performances of SS-CHC and SM-CHC are encouraging as they reach the feasible area reasonably fast and consistently and produce relatively good results. This section also reached the conclusion that these new methods are more beneficial if the problem is difficult, complex and multimodal. Conclusively, SS-CHC and SM-CHC appear to be very useful techniques for solving difficult optimization problems, and good alternatives in cases where other techniques fail.
5 Conclusion In this chapter, two kinds of strategies were proposed — sequential and parallel strategies — and they belong to the class of changing the representation of solutions in the algorithm. These proposals are based on the concepts of dual coding that supplies a sort of dynamic change — “conversation” — between two coding schemes concurrently in a one classic GA. For the sequential strategies, the alternation between SC and GC has offered a smooth and gentle communication between these two representations which allowed each coding scheme to “help” the other coding scheme and replace it when the last one could not complete its trajectory in rendering better results during the search for the optimum. Perhaps at this level, the coding chronological succession was helpful in finding new promising solutions. For the parallel strategies, the “split and merge” phases — accomplished after a steady state that is probably caused by the fact that building blocks are still
Studying the Effects of Dual Coding on the Adaptation of Representation
281
not found and processed correctly — served to construct new distributions that are immediately used to produce possibly better individuals. Maybe at this stage, the coding exchangeability assisted in properly and advantageously discovering the building blocks. As a whole, the results are encouraging, and verification on other test problems is desired. Using dual coding in optimization maybe a good way to enhance linkage learning domain or, vice-versa, the linkage information may contribute in improving the techniques of dual coding. Therefore, understanding the bond and resemblance between the (natural) biology system and the (artificial) genetic and evolutionary algorithm is potentially helpful to realize the role and importance of representation selection and learning linkage. Linkage learning in GAs is the identification of building blocks to be conserved under crossover. This chapter dealt with a very strict definition based on the dual coding framework and was not intended to detect and study the linkage information between decision variables. We just focused on the matter of enhancing the performance of a classic GA by combining two or more coding schemes simultaneously in one standard GA. More research is needed in developing our strategies to include one or more linkage learning procedures. Besides, the linkage information needs to be processed to identify the buiding blocks. Further research is also needed in applying our proposals to hard optimization problems. Certainly we have to integrate techniques of linkage information processing to well as to adapt our proposals to exploit the utility of the problem structure and to detect and employ the real function of the buiding blocks. This approach seems to lead us in future direction to design qualified and competent GAs. Based on this study, the following topics require further investigation: In this chapter, SC and GC are considered. Though, other kinds of coding schemes such as tree or linear representation in genetic programming, and any number of coding schemes can be applied to the sequential and parallel strategies in order to benefit from the well-adapted representation for a specific problem. Advanced direction is required to advantageously understand the essential role of representation in GAs process. Besides, a proper knowledge about the specifications of each coding and its reactions with the genetic operators is genuinely needed. We intend to explore such questions as we extend this work further. Future research will help for a better and deeper understanding of the nature of the performance enhancements, and this will serve to fully comprehend the dynamics and the functioning of the new methods conceived in the framework of dual coding hypothesis. Future work could show that even more superior results might be found by using more enhanced GAs such as Evolution Strategies (ES), Covariance Matrice Adaptation ES Algorithm (CMA-ES), Messy GA; also we can explore different conditions for coding change and exchange in the sequential and parallel strategies. In this chapter, the new proposals were executed with homogeneous populations of coding, further research should exploit the incorporation of heterogeneous populations of coding where EAs can widely benefit from such diversity of coding.
282
M. Bercachi et al.
Finally, these measurements leave us with valuable perceptions concerning the utility of compounding various coding types for individual representation in collaboration in one SGA. In order to improve our algorithms, we need to have a deeper apprehension of what GAs are really processing as they operate which can help to enhance GAs optimal performances and provide us with greater insights into the evolutionary process of GAs.
References 1. Holland, J.H.: Adaptation in Natural and Artificial Systems. MIT Press, Cambridge (1975) 2. Gregory, J.E., Rawlins.: Foundations of Genetic Algorithms - 1. Morgan Kaufman Publishers, San Mateo (1991) 3. Rothlauf, F., Goldberg, D.E.: Representations for Genetic and Evolutionary Algorithms. Springer, New York (2002) 4. Digalakis, J.G., Margaritis, K.G.: An Experimental Study of Benchmarking Functions for Genetic Algorithms (2002) 5. Crutchfield, J.P., Schuster, P.: Evolutionary Dynamics, Exploring the Interplay of Selection, Accident, Neutrality and Function. Oxford University Press, New York (2003) 6. Caruana, Rich, Schaffer, David, J.: Representation and Hidden Bias: Gray vs. Binary Coding for Genetic Algorithms. In: Proceedings of the Fifth International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco (1988) 7. Mathias, E., Whitley, D.: Transforming the Search Space with Gray Coding. In: Proceedings of the 1994 International Conference on Evolutionary Computation (1994) 8. Whitley, D.: A Free Lunch Proof for Gray versus Binary Encodings. In: Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann Publishers, Orlando (1999) 9. Whitley, D., Rana, S., Heckendorn, R.B.: Representation Issues in Neighborhood Search and Evolutionary Algorithms. In: Genetic Algorithms in Engineering and Computer Science (1997) 10. Barbulescu, L., Watson, J.-P., Whitley, D.: Dynamic Representations and Escaping Local Optima: Improving Genetic Algorithms and Local Search. In: AAAI/IAAI (2000) 11. Toussaint, M.: Compact Representations as a Search Strategy: Compression EDAs. Elsevier Science Publishers Ltd, Essex (2006) 12. Rothlauf, F., Goldberg, D.E., Heinzl, A.: Network Random Keys: a Tree Representations Scheme for Genetic and Evolutionary Algorithms. MIT Press, Cambridge (2002) 13. Liepins, G., Vose, M.: Representations Issues in Genetic Algorithms, Experimental and Theoretical Artificial Intelligence Journal (1990) 14. Eshelman, L.J.: The CHC Adapative Search Algorithm: How to Have Safe Search when Engaging in Non-Traditional Genetic Recombination. In: Foundations of Genetic Algorithms - 1. Morgan Kaufmann, San Francisco (1991) 15. De Jong, K.A.: An Analysis of the Behavior of a Class of Genetic Adaptive Systems, Ph.D. dissertation, University of Michigan (1975)
Studying the Effects of Dual Coding on the Adaptation of Representation
283
16. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992) 17. Trkmen, B.S., Turan, O.: An Application Study of Multi-Agent Systems in Multicriterion Ship Design Optimisation. In: Proceedings of the Third International EuroConference on Computer and IT Applications in the Maritime Industries (COMPIT 2004), Siguenza, Madrid (2004) 18. Whitley, D., Rana. S.: Representation, Search and Genetic Algorithm. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI 1997). AAAI Press/MIT Press (1997) 19. Eshelman, L.J., Mathias, K.E., Schaffer, J.D.: Convergence Controlled Variation. In: Foundations of Genetic Algorithms - 4. Morgan Kaufmann, San Francisco (1997) 20. Tsutsui, S., Fujimoto, Y.: Forking Genetic Algorithm with Blocking and Shrinking Modes. In: Proceedings of the Fifth International Conference on Genetic Algorithms. Morgan Kaufmann, San Francisco (1993) 21. Ackley, D.: An Empirical Study of Bit Vector Function Optimization’, Genetic Algorithms and Simulated Annealing (1987) 22. Goldberg, D.E.: Genetic Algorithms and Walsh Functions: Part I, a Gentle Introduction. Complex Systems (1989) 23. Goldberg, D.E.: Genetic Algorithms and Walsh Functions: Part II, Deception and its Analysis. Complex Systems (1989) 24. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Publishing Co, Reading (1989) 25. Baluja, S.: Population-based Incremental Learning: A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning, Tech. Report No. CMU-CS-94-163, Pittsburgh, PA: Carnegie Mellon University (1994) 26. Baluja, S., Davies, S.: Using Optimal Dependency-trees for Combinatorial Optimization: Learning the Structure of the Search Space. In: Proceedings of the Fourteenth International Conference on Machine Learning (1997) 27. Muhlenbein, H., Paa, G.: From Recombination of Genes to the Estimation of Distributions I. Binary Parameters. In: Proceedings of the Fourth International Conference on Parallel Problem Solving from Nature (1996) 28. Muhlenbein, H.: The Equation for Response to Selection and its Use for Prediction. Evolutionary Computation Journal (1997) 29. Muhlenbein, H., Mahnig, T.: Convergence Theory and Applications of the Factorized Distribution Algorithm. Computing and Information Technology Journal (1999) 30. Muhlenbein, H., Mahnig, T.: A Scalable Evolutionary Algorithm for the Optimization of Additively Decomposed Functions. Evolutionary Computation Journal (1999) 31. Muhlenbein, H., Mahnig, T., Ochoa, A.: Schemata, Distributions and Graphical Models in Evolutionary Optimization. Heuristics Journal (1999) 32. Kargupta, H.: The Gene Expression Messy Genetic Algorithm. In: Proceedings of IEEE International Conference on Evolutionary Computation (1996) 33. Bandyopadhyay, S., Kargupta, H., Wang, G.: Revisiting the GEMGA: Scalable Evolutionary Optimization through Linkage Learning. In: Proceedings of IEEE International Conference on Evolutionary Computation (1998) 34. Bosman, P.A.N., Thierens, D.: Linkage Information Processing in Distribution Estimation Algorithms. In: Proceedings of Genetic and Evolutionary Computation Conference (GECCO 1999) (1999)
284
M. Bercachi et al.
35. Harik, G.: Learning Gene Linkage to Efficiently Solve Problems of Bounded Difficulty Using Genetic Algorithms, Ph.D. dissertation, University of Michigan (1997) 36. Harik, G.: Linkage Learning via Probabilistic Modeling in the ECGA, IlliGAL Report No. 99010 (1999) 37. Chen, Y.-p., Yu, T.-L., Sastry, K., Goldberg, D.E.: A Survey of Linkage Learning Techniques in Genetic and Evolutionary Algorithms, IlliGAL Report No. 2007014 (2007) 38. Heckendorn, R.B., Wright, A.H.: Efficient Linkage Discovery by Limited Probing. In: Proceedings of Genetic and Evolutionary Computation Conference (GECCO 2003) (2003) 39. Munetomo, M., Goldberg, D.E.: Identifying Linkage Groups by Non-linearity/Nonmonotonicity Detection. In: Proceedings of Genetic and Evolutionary Computation Conference (GECCO 1999) (1999) 40. Munetomo, M., Goldberg, D.E.: Linkage Identification by Non-monotonicity Detection for Overlapping Functions. Evolutionary Computation Journal (1999) 41. Pelikan, M., Goldberg, D.E., Cantu-Paz, E.: Linkage Problem, Distribution Estimation, and Bayesian Networks. Evolutionary Computation Journal (2000) 42. Bercachi, M., Collard, P., Clergue, M., Verel, S.: Evolving Dynamic Change and Exchange of Genotype Encoding in Genetic Algorithms for Difficult Optimization Problems. In: Proceedings of IEEE International Congress on Evolutionary Computation CEC 2007 (2007) 43. Collard, P., Aurand, J.-P.: DGA: an efficient Genetic Algorithm. In: Proceedings of ECAI 1994: 11th European Conference on Artificial Intelligence (1994)
Symbiotic Evolution to Avoid Linkage Problem Ramin Halavati and Saeed Bagheri Shouraki Sharif University of Technology Iran
In this chapter, we introduce Symbiotic Evolutionary Algorithm (SEA) as a template for search and optimization based on partially specified chromosomes and symbiotic combination operator. We show that in contrast to genetic algorithms with traditional recombination operators, this template will not be bound to linkage problems. We present three implementations of this template: first, as a pure algorithm for search and optimization, second, as an artificial immune system, and third, as an algorithm for classifier rule base evolution, and compare implementation results and feature lists with similar algorithms.
1 Introduction Recombination operator in Genetic Algorithms (GAs) is supposed to extract the component characteristics from two parent chromosomes and reassemble them in different combinations, hopefully to produce an offspring that has a higher fitness value. However, this can only work out if it would be possible to identify which parts of each parent chromosome should be extracted so that the combination of these parts will produce a high fitness offspring (Watson & Pollack, 2000). Another important problem regarding chromosome structure is the hitch hiker or garbage genes, or bad genes that coexist with good genes in a generally good chromosome (Forrest & Mitchell, 1993a). These two problems are usually referred to with a common term, the linkage problem, and they have resulted in evolution of a wide range of solutions that try to resolve them from different points of view. In this chapter, we first introduce the problem with more details and make a quick survey of existing solutions and their cons and pros, and then move on to a new approach based on the natural process of symbiogenesis and group selection. In this approach, chromosomes are partially specified, i.e., they have some locations with missing values, and evaluation and replications are done on groups of chromosomes, instead of single individuals. The major advantages behind this approach are the replacement of recombination operator with symbiotic combination operator and hence, not needing any prior knowledge for specific chromosome or operator design; and decreasing the effect of garbage genes through using partially specified chromosomes and having no strict binding between different genes that make up a solution. This idea will be described in detail in Section 3 as a template algorithm and will be implemented by three different algorithms, later in this chapter: In section 4, a pure implementation of the idea for optimizations will be presented and we will show how this algorithm can solve deceptive and combinatorial optimization problems with much more efficiency when compared to rival algorithms; Section 5 will use this idea Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 285–314, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
286
R. Halavati and S.B. Shouraki
for rule based generation and would be benchmarked on some classification problems with fuzzy If-Then rules; and Section 6 will exploit this idea in an artificial immune system algorithm and show how it can boost the performance of this algorithm in function and combinatorial optimization problems. At last, section 7 will summarize the entire experiences and present the concluding remarks.
2 Linkage Problem and Current Solutions Crossover in standard genetic algorithms (SGAs) (Holland, 1975) takes subsets of genes that are adjacent on chromosome and recombines these subsets with those of another chromosome. This is done by cutting both chromosome from a certain location and attaching the halves to the complementary ones of the other chromosome. Therefore, algorithm designer must have a prior knowledge about problem structure and building blocks, so that the related parts can be put together in a way that the building blocks would not be harmed during cutting and recombination. Early attempts to overcome this need for prior knowledge resulted in design of more complicated crossover operators like the ones with more number of cut points, random cut point positioning, uniform crossover, linear combination of genes, etc (see (Mitchell, 1999) for an extensive list); regardless of the fact that some of these remedies totally neglect the idea behind schemata and building blocks, prior domain knowledge is still required for selection of appropriate operator. Another problem of standard chromosome structure is the fact that once a chromosome gets enough credibility, all of its genes are reproduced regardless of their role in the good fitness. Therefore, a chromosome with high fitness may include genes that are not good sub-solutions and they are only decreasing the fitness value; but they still stay alive and spread because they are stuck to good genes. These genes are called garbage genes or hitch-hiker genes (Forrest & Mitchell, 1993a) and their spread through generations can have two negative effects: first, evolution speed is decreased as bad sub-solutions are promoted and spread, and processing power is consumed for search around these poor quality answers. Second, they can decrease the fitness of other chromosomes through further recombinations and result in removal of good solutions from the gene pool. Many researchers have focused on these two weaknesses of SGAs and several approaches have been introduced. Three basic ideas are behind each category of these solutions. The first idea is to use partially specified chromosomes (PSCs) such as in Messy Genetic Algorithms (mGA) (Deb, 1991), (Goldberg et al, 1989), Cooperative Co-Evolutionary Algorithms (CCEA) (Potter & De Jong, 1994), Symbiotic Evolutionary Adaptation Model (SEAM) (Watson & Pollack, 2000) and, and Incremental Commitment Genetic Algorithm (ICGA) (Watson & Pollack, 1999). In these approaches, the chromosomes have missing values for some locations and cooperation of several chromosomes (based on algorithm strategy) composes a solution. In mGA, this cooperation is minimal and each PSC is evaluated in the context of a template and is compared with other chromosomes which are similar enough to it. In CCEA, genome is split into several sub-genomes by algorithm designer and each pool is evolved in separation from other pools, improving the content of one of these subgenomes. The cooperation between pools takes place during evaluations whereby a
Symbiotic Evolution to Avoid Linkage Problem
287
partial solution from one pool is concatenated to the best of the other pools for evaluation. In SEAM and ICGA, PSCs are evaluated in the context of other chromosomes. In this approach, a context is a combination of some members of the chromosome pool that fully specify all chromosome locations together. To compute the fitness of a chromosome, all unspecified positions of the chromosome are filled with respective values from the context and then the fitness value is computed. The second idea is to use chromosome re-ordering operators and repositioning of genes inside the chromosome on the fly such as Inversion operator (Bagley, 1967) and Linkage Learning Genetic Algorithm (Harrik, 1997). In such algorithms, each gene has a location indicator along with its value and both of these parameters change and propagate during evolution. Thus, gene values and locations are optimized together and the algorithm is supposed to re-arrange gene locations so that those which have related effects on phenotype move together. The third way to deal with linkage and building block problems is to find estimation of distribution of good genes such as Estimation of Distribution Algorithm (Larrañaga & Lozano, 2002), Population-Based Incremental Learning Algorithm (Baluja, 1994), Compact Genetic Algorithm (Harik et al, 1998), Extended Compact Genetic Algorithm (Sastry & Goldberg, 2000), Factorized Distribution Algorithm (Mühlenbein & Mahnig, 1999), Bayesian Optimization Algorithm (Pelikan et al, 1999), and Hierarchical Bayesian Optimization Algorithm (Pelikan et al, 2003). In contrast to purely evolutionary approaches that put their direct focus on search for good solutions, the algorithms in this group try to estimate the distribution of good genes and construct good solutions based on these estimations. Thus, although the target of this family is quite the same as the other evolutionary approaches, there is a fundamental difference on how this target is achieved and due to this difference, these approaches require a pre-selected distribution model and the process estimates the parameters of this distribution to find good solutions. Each of these solutions has its own cons and pros and yet none is considered an ultimate remedy for the linkage problem. Some of the approaches require other source of domain knowledge like estimation of distribution of good genes in all third approach algorithms (EDA, PBIL, cGA, …) and appropriate gene groups in CCEA; Some need excessive computation power for extensive search as in mGA (Kargupta, 1995); And some suffer from premature convergence of solutions before finding good rearrangements as in LLGA (Newman, 2006) (Pelikan et al, 1999); and some are usable only in very specific purposes such as SEAM, see (Halavati et al, 2007b) for more details.
3 Symbiotic Evolutionary Algorithm 3.1 Symbiotic Combination Operator The natural process of symbiogenesis (Merezhkovsky, 1909) is the creation of new species from the genetic integration of organisms, called symbionts. Symbiogenesis has enabled some of the major transitions in evolution (Maynard Smith & Szathmary, 1995), including the origin of eukaryotes which include all plants and animals. This kind of genetic integration is quite different from the transfer of genetic information
288
R. Halavati and S.B. Shouraki
in sexual reproduction. Sexual recombination occurs between similar organisms (i.e. of the same species) and involves the exchange of parts of the genome in a mutually exclusive manner so that every gene acquired from one parent is a gene that cannot be acquired from the other parent. In contrast, symbiotic combination may also occur between genetically unrelated organisms (i.e. different species) and involve the integration of whole genomes. The resultant composite may have all the genes from one symbiont and at the same time acquire any number of genes from the other symbiont (Watson & Pollack, 2000). Based on this idea, symbiotic combination operator was introduced (Watson & Pollack, 1999), (Watson & Pollack, 2000) as an alternative for sexual recombination operator. This operator takes two PSCs and makes an offspring with the sum of their characteristics, see Figure 1 for an example. Therefore, in contrast to the standard crossover operator that is applied to fully specified chromosomes, symbiotic combination runs over partially specified representations and advances them towards fully specified ones. In the original introduction of this operator ((Watson & Pollack, 1999) and (Watson & Pollack, 2000)), when two chromosomes with conflicting genes were to be merged (i.e. they both had values for one/more specific location and the values contradicted), all these conflicts were resolved in favour of the first donor. In this text, we do not need this assumption and we simply do not combine two chromosomes that have conflicts. Chromosome A: Chromosome B: A + B:
1--1---0 --00-111 1-01-110
Fig 1. An example of symbiotic combination. Chromosomes A and B, each, have some unspecified locations, shown with ‘-’ mark. Their combination has specified values for all locations that are specified in at least one of the donors. If there is to be a conflict between the specified values, like the last gene of the above chromosomes, all conflicts are resolved in favor of one donor, in this case, A.
3.2 General Idea of Symbiotic Evolutionary Algorithm The main idea behind using symbiotic combination in evolutionary algorithms is to use PSCs, evolve them using traditional selection and mutation operators, and sometimes combine them using symbiotic combination operator to create chromosomes with more specified locations. One requirement of such process is the evaluation of PSCs during selection phase. In some specific problems, this is done using a direct evaluation function that can evaluate solutions with missing values; but when this is not available, the most common approaches are to evaluate a partially specified chromsome in the context of other members of the pool such as in SEAM and ICGA or to use a template for missing values as in mGA. In the first approach, a context is a combination of some members of the pool that fully specify all chromosome locations. To compute the fitness of a chromosome in a context, all unspecified positions of the chromosome are filled with respective values from the context and then fitness value is computed. A major critique of this approach is the detachment between the context and the chromosome. When the combination of
Symbiotic Evolution to Avoid Linkage Problem
289
several chromosomes (the context) and the chromosome under evaluation results in a good fitness value, it means that the entire group has made a good cooperation and their being together results in a good outcome. So it would be good if the entire composition would get a higher chance of re-emergence and survival, but in this class of algorithms, only the single individual under evaluation gets the reward, see (Halavati et al, 2007b) for more details on this problem. The local template solution of mGA also has this problem as selection is done at individual level. Also the process is limited to a local search based on the template that is used for evaluations and although the template evolves through time, all evaluations are centered around its temporal value in each generation. To overcome these problems, we propose evaluation of chromosomes in the context of other chromosomes, but the reproduction and selection must also be done at context level instead of individual level. If some chromosomes that are grouped for evaluation gain a high fitness value, all of them get a higher survival and reproduction chance and not just one of them. To implement this idea, we will use the term assembly hence forth for a set of none conflicting chromosomes that fully specify all locations; i.e., each location is specified by one and only one member of the assembly. Each member of an assembly will be called a symbiont. This way, whenever we need selection or evaluation, we can generate an assembly, evaluate it and if it gains enough credits, for reproduction, reproduce its entire content. But after replication (and possible modifications) it can break again into its symbionts. Based on this idea, Figure 2 presents a comparative diagram of general genetic algorithm (GA) and a general symbiotic evolutionary algorithm (SEA). In the initialization phase of both algorithms, some random chromosomes are created. The only difference is GA creates fully specified chromosomes and SEA creates partially specified ones. Thus, SEA chromosomes have some locations with unspecified values. In evaluation phase where GA evaluates the existing chromosomes, SEA creates some assemblies of chromosomes and evaluates the assemblies. Once evaluation is
Fig. 2. Comparative diagram of GA and SEA
290
R. Halavati and S.B. Shouraki
done, both algorithms have the selection phase which is done at chromosome level in GA and assembly level in SEA. The modification phase has two sub components: Mutation is the same for both algorithms, but when GA performs recombination, SEA performs symbiotic combination. During symbiotic combination, some of the symbionts in each selected assembly are combined and form single individuals. Once this is done, these symbiotically combined symbionts are merged and they stay together in further steps. After modification phase, both algorithms perform population update. In GA, some or all of the previous generation chromosomes are replaced by the results of modification phase, but in SEA, all selected assemblies are broken into their composing symbionts (note that those who are symbiotically combined stay together) and all of them are added to the population. Also, to decrease the chance of re-emergence of bad assemblies, some symbionts of bad assemblies are taken away from the pool. There are some considerations behind the recommended process: 3.2.1 Avoiding Premature Convergence Creating schemata with more specified bits from the best assemblies in each iteration may result in a very fast creation of fully specified chromosomes and premature convergence of the search process. To prevent this, we limit the size of chromosomes that are created during the process to a value that gradually increases while the process goes on. This value will allow the creation of only single gene chromosomes at the beginning of the process and gradually reaches fully specified chromosomes. This will be called gradually cooling the process and the term is chosen to represent a force that prevents emergence of big chromosomes at the beginning of the process, when the pool is hot, and gradually cools the pool based on a predefined schedule, so that bigger chromosomes can emerge. In all of our implementations, this parameter is simply computed by a linear equation based on the generation number but more complicated functions can be used if the problem space is known better. 3.2.2 Avoiding Local Optima SEA algorithm creates schemata of the best assembly in each iteration by combining some symbionts of the best assembly. This is done to promote solutions similar to the best assembly, but if the best assembly is a local maximum, there would be a good chance that exactly the same assembly will be chosen again as the best assembly of the later steps, because all symbionts of this solution still exist in the pool and some combinations of them are also added. In this case, the local maximum will fill up the chromosome pool very fast with more and more copies of its subcomponents and this increases the possibility of its later creations and may cause the search to get stuck there. By maintaining a list of the last best assemblies and avoiding them during later assembly creations, a local maximum can not repeat itself and this prohibition makes the algorithm search around the local maximum instead of repeating the exact assembly. This approach is quite similar to the idea of Tabu Search (Battiti & Tecchiolli, 1994), (De Werra & Hertz, 1989), (Glover, 1989), (Glover, 1990) in which a visited state is banned from revisiting for a fixed number of iterations.
Symbiotic Evolution to Avoid Linkage Problem
291
3.2.3 Exploiting Partial Evaluation Function If a partial evaluation function exists, SEA can use it. Whenever such a function exists, SEA can use it during assembly generation phase, by picking the first member of the assembly randomly and then choosing the next members using a tournament selection approach. If the problem does not permit partial chromosome evaluation, this part can be replaced with a random selection of members for assembly build up. 3.2.4 Pruning None Fitting Chromosomes Besides promoting and replicating good chromosomes, SEA prohibits and decreases the re-emergence of bad assemblies. To do so, removal of some chromosomes of assemblies with low fitness values is a good solution, but this must be done considering the fact that some of these chromosomes may also be in assemblies with high fitness values and they must be preserved. So, if a chromosome is part of both high fitness assemblies and low fitness assemblies, it must not be removed. 3.3 Feature Comparison of SEA with Some Other Algorithms As stated before, there are generally three methods to deal with linkage or building block problems. One method is the family of estimation of gene distribution algorithms such as EDA (Larraaga & Lozano, 2002), PBIL (Baluja, 1994), cGA (Harik et al, 1998), ECGA (Sastry & Goldberg, 2000), FDA (Mühlenbein & Mahnig, 1999), BOA (Pelikan et al, 1999), and hBOA (Pelikan et al, 2003). As SEA does not belong to this category, its features can not be easily compared with these algorithms because the cited algorithms require a previously selected distribution model for good genes while SEA does not need it; but in case such a model exists, SEA can not make use of it. In Table 1, we have compared SEA with some major algorithms of partial chromosome specification and chromosome reordering schools. As stated there, SEA does not use fully specified chromosomes, so does not have the garbage genes problem; it Table 1. Feature Comparison of SEA and some other algorithms SEAM No
ICGA No
LLGA Yes
CCEA No
SEA No
Yes
Yes
Yes
No
Yes
Yes
No
No
No
No
No
No
Yes
No
No
No
Yes
No
No
Random Mutation
In Favor of First Parent Thresholding
Most Fit Parent Mutation
Remove Originals Mutation
Not Not Needed Needed Mutation Mutation
No
No
In Favor of First Parent Deterministic Crowding No
Yes
No
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
No
Yes
Yes
Substructures Evolve? No
Yes
Yes
Yes
No
No
Yes
Fully Specified Chromosomes? Fixed Chromosome Locations? Sensitive to Genome Order? Carries Garbage? Conflict Resolution Method? Diversity Preservation Method? Requires Size Penalty Function? Requires Domain Knowledge? Uses Substructures?
SGA Yes
mGA No
Yes Yes
292
R. Halavati and S.B. Shouraki
doesn't work under the bias of any conflict resolution method as it does not combine conflicting chromosomes; it is not under the influence of size penalty function as it lets the chromosomes grow only if they prove have high fitness value in cooperation with each other; it requires no prior domain knowledge or partial evaluation function, but it can make use of both if available; it uses substructures and can gradually evolve and reform them and is not bounded by any specific definition for substructures.
4 Implementation of SEA for Optimization This section presents an implementation of SEA idea for optimization. The detailed diagram is presented in Figures 3 and 4. Implementation details are described in subsection 4.1. Benchmark problems and algorithms are presented in subsection 4.2. Next, subsection 4.3 shows the results. 4.1 Implementation Details Figures 3 and 4 represent the implementation of SEA for optimization task. Figure 3 is the main body of the algorithm and Figure 4 shows the assembly generation function which is in charge of creating assemblies when required. Assembly generation function starts with a random chromosome and keeps adding chromosomes to the assembly with two restrictions: the added parts should not have any conflicts with previous members and some extra specified bits must be added to the assembly. Two alternatives are shown in the diagram, once a partial evaluation function does not exist, new members are randomly selected from candidate members and when there is such a function, the best of each group of candidates is selected and added. The main body of the algorithm has some minor differences with that of Figure 2. First, as stated in 3.2.1 subsection, a size control mechanism is added using MaxSize variable that is initialized to 2 bit chromosomes at the beginning and is updated at the end of each iteration. Also as described in 3.2.3 subsection, a tabu list and tabu control mechanism is added so that once an assembly is selected as the best of one iteration, the re-creation of it in some further steps is prohibited by putting it in a queue of tabu answers. Note that assemblies are checked for being tabu in the final stage of assembly generation function and if an assembly with similar values for all locations of the generated assembly was found in tabu list, the newly generated assembly is discarded. As represented in steps 3, 5, and 6 of Figure 3 diagram, only the best assembly of each generation is selected for mutation and symbiotic combination. In mutation step (5), with a certain probability, a mutated copy of each symbiont of the best assembly is created and added to the pool and in symbiotic combination step (6), combinations of each two symbionts of the best assembly are created and added to the pool. At the end of step 6, a test for the size of each combined symbiont is performed; so if the size of a combined symbiont exceeds a maximum size threshold, it is randomly broken into some fragments of smaller sizes. After replicating the best assembly, we prohibit the re-emergence of the worst assemblies in step 7 as already mentioned in subsection 3.2.4. We separate the generated assemblies into two sets. The winners list includes all symbionts of all assemblies that stand in 25% highest ranks based on their fitness values and the losers list is made
Symbiotic Evolution to Avoid Linkage Problem
293
up of all symbionts of the worst 25% assemblies. Then, all symbionts of all members of the losers set are removed from the population, except the ones which are also a member of the winners set. Being in both sets is quite possible and frequent as the assembly generation phase may use a chromosome in several assemblies with different fitness values. Using the above approach, we only remove chromosomes which have not been able to take part in any good assembly. After going through all steps stated above, if the replication phase has created any duplicate chromosome in the population, the extra copies are removed in step 8 and if the population exceeds a pre-specified threshold, some chromosomes are randomly selected and removed from the pool.
Fig. 3. Diagram of SEA algorithm for optimization. The parameters are AC for Assemblies Count, MR for Mutation Rate, and MP for Maximum Population threshold.
Fig. 4. Diagram of Assembly Generation Function. The parameters are SR for Selection Rate and Population Size for the number of chromosomes in the pool.
294
R. Halavati and S.B. Shouraki
4.2 Benchmark Problems We used three benchmark problem sets. The first one is the Hierarchical If and Only If (HIFF) function (Watson & Pollack, 1999) with fully deceptive behaviour. The function takes an N-bit input and computes the fitness as stated in equation (1). ⎧1, ⎪ F ( B) = ⎨ B + F(BL ) + F(BR ), ⎪F(B ) + F(B ), L R ⎩
if B = 1 if ( B 〉1) and (∀i, {bi = 0} or ∀i, {bi = 1}) otherwise
(1)
BL and BR are respectively the left and the right half of B bit string.
The second benchmark is the concatenation of multiple 8-Queen problems (M8Q) (Eiben et al, 1995). In each instance, M separate problems of putting 8 queens on an 8×8 chessboard must be solved, so that no two queens on a board can attack each other. The chromosome includes the rows of the queens and columns are all assumed distinct and fixed. The third benchmark is the KN-Trap function (Kargupta, 1995). The chromosomes are concatenations of K sections of N bits. For each N bit section, the fitness is computed as stated in equation (2) and the summation of all fitness values is assigned to the whole chromosome. The Trap function has a deceptive behaviour with the global maximum for an all-zero bit string and a negative gradient towards this point. The chromosomes in all problems are shuffled (Watson and Pollack, 1990), so that adjacency data may not be used by any of the algorithms. ⎧SizeOf(B), F ( B) = ⎨ ⎩Ones(B) - 1,
if Ones(B) = 0
(2)
Otherwise
Ones(B) equals to the number of bits in B with value 1.
We implemented SEAM and SGA to compare with SEA. SEAM is selected as it is the most similar algorithm to SEA and SGA is chosen as it is the basic algorithm of this family. To adjust the appropriate parameters for each algorithm on each instance of the benchmark problems, we used a hill climbing (HC) (Russle and Norvig, 2002) algorithm with 100 steps and 20 random restarts. The best parameter settings, found by HC, for each algorithm/problem instance is specified in Tables 2-4.
1
300 0.75 0.7 0.1 RW
RW for Rolette Wheel / TS for Tournoment Selection.
N/A 100 100 N/A 0.7 0.65 N/A 0.6 0.65 N/A 0.15 0.3 N/A TS RW
15 x 8 Queens
5x8 Queens 10x8 Queens
1x8 Queens
2x8 Trap
N/A 100 100 N/A 0.65 0.65 N/A 0.7 0.65 N/A 0.1 0.3 N/A TS RW
4x4 Trap
N/A N/A N/A N/A N/A
1x4 Trap
HIFF-7
Population Size Mutation Rate Cross Over Rate Elitism Percentage Selection Method1
HIFF-6
HIFF-5
Table 2. Parameters of benchmark problems for SGA Algorithm
100 500 0.95 0.75 0.5 0.75 0.1 0.3 TS RW
Symbiotic Evolution to Avoid Linkage Problem
295
HIFF-6
HIFF-7
1x4 Trap
4x4 Trap
2x8 Trap
1x8 Queens
5x8 Queens
10x8 Queens
15 x 8 Queens
Number of Assemblies Cooling Rate x 1000 Selection Rate Mutation Rate Number of Tabus
HIFF-5
Table 3. Parameters of benchmark problems for SEA Algorithm
25 15 0.001 0.5 200
50 17.5 0.001 0.5 200
75 20 0.001 0.5 200
25 10 0.01 0.7 100
25 12.5 0.01 0.7 100
50 15 0.01 0.6 100
25 11 0.01 1 100
100 12.5 0.01 1 100
75 12.5 0.01 1 100
90 20 0.01 1 100
4x4 Trap
2x8 Trap
1x8 Queens
150
50
50
100
50
N/A2
15 x 8 Queens
1x4 Trap
100
10x8 Queens
HIFF-7
50
5x8 Queens
HIFF-6
Number of Contexts
HIFF-5
Table 4. Parameters of benchmark problems for SEAM algorithm
N/A
N/A
4.3 Experimental Results of SEA To compare the performance of the algorithms, each instance of the benchmark problems was solved with each algorithm in 30 independent runs. Each algorithm was given a maximum number of allowed fitness function calls based on problem complexity. These maximum values were chosen based on our initial experiments and were the same for all algorithms. The cooling function of SEA is implemented as a linear function, by dividing the iteration number per CoolingRate parameter. The success rates (number of times that each algorithm could find the optimum value (Forrest & Mitchell, 1993b)) are shown in Figure 5 and the average number of fitness computations for cases in which each algorithm has been able to solve the problem is depicted in Figure 6 diagrams. As depicted in Figure 5 (Top Left), SEA was able to solve all instances of HIFF problem up to 128-bit size while SEAM and SGA gradually fail after 64-bit limit. The same is presented in Figure 5 bottom for M8Q problem where SEA algorithm solves all instances of the problem up to 15 boards while SEAM fails after 3 boards limit and SGA fails after 10 boards limit. In Figure 5 top right, the success rates of KN-Trap functions are depicted where SEA holds its superiority against SGA and SEAM again. Figure 6 present the computational time comparison between the three algorithms. In HIFF problems, the higher success rate of SEA was in cost of more fitness function calls in comparison to the two other algorithms. In KN-Trap, SEA has been faster than SGA but slower than SEAM and in M8Q, SEA was faster than SGA while SEAM totally failed.
2
N/A stands for Not Available, cells with no value belong to problems that we could not find parameters that solve even one instance of that problem.
296
R. Halavati and S.B. Shouraki
Fig. 5. Success rates of SGA, SEAM, and SEA algorithms for benchmarked problems. Horizontal Axis: Problem Sizes, Vertical Axis: Success Rates.
Fig. 6. Performance Comparison of SGA, SEAM, and SEA on benchmarked problems. Vertical Axis: Number of fitness function calls; Horizontal Axis: Problem Size.
5 SEA for Rule Base Generation Genetic Algorithms are a widely used approach in predictive data mining where data mining output can be represented by If-Then rules and the process of discovering the
Symbiotic Evolution to Avoid Linkage Problem
297
best rules is done by an evolutionary process. To mention some of these algorithms, a genetic weighted fuzzy rule-base is used in (Teng et al, 2004) (Hasanzade et al, 2004) (Chen & Linkens, 2004) (Cordon et al, 1998) where the parameters of membership functions including position and shape of the fuzzy rule sets and weights of rules are evolved using a genetic algorithm; (Gomez & Dasgupta, 2002) train a rule base in the form of a binary tree using an evolutionary process with a sophisticated crossover operator; (Mendes et al, 2001) use a co-evolutionary system for discovering fuzzy classification rules. An evolutionary programming algorithm evolves fuzzy rules and a simple evolutionary algorithm evolves membership functions; in (Ishibuchi & Yamamoto, 2004), (de la Iglesia et al, 2003), and (Lopez et al, 1999), an evolutionary multiobjective algorithm evolves some fuzzy if-then rules, one objective is accuracy and the other is rule base size; in (Ishibuchi & Yamamoto, 2002) and (Tsang et al, 2005), the evolutionary process is broken into two phases, candidate rule generation and final rule selection; the interested reader may see (Freitas, 2001) for a good survey and (Zhu & Guan, 2004), (Gopalan et al, 2006), (Gundo et al, 2004), (Riquelme et al, 2003), and (Eggermont et al, 2003) for some other variations of this idea. As stated in (Freitas, 2001), the main motivation for using genetic algorithms in discovery of high-level prediction rules is that they perform a global search in the problem space and cope better with attribute interaction when compared to greedy rule induction algorithms often used in data mining. However, the linkage problem is a very important problem that must be taken care of for this task. Each parent rule set may have a set of cooperating rules that can well classify a subspace of the entire problem space and breaking them into two/more parts may result in the breakage of their classification integrity (crossover design problem). Also, when the rule set is restricted to have a specific size from the beginning, good rules and bad rules are stuck together and good/bad classification of each of them affects the ranking of the other one (hitchhiker genes problem). Based on these needs, in this section we will introduce the usage of previously introduced SEA algorithm for rule base generation and will compare it with standard GA on some benchmark classification problems. Both algorithms are used to evolve fuzzy rule bases and the fuzzy membership functions are defined exactly the same. We will use the abbreviation SEA-R for the customized version of SEA algorithm for rule base generation. 5.1 Symbiotic Evolutionary Algorithm for Rule Base Generation (SEA-R) We used (Hasanzade et al, 2003) as our baseline system, a rather recent report on developing fuzzy rule base using pure GA that is quite general and not restricted to its own rule base structure or fuzzy representations. In this approach, each rule is a horn clause, with If-part consisting of fuzzy membership functions for different features of the problem database, and Then-part stating the class to which this rule belongs to. A rule set is composed of one or more rules, with each rule having a weight value stating its role in final decision. To classify an input by a rule set, each of the rules computes the degree of similarity between the input and its own If-part and based on that, it states a degree of belief to its Then-part. Then, a weighted sum of the degree of beliefs for each class is computed and the class which gets the highest value is chosen. Figure 7 specifies the structure of the rule set, and more information about decision making procedure can be found in (Hasanzade et al, 2003) and (Hasanzade et al, 2004).
298
R. Halavati and S.B. Shouraki
<WEIGHT> <MEMBERSHIP FUCNTION>
Æ Æ Æ Æ Æ Æ Æ
a set of s <WEIGHT> + a set of s + a real value a IS a <MEMBERSHIP FUCNTION> a [45-IS / ISNOT] a <MEMBERSHIP FUCNTION> one of the features of dataset. one of the possible fuzzy values for the feature.
Fig. 7. Formal structure of the rule set (chromosome)
The fitness of each rule set is defined as the accuracy of the rule set in classification of training data. Accuracy is a measure combining the classification soundness with 99.9 percent effect and the simplicity of the rules with 0.1 percent effect. The simplicity measure is used to break the tie between two rule sets with different complexities and similar classification rate, in favor of the simpler rule set as stated in equation (3). Simplicity =
1 + Number of rules with just one condition Total number of conditions in all rules
(3)
The reset of the algorithm is quite similar to previous versions. The only problem specific part is the mutation operator that is depicted in details along with the algorithm in Figure 8. 5.2 Experimental Results on Rule Base Generation To compare the performance of SEA algorithm with traditional GA, we used six frequently used benchmarks. The first one is a 10% selection of KDDCUP99 dataset (MIT Lincoln Labs, 2007), and others are selected from University of California, Irvine Machine Learning Repository (Blake & Merz, 1998); these datasets are gathered from real experiments, so they can show efficiency of the algorithm in some real circumstances. Credit Approval (CRX), Glass Identification (Glass), Iris Plant (Iris), 1984 United States Congressional Voting Records Database (Vote), and Wine Recognition (Wine) datasets are selected as the most frequently used datasets so as to compare the results to some other related works. The extensive information about these datasets is mentioned in Table 5. Although KDDCUP99 dataset has many classes of intrusion types, we consider their classes as Normal and Attack cases, similar to (Esposito et al, 2005), (Toosi & Kahani, 2007), and (Mill and Inoue, 2005). The GA algorithm is implemented as described in (Hassanzade, 2004) with exactly the same parameters (expressed in Table 6). Likewise the baseline system, Fuzzy C-Mean clustering (Zimmermann, 1996) was used to define the fuzzy membership function for continuous attributes, and fuzzy singletons were defined for none-parametric attributes. The number of fuzzy sets for KDD99 features is 5 and for other problems 3. The exact parameters of SEA-R algorithm are presented in Table 7. The tests are done four-fold (Blake, 1996), i.e., the data was randomly divided into 4 sets and in each trial, one set was taken as test set, and the other 3 were used as training set. The tests are repeated for 20 times, and the average, minimum and
Symbiotic Evolution to Avoid Linkage Problem
299
Fig. 8. Diagram of Symbiotic Evolutionary Algorithm for Rule Base Generation (SEA-R). The parameters are SR for Selection Rate, TS for Tournament Size, RC for Random Rule Creation Rate and MP for Maximum Population Size. Table 5. Benchmarks Specifications Dataset KDD99 CRX Glass Iris Vote Wine
Features count 41 15 10 4 16 13
Numeric Features 34 6 9 4 0 13
Nominal Features 7 9 1 0 16 0
Classes 2 2 6 3 2 3
Instances 494021 690 214 150 435 178
Table 6. GA Parameters (as in [45-2] and [45-24]) Parameter Maximum Population Mutation Rate Elitism Rate Tournament Size
Value 200 0.7 0.2 4
Table 7. SEA-R Parameters Parameter Population Size Selection Rate Tournament Size Random Creation Rate
Value 1000 6 8 4
300
R. Halavati and S.B. Shouraki
maximum classification rates for training and tests results are depicted in Tables 8 and 9. The stopping criterion of each run is an unchanging best fitness value during 5000 fitness function calls. The computation progress of the algorithm is assumed to be approximately a linear function of number of fitness function calls. Moreover, this number of fitness function calls has been achieved according to experience, and the optimum value may differ for different problems. As presented in Tables 8 and 9, SEA-R has found better rule sets than GA in all cases on training sets and 4 of 5 on test sets. Also, Table 10 presents the best classification results of some other approaches (Gomez et al, 2002), (Mendes et al, 2001), (Rouwhorst & Engelbrecht, 2000), and (Liu & Kwok, 2000), which are Table 8. Average Classification Rate of GA and SEA-R, Different Data Sets, on Training Data Data Sets CRX Glass Iris Vote Wine KDD99
GA Min 0.87433 0.63921 0.98139 0.96528 0.96189 0.87433
Max 0.88937 0.72023 0.99082 0.98003 0.99156 0.88937
Average 0.8807 0.6942 0.9863 0.9732 0.9768 0.8807
SEA Min 0.85199 0.66923 0.97237 0.96474 0.99153 0.85199
Max 0.90042 0.74812 0.9991 0.97976 0.99910 0.90042
Average 0.8885 0.7143 0.9935 0.9756 0.9944 0.8885
Table 9. Average Classification Rate of GA and SEA-R, Different Data Sets, on Test Data Data Sets CRX Glass Iris Vote Wine KDD99
GA Min 0.83746 0.63377
Max 0.87654 0.71370
Average 0.8527 0.6862
SEA Min 0.84888 0.67878
Max 0.86476 0.74008
Average 0.8558 0.7068
0.9185 0.91789 0.86293 0.84263
0.99923 0.98661 0.99902 0.87654
0.9495 0.9531 0.929 0.9436
0.91805 0.92611 0.90821 0.85156
0.99909 0.97972 0.97683 0.8516
0.9557 0.9504 0.9459 0.9931
Table 10. Average Classification Rate of Different Algorithms compared to SEA-R
CRX Glass Iris Vote Wine KDD99
85.27 68.62 94.95 95.31 92.9 94.36
N/A N/A 94.84 95.42 92.22 N/A
84.7 N/A 95.3 N/A N/A N/A
77.39 72.43 95.33 N/A N/A N/A
N/A N/A 94.1 N/A N/A N/A
SEA
(Rouwhorst & Engelbrecht, 2000)
(Liu & Kwok, 2000)
(Mendes et al, 2001)
(Gomez et al, 2002)
GA
Dataset
85.58 70.68 95.57 95.04 94.59 99.31
Symbiotic Evolution to Avoid Linkage Problem
301
re-implemented and tested by (Hassanzade, 2003) with similar settings as ours. As stated there, in cases that we had sufficient comparison data, SEA-R is better than other algorithms in almost all datasets. Figure 9, depict the best fitness values over time for SEA-R and GA on the six stated datasets, averaged in all runs. As it is presented in the diagrams, SEA-R has Table 11. Average time taken by SEA-R and GA to find the best classifier on different benchmarks, in seconds Dataset
SEA
GA
CRX Glass Iris Vote Wine KDD99
357 164 40 89 98 7012
4650 280 633 1490 1710 54306
Fig. 9. Elite fitness of GA and SEA-R versus time for the 5 benchmarks
302
R. Halavati and S.B. Shouraki
found a better solution much faster than GA in all cases. Table 11 summarizes these results, and presents the average time taken to find the best result by each algorithm on each benchmark. As stated there, SEA has reached its best result notably faster than GA in all cases. 5.3 Symbiogenesis as a New School for Rule Base Generation While the suitability of evolutionary approaches for generation of rule based classifier systems is shown in many different contributions, the structure and elements of this process are important issues in design of a system that works efficiently. Two general approaches for this task are Michigan and Pittsburgh methods (Ishibuchi et al, 2001). In Michigan approach, each individual in evolutionary pool is a rule, and the whole pool is encountered as a rule set consisting of cooperating rules. In contrast, in Pittsburg school, each individual is a rule set and the pool is a collection of rival rule sets. There have been studies on showing that Pittsburgh is less successful in classification of high dimensional problems (Ishibuchi et al, 2000), but it is still widely used as in many cases as the cooperation of single rules that are all evolved for better classification, regardless of other rule's behavior, will not necessarily result in a general good classifier, and some parts of the problem space might be neglected. This need has resulted in combinations of the two approaches and hybrid methods such as (Ishibuchi et al, 1999) and (Tan et al, 2003) where in the former, Michigan approach is used as the mutation operator of a Pittsburgh algorithm, performing a local optimization, and in the latter, a two layer algorithm evolves a set of rules using Michigan approach, and then combines them using Pittsburgh method. Three problems must be encountered in any Pittsburgh based evolutionary algorithm. First, how many rules must a rule set have? This makes the problem a twoway optimization in which rule set size and classifier accuracy are both important. Second, how should two rule sets get recombined? While the traditional sexual recombination operators split the two parents and merges their parts, how should one know which rules of either parent rule sets must be extracted to be recombined to make a good combination. The third question is what to do with hitch-hiker or garbage rules; in most evolutionary methods, when one uses chromosomes consisting of many genes, some bad genes might be grouped with some good genes inside one chromosome, and the fitness value, associated with the good ones results in selection and reproduction of the bad ones, as parasites. SEA algorithm uses symbiotic combination operator instead of common sexual recombination operator of GA, and provides a solution for the three above questions; it creates an offspring from two parents by combining all of their rules (genes), and adds the offspring to the gene pool only if it outperforms both its parents. Using this strategy, SEA avoids grouping separate rules before it makes sure that the group works better than the isolated ones, so it avoids garbage rules. It does not break any generated rule set; therefore, it does not require a method to identify good working sub sets of two rule sets. Also, as it grows the rule sets only if growing results in better performance, the designer does not need to make a decision about chromosome sizes in advance. Experimental results clearly comply with this hypothesis where SEA had better or similar results in comparison to GA from accuracy point of view, and these results have always been reached much faster than GA, in similar operating conditions.
Symbiotic Evolution to Avoid Linkage Problem
303
6 Symbiotic Artificial Immune System Over the last few years, there has been an ever increasing interest in the area of Artificial Immune Systems (AIS) and their applications in pattern recognition and optimization such as (Cortés & Coello Coello, 2003), (de Castro & Von Zuben, 1999), (de Castro & Von Zuben, 2000), (Hofmeyr & Forrest, 1999), (Hunt & Cooke, 1995), (Hunt & Cooke, 1995), (Timmis et al, 2000), (Timmis et al, 2004) and many more. In traditional AIS algorithms the generic solution is coded as an antibody and the system gradually matures its antibodies to find the best possible solution(s). During the maturation process, the antibodies are always assumed as rivals and the only cooperation between antibodies happens in cross-reactive responses (Ada & Nossal, 1987), (Hodgkin, 1998), (Mason, 1998), (Smith et al, 1997) and (Sprent, 1994). This general policy has two problems. First, it has the well known garbage genes problem and second, the cooperation between antibodies is minimal and the algorithms do not benefit from the advantages of schemata and exploiting sub-solutions (Holland, 1975). To remove these two weaknesses, this section will introduce the usage of partially specified antibodies and symbiotic combination operator in a well known AIS algorithm and we will show how this gadget can improve the performance of AIS. It is worth noting that although the original idea of AIS is taken from vertebrates’ immunity system which has several hundred million years of evolution on its back, it has been shown that augmenting it with ideas that distant it from its natural form sometimes increases its performance on digital computers and this is not an unusual improvement (de Castro & Timmis, 2003), (de Castro & Von Zuben, 2002a), (de Castro & Von Zuben, 2002b), and (Hunt & Cooke, 1996). In the remaining of this section a brief introduction of CLONALG algorithm (de Castro & Von Zuben, 2002b) will be given in subsection 6.1, then the improved symbiotic AIS algorithm will be presented in subsection 6.2, followed by experimental results in subsection 6.3 and a summary and discussion in subsection 6.4. 6.1 CLONALG Artificial Immune System The immunological process has been used for inspiration in AIS in several general purpose algorithms such as negative selection algorithm (Forrest et al, 1994), positive selection algorithm (Seiden & Celada, 1992), clonal selection algorithm (de Castro & Von Zuben, 2002b), continuous immune models (Farmer et al, 1986), (Varela & Coutinho, 1991) and discrete immune network models (de Castro & Von Zuben, 2002a), (Timmis, 2000) and many special purpose contributions such as multiobjective optimization (Coello & Cortes, 2005), (Cortes & Coello, 2003), (Cui et al, 2001), (Kurpati and Azarm, 2000), and (Yoo & Hajela, 1999) and multimodal optimization (Forret & Perelson, 1991), (Smith et al, 1992), (Smith et al, 1993). Among the wide range of existing AIS algorithms, we chose CLONALG as it is a common abstraction of the clonal selection idea that is shown to be able to learn and select patterns (de Castro & Von Zuben, 2002b) and make multimodal optimizations (de Castro & Timmis, 2002b), while it is still quite simple in comparison to algorithms that use memory cells and networks of activation/deactivation such as aiNet (de Castro & Von Juben, 2002a) and opt-aiNet (Timmis & Edmonds, 2004).
304
R. Halavati and S.B. Shouraki
Fig. 10. Diagram of CLONLG optimization algorithm
CLONALG starts by generating a population of N antibodies, each specifying a random solution for the optimization process. In each iteration of the algorithm, some percentage of the best existing antibodies are selected, cloned and mutated to construct a new candidate population. All new members are evaluated and a certain percentage of the best members are added to the original population. At last, a percentage of worst members of previous generation of antibodies are replaced with new randomly created ones. Figure 10 presents the diagram of this process. 6.2 Symbiotic Artificial Immune System (SymbAIS) To add the idea of symbiogenesis to CLONALG, the new algorithm uses partially specified antibodies and gradually grows them towards fully specified antibodies. The entire process is depicted in Figure 11 and briefly described here: The algorithm starts with a set of partially specified antibodies, each having just one specified bit. If the solution is not binary coded, each antibody may have just one specified field regardless of its type, but here, for the sake of simplicity and without loss of generality, we assume binary coding. The remaining of the algorithm is quite as CLONALG with the exception that evaluation, selection, mutation, and cloning are all done at assembly level instead of antibody level and after cloning, the assemblies are again broken to their symbiont antibodies. Also sometimes symbiotic combination occurs on cloned assemblies and two antibodies in one assembly merge. The cloning rate of each assembly is computed using its affinity value and the rank of its affinity among all other created assemblies as in equation (4). The assembly with highest affinity is ranked 1 and the worst assembly among N is ranked N. Each member of selected assemblies, along with its required number of clones, is given to the maturation function which performs cloning and mutation. Mutation rate is also computed based on affinity and as stated in equation (5). ⎛ ⎞ ⎜ Affinity(S) CloningRate ⎟ Numberof Clones(S) = max⎜1, × ⎟ ∑ Affinity(x ) Rank (S) ⎟⎠ ⎜ ⎝ x∈Created Assemblies
(4)
Mutation Probability(S) = [1 - Affinity (S)] × Muttaion Rate
(5)
Symbiotic Evolution to Avoid Linkage Problem
305
Fig. 11. Diagram of Symbiotic Artificial Immune System algorithm. Parameters are: AC (Assemblies Count): Number of assemblies that are created in each iteration for evaluation. SP (Selection Percentage): Percent of assemblies that are chosen for cloning. DP (Deletion Percentage): Percentage of worst assemblies that are removed from the population. CR (Cloning Rate): Number of clones created from the chosen assemblies. MP (Maximum Population): The upper limit of antibodies population.
Similar to the cooling mechanism of SEA, to avoid fast emergence of fully specified antibodies, the maximum size of possible antibodies at a certain time is limited to a dynamic parameter. Here again we have computed this parameter by a simple linear equation based on the generation number, but more complicated functions can be used if the antibody space is known more. If symbiotic combination operator picks two symbionts whose combination creates an antibody with more specified bits than what is specified by this limit, the combination is discarded and the antibody breaks into its constructors. 6.3 Experimental Results We compared SymbAIS algorithm with CLONALG on two problem sets. The first set is composed of some multimodal optimization functions, depicted in Table 12, and the second set was the multiple 8-Queens combinatorial optimization problem which was introduced in benchmark problems of SEA, section 4.2. In multimodal optimization functions, for functions with real variables, each variable is implemented with 22 bits as in (de Castro & Von Zuben, 2002b) and for multiple 8-Queens problems, implementation is as in SEA. Again, similar to previous benchmarks, parameters of both algorithms were optimized using a hill climbing algorithm with first priority on number of instances solved and second priority on optimization speed. Hill climbing was run for each
306
R. Halavati and S.B. Shouraki Table 12. Test functions for SymbAIS and CLONGALG
Name
Function Definition
Problem Size in bits
Reference
F1
sin 6 (5πx)
22
(Goldberg & Richardson, 1987)
G1
2 ( −2 (( x −0.1) / 0.9 ) sin 6 (5πx)
22
(Goldberg & Richardson, 1987)
H1
x.sin(4πx) − y. sin(4πy + π ) + 1
44
(Goldberg & Richardson, 1987)
F2
sin 6 (5πx) × sin 6 (5πy )
44
(Halavati et al, 2007a)
H2
x. sin(4πx) − y. sin(4πy + π ) + 1 z. sin(4πz ) − t. sin(4πt + π ) + 1
88
(Halavati et al, 2007a)
G3
−8 (( x −0.1) / 0.9 ) 2 2 ¦ i ∏ sin 6 (5πxi )
66
(Halavati et al, 2007a)
4×5 Trap Function
4 instances of a 5 bit trap function. The fitness of each instance is computed as :
20
(Deb, 1991)
24
(Deb, 1991)
66
(Potter & de Jong, 1994)
88
(Potter & de Jong, 1994)
2
if Ones(B) = 0 SizeOf(B), F (B) = ® ¯Ones(B) - 1, Otherwise where Ones(B) equals to the number of 1 bits in B. 3x8 Trap Function
Similar to above, but 3 instances of 8 bits.
3 variable Rastrigin Function
3.0n + ¦ xi2 − 3.0 cos(2πxi )
4 variable Schwefel Function
418.9829n + ¦ xi − sin( xi )
n
i =1
n
i =1
Schwefel
Q-1
Q-5
50 95 1 0.178
350 50 0.54 0.59
136 125 0.8 0.85
50 300 0.36 0.05
50 300 0.36 0.34
Q-15
Rastrigin
117 50 0.2 1
Q-10
3x8 Trap
100 100 0.1 0.1
4x5 Trap
104 50 0.2 0.1
G3,H2
H1,F2
N n d β
F1,G1
Table 13. Parameters of CLONALG algorithm for each benchmark problem. Parameter names are as in (de Castro & Von Zuben, 2002b). Empty columns belong to problems that we could not find appropriate parameters to solve them.
problem instance on each algorithm with 100 steps and 20 restarts and the best parameters found are depicted in Tables 13 and 14. Each problem was run with both algorithms with the best found parameters in 50 independent runs. The stopping criterion of each run was either reaching the optimum solution or making 100 times more affinity function calls than the fastest result that we had already found by any of the algorithms on that problem. Figures 12 shows the average number of affinity function calls before each algorithm has found the optimum solutions in 50 runs. It must be noted, in this experiment, that the success rate of the
Symbiotic Evolution to Avoid Linkage Problem
307
H1,F2
G3,H2
4x5 Trap
3x8 Trap
Rastrigin
Schwefel
Q-1
Q-5
Q-10
Q-15
AC CS SP DP CR MR MP
F1,G1
Table 14. Parameters of SymbAIS algorithm for each benchmark problem. Parameter names are: AC (Assemblies Count), CS (Cooling Speed), SP (Selection Percentage), DP (Deletion Percentage), CR (Cloning Rate), MR (Mutation Rate), MP (Maximum Population).
15 2E-4 0.1 0.3 1.7 1 2500
20 4E-4 0.1 0.2 1.5 1 2500
25 4E-4 0.1 0.4 1.4 0.9 3500
25 2E-4 0.1 0.2 1.6 1 2500
60 1E-4 0.5 0.3 1.8 0.9 2500
50 1E-4 0.2 0.2 1.8 0.8 3000
80 1E-5 0.2 0.4 1.7 0.7 3000
25 4E-4 0.2 0.2 1.5 0.9 2500
25 4E-4 0.2 0.3 1.5 0.8 2500
25 4E-4 0.2 0.2 1.5 0.9 3000
25 4E-4 0.2 0.2 1.5 0.9 3000
Fig. 12. Computation Time of SymbAIS and CLONALG on benchmark problems (Average number of affinity function calls to reach global optimum, in 50 runs). Columns with no value indicate that the algorithm was not successful in solving even one instance of that problem.
algorithms were either 0 or 100%, i.e., each algorithm was either successful in solving a problem in all trials or in no one, therefore, columns of Figure 12 which have no values indicate that the success rate was 0%, otherwise 100%. As depicted in Figure 13, CLONALG could find the optimum result of 8 multimodal functions out of 10 and 2 of the combinatorial optimization problems out 4 while SymbAIS successfully solved all instances of all problems. This better success rate has also been accompanied with fewer number of affinity function calls (and in turn faster emergance) in most cases. Slower SymbAIS results were in three problems (4×5 Trap, Q-1, and Q-5) which are the smallest problems of each category, and it can be due to the fact that the problems were not hard enough to require the more complex algorithm and the problem of hitch hiker genes is not so severe. 6.4 Summary of Findings on Symbiotic AIS Symbiotic Artificial Immune System (SymbAIS) was introduced as an extension of the well known CLONALG with an extra gadget to get rid of unnecessary linkage between solution parts. SymbAIS performs its search using partially specified antibodies and gradually builds up building blocks from the classes of possible
308
R. Halavati and S.B. Shouraki
solutions till it reaches fully specified antibodies. If a solution would not be found using partially specified antibodies, more and more specified antibodies are created gradually and at last the algorithm converges to a normal CLONALG. Therefore, it can be stated that SymbAIS is at least able to find all solutions that CLONALG can find. Experimental results also comply with this expectation and it was shown in section 6.3 that the problems that SymbAIS solved were a superset of that of CLONALG and SymbAIS required less computational effort, except in easier problems in which the linkage problem was not severe.
7 Summary and Conclusions To overcome the linkage problem, we proposed the usage of partially specified chromosomes (PSCs) along with evaluation and selection at group level. The main point behind this idea was to select and replicate PSCs at group level instead of individual level, because once a group of PSCs show a good fitness value together, it does not implicate that each of them is a good subsolution in general, but it means that they are good together and can form a good cooperation; So, once we evaluate a group and they show good fitness value, all members of the group receive a higher change for survival and replication. During replication, to increase the change of regrouping for some members of a successful group, we stick some of them together using symbiotic combination operator and build partial solutions with more specified bits. If these partial solutions show good fitness values in their future combinations with other members of the pool, they would be replicated again and grow bigger and if not, they will be destroyed. As this template is quite prone to getting stuck in local optima, it is augmented with tabu prohibition mechanism to avoid this problem and perform a better global search. The major advantages behind this idea are: 1- The evolutionary algorithm does not need prior domain knowledge for chromosome or recombination operator design for preservation of building blocks during recombination, as there is no recombination any more. 2- The process uses schemata and sub structures during evolution and can find and use building blocks, while the building blocks have no predefined structure or size limitations. Moreover, growth control is done only based on fitness and therefore there would be no bias due to size penalty function or anything similar. 3- As the process uses PSCs and creates bigger chromosomes only if the group of chromosomes shows a high fitness value, garbage genes are produced less than approaches based on fully specified chromosomes. 4- Tabu prohibition prevents getting stuck in local optima and Cooling mechanism ensures a wide global search before narrowing down to areas limited by schemata. 5- The general idea is easily implementable in different evolutionary strategies. The general idea (called Symbiotic Evolutionary Algorithm, SEA) was introduced in section 3 and its features were compared with some of existing algorithms from similar classes, and then it was implemented for combinatorial optimization and deceptive problems in section 4. SEA was compared with simple genetic algorithm and Symbiotic Evolutionary Adaptation Model in section 4.3 on some combinatorial optimization problems and deceptive functions. It showed higher success rates in
Symbiotic Evolution to Avoid Linkage Problem
309
comparison to the rival algorithms and faster finding the optimal solution in most cases, especially on harder problems. Next, SEA template was tailored for evolutionary extraction of fuzzy rules for classification problems and data mining in section 5 and it was shown that while having higher average classification rate in comparison to genetic algorithm in 4 of 5 benchmarks, this has been achieved with much less computation time, which is a very important feature for data mining algorithms which usually deal with huge amounts of data. At last, the SEA template was used in an artificial immune system (CLONALG algorithm) and our experimental results showed that augmenting CLONALG with SEA gadgets can add a major advantage to simple artificial immune system and can help it in solving more problems with less computational effort. Noting that the SymbAIS algorithm eventually converges to CLONALG, we concluded that if the problem would have some building blocks, SymbAIS would find the optimal solution faster than CLONALG, but if its extra gadgets does not work, it will finally become a CLONALG and find any solution that CLONALG can find. Our test results totally complied with our assertions about SEA and showed that SEA was quite more successful in reaching optimum solutions using less computational power, except in cases that search for building blocks was useless and solutions were quite trivial. In these cases, the gadgets of SEA were just unnecessary and time consuming, but yet SEA could find the optimal solution with more computations. As a final statement, we believe that evaluation and selection at group level can be a major advantage to search and optimization methods that are based on PSCs and symbiotic combination is an appropriate tool for building schemata in such algorithms. Our test results were quite satisfactory in all three tested fields, but the best results were acquired in the second experiment which was rule-base generation for classification tasks. So we can recommend SEA best for this purpose as it can be used with no customization and quite intact as presented here. Other two implementations can still be improved by decreasing user defined parameters and making them automatically adjustable based on measures extracted from the process.
Acknowledgements Authors wish to give their sincerest thanks to Professor Caro Lucas for his valuable comments during this task, and Ms. Mojdeh Jalali Heravi, Ms. Bahareh Jafari Jashmi, Ms. Sima Lotfi, and Mr. Pooya Esfandiar for their help through implementation and tests.
References Ada, G.L., Nossal, G.: The Clonal Selection Theory. Scientific American 257(2), 50–57 (1987) Bagley, J.D.: The Behaviour of Adaptive Systems Which Employ Genetic and Correlation Algorithms, PhD Dissertation, University of Michigan (1967) Baluja, S.: Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning, Tech. Rep. No. CMU-CS-94-163. Pittsburgh, PA, Carnegie Mellon University (1994) Battiti, R., Tecchiolli, G.: The Reactive Tabu Search. ORSA journal on computing 6(2), 126– 140 (1994)
310
R. Halavati and S.B. Shouraki
Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases, Irvine, CA: University of California, Department of Information and Computer Science (1998), http://www.ics.uci.edu/~mlearn Chen, M.Y., Linkens, D.A.: Rule-base self-generation and simplification for data-driven fuzzy models. Fuzzy Sets and Systems 142(2,1), 243–265 (2004) Cordon, O., del Jesus, M.J., Herrera, F.: Genetic learning of fuzzy rule-based classification systems cooperating with fuzzy reasoning methods. International Journal of Intelligent Systems 13(10), 1025–1053 (1998) Coello, C., Cortes, N.: Solving Multiobjective Optimization Problems Using an Artificial Immune System. Genetic Programming and Evolvable Machines 6(2), 163–190 (2005) Cortés, N.C., Coello Coello, C.A.: Multiobjective Optimization Using Ideas from the Clonal Selection Principle. In: Cantú-Paz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2724. Springer, Heidelberg (2003) Cui, X., Li, M., Fang, T.: Study of population diversity of multiobjective evolutionary algorithm based on immune and entropy principles. In: Proceedings of the Congress on Evolutionary Computation 2001 (CEC 2001), vol. 2, pp. 1316–1321. IEEE Service Center, Piscataway (2001) de la Iglesia, B., Philpott, M.S., Bagnall, A.J., Rayward-Smith, V.J.: Data Mining Rules Using Multi-Objective Evolutionary Algorithms. In: Proceedings of IEEE Congress on Evolutionary Computations, vol. 3, pp. 1552–1559 (2003) de Castro, L.N., Timmis, J.: An Artificial Immune Network for Multimodal Optimisation. In: Proceedings of the Congress on Evolutionary Computation. Part of the 2002 IEEE World Congress on Computational Intelligence, Honolulu, Hawaii, USA, pp. 699–704 (2002) de Castro, L.N., Timmis, J.: Artificial Immune Systems as a Novel Soft Computing Paradigm. Soft Computing 7(8), 526–544 (2003) de Castro, L.N., Von Zuben, F.J.: Artificial Immune Systems: Part I – Basic Theory and Applications, EEC/Unicamp, Campinas, SP, Tech. Rep. – RT DCA 01/99 (1999) de Castro, L.N., Von Zuben, F.J.: Artificial Immune Systems: Part II – A Survey of Applications. Tech. Rep. – RT DCA 02/00 (2000) de Castro, L.N., Von Zuben, F.J.: aiNet: An Artificial Immune Network for Data Analysis. In: Abbas, H., Sarker, R., Newton, C. (eds.) Data Mining: A Heuristic Approach, pp. 231–259. Idea Group Publishing (2002a) de Castro, L.N., Von Zuben, F.J.: Learning and optimization using the clonal selection principle. IEEE Transactions on Evolutionary Computation 6(3), 239–251 (2002b) de Werra, D., Hertz, A.: Tabu Search Techniques: A tutorial and an application to neural networks - OR Spektrum. 11, 131–141 (1989) Deb, K.: Binary and floating point function optimization using messy genetic algorithms (IlliGAL Report No. 91004). Urbana: University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory (1991) Eggermont, J., Kok, J.N., Koster, W.A.: Genetic Programming for Data Classification: Refining the Search Space. In: Proceedings of the Fifteenth Belgium/Netherlands Conference on Artificial Intelligence, pp. 123–130 (2003) Eiben, A.E., Raué, P.E., Ruttkay, Z.: GA-easy and GA-hard Constraint Satisfaction Problems. In: Meyer, M. (ed.) Constraint Processing. LNCS, vol. 923, pp. 267–283. Springer, Heidelberg (1995)
Symbiotic Evolution to Avoid Linkage Problem
311
Esposito, M., Mazzariello, C., Oliviero, F., Romano, S.P., Sansone, C.: Evaluating Pattern Recognition Techniques in Intrusion Detection Systems. In: Proceedings of the 7th International Workshop on Pattern Recognition in Information Systems (PRIS 2005), Miami, FL USA, May 24-25, 2005, pp. 144–153 (2005) Farmer, J., Packard, N., Perelson, A.: The immune system, adaptation and machinen learning. Physica D archive 2, 187–204 (1986) Forrest, S., Mitchell, M.: Relative Building-block fitness and the Building-block Hypothesis. In: Whitley, L.D. (ed.) Foundations of Genetic Algorithms 2, pp. 109–126. Morgan Kaufmann, San Mateo (1993a) Forrest, S., Mitchell, M.: What Makes a Problem Hard for a Genetic Algorithm? Some Anomalous Results and Their Explanation. Machine Learning 13(2/3), 285–319 (1993b) Forrest, S., Perelson, A.: Genetic algorithms and the immune system. In: Schwefel, H.-P., Männer, R. (eds.) Parallel Problem Solving from Nature. LNCS, pp. 320–325. Springer, Heidelberg (1991) Forrest, S., Perelson, A., Allen, L., Cherukuri, R.: Self-nonself discrimination in a computer. In: Proceedings of 1994 IEEE Symposium on Research in Security and Privacy, pp. 132–143 (1994) Freitas, A.: A survey of evolutionary algorithms for data mining and knowledge discovery. In: Ghosh, A., Tsutsui, S. (eds.) Advances in Evolutionary Computation. Springer, Heidelberg (2001) Glover, F.: Tabu Search. Part I, ORSA Journal on Computing 1, 190–206 (1989) Glover, F.: Tabu Search. Part II, ORSA Journal on Computing 2, 4–32 (1990) Goldberg, D.E., Korb, B., Deb, K.: Messy Genetic Algorithms: Motivation, analysis, and first results. Computer Systems 3(5), 493–530 (1989) Goldberg, D.E., Richardson, J.: Genetic Algorithms With Sharing for Multimodal Function Optimization. In: Procceedings of the Second International Conference on Genetic Algorithms, pp. 41–49 (1987) Gomez, J., Dasgupta, D.: Evolving Fuzzy Classifiers for Intrusion Detection. In: Proceedings of the IEEE Workshop on Information Assurance (2002) Gomez, J., Gonzalez, F., Dasgupta, D.: Complete Expression Trees for Evolving Fuzzy Classifier Systems with Genetic Algorithms. In: Proceedings of the Evolutionary Computation Conference GECCO 2002 (2002) Gopalan, J., Alhajj, R., Barker, J.: Discovering Accurate and Interesting Classification Rules Using Genetic Algorithm. In: Proceedings of the 2006 International Conference on Data Mining, June 26-29, pp. 389–395 (2006) Gundo, K.K., Alatas, B., Karci, A.: Mining Classification Rules by Using Genetic Algorithms with Non-random Initial Population and Uniform Operator. Turkish Journal of Electrical Engineering and Computer Science 12(1) Halavati, R., Shouraki, S.B., Heravi, M.J., Jashmi, B.J.: Symbiotic Evolutionary Algorithm, A General Purpose Optimization Approach. In: Proceedings of IEEE Congress on Evolutionary Computations (CEC 2007), Singapore (2007a) Halavati, R., Shouraki, S.B., Jashmi, B.J., Heravi, M.J.: SEAM+ Evolutionary Optimization Algorithm. In: Proceedings of the 7th International Conference on Intelligent Systems Design and Applications, IEEE Computational Intelligence Society, Rio de Janiro (2007b) Hunt, J., Cooke, D.E.: An adaptive and distributed learning system based on the Immune system. In: Proceedings of IEEE International Conference on Systems Man and Cybernetics (SMC), pp. 2494–2499 (1995) Hunt, J.E., Cooke, D.E.: Learning Using an Artificial Immune System. Journal of Network and Computer Applications 19, 189–212 (1996)
312
R. Halavati and S.B. Shouraki
Harik, G.R.: Learning Gene Linkage to Efficiently Solve Problems of Bounded Difficulty Using Genetic Algorithm, PhD Dissertation, University of Illinois at Urbana-Champaign, Urbana, Illinois (1997) Harik, G.R., Lobo, F.G., Goldberg, D.E.: The compact genetic algorithm. In: Proceedings of the IEEE Conference on Evolutionary Computation, pp. 523–528 (1998) Hasanzade, M.: Fuzzy Intrusion Detection, MS. Dissertation, Computer Engineering Department, Sharif University of Technology, Tehran, Iran (in Persian, 2003) Hasanzade, M., Bagheri, S.B., Lucas, C.: Discovering Fuzzy Classifiers by Genetic Algorithms. In: Proceedings of 4th international ICSC Symposium on Engineering of Intelligent Systems (EIS 2004), Island of Madeira, Portugal (2004) Hodgkin, P.D.: Role of Cross-Reactivity in the Development of Antibody Responses. The Immunologist 6(6), 223–226 (1998) Hofmeyr, A., Forrest, S.: Immunity by Design: An Artificial Immune System. In: Procceedings of GECCO 1999, pp. 1289–1296 (1999) Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975) Ishibuchi, H., Nakashima, T., Murata, T.: A hybrid fuzzy genetics-based machine learning algorithm: Hybridization of Michigan approach and Pittsburgh approach. In: Proceedings of IEEE Conference on Fuzzy Systems (1999) Ishibuchi, H., Nakashima, T., Murata, T.: A hybrid fuzzy GBML for designing compact fuzzy rule-based classification systems. In: Proceedings of IEEE Conference on Fuzzy Systems (2000) Ishibuchi, H., Nakashima, T., Murata, T.: Three objective genetics-based machine learning for linguistic rule extraction, Information Sciences, 109–133 (2001) Ishibuchi, H., Yamamoto, T.: Fuzzy rule selection by data mining criteria and genetic algorithms. In: Proceedings of Genetic and Evolutionary Computation Conference (GECCO 2002), New York, July 9-13 2002, pp. 399–406 (2002) Ishibuchi, H., Yamamoto, T.: Fuzzy Rule Selection by Multi-Objective Genetic Local Search Algorithms and Rule Evaluation Measures in Data Mining. Fuzzy Sets and Systems 141(1), 59–88 (2004) Kargupta, H.: SEARCH Polynomial Complexity And The Fast Messy Genetic, PhD Dissertation, University of Illinois at Urbana-Champaign, Urbana, IL (1995) Kurpati, A., Azarm, S.: Immune network simulation with multiobjective genetic algorithms for multidisciplinary design optimization. Engineering Optimization 33, 245–260 (2000) Larrañaga, P., Lozano, J.A.: Estimation of Distribution Algorithms. In: A New Tool for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2002) Liu, J.J., Kwok, J.T.: An Extended Genetic Rule Induction Algorithm. In: Proceedings of IEEE Congress on Evolutionary Computation (CEC 2000), La Jolla, CA, USA (July 2000) Lopes, C., Pacheco, M., Vellasco, M., Passos, E.: Rule-Evolver: An Evolutionary Approach For Data Mining. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 458–462. Springer, Heidelberg (1999) Mason, D.: Antigen Cross-Reactivity: Essential in the Function of TCRs. The Immunologist 6(6), 220–222 (1998) Maynard Smith, J., Szathmary, E.: The Major Transitions in Evolution. WH Freeman, Oxford (1995) Mendes, R.R.F., Voznika, F.B., Freitas, A.A., Nievola, J.C.: Discovering Fuzzy Classification Rules with Genetic Programming and Co-Evolution. In: Proceedings of 5th European Conference PKDD 2001. LNCS (LNAI). Springer, Heidelberg (2001)
Symbiotic Evolution to Avoid Linkage Problem
313
Merezhkovsky, K.S.: The Theory of Two Plasms as the Basis of Symbiogenesis, a New Study or the Origins of Organisms. In: Proceedings of the Studies of the Imperial Kazan University, Publishing Office of the Imperial University (in Russian) (1909) Mill, J., Inoue, A.: Support Vector Classifiers and Network Intrusion Detection. In: Proceedings of IEEE Conference on Fuzzy Systems, vol. 1, pp. 407–410 (2004) MIT Lincoln Labs, KDD CUP 99 DARPA Intrusion Detection Dataset(2007), http://kdd.ics.uci.edu/databases/kddcup99 Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, London (1999) Mühlenbein, H., Mahnig, T.: Convergence theory and application of the factorized distribution algorithm. Journal of Computing and Information Technology 7(1), 19–32 (1999) Newman, D.R.: The Use of Linkage Learning in Genetic Algorithms (September 2006), http://www.ecs.soton.ac.uk/~drn05r/ug/irp/ Pelikan, M., Goldberg, D.E., Cantu-Paz, E.: BOA: The Bayesian optimization algorithm. In: Proceedings of the Genetic and Evolutionary Computation Conference GECCO 1999, Orlando, FL, vol. I, pp. 525–532. Morgan Kaufmann Publishers, San Francisco (1999) Pelikan, M., Goldberg, D.E.: Hierarchical BOA solves using spin glasses and MAXSAT. In: Cantú-Paz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2724, pp. 1275–1286. Springer, Heidelberg (2003) Potter, M.A., De Jong, K.A.: A Cooperative Coevolutionary Approach to Function Optimization. In: Davidor, Y., Schwefel, H.-P., Manner, R. (eds.) Parallel Problem Solving from Nature (PPSN III), pp. 249–257. Springer, Berlin (1994) Riquelme, J.S., Toro, J.C., Aguilar-Ruiz, M.: Evolutionary Learning of Hierarchical Decision Rules. IEEE Transactions on Systems, Man, and Cybernetics 33(2), 324–334 (2003) Rouwhorst, S.E., Engelbrecht, A.P.: Searching the Forest: Using Decision Tree as Building Blocks for Evolutionary Search in Classification. In: Proceedings of IEEE Congress on Evolutionary Computation (CEC 2000), La Jolla, CA, USA, July 2000, pp. 633–638 (2000) Russle, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn., pp. 111–112. Prentice-Hall, Englewood Cliffs (2002) Sastry, K., Goldberg, D.: On Extended Compact Genetic Algorithm (IlliGAL Report No. 2000026). Urbana, IL: University of Illinois at Urbana-Champaign (2000) Seiden, P.E., Celada, F.A.: Model for Simulating Cognate Recognition and Response the Immune System. Journal of Theoretical Biology 158, 329–357 (1992) Smith, D.J., Forrest, S., Hightower, R.R., Perelson, A.: Deriving Shape Space Parameters from Immunological Data. Journal of Theoretical Biology 189, 141–150 (1997) Smith, R.E., Forrest, S., Perelson, A.: Searching for diverse, cooperative populations with genetic algorithms. Technical Report TCGA No. 92002, University of Alabama, Tuscaloosa, AL (1992) Smith, R.E., Forrest, S., Perelson, A.: Population diversity in an immune system model: Implications for genetic search. In: Whitley, L.D. (ed.) Foundations of Genetic Algorithms, vol. 2, pp. 153–165. Morgan Kaufmann Publishers, San Mateo (1993) Sprent, J.: T and B Memory Cells. Cell 76(2), 315–322 (1994) Tan, K.C., Yu, Q., Heng, C.M., Lee, T.H.: Evolutionary computing for knowledge discovery in medical diagnosis. Artificial Intelligence in Medicine 27, 129–154 Teng, M., Xiong, F., Wang, R., Wu, Z.: Using genetic algorithm for weighted fuzzy rule-based system. In: Proceedings of Fifth World Congress on Intelligent Control and Automation (2004)
314
R. Halavati and S.B. Shouraki
Timmis, J.: Artificial Immune Systems: A Novel Data Analysis Technique Inspired by the Immune Network Theory. Ph.D. Dissertation, Department of Computer Science, University of Wales (2000) Timmis, J., Edmonds, C.: A Comment on opt-AiNet: An Immune Network Algorithm for Optimization. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103. Springer, Heidelberg (2004) Timmis, J., Knight, T., de Castro, L.N., Hart, E.: An overview of artificial immune systems. In: Paton, R., Bolouri, H., Holcombe, M., Parish, J.H., Tateson, R. (eds.) Computation in Cells and Tissues: Perspectives and Tools for Thought. Natural Computation Series, pp. 51–86. Springer, Heidelberg (2004) Timmis, J., Neal, M., Hunt, J.: An Artificial Immune System for Data Analysis. Biosystems 55(1), 143–150 (2000) Toosi, A.N., Kahani, M.: A New Approach to Intrusion Detection Based on an Evolutionary Soft Computing Model Using Neuro-Fuzzy Classifiers. Computer Communications 30, 2201–2212 (2007) Tsang, C.H., Kwong, S., Wang, H.: Anomaly intrusion detection using multi-objective genetic fuzzy system and agent-based evolutionary computation framework. In: Proceedings of Fifth IEEE International Conference on Data Mining (2005) Varela, F.J., Coutinho, A.: Second Generation Immune Networks. Immunology Today 12(5), 159–166 (1991) Watson, R.A., Pollack, J.B.: Incremental Commitment in Genetic Algorithms. In: Proceedings of GECCO 1999, pp. 710–717. Morgan Kaufmann, San Francisco (1999) Watson, R.A., Pollack, J.B.: Symbiotic Combination as an Alternative to Sexual Recombination in Genetic Algorithms. In: Proceedings of Parallel Problem Solving from Nature (PPSN VI), pp. 425–436 (2000) Yoo, J., Hajela, P.: Immune network simulations in multicriterion design. Structural Optimization 18, 85–94 (1999) Zimmermann, H.J.: Fuzzy Set Theory and Its Application. Kluwer Academic Publishers, Dordrecht (1996) Zhu, F., Guan, S.U.: Ordered Incremental Training with Genetic Algorithms. International Journal of Intelligent Systems 19(12), 1239–1256 (2004)
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis Thomas Goth1 , Chia-Ti Tsai2 , Fu-Tien Chiang3 , and Clare Bates Congdon4 1
2
3
4
Department of Computer Science, Colby College, Waterville, ME 04901 USA
[email protected] Division of Cardiology, Department of Internal Medicine, National Taiwan University Hospital, Taipei, Taiwan
[email protected] Division of Cardiology, Department of Internal Medicine, National Taiwan University Hospital, Taipei, Taiwan
[email protected] Department of Computer Science, University of Southern Maine, Portland, ME 04104 USA
[email protected] In this work, we explore the utility of a complex adaptive systems approach for studying epistasis, the nonlinear effects among genes that contribute to different disease outcomes. Due to the nonlinear interactions among the genes, data such as this is difficult to model using traditional epidemiological tools. Thus, we have developed EpiSwarm, a Swarm-based system for investigating complex genetic diseases. EpiSwarm uses genetic algorithms evolution on agents that function as condition-action rules to explain the data in their vicinity. These agents migrate and evolve as they cluster data in a two-dimensional world. Thus, EpiSwarm uses a genetic algorithms component to model genetic data. Although this system does not embody linkage learning explicitly, the goal of the system to identify genetic epistasis is an important process in linkage learning research. In EpiSwarm, the decision variables are biological genes, and identifying the epistasis is the core task. This paper is an extended version of [9], extending and clarifying the explanations of the system and providing additional results.
1 Introduction Many genetic diseases are not caused by the effects of a single gene, but rather, are due to multiple genes acting in concert. For complex diseases, looking at the effect of variation in a single gene may predict one disease outcome, while looking at the interactions of genetic variations across multiple genes gives us a richer understanding of the risk of disease, and may predict different outcomes. Various methods have been developed to study epistasis, but these are generally top-down analytical models; furthermore, due to the complexity of the task, they Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 315–334, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
316
T. Goth et al.
do not yield conclusive results. EpiSwarm represents a bottom-up approach to studying epistasis, and yields novel insights into patterns of genetic epistasis. 1.1
Complex Adaptive Systems
A Complex Adaptive System (CAS) is a scientific model that consists of a large number of interacting agents operating within an environment (usually two dimensional) [7]. The collective state of the agents in a CAS depends on the previous state; the local environment of agents, which includes other agents, influences them and determines how they act [7]. There is no global control over the system; everything is determined by the starting conditions and the rules governing the agents’ behaviors and their interactions. The rules of agents in a CAS should be simple condition-action rules where the conditions are based on the local environment [7]. This is very similar to Conway’s Game of Life [6]; the rules governing whether cells live or die are simple but allow for complex global phenomena that are not trivial to predict. While Conway’s Life is not adaptive, the adaptive aspect of a CAS is the idea that agents either learn individually, or the population is subjected to some means of mutation and competitive selection [7]. The consequences of adaptation are broad because agents are adapting in an environment in which other agents are adapting. Therefore changes among agents will not just have consequences for individual agents; it will also affect the environment directly and indirectly through interactions with other agents. 1.2
Genetic Algorithms and Echo
Genetic algorithms are an approach to problem solving that incrementally evolve a solution to a problem. In this approach, the system works with a population of tentative solutions to the problem and the population “evolves” over a series of generations to gradually form better solutions to the problem. The genetic algorithms approach to problem solving has shown improvements to hillclimbing approaches on a wide variety of problems [8, 14], including data mining [5] and bioinformatics in particular [3, 4]. Echo [10, 11] is an example of a CAS that broadens the concept of genetic algorithms to model the interactions of adaptive agents in a two-dimensional world with constrained resources. 1.3
EpiSwarm
EpiSwarm is implemented using the Swarm system [13] and is inspired by CAS models such as Echo, with the explicit goal of developing a CAS model for clustering complex genetic disease data in two-dimensional space and for identifying rules that explain these clusters. EpiSwarm rule agents act both as rules to explain epistatic phenomena as well as the machinery to organize the data into clusters of similar etiologies. In our initial stage of development, we sought to provide the means for data to cluster and the means for evolving rules to explain those clusters. An example Swarm is shown in Figure 1, showing a clustering of data and rules that emerged
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
317
Fig. 1. A closeup of an example swarm that has converged into clusters from an initial random distribution
from an initial random distribution1 ; the mechanics and results of the system will be explained shortly. Our initial results indicate that while neither the clustering nor the rules are ideal in the current implementation, EpiSwarm is a promising approach for visualizing and understanding clusters of disease outcomes for complex genetic diseases in that it is able to yield stronger rules than found via conventional learning strategies.
2 Genetic Epistasis and Atrial Fibrillation This section provides background on genetic epistasis (the effect we are modeling), atrial fibrillation (the disease we are studying), and the difficulty of studying epidemiological tasks such as these. 2.1
Complex Diseases and Genetic Epistasis
Complex common diseases, such as Coronary Artery Disease (CAD), hypertension, diabetes, and cancer, are difficult to study because they aggregate, but 1
The illustration of the swarms in this chapter have been redrawn for clarification, as will be explained in more detail later.
318
T. Goth et al.
do not segregate in families [17, 18]. That is, these diseases tend to cluster in families, but none of these diseases is associated with a single Mendelian gene. The complex common diseases can be contrasted with many monogenic diseases, such as Duchenne muscular dystrophy and sickle cell anemia. Such diseases also cluster in families, but an alteration in a single gene is necessary and sufficient to cause disease. As our understanding of genetic machinery increases, it becomes apparent that many common diseases are products of genetic epistasis, where the effects of a gene are influenced by the effects of another (the second gene may dampen or enhance the first gene’s function). The approaches traditionally used to study genetic diseases were developed for the simple single-gene diseases, and tend to overlook the possibility of interactions of multiple genes. Furthermore, traditional approaches assume a common etiology for disease, that a single explanation is sufficient to explain all incidence of disease. Instead, a more accurate model can often be achieved by acknowledging that there are different combinations of genetic and environmental factors that may be associated with high (or low) incidences of disease. Thus, traditional approaches are insufficient for studying what is now known about the complexity of many genetic diseases. Alternative approaches can significantly contribute to our understanding of disease pathways. 2.2
Atrial Fibrillation
Atrial fibrillation (AF) disrupts the heart’s normal function, so that the upper chambers do not beat effectively. As a result, blood is not completely pumped out of these chambers, which can lead to clotting. AF affects roughly 2.2 million Americans and is associated with about 15 percent of strokes. The likelihood of developing atrial fibrillation increases with age. Three to five percent of people over 65 have atrial fibrillation. Identifying individuals with genetic susceptibility to AF allows for improved treatments and interventions. 2.3
The Difficulty of Classification with Medical Data
The epistasis task presented here is an example of a binary classification task, however, as is typical of medical data, the data here qualitatively differs from most data modeled by classification systems. There are several reasons for this: • Medical data is expensive (and often intrusive) to collect; thus, there are often not as many examples in the data set as we might like. • There is often some error in the measurement of the attributes in medical data. • There is often substantial noise in the class attribute in medical data, leading to a high percentage of inconsistent data. • Medical data often has relatively few attributes, again in part because of the expense of collecting the data.
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
319
In the data reported here, we have 500 examples, which is considered a large dataset for most medical studies. Noise in the attribute-values for this data corresponds to a sequencing error, with the result that an individual is identified as having one genotype but actually carries another. The class attribute for the AF data is expected to be particularly noisy, since AF incidents increase with age. Some of the negative examples (classified as “healthy”) may develop AF at a later age. One can reasonably assert that more attributes with few examples merely increases our ability to overfit the data. This is inarguably true, however it may be more successful to let the classifier determine an appropriate subset of attributes rather than omitting some available attributes from the learning process. However, with medical data, additional attributes are often simply not available. Other studies of the data used in this research can be found in [15, 21, 20]; related studies of other work with evolutionary computation for epidemiological classification include [1, 2].
3 Materials and Methods This section will describe the data and evaluation metric used in this work. 3.1
An Overview of the Data
The AF data used in the study here is described in [19]. This data set contains 500 examples, describing 250 patients (“cases”) with documented nonfamilial structural AF and 250 controls. Cases were matched to controls with regard to age, gender, presence of left valvular disfunction, and presence of significant valvular heart disease. Eight genetic polymorphisms provide the eight non-class attributes, as described in [19]; for each polymorphism there are three possible values. Each data point also has a ninth attribute, the class attribute, indicating whether the patient is a case of documented AF or not. Initial analyses of the data reveal the following: 1. Among the 500 examples, there are only 161 unique attribute-value combinations (including the class attribute). 2. Among the 250 positive examples, there are 86 unique attribute-value combinations 3. Among the 250 negative examples, there are 75 unique attribute-value combinations 4. There are 117 unique attribute-value combinations when the class attribute is excluded. Thus, there are 44 combinations of non-class attribute values that are inconsistent with regard to the class attribute. These characteristics are typical of epidemiological data, including the sparseness of the data relative to the multi-dimensional space and the inconsistent data. The attributes and possible values of the data set are illustrated in Figure 2.
320
T. Goth et al.
Attribute Gene # Name 0 ACE 1 AT,R 2 AGT 3 G-152A 4 A-20C 5 G-6A 6 T174M 7 M235 8 AF
Range of Description Values insertion/deletion polymorphism II, ID, DD genotypes A1166C polymorphism AA, AC, CC genotypes polymorphism of the AGT gene GG, GA, AA genotypes polymorphism of the AGT gene GG, GA, AA genotypes polymorphism of the AGT gene AA, AC, CC genotypes polymorphism of the AGT gene GG, GA, AA genotypes polymorphism of the AGT gene T/T, T/M, M/M genotypes polymorphism of the AGT gene M/M, M/T, T/T genotypes (disease status) True or False
Fig. 2. A description of the attributes and possible values represented in the data set. These are also represented in each rule agent, as described in Section 5. For a rule, each of the eight genetic polymorphism can also be a “don’t care” value (disease status cannot).
3.2
Evaluation Metric (Odds Ratio)
With a binary classification task, the first step in evaluating a rule set is to partition the dataset into four groups of data points, based on which class the data point belongs to, and which class the rule set predicts for the data point, as illustrated in Figure 3. Data Set D+ D− R+ a b Rule Set R− c d Fig. 3. The partition of a binary data set into four subsets, a, b, c, and d, according to labeled class and predicted class
The data is divided into two groups (D+ and D-), based on whether each example is a positive or negative example of the class. It is also divided into two groups (R+ and R-) based on whether the rule set classifies it as a positive or negative example. The resulting four groups, a, b, c, and d, correspond respectively to true positives, false positives, false negatives, and true negatives. A rule set that describes the data set perfectly will have both b and c equal to 0. As an evaluation criterion, we use an odds ratio, a statistical measure of pairwise association widely used in epidemiology [12]. The “odds” of an example being in a particular class is the ratio of the probability that it belongs to that class to the probability that it does not belong to that class. For example, the odds that a given datapoint in D+ is described by rule R+ is a/c. Likewise, the odds that a given datapoint in D- is described by R+ is b/d. The odds ratio takes the ratio of these two odds: (a/c)/(b/d), or ad/bc, the ratio of the odds
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
321
that R+ describes a positive example to the odds that R+ describes a negative example. (To eliminate problems such as division by 0, 0.5 is added to each of a, b, c, and d.)
4 A Preliminary Analysis of the Data As part of the initial analysis, C4.5 [16] was run on the data. C4.5 constructs decision trees, a classic approach in the machine learning literature. In the standard formulation as used here, a decision tree stores all the data in the root of the tree; each node branches on one of the attributes, forming a separate branch for each possible value in the data. C4.5 includes statistical tests that determine whether additional branching is sound at each node and provides options that may prune a tree accordingly. There is a stochastic component to the C4.5 tree-building process, so 100 trials were performed; all 100 trees were 70.6% correct (29.4% errors), ranging from 75 to 99 leaf nodes to explain the data (there was one tree at each extreme of 75 and 99 leaf nodes). This illustrates a characteristic of this data that is often true of medical data: It is largely inconsistent. In other words, there are several subsets of the data that are the same for all attribute-values except for the class attribute, and we cannot hope to achieve anywhere near 100% accuracy when classifying this data. The number of leaf nodes in a decision tree corresponds to the number of rules in a rule set. Thus, this analysis illustrates that although there is great redundancy in the data, the data does not neatly resolve to a small decision tree. Furthermore, while decision trees are a good first cut at understanding a data set, most formulations (including C4.5) build trees assuming a linear interaction among the attributes, and are thus thwarted by attributes with nonlinear
Fig. 4. The best pruned tree found by C4.5 for this data. Interior nodes split on the attribute inside the node, with the corresponding values labeled on the branches. For each leaf node (rule), the number of positive and negative examples and the odds ratio of that rule is shown. Shaded leaf nodes (rules) are considered explanations of positive examples and unshaded leaf nodes are considered explanations of negative examples.
322
T. Goth et al.
interactions. C4.5 has a hedge against this in doing the multiple trials with different orderings of the data, but most decision tree systems are inherently linear, so this is only a minor hedge. When C4.5 was allowed to generate 100 trees and prune them to 10% confidence, tree sizes dropped to 5 leaf nodes, but with a maximum correctness of 59.8% (40.2% errors). C4.5 identified the best pruned tree as one with 7 leaf nodes and 62% correct. In other words, while it is possible to prune the trees, doing so results in a substantial decrease in accuracy. The 7-rule tree is illustrated in Figure 4. The 7-rule tree is a good illustration of the complexity of the data, highlighting the difficulty of classifying inconsistent data using standard tools. Note that in the tree above, 143 of the 250 patients with the disease (135 plus 8) are classified as being healthy.
5 System Design This section will describe the implementation of EpiSwarm, including the use of the Swarm libraries, and an overview and details of each of the major components of the system. 5.1
Swarm
EpiSwarm is implemented using Swarm [13], a collection of libraries to facilitate multi-agent simulation of complex systems. Swarm is not a CAS itself but it provides tools such as numerous random number generators and data structures frequently used in agent-based modeling, which streamlines the process of creating a CAS and conducting experiments with it. 5.2
EpiSwarm Overview
The idea of EpiSwarm is to use the CAS approach to study epistatic data. Our goals were to design a CAS that would: • Allow the data to form similarity-based clusters, and • Allow rules to form and adapt to explain clusters of data. The EpiSwarm agents live in a two-dimensional toroidal grid world and their behaviors are determined by condition-action rules based upon their local environment (which includes other agents). There are two types of agents in EpiSwarm. Rule agents evolve to explain the data near them and transport data agents from one location to another. This provides the simulation with the means to reposition the data into clusters and to evolve explanations of those clusters. • Data agents are created to match the examples in the data set. There is one data agent per example. Each data agent contains a specific attribute-value for the data point it represents, including the class attribute. These agents
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
Stable System
Agents Perform Behaviors
323
Stable states happen in between generations of the simulation
Agents execute actions based on the previous state of the system
Fig. 5. EpiSwarm alternates between a stable phase, in which the state of the world is captured to be used as input for the next cycle, and a dynamic phase, in which agents check conditions, update their state, and possibly reproduce
are static in that they do not evolve or otherwise change as the simulation runs, but are picked up and moved by the rule agents. Since there are 500 examples in the dataset, there are 500 data agents in the experiments shown here. Each data agent has nine attributes, one for each of the eight different polymorphism among renin-angiotensis system (RAS) genes and one for the status of the patient (AF or not). For the AF data, there are three possible alleles for each gene, and the class attribute is binary. • Rule agents contain attribute-value rules for describing the data. They have one attribute for each attribute in the data set, including the class attribute, as illustrated in Figure 2. The rule agents are similar to the data agents in that they have the same nine attributes. For the rule agents, each attribute may be one of the possible values or a “don’t care” (except for the class attribute, which must be specified in the rule). Behaviors and reproduction of the rule agents are described below. Figure 5 illustrates the EpiSwarm cycle in terms of the agent’s behaviors. The condition set is evaluated during the stable state, the agents’ actions are determined during the active state, and the world is then updated according to these actions for the next stable state. 5.3
Initialization
Data agents may be moved, but do not change in EpiSwarm, thus, when the EpiSwarm world is initialized, one data agent is created to correspond exactly to each datum. Rule agents are created to match random samples from the data. This ensures that they match at least one example initially. (In general, there will be more data agents than rule agents in a simulation, so not all data agents are represented by the initial rule agents.) Both types of agents are distributed
324
T. Goth et al.
Fig. 6. An example initial swarm, with data agents and rule agents randomly distributed. The original EpiSwarm color screenshot has been used as the basis for this depiction, as the colors did not convert well to grayscale. The rule agents are denoted with asterisks and circles, with circle denoting rules that describes AF patients and asterisks denoting rules that describes the controls (negative for disease). The data agents are denoted with squares and plus signs; squares denote AF patients and plus signs denote a controls (negative for disease).
at random in the 2-dimensional toroidal world, as illustrated in Figure 6, and move according to the behaviors described below. Thus, although each initial rule describes a data point, it is not guaranteed to be located in the vicinity of that data point.
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
5.4
325
Movement of Rule and Data Agents
Rule agents may carry up to one data agent at a time. During each cycle, each rule agent that is not already carrying a data agent will first pick up a random data agent within its vicinity (determined by a parameter). After moving, rule agents test a random data agent in the vicinity and drop the data agent being held if the two data agents have similar attribute-values (the threshold for dropping is also determined by a parameter). During each generation of the Swarm simulation, each rule agent may move to a new location in the neighborhood of its current location. For each potential movement, the agent’s x and y coordinates are each adjusted by random values chosen within the range of the “moveDistance” parameter. For example, x will be altered by a random value in the range of [-5..5] and y will be altered by another random value in the range [-5..5]. (Both numbers could be zero, so the agent need not move.) Using these mechanisms, both rule agents and data agents have the ability to relocate in the world. 5.5
Fitness of Rule Agents
A rule agent’s fitness affects its chances of reproductive success, so having an appropriate fitness function in the simulation is important. An agent’s fitness is calculated using the odds ratio function using the data agents in the neighborhood of the agent. The rule agent contains an attribute to indicate whether it explains positive or negative examples, and for the data in the neighborhood, the rule agent matches or does not match each example. 5.6
Replication of Rule Agents (Copying with Mutation)
After all rule agents have moved, agents may replicate. When an agent replicates, it selects N other agents in its neighborhood (N and the neighborhood size are determined by parameters, as is the probability of replication). From among the set of agents (including the one who initiated the replication), the highest fitness agent is copied with mutation, and the lowest fitness agent in the group is killed. The mutation chooses one of the attributes at random and sets it to a random legal value for that attribute (possibly the same as it was before.) “Don’t care” is among the legal values that may be chosen by a mutation for all but the class attribute. 5.7
Reproduction of Rule Agents (Crossover without Mutation)
Each rule agent also has a probability every world tick to be granted the ability to reproduce with another rule agent, using a crossover operation. In crossover, a partner is randomly chosen among the rule neighbors. Once the partner has been selected, a new rule agent is created; the attributes of the offspring are created from a combination of its parents. More specifically, each of the offspring’s nine attributes are from one parent or the other, which one is determined by a “coin
326
T. Goth et al.
flip” (a 50-50 chance). The parent that was granted permission to crossover is moved and its offspring is placed where it was just located. The parent with the lowest fitness of the two is subsequently killed after the crossover. 5.8
Death of Rule Agents
It may seem that the population should remain constant because for each new rule agent that is created one is killed in both replication and crossover. However the adding and removing of rule agents through replication and crossover is not done until the end of a time step, therefore this leaves the potential for the population to grow. This is because rule agents that were chosen to die through the reproduction process are able to reproduce for the duration of a time step, this may not seem significant but it results in a steady growth of the population. To keep the population from growing endlessly, a small random death parameter is used. That is each time step there is a small probability that a rule agent will die at random. loop through all agents pickup MOVE drop CALCULATE FITNESS replicate crossover die Fig. 7. Psuedocode for advancing one generation in the simulation. Moving and calculating fitness are capitalized because they always happen, whereas the probability of other operations happening is determined by parameters, as illustrated in Figure 8.
Swarm Model
Rule
Rules
AttributeValues
Data
SickP
MutationProbability
OddsRatio
CrossoverProbability
MyX
Data
DeathProbability
MyY
AttributeValues
NeighborRadius
SickP
DataRadius
MyX
DropThreshold
MyY Fig. 8. Major EpiSwarm data structures
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
5.9
327
Summary
The EpiSwarm simulation runs through all agents each generation, stepping through each of the operators described above and summarized in Figure 7; Moving and calculating fitness will always happen to an agent (though the “movement” operator has a small probability of not altering the agent’s position). Other operators happen at with a probability set via parameters to the system. Figure 8 illustrates the major data structures in EpiSwarm, including major variables for each.
6 Experiments The system design described above provides EpiSwarm with the basic abilities to move data in the two-dimensional world and to evolve rules to explain clusters of data. The primary question for these initial experiments is whether the system will be successful in arranging data agents and rule agents, enabling the system to capture patterns in the data that elude other approaches. System parameters for the runs reported here are provided in Figure 9. Successful parameter settings were determined based on a few exploratory runs; these should be understood as sufficient, but not necessary parameter settings for the system. Future work will include an investigation of the effects of different parameter settings. Parameter Grid size Number of rule agents Number of data agents Maximum move distance/turn Radius for reproduction Radius for picking up data Minimum similarity to drop data Probability of replication Probability of mutation during replication Probability of crossover Probability of death Number of neighbors to consider for mating
Value 250x250 200 500 (all the data) 2 5 6 7/9 attributes 0.98 0.96 0.98 0.001 3
Fig. 9. Example system parameters for the runs described here
6.1
Initial Results
As this is a stochastic system, different runs reach different qualitative endpoints. In most cases, runs tend to converge to having a majority of rule agents that explain one class or the other, as is illustrated in Figure 10. In this run, after 700 generations, the data agents (squares for AF patients and plus signs for
328
T. Goth et al.
Fig. 10. An example evolved swarm after 700 generations
the negative controls) have been arranged into vague clusters, with rule agents (circles and asterisks) following a similar clustering. A majority of the rules are explaining positive examples (circles, for AF patients), with a small cluster in the lower righthand corner of rules explaining negative examples (controls). In this illustration, the only rules that are explanations of negative examples are in the lower righthand corner, and there are no rules that explain positive examples in that vicinity. Figure 11 illustrates in more detail a cluster of rules (from the lower lefthand corner of Figure 10) that explain positive examples (AF patients), in the vicinity of a cluster of data that is predominantly sick individuals. Figure 12 illustrates in more detail a second cluster of rules (more centrally located in Figure 10) that explain positive examples, in the vicinity of a cluster of data that is predominantly sick individuals. This cluster has a higher proportion of rule agents than the first cluster and also has a higher proportion of sick to healthy data agents.
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
329
Fig. 11. Closeup of a cluster explaining sick individuals in the example swarm
Fig. 12. Closeup of a second cluster explaining sick individuals in the example swarm
330
T. Goth et al.
Fig. 13. Closeup of a cluster explaining healthy individuals in the example swarm
Fig. 14. A plot of odds ratio over time in an EpiSwarm run
Figure 13 illustrates a cluster of rules (from the lower righthand corner of Figure 10) that explain negative examples, in the vicinity of a cluster of data that is predominantly healthy individuals. Figure 14 is an illustration of the increasing fitness of the population of rule agents over time. The gray line is the maximum odds ratio in the population and the black line is the average odds ratio in the population.
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
331
Fig. 15. An EpiSwarm simulation converging on another set of clusters
In this stochastic system, different runs yield different qualitative endpoints. For example, Figure 15 illustrates an EpiSwarm run converged to two strong clusters. One is the cluster on the center right with a predominance of AF rules (circles) explaining AF patients (squares); the second is the cluster in the upper left, with a predominace of negative control rules (asterisks) explaining negative control data points (plus signs). An example rule from the set of agents identifies the combination of ACE=II, AT,R=AA, AGT=GG, G-6A=AA, M235T=M/T (and other polymorphisms as “don’t cares”) as being an exceptionally high odds ratio. Looking over the entire data set, this combination matches 14 individuals, all of whom are sick. (The
332
T. Goth et al.
ACE
AT,R
AGT
II
AA
GG
ID
AA
II
AA
GG
AA
GG
AA
GG
GG
AC
AA
GG
GG
AC
ID
G-152A A-20C
AA GG
M235T
Cases
Controls OR
AA
G-6A
M/T
14
0
30.71
AA
M/T
8
1
5.83
T/M
T/T
8
2
3.48
AA
T/T
M/T
23
3
7.30
AA
T/M
T/T
18
3
5.63
12
1
8.72
AC
T174M
T/M
Fig. 16. Example rules found in EpiSwarm runs, showing the specified genotypes, cases, controls, and odds ratios. Alleles that differ from the dominant value are circled.
odds ratio of this rule across the entire data set is 30.71.) Thus, this rule has identified an ususually strong pattern in the data. Figure 16 illustrates this rule as well as other strong rules found across multiple runs of the system. One point of interest across these rules are the alleles that differ from the dominant value for that gene (for AF patients). For the A-20C gene, 43 cases and 29 controls carried the AC allele; for the T147M gene, 47 cases and 44 controls carried the T/M allele; and for the M235T gene, 62 cases and 30 controls carried the M/T allele. (ACE illustrates two different alleles in the discovered rules, but these are both highly represented in the data.)
7 Conclusions The current stage of development of EpiSwarm has achieved our goals in terms of the preliminary development of the system, specifically in its ability to cluster data and find rules to explain those clusters. We now have solid framework with which we can explore these effects and the “results” that can be obtained from the system. While we are delighted that the system is able to identify highfitness rules such as the one described above, there are obvious limitations to the current system and our understanding of it. There is obviously much to be done in terms of understanding what the system is finding and in investigating the strengths and limitations of this framework. In the clusters illustrated above, the first cluster is along the lines of what we had hoped and expected; the second cluster seems to be overpopulated with highly redundant agents. There is currently no selection pressure to prevent this effect. While we already know that the data is inconsistent and will not, therefore, segregate into monotonous clusters, we might prefer systems that would converge larger clusters of rules explaining healthy individuals to match the large clusters of rules explaining sick individuals.
EpiSwarm, a Swarm-Based System for Investigating Genetic Epistasis
333
8 Future Work There is much room for exploration of the current system. For example, we need to better understand the effects of different parameter settings, such as those that control reproduction and death, and how these affect the quality of the clusters. At this point, we lack tools for investigating the quality of the clusters (beyond the visual images of the swarm and the explicit description of individual rules). Designing such tools is an essential step towards understanding the behavior of the system. While it is interesting to see that the data has the ability to cluster, since we know there are many redundant examples (data points that are exactly the same for all nine attributes), it would be interesting to explore the effects of a different representation of the data that would allow for a “mega” data point that incorporated the identical examples and always kept them together. There might not be any benefit to requiring the system to move identical data points from random starting locations into clusters. The redundancy of rule agents as illustrated in Figure 12 suggests that we should explore a variation of the system in which the rule agents must “compete” with each other for the “resource” of the data agents. Such a component of the system would encourage fewer rules with less redundancy. In addition to exploring variations of the existing system, we need to explore the effects of running EpiSwarm on a variety of data sets, to further investigate its strengths and limitations.
Acknowledgments The authors would like to thank Ying-ping Chen for his assistance with the conversion of screen shots to the illustrations here. We would also like to thank Jason Moore for sharing insights on the AF data and related projects, and John Kuehne and John Hessler for technical support. This project was supported by NIH Grant Number P20 RR-016463 from the INBRE Program of the National Center for Research Resources.
References 1. Congdon, C.B.: A comparison of genetic algorithms and other machine learning systems on a complex classication task from common disease research. Technical report, PhD thesis, The University of Michigan (1995) 2. Congdon, C.B.: Classification of epidemiological data: A comparison of genetic algorithm and decision tree approaches. In: Proceedings of the 2000 Congress on Evolutionary Computation CEC 2000, LaJolla, California, USA, 6-9 2000, pp. 442– 449. IEEE Press, Los Alamitos (2000) 3. Fogel, G.B., Corne, D.W.: Evolutionary Computation in Bioinformatics. Morgan Kaufmann, San Francisco (2002)
334
T. Goth et al.
4. Fogel, G.B., Corne, D.W., Pan, Y.: Computational Intelligence in Bioinformatics. Wiley, IEEE Press (2008) 5. Freitas, A.A.: Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer, Secaucus (2002) 6. Gardner, M.: Mathematical Games: The fantastic combinations of John Conway’s new solitaire game ‘Life’. Sci. Am. 223(4), 120–123 (1970) 7. Gilbert, N., Conte, R.: Artificial Societies. The computer simulation of social life. UCL Press, London (1995) 8. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading (1989) 9. Goth, T., Tsai, C.-T., Chiang, F.-T., Congdon, C.B.: Initial results with episwarm, a swarm-based system for investigating genetic epistasis. In: Proc. 2007 IEEE Congress on Evolutionary Computation (CEC 2007), Piscataway, NJ, September 25-28, 2007, pp. 3855–3861. IEEE Press, Los Alamitos (2007) 10. Holland, J.H.: Hidden Order: How Adaptation Builds Complexity. Addison Wesley Publishing Company, Reading (1998) 11. Hraber, P., Jones, T., Forrest, S.: The ecology of echo. Artificial Life 3(3), 165–190 (1997) 12. Kramer, M.S.: Clinical Epidemiology and Biostatistics: A Primer for Clinical Investigators and Decision-Makers. Springer, Berlin (1988) 13. Minar, N., Burkhart, R., Langton, C., Askenazi, M.: The swarm simulation system, a toolkit for building multi-agent simulations. Technical Report 96-06-042, The Santa Fe Institute (1996) 14. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1996) 15. Moore, J.H., Gilbert, J.C., Tsai, C.-T., Chiang, F.-T., Holden, T., Barney, N., White, B.C.: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theor. Biol. 241(2), 252–261 (2006) 16. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) 17. Sing, C.F., Moll, P.P.: Genetics of atherosclerosis. Annual Review of Genetics 24, 171–287 (1990) 18. Sing, C.F., Reilly, S.L.: Genetics of common diseases that aggregate, but do not segregate, in families. In: Sing, C.F., Hanis, C.L. (eds.) Genetics of cellular, individual, family, and population variability. Oxford University Press, New York (1993) 19. Tsai, C.-T., Lai, L.-P., Lin, J.-L., Chiang, F.-T., Hwang, J.-J., Ritchie, M.D., Moore, J.H., Hsu, K.-L., Tseng, C.-D., Liau, C.-S., Tseng, Y.-Z.: Renin-Angiotensin System Gene Polymorphism and Atrial Fibrillation. Circulation 109, 1640–1646 (2004) 20. White, B.C., Moore, J.: Systems biology thought experiments in human genetics using artificial life and grammatical evolution. In: Artificial Life, vol. IX, pp. 581– 586. MIT Press, Cambridge (2004) 21. White, B.C., Moore, J.: A complete bnf grammar for systems biology thought experiments in human genetics using artificial life and biologically inspired computing. In: Proc. 2005 IEEE Congress on Evolutionary Computation (CEC 2005), Piscataway, NJ, September 2-5. IEEE Press, Los Alamitos (2005)
Real-Coded Extended Compact Genetic Algorithm Based on Mixtures of Models Pier Luca Lanzi1,2 , Luigi Nichetti1 , Kumara Sastry2 , Davide Voltini1 , and David E. Goldberg2 1
2
Dipartimento di Elettronica e Informazione, Politecnico di Milano, I-20133, Milano, Italy
[email protected] Illinois Genetic Algorithm Laboratory (IlliGAL) University of Illinois at Urbana Champaign, Urbana, IL 61801, USA
[email protected],
[email protected] Summary. This paper presents a real-coded estimation distribution algorithm (EDA) inspired to the extended compact genetic algorithm (ECGA) and the real-coded Bayesian Optimization Algorithm (rBOA). Like ECGA, the proposed algorithm partitions the problem variables into a set of clusters that are manipulated as independent variables and estimates the population distribution using marginal product models (MPMs); like rBOA, it employs finite mixtures of models and it does not use any sort of discretization. Accordingly, the proposed real-coded EDA can be either viewed as the extension of the ECGA to real-valued domains by means of finite mixture models or as a simplification of the real-coded BOA to the marginal product models (MPMs). The results reported here show that the number of evaluations required by the proposed algorithm scales sub-quadratically with the problem size in additively separable problems.
1 Introduction Estimation of distribution algorithms (EDAs) [17, 21, 25, 19] replace the traditional variation operators of genetic algorithms by building and sampling probabilistic models. EDAs have successfully solved boundedly-difficult single-level and hierarchical problems, oftentimes requiring only sub-quadratic number of function evaluations [21]. Despite their demonstrated scalability, most EDAs operate on binary variables, and their success has not been extensively carried over to other encodings such as permutation, program and real-codes. In the recent years, several real-coded extensions of the extended compact genetic algorithm (ECGA) [13] have been developed as a way to provide simple real-coded EDAs [7, 9, 18]. Such real-coded ECGAs usually combine a discretization algorithm, which is used to map real-values to discrete symbols, with a probabilistic model based on marginal product models (MPMs). In this paper, we follow a different approach and propose a native real-coded ECGA that directly works on the real variables and does not employ any sort of discretization. The proposed algorithm is inspired to the extended compact genetic algorithm Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 335–358, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
336
P.L. Lanzi et al.
(ECGA) [13] and to the real-coded Bayesian Optimization Algorithm [1, 2]. Similarly to what done in ECGA [13], the proposed algorithm works on probability distributions represented by marginal product models (MPMs) and, for this purpose, it partitions the (real-valued) problem variables into clusters that are then manipulated as independent variables. However, instead of mapping the real values into discrete symbols and using the usual MDL to discriminate between candidate models, the proposed method models each cluster using a joint probability distribution and guides the partitioning process using an MDL metric for continuous distributions. Similarly to what done in rBOA [1, 2], the parameters of the probabilistic model are fitted using several mixtures of models, one for each cluster of variables identified during the model building step. Accordingly, the proposed real-coded ECGA can be either viewed as an extension of ECGA to real-valued domains by means of finite mixtures of models or as a simplification of the real-coded BOA to the case of a probabilistic model based on marginal product models (MPMs). We performed a scalability analysis and applied the proposed algorithm to problems taken from the literature, namely the sphere function and the real deceptive function [1, 2]. The results we present show that the number evaluations required to solve additively separable problems scales at most sub-quadratically with the problem size. The paper is organized as follows. At first, we briefly overview related realcoded EDA designs. In Section 3, we give an outline of the extended compact genetic algorithm followed, in Section 4, by a brief description of the real-coded BOA. In Section 5, we describe the proposed real-coded ECGA and then, we discuss the experimental design (Section 6) which we followed for the populationsizing and scalability analysis we present in section 7.
2 Related Work We now provide a brief overview of the works that are relevant to this paper; for more references, we refer the reader to the two recent books [25, 19]. In [31], Pelikan et al. introduced a real-coded PMBGA working in the continuous domain; they used marginal histograms to model promising solutions. Both the marginal models used, fixed width and fixed height histograms, performed fairly well on test functions with no or weak interaction among the variables, but failed to recognize linkage among variables in functions with a medium level of linkage. An evolution of this approach is presented in [30] where the focus is on the linkage identification in real-coded GAs. In [30], the authors applied the SPX operator [15] as recombination operator and tried to identify linkage information by observing the distribution of the individuals in the population; in particular they examined the correlation coefficient matrix of parameter values of the individuals in the population. Pelikan et al. [23, 24] combined the Bayesian Optimization Algorithm for recombination with evolutionary strategies (ES) for mutation. In [23, 24], a real population is first discretized, then BOA is applied to recombine individuals in the discrete population and the new population is mapped back to the real
Real-Coded Extended Compact Genetic Algorithm
337
domain so that adaptive mutation from ES can be applied to obtain the next real population. Three discretization strategies were presented: Fixed-Width Histograms, Fixed-Height Histograms and k-Means Clustering. The approach was tested on three functions and the reported results showed good scalability. Ahn [1] developed a real-coded Bayesian Optimization Algorithm, rBOA, which constructs the Bayesian factorization graph using finite mixture models. All the relevant substructures are extracted from the graph and each substructure is fit and sampled independently. Chen et al. [7] developed a real-valued version of the ECGA (rECGA) by combining a binary version of ECGA together with Split-On-Demand (SoD) discretization—an adaptive discretization technique that takes into account the distribution of the current population in creating discrete intervals. rECGA was tested on several problems and the accuracy of the computed solutions was shown. Fossati et al. [9] introduced a simple real-coded ECGA which combines basic discretization with a χ-ary ECGA [28, 8]. Minqiang et al. [18] also proposed another version of real-coded ECGA in which continuous variables are initially discretized and then clustered together to build the probabilistic model.
3 Extended Compact Genetic Algorithm (ECGA) The extended compact genetic algorithm (ECGA) [11, 13] is an estimation of distribution algorithm (EDA) that replaces traditional variation operators of genetic and evolutionary algorithms by building a probabilistic model of promising solutions and sampling the model to generate new candidate solutions. ECGA starts with a population of random individuals and repeats the following steps until a stopping criterion is satisfied. First, the fitness of individuals is computed, selection is applied, and a probabilistic model is learned from the population of selected individuals. The probability distribution used in ECGA is a class of probability models known as marginal product models (MPMs). MPMs partition genes into mutually independent groups and specifies marginal probabilities for each linkage group. To distinguish between better model instances from worse ones, ECGA uses a minimum description length (MDL) metric [26]. The key concept behind MDL models is that all things being equal, simpler models are better than more complex ones. The MDL metric used in ECGA is a sum of two components: model complexity, which quantifies the model representation size in terms of number of bits required to store all the marginal probabilities, and compressed population complexity, which quantifies the data compression in terms of the entropy of the marginal distribution over all partitions. In ECGA, both the structure and the parameters of the model are searched and optimized to best fit the data. While the probabilities are learned based on the variable instantiations in the population of selected individuals, a greedy-search heuristic is used to find an optimal or near-optimal probabilistic model. The search method starts by treating each decision variable as independent and then it continues by merging two partitions that yields greatest improvement in the model-metric score. The subset merges are continued until no more improvement in the metric
338
P.L. Lanzi et al.
value is possible. The probabilistic model is then sampled to create an offspring population which is combined with the original population to generate the next population.
4 Real-Coded BOA The real-coded BOA has been been developed by Ahn et al. [4, 2] as an extension of Pelikan’s BOA [25] to real-valued domains. rBOA constructs a Bayesian factorization graph using finite mixture models and it does not apply any sort of discretization. The real-coded BOA works as a typical EDA. It starts from an initial population of random individuals and then it applies the following steps until a stopping criterion is met. First, selection is applied to the current population and generates a new population containing most promising individuals. Then, a model building is applied to learn a probabilistic model of the population of selected individuals. Next, the model is sampled to generate a new population which is used together with the original population to generate the new population which will replace the original one. These four steps, selection, model building and sampling, and replacement are repeated until a termination criterion is satisfied. To learn the probabilistic model M (ζ, θ), with structure ζ and parameters θ, for the current population rBOA first applies k-means [14] to partition the population into a set of ks clusters. Then, it applies an incremental greedy search, guided by the Bayesian Information Criterion, to find for the best Bayesian factorization graph that can model all the ks clusters. Then, model fitting is performed and for this purpose another clustering algorithm, the randomized leader algorithm (RLA) [14, 5], is applied to extract the kf substructures that are then separately fitted using different mixtures of models. All the kf mixtures are then combined together in an overall probabilistic model which is finally sampled to generate a new population. The sampled population and the original population are then combined to produce the next population.
5 Real-Coded ECGA Our real-coded ECGA is largely inspired to the real-valued Bayesian Optimization Algorithm (rBOA) [4, 2] and to the χ-ary Extended Compact Genetic Algorithm (χ-ECGA) [8]. Like rBOA, our real-coded ECGA use clustering and joint normal distributions to build and sample probabilistic models defined over the real domain, thus is does not employ any sort of discretization as other versions of real-coded ECGA [7, 9, 6]. Like χ-ECGA [8], it uses marginal product models (MPMs) to partition variables into mutually independent groups and specifies marginal probabilities for each linkage group. The algorithm is structured as the typical EDA for real-valued domains, see Algorithm 1. At first, the population is randomly initialized (line 7) using a uniform distribution. Then the following steps are repeated until a stop criterion is verified (line 17). The fitness of individuals in the current population P (t) is
Real-Coded Extended Compact Genetic Algorithm
339
Algorithm 1. Pseudo-code of the real-coded ECGA 1: procedure recga 2: var t; 3: var P (·); 4: var ζ; 5: var Θ;
Time step Population at time t Structure of the probabilistic model Parameters of the probabilistic model
6: t ← 0; 7: RandomInit(P (t)); 8: repeat 9: EvaluateFitness(P (t)); 10: Psel ← Selection(P (t)); 11: ζ ← ComputeModelStructure(Ps (t)); 12: Θ ← ComputeModelParameters(Psel ,ζ); 13: Psam ← Sample(M (ζ, Θζ )); 14: Pnew ← Replacement(P (t),Psam ); 15: P (t + 1) ← Pnew ; 16: t ← t + 1; 17: until StopCriterionNotMet 18: end procedure
Sample the model
computed (line 9) and selection is applied to P (t) which generates Psel . Several selection strategies can be used, in this work we tested two of them, tournament selection, which has been used both in BOA [22] and χECGA [8], and truncation selection used in rBOA [4, 2]. Next, the structure ζ of the probabilistic model that best represent the selected population Psel is computed (line 11) and from the structure ζ and the population Psel the parameter vector Θ is computed. Then the probabilistic model M (ζ, Θ), with structure ζ and parameters Θ, is sampled to generate the sampled population Psam (line 13). At the end, restricted tournament replacement (RTR) [21, 12] is applied to the original population P (t) and to the sampled population Psam to generate the new population Pnew (line 14) which will replace the original one (line 15). 5.1
The Probabilistic Model
Our approach is characterized by a probabilistic model M (ζ, Θζ ) described by the structure ζ and the parameter set Θζ . The former describes the relations among the problem variables as in the typical ECGA, that is, ζ = {S1 , . . . , S|ζ| } where Sj is a subset of |Sj | genes, and |ζ| is the number of subsets. The parameter set Θζ contains the parameters of the |ζ| joint normal distributions which model the distribution for each variable subset Sj . Since the subsets are independent, M (ζ, Θ) is computed as, M (ζ, Θ) =
|ζ| j=1
f (Sj , ϑj )
340
P.L. Lanzi et al.
where f (Sj , ϑj ) is the joint normal distribution for the variables in the subset Sj whose parameters ϑj are defined by an array of means (one for each variable) and a covariance matrix; overall, ϑj contains 12 |Sj |2 + 32 |Sj | parameters; the model parameter set Θζ denotes the set of all the parameter vectors ϑj . 5.2
Model Selection
The goal of this step is to find the model structure ζ that better fits the current population. The search procedure is based on the same greedy-search heuristic used in ECGA [11, 13]. Initially, the decision variables are considered independent and, in the next steps, the search continues by merging two partitions that yields greatest improvement in the model-metric score. The subset merges are continued until no more improvement in the metric value is possible. Model Scoring The evaluation of a model structure is based on the same MDL principle used in ECGA, which takes into account the accuracy of the model and its complexity. The score of a model represented by f (ζ, Θζ ) is computed as, Accuracy(f (ζ, Θζ )) + Complexity(f (ζ, Θζ )) As done in rBOA [1, 2], we evaluate the accuracy of f (ζ, Θζ ) using a mixture of ks models that are obtained by clustering the selected population into ks clusters and applying the same structure ζ to model each cluster. This step is performed, as in rBOA [1, 2], by applying the k-means algorithm with the Euclidean distance as the similarity measure among individuals. The model f (ζ, Θζ ) is thus computed as, |ζ| ks αi f (Sj , ϑij ) f (ζ, Θζ ) = i=1
j=1
where, k is the number of models used in the mixture, αi is the relative weight of cluster ki with respect to the entire population, αi = |ki |/|Psel |, and ϑij represents the parameters of the joint normal distribution which describes the variables in Sj for the individuals in cluster ki . Model complexity takes into account the number of parameters |Θζ | computed as the sum of the parameters of each MPM model for each one of the k clusters. However, since all the MPMs have the same structure ζ, each component has the same number of parameters. More precisely, each MPM is a product of ζ marginal distributions on independent subsets. Thus, each distribution f (Sj , ϑij ) linked to subset Sj and the cluster ki , being a joint normal distribution on the variables in Sj , needs 12 |Sj |2 + 32 |Sj | parameters ( 12 |Sj |2 + 12 |Sj | parameters for the covariance matrix and |Sj | means). The overall model complexity is therefore computed as, Complexity(f (ζ, Θζ )) = λ × ln(|P opSel|) × K ×
3 |Sj | + |Sj | 2 2
|ζ| 1 j=1
2
(1)
Real-Coded Extended Compact Genetic Algorithm
341
where the parameter λ determines how much the model complexity influences its evaluation, like in rBOA [1]. 5.3
Model Fitting
Given the model structure ζ, the model parameters Θ are computed for each subset Sj separately. To improve the overall model accuracy, each marginal distribution for Sj is represented as a mixture of distributions that are obtained through a clustering step. As in the initial clustering, the similarity measure among individuals is measured using the Euclidean distance. However, instead of using a fixed number ks of clusters, as in rBOA [1], we apply an adaptive clustering algorithm, namely the BEND random leader algorithm (RLA) [5, 14] with the same threshold used in [1]. Given, the subset Sj , to compute the model parameters clustering is applied to extract Cj clusters. For each cluster, the parameters of the joint normal distribution f (Sj , ϑij ) are computed. All the distributions are then mixed to build the overall model as follows, M (ζ, Θζ ) =
|ζ| Cj
βi,j f (Sj , ϑij )
(2)
j=1 j=1
where Cj is the number of clusters extracted for the data identified by the variable subset Sj ; βi,j is the weight of the i-th cluster of the j-th subset with respect to the whole population and it is computed as the ratio between the cluster size |kji | and the population size. The procedure is reported as Algorithm 2. Algorithm 2. Estimation of the Model Parameters 1: procedure ComputeModelParameters(Psel ,ζ) 2: var Θ = {}; The set of model parameters is initially empty. For each variable subset in the structure ζ 3: for Sj ∈ ζ do Apply clustering on Psel using the variables in Sj . C 4: kj1 , . . . , kj j ← Clustering(Psel ,Sj ); 5: for (i=1; i≤ Cj ; i++) do 6: μij ← ComputeAverages(Psel ,Sj ,kji ); 7: Σji ← ComputeCovarianceMatrix(Psel ,Sj ,kji ); 8: Θ ← Θ ∪ μij ∪ Σji ; 9: end for 10: end for 11: return Θ; 12: end procedure
5.4
Model Sampling
The model M (ζ, Θζ ) is finally sampled to generate a new population. This step is perfomed as in the typical ECGA and χ-ECGA, but in our case separate joint normal distributions are sampled. The generation of a new individual from the
342
P.L. Lanzi et al.
model M (ζ, Θζ ) works as follows. At first, for every subset Sj , one of the model components f (Sj , ϑij ) is selected with a probability proportional to their weight βi,j (Equaton 2). Then, as in ECGA, each selected component is sampled and the new individual is generated by merging the different components corresponding to each one of the subsets that describe the model structure ζ.
6 Design of Experiments We performed a scalability analysis of our simple real-coded ECGA using a problem that does not require linkage learning and a deceptive problem that needs linkage learning, both taken from the literature [1, 2, 24]. The former is the sphere function function (Figure 1), fs (x) =
n−1
x2i ,
i=0
where n is the number of variables and xi ∈ [0, 1], the latter is the real deceptive function frdp [1, 2], m−1 ftrap (x2i , x2i+1 ) frdp (x) = i=0
where xi ∈ [0, 1], m is the number of subproblems (m = n/2), and ftrap (·, ·) is the two-dimensional trap (Figure 2), 1 2 2 if xj , xi+1 ≥ 0.8 ftrap (xj , xj+1 ) = (3) xj +xj+1 0.8 − otherwise 2
6.1
Design of Experiments
The analysis was performed by applying the typical bisection procedure used in [27, 21] to determine the smallest population size and the smallest number of evaluations which guarantee the convergence to the optimum. The procedure is illustrated in Algorithm 3. Given a problem with n variables, the minimum population size PopMin, and the maximum population size PopMax, the following bisection procedure is run NoBis times. Initially, the lower and upper bounds for the population size, PopLow and PopUp, are set to PopMin and PopMax respectively; then the population size PopSize is randomly initialized between PopLow and PopUp. The real-coded ECGA is applied NoRuns times with a population size of PopSize individuals and the percentage of individuals which converged in each run is measured. If more than the 99% of the population consists of optimal individuals the run is successful and the variable NoSuccess is incremented. When the NoRuns runs are completed, the number of converged runs, noSuccess, is checked: if at most one run did not converged, it is assumed that the real-coded ECGA can solve the problem with PopSize individuals and
Real-Coded Extended Compact Genetic Algorithm
343
Fig. 1. The sphere function of two variables
Fig. 2. Bidimensional basis function for the real trap deceptive function
therefore the upper bound for the population size is decreased (line 20); otherwise, the lower bound for the population size is increased (line 22); then, the new population size is computed and the real-coded ECGA is run again NoRuns times. This process stops when the interval between the lower and upper bound for the population size is small enough (line 11). At the end of all the NoBis bisection runs, we determine the smallest population size for which the optimum is reached as, NoBis P opSize [i] . BestPop = i=1 NoBis
344
P.L. Lanzi et al.
Algorithm 3. Pseudo-code for the bisection procedure 1: procedure bisection(numBisection) 2: var NoBis; number of bisections 3: var NoRuns; number of test runs for each interval 4: var PopSize; current population size 5: var PopMin,PopMax ; min and max population size 6: var PopLow, PopUp; lower and upper bounds for binary search 7: 8: 9: 10:
11:
for NoBis times do PopLow := PopMin; PopUp := PopMax ; PopSize := random(PopLow,PopUp);
Random population size between PopLow and PopUp
Bisection continues while interval is large enough while ( (PopMax - PopMin) > 0.1×PopMin) do
12:
NoSuccess = 0;
13: 14:
for NoRuns times do var pci; percetage of optimal individuals in the population
15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:
pci = rECGA(PopSize);
number of successful runs
run RECGA
if pci≥0.99 then 99% of individuals are optimal, success! NoSuccess := NoSuccess + 1; end if end for if NoSuccess≥NoRuns-1 then at most one failed PopUp := PopSize population bound is decreased else PopLow := PopSize population bound is increased end if PopSize := (PopUp + PopLow)/2; end while print PopSize, the convergence time, and the number of evaluations end for end procedure
Note that, this value can be reached only if it is contained in the interval [PopMin, PopMax ]. In the experiments performed for this work, we run ten bisections, NoBis was 10, and for each value of PopSize we run the real-coded ECGA 30 times.
7 Experimental Results We performed two sets of experiments to analyze how our rECGA scales up in a problem which does not require any linkage and in a problem which requires linkage, but also to study how the type of clustering algorithm applied during model selection and model fitting influences the scalability.
Real-Coded Extended Compact Genetic Algorithm
7.1
345
Experiments with the Sphere Function
At first, we applied our real-coded ECGA to the sphere function [10] which does not require any linkage learning and therefore it can be solved by a simple greedy search. The optimum is in the axes origin and, for bisection purposes, we considered that an individual of size n is converged when all the n variables are comprised between −10−5 and 10−5 . We applied our rECGA with the same settings used for the real-coded BOA by Ahn in [1]: during the model selection the k-means is applied with one cluster (ks = 1) so that the population is modeled using only one joint normal distribution; during the model fitting, the RLA algorithm is applied with an upper bound of 100 clusters, i.e., kf = 100. Figure 3 reports (a) the number of evaluations and (b) the population size as functions of the problem size for our rECGA when the probabilistic model is enabled (solid dots) and when the probabilistic model is disabled (empty dots), i.e., when all the variables are considered independent. When we fit the data in Figure 3a with a polynomial we find that when the probabilistic model is used the number of fitness evaluations (Figure 3a) grows sub-quadratically as O(n1.75 ) while, when all the variables are considered independent (similar to what done by UMDA [20] for discrete variables and in [29, 16] for continuous variables), the number of function evaluations grows almost linearly, more precisely as O(n1.1 ). The population size (Figure 3b) required grows sub-quadratically, as O(n1.4 ), when the probabilistic model is enabled (solid dots), while when all the variables are modeled independently the population size grows with the square root of n (empty dots). Model Selection. As in [1], the first step of model selection in our rECGA involves the clustering of the population aimed at identifying k subproblems that are then modeled using k joint distributions based on the same underlying structure. In the experiments reported in [1], the number of clusters is usually one therefore only one distribution is used to model the entire population. However, a higher number of clusters (i.e., a higher number of distributions) should improve the model selection, accordingly, we performed another set of experiments aimed at studying how the clustering algorithm used in the model selection step influences the scalability of our algorithm. For this purpose, we applied our rECGA to the sphere function when the model selection is performed using k-means with 1, 5, and 10 clusters (i.e., ks ∈ {1, 5, 10}) or using the adaptive clustering (RLA) which in rBOA is employed for model fitting [1]. Figure 4a compares the number of evaluations as a function of the problem size for the different versions of our rECGA. When more distributions are used to model the initial population, when ks increases, the number of evaluations decreases, from O(n1.75 ) to O(n0.5 ).The problem is very simple and the best improvement is obtained with just five clusters (ks = 5). Population size follows a similar behavior (Figure 4b). Model Fitting. The number of clusters used during model fitting influences the accuracy of the approximation used to sample the target distribution. A higher number of clusters means that the distribution is approximated using
346
P.L. Lanzi et al.
NUMBER OF EVALUATIONS
106
5
10
104
RECGA ks=1 NO MODEL
103 0
10
20
30
40
50
PROBLEM SIZE (a)
POPULATION SIZE
105
4
10
103
RECGA ks=1 NO MODEL
102 0
10
20 30 PROBLEM SIZE
40
50
(b) Fig. 3. Real-Coded ECGA applied to the sphere function: (a) number of evaluations and (b) population size as a function of the problem size. Clustering parameters are set as in rBOA [1]: model selection applies k-means with one cluster (ks =1) and model fitting applies RLA clustering with at most 100 clusters.
Real-Coded Extended Compact Genetic Algorithm
347
NUMBER OF EVALUATIONS
106
5
10
104 ks=1 ks=5 ks=10 NO MODEL
103 0
10
20
30
40
50
PROBLEM SIZE (a)
POPULATION SIZE
105
4
10
103 ks=1 ks=5 ks=10 NO MODEL
102 0
10
20 30 PROBLEM SIZE
40
50
(b) Fig. 4. Our real-coded ECGA applied to the sphere function using different clustering algorithms for model selection: (a) number of evaluations and (b) population size as a function of the problem size
348
P.L. Lanzi et al.
NUMBER OF EVALUATIONS
107 106 105 104 ks=1 AND RLA ks=1 AND kf=1 ks=1 AND kf=5 ks=1 AND kf=10 NO MODEL
3
10
102 0
10
20
30
40
50
PROBLEM SIZE (a)
POPULATION SIZE
105
104
103
ks=1 AND RLA ks=1 AND kf=1 ks=1 AND kf=5 ks=1 AND kf=10 NO MODEL
102
101 0
10
20 30 PROBLEM SIZE
40
50
(b) Fig. 5. Our real-coded ECGA applied to the sphere function using different clustering algorithms for model fitting: (a) number of evaluations and (b) population size as a function of the problem size
Real-Coded Extended Compact Genetic Algorithm
349
NUMBER OF EVALUATIONS
106
5
10
104 ks=1 RLA + RLA NO MODEL
103 0
10
20
30
40
50
PROBLEM SIZE (a)
POPULATION SIZE
105
4
10
103 ks=1 RLA + RLA NO MODEL
102 0
10
20 30 PROBLEM SIZE
40
50
(b) Fig. 6. Our real-coded ECGA applied to the sphere function using the model selection and fitting strategies of rBOA [1] and a fully adaptive strategy: (a) number of evaluations and (b) population size as a function of the problem size
350
P.L. Lanzi et al.
more joint distributions and therefore it should be more accurate. However, if the population is small or the number of distributions is excessive the obtained approximation might be misleading. In rBOA [1], the model fitting step applies the adaptive RLA clustering algorithm to find the number of joint distributions which can provide an accurate approximation. In the last set of experiments with the sphere function, we investigated how the choice of the clustering algorithm used for model fitting influences the scalability of our real-coded ECGA. Figure 5a compares the number of evaluations for different versions of our rECGA in which model selection is based on a single cluster (ks =1) and the model fitting is performed using a k-means with 1, 5, and 10 clusters (kf ∈ {1, 5, 10}) or the same adaptive clustering (RLA) used in rBOA [1]. The sphere function is simple since all the variables can converge to the same value independently, therefore we should expect no major advantage for using more clusters for model fitting. The results in Figure 5 confirm this in that the best scalability is obtained when only one cluster is used both for model selection and model fitting. As we should expect in such a simple problem, when the number of clusters used for model fitting increases, the number of evaluations increases. Noticeably, an adaptive clustering like RLA seems to provide a good trade-off between the different kmean settings, although it tends to grow with an order that is slightly higher than the one provided by simple k-mean. Selection and Fitting using Adaptive Clustering. Finally, we applied a version of our rECGA in which adaptive clustering, namely the RLA algorithm used in rBOA [1] only for model fitting, is applied both during model selection and model fitting. Figure 6 compares three versions of our rECGA: one without the probabilistic model (empty dots), one with the same clustering strategy used in rBOA [1], that is, model selection is based on one cluster, model fitting is based on the adaptive RLA clustering, while the last one is fully adaptive in that RLA is used both during model selection and model fitting. As can be noted, the fully adaptive version requires slightly more evaluations and slightly larger populations than the version with no probabilistic modeling. However, the two versions scale similarly in that the number of evaluations grows as O(n1.1 ) in both cases. 7.2
Experiments with the Real Deceptive Function
We repeated the same set of experiments using the real deceptive function (RDP) [1, 2] which requires linkage. Figure 7 compares (i) our real-coded ECGA with the same clustering strategies used in the real-coded BOA in [1], and (ii) a version of the same algorithm without probabilistic model building. As it should be expected, without linkage learning only problems of limited size can be solved while, when the probabilistic model is used, the number of function evaluations and the population size both grow sub-quadratically, respectively as O(n1.7 ) (Figure 7a) and O(n1.4 ) (Figure 7b). In Figure 8, we report the percentage of converged building block for an RDP with 50 variables. When we compare our results on the RDP function with the ones available for rBOA in [1], we note
Real-Coded Extended Compact Genetic Algorithm
351
NUMBER OF EVALUATIONS
1010 109 108 107 106 105 104
ks=1 NO MODEL
3
10
0
10
20
30
40
50
PROBLEM SIZE (a)
108
POPULATION SIZE
107 6
10
105 104 103 ks=1 NO MODEL
102 0
10
20 30 PROBLEM SIZE
40
50
(b) Fig. 7. Our real-coded ECGA applied to the real deceptive function using the model selection and fitting strategies of rBOA [1] (solid dots) and no probabilistic model (empty dots): (a) number of evaluations and (b) population size
352
P.L. Lanzi et al.
1
% CONVERGED BBs
0.8
0.6
0.4
0.2
0 0
5
10
15
20
25
30
GENERATION Fig. 8. Our real-coded ECGA applied to the real deceptive function with 50 variables: percentage of converged building blocks. The vertical bars identify the minimum and the maximum percentages over 10 runs.
that our real-coded ECGA requires slightly larger populations but it appears to require fewer generations to converge, while overall, the two methods scale up similarly [3]. Figure 9 compares three versions of our rECGA in which the model selection is based on a k-mean with 1, 5, and 10 clusters (ks ∈ {1, 5, 10}). In this case a higher number of clusters during model selection does not improve the overall performance both in terms of number evaluations nor in terms of population size. The building blocks are rather small in that they involve only two variables, therefore the influence of the initial clustering appears to be limited. Figure 10 compares three versions of our rECGA in which the model fitting is based on a k-mean with 1, 20, and 40 clusters (kf ∈ {1, 20, 40}) with the version using the clustering strategy of rBOA. One cluster for model fitting is insufficient to reach good scalability and around 20 clusters are required to reach a performance similar to the one obtained with the usual clustering strategy. As in the previous case, the further increase of the number of clusters involve has no effect on the scalability. Finally, Figure 11 compares the version of our rECGA without the probabilistic model, with the probabilistic model based on the clustering strategy used in rBOA [1], and with fully adaptive clustering. Again, the choice of the clustering for model selection and model fitting does not show a relevant effect which basically explains the choice of Ahn in [1] where the same clustering strategy is always employed.
Real-Coded Extended Compact Genetic Algorithm
353
NUMBER OF EVALUATIONS
1010 109 108 107 106 105 ks=1 ks=5 ks=10 NO MODEL
4
10
103 0
10
20
30
40
50
PROBLEM SIZE (a)
108
POPULATION SIZE
107 6
10
105 104 ks=1 ks=5 ks=10 NO MODEL
103 102 0
10
20 30 PROBLEM SIZE
40
50
(b) Fig. 9. Our real-coded ECGA applied to the real deceptive function using different clustering algorithms for model selection: (a) number of evaluations and (b) population size
354
P.L. Lanzi et al.
1010 NUMBER OF EVALUATIONS
9
10
108 107 106 105 104 103
ks=1 AND RLA ks=1 AND kf=1 ks=1 AND kf=20 ks=1 AND kf=40 NO MODEL
2
10
101 100 0
10
20
30
40
50
PROBLEM SIZE (a)
108
POPULATION SIZE
107 106 105 104 103 ks=1 AND RLA ks=1 AND kf=1 ks=1 AND kf=20 ks=1 AND kf=40 NO MODEL
102 1
10
100 0
10
20 30 PROBLEM SIZE
40
50
(b) Fig. 10. Our real-coded ECGA applied to the real deceptive function using a model selection based on one cluster and a model fitting based on 1, 20, and 40 clusters, or the adaptive RLA clustering: (a) number of evaluations and (b) population size
Real-Coded Extended Compact Genetic Algorithm
355
NUMBER OF EVALUATIONS
1010 109 108 107 106 105 ks=1 RLA + RLA NO MODEL
104 103 0
10
20
30
40
50
PROBLEM SIZE (a)
108
POPULATION SIZE
107 6
10
105 104 103
ks=1 RLA + RLA NO MODEL
102 0
10
20 30 PROBLEM SIZE
40
50
(b) Fig. 11. Our real-coded ECGA applied to the real deceptive function using the model selection and fitting strategies of rBOA [1] and a fully adaptive strategy: (a) number of evaluations and (b) population size
356
7.3
P.L. Lanzi et al.
Discussion
The scalability analysis we performed shows that our real-coded ECGA scales up well on the two simple problems we considered. The analysis also shows that the choice of the clustering algorithm used for model selection and model fitting may have some effect but it is most likely to be problem dependent and needs further investigations. Interestingly, a fully adaptive strategy, which applies adaptive clustering both for model selection and fitting, seems to provide a good trade-off and it actually scales up as the best configuration in both problems. Accordingly, it would also be interesting, and it will be a matter of future work, to investigate a fully adaptive version of Ahn’s real-coded BOA [1].
8 Summary In this paper, we have presented a real-coded estimation distribution algorithm inspired to ECGA and to the real-coded Bayesian Optimization Algorithm. Our algorithm, like ECGA [13], partitions the real-valued variables into clusters, that are manipulated as independent variables, and works on probability distributions represented by marginal product models (MPMs). However, instead of mapping the real values into discrete symbols, as other real-coded ECGA do [7, 9, 18], the proposed method models each cluster using a joint probability distribution and use MDL applied to continuous distribution to decide between candidate models. Similarly to rBOA [1, 2], our algorithm estimates the model parameters using several mixtures of models, one for each cluster of variables identified during the model building step. Overall, our real-coded ECGA can be either viewed as an extension of ECGA for real domains based on mixtures of models or, alternatively, as a simplification of the real-coded BOA to the case of a probabilistic model based on marginal product models (MPMs). The results for the scalability analysis we performed show that the number of evaluations required by our algorithm, to solve additively separable problems, scales up subquadratically with the problem size.
Acknowledgements The authors wish to thank the reviewers for their comments. This work was sponsored by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant FA9550-06-1-0096, the National Science Foundation under ITR grant DMR-03-25939 at Materials Computation Center and under grant ISS-02-09199 at the National Center for Supercomputing Applications, UIUC. The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Office of Scientific Research, the National Science Foundation, or the U.S. Government.
Real-Coded Extended Compact Genetic Algorithm
357
References 1. Ahn, C.W.: Theory, Design, and Application of Efficient Genetic and Evolutionary Algorithms. PhD thesis, NWC Lab, Department of Information and Communications Gwangju Institute of Science and Technology (GIST) (Febraury 2005) 2. Ahn, C.W.: Advances in Evolutionary Algorithms: Theory, Design and Practice. Studies in Computational Intelligence. Springer, Heidelberg (2006) 3. Ahn, C.W., Ramakrishna, R.: On the Scalability of Real-coded Bayesian Optimization Algorithms. IEEE Transactions on Evolutionary Computation (in press, 2008) 4. Ahn, C.W., Ramakrishna, R.S., Goldberg, D.E.: Real-Coded Bayesian Optimization Algorithm: Bringing the Strength of BOA into the Continuous World. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3102, pp. 840–851. Springer, Heidelberg (2004) 5. Bosman, P.A.: Design and application of iterated density-estimation evolutionary algorithms. PhD thesis, Utrecht University, TB Utrecht, The Netherlands (2003) 6. Chen, C.-H., Chen, Y.-P.: Real-coded ECGA for economic dispatch. In: Lipson, H. (ed.) GECCO, pp. 1920–1927. ACM, New York (2007) 7. Chen, C.-H., Liu, W.-N., Chen, Y.-P.: Adaptive discretization for probabilistic model building genetic algorithms. In: Cattolico, M. (ed.) GECCO, pp. 1103–1110. ACM, New York (2006) 8. de la Ossa, L., Sastry, K., Lobo, F.G.: χ-ary extended compact genetic algorithm in C++. IlliGAL Report No. 2006013, University of Illinois at Urbana-Champaign, Urbana, IL (March 2006) 9. Fossati, L., Lanzi, P.L., Sastry, K., Goldberg, D.E., Gomez, O.: A simple realcoded extended compact genetic algorithm. In: Proceedings of the 2007 Congress on Evolutionary Computation (CEC2007), pp. 342–348. IEEE, Singapore (2007) 10. Grahl, J., Bosman, P.A., Minner, S.: Convergence phases, variance trajectories, and runtime analysis of continuous edas on the sphere function. In: Thierens, D., et al. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference GECCO 2007, vol. I, pp. 516–522. ACM Press, New York (2007) 11. Harik, G.: Linkage learning via probabilistic modeling in the ECGA. IlliGAL Report No. 99010, University of Illinois at Urbana-Champaign, Urbana, IL (January 1999) 12. Harik, G.R.: Finding multimodal solutions using restricted tournament selection. In: Proceedings of the Sixth International Conference on Genetic Algorithms, pp. 24–31 (1995); (Also IlliGAL Report No. 94002) 13. Harik, G.R., Lobo, F.G., Sastry, K.: Linkage learning via probabilistic modeling in the ECGA. In: Pelikan, M., Sastry, K., Cant´ u-Paz, E. (eds.) Scalable Optimization via Probabilistic Modeling: From Algorithms to Applications, ch. 3, pp. 39–61. Springer, Berlin (2006) 14. Hartigan, J.: Clustering algorithms. John Wiley & Sons, New York (1975) 15. Higuchi, T., Tsutsui, S., Yamamura, M.: Theoretical analysis of simplex crossover for real-coded genetic algorithms. In: PPSN VI: Proceedings of the 6th International Conference on Parallel Problem Solving from Nature, pp. 365–374. Springer, London (2000) 16. Inza, I., Larra˜ naga, P., Sierra, B.: Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. In: Feature subset selection by estimation of distribution algorithms, pp. 265–290. Kluwer Academic Publishers, Dordrecht (2002)
358
P.L. Lanzi et al.
17. Larra˜ naga, P., Lozano, J.A. (eds.): Estimation of distribution algorithms. Kluwer Academic Publishers, Boston (2002) 18. Li, M., Goldberg, D.E., Sastry, K., Yu, T.-L.: Real-coded ECGA for solving decomposable real-valued optimization problems. In: Srinivasan, D., Wang, L. (eds.) 2007 IEEE Congress on Evolutionary Computation, 25-28 September. IEEE Computational Intelligence Society, pp. 2194–2201. IEEE Press, Singapore (2007) 19. Lozano, J.A., Larra˜ naga, P., Inza, I., Bengoetxea, E.: Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms. Studies in Fuzziness and Soft Computing. Springer, New York (2006) 20. M¨ uhlenbein, H., Paaß, G.: From recombination of genes to the estimation of distributions I. Binary parameters. Parallel Problem Solving from Nature 4, 178–187 (1996) 21. Pelikan, M.: Hierarchical Bayesian Optimization Algorithm (Toward a New Generation of Evolutionary Algorithms). Springer, Heidelberg (2005) 22. Pelikan, M., Goldberg, D.E., Cant´ u-Paz, E.: Linkage learning, estimation distribution, and Bayesian networks. Evolutionary Computation 8(3), 314–341 (2000) 23. Pelikan, M., Goldberg, D.E., Tsutsui, S.: Combining The Strengths Of Bayesian Optimization Algorithm And Adaptive Evolution Strategies, pp. 512–519 (2002) 24. Pelikan, M., Goldberg, D.E., Tsutsui, S.: Getting the best of both worlds: Discrete and continuous genetic and evolutionary algorithms in concert. Inf. Sci. 156(3-4), 147–171 (2003) 25. Pelikan, M., Sastry, K., Cant´ u-Paz, E.: Scalable Optimization via Probabilistic Modeling: From Algorithms to Applications. Studies in Computational Intelligence. Springer, New York (2006) 26. Rissanen, J.J.: Modelling by shortest data description. Automatica 14, 465–471 (1978) 27. Sastry, K.: Evaluation-relaxation schemes for genetic and evolutionary algorithms. Master’s thesis, University of Illinois at Urbana-Champaign, Urbana, IL (2001); (Also IlliGAL Report No 2002004) 28. Sastry, K., Goldberg, D.E.: Probabilistic model building and competent genetic programming. In: Riolo, R.L., Worzel, B. (eds.) Genetic Programming Theory and Practise, ch. 13, pp. 205–220. Kluwer, Dordrecht (2003) 29. Sebag, M., Ducoulombier, A.: Extending population-based incremental learning to continuous search spaces. In: Eiben, A.E., B¨ ack, T., Schoenauer, M., Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, pp. 418–427. Springer, Heidelberg (1998) 30. Tsutsui, S., Goldberg, D.E.: Simplex crossover and linkage identification: Singlestage evolution vs. multi-stage evolution. In: Proceedings of the 2002 Congress on Evolutionary Computation, 2002 (CEC 2002), pp. 974–979 (2002) 31. Tsutsui, S., Pelikan, M., Goldberg, D.E.: Evolutionary algorithm using marginal histogram in continuous domain. In: Optimization by Building and Using Probabilistic Models (OBUPM 2001), San Francisco, California, USA, vol. 7, pp. 230–233 (2001)
Genetic Algorithms for the Airport Gate Assignment: Linkage, Representation and Uniform Crossover Xiao-Bing Hu1 and Ezequiel Di Paolo2 1
2
Centre for Computational Neuroscience and Robotics, University of Sussex
[email protected] Centre for Computational Neuroscience and Robotics, University of Sussex
[email protected] Abstract. A successful implementation of Genetic Algorithms (GAs) largely relys on the degree of linkage of building blocks in chromosomes. This paper investigates a new matrix representaion in the design of GAs to tackle the Gate Assignment Problem (GAP) at airport terminals. In the GAs for the GAP, a chromosome needs to record the absolute positions of aircraft in the queues to gates, and the relative positions between aircraft are the useful linkage information. The proposed representation is especially effective to handle these linkages in the case of GAP. As a result, a powerful uniform crossover operator, free of feasibility problems, can be designed to identify, inherite and protect good linkages. To resolve the memory inefficiency problem caused by the matrix representation, a special representation transforming procedure is introduced in order to better trade off between computational efficiency and memory efficiency. Extensive comparative simulation studies illustrate the advantages of the proposed GA scheme.
1 Introduction Since Genetic Algorithms (GAs) were proposed by Holland [1], they have been broadly and successfully applied to solve problems in numerous domains. As the scale and complexity of problems handled by GAs increase, researchers begin to realize that for practical use, certain crucial mechanisms have to be integrated into the framework of GAs. Among these crucial mechanisms suggested by practitioners is the ability to learn linkage, referred to as the relationship between variables or elements represented by a chromosome. In the past few decades, there has been growing recognition that effective GAs demand understanding of linkage in order to tackle complicated, large scale problems [1]-[5]. Studies have shown that easy problems can be solved by any ordinary GAs, but when harder problems are considered, scalability has been elusive. As indicated by the results reported in [6], even separable problems could be exponentially hard if the knowledge of the variable groups were not available. Therefore, a successful implementation of GAs largely depends on a good understanding of the relationship between variables or elements, i.e., the linkage of building blocks Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 361–387, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
362
X.-B. Hu and E.D. Paolo
of a chromosome, an appropriate representation of linkage, and some effective methods to distinguish between good and bad linkage, and to store linkage information as well [5]. Linkage related techniques are so diverse, sophisticated, and highly integrated with each problem-specific implememntation of GAs. This paper, bearing the linkage concept in mind, attempts to design an effective and efficient GA for tackling the Gate Assignment Problem (GAP) at airport terminals. To this end, we need to identify and analyze the useful linkage between the elements in a solution to the GAP, to choose a suitable chromosome structure to represent the useful linkage, and then to design effective evolutionary operators, particularly crossover, to take advange of the useful linkage. As the bottle-neck resource in the air transportation system, airports play an extremely important role in the campaign to meet the constantly increasing demands for air traffic services. The Gate Assignment Problem (GAP) at airport terminals is a major issue during the daily airport operations, which involves a set of aircraft with arrival and departure times specified in monthly or quarterly master flight schedules and a set of gates with considerations from airlines and the airport. In other words, the airport GAP aims to assign aircraft to terminal gates to meet operational requirements while minimizing both inconveniences to passengers and operating costs of airports and airlines. The term ’gate’ is used to designate not only the facility through which passengers pass to board or leave an aircraft but also the parking positions used for servicing a single aircraft. These station operations usually account for a smaller part of the overall cost of an airline’s operations than the flight operations themselves. However, they can have a major impact on the efficiency with which the flight schedules are maintained and on the level of passenger satisfaction with the service [7], [8]. Those considerations, or criterions in another term, adopted in the GAP are mainly based on the availability and the compatibility of gates, their sizes and capacities, operational rules, service and managerial directions. Basically, aircraft should be assigned to gates in an optimal way such that some primary criterions can be best satisfied, e.g., the distance a passenger is required to walk in an airport to reach his departure gate, the baggage claim area, or his connecting flight can be minimized, the distance baggage is transported between aircraft and terminal or another aircraft can be significantly shortened, and the time an aircraft needs to wait before dwelling on a gate can be made as short as possible. Many optimization methods have been reported in the past few decades to address the GAP at airport terminals. For instance, passenger walking distance has been widely studied in the GAP research, and methods such as branch-andbound algorithms [8], [9], integer programming [10], linear programming [11], expert systems [12], [13], heuristic methods [7], tabu search algorithms [14] and various hybrid methods [15], [16] were reported to minimize this distance. Baggage transport distance has been relatively less discussed in the GAP literature [7], [17]-[19], but the algorithms developed to solve the minimum passenger walking distance GAP can be easily extended to the case where baggage transport distance needs to be considered [7]. During the peak hours, it often
Genetic Algorithms for the Airport Gate Assignment
363
happens that, particularly at hub airports, the number of aircraft waiting to dwell exceeds the number of available gates. In this case, aircraft waiting time on the apron should also be minimized [15], [20], [21]. The gate idle time is a criterion often used to assess the efficiency of using gate capacity [22]. The multi-objective GAP is discussed in [23], where passenger walking distance and passenger waiting time were both considered, the GAP was modelled as a zeroone integer program, and a hybrid method was developed based on the weighting method, the column approach, the simplex method and the branch-and-bound technique. As large-scale parallel stochastic search and optimization algorithms, GAs have a good potential for solving NP-hard problems such as the GAP. For instance, reference [21] developed a GA to minimize the delayed time during the gates reassignment process. Reference [22] proposed a unified framework to specifically treat idle time of gates in the previous GAP models, and then developed a problem-specific knowledge-based GA. More recently, Reference [24] reported an efficient GA with Uniform Crossover (UC) for the multi-objective airport GAP, where a novel matrix representation of the GAP queues plays a crucial role in the effectiveness and efficiency of the proposed GA. Based on the work reported in [24], this paper aims to investigate the importance of linkage information in the design of efficient GAs for the multi-objective airport GAP. As will be shown later, identifying the useful linkage information in the GAP, i.e., the relative positions of aircraft in the GAP queues, is a key step in the success of GA design. With this linkage information, matrix representation and uniform crossover can work efficiently in the evolutionary process of GA. The remainder of this paper is organized as follows. In Section 2, a mathematical model of multi-objective airport GAP is provided. The useful linkage information in the GAP and the associated representations are discussed in Section 3. The design of GA is described in Section 4. Section 5 gives some simulation results, particularly analyzing the usefulness of the linkage information in the GA, and the paper ends with some conclusions in Section 6.
2 Problem Formulation of the Multi-objective GAP The GAP in this paper will focus on three most popular considerations: passenger walking distance, baggage transport distance, and aircraft waiting time on the apron. Passenger walking distance has a direct impact on the customer satisfaction. The typical walking distances in airports considered are: (I) the distance from check-in to gates for embarking or originating passengers, (II) the distance from gates to baggage claim areas (check-out) for disembarking or destination passengers, and (III) the distances from gate to gate for transfer or connecting passengers. Baggage transport distance occurs when baggage is transferred between aircraft and baggage claim areas. Basically, these distances can be reduced by improving the method by which scheduled flights are assigned to the airport terminal gates. Aircraft waiting time on the apron is the difference between the planned entering time to gates and the allocated entering time to gates. Due to
364
X.-B. Hu and E.D. Paolo
the shortage of gates at peak hours, some scheduled aircraft have to wait extra time on the apron, which could end up with delayed departure and even cause passengers miss connection flights. Although this kind of ground delay is more tolerable than airborne delay in terms of safety and costs, it largely affects the customer satisfaction. Besides, aircraft waiting time can help address another big issue in the GAP: the efficiency of using gate capacity, which is often represented by how even the distribution of idle times is. In the minimum distance GAP, some special constraints have to be included in order to avoid most aircraft being assigned to a same single gate, which however can automatically be ensured by minimizing aircraft waiting time. Therefore, in this paper, we will construct an objective function by combining the above three considerations. A simple way to conduct gate assignment is the first-come-first-served (FCFS) principle according to planned entering time to gates, but the result is usually not optimal or even not near-optimal, because the FCFS principle does not take into account the layout of airport terminals. Even for a queue at a single gate, the FCFS principle is not the first option, mainly because different aircraft may have different ground time and different number of passengers. Obviously, putting ahead an aircraft with more passengers and less ground time could bring benefits, even if its planned entering time is later. Fig. 1 gives a simple illustration of the GAP. Suppose NAC aircraft need to be assigned to NG gates during a given time period [TS ,TE ]. Let Pi and Gi denote the planned entering time to gates and the ground time of the ith aircraft in the original set of arrival and departure aircraft, respectively. Assume Pi and Gi to be known in advance. In this paper, the planned entering time to gates for arrival aircraft is assumed to be the scheduled arrival time to the airport (Ai ), and the planned entering time for departing aircraft is the scheduled departure time (Di ) minus the ground time, i.e., Pi = Di −Gi . Here we consider aircraft rather than flights, because, normally each individual aircraft can relate to two different flights (with different flight numbers): one is an arrival flight associated with an Ai , and the other is a departure flight, whose Di is ideally determined as the Ai plus the Gi . Since the two flights associated with the same aircraft dwell at the same gate, for the sake of simplicity, we use aircraft rather than flights for the modelling. Sometimes, more than 2 commuter flights are actually related to the same physical aircraft. In this case, each physical aircraft will be considered as one or several shadow aircraft, each of which relate to two certain successive flights. Therefore, NAC is actually the number of all shadow aircraft. In the real gate assignment operation, the Pi of an arrival aircraft could be different from the Ai , normally later. It often happens that some arrival aircraft have to wait on the apron due to the shortage of gates. Therefore there could be a waiting time before an aircraft can dwell at a gate. The Gi of an aircraft is assumed to be optimal and fixed, i.e., the airport and airlines have predetermined the minimum time span for each individual aircraft to dwell at gates according to various considerations, such as operation efficiency and redundancy. For example, for the sake of redundancy,
Genetic Algorithms for the Airport Gate Assignment
A batch of arrival and departing Gates at airport aircraft during a given time period terminal AC4
AC6 AC7 AC8 AC9 AC10
MOGAP
AC5
Queue 2 AC6
AC2
AC4
Queue 3
AC7
AC8
Queue 4
Planned entering time to gates
AC1
AC9
AC10
Entrance/Exit
AC5
AC3
Queue 1 Entrance/Exit
AC3
Optimal aircraft queues to different gates in terms of multi-objective function Gate1 Gate2 Gate3 Gate4
AC2
Gate1 Gate2 Gate3 Gate4
AC1
365
Allocated entering time to gates
Fig. 1. Illustration of airport GAP
Reference [14] used a buffer time between the aircraft’s departure time and the next aircraft’s arrival time at the same gate. In this paper, the Gi is assumed to include this buffer time, which means the Gi is longer than the actual time span the aircraft physically dwells at a gate. Different aircraft may have different predetermined Gi , but, for the sake of simplicity, we assume that the same Gi always applies to a given aircraft no matter which gate it dwells at. Most existing methods use binary variables in the modelling of the GAP, e.g., see [7], [9], and [22]. These binary variables are used for each possible assignment: If, at a certain time instant, aircraft i is assigned to gate g, then the associated binary variable is set to 1; otherwise to 0. The usage of binary variables is a key technique for those IP, LP or QAP formulation based methods. Otherwise their modelling of the GAP is impossible. However, from the programming point of view, the usage of binary variables could cause memory inefficiency problems, because for each aircraft at each time instant, besides a binary variable that indicates which gate it is assigned to, other NG − 1 more binary variables are required to show which gates it is not assigned to. These NG − 1 extra binary variables are obviously not necessary in the common sense, but they are crucial to formulate the GAP as an IP, LP or QAP. Reference [7] even introduced more binary variables in order to transform QAP to LP. As mentioned by [14], billions of binary variables will be required before those methods can be applied to a real GAP, which means there could be a scalability problem. GAs do really not need those binary variables for the modelling of the GAP. Actually, GAs can be designed based on almost any kind of formulation of the GAP. This paper attempts to model the GAP in a more straightforward way according to the physical process of gate assignment. Let Qg denote the queue at gate g, Qg (j) is the jth aircraft in Qg , g = 1, . . . , NG , j = 1, . . . , Hg , and Hg is the number of aircraft in Qg satisfying NG g=1
Hg = NAC
(1)
366
X.-B. Hu and E.D. Paolo
Qg (j) = i means the ith aircraft in the original set is assigned as the jth aircraft to dwell at gate g. The allocated entering time to gates (Ei ) for the ith aircraft in the original set can then be calculated as EQg (j) =
j=1 PQg (j) , j = 1, . . . , Hg , g = 1, . . . , NG max(PQg (j) , EQg (j−1) + PGg (j−1) ), j > 1 (2)
The waiting time on the apron for the ith aircraft in the original set is Wi = Ei − Pi , i = 1, . . . , NAC .
(3)
For the sake of simplicity of modelling, besides the NG real gates, the entrance/exit of the airport terminal is usually considered as a dummy gate (e.g., see [7]), and we call it gate NG + 1 in this paper. Associated with this dummy gate NG + 1, we introduce a dummy aircraft NAC + 1. Of course there is no real aircraft queue for this dummy gate, except the dummy aircraft which dwells at the dummy gate all time. Three data matrices, MP ∈ R(NAC +1)×(NAC +1) , MP W D ∈ R(NG +1)×(NG +1) , and MBT D ∈ R(NG +1)×(NG +1) , are used to record the number of passengers transferred between aircraft, passenger walking distances between gates, and baggage transferring distances between gates, respectively. Given i ≤ NAC and j ≤ NAC , the value of MP (i, j) is the number of passengers transferred from aircraft i to aircraft j, MP (i, NA + 1) records the number of arriving passengers from aircraft i to exit, i.e., the dummy aircraft NA + 1, and MP (NA + 1, j) the number of departing passengers from entrance to aircraft j. For those passengers who just pass by the airport with a long-haul aircraft, we assume they do not leave the aircraft when the aircraft stops at the airport. Therefore, we always have MP (i, i) = 0 for i = 1, . . . , NAC + 1. MP W D (i, j) are the passengers walking distance from gate i to gate j, and MBT D (i, j) the baggage transferring distance from gate i to gate j. Although MP W D (NG + 1, NG + 1) = 0, we do not have MP W D (i, i) = 0, i = 1, . . . , NG , because, even though passengers transfer between two aircraft which are successively assigned to the same gate, they still need to leave the first aircraft and wait in a certain terminal lounge before they can board the second aircraft. For MBT D (i, i) is the same case, i.e., MBT D (NG + 1, NG + 1) = 0, but MBT D (i, i) = 0. Besides these three matrices, we still need a data vector: VG = [ν1 , . . . , νNAC +1 ], where i ≤ νi ≤ NG + 1 indicates that the ith aircraft in the original set is assigned to gate νi , and νNAC +1 = NG + 1 means the dummy aircraft NAC + 1 is always assigned to the dummy gate NG + 1. Now we can calculate the total passenger walking distance (TPWD), the total baggage transferring distance (TBTD), and the total passenger waiting time (TPWT) as JT P W D =
Hg NAC +1 N G +1 g=1 j=1
i=1
MP (Qg (j), i)MP W D (g, νi ),
(4)
Genetic Algorithms for the Airport Gate Assignment
JT BT D =
N Xn NAC G +1 +1 g=1 j=1
JT P W T =
N AC i=1
Wi
MP (Qg (j), i)MBT D (g, νi ),
367
(5)
i=1 NAC +1
(MP (i, j) + MP (j, i)),
(6)
j=1
respectively. In the MOGAP of this paper, the following weighed objective function is used to cover these three aspects: JMOGAP = αJT P W D + βJT BT D + (1 − α − β)φJT P W T ,
(7)
where α and β are tuneable weights to adjust the contributions of TPWD, TBTD and TPWT, α + β ≤ 1, 0 ≤ α ≤ 1, 0 ≤ β ≤ 1, (8) and φ is a system parameter to make the waiting time comparable to the distances. In this paper, the distances are measured in meters, and the times measured in minutes. Assuming the average passenger walking speed is 3km/h, then 1 minute waiting time for a passenger can be considered as 50 meters extra walking distance for him/her. In this paper, we take the half, i.e., set φ = 25 because we assume that for passengers walking is physically more uncomfortable than waiting. The MOGAP can now be mathematically formulated as a minimization problem: (9) minQ1 ,...,QNG JMOGAP , subject to (1) to (8). Clearly, how to assign aircraft to different gates to form NG queues and how to organize the order of aircraft in each queue compose a solution, i.e., Q1 , . . . , QNG , to the minimization problem (9). Unlike other existing GAP models, the above formulation of the MOGAP needs no binary variables due to the usage of Q1 , . . . , QNG .
3 Linkage Information and Expressions in GAP In this section we will discuss the useful linkage information to the design of GAs for the airport GAP. The general definition of linkage is given first, the GAPspecific linkage information is then identified, and some possible expressions of the linkage information in the GAP are discussed. 3.1
General Definition of Linkage
As is well known, the GA is a powerful search methodology by imitating the procreation process and operating on the principle of the survival of the fittest. Understanding the bond and resemblance between the (natural) biology system and the (artificial) genetic and evolutionary algorithm may be helpful to realize the role and importance of learning linkage. In biological systems, linkage refers to the level of association in inheritance of two or more non-allelic genes that is higher than to be expected from independent
368
X.-B. Hu and E.D. Paolo
Fig. 2. Linkage in biological systems
assortment [31]. During meiosis, crossover events might occur between strands of the chromosome that genetic materials are recombined. Therefore, if two genes are closer to each other on a chromosome, there is a higher probability that they will be inherited by the offspring together. Genes are said to be linked when they reside on the same chromosome, and the distance between each other determines the level of their linkage. Fig. 2 gives an illustrative example of different genetic linkage between two genes. The upper part shows that if the genes are closer, they are likely to maintain the allele configuration. The lower part shows that if the genes are far away from each other, it is likely for a crossover event to separate them and to change the configuration. In summary, The closer together a set of genes is on a chromosome; the more probable it will not be split by chromosomal crossover during meiosis. When applying genetic algorithms, we usually use strings of characters drawn from a finite alphabets as chromosomes and genetic operators to manipulate these artificial chromosomes. Reference [1] suggested that genetic operators which can learn linkage information for recombining alleles might be necessary for genetic and evolutionary algorithms to succeed. Many well known and widely employed crossover operators, including one-point crossover and twopoint crossover, work under the similar situation subject to the linkage embedded in the chromosome representation as their biological counterparts do. For example, if we have a 6-bit function consisting of two independent 3-bit subfunctions, three possible coding schemes for the 6-bit chromosome are where Cn (A) is the coding scheme n for an individual A, and aji is the ith gene of A and belongs to the jth subfunction, as shown in Fig. 3. Taking one-point crossover as an example, it is easy to see that genes belonging to the same subfunction of individuals encoded with C1 are unlikely to be separated by crossover events. However, if the individuals are encoded with C2 , genes of the same subfunction are split almost in every crossover event. For C3 ,
Genetic Algorithms for the Airport Gate Assignment
369
Fig. 3. An example of linkage in GA
genes of subfunction 1 are easily to be disconnected, while genes of subfunction 2 are likely to stay or to be transferred together. From the viewpoint of genetic algorithms, linkage is used to describe and measure how close those genes that belong to a building block are on a chromosome. In addition to pointing out the linkage phenomenon, Reference [1] also suggested that the chromosome representation should adapt during the evolutionary process to avoid the potential difficulty directly caused by the coding scheme, which was identified as coding traps, the combination of loose linkage and deception among lower order schemata [32]. Because encoding the solutions as fixed strings of characters is common in genetic algorithm practice, it is easy to see that linkage can be identified as the ordering of the loci of genes as the examples given above. It is clear that for simple genetic algorithms with fixed genetic operators and chromosome representations, one of the essential keys to success is a good coding scheme that puts genes belonging to the same building blocks together on the chromosome to provide tight linkage of building blocks. The linkage of building blocks dominates all kinds of building-block processing, including creation, identification, separation, preservation, and mixing. However, in the real world, it is usually difficult to know such information a priori. As a consequence, handling linkage for genetic
370
X.-B. Hu and E.D. Paolo
algorithms to succeed is very important. For a good survey on linkage in GAs, readers may refer to [5]. 3.2
Useful Linkage Information in GAP
Linkage information, or the relationship between building blocks of a crhomosome, is completely problem-dependent. Here we will analyze and identify the useful linkage information in the airport GAP. Naturally, one may think the basic elements in a GAP solution are the mathematical characters which represent all aircraft needed to be assigned to gates. Basically, grouping aircraft according to gates, i.e., assigning aircraft to different gates, is the most important step in the GAP, because it has direct influence on both walking distances and idle times of gates. If aircraft waiting time on the apron is not under consideration, then optimal grouping plus the FCFS principle can produce the best way of utilizing the gates at airport terminals. The GA proposed in [22] was designed based on aircraft grouping information. The encoding scheme adopted in [22] is illustrated in Fig. 4.(b), where a gene C(i) = g means the ith aircraft in the original set of aircraft is assigned to dwell at gate g. Hereafter, we call aircraft grouping as gate assignment. However, it is difficult to find any useful linkage information here, because, according to the encoding scheme based on gate assignment, each gene in Fig. 4.(b) can be considered as a subfunction, which means, no matter whatever crossover is applied, no subfunction will be split. Therefore, the concept of linkage is of little use to the encoding scheme in Fig. 4.(b). Clearly, the encoding scheme based on gate assignment is not concerned about the order of aircraft in the queue to each gate, which is crucial to the minimization of aircraft waiting time on the apron. As discussed before, different aircraft
Fig. 4. Encoding schemes in GAP (see text)
Genetic Algorithms for the Airport Gate Assignment
371
Fig. 5. Common linkage information in GAP
may have different number of passengers and different ground time, and therefore, switching the positions of some aircraft in a FCFS-principle-based queue could reduce the total passenger waiting time, which is another criterion to assess the level of customer satisfaction with the service. The GA proposed for the arriving sequencing and scheduling problem in [25] can be modified and extended to handle the position switching in the GAP. The encoding scheme is illustrated in Fig. 4.(c), where one can see that absolute positions of aircraft in queues to gates are used to construct chromosomes, i.e., a gene C(g, j) = i means the ith aircraft in the original set of aircraft is assigned as the jth aircraft to dwell to gate g. Apparently, the underlying physical meaning of a chromosome, i.e., queues to gates, is expressed in a straightforward way by the absolute-positionbased structure. Based on the position of aircraft in queues, one can easily define the linkage information as the distance between two aircraft in a queue, but is this linkage information useful to a GA for the GAP? Actually, it is not the absolute position of aircraft in queues, but the relative position of aircraft in queues that affacts
372
X.-B. Hu and E.D. Paolo
passenger waiting times. Basically, if two aircraft are assigned to a gate successively, then a desirable order for this pair of aircraft is often that the aircraft with more passengers and less ground time should dwell first, no matter what is the absolute position of the leading aircraft in the pair. Therefore, the useful linkage information in the GAP is the pairwise of aircraft in queues, i.e., as illustrated in Fig. 4.(a), every two successive aircraft in queues composes a unit of useful linkage information in the GAP. An efficient GA for the GAP should be able to take advantage of this useful linkage information. It should be efficient to identify good pairwises of aircraft in queues, then inherit and protect them. There are two encoding schemes which can express the linkage information associated with pairwises of aircraft. One is already discussed above, i.e., the absolute-position-based encoding scheme in Fig. 4.(c). This scheme can be considered using physical linkage, as the linkage information emerges from physical locations of two or more genes on the chromosome. However, under this encoding scheme, it is difficult to carry out genetic operations on common linkage information, i.e., to identify, to inherit and to protect it, in parent chromosomes, as shown in Fig. 5.(a). The other encoding scheme adopts virtual linkage, as illustrated in Fig. 4.(d), where a matrix with a dimension of (NAC + 1) × NAC is used to record directly the linkage information in the GAP queues. The first NAC × NAC genes, i.e., C(i, j), i = 1, . . . , NAC , j = 1, . . . , NAC , record relative positions between aircraft in queues, i.e., the pairwises of aircraft in queues, and the last NAC genes, i.e., C(NAC + 1, j), j = 1, . . . , NAC , record gate assignment. If C(i, i) = 1 and C(NAC + 1, i) = g, this means the ith aircraft in the original set of aircraft is assigned as the first aircraft to dwell at gate g; If C(i, j) = 1 and C(NAC + 1, j) = g, this means aircraft j is assigned to follow aircraft i to dwell at gate g. As illustrated in Fig. 5.(b), under this encoding scheme, common linkage information can be easily identified: If C1 (i, j)&C2 (i, j) = 1, then they are common linkage information, i.e., common pairwises of aircraft.
4 Design of GA for GAP This section will discuss how to design an efficient GA which is particularly capable of taking advantage of the useful linkage information in the GAP. 4.1
Choose Chromosome Structure
In the airport GAP, the total passenger waiting time is sensitive to the relative position between aircraft in queues, i.e., the linkage information in the GAP. Therefore, a good chromosome structure should be able to capture the linkage information. Both the absolute-position-based encoding scheme and the relativeposition-based encoding scheme discussed in Section 3.2 can meet this demand. Here we will further analyze the merits and demerits of these two encoding schemes from the following aspects: • A chromosome structure efficient in term of handling common linkage information is preferable. As analyzed in Section 3.2, the relative-position-based
Genetic Algorithms for the Airport Gate Assignment
373
encoding scheme has an obvious advantage because the linkage information is directly recorded in each gene, while the absolute-position-based encoding scheme is much less friendly to efficient crossover operators unless some additional measures are taken in order to abstract the useful knowledge on pairwises of aircraft in queues. • Feasibility is a crucial issue in the design of an efficient chromosome structure. Due to the underlying physical meaning, some special constraints usually must be satisfied by chromosomes. For chromosomes based on absolute positions of aircraft, the feasibility is defined by two constraints: (I) each aircraft appears once and only once in a chromosome, and (II) if C(g, j) > 0, then C(g, h) > 0 for all 1 ≤ h < j. Under the relative-position-based scheme, a feasible chromosome must satisfy the following 5 constraints according to the underlying physical meaning in the airport GAP: N AC N AC
C(i, j) = NAC ,
(10)
i=1 i=1 N AC
C(i, j)
i=1 N AC
≤ 2, C(i, i) > 0 , ≤ 1, C(i, i) = 0
C(i, j) = 1,
(11)
(12)
i=1
1≤
N AC
C(i, i) = N¯G ≤ NG ,
(13)
i=1 N AC
C(j, j) = 1for anyg ∈ Φ¯G ,
(14)
C(NAC +1,j)=g,j=1,...,NAC
where, without losing the generality, it is assumed that only N¯G gates in all NG gates are assigned to aircraft and Φ¯G denotes the set of assigned gates. Constraints (10) to (14) are actually a new version of the two feasibility constrains for chromosomes based on absolute position. From constraints (10) to (11), one can derive that there may often be some empty rows, no more than N¯G empty rows, in the matrix. If the ith row is empty, then it means aircraft i is the last aircraft to dwell to gate C(NAC + 1, i). • Memory efficiency is another concern when choosing chromosome structure. According to Fig. 4, an absolute-position-based chromosome is composed of NG × NAC genes, while a relative-position-based one has (NAC + 1) × NAC genes. Clearly, the latter could confront an O((n + 1) × n)) memory problem when too many aircraft need to be considered at one time. In order to obtain the merits of both encoding schemes and avoid their demerits, in other words, in order to be efficient to handle linkage information, simple regarding feasibility, and cheap in terms of memory demand, here we combine the
Mutation in favor of the absolute-position-based format. Transform absolute-positionbased parent chromosomes into their relativeposition-based peers.
Crossover in favor of the relativepositionbased format.
Transform relativeposition-based offspring chromosomes into their absoluteposition-based peers.
A new generation of chromosomes in the absoluteposition-based format.
X.-B. Hu and E.D. Paolo
A generation of chromosomes in the absolute-position-based format.
374
Fig. 6. Combination of two encoding schemes and representation transforming procedure
above two encoding schemes to construct chromosomes by introducing a special representation transforming procedure in our GA for the airport GAP. As shown in Fig. 6, the generation of chromosomes are constructed under the absoluteposition-based encoding scheme, which is more memory-efficient. The mutation operator is also designed to operate on the absolute-position-based structure, as it has less feasibility constraints. However, the relative-position-based scheme needs to be used to design an efficient crossover operator to identify, inherit and protect common linkage information in parent chromosomes. To this end, before two absolute-position-based parent chromosomes can crossover, they need to be transformed into their corresponding relative-position-based peers, and after the crossover operation, the relative-position-based offspring chromosomes need to be transformed back to the absolute-position-based format in order to be consistent with the generation’s format. Fortunately, transforming from the absolute-position-based format to the relative-position-based format or in the reverse direction is a rather straightforward process. In an absolute-positionbased chromosome, if C(g, 1) = i, then in its relative-position-based peer, one has C(i, i) = 1 and C(NAC + 1, i) = g; if C(g, h) = i and C(g, h + 1) = j, then in its relative-position-based peer, one has C(i, j) = 1 and C(NAC + 1, j) = g. Vice versa. 4.2
Mutation Operator
Mutation is used by GAs to diversify chromosomes in order to exploit solution space as widely as possible. In the case of airport GAP, the mutation operation should be able to reassign an aircraft to any gate at any order. Therefore, we need two mutation operators: (I) one to shift randomly the positions of two successive aircraft in a same queue, and (II) the other to swap randomly aircraft in two different queues, or to remove an aircraft from one queue, and then append it to the end of another queue. The chromosome structure based on gate assignment
Genetic Algorithms for the Airport Gate Assignment
375
only supports the second mutation. The structure based on absolute positions supports both well as denoted as follows: Mutation I: C(g, j) ↔ C(g, j + 1), j = 1, . . . , Hg − 1, g = 1, . . . , NG . Mutation II: C(g1 , j) ↔ C(g2 , j), j = 1, . . . , Hg1 andk = 1, . . . , Hg2 + 1, g1 = g2 , g1 = 1, . . . , NG , g2 = 1, . . . , NG . 4.3
Crossover Operator
2
1
2 3 1
3
2
3
2
1 1
From parent 1
1
Common genes
Location to split One position split crossover
Parent 1
Parent 2
Offspring
2
2
2
1
2
3 1
Common genes
3 1
2
3
2
3
3
3
1
1
1
1
Uniform crossover
Fig. 7. Uniform crossover vs. one point crossover
2 3 1 2 3 3 1
Same as both parents
3 From parent 2
Offspring 2
2
Randomly inherit from parents
Offspring 1
From parent 2
From parent 1
Crossover is an effective evolutionary process, mainly by operating on common genes, to speed up a GA to converge to optimas or sub-optimas. Uniform crossover is probably the most wildly used crossover operator because of its efficiency in not only identifying, inheriting and protecting common genes, but also re-combining non-common genes [26]-[28]. Fig. 7, using the chromosome structure based on gate assignment in Fig. 4.(b), compares uniform crossover with another also wildly used crossover operator: one point crossover. From Fig. 7 one can see that the uniform crossover is actually an ultimate multi-point crossover, which is obviously much more powerful than the one point crossover in terms of exploiting all possibilities of recombining non-common genes. To design an effective and efficient uniform crossover operator to handle common linkage information in GAP queues is a major objective of this paper. Although the chromosome structure based on absolute positions of aircraft contains the information of relative positions of aircraft, due to the feasibility issue in chromosomes, it is very difficult to design an effective crossover operator in favor of the absolute-position-based format to identify, inherit and protect common relative position. Fortunately, under the relative-position-based encoding scheme, both common gate assignments and common relative positions between aircraft in queues can easily be handled by uniform crossover, which can exploit all possibilities of re-combining non-common genes at the same time. The feasibility issue of the uniform crossover operator can also be addressed in a computationally cheap way.
376
X.-B. Hu and E.D. Paolo (b). Identify common genes:
(a). Two parent chromosome with some common genes: 1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
1
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
2
0
3
1
0
0
0
1
2
1
3
1
2
2
3
1
2
2
3
1
3
3
1
(d). Indicate infeasible genes related to relative positions:
(c). Assign gates (each assigned gate has a first dwelling):
(e). Set relative positions and produce an offspring:
0
0
0
0
0
0
0
0
-1 -1 -1 -1
0
-1
0
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
-1 -1
0
-1
0
-1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
-1
1
-1
0
-1
0
-1
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
-1 -1 -1
1
-1 -1 -1
1
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
-1 -1 -1 -1 -1
1
-1 -1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
-1 -1 -1
0
-1
0
-1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-1 -1 -1
0
-1 -1 -1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
-1 -1 -1
0
-1
0
-1
0
0
0
0
0
0
0
0
2
2
3
1
3
3
3
1
2
2
3
3
3
1
2
2
3
1
3
3
3
1
3
1
Fig. 8. Uniform crossover using relative-position-based chromosome structure
This novel uniform crossover operator is described as the following procedure, which is further illustrated by Fig. 8: Step 1: Given two parent chromosomes C1 and C2 , calculate C3 to locate common genes C3 (i, j) = C1 (i, j)&C2 (i, j), C1 (NAC + 1, j), C1 (NAC + 1, j) = C2 (NAC + 1, j) C3 (NAC + 1, j) = 0 C1 (NAC + 1, j) = C2 (NAC + 1, j) (15) i = 1, . . . , NAC , j = 1, . . . , NAC , i.e., C3 (i, j) = 1 or C3 (NAC + 1, j) > 0 means this location has a common gene shared by C1 and C2 . Step 2: Assign gates to C3 by referring to C1 and C2 . Basically, C3 (NAC + 1, j) is set as C1 (NAC + 1, j) or C2 (NAC + 1, j), and C3 (j, j) is set as C1 (j, j)
Genetic Algorithms for the Airport Gate Assignment
377
or C2 (j, j), j = 1, , NAC , at a half-and-half chance, subject to Constraint (14). Let C4 = C3 . Step 3: Indicate infeasible genes related to relative positions in C4 : Set C4 (i, i) = −1 for i = 1, . . . , NAC ; If C3 (i, j) = 1 for i = j, then set C4 (m, j) = −1 and C4 (i, m) = −1 for m = 1, . . . , NAC ; if C3 (i, i) = 1, then set C4 (m, i) = −1 for m = 1, . . . , NAC . C4 (i, j) = 0 means this location will not be considered when a new relative position between aircraft needs to be set up. Step 4: While C3 (i, j) < NAC , i = 1, . . . , NAC , and j = 1, . . . , NAC , do Step 4.1: Randomly choose j, such that C3 (i, j) = 0, i = 1, . . . , NAC . Step 4.2: Suppose C1 (i1 , j) = 1 and C2 (i2 , j) = 1, i1 = 1, . . . , NAC , and i2 = 1, . . . , NAC . If C3 (NAC + 1, i1 ) = C3 (NAC + 1, i2 ) = C3 (NAC + 1, j) and C4 (i1 , j) = C4 (i2 , j) = 0, then set i3 = i1 or i3 = i2 at a half-andhalf chance; Else if C3 (NAC + 1, in ) = C3 (NAC + 1, j) and C4 (in , j) = 0, n = 1 or 2, then set i3 = in ; Otherwise, randomly choose i3 such that C3 (NAC + 1, i3)=C3(NAC + 1, j) and C4 (i3 , j) = 0. Step 4.3: Set C3 (i3 , j) = 1, C4 (i3 , j) = 1, C4 (m, i3 ) = −1 and C4 (i3 , m) = −1 for m = 1, . . . , NAC . Clearly, with the above crossover procedure, all common genes, i.e., both common linkage information and common gate assigments, are efficiently identified, inherited and protected, and all possibilities of feasibly re-combining noncommon genes can be exploited. As will be proved later, this uniform crossover is a very powerful searching operator in the proposed GA. 4.4
Heuristic Rules
To further improve the performance of GA, the following problem-specific heuristic rules are introduced: • To help the algorithm to converge fast, not all of the new chromosomes are initialized randomly, but some are generated according to the FCFS principle. • When initializing a chromosome randomly, we still follow the FCFS principle but in a loose way, i.e., an aircraft with an earlier Pi is more likely to be assigned to the front of a queue. • If two aircraft have a same Pi , or their Pi s are within a specified narrow time window, then the one with more passengers stands a better chance to be allowed to dwell first. • For the sake of diversity, in each generation, a certain proportion of worst chromosomes are replaced by totally new ones. • Like in [25], the population in a generation, NP opulation , and the maximum number of generations in the evolutionary process, NGeneration , are adjusted according to NAC in order to roughly keep the level of solution quality NP opulation = 30 + 10(round(max(0, NAC − 10)/5)),
(16)
NGeneration = 40 + 15(round(max(0, NAC − 10)/5)).
(17)
378
X.-B. Hu and E.D. Paolo
5 Simulation Results 5.1
Simulation Setup
The terminal layout has a big influence on the cost-efficiency of daily airport operations [30]. In our study, a typical terminal layout, two-sided parking terminal, is used, as illustrated in Fig. 9. The terminal is assumed to have 20 gates. The data matrix MP W D and MBT D , i.e., distances for passenger walking and baggage transporting, are generated according to (18) to (21) MP W D (n, m) = MP W D (n, m) = d1 + d2 |nrem(n) − nrem(m)|
(18)
MP W D (n, NG + 1) = MP W D (NG + 1, n) = d3 + d2 |nrem(n) − 5.5|
(19)
MBT D (n, m) = MBT D (n, m) = d4 + d5 |nrem(n) − nrem(m)|
(20)
MBT D (n, NG + 1) = MBT D (NG + 1, n) = d6 + d5 |nrem(n) − 5.5|
(21)
where n = 1, . . . , NG , m = 1, . . . , NG , and d1 to d6 are constant coefficients which can roughly determine the terminal size and the gate locations, rem(n, 11), n < 11 nrem(n) = (22) rem(n − 10, 11), n ≥ 11 and rem is a function that calculate the remainder after division. Traffic and passenger data are generated randomly under the assumption that the capacity of an aircraft varies between 50 and 300, the ground time span at a gate is between 30 and 60 minutes, and all aircraft are planned to arrive or depart within a period of one-hour time window. The congestion condition is indicated by NAC . For comparative purposes, the GA reported in [25] is extended to solve the GAP. This extended GA, denoted as GA1 hereafter (the proposed new GA
… Gate 1
Gate 2
Gate 3
…
Gate 8
Gate 9
Gate 10
Entrance/Exit (underground) Gate 11 Gate 12 Gate 13
…
Gate 18 Gate 19 Gate 20
… Fig. 9. Two-sided parking terminal layout
Genetic Algorithms for the Airport Gate Assignment
x 10
6
6
-1
Average fitness level
-1
Largest fitness
-1.2 -1.4 -1.6 -1.8 -2 GA1 GA2
-2.2 -2.4 0
379
20
40
60
80
100
x 10
-1.5
-2
-2.5 GA1 GA2 -3 0
20
Generation
40
60
80
100
Generation
Fig. 10. Fitness levels in a test
with uniform crossover is denoted as GA2), employs the chromosome structure based on absolute position, and its crossover is actually a more complex mutation operator, as explained in [24]. The crossover probability and mutation probability are 0.5 and 0.3, respectively. Due to limited space, here we only give in Table 1 the results of a relatively simple case study, in order to illustrate how different GAs optimize gate assignment. In this test, JMOGAP under GA2 is about 5% smaller than that of GA1, and Fig. 10 shows how the fitness, i.e. −JMOGAP , changes in the evolutionary processes of GA1 and GA2. From Table 1, one may easily find out that, under GA1, aircraft 14 is assigned to follow aircraft 27 to dwell at gate 9, and aircraft 20 to follow aircraft 13 to dwell at gate 11, while under GA2, aircraft 14 is assigned to follow aircraft 13 to dwell at gate 9, and aircraft 20 to follow aircraft 27 to dwell at gate 11. These different gate assignments cause no difference in total aircraft waiting time, i.e., aircarft 14 has to wait 6 more minutes under GA1, while aircraft 20 has to wait 6 more minutes under GA2. However, the passenger data show that aircarft 14 has more pasengers than aircraft 20, and, more importantly, there are more transfer passengers within aircraft pairs (13,14) and (20,27) than within aircraft pairs (13,20) and (14,27). This means the above gate assignments under GA2, i.e., aircraft 20 following aircraft 27, and aircraft 14 following aircraft 13, stand a good chance to have smaller total passenger waiting time and shorter passenger walking distance and baggage transfering distance. Therefore, they are useful linkage information in this GAP. GA1 fails to identify and protect this useful linkage information, while GA2 succeeds. From Fig. 10.(a), one can see that the largest fitness of a generation in GA2 increases more quickly than that in GA1, which means GA2, owing to taking advantage of useful linkage information in the GAP and employing uniform crossover, has a faster convergence speed than GA1. Actually, on average in this test, it takes GA2 2.7778 generations to make a breakthrough in the largest fitness, while for GA1, it takes 4.7619 generations. From Fig. 10.(b) one can see that the average fitness of a generation in GA2 increases faster and stays larger than that in GA1. This implies GA2 can effectively improve the overall fitness level, which is probably because the
380
X.-B. Hu and E.D. Paolo
new uniform crossover proposed in this paper really works well in identifying, inheriting and protecting good common pairwises of aircraft in queues. 5.2
General Performance
However, to get general conclusions about different GAs, we need to conduct extensive simulation tests, where NAC is set as 30, 60 or 90 to simulate the situation of under-congestion, congestion, or over-congestion, and one of the single-objective functions in (4) to (6) or the multi-objective function in (7) is used. For each NAC and objective function, 100 simulation runs are conducted under each GA, and the average results are given in Table 2 to Table 5, from which we have the following observations: • Overall, GA2 is about 3% to 10% better than GA1 in terms of the specific objective function, which illustrates the advantages of the proposed uniform crossover operator good at handling the linkage information in the GAP. • In the cases of single-objective GAP, as given in Table 2 to Table 4, GA2 achieves a better performance at the cost of other non-objective criteria. For instance, in Table 2, GA2 gets a smaller TPWD by sacrificing TAWT (total aircraft waiting time). TPWD and TBTD share a similar trend of change, i.e., if GA2 has a smaller/larger TPWD than GA1, then it also has a smaller/larger TBTD. This is probably because both TPWD and TBTD, in a similar way, are determined largely by the terminal layout. • In the case of multi-objective GAP, if the weights in the objective function are properly tuned (α = 0.5 and β = 0.1 in the associated tests), GA2 is better than GA1 not only in terms of the multi-objective function adopted, but also in terms of each single-objective function not adopted. • In the minimum distance (passenger walking distance or baggage transporting distance) GAP, as shown in Table 2 and Table 3, we use no extra constraints to enforce assigning gates evenly to aircraft. As a result, the gap between the maximum queue length (MaxQL) and the minimum queue length (MinQL) is huge, which implies many aircraft are assigned to a certain gate. While in the minimum waiting time GAP, as given in Table 4, the gap between MaxQL and MinQL is very small, which means evenly using gates is automatically guaranteed during the minimization of waiting time. Therefore, since waiting time is considered in the multi-objective GAP, the gap between MaxQL and MinQL is also very small, as shown in Table 5. • Basically, in a more congested case, i.e., with a larger NAC , the operation of gate assignment is more expensive. Roughly speaking, the distances increase linearly in terms of NAC , while the waiting time goes up exponentially, mainly because of the heavy delay applied to aircraft during a congested period. This might suggest, in a more congested case, waiting time should be given a larger weight.
Genetic Algorithms for the Airport Gate Assignment Table 1. Result of gate assignment in a single test AC Pi (min) Gi (min) GA1 GA2 Code Ei (min) Gate Ei (min) Gate 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
28 26 5 12 43 27 34 10 48 56 25 52 7 56 56 49 39 47 13 52 25 35 16 20 56 4 8 54 45 28
40 50 40 45 50 45 40 35 30 35 35 35 50 40 60 35 30 40 40 35 50 40 50 35 45 40 55 50 35 45
28 26 5 12 43 27 34 10 48 56 25 53 7 63 60 49 39 47 13 57 25 35 16 20 66 4 8 57 45 28
3 16 9 7 19 4 2 12 12 14 15 13 11 8 15 20 1 9 13 11 5 18 6 14 6 10 8 7 10 17
28 26 5 12 43 27 34 10 48 56 25 53 7 57 60 49 39 47 13 63 25 35 16 20 66 4 8 57 45 28
3 16 9 7 19 4 2 12 12 14 15 13 8 8 15 5 1 9 13 11 20 18 6 14 6 10 11 7 10 17
Table 2. JT P W D is used as objective function (×105 )
JT P W D TPWD(m) TBTD(m) TAWT(min) MaxQL MinQL
NAC = 30 GA1 7.2656 GA2 7.0330
7.2656 7.0330
18.9600 18.4091
43.4807 44.7622
29.5 29.6
0.2 0.2
NAC = 60 GA1 14.0606 GA2 13.2538
14.0606 13.2538
38.5475 37.2206
201.1071 209.5881
59.1 59.3
0.3 0.3
NAC = 90 GA1 19.7178 GA2 18.8373
19.7178 18.8373
56.4425 55.0299
442.9681 455.4340
88.6 88.5
0.6 0.5
381
382
X.-B. Hu and E.D. Paolo Table 3. JT BT D is used as objective function (×105 )
JT BT D TPWD(m) TBTD(m) TAWT(min) MaxQL MinQL
NAC = 30 GA1 18.5846 GA2 17.8939
7.2739 7.1005
18.5846 17.8939
43.6277 45.1086
29.5 29.6
0.3 0.2
NAC = 60 GA1 38.1412 GA2 36.9374
14.2805 13.1288
38.1412 36.9374
202.0039 210.1956
59.3 59.5
0.3 0.3
NAC = 90 GA1 55.8907 GA2 54.0407
20.1136 18.9287
55.8907 54.0407
440.7336 451.1360
88.8 89.0
0.7 0.6
Table 4. JT P W T is used as objective function (×105 )
JT P W T TPWD(m) TBTD(m) TAWT(min) MaxQL MinQL
NAC = 30 GA1 1.5273 GA2 1.4595
18.8120 19.0023
24.8046 24.9367
0.0611 0.0583
2.2 2.2
0.9 0.9
NAC = 60 GA1 71.2180 GA2 64.2053
36.9188 37.3578
50.4188 51.9046
2.8487 2.5774
3.9 3.8
2.3 2.3
NAC = 90 GA1 219.8557 GA2 208.5154
51.6150 53.0487
73.1843 75.5549
8.7942 8.3508
5.3 5.3
3.9 4.0
Table 5. JM OGAP is used as objective function (×105 )
5.3
JM OGAP TPWD(m) TBTD(m) TAWT(min) MaxQL MinQL
NAC = 30 GA1 11.9457 GA2 11.4672
16.1300 15.5086
23.5272 23.0442
0.1528 0.1477
2.0 2.1
0.9 1.0
NAC = 60 GA1 53.0853 GA2 49.5900
35.9684 34.1606
49.8836 48.7724
3.0112 2.8031
3.9 4.0
2.2 2.1
NAC = 90 GA1 120.2156 GA2 115.7206
49.7772 47.8941
72.3854 72.2129
8.8088 8.4692
5.3 5.2
4.0 4.0
Further Analysis
As discussed before, the proposed uniform crossover is particularly good at identifying, inheriting and protecting common linkage information in parent chromosomes. Here we will check whether or not this is the case in the simulation. To this end, before each crossover is carried out in either GA1 or GA2, we need to count how many common pairwises of aircraft are shared by parent chromosomes, and then record these common pairwises. After the crossover operation, we need to check out how many common pairwises shared by parent chromosomes still exist in the resulting offspring chromosome(s) (in GA1, two parent chromosomes reproduce two offspring chromosomes, while in GA2, two parent chromosomes reproduce one offspring chromosome through uniform crossover).
Genetic Algorithms for the Airport Gate Assignment
383
Table 6. Memory efficiency and computational time (NAC = 30, Npopu = 70) (NAC = 60, Npopu = 130) (NAC = 90, Npopu = 190) MD (b) CT (s) MD (b) CT (s) MD (b) CT (s) GA2 42000 GA3 65100
11.9 11.4
156000 475800
52.7 53.1
342000 1556100
123.3 121.9
Then, we can calculate the survival rate of common linkage information in chromosomes during the two different crossover operations. As explained in [24], the crossover operator in GA1 is designed based on the absolute-position-based encoding scheme, and it is actually equivalent to a complex combination of Mutation I and Mutation II given in Section 4.2. As is well known, mutation aims to diversify chromosomes, and therefore pays no attention to any common linkage information. After conducting 50 random runs of both GA1 and GA2 (here random run means choosing NAC and objective function randomly), we found the average survival rate of common linkage information in GA1 is 0.8127, while in GA2, this rate is 1, which means common linkage information is fully protected during the uniform crossover operation in GA2. One may argue that (I) not all common linkage information in parent chromosomes is good in terms of fitness, and (II) even good common linkage information should be possible to be destroyed according to the stochastic nature of biological evolution. Actually, in GA2, unfit common linkage information is more likely to be eliminated during the competitive selection operation, and good common linkage information may be destroyed by mutation. Here the point is: our proposed uniform crossover is doing exactly and perfectly what a crossover operator is expected to do: providing one hundred percent protection to common linkage information. Unlike in [24], this paper adopts a combined encoding scheme, i.e., GA2 mainly uses the absolute-position-based chromosome structure, but the uniform crossover operator requires the relative-position-based one, as discussed in Section 4.1. Therefore, before and after each crossover operation in GA2, a representation transforming procedure needs to be executed to parents and offspring. This representation transforming procedure saves memory at a cost of computational time. Is this really a good trade-off between memory efficiency and computational burden? Here we compare GA2 with the GA reported in [24], which is denoted as GA3 hereafter. All evolutionary operators in GA3 are actually equivalent to those used in GA2, but GA3 is designed purely on the ground of the relative-position-based encoding scheme. In other words, GA3 needs no representation transforming procedure for uniform crossover. The memory demands (MD) and computational times (CT) of GA2 and GA3 in different congestion conditions are compared in Table 6, from which one can see clearly: • Thanks to the combined encoding scheme, GA2 is much more memoryefficient than GA3, particularly when many aircraft need to be considered at one time. Basically, as NAC goes up, the population of a generation and
384
X.-B. Hu and E.D. Paolo
the total generations to evolve have to increase correspondingly in order for GAs to maintain the level of solution quality. Assuming NAC = 1000 and the population of a generation, Npopu , is 1000, then the memory demand by GA3 to store a generation of chromosomes is 1001 × 1000 × 1000 bytes, which already exceeds the memory capacity of most standard personal computers. In the case of GA2, the memory demand for a generation is 20 × 1000 × 1000 bytes, which is still tolerable and manageable. • Regarding computational time, there is no obvious difference between GA2 and GA3, this is probably because, as discussed in Section 4.1, the representation transforming procedure in GA2 is rather straightforward, and therefore its time cost is almost unrecognizable compared with the computational time consumed by evolutionary operations in GAs. • In summary, one can see the combined encoding scheme in GA2 provides good trade-off between the memory efficiency and computational burden of the algorithm.
6 Conclusion The Airport Gate Assignment Problem (GAP) is a major issue in air traffic control operations, and GAs have a good potential of resolving this problem. To make a successful implememntation of GAs to the GAP, choosing a good representation of aircrat queues to gates is a crucial step in the design of GAs, and the relative positions between aircraft, i.e., the useful linkage information in chromosomes for the GAP, should be represented in a computationally efficient way. This paper studies a new matrix represnetation for the GAP, which adopts the relative positions between aircraft, rather than the widely used absolute positions of aircraft in queues to gates, to construct chromosomes. Based on this new matrix representation, an effective uniform crossover operator is then designed in order to identify, inherit and protect those good linkages in chromosomes for the GAP. To better trade off between computational efficiency and memory efficiency, a special representation transforming procedure is introduced to resolve the O((n + 1) × n)) memory problem caused by the matrix representation. The advantages of the new GA are demonstrated in extensive simulation tests. Further research will be conducted, such as to introduce more complex factors and investigate the weights in order to make the GAP model more realistic, and to extend the report work from static air traffic situations to dynamical environments based on real traffic data which need to be collected and analyzed.
Acknowledgements This work was supported by the EPSRC Grant EP/C51632X/1. A previous version of this paper was presented at The 2007 IEEE Congress on Evolutionary Computation (CEC2007), 25-28 Sep 2007, Singapore.
Genetic Algorithms for the Airport Gate Assignment
385
References 1. Holland, J.H.: Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor (1975) 2. Goldberg, D.E.: The design of innovation: Lessons from and for competent genetic algorithms. In: Genetic Algorithms and Evoluationary Computation, vol. 7. Kluwer Academic Publishers, Dordrecht (2002) 3. Bosman, P.A., Thierens, D.: Linkage information processing in distribution estimation algorithms. In: Proceedings of Genetic and Evolutionary Computation Conference 1999 (GECCO 1999), pp. 60–67 (1999) 4. Chen, Y.P., Goldberg, D.E.: Introducing start expression genes to the linkage learning genetic algorithm. In: Proceedings of the Seventh International Conference on Parallel Problem Solving from Nature (PPSN VII), pp. 351–360 (2002) 5. Chen, Y.P., Yu, T.L., Sastry, K., Goldberg, D.E.: A Survey of Linkage Learning Techniques in Genetic and Evolutionary Algorithms. IlliGAL Report No 2007014 (2007) 6. Goldberg, D.E., Deb, K., Thierens, D.: Toward a better understanding of mixing in genetic algorithms. Journal of the Society of Instrument and Control Engineers 32, 10–16 (1993) 7. Haghani, A., Chen, M.C.: Optimizing gate assignments at airport terminals. Transportation Research A 32, 437–454 (1998) 8. Bolat, A.: Procedures for providing robust gate assignments for arriving aircraft. European Journal of Operations Research 120, 63–80 (2000) 9. Babic, O., Teodorovic, D., Tosic, V.: Aircraft stand assignment to minimize walking distance. Journal of Transportation Engineering 110, 55–66 (1984) 10. Mangoubi, R.S., Mathaisel, D.F.X.: Optimizing gate assignments at airport terminals. Transportation Science 19, 173–188 (1985) 11. Bihr, R.: A conceptual solution to the aircraft gate assignment problem using 0,1 linear programming. Computers & Industrial Engineering 19, 280–284 (1990) 12. Gosling, G.D.: Design of an expert system for aircraft gate assignment. Transportation Research A 24, 59–69 (1990) 13. Srihari, K., Muthukrishnan, R.: An expert system methodology for an aircraftgate assignment. Computers & Industrial Engineering 21, 101–105 (1991) 14. Xu, J., Bailey, G.: Optimizing gate assignments problem: Mathematical model and a tabu search algorithm. In: Proceedings of the 34th Hawaii International Conference on System Sciences. Island of Maui, Hawaii, USA (2001) 15. Ding, H., Lim, A., Rodrigues, B., Zhu, Y.: New heuristics for the over-constrained flight to gate assignments. Journal of the Operational Research Society 55, 760–768 (2004) 16. Ding, H., Lim, A., Rodrigues, B., Zhu, Y.: The over-constrained airport gate assignment problem. Computers & Operations Research 32, 1867–1880 (2005) 17. Robuste, F.: Analysis of baggage handling operations at airports. PhD thesis, University of California, Berkeley, USA (1988) 18. Chang, C.: Flight sequencing and gate assignment in airport hubs. PhD thesis, University of Maryland at College Park, USA (1994) 19. Robuste, F., Daganzo, C.F.: Analysis of baggage sorting schemes for containerized aircraft. Transportation Research A 26, 75–92 (1992) 20. Wirasinghe, S.C., Bandara, S.: Airport gate position estimation for minimum total costs-approximate closed form solution. Transportation Research B 24, 287–297 (1990)
386
X.-B. Hu and E.D. Paolo
21. Gu, Y., Chung, C.A.: Genetic algorithm approach to aircraft gate reassignment problem. Journal of Transportation Engineering 125, 384–389 (1999) 22. Bolat, A.: Models and a genetic algorithm for static aircraft-gate assignment problem. Journal of the Operational Research Society 52, 1107–1120 (2001) 23. Yan, S., Huo, C.M.: Optimization of multiple objective gate assignments. Transportation Research A 35, 413–432 (2001) 24. Hu, X.B., Di Paolo, E.: An Efficient Genetic Algorithm with Uniform Crossover for the Multi-Objective Airport Gate Assignment Problem. In: Proceedings of 2007 IEEE Congress on Evolutionary Computation. Singapore (2007) 25. Hu, X.B., Chen, W.H.: Genetic Algorithm Based on Receding Horizon Control for Arrival Sequencing and Scheduling. Engineering Applications of Artificial Intelligence 18, 633–642 (2005) 26. Sywerda, G.: Uniform crossover in genetic algorithms. In: Proceedings of the 3rd International Conference on Genetic Algorithms. USA (1989) 27. Page, J., Poli, P., Langdon, W.B.: Smooth uniform crossover with smooth point mutation in genetic programming: A preliminary study, Genetic Programming. In: Langdon, W.B., Fogarty, T.C., Nordin, P., Poli, R. (eds.) EuroGP 1999. LNCS, vol. 1598. Springer, Heidelberg (1999) 28. Falkenauer, E.: The worth of uniform crossover. In: Proceedings of the 1999 Congress on Evolutionary Computation. USA (1999) 29. Eiben, A.E., Schoenauer, M.: Evolutionary computing. Information Processing Letters 82, 1–6 (2002) 30. Bandara, S., Wirasinghe, S.C.: Walking distance minimization for airport terminal configurations. Transportation Research A 26, 59–74 (1992) 31. Hartl, D.L., Jones, E.W.: Genetics: principles and analysis, 4th edn. Jones and Bartlett Publishers, Sudbury (1998) 32. Goldberg, D.E.: Simple genetic algorithms and the minimal, deceptive problem. In: Davis, L. (ed.) Genetic Algorithms and Simulated Annealing, ch. 6, pp. 74–88. Morgan Kaufmann Publishers, Los Altos (1987)
Genetic Algorithms for the Airport Gate Assignment
387
Appendix: Notation Ai Di Ei Gi Hg JMOGAP JT BT D JT P W D JT P W T NAC NG NGeneration NP opulation MBT D MP MP W D Pi Qg TE TS Wi
Scheduled arrival time to the airport Scheduled departure time Allocated entering time to a gate Ground time Number of aircraft in Qg Performance index for the multi-objective GAP Total baggage transferring distance Total passenger walking distance Total passenger waiting time Number of aircraft which need to be assigned to gates Number of gates at the airport The maximum number of generations in the evolutionary process The population in a generation Baggage transferring distance matrix Passenger data matrix Passenger walking distance matrix Planned entering time to a gate The queue at gate g End point of the time period for the GAP Starting point of the time period for the GAP Aircraft waiting time
A Decomposed Approach for the Minimum Interference Frequency Assignment Gualtiero Colombo and Stuart M. Allen School of Computer Science, Cardiff University
[email protected],
[email protected] Abstract. The Minimum Interference Frequency Assignment Problem (MI-FAP) is an important optimization problem that arises in operational wireless networks. Solution techniques based on meta-heuristic algorithms have been shown to be successful for some test problems. However, they have not been demonstrated on the large scale problems that occur in practice, and their performance is poor in these cases. We propose a decomposed assignment technique which divides the initial problem into a number of simpler subproblems that are solved either independently or in sequence. Partial subproblems solutions are recomposed into a solution of the original problem. Our results, show that the decomposed assignment approach proposed can improve the outcomes, both in terms of solution quality and runtime. A number of partitioning methods are presented and compared, such as clique detection; partitioning based on sequential orderings; and novel applications of existing graph partitioning and clustering methods adapted for this problem.
1 Introduction In a wireless network transmitters and receivers communicate via signals encoded on specific frequency channels. Roughly speaking the frequency planning of a radio network consists of assigning the base stations a signal which is powerful enough to guarantee adequate communication, without causing severe interference between transmitters. As a consequence (depending on the level of interference which can be considered acceptable) a required frequency separation can be specified for each pairs of transmitters. The Frequency Assignment Problem (FAP) is an optimization problem which aims to assign frequencies to transmitters in as efficient way as possible, either in terms of interference or the amount of spectrum used. Since it is the most important problem occurring in operational cellular wireless networks we will restrict our attention to the Minimum Interference FAP (MI-FAP). In this problem pairs of transmitters are assigned numerical values which represent the acceptable (but undesirable) interference that arises between them. In this case, the sum of the interference produced among all pairs of transmitters in the network should be minimized in the final frequency assignment. The FAP and its variants have been proven to be NP-Hard [12] by reduction to a graph coloring problem. Consequently, exact methods are only able to solve the FAP for small instances composed of a limited number of transmitters. Solution techniques for the FAP are usually based on meta-heuristics algorithms, while lower bounding techniques have been developed that allow the quality of them to be assessed. This has been Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 389–417, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
390
G. Colombo and S.M. Allen
shown to be successful for some test problems producing assignments that are provably optimal. However, standard meta-heuristics do not appear capable of effectively solving hard FAP instances, either having large size or presenting a hard structure within the interference graph. The aim of this chapter is to show that when combined with decomposition they can notably improve their performance. To show success of this approach independently of the underlying meta-heuristic used we apply our decomposed assignment approach to a Simulated Annealing algorithm (SA) and a Genetic Algorithm (GA) with two different assignment representations (direct and order-based). We briefly describe a number of decomposition methods first, then discuss their application to the FAP.
2 Formulation of the Problem The FAP in its basic formulation considers only channel separations as constraints. Furthermore, it uses a binary constraint model as a measure of interference in which constraints are expressed between pairs of transmitters and specify the minimum separation of frequency channels that guarantees acceptable interference (Figure 1) Formally, the FAP can be modelled by a unordered weighted graph G(V, E), called the interference graph, which consists of a finite set of vertices V , representing transmitters, and a finite set of edges E ⊆ { uv | u, v ∈ V } joining unordered distinct pairs of vertices. Each edge uv has an associated weight cuv ∈ {0, 1, 2, . . .} which is an integer value giving the channel separation required for the transmitters represented by its end points u, v. Definition 1. Given an allocation of allowed channels F = {1, 2, . . ., k} and a frequency assignment f : V → F we define it as a zero-violation assignment if | f (v) − f (u) | ≥ cuv ∀uv ∈ E For the MI-FAP problem the constraints above are known as hard constraints and must be respected by any feasible solution of the problem. Another category of weights known as soft constraints is associated with every edge uv ∈ E. These weights are expressed in terms of penalties which represent the probabilistic acceptable interference
Fig. 1. Binary graph repreentation of the FAP
A Decomposed Approach for the Minimum Interference Frequency Assignment
391
between pairs of transmitters which transmit on the same channel, (ccoch uv ) or on adjacent j ). Note that these values can be zero. The network is represented by the channels (cad uv ad j coch 6-tuple N = (G, F, {Bv }v∈V , cuv uv∈E , cuv uv∈E , cuv uv∈E ). Definition 2. Given a pair of transmitters u, v with the corresponding edge uv we define the cost of a violation as: ⎧ hard cuv i f | f (u) − f (v) | < cuv ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ j ⎪ ⎨ cad uv i f | f (u) − f (v) | = 1 ≥ cuv ϕMI ( f , uv) = ⎪ ⎪ ⎪ ccoch ⎪ uv i f | f (u) − f (v) | = 0 = cuv ⎪ ⎪ ⎪ ⎪ ⎩ 0 i f | f (u) − f (v) | ≥ max{cuv , 2} Definition 3. Given an allocation of frequencies F = {1, 2, . . . , K} the Minimum Interference Frequency Assignment Problem (MI-FAP) aims to produce an assignment f : V → F which respects the blocked channel constraints, that is f (v) ∈ F\Bv ∀v ∈ V , does not violate any hard constraints and minimizes the soft constraints. This can be formulated as minimizing: OMI ( f ) =
∑
ϕMI ( f , uv)
uv∈ E
where only solutions with OMI ( f ) < chard uv are feasible. hard In Definition 3 chard uv is a large value chosen so that an assignment f with OMI ( f ) ≥ cuv is known to violate at least one of the hard constraints.
3 Problem Decomposition and Frequency Assignment Previously published works which have applied decomposition for the FAP, can be grouped into three main categories. More often, they were used either in combination with exact methods or with meta-heuristics as a second phase optimization (either following or incorporated the heuristic procedure). A third approach is similar to that proposed in this paper and is based on constructively finding an efficient decomposition into subproblems. This leads to corresponding partial solutions which will then be recomposed into a solution of the initial problem. 3.1
Exact Methods
The most common application of problem decomposition techniques for the FAP has been within exact methods. A number of them is based on selecting an initial permutation of transmitters in order to consider the hard part of a data set first. In [18] this idea is further developed by identifying a hard subgraph, called the core, which is isolated
392
G. Colombo and S.M. Allen
and solved first. Then the remainder part of the problem can be solved ideally without influencing the global objective function. Only few works used exact methods to solve the MI-FAP. Koster et al. [17] observed that assigning frequencies to a cut-set of the interference graph decomposes the problem into two or more independent subproblems, thus they generated a sequence of such cut-sets using tree decomposition. This idea, in addition to the use of several further dominance and bounding techniques led to the solution of some small and medium instances for the MI-FAP. However, for larger real-life instances in which the proposed dynamic programming algorithm is impractical because of the width of the tree, the algorithm has been used iteratively to improve some known lower bounds. Finally, Eisenblatter [9] derived new lower bounds for the COST 259 MI-FAP instances by studying the semidefinite programming relaxation of the minimum k-partition problem. These bounds are based on the fact that the MI-FAP reduces to a minimum k-partition problem, which can be modeled as a semidefinite program with the restriction of considering only co-channel interference. However, the methods described above can produce solutions in a reasonable run time only for the smallest instances. As a consequence they have been primarily used either to produce lower bounds or as a preprocessing technique. 3.2
Heuristic Methods
Decomposition has rarely been proposed in combination with meta-heuristics. In these cases a decomposition of the whole set of transmitters into a number of subsets has been used to optimize the solution either after the heuristic method or incorporated into its procedure at a fixed number of iterations. Moreover, the partitioning adopted is similar to that used for distributed channel assignment for cellular problems in order to optimize solutions locally (usually by applying an exact procedure) inside system clusters of several cells [11, 23]. In Montemanni et al. [20] a cell re-optimization is used in combination with Tabu Search after a fixed number of iterations. Each cell is selected in sequential order and its assignment is optimized by an exact method, while those in the other cells are kept fixed. In a similar way Mannino et al. [19] used SA combined with dynamic programming to compute local optima. In their approach, they optimize assignments in cliques of vertices of multiple demand by reducing this problem to finding fixed cardinality stable sets in interval graphs. All algorithms above obtained good results on the COST259 MI-FAP instances, although the addition of elaborate exact procedures considerably increases the computational complexity of the algorithm thus requiring runtimes roughly one order of magnitude higher than those of the fast heuristic combinations [9]. In our proposed approach the decomposition strategy is extended to a larger scale by adopting a different approach which does not involve any exact local optimization algorithm and is therefore suitable for the application of standard algorithms. Here the decomposed approach aims to simplify a complicated problem by considering separate subproblems obtained by removing some of the constraints between pairs of vertices representing transmitters, rather than increasing the algorithm complexity and solving the problem as a whole. This approach has seldom been used in the literature by extending the standard and generalized clique bounds originally proposed for the Minimum
A Decomposed Approach for the Minimum Interference Frequency Assignment
393
Span FAP (MS-FAP) and Minimum Order-FAP (MO-FAP) to a heuristic approach (see [10]). They start by finding a level-p clique, which is the largest clique having minimum weight edges of p, then produce a first assignment for the clique by applying a metaheuristic and evaluate its span. Subsequently, the clique assignment is kept fixed and an attempt is made to extend the assignment to the full interference graph. This procedure produced good results on some MS-FAP instances, which include some of the Philadelphia benchmarks and other test problems provided by Cardiff University, but has not yet been applied to the MI-FAP. In [4] an order-based steady-state genetic algorithm (GA) has been combined with two different decompositions, based on either the generalized degree of the corresponding graph or more sophisticated graph partitioning algorithms, to solve both the MS-FAP and simple instances of the FS-FAP. Finally, [5] presents preliminary results of applying the same procedure to the MI-FAP using a generational GA with direct representation. The remainder of this paper will introduce our proposed decomposed assignment procedure for the FAP and, subsequently, will outline the different algorithms used to obtain decompositions and the meta-heuristics used to produce assignments of the problem.
4 Decomposed Assignment Approach Our procedure starts by partitioning the interference graph into one or more subsets. A meta-heuristic is applied to each of the subsets in turn to produce a sequence of partial solutions. When the current subset is considered, the algorithm keeps the assignment of transmitters in the previously assigned subsets fixed, and minimizes the constraint violations within them. Finally, the algorithm returns a final assignment of the whole problem. Pseudocode of the decomposed assignment is outlined in Algorithm 1. We need to distinguish between the first assignment loop over the subsets in the partition and the further ones. The first loop builds a sequence of partial assignments in which some vertices are unassigned and not considered in the cost function. At the end of the first loop, a complete assignment is obtained and subsequently the algorithm changes the assignment of a single subset during each iteration. Definition 4. Given a partition of V into n subsets {V1 , V2 , . . . ,Vn } we define the set of the intra and inter-edges for a given subset V j as = E(G[V j ]) E inter = {uv : u ∈ V j , v ∈ / V j , uv ∈ E } E intra j j In Definition 4 G[V j ] indicates the subgraph induced by a subset V j ⊆ V (that is the subgraph H of G for which for any pair of vertices u, v ∈ V j , uv is an edge if and only if uv is also an edge of G). A slightly different procedure is required when the subsets are solved independently (see Algorithm 2), where the algorithm builds distinct partial assignments for each of the subsets. Further loops other than the first lose significance since during a partial asof the current subset j. signment the procedure only considers the internal edges E intra j
394
G. Colombo and S.M. Allen
Algorithm 1. Decomposed assignment Input: G(V, E), number of loops numLoops, size of partition n Output FrequencyAssignment f of V 1: Produce a partition {V1 , V2 , . . . ,Vn } of V using decomposition algorithms 2: for j = 1 to n do // First loop 3: Apply a meta-heuristic to determine f (v) ∀ v ∈ V j to minimize the cost OMI ( f ) =
∑
ϕMI ( f , uv)
uv∈ E j
∪ E intra ) ∩ E(G[V1 ∪V2 ∪ ... ∪V j ]) 4: where E j = (E inter j j 5: end for 6: for i = 2 to numLoops do 7: for j = 1 to n do 8: Apply a meta-heuristic to determine f (v) ∀ v ∈ V j to minimize the cost OMI ( f ) =
∑
ϕMI ( f , uv)
uv∈ E j
∪ E intra ) 9: where E j = (E inter j j 10: end for 11: end for
Note that only the vertices included in the current subset are considered during its channel assignment. Finally, all the partial assignments produced are recombined together to generate a final assignment of the original problem. Since the subsets are solved independently they can be solved in parallel, thus reducing the computational time proportionally to the number of subsets. However, its practical effectiveness depends on the (usually high) number of the constraint violations between subsets. If clusters of highly connected vertices are present in the network the decomposition strategy will aim to locate them into separates subsets, each including a single cluster. Note that this will allow the partial assignment produced in each subset to be significantly reproduced in terms of building blocks (here represented by ‘cliques’ of vertices Algorithm 2. Decomposed assignment - subsets solved independently Input: G(V, E), number of loops numLoops, size of partition n Output FrequencyAssignment f of V 1: Produce a partition {V1 , V2 , . . . ,Vn } of V using decomposition algorithms 2: for j = 1 to n do 3: Apply a meta-heuristic to determine f (v) ∀ v ∈ V j to minimize the cost OMI ( f ) =
∑
uv∈ E intra j
end for
ϕMI ( f , uv)
A Decomposed Approach for the Minimum Interference Frequency Assignment
395
within a same cluster) in the optimal final assignment including all vertices in the network. The decomposed technique will then preserve the structure of the problem when the heuristic solution is produce, for example by preventing the disruptive effect of the genetic operators in case we are using an evolutionary algorithm.
5 Decomposition Algorithms Different decomposition techniques have been tested in order to compare their performance when the corresponding partitions are used to solve MI-FAP instances. To state the decomposition algorithms used at step 1 of Algorithm 1 and 2, we first define a weighted simple graph GD that combines the hard and soft constraints. Definition 5. Given the interference graph G(V, E) and the 3-tuple representing the conad j coch D straints (chard uv uv∈E , cuv uv∈E , cuv uv∈E ) we define the graph G (V, E) as the unordered binary graph having the vertex set V and edge set E with the edge weights given by the linear combination: ad j hard cuv = max{λ1 ccoch uv + λ2 cuv , λ3 cuv }
in which λi are assigned weights to reflect the relative importance of the constraints. In our experiments we set the parameters in Definition 5 to the values λ1 = λ2 = 0.5, and λ3 = 1. Therefore we slightly emphsize the importance of the hard constraints whereas we are giving equal weight to both types of soft constraints. Simple decomposition methods consist of including into the subsets representing the partition sets of vertices selected either at random (random decomposition) or according to geographical information, such as transmitter locations (geographical decomposition). In addition, we have tested two other decomposition methods which are both based on the same idea of solving firstly what roughly correspond to the hardest part of the problem: the generalized-degree and clique decomposition. These methods have previously obtained very good results for the MS-FAP. Pseudocode of these decomposition algorithms can be found in [4, 14]. The maximum clique detection has been adapted to the MI-FAP by implementing the weighted version of the algorithm proposed by Pardalos et al. in [21]. In order to define the problem with more than two subsets, the cliques have been found sequentially after removing the subgraphs induced by those already found by the algorithm. The generalized-degree decomposition simply starts by ordering all the transmitters in the network by their generalized-degree and then includes in each subset those which produce almost equal values of sums calculated under the graph generalized-degreetransmitters. So it appears as a natural extension of the sequential assignment heuristics originally proposed by Hale in [12]. Finally, we propose two further algorithms based on the graph theory: graph clustering and graph partitioning. 5.1
Graph Clustering Algorithms
The natural definition of graph clustering is the separation of sparsely connected dense subgraphs from each other. Given a cut c(G) = { V1, V \V1 } we define the conductance index in order to evaluate its quality as:
396
G. Colombo and S.M. Allen
φ (c) =
∑uv∈ E1inter cuv min(∑uv∈ (E inter ∪E intra ) cuv , ∑uv∈ (E\E intra ) cuv ) 1
1
1
We then extend the definition of the conductance to the whole graph G and a given partition of it. Definition 6. Given a partition of G into n subsets C(G) = { V1 , V2 , . . . ,Vn } we identify the cluster represented by the subset Vi with the induced subgraph of G:G[Vi ] := (Vi , Eiintra ). The conductance of a graph φ (G) is the minimum conductance value over all the possible cuts c(G) of G. The intra-cluster conductance α (C) of C(G) is the minimum conductance value over all induced subgraphs G[Vi ]. The inter-cluster conductance δ (C) is the maximum conductance value over all induced cuts ci = (Vi ,V \Vi ):
α (C) = min(φ (G[Vi ])) and δ (C) = 1 − max(φ (ci )) The larger the intra-cluster conductance of a clustering, the higher the quality, since small intra-cluster conductance means that there is at least one of the clusters Vi containing a bottleneck which can be further decomposed into two subsets. A clustering with small inter-cluster conductance is also a low-quality one since there is at least a cluster with strong external connections. Optimising the indexes above is generally NP-hard as well as calculating the conductance of a graph, see [3]. In this paper we have used the well known Markov clustering algorithm which is based on the intuition that dense regions in sparse graphs should correspond with regions in which the number of k-length paths is relatively large for small values of k ∈ N. As a consequence ‘a random walk that visits a dense cluster will likely not leave the cluster until many of its vertices have been visited’ [22]. Pseudocode of the Markov clustering and a more detailed description of it can be found in [22]. It is important to mention that, although this is one of the most commonly used algorithms for graph clustering, the quality and number of clusters produced depends strictly on the expansion and inflation parameters and convergence is not always guaranteed. 5.2
Graph Partitioning Decomposition
The graph partitioning problem is closely related to graph clustering but has a simpler formulation and has been more widely studied. In a simple unordered graph it is defined as dividing the vertices into disjoint subsets such that the number of edges whose endpoints are in different subsets is minimized. In order to use it to solve MI-FAP instances we have reformulated the problem as: Definition 7. Given a partition of the unordered weighted graph GD representing the network into n disjoint subsets : C(GD ) = { V1, V2 , . . . ,Vn } :
n i=1
Vi = 0 ∧
n
Vi = V
i=1
we define the balanced graph partitioning problem for FAP (GPFAP) as selecting a partition C to minimize:
A Decomposed Approach for the Minimum Interference Frequency Assignment 1 CGPFAP =
∑
uv∈ E
where E inter =
n
∑
cuv −
cuv +
uv∈ E1intra
∑
397
cuv
uv∈ E inter
E inter and such that the difference between the cardinality of differj
j=1
ent subsets is as small as possible, i.e | |Vi | − |V j | | ≤ 1 ∀i, j. The cost function balances the sum of the weights of the external connections E inter in j each of the subsets with a term which maximizes the internal connections in the first subset only. Note that maximising the intra-edges in the first subset aims to prevent Algorithm 1 from producing for this subset the trivial solution of a partial assignment which does not present any interference. We have also formulated an unbalanced GPFAP which removes the balanced constraint. For this problem better performance is obtained introducing a different cost which is based on the conductance indexes introduced for the graph clustering. In fact, given a partition C(GD ) we can easily compute the conductance φ (ci ) over all its induced cuts. Since we aim to minimize this value for all the cuts in a cluster we have subsequently defined the following objective: Definition 8. Given a partition of the unordered weighted graph GD representing the network into n disjoint subsets : C(GD ) = { V1, V2 , . . . ,Vn } :
n
Vi = 0 ∧
i=1
n
Vi = V
i=1
we define the unbalanced GPFAP as minimising the cost: O2GPFAP =
∑ φ (ci ) i
φ (ci ) =
∑uv∈ Eiinter cuv min(∑uv∈ (E inter ∪E intra ) cuv , ∑uv∈ (E\E intra ) cuv ) i i i
To avoid trivial solutions for the first partial assignment the subsets are reordered in decreasing ordering by size. This choice of costs has been preferred in our experiments since it shows the best performance in some preliminary tests. However they could both be used either for the balanced or unbalanced version of FAP. To solve the GPFAP in all of the formulations introduced we have implemented a memetic GA. The aim is here to obtain near optimal solutions in a reasonably short time rather than pursue the absolute optimal, since the partitioning only constitutes the preprocessing step of the the subsequent procedure which solves the FAP. Details and Pseudocode of the algorithm can be found in [5]. It is important to mention that for cellular problems (as the COST259 benchamrks, see Section 6) the GPFAP procedure has been implemented in terms of single cells rather than single transmitters, that is the vertices of the equivalent graph GD are the network cells instead of the vertices of the graph representing transmitters. This automatically preserves the co-cell constraints for each of the subsets.
398
G. Colombo and S.M. Allen
6 Benchmarks The proposed decomposed assignment approach has been tested on a subset of the widely used COST-259 MI-FAP benchmarks for GSM 900 networks [7]. These test problems are widely recognized as the most important proposed in the last decade for the MI-FAP and publically available from [1]. They were explicitly designed with the purpose of comparing and improving existing assignment methods as well as motivating the development of new ones. A complete summary table of their characteristics can be found in [9]. The different importance between hard and soft constraints in the interference graph plays a primary role in the effectiveness of a decomposition. In particular, a decomposition cannot disregard the distribution of the hard constraints which must be satisfied in a feasible assignment. Figures 2 shows the distributions of the hard constraints for two the Siemens instances. They generally present different characteristics and some of them (i.e. Siemens2) are more difficult to be partitioned into subsets whereas others show a more natural separation into clusters (i.e. Siemens3, which also presents a small disconnected component).
Fig. 2. Hard constraints distributions for the COST259 Siemens instances
Although the COST259 are currently the most realistic class of published benchmarks for the MI-FAP problem, their sizes, about one thousand of transmitters, is still lower than those found in real networks. Therefore, it is important to test the effectiveness of the decomposed approach on larger data sets too, since the performance of the standard meta-heuristics degrades rapidly with an increased problem size. Two larger problems C1 and C2 have been additionally generated with cardinality in the order of 10, 000 transmitters using the benchmark generator tool described in [2], which locates transmitters according to a given probability distribution (has been already used in [4] to produce benchmarks the MS-FAP). The idea behind their generations was to extend one of the COST259 test problems to a much larger size with the aim of conserving locally its graph structure characteristics but ‘spread’ it over a much larger area. Traffic demands of each of the cells is generated by assigning between one and seven transmitters drawn following a uniform distribution. Hard constraints consist of a three channel separation between transmitters within the same cell, whereas the soft constraints are
A Decomposed Approach for the Minimum Interference Frequency Assignment
399
Fig. 3. Cells locations for data sets C1 (left) and C2 (right) generated respectively with a ‘singletown’ and a ‘two-towns’ probability distribution
calculated as a function of the Euclidean distance between pairs of transmitters. Cell locations for data sets C1 and C2 are shown in Figure 3.
7 Experimental Results A number of runs of the COST259 instances have been considered in order to compare the performance of different decomposition techniques. The aim of the experiments is to exclude from further experimentation any decomposition methods which appears to be clearly ineffective for this problem. We conducted this first set of experiments using a standard implementation of SA for a limited number of total evaluations (see Table 1). This particular implementation has been used in [15] for the software FASOFT (which is one of the most complete tools to solve the FAP with heuristics), and also in [5] for some preliminary results on the MI-FAP. The heuristic loops through the partition twice. The GPFAP and the Markov clustering appear superior to the clique and generalized degree decompositions, which for the hardest instances have difficulties in producing feasible solutions (that is may contain violated constraints). However, neither GPFAP nor Markov appears to clearly outperform the other all of the instances tested. Furthermore, there is no clear difference between the performance of the balanced and unbalanced version. Note that in some instances of the unbalanced version one of the subsets includes the majority of the transmitters, making its behaviour very close to that of the problem as a whole. This can bring about some advantages in the optimality but also increase the runtime of the decomposed approach. Geographical decomposition takes advantage from the fact that some of the Siemens instances are decomposed into natural clusters, e.g. Siemens3 in Figure 2. However, this method is penalised by the fact that it does not take into account neither the number or weights of the inter-edges between different subsets. Clique based decomposition only partially produces a competitive performance. This may be caused by the fact that, because of the small clique sizes, this method often finds a trivial interference free solution for the first subset. As a consequence fixing their assignment has the effect of restricting the assignments of subsequent subsets. Finally,
400
G. Colombo and S.M. Allen Table 1. Siemens1-4 for SA with decomposition (100, 000 ∗ |V | evaluations per loop) † at least 1 invalid solution * no valid solutions Bal. GPFAP
Unbal. GPFAP
Gen. Degree
Cliques
noSub
Geog
Markov
3.58 (3.62) 3.43 (3.54) 3.81 (4.01) 3.48 (3.61) 3.65 (3.72) 3.45 (3.56)
3.32 (3.33) 3.26 (3.29) 3.40 (3.61) 3.29 (3.41) 3.34 (3.42) 3.23 (3.27)
first loop second loop SIEMENS 1 [whole = 3.30 (3.37)]
2 3 4
3.16 (3.28) 3.14 (3.18) 3.36 (3.45) 3.32 (3.36) 3.15 (3.19) 3.52 (3.53)
3.38 (3.44) 3.30 (3.37) 3.35 (3.43) 3.29 (3.34) 3.99 (4.20) 3.30 (3.44)
17.91 (17.98) 17.83 (17.91) 19.91 (20.23) 18.30 (18.57) 20.99 (21.10) 18.60 (18.87)
17.93 (18.10) 17.59 (17.86) 19.79 (19.94) 17.99 (18.09) 17.72 (17.93) 17.31 (17.42)
7.66 (7.84) 7.58 (7.81) 7.30 (7.45) 7.24 (7.40) 8.09 (8.28) 7.98 (8.14)
7.59 (7.69) 7.31 (7.49) 7.82 (7.99) 7.79 (7.98) 8.43 (8.57) 7.92 (8.13)
90.65 (91.50) 90.65 (91.48) 94.13 (94.85) 92.88 (93.56) 94.72 (95.68) 93.19 (93.38)
90.62 (90.88) 89.54 (90.09) 88.63 (89.12) 88.23 (88.43) 92.40 (93.63) 92.07 (93.42)
5.46 (5.72) 4.24 (4.31) 5.89 (6.23) 4.99 (5.05) 6.85 (6.99) 5.35 (5.56)
4,57 (4.64) 3.49 (3.58) 3.51 (3.58) 3.42 (3.52) 3.59 (3.67) 3.58 (3.68)
SIEMENS 2 [whole = 16.75 (16.89)]
2 3 4
23.30 (24.37) 18.38 (20.02) 19.33 (19.54) 18.64 (19.04) 18.22 (18.42) 17.96 (18.14) 26.14 (26.26) 19.59 (2,018) † 20.47 (20.38) 21.26 (21.36) 19.35 (19.44) 18.20 (18.71) 27.47 (27.52) 19.82 (2,686) † 20.79 (2,687)† 22.61 (22.76) 19.18 (19.33) 18.95 (19.09)
21.49 (21.81) 19.22 (19.34) 24.09 (24.82) 20.18 (20.68) 22.91 (23.36) 20.45 (20.80)
SIEMENS 3 [whole = 8.14 (8.31)]
2 3 4
2,009 (3,350) * 9.14 (674.8) † 9.15 (1,342) † 8.76 (8.82) 2,016 (3,646) * 8.48 (675.3) † 11.30 (678.0) † 8.03 (8.27) 3,017 (3,684) * 8.61 (675.6) † 12.40 (1,345) † 8.43 (8.48)
9.06 (9.24) 8.83 (8.95) 8.49 (8.81) 8.42 (8.44) 9.09 (9.34) 8.40 (8.68)
8.04 (8.17) 7.77 (7.89) 9.17 (9.22) 7.54 (7.65) 7.55 (7.64) 7.42 (7.44)
96.33 (2,761) † 92.59 (92.75) 96.201 (2,096) † 93.52 (427.5) † 3,399 (4,099) * 95.24 (96.25)
92.22 (92.51) 90.10 (90.19) 94.04 (94.17) 91.43 (91.74) 1,215 (2,112) 94.25 (94.38)
SIEMENS 4 [whole = 92.12 (92.23)]
2 3 4
2,147 (3,145) * 2,120 (3,082) * 3,553 (3,823) * 3,131 (3,514) * 3,749 (3,983) * 3,535 (3,801) *
93.04 (723.3) † 92.62 (93.75) 2,254 (2,371) * 93.92 (95.54) 2,651 (3,001) * 96.39 (97.46)
generalized degree decomposition does not neither consider the inter-edges between subsets or resembles any sort of clustering or unbalanced partitioning. As a consequence it produces the worst performance among all the decomposition methods tested. Note, for all decomposition algorithms, the effect of the second loop and how this is beneficial both in terms of mean and variance. In physical terms this is related to a locally different interference distribution in the subsets as it will be discussed more in details at the end of this chapter. Table 2 shows the results for a number of runs in which the subsets have been run independently according to Algorithm 2. We present the best over three runs final cost together with the corresponding partial costs produced by each single subset. These are obtained by recomposing the partial assignments to produce a final complete assignment for the whole transmitters in the network. Results are shown for balanced and unbalanced GPFAP. The final costs always present infeasible solutions and similar results were obtained for the other decompositions tested. Consequently, this approach can only be used if followed by further local optimization procedures. For example, it can be used as a pre-processing before the application of a generic meta-heuristic procedure, which could include the ‘sequential’ decomposition technique proposed in Algorithm 1.
A Decomposed Approach for the Minimum Interference Frequency Assignment
401
Table 2. Siemens1-4 for SA with decomposition - subsets solved independently (100, 000 ∗ |V | evaluations per loop) * no valid solutions noSub.
Bal. GPFAP
1st sub 2nd sub 3rd sub 4th sub
Unbal. GPFAP
cost
1st sub 2nd sub 3rd sub 4th sub
cost
SIEMENS 1
2 3 4
1.953 1.171 26,004 * 2.253 1.067 1.008 0.049 0.867 204,017 * 1.128 0.001 1.205 0.5408 0.069 0.219 460,010 * 1.227 0.001
2 3 4
5.050 1.494 1.187 1.142 0.650 0.292
2,003.72 * 0.001 446,038 * 0.000 0.000 452,042 *
SIEMENS 2
82,027 * 7.212 0.927 0.013 198,037 * 1.240 0.176 0.252 0.008 176,041 * 1.017 0.018
58,024 0.017 756,042 * 0.000 0.000 900,044 *
Table 3. SA - Siemens1-4 with GPFAP decomposition (2, 000, 000 ∗ |V | evaluations per loop) † at least 1 invalid solution * no valid solutions SIEMENS1
noSub.
SIEMENS2
SIEMENS3
SIEMENS4
[whole = 2.68 (2.76)] [whole = 15.59 (15.64)] [whole = 6.59 (6.62)] [whole = 86.59 (87.12)] first loop second loop
2 3 Bal. GPFAP
4 5 2 3 Unbal. GPFAP
4 5
7.1
2.68 (2.75) 2.61 (2.66) 2.95 (3.01) 2.93 (2.95) 2.94 (3.01) 2.90(2.98) 3.34 (3.43) 3.04(3.13) 2.69 (2.74) 2.60 (2.69) 2.73 (2.75) 2.63 (2.67) 3.73 (3.85) 2.85 (2.86) 4.31 (4.38) 3.28 (3.43)
16.97 (17.04) 16.34 (16.73) 19.54 (19.77) 17.41 (17.60) 20.56 (20.87) 18.02 (18.27) 20.31 (20.46) 17.90 (18.53) 16.94 (17.13) 16.2 (16.37) 19.06 (19.17) 16.83 (16.97) 16.56 (16.72) 16.07 (16.27) 18.44 (18.57) 16.30 (16.37)
6.39 (6.46) 6.37 (6.44) 6.53 (6.81) 6.46 (6.58) 6.95 (7.02) 6.79 (6.89) 7.65 (7.79) 7.26 7.(41) 6.22 (6.23) 5.98 (6.13) 6.68 (6.77) 6.42 (6.56) 6.84 (6.88) 6.75 (6.80) 7.23 (7.24) 6.83 (6.94)
84.35 (84.80) 84.08 (84.39) 89.50 (90.59) 87.16 (88.51) 89.96 (90.43) 89.53 (90.17) 92.94 (93.5) 91.83 (92.77) 85.72 (85.90) 84.58 (85.13) 91.47 (91.94) 89.51 (90.35) 90.71 (91.52) 90.02 (90.68) 93.72 (94.04) 91.18 (92.13)
GPFAP Results
A number of longer runs were performed with the GPFAP partitioning for both the balanced and unbalanced versions. This decomposition algorithm produced the best outcomes in the comparison tests as well as the Markov clustering, but it has a much easier and inexpensive implementation. To add generality to the decomposition method we have also implemented a GA beside the standard implementation of SA. The GA used adopts a simple direct representation, in which chromosomes are constituted by a vector whose elements are the frequency values assigned to the corresponding transmitters. Furthermore, when solved by the GA, the FAP has been considered as a multiobjective optimization problem. We implemented a standard NGSA-II using two objectives represented by the co-channel and adjacent-channel interference respectively. Both choices has been made because of showing the best performance in preliminary tests [5] (see also [6] for a comparison of different algorithms and representations for multiobjective GAs).
402
G. Colombo and S.M. Allen
Table 4. GA - Siemens1-4 GA Balanced GPFAP decomposition (2, 000, 000 ∗ |V | evaluations per loop) † at least 1 invalid solution * no valid solutions SIEMENS1
noSub.
SIEMENS2
SIEMENS3
SIEMENS4
[whole = 3.61 (3.72)] [whole = 17.64 (17.91)] [whole = 9.54 (9.77)] [whole = 101.05 (102.14)] first loop second loop
2 3 Bal. GPFAP
4 5 2 3 Unbal. GPFAP
4 5
3.73 (3.80) 3.40 (3.45) 3.7 (3.75) 3.30 (3.44) 4.63 (4.81) 3.37(3.49) 5.70 (6.28) 4.45 (4.61) 3.53 (3.57) 3.18 (3.28) 3.31 (4.52) 3.07 (3.14) 3.46 (3.49) 2.86 (3.04) 3.75 (3.83) 3.24 (3.27)
18.88 (19.03) 17.83 (17.86) 20.70 (20.95) 18.70 (18.89) 21.50 (21.79) 19.10 (19.21) 21.07 (21.23) 18.95 (18.97) 18.96 (19.38) 17.46 (18.21) 20.57 (20.85) 18.07 (18.44) 18.87 (19.11) 17.85 (18.23) 20.03 (21.55) 18.42 (18.79)
6.11 (7.65) 6.08 (7.21) 6.33 (7.09) 6.21 (7.05) 6.74 (7.34) 6.28 (7.19) 7.32 (8.06) 6.66 (7.14) 6.34 (6.62) 6.17 (6.41) 7.02 (7.49) 6.69 (6.77) 7.37 (7.83) 6.76 (6.88) 7.43 (7.52) 7.10 (7.06)
98.31 (99.93) 96.84 (97.22) 103.61 (104.40) 101.09 (101.93) 102.55 (103.60) 101.84 (102.19) 110.71 (768.1) † 106.81 (710.8) † 104.49 (104.88) 103.64 (103.68) 103.11 (103.53) 102.89 (103.02) 103.52 (104.15) 102.90 (103.24) 104.71 (105.28) 104.53 (104.84)
7.1.1 Siemens Firstly, we present the results for the Siemens benchmarks for a higher number of subsets and total final evaluations. Tables 3, and 4 show the best and average over three runs costs for the Siemens benchmarks obtained by SA and the GA with and without the decomposed assignment approach. For both the GA and the standard SA, the decomposed assignment approach produces better results than the problem solved as a whole in three out of four of the benchmarks tested. Note that the longer the runs, the less is the improvement produced in percentage by the decomposition. This result could be, however, predicted by considering that longer runs give the heuristic more opportunity to explore the search space, thus improving the actual performance of the whole approach. On the contrary, the GA seems still to show some significant improvements during the longest runs. However, a further increment in the total number of evaluations will have a considerable effect on computational efficiency since the GA tends to be slower than SA for the same number of configurations explored. The decomposition usually produces better outcomes for a number of subsets between two and four, whereas further increments in the number of subsets degrade considerably the heuristic performance. Although in some of the instances the unbalanced GPFAP produces better results than the balanced version, neither of the two algorithms appears able to clearly outperform the other. Our results improve for the standard SA those previously published for the same algorithm (see [1]), whereas no competitive results are known for GAs for the same benchmarks. However, this is not sufficient for improving the results of the solution produced by the whole approach in the case of Siemens2, which is the only benchmark in which the decomposed approach is ineffective on a pure quality basis. Another important aspect is that, although the GA produces overall a poorer performance than SA, its results become considerably better when the
A Decomposed Approach for the Minimum Interference Frequency Assignment
40
1 subset 2 subsets
35
35
30
30
25
25 cost
cost
40
20
1 subset 3 subsets
20
15
15
10
10
5
5 0
0 0
1000
2000
3000 4000 run time (seconds)
5000
6000
7000
0
Siemens1 - SA - balanced - 2 subsets 50
1000
2000
3000 4000 run time (seconds)
5000
6000
7000
Siemens1 - SA - balanced - 3 subsets 50
1 subset 2 subsets
30
30
1 sub 3 subs
cost
40
cost
40
20
20
10
10
0
0 0
2000
4000 6000 8000 run time (seconds)
10000
12000
0
Siemens2 - SA - balanced - 2 subsets
2000
4000 6000 8000 run time (seconds)
10000
12000
Siemens2 - SA - balanced - 3 subsets
1 subset 2 subsets
14
1 subset 3 subsets
14
12
12
10
10
8
8
cost
cost
403
6
6
4
4
2
2
0 0
1000
2000 3000 run time (seconds)
4000
0
5000
0
Siemens2 - GA - balanced - 2 subsets
1000
2000 3000 run time (seconds)
4000
5000
Siemens2 - GA - balanced - 3 subsets 50
40
1 subset 2 subsets
1 sub 2 subs
35 40
30
30 cost
cost
25 20
20
15 10
10
5 0 0
1000
2000
3000 4000 run time (seconds)
5000
6000
7000
Siemens1 - SA - unbalanced - 2 subsets
0 0
2000
4000
6000 run time (seconds)
8000
10000
12000
Siemens2 - SA - unbalanced - 2 subsets
Fig. 4. Cost-time plot for the COST259 Siemens benchmarks solved by SA and GA with GPFAP decomposition after 2, 000, 000 ∗ |V | evaluations
404
G. Colombo and S.M. Allen
decomposed approach is applied. This reduces considerably the gap between the two algorithms. Histograms in Figure 5 compare the best cost produced by SA using the whole approach and by both the decomposed approach using SA and GA. For the decomposed approach the cost in the diagrams is that obtained with the best performing version of the GPFAP and number of subsets. Finally, for both algorithms we can note the positive effect of the second loop, which becomes particularly important when the decomposition is initially not effective, for instance in the case of five subsets. This tendency will be confirmed by the other test problems considered for MI-FAP, in particular the Bradford benchmarks presented in Section 7.1.2. Figures 4 compares the cost-time curves for Siemens1 and Siemens2 produced by single runs of the meta-heuristics for a single loop with and without the decomposition technique. The solutions of the decomposed assignment approach can be considered valid, that is consisting of a complete frequency assignment, only in the last of the subsets considered in the sequence (represented with a solid line). In these intervals, if we consider a fixed value of time, the cost obtained by the decomposed assignment is generally lower than that produced by the whole approach. Note that, this happens also when the decomposed approach appears ineffective, for example for Siemens2. This also shows that a pure quality criterion may not be the most correct for evaluating its effectiveness. Both plots correspond to the same final number of total evaluations conducted by the meta-heuristic. With this constraint we can observe that the decomposition approach, independently of being more or less optimal, is able to produce very good approximations in a shorter time. The runtime gain can be empirically quantified in a 10 − 15% gain on the experiments performed for this thesis but this amount does depend to a large extent on the complexity of the graph and the algorithm used. For instance, we may expect more advantages by the GA rather than SA, because of the higher number of configurations (evaluations) explored by the evolutionary algorithm. 7.1.2 Bradford This subsection presents the results obtained for the COST259 Bradford instances solved by GPFAP. All results are produced by SA looping twice through the subsets in a given partitioning. The contribution of the second loop is for the most of the problems very important as shown in Figure 6 for Bradford0, Bradford4, and Bradford10, in which the decomposed approach still produces worse results than the whole at the end of the first loop. This is a consequence of a different distribution of the interference within the network caused by the second loop, whose effect balances the constraint violations (and so the local interference) between different subsets. This will be discussed more in details in Section 9. The decomposition technique is mostly effective with a a minimum number of two subsets, thus confirming that these benchmarks appear harder than the Siemens in order to be solved with the decomposed approach. 7.1.3 K and Swisscom This section presents the results of the last two COST259 benchmarks, namely K and Swisscom. Although different in their structure these data sets have a comparable small
A Decomposed Approach for the Minimum Interference Frequency Assignment 7
12
SIEMENS1 whole decomposed assignment SA decomposed assignment GA
10,000*|V| 6
405
SIEMENS3 whole decomposed assignment SA decomposed assignment GA
10,000*|V| 10 100,000*|V|
5 8
100,000*|V|
1,000,000*|V| 1,000,000*|V|
cost
cost
4 2,000,000*|V|
2,000,000*|V|
6
3 4 2 2
1 0
0 evaluations
evaluations
Siemens1
Siemens3
Fig. 5. Cost-evaluations histograms for the COST259 Siemens benchmarks solved by the whole approach (SA) and balanced\unbalanced GPFAP (SA and GA)
200
1000 1 subset 1000000*|V| 1st loop 1 subset 2000000*|V| 1st loop 2 subsets 1000000*|V| 1st loop 2 subsets 1000000*|V| 2nd loop
1 subset 1000000*|V| 1st loop 1 subset 2000000*|V| 1st loop 2 subsets 1000000*|V| 1st loop 2 subsets 1000000*|V| 2nd loop 800
150
cost
cost
600 100
400
50 200
0 0 0
2000
4000
6000 run time (seconds)
8000
balanced - 2 subsets
10000
12000
0
5000
10000 15000 run time (seconds)
20000
25000
unbalanced - 2 subsets
Fig. 6. Cost-time plot for the COST259 Bradford instances solved by SA with balanced GPFAP decomposition with two loops (1, 000, 000 ∗ |V | evaluations per loop)
size of about three hundred vertices. However, they can be both considered as hard instances since K presents a very high connectivity (simulating a dense urban environment), and Swisscom a high number of blocked channel, thus limiting considerably the spectrum of frequencies available. Table 7 shows the results obtained with GPFAP. For K the decomposed assignment approach with the unbalanced version of the GPFAP is effective either at the end of the first or second loop (for a maximum improvement compared to the whole approach of 35% and 33% respectively). On the contrary, the balanced GPFAP improves the whole approach only after the second loop. This is caused by the very high density of the graph, which makes necessary a redistribution of the interference in order to produce effective results. Note that, at least for a decomposition into two subsets, both the balanced and unbalanced GPFAP eventually produce comparable results after the two loops (for a maximum improvement of 28% with the balanced version). For Swisscom, because of the limited spectrum availability, the shortest runs have difficulty in producing feasible assignment during the runs conducted with the decomposed assignment approach. The balanced decomposition produces infeasible solutions
406
G. Colombo and S.M. Allen Table 5. SA - Bradford - GPFAP decomposition (2, 000, 000 ∗ |V | evaluations per loop)
noSub.
BRADFORD0
BRADFORD2
[whole = 1.31 (1.43) ]
[whole = 4.46 (4.53)]
Bal. GPFAP
Unbal. GPFAP
Bal. GPFAP
Unbal. GPFAP
first loop second loop
2 3 4 5
1.31 (1.41) 1.16 (1.23) 1.18 (1.20) 1.07 (1.13) 1.49 (1.63) 1.26 (1.37) 2.07 (2.15) 1.47 (1.54)
1.37 (1.72) 1.09 (1.14) 1.11 (1.18) 1.09 (1.14) 2.17 (2.31) 1.24 (1.26) 2.62 (2.72) 1.26 (1.29)
5.17 (5.36) 4.27 (4.24) 6.45 (6.92) 5.34 (5.49) 6.97 (7.17) 5.75 (5.93) 7.25 (7.64) 6.13 (6.58)
4.79 (4.99) 3.83 (3.92) 6.15 (6.21) 5.05 (5.24) 6.25 (6.34) 5.14 (5.33) 6.79 (7.24) 5.96 (6.57)
Table 6. SA - Bradford - GPFAP decomposition (2, 000, 000 ∗ |V | evaluations per loop) BRADFORD4
BRADFORD1
[whole = 20.62 (20.79) ]
noSub.
Bal. GPFAP
BRADFORD10
[whole = 1.64 (1.75)] [whole = 187.13 (188.34)]
Unbal. GPFAP
Bal. GPFAP
Bal. GPFAP
first loop second loop
2 3 4 5
21.40 (21.83) 18.51 (18.78) 25.83 (26.27) 20.54 (21.01) 33.93 (34.29) 23.09 (24.65) 37.85 (38.14) 28.52 (29.83)
22.17 (22.77) 19.95 (20.44) 26.64 (27.81) 21.66 (22.21) 34.82 (35.29) 24.63 (25.50) 38.71 (39.31) 29.68 (30.56)
2.16 (2.39) 1.32 (1.21) 2.46 (2.80) 1.88 (2.11) 3.38 (3.65) 2.36 (2.56) 4.34 (4.97) 3.68 (3.81)
192.37 (192.98) 185.05 (185.39) 203.69 (204.60) 188.38 (189.51) 211.81 (212.94) 193.51 (194.43) 213.90 (215.56) 196.02 (197.23)
Table 7. SA - K and Swisscom with GPFAP decomposition (5, 000, 000 ∗ |V | evaluations per loop) † at least 1 invalid solution * no valid solutions noSub.
K
SWISSCOM
[whole = 0.84 (0.89) ]
[whole = 32.24 (32.34)]
Bal. GPFAP
Unbal. GPFAP
Bal. GPFAP
Unbal. GPFAP
first loop second loop
2 3 4
1.12 (1.19) 0.63 (0.65) 1.27 (1.40) 0.81 (0.88) 1.65 (1.70) 0.97 (1.01)
0.64 (0.66) 0.61 (0.57) 0.66 (0.70) 0.64 (0.69) 0.72 (0.74) 0.69 (0.71)
29.39 (30.30) 27.09 (27.80) 30.98 (2,032) † 29.12 (1,026) † 6,030 (4,030)* 5,031 (3,035)*
29.27 (30.14) 26.04 (26.78) 29.54 (29.86) 27.10 (27.51) 32.46 (32.57) 32.12 (32.20)
when decomposed in more than two subsets, whereas the unbalanced GPFAP appears always able to produce assignments belonging to the feasibility domain. When the solutions produced are feasible, the decomposed assignment outperforms the whole approach (for a best improvement of 18% with the unbalanced version). 7.1.4 Cardiff University Benchmarks This section presents the results obtained by the GPFAP on the large Cardiff University data sets. Note that for very large benchmarks the decomposed approach is expected to
A Decomposed Approach for the Minimum Interference Frequency Assignment
407
Table 8. SA - Cardiff University benchmarks with balanced GPFAP and geographical decomposition C5
noEvals.
noSub.
1 100, 000 ∗ |V |
2 3 4 1
2, 000, 000 ∗ |V |
2 3 4
C6 GPFAP Geog GPFAP first loop second loop 2.16 (2.31) 1.18 (1.30) 1.22 (1.32) 0.73 (0.89) 1.68 (1.82) 1.30 (1.75) 0.62 (0.75) 0.51 (0.63) 0.98 (1.14) 0.79 (0.93) 0.28 (0.42) 0.23 (0.31) 3.03 (3.46) 2.72 (3.01) 1.25 (1.52) 0.94 (0.99) 1.43 (1.76) 1.29 (1.38) 0.43 (0.65) 0.32 (0.36) 3.37 (3.57) 3.18 (3.25) 1.39 (1.85) 1.09 (1.13) 1.76 (2.11) 1.55 (1.68) 0.55 (0.67) 0.42 (0.56) 0.95 (1.04) 0.37 (0.42) 0.71 (0.74) 0.31 (0.36) 1.01 (1.25) 0.91 (1.16) 0.30 (0.34) 0.23 (0.29) 0.59 (0.68) 0.39 (0.42) 0.09 (0.12) 0.06 (0.10) 2.47 (2.71) 2.14 (2.25) 0.51 (0.57) 0.50 (0.53) 1.10 (1.60) 0.96 (1.09) 0.16 (0.21) 0.14 (0.19) 2.65 (2.84) 2.27 (2.34) 0.57 (0.77) 0.56 (0.63) 1.52 (1.66) 1.26 (1.41) 0.28 (0.46) 0.25 (0.35) Geog
be effective, since this represents a case in which the performance of the meta-heuristics starts degrading, at least in their standard versions. Table 8 reports the costs produced by SA with the balanced GPFAP. We have also considered the geographical decomposition since it represents an intuitive and natural form of partitioning for these benchmarks in which transmitters are located in ‘towns’. There is a marked difference between the values produced by the shortest and longest runs, thus confirming that large benchmarks need to be run for a longer time. Furthermore, this tendency is more important for the whole approach whereas the decomposed (when effective) is also able to produce good results for the shortest runs. For the decomposed approach results still improve as we increase the number of evaluations, although the percentage improvement in comparison with the whole is lower than for the shortest runs. This is particularly true for C6 with a decomposition into two subsets, for which the shortest run with 100, 000 ∗ |V| evaluations outperforms that obtained with that of the whole approach for 2, 000, 000 ∗ |V| evaluations. In any case the shortest runs give an indication of whether or not the decomposed approach will be effective. As already observed for the COST259 benchmarks, for a decomposition into a larger number of subsets (three and four in our examples), the decomposed technique is generally not effective after the first loop through the subsets. However, the second loop brings about a remarkable improvement leading in some cases better results than the whole approach (see C6 ). Note that this happens either for the shortest or the longest runs, thus confirming the validity of the former ones as a test for the effectiveness of a given decomposition. The partition produced by geographical information appears inferior to the GPFAP for both of the problems. It is important to note that for the ‘two-towns’ problem the two decompositions found are very similar. This is shown in Figure 7, in which (with the GPFAP) transmitters belonging to each of the different towns have been included into distinct subsets. This can also be interpreted as the confirmation of the effectiveness
408
G. Colombo and S.M. Allen 45000
45000 1st sub
2nd sub
40000
40000
35000
35000
30000
30000
25000
25000
20000
20000
15000
15000
10000
10000
5000
5000
0
0 0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Fig. 7. Balanced GPFAP decomposition into two subsets of data set C6
of the GPFAP algorithm, which is able to find an effective decomposition without any extra knowledge (as the one given by the geographical information).
8 Markov Clustering To complete the set of the experiments for the MI-FAP instances, we have conducted a number of runs which use partitions generated by the Markov clustering algorithm (see Section 5.1). Graph clustering presents many similarities with the graph partitioning but present the drawback of being more complex and elaborate. Moreover, the quality of the partition produced, in terms of size and number of clusters, depends strictly on the parameter settings used and the algorithm is not able to produce a specified number of subsets. As a consequence, given the partition in a general number of k subsets produced by the Markov algorithm we obtain the desired decomposition into n subsets by further applying to the clustering a graph partitioning algorithm (see Section 5.2). Note that the size of the resulting subsets represented by the clusters is then generally unbalanced, thus this method is expected to produce similar outcomes to the unbalanced GPFAP. Results are given in Table 9, which shows the outcomes produced by SA with the Markov clustering decomposition. For some of the benchmarks Markov clustering obtains better results than GPFAP (see Table 3). However, neither of the two methods appears to clearly outperform the other in all of the instances tested. Note that Markov clustering is not able to obtain positive results for the data sets with the highest graph density (i.e. K and Siemens2), for which it performs worse than GPFAP. On the contrary, its performance improves for other Siemens benchmarks which show a more ‘natural’ partition into clusters (see Figure 2).
9 Trade-Off between Quality and Runtime From the discussion of the results in Section 7 appears that neither an evaluation based on pure quality or computational complexity are able to define an absolute criteria for evaluating the effectiveness of a given decomposition strategy. In fact, quality evaluation is unlikely to be successful with enough generality over the whole set of benchmarks whereas a possible evaluation based only on runtime does not consider any issues about the optimality of the solution produced. In order to fulfill both these requirements
A Decomposed Approach for the Minimum Interference Frequency Assignment
409
Table 9. COST259 for SA with decomposition - best(average) cost - 2, 000, 000 ∗ |V | evaluations per loop (* 5, 000, 000 ∗ |V | evaluations per loop) first loop
noSub.
second loop SIEMENS1
SIEMENS2
SIEMENS3
SIEMENS4
[whole = 2.68 (2.76) [whole = 15.59 (15.64)] [whole = 6.59 (6.62)] [whole = 86.59 (87.12)]
2 3 4 5
2.76(2.92) 2.53(2.64) 2.99(3.14) 2.74(2.83) 3.08(3.16) 2.77(2.79) 3.76(3.99) 2.77(2.94)
19.71(19.89) 17.32 (17.58) 20.22 (20.45) 17.90(18.47) 21.97 (21.99) 19.59 (19.90) 22.36 (22.97) 19.59 (20.04)
6.52(6.88) 6.24 (6.35) 6.44 (6.63) 6.18 (6.39) 7.12 (7.37) 7.01 (7.49) 8.71 (9.15) 7.51 (8.24)
BRADFORD0
BRADFORD4
BRADFORD2
[whole = 1.31 (1.43) [whole = 20.62 (20.79)] [whole = 4.46 (4.53)]
2 3 4 5
1.04(1.10) 0.94 (1.11) 1.03 (1.08) 0.80 (1.06)) 2.03 (2.17) 1.39 (1.52) 2.17 (2.55) 1.31 (1.52)
22.90(23.16) 19.55 (20.22) 27.29 (27.54) 22.71 (23.16) 34.98 (35.43) 25.23 (25.58) 37.31 (37.98) 28.83(29.59)
5.32(5.51) 4.09 (4.23) 5.98 (6.30) 4.22 (4.85) 5.69 (6.01) 4.46 (4.98) 5.94 (6.15) 5.03 (5.65)
83.67(83.91) 82.37 (82.54) 85.12 (86.40) 84.70 (85.27) 93.20 (94.05) 91.32 (92.88) 94.08 (95.12) 92.61 (93.45) K [whole = 0.84 (0.89)]
1.47(1.98) 0.95 (1.02) 2.51 (3.02) 1.08 (1.15) 2.76 (3.27) 1.12 (1.53) 2.74 (2.95) 1.12 (1.50)
other criteria may be more suitable, taking into account the trade-off between quality and runtime of an approximated solution produced by the decomposed assignment approach applied to the FAP. Given a specific benchmark and decomposition method we can evaluate this tradeoff by plotting the interpolation curves between the pairs {cost (best or mean), runtime} for a number of different runs using the decomposed approach. We fix a number of total evaluations and then we run the decomposed approach with a different number of subsets, together with the whole for the same number of evaluations (number of different configurations explored by the heuristic). Intuitively, this represents a fair criterion for comparison between the two approaches. Definition 9. Let evali, j the number of evaluations corresponding to the ith subset and the jth loop. Let evaltot the number of evaluations corresponding to the solution obtained with the non-decomposed approach. In order to have the same number of evaluations we have to respect the following condition: nLoops n
∑ ∑ evali, j = evaltot
j=1 i=1
By considering only one loop and if we want to keep the same number of evaluations ∗|Vi | per subset, so that the number of evaluaper subset, we then explore evali,0 = evaltot |V | tions in each of the subsets is reduced proportionally to its cardinality. This is expected to produce roughly the same runtime. With the decomposed approach the runtime is expected to be slightly lower than with the whole, with this depending essentially on the connectivity of the graph. We can then repeat the same procedure, for both the the decomposed and the whole approach, for other runs corresponding to different numbers
410
G. Colombo and S.M. Allen
Fig. 8. Trade-off quality runtime for a generic test problem decomposed into two and three subsets. N number of subsets, Ne number of total evaluations
of total evaluations, thus obtaining other pair values. Moreover, we can repeat the decomposed assignment procedure for an increasing number of subsets obtaining further points in the diagram. Intuitively, we are expecting the best quality being produced either by the nondecomposed approach, or by the solutions obtained by a decomposition into a small number of subsets. As the number of subsets increases the approximation produced are likely to worsen rapidly (that is more than proportionally) since it becomes increasingly between different subsets. harder to limit the number of the inter-connections E inter j Note that if n subsets are used the decomposed assignment approach coincides with the sequential algorithms described in [12]. Hypothetically, we will aim to obtain in the diagram the non linear curve presented in Figure 9. Note that the point of minimum value would represent the ideal number of subsets for that particular benchmark problem. However, because the cost produced actually increases more sharply than expected (even for a limited number of subsets) it is more likely that the heuristic procedure used will produce a series of curves, each of them corresponding to fixed total number of evaluations. This behaviour would make more difficult the prediction of the most suitable number of subsets for a given data set. An example of this is discussed in Section 12. 9.1
Distribution of Interference
The beneficial effect of a second loop is present within all the experiments performed in this chapter and, in some of the instances (see Bradford) the decomposed approach is only effective at the end of this further loop (see Figure 6). This is essentially caused by a redistribution of the local interference inside each single subset at the end of each loop. After the first loop the heuristic produces good solutions in the first subset, which, however, constraints the second subset leading to many violations. Finally, the second loop balances the interference between the two subsets. Investigating the local distribution of the interference is an alternative way of evaluating a given frequency assignment. Figures 9 shows (for the Cardiff University benchmark C6 with a balanced GPFAP decomposition into two subsets) the interference produced in terms of violations of the
A Decomposed Approach for the Minimum Interference Frequency Assignment 45000
411
45000 1st subset 2nd subset intra-interference
40000
1st subset 2nd subset intra-interference inter-interference
40000
35000
35000
30000
30000
25000
25000
20000
20000
15000
15000
10000
10000
5000
5000
0
0 0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0
first loop - first subset
5000
10000
15000
20000
25000
30000
35000
40000
45000
first loop - second subset
45000
45000 1st subset 2nd subset intra-interference inter-interference
40000
1st subset 2nd subset intra-interference inter-interference
40000
35000
35000
30000
30000
25000
25000
20000
20000
15000
15000
10000
10000
5000
5000
0
0 0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0
second loop - first subset
5000
10000
15000
20000
25000
30000
35000
40000
45000
second loop - second subset 45000
45000 1st subset 2nd subset intra-interference inter-interference
40000
1st subset 2nd subset intra-interference inter-interference
40000
35000
35000
30000
30000
25000
25000
20000
20000
15000
15000
10000
10000
5000
5000
0
0 0
5000
10000
15000
20000
25000
30000
35000
third loop - first subset
40000
45000
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
third loop - second subset
Fig. 9. Intra and Inter-interference between subsets during the three loops of C6 with GPFAP into two subsets
constraints represented by the intra and inter-edges between subsets. We have here applied the decomposed assignment approach with the subsets considered in sequence for three loops through them (see Algorithm 1). During the first loop the first subset produces only the violations of the constraints represented by its intra-edges, thus ignoring the inter-edges with the other subset. Then the partial assignment of the second subset completes the assignment but generally produces high interference values for both inter and intra-edges violations. Subsequently, the second loop balances the interference between the subsets and reduces the violations of the inter-edges constraints between the two subsets. Eliminating this interference would extend the search space to the whole solution space, with ideally no loss of optimality between the solutions produced by the decomposed and the whole approach. However, this is only partially achieved by the redistribution loop and depends essentially on the connectivity of the graph and the ‘quality’ of the decomposition used. Moreover, further loops do not generally change significantly the balance reached between the inter and intra-edges violations for any of the subsets.
412
G. Colombo and S.M. Allen
9.2
Runs with Variable Number of Loops
Given a number of total evaluations of a run with the decomposed assignment approach the introduction of a second loop through the subsets leads to a reduction of the number of evaluations computed for each subset. This can be done proportionally to the number of loops. For the same number of total evaluations we could add other loops and consequently reduce the number of evaluations per subsets, thus producing further {cost, runtime} pairs. However, this has not shown any significant improvement in the quality of the solutions for a number of preliminary experiments in which more than two loops were applied. This is shown in Figures 10 which represents different runs of Siemens1 with a balanced GPFAP decomposition into two subsets in which the same number of total evaluations (4, 000, 000 ∗ |V|) is reached by the whole and the decomposed assignment approach with two and three loops respectively. Alternatively the number of evaluations computed for each subset can also vary during distinct loops (still maintaining fixed the desired number of total evaluations thus satisfying Definition 9). In particular a number of tests have shown better performance when we reduce considerably the number of evaluations in the first subset (and correspondingly increase that of the others). Note that, since it is performed for a very short number of evaluations (for instance the 10% of the total), the first loop assumes the significance of a preprocessing producing partial assignments, which can be obtained either with the ‘sequential’ or ‘independent’ procedure in Algorithms 1 and 2 respectively. Table 10 shows an example of runs of Siemens2 and Bradford10 (which, for size and connectivity, are two of the hardest COST259 data sets). Results correspond to a GPFAP decompositions for a total of 4, 000, 000 ∗ |V| evaluations. However, instead of this being computed by two loops of the same duration, the first loop has been further split into two parts. Firstly, we perform a preprocessing loop of 200, 000 ∗ |V | evaluations (solved either sequentially or independently). Subsequently, a second loop of 1, 800, 000 ∗ |V| completes the first half of the evaluations, followed by a further loop of the remaining 2, 000, 000 ∗ |V |. Results are compared with those shown in Table 3
20
20
1 subset 4,000,000|V| 1 subset 2,000,000|V| 2 subsets 2,000,000|V| 1st loop 2 subsets 2,000,000|V| 2nd loop
15
cost
cost
15
1 subset 4,000,000|V| 1 subset 2,000,000|V| 2 subsets 1,000,000|V| 1st loop 2 subsets 1,000,000|V| 2ndloop 2 subsets 1,000,000|V| 3rd loop 2 subsets 1,000,000|V| 4th loop
10
5
10
5
0
0 0
5000
10000 15000 run time (seconds)
two loops
20000
25000
0
5000
10000 15000 run time (seconds)
20000
25000
three loops
Fig. 10. Cost-time plot for Siemens1 with balanced GPFAP decomposition into two subsets for 4, 000, 000 ∗ |V | evaluations and different number of loops
A Decomposed Approach for the Minimum Interference Frequency Assignment
413
Table 10. SA - Siem1-4 with GPFAP decomposition with preprocessing - 4, 000, 000 ∗ |V | evaluations in total (first loop 200, 000 ∗ |V |, second loop 1, 800, 000 ∗ |V |, third loop 2, 000, 000 ∗ |V |) † at least 1 invalid solution * no valid solutions Sequential prep. Independent prep No preprocessing noSub. first loop second loop third loop 17.60 (17.93) 5,050 (5,150)* 16.97 (17.04) 2 16.38 (16.41) 16.49 (16.92) 16.11 (16.13) 16.24 (16.35) 16.34 (16.73) 19.85 (20.51) 5,202 (5,234)* SIEMENS2 19.54 (19.77) 3 18.08 (18.55) 18.89 (19.02) [whole = 15.59 (15.64) ] 17.27 (17.52) 17.51 (17.91) 17.41 (17.60) 20.95 (21.12) 5,472 (5,760)* 20.56 (20.87) 4 19.35 (19.83) 18.56 (18.60) 18.38 (18.47) 17.88 (18.19) 18.02 (18.27) 197.62 (197.98) 2,015 (2,022) * 192.37 (192.98) 2 187.74 (188.22) 191.26 (191.54) 186.31 (186.42) 190.19 (189.69) 185.05 (185.39) 225.90 (226.30) 2,023 (2,027)* BRADFORD10 203.69 (204.60) 3 206.70 (207.17) 205.01 (205.86) [whole = 187.13 (188.34)] 203.53 (203.64) 201.95 (202.01) 188.38 (189.51) 233.01 (236.56) 2,348 (2,354)* 211.81 (212.94) 4 228.28 (228.84) 216.52 (217.09) 212.81 (215.27) 213.34 (214.57) 193.51 (194.43)
for the same decomposition solved with two equal loops of 2, 000, 000 ∗ |V| evaluations each. At the end of the first loop both ‘sequential’ and ‘independent’ preprocessing approaches improves the results obtained in Tables 3 and 6 (2, 000, 000 ∗ |V | evaluations per loop). Hence the separation of the first loop into a preprocessing phase (which produces a quick approximation of an optimal assignment) and a further loop (which actually redistributes the interference among the subsets) performs better than a single loop performed sequentially through the subsets. However, this mainly happens for a decomposition into two subsets only, with ‘sequential’ preprocessing superior to ‘independent’. Nevertheless, the subsequent final loop (the third loop in our experiments) does not further improve the results which are in general only comparable with those in Table 3 at the end of the 4, 000, 000 ∗ |V| evaluations. Figure 11 shows an example of two single runs of Bradford10 with a decomposition into two subsets in which the preprocessing approach is compared with the same decomposition run for two loops of equal duration. 9.3
Cost-Runtime Trade-Off
Figure 12 shows the cost-runtime curve for Siemens1 and Bradford10 respectively for a number of runs with and without the decomposed assignment approach (with the points corresponding to the whole approach joined with a dotted line). We have considered balanced and unbalanced GPFAP with a different number of subsets, runs corresponding to one single loop, two loops, and runs with a different number of evaluations per loop.
414
G. Colombo and S.M. Allen
1000
1000 1 subset 2000000*|V| 1st loop 2 subsets 100000*|V| 1st loop 2 subsets 1900000*|V| 2nd loop 2 subsets 2000000*|V| 3rd loop 2 subsets 2000000*|V| 1st loop 2 subsets 2000000*|V| 2nd loop
800
1 subset 2000000*|V| 1st loop 2 subsets 100000*|V| 1st loop 2 subsets 1900000*|V| 2nd loop 2 subsets 2000000*|V| 3rd loop 2 subsets 2000000*|V| 1st loop 2 subsets 2000000*|V| 2nd loop
800
600 cost
cost
600
400
400
200
200
0
0 0
5000
10000 15000 run time (seconds)
20000
25000
0
5000
sequential preprocessing
10000 15000 run time (seconds)
20000
25000
independent preprocessing
Fig. 11. Cost-time plot for Bradford10 with balanced GPFAP decomposition into two subsets for 4, 000, 000 ∗ |V | evaluations with reprocessing approach (three loops) and without it (two loops) . 3.1
220 1 subset 2 subsets balanced 2 subsets unbalanced 3 subsets balanced 3 subsets unbalanced
3
1 subset 2 subsets 2 subsets 2nd loop 2 subsets preprocessing 3 subsets 3 subsets 2nd loop 4 subsets 4 subsets 2nd loop
215
210
205 quality
quality
2.9
2.8
200
195 2.7
190
185 2.6
5000
10000
15000 run time (seconds)
Siemens1
20000
180 40000
60000
80000
100000
120000
140000
160000
180000
200000
220000
run time (seconds)
Bradford10
Fig. 12. Trade-off between cost and runtime for Siemens1 and Bradford10
Three groups of runs have been conducted for a different number of total evaluations (1, 000, 000 ∗ |V|, 2, 000, 000 ∗ |V|, and 4, 000, 000 ∗ |V| respectively). Note that the three plots produced recall the expected curves described in Figure 9. If we focus on the central curve (corresponding to a total of 2, 000, 000 ∗ |V| evaluations) we can observe that for both of the benchmarks the balanced decomposition into two subsets produces the best results when two loops are performed. By repeating the analysis for different number of total evaluations and decomposition methods, these diagrams can represent the basis for defining the optimal decomposition criterion given a specific benchmark. This involves the investigation of the best performing: • number of subsets • size of subsets (e.g. balanced and unbalanced decompositions) • number of evaluations per loop and subset (e.g. with and without preprocessing)
A Decomposed Approach for the Minimum Interference Frequency Assignment
415
10 Conclusion The minimum interference FAP is an optimization problem which can be solved with exact methods only for either small or tractable problems. Meta-heuristics techniques have been proposed to handle larger and more practical problems. However, only highly specialised algorithms (often incorporating exact procedures for local optimization) can solve the hardest instances. Furthermore, they have never been applied to very large networks. This chapter proposed a decomposed assignment approach to solve the MI-FAP with meta-heuristics. Our procedure is based on on an initial partition of the interference graph representing the network into one or more subgraphs. A meta-heuristic procedure is then applied to each of the subsets in turn to produce a sequence of partial solutions. Subsets can be solved either sequentially or independently, that is when the current subset is considered the algorithm keeps the assignment of the transmitters in the previously assigned subsets fixed respecting the constraint violations between them. Finally, the partial solutions are recomposed to give a complete assignment of the original problem. We have tested a range of decomposition methods to produce a partitioning of the interference graph representing the network. These include the generalization of methods previously used for the FAP, such as clique detection and partitions based on generalized degree, together with novel applications and modifications of existing graph partitioning and clustering methods. Then we applied the decomposed approach to a standard implementation of SA and a multi-objective GA (NGSA-II) to solve a subset of the COST-259 instances and other larger benchmarks provided by Cardiff University. These have been specifically generated in order to simulate real practical networks. Two partitioning methods based on graph clustering (Markov clustering) and graph partitioning (GPFAP) outperform the other decomposition methods tested. The effectiveness of the decomposed approach appears to be independent from the specific algorithm used. This is shown by the COST-259 results for which not only has this approach improved the previously published results produced by the standard implementation of SA but it has also allowed the GA to be more competitive for this problem, where no published results are yet known for these benchmarks for this category of evolutionary meta-heuristics. This approach (when effective) also allows the use of standard meta-heuristics independently on the data set used, thus avoiding the use of algorithms specifically designed for a particular class of benchmarks. Moreover, the GPFAP appears able to effectively solve the large realistic Cardiff University benchmarks without the need of any further knowledge about the network (e.g. geographical information about transmitter locations). Finally, the contribution given by a further reassignment loop over the subsets of the partition has been shown to be effective in producing good approximations of the optimal solutions, even in the situations in which the decomposition approach is not completely successful on a pure quality basis. In these cases an evaluation of the effectiveness of this approach based on the trade-off between quality and runtime appears to be a more suitable alternative. Future enhancements will concern the application to the decomposed approach to more realistic and complex interference models (multiple interference model) and the
416
G. Colombo and S.M. Allen
definition of more general criteria which, given a benchmark and a partitioning method, identify the optimal decomposition in terms of size, number, and number of loops of subsets.
References 1. FAP web - A website about Frequency Assignment Problems (2007), http://fap.zib.de/ (accessed on June 1, 2007) 2. Allen, S.M., Dunkin, N., Hurley, S., Smith, D.: Frequency assignment problems: benchmarks and lower bounds. University of Glamorgan, UK (1998) 3. Brandes, U., Gaertler, M., Wagner, D.: Experiments on Graph Clustering Algorithms. In: Proc. of the 11th Annual European Symposium on Algorithms, Budapest (2003) 4. Colombo, G.: A Genetic Algorithm for frequency assignment with problem decomposition. International Journal of Mobile Network Design and Innovation 1-2, 102–112 (2006) 5. Colombo, G., Allen, S.M.: Problem decomposition for Minimum Interference Frequency Assignment. In: Proc. of the IEEE Congress in and Evolutionary Computation, Singapore (2007) 6. Colombo, G., Mumford, C.L.: Comparing Algorithms, Representations and Operators for the Multi–objective Knapsack Problem. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC 2005), Edinburgh, Scotland, pp. 1268–1275 (2005) 7. Correia, L.M. (ed.): Wireless Flexible Personalised Communications. Wiley, Chichester (2001) 8. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA–II. IEE Trans. on Evolutionary Computation 6, 182–197 (2002) 9. Eisenblatter, A.: Frequency Assignment in GSM Networks: Models, Heuristics, and Lower Bounds. PhD thesis, Technische Universitat Berlin, Berlin, Germany (2001) 10. Gamst, A.: Some lower bounds for a class of frequency assignment problems. IEEE Transactions on Vehicular Technology 35, 8–14 (1986) 11. Grace, D., Burr, A.G., Tozer, T.C.: Comparison of Different Distributed Channel Assignment Algorithms for UFDMA. In: 2nd IEEE International Conference on Personal, Mobile and Spread Spectrum Communications, pp. 38–41 (1996) 12. Hale, W.K.: Frequency assignment: Theory and applications. Proc. IEEE 38, 1497–1514 (1980) 13. Hale, W.K.: New spectrum management tools. In: Proc. Ieee International Symposium on Electromagnetic Compatibility, pp. 47–53 (1981) 14. Hurley, S., Smith, D.: Meta-Heuristics and channel assignment. In: Hurley, S., Leese, R. (eds.) Methods and algorithms for radio channel assignment. Oxford University Press, Oxford (2002) 15. Hurley, S., Smith, D., Thiel, S.U.: Fasoft: a system for discrete channel frequency assignment. Radio Science 32(5), 1921–1939 (1997) 16. Karaoglu, N., Manderick, B.: FAPSTER - a genetic algorithm for frequency assignment problem. In: Proc. of the 2005 Genetic and Evolutionary Computation Conference, Washington D.C., USA (2005) 17. Koster, A.M.C.A., van Hoesel, C.P.M., Kolen, A.W.J.: Solving partial constraint satisfaction problems with tree decomposition. Networks 40(3), 170–180 (2002) 18. Mannino, C., Sassano, A.: An enumerative algorithm for the frequency assignment problem. Discrete Applied Mathematics 129(1), 155–169 (2003) 19. Mannino, C., Oriolo, G., Ricci, F.: Solving Stability Problems on a Superclass of Interval Graphs. T.R. n. 511, Vito Volterra (2002)
A Decomposed Approach for the Minimum Interference Frequency Assignment
417
20. Montemanni, R., Moon, J.N., Smith, D.H.: An improved Tabu Search algorithm for the Fixed-Spectrum Frequency-Assignment problem. IEE Transactions on Vehicular technology 52(3), 891–901 (2003) 21. Pardalos, P., Rappe, J., Resende, M.: An exact parallel algorithm for the maximum clique problem. In: De Leone, P.P.R., Murl’i, A., Toraldo, G. (eds.) High Performance Algorithms and Software in Nonlinear Optimization. Kluwer, Dordrecht (1998) 22. van Dongen, S.: A cluster algorithm for graphs, Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam (2000) 23. Waharte, S., Boutaba, R.: Comparison of Distributed Frequency Assignment Algorithms for Wireless Sensor Network, Technical Report, University of Waterloo, ON, Canada
Set Representation and Multi-parent Learning within an Evolutionary Algorithm for Optimal Design of Trusses Amitay Isaacs, Tapabrata Ray, and Warren Smith University of New South Wales, Australian Defence Force Academy, Northcott Drive, ACT 2600, Australia {a.isaacs,t.ray,w.smith}@adfa.edu.au
Summary. Trusses are common structures that comprise one or more triangular units constructed with straight slender members connected at joints. Outlined in this chapter is a novel scheme for representation of truss geometry and an evolutionary optimization algorithm that operates on the new representation guided by inheritance and learning. Trusses are represented as a set of elements having a collection of properties (e.g. cross-sectional area, type of material). The sets with varying cardinality represent truss structures with different numbers of elements and hence different topologies. The evolutionary algorithm generates topologies by inheriting common elements from the parents and the corresponding element properties are generated via recombination. Depending on the physical problem being solved, specific recombination operators can be designed to aid the optimization process. One such mutation operator is used in this study to reduce the number of elements in an attempt to identify the smallest feasible topology. Another mutation operator is used to perturb the properties of the elements. Algorithm provided in this chapter are useful insights about learning and inheriting topologies and element properties from 3, 5 and 8 parents. A number of case studies are included to highlight the benefits of our proposed set based representation and effects of learning topologies from multiple parents. Keywords: Linkage learning, set based representation, truss design, evolutionary algorithm.
1 Introduction The design of trusses involves sizing, configuration and topology optimization. In sizing optimization of trusses, the cross-sectional areas of the truss members are the design variables while the coordinates of the nodes and the connectivity of the elements are assumed to be fixed. In a configuration optimization problem, the coordinates of the nodes are the design variables while the element connectivity and their cross-sectional areas are held fixed. Topology optimization deals with varying the connectivity of the elements along with their cross-sectional areas. Simultaneous sizing, configuration and topology optimization have been attempted in the past using genetic algorithms. Binary coding has been used by Wu and Chow [1], Kaveh and Kalatjari [2], and Coello [3]. In binary coding, Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 419–439, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
420
A. Isaacs, T. Ray, and W. Smith
discrete values that are not powers of 2 cannot be represented exactly and one needs to include redundancy or repetitions. Real-coded variables have been used by Deb [4, 5], to represent the crosssectional areas of truss elements. A negative value of the cross-sectional area of an element would mean an absence of that truss member. For discrete values of cross-sectional area, discrete versions of simulated binary crossover (SBX) and mutation operators were used. Tang, Tong and Gu [6] used a mixed representation in which an integer coding was used to represent the discrete values of the cross-sectional areas and a binarycoding was used to represent the topological variables (one for each of the truss members). An interesting concept of Variable String Length Genetic Algorithm (VGA) was introduced by Rajeev and Krishnamoorthy [7] using cut-and-splice operator as in messy GA [8] to design various topologies. More recently, a set representation based evolutionary algorithm [9] was proposed by the authors. The set representation allows the use of continuous and discrete variables without the need for binary or integer coding. The approach utilizes the fact that truss like structures can be very easily represented as a collection of elements having certain properties. Identification of the elements of an optimum topology using the set representation has similarities with linkage identification problems. Linkage refers to the association among variables that are inter-dependent. The elements of an optimum topology can be identified as building blocks with inter-dependence relationship. The mechanism of inheriting common elements from multiple parent designs is a an attempt to capture the linkage between the elements. A number of optimization algorithms have been proposed in recent years that tend to exploit variable linkages in an attempt to sample promising solutions. A comprehensive survey on the linkage learning techniques in evolutionary algorithms has been reported by Chen et al. [10]. These include linkage learning genetic algorithm (LLGA) [11], fast messy GA (FMGA) [12], gene expression messy GA (GEMGA) [13] and estimation of distribution algorithms (EDAs) [14]. This chapter is organized as follows. The set based representation of truss structures is outlined in Sect. 2. The proposed evolutionary algorithm using the set based representation, p-REDISTRIBUTE recombination operator and mutation operators are described in Sect. 3. Case studies of truss design using continuous and discrete area variation and their results are presented in Sect. 4. The effects of learning and inheriting topologies from multiple parents are studied and discussed in Sect. 4. A summary of our findings and concluding remarks are listed in Sect. 5.
2 Set Representation The proposed representation uses the abstraction of a set (of elements) to represent the geometric entities made up of simpler elements. A truss is represented as a set with multiple link or beam elements. A structural entity or an object can be represented by a set S,
Set Representation and Multi-parent Learning
421
S := {E1 , E2 , . . . , En } made up of finite elements E1 to En . Cardinality of a set |S| is the count of the number of elements (n = |S|) in that set. Sets with differing cardinality represent truss structures with different numbers of elements. If the ground structure (which is the collection of all possible elements) for a particular truss has 10 elements, the set SG can be defined as SG := {E1 , E2 , . . . , E10 }. Elements are characterized by a finite set of properties, Pk . Each property can represent a physical quantity (e.g. length, cross-sectional area), and a material property (e.g. density, Young’s modulus). Ei := {P1 , P2 , . . .} A property can either have a fixed or a variable value. Material properties (e.g. density) are represented using fixed values while physical properties can be defined by a range or a collection. • Range – Properties such as cross-sectional area can take values in a range of possible areas or the number of gear teeth take values from a set of integer values. Area ≡ P1 ∈ [10.0, 20.0] cm2 • Collection – Properties such as the type of material can be represented as a collection of density values. A collection is made up of discrete values either real or integer. Density ≡ P2 ∈ {0.1, 0.12, 0.2, 0.22} lb/in3 A random collection of truss elements from the ground structure can generate invalid truss structures. To weed out such structures at the time of model construction, a number of constraints need to be specified. A constraint can be represented as a set of one or more elements which need to exist in any viable structure. Consider a constraint C1 as a set of elements E1 , E3 , E7 . In a constructed set S1 one or more elements of C1 have to be present. Mathematically it can be represented as a non-null intersection of S1 and C1 . C1 := {E1 , E3 , E7 },
S1 ∩ C1 = ∅
Thus any truss structure can be represented as a set of elements (subset of the elements of the ground structure) with properties – cross-sectional area and density. Cardinality of this set itself can be variable. Density can be a fixed property whereas cross-sectional area can be a variable and it can take values from a range or a collection. A number of constraints can be specified to enforce connectivity requirements. A valid structure is the one that satisfies all the constraints such that S1 ⊆ SG , ∀j Cj ∩ S1 = ∅.
422
A. Isaacs, T. Ray, and W. Smith
3 Evolutionary Algorithm The main steps of an evolutionary algorithm are shown in Algorithm 1. The algorithm starts with initializing the population P1 of the first generation. Algorithm 1. Top level Algorithm Require: N > 1 {Number of Generations} Require: Np > 0 {Population size} 1: P1 = Initialize() 2: Evaluate(P1 ) 3: for i = 2 to N do 4: Rank(Pi−1 ) 5: repeat 6: p1, p2, . . . = Select(Pi−1 ) 7: c1, c2, . . . = Recombine(p1, p2, . . .) 8: Mutate(c1, c2, . . .) 9: Add c1, c2, . . . to Ci−1 10: until Ci−1 has Np children 11: Evaluate(Ci−1 ) 12: Pi = Reduce(Pi−1 + Ci−1 ) 13: end for
The major steps of the evolutionary algorithm are: • Evaluate – For all the individuals of a population calculate the objective and the constraint functions. • Rank – Rank all the candidate solutions in the population to identify good candidate solutions. • Select, Recombine, Mutate – Using Selection, Recombination and Mutation operators an offspring population is evolved from the parent population. The selection operator selects the individuals as parents (p1, p2, . . .) that undergo recombination to create offspring (c1, c2, . . .). • Reduce – From parent population (Pi ) and child population (Ci ) select better individuals (with better fitness) to form the population (Pi+1 ) for the next generation. 3.1
Initialization
Each individual of the population is a subset S of the ground set SG . The cardinality of this set can be constrained by prescribing limits as either a fixed number (|S| = 7) or an integer range (|S| = [6, 8]). The subset selection mechanism ensures that the selected subset S satisfies all the specified constraints. The procedure for subset selection is outlined in Algorithm 2. The algorithm requires the ground set and the size of subset required. Additional input is a set of preselected elements (i.e. the elements which have to be part of S). In the initialization
Set Representation and Multi-parent Learning
423
Algorithm 2. Subset construction with construction constraints Require: SG {Ground set} Require: n1 ≤ n ≤ n2 {Required number of elements} Require: Sp {Preselected elements} 1: Sg = SG − Sp {Remove any preselected elements} 2: Select S ⊂ Sg such that |S| = n − |Sp | 3: if VerifyConstraints(S ∪ Sp ) then 4: STOP 5: end if 6: for p ∈ S do 7: Sr = Sg − S {Remaining elements} 8: for q ∈ Sr do 9: S = (S − {p}) ∪ {q} {New subset formed by exchanging p with q} 10: if VerifyConstraints(S ∪ Sp ) then 11: STOP 12: end if 13: end for 14: end for process, the preselected elements is an empty set. The algorithm starts with a random subset S (i.e. the elements are picked up in random order) of required size n from the ground structure set SG . The function VerifyConstraints(S) checks if the subset S satisfies all the specified constraints. If the subset S does not satisfy the constraints, a new subset is formed by exchanging an element from S with an element from the remaining elements of the ground structure set (SG − S). This iterative process methodically swaps elements and generates all the possible combinations of size n from the elements of ground structure set. Once a valid subset S is found, properties of all the elements in the subset S are initialized. Only variable properties need to be initialized. For each property, a value is randomly picked from a range or a collection depending on the type of the property. 3.2
Ranking
The ranking process is used to identify the better individuals in the population. It is based on the values of the objective and the constraints. For unconstrained problems, the individual with better objective function value has a higher rank. For constrained problems a feasible solution has a higher rank than an infeasible solution. Within the set of feasible solutions, the ranks are computed based on the objective function value. Within the set of infeasible solutions, the ranks are based on the maximum constraint violation. Individuals with lower maximum constraint violation are ranked higher.
424
3.3
A. Isaacs, T. Ray, and W. Smith
Selection
The selection process is used to pick individuals from the population that undergo recombination and mutation to create offspring. The individuals are selected using roulette wheel selection where the probability of an individual to be selected is proportional to the fitness of that individual. Fitness based selection schemes usually suffer from premature convergence, so to alleviate this problem, recombination operator often samples offspring from the entire design space. 3.4
Recombination
Our recombination uses the p-REDISTRIBUTE operator as outlined in Algorithm 3 to create p offspring from p parents. The number of individuals undergoing recombination (p) is a user defined parameter. The effect of the number of individuals undergoing recombination on the convergence is studied in this chapter. Recombination Probability determines if the offspring are generated from the parents using recombination or created randomly. The random creation step helps maintain diversity in the population and avoids the problems of premature convergence. The main step involves redistribution of elements of p parents (S1 , . . . , Sp ) to create p offspring (S1 , . . . , Sp ). The cardinality of each of the offspring is decided by the Resizing Probability (Ps ). If the random number is less than the resizing probability, cardinality of each offspring is the same as that of the corresponding parent (|S1 | = |S1 |, . . . , |Sp | = |Sp |), otherwise cardinality of the offspring is set randomly from the allowed range. First, the common elements present (if any) in the parents are extracted (Sc ). The properties of the common elements are grouped together (V ) and the parentcentric recombination operator (PCX) [15] is used to create new values for those properties. This has the effect of perturbing the common elements (Sc,i ) from each parent Si . The rest of the elements from each of the parents are pooled together (Sr ). This pool will have multiple copies (less than p) of the elements as the parents can have common elements (with the same or different values for properties). If an offspring cardinality is different from that of the parent, it may not be possible to create a valid structure using the elements in Sr , as there may not be sufficient distinct elements. In such a case, all the elements missing from Sr are added to the pool (add missing = 1). The new elements that are added to the pool Sr are initialized randomly. Offspring are now created by picking subsets from the pool using Algorithm 2. As the offspring will inherit the common elements (Sc,i from the parents, the subset selection algorithm is invoked with the common elements as the preselected elements. Finally the elements selected using the subset selection algorithm are combined with the common elements (Sc,i ) from each parents to create offspring sets of elements (Si ).
Set Representation and Multi-parent Learning
425
Algorithm 3. p-REDISTRIBUTE Operator Require: SG {Ground Structure set} Require: Pr > 0 {Recombination Probability} Require: Ps > 0 {Resizing Probability} Require: S1 , . . . , Sp {p parents chosen for recombination} Require: [n1 , n2 ] {Possible number of elements in a structure} Require: 0 < n1 ≤ |Si | ≤ n2 ≤ |SG |, i = 1, . . . , p 1: if random[0, 1] < Pr then 2: if random[0, 1] < Ps then 3: mi = random[n1 , n2 ], i = 1, . . . , p 4: add missing = 1 5: else {Number of elements in offspring are same as parents} 6: mi = |Si |, i = 1, . . . , p 7: add missing = 0 8: end if 9: Sc = Si , i = 1, . . . , p {Common elements} 10: Sa = Si , i = 1, . . . , p {All elements used by p parents} 11: V = {∀k Pk ∈ Ej , ∀j Ej ∈ Sc } {Collect properties of common elements} 12: PCX(V ) {p-parent PCX operator} 13: Sc,i = {∀j Ej ∈ Sc }, i = 1, . . . , p {Offspring subsets created using PCX} 14: Sr = Sa − Sc {The rest of the elements} 15: if add missing = 1 then 16: Sr = Sr (SG − Sa ) {Add any missing elements} 17: end if 18: Si = Sc,i Subset(Sr , mi , Sc,i ), i = 1, . . . , p {Construct offspring} 19: else {No Recombination} 20: Si = Subset(SG , mi , ∅) 21: end if
3.5
Mutation
The mutation operator (MUTATE) perturbs the value of one of the properties of randomly selected element from the set. For the real continuous values, polynomial mutation operator [16] is used to perturb the values. For properties with discrete values, polynomial mutation operator is used on the index of the value to obtain a new index. For the problem of truss design, it is easier for the algorithm to pick up sets with a large number of elements as they will satisfy the constraints easily. To ensure that the final topology has fewer elements, an operator is required that will keep removing the elements from the structure. In addition to the standard mutation operator, a DELETE operator is used. This is a mutation operator which randomly removes an element from the structure. The removal of an element from the structure can lead to infeasible structure or structure having fewer elements than required. To avoid having fewer than required elements,
426
A. Isaacs, T. Ray, and W. Smith
DELETE operator is only applied if the number of elements are more than the minimum required. In addition, after removing an element the structure is checked for constraint violation. If any constraint is violated, the element is added back in the structure. The process is repeated with the other elements in the structure. Thus the DELETE operator always ensures that the structure remains feasible subject to the specified constraints. Both operators, the DELETE operator and the MUTATE operator are used as mutation operators. In the problem definition the probabilities for DELETE and MUTATE operators are specified. This is in addition to the mutation probability. The mutation probability determines if a particular candidate solution will undergo mutation or not. The operator probability determines which mutation operator (DELETE or MUTATE) is used for the mutation.
4 Case Studies 4.1
Problem Description
A 6-node, 10-member ground structure is shown in Fig. 1. The problem is the design of the minimum weight truss structure using simultaneous sizing and topology optimization. a 1
5
2
3
7
9 5
6
b
10
8 6
1
4
3 P
2
4 P
Fig. 1. Ground structure for 6-Node, 10-Member Truss
The structure is loaded with a force P of 100000 lbf at nodes 2 and 4. The dimensions of the structure and the density of the material are given in Table 1. The maximum allowed stress in compression or tension is 25 kpsi. The maximum displacement allowed at nodes 2 and 4 is 2 in. For sizing optimization, member cross-sectional areas are the design variables. Depending on the values of the areas used, the following three separate problems are defined. The first problem uses a continuous variation in the cross-sectional area of each member, the other two problems use two different sets of discrete values for the allowable areas. • Problem 1: Continuous area variation from 1 in2 to 35 in2 • Problem 2: Discrete area variation from 1 in2 to 35 in2 in steps of 1 in2
Set Representation and Multi-parent Learning
427
Table 1. 10-Member Truss Design Parameters
Parameter
Value
a 720 in b 360 in P 100000 lbf ρ (density) 0.1 lb/in3 E (Young’s Modulus) 107 psi σmax ± 25 kpsi 2 in dispmax (at node 2,4) • Problem 3: Discrete area variation from available member sizes taken from the American Institute of Steel Construction Manual [7]. A = (1.62, 1.80, 1.99, 2.13, 2.38, 2.62, 2.63, 2.88, 2.93, 3.09, 3.13, 3.38, 3.47, 3.55, 3,63, 3,84, 3.87, 3.88, 4.18, 4.22, 4.49, 4.59, 5.12, 5.74, 7.22, 7.97, 11.50, 13.50, 13.90, 14.20, 15.50, 16.00, 16.90, 18.80, 19.90, 22.00, 22.90, 26.50, 30.00, 33.50). All the values are in square inches. 4.2
Problem Representation
The truss design problem outlined in Sect. 4.1 above can be described by the set representation shown in Table 2. The problem description is organized in sections. The [Algorithm] section in Table 2 defines the parameters controlling the optimization process. Parameters Population and Generations determine the size of the population and number of generations for which the evolutionary algorithm runs. Fitness defines the user-defined function to be used for calculating the objective function and the constraints. Objectclasses parameter lists the object classes that will define the physical problem. In this case there is single object class Truss, which defines the truss structure. The [Truss] section in Table 2 identifies that a set type representation is to be used for this object. The elements belonging to the ground structure are given as E1, . . ., E10, defined by parameter Elements. Each object can have from 5 elements to 10 elements as defined by the Count (Cardinality) parameter. This range can be shrunk to reduce the size of the design space. Constraints of the construction of the set are defined by parameters Constraint 1, . . ., Constraint 4. These constraints ensure that the fixed nodes and the nodes where the load is applied have at least one truss element connected to them to transfer the applied loads. The elements are described next, in separate sections. In Table 2, the [E1] section defines the properties for element E1. The Area parameter lists the range of cross-sectional areas, from 1 in2 to 35 in2 . The density is given as 0.1 lb/in3 . If all the properties of each element are identical, then instead of defining all the properties for each element repeatedly, inherit parameter can be used. This
428
A. Isaacs, T. Ray, and W. Smith
Table 2. Optimization Problem Description of 10-member truss
[Algorithm] Population = 100 Generations = 100 Fitness = Truss 6node Objectclasses = {Truss} [Truss] Type = Set Elements = {E1,E2,E3,E4,E5,E6,E7,E8,E9,E10} Count = [5,10] Constraint 1 = {E1, E7} Constraint 2 = {E3, E8} Constraint 3 = {E5, E7, E10} Constraint 4 = {E6, E9} Mutation Operators = {MUTATE:0.8, DELETE:0.2} Recombination Operators = {REDISTRIBUTE:1.0} Recombination Probability = 0.8 Mutation Probability = 0.2 MUTATE Distribution Index = 20 REDISTRIBUTE Parents = 5 . . . [E1] Area = [1.0, 35.0] density = 0.1 [E2] inherit: E1 . . .
allows the element [E2] to inherit all the properties (i.e. area and density) from the element [E1]. 4.3
Experimental Setup
All the design problems (Problems 1, 2 & 3) are solved with a population size of 100 and maximum 400 generations. Each problem is solved using 10 runs with different random seed. Recombination probability (Pr ) is fixed at 0.9. The number of parents used for recombination are varied with values of 3, 5, and 8. The PCX parameters ση and σζ are fixed at 0.5 each. Resizing probability (Ps ) is set to 0.2. Operator probability for MUTATE operator and DELETE operator
Set Representation and Multi-parent Learning
429
is kept fixed at 0.9 and 0.1 respectively. Distribution Index for the mutation operator is set at 20. The objective function (weight of the truss) and the constraint functions (stress in truss members, displacement at nodes 2 and 4) are evaluated using the finite element method. ANSYS (Version 11.0) is used as the finite element solver. The evolutionary algorithm generates an ANSYS macro file for each of the candidate solution to be evaluated, invokes ANSYS, and extracts the stresses and displacements from the outputs generated by ANSYS. 4.4
Results
The best truss design obtained is shown in Fig. 2 and is made up of 6 members connected to 5 nodes. The algorithm is able to find the same topology (with members 1,3,4,7,8, and 9) for all the three problems. This topology is the best known solution for the 10-member truss design problem.
1
5
3
7 9
8 6
3
4
4
2
Fig. 2. Optimum topology for 6-Node, 10-Member Truss
Problem 1 The results obtained by the evolutionary algorithm using 3, 5, and 8 parents for recombination across 10 experimental runs are given in Table 3. The smallest weight obtained for the truss is 4985.4 lb using 8 parents for recombination. The average performance for 8-parent recombination (5168.8 lb) is better than 5-parent recombination (5215.7 lb), which is better than 3-parent recombination (5417.9 lb). In each of the runs the best solution is obtained with the optimum topology using 6 elements. The number of individuals in the final population with the optimum topology across all the runs are listed in Table 4. It can be seen that with the increase in the number of parents for recombination, the average number of individuals having optimum topology have reduced. The 3-parent recombination is much faster at obtaining the optimum topology as compared to 5-parent and 8-parent recombination. The same behavior is echoed in Fig. 3, which plots the average percentage of optimum topologies across generations using 3,5 and 8 parents. However, it is interesting to observe that with the 8-parent recombination, a better local search capability is demonstrated compared to 5-parent or the 3-parent models.
430
A. Isaacs, T. Ray, and W. Smith
Table 3. The weight (in lb) of the best truss obtained for Problem 1 across 10 experimental runs using 3, 5, and 8 parents for recombination
Expt.
p=3
p=5
p=8
1 2 3 4 5 6 7 8 9 10
5147.6 5400.1 5631.9 5402.7 5568.9 5182.5 5375.9 5071.6 5689.7 5708.4
5108.5 5439.2 5164.8 5108.7 5064.9 5198.8 5352.7 5357.2 5056.6 5305.8
5083.5 5005.3 5016.8 5415.2 5102.8 4985.4 5251.9 5141.6 5206.9 5478.9
Min. Max. Mean Std.Dev.
5071.6 5708.4 5417.9 218.5
5056.6 5439.2 5215.7 130.7
4985.4 5478.9 5168.8 161.6
Table 4. Number of individuals in the final population with the optimum topology for Problem 1 across 10 experimental runs using 3, 5, and 8 parents for recombination
Expt.
p=3
p=5
p=8
1 2 3 4 5 6 7 8 9 10
98 97 100 99 82 100 92 100 100 99
97 96 91 99 96 100 98 99 84 99
99 63 96 96 99 98 95 92 100 94
Avg.
96.7
95.9
93.2
The variation in the weight of the truss across the final generation for all the experimental runs is shown in Fig. 4. For the continuous variation of element areas, the diversity in the solutions obtained for all the runs is quite large. The results obtained using the proposed algorithm and the results published by Deb and Gulati [5] are given in Table 5. The results by Deb and Gulati are based on a population size of 220 and obtained after 225 generations. The SBX
Set Representation and Multi-parent Learning
431
100
Optimal Topology %
80 60 40 20
3-parents 5-parents 8-parents
0 0
50
100
150 200 250 Generations
300
350
400
7000
7000
6500
6500 Weight (lb)
Weight (lb)
Fig. 3. The average percentage of individuals with optimal topology across generations for Problem 1
6000
5500
6000
5500
5000
5000 0
10
20
30
40
50
60
70
80
90 100
0
10
20
30
Solution ID
40
50
60
70
80
90 100
Solution ID
(a) Recombination with 3 parents
(b) Recombination with 5 parents
7000
Weight (lb)
6500
6000
5500
5000 0
10
20
30
40 50 60 Solution ID
70
80
90 100
(c) Recombination with 8 parents Fig. 4. Weight variation in the final population across the 10 experimental runs for Problem 1
operator is used for recombination and polynomial probability based operator is used for mutation. The recombination probability is 0.9 and mutation probability used is 0.1.
432
A. Isaacs, T. Ray, and W. Smith
Table 5. Comparison with published results for Problem 1
Member
Current Work Deb & Gulati Area (in2 ) Area (in2 )
1 3 4 7 8 9
31.71 23.27 17.57 6.45 19.32 20.86
29.68 22.07 15.30 6.09 21.44 21.29
Weight (lb)
4985.4
4899.15
Problem 2 The results obtained by our proposed algorithm using 3, 5, and 8 parents for recombination across 10 experimental runs are given in Table 6. The smallest weight obtained for the truss is 5048.1 lb using 8-parent recombination similar to our observation in Problem 1. The worst design with 8-parent recombination has the weight of 5292.9 lb as compared to 5364.9 and 5445.6 in the case of 5-parent and 3-parent recombination.
Table 6. The weight (in lb) of the best truss obtained for Problem 2 across 10 experimental runs using 3, 5, and 8 parents for recombination
Expt.
p=3
p=5
p=8
1 2 3 4 5 6 7 8 9 10
5311.4 5095.4 5275.4 5337.6 5251.8 5445.6 5171.0 5332.5 5235.8 5224.5
5248.1 5313.9 5364.9 5182.3 5250.7 5140.1 5122.7 5290.3 5248.1 5305.2
5053.2 5292.9 5173.6 5220.9 5167.4 5179.8 5281.6 5167.4 5048.1 5250.7
Min. Max. Mean Std.Dev.
5095.4 5445.6 5268.1 92.1
5122.7 5364.9 5246.7 73.9
5048.1 5292.9 5183.6 79.6
Set Representation and Multi-parent Learning
433
Table 7. Number of individuals in the final population with the optimum topology for Problem 2 across 10 experimental runs using 3, 5, and 8 parents for recombination
Expt.
p=3
p=5
p=8
1 2 3 4 5 6 7 8 9 10
99 82 95 90 94 96 72 87 97 99
81 97 91 96 96 81 95 97 96 91
81 68 97 94 95 96 83 94 91 71
Avg.
91.1
92.1
87
Optimal Topology %
100 80 60 40 20
3-parents 5-parents 8-parents
0 0
5
10
15
20 25 30 Generations
35
40
45
50
Fig. 5. The average percentage of individuals with optimal topology across generations for Problem 2
The number of individuals in the final population with the optimum topology across all the runs are listed in Table 7 and the trend is similar to that for Problem 1 (Table 4). The 8-parent recombination has difficulty finding the optimum topology and is much slower compared to 3-parent and 5-parent recombination. The same behavior can also be observed from Fig. 5 across generations. The 3-parent recombination identifies the optimum topology faster as compared to 5 and 8 parent forms. The diversity in truss weight across the final generation is much less for discrete area variation (see Fig. 6) as compared to continuous area variation (see Fig. 4).
A. Isaacs, T. Ray, and W. Smith
7000
7000
6500
6500 Weight (lb)
Weight (lb)
434
6000
5500
6000
5500
5000
5000 0
10
20
30
40 50 60 Solution ID
70
80
90 100
0
(a) Recombination with 3 parents
10
20
30
40 50 60 Solution ID
70
80
90 100
(b) Recombination with 5 parents
7000
Weight (lb)
6500
6000
5500
5000 0
10
20
30
40 50 60 Solution ID
70
80
90 100
(c) Recombination with 8 parents Fig. 6. Weight variation in the final population across the 10 experimental runs for Problem 2
Listed in Table 8 are the results obtained by the proposed algorithm and the results published by Deb and Gulati [5]. The results by Deb and Gulati are using same parameters as mentioned earlier (population size of 220 ran for 225 generations).
Table 8. Comparison of results for Problem 2
Member
Current Work Deb & Gulati Area (in2 ) Area (in2 )
1 3 4 7 8 9
32 25 21 6 21 17
30 24 16 6 20 21
Weight (lb)
5048.1
4912.85
Set Representation and Multi-parent Learning
435
Table 9. The weight (in lb) of the best truss obtained for Problem 3 across 10 experimental runs using 3, 5, and 8 parents for recombination
Expt.
p=3
p=5
p=8
1 2 3 4 5 6 7 8 9 10
5239.3 5151.6 5413.8 5591.3 5684.6 5313.1 5110.8 5319.1 5478.6 5084.2
5226.9 5304.9 5109.9 5569.4 5107.5 5401.2 5551.1 5245.2 5122.4 5453.4
5484.1 5354.9 5386.7 5400.6 5381.7 5171.8 5370.6 5362.3 5496.1 5229.8
Min. Max. Mean Std.Dev.
5084.2 5684.6 5338.6 192.7
5107.5 5569.4 5309.2 167.8
5171.8 5496.1 5363.9 94.4
Table 10. Number of individuals in the final population with the optimum topology for Problem 3 across 10 experimental runs using 3, 5, and 8 parents for recombination
Expt.
p=3
p=5
p=8
1 2 3 4 5 6 7 8 9 10
85 82 82 82 84 92 85 97 93 93
92 78 94 83 93 74 81 76 84 76
64 72 65 74 90 84 66 79 30 54
Avg.
87.5
83.1
67.8
Problem 3 The results for problem 3 using 3, 5, and 8 parents for recombination across 10 experimental runs are given in Table 9. The best weight obtained for the truss is 5084.2 lb using 3-parent recombination. As compared to the results of Problem 1 and Problem 2, a similar trend is visible in the results of Problem 3.
A. Isaacs, T. Ray, and W. Smith
7000
7000
6500
6500 Weight (lb)
Weight (lb)
436
6000
5500
6000
5500
5000
5000 0
10
20
30
40 50 60 Solution ID
70
80
90 100
0
(a) Recombination with 3 parents
10
20
30
40 50 60 Solution ID
70
80
90 100
(b) Recombination with 5 parents
7000
Weight (lb)
6500
6000
5500
5000 0
10
20
30
40 50 60 Solution ID
70
80
90 100
(c) Recombination with 8 parents Fig. 7. Weight variation in the final population across the 10 experimental runs for Problem 3
Optimal Topology %
100 80 60 40 20
3-parents 5-parents 8-parents
0 0
50
100
150 200 250 Generations
300
350
400
Fig. 8. The average percentage of individuals with optimal topology across generations for Problem 3
The worst design obtained with 8-parent recombination has weight of 5496.1 lb as compared to 5684.6 lb and 5569.4 lb in the case of 3-parent and 5-parent recombination. The average performance in all 3 cases is similar. But the variation in the performance is quite small for 8-parent recombination.
Set Representation and Multi-parent Learning
437
Table 11. Comparison of results for Problem 3
Member
Current Work Kaveh & Kalatjari Areas (in2 ) Areas (in2 )
1 3 4 7 8 9
30.00 26.50 14.20 7.97 22.00 19.90
30.00 19.90 15.50 7.22 22.00 22.00
Weight (lb)
5084.2
4962.1
Recombination with 8-parent has difficulty finding the optimum topology and is once again evident from the results of Problem 3 as shown in Table 10 and Fig. 8. Even though Problems 2 and 3 both have discrete area variation, the algorithm has difficulty obtaining the optimum topology for Problem 3 as compared to Problem 2. This can be attributed to the fact that the discrete area variation in Problem 2 is much smoother as compared to that in Problem 3. The variation in the weight of the truss across the final generation for all the experimental runs is shown in Fig. 7. Listed in Table 11 are the results for problem 3 obtained using the proposed algorithm and published by Kaveh and Kalatjari [2] using binary-coded GA. They have used recombination probability of 0.9, mutation probability for area variables of 0.1 and mutation probability for topological variables is 0.001. The algorithm runs in multiple stages, each stage composed of 100 generations of evolution using population size of 50. The best structure reported is obtained in stage 4.
5 Conclusion A novel set based scheme for the representation of truss geometry and an optimization algorithm embedded with inheritance and learning operating on the set representation is proposed in this chapter. The set based representation allows an easy definition of ground structure and the use of element properties provide the flexibility in defining number and type of design variables. Continuous variables and discrete variables can be seamlessly handled using the representation. The new representation scheme requires design of recombination operators that can operate on the sets. We have proposed a recombination operator that can create topologically different structures exploiting principles of learning and inheritance from multiple parent topologies, a context specific mutation operator that tries to obtain the smallest possible structure and a polynomial mutation operator that perturbs element properties. Our observations indicate that a 3
438
A. Isaacs, T. Ray, and W. Smith
parent recombination can enhance the speed of convergence to optimal topologies whereas the 8 parent recombination is more effective in deriving optimal element properties. The underlying mechanism of roulette wheel selection of parents coupled by identification and inheritance of common elements of the parents is the key driving force in obtaining optimal topologies. The greedy nature of 3 parent recombination resulting in early convergence to optimal topologies is clearly visible in all three problems. Although the thrust of this work is not driven towards development of problem specific operators for truss design, our results are very close to the best reported in the literature. An adaptive multi-parent recombination scheme is a way forward to ride on the benefits of 3 parent and 8 parent recombination for faster identification of optimal topologies and element properties.
References 1. Wu, S.J., Chow, P.T.: Steady-state genetic algorithms for discrete optimization of trusses. Computers & Structures 56, 979–991 (1995) 2. Kaveh, A., Kalatjari, V.: Topology optimization of trusses using genetic algorithm, force method and graph theory. Internationl Journal For Numerical Methods In Engineering 58, 771–791 (2003) 3. Coello, C.A.C., Rudnick, M., Christiansen, A.D.: Using genetic algorithms for optimal design of trusses, New Orleans, LA, USA, pp. 88–94 (1994) 4. Deb, K., Gulati, S., Chakrabarti, S.: Optimal truss-structure design using realcoded genetic algorithms, University of Wisconsin, Madison, Wisconsin, USA, pp. 479–486. Morgan Kaufmann, San Francisco (1998) 5. Deb, K., Gulati, S.: Design of truss-structures for minimum weight using genetic algorithms. Journal of Finite Elements in Analysis and Design 37, 447–465 (2001) 6. Tang, W.Y., Tong, L.Y., Gu, Y.X.: Improved genetic algorithm for design optimization of truss structures with sizing, shape and topology variables. International Journal for Numerical Methods in Engineering 62, 1737–1762 (2005) 7. Rajeev, S., Krishnamoorhty, C.S.: Genetic algorithms-based methodologies for design optimization of trusses. Journal of Structural Engineering 123, 350–358 (1997) 8. Goldberg, D.E., Korb, B., Deb, K.: Messy genetic algorithms: Motivation, analysis, and first results. Complex Systems 3, 493–530 (1989) 9. Isaacs, A., Ray, T., Smith, W.: Novel evolutionary algorithm with set representation scheme for truss design. In: Proceedings of IEEE Congress on Evolutionary Computation 2007 (CEC 2007), Singapore, pp. 3902–3908 (2007) 10. Chen, Y.p., Yu, T.L., Sastry, K., Goldberg, D.E.: A survey of linkage learning techniques in genetic and evolutionary algorithms. Technical Report IlliGAL Report No. 2007014, Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana-Chanpaign (2007) 11. Harik, G.R., Goldberg, D.E.: Learning linkage. Technical Report IlliGAL Report No. 96006, Illinois Genetic Algorithms Laboratory, University of Illinois at UrbanaChanpaign (1996) 12. Goldberg, D.E., Deb, K., Kargupta, H., Harik, G.: Rapid, accurate optimization of difficult problems using fast messy genetic algorithms. In: International Conference on Genetic Algorithms, pp. 56–64. Morgan Kaufmann, San Mateo (1993)
Set Representation and Multi-parent Learning
439
13. Kargupta, H.: The gene expression messy genetic algorithm. In: Proceedings of the 1996 IEEE International Conference on Evolutionary Computation, pp. 814–819 (1996) 14. Larranaga, P., Lozano, J.A.: Estimation of Distribution Algorithms. Kluwer Academic Publishers, Dordrecht (2002) 15. Deb, K., Anand, A., Joshi, D.: A computationally efficient evolutionary algorithm for real-parameter optimization. Evolutionary Computation 10, 371–395 (2002) 16. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. John Wiley and Sons Pvt. Ltd., Chichester (2001)
A Network Design Problem by a GA with Linkage Identification and Recombination for Overlapping Building Blocks Miwako Tsuji, Masaharu Munetomo, and Kiyoshi Akama Information Initiative Center, Hokkaido University. North 11, West 5, Sapporo, 060-0811, Japan m
[email protected],
[email protected],
[email protected] Summary. Efficient mixing of building blocks is important for genetic algorithms and linkage identification methods that identify interdependent variables tightly linked to form a building block have been proposed. However, they have not been applied to real-world problems enough. In this chapter, we apply a genetic algorithm incorporating a linkage identification method called D5 and a crossover method called CDC to a network design problem to verify its performance and examine the applicability of linkage identification genetic algorithms.
1 Introduction It has been recognized that the efficient mixing of building blocks (BBs) is important for genetic algorithms (GA) [1, 2]; there have been many studies attempting to identify variables that are tightly linked to form a building block [3, 4, 5, 6, 7, 8, 9]. Moreover, crossover methods to recombine building blocks detected by the linkage identifications have been proposed [10, 11]. The messy GA [12], one of the early approaches to combine building blocks effectively generates and selects building block candidates explicitly. After the messy GA, linkage — the relationship between variables in the building block — rather than building block itself has been focused. For example, gene expression messy GA (GEMGA) [13], linkage identification by nonlinearity check (LINC) [3], etc. [14, 4, 5] calculate fitness differences caused by perturbations at variables. In addition, estimation of distribution algorithms (EDA) [9, 15] such as Bayesian optimization algorithm (BOA) [16] and estimation of Bayesian networks algorithm (EBNA) [17], estimate the distribution of promising strings to build probabilistic models encoding the relationships between variables. The dependency detection for distribution derived from fitness differences (D5 ) we employed in this chapter combines the above mechanisms. However, such techniques have not been applied enough to real-world problems. In this chapter, we apply a genetic algorithm with D5 and Context Dependent Crossover to a network design problem to verify its performance and examine the applicability of linkage identification genetic algorithms. In the network design problem, it is difficult to encode strings appropriately in advance due to several conflicting conditions. Therefore, in order to show the performance of linkage identification genetic algorithms, it seems suitable to employ the network design problems where the linkage identification should play an important role. Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 441–459, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
442
M. Tsuji, M. Munetomo, and K. Akama
This chapter is organized as follows. We briefly review a linkage identification method we employed in section 2 and a crossover method for overlapping building block in section 3. In section 4, we show our network design model and perform experiments. We conclude in section 5.
2 Linkage Identification 2.1
Building Block and Linkage
In this section, we describe building block, linkage and our notations. Most of real-world functions are considered to be decomposable or quasidecomposable into some sub-problems. Such decomposability is modeled by a function called additively decomposable function: f (s) =
m−1
fj (svj ),
(1)
j=0
where m is the number of sub-functions, s = s0 s1 · · · sl−1 is a string, fj is a j-th subfunction and svj is its sub-solution. The v j is a vector of identification numbers for variables, which defines svj . For example, if v j = (1, 9, 7, 8) and svj = s1 s9 s7 s8 . The Vj is a set of variables which consist of v j . For example, if v j = (1, 9, 7, 8), Vj = {1, 7, 8, 9}. Then, the Vj specifies a set of interdependent variables to construct a subsolution. Because real-world problems cannot be decomposed into non-overlapping sets strictly, some sets overlap, i.e. there are some pairs that Vj ∩ Vj = φ. In terms of genetic algorithms, we call promising sub-solutions building blocks, a set Vj linkage set, and interaction of variables in Vj linkage. 2.2
D5 — Linkage Identification Method with Fitness Difference Clustering
In this section, we review a linkage identification method called the dependency detection for distribution derived from fitness differences (D5 ) [7,8], which is incorporated in a genetic algorithm to solve a network design problem in our experiments. D5 estimates clusters of strings classified according to fitness differences. While many linkage identification methods calculate fitness differences by perturbations (si = 1 → 0 or si = 0 → 1) [3,14,4,13] or estimate the distribution of promising strings to build probabilistic models [9, 15], D5 is designed • to identify linkage performing less than O(l2 ) fitness evaluations which other linkage identification methods such as Linkage Identification by Nonlinearity Check (LINC) [3] performs, • to identify linkage in building blocks which have small contribution to fitness and are difficult to be found by Estimation of Distribution Algorithms (EDAs) [9]. These are realized based on the observation that fitness difference dfi (s) by a perturbation at i-th variable dfi (s) = f (s0 s1 · · · si · · · sl−1 ) − f (s0 s1 · · · si · · · sl−1 ),
A Network Design by a GA with Linkage Identification
443
1. initialize population with n strings 2. for each variable i a) calculate fitness difference dfi (sp ) by a perturbation (0 → 1 or 1 → 0) at i in string sp (p = 0, 1, · · · , n − 1). b) cluster strings according to fitness differences into some clusters. c) estimate the clusters and construct linkage sets. for each cluster, start with a set Vi = {i} and add j ∗ where j ∗ = argminj E(Vi ∪ {j}) inclemently until a termination criterion is met. Let Wi an union set of variables detected from the clusters. Fig. 1. The algorithm of linkage identification in the D5
where si = si − 1, is defined only by variables which depend on the i-th variable. Therefore, if we cluster randomly generated strings according to dfi (s), variables which do not depend on the i-th variable still distribute uniformly in each cluster. On the other hand, the variables which depend on the variable i should be biased. Algorithm Fig.1 shows the algorithm of D5 . The algorithm consists of three parts: (1) calculating fitness differences, (2) clustering strings according to the differences and (3) estimating the clusters of strings. After initializing population, these procedures are applied to each variable i. At first, i-th variable in each string sp is perturbed and fitness difference by the perturbation is calculated: dfi (sp ) = f (sp ) − f (sp )
(2)
where the superscript p (p = 0, 1, · · · , n − 1) indicates each string and n is the number of strings. In the above equation, sp is the same as sp except that the value of i is perturbed. We omit p when it is clear from the context. Then, strings are classified into some clusters based on their fitness differences dfi (sp ). While we employ a centroid method, any classification methods such as kmeans are equally applicable. Finally, D5 estimates each cluster where strings result in the same (or almost the same) fitness differences to detect a linkage set for i. The linkage set for i is a set of variables which gives the minimum entropy measure. The entropy measure for the linkage set V is defined as E(V ) = −
| 2|V −1
px log2 px ,
(3)
x=0
px =
nx n1
(4)
where 2|V | is the number of all possible sub-solutions defined over V , px is an appearance ratio of a sub-solution, n1 is the number of strings in a cluster and nx is the number
444
M. Tsuji, M. Munetomo, and K. Akama
s 010001 110101 001110 101101 001011 010010 101110 000111 101111 000101 001000 011000 000000 110000 000010 111011 101001 000100 111110
f (s) 53 0 26 0 26 54 0 58 30 28 54 28 56 28 55 30 26 42 30
s0 f (s0 ) 110001 26 010101 27 101110 0 001101 26 101011 0 110010 27 001110 26 100111 44 001111 56 100101 14 101000 28 111000 58 100000 42 010000 55 100010 41 011011 0 001001 52 100100 28 011110 0 ··· ···
df0 (s) -27 27 -26 26 -26 -27 26 -14 26 -14 -26 30 -14 27 -14 -30 26 -14 -30
s 011000 011111 011100 110110 110101 110001 110100 110011 110000 101110 101001 101011 101010 101111 101101 100010 000100 000000 000101
f (s) 28 30 14 0 0 26 14 0 28 0 26 0 27 30 0 41 42 56 28
s0 f (s0 ) 111000 58 111111 60 111100 44 010110 27 010101 27 010001 53 010100 41 010011 27 010000 55 001110 26 001001 52 001011 26 001010 53 001111 56 001101 26 000010 55 100100 28 100000 42 100101 14 ··· ···
bias df0 (s) 011*** 30 011*** 30 011*** 30 110*** 27 110*** 27 110*** 27 110*** 27 110*** 27 110*** 27 101*** 26 101*** 26 101*** 26 101*** 26 101*** 26 101*** 26 100*** 14 000*** -14 000*** -14 000*** -14
Fig. 2. Cluster of strings clustered according to df0
of a sub-solution x in the cluster. While we use a simple greedy algorithm to find the set which minimizes the entropy (3), any algorithm can be applied. This procedure is performed repeatedly for each cluster, therefore, there are one or more linkage sets for a single variable i. In the early work [7], we assumed non-overlapping linkage sets and chose one set from them based on their entropy. However, we consider a union of these sets as a linkage set for i to capture overlapping building blocks in real-world problems. Example We illustrate the algorithm of D5 with an example. As the example, we use sum of the order 3 deceptive problem which was an opponent of the messy GA [12]: f (s) = g(s0 s1 s2 ) + g(s3 s4 s5 ) g(si sj sk ) = 30 if 111, 22 if 010,
0 if 110, 101, 011, 26 if 001,
14 if 100,
(5)
28 if 000
where V = {{0, 1, 2}, {3, 4, 5}}. The left side of Fig.2 shows an initial population. As shown in the right side of the figure, 0-th variables in all strings are perturbed, fitness differences are calculated, and strings are classified. Strings with a same fitness differences are in a same cluster there. For instance, strings with df0 = 30 are clustered into one cluster. In the cluster of size 3, linkage set {0, 1, 2} has only sub-solution 011,
A Network Design by a GA with Linkage Identification
445
then from (3), its entropy is 03 log 03 + 13 log 13 + 03 log 03 + · · · = 0. On the other hand, linkage set {0, 3, 4} has sub-solutions 0 ∗ ∗00∗, 0 ∗ ∗11∗, 0 ∗ ∗10∗ and E({0, 3, 4}) should be relatively large. Although some clusters in Fig. 2 have a few strings, in the application of the D5 , enough strings should be used to avoid small clusters [18]. If a problem is (quasi-) decomposable, f (s) =
m−1
fj (svj ),
j=0
fitness difference by a perturbation at i is calculated as dfi (s) = f (s0 s1 · · · si · · · sl−1 ) − f (s0 s1 · · · si · · · sl−1 ) =[ fj (svj ) + fj (svj )] j s.t. i∈Vj j s.t. i∈V / j −[ fj (svj ) + fj (svj )] j s.t. i∈Vj j s.t. i∈V / j fj (svj ) − fj (svj ) = j s.t. i∈Vj j s.t. i∈Vj
(6)
where s = s0 s1 · · · si · · · sl−1 . It is clear that the sub-solutions svj and svj are equal for j such that i ∈ / Vj . Then, dfi (s) depends only on the sets Vj such that i ∈ Vj and it / Vj . Therefore, if we cluster strings according is independent of the sets Vj such that i ∈ to the fitness difference fi (s), variables in Vj such that i ∈ / Vj should be random and those in Vj such that i ∈ Vj should take certain values in each cluster. The random distribution gives large entropy and biased one gives small entropy, therefore, detecting such bias by minimizing the entropy measure, D5 can learn the linkage set for i. It is clear that D5 which performs O(l) fitness evaluations has an advantage over other linkage identification methods such as LINC which performs O(l2 ) fitness evaluations. Moreover, D5 is insensitive to the scaling effects of building blocks, where some building blocks contribute significantly to the overall fitness while others do not. For example, we consider that two sub-functions defined over {0, 1, 2}, {3, 4, 5} and f (s) = g(s0 s1 s2 ) + 10 × g(s3s4 s5 ), where g(si sj sk ) is (6). If a string has a good subsolution over {3, 4, 5}, it should be selected whether a sub-solution over {0, 1, 2} has high fitness or not. Therefore, in the strings selected based on fitness evaluated based on this function, it is difficult for EDAs to see that what sub-solution is preferred and to estimate problem structure over {0, 1, 2}. On the other hand, D5 can find linkage among {0, 1, 2}, because from equation (6), the fitness difference df0 (s) is defined by g(s0 s1 s2 ) − g(s0 s1 s2 ). Therefore, if we consider the first sub-function defined over {0, 1, 2}, the extent of contribution by the second sub-function does not matter, and vice versa. The sizes of linkage sets The original D5 [7] considers non-overlapping problems and defines the order k of building blocks in advance. However, in real-world problems, there are varied sizes of
446
M. Tsuji, M. Munetomo, and K. Akama
building blocks. Moreover, for problems with overlapping building blocks, the number of variables linked to i is the size of the union of building blocks including i. The equation (6) shows that D5 can detect such union for each i. We describe the way to construct final linkage sets from these unions afterwards. Even the maximum order must be given to decide the population size, the size of a linkage set for each variable must be defined adaptively [10]. Because we use the greedy algorithm that adds a new elements incrementally to a set Vi , the size of Vi can be defined by using an appropriate termination criterion for the greedy algorithm. As the termination criterion we consider the following equation and inequality: (Eall (Vi ∪ {j ∗ }) − E(Vi ∪ {j ∗ })) − (Eall (Vi ) − E(Vi )) ≤ Et and E(Vi ∪ {j ∗ }) = 0
(7) (8)
where Et (0 ≤ Et < 1) is a threshold, E(V ) is the entropy of V in the cluster, Eall (V ) is the entropy of V in the whole population, j ∗ is the variable such that ∗ argminj ∈V / i E(Vi ∪ {j}). If j is irrelevant to i, the distributions in the cluster and in the whole population are not much different, because j ∗ do nothing for the clustering. Then, Eall (Vi ∪ {j ∗ }) − E(Vi ∪ {j ∗ }) should be small. The previous difference of entropies Eall (Vi )−E(Vi ) is subtracted to cancel the effect of the previous set. Therefore, if (7) and (8) are satisfied, we reject j ∗ and return Vi . Otherwise, let Vi = Vi ∪ {j ∗ } and search for another dependent variable. Of course, if the size of Vi is above the maximum order given by a user, return Vi regardless of the conditions (7) and (8). From interactions to final linkage sets To form final linkage sets used by crossover operators from interaction lists obtained by D5 for real-world problems, we modify the method in [10] as shown in Fig.3. The algorithm shown in the figure assumes that all interactions for all variables can be detected correctly. If it is true, Wi = Vj , (9) i∈Vj
where Wi (i = 0, 1, · · · , l − 1) is a set of interactions detected by D5 and Vj (j = 0, 1, · · · ) is a set of variables which are inter-dependent functionally. Because of (9) if Vj ⊂ Wi ,
(10)
Vj ⊂ Wh s.t. h ∈ Vj .
(11)
then
Therefore, for arbitrary subsets of Wi , the sets which satisfy the above conditions are extracted as linkage sets.
A Network Design by a GA with Linkage Identification
447
/* Wi (i = 0, 1, · · · , l − 1) : sets of variables which depend on i-th variable Wi is a union of resulting variables from all clusters */ for all pair of i, j (i, j = 0, 1, ·, l − 1) / Wi if i ∈ / Wj or j ∈ delete i from Wj and j from Wi endif endfor V={} /* V : set of linkage sets V0 , V1 , · · · */ //The initial candidates of the final linkage sets for i = 0 to l − 1 Vi = {i} V.add(Vi ) endfor for k = 2 to K /* K : pre-defined order */ for all Vj such that |Vj |==k − 1 for i = 1 to l such that i ∈ / Vj X = Vj + {i} if X ⊂ Wh for all h ∈ X V.add(X) flag = 1 endif endfor if flag V.del(Vj ) // a subset of larger candidate is deleted endif endfor endfor Fig. 3. Algorithm : From Interactions to Linkage Sets
For example, consider a function f (s) = g(s0 s1 s2 ) + g(s2 s3 s4 ). If there are enough strings, the relationships obtained by D5 are W0 W1 W2 W3 W4
= {0, 1, 2} // The linked variables for i = 0 = {0, 1, 2} = {0, 1, 2, 3, 4} = {2, 3, 4} = {2, 3, 4}
The initial candidates of the final linkage sets are V0 = {0}, V1 = {1}, V2 = {2}, V3 = {3}, V4 = {4}.
448
M. Tsuji, M. Munetomo, and K. Akama
For k = 2 in Fig. 3, the following candidates are tried: V0 + {1} = {0, 1}, V0 + {2} = {0, 2}, V0 + {3} = {0, 3}, · · · V1 + {2} = {1, 2}, V1 + {3} = {1, 3}, · · · ··· While {0, 1}, {0, 2}, {1, 2}, · · · are accepted, {0, 3}, {1, 3}, · · · are rejected because the condition {0, 3} ⊂ Wh for all h ∈ {0, 3} is not satisfied ({0, 3} ⊂/W0 = {0, 1, 2}, · · · ). Then {0, 1}, {0, 2}, {1, 2}, {2, 3}, {2, 4}, {3, 4} are obtained and {0}, {1, }, {2}, · · · are deleted. For k = 3, {0, 1, 2}, {0, 1, 3}, {0, 1, 4}, · · · {0, 2, 3}, {0, 2, 4}, · · · ··· are tried. While {0, 1, 2}, {2, 3, 4} are accepted, the others such as {0, 1, 3}, · · · are rejected because the condition {0, 1, 3} ⊂ Wh for all h ∈ {0, 1, 3} is not satisfied ({0, 1, 3} /⊂W0 = {0, 1, 2}, · · · ). If the pre-defined order K = 3, the algorithm returns {1, 2, 3}, {3, 4, 5} as the final linkage sets. If K > 3, the algorithm tries {0, 1, 2} + {3}, {0, 1, 2} + {4}, · · · but they are rejected. Then, {1, 2, 3}, {3, 4, 5} are returned.
3 Crossover for Overlapping Linkage Sets For additively decomposable functions, building blocks are considered to be candidate sub-solutions over linkage sets. If linkage sets do not overlap, building blocks can be combined in a simple BB-wise uniform crossover. However, in real-world problems, variables cannot always be divided into non-overlapping linkage sets. It is difficult to recombine overlapping building blocks because the exchange of some building blocks causes the disruption of other building blocks. Therefore, crossover methods for overlapping building blocks have been proposed [10, 11] to exchange overlapping building blocks with minimum disruptions. In this chapter, we describe a crossover method proposed for complexly overlapping building blocks in [10]. The crossover method is called Context Dependent Crossover (CDC) and it is developed by extending a crossover method by Yu et al. [11]. The crossover method by Yu et al. uses a graph whose nodes represent linkage sets (building blocks) and edges represent overlapping relationships between their nodes. For each crossover, it finds the graph partitioning which minimizes the number of cut edges and divides randomly chosen two nodes. However, if building blocks overlap complexly, i.e. many edges appear in the graph, it cannot work well because the number of cut edges is not always equal to the number of building block disruptions and the pattern of graph partitionings is restricted.
A Network Design by a GA with Linkage Identification
449
1. Construct a graph G = (N, E), where the nodes are linkage sets Vj and the edges are overlapping relations between two nodes. 2. For each crossover operator for parent strings s = s1 s2 · · · si · · · sl and t = t1 t2 · · · ti · · · tl a) Remove nodes Vj where sv j = tv j . b) Remove edges between Vj and Vj if the following conditions are hold : • The exchange between BBs sv j and tv j does not disrupt BBs sv j and tv j . • The exchange between BBs sv j and tv j does not disrupt BBs sv j and tv j . c) Choose two nodes n1 , n2 randomly. Then partition the graph G into two subgraphs G1 = (N1 , E1 ) and G2 = (N2 , E2 ) which satisfy conditions: n1 ∈ N1 , n2 ∈N2 and |E| − |E1 | − |E2 | is minimal. d) Let V = vj ∈N1 vj and exchange variables in V.
Fig. 4. Context Dependent Crossover
Fig. 5. Example of parental strings
Fig. 6. Example of the proposed crossover method. left : remove same BBs, middle : remove edges where BB disruption does not occur, right : resulted graph
450
M. Tsuji, M. Munetomo, and K. Akama
Context dependent crossover (CDC) is proposed to reduce building block disruptions and increase the pattern of graph partitionings. The algorithm of CDC is shown in Fig.4. CDC introduces the concept based on the context of strings and reconstructs the graph for each pair of parent strings s = s1 s2 · · · si · · · sl and t = t1 t2 · · · ti · · · tl . Fig.6 shows an example of CDC for a pair of parent strings shown in Fig.5. BB1 over V1 are 10010 in the first string in Fig.5, and 10010 in the second string. These are corresponding to the top-left node in Fig 6. BB2 over V2 are 10101 in the first and 10110 in the second, and they are corresponding to the bottom-left node in Fig. 6, and so on. First, the nodes where corresponding BBs are identical are removed (Fig.6, left), because whether such BBs are exchanged or not has no effect on the offspring. This operator ensures the resulting offsprings always differ from their parent strings. In our example, because BBs on the V1 are same, the node for V1 is removed. Then, the edges where no BB disruption occurs practically are removed (Fig.6, middle). In our example, the edge between V3 and V4 are removed because for parent sub-strings 10011001 with BBs 10011 and 11001 00011101 with BBs 00011 and 11101, obtained sub-strings are 00011001 with BBs 00011 and 11011 10011101 with BBs 10011 and 11101, or 10011101 with BBs 10011 and 11101 00011001 with BBs 00011 and 11001, when V3 or V4 is exchanged respectively. After that, the resulting graph (Fig.6, right) is divided into two sub-graphs to minimize the number of cut edges. The right graph of Fig.6 obtained by CDC suggests that there is a graph partitioning without BB disruption. On the other hand, if the context is not considered and only one graph is used for all crossover, a partitioning is chosen from {V1 , V2 , V3 |V4 }, {V1 , V2 |V3 , V4 } which cuts one edge, and {V1 |V2 , V3 , V4 } which creates no new string. Moreover, CDC can give various crossover patterns even if building blocks overlap complexly. The reason is as follows: For every crossover, CDC removes nodes and edges to simplify the graph G. The reconstruction of the G depends on the values of parental strings. Therefore, different pairs should give different graphs. The different graphs result in various sub-graphs. The various sub-graphs enhance crossover patterns. The various crossover patterns ensure the diversity of offsprings. Because the diversity of offsprings can be held by making several crossover sets with the least BB disruptions, we can reduce the innovation time, which is the number of generations required to find a new and better solution than all the strings in a population. The reduced surrogates crossover proposed by Booker [19] examines non-matching alleles in parental strings to make reduced strings and selects crossover points for the reduced strings. It always produces variants and maintains population diversity
A Network Design by a GA with Linkage Identification
451
to avoid premature convergence. While CDC ignores BBs which are identical like the reduced surrogates crossover, it does not perform crossover in allele-wise but in BB-wise. In the next section, we solve a network design problem by a genetic algorithm with D5 and CDC.
4 Network Design Problem In network design problems, one needs to allocate links, nodes, etc. of a communication network subject to several constraints. The trade-off between construction costs and quality of services makes the problems difficult. The optimal allocation that satisfies conflicting needs must be found from candidate solutions, whose number should grow exponentially with the size of the network. There have been several approaches using genetic algorithms for the network design problems. However, they often suffered from building block disruptions because it is difficult to detect the building blocks in advance due to the above mentioned natures of the problem as mentioned above. Although some of them [20, 21] considered that elements which are located geographically close together should be tightly linked to construct a building block, interactions between elements are decided not only by geographical conditions but also traffics, QoS, political reasons and so on. Some approaches [22, 23] have tried to solve the network design problems by employing linkage identification and achieved better results than simple genetic algorithms. Here, we show the performance of D5 -GA with CDC for the network design problems. The results indicate that not only linkage identification technique but also recombination operator after the linkage identification is important to solve the problems. 4.1
Problem Definition
In our experiments, we employ the metropolitan area network design problems which were defined in [23] and elsewhere. The problem is as follows: • Purpose – find the link allocation which minimizes the cost of network construction – satisfy the following constraints • Constraints – all nodes must be able to communicate with all other nodes – all communication requests must be satisfied – geographical constraints • Assumptions – communication requests are known. The requests are biased to represent hub nodes and normal nodes. – the sites of nodes are known – the candidate sites of links are known Fig.7 shows the candidate sites of links. We consider two test problems with 162 and 234 candidate sites (variables). In these problems, whether a link is constructed or not
452
M. Tsuji, M. Munetomo, and K. Akama
Fig. 7. Candidate sites of network links. Main roads in the Sapporo (a city in Japan) area are modeled. Example 1 (top) and 2 (bottom)
at each site must be decided. The circles are nodes. The black ones which are in the center of the figure represent hubs and the white ones represent other nodes. The total cost of this network is defined as follows: C=
l
ci di si ,
(12)
i=1
where i is the index of a link candidate, l is the number of link candidates, ci is unit cost of a network link with certain capacity, di is the distance of an i-th link candidate, and si defines whether link is allocated in the i-th link candidate or not. The capacity of each link takes a value from discrete candidates and the value is defined from communication requests and network topology defined by a string automatically. The string of GAs is a sequence s1 s2 · · · sl , where si ∈ {0, 1} and an i-th link is constructed if si = 1, vice versa. The fitness of the problem is defined as the sum of the total cost C and total penalty P . The total penalty is defined as follows: P = m × p,
(13)
where m is the number of traffic requests that can not be satisfied and p is unit penalty. Therefore, the fitness is f = C + P. (14)
A Network Design by a GA with Linkage Identification
Table 1. Parameters for the experiments of network design problems. rameters
string length l population size population size for linkage identification∗ window size for niching crossover probability (per pair of strings) mutation probability (per variable) termination criterion # of evaluations # of runs
example 1 162 300 800 l/8 0.9 0.07 ≥300,000 20
∗
453
means D5 specific pa-
example 2 234 400 1,200 l/8 0.9 0.07 ≥1,000,000 20
Table 2. Average fitness, minimum, variance, standard deviation of 20 runs by each algorithm. (Minimization problem) for example 1.
GA with uniform crossover D5 -GA without CDC D5 -GA with CDC
average 94696.1 94002.4 93857.3
minimum σ2 σ 93687 57949.9 240.7 93500 216947.4 465.7 93530 653265.8 808.2
Table 3. Average fitness, minimum, variance, standard deviation of 20 runs by each algorithm. (Minimization problem) for example 2.
GA with uniform crossover D5 -GA without CDC D5 -GA with CDC
4.2
average 145,656.65 145,131.74 142,815.20
minimum σ2 σ 144,079 59771.5 244.4 143,014 1350347.5 1162.0 142,374 764571.6 874.3
Settings
We perform GA with uniform crossover, D5 -GA without CDC, and D5 -GA with CDC. In D5 -GA without CDC, we remove some weak interactions between variables by tightness detection proposed by Munetomo et. al [24] to construct non-overlapping linkage sets. Main parameters are shown in Table 1. We perform 20 independent runs for each algorithm over a PC with Intel(R) Pentium(R) D CPU 3.00GHz, 512 MB memory. 4.3
Results
Table 2 and 3 show the average, minimum, variance and standard deviation of 20 independent runs in D5 -GA with CDC, D5 -GA without CDC, and GA with uniform crossover. Note that this is a minimization problem, where we want to minimize the cost of the network. D5 -GA with CDC gives the smallest fitness of all algorithms. Even
454
M. Tsuji, M. Munetomo, and K. Akama
300000 D5-GA with CDC D5-GA without CDC GA with Uniform Crossover
280000 260000
fitness
240000 220000 200000 180000 160000 140000 0
200000
400000 600000 # of evaluations
800000
1e+06
Fig. 8. # of evaluations−fitness 200000 D5-GA with CDC D5-GA without CDC GA with Uniform Crossover
190000
fitness
180000
170000
160000
150000
140000 0
200
400
600 800 1000 # of generations
1200
1400
Fig. 9. # of generations−fitness
the average fitness of D5 -GA with CDC is smaller than the minimum fitnesses of other algorithms for the larger problem (example 2). The D5 -GA without CDC which do not consider overlapping relationships between building blocks is better than GA with uniform crossover, but is worse than the one which do consider overlapping relationships. Comparing tables 2 and 3, D5 -GA with CDC modifies fitness further for the larger problem. The fact that CDC, which can combine complexly overlapping building blocks effectively, gives the best result suggests that the overlapping relationships are essential for network design problems.
A Network Design by a GA with Linkage Identification
455
Fig. 10. The resulting network for example 1 by D5 -GA with CDC
Fig. 11. The resulting network for example 1 by GA with uniform crossover
Fig.10 and 11 show resulting networks for example 1 by D5 -GA with CDC, and GA with uniform crossover. Fig.15 and 16 show resulting networks for example 2 by them. For example 1, the final results by D5 -GA with CDC, and GA with uniform crossover are almost the same. However, for the larger problem, example 2, some differences are seen in the middle and bottom areas. Fig.12–14 are the best solutions by D5 -GA with CDC for example 2 in 25-th, 100-th and 200 the generations respectively. In these figures, some needless links are removed gradually.
Table 4. Average CPU Time (min) for example 2. PCs with Intel(R) Pentium(R) D CPU 3.00GHz and 512M memory are used
GA with uniform crossover D5 -GA D5 -GA with CDC
time (min) 351.45 314.53 308.57
456
M. Tsuji, M. Munetomo, and K. Akama
Fig. 12. The best solution for example 2 in 25-th generation by D5 -GA with CDC
Fig. 13. The best solution for example 2 in 100-th generation by D5 -GA with CDC
Fig. 14. The best solution for example 2 in 200-th generation by D5 -GA with CDC
Fig.8 and 9 show fitness in each evaluation and in each generation for example 2. Although D5 -GA and D5 -GA with CDC spend fitness evaluations to obtain linkages, they converge faster than GA with uniform crossover. Table 4 shows the average CPU time of 20 runs for example 2 in each algorithm. All algorithms evaluate the same objective function 1,000,000 times. Although D5 -GA without CDC and D5 -GA with CDC perform extra computation to detect linkage and
A Network Design by a GA with Linkage Identification
457
Fig. 15. The resulting network for example 2 from D5 -GA with CDC
Fig. 16. The resulting network for example 2 from GA with uniform crossover
recombine building blocks, their execution times are smaller than GA with uniform crossover. In our test problems, function evaluations include the simulation of routing for communication requests and they are computationally easy for better solutions where the number of link candidates is relatively small.
5 Conclusion In this chapter, we have solved a network design problem by a GA with a linkage identification method called D5 and context dependent crossover (CDC) to show the applicability of linkage identification genetic algorithms. Even for EDA-difficult problems, D5 identifies linkage efficiently. The relationships between variables identified by D5 are processed to construct final overlapping linkage sets. Investigating context of every pair of strings in addition to the linkage sets, CDC combines overlapping building blocks effectively. Through our experiments, it is shown that the linkage identification genetic algorithm can solve the network problem, which are difficult for simple GAs, especially when the overlapping structure of building blocks is captured and respected.
458
M. Tsuji, M. Munetomo, and K. Akama
References 1. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley, Reading (1989) 2. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press (1975) 3. Munetomo, M., Goldberg, D.E.: Identifying linkage groups by nonlinearity/nonmonotonicity detection. In: Proceedings of the Genetic and Evolutionary Computation Conference - GECCO 1999, pp. 433–440. Morgan Kaufmann Publishers, San Francisco (July 1999) 4. Heckendorn, R.B., Wright, A.H.: Efficient linkage discovery by limited probing. In: Cant´uPaz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2723, pp. 1003–1014. Springer, Heidelberg (2003) 5. Kargupta, H., Park, B.H.: Gene expression and fast construction of distributed evolutionary representation. Evolutionary Computation 9(1), 43–69 (2001) 6. Streeter, M.J.: Upper bounds on the time and space complexity of optimizing additively separable functions. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 186–197. Springer, Heidelberg (2004) 7. Tsuji, M., Munetomo, M., Akama, K.: Modeling dependencies of loci with string classification according to fitness differences. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 246–257. Springer, Heidelberg (2004) 8. Tsuji, M., Munetomo, M., Akama, K.: Linkage identification by fitness difference clustering. Evolutionary Computation 14(4), 383–409 (2006) 9. Larra˜naga, P., Lozano, J.A.: Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2001) 10. Tsuji, M., Munetomo, M., Akama, K.: A crossover for complex building blocks overlapping. In: Proceedings of the Genetic and Evolutionary Computation - GECCO2006, pp. 1337– 1344. ACM Press, New York (2006) 11. Yu, T.L., Sastry, K., Goldberg, D.E.: Linkage learning, overlapping building blocks, and systematic strategy for scalable recombination. In: Proceedings of the Genetic and evolutionary computation conference - GECCO 2005, pp. 1217–1224 (June 2005) 12. Goldberg, D.E., Korb, B., Deb, K.: Messy genetic algorithms: Motivation, analysis, and first results. Complex Systems 3(5), 415–444 (1989) 13. Kargupta, H.: The gene expression messy genetic algorithm. In: Proceedings of the IEEE International Conference on Evolutionary Computation (CEC), pp. 631–636 (September 1996) 14. Munetomo, M.: Linkage identification based on epistasis measures to realize efficient genetic algorithms. In: Proceedings of the Congress on Evolutionary Computation - CEC 2002, pp. 445–452 (2002) 15. Pelikan, M., Goldberg, D.E., Lobo, F.G.: A survey of optimization by building and using probabilistic models. Computational Optimization and Applications 21(1), 5–20 (2002) 16. Pelikan, M., Goldberg, D.E., Cant´u-Paz, E.: BOA: The Bayesian optimization algorithm. In: Proceedings of the Genetic and Evolutionary Computation Conference - GECCO 1999, pp. 525–532. Morgan Kaufmann Publishers, San Francisco (1999) 17. Etxeberria, R., Larra˜naga, P.: Global optimization with bayesian networks. In: Proceedings of the II Symposium on Artificial Intelligence CIMAF 1999, Special Session on Distributions and Evolutionary Optimization, pp. 332–339 (1999) 18. Tsuji, M., Munetomo, M., Akama, K.: Population sizing of dependency detection by fitness difference classification. In: Wright, A.H., Vose, M.D., De Jong, K.A., Schmitt, L.M. (eds.) FOGA 2005. LNCS, vol. 3469, pp. 282–299. Springer, Heidelberg (2005)
A Network Design by a GA with Linkage Identification
459
19. Booker, L.: Improving search in genetic algorithms. In: Davis, L. (ed.) Genetic Algorithms and Simulated Annealing, pp. 61–73. Morgan Kaufmann, San Francisco (1987) 20. Kumar, A., Pathak, R.M., Gupta, Y.P., Parsaei, H.R.: A genetic algorithm for distributed system topology design. Computers and Industrial Engineering 28(3), 659–670 (1995) 21. Sinclair, M.C.: Nomad: Initial architecture of an optical network optimisation, modelling and design tool. In: Proceedings of the 12th UK Performance Engineering Workshop, pp. 157–167 (September 1996) 22. Munetomo, M., Tsuji, M., Akama, K.: Metropolitan area network design using ga based on linkage identification with epistasis measures. In: Proceedings of the 4th Asia-Pacific Conference on Simulated Evolution and Learning, pp. 652–656 (2002) 23. Tsuji, M., Munetomo, M., Akama, K.: Metropolitan area network design using GA based on hierarchical linkage identification. In: Cant´u-Paz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2724, pp. 1616–1617. Springer, Heidelberg (2003) 24. Munetomo, M., Goldberg, D.E.: Linkage identification by non-monotonicity detection for overlapping functions. Technical Report IlliGAL Report No.99005, University of Illinois at Urbana-Champaign (January 1999)
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis Corie L. Cobb1, Ying Zhang2, Alice M. Agogino1, and Jennifer Mangold1 1
Mechanical Engineering Department at the University of California, Berkeley, CA 94720, USA
[email protected],
[email protected],
[email protected] 2 School of Electrical and Computer Engineering at the Georgia Institute of Technology, Savannah, GA 31407, USA
[email protected] Abstract. Multi-objective Genetic Algorithms (MOGA) and Case-based Reasoning (CBR) have proven successful in the design of MEMS (Micro-electro-mechanical Systems) suspension systems. This work focuses on CBR, a knowledge-based algorithm, and MOGA to examine how biological analogs that exist between our evolutionary system and nature can be leveraged to produce new promising MEMS designs. Object-oriented data structures of primitive and complex genetic algorithm (GA) elements, using a component-based genotype representation, have been developed to restrict genetic operations to produce feasible design combinations as required by physical limitations or practical constraints. Through the utilization of this data structure, virtual linkage between genes and chromosomes are coded into the properties of predefined GA objects. The design challenge involves selecting the right primitive elements, associated data structures, and linkage information that promise to produce the best gene pool for new functional requirements. Our MEMS synthesis framework, with the integration of MOGA and CBR algorithms, deals with the linkage problem by integrating a component-based genotype representation with a CBR automated knowledge-base inspired by biomimetic ontology. Biomimetics is proposed as a means to examine and classify functional requirements so that case-based reasoning algorithms can be used to map design requirements to promising initial conceptual designs and appropriate GA primitives. CBR provides MOGA with good linkage information through past MEMS design cases while MOGA inherits that linkage information through our component-bsased genotype representation. A MEMS resonator test case is used to demonstrate this methodology.
1 Introduction Microelectromechanical Systems (MEMS) are small micro-machines or micron-scale electro-mechanical devices that are fabricated with processes adapted from Integrated Circuits (ICs). Although still a relatively new research field, MEMS devices are being developed and deployed in a broad range of application areas, including consumer electronics, biotechnology, automotive systems and aerospace. Example MEMS devices include accelerometers in automotive airbags and micro-mirrors for optical switching in data communication networks. As MEMS devices grow in complexity, there is a greater need to reduce the amount of time MEMS designers spend in the Y.-p. Chen, M.-H. Lim (Eds.): Linkage in Evolutionary Computation, SCI 157, pp. 461–483, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
462
C.L. Cobb et al.
initial conceptual stages of design by employing efficient computer-aided design (CAD) tools. Working with a multidisciplinary research team at the Berkeley Sensor and Actuator Center (BSAC), our work with Evolutionary Computation (EC) is focused on the conceptual design of MEMS devices. Zhou et al. [1] were the first to demonstrate that a multi-objective genetic algorithm (MOGA) can synthesize MEMS resonators and produce new design structures. SUGAR [2], a MEMS simulation tool, was used to perform function evaluations on constraints and fitness values. Kamalian et al. [3] extended Zhou’s work and explored interactive evolutionary computation to integrate human design expertise into the synthesis process. They also fabricated and tested the emergent designs in order to characterize their mechanical properties and identify deviations between simulated and fabricated features [4]. Zhang et al. [5, 6] implemented a hierarchical MEMS synthesis and optimization architecture, using a component-based genotype representation and two levels of optimization: global genetic algorithms (GA) and local gradient-based refinement. Cobb et al. [7] created a casebased reasoning (CBR) tool to serve as an automated knowledge base for the synthesis of MEMS resonant structures, integrating CBR with MOGA [8] to select promising initial designs for MOGA and to increase the number of optimal design concepts presented to MEMS designers. In related research, Muhkerjee et al. [9] conducted work on MEMS synthesis for accelerometers using parametric optimization of a pre-defined MEMS topology. They expanded the design exploration within a multidimensional grid in order to find the global optimal solution. Wang's [10] approach to MEMS synthesis utilized bond graphs and genetic programming with a tree-like structure of building blocks to incorporate knowledge into the evolutionary process, similar to work by Zhang [6]. Li et al. [11] concentrated on developing automated fabrication process planning for surface micromachined MEMS devices that relieves designers from the tedious work of process planning so they can concentrate on the design itself. MEMS CAD has matured to the point that there are now commercial CAD programs, such as Comsol® and IntelliSuite®, that offer MEMS designers pre-made modules and cell libraries, but there is little automatic reasoning in place for the user on how and when these components should be used. Our EC method employs a genetic algorithm as the evolutionary search and optimization method. GAs were introduced by Holland [12] to explain the adaptive processes of evolving natural systems and for creating new artificial systems in a similar way, and Goldberg [13] further demonstrated how to use them in search, optimization, and machine learning. Chen et al. [14] noted that traditional GAs require users to possess prior domain knowledge in order for genes on chromosomes to be correctly arranged with respect to the chosen operators. The performance of a GA is heavily dependent upon its encoding scheme. When prior domain knowledge is available, the design problem can be solved using traditional genetic algorithms. However, that is not always the case, and this is when methods such as linkage learning are needed. Chen [15] and Harik [16] both focused research efforts on the linkage learning genetic algorithm (LLGA) so that a GA, on its own, can detect associations among genes to form building blocks [15].
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
463
Linkage is an important part of GA performance. Tightly linked genes are synonymous with building blocks, but higher level linkage amongst building blocks is also necessary to ensure successful design solutions are reached. We propose an integrated MEMS design synthesis system which combines CBR with biologically inspired classifications and an evolutionary algorithm, MOGA, to help generate more varied conceptual MEMS design cases for a designer and her/his current design application. In this chapter, we will explain our micro-resonator test case which will be highlighted throughout our work to explain our linkage concept. Next, we discuss MOGA and CBR and explain how linkage is achieved through our knowledge-based evolutionary algorithm. Lastly, we present a review of symmetry patterns observed in nature, as they pertain to resonant frequency-sensitive biological creatures, and explore the role that symmetry plays in our evolutionary synthesis process for the resonator example.
2 Evolutionary Computation for Resonant MEMS Design 2.1 MEMS Resonator Test Case To date, our MEMS design synthesis program has focused on the design of resonant MEMS. A schematic of a MEMS resonator and its component decomposition are shown in Fig. 1. These designs have consisted of a fixed center mass (either with or without electrostatic comb drives) connected to four ‘legs’, each made up of multiple beam segments. We evaluated our MOGA synthesis program for several sets of performance objectives all calculated using the SUGAR simulation program.
Fig. 1. Schematic of example resonator synthesis problem. The geometry of the center mass is fixed, while the number of beam segments per leg and the size and angle of each segment is variable [3].
As we are designing resonators, the most significant performance objective for all structures is the resonant frequency (f0). Resonant frequency is the most critical requirement because if a resonator deviates too far from its frequency target it is essentially a useless design. Other performance objectives we have used for synthesis
464
C.L. Cobb et al.
include the stiffness of the structure in the x or y-direction as well as the device area (defined by a bounding rectangle around the device). 2.2 SUGAR: MEMS Simulation with Modified Nodal Analysis SUGAR [2] is an open-source MEMS simulation tool based on modified nodal analysis (MNA), allowing a designer to quickly prototype and simulate several complex MEMS structures for preliminary design applications.1 Finite element analysis (FEA) calculations could take hours per simulation, making them infeasible for iterative design processes on complex systems. SUGAR and other similar lumped parameter nodal analysis simulation tools can perform these functional calculations with reasonable accuracy at a fraction of the time and can therefore allow the MEMS designer to explore larger design spaces. FEA and parametric optimization can then be used to refine the most promising of the design concepts produced by the MOGA evolutionary process. 2.3 Linkage with Component-Based Genotype Representation Genetic linkage, in biological terms, refers to the relative position of two genes on a chromosome. Two genes are linked if they are on the same chromosome and are tightly linked if they are physically close to each other on the same chromosome. Genes that are closely linked are usually inherited together from parent to offspring [14]. Our MOGA data structure can be classified as “linkage adaptation” if we use the same terminology as Chen [14]. Linkage adaptation refers to specifically designed representations, operators, and mechanisms for adapting genetic linkage along with the evolutionary process. Chen states that linkage adaptation techniques are closer to biological metaphors of evolutionary computation because of their representations, operators, and mechanisms.
Fig. 2. Gene representation examples for MEMS building blocks [6] 1
SUGAR can be accessed from: http://sourceforge.net/projects/mems/
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
465
Our component-based genotype representation for MEMS design synthesis is supported by a hierarchical extendible design component library developed by Zhang et al. [6]. Each MEMS design component type is represented by a gene. This gene carries all salient information about the component: its geometric layout parameters, as well as constraints on how the component can be modified and what genetic operations can be applied to it (see Fig. 2). Each gene has external nodes through which components are connected and registered to one another. Two genes are on the same chromosome, which represents a design cluster or a simple MEMS design, if one of them can be reached from the other through any linkage path in the chromosome. Two genes are tightly linked if they share the same external node. For example, in Fig. 3 gene types 10 and 9 are tightly linked because they share the same external node and gene types 10, 9, 5, and 1 are on the same chromosome because each gene can be found by tracing the linkage path in the design.
Fig. 3. MEMS resonator gene representation [6]
A designer can predefine what gene types are allowed to be closely linked to a specific gene type and whether a position on the chromosome is a crossover point during the evolutionary process by associating special properties to certain linkage nodes in the chromosome. Based on predefined rules, the mutation operation can be applied at either the gene level or the chromosome level, providing a probability of changing linkage with the mutation operation during the evolutionary process.
3 Case-Based Reasoning and Biomimetic Inspired Ontology Case-based Reasoning (CBR) is an artificial intelligence method that utilizes knowledge from a past situation to solve current problems. Shank’s dynamic memory model [17] is regarded as the foundation for CBR. Kolodner used Shank’s model to create the first CBR system called CYRUS which was a basic question and answer system [18]. CBR has been applied to a broad array of domains ranging from cooking recipes to the design of electro-mechanical devices. For example, Kritik [19], a
466
C.L. Cobb et al.
CBR system developed in the early 1980s, generated designs for physical systems such as electrical circuits. The first successful industry application of CBR was CLAVIER [20] which was used by Lockheed Martin for determining successful loads of composite material parts for curing in an autoclave. More recently, CAFixD [21] applied the principles of CBR to fixture design for various machining operations. CBR is analogous to human cognition and thought processes; cases can be regarded as “memories,” while retrieval is similar to “reminding” one of a particular instance, and case representation is how one’s memories are organized. CBR involves indexing past knowledge, in the form of “cases” to enable effective retrieval of solutions for a current problem. Indexing and case representation are the two initial and most important stages of CBR, determining the ultimate performance of a CBR program. In the context of our work, CBR takes advantage of previous human knowledge in the form of successful MEMS design cases to help guide humans and computational design tools towards more optimal design concepts. Previous work by Cobb et al. [8] has shown that the integration of a CBR knowledge base with a multi-objective genetic algorithm (MOGA) can increase the number of optimal solutions generated for a given MEMS design problem. CBR is used to help select the best candidates to be evolved in an evolutionary process such as MOGA. In the following sections, we will examine the biological analogs of case representation and indexing as well as how they can support linkage in MOGA. 3.1 Case Representation and Biological Taxonomy Biological classification or taxonomy is a means by which biologists group and classify organisms. Taxonomy helps one identify evolutionary relationships and links between certain species and in the case of MEMS, certain design structures. Classifying organisms based on shared physical traits is how taxonomy began, but these classifications have been modified over the years to reflect Darwinian evolutionary relationships. Spiders are of interest to our work due to the parallels their physical appearance has to our MEMS resonator example. Biologists have classified over 40,000 species of spiders, but they believe there are still thousands of species which have not yet been identified and named. As more species are discovered the current biological classification system can expand and change. The classification of animals and plants is inherently hierarchical; similar to the way our MEMS case library is hierarchical to demonstrate the relationships between different designs. The 40,000 species of classified spiders are further divided into three suborders with 38 families and 111 subfamilies. The groups described by taxonomy get more specific as one goes from the kingdom classification all the way down to the species group. Kingdom is the largest unit of classification (with approximately five kingdoms), phylum is the next unit of classification which further divides each kingdom, and this pattern continues down to the species level, forming a tree like hierarchy of organism representation. No two species of spiders, or any plant or animal, will have the same scientific name (defined by the genus and species). The scientific name is a unique identifier just as each unique MEMS design component
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
467
has a distinct identification number and gene type to distinguish it from other designs and enable efficient case retrieval. MEMS is still an exploratory field and new designs and pieces of the MEMS hierarchy are constantly being added, similar to the way newly discovered species of organisms are expanding the biological taxonomy system everyday. Varadan [22] noted that it is still premature today to create a robust categorization due to the fact that many MEMS devices are still in the research phases of development and have not matured for every application. MEMS categorization has often focused on fabrication methods and materials selection, geometry, or application areas [23]. There are a broad array of MEMS sensors and actuators available today. Bell et al. [24] categorized MEMS by considering work-producing actuators, force sensors and displacement sensors fabricated by surface or bulk micromachining in their work and did an in depth classification of these devices. In MEMS, designs are often classified based on their performance and functional characteristics. Sensors and actuators are the two most broad and commonly agreed upon categories of MEMS which can be divided further into families and classes. Similar to the work of Bell et al. [24], we will have two kingdoms in our classification system: sensors and actuators. Sensors and actuators can each be further divided into phylum or classes based upon their operating domains. For our purposes, we will assume six operating domains based upon input and output signals MEMS devices utilize: (1) Magnetic, (2) Thermal, (3) Electrical, (4) Mechanical, (5) Chemical, and (6) Optical. Imagine the aforementioned domains placed in a 6 by 6 matrix (with all six categories each lined up on the rows and columns) to enable multiple input and output combinations. For example, a thermal-mechanical sensor might take a thermal input and have a mechanical deflection as its output. For a piezoelectric sensor, it will output a voltage in response to an applied mechanical stress, enabling a further categorization of the mechanical-electrical class. Because the user of our CBR program may be searching for designs based on input and output domains or application areas, it is important to index cases by both. Our MEMS hierarchy starts with sensors and actuators, and then branches out to the various input and output mechanisms, and under each of these are specific application areas (RF MEMS, Micro-fluidics, BioMEMS, Optical MEMS, etc.), and then divided further are whole MEMS devices, which are broken down into their various components and primitive elements. Currently, our work focuses on resonant structures, such as resonators, accelerometers and micromechanical filters. Thus, in traversing the MEMS hierarchy, our work falls under the electrical (input and output domain) where electrostatics are primarily used. Fig. 4 is a condensed MEMS taxonomy graph, and is not inclusive of all MEMS devices. The portion shown demonstrates how the classification leads to accelerometers, filters, and resonators – the focus of our work. Resonators, the basic components of filters, can be further decomposed into masses, springs, comb drives, and anchors. Each one of the aforementioned components would have a unique identifier to distinguish them from others. Nguyen [25] classifies MEMS filters based on their ability to achieve a certain frequency range, an important part of being able to develop RF communication devices.
468
C.L. Cobb et al.
Sensors (Kingdom)
Electrical (Phylum)
Electrostatic (Actuation Class)
Accelerometers (Order)
Low Pass Filters (suborder)
Band Pass Filters (suborder)
Coupled Resonators (family)
Further decomposition based on device structure...
Filters (Order)
High Pass Filters (suborder)
Notch Filters (suborder)
Single Resonators (family)
Further decomposition based on device structure...
Fig. 4. MEMS hierarchy example with biological analogy
Fig. 5. MEMS database design
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
469
A hierarchy for biological organisms was created just as a hierarchy for CBR needs to be created in order to sort information and efficiently pull the most relevant primitives and designs for evolutionary computation. Ontology is a way to represent knowledge in a specific domain, helping an artificial intelligence (AI) program to define and retrieve objects. A general hierarchy or structure of ontology is the following [26]: objects, classes of objects, attributes of objects, and relations between objects. Shown in Fig. 5 is our current MEMS case library ontology. Using entity-relationship diagram notation, one can observe how objects such as MEMS resonators and filters are related together. In the diagram,‘d,p’ indicates a disjoint/distinct and partial relationship between classes, in order to account for designs that have not yet been created or added to the library. Attributes of each object include indices for quick retrieval and overall device performance. Our current CBR hierarchy classifies designs based on their shared functionality and performance. 3.2 Creating Evolutionary Linkage with Case-Based Reasoning Linkage, as defined by Chen [14], refers to placing related genes close together on a chromosome. The GA programmer seeds the GA with initial designs with implicit linkages. The GA programmer may be adding her/his expertise to the codification in this process. This may be difficult to do, however, on new design problems in which the programmer has limited experience. Applying the aforementioned definition to MEMS synthesis, we use the concept of linkage to refer to how closely MEMS building blocks should be linked in an evolutionary process. With the integration of CBR and MOGA (see Fig. 6), CBR defines the linkages for the user with an automated case-based library of previous MEMS
Fig. 6. MEMS design synthesis architecture
470
C.L. Cobb et al.
designs. CBR takes away from the user the burden of defining the problem by automatically selecting and optimizing design structures based on a few inputted design requirements. In the absence of CBR or a good seed design, MOGA may not converge to a design solution. Zhang et al. [5] noted that seeding MOGA with a good initial design is essential to helping MOGA converge to better design solutions in a practical number of evolutions. CBR can pull out the design cases close to local design optima for a given scenario. The designs are ranked according to the user’s design requirements and are then encoded in the component-based genotype representation to enable the evolutionary process. Incorporating other powerful computational tools, such as CBR, with MOGA can help MOGA converge faster and more efficiently to optimal design concepts. The linkage problem is alleviated in our MOGA program because CBR inherently defines linkage for MOGA with its case examples. CBR assists MOGA by propagating the linkage of effective building blocks and selecting designs near local optima. In a previous experiment [8], for each MOGA synthesis run, we used a population of 400 for 50 generations. Using constraint cases of (1) no symmetry, (2) y-axis symmetry, and (3) x- and y-axis symmetry, five runs of the MOGA process were conducted for each constraint case in order to see a good spread of design solutions. We found that when MOGA is seeded with good starting designs from CBR, in some instances, y-axis symmetry and x- and y-axis symmetry constraints generate more pareto optimal designs over 50 generations.
MOGA Design Representation Gene Type
MEMS Component
18 14 5 1
Frame Mass Crab-Leg Suspension Comb Drive Anchor
(a) Initial Design
(b) y-symmetric design
Fig. 7. Resonant frequency = 23.8 kHz for initial MOGA design (a); Resonant frequency = 24.8 kHz for a pareto optimal y-symmetric design generated by MOGA (b)
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
471
Shown in Fig. 7 is an example of tight linkage generated by our integrated CBR and MOGA program. The design requirements for this scenario were the following: f0 = 24.9 kHz, Kx/Ky ≥ 8, Area ≤ 2.1e-7 m2. Eight designs were selected by CBR for a MOGA synthesis process for this given scenario. Because all eight CBR retrieved designs had similar linkage properties, we will highlight the best design here which was a resonator with an enclosed frame mass and crab-leg suspensions (two beams with a local 90 degree angle). For the best design shown in Fig. 7a, the mass and comb drives remained fixed while the crab-leg suspensions (which have the largest impact on the performance objectives) were allowed to change in width, length, and global orientation, but the crab-leg suspensions retained their local 90 degree angle. As one can see in Fig. 7, the initial design in Fig. 7a generated an optimal design (Fig. 7b) which had the leg suspensions rotated outside of the frame mass. One would assume that if the objective is to minimize area, the suspensions would remain inside the mass, similar to the initial design in Fig. 7a. However, because frequency and stiffness were also part of the optimization problem, MOGA determined that a design with the suspensions outside of the mass could produce a better resonant frequency and stiffness ratio. The resonator design in Fig. 7b may have not been considered by a human MEMS designer, but due to the linkage knowledge CBR gave MOGA, the design is a good candidate for further analysis and fabrication. Fig. 8 shows another example of tight linkage in our MOGA process. The design requirements for this scenario are the following: f0 = 8.3kHz, Kx/Ky ≥ 29, Area ≤
MOGA Design Representation Gene Type
MEMS Component
15 2 5 1
Hollow Ring Mass Serpentine Suspension Comb Drive Anchor
(a) Initial Design
(b) y-symmetric design
Fig. 8. Resonant frequency = 6969.3 Hz for initial MOGA design (a); Resonant frequency = 8299.9 Hz for a pareto optimal y-symmetric design generated by MOGA (b)
472
C.L. Cobb et al.
3.7e-7 m2. In this particular case there was only one design selected by CBR which consisted of a hollow squared shaped mass with four serpentine springs. Again, the mass and comb drives remained fixed while the serpentine suspension blocks were free to mutate in length, width, number of loops, and their global angle orientation. This scenario also generated designs that had a similar appearance to spiders and insects (aside from the inherent manhattan geometry in the building blocks). Minimizing area is our main design objective for all of the designs in this experiment. The y-symmetry (symmetry around the vertical axis) constraint cases had the smallest design area average (2.608E-7 m2) with a standard deviation of 7.031e-8 m2.
4 Biomimetics: Role of Symmetry and Resonance Applying manhattan geometries (90º angles) and symmetry constraints greatly reduces the search space and allows MOGA to optimize its search over a more manageable size. If convergence can be achieved, however, fewer constraints are preferred in an optimization problem as it broadens the search space to a wider selection of solutions. When MOGA runs unconstrained or with only symmetry constraints, the results produce designs that greatly differ from those designed by humans. Upon observation, these designs have an uncanny appearance to spiders, insects, and other organisms observed in nature. This prompted us to examine the biological analogies that exist between our EC generated resonators and biological organisms to help us understand which symmetry and geometric constraints might be an evolutionary advantage of natural life forms that use vibration or natural frequencies to survive. 4.1 Symmetry and Geometric Constraints Symmetry is evident throughout the natural world − a butterfly’s wings, a spider’s web, and even physicists observe symmetry in distant galaxies. Symmetry has been used to try to understand the physical world since ancient times [27]. In the animal kingdom, bilateral symmetry is found in more complex species, where different parts of the animal’s body perform different functions. Radial symmetry can be found in simpler life forms, such as starfish, where the entire body performs most of the life functions. Symmetry has typically been a sign of quality in nature, and symmetry perception has been demonstrated in humans, animals, and insects. Many studies have concluded that humans and other species find symmetrical patterns more favorable than asymmetrical ones. It has been suggested that preferences for symmetry adapted for reasons related to mate choice. For several species, females prefer a mate that has more symmetrical characteristics [28]; experiments performed with insects and birds found that females prefer to mate with males who have the most symmetrical ornaments [29]. Enquist and Arak [30] suggest that the preference for symmetry has evolved from the need to recognize objects no matter what their position or orientation may be. This preference for symmetry is prominent in the MEMS world where many designers highly favor symmetrical layouts and manhattan style geometry. In previous work, some of our nontraditional asymmetric MEMS designs were fabricated and
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
473
characterized to help improve EC algorithms, and it was shown that the fabricated design behaved within reasonable agreement to simulation results [4]. Of the forms of symmetry in the animal world, bilateral symmetry is much more common than full symmetry. Even with bilateral symmetry, organisms often reflect a behavioral asymmetry with internal organs or a tendency for right or left-handedness as noted by Babcock [31]. Asymmetry is less prevalent in the natural world but can be observed in a select few organisms such as sponges (poriferans). In biology studies by Moller et al. [32], they found that growth rate and fluctuating asymmetry are negatively correlated, meaning asymmetric animals grow less rapidly than symmetric ones. Although organisms may exhibit bilateral and radial symmetry, most organisms have some type of observable asymmetry. In the MEMS world, designers are tasked with developing physical forms that satisfy multiple functional requirements. It is tempting to think that simple designs with 90 degree angles are better than designs with irregular or nontraditional layouts. This can be the case in macroscale designs where non-perpendicular and parallel designs can be time-consuming and expensive from a manufacturing point of view. But in MEMS fabrication, lithography processes enable a designer to create almost any geometrical layout and all are equally easy to fabricate, with the only obstacle being the resolution capabilities of the lithography process, impacting the minimum size of features that can be fabricated. Kamalian et al. [3] previously noted that optimal MEMS designs with multiple competing objectives need not have full symmetry or manhattan angles, but may benefit from symmetry about one axis – bilateral symmetry. Similar to our EC generated MEMS resonators, spiders have a large central mass and a similar number of legs on either side of their body. Spiders have evolved to have some degree of bilateral symmetry around the longitudinal axis, but none around the horizontal axis, similar to our y-symmetric resonator designs shown in Fig. 9. All species of spiders have a broad range of leg shapes, but none of them have manhattan geometries and most exhibit symmetry about only one axis.
no symmetry
y-axis symmetry
x-y axis symmetry
90º angles & x-y axis symmetry
Increasing symmetry and angle constraints
Fig. 9. Examples of MEMS resonator designs with increasing constraints
4.2 Purpose of Resonance and Vibration We can further examine the spider as a biological analog to a resonator in its ability to detect prey by resonating with their vibrations. Vibration cues have been used by
474
C.L. Cobb et al.
insects and spiders to locate and kill their prey. Without the use of vibration recognition, it may be difficult for insects to find their prey, because dense vegetation may limit their visual abilities. Vibration signals are also important, because many of the insect’s or spider’s prey produce vibrations through movement or feeding, which enables them to be located more easily [33]. Bola spiders catch their prey by mimicry, emitting the pheromones of the prey species. The wing-beat vibrations of the moths that fall victim to the bola spiders stimulate the spider to make a bolas in which to capture the moth [34]. Generally all web-spinning spiders detect and find prey in their webs through the vibrations generated by their prey. This is especially important because most species of web spiders do not have a strong sense of smell or good vision. Peters (1931) found that the spiders did not respond to a dead fly placed gently in its web. If, however, the fly arrived in the web with a jerk or if, once in the web, it was stimulated in some way, the spider responded [35]. There is a good deal of evidence that spiders discriminate between different types of signals. There have been several studies that demonstrated how spiders move towards vibrations of various frequencies, similar to the way MEMS resonators and bandpass filters attempt to hone in on certain frequencies for communication purposes. Resonators, which are basic building blocks of MEMS filters, are designed to reject certain frequencies from a wide range of signals and only allow a particular frequency band to pass through. An important aspect of resonance in MEMS and nature is movement. Blickhan and Full [36] conducted a study of multi-legged locomotion in animals as diverse as cockroaches and kangaroos in order to develop a model of “legged terrestrial locomotion.” They found that the dynamics of movement depend on the number of legs one has and the gait or movement pattern. Four- and six-legged creatures had greater whole body stiffness than two-legged creatures. The greater whole body stiffness in the four- and six-legged creatures resulted in higher natural frequencies, just as a higher overall stiffness results in a higher natural frequency in MEMS designs. Spiders generally have eight legs while insects have six legs. In MEMS, we mostly observe resonators with four main legs for stability. There are resonators with only two legs, but these tend to be slightly unstable with a tendency towards out of plane movement. In spiders, eight legs can enable them to move faster and give them the ability to travel in different directions easily. Some insects with six legs have a tendency to move forward more and not backwards and sideways as quickly as spiders. In our MEMS resonator design, we only want to move in one direction based on the comb drive actuation, hence four legs provides more balance and stability than two legs. Additional legs are not needed because in these MEMS resonator designs, motion in multiple directions is undesirable. However, if we look more broadly at other MEMS designs, such as micro-robots, more legs can be desirable to enable quick and easy movement in multiple directions. After 3.8 billion years of “research and development,” nature has discovered what works, what does not, and what is considered life sustaining, optimizing natural designs to meet the necessary functional needs. These “successful designs” are ever-changing to meet environmental requirements and are driven by an ultimate challenge: survival. Nature’s solutions are sometimes not perfect; however they are solutions that are as good as they need to be to serve their intended purpose.
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
475
5 MEMS Case Study: An Analysis of Symmetry Constraints and Impact on Resonance In the previous section, we looked at symmetry and resonance in nature. Since our synthesis system focuses on the structural design of MEMS, it is important to examine the different types of constraints we can embed in our MOGA linkage structure in order to produce the best performing MEMS designs. To better understand what role symmetry constraints we observe in nature have in our MOGA algorithm, an experiment with our resonator test case is performed to explore which combinations of symmetry and geometric constraints might produce the best performing microresonator designs. 5.1 Experiment Setup In this experiment we enforce four different sets of constraints on our micro-resonator test case. Each mico-resonator is constructed of a 2μm thick layer of polysilicon material. The comb drives and center mass for the micro-resonator design are fixed while the springs are free to mutate, subjected to the following symmetry and angle constraints: • C1: No symmetry or geometric constraints • C2: Symmetry is enforced along the y-axis of the design (analogous to bilateral symmetry observed in organisms) • C3: Symmetry is enforced about the x- and y-axis of the design • C4: Symmetry is enforced about the x- and y-axis of the design and the suspensions (also known as ‘legs’) are restricted to 90º angles (analogous to how human designers traditionally create MEMS) We place emphasis on symmetry constraints as these are most common types of structural constraints observed in nature. C4 includes a manhattan angle constraint and represents the typical constraints a human MEMS designer will impose upon the design of a resonant structure. Our goal is to better understand under what conditions symmetry that is found to be optimal in nature is also optimal in our MEMS resonator Table 1. Polyline spring design parameters used for the MEMS resonator case study (*100μm only used for the 10kHz test case)
Mutation constraints for Polyline Spring Max. number of beams Min. number of beams Max. beam length Min. beam length Max. beam width Min. beam width
Parameter Value 7 1 100μm/300μm* 10μm 10μm 2μm
476
C.L. Cobb et al.
example using the MOGA algorithm. The resonator legs begin symmetrically from a center mass but are allowed to evolve with any number of joints in the legs (see Table 1 for leg design parameters). C1 has no symmetry constraints while C2 has the minimal bilateral constraints. C3 and C4 both have full symmetry, with C4 having the additional constraint of manhattan geometry. We wish to explore how these cases of minimal constraints (C1 and C2) compare to those with maximal constraints (C3 and C4). The feasible design range for our initial resonator design is a resonant frequency (f0) between 5-15 kHz and a stiffness ratio (Kx/Ky) between 1-10. The main design objective is the minimization of device area while achieving the required stiffness ratio and keeping the resonant frequency deviation to less than 5%. To explore the range of possible designs, four sets of design requirements for the micro-resonator were randomly generated, using the aforementioned bounds, and then used in a MOGA synthesis run (see Table 2). For comparison purposes, these results are included with the a previous design requirement test case used by Kamalian [3] and Zhang [5] where the resonant frequency target was 10 kHz and the stiffness in the x-direction only had to be greater than the stiffness in the y-direction. Table 2. Randomly Generated Design Requirements
Name DR1 DR2 DR3 DR4 DR5
Target Frequency(f0) 10.0 kHz 14.3 kHz 9.5 kHz 7.0 kHz 13.5 kHz
Stiffness Ratio (Kx/Ky) > 1 (x-axis stiffness greater than y-axis stiffness) 3 5 8 8
For each set of design requirements, we ran the MOGA process five times for each constraint case with a population of 400 designs for 50 generations. In our MOGA process, the inverse of the pareto rank is used as the fitness value of the design. Only the designs in the final pareto-optimal set which meet all of the initial design requirements are used in the analysis. The designs that are in the overall pareto set, have a frequency deviation within 5% of the target frequency and satisfy the stiffness ratio requirement are tallied after each MOGA synthesis process. 5.2 Analysis of Results Table 3 shows the best designs in terms of best minimum and average area, as well as best minimum and average frequency error in the pareto sets for each of the design requirements (DR1-DR5). Note that C1 (no symmetry) appears to be favorable for achieving the best minimum area in the pareto set, whereas C4, the highly constrained full symmetry case with manhattan geometry, is favored for minimizing the average area across the entire pareto set of designs. In contrast, when considering frequency, the best minimum and average error results occur with the least constrained constraints cases, C1 and C2. To see if any of these competing trends are statistically significant we apply a Wilcoxon rank sum test to the data.
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
477
The Wilcoxon rank sum test [37], a non-parametric statistical test, is used to determine whether or not the constraint cases produce similar performing designs with respect to design area and resonant frequency deviation. To begin the rank test, we form the appropriate null hypothesis (H0) and alternate hypothesis (Ha) using a significance level of 5% (or α = 0.05): • H0: The distributions of the two compared constraint cases are identical • Ha: The distributions of the two compared constraint cases are not identical and one distribution is shifted to the right or left of the other (implying one set of constraints generates better performing designs) Table 3. Comparison of design area and frequency deviation
Design Requirements
DR1
DR2
DR3
DR4
DR5
Constraint Case
# of Pareto Optimal Solutions
Minimum Area [m2]
Average Area [m2]
C1
20
1.63E-07
1.92E-07
Minimum Average Frequency Frequency Deviation Deviation [Hz] [Hz] 0.821
110.890
C2
18
1.63E-07
1.79E-07
0.147
96.972
C3
16
1.46E-07
1.96E-07
12.404
181.261
C4
13
1.60E-07
1.75E-07
1.582
150.116
C1
23
1.14E-07
2.36E-07
0.143
80.415
C2
31
1.36E-07
2.43E-07
0.133
119.509
C3
17
1.30E-07
1.72E-07
1.604
121.910
C4
14
1.28E-07
1.53E-07
1.680
156.698
C1
32
1.29E-07
4.31E-07
0.041
45.237
C2
24
1.57E-07
2.12E-07
0.244
81.716
C3
22
1.73E-07
2.79E-07
0.296
142.063
C4
16
1.58E-07
1.82E-07
3.991
99.145
C1
27
1.72E-07
3.02E-07
0.004
45.618
C2
21
2.05E-07
2.93E-07
0.165
23.025
C3
15
2.05E-07
2.39E-07
2.869
69.650
C4
23
1.90E-07
2.22E-07
0.812
116.348
C1
51
1.20E-07
3.50E-07
0.003
27.185
C2
17
1.43E-07
1.95E-07
0.487
130.417
C3
33
1.51E-07
1.86E-07
2.092
193.281
C4
17
1.52E-07
1.63E-07
3.850
177.751
The p-values generated by the rank test for the micro-resonator case study are shown in Tables 4 and 5. In the instances where the p-value is less than the significance level, α = 0.05, we can reject the null hypothesis (H0) and accept the alternate hypothesis (Ha) indicating that the populations are significantly different. Conversely, when the p-value is greater than the significance level, we cannot reject the null hypothesis (H0) within the context of this experiment.
478
C.L. Cobb et al.
Focusing on the frequency deviations (last column in Table 3), an analysis of the constraint cases demonstrates that MOGA produces statistically significant different pareto sets of designs between constraint cases C1&C3 and C1&C4 in four out of five instances, and C2&C3 in three out of five instances if we look at the entire pareto set. This trend implies that asymmetry and bilateral symmetry are preferred to full symmetry. The p-values for this scenario ranged from 1.88E-9≤p≤0.03355. If we focus on frequency deviation for the best designs from each MOGA synthesis run (see Table 5), constraint cases C1&C4 have statistically different distributions in all instances while C1&C3 have statistically different distributions in four out of five instances (0.0079≤ p≤0.0317). Table 4. P-values for frequency deviation across the entire pareto set of designs
Design Requirements DR1 DR2 DR3 DR4 DR5
C1&C2 0.8493 0.1515 0.3328 0.9172 0.0506
C1&C3 0.0772 0.0128 0.0008 0.0335 1.88E-09
C1&C4 0.4071 0.0107 0.0295 0.0042 0.0001
C2&C3 0.0942 0.2532 0.0192 0.0135 0.0132
C2&C4 0.5349 0.1138 0.2755 0.0009 0.1296
C3&C4 0.2635 0.3934 0.1433 0.1888 0.2777
Table 5. P-values for best minimum frequency deviations across each constraint case
Design Requirements DR1 DR2 DR3 DR4 DR5
C1&C2 0.5476 0.8413 0.2222 0.0952 0.0317
C1&C3 0.0079 0.0556 0.0317 0.0079 0.0079
C1&C4 0.0317 0.0317 0.0079 0.0159 0.0079
C2&C3 0.0556 0.2222 0.0952 0.0556 0.2222
C2&C4 C3&C4 0.4206 0.0952 0.1508 0.6905 0.0317 1.0000 0.2222 0.0556 0.1508 0.5476
In addition to the frequency analysis, we also performed an analysis on design area. An analysis of the best performing designs from each synthesis run based on minimum area did not show a strong statistical difference. But, an analysis of area across the entire pareto set showed that C4 (full symmetry and manhattan angles) generates different pareto-optimal sets of designs for three out of five sets of design requirements for each possible constraint case combination (C1&C4, C2&C4, C3&C4). This supports results previously demonstrated by Kamalian [3]. Looking at Table 3, one can see that C4 had the best average design area overall for all of the design requirements. When considering frequency deviation, it appears C1 is statistically better than C3 or C4 in almost all of the cases and C2 is significantly better than C3 and C4 in a majority of instances. But C1 is only statistically better than C2 for one instance – DR5 for the best minimum frequency deviation. The reverse is never the case – full symmetry with or without manhattan geometry shows no significant advantages for reducing frequency deviation. To highlight some of the design generated by our
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis Case 1: No symmetry
Case 2: y-axis symmetry
f0 = 7.0000 kHz Area = 3.1285e-007 m2 Case 3: xy-axis symmetry
f0 = 6.9998 kHz Area = 6.2119e-007 m2 Case 4: xy-axis symmetry and 90º angles
f0 = 6.9971 kHz Area = 3.0076e-007 m2
f0 = 6.9992 kHz Area = 2.9024e-007 m2
479
Fig. 10. Best designs based on resonant frequency for design requirement set DR4
MOGA constraint cases, Fig. 10 shows the best performing designs for the constraint cases for DR4 and Fig. 11 shows the best performing design based on frequency for the remaining design requirements (D1, D2, D3, and D5). It is interesting to note that, in our previous discussion on symmetry and resonance observed in nature, bilateral symmetry is the preferred evolutionary design for spiders and similar insects based on their frequency needs for mating and catching prey. If we examine our results more closely, we must note that most of our design requirements favor asymmetrical or bilateral symmetry if frequency is the major consideration and full symmetry if average area minimization over the pareto set is the priority. However the difference between asymmetry and bilateral symmetry is not statistically significant. We can hypothesize that bilateral symmetry provides the balance between the competing objectives, but further investigation is required in order to validate this. Note that one of our design requirements involves a stiffness ratio, and this is a measure of resonator movement in the x- and y-direction. For our particular microresonator design, it is highly desirable to have a high stiffness ratio (rigidity in the x-direction and compliance in the y-direction) for the purposes of device stability. Thus, as we increase the stiffness ratio from a low value, such as Kx/Ky = 1 (DR1), to
480
C.L. Cobb et al.
a high value such as Kx/Ky = 8 (DR4 and DR5), we are creating a bias against full symmetry in our optimization constraints. This bias in the stiffness ratio potentially forces the designs generated by MOGA to favor more asymmetrical layouts (C1 and C2) rather than fully symmetrical results (C3 and C4). This trend is shown in Table 4 where the bilateral symmetry C2 is statistically better than full symmetry C3 only for the higher stiffness cases DR3, DR4 and DR5.
DR1
DR2
f0 = 9.9999 kHz Area = 1.9474E-007 m2 DR3
f0 = 14.300 kHz Area = 2.7782E-007 m2
f0 = 9.5000 kHz Area = 4.6938E-007 m2
f0 = 13.500 kHz Area = 5.9004E-007 m2
DR5
Fig. 11. Best performing designs based on frequency deviation for D1, D2, D3, and D5
Fig. 11 illustrates the best performing micro-resonator designs based on frequency deviation. Most of these designs have a very small deviation from the frequency goal if we look at the results in Table 3. The designs which have the smallest frequency deviation typically have one of the largest design areas in the pareto set. This is due to the conflicting objectives in our multi-objective optimization problem. There are trade-offs between the frequency, area, and stiffness objectives, and at this point, the human designer can decide which design in the pareto set is best suited for their MEMS design application. In this section, we have presented an analysis of the role symmetry constraints play in out MOGA linkage structure. Increasing the level of symmetry constraints can further restrict the search space to a more manageable size
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
481
and enable our micro-resonator designs to achieve a smaller design area on average, but more asymmetrical designs are favored by MOGA for reducing frequency error and achieving the smallest design area. We hypothesize that the bilateral symmetry found in spiders and insects may be a compromise between frequency accuracy and compact size.
6 Summary and Conclusions Our MEMS synthesis architecture, with the integration of MOGA and CBR, deals with the concept of linkage by using a component-based genotype representation and an automated design knowledge-base. CBR provides MOGA with good linkage information through past design knowledge while MOGA inherits linkage information through our component-based genotype representation. A MEMS micro-resonator test case was presented to show how symmetry constraints observed in nature can be embedded into our MOGA linkage structure to produce new promising MEMS design solutions. We found that when minimizing frequency error, asymmetry and bilateral symmetry are favored while conversely, when minimizing device area, the maximum constraints of full symmetry and enforced 90º angles are favored. As part of our future research plan, we will examine how linkage learning can be integrated with MOGA when CBR may not be able to select a good initial seed design. Further exploring biomimetic algorithms and biomimetic ties to MEMS synthesis algorithms is another area we plan to pursue, investigating how increasing the number of leg components on a MEMS design can create optimal solutions in other design areas such as micro-robots. We want to also further explore the role symmetry and angle constraints have on these types of new MEMS designs. Lastly, we are moving towards creating a broader MEMS classification scheme and building up a case library of MEMS filter designs and their accompanying components to further expand the range of designs covered by our program.
Acknowledgements This work was supported in part by NSF grant CCR-DES/CC-0306557 and a Bell Labs Graduate Research Fellowship.
References 1. Zhou, N., Zhu, B., Agogino, A.M., Pister, K.S.J.: Evolutionary synthesis of MEMS (MicroElectronicMechanical Systems) design. In: Proc. of the Artificial Neural Networks in Engineering Conference, pp. 197–202 (2001) 2. Clark, J.V., Zhou, N., Bindel, D., Schenato, L., Wu, W., Demmel, J., Pister, K.S.J.: 3D MEMS simulation modeling using modified nodal analysis. In: Proceedings of the Microscale Systems: Mechanics and Measurements Symposium, pp. 68–75 (2000) 3. Kamalian, R.H., Agogino, A.M., Takagi, H.: The role of constraints and human interaction in evolving MEMS designs: microresonator case study. In: Proc. of 2004 ASME Design Engineering Technical Conferences and Design Automation Conference #DETC200457462 (2004)
482
C.L. Cobb et al.
4. Kamalian, R., Zhang, Y., Agogino, A.M.: Microfabrication and characterization of evolutionary MEMS resonators. In: Proc. of the IEEE Robotics & Automation Society Symposium of Micro- and Nano-Mechatronics for Information-based Society, pp. 109–114 (2005) 5. Zhang, Y., Kamalian, R., Agogino, A.M., Séquin, C.H.: Hierarchical MEMS synthesis and optimization. In: Proc. SPIE Smart Structures and Materials 2005, Smart Electronics, MEMS, BioMEMS, and Nanotechnology (# 5763-12), vol. 5763, pp. 96–106 (2005) 6. Zhang, Y., Kamalian, R., Agogino, A.M., Séquin, C.H.: Design synthesis of Microelectromechanical Systems using genetic algorithms with component-based genotype representation. In: Proc. of 2006 Genetic and Evolutionary Computation Conference, vol. 1, pp. 731–738 (2006) 7. Cobb, C.L., Agogino, A.M.: Case-based reasoning for the design of micro-electromechanical systems. In: Proc. of 2006 ASME International Design Engineering Technical Conferences & the Computers and Information in Engineering Conference, #DETC200699120 (2006) 8. Cobb, C.L., Zhang, Y., Agogino, A.M.: MEMS design synthesis: integrating case-based reasoning and multi-objective genetic algorithms. In: Proc. SPIE Int. Soc. Opt. Eng., Smart Structures, Devices, and Systems III. 6414: #641419 (invited paper, 2006) 9. Mukherjee, T., Zhou, Y., Fedder, G.: Automated optimal synthesis of microaccelerometers. In: Technical Digest of 12th IEEE International Conference on Micro Electro Mechanical Systems (MEMS 1999), pp. 326–331 (1999) 10. Wang, J., Fan, Z., Terpenny, J.P., Goodman, E.D.: Knowledge interaction with genetic programming in mechatronics systems design using bond graphs. IEEE Transactions on System, Man, and Cybernetics−Part C: Applications and Reviews, vol. 35(2) (2005) 11. Li, J., Gao, S., Liu, Y.: Solid-based CAPP for surface micromachined MEMS devices. Computer-Aided Design 39, 190–201 (2007) 12. Holland, J.H.: Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor (1975) 13. Goldberg, D.E.: Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Longman, Boston (1989) 14. Chen, Y.-P., Goldberg, D.E.: Convergence time for the linkage learning genetic algorithm. Evolutionary Computation 13, 279–302 (2005) 15. Chen, Y.-P., Goldberg, D.E.: Introducing subchromosome representations to the linkage learning genetic algorithm. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3102, pp. 971–982. Springer, Heidelberg (2004) 16. Harik, G.R.: Learning gene linkage to efficiently solve problems of bounded difficulty using genetic algorithms. Ph.D. Thesis. University of Michigan, Ann Arbor (1997) 17. Shank, R.: Dynamic memory: a theory of reminding and learning in computers and people. Cambridge (1982) 18. Kolodner, J.: Case-based reasoning. Morgan Kaufmann Publishers, Inc., San Mateo (1993) 19. Goel, A.K., Bhatta, S., Stroulia, E.: Kritik: An Early Case-Based Design System. In: Maher, M., Pu, P. (eds.) Issues and Applications of Case-Based Reasoning in Design, pp. 87–132. Lawrence Erlbaum Associates, Inc, Hillsdale (1997) 20. Hennessy, D., Hinkle, D.: Applying case-based reasoning to autoclave loading. IEEE Expert 7, 21–26 (1992) 21. Boyle, I.M., Rong, K., Brown, D.C.: CAFixD: A Case-Based Reasoning Fixture Design Method – Framework and Indexing Mechanisms. In: Proc. of the 2004 ASME International Design Engineering Technical Conferences & the Computers and Information in Engineering Conference, #DETC2004-57689 (2004)
Knowledge-Based Evolutionary Linkage in MEMS Design Synthesis
483
22. Varadan, V.K.: RF MEMS and their applications. John Wiley and Sons, Chichester (2002) 23. Walraven, J.A.: Introduction to applications and industries for microelectromechanical systems (MEMS). In: Proc. of 2003 International Test Conference (ITC), pp. 674–680 (2003) 24. Bell, D.J., Lu, T.J., Fleck, N.A.F., Spearing, S.M.: MEMS actuators and sensors: observations on their performance and selection for purpose. J. Micromech Microeng 15, 153–164 (2005) 25. Nguyen, C.T.-C.: RF MEMS in wireless architecture. In: Proc. of the 42nd Design Automation Conference, pp. 416–420 (2005) 26. Noy, N.F., McGuinness, D.L.: Ontology development 101: a guide to creating your first ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 (2001) 27. Stewart, I.: Why beauty is truth: A History of Symmetry. Joat Enterprises, Basic Books (2007) 28. Johnstone, R.A.: Female preferences for symmetrical males as a by product of selection for mate recognition. Nature 372, 172–175 (1994) 29. Swaddle, J.P., Cuthill, I.C.: Preference for symmetric males by female zebra finches. Nature 367, 165–166 (1994) 30. Enquist, M., Arak, A.: Symmetry, beauty, and evolution. Nature 372, 169–172 (1994) 31. Babcock, L.E.: Asymmetry in the fossil record. European Review 13, 135–143 (2005) 32. Moller, A.P., Manning, J.: Growth and developmental instability. The Veterinary Journal 166, 19–27 (2003) 33. Pfannenstiel, R.S., Hunt, R.E., Yeargan, K.V.: Orientation of a hemipteran predator vibrations produced by feeding caterpillars. J. of Insect Behavior 8, 1–9 (1995) 34. Haynes, K.F., Yeargan, K.V., Gemenol, C.: Detection of prey by a spider that aggressively mimics pheromone blends. J. of Insect Behavior 14, 535–544 (2001) 35. Parry, D.A.: The signal generated by an insect in a spider’s web. J. Exp. Biol. 43, 185–192 (1965) 36. Blickhan, R., Full, R.J.: Similarity in multilegged locomotion: Bouncing like a monopode. J. Comp. Physiol A 173, 509–517 (1993) 37. Mendenhall, W., Sincich, T.: Statistics for engineering and the sciences, 4th edn. PrenticeHall, Inc., Upper Saddle River (1995)
Index
3-Deceptive 15 adaptive learning 3 adaptive score metric 103 additively decomposable function agents 240 aggregation algorithm 67 allele survival 42 alternative splicing 189 Amdahl’s Law 166
442
Bayesian information criterion 111 Bayesian network learning 88 Bayesian network 4 Bayesian optimization algorithm 87 Bayesian structure 110 bivariate marginal distribution algorithm (BMDA) 3 building blocks 442 case-based reasoning 465 cellular network 389 Chi-square 3 clustering 225 coding problem 249 combinatorial optimisation 214 complex System 315 compositional evolution 53 concatenated parity function 141 concatenated parity/trap funcion 141 context dependent crossover 448 contingency tables 6 covariance matrix 35 CP/TF 148 CPF 141 crossover 448
D5 442 design synthesis 461 Distributed computing timeline Distributed Computing 161 dual coding 249 dynamic optimisation 189 dynamic representation 268
166
entropy 443 epistasis 315 estimation of Bayesian network algirthm 109 estimation of distribution algorithm (EDA) 442 evolution dynamics of rECGA 84 evolutionary algorithm 141 evolutionary computation 61 exact Bayesian network learning 110 fitness difference 442 frequency assignment 389 GEAs over Grid 165, 167 gene fixation 41 genetic algorithm 225 genetic operators 141 graph clustering 395 graph partitioning 396 graph theory 89 graph 448 Grid based Algorithms 165 Grid Computing 162 Grid middleware 164 Grid vs. Distributed computing
165
486
Index
H-IFF function 41 heirarchical Bayesian optimisation algorithm (hBOA) 141 histogram-based estimation of distribution algorithm 25 hitch-hiking 41 HP protein model 117 incremental learning 62 information gain 229 inter-island evolution 55 interactions 446 intra-island evolution 43 island model 23 knowledge-based linkage
461
linkage learning accuracy linkage learning 141 linkage 442 local optima 189
87
memory scheme 206 meta-heuristics 391 MetaHeuristics Grid (MHGrid) 170 MetaHeuristics Markup Language (MHML) 178 microelectromechanical systems (MEMS) 461 migration of individuals 7 minimum interference 389 mixture Gaussian probability density function 61 model migration 3 model overfitting 101 model structural accuracy 88 modularity 209 molecular genetics 194 multi-objective optimization 480 multimodal problems 228 multivariate real-valued deceptive function 70 mutual information 61 nature-inspired computation 194 network design 451 No Free Lunch Theorem 171
OneMax 12 Open Grid Services Architecture
164
panmictic 42 Parallel Computing 161 parallel genetic algorithm 3 parallelization 159 parameters estimation 341 partition variables 338 performance analyses of factorization 61 perturbation 442 physical linkage 41 probabalistic model building genetic algorithm (PMBGA) 61 probabilistic graphical model 25 probabilistic model building 78 probabilistic model 7 problem-knowledge 122 problem decomposation 71, 87, 391 problem structure 122 quadratic
13
rECGA 84 ring topology
5
sampling probabilistic models 335 scalability of algorithm 345 scalability 144 schemata 290 score + search 111 sequential genetic algorithm 252 Service Oriented Architecture 163 shifting balance theory 43 simulated annealing 390 steady state 268 stochatic optimisation 189 tournament selection 87 truncation selection 87 two-level evolution 42 TwoMax 12 Types of Grids 163 variables clustering 61 Virtual Organizations 163 wireless network
389
Author Index
Agogino, Alice M. 461 Akama, Kiyoshi 159, 441 Allen, Stuart M. 389
Lobo, Fernando G. 87 Lozano, Jose A. 109 Mangold, Jennifer 461 Munawar, Asim 159 Munetomo, Masaharu 159, 441
Bercachi, Maroun 249 Bullinaria, John A. 189 Chiang, Fu-Tien 315 Clergue, Manuel 249 Cobb, Corie L. 461 Coffin, David 141 Collard, Philippe 249 Colombo, Gualtiero 389 Congdon, Clare Bates 315
Nichetti, Luigi
Di Paolo, Ezequiel Ding, Nan 25
Santana, Roberto 109 Sastry, Kumara 61, 87, 335 Schwarz, Josef 3 Shouraki, Saeed Bagheri 285 Skolicki, Zbigniew 41 Smith, Robert E. 141 Smith, Warren 419
361
Echegoyen, Carlos 109 Emmendorfer, Leonardo
225
Goldberg, David E. 61, 87, 335 Goth, Thomas 315 Halavati, Ramin 285 Hauschild, Mark 87 Hu, Xiao-Bing 361 Isaacs, Amitay Jaros, Jiri
419
3
Lanzi, Pier Luca 335 Larra˜ naga, Pedro 109 Li, Minqiang 61 Lima, Claudio F. 87
335
Pelikan, Martin 87 Pozo, Aurora 225 Ray, Tapabrata 419 Rohlfshagen, Philipp 189
Tsai, Chia-Ti Tsuji, Miwako
315 441
Verel, Sebastien Voltini, Davide
249 335
Wahib, Mohamed Yu, Tian-Li Zhang, Ying Zhou, Shude
61 461 25
159