Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
6471
Jaume Bacardit Will Browne Jan Drugowitsch Ester Bernadó-Mansilla Martin V. Butz (Eds.)
Learning Classifier Systems 11th International Workshop, IWLCS 2008 Atlanta, GA, USA, July 13, 2008 and 12th International Workshop, IWLCS 2009 Montreal, QC, Canada, July 9, 2009 Revised Selected Papers
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Jaume Bacardit University of Nottingham, Nottingham, NG8 1BB, UK E-mail:
[email protected] Will Browne Victoria University of Wellington, Wellington 6140, New Zealand E-mail:
[email protected] Jan Drugowitsch University of Rochester, Rochester, NY 14627, USA E-mail:
[email protected] Ester Bernadó-Mansilla Universitat Ramon Llull, 08022 Barcelona, Spain E-mail:
[email protected] Martin V. Butz University of Würzburg, 97070 Würzburg, Germany E-mail:
[email protected] Library of Congress Control Number: 2010940267
CR Subject Classification (1998): I.2.6, I.2, H.3, D.2.4, D.2.8, F.1, H.4, H.2.8 LNCS Sublibrary: SL 7 – Artificial Intelligence ISSN ISBN-10 ISBN-13
0302-9743 3-642-17507-4 Springer Berlin Heidelberg New York 978-3-642-17507-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
Learning Classifier Systems (LCS) constitute a fascinating concept at the intersection of machine learning and evolutionary computation. LCS’s genetic search, generally in combination with reinforcement learning techniques, can be applied to both temporal and spatial problem-solving and promotes powerful search in a wide variety of domains. The LCS concept allows many representations of the learned knowledge from simple production rules to artificial neural networks to linear approximations often in a human readable form. The concepts underlying LCS have been developed for over 30 years, with the annual International Workshop on Learning Classifier Systems supporting the field since 1992. From 1999 onwards the workshop has been held yearly, in conjunction with PPSN in 2000 and 2002 and with GECCO in 1999, 2001, and from 2003 onwards. This book is the continuation of the six volumes containing selected and revised papers from the previous workshops, published by Springer as LNAI 1813, LNAI 1996, LNAI 2321, LNAI 2661, LNCS 4399, and LNAI 4998. The articles in this book have been loosely organized into four overlapping themes. Firstly, the breadth of research into LCS and related areas is demonstrated. Then the ability to approximate complex multidimensional function surfaces is shown by the latest research on computed predictions and piecewise approximations. This work leads on to LCS for complex domains, such as temporal decision-making and continuous domains, whereas traditional learning approaches often require problem-dependent manual tuning of the algorithms and discretization of problem spaces, resulting in a loss of information. Finally, diverse application examples are presented to demonstrate the versatility and broad applicability of the LCS approach. Pier Luca Lanzi and Daniele Loiacono investigate the use of general-purpose Graphical Processing Units (GPUs), which are becoming increasingly common in evolutionary computation, for speeding up matching of environmental states to rules in LCS. Depending on the problem investigated and representation scheme used, they find that the use of GPUs improves the matching speed by 3 to 50 times when compared with matching with standard CPUs. Association rule mining, where interesting associations in the occurrence of items in streams of unlabelled examples are to be extracted, is addressed by Albert Orriols-Puig and Jorge Casillas. Their novel CSar Michigan-style learning classifier system shows promising results when compared with the benchmark approach to this problem. Stewart Wilson shows that there is still much scope in generating novel approaches with the LCS concept. He proposes an automatic system for creating pattern generators and recognizers based on a three-cornered competitive co-evolutionary algorithm approach.
VI
Preface
Patrick O. Stalph and Martin V. Butz investigate current capabilities and challenges facing XCSF, an LCS in which each rule builds a locally linear approximation to the payoff surface within its matching region. It is noted that the XCSF approach was the most popular branch of LCS research within the latest editions of this workshop. In a second paper the same authors investigate the impact of variable set sizes, which show promise beyond the standard two offspring used in many genetics-based machine learning techniques. The model used in XCSF by Gerard David Howard, Larry Bull, and Pier Luca Lanzi uses an artificial neural network, instead of standard rules, for matching and action selection, thus illustrating the flexible nature of LCS techniques. Their method is compared with principles from the NEAT (Neuro Evolution of Augmenting Topologies) approach and augmented with previous LCS neural constructivism work to improve their performance in continuous environments. ´ ee and Mathias P´eroumalna¨ık also examine how LCS copes with Gilles En´ complex environments by introducing the Adapted Pittsburgh Classifier System and applying it to maze type environments containing aliasing squares. This work shows that the LCS is capable of building accurate strategies in non-Markovian environments without the use of rules with memory. Ajay Kumar Tanwani and Muddassar Farooq compare three LCS-based data mining techniques to three benchmark algorithms for biomedical data sets, showing that, although not completely dominant, the GAssist LCS approach in general is able to provide the best classification results on the majority of datasets tested. Illustrating the diversity of application domains for LCS, supply chain management sales is investigated by Mar´ıa Franco, Ivette Mart´ınez, and Celso Gorrin, showing that the set of generated rules solves the sales problem in a satisfactory manner. Richard Preen uses the well established XCS LCS to identify trade entry and exit timings for financial timeseries forecasting. These results show the promise of LCS in this difficult domain due to its noisy, dynamic, and temporal nature. In the final application paper, Jos´e G. Moreno-Torres, Xavier Llor` a, David E. Goldberg, and Rohit Bhargava provide an approach to the homogenization of laboratory data through the use of a genetic programming based algorithm. As in the previous volumes, we hope that this book will be a useful support for researchers interested in learning classifier systems and will provide insights into the most relevant topics. Finally we hope it will encourage new researchers, business, and industry to investigate the LCS concept as a method to discover solutions to their varied problems. September 2010
Will Browne Jaume Bacardit Jan Drugowitsch
Organization
The postproceedings of the International Workshops on Learning Classifier Systems 2008 and 2009 were assembled by the organizing committee of IWLCS 2009.
IWLCS 2008 Organizing Committee Jaume Bacardit (University of Nottingham, UK) Ester Bernad´ o-Mansilla (Universitat Ramon Llull, Spain) Martin V. Butz (Universit¨at W¨ urzburg, Germany) Advisory Committee
Tim Kovacs (University of Bristol, UK) Xavier Llor`a (Univ. of Illinois at Urbana-Champaign, USA) Pier Luca Lanzi (Politecnico de Milano, Italy) Wolfgang Stolzmann (Daimler Chrysler AG, Germany) Keiki Takadama (Tokyo Institute of Technology, Japan) Stewart Wilson (Prediction Dynamics, USA)
IWLCS 2009 Organizing Committee Jaume Bacardit (University of Nottingham, UK) Will Browne (Victoria University of Wellington, New Zealand) Jan Drugowitsch (University of Rochester, USA) Advisory Committee
Ester Bernad´ o-Mansilla (Universitat Ramon Llull, Spain) Martin V. Butz (Universit¨at W¨ urzburg, Germany) Tim Kovacs (University of Bristol, UK) Xavier Llor`a (Univ. of Illinois at Urbana-Champaign, USA) Pier Luca Lanzi (Politecnico de Milano, Italy) Wolfgang Stolzmann (Daimler Chrysler AG, Germany) Keiki Takadama (Tokyo Institute of Technology, Japan) Stewart Wilson (Prediction Dynamics, USA)
VIII
Organization
Referees Ester Bernad´ o-Mansilla Lashon Booker Will Browne Larry Bull Martin V. Butz Jan Drugowitsch Ali Hamzeh
Francisco Herrera John Holmes Tim Kovacs Pier Luca Lanzi Xavier Llor`a Daniele Loiacono Drew Mellor
Luis Miramontes Hercog Albert Orriols-Puig Wolfgang Stolzmann Keiki Takadama Stewart W. Wilson
Past Workshops 1st IWLCS
October 1992
NASA Johnson Space Center, Houston, TX, USA 2nd IWLCS July 1999 GECCO 1999, Orlando, FL, USA 3rd IWLCS September 2000 PPSN 2000, Paris, France 4th IWLCS July 2001 GECCO 2001, San Francisco, CA, USA 5th IWLCS September 2002 PPSN 2002, Granada, Spain 6th IWLCS July 2003 GECCO 2003, Chicago, IL, USA 7th IWLCS June 2004 GECCO 2004, Seattle, WA, USA 8th IWLCS June 2005 GECCO 2005, Washington, DC, USA 9th IWLCS July 2006 GECCO 2006, Seattle, WA, USA 10th IWLCS July 2007 GECCO 2007, London, UK 11th IWLCS July 2008 GECCO 2008, Atlanta, GA, USA 12th IWLCS July 2009 GECCO 2009, Montreal, Canada 13th IWLCS July 2010 GECCO 2010, Portland, OR, USA
Table of Contents
LCS and Related Methods Speeding Up Matching in Learning Classifier Systems Using CUDA . . . . Pier-Luca Lanzi and Daniele Loiacono Evolution of Interesting Association Rules Online with Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Albert Orriols-Puig and Jorge Casillas Coevolution of Pattern Generators and Recognizers . . . . . . . . . . . . . . . . . . Stewart W. Wilson
1
21 38
Function Approximation How Fitness Estimates Interact with Reproduction Rates: Towards Variable Offspring Set Sizes in XCSF . . . . . . . . . . . . . . . . . . . . . . . Patrick O. Stalph and Martin V. Butz Current XCSF Capabilities and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . Patrick O. Stalph and Martin V. Butz
47 57
LCS in Complex Domains Recursive Least Squares and Quadratic Prediction in Continuous Multistep Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniele Loiacono and Pier-Luca Lanzi Use of a Connection-Selection Scheme in Neural XCSF . . . . . . . . . . . . . . . Gerard David Howard, Larry Bull, and Pier-Luca Lanzi
70 87
Building Accurate Strategies in Non Markovian Environments without Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ ee and Mathias P´eroumalna¨ık Gilles En´
107
Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets . . . . . . . . . . . Ajay Kumar Tanwani and Muddassar Farooq
127
Applications Supply Chain Management Sales Using XCSR . . . . . . . . . . . . . . . . . . . . . . . Mar´ıa Franco, Ivette Mart´ınez, and Celso Gorrin
145
X
Table of Contents
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators in XCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Preen On the Homogenization of Data from Two Laboratories Using Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose G. Moreno-Torres, Xavier Llor` a, David E. Goldberg, and Rohit Bhargava Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
166
185
199
Speeding Up Matching in Learning Classifier Systems Using CUDA Pier Luca Lanzi and Daniele Loiacono Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy {lanzi,loiacono}@elet.polimi.it
Abstract. We investigate the use of NVIDIA’s Compute Unified Device Architecture (CUDA) to speed up matching in classifier systems. We compare CUDA-based matching and CPU-based matching on (i) real inputs using interval-based conditions and on (ii) binary inputs using ternary conditions. Our results show that on small problems, due to the memory transfer overhead introduced by CUDA, matching is faster when performed using the CPU. As the problem size increases, CUDA-based matching can outperform CPU-based matching resulting in a 3-12× speedup when the interval-based representation is applied to match real-valued inputs and a 20-50× speedup for ternary-based representation.
1
Introduction
Learning classifier systems [10,8,17] combine evolutionary computation with methods of temporal difference learning to solve classification and reinforcement learning problems. A classifier system maintains a population of conditionaction-prediction rules, called classifiers, which identifies its current knowledge about the problem to be solved. At each time step, the system receives the current state of the problem and matches it against all the classifiers in the population. The results is a match set containing the classifiers that can be applied to the problem in its current state. Based on the value of the actions in the match set, the classifier system selects an action to perform on the problem to progress toward its solution. As a consequence of the executed action, the system receives a numerical reward that is distributed to the classifiers accountable for it. While the classifier system is interacting with the problem, a genetic algorithm is applied to the population to discover better classifiers through selection, recombination and mutation. Matching is the main and most computationally demanding process of a classifier system [14,3] that can occupy up to the 65%-85% of the overall computation time [14]. Accordingly, several methods have been proposed in the literature to speed up matching in learning classifier systems. Llor`a and Sastry [14] compared the typical encoding of classifier conditions for binary inputs, an encoding based on the underlying binary arithmetic, and a version of the J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 1–20, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
P.L. Lanzi and D. Loiacono
same encoding optimized via vector instructions. Their results show that binary encodings combined with optimizations based on the underlying integer arithmetic can speedup the matching process up to 80 times. The analysis of Llor` a and Sastry [14] did not consider the influence of classifier generality on the complexity of matching. As noted in [3], the matching usually stops as soon as it is determined that the classifier cannot be applied to the current problem instance (e.g., [1,12]). Accordingly, matching a population of highly specific classifiers takes much less than matching a population of highly general classifiers. Butz et al. [3] extended the analysis in [14] (i) by considering more encodings (the specificity-based encoding used in Butz’s implementation [1] and the encoding used in some implementations of Alecsys [7]); and (ii) by taking into account classifiers’ generality. Their results show that, overall, specificity-based matching can be 50% faster than character-based encoding when general populations are involved, but it can be slower than character-based encoding if more specific populations are considered. Binary encoding was confirmed to be the fastest option with a reported improvement up to 90% compared to the usual character-based encoding. Butz et al. [3] also proposed a specificity-based encoding for real-coded inputs which could halve the time required to match a population. In this work, we took a different approach to speed up matching in classifier systems based on the use of Graphical Processing Units (GPUs). More precisely, we used NVIDIA’s Compute Unified Device Architecture (CUDA) to implement matching for (i) real inputs using interval-based conditions and for (ii) binary inputs using ternary conditions. We tested our GPU-based matching by applying the same experimental design used in [14,3]. Our results show that on small problems, due to the memory transfer overhead introduced by GPUs, matching is faster when performed using the usual CPU. On larger problems, involving either more variables or more classifiers, GPU-based matching can outperform CPU-based implementation with a 3-12× speedup when the interval-based representation is applied to match real-valued inputs and a 20-50× speedup for ternary-based representation.
2
General-Purpose Computation on GPUs
Graphics Processing Units (GPUs) currently provide the best floating-point performance with a throughput that is at least ten times higher than the one provided by multi-core CPUs. Such a large performance gap has pushed developers to move several computationally intensive parts of their software on GPUs. Many-core GPUs perform better than general-purpose multi-core CPUs on floating-point computation because they have a different underlying design philosophy (see Figure 1). The design of a CPU is optimized for sequential code performance. It exploits sophisticated control logic to execute in parallel instructions from a single thread while maintaining the appearance of sequential
Speeding Up Matching in Learning Classifier Systems Using CUDA
3
execution. In addition, large cache memories are provided to reduce the instruction and data access latencies required in large complex applications. On the other hand, the GPU design is optimized for the execution of massive number of threads. It exploits the large number of executed threads to find work to do during long-latency memory accesses, minimizing the control logic required for each thread. Small cache memories are provided so that when multiple threads access to the same memory data, they do not need to all access to the DRAM. As a result, much more chip area is dedicated to the floating-point calculations.
Fig. 1. An overview of the CPUs and GPUs design philosophies
2.1
The CUDA Programming Model
NVIDIA’s Compute Unified Device Architecture (CUDA)1 allows developers to write computationally intensive applications on a GPU by using an extension of C which provides abstractions for parallel programming. In CUDA, GPUs are represented as devices that can run a large number of threads. Parallel tasks are represented as kernels mapped over a domain. Each kernel represents a sequential task to be executed as a thread on each point of the domain. The data to be processed by the GPU must be loaded into the board memory and, unless deallocated or overwritten, they remain available for subsequent kernels. Kernels have built-in variables to identify themselves in the domain and to access the data in the board memory. The domain is defined as a 5-dimensional structure consisting of a two-dimensional grid of three-dimensional thread blocks. Thread blocks are limited to 512 total threads; each block is assigned to a single processing element and runs as a unit until completion without preemption. Note that the resources used by a block are released only after the execution of all the threads in the same block are completed. Once a block is assigned to a streaming multiprocessor, it is further divided into groups of 32 threads, called warps. All threads within the same block are simultaneously live and they are temporally multiplexed but, at any time, the processing element executes only one of its resident warps. When the number of thread blocks in a grid exceeds the hardware 1
http://www.nvidia.com/object/cuda_home_new.html
4
P.L. Lanzi and D. Loiacono
resources, new blocks are assigned to processing element as soon as previous ones completed their execution. In addition to the global shared memory of the device, GPUs also have a private memory visible only to threads within the same block called per-block shared memory (PBSM). 2.2
Performance Issues
Although CUDA is very intuitive, it requires a deep knowledge of the underlying hardware architecture. CUDA developers need to take into account the specific features of the GPU architecture, such as memory transfer overhead, shared memory bank conflicts, and the impact of control flow. In fact, in CUDA, it is necessary to manage the communication between main memory and GPU shared memory explicitly. Developers have to reduce the transfer overhead by avoiding frequent data transfers between the GPU and CPU. Accordingly, rather than to increase the amount of communication with the CPU, computation on the GPU is usually duplicated and computation is typically overlapped to data communication. Once the memory transfer overhead has been optimized, developers must optimize the access to the global memory of the device, which represents one of the most important performance issue in CUDA. In general, CUDA applications exploit massive data parallelism in that they process a massive amount of data within a short period of time. Therefore, a CUDA kernel must be able to access a massive amount of data from the global memory within a very short period of time. As the memory access is a very slow process, modern DRAMs use a parallel process to increase their data access rate. When a memory location is accessed, many consecutive locations are also accessed. If an application exploits data from multiple, consecutive locations before moving on to other locations, the DRAMs can supply the data at much higher rate with respect to the access to a random sequence of locations. In CUDA, it is possible to take advantage of the fact that threads in a warp are executing the same instruction at any given point in time. When all threads in a warp execute a load instruction, the hardware detects whether the threads access consecutive global memory locations. The most favorable access pattern is achieved when the same instruction for all threads in a warp accesses consecutive global memory locations. In this case, the hardware combines, or coalesces, all these accesses into a consolidated access to the DRAMs that requests all consecutive locations involved. Such coalesced access allows the DRAMs to deliver data at a rate close to the maximal global memory bandwidth. Finally, control flow instructions (e.g., the if or switch statements) can significantly affect the instruction throughput when threads within the same warp follow different branches. When executing different branches, either the execution of each path must be serialized or all threads within the warp must execute each instruction, with predication used to mask out the effects of instructions that should not be executed [19]. Thus, kernels should be optimized avoid excessive use of control flow
Speeding Up Matching in Learning Classifier Systems Using CUDA
5
statements or to ensure that the branches executed will be the same across the whole warp.
3
The XCS Classifier System
XCS [17] maintains a population of condition-action-prediction rules (or classifiers), which represents the current system’s knowledge about a problem solution. Each classifier represents a portion of the overall solution. The classifier’s condition identifies a part of the problem domain; the classifier’s action represents a decision on the part of the domain identified by its condition; the classifier’s prediction p estimates the value of the action in terms of problem solution. Classifier conditions are usually strings defined over the ternary alphabet {0,1,#} in which the don’t care symbol # indicates that the corresponding position can either match a 0 or a 1. Actions are usually binary strings. XCS applies supervised or reinforcement learning to evaluate the classifiers’ prediction and a genetic algorithm to discover better classifiers by selecting, recombining, and mutating existing ones. To guide the evolutionary process, the classifiers keep three additional parameters: the prediction error ε, which estimates the average absolute error of the classifier prediction p; the fitness F , which estimates the average relative accuracy of the payoff prediction given by p and is a function of the prediction error ε; and the numerosity num, which indicates how many copies of classifiers with the same condition and the same action are present in the population. At time t, XCS builds a match set [M] containing the classifiers in the population [P] whose condition matches the current input st ; for each classifier, the match procedure scans all the input bits to check whether the classifier condition contains a don’t care symbol (#) or an input bit is equal to the corresponding character in the condition. If [M] contains less than θmna actions, covering takes place and creates a new classifier with a random action and a condition, with a proportion P# of don’t care symbols, that matches st . For each possible action a in [M], XCS computes the system prediction P (st , a), which estimates the payoff that XCS expects if action a is performed in st . The system prediction P (st , a) is computed as the fitness weighted average of the predictions of classifiers in [M] that advocate action a: P (st , a) =
clk ∈[M ](a)
pk ×
Fk cli ∈[M ](a)
Fi
,
(1)
where [M](a) represents the subset of classifiers of [M ] with action a, pk identifies the prediction of classifier cl k, and Fk identifies the fitness of classifier cl k. Next, XCS selects an action to perform; the classifiers in [M] that advocate the selected action form the current action set [A]. The selected action at is performed, and a scalar reward rt+1 is returned to XCS together with a new
6
P.L. Lanzi and D. Loiacono
input st+1 . The incoming reward rt+1 is used to compute the estimated payoff P (t) as, P (t) = rt+1 + γ max P (st+1 , a) a∈[M]
(2)
Next, the parameters of the classifiers in [A] are updated [5]. At first, the prediction p is updated with learning rate β (0 ≤ β ≤ 1) as, p ← p + β(P (t) − p)
(3)
Then, the prediction error ε and the fitness are updated [17,5]. On a regular basis (dependent on the parameter θga ), the genetic algorithm is applied to the classifiers in [A]. It selects two classifiers, copies them, and with probability χ performs crossover on the copies; then, with probability μ it mutates each allele. The resulting offspring classifiers are inserted into the population and two other classifiers are deleted from the population to keep the population size N constant.
4
Matching Interval-Based Conditions Using GPUs
Learning classifier systems typically assume that inputs are encoded as binary strings and that classifier conditions are strings defined over the ternary alphabet {0,1,#} [9,8,16,17]. There are however several representations that can deal with real-valued inputs: center-based intervals [18], simple intervals [19,15], convex hulls [13], ellipsoids [2], and hyper-ellipsoids [4]. 4.1
Interval Based Conditions and Matching
In the interval-based case [19], a condition is represented by a concatenation of n real interval predicates, int i = (li , ui ); given an input x consisting of n real numbers, a condition matches s if, for every i ∈ {1, . . . n}, the predicate li ≤ si ∧ si ≤ ui is verified. The matching is straightforward and its pseudocode is reported as Algorithm 1: the condition (identified by the variable condition) is represented as a vector of intervals; the inputs are a vector of real values (in double precision); the n inputs (i.e., inputs.size()) are scanned and each input is tested against the corresponding interval; the process stops either when all the inputs matched or as soon as one of the intervals does not match (when result in Algorithm 1 becomes false). Butz et al. [3] showed that this matching procedure can be sped-up by changing the order in which the inputs are tested: if smaller (more specific) intervals are tested first, the match is more likely to fail early so as to speed up the matching process. Their results on matching alone showed that this specificitybased matching could produce a 60% speed increase when applied to populations containing classifiers with highly specific conditions. However, they reported no significant improvement when their specificity-based matching was applied to typical testbeds.
Speeding Up Matching in Learning Classifier Systems Using CUDA
7
Algorithm 1. Matching for interval-based conditions in XCSLib. // representation of classifier condition vector condition; // representation of classifier inputs vector<double> inputs; // matching procedure int pos = 0; bool result = true; while ( (result) && (pos=condition[pos].lower) && (condition[pos].upper>=inputs[pos])); pos++; } return result;
4.2
Interval-Based Matching Using CUDA
Implementing interval-based matching using CUDA is straightforward and involves three simple design steps. First, we need to decide how to represent classifier conditions in the graphic board memory; then, we have to decide how parallelization is organized; finally, we need to implement the require kernel functions. Once these steps are performed, the matching of interval-based conditions on the GPU consists of (i) transferring the data to the board memory of the GPU, (ii) invoking the kernels that perform the matching, and finally (iii) retrieving the result from the board memory. Condition Representation. An interval-based condition can be easily encoded using two arrays of float variables, one to store all the condition’s lower bounds and one to store all the condition’s upper bounds. Algorithm 2 reports the matching algorithm using the lower and upper bound vectors. We can apply the same principle to encode a population of N classifiers using two matrices of float variables lb and ub which contain all the lower bounds and all the upper bounds of the conditions in the population. Given a problem with n real inputs, the matrices lb and ub can be either organized (i) by rows, putting in each row of the matrices the n lower/upper bounds of the same classifier (Figure 2a) or (ii) by columns, putting in each column of the matrices the n lower/upper bounds of the same classifier (Figure 2b). In both the representations, the matrices lb and ub are then linearized into arrays to be stored into the GPU memory. In particular, when the representation by rows is used, the
8
P.L. Lanzi and D. Loiacono
Algorithm 2. Matching for interval-based conditions using arrays. // representation of classifier condition float lb[n]; float ub[n]; // representation of classifier inputs float inputs[n]; // matching procedure int pos = 0; bool result = true; while ( (result) && (pos=lb[pos]) && (ub[pos]>=inputs[pos])); pos++; } return result;
first n values of lb contain the lower bounds of the first classifier condition in the population; while the first n values of ub contain the upper bounds of the same condition. The next n values in lb and ub contain the lower and upper bounds of the second classifier condition, and so on for all the N classifiers in the population. In contrast, when the representation by columns is used, the first N values of lb contain the lower bounds associated to the first input of the N classifiers in the population; similarly the first N values of ub contain the corresponding upper bounds. The next N values in lb and ub contain the lower and upper bounds associated to the second input, and so on for all the n inputs of the problem.
(a)
(b)
Fig. 2. Classifier conditions in the GPU global memory are represented as two matrices lb and ub which can be stored (a) by row or (b) by columns; cli represents the variables in the classifier condition; si shows what variables should be matched in parallel by the kernel
Speeding Up Matching in Learning Classifier Systems Using CUDA
9
Matching. To perform matching, the classifier conditions in the population are stored (either by rows or by columns) in the GPU main memory as the two vectors lb and ub of n × N elements each; the current input is stored in the GPU memory as a vector s of n floats. A result vector matched of N integers in the GPU memory is used to store the result of a matching procedure: a 1 in position i means that condition of classifier cli matched the current input; a 0 in the same position means that the condition of cli did not match. Then, matching is performed by running the matching kernel on the data structures that have been loaded into the device memory. Memory Organization. As we previously noted, the vector lb and ub can be stored into the device memory by rows (Figure 2a) or by columns (Figure 2b). To maximize the performance of a GPU implementation, at each clock cycle, GPU must access very close memory positions since the GPU accesses blocks of contiguous memory locations. Note that, while the representation of lb and ub by row (Figure 2a) appears to be straightforward, it also provides the lesser parallelization possible. As an example consider the first two classifiers in the population (cl0 and cl1 ) whose lower bounds are respectively stored in positions from 0 to n-1 for cl0 and from n to 2n-1 from cl1 . At the first clock cycle, one kernel will start the matching of the first condition and will access the value in lb[0] while the second kernel will access the value in lb[n] (i.e., the first value of lower bound for cl0 and cl1 ). When n is large these two memory positions will
Algorithm 3. Kernel for interval-based matching in CUDA using a row-based representation. // LB and UB represent the classifier condition // n is the size of the input // N is the population size __global__ void match( float* LB, float* UB, float *input, int *matched, int n, int N) { // computes position of the classifier condition in the arrays LB and UB const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x; const unsigned int pos = tidx*n; if (tidx θdel and F k < δF[P ] otherwise.
(10)
(11)
Thus, the deletion algorithm balances the classifier allocation in the different association sets by pushing toward the deletion of rules belonging to large correct sets. At the same time, it favors the search toward highly fit classifiers, since the deletion probability of rules whose fitness is much smaller than the average fitness is increased. 3.4
Rule Set Reduction
At the end of the learning process, the final rule set is processed to provide the user with only interesting rules. For this purpose, we apply the following reduction mechanism. Firstly, we remove all rules whose experience is smaller than θexp (θexp is a user-set parameter). Then, each rule is checked against each other for subsumption following the same procedure used for association rule subsumption but with the following exception: now, a rule ri is a candidate subsumer for rj if ri and rj have the same variables in their antecedent and consequent, ri is more general than rj , and ri has higher confidence than rj . Note that, during learning, the subsumption mechanism requires that the confidence of ri be greater than conf 0 . After applying the rule set reduction mechanism, we make sure that the final population consists of different rules. Other policies can be easily incorporated to this process such as removing rules whose support and confidence are below a predefined threshold. Nonetheless, in our experiments we return all the experienced rules in the final population that are not subsumed by any other. The overall section has described the mechanisms that CSar uses to evolve a population of interesting association rules online. Differently from other quantitative association-rule miners, CSar is characterized for having a maximum population size that limits the number of different interesting association rules that can exist in the final population. Then, the system organizes rules in different association sets and uses a GA to make rules in the same association set compete. Therefore, CSar does not aim at returning all the possible association rules, but at providing the user with a population of limited size with “phenotypically” different and interesting association rules.
4
Experimental Methodology
After having carefully described the system, now we are in position to experimentally analyze the behavior of CSar. The aim of the experimental analysis
Evolution of Interesting Association Rules
31
Table 1. Properties of the data sets. The columns describe: the identifier of the data set (Id.); the name of the data set (dataset); the number of instances (#Inst); the total number of features (#Fea); the number of real features (#Re); the number of integer features (#In); and the number of nominal features (#No). Id. adl ann aud aut bpa col gls h-s irs let pim tao thy wdbc wne wpbc
dataset #Inst #Fea #Re #In #No Adult 48841 15 0 6 9 Annealing 898 39 6 0 33 Audiology 226 70 0 0 70 Automobile 205 26 15 0 11 Bupa 345 7 6 0 1 Horse colic 368 23 7 0 16 Glass 214 10 9 0 1 Heart-s 270 14 13 0 1 Iris 150 5 4 0 1 Letter recognition 20000 17 0 16 1 Pima 768 9 8 0 1 Tao 1888 3 2 0 1 Thyroid 215 6 5 0 1 Wisc. diagnose breast-cancer 569 31 30 0 1 Wine 178 14 13 0 1 Wisc. prognostic breast-cancer 198 34 33 0 1
was to (1) study whether CSar could actually evolve a set of interesting association rules, (2) examine the behavior of the system under different configurations. With these objectives in mind, we did the following two experiments. As our first concern was to analyze whether CSar could evolve the most interesting association rules regardless of having a fixed population size. Therefore, we compared CSar with Apriori [3], probably the most influential association rule miner, on the zoo problem [4]. We selected the zoo problem for this analysis since Apriori only works on problems described by categorical attributes and the zoo problem satisfies this requirement. More specifically, the zoo problem is defined by (1) fifteen binary attributes which indicate whether the animal has a total of fifteen characteristics such as whether it has tail or hair and (2) two categorical attributes that can take more than two values and which represent the number of legs and the type of animal. Secondly, we studied the impact of using the two different procedures to create association rule candidates and of using progressively bigger maximum intervals. For this purpose, we ran CSar (1) with both antecedent- and consequent-grouping strategies to create association sets candidates and (2) with different maximum interval lengths on a collection of real-world problems extracted from the UCI repository [4] and from local repositories [7]. The characteristics of these problems are reported in Table 1. In all runs, CSar employed the following configuration: num iterations = 100 000, popSize = 6 400, conf0 = 0.95, ν = 10, θmna = 10, {θdel , θGA } = 50, θexp = 1000, Pχ = 0.8, {PI/R , Pμ , PC } = 0.1, m0 =0.2. Association set subsumption was activated in all runs.
32
A. Orriols-Puig and J. Casillas
5
Analysis of the Results
With the aim of the experiments in mind, in what follows we discuss about the experimental results. 5.1
Ability of CSar to Discover Interesting Rules
In order to study the ability of CSar to extract interesting association rules, we first compared the system with Apriori on a problem with only categorical attributes, the zoo problem. CSar was ran with both antecedent-grouping and consequent-grouping strategies. As we wanted to analyze the interestingness of the rules created by the systems, we report the number of rules with different minimum supports and confidences obtained by CSar with the two grouping strategies (see Figure 1). The same information is reported for Apriori in Figure 2; however, in this case, the resulting rules of Apriori have been filtered. 1400
800 600 400 200
0
0.1
0.2
0.3
0.4
0.5 0.6 support
0.7
0.8
Conf > 0.05 Conf > 0.10 Conf > 0.15 Conf > 0.20 Conf > 0.25 Conf > 0.30 Conf > 0.35 Conf > 0.40 Conf > 0.45 Conf > 0.50 Conf > 0.55 Conf > 0.60 Conf > 0.65 Conf > 0.70 Conf > 0.75 Conf > 0.80 Conf > 0.85 Conf > 0.90 Conf > 0.95
160 140 Number of Rules
Number of Rules
1000
0
180
Conf > 0.05 Conf > 0.10 Conf > 0.15 Conf > 0.20 Conf > 0.25 Conf > 0.30 Conf > 0.35 Conf > 0.40 Conf > 0.45 Conf > 0.50 Conf > 0.55 Conf > 0.60 Conf > 0.65 Conf > 0.70 Conf > 0.75 Conf > 0.80 Conf > 0.85 Conf > 0.90 Conf > 0.95
1200
120 100 80 60 40 20
0.9
0
1
(a) antecedent grouping
0
0.1
0.2
0.3
0.4
0.5 0.6 support
0.7
0.8
0.9
1
(b) consequent grouping
Fig. 1. Number of rules evolved with minimum support and confidence for the zoo problem with (a) antecedent-grouping and (b) consequent-grouping strategies. The curves are averages over five runs with different random seeds. 1600
Conf > 0.05 Conf > 0.10 Conf > 0.15 Conf > 0.20 Conf > 0.25 Conf > 0.30 Conf > 0.35 Conf > 0.40 Conf > 0.45 Conf > 0.50 Conf > 0.55 Conf > 0.60 Conf > 0.65 Conf > 0.70 Conf > 0.75 Conf > 0.80 Conf > 0.85 Conf > 0.90 Conf > 0.95
1400
Number of Rules
1200 1000 800 600 400 200 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
support
Fig. 2. Number of rules created by Apriori with minimum support and confidence for the zoo problem. Lower confidence and support are not shown since Apriori creates all possible combinations of attributes, exponentially increasing the number of rules.
Evolution of Interesting Association Rules
33
Support
Table 2. Comparison of the number of rules evolved by CSar with antecedent- and consequent-grouping strategies to form the association set candidates with the number of rules evolved by Apriori at high support and confidence values
0.40 0.50 0.60 0.70 0.80 0.90 1.00
Confidence antecedent grouping consequent grouping 0.4 0.6 0.8 0.4 0.6 0.8 275 ± 30 271 ± 27 230 ± 23 65 ± 10 63 ± 9 59 ± 9 123 ± 4 123 ± 4 106 ± 3 61 ± 8 61 ± 8 58 ± 8 58 ± 2 58 ± 2 51 ± 4 51 ± 8 51 ± 8 47 ± 7 21 ± 1 21 ± 1 19 ± 1 19 ± 2 19 ± 2 18 ± 2 2±0 2±0 2±0 2±0 2±0 2±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0
A-priori 0.4 0.6 0.8 2613 2514 2070 530 523 399 118 118 93 30 30 27 2 2 2 0 0 0 0 0 0
That is, Apriori is a two-phase algorithm that exhaustively explores all the feature space, discovers all the itemsets with a minimum predefined support, and creates all the possible rules with these itemsets. Therefore, some of the rules supplied by Apriori are included in other rules. We consider that a rule r1 is included in another rule r2 if r1 has, at least, the same variables with the same values in the rule antecedent and the rule consequent as r2 (r1 may have more variables). In the results provided herein, we removed from the final population all the rules that were included by other rules. Thus, we provide an upper bound of the number of different rules that can be generated. Two important observations can be made from these results. Firstly, the results clearly show that Apriori can create a higher number of rules than CSAr (for the sake of clarity, Table 2 specifies the number of rules for support values ranging from 0.4 to 1.0 and confidence values of {0.4,0.6,0.8}). This behavior was expected, since CSar has a limited population size, while Apriori returns all possible association rules. Nevertheless, it is worth noting that CSAr and Apriori found exactly the same number of highly interesting rules; that is, both systems discovered two rules with both confidence and support higher than 0.8. This highlights the robustness of CSar, whose mechanisms guide the system to discover the most interesting rules. Secondly, focusing on the results reported in Figure 1, we can see that the populations evolved with the antecedent-grouping strategy are larger than those built with the consequent-grouping strategy. This behavior will be also present, and discussed in more detail, in the extended experimental analysis conducted in the next subsection. 5.2
Study of the Behavior of CSar
After showing that CSar can create highly interesting association rules in a case-study problem characterized by categorical attributes, we now extend the experimentation by running the system on 16 real-world data sets. We ran the system with (1) antecedent-grouping and consequent-grouping strategies and (2)
34
A. Orriols-Puig and J. Casillas
Table 3. Average (± standard deviation of the) number of rules with support and confidence greater than 0.60 created by CSar with antecedent- and consequent-grouping strategies and with maximum interval sizes of MI={0.10, 0.25, 0.50}. The average and standard deviation are computed on five runs with different random seeds. antecedent MI=0.10 MI=0.25 adl 135 ± 3 294 ± 15 ann 1736 ± 133 1765 ± 79 aud 2206 ± 80 2017 ± 147 aut 84 ± 14 192 ± 7 bpa 11 ± 4 174 ± 15 col 134 ± 14 188 ± 7 gls 33 ± 4 160 ± 17 H-s 28 ± 1 61 ± 4 irs 0±0 0±0 let 0±0 113 ± 17 pim 4±1 93 ± 9 tao 0±0 0±0 thy 46 ± 2 152 ± 4 wdbc 0±0 419 ± 43 wne 116 ± 9 273 ± 48 wpbc 0±0 0±0
MI=0.50 MI=0.10 567 ± 66 46 ± 1 1702 ± 135 478 ± 86 1999 ± 185 1014 ± 12 710 ± 106 25 ± 6 365 ± 42 17 ± 2 377 ± 64 180 ± 13 694 ± 26 23 ± 2 248 ± 32 13 ± 1 50 ± 5 0±0 991 ± 40 0±0 570 ± 51 3±0 8±1 0±0 350 ± 27 29 ± 2 1143 ± 131 0±0 536 ± 34 26 ± 3 740 ± 234 0±0
consequent MI=0.25 74 ± 3 525 ± 112 982 ± 100 58 ± 3 100 ± 4 191 ± 7 89 ± 6 29 ± 1 0±0 103 ± 6 53 ± 5 0±0 80 ± 3 145 ± 17 65 ± 9 0±0
MI=0.50 147 ± 23 489 ± 34 880 ± 215 188 ± 6 123 ± 22 198 ± 8 205 ± 23 92 ± 13 28 ± 8 205 ± 13 154 ± 25 5±2 160 ± 2 304 ± 16 137 ± 17 264 ± 34
allowing intervals of maximum length maxInt = {0.1, 0.25, 0.5} for continuous variables. Note that by using different grouping strategies we are changing the way how the system creates association set candidates; therefore, as competition is held among rules within the same association set, the resulting rules can be different in both cases. On the other hand, having an increasing larger interval length for continuous variables enables the system to obtain more general rules. Table 3 reports the number of rules, with confidence and support greater than or equal to 0.6, created by the different configurations of CSar. All the reported results are averages of five runs with different random seeds. Comparing the results obtained with the two different grouping schemes, we can see that the antecedent-grouping strategy yielded larger populations than the consequent-grouping strategy, on average. This behavior was expected since the antecedent grouping creates smaller association sets, and thus, maintains more diversity in the population. Nonetheless, a closer examination of the final population indicates that the difference in the final number of rules decreases if we only consider the rules with the highest confidence and support. For example, considering all the rules with confidence and support greater than or equal to 0.60, the antecedent-grouping strategy results in populations 2.16 bigger than those of the consequent-grouping strategy. However, considering only the rules with confidence and support greater than or equal to 0.85, the average difference in the population length gets reduced to 1.12. This indicates a big proportion of the most interesting rules are discovered by the two strategies. It is worth
Evolution of Interesting Association Rules
35
highlighting therefore that the lower number of rules evolved by the consequentgrouping strategy can be considered as an advantage, since the strategy avoids creating and maintaining uninteresting rules in the population, which implies a lower computational time to evolve the population. Focusing on the impact of varying the interval length, the results indicate that for lower maximum interval lengths CSar tends to evolve rules with less support. This behavior can be easily explained as follows. Large maximum interval length enable the existence of highly general rules, which will have higher support. Moreover, if both antecedent and consequent variables are maximally general, rules will also have high confidence. Taking this idea to the extreme, rules that contain variables whose intervals range from the minimum value to the maximum value for the variable will have maximum confidence and support. Nonetheless these rules will be uninteresting for human experts. On the other hand, small interval lengths may result in more interesting association rules, though too small lengths may result in rules that denote strong associations but have less support. This highlights a tradeoff in the setting of this parameter, which should be adjusted for each particular problem. As a rule of thumb, similarly to what can be done with other association rule miners, the practitioner may start setting small interval lengths and increase them in case of not obtaining rules with enough support for the particular domain used.
6
Summary, Conclusion, and Further Work
In this paper, we presented CSar, a Michigan-style LCS designed to evolve quantitative association rules. The experiments conducted in this paper have shown that the method holds promise for online extraction of both categorical and quantitative association rules. Results with the zoo problem indicated that CSar was able to create interesting categorical rules, which were similar to those built by Apriori. Experiments with a collection of real-world problems also pointed out the capabilities of CSar to extract quantitative association rules and served to analyze the behavior of different configurations of the system. These results encourage us to study the system further with the aim of applying CSar to mine quantitative association rules from new challenging real-world problems. Several future work lines can be followed in light of the present work. Firstly, we aim at comparing CSar with other quantitative association rule miners to see if the online architecture can extract knowledge similar to that obtained by other approaches that go several times through the learning data set. Actually, the online architecture of CSar makes the system suitable for mining association rules from changing environments with concept drift [1]; and we think that the existence of concept drift may be a common trait in many real-world problems to which association rules have historically been applied such as profile mining from customer information. Therefore, it would be interesting to analyze how CSar adapts to domains in which variable associations change over time.
36
A. Orriols-Puig and J. Casillas
Acknowledgements The authors thank the support of Ministerio de Ciencia y Tecnolog´ıa under projects TIN2008-06681-C06-01 and TIN2008-06681-C06-05, Generalitat de Catalunya under Grant 2005SGR-00302, and Andalusian Government under grant P07-TIC-3185.
References 1. Aggarwal, C. (ed.): Data streams: Models and algorithms. Springer, Heidelberg (2007) 2. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington D.C, pp. 207–216 (May 1993) 3. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, pp. 487–499 (September 1994) 4. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, University of California (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html 5. Bacardit, J., Krasnogor, N.: Fast rule representation for continuous attributes in genetics-based machine learning. In: GECCO 2008: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, pp. 1421–1422. ACM, New York (2008) 6. Bernad´ o-Mansilla, E., Garrell, J.M.: Accuracy-based learning classifier systems: Models, analysis and applications to classification tasks. Evolutionary Computation 11(3), 209–238 (2003) 7. Bernad´ o-Mansilla, E., Llor` a, X., Garrell, J.M.: XCS and GALE: A comparative study of two learning classifier systems on data mining. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115–132. Springer, Heidelberg (2002) 8. Cai, C.H., Fu, A.W.-C., Cheng, C.H., Kwong, W.W.: Mining association rules with weighted items. In: International Database Engineering and Application Symposium, pp. 68–77 (1998) 9. Divina, F.: Hybrid Genetic Relational Search for Inductive Learning. PhD thesis, Department of Computer Science, Vrije Universiteit, Amsterdam, the Netherlands (2004) 10. Fukuda, T., Morimoto, Y., Morishita, S., Tokuyama, T.: Mining optimized association rules for numeric attributes. In: PODS 1996: Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 182–191. ACM, New York (1996) 11. Goldberg, D.E.: Genetic algorithms in search, optimization & machine learning, 1st edn. Addison-Wesley, Reading (1989) 12. Holland, J.H.: Adaptation in natural and artificial systems. The University of Michigan Press (1975) 13. Hong, T.P., Kuo, C.S., Chi, S.C.: Trade-off between computation time and number of rules for fuzzy mining from quantitative data. International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems 9(5), 587–604 (2001)
Evolution of Interesting Association Rules
37
14. Houtsma, M., Swami, A.: Set-oriented mining of association rules. Technical Report RJ 9567, Almaden Research Center, San Jose, California (October 1993) 15. Kaya, M., Alhajj, R.: Genetic algorithm based framework for mining fuzzy association rules. Fuzzy Sets and Systems 152(3), 587–601 (2005) 16. Lent, B., Swami, A.N., Widom, J.: Clustering association rules. In: Procedings of the IEEE International Conference on Data Engineering, pp. 220–231 (1997) 17. Mata, J., Alvarez, J.L., Riquelme, J.C.: An evolutionary algorithm to discover numeric association rules. In: SAC 2002: Proceedings of the 2002 ACM Symposium on Applied Computing, pp. 590–594. ACM, New York (2002) 18. Miller, R.J., Yang, Y.: Association rules over interval data. In: SIGMOD 1997: Proceedings of the 1997 ACM SIGMOD International Conference on Management of data, pp. 452–461. ACM, New York (1997) 19. N´ un ˜ez, M., Fidalgo, R., Morales, R.: Learning in environments with unknown dynamics: Towards more robust concept learners. Journal of Machine Learning Research 8, 2595–2628 (2007) 20. Salleb-Aouissi, A., Vrain, C., Nortet, C.: Quantminer: A genetic algorithm for mining quantitative association rules. In: Veloso, M.M. (ed.) Proceedings of the 2007 International Join Conference on Artificial Intelligence, pp. 1035–1040 (2007) 21. Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining association rules in large databases. In: Proceedings of the 21st VLDB Conference, Zurich, Switzerland, pp. 432–443 (1995) 22. Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational tables. In: Jagadish, H.V., Mumick, I.S. (eds.) Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, pp. 1–12 (1996) 23. Wang, C.-Y., Tseng, S.-S., Hong, T.-P., Chu, Y.-S.: Online generation of association rules under multidimensional consideration based on negative border. Journal of Information Science and Engineering 23, 233–242 (2007) 24. Wang, K., Tay, S.H.W., Liu, B.: Interestingness-based interval merger for numeric association rules. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, KDD, pp. 121–128. AAAI Press, Menlo Park (1998) 25. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175 (1995) 26. Wilson, S.W.: Generalization in the XCS classifier system. In: 3rd Annual Conf. on Genetic Programming, pp. 665–674. Morgan Kaufmann, San Francisco (1998) 27. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209–219. Springer, Heidelberg (2000)
Coevolution of Pattern Generators and Recognizers Stewart W. Wilson Prediction Dynamics, Concord MA 01742 USA Department of Industrial and Enterprise Systems Engineering The University of Illinois at Urbana-Champaign IL 61801 USA
[email protected] Abstract. Proposed is an automatic system for creating pattern generators and recognizers that may provide new and human-independent insight into the pattern recognition problem. The system is based on a three-cornered coevolution of image-transformation programs.
1
Introduction
Pattern recognition is a very difficult problem for computer science. A major reason is that in many cases pattern classes are not well-specified, frustrating the design of algorithms (including learning algorithms) to identify or discriminate them. Intrinsic specification (via formal definition) is often impractical—consider the class consisting of hand-written letters A. Extrinsic specification (via finite sets of examples) has problems of generalization and over-fitting. Many interesting pattern classes are hard to specify because they exist only in relation to human or animal brains. Humans employ mental processes such as scaling, point of view adjustment, contrast and texture interpretation, saccades, etc., permitting classes to be characterized very subtly. It is likely that truly powerful computer pattern recognition methods will need to employ all such techniques, which is not generally the case today. In this paper we are concerned mainly with human-related pattern classes. A further challenge for pattern recognition research is to create problems with large sets of examples that can be learned from. An automatic pattern generator would be valuable, but it should be capable of producing examples of each class that are diverse and subtle as well as numerous. This paper proposes an automatic pattern generation and recognition process, and speculates that it would shed light on both the formal characterization problem and recognition techniques. The process would permit unlimited generation of examples and very great flexibility of methods, by relying on competitive and cooperative coevolution of pattern generators and recognizers. The paper is organized into a first part in which the pattern recognition problem is discussed in greater detail; a second part in which the competitive and cooperative method is explained in concept; and a third part containing suggestions for a specific implementation. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 38–46, 2010. c Springer-Verlag Berlin Heidelberg 2010
Coevolution of Pattern Generators and Recognizers
2
39
Pattern Recognition Problem
The following is a viewpoint on the pattern recognition problem and what makes it difficult. Let us first see some examples of what are generally regarded as patterns. Characters, such as letters and numerals. Members of a class can differ in numerous ways, including placement in the field of view, size, orientation, shape, thickness, contrast, constituent texture, distortion including angle of view, noise of construction, and masking noise, among others. Patterns in time series, such as musical phrases, price data configurations, and event sequences. Members of a class can differ in time-scale, shape, intensity, texture, etc. Natural patterns, such as trees, landscapes, terrestrial features, and cloud patterns. Members of a class can differ in size, shape, contrast, color, texture, etc. Circumstantial patterns such as situations, moods, plots. Members of a class can differ along a host of dimensions themselves often hard to define. This sampling illustrates the very high diversity within even ordinary pattern classes and suggests that identifying a class member while differentiating it from members of other classes should be very difficult indeed. Yet human beings learn to do it, and apparently quite easily. While that of course has been pointed out before, we note two processes which may play key roles, transformation and context. Transformative processes would include among others centering an object of interest in the field of view via saccades, i.e., translation, and scaling it to a size appropriate for further steps. Contextual processes would include adjusting the effective brightness (of a visual object) relative to its background, and seeing a textured object as in fact a single object on a differently textured background. It is clear that contextual processes are also transformations and that viewpoint will be taken here. A transformational approach to pattern recognition would imply a sequence in which the raw stimulus is successively transformed to a form that permits it to be matched against standard or iconic exemplars, or produces a signal that is associated with a class. Human pattern recognition is generally rapid and its steps are not usually conscious, except in difficult cases or in initial learning. However, people when asked for reasons for a particular recognition will often cite transformational steps like those above that allow the object to be interpreted to some standard form. For this admittedly informal reason, transformations are emphasized in the algorithms proposed here. It is possible to provide a more formal framework. Pattern recognition can be viewed as a process in which examples are mapped to classes. But the mappings are complicated. They are unlike typical functions that map vectors of elements into, e.g., reals. In such a function, each element has a definite position in the
40
S.W. Wilson
vector (its index). Each position can be thought of as a place, and there is a value there. An ordinary function is thus a mapping of “values in places” into an outcome. Call it a place/value (PV) mapping. If you slide the values along the places—or expand them from a point—the outcome is generally completely different. The function depends on just which values are in which places. Patterns, on the other hand, are relative place/relative value (RPRV) mappings. Often, a given instance can be transformed into another instance, but with the same outcome, by a transformation that maintains the relative places or values of the elements—for example, such transformations as scaling, translation, rotation, contrast, even texture. The RPRV property, however, makes pattern recognition very difficult for machine learning methods that attach absolute significance to input element positions and values. There is considerable work on relative-value, or relational, learning systems, e.g., in classifier systems [5,4], and in reinforcement learning generally [1]. But for human-related pattern classes, what seems to be required is a method that is intrinsically able to deal with both relative value and relative place. This suggests that the method must be capable of transformations, both of its input and in subsequent stages. The remainder of the paper lays out one proposal for achieving this.
3
Let the Computer Do It
Traditionally, pattern recognition research involves choosing a domain, creating a source of exemplars, and trying learning algorithms that seem likely to work in that domain. Here, however, we are looking broadly at human-related pattern recognition, or relative place/relative value mappings (Sec. 2). Such a large task calls for an extensive source of pattern examples. It also calls for experimentation with a very wide array of transformation operators. Normally, for practicality one would narrow the domain and the choice of operators. Instead, we want to leave both as wide as possible, in hopes of achieving significant generality. While it changes the problem somewhat, there fortunately appears to be a way of doing this by allowing the computer itself to pose and solve the problem. Imagine a kind of communication game (Figure 1). A sender, or source, S, wants to send messages to a friend F. The messages are in English, and the letters are represented in binary by ASCII bytes. As long as F can decode bytes to ASCII (and knows English), F will understand S ’s messages. But there is also an enemy E that sees the messages and is not supposed to understand them. S and F decide to encrypt the messages. But instead of encrypting prior to conversion to bits, or encrypting the resulting bit pattern, they decide to encrypt each bit. That is, E ’s problem is to tell which bits are 1s and which 0s. If E can do that, the messages will be understandable. Note that F also must decrypt the bits. For this peculiar setup, S and F agree that when S intends to send a 0, S will send a variant of the letter A; for a 1, S will send a variant of B. S will produce these variants using a generation program. Each variant of A created will in general be different; similarly for B. F will know that 0 and 1 are represented
Coevolution of Pattern Generators and Recognizers
S
41
F
E Fig. 1. S sends messages to F that are sniffed by E
by variants of A and B, respectively, and will use a recognition program to tell which is which. E, also using a recognition program, knows only that the messages are in a binary code but does not know anything about how 0s and 1s are represented. In this setup, S ’s objective is to send variants of As and Bs that F will recognize but E will not recognize. The objectives of both F and E are to recognize the letters; for this F has some prior information that E does not have. All the agents will require programs: S for generation and F and E for recognition. The programs will be evolved using evolutionary computation. Each agent will maintain its own population of candidate programs. The overall system will carry out a coevolution [2] in which each agent attempts to evolve the best program consistent with its objectives. Evolution requires a fitness measure, which we need to specifiy for each of the agents. For each bit transmitted by S, F either recognizes it or does not, and E either recognizes it or does not. S ’s aim is for F to recognize correctly but not E ; call this a success for S. A simple fitness measure for an S program would be the number of its successes divided by a predetermined number of transmissions, T, assuming that S sends 0s and 1s with equal probability. A success for F as well as for E would be a correct recognition. A simple fitness measure for their programs would be the number of correct recognitions, again divided by T transmissions. S ’s population would consist of individuals each of which consists of a generation program. To send a bit, S picks an individual, randomly1 decides whether to send a 0 or a 1, then as noted above, generates a variant of A for 0, or of B for 1, the variant differing each time the program is called. The system determines whether the transmission was a success (for S ). After a total of T transmissions using a given S individual, its fitness is updated. F and E each have populations of individual recognition programs. Like S, after T recognition attempts using a population individual, its fitness is updated based on its number of successes. The testing of individuals could be arranged so that for each transmission, individuals from the S, F, and E populations would be selected at random. Or an individual from S could be used for T successive transmissions with F 1
For our purposes, the bits need not encode natural language.
42
S.W. Wilson
and E individuals still randomly picked on each transmission. Various testing schemes are possible. Selection, reproduction, and genetic operations would occur in a population at intervals long enough so that the average individual gets adequately evaluated. Will the coevolution work? It seems there should be pressure for improvement in each of the populations. Some initial programs in S should be better than others; similarly for F and E. The three participants should improve, but the extent is unknown. It could be that all three success rates end up not much above 50%. The best result would be 100% for S and F and 0% for E. But that is unlikely since some degree of success by E would be necessary to push S and F toward higher performance.
4
Some Implementation Suggestions
Having described a communications game in which patterns are generated and recognized, and a scheme for coevolving the corresponding programs, it remains to suggest the form of these programs. For concreteness we consider generation and recognition of two-dimensional, gray-scale visual patterns and take the transformational viewpoint of Sec.2. The programs would be compounds of operators that take an input image and transform it into an output image. The input of one of S ’s generating programs would be an image of an archetypical A or B and its output would be, via transforms, a variant of the input. A recognition program would take such a variant as input and, via transforms, output a further variant. F would match its program’s output against the same archetypes of A and B, picking the better match, and deciding 0 or 1 accordingly. E would simply compute the average gray level of its program’s output image and compare that to a threshold to decide between 0 and 1. For a typical transformation we imagine in effect a function that takes an image—an array of real numbers—as input and produces an image as output. The value at a point x, y of the output may depend on the value at a point (not necessarily the same point) of the input, or on the values of a collection of input points. As a simple example, in a translation transformation, the value at each output point would equal the value at an input point that is displaced linearly from the output point. In general, we would like the value at an output point potentially to be a rather complicated function of the points of the input image. Sims [6], partly with an artistic or visual design purpose, evolved images using fitnesses based on human judgements. In his system, a candidate image was generated by a Lisp-like tree of elementary functions taking as inputs x, y, and outputs of other elementary functions. The elementary functions included standard Lisp functions as well as various image-processing operators such as blurs, convolutions, or gradients that use neighboring pixel values to calculate their outputs. Noise generating functions were also included. The inputs to the function tree were simply the coordinates x and y, so that the tree in effect performed a transformation of the “blank” x-y plane to yield the
Coevolution of Pattern Generators and Recognizers
43
output image. The results of evolving such trees of functions could be surprising and beautiful. Sim’s article gives a number of examples of the images, including one (Figure 2) having the following symbolic expression, (round (log (+ y (color-grad (round (+ (abs (round (log (+ y (color-grad (round (+ y (log (invert y) 15.5)) x) 3.1 1.86 #(0.95 0.7 0.59) 1.35)) 0.19) x)) (log (invert y) 15.5)) x) 3.1 1.9 #(0.95 0.7 0.35) 1.35)) 0.19) x).
c 1991 Fig. 2. Evolved image from Sims [6]. Gray-scale rendering of color original. Association for Computing Machinery, Inc. Reprinted with permission.
Such an image-generating program is a good starting point for us, except for two missing properties. First, the program does not transform an input image; its only inputs are x and y. Second, the program is deterministic: it is not able to produce different outputs for the same image input, a property required in order to produce image variants. To transform an image, the program needs to take as input not only x and y, but also the input image values. A convenient way to do this appears to be to add the image to the function set. That is, add Im(x, y) to the function set, where Im is a function that maps image points to image values of the current input. For example, consider the expression (* k (Im (- x x0 ) (- y y0 )). The effect is to produce an output that translates the input by x0 and y0 in the x and y directions and alters its contrast by the factor k. It seems fairly clear that adding the current input image, as a kind of function, to the function set (it could apply at any stage), is quite general and would permit a great variety of image transformations.
44
S.W. Wilson
To allow different transformations from the same program is not difficult. One approach is to include a “switch” function, Sw , in the function set. Sw would have two inputs and would pass one or the other of them to its output depending on the setting of a random variable at evaluation time (i.e., set when a new image is to be processed and not reset until the next image). The random variable would be a component of a vector of random binary variables, one variable for each specific instance of Sw in the program. Then at evaluation time, the random vector would be re-sampled and the resulting component values would define a specific path through the program tree. The number of distinct paths is 2 raised to the number of instances of Sw , and equals the number of distinct input image variants that the program can create. If that number turns out to be too small, other techniques for creating variation will be required. The transformation programs just described would be directly usable by S to generate variants of A and B starting with archetypes of each. F and E would also use such programs, but not alone. Recognition, in the present approach, reverses generation: it takes a received image and attempts to transform it back into an archetype. Since it does not know the identity of the received image, how does the recognizer know which transformations to apply? We suggest that a recognition program be a kind of “Pittsburgh” classifier system [7] in which each classifier has a condition part intended to be matched against the input, and an action part that is a transformation program of the kind used by S (but without Sw ). In the simplest case the classifier condition would be an image-like array of reals to be matched against the input image; the bestmatching classifier’s transformation program would then be applied to the image. The resulting output would then be matched (by F ) against archetypes A and B and the better-matching character selected. E, as noted earlier, would compare the average of the output image with a threshold. It might be desirable for recognition to take more than one match-transform step; they could be chained up to a certain number, or until a sufficiently sharp A/B decision (or difference from threshold) occurred.2
5
Discussion and Conclusion
A coevolutionary framework has been proposed that, if it works, may create interesting pattern generators and recognizers. We must ask, is it relevant to the kinds of natural patterns noted in Section 2? Natural patterns are not ones created by generators to communicate with friends without informing enemies3 . Instead, natural patterns seem to be clusters of variants that become as large as possible without confusing their natural recipients, and no intruder is involved. Perhaps that framework, which also may 2
3
Recognition will probably require a chain of steps, as the system changes its center of attention or other viewpoint. State memory from previous steps will likely be needed, which favors use of a Pittsburgh over a “Michigan” [3,8], classifier system, since the former is presently more adept at internal state. There may be special cases!
Coevolution of Pattern Generators and Recognizers
45
suggest a coevolution, ought to be explored. But the present framework should give insights, too. A basic hypothesis here is that recognition is a process of transforming a pattern into a standard or archetypical instance. Success by the present scheme— since it uses transformations—would tend to support that hypothesis. More important, the kinds of operators that are useful will be revealed (though extracting such information from symbolic expressions can be a chore). For instance, will the system evolve operators similar to human saccades and will it size-normalize centered objects? It would also be interesting to observe what kinds of matching templates evolve in the condition parts of the recognizer classifiers. For instance, are large-area, relatively crude templates relied upon to get a rough idea of which transforms to apply? If so, it would be in contrast to recognition approaches that proceed from bottom up—e.g. finding edges—instead of top down. Such autonomously created processes would seem of great interest to more standard studies of pattern recognition. The reason is that standard studies involve choices of method that are largely arbitrary, and if they work there is still a question of generality. In contrast, information gained from a relatively unconstrained evolutionary approach might, by virtue of its human-independence, have a greater credibility and extensibility. It is unclear how well the present framework will work—for instance whether F ’s excess of a priori information over E ’s will be enough to drive the coevolution. It is also unclear, even if it works, whether the results will have wider relevance. But the proposal is offered in the hope that its difference from traditional approaches will inspire new experiments and thinking about a central problem in computer science.
References 1. Dˇzeroski, S., de Raedt, L., Driessens, K.: Relational reinforcement learning. Machine Learning 43, 7–52 (2001) 2. Daniel Hillis, W.: Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D 42, 228–234 (1990) 3. Holland, J.H.: Escaping Brittleness: The Possibilities of General-Purpose Learning Algorithms Applied to Parallel Rule-Based Systems. In: Mitchell, Michalski, Carbonell (eds.) Machine Learning, an Artificial Intelligence Approach, vol. II, ch. 20, pp. 593–623. Morgan Kaufmann, San Francisco (1986) 4. Mellor, D.: A first order logic classifier system. In: Beyer, H.-G., O’Reilly, U.-M., Arnold, D.V., Banzhaf, W., Blum, C., Bonabeau, E.W., Cantu-Paz, E., Dasgupta, D., Deb, K., Foster, J.A., de Jong, E.D., Lipson, H., Llora, X., Mancoridis, S., Pelikan, M., Raidl, G.R., Soule, T., Tyrrell, A.M., Watson, J.-P., Zitzler, E. (eds.) GECCO 2005: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, Washington DC, USA, June 25-29, vol. 2, pp. 1819–1826. ACM Press, New York (2005) 5. Shu, L., Schaeffer, J.: VCS: Variable Classifier System. In: David Schaffer, J. (ed.) Proceedings of the 3rd International Conference on Genetic Algorithms (ICGA 1989), George Mason University, pp. 334–339. Morgan Kaufmann, San Francisco (June 1989), http://www.cs.ualberta.ca/~ jonathan/Papers/Papers/vcs.ps
46
S.W. Wilson
6. Sims, K.: Artificial evolution for computer graphics. Computer Graphics 25(4), 319– 328 (1991), http://doi.acm.org/10.1145/122718.122752, Also http://www.karlsims.com/papers/siggraph91.html 7. Smith, S.F.: A Learning System Based on Genetic Adaptive Algorithms. PhD thesis, University of Pittsburgh (1980) 8. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995)
How Fitness Estimates Interact with Reproduction Rates: Towards Variable Offspring Set Sizes in XCSF Patrick O. Stalph and Martin V. Butz Department of Cognitive Psychology III, University of W¨ urzburg R¨ ontgenring 11, 97080 W¨ urzburg, Germany {patrick.stalph,butz}@psychologie.uni-wuerzburg.de http://www.coboslab.psychologie.uni-wuerzburg.de
Abstract. Despite many successful applications of the XCS classifier system, a rather crucial aspect of XCS’ learning mechanism has hardly ever been modified: exactly two classifiers are reproduced when XCSF’s iterative evolutionary algorithm is applied in a sampled problem niche. In this paper, we investigate the effect of modifying the number of reproduced classifiers. In the investigated problems, increasing the number of reproduced classifiers increases the initial learning speed. In less challenging approximation problems, also the final approximation accuracy is not affected. In harder problems, however, learning may stall, yielding worse final accuracies. In this case, over-reproductions of inaccurate, ill-estimated, over-general classifiers occur. Since the quality of the fitness signal decreases if there is less time for evaluation, a higher reproduction rate can deteriorate the fitness signal, thus—dependent on the difficulty of the approximation problem—preventing further learning improvements. In order to speed-up learning where possible while still assuring learning success, we propose an adaptive offspring set size that may depend on the current reliability of classifier parameter estimates. Initial experiments with a simple offspring set size adaptation show promising results. Keywords: LCS, XCS, Reproduction, Selection Pressure.
1
Introduction
Learning classifier systems were introduced over thirty years ago [1] as cognitive systems. Over all these years, it has been clear that there is a strong interaction between parameter estimations—be it by traditional bucket brigade techniques [2], the Widrow-Hoff rule [3,4], or by recursive least squares and related linear approximation techniques [5,6]—and the genetic algorithm, in which the successful identification and propagation of better classifiers depends on the accuracy of these estimates. Various control parameters have been used to balance genetic reproduction with the reliability of the parameter estimation, but to the best of our knowledge, there is no study that addresses the estimation problem explicitly. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 47–56, 2010. c Springer-Verlag Berlin Heidelberg 2010
48
P.O. Stalph and M.V. Butz
In the XCS classifier system [4], reproduction takes place by means of a steadystate, niched GA. Reproductions are activated in current action sets (or match sets in function approximation problems as well as in the original XCS paper). Upon reproduction, two offspring classifiers are generated, which are mutated and recombined with certain probabilities. Reproduction is balanced by the θGA threshold. It specifies that GA reproduction is activated only if the average time of the last GA activation in the set lies longer in the past than θGA . It has been shown that the threshold can delay learning speed but it also prevents the neglect of rarely sampled problem niches in the case of unbalanced data sets [7]. Nonetheless, the reproduction of two classifiers seems to be rather arbitrary— except for the fact that two offspring classifiers are needed for simple recombination mechanisms. Unless the Learning Classifier System has a hard time to learn the problem, the reproduction of more than two classifiers could speed up learning. Thus, this study investigates the effect of modifying the number of offspring classifiers generated upon GA invocation. We further focus our study on the real-valued domain and thus on the XCSF system [8,9]. Besides, we use the rotating hyperellipsoidal representation for the evolving classifier condition structures [10]. This paper is structured as follows. Since we assume general knowledge of XCS1 , we immediately start investigating performance of XCSF on various test problems and with various offspring set sizes. Next, we discuss the results and provide some theoretical considerations. Finally, we propose a road-map for further studying the observed effects and adapting the offspring set sizes according to the perceived problem difficulty and learning progress as well as on the estimated reliability of available classifier estimates.
2
Increased Offspring Set Sizes
To study the effects of increased offspring set sizes, we chose four challenging functions defined in [0, 1]2 , each with rather distinct regularities: f1 (x1 , x2 ) = sin(4π(x1 + x2 )) 2 2 f2 (x1 , x2 ) = exp −8 (xi − 0.5) cos 8π (xi − 0.5) i
(1) (2)
i
f3 (x1 , x2 ) = max exp −10(2x1 − 1)2 , exp −50(2x2 − 1)2 , 1.25 exp −5((2x1 − 1)2 + (2x2 − 1)2 )
(3)
f4 (x1 , x2 ) = sin(4π(x1 + sin(πx2 )))
(4)
Function f1 has been used in various studies [10] and has a diagonal regularity. It requires the evolution of stretched hyperellipsoids that are rotated by 45◦ . Function f2 is a radial sine function that requires a somewhat circular distribution of 1
For details about XCS refer to [4,11].
Towards Variable Offspring Set Sizes in XCSF
49
prediction 0.5 0 -0.5 1 0.5 f
0 -0.5 1 -1
0.8 0.6
0
0.2
0.4
0.4 x
0.6
0.8
y
0.2 1 0
(a) sine function prediction 1 0.5 0 1 f
0.5 0 1 0.8 -0.50
0.6 0.2
0.4
0.4 x
0.6
0.8
y
0.2 1 0
(b) radial sine function prediction 1 0.5 0 1.5 1 f 0.5 1
0
0.8 0.6
0
0.2
0.4
0.4 x
0.6
0.8
y
0.2 1 0
(c) crossed ridge function prediction 1 0.5 0 -0.5 -1
1.5 1 f 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 x
0.6 0.8 1
0
0.2
0.4
0.6 y
0.8
1
(d) sine-in-sine function Fig. 1. Final function approximations, including contour lines, are shown on the lefthand side. The corresponding population distributions after compaction are shown on the right-hand side. For visualization purposes, the conditions are drawn 80% smaller than their actual size.
P.O. Stalph and M.V. Butz
6400
1
1
100
0.01
0
20
40 60 80 number of learning steps (1000s)
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel50% - pred. error macro cl.
0.1 prediction error
0.1 prediction error
1000
select 2 - pred. error macro cl. select 4 - pred. error macro cl. select 8 - pred. error macro cl.
macro classifiers
6400
1000
macro classifiers
50
100
0.01
0
100
20
40 60 80 number of learning steps (1000s)
100
(a) sine function
1
100
0.01
0
20
40 60 80 number of learning steps (1000s)
100
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel50% - pred. error macro cl.
0.1 prediction error
0.1 prediction error
1000
select 2 - pred. error macro cl. select 4 - pred. error macro cl. select 8 - pred. error macro cl.
macro classifiers
1
1000
macro classifiers
6400
6400
100
0.01
0
20
40 60 80 number of learning steps (1000s)
100
(b) radial sine function Fig. 2. Different selection strengths with fixed (left hand side) or match-set-size relative (right hand side) offspring set sizes can speed-up learning significantly but potentially increase the final error level reached. The vertical axis is log-scaled. Error bars represent one standard deviation and the thin dashed line shows the target error ε0 = 0.01.
classifiers. Function f3 is a crossed ridge function, for which it has been shown that XCSF performs competitively in comparison with deterministic machine learning techniques [10]. Finally, function f4 twists two sine functions so that it becomes very hard for the evolutionary algorithm to receive enough signal from the parameter estimates in order to structure the problem space more effectively for an accurate function approximation. Figure 1 shows the approximation surfaces and spatial partitions generated by XCSF with a population size of N = 6400 and with compaction [10] activated after 90k learning iterations.2 The graphs on the left-hand side show the actual function predictions and qualitatively confirm that XCSF is able to learn accurate approximations for all four functions. On the right-hand side, the corresponding condition structures of the final populations are shown. In XCS and 2
Other parameters were set to the following values: β = .1, η = .5, α = 1, ε0 = .01, ν = 5, θGA = 50, χ = 1.0, μ = .05, r0 = 1, θdel = 20, δ = 0.1, θsub = 20. All experiments in this paper are averaged over 20 experiments.
Towards Variable Offspring Set Sizes in XCSF
51
1
100
0.01
0
20
40
60
80
1000
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel50% - pred. error macro cl.
0.1 prediction error
0.1 prediction error
1000
select 2 - pred. error macro cl. select 4 - pred. error macro cl. select 8 - pred. error macro cl.
macro classifiers
1
6400 macro classifiers
6400
100
0.01
100
0
20
number of learning steps (1000s)
40
60
80
100
number of learning steps (1000s)
(a) crossed ridge function
1000
100
select 2 - pred. error macro cl. select 4 - pred. error macro cl. select 8 - pred. error macro cl.
0.01
0
20
40
60
80
100
1000
100
0.1 prediction error
0.1 prediction error
1
macro classifiers
1
6400 macro classifiers
6400
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel50% - pred. error macro cl.
0.01
0
20
number of learning steps (1000s)
40
60
80
100
number of learning steps (1000s)
(b) sine-in-sine function Fig. 3. While in the crossed ridge function larger offspring set sizes mainly speed-up learning, in the challenging sine-in-sine function, larger offspring set sizes can strongly affect the final error level reached
XCSF, two classifiers are selected for reproduction, crossover, and mutation. We now investigate the influence of modified reproduction sizes. Performance of the standard setting, where two classifiers are selected for reproduction (with replacement), is compared with four other reproduction size choices. In the first experiment the offspring set size was set to four and eight classifiers respectively. Thus, four (eight) classifiers are reproduced upon GA invocation and crossover is applied twice (four times) before the mutation operator is applied. In a second, more aggressive setting the offspring set size is set relative to the current match set size, namely to 10% and 50% of the match set size. Especially the last setting was expected to reveal that excessive reproduction can deteriorate learning. Learning progress is shown in Figure 2 for functions f1 and f2 . It can be seen that in both cases standard XCSF with two offspring classifiers learns significantly slower than settings with a larger number of offspring classifiers. The number of distinct classifiers in the population (so called macro classifiers), on the other hand, shows that initially larger offspring set sizes increase the population sizes much faster. Thus, an initially higher diversity due to larger offspring sets yields faster initial learning progress. However, towards the end of the run,
52
P.O. Stalph and M.V. Butz
standard XCSF actually reaches a slightly lower error than the settings with larger offspring sets. This effect is the more pronounced the larger the offspring set. In the radial sine function, this effect is not as strong as in the sine function. Similar observations can also be made in the crossed ridge function, which is shown in Figure 3(a). In the sine-in-sine function f4 (Figure 3(b)), larger offspring set sizes degrade performance most severely. While a selection of four offspring classifiers as well as a selection of a size of 10% of the match set size still shows slight error decreases, larger offspring set sizes completely stall learning— despite large and diverse populations. It appears that the larger offspring set sizes prevent the population from identifying relevant structures and thus prevent the development of accurate function approximations.
3
Theoretical Considerations
What is the effect of increasing the number of offspring generated upon GA invocation? The results indicate that initially, faster learning can be induced. However, later on, learning potentially stalls. Previously, learning in XCS was characterized as an interactive learning process in which several evolutionary pressures [12] foster learning progress: (1) A fitness pressure is induced since usually on average more accurate classifiers are selected for reproduction than for deletion. (2) A set pressure, which causes an intrinsic generalization pressure, is induced since also on average more general classifiers are selected for reproduction than for deletion. (3) Mutation pressure causes diversification of classifier conditions. (4) Subsumption pressure causes convergence to maximally accurate, general classifiers, if found. Since fitness and set pressure work on the same principle, increasing the number of reproductions generally equally increases both pressures. Thus, their balance is maintained. However, the fitness pressure only applies if there is a strong-enough fitness signal, which depends on the number of evaluations a classifier underwent before the reproduction process. The mutation pressure also depends on the number of reproductions; thus, a faster diversification can be expected given larger offspring set sizes. Another analysis estimated the reproductive opportunities a superior classifier might have before being deleted [13]. Moreover, a niche support bound was derived [14], which characterizes the probability that a classifier is sustained in the population, given that it represents an important problem niche for the final solution. Both of these bounds assume that the accuracy of the classifier is accurately specified. However, the larger the offspring set size is, the faster the classifier turnaround, thus the shorter the average iteration time a classifier stays in the population, and thus the fewer the number of iterations available to a classifier until it is deleted. The effect is that the GA in XCS has to work with classifier parameter estimates that are less reliable since they underwent less updates on average. Thus, larger offspring set sizes induce larger noise in the selection process. As long as the fitness pressure leads in the right direction because the parameter estimates have enough signal, learning proceeds faster. This latter reason
Towards Variable Offspring Set Sizes in XCSF
53
stands also in relation to the estimated learning speed of XCS approximated elsewhere [15]. Since reproductions of more accurate classifiers are increased, learning speed increases as long as more accurate classifiers are detected. Due to this reasoning, however, it can also be expected that learning can stall prematurely. This should be the case when the noise, induced by an increased reproduction rate, is too high so that the identification of more accurate classifiers becomes impossible. Better offspring classifiers get deleted before their fitness is sufficiently evaluated. In other words, the fitness signal is too weak for the selection process. This signal-to-noise ratio (fitness signal to selection noise) depends on (1) the problem structure at hand, (2) the solution representation given to XCS (condition and prediction structures), and (3) on the population size. Thus, it is hard to specify the ratio exactly and future research is needed to derive mathematical bounds on this problem. Nonetheless, these considerations explain the general observations in the considered functions: The more complex the function, the more problematic larger offspring sets become— even the traditional two offspring classifiers may be too fast to reach the target error ε0 . To control the signal-to-noise problem, consequently, it is important to balance reproduction rates and offspring set sizes problem-dependently. A similar suggestion was made elsewhere for the control of parameter θGA [7]. In the following, we investigate an approach that decreases the offspring set size over a learning experiment to get the best of both worlds: fast initial learning speeds and maximally accurate final solution representations.
4
Adapting Offspring Set Sizes
As a first approach to determine if it can be useful to use larger initial offspring set sizes and to decrease those sizes during the run, we linearly scale the offspring set size from 10% offspring set size to two over the 100k learning iterations. Figure 4 shows the resulting performance in all four functions comparing the linear scaling with traditional two offspring classifiers and fixed 10% offspring. In graphs 4(a)-(c) we can see that the scaling technique reaches maximum accuracy. Particularly in Graph 4(a) we can see that the performance stalling is overcome and an error level is reached that is similar to the one reached with the traditional XCS setting. However, performance in function f4 shows that the error still stays on a high level initially but it starts decreasing further when compared to a 10% offspring set size later in the run. Thus, the results show that a linear reduction of offspring set sizes can have positive effects on initial learning speed while low reproduction rates at the end of a run allow for a refinement of the final solution structure. However, the results also suggest that the simple linear scheme is not necessarily optimal and its success is highly problem-dependent. Future research needs to investigate flexible adaptation schemes that take the signal-to-noise ratio into account.
P.O. Stalph and M.V. Butz
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel10%to2 - pred. error macro cl.
1
100
0.01
0
20
40 60 80 number of learning steps (1000s)
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel10%to2 - pred. error macro cl.
0.1 prediction error
prediction error
0.1
1000
100
0
20
40 60 80 number of learning steps (1000s)
macro classifiers
1000
6400
1
100
0.01
0
20
40 60 80 number of learning steps (1000s)
(c) crossed ridge function
100
1000
100
0.1 prediction error
prediction error
0.1
100
(b) radial sine function 6400
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel10%to2 - pred. error macro cl.
100
0.01
(a) sine function
1
1000
macro classifiers
1
6400 macro classifiers
6400
0.01
macro classifiers
54
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel10%to2 - pred. error macro cl. 0
20
40 60 80 number of learning steps (1000s)
100
(d) sine-in-sine function
Fig. 4. When decreasing the number of generated offspring over the learning trial, learning speed is kept high while the error convergence reaches the level that is reached by always generating two offspring classifiers (a,b,c). However, in the case of the challenging sine-in-sine function, further learning would be necessary to reach a similarly low error level (d).
5
Conclusions
This paper has shown that a fixed offspring set size does not necessarily yield the best learning speed that XCSF can achieve. Larger offspring set sizes can strongly increase the initial learning speed but do not necessarily reach maximum accuracy. Adaptive offspring set sizes, if scheduled appropriately, can get the best of both worlds in yielding high initial learning speed and low final error. The results however also suggest that a simple adaptation scheme is not generally applicable. Furthermore, the theoretical considerations suggest that a signalto-noise estimate could be used to control the GA offspring schedule and the offspring set sizes. Given a strong fitness signal, a larger set of offspring could be generated. Another consideration that needs to be taken into account in such an offspring generation scheme, however, is the fact that problem domains may be
Towards Variable Offspring Set Sizes in XCSF
55
strongly unbalanced, in which some subspaces may be very easily approximated while others may be very hard. In these cases, it has been shown, though, that the θGA threshold can be increased to ensure a representation of the complete problem [7]. Future research should consider adapting θGA hand-in-hand with the offspring set sizes. In which way this may be accomplished exactly still needs to be determined. Nonetheless, it is hoped that the results and considerations of this work provide clues in the right direction in order to speed-up XCS(F) learning and to make learning even more robust in hard problems.
Acknowledgments The authors acknowledge funding from the Emmy Noether program of the German research foundation (grant BU1335/3-1) and like to thank their colleagues at the department of psychology and the COBOSLAB team.
References 1. Holland, J.H.: Adaptation. In: Progress in Theoretical Biology, vol. 4, pp. 263–293. Academic Press, New York (1976) 2. Holland, J.H.: Properties of the bucket brigade algorithm. In: Proceedings of the 1st International Conference on Genetic Algorithms, Hillsdale, NJ, USA, pp. 1–7. L. Erlbaum Associates Inc., Mahwah (1985) 3. Widrow, B., Hoff, M.E.: Adaptive switching circuits. Western Electronic Show and Convention, Convention Record, Part 4, 96–104 (1960) 4. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175 (1995) 5. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Prediction update algorithms for XCSF: RLS, Kalman filter, and gain adaptation. In: GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1505–1512. ACM, New York (2006) 6. Drugowitsch, J., Barry, A.: A formal framework and extensions for function approximation in learning classifier systems. Machine Learning 70, 45–88 (2008) 7. Orriols-Puig, A., Bernad´ o-Mansilla, E.: Bounding XCS’s parameters for unbalanced datasets. In: GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1561–1568. ACM, New York (2006) 8. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209–219. Springer, Heidelberg (2000) 9. Wilson, S.W.: Classifiers that approximate functions. Natural Computing 1, 211– 234 (2002) 10. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Function approximation with XCS: Hyperellipsoidal conditions, recursive least squares, and compaction. IEEE Transactions on Evolutionary Computation 12, 355–376 (2008) 11. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 267–274. Springer, Heidelberg (2001)
56
P.O. Stalph and M.V. Butz
12. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Toward a theory of generalization and learning in XCS. IEEE Transactions on Evolutionary Computation 8, 28–46 (2004) 13. Butz, M.V., Goldberg, D.E., Tharakunnel, K.: Analysis and improvement of fitness exploitation in XCS: Bounding models, tournament selection, and bilateral accuracy. Evolutionary Computation 11, 239–277 (2003) 14. Butz, M.V., Goldberg, D.E., Lanzi, P.L., Sastry, K.: Problem solution sustenance in XCS: Markov chain analysis of niche support distributions and the impact on computational complexity. Genetic Programming and Evolvable Machines 8, 5–37 (2007) 15. Butz, M.V., Goldberg, D.E., Lanzi, P.L.: Bounding learning time in XCS. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 739–750. Springer, Heidelberg (2004)
Current XCSF Capabilities and Challenges Patrick O. Stalph and Martin V. Butz Department of Cognitive Psychology III, University of W¨ urzburg R¨ ontgenring 11, 97080 W¨ urzburg, Germany {patrick.stalph,butz}@psychologie.uni-wuerzburg.de http://www.coboslab.psychologie.uni-wuerzburg.de
Abstract. Function approximation is an important technique used in many different domains, including numerical mathematics, engineering, and neuroscience. The XCSF classifier system is able to approximate complex multi-dimensional function surfaces using a patchwork of simpler functions. Typically, locally linear functions are used due to the tradeoff between expressiveness and interpretability. This work discusses XCSF’s current capabilities, but also points out current challenges that can hinder learning success. A theoretical discussion on when XCSF works is intended to improve the comprehensibility of the system. Current advances with respect to scalability theory show that the system constitutes a very effective machine learning technique. Furthermore, the paper points-out how to tune relevant XCSF parameters in actual applications and how to choose appropriate condition and prediction structures. Finally, a brief comparison to the Locally Weighted Projection Regression (LWPR) algorithm highlights positive as well as negative aspects of both methods. Keywords: LCS, XCS, XCSF, LWPR.
1
Introduction
The increasing interest in Learning Classifier Systems (LCS) [1] has propelled research and LCS have proven their capabilities in various applications, including multistep problems [2,3], datamining tasks [4,5], as well as robot applications [6,7]. The focus of this work is on the Learning Classifier System XCSF [8], which is a modified version of the original XCS [2]. XCSF is able to approximate multi-dimensional, real-valued function surfaces from samples by locally weighted, usually linear, models. While XCS theory has been investigated thoroughly in the binary domain [5], theory on real-valued input and output spaces remains sparse. There are two important questions: When does the system work at all and how does it scale with increasing complexity? We will address these questions by first carrying over parts of the XCS theory and, secondly, showing the results of a scalability analysis, which suggests that XCSF scales optimally in the required population size. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 57–69, 2010. c Springer-Verlag Berlin Heidelberg 2010
58
P.O. Stalph and M.V. Butz
However, when theory tells that a system is applicable to a specific problem type, the problem is still not solved, yet. The practitioner has to choose appropriate parameters and has to decide on the solution representation, which are condition and prediction structures for XCSF. Therefore, we give a short guide on the system’s relevant parameters and how to set them appropriately. Furthermore, a brief discussion on condition and prediction structures is provided to foster the understanding of how XCSF’s generalization power can be fully exploited. Finally, we briefly compare XCSF with Locally Weighted Projection Regression (LWPR). LWPR is a statistics-based greedy algorithm for function approximation that also uses spatially localized linear models to predict the value of non-linear functions. A discussion of pros and cons points out the capabilities of each algorithm. The remainder of this article is structured as follows. Section 2 is concerned with theoretical aspects of XCSF, that is, (1) when the system works at all and (2) how XCSF scales with increasing problem complexity. In contrast, Section 3 discusses how to set relevant parameters given an actual, unknown problem. In Section 4, we briefly compare XCSF with LWPR and the article ends with a short summary and concluding remarks.
2
Theory
We assume sufficient knowledge about the XCSF Learning Classifier System and directly start with a theoretical analysis. We carry over preconditions for successful learning known from binary XCS and propose a scalability model, which shows how the population size scales with increasing function complexity and dimensionality. 2.1
Preconditions - When It Works
In order to successfully approximate a function, XCSF has to overcome the same challenges that were identified for XCS in binary domains [5]. These challenges were described as (1) covering challenge, (2) schema challenge, (3) reproductive opportunity challenge, (4) learning time challenge, and (5) solution sustenance challenge. The following paragraphs briefly summarize results from a recent study [9] that investigated the mentioned challenges in depth with respect to XCSF. Covering Challenge. The initial population of XCSF should be able to cover the whole input space, because otherwise the deletion mechanism creates holes in the input space and local knowledge about these subspaces is lost (so called coveringdeletion cycle [10]). Consequently, when successively sampled problem instances tend to be located in empty subspaces, the hole is covered with a default classifier and another hole is created due to the deletion mechanism. In analogy to results with binary XCS, there is a linear relation between initial classifier volume and the required population size to master the covering challenge. In particular, the population size has to grow inversely linear with the initial classifier volume.
Current XCSF Capabilities and Challenges
59
Schema and Reproductive Opportunity Challenge. When the covering challenge is met, it is required that the genetic algorithm (a) discovers better substructures and (b) reproduces these substructures. In binary genetic algorithms such substructures are often termed Building Blocks, as proposed in John H. Holland’s Schema Theory [1]. However, the definition of real-valued schemata is non-trivial [11,12,13,14] and it is even more difficult to define building blocks for infinite input and output spaces [15,16]. While the stepwise character in binary functions emphasizes the processing of building blocks via crossover, the smooth character of real-valued functions emphasizes hill-climbing mechanisms. To the best of our knowledge, there is no consensus in the literature on this topic and consequently it remains unclear how a building block can be defined for the real-valued XCSF Learning Classifier System. If XCSF’s fitness landscape is neither flat nor deceptive, there remains one last problem: noise on the fitness signal due to a finite number of samples. Prediction parameter estimates rely on the samples seen so far and so does the prediction error and the fitness. If the classifier turnaround (that is, reproduction and deletion of classifiers) is too high, the selection mechanism cannot identify better substructures and the learning process is stuck [17], which can be alleviated by slowing down the learning, e.g. by increasing θGA [18]. Learning Time Challenge. The learning time mainly depends on the number of mutations from initial classifiers to the target shape of accurate and maximally general classifiers. A too-small population size may delay the learning time, because good classifiers get deleted and knowledge is lost. Furthermore, redundancy in the space of possible mutations (e.g. rotation for dimensions n > 3 is not unique) may increase the learning time. A recent study estimated a linear relation between the number of required mutations and the learning time [9]. Solution Sustenance Challenge. Finally, XCSF has to assure that the evolved accurate solution is sustained. This challenge is mainly concerned with the deletion probability. Given the population size is high enough, the GA has enough “room” to work without destroying accurate classifiers. The resulting bound states that the population size needs to grow inversely linear in the volume of the accurate classifiers to be sustained. 2.2
A Scalability Model
Given that all of the above challenges are overcome and the system is able to learn an accurate approximation of the problem at hand, it is important to know how changes in the function complexity or dimensionality affect XCSF’s learning performance. In particular, we model the relation between – – – –
function complexity (defined via the prediction error), input space dimensionality, XCSF’s population size, and the target error ε0 .
60
P.O. Stalph and M.V. Butz
In order to simplify the model, we assume a uniform function structure and uniform sampling1 . This also implies a uniform classifier structure, that is, uniform shape and size. Without loss of generality, let the n-dimensional input space be confined to [0, 1]n . Furthermore, we assume that XCSF evolves an optimal solution [19]. This includes four properties, namely 1. completeness, that is, each possible input is covered in that at least one classifier matches. 2. correctness, that is, the population predicts the function surface accurately in that the prediction error is below the target error ε0 . 3. minimality, that is, the population contains the minimum number of classifiers needed to represent the function completely and correctly. 4. non-overlappingness, that is, no input is matched by more than one classifier. In sum, we assume a uniform patchwork of equally sized, non-overlapping, accurate, and maximally general classifiers. These assumptions reflect reality on uniform functions except for non-overlappingness, which is almost impossible for real-valued input spaces. We consider a uniformly sampled function of uniform structure fΓ : [0, 1]n → R,
(1)
where n is the dimensionality of the input space and Γ reflects the function complexity. Since we do neither fix the condition type, not the predictor used in XCSF, we have to define the complexity via the prediction error. We define Γ such that a linear increase in this value results in the same increase in the prediction error. Thus, saying that the function is twice as complex induces that the prediction error is twice as high for the same classifiers. Since the classifier volume V influences the prediction error ε in a polynomial fashion on uniform functions, we can summarize the assumptions in the following equation. √ n ε=Γ V (2) We can now derive the optimal classifier volume and the optimal population size. Using the target error ε0 , we get an optimal volume of ε n 0 Vopt = . (3) Γ The volume of the input space to be covered is one and it follows that the optimal population size is n Γ Nopt = . (4) ε0 To sum up, the dimensionality n has an exponential influence on the population size, while the function complexity Γ and the target error ε0 have a polynomial influence. Increasing the function complexity will require a polynomial increase of the population size in the order n. 1
Non-uniform sampling is discussed elsewhere [18].
Current XCSF Capabilities and Challenges
macro classifiers (log-scale)
5000
61
macro classifiers 1D 2D 3D 4D 5D 6D
1000
500
100 0.1
1
10
gradient (log-scale)
Fig. 1. Comparative plots of the final population size after condensation (data points) and the developed scalability theory (solid lines) for dimensions n = 1 to n = 6. The number of macro classifiers is plotted against the function complexity, which is modeled via the increasing gradient. The order of the polynomials are equal to the dimension n, which requires an exponential increase in population size. An increasing function complexity results in a polynomial increase. Apart from an approximately constant overhead due to overlapping classifiers, the scalability model fits reality.
Note that no assumptions are made about the condition type or the predictor used. The intentionally simple equations 3 and 4 hide a complex geometric problem in the variable Γ . For example, assume a three-dimensional non-linear function that is approximated using linear predictions and rotating ellipsoidal conditions. Calculating the prediction error is non-trivial for such a setup. When the above bounds are required exactly, this geometric problem has to be solved for any condition-prediction-function combination anew. In order to validate the scalability model, we conducted experiments with interval conditions and constant predictions on a linear function2 . XCSF with constant predictions equals XCSR [20], however, only one dummy action is available. As done before in [19] with respect to XCS, we analyze a restricted class of problems for XCSF. On the one hand, the constant prediction makes this setup a worst case scenario in terms of required population size. On the other hand, the simple setup allows for solving the geometric problem analytically—thus, we can compare the theoretical population size bound from Equation 4 with the actual population size that is required to approximate the respective function. A so called bisection algorithm runs XCSF with different population size settings in a binary search fashion. On termination, the bisection procedure returns the approximately minimal population size N that is required for successful learning. 2
Other settings: 500000 iterations, ε0 = 0.01, β = 0.1, α = 1, δ = 0.1, ν = 5, χ = 1, μ = 0.05, r0 = 1, θGA = 50, θdel = 20, θsub = 20. GA subsumption was applied. Uniform crossover was applied.
62
P.O. Stalph and M.V. Butz
For details of the bisection algorithm and how the geometric problem is solved, please refer to [9]. Figure 1 shows the results of the bisection experiments on the one- to sixdimensional linear function fΓ (x1 , . . . , xn ) = Γ ni=1 xi , where solid lines represent the developed theory (Equation 4) and the data shown represents the final population size after condensation [21]. For each dimension n, the function difficulty Γ was linearly increased by increasing the gradient of the linear function. The polynomials are shown as straight lines on a log-log-scale plot, where the gradient of a line equals the order of the corresponding polynomial. We observe an approximately constant overhead from scalability theory to actual population size. This overhead is expected, since the scalability model assumes non-overlappingness. Most importantly, the prediction of the model lies parallel to the actual data, which indicates that the dimension n fits the exponent of the theoretical model. Thus, the experiment confirms the scalability model: Problem dimensionality has an exponential influence on the required population size (given full problem space sampling). Furthermore, a linear increase in the problem difficulty (or a linear decrease of the target error ε0 ) induces a polynomial increase in the population size.
3
How to Set XCSF’s Parameters
Although theoretical knowledge shows that XCSF works theoretically optimally, it is also important to understand the influence of XCSF’s parameter settings such as population size, condition structures, and prediction types. Besides the importance and the direct influence of a parameter, the interdependencies between parameters are also relevant for the practitioner. In the following, we give a brief overview of important parameters, their dependencies, and how to tune them in actual applications. 3.1
Important Parameters and Interdependencies
A long list of available parameters exists for both XCS and XCSF. Among obviously important parameters, such as the population size N , there are less frequently tuned parameters (e.g. θGA ) and parameters that are rarely changed at all, such as the crossover rate χ or the accuracy scale ν. The most important parameters are summarized here. Population Size N – This parameter specifies the available workspace for the evolutionary search. Therefore it is crucial to set this value high enough to prevent deletion of good classifiers (see Section 2.1). Target Error ε0 – The error threshold defines the desired accuracy. Evolutionary pressures drive classifiers towards this threshold of accurate and maximally general classifiers. Condition Type – The structuring capability of XCSF is defined by this settings. Various condition structures are available, including simple axis-parallel intervals [22], rotating ellipsoids [23], and arbitrary shapes using gene expression programming [24].
Current XCSF Capabilities and Challenges
63
Prediction Type – Typically linear predictors are used for a good balance of expressiveness and interpretability. However, others are possible, such as constant predictors [8] or polynomial ones [25]. Learning Time – The number of iterations should be set high enough to assure that the prediction error converges to a value below the desired ε0 . GA Frequency Threshold θGA – This threshold specifies that GA reproduction is activated only if the average time of the last GA activation in the set lies longer in the past than θGA . Increasing this value delays learning, but may also prevent forgetting and overgeneralization in unbalanced data sets [18]. Mutation Rate μ – The probability of mutation is closely related to the available mutation options of the condition type and thus it is also connected to the dimensionality of the problem. It should be set according to the problem at hand, e.g. μ = 1/m, where m is the number of available mutation options. Initial classifier size r0 – One the one hand, this value should be set high enough to meet the covering challenge, that is, it should be set such that simple covering with less than N classifiers is sufficient to cover the whole input space. On the other hand, the initial size should be small enough to yield a fitness signal upon crossover or mutation in order to prevent oversized classifiers from taking over the population. The other parameters can be set to their default values, thus ensuring a good balance of the evolutionary pressures. The strongest interdependencies can be found between population size N , target error ε0 , condition structure, and prediction type as indicated by the scalability model of Section 2.2. Changing either of these will affect XCSF’s learning performance significantly. For example, with a higher population size a lower target error can be reached. An appropriate condition structure may turn a polynomial problem into a linear one, thus requiring less classifiers. Advanced predictors are able to approximate more complex functions and thus enable coarse structuring of the input space, again reducing the required population size. When tuning either of these settings, the related parameters should be kept in mind. 3.2
XCSF’s Solution Representation
Before running XCSF with some arbitrary settings on a particular problem, a few things have to be considered. This concerns mainly the condition and prediction structures, that is, XCSF’s solution representation. The next two paragraphs highlight some issues about different representations. Selecting an Appropriate Predictor. The first step is to select the type of prediction to be used for the function approximation. Linear predictions have a reasonable computational complexity and good expressiveness, while the final solution is well interpretable. In some cases, it might be required to invert the approximated function after learning, which is easily possible with a linear predictor. However, if prior knowledge suggests a special type of function (e.g. polynomials
64
P.O. Stalph and M.V. Butz
or sinusoidal functions) this knowledge can be exploited by using corresponding predictors. The complexity of the prediction mainly influences the classifier updates, which is usually – depending on the dimensionality – a minor factor. Structuring Capabilities. Closely related to the predictor is the condition structure. The simplest formulation are intervals, that is, rectangles. Alternatively, spheres or ellipsoids (also known as radial basis functions or receptive fields) can be used. More advanced structures include rotation, which allows for exploiting interdimensional dependencies, but also increases the complexity of (1) the evolutionary search space and (2) the computational time for matching, which are major influences on the learning time. On the other hand, if interdependencies can be exploited, the required population size may shrink dramatically—effectively speeding up the whole learning process by orders of magnitude. Finally, it is also possible to use arbitrary structures such as gene expression programming or neural networks. However, the improved generalization capabilities can reduce the interpretability of the developed solutions and learning success can usually not be guaranteed because the used genetic operators may not necessarily yield a mainly local phenotypic search through the expressible condition structures. 3.3
When XCSF Fails
Even the best condition and prediction structures do not necessarily guarantee successful learning. This section discusses some issues, where fine-tuning of some parameters may help to reach the desired accuracy. Furthermore, we point out when XCSF reaches its limits, so that simple parameter tuning cannot overcome learning failures. Ideally, given an unknown function, XCSF’s prediction error quickly drops below ε0 (see Figure 2(a) for a typical performance graph). When XCSF is not able to accurately learn the function, there are four possible main reasons: 1. The prediction error has not yet converged. 2. The prediction error converged to an average error above the target error. 3. The prediction error stays on an initially very low level, but the function surface is not fully approximated. 4. The prediction error stays on an initially high level. Given case 1, the learning time is too short to allow for an appropriate structuring of the input space. Increasing the number of iterations will solve this issue. In contrast, case 2 indicates that the function is too difficult to approximate with the given population size, target error, predictor, and condition structure. Figure 2(b) illustrates a problem in which the system does not reach the target error. Increasing the learning time allows for a settling of the prediction error, but the target error is only reached when the maximum population size is increased. While in the previous examples XCSF just does not reach the target error, in other scenarios the system completely fails to learn anything due to bad parameter choices. There are two major factors that may prevent learning completely: covering-deletion cycles and flat fitness landscapes. Although case 3
Current XCSF Capabilities and Challenges
6400
10
0.01
1
1000
1 100 prediction error
100 0.1
macro classifiers
1000
1
prediction error
6400 pred. error macro cl. matchset macro cl.
10
0.1
macro classifiers
pred. error macro cl. matchset macro cl.
65
1 0.01
0.001 0
20 40 60 80 number of learning steps (1000s)
100
0
(a) crossed ridge 2D
20 40 60 80 number of learning steps (1000s)
100
(b) sine-in-sine 2D
Fig. 2. Typical performance measurements on two benchmark functions. The target error ε0 = 0.01 is represented by a dashed line. (a) The chosen settings are well suited for the crossed-ridge function and the prediction error converges to a value below the target error. (b) In contrast, the sine-in-sine function is too difficult for the same settings and the system does neither reach the target error nor does the prediction error converge within the given learning time.
6400
10 1e-16 1
1e-17
pred. error macro cl. matchset macro cl. prediction error
100
1000 macro classifiers
1000
1e-15 prediction error
6400
100
10 10
macro classifiers
pred. error macro cl. matchset macro cl.
1 1
0.1 0
20 40 60 80 number of learning steps (1000s)
100
(a) sine 20D, too small r0
0
20 40 60 80 number of learning steps (1000s)
100
(b) sine 20D, too large r0
Fig. 3. Especially on high-dimensional functions, it is crucial to set the initial classifier size r0 to a reasonable value. (a) A small initial size leads to a covering-deletion cycle. (b) When the fitness landscape is too flat, the evolutionary search is unable to identify better substructures and oversized classifiers prevent learning.
seems strange, there is a simple explanation. If the population size and initial classifier size are set such that the input space cannot be covered by the covering mechanism, the system continuously covers and deletes classifiers without any knowledge gain (so called covering-deletion cycle [10]). Typically, the average match set size is one, the population size quickly reaches the maximum, and the average prediction error is almost zero because the error during covering is zero. Exemplary, we equip XCSF with a small initial classifier size r0 and run the system on a 20-dimensional sine function as shown in Figure 3(a). Especially high-dimensional input spaces are prone to this problematic cycle, because (1)
66
P.O. Stalph and M.V. Butz
the initial classifier volume has to be high enough to allow for a complete coverage, but (2) the initial volume may not exceed the size where the GA does not receive a sufficient fitness signal. The latter may be the case when a single mutation of the initial covering shape cannot produce a sufficiently small classifier that captures the (eventually fine-grained) structure of the underlying function. Thus, the GA is missing a fitness gradient and, due to higher reproductive opportunities, over-general classifiers take over the population as shown in Figure 3(b). Typically, the prediction error does not drop at all. Here XCSF reaches its limits and “simple” parameter tuning may not help to overcome the problem with a reasonable population size. Eventually, a refined initial classifier size hits a reasonable fitness and prevents over-general classifiers. Otherwise, it might be necessary to reconsider the condition structure or corresponding evolutionary operators.
4
A Brief Comparison with Locally Weighted Projection Regression
Apart from traditional function fitting, where the general type of the underlying function has to be known before fitting the data, the so called Locally Weighted Projection Regression (LWPR) algorithm [26,27] also approximates functions iteratively by means of local linear models, as does XCSF. The following paragraphs highlight the main differences of LWPR to XCSF and sketch some theoretical thoughts on performance as well as on the applicability of both systems. The locality of each model is defined by so called receptive fields, which correspond to XCSF’s rotating hyperellipsoidal condition structures [23]. However, in contrast to the steady state GA in XCSF, the receptive fields in LWPR are structured by means of a statistical gradient descent. The center, that is, the position of a receptive field, is never changed once it is created. Based on the prediction errors, the receptive fields can shrink in specific directions, which – theoretically – minimize the error. Indefinite shrinking is prevented by introducing a penalty term, which penalizes small receptive fields. Thus, receptive fields shrink due to prediction errors and enlarge if the influence of prediction errors is less than the influence of the penalty term. However, the ideal statistics from batch-learning can only be estimated in an iterative algorithm and experimental validation is required to shed light on the actual performance of both systems, when compared on benchmark functions. One disadvantage of LWPR is that all its statistics are based on linear predictions and the ellipsoidal shape of receptive fields. Thus, alternative predictions or conditions cannot be applied directly. In contrast, a wide variety of prediction types and condition structures are available for XCSF, allowing for a higher representational flexibility. Furthermore, it is easily possible to decouple conditions and predictions in XCSF [6], in which case conditions cluster a contextual space for the predictions in another space. Since the fitness signal for the GA is only based on prediction errors, no coupling is necessary. It remains an open research challenge to realize similar mechanisms and modifications with LWPR.
Current XCSF Capabilities and Challenges
67
On the other hand, the disadvantage of XCSF is a higher population size during learning, which is necessary for the niched evolutionary algorithm to work successfully. Different condition shapes have to be evaluated with several samples before a stable fitness value can be used in the evolutionary selection process. Nevertheless, it has been shown that both systems achieve comparable prediction errors in particular scenarios [23]. Future research will compare XCSF and LWPR in detail, including theoretical considerations as well as empirical evaluations on various benchmark functions.
5
Summary and Conclusions
This article discussed XCSF’s current capabilities as well as scenarios that pose a challenge for the system. From a theoretical point of view, we analyzed the preconditions for successful learning and, if these conditions are met, how the system scales to higher problem complexities, including function structure and dimensionality. In order to successfully learn the surface of a given function, XCSF has to overcome the same challenges that were identified for XCS: covering challenge, schema challenge, reproductive opportunity challenge, learning time challenge, and solution sustenance challenge. Given a uniform function structure and uniform sampling, the scalability model predicts an exponential influence of the input space dimensionality on the population size. Moreover, a polynomial increase in the required population size is expected when the function complexity is linearly increased or when the target error is linearly decreased. From a practitioner’s viewpoint, we highlighted XCSF’s important parameters and gave a brief guide how to set these parameters appropriately. Additional parameter tuning suggestions may help if initial settings fail to reach the desired target error in certain cases. Examples illustrate when XCSF completely fails due to a covering-deletion cycle or due to flat fitness landscapes. Thus, failures in actual applications can be understood and refined parameter choices can eventually resolve the problem. Finally, a brief comparison with a statistics-based machine learning technique, namely Locally Weighted Projection Regression (LWPR), discussed advantages and disadvantages of the evolutionary approach employed in XCSF. A current study, which includes also empirical experiments, supports the presented comparison with respect to several relevant performance measures [28].
Acknowledgments The authors acknowledge funding from the Emmy Noether program of the German research foundation (grant BU1335/3-1) and like to thank their colleagues at the department of psychology and the COBOSLAB team.
68
P.O. Stalph and M.V. Butz
References 1. Holland, J.H.: Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. The MIT Press, Cambridge (1992) 2. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175 (1995) 3. Butz, M.V., Goldberg, D.E., Lanzi, P.L.: Gradient descent methods in learning classifier systems: Improving XCS performance in multistep problems. Technical report, Illinois Genetic Algorithms Laboratory (2003) 4. Bernad´ o-Mansilla, E., Garrell-Guiu, J.M.: Accuracy-based learning classifier systems: Models, analysis, and applications to classification tasks. Evolutionary Computation 11, 209–238 (2003) 5. Butz, M.V.: Rule-Based Evolutionary Online Learning Systems: A Principal Approach to LCS Analysis and Design. Springer, Heidelberg (2006) 6. Butz, M.V., Herbort, O.: Context-dependent predictions and cognitive arm control with XCSF. In: GECCO 2008: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, pp. 1357–1364. ACM, New York (2008) 7. Stalph, P.O., Butz, M.V., Pedersen, G.K.M.: Controlling a four degree of freedom arm in 3D using the XCSF learning classifier system. In: Mertsching, B., Hund, M., Aziz, Z. (eds.) KI 2009. LNCS, vol. 5803, pp. 193–200. Springer, Heidelberg (2009) 8. Wilson, S.W.: Classifiers that approximate functions. Natural Computing 1, 211– 234 (2002) 9. Stalph, P.O., Llor` a, X., Goldberg, D.E., Butz, M.V.: Resource Management and Scalability of the XCSF Learning Classifier System. Theoretical Computer Science (in press), http://dx.doi.org/10.1016/j.tcs.2010.07.007 10. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: How XCS evolves accurate classifiers. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001), pp. 927–934 (2001) 11. Wright, A.H.: Genetic algorithms for real parameter optimization. In: Foundations of Genetic Algorithms, pp. 205–218. Morgan Kaufmann, San Francisco (1991) 12. Goldberg, D.E.: Real-coded genetic algorithms, virtual alphabets, and blocking. Complex Systems 5, 139–167 (1991) 13. Radcliffe, N.J.: Equivalence class analysis of genetic algorithms. Complex Systems 5, 183–205 (1991) 14. M¨ uhlenbein, H., Schlierkamp-Voosen, D.: Predictive models for the breeder genetic algorithm – I. continuous parameter optimization. Evolutionary Computation 1, 25–49 (1993) 15. Beyer, H.G., Schwefel, H.P.: Evolution strategies - a comprehensive introduction. Natural Computing 1(1), 3–52 (2002) 16. Bosman, P.A.N., Thierens, D.: Numerical optimization with real-valued estimationof-distribution algorithms. In: Scalable Optimization via Probabilistic Modeling. SCI, vol. 33, pp. 91–120. Springer, Heidelberg (2006) 17. Stalph, P.O., Butz, M.V.: How Fitness Estimates Interact with Reproduction Rates: Towards Variable Offspring Set Sizes in XCSF. In: Bacardit, J. (ed.) IWLCS 2008/2009. LNCS (LNAI), vol. 6471, pp. 47–56. Springer, Heidelberg (2010) 18. Orriols-Puig, A., Bernad´ o-Mansilla, E.: Bounding XCS’s parameters for unbalanced datasets. In: GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1561–1568. ACM, New York (2006)
Current XCSF Capabilities and Challenges
69
19. Kovacs, T., Kerber, M.: What makes a problem hard for XCS? In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 251–258. Springer, Heidelberg (2001) 20. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209–219. Springer, Heidelberg (2000) 21. Wilson, S.W.: Generalization in the XCS classifier system. In: Genetic Programming 1998: Proceedings of the Third Annual Conference, pp. 665–674 (1998) 22. Stone, C., Bull, L.: For real! XCS with continuous-valued inputs. Evolutionary Computation 11(3), 299–336 (2003) 23. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Function approximation with XCS: Hyperellipsoidal conditions, recursive least squares, and compaction. IEEE Transactions on Evolutionary Computation 12, 355–376 (2008) 24. Wilson, S.W.: Classifier conditions using gene expression programming. In: Bacardit, J., Bernad´ o-Mansilla, E., Butz, M.V., Kovacs, T., Llor` a, X., Takadama, K. (eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 206–217. Springer, Heidelberg (2008) 25. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Extending XCSF beyond linear approximation. In: GECCO 2005: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, pp. 1827–1834 (2005) 26. Vijayakumar, S., Schaal, S.: Locally weighted projection regression: An O(n) algorithm for incremental real time learning in high dimensional space. In: ICML 2000: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1079–1086 (2000) 27. Vijayakumar, S., D’Souza, A., Schaal, S.: Incremental online learning in high dimensions. Neural Computation 17(12), 2602–2634 (2005) 28. Stalph, P.O., Rubinsztajn, J., Sigaud, O., Butz, M.V.: A comparative study: Function approximation with LWPR and XCSF. In: GECCO 2010: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (in press, 2010)
Recursive Least Squares and Quadratic Prediction in Continuous Multistep Problems Daniele Loiacono and Pier Luca Lanzi Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy {loiacono,lanzi}@elet.polimi.it
Abstract. XCS with computed prediction, namely XCSF, has been recently extended in several ways. In particular, a novel prediction update algorithm based on recursive least squares and the extension to polynomial prediction led to significant improvements of XCSF. However, these extensions have been studied so far only on single step problems and it is currently not clear if these findings might be extended also to multistep problems. In this paper we investigate this issue by analyzing the performance of XCSF with recursive least squares and with quadratic prediction on continuous multistep problems. Our results show that both these extensions improve the convergence speed of XCSF toward an optimal performance. As showed by the analysis reported in this paper, these improvements are due to the capabilities of recursive least squares and of polynomial prediction to provide a more accurate approximation of the problem value function after the first few learning problems.
1
Introduction
Learning Classifier Systems are a genetic based machine learning technique for solving problems through the interaction with an unknown environment. The XCS classifier system [16] is probably the most successful learning classifier system to date. It couples effective temporal difference learning, implemented as a modification of the well-known Q-learning [14], to a niched genetic algorithm guided by an accuracy based fitness to evolve accurate maximally general solutions. In [18] Wilson extended XCS with the idea of computed prediction to improve the estimation of the classifiers prediction. In XCS with computed prediction, XCSF in brief, the classifier prediction is not memorized into a parameter but computed as a linear combination of the current input and a weight vector associated to each classifier. Recently, in [11] the classifier weights update has been improved with a recursive least squares approach and the idea of computed prediction has been further extended to polynomial prediction. Both the recursive least squares update and the polynomial prediction have been effectively applied to solve function approximation problems as well as to learn Boolean functions. However, so far it is not currently clear whether these findings might be extended also to continuous multistep problems, where Wilson’s XCSF has J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 70–86, 2010. c Springer-Verlag Berlin Heidelberg 2010
Recursive Least Squares and Quadratic Prediction
71
been already successfully applied [9]. In this paper we investigate this important issue. First, we extend the recursive least squares update algorithm to multistep problems with the covariance resetting, a well known approach to deal with a non stationary target. Then, to test our approach, we compare the usual Widrow-Hoff update rule to the recursive least squares one (extended with covariance resetting) on a class of continuous multistep problems, the 2D Gridworld problems [1]. Our results show that XCSF with recursive least squares outperforms XCSF with Widrow-Hoff rule in terms of convergence speed, although both reach finally an optimal performance. Thus, the results confirm the findings of previous works on XCSF with recursive least squares applied to single step problems. In addition, we performed a similar experimental analysis to investigate the effect of polynomial prediction on the same set of problems. Also in this case, the results suggest that quadratic prediction results in a faster convergence of XCSF toward the optimal performance. Finally, to explain why recursive least squares and polynomial prediction increase the convergence speed of XCSF we showed that they improve the accuracy of the payoff landscape learned in the first few learning problems.
2
XCS with Computed Prediction
XCSF differs from XCS in three respects: (i) classifier conditions are extended for numerical inputs, as done in XCSI [17]; (ii) classifiers are extended with a vector of weights w, that are used to compute prediction; finally, (iii) the original update of classifier prediction must be modified so that the weights are updated instead of the classifier prediction. These three modifications result in a version of XCS, XCSF [18,19], that maps numerical inputs into actions with an associated calculated prediction. In the original paper [18] classifiers have no action and it is assumed that XCSF outputs the estimated prediction, instead of the action itself. In this paper, we consider the version of XCSF with actions and linear prediction (named XCS-LP [19]) in which more than one action is available. As said before, throughout the paper we do not keep the (rather historical) distinction between XCSF and XCS-LP since the two systems are basically identical except for the use of actions in the latter case. Classifiers. In XCSF, classifiers consist of a condition, an action, and four main parameters. The condition specifies which input states the classifier matches; as in XCSI [17], it is represented by a concatenation of interval predicates, int i = (li , ui ), where li (“lower”) and ui (“upper”) are integers, though they might be also real. The action specifies the action for which the payoff is predicted. The four parameters are: the weight vector w, used to compute the classifier prediction as a function of the current input; the prediction error ε, that estimates the error affecting classifier prediction; the fitness F that estimates the accuracy of the classifier prediction; the numerosity num, a counter used to represent different copies of the same classifier. Note that the size of the weight vector w depends on the type of approximation. In the case of piecewise-linear approximation, considered in this paper, the weight vector w has one weight wi
72
D. Loiacono and P.L. Lanzi
for each possible input, and an additional weight w0 corresponding to a constant input x0 , that is set as a parameter of XCSF. Performance Component. XCSF works as XCS. At each time step t, XCSF builds a match set [M] containing the classifiers in the population [P] whose condition matches the current sensory input st ; if [M] contains less than θmna actions, covering takes place and creates a new classifier that matches the current inputs and has a random action. Each interval predicate int i = (li , ui ) in the condition of a covering classifier is generated as li = st (i) − rand(r0 ), and ui = st (i) + rand(r0 ), where st (i) is the input value of state st matched by the interval predicate int i , and the function rand(r0 ) generates a random integer in the interval [0, r0 ] with r0 fixed integer. The weight vector w of covering classifiers is randomly initialized with values from [-1,1]; all the other parameters are initialized as in XCS (see [3]). For each action ai in [M], XCSF computes the system prediction which estimates the payoff that XCSF expects when action ai is performed. As in XCS, in XCSF the system prediction of action a is computed by the fitness-weighted average of all matching classifiers that specify action a. However, in contrast with XCS, in XCSF classifier prediction is computed as a function of the current state st and the classifier vector weight w. Accordingly, in XCSF system prediction is a function of both the current state s and the action a. Following a notation similar to [2], the system prediction for action a in state st , P (st , a), is defined as: cl ∈[M ]|a cl.p(st ) × cl.F P (st , a) = (1) cl∈[M ]|a cl.F where cl is a classifier, [M]|a represents the subset of classifiers in [M] with action a, cl.F is the fitness of cl; cl.p(st ) is the prediction of cl computed in the state st . In particular, when piecewise-linear approximation is considered, cl.p(st ) is computed as: cl.p(st ) = cl .w0 × x0 + cl.wi × st (i) (2) i>0
where cl.w i is the weight wi of cl and x0 is a constant input. The values of P (st , a) form the prediction array. Next, XCSF selects an action to perform. The classifiers in [M] that advocate the selected action are put in the current action set [A]; the selected action is sent to the environment and a reward P is returned to the system. Reinforcement Component. XCSF uses the incoming reward P to update the parameters of classifiers in action set [A]. The weight vector w of the classifiers in [A] is updated using a modified delta rule [15]. For each classifier cl ∈ [A], each weight cl.w i is adjusted by a quantity Δwi computed as: η Δwi = (P − cl.p(st ))st (i) (3) |st |2 where η is the correction rate and |st |2 is the norm of the input vector st , (see [18] for details). Equation 3 is usually referred to as the “normalized ” Widrow-Hoff
Recursive Least Squares and Quadratic Prediction
73
update or “modified delta rule”, because of the presence of the term |st (i)|2 [5]. The values Δwi are used to update the weights of classifier cl as: cl.w i ← cl.w i + Δwi
(4)
Then the prediction error ε is updated as: cl.ε ← cl.ε + β(|P − cl.p(st )| − cl.ε)
(5)
Finally, classifier fitness is updated as in XCS. Discovery Component. The genetic algorithm and subsumption deletion in XCSF work as in XCSI [17]. On a regular basis depending on the parameter θga , the genetic algorithm is applied to classifiers in [A]. It selects two classifiers with probability proportional to their fitness, copies them, and with probability χ performs crossover on the copies; then, with probability μ it mutates each allele. Crossover and mutation work as in XCSI [17,18]. The resulting offspring are inserted into the population and two classifiers are deleted to keep the population size constant.
3
Improving and Extending Computed Prediction
The idea of computed prediction, introduced by Wilson in [18], has been recently improved and extended in several ways [11,12,6,10]. In particular, Lanzi et al. extended the computed prediction to polynomial functions [7] and they introduced in [11] a novel prediction update algorithm, based on recursive least squares. Although these extensions proved to be very effective in single step problems, both in function approximation problems [11,7] and in boolean problems [8], they have never been applied to multistep problems so far. In the following, we briefly describe the classifier update algorithm based on recursive least squares and how it can be applied to multistep problems. Finally, we show how computed prediction can be extended to polynomial prediction. 3.1
XCSF with Recursive Least Squares
In XCSF with recursive least squares,the Widrow-Hoff rule used to update the classifier weights is replaced with a more effective update algorithm based on recursive least squares (RLS). At time step t, given the current state st and the target payoff P , recursive least squares update the weight vector w as wt = wt−1 + kt [P − xt wt−1 ], where xt = [x0
st ]T and kt , called gain vector, is computed as kt =
Vt−1 xt , 1 + xTt Vt−1 xt
while matrix Vt is computed recursively by, Vt = I − kt xTt Vt−1 .
(6)
(7)
74
D. Loiacono and P.L. Lanzi
The matrix V(t) is usually initialized as V(0) = δrls I, where δrls is a positive constant and I is the n × n identity matrix. A higher δrls , denotes that initial parametrization is uncertain, accordingly, initially the algorithm will use a higher, thus faster, update rate (kt ). A lower δrls , denotes that initial parametrization is rather certain, accordingly the algorithm will use a slower update. It is worthwhile to say that the recursive least squares approach presented above involves two basic underlying assumptions [5,4]: (i) the noise on the target payoff P used for updating the classifier weights can be modeled as a unitary variance white noise and (ii) the optimal classifier weights vector does not change during the learning process, i.e., the problem is stationary. While the first assumption is often reasonable and has usually a small impact on the final outcome, the second assumption is not justified in many problems and may have a big impact on the performance. In the literature [5,4] many approaches have been introduced for relaxing this assumption. In particular, a straightforward approach is the resetting of the matrix V: every τrls updates, the matrix V is reset to its initial value δrls I. Intuitively, this prevent RLS to converge toward a fixed parameter estimate by continually restarting the learning process. We refer the interested reader to [5,4] for a more detailed analysis of recursive least squares and other related approaches, like the well known Kalman filter. The extension of XCSF with recursive least squares is straightforward: we added to each classifier the matrix V as an additional parameter and we replaced the usual update of classifier weights with the recursive least squares update described above and reported as Algorithm 1. Algorithm 1. Update classifier cl with RLS algorithm 1: procedure update prediction(cl , s, P ) 2: error ← P − cl.p(s); 3: x(0) ← x0 ; 4: for i ∈ {1, . . . , |s|} do 5: x(i) ← s(i); 6: end for 7: if # of updates from last reset > τrls then 8: cl .V ← δrls I 9: end if 10: ηrls ← (1 + xT · cl.V · x)−1 ; 11: cl .V ← cl .V − ηrls cl.V · xxT · cl .V ; 12: kT ← cl .V · xT ; 13: for i ∈ {0, . . . , |s|} do 14: cl.w i ← cl.w i + k(i)· error; 15: end for 16: end procedure
Compute the current error Build x by adding x0 to s
Reset cl .V Update cl .V Update classifier’s weights
Computational Complexity. It is worth comparing the complexity of the Widrow-Hoff rule and recursive least squares both in terms of memory required for each classifier and time required by each classifier update. For each classifier, recursive least squares stores the matrix cl.Vwhich is n × n, thus its additional space complexity is O(n2 ), where n = |x| is the size of the input vector. With
Recursive Least Squares and Quadratic Prediction
75
respect to the time required for each update, the Widrow-Hoff update rule involves only n scalar multiplications and, thus, is O(n); instead, recursive least squares requires a matrix multiplication, which is O(n2 ). Therefore, recursive least squares is more complex than Widrow-Hoff rule both in terms of memory and time requirements. 3.2
Beyond Linear Prediction
Usually in XCSF the classifier prediction is computed as a linear function, so that piecewise linear approximations of the action-value function are evolved. However, XCSF can be easily extended to evolve also polynomial approximations. Let us consider a simple problem with a single variable state space. At time step t, the classifier prediction is computed as, cl.p(st ) = w0 x0 + w1 st , where x0 is a constant input and st is the current state. Thus, we can introduce a quadratic term in the approximation evolved by XCSF: cl.p(st ) = w0 x0 + w1 st + w2 s2t .
(8)
To learn the new set of weights we use the usual XCSF update algorithm (e.g., either RLS or Widrow-Hoff) applied to the input vector xt , defined as xt = x0 , st , s2t . When more variables are involved, so that st = st (1), . . . , st (n), we define xt = x0 , st (1), s2t (1), . . . , st (n), s2t (n), and apply XCSF to the newly defined input space. The same approach can be generalized to allow the approximation of any polynomials of order k by extending the input vector xt with high order terms. However in this paper, for the sake of simplicity, we will limit our analysis to the quadratic prediction.
4
Experimental Design
To study how recursive least squares and the quadratic prediction affect the performance of XCSF on continuous multistep problems we considered a well known class of problems: the 2D gridworld problems, introduced in [1]. They are two dimensional environments in which the current state is defined by a pair of real valued coordinates x, y in [0, 1]2 , the only goal is in position 1, 1, and there are four possible actions (left, right, up, and down) coded with two bits; each action corresponds in a step of size s in the corresponding direction; actions that would take the system outside the domain [0, 1]2 take the system to the nearest position of the grid border. The system can start anywhere but in the goal position and it reaches the goal position when both coordinates are equal or greater than one. When the system reaches the goal it receives 0, in all the other cases it receives -0.5. We called the problem described above empty gridworld,
76
D. Loiacono and P.L. Lanzi
0 −2 V(x,y)
−4 −6 −8 −10 1 1 0.5
0.5 0
y
0
x
(a) 1
1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 y
0
x
0
1
(b)
0
V(x,y)
−5 −10 −15 −20 1
1 0.5
0.5 y
0 0
x
(c) Fig. 1. The 2D Continuous Gridworld problems: (a) the optimal value function of Grid(0.05) when γ=0.95; (b) the Puddles(0.05) environment; (c) the optimal value function of Puddles(0.05) when γ=0.95
dubbed Grid(s), where s is the agent step size. Figure 1a shows the optimal value function associated to the empty gridworld problem, when s = 0.05 and γ = 0.95. A slightly more challenging problem can be obtained by adding some obstacles to the empty gridworld environment, as proposed in [1]: each obstacle represents an area in which there is an additional cost for moving. These areas are called “puddles” [1], since they actually create a sort of puddle in the optimal value function. Figure 1b depicts the Puddles(s) environment that is derived from Grid(s) by adding two puddles (the gray areas). When the system is in a puddle, it receives an additional negative reward of -2, i.e., the action has an additional
Recursive Least Squares and Quadratic Prediction
77
cost of -2; in the area where the two puddles overlap, the darker gray region, the two negative rewards add up, i.e., the action has a total additional cost of -4. We called this second problem puddle world, dubbed Puddles(s), where s is the agent step size. Figure 1c shows the optimal value function of the puddle world, when s = 0.05 and γ = 0.95. The performance is computed as the average number of steps to reach the goal during the last 100 test problems. To speed up the experiments, problems can last at most 500 steps; when this limit is reached the problem stops even if the system did not reach the goal. All the statistics reported in this paper are averaged over 20 experiments.
5
Experimental Results
Our aim is to study how the RLS update and the quadratic prediction affect the performance of XCSF on continuous multistep problems. To this purpose we applied XCSF with different type of prediction, i.e., linear and quadratic, and with different update rules, i.e., Widrow-Hoff and RLS, on the Grid(0.05) and Puddles(0.05) problems. In addition, we also compared the performance of XCSF to the one obtained with tabular Q-learning [13], a standard reference in the RL literature. In order to apply tabular Q-learning to the 2D Gridworld problems, we discretized the the continuous problem space, using the step size s = 0.05 as resolution for the discretization process. In the first set of experiments we investigated the effect of the RLS update on the performance of XCSF, while in the second set of experiments we extended our analysis also to quadratic prediction. Finally, we analyzed the results obtained and the accuracy of the action-value approximations learned by the different versions of XCSF. 5.1
Results with Recursive Least Squares
In the first set of experiments we compared Q-learning and XCSF with the two different updates on the 2D continuous gridworld problems. For XCSF we used the following parameter settings: N = 5000, 0 = 0.05; β = 0.2; α = 0.1; γ = 0.95; ν = 5; χ = 0.8, μ = 0.04, pexplr = 0.5, θdel = 50, θGA = 50, and δ = 0.1; GA-subsumption is on with θsub = 50; while action-set subsumption is off; the parameters for integer conditions are m0 = 0.5, r0 = 0.25 [17]; the parameter x0 for XCSF is 1 [18]. In addition, with the RLS update we used δrls = 10 and τrls = 50. Accordingly, for Q-learning we set β = 0.2, γ = 0.95, and pexplr = 0.5. The Figure 2a compares the performance of Q-learning and of the two versions of XCSF on the Grid(0.05) problem. All the systems are able to reach an optimal performance and XCSF with the RLS update is able to learn much faster than XCSF with the Widrow-Hoff update, although Q-learning is even faster. This is not surprising, as Q-learning is provided with the optimal state space discretization to solve the problem, while XCSF has to search for it. However it is worthwhile to notice that when the RLS update rule is used, XCSF is able to learn almost as fast as Q-learning. Moving to the more difficult Puddles(0.05) problem, we find very similar results as showed by Figure 2b.
78
D. Loiacono and P.L. Lanzi
AVERAGE NUMBER OF STEPS
40
WH RLS QL Optimum (21)
30
20
10
0
0
1000
2000 3000 4000 LEARNING PROBLEMS
5000
(a) AVERAGE NUMBER OF STEPS
40
WH RLS QL
30
20
10
0
0
1000
2000 3000 4000 LEARNING PROBLEMS
5000
(b) Fig. 2. The performance of Q-learning (reported as QL), XCSF with the Widrow-Hoff update (reported as WH), and of XCSF with the RLS update (reported as RLS) applied to: (a) Grid(0.05) problem (b) Puddles(0.05) problem. Curves are averages on 20 runs.
Also in this case, XCSF with RLS update is able to learn faster than XCSF with the usual Widrow-Hoff update rule and the difference with Q-learning is even less evident. Therefore, our results suggest that the RLS update rule is able to exploit the experience collected more effectively than the Widrow-Hoff rule and confirm the previous findings on single step problems reported in [11]. 5.2
Results with Quadratic Prediction
In the second set of experiments, we compared linear prediction to quadratic prediction on the Grid(0.05) and the Puddles(0.05) problems, using both Widrow-Hoff and RLS updates. Parameters are set as in the previous experiments. Table 1a reports the performance of the systems in the first 500 test problems as a measure of the convergence speed. As found in the previous set of
Recursive Least Squares and Quadratic Prediction
79
Table 1. XCSF applied to Grid(0.05) and to Puddles(0.05) problems. (a) Average number of steps to reach the goal per episode in the first 500 test problems; (b) average number of steps to reach the goal per episode in the last 500 test problems; (c) size of the population evolved. Statistics are averages over 20 experiments.
experiments, the RLS update leads to a faster convergence, also when quadratic prediction is used. In addition, the results suggest that also quadratic prediction affects the learning speed: both with Widrow-Hoff update and with the RLS update the quadratic prediction outperforms the linear one. In particular, XCSF with the quadratic prediction and the RLS update is able to learn even faster than Q-learning in both Grid(0.05) and Puddles(0.05) problems. However, as Table 1b shows, all the systems reach an optimal performance. Finally, it can be noticed that the number of macroclassifiers evolved (Table 1c) is very similar for all the systems, suggesting that XCSF with quadratic prediction does not evolve a more compact solution. 5.3
Analysis of Results
Our results suggest that in continuous multistep problems, the RLS update and the quadratic prediction does not give any advantage either in terms of final performance or in terms of population size. On the other hand, both these extensions lead to an effective improvement of the learning speed, that is they play an important role in the early stage of the learning process. However, this
80
D. Loiacono and P.L. Lanzi
AVERAGE ERROR
4
LINEAR WH LINEAR RLS QUADRATIC WH QUADRATIC RLS
3
2
1
0
0
1000
2000
3000
4000
5000
LEARNING PROBLEMS
(a)
AVERAGE ERROR
4
LINEAR WH LINEAR RLS QUADRATIC WH QUADRATIC RLS
3
2
1
0
0
1000
2000
3000
4000
5000
LEARNING PROBLEMS
(a) Fig. 3. Average absolute error of the value functions learned by XCSF on (a) the Grid(0.05) problem and (b) the Puddles(0.05) problem. Curves are averages over 20 runs.
results is not surprising: (i) the RLS update exploits more effectively the experience collected and learns faster an accurate approximation; (ii) the quadratic prediction allows a broader generalization in the early stages that leads very quickly to a rough approximation of the payoff landscape. Figure 3 reports the error of the value function learned by the four XCSF versions during the learning process. The error of a learned value function is measured as the absolute error with respect to the optimal value function, computed as the average of the absolute errors over an uniform grid of 100 × 100 samples of the problem space. For each version of XCSF this error measure is computed at different stages of the learning process and then averaged over the 20 runs to generate the error curves reported in Figure 3. Results confirm our hypothesis: both quadratic prediction and RLS update lead very fast to accurate approximations of the optimal value function, although the final approximations are as accurate as the one evolved by XCSF with Widrow-Hoff rule and linear prediction. To better understand how the different versions of XCSF approximate the value function, Figure 4,
Recursive Least Squares and Quadratic Prediction
81
0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(a) 0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(b) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(c) Fig. 4. Examples of the value function evolved by XCSF with linear prediction and Widrow-Hoff update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
82
D. Loiacono and P.L. Lanzi
0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(a) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(b) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(c) Fig. 5. Examples of the value function evolved by XCSF with linear prediction and RLS update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
Recursive Least Squares and Quadratic Prediction
83
0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(a) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(b) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(c) Fig. 6. Examples of the value function evolved by XCSF with quadratic prediction and Widrow-Hoff update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
84
D. Loiacono and P.L. Lanzi
0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(a) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(b) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(c) Fig. 7. Examples of the value function evolved by XCSF with quadratic prediction and RLS update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
Recursive Least Squares and Quadratic Prediction
85
Figure 5, Figure 6, and Figure 7 show some examples of the value functions learned by XCSF at different stages of the learning process. In particular, Figure 4a and Figure 5a show the value function learned by XCSF with linear prediction after few learning episodes, using respectively the Widrow-Hoff update and the RLS update. While the value function learned by XCSF with Widrow-Hoff is flat and very uninformative, the one learned by XCSF with RLS update provides a rough approximation to the slope of the optimal value function, despite it is still far from being accurate. Finally, Figure 6 and Figure 7 report similar examples of value functions learned by XCSF with quadratic predictions. Figure 7a shows how XCSF with both quadratic prediction and RLS update may learn very quickly a rough approximations of the optimal value function after very few learning episodes. A similar analysis can be performed on the Puddles(0.05) but it is not reported here due to the lack of space.
6
Conclusions
In this paper we investigated the application of two successful extensions of XCSF, the recursive least squares update algorithm and the quadratic prediction, to multistep problems First, we extended the recursive least squares approach, originally devised only for single step problems, to the multistep problems with the covariance resetting, a technique to deal with a non stationary target. Second, we showed how the linear prediction used by XCSF can be extended to quadratic prediction in a very straightforward way. Then the recursive least squares update and the quadratic prediction have been compared to the usual XCSF on the 2D Gridworld problems. Our results suggest that the recursive least squares update as well as the quadratic prediction lead to a faster convergence speed of XCSF toward the optimal performance. The analysis of the accuracy of the value function estimate showed that recursive least squares and quadratic prediction play an important role in the early stage of the learning process. The capabilities of recursive least squares of exploiting more effectively the experience collected and the broader generalization allowed by the quadratic prediction, lead to a more accurate estimate of the value function after a few learning episodes. In conclusion, we showed that the previous findings on recursive least squares and polynomial prediction applied to single step problems can be extended also to continuous multistep problems. Further investigations will include the analysis of the generalizations evolved by XCSF with recursive least squares and quadratic prediction.
References 1. Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: Safely approximating the value function. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Processing Systems 7, pp. 369–376. The MIT Press, Cambridge (1995)
86
D. Loiacono and P.L. Lanzi
2. Butz, M.V., Pelikan, M.: Analyzing the evolutionary pressures in xcs. In: Spector, L., Goodman, E.D., Wu, A., Langdon, W.B., Voigt, H.-M., Gen, M., Sen, S., Dorigo, M., Pezeshk, S., Garzon, M.H., Burke, E. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001), July 7-11, pp. 935–942. Morgan Kaufmann, San Francisco (2001) 3. Butz, M.V., Wilson, S.W.: An algorithmic description of xcs. Journal of Soft Computing 6(3-4), 144–153 (2002) 4. Goodwin, G.C., Sin, K.S.: Adaptive Filtering: Prediction and Control, PrenticeHall information and system sciences series (March 1984) 5. Haykin, S.: Adaptive Filter Theory, 4th edn. Prentice-Hall, Englewood Cliffs (2001) 6. Lanzi, P.L., Loiacono, D.: Xcsf with neural prediction. In: IEEE Congress on Evolutionary Computation, CEC 2006, pp. 2270–2276 (2006) 7. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Extending XCSF beyond linear approximation. In: Genetic and Evolutionary Computation – GECCO-2005, Washington DC, USA, pp. 1859–1866. ACM Press, New York (2005) 8. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: XCS with computed prediction for the learning of boolean functions. In: Proceedings of the IEEE Congress on Evolutionary Computation – CEC 2005, Edinburgh, UK, pp. 588–595. IEEE, Los Alamitos (September 2005) 9. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: XCS with computed prediction in continuous multistep environments. In: Proceedings of the IEEE Congress on Evolutionary Computation – CEC 2005, Edinburgh, UK, pp. 2032– 2039. IEEE, Los Alamitos (September 2005) 10. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Prediction update algorithms for XCSF: RLS, kalman filter, and gain adaptation. In: GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1505–1512. ACM Press, New York (2006) 11. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Generalization in the XCSF classifier system: Analysis, improvement, and extension. Evolutionary Computation 15(2), 133–168 (2007) 12. Loiacono, D., Marelli, A., Lanzi, P.L.: Support vector regression for classifier prediction. In: GECCO 2007: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp. 1806–1813. ACM Press, New York (2007) 13. Watkins, C.J.C.H.: Learning from delayed reward. PhD thesis (1989) 14. Watkins, C.J.C.H., Dayan, P.: Technical note: Q-Learning. Machine Learning 8, 279–292 (1992) 15. Widrow, B., Hoff, M.E.: Neurocomputing: Foundation of Research. In: Adaptive Switching Circuits, pp. 126–134. The MIT Press, Cambridge (1988) 16. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995), http://prediction-dynamics.com/ 17. Wilson, S.W.: Mining Oblique Data with XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (workshop organisers): Proceedings of the International Workshop on Learning Classifier Systems (IWLCS-2000), in the Joint Workshops of SAB 2000 and PPSN 2000, pp. 158–174 (2000) 18. Wilson, S.W.: Classifiers that approximate functions. Journal of Natural Computing 1(2-3), 211–234 (2002) 19. Wilson, S.W.: Classifier systems for continuous payoff environments. In: Deb, K., Poli, R., Banzhaf, W., Beyer, H.-G., Burke, E., Darwen, P., Dasgupta, D., Floreano, D., Foster, J., Harman, M., Holland, O., Lanzi, P.L., Spector, L., Tettamanzi, A., Thierens, D., Tyrrell, A. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 824–835. Springer, Heidelberg (2004)
Use of a Connection-Selection Scheme in Neural XCSF Gerard David Howard1, Larry Bull1, and Pier-Luca Lanzi2 1
Department of Computer Science, University of the West of England, Bristol, UK {gerard2.howard,larry.bull}@uwe.ac.uk 2 Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milan, Italy
[email protected] Abstract. XCSF is a modern form of Learning Classifier System (LCS) that has proven successful in a number of problem domains. In this paper we exploit the modular nature of XCSF to include a number of extensions, namely a neural classifier representation, self-adaptive mutation rates and neural constructivism. It is shown that, via constructivism, appropriate internal rule complexity emerges during learning. It is also shown that self-adaptation allows this rule complexity to emerge at a rate controlled by the learner. We evaluate this system on both discrete and continuous-valued maze environments. The main contribution of this work is the implementation of a feature selection derivative (termed connection selection), which is applied to modify network connectivity patterns. We evaluate the effect of connection selection, in terms of both solution size and system performance, on both discrete and continuous-valued environments. Keywords: feature selection, neural network, self-adaptation.
1 Introduction Two main theories to explain the emergence of complexity in the brain are constructivism (e.g.[1]), where complexity develops by adding neural structure to a simple network, and selectionism [2] where an initial amount of over-complexity is gradually pruned over time through experience. We are interested in the feasibility of combining both approaches to realize flexible learning within Learning Classifier Systems (LCS) [3], exploiting their Genetic Algorithm (GA) [4] foundation in particular. In this paper we present a form of neural LCS [5] based on XCSF [6] which includes the use of self-adaptive search operators to exploit both constructivism and selectionism during reinforcement learning. The focus of this paper centres around the impact of a form of feature selection that we apply to the neural classifiers, allowing a more granular exploration of the network weight space. Unlike traditional feature selection, which acts only on input channels, we allow every connection in our networks to be enabled or disabled. We term this addition “connection selection”, and evaluate in detail the effects of its inclusion in our LCS, in terms of solution size, internal knowledge representation and stability of evolved solutions in two evaluation environments; the first a discrete maze and the second a continuous maze. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 87–106, 2010. © Springer-Verlag Berlin Heidelberg 2010
88
G.D. Howard, L. Bull, and P.-L. Lanzi
For clarity’s sake, we shall refer to the system without connection selection as NXCSF, and the version with connection selection as N-XCSFcs. Applications of this type of learning system are varied, including (but not limited to) agent navigation, data mining and function approximation; we are interested in the field of simulated agent navigation. The rest of this paper is organized as follows: section 2 details background research, section 3 introduces the evaluation environments used, and section 4 shows the implementation of neural XCSF. Section 5 describes “connection selection”, section 6 provides results of the experiments conducted, and section 7 provides a brief discussion and suggests further avenues of research.
2 Background 2.1 Neural Classifier Systems Benefits of Artificial Neural Network (ANN) representations mimic those of their real-life inspiration; including flexibility, robustness to noise and graceful performance degradation. The type of neural network that will be used in our project is the Multi Layer Perceptron (MLP) [7]. There are a number of neural LCS in the literature that are relevant to this paper. The initial work exploring artificial neural networks within LCS used traditional feedforward MLPs to represent the rules [5]. Recurrent MLPs were then shown able to provide memory for a simple maze task [8]. Radial Basis Function networks [9] were later used for both simulated [10] and real [11] robotics tasks. Both forms of neural representation have been shown amenable to a constructionist approach wherein the number of nodes within the hidden layer is under evolutionary control, along with the network connection weights [5][11]. Here a mutation operator either adds or removes nodes from the hidden-layer. MLPs have also been used in LCS to calculate the predicted payoff [12][13][14], to compute only the action [15], and to predict the next sensory state [16]. 2.2 Neural Constructivism Heuristic approaches to neural constructivism include FAST [17]. Here, a learning agent is made to navigate a discrete maze environment using Q learning [18]. The system begins with a single network, and more are added if the oscillation in Q value between two states is greater than a given threshold (e.g. there exist two states specifying different payoffs/actions, with only one network to cover both states). More networks are added until the solution space is fully covered by a number of neural networks, which allows the system to select optimal actions for each location within the environment. With regards to the use of constructivism in LCS, the first implementation is described in [5], where Wilsons’ Zeroth-level Classifier System (ZCS) [19] is used as a basis, the system being evaluated (NCS) on the Woods1 environment. The author implements a constructivist approach to topology evolution using fully-connected, MLPs to represent a classifier condition. Each classifier begins with one hidden layer node. A constructivism event may be triggered during a GA cycle, and adds or
Use of a Connection-Selection Scheme in Neural XCSF
89
removes a single, fully-connected hidden layer neuron to the classifier condition. The author then proceeds to define the use of NCS in continuous-valued environments using a bounded-range representation, which reduces the number of neurons required by each MLP. This constructivist LCS was then modified to include parameter self-adaptation in [11]. The probabilities of constructivism events occurring are self-adaptive in the same way as the mutation rate in [20], where an Evolutionary Strategy– inspired implementation is used to control the amount of genetic mutation that occurs within each GA niche in a classifier system. This allows classifiers that match in suboptimal niches to search more broadly within the solution space when µ is large, and decreasing the mutation rate when an optimal solution has been found to maintain stability within the niche. In both cases it is reported that networks of different structure evolve to handle different areas of the problem space, thereby identifying the underlying structure of the task. Constructivism leads us to the field of variable length neural representations. Traditional genetic crossover operators are of questionable utility when applied to the variable-length genomes that constructivism generates, as all rely on randomly picking points within the genome to perform crossover on. This can have the effect of breaking the genome in areas that rely on spatial proximity to provide high-utility. A number of methods, notably Harvey’s Species Adaptive Genetic Algorithm (SAGA) [21] and Hutt and Warwick’s Synapsing Variable-Length Crossover (SVLC) [22] provide methods of crossing variable-length genetic strings, with SVLC reporting superior performance than SAGA in a variable-length test problem. SVLC also eliminates the main weakness of SAGA; that the initial crossover point on the first genome is still chosen randomly, with only the second subject to a selection heuristic. It should be noted that neither N-XCSF nor N-XCSFcs use any version of crossover during a GA cycle; the reasoning behind this omission being twofold. Firstly, directly addressing the problem would require increasing the complexity of the system (adding SVLC-like functionality, for example). Secondly, and more importantly, experimental evidence suggests that sufficient solution space exploration can be obtained via a combination of GA mutation, self-adaptive mutation and neural constructivism, to produce optimal solutions in both discrete and continuous environments. This view is reinforced elsewhere in literature, e.g. [23]. Aside from GA-based crossover difficulties, there are also problems related to creating novel network structures of high utility. For example, the competing conventions problem (e.g. [24]) demonstrates how two networks of different structure but identical utility may compete with each other for fitness, despite being essentially the same network. Neuro Evolution of Augmenting Topologies (NEAT) [25] presents a method for addressing this problem under constructivism. Each gene under the NEAT scheme specifies a connection, specifying the input neuron and output neuron, the connection weight, and a Boolean flag indicating if the connection is currently enabled or disabled. Each gene also has a marker that corresponds to that genes’ first appearance in the population, with markers passed down from parents to children during a GA event, and is based on the assumption that genes from the same origin are more likely to encode similar functions. The marker is retained to make it more likely that homologous genes will be selected during crossover. NEAT has been applied to evolve robot controllers [26].
90
G.D. Howard, L. Bull, and P.-L. Lanzi
2.3 Feature Selection Feature selection is a method of streamlining the data input to a process, where the input data can be imagined as a vector of inputs, with dimension >1. This can be done manually (by a human with relevant domain knowledge), although this process can be error-prone, costly in terms of both time and potentially money, and, of course, requires expert domain knowledge. A popular alternative in the machine learning community is automatic feature selection. The use of feature selection brings two major benefits – firstly, that the amount of data being input to a process can be reduced (increasing computational efficiency), and secondly that noisy connections (or those otherwise inhibitory to the successful performance of the system) can be disabled. Useful features within the input vector are preserved as the performance of the system can be expected to drop if they are disabled, with the converse being true for disabling noisy/low-fitness connections. This is especially useful when considering the case of mobile robot control, where sensors are invariably subject to a certain level of noise that can be automatically filtered out by the feature selection mechanism. This description of the concept of feature selection can be seen to display a strong relationship with the MLP (and indeed any connectionist neural) paradigm, which uses a collection of clearly discretised input channels to produce an output. It can be demonstrated that the disabling of connections within the input layer of an MLP can have a (sometimes drastic) affect on the output of the network [27]. Related work on the subject of feature selection in neural networks can be found in [28] and [29], who explore the use of feature selection in a variety of neural networks. Also especially pertinent is the implementation of feature selection within the NEAT framework (FS-NEAT) [30], who apply their system to a double pole balancing task with 256 inputs. FS-NEAT performs feature selection by giving each input feature a small chance (1/I, where I is the dimension of the input vector) to be connected to every output node. An unaltered NEAT mutation sequence then allows these connections to connect to nodes in the hidden layers of the networks, as well as providing the ability to add further input nodes to the networks, again with a small probability of input addition. The authors make the point that NEAT, following a constructivist methodology, tends to evolve small networks without superfluous connections. They observe both a quicker convergence to optimality and networks with only around 32% of the available input nodes connected in the best-performing network, a reduction from 256 inputs to an average “useful” subset size of 83.6 enabled input nodes. Also highly relevant is the derivative FD-NEAT (Feature Deselection NEAT) [31], where all connections are enabled by default, and pruning rather than growing of connections takes place (it should be noted that FS-NEAT and neural constructivism [1] are similar, as are FD-NEAT and Edelman’s theory of neural Darwinism [2]). Consistent between all four papers mentioned above is that they perform input feature selection only (in other words, only input connections are viable candidates for enabling/ disabling). A comparative study into neuroevolution for both classification and regression tasks (supervised) can be found in [32], where the authors compare purely heuristic approaches with an ensemble of evolutionary neural networks (ENNs), whose MLPs
Use of a Connection-Selection Scheme in Neural XCSF
91
are designed through evolutionary computing. In the former case, randomly-weighted fully-connected networks with hidden layer size N (determined experimentally) are used to solve the tasks. In the latter, each network begins with a bounded-random number of hidden layer nodes. A feature-selection derivative similar to our approach is then implemented, whereby each network connection is probabilistically enabled. Structural mutation is then applied so that, with each GA application, a random number of either nodes or connections are added or deleted. Also similar to our implementation, the authors disable crossover, citing [17] due to negligible impact on the final solution performance. They then expand this work to evolve topologies and weights simultaneously, as evolving one without the other was revealed to be disruptive to the learning process. In their implementation, the non-adaptive rates of weight mutation and topological mutation are controlled by individual variables, each with a 50% chance of altering the network. Finally, it should be noted that this work builds on a previous publication [33], which introduces the design of the N-XCSF (and N-XCS [ibid.], which does not include function approximation). The research highlights the benefits of N-XCSF, mainly in terms of generalization capability and population size reduction. It is shown that the use of MLPs allow the same classifier to match in multiple location within the same environmental payoff level, indicating differing actions thanks to action computation. It is also shown that the inclusion of function approximation allows the same classifier to match accurately in many payoff levels; combined these two features allow the system to perform optimally with a degree of generalization (i.e. fewer total networks required in [P]).
3 Environments Discrete maze experiments are conducted on a real-valued version of the Maze4 environment [34] (Figure 1). In the diagram, “O” represents an obstacle that the agent cannot traverse, “G” is the goal state, where the agent must reach to receive reward, and “*” is a free space that the agent can occupy. The environmental discount rate γ=0.71. The environmental representation was altered to loosely approximate a real robots sensor readings - the binary string normally used to represent a given input state st is replaced with a real-valued counterpart in the same way as [5]. That is, each exclusive object type the agent could encounter is represented by a random real number within a specified range ([0.0, 0.1] for free space, [0.4,0.5] for an obstacle and [0.9, 1.0] for the goal state). In the discrete environment, the input state st consists of the cell contents of the 8 cells directly surrounding the agents current position, and the boundedly-random numeric representation attempts to emulate the sensory noise that real robots encounter. Performance is gauged by a “Step-to-goal” count – the number of discrete movements required to reach the goal state from a random starting position in the maze; in Maze 4 this figure is 3.5. Upon reaching the goal state, the agent receives a reward of 1000. Action calculation is covered in section 4. The test environment for the continuous experiments is the 2-D continuous grid world, Grid(0.05) (Figure 2) [35]. This is two-dimensional environment where the agent’s current state, st, consists of the x and y components of the agents current location within the environment; to emulate sensory noise both the x and y location of the
92
G.D. Howard, L. Bull, and P.-L. Lanzi
agent are subject to random noise +/- [0%-5%] of the agents true position. Both x and y are bounded in the range [0,1]; any movement outside of this range takes the agent to the nearest grid boundary. The environmental discount rate γ=0.95. The agent moves a predetermined step size (in this case 0.05) within this environment. The only goal state is in the top-right hand corner of the grid – where (x+y >1.90). The agent can start anywhere except the goal state, and must reach a goal state in the fewest possible movements, where it receives a reward of 1000. Again, action calculation is covered in section 4. O
O
O
O
O
O
O
O
O
*
*
O
*
*
G
O
O
O
*
*
O
*
*
O
O
O
*
O
*
*
O
O
O
*
*
*
*
*
*
O
O
O
*
O
*
*
*
O
O
*
*
*
*
O
*
O
O
O
O
O
O
O
O
O
Fig. 1. The discrete Maze4 environment
1.0
0.5
0.0
0.5
1.0
Fig. 2. The continuous grid (0.05) environment
4 Neural XCSF (N-XCSF) XCSF [6] is a form a classifier system in which a classifiers prediction (that is, the reward a classifier expects to gain from executing its action based on the current input state) is computed. Like other classifier systems, XCSF evolves a population of classifiers, [P], to cover a problem space. Each classifier consists of a condition and an action, as well as a number of other parameters. In our case, a fully-connected Multi-Layer Perceptron neural network[7] is used in place of the traditional ternary condition, and is used to calculate the action. Prediction computation is unchanged, computed linearly using a separate series of weights. Each classifier is represented by a vector that details the connection weights of an MLP. Each connection weight is uniformly initialized randomly in the range [-1, 1]. In the discrete case, there are 8 input neurons, representing the contents of the cells in 8 compass directions surrounding the agent’s current location. For the continuous environment, each network comprises 2 input neurons (representing the noisy x and y location of the agent). Both network types also consist of a number of hidden layer neurons under evolutionary control (see Section 4.2), and 3 output neurons. Each node (hidden and output) in the neural network has a sigmoidal activation function to constrain the range of output values. The first two output neurons represent the strength of action passed to the left and right motors of the robot respectively, and the third output neuron is a “don’t-match” neuron, that excludes the classifier from the
Use of a Connection-Selection Scheme in Neural XCSF
93
match set if it has activation greater than 0.5. This is necessary as the action of the classifier must be re-calculated for each state the classifier encounters, so each classifier “sees” each input. The outputs at the other two neurons (real numbers) are mapped to a single discrete movement, which varies between discrete and continuous environments. In the discrete case, the outputs at the other two neurons are mapped to a movement in one of eight compass directions (N, NE, E, etc.). This takes place in a way similar to [5], where three ranges of discrete output are possible for each node: 0.0<x0.287]|satisfactory 4:AvgInfoGain is [0.6]|bad 5:Default rule -> good The classification rules generated by both classifiers prove our thesis that a noise level greater than 0.25 severely degrades the classification potential of a dataset. As expected, GAssist is able to generate more generic and comprehensible rules. For example, if noise level is above 0.667, the classification potential is bad irrespective of other parameters. The knowledge extracted by both algorithms provide same generalization. Hence, our proposed meta-model can be effectively used in determining the true classification potential of a biomedical dataset. We believe this can prove to be a very effective tool for analyzing the inherent complexities and needs for pre-processing the dataset. Decision Tree of J48 Noise 0.26297 | MissingValues 0.002235: bad
6 Conclusion In this paper, we have quantified the complexity of biomedical datasets in terms of missing values, noise, imbalance ratio and information gain. The effect of complexity on classification accuracy is evaluated using six well-known evolutionary rule learning algorithms. The results of our experiments show that GAssist – in most of the datasets – provides better classification accuracy compared with other algorithms. Our analysis reveals that the classification accuracy of a biomedical dataset is, however, a function
142
A.K. Tanwani and M. Farooq
of the nature of biomedical dataset rather than the choice of a particular evolutionary learner. The major contribution of this paper is a unique methodology to determine the classification potential of a dataset using a meta-model framework. In the future, we would like to present the generated rules of different classifiers to the medical experts for their feedback.
Acknowledgements The authors of this paper are supported, in part, by the National ICT R&D Fund, Ministry of Information Technology, Government of Pakistan. The information, data, comments, and views detailed herein may not necessarily reflect the endorsements of views of the National ICT R&D Fund.
References 1. Pena-Reyes, C.A., Sipper, M.: Evolutionary computation in medicine: an overview. Journal of Artificial Intelligence in Medicine 19(1), 1–23 (2000) 2. Wong, M.L., Lam, W., Leung, K.S., Ngan, P.S., Cheng, J.C.V.: Discovering knowledge from medical databases using evolutionary algorithms. IEEE Engineering in Medicine and Biology 19(4), 45–55 (2000) 3. Holmes, J.H.: Learning classifier systems applied to knowledge discovery in clinical research databases. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 243–261. Springer, Heidelberg (2001) 4. Bernado Mansilla, E.: Domain of competence of XCS classifier system in complexity measurement space. IEEE Transactions on Evolutionary Computation 9(1), 82–104 (2005) 5. Kharbat, F., Bull, L., Odeh, M.: Mining breast cancer data with XCS, Genetic and Evolutionary Computation Conference (GECCO), pp. 2066-2073, UK (2007) 6. Puig, A.O., Mansilla, E.B.: Evolutionary rule-based systems for imbalanced data sets. Soft Computing - A Fusion of Foundations, Methodologies and Applications 13(3), 213–225 (2009) 7. Bacardit, J., Butz, M.V.: Data mining in learning classifier systems: comparing XCS with GAssist. In: Kovacs, T., Llor`a, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 282–290. Springer, Heidelberg (2007) 8. Bernad´o, E., Llor`a, X., Garrell, J.M.: XCS and GALE: a comparative study of two learning classifier systems with six other learning algorithms on classification tasks. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115–132. Springer, Heidelberg (2002) 9. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: An ant colony based system for data mining: applications to medical data. In: Int. Conf. on Knowledge Discovery and Data mining, Boston, pp. 55–62 (2000) 10. Galea, M., Shen, Q., Levine, J.: Evolutionary approaches to fuzzy modelling for classification. Knowledge Engineering Review 19(1), 27–59 (2004) 11. Tanwani, A.K., Afridi, J., Shafiq, M.Z., Farooq, M.: Guidelines to select machine learning scheme for classifcation of biomedical datasets. In: Pizzuti, C., Ritchie, M.D., Giacobini, M. (eds.) EvoBIO 2009. LNCS, vol. 5483, pp. 128–139. Springer, Heidelberg (2009) 12. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Toward a theory of generalization and learning in XCS. IEEE Transactions on Evolutionary Computation 8(1), 28–46 (2004)
Classification Potential vs. Classification Accuracy
143
13. Bernado-Mansilla, E., Garrell-Guiu, J.M.: Accuracy-based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary Computation 11(3), 209– 238 (2006) 14. Bacardit, J., Garrell, J.M.: Bloat control and generalization pressure using the minimum description length principle for a Pittsburgh approach Learning Classifier System. In: Kovacs, T., Llor`a, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 59–79. Springer, Heidelberg (2007) 15. Otero, F.E.B., Freitas, A.A., Johnson, C.J.: cAnt-Miner: an ant colony classification algorithm to cope with continuous attributes. In: Ant Colony Optimization and Swarm Intelligence, Belgium, pp. 48–59 (2008) 16. Gonzalez, A., Perez, R.: SLAVE: a genetic learning system based on an iterative approach. IEEE Transaction on Fuzzy Systems 7(2), 176–191 (1999) 17. Ishibuchi, H., Nakashima, T., Murata, T.: Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Transactions on Systems, Man, and Cybernetics 29(5), 601–618 (1999) 18. Fawcett, T.: ROC graphs: notes and practical considerations for researchers, TR HPL-20034, HP Labs, USA (2004) 19. UCI repository of machine learning databases, University of California-Irvine, Department of Information and Computer Science, www.ics.uci.edu/˜mlearn/MLRepository.html (last accessed: June 25, 2010) 20. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study of their impacts. Artificial Intelligence Review 22(3), 177–210 (2004) 21. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. Journal of Artificial Intelligence Research 11, 131–167 (1999) 22. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 23. Otero, F.E.B.: Ant Colony Optimization Framework, MYRA, http://sourceforge.net/projects/myra/ (last accessed: June 27, 2010) 24. Alcala-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M., Fernandez, J.C., Herrera, F.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing 13, 307–318 (2008) 25. Demsar, J.: Statistical comparisons of classifiers over multiple datasets. Journal of Machine Learning and Research 7, 1–30 (2006) 26. Garcia, S., Herrera, F.: An extension on ”Statistical comparisons of classifiers over multiple datasets” for all pairwise comparisons. Journal of Machine Learning and Research 9, 2677– 2694 (2008) 27. Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics 11, 86–92 (1940) 28. Iman, R.L., Davenport, J.M.: Approximations of the critical region of the Friedman statistic. Communications in Statistics, 571–595 (1980) 29. Dunn, O.J.: Multiple comparisons among means. Journal of the American Statistical Association 56, 52–64 (1961) 30. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70 (1979) 31. Nemenyi, P.B.: Distribution-free multiple comparisons, PhD Thesis, Princeton University (1963)
144
A.K. Tanwani and M. Farooq
32. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony optimization algorithm. IEEE Transactions on Evolutionary Computation 6(4), 321–332 (2002) 33. Orriols-Puig, A., Bernad´o-Mansilla, E.: Revisiting UCS: description, fitness sharing and comparison with XCS. In: Bacardit, J., Bernad´o-Mansilla, E., Butz, M.V., Kovacs, T., Llor`a, X., Takadama, K. (eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 96– 116. Springer, Heidelberg (2008) 34. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1), 40–49 (2004) 35. Tanwani, A.K., Farooq, M.: The role of biomedical dataset in classification. In: Combi, C., Shahar, Y., Abu-Hanna, A. (eds.) Artificial Intelligence in Medicine. LNCS (LNAI), vol. 5651, pp. 370–374. Springer, Heidelberg (2009)
Supply Chain Management Sales Using XCSR Mar´ıa Franco, Ivette Mart´ınez, and Celso Gorrin Departamento de Computaci´ on y Tecnolog´ıa de la Informaci´ on Universidad Sim´ on Bol´ıvar, Caracas, Venezuela
[email protected],
[email protected] Abstract. The Trading Agent Competition in its category Supply Chain Management (TAC SCM) is an international forum where teams develop agents that control a computer assembly company in a simulated environment. TAC SCM involves the following problems: to determine when to send offers, decide the final sales prices of the goods offered and plan the factory and delivery schedules. In this work, we developed a TACSCM agent called TicTACtoe, that uses Wilson’s XCSR classifier system to decide the final sales prices. In addition, we developed an adaptation for this classifier system, that we called blocking classifiers technique, which allows the use of XCSR within environments with single-step tasks and delayed rewards. Our results show that XCSR allows generating a set of rules that solves the TAC SCM sales problem in a satisfactory way. Moreover, we found that the blocking mechanism improves the performance of the agent in the TAC SCM scenario.
1
Introduction
The supply chain management embodies the management of all the process and information that moves along through the supply chain from the supplier to the manufacturer right through to the retailer and the final customer. Nowadays, the supply chain management is one of the most important industrial activities. Planning the activities through the supply chain is vital to the competitiveness of manufacturing enterprises. According to [6], “while today’s supply chains are essentially static, relying on long-term relationships among key trading partners, more flexible and dynamic practices offer the prospect of better matches between suppliers and customers as market conditions change”. The Trading Agent Competition of Supply Chain Management (TAC SCM)[6] was designed to expose the participants to the typical challenges presented in the dynamic supply chain. These challenges include competing for the components provided by the suppliers, managing the inventory, transforming components into final products and competing for the customers. These problems can be classified into three main problems: purchases, production and sales. Pardoe and Stone made experiments applying different learning techniques to sales decisions of TAC SCM agents[11]. One of their main conclusions was that winning offers in TAC SCM is a very complex problem because the winning prices J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 145–165, 2010. c Springer-Verlag Berlin Heidelberg 2010
146
M. Franco, I. Mart´ınez, and C. Gorrin
may vary very quickly. Therefore, this work affirms that taking decisions based on previous states of the current game is inaccurate, while using information taken from a lot of previous games will show better results. The goal of this work is to present an approach to the TAC SCM problem using an evolutionary reinforcement learning system. We specifically use XCSR to solve one of the most important sales problems: pricing the components in order to compete over the market and maximize the profit at the same time.
2
TAC SCM
The TAC SCM competition[6] was designed by a team of researchers from the eSupply Chain Management Lab at Carnegie Mellon University in collaboration with the Swedish Institute of Computer Science (SICS). In this contest, each team has to develop an intelligent agent capable of handle the main supply chain management problems (which orders accept, decide the sale price for products, compete over the market, among others). Agents compete against each other in a simulation that lasts 220 days and includes customers and suppliers to deal with. The main goal of the competitors is to maximize the final profit by selling assembled computers to the customers. The profit of an agent is calculated by subtracting production costs to the incomes. This profit reflects itself in the amount of money the agents have at the end of the game, which indicates which agent is the winner. Each TAC SCM simulation has three actors: customers who buy computers, manufacturers (agents) who produce and sell computers, and suppliers who provide the unassembled components to the manufacturers. A detailed description of these actors can be found in [6]. At the beginning of each day, the agent receives “request for quotes” (also known as RFQs) from the customers. Afterwards, the agent decides which RFQs should be accepted and which should be the final offer price. After sending the offers, the agent waits for the orders from the customers. Only the best priced offers are accepted and turn into orders. If the agent receives the order, it decides when to produce and deliver, and even more important, how much components it should buy to accomplish the production schedules. In order to buy the components, the agent sends the suppliers RFQs for the spare parts. In response, the suppliers send offers to the agent who has to decide whether or not to accept them. Each team competing in a TAC SCM game should develop a manufacturer agent that has to deal with the main decisions of the supply chain management: how much components shall we buy, when shall we produce an order and which RFQs shall we accept. Moreover, when accepting a RFQ, the agent should decide which the final price should be for these goods. In this work, these three problems will be referred as the purchase problem, the production problem and the sales problem. More than 30 agents participate in this competition each year. Among the most successful solutions to the TAC SCM problem we found: TacTex-06,
Supply Chain Management Sales Using XCSR
147
PhantAgent and CMiex. TacTex-06[12] is an agent that uses a prediction model trained with the Additive Regression with Decision Stumps algorithm[17] within the purchase strategy. In addition, this agent also uses another prediction model for the sales strategy based on the idea that the winning prices follow a normal distribution. Other interesting approach to the problem is presented by PhantAgent[13] which uses heuristics to solve the purchase and sales problems. Furthermore, the CMiex agent[2] uses a forecasting module that predicts the sales price for the components and products for the following days. However, the code for these solutions was not available at the time and this situation encouraged us to create our solution to the problem from scratch. In our solution, we addressed the sales problem using an evolutionary reinforcement learning technique. The other problems were solved using simple static strategies in order to evaluate the impact of the learning system on the sales problem. Our approach to the problem will be better explained Section 4.
3
XCSR
XCS is a Michigan Learning Classifier System first described by Wilson[14]. This system is based on the work proposed by Holland[9] but uses the accuracy instead of the payoff as a measure of “goodness” of a classifier. In our implementation we used XCSR[15], a version of the of XCS that accepts real numbers as inputs. To do this the features in the condition are represented by lower and upper bounds while the action remains discrete. The reason why we decided to use this approach is because all the inputs of the decision we wanted to make were real and the decisive thresholds needed to be found dynamically. We also decided to use XCSR because the rule system can constantly adapt to new environments using a fixed rate of exploration[3] and the rules that it generates are interpretable by human beings[10].
4
TicTACtoe
TicTACtoe is our approach to the TAC SCM problem. TicTACtoe has three modules: Purchase, Production and Sales (see Figure 1). Each module manages one of the sub-problems in the supply chain management. Every module takes its own decisions using information taken from the environment and other modules. On the next subsections, we will focus on the details of these modules. In addition to these modules, we provided the agent with memory through an organizer structure. This structure keeps track of: orders scheduled for production, possible order commitments, actual produced orders1 and possible future inventory2. This memory allows to record decisions taken by the agent each day 1 2
Production schedules may vary due to the lack of components. The future inventory is based on the component orders placed by the agent.
148
M. Franco, I. Mart´ınez, and C. Gorrin
Fig. 1. TicTACtoe Architecture
and to consider events that would happen in the future, which will be used to make further decisions. 4.1
Purchases
The purchase module is in charge of sending RFQs to the suppliers in order to buy the necessary components for production. This module has two tasks: a) creating the RFQs to get the current component prices and b) decide which supplier offers to accept. Suppliers RFQ creation. First, the agent calculates how many components are needed for production within the next ten days. These calculations are based on the current inventory, orders scheduled for the next ten days and component orders that have been placed already. The agent always sends the RFQ to its favourite supplier for that particular component, which is the one who has given the best prices lately. There is only one favorite supplier for each component. However, the agent also asks the other suppliers for the current prices in order to update the favourite supplier if necessary. The favorite supplier is preferred in order to get lower prices. This is based in the assumption that the state of a supplier does not change drastically. Therefore, if a supplier gives an agent the best price, probably it would continue giving good prices for some time. Accepting the offers. When the agent ask for components, the suppliers might not be able to comply with the agent’s requirements. When the supplier is not able to deliver the products the agent asks for, it sends two types of adjusted offers instead: offers that vary the quantity and offers with a later due date. If this happens, the priority of TicTACtoe is to accept first the complete offers and then the ones that vary the quantity. Once an order is set, the agent adds a record of the components arrival to calculate the future inventory.
Supply Chain Management Sales Using XCSR
149
Furthermore, the agent keeps a historical record of the base price for each component. The component base price is calculated every day as a weighted average as shown in equation (1): c Pdc = Sdc · w + Pd−1 · (1 − w)
Pdc
(1) Sdc
where is the base price for component c in day d, is the supplier’s price for component c in day d and w is a constant for weighting. 4.2
Production
The production module is in charge of scheduling the production of the active orders (the orders that are waiting to be elaborated and delivered). This module prioritizes the active orders with sooner due dates and higher penalties (in case the orders are behind schedule). The agent loops over the orders checking if there is enough inventory of products to deliver them. If the agent has enough products the order is delivered. This strategy is used by PhantAgent[13] to avoid extra storing charges. In case there are not enough components to deliver the order, the agent verifies if the order is beyond the latest possible delivery day3 . In case it is already too late, the customer would not receive the order anymore. So, the agent cancels it and frees all the components and products associated in order to be able to use them to fulfil other orders. If there are not enough products to fulfill the order but the customer can still wait for it, the agent tries to produce it. To produce an order scheduled for a specific day, the agent checks if there are enough components. When there are not enough components to produce the desired quantity, the agent produces the maximum quantity allowed. If the agent cannot produce an order completely, it continues producing it the next day. At the end of the day, the production module determines the number of late orders and the number of active orders. This information is used by the sales module to adjust the quantity of free cycles the agent can offer. This forces the agent to save cycles for late order production. 4.3
Sales
The sales module is in charge of pricing the products and dealing with the customers. This module checks everyday the customer RFQs and sends offers to the ones that meet the following characteristics: (a) a reserve price higher than the product’s base price and (b) a due date earlier than the end of the simulation. The agent calculates the base price for a product as the sum of the estimated prices of all the spare parts. This estimates how profitable the order would be. Afterwards, the agent uses the set of rules generated using a XCSR to determine the discount factor over the reserve price of each RFQ. The reserve price is the maximum price a customer is willing to pay for an order. The agent that offers the lower price wins the bid. The implementation of XCSR will be explained in greater detail in Section 5. 3
The latest possible deliver day is determined when the customer sends the RFQ.
150
M. Franco, I. Mart´ınez, and C. Gorrin
The final offer price is determined by equation (2), where BaseP rice is the calculated cost of the product based on recent experiences, ReserveP rice is the reference price determined by the customer and d is the discount factor determined by the XCSR. OfferPrice = BasePrice + Revenue · (1 − d)
(2)
Revenue = ReservePrice − BasePrice
(3)
Once the agent calculates the offer price for each RFQ, a production schedule is generated including these possible orders4 . The orders that involve higher revenues have more priority. In order to save production cycles for future orders that would need to be delivered earlier, the agent always tries to produce an order as late as possible according to its due date. This strategy is very similar to the one used in [12]. If there are not free cycles the agent checks the inventory to see if there are enough products to deliver these orders the next day. Moreover, non of these is possible the less profitable RFQs are discarded. Moreover, the daily free cycles are multiplied by a factor between 0 and 1, inversely proportional to the quantity of late orders that the agent has. This helps the agent to get on schedule again, by leaving some cycles for the production of late orders. Our agent remembers all the placed offers as possible commitments. However, customers only accept the best-priced offers. In case a customer rejects an offer, the commitment is removed and all the components and cycles associated are released.
5
XCSR Inside TicTACtoe
One of the most important decisions in the supply chain management is to decide the final price for the products. This price should be low enough to win the order and at the same time high enough to maximize the agents profit. The decision taken by the XCSR is the final price discount the agent should offer to win the bid. This decision is taken inside the Sales Module by accessing the XCSR library through two methods. The first one introduces the current state of the environment, finds the match set and the action that should take effect. Moreover, it associates the action set to the RFQ, in order to reward it later. The second one rewards the action set and saves the error information to compute further population statistics. 5.1
Classifiers Structure
In the following sections, we explain the structure used to represent the TAC SCM sales problem using real inputs and discrete actions. 4
This helps the agent to calculate how much free cycles are left for the production of further orders.
Supply Chain Management Sales Using XCSR
151
Condition. There are simulation values known by the agent that provide important information for its future decisions. Including all this values in the classifiers structure decrease the efficiency of the GA in terms of execution time. To avoid this, we selected the more important features for the decision we wanted to make. Preliminary experiments showed that the more suitable features for the classifier would be: x1 Rate of late orders5 over the total of active orders. This determines how much work is late and how convenient is to make a good offer when the agent is already behind schedule. lateOrders x1 = (4) totalOrders x2 Rate of the factory cycles that remain unused the day before. This helps the agent determine if it should raise or lower the price discount. For example, if the factory is full, the agent should give low discounts in order to try to finish with its active orders before getting new ones. x2 =
freeFactoryCapacity totalFactoryCapacity
(5)
x3 Rate of the base price over the reserve price indicated by the customer. This represents how profitable an order would be. The agent discards the cases when the base price is higher than the reserve price. x3 =
basePrice reservePrice
(6)
x4 The number of days between the current date and the day the order should be delivered. This indicates how much time the agent has to produce and deliver an order. This value is scaled between 0 and 1 considering that the due dates are, at most, 12 days after the actual date. x4 =
(dueDate − day) 12
(7)
x5 The actual day of the simulation normalized by the maximum number of days a game has. This value is very important because there are different situations as the days go by. For example, in the middle of a simulation components start to be scarce and their prices start rising. This feature helps the agent to determine different stages of the simulation that require specific behaviours. day x5 = (8) 220 All the features are normalized between 0 and 1, to use these values as upper and lower bounds. This aspect will be better explained in Section 6.1. 5
The late orders are the active orders that are producing penalties because they are going to be delivered after the due date.
152
M. Franco, I. Mart´ınez, and C. Gorrin
Action. Our implementation of XCSR has 10 actions that represent the different discounts over the possible revenue. The revenue is computed as the difference between the base price and the reserve price determined customers. The different discounts go from 0% to 90% with 10% steps. Reward. The reward is determined by the profit obtained through a RFQ, scaled by the amount of money that implied its fabrication when the agent sent the offer. There are three different scenarios in which the agent can reward an action set. When the offer is not accepted by the customer. In this case the RFQ did not make the agent earn or lose any money and the reward is zero. When the order is delivered. In this case we consider the money earned by the sale and the money lost because of the penalty (if the order was delivered late). The profit and loss are scaled by the investment of the agent, which is calculated based on the base price of the product. The reward in this case is calculated using the equation (9). 2
reward = (prof it) − (loss) where prof it =
2
offeredPrice basePrice
(9) (10)
max(day − duedate, 0) · penalty (11) basePrice · quantity When we scale the profit and the loss using the base price we obtain a percentage value of the money earned with the order. We could think that a good approximation for the reward function is to subtract the expenses to the profit; but the net earnings are not the same for the different products6 . If we use the earnings as the reward, the rules that obtain the highest rewards will be only the ones that sell the most expensive products. However, we want to learn how to sell different types products, not only the most expensive ones. Therefore, it is more appropriate to reward a classifier based on the profit margin. When the order is canceled without being delivered. In this case the agent did not produce the order on time, so it only produced losses for the agent. This is due to the penalties the agent had to pay to the customer. The money invested in the production is not considered as an expense, because these products can be used to fulfil another order. In this situation, the reward is calculated by the equation (12), which is very similar to equation (9) eliminating the term corresponding to the profit. loss =
reward = −(loss)2
(12)
In equations (9) and (12), the terms corresponding to the profit and the loss are squared in order to give the classifier a stronger reward when this quantities are significantly greater. 6
Products produce different earnings depending to their production costs.
Supply Chain Management Sales Using XCSR
6
153
Implementation Details
The XCSR library implementation is based on Butz’s XCS library (Version 1.0)[4]. We adapted this library to XCSR using a lower and upper bound notation as in XCSI[16], but allowing real values for the bounds. In the following subsections, we will explain some characteristics of the system relevant to our implementation. 6.1
Don’t Care
The don’t care in our library was implemented as the absence of lower or upper bound, depending on the allele we wanted to modify. To implement this don’t care, we had to put a restriction to the data: all the features should be bounded between 0 and 1. Putting a don’t care in an allele is equivalent to put 0 or 1, depending if it is a lower or an upper bound. In this way, we open the range to the maximum limit so the allele classifies all the states. 6.2
Classifier Subsumption
Since Butz’s library was oriented to boolean features, we had to implement other subsumption rules so they adapt to our classifier structure, where all the features are bounded between 0 and 1. The rules used were the same rules used by Wilson in [16], where a classifier is more general than another if all the ranges of the first classifier contain the second one. For example, (li , ui ) subsumes (lj , uj ), if ui > uj ∧ li < lj . The actions of the classifier should be the same for the subsumption to occur. 6.3
Crossover Operators
We implemented a restricted two-point crossover operator between conditional ranges that generates new individuals with valid ranges. This means that only the points between an upper bound and a lower bound can be chosen. This crossover operator is equivalent to boolean two-point crossover operator, because the crossover only crosses full conditions. The ranges of the new individuals are always valid, because they are combinations of the parent’s ranges. 6.4
Additional Adaptations
Additional adaptations were necessary to include XCSR in our TAC SCM agent due to the characteristics of the problem. Blocking classifers. In our classifier system, the reward of an action set is given based on the amount of money the agent wins o loses when making the corresponding offer. This value merely depends on the discount given by the agent in the offer. This is the reason why this problem is modelled as a single-step problem. Nevertheless, the agent only knows the reward few days after making the decision. This differs from the classic problems used as benchmarks (i.e. boolean multiplexer), in which the reward arrives immediately after applying an action.
154
M. Franco, I. Mart´ınez, and C. Gorrin
Considering the delayed reward, it is necessary to save the action set associated to the order, so these classifiers are given a reward when the agent gets the final result. Since we are interested in continuing learning while a classifier waits for its reward, classifiers are used in multiple learning iterations parallel to each other. This aspect of the online learning, in addition to the delayed reward, presents a new problem to us. The problem occurs when a classifier that is waiting for a reward is selected for deletion or subsumption. Since these mechanisms could be executed by any learning iteration, they could erase this classifier based on information that is not updated. Consequently, the knowledge represented by this classifier and its upcoming rewards are lost. In order to avoid the deletion of the classifiers expecting a reward based on incomplete information, we implemented a simple counting semaphore. Each classifier has a counter that indicates the number of rewards it is expecting. A single classifier participates in a lot of decisions each day and needs to wait a reward for each one of them. Therefore, we only consider for deletion the classifiers that are not blocked, the ones that have their counter in zero. We had to add also another important restriction in the subsuming mechanism. A classifier can not be subsumed if it is blocked, because its information is not entirely up to date to become part of another classifier. The blocked classifiers may participate in all the other mechanisms like crossover and mutation. Dynamic population generation. The version of XCS used has a dynamic population generation method, in which the population starts empty. Each time the algorithm generates a match set, it inserts new classifiers into the population until all the actions are covered. In other words, the algorithm guarantees that there is at least one classifier for each possible action. If there is no classifier in the population for a specific action, covering is performed and the new classifier is inserted into the population. The advantage of this technique is that the population grows dynamically as the states occur in the experiment, covering all the search space. On the other hand, the population has a limited size and, covering all the actions, yields to the loss of old classifiers when inserting the new ones. Moreover, different groups of classifiers activate themselves in different stages of the simulation. Considering that blocking classifiers places restrictions over the deletion of the active individuals, this increases the probability of deleting classifiers that activate themselves in other stages of the simulation. In order to avoid deleting good inactive classifiers in advance execution stages, we activate the population generation method until the population reaches its size limit. After that, when covering is necessary, we only generate one rule with a random action. However, we continue inserting and deleting individuals when applying the genetic algorithm over the action set. Variable epsilon-greedy action selection policy. The base library interleaves between exploitation and exploration, rewarding the classifiers only during the exploration and taking learning statistics only during exploitation. In this problem, the classifier system learns while the agent competes in a simulation.
Supply Chain Management Sales Using XCSR
155
Since all the decisions taken by the XCSR affects the final result, regardless of whether it was determined by exploration or exploitation, we changed the algorithm in order to reward the classifiers in both cases. Considering the dynamic characteristics of the simulation, we decided to use an -greedy action selection policy. This consist of selecting the best possible action with probability 1 − and exploring the rest of the time. However, we did a slightly modification so the starts at 1 and decreases linearly until it reaches a threshold. This forces the system to explore more at the beginning of the simulation and less by the end of it. When reaches the threshold, its value remains constant allowing the agent to perform some explorations that facilitates its adaptation to changes in the simulation.
7
Experiments and Results
We designed three experiments to test the effectiveness of the proposed mechanisms. First, we tested the performance of the XCSR against other static solutions. Afterwards, we analysed the impact of the blocking classifiers technique and the application of different exploration and exploitation rates. During the execution of the experiments, each agent plays separately against five dummy agents7 . The source code for these dummy agents can be found in [1]. In each experiment, the agents ran for 40 games. In the first game, the XCSR population is empty. At the end of each game, the population is saved, and at the beginning of the next game it is recovered and established as the initial population. All data presented in the following figures corresponds to the last 25 games. The first 15 games were taken as the training stage. However, during this 25 games, the agent is still doing some explorative actions due to the dynamic nature of the simulation (See Section 6.4). The length of each game is 220 simulations days. Each simulation day last 5 seconds8 , considering than none of the agents would need more time to complete its daily actions. The performance of the agents was evaluated using two main performance measures: (a) Final result: the final amount of money in the agent’s bank account. This indicates how much money the agent earned and how profitable its investments were. (b) Received orders: the number of orders placed by the customers. This value indicates the percentage of the market the agent served. This is directly linked to the decision taken by the XCSR, because if the agent gives a better price, it receives more orders 7 8
These agents come along with the TAC SCM library. They are used for testing purposes and they use simple but coherent strategies to handle the different problems. The standard parameters for the games are 220 simulation days with a duration of 15 seconds.
156
M. Franco, I. Mart´ınez, and C. Gorrin
Over these two performance measures, we applied non-parametric tests to determine if the differences between the agents were significant. Since the variables subject to study take integer values, we cannot assume normality. Therefore, we applied the Kruskal-Wallis test[7] to determine if there are significant differences between the agents. After that, we used the Wilcoxon test[7] to perform pair-wise comparisons between the agents. Also, additional performance measures are considered in some experiments to have more observational insights of the performance of the agents: (a) Factory usage: the percentage of usage of the factory capacity. This value indicates how many factory cycles are used on average. This represent the productive capacity of the agents, and it should be used at maximum. (b) Penalties: the amount of money paid to the costumers for late deliveries. This indicates how many late orders the agent had. (c) Interests: the amount of money paid to bank entity for having a negative balance in the bank account. (d) Total income: the total amount of money earned by the agents without considering the losses. (e) Component costs: the amount of money spent in buying components. (f) Storage costs: the amount of money spent in storing components to be used in future production. The component costs, storage costs, penalties and interests are represented as the percentage of the total revenue, while the final result and the total income are represented in US dollars. The combination of these measures with the main ones will show how effective the learning was, considering that we wanted to learn a discount strategy that maximizes the revenue of the agent by winning profitable and manageable orders. However, these performance measures are shown only as a support of the two main measures. Therefore, no statistical test were performed over them. The parameters used in our implementation of XCSR for the calculation of price discounts are α = 0.1, β = 0.2, δ = 0.1, ν = 5, θGA = 25, 0 = 10, θdel = 20, χ = 0.8, μ = 0.04, p# = 0.1, pI = 10.0, I = 0, FI = 0.01, θsub = 20, θmna = 1, s0 = 0.05 and N = 1000. The meaning of these parameters is explained in [5]. Moreover, the sources of TicTACtoe can be found in http://www.gia.usb.ve/~maria/tictactoe. 7.1
Experiment 1: TicTACtoe Performance
The goal of this experiment is to compare three different strategies to determine the price discount. These strategies are: learning using XCSR (L-TicTACtoe), Random and Static. All these strategies were tested using the base version of TicTACtoe. In this experiment, we also compare L-TicTACtoe with the dummy agent provided by the server. The learning version of TicTACtoe, L-TicTACtoe, uses an exploitation rate (1-) of 30% and a population size of 1000 classifiers using the blocking mechanism. This configuration was the most favourable according to Sections 7.2 and
Supply Chain Management Sales Using XCSR
157
7.3. Moreover, preliminary experiments[8] showed that a population size of 1000 produces the best results for this problem. The other versions of TicTACtoe involved are Random and Static. The first one decides the price discount randomly, while the second one gives a discount on day d as follows: ⎧ ⎪ ⎨80% if freeFactoryCapacity d−1 > 80% discount(d) = 10% if freeFactoryCapacityd−1 < 5% ⎪ ⎩ 30% all other cases.
8000 7000 6000 5000
Number of Orders
3000
4000
0e+00 −2e+07 −4e+07
2000
−6e+07
Final Result (US$)
2e+07
where freeFactoryCapacity d−1 is the percentage of free factory capacity on simulation day d − 1. These “naive” rules try to avoid factory saturation raising prices every time the number of free factory cycles goes below 5% and tries to attract customers when this value goes over 80%. Figure 2(a) shows the global performance of the agents. These results clearly show that L-TicTACtoe outperforms Random and Static, with an average result twice as high as Random and four times higher that the dummy agent. Considering that these agents only differ in their pricing strategy, it is evident that a change on this strategy affects the global performance of the agent.
L−TicTACtoe
dummy
static Agents
(a) Final Result
random
L−TicTACtoe
dummy
static
random
Agents
(b) Received Orders
Fig. 2. Comparison of the dummy agent and the TicTACtoe agent using different pricing strategies in terms of final result and received orders
Table 1 shows the p-values of the statistical comparisons among the agents. This table shows that the L-TicTACtoe is significantly better than the other solutions presented. Moreover, the learning agent performs better than the Random agent in 99.9% of the cases, supporting the statements above. Even though we could expect that L-TicTACtoe manages more orders than the other agents, Figure 2(b) reveals that Random and Static win more offers. However, Table 2 indicates that Random and Static are delivering more orders late and therefore, incurring in more penalties. These results show that the pricing strategy of these agents is less advantageous because they commit to orders which they cannot deliver on time and hence, they are penalized. We can also observe in this table that Static gets negative interests. In other words, this agent had to pay the bank for having a negative balance in its bank
158
M. Franco, I. Mart´ınez, and C. Gorrin
Table 1. Statistical comparison of the TicTACtoe agent using different pricing startegies. Column Kt shown the p-value for the Kruskal-Wallis test and column W ilcox. test shows the p-values for the Wilcoxon tests. Agent
Avg ±
Std
Kt
Wilcox. test (p-values) Random Dummy Static
Final Result L-Tic Random Dummy Static
17684004.68 10157938.28 4800379.28 -17317410.84
± ± ± ±
3827810.34 10313748.01 3909834.59 18120669.01
0.0000
0.0013 – – –
0.0000 0.0016 – –
0.0000 0.0000 0.0000 –
0.0000 0.0000 – –
0.0000 0.0000 0.0000 –
Received Orders L-Tic Random Dummy Static
5453.96 6924.28 2676.88 7841.80
± ± ± ±
649.66 593.06 344.13 223.25
0.0000
0.0000 – – –
Table 2. Results in terms of penalties, interest and factory usage of the TicTACtoe agent using different pricing strategies and the dummy agent Agent
Penalties (US$)
Interest (US$) Fact. usage (%)
L-TicTACtoe 412209.08 ± 703824.06 241286.56 ± 93755.58 Random 5596709.56 ± 6221754.37 66187.76 ±248242.79 Static 10535156.52 ±10403092.31 -781983.80 ±594421.05 Dummy 882527.44 ± 1539109.47 37324.28 ± 94207.14
69.12 85.56 93.44 34.56
± ± ± ±
8.05 6.90 1.53 4.28
account. This indicates that the strategy taken by Static is deficient, because it incurs in negative balances on most of the simulation days. On the other hand, L-TicTACtoe is the agent that earns more interests from the bank and presents the lowest variance. This shows that this agent has a more stable behavior in terms of bank account balances. Regarding the factory utilization, we can appreciate that the agents Random and Static achieve a higher factory utilization. High factory utilization suggests a proficient management of the productive capacity. However, the penalties obtained by these agents demonstrate that these agents are surpassing their production capacity. L-TicTACtoe does not use the factory as much as these agents, but still presents a better solution to this problem because it served efficiently a considerable portion of the market. Finally, through this experiment we can confirm that the strategy used by L-TicTACtoe improves the global performance of our solution to the TAC SCM problem. Furthermore, the static and random strategies show poor results as a consequence of the incapacity to adapt themselves to new situations. These
Supply Chain Management Sales Using XCSR
159
results indicate that we have accomplished the goal of applying an evolutionary rule learning technique inside the sales strategy of a TAC SCM agent. 7.2
Experiment 2: Classifiers Blocking
7000 6500
Number of Orders
5500
6000
1e+07 0e+00 −2e+07 −1e+07
Final Result (US$)
2e+07
7500
3e+07
In this experiment, we compare the performance of the TicTACtoe agent with and without the blocking classifier technique described in Section 6.4. The agent that does not block classifiers allows erasing classifiers freely (ignoring if they are waiting for a reward), while the other one preserves this classifiers. In order to keep the agents as similar as possible, both versions of TicTACtoe used an exploitation rate of 70% and a population size of 1000 classifiers. The goal of this comparison was to determine the impact of this technique. The results of this experiment will show if this simplification leads to information loss when we continue learning without waiting for the rewards.
t−block
t−noblock Agents
(a) Final Result
t−block
t−noblock Agents
(b) Received Orders
Fig. 3. Comparison of the performance of TicTACtoe with and without the blocking classifiers technique in terms of final result and received orders
Table 3 shows that t-block (L-TicTACToe with blocking) receives 312 more orders than t-noblock(L-TicTACtoe without blocking). This difference is small and is not strong enough to make any assumptions on the performance of the agents as shown in Table 3. However, Figure 3(a) shows that t-block obtains more frequently a better final result than t-noblock. According to Table 3 this difference is not statistically significant using using a confidence interval of 0.05. However, we could say that t-block behaves better than t-noblock in 94.4% of the cases. This difference in the final balance is explained by the high penalties obtained by t-noblock as shown in Table 4. These penalties indicate that this agent does not develop an appropriate set of rules to determine the final sale price for an RFQ. Moreover, t-noblock makes offers at very low prices to orders that have a high penalty and are very difficult to produce because of the lack of the required components. When this agent offers products at low prices, it obtains plenty of orders, but most of them do not represent a profitable portion of the market considering its penalties. Moreover, we can observe in Table 4 that t-noblock gets negative interests from the bank, while the t-block gets positive interests. This implies that, on
160
M. Franco, I. Mart´ınez, and C. Gorrin
Table 3. Statistical results from the comparison of both agents using and not using the blocking classifier technique. The columns W (p − value) show the p-value of the Wilcoxon test between agents. Agent
Avg ± Std
W (p-val)
Final Result t-block 15206827.28 ± 11654157.53 t-noblock 7860693.72 ± 14530705.27
Avg ± Std
W (p-val)
Received Orders 0.0567 –
7448.44 ± 278.09 7136.20 ± 726.64
0.3159 –
Table 4. Results in terms of interest, penalties, component costs and storage costs of the TicTACtoe agent using and not using the blocking classifiers technique.
t-block t-noblock
Interests (US$)
Penalties (US$)
Comp. costs (%)
Storage (%)
156981±250208 -35459±408030
4791314±6357927 8083767±7679835
85.08 ± 3.26 85.07 ± 4.83
1.13 ± 0.20 1.28 ± 0.28
average, the agent that does not block classifiers incurs in debts, while the other agent maintains a positive balance in its bank account. This factor, in addition to the penalties, explains why in Figure 3(a) the agent t-noblock ends with less money than agent t-block. To determine the impact of the blocking technique, it is also important to analyze the experience of the XCSR system in each agent. The experience is a measure of the classifier usage; it indicates how many times a classifier has been used. In Figure 4, we can observe that the mean experience of the population of t-block is higher than the mean experience of t-noblock. This pattern occurs because t-noblock allows erasing classifiers anytime based on incomplete and inaccurate information. This rules are still waiting for a reward that will determine if they performed well. Consequently, classifiers that could lead to good decisions are erased before the reward arrives, and their knowledge is completely lost. It is interesting to notice that the relationship between the average experience and the day of the simulation is approximately 0.05. This means that each classifier is used at most during 5% of the simulation. Considering that the simulation has 220 days, the 5% correspond to 11 days. Our explanation for this behaviour is that the generated rules are, in fact, detecting different stages during the simulation, and not all the classifiers are used in the same stages. The blocking classifier technique increments the global experience of the population and the probability of survival of possible good sub-solutions. Nevertheless, the tradeoff of using these mechanisms is that the system could also block bad solutions, and the probability of erasing good rules that have not been activated gets higher. The results of this experiment show that agents using the blocking classifiers technique inside XCSR preserve important information in the classifiers. This might lead to better performance in environments with single step tasks and
Supply Chain Management Sales Using XCSR
161
Fig. 4. Mean experience of the XCSR population during 8800 days (40 simulations)
delayed rewards. As a further work, more experimentation will be carried out to validate these hypothesis and determine more advantages and disadvantages this mechanism could have. 7.3
Experiment 3: Exploitation Rate
The aim of this experiment is to determine the best exploitation rate or value for (1 − ) for this particular problem. We tested the performance of the agent using different exploitation rates (0.9; 0.7; 0.5; 0.3) to determine which one is the most suitable for the problem that we are trying to solve. We also included two extra exploitation rates 0 and 1 for control. Afterwards, we analyse the two most interesting cases and compare them with the results of their dummy competitors9 . The rest of the parameters of the algorithm and the agent remained the same. For this experiment we used a population of 1000 individuals and the blocking mechanism. The TicTACtoe and dummy agents involved in this experiment will be referred as tx and dx respectively, where x stands for the final threshold exploitation rate or 1 − (See Section 6.4). Figure 5 shows the results according to the main measures of performance: the final results and the number of received orders. In Figure 5(a) we can observe that agents with the smallest final balance in the bank account at the end of the game are t0 followed by t100. The same behaviour can be observed in Figure 5(b). This means that a constant exploration (t0) (always giving the price discount 9
In this experiment we ran our base agent only with dummy competitors, using different policies each time.
M. Franco, I. Mart´ınez, and C. Gorrin
6000 4000
5000
Number of Orders
1e+07 0e+00
3000
−1e+07
Final Result (US$)
2e+07
7000
8000
162
t0
t10
t30
t50
t70
Agents
(a) Final Result
t90
t100
t0
t10
t30
t50
t70
t90
t100
Agents
(b) Received Orders
Fig. 5. Comparison of the performance of TicTACtoe using different exploitation rates in terms of final result and received orders.
in a random manner) produces the worst results. On the other hand, a pure exploitation (t100) does not achieve a good performance either, because it is incapable of adapting to new environments. Agents that combine exploitation and exploration during the whole learning process obtain the best results due to the dynamic characteristics of the environment. According to Table 3, we can say that the agents t0 and t100 are significantly worse than the rest of the agents, in terms of final result and received orders. It is worth noticing the curve in these two figures. This suggests that the exploration rate does, in fact, affect the strategy developed, and balance between exploitation and exploration is necessary to achieve good performance. In Figure 5(b) we can see that the agent that serve more orders is t70. According to Table 5, there are not significant differences between t30, t50 and t70 in terms of the final result but there are differences in terms of the orders. Moreover, the agent with the highest final result on average turns out to be t30. This situation is clarified by Table 6, which compares the performance of these two agents against their dummy competitors. Despite of efforts of t70 of serving the largest portion of the market, this agent gets plenty of penalties for late deliveries. Moreover, although the agent t30 does not have as many orders as t70, this situation helps the agent to fulfil the orders that it has already. At the end t30, does not have as much penalty as t70, producing a more steady behaviour (lower variance). We could say that agent t30 is learning how to handle a number of orders that minimizes the obtained penalty and maximizes the final revenue. We can also notice in this table that the implementation of TicTACtoe, no matter the exploitation rate used (t30 or t70), gets a higher final revenue and handles a larger portion of the market than the dummy competitors. Furthermore, there is a difference also in the behaviour of both dummy agents since the performance of the agents is relative to the competitor’s behaviour. We can notice that agent t70 makes it more difficult for the competitor d70 to obtain costumers. Regarding to the factory usage, it is considered that a good agent uses its factory capacity as much as possible to complete orders[13]. This helps the agent to obtain higher revenues at the end of the game. Even though both configurations
Supply Chain Management Sales Using XCSR
163
Table 5. Statistical results from the comparisons of the agents using different values for the exploitation rate. Column Kt shows the p-value for the Kruskal-Wallis test and column W ilcox. test shows the p-values for the Wilcoxon tests. Agent
Avg ±
Std
Kt
Wilcox. test (p-values) t30 t50 t70
Final Result t0 t10 t30 t50 t70 t90 t100
9235641 14543996 17684005 17486431 15206827 13201893 10945370
± ± ± ± ± ± ±
3638978 2925682 3827810 5252774 11654158 5666869 3575705
0.0000
0.0000 0.0006 – – – 0.0000 0.0000
0.0000 0.0021 0.5768 – – 0.1823 0.0000
0.0084 0.0593 0.6581 0.7437 – 0.0000 0.0000
0.0000 0.0004 0.0012 – – 0.0303 0.0000
0.0000 0.0000 0.0000 0.0002 – 0.0000 0.0000
Received Orders t0 t10 t30 t50 t70 t90 t100
4226.76± 5301.56± 5453.96± 6398.00± 7448.44± 6284.28± 4586.12±
574.72 674.03 649.66 1374.57 278.09 362.91 660.45
0.0000
0.0000 0.4320 – – – 0.0000 0.0000
Table 6. Comparisons between the agents using 30% and 70% exploitation rates in terms of penalties, factory usage and total income Penalties (US$) t30 d30 t70 d70
412209 882527 4791314 807530
± ± ± ±
703824 1539109 6357927 1301921
Factory Usage (%) 69.12 34.56 91.16 29.52
± ± ± ±
8.05 4.28 2.95 5.23
Total income (US$) 108718063 59104351 142639167 49645853
± ± ± ±
12611823 6999544 6966361 8701695
t30 and t70 have the same production strategy, t70 makes more usage of these resourses than t30. This behaviour is explained by the fact that agent t70 has more orders to attend. Consequently, considering this performance measure the agent t70 learns a better strategy. Nevertheless, the production and purchase strategies are still very simple, which makes it harder for this agent to deliver these orders on time. Regarding to the total income, we can notice that both TicTACtoe agents have incomes proportional to the number of received orders. Also, both agents have higher incomes than their competitors. This evidences that the developed strategies give competitive prices according to the cost of the products and do not offer the products below the production costs.
164
8
M. Franco, I. Mart´ınez, and C. Gorrin
Conclusion
We designed and implemented a supply chain management agent for the TAC SCM problem. Our agent solves the production and the purchases sub-problems using static strategies, while it solves the sales sub-problem using a dynamic strategy. Moreover, the purchase strategy is based on the acquisition of components considering production commitments for the next simulation days. The production strategy is based on manufacturing goods prioritizing orders according to their expected profits and due dates. In addition, we implemented a dynamic sales strategy built on Wilson’s XCSR classifier systems. Through the XCSR mechanism, we obtained a suitable set of rules for the TAC SCM sales problem. This set of rules worked better than the strategies used for control. As our initial solution for the TAC SCM sales problem encountered an issue when handling delayed rewards in a single-step environment, we introduced a blocking classifier technique. We showed that the use of this technique yields to more experienced populations and improves the quality of the generated strategies in this scenario. However, more experimentation needs to be carried out regarding this matter.
References 1. Trading agent competition - TAC SCM game description, http://www.sics.se/tac/page.php?id=13 2. Benisch, M., Sardinha, A., Andrews, J., Sadeh, N.: CMieux: adaptive strategies for competitive supply chain trading. In: ICEC 2006: Proceedings of the 8th international conference on Electronic commerce, pp. 47–58. ACM Press, New York (2006) 3. Bull, L.: Applications of Learning Classifier Systems. Springer, Heidelberg (2004) 4. Butz, M.: Illigal Java-XCS - LCS Web (2006) 5. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, p. 253. Springer, Heidelberg (2001) 6. Collins, J., Arunachalam, R., Sadeh, N., Eriksson, J., Finne, N., Janson, S.: The Supply Chain Management Game for the 2007 Trading Agent Competition, Pittsbourg, Pensilvania (2006) 7. Conover, W.J.: Practical Nonparametric Statistics. John Wiley & Sons, Chichester (December 1998) 8. Franco, M., Gorrin, C.: Dise˜ no e implementaci´ on de un agente de corretaje en una cadena de suministros en un ambiente simulado, Universidad Sim´ on Bol´ıvar (2007) 9. Holland, J.H.: Adaptation. In: Rosen, R., Snell, F.M. (eds.) Progress in theoretical biology IV, pp. 263–293. Academic Press, Nueva York (1976) 10. Lanzi, P.: Learning classifier systems: then and now. Evolutionary Intelligence 1(1), 63–82 (2008) 11. Pardoe, D., Stone, P.: Bidding for customer orders in TAC SCM. In: Faratin, P., Rodr´ıguez-Aguilar, J.-A. (eds.) AMEC 2004. LNCS (LNAI), vol. 3435, pp. 143–157. Springer, Heidelberg (2006)
Supply Chain Management Sales Using XCSR
165
12. Pardoe, D., Stone, P.: An autonomous agent for supply chain management. In: Adomavicius, G., Gupta, A. (eds.) Handbooks in Information Systems Series: Business Computing, vol. 3, pp. 141–172. Emerald Group (2009) 13. Stan, M., Stan, B., Florea, A.M.: A dynamic strategy agent for supply chain management. In: Proceedings of the Eighth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 227–232. IEEE Computer Society, Los Alamitos (2006) 14. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175 (1995) 15. Wilson, S.W.: Get real! XCS with Continuous-Valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, p. 209. Springer, Heidelberg (2000) 16. Wilson, S.W.: Mining oblique data with XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 158–174. Springer, Heidelberg (2001) 17. Witten, I.H., Frank, E.: Data mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators in XCS Richard Preen Department of Computer Science University of the West of England Bristol, BS16 1QY, UK
[email protected] Abstract. This paper extends current LCS research into financial time series forecasting by analysing the performance of agents utilising mathematical technical indicators for both environment classification and in selecting actions to be executed. It compares these agents with traditional models which only use such indicators to classify the environment and exit at the close of the next day. It is proposed that XCS agents utilising mathematical technical indicators for exit conditions will not only outperform similar agents which close the trade at the end of the next day, but also result in fewer trades and consequently lower commissions paid. The results show that in four of five assets, agents using indicator exit conditions outperformed those exiting at the close of the next day, before commissions were factored in. After commissions are factored in, the performance gap between the two agent classes further widens. Keywords: Computational Finance, Learning Classifier Systems, XCS.
1 Introduction The primary objective of this paper is to extend the current research into the use of the XCS Learning Classifier System [28] within the domain of financial time series forecasting. Recent work (e.g., [9], [21], [13], and [24]) has demonstrated the successful application of XCS in this area. However, in each of the studies, agents are trained on daily price data to evolve trade entry rules composed of mathematical technical indicators in conjunction with a fixed rule to close the trade the following day, i.e., the exit timing is not evolved. It is posited that by utilising mathematical technical indicators to identify the timing of the market exit, as opposed to simply exiting on the next day, not only are the associated transaction costs reduced, but the excess returns are increased due to an inherent noise reduction by requiring less prediction accuracy. Initially, several XCS agents are produced to replicate the traditional model and demonstrate their application to financial time series forecasting. In extending this work, the agents additionally evolve mathematical technical indicators to identify appropriate exit conditions. These two models are then compared and the agents are furthermore benchmarked against a buy-and-hold strategy to evaluate whether market beating excess returns can be generated. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 166–184, 2010. © Springer-Verlag Berlin Heidelberg 2010
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
167
Brock, Lakonishock and LeBaron [4] investigated two of the most popular trading rules from technical analysis (moving averages and trading range breakout) on the Dow Jones Industrial Average over the period 1897-1986. They generated typical returns of 0.8% over a 10-day period compared to a normal 10-day upward drift of 0.17%. After the buy signals were generated, the market increased at a rate of 12% per year. Following the sell signals, a decrease of 7% per year was noted. Subsequently, Detry and Grégoire [6] successfully replicated the results for the moving average tests on a series of formally selected European indexes. Moreover, technical analysis has been shown useful in the foreign exchange markets by Dooley and Schaffer [8], Sweeney [25], Levich and Thomas [12], Neely et al. [15], Dewachter [7], Okunev and White [16], and Olson [17]. The primary benefit from the use of mathematical technical indicators in financial time series forecasting is that the algorithms are precisely defined. This means that the signals they produce are free from errors of subjective human judgement and emotion, are replicable, and can easily be tested over large amounts of data and varying assets to quantify performance. Learning Classifier Systems (LCS) [10] can easily co-evolve different combinations of these indicators to form entry/exit rules for financial trading, and even to evolve the technical indicators themselves.
2 Related Work There has been widespread research on Artificial Neural Networks (ANN) and Genetic Programming (GP) for financial time series forecasting. GP examples include Neely et al. [15], Allen and Karjalainen [1], and Chen [5]. Examples of ANN forecasting financial time series include Tsibouris and Zeidenberg [25], Steiner and Wittkemper [23], Kalyvas [11], and Srinivasa, Venugopal and Patnaik [22]. In contrast to ANN and GP, comparatively little research has been conducted into the use of LCS for financial time series forecasting. Early examples of LCS research in this area include Beltrametti et al. [2] using LCS to predict currencies, and Mahfoud and Mani [14] and Schulenburg and Ross ([18], [19], and [20]) predicting stocks. More recently, Stone and Bull [24] created a single-step ZCS [27] agent to forecast long or short positions on the Foreign Exchange (FX) Market, trading with the full amount of the balance each time. The architecture was modified by utilising the NewBoole update mechanism, tweaking the covering algorithm, and introducing a new specialize operator. The agent was required to always be in the market. Daily price and interest rate data was used, covering the period of January 1974 to October 1995 for the U.S. Dollar (USD), German Deutsche Mark (DEM), British Pound (GBP), Japanese Yen (JPY), and Swiss Frank (CHF). These were then used to create currency pairs for USDGBP, USDDEM, USDCHF, USDJPY, DEMJPY, and GBPCHF. The mathematical technical indicators used were based on four primitive functions of the time series which could return either the average price over a specified period, the minimum price over a specified period, the maximum price over a specified period, or the price at a specified day. ZCS was used to generate the indicators, where an indicator is a ratio of two of the primitive functions. For example, a log indicator:
168
R. Preen
lag(4)/max(10) with a range [0.032,0.457] and an action of 1 translates to ‘go long if the price 4 days ago is greater than 1.033 to 1.579 times the maximum price over the past 10 days’. The Genetic search took place on the range and historical period parameters. Crossover was applied by switching the period parameters. For example, two initial indicators: lag(8)/max(22) and min(12)/avg(40) results in the two indicators: lag(12)/max(40) and min(8)/avg(22). Mutation was then used to modify the range and period parameters in the normal way. An 8-bit encoding was used which limited parameters in the range [0,255]. The reward given was based on the additional return of the next day’s price over any interest potentially accrued on the margin. Commission was set at 2.5 basis points and therefore an action taken could be correct even though it produced a negative return. Therefore, a fixed reward of 1000 was given only for actions resulting in positive returns. The results of the ZCS agent produced Annual Percent Rate (APR) excess returns on 5 out of 6 currency pairs. Additionally, the number of ZCS runs with positive excess returns correlated well with the mean excess return achieved. However, while it was found that it is possible to achieve excess returns with ZCS, the performance (using the derived mathematical technical indicators) was not as good as a Genetic Programming benchmark. Furthermore, the most likely reason for this was because of the ZCS agent’s high trading frequency and associated costs. It is suggested that the reason for this is due to the single-step model, and that using a multi-step model could reduce the frequency. However, it seems more likely the reason is due to the requirement that the agent always have a presence in the market. The rationale for this is unclear, particularly since a major advantage private traders have over institutions is being able to stay out of the market until the exact moment when a high probability opportunity occurs. Moreover, the technical indicators used were extremely primitive. If the indicators had tougher constraints for providing an entry signal this could easily have been used to reduce the trading frequency and perhaps provide superior performance. Gershoff [9] investigated the use of a hierarchical configuration of XCS agents (HXCS). Here agents would take the inputs from technical indicators and attempt to learn profitable rules to trade the market data provided. The three mathematical technical indicators used were: Rate of Change (ROC), Simple Moving Average (SMA) and Relative Strength Index (RSI). The HXCS comprised of four micro agents: an RSI Agent, a Volume Agent, a Random Agent, and a Constant Agent. After viewing a state, each Micro Agent produced an action signal (‘0’ or ‘1’ to buy or sell). The signals were then sent to an Aggregate Agent and treated as a vote for the action. The signal that received the majority of the vote was then designated as the aggregate signal for the collection of the Micro Agents. By receiving the votes, the Aggregate Agent could deduce the competitiveness of the vote for each action (i.e., the confidence value). The confidence value was then used either to simply select the action, or as an indication to a Meta Agent whether the aggregate signal won the vote with more than a specified threshold. The Meta Agent received the set of majority signals and confidence indicators, and produced an action signal to execute in the environment.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
169
The payoff (or feedback) given to the agents for executing a particular action was decided based upon whether the following day’s price closed above or below the current day. A payoff of zero was awarded for executing a wrong action and a constant non-zero value was awarded for executing a correct action. The agents were assessed using daily price data for IBM, EXXON, Ford, CitiGroup, Coca-Cola, and Banco Santander Cent Hispano. The training period ran from January 1990 to December 2003 and then an evaluation phase took place on data from January 2005 to June 2006. The results found that the Meta Agents usually outperformed the individual technical agents and that the Micro Agents could not outperform both the buy and hold and bank strategies. Further, that the Meta Agent always outperformed the Random Agent. However, in terms of accuracy, the Meta Agents performed the same or worse than the Micro Agents. In summary, the major finding of this model was that a Hierarchical XCS using multiple agents can produce better results than using a single agent XCS. The fact that the Meta Agents always outperformed the Random Agent also illustrates that the system is capable of learning useful rules, even though in this case they were not able to outperform the relevant real-world benchmarks. Schulenburg and Wong [21] explored Portfolio Allocation using a HXCS. Agents received inputs from technical indicators and attempted to learn profitable rules to trade the market data provided. In addition to a Technical Analysis (TA) Agent, a Market (Mkt) Agent and an Options Agent were created to provide further information to the decision making process. The TA Agent incorporated rules based upon inputs from the following four mathematical technical indicators: Rate of Change (ROC), Relative Strength Index (RSI), Ultimate Oscillator (ULTOSC), and On Balance Volume (OBV). The Mkt Agent integrated rules from the following three general market indicators: the daily percent return of the S&P500 Index, the daily S&P500 Index volume, the daily 10 year T-note bond yield, and the daily 3 month T-bill bond yield. The Options Agent included rules from the following 5 Options market indicators: Delta (i.e., the measurement of the sensitivity of an Option value to the underlying stock price), Gamma (i.e., the measurement of the second order sensitivity of the Option value to the underlying stock price), Vega (i.e., the measurement of the sensitivity of Option value to the stock price volatility), Theta (i.e., the measurement of the sensitivity of Option value to the passage of time), and implied volatility (i.e., the stock volatility estimate given by the Black Scholes formula). The daily stock data tested was for CitiGroup, IBM, General Motors, Eastman Kodak, and Exxon Mobil over the period 4th January 1996 to 28th April 2006. A commission fee of 0.5% of the transaction value was set. In contrast to Gershoff’s HXCS, the agents attempted to predict the price movement of tomorrow’s stock price and the percentage of total wealth to invest, instead of just buy or sell signals. The agents were given the choice between investing in the risky stock and investing in safe treasury bills which returned a variable interest rate based on real world values. The input data from the indicators was first divided into nine discrete cut points by using leave-one-out-cross-validation. The target series then underwent two phases of discretization. The first phase quantized the data using the unsupervised method of histogram equalisation in order to add class label information to the target series. Subsequently, the supervised method of entropy-based discretization was used to split
170
R. Preen
the series into intervals in order to maximise the information gain. Once quantization had been completed, a binary vector was mapped to the intervals so that it could be used by an XCS agent. Next, the cumulative performance of the Meta Agent was evaluated. If its prediction accuracy was less than the specified threshold value, all agents (including the Meta Agent itself) were destroyed and a new set of agents with a new discretization process was launched. The new set of cut points were based on the preceding ten days. All new agents then started their training phases by exploring themselves into the new training environment. After completing training they were placed back into the real world environment. The best results of the agents were compared against four benchmarks: buy and hold, bank, price trend, and a Random Agent. In the case of CitiGroup, all of the XCS agents outperformed all four of the benchmark agents. Moreover, in all five stocks, all XCS agents outperformed the Random Agents. The authors suggest that there is a mere 0.00003% probability that this occurred by chance and that it provides solid proof that stock prices have a rational component. Further, the XCS agents discovered a ‘famous 1960s trading rule’1. This last discovery highlights one of the major benefits to using XCS (as opposed to other alternatives such as an ANN) to forecast financial time series. The ability to have the rules in an easily human readable form enables the researcher to evaluate the logic of any discovered rule and decide whether it makes sense. This is important because if the rule does not make any logical sense to a trader then it is quite possible that the rule has been derived from over-fitting the data and its use in the future is questionable. Interestingly, in contrast to Gershoff’s findings, the Meta Agents here did not perform very well in comparison to the single agents. In 3 of the 5 stocks, the Meta Agents underperformed all three of the single agents. If we are to use the best results as indicative of performance (as suggested by [21]), this provides mixed information on the effectiveness of HXCS as opposed to standard XCS agents. Liu and Nagao [13] conducted a further assay on the application of HXCS to financial time series forecasting. Here performance was evaluated on the prediction accuracy of the direction of the next day. Two Meta Agents were used and their binary perceptions set solely according to comparisons between various moving averages. The moving averages used were of the form MAt,m where the average is calculated from time t back to time t-m. Agent1 consisted of a bitstring of length 24 where each bit was set according to the evaluation of 24 pairs of successive moving averages with an interval length of 20. Agent2 consisted of a bitstring of length 18 where the first 6 moving averages used an interval length of 10 and a further 12 moving averages with an interval length of 5, e.g., bit18 is set to logical 1 if MAt4,5<MAt, 5. Furthermore a fuzzy matching mechanism was used where a classifier is said to have matched the environment state even if 10% of the bits are non-matching. For each environment state received by the HXCS, each Meta Agent receives the input, constructs a match set, and then calculates an average prediction value for the set. The agent with the highest average match set prediction value is then chosen to advocate 1
If the ultimate oscillator is greater than 70, and the previous stock price change is within 2 to 3%, then tomorrow’s stock price will be -2.5 to -3.5%.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
171
an action in the normal XCS procedure and parameters are updated for the action set of that Meta Agent. Experiments were run on four indexes (NIKKEI, NASDAQ, TOPIX, HSI) and 11 other stocks selected from the NIKKEI using daily closing price data from January 2000 to December 2004. The direction hit-rate of both Meta Agents always provided superior performance to a trend-following strategy that predicted the direction of the next day based on the change from the previous day. In addition, HXCS outperformed the Meta Agents by 2-3%. For example, the trend following strategy correctly predicted the direction 56% of the time for the NASDAQ, whereas Agent1 was correct 66.9% of the time, Agent2 70.8%, and HXCS 73.8%.
3 Learning Framework Perhaps the biggest limitation that is consistent among [9], [21], [13], and [24] is that they all attempt to use daily data to forecast the next day’s price. Since “the accuracy of agents’ predictions depends largely on how well the problem is represented” [21] we should adopt an approach that mimics how real trading is conducted as closely as possible. Figure 1 shows the daily price chart of the EURUSD currency pair with the vertical dotted line in the centre marking August 15th 2007. At the close of this day, the Relative Strength Index (RSI) indicator set to 14 periods (i.e., to calculate the RSI over the previous fourteen daily open, high, low, close bars), RSI(14), produces a value of 31.2109. For Agent 1 in [9], this value would set bit6 (RSI(14)≤35) to ‘1’. On the following day the price closed lower (at 1.3426) from its open (of 1.3442). Supposing that the agent had identified that this rule was part of a buy signal, it would have resulted in a loss under the model and negative feedback would have been given.
Fig. 1. EURUSD Daily Price Chart 01.08.2007 – 31.08.2007
172
R. Preen
However, if we look at the bigger picture in Figure 2 we can see that in fact this would have been an excellent place to enter the market. In Figure 2, the vertical dotted line highlights the same day as in Figure 1 but illustrates that, in the bigger picture, the EUR continued to climb in value against the USD during the subsequent months following the RSI signal. Clearly, the method of evaluation and providing feedback to the model is far too short-sighted and asking for far too much accuracy. Real traders utilise Stop Losses (SL) which are triggers set a certain distance from the entry price and exit the market at a loss. This value is there in part because markets are infamous for swaying noisily whilst actually moving towards a logical target (as in a drunken man analogy). Furthermore, most real traders would never attempt to predict the closing price of the next bar (e.g., next day when using daily data) because it is asking for far too much accuracy within a widely acknowledged noisy system. They would simply exit the market at their SL, or attempt to exit the market in profit at some multiple of the initial risk (i.e., SL). Through such a method, successful traders can lose half, or more, of their trades whilst still finishing profitably.
Fig. 2. EURUSD Daily Price Chart 01.08.2007 – 30.11.2007
If the models are intended to replicate real traders, we must adopt a more realworld approach. Such an approach must seek to avoid pre-specifying the exact bar to exit the trade and provide feedback. One approach commonly used in real trading is to define the exit conditions in terms of fixed price numbers. For example, if the agent discovered a buy signal, the SL is set $5 below the entry price, and a Take Profit (TP) (i.e., a price level a trade is considered a winner and profit is taken) of $10 above the entry price is set. A more sophisticated technique would be to test the combinations of SL and TP to find the optimal pair in addition to the entry signal. However, this might easily lead towards curve-fitting the model too specifically to the training set.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
173
Perhaps the most widely used method to identify when to exit a trade is the same as that used to enter the trade in the first place: technical analysis. For example, if a rule to buy an asset is ‘if RSI(14)70 then exit’. This risks complicating the model and exponentially increasing the search space, but it is the only way to provide a real-world measurement of success.
4 Implementation 4.1 Data The data used is the daily price/volume information over the period of February 3rd 1992 to December 14th 2007 for Exxon Mobil Corp. (XOM) (Figure 3.a) the Dow Jones Industrial Average (DJI) (Figure 3.b), General Motors Corporation (GM) (Figure 3.c), Intel Corp. (INTC) (Figure 3.d). In addition, data over the period of December 26th 1991 to December 14th 2007 is used for 30-Year Treasury Bonds (TYX) (Figure 3.e). They were chosen to include one index (DJI), two ranging assets (GM and INTEL), one falling asset (TYX), and one increasing asset (XOM). Moreover, the assets represent diverse market sectors: automobiles, technology, bonds, oil, and an index average. For DJI, the adjusted closing price is divided by 1000 to enable the agents to purchase shares with a balance of $10,000 or less. In all cases, 4000 data points (i.e., days) are used. The first 3000 data points form a training set used to evolve new rules and the most recent 1000 data points are used as a trading set to evaluate these rules. 4.2 XCS The traditional ternary representation is used, where the environment inputs are discretized as outlined in the following sections. A fixed reward of 1000 is given to profitable actions and 0 to actions which result in no profit or a loss. XCS parameters used are as follows (taken from [3] and not further optimised so as not to bias the results used to compare the models): α=1, β=0.2, δ=0.1, θGA=25, θdel=20, θsub=20, P#=0.6, v=5, χ=0.8, ε0=10, μ=0.04. Each agent is shown the training set only once before being evaluated on the trading set. The alteration between exploring and exploiting rules is modified as in [21] to: (1) Running the equation above over 1000 iterations (i.e., the length of the trading set) produced a range of 896 to 932 exploit steps being executed. Thus, over 1000 iterations, exploits are conducted approximately 89.6 - 93.2% of the time. This produces an increasing bias towards exploiting the knowledge acquired as the rules become more evolved, which is important since the system will perform a single pass through the data.
174
R. Preen
(a)
(c)
(b)
XOM
GM
(d)
(e)
DJI
INTEL
TYX
Fig. 3. Daily Adjusted Closing Price data used in experimentation
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
175
4.3 Agent 1 - Entries Agent 1 utilises three stochastic indicators with the periods (8, 3, 3), (32, 12, 12), and (128, 48, 48). The (8, 3, 3) was chosen simply because it is the most commonly used configuration, then the two subsequent combinations are each four times greater, thereby providing a short-term trend, intermediate-term trend, and long-term trend. The direction of the stochastic indicators and their position (i.e., the value between 0 and 100) is used to classify the environment. The signal line was used for the (8,3,3) parameters to smooth the line to reduce noise whereas the (32,12,12) and (128,48,48) main lines are already sufficiently smoothed. The real numbered indicators are discretized through a simple mechanism. A 9 bit binary string is composed where the first two bits are used to classify the (8,3,3) signal line’s position. The third and fourth bits are used to classify the (32,12,12) main line’s position and the fifth and sixth bits are used to classify the (128,48,48) main line’s position. The indicator to binary encoding for each indicator’s position is summarised below in Figure 4. Indicator 0 - 24 25 - 49 50 - 74 75 - 100
Binary 00 01 10 11
Fig. 4. Indicator Value to Binary Encoding
Lastly, three bits are used to classify the direction of each of the stochastic lines as in Figure 5. Bit7 = Bit8 = Bit9 =
‘1’ if Stochastic (8,3,3) current signal line > Stochastic (8,3,3) previous signal line ‘1’ if Stochastic (32,12,12) current main line > Stochastic (32,12,12) previous main line ‘1’ if Stochastic (128,48,48) current main line > Stochastic (128,48,48) previous main line
else ‘0’ else ‘0’ else ‘0’
Fig. 5. Agent 1 Encoding
4.4 Agent 2 - Entries The second agent is a trend following agent comprised mostly of Exponential Moving Averages (EMA). A 20, 50 and 100 period EMA is constructed. The EMAs’ direction (i.e., rising or falling) and the position of the current price relative to the EMA (i.e., above or below) is used to classify the environment. In addition, the direction of the Moving Average Convergence Divergence (MACD) (12, 26, 9) main line, and the direction of the Stochastic (32, 12, 12) main line are used to provide additional trend information. The encoded is summarised below in Figure 6.
176
R. Preen
Bit1 = Bit2 = Bit3 = Bit4 = Bit5 = Bit6 = Bit7 = Bit8 =
‘1’ if EMA (20) current > EMA (20) previous ‘1’ if EMA (50) current > EMA (50) previous ‘1’ if EMA (100) current > EMA (100) previous ‘1’ if price current > EMA (20) current ‘1’ if price current > EMA (50) current ‘1’ if price current > EMA (100) current ‘1’ if Stochastic (32,12,12) current main line > Stochastic (32,12,12) previous main line ‘1’ if MACD (12,26,9) current main line > MACD (12,26,9) previous main line
else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’
Fig. 6. Agent 2 Encoding
4.5 Agent 3 - Entries Agent 3 is the first Agent (Tt1) from [18]. The agent consists of comparisons between the current price and the previous price, a series of Simple Moving Averages (SMA), and the highest and lowest prices observed. The environment bit string consists of 7 binary digits and is encoded as follows in Figure 7. Bit1 = Bit2 = Bit3 = Bit4 = Bit5 = Bit6 = Bit7 =
‘1’ if price current > price previous ‘1’ if price current > 1.2 x SMA(5) ‘1’ if price current > 1.1 x SMA(10) ‘1’ if price current > 1.05 x SMA(20) ‘1’ if price current > 1.025 x SMA(30) ‘1’ if price current > highest price ‘1’ if price current < lowest price
else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’
Fig. 7. Agent 3 Encoding
4.6 Agent Exits There are three sets of exit conditions for each agent. Firstly, there is the traditional model where the next day is used as the only exit condition, meaning that any trade entered today is exited at tomorrow’s closing price. In addition to this, there are two sets of technical indicator exit conditions: a simple set with only 4 exit conditions (see Figure 8) and a more advanced set comprising 16 exit conditions (see Figure 9). To keep the current study simple, the agents were only allowed to buy or hold, with selling not permitted. In both the 4 and 16 exit sets, one of the actions causes the agent to move to the next day without trading (i.e., holds for one day) where reward is given if the price remained unchanged or decreased. The executable actions in the set of four: 1. 2. 3. 4.
Do not enter any trades today (i.e., hold for one day.) Buy today and exit when MACD (12,26,9) decreases. Buy today and exit when EMA (20) decreases. Buy today and exit when both MACD (12,26,9) and EMA (20) decrease. Fig. 8. Four Technical Exit Conditions
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
177
This is implemented by moving forward each day in the index and comparing the indicator’s parameters with the exit conditions (as would happen in live trading.) When a match is found, the result of the action is calculated, the balance updated, and reward given. The comparison of the indicator parameters was implemented by individually checking each rule. This was done for simplicity and to ensure that the rules were functioning correctly. However, with a bigger set of exit conditions to test (since we are testing every applicable combination), one would assign bits to each condition in the same manner the environment conditions are constructed, and then any invalid actions (e.g., EMA (20) cannot be rising and falling simultaneously) would be removed by forcing XCS to choose another action. The executable actions in the set of sixteen: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
Do not enter any trades today (i.e., hold for one day.) Buy today and exit when MACD (12,26,9) decreases. Buy today and exit when EMA (20) decreases. Buy today and exit when Stochastic (32,12,12) decreases. Buy today and exit when EMA (50) decreases. Buy today and exit when MACD (12,26,9) and EMA (20) decrease. Buy today and exit when MACD (12,26,9) and Stochastic (32,12,12) decrease. Buy today and exit when MACD (12,26,9) and EMA (50) decrease. Buy today and exit when EMA (20) and Stochastic (32,12,12) decrease. Buy today and exit when EMA (20) and EMA (50) decrease. Buy today and exit when Stochastic (32,12,12) and EMA (50) decrease. Buy today and exit when MACD (12,26,9) and EMA (20) and Stochastic (32,12,12) decrease. Buy today and exit when MACD (12,26,9) and EMA (20) and EMA (50) decrease. Buy today and exit when MACD (12,26,9) and Stochastic (32,12,12) and EMA (50) decrease. Buy today and exit when EMA (20) and Stochastic (32,12,12) and EMA (50) decrease. Buy today and exit when EMA (20) and Stochastic (32,12,12) and EMA(50) and MACD (12,26,9) decrease. Fig. 9. Sixteen Technical Exit Conditions
5 Experimentation Tables 1 to 5 present a comparison between the agents with the next day as the exit condition, 4 technical indicator exits as the exit conditions, and with 16 technical indicator exits as the exit conditions. Each agent starts with an initial balance of $10,000. The results presented are the best run and the average run of 100 experiments. The highest performing result in each category is highlighted in bold. The results from the experiments comparing the next-day-exit agents with the agents using technical indicator exit conditions, after being shown the training set
178
R. Preen
only once (Tables 1-5), show that for XOM, the agent with the highest balance ($25,648.75) and highest average balance ($15,899.56) was Agent 2 with 16 technical indicator exits. For DJI, Agent 1 with 4 technical indicator exits produced the highest balance ($15,120.46) and Agent 3 with 16 technical indicator exits achieved the highest average balance ($12,102.06). For INTEL, Agent 2 with 4 technical indicator exits produced the highest balance ($21,000.59) and the highest average balance ($10,522.50). In the case of GM, again Agent 2 with 4 technical indicator exits produced both the highest balance ($20,116.72) and the highest average balance ($9,645.54). Lastly, for TYX, Agent 1 with next-day-exit conditions produced both the highest balance ($15,671.20) and highest average balance ($11,389.56). The results have shown that in all cases (except TYX), an agent using technical indicator exits was superior to exiting at the next day for both the highest achievable balance and the average balance over its experiments. Moreover, since commissions are not factored into the agents at this stage, it is highly likely that the gap between the two agent classes would further widen. Table 1. XOM Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
16,568.02 17,015.35 18,085.78 25,648.60 25,648.75 22,883.49 16,133.73 21,105,34 19,904.95 24,634.00
13,518.73 12,863.05 13,815.44 15,442.76 15,899.56 15,849.93 14,825.81 13,823.89 14,224.36 24,634.00
Table 2. DJI Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
13,180.21 13,664.05 12,782.90 14,589.01 14,068.26 14,443.68 13,701.04 14,664.57 15,120.46 12,918.69
11,314.48 11,338.99 11,280.55 12,102.06 11,835.86 12,027.56 11,975.34 11,868.51 12,033.45 12,918.69
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
179
Table 3. INTEL Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
12,672.98 14,240.27 13,476.69 12,889.49 13,736.25 15,759.57 16,511.56 21,000.59 16,568.16 8,894.74
9,512.07 9,727.86 9,731.87 8,391.51 8,860.61 8,481.99 9,504.32 10,522.50 9,924.76 8,894.74
The results from the experiments comparing the next-day-exit agents with the agents using technical indicator exit conditions, after being shown the training set only once (Tables 1-5), show that for XOM, the agent with the highest balance ($25,648.75) and highest average balance ($15,899.56) was Agent 2 with 16 technical indicator exits. For DJI, Agent 1 with 4 technical indicator exits produced the highest balance ($15,120.46) and Agent 3 with 16 technical indicator exits achieved the highest average balance ($12,102.06). For INTEL, Agent 2 with 4 technical indicator exits produced the highest balance ($21,000.59) and the highest average balance ($10,522.50). In the case of GM, again Agent 2 with 4 technical indicator exits produced both the highest balance ($20,116.72) and the highest average balance ($9,645.54). Lastly, for TYX, Agent 1 with next-day-exit conditions produced both the highest balance ($15,671.20) and highest average balance ($11,389.56). Table 4. GM Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
13,505.11 14,324.42 16,789.67 15,605.10 18,114.27 17,338.24 15,804.40 20,116.72 14,565.23 5,970.25
8,251.02 7,927.37 8,579.46 8,827.06 9,254.52 9,153.40 9,226.62 9,645.54 8,362.22 5,970.25
180
R. Preen Table 5. TYX Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
14,180.51 14,297.20 15,671.20 12,773.89 12,503.13 12,047.33 11,346.18 14,297.84 12,260.75 9,227.80
10,959.06 10,730.10 11,389.56 10,010.81 9,632.41 9,815.09 9,870.72 10,014.32 9,936.21 9,227.80
Table 6. t-Stats of Tech Exits vs. Next Day (N.D.) exits. Two-Sample Assuming Unequal Variances. Results in bold are statistically significant at the 95% confidence level.
Stock
XOM DJI INTEL GM TYX
Agent 1 4 Ex. 16 Ex. vs. vs. N.D. N.D. 1.90 6.48 5.60 6.19 0.86 -6.20 -0.69 1.93 -8.34 -9.60
Agent 2 4 Ex. 16 Ex. vs. vs. N.D. N.D. 3.15 9.40 3.73 4.05 3.61 -4.06 5.13 4.09 -4.08 -6.90
Agent 3 4 Ex. 16 Ex. vs. vs. N.D. N.D. 4.10 5.80 3.73 5.82 -0.04 -5.72 2.73 1.87 -7.96 -6.30
The results have shown that in all cases (except TYX), an agent using technical indicator exits was superior to exiting at the next day for both the highest achievable balance and the average balance over its experiments. Moreover, since commissions are not factored into the agents at this stage, it is highly likely that the gap between the two agent classes would further widen. However, in the case of TYX, the best performing agent was Agent 1 with nextday-exit conditions. Furthermore, all next-day-exit agents surpassed the technical indicator exit agents in terms of both highest balance and average balance, showing that for some assets next-day-exits can be the best. However, introducing commissions would likely reduce this gap and perhaps even supplant the next-day-exit agents. Nevertheless, the fact that the next-day-exit agents beat the technical indicator exits is perhaps explainable by the split between the training and trading set, since the training set for TYX primarily decreases but the trading set moves in a side-ways range. Table 6 presents the t-Stats for the three agent types where exiting at the close of the next day is compared with both the 4 and 16 technical indicator exit sets. It is shown that almost all of the results are statistically significant at the 95% confidence level. In particular, for XOM and DJI, all agents utilising technical indicator exits surpassed the same agents when exiting at the close of the next day, and these results
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
181
were statistically significant. Additionally, Agent 2 when using 4 indicator exits has provided statistically significant and superior results when compared to exiting at the close of the next day in all cases except for TYX. Finally, when comparing the best performing agents with a buy and hold strategy, we observe that for INTEL, GM, and TYX, all of the agents using technical indicator exits beat this strategy. Further, the best performing agents on all assets were always able to beat the buy and hold balance; however the average of the agents’ balances did not. Furthermore, should commissions be introduced (the cost would vary from broker to broker) these results when compared to a buy and hold strategy would deteriorate to some extent. However, the agents’ average balances only outperformed a buy and hold strategy when the stocks declined. An explanation for this is that when the agent exits the market wrongfully, although there is no actual loss, there is an opportunity cost because the market increases and the agent underperforms its benchmark. Thus, stocks which generally decline over the period analysed are much easier to beat because agents have the choice to be in or out of the market, while it is much harder to beat those that are generally going up. Table 7 shows the average number of trades executed over 100 tests of each asset by Agent 2. Again, the agent is shown the training set only once before being assessed in the trading set. The table shows that when using 4 technical indicator exits, the agent always trades fewer times than with next-day-exit conditions. Further, this is statistically significant (as shown in table 8). In some cases 40% less trades are executed which would result in substantial transaction fee savings. When utilising 16 technical indicator exits, Agent 2 trades a similar number of times as the agents using next-day-exit conditions. This is a result of adding more exit conditions which increase the probability of closing the trade after a short period of time. Thus, the 16 technical indicator exit agents tested do not offer any transaction fee savings in comparison to the traditional model. Table 7. Average Number of Trades Executed by Agent 2. Agent 2: Next-day-exit 4 Tech- Exits 16 Tech- Exits
XOM 243.25 164.84 241.17
DJI 267.20 170.74 255.23
INTEL 266.83 168.30 255.55
GM 154.37 136.14 144.69
TYX 160.89 105.82 158.54
Table 8. t-Stats of Number of trades Executed by Agent 2 with Tech Exits vs. Next Day (N.D.) exits. Two-Sample Assuming Unequal Variances. Results in bold are statistically significant at the 95% confidence level. Agent 2: 4 Tech- Exits vs. N.D. 16 Tech- Exits vs. N.D.
XOM 4.63 0.13
DJI 5.51 0.51
INTEL 6.60 0.60
GM 1.98 1.36
TYX 3.58 0.13
182
R. Preen
6 Conclusions Agents utilising mathematical technical indicators for the exit conditions outperformed similar agents which used the next day as the exit condition in all cases except for TYX (30-Year Treasury bond), even before taking commissions into account, which would penalise the most active agents (i.e., the agents using next-day-exit). Moreover, these results were achieved with generic XCS parameters and not tuned to improve performance. The reason TYX was anomalous is attributable to either the position of the cut-off point between the training and trading set, or the TYX data being inherently noisier than the other assets, which were all stocks. The cut point in this asset is particularly important because it resulted in a training set which primarily declined and a trading set that ranged sideways. Thus, the agents would have adapted rules to trade within this downward environment but were not prepared for the environment within which they were assessed. An analysis of the number of trades executed by each agent showed that, on average, 31.73% less trades were executed when using 4 technical indicator exit conditions; this would result in substantial transaction savings and further boost the performance of these agents in comparison to the agents using next-day-exit conditions. However, the agents using 16 mathematical technical indicator exits executed with approximately the same frequency as the agents using next-day-exit conditions. This was a result of having more rules with different exit conditions that could be triggered, so the agents were closing the trades with greater frequency.
References 1. Allen, F., Karjalainen, R.: Using Genetic Algorithms to find technical trading rules. Journal of Financial Economics 51(2), 245–271 (1999) 2. Beltrametti, L., Fiorentini, R., Marengo, L., Tamborini, R.: A learning-to-forecast experiment on the foreign exchange market with a Classifier System. Journal of Economic Dynamics and Control 21(8&9), 1543–1575 (1997) 3. Butz, M., Sastry, K., Goldberg, D.: Strong, Stable, and Reliable Fitness Pressure in XCS due to Tournament Selection. Genetic Programming and Evolvable Machines 6(1), 53–77 (2005) 4. Brock, W., Lakonishock, J., LeBaron, B.: Simple Technical Trading Rules and the Stochastic Properties of Stock Returns. Journal of Finance 47, 1731–1764 (1992) 5. Chen, S.-H.: Genetic Algorithms and Genetic Programming in Computational Finance. Kluwer Academic Publishers, Norwell (2002) 6. Detry, P.J., Grégoire, P.: Other evidences of the predictive power of technical analysis: the moving average rules on European indexes, CeReFiM, Belgium, pp. 1–25 (1999) 7. Dewachter, H.: Can Markov switching models replicate chartist profits in the foreign exchange market? Journal of International Money and Finance 20(1), 25–41 (2001) 8. Dooley, M., Schaffer, J.: Analysis of Short-Run Exchange Rate Behavior: March 1973 to November 1981. In: Bigman, D., Taya, T. (eds.) Floating Exchange Rates and State of World Trade and Payments, pp. 43–70. Ballinger Publishing Company, Cambridge (1983) 9. Gershoff, M.: An investigation of HXCS Traders. School of Informatics. Vol. Master of Sciences Edinburgh. University of Edinburgh (2006)
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
183
10. Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press (1975) 11. Kalyvas, E.: Using Neural Networks and Genetic Algorithms to Predict Stock Market Returns. University of Manchester Master of Science thesis (2001) 12. Levich, R., Thomas, L.: The Merits of Active Currency Management: Evidence from International Bond Portfolios. Financial Analysts Journal 49(5), 63–70 (1993) 13. Liu, S., Nagao, T.: HXCS and its Application to Financial Time Series Forecasting. IEEJ Transactions on Electrical and Electronic Engineering 1, 417–425 (2006) 14. Mahfoud, S., Mani, G.: Financial forecasting using Genetic Algorithms. Applied Artificial Intelligence 10(6), 543–565 (1996) 15. Neely, C., Weller, P., Dittmar, R.: Is Technical Analysis in the Foreign Exchange Market Profitable? A Genetic Programming Approach. Journal of Financial and Quantitative Analysis 32(4), 405–426 (1997) 16. Okunev, J., White, D.: Do momentum-based strategies still work in foreign currency markets? Journal of Financial and Quantitative Analysis 38, 425–447 (2003) 17. Olson, D.: Have trading rule profits in the currency market declined over time? Journal of Banking and Finance 28, 85–105 (2004) 18. Schulenburg, S., Ross, P.: An Adaptive Agent Based Economic Model. In: Lanzi, P.L., et al. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1996, pp. 265–284. Springer, Heidelberg (2001) 19. Schulenburg, S., Ross, P.: Strength and money: An LCS approach to increasing returns. In: Lanzi, P.L. (ed.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 114–137. Springer, Heidelberg (2001) 20. Schulenburg, S., Ross, P.: Explorations in LCS models of stock trading. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 151–180. Springer, Heidelberg (2002) 21. Schulenburg, S., Wong, S.Y.: Portfolio allocation using XCS experts in technical analysis, market conditions and options market. In: Proceedings of the 2007 GECCO Conference Companion on Genetic and Evolutionary Computation, pp. 2965–2972. ACM, New York (2007) 22. Srinivasa, K.G., Venugopal, K.R., Patnaik, L.M.: An efficient fuzzy based neuro: genetic algorithm for stock market prediction. International Journal of Hybrid Intelligent Systems 3(2), 63–81, (2006) 23. Steiner, M., Wittkemper, H.G.: Neural networks as an alternative stock market model. In: Refenes, A.P. (ed.) Neural networks in the capital markets, pp. 137–149. John Wiley and Sons, Chichester (1996) 24. Stone, C., Bull, L.: Foreign Exchange Trading using a Learning Classifier System. In: Bull, L., Bernado-Mansilla, E., Holmes, J. (eds.) Learning Classifier Systems in Data Mining, pp. 169–190. Springer, Heidelberg (2008) 25. Sweeney, R.J.: Beating the foreign exchange market. Journal of Finance 41, 163–182 (1986) 26. Tsibouris, G., Zeidenberg, M.: Testing the Efficient Market Hypothesis with Gradient Descent Algorithms, pp. 127–136. John Wiley and Sons Ltd., Chichester (1996) 27. Wilson, S.W.: ZCS: A Zeroth Level Classifier. Evolutionary Computation 2, 1–18 (1994) 28. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149– 175 (1995)
184
R. Preen
Appendix: Mathematical Technical Indicators Simple Moving Average: SMA(N) SMAt = (Closet +Closet-1 ... + Closet-N)/N Where Close is the closing price being averaged and N is the number of days in the moving average. Exponential Moving Average: EMA(N) EMAt = Closet · K + EMAt-1 · (1-K) Where K=2/(N+1), N is the number of days in the EMA, Closet is today’s closing price, and EMAt-1 is the EMA of yesterday. Moving Average Convergence Divergence: MACD(a,b,c) MACD main line = EMA(a) – EMA(b) MACD signal line = EMA(c) Where EMA(c) is an exponential moving average of the MACD main line. Stochastic Oscillator: Stochastic(FastK, SlowK, SlowD) Stochastic main line, Stocht = Stocht-1 + (Fast – Stocht-1 / SlowK) Stochastic signal line, Sigt = Sigt-1 + (Stocht – Sigt-1) / SlowD Where, Stocht is today’s stochastic main line; Stocht-1 is the stochastic main line of yesterday; Fast = 100 · ((Closet – L/(H–L)); Closet is today’s closing price; L is the lowest low price over the last FastK days; and H is the highest high price over the last FastK days.
On the Homogenization of Data from Two Laboratories Using Genetic Programming Jose G. Moreno-Torres1, Xavier Llor` a2, David E. Goldberg3 , and Rohit Bhargava4 1
Department of Computer Science and Artificial Intelligence, Universidad de Granada, 18071 Granada, Spain
[email protected] 2 National Center for Supercomputing Applications (NCSA) University of Illinois at Urbana-Champaign 1205 W. Clark Street, Urbana, Illinois, USA
[email protected] 3 Illinois Genetic Algorithms Laboratory (IlliGAL) University of Illinois at Urbana-Champaign 104 S. Mathews Ave, Urbana, Illinois, USA
[email protected] 4 Department of Bioengineering University of Illinois at Urbana-Champaign 405 N. Mathews Ave, Urbana, Illinois, USA
[email protected] Abstract. In experimental sciences, diversity tends to difficult predictive models’ proper generalization across data provided by different laboratories. Thus, training on a data set produced by one lab and testing on data provided by another lab usually results in low classification accuracy. Despite the fact that the same protocols were followed, variability on measurements can introduce unforeseen variations that affect the quality of the model. This paper proposes a Genetic Programming based approach, where a transformation of the data from the second lab is evolved driven by classifier performance. A real-world problem, prostate cancer diagnosis, is presented as an example where the proposed approach was capable of repairing the fracture between the data of two different laboratories.
1
Introduction
The assumption that a properly trained classifier will be able to predict the behavior of unseen data from the same problem is at the core of any automatic classification process. However, this hypothesis tends to prove unreliable when dealing with biological data (or other experimental sciences), especially when such data is provided by more than one laboratory, even if they are following the same protocols to obtain it. This paper presents an example of such a case, a prostate cancer diagnosis problem where a classifier built using the data of the first laboratory performs J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 185–197, 2010. c Springer-Verlag Berlin Heidelberg 2010
186
J.G. Moreno-Torres et al.
very accurately on the test data from that same laboratory, but comparatively poorly on the data from the second one. It is assumed that this behavior is due to a fracture between the data of the two laboratories, and a Genetic Programming (GP) method is developed to homogenize the data in subsequent subsets. We consider this method a form of feature extraction because the new dataset is constructed with new features which are functional mappings of the old ones. The method presented in this paper attempts to optimize a transformation over the data from the second laboratory, in terms of classifier performance. That is, the data from the second lab is transformed into a new dataset where the classifier, trained on the data from the first lab, performs as accurately as possible. If the performance achieved by the classifier in this new, transformed, dataset, is equivalent to the one obtained in the data from the first lab, we understand the data has been homogenized. More formally, the classifier f is trained on data from one laboratory (dataset A), such that y = f (xA) is the class prediction for one instance xA of dataset A. For the data from the other lab (dataset B), it is assumed that there exists a transformation T such that f (T (xB)) is a good classifier for instances xB of dataset B. The ’goodness’ of the classifier is measured by the loss function l(f (T (xB)), y), where y is the class associated with xB, and l(., .) is a measure of distance between f (T (xB)) and y. The aim is to find a transformation T such that the average loss over all instances in B is minimized. The remainder of this paper is organized as follows: In Section 2, some preliminaries about the techniques used and some approaches to similar problems in the literature are presented. Section 3 has a description of the proposed algorithm. Section 4 details the real-world biological dataset that motivates this paper. Section 5 includes the experimental setup, along with the results obtained, and an analysis. Finally, some concluding remarks are made in Section 6.
2
Preliminaries
This section is divided in the following way: In Section 2.1 we introduce the notation that has been used in this paper. Then we include a brief summary of what has been done in feature extraction in Section 2.2, and a short review of the different approaches we found in the specialized literature on the use of GP for feature extraction in Section 2.3. 2.1
Notation
When describing the problem, datasets A, B and S correspond to: – A: The original dataset, provided by the first lab, that was used to build the classifier. – B: The problem dataset, from the second lab. The classifier is not accurate on this dataset, and that is what the proposed algorithm attempts to solve. – S: The solution dataset, result of applying the evolved transformation to the samples in dataset B. The goal is to have the classifier performance be as high as possible on this dataset.
On the Homogenization of Data from Two Laboratories
2.2
187
Feature Extraction
Feature extraction is one form of pre-processing, which creates new features as functional mappings of the old ones. An early proposer of such a term was probably Wyse in 1980 [1], in a paper about intrinsic dimensionality estimation. There are multiple techniques that have been applied to feature extraction throughout the years, ranging from principal component analysis (PCA) to support vector machines (SVMs) to GAs (see [2,3,4], respectively, for some examples). Among the foundations papers in the literature, Liu’s book in 1998 [5] is one of the earlier compilations of the field. A workshop held in 2003 [6], led Guyon & Elisseeff to publish a book with an important treatment of the foundations of feature extraction[7]. 2.3
Genetic Programming-Based Feature Extraction
Genetic Programming (GP) has been used extensively to optimize feature extraction and selection tasks. One of the first contributions in this line was the work published by Tackett in 1993 [8], who applied GP to feature discovery and image discrimination tasks. We can consider two main branches in the philosophy of GP-based feature extraction: 1 On one hand, we have the proposals that focus only on the feature extraction procedure, of which there are multiple examples: Sherrah et al. [9] presented in 1997 the evolutionary pre-processor (EPrep), which searches for an optimal feature extractor by minimizing the misclassification error over three randomly selected classifiers. Kotani et al.’s work from 1999 [10] determined the optimal polynomial combinations of raw features to pass to a k-nearest neighbor classifier. In 2001, Bot [11] evolved transformed features, one-at-atime, again for a k-NN classifier, utilizing each new feature only if it improved the overall classification performance. Zhang & Rockett, in 2006, [12] used multiobjective GP to learn optimal feature extraction in order to fold the high-dimensional pattern vector to a one-dimensional decision space where the classification would be trivial. Lastly, also in 2006, Guo & Nandi [13] optimized a modified Fisher discriminant using GP, and then Zhang & Rockett [14] extended their work by using a multiobjective approach to prevent tree bloat. 2 On the other hand, some authors have chosen to evolve a full classifier with an embedded feature extraction step. As an example, Harris [15] proposed in 1997 a co-evolutionary strategy involving the simultaneous evolution of the feature extraction procedure along with a classifier. More recently, Smith & Bull [16] developed a hybrid feature construction and selection method using GP together with a GA. 2.4
Finding and Repairing Fractures between Data
Among the proposals to quantify the fracture in the data, we would like to mention the one by Wang et al. [17], where the authors present the idea of
188
J.G. Moreno-Torres et al.
correspondence tracing. They propose an algorithm for the discovering of changes of classification characteristics, which is based on the comparison between two rule-based classifiers, one built from each dataset. Yang et al. [18] presented in 2008 the idea of conceptual equivalence as a method for contrast mining, which consists of the discovery of discrepancies between datasets. Lately, it is important to mention the work by Cieslak and Chawla [19], which presents a statistical framework to analyze changes in data distribution resulting in fractures between the data. The fundamental difference between the mentioned works and this one is we focus on repairing the fracture by modifying the data, using a general method that works with any kind of data fracture, while they propose methods to quantify said fracture that work provided some conditions.
3
A Proposal for GP-Based Feature Extraction to Homogenize Data from Two Laboratories
The problem we are attempting to solve is the design of a method that can create a transformation from a dataset (dataset B) where a classification model built using the data from a different dataset (dataset A) is not accurate; into a new dataset (dataset S) where the classifier is more accurate. Said classifier is kept unchanged throughout the process. We decided to use GP to solve the problem for a number of reasons: 1 It is well suited to evolve arbitrary expressions because its chromosomes are trees. This is useful in our case because we want to have the maximum possible flexibility in terms of the functional expressions of this transformations. 2 GP provides highly-interpretable solutions. This is an advantage because our goal is not only to have a new dataset where the classifier works, but also to analyze what was the problem in the first dataset. Once GP was chosen, we needed to decide what terminals and operators to use, how to calculate the fitness of an individual and which evolutionary parameters (population size, number of generations, selection and mutation rates, etc) are appropriate for the problem at hand. 3.1
Solutions Representation: Context-Free Grammar
The representation of the solutions was achieved by extending GP to evolve more than one tree per solution. Each individual is composed by n trees, where n is the number of attributes present in the dataset. We are trying to develop a new dataset with the same number of attributes as the old one, since this new dataset needs to be fed to the existing model. In the tree structure, the leaves are either constants (we use the Ephemeral Random Constant approach [20]) or attributes from the original dataset. The intermediate nodes are functions from the function set, which is specific to each problem.
On the Homogenization of Data from Two Laboratories
189
The attributes on the transformed dataset are represented by algebraic expressions. These expressions are generated according to the rules of a context-free grammar which allows the absence of some of the functions or terminals. The grammar corresponding to the example problem would look like this: Start → T ree T ree T ree → N ode N ode → N ode Operator N ode N ode → T erminal Operator → + | − | ∗ | ÷ T erminal → x0 | x1 | E E → realN umber(represented by e) 3.2
Fitness Evaluation
The fitness evaluation procedure is probably the most treated aspect of design in the literature when dealing with GP-based feature extraction. As has been stated before, the idea is to have the provided classifier’s performance drive the evolution. To achieve that, our method calculates fitness as the classifier’s accuracy over the dataset obtained by applying the transformations encoded in the individual (training-set accuracy). 3.3
Genetic Operators
This section details the choices made for selection, crossover and mutation operators. Since the objective of this work is not to squeeze the maximum possible performance from GP, but rather to show that it is an appropriate technique for the problem and that it can indeed solve it, we did not pay special attention to these choices, and picked the most common ones in the specialized literature. – Tournament selection without replacement. To perform this selection, s individuals are first randomly picked from the population (where s is the tournament size), while avoiding using any member of the population more than once. The selected individual is then chosen as the one with the best fitness among those picked in the first stage. – One-point crossover: A subtree from one of the parents is substituted by one from the other parent. This procedure is carried over in the following way: 1 Randomly select a non-root non-leave node on each of the two parents. 2 The first child is the result of swapping the subtree below the selected node in the father for that of the mother. 3 The second child is the result of swapping the subtree below the selected node in the mother for that of the father.
190
J.G. Moreno-Torres et al.
– Swap mutation: This is a conservative mutation operator, that helps diversify the search within a close neighborhood of a given solution. It consists of exchanging the primitive associated to a node by one that has the same number of arguments. – Replacement mutation: This is a more aggressive mutation operator that leads to diversification in a larger neighborhood. The procedure to perform this mutation is the following: 1 Randomly select a non-root non-leave node on the tree to mutate. 2 Create a random tree of depth no more than a fixed maximum depth. In this work, the maximum depth allowed was 5. 3 Swap the subtree below the selected node for the randomly generated one. 3.4
Function Set
Which functions to include in the function set are usually dependent on the problem. Since one of our goals is to have an algorithm as universal and robust as possible, where the user does not need to fine-tune any parameters to achieve good performance; we decided not to study the effect of different function set choices. We chose the default functions most authors use in the literature: {+, −, ∗, ÷, exp, cos}. 3.5
Parameters
Table 1 summarizes the parameters used for the experiments. Table 1. Evolutionary parameters for a nv -dimensional problem Parameter Value Number of trees nv Population size 400 ∗ nv Duration of the run 100 generations Selection operator Tournament without replacement Tournament size log2 (nv ) + 1 Crossover operator One-point crossover Crossover probability 0.9 Mutation operator Replacement & Swap mutations Replacement mutation probability 0.001 Swap mutation probability 0.01 Maximum depth of the swapped in subtree 5 Function set {+, −, ∗, ÷, cos, exp} Terminal set {x0 ,x1 ,...,xnv − 1, e}
3.6
Execution Flow
Algorithm 1 contains a summary of the execution flow of the GP procedure, which follows a classical evolutionary scheme. It stops after a user-defined number of generations,
On the Homogenization of Data from Two Laboratories
191
Algorithm 1. Execution flow of the GP method 1 . Randomly c r e a t e t h e i n i t i a l p o p u l a t i o n by a p p l y i n g t h e c o n t e x t −f r e e grammar i n S e c t i o n 3 . 1 . 2 . Repeat Ng t i m e s ( where Ng i s t h e number o f g e n e r a t i o n s ) 2.1 Evaluate the cu r r en t population , using the procedure seen in Section 3 . 2 . 2 . 2 Apply s e l e c t i o n and c r o s s o v e r t o c r e a t e a new p o p u l a t i o n t h a t w i l l r e p l a c e t h e o l d one . 2 . 3 Apply t h e mutation o p e r a t o r s t o t h e new p o p u l a t i o n . 3 . Return t h e b e s t i n d i v i d u a l e v e r s e e n .
4
Case Study: Prostate Cancer Diagnosis
Prostate cancer is the most common non-skin malignancy in the western world. The American Cancer Society estimated 192,280 new cases and 27,360 deaths related to prostate cancer in 2009 [21]. Recognizing the public health implications of this disease, men are actively screened through digital rectal examinations and/or serum prostate specific antigen (PSA) level testing. If these screening tests are suspicious, prostate tissue is extracted, or biopsied, from the patient and examined for structural alterations. Due to imperfect screening technologies and repeated examinations, it is estimated that more than one million people undergo biopsies in the US alone. 4.1
Diagnostic Procedure
Biopsy, followed by manual examination under a microscope is the primary means to definitively diagnose prostate cancer as well as most internal cancers in the human body. Pathologists are trained to recognize patterns of disease in the architecture of tissue, local structural morphology and alterations in cell size and shape. Specific patterns of specific cell types distinguish cancerous and noncancerous tissues. Hence, the primary task of the pathologist examining tissue for cancer is to locate foci of the cell of interest and examine them for alterations indicative of disease. A detailed explanation of the procedure is beyond the scope of this paper and can be found elsewhere [22,23,24,25]. Operator fatigue is well-documented and guidelines limit the workload and rate of examination of samples by a single operator (examination speed and throughput). Importantly, inter- and intra-pathologist variation complicates decision making. For this reason, it would be extremely interesting to have an accurate automatic classifier to help reduce the load on the pathologists. This was partially achieved in [24], but some issues remain open. 4.2
The Generalization Problem
Llor` a et al. [24] successfully applied a genetics-based approach to the development of a classifier that obtained human-competitive results based on FTIR
192
J.G. Moreno-Torres et al.
data. However, the classifier built from the data obtained from one laboratory proved remarkably inaccurate when applied to classify data from a different hospital. Since all the experimental procedure was identical; using the same machine, measuring and post-processing; and having the exact same lab protocols, both for tissue extraction and staining; there was no factor that could explain this discrepancy. What we attempt to do with this work is develop an algorithm that can evolve a transformation over the data from the second laboratory, creating a new dataset where the classifier built from the first lab is as accurate as possible. 4.3
Pre-processing of the Data
The biological data obtained from the laboratories has an enormous size (in the range of 14GB of storage per sample); and parallel computing was needed to achieve better-than-human results. For this reason, feature selection was performed on the dataset obtained by FTIR. It was done by applying an evaluation of pairwise error and incremental increase in classification accuracy for every class, resulting in a subset of 93 attributes. This reduced dataset provided enough information for classifier performance to be rather satisfactory: a simple C4.5 classifier achieved ∼ 95% accuracy on the data from the first lab, but only ∼ 80% on the second one. The dataset consists of 789 samples from one laboratory and 665 from the other one. These samples represent 0.01% of the total data available for each data set, which were selected applying stratified sampling without replacement. A detailed description of the data pre-processing procedure can be found in [22]. The experiments reported in this paper were performed utilizing the reduced dataset, since the associated computational costs make it unfeasible to work with the complete one. The reduced dataset is made of 93 real attributes, and there are two classes (positive and negative diagnosis). The dataset consists of 789 samples from one laboratory and 665 from the other one, with a 60% − 40% class distribution.
5
Experimental Study
This section is organized in the following way: To begin with, a general description of the experimental procedure is presented in Section 5.1, and the parameters used for the experiment. The results obtained are presented in Section 5.2, a statistical analysis is shown in Section 5.3, and lastly some sample transformations are shown in Section 5.4. 5.1
Experimental Framework
The experimental methodology can be summarized as follows: 1 Consider each of the provided datasets (one from each lab) to be datasets A and B respectively.
On the Homogenization of Data from Two Laboratories
193
2 From dataset A, build a classifier. We chose C4.5 [26], but any other classifier would work exactly the same; due to the fact that the proposed method uses the learned classifier as a black box. 3 Apply our method to dataset B in order to evolve a transformation that will create a solution dataset S. Use 5-fold cross validation over dataset S, so that training and test set accuracy results can be obtained. 4 Check the performance of the step 2 classifier on dataset S. Ideally, it should be close to the one on dataset A, meaning the proposed method has successfully discovered the hidden transformation and inverted it.
5.2
Performance Results
This section presents the results for the Prostate Cancer problem, in terms of classifier accuracy. The results obtained can be seen in table 2. Table 2. Classifier performance results Classifier performance in dataset ... A-training A-test B S-training S-test 0.95435 0.92015 0.83570 0.95191 0.92866
The performance results are promising. First and foremost, the proposed method was able to find a transformation over the data from the second laboratory that made the classifier work just as well as it did on the data from the first lab, effectively finding the fracture in the data (that is, the difference in data distribution between the data sets provided by the two labs) that prevented the classifier from working accurately. 5.3
Statistical Analysis
To complete the experimental study, we performed a statistical comparison between the classifier performance over datasets A, B and S. In [27,28,29,30] a set of simple, safe and robust non-parametric tests for statistical comparisons of classifiers are recommended. One of them is the Wilcoxon Signed-Ranks Test [31,32], which is the test that we have selected to do the comparison. In order to perform the Wilcoxon test, we used the results from each partition in the 5-fold cross validation procedure. We ran the experiment four times, resulting in 4 ∗ 5 = 20 performance samples to carry out the statistical test. R+ corresponds to the first algorithm in the comparison winning, R− to the second one. We can conclude our method has proved to be capable of fully homogenizing the data from both laboratories regarding classifier performance, both in terms of training and test set.
194
J.G. Moreno-Torres et al. Table 3. Wilcoxon signed-ranks test results Comparison R+ A-test vs B 210 B vs S-test 0 A-training vs S-training 126 A-test vs S-test 84
5.4
R− p-value null hypothesis of equality 0 1.91E − 007 rejected (A-test outperforms B) 210 1.91E − 007 rejected (S-test outperforms B) 84 −− accepted 126 −− accepted
Obtained Transformations
Figure 1 contains a sample of some of the evolved expressions for the best individual found by our method. Since the dataset has 93 attributes, the individual was composed of 93 trees, but for space concerns only the attributes relevant to the C4.5 classifier were included here.
Fig. 1. Tree representation of the expressions contained in a solution to the Prostate Cancer problem
6
Concluding Remarks
We have presented a new algorithm that approaches a common problem in real life for which not many solutions have been proposed in evolutionary computing. The problem in question is the repairing of fractures between data by adjusting the data itself, not the classifiers built from it.
On the Homogenization of Data from Two Laboratories
195
We have developed a solution to the problem by means of a GP-based algorithm that performs feature extraction on the problem dataset driven by the accuracy of the previously built classifier. We have applied our method to a real-world problem where data from two different laboratories regarding prostate cancer diagnosis was provided, and where the classifier learned from one did not perform well enough on the other. Our algorithm was capable of learning a transformation over the second dataset that made the classifier fit just as well as it did on the first one. The validation results with 5-fold cross validation also support the idea that the algorithm is obtaining good results; and has a strong generalization power. We have applied a statistical analysis methodology that supports the claim that the classifier performance obtained on the solution dataset significantly outperforms the one obtained on the problem dataset. Lastly, we have shown the learned transformations. Unfortunately, we have not been able to extract any useful information from them yet.
Acknowledgments Jose Garc´ıa Moreno-Torres was supported by a scholarship from ‘Obra Social la Caixa’ and is currently supported by a FPU grant from the Ministerio de Educaci´ on y Ciencia of the Spanish Government and the KEEL project. Rohit Bhargava would like to acknowledge collaborators over the years, especially Dr. Stephen M. Hewitt and Dr. Ira W. Levin of the National Institutes of Health, for numerous useful discussions and guidance. Funding for this work was provided in part by University of Illinois Research Board and by the Department of Defense Prostate Cancer Research Program. This work was also funded in part by the National Center for Supercomputing Applications and the University of Illinois, under the auspices of the NCSA/UIUC faculty fellows program.
References 1. Wyse, N., Dubes, R., Jain, A.: A critical evaluation of intrinsic dimensionality algorithmsa critical evaluation of intrinsic dimensionality algorithms. In: Gelsema, E.S., Kanal, L.N. (eds.) Pattern recognition in practice, Amsterdam, pp. 415–425. Morgan Kauffman Publishers, Inc., San Francisco (1980) 2. Kim, K.A., Oh, S.Y., Choi, H.C.: Facial feature extraction using pca and wavelet multi-resolution images. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, p. 439. IEEE Computer Society, Los Alamitos (2004) 3. Podolak, I.T.: Facial component extraction and face recognition with support vector machines. In: FGR 2002: Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, p. 83. IEEE Computer Society, Los Alamitos (2002) 4. Pei, M., Goodman, E.D., Punch, W.F.: Pattern discovery from data using genetic algorithms. In: Proceeding of 1st Pacific-Asia Conference Knowledge Discovery & Data Mining, PAKDD 1997 (1997)
196
J.G. Moreno-Torres et al.
5. Liu, H., Motoda, H.: Feature extraction, construction and selection: a data mining perspective. SECS, vol. 453. Kluwer Academic, Boston (1998) 6. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 7. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.): Feature Extraction, Foundations and Applications. Springer, Heidelberg (2006) 8. Tackett, W.A.: Genetic programming for feature discovery and image discrimination. In: Proceedings of the 5th International Conference on Genetic Algorithms, pp. 303–311. Morgan Kaufmann Publishers Inc., San Francisco (1993) 9. Sherrah, J.R., Bogner, R.E., Bouzerdoum, A.: The evolutionary pre-processor: Automatic feature extraction for supervised classification using genetic programming. In: Proc. 2nd International Conference on Genetic Programming (GP 1997), pp. 304–312. Morgan Kaufmann, San Francisco (1997) 10. Kotani, M., Ozawa, S., Nakai, M., Akazawa, K.: Emergence of feature extraction function using genetic programming. In: KES, pp. 149–152 (1999) 11. Bot, M.C.J.: Feature extraction for the k-nearest neighbour classifier with genetic programming. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 256–267. Springer, Heidelberg (2001) 12. Zhang, Y., Rockett, P.I.: A generic optimal feature extraction method using multiobjective genetic programming. Technical Report VIE 2006/001, Department of Electronic and Electrical Engineering, University of Sheffield, UK (2006) 13. Guo, H., Nandi, A.K.: Breast cancer diagnosis using genetic programming generated feature. Pattern Recognition 39(5), 980–987 (2006) 14. Zhang, Y., Rockett, P.I.: A generic multi-dimensional feature extraction method using multiobjective genetic programming. Evolutionary Computation 17(1), 89– 115 (2009) 15. Harris, C.: An investigation into the Application of Genetic Programming techniques to Signal Analysis and Feature Detection,September. University College, London (September 26, 1997) 16. Smith, M.G., Bull, L.: Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines 6(3), 265–281 (2005) 17. Wang, K., Zhou, S., Fu, C.A., Yu, J.X., Jeffrey, F., Yu, X.: Mining changes of classification by correspondence tracing. In: Proceedings of the 2003 SIAM International Conference on Data Mining, SDM 2003 (2003) 18. Yang, Y., Wu, X., Zhu, X.: Conceptual equivalence for contrast mining in classification learning. Data & Knowledge Engineering 67(3), 413–429 (2008) 19. Cieslak, D.A., Chawla, N.V.: A framework for monitoring classifiers’ performance: when and why failure occurs? Knowledge and Information Systems 18(1), 83–108 (2009) 20. Koza, J.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992) 21. AmericanCancerSociety: How many men get prostate cancer? http://www.cancer.org/docroot/CRI/content/ CRI 2 2 1X How many men get prostate cancer 36.asp 22. Fernandez, D.C., Bhargava, R., Hewitt, S.M., Levin, I.W.: Infrared spectroscopic imaging for histopathologic recognition. Nature Biotechnology 23(4), 469–474 (2005)
On the Homogenization of Data from Two Laboratories
197
23. Levin, I.W., Bhargava, R.: Fourier transform infrared vibrational spectroscopic imaging: integrating microscopy and molecular recognition. Annual Review of Physical Chemistry 56, 429–474 (2005) 24. Llor` a, X., Reddy, R., Matesic, B., Bhargava, R.: Towards better than human capability in diagnosing prostate cancer using infrared spectroscopic imaging. In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation GECCO 2007, pp. 2098–2105. ACM, New York (2007) 25. Llor` a, X., Priya, A., Bhargava, R.: Observer-invariant histopathology using genetics-based machine learning. Natural Computing: An International Journal 8(1), 101–120 (2009) 26. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) 27. Demˇsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 28. Garc´ıa, S., Herrera, F.: An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pairwise comparisons. Journal of Machine Learning Research 9, 2677–2694 (2008) 29. Garc´ıa, S., Fern´ andez, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability. Soft Computing 13(10), 959–977 (2009) 30. Garc´ıa, S., Fern´ andez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180(10), 2044–2064 (2010) 31. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1(6), 80–83 (1945) 32. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures, 4th edn. Chapman & Hall/CRC (2007)
Author Index
Bhargava, Rohit Bull, Larry 87 Butz, Martin V. Casillas, Jorge ´ ee, Gilles En´
185
Lanzi, Pier-Luca 1, 70, 87 Llor` a, Xavier 185 Loiacono, Daniele 1, 70
47, 57
Mart´ınez, Ivette 145 Moreno-Torres, Jose G.
21 107
Orriols-Puig, Albert
Goldberg, David E. Gorrin, Celso 145
185
Howard, Gerard David
21
P´eroumalna¨ık, Mathias Preen, Richard 166
Farooq, Muddassar 127 Franco, Mar´ıa 145
Stalph, Patrick O.
Wilson, Stewart W.
107
47, 57
Tanwani, Ajay Kumar 87
185
127 38