Computational Intelligence
Computational Intelligence for Engineering and Manufacturing Edited by
Diego Andina Techni...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Computational Intelligence

Computational Intelligence for Engineering and Manufacturing Edited by

Diego Andina Technical University of Madrid (UPM), Spain

Duc Truong Pham Manufacturing Engineering Center, Cardiff University, Cardiff

A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN-10 ISBN-13 ISBN-10 ISBN-13

0-387-37450-7 (HB) 978-0-387-37450-5 (HB) 0-387-37452-3 (e-book) 978-0-387-37452-9 (e-book)

Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com

Printed on acid-free paper

All Rights Reserved © 2007 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.

This book is dedicated to the memory of Roberto Carranza E., who induced the authors the enthusiasm to jointly prepare this book.

CONTENTS

Contributing Authors

ix

Preface

xi

Acknowledgements

xiii

1.

Soft Computing and its Applications in Engineering and Manufacture D. T. Pham, P. T. N. Pham, M. S. Packianather, A. A. Afify

1

2.

Neural Networks Historical Review D. Andina, A. Vega-Corona, J. I. Seijas, J. Torres-García

39

3.

Artificial Neural Networks D. T. Pham, M. S. Packianather, A. A. Afify

67

4.

Application of Neural Networks D. Andina, A. Vega-Corona, J. I. Seijas, M. J. Alarcón

93

5.

Radial Basis Function Networks and their Application in Communication Systems Ascensión Gallardo Antolín, Juan Pascual García, José Luis Sancho Gómez

109

6.

Biological Clues for Up-to-Date Artificial Neurons Javier Ropero Peláez, Jose Roberto Castillo Piqueira

131

7.

Support Vector Machines Jaime Gómez Sáenz de Tejada, Juan Seijas Martínez-Echevarría

147

8.

Fractals as Pre-Processing Tool for Computational Intelligence Application Ana M. Tarquis, Valeriano Méndez, Juan B. Grau, José M. Antón, Diego Andina

vii

193

CONTRIBUTING AUTHORS

D. Andina, J. I. Seijas, J. Torres-García, M. J. Alarcón, A. Tarquis, J. B. Grau and J. M. Antón work for Technical University of Madrid (UPM), Spain, where they form the Group for Automation and Soft Computing (GASC). D. T. Pham, P. T. N. Pham, M. S. Packianather and A. A. Afify work for Cardiff University . Javier Ropero Peláez, José Roberto Castillo Piqueira work for Escola Politecnica da Universidade de Sao Paulo Departamento de Engenharia de Telecomunicaçoes e Controle, Brazil. A. Gallardo Antolín, J. Pascual García and J. L. Sancho Gómez work for University Carlos III of Madrid, Spain, A. Vega-Corona, V. Méndez and J. Gómez Sáenz de Tejada work for University of Guanajuato, Mexico, Technical University of Madrid and Universidad Autónoma of Madrid, Spain, respectively.

ix

PREFACE

This book presents a selected collection of contributions on a focused treatment of important elements of Computational Intelligence. Unlike traditional computing, Computational Intelligence (CI) is tolerant of imprecise information, partial truth and uncertainty. The principle components of CI that currently have frequent application in Engineering and Manufacturing are: Neural Networks (NN), fuzzy logic (FL) and Support Vector Machines (SVM). In CI, NN and SVM are concerned with learning, while FL with imprecision and reasoning. This volume mainly covers a key element of Computational Intelligence∗ learning. All the contributions in this volume have a direct relevance to neural network learning∗ from neural computing fundamentals to advanced networks such as Multilayer Perceptrons (MLP), Radial Basis Function Networks (RBF), and their relations with fuzzy set and support vector machines theory. The book also discusses different applications in Engineering and Manufacturing. These are among applications where CI have excellent potentials for use. Both novice and expert readers should find this book a useful reference in the field of Computational Intelligence. The editors and the authors hope to have contributed to the field by paving the way for learning paradigms to solve real-world problems D. Andina

xi

ACKNOWLEDGEMENTS

This document has been produced with the financial assistance of the European Community, ALFA project II-0026-FA. The views expressed herein are those of the Authors and can therefore in no way be taken to reflect the official opinion of the European Community. The editors wish to thank Dr A. Afify of Cardiff University and Mr A. Jevtic of the Technical University of Madrid for their support and helpful comments during the revision of this text. The editors also wish to thank Nagib Callaos, President of the International Institute of Informatics and Systemics, IIIS, for his permission and freedom to reproduce in Chapters 2 and 4 of this book contents from the book by D.Andina and F.Ballesteros (Eds), “Recent Advances in Neural Networks” Ed. IIIS press, ILL, USA (2000).

xiii

CHAPTER 1 SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

D. T. PHAM, P. T. N. PHAM, M. S. PACKIANATHER, A. A. AFIFY Manufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom

INTRODUCTION Soft computing is a recent term for a computing paradigm that has been in existence for almost fifty years. This chapter reviews five soft computing tools. They are: knowledge-based systems, fuzzy logic, inductive learning, neural networks and genetic algorithms. All of these tools have found many practical applications. Examples of applications in engineering and manufacture will be given in the chapter. 1.

KNOWLEDGE-BASED SYSTEMS

Knowledge-based systems, or expert systems, are computer programs embodying knowledge about a narrow domain for solving problems related to that domain. An expert system usually comprises two main elements, a knowledge base and an inference mechanism. The knowledge base contains domain knowledge which may be expressed as any combination of “IF-THEN” rules, factual statements (or assertions), frames, objects, procedures and cases. The inference mechanism is that part of an expert system which manipulates the stored knowledge to produce solutions to problems. Knowledge manipulation methods include the use of inheritance and constraints (in a frame-based or object-oriented expert system), the retrieval and adaptation of case examples (in a case-based expert system) and the application of inference rules such as modus ponens (If A Then B; A Therefore B) and modus tollens (If A Then B; NOT B Therefore NOT A) according to “forward chaining” or “backward chaining” control procedures and “depth-first” or “breadth-first” search strategies (in a rule-based expert system). With forward chaining or data-driven inferencing, the system tries to match available facts with the IF portion of the 1 D. Andina and D.T. Pham (eds.), Computational Intelligence, 1–38. © 2007 Springer.

2

CHAPTER 1

IF-THEN rules in the knowledge base. When matching rules are found, one of them is “fired”, i.e. its THEN part is made true, generating new facts and data which in turn causes other rules to “fire”. Reasoning stops when no more new rules can fire. In backward chaining or goal-driven inferencing, a goal to be proved is specified. If the goal cannot be immediately satisfied by existing facts in the knowledge base, the system will examine the IF-THEN rules for rules with the goal in their THEN portion. Next, the system will determine whether there are facts that can cause any of those rules to fire. If such facts are not available they are set up as subgoals. The process continues recursively until either all the required facts are found and the goal is proved or any one of the subgoals cannot be satisfied, in which case the original goal is disproved. Both control procedures are illustrated in Figure 1. Figure 1a shows how, given the assertion that a lathe is a machine tool and a set of rules concerning machine tools, a forward-chaining system will generate additional assertions such as “a lathe is power driven” and “a lathe has a tool holder”. Figure 1b details the backward-chaining sequence producing the answer to the query “does a lathe require a power source?”. In the forward chaining example of Figure 1a, both rules R2 and R3 simultaneously qualify for firing when inferencing starts as both their IF parts match the presented fact F1. Conflict resolution has to be performed by the expert system to decide which rule should fire. The conflict resolution method adopted in this example is “first come, first served”: R2 fires as it is the first qualifying rule encountered. Other conflict resolution methods include “priority”, “specificity” and “recency”. The search strategies can also be illustrated using the forward chaining example of Figure 1a. Suppose that, in addition to F1, the knowledge base also initially contains the assertion “a CNC turning centre is a machine tool”. Depth-first search involves firing rules R2 and R3 with X instantiated to “lathe” (as shown in Figure 1a) before firing them again with X instantiated to “CNC turning centre”. Breadth-first search will activate rule R2 with X instantiated to “lathe” and again with X instantiated to “CNC turning centre”, followed by rule R3 and the same sequence of instantiations. Breadth-first search finds the shortest line of inferencing between a start position and a solution if it exists. When guided by heuristics to select the correct search path, depth-first search might produce a solution more quickly, although the search might not terminate if the search space is infinite [Jackson, 1999]. For more information on the technology of expert systems, see [Pham and Pham, 1988; Durkin, 1994; Giarratano and Riley, 1998; Darlington, 1999; Jackson, 1999; Badiru and Cheung, 2002; Nurminen et al., 2003]. Most expert systems are nowadays developed using programs known as “shells”. These are essentially ready-made expert systems complete with inferencing and knowledge storage facilities but without the domain knowledge. Some sophisticated expert systems are constructed with the help of “development environments”. The latter are more flexible than shells in that they also provide means for users to implement their own inferencing and knowledge representation methods. More details on expert systems shells and development environments can be found in [Price, 1990].

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

KNOWLEDGE BASE (Initial State) Fact : F1 - A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven

F1 & R2 match KNOWLEDGE BASE (Intermediate State) Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven

F1 & R3 match KNOWLEDGE BASE (Intermediate State) Fact : F1 F2 F3 Rules : R1 R2 R3 -

A lathe is a machine tool A lathe has a tool holder A lathe is power driven If X is power driven Then X requires a power source If X is a machine tool Then X has a tool holder If X is a machine tool Then X is power driven

F3 & R1 match KNOWLEDGE BASE (Final State) Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder F3 - A lathe is power driven F4 - A lathe requires a power source Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven

Figure 1a. An example of forward chaining

3

4

CHAPTER 1

KNOWLEDGE BASE (Initial State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ?

G1 & R1

KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Goal : Satisfied ? G1 - A lathe requires a power source G2 - A lathe is a power driven ?

KNOWLEDGE BASE (Final State) Fact : F1 -A lathe is a machine tool F2 -A lathe is power driven F3 -A lathe requires a power source Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Goal : Satisfied G1 - A lathe requires a power source Yes

F2 & R1

KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool F2 -A lathe is power driven Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ? G2 - A lathe is a power driven Yes

F1 & R3

G2 & R3

KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ? G2 - A lathe is a power driven ? ? G3 - A lathe is a machine tool

KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : ? G1 - A lathe requires a power source ? G2 - A lathe is a power driven G3 - A lathe is a machine tool Yes

Figure 1b. An example of backward chaining

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

5

Among the five tools considered in this chapter, expert systems are probably the most mature, with many commercial shells and development tools available to facilitate their construction. Consequently, once the domain knowledge to be incorporated in an expert system has been extracted, the process of building the system is relatively simple. The ease with which expert systems can be developed has led to a large number of applications of the tool. In engineering, applications can be found for a variety of tasks including selection of materials, machine elements, tools, equipment and processes, signal interpreting, condition monitoring, fault diagnosis, machine and process control, machine design, process planning, production scheduling and system configuring. Some recent examples of specific tasks undertaken by expert systems are: • identifying and planning inspection schedules for critical components of an offshore structure [Peers et al., 1994]; • automating the evaluation of manufacturability in CAD systems [Venkatachalam, 1994]; • choosing an optimal robot for a particular task [Kamrani et al., 1995]; • monitoring the technical and organisational problems of vehicle maintenance in coal mining [Streichfuss and Burgwinkel, 1995]; • configuring paper feeding mechanisms [Koo and Han, 1996]; • training technical personnel in the design and evaluation of energy cogeneration plants [Lara Rosano et al., 1996]; • storing, retrieving and adapting planar linkage designs [Bose et al., 1997]; • designing additive formulae for engine oil products [Shi et al., 1997]; • carrying out automatic remeshing during a finite-elements analysis of forging deformation [Yano et al., 1997]; • designing of products and their assembly processes [Zha et al., 1998]; • modelling and control of combustion processes [Kalogirou, 2003]; • optimising the transient performances in the adaptive control of a planar robot [De La Sen et al., 2004].

2.

FUZZY LOGIC

A disadvantage of ordinary rule-based expert systems is that they cannot handle new situations not covered explicitly in their knowledge bases (that is, situations not fitting exactly those described in the “IF” parts of the rules). These rule-based systems are completely unable to produce conclusions when such situations are encountered. They are therefore regarded as shallow systems which fail in a “brittle” manner, rather than exhibit a gradual reduction in performance when faced with increasingly unfamiliar problems, as human experts would. The use of fuzzy logic [Zadeh, 1965] which reflects the qualitative and inexact nature of human reasoning can enable expert systems to be more resilient. With fuzzy logic, the precise value of a variable is replaced by a linguistic description, the meaning of which is represented by a fuzzy set, and inferencing is carried

6

CHAPTER 1

out based on this representation. Fuzzy set theory may be considered an extension of classical set theory. While classical set theory is about “crisp” sets with sharp boundaries, fuzzy set theory is concerned with “fuzzy” sets whose boundaries are “grey”. In classical set theory, an element ui can either belong or not belong to a set A, i.e. ∼ the degree to which element u belongs to set A is either 1 or 0. However, in fuzzy ∼

set theory, the degree of belonging of an element u to a fuzzy set A is a real number ∼

between 0 and 1. This is denoted by A ui , the grade of membership of ui in A. Fuzzy ∼

∼

set A is a fuzzy set in U, the “universe of discourse” or “universe” which includes all ∼

objects to be discussed. A ui is 1 when ui is definitely a member of A and A ui is ∼

∼

∼

0 when ui is definitely not a member of A. For instance, a fuzzy set defining the term “normal room temperature” might be:-

∼

normal room temperature ≡ 00/below10 C + 03/10 C–16 C (1)

+ 08/16 C–18 C + 10/18 C–22 C + 08/22 C–24 C + 03/24 C–30 C + 00/above 30 C

The values 0.0, 0.3, 0.8 and 1.0 are the grades of membership to the given fuzzy set of temperature ranges below 10 C (above 30 C), between 10 C and 16 C24 C–30 C, between 16 C and 18 C22 C–24 C and between 18 C and 22 C. Figure 2(a) shows a plot of the grades of membership for “normal room temperature”. For comparison, Figure 2(b) depicts the grades of membership for a crisp set defining room temperatures in the normal range. Knowledge in an expert system employing fuzzy logic can be expressed as qualitative statements (or fuzzy rules) such as “If the room temperature is normal, then set the heat input to normal”, where “normal room temperature” and “normal heat input” are both fuzzy sets. A fuzzy rule relating two fuzzy sets A and B is effectively the Cartesian product ∼

∼

A × B which can be represented by a relation matrix R. Element Rij of R is the ∼ ∼ ∼ ∼ membership to A × B of pair ui vj ui ∈ A and vj ∈ B. Rij is given by: ∼

(2)

∼

∼

∼

Rij = minA ui B vj ∼

∼

For example, with “normal room temperature” defined as before and “normal heat input” described by: (3)

normal heat input ≡ 02/1 kW + 09/2 kW + 02/3 kW

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

7

µ

1

0.5

10

20

30

40 Temperature ( ˚C )

(a)

µ 1

10

20

30

40 Temperature ( ˚C )

(b) Figure 2. (a) Fuzzy set of “normal temperature” (b) Crisp set of “normal temperature”

R can be computed as: ∼

⎡

(4)

00 ⎢02 ⎢ ⎢02 ⎢ R = ⎢ ⎢02 ∼ ⎢02 ⎢ ⎣02 00

00 03 08 09 08 03 00

⎤ 00 02⎥ ⎥ 02⎥ ⎥ 02⎥ ⎥ 02⎥ ⎥ 02⎦ 00

A reasoning procedure known as the compositional rule of inference, which is the equivalent of the modus-ponens rule in rule-based expert systems, enables conclusions to be drawn by generalisation (extrapolation or interpolation) from the qualitative information stored in the knowledge base. For instance, when the room

8

CHAPTER 1

temperature is detected to be “slightly below normal”, a temperature-controlling fuzzy expert system might deduce that the heat input should be set to “slightly above normal”. Note that this conclusion might not be contained in any of the fuzzy rules stored in the system. A well-known compositional rule of inference is the max-min rule. Let R represent the fuzzy rule “If A Then B” and a ≡ i /ui ∼

∼

∼

∼

i

a fuzzy assertion. A and a are fuzzy sets in the same universe of discourse. The ∼ ∼ max-min rule enables a fuzzy conclusion b ≡ j /vj to be inferred from a and R ∼

j

∼

∼

as follows: (5) (6)

b = a oR

∼

∼

∼

j = maxmin i Rij i

For example, given the fuzzy rule “If the room temperature is normal, then set the heat input to normal” where “normal room temperature” and “normal heat input” are as defined previously, and a fuzzy temperature measurement of temperature ≡ 00/below10 C + 04/10 C–16 C + 08/16 C–18 C (7)

+ 08/18 C–22 C + 02/22 C–24 C + 00/24 C–30 C + 00/above30 C

the heat input will be deduced as: heat input = temperature oR ∼

(8)

= 02/1 kW + 08/2 kW + 02/3 kW

For further information on fuzzy logic, see [Kaufmann, 1975; Klir and Yuan, 1995; 1996; Ross, 1995; Zimmermann, 1996; Dubois and Prade, 1998]. Fuzzy logic potentially has many applications in engineering where the domain knowledge is usually imprecise. Notable successes have been achieved in the area of process and machine control although other sectors have also benefited from this tool. Recent examples of engineering applications include: • controlling the height of the arc in a welding process [Bigand et al., 1994]; • controlling the rolling motion of an aircraft [Ferreiro Garcia, 1994]; • controlling a multi-fingered robot hand [Bas and Erkmen, 1995]; • analysing the chemical composition of minerals [Da Rocha Fernandes and Cid Bastos, 1996]; • monitoring of tool-breakage in end-milling operations [Chen and Black, 1997]; • modelling of the set-up and bend sequencing process for sheet metal bending [Ong et al., 1997]; • determining the optimal formation of manufacturing cells [Szwarc et al., 1997; Zülal and Arikan, 2000];

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

9

• classifying discharge pulses in electrical discharge machining [Tarng et al., 1997]; • modelling an electrical drive system [Costa Branco and Dente, 1998]; • improving the performance of hard disk drive final assembly [Zhao and De Souza, 1998; 2001]; • analysing chatter occurring during a machine tool cutting process [Kong et al., 1999]; • addressing the relationships between customer needs and design requirements [Sohen and Choi, 2001; Vanegas and Labib, 2001; Karsak, 2004]; • assessing and selecting advanced manufacturing systems [Karsak and Kuzgunkaya, 2002; Bozda˘g et al., 2003; Beskese et al., 2004; Kulak and Kahraman, 2004]; • evaluating cutting force uncertainty in turning [Wang et al., 2002]; • reducing defects in automotive coating operations [Lou and Huang, 2003]. 3.

INDUCTIVE LEARNING

The acquisition of domain knowledge to build into the knowledge base of an expert system is generally a major task. In some cases, it has proved a bottleneck in the construction of an expert system. Automatic knowledge acquisition techniques have been developed to address this problem. Inductive learning is an automatic technique for knowledge acquisition. The inductive approach produces a structured representation of knowledge as the outcome of learning. Induction involves generalising a set of examples to yield a selected representation which can be in terms of a set of rules, concepts or logical inferences or a decision tree. An inductive learning program usually requires as input a set of examples. Each example is characterised by the values of a number of attributes and the class to which it belongs. In one approach to inductive learning, through a process of “dividing-and-conquering” where attributes are chosen according to some strategy (for example, to maximise the information gain) to divide the original example set into subsets, the inductive learning program builds a decision tree that correctly classifies the given example set. The tree represents the knowledge generalised from the specific examples in the set. This can subsequently be used to handle situations not explicitly covered by the example set. In another approach known as the “covering approach”, the inductive learning program attempts to find groups of attributes uniquely shared by examples in given classes and forms rules with the IF part as conjunctions of those attributes and the THEN part as the classes. The program removes correctly classified examples from consideration and stops when rules have been formed to classify all examples in the given set. A new approach to inductive learning, “inductive logic programming”, is a combination of induction and logic programming. Unlike conventional inductive learning which uses propositional logic to describe examples and represent new concepts, inductive logic programming (ILP) employs the more powerful predicate

10

CHAPTER 1

logic to represent training examples and background knowledge and to express new concepts. Predicate logic permits the use of different forms of training examples and background knowledge. It enables the results of the induction process, that is the induced concepts, to be described as general first-order clauses with variables and not just as zero-order propositional clauses made up of attribute-value pairs. There are two main types of ILP systems, the first, based on the top-down generalisation/specialisation method, and the second, on the principle of inverse resolution [Muggleton, 1992; Lavrac, 1994]. A number of inductive learning programs have been developed. Some of the well known programs are CART [Breiman et al., 1998], ID3 and its descendants C4.5 and C5.0 [Quinlan, 1983; 1986; 1993; ISL, 1998; RuleQuest, 2000] which are divide-and-conquer programs, the AQ family of programs [Michalski, 1969; 1990; Michalski et al., 1986; Cervone et al., 2001; Michalski and Kaufman, 2001] which follow the covering approach, the FOIL program [Quinlan, 1990; Quinlan and Cameron-Jones, 1995] which is an ILP system adopting the generalisation/specialisation method and the GOLEM program [Muggleton and Feng, 1990] which is an ILP system based on inverse resolution. Although most programs only generate crisp decision rules, algorithms have also been developed to produce fuzzy rules [Wang and Mendel, 1992; Janikow, 1998; Hang and Chen, 2000; Baldwin and Martin, 2001; Wang et al., 2001; Baldwin and Karale, 2003; Wang et al., 2003]. Figure 3 shows the main steps in RULES–3 Plus, an induction algorithm in the covering category [Pham and Dimov, 1997] and belonging to the RULES family of rule extraction systems [Pham and Aksoy, 1994; 1995a; 1995b; Pham et al., 2000; Pham et al., 2003; Pham and Afify; 2005a]. The simple problem of detecting the state of a metal cutting tool is used to explain the operation of RULES-3 Plus. Three sensors are employed to monitor the cutting process and, according to the signals obtained from them (1 or 0 for sensors 1 and 3; −1, 0, or 1 for sensor 2), the tool is inferred as being “normal” or “worn”. Thus, this problem involves three attributes which are the states of sensors 1, 2 and 3 and the signals that they emit constitute the values of those attributes. The example set for the problem is given in Table 1.

Table 1. Training set for the Cutting Tool problem Example

Sensor_1

Sensor_2

Sensor_3

Tool State

1 2 3 4 5 6 7 8

0 1 1 1 0 1 1 0

−1 0 −1 0 0 1 −1 −1

0 0 1 1 1 1 0 1

Normal Normal Worn Normal Normal Worn Normal Worn

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

11

Step 1. Take an unclassified example and form array SETAV. Step 2. Initialise arrays PRSET and T_PRSET (PRSET and T_PRSET will consist of mPRSET expressions with null conditions and zero H measures) and set nco = 0. Step 3. IF nco < na THEN nco = nco + 1 and set m = 0; ELSE the example itself is taken as a rule and STOP. Step 4. DO m = m + 1; Specialise expression m in PRSET by appending to it a condition from SETAV that differs from the conditions already included in the expression; Compute the H measure for the expression; IF its H measure is higher than the H measure of any expression in T_PRSET THEN replace the expression having the lowest H measure with the newly formed expression; ELSE discard the new expression; WHILE m < mPRSET . Step 5. IF there are consistent expressions in T_PRSET THEN choose as a rule the expression that has the highest H measure and discard the others; ELSE copy T_PRSET into PRSET; initialise T_PRSET and go to step 3.

Figure 3. Rule forming procedure of RULES-3 Plus Notes: nco – number of conditions; na -number of attributes; mPRSET – number of expressions stored in PRSET (mPRSET is user-provided); T_PRSET - a temporary array of partial rules of the same dimension as PRSET

In step 1, example 1 is used to form the attribute-value array SETAV which will contain the following attribute-value pairs: [Sensor_1 = 0 Sensor_2 = −1 and Sensor_3 = 0. In step 2, the partial rule set PRSET and T_PRSET, the temporary version of PRSET used for storing partial rules in the process of rule construction, are initialised. This creates for each of these sets three expressions having null conditions and zero H measures. The H measure for an expression is defined as: (9)

H=

Eic Ei Ec Eic Ei 1− c 1− 2−2 −2 E Ec E E E

where E c is the number of examples covered by the expression (the total number of examples correctly classified and misclassified by a given rule), E is the total number of examples, Eic is the number of examples covered by the expression and belonging to the target class i (the number of examples correctly classified by a given rule), and Ei is the number of examples in the training set belonging to the

12

CHAPTER 1

target class i. In Equation (9), the first term (10)

G=

Ec E

relates to the generality of the rule and the second term

Eic Ei Eic Ei (11) A = 2−2 1 − 1 − − 2 Ec E Ec E indicates its accuracy. In steps 3 and 4, by specialising PRSET using the conditions stored in SETAV, the following expressions are formed and stored in T_PRSET: 1 Sensor_3 = 0 ⇒ Alarm = OFF

H = 02565

2 Sensor_2 = −1 ⇒ Alarm = OFF

H = 00113

3 Sensor_1 = 0 ⇒ Alarm = OFF

H = 00012

In step 5, a rule is produced as the first expression in T_PRSET applies to only one class: Rule1 IF Sensor_3 = 0 THEN Alarm = OFF H = 02565 Rule 1 can classify examples 2 and 7 in addition to example 1. Therefore, these examples are marked as classified and the induction proceeds. In the second iteration, example 3 is considered. T_PRSET, formed in step 4 after specialising the initial PRSET, now consists of the following expressions: 1 Sensor_3 = 1 ⇒ Alarm = ON

H = 00406

2 Sensor_2 = −1 ⇒ Alarm = ON

H = 00079

3 Sensor_1 = 1 ⇒ Alarm = ON

H = 00005

As none of the expressions cover only one class, T_PRSET is copied into PRSET (step 5) and the new PRSET has to be specialised further by appending the existing expressions with conditions from SETAV. Therefore the procedure returns to step 3 for a new pass. The new T_PRSET formed at the end of step 4 contains the following three expressions: 1 Sensor_2 = −1Sensor_3 = 1 ⇒ Alarm = ON

H = 03876

2 Sensor_1 = 1Sensor_3 = 1 ⇒ Alarm = ON

H = 00534

3 Sensor_1 = 1Sensor_2 = −1 ⇒ Alarm = ON

H = 00008

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

13

As the first expression applies to only one class, the following rule is obtained: Rule 2 IF Sensor_2 = −1 AND Sensor_3 = 1 THEN Alarm = ON H = 03876 Rule 2 can classify examples 3 and 8, which again are marked as classified. In the third iteration, example 4 is used to obtained the next rule: Rule 3 IF Sensor_2 = 0 THEN Alarm = OFF H = 02565 This rule can classify examples 4 and 5 and so they are also marked as classified. In iteration 4, the last unclassified example 6 is employed for rule extraction, yielding: Rule 4 IF Sensor_2 = 1 THEN Alarm = ON H = 02741 There are no remaining unclassified examples in the example set and the procedure terminates at this point. Due to its requirement for a set of examples in a rigid format (with known attributes and of known classes), inductive learning has found rather limited applications in engineering as not many engineering problems can be described in terms of such a set of examples. Another reason for the paucity of applications is that inductive learning is generally more suitable for problems where attributes have discrete or symbolic values than for those with continuous-valued attributes as in many engineering problems. Some recent examples of applications of inductive learning are: • controlling a laser cutting robot [Luzeaux, 1994]; • controlling the functional electrical stimulation of spinally-injured humans [Kostov et al., 1995]; • modelling job complexity in clothing production systems [Hui et al., 1997]; • analysing the constructability of a beam in a reinforced-concrete frame [Skibniewski et al., 1997]; • analysing the results of tests on portable electronic products to discover useful design knowledge [Zhou, 2001]; • accelerating rotogravure printing [Evans and Fisher, 2002]; • predicting JIT factory performance from past data that includes both good and poor factory performance [Mathieu et al., 2002]; • developing an intelligent monitoring system for improving the reliability of a manufacturing process [Peng, 2004]. • analysing data in a steel bar manufacturing company to help intelligent decision making [Pham et al., 2004]; More information on inductive learning techniques and their applications in engineering and manufacture can be found in [Pham et al., 2002; Pham and Afify, 2005b].

14 4.

CHAPTER 1

NEURAL NETWORKS

Like inductive learning programs, neural networks can capture domain knowledge from examples. However, they do not archive the acquired knowledge in an explicit form such as rules or decision trees and they can readily handle both continuous and discrete data. They also have a good generalisation capability as with fuzzy expert systems. A neural network is a computational model of the brain. Neural network models usually assume that computation is distributed over several simple units called neurons which are interconnected and which operate in parallel (hence, neural networks are also called parallel-distributed-processing systems or connectionist systems). Figure 4 illustrates a typical model of a neuron. Output signal yj is a function f of the sum of weighted input signals xi . The activation function f can be a linear, simple threshold, sigmoidal, hyberbolic tangent or radial basis function. Instead of being deterministic, f can be a probabilistic function, in which case yj will be a binary quantity, for example, +1 or −1. The net input to such a stochastic neuron – that is, the sum of weighted input signals xi – will then give the probability of yj being +1 or −1. How the inter-neuron connections are arranged and the nature of the connections determine the structure of a network. How the strengths of the connections are adjusted or trained to achieve a desired overall behaviour of the network is governed by its learning algorithm. Neural networks can be classified according to their structures and learning algorithms. In terms of their structures, neural networks can be divided into two types: feedforward network and recurrent networks. Feedforward networks can perform a static mapping between an input space and an output space: the output at a given instant is a function only of the input at that instant. The most popular feedforward neural network is the multi-layer perceptron (MLP): all signals flow in a single direction from the input to the output of the network. Figure 5 shows an MLP with three layers: an input layer, an output layer and an intermediate or hidden layer. Neurons in the input layer only act as buffers for distributing the input signals xi to neurons in the hidden layer. Each neuron j in the hidden layer operates according to the model of Figure 4. That is, its output yj is given by: (12)

yj = f wji xi x1

xi

wj1 wji

∑

yj f(.)

wjn xn Figure 4. Model of a neuron

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

15

Output Layer

y1

yn

Hidden Layer w1m w12 w11 Input Layer x1

x2

xm

Figure 5. A multi-layer perceptron

The outputs of neurons in the output layer are computed similarly. Other feedforward networks [Pham and Liu, 1999] include the learning vector quantisation (LVQ) network, the cerebellar model articulation control (CMAC) network and the group-method of data handling (GMDH) network. Recurrent networks are networks where the outputs of some neurons are fedback to the same neurons or to neurons in layers before them. Thus signals can flow in both forward and backward directions. Recurrent networks are said to have a dynamic memory: the output of such networks at a given instant reflects the current input as well as previous inputs and outputs. Examples of recurrent networks [Pham and Liu, 1999] include the Hopfield network, the Elman network and the Jordan network. Figure 6 shows a well-known, simple recurrent neural network, the Grossberg and Carpenter ART-1 network. The network has two layers, an input layer and an output layer. The two layers are fully interconnected, the connections are in both the forward (or bottom-up) direction and the feedback (or top-down) direction. The vector Wi of weights of the bottom-up connections to an output neuron i forms an exemplar of the class it represents. All the Wi vectors constitute the long-term memory of the network. They are employed to select the winning neuron, the latter again being the neuron whose Wi vector is most similar to the current input pattern. The vector Vi of the weights of the top-down connections from an output neuron i is used for vigilance testing, that is, determining whether an input pattern is sufficiently close to a stored exemplar. The vigilance vectors Vi form the short-term memory of the network. Vi and Wi are related in that Wi is a normalised copy of Vi , viz. (13)

Wi =

+

Vi

Vji

16

CHAPTER 1

output layer

bottom up weights W

top down weights V

input layer Figure 6. An ART-1 network

where is a small constant and Vji , the jth component of Vi (i.e. the weight of the connection from output neuron i to input neuron j). Implicit “knowledge” is built into a neural network by training it. Neural networks are trained and categorised according to two main types of learning algorithms: supervised and unsupervised. In addition, there is a third type, reinforcement learning, which is a special case of supervised learning. In supervised training, the neural network can be trained by being presented with typical input patterns and the corresponding expected output patterns. The error between the actual and expected outputs is used to modify the strengths, or weights, of the connections between the neurons. The backpropagation (BP) algorithm, a gradient descent algorithm, is the most commonly adopted MLP training algorithm. It gives the change wji in the weight of a connection between neurons i and j as follows:(14)

wji = j xi

where is a parameter called the learning rate and j is a factor depending on whether neuron j is an output neuron or a hidden neuron. For output neurons,

f t yj − yj (15) j = net j and for hidden neurons,

f w (16) j = net j q qj q

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

17

In Equation (15), netj is the total weighted sum of input signals to neuron j and t yj is the target output for neuron j. As there are no target outputs for hidden neurons, in Equation (16), the difference between the target and actual output of a hidden neuron j is replaced by the weighted sum of the q terms already obtained for neurons q connected to the output of j. Thus, iteratively, beginning with the output layer, the term is computed for neurons in all layers and weight updates determined for all connections. The weight updating process can take place after the presentation of each training pattern (pattern-based training) or after the presentation of the whole set of training patterns (batch training). In either case, a training epoch is said to have been completed when all training patterns have been presented once to the MLP. For all but the most trivial problems, several epochs are required for the MLP to be properly trained. A commonly adopted method to speed up the training is to add a “momentum” term to Equation (14) which effectively lets the previous weight change influence the new weight change, viz: (17)

wji k + 1 = j xi + wji k

where wji k + 1 and wji k are weight changes in epochs k + 1 and k respectively and is the “momentum” coefficient. Some neural networks are trained in an unsupervised mode where only the input patterns are provided during training and the networks learn automatically to cluster them in groups with similar features. For example, training an ART-1 network involves the following steps: (i) initialising the exemplar and vigilance vectors Wi and Vi for all output neurons by setting all the components of each Vi to 1 and computing Wi according to Equation (13). An output neuron with all its vigilance weights set to 1 is known as an uncommitted neuron in the sense that it is not assigned to represent any pattern classes; (ii) presenting a new input pattern x; (iii) enabling all output neurons so that they can participate in the competition for activation; (iv) finding the winning output neuron among the competing neurons, i.e. the neuron for which x. Wi is largest; a winning neuron can be an uncommitted neuron as is the case at the beginning of training or if there are no better output neurons; (v) testing whether the input pattern x is sufficiently similar to the vigilance vector Vi of the winning neuron. Similarity is measured by the fraction r of bits in x that are also in Vi , viz. (18)

xV r= i xi

x is deemed to be sufficiently similar to Vi if r is at least equal to vigilance threshold 0 < ≤ 1

18

CHAPTER 1

(vi) going to step (vii) if r ≥ (i.e. there is resonance); else disabling the winning neuron temporarily from further competition and going to step (iv) repeating this procedure until there are no further enabled neurons; (vii) adjusting the vigilance vector Vi of the most recent winning neuron by logically ANDing it with x, thus deleting bits in Vi that are not also in x; computing the bottom-up exemplar vector Wi using the new Vi according to Equation (13); activating the winning output neuron; (viii) going to step (ii). The above training procedure ensures that if the same sequence of training patterns is repeatedly presented to the network, its long-term and short-term memories are unchanged (i.e. the network is stable). Also, provided there are sufficient output neurons to represent all the different classes, new patterns can always be learnt, as a new pattern can be assigned to an uncommitted output neuron if it does not match previously stored exemplars well (i.e. the network is plastic). In reinforcement learning, instead of requiring a teacher to give target outputs and using the differences between the target and actual outputs directly to modify the weights of a neural network, the learning algorithm employs a critic only to evaluate the appropriateness of the neural network output corresponding to a given input. According to the performance of the network on a given input vector, the critic will issue a positive or negative reinforcement signal. If the network has produced an appropriate output, the reinforcement signal will be positive (a reward). Otherwise, it will be negative (a penalty). The intention of this is to strengthen the tendency to produce appropriate outputs and to weaken the propensity for generating inappropriate outputs. Reinforcement learning is a trial-and-error operation designed to maximise the average value of the reinforcement signal for a set of training input vectors. An example of a simple reinforcement learning algorithm is a variation of the associative reward-penalty algorithm [Hassoun, 1995]. Consider a single stochastic neuron j with inputs x1 x2 x3 xn . The reinforcement rule may be written as [Hassoun, 1995] (19)

wji k + 1 = wji k + lrkyj k − Eyj kxi k

wji is the weight of the connection between input i and neuron j, l is the learning coefficient, r (which is +1 or −1) is the reinforcement signal, yj is the output of neuron j, Eyj is the expected value of the output, and xi k is the ith component of the kth input vector in the training set. When learning converges, wji k + 1 = wji k and so Eyj k = yj k = +1 or −1. Thus, the neuron effectively becomes deterministic. Reinforcement learning is typically slower than supervised learning. It is more applicable to small neural networks used as controllers where it is difficult to determine the target network output. For more information on neural networks, see [Michie et al., 1994; Hassoun, 1995; Pham and Liu, 1999; Yao, 1999; Jiang et al., 2002; Duch et al., 2004]. Neural networks can be employed as mapping devices, pattern classifiers or pattern completers (auto-associative content addressable memories and pattern associators). Like expert systems, they have found a wide spectrum of applications in

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

19

almost all areas of engineering, addressing problems ranging from modelling, prediction, control, classification and pattern recognition, to data association, clustering, signal processing and optimisation. Some recent examples of such applications are: • predicting the tensile strength of composite laminates [Teti and Caprino, 1994]; • controlling a flexible assembly operation [Majors and Richards, 1995]; • choosing sheet metal working conditions [Lin and Chang, 1996]; • determining suitable cutting conditions in operation planning [Park et al., 1996; Schultz et al., 1997]; • recognising control chart patterns [Pham and Oztemel, 1996]; • analysing vibration spectra [Smith et al., 1996]; • deducing velocity vectors in uniform and rotating flows by tracking the movement of groups of particles [Jambunathan et al., 1997]; • setting the number of kanbans in a dynamic JIT factory [Wray et al., 1997; Markham et al., 2000]; • generating knowledge for scheduling a flexible manufacturing system [Kim et al., 1998; Priore et al., 2003]; • modelling and controlling dynamic systems including robot arms [Pham and Liu, 1999]; • acquiring and refining operational knowledge in industrial processes [Shigaki and Narazaki, 1999]; • improving yield in a semiconductor manufacturing company [Shin and Park, 2000]; • identifying arbitrary geometric and manufacturing categories in CAD databases [Ip et al., 2003]; • minimising the makespan in a flowshop scheduling problem [Akyol, 2004]. 5.

GENETIC ALGORITHMS

Conventional search techniques, such as hill-climbing, are often incapable of optimising non-linear or multi modal functions. In such cases, a random search method is generally required. However, undirected search techniques are extremely inefficient for large domains. A genetic algorithm (GA) is a directed random search technique, invented by Holland [Holland, 1975], which can find the global optimal solution in complex multi-dimensional search spaces. A GA is modelled on natural evolution in that the operators it employs are inspired by the natural evolution process. These operators, known as genetic operators, manipulate individuals in a population over several generations to improve their fitness gradually. Individuals in a population are likened to chromosomes and usually represented as strings of binary numbers. The evolution of a population is described by the “schema theorem” [Holland, 1975; Goldberg, 1989]. A schema represents a set of individuals, i.e. a subset of the population, in terms of the similarity of bits at certain positions of those individuals. For example, the schema 1∗ 0∗ describes the set of individuals whose first and third bits are 1 and 0, respectively. Here, the symbol ∗ means any value would be

20

CHAPTER 1

acceptable. In other words, the values of bits at positions marked ∗ could be either 0 or 1. A schema is characterised by two parameters: defining length and order. The defining length is the length between the first and last bits with fixed values. The order of a schema is the number of bits with specified values. According to the schema theorem, the distribution of a schema through the population from one generation to the next depends on its order, defining length and fitness. GAs do not use much knowledge about the optimisation problem under study and do not deal directly with the parameters of the problem. They work with codes which represent the parameters. Thus, the first issue in a GA application is how to code the problem, i.e. how to represent its parameters. As already mentioned, GAs operate with a population of possible solutions. The second issue is the creation of a set of possible solutions at the start of the optimisation process as the initial population. The third issue in a GA application is how to select or devise a suitable set of genetic operators. Finally, as with other search algorithms, GAs have to know the quality of the solutions already found to improve them further. An interface between the problem environment and the GA is needed to provide this information. The design of this interface is the fourth issue.

5.1

Representation

The parameters to be optimised are usually represented in a string form since this type of representation is suitable for genetic operators. The method of representation has a major impact on the performance of the GA. Different representation schemes might cause different performances in terms of accuracy and computation time. There are two common representation methods for numerical optimisation problems [Blickle and Thiele, 1995, Michalewicz, 1996]. The preferred method is the binary string representation method. The reason for this method being popular is that the binary alphabet offers the maximum number of schemata per bit compared to other coding techniques. Various binary coding schemes can be found in the literature, for example, Uniform coding, Gray scale coding, etc. The second representation method is to use a vector of integers or real numbers with each integer or real number representing a single parameter. When a binary representation scheme is employed, an important step is to decide the number of bits to encode the parameters to be optimised. Each parameter should be encoded with the optimal number of bits covering all possible solutions in the solution space. When too few or too many bits are used the performance can be adversely affected.

5.2

Creation of Initial Population

At the start of optimisation, a GA requires a group of initial solutions. There are two ways of forming this initial population. The first consists of using randomly produced solutions created by a random number generator, for example. This method

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

21

is preferred for problems about which no a priori knowledge exists or for assessing the performance of an algorithm. The second method employs a priori knowledge about the given optimisation problem. Using this knowledge, a set of requirements is obtained and solutions which satisfy those requirements are collected to form an initial population. In this case, the GA starts the optimisation with a set of approximately known solutions and therefore convergence to an optimal solution can take less time than with the previous method. 5.3

Genetic Operators

The flowchart of a simple GA is given in Figure 7. There are basically four genetic operators, selection, crossover, mutation and inversion. Some of these operators were inspired by nature. In the literature, many versions of these operators can be found. It is not necessary to employ all of these operators in a GA because each operates independently of the others. The choice or design of operators depends on the problem and the representation scheme employed. For instance, operators designed for binary strings cannot be directly used on strings coded with integers or real numbers. 5.3.1

Selection

The aim of the selection procedure is to reproduce more of individuals whose fitness values are higher than those whose fitness values are low. The selection procedure has a significant influence on driving the search towards a promising area and finding good solutions in a short time. However, the diversity of the population

Initial Population

Evaluation

Selection

Crossover

Mutation

Inversion

Figure 7. Flowchart of a basic genetic algorithm

22

CHAPTER 1

must be maintained to avoid premature convergence and to reach the global optimal solution. In GAs there are mainly two selection procedures: proportional selection, also called stochastic selection, and ranking-based selection [Whitely, 1989]. Proportional selection is usually called “Roulette Wheel” selection, since its mechanism is reminiscent of the operation of a Roulette Wheel. Fitness values of individuals represent the widths of slots on the wheel. After a random spinning of the wheel to select an individual for the next generation, slots with large widths representing individuals with high fitness values will have a higher chance to be selected. One way to prevent premature convergence is to control the range of trials allocated to any single individual, so that no individual produces too many offspring. The ranking system is one such alternative selection algorithm. In this algorithm, each individual generates an expected number of offspring which is based on the rank of its performance and not on the magnitude [Baker, 1985]. 5.3.2

Crossover

This operation is considered the one that makes the GA different from other algorithms, such as dynamic programming. It is used to create two new individuals (children) from two existing individuals (parents) picked from the current population by the selection operation. There are several ways of doing this. Some common crossover operations are one-point crossover, two-point crossover, cycle crossover and uniform crossover. One-point crossover is the simplest crossover operation. Two individuals are randomly selected as parents from the pool of individuals formed by the selection procedure and cut at a randomly selected point. The tails, which are the parts after the cutting point, are swapped and two new individuals (children) are produced. Note that this operation does not change the values of bits. An example of one-point crossover is shown in Figure 8. 5.3.3

Mutation

In this procedure, all individuals in the population are checked bit by bit and the bit values are randomly reversed according to a specified rate. Unlike crossover, this is Parent 1

100|010011110

Parent 2

001|011000110

New string 1

100|011000110

New string 2

001|010011110 Figure 8. Crossover

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

Old string

1100|0|1011101

New string

1100|1|1011101

23

Figure 9. Mutation

a monadic operation. That is, a child string is produced from a single parent string. The mutation operator forces the algorithm to search new areas. Eventually, it helps the GA to avoid premature convergence and find the global optimal solution. An example is given in Figure 9.

5.3.4

Inversion

This operator is employed for a group of problems, such as the cell placement problem, layout problem and travelling salesman problem. It also operates on one individual at a time. Two points are randomly selected from an individual and the part of the string between those two points is reversed (see Figure 10).

5.4

Control Parameters

Important control parameters of a simple GA include the population size (number of individuals in the population), crossover rate, mutation rate and inversion rate. Several researchers have studied the effect of these parameters on the performance of a GA [Schaffer et al., 1989; Grefenstette, 1986; Fogarty, 1989; Mahfoud, 1995; Smith and Fogarty, 1997]. The main conclusions are as follows. A large population size means the simultaneous handling of many solutions and increases the computation time per iteration; however since many samples from the search space are used, the probability of convergence to a global optimal solution is higher than with a small population size. The crossover rate determines the frequency of the crossover operation. It is useful at the start of optimisation to discover promising regions in the search space. A low crossover frequency decreases the speed of convergence to such areas. If the frequency is too high, it can lead to saturation around one solution. The mutation operation is controlled by the mutation rate. A high mutation rate introduces high diversity in the population and might cause instability. On the other hand, it is usually very difficult for a GA to find a global optimal solution with too low a mutation rate. Old string

10|1100|11101

New string

10|0011|11101

Figure 10. Inversion of a binary string segment

24 5.5

CHAPTER 1

Fitness Evaluation Function

The fitness evaluation unit in a GA acts as an interface between the GA and the optimisation problem. The GA assesses solutions for their quality according to the information produced by this unit and not by directly using information about their structure. In engineering design problems, functional requirements are specified to the designer who has to produce a structure which performs the desired functions within predetermined constraints. The quality of a proposed solution is usually calculated depending on how well the solution performs the desired functions and satisfies the given constraints. In the case of a GA, this calculation must be automatic and the problem is how to devise a procedure which computes the quality of solutions. Fitness evaluation functions might be complex or simple depending on the optimisation problem at hand. Where a mathematical equation cannot be formulated for this task, a rule-based procedure can be constructed for use as a fitness function or in some cases both can be combined. Where some constraints are very important and cannot be violated, the structures or solutions which do so can be eliminated in advance by appropriately designing the representation scheme. Alternatively, they can be given low probabilities by using special penalty functions. For further information on genetic algorithms, see [Holland, 1975; Goldberg, 1989; Davis, 1991; Mitchell, 1996; Pham and Karaboga, 2000; Freitas, 2002]. Genetic algorithms have found applications in engineering problems involving complex combinatorial or multi-parameter optimisation. Some recent examples of those applications are: • configuring transmission systems [Pham and Yang, 1993]; • designing the knowledge base of fuzzy logic controllers [Pham and Karaboga, 1994]; • generating hardware description language programs for high-level specification of the function of programmable logic devices [Seals and Whapshott, 1994]; • planning collision-free paths for mobile and redundant robots [Ashiru et al., 1995; Wilde and Shellwat, 1997; Nearchou and Aspragathos, 1997]; • scheduling the operations of a job shop [Cho et al., 1996; Drake and Choudhry, 1997; Lee et al., 1997; Chryssolouris and Subramaniam, 2001; Pérez et al., 2003]; • generating dynamic schedules for the operation and control of a flexible manufacturing cell [Jawahar et al., 1998]; • optimising the performance of an industrially designed inventory control system [Disney, 2000]; • forming manufacturing cells and determining machine layout information for cellular manufacturing [Wu et al., 2002]; • optimising assembly process plans to improve productivity [Li et al., 2003]; • improving the convergence speed and reducing the computational complexity of neural networks [Öztürk and Öztürk, 2004].

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

6.

25

SOME APPLICATIONS IN ENGINEERING AND MANUFACTURE

This section briefly reviews five engineering applications of the aforementioned soft computing tools. 6.1

Expert Statistical Process Control

Statistical process control (SPC) is a technique for improving the quality of processes and products through closely monitoring data collected from those processes and products and using statistically-based tools such as control charts. XPC is an expert system for facilitating and enhancing the implementation of statistical process control [Pham and Oztemel, 1996]. A commercially available shell was employed to build XPC. The shell allows a hybrid rule-based and pseudo objectoriented method of representing the standard SPC knowledge and process-specific diagnostic knowledge embedded in XPC. The amount of knowledge involved is extensive, which justifies the adoption of a knowledge-based systems approach. XPC comprises four main modules. The construction module is used to set up a control chart. The capability analysis module is for calculating process capability indices. The on-line interpretation and diagnosis module assesses whether the process is in control and determines the causes for possible out-of-control situations. It also provides advice on how to remedy such situations. The modification module updates the parameters of a control chart to maintain true control over a time-varying process. XPC has been applied to the control of temperature in an injection moulding machine producing rubber seals. It has recently been enhanced by integrating a neural network module with the expert system modules to detect abnormal patterns in the control chart (see Figure 11). 6.2

Fuzzy Modelling of a Vibratory Sensor for Part Location

Figure 12 shows a six-degree-of-freedom vibratory sensor for determining the coordinates of the centre of mass xG yG and orientation of bulky rigid parts. The sensor is designed to enable a robot to pick up parts accurately for machine feeding or assembly tasks. The sensor consists of a rigid platform (P) mounted on a flexible column (C). The platform supports one object (O) to be located at a time. O is held firmly with respect to P. The static deflections of C under the weight of O and the natural frequencies of vibration of the dynamic system comprising O, P and C are measured and processed using a mathematical model of the system to determine xG , yG and for O. In practice, the frequency measurements have low repeatability, which leads to inconsistent location information. The problem worsens when is in the region 80 -90 relative to a reference axis of the sensor because the mathematical model becomes ill-conditioned. In this “ill-conditioning” region, an alternative to using a mathematical model to compute is to adopt an experimentally derived fuzzy model. Such a fuzzy model has to be obtained for

26

CHAPTER 1

Range Chart UCL : 9

15 Mean : 4.5

CL : 4

30

45

Mean Chart LCL : 0.00

60

75

98 PCI: 1.7

St. Dev : 1.5

State of the process: in-control

UCL : 93

15

CL : 78

30

Mean : 72.5

45

60

St. Dev : 4.4

LCL : 63

75

98 PSD : 4.0

State of the process: in-control

Warning !!!!!! Process going out of control!

press any key to continue

the pattern is normal the pattern is inc. trend the pattern is dec. trend the pattern is up. shift the pattern is down. shift the pattern is cyclic

(%) (%) (%) (%) (%) (%)

: 0.00 : 0.00 : 100.00 : 0.00 : 0.00 : 0.00 press 999 to exit

Figure 11. XPC output screen

each specific object through calibration. A possible calibration procedure involves placing the object at different positions xG yG and orientations and recording the periods of vibration T of the sensor. Following calibration, fuzzy rules relating xG , yG and T to could be constructed to form a fuzzy model of the behaviour of the sensor for the given object. A simpler fuzzy model is achieved by observing that xG

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

Platform P

27

Object O Z

Orientation y

z yG

Y

Column C

End of robot arm Figure 12. Schematic diagram of a vibratory sensor mounted on a robot wrist

and yG only affect the reference level of T and, if xG and yG are employed to define that level, the trend in the relationship between T and is the same regardless of the position of the object. Thus, a simplified fuzzy model of the sensor consists of rules such as “IF T-Tref is small THEN -ref is small” where Tref is the value of T when the object is at position xG yG and orientation ref . ref could be chosen as 80 , the point at which the fuzzy model is to replace the mathematical model. Tref could be either measured experimentally or computed from the mathematical model. To counteract the effects of the poor repeatability of period measurements which are particularly noticeable in the “ill-conditioning” region, the fuzzy rules are modified so that they take into account the variance in T. An example of a modified fuzzy rule is: “IF T-Tref is small and T is small, THEN − ref is small” In the above rule, T denotes the standard deviation in the measurement of T. Fuzzy modelling of the vibratory sensor is detailed in Pham and Hafeez (1992). Using a fuzzy model, the orientation can be determined to ±2 accuracy in the region 80 -90 . The adoption of fuzzy logic in this application has produced a compact and transparent model from a large amount of noisy experimental data. 6.3

Induction of Feature Recognition Rules in a Geometric Reasoning System for Analysing 3D Assembly Models

Pham et al. (1999) have described a concurrent engineering approach involving generating assembly strategies for a product directly from its 3D CAD model.

28

CHAPTER 1

A feature-based CAD system is used to create assembly models of products. A geometric reasoning module extracts assembly-oriented data for a product from the CAD system after creating a virtual assembly tree that identifies the components and sub-assemblies making up the given product (Figure 13a). The assembly information extracted by the module includes: placement constraints and dimensions used to specify the relevant position of a given component or sub-assembly; geometric entities (edges, surfaces, etc) used to constrain the component or subassembly; and the parents and children of each entity employed as a placement constraint. An example of the information extracted is shown in Figure 13b. Feature recognition is applied to the extracted information to identify each feature used to constrain a component or sub-assembly. The rule-based feature recognition process has three possible outcomes: 1. The feature is recognised as belonging to a unique class. 2. The feature shares attributes with more than one class (see Figure 13c). 3. The feature does not belong to any known class. Cases 2 and 3 require the user to decide the correct class of the feature and the rule base to be updated. The updating is implemented via a rule induction program. The program employs RULES-3 Plus which automatically extracts new feature recognition rules from examples provided to it in the form of characteristic vectors representing different features and their respective class labels. Rule induction is very suitable for this application because of the complexity of the characteristic vectors and the difficulty of defining feature classes manually.

Figure 13a. An assembly model

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

29

Bolt: • Child of Block • Placement constraints: 1: alignment of two axes 2: mating ofthe bottom surface of the bolt head and the upper surface ofthe block • No child part in the assembly hierarchy

Block: • No parents • No constraints (root component) • Next part in the assembly: Bolt Figure 13b. An example of assembly information

Partial Round Nonthrough Slot (BSL_2)

New Form Feature

Detected Similar Feature Classes

Rectangular Nonthrough Slot (BSL_1)

Figure 13c. An example of feature recognition

6.4

Neural-network-based Automotive Product Inspection

Figure 14 depicts an intelligent inspection system for engine valve stem seals [Pham and Oztemel, 1996]. The system comprises four CCD cameras connected to a computer that implements neural-network-based algorithms for detecting and classifying defects in the seal lips. Faults on the lip aperture are classified by a multilayer perceptron. The inputs to the network are a 20-component vector, where

30

CHAPTER 1

Ethernet link

Vision system

Host PC 4 CCD cameras 512 x 512 resolution Databus

Lighting ring

Good Chute

Seal

Material handling and lighting controller

Bowl Feeder Reject

Rework Indexing machine

Figure 14. Valve stem seal inspection system

the value of each component is the number of times a particular geometric feature is found on the aperture being inspected. The outputs of the network indicate the type of defect on the seal lip aperture. A similar neural network is used to classify defects on the seal lip surface. The accuracy of defect classification in both perimeter and surface inspection is in excess of 80%. Note that this figure is not the same as that for the accuracy in detecting defective seals, that is differentiating between good and defective seals. The latter task is also implemented using a neural network which achieves an accuracy of almost 100%. Neural networks are necessary for this application because of the difficulty of describing precisely the various types of defects and the differences between good and defective seals. The neural networks are able to learn the classification task automatically from examples. 6.5

GA-based Conceptual Design

TRADES is a system using GA techniques to produce conceptual designs of transmission units [Pham and Yang, 1993]. The system has a set of basic building blocks, such as gear pairs, belt drives and mechanical linkages, and generates conceptual designs to satisfy given specifications by assembling the building blocks into different configurations. The crossover, mutation and inversion operators of the GA are employed to create new configurations from an existing population of configurations. Configurations are evaluated for their compliance with the design specifications. Potential solutions should provide the required speed reduction ratio

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

31

and motion transformation while not containing incompatible building blocks or exceeding specified limits on the number of building blocks to be adopted. A fitness function codifies the degree of compliance of each configuration. The maximum fitness value is assigned to configurations that satisfy all functional requirements without violating any constraints. As in a standard GA, information concerning the fitness of solutions is employed to select solutions for reproduction thus guiding the process towards increasingly fitter designs as the population evolves. In addition to the usual GA operators, TRADES incorporates new operators to avert premature convergence to non-optimal solutions and facilitate the generation of a variety of design concepts. Essentially, these operators reduce the chances of any one configuration or family of configurations dominating the solution population by avoiding crowding around very fit configurations and preventing multiple copies of a configuration particularly after it has been identified as a potential solution. TRADES is able to produce design concepts from building blocks without requiring much additional a priori knowledge. The manipulation of the building blocks to generate new concepts is carried out by the GA in a stochastic but guided manner. This enables good conceptual designs to be found without the need to search the design space exhaustively. Due to the very large size of the design space and the quasi random operation of the GA, novel solutions not immediately evident to a human designer are sometimes generated by TRADES. On the other hand, impractical configurations could also arise. TRADES incorporates a number of heuristics to filter out such design proposals. 7.

CONCLUSION

Over the past fifty years, the field of soft computing has produced a number of powerful tools. This chapter has reviewed five of those tools, namely, knowledgebased systems, fuzzy logic, inductive learning, neural networks and genetic algorithms. Applications of the tools in engineering and manufacture have become more widespread due to the power and affordability of present-day computers. It is anticipated that many new applications will emerge and that, for demanding tasks, greater use will be made of hybrid tools combining the strengths of two or more of the tools reviewed here [Michalski and Tecuci, 1994; Medsker, 1995]. Other technological developments in soft computing that will have an impact in engineering include data mining, or the extraction of information and knowledge from large databases [Limb and Meggs, 1994; Witten and Frank, 2000, Braha, 2001; Han ˙ and Kamber, 2001; Pham and Afify, 2002; Klösgen and Zytkow, 2002; Giudici, 2003], and multi-agent systems, or distributed self-organising systems employing entities that function autonomously in an unpredictable environment concurrently with other entities and processes [Wooldridge and Jennings, 1994; Rzevski, 1995; Márkus et al., 1996; Tharumarajah et al., 1996; Bento and Feijó, 1997; Monostori, 2002]. The appropriate deployment of these new soft computing tools and of the tools presented in this chapter will contribute to the creation of more competitive engineering systems.

32 8.

CHAPTER 1

ACKNOWLEDGEMENTS

This work was carried out within the ALFA project “Novel Intelligent Automation and Control Systems II” (NIACS II), the ERDF (Objective One) projects “Innovation in Manufacturing Centre (IMC)”, “Innovative Technologies for Effective Enterprises” (ITEE) and “Supporting Innovative Product Engineering and Responsive Manufacturing” (SUPERMAN) and within the project “Innovative Production Machines and Systems” (I∗ PROMS). REFERENCES Akyol D E, (2004), “Application of neural networks to heuristic scheduling algorithms”, Computers Ind. Engng, 46, 679–696. Ashiru I, Czanecki C and Routen T, (1995), “Intelligent operators and optimal genetic-based path planning for mobile robots”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, 1018–1023. Badiru A B and Cheung J Y, (2002), Fuzzy Engineering Expert Systems with Neural Network Applications, John Wiley & Sons, New York. Baker J E, (1985), “Adaptive selection methods for genetic algorithms”, Proc. 1st Int. Conf. on Genetic Algorithms and Their Applications, Pittsburgh, PA, 101–111. Baldwin J F and Karale S B, (2003), “New concepts for fuzzy partitioning, defuzzification and derivation of probabilistic fuzzy decision trees”, Proc. 22nd Int. Conf. of the North American Fuzzy Information Processing Society (NAFIPS-03), Chicago, Illinois, USA, 484–487. Baldwin J F and Martin T P, (2001), “Towards inductive support logic programming”, Proc. Joint 9th IFSA World Congress and 20th NAFIPS Int. Conf., Vancouver, Canada, 4, 1875–1880. Bas K and Erkmen A M, (1995), “Fuzzy preshape and reshape control of Anthrobot-III 5-fingered robot hand”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, 673–677. Bento J and Feijó B, (1997), “An agent-based paradigm for building intelligent CAD systems”, Artificial Intelligence in Engineering, 11 (3), 231–244. Beskese A, Kahraman C and Irani Z, (2004), “Quantification of flexibility in advanced manufacturing systems using fuzzy concepts”, Int. J. Production Economics, 89 (1), 45–56. Bigand A, Goureau P and Kalemkarian J, (1994), “Fuzzy control of a welding process”, Proc. IMACS Int. Symp. on Signal Processing, Robotics and Neural Networks (SPRANN 94), Villeneuve d’Ascq, France, 379–342. Blickle T and Thiele L, (1995), “A comparison of selection schemes used in genetic algorithms”, Computer engineering and Communication Networks Lab (TIK)-Report, No. 11, Version 1.1b, Swiss Federation Institute of Technology (ETH), Zurich, Switzerland. Bose A, Gini M and Riley D, (1997), “A case-based approach to planar linkage design”, Artificial Intelligence in engineering, 11 (2), 107–119. Bozda˘g C E, Kahraman C and Ruan D, (2003), “Fuzzy group decision making for selection among computer integrated manufacturing systems”, Computers in Industry, 15 (1), 13–29. Braha D, (2001), Data Mining for Design and Manufacturing: Methods and Applications. Kluwer Academic Publishers, Boston. Breiman L, Friedman J H, Olshen R A and Stone C J, (1984), Classification and Regression Trees, Belmont, Wadsworth. Cervone G, Panait L A and Michalski R S, (2001), “The development of the AQ20 learning system and initial experiments”, Proc. 10th Inter. Symposium on Intelligent Information Systems, Poland. Chen J C and Black J T, (1997), “A fuzzy-nets in-process (FNIP) system for tool-breakage monitoring in end-milling operations”, Int. J Machine Tools Manufacturing, 37 (6), 783–800. Cho B J, Hong S C and Okoma S, (1996), “Job shop scheduling using genetic algorithm”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, 351–358.

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

33

Chryssolouris G and Subramaniam V, (2001), “Dynamic scheduling of manufacturing job shops using genetic algorithms”, J. Intelligent Manufacturing, 12, 281–293. Costa Branco P J and Dente J A, (1998), “An experiment in automatic modelling an electrical drive system using fuzzy logic”, IEEE Trans on Systems, Man, and Cybernetics, 28 (2), 254–262. Da Rocha Fernandes A M and Cid Bastos R, (1996), “Fuzzy expert systems for qualitative analysis of minerals”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 673–680. Darlington K W, (1999), The Essence of Expert Systems, Prentice Hall. Davis L, (1991), Handbook of Genetic Algorithms, Van Nostrand, New York, NY. De La Sen M, Miñambres J J, Garrido A J, Almansa A and Soto J C, (2004), “Basic theoretical results for expert systems: Application to the supervision of adaptation transients in planar robots”, Artificial Intelligence, 152 (2), 173–211. Disney S M, Naim M M and Towill D R, (2000), “Genetic algorithm optimisation of a class of inventory control systems”, Inter. J. Production Economics, 68, 259–278. Drake P R and Choudhry I A, (1997), “From apes to schedules”, Manufacturing Engineer, 76 (1), 43–45. Dubois D and Prade H, (1998), “An introduction to fuzzy systems”, Clinica Chimica Acta, 270 (1), 3–29. Duch W, Setiono R and Zurada J M, (2004), “Computational intelligence methods for rule-based data understanding”, Proc. IEEE, 92 (5), 771–805. Durkin J, (1994), Expert Systems Design and Development, Macmillan, New York. Evans B and Fisher D, (2002), “Using decision tree induction to minimize process delays in printing industry”, In: Handbook of Data Mining and Knowledge Discovery (W. Klösgen and J.M. Zytkow (Eds.)), Oxford University Press. Kong F, Yu J and Zhou X, (1999), “Analysis of fuzzy dynamic characteristics of machine cutting process: Fuzzy stability analysis in regenerative-type-chatter”, Int. J. Machine Tools and Manufacture, 39 (8), 1299–1309. Ferreiro Garcia R, (1994), “FAM rule as basis for poles shifting applied to the roll control of an aircraft”, SPRANN 94 (ibid), 375–378. Fogarty T C, (1989), “Varying the probability of mutation in the genetic algorithm”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 104–109. Freitas A A, (2002), Data mining and knowledge discovery with evolutionary algorithms, SpringerVerlag, Berlin, New York. Giarratano J C and Riley G D, (1998), Expert Systems: Principles and Programming, 3rd Edition, PWS Publishing Company, Boston, MA. Giudici P, (2003), Applied Data Mining: Statistical Methods for Business and Industry, John Wiley & Sons, England. Goldberg D E, (1989), Genetic Algorithms in Search, Optimisation and Machine Learning, Addison Wesley, Reading, MA. Grefenstette J J, (1986), “Optimization of control parameters for genetic algorithms”, IEEE Trans on Systems, Man and Cybernetics, 16 (1), 122–128. Han J and kamber M, (2001), Data Mining: Concepts and Techniques, Academic Press, USA. Hassoun M H, (1995), Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA. Holland J H, (1975), Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, MI. Hong T P and Chen J B, (2000), “Processing individual fuzzy attributes for fuzzy rule induction”, Fuzzy Sets and Systems, 112 (1), 127–140. Hui P C L, Chan K C K and Yeung K W, (1997), “Modelling job complexity in garment manufacturing by inductive learning”, Inter. J. Clothing Science and Technology, 9 (1), 34–44. Ip C Y, Regli W C, Sieger L and Shokoufandeh A, (2003), “Automated learning of model classification. Proc. 8th ACM Symposium on Solid Modeling and Applications, Seattle, Washington, USA, ACM Press, 322–327. ISL, (1998), Clementine Data Mining Package. SPSS UK Ltd., 1st Floor, St. Andrew’s House, West Street, Woking, Surrey GU21 1EB, United Kingdom. Jackson P, (1999), Introduction to Expert Systems, 3rd Edition, Addison-Wesley, Harlow, Essex.

34

CHAPTER 1

Jambunathan K, Fontama V N, Hartle S L and Ashforth-Frost S, (1997), “Using ART 2 networks to deduce flow velocities”, Artificial Intelligence in Engineering, 11 (2), 135–141. Janikow C Z, (1998), “Fuzzy decision trees: Issues and methods”, IEEE Trans on System, Man, and Cybernetic, 28 (1), 1–14. Jawahar N, Aravindan P, Ponnambalam S G and Karthikeyan A A, (1998), “A genetic algorithm-based scheduler for setup-constrained FMC”, Computers in Industry, 35, 291–310. Jiang Y, Zhou Z-H and Chen Z-Q, (2002), “Rule learning based on neural network ensemble”, Proc. Inter. Joint Conf. on Neural Networks, Honolulu, HI, 1416–1420. Kalogirou S A, (2003), “Artificial Intelligence for modelling and control of combustion processes: A review”, Progress in Energy and Combustion Science, 29 (6), 515–566. Kamrani A K, Shashikumar S and Patel S, (1995), “An intelligent knowledge-based system for robotic cell design”, Computers Ind. Engng, 29 (1–4), 141–145. Karsak E E, (2004), “Fuzzy multiple objective programming framework to prioritize design requirements in quality function deployment”, Computers Ind. Engng, (Submitted and accepted). Karsak E E and Kuzgunkaya O, (2002), “A fuzzy multiple objective programming approach for the selection of a flexible manufacturing system”, Int. J. Production Economics, 79 (2), 101–111. Kaufmann A, (1975), Introduction to the Theory of Fuzzy Subsets, Vol.1, Academic Press, New York. Kim C-O, Min Y-D and Yih Y, (1998), “Integration of inductive learning and neural networks for multi-objective FMS scheduling”, Inter. J. Production Research, 36 (9), 2497–2509. Klir G J and Yuan B, (1995), Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall, Upper Saddle River, NJ. Klir G J and Yuan B, (Eds.), (1996), Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems – selected papers by L A Zadeh, World Scientific, Singapore. ˙ Klösgen W and Zytkow J M, (2002), Handbook of Data Mining and Knowledge Discovery, Oxford University Press, New York. Koo D Y and Han S H, (1996), “Application of the configuration design methods to a design expert system for paper feeding mechanism”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 49–56. Kostov A, Andrews B, Stein R B, Popovic D and Armstrong W W, (1995), “Machine learning in control of functional electrical stimulation systems for locomotion”, IEEE Trans on Biomedical Engineering, 44 (6), 541–551. Kulak O and Kahraman C, (2004), “Multi-attribute comparison of advanced manufacturing systems using fuzzy vs. crisp axiomatic design approach”, Int. J. Production Economics, (Submitted and accepted). Lara Rosano F, Kemper Valverde N, De La Paz Alva C and Alcántara Zavala J, (1996), “Tutorial expert system for the design of energy cogeneration plants”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 300–305. Lavrac N and Dzeroski S, (1994), Inductive Logic Programming: Techniques and Applications, Ellis Horwood, New York. Lee C-Y, Piramuthu S and Tsai Y-K, (1997),“Job shop scheduling with a genetic algorithm and machine learning”, Inter. J. Production Research, 35 (4), 1171–1191. Li J R, Khoo L P and Tor S B, (2003), “A Tabu-enhanced genetic algorithm approach for assembly process planning”, J. Intelligent Manufacturing, 14, 197–208. Limb P R and Meggs G J, (1994), “Data mining tools and techniques”, British Telecom Technology Journal, 12 (4), 32–41. Lin Z-C and Chang D-Y, (1996), “Application of a neural network machine learning model in the selection system of sheet metal bending tooling”, Artificial Intelligence in Engineering, 10, 21–37. Lou H H and Huang Y L, (2003), “Hierarchical decision making for proactive quality control: System development for defect reduction in automotive coating operations”, Engineering Applications of Artificial Intelligence, 16, 237–250. Luzeaux D, (1994), “Process control and machine learning: rule-based incremental control”, IEEE Trans on Automatic Control, 39 (6), 1166–1171.

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

35

Mahfoud S W, (1995), Niching Methods for Genetic Algorithms, Ph.D. Thesis, Department of General Engineering, University of Illinois at Urbana-Champaign. Majors M D and Richards R J, (1995), “A topologically-evolving neural network for robotic flexible assembly control”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, August, 894–899. Markham I S, Mathieu RG and Wray B A, (2000), “Kanban setting through artificial intelligence: A comparative study of artificial neural networks and decision trees”, Integrated Manufacturing Systems, 11 (4), 239–246. Márkus A, Kis T, Váncza J and Monostori L, (1996), “A market approach to holonic manufacturing”, CIRP Annals, 45 (1), 433–436. Mathieu R G, Wray B A and Markham I S, (2002), “An approach to learning from both good and poor factory performance in a kanban-based just-in-time production system”, Production Planning & Control, 13 (8), 715–724. Medsker L R, (1995), Hybrid Intelligent Systems, Kluwer Academic Publishers, Boston, 298 pp. Michalewicz Z, (1996), Genetic Algorithms + Data Structures = Evolution Programs, 3rd Edition, Springer-Verlag, Berlin. Michalski R S, (1990), “A theory and methodology of inductive learning”, in Readings in Machine Learning, Eds. Shavlik J W and Dietterich T G, Kaufmann, San Mateo, CA, 70–95. Michalski R S and Kaufman KA, (2001), “The AQ19 system for machine learning and pattern discovery: A general description and user guide”, Reports of the Machine Learning and Inference Laboratory, MLI 01-2, George Mason University, Fairfax, VA, USA. Michalski R S and Larson J B, (1983), “Incremental generation of VL1 hypotheses: The underlying methodology and the descriptions of program AQ11”, ISG 83–5, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois. Michalski R S, Mozetic I, Hong J and Lavrac N, (1986), “The multi-purpose incremental learning system AQ15 and its testing application to three medical domains”, American Association of Artificial Intelligence, Los Altos, CA, Morgan Kaufmann, 1041–1045. Michalski R and Tecuci G, (1994), Machine Learning: A Multistrategy Approach, 4, Morgan Kaufmann Publishers, San Francisco, CA, USA. Michie D, Spiegelhalter D J and Taylor C C, (1994), Machine Learning, Neural and Statistical Classification, Ellis Horwood, New York. Mitchell M, (1996), An Introduction to Genetic Algorithms, MIT Press. Monostori L, (2002), “AI and machine learning techniques for managing complexity, changes and uncertainties in manufacturing”, Proc. 15th Triennial World Congress, Barcelona, Spain, 119–130. Muggleton S (ed), (1992), Inductive Logic Programming, Academic Press, London, 565 pp. Muggleton S and Feng C, (1990), “Efficient induction of logic programs”, Proc. 1st Conf. on Algorithmic Learning Theory, Tokyo, Japan, 368–381. Nearchou A C and Aspragathos N A, (1997), “A genetic path planning algorithm for redundant articulated robots”, Robotica, 15 (2), 213–224. Nurminen J K, Karonen O and Hatonen K, (2003), “What makes expert systems survive over 10-years – empirical evaluation of several engineering applications”, Expert Systems with Applications, 24 (2), 199–211. Ong S K, De Vin L J, Nee A Y C and Kals H J J, (1997), “Fuzzy set theory applied to bend sequencing for sheet metal bending”, Int. J. Materials Processing Technology, 69, 29–36. Öztürk N and Öztürk F, (2004), “Hybrid neural network and genetic algorithm based machining feature recognition”, J. Intelligent Manufacturing, 15, 278–298. Park M-W, Rho H-M and Park B-T, (1996), “Generation of modified cutting conditions using neural networks for an operation planning system”, Annals of the CIRP, 45 (1), 475–478. Peers S M C, Tang M X and Dharmavasan S, (1994), “A knowledge-based scheduling system for offshore structure inspection”, Artificial Intelligence in Engineering IX (AIEng 9), Eds. Rzevski G, Adey R A and Russell D W, Computational Mechanics, Southampton, 181–188. Peng Y, (2004), “Intelligent condition monitoring using fuzzy inductive learning”, J. Intelligent Manufacturing, 15 (3), 373–380.

36

CHAPTER 1

Pérez E, Herrera F and Hernández C, (2003), “Finding multiple solutions in job shop scheduling by niching genetic algorithms”, J. Intelligent Manufacturing, 14, 323–339. Pham D T and Afify A A, (2002), “Machine learning: Techniques and trends”, Proc. 9th Inter. Workshop on Systems, Signals and Image Processing (IWSSIP – 02), Manchester Town Hall, UK, World Scientific, 12–36. Pham D T and Afify A A, (2005a), “RULES-6: A simple rule induction algorithm for handling large data sets”, Proc. of the Institution of Mechanical Engineers, Part (C), 219 (10), 1119–1137 . Pham D T and Afify A A, (2005b), “Machine learning techniques and their applications in manufacturing”, Proc. of the Institution of Mechanical Engineers, Part B, 219 (5), 395–412. Pham D T, Afify A A and Dimov S S, (2002), “Machine learning in manufacturing”, Proc. 3rd CIRP Inter. Seminar on Intelligent Computation in Manufacturing Engineering – (ICME 2002), Ischia, Italy, III–XII. Pham D T and Aksoy M S, (1994), “An algorithm for automatic rule induction”, Artificial Intelligence in Engineering, 8, 277–282. Pham D T and Aksoy M S, (1995a), “RULES : A rule extraction system”, Expert Systems with Applications, 8, 59–65. Pham D T and Aksoy M S, (1995b), “A new algorithm for inductive learning”, Journal of Systems Engineering, 5, 115–122. Pham D T, Bigot S and Dimov S S, (2003), “RULES-5: A rule induction algorithm for problems involving continuous attributes”, Proc. of the Institution of Mechanical Engineers, 217 (Part C), 1273–1286. Pham D T and Dimov S S (1997), “An efficient algorithm for automatic knowledge acquisition”, Pattern Recognition, 30(7), 1137–1143. Pham D T, Dimov S S and Salem Z, (2000), “Technique for selecting examples in inductive learning”, ESIT 2000 European Symposium on Intelligent Techniques, Erudit Aachen Germany, 119–127. Pham D T, Dimov S S and Setchi RM (1999), “Concurrent engineering: a tool for collaborative working”, Human Systems Management, 18, 213–224. Pham D T and Hafeez K, (1992), “Fuzzy qualitative model of a robot sensor for locating threedimensional objects”, Robotica, 10, 555–562. Pham D T and Karaboga D, (1994), “Some variable mutation rate strategies for genetic algorithms”, SPRANN 94 (ibid), 73–96. Pham D T and Karaboga D, (2000), Intelligent Optimisation Techniques: Genetic Algorithms, Tabu Search, Simulated Annealing and Neural Networks, Springer-Verlag, London, Berlin and Heidelberg, 2nd printing, 302 pp. Pham D T and Liu X, (1999), Neural Networks for Identification, Prediction and Control, Springer Verlag, London, Berlin and Heidelberg, 4th printing, 238 pp. Pham D T and Oztemel E, (1996), Intelligent Quality Systems, Springer Verlag, London, Berlin and Heidelberg, 201 pp. Pham D T, Packianather M S, Dimov S, Soroka A J, Girard T, Bigot S. and Salem Z, (2004), “An application of data mining and machine learning techniques in the metal industry”, Proc. 4th CIRP Inter. Seminar on Intelligent Computation in Manufacturing Engineering (ICME-04), Sorrento (Naples), Italy. Pham D T and Pham P T N, (1988), “Expert systems in mechanical and manufacturing engineering”, Int. J. Adv. Manufacturing Technology, Special Issue on Knowledge Based Systems, 3(3), 3–21. Pham D T and Yang Y, (1993), “A genetic algorithm based preliminary design system”, Proc. IMechE, Part D: J. Automobile Engineering, 207, 127–133. Price C J, (1990), Knowledge Engineering Toolkits, Ellis Horwood, Chichester. Priore P, Fuente D, Pino R and Puente J, (2003), “Dynamic scheduling of manufacturing systems using neural networks and inductive learning”, Integrated Manufacturing Systems, 14 (2), 160–168. Quinlan J R, (1983), “Learning efficient classification procedures and their application to chess endgames”, In: Machine Learning: An Artificial Intelligence Approach (Michalski R S, Carbonell J G and Mitchell T M (Eds.)), I, Tiogo Publishing Co., 463–482. Quinlan J R, (1986), “Induction of decision trees”, Machine Learning, 1, 81–106.

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

37

Quinlan J R, (1990), “Learning logical definitions from relations”, Machine Learning, 5, 239–266. Quinlan J R, (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA. Quinlan J R and Cameron-Jones R M, (1995), “Induction of logic programs: FOIL and related systems”, New Generation Computing, 13, 287–312. Ross T J, (1995), Fuzzy Logic with Engineering Applications, McGraw-Hill, New York. RuleQuest, (2001), Data Mining Tools C5.0, Pty Ltd, 30 Athena Avenue, St Ives NSW 2075, Australia. Available from: http://www.rulequest.com/see5-info.html. Rzevski G, (1995), “Artificial intelligence in engineering : past, present and future”, Artificial Intelligence in Engineering X, Eds Rzevski G, Adey R A and Tasso C, Computational Mechatronics, Southampton, 3–16. Schaffer J D, Caruana R A, Eshelman L J and Das R, (1989), “A study of control parameters affecting on-line performance of genetic algorithms for function optimisation”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 51–61. Schultz G, Fichtner D, Nestler A and Hoffmann J, (1997), “An intelligent tool for determination of cutting values based on neural networks”, Proc. 2nd World Congress on Intelligent Manufacturing Processes and Systems, Budapest, Hungary, 66–71. Seals R C and Whapshott G F, (1994), “Design of HDL programmes for digital systems using genetic algorithms”, AI Eng 9 (ibid), 331–338. Shi Z Z, Zhou H and Wang J, (1997), “Applying case-based reasoning to engine oil design”, Artificial Intelligence in Engineering, 11 (2), 167–172. Shigaki I and Narazaki H, (1999), “A machine-learning approach for a sintering process using a neural network”, Production Planning & Control, 10 (8), 727–734. Shin C K and Park S C, (2000), “A machine learning approach to yield management in semiconductor manufacturing”, Inter. J. Production Research, 38 (17), 4261–4271. Skibniewski M, Arciszewski T and Lueprasert K, (1997), “Constructability analysis : machine learning approach”, ASCE J of Computing in Civil Engineering, 12 (1), 8–16. Smith J E and Fogarty T C, (1997), “Operator and parameter adaptation in genetic algorithms”, Soft Computing, 1 (2), 81–87. Smith P, MacIntyre J and Husein S, (1996), “The application of neural networks in the power industry”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 321–326. Sohen S Y and Choi I S, (2001), “Fuzzy QFD for supply chain management with reliability consideration”, Reliability Eng. and Systems Safety, 72, 327–334. Streichfuss M and Burgwinkel P, (1995), “An expert-system-based machine monitoring and maintenance management system”, Control Eng. Practice, 3 (7), 1023–1027. Szwarc D, Rajamani D and Bector C R, (1997), “Cell formation considering fuzzy demand and machine capacity”, Int. J. Advanced Manufacturing Technology, 13 (2), 134–147. Tarng Y S, Tseng C M and Chung L K, (1997), “A fuzzy pulse discriminating systems for electrical discharge machining”, Int. J. Machine Tools and Manufacture, 37 (4), 511–522. Teti R and Caprino G, (1994), “Prediction of composite laminate residual strength based on a neural network approach”, AI Eng 9 (ibid), 81–88. Tharumarajah A, Wells A J and Nemes L, (1996), “Comparison of the bionic, fractal and holonic manufacturing system concepts”, Int. J. Computer Integrated Manfacturing, 9 (3), 217–226. Vanegas L V and Labib A W, (2001), “A fuzzy quality function deployment (FQFD) model for deriving optimum targets”, Int. J. Production Research, 39 (1), 99–120. Venkatachalam A R, (1994), “Automating manufacturability evaluation in CAD systems through expert systems approaches”, Expert Systems with Applications, 7 (4), 495–506. Wang L X and Mendel M, (1992), “Generating fuzzy rules by learning from examples”, IEEE Trans on Systems, Man and Cybernetics, 22 (6), 1414–1427. Wang W P, Peng Y H and Li X Y, (2002), “Fuzzy-grey prediction of cutting force uncertainty in turning”, J Materials Processing Technology, 129, 663–666. Wang C-H, Tsai C-J, Hong T-P and Tseng S-S, (2003), “Fuzzy Inductive Learning Strategies”, Applied Intelligence, 18 (2), 179–193.

38

CHAPTER 1

Wang X Z, Wang Y D, Xu X F, Ling W D and Yeung D S, (2001), “A new approach to fuzzy rule generation: Fuzzy extension matrix”, Fuzzy Sets and Systems, 123 (3), 291–306. Whitely D, (1989), “The GENITOR algorithm and selection pressure: why rank-based allocation of reproductive trials is best”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 116–123. Wilde P and Shellwat H, (1997), “Implementation of a genetic algorithm for routing an autonomous robot”, Robotica, 15 (2), 207–211. Witten I H and Frank E, (2000), Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, USA. Wooldridge M J and Jennings N R, (1994), “Agent theories, architectures and languages : a survey”, Proc. ECAI 94 Workshop on Agent Theories, Architectures and Languages, Amsterdam, 1–32. Wray B A, Rakes T R and Rees L, (1997), “Neural network identification of critical factors in dynamic just-in-time kanban environment”, J. Intelligent Manufacturing, 8, 83–96. Wu X, Chu C-H, Wang Y and Yan W, (2002), “A genetic algorithm for integrated cell formation and layout decisions”, Proc. of the 2002 Congress on Evolutionary Computation (CEC-02), 2, 1866–1871. Yano H, Akashi T, Matsuoka N, Nakanishi K, Takata O and Horinouchi N, (1997), “An expert system to assist automatic remeshing in rigid plastic analysis”, Toyota Technical Review, 46 (2), 87–92. Yao X, (1999), “Evolving artificial neural networks”, Proceedings of the IEEE, 87 (9), 1423–1447. Zadeh L A, (1965), “Fuzzy Sets”, Information Control, 8, 338–353. Zha X F, Lim S Y E and Fok S C, (1998), “Integrated knowledge-based assembly sequence planning”, Int. J. Adv. Manufacturing Technology, 12 (3), 211–237. Zha X F, Lim S Y E and Fok S C, (1999), “Integrated knowledge-based approach and system for product design and assembly”, Int. J. Computer Integrated Manufacturing, 14, 50–64. Zhao Z Y and De Souza R, (1998), “On improving the performance of hard disk drive final assembly via knowledge intensive simulation”, J. Electronics Manufacturing, 1, 23–25. Zhao Z Y and De Souza R, (2001), “Fuzzy rule learning during simulation of manufacturing resources”, Fuzzy Sets and Systems, 122, 469–485. Zhou C, Nelson P C, Xiao W, Tirpak T M and Lane S A, (2001), “An intelligent data mining system for drop test analysis of electronic products”, IEEE Trans on Electronics Packaging Manufacturing, 24 (3), 222–231. Zimmermann H-J, (1996), Fuzzy Set Theory and its Applications, 3nd Edition, Kluwer Academic Publishers, Boston. Zülal G and Arikan F, (2000), “Application of fuzzy decision making in part-machine grouping”, Int. J. Production Economics, 63, 181–193.

CHAPTER 2 NEURAL NETWORKS HISTORICAL REVIEW

D. ANDINA1 , A. VEGA-CORONA2 , J. I. SEIJAS3 , J. TORRES-GARCÍA 1

Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. [email protected] 2 Facultad de Ingeniería, Mecánica, Eléctrica y Electrónica (FIMEE), Universidad de Guanajuato (UG), Salamanca, Gto., México. [email protected] 3 Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. [email protected] Abstract:

This chapter starts with a historical summary of the evolution of Neural Networks from the first models which are very limited in application capabilities to the present ones that make possible to think in applying automatic process to tasks that formerly had been reserved to the human intelligence. After the historical review, Neural Networks are dealt from a computational point of view. This perspective helps to compare Neural Systems with classical Computing Systems and leads to a formal and common presentation that will be used throughout the book

INTRODUCTION Computers used nowadays can make a great variety of tasks (whenever they are well defined) at a higher speed and with more reliability than those reached by the human beings. None of us will be, for example, able to solve complex mathematical equations at the speed that a personal computer will. Nevertheless, mental capacity of the human beings is still higher than the one of machines in a wide variety of tasks. No artificial system of image recognition is able to compete with the capacity of a human being to discern between objects of diverse forms and directions; in fact it would not even be able to compete with the capacity of an insect. In the same way, whereas a computer performs an enormous amount of computation and restrictive conditions to recognize, for example, phonemes, an adult human recognizes without no effort words pronounced by different people, at different speeds, accents and intonations, even in the presence of environmental noise. It is observed that, by means of rules learned from the experience, the human being is much more effective than the computers in the resolution of imprecise 39 D. Andina and D.T. Pham (eds.), Computational Intelligence, 39–65. © 2007 Springer.

40

CHAPTER 2

problems (ambiguous problems), or problems that require great amount of information. Our brain reaches these objectives by means of thousands of millions of simple cells, called neurons, which are interconnected to each other. However, it is estimated that the operational amplifiers and logical gates can make operations several orders of magnitude faster than the neurons. If the same processing technique of biological elements were implemented with operational amplifiers and logical gates, one could construct machines relatively cheap and able to process as much information, at least, as the one that processes a biological brain. Of course, we are too far from knowing if these machines will be constructed one day. Therefore, there are strong reasons to think about the viability to tackle certain problems by means of parallel systems that process information and learn by means of principles taken from the brain systems of living beings. Such systems are called Artificial Neural Networks, connexionist models or distributed parallel process models. Artificial Neural Networks (ANNs or, simply, NNs) come then from the man’s intention of simulating the biological brain system in an artificial way. 1.

HISTORICAL PERSPECTIVE

The science of Artificial Neural Networks did his first significant appearance during the 1940’s. Researchers who tried to emulate the functions of the human brain developed physical models (later, simulations by means of programs) of the biological neurons and their interconnections. As the neurobiologists were deepening in the knowledge of the human neural system, these first models were being considered more and more rudimentary approaches. Nevertheless, some of the results obtained in these first times were impressive, which encouraged future research and developments of sophisticated and powerful Artificial Neural Networks. 1.1

First Computational Model of Nervous Activity: The Model of McCulloch and Pitts

McCulloch and Pitts published the first systematic studies of the artificial neural networks [McCulloch and Pitts, 1943] [Pitts and McCulloch, 1947]. This study appeared in terms of a computational model of the nervous activity of the human nervous system cells. Most of their work is focused on the behavior of a simple neuron, whose mathematical or computational model is shown in Figure 1. Inside the artificial neuron, the sum of each input xi multiplied by a scale factor (or weight wi ) is made. The inputs emulate the excitations received by the biological neurons. The weights represent the force of the synaptic union: a positive weight represents an excitatory effect, and a negative weight an inhibitory effect. If the result of the sum is higher than a certain threshold value or bias (represented by the weight w0 ), the cell activates providing a positive value (normally +1); in the opposite case, the output presents a negative value (usually −1) or zero. Therefore, it is a binary output. In general,

NEURAL NETWORKS HISTORICAL REVIEW

⎡ x1 ⎤ ⎢ ⎥ x2 X =⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣ xm⎥⎦

41

w0 w1 ∑wi xi

f (Z ) Z

wM

1

O −1

O Activation function O = f (Z )

Figure 1. Artificial model [McCulloch and Pitts, 1943] of a biological neuron. As it can be observed, the relation between the input and output follows a nonlinear function called activation function. In the first model shown in this figure, the activation is a hard threshold function

the model follows the neurobiologic behavior: the nervous cells produce nonlinear answers when provided of an excitation by a certain input. In particular, McCulloch and Pitts proposed an activation function, that represents the nonlinearity of the model, called hard threshold function (see Figure 1). Although this first model can only perform very simple tasks, as it will be described below, the potentiality of the neural systems is essentially in the interconnection between neurons to form networks. This interconnection is normally arranged forming layers of nodes (artificial neurons). This kind of neural networks are called Multi-Layer Perceptron (MLP). In general, it is possible to speak about Feed Forward Neural Networks like those in which the information always is transmitted in the direction of input layer to output layer. Or Feedback Neural Networks, where the information can be transmitted in both directions; that is, connections between nodes of higher layers with nodes of lower layers are allowed. Figure 2 shows a Feed Forward Neural Network of two layers: a hidden layer (located right after the input layer) and the output layer. The input layer is usually not considered as being properly a layer of the network. Each component of the input vector x = 1x1 xM T is connected to all the nodes of the first hidden layer. The forces of these connections are determined by the weight associated with each one of them. When the same philosophy is applied to the rest of the network’s layers, it is said that full connectivity exists. Trying to proceed chronologically, we will leave the Multilayer Neural Networks (MLP) by the moment. The first artificial neural networks proposals were networks of a single layer as the one shown in Figure 2 but eliminating the output layer.

INPUT w01 1 ⎡1 ⎤ ⎢x ⎥ 2 1 X= ⎢ ⎥ wM2 ⎢ ⎥ o1m w ⎢ ⎥ m ⎣⎢xM ⎦⎥ w1m m

o2 OUTPUT

Figure 2. Two Layer Feed Forward Neural Networks

42

CHAPTER 2

This joint disposition of the first model of neuron (see Figure 1) in parallel was suggested in order to solve the limitations of a neuron acting alone. It is easy to verify that the model of McCulloch and Pitts divides the input space into two parts by means of the hyperplane described by the equation (1)

hx =

M

wj xj + wo = 0

j=1

This effect can be observed in Figure 3 that shows this hyperplane for the particular case of M = 2. A simple neuron can solve two-class classification problems of M-dimensional data, assuming that they are linearly separable. That is, it can assign an output equal to 1 to all the data of class “A” (that fall in the same side of the hyperplane), whereas it assigns a value equal to −1 to the rest of the data that fall in the opposite side. Mathematically, we can express this classification as (2)

M

CB

wj xj > − wo

j=1

< CA

where CA and CB denotes class A and class B, respectively. We have now a very simple neuron behavior model, that does not consider many of the biological characteristics that tries to emulate. For example, it does not consider the real delays existing in all inter-neural transmission –that have an important effect on the dynamic system–, or, more importantly, it does not include effects of synchronism or frequency modulation features, which is considered crucial by many researchers. x2 h(x) = 0

Class1

h(x) > 0 w0 w2

h(x) < 0

x1

Class 2 x2

w1 w2

x1

w0 w2

Figure 3. The hyperplane determined by the McCulloch and Pitts neuron model for the case of two dimensional inputs. This hyperplane depends on the neuron’s parameters (weights wj , and threshold value w0 ) according with the mathematical expression M j=1 wj xj + w0 = 0

NEURAL NETWORKS HISTORICAL REVIEW

43

In spite of their limitations, the networks designed in this way have characteristics classically restricted to the biologic systems. Perhaps researchers have been able to shape the main biological neuron operations in this model, or perhaps the similarities in some applications are mere coincidence. Only the necessary time to continue this research will solve this question.

1.2

Training of Neural Networks: Hebbian Learning

The equation of the hyperplane border that characterizes the operation of the artificial neuron depends on the synaptic weights w1 , wM and on the threshold value wo , which is normally considered as another weight of the network. The remaining problem consists in the way of choosing, determining or looking for the appropriate value of these weights that solve the problem in hand. This task is called learning or training of the network. From a historical point of view, the Hebbian Learning is the oldest and one of the most studied learning procedures. In 1961, Hebb proposed a learning model that has given rise to many of the learning systems which nowadays exist for training neural networks. Hebb proposed that the value of the synaptic union would be increased whenever the input and the output of a neuron were simultaneously activated. In this way, the network’s connections used more frequently are reinforced, emulating the biological phenomenon of the habit and the learning by means of repetition. It is said that a neural network uses Hebbian learning when it increases the value of its weights accordingly with the product of the levels of excitation of the source and destiny neurons. The Hebbian learning of the network is performed by means of successive iterations using only the information of the input and output network, it never used never the desired output or target. For this reason, this type of learning is called unsupervised learning. It distinguishes it from other models of learning that use the additional information of the desired values of the output, as a teacher, and that we will expose next.

1.3

Supervised Learning: Rosenblatt and Widrow

Although many learning methods following the Hebbian model have been developed, it seems logical to expect that the most efficient results can be achieved by those methods that use information of the network output (supervised learning. Learning is so guided in order to perform a given function. About 1960, Rosenblatt [Rosenblatt, 1962] dedicated his efforts in developing supervised learning algorithms to train a neural network that called perceptron. A perceptron is a Feed Forward neural network as that shown in Figure 2, where the nonlinearities of the neurons are of the hard type. Some of the common functions used as alternatives to the hard threshold functions will be shown later on. In this way, the Mcculloch and Pitts model can be considered as the simplest kind of hard threshold perceptron.

44

CHAPTER 2

Concretely, Rosenblatt showed that a one layer perceptron is able to learn many practical functions. He proposed a learning rule for the perceptron called the perceptron rule. Let us consider the simplest case of a one layer perceptron composed by one single neuron, that is, the model proposed by McCulloch and Pitts. If certain pairs of input and corresponding output is known, DN = x1 d1 x2 d2 xN dN , then, at a given input pattern xk of the input data set, the perceptron rule updates the network weights w = wo w1 wM T in the following way (3)

wk + 1 = wk + dk − ok xk

The parameter controls the updating magnitude values, and so the speed of the algorithm convergence. It is called the learning rate and it usually takes values in the range between 0 and 1. The DN set is called learning set and, as it includes values of the desired outputs, it is of the supervised type. If the linear separability is accomplished by the training data set, Rosenblatt showed that the algorithm always converge in a finite number of steps, independently of the value. On the contrary, if the problem is not linearly separable, it will have to be forced to stop, as always there will be at least one pattern erroneously classified. Usually, the training starts giving small random values to the perceptron weights. In each step of the algorithm, a new input xk is applied to the network, then the corresponding output is calculated, ok , and the weights are updated only if error dk − ok is not equal to zero. It is interesting to note that if the learning rate has a value close to 0, the weights will have a little variation with each new input, and the learning is slow; if the value is next to 1 there can be large differences between weight values for one iteration and the following one, reducing the influence of past iterations and the algorithm could not converge. This problem is called instability. Therefore, the gain rate should be adapted to the distribution changes on the input patterns, satisfying the conflict between training time and stable updating of weights. Also at early 1960’s, Widrow and Hoff [Widrow and Hoff, 1960] performed several demonstrations on perceptron-like systems, that called ADALINE (“ADAptive LINear Elements”), proposing a learning rule called LMS algorithm (“Least Mean Square” algorithm) or Widrow-Hoff algorithm. This rule minimizes the Sum of Square Errors (SSE, “sum-of-square errors”) between the desired output and the output given by the perceptron before the hard threshold activation function. That is, it minimizes the error function (4)

Ew =

N 1 d − zj 2 2 j=1 j

through a gradient algorithm. The linear output z can be observed in Figure 1. When the gradient to w is applied in Equation (4) and actualized in the opposite

45

NEURAL NETWORKS HISTORICAL REVIEW

direction to the gradient one, the LMS rule is obtained. (5)

wk + 1 = wk +

N

dj − zj kxj

j=1

where zj k = wT kxj . This “block” (in the sense that it uses all training patterns in each iteration) version of the LMS is usually substituted for an “estocastic approximation” (pattern by pattern) as shown in equation (6)

wk + 1 = wk + dk − zk xk

Unlike the perceptron rule, the application of LMS delivers reasonable results (the best that can be achieved through a linear discriminator in the SSE sense) when the training set is not linearly separable. During these years, researchers all around the world become enthusiastic with the application possibilities that these systems promised. 1.4

Partial eclipse of Neural Networks: Minsky and Papert

The initial euphoria aroused in the early sixties was substituted by disappointment when Minsky and Papert [Minsky and Papert, 1969] rigorously analyzed the problem and showed that there exists severe restrictions in the class of functions that a perceptron can perform. One of their results shows how a one layer perceptron with two inputs and one output is unable of performing a simple function as the or-exclusive (Xor). The inputs of this function are of the type 1 or −1 being the output −1 when the two inputs are different and 1 if they are equal. In the Figure 4 this problem is illustrated. It can be observed how a linear discriminator is unable of separating the patterns of the two classes. This limitation was well known by the end of the sixties and it was also known that the problem could be solved adding more layers to the system. As an example, let us analyze a two layer perceptron. The first layer can classify input vectors separated

x2 Class A

1

−1

X

d

⎡(1,1) ⎤ ⎢(1,–1)⎥ ⎢ ⎥ ⎢(–1,1)⎥ ⎢ ⎥ ⎣(–1,–1)⎦

⎡1 ⎤ ⎢–1⎥ ⎢ ⎥ ⎢–1⎥ ⎢ ⎥ ⎣1 ⎦

Class B

1 x1 −1

Figure 4. The or-exclusive (Xor) problem

46

CHAPTER 2

by hyperplanes. The second layer can implement the logical functions AND and OR, because both problems are linearly separable. In this way, a perceptron as the one shown in Figure 5 (a) can implement boundaries as the one shown in Figure 5 (b) and, so, solve the Xor problem. In the general case, it can be shown that a two layer perceptron can implement simply convex and connex regions –a region is said to be convex if any straight line that joins two points of its boundary goes only through points included in the region limited by the boundary. Convex regions are limited by the (hyper)planes performed by each node in the first layer, and can be open or closed. It has to be noted that the possibilities of Multi Layer Perceptrons rely on the nonlinearities of their neurons. If the activation function performed by these neurons was linear, then the MLP capabilities would be the same as those of the single layer perceptron. For example, let us think of a two layer perceptron with a threshold value, wo = zero and with a linear activation function, fz = z (see Figure 1). In this case, the outputs of the first layer can be easily expressed through a matrix O1 = W1T X, and those of the second layer as O2 = W2T O1 . Then, the output as a function of the input is obtained as (7)

T O2 = W2T O1 = W2T W1T X = Wtotal

1 O

x1

X= x2

2

x2 Class A

1

Decision boundary (node 1) Class B

–1 1 Class B

x1 Class A

–1 Decision boundary (node 2)

Figure 5. (a) Two layers perceptron, able to solve the Xor problem, implementing a boundary as shown in (b)

NEURAL NETWORKS HISTORICAL REVIEW

47

This function could be performed by a single layer perceptron whose layer weights were Wtotal . Therefore, if the nodes are linear elements, the performance of the structure is not improved by adding new layers, as an equivalent one layer perceptron can be found. In spite of the possibilities opened by the MLP, Minsky and Papert, prestigious scientists of their time, emphasized that algorithms to train such structures were not known, and showed their scepticism on the possibilities of they would ever be developed. The book of Minsky and Papert [Minsky and Papert, 1969], showed some critical examples of the disadvantages of NNs vs classical computers in terms of their capabilities for storing information, was a strong punch on the NNs research enthusiasm, eclipsing their developing for the next twenty years.

1.5

Backpropagation algorithm: Werbos, Rumelhart et al. and Parker

It is true that the single layer perceptron has the limitation of being a simple discriminator. There are reasons to affirm that it is only able of solving “toy” problems. Although their limitations reduce when the number of layers raises, it is difficult to find the adequate weights to solve a given problem. This problem was solved with the incorporation of “soft”, derivable, nonlinearities in the neurons in the place of the classical hard threshold. Concretely, the sigmoidal function is very appropriate Figure 6. Among others, there exists an specially relevant theorem on the capabilities of the MLP with soft activation functions, Cybenko’s Theorem [Cybenko, 1989]: it is sufficient with a two layers perceptron with the nodes (in indefinite number) in the first layers performing sigmoidal activation functions to establish any correspondence between No and −1 1NL (therefore, it will also be possible to establish any classification). For a first revision on the perceptron capabilities as “approximators” in the case of soft nonlinearities, it is worth to mention the work of Hornik et al., [Hornik et al., 1989, Hornik et al., 1990]. But, again, we must come back on the question of how to train the network weights. In a completely analogous form to the LMS algorithm previously described, the retropropagation algorithm updates the network weights (in this case of a MLP) in the opposite direction of the error function gradient that we aim to minimize (i.e. SSE). For that purpose, the chain rule is applied as many times as required

⎡ x1 ⎤ ⎢x ⎥ 2 X= ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣⎢ xm ⎦⎥

1 O2

2

O1m 1 O1m

m

MPL-NODE

wTmx

–1

Figure 6. Multilayer Perceptron with sigmoidal nonlinearities

48

CHAPTER 2

to calculate that gradient for all the weights in the network. As the output is a derivable function, this calculation is relatively easy [Haykin, 1994]. The backpropagation algorithm was proposed independently and consecutively by Werbos [Werbos, 1974], Rumelhart et al., [Rumerlhart et al., 1986] and Parker [Parker, 1985]. It can be said that the pessimism aroused by the book by Minsky and Papert had its counterpart twenty years later with the developing of the backpropagation algorithm. 2.

NEURAL NETWORKS VS CLASSICAL COMPUTERS

Classical digital computers process the information at two basic levels: hardware and software. The computations performed are algorithmic and sequential. Each problem is solved through an algorithm coded in a program, physically located in the computer memory. Problems are solved one after the other. Algorithms are performed as many times as needed, with the same reliability and at electronic speed. Nevertheless, there are many real problems where computers cannot be successfully applied yet. For example, let us think of a little mosquito finding its way to survive in the world. Such a problem is a not-solved challenge to any automatic device. But the difference probably relies on the fact that living beings do not follow the computer processing scheme. Biological brains process information in a massive, parallel, not sequential way. Problems are solved by the cooperative participation of millions of highly interconnected elemental processors, called neurons. The neurons do not need to be programmed. From the stimulus they receive from other neurons, they are able to modify, adapt or learn its functioning. The system does not need a central processing unit to control the activities of the system. It is interesting to note that biological neural systems work at a speed several orders of magnitude lower than electronic systems. Therefore, brain is an adaptive, non-linear, sophisticated processing system. Knowledge is distributed in the neurons activation state and memory is not addressed through fixed labels. Their architecture tries to emulate the basic neural features of brains and are designed by learning from examples. They could be defined as networks that massively connect simple units (usually adaptive units), hierarchically organized, that try to interact with the real world objects in the biological systems fashion. Advantages of NNs over classical computers are: 1 Adaptive Learning: they are able to learn and to perform tasks by an adaptive procedure. 2 Self-organized: they are able to build their own internal organization or representation of the information provided in a learning phase. 3 Fault Tolerance: ability of performing the same function despite of the partial destruction of the Network. 4 Real time operation: its hardware architecture is oriented to massive parallel processing of the information.

NEURAL NETWORKS HISTORICAL REVIEW

49

5 Simplicity of integration with present technology: these systems can be easily simulated using the present computers and are also implemented in specific neural hardware, that allows their modular integration in present systems. 3. 3.1

BIOLOGICAL AND ARTIFICIAL NEURONS The Biological Neuron

The biological neuron, whose basic operation is not yet completely known nor understood, is composed of a cellular body and series of ramifications that are in branches, called dendrites. Among all these branches, one of them is particularly long and receives the name of axon. It starts from the cellular body and ends in another series of dendrites. These last nervous terminals are used by the neurons to be in contact with each other by means of the synaptic connections. When a cell receives signals of other cells, (these can be excitatory or inhibitory signals) the global effect is an excitation that exceeds a certain threshold value. Then it responds transmitting a certain nervous signal through the axon to the adjacent cells by means of the synapse of the nervous terminations. Human nervous system is made up of these cells and is of a fascinating complexity. It is estimated that 1011 neurons participate in more than 1015 interconnections on channels that can measure more of a meter. Studies on the human brain anatomy conclude that there are more than 1000 synapses in the input and output of each neuron. It is important to note that, although the commutation time of a neuron (few milliseconds) is almost a million times lower than the one of the actual computer elements, the biological neurons have a very higher connectivity (thousands of times) than the actual supercomputers. Neurons are composed of the cell core, soma, and several branches called the axon and the dendrites. The dendrites of different neurons are connected in what are called sinapses and play the role of establishing the connection with the neighbor neurons in order to make possible the communication among them. Each neuron has two basic states: activation and rest. When a neuron is activated it emits through the axon a chain of electrical excitements, of different frequencies depending of its level of activation. Information is coded in the frequency of generation and not in its amplitude. The signal produced in the neuron body propagates to other neurons from the axon to other neurons through chemical interchanges that take place in the synapses of the dendrites. The chemical components liberated by the dendrites are called neurotransmitters and contribute to increase or inhibit the activation level of the neuron that receives the neurotransmitters. Due to the action of the neurotransmitters – that are basically chemical signals – ionic channels are opened in the receiver neuron and electrical ions are received, contributing to the overall electrical charge of the neuron or excitation level. When the excitation level surpasses a certain activation level, the neuron is activated. The efficiency of the synapse depends on several factors: the number of

50

CHAPTER 2

the neurotransmitter glands, concentration in the membrane of the neighbor neuron, efficiency of the ionic channels and other physical and chemical variables. As to the learning procedure, last discoveries make believe that it is also of electrochemical nature, taking place among neighbor neurons, hierarchically close in a layered structure. The chemical liberated in the learning process seems to be nitric oxide (NO). Its molecules are able to go through the membrane and route to the neighbor neurons controlling the efficiency of the connection by reactions with other chemicals in this last neuron. This efficiency regulation of the electrochemical connection among neighbor neurons is the responsible of the learning procedure. 3.2

The Artificial Neuron

The simplest model of artificial neuron, as presented in Figure 1, is obtained through approximating the action of all neuron inputs by a linear function. Let us call this function Base Function, u·. In this case, the Base Function is a weighted sum u = w0 +

ni

w j xj

j=1

where w0 is a threshold and wj are the synaptic weights, that correspond to the effect of the inputs on the activation function. The output function of an artificial neuron can be expressed as ni y = fx = f w0 + wj xj j=1

In an artificial neuron, this function can be computed in three steps: the calculation of the base function value, u·, as the sum of the input values xj weighted by the synaptic weights wij plus the threshold value w0 and a non-linear activation function fu. Typical activation functions are explained in Figure 7: • Step function 0 si t < 0 ut = 1 si t ≥ 0 • Sign function sgnt =

−1 1

• Gaussian function x2

fx = ae− 2

si t < 0 si t ≥ 0

NEURAL NETWORKS HISTORICAL REVIEW

51

f (x)

f (x−a)

1

1

a

x

−1 Sign function

⎧1, if x ≥ a f (x−a) = ⎨ ⎩–1, if x ≤ a

x −1 Hyperbolic function f (x) = tanh(βx), β > 0

Figure 7. Some typical activation functions

• Exponential function fx =

1 > 0 1 + e− x

• Hyperbolic Function fx = tanh x > 0 Hyperbolic and exponential functions are classified as sigmoids or sigmoidal functions. They are real class functions, limited and monotonic f x > 0. In the case of sigmoidal functions, the mean value of the slope in the origin is called gain and such a value represents a measurement of the transition slope steepness.Therefore, if the gain tends to an infinite value, the sigmoid tends to a Sign function. According to this, Exponential and Hyperbolic functions have a gain of 4 and , respectively. As assumed in the previous point, the activation function of a neuron is nonlinear. If the function fu is linear, fu = u, then the artificial neuron is called Linear Neuron or Linear Node of the NN. 4.

NEURAL NETWORKS: CHARACTERISTICS AND TAXONOMY

A Neural Network can be represented as an oriented pair G E, composed of a set of nodes or basic processing elements G, also called processing units, artificial neurons or nodes, and a set of interconnections, E, among them. The nodes set G is partitioned in different sets called layers. Each processing unit can also have a local memory and always a transfer function. Depending upon this function of the weighted input values and the values stored in the local memory, the output y is computed. There are four main aspects that can characterize all NNs: a) Data Representation. According to the input-output form, ANNs can be classified as: continuous type NNs, digital NNs or hybrid NNs. In the continuous type, input-output data are of analogic nature. Their values are real and continuous. In digital NNs, input-output data is of digital nature. In the hybrid case, inputs are analogic and outputs are binary.

52

CHAPTER 2

b) Topology. Architecture or Topology of the NN refers to the way that the nodes are physically disposed in the network. The nodes form layers or groups of nodes that share a common input and feed their output to common nodes. Only neurons in the input and output layers interact with the external systems. The rest of nodes in the network present internal connections, forming what is called hidden layers. Therefore, topology of the NNs is characterized by the number of layers, number of neurons inside the layers, connectivity degree and type of connections among the nodes. c) Input-Output Association. With respect to the input-output association type NNs can be classified as heteroassociative or autoassociative. Heteroassociative NNs: implement a certain function, frequently of difficult analytical expression. They associate a set of inputs with a set of outputs in such a way that each input has a corresponding output. Autoassociative networks: outputs have the purpose to rebuild a certain input information that has been corrupted by associating to each input data the more similar stored data. d) Learning Procedure. All the connections or synapsis of the nodes in a NN have an associated synaptic weight efficiency factor. Each connection or synapsis between the node i and the node j is weighted by wji . This weight is responsible of the learning of the neural network. In the learning phase, the NN modifies its weights as a result of a new input information. Weights are modified following a convergent algorithm in such a way that when all the weight values are stabilized to a certain value and the learning phase ends, it is said that the NN has“learnt”. For the learning process it is crucial to establish the weights updating algorithm for the NN to correctly learn the new input information. According to the learning criteria NNs can be classified as neural networks of supervised learning or unsupervised learning NNs. Figure 8 represents the most common way of NNs classification.

5.

FEED FORWARD NEURAL NETWORKS: THE PERCEPTRON

First presented in section 1.1 Feed Forward Neural Networks are generally defined as those networks composed of one or more layers whose nodes are connected in such a way that their input comes only from nodes in the previous layer and their outputs connect exclusively to neurons of the following layer. Their name comes from the fact that the output of each layer feeds to the units of the following layer. Of all feed forward NNs the most popular, is the Multilayer Perceptron, developed as an extension to the Perceptron proposed by Rossenblatt in 1962 [Rosenblatt, 1962]. In this type of networks, the learning is supervised because it uses information of the output that the network must provide to the current input. Learning phase

NEURAL NETWORKS HISTORICAL REVIEW

53

Figure 8. Neural Networks basic taxonomy

or training phase consists in presenting to the network an input-output pair, called training pattern DN = x1 d1 x2 d2 xM dM in such a way that the weights are adjusted by xi ∈ p and di ∈ k , i = 1 2 N. Once the training phase is completed, the network is designed and ready to work in what is called the direct mode phase. In this phase, the network classifies the

54

CHAPTER 2

inputs by the following binary decision rule 1 if x w > 0 g= 0 if x w < 0 where x w is the discriminating function, that is, the space p is divided into two regions by the decision boundary x w = 0. Logically, the choice of the discriminating function x w depends on the distribution of the training patterns. 5.1

One Layer Perceptron

It basically consists in a set of nodes whose activation is produced for the action of the weighted sums of the input values and, consequently, the discriminating function takes the form p (8) x w w = wi xi + = 0 i=1

Also, if we make = w0 and we consider the inputs in the space p+1 such as x = x1 x2 xp 1 and w = w1 w2 wp w0 , Equation (8) can be expressed as x w = wxT = 0 Among other things, it serves to perform the pattern classification task, through a discriminating function of the form [Karayiannis and Venetsanopoulos, 1993], [Hush and Horne, 1993]: uk xn =

N

wkj xnj

j=0

The classification rule is based on the assignment of class k to the input pattern if the kth network output is the highest of all outputs. The network must be trained following an appropriate algorithm, to produce the desired output for each pattern uk xn ≥ uj xn ∀j = k −→ xn ∈ Wk This decision rule is, sometimes, substituted by a binary decision rule with a decision threshold. The Perceptron is a system that operates in such a way. After the learning or training, the Perceptron structure can separates the classification space in regions, one region for each class. The decision boundaries are composed by hyperplane segments defined as: uk xn − uj xn = 0 The Perceptron was initially proposed by Rosenblatt and a group of his students. In their work, the Perceptron versatility was shown. Unfortunately, the fact problem of the linear separability made its use out of interest.

55

NEURAL NETWORKS HISTORICAL REVIEW

5.1.1

Perceptron Training

It can be summarized in five steps: 1 Weights and Threshold initialization. Each one of the weights wi has to be initialized to low random values w0 = . 2 For i = 1 2 N, presenting the training pattern (a new E/S training pair is composed by a new input Xp = x1 x2 xN i = 1 2 N and its corresponding desired output dt. 3 Computing present output M M yi t = f wij xj t − i = f wij xj t = fNeti j=1

j=1

4 Weight adaptation: Wi = dt − ytxi t. • : learning rate 0 < < 1. • dt: desired output, yt: present output. • This process is repeated till the error et = dt − yt for each one of the patterns is zero or less than a preset value. 5 back to step 2 The convergence of the perceptron training is established by the following theorem: If the training set of a multiple classification problem is linearly separable then the perceptron training algorithm converges to a correct solution in a finite number of iterations. The mathematical proof of this theorem can be found in [Rosenblatt, 1962] and its significance relays in the fact that a multiple class problem can be reduced to a binary classification. Two typical examples of this situation are shown in the Figure 9. 6.

LMS LEARNING RULE

Nevertheless, even with the simple Perceptron structure, a reasonable solution can be achieved for a set that does not accomplish the linear separability property, by x2

x2

01

01 11

11

x1 00

10 OR -function

x1 00

10 AND -function

Figure 9. Logical functions OR and AND reduced to a binary classification problem

56

CHAPTER 2

the use of the Least Mean Square convergence algorithm (LMS) to update the NN weights during the learning phase. In general, the error function Equation (4), also called cost function or objective function, to be minimized by the LMS algorithm can be expressed as follows [Hush and Horne, 1993]: E=

M

uxn − k

k=1 xn ∈Wl

where k is a k elements vector with all its components of zero value, except those of k order, that corresponds to the correct classification. Therefore, for a given training set DN where dk represents the computed value, if the desired output to the k-th input vector is yk , then the Mean Square Error (MSE) corresponding to the input-output pair is given by < k2 >=

N N 1 1 k2 = d − yk 2 N i=1 N i=1 k

or, in vectorial notation, < k2 >=< dk2 > −2dk < wT x > +w < xk xkT > The minimum square error corresponds to the matrix w that satisfies the equation = 0 w In the case N = 2 the equation is an error paraboloid as shown in Figure 10. From Figure 10 it can be observed that the optimum value for the weights of the network is the one that makes the gradient null. A possible search procedure is the maximum step descent. The gradient direction is perpendicular to the contour lines in each point of the error surface. At the algorithm starting point, the weight vector does not derives to a minimum except in the case of spherical level curves. The weight updates in each iteration step must be small or the weight vector could wander over the hypersurface without never reaching the searched minimum. 6.1

The Multilayer Perceptron

A Perceptron of n layers is composed of n + 1 layers Ll l = 0 1 n, of several processing units in each one, corresponding L0 to the input layer and Ln to the output layer and Ll l = 1 n − 1 to the hidden layers. The nodes in the hidden and output layers are individual processing units. The overall output is obtained by adding all weighted inputs and passing the result through a non-linear function of sigmoidal type (see Figure 6).

57

NEURAL NETWORKS HISTORICAL REVIEW

80 70 60 50 40 30 20 10 0 2 2

1 1

0 y

0

–1

–1 –2

x

–2

Figure 10. Error Paraboloid of the LMS learning

Usually, in a Multilayer Perceptron, the nodes in each layer are fully interconnected with the neurons in the adjacent layer. This fact is repeated layer by layer through all the network. 6.1.1

Learning Algorithm (“Backpropagation”)

Before detailing the learning algorithm, let us introduce the following nomenclature: ulj : output of the j-th node in layer l. wlji : weight that connects the i-th node in layer l − 1 to the node j-th in layer l. xp : p-th training pattern. u0i : i-th component of the input vector. dj xp : desired output of the j-th node in the output layer when a p-th pattern is presented at the network input. NL : number of nodes in a given layer. L: number of layers. P: number of training patterns. Obviously, in a Perceptron-like structure, outputs depend upon the synaptic weights that connect neurons in the different layers. Such weights are actualized in the following way 1. Associating a set of input patterns to a set of desired outputs. In a pattern classification problem it is the same as making a primary classification on them by the designer (supervised training).

58

CHAPTER 2

2. Presenting all training patterns to the network. The network then processes all patterns and presents an output. The classification offered by the network can be an erroneous one, thus the error is easily quantified. 3. Defining an objective function. For example, the Mean Square Error (MSE) between the desired and real outputs of the units in the output layer [Hush and Horne, 1993]: Jp w =

NL 1 u x − dq xn 2 q=1 Lq n

This objective function represents an error function in a parametric hyperspace. The training or learning then consists in the search for the minimum of that surface through a gradient descent algorithm in the opposite direction of the surface gradient by examining a set of weights that minimizes the error. Each weight is modified or adapted in each iteration step in an amount that is proportional to the partial derivative of the function to that weight (9)

wlji k + 1 = wlji k −

Jp w wlji

In Equation (9), constant is the learning rate. The speed of the convergence of the algorithm depends on because the amount of the weight modification in each iteration step is proportional to the gradient in the weight direction, but it is weighted by the constant value of the learning rate. In this point, the training algorithm can be designed if we know how to calculate the partial derivative to each weight of the network. This derivative can be easily calculated using the chain rule: Jp w Jp w ulj = wlji ulj wlji that is, Jp w Jp w = f wlji ulj

Nl−1 −1

wljm ul−1m ul−1i

m=0

where f·, represents the sigmoidal function previously defined. This function has a very simple derivative: f =

f = f1 − f d

when the parameter is of unit value. In this expression we can observe that the sensibility of the objective function to each weight depends on the sensibility of this function to the output of the neuron that is fed by the synaptic weight input.

NEURAL NETWORKS HISTORICAL REVIEW

59

This last sensibility can be in its turn calculated from the objective function sensibilities with respect to the node outputs of the following layer, and so on [Hush and Horne, 1993]. This process is repeated till we reach to the output layer. The sensibility of the objective function to each node output can be calculated from the output layer in a recursive from. The sensibilities to the outputs of nodes in hidden layers are also denominated “error”, although, strictly speaking, they do not represent a real error. In order to calculate the error in the hidden layers, the error in the output layer must be computed and backpropagated to previous layers. That is performed by the Backpropagation algorithm. In this algorithm, training usually is started with random small values of the synaptic weights in order to provide a safe to the backpropagation algorithm. Once the structure of the network is chosen, the key parameter to be controlled is the learning rate. A too small value will slow the learning process. A too high value will accelerate the learning, but can produce loosing the minimum of the error surface. To find the optimal value of this parameter, an empirical method has to be used. Once the learning has started, it must continue till a minimum error is found, or till no variation in weight values is achieved. In that point, the network is said to have finished learning. It is not always practical to wait till this point of the learning and several other criteria are adopted, among them: 1. When the value of the gradient error surface is sufficiently small, it means that the gradient learning algorithm has found a set of weight values in a local minimum of the error surface. 2. When the error between the real network output and the desired one is under certain tolerable value for our application. Obviously, this case needs the knowledge of the maximum tolerable error for the given application. 3. In pattern classification problems, when all the learning patterns have been correctly classified, the training procedure can be stopped. 4. Training can be stopped after a fixed number of iterations. 5. Finally, a more appropriate and developed procedure is to train the network with a set of patterns and supervise the error over a different set called test set. The training phase is stopped when a minimum error on the test set is found. This last method prevents the overspecialization of the network on the training set, a phenomenon that happens when the error on the training set is lower than the error over other set of patterns of the same application, showing that the network has lost generalization capabilities. The method needs to use a double number of patterns, a fact that can be expensive or even not possible. Therefore, in order to efficiently apply neural networks to real problems it is very important to have a number of patterns in sufficient number. 6.2

Acceleration of the training procedure

The training procedure described in the previous section presents two main problems: in one hand the convergence or training phase is very slow, and, on the other hand, it is not easy to precisely elect the appropriate learning rate. A simple solution

60

CHAPTER 2

to accelerate the network training is the usage of second order methods that use the information contained in the second matrix of derivates (Hessian). These methods reduce the number of iterations needed in the training phase in order to achieve a local or global minimum of the error surface. Nevertheless, they cost a higher amount of computation and this increases the time of training. For this reason, only the diagonal matrix of the Hessian is usually used. Another solution is to rise the gradient value by adding a term that is a fraction of the past changes in the weights. This term, usually known as momentum term, is the weight by a new constant value, usually designated by : wkj k + 1 = wkj k −

Jw + wkj k wkj

This term tends to smooth the changes in the weights, leading to increase the learning speed by avoiding divergent learning fluctuations. It has been shown that adding noise to the training patterns, decreases the training time and helps to avoid local minima in the learning process. Another way to decrease the training time consists in the use of alternative transfer functions in the network nodes. When allowing a function to take positive and negative values in a symmetric dynamical range, it is probable that several activations will be next to zero and their corresponding weights will not need to be actualized. An example of this type of activation function is the hyperbolic one. In Table 1, typical parameters of this kind of networks and their influence in the processing are summarized. 6.3

On-Line and Off-Line training

During the training, weight update can be carried out in two different ways [Bourland and Morgan, 1994]: • Off-line or “Block training”: in this case, modifications on the weights over the whole training set are accumulated. The weights are modified only when all the training patterns are presented to the network. Table 1. Design Properties of NNs Transfer Function Derivate of Transfer Function Learning rate Effects on the NN Moment

Sign f x =

Exponential = 1 f x = fx1 − fx

Hyperbolic = 1 f x = 1 − f 2 x

=1 Learning not guaranteed With a small value, the vectors of weights increment take very divergent directions

= 01 Quick but not precise convergence With a big value, the vectors of weights increment take similar directions, helping to the convergence of the training

= 001 Precise and slow convergence

NEURAL NETWORKS HISTORICAL REVIEW

61

• On-line training: the network weights are modified each time that a new training pattern is presented to the network. It can be proved that this method leads to the same result as that of the off-line training [Widrow and Stearns, 1985]. In practice, this method shows some advantages that make it much more attractive to be used: it converges much more quickly to a minimum of the error surface and usually avoids the local minima. A possible explanation is that with the on-line training some “noise” is introduced over the set of training patterns. 6.4

Selection of the Network size

The selection of the appropriate network size is a task of the utmost importance: if the network is too small, it will not be able to achieve an efficient solution for the problem that is representing, while if its size is too big it can happen that the network can represent too many solutions to solve the problem over the training patterns but none of them is optimum to the application problem. If there is no preliminary experience, the dimension of the network size is a trial and error problem. To start with, an option is to try a small network and to increase the size progressively in order to find an efficient dimension for the network. The other option is to try a big network and reduce the size progressively, removing the nodes or weights that do not have significance on the overall output of the NN. Several studies have settled some size limits that should not be exceeded. In this sense, a proposal is that the number of nodes in the hidden layer should not exceed the number of training patterns. In practice, this is always accomplished, as the number of nodes will always be much lower than the number of training patterns. In fact, big networks can be able to memorize the whole training set loosing generalization capabilities. 7.

KOHONEN NETWORKS

A main principle in biological brain organization is that neurons group in such a way that those that are physically close collaborate in the same stimulus that is being processed. That is the way that nerve connections are organized. For example, to each level of the auditive path, nerve cells and fibers are disposed in relation to the frequency that is responsible of a higher output for each neuron [Lipmann, 1987]. Therefore, the physical disposition of the neurons in the brain structure is in somehow related to the function they perform. Kohonen [Kohonen, 1984] proposed an algorithm to adjust the weights of a network whose input is a vector of N components and its output is another vector of different dimension, MM < N. In this way, the dimension of the input subspace is reduced, physically grouping the data. Vectors defined over a continuous variable are used as input to the network. The network is trained without supervision in a way that the network itself establishes the input data grouping criteria, extracting regularities and correlations. When a sufficient number of input vectors has been presented, the weights are self-organized in

62

CHAPTER 2

a way that, topologically speaking, close nodes are sensible to similar inputs. Nodes physically far will stand completely inactive. Clusters that have their topological equivalent in the network are produced. For this reason this kind of networks are known as Self-Organized Feature Map (SOFM). The algorithm that assigns values to the connections of the synaptic weights is based on the concepts of neighborhood and competitive learning. The distance between the input and the weights of all the nodes is computed, establishing the closest one as the winner node. The updating of weights is performed for this node and the neighbor nodes. The rest are not actualized favoring a concrete physical organization. This kind of network has always two layers: the input and the output one. The dimension of the input vector establishes the number of nodes of the input layer: one node for each component of the input vector. The input neurons drive the input vectors to the output layer controlled by the connections weights. In this type of network it is very important to establish a neighborhood and a distance measure in the network. In the example of Figure 11, the nodes are configured in a bidimensional structure. The algorithm used to compute the output is designed in such a way that only one output neuron is activated when one input vector is applied to the network. The fired node corresponds to the category of classification corresponding to the input vector. Similar input vectors activate the same output, while different vectors activate different neurons. Only the neuron with the minimum difference between the input vector and the output weights vector node is activated. When the training algorithm starts, the adjustment is done in a wide zone surrounding the fired node or winner node. As the training progresses, the neighbor area is progressively reduced. Through this little adjustment, the network follows any systematic change in the input vectors: the network self-organizes. Therefore, this algorithm behaves as a vectorial quantifier when the number of desired clusters can be a priori specified and a sufficient amount of data relative to the number of the desired clusters is Outputs

Input layer

Figure 11. Structure of the Kohonen Network

NEURAL NETWORKS HISTORICAL REVIEW

63

known. However, the results depend on the order of the presentation of the input data, specially when the amount of input data is small. 7.1

Training

Training of the SOMF network can be summarized in five steps: 1. Weights initialization: The network structure is N input nodes and M output nodes. Random values are assigned to each of the weight wij connections. Initial neighbor radius is fixed for the neighbor mask. 2. Presentation of a new E/S pair: A new pattern is presented at the input Xp t = x1 t x2 t xN t. 3. Computation of the distance dj between the input and each one of the output nodes dj =

N −1

xi t − wij t2

i=0

where xi t is the input to the node i in the iteration t, and wij t is the input weight i to the output j in the iteration t. 4. Selection of the output node as the node with the minimum distance: node j ∗ is selected as the node with the minimum distance dj . 5. Updating node j ∗ and its neighbor: weights are updated for node j ∗ and all its neighbors in the vicinity matrix defined by NEj ∗ t. The new weights are: wij t + 1 = wij t + txi t − wij t for j ∈ NEj ∗ t 0 ≤ i ≤ N − 1. The term t is a gain term 0 < t < 1 that decreases with time. 6. Back to step 2. An standard example introduced by Kohonen illustrates the self-organized networks capacity to learn random distributions of the input vectors presented to the network. For example, if the input is an order two vector with component uniformly distributed and the output is designed as bidimensional, then the network weights will organize in a reticular fashion as shown in Figure 12. 1.5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3 1

2.5 2

0.5

1.5 1

0

0.5 –0.5 –0.5

0 0

0.5

1

1.5

0.5 0 0.5 1 1.5 2 2.5 3

0.4 0.6 0.8 1 1.2 1.4 1.6

Figure 12. Kohonen Map for the two-dimensional case

64 8.

CHAPTER 2

FUTURE PERSPECTIVES

Artificial neural networks are inspired from the biological performance of the human brain, where the former attempts to emulate the latter. This is the main link between biological and artificial neural networks. From this starting point, both disciplines follow separate ways. The present understanding of the brain mechanisms is so limited that the systems designer has not sufficient data to emulate its performing. Therefore, the engineer has to be one step forward from the biological knowledge, searching and devising useful algorithms and structures that efficiently solve given problems. In the vast majority of cases, this search delivers a result that diverges completely from the biological reality and the brain similarities become metaphors. Despite this faint and usually inexisting analogy between biology and artificial neural networks, the results of the latter frequently evoke comparisons with the former, because they are frequently reminiscent of the performing of the brain. Unfortunately, these comparisons are not benign and produce unrealistic expectations that lead to disappointment. Researching based on false expectations can evaporate when illuminated by the light of reality, as happened in the sixties. This promising researching field could eclipse again if we do not contain the temptation of comparing our results with those of the brain. It has been said that NNs are capable of being applied in all activities specific of the human brain. Currently, they are considered an alternative for all those tasks where the conventional computation does not achieve satisfactory results. There has been speculations about a next future where NNs will be able to reach a place together with classical computation. However, this will only happen if the researchers achieve sufficient knowledge for that developing. Currently, the theoretical knowledge is not robust enough to justify such predictions.

REFERENCES W.W. McCulloch and W. Pitts, A Logical Calculus of the Ideas Inminent in Nervous Activity, Bulletin of Mathematical Biophysics, 5:115–133, 1943. W. Pitts and W.W. McCulloch, How We Know Universals, Bulletin of Mathematical Biophysics, 9:127– 147, 1947. D.O. Hebb, Organization of Behaviour, Science Editions, New York, 1961. F. Rosenblatt, Principles of Neurodynamics, Science Editions, New York, 1962. B. Widrow, M. E. Hoff, Adaptive Switching Circuits, In IRE WESCON Convention Record, pages 96–104, 1960. M. Minsky, S. Papert, Perceptrons, MIT press, Cambridge, MA, 1969. G. Cybenko, Approximation by Superposition of a Sigmoidal Function, Mathematics of Control, Signals, and Systems, 2:303–314, 1989. K. Hornik, M. Stinchcombe and H. White, Multilayer Feedforward Networks are Universal Approximators, Neural Networks, 2(5):359–366, 1989. K. Hornik, M. Stinchcombe and H. White, Universal Aproximation of an Unknown Mapping and Its Derivatives using Multilayer Feedforward Networks, Neural Networks, 3:551–560, 1990.

NEURAL NETWORKS HISTORICAL REVIEW

65

S. Haykin, Neural Networks. A Comprehensive Foundation, Macmillan College Publishing, Ontario, 1994. P.J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioural Sciences, PhD thesis, Harvard University, Boston, 1974. D.E. Rumerlhart, G. E. Hinton and R. J. Williams, Learning Internal Representations by Error Propagation, In D. E. Rumelhart, J. L. McClelland and the PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1: Foundations, pages 318–362, MIT Press, Cambridge, MA, 1986. D.B. Parker, Learning Logic, Technical report, Technical Report TR-47, Cambridge, MA: MIT Center for Research in Computational Economics and Management Science, 1985. D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks. What’s new since Lippman?, IEEE Signal Processing Magazine, 2:721–729, January, 1993. N.B. Karayiannis and A.N. Venetsanopoulos, Artificial Neural Networks, Learning Algorithms, Perfomance Evaluation and Applications, Kluwer Academic Publishers, Boston, MA, 1993. H.A. Bourland and N. Morgan, Connectionist Speech recognition. A hybrid Approach, Kluwer Academic Publishers, Boston, MA, 1994. B. Widrow and S.D. Stearns, Adaptative Signal Processing, Prentice-Hall, Signal Processing Series, Englewood Cliffs, NJ, 1985. R.P. Lipmann, An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, 328–339, April, 1987. T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag, Berlin, 1984.

CHAPTER 3 ARTIFICIAL NEURAL NETWORKS

D. T. PHAM, M. S. PACKIANATHER, A. A. AFIFY Manufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom

INTRODUCTION Artificial neural networks are computational models of the brain. There are many types of neural networks representing the brain’s structure and operation with varying degrees of sophistication. This chapter provides an introduction to the main types of networks and presents examples of each type. 1.

TYPES OF NEURAL NETWORKS

Neural networks generally consist of a number of interconnected processing elements (PEs) or neurons. How the inter-neuron connections are arranged and the nature of the connections determine the structure of a network. How the strengths of the connections are adjusted or trained to achieve a desired overall behaviour of the network is governed by its learning algorithm. Neural networks can be classified according to their structures and learning algorithms. 1.1

Structural Categorisation

In terms of their structures, neural networks can be divided into two types: feedforward networks and recurrent networks. Feedforward networks: In a feedforward network, the neurons are generally grouped into layers. Signals flow from the input layer through to the output layer via unidirectional connections, the neurons being connected from one layer to the next, but not within the same layer. Examples of feedforward networks include the multi-layer perceptron (MLP) [Rumelhart and McClelland, 1986], the radial basis function (RBF) network [Broomhead and Lowe, 1988; Moody and Darken, 1989], the learning vector quantization (LVQ) network [Kohonen, 1989], the cerebellar 67 D. Andina and D.T. Pham (eds.), Computational Intelligence, 67–92. © 2007 Springer.

68

CHAPTER 3

model articulation control (CMAC) network [Albus, 1975a], the group-method of data handling (GMDH) network [Hecht-Nielsen, 1990] and some spiking neural networks [Maass, 1997]. Feedforward networks can most naturally perform static mappings between an input space and an output space: the output at a given instant is a function only of the input at that instant. Recurrent networks: In a recurrent network, the outputs of some neurons are fedback to the same neurons or to neurons in preceding layers. Thus, signals can flow in both forward and backward directions. Examples of recurrent networks include the Hopfield network [Hopfield, 1982], the Elman network [Elman, 1990] and the Jordan network [Jordan, 1986]. Recurrent networks have a dynamic memory: their outputs at a given instant reflect the current input as well as previous inputs and outputs.

1.2

Learning Algorithm Categorisation

Neural networks are trained by two main types of learning algorithms: supervised and unsupervised learning algorithms. In addition, there exists a third type, reinforcement learning, which can be regarded as a special form of supervised learning. Supervised learning: A supervised learning algorithm adjusts the strengths or weights of the inter-neuron connections according to the difference between the desired and actual network outputs corresponding to a given input. Thus, supervised learning requires a teacher or supervisor to provide desired or target output signals. Examples of supervised learning algorithms include the delta rule [Widrow and Hoff, 1960], the generalised delta rule or backpropagation algorithm [Rumelhart and McClelland, 1986] and the LVQ algorithm [Kohonen, 1989]. Unsupervised learning: Unsupervised learning algorithms do not require the desired outputs to be known. During training, only input patterns are presented to the neural network which automatically adapts the weights of its connections to cluster the input patterns into groups with similar features. Examples of unsupervised learning algorithms include the Kohonen [Kohonen, 1989] and Carpenter-Grossberg Adaptive Resonance Theory (ART) [Carpenter and Grossberg, 1988] competitive learning algorithms. Reinforcement learning: As mentioned before, reinforcement learning is a special case of supervised learning. Instead of using a teacher to give target outputs, a reinforcement learning algorithm employs a critic only to evaluate the goodness of the neural network output corresponding to a given input. An example of a reinforcement learning algorithm is the genetic algorithm (GA) [Holland, 1975; Goldberg, 1989].

2.

NEURAL NETWORKS EXAMPLE

This section briefly describes the example neural networks and associated learning algorithms cited previously.

69

ARTIFICIAL NEURAL NETWORKS

2.1

Multi-layer Perceptron (MLP)

MLPs are perhaps the best known type of feedforward networks. Figure 1a shows an MLP with three layers: an input layer, an output layer and an intermediate or hidden layer. Neurons in the input layer only act as buffers for distributing the input signals xi to neurons in the hidden layer. Each neuron j (Figure 1b) in the hidden layer sums up its input signals xi after weighting them with the strengths of the respective connections wji from the input layer and computes its output yj as a function f of the sum, viz. (1)

yj = f

wji xi

f can be a simple threshold function or a sigmoidal, hyperbolic tangent or radial basis function (see Table 1). The output of neurons in the output layer is computed similarly. The backpropagation (BP) algorithm, a gradient descent algorithm, is the most commonly adopted MLP training algorithm. It gives the change wji in the weight

Output Layer

y1

yn

Hidden Layer w1m w12 w11 Input Layer x1

x2

xm

Figure 1a. A multi-layer perceptron

x1

wj1 wji

xi

Σ

yj f(.)

wjn xn

Figure 1b. Details of a neuron

70

CHAPTER 3 Table 1. Activation functions Type of Functions

Functions

Linear

fs = s

Threshold

fs =

Sigmoid Hyperbolic tangent Radial basis function

+1 if s > st −1 otherwise fs = 1/1 + exp−s fs = 1 − exp−2s/1 + exp2s fs = exp−s2 /2

of a connection between neurons i and j as follows: (2)

wji = j xi

where is a parameter called the learning rate and j is a factor depending on whether neuron j is an output neuron or a hidden neuron. For output neurons, f t yj − yj (3) j = netj and for hidden neurons, f w (4) j = netj q qj q In Equation (3), netj is the total weighted sum of input signals to neuron j and t yj is the target output for neuron j. As there are no target outputs for hidden neurons, in Equation (4), the difference between the target and actual output of a hidden neuron j is replaced by the weighted sum of the q terms already obtained for neurons q connected to the output of j. Thus, iteratively, beginning with the output layer, the term is computed for neurons in all layers and weight updates determined for all connections. The weight updating process can take place after the presentation of each training pattern (pattern-based training) or after the presentation of the whole set of training patterns (batch training). In either case, a training epoch is said to have been completed when all training patterns have been presented once to the MLP. For all but the most trivial problems, several epochs are required for the MLP to be properly trained. A commonly adopted method to speed up the training is to add a “momentum” term to Equation (2) which effectively lets the previous weight change influence the new weight change, viz: (5)

wji k + 1 = j xi + wji k

where wji k + 1 and wji k are weight changes in epochs k + 1 and k respectively and is the “momentum” coefficient.

71

ARTIFICIAL NEURAL NETWORKS

Another learning method suitable for training MLPs is the genetic algorithm (GA). This is an optimisation algorithm based on evolution principles. The weights of the connections are considered genes in a chromosome. The goodness or fitness of the chromosome is directly related to how well trained the MLP is. The algorithm starts with a randomly generated population of chromosomes and applies genetic operators to create new and fitter populations. The most common genetic operators are the selection, crossover and mutation operators. The selection operator chooses chromosomes from the current population for reproduction. Usually, a biased selection procedure is adopted which favours the fitter chromosomes. The crossover operator creates two new chromosomes from two existing chromosomes by cutting them at a random position and exchanging the parts following the cut. The mutation operator produces a new chromosome by randomly changing the genes of an existing chromosome. Together, these operators simulate a guided random search method which can eventually yield the optimum set of weights to minimise the differences between the actual and target outputs of the neural network. Further details of genetic algorithms can be found in the chapter on Soft Computing and its Applications in Engineering and Manufacture. 2.2

Radial Basis Function (RBF) Network

Large multi-layer perceptron (MLP) networks take a long time to train. This has led to the construction of alternative networks such as the Radial Basis Function (RBF) network [Cichocki and Unbahauen, 1993; Hassoun, 1995; Haykin, 1999]. The RBF network is the most used network after MLPs. Figure 2 shows the structure of a RBF network which consists of three layers. The input layer neurons receive the inputs x1 xM . The hidden layer neurons provide a set of activation functions that constitute an arbitrary “basis” for the input patterns in the input space to be expanded into the hidden space by way of non-linear transformation. At the input of each hidden neuron, the distance between the centre of each activation or basis function and the input vector is calculated. Applying the basis function to this distance produces the output of the hidden neuron. The RBF network output y is formed by the neuron in the output layer as a weighted sum of the hidden layer neuron activation.

Input Layer

x1 xk

Hidden Layer

w1

Output Layer

wi wN

xM

Figure 2. The RBF network

y

72

CHAPTER 3

K(x) 1.0

Standard Deviation σ=1

x 0 Figure 3. The Radial Basis Function

The basis function is generally chosen to be a standard function which is positive at its centre x = 0 and then decreases uniformly to zero on either side as shown in Figure 3. A common choice is the Gaussian distribution function: 2 x (6) Kx = exp − 2 This function can be shifted to an arbitrary centre, x = c, and stretched by varying its standard deviation as follows: x − c2 x − c = exp − (7) K 2 2 The output of the RBF network y is given by: N x − ci (8) y = wi K i ∀x i=1 where wi is the weight of the hidden neuron i, ci the centre of basis function i and i the standard deviation of the function. x − ci is the norm of x − ci . There are various ways to calculate the norm. The most common is the Euclidean norm given by:

(9) x − ci = x1 − ci1 2 + x2 − ci2 2 + + xM − ciM 2 This norm gives the distance between the two points x and ci in N-dimensional space. All points x that are the same radial distance from ci give the same value for the norm and hence the same value for the basis function. Hence the basis functions are called Radial Basis Functions. Obtaining the values for wi , ci and i requires training the RBF network. Because the basis functions are differentiable, back-propagation could be used as with MLP networks. Training of a multiple-input single-output RBF network can proceed as follows: (i) choose the number N of hidden units; There is no firm guidance available for this. The selection of N is normally made by trial and error. In general, the smallest N that gives the RBF network an acceptable performance is adopted.

ARTIFICIAL NEURAL NETWORKS

73

(ii) choose the centres, ci ; Centre selection could be performed in three different ways [Haykin, 1999]: a) Trial and error: Centres can be selected by trial and error. This is not always easy if little is known about underlying functional behaviour of data. Usually, the centres are spread evenly or randomly over N -dimensional input space. b) Self-organized selection: An adaptive unsupervised method can be used to learn where to place the centres. c) Supervised selection: A supervised learning process, commonly error correction learning, can be deployed to fix the centres. (iii) choose stretch constants, i ; Several heuristics are available. A popular way is to set i equal to the distance to nearest neighbour. First the distances between centres are computed then the nearest distance is chosen to be the value of i . (iv) calculate weights, wi . When ci and wi are known, the outputs of hidden units O1 ON T can be calculated for any pattern of inputs x = x1 xM . Assuming there are P input patterns x in the training set, there will be P sets of hidden unit outputs that can be calculated. These can be assembled in a N × P matrix: 1 2 P ⎤ O1 O1 O1 ⎢O1 O2 OP ⎥ 2 2 ⎥ ⎢ 2 ⎥ O=⎢ ⎥ ⎢ ⎦ ⎣ 1 2 P ON ON ON

⎡

(10)

If the output yi of the RBF network corresponding to training input pattern i i i xi is yi = O1 w1 + O2 w2 + + ON wN , the following equation can be obtained: ⎤ ⎡ 1 O1 y1 ⎢ ⎥ ⎢ ⎥ ⎢ y=⎢ ⎣ ⎦=⎣ P yP O1 ⎡

(11)

1 ⎤

ON

⎡

w1

⎤

⎥ ⎢ ⎥ ⎥ · ⎢ ⎥ = OT · w ⎦ ⎣ ⎦ P wN ON

y is the vector of actual outputs corresponding to the training inputs x. Ideally, y should be equal to d, the desired/target outputs. Unknown coefficients wi can be chosen to minimise the sum-squared-error of y compared with d. It can be shown that this is achieved when: (12)

w = O OT −1 O d

74 2.3

CHAPTER 3

Learning Vector Quantization (LVQ) Network

Figure 4 shows an LVQ network which comprises three layers of neurons: an input buffer layer, a hidden layer and an output layer. The network is fully connected between the input and hidden layers and partially connected between the hidden and output layers, with each output neuron linked to a different cluster of hidden neurons. The weights of the connections between the hidden and output neurons are fixed to 1. The weights of the input-hidden neuron connections form the components of reference vectors (one reference vector is assigned to each hidden neuron). They are modified during the training of the network. Both the hidden neurons (also known as Kohonen neurons) and the output neurons have binary outputs. When an input pattern is supplied to the network, the hidden neuron whose reference vector is closest to the input pattern is said to win the competition for being activated and thus allowed to produce a “1”. All other hidden neurons are forced to produce a “0”. The output neuron connected to the cluster of hidden neurons that contains the winning neuron also emits a “1” and all other output neurons a “0”. The output neuron that produces a “1” gives the class of the input pattern, each output neuron being dedicated to a different class. The simplest LVQ training procedure is as follows: (i) initialise the weights of the reference vectors; (ii) present a training input pattern to the network; (iii) calculate the (Euclidean) distance between the input pattern and each reference vector; (iv) update the weights of the reference vector that is closest to the input pattern, that is, the reference vector of the winning hidden neuron. If the latter belongs

Output layer

Hidden (Kohonen) Layer Reference vector

Input layer

Input vector Figure 4. Learning Vector Quantization network

75

ARTIFICIAL NEURAL NETWORKS

to the cluster connected to the output neuron in the class that the input pattern is known to belong to, the reference vector is brought closer to the input pattern. Otherwise, the reference vector is moved away from the input pattern; (v) return to (ii) with a new training input pattern and repeat the procedure until all training patterns are correctly classified (or a stopping criterion is met). For other LVQ training procedures, see for example [Pham and Oztemel, 1994]. 2.4

CMAC Network

CMAC (Cerebellar Model Articulation Control) [Albus, 1975a, 1975b, 1979a, 1979b; An et al 1994] can be considered a supervised feedforward neural network with the characteristics of a fuzzy associative memory. A basic CMAC module is shown in Figure 5. CMAC consists of a series of mappings: (13)

f

e

g

S −→M −→A−→u

where S = input vectors M = intermediate variables A = association cell vectors u = output of CMAC ≡ hS h ≡ g·f ·e (a) Input encoding (S → M mapping) The S → M mapping is a set of submappings, one for each input variable: ⎤ ⎡ s 1 → m1 ⎢ s2 → m 2 ⎥ ⎥ (14) S→M =⎢ ⎦ ⎣ sn → mn

M

S >M

Input S

:

Input Encoding

:

>A

Weight Table

A

>u

Actual Output u

:

+

_

Desired Output Figure 5. A basic CMAC module

+

76

CHAPTER 3

The range of s1 is coarsely discretised using the quantising functions q1 q2 qk . Each function divides the range into k intervals. The intervals produced by function qj+1 are offset by one kth of the range compared to their counterparts produced by function qj . mi is a set of k intervals generated by q1 to qk respectively. An example is given in Figure 6 to illustrate the internal mappings within a CMAC module. The S → M mapping is shown in the leftmost part of the figure. In Figure 6, two input variables s1 and s2 are represented with unity resolution in the range of 0 to 8. The range of each input variable is described using three quantising functions. For example, the range of s1 is described by functions q1 q2 , and q3 . q1 divides the range into intervals A, B, C and D. q2 gives intervals E, F , G, and H and q3 provides intervals I, J , K and L. That is, q1 = A B C D q2 = E F G H q3 = I J K L For every value of s1 , there exists a set of elements, m1 , which are the intersection of the functions q1 to q3 , such that the value of s1 uniquely defines set m1 and vice versa. For example, value s1 = 5 maps to set m1 = B G K and vice versa. Similarly, value s2 = 4 maps to set m2 = b g j and vice versa. The S → M mapping gives CMAC two advantages: the first is that a single precise variable si can be transmitted over several imprecise information channels. Each channel carries only a small part of the information of si . This increases the reliability of the information transmission. The other advantage is that small changes in the value of si have no influence on most of the elements in mi . This leads to the property of input generalisation which is important in an environment where random noise exists.

S

M

M

d c b a

m2 d

l h

c k g b

j f

a

i e

s2

A *

*

*

*

*

*

*

*

X1 *

*

*

*

*

*

*

A

u

A B C D

8 7 6 5 4 3 2 1 0

h g f e

*

*

*

*

*

X2 *

*

*

*

*

*

*

*

*

*

+

E F G H

_ 0 1 2 3 4 5 6 7 8 A

E

B F

I

C G

J

s1 D

H K

L

m1

l k j i

*

*

*

*

*

*

*

*

*

* X3 *

*

*

I

J K L

*

*

Figure 6. Internal mappings within a CMAC module

+

ARTIFICIAL NEURAL NETWORKS

77

(b) Address computing (M → A mapping) A is a set of address vectors associated with weight tables. A is obtained by combining the elements of mi . For example, in Figure 6, the sets m1 = B G K and m2 = b g j are combined to give the set of elements A = a1 a2 a3 = Bb Gg Kj. (c) Output computing (A → U mapping) This mapping involves looking up the weight tables and adding the contents of the addressed locations to yield the output of the network. The following formula is employed: (15) u = wi ai i

That is, only the weights associated with the addresses ai in A are summed. For this given example, these weights are: wBb = x1 wGg = x2 wKj = x3 Thus the output is: u = x1 + x2 + x3

(16)

Training a CMAC module consists of adjusting the stored weights. Assuming that f is the function that CMAC has to learn, the following training steps could be adopted: (i) select a point S in the input space and obtain the current output u corresponding to S; (ii) let u be the desired output of CMAC, that is, u = f S; (iii) if u − u ≤ , where is an acceptable error, then do nothing; the desired value is already stored in CMAC. However, if u − u > , then add to every weight which contributed to u the quantity (17)

=

u−u A

where A = the number of weights which contributed to u and is the learning rate. 2.5

Group Method of Data Handling (GMDH) Network

Figure 7 shows a GMDH network and the details of one of its neurons. Unlike the feedforward neural networks previously described which have a fixed structure,

78

CHAPTER 3

N-Adaline x1 N-Adaline

x2

N-Adaline N-Adaline

x3

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

y

N-Adaline x4

N-Adaline

Figure 7a. A trained GMDH network Note: Each GMDH neuron is an N-Adaline, which is an Adaptive Linear Element with a nonlinear preprocessor

Nonlinear processor x1

x1

Square

X

x2

Square

x21

x1x2

x22

x2

w1

+1

w2

w0

w3

+

output

–

w4

e +

w5

yd desired output

Figure 7b. Details of a GMDH Neuron

a GMDH network has a structure which grows during training. Each neuron in a GMDH network usually has two inputs x1 and x2 and produces an output y that is a quadratic combination of these inputs, viz. (18)

y = wo + w1 x1 + w2 x12 + w3 x1 x2 + w4 x22 + w5 x2

Training a GMDH network consists of configuring the network starting with the input layer, adjusting the weights of each neuron, and increasing the number of layers until the accuracy of the mapping achieved with the network deteriorates.

ARTIFICIAL NEURAL NETWORKS

79

The number of neurons in the first layer depends on the number of external inputs available. For each pair of external inputs, one neuron is used. Training proceeds with presenting an input pattern to the input layer and adapting the weights of each neuron according to a suitable learning algorithm, such as the delta rule (see for example [Pham and Liu, 1994]), viz. (19)

Wk+1 = Wk +

Xk Xk

2

ykd − WkT Xk

where Wk , the weight vector of a neuron at time k, and Xk the modified input vector to the neuron at time k, are defined as (20) (21)

Wk = w0 w1 w2 w3 w4 w5 T T Xk = 1 x1 x12 x1 x2 x22 x2

and ykd is the desired network output at time k. Note that, for this description, it is assumed that the GMDH network only has one output. Equation (19) shows that the desired network output is presented to each neuron in the input layer and an attempt is made to train each neuron to produce that output. When the sum of the mean square errors SE over all the desired outputs in the training data set for a given neuron reaches the minimum for that neuron, the weights of the neuron are frozen and its training halted. When the training has ended for all neurons in a layer, the training for the layer stops. Neurons that produce SE values below a given threshold when another set of data (known as the selection data set) is presented to the network are selected to grow the next layer. At each stage, the smallest SE value achieved for the selection data set is recorded. If the smallest SE value for the current layer is less than that for the previous layer (that is, the accuracy of the network is improving), a new layer is generated, the size of which depends on the number of neurons just selected. The training and selection processes are repeated until the SE value deteriorates. The best neuron in the immediately preceding layer is then taken as the output neuron for the network. 2.6

Hopfield Network

Figure 8 shows one version of a Hopfield network. This network normally accepts binary and bipolar inputs (+1 or −1). It has a single “layer” of neurons, each connected to all the others, giving it a recurrent structure, as mentioned earlier. The training of a Hopfield network takes only one step, the weights wij of the network being assigned directly as follows: ⎧ P ⎨ 1 xc xc i = j (22) wij = N c=1 i j ⎩ 0 i=j where wij is the connection weight from neuron i to neuron j, and xic (which is either +1 or −1) is the ith component of the training input pattern for class c, P

80

CHAPTER 3

y1

y2

Outputs y3

w12

yN

w13

w1N

Hopfield Layer

x1

x2

x3

xN

Inputs Figure 8. A Hopfield network

the number of classes and N the number of neurons (or the number of components in the input pattern). Note from Equation (22) that wij = wji and wii = 0, a set of conditions that guarantee the stability of the network. When an unknown pattern is input to the network, its outputs are initially set equal to the components of the unknown pattern, viz. (23)

yi 0 = xi

1≤i≤N

Starting with these initial values, the network iterates according to the following equation until it reaches a minimum energy state, i.e. its outputs stabilise to constant values: N (24) yi k + 1 = f wij yi k 1 < i ≤ N j=1

where f is a hard limiting function defined as −1 x < 0 (25) fx = 1 x>0 2.7

Elman and Jordan Nets

Figures 9a and b show an Elman net and a Jordan net, respectively. These networks have a multi-layered structure similar to the structure of MLPs. In both nets, in addition to an ordinary hidden layer, there is another special hidden layer sometimes called the context or state layer. This layer receives feedback signals from the

81

ARTIFICIAL NEURAL NETWORKS

outputs output units

1 1 hidden units

input units context unit inputs Figure 9a. An Elman network

output output feedback

output unit

hidden layer

input unit self feedback input context unit Figure 9b. A Jordan network

ordinary hidden layer (in the case of an Elman net) or from the output layer (in the case of a Jordan net). The Jordan net also has connections from each neuron in the context layer back to itself. With both nets, the outputs of neurons in the context layer, are fed forward to the hidden layer. If only the forward connections are to be adapted and the feedback connections are preset to constant values, these networks can be considered ordinary feedforward networks and the BP algorithm used to train them. Otherwise, a GA could be employed [Pham and Karaboga, 1993b; Karaboga, 1994]. For improved versions of the Elman and Jordan nets, see [Pham and Liu, 1992; Pham and Oh, 1992].

82 2.8

CHAPTER 3

Kohonen Network

A Kohonen network or a self-organising feature map has two layers, an input buffer layer to receive the input pattern and an output layer (see Figure 10). Neurons in the output layer are usually arranged into a regular two-dimensional array. Each output neuron is connected to all input neurons. The weights of the connections form the components of the reference vector associated with the given output neuron. Training a Kohonen network involves the following steps: (i) initialise the reference vectors of all output neurons to small random values; (ii) present a training input pattern; (iii) determine the winning output neuron, i.e. the neuron whose reference vector is closest to the input pattern. The Euclidean distance between a reference vector and the input vector is usually adopted as the distance measure; (iv) update the reference vector of the winning neuron and those of its neighbours. These reference vectors are brought closer to the input vector. The adjustment is greatest for the reference vector of the winning neuron and decreased for reference vectors of neurons further away. The size of the neighbourhood of a neuron is reduced as training proceeds until, towards the end of training, only the reference vector of a winning neuron is adjusted. In a well-trained Kohonen network, output neurons that are close to one another have similar reference vectors. After training, a labelling procedure is adopted where input patterns of known classes are fed to the network and class labels are assigned to output neurons that are activated by those input patterns. As with the LVQ network, an output neuron is activated by an input pattern if it wins the competition against other output neurons, that is, if its reference vector is closest to the input pattern.

Output neurons

Reference vector

Input neurons

Input vector Figure 10. A Kohonen network

ARTIFICIAL NEURAL NETWORKS

2.9

83

ART Networks

There are different versions of the ART network. Figure 11 shows the ART-1 version for dealing with binary inputs. Later versions, such as ART-2 can also handle continuous-valued inputs. ART-1 As illustrated in Figure 11, an ART-1 network has two layers, an input layer and an output layer. The two layers are fully interconnected, the connections are in both the forward (or bottom-up) direction and the feedback (or top-down) direction. The vector Wi of weights of the bottom-up connections to an output neuron i forms an exemplar of the class it represents. All the Wi vectors constitute the long-term memory of the network. They are employed to select the winning neuron, the latter again being the neuron whose Wi vector is most similar to the current input pattern. The vector Vi of the weights of the top-down connections from an output neuron i is used for vigilance testing, that is, determining whether an input pattern is sufficiently close to a stored exemplar. The vigilance vectors Vi form the short-term memory of the network. Vi and Wi are related in that Wi is a normalised copy of Vi , viz. (26)

Wi =

+

Vi

Vji

where is a small constant and Vji , the jth component of Vi (i.e. the weight of the connection from output neuron i to input neuron j).

output layer

bottom up weights W

top down weights V

input layer Figure 11. An ART-1 network

84

CHAPTER 3

Training an ART-1 network occurs continuously when the network is in use and involves the following steps: (i) initialise the exemplar and vigilance vectors Wi and Vi for all output neurons, setting all the components of each Vi to 1 and computing Wi according to Equation (26). An output neuron with all its vigilance weights set to 1 is known as an uncommitted neuron in the sense that it is not assigned to represent any pattern classes; (ii) present a new input pattern x; (iii) enable all output neurons so that they can participate in the competition for activation; (iv) find the winning output neuron among the competing neurons, i.e. the neuron for which x. Wi is largest; a winning neuron can be an uncommitted neuron as is the case at the beginning of training or if there are no better output neurons; (v) test whether the input pattern x is sufficiently similar to the vigilance vector Vi of the winning neuron. Similarity is measured by the fraction r of bits in x that are also in Wi , viz. (27)

x V r= i xi

x is deemed to be sufficiently similar to Vi if r is at least equal to vigilance threshold 0 < ≤ 1; (vi) go to step (vii) if r ≥ (i.e. there is resonance); else disable the winning neuron temporarily from further competition and go to step (iv) repeating this procedure until there are no further enabled neurons; (vii) adjust the vigilance vector Vi of the most recent winning neuron by logically ANDing it with x, thus deleting bits in Vi that are not also in x; compute the bottom-up exemplar vector Wi using the new Vi according to Equation (26); activate the winning output neuron; (viii) go to step (ii). The above training procedure ensures that if the same sequence of training patterns is repeatedly presented to the network, its long-term and short-term memories are unchanged (i.e. the network is stable). Also, provided there are sufficient output neurons to represent all the different classes, new patterns can always be learnt, as a new pattern can be assigned to an uncommitted output neuron if it does not match previously stored exemplars well (i.e. the network is plastic). ART-2 The architecture of an ART-2 network [Carpenter and Grossberg, 1987; Pham and Chan, 1998; 2001] is depicted in Figure 12. In this particular configuration, the “feature representation” field F 1 consists of 4 loops. An input pattern will be circulated in the lower two loops first. Inherent noise in the input pattern will be suppressed (this is controlled by the parameters a and b and the feedback function f·) and prominent features in it will be accentuated. Then the enhanced input

ARTIFICIAL NEURAL NETWORKS

85

pattern will be passed to the upper two F 1 loops and will excite the neurons in the “category representation” field F 2 via the bottom-up weights. The “established class” neuron in F 2 that receives the strongest stimulation will fire. This neuron will read out a “top-down expectation” in the form of a set of top-down weights sometimes referred to as class templates. This top-down expectation will be compared against the enhanced input pattern by the vigilance mechanism. If the vigilance test is passed, the top-down and bottom-up weights will be updated and, along with the enhanced input pattern, will circulate repeatedly in the two upper F 1 loops until stability is achieved. The time taken by the network to reach a stable state depends on how close the input pattern is to passing the vigilance test. If it passes the test comfortably, i.e. the input pattern is quite similar to the top-down expectation, stability will be quick to achieve. Otherwise, more iterations are required. After the top-down and bottom-up weights have been updated, the current firing neuron will become an established class neuron. If the vigilance test fails, the current firing neuron will be disabled. Another search within the remaining established class neurons in the F 2 layer will be conducted. If none of the established class neurons has a top-down expectation similar to the input pattern, an unoccupied F 2 neuron will be assigned to classify the input pattern. This procedure repeats itself until either all the patterns are classified or the memory capacity of F 2 has been exhausted. The basic ART-2 training algorithm can be summarised as follows: (i) initialising the top-down and bottom-up long term memory traces; (ii) presenting an input pattern from the training data set to the network; (iii) triggering the neuron with the highest total input in the category representation field; (iv) checking the match between the input pattern and the exemplar in the topdown filter (long term memory) using a vigilance parameter; (v) starting the learning process if the mismatch is within the tolerance level defined by the vigilance parameter and then going to step (viii); otherwise, moving to the next step; (vi) disabling the current active neuron in the category representation field and returning to step (iii); go to step (vii) if all the established classes have been tried; (vii) establishing a new class for the given input pattern; (viii) repeating (ii) to (vii) until the network stabilises or a specified number of iterations are completed. In the recall mode, only steps (ii), (iii), (iv) and (viii) will be utilised. Dynamics of ART-2: The dynamics of the ART-2 network illustrated in Figure 12 is controlled by a set of mathematical equations. They are as follows: (28)

wi = Ii + au i

(29)

xi =

wi

W

86

CHAPTER 3

F2 reset

ρ

Yj Zij

vigilance mechanism

g(Yj) = d Zji

cpi

ri

F2

qi

pi

bf(qi) ui

vi

aui F2

f(xi)

wi

xi

F1 q′i

p ′i

bf(q′i ) v′i

u′i

f(x′i )

au′i

x′i

w′i

Ii Figure 12. Architecture of an ART-2 network

(30)

vi = f xi + bf qi

(31)

u i =

(32)

pi = u i

(33)

qi =

(34)

wi = qi

wi xi = W

(35) (36) (37)

vi

V pi

P

vi = f xi + bf qi v ui = i V

ARTIFICIAL NEURAL NETWORKS

(38)

pi = ui +

(39)

p qi = i P

87

g Yj zji j

The symbol

X represents the L2 norm of the vector X. If X = x1 x2 xn , then X = x12 + x22 +

+ xn2 . The output of the jth neuron in the classification layer is denoted by gYj . The L2 norm is used in the equations for the purpose of normalising the input data. The function f· used in Equations (30) and (36) is a non-linear function, the purpose of which is for suppressing the noise in the input pattern down to a prescribed level. The definition of f· is 0 if 0 ≤ x < (40) fx = x if x ≥ where is a user defined parameter, it has a value between 0 and 1. Learning Mechanism of ART-2: When an input pattern is applied to the ART-2 network, it will pass through the 4 loops comprising F 1 and then stimulate the classification neurons in F 2. The total excitation received by the jth neuron in the classification layer is equal to Tj where (41)

Tj =

pi zij

i

The neuron which is stimulated by the strongest total input signal will fire by generating an output with the constant value d. Therefore, for the winning neuron, gYj equals d. When a winning neuron is determined, all the other neurons will be prohibited from firing. The value d will be used to multiply the top-down expectation of the firing class before the top-down expectation pattern is read out for comparison in the vigilance test. When the winning neuron fires, all the other neurons are inhibited from firing so it can be inferred that when there is a firing neuron (say j), Equation (38) becomes: (42)

pi = ui + dzji

otherwise if there is no winning neuron, it can be simplified as: (43)

pi = u i

The top-down expectation pattern is merged with the enhanced input pattern at point ri before they enter the vigilance test (see Figure 12). ri is defined by (44)

ri =

qi + cpi Q + cP

88

CHAPTER 3

The vigilance test is failed and the firing neuron will be reset if the following condition is true: (45) >1 R where is the vigilance parameter. On the other hand, if the vigilance test is passed (in other words, the current input pattern can be accepted as a member of the firing neuron), the top-down and the bottom-up weights are updated so that the special features present in the current input pattern can be incorporated into the class exemplar represented by the firing neuron. The updating equations are as follows: (46) (47)

d z = d pi − zji dt ji d zij = d pi − zij dt

The bottom-up weights are denoted by Zij and the top-down weights by Zji . According to the recommendations in [Carpenter and Grossberg, 1987], all the topdown weights should be initialised with the value 0 at the beginning of the learning process. This can be expressed by the following equation: (48)

Zji 0 = 0

This measure is designed to prevent a neuron from being reset when it is allocated to classify an input pattern for the first time. The bottom-up weights are initialised using the equation: (49)

Zji 0 =

1 √ 1 − d M

where M is the number of neurons in the input layer. This number is equal to the dimension of the input vectors. This arrangement ensures that after all the neurons with the top-down expectations similar to the input pattern have been searched, it would be easy for the input pattern to access a previously uncommitted neuron. 2.10

Spiking Neural Network

Experiments with biological neural systems have shown that they use the timing of electrical pulses or “spikes” to encode and transmit information. Spiking neural networks, also known as pulsed neural networks, are attempts at modelling the operation of biological neural systems more closely than is the case with other artificial neural networks. An example of spiking neural network is shown in Figure 13. Each connection between neurons i and j could contain multiple connections associated with a weight value and delay [Natschläger and Ruf, 1998].

89

ARTIFICIAL NEURAL NETWORKS

1

I N P U T

wlij , dlij

1 O U T P U T

2 j

i n

wkij , dkij

i

j

wkij , dkij

m Figure 13. Spiking neural network topology showing a single connection composed of multiple weights wijk with corresponding delays dijk

PSP

ε ij (t − s)

s

t a)

PSP s

t

ε ij (t − s)

b) Figure 14. Different shapes of response functions. a) Excitatory post synaptic potentials (EPSPs) function b) Inhibitory post synaptic potentials (IPSPs) function

90

CHAPTER 3

In the leaky integrate-and-fire model proposed by Maass [Maass, 1997], a neuron is regarded as a homogeneous unit that generates spikes when the total excitation exceeds a threshold value. Consider a network that consists of a finite set V of such spiking neurons, a set E ⊆ V × V of synapses, a set of weights Wuv ≥ 0, a response function uv R+ → R for each synapse u v ∈ E where R+ = x ∈ R x ≥ 0 and a threshold function v R+ → R for each neuron v ∈ V . If Fu ⊆ R+ is the set of firing times of a neuron u, then the potential at the trigger zone of each neuron v at time t is given by: (50) Pv t = u uv∈E s∈F s100 msec Figure 6. Some types of neurons like thalamo-cortical neurons present a dual firing behaviour: in their tonic firing mode the frequency of their response is proportional to the stimulus (10–165 Hz). However when they are stimulated and afterwards inhibited during at least 100 msec. their response changes to burst firing with much higher frequency rates (150–320 Hz)

thalamus, at the core of the brain, are able to fire either in tonic or in burst mode as shown in Figure 6. The main characteristic of the tonic mode is that the spiking frequency is proportional to the stimulus being in the range of 10 to 165 Hz. However in the burst mode, the frequency is not related to the input activation, being in the range of 150 to 320 Hz. This burst mode is very interesting because it takes place after a precise sequence of preliminary facts. For the burst mode to happen, the thalamo-cortical neuron needs to be positively stimulated and afterwards inhibited during at least 100 msec. After these two previous events the burst firing is produced when a slight positive stimulation is given to the neuron. For a deeper study of these mechanisms see [Llinas and Jahnsen, 1982], [Llinas, 1994], [Steriade and Llinas, 1988]. The purpose of this dual behaviour is still a matter of controversy. Ropero [Ropero, 1997] proposed that the tonic mode served for intrathalamic operations. When the result of this intrathalamic operations are concluded the result is relayed to the cortex via the burst firing mode. 1.3 1.3.1

Network Properties Synchronization among neurons

Some type of neurons for example, the granule cells of the olfactory bulb and the reticular cells in the thalamus are able to synchronize their activity and, afterwards, oscillate together [McCormick and Pape, 1990], [Steriade et al., 1987]. One of the causes of this behaviour is that these neurons posses dendro-dendritic [Deschenes et al., 1985] electric contacts in which the potential is communicated directly from one neuron to the other without any kind of neurotransmitter in between. The situation is as if we had a set of ping-pong balls tied by fine cords and we used two very big bats to play with them. The movement of the balls becomes more and more uniform and synchronized during the play. The kinetic energy given by each one of the bats over the balls corresponds to the electric energy of ions entering the neurons. One type of ions increments the inner potential of the neurons when it is below a certain threshold and other type of ions reduces the potential when the potential is above an upper voltage threshold.

138

CHAPTER 6

These play beetween ions and the potential sharing of dendrodendritic connected neurons generates the synchronized oscillations. This behaviour was modelled and programmed in Matlab [Ropero, 2003] with the results shown in Figure 7. 1.3.2

Normalizing inhibition

Inhibitory neurons were supposed to only perform subtraction [Carandini and Heeger, 1994] over other neurons and this property was used for biasing the neurons in conventional neural networks models like backpropagation or radial basis networks. The operation of biasing the neurons was equivalent to shifting the activation function of these neurons to the right or to the left in a similar way to the one explained in section 2.2. This kind of subtracting or biasing inhibition is performed by means of GABA-B (Gamma-aminobutyric acid) neurotransmitter in real neurons. However inhibition is performed in many of the cases by means of GABA-A neurotransmitter instead of GABA-B, being the effect of GABA-A inhibition divisive and not subtractive. We postulate that this GABA-A inhibition could perform a scaling or normalizing effect of the input patterns arriving at a certain layer of the brain. Many structures in the brain have a layered organization. The input to each layer goes to two type of neurons: (A) To neurons that perform an excitatory projection onto the following layer (B) To GABA-A neurons that produce inhibition inside its own layer thereby creating an inhibitory divisive field in the layer (see Figure 8).

Figure 7. The height of each intersection of lines over the surface represents the activation of a 7 × 7 net of neurons. If each of the neurons in this net has an oscillatory activity and the potential of each of them is partially shared between the other neurons, a synchronization of the activities takes place. From top to bottom and from left to right a computer simulation of the synchhronization of a 7 × 7 net of networks is shown

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS

139

O

+ + + ++ + i 4 i5 i6 i1 i 2 i3

+ +

+ +

+ +

+ I4 I1

I5 I2

I6

Field of inhibitory interneurons

I3

Figure 8. Normalization of synaptic inputs due to an inhibitory field of GABA-A inhibitory interneurons. The six neurons of the lower layer of the figure form an excitatory input I = I1 ,I2 ,I3 ,I4 ,I5 ,I6 impinging on a second layer of neurons (middle). This pattern produces an excitation + over the six neurons in the middle layer and over GABA-A inhibitory interneurons that are not shown. Once these inhibitory interneurons are activated, they creates an inhibitory field that divides the activation of these middle layer neurons by nI = nI1 + nI2 + nI3 + nI4 + nI5 + nI6 . In this way the neuron at the top receives a normalized input i = i1 ,i2 ,i3 ,i4 ,i5 ,i6 that is the result of dividing each of the components of pattern I by the constant n(I)

The activation of excitatory and inhibitory neurons in each layer is almost the same absolute value because the input pattern impinges at the same time excitatory and inhibitory neurons. Therefore this inhibitory divisive field is proportional to this activation. This divisive inhibition is able to produce a sort of normalization over input patterns (see Figure 8 for more details). 2.

UPDATING MC CULLOCH-PITTS MODEL

Up to this point we introduced several properties of real neurons with remarkable interest for computational purposes. Using some of them we tried to update some of the characterisitics of the McCulloch-Pitts paradigm of neural computation. 2.1

Up-to-date Synaptic Model

The classical model of synaptic weight alteration due to Hebb lacked many of the properties that were mentioned in previous sections. Here we propose another

140

CHAPTER 6

model that not only mimics the way biological reinforcement and depression is produced but also accomplishes the property of metaplasticity [Ropero and Sim˜oes, 1999]. In our model the synaptic weight between the presynaptic neuron A and the postsynaptic neuron B is calculated as: (2)

wAB = PB/A

where B is a postsynaptic activation above a specific threshold and A a presynaptic action potential. As shown the synaptic weight is calculated as a conditional probability. The above expression can also be written as: (3)

wAB = PB/A =

nA I B nA

in which the operator “n( )”, number of times, quantifies how many times a certain event takes place, for example how many times event A, event B or the intersection of A and B occurs. Starting with different values of the numerator and denominator, i.e. different initial weights, and allowing the postsynaptic neuron to fire according to a non-linear squashing function (logistic) a 3-D version of Figure 2 is obtained in Figure 9. In this figure a continuous line drawn on the surface shows the evolution of the LTP threshold in function of the initial weight. It can be noticed that a very simple statistical expression is able to account for a big variety of properties like

Change in synaptic strength

Weight = P(B/A)

LTP threshold

Initial weight Normalized postsynaptic activity (voltage) Figure 9. The computer simulation above shows that metaplasticity takes place when the synaptic weight is calculated using the conditional probability P(B/A), being B a suprathreshold activation of the postsynaptic neuron and A the presynaptic action potential. A line joins the different LTP threshold, each one of them corresponding to a different initial synaptic weight

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS

141

reinforcement, depression and metaplasticity. Therefore, talking into account that conditional probabilities can be computed in synapses, the more obvious question that arises here is: Are real synapses the tiniest pieces of the more fascinating statistical computer ever imagined? 2.2

Up-to-date Neuron Model

We propose a neuron model (see Figure 10) using the just presented equation for modelling the synaptic weights [Ropero, 1996]. In the soma each of the excitatory postsynaptic potentials (EPSPs) are summed. An EPSP is obtained by multiplying the probability PIi of an action potential in the presynaptic neuron by the corresponding weight PO/Ii . Although there are no probabilities at the presynaptic space but action potentials at different frequencies, the product PIi PO/Ii can approximate each of the EPSPs. These EPSPs are formed by the sum of the voltage humps produced each one of them by a presynaptic action potential in a process known as temporal summation. When these humps are nearby, the humps ride over previous humps creating a tallest EPSP. When they are far away, as for example when the presynaptic action potential is low, they can hardly ride over each other and the resulting EPSP is low. Given that the maximal frequency of presynaptic action potential is limited, the height of the resulting EPSP is also limited. This maximal height corresponds to a PIi PO/Ii of value 1. All the EPSPs go from the dendrites to the soma where they are summed. This sum is the so-called activation of the neuron which is transformed afterwards into a frequency of action potentials by means of a logistic or sigmoidal function. To prevent the saturation of the weights a normalization of the input pattern by means of divisive inhibition is commonplace in the brain.

P(I1) P(O/I1)

P(O/I2) P(I2)

P(O/I3) P(I3)

P(O/I) = P(O/I1)P(I1) + P(O/I2)P(I2) + P(O/I3)P(I3)

Figure 10. Model of a neuron based on conditional probabilities for calculating the synaptic weights. In each synapse the probability of presynaptic action potential is multiplied by the synaptic weight and the result gives the postsynaptic activation in each synapse. The sum of postsynaptic activations gives the activation of the neuron which is calculated as POI = PO/I = PO/I1 PI1 + PO/I2 PI2 + PO/I3 PI3

142 2.3

CHAPTER 6

Up-to-date Network Model

If the same pattern is input to several neurons, instead of only one, a competitive process can take place so that only one neuron, the one whose activation is maximal, becomes the winner of the competition. When the winner fires, the remaining neurons are kept silent. Silencing the not winning neurons is usually done by an inhibitory feed-back or lateral inhibition. For avoiding that only one neuron becomes the winner for every pattern, the probabilistic synapses should be normalized along time (see Figure 11). This is one of the possible roles of biological synaptic normalization, giving every neuron the same opportunity to fire. But what biological mechanisms are involved in the selection of this winning neuron? In section 2.3.1. it was introduced that the synchronized oscillation of neurons is a mechanism found at least in the thalamus and the olfactory bulb. This synchronized oscillation of neurons can allow the finding of the neuron with maximal activation: if a common oscillatory potential were summed to the activations of a layer of neurons the neuron whose total activation arrives first to a certain firing threshold is at the same time the one with biggest activation [Ropero, 2003].

t1 = 0.2 a.

w11 = 0.6 3 y1 = ∑w1j .tj = 0.38 j=1

t2 = 0.4 w12 = 0.4

y1 w13 = 0.2

t3 = 0.5 0.2 b.

0.2

3

y2 = ∑w2j .tj = 0.50 j=1

0.4 y2

0.4

0.6 0.5 0.2 c.

0.4 0.6

0.4

y3

3

y3 = ∑w3j .tj = 0.42 j=1

0.2 0.5 Figure 11. Synaptic normalization allows a competitive process among neurons. The neuron whose synaptic weight distribution wij is most similar to the input pattern of frequencies T = t1 , t2 , t3 tj is also the one with maximal activation. This is the case of neuron b whose weights [0.2, 0.4, 0.6] are most similar to vector T = 02 04 05 Therefore the sum of the products of the input frequencies multiplied by its weights yields the maximal value. Notice that due to the synaptic normalization the number of ionic channels is the same in the three neurons. In summary, synaptic normalization is the property that allows that the neuron whose weight distribution is most similar to the input pattern also exhibits the maximal activation

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS

3.

143

JOINING THE BLOCKS: A NEURAL NETWORK MODEL OF THE THALAMUS

Probabilistic synapses, synchronized oscillations, weights normalization and normalizing inhibition, all of them are properties that were used to implement a realistic computational model of the thalamus. The thalamus is a structure at the core of the brain that relays sensorial information from the senses to the cortex. The function of this structure was unknown. The model we propose helps in this way to understand the role of the thalamus inside the computation in the brain [Ropero, 1996], [Ropero,1997], [Pelaez, 2000]. The thalamus is basically a two layered brain structure. The first layer is formed by thalamo-cortical neurons that receive sensorial patterns and after approximately 100 msec. send the result of the inner thalamic computation to the cortex. The second layer formed by reticular neurons that oscillate synchronically performs a competitive process by which each one of the neurons fires in the presence of specific characteristics of the input patterns. When several of these neurons fire, they produce several inhibitory masks that, when superposed, create a negative replica of the input pattern shown in Figure 12 over the first layer. If the input patterns were damaged or noisy the negative replica recreates a perfect version of the input without defects or noise. Pattern reconstruction and noise rejection are two of the tasks that we postulate the thalamus is able to perform. For these tasks, a process of learning must take place at the level of the thalamus. Our computer model of the thalamus programmed in Matlab has these two layers, each one of 9 × 9 = 81 neurons. The two layers are completely interconnected to each other having 2 × 81 × 81 = 13122 connections. It learned 36 characters during several epochs and is able to recognize and complete damaged or noisy patterns (see Figure 12). The learning capability of the model shows that the real thalamus have also learning capabilities, a fact, that was completely ignored until now in the thalamus’ research. 4.

CONCLUSIONS

In this review we have presented several properties of synapses, neurons and networks that were not considered in previous neural network models but that have interesting computational potential. McCulloch Pitts neuron’s model was based in the restricted knowledge about neurons that existed in the forties. Nowadays a more comprehensive knowledge about the amazing properties of neurons can be used to update McCulloch Pitts model. In the case of synaptic plasticity we presented several properties of synaptic weights like directionality, existence of both potentiation and depression thresholds, metaplasticity and normalization. Regarding neurons relevant properties were introduced to the reader like the spike threshold adaptation and the dual behaviour in frequency of some types of neurons. Finally, and concerning networks of neurons, we studied the synchronization of a set of neurons and the normalizing inhibition produced by a set of GABA-A neurons over the input pattern of another neuron.

144

CHAPTER 6

Figure 12. A biologically realistic computer model of the thalamus constituted by two layers of 9 × 9 = 81 neurons each. An example of the pattern reconstruction capability of the thalamus model is shown (a) After being trained with 36 different characters (letters and numbers) a very noisy and damaged testing pattern is input which vaguely resembles a B. (b) An “I” shaped sustained feedback inhibition over the first layer is produced by a reticular neuron in the second layer. After firing, the reticular neuron rests in refractoriness. This inhibition reduces the subsequent activation in the first layer. (c) Another neuron fires and immediately enters in the refractory period producing another sustained inhibition that is superposed over the previous one. Both inhibitions are shaped like an E. (d) Finally, another reticular neuron fires and the total inhibition completely reconstructs letter B showing the reconstruction capability of the thalamic model. The central figure of each screen gives the value of the activations of a net of reticular neurons

With all these elements in mind we proposed a new equation for synaptic reinforcement based in conditional probabilities. The paradigm of a neuron was also modified taking into account that the neuron is always integrated in a network. For example, if the neuron was detached from the inhibitory field that normalizes its inputs, its active synaptic weights will increase without bound and the neuron will be saturated most of the time. It was also shown that the normalization of synaptic weights is an important condition for allowing a competitive process between neurons. An example of such competition and of all the mentioned properties working together is the model of the thalamus that we programmed in Matlab. It learned 36 characters and exhibits the property of completing damage or noisy patterns.

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS

145

We expect that the reader benefits from this paper’s account of recently found neural properties when creating new artificial neural networks or trying to emulate the functioning of the brain. REFERENCES Abraham, W.C., and Bear, M.F. (1996) Metaplasticity: the plasticity of synaptic plasticity. Trends in Neuroscience 19:126–130. Abraham, W.C., and Tate, W.P. (1997) Metaplasticity: a new vista across the field of synaptic plasticity, Progress in Neurobiology 52:303–323. Artola, A , Brocher, S., and Singer, W. (1990) Different voltage-dependent threshold for inducing long-term depression and long-term potentiation in slices of rat visual córtex. Nature 347:69–72 Bear, M.F., Connors, B.W., and Paradise, M.A. (2001) Neuroscience. Exploring the Brain. Lippincott, Williams & Wilkins. USA Bienestock, E.L., Cooper, L.N., and Munro, P.W. (1982) Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual córtex. The Journal of Neurosciences 2(1):32–48. Carandini, M., and Heeger, D.J. (1994) Summation and division by neurons in primate visual cortex. Science 264(5163):1333–6. Carpenter, G., and Grossberg, S. (1988) The ART of adaptive pattern recognition by a self-organizing neural network. Computer 21(3):77–88 Deschenes, M., Madariaga-Domich, A., and Steriade, M. (1985) Dendrodendrític synapses in the cat reticularis thalami nucleus: a structural basis for thalamic spindle synchronization. Brain Research 334:165–168. Desai, N.S., Rutherford, L.C., and Turrigiano, G.G. (1999) Plasticity in the intrinsic excitability of cortical pyramidal neurons, Nature Neurosciences 2:515–520 Hopfield, J.J. (1982) Neural Networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79:2554–2558 Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biological Cybernetics 43:59–69. Llinás, R., and Jahnsen, H. (1982) Electrophysiology of mammalian thalamic neurones in vitro. Nature 297:406–408 Llinas, R., Ribary, U., Joliot, M., and Wang, X.J. (1994). Content and Context in Temporal Thalamocortical Binding. In G.Buzsaki et al. (Eds.), Temporal Coding in the Brain (pp. 151–72). Berlin: Spring-Verlag McClelland, J.L., Rumelhart, D.E., and The PDP Research Group. (1986). Parallel distributed processing: Exploration in the microstructure of cognition. Cambridge, MA: MIT Press. McClelland, J.L., and Rumelhart, D.E. (1988). Explorations in parallel distributed processing. Cambridge, MA: MIT Press. McCormick, D.A., and Pape, H.-C. (1990) Properties of a hyperpolarization activated cation current and its role in rhytmic oscillation in thalamic relay nurons. Journal of Physiology (London) 431:291–318. McCulloch, W. and Pitts, W. (1943) A logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 1943. Ropero Peláez, J. (1996) A Formal Representation of Thalamus and Cortex Computation. Proceedings of the International Conference of Brain Processes, Theories and Models. Edited by Roberto MorenoDíaz and José Mira-Mira. MIT Press. Ropero Peláez, J. (1997) Plato’s theory of ideas revisited. Neural Networks, 1997 Special issue 10(7): 1269–1288. Ropero Pelaez, J., and Godoy Simoes, M. (1999) A computational model of synaptic metaplasticity. Proceedings of the International Joint Conference of Neural Networks 1999. Washington DC. Ropero Peláez, J. (2000) Towards a neural network based therapy for hallucinatory disorders. Neural Networks, 2000 Special Issue 13(2000):1047–1061.

146

CHAPTER 6

Ropero Peláez, J. (2003) Phd Thesis in Neuroscience: Aprendizaje en un modelo computacional del tálamo. Faculty of Medicine. Autónoma University of Madrid. Rosenblatt, F. (1956) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65:386–408 Steriade, M., Domich, L., Oakson, G., and Deschenes, M. (1987) The deafferented reticular thalamic nucleus generates spindle rhythmicity. The Journal of Neurophysiology 57:260–273. Steriade, M., and Llinas, R.R. (1988), The Functional State of the Thalamus and the Associated Neuronal Interplay. Physiological Review 68(3):649–739. Tompa, P., and Friedrich, P. (1998). Synaptic metaplasticity and the local charge effect in postsynaptic densities. Trends in Neuroscience 21(3):97–101. Turrigiano, G.G., Leslie, K.R., Desai, N.S., Rutherford, L.C., and Nelson, S.B. (1998) Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature 391:892–896.

CHAPTER 7 SUPPORT VECTOR MACHINES

JAIME GÓMEZ SÁENZ DE TEJADA1 , JUAN SEIJAS MARTÍNEZ-ECHEVARRÍA2 1 2

Escuela Politécnica Superior, Universidad Auónoma de Madrid Escuela Técnica Superior de Ingenieros de Telecomunicaciones, Universidad Politécnica de Madrid

Abstract:

Support Vector Machines is the most recent algorithm in the Machine Learning community. After a bit less than a decade of live, it has displayed many advantages with respect to the best old methods: generalization capacity, ease of use, solution uniqueness. It has also shown some disadvantages: maximum data handling and speed in the training phase. However, these disadvantages will be overcome in the near future, as computer power increases, leaving an all-purpose learning method both cheap to use and giving the best performance. This chapter provides an overview about the main SVM configuration, its mathematical applications and the easiest implementation

Keywords:

Support Vector Machines, Machine Learning

INTRODUCTION Machine Learning has become one of the main fields in artificial intelligence today. Whether in the pattern recognition field or in function estimation, statistical Machine Learning tries to find a numerical hypothesis which adapts correctly to the given data, that is, machines able to generalize the statistical distribution of a representative data set. Once we have generated the hypothesis, all future unknown patterns following the same distribution will be correctly classified. From the principles of statistical mechanics, a handful of algorithms have been devised to solve the classification problem, such as decision trees, k-nearest neighbour, neural networks, Bayesian classifiers, radial basis functions classifiers, and, as a newcomer, support vector machines (from now on SVM). The basic SVM is a supervised classification algorithm introduced by Vladimir Vapnik, motivated by VC (Vapnik Chervonenkis) theory [Vapnik, 1995], from which the Structural Risk Minimization concept was derived. In the late 70’s, 147 D. Andina and D.T. Pham (eds.), Computational Intelligence, 147–191. © 2007 Springer.

148

CHAPTER 7

Vapnik studied the numeric solution of convex quadratic problems applied to Machine Learning, and defined an immediate ancestor of SVM called ‘Generalization portrait’. In the early 90’s, Vapnik joined Bell laboratories, where his ideas evolved until the creation of the term ‘support vector machines’ in 1995. Nevertheless, the basic mathematics behind SVM were developed much earlier. The concept of a non-input space hyperplane generation to separate data in input space, the heart of SVM, was settled in 1964. The study of convex quadratic problems gave the Karush-Kuhn-Tucker optimality conditions in 1936, while the definition of valid kernel functions for the transformation described above was formulated by Mercer in 1909. This chapter provides an introductory view to SVM, so that any computer scientist or engineer reader can develop his own SVM implementation and apply it to any real world machine-learning problem. For that purpose, we will sacrifice some mathematical completeness for the sake of clarity. It has four sections: first, the SVM will be defined and analysed; second, the main SVM principle mathematical uses will be developed; third a comparison between SVM and neural networks will be studied; last, the best current implementation approach will be shown. Support Vector Machines are easy to understand, not too difficult to implement, and child’s play to use. If you need a generic Machine Learning method, forget about neural networks or any other method you previously learnt: the SVM family globally outperforms them all. 1. 1.1

SVM DEFINITION Structural Risk

Classifiers having a big number of adjustable parameters (and so, great capacity) most probably will generate overfitting, thus learning the training data set without errors, but with poor generalization ability. On the contrary, a classifier with insufficient capacity will not be able to generate a hypothesis complex enough to model the data. A mid-point must be found where adjustable parameters are neither too much nor too scarce, both for the training ant test set. For that reason, it is essential to choose the kind of functions a learning machine can implement. For a given problem, the machine must have a low classification error, and also small capacity. Capacity is defined as the ability of a given machine to learn any training set without errors. For example, the 1-nearest neighbour has infinite capacity, but is a poor classifier for unseen test data with complex distributions and noisy sets. A machine with great capacity will tend to generate overfitting over the data, making it no longer useful because it does not learn. For extended information about these issues, see [Burges, 1998]. There are a handful of mathematical bound expressions that define the relations between a machine learning ability and its performance. The underlying theory tries to find under which circumstances and how fast the performance measure converges while the number of input data for training increases. On the limit, with an infinite number of points, we could have a correct performance value, better than just an

SUPPORT VECTOR MACHINES

149

estimation. With respect to the SVM, we will use one limit definition in particular which will take us to the Structural Risk Minimization (SRM) principle [Vapnik, 1995]. Suppose we have l observations, input data in the training phase. Each data consists on a pair of values xi yi , where xi is a vector ∈ n i = 1 l and the fixed associated label yi ∈ 1 −1, given by a consistent data source. We assume there is an unknown probability distribution P(x,y), from which the data points have been drawn. Data is always assumed to be independently drawn and identically distributed. Suppose we have a machine whose task is to learn the mapping xi → yi . This generic machine is really defined by a set of possible mappings xi → fx , where functions fx are generic, defined for the set of adjustable parameters . This machine is by definition deterministic, that is, for a given input vector xi , and a parameter set , we will always obtain the same output fxi . Choosing the parameter set gives a trained machine. For example, a neural network with a fixed architecture and fixed weights (parameter set ) would be a trained machine as defined in these paragraphs. Thus, the expected error in the phase test for a trained machine is: 1 (1) R = y − fx dPx y 2 The value R is called expected risk. Nevertheless, this expression is difficult to use because the probability distribution P(x,y) is unknown. Thus, a variation of the formula is developed to use the finite number of available observations. It is called empirical risk, and is defined as the measured mean error rate on the training set: (2)

Remp =

l 1 y − fxi 2l i=1 i

Remp is a fixed number for a given parameter and test set. It has been shown in [Vapnik, 1995] that the following condition holds: (3)

R = Remp + gh

where g(h) is a real number which is directly related to the VC dimension. Again, a learning machine is defined as a set of parameterised functions (called a family of functions) having a similar structure. The term Vapnik-Chervonenkis (VC) dimension is a non-negative integer that measures the generalization capacity previously defined. The VC dimension for a given learning machine is defined as the maximum number of points that can be correctly classified using functions belonging to the family set. In other words: if VC dimension = h, then there exists a set of h points that can be classified with family functions regardless of the point labels. Note that, first, there cannot exist a set of h + 1 points satisfying the constraint; second, you only need one set of h points for the definition to be applicable (it did not say “for all h-points sets”).

150

CHAPTER 7

Figure 1. Three wisely chosen points

Let’s try an example. Suppose we are in 2 space and the learning machine L1 is defined as the set of “one straight line” classifiers. In figure 1 we choose three points. We see (you can try) that, for all combination of labels (8 possible combinations using 3 points with two labels), they can be separated using one straight line. For each combination it would use a different straight line, but it would still be a component of the family set. Therefore the analysed learning machine VC dimension is at least 3. If we try 4 points (any 4 point set) we will not be able to satisfy all constraints, so we can state that “one straight line” classifiers in 2 space have VC dimension equal to 3. Another example. Suppose we are in 2 space and the learning machine L2 is defined as the set of “two-segment line” classifiers (continuous but non-derivable in the joint point). In figure 2 we choose five points. Again, try all possible label combinations (now 32). Using a two-segment line you can separate all 32 cases, but it would not be possible to separate 6 well-chosen points (any 6 points). Therefore VC dimension for this learning machine is 5.

Figure 2. Five wisely chosen points

SUPPORT VECTOR MACHINES

151

Figure 3. Training set and two valid classifiers, “straight-line”(dashed line) and “two-segment-line” (solid line)

When facing a problem that can be classified using different learning machines, as can be seen in figure 3, which one is better. The SRM principle will try to find the learning machine with the lowest VC dimension that correctly classifies all data points. The consequences are analysed in section 2.12. In what regards SVM definition, SRM principle and VC dimension concept requires that the chosen classifier be the one with the largest margin (linear SVM use the family of linear hyperplanes in input space), defined in next section.

1.2

Linear SVM for Separable Data

The simplest case for a SVM is that of linear machines trained with a separable data set (see Figure 4a).

Figure 4a. Linear separable training set

152

CHAPTER 7

Suppose we have a training data set made of pairs xi yi i = 1 l, such that xi ∈ d yi ∈ 1 −1. Suppose there exists a hyperplane in d which separates positive from negative examples (after their yi value). Points that are exactly on the hyperplane satisfy the condition: (4)

w•x+b = 0

where w is the hyperplane perpendicular vector (regardless of the norm), b/w (absolute value of term b divided by module of vector w) is the distance from the origin to the hyperplane, and the operator • is defined as the dot product in the Euclidean space in which the data belong (we will use the scalar product between two d-dimension vectors). Let d+ d− be the shortest distance between the plane and a positive (negative) example; the margin of the hyperplane is defined as d+ + d− . We can say that, at maximizing the classifier margin, we will decrease the risk limit defined in (3). This is the base for the following SVM mathematical development. For the linear and separable case, the SVM algorithm calculates the separator hyperplane that maximizes the classifier margin. Thus, all training data must satisfy the following constraints: (5)

w • xi + b ≥ +1

for yi = +1

(6)

w • xi + b ≤ −1

for yi = −1

which can be formulated in one expression: (7)

yi w • xi + b − 1 ≥ 0

∀i

All points for which equality at inequality (5) holds, are on hyperplane H1 : w • xi + b = 1, parallel to the separator hyperplane and distance 1 − b/w to the origin. In much the same way, those points for which equality at inequality (6) holds, are on hyperplane H2 : w • xi + b = −1, parallel to H1 and the separator hyperplane and distance − 1 − b/w to the origin. Thus, d+ = d− = 1/w, and so the margin is 2/w. We must find a pair of planes H1 H2 that maximize the margin, minimizing w2 , with respect to constraints defined in inequality (7). Note that, in the training phase, no data point will be between H1 and H2 or on the wrong side of its class plane (that is the reason for calling it separable case). Those points that satisfy the equality in inequality (7), (those placed on H1 or H2 ), and that, if eliminated from the training set, would give a different solution (by definition would change d+ or d− ), are called support vectors. The name comes from the fact that the learning machine is completely defined with these points and their weight on the hyperplane. All other training points, which are at a greater distance from the hyperplane than the support vectors, serve no purpose: if we had begun the training without them, the solution would have remained the same (see Figure 4b).

SUPPORT VECTOR MACHINES

153

Figure 4b. Linear SVM classifier. Support vectors are encircled, the margin is shown with two dashed lines and the separator hyperplane is shown with a solid line

The problem can be reformulated using Lagrange multipliers. It will help us to add constraints to the problem more easily, and will let the training data appear only in the form of dot products between vectors. This will let us generalize the SVM algorithm to the non-linear case. The general rule for creating the Lagrange formulation is: for constraints of the type c ≥ 0, the constraint equation is multiplied by a Lagrange multiplier and subtracted from the objective function. Thus, we introduce non-negative Lagrange multipliers i i = 1 l , one for each constraint in inequality (7), that is, one for each training point. The Lagrangian we obtain is: (8)

LP =

l l 1 w 2 − i yi w • xi + b + i 2 i=1 i=1

We want to minimize LP with respect to w and b (the variables that define the plane), and require that partial derivatives of LP with respect to the i be 0. By definition, this is a convex quadratic optimisation problem, because objective function is convex and constraints are also a convex set [Burges, 1998]. This means we can solve the problem using the dual formulation [Fletcher, 1987]. This Wolf-dual formulation has the following property: maximization of LD (in contrast with primal formulation LP ) with the defined constraints occurs at the same value of w and b than the minimization of LP , shown in the previous paragraph. All partial derivatives must be zero at the optimum. Calculating partial derivatives of LP with respect to b and w, we obtain the following conditions: (9)

w=

l

i yi xi

i=1

(10)

l i=1

i yi = 0

154

CHAPTER 7

which substituting in equation (8) gives: (11)

LD =

l

i −

i=1

l l 1 y y x • x 2 i=1 j=1 i i j j i j

Therefore, now the problem is written as “Maximize LD with respect to all i , satisfying conditions (7) and (10)”. There is a Lagrange multiplier for each training point, but only those having i > 0 are of any importance in calculating the separator hyperplane with equation (9). These are the support vectors, which were defined in previous paragraphs. Geometric interpretation of (11) is easier if the second term is substituted using (9). Suppose we are in an intermediate optimisation state, and we want to calculate the second term at step i = 0. Thus the term is: l

0 j y0 yj x0 • xj = 0 y0

j=1

= 0 y0 x0 •

l

j yj x0 • xj

j=1 l

j yj xj = 0 y0 x0 • w

j=1

The scalar product of a point and a normal-to-the-hyperplane vector gives the point projection over the vector, that is, relative distance between point and hyperplane. The relying concept under the formula is: = 0 ∗ Correctness of classification ∗ distance between point and sscurrent defined separator plane At each sep i, the relation between xi and current-state w is calculated. Therefore, we can deduce some hand-made optimisation rules: A) If classification is correct, the term is negative, so i should decrease, and thus reduce its weight (its importance) in the calculation of current w, in case the optimum has not been reached. B) If distance is big with respect to other points of the same class, and it is correctly classified, i should decrease, while other same-class point k closer to the margin should increase. Note that when evaluating the correctness of a point during the training phase, the point itself is used. If a point is misclassified, the algorithm will increase its multiplier as much as needed, forcing the hyperplane definition until this point condition is satisfied. For the linear separable case this strategy is valid, because sooner or later the point must be correctly classified. But for non-linear or nonseparable cases, this strategy may give poor results. If we have some noise in the training data, the algorithm will try to force the hyperplane definition to classify points that are wrong. This will generate overfitting over the data so the performance will be poorer. Therefore, the SVM training algorithm consists of the following basic steps:

SUPPORT VECTOR MACHINES

155

1. Identify all training data points, and their labels. 2. Optimize (maximize) the dual Lagrangian, maintaining constraints defined in (7) and (9). For that purpose, there are many convex quadratic problem optimisation methods described in mathematical literature [Fletcher, 1987]. The optimisation phase result is the set of all Lagrange multiplier values i . Basic optimization methods have important limits about the resources (time and memory) needed in big problems (more than 10.000 patterns). Thus, at the beginning of SVM history, efficient optimization algorithms were the basic research line. In section 5 the best SVM algorithm will be shown: SMO. 3. Throw away all those points which are not support vectors after the training process (i.e. those having i = 0), and calculate the value of w and b from support vectors and formulas (9) and (7). Then, we will have a completely defined optimum separator hyperplane. 1.3

Karush-Khun-Tucker Conditions

Karush-Khun-Tucker (KKT) conditions represent necessary and sufficient conditions for a solution to exist to the problem defined in step 2 in the previous algorithm. This solution identifies the objective function LP optimum value with respect to all available parameters (all i ). Many SVM algorithms use these KKT conditions to identify if the machine’s current state is the optimum, and if not so, which are the points that violate these optimality conditions the most. For the basic SVM definition, given in this chapter, optimality conditions are: (7.9)

w=

l

i yi xi

i=1

(7.10)

l

i yi = 0

i=1

(7.7)

yi w • xi + b − 1 ≥ 0

(7.7 bis)

i yi w • xi + b − 1 = 0 i ≥ 0

Most of them have been introduced in previous sections of this chapter, but they have been repeated here for better comprehension of the optimisation process. The new equation (7.7 bis) is easy to be interpreted. It regards to the points that must hold equality in inequality (7). It could be defined in the following words: “Any training point, either holds equality in inequality (7), or its Lagrange multiplier is annulated, i.e. i = 0”. If it holds equality (7) and i = 0, then the point is on the margin hyperplane and is a support vector. It can also happen that both conditions hold, that is, equality (7) holds and i = 0. In that case, the point is on his class margin hyperplane but it is not needed for the hyperplane definition, therefore it is not a support vector.

156 1.4

CHAPTER 7

Optimisation Example

To show with more clarity the optimisation process, we will introduce an example. Suppose we have 3 points 1 1 2 1 3 1 ∈ 2 and labels +1 −1 −1 respectively (see figure 5). Suppose the initialisation routine defines Lagrange multipliers as 1 = 2 2 = 1 3 = 1 (holding condition (10)). We use formulas (9) and (11) to calculate the following: w1 = 21 1 − 12 1 − 13 1 = −3 0 LD1 = 4 − 1/2−6 + 6 + 9 = −0 5 Then, we check if this is a valid solution for our SVM. For that purpose, we use KKT conditions, specially condition (7). Note that all three points would be support vectors, so they must have the same value of b when substituting in condition (7). At this optimisation stage this is not true for w1 , because we obtain b = 4 b = 5 and b = 8. Thus we can say, without doubt, that this is no solution. Now we must find another set of Lagrange multipliers that bring us to an increase of LD . Point 3 is farthest from current pseudo-hyperplane (being correctly classified), so it is a good candidate for decreasing its weight in the definition of w (see section 2.2). Suppose that new Lagrange multiplier values are 1 = 1 2 = 1 3 = 0 (condition (10) must always hold). w2 = 11 1 − 12 1 − 03 1 = −1 0 LD2 = 2 − 1/2−1 + 2 = 1 5 We made a good choice because LD has increased. Nevertheless, we still do not satisfy KKT conditions. When we substitute equation (7), we obtain b = 2 y b = 1 for both points respectively (we have two support vectors only).

Figure 5. A linear separable set with margin and separator hyperplane

SUPPORT VECTOR MACHINES

157

Now that we have two support vectors, with different class, their Lagrange multipliers must change in the same way for condition (10) to hold. We increase, for instance, to 1 = 2 2 = 2 3 = 0. w3 = 21 1 − 22 1 − 03 1 = −2 0 LD3 = 4 − 1/2−4 + 8 = 2 Again, LD has increased, so we have chosen wisely. Moreover, at this optimisation step, KKT conditions hold, having the same value b for all support vectors, b = 3. We can assert without any doubt that the optimum has been reached. For instance, if we continue to increase the multipliers to 1 = 3 2 = 3 3 = 0, the result would not be valid. We would obtain: w4 = 31 1 − 32 1 − 03 1 = −3 0 LD4 = 6 − 1/2−9 + 18 = 1 5 Convexity required for the objective function definition holds: LD1 < LD2 < LD3 > LD4 . Moreover, as the example is so small, some degree of uniform quadratic convexity can be seen, as LD2 = LD4 , underneath the optimum. During the optimisation process, while KKT conditions do not hold, the unique separator hyperplane does not exist. At each new step (new set of values of ), there is one hyperplane direction only, but as many separator hyperplanes as support vectors in the training set (different values of b). These hyperplanes do not need to have a geometric meaning; they do not try to separate the data, even though they could. As we get closer to the optimum (increasing LD ), all support-vector-defined hyperplanes will come closer to each other (less difference in the b value). The limit is reached when LD gets to the optimum value, and all hyperplanes match up with only one value of b: the separator hyperplane. This concept differs largely on the search process followed by other similar methods, like the perceptron. This last one always defines a separator hyperplane that evolves at each training step trying to classify correctly all training data. For that reason, it can reach a state in which all data points are correctly classified, but whose margin is not the optimum. That is called a local minimum, where the perceptron will be trapped and will not be able to continue. The SVM algorithm performs a quadratic optimisation in which no intermediate state can be considered as a valid solution. There will be one solution only, it will be global, and it will be the best you can have. Even though soft-margin SVM definition will take place in next sections, this is a good place to see what happens when the optimisation algorithm is applied to a non-separable data set. Suppose we have again the 3 points used before 1 1 2 1 3 1 ∈ 2 but now with different labels +1 −1 +1 (see figure 6). We have changed the third point label, so the training set becomes nonseparable with a linear machine. Nevertheless, this information is not given to the SVM algorithm.

158

CHAPTER 7

Figure 6. A linear non-separable set

Suppose we initialise values as 1 = 1 2 = 2 3 = 1 (condition (10) holds). w1 = 11 1 − 22 1 + 13 1 = 0 0 LD1 = 4 − 1/2+0 − 0 + 0 = 4 Of course, this cannot be a solution. We do not need to check KKT conditions, because w = 0 0 does not define a hyperplane. At this stage we cannot guess which points are better changing, so we do it randomly. Suppose we define a new state 1 = 1 5 2 = 2 3 = 0 5 (there are not many more alternatives). w2 = 1 51 1 − 22 1 + 0 53 1 = −1 0 LD2 = 4 − 1/2−1 5 + 4 − 1 5 = 3 75 We obtain LD2 < LD1 , so we can be sure this is not a solution, and, even more, this way will take us nowhere. We choose another possible set, 1 = 2 2 = 4 3 = 2. w3 = 21 1 − 42 1 + 23 1 = 0 0 LD3 = 8 − 1/2+0 − 0 + 0 = 8 As in the first case, this cannot be a solution. But LD has increased quite a lot, and we could think this is getting us closer to the solution. But it can be noted that we could increase the multipliers anyhow, knowing LDn = 1 + 2 + 3 , and so, the objective function increases without limit (note that in this example the problem is not characterized by a quadratic function, but by a linear function, so there cannot be an optimisation solution). Therefore, if the objective function increases without limit, then we are applying a linear separable machine to a linear non-separable training set.

SUPPORT VECTOR MACHINES

1.5

159

Test Phase

As it has been said, once we have trained a SVM, we obtain the values w and b. With these values, we define a separating hyperplane, w • x + b = 0, parallel to H1 and H2 and placed at the middle, at the same distance of both. To classify an unseen pattern x, we just need to know which side of the separator hyperplane the point is, i.e., the sign of w • x + b. Note that in the test phase we may have data points placed in between H1 and H2 , and, if used during training, the solution found would have changed somehow. This concept may be useful when developing SVM training algorithms, because it could find a priori support vectors, before the whole training, saving computational power. Up until now we have mentioned only the binary case, that is, data can only have two classes. SVM classifiers can be easily extended to the multiple class case: for n classes, we just need to generate n-1 binary classifiers which separate one class form the rest. Nevertheless, this multiple classifier is O(n) more complex in time (memory resources are more difficult to estimate) than one binary classifier in the training as well as the test phase. As this extension does not give new major advances, it will not be mentioned in the rest of this chapter. 1.6

Non-Separable Linear Case

Now that we know everything that is needed to create and use a simple SVM, we will upgrade its definition so that it will be able to deal with any real-life problem. When the above-described algorithm for separable data is used over non-separable data (see figure 7), no solution will be found, as the value of LD will grow without limit (see section 2.5). For the non-separable data to satisfy initial constraints, we have to introduce the concept of soft margin. This means that the algorithm will allow some training points to violate those constraints, and so, the rest of training data will be correctly classified (regardless of violating points). For that purpose we

Figure 7. A linear non-separable set, which needs a soft-margin classifier. The distribution is defined as class = 1 if x1 + x2 > 7 5; class = −1 otherwise. The distribution has some noise

160

CHAPTER 7

introduce positive slack variables for each point in a way such that the following inequalities hold [Cortes and Vapnik, 1995]: (12)

w • xi + b ≥ +1 − i

for

yi = +1

(13)

w • xi + b ≥ −1 + i

for

yi = −1

Values i are not fixed prior to the training; they will be calculated during the optimisation process. And because they are not fixed, we can be certain that all points will satisfy inequalities (12) and (13): just increase its i until inequality holds. We have solved our troubles: now, there will always be a solution. But it may be that the solution is not close enough to the true distribution under the data. If that is so, then the solution is useless; so we have just changed the name of our worries. With the introduction of these variables must follow a primal Lagrangian LP increase, so that classification errors during training will be minimized. For a training pattern classification error to take place, its associated i must be greater than 1, so l

i

i=1

is a good estimate of the training errors’ upper bound with respect to the complete training set. Therefore, the objective function to be minimized changes from 1/2w2 to l 1 w 2 + C i 2 i=1

being C a parametrizable non-negative real value. This value corresponds to the global penalization given to training errors. This new objective function could have been different. We could have devised other methods for forcing i values to be as small as possible. The election of exactly that function follows simplicity reasons: the problem continues to be convex quadratic, and neither the i , nor the Lagrange multipliers associated to these new constraints, appear in the problem dual formulation. Therefore, we have to maximize LD : (14)

LD =

l

i −

i=1

with constraints: (15)

0 ≤ i ≤ C

(16)

w=

l

i yi xi

i=1

(17)

l i=1

i yi = 0

l l 1 y y x • x 2 i=1 j=1 i i j j i j

SUPPORT VECTOR MACHINES

161

The only difference between the previous algorithm and this last one is that now the i have an upper bound C. The training algorithm will not allow any point to increase its weight indefinitely, and so, a solution will eventually be found. The error term in the optimisation process goes to those points that have i > 0, either because they are incorrectly classified or because they lie inside the margin. For any point that satisfies i > 0, it can be stated i = C. It still is a support vector, and it will be treated as such in the calculation of w, but in the optimisation process its weight will grow no more. Soft margin philosophy (against hard margin defined in section 2.2), is not to forbid training errors, not even to minimize them alone. The idea is to minimize the whole objective function, in which errors make some pressure as well as the hypothesis robustness, identified as the margin maximization between those well-classified points at each side of the separating hyperplane (characterised by constraint (7)). Suppose, for instance, the case shown in figure 7. A hard margin classifier cannot be found, but many soft margin classifiers will satisfy the constraints, and the only difference will be the C value. The first approach for newcomers is usually the hardest soft-margin possible, one that looks like figure 8. It is a valid solution, but it has a very small margin. By definition of structural risk minimization, if we increase the margin, test errors would decrease (better generalization performance). On the other hand, training errors should be avoided (or, at least, limited), so a balance must be found between margin maximization and error permissibility. A small quantity of noise may be accepted without modifying the generalization performance, by creating a hypothesis that is developed after some common properties satisfied by the data (the internal, true data distribution). In the case of figure 9, it is easily seen that more training points become errors, but the classifier is much closer to the underlying distribution concept. The new parameter C becomes the only value (until now) that must be provided in the SVM architecture. As it has been said, C serves as a balance between error permissibility and generalization goodness.

Figure 8. The figure 7 set, with a rather hard soft margin classifier

162

CHAPTER 7

Figure 9. The figure 7 set, with a softer margin classifier

– If C is small, then errors are cheap. The margin will grow, and so will the number of training patterns that violate the margin. – If C is big, then the value of w has small relevance in the objective function optimisation against training errors. We are approaching the hard margin philosophy. Because w value is closely related to the margin maximization, decreasing w relevance will take us to a smaller margin, and maybe, to a worse generalization ability. To choose a good C value, model complexity and expected data noise must be evaluated as a whole. 1.7

Non-Linear Case

In most real life cases, data cannot be separated using a linear hyperplane in input space. Even the use of slack variables could lead to a poor classifier, in case the linear deviations are caused by the hypothesis structure and not because of noisy data. The next step is to introduce in the SVM algorithm non-linear separating surfaces instead of hyperplanes (see figure 10). For that purpose, we generate an input data mapping into another Euclidean space H, whose dimension is higher than the input space. We use a mapping function , such that: d → H In the problem dual formulation, input data vectors appear only as inner products xi • xj , in the space they belong. Now they will only appear as xi • xj in space H. Space H will usually be a very high dimension space. It could even be an infinite dimension space. Therefore, performing operations in this space could be too costly. But if we could find a kernel function K such that Kxi xj = xi • xj , then we would not need to explicitly map data vectors into space H, we would not even

SUPPORT VECTOR MACHINES

163

Figure 10. Non-linear distribution set

need to know what is. Now we just have to define a valid kernel function K, and substitute Kxi xj everywhere xi • xj stands in the algorithm. When we use a much higher dimension space, many new data features, linear and non-linear, arise. Each new dimension offers a new possible correlation view, a new attribute with which we can separate the data, a new factor with which to create the hypothesis. It will be the training process responsibility to discriminate those attributes that contain useful hyperplane-definition information from those that do not, by assigning a bigger weight in the linear combination of all features. For those cases when there is some user information about data correlation, an explicit mapping can be generated. Nevertheless this is not usual, and could lead to an inefficient implementation, depending on the previous knowledge credibility. Using generic mapping functions (we will see them later) offers the possibility to generate an enormous number of new features, without taking care of the meaning of each one. In fact, these spaces use to be in the order of thousands, millions or even infinite dimensions. It is difficult to accept such a big geometrical space. It seems easier to identify it with a set of non-linear relations between input attributes, which can be assembled with linear relations in the optimisation process to create a surface (hyperplane in feature space, indefinable curve in input space), capable of separating input data one class from the other. If we replace xi • xj by Kxi xj everywhere in the training phase formulas, the algorithm defined in section 2.2 will generate a linear SVM in a high dimensional space (specified by the mapping function). And most important, it will do it in roughly the same time complexity as a simple linear SVM created in input space (without mapping). All further development stays the same, as we are still creating a linear separator, although in a different space. In the linear case, the training phase output was the value of w and b, with which the hyperplane was completely defined, and so the test phase had just to see at which hyperplane side the new pattern was. Now, we cannot explicitly calculate w, because it is defined in space H only and we do not know exactly how the mapping is made.

164

CHAPTER 7

Through the support vector extension, the value of w can be written as: (18)

w=

N

i yi si

i=1

so we can write the classification function as: (19)

fx =

N

i yi si • x + b =

i=1

N

i yi Ksi x + b

i=1

where si are the N support vectors, identified in the training phase as those patterns whose Lagrange multiplier is not zero. With this definition we avoid calculating mapping function once more. Note that soft margin concept still applies to a non-linear classifier. Actually, its implementation remains very simple: Lagrange multipliers have an upper limit. In this case soft margin applies to the linear classifier in high dimension space. The clearest advantage is that we still assert there is a solution. The use of a non-linear surface as separator functions does not guarantee a solution will be found at all, even though it is more probable. Moreover, using the soft margin alternative gives the classifier more robustness against noisy training patterns. Training phase time complexity does not change, but test phase is different. In the linear case, having calculated explicitly w, algorithms complexity is O(1), using inner product as the basic operation (which is O(d) if multiply-add is the basic operation). For the non-linear phase, we need to perform O(N) operations, where N was previously defined as the number of support vectors. Because of the relation between support vectors number and complexity, algorithms have been devised that try to minimize, or even replace, support vectors during and after training, so that this phase may be competitive enough with other machine learning methods, such as neural networks. 1.8

Mapping Function Example

For better understanding of the concept of new useful features generation, we will show an example. Suppose we have a data set xi ci in 2 × +1 −1 as shown in figure 11. It can be seen that this is not a linear separable case, and the soft margin linear separator is not enough. In this example, training data has no noise. We define as a mapping function 2 → 3 with the form: x1 x2 → x1 x2 x1 x2 Therefore, we have added a new feature to the input definition, which gives us information about a specific kind of relation between the two initial variables. Thus, we can calculate the kernel function:

Kx x = x • x = x1 x2 x1 x2 • x1 x2 x1 x2

= x1 x1 + x2 x2 + x1 x2 x1 x2

SUPPORT VECTOR MACHINES

165

Figure 11. Non-linear distribution set. The distribution is defined as class = 1 if x1 x2 < 14 5; class = −1 otherwise

Figure 12. Feature space view for the main points from figure 11. The margin h1 − h2 is partially shown using solid lines

We have defined the mapping function and the new space implicitly, using the inner product in input space as the only valid operator. In figure 12, the most important points, form the training data set, have been represented, as well as the separator hyperplane the SVM algorithm would find and those points that become support vectors. The separator hyperplane is z = 14 5. Note that in the final hypothesis only one feature is required to create the hyperplane (it is defined using just the third component) from the three available features. This will be very common case in non-linear SVM: just a few features will form the linear combination defining the separator hyperplane. To represent the curve in input space that describes the generated hyperplane we need to use the inverse mapping: −1 x1 x2 x1 x2 → x1 x2

166

CHAPTER 7

Figure 13. Non-linear classifier for the figure 11 set. Support vectors are encircled, margin is shown using dashed lines and the separator curve is shown with a solid line

As the new axis z was defined as z = x1 x2 in the high dimensional space, those points that lie on the hyperplane hold x1 x2 = 14 5, and so the curve in input space can be defined as x2 = 14 5/x1 . In figure 13, the final result can be observed, with hyperboloid x2 = 14 5/x1 as the non-linear class separator surface. Support vectors in this figure are those that were identified during training and highlighted in figure 12. It should not be thought that those points that lie near the non-linear separator surface in input space should become support vectors, although it usually tends to it. The mapping function does not necessarily satisfy any input data relation properties, but the concept behind the support vector is: “significant point”, and the points that carry more information are those that lie near other class points in input space. In real world cases, this function will not be useful, unless clear and easy apriori information is given to the SVM engineer. Nevertheless, it is a valid mapping function and generates a valid kernel function. For this to happen, function K(x,y) must satisfy some constraints, known as Mercer conditions.

1.9

Mercer Conditions

Not all kernel functions are valid, that is, they describe a Euclidean space with the properties required in previous sections. It is enough to satisfy Mercer conditions [Vapnik, 1995], which can be written as: There exists a function Kx y = x • y if and only if for all g(x), such that gx2 dx is finite, the following inequality holds:

Kx ygxgydxdy ≥ 0

SUPPORT VECTOR MACHINES

167

For most cases, this is a very complicated condition to check, because it is said ‘for all g(x)’. It has been demonstrated for Kx y =

P

Cp x • yP

i=1

when Cp is a positive real number and p is a positive integer. 1.10

Kernel Examples

The first (and only) basic kernels used to develop pattern recognition as well as non-linear regression and principal component analysis with SVM are (for any pair of vectors x y ∈ d ): (20)

Kx y =x • y + 1p

(21)

Kx y = exp−x − y2 /2 2

(22)

Kx y = tanhx • y −

Kernel (20) is a non-homogeneous polynomial classifier of degree p (another used variation is the homogeneous polynomial kernel, without term ‘+1’). It creates a space H with as many dimensions (data features) as p-combinations of x and y. All possible relations between input attributes until degree p appear in the new space. The margin maximization algorithm will discriminate those having information from those that have not (should be most of them), so the number of adjustable parameters required to obtain a good solution decreases. Kernel (21) is a Gaussian radial base function (RBF). The new space dimension is not fixed, depends on actual data distribution, and it could get to infinite. This kernel visual effect is that near-by patterns form class clusters, as big as they can. Clusters have the support vectors as centres (in feature space), and the radius is given by the value of and support vector weight, obtained during training. Kernel (22) is similar to a two layer sigmoidal neural network. Using the neural network kernel, the first layer is composed of N sets of weights, each set consisting of d weights; the second layer is composed of N weights (the i ), so that an evaluation requires a weighted sum of sigmoids evaluated on dot products. The structure and weights (which defines the related neural network architecture) are given automatically by the training process. Not all values of y satisfy Mercer conditions [Vapnik, 1995]. We say (20), (21) and (22) are basic functions because new kernel functions can be formulated combining them and still satisfying Mercer conditions. A linear combination of two Mercer kernels is a Mercer kernel. This can be easily demonstrated knowing that the integrator operator is distributive with respect to the add operator. Also, another kind of slight changes can be implemented from the basic functions, looking for a kernel function having a priori information about the internal distribution.

168

CHAPTER 7

Nevertheless, it has been experimentally stated that, in many cases, kernel choice is not a determining factor in the machine performance. For a real world problem whose internal distribution is not particularly fitted to some kind of kernel, support vector set tend to be very similar, no matter what non-linear function is used. Of course weights are fairly different, as the evaluating function is so. But the result, the separating surface, tends to have a very similar geometrical shape, especially where data density is high. As it was said in previous sections, the reason could be that those patterns that are important because they lie near other-class patterns continue to be important regardless of the mapping function, so they become support vectors. Last, we will define the kernel matrix as a symmetric square M-order matrix (where M is the training pattern number), where position (i,j) describes the kernel function value Kxi xj . 1.11

Global Solutions and Uniqueness

As it has been shown in previous sections, the result of SVM training is a global solution for the optimisation process, i.e., the parameter set (values for w, b and i ) which give an objective function maximum. This term goes against ‘local solution’, defined as a parameter set whose objective function is optimum when compared around the vicinity. In the SVM algorithm, any local solution is also a global solution because it is characterised as a convex quadratic problem. Nevertheless, global solution may not be unique. There could be more than one parameter set where objective function gets the same value, and it could be the optimum. It is not inconsistent with global solution definition. Solution uniqueness is guaranteed only in case the problem is strictly convex. The SVM training definition assures the problem to be convex, but training data will make the problem be strictly convex or not. Non-uniqueness occurs in two different ways: • When w and b values are not unique. In this case all w and b values between two solutions are also global solutions. This is easy to accept, as the problem is characterized by a convex problem. • When w and b values are unique, but the w value comes from different sets of i values. Reaching one solution or the other depends on the training algorithm randomness. Remember that there can be training data points that lie on the hyperplane but are not support vectors. Much alike when three points in a row give just one straight line and throwing away any of the three would give the same result, it is easy to create one training set that would generate different hard margin classifier support vector set depending on the listing order, although the separator hyperplane would remain unchanged. 1.12

Generalization Performance Analysis

Mercer condition tells us whether a kernel function defines a new Euclidean space or not, but it does not define how the mapping function must be applied or

SUPPORT VECTOR MACHINES

169

the new space morphology. For easy cases, the feature space dimensions can be deduced. For instance, the p-degree homogeneous polynomial kernel has d+p−1 p new features or dimensions. For a 4-degree polynomial kernel using 16 × 16 pixel images (256 initial features), the new space dimension is 183181376. In real world cases we will never have training sets that big. A classification machine with a huge ‘features over data’ ratio would undoubtedly produce overfitting. Let us use an easier example: 3-degree polynomial with 8 × 8 pixel data. The new space dimension is 45760. If you are using a simple multi-layer perceptron neural net, the relation between number of weights and data points should not be greater than around 15%. Suppose you are generating a hidden layer with 45760 units (new features), 64 units in the input layer and one unit as output. The number of weights in the net gets around 2974400 (almost 3 million). Therefore, the minimum training data set should have 19829333 patterns (almost 20 million). Now, that is an awfully big data set. Of course, not all 45760 new features are important. Many of them will have a null weight. But you cannot know at first which features will be needed and which ones will not be. Some algorithms have been designed to decrease the neural net while training, but even in this case the difference between useful feature and disturbing feature is not easy to make. A separator hyperplane in feature space H must have dimH+1 parameters. Any classification system needing so many parameters to create a discrimination function will be resource and time inefficient. Nevertheless SVM have a good classification and generalization performance, in spite of treating data in an enormous space, which could be even infinite. The reason has not been formally demonstrated, although the maximum margin requirement has much to say about it. Within the SVM, the solution has at most l + 1 adjustable parameters, being l the number of training patterns. After the training, the solution has N + 1 parameters, being N the number of support vectors, which is much less than the number of new features. In section 2.1 we left a question about which classifier is better out of two possible choices. The answer is “the one having lowest VC dimension”, which is the same as saying “the simplest”. It was shown that the bound on the risk is related to the VC dimension: the least the VC dimension, the least the risk bound. However, it does not assure you which one will have the least actual risk. There is no way to know it beforehand. This approach is not only mathematically motivated, but we could also use some philosophy statements on it. An English 14-century philosopher, William of Ockham, enounces the Ockham’s razor theory: Given some evidence and two hypothesis, one simple and one complex, both satisfying the evidence, then the simplest hypothesis is most probable to be true. It does not say which one is true, but if you had to bet and you had no additional knowledge or evidence, you should go for the first hypothesis. That is all about learning, be it machine or human: choose

170

CHAPTER 7

the one hypothesis which seems most probable with current evidence. Whenever you make a new assumption (using an unnecessary complex hypothesis) you are most probably farther from the truth. That answers the big question, why are support vector machines generalization performance good even when using high dimension feature space? Because SVM performance is not related to the space dimension where data is separated, but to the classifier VC dimension. Therefore, SVM classifier depends on the data hypothesis simplicity, not on the number of available features. If a simple hypothesis can do the separating job, the SVM will use it, with no overfitting. There is no magic any more. The SVM algorithm gives the simplest hypothesis, that is, the most probable one. But it does not mean there cannot be a better answer for a given problem. In spite of our SVM hard militancy we do not deny SVM have been slightly outperformed (mostly by specific neural networks) in some experimental benches. The answer is simple: luck. The SVM gave the most probable answer after one general-purpose execution. But the true internal distribution may have been slightly more complex, even though it did not show on the training data. If you are trying a neural network architecture with a bit more complexity, which way will you go? You cannot say unless you have additional information. The successful architect engineer would most probably try all possible ways. It means trying hundreds of different architectures and finally using the one having better error rates on the test set. But that approach falls down in many places: first, the engineer must decide how much complexity should the answer have (not an easy task at all); second, the training set must be slightly deviated from the internal distribution for the SVM to lie behind; third, if you are generating many classifiers and you use the test set to decide which one is better, then the test set is no longer a good validation set, because you are using it as a secondary training set (even though it is used as a validation set for publishing the results); last, the engineer spends a lot of time in the training phase. And in spite of all this extra work, in cases where SVM are outperformed, they are still very near to the highest results in this scientific ranking. Which means that in the real world it is difficult to find the SVM outperformed. Support Vector Machines are not easy to implement, but they are very easy to use. Nevertheless, its use has some limits. As it is a statistical method, symbolic learning does not suit too well. For instance, the parity problem with few data makes the SVM decide that all points are support vectors. This is a clear hint for bad generalization performance, because it means: “one point has no relation with any other point”. In those cases a SVM is no better than a simple Nearest Neighbour classification algorithm. Other Machine Learning paradigms, for instance C4.5, are able to work with input data having parameters with the unknown value (C4.5 uses the ‘?’ symbol). The algorithm identifies this value and treats the information accordingly. However, the SVM algorithm does not allow unknown values, diminishing the applicability to some data sets.

SUPPORT VECTOR MACHINES

171

Inside the previously defined scope, SVM has a very light bias. It is a true general-purpose machine learning method. Although a priori information can be included inside the kernel function, the number of new features is so wide that, regardless of the internal data distribution, there will always be a near-by hypothesis model using those new features. The training algorithm will have embedded some sort of balance between using too few features (too simple hypothesis), and using too many (overfitting). The basic achievement in using SVM is that you just choose a generic kernel function (we won’t say “any kernel will do”, but it is not too far from the truth), and the confidence degree C (up until now, mostly heuristics are used, but you will soon find it is quite easy). Then you push the button, and after some time you will have the best classification machine. No need for an experienced engineer or scientist. No complicated architectures. No tailoring. No second thoughts. Child’s play. 2.

SVM MATHEMATICAL APLICATIONS

The initial mathematical development for SVM has been applied to different approaches inside Machine Learning scope. All of them are based on the structural risk minimization principle, in the problem Lagrange formulation, and in the non-linear case generalization. For each approach you only need to define the requirements all points must satisfy, its effect on the objective function and the mathematical steps through the Lagrange formulations. 2.1

Pattern Recognition

The first approach to SVM was in the pattern recognition field. In fact, the search for a new statistical paradigm able to optimise the class separation problem was the boost to V. Vapnik in his quadratic programming research. For that reason, the SVM definition developed in the previous sections and their implementation shown in next sections, apply specifically to pattern recognition. Nevertheless, most concepts apply also to the other approaches defined in this section. 2.2

Regression

Historically, the second approach the SVM had was non-linear regression and function estimation, called SVRM (Support Vector Regression Machines) [Vapnik et al, 1997]. This field can be divided into two parts: first, ‘function approximation’ tries to find the curve that best adapts to the training data, acquired without noise (which makes it very similar to usual methods for interpolation); second, function estimation (regression), where data is noisy and whose distribution is unknown,

172

CHAPTER 7

the method tries to estimate as simplest as possible unseen data points, including extrapolation. SVRM algorithm treats both cases in a very similar way. For each case, the cost function can be slightly changed. 2.2.1

Definition

Suppose we have a training set with l data pairs xi yi , where xi ∈ d i = 1 M (up until now, just the same as the pattern recognition case), and where yi ∈ , is not a label any more but a real number which represents the value of the function we want to estimate at xi , i.e. yi = fr xi + ni , being ni the noise associated to point i. We want to find a function fx having a deviation maximum of with respect to all training yi . In the basic case, there can be no training points having a distance to the expected value bigger than , so the resulting curve must fit all points. This case can be used only when data describe a linear function with a noise level ni < ∀i. The estimating function has the form: (23)

fx = w • x + b

being w the vector defining the curve in input space, and b the free term (the bias). Similarly to the pattern recognition case, the structural risk minimization principle demands the greatest possible simplicity to the approximation function. We will try to minimize w2 , which will give us the flattest linear function from those satisfying the constraints (unlike the margin maximization definition in pattern recognition). Therefore, the optimisation problem is written as: Minimize 1/2w2 with respect to constraints: yi − w • xi − b ≤ (24)

w • xi + b − yi ≤

Nevertheless, following the same reasoning as in section 2, this inflexible formulation is only valid when there is at least one solution satisfying conditions (24). Because this is usually an unreal case, without noise in the data (it could be used for an interpolation approach), the soft margin idea must be introduced. We define positive slack variables i that give information about how far is the expected value from the true value for point i. Thus, we are introducing in the algorithm the ability to admit errors (points not satisfying constraints), but keeping the ability to find a solution representing the data distribution well enough. Likewise, a new cost function must be defined giving a balance between the number of allowed errors and simplicity (and usefulness) of the final estimating function. This cost function, cx y f, must fulfil some properties, discussed in next section.

SUPPORT VECTOR MACHINES

173

To continue with the formulation development through this section we will use the -insensitive cost function [Vapnik et al, 1997], partially because it was the first one proposed, and because it is the simplest to interpret and optimise. This cost function is continuous non-derivable, so variables must be duplicated (formulation gets longer but no more difficult). Now all and turn ∗ and ∗ , where the one without asterisk is associated to the yi ≥ fxi case, and the one with asterisk is associated to the yi < fxi ) case. Note that both cases cannot be true for any one point, so for all training points at least one of the duplicated variables will be zero. Thus, objective function becomes: M 1 1 2 (25) LP = w + C cxi yi f M i=1 2 After the primal and dual formulation development (just like the pattern recognition case), the problem can be written as: Maximise LD = −

M M M M 1 i −∗i i −∗i xi •xj − i −∗i + yi i +∗i 2 i=1 j=1 i=1 i=1

with respect to: M

i − ∗i = 0

i=1

(26)

i ∗i ∈ 0 C

having C the same meaning as in section 2: an error permissibility balance parameter. This development remains defined as a convex quadratic optimisation problem, which has to satisfy Karush-Khun-Tucker conditions at optimality. Therefore, implementation methods defined for pattern recognition are applicable, although with some differences caused by the cost function. In the case of -insensitive cost function, duplicated Lagrange multipliers must be treated specifically. This will happen to all non-derivable cost functions. Again, support vectors are those training points whose Lagrange multipliers are not zero (in the case of duplication, it means one of the multipliers is non-zero). Moreover, those points having a non-zero slack variable are considered as training ∗ errors, and have the corresponding multiplier set to the maximum i = C (where ∗ the symbol means “either of the duplicated items, the applicable one”). Support vectors having a Lagrange multiplier not at bound 0 < < C are placed on the margin (they are needed to define the margin) and have a zero slack variable. Basically, the concept after the support vectors, weights and geometrical meaning, remain the same as the pattern recognition case (see figure 14).

174

CHAPTER 7

Figure 14. Linear regression machine. Support vectors are encircled, the margin/tube is shown with dashed lines and the estimated function is shown with a solid line

The value of the bias b can be calculated from a non-bound support vector, i.e. a non-error support vector. The equalities to be used are those in inequalities (24) (27)

b =yi − w • si −

ifi = 0yi = C

(28)

b =yi − w • si +

if∗i = 0y∗i = C

In case all support vectors are errors (very unusual, and in any case, most probably a bad solution) the b calculation method is much more complex, and can be done during optimisation itself. Likewise, we can define a non-linear mapping from input space to a feature space, where the algorithm will try to find the flattest function approximating the data well enough. The ‘flat’ property can usually be seen in the corresponding input-space non-linear curve: its shape is the one having smaller tangent value through the point set. The mapping concept and development is similar to the one described in previous sections: using a kernel function K(x,y) making all operations implicitly in feature space, usually a much higher dimension space (see figures 15a and 15b). Therefore, the non-linear problem is defined as: Maximize LD = −

M M 1 − ∗i i − ∗i Kxi xj 2 i=1 j=1 i

−

M

i − ∗i +

i=1

with respect to: M i=1

i − ∗i = 0

M i=1

yi i + ∗i

SUPPORT VECTOR MACHINES

175

Figure 15a. Non-Linear regression machine. The dots follow the sinc(x) function, the dashed lines are the -tube, and the solid line is the function SVRM estimation. Note that support vectors are those corresponding to the 3 tangent points (1 in the middle x = 0, and the other two at the limits)

Figure 15b. Non-Linear regression machine. The dots follow the same sinc(x) function, and the other elements follow the figure 15a notation. Note that as the -tube decreases, the function estimation gets more accurate. At the limit, if noise allowance approaches 0, the function estimation error will also be 0 in this example

(29)

i ∗i ∈ 0 C

and w support vector expansion and estimated function are written: (30)

w=

N

i − ∗i si

i=1

(31)

fx =

N

i − ∗i Ksi x

i=1

being si the resulting N support vectors, and being M the complete training set. It has been observed, through a number of experiments [Osuna and Girosi, 1998], that SVRM tend to use a relatively low number of support vectors, compared to

176

CHAPTER 7

other similar machine learning processes. The reason could be the allowed flexibility while errors are below a threshold, generating simpler surfaces, and thus needing less support vectors to define them. Moreover, It has been proved that the algorithm works well when a non-linear kernel is applied in spite of having few training data. Other well-known methods will easily overfit the data, while the SVRM dynamically controls its generalization ability, generating a hypothesis simple enough to model training data distribution better. 2.2.2

Cost Functions and -SVRM

The cost function is one of the key elements in SVRM. As it was said in the previous section, real data is usually acquired with a certain noise figure with unknown distribution. The cost function is in charge of accepting noise deviations, and penalizing wide deviations, whether they are caused by noise or by a current too simple hypothesis. The point is how to make the difference between noise and hypothesis complexity. Nevertheless, this function must satisfy certain features. For the sake of problem resolution usefulness, the cost function must be convex, thus maintaining problem convexity and assuring solution existence, uniqueness and globality. Moreover, for the mathematical development to remain simple, it is required to be symmetric and having at most two discontinuities at ±, in the first derivative, being ≥ 0. Therefore, even if we know the noise distribution, it would be too complex to introduce that additional information inside the algorithm. We should then have to find a convex cost function that may adjust to the noise distribution, but we would still use an approximation. Not to mention the mathematical development for the new cost function, notably difficult for non-expert mathematicians. The conclusion is: just use a general purpose cost function and let the SRVM automatic learning do the engineering job. The development described in the previous subsection refers to the -insensible cost function, which is the most commonly used, and is defined as: (32)

c =

0 if ≤ − if >

These kind of functions have an additional parameter , which helps to adjust the maximum allowable deviation for any given point. A validation process is required to adjust this parameter, even though its value can be approximated after any additional knowledge about noise or data distributions. To finish with the SVRM section, we will summarize a variation for the -SVRM (using -insensitive cost function), called -SVRM [Schölkopf et al, 1998]. The difference consists not in the cost function itself (which remains the -insensitive), but in the objective function. The -SVRM gave the objective function as:

SUPPORT VECTOR MACHINES

(33)

177

M 1 1 2 ∗ + i LP = w + C M i=1 i 2

and now, in -SVRM, the objective function is: M 1 1 2 ∗ + i (34) LP = w + C + 2 M i=1 i with respect to the same constraints as in (24). The resulting dual formulation problem gets: Maximize LD =

M

yi i − ∗i −

i=1

M M 1 − ∗i i − ∗i Kxi xj 2 i=1 j=1 i

with respect to M

i − ∗i = 0

i=1 M

i + ∗i ≤ C

i=1

(35)

i ∗i

∈

0

C M

and leaving the estimating function in the same form as in (31). The values of b and can be calculated after training using constraints (24) for non-bound support vectors. If the value of increases, then the first term in the cost effect at (34), , will increase proportionally, while the second term will decrease as some points will benefit from the softer constraints and will be inside the bound (it also decreases proportionally to the new lucky points). For the objective function to attain the optimum, the value of must increase until the fraction of error points (out of bounds) is less than or equal to the value of . Therefore the new parameter is an upper limit for training errors (which are related to the number of support vectors). Obviously it must satisfy ∈ 0 1. It seems easier to pick a good value rather than a value. Moreover, -SVRM is a superset of -SVRM: after training with the first method we can calculate the parameter value, which can be used in a -SVRM algorithm giving exactly the same solution obtained in the first place. 2.3

Principal Component Analysis

Support Vector Machines (regression included) and non-linear Principal Component Analysis (PCA) were the first applications developed under the idea of a high

178

CHAPTER 7

dimension space mapping using Mercer kernels in Machine Learning. They differ in the problem to solve even though they use similar means. SVM is a supervised algorithm, i.e. the system state changes whether an output for a given pattern is equal to the expected correct value or not. On the other hand, kernel PCA is an unsupervised algorithm, i.e. there are no labels, and the output is the training data distribution covariance analysis [Schölkopf et al, 1998]. PCA is an efficient method to extract the input data in a certain structure, and can be achieved by calculating the system eigenvalues and eigenvectors. Formally speaking, kernel PCA is an input space base transformation for diagonalizing the normalized input data covariance matrix estimation with the form: (36)

C=

M

1 xi xiT M i=1

where M is the number of patterns xi . It is called principal component to the new coordinates described by the eigenvectors as base, i.e. the matrix vectors orthogonal projection over the eigenvectors. Eigenvalues and eigenvectors V must be non-zero and satisfy V = CV. We introduce the usual non-linearity concept, with the mapping function and its corresponding kernel. We assume there exist coefficients 1 M , such that : (37)

V=

M

i xi

i=1

and the corresponding matrix kernel K (as defined in previous sections). Then we arrive to the problem: (38)

M = K

being the eigenvalues and = 1 M the eigenvectors coefficients. To extract the principal components for a given pattern, data projections in feature space are calculated in the following form: (39)

Vk x =

M

M

ki xi x = ki Kxi x

i=1

i=1

In this notation, k is a super-index representing the k-th eigenvector and its k-th coefficients set. Note that after the previous calculation process, k non-zero eigenvalues and eigenvectors are obtained, each one of them with a set of M coefficients. To implement the kernel PCA algorithm, the following steps must be taken: 1. Kernel matrix must be calculated, being of size MxM Kij = kxi xj ij

for all i,j ∈ 1 M

SUPPORT VECTOR MACHINES

179

Here comes the first problem when using kernel PCA. Any matrix calculation resources will grow at least with the square of its size, so with current algorithms and hardware no more than 5000 data should be used. If you are provided with more data for training (which should never be seen as an unfortunate case), a representative subset must be created heuristically. 2. Diagonalize matrix K, to calculate eigenvalues and eigenvectors after equation (38), using traditional methods, and normalize such vectors. After this calculation we can obtain coefficients k = k1 kM , to be used in the projection phase. 3. To extract non-linear principal components from a given pattern, point projections over eigenvectors must be calculated using equation (39). The number of principal components (non-zero eigenvectors) to be used is designer’s choice. But not all of them must be used: if so, the process would be useless. Just choose the first k principal components, those with a significant amount of information and very little noise. After this simple process, you get a data space change. From input space we changed to a k-dimension space (being k a fraction of M), in which each dimension gives a useful feature taken from non-linear correlation in the training data set. That is a conceptual difference between SVM training and kernel PCA training: in the first case new features are implicitly generated, and many are dropped after training; in the second case k new features are explicitly generated, all of them with a lot of information, ordered from most important downwards. The value k has an upper bound of M, the number of training patterns. New features are made explicit, so, obviously, the number k must not be too high or computation effort would be inefficient. So only the first components should be used, those having the greatest possible variance, i.e. the biggest eigenvector, i.e. the most discriminant information. Non-linear PCA usefulness in pattern recognition has been tested thoroughly, attaining classification performances as good as the best non-linear SVM and well above neural networks. The process is very simple: first calculate projection coefficients and select the best ones; then transform all patterns (training, validation and test) explicitly into the new space; afterwards use these data to train a linear or non-linear classification machine (SVM, neural networks, decision trees, …, anyone will do) which will be the true supervised classification process. When using kernel PCA for classification, usually a linear SVM is used for supervised training, giving enough flexibility to solve any non-linear problem. The described process is very much like a one-hidden-layer neural network, in which the architecture and the first layer weights are obtained by optimised means: the variance matrix eigenvectors. Also because of the explicit new features calculation, multiclass SVM can be trained easily: the first layer would be common (as in neural networks), and the second layer (linear discriminant) can be calculated using the hyperplane w value (now it can be calculated because feature space is no longer implicit), giving O(1) complexity.

180 3.

CHAPTER 7

SVM VERSUS NEURAL NETWORKS

Neural Networks has led the Machine Learning field from the 1980’s thanks to its development and interpretation simplicity, while having very competitive generalization ability. Nevertheless, after 20 years have gone by, design and development complexity has increased considerably when trying to solve secondary problems as convergence speed, new error calculation concepts, new activation functions with additional constraints, local minimum preventing, and so on. So many years of active research have turned NN from initial simplicity to current complexity, fit only for specialised engineers. Probably, as time goes by, SVM will follow a similar path, from current simplicity to some complexity degree, needing a human expert to take out all of its potential. Complexity by itself is not bad: higher method complexity usually leads to better performance or classification rates. But NN basic research is currently scarce. For the most part, it is about new applications where NN give better results for specific architectures, so a qualitative jump is needed: SVM. This is quite a natural step in human research, and many examples can be shown. When some technology gets to its limit, then a new approach must be issued. At first, both methods performance may be similar, but the new one will eventually outperform the old method. We believe we are currently in the beginning of a technology jump, so it is a nice time to change sides. All along this chapter, the relation between SVM and NN has been widely established. It can be stated that a SVM object topology can be developed as a one-hidden-layer perceptron. It has been demonstrated in NN literature that the family of one-hidden-layer perceptron can act as a universal discriminator, i.e. it can approximate any function. For sigmoidal activation function, similarity between NN and SVM with kernel (22) is complete. (see figure 16). After the SVM optimisation process, we obtain a network having d units in the first layer (input space dimension), N units in

Figure 16. A neural network approach for the SVM implicit architecture. Note that the layers are completely connected (although not explicitly shown for figure clarity). Also, all weights are equal to 1 except the ones connecting the hidden layer and the output layer which are equal to the corresponding support vector’s

SUPPORT VECTOR MACHINES

181

the hidden layer (number of support vectors), and one unit in the output layer (binary classifier), with weights connecting the last two layers. For other kernels, similarity is somewhat lesser, although the resulting topology is like the d,N ,1 one-hidden-layer perceptron. Only the kernel function makes the difference. On the other hand, the SVM with RBF kernel is like a RBF classification network, in which clusters and its characteristics have been calculated using an automatic optimal algorithm. When using a similar kernel and activation function, an important difference can be observed. SVM tend to have a bigger number of support vectors (hidden-layer units), when facing a complex or noisy training set. Neural networks can attain similar classification performances with much less internal units. This is essential to the test phase speed, because it depends directly on the number of elements in the hidden layer. The number of multiply-add operations done in the test phase of either method is Nd + 1. The reason for such difference is mainly that support vectors defining the hidden layer are constrained to be training points. Neural networks do not have such constraint, so they need less elements to model the same function (it has greater freedom degree). This does not mean that NN solution is better; it is quicker in test phase, and topology complexity is lower, but generalization performance is not affected. Moreover, note that the training phase allowed errors (including those points lying inside the margin) become support vectors. When optimising complex or noisy training sets with loose error penalization, the number of training errors can be very large. But this problem has also been solved during the first SVM research steps. In [Burges, 1996] the “reduced support vector set” method is described. Given a trained SVM, this method creates a smaller support vector set representing approximately the same information than the whole support vector set. But in this case the former constraint is eliminated because the new virtual support vectors need not be training points. The result is very much alike the NN approach topology. This new expansion solves the classification speed problem, making the SVM competitive against other Machine Learning methods. Nevertheless, it is seldom used because of its considerable development difficulty. Even more similarity can be found between NN and SVM classifiers using kernel PCA as the feature extraction step. Units in the hidden layer are calculated explicitly using the eigenvector projection instead of kernel calculation. These units are not significant training points but true features, all of which share the concepts under the internal data distribution. Thus, the classifier topology should be very similar to the one generated by an experienced NN architect, because they have heavy statistical meaning. The only flexibility a NN offers and the SVM cannot reach is the multiplehidden-layer approach (using kernel PCA plus non-linear SVM could get up to 2 hidden layers, but it is seldom used). In spite of the fact that a one-hidden-layer

182

CHAPTER 7

topology is a universal discriminator, having more hidden layers can make the training process much more efficient. Using that capability, maybe there are fewer units in the net, or convergence is faster. But the training algorithms grow more complex, and the overfitting and local minimum finding problems will still be there. Therefore, the main differences between both methods are: • Training one SVM requires much more computation resources than training one NN. • Classification speed is usually slower in SVM • SVM result is the optimum, while NN can be stuck in local minima. Therefore SVM usually outperforms NN in classification performance. • SVM parameters are few and easy to use, while NN requires an experienced engineer to create and try the right architecture. • SVM usually needs one execution only to give the best results, while NN usually requires many tries to take out its best. Outside scientific community, money rules. Expert engineer time is much more expensive than computing resources, and differences will grow higher. If Machine Learning algorithms are to be introduced massively in commercial products such as knowledge management or data mining, automatic methods must be used. In the real world, new data is always coming; new profiles arise while others are no longer valid. Neural network flexibility must be tailored by an expert to fit current state. But, for a company, it may be not worthy the cost of tailoring a Machine Learning system that will become obsolete within some months. It is unavoidable: craftsmen will be eventually replaced by machines. 4.

SVM OPTIMISATION METHODS

4.1

Optimisation Methods Overview

SVM development tries to solve the problem described in (14): Maximize LD with respect to Lagrange multipliers and with constraints (15) and (16). When SVM appeared, the first approach to solve this problem was using standard optimisation methods, such as gradient-descent or quasi-Newton. These methods, quite veterans in mathematical literature, mainly apply complex operators over the Hessian matrix (partial derivatives matrix). These one-step processes are computationally as well as memory resources intensive. Memory resources for matrices is OM2 , while computational resources are OM3 , being M the number of patterns in the training set. For instance, a 5000-point set will require 100 Mbytes of storing memory using single precision floating numbers. Any process over such an enormous data structure will be very inefficient, beyond many machines ability. The main research line in the first years of live of SVM was the search for alternative optimisation methods, developed explicitly for SVM mathematical use. Many new approaches were published before one of them pleased all researchers for its simplicity and its efficiency. The main methods, in chronological appearance are the following:

SUPPORT VECTOR MACHINES

183

• The chunking method, developed by Vapnik. Points that are not support vectors do not affect the Hessian matrix calculation; therefore, if we take them out before the matrix calculation begins the resulting Lagrange multipliers would remain the same. At the same time, the matrix calculation itself is easier, now its complexity is ON3 being N the number of support vectors, and N 1 r r

where f is the fraction of the image occupied by black pixels and L is the length of the image.

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

201

Case q = 1 The smallest value of the entropy corresponds to the case in which each grid block covering the pore phase is entirely filled by pore phase. The largest value corresponds to a uniform distribution. Lower and upper bounds are then as follows: (14)

2 ln

nr L L + lnf < − i lni < 2 ln q = 1 r r i=1

Case 0 ≤ q < 1 The smallest value q r can take corresponds to the case in which each grid block covering the pore phase is entirely filled by pore. The largest value corresponds to a uniform distribution of pore phase over the image. Lower and upper bounds are as follows: 21−q 21−q L L 1−q (15) f < r q r < 0 ≤ q < 1 r r Case q < 0 The smallest value q r can take corresponds to the case in which each grid block covering the pore phase is entirely filled by pore. The function is monotonically decreasing. Therefore, the value corresponding to r = 1 pixel can be selected as an upper bound. (16)

21−q L f 1−q < q r < L21−q f 1−q q < 0 r

Having defined these bounds, we now seek to examine their significance in terms of extracting generalized dimensions from image data. For q > 1 and for 0 ≤ q < 1, the bounding functions when plotted on the log-log plot used to extract the dimension yield two parallel lines with a vertical separation of 1 − q lnf. For q = 1, the bounding functions when included in the plot of entropy against lnr again yield two parallel lines of slope 2 with separation of lnf. Thus, in these cases we reach the same impasse as that with the fractal analysis, namely depending on f , and independent of actual geometry considered, the data can be so constrained as to yield convincing straight-line fits with associated derived dimensions. 3.2

Gliding Box Method

The gliding-box method was originally used for lacunarity analysis (Allain and Cloitre, 1991). Later, it was modified by Cheng (1997a, 1997b) for estimating q as follows: (17)

< q > +D = −

log< Mq r > logr/rmin

202

CHAPTER 8

Where D is the dimension of the Euclidean space where the image is imbibed (in this case D = 2) and M represents the multiplier measured on each pixel as: rmin q (18) Mq r = r For further details see Grau et al. (2006). The advantage of using Equation (17) in comparison with Equation (9) is that the estimation is independent of box size r which allows the use of two successive box sizes only to estimate q. Equation (18) imposes that rmin should not be null. Once this estimation is done, Equation (8) can be applied to estimate Dq . For the case of q = 1 the following relationship is applied based on the work given in (Saucier and Muller, 1999): (19) 4.

ˆ 1 = 2D2 − D3 D IMAGES FOR THE CASE STUDY

Three soil samples were selected with the aim to represent a different range in void pattern distribution in soils and a wide range of porosity values, from 5% of porosity till 47%. Each of the samples was prepared for image analysis following the procedure described by Protz and VandenBygaart (1998). The data was obtained by imaging thin sections with a Kodak 460 RGB camera using transmitted and circularly polarized illumination. The data was cropped from 3060 × 2036 pixels to 3000 × 2000 pixels. Then, EASI/PACE software classified the data and the void bitmap separated, each individual pixel size was 186 × 186 microns. The images of these soils are showed in Figure 3. To avoid any interference of the edge effect for the calculations using the boxcounting method, an area of 1024 × 1024 pixels of the left upper corner of the original images was selected. 5. 5.1

RESULTS OF THE CASE STUDY AND DISCUSSION Generating Function with the Box-counting Method

For the three binary images, q r was calculated and then a bi-log plot of q r versus r was made to observe the behavior. All plots showed a clear pattern in the data. In Figure 2, for example, at negative q there were two distinctive areas, one where there was a linear relationship between logr and logq r and another where the value of log q r was almost constant versus logr. The box size at which the behavior is different for the three images is around 64 pixels. These two phases were not evident with positive q values (see Figure 4). The existence of a plateau phase of logq r can be explained by the nature of the measure under consideration. At r values close to 1, the variation in number of black pixels is based on a few pixels, having the most simplicity when r = 1

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

203

Sample A

Sample B

Sample C

Figure 3. Soil binary images, pore phase in black pixels, of: (a) ADS, (b) BUSO and (c) EHV1. Each image has 5.65%, 19.17% and 46.67% of porosity, respectively

204

CHAPTER 8 A 150

LogX(q,r)

100 50 0 –50 –100 –150

0

2

4

6

Log(r) B 150

LogX(q,r)

100 50 0 –50 –100 –150

0

2

4

6

Log(r) C 150

LogX(q,r)

100 50 0 –50 –100 –150

0

2

4

6

–10 –8 –6 –4 –2 0 2 4 6 8 10

Log(r)

Figure 4. Bi-log plot of q r versus box size r at different mass exponent q: A): ADS; B) BUSO; C) EVH1

where the measure can only have 0 or 1 value. Thus, for small boxes of size r the proportions among their values are mainly constant. However, when the box size passes certain size a scaling pattern begins.

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

5.2

205

Generalized Dimensions Using the Box-counting Method

If all of the regression points are considered, the Dq values, obtained mainly for q < 0, were quite different from these obtained if only the regression points in the linear behavior were chosen (Figure 5). Between both criteria, any Dq can be obtained, but for q >= 0 the differences are not significant. Many authors have pointed out this fact since the first applications of multifractal analysis to experimental results (Tarquis et al., 2005). The implications of Dq changes, too noticeable in this case, make impossible any comparison and calculation of the amplitude of the dimensions D−10 − D+10 as it has been used in several works. The differences found among the Dq representation (Figure 5, filled circles) are mainly found in the negative part. In particular, comparing ADS (Figure 5A filled circles) with the rest it is evident that it doesn’t show a multifractal behavior. All the D0 obtained have a value of 2 (plane dimension). This overestimation is due to the fact that the studied range that was selected to have an optimum fit for all the q values. However, looking at the lower and upper bond of the box-counting plots for q = 0 (Figure 6) it is quite clear that regardless the structure in the image the linear fit will be obtained with a high r 2 . The standard errors (data not shown) of the Dq obtained in the linear behavior phase are minimum and the r 2 of the regression analysis very high. However, this is not surprising if we realize that only three points are being used. In addition, the number of boxes of each size is very low, for size 128 × 128 pixels the number of boxes is 64, for size 256 × 256 pixels the number of boxes is 16, analyzing an image of 1024 × 1024 pixels that is considered a representative elementary area (VandenBygaart and Protz, 1999). This size restriction is avoided by using the gliding box method and its results are discussed in the next section.

5.3

Generalized Dimensions Using the Gliding Box Method

For the three binary images, < Mq r > was calculated and then a bi-log plot of < Mq r > versus r/rmin was made. All plots showed a linear relationship, as it was expected, with an important number of points to calculate a linear regression and based on the line’s slope estimate Dq (Figure 4). In the case of EHV1 for q < −6 (Figure 4A), the linear relationship is not as clear as in the rest of the images. Finally, a comparison between both methods in the Dq values obtained can be studied in Figure 5. In all of the graphics, Dq appears again with a value of 2 imposed by the box gliding method as it was explained in section 3.2. For ADS (Figure 5A) both curves are similar. On propose, the range of values for Dq has been changed to observe that the image effect could induce to an error in our conclusions, when in Figure 3 was evident that Dq was an almost constant value.

206

CHAPTER 8 A 6,50 5,50

Dq

4,50 3,50 2,50 1,50 –10

–8

–6

–4

–2

0 q

2

4

6

8

10

B 6,50 5,50

Dq

4,50 3,50 2,50 1,50 –10

–8

–6

–4

–2

0 q

2

4

6

8

10

C 6,50

Dq

5,50 4,50 3,50 2,50 1,50 –10

–8

–6

–4

–2

0 q

2

4

6

8

10

Figure 5. Generalized dimensions (Dq) from q = −10 to q = +10 for all points of the regression line (filled square) and for the three selected points based on bi-log plot of X(r,q) (filled circles) of each image: A) ADS; B) BUSO and C) EVH1

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

207

Observing the differences between both methods in BUSO and EVH1 (Figure 5B and 5C respectively) are bigger in the negative q values although in the positive values Dq shows a stronger decay (Grau et al., 2006). 6.

CONCLUSIONS

Over the last years, the concepts of fractal/multifractal have been increasingly applied in analysis of porous materials including soils and in the development of fractal models of porous media. In terms of modeling, it is important to characterize the multiscale heterogeneity of soil structure in a useful way, but the blind application of these analyses does not approach to it.

(a) 16 14

log N

12 10 8 6 4 2 0 0

1

2

3

4 log r

5

6

7

8

(b) 16 14 12 log N

10 8 6 4 2 0 0

1

2

3

4

5

6

7

8

log r Figure 6. Box counting plots for EHV1 soil images, q = 0, with upper and lower bounds (a) solid phase (b) pore phase. (From Bird et al., J. of Hydrol., 322, 211, 2006. With permission)

208

CHAPTER 8

A

Log (<M(r,q)>)

40 30 20 10 0 –10 –20 –30 –40 –0,1

0,1

0,3

0,5

0,7 Log(r/rmin)

0,9

1,1

1,3

1,5

B

Log (<M(r,q)>)

40 20 0 –20 –40 –60 –80

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

Log(r/rmin)

C

Log (<M(r,q)>)

20 0 –20 –40 –60 –80 –100 –120 0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

–10 –8 –6 –4 –2 0 2 4 6 8 10

Figure 7. Bi-log plot of < Mr q > versus box size rate r/rmin at different mass exponent (q): A): ADS; B) BUSO, C) EVH1

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

209 A

2,030 2,025 2,020

Dq

2,015 2,010 2,005 2,000 1,995 1,990 -10

-8

-6

-4

-2

0 q

2

4

6

8

10 B

4,000 3,500 3,000

Dq

2,500 2,000 1,500 1,000 0,500 0,000 -10

-8

-6

-4

-2

0 q

2

4

6

8

10 C

7,00 6,00

Dq

5,00 4,00 3,00 2,00 1,00 -10

-8

-6

-4

-2

0 q

2

4

6

8

10

Figure 8. Generalized dimensions (Dq) from q = −10 to q = +10 based on the box-gliding method (empty square) and based on the box-counting method (filled circles) using the same box sizes range: A) ADS; B) BUSO; C) EVH1

210

CHAPTER 8

The results obtained by the “box-counting” and “gliding-box” methods for multifractal modeling of soil pore images show that “gliding-box” provides more consistent results as it creates more number of large size boxes in comparison with the box-counting method and avoids the restriction that box-counting method imposes to the partition function.

7.

ACKNOWLEDGEMENTS

We thank Dr Richard Heck of Guelph University for the soil images. We are very indebted to Dr. N. Bird, Dr. Q. Cheng and Dr. D. Gimenez for helpful discussions. This work was supported by Techical University of Madrid (UPM) and Madrid Autonomous Community (CAM), Project No. M050020163.

REFERENCES Aharony, A., 1990, Multifractals in physics – successes, dangers and challenges, Physica A. 168: 479–489. Ahammer, H., De Vaney, T.T.J. and Tritthart, H.A., 2003, How much resolution is enough? Influence of downscaling the pixel resolution of digital images on the generalised dimensions, Physica D. 181 (3–4):147–156. Allain, C. and Cloitre, M., 1991, Characterizing the lacunarity of random and deterministic fractal sets, Physical Review A. 44:3552–3558. Anderson, A.N., McBratney, A.B. and FitzPatrick, E.A., 1996, Soil Mass, Surface, and Spectral Fractal Dimensions Estimated from Thin Section Photographs, Soil Sci. Soc. Am. J. 60:962–969. Anderson, A.N., McBratney, A.B. and Crawford, J.W., Applications of fractals to soil studies. Adv. Agron., 63:1, 1998. Barnsley, M.F., Devaney, R.L., Mandelbrot, B.B., Peitgen, H.O., Saupe, D. and Voss, R.F., 1988, The Science of Fractal Images. Edited by H.O. Peitgen and D. Saupe, Springer-Verlag, New York. Bartoli, F., Philippy, R., Doirisse, S., Niquet, S. and Dubuit, M., 1991, Structure and self-similarity in silty and sandy soils; the fractal approach, J. Soil Sci. 42:167–185. Bartoli, F., Bird, N.R., Gomendy, V., Vivier, H. and Niquet, S., 1999, The relation between silty soil structures and their mercury porosimetry curve counterparts: fractals and percolation, Eur. J. Soil Sci., 50(9). Bartoli, F., Dutartre, P., Gomendy, V., Niquet, S. and Vivier, H., 1998. Fractal and soil structures. In: Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC Press, Boca Raton, 203–232. Baveye, P. and Boast, C.W. Fractal Geometry, Fragmentation Processes and the Physics of ScaleInvariance: An Introduction. In Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC Press, Boca Raton, 1998, 1. Baveye, P., Boast, C.W., Ogawa, S., Parlange, J.Y. and Steenhuis, T., 1998. Influence of image resolution and thresholding on the apparent mass fractal characteristics of preferential flow patterns in field soils, Water Resour. Res. 34, 2783–2796. Bird, N., Díaz, M.C., Saa, A. and Tarquis, A.M., 2006. Fractal and Multifractal Analysis of Pore-Scale Images of Soil. J. Hydrol, 322, 211–219. Bird, N.R.A., Perrier, E. and Rieu, M., 2000. The water retention function for a model of soil structure with pore and solid fractal distributions. Eur. J. Soil Sci. 51, 55–63. Bird, N.R.A. and Perrier, E.M.A., 2003. The pore-solid fractal model of soil density scaling. Eur. J. Soil Sci. 54, 467–476. Booltink, H.W.G., Hatano, R. and Bouma, J., 1993. Measurement and simulation of bypass flow in a structured clay soil; a physico-morphological approach. J. Hydrol. 148, 149–168.

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

211

Brakensiek, D.L., W.J. Rawls, S.D. Logsdon and Edwards, W.M., 1992. Fractal description of macroporosity. Soil Sci. Soc. Am. J. 56, 1721–1723. Buczhowski, S., Hildgen, P. and Cartilier, L. 1998. Measurements of fractal dimension by box-counting: a critical analysis of data scatter. Physica A 252, 23–34. Cheng, Q. and Agerberg, F.P. (1996). Comparison between two types of multifractal modeling. Mathematical Geology, 28(8), 1001–1015. Cheng, Q. (1997a). Discrete multifractals. Mathematical Geology, 29(2), 245–266. Cheng, Q. (1997b). Multifractal modeling and lacunarity analysis. Mathematical Geology, 29(7), 919–932. Crawford, J.W., Baveye, P., Grindrod, P. and Rappoldt, C. Application of Fractals to Soil Properties, Landscape Patterns, and Solute Transport in Porous Media, in Assessment of Non-Point Source Pollution in the Vadose Zone. Geophysical Monograph 108, Corwin, Loague and Ellsworth, Eds., American Geophysical Union, Wahington, DC, 1999, 151. Crawford, J.W., Ritz, K. and Young, I.M. Quantification of fungal morphology, gaseous transport and microbial dynamics in soil: an integrated framework utilising fractal geometry. Geoderma, 56, 1578, 1993. Crawford, J.W., Matsui, N. and Young, I.M. 1995., The relation between the moisture-release curve and the structure of soil. Eur. J. Soil Sci. 46, 369–375. Dathe, A., Eins, S., Niemeyer, J. and Gerold, G. The surface fractal dimension of the soil-pore interface as measured by image analysis. Geoderma, 103, 203, 2001. Dathe, A., Tarquis, A.M. and Perrier, E., 2006. Multifractal analysis of the pore- and solid-phases in binary two-dimensional images of natural porous structures. Geoderma, doi:10.1016/j.geoderma.2006.03.024, in press. Dathe, A. and Thullner, M., 2005. The relationship between fractal properties of solid matrix and pore space in porous media. Geoderma, 129, 279–290. Feder, J., 1989. Fractals. Plenum Press, New York. 283pp Flury, M. and Fluhler, H., 1994. Brilliant blue FCF as a dye tracer for solute transport studies – A toxicological overview. J.Environ. Qual. 23, 1108–1112. Flury, M. and Fluhler, H., 1995. Tracer characteristics of brilliant blue. Soil Sci. Soc. Am. J. 59, 22–27. Flury, M., Fluhler, H., Jury, W.A. and Leuenberger, J., 1994. Susceptibility of soils to preferential flow of water: A field study, Water Resour. Res. 30, 1945–1954. Giménez, D., R.R. Allmaras, E.A. Nater and Huggins, D.R., 1997a. Fractal dimensions for volume and surface of interaggregate pores – scale effects. Geoderma 77, 19–38. Giménez D., Perfect E., Rawls W.J. and Pachepsky, Y., 1997b. Fractal models for predicting soil hydraulic properties: a review. Eng. Geol. 48, 161–183. Gouyet, J.G. Physics and Fractal Structures. Masson, Paris, 1996. Grau, J., Méndez, V., Tarquis, A.M., Díaz, M.C. and A. Saa, 2006. Comparison of gliding box and box-counting methods in soil image analysis. Geoderma, doi:10.1016/j.geoderma.2006.03.009, in press. Griffith, D.A.. Advanced Spatial Statistics. Kluwer Academic Publishers, Boston, 1988. Hallett, P.D., Bird, N.R.A., Dexter, A.R. and Seville, P.K., 1998. Investigation into the fractal scaling of the structure and strength of soil aggregates. Eur. J. Soil Sci. 49, 203–211. Hatano, R. and Booltink, H.W.G., 1992. Using Fractal Dimensions of Stained Flow Patterns in a Clay Soil to Predict Bypass Flow. J. Hydrol. 135, 121–131. Hatano, R., Kawamura, N., Ikeda, J. and Sakuma, T. Evaluation of the effect of morphological features of flow paths on solute transport by using fractal dimensions of methylene blue staining patterns. Geoderma 53, 31, 1992. Hentschel, H.G.R. and Procaccia, I. (1983). The infinite number of generalized dimensions of fractals and strange attractors. Physica D, 8, 435, 1983. Kaye, B.G. A Random Walk through Fractal Dimensions. VCH Verlagsgesellschaft, Weinheim, Germany, 1989, 297. Mandelbrot, B.B. The Fractal Geometry of Nature. W.H. Freeman, San Francisco, CA, 1982. McCauley, J.L. 1992. Models of permeability and conductivity of porous media. Physica A 187, 18–54.

212

CHAPTER 8

Moran, C.J., McBratney, A.B. and Koppi, A.J.,1989. A rapid method for analysis of soil macropore structure. I. Specimen preparation and digital binary production. Soil Sci. Soc. Am. J. 53, 921–928. Muller, J., 1996. Characterization of pore space in chalk by multifractal analysis. J. Hydrology, 187, 215–222. Muller, J., Huseby, O.K. and Saucier, A. Influence of Multifractal Scaling of Pore Geometry on Permeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals, 5, 1485, 1995. Muller, J. and McCauley, J.L., 1992. Implication of Fractal Geometry for Fluid Flow Properties of Sedimentary Rocks. Transp. Porous Media 8, 133–147. Muller, J., Huseby, O.K. and Saucier, A., 1995. Influence of Multifractal Scaling of Pore Geometry on Permeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals 5, 1485–1492. Ogawa, S., Baveye, P., Boast, C.W., Parlange, J.Y. and Steenhuis, T. Surface fractal characteristics of preferential flow patterns in field soils: evaluation and effect of image processing. Geoderma, 88, 109, 1999. Oleschko, K., Fuentes, C., Brambila, F. and Alvarez, R. Linear fractal analysis of three Mexican soils in different management systems. Soil Technol., 10, 185, 1997. Oleschko, K. Delesse principle and statistical fractal sets: 1. Dimensional equivalents. Soil&Tillage Research, 49, 255,1998a. Oleschko, K., Brambila, F., Aceff, F. and Mora, L.P. From fractal analysis along a line to fractals on the plane. Soil&Tillage Research, 45, 389, 1998b. Orbach, R. Dynamics of fractal networks. Science (Washington, DC) 231, 814, 1986. Pachepsky, Y.A.,Yakovchenko, V., Rabenhorst, M.C., Pooley, C. and Sikora, L.J. . Fractal parameters of pore surfaces as derived from micromorphological data: effect of long term management practices. Geoderma, 74, 305, 1996. Pachepsky, Y.A., Giménez, D., Crawford, J.W. and Rawls, W.J. Conventional and fractal geometry in soil science. In Fractals in Soil Science, Pachepsky, Crawford and Rawls, Eds., Elsevier Science, Amsterdam, 2000, 7. Persson, M., Yasuda, H., Albergel, J., Berndtsson, R., Zante, P., Nasri, S. and Öhrström, P., 2001. Modeling plot scale dye penetration by a diffusion limited aggregation (DLA) model. J. Hydrol. 250, 98–105. Peyton, R.L., Gantzer, C.J., Anderson, S.H., Haeffner, B.A. and Pfeifer, P. . Fractal dimension to describe soil macropore structure using X ray computed tomography. Water Resource Research, 30, 691, 1994. Posadas, A.N.D., Giménez, D., Quiroz, R. and Protz, R., 2003. Multifractal Characterization of Soil Pore Spatial Distributions. Soil Sci. Soc. Am. J. 67, 1361–1369 Protz , R. and VandenBygaart, A.J. 1998. Towards systematic image analysis in the study of soil micromorphology. Science Soils, 3. (available online at http://link.springer.de/link/service/journals/). Ripley, B.D. Statistical Inference for Spatial Processes, Cambridge Univ. Press, Cambridge, 1988. Saucier, A. Effective permeability of multifractal porous media. Physica A, 183, 381, 1992. Saucier, A. and Muller, J. Remarks on some properties of multifractals. Physica A, 199, 350, 1993. Saucier, A. and Muller, J. Textural analysis of disordered materials with multifractals. Physica A, 267, 221, 1999. Saucier, A., Richer, J. and Muller, J., 2002. Statistical mechanics and its applications. Physica A, 311 (1–2): 231–259. Takayasu, H. Fractals in the Physical Sciences. Manchester University Press, Manchester, 1990. Tarquis, A.M., Giménez, D., Saa, A., Díaz, M.C. and Gascó, J.M., 2003. Scaling and Multiscaling of Soil Pore Systems Determined by Image Analysis. In: Scaling Methods in Soil Physics, Pachepsky, Radcliffe and Selim Eds., CRC Press, 434 pp. Tarquis, A.M., McInnes, K.J., Keys, J., Saa, A., García, M.R. and Díaz, M.C., 2006. Multiscaling Analysis In A Structured Clay Soil Using 2D Images. J. Hydrol, 322, 236–246. Tel, T. and Vicsek, T., 1987. Geometrical multifractality of growing structures, J. Physics A. General, 20, L835–L840. VandenBygaart, A.J. and Protz, R., 1999. The representative elementary area (REA) in studies of quantitative soil micromorphology. Geoderma 89, 333–346.

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

213

Vicsek, T. 1990. Mass multifractals. Physica A, 168, 490–497. Vogel, H.J. and Kretzschmar, A., 1996. Topological characterization of pore space in soil-sample preparation and digital image-processing. Geoderma 73, 23–38.

Computational Intelligence for Engineering and Manufacturing Edited by

Diego Andina Technical University of Madrid (UPM), Spain

Duc Truong Pham Manufacturing Engineering Center, Cardiff University, Cardiff

A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN-10 ISBN-13 ISBN-10 ISBN-13

0-387-37450-7 (HB) 978-0-387-37450-5 (HB) 0-387-37452-3 (e-book) 978-0-387-37452-9 (e-book)

Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com

Printed on acid-free paper

All Rights Reserved © 2007 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.

This book is dedicated to the memory of Roberto Carranza E., who induced the authors the enthusiasm to jointly prepare this book.

CONTENTS

Contributing Authors

ix

Preface

xi

Acknowledgements

xiii

1.

Soft Computing and its Applications in Engineering and Manufacture D. T. Pham, P. T. N. Pham, M. S. Packianather, A. A. Afify

1

2.

Neural Networks Historical Review D. Andina, A. Vega-Corona, J. I. Seijas, J. Torres-García

39

3.

Artificial Neural Networks D. T. Pham, M. S. Packianather, A. A. Afify

67

4.

Application of Neural Networks D. Andina, A. Vega-Corona, J. I. Seijas, M. J. Alarcón

93

5.

Radial Basis Function Networks and their Application in Communication Systems Ascensión Gallardo Antolín, Juan Pascual García, José Luis Sancho Gómez

109

6.

Biological Clues for Up-to-Date Artificial Neurons Javier Ropero Peláez, Jose Roberto Castillo Piqueira

131

7.

Support Vector Machines Jaime Gómez Sáenz de Tejada, Juan Seijas Martínez-Echevarría

147

8.

Fractals as Pre-Processing Tool for Computational Intelligence Application Ana M. Tarquis, Valeriano Méndez, Juan B. Grau, José M. Antón, Diego Andina

vii

193

CONTRIBUTING AUTHORS

D. Andina, J. I. Seijas, J. Torres-García, M. J. Alarcón, A. Tarquis, J. B. Grau and J. M. Antón work for Technical University of Madrid (UPM), Spain, where they form the Group for Automation and Soft Computing (GASC). D. T. Pham, P. T. N. Pham, M. S. Packianather and A. A. Afify work for Cardiff University . Javier Ropero Peláez, José Roberto Castillo Piqueira work for Escola Politecnica da Universidade de Sao Paulo Departamento de Engenharia de Telecomunicaçoes e Controle, Brazil. A. Gallardo Antolín, J. Pascual García and J. L. Sancho Gómez work for University Carlos III of Madrid, Spain, A. Vega-Corona, V. Méndez and J. Gómez Sáenz de Tejada work for University of Guanajuato, Mexico, Technical University of Madrid and Universidad Autónoma of Madrid, Spain, respectively.

ix

PREFACE

This book presents a selected collection of contributions on a focused treatment of important elements of Computational Intelligence. Unlike traditional computing, Computational Intelligence (CI) is tolerant of imprecise information, partial truth and uncertainty. The principle components of CI that currently have frequent application in Engineering and Manufacturing are: Neural Networks (NN), fuzzy logic (FL) and Support Vector Machines (SVM). In CI, NN and SVM are concerned with learning, while FL with imprecision and reasoning. This volume mainly covers a key element of Computational Intelligence∗ learning. All the contributions in this volume have a direct relevance to neural network learning∗ from neural computing fundamentals to advanced networks such as Multilayer Perceptrons (MLP), Radial Basis Function Networks (RBF), and their relations with fuzzy set and support vector machines theory. The book also discusses different applications in Engineering and Manufacturing. These are among applications where CI have excellent potentials for use. Both novice and expert readers should find this book a useful reference in the field of Computational Intelligence. The editors and the authors hope to have contributed to the field by paving the way for learning paradigms to solve real-world problems D. Andina

xi

ACKNOWLEDGEMENTS

This document has been produced with the financial assistance of the European Community, ALFA project II-0026-FA. The views expressed herein are those of the Authors and can therefore in no way be taken to reflect the official opinion of the European Community. The editors wish to thank Dr A. Afify of Cardiff University and Mr A. Jevtic of the Technical University of Madrid for their support and helpful comments during the revision of this text. The editors also wish to thank Nagib Callaos, President of the International Institute of Informatics and Systemics, IIIS, for his permission and freedom to reproduce in Chapters 2 and 4 of this book contents from the book by D.Andina and F.Ballesteros (Eds), “Recent Advances in Neural Networks” Ed. IIIS press, ILL, USA (2000).

xiii

CHAPTER 1 SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

D. T. PHAM, P. T. N. PHAM, M. S. PACKIANATHER, A. A. AFIFY Manufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom

INTRODUCTION Soft computing is a recent term for a computing paradigm that has been in existence for almost fifty years. This chapter reviews five soft computing tools. They are: knowledge-based systems, fuzzy logic, inductive learning, neural networks and genetic algorithms. All of these tools have found many practical applications. Examples of applications in engineering and manufacture will be given in the chapter. 1.

KNOWLEDGE-BASED SYSTEMS

Knowledge-based systems, or expert systems, are computer programs embodying knowledge about a narrow domain for solving problems related to that domain. An expert system usually comprises two main elements, a knowledge base and an inference mechanism. The knowledge base contains domain knowledge which may be expressed as any combination of “IF-THEN” rules, factual statements (or assertions), frames, objects, procedures and cases. The inference mechanism is that part of an expert system which manipulates the stored knowledge to produce solutions to problems. Knowledge manipulation methods include the use of inheritance and constraints (in a frame-based or object-oriented expert system), the retrieval and adaptation of case examples (in a case-based expert system) and the application of inference rules such as modus ponens (If A Then B; A Therefore B) and modus tollens (If A Then B; NOT B Therefore NOT A) according to “forward chaining” or “backward chaining” control procedures and “depth-first” or “breadth-first” search strategies (in a rule-based expert system). With forward chaining or data-driven inferencing, the system tries to match available facts with the IF portion of the 1 D. Andina and D.T. Pham (eds.), Computational Intelligence, 1–38. © 2007 Springer.

2

CHAPTER 1

IF-THEN rules in the knowledge base. When matching rules are found, one of them is “fired”, i.e. its THEN part is made true, generating new facts and data which in turn causes other rules to “fire”. Reasoning stops when no more new rules can fire. In backward chaining or goal-driven inferencing, a goal to be proved is specified. If the goal cannot be immediately satisfied by existing facts in the knowledge base, the system will examine the IF-THEN rules for rules with the goal in their THEN portion. Next, the system will determine whether there are facts that can cause any of those rules to fire. If such facts are not available they are set up as subgoals. The process continues recursively until either all the required facts are found and the goal is proved or any one of the subgoals cannot be satisfied, in which case the original goal is disproved. Both control procedures are illustrated in Figure 1. Figure 1a shows how, given the assertion that a lathe is a machine tool and a set of rules concerning machine tools, a forward-chaining system will generate additional assertions such as “a lathe is power driven” and “a lathe has a tool holder”. Figure 1b details the backward-chaining sequence producing the answer to the query “does a lathe require a power source?”. In the forward chaining example of Figure 1a, both rules R2 and R3 simultaneously qualify for firing when inferencing starts as both their IF parts match the presented fact F1. Conflict resolution has to be performed by the expert system to decide which rule should fire. The conflict resolution method adopted in this example is “first come, first served”: R2 fires as it is the first qualifying rule encountered. Other conflict resolution methods include “priority”, “specificity” and “recency”. The search strategies can also be illustrated using the forward chaining example of Figure 1a. Suppose that, in addition to F1, the knowledge base also initially contains the assertion “a CNC turning centre is a machine tool”. Depth-first search involves firing rules R2 and R3 with X instantiated to “lathe” (as shown in Figure 1a) before firing them again with X instantiated to “CNC turning centre”. Breadth-first search will activate rule R2 with X instantiated to “lathe” and again with X instantiated to “CNC turning centre”, followed by rule R3 and the same sequence of instantiations. Breadth-first search finds the shortest line of inferencing between a start position and a solution if it exists. When guided by heuristics to select the correct search path, depth-first search might produce a solution more quickly, although the search might not terminate if the search space is infinite [Jackson, 1999]. For more information on the technology of expert systems, see [Pham and Pham, 1988; Durkin, 1994; Giarratano and Riley, 1998; Darlington, 1999; Jackson, 1999; Badiru and Cheung, 2002; Nurminen et al., 2003]. Most expert systems are nowadays developed using programs known as “shells”. These are essentially ready-made expert systems complete with inferencing and knowledge storage facilities but without the domain knowledge. Some sophisticated expert systems are constructed with the help of “development environments”. The latter are more flexible than shells in that they also provide means for users to implement their own inferencing and knowledge representation methods. More details on expert systems shells and development environments can be found in [Price, 1990].

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

KNOWLEDGE BASE (Initial State) Fact : F1 - A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven

F1 & R2 match KNOWLEDGE BASE (Intermediate State) Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven

F1 & R3 match KNOWLEDGE BASE (Intermediate State) Fact : F1 F2 F3 Rules : R1 R2 R3 -

A lathe is a machine tool A lathe has a tool holder A lathe is power driven If X is power driven Then X requires a power source If X is a machine tool Then X has a tool holder If X is a machine tool Then X is power driven

F3 & R1 match KNOWLEDGE BASE (Final State) Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder F3 - A lathe is power driven F4 - A lathe requires a power source Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven

Figure 1a. An example of forward chaining

3

4

CHAPTER 1

KNOWLEDGE BASE (Initial State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ?

G1 & R1

KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Goal : Satisfied ? G1 - A lathe requires a power source G2 - A lathe is a power driven ?

KNOWLEDGE BASE (Final State) Fact : F1 -A lathe is a machine tool F2 -A lathe is power driven F3 -A lathe requires a power source Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Goal : Satisfied G1 - A lathe requires a power source Yes

F2 & R1

KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool F2 -A lathe is power driven Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ? G2 - A lathe is a power driven Yes

F1 & R3

G2 & R3

KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ? G2 - A lathe is a power driven ? ? G3 - A lathe is a machine tool

KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : ? G1 - A lathe requires a power source ? G2 - A lathe is a power driven G3 - A lathe is a machine tool Yes

Figure 1b. An example of backward chaining

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

5

Among the five tools considered in this chapter, expert systems are probably the most mature, with many commercial shells and development tools available to facilitate their construction. Consequently, once the domain knowledge to be incorporated in an expert system has been extracted, the process of building the system is relatively simple. The ease with which expert systems can be developed has led to a large number of applications of the tool. In engineering, applications can be found for a variety of tasks including selection of materials, machine elements, tools, equipment and processes, signal interpreting, condition monitoring, fault diagnosis, machine and process control, machine design, process planning, production scheduling and system configuring. Some recent examples of specific tasks undertaken by expert systems are: • identifying and planning inspection schedules for critical components of an offshore structure [Peers et al., 1994]; • automating the evaluation of manufacturability in CAD systems [Venkatachalam, 1994]; • choosing an optimal robot for a particular task [Kamrani et al., 1995]; • monitoring the technical and organisational problems of vehicle maintenance in coal mining [Streichfuss and Burgwinkel, 1995]; • configuring paper feeding mechanisms [Koo and Han, 1996]; • training technical personnel in the design and evaluation of energy cogeneration plants [Lara Rosano et al., 1996]; • storing, retrieving and adapting planar linkage designs [Bose et al., 1997]; • designing additive formulae for engine oil products [Shi et al., 1997]; • carrying out automatic remeshing during a finite-elements analysis of forging deformation [Yano et al., 1997]; • designing of products and their assembly processes [Zha et al., 1998]; • modelling and control of combustion processes [Kalogirou, 2003]; • optimising the transient performances in the adaptive control of a planar robot [De La Sen et al., 2004].

2.

FUZZY LOGIC

A disadvantage of ordinary rule-based expert systems is that they cannot handle new situations not covered explicitly in their knowledge bases (that is, situations not fitting exactly those described in the “IF” parts of the rules). These rule-based systems are completely unable to produce conclusions when such situations are encountered. They are therefore regarded as shallow systems which fail in a “brittle” manner, rather than exhibit a gradual reduction in performance when faced with increasingly unfamiliar problems, as human experts would. The use of fuzzy logic [Zadeh, 1965] which reflects the qualitative and inexact nature of human reasoning can enable expert systems to be more resilient. With fuzzy logic, the precise value of a variable is replaced by a linguistic description, the meaning of which is represented by a fuzzy set, and inferencing is carried

6

CHAPTER 1

out based on this representation. Fuzzy set theory may be considered an extension of classical set theory. While classical set theory is about “crisp” sets with sharp boundaries, fuzzy set theory is concerned with “fuzzy” sets whose boundaries are “grey”. In classical set theory, an element ui can either belong or not belong to a set A, i.e. ∼ the degree to which element u belongs to set A is either 1 or 0. However, in fuzzy ∼

set theory, the degree of belonging of an element u to a fuzzy set A is a real number ∼

between 0 and 1. This is denoted by A ui , the grade of membership of ui in A. Fuzzy ∼

∼

set A is a fuzzy set in U, the “universe of discourse” or “universe” which includes all ∼

objects to be discussed. A ui is 1 when ui is definitely a member of A and A ui is ∼

∼

∼

0 when ui is definitely not a member of A. For instance, a fuzzy set defining the term “normal room temperature” might be:-

∼

normal room temperature ≡ 00/below10 C + 03/10 C–16 C (1)

+ 08/16 C–18 C + 10/18 C–22 C + 08/22 C–24 C + 03/24 C–30 C + 00/above 30 C

The values 0.0, 0.3, 0.8 and 1.0 are the grades of membership to the given fuzzy set of temperature ranges below 10 C (above 30 C), between 10 C and 16 C24 C–30 C, between 16 C and 18 C22 C–24 C and between 18 C and 22 C. Figure 2(a) shows a plot of the grades of membership for “normal room temperature”. For comparison, Figure 2(b) depicts the grades of membership for a crisp set defining room temperatures in the normal range. Knowledge in an expert system employing fuzzy logic can be expressed as qualitative statements (or fuzzy rules) such as “If the room temperature is normal, then set the heat input to normal”, where “normal room temperature” and “normal heat input” are both fuzzy sets. A fuzzy rule relating two fuzzy sets A and B is effectively the Cartesian product ∼

∼

A × B which can be represented by a relation matrix R. Element Rij of R is the ∼ ∼ ∼ ∼ membership to A × B of pair ui vj ui ∈ A and vj ∈ B. Rij is given by: ∼

(2)

∼

∼

∼

Rij = minA ui B vj ∼

∼

For example, with “normal room temperature” defined as before and “normal heat input” described by: (3)

normal heat input ≡ 02/1 kW + 09/2 kW + 02/3 kW

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

7

µ

1

0.5

10

20

30

40 Temperature ( ˚C )

(a)

µ 1

10

20

30

40 Temperature ( ˚C )

(b) Figure 2. (a) Fuzzy set of “normal temperature” (b) Crisp set of “normal temperature”

R can be computed as: ∼

⎡

(4)

00 ⎢02 ⎢ ⎢02 ⎢ R = ⎢ ⎢02 ∼ ⎢02 ⎢ ⎣02 00

00 03 08 09 08 03 00

⎤ 00 02⎥ ⎥ 02⎥ ⎥ 02⎥ ⎥ 02⎥ ⎥ 02⎦ 00

A reasoning procedure known as the compositional rule of inference, which is the equivalent of the modus-ponens rule in rule-based expert systems, enables conclusions to be drawn by generalisation (extrapolation or interpolation) from the qualitative information stored in the knowledge base. For instance, when the room

8

CHAPTER 1

temperature is detected to be “slightly below normal”, a temperature-controlling fuzzy expert system might deduce that the heat input should be set to “slightly above normal”. Note that this conclusion might not be contained in any of the fuzzy rules stored in the system. A well-known compositional rule of inference is the max-min rule. Let R represent the fuzzy rule “If A Then B” and a ≡ i /ui ∼

∼

∼

∼

i

a fuzzy assertion. A and a are fuzzy sets in the same universe of discourse. The ∼ ∼ max-min rule enables a fuzzy conclusion b ≡ j /vj to be inferred from a and R ∼

j

∼

∼

as follows: (5) (6)

b = a oR

∼

∼

∼

j = maxmin i Rij i

For example, given the fuzzy rule “If the room temperature is normal, then set the heat input to normal” where “normal room temperature” and “normal heat input” are as defined previously, and a fuzzy temperature measurement of temperature ≡ 00/below10 C + 04/10 C–16 C + 08/16 C–18 C (7)

+ 08/18 C–22 C + 02/22 C–24 C + 00/24 C–30 C + 00/above30 C

the heat input will be deduced as: heat input = temperature oR ∼

(8)

= 02/1 kW + 08/2 kW + 02/3 kW

For further information on fuzzy logic, see [Kaufmann, 1975; Klir and Yuan, 1995; 1996; Ross, 1995; Zimmermann, 1996; Dubois and Prade, 1998]. Fuzzy logic potentially has many applications in engineering where the domain knowledge is usually imprecise. Notable successes have been achieved in the area of process and machine control although other sectors have also benefited from this tool. Recent examples of engineering applications include: • controlling the height of the arc in a welding process [Bigand et al., 1994]; • controlling the rolling motion of an aircraft [Ferreiro Garcia, 1994]; • controlling a multi-fingered robot hand [Bas and Erkmen, 1995]; • analysing the chemical composition of minerals [Da Rocha Fernandes and Cid Bastos, 1996]; • monitoring of tool-breakage in end-milling operations [Chen and Black, 1997]; • modelling of the set-up and bend sequencing process for sheet metal bending [Ong et al., 1997]; • determining the optimal formation of manufacturing cells [Szwarc et al., 1997; Zülal and Arikan, 2000];

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

9

• classifying discharge pulses in electrical discharge machining [Tarng et al., 1997]; • modelling an electrical drive system [Costa Branco and Dente, 1998]; • improving the performance of hard disk drive final assembly [Zhao and De Souza, 1998; 2001]; • analysing chatter occurring during a machine tool cutting process [Kong et al., 1999]; • addressing the relationships between customer needs and design requirements [Sohen and Choi, 2001; Vanegas and Labib, 2001; Karsak, 2004]; • assessing and selecting advanced manufacturing systems [Karsak and Kuzgunkaya, 2002; Bozda˘g et al., 2003; Beskese et al., 2004; Kulak and Kahraman, 2004]; • evaluating cutting force uncertainty in turning [Wang et al., 2002]; • reducing defects in automotive coating operations [Lou and Huang, 2003]. 3.

INDUCTIVE LEARNING

The acquisition of domain knowledge to build into the knowledge base of an expert system is generally a major task. In some cases, it has proved a bottleneck in the construction of an expert system. Automatic knowledge acquisition techniques have been developed to address this problem. Inductive learning is an automatic technique for knowledge acquisition. The inductive approach produces a structured representation of knowledge as the outcome of learning. Induction involves generalising a set of examples to yield a selected representation which can be in terms of a set of rules, concepts or logical inferences or a decision tree. An inductive learning program usually requires as input a set of examples. Each example is characterised by the values of a number of attributes and the class to which it belongs. In one approach to inductive learning, through a process of “dividing-and-conquering” where attributes are chosen according to some strategy (for example, to maximise the information gain) to divide the original example set into subsets, the inductive learning program builds a decision tree that correctly classifies the given example set. The tree represents the knowledge generalised from the specific examples in the set. This can subsequently be used to handle situations not explicitly covered by the example set. In another approach known as the “covering approach”, the inductive learning program attempts to find groups of attributes uniquely shared by examples in given classes and forms rules with the IF part as conjunctions of those attributes and the THEN part as the classes. The program removes correctly classified examples from consideration and stops when rules have been formed to classify all examples in the given set. A new approach to inductive learning, “inductive logic programming”, is a combination of induction and logic programming. Unlike conventional inductive learning which uses propositional logic to describe examples and represent new concepts, inductive logic programming (ILP) employs the more powerful predicate

10

CHAPTER 1

logic to represent training examples and background knowledge and to express new concepts. Predicate logic permits the use of different forms of training examples and background knowledge. It enables the results of the induction process, that is the induced concepts, to be described as general first-order clauses with variables and not just as zero-order propositional clauses made up of attribute-value pairs. There are two main types of ILP systems, the first, based on the top-down generalisation/specialisation method, and the second, on the principle of inverse resolution [Muggleton, 1992; Lavrac, 1994]. A number of inductive learning programs have been developed. Some of the well known programs are CART [Breiman et al., 1998], ID3 and its descendants C4.5 and C5.0 [Quinlan, 1983; 1986; 1993; ISL, 1998; RuleQuest, 2000] which are divide-and-conquer programs, the AQ family of programs [Michalski, 1969; 1990; Michalski et al., 1986; Cervone et al., 2001; Michalski and Kaufman, 2001] which follow the covering approach, the FOIL program [Quinlan, 1990; Quinlan and Cameron-Jones, 1995] which is an ILP system adopting the generalisation/specialisation method and the GOLEM program [Muggleton and Feng, 1990] which is an ILP system based on inverse resolution. Although most programs only generate crisp decision rules, algorithms have also been developed to produce fuzzy rules [Wang and Mendel, 1992; Janikow, 1998; Hang and Chen, 2000; Baldwin and Martin, 2001; Wang et al., 2001; Baldwin and Karale, 2003; Wang et al., 2003]. Figure 3 shows the main steps in RULES–3 Plus, an induction algorithm in the covering category [Pham and Dimov, 1997] and belonging to the RULES family of rule extraction systems [Pham and Aksoy, 1994; 1995a; 1995b; Pham et al., 2000; Pham et al., 2003; Pham and Afify; 2005a]. The simple problem of detecting the state of a metal cutting tool is used to explain the operation of RULES-3 Plus. Three sensors are employed to monitor the cutting process and, according to the signals obtained from them (1 or 0 for sensors 1 and 3; −1, 0, or 1 for sensor 2), the tool is inferred as being “normal” or “worn”. Thus, this problem involves three attributes which are the states of sensors 1, 2 and 3 and the signals that they emit constitute the values of those attributes. The example set for the problem is given in Table 1.

Table 1. Training set for the Cutting Tool problem Example

Sensor_1

Sensor_2

Sensor_3

Tool State

1 2 3 4 5 6 7 8

0 1 1 1 0 1 1 0

−1 0 −1 0 0 1 −1 −1

0 0 1 1 1 1 0 1

Normal Normal Worn Normal Normal Worn Normal Worn

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

11

Step 1. Take an unclassified example and form array SETAV. Step 2. Initialise arrays PRSET and T_PRSET (PRSET and T_PRSET will consist of mPRSET expressions with null conditions and zero H measures) and set nco = 0. Step 3. IF nco < na THEN nco = nco + 1 and set m = 0; ELSE the example itself is taken as a rule and STOP. Step 4. DO m = m + 1; Specialise expression m in PRSET by appending to it a condition from SETAV that differs from the conditions already included in the expression; Compute the H measure for the expression; IF its H measure is higher than the H measure of any expression in T_PRSET THEN replace the expression having the lowest H measure with the newly formed expression; ELSE discard the new expression; WHILE m < mPRSET . Step 5. IF there are consistent expressions in T_PRSET THEN choose as a rule the expression that has the highest H measure and discard the others; ELSE copy T_PRSET into PRSET; initialise T_PRSET and go to step 3.

Figure 3. Rule forming procedure of RULES-3 Plus Notes: nco – number of conditions; na -number of attributes; mPRSET – number of expressions stored in PRSET (mPRSET is user-provided); T_PRSET - a temporary array of partial rules of the same dimension as PRSET

In step 1, example 1 is used to form the attribute-value array SETAV which will contain the following attribute-value pairs: [Sensor_1 = 0 Sensor_2 = −1 and Sensor_3 = 0. In step 2, the partial rule set PRSET and T_PRSET, the temporary version of PRSET used for storing partial rules in the process of rule construction, are initialised. This creates for each of these sets three expressions having null conditions and zero H measures. The H measure for an expression is defined as: (9)

H=

Eic Ei Ec Eic Ei 1− c 1− 2−2 −2 E Ec E E E

where E c is the number of examples covered by the expression (the total number of examples correctly classified and misclassified by a given rule), E is the total number of examples, Eic is the number of examples covered by the expression and belonging to the target class i (the number of examples correctly classified by a given rule), and Ei is the number of examples in the training set belonging to the

12

CHAPTER 1

target class i. In Equation (9), the first term (10)

G=

Ec E

relates to the generality of the rule and the second term

Eic Ei Eic Ei (11) A = 2−2 1 − 1 − − 2 Ec E Ec E indicates its accuracy. In steps 3 and 4, by specialising PRSET using the conditions stored in SETAV, the following expressions are formed and stored in T_PRSET: 1 Sensor_3 = 0 ⇒ Alarm = OFF

H = 02565

2 Sensor_2 = −1 ⇒ Alarm = OFF

H = 00113

3 Sensor_1 = 0 ⇒ Alarm = OFF

H = 00012

In step 5, a rule is produced as the first expression in T_PRSET applies to only one class: Rule1 IF Sensor_3 = 0 THEN Alarm = OFF H = 02565 Rule 1 can classify examples 2 and 7 in addition to example 1. Therefore, these examples are marked as classified and the induction proceeds. In the second iteration, example 3 is considered. T_PRSET, formed in step 4 after specialising the initial PRSET, now consists of the following expressions: 1 Sensor_3 = 1 ⇒ Alarm = ON

H = 00406

2 Sensor_2 = −1 ⇒ Alarm = ON

H = 00079

3 Sensor_1 = 1 ⇒ Alarm = ON

H = 00005

As none of the expressions cover only one class, T_PRSET is copied into PRSET (step 5) and the new PRSET has to be specialised further by appending the existing expressions with conditions from SETAV. Therefore the procedure returns to step 3 for a new pass. The new T_PRSET formed at the end of step 4 contains the following three expressions: 1 Sensor_2 = −1Sensor_3 = 1 ⇒ Alarm = ON

H = 03876

2 Sensor_1 = 1Sensor_3 = 1 ⇒ Alarm = ON

H = 00534

3 Sensor_1 = 1Sensor_2 = −1 ⇒ Alarm = ON

H = 00008

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

13

As the first expression applies to only one class, the following rule is obtained: Rule 2 IF Sensor_2 = −1 AND Sensor_3 = 1 THEN Alarm = ON H = 03876 Rule 2 can classify examples 3 and 8, which again are marked as classified. In the third iteration, example 4 is used to obtained the next rule: Rule 3 IF Sensor_2 = 0 THEN Alarm = OFF H = 02565 This rule can classify examples 4 and 5 and so they are also marked as classified. In iteration 4, the last unclassified example 6 is employed for rule extraction, yielding: Rule 4 IF Sensor_2 = 1 THEN Alarm = ON H = 02741 There are no remaining unclassified examples in the example set and the procedure terminates at this point. Due to its requirement for a set of examples in a rigid format (with known attributes and of known classes), inductive learning has found rather limited applications in engineering as not many engineering problems can be described in terms of such a set of examples. Another reason for the paucity of applications is that inductive learning is generally more suitable for problems where attributes have discrete or symbolic values than for those with continuous-valued attributes as in many engineering problems. Some recent examples of applications of inductive learning are: • controlling a laser cutting robot [Luzeaux, 1994]; • controlling the functional electrical stimulation of spinally-injured humans [Kostov et al., 1995]; • modelling job complexity in clothing production systems [Hui et al., 1997]; • analysing the constructability of a beam in a reinforced-concrete frame [Skibniewski et al., 1997]; • analysing the results of tests on portable electronic products to discover useful design knowledge [Zhou, 2001]; • accelerating rotogravure printing [Evans and Fisher, 2002]; • predicting JIT factory performance from past data that includes both good and poor factory performance [Mathieu et al., 2002]; • developing an intelligent monitoring system for improving the reliability of a manufacturing process [Peng, 2004]. • analysing data in a steel bar manufacturing company to help intelligent decision making [Pham et al., 2004]; More information on inductive learning techniques and their applications in engineering and manufacture can be found in [Pham et al., 2002; Pham and Afify, 2005b].

14 4.

CHAPTER 1

NEURAL NETWORKS

Like inductive learning programs, neural networks can capture domain knowledge from examples. However, they do not archive the acquired knowledge in an explicit form such as rules or decision trees and they can readily handle both continuous and discrete data. They also have a good generalisation capability as with fuzzy expert systems. A neural network is a computational model of the brain. Neural network models usually assume that computation is distributed over several simple units called neurons which are interconnected and which operate in parallel (hence, neural networks are also called parallel-distributed-processing systems or connectionist systems). Figure 4 illustrates a typical model of a neuron. Output signal yj is a function f of the sum of weighted input signals xi . The activation function f can be a linear, simple threshold, sigmoidal, hyberbolic tangent or radial basis function. Instead of being deterministic, f can be a probabilistic function, in which case yj will be a binary quantity, for example, +1 or −1. The net input to such a stochastic neuron – that is, the sum of weighted input signals xi – will then give the probability of yj being +1 or −1. How the inter-neuron connections are arranged and the nature of the connections determine the structure of a network. How the strengths of the connections are adjusted or trained to achieve a desired overall behaviour of the network is governed by its learning algorithm. Neural networks can be classified according to their structures and learning algorithms. In terms of their structures, neural networks can be divided into two types: feedforward network and recurrent networks. Feedforward networks can perform a static mapping between an input space and an output space: the output at a given instant is a function only of the input at that instant. The most popular feedforward neural network is the multi-layer perceptron (MLP): all signals flow in a single direction from the input to the output of the network. Figure 5 shows an MLP with three layers: an input layer, an output layer and an intermediate or hidden layer. Neurons in the input layer only act as buffers for distributing the input signals xi to neurons in the hidden layer. Each neuron j in the hidden layer operates according to the model of Figure 4. That is, its output yj is given by: (12)

yj = f wji xi x1

xi

wj1 wji

∑

yj f(.)

wjn xn Figure 4. Model of a neuron

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

15

Output Layer

y1

yn

Hidden Layer w1m w12 w11 Input Layer x1

x2

xm

Figure 5. A multi-layer perceptron

The outputs of neurons in the output layer are computed similarly. Other feedforward networks [Pham and Liu, 1999] include the learning vector quantisation (LVQ) network, the cerebellar model articulation control (CMAC) network and the group-method of data handling (GMDH) network. Recurrent networks are networks where the outputs of some neurons are fedback to the same neurons or to neurons in layers before them. Thus signals can flow in both forward and backward directions. Recurrent networks are said to have a dynamic memory: the output of such networks at a given instant reflects the current input as well as previous inputs and outputs. Examples of recurrent networks [Pham and Liu, 1999] include the Hopfield network, the Elman network and the Jordan network. Figure 6 shows a well-known, simple recurrent neural network, the Grossberg and Carpenter ART-1 network. The network has two layers, an input layer and an output layer. The two layers are fully interconnected, the connections are in both the forward (or bottom-up) direction and the feedback (or top-down) direction. The vector Wi of weights of the bottom-up connections to an output neuron i forms an exemplar of the class it represents. All the Wi vectors constitute the long-term memory of the network. They are employed to select the winning neuron, the latter again being the neuron whose Wi vector is most similar to the current input pattern. The vector Vi of the weights of the top-down connections from an output neuron i is used for vigilance testing, that is, determining whether an input pattern is sufficiently close to a stored exemplar. The vigilance vectors Vi form the short-term memory of the network. Vi and Wi are related in that Wi is a normalised copy of Vi , viz. (13)

Wi =

+

Vi

Vji

16

CHAPTER 1

output layer

bottom up weights W

top down weights V

input layer Figure 6. An ART-1 network

where is a small constant and Vji , the jth component of Vi (i.e. the weight of the connection from output neuron i to input neuron j). Implicit “knowledge” is built into a neural network by training it. Neural networks are trained and categorised according to two main types of learning algorithms: supervised and unsupervised. In addition, there is a third type, reinforcement learning, which is a special case of supervised learning. In supervised training, the neural network can be trained by being presented with typical input patterns and the corresponding expected output patterns. The error between the actual and expected outputs is used to modify the strengths, or weights, of the connections between the neurons. The backpropagation (BP) algorithm, a gradient descent algorithm, is the most commonly adopted MLP training algorithm. It gives the change wji in the weight of a connection between neurons i and j as follows:(14)

wji = j xi

where is a parameter called the learning rate and j is a factor depending on whether neuron j is an output neuron or a hidden neuron. For output neurons,

f t yj − yj (15) j = net j and for hidden neurons,

f w (16) j = net j q qj q

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

17

In Equation (15), netj is the total weighted sum of input signals to neuron j and t yj is the target output for neuron j. As there are no target outputs for hidden neurons, in Equation (16), the difference between the target and actual output of a hidden neuron j is replaced by the weighted sum of the q terms already obtained for neurons q connected to the output of j. Thus, iteratively, beginning with the output layer, the term is computed for neurons in all layers and weight updates determined for all connections. The weight updating process can take place after the presentation of each training pattern (pattern-based training) or after the presentation of the whole set of training patterns (batch training). In either case, a training epoch is said to have been completed when all training patterns have been presented once to the MLP. For all but the most trivial problems, several epochs are required for the MLP to be properly trained. A commonly adopted method to speed up the training is to add a “momentum” term to Equation (14) which effectively lets the previous weight change influence the new weight change, viz: (17)

wji k + 1 = j xi + wji k

where wji k + 1 and wji k are weight changes in epochs k + 1 and k respectively and is the “momentum” coefficient. Some neural networks are trained in an unsupervised mode where only the input patterns are provided during training and the networks learn automatically to cluster them in groups with similar features. For example, training an ART-1 network involves the following steps: (i) initialising the exemplar and vigilance vectors Wi and Vi for all output neurons by setting all the components of each Vi to 1 and computing Wi according to Equation (13). An output neuron with all its vigilance weights set to 1 is known as an uncommitted neuron in the sense that it is not assigned to represent any pattern classes; (ii) presenting a new input pattern x; (iii) enabling all output neurons so that they can participate in the competition for activation; (iv) finding the winning output neuron among the competing neurons, i.e. the neuron for which x. Wi is largest; a winning neuron can be an uncommitted neuron as is the case at the beginning of training or if there are no better output neurons; (v) testing whether the input pattern x is sufficiently similar to the vigilance vector Vi of the winning neuron. Similarity is measured by the fraction r of bits in x that are also in Vi , viz. (18)

xV r= i xi

x is deemed to be sufficiently similar to Vi if r is at least equal to vigilance threshold 0 < ≤ 1

18

CHAPTER 1

(vi) going to step (vii) if r ≥ (i.e. there is resonance); else disabling the winning neuron temporarily from further competition and going to step (iv) repeating this procedure until there are no further enabled neurons; (vii) adjusting the vigilance vector Vi of the most recent winning neuron by logically ANDing it with x, thus deleting bits in Vi that are not also in x; computing the bottom-up exemplar vector Wi using the new Vi according to Equation (13); activating the winning output neuron; (viii) going to step (ii). The above training procedure ensures that if the same sequence of training patterns is repeatedly presented to the network, its long-term and short-term memories are unchanged (i.e. the network is stable). Also, provided there are sufficient output neurons to represent all the different classes, new patterns can always be learnt, as a new pattern can be assigned to an uncommitted output neuron if it does not match previously stored exemplars well (i.e. the network is plastic). In reinforcement learning, instead of requiring a teacher to give target outputs and using the differences between the target and actual outputs directly to modify the weights of a neural network, the learning algorithm employs a critic only to evaluate the appropriateness of the neural network output corresponding to a given input. According to the performance of the network on a given input vector, the critic will issue a positive or negative reinforcement signal. If the network has produced an appropriate output, the reinforcement signal will be positive (a reward). Otherwise, it will be negative (a penalty). The intention of this is to strengthen the tendency to produce appropriate outputs and to weaken the propensity for generating inappropriate outputs. Reinforcement learning is a trial-and-error operation designed to maximise the average value of the reinforcement signal for a set of training input vectors. An example of a simple reinforcement learning algorithm is a variation of the associative reward-penalty algorithm [Hassoun, 1995]. Consider a single stochastic neuron j with inputs x1 x2 x3 xn . The reinforcement rule may be written as [Hassoun, 1995] (19)

wji k + 1 = wji k + lrkyj k − Eyj kxi k

wji is the weight of the connection between input i and neuron j, l is the learning coefficient, r (which is +1 or −1) is the reinforcement signal, yj is the output of neuron j, Eyj is the expected value of the output, and xi k is the ith component of the kth input vector in the training set. When learning converges, wji k + 1 = wji k and so Eyj k = yj k = +1 or −1. Thus, the neuron effectively becomes deterministic. Reinforcement learning is typically slower than supervised learning. It is more applicable to small neural networks used as controllers where it is difficult to determine the target network output. For more information on neural networks, see [Michie et al., 1994; Hassoun, 1995; Pham and Liu, 1999; Yao, 1999; Jiang et al., 2002; Duch et al., 2004]. Neural networks can be employed as mapping devices, pattern classifiers or pattern completers (auto-associative content addressable memories and pattern associators). Like expert systems, they have found a wide spectrum of applications in

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

19

almost all areas of engineering, addressing problems ranging from modelling, prediction, control, classification and pattern recognition, to data association, clustering, signal processing and optimisation. Some recent examples of such applications are: • predicting the tensile strength of composite laminates [Teti and Caprino, 1994]; • controlling a flexible assembly operation [Majors and Richards, 1995]; • choosing sheet metal working conditions [Lin and Chang, 1996]; • determining suitable cutting conditions in operation planning [Park et al., 1996; Schultz et al., 1997]; • recognising control chart patterns [Pham and Oztemel, 1996]; • analysing vibration spectra [Smith et al., 1996]; • deducing velocity vectors in uniform and rotating flows by tracking the movement of groups of particles [Jambunathan et al., 1997]; • setting the number of kanbans in a dynamic JIT factory [Wray et al., 1997; Markham et al., 2000]; • generating knowledge for scheduling a flexible manufacturing system [Kim et al., 1998; Priore et al., 2003]; • modelling and controlling dynamic systems including robot arms [Pham and Liu, 1999]; • acquiring and refining operational knowledge in industrial processes [Shigaki and Narazaki, 1999]; • improving yield in a semiconductor manufacturing company [Shin and Park, 2000]; • identifying arbitrary geometric and manufacturing categories in CAD databases [Ip et al., 2003]; • minimising the makespan in a flowshop scheduling problem [Akyol, 2004]. 5.

GENETIC ALGORITHMS

Conventional search techniques, such as hill-climbing, are often incapable of optimising non-linear or multi modal functions. In such cases, a random search method is generally required. However, undirected search techniques are extremely inefficient for large domains. A genetic algorithm (GA) is a directed random search technique, invented by Holland [Holland, 1975], which can find the global optimal solution in complex multi-dimensional search spaces. A GA is modelled on natural evolution in that the operators it employs are inspired by the natural evolution process. These operators, known as genetic operators, manipulate individuals in a population over several generations to improve their fitness gradually. Individuals in a population are likened to chromosomes and usually represented as strings of binary numbers. The evolution of a population is described by the “schema theorem” [Holland, 1975; Goldberg, 1989]. A schema represents a set of individuals, i.e. a subset of the population, in terms of the similarity of bits at certain positions of those individuals. For example, the schema 1∗ 0∗ describes the set of individuals whose first and third bits are 1 and 0, respectively. Here, the symbol ∗ means any value would be

20

CHAPTER 1

acceptable. In other words, the values of bits at positions marked ∗ could be either 0 or 1. A schema is characterised by two parameters: defining length and order. The defining length is the length between the first and last bits with fixed values. The order of a schema is the number of bits with specified values. According to the schema theorem, the distribution of a schema through the population from one generation to the next depends on its order, defining length and fitness. GAs do not use much knowledge about the optimisation problem under study and do not deal directly with the parameters of the problem. They work with codes which represent the parameters. Thus, the first issue in a GA application is how to code the problem, i.e. how to represent its parameters. As already mentioned, GAs operate with a population of possible solutions. The second issue is the creation of a set of possible solutions at the start of the optimisation process as the initial population. The third issue in a GA application is how to select or devise a suitable set of genetic operators. Finally, as with other search algorithms, GAs have to know the quality of the solutions already found to improve them further. An interface between the problem environment and the GA is needed to provide this information. The design of this interface is the fourth issue.

5.1

Representation

The parameters to be optimised are usually represented in a string form since this type of representation is suitable for genetic operators. The method of representation has a major impact on the performance of the GA. Different representation schemes might cause different performances in terms of accuracy and computation time. There are two common representation methods for numerical optimisation problems [Blickle and Thiele, 1995, Michalewicz, 1996]. The preferred method is the binary string representation method. The reason for this method being popular is that the binary alphabet offers the maximum number of schemata per bit compared to other coding techniques. Various binary coding schemes can be found in the literature, for example, Uniform coding, Gray scale coding, etc. The second representation method is to use a vector of integers or real numbers with each integer or real number representing a single parameter. When a binary representation scheme is employed, an important step is to decide the number of bits to encode the parameters to be optimised. Each parameter should be encoded with the optimal number of bits covering all possible solutions in the solution space. When too few or too many bits are used the performance can be adversely affected.

5.2

Creation of Initial Population

At the start of optimisation, a GA requires a group of initial solutions. There are two ways of forming this initial population. The first consists of using randomly produced solutions created by a random number generator, for example. This method

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

21

is preferred for problems about which no a priori knowledge exists or for assessing the performance of an algorithm. The second method employs a priori knowledge about the given optimisation problem. Using this knowledge, a set of requirements is obtained and solutions which satisfy those requirements are collected to form an initial population. In this case, the GA starts the optimisation with a set of approximately known solutions and therefore convergence to an optimal solution can take less time than with the previous method. 5.3

Genetic Operators

The flowchart of a simple GA is given in Figure 7. There are basically four genetic operators, selection, crossover, mutation and inversion. Some of these operators were inspired by nature. In the literature, many versions of these operators can be found. It is not necessary to employ all of these operators in a GA because each operates independently of the others. The choice or design of operators depends on the problem and the representation scheme employed. For instance, operators designed for binary strings cannot be directly used on strings coded with integers or real numbers. 5.3.1

Selection

The aim of the selection procedure is to reproduce more of individuals whose fitness values are higher than those whose fitness values are low. The selection procedure has a significant influence on driving the search towards a promising area and finding good solutions in a short time. However, the diversity of the population

Initial Population

Evaluation

Selection

Crossover

Mutation

Inversion

Figure 7. Flowchart of a basic genetic algorithm

22

CHAPTER 1

must be maintained to avoid premature convergence and to reach the global optimal solution. In GAs there are mainly two selection procedures: proportional selection, also called stochastic selection, and ranking-based selection [Whitely, 1989]. Proportional selection is usually called “Roulette Wheel” selection, since its mechanism is reminiscent of the operation of a Roulette Wheel. Fitness values of individuals represent the widths of slots on the wheel. After a random spinning of the wheel to select an individual for the next generation, slots with large widths representing individuals with high fitness values will have a higher chance to be selected. One way to prevent premature convergence is to control the range of trials allocated to any single individual, so that no individual produces too many offspring. The ranking system is one such alternative selection algorithm. In this algorithm, each individual generates an expected number of offspring which is based on the rank of its performance and not on the magnitude [Baker, 1985]. 5.3.2

Crossover

This operation is considered the one that makes the GA different from other algorithms, such as dynamic programming. It is used to create two new individuals (children) from two existing individuals (parents) picked from the current population by the selection operation. There are several ways of doing this. Some common crossover operations are one-point crossover, two-point crossover, cycle crossover and uniform crossover. One-point crossover is the simplest crossover operation. Two individuals are randomly selected as parents from the pool of individuals formed by the selection procedure and cut at a randomly selected point. The tails, which are the parts after the cutting point, are swapped and two new individuals (children) are produced. Note that this operation does not change the values of bits. An example of one-point crossover is shown in Figure 8. 5.3.3

Mutation

In this procedure, all individuals in the population are checked bit by bit and the bit values are randomly reversed according to a specified rate. Unlike crossover, this is Parent 1

100|010011110

Parent 2

001|011000110

New string 1

100|011000110

New string 2

001|010011110 Figure 8. Crossover

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

Old string

1100|0|1011101

New string

1100|1|1011101

23

Figure 9. Mutation

a monadic operation. That is, a child string is produced from a single parent string. The mutation operator forces the algorithm to search new areas. Eventually, it helps the GA to avoid premature convergence and find the global optimal solution. An example is given in Figure 9.

5.3.4

Inversion

This operator is employed for a group of problems, such as the cell placement problem, layout problem and travelling salesman problem. It also operates on one individual at a time. Two points are randomly selected from an individual and the part of the string between those two points is reversed (see Figure 10).

5.4

Control Parameters

Important control parameters of a simple GA include the population size (number of individuals in the population), crossover rate, mutation rate and inversion rate. Several researchers have studied the effect of these parameters on the performance of a GA [Schaffer et al., 1989; Grefenstette, 1986; Fogarty, 1989; Mahfoud, 1995; Smith and Fogarty, 1997]. The main conclusions are as follows. A large population size means the simultaneous handling of many solutions and increases the computation time per iteration; however since many samples from the search space are used, the probability of convergence to a global optimal solution is higher than with a small population size. The crossover rate determines the frequency of the crossover operation. It is useful at the start of optimisation to discover promising regions in the search space. A low crossover frequency decreases the speed of convergence to such areas. If the frequency is too high, it can lead to saturation around one solution. The mutation operation is controlled by the mutation rate. A high mutation rate introduces high diversity in the population and might cause instability. On the other hand, it is usually very difficult for a GA to find a global optimal solution with too low a mutation rate. Old string

10|1100|11101

New string

10|0011|11101

Figure 10. Inversion of a binary string segment

24 5.5

CHAPTER 1

Fitness Evaluation Function

The fitness evaluation unit in a GA acts as an interface between the GA and the optimisation problem. The GA assesses solutions for their quality according to the information produced by this unit and not by directly using information about their structure. In engineering design problems, functional requirements are specified to the designer who has to produce a structure which performs the desired functions within predetermined constraints. The quality of a proposed solution is usually calculated depending on how well the solution performs the desired functions and satisfies the given constraints. In the case of a GA, this calculation must be automatic and the problem is how to devise a procedure which computes the quality of solutions. Fitness evaluation functions might be complex or simple depending on the optimisation problem at hand. Where a mathematical equation cannot be formulated for this task, a rule-based procedure can be constructed for use as a fitness function or in some cases both can be combined. Where some constraints are very important and cannot be violated, the structures or solutions which do so can be eliminated in advance by appropriately designing the representation scheme. Alternatively, they can be given low probabilities by using special penalty functions. For further information on genetic algorithms, see [Holland, 1975; Goldberg, 1989; Davis, 1991; Mitchell, 1996; Pham and Karaboga, 2000; Freitas, 2002]. Genetic algorithms have found applications in engineering problems involving complex combinatorial or multi-parameter optimisation. Some recent examples of those applications are: • configuring transmission systems [Pham and Yang, 1993]; • designing the knowledge base of fuzzy logic controllers [Pham and Karaboga, 1994]; • generating hardware description language programs for high-level specification of the function of programmable logic devices [Seals and Whapshott, 1994]; • planning collision-free paths for mobile and redundant robots [Ashiru et al., 1995; Wilde and Shellwat, 1997; Nearchou and Aspragathos, 1997]; • scheduling the operations of a job shop [Cho et al., 1996; Drake and Choudhry, 1997; Lee et al., 1997; Chryssolouris and Subramaniam, 2001; Pérez et al., 2003]; • generating dynamic schedules for the operation and control of a flexible manufacturing cell [Jawahar et al., 1998]; • optimising the performance of an industrially designed inventory control system [Disney, 2000]; • forming manufacturing cells and determining machine layout information for cellular manufacturing [Wu et al., 2002]; • optimising assembly process plans to improve productivity [Li et al., 2003]; • improving the convergence speed and reducing the computational complexity of neural networks [Öztürk and Öztürk, 2004].

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

6.

25

SOME APPLICATIONS IN ENGINEERING AND MANUFACTURE

This section briefly reviews five engineering applications of the aforementioned soft computing tools. 6.1

Expert Statistical Process Control

Statistical process control (SPC) is a technique for improving the quality of processes and products through closely monitoring data collected from those processes and products and using statistically-based tools such as control charts. XPC is an expert system for facilitating and enhancing the implementation of statistical process control [Pham and Oztemel, 1996]. A commercially available shell was employed to build XPC. The shell allows a hybrid rule-based and pseudo objectoriented method of representing the standard SPC knowledge and process-specific diagnostic knowledge embedded in XPC. The amount of knowledge involved is extensive, which justifies the adoption of a knowledge-based systems approach. XPC comprises four main modules. The construction module is used to set up a control chart. The capability analysis module is for calculating process capability indices. The on-line interpretation and diagnosis module assesses whether the process is in control and determines the causes for possible out-of-control situations. It also provides advice on how to remedy such situations. The modification module updates the parameters of a control chart to maintain true control over a time-varying process. XPC has been applied to the control of temperature in an injection moulding machine producing rubber seals. It has recently been enhanced by integrating a neural network module with the expert system modules to detect abnormal patterns in the control chart (see Figure 11). 6.2

Fuzzy Modelling of a Vibratory Sensor for Part Location

Figure 12 shows a six-degree-of-freedom vibratory sensor for determining the coordinates of the centre of mass xG yG and orientation of bulky rigid parts. The sensor is designed to enable a robot to pick up parts accurately for machine feeding or assembly tasks. The sensor consists of a rigid platform (P) mounted on a flexible column (C). The platform supports one object (O) to be located at a time. O is held firmly with respect to P. The static deflections of C under the weight of O and the natural frequencies of vibration of the dynamic system comprising O, P and C are measured and processed using a mathematical model of the system to determine xG , yG and for O. In practice, the frequency measurements have low repeatability, which leads to inconsistent location information. The problem worsens when is in the region 80 -90 relative to a reference axis of the sensor because the mathematical model becomes ill-conditioned. In this “ill-conditioning” region, an alternative to using a mathematical model to compute is to adopt an experimentally derived fuzzy model. Such a fuzzy model has to be obtained for

26

CHAPTER 1

Range Chart UCL : 9

15 Mean : 4.5

CL : 4

30

45

Mean Chart LCL : 0.00

60

75

98 PCI: 1.7

St. Dev : 1.5

State of the process: in-control

UCL : 93

15

CL : 78

30

Mean : 72.5

45

60

St. Dev : 4.4

LCL : 63

75

98 PSD : 4.0

State of the process: in-control

Warning !!!!!! Process going out of control!

press any key to continue

the pattern is normal the pattern is inc. trend the pattern is dec. trend the pattern is up. shift the pattern is down. shift the pattern is cyclic

(%) (%) (%) (%) (%) (%)

: 0.00 : 0.00 : 100.00 : 0.00 : 0.00 : 0.00 press 999 to exit

Figure 11. XPC output screen

each specific object through calibration. A possible calibration procedure involves placing the object at different positions xG yG and orientations and recording the periods of vibration T of the sensor. Following calibration, fuzzy rules relating xG , yG and T to could be constructed to form a fuzzy model of the behaviour of the sensor for the given object. A simpler fuzzy model is achieved by observing that xG

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

Platform P

27

Object O Z

Orientation y

z yG

Y

Column C

End of robot arm Figure 12. Schematic diagram of a vibratory sensor mounted on a robot wrist

and yG only affect the reference level of T and, if xG and yG are employed to define that level, the trend in the relationship between T and is the same regardless of the position of the object. Thus, a simplified fuzzy model of the sensor consists of rules such as “IF T-Tref is small THEN -ref is small” where Tref is the value of T when the object is at position xG yG and orientation ref . ref could be chosen as 80 , the point at which the fuzzy model is to replace the mathematical model. Tref could be either measured experimentally or computed from the mathematical model. To counteract the effects of the poor repeatability of period measurements which are particularly noticeable in the “ill-conditioning” region, the fuzzy rules are modified so that they take into account the variance in T. An example of a modified fuzzy rule is: “IF T-Tref is small and T is small, THEN − ref is small” In the above rule, T denotes the standard deviation in the measurement of T. Fuzzy modelling of the vibratory sensor is detailed in Pham and Hafeez (1992). Using a fuzzy model, the orientation can be determined to ±2 accuracy in the region 80 -90 . The adoption of fuzzy logic in this application has produced a compact and transparent model from a large amount of noisy experimental data. 6.3

Induction of Feature Recognition Rules in a Geometric Reasoning System for Analysing 3D Assembly Models

Pham et al. (1999) have described a concurrent engineering approach involving generating assembly strategies for a product directly from its 3D CAD model.

28

CHAPTER 1

A feature-based CAD system is used to create assembly models of products. A geometric reasoning module extracts assembly-oriented data for a product from the CAD system after creating a virtual assembly tree that identifies the components and sub-assemblies making up the given product (Figure 13a). The assembly information extracted by the module includes: placement constraints and dimensions used to specify the relevant position of a given component or sub-assembly; geometric entities (edges, surfaces, etc) used to constrain the component or subassembly; and the parents and children of each entity employed as a placement constraint. An example of the information extracted is shown in Figure 13b. Feature recognition is applied to the extracted information to identify each feature used to constrain a component or sub-assembly. The rule-based feature recognition process has three possible outcomes: 1. The feature is recognised as belonging to a unique class. 2. The feature shares attributes with more than one class (see Figure 13c). 3. The feature does not belong to any known class. Cases 2 and 3 require the user to decide the correct class of the feature and the rule base to be updated. The updating is implemented via a rule induction program. The program employs RULES-3 Plus which automatically extracts new feature recognition rules from examples provided to it in the form of characteristic vectors representing different features and their respective class labels. Rule induction is very suitable for this application because of the complexity of the characteristic vectors and the difficulty of defining feature classes manually.

Figure 13a. An assembly model

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

29

Bolt: • Child of Block • Placement constraints: 1: alignment of two axes 2: mating ofthe bottom surface of the bolt head and the upper surface ofthe block • No child part in the assembly hierarchy

Block: • No parents • No constraints (root component) • Next part in the assembly: Bolt Figure 13b. An example of assembly information

Partial Round Nonthrough Slot (BSL_2)

New Form Feature

Detected Similar Feature Classes

Rectangular Nonthrough Slot (BSL_1)

Figure 13c. An example of feature recognition

6.4

Neural-network-based Automotive Product Inspection

Figure 14 depicts an intelligent inspection system for engine valve stem seals [Pham and Oztemel, 1996]. The system comprises four CCD cameras connected to a computer that implements neural-network-based algorithms for detecting and classifying defects in the seal lips. Faults on the lip aperture are classified by a multilayer perceptron. The inputs to the network are a 20-component vector, where

30

CHAPTER 1

Ethernet link

Vision system

Host PC 4 CCD cameras 512 x 512 resolution Databus

Lighting ring

Good Chute

Seal

Material handling and lighting controller

Bowl Feeder Reject

Rework Indexing machine

Figure 14. Valve stem seal inspection system

the value of each component is the number of times a particular geometric feature is found on the aperture being inspected. The outputs of the network indicate the type of defect on the seal lip aperture. A similar neural network is used to classify defects on the seal lip surface. The accuracy of defect classification in both perimeter and surface inspection is in excess of 80%. Note that this figure is not the same as that for the accuracy in detecting defective seals, that is differentiating between good and defective seals. The latter task is also implemented using a neural network which achieves an accuracy of almost 100%. Neural networks are necessary for this application because of the difficulty of describing precisely the various types of defects and the differences between good and defective seals. The neural networks are able to learn the classification task automatically from examples. 6.5

GA-based Conceptual Design

TRADES is a system using GA techniques to produce conceptual designs of transmission units [Pham and Yang, 1993]. The system has a set of basic building blocks, such as gear pairs, belt drives and mechanical linkages, and generates conceptual designs to satisfy given specifications by assembling the building blocks into different configurations. The crossover, mutation and inversion operators of the GA are employed to create new configurations from an existing population of configurations. Configurations are evaluated for their compliance with the design specifications. Potential solutions should provide the required speed reduction ratio

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

31

and motion transformation while not containing incompatible building blocks or exceeding specified limits on the number of building blocks to be adopted. A fitness function codifies the degree of compliance of each configuration. The maximum fitness value is assigned to configurations that satisfy all functional requirements without violating any constraints. As in a standard GA, information concerning the fitness of solutions is employed to select solutions for reproduction thus guiding the process towards increasingly fitter designs as the population evolves. In addition to the usual GA operators, TRADES incorporates new operators to avert premature convergence to non-optimal solutions and facilitate the generation of a variety of design concepts. Essentially, these operators reduce the chances of any one configuration or family of configurations dominating the solution population by avoiding crowding around very fit configurations and preventing multiple copies of a configuration particularly after it has been identified as a potential solution. TRADES is able to produce design concepts from building blocks without requiring much additional a priori knowledge. The manipulation of the building blocks to generate new concepts is carried out by the GA in a stochastic but guided manner. This enables good conceptual designs to be found without the need to search the design space exhaustively. Due to the very large size of the design space and the quasi random operation of the GA, novel solutions not immediately evident to a human designer are sometimes generated by TRADES. On the other hand, impractical configurations could also arise. TRADES incorporates a number of heuristics to filter out such design proposals. 7.

CONCLUSION

Over the past fifty years, the field of soft computing has produced a number of powerful tools. This chapter has reviewed five of those tools, namely, knowledgebased systems, fuzzy logic, inductive learning, neural networks and genetic algorithms. Applications of the tools in engineering and manufacture have become more widespread due to the power and affordability of present-day computers. It is anticipated that many new applications will emerge and that, for demanding tasks, greater use will be made of hybrid tools combining the strengths of two or more of the tools reviewed here [Michalski and Tecuci, 1994; Medsker, 1995]. Other technological developments in soft computing that will have an impact in engineering include data mining, or the extraction of information and knowledge from large databases [Limb and Meggs, 1994; Witten and Frank, 2000, Braha, 2001; Han ˙ and Kamber, 2001; Pham and Afify, 2002; Klösgen and Zytkow, 2002; Giudici, 2003], and multi-agent systems, or distributed self-organising systems employing entities that function autonomously in an unpredictable environment concurrently with other entities and processes [Wooldridge and Jennings, 1994; Rzevski, 1995; Márkus et al., 1996; Tharumarajah et al., 1996; Bento and Feijó, 1997; Monostori, 2002]. The appropriate deployment of these new soft computing tools and of the tools presented in this chapter will contribute to the creation of more competitive engineering systems.

32 8.

CHAPTER 1

ACKNOWLEDGEMENTS

This work was carried out within the ALFA project “Novel Intelligent Automation and Control Systems II” (NIACS II), the ERDF (Objective One) projects “Innovation in Manufacturing Centre (IMC)”, “Innovative Technologies for Effective Enterprises” (ITEE) and “Supporting Innovative Product Engineering and Responsive Manufacturing” (SUPERMAN) and within the project “Innovative Production Machines and Systems” (I∗ PROMS). REFERENCES Akyol D E, (2004), “Application of neural networks to heuristic scheduling algorithms”, Computers Ind. Engng, 46, 679–696. Ashiru I, Czanecki C and Routen T, (1995), “Intelligent operators and optimal genetic-based path planning for mobile robots”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, 1018–1023. Badiru A B and Cheung J Y, (2002), Fuzzy Engineering Expert Systems with Neural Network Applications, John Wiley & Sons, New York. Baker J E, (1985), “Adaptive selection methods for genetic algorithms”, Proc. 1st Int. Conf. on Genetic Algorithms and Their Applications, Pittsburgh, PA, 101–111. Baldwin J F and Karale S B, (2003), “New concepts for fuzzy partitioning, defuzzification and derivation of probabilistic fuzzy decision trees”, Proc. 22nd Int. Conf. of the North American Fuzzy Information Processing Society (NAFIPS-03), Chicago, Illinois, USA, 484–487. Baldwin J F and Martin T P, (2001), “Towards inductive support logic programming”, Proc. Joint 9th IFSA World Congress and 20th NAFIPS Int. Conf., Vancouver, Canada, 4, 1875–1880. Bas K and Erkmen A M, (1995), “Fuzzy preshape and reshape control of Anthrobot-III 5-fingered robot hand”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, 673–677. Bento J and Feijó B, (1997), “An agent-based paradigm for building intelligent CAD systems”, Artificial Intelligence in Engineering, 11 (3), 231–244. Beskese A, Kahraman C and Irani Z, (2004), “Quantification of flexibility in advanced manufacturing systems using fuzzy concepts”, Int. J. Production Economics, 89 (1), 45–56. Bigand A, Goureau P and Kalemkarian J, (1994), “Fuzzy control of a welding process”, Proc. IMACS Int. Symp. on Signal Processing, Robotics and Neural Networks (SPRANN 94), Villeneuve d’Ascq, France, 379–342. Blickle T and Thiele L, (1995), “A comparison of selection schemes used in genetic algorithms”, Computer engineering and Communication Networks Lab (TIK)-Report, No. 11, Version 1.1b, Swiss Federation Institute of Technology (ETH), Zurich, Switzerland. Bose A, Gini M and Riley D, (1997), “A case-based approach to planar linkage design”, Artificial Intelligence in engineering, 11 (2), 107–119. Bozda˘g C E, Kahraman C and Ruan D, (2003), “Fuzzy group decision making for selection among computer integrated manufacturing systems”, Computers in Industry, 15 (1), 13–29. Braha D, (2001), Data Mining for Design and Manufacturing: Methods and Applications. Kluwer Academic Publishers, Boston. Breiman L, Friedman J H, Olshen R A and Stone C J, (1984), Classification and Regression Trees, Belmont, Wadsworth. Cervone G, Panait L A and Michalski R S, (2001), “The development of the AQ20 learning system and initial experiments”, Proc. 10th Inter. Symposium on Intelligent Information Systems, Poland. Chen J C and Black J T, (1997), “A fuzzy-nets in-process (FNIP) system for tool-breakage monitoring in end-milling operations”, Int. J Machine Tools Manufacturing, 37 (6), 783–800. Cho B J, Hong S C and Okoma S, (1996), “Job shop scheduling using genetic algorithm”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, 351–358.

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

33

Chryssolouris G and Subramaniam V, (2001), “Dynamic scheduling of manufacturing job shops using genetic algorithms”, J. Intelligent Manufacturing, 12, 281–293. Costa Branco P J and Dente J A, (1998), “An experiment in automatic modelling an electrical drive system using fuzzy logic”, IEEE Trans on Systems, Man, and Cybernetics, 28 (2), 254–262. Da Rocha Fernandes A M and Cid Bastos R, (1996), “Fuzzy expert systems for qualitative analysis of minerals”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 673–680. Darlington K W, (1999), The Essence of Expert Systems, Prentice Hall. Davis L, (1991), Handbook of Genetic Algorithms, Van Nostrand, New York, NY. De La Sen M, Miñambres J J, Garrido A J, Almansa A and Soto J C, (2004), “Basic theoretical results for expert systems: Application to the supervision of adaptation transients in planar robots”, Artificial Intelligence, 152 (2), 173–211. Disney S M, Naim M M and Towill D R, (2000), “Genetic algorithm optimisation of a class of inventory control systems”, Inter. J. Production Economics, 68, 259–278. Drake P R and Choudhry I A, (1997), “From apes to schedules”, Manufacturing Engineer, 76 (1), 43–45. Dubois D and Prade H, (1998), “An introduction to fuzzy systems”, Clinica Chimica Acta, 270 (1), 3–29. Duch W, Setiono R and Zurada J M, (2004), “Computational intelligence methods for rule-based data understanding”, Proc. IEEE, 92 (5), 771–805. Durkin J, (1994), Expert Systems Design and Development, Macmillan, New York. Evans B and Fisher D, (2002), “Using decision tree induction to minimize process delays in printing industry”, In: Handbook of Data Mining and Knowledge Discovery (W. Klösgen and J.M. Zytkow (Eds.)), Oxford University Press. Kong F, Yu J and Zhou X, (1999), “Analysis of fuzzy dynamic characteristics of machine cutting process: Fuzzy stability analysis in regenerative-type-chatter”, Int. J. Machine Tools and Manufacture, 39 (8), 1299–1309. Ferreiro Garcia R, (1994), “FAM rule as basis for poles shifting applied to the roll control of an aircraft”, SPRANN 94 (ibid), 375–378. Fogarty T C, (1989), “Varying the probability of mutation in the genetic algorithm”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 104–109. Freitas A A, (2002), Data mining and knowledge discovery with evolutionary algorithms, SpringerVerlag, Berlin, New York. Giarratano J C and Riley G D, (1998), Expert Systems: Principles and Programming, 3rd Edition, PWS Publishing Company, Boston, MA. Giudici P, (2003), Applied Data Mining: Statistical Methods for Business and Industry, John Wiley & Sons, England. Goldberg D E, (1989), Genetic Algorithms in Search, Optimisation and Machine Learning, Addison Wesley, Reading, MA. Grefenstette J J, (1986), “Optimization of control parameters for genetic algorithms”, IEEE Trans on Systems, Man and Cybernetics, 16 (1), 122–128. Han J and kamber M, (2001), Data Mining: Concepts and Techniques, Academic Press, USA. Hassoun M H, (1995), Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA. Holland J H, (1975), Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, MI. Hong T P and Chen J B, (2000), “Processing individual fuzzy attributes for fuzzy rule induction”, Fuzzy Sets and Systems, 112 (1), 127–140. Hui P C L, Chan K C K and Yeung K W, (1997), “Modelling job complexity in garment manufacturing by inductive learning”, Inter. J. Clothing Science and Technology, 9 (1), 34–44. Ip C Y, Regli W C, Sieger L and Shokoufandeh A, (2003), “Automated learning of model classification. Proc. 8th ACM Symposium on Solid Modeling and Applications, Seattle, Washington, USA, ACM Press, 322–327. ISL, (1998), Clementine Data Mining Package. SPSS UK Ltd., 1st Floor, St. Andrew’s House, West Street, Woking, Surrey GU21 1EB, United Kingdom. Jackson P, (1999), Introduction to Expert Systems, 3rd Edition, Addison-Wesley, Harlow, Essex.

34

CHAPTER 1

Jambunathan K, Fontama V N, Hartle S L and Ashforth-Frost S, (1997), “Using ART 2 networks to deduce flow velocities”, Artificial Intelligence in Engineering, 11 (2), 135–141. Janikow C Z, (1998), “Fuzzy decision trees: Issues and methods”, IEEE Trans on System, Man, and Cybernetic, 28 (1), 1–14. Jawahar N, Aravindan P, Ponnambalam S G and Karthikeyan A A, (1998), “A genetic algorithm-based scheduler for setup-constrained FMC”, Computers in Industry, 35, 291–310. Jiang Y, Zhou Z-H and Chen Z-Q, (2002), “Rule learning based on neural network ensemble”, Proc. Inter. Joint Conf. on Neural Networks, Honolulu, HI, 1416–1420. Kalogirou S A, (2003), “Artificial Intelligence for modelling and control of combustion processes: A review”, Progress in Energy and Combustion Science, 29 (6), 515–566. Kamrani A K, Shashikumar S and Patel S, (1995), “An intelligent knowledge-based system for robotic cell design”, Computers Ind. Engng, 29 (1–4), 141–145. Karsak E E, (2004), “Fuzzy multiple objective programming framework to prioritize design requirements in quality function deployment”, Computers Ind. Engng, (Submitted and accepted). Karsak E E and Kuzgunkaya O, (2002), “A fuzzy multiple objective programming approach for the selection of a flexible manufacturing system”, Int. J. Production Economics, 79 (2), 101–111. Kaufmann A, (1975), Introduction to the Theory of Fuzzy Subsets, Vol.1, Academic Press, New York. Kim C-O, Min Y-D and Yih Y, (1998), “Integration of inductive learning and neural networks for multi-objective FMS scheduling”, Inter. J. Production Research, 36 (9), 2497–2509. Klir G J and Yuan B, (1995), Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall, Upper Saddle River, NJ. Klir G J and Yuan B, (Eds.), (1996), Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems – selected papers by L A Zadeh, World Scientific, Singapore. ˙ Klösgen W and Zytkow J M, (2002), Handbook of Data Mining and Knowledge Discovery, Oxford University Press, New York. Koo D Y and Han S H, (1996), “Application of the configuration design methods to a design expert system for paper feeding mechanism”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 49–56. Kostov A, Andrews B, Stein R B, Popovic D and Armstrong W W, (1995), “Machine learning in control of functional electrical stimulation systems for locomotion”, IEEE Trans on Biomedical Engineering, 44 (6), 541–551. Kulak O and Kahraman C, (2004), “Multi-attribute comparison of advanced manufacturing systems using fuzzy vs. crisp axiomatic design approach”, Int. J. Production Economics, (Submitted and accepted). Lara Rosano F, Kemper Valverde N, De La Paz Alva C and Alcántara Zavala J, (1996), “Tutorial expert system for the design of energy cogeneration plants”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 300–305. Lavrac N and Dzeroski S, (1994), Inductive Logic Programming: Techniques and Applications, Ellis Horwood, New York. Lee C-Y, Piramuthu S and Tsai Y-K, (1997),“Job shop scheduling with a genetic algorithm and machine learning”, Inter. J. Production Research, 35 (4), 1171–1191. Li J R, Khoo L P and Tor S B, (2003), “A Tabu-enhanced genetic algorithm approach for assembly process planning”, J. Intelligent Manufacturing, 14, 197–208. Limb P R and Meggs G J, (1994), “Data mining tools and techniques”, British Telecom Technology Journal, 12 (4), 32–41. Lin Z-C and Chang D-Y, (1996), “Application of a neural network machine learning model in the selection system of sheet metal bending tooling”, Artificial Intelligence in Engineering, 10, 21–37. Lou H H and Huang Y L, (2003), “Hierarchical decision making for proactive quality control: System development for defect reduction in automotive coating operations”, Engineering Applications of Artificial Intelligence, 16, 237–250. Luzeaux D, (1994), “Process control and machine learning: rule-based incremental control”, IEEE Trans on Automatic Control, 39 (6), 1166–1171.

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

35

Mahfoud S W, (1995), Niching Methods for Genetic Algorithms, Ph.D. Thesis, Department of General Engineering, University of Illinois at Urbana-Champaign. Majors M D and Richards R J, (1995), “A topologically-evolving neural network for robotic flexible assembly control”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, August, 894–899. Markham I S, Mathieu RG and Wray B A, (2000), “Kanban setting through artificial intelligence: A comparative study of artificial neural networks and decision trees”, Integrated Manufacturing Systems, 11 (4), 239–246. Márkus A, Kis T, Váncza J and Monostori L, (1996), “A market approach to holonic manufacturing”, CIRP Annals, 45 (1), 433–436. Mathieu R G, Wray B A and Markham I S, (2002), “An approach to learning from both good and poor factory performance in a kanban-based just-in-time production system”, Production Planning & Control, 13 (8), 715–724. Medsker L R, (1995), Hybrid Intelligent Systems, Kluwer Academic Publishers, Boston, 298 pp. Michalewicz Z, (1996), Genetic Algorithms + Data Structures = Evolution Programs, 3rd Edition, Springer-Verlag, Berlin. Michalski R S, (1990), “A theory and methodology of inductive learning”, in Readings in Machine Learning, Eds. Shavlik J W and Dietterich T G, Kaufmann, San Mateo, CA, 70–95. Michalski R S and Kaufman KA, (2001), “The AQ19 system for machine learning and pattern discovery: A general description and user guide”, Reports of the Machine Learning and Inference Laboratory, MLI 01-2, George Mason University, Fairfax, VA, USA. Michalski R S and Larson J B, (1983), “Incremental generation of VL1 hypotheses: The underlying methodology and the descriptions of program AQ11”, ISG 83–5, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois. Michalski R S, Mozetic I, Hong J and Lavrac N, (1986), “The multi-purpose incremental learning system AQ15 and its testing application to three medical domains”, American Association of Artificial Intelligence, Los Altos, CA, Morgan Kaufmann, 1041–1045. Michalski R and Tecuci G, (1994), Machine Learning: A Multistrategy Approach, 4, Morgan Kaufmann Publishers, San Francisco, CA, USA. Michie D, Spiegelhalter D J and Taylor C C, (1994), Machine Learning, Neural and Statistical Classification, Ellis Horwood, New York. Mitchell M, (1996), An Introduction to Genetic Algorithms, MIT Press. Monostori L, (2002), “AI and machine learning techniques for managing complexity, changes and uncertainties in manufacturing”, Proc. 15th Triennial World Congress, Barcelona, Spain, 119–130. Muggleton S (ed), (1992), Inductive Logic Programming, Academic Press, London, 565 pp. Muggleton S and Feng C, (1990), “Efficient induction of logic programs”, Proc. 1st Conf. on Algorithmic Learning Theory, Tokyo, Japan, 368–381. Nearchou A C and Aspragathos N A, (1997), “A genetic path planning algorithm for redundant articulated robots”, Robotica, 15 (2), 213–224. Nurminen J K, Karonen O and Hatonen K, (2003), “What makes expert systems survive over 10-years – empirical evaluation of several engineering applications”, Expert Systems with Applications, 24 (2), 199–211. Ong S K, De Vin L J, Nee A Y C and Kals H J J, (1997), “Fuzzy set theory applied to bend sequencing for sheet metal bending”, Int. J. Materials Processing Technology, 69, 29–36. Öztürk N and Öztürk F, (2004), “Hybrid neural network and genetic algorithm based machining feature recognition”, J. Intelligent Manufacturing, 15, 278–298. Park M-W, Rho H-M and Park B-T, (1996), “Generation of modified cutting conditions using neural networks for an operation planning system”, Annals of the CIRP, 45 (1), 475–478. Peers S M C, Tang M X and Dharmavasan S, (1994), “A knowledge-based scheduling system for offshore structure inspection”, Artificial Intelligence in Engineering IX (AIEng 9), Eds. Rzevski G, Adey R A and Russell D W, Computational Mechanics, Southampton, 181–188. Peng Y, (2004), “Intelligent condition monitoring using fuzzy inductive learning”, J. Intelligent Manufacturing, 15 (3), 373–380.

36

CHAPTER 1

Pérez E, Herrera F and Hernández C, (2003), “Finding multiple solutions in job shop scheduling by niching genetic algorithms”, J. Intelligent Manufacturing, 14, 323–339. Pham D T and Afify A A, (2002), “Machine learning: Techniques and trends”, Proc. 9th Inter. Workshop on Systems, Signals and Image Processing (IWSSIP – 02), Manchester Town Hall, UK, World Scientific, 12–36. Pham D T and Afify A A, (2005a), “RULES-6: A simple rule induction algorithm for handling large data sets”, Proc. of the Institution of Mechanical Engineers, Part (C), 219 (10), 1119–1137 . Pham D T and Afify A A, (2005b), “Machine learning techniques and their applications in manufacturing”, Proc. of the Institution of Mechanical Engineers, Part B, 219 (5), 395–412. Pham D T, Afify A A and Dimov S S, (2002), “Machine learning in manufacturing”, Proc. 3rd CIRP Inter. Seminar on Intelligent Computation in Manufacturing Engineering – (ICME 2002), Ischia, Italy, III–XII. Pham D T and Aksoy M S, (1994), “An algorithm for automatic rule induction”, Artificial Intelligence in Engineering, 8, 277–282. Pham D T and Aksoy M S, (1995a), “RULES : A rule extraction system”, Expert Systems with Applications, 8, 59–65. Pham D T and Aksoy M S, (1995b), “A new algorithm for inductive learning”, Journal of Systems Engineering, 5, 115–122. Pham D T, Bigot S and Dimov S S, (2003), “RULES-5: A rule induction algorithm for problems involving continuous attributes”, Proc. of the Institution of Mechanical Engineers, 217 (Part C), 1273–1286. Pham D T and Dimov S S (1997), “An efficient algorithm for automatic knowledge acquisition”, Pattern Recognition, 30(7), 1137–1143. Pham D T, Dimov S S and Salem Z, (2000), “Technique for selecting examples in inductive learning”, ESIT 2000 European Symposium on Intelligent Techniques, Erudit Aachen Germany, 119–127. Pham D T, Dimov S S and Setchi RM (1999), “Concurrent engineering: a tool for collaborative working”, Human Systems Management, 18, 213–224. Pham D T and Hafeez K, (1992), “Fuzzy qualitative model of a robot sensor for locating threedimensional objects”, Robotica, 10, 555–562. Pham D T and Karaboga D, (1994), “Some variable mutation rate strategies for genetic algorithms”, SPRANN 94 (ibid), 73–96. Pham D T and Karaboga D, (2000), Intelligent Optimisation Techniques: Genetic Algorithms, Tabu Search, Simulated Annealing and Neural Networks, Springer-Verlag, London, Berlin and Heidelberg, 2nd printing, 302 pp. Pham D T and Liu X, (1999), Neural Networks for Identification, Prediction and Control, Springer Verlag, London, Berlin and Heidelberg, 4th printing, 238 pp. Pham D T and Oztemel E, (1996), Intelligent Quality Systems, Springer Verlag, London, Berlin and Heidelberg, 201 pp. Pham D T, Packianather M S, Dimov S, Soroka A J, Girard T, Bigot S. and Salem Z, (2004), “An application of data mining and machine learning techniques in the metal industry”, Proc. 4th CIRP Inter. Seminar on Intelligent Computation in Manufacturing Engineering (ICME-04), Sorrento (Naples), Italy. Pham D T and Pham P T N, (1988), “Expert systems in mechanical and manufacturing engineering”, Int. J. Adv. Manufacturing Technology, Special Issue on Knowledge Based Systems, 3(3), 3–21. Pham D T and Yang Y, (1993), “A genetic algorithm based preliminary design system”, Proc. IMechE, Part D: J. Automobile Engineering, 207, 127–133. Price C J, (1990), Knowledge Engineering Toolkits, Ellis Horwood, Chichester. Priore P, Fuente D, Pino R and Puente J, (2003), “Dynamic scheduling of manufacturing systems using neural networks and inductive learning”, Integrated Manufacturing Systems, 14 (2), 160–168. Quinlan J R, (1983), “Learning efficient classification procedures and their application to chess endgames”, In: Machine Learning: An Artificial Intelligence Approach (Michalski R S, Carbonell J G and Mitchell T M (Eds.)), I, Tiogo Publishing Co., 463–482. Quinlan J R, (1986), “Induction of decision trees”, Machine Learning, 1, 81–106.

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE

37

Quinlan J R, (1990), “Learning logical definitions from relations”, Machine Learning, 5, 239–266. Quinlan J R, (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA. Quinlan J R and Cameron-Jones R M, (1995), “Induction of logic programs: FOIL and related systems”, New Generation Computing, 13, 287–312. Ross T J, (1995), Fuzzy Logic with Engineering Applications, McGraw-Hill, New York. RuleQuest, (2001), Data Mining Tools C5.0, Pty Ltd, 30 Athena Avenue, St Ives NSW 2075, Australia. Available from: http://www.rulequest.com/see5-info.html. Rzevski G, (1995), “Artificial intelligence in engineering : past, present and future”, Artificial Intelligence in Engineering X, Eds Rzevski G, Adey R A and Tasso C, Computational Mechatronics, Southampton, 3–16. Schaffer J D, Caruana R A, Eshelman L J and Das R, (1989), “A study of control parameters affecting on-line performance of genetic algorithms for function optimisation”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 51–61. Schultz G, Fichtner D, Nestler A and Hoffmann J, (1997), “An intelligent tool for determination of cutting values based on neural networks”, Proc. 2nd World Congress on Intelligent Manufacturing Processes and Systems, Budapest, Hungary, 66–71. Seals R C and Whapshott G F, (1994), “Design of HDL programmes for digital systems using genetic algorithms”, AI Eng 9 (ibid), 331–338. Shi Z Z, Zhou H and Wang J, (1997), “Applying case-based reasoning to engine oil design”, Artificial Intelligence in Engineering, 11 (2), 167–172. Shigaki I and Narazaki H, (1999), “A machine-learning approach for a sintering process using a neural network”, Production Planning & Control, 10 (8), 727–734. Shin C K and Park S C, (2000), “A machine learning approach to yield management in semiconductor manufacturing”, Inter. J. Production Research, 38 (17), 4261–4271. Skibniewski M, Arciszewski T and Lueprasert K, (1997), “Constructability analysis : machine learning approach”, ASCE J of Computing in Civil Engineering, 12 (1), 8–16. Smith J E and Fogarty T C, (1997), “Operator and parameter adaptation in genetic algorithms”, Soft Computing, 1 (2), 81–87. Smith P, MacIntyre J and Husein S, (1996), “The application of neural networks in the power industry”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 321–326. Sohen S Y and Choi I S, (2001), “Fuzzy QFD for supply chain management with reliability consideration”, Reliability Eng. and Systems Safety, 72, 327–334. Streichfuss M and Burgwinkel P, (1995), “An expert-system-based machine monitoring and maintenance management system”, Control Eng. Practice, 3 (7), 1023–1027. Szwarc D, Rajamani D and Bector C R, (1997), “Cell formation considering fuzzy demand and machine capacity”, Int. J. Advanced Manufacturing Technology, 13 (2), 134–147. Tarng Y S, Tseng C M and Chung L K, (1997), “A fuzzy pulse discriminating systems for electrical discharge machining”, Int. J. Machine Tools and Manufacture, 37 (4), 511–522. Teti R and Caprino G, (1994), “Prediction of composite laminate residual strength based on a neural network approach”, AI Eng 9 (ibid), 81–88. Tharumarajah A, Wells A J and Nemes L, (1996), “Comparison of the bionic, fractal and holonic manufacturing system concepts”, Int. J. Computer Integrated Manfacturing, 9 (3), 217–226. Vanegas L V and Labib A W, (2001), “A fuzzy quality function deployment (FQFD) model for deriving optimum targets”, Int. J. Production Research, 39 (1), 99–120. Venkatachalam A R, (1994), “Automating manufacturability evaluation in CAD systems through expert systems approaches”, Expert Systems with Applications, 7 (4), 495–506. Wang L X and Mendel M, (1992), “Generating fuzzy rules by learning from examples”, IEEE Trans on Systems, Man and Cybernetics, 22 (6), 1414–1427. Wang W P, Peng Y H and Li X Y, (2002), “Fuzzy-grey prediction of cutting force uncertainty in turning”, J Materials Processing Technology, 129, 663–666. Wang C-H, Tsai C-J, Hong T-P and Tseng S-S, (2003), “Fuzzy Inductive Learning Strategies”, Applied Intelligence, 18 (2), 179–193.

38

CHAPTER 1

Wang X Z, Wang Y D, Xu X F, Ling W D and Yeung D S, (2001), “A new approach to fuzzy rule generation: Fuzzy extension matrix”, Fuzzy Sets and Systems, 123 (3), 291–306. Whitely D, (1989), “The GENITOR algorithm and selection pressure: why rank-based allocation of reproductive trials is best”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 116–123. Wilde P and Shellwat H, (1997), “Implementation of a genetic algorithm for routing an autonomous robot”, Robotica, 15 (2), 207–211. Witten I H and Frank E, (2000), Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, USA. Wooldridge M J and Jennings N R, (1994), “Agent theories, architectures and languages : a survey”, Proc. ECAI 94 Workshop on Agent Theories, Architectures and Languages, Amsterdam, 1–32. Wray B A, Rakes T R and Rees L, (1997), “Neural network identification of critical factors in dynamic just-in-time kanban environment”, J. Intelligent Manufacturing, 8, 83–96. Wu X, Chu C-H, Wang Y and Yan W, (2002), “A genetic algorithm for integrated cell formation and layout decisions”, Proc. of the 2002 Congress on Evolutionary Computation (CEC-02), 2, 1866–1871. Yano H, Akashi T, Matsuoka N, Nakanishi K, Takata O and Horinouchi N, (1997), “An expert system to assist automatic remeshing in rigid plastic analysis”, Toyota Technical Review, 46 (2), 87–92. Yao X, (1999), “Evolving artificial neural networks”, Proceedings of the IEEE, 87 (9), 1423–1447. Zadeh L A, (1965), “Fuzzy Sets”, Information Control, 8, 338–353. Zha X F, Lim S Y E and Fok S C, (1998), “Integrated knowledge-based assembly sequence planning”, Int. J. Adv. Manufacturing Technology, 12 (3), 211–237. Zha X F, Lim S Y E and Fok S C, (1999), “Integrated knowledge-based approach and system for product design and assembly”, Int. J. Computer Integrated Manufacturing, 14, 50–64. Zhao Z Y and De Souza R, (1998), “On improving the performance of hard disk drive final assembly via knowledge intensive simulation”, J. Electronics Manufacturing, 1, 23–25. Zhao Z Y and De Souza R, (2001), “Fuzzy rule learning during simulation of manufacturing resources”, Fuzzy Sets and Systems, 122, 469–485. Zhou C, Nelson P C, Xiao W, Tirpak T M and Lane S A, (2001), “An intelligent data mining system for drop test analysis of electronic products”, IEEE Trans on Electronics Packaging Manufacturing, 24 (3), 222–231. Zimmermann H-J, (1996), Fuzzy Set Theory and its Applications, 3nd Edition, Kluwer Academic Publishers, Boston. Zülal G and Arikan F, (2000), “Application of fuzzy decision making in part-machine grouping”, Int. J. Production Economics, 63, 181–193.

CHAPTER 2 NEURAL NETWORKS HISTORICAL REVIEW

D. ANDINA1 , A. VEGA-CORONA2 , J. I. SEIJAS3 , J. TORRES-GARCÍA 1

Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. [email protected] 2 Facultad de Ingeniería, Mecánica, Eléctrica y Electrónica (FIMEE), Universidad de Guanajuato (UG), Salamanca, Gto., México. [email protected] 3 Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. [email protected] Abstract:

This chapter starts with a historical summary of the evolution of Neural Networks from the first models which are very limited in application capabilities to the present ones that make possible to think in applying automatic process to tasks that formerly had been reserved to the human intelligence. After the historical review, Neural Networks are dealt from a computational point of view. This perspective helps to compare Neural Systems with classical Computing Systems and leads to a formal and common presentation that will be used throughout the book

INTRODUCTION Computers used nowadays can make a great variety of tasks (whenever they are well defined) at a higher speed and with more reliability than those reached by the human beings. None of us will be, for example, able to solve complex mathematical equations at the speed that a personal computer will. Nevertheless, mental capacity of the human beings is still higher than the one of machines in a wide variety of tasks. No artificial system of image recognition is able to compete with the capacity of a human being to discern between objects of diverse forms and directions; in fact it would not even be able to compete with the capacity of an insect. In the same way, whereas a computer performs an enormous amount of computation and restrictive conditions to recognize, for example, phonemes, an adult human recognizes without no effort words pronounced by different people, at different speeds, accents and intonations, even in the presence of environmental noise. It is observed that, by means of rules learned from the experience, the human being is much more effective than the computers in the resolution of imprecise 39 D. Andina and D.T. Pham (eds.), Computational Intelligence, 39–65. © 2007 Springer.

40

CHAPTER 2

problems (ambiguous problems), or problems that require great amount of information. Our brain reaches these objectives by means of thousands of millions of simple cells, called neurons, which are interconnected to each other. However, it is estimated that the operational amplifiers and logical gates can make operations several orders of magnitude faster than the neurons. If the same processing technique of biological elements were implemented with operational amplifiers and logical gates, one could construct machines relatively cheap and able to process as much information, at least, as the one that processes a biological brain. Of course, we are too far from knowing if these machines will be constructed one day. Therefore, there are strong reasons to think about the viability to tackle certain problems by means of parallel systems that process information and learn by means of principles taken from the brain systems of living beings. Such systems are called Artificial Neural Networks, connexionist models or distributed parallel process models. Artificial Neural Networks (ANNs or, simply, NNs) come then from the man’s intention of simulating the biological brain system in an artificial way. 1.

HISTORICAL PERSPECTIVE

The science of Artificial Neural Networks did his first significant appearance during the 1940’s. Researchers who tried to emulate the functions of the human brain developed physical models (later, simulations by means of programs) of the biological neurons and their interconnections. As the neurobiologists were deepening in the knowledge of the human neural system, these first models were being considered more and more rudimentary approaches. Nevertheless, some of the results obtained in these first times were impressive, which encouraged future research and developments of sophisticated and powerful Artificial Neural Networks. 1.1

First Computational Model of Nervous Activity: The Model of McCulloch and Pitts

McCulloch and Pitts published the first systematic studies of the artificial neural networks [McCulloch and Pitts, 1943] [Pitts and McCulloch, 1947]. This study appeared in terms of a computational model of the nervous activity of the human nervous system cells. Most of their work is focused on the behavior of a simple neuron, whose mathematical or computational model is shown in Figure 1. Inside the artificial neuron, the sum of each input xi multiplied by a scale factor (or weight wi ) is made. The inputs emulate the excitations received by the biological neurons. The weights represent the force of the synaptic union: a positive weight represents an excitatory effect, and a negative weight an inhibitory effect. If the result of the sum is higher than a certain threshold value or bias (represented by the weight w0 ), the cell activates providing a positive value (normally +1); in the opposite case, the output presents a negative value (usually −1) or zero. Therefore, it is a binary output. In general,

NEURAL NETWORKS HISTORICAL REVIEW

⎡ x1 ⎤ ⎢ ⎥ x2 X =⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣ xm⎥⎦

41

w0 w1 ∑wi xi

f (Z ) Z

wM

1

O −1

O Activation function O = f (Z )

Figure 1. Artificial model [McCulloch and Pitts, 1943] of a biological neuron. As it can be observed, the relation between the input and output follows a nonlinear function called activation function. In the first model shown in this figure, the activation is a hard threshold function

the model follows the neurobiologic behavior: the nervous cells produce nonlinear answers when provided of an excitation by a certain input. In particular, McCulloch and Pitts proposed an activation function, that represents the nonlinearity of the model, called hard threshold function (see Figure 1). Although this first model can only perform very simple tasks, as it will be described below, the potentiality of the neural systems is essentially in the interconnection between neurons to form networks. This interconnection is normally arranged forming layers of nodes (artificial neurons). This kind of neural networks are called Multi-Layer Perceptron (MLP). In general, it is possible to speak about Feed Forward Neural Networks like those in which the information always is transmitted in the direction of input layer to output layer. Or Feedback Neural Networks, where the information can be transmitted in both directions; that is, connections between nodes of higher layers with nodes of lower layers are allowed. Figure 2 shows a Feed Forward Neural Network of two layers: a hidden layer (located right after the input layer) and the output layer. The input layer is usually not considered as being properly a layer of the network. Each component of the input vector x = 1x1 xM T is connected to all the nodes of the first hidden layer. The forces of these connections are determined by the weight associated with each one of them. When the same philosophy is applied to the rest of the network’s layers, it is said that full connectivity exists. Trying to proceed chronologically, we will leave the Multilayer Neural Networks (MLP) by the moment. The first artificial neural networks proposals were networks of a single layer as the one shown in Figure 2 but eliminating the output layer.

INPUT w01 1 ⎡1 ⎤ ⎢x ⎥ 2 1 X= ⎢ ⎥ wM2 ⎢ ⎥ o1m w ⎢ ⎥ m ⎣⎢xM ⎦⎥ w1m m

o2 OUTPUT

Figure 2. Two Layer Feed Forward Neural Networks

42

CHAPTER 2

This joint disposition of the first model of neuron (see Figure 1) in parallel was suggested in order to solve the limitations of a neuron acting alone. It is easy to verify that the model of McCulloch and Pitts divides the input space into two parts by means of the hyperplane described by the equation (1)

hx =

M

wj xj + wo = 0

j=1

This effect can be observed in Figure 3 that shows this hyperplane for the particular case of M = 2. A simple neuron can solve two-class classification problems of M-dimensional data, assuming that they are linearly separable. That is, it can assign an output equal to 1 to all the data of class “A” (that fall in the same side of the hyperplane), whereas it assigns a value equal to −1 to the rest of the data that fall in the opposite side. Mathematically, we can express this classification as (2)

M

CB

wj xj > − wo

j=1

< CA

where CA and CB denotes class A and class B, respectively. We have now a very simple neuron behavior model, that does not consider many of the biological characteristics that tries to emulate. For example, it does not consider the real delays existing in all inter-neural transmission –that have an important effect on the dynamic system–, or, more importantly, it does not include effects of synchronism or frequency modulation features, which is considered crucial by many researchers. x2 h(x) = 0

Class1

h(x) > 0 w0 w2

h(x) < 0

x1

Class 2 x2

w1 w2

x1

w0 w2

Figure 3. The hyperplane determined by the McCulloch and Pitts neuron model for the case of two dimensional inputs. This hyperplane depends on the neuron’s parameters (weights wj , and threshold value w0 ) according with the mathematical expression M j=1 wj xj + w0 = 0

NEURAL NETWORKS HISTORICAL REVIEW

43

In spite of their limitations, the networks designed in this way have characteristics classically restricted to the biologic systems. Perhaps researchers have been able to shape the main biological neuron operations in this model, or perhaps the similarities in some applications are mere coincidence. Only the necessary time to continue this research will solve this question.

1.2

Training of Neural Networks: Hebbian Learning

The equation of the hyperplane border that characterizes the operation of the artificial neuron depends on the synaptic weights w1 , wM and on the threshold value wo , which is normally considered as another weight of the network. The remaining problem consists in the way of choosing, determining or looking for the appropriate value of these weights that solve the problem in hand. This task is called learning or training of the network. From a historical point of view, the Hebbian Learning is the oldest and one of the most studied learning procedures. In 1961, Hebb proposed a learning model that has given rise to many of the learning systems which nowadays exist for training neural networks. Hebb proposed that the value of the synaptic union would be increased whenever the input and the output of a neuron were simultaneously activated. In this way, the network’s connections used more frequently are reinforced, emulating the biological phenomenon of the habit and the learning by means of repetition. It is said that a neural network uses Hebbian learning when it increases the value of its weights accordingly with the product of the levels of excitation of the source and destiny neurons. The Hebbian learning of the network is performed by means of successive iterations using only the information of the input and output network, it never used never the desired output or target. For this reason, this type of learning is called unsupervised learning. It distinguishes it from other models of learning that use the additional information of the desired values of the output, as a teacher, and that we will expose next.

1.3

Supervised Learning: Rosenblatt and Widrow

Although many learning methods following the Hebbian model have been developed, it seems logical to expect that the most efficient results can be achieved by those methods that use information of the network output (supervised learning. Learning is so guided in order to perform a given function. About 1960, Rosenblatt [Rosenblatt, 1962] dedicated his efforts in developing supervised learning algorithms to train a neural network that called perceptron. A perceptron is a Feed Forward neural network as that shown in Figure 2, where the nonlinearities of the neurons are of the hard type. Some of the common functions used as alternatives to the hard threshold functions will be shown later on. In this way, the Mcculloch and Pitts model can be considered as the simplest kind of hard threshold perceptron.

44

CHAPTER 2

Concretely, Rosenblatt showed that a one layer perceptron is able to learn many practical functions. He proposed a learning rule for the perceptron called the perceptron rule. Let us consider the simplest case of a one layer perceptron composed by one single neuron, that is, the model proposed by McCulloch and Pitts. If certain pairs of input and corresponding output is known, DN = x1 d1 x2 d2 xN dN , then, at a given input pattern xk of the input data set, the perceptron rule updates the network weights w = wo w1 wM T in the following way (3)

wk + 1 = wk + dk − ok xk

The parameter controls the updating magnitude values, and so the speed of the algorithm convergence. It is called the learning rate and it usually takes values in the range between 0 and 1. The DN set is called learning set and, as it includes values of the desired outputs, it is of the supervised type. If the linear separability is accomplished by the training data set, Rosenblatt showed that the algorithm always converge in a finite number of steps, independently of the value. On the contrary, if the problem is not linearly separable, it will have to be forced to stop, as always there will be at least one pattern erroneously classified. Usually, the training starts giving small random values to the perceptron weights. In each step of the algorithm, a new input xk is applied to the network, then the corresponding output is calculated, ok , and the weights are updated only if error dk − ok is not equal to zero. It is interesting to note that if the learning rate has a value close to 0, the weights will have a little variation with each new input, and the learning is slow; if the value is next to 1 there can be large differences between weight values for one iteration and the following one, reducing the influence of past iterations and the algorithm could not converge. This problem is called instability. Therefore, the gain rate should be adapted to the distribution changes on the input patterns, satisfying the conflict between training time and stable updating of weights. Also at early 1960’s, Widrow and Hoff [Widrow and Hoff, 1960] performed several demonstrations on perceptron-like systems, that called ADALINE (“ADAptive LINear Elements”), proposing a learning rule called LMS algorithm (“Least Mean Square” algorithm) or Widrow-Hoff algorithm. This rule minimizes the Sum of Square Errors (SSE, “sum-of-square errors”) between the desired output and the output given by the perceptron before the hard threshold activation function. That is, it minimizes the error function (4)

Ew =

N 1 d − zj 2 2 j=1 j

through a gradient algorithm. The linear output z can be observed in Figure 1. When the gradient to w is applied in Equation (4) and actualized in the opposite

45

NEURAL NETWORKS HISTORICAL REVIEW

direction to the gradient one, the LMS rule is obtained. (5)

wk + 1 = wk +

N

dj − zj kxj

j=1

where zj k = wT kxj . This “block” (in the sense that it uses all training patterns in each iteration) version of the LMS is usually substituted for an “estocastic approximation” (pattern by pattern) as shown in equation (6)

wk + 1 = wk + dk − zk xk

Unlike the perceptron rule, the application of LMS delivers reasonable results (the best that can be achieved through a linear discriminator in the SSE sense) when the training set is not linearly separable. During these years, researchers all around the world become enthusiastic with the application possibilities that these systems promised. 1.4

Partial eclipse of Neural Networks: Minsky and Papert

The initial euphoria aroused in the early sixties was substituted by disappointment when Minsky and Papert [Minsky and Papert, 1969] rigorously analyzed the problem and showed that there exists severe restrictions in the class of functions that a perceptron can perform. One of their results shows how a one layer perceptron with two inputs and one output is unable of performing a simple function as the or-exclusive (Xor). The inputs of this function are of the type 1 or −1 being the output −1 when the two inputs are different and 1 if they are equal. In the Figure 4 this problem is illustrated. It can be observed how a linear discriminator is unable of separating the patterns of the two classes. This limitation was well known by the end of the sixties and it was also known that the problem could be solved adding more layers to the system. As an example, let us analyze a two layer perceptron. The first layer can classify input vectors separated

x2 Class A

1

−1

X

d

⎡(1,1) ⎤ ⎢(1,–1)⎥ ⎢ ⎥ ⎢(–1,1)⎥ ⎢ ⎥ ⎣(–1,–1)⎦

⎡1 ⎤ ⎢–1⎥ ⎢ ⎥ ⎢–1⎥ ⎢ ⎥ ⎣1 ⎦

Class B

1 x1 −1

Figure 4. The or-exclusive (Xor) problem

46

CHAPTER 2

by hyperplanes. The second layer can implement the logical functions AND and OR, because both problems are linearly separable. In this way, a perceptron as the one shown in Figure 5 (a) can implement boundaries as the one shown in Figure 5 (b) and, so, solve the Xor problem. In the general case, it can be shown that a two layer perceptron can implement simply convex and connex regions –a region is said to be convex if any straight line that joins two points of its boundary goes only through points included in the region limited by the boundary. Convex regions are limited by the (hyper)planes performed by each node in the first layer, and can be open or closed. It has to be noted that the possibilities of Multi Layer Perceptrons rely on the nonlinearities of their neurons. If the activation function performed by these neurons was linear, then the MLP capabilities would be the same as those of the single layer perceptron. For example, let us think of a two layer perceptron with a threshold value, wo = zero and with a linear activation function, fz = z (see Figure 1). In this case, the outputs of the first layer can be easily expressed through a matrix O1 = W1T X, and those of the second layer as O2 = W2T O1 . Then, the output as a function of the input is obtained as (7)

T O2 = W2T O1 = W2T W1T X = Wtotal

1 O

x1

X= x2

2

x2 Class A

1

Decision boundary (node 1) Class B

–1 1 Class B

x1 Class A

–1 Decision boundary (node 2)

Figure 5. (a) Two layers perceptron, able to solve the Xor problem, implementing a boundary as shown in (b)

NEURAL NETWORKS HISTORICAL REVIEW

47

This function could be performed by a single layer perceptron whose layer weights were Wtotal . Therefore, if the nodes are linear elements, the performance of the structure is not improved by adding new layers, as an equivalent one layer perceptron can be found. In spite of the possibilities opened by the MLP, Minsky and Papert, prestigious scientists of their time, emphasized that algorithms to train such structures were not known, and showed their scepticism on the possibilities of they would ever be developed. The book of Minsky and Papert [Minsky and Papert, 1969], showed some critical examples of the disadvantages of NNs vs classical computers in terms of their capabilities for storing information, was a strong punch on the NNs research enthusiasm, eclipsing their developing for the next twenty years.

1.5

Backpropagation algorithm: Werbos, Rumelhart et al. and Parker

It is true that the single layer perceptron has the limitation of being a simple discriminator. There are reasons to affirm that it is only able of solving “toy” problems. Although their limitations reduce when the number of layers raises, it is difficult to find the adequate weights to solve a given problem. This problem was solved with the incorporation of “soft”, derivable, nonlinearities in the neurons in the place of the classical hard threshold. Concretely, the sigmoidal function is very appropriate Figure 6. Among others, there exists an specially relevant theorem on the capabilities of the MLP with soft activation functions, Cybenko’s Theorem [Cybenko, 1989]: it is sufficient with a two layers perceptron with the nodes (in indefinite number) in the first layers performing sigmoidal activation functions to establish any correspondence between No and −1 1NL (therefore, it will also be possible to establish any classification). For a first revision on the perceptron capabilities as “approximators” in the case of soft nonlinearities, it is worth to mention the work of Hornik et al., [Hornik et al., 1989, Hornik et al., 1990]. But, again, we must come back on the question of how to train the network weights. In a completely analogous form to the LMS algorithm previously described, the retropropagation algorithm updates the network weights (in this case of a MLP) in the opposite direction of the error function gradient that we aim to minimize (i.e. SSE). For that purpose, the chain rule is applied as many times as required

⎡ x1 ⎤ ⎢x ⎥ 2 X= ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣⎢ xm ⎦⎥

1 O2

2

O1m 1 O1m

m

MPL-NODE

wTmx

–1

Figure 6. Multilayer Perceptron with sigmoidal nonlinearities

48

CHAPTER 2

to calculate that gradient for all the weights in the network. As the output is a derivable function, this calculation is relatively easy [Haykin, 1994]. The backpropagation algorithm was proposed independently and consecutively by Werbos [Werbos, 1974], Rumelhart et al., [Rumerlhart et al., 1986] and Parker [Parker, 1985]. It can be said that the pessimism aroused by the book by Minsky and Papert had its counterpart twenty years later with the developing of the backpropagation algorithm. 2.

NEURAL NETWORKS VS CLASSICAL COMPUTERS

Classical digital computers process the information at two basic levels: hardware and software. The computations performed are algorithmic and sequential. Each problem is solved through an algorithm coded in a program, physically located in the computer memory. Problems are solved one after the other. Algorithms are performed as many times as needed, with the same reliability and at electronic speed. Nevertheless, there are many real problems where computers cannot be successfully applied yet. For example, let us think of a little mosquito finding its way to survive in the world. Such a problem is a not-solved challenge to any automatic device. But the difference probably relies on the fact that living beings do not follow the computer processing scheme. Biological brains process information in a massive, parallel, not sequential way. Problems are solved by the cooperative participation of millions of highly interconnected elemental processors, called neurons. The neurons do not need to be programmed. From the stimulus they receive from other neurons, they are able to modify, adapt or learn its functioning. The system does not need a central processing unit to control the activities of the system. It is interesting to note that biological neural systems work at a speed several orders of magnitude lower than electronic systems. Therefore, brain is an adaptive, non-linear, sophisticated processing system. Knowledge is distributed in the neurons activation state and memory is not addressed through fixed labels. Their architecture tries to emulate the basic neural features of brains and are designed by learning from examples. They could be defined as networks that massively connect simple units (usually adaptive units), hierarchically organized, that try to interact with the real world objects in the biological systems fashion. Advantages of NNs over classical computers are: 1 Adaptive Learning: they are able to learn and to perform tasks by an adaptive procedure. 2 Self-organized: they are able to build their own internal organization or representation of the information provided in a learning phase. 3 Fault Tolerance: ability of performing the same function despite of the partial destruction of the Network. 4 Real time operation: its hardware architecture is oriented to massive parallel processing of the information.

NEURAL NETWORKS HISTORICAL REVIEW

49

5 Simplicity of integration with present technology: these systems can be easily simulated using the present computers and are also implemented in specific neural hardware, that allows their modular integration in present systems. 3. 3.1

BIOLOGICAL AND ARTIFICIAL NEURONS The Biological Neuron

The biological neuron, whose basic operation is not yet completely known nor understood, is composed of a cellular body and series of ramifications that are in branches, called dendrites. Among all these branches, one of them is particularly long and receives the name of axon. It starts from the cellular body and ends in another series of dendrites. These last nervous terminals are used by the neurons to be in contact with each other by means of the synaptic connections. When a cell receives signals of other cells, (these can be excitatory or inhibitory signals) the global effect is an excitation that exceeds a certain threshold value. Then it responds transmitting a certain nervous signal through the axon to the adjacent cells by means of the synapse of the nervous terminations. Human nervous system is made up of these cells and is of a fascinating complexity. It is estimated that 1011 neurons participate in more than 1015 interconnections on channels that can measure more of a meter. Studies on the human brain anatomy conclude that there are more than 1000 synapses in the input and output of each neuron. It is important to note that, although the commutation time of a neuron (few milliseconds) is almost a million times lower than the one of the actual computer elements, the biological neurons have a very higher connectivity (thousands of times) than the actual supercomputers. Neurons are composed of the cell core, soma, and several branches called the axon and the dendrites. The dendrites of different neurons are connected in what are called sinapses and play the role of establishing the connection with the neighbor neurons in order to make possible the communication among them. Each neuron has two basic states: activation and rest. When a neuron is activated it emits through the axon a chain of electrical excitements, of different frequencies depending of its level of activation. Information is coded in the frequency of generation and not in its amplitude. The signal produced in the neuron body propagates to other neurons from the axon to other neurons through chemical interchanges that take place in the synapses of the dendrites. The chemical components liberated by the dendrites are called neurotransmitters and contribute to increase or inhibit the activation level of the neuron that receives the neurotransmitters. Due to the action of the neurotransmitters – that are basically chemical signals – ionic channels are opened in the receiver neuron and electrical ions are received, contributing to the overall electrical charge of the neuron or excitation level. When the excitation level surpasses a certain activation level, the neuron is activated. The efficiency of the synapse depends on several factors: the number of

50

CHAPTER 2

the neurotransmitter glands, concentration in the membrane of the neighbor neuron, efficiency of the ionic channels and other physical and chemical variables. As to the learning procedure, last discoveries make believe that it is also of electrochemical nature, taking place among neighbor neurons, hierarchically close in a layered structure. The chemical liberated in the learning process seems to be nitric oxide (NO). Its molecules are able to go through the membrane and route to the neighbor neurons controlling the efficiency of the connection by reactions with other chemicals in this last neuron. This efficiency regulation of the electrochemical connection among neighbor neurons is the responsible of the learning procedure. 3.2

The Artificial Neuron

The simplest model of artificial neuron, as presented in Figure 1, is obtained through approximating the action of all neuron inputs by a linear function. Let us call this function Base Function, u·. In this case, the Base Function is a weighted sum u = w0 +

ni

w j xj

j=1

where w0 is a threshold and wj are the synaptic weights, that correspond to the effect of the inputs on the activation function. The output function of an artificial neuron can be expressed as ni y = fx = f w0 + wj xj j=1

In an artificial neuron, this function can be computed in three steps: the calculation of the base function value, u·, as the sum of the input values xj weighted by the synaptic weights wij plus the threshold value w0 and a non-linear activation function fu. Typical activation functions are explained in Figure 7: • Step function 0 si t < 0 ut = 1 si t ≥ 0 • Sign function sgnt =

−1 1

• Gaussian function x2

fx = ae− 2

si t < 0 si t ≥ 0

NEURAL NETWORKS HISTORICAL REVIEW

51

f (x)

f (x−a)

1

1

a

x

−1 Sign function

⎧1, if x ≥ a f (x−a) = ⎨ ⎩–1, if x ≤ a

x −1 Hyperbolic function f (x) = tanh(βx), β > 0

Figure 7. Some typical activation functions

• Exponential function fx =

1 > 0 1 + e− x

• Hyperbolic Function fx = tanh x > 0 Hyperbolic and exponential functions are classified as sigmoids or sigmoidal functions. They are real class functions, limited and monotonic f x > 0. In the case of sigmoidal functions, the mean value of the slope in the origin is called gain and such a value represents a measurement of the transition slope steepness.Therefore, if the gain tends to an infinite value, the sigmoid tends to a Sign function. According to this, Exponential and Hyperbolic functions have a gain of 4 and , respectively. As assumed in the previous point, the activation function of a neuron is nonlinear. If the function fu is linear, fu = u, then the artificial neuron is called Linear Neuron or Linear Node of the NN. 4.

NEURAL NETWORKS: CHARACTERISTICS AND TAXONOMY

A Neural Network can be represented as an oriented pair G E, composed of a set of nodes or basic processing elements G, also called processing units, artificial neurons or nodes, and a set of interconnections, E, among them. The nodes set G is partitioned in different sets called layers. Each processing unit can also have a local memory and always a transfer function. Depending upon this function of the weighted input values and the values stored in the local memory, the output y is computed. There are four main aspects that can characterize all NNs: a) Data Representation. According to the input-output form, ANNs can be classified as: continuous type NNs, digital NNs or hybrid NNs. In the continuous type, input-output data are of analogic nature. Their values are real and continuous. In digital NNs, input-output data is of digital nature. In the hybrid case, inputs are analogic and outputs are binary.

52

CHAPTER 2

b) Topology. Architecture or Topology of the NN refers to the way that the nodes are physically disposed in the network. The nodes form layers or groups of nodes that share a common input and feed their output to common nodes. Only neurons in the input and output layers interact with the external systems. The rest of nodes in the network present internal connections, forming what is called hidden layers. Therefore, topology of the NNs is characterized by the number of layers, number of neurons inside the layers, connectivity degree and type of connections among the nodes. c) Input-Output Association. With respect to the input-output association type NNs can be classified as heteroassociative or autoassociative. Heteroassociative NNs: implement a certain function, frequently of difficult analytical expression. They associate a set of inputs with a set of outputs in such a way that each input has a corresponding output. Autoassociative networks: outputs have the purpose to rebuild a certain input information that has been corrupted by associating to each input data the more similar stored data. d) Learning Procedure. All the connections or synapsis of the nodes in a NN have an associated synaptic weight efficiency factor. Each connection or synapsis between the node i and the node j is weighted by wji . This weight is responsible of the learning of the neural network. In the learning phase, the NN modifies its weights as a result of a new input information. Weights are modified following a convergent algorithm in such a way that when all the weight values are stabilized to a certain value and the learning phase ends, it is said that the NN has“learnt”. For the learning process it is crucial to establish the weights updating algorithm for the NN to correctly learn the new input information. According to the learning criteria NNs can be classified as neural networks of supervised learning or unsupervised learning NNs. Figure 8 represents the most common way of NNs classification.

5.

FEED FORWARD NEURAL NETWORKS: THE PERCEPTRON

First presented in section 1.1 Feed Forward Neural Networks are generally defined as those networks composed of one or more layers whose nodes are connected in such a way that their input comes only from nodes in the previous layer and their outputs connect exclusively to neurons of the following layer. Their name comes from the fact that the output of each layer feeds to the units of the following layer. Of all feed forward NNs the most popular, is the Multilayer Perceptron, developed as an extension to the Perceptron proposed by Rossenblatt in 1962 [Rosenblatt, 1962]. In this type of networks, the learning is supervised because it uses information of the output that the network must provide to the current input. Learning phase

NEURAL NETWORKS HISTORICAL REVIEW

53

Figure 8. Neural Networks basic taxonomy

or training phase consists in presenting to the network an input-output pair, called training pattern DN = x1 d1 x2 d2 xM dM in such a way that the weights are adjusted by xi ∈ p and di ∈ k , i = 1 2 N. Once the training phase is completed, the network is designed and ready to work in what is called the direct mode phase. In this phase, the network classifies the

54

CHAPTER 2

inputs by the following binary decision rule 1 if x w > 0 g= 0 if x w < 0 where x w is the discriminating function, that is, the space p is divided into two regions by the decision boundary x w = 0. Logically, the choice of the discriminating function x w depends on the distribution of the training patterns. 5.1

One Layer Perceptron

It basically consists in a set of nodes whose activation is produced for the action of the weighted sums of the input values and, consequently, the discriminating function takes the form p (8) x w w = wi xi + = 0 i=1

Also, if we make = w0 and we consider the inputs in the space p+1 such as x = x1 x2 xp 1 and w = w1 w2 wp w0 , Equation (8) can be expressed as x w = wxT = 0 Among other things, it serves to perform the pattern classification task, through a discriminating function of the form [Karayiannis and Venetsanopoulos, 1993], [Hush and Horne, 1993]: uk xn =

N

wkj xnj

j=0

The classification rule is based on the assignment of class k to the input pattern if the kth network output is the highest of all outputs. The network must be trained following an appropriate algorithm, to produce the desired output for each pattern uk xn ≥ uj xn ∀j = k −→ xn ∈ Wk This decision rule is, sometimes, substituted by a binary decision rule with a decision threshold. The Perceptron is a system that operates in such a way. After the learning or training, the Perceptron structure can separates the classification space in regions, one region for each class. The decision boundaries are composed by hyperplane segments defined as: uk xn − uj xn = 0 The Perceptron was initially proposed by Rosenblatt and a group of his students. In their work, the Perceptron versatility was shown. Unfortunately, the fact problem of the linear separability made its use out of interest.

55

NEURAL NETWORKS HISTORICAL REVIEW

5.1.1

Perceptron Training

It can be summarized in five steps: 1 Weights and Threshold initialization. Each one of the weights wi has to be initialized to low random values w0 = . 2 For i = 1 2 N, presenting the training pattern (a new E/S training pair is composed by a new input Xp = x1 x2 xN i = 1 2 N and its corresponding desired output dt. 3 Computing present output M M yi t = f wij xj t − i = f wij xj t = fNeti j=1

j=1

4 Weight adaptation: Wi = dt − ytxi t. • : learning rate 0 < < 1. • dt: desired output, yt: present output. • This process is repeated till the error et = dt − yt for each one of the patterns is zero or less than a preset value. 5 back to step 2 The convergence of the perceptron training is established by the following theorem: If the training set of a multiple classification problem is linearly separable then the perceptron training algorithm converges to a correct solution in a finite number of iterations. The mathematical proof of this theorem can be found in [Rosenblatt, 1962] and its significance relays in the fact that a multiple class problem can be reduced to a binary classification. Two typical examples of this situation are shown in the Figure 9. 6.

LMS LEARNING RULE

Nevertheless, even with the simple Perceptron structure, a reasonable solution can be achieved for a set that does not accomplish the linear separability property, by x2

x2

01

01 11

11

x1 00

10 OR -function

x1 00

10 AND -function

Figure 9. Logical functions OR and AND reduced to a binary classification problem

56

CHAPTER 2

the use of the Least Mean Square convergence algorithm (LMS) to update the NN weights during the learning phase. In general, the error function Equation (4), also called cost function or objective function, to be minimized by the LMS algorithm can be expressed as follows [Hush and Horne, 1993]: E=

M

uxn − k

k=1 xn ∈Wl

where k is a k elements vector with all its components of zero value, except those of k order, that corresponds to the correct classification. Therefore, for a given training set DN where dk represents the computed value, if the desired output to the k-th input vector is yk , then the Mean Square Error (MSE) corresponding to the input-output pair is given by < k2 >=

N N 1 1 k2 = d − yk 2 N i=1 N i=1 k

or, in vectorial notation, < k2 >=< dk2 > −2dk < wT x > +w < xk xkT > The minimum square error corresponds to the matrix w that satisfies the equation = 0 w In the case N = 2 the equation is an error paraboloid as shown in Figure 10. From Figure 10 it can be observed that the optimum value for the weights of the network is the one that makes the gradient null. A possible search procedure is the maximum step descent. The gradient direction is perpendicular to the contour lines in each point of the error surface. At the algorithm starting point, the weight vector does not derives to a minimum except in the case of spherical level curves. The weight updates in each iteration step must be small or the weight vector could wander over the hypersurface without never reaching the searched minimum. 6.1

The Multilayer Perceptron

A Perceptron of n layers is composed of n + 1 layers Ll l = 0 1 n, of several processing units in each one, corresponding L0 to the input layer and Ln to the output layer and Ll l = 1 n − 1 to the hidden layers. The nodes in the hidden and output layers are individual processing units. The overall output is obtained by adding all weighted inputs and passing the result through a non-linear function of sigmoidal type (see Figure 6).

57

NEURAL NETWORKS HISTORICAL REVIEW

80 70 60 50 40 30 20 10 0 2 2

1 1

0 y

0

–1

–1 –2

x

–2

Figure 10. Error Paraboloid of the LMS learning

Usually, in a Multilayer Perceptron, the nodes in each layer are fully interconnected with the neurons in the adjacent layer. This fact is repeated layer by layer through all the network. 6.1.1

Learning Algorithm (“Backpropagation”)

Before detailing the learning algorithm, let us introduce the following nomenclature: ulj : output of the j-th node in layer l. wlji : weight that connects the i-th node in layer l − 1 to the node j-th in layer l. xp : p-th training pattern. u0i : i-th component of the input vector. dj xp : desired output of the j-th node in the output layer when a p-th pattern is presented at the network input. NL : number of nodes in a given layer. L: number of layers. P: number of training patterns. Obviously, in a Perceptron-like structure, outputs depend upon the synaptic weights that connect neurons in the different layers. Such weights are actualized in the following way 1. Associating a set of input patterns to a set of desired outputs. In a pattern classification problem it is the same as making a primary classification on them by the designer (supervised training).

58

CHAPTER 2

2. Presenting all training patterns to the network. The network then processes all patterns and presents an output. The classification offered by the network can be an erroneous one, thus the error is easily quantified. 3. Defining an objective function. For example, the Mean Square Error (MSE) between the desired and real outputs of the units in the output layer [Hush and Horne, 1993]: Jp w =

NL 1 u x − dq xn 2 q=1 Lq n

This objective function represents an error function in a parametric hyperspace. The training or learning then consists in the search for the minimum of that surface through a gradient descent algorithm in the opposite direction of the surface gradient by examining a set of weights that minimizes the error. Each weight is modified or adapted in each iteration step in an amount that is proportional to the partial derivative of the function to that weight (9)

wlji k + 1 = wlji k −

Jp w wlji

In Equation (9), constant is the learning rate. The speed of the convergence of the algorithm depends on because the amount of the weight modification in each iteration step is proportional to the gradient in the weight direction, but it is weighted by the constant value of the learning rate. In this point, the training algorithm can be designed if we know how to calculate the partial derivative to each weight of the network. This derivative can be easily calculated using the chain rule: Jp w Jp w ulj = wlji ulj wlji that is, Jp w Jp w = f wlji ulj

Nl−1 −1

wljm ul−1m ul−1i

m=0

where f·, represents the sigmoidal function previously defined. This function has a very simple derivative: f =

f = f1 − f d

when the parameter is of unit value. In this expression we can observe that the sensibility of the objective function to each weight depends on the sensibility of this function to the output of the neuron that is fed by the synaptic weight input.

NEURAL NETWORKS HISTORICAL REVIEW

59

This last sensibility can be in its turn calculated from the objective function sensibilities with respect to the node outputs of the following layer, and so on [Hush and Horne, 1993]. This process is repeated till we reach to the output layer. The sensibility of the objective function to each node output can be calculated from the output layer in a recursive from. The sensibilities to the outputs of nodes in hidden layers are also denominated “error”, although, strictly speaking, they do not represent a real error. In order to calculate the error in the hidden layers, the error in the output layer must be computed and backpropagated to previous layers. That is performed by the Backpropagation algorithm. In this algorithm, training usually is started with random small values of the synaptic weights in order to provide a safe to the backpropagation algorithm. Once the structure of the network is chosen, the key parameter to be controlled is the learning rate. A too small value will slow the learning process. A too high value will accelerate the learning, but can produce loosing the minimum of the error surface. To find the optimal value of this parameter, an empirical method has to be used. Once the learning has started, it must continue till a minimum error is found, or till no variation in weight values is achieved. In that point, the network is said to have finished learning. It is not always practical to wait till this point of the learning and several other criteria are adopted, among them: 1. When the value of the gradient error surface is sufficiently small, it means that the gradient learning algorithm has found a set of weight values in a local minimum of the error surface. 2. When the error between the real network output and the desired one is under certain tolerable value for our application. Obviously, this case needs the knowledge of the maximum tolerable error for the given application. 3. In pattern classification problems, when all the learning patterns have been correctly classified, the training procedure can be stopped. 4. Training can be stopped after a fixed number of iterations. 5. Finally, a more appropriate and developed procedure is to train the network with a set of patterns and supervise the error over a different set called test set. The training phase is stopped when a minimum error on the test set is found. This last method prevents the overspecialization of the network on the training set, a phenomenon that happens when the error on the training set is lower than the error over other set of patterns of the same application, showing that the network has lost generalization capabilities. The method needs to use a double number of patterns, a fact that can be expensive or even not possible. Therefore, in order to efficiently apply neural networks to real problems it is very important to have a number of patterns in sufficient number. 6.2

Acceleration of the training procedure

The training procedure described in the previous section presents two main problems: in one hand the convergence or training phase is very slow, and, on the other hand, it is not easy to precisely elect the appropriate learning rate. A simple solution

60

CHAPTER 2

to accelerate the network training is the usage of second order methods that use the information contained in the second matrix of derivates (Hessian). These methods reduce the number of iterations needed in the training phase in order to achieve a local or global minimum of the error surface. Nevertheless, they cost a higher amount of computation and this increases the time of training. For this reason, only the diagonal matrix of the Hessian is usually used. Another solution is to rise the gradient value by adding a term that is a fraction of the past changes in the weights. This term, usually known as momentum term, is the weight by a new constant value, usually designated by : wkj k + 1 = wkj k −

Jw + wkj k wkj

This term tends to smooth the changes in the weights, leading to increase the learning speed by avoiding divergent learning fluctuations. It has been shown that adding noise to the training patterns, decreases the training time and helps to avoid local minima in the learning process. Another way to decrease the training time consists in the use of alternative transfer functions in the network nodes. When allowing a function to take positive and negative values in a symmetric dynamical range, it is probable that several activations will be next to zero and their corresponding weights will not need to be actualized. An example of this type of activation function is the hyperbolic one. In Table 1, typical parameters of this kind of networks and their influence in the processing are summarized. 6.3

On-Line and Off-Line training

During the training, weight update can be carried out in two different ways [Bourland and Morgan, 1994]: • Off-line or “Block training”: in this case, modifications on the weights over the whole training set are accumulated. The weights are modified only when all the training patterns are presented to the network. Table 1. Design Properties of NNs Transfer Function Derivate of Transfer Function Learning rate Effects on the NN Moment

Sign f x =

Exponential = 1 f x = fx1 − fx

Hyperbolic = 1 f x = 1 − f 2 x

=1 Learning not guaranteed With a small value, the vectors of weights increment take very divergent directions

= 01 Quick but not precise convergence With a big value, the vectors of weights increment take similar directions, helping to the convergence of the training

= 001 Precise and slow convergence

NEURAL NETWORKS HISTORICAL REVIEW

61

• On-line training: the network weights are modified each time that a new training pattern is presented to the network. It can be proved that this method leads to the same result as that of the off-line training [Widrow and Stearns, 1985]. In practice, this method shows some advantages that make it much more attractive to be used: it converges much more quickly to a minimum of the error surface and usually avoids the local minima. A possible explanation is that with the on-line training some “noise” is introduced over the set of training patterns. 6.4

Selection of the Network size

The selection of the appropriate network size is a task of the utmost importance: if the network is too small, it will not be able to achieve an efficient solution for the problem that is representing, while if its size is too big it can happen that the network can represent too many solutions to solve the problem over the training patterns but none of them is optimum to the application problem. If there is no preliminary experience, the dimension of the network size is a trial and error problem. To start with, an option is to try a small network and to increase the size progressively in order to find an efficient dimension for the network. The other option is to try a big network and reduce the size progressively, removing the nodes or weights that do not have significance on the overall output of the NN. Several studies have settled some size limits that should not be exceeded. In this sense, a proposal is that the number of nodes in the hidden layer should not exceed the number of training patterns. In practice, this is always accomplished, as the number of nodes will always be much lower than the number of training patterns. In fact, big networks can be able to memorize the whole training set loosing generalization capabilities. 7.

KOHONEN NETWORKS

A main principle in biological brain organization is that neurons group in such a way that those that are physically close collaborate in the same stimulus that is being processed. That is the way that nerve connections are organized. For example, to each level of the auditive path, nerve cells and fibers are disposed in relation to the frequency that is responsible of a higher output for each neuron [Lipmann, 1987]. Therefore, the physical disposition of the neurons in the brain structure is in somehow related to the function they perform. Kohonen [Kohonen, 1984] proposed an algorithm to adjust the weights of a network whose input is a vector of N components and its output is another vector of different dimension, MM < N. In this way, the dimension of the input subspace is reduced, physically grouping the data. Vectors defined over a continuous variable are used as input to the network. The network is trained without supervision in a way that the network itself establishes the input data grouping criteria, extracting regularities and correlations. When a sufficient number of input vectors has been presented, the weights are self-organized in

62

CHAPTER 2

a way that, topologically speaking, close nodes are sensible to similar inputs. Nodes physically far will stand completely inactive. Clusters that have their topological equivalent in the network are produced. For this reason this kind of networks are known as Self-Organized Feature Map (SOFM). The algorithm that assigns values to the connections of the synaptic weights is based on the concepts of neighborhood and competitive learning. The distance between the input and the weights of all the nodes is computed, establishing the closest one as the winner node. The updating of weights is performed for this node and the neighbor nodes. The rest are not actualized favoring a concrete physical organization. This kind of network has always two layers: the input and the output one. The dimension of the input vector establishes the number of nodes of the input layer: one node for each component of the input vector. The input neurons drive the input vectors to the output layer controlled by the connections weights. In this type of network it is very important to establish a neighborhood and a distance measure in the network. In the example of Figure 11, the nodes are configured in a bidimensional structure. The algorithm used to compute the output is designed in such a way that only one output neuron is activated when one input vector is applied to the network. The fired node corresponds to the category of classification corresponding to the input vector. Similar input vectors activate the same output, while different vectors activate different neurons. Only the neuron with the minimum difference between the input vector and the output weights vector node is activated. When the training algorithm starts, the adjustment is done in a wide zone surrounding the fired node or winner node. As the training progresses, the neighbor area is progressively reduced. Through this little adjustment, the network follows any systematic change in the input vectors: the network self-organizes. Therefore, this algorithm behaves as a vectorial quantifier when the number of desired clusters can be a priori specified and a sufficient amount of data relative to the number of the desired clusters is Outputs

Input layer

Figure 11. Structure of the Kohonen Network

NEURAL NETWORKS HISTORICAL REVIEW

63

known. However, the results depend on the order of the presentation of the input data, specially when the amount of input data is small. 7.1

Training

Training of the SOMF network can be summarized in five steps: 1. Weights initialization: The network structure is N input nodes and M output nodes. Random values are assigned to each of the weight wij connections. Initial neighbor radius is fixed for the neighbor mask. 2. Presentation of a new E/S pair: A new pattern is presented at the input Xp t = x1 t x2 t xN t. 3. Computation of the distance dj between the input and each one of the output nodes dj =

N −1

xi t − wij t2

i=0

where xi t is the input to the node i in the iteration t, and wij t is the input weight i to the output j in the iteration t. 4. Selection of the output node as the node with the minimum distance: node j ∗ is selected as the node with the minimum distance dj . 5. Updating node j ∗ and its neighbor: weights are updated for node j ∗ and all its neighbors in the vicinity matrix defined by NEj ∗ t. The new weights are: wij t + 1 = wij t + txi t − wij t for j ∈ NEj ∗ t 0 ≤ i ≤ N − 1. The term t is a gain term 0 < t < 1 that decreases with time. 6. Back to step 2. An standard example introduced by Kohonen illustrates the self-organized networks capacity to learn random distributions of the input vectors presented to the network. For example, if the input is an order two vector with component uniformly distributed and the output is designed as bidimensional, then the network weights will organize in a reticular fashion as shown in Figure 12. 1.5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3 1

2.5 2

0.5

1.5 1

0

0.5 –0.5 –0.5

0 0

0.5

1

1.5

0.5 0 0.5 1 1.5 2 2.5 3

0.4 0.6 0.8 1 1.2 1.4 1.6

Figure 12. Kohonen Map for the two-dimensional case

64 8.

CHAPTER 2

FUTURE PERSPECTIVES

Artificial neural networks are inspired from the biological performance of the human brain, where the former attempts to emulate the latter. This is the main link between biological and artificial neural networks. From this starting point, both disciplines follow separate ways. The present understanding of the brain mechanisms is so limited that the systems designer has not sufficient data to emulate its performing. Therefore, the engineer has to be one step forward from the biological knowledge, searching and devising useful algorithms and structures that efficiently solve given problems. In the vast majority of cases, this search delivers a result that diverges completely from the biological reality and the brain similarities become metaphors. Despite this faint and usually inexisting analogy between biology and artificial neural networks, the results of the latter frequently evoke comparisons with the former, because they are frequently reminiscent of the performing of the brain. Unfortunately, these comparisons are not benign and produce unrealistic expectations that lead to disappointment. Researching based on false expectations can evaporate when illuminated by the light of reality, as happened in the sixties. This promising researching field could eclipse again if we do not contain the temptation of comparing our results with those of the brain. It has been said that NNs are capable of being applied in all activities specific of the human brain. Currently, they are considered an alternative for all those tasks where the conventional computation does not achieve satisfactory results. There has been speculations about a next future where NNs will be able to reach a place together with classical computation. However, this will only happen if the researchers achieve sufficient knowledge for that developing. Currently, the theoretical knowledge is not robust enough to justify such predictions.

REFERENCES W.W. McCulloch and W. Pitts, A Logical Calculus of the Ideas Inminent in Nervous Activity, Bulletin of Mathematical Biophysics, 5:115–133, 1943. W. Pitts and W.W. McCulloch, How We Know Universals, Bulletin of Mathematical Biophysics, 9:127– 147, 1947. D.O. Hebb, Organization of Behaviour, Science Editions, New York, 1961. F. Rosenblatt, Principles of Neurodynamics, Science Editions, New York, 1962. B. Widrow, M. E. Hoff, Adaptive Switching Circuits, In IRE WESCON Convention Record, pages 96–104, 1960. M. Minsky, S. Papert, Perceptrons, MIT press, Cambridge, MA, 1969. G. Cybenko, Approximation by Superposition of a Sigmoidal Function, Mathematics of Control, Signals, and Systems, 2:303–314, 1989. K. Hornik, M. Stinchcombe and H. White, Multilayer Feedforward Networks are Universal Approximators, Neural Networks, 2(5):359–366, 1989. K. Hornik, M. Stinchcombe and H. White, Universal Aproximation of an Unknown Mapping and Its Derivatives using Multilayer Feedforward Networks, Neural Networks, 3:551–560, 1990.

NEURAL NETWORKS HISTORICAL REVIEW

65

S. Haykin, Neural Networks. A Comprehensive Foundation, Macmillan College Publishing, Ontario, 1994. P.J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioural Sciences, PhD thesis, Harvard University, Boston, 1974. D.E. Rumerlhart, G. E. Hinton and R. J. Williams, Learning Internal Representations by Error Propagation, In D. E. Rumelhart, J. L. McClelland and the PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1: Foundations, pages 318–362, MIT Press, Cambridge, MA, 1986. D.B. Parker, Learning Logic, Technical report, Technical Report TR-47, Cambridge, MA: MIT Center for Research in Computational Economics and Management Science, 1985. D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks. What’s new since Lippman?, IEEE Signal Processing Magazine, 2:721–729, January, 1993. N.B. Karayiannis and A.N. Venetsanopoulos, Artificial Neural Networks, Learning Algorithms, Perfomance Evaluation and Applications, Kluwer Academic Publishers, Boston, MA, 1993. H.A. Bourland and N. Morgan, Connectionist Speech recognition. A hybrid Approach, Kluwer Academic Publishers, Boston, MA, 1994. B. Widrow and S.D. Stearns, Adaptative Signal Processing, Prentice-Hall, Signal Processing Series, Englewood Cliffs, NJ, 1985. R.P. Lipmann, An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, 328–339, April, 1987. T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag, Berlin, 1984.

CHAPTER 3 ARTIFICIAL NEURAL NETWORKS

D. T. PHAM, M. S. PACKIANATHER, A. A. AFIFY Manufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom

INTRODUCTION Artificial neural networks are computational models of the brain. There are many types of neural networks representing the brain’s structure and operation with varying degrees of sophistication. This chapter provides an introduction to the main types of networks and presents examples of each type. 1.

TYPES OF NEURAL NETWORKS

Neural networks generally consist of a number of interconnected processing elements (PEs) or neurons. How the inter-neuron connections are arranged and the nature of the connections determine the structure of a network. How the strengths of the connections are adjusted or trained to achieve a desired overall behaviour of the network is governed by its learning algorithm. Neural networks can be classified according to their structures and learning algorithms. 1.1

Structural Categorisation

In terms of their structures, neural networks can be divided into two types: feedforward networks and recurrent networks. Feedforward networks: In a feedforward network, the neurons are generally grouped into layers. Signals flow from the input layer through to the output layer via unidirectional connections, the neurons being connected from one layer to the next, but not within the same layer. Examples of feedforward networks include the multi-layer perceptron (MLP) [Rumelhart and McClelland, 1986], the radial basis function (RBF) network [Broomhead and Lowe, 1988; Moody and Darken, 1989], the learning vector quantization (LVQ) network [Kohonen, 1989], the cerebellar 67 D. Andina and D.T. Pham (eds.), Computational Intelligence, 67–92. © 2007 Springer.

68

CHAPTER 3

model articulation control (CMAC) network [Albus, 1975a], the group-method of data handling (GMDH) network [Hecht-Nielsen, 1990] and some spiking neural networks [Maass, 1997]. Feedforward networks can most naturally perform static mappings between an input space and an output space: the output at a given instant is a function only of the input at that instant. Recurrent networks: In a recurrent network, the outputs of some neurons are fedback to the same neurons or to neurons in preceding layers. Thus, signals can flow in both forward and backward directions. Examples of recurrent networks include the Hopfield network [Hopfield, 1982], the Elman network [Elman, 1990] and the Jordan network [Jordan, 1986]. Recurrent networks have a dynamic memory: their outputs at a given instant reflect the current input as well as previous inputs and outputs.

1.2

Learning Algorithm Categorisation

Neural networks are trained by two main types of learning algorithms: supervised and unsupervised learning algorithms. In addition, there exists a third type, reinforcement learning, which can be regarded as a special form of supervised learning. Supervised learning: A supervised learning algorithm adjusts the strengths or weights of the inter-neuron connections according to the difference between the desired and actual network outputs corresponding to a given input. Thus, supervised learning requires a teacher or supervisor to provide desired or target output signals. Examples of supervised learning algorithms include the delta rule [Widrow and Hoff, 1960], the generalised delta rule or backpropagation algorithm [Rumelhart and McClelland, 1986] and the LVQ algorithm [Kohonen, 1989]. Unsupervised learning: Unsupervised learning algorithms do not require the desired outputs to be known. During training, only input patterns are presented to the neural network which automatically adapts the weights of its connections to cluster the input patterns into groups with similar features. Examples of unsupervised learning algorithms include the Kohonen [Kohonen, 1989] and Carpenter-Grossberg Adaptive Resonance Theory (ART) [Carpenter and Grossberg, 1988] competitive learning algorithms. Reinforcement learning: As mentioned before, reinforcement learning is a special case of supervised learning. Instead of using a teacher to give target outputs, a reinforcement learning algorithm employs a critic only to evaluate the goodness of the neural network output corresponding to a given input. An example of a reinforcement learning algorithm is the genetic algorithm (GA) [Holland, 1975; Goldberg, 1989].

2.

NEURAL NETWORKS EXAMPLE

This section briefly describes the example neural networks and associated learning algorithms cited previously.

69

ARTIFICIAL NEURAL NETWORKS

2.1

Multi-layer Perceptron (MLP)

MLPs are perhaps the best known type of feedforward networks. Figure 1a shows an MLP with three layers: an input layer, an output layer and an intermediate or hidden layer. Neurons in the input layer only act as buffers for distributing the input signals xi to neurons in the hidden layer. Each neuron j (Figure 1b) in the hidden layer sums up its input signals xi after weighting them with the strengths of the respective connections wji from the input layer and computes its output yj as a function f of the sum, viz. (1)

yj = f

wji xi

f can be a simple threshold function or a sigmoidal, hyperbolic tangent or radial basis function (see Table 1). The output of neurons in the output layer is computed similarly. The backpropagation (BP) algorithm, a gradient descent algorithm, is the most commonly adopted MLP training algorithm. It gives the change wji in the weight

Output Layer

y1

yn

Hidden Layer w1m w12 w11 Input Layer x1

x2

xm

Figure 1a. A multi-layer perceptron

x1

wj1 wji

xi

Σ

yj f(.)

wjn xn

Figure 1b. Details of a neuron

70

CHAPTER 3 Table 1. Activation functions Type of Functions

Functions

Linear

fs = s

Threshold

fs =

Sigmoid Hyperbolic tangent Radial basis function

+1 if s > st −1 otherwise fs = 1/1 + exp−s fs = 1 − exp−2s/1 + exp2s fs = exp−s2 /2

of a connection between neurons i and j as follows: (2)

wji = j xi

where is a parameter called the learning rate and j is a factor depending on whether neuron j is an output neuron or a hidden neuron. For output neurons, f t yj − yj (3) j = netj and for hidden neurons, f w (4) j = netj q qj q In Equation (3), netj is the total weighted sum of input signals to neuron j and t yj is the target output for neuron j. As there are no target outputs for hidden neurons, in Equation (4), the difference between the target and actual output of a hidden neuron j is replaced by the weighted sum of the q terms already obtained for neurons q connected to the output of j. Thus, iteratively, beginning with the output layer, the term is computed for neurons in all layers and weight updates determined for all connections. The weight updating process can take place after the presentation of each training pattern (pattern-based training) or after the presentation of the whole set of training patterns (batch training). In either case, a training epoch is said to have been completed when all training patterns have been presented once to the MLP. For all but the most trivial problems, several epochs are required for the MLP to be properly trained. A commonly adopted method to speed up the training is to add a “momentum” term to Equation (2) which effectively lets the previous weight change influence the new weight change, viz: (5)

wji k + 1 = j xi + wji k

where wji k + 1 and wji k are weight changes in epochs k + 1 and k respectively and is the “momentum” coefficient.

71

ARTIFICIAL NEURAL NETWORKS

Another learning method suitable for training MLPs is the genetic algorithm (GA). This is an optimisation algorithm based on evolution principles. The weights of the connections are considered genes in a chromosome. The goodness or fitness of the chromosome is directly related to how well trained the MLP is. The algorithm starts with a randomly generated population of chromosomes and applies genetic operators to create new and fitter populations. The most common genetic operators are the selection, crossover and mutation operators. The selection operator chooses chromosomes from the current population for reproduction. Usually, a biased selection procedure is adopted which favours the fitter chromosomes. The crossover operator creates two new chromosomes from two existing chromosomes by cutting them at a random position and exchanging the parts following the cut. The mutation operator produces a new chromosome by randomly changing the genes of an existing chromosome. Together, these operators simulate a guided random search method which can eventually yield the optimum set of weights to minimise the differences between the actual and target outputs of the neural network. Further details of genetic algorithms can be found in the chapter on Soft Computing and its Applications in Engineering and Manufacture. 2.2

Radial Basis Function (RBF) Network

Large multi-layer perceptron (MLP) networks take a long time to train. This has led to the construction of alternative networks such as the Radial Basis Function (RBF) network [Cichocki and Unbahauen, 1993; Hassoun, 1995; Haykin, 1999]. The RBF network is the most used network after MLPs. Figure 2 shows the structure of a RBF network which consists of three layers. The input layer neurons receive the inputs x1 xM . The hidden layer neurons provide a set of activation functions that constitute an arbitrary “basis” for the input patterns in the input space to be expanded into the hidden space by way of non-linear transformation. At the input of each hidden neuron, the distance between the centre of each activation or basis function and the input vector is calculated. Applying the basis function to this distance produces the output of the hidden neuron. The RBF network output y is formed by the neuron in the output layer as a weighted sum of the hidden layer neuron activation.

Input Layer

x1 xk

Hidden Layer

w1

Output Layer

wi wN

xM

Figure 2. The RBF network

y

72

CHAPTER 3

K(x) 1.0

Standard Deviation σ=1

x 0 Figure 3. The Radial Basis Function

The basis function is generally chosen to be a standard function which is positive at its centre x = 0 and then decreases uniformly to zero on either side as shown in Figure 3. A common choice is the Gaussian distribution function: 2 x (6) Kx = exp − 2 This function can be shifted to an arbitrary centre, x = c, and stretched by varying its standard deviation as follows: x − c2 x − c = exp − (7) K 2 2 The output of the RBF network y is given by: N x − ci (8) y = wi K i ∀x i=1 where wi is the weight of the hidden neuron i, ci the centre of basis function i and i the standard deviation of the function. x − ci is the norm of x − ci . There are various ways to calculate the norm. The most common is the Euclidean norm given by:

(9) x − ci = x1 − ci1 2 + x2 − ci2 2 + + xM − ciM 2 This norm gives the distance between the two points x and ci in N-dimensional space. All points x that are the same radial distance from ci give the same value for the norm and hence the same value for the basis function. Hence the basis functions are called Radial Basis Functions. Obtaining the values for wi , ci and i requires training the RBF network. Because the basis functions are differentiable, back-propagation could be used as with MLP networks. Training of a multiple-input single-output RBF network can proceed as follows: (i) choose the number N of hidden units; There is no firm guidance available for this. The selection of N is normally made by trial and error. In general, the smallest N that gives the RBF network an acceptable performance is adopted.

ARTIFICIAL NEURAL NETWORKS

73

(ii) choose the centres, ci ; Centre selection could be performed in three different ways [Haykin, 1999]: a) Trial and error: Centres can be selected by trial and error. This is not always easy if little is known about underlying functional behaviour of data. Usually, the centres are spread evenly or randomly over N -dimensional input space. b) Self-organized selection: An adaptive unsupervised method can be used to learn where to place the centres. c) Supervised selection: A supervised learning process, commonly error correction learning, can be deployed to fix the centres. (iii) choose stretch constants, i ; Several heuristics are available. A popular way is to set i equal to the distance to nearest neighbour. First the distances between centres are computed then the nearest distance is chosen to be the value of i . (iv) calculate weights, wi . When ci and wi are known, the outputs of hidden units O1 ON T can be calculated for any pattern of inputs x = x1 xM . Assuming there are P input patterns x in the training set, there will be P sets of hidden unit outputs that can be calculated. These can be assembled in a N × P matrix: 1 2 P ⎤ O1 O1 O1 ⎢O1 O2 OP ⎥ 2 2 ⎥ ⎢ 2 ⎥ O=⎢ ⎥ ⎢ ⎦ ⎣ 1 2 P ON ON ON

⎡

(10)

If the output yi of the RBF network corresponding to training input pattern i i i xi is yi = O1 w1 + O2 w2 + + ON wN , the following equation can be obtained: ⎤ ⎡ 1 O1 y1 ⎢ ⎥ ⎢ ⎥ ⎢ y=⎢ ⎣ ⎦=⎣ P yP O1 ⎡

(11)

1 ⎤

ON

⎡

w1

⎤

⎥ ⎢ ⎥ ⎥ · ⎢ ⎥ = OT · w ⎦ ⎣ ⎦ P wN ON

y is the vector of actual outputs corresponding to the training inputs x. Ideally, y should be equal to d, the desired/target outputs. Unknown coefficients wi can be chosen to minimise the sum-squared-error of y compared with d. It can be shown that this is achieved when: (12)

w = O OT −1 O d

74 2.3

CHAPTER 3

Learning Vector Quantization (LVQ) Network

Figure 4 shows an LVQ network which comprises three layers of neurons: an input buffer layer, a hidden layer and an output layer. The network is fully connected between the input and hidden layers and partially connected between the hidden and output layers, with each output neuron linked to a different cluster of hidden neurons. The weights of the connections between the hidden and output neurons are fixed to 1. The weights of the input-hidden neuron connections form the components of reference vectors (one reference vector is assigned to each hidden neuron). They are modified during the training of the network. Both the hidden neurons (also known as Kohonen neurons) and the output neurons have binary outputs. When an input pattern is supplied to the network, the hidden neuron whose reference vector is closest to the input pattern is said to win the competition for being activated and thus allowed to produce a “1”. All other hidden neurons are forced to produce a “0”. The output neuron connected to the cluster of hidden neurons that contains the winning neuron also emits a “1” and all other output neurons a “0”. The output neuron that produces a “1” gives the class of the input pattern, each output neuron being dedicated to a different class. The simplest LVQ training procedure is as follows: (i) initialise the weights of the reference vectors; (ii) present a training input pattern to the network; (iii) calculate the (Euclidean) distance between the input pattern and each reference vector; (iv) update the weights of the reference vector that is closest to the input pattern, that is, the reference vector of the winning hidden neuron. If the latter belongs

Output layer

Hidden (Kohonen) Layer Reference vector

Input layer

Input vector Figure 4. Learning Vector Quantization network

75

ARTIFICIAL NEURAL NETWORKS

to the cluster connected to the output neuron in the class that the input pattern is known to belong to, the reference vector is brought closer to the input pattern. Otherwise, the reference vector is moved away from the input pattern; (v) return to (ii) with a new training input pattern and repeat the procedure until all training patterns are correctly classified (or a stopping criterion is met). For other LVQ training procedures, see for example [Pham and Oztemel, 1994]. 2.4

CMAC Network

CMAC (Cerebellar Model Articulation Control) [Albus, 1975a, 1975b, 1979a, 1979b; An et al 1994] can be considered a supervised feedforward neural network with the characteristics of a fuzzy associative memory. A basic CMAC module is shown in Figure 5. CMAC consists of a series of mappings: (13)

f

e

g

S −→M −→A−→u

where S = input vectors M = intermediate variables A = association cell vectors u = output of CMAC ≡ hS h ≡ g·f ·e (a) Input encoding (S → M mapping) The S → M mapping is a set of submappings, one for each input variable: ⎤ ⎡ s 1 → m1 ⎢ s2 → m 2 ⎥ ⎥ (14) S→M =⎢ ⎦ ⎣ sn → mn

M

S >M

Input S

:

Input Encoding

:

>A

Weight Table

A

>u

Actual Output u

:

+

_

Desired Output Figure 5. A basic CMAC module

+

76

CHAPTER 3

The range of s1 is coarsely discretised using the quantising functions q1 q2 qk . Each function divides the range into k intervals. The intervals produced by function qj+1 are offset by one kth of the range compared to their counterparts produced by function qj . mi is a set of k intervals generated by q1 to qk respectively. An example is given in Figure 6 to illustrate the internal mappings within a CMAC module. The S → M mapping is shown in the leftmost part of the figure. In Figure 6, two input variables s1 and s2 are represented with unity resolution in the range of 0 to 8. The range of each input variable is described using three quantising functions. For example, the range of s1 is described by functions q1 q2 , and q3 . q1 divides the range into intervals A, B, C and D. q2 gives intervals E, F , G, and H and q3 provides intervals I, J , K and L. That is, q1 = A B C D q2 = E F G H q3 = I J K L For every value of s1 , there exists a set of elements, m1 , which are the intersection of the functions q1 to q3 , such that the value of s1 uniquely defines set m1 and vice versa. For example, value s1 = 5 maps to set m1 = B G K and vice versa. Similarly, value s2 = 4 maps to set m2 = b g j and vice versa. The S → M mapping gives CMAC two advantages: the first is that a single precise variable si can be transmitted over several imprecise information channels. Each channel carries only a small part of the information of si . This increases the reliability of the information transmission. The other advantage is that small changes in the value of si have no influence on most of the elements in mi . This leads to the property of input generalisation which is important in an environment where random noise exists.

S

M

M

d c b a

m2 d

l h

c k g b

j f

a

i e

s2

A *

*

*

*

*

*

*

*

X1 *

*

*

*

*

*

*

A

u

A B C D

8 7 6 5 4 3 2 1 0

h g f e

*

*

*

*

*

X2 *

*

*

*

*

*

*

*

*

*

+

E F G H

_ 0 1 2 3 4 5 6 7 8 A

E

B F

I

C G

J

s1 D

H K

L

m1

l k j i

*

*

*

*

*

*

*

*

*

* X3 *

*

*

I

J K L

*

*

Figure 6. Internal mappings within a CMAC module

+

ARTIFICIAL NEURAL NETWORKS

77

(b) Address computing (M → A mapping) A is a set of address vectors associated with weight tables. A is obtained by combining the elements of mi . For example, in Figure 6, the sets m1 = B G K and m2 = b g j are combined to give the set of elements A = a1 a2 a3 = Bb Gg Kj. (c) Output computing (A → U mapping) This mapping involves looking up the weight tables and adding the contents of the addressed locations to yield the output of the network. The following formula is employed: (15) u = wi ai i

That is, only the weights associated with the addresses ai in A are summed. For this given example, these weights are: wBb = x1 wGg = x2 wKj = x3 Thus the output is: u = x1 + x2 + x3

(16)

Training a CMAC module consists of adjusting the stored weights. Assuming that f is the function that CMAC has to learn, the following training steps could be adopted: (i) select a point S in the input space and obtain the current output u corresponding to S; (ii) let u be the desired output of CMAC, that is, u = f S; (iii) if u − u ≤ , where is an acceptable error, then do nothing; the desired value is already stored in CMAC. However, if u − u > , then add to every weight which contributed to u the quantity (17)

=

u−u A

where A = the number of weights which contributed to u and is the learning rate. 2.5

Group Method of Data Handling (GMDH) Network

Figure 7 shows a GMDH network and the details of one of its neurons. Unlike the feedforward neural networks previously described which have a fixed structure,

78

CHAPTER 3

N-Adaline x1 N-Adaline

x2

N-Adaline N-Adaline

x3

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

y

N-Adaline x4

N-Adaline

Figure 7a. A trained GMDH network Note: Each GMDH neuron is an N-Adaline, which is an Adaptive Linear Element with a nonlinear preprocessor

Nonlinear processor x1

x1

Square

X

x2

Square

x21

x1x2

x22

x2

w1

+1

w2

w0

w3

+

output

–

w4

e +

w5

yd desired output

Figure 7b. Details of a GMDH Neuron

a GMDH network has a structure which grows during training. Each neuron in a GMDH network usually has two inputs x1 and x2 and produces an output y that is a quadratic combination of these inputs, viz. (18)

y = wo + w1 x1 + w2 x12 + w3 x1 x2 + w4 x22 + w5 x2

Training a GMDH network consists of configuring the network starting with the input layer, adjusting the weights of each neuron, and increasing the number of layers until the accuracy of the mapping achieved with the network deteriorates.

ARTIFICIAL NEURAL NETWORKS

79

The number of neurons in the first layer depends on the number of external inputs available. For each pair of external inputs, one neuron is used. Training proceeds with presenting an input pattern to the input layer and adapting the weights of each neuron according to a suitable learning algorithm, such as the delta rule (see for example [Pham and Liu, 1994]), viz. (19)

Wk+1 = Wk +

Xk Xk

2

ykd − WkT Xk

where Wk , the weight vector of a neuron at time k, and Xk the modified input vector to the neuron at time k, are defined as (20) (21)

Wk = w0 w1 w2 w3 w4 w5 T T Xk = 1 x1 x12 x1 x2 x22 x2

and ykd is the desired network output at time k. Note that, for this description, it is assumed that the GMDH network only has one output. Equation (19) shows that the desired network output is presented to each neuron in the input layer and an attempt is made to train each neuron to produce that output. When the sum of the mean square errors SE over all the desired outputs in the training data set for a given neuron reaches the minimum for that neuron, the weights of the neuron are frozen and its training halted. When the training has ended for all neurons in a layer, the training for the layer stops. Neurons that produce SE values below a given threshold when another set of data (known as the selection data set) is presented to the network are selected to grow the next layer. At each stage, the smallest SE value achieved for the selection data set is recorded. If the smallest SE value for the current layer is less than that for the previous layer (that is, the accuracy of the network is improving), a new layer is generated, the size of which depends on the number of neurons just selected. The training and selection processes are repeated until the SE value deteriorates. The best neuron in the immediately preceding layer is then taken as the output neuron for the network. 2.6

Hopfield Network

Figure 8 shows one version of a Hopfield network. This network normally accepts binary and bipolar inputs (+1 or −1). It has a single “layer” of neurons, each connected to all the others, giving it a recurrent structure, as mentioned earlier. The training of a Hopfield network takes only one step, the weights wij of the network being assigned directly as follows: ⎧ P ⎨ 1 xc xc i = j (22) wij = N c=1 i j ⎩ 0 i=j where wij is the connection weight from neuron i to neuron j, and xic (which is either +1 or −1) is the ith component of the training input pattern for class c, P

80

CHAPTER 3

y1

y2

Outputs y3

w12

yN

w13

w1N

Hopfield Layer

x1

x2

x3

xN

Inputs Figure 8. A Hopfield network

the number of classes and N the number of neurons (or the number of components in the input pattern). Note from Equation (22) that wij = wji and wii = 0, a set of conditions that guarantee the stability of the network. When an unknown pattern is input to the network, its outputs are initially set equal to the components of the unknown pattern, viz. (23)

yi 0 = xi

1≤i≤N

Starting with these initial values, the network iterates according to the following equation until it reaches a minimum energy state, i.e. its outputs stabilise to constant values: N (24) yi k + 1 = f wij yi k 1 < i ≤ N j=1

where f is a hard limiting function defined as −1 x < 0 (25) fx = 1 x>0 2.7

Elman and Jordan Nets

Figures 9a and b show an Elman net and a Jordan net, respectively. These networks have a multi-layered structure similar to the structure of MLPs. In both nets, in addition to an ordinary hidden layer, there is another special hidden layer sometimes called the context or state layer. This layer receives feedback signals from the

81

ARTIFICIAL NEURAL NETWORKS

outputs output units

1 1 hidden units

input units context unit inputs Figure 9a. An Elman network

output output feedback

output unit

hidden layer

input unit self feedback input context unit Figure 9b. A Jordan network

ordinary hidden layer (in the case of an Elman net) or from the output layer (in the case of a Jordan net). The Jordan net also has connections from each neuron in the context layer back to itself. With both nets, the outputs of neurons in the context layer, are fed forward to the hidden layer. If only the forward connections are to be adapted and the feedback connections are preset to constant values, these networks can be considered ordinary feedforward networks and the BP algorithm used to train them. Otherwise, a GA could be employed [Pham and Karaboga, 1993b; Karaboga, 1994]. For improved versions of the Elman and Jordan nets, see [Pham and Liu, 1992; Pham and Oh, 1992].

82 2.8

CHAPTER 3

Kohonen Network

A Kohonen network or a self-organising feature map has two layers, an input buffer layer to receive the input pattern and an output layer (see Figure 10). Neurons in the output layer are usually arranged into a regular two-dimensional array. Each output neuron is connected to all input neurons. The weights of the connections form the components of the reference vector associated with the given output neuron. Training a Kohonen network involves the following steps: (i) initialise the reference vectors of all output neurons to small random values; (ii) present a training input pattern; (iii) determine the winning output neuron, i.e. the neuron whose reference vector is closest to the input pattern. The Euclidean distance between a reference vector and the input vector is usually adopted as the distance measure; (iv) update the reference vector of the winning neuron and those of its neighbours. These reference vectors are brought closer to the input vector. The adjustment is greatest for the reference vector of the winning neuron and decreased for reference vectors of neurons further away. The size of the neighbourhood of a neuron is reduced as training proceeds until, towards the end of training, only the reference vector of a winning neuron is adjusted. In a well-trained Kohonen network, output neurons that are close to one another have similar reference vectors. After training, a labelling procedure is adopted where input patterns of known classes are fed to the network and class labels are assigned to output neurons that are activated by those input patterns. As with the LVQ network, an output neuron is activated by an input pattern if it wins the competition against other output neurons, that is, if its reference vector is closest to the input pattern.

Output neurons

Reference vector

Input neurons

Input vector Figure 10. A Kohonen network

ARTIFICIAL NEURAL NETWORKS

2.9

83

ART Networks

There are different versions of the ART network. Figure 11 shows the ART-1 version for dealing with binary inputs. Later versions, such as ART-2 can also handle continuous-valued inputs. ART-1 As illustrated in Figure 11, an ART-1 network has two layers, an input layer and an output layer. The two layers are fully interconnected, the connections are in both the forward (or bottom-up) direction and the feedback (or top-down) direction. The vector Wi of weights of the bottom-up connections to an output neuron i forms an exemplar of the class it represents. All the Wi vectors constitute the long-term memory of the network. They are employed to select the winning neuron, the latter again being the neuron whose Wi vector is most similar to the current input pattern. The vector Vi of the weights of the top-down connections from an output neuron i is used for vigilance testing, that is, determining whether an input pattern is sufficiently close to a stored exemplar. The vigilance vectors Vi form the short-term memory of the network. Vi and Wi are related in that Wi is a normalised copy of Vi , viz. (26)

Wi =

+

Vi

Vji

where is a small constant and Vji , the jth component of Vi (i.e. the weight of the connection from output neuron i to input neuron j).

output layer

bottom up weights W

top down weights V

input layer Figure 11. An ART-1 network

84

CHAPTER 3

Training an ART-1 network occurs continuously when the network is in use and involves the following steps: (i) initialise the exemplar and vigilance vectors Wi and Vi for all output neurons, setting all the components of each Vi to 1 and computing Wi according to Equation (26). An output neuron with all its vigilance weights set to 1 is known as an uncommitted neuron in the sense that it is not assigned to represent any pattern classes; (ii) present a new input pattern x; (iii) enable all output neurons so that they can participate in the competition for activation; (iv) find the winning output neuron among the competing neurons, i.e. the neuron for which x. Wi is largest; a winning neuron can be an uncommitted neuron as is the case at the beginning of training or if there are no better output neurons; (v) test whether the input pattern x is sufficiently similar to the vigilance vector Vi of the winning neuron. Similarity is measured by the fraction r of bits in x that are also in Wi , viz. (27)

x V r= i xi

x is deemed to be sufficiently similar to Vi if r is at least equal to vigilance threshold 0 < ≤ 1; (vi) go to step (vii) if r ≥ (i.e. there is resonance); else disable the winning neuron temporarily from further competition and go to step (iv) repeating this procedure until there are no further enabled neurons; (vii) adjust the vigilance vector Vi of the most recent winning neuron by logically ANDing it with x, thus deleting bits in Vi that are not also in x; compute the bottom-up exemplar vector Wi using the new Vi according to Equation (26); activate the winning output neuron; (viii) go to step (ii). The above training procedure ensures that if the same sequence of training patterns is repeatedly presented to the network, its long-term and short-term memories are unchanged (i.e. the network is stable). Also, provided there are sufficient output neurons to represent all the different classes, new patterns can always be learnt, as a new pattern can be assigned to an uncommitted output neuron if it does not match previously stored exemplars well (i.e. the network is plastic). ART-2 The architecture of an ART-2 network [Carpenter and Grossberg, 1987; Pham and Chan, 1998; 2001] is depicted in Figure 12. In this particular configuration, the “feature representation” field F 1 consists of 4 loops. An input pattern will be circulated in the lower two loops first. Inherent noise in the input pattern will be suppressed (this is controlled by the parameters a and b and the feedback function f·) and prominent features in it will be accentuated. Then the enhanced input

ARTIFICIAL NEURAL NETWORKS

85

pattern will be passed to the upper two F 1 loops and will excite the neurons in the “category representation” field F 2 via the bottom-up weights. The “established class” neuron in F 2 that receives the strongest stimulation will fire. This neuron will read out a “top-down expectation” in the form of a set of top-down weights sometimes referred to as class templates. This top-down expectation will be compared against the enhanced input pattern by the vigilance mechanism. If the vigilance test is passed, the top-down and bottom-up weights will be updated and, along with the enhanced input pattern, will circulate repeatedly in the two upper F 1 loops until stability is achieved. The time taken by the network to reach a stable state depends on how close the input pattern is to passing the vigilance test. If it passes the test comfortably, i.e. the input pattern is quite similar to the top-down expectation, stability will be quick to achieve. Otherwise, more iterations are required. After the top-down and bottom-up weights have been updated, the current firing neuron will become an established class neuron. If the vigilance test fails, the current firing neuron will be disabled. Another search within the remaining established class neurons in the F 2 layer will be conducted. If none of the established class neurons has a top-down expectation similar to the input pattern, an unoccupied F 2 neuron will be assigned to classify the input pattern. This procedure repeats itself until either all the patterns are classified or the memory capacity of F 2 has been exhausted. The basic ART-2 training algorithm can be summarised as follows: (i) initialising the top-down and bottom-up long term memory traces; (ii) presenting an input pattern from the training data set to the network; (iii) triggering the neuron with the highest total input in the category representation field; (iv) checking the match between the input pattern and the exemplar in the topdown filter (long term memory) using a vigilance parameter; (v) starting the learning process if the mismatch is within the tolerance level defined by the vigilance parameter and then going to step (viii); otherwise, moving to the next step; (vi) disabling the current active neuron in the category representation field and returning to step (iii); go to step (vii) if all the established classes have been tried; (vii) establishing a new class for the given input pattern; (viii) repeating (ii) to (vii) until the network stabilises or a specified number of iterations are completed. In the recall mode, only steps (ii), (iii), (iv) and (viii) will be utilised. Dynamics of ART-2: The dynamics of the ART-2 network illustrated in Figure 12 is controlled by a set of mathematical equations. They are as follows: (28)

wi = Ii + au i

(29)

xi =

wi

W

86

CHAPTER 3

F2 reset

ρ

Yj Zij

vigilance mechanism

g(Yj) = d Zji

cpi

ri

F2

qi

pi

bf(qi) ui

vi

aui F2

f(xi)

wi

xi

F1 q′i

p ′i

bf(q′i ) v′i

u′i

f(x′i )

au′i

x′i

w′i

Ii Figure 12. Architecture of an ART-2 network

(30)

vi = f xi + bf qi

(31)

u i =

(32)

pi = u i

(33)

qi =

(34)

wi = qi

wi xi = W

(35) (36) (37)

vi

V pi

P

vi = f xi + bf qi v ui = i V

ARTIFICIAL NEURAL NETWORKS

(38)

pi = ui +

(39)

p qi = i P

87

g Yj zji j

The symbol

X represents the L2 norm of the vector X. If X = x1 x2 xn , then X = x12 + x22 +

+ xn2 . The output of the jth neuron in the classification layer is denoted by gYj . The L2 norm is used in the equations for the purpose of normalising the input data. The function f· used in Equations (30) and (36) is a non-linear function, the purpose of which is for suppressing the noise in the input pattern down to a prescribed level. The definition of f· is 0 if 0 ≤ x < (40) fx = x if x ≥ where is a user defined parameter, it has a value between 0 and 1. Learning Mechanism of ART-2: When an input pattern is applied to the ART-2 network, it will pass through the 4 loops comprising F 1 and then stimulate the classification neurons in F 2. The total excitation received by the jth neuron in the classification layer is equal to Tj where (41)

Tj =

pi zij

i

The neuron which is stimulated by the strongest total input signal will fire by generating an output with the constant value d. Therefore, for the winning neuron, gYj equals d. When a winning neuron is determined, all the other neurons will be prohibited from firing. The value d will be used to multiply the top-down expectation of the firing class before the top-down expectation pattern is read out for comparison in the vigilance test. When the winning neuron fires, all the other neurons are inhibited from firing so it can be inferred that when there is a firing neuron (say j), Equation (38) becomes: (42)

pi = ui + dzji

otherwise if there is no winning neuron, it can be simplified as: (43)

pi = u i

The top-down expectation pattern is merged with the enhanced input pattern at point ri before they enter the vigilance test (see Figure 12). ri is defined by (44)

ri =

qi + cpi Q + cP

88

CHAPTER 3

The vigilance test is failed and the firing neuron will be reset if the following condition is true: (45) >1 R where is the vigilance parameter. On the other hand, if the vigilance test is passed (in other words, the current input pattern can be accepted as a member of the firing neuron), the top-down and the bottom-up weights are updated so that the special features present in the current input pattern can be incorporated into the class exemplar represented by the firing neuron. The updating equations are as follows: (46) (47)

d z = d pi − zji dt ji d zij = d pi − zij dt

The bottom-up weights are denoted by Zij and the top-down weights by Zji . According to the recommendations in [Carpenter and Grossberg, 1987], all the topdown weights should be initialised with the value 0 at the beginning of the learning process. This can be expressed by the following equation: (48)

Zji 0 = 0

This measure is designed to prevent a neuron from being reset when it is allocated to classify an input pattern for the first time. The bottom-up weights are initialised using the equation: (49)

Zji 0 =

1 √ 1 − d M

where M is the number of neurons in the input layer. This number is equal to the dimension of the input vectors. This arrangement ensures that after all the neurons with the top-down expectations similar to the input pattern have been searched, it would be easy for the input pattern to access a previously uncommitted neuron. 2.10

Spiking Neural Network

Experiments with biological neural systems have shown that they use the timing of electrical pulses or “spikes” to encode and transmit information. Spiking neural networks, also known as pulsed neural networks, are attempts at modelling the operation of biological neural systems more closely than is the case with other artificial neural networks. An example of spiking neural network is shown in Figure 13. Each connection between neurons i and j could contain multiple connections associated with a weight value and delay [Natschläger and Ruf, 1998].

89

ARTIFICIAL NEURAL NETWORKS

1

I N P U T

wlij , dlij

1 O U T P U T

2 j

i n

wkij , dkij

i

j

wkij , dkij

m Figure 13. Spiking neural network topology showing a single connection composed of multiple weights wijk with corresponding delays dijk

PSP

ε ij (t − s)

s

t a)

PSP s

t

ε ij (t − s)

b) Figure 14. Different shapes of response functions. a) Excitatory post synaptic potentials (EPSPs) function b) Inhibitory post synaptic potentials (IPSPs) function

90

CHAPTER 3

In the leaky integrate-and-fire model proposed by Maass [Maass, 1997], a neuron is regarded as a homogeneous unit that generates spikes when the total excitation exceeds a threshold value. Consider a network that consists of a finite set V of such spiking neurons, a set E ⊆ V × V of synapses, a set of weights Wuv ≥ 0, a response function uv R+ → R for each synapse u v ∈ E where R+ = x ∈ R x ≥ 0 and a threshold function v R+ → R for each neuron v ∈ V . If Fu ⊆ R+ is the set of firing times of a neuron u, then the potential at the trigger zone of each neuron v at time t is given by: (50) Pv t = u uv∈E s∈F s100 msec Figure 6. Some types of neurons like thalamo-cortical neurons present a dual firing behaviour: in their tonic firing mode the frequency of their response is proportional to the stimulus (10–165 Hz). However when they are stimulated and afterwards inhibited during at least 100 msec. their response changes to burst firing with much higher frequency rates (150–320 Hz)

thalamus, at the core of the brain, are able to fire either in tonic or in burst mode as shown in Figure 6. The main characteristic of the tonic mode is that the spiking frequency is proportional to the stimulus being in the range of 10 to 165 Hz. However in the burst mode, the frequency is not related to the input activation, being in the range of 150 to 320 Hz. This burst mode is very interesting because it takes place after a precise sequence of preliminary facts. For the burst mode to happen, the thalamo-cortical neuron needs to be positively stimulated and afterwards inhibited during at least 100 msec. After these two previous events the burst firing is produced when a slight positive stimulation is given to the neuron. For a deeper study of these mechanisms see [Llinas and Jahnsen, 1982], [Llinas, 1994], [Steriade and Llinas, 1988]. The purpose of this dual behaviour is still a matter of controversy. Ropero [Ropero, 1997] proposed that the tonic mode served for intrathalamic operations. When the result of this intrathalamic operations are concluded the result is relayed to the cortex via the burst firing mode. 1.3 1.3.1

Network Properties Synchronization among neurons

Some type of neurons for example, the granule cells of the olfactory bulb and the reticular cells in the thalamus are able to synchronize their activity and, afterwards, oscillate together [McCormick and Pape, 1990], [Steriade et al., 1987]. One of the causes of this behaviour is that these neurons posses dendro-dendritic [Deschenes et al., 1985] electric contacts in which the potential is communicated directly from one neuron to the other without any kind of neurotransmitter in between. The situation is as if we had a set of ping-pong balls tied by fine cords and we used two very big bats to play with them. The movement of the balls becomes more and more uniform and synchronized during the play. The kinetic energy given by each one of the bats over the balls corresponds to the electric energy of ions entering the neurons. One type of ions increments the inner potential of the neurons when it is below a certain threshold and other type of ions reduces the potential when the potential is above an upper voltage threshold.

138

CHAPTER 6

These play beetween ions and the potential sharing of dendrodendritic connected neurons generates the synchronized oscillations. This behaviour was modelled and programmed in Matlab [Ropero, 2003] with the results shown in Figure 7. 1.3.2

Normalizing inhibition

Inhibitory neurons were supposed to only perform subtraction [Carandini and Heeger, 1994] over other neurons and this property was used for biasing the neurons in conventional neural networks models like backpropagation or radial basis networks. The operation of biasing the neurons was equivalent to shifting the activation function of these neurons to the right or to the left in a similar way to the one explained in section 2.2. This kind of subtracting or biasing inhibition is performed by means of GABA-B (Gamma-aminobutyric acid) neurotransmitter in real neurons. However inhibition is performed in many of the cases by means of GABA-A neurotransmitter instead of GABA-B, being the effect of GABA-A inhibition divisive and not subtractive. We postulate that this GABA-A inhibition could perform a scaling or normalizing effect of the input patterns arriving at a certain layer of the brain. Many structures in the brain have a layered organization. The input to each layer goes to two type of neurons: (A) To neurons that perform an excitatory projection onto the following layer (B) To GABA-A neurons that produce inhibition inside its own layer thereby creating an inhibitory divisive field in the layer (see Figure 8).

Figure 7. The height of each intersection of lines over the surface represents the activation of a 7 × 7 net of neurons. If each of the neurons in this net has an oscillatory activity and the potential of each of them is partially shared between the other neurons, a synchronization of the activities takes place. From top to bottom and from left to right a computer simulation of the synchhronization of a 7 × 7 net of networks is shown

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS

139

O

+ + + ++ + i 4 i5 i6 i1 i 2 i3

+ +

+ +

+ +

+ I4 I1

I5 I2

I6

Field of inhibitory interneurons

I3

Figure 8. Normalization of synaptic inputs due to an inhibitory field of GABA-A inhibitory interneurons. The six neurons of the lower layer of the figure form an excitatory input I = I1 ,I2 ,I3 ,I4 ,I5 ,I6 impinging on a second layer of neurons (middle). This pattern produces an excitation + over the six neurons in the middle layer and over GABA-A inhibitory interneurons that are not shown. Once these inhibitory interneurons are activated, they creates an inhibitory field that divides the activation of these middle layer neurons by nI = nI1 + nI2 + nI3 + nI4 + nI5 + nI6 . In this way the neuron at the top receives a normalized input i = i1 ,i2 ,i3 ,i4 ,i5 ,i6 that is the result of dividing each of the components of pattern I by the constant n(I)

The activation of excitatory and inhibitory neurons in each layer is almost the same absolute value because the input pattern impinges at the same time excitatory and inhibitory neurons. Therefore this inhibitory divisive field is proportional to this activation. This divisive inhibition is able to produce a sort of normalization over input patterns (see Figure 8 for more details). 2.

UPDATING MC CULLOCH-PITTS MODEL

Up to this point we introduced several properties of real neurons with remarkable interest for computational purposes. Using some of them we tried to update some of the characterisitics of the McCulloch-Pitts paradigm of neural computation. 2.1

Up-to-date Synaptic Model

The classical model of synaptic weight alteration due to Hebb lacked many of the properties that were mentioned in previous sections. Here we propose another

140

CHAPTER 6

model that not only mimics the way biological reinforcement and depression is produced but also accomplishes the property of metaplasticity [Ropero and Sim˜oes, 1999]. In our model the synaptic weight between the presynaptic neuron A and the postsynaptic neuron B is calculated as: (2)

wAB = PB/A

where B is a postsynaptic activation above a specific threshold and A a presynaptic action potential. As shown the synaptic weight is calculated as a conditional probability. The above expression can also be written as: (3)

wAB = PB/A =

nA I B nA

in which the operator “n( )”, number of times, quantifies how many times a certain event takes place, for example how many times event A, event B or the intersection of A and B occurs. Starting with different values of the numerator and denominator, i.e. different initial weights, and allowing the postsynaptic neuron to fire according to a non-linear squashing function (logistic) a 3-D version of Figure 2 is obtained in Figure 9. In this figure a continuous line drawn on the surface shows the evolution of the LTP threshold in function of the initial weight. It can be noticed that a very simple statistical expression is able to account for a big variety of properties like

Change in synaptic strength

Weight = P(B/A)

LTP threshold

Initial weight Normalized postsynaptic activity (voltage) Figure 9. The computer simulation above shows that metaplasticity takes place when the synaptic weight is calculated using the conditional probability P(B/A), being B a suprathreshold activation of the postsynaptic neuron and A the presynaptic action potential. A line joins the different LTP threshold, each one of them corresponding to a different initial synaptic weight

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS

141

reinforcement, depression and metaplasticity. Therefore, talking into account that conditional probabilities can be computed in synapses, the more obvious question that arises here is: Are real synapses the tiniest pieces of the more fascinating statistical computer ever imagined? 2.2

Up-to-date Neuron Model

We propose a neuron model (see Figure 10) using the just presented equation for modelling the synaptic weights [Ropero, 1996]. In the soma each of the excitatory postsynaptic potentials (EPSPs) are summed. An EPSP is obtained by multiplying the probability PIi of an action potential in the presynaptic neuron by the corresponding weight PO/Ii . Although there are no probabilities at the presynaptic space but action potentials at different frequencies, the product PIi PO/Ii can approximate each of the EPSPs. These EPSPs are formed by the sum of the voltage humps produced each one of them by a presynaptic action potential in a process known as temporal summation. When these humps are nearby, the humps ride over previous humps creating a tallest EPSP. When they are far away, as for example when the presynaptic action potential is low, they can hardly ride over each other and the resulting EPSP is low. Given that the maximal frequency of presynaptic action potential is limited, the height of the resulting EPSP is also limited. This maximal height corresponds to a PIi PO/Ii of value 1. All the EPSPs go from the dendrites to the soma where they are summed. This sum is the so-called activation of the neuron which is transformed afterwards into a frequency of action potentials by means of a logistic or sigmoidal function. To prevent the saturation of the weights a normalization of the input pattern by means of divisive inhibition is commonplace in the brain.

P(I1) P(O/I1)

P(O/I2) P(I2)

P(O/I3) P(I3)

P(O/I) = P(O/I1)P(I1) + P(O/I2)P(I2) + P(O/I3)P(I3)

Figure 10. Model of a neuron based on conditional probabilities for calculating the synaptic weights. In each synapse the probability of presynaptic action potential is multiplied by the synaptic weight and the result gives the postsynaptic activation in each synapse. The sum of postsynaptic activations gives the activation of the neuron which is calculated as POI = PO/I = PO/I1 PI1 + PO/I2 PI2 + PO/I3 PI3

142 2.3

CHAPTER 6

Up-to-date Network Model

If the same pattern is input to several neurons, instead of only one, a competitive process can take place so that only one neuron, the one whose activation is maximal, becomes the winner of the competition. When the winner fires, the remaining neurons are kept silent. Silencing the not winning neurons is usually done by an inhibitory feed-back or lateral inhibition. For avoiding that only one neuron becomes the winner for every pattern, the probabilistic synapses should be normalized along time (see Figure 11). This is one of the possible roles of biological synaptic normalization, giving every neuron the same opportunity to fire. But what biological mechanisms are involved in the selection of this winning neuron? In section 2.3.1. it was introduced that the synchronized oscillation of neurons is a mechanism found at least in the thalamus and the olfactory bulb. This synchronized oscillation of neurons can allow the finding of the neuron with maximal activation: if a common oscillatory potential were summed to the activations of a layer of neurons the neuron whose total activation arrives first to a certain firing threshold is at the same time the one with biggest activation [Ropero, 2003].

t1 = 0.2 a.

w11 = 0.6 3 y1 = ∑w1j .tj = 0.38 j=1

t2 = 0.4 w12 = 0.4

y1 w13 = 0.2

t3 = 0.5 0.2 b.

0.2

3

y2 = ∑w2j .tj = 0.50 j=1

0.4 y2

0.4

0.6 0.5 0.2 c.

0.4 0.6

0.4

y3

3

y3 = ∑w3j .tj = 0.42 j=1

0.2 0.5 Figure 11. Synaptic normalization allows a competitive process among neurons. The neuron whose synaptic weight distribution wij is most similar to the input pattern of frequencies T = t1 , t2 , t3 tj is also the one with maximal activation. This is the case of neuron b whose weights [0.2, 0.4, 0.6] are most similar to vector T = 02 04 05 Therefore the sum of the products of the input frequencies multiplied by its weights yields the maximal value. Notice that due to the synaptic normalization the number of ionic channels is the same in the three neurons. In summary, synaptic normalization is the property that allows that the neuron whose weight distribution is most similar to the input pattern also exhibits the maximal activation

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS

3.

143

JOINING THE BLOCKS: A NEURAL NETWORK MODEL OF THE THALAMUS

Probabilistic synapses, synchronized oscillations, weights normalization and normalizing inhibition, all of them are properties that were used to implement a realistic computational model of the thalamus. The thalamus is a structure at the core of the brain that relays sensorial information from the senses to the cortex. The function of this structure was unknown. The model we propose helps in this way to understand the role of the thalamus inside the computation in the brain [Ropero, 1996], [Ropero,1997], [Pelaez, 2000]. The thalamus is basically a two layered brain structure. The first layer is formed by thalamo-cortical neurons that receive sensorial patterns and after approximately 100 msec. send the result of the inner thalamic computation to the cortex. The second layer formed by reticular neurons that oscillate synchronically performs a competitive process by which each one of the neurons fires in the presence of specific characteristics of the input patterns. When several of these neurons fire, they produce several inhibitory masks that, when superposed, create a negative replica of the input pattern shown in Figure 12 over the first layer. If the input patterns were damaged or noisy the negative replica recreates a perfect version of the input without defects or noise. Pattern reconstruction and noise rejection are two of the tasks that we postulate the thalamus is able to perform. For these tasks, a process of learning must take place at the level of the thalamus. Our computer model of the thalamus programmed in Matlab has these two layers, each one of 9 × 9 = 81 neurons. The two layers are completely interconnected to each other having 2 × 81 × 81 = 13122 connections. It learned 36 characters during several epochs and is able to recognize and complete damaged or noisy patterns (see Figure 12). The learning capability of the model shows that the real thalamus have also learning capabilities, a fact, that was completely ignored until now in the thalamus’ research. 4.

CONCLUSIONS

In this review we have presented several properties of synapses, neurons and networks that were not considered in previous neural network models but that have interesting computational potential. McCulloch Pitts neuron’s model was based in the restricted knowledge about neurons that existed in the forties. Nowadays a more comprehensive knowledge about the amazing properties of neurons can be used to update McCulloch Pitts model. In the case of synaptic plasticity we presented several properties of synaptic weights like directionality, existence of both potentiation and depression thresholds, metaplasticity and normalization. Regarding neurons relevant properties were introduced to the reader like the spike threshold adaptation and the dual behaviour in frequency of some types of neurons. Finally, and concerning networks of neurons, we studied the synchronization of a set of neurons and the normalizing inhibition produced by a set of GABA-A neurons over the input pattern of another neuron.

144

CHAPTER 6

Figure 12. A biologically realistic computer model of the thalamus constituted by two layers of 9 × 9 = 81 neurons each. An example of the pattern reconstruction capability of the thalamus model is shown (a) After being trained with 36 different characters (letters and numbers) a very noisy and damaged testing pattern is input which vaguely resembles a B. (b) An “I” shaped sustained feedback inhibition over the first layer is produced by a reticular neuron in the second layer. After firing, the reticular neuron rests in refractoriness. This inhibition reduces the subsequent activation in the first layer. (c) Another neuron fires and immediately enters in the refractory period producing another sustained inhibition that is superposed over the previous one. Both inhibitions are shaped like an E. (d) Finally, another reticular neuron fires and the total inhibition completely reconstructs letter B showing the reconstruction capability of the thalamic model. The central figure of each screen gives the value of the activations of a net of reticular neurons

With all these elements in mind we proposed a new equation for synaptic reinforcement based in conditional probabilities. The paradigm of a neuron was also modified taking into account that the neuron is always integrated in a network. For example, if the neuron was detached from the inhibitory field that normalizes its inputs, its active synaptic weights will increase without bound and the neuron will be saturated most of the time. It was also shown that the normalization of synaptic weights is an important condition for allowing a competitive process between neurons. An example of such competition and of all the mentioned properties working together is the model of the thalamus that we programmed in Matlab. It learned 36 characters and exhibits the property of completing damage or noisy patterns.

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS

145

We expect that the reader benefits from this paper’s account of recently found neural properties when creating new artificial neural networks or trying to emulate the functioning of the brain. REFERENCES Abraham, W.C., and Bear, M.F. (1996) Metaplasticity: the plasticity of synaptic plasticity. Trends in Neuroscience 19:126–130. Abraham, W.C., and Tate, W.P. (1997) Metaplasticity: a new vista across the field of synaptic plasticity, Progress in Neurobiology 52:303–323. Artola, A , Brocher, S., and Singer, W. (1990) Different voltage-dependent threshold for inducing long-term depression and long-term potentiation in slices of rat visual córtex. Nature 347:69–72 Bear, M.F., Connors, B.W., and Paradise, M.A. (2001) Neuroscience. Exploring the Brain. Lippincott, Williams & Wilkins. USA Bienestock, E.L., Cooper, L.N., and Munro, P.W. (1982) Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual córtex. The Journal of Neurosciences 2(1):32–48. Carandini, M., and Heeger, D.J. (1994) Summation and division by neurons in primate visual cortex. Science 264(5163):1333–6. Carpenter, G., and Grossberg, S. (1988) The ART of adaptive pattern recognition by a self-organizing neural network. Computer 21(3):77–88 Deschenes, M., Madariaga-Domich, A., and Steriade, M. (1985) Dendrodendrític synapses in the cat reticularis thalami nucleus: a structural basis for thalamic spindle synchronization. Brain Research 334:165–168. Desai, N.S., Rutherford, L.C., and Turrigiano, G.G. (1999) Plasticity in the intrinsic excitability of cortical pyramidal neurons, Nature Neurosciences 2:515–520 Hopfield, J.J. (1982) Neural Networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79:2554–2558 Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biological Cybernetics 43:59–69. Llinás, R., and Jahnsen, H. (1982) Electrophysiology of mammalian thalamic neurones in vitro. Nature 297:406–408 Llinas, R., Ribary, U., Joliot, M., and Wang, X.J. (1994). Content and Context in Temporal Thalamocortical Binding. In G.Buzsaki et al. (Eds.), Temporal Coding in the Brain (pp. 151–72). Berlin: Spring-Verlag McClelland, J.L., Rumelhart, D.E., and The PDP Research Group. (1986). Parallel distributed processing: Exploration in the microstructure of cognition. Cambridge, MA: MIT Press. McClelland, J.L., and Rumelhart, D.E. (1988). Explorations in parallel distributed processing. Cambridge, MA: MIT Press. McCormick, D.A., and Pape, H.-C. (1990) Properties of a hyperpolarization activated cation current and its role in rhytmic oscillation in thalamic relay nurons. Journal of Physiology (London) 431:291–318. McCulloch, W. and Pitts, W. (1943) A logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 1943. Ropero Peláez, J. (1996) A Formal Representation of Thalamus and Cortex Computation. Proceedings of the International Conference of Brain Processes, Theories and Models. Edited by Roberto MorenoDíaz and José Mira-Mira. MIT Press. Ropero Peláez, J. (1997) Plato’s theory of ideas revisited. Neural Networks, 1997 Special issue 10(7): 1269–1288. Ropero Pelaez, J., and Godoy Simoes, M. (1999) A computational model of synaptic metaplasticity. Proceedings of the International Joint Conference of Neural Networks 1999. Washington DC. Ropero Peláez, J. (2000) Towards a neural network based therapy for hallucinatory disorders. Neural Networks, 2000 Special Issue 13(2000):1047–1061.

146

CHAPTER 6

Ropero Peláez, J. (2003) Phd Thesis in Neuroscience: Aprendizaje en un modelo computacional del tálamo. Faculty of Medicine. Autónoma University of Madrid. Rosenblatt, F. (1956) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65:386–408 Steriade, M., Domich, L., Oakson, G., and Deschenes, M. (1987) The deafferented reticular thalamic nucleus generates spindle rhythmicity. The Journal of Neurophysiology 57:260–273. Steriade, M., and Llinas, R.R. (1988), The Functional State of the Thalamus and the Associated Neuronal Interplay. Physiological Review 68(3):649–739. Tompa, P., and Friedrich, P. (1998). Synaptic metaplasticity and the local charge effect in postsynaptic densities. Trends in Neuroscience 21(3):97–101. Turrigiano, G.G., Leslie, K.R., Desai, N.S., Rutherford, L.C., and Nelson, S.B. (1998) Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature 391:892–896.

CHAPTER 7 SUPPORT VECTOR MACHINES

JAIME GÓMEZ SÁENZ DE TEJADA1 , JUAN SEIJAS MARTÍNEZ-ECHEVARRÍA2 1 2

Escuela Politécnica Superior, Universidad Auónoma de Madrid Escuela Técnica Superior de Ingenieros de Telecomunicaciones, Universidad Politécnica de Madrid

Abstract:

Support Vector Machines is the most recent algorithm in the Machine Learning community. After a bit less than a decade of live, it has displayed many advantages with respect to the best old methods: generalization capacity, ease of use, solution uniqueness. It has also shown some disadvantages: maximum data handling and speed in the training phase. However, these disadvantages will be overcome in the near future, as computer power increases, leaving an all-purpose learning method both cheap to use and giving the best performance. This chapter provides an overview about the main SVM configuration, its mathematical applications and the easiest implementation

Keywords:

Support Vector Machines, Machine Learning

INTRODUCTION Machine Learning has become one of the main fields in artificial intelligence today. Whether in the pattern recognition field or in function estimation, statistical Machine Learning tries to find a numerical hypothesis which adapts correctly to the given data, that is, machines able to generalize the statistical distribution of a representative data set. Once we have generated the hypothesis, all future unknown patterns following the same distribution will be correctly classified. From the principles of statistical mechanics, a handful of algorithms have been devised to solve the classification problem, such as decision trees, k-nearest neighbour, neural networks, Bayesian classifiers, radial basis functions classifiers, and, as a newcomer, support vector machines (from now on SVM). The basic SVM is a supervised classification algorithm introduced by Vladimir Vapnik, motivated by VC (Vapnik Chervonenkis) theory [Vapnik, 1995], from which the Structural Risk Minimization concept was derived. In the late 70’s, 147 D. Andina and D.T. Pham (eds.), Computational Intelligence, 147–191. © 2007 Springer.

148

CHAPTER 7

Vapnik studied the numeric solution of convex quadratic problems applied to Machine Learning, and defined an immediate ancestor of SVM called ‘Generalization portrait’. In the early 90’s, Vapnik joined Bell laboratories, where his ideas evolved until the creation of the term ‘support vector machines’ in 1995. Nevertheless, the basic mathematics behind SVM were developed much earlier. The concept of a non-input space hyperplane generation to separate data in input space, the heart of SVM, was settled in 1964. The study of convex quadratic problems gave the Karush-Kuhn-Tucker optimality conditions in 1936, while the definition of valid kernel functions for the transformation described above was formulated by Mercer in 1909. This chapter provides an introductory view to SVM, so that any computer scientist or engineer reader can develop his own SVM implementation and apply it to any real world machine-learning problem. For that purpose, we will sacrifice some mathematical completeness for the sake of clarity. It has four sections: first, the SVM will be defined and analysed; second, the main SVM principle mathematical uses will be developed; third a comparison between SVM and neural networks will be studied; last, the best current implementation approach will be shown. Support Vector Machines are easy to understand, not too difficult to implement, and child’s play to use. If you need a generic Machine Learning method, forget about neural networks or any other method you previously learnt: the SVM family globally outperforms them all. 1. 1.1

SVM DEFINITION Structural Risk

Classifiers having a big number of adjustable parameters (and so, great capacity) most probably will generate overfitting, thus learning the training data set without errors, but with poor generalization ability. On the contrary, a classifier with insufficient capacity will not be able to generate a hypothesis complex enough to model the data. A mid-point must be found where adjustable parameters are neither too much nor too scarce, both for the training ant test set. For that reason, it is essential to choose the kind of functions a learning machine can implement. For a given problem, the machine must have a low classification error, and also small capacity. Capacity is defined as the ability of a given machine to learn any training set without errors. For example, the 1-nearest neighbour has infinite capacity, but is a poor classifier for unseen test data with complex distributions and noisy sets. A machine with great capacity will tend to generate overfitting over the data, making it no longer useful because it does not learn. For extended information about these issues, see [Burges, 1998]. There are a handful of mathematical bound expressions that define the relations between a machine learning ability and its performance. The underlying theory tries to find under which circumstances and how fast the performance measure converges while the number of input data for training increases. On the limit, with an infinite number of points, we could have a correct performance value, better than just an

SUPPORT VECTOR MACHINES

149

estimation. With respect to the SVM, we will use one limit definition in particular which will take us to the Structural Risk Minimization (SRM) principle [Vapnik, 1995]. Suppose we have l observations, input data in the training phase. Each data consists on a pair of values xi yi , where xi is a vector ∈ n i = 1 l and the fixed associated label yi ∈ 1 −1, given by a consistent data source. We assume there is an unknown probability distribution P(x,y), from which the data points have been drawn. Data is always assumed to be independently drawn and identically distributed. Suppose we have a machine whose task is to learn the mapping xi → yi . This generic machine is really defined by a set of possible mappings xi → fx , where functions fx are generic, defined for the set of adjustable parameters . This machine is by definition deterministic, that is, for a given input vector xi , and a parameter set , we will always obtain the same output fxi . Choosing the parameter set gives a trained machine. For example, a neural network with a fixed architecture and fixed weights (parameter set ) would be a trained machine as defined in these paragraphs. Thus, the expected error in the phase test for a trained machine is: 1 (1) R = y − fx dPx y 2 The value R is called expected risk. Nevertheless, this expression is difficult to use because the probability distribution P(x,y) is unknown. Thus, a variation of the formula is developed to use the finite number of available observations. It is called empirical risk, and is defined as the measured mean error rate on the training set: (2)

Remp =

l 1 y − fxi 2l i=1 i

Remp is a fixed number for a given parameter and test set. It has been shown in [Vapnik, 1995] that the following condition holds: (3)

R = Remp + gh

where g(h) is a real number which is directly related to the VC dimension. Again, a learning machine is defined as a set of parameterised functions (called a family of functions) having a similar structure. The term Vapnik-Chervonenkis (VC) dimension is a non-negative integer that measures the generalization capacity previously defined. The VC dimension for a given learning machine is defined as the maximum number of points that can be correctly classified using functions belonging to the family set. In other words: if VC dimension = h, then there exists a set of h points that can be classified with family functions regardless of the point labels. Note that, first, there cannot exist a set of h + 1 points satisfying the constraint; second, you only need one set of h points for the definition to be applicable (it did not say “for all h-points sets”).

150

CHAPTER 7

Figure 1. Three wisely chosen points

Let’s try an example. Suppose we are in 2 space and the learning machine L1 is defined as the set of “one straight line” classifiers. In figure 1 we choose three points. We see (you can try) that, for all combination of labels (8 possible combinations using 3 points with two labels), they can be separated using one straight line. For each combination it would use a different straight line, but it would still be a component of the family set. Therefore the analysed learning machine VC dimension is at least 3. If we try 4 points (any 4 point set) we will not be able to satisfy all constraints, so we can state that “one straight line” classifiers in 2 space have VC dimension equal to 3. Another example. Suppose we are in 2 space and the learning machine L2 is defined as the set of “two-segment line” classifiers (continuous but non-derivable in the joint point). In figure 2 we choose five points. Again, try all possible label combinations (now 32). Using a two-segment line you can separate all 32 cases, but it would not be possible to separate 6 well-chosen points (any 6 points). Therefore VC dimension for this learning machine is 5.

Figure 2. Five wisely chosen points

SUPPORT VECTOR MACHINES

151

Figure 3. Training set and two valid classifiers, “straight-line”(dashed line) and “two-segment-line” (solid line)

When facing a problem that can be classified using different learning machines, as can be seen in figure 3, which one is better. The SRM principle will try to find the learning machine with the lowest VC dimension that correctly classifies all data points. The consequences are analysed in section 2.12. In what regards SVM definition, SRM principle and VC dimension concept requires that the chosen classifier be the one with the largest margin (linear SVM use the family of linear hyperplanes in input space), defined in next section.

1.2

Linear SVM for Separable Data

The simplest case for a SVM is that of linear machines trained with a separable data set (see Figure 4a).

Figure 4a. Linear separable training set

152

CHAPTER 7

Suppose we have a training data set made of pairs xi yi i = 1 l, such that xi ∈ d yi ∈ 1 −1. Suppose there exists a hyperplane in d which separates positive from negative examples (after their yi value). Points that are exactly on the hyperplane satisfy the condition: (4)

w•x+b = 0

where w is the hyperplane perpendicular vector (regardless of the norm), b/w (absolute value of term b divided by module of vector w) is the distance from the origin to the hyperplane, and the operator • is defined as the dot product in the Euclidean space in which the data belong (we will use the scalar product between two d-dimension vectors). Let d+ d− be the shortest distance between the plane and a positive (negative) example; the margin of the hyperplane is defined as d+ + d− . We can say that, at maximizing the classifier margin, we will decrease the risk limit defined in (3). This is the base for the following SVM mathematical development. For the linear and separable case, the SVM algorithm calculates the separator hyperplane that maximizes the classifier margin. Thus, all training data must satisfy the following constraints: (5)

w • xi + b ≥ +1

for yi = +1

(6)

w • xi + b ≤ −1

for yi = −1

which can be formulated in one expression: (7)

yi w • xi + b − 1 ≥ 0

∀i

All points for which equality at inequality (5) holds, are on hyperplane H1 : w • xi + b = 1, parallel to the separator hyperplane and distance 1 − b/w to the origin. In much the same way, those points for which equality at inequality (6) holds, are on hyperplane H2 : w • xi + b = −1, parallel to H1 and the separator hyperplane and distance − 1 − b/w to the origin. Thus, d+ = d− = 1/w, and so the margin is 2/w. We must find a pair of planes H1 H2 that maximize the margin, minimizing w2 , with respect to constraints defined in inequality (7). Note that, in the training phase, no data point will be between H1 and H2 or on the wrong side of its class plane (that is the reason for calling it separable case). Those points that satisfy the equality in inequality (7), (those placed on H1 or H2 ), and that, if eliminated from the training set, would give a different solution (by definition would change d+ or d− ), are called support vectors. The name comes from the fact that the learning machine is completely defined with these points and their weight on the hyperplane. All other training points, which are at a greater distance from the hyperplane than the support vectors, serve no purpose: if we had begun the training without them, the solution would have remained the same (see Figure 4b).

SUPPORT VECTOR MACHINES

153

Figure 4b. Linear SVM classifier. Support vectors are encircled, the margin is shown with two dashed lines and the separator hyperplane is shown with a solid line

The problem can be reformulated using Lagrange multipliers. It will help us to add constraints to the problem more easily, and will let the training data appear only in the form of dot products between vectors. This will let us generalize the SVM algorithm to the non-linear case. The general rule for creating the Lagrange formulation is: for constraints of the type c ≥ 0, the constraint equation is multiplied by a Lagrange multiplier and subtracted from the objective function. Thus, we introduce non-negative Lagrange multipliers i i = 1 l , one for each constraint in inequality (7), that is, one for each training point. The Lagrangian we obtain is: (8)

LP =

l l 1 w 2 − i yi w • xi + b + i 2 i=1 i=1

We want to minimize LP with respect to w and b (the variables that define the plane), and require that partial derivatives of LP with respect to the i be 0. By definition, this is a convex quadratic optimisation problem, because objective function is convex and constraints are also a convex set [Burges, 1998]. This means we can solve the problem using the dual formulation [Fletcher, 1987]. This Wolf-dual formulation has the following property: maximization of LD (in contrast with primal formulation LP ) with the defined constraints occurs at the same value of w and b than the minimization of LP , shown in the previous paragraph. All partial derivatives must be zero at the optimum. Calculating partial derivatives of LP with respect to b and w, we obtain the following conditions: (9)

w=

l

i yi xi

i=1

(10)

l i=1

i yi = 0

154

CHAPTER 7

which substituting in equation (8) gives: (11)

LD =

l

i −

i=1

l l 1 y y x • x 2 i=1 j=1 i i j j i j

Therefore, now the problem is written as “Maximize LD with respect to all i , satisfying conditions (7) and (10)”. There is a Lagrange multiplier for each training point, but only those having i > 0 are of any importance in calculating the separator hyperplane with equation (9). These are the support vectors, which were defined in previous paragraphs. Geometric interpretation of (11) is easier if the second term is substituted using (9). Suppose we are in an intermediate optimisation state, and we want to calculate the second term at step i = 0. Thus the term is: l

0 j y0 yj x0 • xj = 0 y0

j=1

= 0 y0 x0 •

l

j yj x0 • xj

j=1 l

j yj xj = 0 y0 x0 • w

j=1

The scalar product of a point and a normal-to-the-hyperplane vector gives the point projection over the vector, that is, relative distance between point and hyperplane. The relying concept under the formula is: = 0 ∗ Correctness of classification ∗ distance between point and sscurrent defined separator plane At each sep i, the relation between xi and current-state w is calculated. Therefore, we can deduce some hand-made optimisation rules: A) If classification is correct, the term is negative, so i should decrease, and thus reduce its weight (its importance) in the calculation of current w, in case the optimum has not been reached. B) If distance is big with respect to other points of the same class, and it is correctly classified, i should decrease, while other same-class point k closer to the margin should increase. Note that when evaluating the correctness of a point during the training phase, the point itself is used. If a point is misclassified, the algorithm will increase its multiplier as much as needed, forcing the hyperplane definition until this point condition is satisfied. For the linear separable case this strategy is valid, because sooner or later the point must be correctly classified. But for non-linear or nonseparable cases, this strategy may give poor results. If we have some noise in the training data, the algorithm will try to force the hyperplane definition to classify points that are wrong. This will generate overfitting over the data so the performance will be poorer. Therefore, the SVM training algorithm consists of the following basic steps:

SUPPORT VECTOR MACHINES

155

1. Identify all training data points, and their labels. 2. Optimize (maximize) the dual Lagrangian, maintaining constraints defined in (7) and (9). For that purpose, there are many convex quadratic problem optimisation methods described in mathematical literature [Fletcher, 1987]. The optimisation phase result is the set of all Lagrange multiplier values i . Basic optimization methods have important limits about the resources (time and memory) needed in big problems (more than 10.000 patterns). Thus, at the beginning of SVM history, efficient optimization algorithms were the basic research line. In section 5 the best SVM algorithm will be shown: SMO. 3. Throw away all those points which are not support vectors after the training process (i.e. those having i = 0), and calculate the value of w and b from support vectors and formulas (9) and (7). Then, we will have a completely defined optimum separator hyperplane. 1.3

Karush-Khun-Tucker Conditions

Karush-Khun-Tucker (KKT) conditions represent necessary and sufficient conditions for a solution to exist to the problem defined in step 2 in the previous algorithm. This solution identifies the objective function LP optimum value with respect to all available parameters (all i ). Many SVM algorithms use these KKT conditions to identify if the machine’s current state is the optimum, and if not so, which are the points that violate these optimality conditions the most. For the basic SVM definition, given in this chapter, optimality conditions are: (7.9)

w=

l

i yi xi

i=1

(7.10)

l

i yi = 0

i=1

(7.7)

yi w • xi + b − 1 ≥ 0

(7.7 bis)

i yi w • xi + b − 1 = 0 i ≥ 0

Most of them have been introduced in previous sections of this chapter, but they have been repeated here for better comprehension of the optimisation process. The new equation (7.7 bis) is easy to be interpreted. It regards to the points that must hold equality in inequality (7). It could be defined in the following words: “Any training point, either holds equality in inequality (7), or its Lagrange multiplier is annulated, i.e. i = 0”. If it holds equality (7) and i = 0, then the point is on the margin hyperplane and is a support vector. It can also happen that both conditions hold, that is, equality (7) holds and i = 0. In that case, the point is on his class margin hyperplane but it is not needed for the hyperplane definition, therefore it is not a support vector.

156 1.4

CHAPTER 7

Optimisation Example

To show with more clarity the optimisation process, we will introduce an example. Suppose we have 3 points 1 1 2 1 3 1 ∈ 2 and labels +1 −1 −1 respectively (see figure 5). Suppose the initialisation routine defines Lagrange multipliers as 1 = 2 2 = 1 3 = 1 (holding condition (10)). We use formulas (9) and (11) to calculate the following: w1 = 21 1 − 12 1 − 13 1 = −3 0 LD1 = 4 − 1/2−6 + 6 + 9 = −0 5 Then, we check if this is a valid solution for our SVM. For that purpose, we use KKT conditions, specially condition (7). Note that all three points would be support vectors, so they must have the same value of b when substituting in condition (7). At this optimisation stage this is not true for w1 , because we obtain b = 4 b = 5 and b = 8. Thus we can say, without doubt, that this is no solution. Now we must find another set of Lagrange multipliers that bring us to an increase of LD . Point 3 is farthest from current pseudo-hyperplane (being correctly classified), so it is a good candidate for decreasing its weight in the definition of w (see section 2.2). Suppose that new Lagrange multiplier values are 1 = 1 2 = 1 3 = 0 (condition (10) must always hold). w2 = 11 1 − 12 1 − 03 1 = −1 0 LD2 = 2 − 1/2−1 + 2 = 1 5 We made a good choice because LD has increased. Nevertheless, we still do not satisfy KKT conditions. When we substitute equation (7), we obtain b = 2 y b = 1 for both points respectively (we have two support vectors only).

Figure 5. A linear separable set with margin and separator hyperplane

SUPPORT VECTOR MACHINES

157

Now that we have two support vectors, with different class, their Lagrange multipliers must change in the same way for condition (10) to hold. We increase, for instance, to 1 = 2 2 = 2 3 = 0. w3 = 21 1 − 22 1 − 03 1 = −2 0 LD3 = 4 − 1/2−4 + 8 = 2 Again, LD has increased, so we have chosen wisely. Moreover, at this optimisation step, KKT conditions hold, having the same value b for all support vectors, b = 3. We can assert without any doubt that the optimum has been reached. For instance, if we continue to increase the multipliers to 1 = 3 2 = 3 3 = 0, the result would not be valid. We would obtain: w4 = 31 1 − 32 1 − 03 1 = −3 0 LD4 = 6 − 1/2−9 + 18 = 1 5 Convexity required for the objective function definition holds: LD1 < LD2 < LD3 > LD4 . Moreover, as the example is so small, some degree of uniform quadratic convexity can be seen, as LD2 = LD4 , underneath the optimum. During the optimisation process, while KKT conditions do not hold, the unique separator hyperplane does not exist. At each new step (new set of values of ), there is one hyperplane direction only, but as many separator hyperplanes as support vectors in the training set (different values of b). These hyperplanes do not need to have a geometric meaning; they do not try to separate the data, even though they could. As we get closer to the optimum (increasing LD ), all support-vector-defined hyperplanes will come closer to each other (less difference in the b value). The limit is reached when LD gets to the optimum value, and all hyperplanes match up with only one value of b: the separator hyperplane. This concept differs largely on the search process followed by other similar methods, like the perceptron. This last one always defines a separator hyperplane that evolves at each training step trying to classify correctly all training data. For that reason, it can reach a state in which all data points are correctly classified, but whose margin is not the optimum. That is called a local minimum, where the perceptron will be trapped and will not be able to continue. The SVM algorithm performs a quadratic optimisation in which no intermediate state can be considered as a valid solution. There will be one solution only, it will be global, and it will be the best you can have. Even though soft-margin SVM definition will take place in next sections, this is a good place to see what happens when the optimisation algorithm is applied to a non-separable data set. Suppose we have again the 3 points used before 1 1 2 1 3 1 ∈ 2 but now with different labels +1 −1 +1 (see figure 6). We have changed the third point label, so the training set becomes nonseparable with a linear machine. Nevertheless, this information is not given to the SVM algorithm.

158

CHAPTER 7

Figure 6. A linear non-separable set

Suppose we initialise values as 1 = 1 2 = 2 3 = 1 (condition (10) holds). w1 = 11 1 − 22 1 + 13 1 = 0 0 LD1 = 4 − 1/2+0 − 0 + 0 = 4 Of course, this cannot be a solution. We do not need to check KKT conditions, because w = 0 0 does not define a hyperplane. At this stage we cannot guess which points are better changing, so we do it randomly. Suppose we define a new state 1 = 1 5 2 = 2 3 = 0 5 (there are not many more alternatives). w2 = 1 51 1 − 22 1 + 0 53 1 = −1 0 LD2 = 4 − 1/2−1 5 + 4 − 1 5 = 3 75 We obtain LD2 < LD1 , so we can be sure this is not a solution, and, even more, this way will take us nowhere. We choose another possible set, 1 = 2 2 = 4 3 = 2. w3 = 21 1 − 42 1 + 23 1 = 0 0 LD3 = 8 − 1/2+0 − 0 + 0 = 8 As in the first case, this cannot be a solution. But LD has increased quite a lot, and we could think this is getting us closer to the solution. But it can be noted that we could increase the multipliers anyhow, knowing LDn = 1 + 2 + 3 , and so, the objective function increases without limit (note that in this example the problem is not characterized by a quadratic function, but by a linear function, so there cannot be an optimisation solution). Therefore, if the objective function increases without limit, then we are applying a linear separable machine to a linear non-separable training set.

SUPPORT VECTOR MACHINES

1.5

159

Test Phase

As it has been said, once we have trained a SVM, we obtain the values w and b. With these values, we define a separating hyperplane, w • x + b = 0, parallel to H1 and H2 and placed at the middle, at the same distance of both. To classify an unseen pattern x, we just need to know which side of the separator hyperplane the point is, i.e., the sign of w • x + b. Note that in the test phase we may have data points placed in between H1 and H2 , and, if used during training, the solution found would have changed somehow. This concept may be useful when developing SVM training algorithms, because it could find a priori support vectors, before the whole training, saving computational power. Up until now we have mentioned only the binary case, that is, data can only have two classes. SVM classifiers can be easily extended to the multiple class case: for n classes, we just need to generate n-1 binary classifiers which separate one class form the rest. Nevertheless, this multiple classifier is O(n) more complex in time (memory resources are more difficult to estimate) than one binary classifier in the training as well as the test phase. As this extension does not give new major advances, it will not be mentioned in the rest of this chapter. 1.6

Non-Separable Linear Case

Now that we know everything that is needed to create and use a simple SVM, we will upgrade its definition so that it will be able to deal with any real-life problem. When the above-described algorithm for separable data is used over non-separable data (see figure 7), no solution will be found, as the value of LD will grow without limit (see section 2.5). For the non-separable data to satisfy initial constraints, we have to introduce the concept of soft margin. This means that the algorithm will allow some training points to violate those constraints, and so, the rest of training data will be correctly classified (regardless of violating points). For that purpose we

Figure 7. A linear non-separable set, which needs a soft-margin classifier. The distribution is defined as class = 1 if x1 + x2 > 7 5; class = −1 otherwise. The distribution has some noise

160

CHAPTER 7

introduce positive slack variables for each point in a way such that the following inequalities hold [Cortes and Vapnik, 1995]: (12)

w • xi + b ≥ +1 − i

for

yi = +1

(13)

w • xi + b ≥ −1 + i

for

yi = −1

Values i are not fixed prior to the training; they will be calculated during the optimisation process. And because they are not fixed, we can be certain that all points will satisfy inequalities (12) and (13): just increase its i until inequality holds. We have solved our troubles: now, there will always be a solution. But it may be that the solution is not close enough to the true distribution under the data. If that is so, then the solution is useless; so we have just changed the name of our worries. With the introduction of these variables must follow a primal Lagrangian LP increase, so that classification errors during training will be minimized. For a training pattern classification error to take place, its associated i must be greater than 1, so l

i

i=1

is a good estimate of the training errors’ upper bound with respect to the complete training set. Therefore, the objective function to be minimized changes from 1/2w2 to l 1 w 2 + C i 2 i=1

being C a parametrizable non-negative real value. This value corresponds to the global penalization given to training errors. This new objective function could have been different. We could have devised other methods for forcing i values to be as small as possible. The election of exactly that function follows simplicity reasons: the problem continues to be convex quadratic, and neither the i , nor the Lagrange multipliers associated to these new constraints, appear in the problem dual formulation. Therefore, we have to maximize LD : (14)

LD =

l

i −

i=1

with constraints: (15)

0 ≤ i ≤ C

(16)

w=

l

i yi xi

i=1

(17)

l i=1

i yi = 0

l l 1 y y x • x 2 i=1 j=1 i i j j i j

SUPPORT VECTOR MACHINES

161

The only difference between the previous algorithm and this last one is that now the i have an upper bound C. The training algorithm will not allow any point to increase its weight indefinitely, and so, a solution will eventually be found. The error term in the optimisation process goes to those points that have i > 0, either because they are incorrectly classified or because they lie inside the margin. For any point that satisfies i > 0, it can be stated i = C. It still is a support vector, and it will be treated as such in the calculation of w, but in the optimisation process its weight will grow no more. Soft margin philosophy (against hard margin defined in section 2.2), is not to forbid training errors, not even to minimize them alone. The idea is to minimize the whole objective function, in which errors make some pressure as well as the hypothesis robustness, identified as the margin maximization between those well-classified points at each side of the separating hyperplane (characterised by constraint (7)). Suppose, for instance, the case shown in figure 7. A hard margin classifier cannot be found, but many soft margin classifiers will satisfy the constraints, and the only difference will be the C value. The first approach for newcomers is usually the hardest soft-margin possible, one that looks like figure 8. It is a valid solution, but it has a very small margin. By definition of structural risk minimization, if we increase the margin, test errors would decrease (better generalization performance). On the other hand, training errors should be avoided (or, at least, limited), so a balance must be found between margin maximization and error permissibility. A small quantity of noise may be accepted without modifying the generalization performance, by creating a hypothesis that is developed after some common properties satisfied by the data (the internal, true data distribution). In the case of figure 9, it is easily seen that more training points become errors, but the classifier is much closer to the underlying distribution concept. The new parameter C becomes the only value (until now) that must be provided in the SVM architecture. As it has been said, C serves as a balance between error permissibility and generalization goodness.

Figure 8. The figure 7 set, with a rather hard soft margin classifier

162

CHAPTER 7

Figure 9. The figure 7 set, with a softer margin classifier

– If C is small, then errors are cheap. The margin will grow, and so will the number of training patterns that violate the margin. – If C is big, then the value of w has small relevance in the objective function optimisation against training errors. We are approaching the hard margin philosophy. Because w value is closely related to the margin maximization, decreasing w relevance will take us to a smaller margin, and maybe, to a worse generalization ability. To choose a good C value, model complexity and expected data noise must be evaluated as a whole. 1.7

Non-Linear Case

In most real life cases, data cannot be separated using a linear hyperplane in input space. Even the use of slack variables could lead to a poor classifier, in case the linear deviations are caused by the hypothesis structure and not because of noisy data. The next step is to introduce in the SVM algorithm non-linear separating surfaces instead of hyperplanes (see figure 10). For that purpose, we generate an input data mapping into another Euclidean space H, whose dimension is higher than the input space. We use a mapping function , such that: d → H In the problem dual formulation, input data vectors appear only as inner products xi • xj , in the space they belong. Now they will only appear as xi • xj in space H. Space H will usually be a very high dimension space. It could even be an infinite dimension space. Therefore, performing operations in this space could be too costly. But if we could find a kernel function K such that Kxi xj = xi • xj , then we would not need to explicitly map data vectors into space H, we would not even

SUPPORT VECTOR MACHINES

163

Figure 10. Non-linear distribution set

need to know what is. Now we just have to define a valid kernel function K, and substitute Kxi xj everywhere xi • xj stands in the algorithm. When we use a much higher dimension space, many new data features, linear and non-linear, arise. Each new dimension offers a new possible correlation view, a new attribute with which we can separate the data, a new factor with which to create the hypothesis. It will be the training process responsibility to discriminate those attributes that contain useful hyperplane-definition information from those that do not, by assigning a bigger weight in the linear combination of all features. For those cases when there is some user information about data correlation, an explicit mapping can be generated. Nevertheless this is not usual, and could lead to an inefficient implementation, depending on the previous knowledge credibility. Using generic mapping functions (we will see them later) offers the possibility to generate an enormous number of new features, without taking care of the meaning of each one. In fact, these spaces use to be in the order of thousands, millions or even infinite dimensions. It is difficult to accept such a big geometrical space. It seems easier to identify it with a set of non-linear relations between input attributes, which can be assembled with linear relations in the optimisation process to create a surface (hyperplane in feature space, indefinable curve in input space), capable of separating input data one class from the other. If we replace xi • xj by Kxi xj everywhere in the training phase formulas, the algorithm defined in section 2.2 will generate a linear SVM in a high dimensional space (specified by the mapping function). And most important, it will do it in roughly the same time complexity as a simple linear SVM created in input space (without mapping). All further development stays the same, as we are still creating a linear separator, although in a different space. In the linear case, the training phase output was the value of w and b, with which the hyperplane was completely defined, and so the test phase had just to see at which hyperplane side the new pattern was. Now, we cannot explicitly calculate w, because it is defined in space H only and we do not know exactly how the mapping is made.

164

CHAPTER 7

Through the support vector extension, the value of w can be written as: (18)

w=

N

i yi si

i=1

so we can write the classification function as: (19)

fx =

N

i yi si • x + b =

i=1

N

i yi Ksi x + b

i=1

where si are the N support vectors, identified in the training phase as those patterns whose Lagrange multiplier is not zero. With this definition we avoid calculating mapping function once more. Note that soft margin concept still applies to a non-linear classifier. Actually, its implementation remains very simple: Lagrange multipliers have an upper limit. In this case soft margin applies to the linear classifier in high dimension space. The clearest advantage is that we still assert there is a solution. The use of a non-linear surface as separator functions does not guarantee a solution will be found at all, even though it is more probable. Moreover, using the soft margin alternative gives the classifier more robustness against noisy training patterns. Training phase time complexity does not change, but test phase is different. In the linear case, having calculated explicitly w, algorithms complexity is O(1), using inner product as the basic operation (which is O(d) if multiply-add is the basic operation). For the non-linear phase, we need to perform O(N) operations, where N was previously defined as the number of support vectors. Because of the relation between support vectors number and complexity, algorithms have been devised that try to minimize, or even replace, support vectors during and after training, so that this phase may be competitive enough with other machine learning methods, such as neural networks. 1.8

Mapping Function Example

For better understanding of the concept of new useful features generation, we will show an example. Suppose we have a data set xi ci in 2 × +1 −1 as shown in figure 11. It can be seen that this is not a linear separable case, and the soft margin linear separator is not enough. In this example, training data has no noise. We define as a mapping function 2 → 3 with the form: x1 x2 → x1 x2 x1 x2 Therefore, we have added a new feature to the input definition, which gives us information about a specific kind of relation between the two initial variables. Thus, we can calculate the kernel function:

Kx x = x • x = x1 x2 x1 x2 • x1 x2 x1 x2

= x1 x1 + x2 x2 + x1 x2 x1 x2

SUPPORT VECTOR MACHINES

165

Figure 11. Non-linear distribution set. The distribution is defined as class = 1 if x1 x2 < 14 5; class = −1 otherwise

Figure 12. Feature space view for the main points from figure 11. The margin h1 − h2 is partially shown using solid lines

We have defined the mapping function and the new space implicitly, using the inner product in input space as the only valid operator. In figure 12, the most important points, form the training data set, have been represented, as well as the separator hyperplane the SVM algorithm would find and those points that become support vectors. The separator hyperplane is z = 14 5. Note that in the final hypothesis only one feature is required to create the hyperplane (it is defined using just the third component) from the three available features. This will be very common case in non-linear SVM: just a few features will form the linear combination defining the separator hyperplane. To represent the curve in input space that describes the generated hyperplane we need to use the inverse mapping: −1 x1 x2 x1 x2 → x1 x2

166

CHAPTER 7

Figure 13. Non-linear classifier for the figure 11 set. Support vectors are encircled, margin is shown using dashed lines and the separator curve is shown with a solid line

As the new axis z was defined as z = x1 x2 in the high dimensional space, those points that lie on the hyperplane hold x1 x2 = 14 5, and so the curve in input space can be defined as x2 = 14 5/x1 . In figure 13, the final result can be observed, with hyperboloid x2 = 14 5/x1 as the non-linear class separator surface. Support vectors in this figure are those that were identified during training and highlighted in figure 12. It should not be thought that those points that lie near the non-linear separator surface in input space should become support vectors, although it usually tends to it. The mapping function does not necessarily satisfy any input data relation properties, but the concept behind the support vector is: “significant point”, and the points that carry more information are those that lie near other class points in input space. In real world cases, this function will not be useful, unless clear and easy apriori information is given to the SVM engineer. Nevertheless, it is a valid mapping function and generates a valid kernel function. For this to happen, function K(x,y) must satisfy some constraints, known as Mercer conditions.

1.9

Mercer Conditions

Not all kernel functions are valid, that is, they describe a Euclidean space with the properties required in previous sections. It is enough to satisfy Mercer conditions [Vapnik, 1995], which can be written as: There exists a function Kx y = x • y if and only if for all g(x), such that gx2 dx is finite, the following inequality holds:

Kx ygxgydxdy ≥ 0

SUPPORT VECTOR MACHINES

167

For most cases, this is a very complicated condition to check, because it is said ‘for all g(x)’. It has been demonstrated for Kx y =

P

Cp x • yP

i=1

when Cp is a positive real number and p is a positive integer. 1.10

Kernel Examples

The first (and only) basic kernels used to develop pattern recognition as well as non-linear regression and principal component analysis with SVM are (for any pair of vectors x y ∈ d ): (20)

Kx y =x • y + 1p

(21)

Kx y = exp−x − y2 /2 2

(22)

Kx y = tanhx • y −

Kernel (20) is a non-homogeneous polynomial classifier of degree p (another used variation is the homogeneous polynomial kernel, without term ‘+1’). It creates a space H with as many dimensions (data features) as p-combinations of x and y. All possible relations between input attributes until degree p appear in the new space. The margin maximization algorithm will discriminate those having information from those that have not (should be most of them), so the number of adjustable parameters required to obtain a good solution decreases. Kernel (21) is a Gaussian radial base function (RBF). The new space dimension is not fixed, depends on actual data distribution, and it could get to infinite. This kernel visual effect is that near-by patterns form class clusters, as big as they can. Clusters have the support vectors as centres (in feature space), and the radius is given by the value of and support vector weight, obtained during training. Kernel (22) is similar to a two layer sigmoidal neural network. Using the neural network kernel, the first layer is composed of N sets of weights, each set consisting of d weights; the second layer is composed of N weights (the i ), so that an evaluation requires a weighted sum of sigmoids evaluated on dot products. The structure and weights (which defines the related neural network architecture) are given automatically by the training process. Not all values of y satisfy Mercer conditions [Vapnik, 1995]. We say (20), (21) and (22) are basic functions because new kernel functions can be formulated combining them and still satisfying Mercer conditions. A linear combination of two Mercer kernels is a Mercer kernel. This can be easily demonstrated knowing that the integrator operator is distributive with respect to the add operator. Also, another kind of slight changes can be implemented from the basic functions, looking for a kernel function having a priori information about the internal distribution.

168

CHAPTER 7

Nevertheless, it has been experimentally stated that, in many cases, kernel choice is not a determining factor in the machine performance. For a real world problem whose internal distribution is not particularly fitted to some kind of kernel, support vector set tend to be very similar, no matter what non-linear function is used. Of course weights are fairly different, as the evaluating function is so. But the result, the separating surface, tends to have a very similar geometrical shape, especially where data density is high. As it was said in previous sections, the reason could be that those patterns that are important because they lie near other-class patterns continue to be important regardless of the mapping function, so they become support vectors. Last, we will define the kernel matrix as a symmetric square M-order matrix (where M is the training pattern number), where position (i,j) describes the kernel function value Kxi xj . 1.11

Global Solutions and Uniqueness

As it has been shown in previous sections, the result of SVM training is a global solution for the optimisation process, i.e., the parameter set (values for w, b and i ) which give an objective function maximum. This term goes against ‘local solution’, defined as a parameter set whose objective function is optimum when compared around the vicinity. In the SVM algorithm, any local solution is also a global solution because it is characterised as a convex quadratic problem. Nevertheless, global solution may not be unique. There could be more than one parameter set where objective function gets the same value, and it could be the optimum. It is not inconsistent with global solution definition. Solution uniqueness is guaranteed only in case the problem is strictly convex. The SVM training definition assures the problem to be convex, but training data will make the problem be strictly convex or not. Non-uniqueness occurs in two different ways: • When w and b values are not unique. In this case all w and b values between two solutions are also global solutions. This is easy to accept, as the problem is characterized by a convex problem. • When w and b values are unique, but the w value comes from different sets of i values. Reaching one solution or the other depends on the training algorithm randomness. Remember that there can be training data points that lie on the hyperplane but are not support vectors. Much alike when three points in a row give just one straight line and throwing away any of the three would give the same result, it is easy to create one training set that would generate different hard margin classifier support vector set depending on the listing order, although the separator hyperplane would remain unchanged. 1.12

Generalization Performance Analysis

Mercer condition tells us whether a kernel function defines a new Euclidean space or not, but it does not define how the mapping function must be applied or

SUPPORT VECTOR MACHINES

169

the new space morphology. For easy cases, the feature space dimensions can be deduced. For instance, the p-degree homogeneous polynomial kernel has d+p−1 p new features or dimensions. For a 4-degree polynomial kernel using 16 × 16 pixel images (256 initial features), the new space dimension is 183181376. In real world cases we will never have training sets that big. A classification machine with a huge ‘features over data’ ratio would undoubtedly produce overfitting. Let us use an easier example: 3-degree polynomial with 8 × 8 pixel data. The new space dimension is 45760. If you are using a simple multi-layer perceptron neural net, the relation between number of weights and data points should not be greater than around 15%. Suppose you are generating a hidden layer with 45760 units (new features), 64 units in the input layer and one unit as output. The number of weights in the net gets around 2974400 (almost 3 million). Therefore, the minimum training data set should have 19829333 patterns (almost 20 million). Now, that is an awfully big data set. Of course, not all 45760 new features are important. Many of them will have a null weight. But you cannot know at first which features will be needed and which ones will not be. Some algorithms have been designed to decrease the neural net while training, but even in this case the difference between useful feature and disturbing feature is not easy to make. A separator hyperplane in feature space H must have dimH+1 parameters. Any classification system needing so many parameters to create a discrimination function will be resource and time inefficient. Nevertheless SVM have a good classification and generalization performance, in spite of treating data in an enormous space, which could be even infinite. The reason has not been formally demonstrated, although the maximum margin requirement has much to say about it. Within the SVM, the solution has at most l + 1 adjustable parameters, being l the number of training patterns. After the training, the solution has N + 1 parameters, being N the number of support vectors, which is much less than the number of new features. In section 2.1 we left a question about which classifier is better out of two possible choices. The answer is “the one having lowest VC dimension”, which is the same as saying “the simplest”. It was shown that the bound on the risk is related to the VC dimension: the least the VC dimension, the least the risk bound. However, it does not assure you which one will have the least actual risk. There is no way to know it beforehand. This approach is not only mathematically motivated, but we could also use some philosophy statements on it. An English 14-century philosopher, William of Ockham, enounces the Ockham’s razor theory: Given some evidence and two hypothesis, one simple and one complex, both satisfying the evidence, then the simplest hypothesis is most probable to be true. It does not say which one is true, but if you had to bet and you had no additional knowledge or evidence, you should go for the first hypothesis. That is all about learning, be it machine or human: choose

170

CHAPTER 7

the one hypothesis which seems most probable with current evidence. Whenever you make a new assumption (using an unnecessary complex hypothesis) you are most probably farther from the truth. That answers the big question, why are support vector machines generalization performance good even when using high dimension feature space? Because SVM performance is not related to the space dimension where data is separated, but to the classifier VC dimension. Therefore, SVM classifier depends on the data hypothesis simplicity, not on the number of available features. If a simple hypothesis can do the separating job, the SVM will use it, with no overfitting. There is no magic any more. The SVM algorithm gives the simplest hypothesis, that is, the most probable one. But it does not mean there cannot be a better answer for a given problem. In spite of our SVM hard militancy we do not deny SVM have been slightly outperformed (mostly by specific neural networks) in some experimental benches. The answer is simple: luck. The SVM gave the most probable answer after one general-purpose execution. But the true internal distribution may have been slightly more complex, even though it did not show on the training data. If you are trying a neural network architecture with a bit more complexity, which way will you go? You cannot say unless you have additional information. The successful architect engineer would most probably try all possible ways. It means trying hundreds of different architectures and finally using the one having better error rates on the test set. But that approach falls down in many places: first, the engineer must decide how much complexity should the answer have (not an easy task at all); second, the training set must be slightly deviated from the internal distribution for the SVM to lie behind; third, if you are generating many classifiers and you use the test set to decide which one is better, then the test set is no longer a good validation set, because you are using it as a secondary training set (even though it is used as a validation set for publishing the results); last, the engineer spends a lot of time in the training phase. And in spite of all this extra work, in cases where SVM are outperformed, they are still very near to the highest results in this scientific ranking. Which means that in the real world it is difficult to find the SVM outperformed. Support Vector Machines are not easy to implement, but they are very easy to use. Nevertheless, its use has some limits. As it is a statistical method, symbolic learning does not suit too well. For instance, the parity problem with few data makes the SVM decide that all points are support vectors. This is a clear hint for bad generalization performance, because it means: “one point has no relation with any other point”. In those cases a SVM is no better than a simple Nearest Neighbour classification algorithm. Other Machine Learning paradigms, for instance C4.5, are able to work with input data having parameters with the unknown value (C4.5 uses the ‘?’ symbol). The algorithm identifies this value and treats the information accordingly. However, the SVM algorithm does not allow unknown values, diminishing the applicability to some data sets.

SUPPORT VECTOR MACHINES

171

Inside the previously defined scope, SVM has a very light bias. It is a true general-purpose machine learning method. Although a priori information can be included inside the kernel function, the number of new features is so wide that, regardless of the internal data distribution, there will always be a near-by hypothesis model using those new features. The training algorithm will have embedded some sort of balance between using too few features (too simple hypothesis), and using too many (overfitting). The basic achievement in using SVM is that you just choose a generic kernel function (we won’t say “any kernel will do”, but it is not too far from the truth), and the confidence degree C (up until now, mostly heuristics are used, but you will soon find it is quite easy). Then you push the button, and after some time you will have the best classification machine. No need for an experienced engineer or scientist. No complicated architectures. No tailoring. No second thoughts. Child’s play. 2.

SVM MATHEMATICAL APLICATIONS

The initial mathematical development for SVM has been applied to different approaches inside Machine Learning scope. All of them are based on the structural risk minimization principle, in the problem Lagrange formulation, and in the non-linear case generalization. For each approach you only need to define the requirements all points must satisfy, its effect on the objective function and the mathematical steps through the Lagrange formulations. 2.1

Pattern Recognition

The first approach to SVM was in the pattern recognition field. In fact, the search for a new statistical paradigm able to optimise the class separation problem was the boost to V. Vapnik in his quadratic programming research. For that reason, the SVM definition developed in the previous sections and their implementation shown in next sections, apply specifically to pattern recognition. Nevertheless, most concepts apply also to the other approaches defined in this section. 2.2

Regression

Historically, the second approach the SVM had was non-linear regression and function estimation, called SVRM (Support Vector Regression Machines) [Vapnik et al, 1997]. This field can be divided into two parts: first, ‘function approximation’ tries to find the curve that best adapts to the training data, acquired without noise (which makes it very similar to usual methods for interpolation); second, function estimation (regression), where data is noisy and whose distribution is unknown,

172

CHAPTER 7

the method tries to estimate as simplest as possible unseen data points, including extrapolation. SVRM algorithm treats both cases in a very similar way. For each case, the cost function can be slightly changed. 2.2.1

Definition

Suppose we have a training set with l data pairs xi yi , where xi ∈ d i = 1 M (up until now, just the same as the pattern recognition case), and where yi ∈ , is not a label any more but a real number which represents the value of the function we want to estimate at xi , i.e. yi = fr xi + ni , being ni the noise associated to point i. We want to find a function fx having a deviation maximum of with respect to all training yi . In the basic case, there can be no training points having a distance to the expected value bigger than , so the resulting curve must fit all points. This case can be used only when data describe a linear function with a noise level ni < ∀i. The estimating function has the form: (23)

fx = w • x + b

being w the vector defining the curve in input space, and b the free term (the bias). Similarly to the pattern recognition case, the structural risk minimization principle demands the greatest possible simplicity to the approximation function. We will try to minimize w2 , which will give us the flattest linear function from those satisfying the constraints (unlike the margin maximization definition in pattern recognition). Therefore, the optimisation problem is written as: Minimize 1/2w2 with respect to constraints: yi − w • xi − b ≤ (24)

w • xi + b − yi ≤

Nevertheless, following the same reasoning as in section 2, this inflexible formulation is only valid when there is at least one solution satisfying conditions (24). Because this is usually an unreal case, without noise in the data (it could be used for an interpolation approach), the soft margin idea must be introduced. We define positive slack variables i that give information about how far is the expected value from the true value for point i. Thus, we are introducing in the algorithm the ability to admit errors (points not satisfying constraints), but keeping the ability to find a solution representing the data distribution well enough. Likewise, a new cost function must be defined giving a balance between the number of allowed errors and simplicity (and usefulness) of the final estimating function. This cost function, cx y f, must fulfil some properties, discussed in next section.

SUPPORT VECTOR MACHINES

173

To continue with the formulation development through this section we will use the -insensitive cost function [Vapnik et al, 1997], partially because it was the first one proposed, and because it is the simplest to interpret and optimise. This cost function is continuous non-derivable, so variables must be duplicated (formulation gets longer but no more difficult). Now all and turn ∗ and ∗ , where the one without asterisk is associated to the yi ≥ fxi case, and the one with asterisk is associated to the yi < fxi ) case. Note that both cases cannot be true for any one point, so for all training points at least one of the duplicated variables will be zero. Thus, objective function becomes: M 1 1 2 (25) LP = w + C cxi yi f M i=1 2 After the primal and dual formulation development (just like the pattern recognition case), the problem can be written as: Maximise LD = −

M M M M 1 i −∗i i −∗i xi •xj − i −∗i + yi i +∗i 2 i=1 j=1 i=1 i=1

with respect to: M

i − ∗i = 0

i=1

(26)

i ∗i ∈ 0 C

having C the same meaning as in section 2: an error permissibility balance parameter. This development remains defined as a convex quadratic optimisation problem, which has to satisfy Karush-Khun-Tucker conditions at optimality. Therefore, implementation methods defined for pattern recognition are applicable, although with some differences caused by the cost function. In the case of -insensitive cost function, duplicated Lagrange multipliers must be treated specifically. This will happen to all non-derivable cost functions. Again, support vectors are those training points whose Lagrange multipliers are not zero (in the case of duplication, it means one of the multipliers is non-zero). Moreover, those points having a non-zero slack variable are considered as training ∗ errors, and have the corresponding multiplier set to the maximum i = C (where ∗ the symbol means “either of the duplicated items, the applicable one”). Support vectors having a Lagrange multiplier not at bound 0 < < C are placed on the margin (they are needed to define the margin) and have a zero slack variable. Basically, the concept after the support vectors, weights and geometrical meaning, remain the same as the pattern recognition case (see figure 14).

174

CHAPTER 7

Figure 14. Linear regression machine. Support vectors are encircled, the margin/tube is shown with dashed lines and the estimated function is shown with a solid line

The value of the bias b can be calculated from a non-bound support vector, i.e. a non-error support vector. The equalities to be used are those in inequalities (24) (27)

b =yi − w • si −

ifi = 0yi = C

(28)

b =yi − w • si +

if∗i = 0y∗i = C

In case all support vectors are errors (very unusual, and in any case, most probably a bad solution) the b calculation method is much more complex, and can be done during optimisation itself. Likewise, we can define a non-linear mapping from input space to a feature space, where the algorithm will try to find the flattest function approximating the data well enough. The ‘flat’ property can usually be seen in the corresponding input-space non-linear curve: its shape is the one having smaller tangent value through the point set. The mapping concept and development is similar to the one described in previous sections: using a kernel function K(x,y) making all operations implicitly in feature space, usually a much higher dimension space (see figures 15a and 15b). Therefore, the non-linear problem is defined as: Maximize LD = −

M M 1 − ∗i i − ∗i Kxi xj 2 i=1 j=1 i

−

M

i − ∗i +

i=1

with respect to: M i=1

i − ∗i = 0

M i=1

yi i + ∗i

SUPPORT VECTOR MACHINES

175

Figure 15a. Non-Linear regression machine. The dots follow the sinc(x) function, the dashed lines are the -tube, and the solid line is the function SVRM estimation. Note that support vectors are those corresponding to the 3 tangent points (1 in the middle x = 0, and the other two at the limits)

Figure 15b. Non-Linear regression machine. The dots follow the same sinc(x) function, and the other elements follow the figure 15a notation. Note that as the -tube decreases, the function estimation gets more accurate. At the limit, if noise allowance approaches 0, the function estimation error will also be 0 in this example

(29)

i ∗i ∈ 0 C

and w support vector expansion and estimated function are written: (30)

w=

N

i − ∗i si

i=1

(31)

fx =

N

i − ∗i Ksi x

i=1

being si the resulting N support vectors, and being M the complete training set. It has been observed, through a number of experiments [Osuna and Girosi, 1998], that SVRM tend to use a relatively low number of support vectors, compared to

176

CHAPTER 7

other similar machine learning processes. The reason could be the allowed flexibility while errors are below a threshold, generating simpler surfaces, and thus needing less support vectors to define them. Moreover, It has been proved that the algorithm works well when a non-linear kernel is applied in spite of having few training data. Other well-known methods will easily overfit the data, while the SVRM dynamically controls its generalization ability, generating a hypothesis simple enough to model training data distribution better. 2.2.2

Cost Functions and -SVRM

The cost function is one of the key elements in SVRM. As it was said in the previous section, real data is usually acquired with a certain noise figure with unknown distribution. The cost function is in charge of accepting noise deviations, and penalizing wide deviations, whether they are caused by noise or by a current too simple hypothesis. The point is how to make the difference between noise and hypothesis complexity. Nevertheless, this function must satisfy certain features. For the sake of problem resolution usefulness, the cost function must be convex, thus maintaining problem convexity and assuring solution existence, uniqueness and globality. Moreover, for the mathematical development to remain simple, it is required to be symmetric and having at most two discontinuities at ±, in the first derivative, being ≥ 0. Therefore, even if we know the noise distribution, it would be too complex to introduce that additional information inside the algorithm. We should then have to find a convex cost function that may adjust to the noise distribution, but we would still use an approximation. Not to mention the mathematical development for the new cost function, notably difficult for non-expert mathematicians. The conclusion is: just use a general purpose cost function and let the SRVM automatic learning do the engineering job. The development described in the previous subsection refers to the -insensible cost function, which is the most commonly used, and is defined as: (32)

c =

0 if ≤ − if >

These kind of functions have an additional parameter , which helps to adjust the maximum allowable deviation for any given point. A validation process is required to adjust this parameter, even though its value can be approximated after any additional knowledge about noise or data distributions. To finish with the SVRM section, we will summarize a variation for the -SVRM (using -insensitive cost function), called -SVRM [Schölkopf et al, 1998]. The difference consists not in the cost function itself (which remains the -insensitive), but in the objective function. The -SVRM gave the objective function as:

SUPPORT VECTOR MACHINES

(33)

177

M 1 1 2 ∗ + i LP = w + C M i=1 i 2

and now, in -SVRM, the objective function is: M 1 1 2 ∗ + i (34) LP = w + C + 2 M i=1 i with respect to the same constraints as in (24). The resulting dual formulation problem gets: Maximize LD =

M

yi i − ∗i −

i=1

M M 1 − ∗i i − ∗i Kxi xj 2 i=1 j=1 i

with respect to M

i − ∗i = 0

i=1 M

i + ∗i ≤ C

i=1

(35)

i ∗i

∈

0

C M

and leaving the estimating function in the same form as in (31). The values of b and can be calculated after training using constraints (24) for non-bound support vectors. If the value of increases, then the first term in the cost effect at (34), , will increase proportionally, while the second term will decrease as some points will benefit from the softer constraints and will be inside the bound (it also decreases proportionally to the new lucky points). For the objective function to attain the optimum, the value of must increase until the fraction of error points (out of bounds) is less than or equal to the value of . Therefore the new parameter is an upper limit for training errors (which are related to the number of support vectors). Obviously it must satisfy ∈ 0 1. It seems easier to pick a good value rather than a value. Moreover, -SVRM is a superset of -SVRM: after training with the first method we can calculate the parameter value, which can be used in a -SVRM algorithm giving exactly the same solution obtained in the first place. 2.3

Principal Component Analysis

Support Vector Machines (regression included) and non-linear Principal Component Analysis (PCA) were the first applications developed under the idea of a high

178

CHAPTER 7

dimension space mapping using Mercer kernels in Machine Learning. They differ in the problem to solve even though they use similar means. SVM is a supervised algorithm, i.e. the system state changes whether an output for a given pattern is equal to the expected correct value or not. On the other hand, kernel PCA is an unsupervised algorithm, i.e. there are no labels, and the output is the training data distribution covariance analysis [Schölkopf et al, 1998]. PCA is an efficient method to extract the input data in a certain structure, and can be achieved by calculating the system eigenvalues and eigenvectors. Formally speaking, kernel PCA is an input space base transformation for diagonalizing the normalized input data covariance matrix estimation with the form: (36)

C=

M

1 xi xiT M i=1

where M is the number of patterns xi . It is called principal component to the new coordinates described by the eigenvectors as base, i.e. the matrix vectors orthogonal projection over the eigenvectors. Eigenvalues and eigenvectors V must be non-zero and satisfy V = CV. We introduce the usual non-linearity concept, with the mapping function and its corresponding kernel. We assume there exist coefficients 1 M , such that : (37)

V=

M

i xi

i=1

and the corresponding matrix kernel K (as defined in previous sections). Then we arrive to the problem: (38)

M = K

being the eigenvalues and = 1 M the eigenvectors coefficients. To extract the principal components for a given pattern, data projections in feature space are calculated in the following form: (39)

Vk x =

M

M

ki xi x = ki Kxi x

i=1

i=1

In this notation, k is a super-index representing the k-th eigenvector and its k-th coefficients set. Note that after the previous calculation process, k non-zero eigenvalues and eigenvectors are obtained, each one of them with a set of M coefficients. To implement the kernel PCA algorithm, the following steps must be taken: 1. Kernel matrix must be calculated, being of size MxM Kij = kxi xj ij

for all i,j ∈ 1 M

SUPPORT VECTOR MACHINES

179

Here comes the first problem when using kernel PCA. Any matrix calculation resources will grow at least with the square of its size, so with current algorithms and hardware no more than 5000 data should be used. If you are provided with more data for training (which should never be seen as an unfortunate case), a representative subset must be created heuristically. 2. Diagonalize matrix K, to calculate eigenvalues and eigenvectors after equation (38), using traditional methods, and normalize such vectors. After this calculation we can obtain coefficients k = k1 kM , to be used in the projection phase. 3. To extract non-linear principal components from a given pattern, point projections over eigenvectors must be calculated using equation (39). The number of principal components (non-zero eigenvectors) to be used is designer’s choice. But not all of them must be used: if so, the process would be useless. Just choose the first k principal components, those with a significant amount of information and very little noise. After this simple process, you get a data space change. From input space we changed to a k-dimension space (being k a fraction of M), in which each dimension gives a useful feature taken from non-linear correlation in the training data set. That is a conceptual difference between SVM training and kernel PCA training: in the first case new features are implicitly generated, and many are dropped after training; in the second case k new features are explicitly generated, all of them with a lot of information, ordered from most important downwards. The value k has an upper bound of M, the number of training patterns. New features are made explicit, so, obviously, the number k must not be too high or computation effort would be inefficient. So only the first components should be used, those having the greatest possible variance, i.e. the biggest eigenvector, i.e. the most discriminant information. Non-linear PCA usefulness in pattern recognition has been tested thoroughly, attaining classification performances as good as the best non-linear SVM and well above neural networks. The process is very simple: first calculate projection coefficients and select the best ones; then transform all patterns (training, validation and test) explicitly into the new space; afterwards use these data to train a linear or non-linear classification machine (SVM, neural networks, decision trees, …, anyone will do) which will be the true supervised classification process. When using kernel PCA for classification, usually a linear SVM is used for supervised training, giving enough flexibility to solve any non-linear problem. The described process is very much like a one-hidden-layer neural network, in which the architecture and the first layer weights are obtained by optimised means: the variance matrix eigenvectors. Also because of the explicit new features calculation, multiclass SVM can be trained easily: the first layer would be common (as in neural networks), and the second layer (linear discriminant) can be calculated using the hyperplane w value (now it can be calculated because feature space is no longer implicit), giving O(1) complexity.

180 3.

CHAPTER 7

SVM VERSUS NEURAL NETWORKS

Neural Networks has led the Machine Learning field from the 1980’s thanks to its development and interpretation simplicity, while having very competitive generalization ability. Nevertheless, after 20 years have gone by, design and development complexity has increased considerably when trying to solve secondary problems as convergence speed, new error calculation concepts, new activation functions with additional constraints, local minimum preventing, and so on. So many years of active research have turned NN from initial simplicity to current complexity, fit only for specialised engineers. Probably, as time goes by, SVM will follow a similar path, from current simplicity to some complexity degree, needing a human expert to take out all of its potential. Complexity by itself is not bad: higher method complexity usually leads to better performance or classification rates. But NN basic research is currently scarce. For the most part, it is about new applications where NN give better results for specific architectures, so a qualitative jump is needed: SVM. This is quite a natural step in human research, and many examples can be shown. When some technology gets to its limit, then a new approach must be issued. At first, both methods performance may be similar, but the new one will eventually outperform the old method. We believe we are currently in the beginning of a technology jump, so it is a nice time to change sides. All along this chapter, the relation between SVM and NN has been widely established. It can be stated that a SVM object topology can be developed as a one-hidden-layer perceptron. It has been demonstrated in NN literature that the family of one-hidden-layer perceptron can act as a universal discriminator, i.e. it can approximate any function. For sigmoidal activation function, similarity between NN and SVM with kernel (22) is complete. (see figure 16). After the SVM optimisation process, we obtain a network having d units in the first layer (input space dimension), N units in

Figure 16. A neural network approach for the SVM implicit architecture. Note that the layers are completely connected (although not explicitly shown for figure clarity). Also, all weights are equal to 1 except the ones connecting the hidden layer and the output layer which are equal to the corresponding support vector’s

SUPPORT VECTOR MACHINES

181

the hidden layer (number of support vectors), and one unit in the output layer (binary classifier), with weights connecting the last two layers. For other kernels, similarity is somewhat lesser, although the resulting topology is like the d,N ,1 one-hidden-layer perceptron. Only the kernel function makes the difference. On the other hand, the SVM with RBF kernel is like a RBF classification network, in which clusters and its characteristics have been calculated using an automatic optimal algorithm. When using a similar kernel and activation function, an important difference can be observed. SVM tend to have a bigger number of support vectors (hidden-layer units), when facing a complex or noisy training set. Neural networks can attain similar classification performances with much less internal units. This is essential to the test phase speed, because it depends directly on the number of elements in the hidden layer. The number of multiply-add operations done in the test phase of either method is Nd + 1. The reason for such difference is mainly that support vectors defining the hidden layer are constrained to be training points. Neural networks do not have such constraint, so they need less elements to model the same function (it has greater freedom degree). This does not mean that NN solution is better; it is quicker in test phase, and topology complexity is lower, but generalization performance is not affected. Moreover, note that the training phase allowed errors (including those points lying inside the margin) become support vectors. When optimising complex or noisy training sets with loose error penalization, the number of training errors can be very large. But this problem has also been solved during the first SVM research steps. In [Burges, 1996] the “reduced support vector set” method is described. Given a trained SVM, this method creates a smaller support vector set representing approximately the same information than the whole support vector set. But in this case the former constraint is eliminated because the new virtual support vectors need not be training points. The result is very much alike the NN approach topology. This new expansion solves the classification speed problem, making the SVM competitive against other Machine Learning methods. Nevertheless, it is seldom used because of its considerable development difficulty. Even more similarity can be found between NN and SVM classifiers using kernel PCA as the feature extraction step. Units in the hidden layer are calculated explicitly using the eigenvector projection instead of kernel calculation. These units are not significant training points but true features, all of which share the concepts under the internal data distribution. Thus, the classifier topology should be very similar to the one generated by an experienced NN architect, because they have heavy statistical meaning. The only flexibility a NN offers and the SVM cannot reach is the multiplehidden-layer approach (using kernel PCA plus non-linear SVM could get up to 2 hidden layers, but it is seldom used). In spite of the fact that a one-hidden-layer

182

CHAPTER 7

topology is a universal discriminator, having more hidden layers can make the training process much more efficient. Using that capability, maybe there are fewer units in the net, or convergence is faster. But the training algorithms grow more complex, and the overfitting and local minimum finding problems will still be there. Therefore, the main differences between both methods are: • Training one SVM requires much more computation resources than training one NN. • Classification speed is usually slower in SVM • SVM result is the optimum, while NN can be stuck in local minima. Therefore SVM usually outperforms NN in classification performance. • SVM parameters are few and easy to use, while NN requires an experienced engineer to create and try the right architecture. • SVM usually needs one execution only to give the best results, while NN usually requires many tries to take out its best. Outside scientific community, money rules. Expert engineer time is much more expensive than computing resources, and differences will grow higher. If Machine Learning algorithms are to be introduced massively in commercial products such as knowledge management or data mining, automatic methods must be used. In the real world, new data is always coming; new profiles arise while others are no longer valid. Neural network flexibility must be tailored by an expert to fit current state. But, for a company, it may be not worthy the cost of tailoring a Machine Learning system that will become obsolete within some months. It is unavoidable: craftsmen will be eventually replaced by machines. 4.

SVM OPTIMISATION METHODS

4.1

Optimisation Methods Overview

SVM development tries to solve the problem described in (14): Maximize LD with respect to Lagrange multipliers and with constraints (15) and (16). When SVM appeared, the first approach to solve this problem was using standard optimisation methods, such as gradient-descent or quasi-Newton. These methods, quite veterans in mathematical literature, mainly apply complex operators over the Hessian matrix (partial derivatives matrix). These one-step processes are computationally as well as memory resources intensive. Memory resources for matrices is OM2 , while computational resources are OM3 , being M the number of patterns in the training set. For instance, a 5000-point set will require 100 Mbytes of storing memory using single precision floating numbers. Any process over such an enormous data structure will be very inefficient, beyond many machines ability. The main research line in the first years of live of SVM was the search for alternative optimisation methods, developed explicitly for SVM mathematical use. Many new approaches were published before one of them pleased all researchers for its simplicity and its efficiency. The main methods, in chronological appearance are the following:

SUPPORT VECTOR MACHINES

183

• The chunking method, developed by Vapnik. Points that are not support vectors do not affect the Hessian matrix calculation; therefore, if we take them out before the matrix calculation begins the resulting Lagrange multipliers would remain the same. At the same time, the matrix calculation itself is easier, now its complexity is ON3 being N the number of support vectors, and N 1 r r

where f is the fraction of the image occupied by black pixels and L is the length of the image.

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

201

Case q = 1 The smallest value of the entropy corresponds to the case in which each grid block covering the pore phase is entirely filled by pore phase. The largest value corresponds to a uniform distribution. Lower and upper bounds are then as follows: (14)

2 ln

nr L L + lnf < − i lni < 2 ln q = 1 r r i=1

Case 0 ≤ q < 1 The smallest value q r can take corresponds to the case in which each grid block covering the pore phase is entirely filled by pore. The largest value corresponds to a uniform distribution of pore phase over the image. Lower and upper bounds are as follows: 21−q 21−q L L 1−q (15) f < r q r < 0 ≤ q < 1 r r Case q < 0 The smallest value q r can take corresponds to the case in which each grid block covering the pore phase is entirely filled by pore. The function is monotonically decreasing. Therefore, the value corresponding to r = 1 pixel can be selected as an upper bound. (16)

21−q L f 1−q < q r < L21−q f 1−q q < 0 r

Having defined these bounds, we now seek to examine their significance in terms of extracting generalized dimensions from image data. For q > 1 and for 0 ≤ q < 1, the bounding functions when plotted on the log-log plot used to extract the dimension yield two parallel lines with a vertical separation of 1 − q lnf. For q = 1, the bounding functions when included in the plot of entropy against lnr again yield two parallel lines of slope 2 with separation of lnf. Thus, in these cases we reach the same impasse as that with the fractal analysis, namely depending on f , and independent of actual geometry considered, the data can be so constrained as to yield convincing straight-line fits with associated derived dimensions. 3.2

Gliding Box Method

The gliding-box method was originally used for lacunarity analysis (Allain and Cloitre, 1991). Later, it was modified by Cheng (1997a, 1997b) for estimating q as follows: (17)

< q > +D = −

log< Mq r > logr/rmin

202

CHAPTER 8

Where D is the dimension of the Euclidean space where the image is imbibed (in this case D = 2) and M represents the multiplier measured on each pixel as: rmin q (18) Mq r = r For further details see Grau et al. (2006). The advantage of using Equation (17) in comparison with Equation (9) is that the estimation is independent of box size r which allows the use of two successive box sizes only to estimate q. Equation (18) imposes that rmin should not be null. Once this estimation is done, Equation (8) can be applied to estimate Dq . For the case of q = 1 the following relationship is applied based on the work given in (Saucier and Muller, 1999): (19) 4.

ˆ 1 = 2D2 − D3 D IMAGES FOR THE CASE STUDY

Three soil samples were selected with the aim to represent a different range in void pattern distribution in soils and a wide range of porosity values, from 5% of porosity till 47%. Each of the samples was prepared for image analysis following the procedure described by Protz and VandenBygaart (1998). The data was obtained by imaging thin sections with a Kodak 460 RGB camera using transmitted and circularly polarized illumination. The data was cropped from 3060 × 2036 pixels to 3000 × 2000 pixels. Then, EASI/PACE software classified the data and the void bitmap separated, each individual pixel size was 186 × 186 microns. The images of these soils are showed in Figure 3. To avoid any interference of the edge effect for the calculations using the boxcounting method, an area of 1024 × 1024 pixels of the left upper corner of the original images was selected. 5. 5.1

RESULTS OF THE CASE STUDY AND DISCUSSION Generating Function with the Box-counting Method

For the three binary images, q r was calculated and then a bi-log plot of q r versus r was made to observe the behavior. All plots showed a clear pattern in the data. In Figure 2, for example, at negative q there were two distinctive areas, one where there was a linear relationship between logr and logq r and another where the value of log q r was almost constant versus logr. The box size at which the behavior is different for the three images is around 64 pixels. These two phases were not evident with positive q values (see Figure 4). The existence of a plateau phase of logq r can be explained by the nature of the measure under consideration. At r values close to 1, the variation in number of black pixels is based on a few pixels, having the most simplicity when r = 1

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

203

Sample A

Sample B

Sample C

Figure 3. Soil binary images, pore phase in black pixels, of: (a) ADS, (b) BUSO and (c) EHV1. Each image has 5.65%, 19.17% and 46.67% of porosity, respectively

204

CHAPTER 8 A 150

LogX(q,r)

100 50 0 –50 –100 –150

0

2

4

6

Log(r) B 150

LogX(q,r)

100 50 0 –50 –100 –150

0

2

4

6

Log(r) C 150

LogX(q,r)

100 50 0 –50 –100 –150

0

2

4

6

–10 –8 –6 –4 –2 0 2 4 6 8 10

Log(r)

Figure 4. Bi-log plot of q r versus box size r at different mass exponent q: A): ADS; B) BUSO; C) EVH1

where the measure can only have 0 or 1 value. Thus, for small boxes of size r the proportions among their values are mainly constant. However, when the box size passes certain size a scaling pattern begins.

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

5.2

205

Generalized Dimensions Using the Box-counting Method

If all of the regression points are considered, the Dq values, obtained mainly for q < 0, were quite different from these obtained if only the regression points in the linear behavior were chosen (Figure 5). Between both criteria, any Dq can be obtained, but for q >= 0 the differences are not significant. Many authors have pointed out this fact since the first applications of multifractal analysis to experimental results (Tarquis et al., 2005). The implications of Dq changes, too noticeable in this case, make impossible any comparison and calculation of the amplitude of the dimensions D−10 − D+10 as it has been used in several works. The differences found among the Dq representation (Figure 5, filled circles) are mainly found in the negative part. In particular, comparing ADS (Figure 5A filled circles) with the rest it is evident that it doesn’t show a multifractal behavior. All the D0 obtained have a value of 2 (plane dimension). This overestimation is due to the fact that the studied range that was selected to have an optimum fit for all the q values. However, looking at the lower and upper bond of the box-counting plots for q = 0 (Figure 6) it is quite clear that regardless the structure in the image the linear fit will be obtained with a high r 2 . The standard errors (data not shown) of the Dq obtained in the linear behavior phase are minimum and the r 2 of the regression analysis very high. However, this is not surprising if we realize that only three points are being used. In addition, the number of boxes of each size is very low, for size 128 × 128 pixels the number of boxes is 64, for size 256 × 256 pixels the number of boxes is 16, analyzing an image of 1024 × 1024 pixels that is considered a representative elementary area (VandenBygaart and Protz, 1999). This size restriction is avoided by using the gliding box method and its results are discussed in the next section.

5.3

Generalized Dimensions Using the Gliding Box Method

For the three binary images, < Mq r > was calculated and then a bi-log plot of < Mq r > versus r/rmin was made. All plots showed a linear relationship, as it was expected, with an important number of points to calculate a linear regression and based on the line’s slope estimate Dq (Figure 4). In the case of EHV1 for q < −6 (Figure 4A), the linear relationship is not as clear as in the rest of the images. Finally, a comparison between both methods in the Dq values obtained can be studied in Figure 5. In all of the graphics, Dq appears again with a value of 2 imposed by the box gliding method as it was explained in section 3.2. For ADS (Figure 5A) both curves are similar. On propose, the range of values for Dq has been changed to observe that the image effect could induce to an error in our conclusions, when in Figure 3 was evident that Dq was an almost constant value.

206

CHAPTER 8 A 6,50 5,50

Dq

4,50 3,50 2,50 1,50 –10

–8

–6

–4

–2

0 q

2

4

6

8

10

B 6,50 5,50

Dq

4,50 3,50 2,50 1,50 –10

–8

–6

–4

–2

0 q

2

4

6

8

10

C 6,50

Dq

5,50 4,50 3,50 2,50 1,50 –10

–8

–6

–4

–2

0 q

2

4

6

8

10

Figure 5. Generalized dimensions (Dq) from q = −10 to q = +10 for all points of the regression line (filled square) and for the three selected points based on bi-log plot of X(r,q) (filled circles) of each image: A) ADS; B) BUSO and C) EVH1

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

207

Observing the differences between both methods in BUSO and EVH1 (Figure 5B and 5C respectively) are bigger in the negative q values although in the positive values Dq shows a stronger decay (Grau et al., 2006). 6.

CONCLUSIONS

Over the last years, the concepts of fractal/multifractal have been increasingly applied in analysis of porous materials including soils and in the development of fractal models of porous media. In terms of modeling, it is important to characterize the multiscale heterogeneity of soil structure in a useful way, but the blind application of these analyses does not approach to it.

(a) 16 14

log N

12 10 8 6 4 2 0 0

1

2

3

4 log r

5

6

7

8

(b) 16 14 12 log N

10 8 6 4 2 0 0

1

2

3

4

5

6

7

8

log r Figure 6. Box counting plots for EHV1 soil images, q = 0, with upper and lower bounds (a) solid phase (b) pore phase. (From Bird et al., J. of Hydrol., 322, 211, 2006. With permission)

208

CHAPTER 8

A

Log (<M(r,q)>)

40 30 20 10 0 –10 –20 –30 –40 –0,1

0,1

0,3

0,5

0,7 Log(r/rmin)

0,9

1,1

1,3

1,5

B

Log (<M(r,q)>)

40 20 0 –20 –40 –60 –80

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

Log(r/rmin)

C

Log (<M(r,q)>)

20 0 –20 –40 –60 –80 –100 –120 0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

–10 –8 –6 –4 –2 0 2 4 6 8 10

Figure 7. Bi-log plot of < Mr q > versus box size rate r/rmin at different mass exponent (q): A): ADS; B) BUSO, C) EVH1

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

209 A

2,030 2,025 2,020

Dq

2,015 2,010 2,005 2,000 1,995 1,990 -10

-8

-6

-4

-2

0 q

2

4

6

8

10 B

4,000 3,500 3,000

Dq

2,500 2,000 1,500 1,000 0,500 0,000 -10

-8

-6

-4

-2

0 q

2

4

6

8

10 C

7,00 6,00

Dq

5,00 4,00 3,00 2,00 1,00 -10

-8

-6

-4

-2

0 q

2

4

6

8

10

Figure 8. Generalized dimensions (Dq) from q = −10 to q = +10 based on the box-gliding method (empty square) and based on the box-counting method (filled circles) using the same box sizes range: A) ADS; B) BUSO; C) EVH1

210

CHAPTER 8

The results obtained by the “box-counting” and “gliding-box” methods for multifractal modeling of soil pore images show that “gliding-box” provides more consistent results as it creates more number of large size boxes in comparison with the box-counting method and avoids the restriction that box-counting method imposes to the partition function.

7.

ACKNOWLEDGEMENTS

We thank Dr Richard Heck of Guelph University for the soil images. We are very indebted to Dr. N. Bird, Dr. Q. Cheng and Dr. D. Gimenez for helpful discussions. This work was supported by Techical University of Madrid (UPM) and Madrid Autonomous Community (CAM), Project No. M050020163.

REFERENCES Aharony, A., 1990, Multifractals in physics – successes, dangers and challenges, Physica A. 168: 479–489. Ahammer, H., De Vaney, T.T.J. and Tritthart, H.A., 2003, How much resolution is enough? Influence of downscaling the pixel resolution of digital images on the generalised dimensions, Physica D. 181 (3–4):147–156. Allain, C. and Cloitre, M., 1991, Characterizing the lacunarity of random and deterministic fractal sets, Physical Review A. 44:3552–3558. Anderson, A.N., McBratney, A.B. and FitzPatrick, E.A., 1996, Soil Mass, Surface, and Spectral Fractal Dimensions Estimated from Thin Section Photographs, Soil Sci. Soc. Am. J. 60:962–969. Anderson, A.N., McBratney, A.B. and Crawford, J.W., Applications of fractals to soil studies. Adv. Agron., 63:1, 1998. Barnsley, M.F., Devaney, R.L., Mandelbrot, B.B., Peitgen, H.O., Saupe, D. and Voss, R.F., 1988, The Science of Fractal Images. Edited by H.O. Peitgen and D. Saupe, Springer-Verlag, New York. Bartoli, F., Philippy, R., Doirisse, S., Niquet, S. and Dubuit, M., 1991, Structure and self-similarity in silty and sandy soils; the fractal approach, J. Soil Sci. 42:167–185. Bartoli, F., Bird, N.R., Gomendy, V., Vivier, H. and Niquet, S., 1999, The relation between silty soil structures and their mercury porosimetry curve counterparts: fractals and percolation, Eur. J. Soil Sci., 50(9). Bartoli, F., Dutartre, P., Gomendy, V., Niquet, S. and Vivier, H., 1998. Fractal and soil structures. In: Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC Press, Boca Raton, 203–232. Baveye, P. and Boast, C.W. Fractal Geometry, Fragmentation Processes and the Physics of ScaleInvariance: An Introduction. In Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC Press, Boca Raton, 1998, 1. Baveye, P., Boast, C.W., Ogawa, S., Parlange, J.Y. and Steenhuis, T., 1998. Influence of image resolution and thresholding on the apparent mass fractal characteristics of preferential flow patterns in field soils, Water Resour. Res. 34, 2783–2796. Bird, N., Díaz, M.C., Saa, A. and Tarquis, A.M., 2006. Fractal and Multifractal Analysis of Pore-Scale Images of Soil. J. Hydrol, 322, 211–219. Bird, N.R.A., Perrier, E. and Rieu, M., 2000. The water retention function for a model of soil structure with pore and solid fractal distributions. Eur. J. Soil Sci. 51, 55–63. Bird, N.R.A. and Perrier, E.M.A., 2003. The pore-solid fractal model of soil density scaling. Eur. J. Soil Sci. 54, 467–476. Booltink, H.W.G., Hatano, R. and Bouma, J., 1993. Measurement and simulation of bypass flow in a structured clay soil; a physico-morphological approach. J. Hydrol. 148, 149–168.

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

211

Brakensiek, D.L., W.J. Rawls, S.D. Logsdon and Edwards, W.M., 1992. Fractal description of macroporosity. Soil Sci. Soc. Am. J. 56, 1721–1723. Buczhowski, S., Hildgen, P. and Cartilier, L. 1998. Measurements of fractal dimension by box-counting: a critical analysis of data scatter. Physica A 252, 23–34. Cheng, Q. and Agerberg, F.P. (1996). Comparison between two types of multifractal modeling. Mathematical Geology, 28(8), 1001–1015. Cheng, Q. (1997a). Discrete multifractals. Mathematical Geology, 29(2), 245–266. Cheng, Q. (1997b). Multifractal modeling and lacunarity analysis. Mathematical Geology, 29(7), 919–932. Crawford, J.W., Baveye, P., Grindrod, P. and Rappoldt, C. Application of Fractals to Soil Properties, Landscape Patterns, and Solute Transport in Porous Media, in Assessment of Non-Point Source Pollution in the Vadose Zone. Geophysical Monograph 108, Corwin, Loague and Ellsworth, Eds., American Geophysical Union, Wahington, DC, 1999, 151. Crawford, J.W., Ritz, K. and Young, I.M. Quantification of fungal morphology, gaseous transport and microbial dynamics in soil: an integrated framework utilising fractal geometry. Geoderma, 56, 1578, 1993. Crawford, J.W., Matsui, N. and Young, I.M. 1995., The relation between the moisture-release curve and the structure of soil. Eur. J. Soil Sci. 46, 369–375. Dathe, A., Eins, S., Niemeyer, J. and Gerold, G. The surface fractal dimension of the soil-pore interface as measured by image analysis. Geoderma, 103, 203, 2001. Dathe, A., Tarquis, A.M. and Perrier, E., 2006. Multifractal analysis of the pore- and solid-phases in binary two-dimensional images of natural porous structures. Geoderma, doi:10.1016/j.geoderma.2006.03.024, in press. Dathe, A. and Thullner, M., 2005. The relationship between fractal properties of solid matrix and pore space in porous media. Geoderma, 129, 279–290. Feder, J., 1989. Fractals. Plenum Press, New York. 283pp Flury, M. and Fluhler, H., 1994. Brilliant blue FCF as a dye tracer for solute transport studies – A toxicological overview. J.Environ. Qual. 23, 1108–1112. Flury, M. and Fluhler, H., 1995. Tracer characteristics of brilliant blue. Soil Sci. Soc. Am. J. 59, 22–27. Flury, M., Fluhler, H., Jury, W.A. and Leuenberger, J., 1994. Susceptibility of soils to preferential flow of water: A field study, Water Resour. Res. 30, 1945–1954. Giménez, D., R.R. Allmaras, E.A. Nater and Huggins, D.R., 1997a. Fractal dimensions for volume and surface of interaggregate pores – scale effects. Geoderma 77, 19–38. Giménez D., Perfect E., Rawls W.J. and Pachepsky, Y., 1997b. Fractal models for predicting soil hydraulic properties: a review. Eng. Geol. 48, 161–183. Gouyet, J.G. Physics and Fractal Structures. Masson, Paris, 1996. Grau, J., Méndez, V., Tarquis, A.M., Díaz, M.C. and A. Saa, 2006. Comparison of gliding box and box-counting methods in soil image analysis. Geoderma, doi:10.1016/j.geoderma.2006.03.009, in press. Griffith, D.A.. Advanced Spatial Statistics. Kluwer Academic Publishers, Boston, 1988. Hallett, P.D., Bird, N.R.A., Dexter, A.R. and Seville, P.K., 1998. Investigation into the fractal scaling of the structure and strength of soil aggregates. Eur. J. Soil Sci. 49, 203–211. Hatano, R. and Booltink, H.W.G., 1992. Using Fractal Dimensions of Stained Flow Patterns in a Clay Soil to Predict Bypass Flow. J. Hydrol. 135, 121–131. Hatano, R., Kawamura, N., Ikeda, J. and Sakuma, T. Evaluation of the effect of morphological features of flow paths on solute transport by using fractal dimensions of methylene blue staining patterns. Geoderma 53, 31, 1992. Hentschel, H.G.R. and Procaccia, I. (1983). The infinite number of generalized dimensions of fractals and strange attractors. Physica D, 8, 435, 1983. Kaye, B.G. A Random Walk through Fractal Dimensions. VCH Verlagsgesellschaft, Weinheim, Germany, 1989, 297. Mandelbrot, B.B. The Fractal Geometry of Nature. W.H. Freeman, San Francisco, CA, 1982. McCauley, J.L. 1992. Models of permeability and conductivity of porous media. Physica A 187, 18–54.

212

CHAPTER 8

Moran, C.J., McBratney, A.B. and Koppi, A.J.,1989. A rapid method for analysis of soil macropore structure. I. Specimen preparation and digital binary production. Soil Sci. Soc. Am. J. 53, 921–928. Muller, J., 1996. Characterization of pore space in chalk by multifractal analysis. J. Hydrology, 187, 215–222. Muller, J., Huseby, O.K. and Saucier, A. Influence of Multifractal Scaling of Pore Geometry on Permeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals, 5, 1485, 1995. Muller, J. and McCauley, J.L., 1992. Implication of Fractal Geometry for Fluid Flow Properties of Sedimentary Rocks. Transp. Porous Media 8, 133–147. Muller, J., Huseby, O.K. and Saucier, A., 1995. Influence of Multifractal Scaling of Pore Geometry on Permeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals 5, 1485–1492. Ogawa, S., Baveye, P., Boast, C.W., Parlange, J.Y. and Steenhuis, T. Surface fractal characteristics of preferential flow patterns in field soils: evaluation and effect of image processing. Geoderma, 88, 109, 1999. Oleschko, K., Fuentes, C., Brambila, F. and Alvarez, R. Linear fractal analysis of three Mexican soils in different management systems. Soil Technol., 10, 185, 1997. Oleschko, K. Delesse principle and statistical fractal sets: 1. Dimensional equivalents. Soil&Tillage Research, 49, 255,1998a. Oleschko, K., Brambila, F., Aceff, F. and Mora, L.P. From fractal analysis along a line to fractals on the plane. Soil&Tillage Research, 45, 389, 1998b. Orbach, R. Dynamics of fractal networks. Science (Washington, DC) 231, 814, 1986. Pachepsky, Y.A.,Yakovchenko, V., Rabenhorst, M.C., Pooley, C. and Sikora, L.J. . Fractal parameters of pore surfaces as derived from micromorphological data: effect of long term management practices. Geoderma, 74, 305, 1996. Pachepsky, Y.A., Giménez, D., Crawford, J.W. and Rawls, W.J. Conventional and fractal geometry in soil science. In Fractals in Soil Science, Pachepsky, Crawford and Rawls, Eds., Elsevier Science, Amsterdam, 2000, 7. Persson, M., Yasuda, H., Albergel, J., Berndtsson, R., Zante, P., Nasri, S. and Öhrström, P., 2001. Modeling plot scale dye penetration by a diffusion limited aggregation (DLA) model. J. Hydrol. 250, 98–105. Peyton, R.L., Gantzer, C.J., Anderson, S.H., Haeffner, B.A. and Pfeifer, P. . Fractal dimension to describe soil macropore structure using X ray computed tomography. Water Resource Research, 30, 691, 1994. Posadas, A.N.D., Giménez, D., Quiroz, R. and Protz, R., 2003. Multifractal Characterization of Soil Pore Spatial Distributions. Soil Sci. Soc. Am. J. 67, 1361–1369 Protz , R. and VandenBygaart, A.J. 1998. Towards systematic image analysis in the study of soil micromorphology. Science Soils, 3. (available online at http://link.springer.de/link/service/journals/). Ripley, B.D. Statistical Inference for Spatial Processes, Cambridge Univ. Press, Cambridge, 1988. Saucier, A. Effective permeability of multifractal porous media. Physica A, 183, 381, 1992. Saucier, A. and Muller, J. Remarks on some properties of multifractals. Physica A, 199, 350, 1993. Saucier, A. and Muller, J. Textural analysis of disordered materials with multifractals. Physica A, 267, 221, 1999. Saucier, A., Richer, J. and Muller, J., 2002. Statistical mechanics and its applications. Physica A, 311 (1–2): 231–259. Takayasu, H. Fractals in the Physical Sciences. Manchester University Press, Manchester, 1990. Tarquis, A.M., Giménez, D., Saa, A., Díaz, M.C. and Gascó, J.M., 2003. Scaling and Multiscaling of Soil Pore Systems Determined by Image Analysis. In: Scaling Methods in Soil Physics, Pachepsky, Radcliffe and Selim Eds., CRC Press, 434 pp. Tarquis, A.M., McInnes, K.J., Keys, J., Saa, A., García, M.R. and Díaz, M.C., 2006. Multiscaling Analysis In A Structured Clay Soil Using 2D Images. J. Hydrol, 322, 236–246. Tel, T. and Vicsek, T., 1987. Geometrical multifractality of growing structures, J. Physics A. General, 20, L835–L840. VandenBygaart, A.J. and Protz, R., 1999. The representative elementary area (REA) in studies of quantitative soil micromorphology. Geoderma 89, 333–346.

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION

213

Vicsek, T. 1990. Mass multifractals. Physica A, 168, 490–497. Vogel, H.J. and Kretzschmar, A., 1996. Topological characterization of pore space in soil-sample preparation and digital image-processing. Geoderma 73, 23–38.

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close