Encyclopedia of Artificial Intelligence

Encyclopedia of Artificial Intelligence Juan Ramón Rabuñal Dopico University of A Coruña, Spain Julián Dorado de la Cal...

Author: Juan Ramon Rabunal Dopico | Julian Dorado De La Calle | Alejandro Pazos Sierra

664 downloads 2961 Views 29MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Encyclopedia of Artificial Intelligence Juan Ramón Rabuñal Dopico University of A Coruña, Spain Julián Dorado de la Calle University of A Coruña, Spain Alejandro Pazos Sierra University of A Coruña, Spain

Information Sci

Hershey • New York

Director of Editorial Content: Kristin Klinger Managing Development Editor: Kristin Roth Development Editorial Assistant: Julia Mosemann, Rebecca Beistline Senior Managing Editor: Jennifer Neidig Managing Editor: Jamie Snavely Assistant Managing Editor: Carole Coulson Typesetter: Jennifer Neidig, Amanda Appicello, Cindy Consonery Cover Design: Lisa Tosheff Printed at: Yurchak Printing Inc. Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com/reference and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanbookstore.com Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Encyclopedia of artificial intelligence / Juan Ramon Rabunal Dopico, Julian Dorado de la Calle, and Alejandro Pazos Sierra, editors. p. cm. Includes bibliographical references and index. Summary: "This book is a comprehensive and in-depth reference to the most recent developments in the field covering theoretical developments, techniques, technologies, among others"--Provided by publisher. ISBN 978-1-59904-849-9 (hardcover) -- ISBN 978-1-59904-850-5 (ebook) 1. Artificial intelligence--Encyclopedias. I. Rabunal, Juan Ramon, 1973- II. Dorado, Julian, 1970- III. Pazos Sierra, Alejandro. Q334.2.E63 2008 006.303--dc22 2008027245

British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set are those of the authors, but not necessarily of the publisher.

If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication.

Editorial Advisory Board

Juan Ríos Carrión Polytechnical University of Madrid, Spain

Peter Smith University of Sunderland, UK

Anselmo del Moral University of Deusto, Spain

Paul M. Chapman University of Hull, UK

Daniel Manrique Gamo Polytechnical University of Madrid, Spain

Ana Belén Porto Pazos University of A Coruña, Spain

Juan Pazos Sierra Polytechnical University of Madrid, Spain

Javier Pereira University of A Coruña, Spain

Jose Crespo del Arco Polytechnical University of Madrid, Spain

Stefano Cagnoni Università degli Studi de Parma, Italy

Norberto Ezquerra Georgia Institute of Technology, USA

Jose María Barreiro Sorrivas Polytechnical University of Madrid, Spain

Lluís Jofre Polytechnical University of Catalunya, Spain

List of Contributors

Adorni, Giovanni / Università degli Studi di Genova, Italy...................................................................840, 848 Akkaladevi, Somasheker / Virginia State University, USA..........................................................940, 945, 1330 Al-Ahmadi, Mohammad Saad / King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia..............................................................................................................................................1323 Aliaga, Ramón J. / Universidad Politécnica de Valencia, Spain..................................................................1576 Alías, Francesc / Universitat Ramon Llull, Spain...................................................................................541, 788 Alonso-Betanzos, Amparo / University of A Coruña, Spain...........................................................................632 Alonso Hernández, Jesús Bernardino / University of Las Palmas de Gran Canaria, Spain............1266, 1439 Alonso-Weber, Juan Manuel / Universidad Carlos III de Madrid, Spain.....................................................554 Alsina Pagès, Rosa Maria / Universitat Ramon Llull, Spain..........................................................................719 Alvarellos González, Alberto / University of A Coruña, Spain......................................................................167 Amarger, Véronique / University of Paris, France........................................................................................131 Amari, Shun-ichi / Brain Science Institute, Japan..........................................................................................318 Ambrósio, Paulo Eduardo / Santa Cruz State University, Brazil..................................................................157 Anagnostou, Miltiades / National Technical University of Athens, Greece........................................1429, 1524 Andrade, Javier / University of A Coruña, Spain...........................................................................................975 Andrade, José Manuel / University of A Coruña, Spain................................................................................581 Ang, Kai Keng / Institute for Infocomm Research, Singapore......................................................................1396 Ang Jr., Marcelo H. / National University of Singapore, Singapore..................................................1072, 1080 Angulo, Cecilio / Technical University of Catalonia, Spain................................................................1095, 1518 Anselma, Luca / Università di Torino, Italy....................................................................................................396 Arcay, Bernardino / University of A Coruña, Spain.......................................................................................710 Ares, Juan / University of A Coruña, Spain.....................................................................................................982 Armstrong, Alice J. / The George Washington University, USA.......................................................................65 Arquero, Águeda / Technical University of Madrid, Spain............................................................................781 Aunet, Snorre / University of Oslo, Norway & Centers for Neural Inspired Nano Architectures, Norway.............................................................................................................................................1474, 1555 Azzini, Antonia / University of Milan, Italy....................................................................................................575 Badidi, Elarbi / United Arab Emirates University, UAE...................................................................................31 Bagchi, Kallol / University of Texas at El Paso, USA.......................................................................................51 Bajo, Javier / Universidad Pontificia de Salamanca, Spain..........................................................................1327 Barajas, Sandra E. / Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico...................................867 Barron, Lucia / Instituto Tecnologico de Culiacan, Mexico...........................................................................860 Barták, Roman / Charles University in Prague, Czech Republic...................................................................404 Barton, Alan J. / National Research Council Canada, Canada..........................................................1205, 1589 Becerra, J. A. / University of A Coruña, Spain................................................................................................603 Bedia, Manuel G. / University of Zaraogoza, Spain.......................................................................................256

Beiu, Valeriu / United Arab Emirates University, UAE...................................................................................471 Bel Enguix, Gemma / Rovira i Virgili University, Spain...............................................................................1173 Belanche Muñoz, Lluís A. / Universitat Politècnica de Catalunya, Spain................................639, 1004, 1012 Berge, Hans Kristian Otnes / University of Oslo, Norway..........................................................................1485 Bernier, Joel / SAGEM REOSC, France..........................................................................................................131 Berrones, Arturo / Universidad Autónoma de Nuevo León, Mexico............................................................1462 Bershtein, Leonid S. / Taganrog Technological Institute of Southern Federal University, Russia................704 Bessalah, Hamid / Center de Développement des Technologies Avancées (CDTA), Algérie..........................831 Beynon, Malcolm J. / Cardiff University, UK.........................................................................................443, 696 Bhatnagar, Vasudha / University of Delhi, India......................................................................................76, 172 Blanco, Ángela / Universidad Pontificia de Salamanca, Spain.......................................................................561 Blanco, Francisco J. / Juan Canalejo Hospital, Spain..................................................................................1583 Blasco, X. / Polytechnic University of Valencia, Spain..................................................................................1296 Boonthum, Chutima / Hampton University, USA.........................................................................................1253 Bouridene, Ahmed. / Queens University of Belfast, Ireland...........................................................................831 Boyer-Xambeu, Marie-Thérèse / Université de Paris VII – LED, France...................................................996 Bozhenyuk, Alexander V. / Taganrog Technological Institute of Southern Federal University, Russia........704 Brest, Janez / University of Maribor, Slovenia................................................................................................488 Bueno, Raúl Vicen / University of Alcalá, Spain....................................................................................933, 956 Bueno García, Gloria / University of Castilla – La Mancha, Spain.......................................................367, 547 Buruncuk, Kadri / Near East University, Turkey.........................................................................................1596 Cadenas, José M. / Universidad de Murcia, Spain.........................................................................................480 Cagnoni, Stefano / Università degli Studi di Parma, Italy............................................................840, 848, 1303 Çakıcı, Ruket / ICCS School of Informatics, University of Edinburgh, UK...................................................449 Canto, Rosalba Cuapa / Benemérita Universidad Autónoma de Puebla, Mexico.............................1370, 1426 Carballo, Rodrigo / University of Santiago de Conpatela, Spain.................................................................1603 Carbonero, M. / INSA – ETEA, Spain...........................................................................................................1136 Cardot, Hubert / University François-Rabelais of Tours, France..................................................................520 Castillo, Luis F. / National University, Colombia...........................................................................................256 Castro Ponte, Alberte / University of Santiago de Compostela, Spain..................................................144, 759 Castro, Alfonso / University of A Coruña, Spain.............................................................................................710 Castro-Bleda, María José / Universidad Politécnica de Valencia, Spain......................................................231 Cepero, M. / University of Granada, Spain.....................................................................................................910 Chapman, Paul M. / University of Hull, UK..................................................................................................536 Charrier, Christophe / University of Caen Basse-Normandie, France..........................................................520 Chen, Qiyang / Montclair State University, USA..........................................................................418, 963, 1036 Chen, Guanrong / City University of Hong Kong, Hong Kong, China..................................................688, 734 Chen, Sherry Y. / Brunel University, UK........................................................................................................437 Chikhi, Nassim / Center de Développement des Technologies Avancées (CDTA), Algérie............................831 Chiong, Raymond / Swinburne University of Technology, Sarawak Campus, Malaysia.............................1562 Chrysostomou, Kyriacos / Brunel University, UK.........................................................................................437 Colomo, Ricardo / Universidad Carlos III de Madrid, Spain.......................................................................1064 Corchado, Juan M. / University of Salamanca, Spain..........................................................................256, 1316 Coupland, Sarah / Royal Liverpool University Hospital, UK........................................................................390 Crespo, Jose / Universidad Politécnica de Madrid, Spain............................................................................1102 Cruz-Corona, Carlos / Universidad de Granada, Spain................................................................................480 Cuéllar, M. P. / Universidad de Granada, Spain...........................................................................................1152 Culhane, Aedín C. / Harvard School of Public Health, USA............................................................................65

Curra, Alberto / University of A Coruña, Spain.............................................................................................110 Damato, Bertil / Royal Liverpool University Hospital, UK............................................................................390 Danciu, Daniela / University of Craiova, Romania.......................................................................................1212 Danielson, Mats / Stockholm University, Sweden & Royal Institute of Technology, Sweden..........................431 Das, Sanjoy / Kansas State University, USA........................................................................................1145, 1191 Davis, Darryl N. / University of Hull, UK.......................................................................................................536 de la Mata Moya, David / University of Alcalá, Spain...................................................................................933 de la Rosa Turbides, Tomás / Universidad Carlos III de Madrid, Spain.....................................................1024 Deleplace, Ghislain / Université de Paris VIII – LED, France.......................................................................996 Delgado, M. / Universidad de Granada, Spain..............................................................................................1152 Delgado, Soledad / Technical University of Madrid, Spain.............................................................................781 Del-Moral-Hernandez, Emilio / University of São Paulo, Brazil..................................................................275 Deng, Pi-Sheng / California State University at Stanislaus, USA.........................................................748, 1504 Déniz Suárez, Oscar / University of Las Palmas de Gran Canaria, Spain.....................................................367 Dhurandher, Sanjay Kumar / University of Delhi, India....................................................................589, 1530 di Pierro, Francesco / University of Exeter, UK............................................................................................1042 Díaz Martín, José Fernando / University of Deusto, Spain...........................................................................344 Díaz Pernas, F. J. / University of Valladolid, Spain...........................................................................1490, 1497 Díez Higuera, J. F. / University of Valladolid, Spain.........................................................................1490, 1497 Diuk, Carlos / Rutgers University, USA..........................................................................................................825 Djebbari, Amira / National Research Council Canada, Canada.....................................................................65 Dorado de la Calle, Julián / University of A Coruña, Spain................................................................377, 1273 Dornaika, Fadi / Institut Géographique National, France.............................................................................625 Douglas, Angela / Liverpool Women’s Hospital, UK.......................................................................................390 Duro, R. J. / University of A Coruña, Spain....................................................................................................603 Edelkamp, Stefan / University of Dortmund, Germany........................................................................501, 1549 Ein-Dor, Phillip / Tel-Aviv University, Israel...........................................................................................327, 334 Ekenberg, Love / Stockholm University, Sweden & Royal Institute of Technology, Sweden..........................431 Eleuteri, Antonio / Royal Liverpool University Hospital, UK........................................................................390 Encheva, Sylvia / Haugesund University College, Norway..........................................................................1610 Erdogmus, Deniz / Northeastern University, USA..........................................................................................902 Esmahi, Larbi / Athabasca University, Canada................................................................................................31 España-Boquera, Salvador / Universidad Politécnica de Valencia, Spain....................................................231 Ezquerra, Norberto / Georgia Institute of Technology, USA........................................................................1290 Fan, Liwei / National University of Singapore, Singapore..............................................................................879 Farah, Ahcene / Ajman University, UAE.........................................................................................................831 Faundez-Zanuy, Marcos / Escola Universitària Politècnica de Mataró, Spain............................................262 Fernández, J. Álvaro / University of Extremadura, Badajoz, Spain.........................................................45, 218 Fernandez-Blanco, Enrique / University of A Coruña, Spain......................................................377, 744, 1583 Ferrer, Miguel A. / University of Las Palmas de Gran Canaria, Spain................................................270, 1232 Figueiredo, Karla / UERJ, Brazil............................................................................................................808, 817 Flauzino, Rogerio A. / University of São Paulo, Brazil................................................................................1121 Flores, Dionicio Zacarías / Benemérita Universidad Autónoma de Puebla, Mexico.........................1370, 1426 Flores, Fernando Zacarías / Benemérita Universidad Autónoma de Puebla, México.......................1370, 1426 Flores-Badillo, Marina / CINVESTAV Unidad Guadalajara, Mexico..........................................................1615 Flórez-Revuelta, Francisco / University of Alicante, Spain.........................................................................1363 Fontenla-Romero, Oscar / University of A Coruña, Spain.............................................................................667 Formiga, Lluís / Universitat Ramon Llull, Spain............................................................................................788 Fornarelli, Girolamo / Politecnico di Bari, Italy....................................................................................206, 211

Fuster-Garcia, E. / Polytechnic University of Valencia, Spain.....................................................................1296 Gadea, Rafael / Universidad Politécnica de Valencia, Spain........................................................................1576 Garanina, Natalia / Russian Academy of Science, Institute of Informatics Systems, Russia........................1089 García, Ángel / Universidad Carlos III de Madrid, Spain............................................................................1064 García, Rafael / University of A Coruña, Spain..............................................................................................982 García González, Antonio / University of Alcalá, Spain................................................................................956 García-Chamizo, Juan Manuel / University of Alicante, Spain..................................................................1363 García-Córdova, Francisco / Polytechnic University of Cartagena (UPCT), Spain...................................1197 Garcia-Raffi, L. M. / Polytechnic University of Valencia, Spain..................................................................1296 García-Rodríguez, José / University of Alicante, Spain...............................................................................1363 Garrido, Mª Carmen / Universidad de Murcia, Spain...................................................................................480 Garro, Alfredo / University of Calabria, Italy..............................................................................................1018 Gaubert, Patrice / Université de Paris 12 – ERUDITE, France....................................................................996 Gavrilova, M. L. / University of Calgary, Canada..............................................................................................9 Geem, Zong Woo / Johns Hopkins University, USA.......................................................................................803 Gelbard, Roy / Bar-Ilan University, Israel......................................................................................................796 George, E. Olusegun / University of Memphis, USA..............................................................................304, 312 Gerek, Ömer Nezih / Anadolu University Eskisehir, Turkey.........................................................................1433 Gestal, Marcos / University of A Coruña, Spain.....................................................................................581, 647 Giaquinto, Antonio / Politecnico di Bari, Italy.......................................................................................206, 211 Gil Pita, Roberto / University of Alcalá, Spain.......................................................................................933, 956 Gillard, Lucien / CNRS – LED, France..........................................................................................................996 Giret, Jean-Francois / CEREQ, France........................................................................................................1029 Gómez, Gabriel / University of Zurich, Switzerland.......................................................................................464 Gómez, Juan M. / Universidad Carlos III de Madrid, Spain........................................................................1064 Gómez-Carracedo, Mari Paz / University of A Coruña, Spain......................................................................647 González-Fonteboa, Belén / University of A Coruña, Spain..........................................................................526 González, Evelio J. / University of La Laguna, Spain.....................................................................................917 González, Roberto / University of Castilla – La Mancha, Spain....................................................................547 Gonzalez-Abril, Luis / Technical University of Catalonia, Spain................................................................1518 González Bedia-Fonteboa, Manuel / University of Zaragoza, Spain............................................................256 González-Castolo, Juan Carlos / CINVESTAV Unidad Guadalajara, Mexico..............................................677 González de la Rosa, Juan J. / Universities of Cádiz-Córdoba, Spain.......................................................1226 González Ortega, D. / University of Valladolid, Spain.......................................................................1490, 1497 Gonzalo, Consuelo / Technical University of Madrid, Spain..........................................................................781 Graesser, Art / The University of Memphis, USA..........................................................................................1179 Grošek, Otokar / Slovak University of Technology, Slovakia.................................................................179, 186 Guerin-Dugue, Anne / GIPSA-lab, France...................................................................................................1244 Guerrero-González, Antonio / Polytechnic University of Cartagena (UPCT), Spain.................................1197 Guijarro-Berdiñas, Bertha / University of A Coruña, Spain.........................................................................667 Guillen, A. / University of Granada, Spain......................................................................................................910 Gupta, Anamika / University of Delhi, India....................................................................................................76 Gutiérrez, P.A. / University of Córdoba, Spain.............................................................................................1136 Gutiérrez Sánchez, Germán / Universidad Carlos III de Madrid, Spain......................................................554 Halang, Wolfgang A. / Fernuniversitaet in Hagen, Germany.......................................................................1049 Hammer, Barbara / Technical University of Clausthal, Germany...............................................................1337 Hee, Lee Gim / DSO National Laboratories, Singapore.....................................................................1072, 1080 Herrador, Manuel F. / University of A Coruña, Spain....................................................................................118

Herrera, Carlos / Intelligent Systems Research Centre, University of Ulster, North Ireland.............................1376 Herrera, L. J. / University of Granada, Spain................................................................................................910 Herrero, J. M. / Polytechnic University of Valencia, Spain..........................................................................1296 Hervás, C. / University of Córdoba, Spain....................................................................................................1136 Hocaoğlu, Fatih Onur / Anadolu University Eskisehir, Turkey....................................................................1433 Hong, Wei-Chiang / Oriental Institute of Technology, Taiwan.......................................................................410 Hopgood, Adrian A. / De Montfort University, UK........................................................................................989 Ho-Phuoc, Tien / GIPSA-lab, France............................................................................................................1244 Huang, Xiaoyu / University of Shanghai for Science & Technology, China.....................................................51 Huber, Franz / California Institute of Technology, USA...............................................................................1351 Ibáñez, Óscar / University of A Coruña, Spain.......................................................................................383, 759 Ibrahim, Walid / United Arab Emirates University, UAE...............................................................................471 Iftekharuddin, Khan M. / University of Memphis, USA........................................................................304, 312 Ingber, Lester / Lester Ingber Research, USA...................................................................................................58 Iglesias, Gergorio / University of Santiago de Compostela, Spain...............................................................1603 Ionescu, Laurenţiu / University of Pitesti, Romania.......................................................................................609 Ip, Horace H. S. / City University of Hong Kong, Hong Kong...........................................................................1 Iriondo, Ignasi / Universitat Ramon Llull, Spain............................................................................................541 Islam, Atiq / University of Memphis, USA...............................................................................................304, 312 Izeboudjen, Nouma / Center de Développement des Technologies Avancées (CDTA), Algérie.....................831 Jabbar, Shahid / University of Dortmund, Germany......................................................................................501 Jabr, Samir / Near East University, Turkey...................................................................................................1596 Janković-Romano, Mario / University of Belgrade, Serbia...........................................................................950 Jarabo Amores, María Pilar / University of Alcalá, Spain............................................................................933 Jaspe, Alberto / University of A Coruña, Spain...............................................................................................873 Jiang, Jun / City University of Hong Kong, Hong Kong.....................................................................................1 Jiménez Celorrio, Sergio / Universidad Carlos III de Madrid, Spain..........................................................1024 Jiménez López, M. Dolores / Rovira i Virgili University, Spain...................................................................1173 Joo, Young Hoon / Kunsan National University, Korea..........................................................................688, 734 Kaburlasos, Vassilis G. / Technological Educational Institution of Kavala, Greece....................................1238 Kačič, Zdravko / University of Maribor, Slovenia........................................................................................1467 Kärnä, Tuomas / Helsinki University of Technology, Finland........................................................................661 Katangur, Ajay K. / Texas A&M University – Corpus Christi, USA............................................................1330 Khashman, Adnan / Near East University, Turkey.......................................................................................1596 Khu, Soon-Thiam / University of Exeter, UK...............................................................................................1042 Kleinschmidt, João H. / State University of Campinas, Brazil.......................................................................755 Klimanek, David / Czech Technical University in Prague, Czech Republic...................................................567 Kochhar, Sarabjeet / University of Delhi, India.............................................................................................172 Kovács, Szilveszter / University of Miskolc, Hungary....................................................................................728 Kovács, László / University of Miskolc, Hungary.................................................................................654, 1130 Krčadinac, Uroš / University of Belgrade, Serbia..........................................................................................950 Kroc, Jiří / Section Computational Science, The University of Amsterdam, The Netherlands.......................353 Kumar, Naveen / University of Delhi, India......................................................................................................76 Kurban, Mehmet / Anadolu University Eskisehir, Turkey............................................................................1433 Lama, Manuel / University of Santiago de Compostela, Spain............................................................138, 1278 Law, Ngai-Fong / The Hong Kong Polytechnic University, Hong Kong.........................................................289 Lazarova-Molnar, Sanja / United Arab Emirates University, UAE...............................................................471 Lebrun, Gilles / University of Caen Basse-Normandie, France.....................................................................520 Ledezma Espino, Agapito / Universidad Carlos III de Madrid, Spain..........................................................554

Lee, Man Wai / Brunel University, UK...........................................................................................................437 Lendasse, Amaury / Helsinki University of Technology, Finland...................................................................661 Leung, C. W. / The Hong Kong Polytechnic University, Hong Kong............................................................1568 Levinstein, Irwin B. / Old Dominion University, USA..................................................................................1253 Levy, Simon D. / Washington and Lee University, USA..................................................................................514 Lezoray, Olivier / University of Caen Basse-Normandie, France..................................................................520 Liang, Faming / Texas A&M University, USA...............................................................................................1482 Liew, Alan Wee-Chung / Griffith University, Australia..................................................................................289 Lisboa, Paulo J.G. / Liverpool John Moores University, UK............................................................................71 Littman, Michael / Rutgers University, USA..................................................................................................825 Liu, Xiaohui / Brunel University, UK..............................................................................................................437 Lopes, Heitor Silvério / Federal University of Technology, Brazil.................................................................596 López, M. Gloria / University of A Coruña, Spain..........................................................................................110 López-Mellado, Ernesto / CINVESTAV Unidad Guadalajara, Mexico................................................677, 1615 López-Rodríguez, Domingo / University of Málaga, Spain.........................................................................1112 Losada Rodriguez, Miguel Ángel / University of Granada, Spain................................................................144 Loula, Angelo / State University of Feira de Santana, Brazil & State University of Campinas (UNICAMP), Brazil..........................................................................................................................................................1543 Loureiro, Javier Pereira / University of A Coruña, Spain..................................................................1283, 1290 Lukomski, Robert / Wroclaw University of Technology, Poland..................................................................1356 Lungarella, Max / University of Zurich, Switzerland.....................................................................................464 Luo, Xin / The University of New Mexico, USA............................................................................940, 945, 1330 Madani, Kurosh / University of Paris, France...............................................................................................131 Madureira, Ana Marie / Polytechnic Institute of Porto, Portugal.................................................................853 Magliano, Joseph P. / Northern Illinois University, USA..............................................................................1253 Magoulas, George D. / University of London, UK........................................................................................1411 Magro, Diego / Università di Torino, Italy......................................................................................................396 Maitra, Anutosh / Dhirubhai Ambani Institute of Information and Communication Technology, India........494 Mandl, Thomas / University of Hildesheim, Germany...................................................................................151 Manrique, Daniel / Inteligencia Artificial, Facultad de Informatica, UPM, Spain........................................767 Marichal, G. Nicolás / University of La Laguna, Spain..................................................................................917 Marín-García, Fulgencio / Polytechnic University of Cartagena (UPCT), Spain.......................................1197 Martínez, Antonio / University of Castilla – La Mancha, Spain....................................................................547 Martínez, Elisa / Universitat Ramon Llull, Spain...........................................................................................541 Martínez, Estíbaliz / Technical University of Madrid, Spain.........................................................................781 Martínez, Jorge D. / Universidad Politécnica de Valencia, Spain................................................................1576 Martínez, Mª Isabel / University of A Coruña, Spain.....................................................................................118 Martínez-Abella, Fernando / University of A Coruña, Spain........................................................................526 Martínez Carballo, Manuel / University of A Coruña, Spain........................................................................532 Martínez-Estudillo, F.J. / INSA – ETEA, Spain............................................................................................1136 Martínez-Feijóo, Diego / University of A Coruña, Spain.............................................................................1583 Martínez Romero, Marcos / University of A Coruña, Spain..............................................................1283, 1290 Martínez-Zarzuela, M. / University of Valladolid, Spain...................................................................1490, 1497 Martín-Guerrero, José D. / University of Valencia, Spain...............................................................................71 Martín-Merino, Manuel / Universidad Pontificia de Salamanca, Spain.......................................................561 Mateo, Fernando / Universidad Politécnica de Valencia, Spain..................................................................1576 Mateo Segura, Clàudia / Universitat Ramon Llull, Spain..............................................................................719 Mato, Virginia / University of A Coruña, Spain..............................................................................................110 Maučec, Mirjam Sepesy / University of Maribor, Slovenia..........................................................................1467

Mazare, Alin / University of Pitesti, Romania.................................................................................................609 McCarthy, Philip / The University of Memphis, USA...................................................................................1179 McGinnity, Thomas M. / Intelligent Systems Research Centre, University of Ulster, North Ireland................1376 McNamara, Danielle S. / The University of Memphis, USA.........................................................................1253 Meged, Avichai / Bar-Ilan University, Israel...................................................................................................796 Méndez Salgueiro, José Ramón / University of A Coruña, Spain..................................................................532 Meng, Hai-Dong / Inner Mongolia University of Science and Technology, China.........................................297 Mérida-Casermeiro, Enrique / University of Málaga, Spain......................................................................1112 Mesejo, Pablo / University of A Coruña, Spain.............................................................................................1583 Michalewicz, Zbigniew / The University of Adelaide, Australia......................................................................16 Miguélez Rico, Mónica / University of A Coruña, Spain..............................................................236, 241, 1273 Millis, Keith K. / The University of Memphis, USA......................................................................................1253 Misra, Sudip / Yale University, USA......................................................................................................589, 1530 Mohammadian, M. / University of Canberra, Australia......................................................................456, 1510 Monzó, José Mª / Universidad Politécnica de Valencia, Spain.....................................................................1576 Morales Moreno, Aythami / University of Las Palmas de Gran Canaria, Spain........................................1259 Mordonini, Monica / Università degli Studi di Parma, Italy........................................................840, 848, 1303 Moreno-Muñoz, A. / Universities of Cádiz-Córdoba, Spain.........................................................................1226 Muñoz, Enrique / Universidad de Murcia, Spain...........................................................................................480 Muñoz, Luis Miguel Guzmán / Benemérita Universidad Autónoma de Puebla, Mexico..................1370, 1426 Mussi, Luca / Università degli Studi di Perugia, Italy...........................................................................840, 848 Mutihac, Radu / University of Bucharest, Romania.......................................................................22, 223, 1056 Narula, Prayag / University of Delhi, India..........................................................................................589, 1530 Neto, João José / Universidade de São Paulo, Brazil.......................................................................................37 Nitta, Tohru / AIST, Japan...............................................................................................................................361 Nóvoa, Francisco J. / University of A Coruña, Spain.....................................................................................110 Oja, Erkki / Helsinki University of Technology, Finland..............................................................................1343 Olteanu, Madalina / Université de Paris I – CES SAMOS, France...............................................................996 Ortiz-de-Lazcano-Lobato, Juan M. / University of Málaga, Spain............................................................1112 Pacheco, Marco / PUC-Rio, Brazil.........................................................................................................808, 817 Panigrahi, Bijaya K. / Indian Institute of Technology, India........................................................................1145 Papaioannou, Ioannis / National Technical University of Athens, Greece.........................................1418, 1524 Pazos Montañés, Félix / University of A Coruña, Spain.................................................................................167 Pazos Sierra, Alejandro / University of A Coruña, Spain.............................................................................1283 Pedreira, Nieves / University of A Coruña, Spain...........................................................................................532 Pegalajar, M. C. / University of Granada, Spain..........................................................................................1152 Pelta, David A. / Universidad de Granada, Spain...........................................................................................480 Peña, Dexmont / Universidad Autónoma de Nuevo León, Mexico...............................................................1462 Peng, Chun-Cheng / University of London, UK...........................................................................................1411 Pérez, Juan L. / University of A Coruña, Spain......................................................................................118, 526 Pérez, Óscar / Universidad Autónoma de Madrid, Spain................................................................................282 Pérez-Sánchez, Beatriz / University of A Coruña, Spain................................................................................667 Periscal, David / University of A Coruña, Spain.............................................................................................618 Perl, Juergen / University of Mainz, Germany..............................................................................................1212 Peters, Georg / Munich University of Applied Sciences, Germany.................................................................774 Piana, Michele / Universita’ di Verona, Italy...................................................................................................372 Planet, Santiago / Universitat Ramon Llull, Spain..........................................................................................541 Poggi, Agostino / Università di Parma, Italy.................................................................................................1404 Poh, Kim Leng / National University of Singapore, Singapore......................................................................879

Porto Pazos, Ana Belén / University of A Coruña, Spain...............................................................................167 Principe, Jose C. / University of Florida, USA...............................................................................................902 Putonet, Carlos G. / University of Granada, Spain......................................................................................1226 Quackenbush, John / Harvard School of Public Health, USA.........................................................................65 Queiroz, João / State University of Campinas (UNICAMP), & Federal University of Bahia, Brazil..........1543 Quek, Chai / Nanyang Technological University, Singapore........................................................................1396 Rabuñal Dopico, Juan Ramón / University of A Coruña, Spain...........................................................125, 383 Raducanu, Bogdan / Computer Vision Center, Spain.....................................................................................625 Ramos, Carlos / Polytechnic of Porto, Portugal...............................................................................................92 Rashid, Shaista / University of Bradford, UK.................................................................................................337 Răsvan, Vladimir / University of Craiova, Romania....................................................................................1212 Reyes-Galaviz, Orion Fausto / Universidad Autónoma de Tlaxcala, Mexico.......................................860, 867 Reyes-García, Carlos Alberto / Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico.........860, 867 Riaño Sierra, Jesús M. / University of Deusto, Spain...................................................................................344 Rigas, Dimitris / University of Bradford, UK..................................................................................................337 Ríos, Juan / Inteligencia Artificial, Facultad de Informatica, UPM, Spain....................................................767 Rivero, Daniel / University of A Coruña, Spain......................................................................................125, 618 Rodrigues, Ernesto / Federal University of Technology, Brazil.....................................................................596 Rodriquez, Gregorio Iglesias / University of Santiago de Compostela, Spain....................................144, 1614 Rodríguez, M. Antón / University of Valladolid, Spain......................................................................1490, 1497 Rodríguez, Patricia Henríquez / University of Las Palmas de Gran Canaria, Spain.......................1266, 1439 Rodríguez, Santiago / University of A Coruña, Spain....................................................................................975 Rodríguez, Sara / Universidad de Salamanca, Spain...................................................................................1316 Rodríguez-Patón, Alfonso / Inteligencia Artificial, Facultad de Informatica, UPM, Spain..........................767 Rojas, F. / University of Granada, Spain.........................................................................................................910 Rojas, F. J. / University of Granada, Spain.....................................................................................................910 Rojas, I. / University of Granada, Spain..........................................................................................................910 Rokach, Lior / Ben Gurion University, Israel.................................................................................................884 Romero, Carlos F. / University of Las Palmas de Gran Canaria, Spain......................................................1447 Romero, Enrique / Technical University of Catalonia, Spain......................................................................1205 Romero-García, V. / Polytechnic University of Valencia, Spain...................................................................1296 Rosa Zurera, Manuel / University of Alcalá, Spain...............................................................................933, 956 Roussaki, Ioanna / National Technical University of Athens, Greece................................................1418, 1524 Rousset, Patrick / CEREQ, France...............................................................................................................1029 Roy, Shourya / IBM Research, India Research Lab, India........................................................................99, 105 Ruano, Marcos / Universidad Carlos III de Madrid, Spain..........................................................................1064 Rus, Vasile / The University of Memphis, USA..............................................................................................1179 Rusiecki, Andrzej / Wroclaw University of Technology, Poland..................................................................1389 Russomanno, David J. / University of Memphis, USA...........................................................................304, 312 Sadri, Fariba / Imperial College London, UK..................................................................................................85 Salazar, Addisson / iTEAM, Polytechnic University of Valencia, Spain.................................................192, 199 Sanchez, Rodrigo Carballo / University of Santiago de Compostela, Spain.......................................144, 1614 Sánchez, Eduardo / University of Santiago de Compostela, Spain......................................................138, 1278 Sánchez, Ricardo / Universidad Autónoma de Nuevo León, Mexico............................................................1462 Sánchez-Maroño, Noelia / University of A Coruña, Spain.............................................................................632 Sánchez-Montañés, Manuel / Universidad Autónoma de Madrid, Spain..............................................282, 561 Sánchez-Pérez, J. V. / Polytechnic University of Valencia, Spain.................................................................1296 Sanchis, J. / Polytechnic University of Valencia, Spain.................................................................................1296 Sanchis de Miguel, Araceli / Universidad Carlos III de Madrid, Spain.........................................................554

Sarathy, Rathindra / Oklahoma State University, USA................................................................................1323 Savić, Dragan A. / University of Exeter, UK.................................................................................................1042 Schleif, Frank-M. / University of Leipzig, Germany.....................................................................................1337 Seoane, Antonio / University of A Coruña, Spain...........................................................................................873 Seoane, María / University of A Coruña, Spain......................................................................................975, 982 Seoane Fernández, José Antonio / University of A Coruña, Spain......................................236, 241, 744, 1273 Serantes, J. Andrés / University of A Coruña, Spain......................................................................................744 Şerban, Gheorghe / University of Pitesti, Romania........................................................................................609 Sergiadis, George D. / Aristotle University of Thessaloniki, Greece..............................................................967 Serrano, Arturo / iTEAM, Polytechnic University of Valencia, Spain....................................................192, 199 Serrano-López, Antonio J. / University of Valencia, Spain..............................................................................71 Sesmero Lorente, M. Paz / Universidad Carlos III de Madrid, Spain...........................................................554 Shambaugh, Neal / West Virginia University, USA.......................................................................................1310 Sharkey, Amanda J.C. / University of Sheffield, UK............................................................................161, 1537 Shilov, Nikolay V. / Russian Academy of Science, Institute of Informatics Systems, Russia.........................1089 Sieber, Tanja / University of Miskolc, Hungary............................................................................................1130 Silaghi, Marius C. / Florida Insitute of Technology, USA...............................................................................507 Silva, Ivan N. / University of São Paulo, Brazil............................................................................................1121 Sloot, Peter M.A. / Section Computational Science, The University of Amsterdam, The Netherlands...........353 Socoró Carrié, Joan-Claudi / Universitat Ramon Llull, Spain..............................................................541, 719 Sofron, Emil / University of Pitesti, Romania.................................................................................................609 Song, Yu-Chen / Inner Mongolia University of Science and Technology, China............................................297 Sorathia, Vikram / Dhirubhai Ambani Institute of Information and Communication Technology, India......494 Soria-Olivas, Emilio / University of Valencia, Spain........................................................................................71 Sossa, Humberto / Center for Computing Research, IPN, Mexico.................................................................248 Souza, Flavio / UERJ, Brazil...................................................................................................................808, 817 Stanković, Milan / University of Belgrade, Serbia.........................................................................................950 Stathis, Kostas / Royal Holloway, University of London, UK...........................................................................85 Suárez, Sonia / University of A Coruña, Spain........................................................................................975, 982 Subramaniam, L. Venkata / IBM Research, India Research Lab, India..................................................99, 105 Sulc, Bohumil / Czech Technical University in Prague, Czech Republic........................................................567 Szenher, Matthew / University of Edinburgh, UK........................................................................................1185 Taktak, Azzam / Royal Liverpool University Hospital, UK............................................................................390 Tang, Zaiyong / Salem State College, USA.......................................................................................................51 Tapia, Dante I. / Universidad de Salamanca, Spain......................................................................................1316 Taveira Pinto, Francisco / University of Santiago de Compostela, Spain....................................................1603 Tejera Santana, Aday / University of Las Palmas de Gran Canaria, Spain..................................................270 Téllez, Ricardo / Technical University of Catalonia, Spain..........................................................................1095 Tettamanzi, Andrea G. B. / University of Milan, Italy...................................................................................575 Tikk, Domonkos / Budapest University of Technology and Economics, Hungary.........................................654 Tlelo-Cuautle, Esteban / Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico............................867 Tomaiuolo, Michele / Università di Parma, Italy..........................................................................................1404 Torijano Gordo, Elena / University of Alcalá, Spain.....................................................................................956 Torres, Manuel / University of Castilla – La Mancha, Spain.........................................................................547 Travieso González, Carlos M. / University of Las Palmas de Gran Canaria, Spain.........................1259, 1447 Tumin, Sharil / University of Bergen, Norway..............................................................................................1610 Turgay, Safiye / Abant İzzet Baysal University, Turkey...................................................................................924 Valdés, Julio J. / National Research Council Canada, Canada..........................................................1205, 1589 Valenzuela, O. / University of Granada, Spain...............................................................................................910

Vargas, J. Francisco / University of Las Palmas de Gran Canaria, Spain & Universidad de Antioquia, Colombia...................................................................................................................................................1232 Vazquez, Roberto A. / Center for Computing Research, IPN, Mexico...........................................................248 Vázquez Naya, José Manuel / University of A Coruña, Spain...........................................................1283, 1290 Vellasco, Marley / PUC-Rio, Brazil........................................................................................................808, 817 Verdegay, José L. / Universidad de Granada, Spain......................................................................................480 Villmann, Thomas / University of Leipzig, Germany...................................................................................1337 Vlachos, Ioannis K. / Aristotle University of Thessaloniki, Greece................................................................967 Voiry, Matthieu / University of Paris, France & SAGEM REOSC, France...................................................131 Wang, John / Montclair State University, USA.............................................................418, 424, 974, 963, 1036 Wilkosz, Kazimierz / Wroclaw University of Technology, Poland...............................................................1356 Williamson, Kristian / Statistics Canada, Canada...........................................................................................31 Wong, T. T. / The Hong Kong Polytechnic University, Hong Kong..............................................................1568 Xu, Lei / Chinese University of Hong Kong, Hong Kong & Peking University, China.................318, 892, 1343 Yaman, Fahrettin / Abant İzzet Baysal University, Turkey.............................................................................924 Yan, Hong / City University of Hong Kong, Hong Kong & University of Sydney, Australia..........................289 Yan, Yan / Tsinghua University, Beijing, China.............................................................................................1455 Yao, James / Montclair State University, USA.........................................................................................418, 424 Yokoo, Makoto / Kyushu University, Japan....................................................................................................507 Yousuf, Muhammad Ali / Tecnologico de Monterrey – Santa Fe Campus, México....................................1383 Zajac, Pavol / Slovak University of Technology, Slovakia.......................................................................179, 186 Zamora-Martínez, Francisco / Universidad Politécnica de Valencia, Spain................................................231 Zarri, Gian Piero / LaLIC, University Paris 4-Sorbonne, France......................................................1159, 1167 Zatarain, Ramon / Instituto Tecnologico de Culiacan, Mexico......................................................................860 Zhang, Yu-Jin / Tsinghua University, Beijing, China....................................................................................1455 Zhao, Yi / Fernuniversitaet in Hagen, Germany...........................................................................................1079 Ziemke, Tom / University of Skovde, Sweden................................................................................................1376

Contents by Volume

Volume I Active Learning with SVM / Jun Jiang, City University of Hong Kong, Hong Kong; and Horace H. S. Ip, City University of Hong Kong, Hong Kong...............................................................................1 Adaptive Algorithms for Intelligent Geometric Computing / M. L. Gavrilova, University of Calgary, Canada..................................................................................................................................................................9 Adaptive Business Intelligence / Zbigniew Michalewicz, The University of Adelaide, Australia......................16 Adaptive Neural Algorithms for PCA and ICA / Radu Mutihac, University of Bucharest, Romania................22 Adaptive Neuro-Fuzzy Systems / Larbi Esmahi, Athabasca University, Canada; Kristian Williamson, Statistics Canada, Canada; and Elarbi Badidi, United Arab Emirates University, UAE..................................31 Adaptive Technology and Its Applications / João José Neto, Universidade de São Paulo, Brazil....................37 Advanced Cellular Neural Networks Image Processing / J. Álvaro Fernández, University of Extremadura, Badajoz, Spain.............................................................................................................................45 Agent-Based Intelligent System Modeling / Zaiyong Tang, Salem State College, USA; Xiaoyu Huang, University of Shanghai for Science & Technology, China; and Kallol Bagchi, University of Texas at El Paso, USA...................................................................................................................................................51 AI and Ideas by Statistical Mechanics / Lester Ingber, Lester Ingber Research, USA.......................................58 AI Methods for Analyzing Microarray Data / Amira Djebbari, National Research Council Canada, Canada; Aedín C. Culhane, Harvard School of Public Health, USA; Alice J. Armstrong, The George Washington University, USA; and John Quackenbush, Harvard School of Public Health, USA.....................................................................................................................................................................65 AI Walk from Pharmacokinetics to Marketing, An / José D. Martín-Guerrero, University of Valencia, Spain; Emilio Soria-Olivas, University of Valencia, Spain; Paulo J.G. Lisboa, Liverpool John Moores University, UK; and Antonio J. Serrano-López, University of Valencia, Spain.................................................71 Algorithms for Association Rule Mining / Vasudha Bhatnagar, University of Delhi, India; Anamika Gupta, University of Delhi, India; and Naveen Kumar, University of Delhi, India............................76

Ambient Intelligence / Fariba Sadri, Imperial College London, UK; and Kostas Stathis, Royal Holloway, University of London, UK.......................................................................................................85 Ambient Intelligence Environments / Carlos Ramos, Polytechnic of Porto, Portugal......................................92 Analytics for Noisy Unstructured Text Data I / Shourya Roy, IBM Research, India Research Lab, India; and L. Venkata Subramaniam, IBM Research, India Research Lab, India........................................................99 Analytics for Noisy Unstructured Text Data II / L. Venkata Subramaniam, IBM Research, India Research Lab, India; and Shourya Roy, IBM Research, India Research Lab, India.......................................................105 Angiographic Images Segmentation Techniques / Francisco J. Nóvoa, University of A Coruña, Spain; Alberto Curra, University of A Coruña, Spain; M. Gloria López, University of A Coruña, Spain; and Virginia Mato, University of A Coruña, Spain.................................................................................................. 110 ANN Application in the Field of Structural Concrete / Juan L. Pérez, University of A Coruña, Spain; Mª Isabel Martínez, University of A Coruña, Spain; and Manuel F. Herrador, University of A Coruña, Spain................................................................................................................................................... 118 ANN Development with EC Tools: An Overview / Daniel Rivero, University of A Coruña, Spain; and Juan Ramón Rabuñal Dopico, University of A Coruña, Spain..................................................................125 ANN-Based Defects’ Diagnosis of Industrial Optical Devices / Matthieu Voiry, University of Paris, France & SAGEM REOSC, France; Véronique Amarger, University of Paris, France; Joel Bernier, SAGEM REOSC, France; and Kurosh Madani, University of Paris, France..................................................131 Artificial Intelligence and Education / Eduardo Sánchez, University of Santiago de Compostela, Spain; and Manuel Lama, University of Santiago de Compostela, Spain...................................................................138 Artificial Intelligence and Rubble-Mound Breakwater Stability / Gregorio Iglesias Rodriquez, University of Santiago de Compostela, Spain; Alberte Castro Ponte, University of Santiago de Compostela, Spain; Rodrigo Carballo Sanchez, University of Santiago de Compostela, Spain; and Miguel Ángel Losada Rodriguez, University of Granada, Spain.........................................................................................................144 Artificial Intelligence for Information Retrieval / Thomas Mandl, University of Hildesheim, Germany.........151 Artificial Intelligence in Computer-Aided Diagnosis / Paulo Eduardo Ambrósio, Santa Cruz State University, Brazil..............................................................................................................................................157 Artificial Neural Networks and Cognitive Modelling / Amanda J.C. Sharkey, University of Sheffield, UK....161 Artificial NeuroGlial Networks / Ana Belén Porto Pazos, University of A Coruña, Spain; Alberto Alvarellos González, University of A Coruña, Spain; and Félix Montañés Pazos, University of A Coruña, Spain..........167 Association Rule Mining / Vasudha Bhatnagar, University of Delhi, India; and Sarabjeet Kochhar, University of Delhi, India.................................................................................................................................172 Automated Cryptanalysis / Otokar Grošek, Slovak University of Technology, Slovakia; and Pavol Zajac, Slovak University of Technology, Slovakia..................................................................................179

Automated Cryptanalysis of Classical Ciphers / Otokar Grošek, Slovak University of Technology, Slovakia; and Pavol Zajac, Slovak University of Technology, Slovakia...........................................................186 Automatic Classification of Impact-Echo Spectra I / Addisson Salazar, iTEAM, Polytechnic University of Valencia, Spain; and Arturo Serrano, iTEAM, Polytechnic University of Valencia, Spain .......192 Automatic Classification of Impact-Echo Spectra II / Addisson Salazar, iTEAM, Polytechnic University of Valencia, Spain; and Arturo Serrano, iTEAM, Polytechnic University of Valencia, Spain........199 AVI of Surface Flaws on Manufactures I / Girolamo Fornarelli, Politecnico di Bari, Italy; and Antonio Giaquinto, Politecnico di Bari, Italy............................................................................................206 AVI of Surface Flaws on Manufactures II / Girolamo Fornarelli, Politecnico di Bari, Italy; and Antonio Giaquinto, Politecnico di Bari, Italy............................................................................................ 211 Basic Cellular Neural Networks Image Processing / J. Álvaro Fernández, University of Extremadura, Badajoz, Spain......................................................................................................218 Bayesian Neural Networks for Image Restoration / Radu Mutihac, University of Bucharest, Romania............................................................................................................................................................223 Behaviour-Based Clustering of Neural Networks / María José Castro-Bleda, Universidad Politécnica de Valencia, Spain; Salvador España-Boquera, Universidad Politécnica de Valencia, Spain; and Francisco Zamora-Martínez, Universidad Politécnica de Valencia, Spain.................................................................................................................................................................231 Bio-Inspired Algorithms in Bioinformatics I / José Antonio Seoane Fernández, University of A Coruña, Spain; and Mónica Miguélez Rico, University of A Coruña, Spain................................................236 Bio-Inspired Algorithms in Bioinformatics II / José Antonio Seoane Fernández, University of A Coruña, Spain; and Mónica Miguélez Rico, University of A Coruña, Spain............................................241 Bioinspired Associative Memories / Roberto A. Vazquez, Center for Computing Research, IPN, Mexico; and Humberto Sossa, Center for Computing Research, IPN, Mexico................................................248 Bio-Inspired Dynamical Tools for Analyzing Cognition / Manuel G. Bedia, University of Zaragoza, Spain; Juan M. Corchado, University of Salamanca, Spain; and Luis F. Castillo, National University, Colombia.........................................................................................................................256 Biometric Security Technology / Marcos Faundez-Zanuy, Escola Universitària Politècnica de Mataró, Spain...................................................................................................................................................262 Blind Source Separation by ICA / Miguel A. Ferrer, University of Las Palmas de Gran Canaria, Spain; and Aday Tejera Santana, University of Las Palmas de Gran Canaria, Spain................................................270 Chaotic Neural Networks / Emilio Del-Moral-Hernandez, University of São Paulo, Brazil...........................275 Class Prediction in Test Sets with Shifted Distributions / Óscar Pérez, Universidad Autónoma de Madrid, Spain; and Manuel Sánchez-Montañés, Universidad Autónoma de Madrid, Spain......................282

Cluster Analysis of Gene Expression Data / Alan Wee-Chung Liew, Griffith University, Australia; Ngai-Fong Law, The Hong Kong Polytechnic University, Hong Kong; and Hong Yan, City University of Hong Kong, Hong Kong & University of Sydney, Australia........................................................................289 Clustering Algorithm for Arbitrary Data Sets / Yu-Chen Song, Inner Mongolia University of Science and Technology, China; and Hai-Dong Meng, Inner Mongolia University of Science and Technology, China.................................................................................................................................................................297 CNS Tumor Prediction Using Gene Expression Data Part I / Atiq Islam, University of Memphis, USA; Khan M. Iftekharuddin, University of Memphis, USA; E. Olusegun George, University of Memphis, USA; and David J. Russomanno, University of Memphis, USA.................................................................................304 CNS Tumor Prediction Using Gene Expression Data Part II / Atiq Islam, University of Memphis, USA; Khan M. Iftekharuddin, University of Memphis, USA; E. Olusegun George, University of Memphis, USA; and David J. Russomanno, University of Memphis, USA.................................................................................312 Combining Classifiers and Learning Mixture-of-Experts / Lei Xu, Chinese University of Hong Kong, Hong Kong & Peking University, China; and Shun-ichi Amari, Brain Science Institute, Japan.....................318 Commonsense Knowledge Representation I / Phillip Ein-Dor, Tel-Aviv University, Israel.............................327 Commonsense Knowledge Representation II / Phillip Ein-Dor, Tel-Aviv University, Israel...........................334 Comparative Study on E-Note-Taking, A / Shaista Rashid, University of Bradford, UK; and Dimitris Rigas, University of Bradford, UK..............................................................................................337 Comparison of Cooling Schedules for Simulated Annealing, A / José Fernando Díaz Martín, University of Deusto, Spain; and Jesús M. Riaño Sierra, University of Deusto, Spain...................................344 Complex Systems Modeling by Cellular Automata / Jiří Kroc, Section Computational Science, The University of Amsterdam, The Netherlands; and Peter M.A. Sloot, Section Computational Science, The University of Amsterdam, The Netherlands...............................................................................................353 Complex-Valued Neural Networks / Tohru Nitta, AIST, Japan........................................................................361 Component Analysis in Artificial Vision / Oscar Déniz Suárez, University of Las Palmas de Gran Canaria, Spain; and Gloria Bueno García, University of Castilla-La Mancha, Spain...................................................367 Computational Methods in Biomedical Imaging / Michele Piana, Universita’ di Verona, Italy......................372 Computer Morphogenesis in Self-Organizing Structures / Enrique Fernández-Blanco, University of A Coruña, Spain; and Julián Dorado, University of A Coruña, Spain.......................................377 Computer Vision for Wave Flume Experiments / Óscar Ibáñez, University of A Coruña, Spain; and Juan Rabuñal Dopico, University of A Coruña, Spain..............................................................................383 Conditional Hazard Estimating Neural Networks / Antonio Eleuteri, Royal Liverpool University Hospital, UK; Azzam Taktak, Royal Liverpool University Hospital, UK; Bertil Damato, Royal Liverpool University Hospital, UK; Angela Douglas, Liverpool Women’s Hospital, UK; and Sarah Coupland, Royal Liverpool University Hospital, UK....................................................................................................................................390

Configuration / Luca Anselma, Università di Torino, Italy; and Diego Magro, Università di Torino, Italy.................................................................................................................................396 Constraint Processing / Roman Barták, Charles University in Prague, Czech Republic.................................404 Continuous ACO in a SVR Traffic Forecasting Model / Wei-Chiang Hong, Oriental Institute of Technology, Taiwan.......................................................................................................................................410 Data Mining Fundamental Concepts and Critical Issues / John Wang, Montclair State University, USA; Qiyang Chen, Montclair State University, USA; and James Yao, Montclair State University, USA................418 Data Warehousing Development and Design Methodologies / James Yao, Montclair State University, USA; and John Wang, Montclair State University, USA..................................................................................424 Decision Making in Intelligent Agents / Mats Danielson, Stockholm University, Sweden & Royal Institute of Technology, Sweden; and Love Ekenberg, Stockholm University, Sweden & Royal Institute of Technology, Sweden..............................................................................................................431 Decision Tree Applications for Data Modelling / Man Wai Lee, Brunel University, UK; Kyriacos Chrysostomou, Brunel University, UK; Sherry Y. Chen, Brunel University, UK; and Xiaohui Liu, Brunel University, UK..................................................................................................................437 Dempster-Shafer Theory, The / Malcolm J. Beynon, Cardiff University, UK..................................................443 Dependency Parsing: Recent Advances / Ruket Çakıcı, University of Edinburgh, UK....................................449 Designing Unsupervised Hierarchical Fuzzy Logic Systems / M. Mohammadian, University of Canberra, Australia....................................................................................................................456 Developmental Robotics / Max Lungarella, University of Zurich, Switzerland; and Gabriel Gómez, University of Zurich, Switzerland.....................................................................................................................464 Device-Level Majority von Neumann Multiplexing / Valeriu Beiu, United Arab Emirates University, UAE; Walid Ibrahim, United Arab Emirates University, UAE; and Sanja Lazarova-Molnar, United Arab Emirates University, UAE.................................................................................................................................471 Different Approaches for Cooperation with Metaheuristics / José M. Cadenas, Universidad de Murcia, Spain; Mª Carmen Garrido, Universidad de Murcia, Spain; Enrique Muñoz, Universidad de Murcia, Spain; Carlos Cruz-Corona, Universidad de Granada, Spain; David A. Pelta, Universidad de Granada, Spain; and José L. Verdegay, Universidad de Granada, Spain..............................480 Differential Evolution with Self-Adaptation / Janez Brest, University of Maribor, Slovenia..........................488 Discovering Mappings Between Ontologies / Vikram Sorathia, Dhirubhai Ambani Institute of Information and Communication Technology, India; and Anutosh Maitra, Dhirubhai Ambani Institute of Information and Communication Technology, India....................................................................................494 Disk-Based Search / Stefan Edelkamp, University of Dortmund, Germany; and Shahid Jabbar, University of Dortmund, Germany...................................................................................................................501

Distributed Constraint Reasoning / Marius C. Silaghi, Florida Insitute of Technology, USA; and Makoto Yokoo, Kyushu University, Japan.................................................................................................507 Distributed Representation of Compositional Structure / Simon D. Levy, Washington and Lee University, USA.................................................................................................................................................514 EA Multi-Model Selection for SVM / Gilles Lebrun, University of Caen Basse-Normandie, France; Olivier Lezoray, University of Caen Basse-Normandie, France; Christophe Charrier, University of Caen Basse-Normandie, France; and Hubert Cardot, University François-Rabelais of Tours, France................................................................................................................................................520 EC Techniques in the Structural Concrete Field / Juan L. Pérez, University of A Coruña, Spain; Belén González-Fonteboa, University of A Coruña, Spain; and Fernando Martínez Abella, University of A Coruña, Spain..........................................................................................................................526 E-Learning in New Technologies / Nieves Pedreira, University of A Coruña, Spain; José Ramón Méndez Salgueiro, University of A Coruña, Spain; and Manuel Martínez Carballo, University of A Coruña, Spain..........................................................................................................................532 Emerging Applications in Immersive Technologies / Darryl N. Davis, University of Hull, UK; and Paul M. Chapman, University of Hull, UK................................................................................................536 Emulating Subjective Criteria in Corpus Validation / Ignasi Iriondo, Universitat Ramon Llull, Spain; Santiago Planet, Universitat Ramon Llull, Spain; Francesc Alías, Universitat Ramon Llull, Spain; Joan-Claudi Socoró, Universitat Ramon Llull, Spain; and Elisa Martínez, Universitat Ramon Llull, Spain.................................................................................................................................................................541

Volume II Energy Minimizing Active Models in Artificial Vision / Gloria Bueno García, University of Castilla – La Mancha, Spain; Antonio Martínez, University of Castilla – La Mancha, Spain; Roberto González, University of Castilla – La Mancha, Spain; and Manuel Torres, University of Castilla – La Mancha, Spain.......................................................................................................547 Ensemble of ANN for Traffic Sign Recognition / M. Paz Sesmero Lorente, Universidad Carlos III de Madrid, Spain; Juan Manuel Alonso-Weber, Universidad Carlos III de Madrid, Spain; Germán Gutiérrez Sánchez, Universidad Carlos III de Madrid, Spain; Agapito Ledezma Espino, Universidad Carlos III de Madrid, Spain; and Araceli Sanchis de Miguel, Universidad Carlos III de Madrid, Spain..........................................................................................................................................554 Ensemble of SVM Classifiers for Spam Filtering / Ángela Blanco, Universidad Pontificia de Salamanca, Spain; and Manuel Martín-Merino, Universidad Pontificia de Salamanca, Spain......................561 Evolutionary Algorithms in Discredibility Detection / Bohumil Sulc, Czech Technical University in Prague, Czech Republic; and David Klimanek, Czech Technical University in Prague, Czech Republic..................................................................................................................................................567 Evolutionary Approaches for ANNs Design / Antonia Azzini, University of Milan, Italy; and Andrea G.B. Tettamanzi, University of Milan, Italy.........................................................................................575

Evolutionary Approaches to Variable Selection / Marcos Gestal, University of A Coruña, Spain; and José Manuel Andrade, University of A Coruña, Spain..............................................................................581 Evolutionary Computing Approach for Ad-Hoc Networks / Prayag Narula, University of Delhi, India; Sudip Misra, Yale University, USA; and Sanjay Kumar Dhurandher, University of Delhi, India..................................................................................................................................................................589 Evolutionary Grammatical Inference / Ernesto Rodrigues, Federal University of Technology, Brazil; and Heitor Silvério Lopes, Federal University of Technology, Brazil..................................................596 Evolutionary Robotics / J. A. Becerra, University of A Coruña, Spain; and R. J. Duro, University of A Coruña, Spain................................................................................................................................................603 Evolved Synthesis of Digital Circuits / Laurenţiu Ionescu, University of Pitesti, Romania; Alin Mazare, University of Pitesti, Romania; Gheorghe Şerban, University of Pitesti, Romania; and Emil Sofron, University of Pitesti, Romania..........................................................................................................................609 Evolving Graphs for ANN Development and Simplification / Daniel Rivero, University of A Coruña, Spain; and David Periscal, University of A Coruña, Spain............................................................618 Facial Expression Recognition for HCI Applications / Fadi Dornaika, Institut Géographique National, France; and Bogdan Raducanu, Computer Vision Center, Spain....................................................................625 Feature Selection / Noelia Sánchez-Maroño, University of A Coruña, Spain; and Amparo Alonso-Betanzos, University of A Coruña, Spain................................................................................632 Feed-Forward Artificial Neural Network Basics / Lluís A. Belanche Muñoz, Universitat Politècnica de Catalunya, Spain.......................................................................................................................639 Finding Multiple Solutions with GA in Multimodal Problems / Marcos Gestal, University of A Coruña, Spain; and Mari Paz Gómez-Carracedo, University of A Coruña, Spain......................................647 Full-Text Search Engines for Databases / László Kovács, University of Miskolc, Hungary; and Domonkos Tikk, Budapest University of Technology and Economics, Hungary.......................................654 Functional Dimension Reduction for Chemometrics / Tuomas Kärnä, Helsinki University of Technology, Finland; and Amaury Lendasse, Helsinki University of Technology, Finland.............................661 Functional Networks / Oscar Fontenla-Romero, University of A Coruña, Spain; Bertha Guijarro-Berdiñas, University of A Coruña, Spain; and Beatriz Pérez-Sánchez, University of A Coruña, Spain..........................................................................................................................667 Fuzzy Approximation of DES State / Juan Carlos González-Castolo, CINVESTAV Unidad Guadalajara, Mexico; and Ernesto López-Mellado, CINVESTAV Unidad Guadalajara, Mexico .................677 Fuzzy Control Systems: An Introduction / Guanrong Chen, City University of Hong Kong, Hong Kong; and Young Hoon Joo, Kunsan National University, Korea..........................................................688 Fuzzy Decision Trees / Malcolm J. Beynon, Cardiff University, UK...............................................................696

Fuzzy Graphs and Fuzzy Hypergraphs / Leonid S. Bershtein, Taganrog Technological Institute of Southern Federal University, Russia; and Alexander V. Bozhenyuk, Taganrog Technological Institute of Southern Federal University, Russia..............................................................................................704 Fuzzy Logic Applied to Biomedical Image Analysis / Alfonso Castro, University of A Coruña, Spain; and Bernardino Arcay, University of A Coruña, Spain.........................................................................710 Fuzzy Logic Estimator for Variant SNR Environments / Rosa Maria Alsina Pagès, Universitat Ramon Llull, Spain; Clàudia Mateo Segura, Universitat Ramon Llull, Spain; and Joan-Claudi Socoró Carrié, Universitat Ramon Llull, Spain........................................................................................................................719 Fuzzy Rule Interpolation / Szilveszter Kovács, University of Miskolc, Hungary.............................................728 Fuzzy Systems Modeling: An Introduction / Young Hoon Joo, Kunsan National University, Korea; and Guanrong Chen, City University of Hong Kong, Hong Kong, China.......................................................734 Gene Regulation Network Use for Information Processing / Enrique Fernandez-Blanco, University of A Coruña, Spain; and J.Andrés Serantes, University of A Coruña, Spain.......................................................744 Genetic Algorithm Applications to Optimization Modeling / Pi-Sheng Deng, California State University at Stanislaus, USA.............................................................................................................................................748 Genetic Algorithms for Wireless Sensor Networks / João H. Kleinschmidt, State University of Campinas, Brazil.................................................................................................................................................................755 Genetic Fuzzy Systems Applied to Ports and Coasts Engineering / Óscar Ibáñez, University of A Coruña, Spain; and Alberte Castro Ponte, University of Santiago de Compostela, Spain...........................759 Grammar-Guided Genetic Programming / Daniel Manrique, Inteligencia Artificial, Facultad de Informatica, UPM, Spain; Juan Ríos, Inteligencia Artificial, Facultad de Informatica, UPM, Spain; and Alfonso Rodríguez-Patón, Inteligencia Artificial, Facultad de Informatica, UPM, Spain........................767 Granular Computing / Georg Peters, Munich University of Applied Sciences, Germany................................774 Growing Self-Organizing Maps for Data Analysis / Soledad Delgado, Technical University of Madrid, Spain; Consuelo Gonzalo, Technical University of Madrid, Spain; Estíbaliz Martínez, Technical University of Madrid, Spain; and Águeda Arquero, Technical University of Madrid, Spain...........................781 GTM User Modeling for aIGA Weight Tuning in TTS Synthesis / Lluís Formiga, Universitat Ramon Llull, Spain; and Francesc Alías, Universitat Ramon Llull, Spain............................................................................788 Handling Fuzzy Similarity for Data Classification / Roy Gelbard, Bar-Ilan University, Israel; and Avichai Meged, Bar-Ilan University, Israel...............................................................................................796 Harmony Search for Multiple Dam Scheduling / Zong Woo Geem, Johns Hopkins University, USA.............803 Hierarchical Neuro-Fuzzy Systems Part I / Marley Vellasco, PUC-Rio, Brazil; Marco Pacheco, PUC-Rio, Brazil; Karla Figueiredo, UERJ, Brazil; and Flavio Souza, UERJ, Brazil.....................................808

Hierarchical Neuro-Fuzzy Systems Part II / Marley Vellasco, PUC-Rio, Brazil; Marco Pacheco, PUC-Rio, Brazil; Karla Figueiredo, UERJ, Brazil; and Flavio Souza, UERJ, Brazil.....................................817 Hierarchical Reinforcement Learning / Carlos Diuk, Rutgers University, USA; and Michael Littman, Rutgers University, USA.......................................................................................................825 High Level Design Approach for FPGA Implementation of ANNs / Nouma Izeboudjen, Center de Développement des Technologies Avancées (CDTA), Algérie; Ahcene Farah, Ajman University, UAE; Hamid Bessalah, Center de Développement des Technologies Avancées (CDTA), Algérie; Ahmed. Bouridene, Queens University of Belfast, Ireland; and Nassim Chikhi, Center de Développement des Technologies Avancées (CDTA), Algérie..........................................................831 HOPS: A Hybrid Dual Camera Vision System / Stefano Cagnoni, Università degli Studi di Parma, Italy; Monica Mordonini, Università degli Studi di Parma, Italy; Luca Mussi, Università degli Studi di Perugia, Italy; and Giovanni Adorni, Università degli Studi di Genova, Italy................................................840 Hybrid Dual Camera Vision System / Stefano Cagnoni, Università degli Studi di Parma, Italy; Monica Mordonini, Università degli Studi di Parma, Italy; Luca Mussi, Università degli Studi di Perugia, Italy; and Giovanni Adorni, Università degli Studi di Genova, Italy................................................848 Hybrid Meta-Heuristics Based System for Dynamic Scheduling / Ana Maria Madureira, Polytechnic Institute of Porto, Portugal...........................................................................................................853 Hybrid System for Automatic Infant Cry Recognition I, A / Carlos Alberto Reyes-García, Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico; Ramon Zatarain, Instituto Tecnologico de Culiacan, Mexico; Lucia Barron, Instituto Tecnologico de Culiacan, Mexico; and Orion Fausto Reyes-Galaviz, Universidad Autónoma de Tlaxcala, Mexico....................................................860 Hybrid System for Automatic Infant Cry Recognition II, A / Carlos Alberto Reyes-García, Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico; Sandra E. Barajas, Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico; Esteban Tlelo-Cuautle, Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico; and Orion Fausto Reyes-Galaviz, Universidad Autónoma de Tlaxcala, Mexico...............................................................................................................................................................867 IA Algorithm Acceleration Using GPUs / Antonio Seoane, University of A Coruña, Spain; and Alberto Jaspe, University of A Coruña, Spain..................................................................................................873 Improving the Naïve Bayes Classifier / Liwei Fan, National University of Singapore, Singapore; and Kim Leng Poh, National University of Singapore, Singapore...................................................................879 Incorporating Fuzzy Logic in Data Mining Tasks / Lior Rokach, Ben Gurion University, Israel....................884 Independent Subspaces / Lei Xu, Chinese University of Hong Kong, Hong Kong & Peking University, China.................................................................................................................................................................892 Information Theoretic Learning / Deniz Erdogmus, Northeastern University, USA; and Jose C. Principe, University of Florida, USA...................................................................................................902

Intelligent Classifier for Atrial Fibrillation (ECG) / O.Valenzuela, University of Granada, Spain; I.Rojas, University of Granada, Spain; F.Rojas, University of Granada, Spain; A.Guillen, University of Granada, Spain; L.J Herrera, University of Granada, Spain; F.J.Rojas, University of Granada, Spain; and M.Cepero, University of Granada, Spain......................................................................910 Intelligent MAS in System Engineering and Robotics / G. Nicolás Marichal, University of La Laguna, Spain; and Evelio J. González, University of La Laguna, Spain......................................................................917 Intelligent Query Answering Mechanism in Multi Agent Systems / Safiye Turgay, Abant İzzet Baysal University, Turkey; and Fahrettin Yaman, Abant İzzet Baysal University, Turkey...........................................924 Intelligent Radar Detectors / Raúl Vicen Bueno, University of Alcalá, Spain; Manuel Rosa Zurera, University of Alcalá, Spain; María Pilar Jarabo Amores, University of Alcalá, Spain; Roberto Gil Pita, University of Alcalá, Spain; and David de la Mata Moya, University of Alcalá, Spain.................................................................................................................................................................933 Intelligent Software Agents Analysis in E-Commerce I / Xin Luo, The University of New Mexico, USA; and Somasheker Akkaladevi, Virginia State University, USA...........................................................................940 Intelligent Software Agents Analysis in E-Commerce II / Xin Luo, The University of New Mexico, USA; and Somasheker Akkaladevi, Virginia State University, USA...........................................................................945 Intelligent Software Agents with Applications in Focus / Mario Janković-Romano, University of Belgrade, Serbia; Milan Stanković, University of Belgrade, Serbia; and Uroš Krčadinac, University of Belgrade, Serbia...........................................................................................................................................950 Intelligent Traffic Sign Classifiers / Raúl Vicen Bueno, University of Alcalá, Spain; Elena Torijano Gordo, University of Alcalá, Spain; Antonio García González, University of Alcalá, Spain; Manuel Rosa Zurera, University of Alcalá, Spain; and Roberto Gil Pita, University of Alcalá, Spain...........956 Interactive Systems and Sources of Uncertainties / Qiyang Chen, Montclair State University, USA; and John Wang, Montclair State University, USA............................................................................................963 Intuitionistic Fuzzy Image Processing / Ioannis K. Vlachos, Aristotle University of Thessaloniki, Greece; and George D. Sergiadis, Aristotle University of Thessaloniki, Greece...........................................................967 Knowledge Management Systems Procedural Development / Javier Andrade, University of A Coruña, Spain; Santiago Rodríguez, University of A Coruña, Spain; María Seoane, University of A Coruña, Spain; and Sonia Suárez, University of A Coruña, Spain...............................................................975 Knowledge Management Tools and Their Desirable Characteristics / Juan Ares, University of A Coruña, Spain; Rafael García, University of A Coruña, Spain; María Seoane, University of A Coruña, Spain; and Sonia Suárez, University of A Coruña, Spain............................................................................................982 Knowledge-Based Systems / Adrian A. Hopgood, De Montfort University, UK.............................................989 Kohonen Maps and TS Algorithms / Marie-Thérèse Boyer-Xambeu, Université de Paris VII – LED, France; Ghislain Deleplace, Université de Paris VIII – LED, France; Patrice Gaubert, Université de Paris 12 – ERUDITE, France; Lucien Gillard, CNRS – LED, France; and Madalina Olteanu, Université de Paris I – CES SAMOS, France...................................................................996

Learning in Feed-Forward Artificial Neural Networks I / Lluís A. Belanche Muñoz, Universitat Politècnica de Catalunya, Spain.....................................................................................................................1004 Learning in Feed-Forward Artificial Neural Networks II / Lluís A. Belanche Muñoz, Universitat Politècnica de Catalunya, Spain.....................................................................................................................1012 Learning Nash Equilibria in Non-Cooperative Games / Alfredo Garro, University of Calabria, Italy.........1018 Learning-Based Planning / Sergio Jiménez Celorrio, Universidad Carlos III de Madrid, Spain; and Tomás de la Rosa Turbides, Universidad Carlos III de Madrid, Spain...................................................1024 Longitudinal Analysis of Labour Market Data with SOM, A / Patrick Rousset, CEREQ, France; and Jean-Francois Giret, CEREQ, France....................................................................................................1029 Managing Uncertainties in Interactive Systems / Qiyang Chen, Montclair State University, USA; and John Wang, Montclair State University, USA..........................................................................................1036 Many-Objective Evolutionary Optimisation / Francesco di Pierro, University of Exeter, UK; Soon-Thiam Khu, University of Exeter, UK; and Dragan A. Savić, University of Exeter, UK.......................1042 Mapping Ontologies by Utilising Their Semantic Structure / Yi Zhao, Fernuniversitaet in Hagen, Germany; and Wolfgang A. Halang, Fernuniversitaet in Hagen, Germany..................................................1049 Mathematical Modeling of Artificial Neural Networks / Radu Mutihac, University of Bucharest, Romania..........................................................................................................................................................1056 Microarray Information and Data Integration Using SAMIDI / Juan M. Gómez, Universidad Carlos III de Madrid, Spain; Ricardo Colomo, Universidad Carlos III de Madrid, Spain; Marcos Ruano, Universidad Carlos III de Madrid, Spain; and Ángel García, Universidad Carlos III de Madrid, Spain.....1064 Mobile Robots Navigation, Mapping, and Localization Part I / Lee Gim Hee, DSO National Laboratories, Singapore; and Marcelo H. Ang Jr., National University of Singapore, Singapore................1072 Mobile Robots Navigation, Mapping, and Localization Part II / Lee Gim Hee, DSO National Laboratories, Singapore; and Marcelo H. Ang Jr., National University of Singapore, Singapore................1080 Modal Logics for Reasoning about Multiagent Systems / Nikolay V. Shilov, Russian Academy of Science, Institute of Informatics Systems, Russia; and Natalia Garanina, Russian Academy of Science, Institute of Informatics Systems, Russia..........................................................................................................1089 Modularity in Artificial Neural Networks / Ricardo Téllez, Technical University of Catalonia, Spain; and Cecilio Angulo, Technical University of Catalonia, Spain...........................................................1095 Morphological Filtering Principles / Jose Crespo, Universidad Politécnica de Madrid, Spain.................... 1102 MREM, Discrete Recurrent Network for Optimization / Enrique Mérida-Casermeiro, University of Málaga, Spain; Domingo López-Rodríguez, University of Málaga, Spain; and Juan M. Ortiz-de-Lazcano-Lobato, University of Málaga, Spain.................................................................. 1112

Volume III Multilayer Optimization Approach for Fuzzy Systems / Ivan N. Silva, University of São Paulo, Brazil; and Rogerio A. Flauzino, University of São Paulo, Brazil............................................................................. 1121 Multi-Layered Semantic Data Models / László Kovács, University of Miskolc, Hungary; and Tanja Sieber, University of Miskolc, Hungary................................................................................................ 1130 Multilogistic Regression by Product Units / P.A. Gutiérrez, University of Córdoba, Spain; C. Hervás, University of Córdoba, Spain; F.J. Martínez-Estudillo, INSA – ETEA, Spain; and M. Carbonero, INSA – ETEA, Spain....................................................................................................................................... 1136 Multi-Objective Evolutionary Algorithms / Sanjoy Das, Kansas State University, USA; and Bijaya K. Panigrahi, Indian Institute of Technology, India............................................................................ 1145 Multi-Objective Training of Neural Networks / M. P. Cuéllar, Universidad de Granada, Spain; M. Delgado, Universidad de Granada, Spain; and M. C. Pegalajar, University of Granada, Spain............ 1152 “Narrative” Information and the NKRL Solution / Gian Piero Zarri, LaLIC, University Paris 4-Sorbonne, France ............................................................................................................................................................ 1159 “Narrative” Information Problems / Gian Piero Zarri, LaLIC, University Paris 4-Sorbonne, France......... 1167 Natural Language Processing and Biological Methods / Gemma Bel Enguix, Rovira i Virgili University, Spain; and M. Dolores Jiménez López, Rovira i Virgili University, Spain........................ 1173 Natural Language Understanding and Assessment / Vasile Rus, The University of Memphis, USA; Philip McCarthy, University of Memphis, USA; Danielle S. McNamara, The University of Memphis, USA; and Art Graesser, University of Memphis, USA.................................................................................... 1179 Navigation by Image-Based Visual Homing / Matthew Szenher, University of Edinburgh, UK.................... 1185 Nelder-Mead Evolutionary Hybrid Algorithms / Sanjoy Das, Kansas State University, USA....................... 1191 Neural Control System for Autonomous Vehicles / Francisco García-Córdova, Polytechnic University of Cartagena (UPCT), Spain; Antonio Guerrero-González, Polytechnic University of Cartagena (UPCT), Spain; and Fulgencio Marín-García, Polytechnic University of Cartagena (UPCT), Spain......................... 1197 Neural Network-Based Visual Data Mining for Cancer Data / Enrique Romero, Technical University of Catalonia, Spain; Julio J. Valdés,National Research Council Canada, Canada; and Alan J. Barton, National Research Council Canada, Canada.................................................................................................1205 Neural Network-Based Process Analysis in Sport / Juergen Perl, University of Mainz, Germany...............1212 Neural Networks and Equilibria, Synchronization, and Time Lags / Daniela Danciu, University of Craiova, Romania; and Vladimir Răsvan, University of Craiova, Romania.............................................1219

Neural Networks and HOS for Power Quality Evaluation / Juan J. González De la Rosa, Universities of Cádiz-Córdoba, Spain; Carlos G. Puntonet, University of Granada, Spain; and A. Moreno-Muñoz, Universities of Cádiz-Córdoba, Spain......................................................................1226 Neural Networks on Handwritten Signature Verification / J. Francisco Vargas, University of Las Palmas de Gran Canaria, Spain & Universidad de Antioquia, Colombia; and Miguel A. Ferrer, University of Las Palmas de Gran Canaria, Spain........................................................................................1232 Neural/Fuzzy Computing Based on Lattice Theory / Vassilis G. Kaburlasos, Technological Educational Institution of Kavala, Greece.....................................................................................................1238 New Self-Organizing Map for Dissimilarity Data, A / Tien Ho-Phuoc, GIPSA-lab, France; and Anne Guerin-Dugue, GIPSA-lab, France................................................................................................1244 NLP Techniques in Intelligent Tutoring Systems / Chutima Boonthum, Hampton University, USA; Irwin B. Levinstein, Old Dominion University, USA; Danielle S. McNamara, The University of Memphis, USA; Joseph P. Magliano, Northern Illinois University, USA; and Keith K. Millis, The University of Memphis, USA....................................................................................................................1253 Non-Cooperative Facial Biometric Identification Systems / Carlos M. Travieso González, University of Las Palmas de Gran Canaria, Spain; and Aythami Morales Moreno, University of Las Palmas de Gran Canaria, Spain..................................................................................................................................1259 Nonlinear Techniques for Signals Characterization / Jesús Bernardino Alonso Hernández, University of Las Palmas de Gran Canaria, Spain; and Patricia Henríquez Rodríguez, University of Las Palmas de Gran Canaria, Spain..................................................................................................................................1266 Ontologies and Processing Patterns for Microarrays / Mónica Miguélez Rico, University of A Coruña, Spain; José Antonio Seoane Fernández, University of A Coruña, Spain; and Julián Dorado de la Calle, University of A Coruña, Spain........................................................................................................................1273 Ontologies for Education and Learning Design / Manuel Lama, University of Santiago de Compostela, Spain; and Eduardo Sánchez, University of Santiago de Compostela, Spain................................................1278 Ontology Alignment Overview / José Manuel Vázquez Naya, University of A Coruña, Spain; Marcos Martínez Romero, University of A Coruña, Spain; Javier Pereira Loureiro, University of A Coruña, Spain; and Alejandro Pazos Sierra, University of A Coruña, Spain.............................................1283 Ontology Alignment Techniques / Marcos Martínez Romero, University of A Coruña, Spain; José Manuel Vázquez Naya, University of A Coruña, Spain; Javier Pereira Loureiro, University of A Coruña, Spain; and Norberto Ezquerra, Georgia Institute of Technology, USA..................1290 Optimization of the Acoustic Systems / V. Romero-García, Polytechnic University of Valencia, Spain; E. Fuster-Garcia, Polytechnic University of Valencia, Spain; J. V. Sánchez-Pérez, Polytechnic University of Valencia, Spain; L. M. Garcia-Raffi, Polytechnic University of Valencia, Spain; X. Blasco, Polytechnic University of Valencia, Spain; J. M. Herrero, Polytechnic University of Valencia, Spain; and J. Sanchis, Polytechnic University of Valencia, Spain..................................................1296

Particle Swarm Optimization and Image Analysis / Stefano Cagnoni, Università degli Studi di Parma, Italy; and Monica Mordonini, Università degli Studi di Parma, Italy...........................................................1303 Personalized Decision Support Systems / Neal Shambaugh, West Virginia University, USA........................1310 Planning Agent for Geriatric Residences / Javier Bajo, Universidad Pontificia de Salamanca, Spain; Dante I. Tapia, Universidad de Salamanca, Spain; Sara Rodríguez, Universidad de Salamanca, Spain; and Juan M. Corchado, Universidad de Salamanca, Spain...........................................................................1316 Privacy-Preserving Estimation / Mohammad Saad Al-Ahmadi, King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia; and Rathindra Sarathy, Oklahoma State University, USA....................1323 Protein Structure Prediction by Fusion, Bayesian Methods / Somasheker Akkaladevi, Virginia State University, USA; Ajay K. Katangur, Texas A&M University – Corpus Christi, USA; and Xin Luo, The University of New Mexico, USA.........................................................................................1330 Prototype Based Classification in Bioinformatics / Frank-M. Schleif, University of Leipzig, Germany; Thomas Villmann, University of Leipzig, Germany; and Barbara Hammer, Technical University of Clausthal, Germany........................................................................................................................................1337 Randomized Hough Transform / Lei Xu, Chinese University of Hong Kong, Hong Kong & Peking University, China; and Erkki Oja, Helsinki University of Technology, Finland............................................1343 Ranking Functions / Franz Huber, California Institute of Technology, USA.................................................1351 RBF Networks for Power System Topology Verification / Robert Lukomski, Wroclaw University of Technology, Poland; and Kazimierz Wilkosz, Wroclaw University of Technology, Poland............................1356 Representing Non-Rigid Objects with Neural Networks / José García-Rodríguez, University of Alicante, Spain; Francisco Flórez-Revuelta, University of Alicante, Spain; and Juan Manuel García-Chamizo, University of Alicante, Spain..............................................................................................1363 Roadmap on Updates, A / Fernando Zacarías Flores, Benemérita Universidad Autónoma de Puebla, México; Dionicio Zacarías Flores, Benemérita Universidad Autónoma de Puebla, Mexico; Rosalba Cuapa Canto, Benemérita Universidad Autónoma de Puebla, Mexico; and Luis Miguel Guzmán Muñoz, Benemérita Universidad Autónoma de Puebla, Mexico..................................1370 Robot Model of Dynamic Appraisal and Response, A / Carlos Herrera, Intelligent Systems Research Centre University of Ulster, North Ireland; Tom Ziemke, University of Skovde, Sweden; and Thomas M. McGinnity, Intelligent Systems Research Centre University of Ulster, University of Ulster, North Ireland...................................................................................................................................................1376 Robots in Education / Muhammad Ali Yousuf, Tecnologico de Monterrey – Santa Fe Campus, México......1383 Robust Learning Algorithm with LTS Error Function / Andrzej Rusiecki, Wroclaw University of Technology, Poland.........................................................................................................................................1389 Rough Set-Based Neuro-Fuzzy System / Kai Keng Ang, Institute for Infocomm Research, Singapore; and Chai Quek, Nanyang Technological University, Singapore.....................................................................1396

Rule Engines and Agent-Based Systems / Agostino Poggi, Università di Parma, Italy; and Michele Tomaiuolo, Università di Parma, Italy.............................................................................................1404 Sequence Processing with Recurrent Neural Networks / Chun-Cheng Peng, University of London, UK; and George D. Magoulas, University of London, UK.................................................................................... 1411 Shortening Automated Negotiation Threads via Neural Nets / Ioanna Roussaki, National Technical University of Athens, Greece; Ioannis Papaioannou, National Technical University of Athens, Greece; and Miltiades Anagnostou, National Technical University of Athens, Greece...............................................1418 Signed Formulae as a New Update Process / Fernando Zacarías Flores, Benemérita Universidad Autónoma de Puebla, Mexico; Dionicio Zacarías Flores, Benemérita Universidad Autónoma de Puebla, Mexico; Rosalba Cuapa Canto, Benemérita Universidad Autónoma de Puebla, Mexico; and Luis Miguel Guzmán Muñoz, Benemérita Universidad Autónoma de Puebla, Mexico..................................1426 Solar Radiation Forecasting Model / Fatih Onur Hocaoğlu, Anadolu University Eskisehir, Turkey; Ömer Nezih Gerek, Anadolu University Eskisehir, Turkey; and Mehmet Kurban, Anadolu University Eskisehir, Turkey.............................................................................................................................................1433 Speech-Based Clinical Diagnostic Systems / Jesús Bernardino Alonso Hernández, University of Las Palmas de Gran Canaria, Spain; and Patricia Henríquez Rodríguez, University of Las Palmas de Gran Canaria, Spain..................................................................................................................................1439 State of the Art in Writer’s Off-Line Identification / Carlos M. Travieso González, University of Las Palmas de Gran Canaria, Spain; and Carlos F. Romero, University of Las Palmas de Gran Canaria, Spain................................................................................................................................................1447 State-of-the-Art on Video-Based Face Recognition / Yan Yan, Tsinghua University, Beijing, China; and Yu-Jin Zhang, Tsinghua University, Beijing, China.................................................................................1455 Stationary Density of Stochastic Search Processes / Arturo Berrones, Universidad Autónoma de Nuevo León, México; Dexmont Peña, Universidad Autónoma de Nuevo León, Mexico; and Ricardo Sánchez, Universidad Autónoma de Nuevo León, Mexico................................................................1462 Statistical Modelling of Highly Inflective Languages / Mirjam Sepesy Maučec, University of Maribor, Slovenia; and Zdravko Kačič, University of Maribor, Slovenia......................................................1467 Statistical Simulations on Perceptron-Based Adders / Snorre Aunet, University of Oslo, Norway & Centers for Neural Inspired Nano Architectures, Norway; and Hans Kristian Otnes Berge, University of Oslo, Norway.............................................................................................................................1474 Stochastic Approximation Monte Carlo for MLP Learning / Faming Liang, Texas A&M University, USA...............................................................................................................................................1482 Stream Processing of a Neural Classifier I / M. Martínez-Zarzuela, University of Valladolid, Spain; F. J. Díaz Pernas, University of Valladolid, Spain; D. González Ortega, University of Valladolid, Spain; J. F. Díez Higuera, University of Valladolid, Spain; and M. Antón Rodríguez, University of Valladolid, Spain...............................................................................................................................................................1490

Stream Processing of a Neural Classifier II / M. Martínez-Zarzuela, University of Valladolid, Spain; F. J. Díaz Pernas, University of Valladolid, Spain; D. González Ortega, University of Valladolid, Spain; J. F. Díez Higuera, University of Valladolid, Spain; and M. Antón Rodríguez, University of Valladolid, Spain...............................................................................................................................................................1497 Study of the Performance Effect of Genetic Operators, A / Pi-Sheng Deng, California State University at Stanislaus, USA...........................................................................................................................................1504 Supervised Learning of Fuzzy Logic Systems / M. Mohammadian, University of Canberra, Australia.......1510 Support Vector Machines / Cecilio Angulo, Technical University of Catalonia, Spain; and Luis Gonzalez-Abril, Technical University of Catalonia, Spain.....................................................................1518 Survey on Neural Networks in Automated Negotiations, A / Ioannis Papaioannou, National Technical University of Athens, Greece; Ioanna Roussaki, National Technical University of Athens, Greece; and Miltiades Anagnostou, National Technical University of Athens, Greece...............................................1524 Swarm Intelligence Approach for Ad-Hoc Networks / Prayag Narula, University of Delhi, India; Sudip Misra, Yale University, USA; and Sanjay Kumar Dhurandher, University of Delhi, India..................1530 Swarm Robotics / Amanda J.C. Sharkey, University of Sheffield, UK...........................................................1537 Symbol Grounding Problem / Angelo Loula, State University of Feira de Santana, Brazil & State University of Campinas (UNICAMP), Brazil; and João Queiroz, State University of Campinas (UNICAMP), Brazil & Federal University of Bahia, Brazil.................................................................................................1543 Symbolic Search / Stefan Edelkamp, University of Dortmund, Germany......................................................1549 Synthetic Neuron Implementations / Snorre Aunet, University of Oslo, Norway & Centers for Neural Inspired Nano Architectures, Norway.................................................................................................1555 Teaching Machines to Find Names / Raymond Chiong, Swinburne University of Technology, Sarawak Campus, Malaysia...........................................................................................................................1562 Thermal Design of Gas-Fired Cooktop Burners Through ANN / T.T. Wong, The Hong Kong Polytechnic University, Hong Kong; and C.W. Leung, The Hong Kong Polytechnic University, Hong Kong..................1568 2D Positioning Application in PET Using ANNs, A / Fernando Mateo, Universidad Politécnica de Valencia, Spain; Ramón J. Aliaga, Universidad Politécnica de Valencia, Spain; Jorge D. Martínez, Universidad Politécnica de Valencia, Spain; José Mª Monzó, Universidad Politécnica de Valencia, Spain; and Rafael Gadea, Universidad Politécnica de Valencia, Spain....................................................................1576 2D-PAGE Analysis Using Evolutionary Computation / Pablo Mesejo, University of A Coruña, Spain; Enrique Fernández-Blanco, University of A Coruña, Spain; Diego Martínez-Feijóo, University of A Coruña, Spain; and Francisco J. Blanco, Juan Canalejo Hospital, Spain....................................................1583 Visualizing Cancer Databases Using Hybrid Spaces / Julio J. Valdés, National Research Council Canada, Canada; and Alan J. Barton, National Research Council Canada, Canada.................................................1589

Voltage Instability Detection Using Neural Networks / Adnan Khashman, Near East University, Turkey; Kadri Buruncuk, Near East University, Turkey; and Samir Jabr, Near East University, Turkey..............................................................................................................................................................1596 Wave Reflection at Submerged Breakwaters / Alberte Castro Ponte, University of Santiago de Compostela, Spain; Gregorio Iglesias Rodriguez, University of Santiago de Compostela, Spain; Francisco Taveira Pinto, University of Santiago de Compostela, Spain; and Rodrigo Carballo Sanchez, University of Santiago de Compostela, Spain.................................................................................................1603 Web-Based Assessment System Applying Many-Valued Logic / Sylvia Encheva, Haugesund University College, Norway; and Sharil Tumin, University of Bergen, Norway............................................1610 Workflow Management Based on Mobile Agent Technology / Marina Flores-Badillo, CINVESTAV Unidad Guadalajara, Mexico; and Ernesto López-Mellado, CINVESTAV Unidad Guadalajara, Mexico....1615

xxxi

Preface

Through the history the man has always hoped the boost of three main characteristics: physical, metaphysical and intellectual. From the physical viewpoint he invented and developed all kind of tools: levers, wheels, cams, pistons, etc., until achieving the sophisticated machines existing nowadays. Regarding the metaphysical aspect, the initial celebration of magical-animistic rituals led to attempts, either real or literary, for creating ex nihilo life: life from inert substance. The most actual approaches involve the cryoconservation of deceased people for them to be returned to life in the future; the generation of life at the laboratories by means of cells, tissues, organs, systems or individuals created from previously frozen stem cells is also currently aimed. The third aspect considered, the intellectual one, is the most interesting here. There have been multiple contributions, since devices that increased the calculi ability as the abacus appeared, until the later theoretical proposals for trying to solve problems, as the Ars Magna by Ramón Lull. The first written reference of the Artificial Intelligence that is known is The Iliad, where Homer describes the visit of the goddess Thetis and her son Achilles to the workshop of Hephaestus, god of smiths: At once he was helped along by female servants made of gold, who moved to him. They look like living servant girls, possessing minds, hearts with intelligence, vocal chords, and strength. However, the first reference of Artificial Intelligence, as it is currently understood, can be found in the proposal made by J. McCarthy to the Rockefeller Foundation in 1956; this proposal hoped for funds that might support a month-lasting meeting of twelve researchers of the Dartmouth Summer Research Project in order to establish the basis of the, McCarthy-named, Artificial Intelligence. Although the precursors of the Artificial Intelligence (S. Ramón y Cajal, N. Wienner, D. Hebb, C. Shannon and J. McCulloch, among many others), come from multiple science disciplines, the true driving forces (A. Turing, J. von Neumann, M. Minsky, T. Gödell,…) emerge in the second third of the XX century with the apparition of certain tools, the computers, capable of handling fairly complex problems. Some other scientists, as J. Hopfield or J. Holland, proposed at the last third of the century some biology-inspired approaches that enabled the treatment of complex problems of the real world that even might require certain adaptive ability. All this long and productive trend of the history of the Artificial Intelligence demanded an encyclopaedia that might give expression to the current situation of this multidisciplinary topic, where researches from multiple fields as neuroscience, computing science, cognitive sciences, exact sciences and different engineering areas converge. This work intends to provide a wide and well balanced coverage of all the points of interest that currently exist in the field of Artificial Intelligence, from the most theoretical fundamentals to the most recent industrial applications. Multiple researches have been contacted and several notifications have been performed in different forums of the scientific field dealt here. All the proposals have been carefully revised by the editors for balancing, as far as possible, the contributions, with the intention of achieving an accurately wide document that might exemplify this field.

xxxii

A first selection was performed after the reception of all the proposals and it was later sent to three external expert reviewers in order to carry out a double-blind revision based on a peer review. As a result of this strict and complex process, and before the final acceptance, a high number of contributions (80% approximately) were rejected or required to be modified. The effort of the last two years is now believed to be worthwhile; at least this is the belief of the editors who, with the invaluable help of a high number of people mentioned in the acknowledgements, have managed to get this complete encyclopaedia off the ground. The numbers speak for themselves: 233 articles published that have been carried out by 442 authors from 38 different countries and also revised by 238 scientific reviewers. The diverse and comprehensive coverage of the disciplines directly related with the Artificial Intelligence is also believed to contribute to a better understanding of all the researching related to this important field of study. It was also intended that the contributions compiled in this work might have a considerable impact on the expansion and the development of the body of knowledge related to this wide field, for it to be an important reference source used by researchers and system developers of this area. It was hoped that the encyclopaedia might be an effective help in order to achieve a better understanding of concepts, problems, trends, challenges and opportunities related to this field of study; it should be useful for the research colleagues, for the teaching personnel, for the students, etc. The editors will be happy to know that this work could inspire the readers for contributing to new advances and discoveries in this fantastic work area that might themselves also contribute to a better life quality of different society aspects: productive processes, health care or any other area where a system or product developed by techniques and procedures of Artificial Intelligence might be used.

xxxiii

About the Editors

Juan Ramón Rabuñal Dopico is associate professor in the Department of Information and Communications Technologies, University of A Coruña (Spain). He finished his graduate in computer science in 1996, and in 2002, he became a PhD in computer science with his thesis “Methodology for the Development of Knowledge Extraction Systems in ANNs” and he became a PhD in civil engineering in 2008. He has worked on several Spanish and European projects and has published many books and papers in several international journals. He is currently working in the areas of evolutionary computation, artificial neural networks, and knowledge extraction systems. Julian Dorado is associate professor in the Faculty of Computer Science, University of A Coruña (Spain). He finished his graduate in computer science in 1994. In 1999, he became a PhD, with a special mention of European doctor. In 2004, he finished his graduate in biology. He has worked as a teacher of the university for more than 8 years. He has published many books and papers in several journals and international conferences. He is presently working on bioinformatics, evolutionary computing, artificial neural networks, computer graphics, and data mining. Alejandro Pazos is professor in computer science, University of A Coruña (Spain). He was born in Padron in 1959. He is MD by Faculty of Medicine, University of Santiago de Compostela in 1987. He obtained a Master of Knowledge Engerineering in 1989 and a PhD in computer science in 1990 from the Polytechnique University of Madrid. He also archives the PhD grade in Medicine in 1996 by the University Complutese of Madrid. He has worked with research groups at Georgia Institute of Technology, Havard Medical School, Stanford University, Politechnique University of Madrid, etc. He funded and is the director of the research laboratory Artificial Neural Networks and Adaptative Systems in Computer science Faculty and is co-director of the Medical Informatics and Radiology Diagnostic Center at the University of A Coruña.

Active Learning with SVM Jun Jiang City University of Hong Kong, Hong Kong Horace H. S. Ip City University of Hong Kong, Hong Kong

INTRODUCTION With the increasing demand of multimedia information retrieval, such as image and video retrieval from the Web, there is a need to find ways to train a classifier when the training dataset is combined with a small number of labelled data and a large number of unlabeled one. Traditional supervised or unsupervised learning methods are not suited to solving such problems particularly when the problem is associated with data in a high-dimension space. In recent years, many methods have been proposed that can be broadly divided into two groups: semi-supervised and active learning (AL). Support Vector Machine (SVM) has been recognized as an efficient tool to deal with high-dimensionality problems, a number of researchers have proposed algorithms of Active Learning with SVM (ALSVM) since the turn of the Century. Considering their rapid development, we review, in this chapter, the state-of-the-art of ALSVM for solving classification problems.

BACKGROUND The general framework of AL can be described as in Figure 1. It can be seen clearly that its name – active learning – comes from the fact that the learner can improve the classifier by actively choosing the “optimal” data from the potential query set Q and adding it into the current labeled training set L after getting its label during the processes. The key point of AL is its sample selection criteria. AL in the past was mainly used together with neural network algorithm and other learning algorithms. Statistical AL is one classical method, in which the sample minimizing either the variance (D. A. Cohn, Ghahramani, & Jordan, 1996), bias (D. A. Cohn, 1997) or generalisation error (Roy & McCallum, 2001) is queried to the oracle. Although these methods have

strong theoretical foundation, there are two common problems limiting their application: one is how to estimate the posterior distribution of the samples, and the other is its prohibitively high computation cost. To deal with the above two problems, a series of version space based AL methods, which are based on the assumption that the target function can be perfectly expressed by one hypothesis in the version space and in which the sample that can reduce the volume of the version space is chosen, have been proposed. Examples are query by committee (Freund, Seung, Shamir, & Tishby, 1997), and SG AL (D. Cohn, Atlas, & Ladner, 1994). However the complexity of version space made them intractable until the version space based ALSVMs have emerged. The success of SVM in the 90s has prompted researchers to combine AL with SVM to deal with the semi-supervised learning problems, such as distancebased (Tong & Koller, 2001), RETIN (Gosselin & Cord, 2004) and Multi-view (Cheng & Wang, 2007) based ALSVMs. In the following sections, we summarize existing well-known ALSVMs under the framework of version space theory, and then briefly describe some mixed strategies. Lastly, we will discuss the research trends for ALSVM and give conclusions for the chapter.

VERSION SPACE BASED ACTIVE LEARNING WITH SVM The idea of almost all existing heuristic ALSVMs is explicitly or implicitly to find the sample which can reduce the volume of the version space. In this section, we first introduce their theoretical foundation and then review some typical ALSVMs.

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

A

Active Learning with SVM

Figure 1. Framework of active learning Initialize Step: An classifier h is trained on the initial labeled training set L step 1: The learner evaluates each data x in potential query set Q (subset of or whole unlabeled data set U) and query the sample x* which has lowest EvalFun(x, L, h, H) to the oracle and get its label y*; step 2: The learner update the classifier h with the enlarged training set {L + ( x*, y*)}; step 3: Repeat step 1 and 2 until stopping training; Where

EvalFun(x, L, h, H): the function of evaluating potential query x (the lowest value is the best here)  L: the current labeled training set  H: the hypothesis space



Version Space Theory Based on the Probability Approximation Correct learning model, the goal of machine learning is to find a consistent classifier which has the lowest generalization error bound. The Gibbs generalization error bound (McAllester, 1998) is defined as Gibbs

(m, PH , z, ) =

  em 2  1  1  ln     + ln m   PH (V (z ))   

where PH denotes a prior distribution over hypothesis space H, V(z) denotes the version space of the training set z, m is the number of z and d is a constant in [0, 1]. It follows that the generalization error bound of the consistent classifiers is controlled by the volume of the version space if the distribution of the version space is uniform. This provides a theoretical justification for version space based ALSVMs.

Query by Committee with SVM This algorithm was proposed by (Freund et al., 1997) in which 2k classifiers were randomly sampled and the sample on which these classifiers have maximal disagreement can approximately halve the version space and then will be queried to the oracle. However, the complexity of the structure of the version space leads to the difficulty of random sampling within it.

(Warmuth, Ratsch, Mathieson, Liao, & Lemmem, 2003) successfully applied the algorithm of playing billiard to randomly sample the classifiers in the SVM version space and the experiments showed that its performance was comparable to the performance of standard distance-based ALSVM (SD-ALSVM) which will be introduced later. The deficiency is that the processes are time-consuming.

Standard Distance Based Active Learning with SVM For SVM, the version space can be defined as: V = {w ∈ W | w = 1, yi ( w • Φ ( xi ) > 0, i = 1,..., m}

where Φ (.) denotes the function which map the original input space X into a high-dimensional space Φ ( X ) , and W denotes the parameter space. SVM has two properties which lead to its tractability with AL. The first is its duality property that each point w in V corresponds to one hyperplane in Φ ( X ) which divides Φ ( X ) into two parts and vice versa. The other property is that the solution of SVM w* is the center of the version space when the version space is symmetric or near to its center when it is asymmetric. Based on the above two properties, (Tong & Koller, 2001) inferred a lemma that the sample nearest to the


Figure 2. Illustration of standard distance-based ALSVM

A

a

H y perplane i nduc ed by S upport V ec tor H y perplane i nduc ed by the c andidate s am ple

c W*

T he s olution of S V M

W* T he larges t ins c ribed hy pers phere b

V ers ion S pac e

Figure 2a. The projection of the parameter space around the Version Space S upport V ec tors

b a

+

c

+ + +

margin - 1 C las s

+

++

+

+ 1 C las s C andidate U nlabeled S am ples

Figure 2b. In the induced feature space

decision boundary can make the expected size of the version space decrease fastest. Thus the sample nearest to the decision boundary will be queried to the oracle (Figure 2). This is the so-called SD-ALSVM which has low additional computations for selecting the queried sample and fine performance in real applications.

Batch Running Mode Distance Based Active Learning with SVM When utilizing batch query, (Tong & Koller, 2001) simply selected multiple samples which are nearest to the decision boundary. However, adding a batch of such samples cannot ensure the largest reduction of the size of version space, such as an example shown in figure 3. Although every sample can nearly halve the version space, three samples together can still reduce about 1/2,

instead of 7/8, of the size of the version space. It can be observed that this was ascribed to the small angles between their induced hyperplanes. To overcome this problem, (Brinker, 2003) proposed a new selection strategy by incorporating diversity measure that considers the angles between the induced hyperplanes. Let the labeled set be L and the pool query set be Q in the current round, then based on the diversity criterion the further added sample xq should be x q = min max x j ∈Q

xi ∈L

k ( x j , xi ) k ( x j , x j ) k ( xi , xi )

k ( x j , xi ) k ( x j , x j )k ( xi , xi )


Figure 3. One example of simple batch querying with “a”, “b” and “c” samples with pure SD-ALSVM c

a

b H y perplane i nduc ed by S upport V ec tor H y perplane i nduc ed by the c andidate s am ple W*


W* T he larges t ins c ribed hy pers phere V ers ion S pac e

Figure 4. One example of batch querying with “a”, “b” and “c” samples by incorporating diversity into SDALSVM c

a b H y perplane i nduc ed by S upport V ec tor H y perplane i nduc ed by the c andidate s am ple W*


W* T he larges t i ns c ribed hy pers phere V ers ion S pac e

where denotes the cosine value of the angle between two hyperplanes induced by xj and xi, thus it is known as angle diversity criterion. It can be observed that the reduced volume of the version space in figure 4 is larger than that in Figure 3.

RETIN Active Learning Let ( I j ) j∈[1...n ] be the samples in a potential query set Q, and r(i, k) be the function that, at iteration i, codes the position k in the relevance ranking according to the distance to the current decision boundary, then a sequence can be obtained as follows:

I , I r ( i , 2 ) ,..., I r ( i ,s ( i ) ,..., I r ( i ,s ( i )+m−1 ,..., I r ( i ,n ) r ( i ,1)   

most relevant

queried data

least relevant

In SD-ALSVM, s(i) is such as I r ( i ,s ( i ) ,..., I r ( i ,s ( i )+m −1 are the m closest samples to the SVM boundary. This strategy implicitly relies on a strong assumption: an accurate estimation of SVM boundary. However, the decision boundary is usually unstable at the initial iterations. (Gosselin & Cord, 2004) noticed that, even if the decision boundary may change a lot during the earlier iterations, the ranking function r() is quite stable. Thus they proposed a balanced selection criterion that


is independent on the frontier and in which an adaptive method was designed to tune s during the feedback iterations. It was expressed by

uncorrelated. It is difficult to ensure this condition in real applications.

s(i + 1) = s(i ) + h( rrel (i ), rirr (i )

MIXED ACTIVE LEARNING

where h( x, y ) = k × ( x − y ) which characterizes the system dynamics (k is a positive constant), rrel(i) and rirrl(i) denote the number of relevant and irrelevant samples in the queried set in the ith iteration. This way, the number of relevant and irrelavant samples in the queried set will be roughly equal.

Mean Version Space Criterion (He, Li, Zhang, Tong, & Zhang, 2004) proposed a selection criterion by minimizing the mean version space which is defined as C MVS ( xk ) = Vol (Vi + ( xk ) P( y k = 1 | xk ) + Vol (Vi − ( xk ) P( y k = −1 | xk )

where Vol (Vi + ( xk ) ( Vol (Vi − ( xk ) ) denotes the volume of the version space after adding an unlabelled sample xk into the ith round training set. The mean version space includes both the volume of the version space and the posterior probabilities. Thus they considered that the criterion is better than the SD-ALSVM. However, the computation of this method is time-consuming.

Multi-View Based Active Learning Different from the algorithms which are based only on one whole feature set, multi-view methods are based on multiple sub-feature ones. Several classifiers are first trained on different sub-feature sets. Then the samples on which the classifiers have the largest disagreements comprise the contention set from which queried samples are selected. first (I. Muslea, Minton, & Knoblock, 2000) applied in AL and (Cheng & Wang, 2007) implemented it with ALSVM to produce a CoSVM algorithm which was reported to have better performance than the SD-ALSVM. Multiple classifiers can find the rare samples because they observe the samples with different views. Such property is very useful to find the diverse parts belonging to the same category. However, multi-view based methods demand that the relevant classifier can classify the samples well and that all feature sets are

Instead of single AL strategies in the former sections, we will discuss two mixed AL modes in this section: one is combining different selection criteria and another is incorporating semi-supervised learning into AL.

Hybrid Active Learning Contrast to developing a new AL algorithm that works well for all situations, some researchers argued that combining different methods, which are usually complementary, is a better way, for each method has its advantages and disadvantages. The intuitive structure of the hybrid strategy is parallel mode. The key point here is how to set the weights of different AL methods. The simplest way is to set fixed weights according to experience and it was used by most existing methods. The Most Relevant/Irrelevant (L. Zhang, Lin, & Zhang, 2001) strategies can help to stabilize the decision boundary, but have low learning rates; while standard distance-based methods have high learning rates, but have unstable frontiers at the initial feedbacks. Considering this, (Xu, Xu, Yu, & Tresp, 2003) combined these two strategies to achieve better performance than only using a single strategy. As stated before, the diversity and distance-based strategies are also complementary and (Brinker, 2003), (Ferecatu, Crucianu, & Boujemaa, 2004) and (Dagli, Rajaram, & Huang, 2006) combined angle, inner product and entropy diversity strategy with standard distance-based one respectively. However, the strategy of the fixed weights can not fit well into all datasets and all learning iterations. So the weights should be set dynamically. In (Baram, El-Yaniv, & Luz, 2004), all the weights were initialized with the same value, and were modified in the later iterations by using EXP4 algorithm. In this way, the resulting AL algorithm is empirically shown to consistently perform almost as well as and sometimes outperform the best algorithm in the ensemble.

A


Semi-Supervised Active Learning

Feature-Based Active Learning

1.

In AL, the feedback from the oracle can also help to identify the important features, and (Raghavan, Madani, & Jones, 2006) showed that such works can improve the performance of the final classifier significantly. In (Su, Li, & Zhang, 2001), Principal Components Analysis was used to identify important features. To our knowledge, there are few reports addressing the issue.

Active Learning with Transductive SVM

In the first stages of SD-ALSVM, a few labeled data may lead to great deviation of the current solution from the true solution; while if unlabeled samples are considered, the solution may be closer to the true solution. (Wang, Chan, & Zhang, 2003) showed that the closer the current solution is to the true one, the larger the size of the version space will be reduced. They incorporated Transductive SVM (TSVM) to produce more accurate intermediate solutions. However, several studies (T. Zhang & Oles, 2000) challenged that TSVM might not be so helpful from unlabeled data in theory and in practice. (Hoi & Lyu, 2005) applied the semi-supervised learning techniques based on the Gaussian fields and Harmonic functions instead and the improvements were reported to be significant. 2.

The Scaling of Active Learning The scaling of AL to very large database has not been extensively studied yet. However, it is an important issue for many real applications. Some approaches have been proposed on how to index database (Lai, Goh, & Chang, 2004) and how to overcome the concept complexities accompanied with the scalability of the dataset (Panda, Goh, & Chang, 2006).

Incorporating EM into Active Learning

(McCallum & Nigam, 1998) combined Expectation Maximization (EM) with the strategy of querying by committee. And (Ion Muslea, Minton, & Knoblock, 2002) integrated Multi-view AL algorithm with EM to get the Co-EMT algorithm which can work well in the situation where the views are incompatible and correlated.

FUTURE TRENDS

CONCLUSION In this chapter, we summarize the techniques of ALSVM which have been an area of active research since 2000. We first focus on the descriptions of heuristic ALSVM approaches within the framework of the theory of version space minimization. Then mixed methods which can complement the deficiencies of single ones are introduced and finally future research trends focus on techniques for selecting the initial labeled training set, feature-based AL and the scaling of AL to very large database.

How to Start the Active Learning AL can be regarded as the problem of searching target function in the version space, so a good initial classifier is important. When the objective category is diverse, the initial classifier becomes more important, for bad one may result in converging to a local optimal solution, i.e., some parts of the objective category may not be correctly covered by the final classifier. Two-stage (Cord, Gosselin, & Philipp-Foliguet, 2007), long-term learning (Yin, Bhanu, Chang, & Dong, 2005), and pre-cluster (Engelbrecht & BRITS, 2002) strategies are promising.

REFERENCES Baram, Y., El-Yaniv, R., & Luz, K. (2004). Online Choice of Active Learning Algorithms. Journal of Machine Learning Research, 5, 255-291. Brinker, K. (2003). Incorporating Diversity in Active Learning with Support Vector Machines. Paper presented at the International Conference on Machine Learning. Cheng, J., & Wang, K. (2007). Active learning for image retrieval with Co-SVM. Pattern Recognition, 40(1), 330-334.


Cohn, D., Atlas, L., & Ladner, R. (1994). Improving Generalization with Active Learning. Machine Learning, 15, 201-221. Cohn, D. A. (1997). Minimizing Statistical Bias with Queries. In Advances in Neural Information Processing Systems 9, Also appears as AI Lab Memo 1552, CBCL Paper 124. M. Mozer et al, eds. Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active Learning with Statistical Models. Journal of Artificial Intelligence Research, 4, 129-145. Cord, M., Gosselin, P. H., & Philipp-Foliguet, S. (2007). Stochastic exploration and active learning for image retrieval. Image and Vision Computing, 25(1), 14-23. Dagli, C. K., Rajaram, S., & Huang, T. S. (2006). Utilizing Information Theoretic Theoretic Diversity for SVM Active Learning. Paper presented at the International Conference on Pattern Recognition, Hong Kong. Engelbrecht, A. P., & BRITS, R. (2002). Supervised Training Using an Unsuerpvised Approach to Active Learning. Neural Processing Letters, 15, 14. Ferecatu, M., Crucianu, M., & Boujemaa, N. (2004). Reducing the redundancy in the selection of samples for SVM-based relevance feedback Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective Sampling Using the Query by Committee Algorithm. Machine Learning, 28, 133-168. Gosselin, P. H., & Cord, M. (2004). RETIN AL: an active learning strategy for image category retrieval. Paper presented at the International Conference on Image Processing. He, J., Li, M., Zhang, H.-J., Tong, H., & Zhang, C. (2004). Mean version space: a new active learning method for content-based image retrieval. Paper presented at the International Multimedia Conference Proceedings of the 6th ACM SIGMM International Workshop on Mulitimedia Information Retrieval. Hoi, S. C. H., & Lyu, M. R. (2005). A semi-supervised active learning framework for image retrieval. Paper presented at the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Lai, W.-C., Goh, K., & Chang, E. Y. (2004, June). On Scalability of Active Learning for Formulating Query Concepts (long version of the ICME invited paper).

Paper presented at the Workshop on Computer Vision Meets Databases (CVDB) in cooperation with ACM International Conference on Management of Data (SIGMOD), Paris. McAllester, D. A. (1998). Some PAC Bayesian Theorems. Paper presented at the Proceedings of the 11th Annual Conference on Computational Learning Theory, Madison, Wisconsin. McCallum, A. K., & Nigam, K. (1998). Employing EM and Pool-Based Active Learning for Text Classification. Paper presented at the Proceedings of 15th International Conference on Machine Learning. Muslea, I., Minton, S., & Knoblock, C. A. (2000). Selective Sampling with Redundant Views. Paper presented at the Proceedings of the 17th National Conference on Artificial Intelligence. Muslea, I., Minton, S., & Knoblock, C. A. (2002). Active+Semi-Supervised Learning = Robust Multi-View Learning. Paper presented at the Proceedings of the 19th International Conference on Machine Learning. Panda, N., Goh, K., & Chang, E. Y. (2006). Active Learning in Very Large Image Databases Journal of Multimedia Tools and Applications Special Issue on Computer Vision Meets Databases. Raghavan, H., Madani, O., & Jones, R. (2006). Active Learning with Feedback on Both Features and Instances. Journal of Machine Learning Research, 7, 1655-1686. Roy, N., & McCallum, A. (2001). Toward Optimal Active Learning Through Sampling Estimation of Error Reduction. Paper presented at the Proceedings of 18th International Conference on Machine Learning. Su, Z., Li, S., & Zhang, H. (2001). Extraction of Feature Subspaces for Content-based Retrieval Using Relevance Feedback. Paper presented at the ACM Multimedia, Ottawa, Ontario, Canada. Tong, S., & Koller, D. (2001). Support Vector Machine Active Learning with Application to Text Classification. Journal of Machine Learning Research, 45-66. Wang, L., Chan, K. L., & Zhang, Z. (2003). Bootstrapping SVM active learning by incorporating unlabelled images for image retrieval. Paper presented at the Proceeding of IEEE Computer Vision and Pattern Recognition.

A


Warmuth, M. K., Ratsch, G., Mathieson, M., Liao, J., & Lemmem, C. (2003). Active Learning in the Drug Discovery Process. Journal of Chemical Information Sciences, 43(2), 667-673. Xu, Z., Xu, X., Yu, K., & Tresp, V. (2003). A Hybrid Relevance-feedback Approach to Text Retrieval. Paper presented at the Proceedings of the 25th European Conference on Information Retrieval Research, Lecture Notes in Computer Science. Yin, P., Bhanu, B., Chang, K., & Dong, A. (2005). Integrating Relevance Feedback Techniques for Image Retrieval Using Reinforcement Learning IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1536-1551. Zhang, L., Lin, F., & Zhang, B. (2001). Support Vector Machine Learning for Image Retrieval. Paper presented at the International Conference on Image Processing. Zhang, T., & Oles, F. (2000). A Probability Analysis on The Value of Unlabeled Data for Classification Problems. Paper presented at the Proceeding of 17th International Conference of Machine Learning, San Francisco, CA.

KEy TERMS Heuristic Active Learning: The set of active learning algorithms in which the sample selection criteria is based on some heuristic objective function. For example, version space based active learning is to select the sample which can reduce the size of the version space. Hypothesis Space: The set of all hypotheses in which the objective hypothesis is assumed to be found. Semi-Supervised Learning: The set of learning algorithms in which both labelled and unlabelled data in the training dataset are directly used to train the classifier. Statistical Active Learning: The set of active learning algorithms in which the sample selection criteria is based on some statistical objective function, such as minimization of generalisation error, bias and variance. Statistical active learning is usually statistically optimal. Supervised Learning: The set of learning algorithms in which the samples in the training dataset are all labelled. Unsupervised Learning: The set of learning algorithms in which the samples in training dataset are all unlabelled. Version Space: The subset of the hypothesis space which is consistent with the training set.

Adaptive Algorithms for Intelligent Geometric Computing M. L. Gavrilova University of Calgary, Canada

INTRODUCTION This chapter spans topics from such important areas as Artificial Intelligence, Computational Geometry and Biometric Technologies. The primary focus is on the proposed Adaptive Computation Paradigm and its applications to surface modeling and biometric processing. Availability of much more affordable storage and high resolution image capturing devices have contributed significantly over the past few years to accumulating very large datasets of collected data (such as GIS maps, biometric samples, videos etc.). On the other hand, it also created significant challenges driven by the higher than ever volumes and the complexity of the data, that can no longer be resolved through acquisition of more memory, faster processors or optimization of existing algorithms. These developments justified the need for radically new concepts for massive data storage, processing and visualization. To address this need, the current chapter presents the original methodology based on the paradigm of the Adaptive Geometric Computing. The methodology enables storing complex data in a compact form, providing efficient access to it, preserving high level of details and visualizing dynamic changes in a smooth and continuous manner. The first part of the chapter discusses adaptive algorithms in real-time visualization, specifically in GIS (Geographic Information Systems) applications. Data structures such as Real-time Optimally Adaptive Mesh (ROAM) and Progressive Mesh (PM) are briefly surveyed. The adaptive method Adaptive Spatial Memory (ASM), developed by R. Apu and M. Gavrilova, is then introduced. This method allows fast and efficient visualization of complex data sets representing terrains, landscapes and Digital Elevation Models (DEM). Its advantages are briefly discussed. The second part of the chapter presents application of adaptive computation paradigm and evolutionary computing to missile simulation. As a result, patterns of complex behavior can be developed and analyzed.

The final part of the chapter marries a concept of adaptive computation and topology-based techniques and discusses their application to challenging area of biometric computing.

BACKGROUND For a long time, researchers were pressed with questions on how to model real-world objects (such as terrain, facial structure or particle system) realistically, while at the same time preserving rendering efficiency and space. As a solution, grid, mesh, TIN, Delaunay triangulationbased and other methods for model representation were developed over the last two decades. Most of these are static methods, not suitable for rendering dynamic scenes or preserving higher level of details. In 1997, first methods for dynamic model representation: Real-time Optimally Adapting Mesh (ROAM) (Duchaineauy et. al., 1997, Lindstrom and Koller, 1996) and Progressive Mesh (PM) (Hoppe, 1997) were developed. Various methods have been proposed to reduce a fine mesh into an optimized representation so that the optimized mesh contains less primitives and yields maximum detail. However, this approach had two major limitations. Firstly, the cost of optimization is very expensive (several minutes to optimize one medium sized mesh). Secondly, the generated nonuniform mesh is still static. As a result, it yields poor quality when only a small part of the mesh is being observed. Thus, even with the further improvements, these methods were not capable of dealing with large amount of complex data or significantly varied level of details. They have soon were replaced by a different computational model for rendering geometric meshes (Li Sheng et. al. 2003, Shafae and Pajarola, 2003). The model employs a continuous refinement criteria based on an error metric to optimally adapt to a more accurate representation. Therefore, given a mesh representation and a small change in the viewpoint, the optimized mesh


A

Adaptive Algorithms for Intelligent Geometric Computing

for the next viewpoint can be computed by refining the existing mesh.

ADAPTIVE GEOMETRIC COMPUTING This chapter presents Adaptive Multi-Resolution Technique for real-time terrain visualization utilizing a clever way of optimizing mesh dynamically for smooth and continuous visualization with a very high efficiency (frame rate) (Apu and Gavrilova (2005) (2007)). Our method is characterized by the efficient representation of massive underlying terrain, utilizes efficient transition between detail levels, and achieves frame rate constancy ensuring visual continuity. At the core of the method is adaptive processing: a formalized hierarchical representation that exploits the subsequent refinement principal. This allows us a full control over the complexity of the feature space. An error metric is assigned by a higher level process where objects (or features) are initially classified into different labels. Thus, this adaptive method is highly useful for feature space representation. In 2006, Gavrilova and Apu showed that such methods can act as a powerful tool not only for terrain rendering, but also for motion planning and adaptive simulations (Apu and Gavrilova, 2006). They introduced Adaptive Spatial Memory (ASM) model that utilizes adaptive approach for real-time online algorithm for multi-agent collaborative motion planning. They have demonstrate that the powerful notion of adaptive computation can be applied to perception and understanding of space. Extension of this method for 3D motion planning as part of collaborative research with Prof. I. Kolingerova group has been reported to be

Figure 1. Split and merge operations in ASM model

0

significantly more efficient than conventional methods (Broz et.al., 2007). We first move to discuss evolutionary computing. We demonstrate the power of adaptive computation by developing and applying adaptive computational model to missile simulation (Apu and Gavrilova, 2006). The developed adaptive algorithms described above have a property that spatial memory units can form, refine and collapse to simulate learning, adapting and responding to stimuli. The result is a complex multi-agent learning algorithm that clearly demonstrates organic behaviors such as sense of territory, trails, tracks etc. observed in flocks/herds of wild animals and insects. This gives a motivation to explore the mechanism in application to swarm behavior modeling. Swarm Intelligence (SI) is the property of a system whereby the collective behaviors of unsophisticated agents interacting locally with their environment cause coherent functional global patterns to emerge (Bonabeau, 1999). Swarm intelligence provides a basis for exploration of a collective (distributed) behavior of a group of agents without centralized control or the provision of a global model. Agents in such system have limited perception (or intelligence) and cannot individually carry out the complex tasks. According to Bonebeau, by regulating the behavior of the agents in the swarm, one can demonstrate emergent behavior and intelligence as a collective phenomenon. Although the swarming phenomenon is largely observed in biological organisms such as an ant colony or a flock of birds, it is recently being used to simulate complex dynamic systems focused towards accomplishing a well-defined objective (Kennedy, 2001, Raupp ans Thalmann, 2001).


Let us now investigate application of the adaptive computational paradigm and swarm intelligence concept to missile behavior simulation (Apu and Gavrilova, 2006). First of all, let us note that complex strategic behavior can be observed by means of a task oriented artificial evolutionary process in which behaviors of individual missiles are described in surprising simplicity. Secondly, the global effectiveness and behavior of the missile swarm is relatively unaffected by disruption or destruction of individual units. From a strategic point of view, this adaptive behavior is a strongly desired property in military applications, which motivates our interest in applying it to missile simulation. Note that this problem was chosen as it presents a complex challenge for which an optimum solution is very hard to obtain using traditional methods. The dynamic and competitive relationship between missiles and turrets makes it extremely difficult to model using a deterministic approach. It should also be noted that the problem has an easy evaluation metric that allows determining fitness values precisely. Now, let us summarize the idea of evolutionary optimization by applying genetic algorithm to evolve the missile genotype. We are particularly interested in observing the evolution of complex 3D formations and tactical strategies that the swarm learns to maximize their effectiveness during an attack simulation run. The simulation is based on attack, evasion and defense. While the missile sets strategy to strike the target, the battle ship prepares to shoot down as many missiles as possible (Figure 2 illustrates the basic missile ma-

Figure 2. Basic maneuvers for a missile using the Gene String

neuvers). Each attempt to destroy the target is called an attack simulation run. Its effectiveness equals to the number of missiles hitting the target. Therefore the outcome of the simulation is easily quantifiable. On the other hand, the interaction between missiles and the battleship is complex and nontrivial. As a result, war strategies may emerge in which a local penalty (i.e. sacrificing a missile) can optimize global efficiency (i.e. deception strategy). The simplest form of information known to each missile is its position and orientation and the location of the target. This information is augmented with information about missile neighborhood and environment, which influences missile navigation pattern. For actual missile behavior simulation, we use strategy based on the modified version of Boids flocking technique. We have just outlined the necessary set of actions to reach the target or interact with the environment. This is the basic building block of missile navigation. The gene string is another important part that reflects the complexity with which such courses of action could be chosen. It contains a unique combination of maneuvers (such as attack, evasion, etc.) that evolve to create complex combined intelligence. We describe the fitness of the missile gene in terms of collective performance. After investigating various possibilities, we developed and used a two dimensional adaptive fitness function to evolve the missile strains in one evolutionary system. Details on this approach can be found in (Apu and Gavrilova, 2006). After extensive experimentation, we have found many interesting characteristics, such as geometric attack formation and organic behaviors observed among swarms in addition to the highly anticipated strategies such as simultaneous attack, deception, retreat and other strategies (see Figure 3). We also examined the adaptability by randomizing the simulation coordinates, distance, initial formation, attack rate, and other parameters of missiles and measured the mean and variance of the fitness function. Results have shown that many of the genotypes that evolved are highly adaptive to the environment. We have just reviewed the application of the adaptive computational paradigm to swarm intelligence and briefly described the efficient tactical swarm simulation method (Apu and Gavrilova 2006). The results clearly demonstrate that the swarm is able to develop complex strategy through the evolutionary process of genotype mutation. This contribution among other works on

A


adaptive computational intelligence will be profiled in detail in the upcoming book as part of Springer-Verlag book series on Computational Intelligence (Gavrilova, 2007). As stated in the introduction, adaptive computation is based on a variable complexity level of detail paradigm, where a physical phenomenon can be simulated by the continuous process of local adaptation of spatial complexity. As presented by M. Gavrilova in Plenary Lecture at 3IA Eurographics Conference, France in 2006, the adaptive paradigm is a powerful computational model that can also be applied to vast area of biometric research. This section therefore reviews methods and techniques based on adaptive geometric methods in application to biometric problems. It emphasizes advantages that intelligent approach to geometric computing brings to the area of complex biometric data processing (Gavrilova 2007). In information technology, biometrics refers to a study of physical and behavioral characteristics with the purpose of person identification (Yanushkevich, Gavrilova, Wang and Srihari, 2007). In recent years, the area of biometrics has witnessed a tremendous growth, partly as a result of a pressing need for increased security, and partly as a response to the new technological advances that are literally changing the way we live. Availability of much more affordable storage and the high resolution image biometric capturing devices have contributed to accumulating very large datasets of biometric data. In the earlier sections, we have studied the background of the adaptive mesh generation. Let us now look at the background research in topologybased data structures, and its application to biometric research. This information is highly relevant to goals of modeling and visualizing complex biometric data. At

the same time as adaptive methodology was developing in GIS, interest to topology-based data structures, such as Voronoi diagrams and Delaunay triangulations, has grown significantly. Some preliminary results on utilization of these topology-based data structures in biometric began to appear. For instance, research on image processing using Voronoi diagrams was presented in (Liang and Asano, 2004, Asano, 2006), studies of utilizing Voronoi diagram for fingerprint synthesis were conducted by (Bebis et. al., 1999, Capelli et. al. 2002), and various surveys of methods for modeling of human faces using triangular mesh appeared in (Wen and Huang, 2004, Li and Jain, 2005, Wayman et. al. 2005). Some interesting results were recently obtained in the BTLab, University of Calgary, through the development of topology-based feature extraction algorithms for fingerprint matching (Wang et. al. 2006, 2007, illustration is found in Figure 4), 3D facial expression modeling (Luo et. al. 2006) and iris synthesis (Wecker et. al. 2005). A comprehensive review of topology-based approaches in biometric modeling and synthesis can be found in recent book chapter on the subject (Gavrilova, 2007). In this chapter, we propose to manage the challenges arising from large volumes of complex biometric data through the innovative utilization of the adaptive paradigm. We suggest combination of topology-based and hierarchy based methodology to store and search for biometric data, as well as to optimize such representation based on the data access and usage. Namely, retrieval of the data, or creating real-time visualization can be based on the dynamic patter of data usage (how often, what type of data, how much details, etc.), recorded and analyzed in the process of the biometric system being used for recognition and identification purposes.

Figure 3. Complex formation and attack patterns evolved

(a) Deception pattern

(b) Distraction pattern

(c) Organic motion pattern


Figure 4. Delaunay triangulation based technique for fingerprint matching

In addition to using this information for optimized data representation and retrieval, we also propose to incorporate intelligent learning techniques to predict most likely patters of the system usage and to represent and organize data accordingly. On a practical side, to achieve our goal, we propose a novel way to represent complex biometric data through the organization of the data in a hierarchical tree-like structure. Such organization is similar in principle to the Adaptive Memory Subdivision (AMS), capable of representing and retrieving varies amount of information and level of detail that needs to be represented. Spatial quad-tree is used to hold the information about the system, as well as the instructions on how to process this information. Expansion is realized through the spatial subdivision technique that refines the data and increases level of details, and the collapsing is realized through the merge operation that simplifies the data representation and makes it more compact. The greedy strategy is used to optimally adapt to the best representation based on the user requirements, amount of available data and resources, required resolution and so on. This powerful technique enables us to achieve the goal of compact biometric data representation, that allows for instance to efficiently store minor details of the modeled face (e.g. scars, wrinkles) or detailed patterns of the iris.

FUTURE TRENDS In addition to data representation, adaptive technique can be highly useful in biometric feature extraction with the purpose of fast and reliable retrieval and matching of the biometric data, and in implementing dynamic

A

changes to the model. The methodology has a high potential of becoming one of the key approaches in biometric data modeling and synthesis.

CONCLUSION The chapter reviewed the adaptive computational paradigm in application to surface modeling, evolutionary computing and biometric research. Some of the key future developments in the upcoming years will undoubtedly highlight the area, inspiring new generations of intelligent biometric systems with adaptive behavior.

REFERENCES Apu R. & Gavrilova M (2005) Geo-Mass: Modeling Massive Terrain in Real-Time, GEOMATICA J. 59(3), 313-322. Apu R. & Gavrilova M. (2006) Battle Swarm: An Evolutionary Approach to Complex Swarm Intelligence, 3IA Int. C. Comp. Graphics and AI, Limoges, France, 139-150. Apu, R & Gavrilova, M. (2007) Fast and Efficient Rendering System for Real-Time Terrain Visualization, IJCSE Journal, 2(2), 5/6. Apu, R. & Gavrilova, M. (2006) An Efficient Swarm Neighborhood Management for a 3D Tactical Simulator, IEEE-CS proceedings, ISVD 2006, 85- 93


Asano, T. (2006) Aspect-Ratio Voronoi Diagram with Applications, ISVD 2006, IEEE-CS proceedings, 3239

Li Sheng, Liu Xuehui & Wu Enhau, (2003) FeatureBased Visibility-Driven CLOD for Terrain, In Proc. Pacific Graphics 2003, 313-322, IEEE Press

Bebis G., Deaconu T & Georiopoulous, M. (1999) Fingerprint Identification using Delaunay Triangulation, ICIIS 99, Maryland, 452-459

Li, S. & Jain, A. (2005) Handbook of Face Recognition. Springer-Verlag

Bonabeau, E., Dorigo, M. & Theraulaz, G. (1999) Swarm Intelligence: From Natural to Artificial Systems, NY: Oxford Univ. Press Broz, P., Kolingerova, I, Zitka, P., Apu R. & Gavrilova M. (2007) Path planning in dynamic environment using an adaptive mesh, SCCG 2007, Spring Conference on Computer Graphics 2007, ACM SIGGRAPH Capelli R, Maio, D, Maltoni D. (2002) Synthetic Fingerprint-Database Generation, ICPR 2002, Canada, vol 3, 369-376 Duchaineauy, M. et. al. (1997) ROAMing Terrain: Real-Time Optimally Adapting Meshes, IEEE Visualization ’97, 81-88 Gavrilova M.L. (2007) Computational Geometry and Image Processing in Biometrics: on the Path to Convergence, in Book Image Pattern Recognition: Synthesis and Analysis in Biometrics, Book Chapter 4, 103-133, World Scientific Publishers Gavrilova M.L. Computational Intelligence: A Geometry-Based Approach, in book series Studies in Computational Intelligence, Springer-Verlag, Ed. Janusz Kacprzyk, to appear. Gavrilova, M.L. (2006) IEEE_CS Book of the 3rd International Symposium on Voronoi Diagrams in Science and Engineering, IEEE-CS, Softcover, 2006, 270 pages. Gavrilova, M.L. (2006) Geometric Algorithms in 3D Real-Time Rendering and Facial Expression Modeling, 3IA’2006 Plenary Lecture, Eurographics, Limoges, France, 5-18 Hoppe, H. (1997) View-Dependent Refinement of Progressive Meshes, SIGGRAPH ’97 Proceedings, 189-198 Kennedy, J., Eberhart, R. C., & Shi, Y. (2001) Swarm Intelligence, San Francisco: Morgan Kaufmann Publishers

Liang X.F. & Asano T. (2004) A fast denoising method for binary fingerprint image, IASTED, Spain, 309313 Lindstrom, P. & Koller, D. (1996) Real-time continuous level of detail rendering of height fields, SIGGRAPH 1996 Proceedings, 109-118 Luo, Y, Gavrilova, M. & Sousa M.C. (2006) NPAR by Example: line drawing facial animation from photographs, CGIV’06, IEEE, Computer Graphics, Imaging and Visualization, 514-521 Raupp S. & Thalmann D. (2001) Hierarchical Model for Real Time Simulation of Virtual Human Crowds, IEEE Trans. on Visualization and Computer Graphics 7(2), 152-164 Shafae, M. & Pajarola, R. (2003) Dstrips: Dynamic Triangle Strips for Real-Time Mesh Simplification and Rendering, Pacific Graphics 2003, 271-280 Wang, C, Luo, Y, Gavrilova M & Rokne J. (2007) Fingerprint Image Matching Using a Hierarchical Approach, in Book Computational Intelligence in Information Assurance and Security, Springer SCI Series, 175-198 Wang, H, Gavrilova, M, Luo Y. & J. Rokne (2006) An Efficient Algorithm for Fingerprint Matching, ICPR 2006, Int. C. on Pattern Recognition, Hong Kong, IEEE-CS, 1034-1037 Wayman J, Jain A, Maltoni D & Maio D. (2005) Biometric Systems: Technology, Design and Performance Evaluation, Book, Springer Wecker L, Samavati, F & Gavrilova M (2005) Iris Synthesis: A Multi-Resolution Approach, GRAPHITE 2005, ACM Press. 121-125 Wen, Z. & Huang, T. (2004) 3D Face Processing: Modeling, Analysis and Synthesis, Kluwer Yanushkevich, S, Gavrilova M., Wang, P & Srihari S. (2007) Image Pattern Recognition: Synthesis and Analysis in Biometrics, Book World Scientific


KEy TERMS Adaptive Geometric Model (AGM): A new approach to geometric computing utilizing adaptive computation paradigm. The model employs a continuous refinement criteria based on an error metric to optimally adapt to a more accurate representation. Adaptive Multi-Resolution Technique (AMRT): For real-time terrain visualization is a method that utilizes a clever way of optimizing mesh dynamically for smooth and continuous visualization with a high efficiency. Adaptive Spatial Memory (ASM): A hybrid method based on the combination of traditional hierarchical tree structure with the concept of expanding or collapsing tree nodes. Biometric Technology (BT): An area of study of physical and behavioral characteristics with the purpose of person authentication and identification.

Delaunay Triangulation (DT): A computational geometry data structure dual to Voronoi diagram. Evolutionary Paradigm (EP): The collective name for a number of problem solving methods utilizing principles of biological evolution, such as natural selection and genetic inheritance. Swarm Intelligence (SI): The property of a system whereby the collective behaviors of unsophisticated agents interacting locally with their environment cause coherent functional global patterns to emerge. Topology-Based Techniques (TBT): A group of methods using geometric properties of a set of objects in the space and their proximity Voronoi Diagram (VD): A fundamental computational geometry data structure that stores topological information for a set of objects.

A

Adaptive Business Intelligence Zbigniew Michalewicz The University of Adelaide, Australia

INTRODUCTION Since the computer age dawned on mankind, one of the most important areas in information technology has been that of “decision support.” Today, this area is more important than ever. Working in dynamic and ever-changing environments, modern-day managers are responsible for an assortment of far reaching decisions: Should the company increase or decrease its workforce? Enter new markets? Develop new products? Invest in research and development? The list goes on. But despite the inherent complexity of these issues and the ever-increasing load of information that business managers must deal with, all these decisions boil down to two fundamental questions: • •

What is likely to happen in the future? What is the best decision right now?

Whether we realize it or not, these two questions pervade our everyday lives — both on a personal and professional level. When driving to work, for instance, we have to make a traffic prediction before we can choose the quickest driving route. At work, we need to predict the demand for our product before we can decide how much to produce. And before investing in a foreign market, we need to predict future exchange rates and economic variables. It seems that regardless of the decision being made or its complexity, we first need to make a prediction of what is likely to happen in the future, and then make the best decision based on that prediction. This fundamental process underpins the basic premise of Adaptive Business Intelligence.

BACKGROUND Simply put, Adaptive Business Intelligence is the discipline of combining prediction, optimization, and adaptability into a system capable of answering these two fundamental questions: What is likely to happen in the future? and What is the best decision right now?

(Michalewicz et al. 2007). To build such a system, we first need to understand the methods and techniques that enable prediction, optimization, and adaptability (Dhar and Stein, 1997). At first blush, this subject matter is nothing new, as hundreds of books and articles have already been written on business intelligence (Vitt et al., 2002; Loshin, 2003), data mining and prediction methods (Weiss and Indurkhya, 1998; Witten and Frank, 2005), forecasting methods (Makridakis et al., 1988), optimization techniques (Deb 2001; Coello et al. 2002; Michalewicz and Fogel, 2004), and so forth. However, none of these has explained how to combine these various technologies into a software system that is capable of predicting, optimizing, and adapting. Adaptive Business Intelligence addresses this very issue. Clearly, the future of the business intelligence industry lies in systems that can make decisions, rather than tools that produce detailed reports (Loshin 2003). As most business managers now realize, there is a world of difference between having good knowledge and detailed reports, and making smart decisions. Michael Kahn, a technology reporter for Reuters in San Francisco, makes a valid point in the January 16, 2006 story entitled “Business intelligence software looks to future”: “But analysts say applications that actually answer questions rather than just present mounds of data is the key driver of a market set to grow 10 per cent in 2006 or about twice the rate of the business software industry in general. ‘Increasingly you are seeing applications being developed that will result in some sort of action,’ said Brendan Barnacle, an analyst at Pacific Crest Equities. ‘It is a relatively small part now, but it is clearly where the future is. That is the next stage of business intelligence.’”


Adaptive Business Intelligence

MAIN FOCUS OF THE CHAPTER “The answer to my problem is hidden in my data … but I cannot dig it up!” This popular statement has been around for years as business managers gathered and stored massive amounts of data in the belief that they contain some valuable insight. But business managers eventually discovered that raw data are rarely of any benefit, and that their real value depends on an organization’s ability to analyze them. Hence, the need emerged for software systems capable of retrieving, summarizing, and interpreting data for end-users (Moss and Atre, 2003). This need fueled the emergence of hundreds of business intelligence companies that specialized in providing software systems and services for extracting knowledge from raw data. These software systems would analyze a company’s operational data and provide knowledge in the form of tables, graphs, pies, charts, and other statistics. For example, a business intelligence report may state that 57% of customers are between the ages of 40 and 50, or that product X sells much better in Florida than in Georgia.1 Consequently, the general goal of most business intelligence systems was to: (1) access data from a variety of different sources; (2) transform these data into information, and then into knowledge; and (3) provide an easy-to-use graphical interface to display this knowledge. In other words, a business intelligence system was responsible for collecting and digesting data, and presenting knowledge in a friendly way (thus enhancing the end-user’s ability to make good decisions). The diagram in Figure 1 illustrates the processes that underpin a traditional business intelligence system. Although different texts have illustrated the relationship between data and knowledge in different ways (e.g.,

Davenport and Prusak, 2006; Prusak, 1997; Shortliffe and Cimino, 2006), the commonly accepted distinction between data, information, and knowledge is: • • •

Data are collected on a daily basis in the form of bits, numbers, symbols, and “objects.” Information is “organized data,” which are preprocessed, cleaned, arranged into structures, and stripped of redundancy. Knowledge is “integrated information,” which includes facts and relationships that have been perceived, discovered, or learned.

Because knowledge is such an essential component of any decision-making process (as the old saying goes, “Knowledge is power!”), many businesses have viewed knowledge as the final objective. But it seems that knowledge is no longer enough. A business may “know” a lot about its customers — it may have hundreds of charts and graphs that organize its customers by age, preferences, geographical location, and sales history — but management may still be unsure of what decision to make! And here lies the difference between “decision support” and “decision making”: all the knowledge in the world will not guarantee the right or best decision. Moreover, recent research in psychology indicates that widely held beliefs can actually hamper the decision-making process. For example, common beliefs like “the more knowledge we have, the better our decisions will be,” or “we can distinguish between useful and irrelevant knowledge,” are not supported by empirical evidence. Having more knowledge merely increases our confidence, but it does not improve the accuracy of our decisions. Similarly, people supplied with “good” and “bad” knowledge often have trouble distinguishing

Figure 1. The processes that underpin a traditional business intelligence system

D A T A

Data Preparation

I N F O R M A T I O N

Data Mining

K N O W L E D G E

A


between the two, proving that irrelevant knowledge decreases our decision-making effectiveness. Today, most business managers realize that a gap exists between having the right knowledge and making the right decision. Because this gap affects management’s ability to answer fundamental business questions (such as “What should be done to increase profits? Reduce costs? Or increase market share?”), the future of business intelligence lies in systems that can provide answers and recommendations, rather than mounds of knowledge in the form of reports. The future of business intelligence lies in systems that can make decisions! As a result, there is a new trend emerging in the marketplace called Adaptive Business Intelligence. In addition to performing the role of traditional business intelligence (transforming data into knowledge), Adaptive Business Intelligence also includes the decision-making process, which is based on prediction and optimization as shown in Figure 2. While business intelligence is often defined as “a broad category of application programs and technologies for gathering, storing, analyzing, and providing access to data,” the term Adaptive Business Intelligence can be defined as “the discipline of using prediction and optimization techniques to build self-learning ‘decisioning’ systems” (as the above diagram shows). Adaptive Business Intelligence systems include elements of data mining, predictive modeling, forecasting, optimization, and adaptability, and are used by business managers to make better decisions. This relatively new approach to business intelligence is capable of recommending the best course of action

(based on past data), but it does so in a very special way: An Adaptive Business Intelligence system incorporates prediction and optimization modules to recommend near-optimal decisions, and an “adaptability module” for improving future recommendations. Such systems can help business managers make decisions that increase efficiency, productivity, and competitiveness. Furthermore, the importance of adaptability cannot be overemphasized. After all, what is the point of using a software system that produces sub par schedules, inaccurate demand forecasts, and inferior logistic plans, time after time? Would it not be wonderful to use a software system that could adapt to changes in the marketplace? A software system that could improve with time?

FUTURE TRENDS The concept of adaptability is certainly gaining popularity, and not just in the software sector. Adaptability has already been introduced in everything from automatic car transmissions (which adapt their gear-change patterns to a driver’s driving style), to running shoes (which adapt their cushioning level to a runner’s size and stride), to Internet search engines (which adapt their search results to a user’s preferences and prior search history). These products are very appealing for individual consumers, because, despite their mass production, they are capable of adapting to the preferences of each unique owner after some period of time. The growing popularity of adaptability is also underscored by a recent publication of the US De-

Figure 2. Adaptive business intelligence system

Adaptability

D A T A

Data Preparation

I N F O R M A T I O N

Data Mining

K N O W L E D G E

Optimization

Prediction

D E C I S I O N


partment of Defense. This lists 19 important research topics for the next decade and many of them include the term “adaptive”: Adaptive Coordinated Control in the Multi-agent 3D Dynamic Battlefield, Control for Adaptive and Cooperative Systems, Adaptive System Interoperability, Adaptive Materials for Energy-Absorbing Structures, and Complex Adaptive Networks for Cooperative Control. For sure, adaptability was recognized as important component of intelligence quite some time ago: Alfred Binet (born 1857), French psychologist and inventor of the first usable intelligence test, defined intelligence as “... judgment, otherwise called good sense, practical sense, initiative, the faculty of adapting one’s self to circumstances.” Adaptability is a vital component of any intelligent system, as it is hard to argue that a system is “intelligent” if it does not have the capacity to adapt. For humans, the importance of adaptability is obvious: our ability to adapt was a key element in the evolutionary process. In psychology, a behavior or trait is adaptive when it helps an individual adjust and function well within a changing social environment. In the case of artificial intelligence, consider a chess program capable of beating the world chess master: Should we call this program intelligent? Probably not. We can attribute the program’s performance to its ability to evaluate the current board situation against a multitude of possible “future boards” before selecting the best move. However, because the program cannot learn or adapt to new rules, the program will lose its effectiveness if the rules of the game are changed or modified. Consequently, because the program is incapable of learning or adapting to new rules, the program is not intelligent. The same holds true for any expert system. No one questions the usefulness of expert systems in some environments (which are usually well defined and static), but expert systems that are incapable of learning and adapting should not be called “intelligent.” Some expert knowledge was programmed in, that is all. So, what are the future trends for Adaptive Business Intelligence? In words of Jim Goodnight, the CEO of SAS Institute (Collins et al. 2007): “Until recently, business intelligence was limited to basic query and reporting, and it never really provided that much intelligence ….”

However, this is about to change. Keith Collins, the Chief Technology Officer of SAS Institute (Collins et al. 2007) believes that: “A new platform definition is emerging for business intelligence, where BI is no longer defined as simple query and reporting. […] In the next five years, we’ll also see a shift in performance management to what we’re calling predictive performance management, where analytics play a huge role in moving us beyond just simple metrics to more powerful measures.” Further, Jim Davis, the VP Marketing of SAS Institute (Collins et al. 2007) stated: “In the next three to five years, we’ll reach a tipping point where more organizations will be using BI to focus on how to optimize processes and influence the bottom line ….” Finally, it would be important to incorporate adaptability in prediction and optimization components of the future Adaptive Business Intelligence systems. There are some recent, successful implementations of Adaptive Business Intelligence systems reported (e.g., Michalewicz et al. 2005), which provide daily decision support for large corporations and result in multi-million dollars return on investment. There are also companies (e.g., www.solveitsoftware.com) which specialize in development of Adaptive Business Intelligence tools. However, further research effort is required. For example, most of the research in machine learning has focused on using historical data to build prediction models. Once the model is built and evaluated, the goal is accomplished. However, because new data arrive at regular intervals, building and evaluating a model is just the first step in Adaptive Business Intelligence. Because these models need to be updated regularly (something that the adaptability module is responsible for), we expect to see more emphasis on this updating process in machine learning research. Also, the frequency of updating the prediction module, which can vary from seconds (e.g., in real-time currency trading systems), to weeks and months (e.g., in fraud detection systems) may require different techniques and methodologies. In general, Adaptive Business Intelligence systems would include the research results from control theory, statistics, operations research, machine learning, and modern heuristic methods, to name a few. We also

A


expect that major advances will continue to be made in modern optimization techniques. In the years to come, more and more research papers will be published on constrained and multi-objective optimization problems, and on optimization problems set in dynamic environments. This is essential, as most real-world business problems are constrained, multi-objective, and set in a time-changing environment.

Intelligence is all about. Systems based on Adaptive Business Intelligence aim at solving real-world business problems that have complex constraints, are set in time-changing environments, have several (possibly conflicting) objectives, and where the number of possible solutions is too large to enumerate. Solving these problems requires a system that incorporates modules for prediction, optimization, and adaptability.

CONCLUSION

REFERENCES

It is not surprising that the fundamental components of Adaptive Business Intelligence are already emerging in other areas of business. For example, the Six Sigma methodology is a great example of a well-structured, data-driven methodology for eliminating defects, waste, and quality-control problems in many industries. This methodology recommends the sequence of steps shown in Figure 3. Note that the above sequence is very close “in spirit” to part of the previous diagram, as it describes (in more detail) the adaptability control loop. Clearly, we have to “measure,” “analyze,” and “improve,” as we operate in a dynamic environment, so the process of improvement is continuous. The SAS Institute proposes another methodology, which is more oriented towards data mining activities. Their methodology recommends the sequence of steps shown in Figure 4. Again, note that the above sequence is very close to another part of our diagram, as it describes (in more detail) the transformation from data to knowledge. It is not surprising that businesses are placing considerable emphasis on these areas, because better decisions usually translate into better financial performance. And better financial performance is what Adaptive Business

Coello, C.A.C., Van Veldhuizen, A.A., and Lamont, G.B. (2002). Evolutionary algorithms for solving multiobjective problems. Kluwer Academic. Collins, K., Goodnight, J., Hagström, M., Davis, J. (2007). The future of business intelligence: Four questions, four views. SASCOM, First quarter, 2007. Davenport, T.H. and Prusak, L. (2006). Working knowledge. Academic Internet Publishers. Deb, K. (2001). Multi-objective optimization using evolutionary algorithms.Wiley. Dhar, V. and Stein, R., (1997). Seven methods for transforming corporate data into business intelligence. Prentice Hall. Loshin, D. (2003). Business intelligence: The savvy manager’s guide. Margan Kaufmann. Makridakis, S., Wheelwright, S.C., and Hyndman, R.J. (1998). Forecasting: Methods and applications. Wiley. Michalewicz, Z. and Fogel, D.B. (2004). How to solve it: Modern heuristics, 2nd edition. Springer.

Figure 3. Six Sigma methodology sequence

Define

Measure

Analyze

Improve

Control

Model

Assess

Figure 4. SAS Institute recommended methodolgy sequence

Sample

0

Explore

Modify


Michalewicz, Z., Schmidt, M., Michalewicz, M., and Chiriac, C. (2005). A decision-support system based on computational intelligence: A case study. IEEE Intelligent Systems, 20(4), 44-49. Michalewicz, Z., Schmidt, M., Michalewicz, M., and Chiriac, C. (2007). Adaptive business intelligence. Springer. Moss, L. T. and Atre, S. (2003). Business intelligence roadmap. Addison Wesley. Prusak, L. (1997). Knowledge in organizations. Butterworth-Heinemann. Shortliffe, E. H. and Cimino, J. J. Eds (2006). Biomedical informatics: Computer applications in health care and biomedicine. Springer. Vitt, E., Luckevich, M., and Misner, S. (2002). Business intelligence: Making better decisions faster. Microsoft Press. Weiss, S. M. and Indurkhya, N., (1998). Predictive data mining. Morgan Kaufmann. Witten, I. H. and Frank, E. (2005). Data mining: Practical machine learning tools and techniques, 2nd edition. Morgan Kaufmann.

TERMS AND DEFINITIONS Adaptive Business Intelligence: The discipline of using prediction and optimization techniques to build self-learning ‘decisioning’ systems”. Business Intelligence: A collection of tools, methods, technologies, and processes needed to transform data into actionable knowledge. Data: Pieces collected on a daily basis in the form of bits, numbers, symbols, and “objects.” Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns, relationships, or obtaining systems that perform useful tasks such as classification, prediction, estimation, or affinity grouping. Information: “Organized data,” which are preprocessed, cleaned, arranged into structures, and stripped of redundancy. Knowledge: “Integrated information,” which includes facts and relationships that have been perceived, discovered, or learned. Optimization: Process of finding the solution that is the best fit to the available resources. Prediction: A statement or claim that a particular event will occur in the future.

ENDNOTE 1

Note that business intelligence can be defined both as a “state” (a report that contains knowledge) and a “process” (software responsible for converting data into knowledge).

A

Adaptive Neural Algorithms for PCA and ICA Radu Mutihac University of Bucharest, Romania

INTRODUCTION

BACKGROUND

Artificial neural networks (ANNs) (McCulloch & Pitts, 1943) (Haykin, 1999) were developed as models of their biological counterparts aiming to emulate the real neural systems and mimic the structural organization and function of the human brain. Their applications were based on the ability of self-designing to solve a problem by learning the solution from data. A comparative study of neural implementations running principal component analysis (PCA) and independent component analysis (ICA) was carried out. Artificially generated data additively corrupted with white noise in order to enforce randomness were employed to critically evaluate and assess the reliability of data projections. Analysis in both time and frequency domains showed the superiority of the estimated independent components (ICs) relative to principal components (PCs) in faithful retrieval of the genuine (latent) source signals. Neural computation belongs to information processing dealing with adaptive, parallel, and distributed (localized) signal processing. In data analysis, a common task consists in finding an adequate subspace of multivariate data for subsequent processing and interpretation. Linear transforms are frequently employed in data model selection due to their computational and conceptual simplicity. Some common linear transforms are PCA, factor analysis (FA), projection pursuit (PP), and, more recently, ICA (Comon, 1994). The latter emerged as an extension of nonlinear PCA (Hotelling, 1993) and developed in the context of blind source separation (BSS) (Cardoso, 1998) in signal and array processing. ICA is also related to recent theories of the visual brain (Barlow, 1991), which assume that consecutive processing steps lead to a progressive reduction in the redundancy of representation (Olshausen and Field, 1996). This contribution is an overview of the PCA and ICA neuromorphic architectures and their associated algorithmic implementations increasingly used as exploratory techniques. The discussion is conducted on artificially generated sub- and super-Gaussian source signals.

In neural computation, transforming methods amount to unsupervised learning, since the representation is only learned from data without any external control. Irrespective of the nature of learning, the neural adaptation may be formally conceived as an optimization problem: an objective function describes the task to be performed by the network and a numerical optimization procedure allows adapting network parameters (e.g., connection weights, biases, internal parameters). This process amounts to search or nonlinear programming in a quite large parameter space. However, any prior knowledge available on the solution might be efficiently exploited to narrow the search space. In supervised learning, the additional knowledge is incorporated in the net architecture or learning rules (Gold, 1996). A less extensive research was focused on unsupervised learning. In this respect, the mathematical methods usually employed are drawn from classical constrained multivariate nonlinear optimization and rely on the Lagrange multipliers method, the penalty or barrier techniques, and the classical numerical algebra techniques, such as deflation/renormalization (Fiori, 2000), the Gram-Schmidt orthogonalization procedure, or the projection over the orthogonal group (Yang, 1995).

PCA and ICA Models Mathematically, the linear stationary PCA and ICA models can be defined on the basis of a common data model. Suppose that some stochastic processes are represented by three random (column) vectors x (t ), n (t )∈  N and s (t )∈  M with zero mean and finite covariance,

with the components of s (t ) = {s1 (t ), s2 (t ),..., sM (t )} being statistically independent and at most one Gaussian. Let A be a rectangular constant full column rank N × M matrix with at least as many rows as columns ( N ≥ M ), and denote by t the sample index (i.e., time or sample point) taking the discrete values t = 1, 2, ...,


Adaptive Neural Algorithms for PCA and ICA

T. We postulate the existence of a linear relationship among these variables like: M

{x (t )}and the eigenvectors c , j = 1, 2,..., L of the input

i =1

covariance matrix Cx . The subspace spanned by the

x (t ) = As (t ) + n (t ) = ∑ si (t ) ai + n (t ) (1)

Here s (t ) , x (t ), n (t ) , and A are the sources, the observed data, the (unknown) noise in data, and the (unknown) mixing matrix, respectively, whereas ai , i = 1, 2,..., M are the columns of A. Mixing is supposed to be instantaneous, so there is no time delay

between a (latent) source variable si (t ) mixing into an observable (data) variable x j (t ) , with i = 1, 2, ..., M and j = 1, 2, ..., N. Consider that the stochastic vector process

{x (t )}∈ 

N

has the mean E {x (t )}= 0 and the covari-

{

} A

{

to minimize the error function J = E x (t ) − xˆ (t ) . The rows in W are the PCs of the stochastic process

}

ance matrix Cx = E x (t ) x (t ) . The goal of PCA is to identify the dependence structure in each dimension and to come out with an orthogonal transform matrix W of size L × N from  N to  L , L < N , such that T

the L-dimensional output vector y (t ) = W x (t ) sufficiently represents the intrinsic features of the input

j

principal eigenvectors {c1 , c 2 ,..., c L } with L < N , is called the PCA subspace of dimensionality L. The ICA problem can be formulated as following: given T realizations of x (t ), estimate both the matrix A and the corresponding realizations of s (t ) . In BSS the task is somewhat relaxed to finding the waveforms

{s (t )} of the sources knowing only the (observed) i

mixtures {x j (t )}. If no suppositions are made about the noise, the additive noise term is omitted in (1). A practical strategy is to include noise in the signals as supplementary term(s): hence the ICA model (Fig. 2) becomes: M

x (t ) = As (t ) = ∑ ai si (t )

(2)

i =1

The source separation consists in updating an unmixing matrix B (t ) , without resorting to any information about the spatial mixing matrix A, so that the output vec-

data, and where the covariance matrix Cy of {y (t )} is a diagonal matrix D with the diagonal elements ar-

tor y (t ) = B (t ) x (t ) becomes an estimate y (t ) = sˆ (t )

ranged in descending order, di ,i ≥ di +1,i +1 . The restoration

of the original independent source signals s (t ) . The

of {x (t )} from {y (t )}, say {xˆ (t )}, is consequently given by xˆ (t ) = W W x (t ) (Figure 1). For a given L, PCA aims to find an optimal value of W, such as T

separating matrix B (t ) is divided in two parts dealing with dependencies in the first two moments, i.e., the whitening matrix V (t ), and the dependencies in

Figure 1. Schematic of the PCA model


Figure 2. Schematic of the ICA model

Figure 3. A simple feed-forward ANN performing PCA and ICA

higher-order statistics, i.e., the orthogonal separating matrix W (t ) in the whitened space (Fig. 2). If we assume zero-mean observed data x (t ), then we get by whitening a vector v (t ) = V (t ) x (t ) with decorrelated components. The subsequent linear transform W (t ) seeks the solution by an adequate rotation in the space of component densities and yields y (t ) = W (t ) v (t ) (Fig. 2). The total separation matrix between the input and the output layer turns to be B (t ) = W (t ) V (t ) . In the standard stationary case, the whitening and the orthogonal separating matrices converge to some constant values after a finite number of iterations during learning, that is, B (t ) → B = W V .

NEURAL IMPLEMENTATIONS A neural approach to BSS entails a network that has mixtures of the source signals as input and produces approximations of the source signals as output (Figure 3). As a prerequisite, the input signals must be mutually uncorrelated, a requirement usually fulfilled by PCA. The output signals must nevertheless be mutually independent, which leads in a natural way from PCA to ICA. The higher order statistics required by source separation can be incorporated into computations either explicitly or by using suitable nonlinearities. ANNs better fit the latter approach (Karhunen, 1996). The core of the large class of neural adaptive algorithms consists in a learning rule and its associated optimization criterion (objective function). These two items differentiate the algorithms, which are actually families of algorithms parameterized by the nonlinear


function used. An update rule is specified by the iterative incremental change ∆W of the rotation matrix W, which gives the general form of the learning rule: W → W + ∆W

(3)

Due to the instability of the above nonlinear Hebbian learning rule for the multi-unit case, a different approach based on optimizing two criteria simultaneously was introduced (Oja, 1982): W (t + 1) = W (t ) +

(t ) x (t ) g (y (t )

T

)+

(t )

(I − W (t ) W (t ) ) T

(6)

Neural PCA First, consider a single artificial neuron receiving an M-dimensional input vector x. It gradually adapts its

{

}

weight vector w so that the function E f (w T x ) is maximized, where E is the expectation with respect to the (unknown) probability density of x and f is a continuous objective function. The function f is bounded by setting constant the Euclidian norm of w. A constrained gradient ascent learning rule based on a sequence of sample functions for relatively small learning rates (t ) is then (Oja, 1995):

(

)

(

)

w (t + 1) = w (t ) + a (t ) I − w (t ) w (t ) x (t ) g w (t ) w (t ) T

T

(4) where g = f ′ . Any PCA learning rules tend to find that direction in the input space along which the data has maximal variance. If all directions in the input space have equal variance, the one-unit case with a suitable nonlinearity is approximately minimizing the kurtosis of the neuron input. It means that the weight vector of the unit will be determined by the direction in the input space on which the projection of the input data is mostly clustered and deviates significantly from normality. This task is essentially the goal in the PP technique. In the case of single layer ANNs consisting of L parallel units, with each unit i having the same Melement input vector x and its own weight vector w i that together comprise an M × L weight matrix W = [w1 ,w 2 ,... ,w L ] the following training rule obtained from (4) is a generalization of the linear PCA learning rule (in matrix form):

(

W (t + 1) = W (t ) + a (t ) I − W (t ) W (t )

T

) x (t ) g (x (t ) W (t )) T

(5)

Here (t ) is chosen positive or negative depending on our interest in maximizing or minimizing, respec-

{

}

tively, the objective function J1 (w i ) = E f (xT w i ) . Similarly, (t ) is another gain parameter that is always positive and constrains the weight vectors to orthonormality, which is imposed by an appropriate penalty function such as:

J 2 (w i ) =

2 1 1 1 − w Ti w i ) + ( 2 2

M

∑ (w

j =1, j ≠ i

T i

wj) . 2

This is the bigradient algorithm, which is iterated until the weight vectors have converged with the desired accuracy. This algorithm can use normalized Hebbian or anti-Hebbian learning in a unified formula. Starting from one-unit rule, the multi-unit bigradient algorithm can simultaneously extract several robust counterparts of the principal or minor eigenvectors of the data covariance matrix (Wang, 1996). In the case of multilayered ANNs, the transfer functions of the hidden nodes can be expressed by radial basis functions (RBF), whose parameters could be learnt by a two-stage gradient descent strategy. A new growing RBF-node insertion strategy with different RBF is used in order to improve the net performances. The learning strategy is reported to save computational time and memory space in approximation of continuous and discontinuous mappings (Esposito et al., 2000).

Neural ICA Various forms of unsupervised learning have been implemented in ANNs beyond standard PCA like nonlinear PCA and ICA. Data whitening can be neurally emulated by PCA with a simple iterative algorithm that updates the sphering matrix V (t ):

A


V (t + 1) = V (t ) −

(t )(vvT − I )

(7)

After getting the decorrelation matrix V (t ), the basic task for ICA algorithms remains to come out with an orthogonal matrix W (t ) , which is equivalent to a suitable rotation of the decorrelated data v (t ) = V (t )x (t ) aiming to maximize the product of the marginal densities of its components. There are various neural approaches to estimate the rotation matrix W (t ) . An important class of algorithms is based on maximization of network entropy (Bell, 1995). The BS nonlinear information maximization (infomax) algorithm performs online stochastic gradient ascent in mutual information (MI) between outputs and inputs of a network. By minimizing the MI between outputs, the network factorizes the inputs into independent components. Considering a network with the input vector x (t ), a weight matrix W (t ) , and a monotonically transformed output vector y = g (Wx + w 0 ) , then the resulting learning rule for the weights and bias-weights, respectively, are: −1

∆W =  WT  + x (1 − 2y )

T

and

∆w 0 = 1 − 2y

(8)

oped from the infomax principle satisfying a general stability criterion and preserving the simple initial architecture of the network. Applying either natural or relative gradient (Cardoso, 1996) for optimization, their learning rule yields results that compete with fixed-point batch computations. The equivariant adaptive separation via independence (EASI) algorithm introduced by Cardoso and Laheld (1996) is a nonlinear decorrelation method. The objective function J (W ) = E {f (Wx )} is subject to minimization with the orthogonal constraint imposed on W and the nonlinearity g = f ′ chosen according to data kurtosis. Its basic update rule equates to: ∆W = −

(yy

T

)

− I + g (y )y T − yg (y T ) W

(10) Fixed-point (FP) algorithms are searching the ICA solution by minimizing mutual information (MI) among the estimated components (Hyvärinen, 1997). The FastICA learning rule finds a direction w so that the projection of w T x maximizes a contrast function

{

}

2

T of the form J G (w ) =  E f (w x ) − E {f (v )} with v standing for the standardized Gaussian variable. The learning rule is basically a Gram-Schmidt-like decorrelation method.

In the case of bounded variables, the interplay between the anti-Hebbian term x (1 − 2y ) and the T

−1

antidecay term  WT  produces an output density that is close to the flat constant distribution, which corresponds to the maximum entropy distribution. Amari, Cichocki, and Yang (Amari, 1996) altered the BS infomax algorithm by using the natural gradient instead of the stochastic gradient to reduce the complexity of neural computations and significantly improving the speed of convergence. The update rule proposed for the separating matrix is: T ∆W = I − g (Wx ) (Wx )  W  

(9)

Lee et al. (Lee, 2000) extended to both sub-and super-Gaussian distributions the learning rule devel

ALGORITHM ASSESSMENT We comparatively run both PCA and ICA neural algorithms using synthetically generated time series additively corrupted with some white noise to alleviate strict determinism (Table 1 and Fig. 4.). Neural PCA was implemented using the bigradient algorithm since it works for both minimization and maximization of the criterion J1 under the normality constraints enforced by the penalty function J2. The neural ICA algorithms were the extended infomax of Bell and Sejnowski, a semi-adaptive fixed-point fast ICA algorithm (Hyvärinen & Oja, 1997), an adapted variant of EASI algorithm optimized for real data, and the extended generalized lambda distribution (EGLD) maximum likelihood-based algorithm. In the case of artificially generated sources, the accuracy of separating the latent sources by an algorithm


Table 1. The analytical form of the signals sources

A

Signal sources

S (1) = 2 ∗ sin (t 149 )∗ cos (t 8 )

Modulated sinusoid: Square waves:

(

)

S (2 ) = sign sin (12 ∗ t + 9 ∗ cos (2 29 )) Saw-tooth:

S (3) = (rem (t , 79 ) − 17 ) 23 Impulsive curve:

(

S (4 ) = (rem (t , 23) − 11) 9 Exponential decay:

)

5

S (5 ) = 5 ∗ exp (−t 121)∗ cos (37 ∗ t )

Spiky noise:

(

)

S (6 ) = (rand (1, T ) < .5 )∗ 2 − 1 ∗ log (rand (1, T ))

Figure 4. Sub-Gaussian (left) and super-Gaussian (right) source signals and their corresponding histograms (bottom)


performing ICA can be measured by means of some quantitative indexes. The first we used was defined as the signal-to-interference ratio (SIR):

1 SIR = N

N

∑10 ⋅ log i =1

10

signals, times the number of time samples, and times the module of the source signals:

max (Qi )

2

2  T ∑  ∑  xi (t ) − yi (t )  i =1  t =1  N

QiT Qi − max (Qi )

2

(11)

SRE =

1 TN

2  T ∑  ∑  xi (t )  i =1  t =1  N

, t = 1, 2,..., T

(13) where Q = BA is the overall transforming matrix of the latent source components, Qi is the i-th column of Q, max (Qi ) is the maximum element of Qi , and N is the number of the source signals. The higher the SIR is, the better the separation performance of the algorithm. A secondly employed index was the distance between the overall transforming matrix Q and an ideal permutation matrix, which is interpreted as the crosstalking error (CTE): N  N  N  N  Qij Qij − 1 + ∑  ∑ − 1 CTE = ∑  ∑  j =1  i =1 max Q j  i =1  j =1 max Qi    

(12) Above, Qij is the ij-th element of Q, max Qi is the maximum absolute valued element of the row i in Q, and max Q j is the maximum absolute valued element of the column j in Q. A permutation matrix is defined so that on each of its rows and columns, only one of the elements equals to unity while all the other elements are zero. It means that the CTE attains its minimum value zero for an exact permutation matrix (i.e., perfect decomposition) and goes positively higher the more Q deviates from a permutation matrix (i.e., decomposition of lower accuracy). We defined the relative signal retrieval error (SRE) as the Euclidian distance between the source signals and their best matching estimated components normalized to the number of source

The lower the SRE is, the better the estimates approximate the latent source signals. The stabilized version of FastICA algorithm is attractive by its fast and reliable convergence, and by the lack of parameters to be tuned. The natural gradient incorporated in the BS extended infomax performs better than the original gradient ascent and is computationally less demanding. Though the BS algorithm is theoretically optimal in the sense of dealing with mutual information as objective function, like all neural unsupervised algorithms, its performance heavily depends on the learning rates and its convergence is rather slow. The EGLD algorithm separates skewed distributions, even for zero kurtosis. In terms of computational time, the BS extended infomax algorithm was the fastest, FastICA more faithfully retrieved the sources among all algorithms under test, while the EASI algorithm came out with a full transform matrix Q that is the closest to unity.

FUTURE TRENDS Neuromorphic methods in exploratory analysis and data mining are rapidly emerging applications of unsupervised neural training. In recent years, new learning algorithms have been proposed, yet their theoretical properties, range of optimal applicability, and comparative assessment have remained largely unexplored. No convergence theorems are associated with the training algorithms in use. Moreover, algorithm convergence heavily depends on the proper choice of the learning rate(s) and, even when convergence is accomplished, the neural algorithms are relatively slow compared with batch-type computations. Nonlinear and nonstationary neural ICA is expected to be developed due to ANNs


nonalgorithmic processing and their ability to learn nonanalytical relationships if adequately trained.

CONCLUSION Both PCA and ICA share some common features like aiming at building generative models that are likely to have produced the observed data and performing information preservation and redundancy reduction. In a neuromorphic approach, the model parameters are treated as network weights that are changed during the learning process. The main difficulty in function approximation stems from choosing the network parameters that have to be fixed a priori, and those that must be learnt by means of an adequate training rule. PCA and ICA have major applications in data mining and exploratory data analysis, such as signal characterization, optimal feature extraction, and data compression, as well as the basis of subspace classifiers in pattern recognition. ICA is much better suited than PCA to perform BSS, blind deconvolution, and equalization.

REFERENCES Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind source aeparation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing, 8. Cambridge, MA: MIT Press.

Esposito, A., Marinaro, M., & Scarpetta, S. (2000). Approximation of continuous and discontinuous mappings by a growing neural RBF-based algorithm. Neural Networks, 13(6) 651-665. Fiori, S., & Piazza, F. (2000). A general class of APEXlike PCA neural algorithms, IEEE Transactions on Circuits and Systems - Part I. 47, 1394-1398. Gold, S., Rangarajan, A., & Mjolsness, E. (1996). Learning with preknowledge: Clustering with point and graph matching distance. Neural Computation, 8, 787-804. Haykin, S. (1999). Neural networks (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Hotelling, H. (1993). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417-441 and 498-520. Hyvärinen, A., & Oja, E. (1997). A fast fixed-point algorithm for ICA. Neural Computation, 9, 1483-1492. Karhunen, J. (1996). Neural approaches to independent component analysis and source separation. Proceedings ESANN’96, Bruges, Belgium, 249-266. Lee, T.-W., Girolami, M., Bell, A. J., & Sejnowski, T. J. (2000). A unifying information-theoretic framework for ICA. Computers and Mathematics with Applications, 39, 1-21. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115-133.

Barlow, H. B. (1991). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217-234). Cambridge, MA: MIT Press.

Oja, E., Karhunen, J., Wang, L., & Vigario, R. (1995). Principal and independent components in neural networks - Recent developments. Proceedings VIIth Workshop on Neural Nets, Vietri, Italy.

Bell, A., & Sejnowski, T. (1995). An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129-1159.

Oja, E. (1982). Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267-273.

Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44, 3017-3030.

Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607-609.

Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceeding IEEE, 9, 2009-2025.

Wang, L. & Karhunen, J. (1996). A unified neural bigradient algorithm for robust PCA and MCA. International Journal of Neural Sysems, 7, 53-67.

Comon, P. (1994). Independent component analysis, A new concept? Signal Processing, 36, 287-314.

A


Yang, B. (1995). Projection approximation subspace tracking. IEEE Transactions on Signal Processing, 43, 1247-1252.

Exploratory Data Analysis (EDA): An approach based on allowing the data itself to reveal its underlying structure and model heavily using the collection of techniques known as statistical graphics.

KEy TERMS

Independent Component Analysis (ICA): An exploratory method for separating a linear mixture of latent signal sources into independent components as optimal estimates of the original sources on the basis of their mutual statistical independence and non-Gaussianity.

Artificial Neural Networks (ANNs): An information-processing synthetic system made up of several simple nonlinear processing units connected by elements that have information storage and programming functions adapting and learning from patterns, which mimics a biological neural network. Blind Source Separation (BSS): Separation of latent nonredundant (e.g., mutually statistically independent or decorrelated) source signals from a set of linear mixtures, such that the regularity of each resulting signal is maximized, and the regularity between the signals is minimized (i.e. statistical independence is maximized) without (almost) any information on the sources. Confirmatory Data Analysis (CDA): An approach which, subsequent to data acquisition, proceeds with the imposition of a prior model and analysis, estimation, and testing model parameters.

0

Learning Rule: Weight change strategy in a connectionist system aiming to optimize a certain objective function. Learning rules are iteratively applied to the training set inputs with error gradually reduced as the weights are adapting. Principal Component Analysis (PCA): An orthogonal linear transform based on singular value decomposition that projects data to a subspace that preserves maximum variance.

Adaptive Neuro-Fuzzy Systems Larbi Esmahi Athabasca University, Canada Kristian Williamson Statistics Canada, Canada Elarbi Badidi United Arab Emirates University, UAE

INTRODUCTION Fuzzy logic became the core of a different approach to computing. Whereas traditional approaches to computing were precise, or hard edged, fuzzy logic allowed for the possibility of a less precise or softer approach (Klir et al., 1995, pp. 212-242). An approach where precision is not paramount is not only closer to the way humans thought, but may be in fact easier to create as well (Jin, 2000). Thus was born the field of soft computing (Zadeh, 1994). Other techniques were added to this field, such as Artificial Neural Networks (ANN), and genetic algorithms, both modeled on biological systems. Soon it was realized that these tools could be combined, and by mixing them together, they could cover their respective weaknesses while at the same time generate something that is greater than its parts, or in short, creating synergy. Adaptive Neuro-fuzzy is perhaps the most prominent of these admixtures of soft computing technologies (Mitra et al., 2000). The technique was first created when artificial neural networks were modified to work with fuzzy logic, hence the Neuro-fuzzy name (Jang et al., 1997, pp. 1-7). This combination provides fuzzy systems with adaptability and the ability to learn. It was later shown that adaptive fuzzy systems could be created with other soft computing techniques, such as genetic algorithms (Yen et al., 1998, pp. 469-490), Rough sets (Pal et al., 2003; Jensen et al., 2004, Ang et al., 2005) and Bayesian networks (Muller et al., 1995), but the Neuro-fuzzy name was widely used, so it stayed. In this chapter we are using the most widely used terminology in the field. Neuro-fuzzy is a blanket description of a wide variety of tools and techniques used to combine any aspect of fuzzy logic with any aspect of artificial neural

networks. For the most part, these combinations are just extensions of one technology or the other. For example, neural networks usually take binary inputs, but use weights that vary in value from 0 to 1. Adding fuzzy sets to ANN to convert a range of input values into values that can be used as weights is considered a Neuro-fuzzy solution. This chapter will pay particular interest to the sub-field where the fuzzy logic rules are modified by the adaptive aspect of the system. The next part of this chapter will be organized as follows: in section 1 we examine models and techniques used to combine fuzzy logic and neural networks together to create Neuro-fuzzy systems. Section 2 provides an overview of the main steps involved in the development of adaptive Neuro-fuzzy systems. Section 3 concludes this chapter with some recommendations and future developments.

NEURO-FUZZy TECHNOLOGy Neuro-fuzzy Technology is a broad term used to describe a field of techniques and methods used to combine fuzzy logic and neural networks together (Jin, 2003, pp. 111-140). Fuzzy logic and neural networks each have their own sets of strengths and weaknesses, and most attempts to combine these two technologies have the goal of using each techniques strengths to cover the others weaknesses. Neural networks are capable of self-learning, classification and associating inputs with outputs. Neural networks can also become a universal function approximator (Kosko, 1997, pp. 299; Nauck et al., 1998, Nauck et al. 1999). Given enough information about an unknown continuous function, such as its inputs


A

Adaptive Neuro-Fuzzy Systems

and outputs, the neural network can be trained to approximate it. The disadvantages of neural networks are they are not guaranteed to converge, that is to be trained properly, and after they have been trained they cannot give any information about why they take a particular course of action when given a particular input. Fuzzy logic Inference systems can give human readable and understandable information about why a particular course of action was taken because it is governed by a series of IF THEN rules. Fuzzy logic systems can adapt in a way that their rules and the parameters of the fuzzy sets associated with those rules can be changed to meet some criteria. However fuzzy logic systems lack the capability for self-learning, and must be modified by an external entity. Another salient feature of fuzzy logic systems is that they are, like artificial neural networks, capable of acting as universal approximators. The common feature of being able to act as a universal approximator is the basis of most attempts to merge these two technologies. Not only it can be used to approximate a function but it can also be used by both neural networks, and fuzzy logic systems to approximate each other as well. (Pal et al., 1999, pp. 66) Universal approximation is the ability of a system to replicate a function to some degree. Both neural networks and fuzzy logic systems do this by using a non-mathematical model of the system (Jang et al., 1997, pp. 238; Pal et al., 1999, pp. 19). The term approximate is used as the model does not have to match the simulated function exactly, although it is sometime possible to do so if enough information about the function is available. In most cases it is not necessary or even desirable to perfectly simulate a function as this takes time and resources that may not be available and close is often good enough.

•

Neural-Fuzzy Systems (NFS): are fuzzy systems “augmented” by neural networks (Jin, 2003, pp.111-140).

There also four main architectures used for implementing neuro-fuzzy systems: • • • •

Fuzzy Multi-layer networks (Jang, 1993; Mitra et al., 1995; Mitra et al., 2000; Mamdani et al., 1999; Sugeno et al., 1988, Takagi et al., 1985). Fuzzy Self-Organizing Map networks (Drobics et al., 2000; Kosko, 1997, pp. 98; Haykin, 1999, pp. 443) Black-Box Fuzzy ANN (Bellazzi et al., 1999; Qiu, 2000; Monti, 1996) Hybrid Architectures (Zatwarnicki, 2005; Borzemski et al., 2003; Marichal et al., 2001; Rahmoun et al., 2001; Koprinska et al., 2000; Wang et al. 1999; Whitfort et al., 1995).

DEVELOPMENT OF ADAPTIVE NEURO-FUZZy SySTEMS

Efforts to combine fuzzy logic and neural networks have been underway for several years and many methods have been attempted and implemented. These methods are of two major categories:

Developing an Adaptive Neuro-fuzzy system is a process that is similar to the procedures used to create fuzzy logic systems, and neural networks. One advantage of this combined approach is that it is usually no more complicated than either approach taken individually. As noted above, there are two methods of creating a Neuro-fuzzy system; integrating fuzzy logic into a neural network framework (FNN), and implementing neural networks into a fuzzy logic system (NFS). A fuzzy neural network is just a neural network with some fuzzy logic components; hence is generally trained like a normal neural network is. Training Process: The training regimen for a NFS differs slightly from that used to create a neural network and a fuzzy logic system in some key ways, while at the same time incorporating many improvements over those training methods. The training process of a Neuro-fuzzy system has five main steps: (Von Altrock, 1995, pp. 71-75)

•

•

Categories of Neuro-Fuzzy Systems

Fuzzy Neural Networks (FNN): are neural networks that can use fuzzy data, such as fuzzy rules, sets and values (Jin, 2003, pp.205-220).

Obtain Training Data: The data must cover all possible inputs and output, and all the critical regions of the function if it is to model it in an appropriate manner.


•

•

•

•

Create a Fuzzy Logic System: The fuzzy system may be an existing system which is known to work, such as one that has been in production for some time or one that has been created by following expert system development methodologies. Define the Neural Fuzzy Learning: This phase deals with defining what you want the system to learn. This allows greater control over the learning process while still allowing for rule knowledge discovery. Training Phase: To run the training algorithm. The algorithm may have parameters that can be adjusted to modify how the system is to be modified during training. Optimization and Verification: Validation can take many forms, but will usually involve feeding the system a series of known inputs to determine if the system generates the desired output, and or is within acceptable parameters. Furthermore, the rules and membership functions may be extracted so they can be examined by human experts for correctness.

CONCLUSION AND FUTURE DEVELOPMENTS Advantages of ANF systems: Although there are many ways to implement a Neuro-fuzzy system, the advantages described for these systems are remarkably uniform across the literature. The advantages attributed to Neuro-fuzzy systems as compared to ANNs are usually related to the following aspects: •

•

•

Faster to train: This is due to the massive number of connections present in the ANN, and the non-trivial number of calculations associated with each. As well, most neural fuzzy systems can be trained by going through the data once, whereas a neural network may need to be exposed to the same training data many times before it converges. Less computational resources: Neural fuzzy system is smaller in size and contains fewer internal connections than a comparable ANN, hence it is faster and use significantly less resources. Offer the possibility to extract the rules: This is a major advantage over ANNs in that the rules governing a system can be communicated to the human users in an easily understandable form.

Limitation of ANF systems: The greatest limitation in creating adaptive systems is known as the “Curse of Dimensionality”, which is named after the exponential growth in the number of features that the model has to keep track of as the number of input attributes increases. Each attribute in the model is a variable in the system, which corresponds to an axis in a multidimensional graph that the function is mapped into. The connections between different attributes correspond to the number of potential rules in the system as given by the formula: Nrules = (Llingustic_terms)variables (Gorrostieta et al., 2006) This formula becomes more complicated if there are different numbers of linguistic variables (fuzzy sets) covering each attribute dimension. Fortunately there are ways around this problem. As the neural fuzzy system is only approximating the function being modeled, the system may not need all the attributes to achieve the desired results. Another area of criticism in the Neuro-fuzzy field is related to aspects that can’t be learned or approximated. One of the most known aspects here is the caveat attached to the universal approximation. In fact, the function being approximated has to be continuous; a continuous function is a function that does not have a singularity, a point where it goes to infinity. Other functions that Adaptive Neuro-fuzzy systems may have problems learning are things like encryption algorithms, which are purposely designed to be resistant to this type of analysis. Future developments: Predicting the future has always been hard; however for ANF technology the future expansion has been made easy because of the widespread use of its basis technology (neural networks and fuzzy logic). Mixing of these technologies creates synergies as they remediate to each other weaknesses. ANF technology allows complex system to be grown instead of someone having to build them. One of the most promising areas for ANF systems is System Mining. There exist many cases where we wish to automate a system that cannot be systematically described in a mathematical manner. This means there is no way of creating a system using classical development methodologies (i.e. Programming a simulation.). If we have an adequately large set of examples of inputs and their corresponding outputs, ANF can be used to get a model of the system. The rules and their associated

A


fuzzy sets can then be extracted from this system and examined for details about how the system works. This knowledge can be used to build the system directly. One interesting application of this technology is to audit existing complex systems. The extracted rules could be used to determine if the rules match the exceptions of what the system is supposed to do, and even detect fraud actions. Alternatively, the extracted model may show an alternative, and or more efficient manner of implementing the system.

REFERENCES Ang, K. K. & Quek, C. (2005). RSPOP: Rough SetBased Pseudo Outer-Product Fuzzy Rule Identification Algorithm. Neural Computation, (17) 1, 205-243. Bellazzi, R., Guglielmann, R. & Ironi L. (1999). A qualitative-fuzzy framework for nonlinear black-box system identification. In Dean T., editor, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI 99), volume 2, pages 1041—1046. Morgan Kaufmann Publishers. Borzemski, L. & Zatwarnicki, K. (2003). A fuzzy adaptive request distribution algorithm for cluster-based Web systems. In the Proceedings Eleventh Euromicro Conference on Parallel, Distributed and NetworkBased Processing, 119 - 126. Institute of Electrical & Electronics Engineering Publisher. Chavan, S., Shah, K., Dave, N., Mukherjee, S., Abraham, A., & Sanyal, S. (2004). Adaptive neuro-fuzzy intrusion detection systems. In Proceedings of the International Conference on Information Technology: Coding and Computing, ITCC 2004, 70 - 74 Vol.1. Institute of Electrical & Electronics Engineering Publisher. Drobics, M., Winiwater & W., Bodenhofer, U. (2000). Interpretation of self-organizing maps with fuzzy rules. In Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’00), p. 0304. IEEE Computer Society Press. Gorrostieta, E. & Pedraza, C. (2006). Neuro Fuzzy Modeling of Control Systems. In Proceedings of the 16th IEEE International Conference on Electronics, Communications and Computers (CONIELECOMP 2006), 23 – 23. IEEE Computer Society Publisher.

Haykin, S. (1999). Neural Networks: A Comprehensive Foundation. Prentice Hall Publishers, 2nd edition Jang, J. S. R., Sun C. T. & Mizutani E. (1997). NeuroFuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. Prentice Hall Publishers, US Ed edition. Jang, J.-S.R. (1993). ANFIS: adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man and Cybernetics, (23) 3, 665 – 685. Jensen, R. & Shen, Q. (2004). Semantics-preserving dimensionality reduction: rough and fuzzy-rough-based approaches. IEEE Transactions on Knowledge and Data Engineering, (16) 12, 1457 – 1471. Jin Y. (2000). Fuzzy modeling of high-dimensional systems: Complexity reduction and interpretability improvement. IEEE Transactions on Fuzzy Systems, (8) 2, 212-221. Jin, Y. (2003). Advanced Fuzzy Systems Design and Applications. Physica-Verlag Heidelberg Publishers; 1 edition. Klir, G. J. & Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall PTR Publishers; 1st edition. Koprinska, L. & Kasabov, N. (2000). Evolving fuzzy neural network for camera operations recognition. In Proceedings of the 15th International Conference on Pattern Recognition, 523 - 526 vol.2. IEEE Computer Society Press Publisher. Kosko, B. (1997). Fuzzy Engineering. Prentice Hall Publishers, 1st edition. Mamdani, E. H. & Assilian, S. (1999). An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller. International Journal of Human-Computer Studies, (51) 2, 135-147. Marichal, G.N., Acosta, L., Moreno, L., Mendez, J.A. & Rodrigo, J. J. (2001). Obstacle Avoidance for a Mobile Robot: A neuro-fuzzy approach. Fuzzy Sets and Systems, (124) 2, 171- 179. Mitra, S. & Hayashi Y. (2000). Neuro-fuzzy rule generation: survey in soft computing framework. IEEE Transactions on Neural Networks, (11) 3, 748 – 768.


Mitra, S. & Pal, S. K. (1995). Fuzzy multi-layer perceptron, inferencing and rule generation. IEEE Transactions on Neural Networks, (6) 1, 51-63. Monti, A. (1996). A fuzzy-based black-box approach to IGBT modeling. In Proceedings of the Third IEEE International Conference on Electronics, Circuits, and Systems: ICECS ‘96. 1147 - 1150 vol.2. Institute of Electrical & Electronics Engineering Publisher. Muller, P. & Insua, D.R. (1998). Issues in Bayesian Analysis of Neural Network Models. Neural Computation (10) 3, 749-770. Nauck, D. & Kruse R. (1999). Neuro-fuzzy systems for function approximation. Fuzzy Sets and Systems (101) 261-271. Nauck, D. & Kruse, R. (1998). A neuro-fuzzy approach to obtain interpretable fuzzy systems for function approximation. In Wcci 98: Proceedings of Fuzz-IEEE ‘98, 1106 - 1111 vol.2. IEEE World Congress on Computational Intelligence. Institute of Electrical & Electronics Engineering Publisher. Pal, S. K. & Mitra S. (1999). Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing. John Wiley & Sons Publishers, 1st edition. Pal, S.K., Mitra, S. & Mitra, P. (2003). Rough-fuzzy MLP: modular evolution, rule generation, and evaluation. IEEE Transactions on Knowledge and Data Engineering, (15) 1, 14 – 25. Qiu F. (2000). Opening the black box of neural networks with fuzzy set theory to facilitate the understanding of remote sensing image processing. In Proceedings of the IEEE 2000 International Geoscience and Remote Sensing Symposium: Taking the Pulse of the Planet: The Role of Remote Sensing in Managing the Environment, IGARSS 2000. 1531 - 1533 vol.4. Institute of Electrical & Electronics Engineering Publisher. Rahmoun, A. & Berrani, S. (2001). A genetic-based neuro-fuzzy generator: NEFGEN. ACS/IEEE International Conference on Computer Systems and Applications, 18 – 23. Institute of Electrical & Electronics Engineering Publisher. Sugeno, M. & Kang, G. T. (1998). Structure identification of fuzzy model. Fuzzy Sets and Systems, (28) 1, 15-33.

Takagi T. & Sugeno M. (1985). Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, (15), 116-132. Von Altrock, C. (1995). Fuzzy Logic and Neuro Fuzzy Applications Explained. Prentice Hall Publishers. Wang L. & Yen J. (1999). Extracting Fuzzy Rules for System Modeling Using a Hybrid of Genetic Algorithms and Kalman Filter. Fuzzy Sets Systems, (101) 353–362. Whitfort, T., Matthews, C. & Jagielska, I. (1995). Automated knowledge acquisition for a fuzzy classification problem. In Kasabov, N. K. & Coghill, G. (Editors), Proceedings of the Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, 227 – 230. IEEE Computer Society Press Publisher. Yen, J. & Langari, R. (1998). Fuzzy Logic: Intelligence, Control, and Information. Prentice Hall Publishers. Zadeh, L. A. (1994). Fuzzy Logic, Neural Networks, and Soft Computing. Communications of the ACM (37) 3, 77-84. Zatwarnicki, K. (2005). Proposal of a neuro-fuzzy model of a WWW server. Proceedings of the Fifth International Conference on Intelligent Systems Design and Applications ISDA ‘05, 141 – 146. Institute of Electrical & Electronics Engineering Publisher.

KEy TERMS Artificial Neural Networks (ANN): An artificial neural network, often just called a “neural network” (NN), is an interconnected group of artificial neurons that uses a mathematical model or computational model for information processing based on a connectionist approach to computation. Knowledge is acquired by the network from its environment through a learning process, and interneuron connection strengths (synaptic weighs) are used to store the acquired knowledge. Evolving Fuzzy Neural Network (EFuNN): An Evolving Fuzzy Neural Network is a dynamic architecture where the rule nodes grow if needed and shrink by aggregation. New rule units and connections can be added easily without disrupting existing nodes.

A


The learning scheme is often based on the concept of “winning rule node”. Fuzzy Logic: Fuzzy logic is an application area of fuzzy set theory dealing with uncertainty in reasoning. It utilizes concepts, principles, and methods developed within fuzzy set theory for formulating various forms of sound approximate reasoning. Fuzzy logic allows for set membership values to range (inclusively) between 0 and 1, and in its linguistic form, imprecise concepts like “slightly”, “quite” and “very”. Specifically, it allows partial membership in a set. Fuzzy Neural Networks (FNN): are Neural Networks that are enhanced with fuzzy logic capability such as using fuzzy data, fuzzy rules, sets and values. Neuro-Fuzzy Systems (NFS): A neuro-fuzzy system is a fuzzy system that uses a learning algorithm derived from or inspired by neural network theory to determine its parameters (fuzzy sets and fuzzy rules) by processing data samples. Self-Organizing Map (SOM): The self-organizing map is a subtype of artificial neural networks. It

is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space. The self-organizing map is a single layer feed-forward network where the output syntaxes are arranged in low dimensional (usually 2D or 3D) grid. Each input is connected to all output neurons. Attached to every neuron there is a weight vector with the same dimensionality as the input vectors. The number of input dimensions is usually a lot higher than the output grid dimension. SOMs are mainly used for dimensionality reduction rather than expansion. Soft Computing: Soft Computing refers to a partnership of computational techniques in computer science, artificial intelligence, machine learning and some engineering disciplines, which attempt to study, model, and analyze complex phenomena. The principle partners at this juncture are fuzzy logic, neuron-computing, probabilistic reasoning, and genetic algorithms. Thus the principle of soft computing is to exploit the tolerance for imprecision, uncertainty, and partial truth to achieve tractability, robustness, low cost solution, and better rapport with reality.

Adaptive Technology and Its Applications João José Neto Universidade de São Paulo, Brazil

INTRODUCTION Before the advent of software engineering, the lack of memory space in computers and the absence of established programming methodologies led early programmers to use self-modification as a regular coding strategy. Although unavoidable and valuable for that class of software, solutions using self-modification proved inadequate while programs grew in size and complexity, and security and reliability became major requirements. Software engineering, in the 70’s, almost led to the vanishing of self-modifying software, whose occurrence was afterwards limited to small low-level machinelanguage programs with very special requirements. Nevertheless, recent research developed in this area, and the modern needs for powerful and effective ways to represent and handle complex phenomena in hightechnology computers are leading self-modification to be considered again as an implementation choice in several situations. Artificial intelligence strongly contributed for this scenario by developing and applying non-conventional approaches, e.g. heuristics, knowledge representation and handling, inference methods, evolving software/ hardware, genetic algorithms, neural networks, fuzzy systems, expert systems, machine learning, etc. In this publication, another alternative is proposed for developing Artificial Intelligence applications: the use of adaptive devices, a special class of abstractions whose practical application in the solution of current problems is called Adaptive Technology. The behavior of adaptive devices is defined by a dynamic set of rules. In this case, knowledge may be represented, stored and handled within that set of rules by adding and removing rules that represent the addition or elimination of the information they represent. Because of the explicit way adopted for representing and acquiring knowledge, adaptivity provides a very simple abstraction for the implementation of artificial learning mechanisms: knowledge may be comfortably

gathered by inserting and removing rules, and handled by tracking the evolution of the set of rules and by interpreting the collected information as the representation of the knowledge encoded in the rule set.

MAIN FOCUS OF THIS ARTICLE This article provides concepts and foundations on adaptivity and adaptive technology, gives a general formulation for adaptive abstractions in use and indicates their main applications. It shows how rule-driven devices may turn into adaptive devices to be applied in learning systems modeling, and introduces a recently formulated kind of adaptive abstractions having adaptive subjacent devices. This novel feature may be valuable for implementing meta-learning, since it enables adaptive devices to change dynamically the way they modify their own set of defining rules. A significant amount of information concerning adaptivity and related subjects may be found at the (LTA Web site).

BACKGROUND This section summarizes the foundations of adaptivity and establishes a general formulation for adaptive ruledriven devices (Neto, 2001), non-adaptivity being the only restriction imposed to the subjacent device. Some theoretical background is desirable for the study and research on adaptivity and Adaptive Technology: formal languages, grammars, automata, computation models, rule-driven abstractions and related subjects. Nevertheless, either for programming purposes or for an initial contact with the theme, it may be unproblematic to catch the basics of adaptivity even having no prior expertise with computer-theoretical subjects. In adaptive abstractions, adaptivity may be achieved by attaching adaptive actions to selected rules chosen


A

Adaptive Technology and Its Applications

from the rule set defining some subjacent non-adaptive device. Adaptive actions enable adaptive devices to dynamically change their behavior without external help, by modifying their own set of defining rules whenever their subjacent rule is executed. For practical reasons, up to two adaptive actions are allowed: one to be performed prior to the execution of its underlying rule, and the other, after it. An adaptive device behaves just as it were piecewise non-adaptive: starting with the configuration of its initial underlying device, it iterates the following two steps, until reaching some well-defined final configuration: • •

While no adaptive action is executed, run the underlying device; Modify the set of rules defining the device by executing an adaptive action.

configuration ci in response to some input stimulus

s ∈ S ∪ { }, yielding its next configuration ci +1 . Successive applications of rules in response to a stream w ∈ S * of input stimuli, starting from the initial configuration c0 and leading to some final configuration

c ∈ A is denoted c0 ⇒*w c (The star postfix operator in the formulae denotes the Kleene closure: its preceding element may be re-instantiated or reapplied an arbitrary number of times). We say that D defines a sentence w if, and only if,

c0 ⇒*w c holds for some c ∈ A . The collection L(D) of all such sentences is called the language defined by D:

{

}

L(D ) = w ∈ S * | c0 ⇒*w c, c ∈ A .

Adaptive (Rule-Driven) Devices Rule-Driven Devices A rule-driven device is any formal abstraction whose behavior is described by a rule set that maps each possible configuration of the device into a corresponding next one. A device is deterministic when, for any configuration and any input, a single next configuration is possible. Otherwise, it is said non-deterministic. Non-deterministic devices allow multiple valid possibilities for each move, and require backtracking, so deterministic equivalents are usually preferable in practice. Assume that: •

D is some rule-driven device, defined as

•

D = (C , R, S , c0 , A). C is its set of possible configurations.

•

R ⊆ C × (S ∪ { })× C is the set of rules describing its behavior, where e denotes empty stimulus, representing no events at all. S is its set of valid input stimuli.

• •

c0 ∈ C is its initial configuration. A ⊆ C is its set of final configurations.

•

Let c i ⇒ ( r ) ci +1 (for short, c i ⇒ ci +1 ) denote the ap-

plication of some rule r = (ci , s, ci +1 )∈ R to the current

An adaptive rule-driven device AD = (ND0 , AM ) associates an initial subjacent rule-driven device ND0 = (C , NR0 , S , c0 , A), to some adaptive mechanism AM, that can dynamically change its behavior by modifying its defining rules. That is accomplished by executing non-null adaptive actions chosen from a set AA of adaptive actions, which includes the null adaptive action a0. A built-in counter t starts at 0 and is self-incremented upon any adaptive actions’ execution. Let Xj denote the value of X after j executions of adaptive actions by AD. Adaptive actions in AA call functions that map AD current set ARt of adaptive rules into ARt+1 by inserting to and removing adaptive rules ar from AM. Let AR be the set of all possible sets of adaptive rules for AD. Any a k ∈ A maps the current set of rules AR t ∈AR into AR t +1∈AR: a k : AR → AR

AM associates to each rule nr p ∈ NR of AD underlying device ND a pair of adaptive actions ba p , aa p ∈ AA: AM ⊆ AA × NR × AA


Notation

6.

When writing elementary adaptive actions, ?[ar ], + [ar ]

Apply aap, yielding the next (stable) configuration for AD; go to 2

7.

If some ct +1 ∈ F was reached, then AD accepts w, otherwise AD rejects w; stop.

and − [ar ]respectively denote searching, inserting and eliminating adaptive rules that follow template ar. Note that ar may contain references to parameters, variables and generators, in order to allow cross-referencing among elementary adaptive actions inside an adaptive function. Given an underlying rule nr p ∈ NR, we define an adaptive rule ar p ∈ AM as:

(

ar p = ba p , nr p , aa p

)

For each AD move, AM applies some arp in three steps: a. b. c.

execution of adaptive action bap before applying the subjacent rule nrp; application of the underlying non-adaptive rule nrp; execution of adaptive action aap.

The following algorithm sketches the overall operation of AD: 1. 2. 3.

Initialize c0, w; If w is exhausted, go to 7 else get next event st; For the current configuration ct, determine the set CR of ct-compatible rules;

a.

if CR = ∅, reject w.

b.

if CR C R == {( ct , s, c′ )}, apply (ct , s, c′) as in steps

c.

4-6, leading AD to ct +1 = c′ . if CR = {r k = (ct , s, c k )| c k ∈ C , k = 1,, n, n > 1 }, apply all rules rk in parallel, as in steps 4-6, leading

4.

5.

Hierarchical Multi-Level Adaptive Devices Let us define a more elaborated adaptive device by generalizing the definition above. Call non-adaptive devices level-0 devices; define level-1 devices those having subjacent level-0 devices, to each of whose rules a pair of level-1 adaptive actions are attached. Let the subjacent device be some level-k adaptive device. One may construct a level-(k+1) device attaching a pair of level-(k+1) adaptive actions to each of its rules. This is the induction step for the definition of hierarchically structured multi-level adaptive devices. Besides the set of rules defining the subjacent level-k device, for k > 0, adaptive functions’ subjacent device performs at its own level, which may use level-(k+1) adaptive actions to modify the behavior of level-k adaptive functions. So, for k > 0, level-(k+1) devices can change the way their subjacent level-k devices modify themselves. That also holds for k = 1, since even for k = 0 the (empty) set of adaptive functions still exists.

Notation The absence of adaptive actions in non-adaptive rules nr is explicitly expressed by stating all level-0 rules r0 in the form (a 0 nr a 0 ). Therefore, level-k rules rk take the

general format ( bk rk −1 ak ), with both bk and ak level-k adaptive actions for any adaptive level k ≥ 0 . So, level-k adaptive devices have all their defining rules stated in the standard form

AD to c1 , c 2 ,, c n , respectively.

(b (b ( (b (a ( c,

p 0 If ba = a , go to 2, else apply first bap. If rule arp were removed by bap, go to 3 aborting arp, else AD reached an intermediate configuration, then go to 2. Apply nrp to the current (intermediate) configuration, yielding a new intermediate configuration;

with

k

k −1

1

0

(b ( (b (a ( c, k −1

1

0

) ) ) ) )

, c′)a 0 a1  ak −1 ak ,

) ) )

, c′ )a 0 a1  ak −1

)

representing one of the rules defining the subjacent level-(k – 1) adaptive device.

A


Hence, level-i adaptive actions can modify both the set of level-i adaptive rules and the set of elementary adaptive actions defining level-(i – 1) adaptive functions.

• • •

A SIMPLE ILLUSTRATIVE EXAMPLE

Append a sequence with three transitions consuming c, ending at y.

In the following example, graphical notation is used for clarity and conciseness. When drawing automata, (as usual) circles represent states; double-line circles indicate final states; arrows indicate transitions; labels on the arrows indicate tokens consumed by the transition and (optionally) an associated adaptive action. When representing adaptive functions, automata fragments in brackets stand for a group of transitions to be added (+) or removed (-) when the adaptive action is applied. Figure 1 shows the starting shape of an adaptive automaton that accepts anb2nc3n, n≥0. At state 1, it includes a transition consuming a, which performs

Figure 3 shows the first two shape changes of this automaton after consuming the two first symbols a (at state 1) in sentence a2b4c6. In its last shape, the automaton trivially consumes the remaining b4c6, and does not change any more. There are many other examples of adaptive devices in the references. This almost trivial and intuitive case was shown here for illustration purposes only.

Knowledge Representation The preceding example illustrates how adaptive devices use the set of rules as their only element for representing and handling knowledge. A rule (here, a transition) may handle parametric information in its components (here, the transition’s origin and destination states, the token labeling the transition, the adaptive function it calls, etc.). Rules may be combined together in order to represent some non-elementary information (here, the sequences of transitions consuming tokens “b” and “c” keep track of the value of n in each particular sentence). This way, rules and their components may work and may be interpreted as low-level elements of knowledge. Although being impossible to impose rules on how to represent and handle knowledge in systems repre-

adaptive action A( ).

Figure 2 defines how A( ) operate:

Figure 1. Initial configuration of the illustrative adaptive automaton 1

ε

2

ε

Using state 2 as reference, eliminate empty transitions using states x and y Add a sequence starting at x, with two transitions consuming b Append the sequence of two empty transitions sharing state 2

3

a /A ()

Figure 2. Adaptive function A ( )

A () =

{

?[ x ?[ 2

–[ x +[ x 0

b

b

ε

ε

2

ε

2 ]

ε

y ]

2 ε

ε

y ] c

c

c

y ]

}


Figure 3. Configurations of the adaptive automaton after executing A ( ) once and twice 1

b

b

ε

b

b

b

2

ε

c

b

ε

c

c

c

c

A

3

a /A ()

1 a /A ()

c

c

sented with adaptive devices, the details of the learning process may be chosen according to the particular needs of each system being modeled. In practice, the learning behavior of an adaptive device may be identified and measured by tracking the progress of the set of rules during its operation and interpreting the dynamics of its changes. In the above example, when transitions are added to the automaton by executing adaptive action A ( ), one may interpret the length of the sequence of transitions consuming “b” (or “c”) as a manifestation of the knowledge that is being gathered by the adaptive automaton on the value of n (its exact value becomes available after the sub-string of tokens “a” is consumed).

c

c

3

All those features are vital for conceiving, modeling, designing and implementing applications in Artificial Intelligence, which benefits from adaptivity while expressing traditionally difficult-to-describe Artificial Intelligence facts. Listed below are features Adaptive Technology offers to several fields of Computation, especially to Artificial Intelligence-related ones, indicating their main impacts and applications. •

FUTURE TRENDS Adaptive abstractions represent a significant theoretical advance in Computer Science, by introducing and exploring powerful non-classical concepts such as: time-varying behavior, autonomously dynamic rule sets, multi-level hierarchy, static and dynamic adaptive actions. Those concepts allow establishing a modeling style, proper for describing complex learning systems, for efficiently solving traditionally hard problems, for dealing with self-modifying learning methods, and for providing computer languages and environments for comfortable elaboration of quality programs with dynamically-variant behavior.

2 ε

•

Adaptive Technology provides a true computation model, constructed around formal foundations. Most Artificial Intelligence techniques in use are very hard to express and follow since the connection between elements of the models and information they represent is often implicit, so their operation reasoning is difficult for a human to track and plan. Adaptive rule-driven devices concentrate all stored knowledge in their rules, and the whole logic that handles such information, in their adaptive actions. Such properties open for Artificial Intelligence the possibility to observe, understand and control adaptive-device-modeled phenomena. By following and interpreting how and why changes occur in the device set of rules, and by tracking semantics of adaptive actions, one can infer the reasoning of the model reactions to its input. Adaptive devices have enough processing power to model complex computations. In (Neto, 2000) some well-succeeded use cases are shown with


•

•

•

•

•

•

simple and efficient adaptive devices used instead of complex traditional formulations. Adaptive Devices are Turing Machine-equivalent computation models that may be used in the construction of single-notation full specifications of programming languages, including lexical, syntactical, context-dependent static-semantic issues, language built-in features such as arithmetic operations, libraries, semantics, code generation and optimization, run-time code interpreting, etc. Adaptive devices are well suited for representing complex languages, including idioms. Natural language particularly require several features to be expressed and handled, as word inflexions, orthography, multiple syntax forms, phrase ordering, ellipsis, permutation, ambiguities, anaphora and others. A few simple techniques allow adaptive devices to deal with such elements, strongly simplifying the effort of representing and processing them. Applications are wide, including machine translation, data mining, text-voice and voice-text conversion, etc. Computer art is another fascinating potential application of adaptive devices. Music and other artistic expressions are forms of human language. Given some language descriptions, computers can capture human skills and automatically generate interesting outputs. Well-succeeded experiments were carried out in the field of music, with excellent results (Basseto, 1999). Decision-taking systems may use Adaptive Decision Tables and Trees for constructing intelligent systems that accept training patterns, learn how to classify them, and therefore, classify unknown patterns. Well-succeeded experiments include: classifying geometric patterns, decoding sign languages, locating patterns in images, generating diagnoses from symptoms and medical data, etc. Language inference uses Adaptive Devices to generate formal descriptions of languages from samples, by identifying and collecting structural information and generalizing on the evidence of repetitive or recursive constructs (Matsuno, 2006). Adaptive Devices can be used for learning purposes by storing as rules the gathered information on some monitored phenomenon. In educational

•

systems, the behavior of both students and trainers can be inferred and used to decide how to proceed. One can construct Adaptive Devices whose underlying abstraction is a computer language. Statements in such languages may be considered as rules defining behavior of a program. By attaching adaptive rules to statements, the program becomes self-modifiable. Adaptive languages are needed for adaptive applications to be expressed naturally. For adaptivity to become a true programming style, techniques and methods must be developed to construct good adaptive software, since adaptive applications developed so far were usually produced in strict ad-hoc way.

CONCLUSION Adaptive Technology concerns techniques, methods and subjects referring to actual application of adaptivity. Adaptive automata (Neto, 1994) were first proposed for practical representation of context-sensitive languages (Rubinstein, 1995). Adaptive grammars (Iwai, 2000) were employed as its generative counterpart (Burshteyn, 1990), (Christiansen, 1990), (Cabasino, 1992), (Shutt, 1993), (Jackson, 2006). For specification and analysis of real time reactive systems, works were developed based on adaptive versions of statecharts (Almeida Jr., 1995), (Santos, 1997). An interesting confirmation of power and usability of adaptive devices for modeling complex systems (Neto, 2000) was the successful use of Adaptive Markov Chains in a computer music-generating device (Basseto, 1999). Adaptive Decision Tables (Neto, 2001) and Adaptive Decision Trees (Pistori, 2006) are nowadays being experimented in decision-taking applications. Experiments have been reported that explore the potential of adaptive devices for constructing language inference systems (Neto, 1998), (Matsuno, 2006). An important area in which adaptive devices shows its strength is the specification and processing of natural languages (Neto, 2003). Many other results are being achieved while representing syntactical context-dependencies of natural language. Simulation and modeling of intelligent systems are other concrete applications of adaptive formalisms, as illustrated in the description of the control mechanism


of an intelligent autonomous vehicle which collects information from its environment and builds maps for navigation. Many other applications for adaptive devices are possible in several fields.

REFERENCES (* or ** - downloadable from LTA Website; ** - in Portuguese only) Almeida Jr., J.R. (1995)**. STAD - Uma ferramenta para representação e simulação de sistemas através de statecharts adaptativos. São Paulo, 202p. Doctoral Thesis. Escola Politécnica, Universidade de São Paulo. Basseto, B.A., Neto, J.J. (1999)*. A stochastic musical composer based on adaptive algorithms. Anais do XIX Congresso Nacional da Sociedade Brasileira de Computação. SBC-99, Vol. 3, pp. 105-13. Burshteyn, B. (1990). Generation and recognition of formal languages by modifiable grammars. ACM SIGPLAN Notices, v.25, n.12, p.45-53, 1990. Cabasino, S.; Paolucci, P.S.; Todesco, G.M. (1992). Dynamic parsers and evolving grammars. ACM SIGPLAN Notices, v.27, n.11, p.39-48, 1992.

PROPOR 2003, LNAI Volume 2721, Faro, Portugal, June 26-27, Springer-Verlag, 2003, pp 94-97. Neto, J. J. (2001)*. Adaptive Rule-Driven Devices General Formulation and Case Study. Lecture Notes in Computer Science. Watson, B.W. and Wood, D. (Eds.): Implementation and Application of Automata - 6th International Conference, CIAA 2001, Vol.2494, Pretoria, South Africa, July 23-25, Springer-Verlag, 2001, pp. 234-250. Neto, J.J. (1994)*. Adaptive automata for contextdependent languages. ACM SIGPLAN Notices, v.29, n.9, p.115-24, 1994. Neto, J.J. (2000)*. Solving Complex Problems Efficiently with Adaptive Automata. CIAA 2000 - Fifth International Conference on Implementation and Application of Automata - London, Ontario, Canada. Neto, J.J., Iwai, M.K. (1998)*. Adaptive automata for syntax learning. XXIV Conferencia Latinoamericana de Informática CLEI’98, Quito - Ecuador, tomo 1, pp.135-146. Pistori, H.; Neto, J.J.; Pereira, M.C. (2006)* Adaptive Non-Deterministic Decision Trees: General Formulation and Case Study. INFOCOMP Journal of Computer Science, Lavras, MG.

Christiansen, H. (1990). A survey of adaptable grammars. ACM SIGPLAN Notices, v.25, n.11, p.33-44.

Rubinstein, R.S.; Shutt. J.N. (1995). Self-modifying finite automata: An introduction, Information processing letters, v.56, n.4, 24, p.185-90.

Iwai, M.K. (2000)**. Um formalismo gramatical adaptativo para linguagens dependentes de contexto. São Paulo 2000, 191p. Doctoral Thesis. Escola Politécnica, Universidade de São Paulo.

Santos, J.M.N. (1997)**. Um formalismo adaptativo com mecanismos de sincronização para aplicações concorrentes. São Paulo, 98p. M.Sc. Dissertation. Escola Politécnica, Universidade de São Paulo.

Jackson, Q.T. (2006). Adapting to Babel – Adaptivity and context-sensitivity parsing: from anbncn to RNA – A Thotic Technology Partners Research Monograph.

Shutt, J.N. (1993). Recursive adaptable grammar. M.S. Thesis, Computer Science Department, Worcester Polytechnic Institute, Worcester MA.

LTA Website: http://www.pcs.usp.br/~lta Matsuno, I.P. (2006)**. Um Estudo do Processo de Inferência de Gramáticas Regulares e Livres de Contexto Baseados em Modelos Adaptativos. M.Sc. Dissertation, Escola Politécnica, Universidade de São Paulo. Neto, J.J.; Moraes, M.de. (2003)* Using Adaptive Formalisms to Describe Context-Dependencies in Natural Language. Computational Processing of the Portuguese Language 6th International Workshop,

KEy TERMS Adaptivity: Property exhibited by structures that dynamically and autonomously change their own behavior in response to input stimuli. Adaptive Computation Model: Turing-powerful abstraction that mimic the behavior of potentially selfmodifying complex systems.

A


Adaptive Device: Structure with dynamic behavior, with some subjacent device and an adaptive mechanism. Adaptive Functions and Adaptive Actions: Adaptive actions are calls to adaptive functions, which can determine changes to perform on its layer’s rule set and on their immediately subjacent layer’s adaptive functions. Adaptive Mechanism: Alteration discipline associated to an adaptive device’s rule set that change the behavior of its subjacent device by performing adaptive actions. Adaptive Rule-Driven Device: Adaptive device whose behavior is defined by a dynamically changing set of rules, e.g. adaptive automata, adaptive grammars, etc. Context-Dependency: Reinterpretation of terms, due to conditions occurring elsewhere in a sentence, e.g. agreement rules in English, type-checking in Pascal.

Context-Sensitive (-Dependent) Formalism: Abstraction capable of representing Chomsky type-1 or type-0 languages. Adaptive Automata and Adaptive Context-free Grammars are well suited to express such languages. Hierarchical (Multilevel) Adaptive Device: Stratified adaptive structures whose involving layer’s adaptive actions can modify both its own layer’s rules and its underlying layer’s adaptive functions. Subjacent (or Underlying) Device: Any device used as basis to formulate adaptive devices. The innermost of a multilevel subjacent device must be non-adaptive.

Advanced Cellular Neural Networks Image Processing J. Álvaro Fernández University of Extremadura, Badajoz, Spain

INTRODUCTION Since its introduction to the research community in 1988, the Cellular Neural Network (CNN) (Chua & Yang, 1988) paradigm has become a fruitful soil for engineers and physicists, producing over 1,000 published scientific papers and books in less than 20 years (Chua & Roska, 2002), mostly related to Digital Image Processing (DIP). This Artificial Neural Network (ANN) offers a remarkable ability of integrating complex computing processes into compact, real-time programmable analogic VLSI circuits as the ACE16k (Rodríguez et al., 2004) and, more recently, into FPGA devices (Perko et al., 2000). CNN is the core of the revolutionary Analogic Cellular Computer (Roska et al., 1999), a programmable system based on the so-called CNN Universal Machine (CNN-UM) (Roska & Chua, 1993). Analogic CNN computers mimic the anatomy and physiology of many sensory and processing biological organs (Chua & Roska, 2002). This article continues the review started in this Encyclopaedia under the title Basic Cellular Neural Network Image Processing.

BACKGROUND The standard CNN architecture consists of an M × N rectangular array of cells C(i,j) with Cartesian coordinates (i,j), i = 1, 2, …, M, j = 1, 2, …, N. Each cell or neuron C(i,j) is bounded to a sphere of influence Sr(i,j) of positive integer radius r, defined by:   S r (i, j ) =  C (k , l ) max { k − i , l − j }≤ r  1≤ k ≤ M ,1≤l ≤ N  

(1) This set is referred as a (2r +1) × (2r +1) neighbourhood. The parameter r controls the connectivity

of a cell. When r > N /2 and M = N, a fully connected CNN is obtained, a case that corresponds to the classic Hopfield ANN model. The state equation of any cell C(i,j) in the M × N array structure of the standard CNN may be described by: C

dzij (t ) dt

=−

1 zij (t ) + ∑ [A(i, j; k , l ) ⋅ ykl (t ) + B(i, j; k , l ) ⋅ xkl ]+ Iij R C ( k ,l )∈Sr ( i , j )

(2) where C and R are values that control the transient response of the neuron circuit (just like an RC filter), I is generally a constant value that biases the state matrix Z = {zij}, and Sr is the local neighbourhood defined in (1), which controls the influence of the input data X = {xij} and the network output Y = {yij} for time t. This means that both input and output planes interact with the state of a cell through the definition of a set of real-valued weights, A(i, j; k, l) and B(i, j; k, l), whose size is determined by r. The cloning templates A and B are called the feedback and feed-forward operators, respectively. An isotropic CNN is typically defined with constant values for r, I, A and B, implying that for an input image X, a neuron C(i, j) is provided for each pixel (i, j), with constant weighted circuits defined by the feedback and feed-forward templates A and B. The neuron state value zij is adjusted with the bias parameter I, and passed as input to an output function of the form:

yij =

1 zij (t ) + 1 − zij (t ) − 1 2

(

)

(3)

The vast majority of the templates defined in the CNN-UM template compendium of (Chua & Roska, 2002) are based on this isotropic scheme, using r = 1 and binary images in the input plane. If no feedback (i.e. A = 0) is used, then the CNN behaves as a convolution network, using B as a spatial filter, I as a threshold and the piecewise linear output (3) as a limiter. Thus,


A

Advanced Cellular Neural Networks Image Processing

virtually any spatial filter from DIP theory can be implemented on such a feed-forward CNN, ensuring binary output stability via the definition of a central feedback absolute value greater than 1.

ADVANCED CNN IMAGE PROCESSING In this section, a description of more complex CNN models is performed in order to provide a deeper insight into CNN design, including multi-layer structures and nonlinear templates, and also to illustrate its powerful DIP capabilities.

Nonlinear Templates A problem often addressed in DIP edge detection is the robustness against noise (Jain, 1989). In this sense, the EDGE CNN detector for grey-scale images given by

A = 2, BEDGE

 −1 −1 −1 =  −1 8 −1 , I = -0.5 (4)  −1 −1 −1

is a typical example of a weak-against-noise filter, as a result of fixed linear feed-forward template combined with excitatory feedback. One way to provide the detector with more robustness against noise is via the definition of a nonlinear B template of the form:

BCONTOUR

b b b  0.5 = b 0 b  where b =   −1 b b b 

xij − xkl > th xij − xkl ≤ th

(5) This nonlinear template actually defines different coefficients for the surrounding pixels prior to perform the spatial filtering of the input image X. Thus, a CNN defined with nonlinear templates is generally dependent of X, and can not be treated as an isotropic model. Just two values for the surrounding coefficients of B are allowed: one excitatory for greater than a threshold th luminance differences with the central pixel (i.e. edge pixels), and the other inhibitory, doubled in absolute value, for similar pixels, where th is usually set around

0.5. The feedback template A = 2 remains unchanged, but the value for the bias I must be chosen from the following analysis: For a given state zij element, the contribution wij of the feed-forward nonlinear filter of (5) may be expressed as:

wij = −1.0 ⋅ ps + 0.5 ⋅ pe = − (8 − pe ) + 0.5 ⋅ pe = −8 + 1.5 ⋅ pe

(6)

where ps is the number of similar pixels in the 3 × 3 neighbourhood and pe the rest of edge pixels. E.g. if the central pixel has 8 edge neighbours, wij = 12 – 8 = 4, whereas if all its neighbours are similar to it, then wij = –8. Thus, a pixel will be selected as edge depending on the number of its edge neighbours, providing the possibility of noise reduction. For instance, edge detection for pixels with at least 3 edge neighbours forces that I ∈ (4, 5). The main result is that the inclusion of nonlinearities in the definition of B coefficients and, by extension, the pixel-wise definition of the main CNN parameters gives rise to more powerful and complex DIP filters (Chua & Roska, 1993).

Morphologic Operators Mathematical Morphology is an important contributor to the DIP field. In the classic approach, every morphologic operator is based on a series of simple concepts from Set Theory. Moreover, all of them can be divided into combinations of two basic operators: erosion and dilation (Serra, 1982). Both operators take two pieces of data as input: the binary input image and the so-called structuring element, which is usually represented by a 3×3 template. A pixel belongs to an object if it is active (i.e. its value is 1 or black), whereas the rest of pixels are classified as background, zero-valued elements. Basic morphologic operators are defined using only object pixels, marked as 1 in the structuring element. If a pixel is not used in the match, it is left blank. Both dilation and erosion operators may be defined by the structuring elements


1 1 1

1

1 1 1 1 1 1

1 1 1 1

and

Dynamic Range Control CNN and Piecewise Linear Mappings (7)

for 8 or 4-neighbour connectivity, respectively. In dilation, the structuring element is placed over each input pixel. If any of the 9 (or 5) pixels considered in (7) is active, then the output pixel will be also active (Jain, 1989). The erosion operator can be defined as the dual of dilation, i.e. a dilation performed over the background. More complex morphologic operators are based on structuring elements that also contains background pixels. This is the case of the Hit and Miss Transform (HMT), a generalized morphologic operator used to identify certain local pixel configurations. For instance, the structuring elements defined by 0 1 0 1 1 0 0 0

and

1 1 0 1 0 0 0 0

(8)

are used to find 90º convex corner object pixels within the image. A pixel will be selected as active in the output image if its local neighbourhood exactly matches with that defined by the structuring element. However, in order to calculate a full, non-orientated corner detector it will be necessary to perform 8 HMT, one for each rotated version of (8), OR-ing the 8 intermediate output images to obtain the final image (Fisher et al., 2004). In the CNN context, the HMT may be obtained in a straightforward manner by: sij = 1 1 A = 2, BHMT : bij =  , I = 0.5 − ps 0 otherwise (9)

where S = {sij} is the structuring element and ps is the total number of active pixels in it. Since the input template B of the HTM CNN is defined via the structuring element S, and given that there are 29 = 512 distinct 3 × 3 possible structuring elements, there will also be 512 different hit-and-miss erosions. For achieving the opposite result, i.e. hit-andmiss dilation, the threshold must be the opposite of that in (9) (Chua & Roska, 2002).

A

DIP techniques can be classified by the domain where they operate: the image or spatial domain or the transform domain (e.g. the Fourier domain). Spatial domain techniques are those who operate directly over the pixels within an image (e.g. its intensity level). A generic spatial operator can be defined by Y (i, j ) = T [X (i, j ) ]S

r

(10)

where X and Y are the input and output images, respectively, and T is a spatial operator defined over a neighbourhood Sr around each pixel X(i, j), as defined in (1). Based on this neighbourhood, spatial operators can be grouped into two types: Single Point Processing Operators, also known as Mapping Operators, and Local Processing Operators, which can be defined by a spatial filter (i.e. 2D-discrete convolution) mask (Jain, 1989). The simplest form of T is obtained when Sr is 1 pixel size. In this case, Y only depends of the intensity value of X for every pixel and T becomes an intensity level transformation function, or mapping, of the form s = T(r)

(11)

where r and s are variables that represent grey level in X and Y, respectively.

According to this formulation, mappings can be achieved by direct application of a function over a range of input intensity levels. By properly choosing the form of T, a number of effects can be obtained, as the grey-level inversion, dynamic range compression or expansion (i.e. contrast enhancement), and threshold binarization for obtaining binary masks used in analysis and morphologic DIP. A mapping is linear if its function T is also linear. Otherwise, T is not linear and the mapping is also nonlinear. An example of nonlinear mapping is the CNN output function (3). It consists of three linear segments: two saturated levels, –1 and +1, and the central linear segment with unitary slope that connects them. This function is said to be piecewise linear and is closely related to the well-known sigmoid function utilized in the Hopfield ANN (Chua & Roska, 1993). It performs a mapping of intensity values stored in Z in the [–1,


+1] range. The bias I controls the average point of the input range, where the output function gives a zerovalued outcome. Starting from the original CNN cell or neuron (1)-(3), a brief review of the Dynamic Range Control (DRC) CNN model first defined in (Fernández et al., 2006) follows. This network is designed to perform a piecewise linear mapping T over X, with input range [m–d, m+d] and output range [a, b]. Thus, a   b − a T  X (i, j ) =  (X (i, j ) − m )+ b +2 a d 2   b

−∞ < X (i, j ) ≤ m − d m − d < X (i, j ) ≤ m + d

(16)

The DRC network can be easily applied to a first order piecewise polynomial approximation of nonlinear, continuous mappings. One of the valid possibilities is the multi-layer DRC CNN implementation of errorcontrolled Chebyshev polynomials, as described in (Fernández et al., 2006). The possible mappings include, among many others, the absolute value, logarithmic, exponential, radial basis and integer and real-valued power functions.

m + d < X (i, j ) < +∞

(12) In order to be able to implement this function in a multi-layer CNN, the following constraints must be met: b − a ≤ 2 and d ≤ 1

(13)

A CNN cell which controls the desired input range can be defined with the following parameters: A1 = 0, B1 = 1/d, I1 = -m/d

(14)

This network performs a linear mapping between [m–d, m+d] and [–1,+1]. Its output is the input of a second CNN whose parameters are: A2 = 0, B2 = (b – a)/2, I2 = (b + a)/2

(15)

The output of this second network is exactly the mapping T defined in (12) bounded by the constraints of (13). One of the simplest techniques used in grey-scale image contrast enhancement is contrast stretching or normalization. This technique maximizes the dynamic range of the intensity levels within the image from suitable estimates of the maximum and minimum intensity values (Fisher et al., 2004). Thus, in the case of normalized grey-scale images, where the minimum (i.e. black) and maximum (i.e. white) intensity levels are represented by 0 and 1 values, respectively; if such an image with dynamic intensity range [f, g] ⊆ [0, +1] is fed in the input of the 2-layer CNN defined by (14) and (15), the following parameters will achieve the desired linear dynamic range maximization:

a = 0, b = 1, m = (g + f)/2, d = (g – f)/2

FUTURE TRENDS There is a continuous quest by engineers and specialists: compete with and imitate nature, especially some “smart” animals. Vision is one particular area which computer engineers are interested in. In this context, the so-called Bionic Eye (Werblin et al., 1995) embedded in the CNN-UM architecture is ideal for implementing many spatio-temporal neuromorphic models. With its powerful image processing toolbox and a compact VLSI implementation (Rodríguez et al., 2004), the CNN-UM can be used to program or mimic different models of retinas and even combinations of them. Moreover, it can combine biologically based models, biologically inspired models, and analogic artificial image processing algorithms. This combination will surely bring a broader kind of applications and developments.

CONCLUSION A number of other advances in the definition and characterization of CNN have been researched in the past decade. This includes the definition of methods for designing and implementing larger than 3×3 neighbourhoods in the CNN-UM (Kék & Zarándy, 1998), the CNN implementation of some image compression techniques (Venetianer et al., 1995) or the design of a CNN-based Fast Fourier Transform algorithm over analogic signals (Perko et al., 1998), between many others. In this article, a general review of the main properties and features of the Cellular Neural Network model has been addressed focusing on its DIP applications. The


CNN is now a fundamental and powerful toolkit for real-time nonlinear image processing tasks, mainly due to its versatile programmability, which has powered its hardware development for visual sensing applications (Roska et al., 1999).

REFERENCES Chua, L.O., & Roska, T. (2002). Cellular Neural Networks and Visual Computing. Foundations and Applications. Cambridge, UK: Cambridge University Press. Chua, L.O., & Roska, T. (1993). The CNN Paradigm. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 40, 147–156. Chua, L.O., & Yang, L. (1988). Cellular Neural Networks: Theory and Applications. IEEE Transactions on Circuits and Systems, 35, 1257–1290. Fernández, J.A., Preciado, V.M., & Jaramillo, M.A. (2006). Nonlinear Mappings with Cellular Neural Networks. Lecture Notes in Computer Science, 4177, 350–359. Fisher, R., Perkins, S., Walker, A., & Wolfart, E. (2004). Hypermedia Image Processing Reference (HIPR2). Website: http://homepages.inf.ed.ac.uk/rbf/HIPR2, University of Edinburgh, UK. Jain, A.K. (1989). Fundamentals of Digital Image Processing. Englewood Cliffs, NJ, USA: PrenticeHall. Kék, L., & Zarándy, A. (1998). Implementation of Large Neighborhood Non-Linear Templates on the CNN Universal Machine. International Journal of Circuit Theory and Applications, 26, 551-566. Perko, M., Fajfar, I., Tuma, T., & Puhan, J. (1998). Fast Fourier Transform Computation Using a Digital CNN Simulator. 5th IEEE International Workshop on Cellular Neural Network and Their Applications Proceedings, 230-236. Perko, M., Fajfar, I., Tuma, T., & Puhan, J. (2000). Low-Cost, High-Performance CNN Simulator Implemented in FPGA. 6th IEEE International Workshop on Cellular Neural Network and Their Applications Proceedings, 277-282.

Rodríguez, A., Liñán, G., Carranza, L., Roca, E., Carmona, R., Jiménez, F., Domínguez, R., & Espejo, S. (2004). ACE16k: The Third Generation of MixedSignal SIMD-CNN ACE Chips Toward VSoCs. IEEE Transactions on Circuits and Systems I: Regular Papers, 51, 851–863. Roska, T., & Chua, L.O. (1993). The CNN Universal Machine: An Analogic Array Computer. IEEE Transactions on Circuits and Systems II: Analog and Digital Processing, 40, 163–173. Roska, T., Zarándy, Á., Zöld, S., Földesy, P., & Szolgay, P. (1999). The Computational Infrastructure of Analogic CNN Computing – Part I: The CNN-UM Chip Prototyping System. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 46, 261–268. Serra, J. (1982). Image Analysis and Mathematical Morphology. London, UK: Academic Press. Venetianer, P.L., Werblin, F., Roska, T., & Chua, L.O. (1995). Analogic CNN Algorithms for Some Image Compression and Restoration Tasks. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 42, 278-284. Werblin, F., Roska, T., & Chua, L.O. (1995). The Analogic Cellular Neural Network as a Bionic Eye. International Journal of Circuit Theory and Applications, 23, 541-569.

KEy TERMS Bionics: The application of methods and systems found in nature to the study and design of engineering systems. The word seems to have been formed from “biology” and “electronics” and was first used by J. E. Steele in 1958. Chebyshev Polynomial: An important type of polynomials used in data interpolation, providing the best approximation of a continuous function under the maximum norm. Dynamic Range: A term used to describe the ratio between the smallest and largest possible values of a variable quantity. FPGA: Acronym that stands for Field-Programmable Gate Array, a semiconductor device invented

A


in 1984 by R. Freeman that contains programmable interfaces and logic components called “logic blocks” used to perform the function of basic logic gates (e.g. XOR) or more complex combination functions such as decoders. Piecewise Linear Function: A function f(x) that can be split into a number of linear segments, each of which is defined for a non-overlapping interval of x. Spatial Convolution: A term used to identify the linear combination of a series of discrete 2D data (a digital image) with a few coefficients or weights. In the Fourier theory, a convolution in space is equivalent to (spatial) frequency filtering. Template: Also known as kernel, or convolution kernel, is the set of coefficients used to perform a spatial filter operation over a digital image via the spatial convolution operator. VLSI: Acronym that stands for Very Large Scale Integration. It is the process of creating integrated circuits by combining thousands (nowadays hundreds of millions) of transistor-based circuits into a single chip. A typical VLSI device is the microprocessor.

0

Agent-Based Intelligent System Modeling Zaiyong Tang Salem State College, USA Xiaoyu Huang University of Shanghai for Science & Technology, China Kallol Bagchi University of Texas at El Paso, USA

INTRODUCTION An intelligent system is a system that has, similar to a living organism, a coherent set of components and subsystems working together to engage in goal-driven activities. In general, an intelligent system is able to sense and respond to the changing environment; gather and store information in its memory; learn from earlier experiences; adapt its behaviors to meet new challenges; and achieve its pre-determined or evolving objectives. The system may start with a set of predefined stimulusresponse rules. Those rules may be revised and improved through learning. Anytime the system encounters a situation, it evaluates and selects the most appropriate rules from its memory to act upon. Most human organizations such as nations, governments, universities, and business firms, can be considered as intelligent systems. In recent years, researchers have developed frameworks for building organizations around intelligence, as opposed to traditional approaches that focus on products, processes, or functions (e.g., Liang, 2002; Gupta and Sharma, 2004). Today’s organizations must go beyond traditional goals of efficiency and effectiveness; they need to have organizational intelligence in order to adapt and survive in a continuously changing environment (Liebowitz, 1999). The intelligent behaviors of those organizations include monitoring of operations, listening and responding to stakeholders, watching the markets, gathering and analyzing data, creating and disseminating knowledge, learning, and effective decision making. Modeling intelligent systems has been a challenge for researchers. Intelligent systems, in particular, those involve multiple intelligent players, are complex

systems where system dynamics does not follow clearly defined rules. Traditional system dynamics approaches or statistical modeling approaches rely on rather restrictive assumptions such as homogeneity of individuals in the system. Many complex systems have components or units which are also complex systems. This fact has significantly increased the difficulty of modeling intelligent systems. Agent-based modeling of complex systems such as ecological systems, stock market, and disaster recovery has recently garnered significant research interest from a wide spectrum of fields from politics, economics, sociology, mathematics, computer science, management, to information systems. Agent-based modeling is well suited for intelligent systems research as it offers a platform to study systems behavior based on individual actions and interactions. In the following, we present the concepts and illustrate how intelligent agents can be used in modeling intelligent systems. We start with basic concepts of intelligent agents. Then we define agent-based modeling (ABM) and discuss strengths and weaknesses of ABM. The next section applies ABM to intelligent system modeling. We use an example of technology diffusion for illustration. Research issues and directions are discussed next, followed by conclusions.

INTELLIGENT AGENT Intelligent agents, also known as software agents, are computer applications that autonomously sense and respond to environment in the pursuit of certain designed objectives (Wooldridge and Jennings, 1995). Intelligent agents exhibit some level of intelligence. They can be


A

Agent-Based Intelligent System Modeling

used to assist the user in performing non-repetitive tasks, such as seeking information, shopping, scheduling, monitoring, control, negotiation, and bargaining. Intelligent agents may come in various shapes and forms such as knowbots, softbots, taskbots, personal agents, shopbots, information agents, etc. No matter what shape or form they have, intelligent agents exhibit one or more of the following characteristics: • • • • • • •

Autonomous: Being able to exercise control over their own actions. Adaptive/Learning: Being able to learn and adapt to their external environment. Social: Being able to communicate, bargain, collaborate, and compete with other agents on behalf of their masters (users). Mobile: Being able to migrate themselves from one machine/system to another in a network, such as the Web. Goal-oriented: Being able to act in accordance with built-in goals and objectives. Communicative: Being able to communicate with people or other agents thought protocols such as agent communication language (ACL). Intelligent: Being able to exhibit intelligent behavior such as reasoning, generalizing, learning, dealing with uncertainty, using heuristics, and natural language processing.

AGENT-BASED MODELING Using intelligent agents and their actions and interactions in a given environment to simulate the complex dynamics of a system is referred to as agentbased modeling. ABM research is closely related to the research in complex systems, emergence, computational sociology, multi agent systems, evolutionary programming, and intelligent organizations. In ABM, system behavior results from individual behaviors and collective behaviors of the agents. Researchers of ABM are interested in how macro phenomena are emerging from micro level behaviors among a heterogeneous set of interacting agents (Holland, 1992). Every agent has its attributes and its behavior rules. When agents encounter in the agent society, each agent individually assesses the situation and makes decisions on the basis of its behavior rules. In general, individual agents do

not have global awareness in the multi-agent system. Agent-based modeling allows a researcher to set different parameters and behavior rules of individual agents. The modeler makes assumptions that are most relevant to the situation at hand, and then watches phenomena emerge from the interactions of the agents. Various hypotheses can be tested by changing agent parameters and rules. The emergent collective pattern of the agent society often leads to results that may not have been predicated. One of the main advantages of ABM over traditional mathematical equation based modeling is the ability to model individual styles and attributes, rather than assuming homogeneity of the whole population. Traditional models based on analytical techniques often become intractable as the systems reach real-world level of complexity. ABM is particularly suitable for studying system dynamics that are generated from interactions of heterogeneous individuals. In recent years, ABM has been used in studying many real world systems, such as stock markets (Castiglione 2000), group selection (Pepper 2000), and workflow and information diffusion (Neri 2004). Bonabeau (2002) presents a good summary of ABM methodology and the scenarios where ABM is appropriate. ABM is, however, not immune from criticism. Per Bonabeau (2002), “an agent-based model will only be as accurate as the assumptions and data that went into it, but even approximate simulations can be very valuable”. It has also been observed that ABM relies on simplified models of rule-based human behavior that often fail to take into consideration the complexity of human cognition. Besides, it suffers from “unwrapping” problem as the solution is built into the program and thus prevents occurrence of new or unexpected events (Macy, 2002).

ABM FOR INTELLIGENT SySTEMS An intelligent system is a system that can sense and respond to its environment in pursuing its goals and objectives. It can learn and adapt based on past experience. Examples of intelligent systems include, but not limited to, the following: biological life such as human beings, artificial intelligence applications, robots, organizations, nations, projects, and social movements.


Walter Fritz (1997) suggests that the key components of an intelligent system include objectives, senses, concepts, growth of a concept, present situation, response rules, mental methods, selection, actions, reinforcement, memory and forgetting, sleeping, and patterns (high level concepts). It is apparent that traditional analytical modeling techniques are not able to model many of the components of intelligent systems, let alone the complete system dynamics. However, ABM lends itself well to such a task. All those components can be models as agents (albeit some in abstract sense). An intelligent system is thus made of inter-related and interactive agents. ABM is especially suitable for intelligent systems consist of a large number of heterogeneous participants, such as a human organization.

designed or empirically grounded. In practice, a study may start with simple models, often with designed agents and environments, to explore certain specific dynamics of the system. The design model is refined through the calibration process, in which design parameters are modified to improve the desired characteristics of the model. The final step in the modeling process is validation where we check the agent individual behavior, interactions, and emergent properties of the system against expected design features. Validation usually involves comparison of model outcomes, often at the macro-level, with comparable outcomes in the real world (Midgley, el at., 2007). Figure 1 shows the complete modeling process. A general tutorial on ABM is given by Macal and North (2005).

Modeling Processes

ABM for Innovation Diffusion

Agent-based modeling for intelligent systems starts with a thorough analysis of the intelligent systems. Since the system under consideration may exhibit complex behaviors, we need to identify one or a few key features to focus on. Given a scenario of the target intelligent system, we first establish a set of objectives that we aim to achieve via the simulation of the agent-based representation of the intelligent system. The objectives of the research can be expressed as a set of questions to which we seek answers (Doran, 2006). A conceptual model is created to lay out the requirements for achieving the objectives. This includes defining the entities, such as agents, environment, resources, processes, and relationships. The conceptual modeling phase answers the question of what—what are needed. The design model determines how the requirements can be implemented, including defining the features and relevant behaviors of the agents (Brown, 2006). Depending on the goals of a particular research, a model may involve the use of designed or empirically grounded agents. Designed agents are those endowed with characteristics and behaviors that represent conditions for testing specific hypotheses about the intelligent systems. When the agents are empirically grounded, they are used to represent real world entities, such as individuals or processes in an organization. Empirically grounded agents are feasible only when data about the real world entities are available. Similarly, the environment within which the agents act can be

We present an example of using agent-based intelligent system modeling for studying the acceptance and diffusion of innovative ideas or technology. Diffusion of innovation has been studied extensively over the last few decades (Rogers, 1995). However, traditional research in innovation diffusion has been grounded on case based analysis and analytical systems modeling

Figure 1. Agent-based modeling process START

Set Obje ctives

Conce ptual Mode l

De sign Mode l

Calibration

Validation

END

A


(e.g., using differential and difference equations). Agent-based modeling for diffusion of innovation is relatively new. Our example is adopted from a model created by Michael Samuels (2007), implemented with a popular agent modeling system—NetLogo. The objective of innovation diffusion modeling is to answer questions such as how an idea or technology is adopted in a population, how different people (e.g., innovators, early adopters, and change agents) influence each other, and under what condition an innovation will be accepted or rejected by the population. In the conceptual modeling, we identify various factors that influence an individual’s propensity for adopting the innovation. Those factors are broadly divided into to two categories: internal influences (e.g., word-of-mouth) and external influences (e.g. mass media). Any factor that exerts its influence through individual contact is considered internal influence. Individuals in the target population are divided into four groups: adopter, potential (adopter), change agent, and disrupter. Adopters are those who have adopted the innovation, while potentials are those who have certain likelihood to adopt the innovation. Change agents are the champions of the innovation. They are very knowledgeable and enthusiastic about the innovation, and often play a critical role in facilitating its- diffusion. Disrupters are those who play an opposite role of change agents. They are against the current innovation, oftentimes because they favor an even

Figure 2. Agents and influences

newer and perceived better innovation. The four groups of agents and their relationships are depicted in Figure 2. It is common, although not necessary, to assume that those four groups make up the entire population. In a traditional diffusion model, such as the Bass model (Bass, 1996), the diffusion rate depends only on the number of adopters (and potential adopters, given fixed population size). Characteristics of individuals in the population are ignored. Even in those models where it is assumed that potential adopters have varying threshold for adopting an innovation (Abrahamson and Rosenkopf, 1997), the individuality is very limited. However, in agent-based modeling, the types of individuals and individual characteristics are essentially unbounded. For example, we can divide easily adopters into innovators, early adopters, and late adopters, etc. If necessary, various demographic and social-economic features can be bestowed to individual agents. Furthermore, both internal influence and external influence can be further attributed to more specific causes. For example, internal influence through social networks can be divided into traditional social networks that consists friends and acquaintances and virtual social networks formed online. Table 1 lists typical factors that affect the propensity of adopting an innovation. An initial study of innovation diffusion, such as the one in Michael Samuels (2007), can simply aggregate all internal influences into “word-of-month” and all external influences into mass media. Each potential adopter’s tendency of converting to an adopter is influenced by chance encounter with other agents. If a potential adopter meets a change agent, who is an avid promoter of the innovation, he would become more knowledgeable about the advantages of the innovation, and more likely to adopt. An encounter with a disrupter

Change Agent

Table1. Typical internal and external influences

Potential

Adopter

Environment

Dotted line:External influence Solid line: Internal influence

Disrupter

Internal influence

External influence

Word-of-mouth Telephone

Newspapers Television

Email

Laws, policies and regulations

Instant message Chat Blog

Culture Internet/Web Online communities

Social networks (online/ offline)

RSS


creates the opposite effect, as a disrupter favors a different type of innovation. In order for the simulated model to accurately reflect a real-world situation, the model structure and parameter values should be carefully selected. For example, we need to decide how much influence each encounter will result; what is the probability of encountering a change agent or a disrupter; how much influence is coming from the mass media, etc. We can get these values through surveys, statistical analysis of empirical data, or experiments specifically designed to elicit data from real world situations.

TRENDS AND RESEARCH ISSUES As illustrated through the example of modeling the diffusion of innovation in an organization, industry, or society, agent-based modeling can be used to model the adaptation of intelligent systems that consist of intelligent individuals. As most intelligent systems are complex in both structure and system dynamics, traditional modeling tools that require too many unrealistic assumptions have become less effective in modeling intelligent systems. In recent years, agent-based modeling has found a wide spectrum of applications such as in business strategic solutions, supply chain management, stock markets, power economy, social evolution, military operations, security, and ecology (North and Macal, 2007). As ABM tools and resources become more accessible, research and applications of agent-based intelligent system modeling are expected to increase in the near future. Some challenges remain, though. Using ABM to model intelligent systems is a research area that draws theories from other fields, such as economics, psychology, sociology, etc., but without its own well established theoretic foundation. ABM has four key assumptions (Macy and Willer, 2002): Agents act locally with little or no central authority; agents are interdependent; agents follow simple rules, and agents are adaptive. However, some of those assumptions may not be applicable to intelligent system modeling. Central authorities, or central authoritative information such as mass media in the innovation diffusion example, may play an important role in intelligent organizations. Not all agents are alike in an intelligent system. Some may be independent, non-adaptive, or following complex behavior rules.

ABM uses a “bottom-up” approach, creating emergent behaviors of an intelligent system through “actors” rather than “factors”. However, macro-level factors have direct impact on macro behaviors of the system. Macy and Willer (2002) suggest that bringing those macro-level factors back will make agent-based modeling more effective, especially in intelligent systems such as social organizations. Recent intelligent systems research has developed the concept of integrating human and machine-based data, knowledge, and intelligence. Kirn (1996) postulates that the organization of the 21st century will involve artificial agents based system highly intertwined with human intelligence of the organization. Thus, a new challenge for agent-based intelligent system modeling is to develop models that account for interaction, aggregation, and coordination of intelligent agent and human agents. The ABM will represent not only the human players in an intelligent system, but also the intelligent agents that are developed in real-world applications in those systems.

CONCLUSION Modeling intelligent systems involving multiple intelligent players has been difficult using traditional approaches. We have reviewed recent development in agent-based modeling and suggest agent-based modeling is well suited for studying intelligent systems, especially those systems with sophisticated and heterogeneous participants. Agent-based modeling allows us to model system behaviors based on the actions and interactions of individuals in the system. Although most ABM research focuses on local rules and behaviors, it is possible that we integrate global influences in the models. ABM represents a novel approach to model intelligent systems. Combined with traditional modeling approaches (for example, microlevel simulation as proposed in MoSeS), ABM offers researchers a promising tool to solve complex and practical problems and to broaden research endeavors (Wu, 2007).

A


REFERENCES Abrahamson, E. and L. Rosenkopf ( 1997). Social Network Effects on the Extent of Innovation Diffusion: A Computer Simulation. Organization Science. 8(3), 289-309. Bass, F. M. (1969). A New Product Growth Model for Consumer Durables, Management Science, 13(5). 215-227. Bonabeau, E. (2002). Agent-based modeling: Methods and techniques for simulating human systems. PNAS May 14, 2002. 99, suppl. 3, 7280-7287. Brown, D.G. (2006). Agent-based models. In H. Geist, Ed. The Earth’s Changing Land: An Encyclopedia of Land-Use and Land-Cover Change. Westport CT: Greenwood Publishing Group. 7-13. Doran J. E. (00). Agent Design for Agent Based Modeling. In Agent Based Computational Modelling: Applications in Demography, Social, Economic and Environmental Sciences, eds. F. C. Billari, T. Fent, A. Prskawetz, and J.Scheffran. Physica-Verlag (Springer). 215-223. Filippo Castiglione (2000), ‘Diffusion and aggregation in an agent based model of stock market fluctuations’, International Journal of Modern Physics C. 11(5), 1-15. Fritz, Walter (1997).Intelligent Systems and their Societies. First version: Jan 27, 1997 http://www. intelligent-systems.com.ar/intsyst/index.htm Gupta, J. N. D. and S. K. Sharma (2004). Editors. Intelligent Enterprises for the 21st Century. Hershey, PA: Idea Group Publishing. Holland, J.H. (1992). Complex adaptive systems. Daedalus. 121(1), 17-30. Kirn, S. 1996. Organizational intelligence and distributed artificial intelligence. In Foundations of Distributed Artificial intelligence, G. M. O’Hare and N. R. Jennings, Eds. John Wiley Sixth-Generation Computer Technology Series. John Wiley & Sons, New York, NY. 505-526. Liang, T. Y. (2002). The Inherent Structure and Dynamic of Intelligent Human Organizations, Human Systems Management. 21(1), 9-19.

Liebowitz, J. (1999). Building Organizational Intelligence: A Knowledge Primer, New York: CRC Press. Macal, C. M. and North, M. J. (2005). Tutorial on Agent-Based Modeling and Simulation. Proceedings of the 37th Winter Simulation Conference, Orlando, Florida. 2-15. Macy, M. W. (2002). Social Simulation, In N. Smelser and P. Baltes, eds., International Encyclopedia of the Social and Behavioral Sciences, Elsevier, The Netherlands. Macy, M.W, and Willer, R. (2002). From Factors to Actors: Computational Sociology and Agent-Based Modeling. Annual Review of Sociology. 28, 143166. McMaster, M. D. (1996). The Intelligence Advantage: Organizing for Complexity. Burlington MA: Butterworth-Heineman. Midgley, D.F., Marks R.E., and Kunchamwar D. (2007). The Building and Assurance of Agent-Based Models: An Example and Challenge to the Field. Journal of Business Research. 60(8), 884-893. Neri, F. (2004). Agent Based Simulation of Information Diffusion in a Virtual Market Place. IEEE/WIC/ ACM International Conference on Intelligent Agent Technology (IAT'04). 333-336. North, M. J. and C. M. Macal, (2007). Managing Business Complexity: Discovering Strategic Solutions with Agent-based Modeling and Simulation. Oxford University Press, New York. Pepper, J. W. (2000) An Agent-Based Model of Group Selection, Santa Fe Institute. Retrieved June 16, 2007 at: http://www.santafe.edu/~jpepper/papers/ALIFE7_ GS.pdf Rogers, E.M. (1995). Diffusion of Innovations. The Free Press, New York. Samuels, M.L. (2007). Innovation model. Last updated: 01/08/2007, http://ccl.northwestern.edu/netlogo/ models/community/Innovation. Wooldridge, M. and N. R. Jennings (1995). Intelligent Agents: Theory and Practice, Knowledge Engineering Review. 10(2), 115-152.


Wu, B. (2007). A Hybrid Approach for Spatial MSM. NSF/ESRC Agenda Setting Workshop on Agent-Based Modeling of Complex Spatial Systems: April 14-16, 2007

KEy TERMS Agent Based Modeling: Using intelligent agents and their actions and interactions in a given environment to simulate the complex dynamics of a system. Diffusion of Innovation: Popularized by Everett Rogers, it is the study of the process by which an innovation is communicated and adopted over time among the members of a social system. Intelligent Agent: An autonomous software program that is able to learn and adapt to its environment in order to perform certain tasks delegated to it by its master.

Intelligent System: A system that has a coherent set of components and subsystems working together to engage in goal-driven activities. Intelligent System Modeling: The process of construction, calibration, and validation of models of intelligent systems. Multi-Agent System: A distributed system with a group of intelligent agents that communicate, bargain, compete, and cooperate with other agents and the environment to achieve goals designated by their masters. Organizational Intelligence: The ability of an organization to perceive, interpret, and select the most appropriate response to the environment in order to advance its goals.

A

AI and Ideas by Statistical Mechanics Lester Ingber Lester Ingber Research, USA

INTRODUCTION A briefing (Allen, 2004) demonstrates the breadth and depth complexity required to address real diplomatic, information, military, economic (DIME) factors for the propagation/evolution of ideas through defined populations. An open mind would conclude that it is possible that multiple approaches may be required for multiple decision makers in multiple scenarios. However, it is in the interests of multiple decision-makers to as much as possible rely on the same generic model for actual computations. Many users would have to trust that the coded model is faithful to process their inputs. Similar to DIME scenarios, sophisticated competitive marketing requires assessments of responses of populations to new products. Many large financial institutions are now trading at speeds barely limited by the speed of light. They colocate their servers close to exchange floors to be able to turn quotes into orders to be executed within msecs. Clearly, trading at these speeds require automated algorithms for processing and making decisions. These algorithms are based on "technical" information derived from price, volume and quote (Level II) information. The next big hurdle to automated trading is to turn "fundamental" information into technical indicators, e.g., to include new political and economic news into such algorithms.

BACKGROUND The concept of “memes” is an example of an approach to deal with DIME factors (Situngkir, 2004). The meme approach, using a reductionist philosophy of evolution among genes, is reasonably contrasted to approaches emphasizing the need to include relatively global influences of evolution (Thurtle, 2006). There are multiple other alternative works being conducted world-wide that must be at least kept in mind while developing and testing models of evolution/propagation of ideas in defined populations: A

study on a simple algebraic model of opinion formation concluded that the only final opinions are extremal ones (Aletti et al., 2006). A study of the influence on chaos on opinion formation, using a simple algebraic model, concluded that contrarian opinion could persist and be crucial in close elections, albeit the authors were careful to note that most real populations probably do not support chaos (Borghesi & Galam, 2006). A limited review of work in social networks illustrates that there are about as many phenomena to be explored as there are disciplines ready to apply their network models (Sen, 2006).

Statistical Mechanics of Neocortical Interactions (SMNI) A class of AI algorithms that has not yet been developed in this context takes advantage of information known about real neocortex. It seems appropriate to base an approach for propagation of ideas on the only system so far demonstrated to develop and nurture ideas, i.e., the neocortical brain. A statistical mechanical model of neocortical interactions, developed by the author and tested successfully in describing short-term memory (STM) and electroencephalography (EEG) indicators, is the proposed bottom-up model. Ideas by Statistical Mechanics (ISM) is a generic program to model evolution and propagation of ideas/patterns throughout populations subjected to endogenous and exogenous interactions (Ingber, 2006). ISM develops subsets of macrocolumnar activity of multivariate stochastic descriptions of defined populations, with macrocolumns defined by their local parameters within specific regions and with parameterized endogenous inter-regional and exogenous external connectivities. Parameters of subsets of macrocolumns will be fit to patterns representing ideas. Parameters of external and inter-regional interactions will be determined that promote or inhibit the spread of these ideas. Fitting such nonlinear systems requires the use of sampling techniques. The author's approach uses guidance from his statistical mechanics of neocortical interactions (SMNI),


AI and Ideas by Statistical Mechanics

developed in a series of about 30 published papers from 1981-2001 (Ingber, 1983; Ingber, 1985; Ingber, 1992; Ingber, 1994; Ingber, 1995; Ingber, 1997). These papers also address long-standing issues of information measured by electroencephalography (EEG) as arising from bottom-up local interactions of clusters of thousands to tens of thousands of neurons interacting via short-ranged fibers), or top-down influences of global interactions (mediated by long-ranged myelinated fibers). SMNI does this by including both local and global interactions as being necessary to develop neocortical circuitry.

Statistical Mechanics of Financial Markets (SMFM) Tools of financial risk management, developed to process correlated multivariate systems with differing non-Gaussian distributions using modern copula analysis enables bona fide correlations and uncertainties of success and failure to be calculated. Since 1984, the author has published about 20 papers developing a Statistical Mechanics of Financial Markets (SMFM), many available at http://www.ingber.com. These are relevant to ISM, to properly deal with real-world distributions that arise in such varied contexts. Gaussian copulas are developed in a project Trading in Risk Dimensions (TRD) (Ingber, 2006). Other copula distributions are possible, e.g., Student-t distributions. These alternative distributions can be quite slow because inverse transformations typically are not as quick as for the present distribution. Copulas are cited as an important component of risk management not yet widely used by risk management practitioners (Blanco, 2005).

Sampling Tools Computational approaches developed to process different approaches to modeling phenomena must not be confused with the models of these phenomena. For example, the meme approach lends it self well to a computational scheme in the spirit of genetic algorithms (GA). The cost/objective function that describes the phenomena of course could be processed by any other sampling technique such as simulated annealing (SA). One comparison (Ingber & Rosen, 1992) demonstrated the superiority of SA over GA on cost/objective functions used in a GA database. That study used Very Fast

Simulated Annealing (VFSR), created by the author for military simulation studies (Ingber, 1989), which has evolved into Adaptive Simulated Annealing (ASA) (Ingber, 1993). However, it is the author's experience that the Art and Science of sampling complex systems requires tuning expertise of the researcher as well as good codes, and GA or SA likely would do as well on cost functions for this study. If there are not analytic or relatively standard math functions for the transformations required, then these transformations must be performed explicitly numerically in code such as TRD. Then, the ASA_PARALLEL OPTIONS already existing in ASA (developed as part of the1994 National Science Foundation Parallelizing ASA and PATHINT Project (PAPP)) would be very useful to speed up real time calculations (Ingber, 1993). Below, only a few topics relevant to ISM are discussed. More details are in a previous report (Ingber, 2006).

SMNI AND SMFM APPLIED TO ARTIFICIAL INTELLIGENCE Neocortex has evolved to use minicolumns of neurons interacting via short-ranged interactions in macrocolumns, and interacting via long-ranged interactions across regions of macrocolumns. This common architecture processes patterns of information within and among different regions of sensory, motor, associative cortex, etc. Therefore, the premise of this approach is that this is a good model to describe and analyze evolution/propagation of ideas among defined populations. Relevant to this study is that a spatial-temporal lattice-field short-time conditional multiplicativenoise (nonlinear in drifts and diffusions) multivariate Gaussian-Markovian probability distribution is developed faithful to neocortical function/physiology. Such probability distributions are a basic input into the approach used here. The SMNI model was the first physical application of a nonlinear multivariate calculus developed by other mathematical physicists in the late 1970s to define a statistical mechanics of multivariate nonlinear nonequilibrium systems (Graham, 1977; Langouche et al., 1982).

A


SMNI Tests on STM and EEG

SMNI Description of STM

SMNI builds from synaptic interactions to minicolumnar, macrocolumnar, and regional interactions in neocortex. Since 1981, a series of SMNI papers has been developed model columns and regions of neocortex, spanning mm to cm of tissue. Most of these papers have dealt explicitly with calculating properties of STM and scalp EEG in order to test the basic formulation of this approach (Ingber, 1983; Ingber, 1985; Ingber & Nunez, 1995). The SMNI modeling of local mesocolumnar interactions (convergence and divergence between minicolumnar and macrocolumnar interactions) was tested on STM phenomena. The SMNI modeling of macrocolumnar interactions across regions was tested on EEG phenomena.

SMNI studies have detailed that maximal numbers of attractors lie within the physical firing space of both excitatory and inhibitory minicolumnar firings, consistent with experimentally observed capacities of auditory and visual STM, when a "centering" mechanism is enforced by shifting background noise in synaptic interactions, consistent with experimental observations under conditions of selective attention (Ingber, 1985; Ingber, 1994). These calculations were further supported by highresolution evolution of the short-time conditional-probability propagator using PATHINT (Ingber & Nunez, 1995). SMNI correctly calculated the stability and duration of STM, the primacy versus recency rule,

Figure 1. Illustrated are three biophysical scales of neocortical interactions: (a)-(a*)-(a') microscopic neurons; (b)-(b') mesocolumnar domains; (c)-(c') macroscopic regions (Ingber, 1983). SMNI has developed appropriate conditional probability distributions at each level, aggregating up from the smallest levels of interactions. In (a*) synaptic inter-neuronal interactions, averaged over by mesocolumns, are phenomenologically described by the mean and variance of a distribution Ψ. Similarly, in (a) intraneuronal transmissions are phenomenologically described by the mean and variance of Γ. Mesocolumnar averaged excitatory (E) and inhibitory (I) neuronal firings M are represented in (a'). In (b) the vertical organization of minicolumns is sketched together with their horizontal stratification, yielding a physiological entity, the mesocolumn. In (b') the overlap of interacting mesocolumns at locations r and r′ from times t and t + t is sketched. In (c) macroscopic regions of neocortex are depicted as arising from many mesocolumnar domains. (c') sketches how regions may be coupled by longranged interactions.

0


random access to memories within tenths of a second as observed, and the observed 7±2 capacity rule of auditory memory and the observed 4±2 capacity rule of visual memory. SMNI also calculates how STM patterns (e.g., from a given region or even aggregated from multiple regions) may be encoded by dynamic modification of synaptic parameters (within experimentally observed ranges) into long-term memory patterns (LTM) (Ingber, 1983).

SMNI Description of EEG Using the power of this formal structure, sets of EEG and evoked potential data from a separate NIH study, collected to investigate genetic predispositions to alcoholism, were fitted to an SMNI model on a lattice of regional electrodes to extract brain "signatures" of STM (Ingber, 1997). Each electrode site was represented by an SMNI distribution of independent stochastic macrocolumnar-scaled firing variables, interconnected by long-ranged circuitry with delays appropriate to long-fiber communication in neocortex. The global optimization algorithm ASA was used to perform maximum likelihood fits of Lagrangians defined by path integrals of multivariate conditional probabilities. Canonical momenta indicators (CMI) were thereby derived for individual's EEG data. The CMI give better signal recognition than the raw data, and were used to advantage as correlates of behavioral states. In-sample data was used for training (Ingber, 1997), and out-of-sample data was used for testing these fits. The architecture of ISM is modeled using scales similar to those used for local STM and global EEG connectivity.

Generic Mesoscopic Neural Networks SMNI was applied to a parallelized generic mesoscopic neural networks (MNN) (Ingber, 1992), adding computational power to a similar paradigm proposed for target recognition. "Learning" takes place by presenting the MNN with data, and parametrizing the data in terms of the firings, or multivariate firings. The "weights," or coefficients of functions of firings appearing in the drifts and diffusions, are fit to incoming data, considering the joint "effective" Lagrangian (including the logarithm of the prefactor in the probability distribution) as a dynamic

Figure 2. Scales of interactions among minicolumns are represented, within macrocolumns, across macrocolumns, and across regions of macrocolumns

cost function. This program of fitting coefficients in Lagrangian uses methods of ASA. "Prediction" takes advantage of a mathematically equivalent representation of the Lagrangian path-integral algorithm, i.e., a set of coupled Langevin rate-equations. A coarse deterministic estimate to "predict" the evolution can be applied using the most probable path, but PATHINT has been used. PATHINT, even when parallelized, typically can be too slow for "predicting" evolution of these systems. However, PATHTREE is much faster.

Architecture for Selected ISM Model The primary objective is to deliver a computer model that contains the following features: (1) A multivariable space will be defined to accommodate populations. (2) A cost function over the population variables in (1) will be defined to explicitly define a pattern that can be identified as an Idea. A very important issue is for this project is to develop cost functions, not only how to fit or process them. (3) Subsets of the population will be used to fit parameters — e.g, coefficients of variables, connectivities to patterns, etc. — to an Idea, using the cost function in (2). (4) Connectivity of the population in (3) will be made to the rest of the population. Investigations will be made to determine what endogenous connectivity is required to stop or promote the propagation of the Idea into other regions of the population. (5) External forces, e.g., acting only on specific regions of the population, will be introduced, to determine how these exogenous forces may stop or promote the propagation of an Idea.

Application of SMNI Model The approach is to develop subsets of Ideas/macrocolumnar activity of multivariate stochastic descriptions of

A


defined populations (of a reasonable but small population samples, e.g., of 100-1000), with macrocolumns defined by their local parameters within specific regions (larger samples of populations) and with parameterized long-ranged inter-regional and external connectivities. Parameters of a given subset of macrocolumns will be fit using ASA to patterns representing Ideas, akin to acquiring hard-wired long-term (LTM) patterns. Parameters of external and inter-regional interactions will be determined that promote or inhibit the spread of these Ideas, by determining the degree of fits and overlaps of probability distributions relative to the seeded macrocolumns. That is, the same Ideas/patterns may be represented in other than the seeded macrocolumns by local confluence of macrocolumnar and long-ranged firings, akin to STM, or by different hard-wired parameter LTM sets that can support the same local firings in other regions (possible in nonlinear systems). SMNI also calculates how STM can be dynamically encoded into LTM (Ingber, 1983). Small populations in regions will be sampled to determine if the propagated Idea(s) exists in its pattern space where it did exist prior to its interactions with the seeded population. SMNI derives nonlinear functions as arguments of probability distributions, leading to multiple STM, e.g., 7±2 for auditory memory capacity. Some investigation will be made into nonlinear functional forms other than those derived for SMNI, e.g., to have capacities of tens or hundreds of patterns for ISM.

Application of TRD Analysis This approach includes application of methods of portfolio risk analysis to such statistical systems, correcting two kinds of errors committed in multivariate risk analyses: (E1) Although the distributions of variables being considered are not Gaussian (or not tested to see how close they are to Gaussian), standard statistical calculations appropriate only to Gaussian distributions are employed. (E2) Either correlations among the variables are ignored, or the mistakes committed in (E1) — incorrectly assuming variables are Gaussian — are compounded by calculating correlations as if all variables were Gaussian. It should be understood that any sampling algorithm processing a huge number of states can find many multiple optima. ASA's MULTI_MIN OPTIONS are

used to save multiple optima during sampling. Some algorithms might label these states as "mutations" of optimal states. It is important to be able to include them in final decisions, e.g., to apply additional metrics of performance specific to applications. Experience with risk-managing portfolios shows that all criteria are not best considered by lumping them all into one cost function, but rather good judgment should be applied to multiple stages of pre-processing and post-processing when performing such sampling, e.g., adding additional metrics of performance.

FUTURE TRENDS Given financial and political motivations to merge information discussed in the Introduction, it is inevitable that many AI algorithms will be developed, and many current AI algorithms will be enhanced, to address these issues.

CONCLUSION It seems appropriate to base an approach for propagation of generic ideas on the only system so far demonstrated to develop and nurture ideas, i.e., the neocortical brain. A statistical mechanical model of neocortical interactions, developed by the author and tested successfully in describing short-term memory and EEG indicators, Ideas by Statistical Mechanics (ISM) (Ingber, 2006) is the proposed model. ISM develops subsets of macrocolumnar activity of multivariate stochastic descriptions of defined populations, with macrocolumns defined by their local parameters within specific regions and with parameterized endogenous inter-regional and exogenous external connectivities. Tools of financial risk management, developed to process correlated multivariate systems with differing non-Gaussian distributions using modern copula analysis, importance-sampled using ASA, will enable bona fide correlations and uncertainties of success and failure to be calculated.


REFERENCES Aletti, G., Naldi, G. & Toscani, G. (2006) First-order continuous models of opinion formation. Report. U Milano. [Url http://lanl.arxiv.org/abs/condmat/0605092] Allen, J. (2004) Commander's automated decision support tools. Report. DARPA. [URL http://www. darpa.mil/ato/solicit/IBC/allen.ppt] Blanco, C. (2005) Financial Risk Management: Beyond Normality, Volatility and Correlations. Financial Economics Network, Waltham, MA. [URL http://www.fenews.com/fen46/front-sr/blanco/blanco. html] Borghesi, C. & Galam, S. (2006) Chaotic, staggered and polarized dynamics in opinion forming: the contrarian effect. Report. Service de Physique de l'Etat Condens. [Url http://lanl.arxiv.org/abs/physics/0605150] Graham, R. (1977) Covariant formulation of nonequilibrium statistical thermodynamics. Zeitschrift fu¨r Physik. B26, 397-405. Ingber, L. (1983) Statistical mechanics of neocortical interactions. Dynamics of synaptic modification. Physical Review A. 28, 395-416. [URL http://www. ingber.com/smni83_dynamics.pdf] Ingber, L. (1985) Statistical mechanics of neocortical interactions: Stability and duration of the 7+-2 rule of short-term-memory capacity. Physical Review A. 31, 1183-1186. [URL http://www.ingber.com/smni85_stm. pdf] Ingber, L. (1989) Very fast simulated re-annealing. Mathematical Computer Modelling. 12(8), 967-973. [URL http://www.ingber.com/asa89_vfsr.pdf] Ingber, L. (1992) Generic mesoscopic neural networks based on statistical mechanics of neocortical interactions. Physical Review A. 45(4), R2183-R2186. [URL http://www.ingber.com/smni92_mnn.pdf] Ingber, L. (1993) Adaptive Simulated Annealing (ASA). Global optimization C-code. Caltech Alumni Association. [URL http://www.ingber.com/#ASACODE]

memory. Physical Review E. 49(5B), 4652-4664. [URL http://www.ingber.com/smni94_stm.pdf] Ingber, L. (1995) Statistical mechanics of multiple scales of neocortical interactions, In: Neocortical Dynamics and Human EEG Rhythms, ed. P.L. Nunez. Oxford University Press, 628-681. [ISBN 0-19-505728-7. URL http:// www.ingber.com/smni95_scales.pdf] Ingber, L. (1997) Statistical mechanics of neocortical interactions: Applications of canonical momenta indicators to electroencephalography. Physical Review E. 55(4), 4578-4593. [URL http://www.ingber. com/smni97_cmi.pdf] Ingber, L. (2006) Ideas by statistical mechanics (ISM). Report 2006:ISM. Lester Ingber Research. [URL http://www.ingber.com/smni06_ism.pdf] Ingber, L. & Nunez, P.L. (1995) Statistical mechanics of neocortical interactions: High resolution path-integral calculation of short-term memory. Physical Review E. 51(5), 5074-5083. [URL http://www.ingber.com/ smni95_stm.pdf] Ingber, L. & Rosen, B. (1992) Genetic algorithms and very fast simulated reannealing: A comparison. Mathematical Computer Modelling. 16(11), 87-100. [URL http://www.ingber.com/asa92_saga.pdf] Langouche, F., Roekaerts, D. & Tirapegui, E. (1982) Functional Integration and Semiclassical Expansions. Reidel, Dordrecht, The Netherlands. Sen, P. (2006) Complexities of social networks: A physicist's perspective. Report. U Calcutta. [Url http://lanl.arxiv.org/abs/physics/0605072] Situngkir, H. (2004) On selfish memes: Culture as complex adaptive system. Journal Social Complexity. 2(1), 20-32. [URL http://cogprints.org/3471/] Thurtle, P.S. (2006) "The G Files": Linking "The Selfish Gene" And "The Thinking Reed". Stanford Presidential Lectures and Symposia in the Humanities and Arts. Standford U. [URL http://prelectur.stanford.edu/lecturers/gould/commentary/thurtle. html]

Ingber, L. (1994) Statistical mechanics of neocortical interactions: Path-integral evolution of short-term

A


KEy TERMS Copula Analysis: This transforms non-Gaussian probability distributions to a common appropriate space (usually a Gaussian space) where it makes sense to calculate correlations as second moments. DIME: Represents diplomatic, information, military, and economic aspects of information that must be merged into coherent pattern. Global Optimization: Refers to a collection of algorithms used to statistically sample a space of parameters or variables to optimize a system, but also often used to sample a huge space for information. There are many variants, including simulated annealing, genetic algorithms, ant colony optimization, hill-climbing, etc. ISM: An anacronym for Ideas by Statistical Mechanics in the context of the noun defined as: A belief (or system of beliefs) accepted as authoritative by some group or school. A doctrine or theory; especially, a wild or visionary theory. A distinctive doctrine, theory, system, or practice.

Meme: Alludes to a technology originally defined to explain social evolution, which has been refined to mean a gene-like analytic tool to study cultural evolution. Memory: This may have many forms and mechanisms. Here, two major processes of neocortical memory are used for AI technologies, short-term memory (STM) and long-term memory (LTM). Simulated Annealing (SA): A class of algorithms for sampling a huge space, which has a mathematical proof of convergence to global optimal minima. Most SA algorithms applied to most systems do not fully take advantage of this proof, but the proof often is useful to give confidence that the system will avoid getting stuck for a long time in local optimal regions. Statistical Mechanics: A branch of mathematical physics dealing with systems with a large number of states. Applications of nonequilibrium nonlinear statistical mechanics are now common in many fields, ranging from physical and biological sciences, to finance, to computer science, etc.

AI Methods for Analyzing Microarray Data Amira Djebbari National Research Council Canada, Canada Aedín C. Culhane Harvard School of Public Health, USA Alice J. Armstrong The George Washington University, USA John Quackenbush Harvard School of Public Health, USA

INTRODUCTION Biological systems can be viewed as information management systems, with a basic instruction set stored in each cell’s DNA as “genes.” For most genes, their information is enabled when they are transcribed into RNA which is subsequently translated into the proteins that form much of a cell’s machinery. Although details of the process for individual genes are known, more complex interactions between elements are yet to be discovered. What we do know is that diseases can result if there are changes in the genes themselves, in the proteins they encode, or if RNAs or proteins are made at the wrong time or in the wrong quantities. Recent advances in biotechnology led to the development of DNA microarrays, which quantitatively measure the expression of thousands of genes simultaneously and provide a snapshot of a cell’s response to a particular condition. Finding patterns of gene expression that provide insight into biological endpoints offers great opportunities for revolutionizing diagnostic and prognostic medicine and providing mechanistic insight in data-driven research in the life sciences, an area with a great need for advances, given the urgency associated with diseases. However, microarray data analysis presents a number of challenges, from noisy data to the curse of dimensionality (large number of features, small number of instances) to problems with no clear solutions (e.g. real world mappings of genes to traits or diseases that are not yet known). Finding patterns of gene expression in microarray data poses problems of class discovery, comparison, prediction, and network analysis which are often approached with AI methods. Many of these methods have

been successfully applied to microarray data analysis in a variety of applications ranging from clustering of yeast gene expression patterns (Eisen et al., 1998) to classification of different types of leukemia (Golub et al., 1999). Unsupervised learning methods (e.g. hierarchical clustering) explore clusters in data and have been used for class discovery of distinct forms of diffuse large B-cell lymphoma (Alizadeh et al., 2000). Supervised learning methods (e.g. artificial neural networks) utilize a previously determined mapping between biological samples and classes (i.e. labels) to generate models for class prediction. A k-nearest neighbor (k-NN) approach was used to train a gene expression classifier of different forms of brain tumors and its predictions were able to distinguish biopsy samples with different prognosis suggesting that microarray profiles can predict clinical outcome and direct treatment (Nutt et al., 2003). Bayesian networks constructed from microarray data hold promise for elucidating the underlying biological mechanisms of disease (Friedman et al., 2000).

BACKGROUND Cells dynamically respond to their environment by changing the set and concentrations of active genes by altering the associated RNA expression. Thus “gene expression” is one of the main determinants of a cell’s state, or phenotype. For example, we can investigate the differences between a normal cell and a cancer cell by examining their relative gene expression profiles. Microarrays quantify gene expression levels in various conditions (such as disease vs. normal) or across time points. For n genes and m instances (biological


A

AI Methods for Analyzing Microarray Data

Table 1. Some public online repositories of microarray data Name of the repository

URL

ArrayExpress at the European Bioinformatics Institute

http://www.ebi.ac.uk/arrayexpress/

Gene Expression Omnibus at the National Institutes of Health

http://www.ncbi.nlm.nih.gov/geo/

Stanford microarray database

http://smd.stanford.edu/

Oncomine

http://www.oncomine.org/main/index.jsp

samples), microarray measurements are stored in an n by m matrix where each row is a gene, each column is a sample and each element in the matrix is the expression level of a gene in a biological sample, where samples are instances and genes are features describing those instances. Microarray data is available through many public online repositories (Table 1). In addition, the Kent-Ridge repository (http://sdmc.i2r.a-star.edu. sg/rp/) contains pre-formatted data ready to use with the well-known machine learning tool Weka (Witten & Frank, 2000). Microarray data presents some unique challenges for AI such as a severe case of the curse of dimensionality due to the scarcity of biological samples (instances). Microarray studies typically measure tens of thousands of genes in only tens of samples. This low case to variable ratio increases the risk of detecting spurious relationships. This problem is exacerbated because microarray data contains multiple sources of withinclass variability, both technical and biological. The high levels of variance and low sample size make feature selection difficult. Testing thousands of genes creates a multiple testing problem, which can result in underestimating the number of false positives. Given data with these limitations, constructing models becomes under-determined and therefore prone to over-fitting. From biology, it is also clear that genes do not act independently. Genes interact in the form of pathways or gene regulatory networks. For this reason, we need models that can be interpreted in the context of pathways. Researchers have successfully applied AI methods to microarray data preprocessing, clustering, feature selection, classification, and network analysis.

MINING MICROARRAy DATA: CURRENT TECHNIQUES, CHALLENGES AND OPPORTUNITIES FOR AI Data Preprocessing After obtaining microarray data, normalization is performed to account for systematic measurement biases and to facilitate between-sample comparisons (Quackenbush, 2002). Microarray data may contain missing values that may be replaced by mean replacement or k-NN imputation (Troyanskaya et al., 2001).

Feature Selection The goal of feature selection is to find genes (features) that best distinguish groups of instances (e.g. disease vs. normal) to reduce the dimensionality of the dataset. Several statistical methods including t-test, significance analysis of microarrays (SAM) (Tusher et al., 2001), and analysis of variance (ANOVA) have been applied to select features from microarray data. In classification experiments, feature selection methods generally aim to identify relevant gene subsets to construct a classifier with good performance (Inza et al., 2004). Features are considered to be relevant when they can affect the class; the strongly relevant are indispensable to prediction and the weakly relevant may only sometimes contribute to prediction. Filter methods evaluate feature subsets regardless of the specific learning algorithm used. The statistical methods for feature selection discussed above as well as rankers like information gain rankers are filters for the features to be included. These methods ignore the fact that there may be redundant features (features that are highly correlated with each other and as such one can be used to replace the other) and so do not seek to find a set of features which could perform similarly


with fewer variables while retaining the same predictive power (Guyon & Elisseeff, 2003). For this reason multivariate methods are more appropriate. As an alternative, wrappers consider the learning algorithm as a black-box and use prediction accuracy to evaluate feature subsets (Kohavi & John, 1997). Wrappers are more direct than filter methods but depend on the particular learning algorithm used. The computational complexity associated with wrappers is prohibitive due to curse of dimensionality, so typically filters are used with forward selection (starting with an empty set and adding features one by one) instead of backward elimination (starting with all features and removing them one by one). Dimension reduction approaches are also used for multivariate feature selection.

Dimension Reduction Approaches Principal component analysis (PCA) is widely used for dimension reduction in machine learning (Wall et al., 2003). The idea behind PCA is quite intuitive: correlated objects can be combined to reduce data “dimensionality”. Relationships between gene expression profiles in a data matrix can be expressed as a linear combination such that colinear variables are regressed onto a new set of coordinates. PCA, its underlying method Single Value Decomposition (SVD), related approaches such as correspondence analysis (COA), and multidimensional scaling (MDS) have been applied to microarray data and are reviewed by Brazma & Culhane (2005). Studies have reported that COA or other dual scaling dimension reduction approaches such as spectral map analysis may be more appropriate than PCA for decomposition of microarray data (Wouters et al., 2003). While PCA considers the variance of the whole dataset, clustering approaches examine the pairwise distance between instances or features. Therefore, these methods are complementary and are often both used in exploratory data analysis. However, difficulties in interpreting the results in terms of discrete genes limit the application of these methods.

Clustering What we see as one disease is often a collection of disease subtypes. Class discovery aims to discover these subtypes by finding groups of instances with similar expression patterns. Hierarchical clustering is an agglomerative method which starts with a singleton

and groups similar data points using some distance measure such that two data points that are most similar are grouped together in a cluster by making them children of a parent node in the tree. This process is repeated in a bottom-up fashion until all data points belong to a single cluster (corresponding to the root of the tree). Hierarchical and other clustering approaches, including K-means, have been applied to microarray data (Causton et al., 2003). Hierarchical clustering was applied to study gene expression in samples from patients with diffuse large B-cell lymphoma (DLBCL) resulting in the discovery of two subtypes of the disease. These groups were found by analyzing microarray data from biopsy samples of patients who had not been previously treated. These patients continued to be studied after chemotherapy, and researchers found that the two newly discovered disease subtypes had different survival rates, confirming the hypothesis that the subtypes had significantly different pathologies (Alizadeh et al., 2000). While clustering simply groups the given data based on pair-wise distances, when information is known a priori about some or all of the data i.e. labels, a supervised approach can be used to obtain a classifier that can predict the label of new instances.

Classification (Supervised Learning) The large dimensionality of microarray data means that all classification methods are susceptible to over-fitting. Several supervised approaches have been applied to microarray data including Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), and k-NNs among others (Hastie et al., 2001). A very challenging and clinically relevant problem is the accurate diagnosis of the primary origin of metastatic tumors. Bloom et al. (2004) applied ANNs to the microarray data of 21 tumor types with 88% accuracy to predict the primary site of origin of metastatic cancers with unknown origin. A classification of 84% was obtained on an independent test set with important implications for diagnosing cancer origin and directing therapy. In a comparison of different SVM approaches, multicategory SVMs were reported to outperform other popular machine learning algorithms such as k-NNs and ANNs (Statnikov et al., 2005) when applied to 11 publicly available microarray datasets related to cancer.

A


It is worth noting that feature selection can significantly improve classification performance.

Cross-Validation Cross-validation (CV) is appropriate in microarray studies which are often limited by the number of instances (e.g. patient samples). In k-fold CV, the training set is divided into k subsets of equal size. In each iteration k-1 subsets are used for training and one subset is used for testing. This process is repeated k times and the mean accuracy is reported. Unfortunately, some published studies have applied CV only partially, by applying CV on the creation of the prediction rule while excluding feature selection. This introduces a bias in the estimated error rates and over-estimates the classification accuracy (Simon et al., 2003). As a consequence, results from many studies are controversial due to methodological flaws (Dupuy & Simon, 2007). Therefore, models must be evaluated carefully to prevent selection bias (Ambroise & McLachlan, 2002). Nested CV is recommended, with an inner CV loop to perform the tuning of the parameters and an outer CV to compute an estimate of the error (Varma & Simon, 2006). Several studies which have examined similar biological problems have reported poor overlap in gene expression signatures. Brenton et al. (2005) compared two gene lists predictive of breast cancer prognosis and found only 3 genes in common. Even though the intersection of specific gene lists is poor, the highly correlated nature of microarray data means that many gene lists may have similar prediction accuracy (EinDor et al., 2004). Gene signatures identified from different breast cancer studies with few genes in common were shown to have comparable success in predicting patient survival (Buyse et al., 2006). Commonly used supervised learning algorithms yield black box models prompting the need for interpretable models that provide insights about the underlying biological mechanism that produced the data.

Network Analysis Bayesian networks (BNs), derived from an alliance between graph theory and probability theory, can capture dependencies among many variables (Pearl, 1988, Heckerman, 1996).

Friedman et al. (2000) introduced a multinomial model framework for BNs to reverse-engineer networks and showed that this method differs from clustering in that it can discover gene interactions other than correlation when applied to yeast gene expression data. Spirtes et al. (2002) highlight some of the difficulties of applying this approach to microarray data. Nevertheless, many extensions of this research direction have been explored. Correlation is not necessarily a good predictor of interactions, and weak interactions are essential to understand disease progression. Identifying the biologically meaningful interactions from the spurious ones is challenging, and BNs are particularly well-suited for modeling stochastic biological processes. The exponential growth of data produced by microarray technology as well as other high-throughput data (e.g. protein-protein interactions) call for novel AI approaches as the paradigm shifts from a reductionist to a mechanistic systems view in the life sciences.

FUTURE TRENDS Uncovering the underlying biological mechanisms that generate these data is harder than prediction and has the potential to have far reaching implications for understanding disease etiologies. Time series analysis (Bar-Joseph, 2004) is a first step to understanding the dynamics of gene regulation, but, eventually, we need to use the technology not only to observe gene expression data but also to direct intervention experiments (Pe’er et al., 2001, Yoo et al., 2002) and develop methods to investigate the fundamental problem of distinguishing correlation from causation.

CONCLUSION We have reviewed AI methods for pre-processing, clustering, feature selection, classification and mechanistic analysis of microarray data. The clusters, gene lists, molecular fingerprints and network hypotheses produced by these approaches have already shown impact; from discovering new disease subtypes and biological markers, predicting clinical outcome for directing treatment as well as unraveling gene networks. From the AI perspective, this field offers challenging problems and may have a tremendous impact on biology and medicine.


REFERENCES Alizadeh A.A., Eisen M.B., Davis R.E., Ma C., Lossos I.S., Rosenwald A., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503-11. Ambroise C., & McLachlan G.J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562-6. Bar-Joseph Z. (2004). Analyzing time series gene expression data. Bioinformatics, 20(16), 2493-503. Bloom G., Yang I.V., Boulware D., Kwong K.Y., Coppola D., Eschrich S., et al. (2004). Multi-platform, multi-site, microarray-based human tumor classification. American Journal of Pathology, 164(1), 9-16. Brenton J.D., Carey L.A., Ahmed A.A., & Caldas C. (2005). Molecular classification and molecular forecasting of breast cancer: ready for clinical application? Journal of Clinical Oncology, 23(29), 7350-60. Brazma A., & Culhane AC. (2005). Algorithms for gene expression analysis. In Jorde LB., Little PFR, Dunn MJ., Subramaniam S. (Eds.) Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics., (3148 -3159) London: John Wiley & Sons. Buyse, M., Loi S., Van’t Veer L., Viale G., Delorenzi M., Glas A.M., et al. (2006). Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute, 98, 1183-92. Causton H.C., Quackenbush J., & Brazma A. (2003) Microarray Gene Expression Data Analysis: A Beginner’s Guide. Oxford: Blackwell Science Limited. Dupuy A., & Simon RM. (2007). Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. Journal of the National Cancer Institute, 99(2), 147-57. Ein-Dor L., Kela I., Getz G., Givol D., & Domany E. (2004). Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2), 171-8. Eisen M.B., Spellman P.T., Brown P.O., & Botstein D. (1998). Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences, 95, 14863-14868.

Friedman N., Linial M., Nachman I., & Pe’er D. (2000). Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7(3-4), 601-20. Golub T. R., Slonim D. K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J. P., et al. (1999). Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286 (5439), 531. Guyon, I., & Elisseff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182. Hastie T., Tibshirani R., & Friedman J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer Series in Statistics. Heckerman D. (1996). A Tutorial on Learning with Bayesian Networks. Technical Report MSR-TR-95-06. Microsoft Research. Inza I., Larrañaga P., Blanco R., & Cerrolaza A.J. (2004). Filter versus wrapper gene selection approaches in DNA microarray domains. Artificial Intelligence in Medicine, special issue in “Data mining in genomics and proteomics”, 31(2), 91-103. Kohavi R., & John G.H. (1997). Wrappers for feature subset selection, Artificial Intelligence, 97(1-2), 273324. Nutt C.L., Mani D.R., Betensky R.A., Tamayo P., Cairncross J.G., Ladd C., et al. (2003). Gene Expressionbased Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification. Cancer Research, 63, 1602-1607. Pe’er D, Regev A, Elidan G, & Friedman N. (2001). Inferring subnetworks from perturbed expression profiles. Bioinformatics, 17 S1, S215-24. Pearl J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, San Mateo: Morgan Kaufmann Publishers. Quackenbush J. (2002). Microarray data normalization and transformation, Nature Genetics, 32, 496–501. Quackenbush J. (2006). Microarray Analysis and Tumor Classification. The New England Journal of Medicine, 354(23), 2463-72.

A


Simon R., Radmacher M.D., Dobbin K., & McShane L.M. (2003). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute, 95(1), 14-8. Spirtes, P., Glymour, C., Scheines, R. Kauffman, S., Aimale, V., & Wimberly, F. (2001). Constructing Bayesian Network Models of Gene Expression Networks from Microarray Data. Proceedings of the Atlantic Symposium on Computational Biology, Genome Information Systems and Technology. Statnikov A., Aliferis C.F., Tsamardinos I., Hardin D., & Levy S. (2005). A comprehensive evaluation of multicategory classification methodsfor microarray gene expression cancer diagnosis. Bioinformatics, 21(5), 631-643

Yoo C., Thorsson V., & Cooper G.F. (2002). Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. Biocomputing: Proceedings of the Pacific Symposium, 7, 498-509

KEy TERMS Curse of Dimensionality: A situation where the number of features (genes) is much larger than the number of instances (biological samples) which is known in statistics as p >> n problem. Feature Selection: A problem of finding a subset (or subsets) of features so as to improve the performance of learning algorithms.

Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T., Tibshirani R., et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-5.

Microarray: A microarray is an experimental assay which measures the abundances of mRNA (intermediary between DNA and proteins) corresponding to gene expression levels in biological samples.

Tusher V.G., Tibshirani R., & Chu G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9), 5116-5121.

Multiple testing problem: A problem that occurs when a large number of hypotheses are tested simultaneously using a user-defined α cut off p-value which may lead to rejecting a non-negligible number of null hypotheses by chance.

Varma, S., & Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91 Witten, I. H. & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers Inc. Wall, M., Rechtsteiner, A., & Rocha, L. (2003). Singular value decomposition and principal component analysis. In D.P. Berrar, W. Dubitzky, M. Granzow (Eds.) A Practical Approach to Microarray Data Analysis. (91-109). Norwell: Kluwer. Wouters, L., Gohlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G., & Lewi, P.J. (2003). Graphical exploration of gene expression data: a comparative study of three multivariate methods. Biometrics, 59, 1131-1139

0

Over-Fitting: A situation where a model learns spurious relationships and as a result can predict training data labels but not generalize to predict future data. Supervised Learning: A learning algorithm that is given a training set consisting of feature vectors associated with class labels and whose goal is to learn a classifier that can predict the class labels of future instances. Unsupervised Learning: A learning algorithm that tries to identify clusters based on similarity between features or between instances or both but without taking into account any prior knowledge.

An AI Walk from Pharmacokinetics to Marketing José D. Martín-Guerrero University of Valencia, Spain Emilio Soria-Olivas University of Valencia, Spain Paulo J.G. Lisboa Liverpool John Moores University, UK Antonio J. Serrano-López University of Valencia, Spain

INTRODUCTION

•

This work is intended for providing a review of reallife practical applications of Artificial Intelligence (AI) methods. We focus on the use of Machine Learning (ML) methods applied to rather real problems than synthetic problems with standard and controlled environment. In particular, we will describe the following problems in next sections:

• • •

• • •

Optimization of Erythropoietin (EPO) dosages in anaemic patients undergoing Chronic Renal Failure (CRF). Optimization of a recommender system for citizen web portal users. Optimization of a marketing campaign.

The choice of these problems is due to their relevance and their heterogeneity. This heterogeneity shows the capabilities and versatility of ML methods to solve real-life problems in very different fields of knowledge. The following methods will be mentioned during this work: •

•

Artificial Neural Networks (ANNs): Multilayer Perceptron (MLP), Finite Impulse Response (FIR) Neural Network, Elman Network, Self-Oganizing Maps (SOMs) and Adaptive Resonance Theory (ART). Other clustering algorithms: K-Means, Expectation-Maximization (EM) algorithm, Fuzzy C-Means (FCM), Hierarchical Clustering Algorithms (HCA).

Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH). Support Vector Regression (SVR). Collaborative filtering techniques. Reinforcement Learning (RL) methods.

BACKGROUND The aim of this communication is to emphasize the capabilities of ML methods to deliver practical and effective solutions in difficult real-world applications. In order to make the work easy to read we focus on each of the three separate domains, namely, Pharmacokinetics (PK), Web Recommender Systems and Marketing.

Pharmacokinetics Clinical decision-making support systems have used Artificial Intelligence (AI) methods since the end of the fifties. Nevertheless, it was only during the nineties that decision support systems were routinely used in clinical practice on a significant scale. In particular, ANNs have been widely used in medical applications the last two decades (Lisboa, 2002). One of the first relevant studies involving ANNs and Therapeutic Drug Monitoring was (Gray, Ash, Jacobi, & Michel, 1991). In this work, an ANN-based drug interaction warning system was developed with a computerized real-time entry medical records system. A reference work in this field is found in (Brier, Zurada, & Aronoff, 1995), in which the capabilities of ANNs and NONMEN are benchmarked.


A

An AI Walk from Pharmacokinetics to Marketing

Focusing on problems that are closer to the reallife application that will be described in next section, there are also a number of recent works involving the use of ML for drug delivery in kidney disease. For instance, a comparison of renal-related adverse drug reactions between rofecoxib and celecoxib, based on the WHO/Uppsala Monitoring Centre safety database, was carried out by (Zhao, Reynolds, Lejkowith, Whelton, & Arellano, 2001). Disproportionality in the association between a particular drug and renal-related adverse drug reactions was evaluated using a Bayesian confidence propagation neural network method. A study of prediction of cyclosporine dosage in patients after kidney transplantation using neural networks and kernel-based methods was carried out in (Camps et al., 2003). In (Gaweda, Jacobs, Brier, & Zurada, 2003), a pharmacodynamic population analysis in CRF patients using ANNs was performed. Such models allow for adjusting the dosing regime. Finally, in (Martín et al., 2003) , the use of neural networks was proposed for the optimization of EPO dosage in patients undergoing anaemia connected with CRF.

Web Recommender Systems Recommender systems are widely used in web sites including Google. The main goal of these systems is to recommend objects which a user might be interested in. Two main approaches have been used: content-based and collaborative filtering (Zukerman & Albrecht, 2001), although other kinds of techniques have also been proposed (Burke, 2002). Collaborative recommenders aggregate ratings of recommendations of objects, find user similarities based on their ratings, and finally provide new recommendations based on inter-user comparisons. Some of the most relevant systems using this technique are GroupLens/NetPerceptions and Recommender. The main advantage of collaborative techniques is that they are independent from any machine-readable representation of the objects, and that they work well for complex objects where subjective judgements are responsible for much of the variation in preferences. Content-based learning is used when a user’s past behaviour is a reliable indicator of his/her future behaviour. It is particularly suitable for situations in which users tend to exhibit idiosyncratic behaviour. However, this approach requires a system to collect relatively large amounts of data from each user in order

to enable the formulation of a statistical model. Examples of systems of this kind are text recommendation systems like the newsgroup filtering system, NewsWeeder, which uses words from its texts as features.

Marketing The latest marketing trends are more concerned about maintaining current customers and optimizing their behaviour than getting new ones. For this reason, relational marketing focuses on what a company must do to achieve this objective. The relationships between a company and its costumers follow a sequence of actionresponse system, where the customers can modify their behaviour in accordance with the marketing actions developed by the company. The development of a good and individualized policy is not easy because there are many variables to take into account. Applications of this kind can be viewed as a Markov chain problem, in which a company decides what action to take once the customer properties in the current state (time t), are known. Reinforcement Learning (RL) can be used to solve this task since previous applications have demonstrated its suitability in this area. In (Sun, 2003), RL was applied to analyse mailing by studying how an action in time t influences actions in following times. In (Abe et al., 2002) and (Pednault, Abe & Zadrozny., 2002), several RL algorithms were benchmarked in mailing problems. In (Abe, 2004), RL was used to optimize cross channel marketing.

AI CONTRIBUTIONS IN REAL-LIFE APPLICATIONS Previous section showed a review of related work. In this section, we will focus on showing authors’ experience in using AI to solve real-life problems. In order to show up the versatility of AI methods, we will focus on particular applications from three different fields of knowledge, the same that were reviewed in previous section.

Pharmacokinetics Although we have also worked with other pharmacokinetic problems, in this work, we focus on maybe the most relevant problem, which is the


optimization of EPO dosages in patients within a haemodialysis program. Patients who suffer from CRF tend to suffer from an associated anaemia, as well. EPO is the treatment of choice for this kind of anaemia. The use of this drug has greatly reduced cardiovascular problems and the necessity of multiple transfusions. However, EPO is expensive, making the already costly CRF program even more so. Moreover, there are significant risks associated with EPO such as thrombo-embolisms and vascular problems, if Haemoglobin (Hb) levels are too high or they increase too fast. Consequently, optimizing dosage is critical to ensure adequate pharmacotherapy as well as a reasonable treatment cost. Population models, widely used by Pharmacokinetics’ researchers, are not suitable for this problem since the response to the treatment with EPO is highly dependent on the patient. The same dosages may have very different responses in different patients, most notably the so-called EPO-resistant patients, who do not respond to EPO treatment, even after receiving high dosages. Therefore, it is preferable to focus on an individualized treatment. Our first approach to this problem was based on predicting the Hb level given a certain administered dose of EPO. Although the final goal is to individualize EPO doses, we did not predict EPO dose but Hb level. The reason is that EPO predictors would model physician’s protocol whereas Hb predictors model body’s response to the treatment, hence being a more “objective” approach. In particular, the following models were used: GARCH (Hamilton, 1994), MLP, FIR neural network, Elman’s recurrent neural network and SVR (Haykin, 1999). Accurate prediction models were obtained, especially when using ANNs and SVR. Dynamic neural networks (i.e., FIR and recurrent) did not outperform notably the static MLP probably due to the short length of the time series (Martín et al., 2003). An easy-to-use software application was developed to be used by clinicians, in which after filling in patients’ data and a certain EPO dose, the predicted Hb level for next month was shown. Although prediction models were accurate, we realized that this prediction approach had a major flaw. Despite obtaining accurate models, we had not yet achieved a straightforward way to transfer the extracted knowledge to daily clinical practice, because clinicians had to “play” with different doses to analyse the best solution to attain a certain Hb level. It would

be better to have an automatic model that suggests the actions to be made in order to attain the targeted range of Hb, rather than this “indirect” approach. This reflection made us research on new models, and we came up with the use of RL (Sutton & Barto, 1998). We are currently working on this topic but we have already achieved promising results, finding policies (sequence of actions) that appear to be better than those followed in the hospital, i.e., there are a higher number of patients within the desired target of Hb at the end of the treatment (Martín et al., 2006a).

Web Recommender Systems A completely different application is described in this subsection, namely, the development of web recommender systems. The authors proposed a new approach to develop recommender systems based on collaborative filtering, but also including an analysis of the feasibility of the recommender by using a prediction stage (Martín et al., 2006b). The very basic idea was to use clustering algorithms in order to find groups of similar users. The following clustering algorithms were taken into account: KMeans, FCM, HCA, EM algorithm, SOMs and ART. New users were assigned to one of the groups found by these clustering algorithms, and then they were recommended with web services that were usually accessed by other users of his/her same group, but had not yet been accessed by these new users (in order to maximize the usefulness of the approach). Using controlled data sets, the study concluded that ART and SOMs showed a very good behaviour with data sets of very different characteristics, whereas HCA and EM showed an acceptable behaviour provided that the dimensionality of the data set was not too high and the overlap was slight. Algorithms based on K-Means achieved the most limited success in the acceptance of offered recommendations. Even though the use of RL was only slightly studied, it seems to be a suitable choice for this problem, since the internal dynamics of the problem is easily tackled by RL, and moreover the interference between the recommendation interface and the user can be minimized with an adequate definition of the rewards (Hernández, Gaudioso, & Boticario, 2004).

A


Marketing The last application that will be mentioned in this communication is related to marketing. One way to increase the loyalty of customers is by offering them the opportunity to obtain some gifts as the result of their purchases from a certain company. The company can give virtual credits to anyone who buys certain articles, typically those that the company is interested in promoting. After a certain number of purchases, the customers can exchange their virtual credits for the gifts offered by the company. The problem is to establish the appropriate number of virtual credits for each promoted item. In accordance with the company policy, it is expected that the higher the credit assignment, the higher the amount of purchases. However, the company’s profits are lower since the marketing campaign adds an extra cost to the company. The goal is to achieve a trade-off by establishing an optimal policy. We proposed a RL approach to optimize this marketing campaign. This particular application, whose characteristics are described below, is much more difficult than the other RL approaches to marketing mentioned in the Background Section. This is basically because there are many more different actions that can be taken. The information used for the study corresponds to five months of the campaign, involving 1,264,862 transactions, 1,004 articles and 3,573 customers. RL can deal with intrinsic dynamics, and besides, it has the attractive advantage that is able to maximize the so-called long-term reward. This is especially relevant in this application since the company is interested in maximizing the profits at the end of the campaign, and a customer who do not produce much profits in the first months of the campaign, may however make many profitable transactions in the future. Our first results showed that profits using a policy based on RL instead of the policy followed by the company so far, could even double long-term profits at the end of the campaign (Gómez et al., 2005).

CONCLUSION AND FUTURE TRENDS This paper has shown the capabilities and versatility of different AI methods to be applied to real-life problems, illustrated with three specific applications in different domains. Clearly, the methodology is generic and applies equally well to many other fields,

provided that the information contained in the data is sufficiently rich to require non-linear modelling and is capable of supporting a predictive performance that is of practical value. As a next future trend, it should be emphasized that AI methods are increasingly popular for business applications in recent years, challenging classical business models. In the particular case of RL, the commercial potential of this powerful methodology has been significantly underestimated, as it is applied almost exclusively to Robotics. We feel that it is a methodology still to be exploited in many real applications, as we have shown in this paper.

REFERENCES Abe, N., Pednault, E., Wang, H., Zadrozny, B., Wei, F., & Apte, C. (2002). Empirical comparison of various reinforcement learning strategies for sequential targeted marketing. Proceedings of the ICDM 2002, 315-321. Abe, N., Verma, N., Schroko, R. & Apte, C. (2004). Cross-channel optimized marketing by reinforcement learning. Proceedings of the KDD 2004, 767-772. Brier, M. E., Zurada, J. M., & Aronoff, G. R. (1995). Neural network predicted peak and trough gentamicin concentrations. Pharmaceutical Research, 12 (3), 406-412. Burke, R. (2002). Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12, 331-370. Camps, G., Porta, B., Soria, E., Martín, J. D., Serrano, A. J., Pérez, J. J., & Jiménez, N. V. (2003). Prediction of cyclosporine dosage in patients after kidney transplantation using neural networks. IEEE Transactions on Biomedical Engineering, 50 (4), 442-448. Gaweda, A. E., Jacobs, A. A., Brier, M. E., & Zurada, J. M. (2003). Pharmacodynamic population analysis in chronic renal failure using artificial neural networks – a comparative study. Neural Networks, 16 (5-6), 841-845. Gómez, G., Martín, J. D., Soria, E., Palomares, A., Balaguer, E., Casariego, N.,, & Paglialunga, D. (2005). An approach based on reinforcement learning and


Seelf-Organizing Maps to design a marketing campaign. Proceedings of the 2nd International Conference on Machine Intelligence ACIDCA-ICMI 2005, 259-265.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA.

Gray, D. L., Ash, S. R., Jacobi, J. & Michel, A. N. (1991). The training and use of an artificial neural network to monitor use of medication in treatment of complex patients. Journal of Clinical Engineering, 16 (4), 331-336.

Zhao, S. Z., Reynolds, M. W., Leikowith, J., Whelton, A., & Arellano, F. M. (2001). A comparison of renalrelated adverse drug reactions between rofecoxib and celecoxib, based on World Health Organization/ Uppsala Monitoring Centre safety database. Clinical Therapeutics, 23 (9), 1478-1491.

Hamilton, J. D. (1994). Time Series Analysis, Princeton University Press, Princeton NJ, USA. Haykin, S. (1999). Neural Networks (2nd ed.). Prentice Hall, Englewood Cliffs, NJ, USA. Hernández, F., Gaudioso, E. & Boticario, J. G. (2004) A reinforcement approach to achieve unobstrusive and interactive recommendation systems for web-based communities. Proceedings of Adaptive Hypermedia 2004, 409-412. Lisboa, P. J. G. (2002). A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Networks, 15 (1), 11-39. Martín, J. D., Soria, E., Camps, G., Serrano, A. J., Pérez, J. J., & Jiménez, N. V. (2003). Use of neural networks for dosage indidualisation of erythropoietin in patients with secondary anemia to chronic renal failure. Computers in Biology and Medicine, 33 (4), 361-373. Martín, J. D., Soria, E., Chorro, V., Climente. M., & Jiménez, N. V. (2006a). Reinforcement Learning for anemia management in hemodialysis patients treated with erythropoietic stimulating factors. Proceedings of the Workshop “Planning, Learning and Monitoring with uncertainty and dynamic worlds”, European Conference on Artificial Intelligence 2006, 19-24. Martín, J. D., Palomares, A., Balaguer, E., Soria, E., Gómez, J., & Soriano, A. (2006b) Studying the feasibility of a recommender in a citizen web portal based on user modeling and clustering algorithms. Expert Systems with Aplications, 30 (2), 299-312. Pednault, E., Abe, N., & Zadrozny, B. (2002). Sequential cost-sensitive decision making with reinforcement learning. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2002, 259-268.

Zukerman, I., & Albrecht, D. (2001). Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction, 11, 5-18.

KEy TERMS Agent: In RL terms, it is the responsible of making decisions according to observations of its environment. Environment: In RL terms, it is every external condition to the agent. Exploration-Explotation Dilemma: It is a classical RL dilemma, in which a trade-off solution must be achieved. Exploration means random search of new actions in order to achieve a likely (but yet unknown) better reward than all the known ones, while explotation is focused on exploiting the current knowledge for the maximization of the reward (greedy approach). Life-Time Value: It is a measure widely used in marketing applications that offers the long-term result that has to be maximized. Reward: In RL terms, the immediate reward is the value returned by the environment to the agent depending on the taken action. The long-term reward is the sum of all the immediate rewards throughout a complete decision process. Sensitivity: Similar measure that offers the ratio of positives that are correctly classified by the model. (Refer to Specificity.) Specificity: Success rate measure in a classification problem. If there are two classes (namely, positive and negative), specificity measures the ratio of negatives that are correctly classified by the model.

Sun, P. (2003). Constructing learning models from data: The dynamic catalog mailing problem. Ph. D. Dissertation, Tsinghua University, China.

A

Algorithms for Association Rule Mining Vasudha Bhatnagar University of Delhi, India Anamika Gupta University of Delhi, India Naveen Kumar University of Delhi, India

INTRODUCTION Association Rule Mining (ARM) is one of the important data mining tasks that has been extensively researched by data-mining community and has found wide applications in industry. An Association Rule is a pattern that implies co-occurrence of events or items in a database. Knowledge of such relationships in a database can be employed in strategic decision making in both commercial and scientific domains. A typical application of ARM is market basket analysis where associations between the different items are discovered to analyze the customer’s buying habits. The discovery of such associations can help to develop better marketing strategies. ARM has been extensively used in other applications like spatial-temporal, health care, bioinformatics, web data etc (Hipp J., Güntzer U., Nakhaeizadeh G. 2000). An association rule is an implication of the form X → Y where X and Y are independent sets of attributes/items. An association rule indicates that if a set of items X occurs in a transaction record then the set of items Y also occurs in the same record. X is called the antecedent of the rule and Y is called the consequent of the rule. Processing massive datasets for discovering co-occurring items and generating interesting rules in reasonable time is the objective of all ARM algorithms. The task of discovering co-occurring sets of items cannot be easily accomplished using SQL, as a little reflection will reveal. Use of ‘Count’ aggregate query requires the condition to be specified in the where clause, which finds the frequency of only one set of items at a time. In order to find out all sets of co-occurring items in a database with n items, the number of queries that need to be written is exponential in n. This is the prime motivation for designing algorithms

for efficient discovery of co-occurring sets of items, which are required to find the association rules. In this article we focus on the algorithms for association rule mining (ARM) and the scalability issues in ARM. We assume familiarity of the reader with the motivation and applications of association rule mining

BACKGROUND Let I = {i1, i2,…, in} denote a set of items and D denote a database of N transactions. A typical transaction T∈D may contain a subset X of the entire set of items I and is associated with a unique identifier TID. An item-set is a set of one or more items i.e. X is an item-set if X ⊆ I. A k-item-set is an item-set of cardinality k. A transaction is said to contain an item-set X if X ⊆ T. Support of an item set X, also called Coverage is the fraction of transactions that contain X. It denotes the probability that a transaction contains X. Support ( X ) = P( X ) =

No. of transactions containing X N

An item-set having support greater than the user specified support threshold (ms) is known as frequent item-set. An association rule is an implication of the form X →Y [Support, Confidence] where X ⊂ I, Y⊂ I and X∩Y =∅, where Support and Confidence are rule evaluation metrics. Support of a rule X → Y in D is ‘S'’ if S% of transactions in D contain X ∪ Y. It is computed as: Support ( X → Y ) = P ( X ∪ Y ) =

No. of transaction containing X ∪ Y N


Algorithms for Association Rule Mining

Support indicates the prevalence of a rule. In a typical market basket analysis application, rules with very low support values represent rare events and are likely to be uninteresting or unprofitable. Confidence of a rule measures its strength and provides an indication of the reliability of prediction made by the rule. A rule X → Y has a confidence ‘C'‘ in D if C % of transactions in D that contain X, also contain Y. Confidence is computed, as the conditional probability of Y occuring in a transaction, given X is present in the same transaction, i.e. Confidence( X → Y ) = P(Y

X

)=

P( X ∪ Y ) Support ( X ∪ Y ) = P( X ) Support ( X )

A rule generated from frequent item-sets is strong if its confidence is greater than the user specified confidence threshold (mc). Fig. 1 shows an example database of five transactions and shows the computation of support and confidence of a rule. The objective of Association Rule Mining algorithms is to discover the set of strong rules from a given database as per the user specified ms and mc thresholds. Algorithms for ARM essentially perform two distinct tasks: (1) Discover frequent item-sets. (2) Generate strong rules from frequent item-sets. The first task requires counting of item-sets in the database and filtering against the user specified threshold (ms). The second task of generating rules from frequent item-sets is a straightforward process of generating subsets and checking for the strength. We describe below the general approaches for finding frequent item-sets in association rule mining algorithms. The second task is trivial as explained in the last section of the article.

APPROACHES FOR GENERATING FREQUENT ITEM-SETS

A

If we apply a brute force approach to discover frequent item-sets, the algorithm needs to maintain counters for all 2n - 1 item-sets. For large values of n that are common in the datasets being targeted for mining, maintaining such large number of counters is a daunting task. Even if we assume availability of such large memory, indexing of these counters also presents a challenge. Data mining researchers have developed numerous algorithms for efficient discovery of frequent item-sets. The earlier algorithms for ARM discovered all frequent item-sets. Later it was shown by three independent groups of researchers (Pasquier N., Bastide Y., Taouil R. & Lakhal L. 1999), (Zaki M.J. 2000), (Stumme G., 1999), that it is sufficient to discover frequent closed item-sets (FCI) instead of all frequent item-sets (FI). FCI are the item-sets whose support is not equal to the support of any of its proper superset. FCI is a reduced, complete and loss less representation of frequent item-sets. Since FCI are much less in number than FI, computational expense for ARM is drastically reduced. Figure 2 summarizes different approaches used for ARM. We briefly describe these approaches.

Discovery of Frequent Item-Sets Level-Wise Approach Level wise algorithms start with finding the item-sets of cardinality one and gradually work up to the frequent item-sets of higher cardinality. These algorithms use anti-monotonic property of frequent item-sets accord-

Figure 1. Computation of support and confidence of a rule in an example database TID 1 2 3 4 5

Items BCD BCDE AC BDE AB

Let ms=40%, mc=70% Consider the association rule B , support ( ) = 3/5 = 60% confidence( ) = support(B D)/support(B) = 3/4 = 75% The rule is a strong rule.


Figure 2. Approaches for ARM algorithms ARM Algorithms Frequent Closed Item-sets

Frequent Item-sets

Level-wise

Tree Based

Level-wise

ing to which, no superset of an infrequent item-set can be frequent. Agarwal et al. (Agarwal, R., Imielinski T., & Swami A. 1993), (Agarwal, R., & Swami A., 1994) proposed Apriori algorithm, which is the most popular iterative algorithm in this category. It starts, with finding the frequent item-sets of size one and goes up level by level, finding candidate item-sets of size k by joining item-sets of size k-1. Two item-sets, each of size k-1 join to form an item-set of size k if and only if they have first k-2 items common. At each level the algorithm prunes the candidate item-sets using anti-monotonic property and subsequently scans the database to find the support of pruned candidate item-sets. The process continues till the set of frequent item-sets is nonempty. Since each iteration requires a database scan, maximum number of database scans required is same as the size of maximal item-set. Fig. 3 and Fig 4 gives the pseudo code of Apriori algorithm and a running example respectively. Two of the major bottlenecks in Apriori algorithm are i) number of passes and ii) number of candidates generated. The first is likely to cause I/O bottleneck and the second causes heavy load on memory and CPU usage. Researchers have proposed solutions to these problems with considerable success. Although detailed discussion of these solutions is beyond the scope of this article, a brief mention is necessary. Hash techniques reduce the number of candidates by making a hash table and discarding a bucket if it has support less than the ms. Thus at each level memory requirement is reduced because of smaller candidate set. The reduction is most significant at lower levels. Maintaining a list of transaction ids for each candidate set reduces the database access. Dynamic Item-set Counting algorithm reduces the number of scans by

Tree Based

La

counting candidate sets of different cardinality in a single scan (Brin S., Motwani R., Ullman J.D., & Tsur S. 1997). Pincer Search algorithm uses a bi-directional strategy to prune the candidate set from top (maximal) and bottom (1-itemset) (Lin D. & Kedem Z.M. 1998). Partitioning and Sampling strategies have also been proposed to speed up the counting task. An excellent comparison of Apriori algorithm and its variants has been given in (Hipp J., Güntzer U., Nakhaeizadeh G. 2000).

Tree Based Algorithms Tree based algorithms have been proposed to overcome the problem of multiple database scans. These algorithms compress (sometimes lossy) the database into a tree data structure and reduce the number of database scans appreciably. Subsequently the tree is used to mine for support of all frequent item-sets. Set-Enumeration tree used in Max Miner algorithm (Bayardo R.J. 1998) orders the candidate sets while searching for maximal frequent item-sets. The data structure facilitates quick identification of long frequent item-sets based on the information gathered during each pass. The algorithm is particularly suitable for dense databases with maximal item-sets of high cardinality. Han et. al. (Han, J., Pei, J., & Yin, Y. 2000) proposed Frequent Pattern (FP)-growth algorithm which performs a database scan and finds frequent item-sets of cardinality one. It arranges all frequent item-sets in a table (header) in the descending order of their supports. During the second database scan, the algorithm constructs in-memory data structure called FP-Tree by inserting each transaction after rearranging it in descending order of the support. A node in FP-Tree stores a single attribute so that each path in the tree


Figure 3. Apriori algorithm

represents and counts the corresponding record in the database. A link from the header connects all the nodes of an item. This structural information is used while mining the FP-Tree. FP-Growth algorithm recursively generates sub-trees from FP-Trees corresponding to each frequent item-set. Coenen et. al. (Coenen F., Leng P., & Ahmed S. 2004) proposed Total Support Tree (T-Tree) and Partial Support Tree (P-Tree) data structures which offer significant advantage in terms of storage and execution. These data structures are compressed set enumeration trees and are constructed after one scan of the database and stores all the item-sets as distinct records in database.

Discovery of Frequent Closed Item-Sets Level Wise Approach Pasquier et. al. (Pasquier N., Bastide Y., Taouil R. & Lakhal L. 1999) proposed Close method to find

A

Frequent Closed Item-sets (FCI). This method finds closures based on Galois closure operators and computes the generators. Galois closure operator h(X) for some X ⊆ I is defined as the intersection of transactions in D containing item-set X. An item-set X is a closed item-set if and only if h(X) = X. One of the smallest arbitrarily chosen item-set p, such that h(p) = X is known as generator of X. Close method is based on Apriori algorithm. It starts from 1- item-sets, finds the closure based on Galois closure operator, goes up level by level computing generators and their closures (i.e. FCI) at each level. At each level, candidate generator item-sets of size k are found by joining generator item-sets of size k-1 using the combinatorial procedure used in Apriori algorithm. The candidate generators are pruned using two strategies i) remove candidate generators whose all subsets are not frequent ii) remove the candidate generators if closure of one of its subsets is superset of the generator. Subsequently algorithm finds the support of pruned candidate generator. Each iteration requires


Figure 4. Running example of apriori algorithm for finding frequent itemsets (ms = 40%)

one pass over the database to construct the set of FCI and count their support.

Tree Based Approach Wang et. al. (Wang J., Han J. & Pei J. 2003) proposed Closet+ algorithm to compute FCI and their supports using FP-tree structure. The algorithm is based on divide and conquers strategy and computes the local frequent items of a certain prefix by building and scanning its projected database.

0

Concept Lattice Based Approach Concept lattice is a core structure of Formal Concept Analysis (FCA). FCA is a branch of mathematics based on Concept and Concept hierarchies. Concept (A,B) is defined as a pair of set of objects A (known as extent) and set of attributes B (known as intent) such that set of all attributes belonging to extent A is same as B and set of all objects containing attributes of intent B is same as A. In other words, no object other than objects of set A contains all attributes of B and no attribute other than attributes in set B is contained in all objects of set A. Concept lattice is a complete lattice of all Concepts. Stumme G., (1999) discovered that intent


Exhibit A.

A

add extent {all transactions} in the list of extents For each item i ∈ I for each set X in the list of extents find X ∩ {set of transactions containing i} include in the list of extents if not included earlier EndFor EndFor

B of the Concept (A,B) represents the closed item-set, which implies that all algorithms for finding Concepts can be used to find closed item-sets. Kuznetsov S.O., & Obiedkov S.A. (2002) provides a comparison of performance of various algorithms for concepts. The naïve method to compute Concepts, proposed by Ganter is given in Exhibit A. This method generates all the Concepts i.e. all closed item-sets. Closed item-sets generated using this method in example 1 are {A},{B} ,{C},{A,B},{A,C},{B,D},{B, C,D}, {B,D,E}, {B,C,D,E}. Frequent Closed item-sets are {A} ,{B},{C},{B,D},{B,C,D},{B,D,E}. Concept lattice for frequent closed item-sets is given in Figure 5.

Figure 5. Concept lattice

Generating Association Rules Once all frequent item-sets are known, association rules can be generated in a straightforward manner by finding all subsets of an item-sets and testing the strength (Han J., & Kamber M., 2006). The pseudo code for this algorithm is given in Exhibit B. Based on the above algorithm, strong rules generated from frequent item-set BCD in Example 1 are: BC → D, conf=100% CD → B, conf=100% where mc = 70% There are two ways to find association rules from frequent closed item-sets: i) ii)

compute frequent item-sets from FCI and then find the association rules generate rules directly using FCI.

Close method uses the first approach, which generates lot of redundant rules while method proposed by Zaki (Zaki M.J., 2000), (Zaki, M.J., & Hsiao C., J., 2005) uses the second approach and derives rules

directly from the Concept lattice. The association rules thus derived are non-redundant rules. For example, set of strong rules generated using Close method in Example 1 is {BC → D,CD →B,D →B,E → B,E →D,E → BD, BE →D,DE →B}. For the same example, set of non-redundant strong rules generated using Concept Lattice approach is {D →B, E → BD, BC → D, CD → B}. We can observe here that all rules can be derived from the reduced non-redundant set of rules. Scalability issues in Association Rule Mining Scalability issues in ARM have motivated development of incremental and parallel algorithms. Incremental algorithms for ARM preserve the counts of selective item-sets and reuse this knowledge later to discover frequent item-sets from augmented database. Fast update algorithm (FUP) is the earliest algorithm based on this idea. Later different algorithms are presented based on sampling (Hipp J., Guntzer U., & Nakhaeizadeh G., 2000). Parallel algorithms partition either the dataset for counting or the set of counters, across different ma


Exhibit B. For each frequent item-set I, generate all non-empty subsets of I For every non-empty subset s of I, Output the rule s → (I-s) if support(I) / support (s) >= mc EndFor EndFor

chines to achieve scalability (Hipp J., Guntzer U., & Nakhaeizadeh G., 2000). Algorithms, which partition the dataset exchange counters while the algorithms, which partition the counters, exchange datasets incurring high communication cost.

FUTURE TRENDS Discovery of Frequent Closed Item-sets (FCI) is a big lead in ARM algorithms. With the current growth rate of databases and increasing applications of ARM in various scientific and commercial applications we envisage tremendous scope for research in parallel, incremental and distributed algorithms for FCI. Use of lattice structure for FCI offers promise of scalability. On line mining on streaming datasets using FCI approach is an interesting direction to work on.

CONCLUSION The article presents the basic approach for Association Rule Mining, focusing on some common algorithms for finding frequent item-sets and frequent closed item-sets. Various approaches have been discussed to find such item-sets. Formal Concept Analysis approach for finding frequent closed item-sets is also discussed. Generation of rules from frequent items-sets and frequent closed item-sets is briefly discussed. The article addresses the scalability issues involved in various algorithms.

Databases, Proceedings of the 1993 ACM International Conference on Management of Data, 207-216, Washington, D.C. Agrawal R., & Srikant R., (1994), Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the Twentieth International Conference on VLDB, pp. 487-499, Santiago, Chile Bayardo R.J. (1998), Efficiently Mining Long Patterns From Databases, Proceedings of the ACM International Conference on Management of Data. Brin S., Motwani R., Ullman J. D., & Tsur S., (1997), Dynamic Item-set Counting and Implication Rules for Market Basket Data. ACM Special Interest Group on Management of Data, 26(2):255 Coenen F., Leng P., & Ahmed S., (2004) Data Structure for Association Rule Mining: T-Trees and P-Trees, IEEE TKDE, Vol. 16, No. 6 Han, J., Pei, J., & Yin, Y., (2000), Mining Frequent Patterns Without Candidate Generation, Proceedings of the ACM International Conference on Management of Data, ACM Press, 1-12. Han, J., & Kamber, M., (2006), Data Mining: Concepts and Techniques, 2nd ed. Morgan Kaufmann Publishers. Hipp, J., Güntzer, U., & Nakhaeizadeh, G., (2000), Algorithms for Association Rule Mining: A General Survey and Comparison, SIGKDD Explorations.

REFERENCES

Kuznetsov, S.O., & Obiedkov, S.A., (2002), Comparing Performance of Algorithms For Generating Concept Lattices, Journal of Experimentation and Theoretical Artificial Intelligence.

Agarwal, R., Imielinski T., & Swami A., (1993), Mining Association Rules Between Sets of Items in Large

Lin, D., & Kedem, Z. M., (1998), Pincer Search: A New Algorithm for Discovering the Maximum Frequent


Sets. Proceedings of the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L., (1999), Efficient Mining of Association Rules Using Closed Item-set Lattices, Information Systems, 24(1):25-46

B’ = {geG | gIm for all meB} (the set of objects common to the attributes in B). A formal concept of the context (G,M,I) is a pair (A,B) with A⊆G,B⊆M, A’=B and B’=A

Stumme, G., (1999), Conceptual Knowledge Discovery with Frequent Concept Lattices, FB4-Preprint 2043, TU Darmstadt

A is called the extent and B is the intent of the concept (A,B).

Wang, J., Han, J., & Pei, J., (2003), Closet+: Searching for the Best Strategies for Mining Frequent Closed Itemsets, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 236-245, New York, USA, ACM Press.

Frequent Closed Item-Set: An item-set X is a closed item-set if there exists no item-set X’ such that: i. ii.

X’ is a proper superset of X, Every transaction containing X also contains X’.

Zaki, M. J., (2000), Generating Non-Redundant Association Rules, Proceedings of the International Conference on Knowledge Discovery and Data Mining.

A closed item-set X is frequent if its support exceeds the given support threshold.

Zaki, M.J., & Hsiao C.,J.,(2005), Efficient algorithms for mining closed item-sets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering, 17(4): 462-478.

Galois Connection: Let D = (O,I,R) be a data mining context where O and I are finite sets of objects (transactions) and items respectively. R ⊆ O x I is a binary relation between objects and items. For O ⊆ O, and I ⊆ I, we define as shown in Exhibit C.

KEy TERMS Association Rule: An Association rule is an implication of the form X→Y where X ⊂ I, Y⊂ I and X∩Y =∅, I denotes the set of items.

f(O) associates with O the items common to all objects o ∈ O and g(I) associates with I the objects related to all items i ∈ I. The couple of applications (f,g) is a Galois connection between the power set of O (i.e. 2O) and the power set of I (i.e. 2I).

Data Mining: Extraction of interesting, non-trivial, implicit, previously unknown and potentially useful information or patterns from data in large databases.

The operators h = f o g in 2I and h’ = g o f in 2o are Galois closure operators. An item-set C ⊆ I from D is a closed item-set iff h(C) = C.

Formal Concept: A formal context K = (G,M,I) consists of two sets G (objects) and M (attributes) and a relation I between G and M. For a set A⊆G of objects

Generator Item-Set: A generator p of a closed item-set c is one of the smallest item-sets such that h(p) = c.

A’={meM | gIm for all geA} (the set of all attributes common to the objects in A). Correspondingly, for a set B of attributes we define

Non-Redundant Association Rules: Let Ri denote the rule X1i→X2i, where X1,X2 ⊆ I. Rule R1 is more general than rule R2 provided R2 can be generated by adding additional items to either the antecedent or consequent of R1. Rules having the same support and confidence as

Exhibit C. f(O): 2O → 2I

g(I): 2I → 2O

f(O) = (i∈ I | ∀o ∈ O, (o,i) ∈ R}

g( I) = (o∈ O | ∀i ∈ I, (o,i) ∈ R}

A


more general rules are the redundant association rules. Remaining rules are non-redundant rules.

Ambient Intelligence

A

Fariba Sadri Imperial College London, UK Kostas Stathis Royal Holloway, University of London, UK

INTRODUCTION In recent years much research and development effort has been directed towards the broad field of ambient intelligence (AmI), and this trend is set to continue for the foreseeable future. AmI aims at seamlessly integrating services within smart infrastructures to be used at home, at work, in the car, on the move, and generally in most environments inhabited by people. It is a relatively new paradigm rooted in ubiquitous computing, which calls for the integration and convergence of multiple disciplines, such as sensor networks, portable devices, intelligent systems, human-computer and social interactions, as well as many techniques within artificial intelligence, such as planning, contextual reasoning, speech recognition, language translation, learning, adaptability, and temporal and hypothetical reasoning. The term AmI was coined by the European Commission, when in 2001 one of its Programme Advisory Groups launched the AmI challenge (Ducatel et al., 2001), later updated in 2003 (Ducatel et al., 2003). But although the term AmI originated from Europe, the goals of the work have been adopted worldwide, see for example (The Aware Home, 2007), (The Oxygen Project, 2007), and (The Sony Interaction Lab, 2007). The foundations of AmI infrastructures are based on the impressive progress we are witnessing in wireless technologies, sensor networks, display capabilities, processing speeds and mobile services. These developments help provide much useful (row) information for AmI applications. Further progress is needed in taking full advantage of such information in order to provide the degree of intelligence, flexibility and naturalness envisaged. This is where artificial intelligence and multi-agent techniques have important roles to play. In this paper we will review the progress that has been made in intelligent systems, discuss the role of

artificial intelligence and agent technologies and focus on the application of AmI for independent living.

BACKGROUND Ambient intelligence is a vision of the information society where normal working and living environments are surrounded by embedded intelligent devices that can merge unobtrusively into the background and work through intuitive interfaces. Such devices, each specialised in one or more capabilities, are intended to work together within an infrastructure of intelligent systems, to provide a multitude of services aimed at generally improving safety and security and improving quality of life in ordinary living, travelling and working environments. The European Commission identified four AmI scenarios (Ducatel et al. 2001, 2003) in order to stimulate imagination and initiate and structure research in this area. We summarise two of these to provide the flavour of AmI visions.

AmI Scenarios: 1. Dimitrios is taking a coffee break and prefers not to be disturbed. He is wearing on his clothes or body a voice activated digital avatar of himself, known as Digital Me (D-Me). D-Me is both a learning device, learning about Dimitrios and his environment, and an acting device offering communication, processing and decision-making functionalities. During the coffee break D-Me answers the incoming calls and emails of Dimitrios. It does so smoothly in the necessary languages, with a re-production of Dimitrios’ voice and accent. Then D-Me receives a call from Dimitrios’ wife, recognises its urgency and passes it on to Demetrios. At the same time it catches a message from an older person’s D-Me,



located nearby. This person has left home without his medication and would like to find out where to access similar drugs. He has asked his D-Me, in natural language, to investigate this. Dimitrios happens to suffer from a similar health problem and uses the same drugs. His D-Me processes the incoming request for information, and decides neither to reveal Dimitrios’ identity nor offer direct help, but to provide the elderly person’s D-Me with a list of the closest medicine shops and potential contact with a self-help group. 2. Carmen plans her journey to work. It asks AmI, by voice command, to find her someone with whom she can share a lift to work in half an hour. She then plans the dinner party she is to give that evening. She wishes to bake a cake, and her e-fridge flashes a recipe on the e-fridge screen and highlights the ingredients that are missing. Carmen completes her shopping list on the screen and asks for it to be delivered to the nearest distribution point in her neighbourhood. All goods are smart tagged, so she can check the progress of her virtual shopping from any enabled device anywhere, and make alterations. Carmen makes her journey to work, in a car with dynamic traffic guidance facilities and traffic systems that dynamically adjust speed limits depending on congestion and pollution levels. When she returns home the AmI welcomes her and suggests that on the next day she should telework, as a big demonstration is planned in downtown.

The demands that drive AmI and provide opportunities are for improvement of safety and quality of life, enhancements of productivity and quality of products and services, including public services such as hospitals, schools, military and police, and industrial innovation. AmI is intended to facilitate human contact and community and cultural enhancement, and ultimately it should inspire trust and confidence. Some of the technologies required for AmI are summarised in Figure 1. AmI work builds on ubiquitous computing and sensor network and mobile technologies. To provide the intelligence and naturalness required, it is our view that significant contributions can come from advances in artificial intelligence and agent technologies. Artificial intelligence has a long history of research on planning, scheduling, temporal reasoning, fault diagnosis, hypothetical reasoning, and reasoning with incomplete and uncertain information. All of these are techniques that can contribute to AmI where actions and decisions have to be taken in real time, often with dynamic and uncertain knowledge about the environment and the user. Agent technology research has concentrated on agent architectures that combine several, often cognitive, capabilities, including reactivity and adaptability, as well as the formation of agent societies through communication, norms and protocols. Recent work has attempted to exploit these techniques for AmI. In (Augusto and Nugent 2004) the use of temporal reasoning combined with active data-

Figure 1. Components of Ambient Intelligence COMPONENTS

AMBIENT

INTELLIGENCE

Very unobtrusive hardw are E m bedded system s D ynam ic distributed netw orks S eam less m obile/fixed ubiquitous com m unication S ensor t echnology I/O devices A daptive s oftw are

C om putational intelligence C ontextual a w areness N atural interaction A daptability R obustness S ecurity F ault tolerance

SOFTWARE PLATFORM


bases are explored in the context of smart homes. In (Sadri 2007) the use of temporal reasoning together with agents is explored to deal with similar scenarios, where information observed in a home environment is evaluated, deviations from normal behaviour and risky situations are recognised and compensating actions are recommended. The relationship of AmI to cognitive agents is motivated by (Stathis and Toni 2004) who argue that computational logic elevates the level of the system to that of a user. They advocate the KGP agent model (Kakas, et al 2004) to investigate how to assist a traveller to act independently and safely in an unknown environment using a personal communicator. (Augusto et al 2006) address the process of taking decisions in the presence of conflicting options. (Li and Ji 2005) offer a new probabilistic framework based on Bayesian Networks for dealing with ambiguous and uncertain sensory observations and users’ changing states, in order to provide correct assistance. (Amigoni et al 2005) address the goal-oriented aspect of AmI applications, and in particular the planning problem within AmI. They conclude that a combination of centralised and distributed planning capabilities are required, due to the distributed nature of AmI and the participation of heterogeneous agents, with different capabilities. They offer an approach based on the Hierarchical Task Networks taking the perspective of a multi-agent paradigm for AmI. The paradigm of embedded agents for AmI environments with a focus on developing learning and adaptation techniques for the agents is discussed in (Hagras et al 2004, and Hagras and Callaghan 2005). Each agent is equipped with sensors and effectors and uses a learning system based on fuzzy logic. A real AmI environment in the form of an “intelligent dormitory” is used for experimentation. Privacy and security in the context of AmI applications at home, at work, and in the health, shopping and mobility domains are discussed in (Friedewald et al 2007). For such applications they consider security threats such as surveillance of users, identity theft and malicious attacks, as well as the potential of the digital divide amongst communities and social pressures.

AMBIENT INTELLIGENCE FOR INDEPENDENT LIVING

A

One major use of AmI is to support services for independent living, to prolong the time people can live decently in their own homes by increasing their autonomy and self-confidence. This may involve the elimination of monotonous everyday activities, monitoring and caring for the elderly, provision of security, or saving resources. The aim of such AmI applications is to help: • • •

maintain safety of a person by monitoring his environment and recognizing and anticipating risks, and taking appropriate actions, provide assistance in daily activities and requirements, for example, by reminding and advising about medication and nutrition, and improve quality of life, for example by providing personalized information about entertainment and social activities.

This area has attracted a great deal of attention in recent years, because of increased longevity and the aging population in many parts of the world. For such an AmI system to be useful and accepted it needs to be versatile, adaptable, capable of dealing with changing environments and situations, transparent and easy, and even pleasant, to interact with. We believe that it would be promising to explore an approach based on providing an agent architecture consisting of a society of heterogeneous, intelligent, embedded agents, each specialised in one or more functionalities. The agents should be capable of sharing information through communication, and their dialogues and behaviour should be governed by context-dependent and dynamic norms. The basic capabilities for intelligent agents include: • • • •

Sensing: to allow the agent observe the environment Reactivity: to provide context-dependent dynamic behaviour and the ability to adapt to changes in the environment Planning: to provide goal-directed behaviour Goal Decision: to allow dynamic decisions about which goals have higher priorities


•

Action execution: to allow the agent to affect the environment.

All of these functionalities also require reasoning about spatio-temporal constraints reflecting the environment in which an AmI system operates. Most of these functionalities have been integrated in the KGP model (Kakas et al, 2004), whose architecture is shown in Figure 2 and implemented in the PROSOCS system (Bracciali et al, 2006). The use of reactivity for communication and dialogue policies has also been discussed in, for example, (Sadri et al, 2003). The inclusion of normative behaviour has been discussed in (Sadri et al, 2006) where we also consider how to choose amongst different types of goals, depending on the governing norms. For a general discussion on the importance of norms in artificial societies see (Pitt, 2005). KGP agents are situated in the environment via their physical capabilities. Information received from the environment (including other agents) updates the agents state and provides input to its dynamic cycle theory, which, in turn, determines the next steps in terms of its transitions, using its reasoning capabilities.

FUTURE TRENDS As most other information and communication technologies, AmI is not likely to be good or bad on its own, but its value will be judged from the different

Figure 2. The architecture of a KGP agent

ways the technology will be used to improve people’s lives. In this section we discuss new opportunities and challenges for the integration of AmI with what people do in ordinary settings. We abstract away from hardware trends and we focus on areas that are software related and are likely to play an important role in the adoption of AmI technologies. A focal point is the observation that people discover and understand the world through visual and conversational interactions. As a result, in the coming years we expect to see the design of AmI systems to focus in ways that will allow humans to interact in natural ways, using their common skills such as speaking, gesturing, glancing. This kind of natural interaction (Leibe et al 2000) will complement existing interfaces and will require that AmI systems be capable of representing virtual objects, possibly in 3D, as well as capture people’s moves in the environment and identify which of these moves are directed to virtual objects. We also expect to see new research directed towards processing of sensor data with different information (Massaro and Friedman 1990) and different kind of formats such as audio, video, and RFID. Efficient techniques to index, search, and structure these data and ways to transform them to the higher-level semantic information required by cognitive agents will be an important area for future work. Similarly, the reverse of this process is likely to be of equal importance, namely, how to translate high-level information to the lower-level signals required by actuators that are situated in the environment. Given that sensors and actuators will provide the link with the physical environment, we also anticipate further research to address the general linking of AmI systems to already existing computing infrastructures such as the semantic web. This work will create hybrid environments that will need to combine useful information from existing wired technologies with information from wireless ones (Stathis et al 2007). To enable the creation of such environments we imagine the need to build new frameworks and middleware to facilitate integration of heterogeneous AmI systems and make the interoperation more flexible. Another important issue is how the human experience in AmI will be managed in a way that will be as unobtrusive as possible. In this we foresee that developments in cognitive systems will play a very important role. Although there will be many areas of cognitive system behaviour that will need to be addressed, we


anticipate that development of agent models that adapt and learn (Sutton and Barto 1998), to be of great importance. The challenge here will be how to integrate the output of these adaptive and learning capabilities to the reasoning and decision processes of the agent. The resulting cognitive behaviour must differentiate between newly learned concepts and existing ones, as well as discriminate between normal behaviour and exceptions. We expect that AmI will emerge with the formation of user communities who live and work in a particular locality (Stathis et al 2006). The issue then becomes how to manage all the information that is provided and captured as the system evolves. We foresee research to address issues such as semantic annotations of content, and partitioning and ownership of information. Linking in local communities with smart homes, e-healthcare, mobile commerce, and transportation systems will eventually give rise to a global AmI system. For applications in such a system to be embraced by people we will need to see specific human factors studies to decide how unobtrusive, acceptable and desirable the actions of the AmI environment seem to people who use them. Some human factors studies should focus on issues of presentation of objects and agents in a 3D setting, as well as on the important issues of privacy, trust and security. To make possible the customization of system interactions to different classes of users, it is required to acquire and store information about these users. Thus for people to trust AmI interactions in the future we must ensure that the omnipresent intelligent environment maintains privacy in an ethical manner. Ethical or, better, normative behaviour cannot only be ensured at the cognitive level (Sadri et al 2006), but also at the lower, implementation level of the AmI platform. In this context, ensuring that communicated information is encrypted, certified, and follows transparent security policies will be required to build systems less vulnerable to malicious attacks. Finally, we also envisage changes to business models that would characterise AmI interactions (Hax and Wielde 2001).

applications that can test such a combination is AmI supporting independent living. For such applications we have identified the trends that are likely to play an important role in the future.

CONCLUSION

Hagras, H. and Callaghan, V. (2005). An intelligent fuzzy agent approach for realizing ambient intelligence in intelligent inhabited environments. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 35(1), 55-65.

The successful adoption of AmI is predicated on the suitable combination of ubiquitous computing, artificial intelligence and agent technologies. A useful class of

REFERENCES Augusto, J.C., Liu, J., Chen L. (2006). Using ambient intelligence for disaster management. In the Proceedings of the 10th International Conference on Knowledgebased Intelligent Information and Engineering Systems (KES 2006), Springer Verlag. Augusto, J.C., Nugent, C. D. (2004). The use of temporal reasoning and management of complex events in smart homes. In Proceedings of the European Conference on Artificial Systems (ECAI), 778-782. Bracciali, A., Endriss, U., Demetriou, N., Kakas, A.C., Lu, L., Stathis, K. (2006). Crafting the mind of PROSOCS agents. Applied Artificial Intelligence 20(2-4), 105-131. Ducatel, K., Bogdanowicz, M., Scapolo, F., Leijten, J., Burgelman J.-C. (2001). Scenarios for ambient intelligence in 2010. IST Advisory Group Final Report, European Commission. Ducatel, K., Bogdanowicz, M., Scapolo, F., Leijten, J., Burgelman J.-C. (2003). Ambient intelligence : from vision to reality. IST Advisory Group Draft Report, European Commission. Dutton, W. H. (1999). Society on the line: information politics in the digital age, Oxford, Oxford University Press. Friedewald M., Vildijiounaite, E., Punie, Y. Wright, D. (2007). Privacy, identity and security in ambient intelligence: a scenario analysis. Telematics and Informatics, 24, 15-29. Hagras, H., Callaghan, V., Colley, M., Clarke, G., Pounds-Cornish, A., Duman, H. (2004). Creating an ambient intelligence environment using embedded agents. IEEE Intelligent Systems, 19(6), 12-20.

A


Hax, A., and Wilde, D, II. (2001). The Delta Model – discovering new sources of profitability in a networked economy. European Management Journal. 9, 379-391. Kakas, A., Mancarella, P., Sadri, F. Stathis, K. Toni, F. (2004). The KGP model of agency. In Proceedings of European Conference on Artificial Intelligence, 33-37. Li, X. and Ji, Q. (2005). Active affective state detection and user assistance with dynamic bayesian networks. IEEE Transactions on Systems, Man, and Cybernetics: Special Issue on Ambient Intelligence, 35(1), 93-105. Leibe, B., Starner, T., Ribarsky, W., Wartell, Z., Krum, D., Singletary, B., and Hodges, L. (2000). The Perceptive Workbench: towards spontaneous and natural interaction in semi-immersive virtual environments. IEEE Virtual Reality Conference (VR’2000). 13-20. Massaro, D. W., and D. Friedman. (1990). Models of integration given multiple sources of information. Psychological Review. 97, 225-252. Pitt, J. (2005) The open agent society as a platform for the user-friendly information society. AI Soc. 19(2), 123-158. Sadri, F., Stathis, K., and Toni, F. (2006). Normative KGP agents. Journal of Computational and Mathematical Organizational Theory. 12(2-3), 101-126. Sadri, F. (2007). Ambient intelligence for care of the elderly in their homes. In Proceedings of the 2nd Workshop on Artificial Intelligent Techniques for Ambient Intelligence (AITAmI ‘07), 62-67. Sadri, F., Toni, F., Torroni, P. (2003). Minimally intrusive negotiating agents for resource sharing. In Proceedings of the 8th International Joint Conference on Artificial Intelligence (IJCAI 03), 796-801. Stathis, K., de Bruijn, O., Spence, R. and Purcell, P. (2006) Ambient intelligence: human-agent interactions in a networked community. In Purcell, P. (ed) Networked Neighbourhoods: The Connected Community in Context (Springer), 279-304.

0

Stathis, K., Kafetzoglou, S., Papavasiliou, S., and Bromuri, S. (2007). Sensor network grids: agent environments combined with QoS in wireless sensor networks. In Proceedings of the 3rd International Conference on Autonomic and Autonomous Systems, IEEE. 47-52. Stathis, K. And Toni, F. (2004). Ambient intelligence using KGP agents. Workshop at the Second European Symposium on Ambient Intelligence, Lecture Notes in Compuer Science 3295, 351-362. Sutton, R. S. and Barto, G. A. (1998). Reinforcement learning: an introduction. MIT Press. The Aware Home Initiative (2007), http://www. cc.gatech.edu/fce/house/house.html. The Oxygen Project (2007), http://www.oxygen.lcs. mit.edu. The Sony Interaction Lab (2007), http://www.sonycsl. co.jp/IL/index.html.

TERMS AND DEFINITIONS Artificial Societies: Complex systems consisting of a, possibly large, set of agents whose interaction are constrained by norms and the roles the agents are responsible to play. Cognitive Agents: Software agents endowed with high-level mental attitudes, such as beliefs, goals and plans. Context Awareness: Refers to the idea that computers can both sense and react according to the state of the environment they are situated. Devices may have information about the circumstances under which they are able to operate and react accordingly. Natural Interaction: The investigation of the relationships between humans and machines aiming to create interactive artifacts that respect and exploit the natural dynamics through which people communicate and discover the real world. Smart Homes: Homes equipped with intelligent sensors and devices within a communications infrastructure that allows the various systems and devices to communicate with each other for monitoring and maintenance purposes.


Ubiquitous Computing: A model of human-computer interaction in which information processing is integrated into everyday objects and activities. Unlike the desktop paradigm, in which a single user chooses to interact with a single device for a specialized purpose, with ubiquitous computing a user interacts with many computational devices and systems simultaneously, in the course of ordinary activities, and may not necessarily even be aware that is doing so.

A

Wireless Sensor Networks: Wireless networks consisting of spatially distributed autonomous devices using sensors to cooperatively monitor physical or environmental conditions, such as temperature, sound, vibration, pressure, motion or pollutants, at different locations.

Ambient Intelligence Environments Carlos Ramos Polytechnic of Porto, Portugal

INTRODUCTION The trend in the direction of hardware cost reduction and miniaturization allows including computing devices in several objects and environments (embedded systems). Ambient Intelligence (AmI) deals with a new world where computing devices are spread everywhere (ubiquity), allowing the human being to interact in physical world environments in an intelligent and unobtrusive way. These environments should be aware of the needs of people, customizing requirements and forecasting behaviours. AmI environments may be so diverse, such as homes, offices, meeting rooms, schools, hospitals, control centers, transports, touristic attractions, stores, sport installations, and music devices. Ambient Intelligence involves many different disciplines, like automation (sensors, control, and actuators), human-machine interaction and computer graphics, communication, ubiquitous computing, embedded systems, and, obviously, Artificial Intelligence. In the aims of Artificial Intelligence, research envisages to include more intelligence in the AmI environments, allowing a better support to the human being and the access to the essential knowledge to make better decisions when interacting with these environments.

BACKGROUND Ambient Intelligence (AmI) is a concept developed by the European Commission’s IST Advisory Group ISTAG (ISTAG, 2001)(ISTAG, 2002). ISTAG believes that it is necessary to take a holistic view of Ambient Intelligence, considering not just the technology, but the whole of the innovation supply-chain from science to end-user, and also the various features of the academic, industrial and administrative environment that facilitate or hinder realisation of the AmI vision (ISTAG, 2003). Due to the great amount of technologies involved in the Ambient Intelligence concept we

may find several works that appeared even before the ISTAG vision pointing in the direction of Ambient Intelligence trends. In what concerns Artificial Intelligence (AI), Ambient Intelligence is a new meaningful step in the evolution of AI (Ramos, 2007). AI has closely walked side-by-side with the evolution of Computer Science and Engineering. The building of the first artificial neural models and hardware, with the Walter Pitts and Warren McCullock work (Pitts & McCullock, 1943) and Marvin Minsky and Dean Edmonds SNARC system correspond to the first step. Computer-based Intelligent Systems, like the MYCIN Expert System (Shortliffe, 1976) or networkbased Intelligent Systems, like AUTHORIZER’s ASSISTANT (Rothi, 1990) used by American Express for authorizing transactions consulting several Data Bases are the kind of systems of the second step of AI. From the 80’s Intelligent Agents and Multi-Agent Systems have established the third step, leading more recently to Ontologies and Semantic Web. From hardware to the computer, from the computer to the local network, from the local network to the Internet, and from the Internet to the Web, Artificial Intelligence was on the state of the art of computing, most of times a little bit ahead of the technology limits. Now the centre is no more in the hardware, or in the computer, or even in the network. Intelligence must be provided to our daily-used environments. We are aware of the push in the direction of Intelligent Homes, Intelligent Vehicles, Intelligent Transportation Systems, Intelligent Manufacturing Systems, even Intelligent Cities. This is the reason why Ambient Intelligence concept is so important nowadays (Ramos, 2007). Ambient Intelligence is not possible without Artificial Intelligence. On the other hand, AI researchers must be aware of the need to integrate their techniques with other scientific communities’ techniques (e.g. Automation, Computer Graphics, Communications). Ambient Intelligence is a tremendous challenge, needing the better effort of different scientific communities.


Ambient Intelligence Environments

There is a miscellaneous of concepts and technologies related with Ambient Intelligence. Ubiquitous Computing, Pervasive Computing, Embedded Systems, and Context Awareness are the most common. However these concepts are different from Ambient Intelligence. The concept of Ubiquitous Computing (UbiComp) was introduced by Mark Weiser during his tenure as Chief Technologist of the Palo Alto Research Center (PARC) (Weiser, 1991). Ubiquitous Computing means that we have access to computing devices anywhere in an integrated and coherent way. Ubiquitous Computing was mainly driven by Communications and Computing devices scientific communities but now is involving other research areas. Ambient Intelligence differs from Ubiquitous Computing because sometimes the environment where Ambient Intelligence is considered is simply local. Another difference is that Ambient Intelligence makes more emphasis on intelligence than Ubiquitous Computing. However, ubiquity is a real need today and Ambient Intelligence systems are considering this feature. A concept that sometimes is seen as a synonymous of Ubiquitous Computing is Pervasive Computing. According to Teresa Dillon, Ubiquitous Computing is best considered as the underlying framework, the embedded systems, networks and displays which are invisible and everywhere, allowing us to ‘plug-andplay’ devices and tools, On the other hand, Pervasive Computing, is related with all the physical parts of our lives; mobile phone, hand-held computer or smart jacket (Dillon, 2006). Embedded Systems mean that electronic and computing devices are embedded in current objects or goods. Today goods like cars are equipped with microprocessors; the same is true for washing machines, refrigerators, and toys. Embedded Systems community is more driven by electronics and automation scientific communities. Current efforts go in the direction to include electronic and computing devices in the most usual and simple objects we use, like furniture or mirrors. Ambient Intelligence differs from Embedded Systems since computing devices may be clearly visible in AmI scenarios. However, there is a clear trend to involve more embedded systems in Ambient Intelligence. Context Awareness means that the system is aware of the current situation we are dealing with. An example is the automatic detection of the current situation in a Control Centre. Are we in presence of a normal situation

or are we dealing with a critical situation, or even an emergency? In this Control Centre the intelligent alarm processor will exhibit different outputs according to the identified situation (Vale, Moura, Fernandes, Marques, Rosado, Ramos, 1997). Automobile Industry is also investing in Context Aware systems, like near-accident detection. Human-Computer Interaction scientific community is paying lots of attention to Context Awareness. Context Awareness is one of the most desired concepts to include in Ambient Intelligence, the identification of the context is important for deciding to act in an intelligent way. There are different views of the importance of other concepts and technologies in the Ambient Intelligence field. Usually these differences are derived from the basic scientific community of the authors. ISTAG see the technology research requirements from different points of view (Components, Integration, System, and User/Person). In (ISTAG, 2003) the following ambient components are mentioned: smart materials; MEMS and sensor technologies; embedded systems; ubiquitous communications; I/O device technology; adaptive software. In the same document ISTAG refers the following intelligence components: media management and handling; natural interaction; computational intelligence; context awareness; and emotional computing. Recently Ambient Intelligence is receiving a significant attention from Artificial Intelligence Community. We may refer the Ambient Intelligence Workshops organized by Juan Augusto and Daniel Shapiro at ECAI’2006 (European Conference on Artificial Intelligence) and IJCAI’2007 (International Joint Conference on Artificial Intelligence) and the Special Issue on Ambient Intelligence, coordinated by Carlos Ramos, Juan Augusto and Daniel Shapiro to appear in the March/April’2008 issue of the IEEE Intelligent Systems magazine.

AMBIENT INTELLIGENT PROTOTyPES AND SySTEMS Here we will analyse some examples of Ambient Intelligence prototypes and systems, divided by the area of application.

A


AmI at Home Domotics is a consolidated area of activity. After the first experiences using Domotics at homes there was a trend to refer the Intelligent Home concept. However, Domotics is too centred in the automation, giving to the user the capability to control the house devices from everywhere. We are still far from the real Ambient Intelligence in homes, at least at the commercial level. In (Wichert, Hellschimidt, 2006) there is an interesting example in the aims of EMBASSI project, by gesture a woman is commanding the TV to be brighter, however the TV is already at the brightest level, so the lights reduce the level and the windows close, showing an example of context awareness in the environment. Several organizations are doing experiments to achieve the Intelligent Home concept. Some examples are HomeLab from Philips, MIT House_n, Georgia Tech Aware Home, Microsoft Concept Home, and e2 Home from Electrolux and Ericsson.

AmI in Vehicles and Transports

problems. The percentage of population with health problems will increase and it will be very difficult to Hospitals to maintain all patients. Our society is faced with the responsibility to care for these people in the best possible social and economical ways. So, there is a clear interest to create Ambient Intelligence devices and environments allowing the patients to be followed in their own homes or during their day-by-day life. The medical control support devices may be embedded in clothes, like T-shirts, collecting vital-sign information from sensors (e. g. blood pressure, temperature). Patients will be monitored at long distance. The surrounding environment, for example the patient home, may be aware of the results from the clinical data and even perform emergency calls to order an ambulance service. For instance, we may refer the IST Vivago® system (IST International Security Technology Oy, Helsinki, Finland), an active social alarm system, which combines intelligent social alarms with continuous remote monitoring of the user’s activity profile (Särelä, Korhonen, Lötjönen, Sola, Myllymäki, 2003).

Since the first experiences with NAVLAB 1 (Thorpe, Herbert, Kanade, Shafer, 1988) Carnegie Mellon University has developed several prototypes for Autonomous Vehicle Driving and Assistance. The last one, NAVLAB 11, is an autonomous Jeep. Most of the car industry companies are doing research in the area of Intelligent Vehicles for several tasks like car parking assistance or pre-collision detection. Another example of AmI application is related with Transports, namely in connection with Intelligent Transportation Systems (ITS). The ITS Joint Program of the US Department of Transportation identified several areas of applications, namely: arterial management; freeway management; transit management; incident management; emergence management; electronic payment; traveller information; information management; crash prevention and safety; roadway operations and management; road weather management; commercial vehicle operations; and intermodal freight. In all these application areas Ambient Intelligence can be used.

AmI in Tourism and Cultural Heritage

AmI in Elderly and Health Care

AmI at Work

Several studies point to the aging of population during the next decades. While being a good result of increasing of life expectation, this also implies some

The human being spends considerable time in working places like offices, meeting rooms, manufacturing plants, control centres.

Tourism and Cultural Heritage are good application areas for Ambient Intelligence. Tourism is a growing industry. In the past tourists were satisfied with pre-defined tours, equal for all the people. However there is a trend in the customization and the same tour can be conceived to adapt to tourists according their preferences. Immersive tour post is an example of such experience (Park, Nam, Shi, Golub, Van Loan, 2006). MEGA is an user-friend virtual-guide to assist visitors in the Parco Archeologico della Valle del Temple in Agrigento, an archaeological area with ancient Greek temples in Agrigento, located in Sicily, Italy (Pilato, Augello, Santangelo, Gentile, Gaglio, 2006). DALICA has been used for constructing and updating the user profile of visitors of Villa Adriana in Tivoli, near Rome, Italy (Constantini, Inverardi, Mostarda, Tocchio, Tsintza, 2007).


SPARSE is a project initially created for helping Power Systems Control Centre Operators in the diagnosis and restoration of incidents (Vale, Moura, Fernandes, Marques, Rosado, Ramos, 1997). It is a good example of context awareness since the developed system is aware of the on-going situation, acting in different ways according the normal or critical situation of the power system. This system is evolving for an Ambient Intelligence framework applied to Control Centres. Decision Making is one of the most important activities of the human being. Nowadays decisions imply to consider many different points of view, so decisions are commonly taken by formal or informal groups of persons. Groups exchange ideas or engage in a process of argumentation and counter-argumentation, negotiate, cooperate, collaborate or even discuss techniques and/or methodologies for problem solving. Group Decision Making is a social activity in which the discussion and results consider a combination of rational and emotional aspects. ArgEmotionAgents is a project in the area of the application of Ambient Intelligence in the group argumentation and decision support considering emotional aspects and running in the Laboratory of Ambient Intelligence for Decision Support (LAID), seen in Figure 1 (Marreiros, Santos, Ramos, Neves, Novais, Machado, Bulas-Cruz, 2007), a kind of an Intelligent Decision Room. This work has also a part involving ubiquity support.

AmI in Sports Sports involve high-level athletes and many more practitioners. Many sports are done without any help of the associated devices, opening here a clear opportunity for Ambient Intelligence to create sports assistance devices and environments.

FlyMaster NAV+ is a free-flight on-board pilot Assistant (e.g. gliding, paragliding), using the FlyMaster F1 module with access to GPS and sensorial information. FlyMaster Avionics S.A., a spin-off, was created to commercialize these products (see figure 2).

AMBIENT INTELLIGENCE PLATFORMS Some companies and academic institutions are investing in the creation of Ambient Intelligence generation platforms. The Endeavour project is developed by the California University in Berkeley (http://endeavour.cs.berkeley. edu/). The project aims to specify, design, and implement prototypes at a planet scale, self organized and involving an adaptive “Information Utility”. Oxygen enables pervasive human centred computing through a combination of specific user and system technologies (http://www.oxygen.lcs.mit.edu/). This project provides speech and vision technologies enabling us to communicate with Oxygen as if we were interacting with another person, saving much time and effort (Rudolph, 2001). The Portolano project was developed in the University of Washington and seeks to create a testbed for research into the emerging field of invisible computing (http://portolano.cs.washington.edu/). The invisible computing is possible with devices so highly optimized to particular tasks that they bend into the world and require little technical knowledge from the users (Esler, Hightower, Anderson, Borrielo, 1999). The EasyLiving project of Microsoft Research Vision Group corresponds to a prototype architecture and associated technologies for building intelligent environments (Brumitt, Meyers, Krumm, Kern, Shafer,

Figure 1. Ambient Intelligence for decision support, LAID Laboratory

A


Figure 2. FlyMaster Pilot Assistant device, from FlyMaster Avionics S.A.

2000). EasyLiving goal is to facilitate the interaction of people with other people, with computer, and with devices (http://research.microsoft.com/easyliving/).

FUTURE TRENDS Ambient Intelligence deals with a futuristic notion for our lives. Most of the practical experiences concerning Ambient Intelligence are still in a very incipient phase, due to the recent existence of this concept. Today, it is not clear the separation between the computer and the environments. However, for new generations things will be more transparent, and environments with Ambient Intelligence will be more widely accepted. In the area of transport, AmI will cover several aspects. The first will be related with the vehicle itself. Several performances start to be available, like the automatic identification of the situation (e.g. pre-collision identification, identification of the driver conditions). Other aspects will be related with the traffic information. Today, GPS devices are generalized, but they deal with static information. Joining on-line traffic conditions will enable the driver to avoid roads with accidents. Technology is giving good steps in the direction of automatic vehicle driving. But in the near future the developed systems will be seen more like driver assistants in spite of autonomous driving systems. Another area where AmI will experience a strong development will be the area of Health Care, especially

in the Elderly Care. Patients will receive this support to allow a more autonomous life in their homes. However automatic acquisition of vital signals (e.g. blood pressure, temperature) will allow to do automatic emergency calls when the patient health is in significant trouble. The person monitoring will also be done in his/her home, trying to detect differences in expected situations and habits. The home support will achieve the normal personal and family life. Intelligent Homes will be a reality. The home residents will pay less attention to normal home management aspects, for example, how many bottles of red wine are available for the week meals or if the specific ingredients for a cake are all available. AmI for job support are also expected. Decision Support Systems will be oriented to on-the-job environments. This will be clear in offices, meeting rooms, call centres, control centres, and plants.

CONCLUSION This article presents the state of the art in which concerns Ambient Intelligence field. After the history of the concept, we established some related concepts definitions and illustrated with some examples. There is a long way to follow in order to achieve the Ambient Intelligence concept, however in the future, this concept will be referred as one of the landmarks in the Artificial Intelligence development.


REFERENCES Brumitt, B., Meyers, B., Krumm, J., Kern, A., Shafer, S. (2000). EasyLiving: Technologies for Intelligent Environments. Lecture Notes in Computer Science, vol. 1927, pp. 97-119. Constantini, S., Inverardi, P., Mostarda, L., Tocchio, A., Tsintza, P. (2007). User Profile Agents for Cultural Heritage fruition. Artificial and Ambient Intelligence. Proc. of the Artificial Intelligence and Simulation of Behaviour Annual Convention, pp. 30-33. Dillon, T. (2006). Pervasive and Ubiquitous Computing. Futurelab. Available at http://www.futurelab.org. uk/viewpoint/art71.htm. ISTAG (2001), Scenarios for Ambient Intelligence in 2010, European Commission Report. ISTAG (2002). Strategic Orientations & Priorities for IST in FP6, European Commission Report. ISTAG (2003). Ambient Intelligence: from vision to reality, European Commission Report. Marreiros, G., Santos, R., Ramos, C., Neves, J., Novais, P., Machado, J., Bulas-Cruz, J. (2007). Ambient Intelligence in Emotion Based Ubiquitous Decision Making. Proc. Artificial Intelligence Techniques for Ambient Intelligence, IJCAI’07 – Twentieth International Joint Conference on Artificial Intelligence. Hyderabad, India. McCulloch, W.S., & Pitts, W. (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics. (5) 115-133. Park, D., Nam, T., Shi, C., Golub, G., Van Loan, C. (2006). Designing an immersive tour experience system for cultural tour sites. ACM Press. New York, NY. pp. 1193-1198. Pilato, G., Augello, A., Santangelo, A., Gentile, A., Gaglio S. (2006). An intelligent multimodal site-guide for the Parco Archeologico della Valle del Temple in Agrigento. Proc. of the First Workshop in Intelligent Technologies for Cultural HeritageExploitation. European Conference on Artificial Intelligence.

Esler, M., Hightower, J., Anderson, T., Borrielo, J. (1999). Next century challenges: data-centric networking for invisible computing: the Portolano project at the University of Washington. Proceedings of the 5th annual ACM/IEEE international conference on Mobile computing and networking, pp. 256-262. Ramos, C. (2007). Ambient Intelligence – a State of the Art from Artificial Intelligence perspective. Proceedings of EPIA’2007 – the Portuguese Conference on Artificial Intelligence. Rothi J., Yen D.(1990). Why American Express Gambled on an Expert Data Base. Information Strategy: The Executive´s Journal, 6(3), pp. 16-22. Rudolph, L. (2001). Project Oxygen: Pervasive, HumanCentric Computing - An Initial Experience. Lecture Notes in Computer Science, vol. 2068. Särelä A., Korhonen I., Lötjönen L., Sola M., Myllymäki M. (2003), IST Vivago® - an intelligent social and remote wellness monitoring system for the elderly. In: Proceedings of the 4th Annual IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine, pp. 362-365. Shortliffe, E. (1976). Computer-Based Medical Consultations: MYCIN; Elsevier - North Holland. Thorpe, C., Hebert, M.H., Kanade, T., Shafer, S.A. (1988), Vision and navigation for the Carnegie-Mellon Navlab, IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(3), 362-373. Vale, Z., Moura, A., Fernandes, M., Marques, A., Rosado, A., Ramos, C. (1997). SPARSE: An Intelligent Alarm Processor and Operator Assistant, IEEE ExpertSpecial Track on AI Applications in the Electric Power Industry, 12(3), pp. 86- 93, 1997. Weiser, M. (1991), The Computer for the TwentyFirst Century. Scientific American. September 1991. pp. 94-104. Wichert R., Hellenschmidt M. (2006). Intelligent Systems. Ambient Intelligence solutions for Intelligent Envioronments. Thematic Brochure of INI-GraphicsNet, pp. 12-13, n.1, 2006.

A


TERMS AND DEFINITIONS Ambient Intelligence: Ambient Intelligence (AmI) deals with a new world where computing devices are spread everywhere, allowing the human being to interact in physical world environments in an intelligent and unobtrusive way. These environments should be aware of the needs of people, customizing requirements and forecasting behaviours. Context Awareness: Context Awareness means that the system is aware of the current situation we are dealing with. Embedded Systems: Embedded Systems means that electronic and computing devices are embedded in current objects or goods. Intelligent Decision Room: A decision-making space, eg a meeting room or a control center, equipped with intelligent devices and/or systems to support decision-making processes. Intelligent Home: A home equipped with several electronic and interactive devices to help residents to manage conventional home decisions. Intelligent Transportation Systems: Intelligent Systems applied to the area of Transports, namely to traffic and travelling issues. Intelligent Vehicles: A vehicle equipped with sensors and decision support components. Pervasive Computing: Pervasive Computing is related with all the physical parts of our lives, the user may have not notion of the computing devices and details related with these physical parts. Ubiquitous Computing: Ubiquitous Computing means that we have access to computing devices anywhere in an integrated and coherent way.

Analytics for Noisy Unstructured Text Data I Shourya Roy IBM Research, India Research Lab, India L. Venkata Subramaniam IBM Research, India Research Lab, India

INTRODUCTION Accdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer be at the rghit pclae. Tihs is bcuseae the human mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.1 Unfortunately computing systems are not yet as smart as the human mind. Over the last couple of years a significant number of researchers have been focussing on noisy text analytics. Noisy text data is found in informal settings (online chat, SMS, e-mails, message boards, among others) and in text produced through automated speech recognition or optical character recognition systems. Noise can possibly degrade the performance of other information processing algorithms such as classification, clustering, summarization and information extraction. We will identify some of the key research areas for noisy text and give a brief overview of the state of the art. These areas will be, (i) classification of noisy text, (ii) correcting noisy text, (iii) information extraction from noisy text. We will cover the first one in this chapter and the later two in the next chapter. We define noise in text as any kind of difference in the surface form of an electronic text from the intended, correct or original text. We see such noisy text everyday in various forms. Each of them has unique characteristics and hence requires special handling. We introduce some such forms of noisy textual data in this section. Online Noisy Documents: E-mails, chat logs, scrapbook entries, newsgroup postings, threads in discussion fora, blogs, etc., fall under this category. People are typically less careful about the sanity of written content in such informal modes of communication. These are characterized by frequent misspellings, commonly

and not so commonly used abbreviations, incomplete sentences, missing punctuations and so on. Almost always noisy documents are human interpretable, if not by everyone, at least by intended readers. SMS: Short Message Services are becoming more and more common. Language usage over SMS text significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language (Choudhury et. al., 2007). Text Generated by ASR Devices: ASR is the process of converting a speech signal to a sequence of words. An ASR system takes speech signal such as monologs, discussions between people, telephonic conversations, etc. as input and produces a string a words, typically not demarcated by punctuations as transcripts. An ASR system consists of an acoustic model, a language model and a decoding algorithm. The acoustic model is trained on speech data and their corresponding manual transcripts. The language model is trained on a large monolingual corpus. ASR convert audio into text by searching the acoustic model and language model space using the decoding algorithm. Most conversations at contact centers today between agents and customers are recorded. To do any processing of this data to obtain customer intelligence it is necessary to convert the audio into text. Text Generated by OCR Devices: Optical character recognition, or ‘OCR’, is a technology that allows digital images of typed or handwritten text to be transferred into an editable text document. It takes the picture of text and translates the text into Unicode or ASCII. . For handwritten optical character recognition, the rate of recognition is 80% to 90% with clean handwriting. Call Logs in Contact Centers: Today’s contact centers (also known as call centers, BPOs, KPOs) produce huge amounts of unstructured data in the form of call logs apart from emails, call transcriptions, SMS, chat


A

Analytics for Noisy Unstructured Text Data I

transcripts etc. Agents are expected to summarize an interaction as soon as they are done with it and before picking up the next one. As the agents work under immense time pressure hence the summary logs are very poorly written and sometimes even difficult for human interpretation. Analysis of such call logs are important to identify problem areas, agent performance, evolving problems etc. In this chapter we will be focussing on automatic classification of noisy text. Automatic text classification refers to segregating documents into different topics depending on content. For example, categorizing customer emails according to topics such as billing problem, address change, product enquiry etc. It has important applications in the field of email categorization, building and maintaining web directories e.g. DMoz, spam filter, automatic call and email routing in contact center, pornographic material filter and so on.

from documents, each document is converted into a document vector. Documents are represented in a vector space; each dimension of this space represents a single feature and the importance of that feature in that document gives the exact distance from the origin. The simplest representation of document vectors uses the binary event model, where if a feature j ∈ V appears in document di, then the jth component of di is 1 otherwise it is 0. One of the most popular statistical classification techniques is naive Bayes (McCallum, 1998). In the naive Bayes technique the probability of a document di belonging to class c is computed as: Pr( c, d ) Pr( d )

Pr( c | d ) =

= Pr( c ) Pr( d | c ) Pr( d ) ∞ Pr( c ) Pr( d | c )

NOISy TEXT CATEGORIZATION The text classification task is one of the learning models for a given set of classes and applying these models to new unseen documents for class assignment. This is an important component in many knowledge extraction tasks; real time sorting of email or files into folder hierarchies, topic identification to support topic-specific processing operations, structured search and/or browsing, or finding documents corresponding to long-term standing interests or more dynamic taskbased interests. Two types of classifiers are generally commonly found viz. statistical classifiers and rule based classifiers. In statistical techniques a model is typically trained on a corpus of labelled data and once trained the system can be used for automatic assignment of unseen data. A survey of text classification can be found in the work by Aas & Eikvil (Aas & Eikvil, 1999). Given a training document collection D ={d1, d2, ….., dM} with true classes {y1, y2, ….., yM} the task is to learn a model. This model is used for categorizing a new unlabelled document du. Typically words appearing in the text are used as features. Other applications including search rely heavily on taking the markup or link structure of documents into account but classifiers only depend on the content of the documents or the collection of words present in the documents. Once features are extracted 00

∞

∏ P(d j

j

| c)

The final approximation of the above equation refers to the naive part of such a model, i.e., the assumption of word independence which means the features are assumed to be conditionally independent, given the class variable. Rule-based learning systems have been adopted in the document classification problem since it has considerable appeal. They perform well at finding simple axis-parallel frontiers. A typical rule-based classification scheme for a category, say C, has the form: Assign category C if antecedent or Do no assign category C if antecedent or The antecedent in the premise of a rule usually involves some kind of feature value comparison. A rule is said to cover a document or a document is said to satisfy a rule if all the feature value comparisons in the antecedent of the rule are true for the document. One of the well known works in the rule based text classification domain is RIPPER. Like a standard separate-and-conquer algorithm, it builds a rule set incrementally. When a rule is found, all documents covered by the rule are discarded including positive


and negative documents. The rule is then added to the rule set. The remaining documents are used to build other rules in the next iteration. In both statistical as well as rule based text classification techniques, the content of the text is the sole determiner of the category to be assigned. However noise in the text distorts the content and hence readers can expect the categorization performance to get affected by noise in the text. Classifiers are essentially trained to identify correlation between extracted features (words) with different categories which can be later utilized to categorize new documents. For example, words like exciting offer get a free laptop might have stronger correlation with category spam emails than non-spam emails. Noise in text distorts this feature space excitinng ofer get frree lap top will be new set of features and the categorizer will not be able to relate it to the spam emails category. The feature space explodes as the same feature can appear in different forms due to spelling errors, poor recognition, wrong transcription, etc. In the remaining part of this section we will give an overview how people have approached the problem of categorizing noisy text.

Categorization of OCRed Documents Electronically recognized handwritten documents and documents generated from OCR process are typical examples of noisy text because of the errors introduced by the recognition process. Vinciarelli (Vinciarelli, 2004) has studied the characteristics of noise present in such data and its effects on categorization accuracy. A subset of documents from the Reuters-21578 text classification dataset were taken and noise was introduced using two methods: first a subset of documents were manually written and recognized using an offline handwriting recognition system. In the second the OCR based extraction process was simulated by randomly changing a certain percentage of characters. According to them for recall values up to 60-70 percent depending on the sources, the categorization system is robust to noise even when the Term Error Rate is higher than 40 percent. It was also observed that the results from the handwritten data appeared to be lower than those obtained from OCR simulations. Generic systems for text categorization based on statistical analysis of representative text corpora have been proposed (Bayer et. al., 1998). Features are extracted from training texts by selecting substrings from actual word forms and

applying statistical information and general linguistic knowledge followed by dimensionality reduction by linear transformation. The actual categorization system is based on minimum least-squares approach. The system is evaluated on the tasks of categorizing abstracts of paper-based German technical reports and business letters concerning complaints. Approximately 80% classification accuracy is obtained and it is seen that the system is very robust against recognition or typing errors. Issues with categorizing OCRed documents are also discussed by many other authors (Brooks & Teahan, 2007), (Hoch, 1994) and (Taghva et. al., 2001).

Categorization of ASRed Documents Automatic Speech Recognition (ASR) is simply the process of converting an acoustic signal to a sequence of words. Researchers have proposed different techniques for speech recognition tasks based on Hidden Markov model (HMM), neural networks, Dynamic time warping (DTW) (Trentin & Gori, 2001). The performance of an ASR system is typically measured in terms of Word Error Rate (WER), which is derived from the Levenshtein distance, working at word level instead of character. WER can be computed as

WER =

S+D+I N

where S is the number of substitutions, D is the number of the deletions, I is the number of the insertions, and N is the number of words in the reference. Bahl et.al. (Bahl et. al. 1995) have built an ASR system and demonstrated its capability on benchmark datasets. ASR systems give rise to word substitutions, deletions and insertions, while OCR systems produce essentially word substitutions. Moreover, ASR systems are constrained by a lexicon and can give as output only words belonging to it, while OCR systems can work without a lexicon (this corresponds to the possibility of transcribing any character string) and can output sequences of symbols not necessarily corresponding to actual words. Such differences are expected to have strong influence on performance of systems designed for categorizing ASRed documents in comparison to categorization of OCRed documents. A lot of work on automatic call type classification for the purpose of 0

A


categorizing calls (Tang et al., 2003), call routing (Kuo and Lee, 2003; Haffner et al., 2003), obtaining call log summaries (Douglas et al., 2005), agent assisting and monitoring (Mishne et al., 2005) has appeared in the past.Here calls are classified based on the transcription from an ASR system. One interesting work on seeing effect of ASR noise on text classification was done on a subset of benchmark text classification dataset Reuters-215782 (Agarwal et. al., 2007). They read out and automatically transcribed 200 documents and applied a text classifier trained on clean Reuters-21578 training corpus3. Surprisingly, in spite of high degree of noise, they did not observe much degradation in accuracy.

Effect of Spelling Errors on Categorization Spelling errors are an integral part of written text—electronic as well as non-electronic. Every reader reading this book must have been scolded by their teacher in school for spelling words wrongly! In this era of electronic text people have become less careful while writing resulting poorly written text containing abbreviations, short forms, acronyms, wrong spellings. Such electronic text documents including email, chat log, postings, SMSs are sometimes difficult to interpret even for human beings. It goes without saying that text analytics on such noisy data is a non trivial task. Wrong spellings can affect automatic classification performance in multiple ways depending on the nature of the classification technique being used. In the case of statistical techniques, spelling differences distort the feature space. If training as well as the test data corpus are noisy, while learning the model the classifier will treat variants of the same words as different features. As a result the observed joint probability distribution will be different from the actual distribution. If the proportion of wrongly spelt words is high then the distortion can be significant and will hurt the accuracy of the resultant classifier. However, if the classifier is trained on a clean corpus and the test documents are noisy, then wrongly spelt words will be treated as unseen words and will not help in classification. In an unlikely situation a wrongly spelt word present in a test document may become a different valid feature and worse, may become a valid indicative feature of a different class. A standard technique in the text classification process is feature selection which happens after feature extraction and before training. Feature 0

selection typically employs some statistical measures over the training corpus and ranks features in order of the amount of information (correlation) they have with respect to the class labels of the classification task at hand. After the feature set has been ranked, the top few features are retained (typically order of hundreds or a few thousand) and the others are discarded. Feature selection should be able to eliminate wrongly spelt words present in the training data provided (i) the proportion of wrongly spelt words is not very large and (ii) there is no regular pattern in spelling errors4. However it has been observed, even at high degree of spelling errors the classification accuracy does not suffer much (Agarwal et al., 2007). Rule based classification techniques also get negatively affected by spelling errors. If the training data contains spelling errors then some of the rules may not get the required statistical significance. Due to spelling errors present in the test data a valid rule may not fire and worse, an invalid rule may fire leading to a wrong categorization. Suppose RIPPER has learnt a rule set like: Assign category “sports” IF (the document contains {\it sports}) OR (the document contains {\it exercise} AND {\it outdoor}) OR (the document contains {\it exercise} but not {\it homework} {\it exam}) OR (the document contains {\it play} AND {\it rule}) OR …… A hypothetical test document containing repeated occurrences of exercise, but each time wrongly spelt as exarcise, will not be categorized to the sports category and hence lead to misclassification.

CONCLUSION In this chapter we have looked at noisy text analytics. This topic is gaining in importance as more and more noisy data gets generated and needs processing. In particular we have looked at techniques for correcting noisy text and for doing classification. We have presented a survey of existing techniques in the area and have shown that even though it is a difficult problem it is possible to address it with a combination of new and existing techniques.


REFERENCES K. Aas & L. Eikvil (1999). Text Categorisation: A Survey. Technical report, Norwegian Computing Center. S. Agarwal, S. Godbole, D. Punjani & S. Roy (2007). How Much Noise is too Much: A Study in Automatic Text Classification. In Proceedings of the IEEE International Conference on Data Mining series (ICDM), Nebraska, Omaha (To Appear). L. R. Bahl, S. Balakrishnan-Aiyer, J. Bellegarda, M. Franz, P. Gopalakrishnan, D. Nahamoo, M. Novak, M. Padmanabhan, M. Picheny, and S. Roukos. Performance of the IBM large vocabulary continuous speech recognition system on the ARPA wall street journal task. In Proc. ICASSP ’95, pages 41–44, Detroit, MI, 1995.

G. Mishne, D. Carmel, R. Hoory, A. Roytman and A. Soffer. 2005. Automatic Analysis of Call-center Conversations. Conference on Information and Knowledge Management. October 31-November 5, Bremen, Germany. K. Taghva, T. Narkter, J. Borsack, Lumos. S., A. Condit, & Young (2001). Evaluating Text Categorization in the Presence of OCR Errors. In Proceedings of IS&T SPIE 2001 International Symposium on Electronic Imaging Science and Technology, (68-74). M. Tang, B. Pellom and K. Hacioglu. 2003. Calltype Classification and Unsupervised Training for the Call Center Domain. Automatic Speech Recognition and UnderstandingWorkshop. November 30-December 4, St. Thomas, U S Virgin Islands.

T. Bayer, U. Kressel, H. Mogg-Schneider, & Renz (1998). Categorizing Paper Documents. Computer Vision and Image Understanding, 70(3) (299-306).

E. Trentin & M. Gori (2001). A Survey of Hybrid ANN/HMM Models for Automatic Speech Recognition. Neurocomputing journal. Volume 37. (91-126)

R. Brooks & L. J. Teahan (2007). A Practical Implementation of Automatic Text Categorization and Correction of the Conversion of Noisy OCR Documents into Braille and Large Print. Proceedings of Workshop on Analytics for Noisy Unstructured Text Data (at IJCAI 2007). Jan, Hyderabad, India.

A. Vinciarelli (2005). Noisy Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, no. 12. (1882 – 1295).

S. Douglas, D. Agarwal, T. Alonso, R. M. Bell, M. Gilbert, D. F. Swayne and C. Volinsky. 2005. Mining Customer Care Dialogs for “Daily News”. IEEE Trans. on Speech and Audio Processing, 13(5):652–660. P. Haffner, G. Tur & J. H. Wright (2003). Optimizing SVMs for Complex Call Classification. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing. R. Hoch (1994). Using IR Techniques for Text Classification in Document Analysis. In Proceedings of 17th ACM SIGIR Conference on Research and Development in Information Retrieval, (31-40). H.-K J. Kuo and C.-H. Lee. 2003. Discriminative Training of Natural Language Call Routers. IEEE Trans. on Speech and Audio Processing, 11(1):24–35. A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. In AAAI/ ICML-98 Workshop on Learning for Text Categorization, 1998.

Vlachos (2006). Active Annotation. In Proceedings of the EACL 2006 Workshop on Adaptive Text Extraction and Mining, Trento, Italy.

KEy TERMS Automatic Speech Recognition: Machine recognition and conversion of spoken words into text. Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns, relationships or obtaining systems that perform useful tasks such as classification, prediction, estimation, or affinity grouping. Information Extraction: Automatic extraction of structured knowledge from unstructured documents. Noisy Text: Text with any kind of difference in the surface form, from the intended, correct or original text. Optical Character Recognition: Translation of images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text.

0

A


Rule Induction: Process of learning, from cases or instances, if-then rule relationships that consist of an antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-part, stating a classification, prediction, or other expression of a property that holds for cases defined in the antecedent). Text Analytics: The process of extracting useful and structured knowledge from unstructured documents to find useful associations and insights. Text Classification (or Text Categorization): Is the task of learning models for a given set of classes and applying these models to new unseen documents for class assignment.

ENDNOTES 1

2

3

4

0

According to http://www.mrc-cbu.cam.ac.uk/ %7Emattd/Cmabrigde/, this is an internet hoax. However we found it interesting and hence included here. http://www.daviddlewis.com/resources/testcollections/ This dataset is available from http://kdd.ics.uci. edu/databases/reuters_transcribed/reuters_transcribed.html Note: this assumption may not hold true in the case of cognitive errors

0

Analytics for Noisy Unstructured Text Data II L. Venkata Subramaniam IBM Research, India Research Lab, India Shourya Roy IBM Research, India Research Lab, India

INTRODUCTION The importance of text mining applications is growing proportionally with the exponential growth of electronic text. Along with the growth of internet many other sources of electronic text have become really popular. With increasing penetration of internet, many forms of communication and interaction such as email, chat, newsgroups, blogs, discussion groups, scraps etc. have become increasingly popular. These generate huge amount of noisy text data everyday. Apart from these the other big contributors in the pool of electronic text documents are call centres and customer relationship management organizations in the form of call logs, call transcriptions, problem tickets, complaint emails etc., electronic text generated by Optical Character Recognition (OCR) process from hand written and printed documents and mobile text such as Short Message Service (SMS). Though the nature of each of these documents is different but there is a common thread between all of these—presence of noise. An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company1,company2,date), from an online news sentence such as: “Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.” Opinion(product1,good), from a blog post such as: “I absolutely liked the texture of SheetK quilts.” At superficial level, there are two ways for information extraction from noisy text. The first one is cleaning text by removing noise and then applying existing state of the art techniques for information extraction. There in lies the importance of techniques for automatically correcting noisy text. In this chapter, first we will review some work in the area of noisy text correction. The second approach is to devise extraction techniques which are robust with respect to noise. Later in this chapter,

we will see how the task of information extraction is affected by noise.

NOISy TEXT CORRECTION Before moving on to techniques for processing noisy text we will briefly introduce methods for correcting noisy text. One of the most common forms of noise in text is wrong spelling. Kukich provides a comprehensive survey of techniques pertaining to detecting and correcting spelling errors (Kukich, 1992). According to this survey, three types of nonword misspellings are typically found viz. typographic such as teh, speel, cognitive such as recieve, conspeeracy and phonetic such as abiss, nacherly. A distinction must be made between automatically detecting such errors and automatically correcting those errors. The latter is a much harder problem. Most of the recent work in this area is about correcting spelling mistakes automatically. Golding and Roth (Golding & Roth, 1999) proposed a combination of a variant of Winnow, a multiplicative weight-update algorithm and weighted majority voting for context sensitive spelling correction. Mangu and Brill (Mangu & Brill, 1997) have shown that a small set of human understandable rules is more meaningful than a large set of opaque features and weights. Hybrid methods capturing the context using trigrams of the parts-of-speech tags and a feature based method have also been proposed to handle context sensitive spelling correction (Golding & Schabes, 1996). There is a lot of work related to automatic correction of spelling errors (Agirre et. al., 1998), (Zamora et. al., 1983), (Golding, 1995). A complete bibliography of all the work related to spelling error detection and correction can be found in (Beebe, 2005). On a related note, automatic spelling error correction techniques have been applied for other


A

Analytics for Noisy Unstructured Text Data II

applications such as semantic role labelling (Sang et. al., 2005). There is also recent work on correcting the output of SMS text (Aw et. al., 2006) (Choudhury et. al., 2007), OCR errors (Nartker et. al., 2003) and ASR errors (Sarma & Palmer, 2004).

INFORMATION EXTRACTION FROM NOISy TEXT The goal of Information Extraction (IE) is to automatically extract structured information from the unstructured documents. The extracted structured information has to be contextually and semantically well-defined data from a given domain. A typical application of IE is to scan a set of documents written in natural language and populate a database with the information extracted. The MUC (Message Understanding Conference) conference was one effort at codifying the IE task and expanding it (Chinchor, 1998). There are two basic approaches to the design of IE systems. One comprises the knowledge engineering approach where a domain expert writes a set of rules to extract the sought after information. Typically the process of building the system is iterative whereby a set of rules is written, the system is run and the output examined to see how the system is performing. The domain expert then modifies the rules to overcome any under- or over-generation in the output. The second is the automatic training approach. This approach is similar to classification where the texts are appropriately annotated with the information being extracted. For example, if we would like to build a city name extractor, then the training set would include documents with all the city names marked. An IE system would be trained on this annotated corpus to learn the patterns that would help in extracting the necessary entities. An information extraction system typically consists of natural language processing steps such as morphological processing, lexical processing and syntactic analysis. These include stemming to reduce inflected forms of words to their stem, parts of speech tagging to assign labels such as noun, verb, etc. to each word and parsing to determine the grammatical structure of sentences.

0

Named Entity Annotation of Web Posts Extraction of named entities is a key IE task. It seeks to locate and classify atomic elements in the text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Entity recognition systems either use rule based techniques or statistical models. Typically a parser or a parts of speech tagger identifies elements such as nouns, noun phrases, or pronouns. These elements along with surface forms of the text are used to define templates for extracting the named entities. For example, to tag company names it would be desirable to look at noun phrases that contain the words company or incorporated in them. These rules can be automatically learnt using a tagged corpus or could be defined manually. Most known approaches do this on clean well formed text. However, named entity annotation of web posts such as online classifieds, product listings etc. is harder because these texts are not grammatical or well written. In such cases reference sets have been used to annotate parts of the posts (Michelson & Knoblock, 2005). The reference set is thought of as a relational set of data with a defined schema and consistent attribute values. Posts are now matched to their nearest records in the reference set. In the biological domain gene name annotation, even though it is performed on well written scientific articles, can be thought of in the context of noise, because many gene names overlap with common English words or biomedical terms. There have been studies on the performance of the gene name annotator when trained on noisy data (Vlachos, 2006).

Information Extraction from OCRed Documents Documents obtained from OCR may have not only unknown words and compound words, but also incorrect words due to OCR errors. In their work Miller et. al. (Miller et. al., 2000) have measured the effect of OCR noise on IE performance. Many IE methods work directly on the document image to avoid errors resulting from converting to text. They adopt keyword matching by searching for string patterns and then use global document models consisting of keyword models and their logical relationships to achieve robustness in matching (Lu & Tan, 2004). The presence of OCR errors has a detrimental effect on information access


from these documents (Taghva et. al., 2004). However, post processing of these documents to correct these errors exist and have been shown to give large improvements.

Information Extraction from ASRed Documents The output of an ASR system does not contain case information and punctuations. It has been shown that in the absence of punctuations extraction of different syntactic entities like parts of speech and noun phrases is not accurate (Nasukawa et. al., 2007). So IE from ASRed documents becomes harder. Miller et. al. (Miller et. al., 2000) have shown how IE performance varies with ASR noise. It has been shown that it is possible to build aggregate models from ASR data (Roy & Subramaniam, 2006). In this work topical models are constructed by utilizing inter document redundancy to overcome the noise. In this work only a few natural language processing steps have been used. Phrases have been aggregated over the noisy collection to get to the clean underlying text.

FUTURE TRENDS More and more data from sources like chat, conversations, blogs, discussion groups need to be mined to capture opinions, trends, issues and opportunities. These forms of communication encourage informal language which can be considered noisy due to spelling errors, grammatical errors and informal writing styles. Companies are interested in mining such data to observe customer preferences and improve customer satisfaction. Online agents need to be able to understand web posts to take actions and communicate with other agents. Customers are interested in collated product reviews from web posts of other users. The nature of the noisy text warrants moving beyond traditional text analytics techniques. There is need for developing natural language processing techniques that are robust to noise. Also techniques that implicitly and explicitly tackle textual noise need to be developed.

CONCLUSION In this chapter we have looked at information extraction from noisy text. This topic is gaining in importance as more and more noisy data gets generated and useful information needs to be obtained from this. We have presented a survey of existing techniques information extraction techniques. We have also presented some of the future trends in noisy text analytics.

REFERENCES E. Agirre, K. Gojenola, K. Sarasola & A. Voutilainen (1998). Towards a Single Proposal in Spelling Correction. Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics (22-28). Aw, M. Zhang, J. Xiao & J. Su (2006). A Phrase-Based Statistical Model for SMS Text Normalization. In Proceedings of the Joint conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics (ACL-COLING 2006), Sydney, Australia. N. H. F. Beebe (2005). A Bibliography of Publications on Computer Based Spelling Error Detection and Correction. http://www.math.utah.edu/pub/tex/bib/spell. ps.gz. M. Choudhury, R. Saraf, V. Jain, S. Sarkar & A. Basu (2007). Investigation and Modeling of the Structure of Texting Language. In Proceedings of the IJCAI 2007 Workshop on Analytics for Noisy Unstructured Text Data (AND 2007), Hyderabad, India. N. Chinchor (1998). Overview of MUC-7. http:// www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_proceedings/overview.html R. Golding (1995). A Bayesian Hybrid Method for Context-Sensitive Spelling Correction. Proceedings of the Third Workshop on Very Large Corpora (39—53). R. Golding & D. Roth (1999). A Winnow-Based Approach to Context-Sensitive Spelling Correction. Journal of Machine Learning. Volume 34 (1-3) (107-130)

0

A


R. Golding & Y. Schabes (1996). Combining Trigram-Based and Feature-Based Methods for ContextSensitive Spelling Correction. Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics (71—78).

Sarma & D. Palmer (2004). Context-based Speech Recognition Error Detection and Correction. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004.

K. Kukich (1992). Technique for Automatically Correcting Words in Text. ACM Computing Survey. Volume 24 (4) (377—439).

K. Taghva, T. Narkter & J. Borsack (2004). Information Access in the Presence of OCR Errors. ACM Hardcopy Document Processing Workshop, Washington, DC, USA. (1-8)

Y. Lu & C. L. Tan (2004). Information Retrieval in Document Image Databases. IEEE Transactions on Knowledge and Data Engineering. Vol 16, No. 11. (1398-1410) L. Mangu & E. Brill (1997). Automatic Rule Acquisition for Spelling Correction. Proc. 14th International Conference on Machine Learning. (187—194). M. Michelson & C. A. Knoblock (2005). Semantic Annotation of Unstructured and Ungrammatical Text. In Proceedings of the International Joint Conference on Artificial Intelligence. D. Miller, S. Boisen, R. Schwartz, R. Stone & R. Weischedel (2000). Named Entity Extraction from Noisy Input: Speech and OCR. Proceedings of the Sixth Conference on Applied Natural Language Processing. T. Nartker, K. Taghva, R. Young, J. Borsack, and A. Condit (2003). OCR Correction Based On Document Level Knowledge. In Proc. IS&T/SPIE 2003 Intl. Symp. on Electronic Imaging Science and Technology, volume 5010, Santa Clara, CA. T. Nasukawa, D. Punjani, S. Roy, L. V. Subramaniam & H. Takeuchi (2007). Adding Sentence Boundaries to Conversational Speech Transcriptions Using Noisily Labeled Examples. In Proceedings of the IJCAI 2007 Workshop on Analytics for Noisy Unstructured Text Data (AND 2007), Hyderabad, India. S. Roy & L. V. Subramaniam (2006). Automatic Generation of Domain Models for Call-Centers from Noisy Transcriptions. In Proceedings of the Joint conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics (ACL-COLING 2006), Sydney, Australia. E. T. K. Sang, S. Canisius, A. van den Bosch & T. Bogers (2005). Applying Spelling Error Correction Techniques for Improving Semantic Role Labelling. In Proceedings of CoNLL. 0

K. Taghva, T. Narkter, J. Borsack, Lumos. S., A. Condit, & Young (2001). Evaluating Text Categorization in the Presence of OCR Errors. In Proceedings of IS&T SPIE 2001 International Symposium on Electronic Imaging Science and Technology, (68-74). E. M. Zamora, J. J. Pollock, & A. Zamora (1983). The Use of Trigram Analysis for Spelling Error Detection. Information Processing and Management 17. 305-316.

KEy TERMS Automatic Speech Recognition: Machine recognition and conversion of spoken words into text. Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns, relationships or obtaining systems that perform useful tasks such as classification, prediction, estimation, or affinity grouping. Information Extraction: Automatic extraction of structured knowledge from unstructured documents. Knowledge Extraction: Explicitation of the internal knowledge of a system or set of data in a way that is easily interpretable by the user. Noisy Text: Text with any kind of difference in the surface form, from the intended, correct or original text. Optical Character Recognition: Translation of images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text. Rule Induction: Process of learning, from cases or instances, if-then rule relationships that consist of an antecedent (if-part, defining the preconditions or


coverage of the rule) and a consequent (then-part, stating a classification, prediction, or other expression of a property that holds for cases defined in the antecedent).

A

Text Analytics: The process of extracting useful and structured knowledge from unstructured documents to find useful associations and insights.

0

0

Angiographic Images Segmentation Techniques Francisco J. Nóvoa University of A Coruña, Spain Alberto Curra University of A Coruña, Spain M. Gloria López University of A Coruña, Spain Virginia Mato University of A Coruña, Spain

INTRODUCTION Heart-related pathologies are among the most frequent health problems in western society. Symptoms that point towards cardiovascular diseases are usually diagnosed with angiographies, which allow the medical expert to observe the bloodflow in the coronary arteries and detect severe narrowing (stenosis). According to the severity, extension, and location of these narrowings, the expert pronounces a diagnosis, defines a treatment, and establishes a prognosis. The current modus operandi is for clinical experts to observe the image sequences and take decisions on the basis of their empirical knowledge. Various techniques and segmentation strategies now aim at objectivizing this process by extracting quantitative and qualitative information from the angiographies.

BACKGROUND Segmentation is the process that divides an image in its constituting parts or objects. In the present context, it consists in separating the pixels that compose the coronary tree from the remaining “background” pixels. None of the currently applied segmentation methods is able to completely and perfectly extract the vasculature of the heart, because the images present complex morphologies and their background is inhomogeneous due to the presence of other anatomic elements and artifacts such as catheters. The literature presents a wide array of coronary tree extraction methods: some apply pattern recognition

techniques based on pure intensity, such as thresholding followed by an analysis of connected components, whereas others apply explicit vessel models to extract the vessel contours. Depending on the quality and noise of the image, some segmentation methods may require image preprocessing prior to the segmentation algorithm; others may need postprocessing operations to eliminate the effects of a possible oversegmentation. The techniques and algorithms for vascular segmentation could be categorized as follows (Kirbas, Quek, 2004): 1. 2. 3. 4. 5.

Techniques for “pattern-matching” or pattern recognition Techniques based on models Techniques based on tracking Techniques based on artificial intelligence Main Focus

This section describes the main features of the most commonly accepted coronary tree segmentation techniques. These techniques automatically detect objects and their characteristics, which is an easy and immediate task for humans, but an extremely complex process for artificial computational systems.

Techniques Based on Pattern Recognition The pattern recognition approaches can be classified into four major categories:


Angiographic Images Segmentation Techniques

Figure 1. Regions growth applied to an angiography

Multiscale Methods The multiscale method extracts the vessel method by means of images of varying resolutions. The main advantage of this technique resides in its high speed. Larger structures such as main arteries are extracted by segmenting low resolution images, whereas smaller structures are obtained through high resolution images.

Methods Based on Skeletons The purpose of these methods is to obtain a skeleton of the coronary tree: a structure of smaller dimensions than the original that preserves the topological properties and the general shape of the detected object. Skeletons based on curves are generally used to reconstruct vascular structures (Nyström, Sanniti di Baja & Svensson, 2001). Skeletonizing algorithms are also called “thinning algorithms”. The first step of the process is to detect the central axis of the vessels or “centerline”. This axis is an imaginary line that follows each vessel in its central axis, i.e. two normal segments that cross the axis in opposite sense should present the same distance from the vessel’s edges. The total of these lines constitutes the skeleton of the coronary tree. The methods that are used to detect the central axes can be classified into three categories: Methods Based on Crests One of the first methods to segment angiographic images on the basis of crests was proposed by Guo and

A

Richardson (Guo & Ritchardson, 1998). This method treats angiographies as topographic maps in which the detected crests constitute the central axes of the vessels. The image is preprocessed by means of a median filter and smoothened with non-linear diffusion. The region of interest is then selected through thresholding, a process that eliminates the crests that do not correspond with the central axes. Finally, the candidate central axes are joined with curve relaxation techniques. Methods Based on Regions Growth Taking a known point as seed point, these techniques segment images through the incremental inclusion of pixels in a region on the basis of an a priori established criterion. There are two especially important criteria: similitude in the value, and spatial proximity (Jain, Kasturi & Schunck, 1995). It is established that pixels that are sufficiently near others with similar grey levels belong to the same object. The main disadvantage of this method is that it requires the intervention of the user to determine the seed points. O’Brien and Ezquerra (O’Brien & Ezquerra, 1994) propose the automatic extraction of the coronary vessels in angiograms on the basis of temporary, spatial, and structural restrictions. The algorithm starts with a low-pass filter and the user’s definition of a seed point. The system then starts to extract the central axes by means of the “globe test” mechanism, after which the detected regions are entangled through the graph theory. The applied test also allows us to discard the regions that are detected incorrectly and do not belong to the vascular tree.


Methods Based on Differential Geometry The methods that are based on differential geometry treat images as hypersurfaces and extract their features using curvature and surface crests. The points of hypersurface’s crest correspond to the central axis of the structure of a vessel. This method can be applied to bidimensional as well as tridimensional images; angiograms are bidimensional images and are therefore modelled as tridimensional hypersurfaces. Examples of reconstructions can be found in Prinet et al (Prinet, Mona & Rocchisani, 1995), who treat the images as parametric surfaces and extract their features by means of surfaces and crests.

Correspondence Filters Methods The correspondence filter approach convolutes the image with multiple correspondence filters so as to extract the regions of interest. The filters are designed to detect different sizes and orientations. Poli and Valli (Poli, R & Valli, 1997) apply this technique with an algorithm that details a series of multiorientation linear filters that are obtained as linear combinations of Gaussian “kernels”. These filters are sensitive to different vessel widths and orientations. Mao et al (Mao, Ruan, Bruno, Toumoulin, Collorec & Haigron, 1992) also use this type of filters in an algorithm based on visual perception models that affirm that the relevant parts of the objects in images with noise appear normally grouped.

their morphological features can be preserved and irrelevant elements eliminated. The main morphological operations are the following: • • • • • •

Dilatation: Expands objects, fills up empty spaces, and connects disjunct regions. Erosion: Contracts objects, separates regions. Closure: Dilatation + Erosion. Opening: Erosion + Dilatation. "Top hat" transformation: Extracts the structures with a linear shape "Watershed” transformation: "Inundates” the image that is taken as a topographic map , and extracts the parts that are not "flooded".

Eiho and Qian (Eiho & Qian, 1997) use a purely morphological approach to define an algorithm that consists of the following steps: 1. 2. 3. 4. 5.

Application of the “top hat” operator to emphasize the vessels Erosion to eliminate the areas that do not correspond to vessels Extraction of the tree from a point provided by the user and on the basis of grey levels. Slimming down of the tree Extraction of edges through “watershed” transformation

MODEL-BASED TECHNIQUES Morphological Mathematical Methods Mathematical morphology defines a series of operators that apply structural elements to the images so that

These approaches use explicit vessel models to extract the vascular tree. They can be divided into four catego-

Figure 2. Morphological operators applied to an angiography


ries: deformable models, parametric models, template correspondence models, and generalized cylinders.

Deformable Models Strategies based on deformable models can be classified in terms of the work by McInerney and Terzopoulos (McInerney & Terzopoulos, 1997). Algorithms that use deformable models (Merle, Finet, Lienard, & Magnin, 1997) are based on the progressive refining of an initial skeleton built with curves from a series of reference points: • • •

Root points: Starting points for the coronary tree. Bifurcation points: Points where a main branch divides into a secundary branch. End points: Points where a tree branch ends. These points have to be marked manually.

Deformable Parametric Models: Active Contours These models use a set of parametric curves that adjust to the object’s edges and are modified by both external forces, that foment deformation, and internal forces that resist change. The active contour models or “snakes” in particular are a special case of a more general technique that pretends to adjust deformable models by minimizing energy. Klein et al. (Klein, Lee & Amini, 1997) propose an algorithm that uses “snakes” for 4D reconstruction: they trace the position of each point of the central axis of a skeleton in a sequence of angiograms.

Deformable Geometric Models These models are based on topographic models that are adapted for shape recognition. Malladi et al. (Malladi, Sethian & Vemuri, 1995) for instance adapt the “Level Set Method” (LSM) by representing an edge as a level zero set of a hypersurface of a superior order; the model evolves to reduce a metric defined by the restrictions of edges and curvature, but less rigidly than in the case of the “snakes”. This edge, which constitutes the zero level of the hypersurface, evolves by adjusting to the edges of the vessels, which is what we want to detect.

Propagation Methods Quek and Kirbas (Quek & Kirbas, 2001) developed a system of wave propagation combined with a backtracking mechanism to extract the vessels from angiographic images. This method basically labels each pixel according to its likeliness to belong to a vessel and then propagates a wave through the pixels that are labeled as belonging to the vessel; it is this wave that definitively extracts the vessels according to the local features it encounters. Approaches based on the correspondence of deformable templates: This approach tries to recognize structural models (templates) in an image by using a template as context, i.e. as a priori model. This template is generally represented as a set of nodes connected by a segment. The initial structure is deformed until it adjusts optimally to the structures that were observed in the image. Petrocelli et al. (Petrocelli, Manbeck, & Elion, 1993) describe a method based on deformable templates that also incorporates additional previous knowledge into the deformation process.

Parametric Models These models are based on the a priori knowledge of the artery’s shape and are used to build models whose parameters depend on the profiles of the entire vessel; as such, they consider the global information of the artery instead of merely the local information. The value of these parameters is established after a learning process. The literature shows the use of models with circular sections (Shmueli, Brody, & Macovski, 1983) and spiral sections (Pappas, & Lim, 1984), because various studies by Brown, B. G., (Bolson, Frimer, & Dodge, 1977) (Brown, Bolson, Frimer & Dodge, 1982) show that sections of healthy arteries tend to be circular and sections with stenosis are usually elliptical. However, both circular and elliptical shapes fail to approach irregular shapes caused by pathologies or bifurcations. This model has been applied to the reconstruction of vascular structures with two angiograms (Pellot, Herment, Sigelle, Horain, Maitre & Peronneau, 1994), which is why both healthy and stenotic sections are modeled by means of ellipses. This model is subsequently deformed until it corresponds to the shape associated to the birth of a new branch or pathology.

A


Figure 3. “Snakes” applied to a blood vessel. http://vislab.cs.vt.edu/review/extraction.html

Generalized Cylinder Models

ARTERIAL TRACKING

A generalized cylinder (GC) is a solid whose central axis is a 3D curve. Each point of that axis has a limited and closed section that is perpendicular to it. A CG is therefore defined in space by a spatial curve or axis and a function that defines the section in that axis. The section is usually an ellipse. Tecnically, GCs should be included in the parametric methods section, but the work that has been done in this field is so extense that it deserves its own category. The construction of the coronary tree model requires one single view to build the 2D tree and estimate the sections. However, there is no information on the depth or the area of the sections, so a second projection will be required.

Contrary to the approaches based on pattern recognition, where local operators are applied to the entire image, techniques based on arterial follow-up are based on the application of local operators in an area that presumibly belongs to a vessel and that cover its length. From a given point of departure the operators detect the central axis and, by analyzing the pixels that are orthogonal to the tracking direction, the vessel’s edges. There are various methods to determine the central axis and the edges: some methods carry out a sequential tracking and incorporate connectivity information after a simple edge detection operation, other methods use this information to sequentially track the contours. There are also approaches based on the intensity of the crests, on fuzzy sets, or on the representation of


Figure 4. Tracking applied to an angiography

graphs, where the purpose lies in finding the optimal road in the graph that represents the image. Lu and Eiho (Lu, Eiho, 1993) have described a follow-up algorithm for the vascular edges in angiographies that considers the inclusion of branches and consists of three steps: 1. 2. 3.

Edge detection Branch search Tracking of sequential contours

The user must provide the point of departure, the direction, and the search range. The edge points are evaluated with a differential smoothening operator in a line that is perpendicular to the direction of the vessel. This operator also serves to detect the branches.

A

are then used to formulate a hierarchy with which to create the model. This type of system does not offer any good results in arterial bifurcations or in arteries with occlusions. Another approach (Stansfield, 1986) consists in formulating a rules-based Expert System to identify the arteries. During the first phase, the image is processed without making use of domain knowledge to extract segments of the vessels. It is only in the second phase that domain knowledge on cardiac anatomy and physiology is applied. The latter approach is more robust than the former; but it presents the inconvencience of not combining all the segments into one vascular structure.

FUTURE TRENDS TECHNIQUES BASED ON ARTIFICIAL INTELLIGENCE Approaches based on Artificial Intelligence use highlevel knowledge to guide the segmentation and delineation of vascular structures and sometimes use different types of knowledge from various sources. One possibility (Smets, Verbeeck, Suetens, & Oosterlinck, 1988) is to use rules that codify knowledge on the morphology of blood vessels; these rules

It cannot be said that one technique has a more promising future than another, but the current tendency is to move away from the abovementioned classical segmentation algorithms towards 3D and even 4D reconstructions of the coronary tree. Other lines of research focus on obtaining angiograph images by means of new acquisition technologies such as Magnetic Resonance, Computarized High Speed Tomography, or two-armed angiograph devices that achieve two simultaneous projections in


combination with the use of ultrasound intravascular devices. This type of acquisition simplifies the creation of tridimensional structures, either directly from the acquisition or after a simple processing of the bidimensional images.

REFERENCES Brown, B. G., Bolson E., Frimer, M., & Dodge, H. T. (1977). Quantitative coronary arteriography. Circulation, 55:329-337. Brown, B. G., Bolson E., Frimer, M., & Dodge, H. T. (1982). Arteriographic assessment of coronary atherosclerosis. Arteriosclerosis, 2:2-15. Eiho, S., & Qian, Y. (1997). Detection of coronary artery tree using morphological operator. In Computers in Cardiology 1997, pages 525-528. Gonzalez, R. C., & Woods, R. E. (1996). Digital Image Proccessing. Addison-Wesley Publishing Company, Inc. Reading, Massachusets, USA.

skeletons. In Edoardo Ardizzone and Vito Di Gesµu, editors, Proceedings of 11th International Conference on Image Analysis and Processing (ICIAP 2001), 495500, Palermo, Italy, IEEE Computer Society. Malladi, R., Sethian, J. A., & Vemuri, B. C. (1995). Shape modeling with front propagation: a level set approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17:158-175. Mao, F., Ruan, S.,Bruno, A., Toumoulin, C., Collorec, R., & Haigron, P. (1992). Extraction of structural features in digital subtraction angiography. Biomedical Engineering Days, 1992.,Proceedings of the 1992 International, 166-169. McInerney, T., & Terzopoulos, D.(1997). Medical image segmentation using topologically adaptable surfaces. In CVRMedMRCAS ‘97: Proceedings of the First Joint Conference on Computer Vision, Virtual Reality and Robotics in Medicine and Medial Robotics and Computer-Assisted Surgery, 23-32, London, UK, Springer-Verlag.

Greenes, R. A., & Brinkley, K. F. (2001). Imaging Systems. De Medical informatics: computer applications in health care and biomedicine. Pp. 485 – 538. Second Edition. 2001. Ed. Springer-Verlag. New York. USA.

O’Brien, J. F., & Ezquerra, N. F. (1994). Automated segmentation of coronary vessels in angiographic image sequences utilizing temporal, spatial and structural constraints. (Technical report), Georgia Institute of Technology.

Guo, D., & Richardson, P. (1998) . Automatic vessel extraction from angiogram images. In Computers in Cardiology 1998, 441 - 444.

Pappas, T. N, & Lim, J.S. (1984). Estimation of coronary artery boundaries in angiograms. Appl. Digital Image Processing VII, 504:312-321.

Jain, R.C., Kasturi, R., & Schunck,B. G. (1995). Machine Vision.McGraw-Hill.

Pellot, C., Herment, A., Sigelle, M., Horain, P., Maitre, H., & Peronneau, P. (1994). A 3d reconstruction of vascular structures from two x-ray angiograms using an adapted simulated annealing algorithm. Medical Imaging, IEEE Transactions on, 13:48-60.

Kirbas, C. & Quek, F. (2004). A review of vessel extraction techniques and algorithms. ACM Comput. Surv., 36(2),81-121. Klein, A. K., Lee, F., & Amini, A. A. (1997). Quantitative coronary angiography with deformable spline models. IEEE Transactions on Medical Imaging, 16(5):468-482 Lu, S., & Eiho, S. (1993). Automatic detection of the coronary arterial contours with sub-branches from an x-ray angiogram.In Computers in Cardiology 1993. Proceedings., 575-578. Nyström, I., Sanniti di Baja, G., & Svensson, S. (2001). Representing volumetric vascular structures using curve

Petrocelli, R. R., Manbeck, K. M., & Elion, J. L. (1993). Three dimensional structure recognition in digital angiograms using gauss-markov methods. In Computers in Cardiology 1993. Proceedings., 101-104. Poli, R., & Valli, G. (1997). An algorithm for real-time vessel enhancement and detection. Computer Methods and Programs in Biomedicine, 52:1-22. Prinet, V., Mona, O., & Rocchisani, J. M. (1995). Multi-dimensional vessels extraction using crest lines. In Engineering in Medicine and Biology Society, 1995. IEEE 17th Annual Conference, 1:393-394.


Quek, F. H. K., & Kirbas, C. (2001). Simulated wave propagation and traceback in vascular extraction. In Medical Imaging and Augmented Reality, 2001. Proceedings. International Worksho, 229-234. Shmueli, K., Brody, W. R., & Macovski, A. (1983). Estimation of blood vessel boundaries in x-ray images. Opt. Eng., 22:110-116. Smets, C., Verbeeck, G., Suetens, P., & Oosterlinck, A. (1988). A knowledge-based system for the delineation of blood vessels on subtraction angiograms. Pattern Recogn. Lett., 8(2):113-121. Stansfield, S. A. (1986). Angy: A rule-based expert system for automatic segmentation of coronary vessels from digital subtracted angiograms. PAMI, 8(3):188199.

KEy TERMS Angiography: Image of blood vessels obtained by any possible procedure. Artery: Each of the vessels that take the blood from the heart to the other bodyparts. Computerized Tomography: Exploration of Xrays that produces detailed images of axial cuts of the

body. A CT obtains many images by rotating around the body. A computer combines all these images into a final image that represents the bodycut like a slice. Expert System: Computer or computer program that can give responses that are similar to those of an expert. Segmentation: In computer vision, segmentation refers to the process of partitioning a digital image into multiple regions. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (structures) in images, in this case, the coronary tree in digital angiography frames. Stenosis: A stenosis is an abnormal narrowing in a blood vessel or other tubular organ or structure. A coronary artery that’s constricted or narrowed is called stenosed. Buildup of fat, cholesterol and other substances over time may clog the artery. Many heart attacks are caused by a complete blockage of a vessel in the heart, called a coronary artery. Thresholding: A technique for the processing of digital images that consists in applying a certain property or operation to those pixels whose intensity value exceeds a defined threshold.

A

ANN Application in the Field of Structural Concrete Juan L. Pérez University of A Coruña, Spain Mª Isabel Martínez University of A Coruña, Spain Manuel F. Herrador University of A Coruña, Spain

INTRODUCTION Artificial Intelligence (AI) mechanisms are more and more frequently applied to all sorts of civil engineering problems. New methods and algorithms which allow civil engineers to use these techniques in a different way on diverse problems are available or being made available. One AI techniques stands out over the rest: Artificial Neural Networks (ANN). Their most remarkable traits are their ability to learn, the possibility of generalization and their tolerance towards mistakes. These characteristics make their use viable and cost-efficient in any field in general, and in Structural Engineering in particular. The most extended construction material nowadays is concrete, mainly because of its high resistance and its adaptability to formwork during its fabrication process. Along this chapter we will find different applications of ANNs to structural concrete.

Artificial Neural Networks Warren McCulloch and Walter Pitts are credited for the origin of Artificial Networks in the 1940s, since they were the first to design an artificial neuron (McCulloch & Pitts, 1943). They proposed the binary mode (active or inactive) neuron model with a fixed threshold which must be surpassed for it to change state. Some of the concepts they introduced still hold useful today. Artificial Neural Networks intend to simulate the properties found in biological neural systems through mathematical models by the way of artificial mechanisms. A neuron is considered a formal element, or module, or basic network unit which receives

information from other modules or the environment; it then integrates and computes this information to emit a single output which will be identically transmitted to subsequent multiple neurons (Wasserman, 1989). The output of an artificial neuron is determined by its propagation or excitation, activation and transfer functions. The propagation function is generally the summation of each input multiplied by the weight of its interconnection (net value): N −1

ni = ∑ [Wij ⋅ p j ] j =0

(1)

The activation function modifies the latter, relating the neural input to the next activation state.

ai (t ) = FA[ai (t − 1), ni (t − 1)]

(2)

The transfer function is applied to the result of the activation function. It is used to bound the neuron’s output and is generally given by the interpretation intended for the output. Some of the most commonly used transfer functions are the sigmoid (to obtain values in the [0,1] interval) and the hyperbolic tangent (to obtain values in the [-1,1] interval).

outi = FT (ai (t ) )

(3)

Once each element in the process is defined, the type of network (network topology) to use must be designed. These can be divided in forward-feed networks, where


ANN Application in the Field of Structural Concrete

information moves in one direction only (from input to output), and networks with partial or total feedback, where information can flow in any direction. Finally, learning rules and training type must be defined. Learning rules are divided in supervised and non-supervised (Brown & Harris, 1994) (Lin & Lee, 1996) and within the latter, self-organizing learning and reinforcement learning (Hoskins & Himmelblau, 1992). The type of training will be determined by the type of learning chosen.

An Introduction to Concrete (Material and Structure) Structural concrete is a construction material created from the mixture of cement, water, aggregates and additions or admixtures with diverse functions. The goal is to create a material with rock-like appearance, with sufficient compressive strength and the ability to adopt adequate structural shapes. Concrete is moldable during its preparation phase, once the components have mixed together go produce a fluid mass which conveniently occupies the cavities in a mould named formwork. After a few hours, concrete hardens thanks to the chemical hydration reaction experimented by cement, generating a paste which envelops the aggregates and gives the ensemble the appearance of an artificial rock somewhat similar to a conglomerate. Hardened concrete offers good compressive strength, but very low tensile strength. This is why structures created with this material must be reinforced by use of steel rebars, configured by rods which are placed (before pouring the concrete) along the lines where calculation predicts the highest tensile stresses. Cracking, which reduces the durability of the structure, is thus hindered, and sufficient resistance is guaranteed with a very low probability of failure. The entirety formed by concrete and rebar is referred to as Structural Concrete (Shah, 1993). Two phases thus characterize the evolution of concrete in time. In the first phase, concrete must be fluid enough to ensure ease of placement, and a time to initial set long enough to allow transportation from plant to worksite. Flowability depends basically on the type and quantity of the ingredients in the mixture. Special chemical admixtures (such as plasticizers and superplasticizers) guarantee flowability without grossly increasing the amount of water, whose ratio relative to the amount of cement (or water/cement ratio, w/c) is on

reverse proportion to strength attained. The science of rheology deals with the study of the behavior of fresh concrete. A variety of tests can be used to determine flowability of fresh concrete, the most popular amongst them being the Abrams cone (Abrams, 1922) or slump cone test (Domone, 1998). The second phase (and longest over time) is the hardened phase of concrete, which determines the behavior of the structure it gives shape to, from the point of view of serviceability (by imposing limitations on cracking and compliance) and resistance to failure (by imposing limitations on the minimal loads that can be resisted, as compared to the internal forces produced by external loading), always within the frame of sufficient durability for the service life foreseen. The study of structural concrete from every point of view has been undertaken following many different optics. The experimental path has been very productive, generating along the past 50 years a database (with a tendency to scatter) which has been used to sanction studies carried along the second and third path that follow. The analytical path also constitutes a fundamental tool to approach concrete behavior, both from the material and structural point of view. Development of theoretical behavior models goes back to the early 20th century, and theoretical equations developed since have been corrected through testing (as mentioned above) before becoming a part of codes and specifications. This method of analysis has been reinforced with the development of numerical methods and computational systems, capable of solving a great number of simultaneous equations. In particular, the Finite Element Method (and other methods in the same family) and optimization techniques have brought a remarkable capacity to approximate behavior of structural concrete, having their results benchmarked in may applications by the aforementioned experimental testing. Three basic lines of study are thus available. Being complementary between them, they have played a decisive role in the production of national and international codes and rules which guide or legislate the project, execution and maintenance of structural concrete works. Concrete is a complex material, which presents a number of problems for analytical study, and so is an adequate field for the development of analysis techniques based on neural networks (Gonzalez, Martínez and Carro, 2006)

A


Application of Artificial Neural Networks to problems in the field of structural concrete has unfolded in the past few years in two ways. On one hand, analytical and structural optimization systems faster than traditional (usually iterative) methods have been generated starting with expressions and calculation rules. On the other, the numerous databases created form the large amount of tests published in the scientific community have allowed for the development of very powerful ANN which have thrown light on various complex phenomena. In a few cases, specific designed codes have been improved through the use of these techniques; some examples follow.

Application of Artificial Neural Networks to Optimization Problems Design of concrete structures is based on the determination of two basic parameters: member thickness (effective depth d, depth of a beam or slab section measured from the compression face to the centroid of reinforcement) and amount of reinforcement (established as the total area As of steel in a section, materialized as rebars, or the reinforcement ratio, the ratio between steel area and concrete area in the section). Calculation methods are iterative, since a large number of conditions must be verified in the structure, and the aforementioned parameters are fixed as a function of three basic conditions which are sequentially followed: structural safety, maximum ductility at failure and minimal cost. Design rules, expressed through equations, allow for a first solution which is corrected to meet all calculation scenarios, finally converging when the difference between input and output parameters are negligible. In some cases it is possible to develop optimization algorithms, whose analytical formulation opens the way to the generation of a database. Hadi (Hadi, 2003) has performed this work for simply supported reinforced concrete beams, and the expressions obtained after the optimization process determine the parameters specified above, while simultaneously assigning the cost associated to the optimal solution (related to the cost of materials and formwork). With these expressions, Hadi develops a database with the following variables: applied flexural moment (M), compressive strength of concrete (fc), steel strength (fy), section width (b), section depth (h), and unit costs of concrete (Cc), steel (Cs) and formwork (Cf). 0

Network parameters used are as follows. The number of training samples is 550; number of input layer neurons is 8; number of hidden layer neurons is 10; number of output layer neurons is 4; type of backpropagation is Levenberg–Marquardt backpropagation; activation function is sigmoidal function; learning rate; 0.01; number of epochs is 3000; sum-square error achieved is 0.08. The network had been tested with 50 samples and yielded the average error of 6.1%. Hadi studies various factors when choosing network architecture and backpropagation algorithm type. When two layers of hidden neurons are used, precision is not improved while computation time is increased. The number of samples depends on the complexity of the problem and the number of input and output parameters. If a value is fixed for the input costs, there are no noticeable precision improvements between training the network with 200 or 1000 samples. When costs are introduced as input parameters, 100 samples are not enough to achieve convergence in training. Finally, the training algorithm is also checked, studying the range between pure backpropagation (too slow for training), backpropagation with momentum and with adaptive learning, backpropagation with Levenberg–Marquardt updating rule and fast learning backpropagation. The latter is finally retained since it requires less time to get the network to converge while providing very good results (Demuth, H. & Beale, M.,1995)

Application of Artificial Neural Networks to Prediction of Concrete Physical Parameters Measurable Through Testing: Concrete Strength and Consistency Other neural network applications are supported by large experimental databases, created through years of research, which allow for the prediction of phenomena with complex analytical formulation. One of these cases is the determination of two basic concrete parameters: its workability when mixed, necessary for ease of placement in concrete, and its compressive strength once hardened, which is basic to the evaluation of the capacity of the structure. The variables that necessarily determine these two parameters are the components of concrete: amounts of cement, water, fine aggregate (sand), coarse aggregate (small gravel and large gravel), and other components such as pozzolanic additions (which bring soundness


and delayed strength increase, especially in the case of fly ash and silica fume) and admixtures (which fluidify the fresh mixture allowing the use of reduced amounts of water). There are still no analytical or numerical models that faithfully predict fresh concrete consistency (related to flowability, and usually evaluated by the slump of a molded concrete cone) or compressive strength (determined by crushing of prismatic specimens in a press). Öztaş et al. (Öztaş, Pala, Özbay, Kanca, Çağlar & Batí, 2006) have developed a neural network from 187 concrete mixes, for which all parameters are know, using 169 of them for training and 18, randomly selected, for verification. Database variables are sometimes taken as a ratio between them, since there is available knowledge about the dependency of slump and strength on such parameters. The established range for the 7 parameter set is shown in Table 1. Network architecture, as determined by 7 input neurons and two hidden layers of 5 and 3 neurons respectively. The back-propagation learning algorithm has been used in feed-forward two hidden-layers. The learning algorithm used in the study is scaled conjugate gradients algorithm (SCGA), activation function is sigmoidal function, and number of epochs is 10,000. The prediction capacity of the network is better in the “Compressive Strength” output (maximum error of 6%) than in the

Table 1. Input parameter range Input parameters

Minimum

Maximum

W/B (ratio, %)a

W (kg/m)b

0

s/a (ratio, %)c FA (ratio, %)

0

0

AE (kg/m)e

0.0

0.0

d

SF (ratio, %)f SP (kg/m)g

.

.

(a) [Water]/[binder] ratio, considering binder as the lump sum of cement, fly ash and silica fume (b) Amount of water (c) [Amount of sand]/[Total aggregate (sand+small gravel+large gravel)] (d) Percentage of cement substituted by fly ash (e) Amount of air-entraining agent (f) Percentage of cement substituted by silica fume (g) Amount of superplasticizer

“Slump” output (errors up to 25%). This is due to the fact that the relation between the chosen variables and strength is much stronger than in the case of slump, which is influenced by other non-contemplated variables (e. g. type and power of concrete mixer, mixing order of components, aggregate moisture) and the method for measurement of consistency, whose adequacy for the particular type of concrete used in the database is questioned by some authors.

Application of Artificial Neural Networks to the Development of Design Formulae and Codes The last application presented in this paper is the response analysis to shear forces in concrete beams. These forces generate transverse tensile stresses in concrete beams which require placement of rebars perpendicular to the beam axis, known as hoops or ties. Analytical determination of failure load from the variables that intervene in this problem is very complex, and in general most of the formulae used today are based on experimental interpolations with no dimensional consistency. Cladera and Marí (Cladera & Marí, 2004) have studied the problem through laboratory testing, developing a neural network for the strength analysis of beams with no shear reinforcement. They rely on a database compiled by Bentz (Bentz, 2000) and Kuchma (Kuchma, 2002), where the variables are effective depth (d), beam width (b, though introduced as d/b), shear span (a/d, see Figure 1), longitudinal reinforcement ratio (ρl = As/bd) and compressive strength of concrete (fc). Of course, failure load is provided for each of the 177 tests found in the database. They use 147 tests to train the network and 30 for verification, on a one layer architecture with 10 hidden neurons and a retropropagation learning mechanism. The ranges

Table 2 Input parameter ranges Parameter d(mm) d/b ρℓ (%) fc(MPa) a/d Vfail(kN)

Minimum 0. 0.

Maximum 00 .

0.0 . . .

. 0. . .

A


Figure 1. Span loading a of a beam. (González, 2002)

Table 3. Comparison between available codes and proposed equations for shear strength. Procedure Average Median S t a n d a r d deviation CoV (%) Minimum Maximum

ACI - . .

ACI - . .

MC-0 . .

EC- .0 0.

AASHTO . .

Eq.() . .

Eq.() . .

0.

0.0

0.

0.

0.

0.

0.

. 0. .

. 0. .

. 0. .

.0 0. .

.0 0. .

. 0. .

. 0. .

for the variables are shown on Table 2. Almost 8000 iterations were required to attain best results. The adjustment provided by training presents an average ratio Vtest/Vpred of 0.99, and 1.02 in validation. The authors have effectively created a laboratory with a neural network, in which they “test” (within parameter range) new beams by changing exclusively one parameter each time. Finally, they come up with two alternative design formulae that improve noticeably any given formula developed up to that moment. Table 3 presents a comparison between those two expressions (named Eq. 7 and Eq. 8) and others found in a series of international codes.

CONCLUSION •

•

The field of structural concrete shows great potential for the application of neural networks. Successful approaches to optimization, prediction of complex physical parameters and design formulae development have been presented. The network topology used in most cases for structural concrete is forward-feed, multilayer with backpropagation, typically with one or two hidden

•

•

layers. The most commonly used training algorithms are descent gradient with momentum and adaptive learning, and Levenberg-Marquardt. The biggest potential of ANNs is their capacity to generate virtual testing laboratories which substitute with precision expensive real laboratory tests within the proper range of values. A methodical “testing” program throws light on the influence of the different variables in complex phenomena at reduced cost. The field of structural concrete counts upon extensive databases, generated through the years, that can be analyzed with this technique. An effort should be made to compile and homogenize these databases to extract the maximum possible knowledge, which has great influence on structural safety.

ACKNOWLEDGMENT This work was partially supported by the Spanish Ministry of Education and Science (Ministerio de Educación y Ciencia) (Ref BIA2005-09412-C03-01), grants (Ref. 111/2006/2-3.2) funded by the Spanish


Ministry of Enviroment ( Ministerio de Medio ambiente) and grants from the General Directorate of Research, Development and Innovation (Dirección Xeral de Investigación, Desenvolvemento e Innovación) of the Xunta de Galicia (Ref. PGIDT06PXIC118137PN). The work of Juan L. Pérez is supported by an FPI grant (Ref. BES-2006-13535) from the Spanish Ministry of Education and Science (Ministerio de Educación y Ciencia).

learning. Computers and Chemical Engineering, vol. 16(4). 241-251. Kuchma D. (1999-2002) Shear data bank. University of Illinois, Urbana-Champaign. Lin, C.T. & Lee, C.S.(1996). Neural Fuzzy Systems: A neuro-fuzzy synergism to intelligent systems. Prentice-Hall.

REFERENCES

McCulloch, W. S. & Pitts, W. (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics. (5). 115-133.

Abrams, D.A. (1922). Proportion Concrete Mixtures. Proceedings of the American Concrete Institute, 174181.

Öztaş, A. Pala, M. Özbay E. Kanca E. Çağlar N. & Bhatti M.A. (2006) Predicting the compressive strength and slump of high strength concrete using neural network. Construction and Building Materials. (20). 769–775.

Bentz, EC. (2000). Sectional analysis of reinforced concrete members. PhD thesis, Department of Civil Engineering, University of Toronto. Brown, M. & Harris, C. (1994). Neurofuzzy adaptive modelling and control. Prentice-Hall. Cladera, A. & Marí, A.R. (2004). Shear design procedure for reinforced normal and high-strength concrete beams using artificial neural networks. Part I: beams without stirrups. Engineering Structures (26) 917–926 Demuth, H. & Beale, M. (1995). Neural network toolbox for use with MATLAB. MA: The Mathworks, Inc. Domone, P.(1998). The Slump Flow Test for HighWorkability Concrete. Cement and Concrete Research (28-2), 177-182. González B. (2002). Hormigones con áridos reciclados procedentes de demoliciones: dosificaciones, propiedades mecánicas y comportamiento estructural a cortante. PhD thesis, Department of Construction Technology, University of A Coruña. González, B. Martínez, I. and Carro, D. (2006). Prediction of the consistency of concrete by means of the use of ANN. Artificial Neural Networks in Real-Life Applications.Ed. Idea Group Inc. 188-200

Shah, SP. (1993). Recent trends in the science and technology of concrete, concrete technology, new trends, industrial applications. Proceedings of the international RILEM workshop, London, E & FN Spon. 1–18. Wasserman, P. (1989) Neural Computing, Ed. Van Nostrand Reinhold, New York.

KEy TERMS Compression: Stress generated by pressing or squeezing. Consistency: The relative mobility or ability of freshly mixed concrete or mortar to flow; the usual measurement for concrete is slump, equal to the subsidence measured to the nearest 1/4 in. (6 mm) of a molded specimen immediately after removal of the slump cone. Ductility: That property of a material by virtue of which it may undergo large permanent deformation without rupture.

Hadi, M (2003). Neural networks applications in concrete structures. Computers and Structures (81) 373–381

Formwork: Total system of support for freshly placed concrete including the mold or sheathing that contacts the concrete as well as supporting members, hardware, and necessary bracing; sometimes called shuttering in the UK.

Hoskins, J.C. & Himmelblau, D.M.(1992). Process control via artificial neural networks and reinforcement

Shear Span: Distance between a reaction and the nearest load point.

A


Structural Safety: Structural response stronger than the internal forces produced by external loading. Tension: Stress generated by stretching.

ANN Development with EC Tools: An Overview Daniel Rivero University of A Coruña, Spain Juan Rabuñal University of A Coruña, Spain

INTRODUCTION Among all of the Artificial Intelligence techniques, Artificial Neural Networks (ANNs) have shown to be a very powerful tool (McCulloch & Pitts, 1943) (Haykin, 1999). This technique is very versatile and therefore has been succesfully applied to many different disciplines (classification, clustering, regression, modellization, etc.) (Rabuñal & Dorado, 2005). However, one of the greatest problems when using ANNs is the great manual effort that has to be done in their development. A big myth of ANNs is that they are easy to work with and their development is almost automatically done. This development process can be divided into two parts: architecture development and training and validation. As the network architecture is problem-dependant, the design process of this architecture used to be manually performed, meaning that the expert had to test different architectures and train them until finding the one that achieved best results after the training process. The manual nature of the described process determines its slow performance although the training part is completely automated due to the existence of several algorithms that perform this part. With the creation of Evolutionary Computation (EC) tools, researchers have worked on the application of these techniques to the development of algorithms for automatically creating and training ANNs so the whole process (or, at least, a great part of it) can be automatically performed by computers and therefore few human efforts has to be done in this process.

EC is called Evolutionary Algorithms (EAs), which are based on natural evolution and its implementation on computers. All of these tools work with the same basis: a population of solutions to that particular problem is randomly created and an evolutionary process is applied to it. From this initial random population, the evolution is done by means of selection and combination of the best individuals (although the worst ones also have a small probability of being chosen) to create new solutions. This process is carried out by selection, crossover, and mutation operators. These operators are typically used in biology in its evolution for adaptation and survival. After several generations, it is hoped that the population contains a good solution to the problem. The first EA to appear was Genetic Algorithms (GAs), in 1975 (Holland, 1975). With the working explained above, GAs use a binary codification (i.e., each solution is codified into a string of bits). Later, in the early 90s a new technique appeared, called Genetic Programming (GP). This one is based ob the evolution of trees, i.e., each individual is codified as a tree instead of a binary string. This allows its application to a wider set of environments. Although GAs and GP are the two most used techniques in EAs, more tools can be classified as part of this world, such as Evolutionary Programming or Evolution Strategies, all of them with the same basis: the evolution of a population following the natural evolution rules.

DEVELOPMENT OF ANNS WITH EC TOOLS BACKGROUND EC is a set of tools based on the imitation of the natural behaviour of the living beings for solving optimization problems. One of the most typical subset of tools inside

The development of ANNs is a topic that has been extensively dealt with very diverse techniques. The world of evolutionary algorithms is not an exception, and proof of that is the great amount of works that have


A

ANN Development with EC Tools

been published about different techniques in this area (Cantú-Paz & Kamath, 2005). These techniques follow the general strategy of an evolutionary algorithm: an initial population consisting of different genotypes, each one of them codifying different parameters (typically, the weight of the connections and / or the architecture of the network and / or the learning rules), and is randomly created. This population is evaluated in order to determine the fitness of each individual. Afterwards, this population is repeatedly made to evolve by means of different genetic operators (replication, crossover, mutation, etc.) until a determined termination criteria is fulfilled (for example, a sufficiently good individual is obtained, or a predetermined maximum number of generations is achieved). Essentially, the ANN generation process by means of evolutionary algorithms is divided into three main groups: evolution of the weights, architectures, and learning rules.

Evolution of Weights The evolution of the weights begins with a network with a predetermined topology. In this case, the problem is to establish, by means of training, the values of the network connection weights. This is generally conceived as a problem of minimization of the network error, taken, for example, as the result of the Mean Square Error of the network between the desired outputs and the ones achieved by the network. Most the training algorithms, such as the backpropagation algorithm (BP) (Rumelhart, Hinton & Williams, 1986), are based on gradient minimization. This has several drawbacks (Whitley, Starkweather & Bogart, 1990), the most important is that quite frequently the algorithm becomes stuck in a local minimum of the error function and is unable of finding the global minimum, especially if the error function is multimodal and / or non-differentiable. One way of overcoming these problems is to carry out the training by means of an Evolutionary Algorithm (Whitley, Starkweather & Bogart, 1990); i.e., formulate the training process as the evolution of the weights in an environment defined by the network architecture and the task to be done (the problem to be solved). In these cases, the weights can be represented in the individuals’ genetic material as a string of binary values (Whitley, Starkweather & Bogart, 1990) or a string of real numbers (Greenwood, 1997). Traditional genetic algorithms (Holland, 1975) use a genotypic codification

method with the shape of binary strings. In this way, much work has emerged that codifies the values of the weights by means of a concatenation of the binary values which represent them (Whitley, Starkweather & Bogart, 1990). The big advantage of these approximations is their generality and that they are very simple to apply, i.e., it is very easy and quick to apply the operators of uniform crossover and mutation on a binary string. The disadvantage of using this type of codification is the problem of permutation. This problem was raised upon considering that the order in which the weights are taken in the string causes equivalent networks to possibly correspond with totally different individuals. This leads the crossing operator to become very inefficient. Logically, the weight value codification has also emerged in the form of real number concatenation, each one of them associated with a determined weight (Greenwood 1997). By means of genetic operators designed to work with this type of codification, and given that the existing ones for bit string cannot be used here, several studies (Montana & Davis, 1989) showed that this type of codification produces better results and with more efficiency and scalability than the BP algorithm.

Evolution of the Architectures The evolution of the architectures includes the generation of the topological structure; i.e., the topology and connectivity of the neurons, and the transfer function of each neuron of the network. The architecture of a network has a great importance in order to successfully apply the ANNs, as the architecture has a very significant impact on the process capacity of the network. In this way, on one hand, a network with few connections and a lineal transfer function may not be able to resolve a problem that another network having other characteristics (distinct number of neurons, connections or types of functions) would be able to resolve. On the other hand, a network having a high number of non-lineal connections and nodes could be overfitted and learn the noise which is present in the training as an inherent part of it, without being able to discriminate between them, and in the end, not have a good generalization capacity. Therefore, the design of a network is crucial, and this task is classically carried out by human experts using their own experience, based on “trial and error”, experimenting with a different set of architectures. The evolution of architectures has


been possible thanks to the appearance of constructive and destructive algorithms (Sietsma & Dow, 1991). In general terms, a constructive algorithm begins with a minimum network (with a small number of layers, neurons and connections) and successively adds new layers, nodes and connections, if they are necessary, during the training. A destructive algorithm carries out the opposite operation, i.e., it begins with a maximum network and eliminates unnecessary nodes and connections during the training. However, the methods based on Hill Climbing algorithms are quite susceptible into falling to a local minimum (Angeline, Suders & Pollack, 1994). In order to develop ANN architectures by means of an evolutionary algorithm, it is necessary to decide how to codify a network inside the genotype so it can be used by the genetic operators. For this, different types of network codifications have emerged. In the first codification method, direct codification, there is a one-to-one correspondence between the genes and the phenotypic representation (Miller, Todd & Hedge, 1989). The most typical codification method consists of a matrix C=(cij) of NxN size which represents an architecture of N nodes, where cij indicates the presence or absence of a connection between the i and j nodes. It is possible to use cij=1 to indicate a connection and cij=0 to indicate an absence of connection. In fact, cij could take real values instead of Booleans to represent the value of the connection weight between neuron “i” and “j”, and in this way, architecture and connections can be developed simultaneously (Alba, Aldana & Troya, 1993). The restrictions which are required in the architectures can easily be incorporated into this representational scheme. For example, a feedforward network would have non-zero coefficients only in the upper right hand triangle of the matrix. These types of codification are generally very simple and easy to implement. However, they have a lot of disadvantages, such as scalability, the impossibility of codifying repeated structures, or permutation (i.e., different networks which are functionally equivalent can correspond with different genotypes) (Yao & Liu, 1998). As a counterproposal to this type of direct codification method, there are also the indirect codification types in existence. With the objective of reducing the length of the genotypes, only some of the characteristics of the architecture are codified into the chromosome. Within this type of codification, there are various types of representation.

First, the parametric representations have to be mentioned. The network can be represented by a set of parameters such as the number of hidden layers, the number of connections between two layers, etc. There are several ways of codifying these parameters inside the chromosome (Harp, Samad & Guha, 1989). Although the parametric representations can reduce the length of the chromosome, the evolutionary algorithm makes a search in a limited space within the possible searchable space that represents all the possible architectures. Another type of non-direct codification is based on a representational system with the shape of grammatical rules (Yao & Shi, 1995). In this system, the network is represented by a set of rules, with shape of production rules, which will build a matrix that represents the network. Other types of codification, more inspired in the world of biology, are the ones known as “growing methods”. With them, the genotype does not codify the network any longer, but instead it contains a set of instructions. The decodification of the genotype consists of the execution of these instructions, which will provoke the construction of the phenotype (Husbands, Harvey, Cliff & Miller, 1994). These instructions usually include neural migrations, neuronal duplication or transformation, and neuronal differentiation. Finally, and within the indirect codification methods, there are other methods which are very different from the ones already described. Andersen describes a technique in which each individual of a population represents a hidden node instead of the architecture (Andersen & Tsoi, 1993). Each hidden layer is constructed automatically by means of an evolutionary process which uses a genetic algorithm. This method has the limitation that only feed-forward networks can be constructed and there is also a tendency for various nodes with a similar functionality to emerge, which inserts some redundancy inside the network that must be eliminated. One important characteristic is that, in general, these methods only develop architectures, which is the most common, or else architectures and weights together. The transfer function of each architecture node is assumed to have been previously determined by a human expert, and that it is the same for all of the network nodes (at least, for all of the nodes of the same layer), although the transfer function has been shown to have a great importance on the behaviour of the network (Lovell & Tsoi, 1992). Few methods have

A


been developed which cause the transfer function to evolve, and, therefore, had little repercussion in the world of ANNs with EC.

Evolution of the Learning Rule Another interesting approximation to the development of ANNs by means of EC is the evolution of the learning rule. This idea emerges because a training algorithm works differently when it is applied to networks with different architectures. In fact, and given that a priori, the expert usually has very few knowledge about a network, it is preferable to develop an automatic system to adapt the learning rule to the architecture and the problem to be resolved. There are several approximations to the evolution of the learning rule (Crosher, 1993) (Turney, Whitley & Anderson, 1996), although most of them are based only on how the learning can modify or guide the evolution, and in the relation between the architecture and the connection weights. Actually, there are few works that focus on the evolution of the learning rule in itself (Bengio & Bengio, Cloutier & Gecsei, 1992) (Ribert, Stocker, Lecourtier & Ennaji, 1994). One of the most common approaches is based on setting the parameters of the BP algorithm: learning rate and momentum. Some authors propose methods in which an evolutionary process is used to find these parameters while leaving the architecture constant (Kim, Jung, Kim & Park, 1996). Other authors, on the other hand, propose codifying these BP algorithm parameters together with the network architecture inside of the individuals of the population (Harp, Samad & Guha, 1989).

FUTURE TRENDS The evolution of ANNs has been a research topic since some decades ago. The creation of new EC and, in general, new AI techniques and the evolution and improvement of the existing ones allow the development of new methods of automatically developing of ANNs. Although there are methods that (more or less) automatically develop ANNs, they are usually not very efficient, since evolution of architectures, weights and learning rules at once leads to having a very big search space, so this feature definitely has to be improved.

CONCLUSION The world of EC has provided a set of tools that can be applied to optimization problems. In this case, the problem is to find an optimal architecture and/or weight value set and/or learning rule. Therefore, the development of ANNs was converted into an optimization problem. As the described techniques show, the use of EC techniques has made possible the development of ANNs without human intervention, or, at least, minimising the participation of the expert in this task. As has been explained, these techniques have some problems. One of them is the already explained permutation problem. Another problem is the loss of efficiency: the more complicated the structure to evolve is (weigths, learning rule, architecture), less efficient the system will be, because the search space becomes much bigger. If the system has to evolve several things at a time (for example, architecture and weights so the ANN development is completely automated), this loss of efficiency increases. However, these systems still work faster than the whole manual process of designing and training several times an ANN.

REFERENCES Alba E., Aldana J.F. & Troya J.M. (1993) Fully automatic ANN design: A genetic approach. Proc. Int. Workshop Artificial Neural Networks (IWANN’93), Lecture Notes in Computer Science. (686) 399-404. Andersen H.C. & Tsoi A.C. (1993) A constructive algorithm for the training of a multilayer perceptron based on the genetic algorithm. Complex systems 7 (4) 249-268. Angeline P.J., Suders G.M. & Pollack J.B. (1994) An evolutionary algorithm that constructs recurrent neural networks. IEEE Trans. Neural Networks. (5) 54-65. Bengio S., Bengio Y., Cloutier J. & Gecsei J. (1992) On the optimization of a synaptic learning rule. Preprints of the Conference on Optimality in Artificial and Biological Neural Networks. Cantú-Paz E. & Kamath C. (2005) An Empirical Comparison of Combinatios of Evolutionary Algorithms and Neural Networks for Classification Problems. IEEE Transactions on systems, Man and Cybernetics – Part B: Cybernetics. 915-927.


Crosher D. (1993) The artificial evolution of a generalized class of adaptive processes. Preprints of AI’93 Workshop on Evolutionary Computation. 18-36. Greenwood G.W. (1997) Training partially recurrent neural networks using evolutionary strategies. IEEE Trans. Speech Audio Processing. (5) 192-194. Harp S.A., Samad T. & Guha A. (1989) Toward the genetic synthesis of neural networks. Proc. 3rd Int. Conf. Genetic Algorithms and Their Applications. 360-369. Haykin, S. (1999). Neural Networks (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Holland, J.J. (1975) Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press. Husbands P., Harvey I., Cliff D. & Miller G. (1994) The use of genetic algorithms for the development of sensorimotor control systems. From Perception to Action. (P. Gaussier and JD Nicoud, eds.). Los alamitos CA: IEEE Press. Kim H., Jung S., Kim T. & Park K. (1996) Fast learning method for backpropagation neural network by evolutionary adaptation of learning rates. Neurocomputing, 11(1) 101-106. Lovell D.R. & Tsoi A.C. (2002) The Performance of the Neocognitron with various S-Cell and C-Cell Transfer Functions, Intell. Machines Lab., Dep. Elect. Eng., Univ. Queensland, Tech. Rep. McCulloch W.S., & Pitts, W. (1943) A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics. (5) 115-133. Miller G.F., Todd P.M. & Hedge S.U. (1989) Designing neural networks using genetic algorithms. Proceedings of the Third International Conference on Genetic algorithms. San Mateo, CA: Morgan Kaufmann. 379384. Montana D. & David L. (1989) Training feed-forward neural networks using genetic algorithms. Proc. 11th Int. Joint Conf. Artificial Intelligence. San Mateo, CA: Morgan Kaufmann. 762-767. Rabuñal, J.R. & Dorado J. (2005) Artificial Neural Networks in Real-Life Applications. Idea Group Inc.

Ribert A., Stocker E., Lecourtier Y. & Ennaji A. (1994) Optimizing a Neural Network Architecture with an Adaptive Parameter Genetic Algorithm. Lecture Notes in Computer Science. Springer-Verlag. (1240) 527-535. Rumelhart D.E., Hinton G.E. & Williams R.J. (1986) Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructures of Cognition. D. E. Rumelhart & J.L. McClelland, Eds. Cambridge, MA: MIT Press. (1) 318-362. Sietsma J. & Dow R. J. F. (1991) Creating Artificial Neural Networks that generalize. Neural Networks. (4) 1: 67-79. Turney P., Whitley D. & Anderson R. (1996) Special issue on the baldwinian effect. Evolutionary Computation. 4(3) 213-329. Whitley D., Starkweather T. & Bogart C. (1990) Genetic algorithms and neural networks: Optimizing connections and connectivity. Parallel Comput., Vol. 14, No 3. 347-361. Yao X. & Shi Y. (1995) A preliminary study on designing artificial neural networks using co-evolution. Proc. IEEE Singapore Int. Conf. Intelligence Control and Instrumentation. 149-154. Yao X. & Liu Y. (1998) Toward designing artificial neural networks by evolution. Appl. Math. Computation. vol. 91, no. 1, 83-90.

KEy TERMS Artificial Neural Networks: Interconnected set of many simple processing units, commonly called neurons, that use a mathematical model, that represents an input/output relation, Back-Propagation Algorithm: Supervised learning technique used by ANNs, that iteratively modifies the weights of the connections of the network so the error given by the network after the comparison of the outputs with the desired one decreases. Evolutionary Computation: Set of Artificial Intelligence techniques used in optimization problems, which are inspired in biologic mechanisms such as natural evolution.

A


Genetic Programming: Machine learning technique that uses an evolutionary algorithm in order to optimise the population of computer programs according to a fitness function which determines the capability of a program for performing a given task. Genotype: The representation of an individual on an entire collection of genes which the crossover and mutation operators are applied to. Phenotype: Expression of the properties coded by the individual’s genotype. Population: Pool of individuals exhibiting equal or similar genome structures, which allows the application of genetic operators. Search Space: Set of all possible situations of the problem that we want to solve could ever be in.

0

ANN-Based Defects’ Diagnosis of Industrial Optical Devices Matthieu Voiry University of Paris, France SAGEM REOSC, France Véronique Amarger University of Paris, France Joel Bernier SAGEM REOSC, France Kurosh Madani University of Paris, France

INTRODUCTION A major step for high-quality optical devices faults diagnosis concerns scratches and digs defects detection and characterization in products. These kinds of aesthetic flaws, shaped during different manufacturing steps, could provoke harmful effects on optical devices’ functional specificities, as well as on their optical performances by generating undesirable scatter light, which could seriously damage the expected optical features. A reliable diagnosis of these defects becomes therefore a crucial task to ensure products’ nominal specification. Moreover, such diagnosis is strongly motivated by manufacturing process correction requirements in order to guarantee mass production quality with the aim of maintaining acceptable production yield. Unfortunately, detecting and measuring such defects is still a challenging problem in production conditions and the few available automatic control solutions remain ineffective. That’s why, in most of cases, the diagnosis is performed on the basis of a human expert based visual inspection of the whole production. However, this conventionally used solution suffers from several acute restrictions related to human operator’s intrinsic limitations (reduced sensitivity for very small defects, detection exhaustiveness alteration due to attentiveness shrinkage, operator’s tiredness and weariness due to repetitive nature of fault detection and fault diagnosis tasks). To construct an effective automatic diagnosis system, we propose an approach based on four main

operations: defect detection, data extraction, dimensionality reduction and neural classification. The first operation is based on Nomarski microscopy issued imaging. These issued images contain several items which have to be detected and then classified in order to discriminate between “false” defects (correctable defects) and “abiding” (permanent) ones. Indeed, because of industrial environment, a number of correctable defects (like dusts or cleaning marks) are usually present beside the potential “abiding” defects. Relevant features extraction is a key issue to ensure accuracy of neural classification system; first because raw data (images) cannot be exploited and, moreover, because dealing with high dimensional data could affect learning performances of neural network. This article presents the automatic diagnosis system, describing the operations of the different phases. An implementation on real industrial optical devices is carried out and an experiment investigates a MLP artificial neural network based items classification.

BACKGROUND Today, the only solution which exists to detect and classify optical surfaces’ defects is a visual one, carried out by a human expert. The first originality of this work is in the sensor used: Normarski microscopy. Three main advantages distinguishing Nomarski microscopy (known also as “Differential Interference Contrast microscopy” (Bouchareine, 1999) (Chatterjee, 2003))


A

ANN-Based Defects’ Diagnosis of Industrial Optical Devices

from other microscopy techniques, have motivated our preference for this imaging technique. The first of them is related to the higher sensitivity of this technique comparing to the other classical microscopy techniques (Dark Field, Bright Field) (Flewitt & Wild, 1994). Furthermore, the DIC microscopy is robust regarding lighting non-homogeneity. Finally, this technology provides information relative to depth (3-th dimension) which could be exploited to typify roughness or defect’s depth. This last advantage offers precious additional potentiality to characterize scratches and digs flaws in high-tech optical devices. Therefore, Nomarski microscopy seems to be a suitable technique to detect surface imperfections. On the other hand, since they have shown many attractive features in complex pattern recognition and classification tasks (Zhang, 2000) (Egmont-Petersen, de Ridder, & Handels, 2002), artificial neural network based techniques are used to solve difficult problems. In our particular case, the problem is related to the classification of small defects on a great observation’s surface. These promising techniques could however encounter difficulties when dealing with high dimensional data. That’s why we are also interested in data dimensionality reducing methods.

DEFECTS’ DETECTION AND CLASSIFICATION The suggested diagnosis process is described in broad outline in the diagram of Figure 1. Every step is presented, first detection and data extraction phases and then classification phase coupled with dimensionality reduction. In a second part, some investigations on real industrial data are carried out and the obtained results are presented.

Detection and Data Extraction

proposed method (Voiry, Houbre, Amarger, & Madani, 2005) includes four phases: •

• • •

Pre-processing: DIC issued digital image transformation in order to reduce lighting heterogeneity influence and to enhance the aimed defects’ visibility, Adaptive matching: adaptive process to match defects, Filtering and segmentation: noise removal and defects’ outlines characterization. Defect image extraction: correct defect representation construction.

Finally, the image associated to a given detected gives an isolated (from other items) representation of the defect (e.g. depicts the defect in its immediate environment), like depicted in Figure 2. But, information contained in such generated images is highly redundant and these images don’t have necessarily the same dimension (typically this dimension can turn out to be hundred times as high). That is why this raw data (images) can not be directly processed and has first to be appropriately encoded, using some transformations. Such ones must naturally be invariant with regard to geometric transformations (translation, rotation and scaling) and robust regarding different perturbations (noise, luminance variation and background variation). Fourier-Mellin transformation is used as it provides invariant descriptors, which are considered to have good coding capacity in classification tasks (Choksuriwong, Laurent, & Emile, 2005) (Derrode, 1999) (Ghorbel, 1994). Finally, the processed features have to be normalized, using the centring-reducing transformation. Providing a set of 13 features using such transform, is a first acceptable compromise between industrial environment real-time processing constraints and defect image representation quality (Voiry, Madani, Amarger, & Houbre, 2006).

The aim of defect’s detection stage is to extract defects images from DIC detector issued digital image. The

Figure 1. Block diagram of the proposed defect diagnosis system

ANN-Based Defects’ Diagnosis of Industrial Optical Devices

Figure 2. Images of characteristic items: (a) Scratch; (b) dig; (c) dust; (d) cleaning marks

(a)

(b)

Dimensionality Reduction To obtain a correct description of defects, we must consider more or less important number of Fourier-Mellin invariants. But dealing with high-dimensional data poses problems, known as “curse of dimensionality” (Verleysen, 2001). First, sample number required to reach a predefined level of precision in approximation tasks increases exponentially with dimension. Thus, intuitively, the sample number needed to properly learn problem becomes quickly much too large to be collected by real systems, when dimension of data increases. Moreover surprising phenomena appear when working in high dimension (Demartines, 1994): for example, variance of distances between vectors remains fixed while its average increases with the space dimension, and Gaussian kernel local properties are also lost. These last points explain that behaviour of a number of artificial neural network algorithms could be affected while dealing with high-dimensional data. Fortunately, most real-world problem data are located in a manifold of dimension p (the data intrinsic dimension) much smaller than its raw dimension. Reducing data dimensionality to this smaller value can therefore decrease the problems related to high dimension. In order to reduce the problem dimensionality, we use Curvilinear Distance Analysis (CDA). This technique is related to Curvilinear Component Analysis (CCA), whose goal is to reproduce the topology of a n-dimension original space in a new p-dimension space (where p } Sort S by key score(X) descending. Return S.

Table 1. Performance of the three-layer decryption of a table-transposition cipher using a brute-force search. First filter was negative predicate-based, removing all decrypts with first 4 letters not forming a valid n-gram (about 90 % of texts were removed). Score was then computed as the count of valid tetragrams in the whole text. If this count was lower then given threshold (12), then the text was removed in the score-based filter. Finally, remaining texts were scored using the dictionary words. Key-space Size

Negative filter

Score-based filter

Remaining texts

Total time [s]

9!

89.11%

10.82%

254

1.2

10!

89.15%

10.78%

2903

5.8

11!

88.08%

8.92%

239501

341

12!

90.10%

9.85%

1193512

746

Algorithm integrates the three layers of plaintext recognition, namely negative test predicate, fast scoring function and precise scoring function, as a threelayer filter. The final scoring function is also used to sort the outputs. First filter should be very fast, with very low error probability. Fast score should be easy to compute, but it is not required to precisely identify the correct plaintext. Correct plaintext recognition is the role of precise scoring function. In the algorithm, the best score is the highest one. If the score is computed in the opposite meaning, the algorithm must be rewritten accordingly. In some cases, we can integrate a fast scoring function within the negative test or with the precise scoring, leading to two-layer filters, as in (Zajac, 2006a). It is also possible to use even more steps of predicatebased and score-based filtering, respectively. However, experiments show that the proposed architecture of 0

three-layers is the most flexible, and more layers can even lead to performance decrease. Experimental results are shown in Table 1.

Negative Filtering The goal of the negative test predicate is to identify candidate texts that are NOT plaintext (with very high probability, ideally with certainty). People can clearly recognize the wrong text just by looking at it. It is in the area of artificial intelligence to implement this ability in computers. However, most nowadays AI methods (e.g. neural networks) seem to be too slow, to be applicable in this stage of a brute-force algorithm, as every text must be evaluated with this predicate. Most of the methods for fast negative text filtering are based on prohibited n-grams. As an n-gram

Automated Cryptanalysis

we would only consider a sequence of n consecutive letters. If the alphabet size is N, then it is possible to n

create N possible n-grams. For higher n, only a small fraction of them can appear in valid text in a given language (Zajac, 2006b). By using a lexical tree or lookup table, it is easy (and fast) to verify, whether a given n-gram is valid or not. Thus a natural test is to check every n-gram in the text, whether it is valid or not. There are two basic problems arising with this approach – the real plaintext can contain (intentionally) misspelled, uncommon or foreign words, and thus our n-gram database can be incomplete. We can limit our test to some specific patterns, e.g. too long run of consecutive vowels/consonants. These patterns can be checked in time dependent on the plaintext candidate length. A filter can also be based on checking only a few n-grams on a fixed or random position in the text, e.g. the first four letters. The rule for rejecting texts should be based on the exact type of the cipher we are trying to decipher. For example, if the first four letters of the decrypted plaintext does not depend on some part of the key, the filter based only on their validity would not be effective. An interesting question is, whether it is possible to create a system which can effectively learn its filter rules from existing decrypted texts, even in the process of decryption.

two scoring functions: one that is fast but less precise, with lower threshold value, and one that is very precise, but harder to compute. An example of scoring function distributions can be found in Figures 1 and 2. Scoring function in Figure 1 is much more precise than in Figure 2, but computational time required for evaluation is doubled. Moreover, scoring function in Figure 1 was created from a reduced dictionary fitted to a given ciphertext. Evaluation based on a complete dictionary is slower, more difficult to implement, and can even be less precise. Scoring functions can be based on dictionary words, n-grams statistics, or other specific statistics. It is difficult to provide a one-fits-all scoring function, as decryption process for different cipher types has impact on actual scoring function results. E.g. when trying to decrypt a transposition cipher, we already know which letters appear with which frequency, and thus letter frequency based statistics do not play any role in scoring. On the other hand they are quite significant for substitution ciphers. Most common universal scoring functions are (see also Ganesan & Sherman, 1993): 1.

Scoring Functions With the negative filter step we can eliminate around 90% of candidate texts or more. The number of texts to be verified is still very huge, and thus we need to apply more precise methods of plaintext recognition. We use a scoring function that assigns a quantity – score – to every text that has survived elimination in previous steps. Here the higher score means higher likeness that a given text is a valid plaintext. For each scoring function we can assign a threshold, and it should be very improbable that a valid plaintext have score under this threshold. Actual threshold value can either be found experimentally (by evaluating large number of real texts), or can be based on a statistical analysis. Speed of the scoring function can be determined by using classical algorithm complexity estimates. Precision of the scoring can be defined by means of separation of valid and invalid plaintexts, respectively. There is a trade-off involved in scoring, as faster scoring functions are less precise and vice-versa. Thus we apply

2.

Number of dictionary words in the text / fraction of meaningful text Scoring based on dictionary words is very precise, if we have a large enough dictionary. Even if not every word in the hidden message is in our dictionary, it is very improbable that some incorrect decryption contains some larger fraction of a text composed of dictionary words. Removing short words from dictionary can increase the precision. Another possibility is to use weights based on word length as in (Russell, Clark & Stepney, 2003). Dictionary words can be found using lexical trees. In some languages, we should use dictionary of word stems, instead of the whole words. Speed of evaluation depends on the length of the text and the average length of words in dictionary, respectively. Rank distance of n-grams Rank of the n-gram is its position depending on order based n-gram frequencies (merged for all n up to given bound dependent on language). This method is used (Cavnar & Trenkle, 1994) in fast automated language recognition: compute ranks of n-grams of given text and compare it with ranks of significant n-grams obtained from large corpus of different languages. Correct language should

A


Figure 1. Distribution of score among 9! possible decrypts (table transposition cipher with key given by a permutation of 9 columns), ciphertext size is 90 characters. Score was computed as a weighted sum of lengths of (reduced) dictionary words found in the text. Single highest score belongs to the correct plaintext. 184906 153009

23694

0 - 46 47 - 9 2

3.

4.

93 138

1198

65

5

1

0

0

1

139 184

185 230

231 276

277 322

323 368

369 414

415 464

have the smallest distance. Even if this method can be adapted for plaintext recognition, e.g. by creating “random” corpus, or encrypted corpus, it does not seem practical. Frequency distance of n-gram statistics Score of the text can be estimated from the difference of measured frequency of n-grams and estimated frequencies from large corpus (Clark, 1998; Spillman, Janssen, Nelson & Kepner, 1993). We suppose that correct plaintext would have the smallest distance from corpus statistics. However due to statistical properties of the text, this is not always true for short or specific texts. Thus the precision of this scoring function is higher for longer texts. Speed of the evaluation depends on the text size and size of n. Scoring tables for n-grams If we consider the statistics of all n-grams, we will see that most n-grams contribute a very small value to the final score. We can consider contribution of only the most common n-grams. For a given language (and a given ciphertext) we can prepare a table of n-gram scores with fixed number of entries. Score is evaluated as the sum of scores assigned to n-grams in a given candidate text (Clark & Dawson, 1998). We can assign both positive scores for common valid n-grams, and negative scores for common invalid/supposedly rare n-grams. However, precision of this method

5.

is very low especially for transposition ciphers. In our experiments the scores have normal distribution, and usually the correct plaintext does not have highest possible values. On the other hand, this scoring function is easy to evaluate, and can be customized for a given ciphertext. Thus it can be used as a fast scoring function fastScore(X). Index of coincidence (and other similar statistics) Index of coincidence (Friedman 1920), denoted by IC, is a suitable and fast statistics applicable for ciphers that modify the letter frequency. The notion comes from probability of the same letter occurring in two different texts at the same position. Encrypted texts are considered random, and have (normalized) index of coincidence near to the 1.0. On the other hand, a plaintext has much higher IC near the value expected for a given language and alphabet. For English language the expected value is 1.73. As with all statistics based on scoring function, its precision is influenced by the length of the text. Index of coincidence is most suitable for polyalphabetic ciphers, where encryption depends on the position of the letter in the text (e.g. Vigenère cipher). It can be adapted to other cipher types (Bagnall, McKeown & Rayward-Smith, 1997), e.g. for transposition ciphers when considering that alphabet is created by all possible n-grams with some n > 1.


Figure 2. Distribution of score among 9! possible decrypts (table transposition cipher with key given by a permutation of 9 columns), ciphertext size is 90 characters. Score is the count of digrams with frequency higher than 1% (in Slovak language) in a given text. The correct plaintext has scored 73.

111941

113251

59855 41959 23617 128

4233

3-10

11-17

6931 18-24

25-31

32-38

FUTURE TRENDS Language processing and recognition have applications in various areas outside cryptanalysis (OCR, automatic translation...). Some cryptanalytic techniques can be generalized for these fields. E.g. some letters or groups of letters are often replaced by another in scanned documents. Thus correcting these documents is similar to cryptanalysis of randomized substitution ciphers. With Artificial Intelligence research new insights can be gained into a structure of natural language that can help further in cryptanalysis. Cryptanalysis is also strongly related to automatic translation efforts. Some open problems that need to be addressed by language recognition suitable for cryptanalysis of classical ciphers are the following: •

•

How the text recognition should be integrated with decryption process to give feedback, e.g. on partially decrypted words, to estimate a new key, etc. This is especially true, if we use more advanced search heuristic than brute-force search through the key-space. This can also be viewed as a generalization of results of Peleg & Rosenfeld (1979). How the syntax and semantics of the language can help in text recognition and key search, respectively.

39-45

•

•

46-52

53-59

917

48

60-66

67-74

How various encodings and writing systems influence cryptanalysis. Specific issues arise when dealing with different writing systems (Atkinson 1985; August 1989 and 1990). How to correctly recognize text with intentional misspellings and special code words.

Another set of problems arises when different natural languages are used, like the language recognition, specific alphabets, impact of diacritical marks, etc. Our research shows that the language of the message encrypted by substitution cipher can be recognized even without decryption (Zajac, 2006b). It is even possible to use dictionary of a different (although similar) language in decryption process. It is an interesting research question whether it is possible to create completely general language recognition function (or restricted to some family of languages) usable for cryptanalysis. Plaintext recognition in cryptanalysis can be also seen as a specific information retrieval problem (Manning, Raghavan & Schütze 2008). Multilanguage information retrieval is targeting similar problems to the problems presented above (see e.g. McNamee, 2006). The research in these areas can clearly influence each other in the future.

A


CONCLUSION This article summarizes the usage and restrictions for language processing in the context of cryptanalysis of classical ciphers. Their application usually differs according to a character of the analyzed cipher systems, although we have presented some common techniques that can be easily adapted for a specific situation. Most cryptanalytic attacks require very fast language recognition, but on the other hand, great speed often causes inaccurate results, up to the point of unrecognizable decrypts. The role of the Artificial Intelligence research is to find faster and more precise language predicates and combine them to a useful plaintext recognition system.

REFERENCES Atkinson, R. (1985). Ciphers in Oriental Languages. Cryptologia, 9(4), 373-380. August, D. A. (1989). Cryptography and Exploitation of Chinese Manual Cryptosystems - Part I: The Encoding Problem. Cryptologia, 13(4), 289-302. August, D. A. (1990). Cryptography and Exploitation of Chinese Manual Cryptosystems - Part II: The Encrypting Problem. Cryptologia, 14(1), 61-78. Bagnall,T. & McKeown, G. P. & Rayward-Smith, V. J. (1997). The cryptanalysis of a three rotor machine using a genetic algorithm. In Thomas Back, editor, Proceedings of the Seventh International Conference on Genetic Algorithms (ICGA97), San Francisco, CA. Morgan Kaufmann. Cavnar, W.B., & Trenkle, J.M. (1994). N-gram-based text categorization. Proceedings of the Third Symposium on Document Analysis and Info, 161-175. Clark, A. J. (1998). Optimisation Heuristics for Cryptology. PhD thesis, Information Security Research Center, Faculty of Information Technology, Queensland University of Technology. Clark, A. & Dawson, E. (1998). Optimisation Heuristics for the Automated Cryptanalysis of Classical Ciphers. Journal of Combinatorial Mathematics and Combinatorial Computing, vol. 28, 63-86.

Friedman, W. F. (1920). The Index of Coincidence and Its Applications in Cryptography, Riverbank Publication No. 22, Riverbank Labs., Geneva, Ill.. Ganesan, R. & Sherman, A. (1993). Statistical techniques for language recognition: An introduction and guide for cryptanalysts, Cryptologia,17(4), 321-366. Manning, C.D. & Raghavan P. & Schütze, H. (2008) Introduction to Information Retrieval, Cambridge University Press. McMahon, J. & Smith, F.J. (1998). A Review of Statistical Language Processing Techniques. Artificial Intelligence Review 12 (5), 347-391. McNamee, P. (2006). Why You Should Use N-grams for Multilingual Information Retrieval. UMBC eBiquity Talk. http://www.umiacs.umd.edu/research/CLIP/colloq/abstracts/2006-10-18-slides.pdf Peleg, S. & Rosenfeld, A. (1979). Breaking Substitution Ciphers Using a Relaxation Algorithm. Communications of the ACM 22(11), 598--605. Russell, M. D. & Clark, J. A. & Stepney, S. (2003). Making the most of two heuristics: Breaking transposition ciphers with ants. Proceedings of IEEE Congress on Evolutionary Computation (CEC 2003). IEEE Press, 2653--2658. Spillman, R. & Janssen, M. & Nelson, B. & Kepner, M. (1993). Use of a genetic algorithm in the cryptanalysis of simple substitution ciphers. Cryptologia, 17(1), pp. 31-44. Zajac, P. (2006a). Automated Attacks on Transposition Ciphers. Begabtenförderung im MINT Bereich 14, 61-76. Zajac, P. (2006b). Ciphertext language identification. Journal of Electrical Engineering, 57 (7/s), 26--29.

KEy TERMS Brute-Force Attack: Exhaustive cryptanalytic technique that searches the whole key-space to find the correct key. Candidate Text: The text that was obtained by application of decryption algorithm on ciphertext using some key k ∈ K. If k is the correct key (or the equiva-


lent key to) K, then candidate text is a valid plaintext x, otherwise it is a text encrypted by concatenation of dk (eK(x)).

A

Ciphertext: The encrypted text, a string of letters from alphabet C of a given cryptosystem by a given key K ∈ K.. Classical Cipher: A classical cipher system is a five-tuple (P,C,K,E,D), where P, C, define plaintext and ciphertext alphabet, K is the set of possible keys, and for each K ∈ K, there exists an encryption algorithm eK ∈ E, and a corresponding decryption algorithm dK ∈ D such that dK (eK(x)) = x for every input x∈P and K ∈ K.. Cryptanalysis: Is a process of trying to decrypt given ciphertext and/or find the key without, or with only partial knowledge of the key. It is also a research area studying techniques of cryptanalysis. Key-Space: Set of all possible keys for a given ciphertext. Key-space can be limited to a subspace of the whole K by some prior knowledge. Plaintext: The unencrypted text, a string of letters from alphabet P of a given cryptosystem. Plaintexts Filter: An algorithm, or predicate, used to determine, which texts are not valid plaintexts. Ideal plaintexts filter never produces answer INVALID for a correct plaintext. Scoring Function: Scoring function is used to evaluate fitness of a candidate text for a key k ∈ K.. Ideal scoring function has global extreme in the correct plaintext, i.e. when k = K.

Automated Cryptanalysis of Classical Ciphers Otokar Grošek Slovak University of Technology, Slovakia Pavol Zajac Slovak University of Technology, Slovakia

INTRODUCTION Classical ciphers are used to encrypt plaintext messages written in a natural language in such a way that they are readable for sender or intended recipient only. Many classical ciphers can be broken by brute-force search through the key-space. Methods of artificial intelligence, such as optimization heuristics, can be used to narrow the search space, to speed-up text processing and text recognition in the cryptanalytic process. Here we present a broad overview of different AI techniques usable in cryptanalysis of classical ciphers. Specific methods to effectively recognize the correctly decrypted text among many possible decrypts are discussed in the next part Automated cryptanalysis – Language processing.

BACKGROUND Cryptanalysis can be seen as an effort to translate a ciphertext (an encrypted text) to a human language. Cryptanalysis can thus be related to the computational linguistics. This area originated with efforts in the United States in the 1950s to have computers automatically translate texts from foreign languages into English, particularly Russian scientific journals. Nowadays it is a field of study devoted to developing algorithms and software for intelligently processing language data. Systematic (public) efforts to automate cryptanalysis using computers can be traced to first papers written in late ’70s (see e.g. Schatz, 1977). However, the research area has still many open problems, closely connected to an area of Artificial Intelligence. It can be concluded from the current state-of-the-art, that although computers are very useful in many cryptanalytic tasks, a human intelligence is still essential in complete cryptanalysis.

For convenience of a reader we recall some basic notions from cryptography. Very thorough survey of classical ciphers is written by Kahn (1974). A message to be encrypted (plaintext) is written in the lowercase alphabet P = {a, b, c… x, y, z}. The encrypted message (ciphertext) is written in uppercase alphabet C = {A, B, C… X, Y, Z}. Different alphabets are used in order to better distinguish plaintext and ciphertext, respectively. In fact these alphabets are the same. There is a reversible encryption rule (algorithm) how to transform the plaintext to the ciphertext, and viceversa. These algorithms depend on a secret parameter K called the key. The set of possible keys K is called the key-space. Input and output of these algorithms is a string of letters from respective alphabets, P* and C*. Both, sender as well as receiver, uses the same secret key, and the same encryption and decryption algorithms. There are three basic classical systems to encrypt a message, namely a substitution, a transposition, and a running key. In a substitution cipher a string of letters is replaced by another string of letters using prescribed substitution of single letters, e.g. left ‘a’ to ‘A’, replacing letter ‘b’ by letter ‘N’, letter ‘c’ by letter ‘G’, etc. A transposition cipher rearranges order of letters according to a secret key K. Unlike substitution ciphers the frequency of letters in the plaintext and ciphertext remains the same. This characteristic is used in recognizing that the text was encrypted by some transposition cipher. A typical running key cipher is to derive from a main key K the running key K0 K1 K2…Kn. If P = C = K is a group, then simply yi = eK( xi) = xi + Ki . Thus it is convenient to define a ciphering algorithm for classical ciphers as follows: Definition 1: A classical cipher system is a fivetuple (P,C,K,E,D), where the following conditions are satisfied:


Automated Cryptanalysis of Classical Ciphers

1. 2. 3. 4.

5.

P is a finite set of a plaintext alphabet, and P* the set of all finite strings of symbols from P. C is a finite set of a ciphertext alphabet, and C* the set of all finite strings of symbols from C.. K is a finite set of possible keys. For each K ∈ K , there is an encryption algorithm eK ∈ E, and a corresponding decryption algorithm dK ∈ D such that dK (eK(x)) = x for every input x∈ P and K ∈ K.. The ciphering algorithm assigns to any finite string x0 x1 x2…xn from P* the resulting ciphertext string y0 y1 y2…yn from C*, where yi = eK( xi) . The actual key may, or need not depend on the index i.

Another typical case for P, and C, are r-tuples of the Latin alphabet. For transposition ciphers, the key is periodically repeated for r-tuples. For substitution ciphers of r-tuples, the key is an r-tuple of keys. In the case of running keys, there is another key stream generator g: K × P → K which generates from the initial key K, and possibly from the plaintext x0 x1 x2…xn-1 the actual key Kn . For classical ciphers, there are two typical situations when we try to recover the plaintext: 1.

2.

Let the input to decryption algorithm dK ∈ D with unknown key K be a ciphertext string y0 y1 y2…yn from C*, where yi = eK( xi). Our aim is to find the plaintext string x0 x1 x2…xn from P*. Thus in each execution an algorithm is searching through Key-space K. The decryption algorithm dK ∈ D and key K are unknown. Our aim is to find for the ciphertext string y0 y1 y2…yn from C*, where yi = eK( xi), the plaintext string x0 x1 x2…xn from P*. This requires a different algorithm than the actual dK ∈ D, as well as some additional information. Usually there is available another ciphertext, say z0 z1 z2…zn from C*. Thus in each execution an algorithm is searching through possible substitutions which are suitable for both ciphertexts.

In both cases we need a plaintext recognition subroutine which evaluates a candidate substring of length v for a possible plaintext, say ct c1+t c2+t…cv+t := xt x1+t x2+t…xv+t . Such automated text recognition needs an adequate model of a used language.

AUTOMATED CRyPTANALySIS There are two straightforward methods for automated cryptanalysis. Unfortunately none of them is for longer strings applicable in practice. The first one is for transposition ciphers. When no other information about the cipher is known, we can use a general method, called anagramming, to decipher the message. In this method we are trying to assemble the meaningful string (anagram) from the ciphertext. This is accomplished by arranging the letters to words from the dictionary. When we find the meaningful word we process the rest of the message in the same way. When we are not able to create more meaningful words, we retrace our steps, and try other possible words until the whole meaningful anagram is found. The second, and very similar, is for the substitution ciphers. Here we are trying to assemble the meaningful string (anagram) from the ciphertext by searching through all possible substitutions of letters to get words from dictionary of the used language. Although the size of the key-space is large, automated cryptanalysis uses many other methods based, e.g. on frequency distribution of letters. Automated cryptanalysis of simple substitution ciphers can decrypt most of the messages both with known word boundaries (Carrol & Martin, 1986), and without this information (Ramesh, Athithan & Thiruvengadam, 1993; Jakobsen, 1995). There are other classical ciphers, where transposition or substitution depends not only on the actual key, but also on a position within a block of letters of the string. For effective automated cryptanalysis at least two layers of plaintext candidate processing, filtering and scoring, are required. Better results are achieved by additional filtering layers. This of course increases computational complexity. Bellow we give an overview of these filtering layers.

Automated Brute Force Attacks The basic type of algorithm suitable for automated cryptanalysis is a brute force attack. As we have to search the whole key-space, this attack is only feasible when key-space is “not too large”. Exact quantification of the searchable key-space depends on computational resources available to an attacker, and the average time needed to verify a candidate for decrypted text. Thus, the plaintext recognition is the most critical part of the algorithm from the performance point of view.

A


On the other hand, only the most complex algorithms achieve really high accuracy of plaintext recognition. Thus the careful balance of the complexity of plaintext recognition algorithms and its accuracy is required. It is unlikely that automated cryptanalysis produces only one possible result, but it is possible to limit the set of possible decrypts to a manageable size. Reported results should be sorted according to their probability of being the true plaintext. A generic brute force algorithm with plaintext recognition can be described by the pseudo-code in Exhibit A. We have identified three layers of plaintext recognition, namely negative test predicate, fast scoring function and precise scoring function. All three functions are used as a three-layer filter, and final scoring function is also used to sort the outputs. First filter should be very fast, and should have very low error probability. Fast score should be easy to compute, but it is not required to precisely identify the correct plaintext. Correct plaintext recognition is the role of precise scoring function. In the algorithm, the best score is the highest one. If the score is computed in the opposite meaning, the algorithm must be rewritten accordingly. In some cases, we can integrate a fast scoring function within negative test or with the precise scoring, leading to two-layer filters, as in (Zajac, 2006a). It is also possible to use even more steps of predicatebased and score-based filtering, respectively. However, experiments show that the proposed architecture of three-layers is the most flexible, and more layers can even lead to performance decrease. Scoring and fil-

tering is described in-depth in the article Automated cryptanalysis – Language processing.

Applications of Artificial Intelligence Methods Artificial Intelligence (AI) methods can be used in four main areas of the automated cryptanalysis: 1.

2.

3.

4.

Plaintext recognition: The goal of the AI is to supply negative predicates that filter out wrong decrypts, and scoring functions that assess the text’s likeness to natural language. Key-search heuristics: The goal of the AI is to provide heuristics to speed-up the decryption process either by constraining the key-space, or by guiding the selection of next keys to be tried in the decryption. This area is most often researched, as it can provide clear experimental results, and meaningful evaluation. Plaintext estimation: The goal of the AI is to estimate the meaning of the plaintext from the partial decryption, or to estimate some parts of the plaintext based on external data (e.g. a sender of a ciphertext, historical and geographic context, specific grammatical rules etc.) Estimated parts of the plaintext can then lead to much easier complete decryption. This area of research is mainly unexplored, and plaintext estimation is done by the cryptanalyst. Automatic security evaluation: The goal of the cryptanalysis is not only to break ciphers and to

Exhibit A.

1. 2.

3. 4.

INPUT: ciphertext string Y = y0 y1 y2…yn OUTPUT: ordered sequence S of possible plaintexts with their scores Let S = { } For each key K ∈ K do 2.1. Let X = dK( Y) be a candidate plaintext. 2.2. Compute n egative test p redicate filter(X). If predicate is true, continue w ith step 2. 2.3. Compute fast scoring function fastScore(X). If fastScore(X) < LIMITF, continue with step 2. 2.4. Compute precise scoring function score(X). If score(X) < LIMIT, continue with step 2. 2.5. Let S = S ∪ {<score(X), X> } Sort S by key score(X) descending. Return S.


learn secrets, but it is also used when creating new ciphers to evaluate their security. Although most classical ciphers are already “outdated”, their cryptanalysis is still important, e.g. in teaching the modern computer security principles. When teaching classical ciphers, it is useful to have an AI tool (e.g. an expert system), that can automate the evaluation of cipher security (at least under some weaker assumptions). Although much work is done in automatic evaluation of modern security protocols, we are unaware of some tools to evaluate “classical” cipher designs. Area that is best researched is the area of Key-search heuristics. It immediately follows from the fact that brute force search through the whole key-space can be considered as a very crude method of decryption. Most classical ciphers were not designed with careful consideration of the text statistics. We can assign score for each key in the key-space that is correlated with the probability that text decrypted by given key is the plaintext. The score, when considered over the key-space, certainly have some local maxima, which can lead either immediately to a meaningful plaintext, or a text from which plaintext is easily guessed. Thus it can be useful to consider various relaxation techniques to search through the key-space with the goal of maximizing scoring function. One of the earliest demonstrations of relaxation techniques for breaking substitution ciphers are presented by Peleg & Rosenfeld (1979) and Hunter & McKenzie (1983). Successful attacks applicable for many classical ciphers can be implemented using basic hill climbing, through tabu search, simulated annealing and applications of genetic/evolution algorithms (Clark & Dawson, 1998). Genetic algorithms have achieved many successes in breaking classical ciphers as demonstrated by Mathews (1993), or Clark (1994), and can even break a rotor machine (Bagnall, McKeown & Rayward-Smith, 1997). Russell, Clark & Stepney (1998) present anagramming attack using a solver based on an ant colony optimisation algorithm. These types of attack try to converge to the correct key by small changes of the actual key. Success rate of the attacks is usually measured by the fraction of the reconstructed key and/or text. Relaxation methods can find with a high probability the keys, or the plaintext approximations, even if it is not feasible to search the whole key-space. The success mainly depends on the ciphertext size, since the scoring is usually statistics-

based. One of the unexplored challenges is to consider application of multiple relaxation techniques. First heuristic can be used to shrink the key-space, and then either the brute-force search or another heuristic is used with more precision to finish the decryption.

FUTURE TRENDS The results obtained strongly depend on the size of the ciphertext, and decryptions are usually only partial. Techniques of the automated cryptanalysis also need to be fitted to a given problem. E.g. attacks on substitution ciphers can use individual letter statistics, but for attacks intended for transposition ciphers these statistics are invariant and make no sense in using. Automated cryptanalysis is usually studied only in context of these two main types of ciphers, but there is a broad area of unexplored problems concerning different classical cipher types, such as running key type ciphers. Specific uses of AI techniques can fail for some cryptosystems as pointed by Wagner, S., Affenzeller, M. & Schragl, D. (2004). Cryptanalysis also depends on the language (Zajac, 2006b), although there are some notable exceptions when considering similar languages. As the computational power increases, even just recently used ciphers, like Data Encryption Standard (DES), are becoming subject of automated cryptanalysis (e.g. Nalini & Raghavendra Rao, 2007). Beside application of heuristics to cryptanalysis, a lot of further research is required in areas of plaintext estimation and automatic security evaluation. An expert system that would cover these areas and connect them with AI for plaintext recognition and search heuristics can be a strong tool to teach computer security or to help forensic analysis or historical studies involving encrypted materials.

CONCLUSION This article is concerned with an automated cryptanalysis of classical ciphers, where classical ciphers are considered as a cipher from before WW2, or penciland-paper ciphers. Optimization heuristics are quite successful in attacks targeted to these ciphers, but they usually cannot be made fully-automatic. Their application usually differs according to a character of the analysed cipher systems. An important research

A


direction is extending the techniques from classical cryptanalysis to automated decryption of modern digital cryptosystems. Another important problem is to create set of fully-automatic cryptanalytic tools or a complete expert system that can be adapted to various types of ciphers and languages.

REFERENCES Bagnall,T. & McKeown, G. P. & Rayward-Smith, V. J. (1997). The cryptanalysis of a three rotor machine using a genetic algorithm. In Thomas Back, editor, Proceedings of the Seventh International Conference on Genetic Algorithms (ICGA97), San Francisco, CA. Morgan Kaufmann. Carrol, J. & Martin, S. (1986). The automated cryptanalysis of substitution ciphers. Cryptologia, 10(4). 193-209. Clark, A. (1994). Modern optimisation algorithms for cryptanalysis. In Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, November 29 - December 2, 258-262. Clark, A. & Dawson, E. (1998). Optimisation Heuristics for the Automated Cryptanalysis of Classical Ciphers. Journal of Combinatorial Mathematics and Combinatorial Computing, vol. 28, 63-86. Hunter, D.G.N. & McKenzie, A. R. (1983). Experiments with Relaxation Algorithms for Breaking Simple Substitution Ciphers. Comput. J. 26(1), 68-71 Jakobsen, T. (1995). A fast method for cryptanalysis of substitution ciphers. Cryptologia, 19(3). pp. 265274.

Ramesh, R.S. & Athithan, G. & Thiruvengadam, K. (1993). An automated approach to solve simple substitution ciphers. Cryptologia, 17(2), 202-218. Russell, M. D. & Clark, J. A. & Stepney, S. (2003). Making the most of two heuristics: Breaking transposition ciphers with ants. Proceedings of IEEE Congress on Evolutionary Computation (CEC 2003). IEEE Press, 2653--2658. Schatz, B. (1977). Automated analysis of cryptograms. Cryptologia, 1(2), 265-274. Also in: Cryptology: yesterday, today, and tomorrow, Artech House 1987, ISBN: 0-89006-253-6. Wagner, S. & Affenzeller, M. & Schragl, D. (2004). Traps and Dangers when Modelling Problems for Genetic Algorithms. Cybernetics and Systems, pp. 79-84. Zajac, P. (2006a). Automated Attacks on Transposition Ciphers. Begabtenförderung im MINT Bereich 14, 61-76. Zajac, P. (2006b). Ciphertext language identification. Journal of Electrical Engineering, 57 (7/s), 26--29.

KEy TERMS Brute-Force Attack: Exhaustive cryptanalytic technique that searches the whole key-space to find the correct key. Ciphertext: The encrypted text, a string of letters from alphabet C of a given cryptosystem by a given key K ∈ K..

Matthews, R.A.J. (1993). The use of genetic algorithms in cryptanalysis. Cryptologia, 17(4), 187-201.

Classical Cipher: A classical cipher system is a five-tuple (P,C,K,E,D), where P, C, define plaintext and ciphertext alphabet, K is the set of possible keys, and for each K ∈ K, there exists an encryption algorithm eK ∈ E, and a corresponding decryption algorithm dK ∈ D such that dK (eK(x)) = x for every input x∈ P and K ∈ K..

Nalini, N. & Raghavendra Rao, A. (2007). Attacks of simple block ciphers via efficient heuristics. Information Sciences: an International Journal 177 (12), 2553--2569.

Cryptanalysis: Is a process of trying to decrypt given ciphertext and/or find the key without, or with only partial knowledge of the key. It is also a research area studying techniques of cryptanalysis.

Peleg, S. & Rosenfeld, A. (1979). Breaking Substitution Ciphers Using a Relaxation Algorithm. Communications of the ACM 22(11), 598--605.

Key-Space: Set of all possible keys for a given ciphertext. Key-space can be limited to a subspace of the whole K by some prior knowledge.

Kahn, D. (1974): The codebreakers. Wiedenfeld and Nicolson, London.

0


Plaintext: The unencrypted text, a string of letters from alphabet P of a given cryptosystem.

A

Relaxation Attack: Cryptanalytic technique that searches the key-space by incremental updates of the candidate key(s). It usually applies the knowledge of previous trial decryption(s) to change some parts of the key.

Automatic Classification of Impact-Echo Spectra I Addisson Salazar iTEAM, Polytechnic University of Valencia, Spain Arturo Serrano iTEAM, Polytechnic University of Valencia, Spain

INTRODUCTION We investigate the application of artificial neural networks (ANNs) to the classification of spectra from impact-echo signals. In this paper we provide analyses from simulated signals and the second part paper details results of lab experiments. The data set for this research consists of sonic and ultrasonic impact-echo signal spectra obtained from 100 3D-finite element models. These spectra, along with a categorization of the materials among homogeneous and defective classes depending on the kind of material defects, were used to develop supervised neural network classifiers. Four levels of complexity were proposed for classification of materials as: material condition, kind of defect, defect orientation and defect dimension. Results from Multilayer Perceptron (MLP) and Radial Basis Function (RBF) neural networks with Linear Discriminant Analysis (LDA), and k-Nearest Neighbours (kNN) algorithms (Duda, Hart, & Stork, 2000), (Bishop C.M., 2004) are compared. Suitable results for LDA and RBF were obtained. The impact-echo is a technique for non-destructive evaluation based on monitoring the surface motion resulting from a short-duration mechanical impact. It has been widely used in applications of concrete structures in civil engineering. Cross-sectional resonant modes in impact-echo signals have been analyzed in elements of different shapes, such as, circular and square beams, beams with empty ducts or cement fillings, etc. In addition, frequency analyses of the displacement of the fundamental frequency to lower values for detection of cracks have been studied (Sansalone & Street, 1997), (Carino, 2001). The impact-echo wave propagation can be analyzed from transient and stationary behaviour. The excitation signal (the impact) produces a short transient stage where the first P (normal stress), S (shear stress) and

Rayleigh (superficial) waves arrive to the sensors; afterward the wave propagation phenomenon becomes stationary and a manifold of different mixtures of waves including various changes of S-wave to P-wave propagation mode and viceversa arrive to the sensors. Patterns of waveform displacements in this latter stage are known as the resonant modes of the material. The spectra of impact-echo signals provide of information for classification based on resonant modes the inspected materials. The classification tree approached in this paper has four levels from global to detailed classes with up to 12 classes in the lowest level. The levels are: (i) Material condition: homogeneous, one defect, multiple defects, (ii) Kind of defect: homogeneous, hole, crack, multiple defects, (iii) Defect orientation: homogeneous, hole in axis X or axis Y, crack in planes XY, ZY, or XZ, multiple defects, and (iv) Defect dimension: homogeneous, passing through and half passing through types of holes and cracks of level iii, multiple defects. Some examples of defective models are in Figure 1.

BACKGROUND Neural networks applications in impact-echo testing include: detect flaws on concrete slabs, combining spectra of numerical simulations and real signals for network training (Pratt & Sansalone, 1992), identification of unilaterally working sublayer cracks using numerically generated waveforms as network inputs (Stavroulakis, 1999), classification of concrete slabs in solid and defective (containing void or delamination), use of training features extracted from many repetitions of impact-echo experiments on three specimens to be classified in three classes (Xiang & Tso, 2002), and to predict shallow crack depths in asphalt pavements using features from an extensive real signal


Automatic Classification of Impact-Echo Spectra I

dataset (Mei, 2004). All these studies used multilayer perceptron neural network and monosensor impactecho systems. In a recent work, we classified impact-echo data by neural networks using temporal and frequency features extracted from the signals, finding that the better features were frequency features (Salazar, Unió, Serrano, & Gosalbez, 2007). Thus the present work is focused in exploiting only spectra information of the impact-echo signals. These spectra contain a large amount of redundant information. We applied Principal Component Analysis (PCA) to spectra for compressing and removing noise. The proposed classification problem and the use of spectra PCA components as classification features are a new proposal in application of neural networks to impact-echo testing. There is evidence that the first components of PCA retain essentially all of the useful information and this compression optimally removes noise and can be used to identify unusual spectra (Bailer-Jones, 1996), (BailerJones, Irwin, & Hippel, 1998), (Xu et al., 2004). The principal components represent sources of variance in the data. The projection of the pth spectrum onto the kth principal component is known as the admixture coefficient ak,p. The most significant principal components contain those features which are most strongly correlated in many of the spectra. It follows that noise (which is uncorrelated with any other features by definition) will be represented in the less significant components. Thus by retaining only the more significant components to represent the spectra we achieve a data compression that preferentially remove noise. The reduced reconstruction, yp of the pth spectrum xp, is obtained by using only the first r principal components to reconstruct the spectrum, i.e. k =r

y p = x + ∑ ak , p u k , k =1

r < N,

(1)

where x is the mean spectrum which is subtracted from the spectra before the eigenvectors are calculated, and uk is the kth principal component. x can be considered as the zeroth eigenvector, although the degree of variance it explains depends on the specific data set and may be much less than that explained by the first eigenvectors.

Let ep be the error incurred in using this reduced reconstruction. By definition xp = yp + ep, so p

=

k =N

∑a

k = r +1

k,p

uk .

(2)

RECOGNITION OF DEFECT PATTERNS IN IMPACT-ECHO SPECTRA -SIMULATIONS Impact-Echo Signals Simulated signals came from full transient dynamic analysis of 100 3D finite element models of simulated parallelepiped-shape material of 0.07x0.05x0.22m. (width, height and length) supported to one third and two thirds of the block length (direction z). Figure 1 shows different examples of the models of defective pieces. From the transient analysis the dynamic response of the material structure (time-varying displacements in the structure) under the action of a transient load is estimated. The transient load, i.e. the hammer impact, was simulated by applying a force-time history of a half sine wave with a period of 64µs as a uniform pressure load on two elements at the centre of the model front face. The elastic material constants for the simulated material were: density 2700 kg/m3, elasticity modulus 69500 Mpa. and Poisson’s ratio 0.22. Elements having dimensions of about 0.01 m. were used in the models. These elements can accurately capture the frequency response up to 40 kHz. Surface displacement waveforms were taken from the simulation results at 7 nodes in different locations on the material surface, see Figure 1a. Signals consisted of 5000 samples recorded at a sampling frequency of 100 kHz. To make possible to compare simulations with experiments, the second derivative of the displacement was calculated to work with accelerations, since the sensors available for experiments were mono-axial accelerometers. These accelerations were measured in the normal direction to the plane of the material surface accordingly to the configuration of the sensors in Figure 1a.

Feature Extraction and Selection We investigate if the changes in the spectra, particularly in the zones of the fundamental frequencies, are related

A


Figure 1. Finite element models with different defects and 7-sensor configuration

1a. Half-passing through crack oriented in plane ZY

1b. passing through hole oriented in axis Y

with the shape, orientation and dimension of the defects. The information of the spectra for each channel consists of n/2 values as half of the number of points used to calculate the Fast Fourier Transform (FFT). Due to the 7-channel impact-echo system setup applied, the number of data available for each impact-echo test was 7*n/2, e.g. for a FFT calculated with 256 points, 896 values would be available as entries for classifiers. This high number of entries could be unsuitable for the training stage of neural networks. Considering impact-echo signal spectra redundancy, PCA was applied in two steps. At first step, PCA was applied to the spectra of each channel as a feature extraction method. At second step, PCA was applied to the component set (spectra compressed) obtained in the first step for all the channels and records as dimensionality reduction and feature selection method. Thus, a compressed and representative pattern of the spectra for the multichannel impact-echo inspection was obtained. The size of the FFT employed was 1024 points since using less points the resolution was not good enough for classifications. Once the spectra were estimated for all the models they were grouped and normalized by maximum per channel. There were considered three options to establish the number of components at the first PCA step: select a number of components that explain a minimum of the variance in the data, or a number of components such the variance increment is minimum, or a fixed number of components. The

first two options could estimate a variable number of components per channel, and they could select more components for the channels with ‘worst’ signals, i.e. signals with low signal to noise relation (SNR), due to problems in measuring (e.g. bad contact in the interface sensor and material). Thus we select a fixed number of components=20 per channel, that explained more than 95% of the data variance for each of the channels, so the total number of components was 7*20=140 for one model. The initial entries for the classification stage were then 140 features (spectra components) for the 100 simulation models. For simulations 20 replicates for each model were added that corresponded to the repetitions performed in the experiments. The replicates were generated using random Gaussian noise with 0.1-standard deviation of the original signals; then total of records for simulations was 2000 with 140 spectra components. PCA was applied again to reduce the dimensionality of the classification space and to select the best spectra features for classification. After some preliminary tests, 50 was set as a number of components for classification. Using this number of components, the explained variance was 98%. With the 50 sorted components obtained, an iterative process of classification varying the number of components was applied using LDA and kNN as classifiers. The curve described by the set of classification error and number of components (5,10,15,…,50)


values has an inflection point where the information provided for the components perform the best classification. Following this feature selection process, a reduced set of features (‘better’ spectra components) was obtained. Those features were used as entries for ANNs, improving the performance of the classification, instead of using all the spectra components. The number of selected components for ANN classification varied from 20 to 30, depending on classification level (material condition, kind of defect, defect orientation, defect dimension). The classification proceeded applying the LeaveOne-Out method, avoiding records of replicas or repetitions of a test piece were in the training stage of that piece, so generalization of pattern learning was forced. Thus some of the records used in training and test corresponded to models or specimens with the same kind of defect but located in different positions, and the rest of records corresponded to others kind of defective pieces. Results presented in next sections are referring to mean error in testing stage.

Simulation Results Figure 2a shows the results of classification by kNN and LDA with linear, Mahalanobis, and quadratic distances for simulations at level 4 of the classification tree. The best percentage of classification success (75.9) is obtained by LDA-quadratic and LDA-Mahalanobis with 25 components. Those components were selected and used as inputs for the input layer of the networks. One hidden layer was used (different number of neurons were tried to obtain the best configuration of the neuron number at this layer), and the number of neurons at the output layer was set as the number of classes, depending on the classification level. A validation stage and resilient propagation training method were used in classifications with MLP. The spread parameter was tuned for RBF, Figure 2b shows how the spread affects the classification results in the “defect dimension level”, and in this case the minimum error (0.31) is for spread value 1.6. Summarised general results by different classification methods for simulations are showed in Table 1. The best classification performance is obtained by LDA with quadratic distance, but results of RBF are fairly comparable. Due to classes are not equally-probable at each level, general results are weighted by class probability, see Figure 3. Homogeneous class was

completely distinguishable and multiple-defects class was the worst classified at every classification levels. The percentage of success could be very much higher by increasing classification success for multiple-defect class. This fact was caused because the multiple-defects models consisted in models with various cracks, and it yield confusion between the crack and multiple-defect classes. The percentage of success decreases for more complex classifications, with RBF lowest performance of 69% for 12 classes.

FUTURE TRENDS The proposed methodology was tested with particular kind of material and defects and configuration of multichannel testing. It could be tested using models and specimens of different materials, sizes, sensor configurations, and signal processing parameters. There exist several techniques and algorithms of classification that can be explored for the proposed problem. Recently a model of independent component analysis (ICA) was proposed for impact-echo (Salazar, Vergara, Igual, Gosalbez, & Miralles, 2004), and new classifiers based on mixtures of ICAs have been proposed (Salazar, Vergara, Igual, & Gosalbez, 2005), (Salazar, Vergara, Igual, & Serrano, 2007), that include issues as semisupervision in training stage. The use of prior knowledge in the training stage is critical in order to obtain suitable models for different kind of classifications. Those kind of techniques could give more understating on how labelled and labelled data change model learned by the classifier. In addition more research is needed on the shape of the classification space (impact-echo signal spectra), outlier probability, and decision region of the classes for the proposed problem.

CONCLUSION We demonstrate the feasibility of using neural networks to extract patterns of different kinds of defects from impact-echo signal spectra in simulations. The methodology used was very restricted because there was only one piece for a defect in certain localization in the bulk and it was not in the training stage, so classifier had to assign the right class with the patterns of pieces of the same class in other localizations. Results could

A


Figure 2. LDA, kNN results and tuning of RBF parameter at Simulations, level 4 of classification

2b. RBF spread tuning for simulations at fourth level of classification

2a. LDA, kNN results for simulations at fourth level of classification

Simulations

Table 1. Summarised classification results for simulations Error (%)

Level 1 (3 classes)

Level 2 (4 classes)

Level 3 (7 classes)

LDA-L

6

13

30

29

LDA-Q

8

9

19

24.1

LDA-M

11.6

9

19

24.1

kNN

8

14

25

29

MLP

9

16

31

39

RBF

8

17

26

31

be used to implement the proposed method in real applications of quality evaluation of materials; in those applications the database collected during reasonable time could have samples similar to the tested piece, making easier the classification process.

REFERENCES Bailer-Jones, C. (1996). Neural Network Classifiation of Stellar Spectra. University of Cambridge. Bailer-Jones, C., Irwin, M., & Hippel, T. (1998). Automated classification of stellar spectra - II. Twodimensional classification with neural networks and principal components analysis. Monthly Notices of the Royal Astronomical Society, 298, 361-377.

Level 4 (12 classes)

Bishop C.M. (2004). Neural newtworks for pattern recognition. Oxford: Oxford University Press. Carino, N. J. (2001). The impact-echo method: an overview. In Structures Congress and Exposition (Ed.), (pp. 1-18). Duda, R., Hart, P. E., & Stork, D. G. (2000). Pattern classification. (2 ed.) New York: Wiley-Interscience . Mei, X. (2004). Neural network for rapid depth evaluation of shallow cracks in asphalt pavements. Computer-aided civil and infrastructure engineering, 19, 223-230. Pratt, D. & Sansalone, M. (1992). Impact-echo signal interpretation using artificial intelligence. ACI Materials Journal, 89, 178-187.


Figure 3. Percentages of success in classifications by RBF

A

Quality of material (97.5)

(100)

One defect

Homogeneous (100)

(96 .9)

(100)

(81.3)

Homogeneous

(79 .2)

Axis Y

(62.5)

P

(75)

(100)

H P

P

(50)

Plane XY

(75)

Multiple defects

H P

(81.3)

P

Plane ZY

H P P

Salazar, A., Unió, J., Serrano, A., & Gosalbez, J. (2007). Neural Networks for Defect Detection in Non-Destructive Evaluation by Sonic Signals. Lecture Notes in Computer Science, 4507, 638-645. Salazar, A., Vergara, L., Igual, J., & Gosalbez, J. (2005). Blind source separation for classification and detection of flaws in impact-echo testing. Mechanical Systems and Signal Processing, 19, 1312-1325. Salazar, A., Vergara, L., Igual, J., Gosalbez, J., & Miralles, R. (2004). ICA Model Applied to Multichannel Non-destructive Evaluation by Impact-echo. Lecture Notes in Computer Science, 3195, 470-477. Salazar, A., Vergara, L., Igual, J., & Serrano, A. (2007). Learning Hierarchies from ICA Mixtures. In I. 2. 20th International Joint Conference on Neural Networks (Ed.). Sansalone, M. & Street, W. (1997). Impact-echo: Nondestructive evaluation of concrete and masonry. New York: Bullbrier Press. Stavroulakis, G. E. (1999). Impact-echo from a unilateral interlayer crack. LCP-BEM modelling and neural identification. Engineering Fracture Mechanics, 62, 165-184. Xiang, Y. & Tso, S. K. (2002). Detection and classification of flaws in concrete structure using bispectra and neural networks. NDT&E International, 35, 19-27.

(92)

Level 1

Material condition

Level 2

Kind of defect

(83)

(25)

Plane XZ

( 62.5) (62.5) (37.5) (75)

( 75)

General results

(25)

Crack

(87 .5)

Axis X

(75)

Homogeneous

Multiple defects

Hole

Homogeneous

(100)

(25)

H P P

(62.5)

Multiple defects (25)

Level 3

(74)

Defect orientation (69)

H Multiple Level 4 Defect dimension P defects

Xu, R., Nguyen, H., Sobol, P., Wang, S. L., Wu, A., & Johnson, K. E. (2004). Application of Principal Component Analysis to the FTIR Spectra of Disk Lubricant to Study Lube–Carbon Interactions. IEEE Transactions on Magnetics, 40, 3186-3189.

KEy TERMS Artificial Neural Network (ANN): A mathematical model inspired in biological neural networks. The units are called neurons connected in various input, hidden and output layers. For a specific stimulus (numerical data at the input layer) some neurons are activated following an activation function and producing numerical output. Thus ANN is trained, storing the learned model in weight matrices of the neurons. This kind of processing has demonstrated to be suitable to find nonlinear relationships in data, being more flexible in some applications than models extracted by linear decomposition techniques. Finite Element Method (FEM): It is a numerical analysis technique to obtain solutions to the differential equations that describe, or approximately describe a wide variety of problems. The underlying premise of FEM states that a complicated domain can be sub-divided into a series of smaller regions (the finite elements) in which the differential equations are approximately


solved. By assembling the set of equations for each region, the behavior over the entire problem domain is determined. Impact-Echo Testing: A non-destructive evaluation procedure based on monitoring the surface motion resulting from a short-duration mechanical impact. From analyses of the vibrations measured by sensors, a diagnosis of the material condition can be obtained. Non-Destructive Evaluation (NDE): NDE, ND Testing or ND Inspection techniques are used in quality control of materials. Those techniques do not destroy the test object and extract information on the internal structure of the object. To detect different defects such as cracking and corrosion, there are different methods of testing available, such as X-ray (where cracks show up on the film), ultrasound (where cracks show up as an echo blip on the screen) and impact-echo (cracks are detected by changes in the resonance modes of the object). Pattern Recognition: An important area of research concerned to discover or identify automatically figures, characters, shapes, forms, and patterns without active human participation in the decision process. It is also

related with classify data in categories. Classification consists in learning a model for separating the data categories, that kind of machine learning can be approached using statistical (parametric or no-parametric models) or heuristic techniques. If some prior information is given in learning process, it is called supervised or semi-supervised, else it is called unsupervised. Principal Component Analysis (PCA): A method for achieving a dimensionality reduction. It represents a set of N-dimensional data by means of their projections onto a set of r optimally defined axes (principal components). As these axes form an orthogonal set, PCA yields a data linear transformation. Principal components represent sources of variance in the data. Thus the most significant principal components show those data features which vary the most. Signal Spectra: Set of frequency components decomposed from an original signal in time domain. There exist several techniques to map a function in time domain to frequency domain as Fourier and Wavelet transforms, and its inverse transforms that allow reconstructing the original signal.

Automatic Classification of Impact-Echo Spectra II Addisson Salazar iTEAM, Polytechnic University of Valencia, Spain Arturo Serrano iTEAM, Polytechnic University of Valencia, Spain

INTRODUCTION We study the application of artificial neural networks (ANNs) to the classification of spectra from impact-echo signals. In this paper we focus on analyses from experiments. Simulation results are covered in paper I. Impact-echo is a procedure from Non-Destructive Evaluation where a material is excited by a hammer impact which produces a response from the material microstructure. This response is sensed by a set of transducers located on material surface. Measured signals contain backscattering from grain microstructure and information of flaws in the material inspected (Sansalone & Street, 1997). The physical phenomenon of impact-echo corresponds to wave propagation in solids. When a disturbance (stress or displacement) is applied suddenly at a point on the surface of a solid, such as by impact, the disturbance propagates through the solid as three different types of stress waves: a P-wave, an S-wave, and an R-wave. The P-wave is associated with the propagation of normal stress and the S-wave is associated with shear stress, both of them propagate into the solid along spherical wave fronts. In addition, a surface wave, or Rayleigh wave (R-wave) travels throughout a circular wave front along the material surface (Carino, 2001). After a transient period where the first waves arrive, wave propagation becomes stationary in resonant modes of the material that vary depending on the defects inside the material. In defective materials propagated waves have to surround the defects and their energy decreases, and multiple reflections and diffraction with the defect borders become reflected waves (Sansalone, Carino, & Hsu, 1998). Depending on the observation time and the sampling frequency used in the experiments we may be interested in analyzing the transient or the stationary stage of the wave propagation in im-

pact-echo tests. Usually with high resolution in time, analyzes of wave propagation velocity can give useful information, for instance, to build a tomography of a material inspected from different locations. Considering the sampling frequency that we used in the experiments (100 kHz), a feature extracted from the signal as the wave propagation velocity is not accurate enough to discern between homogeneous and different kind of defective materials. The data set for this research consists of sonic and ultrasonic impact-echo signal (1-27 kHz) spectra obtained from 84 parallelepiped-shape (7x5x22cm. width, height and length) lab specimens of aluminium alloy series 2000. These spectra, along with a categorization of the quality of materials among homogeneous, one-defect and multiple-defect classes were used to develop supervised neural network classifiers. We show that neural networks yield good classifications ( N /2 and M = N, a fully connected CNN is obtained, where every neuron is connected to every other cell in the network and Sr(i,j) is the entire array. This extreme case corresponds to the classic Hopfield ANN model (Chua & Roska, 2002). The state equation of any cell C(i,j) in the M × N array structure of the standard CNN may be described mathematically by: C

dzij (t ) dt

=−

1 zij (t ) + ∑ [A(i, j; k , l ) ⋅ ykl (t ) + B(i, j; k , l ) ⋅ xkl ]+ Iij R C ( k ,l )∈Sr ( i , j )

(2) where C and R are values that control the transient response of the neuron circuit (just like an RC filter, typically set to unity for the sake of simplicity), I is generally a constant value that biases or thresholds the state matrix Z = {zij}, and Sr is the local neighbourhood of cell C(i, j) defined in (1), which controls the influence of the input data X = {xij} and the network output Y = {yij} for time t. This means that both input and output planes interact with the state of a cell through the definition of a set of real-valued weights, A(i, j; k, l) and B(i, j; k, l), whose size is determined by the neighbourhood radius r. The matrices or cloning templates A and B are called the feedback and feed-forward (or control) operators, respectively. A standard CNN is typically defined with constant values for r, I, A and B, thus implying that for a fixed input image X, a neuron C(i, j) is provided for each


Basic Cellular Neural Networks Image Processing

pixel (i, j), with constant weighted circuits defined by the feedback template A that connects the cell with the output plane Y, and by the control template B, which connects the neuron to the neighbouring pixels of input xij ∈ X. The value of the neuron state zij is then adjusted with the bias parameter I, and passed as input to a piecewise-linear function in order to determine the output value yij. This function may be expressed as yij =

1 zij (t ) + 1 − zij (t ) − 1 2

(

)

In other words,

∑

A(i, j; k , l ) ⋅ ykl =

∑

B(i, j; k , l ) ⋅ xkl =

C ( k ,l )∈Sr ( i , j )

C ( k ,l )∈Sr ( i , j )

and Iij = I. (3)

In the Image Processing context, a grey-scale image input X can be represented pixel-wise using a linear map between a pixel value (e.g. a 8-bit integer luminance matrix with 256 grey-scale levels) and the CNN input interval [–1, +1], where the lower limit is used to implement full luminance (i.e. white) and the upper for black pixels (Chua & Yang, 1988).

BASIC CNN IMAGE PROCESSING The main application of the CNN model, due to its convolution-like scheme, has been DIP modelling and design. In the next subsections a number of basic DIP approaches are introduced, underlining the importance of the network parameters by giving illustrative examples of application. Starting from the standard model described in the previous section, the definition of the standard isotropic CNN follows. Then, an example of application in logic DIP processing is performed in order to introduce the nonlinear effects that implies the using a non-zero feedback template.

The Isotropic CNN Model For a still image, X will be invariant with time, and for video, X = X(t). In the most general case, r, A, B and I may vary with position and time, and the cloning templates are defined as nonlinear, with the possibility of integrating inhibitory signals for the state matrix and even nonlinear templates that interact with mixed input-output-state data (Chua & Roska, 2002). These possible extensions raise the definition of a special (and simpler) class of CNN, called isotropic or space-invariant, in which r, A, B and I are fixed for the whole network and where linear synaptic operators are utilized.

∑ ∑

A(i − k , j − l ) ⋅ ykl

∑ ∑

B(i − k , j − l ) ⋅ xkl

k −i ≤ r l − j ≤ r

k −i ≤ r l − j ≤ r

(4)

The vast majority of the templates defined in the template compendium of (Chua & Roska, 2002) for the CNN-UM are based on this isotropic scheme, using r = 1, and binary images in the input plane. If no feedback (i.e. A = 0) is used, then the CNN behaves as a convolution network, using B as a spatial filter, I as a threshold and the piecewise linear output (3) as a limiter or saturated output filter. In this way, virtually any spatial filter from DIP theory (Jain, 1989) can be implemented on such a feed-forward driven CNN, which ensures its output stability. For instance, the EDGE template defined by

A = 0, BEDGE

 −1 −1 −1 =  −1 8 −1 , I = –1 (5)  −1 −1 −1

is designed to work correctly for binary inputs, giving black (+1) output pixels in the input locations where a black edge pixel exists (i.e. if a black pixel has 1 white neighbour), and white (–1) pixels elsewhere. However, when a grey-scale input image is fed to this CNN, the output may not be a binary image. To solve this potential problem, the following modification is performed over the EDGE CNN: A = 2, B = BEDGE, I = –0.5

(6)

The definition of a centre feedback absolute value greater than 1 in (6) ensures a binary output and thus output network stability. The B template used in these CNN is of the Laplacian type, having the important property that all surrounding input synaptic weights are inhibitory (i.e. negative) and identical, but the centre synaptic weight is excitatory, and the average of all input synaptic weights is zero.

B


Apart from edges, convex corners (i.e. black pixels with at least five white neighbours) may also be detected with the following modification of its parameters: A = 2, B = BEDGE, I = –8.5

(7)

This example illustrates the important role played by the threshold parameter I. This parameter may be viewed as a bias index that reallocates the origin z0 of the output function (3) (Fernández et al., 2006).

Basic Logic Operators In order to perform pixel-wise logic operations between two binary images X1 and X2, the initial state Z(0) of the network is also utilized as a variable (Chua & Roska, 2002). In standard feed-forward driven CNN, this variable Z(0) is usually set to zero but it can also be used in order to obtain results valid for another applications, such as motion detection and estimation (Torralba & Hérault, 1999). For example, for a binary set union (logic OR), the following templates are defined: X = X1, B1, Z(0) = X2, A = 3, B = 3, I = 2 (8) whereas for set intersection (logic AND), these variables are defined as X = X1, Z(0) = X2, A = 1.5, B = 1.5, I = –1.5 (9) Once again, the usage of excitatory feedback ensures output stability through the saturation output function (3), and the threshold properly biases the final result.

Feedback-Driven Standard CNN The feedback templates used in all the previously exemplified CNN utilize (if any) only the central element of the template. A standard CNN with off-centre nonzero feedback elements is a special class that exhibits more complex dynamics than those treated so far (Chua & Roska, 1993). The use of a centre element in A, a00 > 1, means that the output will be binary, i.e. network output will never be stable in the linear region of the saturation function (3) (Chua & Roska, 2002). With this restriction, if another element is set in the feedback template, 0

then two possible situations may occur: the activation of cells in the opposite part of only one of the saturation regions (partial inversion), or wave propagating cell inversions in both binary states. The first kind of these feedback-driven CNN is said to have the mono-activation property if cells in only one saturated region can enter the linear region. Thus, if cells can enter the linear region from the positive saturation region, then those cells saturated in the negative part must fulfil that the overall contribution of A, B and I in its sphere of influence Sr must be less than –1. That is,

wij (t ) =

∑ [a

Sr ( i , j )

kl

⋅ ykl (t ) + bkl ⋅ xkl ]+ I ij < −1 (10)

On the other hand, if cells enter the linear region only from the negative saturation region, then the contribution for positive stable cells must be wij(t) > 1. It can be demonstrated that in a mono-activated CNN with positive A coefficients, with a00 > 1 and saturated initial values, all the cells that enter the linear region change monotonically their state from (only) one saturated area to the other, and therefore it is a stable nonlinear network (Chua & Roska, 2002). If, for instance, one element in A is negative, the transient will not be monotonic, which does not necessarily imply network instability. An example of a non-monotonic but stable CNN is the Connected Component Detector (CCD) (Matsumoto et al., 1990 a) whose templates (for the horizontal case) are the following:

ACCD

0 0 0  = 1 2 −1 , B = 0, I = 0 0 0 0 

(11)

For designing a unidirectional wave propagating mono-activated CNN, a binary activation pattern is defined, which will trigger the transient until output stability is reached (Chua & Roska, 2002). An example of this type of stable feedback-driven CNN is the (horizontal) Shadow Detector (Matsumoto et al., 1990 b) whose parameters are:


AShadow

0 0 0 = 1 2 0  , B = 0, I = 0 0 0 0 

REFERENCES (12)

FUTURE TRENDS There is a continuous quest by engineers and specialists: compete with and imitate nature, especially some “smart” animals. Vision is one particular area which computer engineers are interested in. In this context, the so-called Bionic Eye (Werblin et al., 1995) embedded in the CNN-UM architecture is ideal for implementing many spatio-temporal neuromorphic models. With its powerful image processing toolbox and a compact VLSI implementation (Rodríguez et al., 2004), the CNN-UM can be used to program or mimic different models of retinas and even combinations of them (Lázár et al., 2004). Moreover, it can combine biologically based models, biologically inspired models, and analogic artificial image processing algorithms. This combination will surely bring a broader kind of applications and developments.

CONCLUSION A number of other advances in the definition and characterization of CNN have been researched in the past decade. This includes the definition of methods for designing and implementing larger than 3 × 3 neighbourhoods in the CNN-UM (Kék & Zarándy, 1998), the efficient implementation of halftoning techniques (Crounse et al., 1993), the CNN implementation of some image compression techniques (Venetianer et al., 1995) or the design of a CNN-based Fast Fourier Transform algorithm over analogic signals (Perko et al., 1998), between many others. Some of them have also been described in this book in the article entitled Advanced Cellular Neural Networks Image Processing. In this article, a general review of the main properties and features of the Cellular Neural Network model has been addressed, focusing on its DIP capabilities from a basic viewpoint. CNN is now a fundamental and powerful toolkit for real-time nonlinear image processing tasks, mainly due to its versatile programmability, which has powered its hardware development for visual sensing applications.

Chua, L.O., & Roska, T. (2002). Cellular Neural Networks and Visual Computing. Foundations and Applications. Cambridge, UK: Cambridge University Press. Chua, L.O., & Roska, T. (1993). The CNN Paradigm. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 40, 147–156. Chua, L.O., & Yang, L. (1988). Cellular Neural Networks: Theory and Applications. IEEE Transactions on Circuits and Systems, 35, 1257–1290. Crounse, K.R., Roska, T., & Chua, L.O. (1993). Image Halftoning with Cellular Neural Networks. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 40, 267-283. Fernández, J.A., Preciado, V.M., & Jaramillo, M.A. (2006). Nonlinear Mappings with Cellular Neural Networks. Lecture Notes in Computer Science, 4177, 350–359. Jain, A.K. (1989). Fundamentals of Digital Image Processing. Englewood Cliffs, NJ, USA: PrenticeHall. Kék, L., & Zarándy, A. (1998). Implementation of Large Neighborhood Non-Linear Templates on the CNN Universal Machine. International Journal of Circuit Theory and Applications, 26, 551-566. Lázár, A.K., Wagner, R., Bálya, D., & Roska, T. (2004). Functional Representations of Retina Channels via the RefineC Retina Simulator. International Workshop on Cellular Neural Networks and their Applications CNNA 2004, 333-338. Matsumoto, T., Chua, L.O., & Suzuki, H. (1990 a). CNN Cloning Template: Connected Component Detector. IEEE Transactions on Circuits and Systems, 37, 633-635. Matsumoto, T., Chua, L.O., & Suzuki, H. (1990 b). CNN Cloning Template: Shadow Detector. IEEE Transactions on Circuits and Systems, 37, 1070-1073. Perko, M., Iztok Fajfar, I., Tuma, T., & Puhan, J. (1998). Fast Fourier Transform Computation Using a Digital CNN Simulator. Fifth IEEE International Workshop on Cellular Neural Network and Their Applications Proceedings, 230-236.

B


Rodríguez, A., Liñán, G., Carranza, L., Roca, E., Carmona, R., Jiménez, F., Domínguez, R., & Espejo, S. (2004). ACE16k: The Third Generation of MixedSignal SIMD-CNN ACE Chips Toward VSoCs. IEEE Transactions on Circuits and Systems I: Regular Papers, 51, 851–863. Roska, T., & Chua, L.O. (1993). The CNN Universal Machine: An Analogic Array Computer. IEEE Transactions on Circuits and Systems II: Analog and Digital Processing, 40, 163–173. Torralba, A.B., & Hérault, J. (1999). An Efficient Neuromorphic Analog Network for Motion Estimation. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 46, 269-280. Venetianer, P.L., Werblin, F., Roska, T., & Chua, L.O. (1995). Analogic CNN Algorithms for Some Image Compression and Restoration Tasks. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 42, 278-284. Werblin, F., Roska, T., & Chua, L.O. (1995). The Analogic Cellular Neural Network as a Bionic Eye. International Journal of Circuit Theory and Applications, 23, 541-569.

KEy TERMS Artificial Neural Network (ANN): A system made up of interconnecting artificial neurons or nodes (usually simplified neurons) which may share some properties of biological neural networks. They may either be used to gain an understanding of biological neural networks, or for solving traditional artificial intelligence tasks without necessarily attempting to model a real biological system. Well known examples of ANN are the Hopfield, Kohonen and Cellular (CNN) models. Feedback: The signal that is looped back to control a system within itself. When the output of the system is fed back as a part of the system input, it is called a feedback loop. A simple electronic device which is based on feedback is the electronic oscillator. The Phase-Locked Loop (PLL) is an example of complex feedback system.

Neuromorphic: A term coined by Carver Mead in the late 1980s to describe VLSI systems containing electronic analogue circuits that mimic neuro-biological architectures present in the nervous system. More recently, its definition has been extended to include both analogue, digital and mixed mode A/D VLSI systems that implements models of neural systems as well as software algorithms. Piecewise Linear Function: A function f(x) that can be split into a number of linear segments, each of which is defined for a non-overlapping interval of x. Spatial Convolution: A term used to identify the linear combination of a series of discrete 2D data (a digital image) with a few coefficients or weights. In the Fourier theory, a convolution in space is equivalent to (spatial) frequency filtering. Template: Also known as kernel, or convolution kernel, is the set of coefficients used to perform a spatial filter operation over a digital image via the spatial convolution operator. Transient: In electronics, a transient system is a short life oscillation in a system caused by a sudden change of voltage, current, or load. They are mostly found as the result of the operation of switches. The signal produced by the transient process is called the transient signal or simply the transient. Also, the transient of a dynamic system can be viewed as its path to a stable final output. VLSI: Acronym that stands for Very Large Scale Integration. It is the process of creating integrated circuits by combining thousands (nowadays hundreds of millions) of transistor-based circuits into a single chip. A typical VLSI device is the microprocessor.

Bayesian Neural Networks for Image Restoration Radu Mutihac University of Bucharest, Romania

INTRODUCTION Numerical methods commonly employed to convert experimental data into interpretable images and spectra commonly rely on straightforward transforms, such as the Fourier transform (FT), or quite elaborated emerging classes of transforms, like wavelets (Meyer, 1993; Mallat, 2000), wedgelets (Donoho, 1996), ridgelets (Candes, 1998), and so forth. Yet experimental data are incomplete and noisy due to the limiting constraints of digital data recording and the finite acquisition time. The pitfall of most transforms is that imperfect data are directly transferred into the transform domain along with the signals of interest. The traditional approach to data processing in the transform domain is to ignore any imperfections in data, set to zero any unmeasured data points, and then proceed as if data were perfect. Contrarily, the maximum entropy (ME) principle needs to proceed from frequency domain to space (time) domain. The ME techniques are used in data analysis mostly to reconstruct positive distributions, such as images and spectra, from blurred, noisy, and/or corrupted data. The ME methods may be developed on axiomatic foundations based on the probability calculus that has a special status as the only internally consistent language of inference (Skilling 1989; Daniell 1994). Within its framework, positive distributions ought to be assigned probabilities derived from their entropy. Bayesian statistics provides a unifying and selfconsistent framework for data modeling. Bayesian modeling deals naturally with uncertainty in data explained by marginalization in predictions of other variables. Data overfitting and poor generalization are alleviated by incorporating the principle of Occam’s razor, which controls model complexity and set the preference for simple models (MacKay, 1992). Bayesian inference satisfies the likelihood principle (Berger, 1985) in the sense that inferences depend only on the probabilities assigned to data that were measured and not on the properties of some admissible data that had never been acquired.

Artificial neural networks (ANNs) can be conceptualized as highly flexible multivariate regression and multiclass classification non-linear models. However, over-flexible ANNs may discover non-existent correlations in data. Bayesian decision theory provides means to infer how flexible a model is warranted by data and suppresses the tendency to assess spurious structure in data. Any probabilistic treatment of images depends on the knowledge of the point spread function (PSF) of the imaging equipment, and the assumptions on noise, image statistics, and prior knowledge. Contrarily, the neural approach only requires relevant training examples where true scenes are known, irrespective of our inability or bias to express prior distributions. Trained ANNs are much faster image restoration means, especially in the case of strong implicit priors in the data, nonlinearity, and nonstationarity. The most remarkable work in Bayesian neural modeling was carried out by MacKay (1992, 2003) and Neal (1994, 1996), who theoretically set up the framework of Bayesian learning for adaptive models.

BACKGROUND Bayesian approach to image restoration is based on the assumption that all of the relevant image information may be stated in probabilistic terms and prior probabilities are known. The ME principle is optimally setting prior probabilities for positive additive distributions. Yet Bayes’ theorem and the ME principle share one common future: the updating of a state of knowledge. In some cases, running Bayes’ theorem in one hypothesis space and applying the ME principle in another lead to similar calculations. Neuromorphic and Bayesian modeling may apparently look like extremes of the data modeling spectrum. ANNs are non-linear parallel computational devices endowed with gradient descent algorithms trained by example to solve prediction and classification problems. In contrast, Bayesian statistics is based on coherent


B

Bayesian Neural Networks for Image Restoration

inference and clear axioms. Yet both approaches aim to create models in agreement with data. Bayesian decision theory provides intrinsic means to model ranking. Bayesian inference for ANNs can be implemented numerically by deterministic methods involving Gaussian approximations (MacKay, 1992), or by Monte-Carlo methods (Neal, 1996). Two features distinguish the Bayesian approach to learning models from data. First, beliefs derived from background knowledge are used to select a prior probability distribution for model parameters. Secondly, predictions of future observations are performed by integrating the model’s predictions with respect to the posterior parameter distribution obtained by updating this prior with new data. Both aspects are difficult in neural modeling: the prior over network parameters has no obvious relation to prior knowledge, and integration over the posterior is computationally demanding. The properties of priors can be elucidated by defining classes of prior distributions for net parameters that reach sensible limits as the net size goes to infinity (Neal, 1994). The problem of integrating over the posterior can be solved using Markov chain Monte Carlo (Neal 1996).

of a positive, additive probability density function. Likewise, the measured data g = {g1 , g 2 , ...g M } are expressed in the form of a probability distribution (Fig. 1). Further assumption refers to image data as a linear function of physical intensity, and that the errors (noise) b is data independent, additive, and Gaussian with zero mean and known standard deviation m , m = 1 , 2 , ...,M in each pixel. The concept of image entropy and the entropy alternative expressions used in image restoration are discussed by Gull and Skilling (1985). A brief review of different approaches based on ME principle, as well as a full Bayesian approach for solving inverse problems are due to Djafari (1995). Image models are derived on the basis of intuitive ideas and observations of real images, and have to comply with certain criteria of invariance, that is, operations on images should not affect their likelihood. Each model comprises a hypothesis H with some free parameters w = ( , , ...) that assign a probability density P ( f | w, H ) over the entire image space and normalized to integrate to unity. Prior beliefs about the validity of H before data acquisition are embedded in P(H). Extreme choices for P(H) only may exceed the

Bayesian Image Modeling

evidence P ( f | H ) , thus the plausibility P (H | f ) of

The fundamental concept of Bayesian analysis is that

H is given essentially by the evidence P ( f | H ) of the image f. Consequently, objective means for comparing various hypotheses exist. Initially, the free parameters w are either unknown or they are assigned very wide prior distributions. The task is to search for the best fit parameter set wMP, which has the largest likelihood given the image. Following Bayes’ theorem:

the plausibility of alternative hypotheses {H i }i∈ is represented by probabilities {Pi }i∈ , and inference is performed by evaluating these probabilities. Inference may opperate on various propositions related in neural modeling to different paradigms. Bayes’ theorem makes no reference to any sample or hypothesis space, neither it determines the numerical value of any probability directly from available information. As a prerequisite to apply Bayes’ theorem, a principle to cast available information into numerical values is needed. In statistical restoration of gray-level digital images, the basic assumption is that there exists a scene adequately represented by an orderly array of N pixels. The task is to infer reliable statistical descriptions of images, which are gray-scale digitized pictures and stored as an array of integers representing the intensity of gray level in each pixel. Then the shape of any positive, additive image can be directly identified with a probability distribution. The image is conceived as an outcome of a random vector f = {f1 , f 2 , ..., f N }, given in the form

P (w | f, H ) =

P ( f | w, H )⋅ P (w | H ) P( f | H )

(1)

where P ( f | w, H ) is the likelihood of the image f given w, P (w | H ) is the prior distribution of w, and

P ( f | H ) is the evidence for H. A prior P (w | H ) has to be assigned quite subjectively based on our

beliefs about images. Since P (w | f, H ) is normalized to 1, then the denominator in (1) ought to satisfy P ( f | H ) = ∫ P ( f | w, H )⋅ P (w | H )⋅ d w . The intew


Figure 1. Flowchart summarizing the forward and inverse problems

grant is often dominated by the likelihood in wMP, so that the evidence of H is approximated by the best fit

P (H | f ) ∝ P ( f | H )⋅ P (H )

likelihood P ( f | wMP , H ) times the Occam’s factor (MacKay, 1992):

Maximum Entropy Methods

P ( f | H ) ≅ P ( f | w MP , H )⋅ P (wMP | H )⋅ ∆ w (2)

Assuming uniform prior parameter distributions P (w | H ) over all admissible parameter sets, then P (w MP ) =

1 , and the evidence becomes: ∆0 w

P ( f | H ) ≅ P ( f | w MP , H )⋅

∆w ∆0 w

(3)

The ratio

∆w ∆0 w between the posterior accessible volume of the model’s parameter space and the prior accessible volume prevents data overfitting by favoring simpler models. Further, Bayes’ theorem gives the probability of H up to a constant:

B

(4)

Applying the ME principle amounts to assigning a distribution {P1 , P2 , ..., Pn } on some hypothesis space by the criterion that it shall maximize some form of entropy subject to constraints that express properties we wish the distribution to have, but are not sufficient to determine it. The ME methods require specifying in advance a definite hypothesis space which sets down the possibilities to be taken into consideration. They come out with a probability distribution, rather than a probability. The ME probability of a single hypothesis H that is not embedded in a space of alternative hypotheses does not make any sense. The ME methods do not require for input the numerical values of any probabilities on that space, rather they assign numerical values to available information as expressed by the choice of hypothesis space and constraints.

LINEAR IMAGING EXPERIMENTS In the widely spread linear case, where the N-dimensional image vector f consists of the pixel values of an unobserved image, and the M-dimensional data vector g is made of the pixel values of an observed image supposed to be a degraded version of f, and assuming zero-mean Gaussian additive errors:


(5)

g =R f+b

where the M × N matrix R stands for the PSF (transfer function or instrumetal response) of the imaging system; then the likelihood of data is: P (g | f, C , H ) =

1 1 ⋅ det 2

M 2

(2 )

C

T  1  ⋅ exp  − (g − f ) C−1 (g − f )  2 

(6) where C is the covariance matrix of the error vector b. If there is no correlation among the pixels and each pixel has the standard deviation m , m = 1 , 2 , ..., M , then the symmetric full rank covariance matrix becomes 2 diagonal with the elements Cmm = m , m = 1 , 2 , ..., M . Hence the probability of the data g given the image f may be written as:

P (g | f, C , H ) =

1 M

M

(2 ) 2 ∏

m =1

m

2  N      g R f −   ∑ m mn n  1 M n =1   ⋅ exp  − ∑   2 m  2 m =1     

(7) The full joint posterior P ( f, | g, H ) of the image f and the unknown PSF parameters denoted generically by θ should be evaluated. Then the required inference

If the evidence P (g | , H ) is sharply peaked around some value ˆ and the prior P ( | H ) is fairly flat in that region, then P ( f | g, H ) ≅ P f | ˆ , g, H . Otherwise, if the marginal integrant is not well approximated at the modal value of the evidence, then misleading narrow posterior probability densities may result. If the errors have uniform standard deviation b, then the symmetric covariance matrix has full rank M with

(

C=

2 b I,

and the probability of data (7) becomes:

P (g | f, , H ) =

where pixel,

)

=1

Eb (g | f, H ) =

 M  ⋅ exp  − ∑ ⋅ Eb (g | f, H ) )  m=1  (11)

1

Zb (

2 b

is a measure of the noise in each

1 bT b 1 M ⋅ = ∑ 2 b2 2 m=1

N    g m − ∑ Rmn f n  n =1  

2

2 b

is the error function, and is the noise partition function. More complex models use the intrinsic correlation −1

about the posterior probability P ( f | g, H ) is obtained as a marginal integral of this joint posterior over the uncertainties in the PSF:

function C = GGT  , where G is a convolution from   an imaginary hidden image, which is uncorrelated, to the real correlated image. If the prior probability of the image f is also Gaussian:

P ( f | g, H ) = ∫ P ( f, | g, H )⋅ d = ∫ P ( f | , g, H )⋅ P ( | g, H )⋅ d

P ( f | F0 , H ) =

(8) θ:

1 N 2

(2 )

1 ⋅ det 2

F0

(12)

Now applying Bayes’ theorem for the parameters

P ( | g, H ) =

P (g | , H )⋅ P ( | H ) P (g | H )

(9)

 1  ⋅ exp  − f T F0−1 f  2  

where is the prior covariance matrix of f, and assuming a uniform standard deviation of the image, then its prior probability distribution becomes: 1

(

)

and substituting in (8)

P( f | ,H )=

∫ P ( f,

(13) where the parameter = 1 2f measures the expected N 2 smoothness of f, Z f ( ) = (2 ) is the partition

| g, H )⋅ d ∝ ∫ P ( f | , g, H )⋅ P (g | , H )⋅ P ( | H )⋅ d

(10)

Zf

( )

⋅ exp − E f ( f | F0 )


function of f, and

2 b 2 f

1 E f ( f | F 0 ) = f T F0−1 f . 2

The posterior probability of image f given data g is derived from Bayes’ theorem: P ( f | g, , , H ) =

P (g | f, , H )⋅ P ( f | , H ) P (g | , , H )

(14)

(

exp − E f − Eb ZM

(

,

)

) = exp (−M ( f )) ZM

(

,

)

(15) where

with the integral covering the space of all admissible images in the partition function. Therefore, minimizing the objective function M(f) corresponds to finding the most probable image f MP , which is the mean value of the Gaussian posterior distribution. Its covariance matrix A −1 that defines the joint error bars on f can be obtained from the Hessian matrix A = −∇∇log P ( f | g, , , H ) evaluated at f MP. The image f MP is obtained by differentiating log P ( f | g, , , H ) and solving for the derivative being zero: f MP

2 b 2 f

(16)

The term 2 b 2 f

Invoking the ME principle requires that the prior knowledge to be stated as a set of constraints on f, though affecting the amount by which the image reconstruction is offset from reality. The prior information about f may be expressed as a probability distribution (Djafari, 1995): P( f | ,H )=

1

Z(

)

⋅ exp (− ⋅ Φ ( f )) (17)

N

( f ) = ∑ fn ⋅ ln n=1

fn U

(18)

where U is the total number of quanta in the image f (Mutihac et al., 1997). The posterior probability of an image f drawn from some measured data g is given by Bayes’ theorem:  P ( f | g, , C , H ) ∝ exp  − 

N

∑

n =1

 f f n ⋅ ln  n U

N    g − R f   M  m ∑ mn n  1  n =1      ⋅ exp  − 2 ∑ 2   m =1 m      

(19)

−1

 C  RT f 

−1

2  T   R R − b2 C  RT   f −1 equates to the pseudoinverse R −1 =  RT R  RT .

where a is generally a positive parameter and Z(a) is the normalizing factor. The entropic prior in the discrete case may correspond to potential functions like:

M ( f ) = E f + Eb and Z M ( , ) = ∫ f exp (− M ( f ))⋅ d f

 =  RT R − 

is negligible, the optimal linear filter

Entropic Prior of Images

where the evidence P (g | , , H ) is the normalizing factor. Since the denominator in (14) is a product of Gaussian functions of f, we may rewrite: P ( f | g, , , H ) =

B

C

C

regularizes the ill-conditioned inversability. When the term

An estimation rule, such as posterior mean or maximum a posteriori (MAP), is needed in order to choose an optimal, unique, and stable solution f for the estimated image. The posterior probability is assumed to summarize the full state of knowledge on a given scene. Producing a single image as the best restoration naturally leads to the most likely one which maximizes the posterior probability P ( f | g, , C , H ),


along with some statement of reliability derived from the spread of all admissible images. In variational problems with linear constraints, Agmon et al. (1979) showed that the potential function associated to a positive, additive image is always concave for any set of Lagrange multipliers, and it possesses an unique minimum which coincides with the solution of the nonlinear system of constraints. As a prerequisite, the linear independence of the constraints is checked and then the necessary and sufficient conditions for a feasible solution are formulated. Wilczek and Drapatz (1985) suggested the Newton-Raphson’s iteration method as offering high accuracy results. Ortega and Rheinboldt (1970) adopted a continuation technique for the very few cases where the Newton’s method fails to converge. These techniques are nevertheless successful in practice for relatively small data sets only and assume a symmetric positive definite Hessian matrix of the potential function.

Quality Assessment of Image Restoration In all digital imaging systems, quality degradation is inevitably due to various sources like photon shot noise, finite acquisition time, readout noise, dark current noise, and quantization noise. Some noise sources can be effectively suppressed yet some cannot. The combined effect of these degradation sources is often modeled by Gaussian additive noise (Pham et al. 2005). In order to quantitatively estimate the restoration quality in the case of similar size (M = N) for both the ~ measured g and the restored image f , the mean energy of restoration error:

D=

1 N

N

∑  gn − fn 

2

(20)

n =1

may be used as a merit factor. Yet too high a value for D may set the restored image quite away from the original scene and raise questions on introducing spurious features for which there is no clear evidence in measurements and complicating the subsequent inference and plausibility. A more realistic degradation measure of image blurring by additive noise is referred to in terms of a metric

called blurred signal-to-noise ratio redefined here by using the noise variance in each pixel such as: BSNR = 10 ⋅ lg

1 N

N

∑

[yn − yn ]2

n =1

2 n

(21)

where y = g − b is the difference between the measured data g and the noise b. In simulations, where the original image f of the measured data g is available, the objectivity of testing the performance of image restoration algorithms may be assessed by the improvement of signal-to-noise ratio metric defined as: N

ISNR = 10 ⋅ lg

∑ [ fn − gn ] n =1 N

2

∑  f n − fn  n =1

2

(22)

~ where f is the best statistical estimation of the correct solution f. While mean squared error metrics like ISNR do not always reflect the perceptual properties of the human visual system, they may provide an objective standard by which to compare different image processing techniques. Nevertheless, it is of major significance that various algorithms behavior be analyzed from the point of view of ringing and noise amplification, which can be a key indicator of improvement in quality for subjective comparisons of restoration algorithms (Banham and Katsaggelos, 1997).

FUTURE TRENDS A practical Bayesian framework for neural-inspired modeling aims to develop probabilistic models that fit data and perform optimal predictions. The link between Bayesian inference and neural models gives new perspectives to the assumptions and approximations made on ANNs when used as associative memories. Evolutionary optimization algorithms capable to discover absolute function minimum (maximum) are needed. A statistically biased redefinition of the concept of pattern existence used in a quantitative manner to assess the overall quality of digital images with domain-specific relevance would increase the accuracy of ranking the image restoration methods.


An efficient MAP procedure has to be implemented in a recursive supervised trained neural net to get restored (reconstructed) the best image in compliance with the existing constraints, measuring and modeling errors.

CONCLUSION A major intrinsic difficulty in Bayesian image restoration resides in determination of a prior law for images. The ME principle solves this problem in a self-consistent way. The ME model for image deconvolution enforces the restored image to be positive. The spurious negative areas and complementary spurious positive areas are wiped off and the dynamic range of the restored image is substantially enhanced. Image restoration based on image entropy is effective even in the presence of significant noise, missing or corrupted data. This is due to the appropriate regularization of the inverse problem of image restoration introduced in a coherent way by the ME principle. It satisfies all consistency requirements when combining the prior knowledge and the information contained in experimental data. A major result is that no artifacts are added since no structure is enforced by entropic priors. Bayesian ME approach is a statistical method which directly operates in spatial domain, thus eliminating the inherent errors coming out from numerical Fourier direct and inverse transformations and from the truncation of signals.

REFERENCES

Daniell, G. J. (1994). Of maps and monkeys: An introduction to the maximum entropy method. In B. Buck & V. A. Macaulay (Eds.), Maximum entropy in action (pp. 1-18). Oxford: Clarendon Press. Djafari, A. M.- (1995). A full Bayesian approach for inverse problems. In K. M. Hanson & R. N. Silver (Eds.), Maximum entropy and bayesian methods (pp. 135-144). Donoho, D. L. (1996). Unconditional bases and bit-level compression. Applied and Computational Harmonic Analysis, 1(1), 100-105. Gull, S. F. & Skilling, J. (1985). The entropy of an image. In C. R. Smith & W. T. Grandy Jr. (Eds), Maximum entropy and Bayesian methods in inverse problems (pp. 287-302), Dordrecht: Kluwer Academic Publishers. MacKay, D. J. K. (1992). A practical Bayesian framework for backpropagation networks, Neural Computation, 4, 448-472. MacKay, D. J. K. (2003). Information theory, inference, and learning algorithms. Cambridge: University Press. Mallat, S. (2000). Une exploration des signaux en ondelettes, Editions de l’Ecole Polytechnique. Mayers, K. J. & Hanson, K. M. (1990). Comparison of the algebraic reconstruction technique with the maximum entropy reconstruction technique for a variety of detection tasks. Proceedings of SPIE, 1231, 176-187. Meyer, Y. (1993). Review of “An introduction to wavelets and ten lectures on wavelets.” Bulletin of the American Mathematical Society, 28, 350-359.

Agmon, N., Alhassid, Y., & Levine, R. D. (1979). An algorithm for finding the distribution of maximal entropy. Journal of Computational Physics, 30, 250-258.

Mutihac, R., Colavita A.A., Cicuttin, A. & Cerdeira, A. E. (1997). Bayesian modeling of feed-forward neural networks. Fuzzy Systems & Artificial Intelligence, 6(1-3), 31-40.

Banham, M. R. & Katsaggelos, A. K. (1997, March). Digital image restoration, IEEE Signal Processing Magazine, 24-41.

Neal, R.M. (1994). Priors for infinite networks. Technical Report CRG-TR-94-1, Department of Computer Science, University of Toronto.

Berger, J. (1985). Statistical decision theory and Bayesian analysis. Springer-Verlag.

Neal, R. M. (1996). Bayesian learning for neural networks. In Lecture Notes in Statistics, 118, New York: Springer-Verlag

Candes, E. J. (1993). Ridgelets: Theory and applications. PhD Thesis. Department of Statistics, Standford University, 1998.

Ortega, J. M. & Rheinboldt, W. B. (1970). Iterative solution of nonlinear equations in several variables. New York: Academic Press.

B


Pham, T. Q., van Vliet, L. J., & Schutte K. (2005). Influence of SNR and PSF on limits of super-resolution. Proceedings of SPIE-IS&T Electronic Imaging, 5672, 169-180.

Deconvolution: An algorithmic method for eliminating noise and improving the resolution of digital data by reversing the effects of convolution on recorded data.

Skilling, J. (1989). Classic maximum entropy. In J. Skilling (Ed.), Maximum entropy and Bayesian methods (pp. 45-52), Dordrecht: Kluwer Academic Publishers.

Digital Image: A representation of a 2D/3D image as a finite set of digital values called pixels/voxels typically stored in computer memory as a raster image or raster map.

Wilczek, R. & Drapatz, S. (1985). A high accuracy algorithm for maximum entropy image restoration in the case of small data sets. Astronomy and Astrophysics, 142, 9-12.

KEy TERMS Artificial Neural Networks (ANNs): Highly parallel nets of interconnected simple computational elements, which perform elementary operations like summing the incoming inputs (afferent signals) and amplifying/thresholding the sum. Bayesian Inference: An approach to statistics in which all forms of uncertainty are expressed in terms of probability.

0

Entropy: A measure of the uncertainty associated with a random variable. Entropy quantifies information in a piece of data. Image Restoration: A blurred image can be significantly improved by deconvolving its PSF in such a way that the result is a sharper and more detailed image. Point Spread Function (PSF): The output of the imaging system for an input point source. Probabilistic Inference: An effective approach to approximate reasoning and empirical learning in AI.

Behaviour-Based Clustering of Neural Networks María José Castro-Bleda Universidad Politécnica de Valencia, Spain Slavador España-Boquera Universidad Politécnica de Valencia, Spain Francisco Zamora-Martínez Universidad Politécnica de Valencia, Spain

INTRODUCTION

BACKGROUND

The field of off-line optical character recognition (OCR) has been a topic of intensive research for many years (Bozinovic, 1989; Bunke, 2003; Plamondon, 2000; Toselli, 2004). One of the first steps in the classical architecture of a text recognizer is preprocessing, where noise reduction and normalization take place. Many systems do not require a binarization step, so the images are maintained in gray-level quality. Document enhancement not only influences the overall performance of OCR systems, but it can also significantly improve document readability for human readers. In many cases, the noise of document images is heterogeneous, and a technique fitted for one type of noise may not be valid for the overall set of documents. One possible solution to this problem is to use several filters or techniques and to provide a classifier to select the appropriate one. Neural networks have been used for document enhancement (see (Egmont-Petersen, 2002) for a review of image processing with neural networks). One advantage of neural network filters for image enhancement and denoising is that a different neural filter can be automatically trained for each type of noise. This work proposes the clustering of neural network filters to avoid having to label training data and to reduce the number of filters needed by the enhancement system. An agglomerative hierarchical clustering algorithm of supervised classifiers is proposed to do this. The technique has been applied to filter out the background noise from an office (coffee stains and footprints on documents, folded sheets with degraded printed text, etc.).

Multilayer Perceptrons (MLPs) have been used in previous works for image restoration: the input to the MLP is the pixels in a moving window, and the output is the restored value of the current pixel (Egmont-Petersen, 2000; Hidalgo, 2005; Stubberud, 1995; Suzuki, 2003). We have also used neural network filters to estimate the gray level of one pixel at a time (Hidalgo, 2005): the input to the MLP consisted of a square of pixels that was centered at the pixel to be cleaned, and there were four output units to gain resolution (see Figure 1). Given a set of noisy images and their corresponding clean counterparts, a neural network was trained. With the trained network, the entire image was cleaned by scanning all the pixels with the MLP. The MLP, therefore, functions like a nonlinear convolution kernel. The universal approximation property of a MLP guarantees the capability of the neural network to approximate any continuous mapping (Bishop, 1996). This approach clearly outperforms other classic spatial filters for reducing or eliminating noise from images (the mean filter, the median filter, and the closing/opening filter (Gonzalez, 1993)) when applied to enhance and clean a homogeneous background noise (Hidalgo, 2005).

BEHAVIOUR-BASED CLUSTERING OF NEURAL NETWORKS Agglomerative Hierarchical Clustering Agglomerative hierarchical clustering is considered to be a more convenient approach than other clustering


B

Behaviour-Based Clustering of Neural Networks

Figure 1. An example of document enhancement with an artificial neural network. A cleaned image (right) is obtained by scanning the entire noisy image (left) with the neural network.

algorithms, mainly because it makes very few assumptions about the data (Jain, 1999; Mollineda, 2000). Instead of looking for a single partition (based on finding a local minimum), this clustering algorithm constructs a hierarchical structure by iteratively merging clusters according to certain dissimilarity measure, starting from singletons until no further merging is possible (one general cluster). The hierarchical clustering process can be illustrated with a tree that is called dendogram, which shows how the samples are merged and the degree of dissimilarity of each union (see Figure 2). The dendogram can be easily broken at a given level to obtain clusters of the desired cardinality or with a specific dissimilarity measure. A general hierarchical clustering algorithm can be informally described as follows: 1. 2. 3.

Initialization: M singletons as M clusters. Compute the dissimilarity distances between every pair of clusters. Iterative process:

a) b) c) d) 4.

Select the number N of clusters for a given criterion.

Behaviour-Based Clustering of Supervised Classifiers When the points of the set to be clustered are supervised classifiers, both a dissimilarity distance and the way to merge two classifiers must be defined (see Figure 2): 1.

2.

Determine the closest pair of clusters i and j. Merge the two closest clusters into a new cluster i+j. Update the dissimilarity distances from the new cluster i+j to all the other clusters. If more than one cluster remains, go to step a).

The dissimilarity distance between two clusters can be based on the behaviour of the classifiers with respect to a validation dataset. The more similar the output of two classifiers is, the closer they are. To merge the closest pair of clusters, a new classifier is trained with the associated training data


Figure 2. Behaviour-based clustering of supervised classifiers. An example of the dendogram obtained for M=5 points: A, B, C, D, E. If N=3, three clusters are selected: A+B, C, D+E. In this work, to merge two clusters, a new classifiers is trained. For example, cluster D+E is trained with the data used to train the classifiers D and E.

of both clusters. Another possibility is to build an ensemble of the two classifiers. An Application of Behaviour-based Clustering of MLPs to Document Enhancement In this work, MLPs are used as supervised classifiers. When two clusters are merged, a new MLP is trained with the associated training data of the two merged MLPs. This behaviour-based clustering algorithm has been applied to enhance printed documents with typical noises from an office (folded sheets, wrinkled sheets, coffee stains, ...). Figure 1 shows an example of a noisy printed document (wrinkled sheet) from the corpus. A set of MLPs is trained as neural filters for different types of noise and then clustered into groups to obtain a reduced set of neural clustered filters. In order to automatically determine which clustered filter is the most suitable to clean and enhance a real noisy image, an image classifier is also trained using MLPs. Experimental results using this enhancement system show excellent results in cleaning noisy documents (Zamora-Martínez, 2007).

FUTURE TRENDS Document enhancement is becoming more and more relevant due to the huge amount of scanned documents. Besides, it not only influences the overall performance of OCR systems, but it can also significantly improve document readability for human readers.

The method proposed in this work can be improved twofold: by using ensembles of MLPs when two MLPs are merged, and by improving the method to select the neural clustered filter that is the most suitable to enhance a given noisy image.

CONCLUSION An agglomerative hierarchical clustering of supervisedlearning classifiers that uses a measure of similarity among classifiers based on their behaviour on a validation dataset has been proposed. As an application of this clustering procedure, we have designed an enhancement system for document images using neural network filters. Both objective and subjective evaluations of the cleaning method show excellent results in cleaning noisy documents. This method could also be used to clean and restore other types of images, such as noisy backgrounds in scanned documents, stained paper of historical documents, vehicle license recognition, etc.

REFERENCES Bishop, C.M. (1996). Neural Networks for Pattern Recognition. Oxford University Press. Bozinovic, R.M., & Srihari, S.N. (1989). Off-Line Cursive Script Word Recognition. IEEE Trans. on PAMI, 11(1), 68–83.

B


Bunke, H. (2003). Recognition of Cursive Roman Handwriting – Past, Present and Future. In: Proc. ICDAR. 448–461. Egmont-Petersen, M., de Ridder, D., & Handels, H. (2002). Image processing with neural networks – a review. Pattern Recognition 35(10). 2279–2301. Gonzalez, R., & Woods, R. (1993). Digital Image Processing. Addison-Wesley Pub. Co. Hidalgo, J.L., España, S., Castro, M.J., & Pérez, J.A. (2005). Enhancement and cleaning of handwritten data by using neural networks. In: Pattern Recognition and Image Analysis. Volume 3522 of LNCS. SpringerVerlag. 376–383 Jain, A.K., Murty, M.N., & Flynn, P.J. (1999). Data clustering: a review. ACM Comput. Surv. 31(3). 264–323 Kanungo, T., & Zheng, Q. (2004). Estimating Degradation Model Parameters Using Neighborhood Pattern Distributions: An Optimization Approach. IEEE Trans. on PAMI 26(4). 520–524. Mollineda, R.A., & Vidal, E. (2000). A relative approach to hierarchical clustering. In: Pattern Recognition and Applications. Volume 56. IOS Press. 19–28. Plamondon, R., & Srihari, S.N. (2000). On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans. on PAMI 22(1). 63–84. Stubberud, P., Kanai, J., & Kalluri, V. (1995). Adaptive Image Restoration of Text Images that Contain Touching or Broken Characters. In: Proc. ICDAR. Volume 2. 778–781. Suzuki, K., Horiba, I., & Sugie, N. (2003). Neural Edge Enhancer for Supervised Edge Enhancement from Noisy Images. IEEE Trans. on PAMI 25(12). 1582–1596. Toselli, A.H., Juan, A., González, J., Salvador, I., Vidal, E., Casacuberta, F., Keysers, D., & Ney, H. (2004). Integrated Handwriting Recognition and Interpretation using Finite-State Models. Int. Journal of Pattern Recognition and Artificial Intelligence 18(4). 519-539. F. Zamora-Martínez, S. España-Boquera, & M.J. Castro-Bleda. (2007). Behaviour-based Clustering of Neural Networks applied to Document Enhancement.

In: Computational and Ambient Intelligence. Volume 4507 of LNCS. Springer-Verlag. 144-151. http://en.wikipedia.org

KEy TERMS Artificial Neural Network: An artificial neural network (ANN), often just called a “neural network” (NN), is an interconnected group of artificial neurons that uses a mathematical model or computational model for information processing based on a connectionist approach to computation. Backpropagation Algorithm: A supervised learning technique used for training artificial neural networks. It was first described by Paul Werbos in 1974, and further developed by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams in 1986. It is most useful for feed-forward networks (networks that have no feedback, or simply, that have no connections that loop). Clustering: The classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Document Enhancement: Accentuation of certain desired features, which may facilitate later processing steps such as segmentation or object recognition. Hierarchical Agglomerative Clustering: Hierarchical Clustering algorithms find successive clusters using previously established clusters. Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Multilayer Perceptron (MLP): This class of artificial neural networks consists of multiple layers of computational units, usually interconnected in a feedforward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer. In many applications the units of these networks apply a sigmoid function as an activation function.


Optical Character Recognition (OCR): A type of computer software designed to translate images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text, or to translate pictures of characters into a standard encoding scheme representing them (e.g. ASCII or Unicode). OCR began as a field of research in pattern recognition, artificial intelligence and machine vision.

B

Supervised Learning: A machine learning technique for creating a function from training data. The training data consist of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output).

Bio-Inspired Algorithms in Bioinformatics I José Antonio Seoane Fernández University of A Coruña, Spain Mónica Miguélez Rico University of A Coruña, Spain

INTRODUCTION

BACKGROUND

Large worldwide projects like the Human Genome Project, which in 2003 successfully concluded the sequencing of the human genome, and the recently terminated Hapmap Project, have opened new perspectives in the study of complex multigene illnesses: they have provided us with new information to tackle the complex mechanisms and relationships between genes and environmental factors that generate complex illnesses (Lopez, 2004; Dominguez, 2006). Thanks to these new genomic and proteomic data, it becomes increasingly possible to develop new medicines and therapies, establish early diagnoses, and even discover new solutions for old problems. These tasks however inevitably require the analysis, filtration, and comparison of a large amount of data generated in a laboratory with an enormous amount of data stored in public databases, such as the NCBI and the EBI. Computer sciences equip biomedicine with an environment that simplifies our understanding of the biological processes that take place in each and every organizational level of live matter (molecular level, genetic level, cell, tissue, organ, individual, and population) and the intrinsic relationships between them. Bioinformatics can be described as the application of computational methods to biological discoveries (Baldi, 1998). It is a multidisciplinary area that includes computer sciences, biology, chemistry, mathematics, and statistics. The three main tasks of bioinformatics are the following: develop algorithms and mathematical models to test the relationships between the members of large biological datasets, analyze and interpret heterogeneous data types, and implement tools that allow the storage, retrieve, and management of large amounts of biological data.

The following section describes some of the problems that are most commonly found in bioinformatics.

Interpretation of Gene Expression The expression of genes is the process by which the codified information of a gene is transformed into the necessary proteins for the development and functioning of the cell. In the course of this process, small sequences of ARN, also called ARN messengers, are formed by transcription and subsequently translated into proteins. The amount of expressed mARN can be measured with various methods, such as gel electrophoresis, but large numbers of simultaneous expression analyses are usually carried out with microarrays (Quackenbush, 2001), which make it possible to obtain the simultaneous expression of tens of thousands of genes; such an amount of data can only be analyzed with the help of an informatic process. Among the most common tasks in this type of analysis is the task to find the differences between, for instance, a patient and a test that determines whether a gene is expressed or not. These tasks can be divided into classical problems of classification and clustering. Clustering is used not only in experiments of microarrays (to identify groups of genes with similar expressions), but also suggests functional relationships between the members of the cluster.

Alignment of ADN, ARN, and Protein Sequences Sequences alignment consists in superposing two or more sequences of both nucleotides (ADN and ARN) and amino acids (proteins) in order to compare them and analyze the sequence parts that are alike and unalike.


Bio-Inspired Algorithms in Bioinformatics I

The optimal alignment is that which mainly shows correspondences between the nucleotides or amino acids and is therefore said to have the highest score. This alignment may or may not have a biological meaning. There are two types of alignment: the global alignment, which maximizes the number of coincidences in the entire sequence, and the local alignment, which looks for similar regions in large sequences that are normally highly divergent. The most commonly used technique to implement alignments is dynamic programming by means of the Smith-Waterman algorithm (Smith, 1981), which explores all the possible comparisons in the sequences. Another problem in sequences alignment is multiple alignment (Wallace, 2005), which consists in aligning three or more sequences of ADN, ARN, or proteins, and is generally used to search for evolutive relationships between these sequences. The problem is equivalent to that of simple sequences alignment, but takes into consideration the n sequences that are to be compared. The complexity of the algorithm increases exponentially with the number of sequences to compare.

Identification of the Gene Regulatory Network All the information of a living organism’s genome is stored in each and every one of its cells. Whereas the genome is used to synthesize information on all the body cells, the regulating network is in charge of guiding the expression of a given set of genes in one cell rather than another so as to form certain types of cells (cellular differentiation) or carry out specific functions related to spatial and temporal localization; in other words, it makes the genes express themselves when and where necessary. The role of a gene regulatory network therefore consists in integrating the dynamic behaviour of the cell and the external signals with the environment of the cell, and to guide the interaction of all the cells so as to control the process of cellular differentiation (Geard, 2004). Inferring this regulating network from the cellular expression data is considered to be one of the most complex problems in bioinformatics (Akustsu, 1999).

Construction of Phylogenetic Trees A phylogenetic tree (Setúbal, 1999) is a tree that shows the evolutionary relationships between various spe-

cies of individuals that are believed to have common descendence. Whereas traditionally morphological characteristics are used to carry out such analyses, in the present case we will study molecular phylogenetic trees, which use sequences of nucleotides or amino acids for classification. The construction of these trees is initially based on algorithms for multiple sequences alignment, which allows us to classify the evolutive relationships between homologue genes present in various species. In a second phase, we must calculate the genetic distance between each pair of sequences in order to represent them correctly in the tree.

Gene Finding and Mapping Gene finding (Fickett, 1996) basically consists in identifying genes in an ADN chain by recognizing the sequence that initiates the codification of the gene or gene promoter. When the protein that will interpret the gene finds the sequence of that promoter, we know that the next step is the recognition of the gene. Gene mapping (Setúbal, 1999) consists in creating a genetic map by assigning genes to a position inside the chromosome and by indicating the relative distance between them. There are two types of mapping. Physical or cytogenetic mapping, on the one hand, consists in dividing the chromosome into small labelled fragments. Once divided, they must be ordered and situated in their correct position in the chromosome. Link mapping, on the other hand, shows the position of some genes with respect to others. The latter mapping type has two inconveniences: it does not provide the distance between the genes, and it is unable to provide the correct order if the genes are very close to each other.

Prediction of DNA, RNA, and Protein Structure The DNA and RNA sequences are folded into a tridimensional structure that is determined by the order of the nucleotides within the sequence. Under the same environmental conditions, the tridimensional structure of these sequences implies a diverging behaviour. Since the secondary structure of the nucleic acids is a factor that affects the link of both DNA molecules and RNA molecules, it is essential to know these structures in order to analyze a sequence. The prediction of the folds that determine the RNA structure is an important factor in the understanding of

B


many biological processes, such as translation in the RNA Messenger, replication of RNA chains in viruses, and the function of structural RNA and RNA/proteins complexes. The tridimensional structure of proteins is extremely diverse, going from completely fibrous to nodular. Predicting the folds of proteins is important, because a protein’s structure is closely related to its function. The experimental determination of the proteinic structure as such helps us to find the proteinic function and allows us to design synthetic proteins that can be used as medicines.

BIO-INSPIRED ALGORITHMS The basic principle of bio-inspired algorithms is to use analogies with natural systems in order to solve problems. By simulating the behaviour of natural systems, these algorithms design heuristic, non-deterministic methods for searching, learning, behaviour, etc. (Forbes, 2004).

Artificial Neural Networks Artificial neural networks (McCulloch, 1943)(Hertz, 1991)(Bishop, 1995) (Rumelhart, 1986) (ANNs) are computational models inspired on the behaviour of the nervous system. Even though their development is based on the modelling of biological processes in the brain, there are considerable differences between the processing elements of ANNs and actual neurons. ANNs consist of unit networks that are interconnected and organized in layers that evolve in the course of time. The main features of these systems are the following:Self-Organization and Adaptability: Allow robust and adaptive processing, adaptive training, and self-organizing networks; Non-linear processing: Increase the network’s capacity to approach, classify, and be immune to noise;Parallel processing: use a large number of processing units with a high level of interconnectivity. ANNs can be classified according to their learning type: Supervised learning neural networks: the network learns relationships between the input and output data. The input data are passed on to the input layer and propagate through the network architecture until they reach the output layer. The output obtained in this output layer is compared to the expected output,

and subsequently the weights of the interconnections are modified so as to minimize the error between the obtained and the expected output; Non-supervised learning networks: In this type of learning, none of the expected output types is passed on to the network, but the network itself searches for the differences between the inputs and separates the data accordingly.

Evolutionary Computation Evolutionary computation (Rechenberg, 1971)(Holland, 1975) is a technique that is inspired on evolutive biological strategies: genetic algorithms, for example, use biological techniques of cross-over, mutation, and selection to solve searching and optimization problems. Each of these operators has an impact on one or more chromosomes, i.e. possible solutions to the problem, and generates another series of chromosomes, i.e. the following generation of solutions. The algorithm is executed iteratively and as such takes the population through the generations until it finds an optimal solution. Another strategy of evolutionary computation is genetic programming (Koza 1990), which uses the same operators as the genetic algorithms to develop the optimal program to solve a problem.

Swarm Intelligence Swarm intelligence (Beni, 1989)(Bonabeau, 2001)(Engelbrench, 2005) is a recent family of bio-inspired techniques based on the social or collective behaviour of groups such as ants, bees, etc., insects which have very limited capacities as individuals, but form groups to carry out complex tasks.

Immune Artificial System The immune artificial system (Farmer, 1986)(Dasgupta, 1999) is a new computational paradigm that has appeared in recent years and is based on the immune system of vertebrates. The biological immune system is a parallel and distributed adaptive system that uses learning, memory, and associative recuperation to solve problems of recognition and classification. It particularly learns to recognize patterns, remember them, and use their combinations to build efficient pattern detectors. From the point of view of information processing, these interesting features are used


in the artificial immune system to successfully solve complex problems.

CONCLUSION This article describes the main problems that are presently found in the field of bio-informatics. It also presents some of the bio-inspired computation techniques that provide solutions for problems related to classification, clustering, minimization, modelling, etc. The following article will describe a series of techniques that allow researchers to solve the above problems with bio-inspired models.

ACKNOWLEDGMENT This work was partially supported by the Spanish Ministry of Education and Culture (Ref TIN200613274) and the European Regional Development Funds (ERDF), grant (Ref. PIO61524) funded by the Carlos III Health Institute, grant (Ref. PGIDIT 05 SIN 10501PR) from the General Directorate of Research of the Xunta de Galicia and grants (File 2006/60, 2007/127 and 2007/144) from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia.

REFERENCES Akatsu T, Miyano S and Kuhara S. (1999). Identification of genetic networks from a small number of gene expression patterns under the boolean network model. Proceedings of Pacific Symposium of Biocomputing 99:17-28.

Bonabeau E, Dorigo M, Theraulaz G. (2001). Swarm intelligence: From natural to artificial systems. Journal of Artificial Societies and Social Simulation 4(1). Dasgupta D. (1999). Artificial immune system an their applications. Springer-Verlang Berlin. Domínguez E, Loza MI, Padín JF, Gesteira A, Paz E, Páramo M, Brenlla J, Pumar E, Iglesias F, Cibeira A, Castro M, Caruncho H, Carracedo A, Costas J. (2006). Extensive linkage disequilibrium mapping at HTR2A and DRD3 for schizophrenia susceptibility genes in Galician population. Schizophrenia Research, 2006. Engelbrencht AP. (2005). Fundamentals of computation swarm intelligence. Wiley. Farmer J, Pachard N and Parelson A. (1986). The immune system, adaption and machine learning. Physica D 2:189-204. Fickett JW. (1996). Finding genes by computer: The state of art. Trends in Genetics 12(8):316:320. Forbes N. (2004). Imitation of Life. How Biology Is Inspiring Computing. MIT Press. Geard N. (2004). Modelling Gene Regulatory Networks: Systems Biology to Complex Systems. ACCS Draft Technical Report. ITEE Universisty of Queensland. Holland J. (1975). Adaption in Natual and Artificial Systems. University of Michigan Press. Hertz J., Krogh A. & Palmer RG. (1991). Introduction to the theory of neural computation. Addison-Wesley, Redwood City. Koza J. (1990). Genetic Programming: A paradigm for genetically breeding populations of computer programs to solve problems. Stanford University Computer Science Department Technical Report.

Baldi P and Brunak S. (1998). Bioinformatics: The machine Learning Approach. MIT Press.

Korf I, Yendel M and Bedell J.(2003). Blast. O’Relly.

Beni G and Wang U. (1989). Swarm Intelligence in cellular robotic systems. NATO Advanced workshop on robots and biological systems. Il Ciocco Tuscany, Italy.

Lopez-Bigas, N. & Ouzounis, C.A. (2004). Genomewide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res. 32, 310814.

Bishop C.M. (1995). Neural Networks for Pattern Recognition. Clarendon Press, Oxford.

McCullock WS, Pitts W. (1943). A Logical Calculus of Ideas Imminet in Nervous Activity. Bulletin of Mathematical Biophysiscs 5:226-33.

B


Mullins, K.(1990). The unusual origin of the polymerase chain reaction. Scientific American 262(4):56-61. Quackenbush J. (2001). Computational Analysis of microarray data. Nature Review Genetics 2:418-427. Rechenberg I. (1973). Evolutionsstrategie – Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. PhD Thesis. Rumelhart DE, Hinton GE & Williams RJ. (1986). Learning Internal Representation by Backpropagation Errors. Nature 323(99):533-6. Setubal J and Meidanis J. (1999). Introduction to Computational Molecular Biology. PWS Publishing. Smith TF and Waterman MS.(1981). Identification of common molecular sequences. Journal of Molecular Biology. 24(8):195-197. Wallace IM, Blacshields G and Higgins DG. (2005). Multiple sequence alignments. Current Opinion in Structural Biology. 15(3):231-267.

Electroforesis: The use of an external electric field to separate large biomolecules on the basis of their charge by running them through acrylamide or agarose gel. Messenger RNA: The complementary copy of DNA formed from a single-stranded DNA template during the transcription that migrates from the nucleus to the cytoplasm where it is processed into a sequence carrying the information to code for a polypeptide domain. Microarray: A 2D array, typically on a glass, filter, or silicon wafer, upon which genes or gene fragments are deposited or synthesized in a predetermined spatial order allowing them to be made available as probes in a high-throughput, parallel manner. Nucleotid: A nucleic acid unit composed of a five carbon sugar joined to a phosphate group and a nitrogen base. Swarm Intelligence: An artificial intelligence technique based on the study of collective behaviour in decentralised, self-organised systems.

KEy TERMS

Transcription: The assembly of complementary single-stranded RNA on a DNA template.

Amino Acid: One of the 20 chemical building blocks that are joined by amide (peptide) linkages to form a polypeptide chain of a protein.

Translation: The process of converting RNA to protein by the assembly of a polypeptide chain from an mRNA molecule at the ribosome.

Artificial Immune System: Biologically inspired computer algorithms that can be applied to various domains, including fault detection, function optimization, and intrusion detection. Also called computer immune system.

0

Bio-Inspired Algorithms in Bioinformatics II José Antonio Seoane Fernández University of A Coruña, Spain Mónica Miguélez Rico University of A Coruña, Spain

INTRODUCTION Our previous article presented several computational models inspired on biological models, such as neural networks, evolutionary computation, swarm intelligence, and the artificial immune system. It also explained the most common problems in bioinformatics to which these models can be applied. The present article presents a series of approaches to bioinformatics tasks that were developed by means of artificial intelligence techniques and focus on bioinspired algorithms such as artificial neural networks and evolutionary computation.

BACKGROUND Previous publications have focused on the use of bioinspired and other artificial intelligence techniques. Keedwell (2005) has summarized the foundations of molecular biology, the main problems in bioinformatics, and the existing solutions based on artificial intelligence. Baldi (Baldi, 2001) also describes various techniques for problem-solving in bioinformatics. Other generalizing works on this subject can be found in (Larrañaga, 2006), whereas more specialized works focus on solutions based on evolutionary computation (Pal, 2006) or artificial life (Das, 2007).

Bio-Inspired Techniques The following section describes how the techniques that were mentioned in our article Bio-inspired Algorithms in Bioinformatics I have been used to solve the main problems in bioinformatics.

Gene Expression We start by describing how artificial intelligence techniques have contributed to the interpretation of

genes expression. Artificial neural networks (ANNs) have been applied extensively to the classification of genetic data. One of the most commonly used architectures for the classification of this type of data is the multilayer perceptron. Many works use this architecture for diagnosis (Wang, 2006) (Wei, 2005) (Narayanan, 2004) and obtain very good results; most of these approaches use artificial neural networks to discover and classify interactions between variables (genes expression values). Statnikov (2005) and Lee (2005) compare several classification techniques, such as ANNs using backpropagation, probabilistic ANNs, Support Vector Machines (SVM), K-Nearest Neighbour (KNN), and other statistical methods for the classification of data that issue from microarrays expression tests. In this type of genetic expression data classification, we can also find a combination of ANNs and genetic programming: Ritchie (Ritchie, 2004) codifies into each individual of the genetic algorithm (GA) the architecture and weights of the network, so that the genetic programming optimizes the network to minimize the error between the output layer and the expected output, or the hybrids between the ANNs and the genetic algorithms of Kim (Kim, 2004) and Keedwell (Keedwell, 2005). Genetic programming (GP) as such has also been used (Gilbert, 2000; Hong, 2004; Langdon, 2004; Hong, 2006) to classify the results of an expression analysis. The advantage of GP is that it classifies the genes while selecting the relevant ones (Muni, 2006). The training set of the expression data patients and control are the input for the GP algorithm, which evaluates whether or not the example is a control. The result is one or a set of classification rules. The advantage of using GP instead of other techniques such as SVM is that it is transparent: the mechanism used to classify the examples of the patients can be evaluated (Driscoll, 2003). Whereas the above studies all classify by means of supervised learning, the following section presents various expression analysis methods for clustering that


B

Bio-Inspired Algorithms in Bioinformatics II

use non-supervised learning. This type of analysis is very useful to discover gene groups that are potentially related or associated to the illness. A comparison between the most commonly applied methods, using both real and simulated data, can be found in the works of Thalamuthu (2006), Handl (2005), and Sheng (2005). Even though these methods have provided good results in certain cases (Spellman, 1998; Tamayo, 1999; Mavroudi, 2002), some of their inherent problems, such as the identification of the number of clusters, the clustering of the “outliers”, and the complexity associated to the large amount of data that are being analysed, often complicate their use for expression analysis (Sherlock, 2001). These deficiencies were tackled in a series of second generation clustering algorithms, among which the self-organising trees (Herrero, 2001; Hsu, 2003). Another interesting approach for expression analysis is the use of the artificial immune system, which can be observed in the works of Ando (Ando 2003), who applies immune recognition to classification by making the system select the most significant genes and optimize their weights in order to obtain classification rules. Finally, de Sousa, de Castro, and Bezerra apply this technique to clustering (de Sousa, 2004)(de Castro, 2001)(Bezerra, 2003).

Sequence Alignment Solutions based on genetic algorithms, such as the SAGA (Notredame, 1996), the RAGA, the PRAGA (Notredame, 1997, 2002), and others (O’Sullivan, 2004; Nguyen, 2002; Yokohama, 2001), have been applied to sequence alignment since the very beginning. The most common method consists in codifying the alignments as individuals inside the genetic algorithm. There are also hybrid solutions that use not only GA but also dynamic programming (Zhang, 1997, 1998); and finally, there is the application of artificial life algorithms, in particular the ant colony algorithm (Chen, 2006; Moss, 2003).

Genetic Networks In order to correct the problem of the inference of genetic networks, the structure of the regulating network and the interactions between the participating genes must be predicted. The expression of the genes is regulated by transitions of states in which the levels of expression of the involved genes are updated simultaneously.

ANNs have been used to model these networks. Examples of such approaches can be found in the works of Krishna, Keedwell, and Narayanan (Keedwell, 2003)(Krishna, 2005). Genetic algorithms (Ando, 2001)(Tominaga, 2001) and hybrid RNA-genetic approaches (Keedwell, 2005) have also been used for the same purpose.

Phylogenetic Trees Normally, exhaustive search techniques for the creation of phylogenetic trees are computationally unfeasible for more than 10 comparisons, because the number of possible solutions increases exponentially with the number of objects in the comparisons. In order to optimize these searches, researchers have used heuristics based on genetic algorithms (Skourikhine, 2000)(Katoh, 2001)(Lemmon, 2002) that allow the reconstruction of the optimal trees with less computational load. Other techniques, such as the ant colony algorithm, have also been used to reconstruct phylogenetic trees (Ando, 2002)(Kummorkaew, 2004) (Perretto, 2005).

Gene Finding and Mapping Gene mapping has been approached by methods that use only genetic algorithm (Fickett, 1996)(Murao, 2002) as well as by hybrid methods that combine genetic algorithms and statistical techniques (Gaspin, 1997). The problem of gene searching and in particular promoter searching has been approached by means of neural networks (Liu, 2006), neural networks optimized with genetic algorithms (Knudsen, 1999), conventional genetic algorithms (Kel, 1998)(Levitsky, 2003),and fuzzy genetic algorithms (Jacob, 2005).

Structure Prediction The tridimensional structure of DNA was predicted with genetic algorithms (Beckers, 1997) by codifying the torsional angles between the atoms of the DNA molecule as solutions of the genetic algorithm. Another approach was the development of hybrid strategies of ANNs and GAs (Parbhane, 2000), in which the network approaches the non-linear relations between the inputs and outputs of the data set, and the genetic algorithm searches within the network inputs space to optimize the output. In order to predict the secondary structure of the RNA, the system calculates the minimum free


energy of the structure for all the different combinations of the hydrogene links. There are approaches that use genetic algorithms (Shapiro, 2001)(Wiese, 2003) and artificial neural networks (Steeg, 1997). Artificial neural networks have been applied to the prediction of protein structures (Qian, 1988)(Sasagawa, 1992), and so have genetic algorithms. A compilation of the application of evolutionary computation in protein structures prediction can be found in (Schulze-Kremer, 2000). Swarm intelligence, and optimization by ant colony in particular, have been applied to structures prediction (Shmygelska, 2005)(Chu, 2005) and artificial immune system (Nicosia, 2004)(Cutello, 2007).

CONCLUSION This article presents a compendium of the most recent references on the application of bio-inspired solutions such as evolutionary computation, artificial neural networks, swarm intelligence, and artificial immune system to the most common problems in bioinformatics.

ACKNOWLEDGMENT This work was partially supported by the Spanish Ministry of Education and Culture (Ref TIN200613274) and the European Regional Development Funds (ERDF), grant (Ref. PIO61524) funded by the Carlos III Health Institute, grant (Ref. PGIDIT 05 SIN 10501PR) from the General Directorate of Research of the Xunta de Galicia and grants (File 2006/60, 2007/127 and 2007/144) from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia.

REFERENCES Ando S and Iba H. (2001). Inference of gene regulatory model by genetic algorithms. Proceedings of Congress of Evolutionary computation. 1:712-719. Ando S and Hiba H. (2002). Ant algorithms for construction of evolutionary tree. Proceedings of Congress of Evolutionary Computation (CEC 2002). Ando S. and Iba H. (2003). Artificial Immune System for Classification of Gene Expression Data. Genetic

and Evolutionary Computation (GECCO 2003). LNCS 2724/2003 pp 205. Springer Berlin. Baldi P and Brunak S. (2001). Bioinformatics: The Machine Learning Approach. MIT Press. 2001. Beckers ML, Muydens LM, Pikkermaat JA and Altona C. (1997). Aplications of a genetic algorithm in the conformational analysis of methylene acetal-linked thymine dimmers in DNA: Comparison with distance geometry calculations. Jounal of Biomolecular NMR 9(1):25-34. Bezerra GB and De Castro LN. (2003). Bioinformatics data analysis using an artificial immune network. In International Conference on Artificial Immune Systems 2003. LNCS 2787/2003 pp. 22-33. Springer Berlin. Chen Y, Pan Y, Chen L and Chen J. (2006). Partitioned Optimization Algorithms for multiple sequence alignment. Second IEEE Workshop on High Performance Computing in Medicine and Biology (HiPCoMB2006). Chu D, Till M and Zomaya A. (2006). Parallel ant colony optimization for 3D Protein Structure Prediction using HP Lattice Model. 19th Congress on Evolutionary Computation (CEC 2006). Cutello V, Nicosia G, Pavone M and Timmis J. (2007). An immune algorithm for protein structure prediction on Lattice Models. IEEE Transaction on Evolutionary Computation 11(1):101-117. Das S, Abraham A and Konar A. (2007). Swarm Intelligence Algorithms in Bioinformatics. Computational Intelligence in Bioinformatics. Arpad, Keleman et al., editors. Springer Verlang Berlin. De Smet F, Mathys J, Marchal K. (2002). Adaptative quality based clustering of gene expression profiles. Bioinformatics 20(5):660-667. De Sousa JS, Gomes L, Bezerra GB, de Castro LN and Von Zuben FJ. (2004). An immune evolutionary algorithm for multiple rearrangements of gene expression data. Genetic Programming and Evolvable Machines. Vol 5 pp. 157-179. De Castro LN & Von Zuben FJ. (2001). aiNet: An artificial Immune Network for Data Analysis. Data Mining: A Heuristic Approach. 2001 Idea Group Publishing. Driscol JA, Worzel B and MacLean D. (2003). Classification of gene expression data with genetic pro

B


gramming. Genetic Programming Theory and Practice. Kluwer Academic Publishers pp 25-42. Fickett J and Cinkosky M. (1993). A genetic algorithm for assembling chromosome physical maps. Proceedings 2nd international conference in Bioinformatics, Supercomputing and Complex Genome Analysis 2:272-285. . Gaspin C. and Schiex T. (1997). Genetic Algorithms for genetic mapping. Proceedings of 3rd European Conference in Artificial Evolution pp. 145-156. Gilbert RJ, Rowland JJ and Kell DB. (2000). Genomic computing: Explanatory modelling for functional genomics. Proceedings of the Genetic and Evolutionary Computation conference (GECCO 2000). Morgan Kaufmann pp 551-557. Handl J, Knowles J and Kell D. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201-3212. Herrero J, Valencia A and Dopazo J. (2001). A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17(2):126-162. Hong JH and Cho SB. (2004). Lymphoma cancer classification using genetic programming with SNR features. Proceedings of EuroGP 2004. Coimbra pp78-88. Hong JH and Cho SB. (2006). The classification of cancer based on DNA microarray data that usere diverse ensemble genetic programming. Artificial Intelligence in Medicine 36(1):43-58. Hsu AL, Tang S and Halgamuge SK. (2003). An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data. Bioinformatics 19(16):2131-2140. Jacob E, Sasikumar and Fair KNR. (2005). A fuzzy guided genetic algorithm for operon prediction. Bioinformatics 21(8):1403-1410. Katoh K, Kuma K and Miyata T. (2005). Genetic algorithm-based maximum likehood analysis for molecular phylogeny. Journal of Molecular Biology. 53(4-5):477-484. Keedwell E and Narayanan A. (2005). Intelligent Bioinformatics. Wiley.

Keedwell E and Narayanan A. (2005). Discovering gene regulatory networks with a neural genetic hybrid. IEE/ACM Transaction on Computational Biology and Bioinformatics. 2(3):231-243. Kel A, Ptitsyn A, Babenko V, Meier-Ewert S and Lehrach H. (1998). A genetic algorithm for designing gene family-specific oligonucleotide sets used for hybridization: The G protein-coupled receptor protein superfamily. Bioinformatics 14(3):259-270. Kim KJ and Cho SB. (2004). Prediccion of colon cancer using an evolutionary neural network. Neurocomputing 61:361-79. Korf I, Yendel M and Bedell J. (2003). Blast. O’Relly. Knudsen S. (1999). Promoter 2.0: for the recognition of Pol II promoter sequences. Bioinformatics 15(5):356-417. Krishna A, Narayanan A and Keedwell EC. (2005). Neural netrowks and temporal gene expression data. Applications of Evolutionary Computing (EVOBIO05) LNCS 3449 Springer Verlang. Kummorkaew M, Ku K and Ruenglertpanyakul P. (2004). Application of ant colony optimization to evolutionary tree construction. Proceedings of 15th Annual Meeting of the Thai Society for Biotechnology. Thailand. Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano J, Armañazas R, Santafé G, Pérez A and Robles V. (2006). Machine Learning in Bioinformatics. Briefings in Bioinformatics 7(1):86-112. Langdon W and Buxton B. (2004). Genetic programming for mining dna chip data from cancer patients. Genetic Programming and Evolvable Machines. 5(3). Lemmon AR and Milinkovitch MC. (2002). The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation. Proceedings of national academy of sciences. 99(16):1051610521. Lee JW, Lee JB, Park Mand Song SH. (2005). An extensive comparison of recent classification tools applied to microarray data. Journal of Computational Statistics and Data Analysis. 48(4):869-885. Levitsky VG, Katokhin AV. (2003). Recognition of eukaryotic promoters using genetic algorithm based


on interactive discriminant analysis. In silico biology. 3(1-2):81-87. Liu DR, Xiong X, DasGupta B, Zhang HG. (2006). Motif discoveries in unaligned molecular sequences using self-organizing neural networks, IEEE TRANSACTIONS ON NEURAL NETWORKS 17 (4): 919928. Mavroudi S, Papadimitriou S and Bezerianos A. (2002). Gene expression data analysis with a dynamically extended self.-organized map that exploits class information. Bioinformatics 18(11): 14446-1453. Moss J and Johnson C. (2003). An ant colony algorithm for multiple sequence alignment in bioinformatics. Artificial Neural Networks and Genetic algorithms, pp 182-186. Springer. Muni DP, Pal NR and Das J. (2006). Genetic programming for simultaneous feature selection and classifier desing. System, Man and Cybernetics 36(1):106-117. Murao H, Tamaki H and Kitamura S. (2002). A coevolutionary approach to adapt the genotype-phenotype map in genetic algorithms. Proceedings of Congress of Evolutionary Computation 2:1612-1617. Narayanan A, Keedwell E, Tatineni SS. (2004). Singlelayer artificial neural networks for gene expression analisys. Neurocomputing 61:217-240. Nguyen H, Yoshihara I, Yamamori K and Yusanaga M. A parallel hybrid genetic algorithm for multiple protein sequence alignment. Congress of Evolutionary Computation 1:309-314. Nicosia G. (2004). Immune Algorithms for Optimization and Protein Structure Prediction. PhD Thesis. Department of Mathematics and Computer Science. University of Catania, Italy. Notredame C and Higgins D. (1996). SAGA: Sequence alignment by genetic algorithm. Nucleic Acid Research. 24(8):1515-1524. Notredame C, O’Brien EA and Higgins DG. (1997). RAGA: RNA sequence alignment by genetic algorithm. Nucleid Acid Research 25(22):4570-4580. Notredame C. (2002). Recent Progresses in multiple sequence alignment: a survey. Pharmacogenomics 31(1); 131-144.

O’Sullivan O, Suhre K, Abergel C, Higgins D and Notredame C. (2004). 3DCoffee: Combining protein sequences and structures within multiple sequence alignments. Journal of Molecular Biology 340(2):385395. Pal S, Bandyopadhyay S and Ray S. (2006). Evolutionary Computation in Bioinformatics. A Review. IEEE Transactions on System, Man and Cybernetics 36(5):601-615. Parbhane R, Unniraman S, Tambe S, Nagaraja V and Kulkarni B. (2000). Optimun DNA curvature DNA curvature using a hybrid approach involving an artificial neural network and genetic algorithm. Journal of Biomolecular Structural Dynamics 17(4):665-672. Perretto M and Lopes HS. (2005). Reconstruction of phylogenetic trees using the ant colony optimization paradigm. Genetic and Molecular research 4(3):581589. Prelic A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, Henning L, Thiele L and Zitzler E. (2006). A systematic comparison and evaluation of bioclustering method for gene expression data. Bioinformatics 22(9):1122-1129. Qian N, Sejnowski TJ. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology 202:865-884. Ritchie MD, Coffey CS and Moore JH. (2004). Genetics Programming Neural Networks as a Bioinformatics Tool for Human Genetics. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2004). LNCS 3012 Vol 2 pp438-448. Sasegawa F. and Tajima K. (1992). Prediction of protein secondary structures by a neural network. Bioinformatics 9(2):147-152. Schulze-Kremer S. (2000). Genetic algorithms and protein folding. Methods in molecular biology. Protein Structure Prediction: Methods and Protocols 143:175222. Shapiro BA, Wu JC, Bengali D and Potts MJ. (2001). The massively parallel genetic algorithm for RNA folding: MIMD implementation and popular variation. Bioinformatics 17(2):137-148. Sheng Q, Moreau Y, De Smert G and Zhang MQ. (2005). Advances in cluster analysis of microarray

B


data. Data Analysis and Visualization in Genomics and Proteomics, John Wiley pp. 153-226. Sherlock G. (2001). Analysis of large-scale gene expression data. Briefings in Bioinformatics 2(4):350-412. Shmygelska A and Hoos H. (2005). An ant colony optimization algorithm for the 2D and 3D hydrophobic polar protein folding problem. BMC Bioinformatics 6:30. Skourikhine A. (2000). Phylogenetic tree reconstruction using self –adaptative genetic algorithm. IEEE International Symposium in Bioinformatics and Biomedical engineering pp. 129-134. Spellman PT, Sherlock G, Zhang MQ. (1998). Comprehensive identification of cell cycleregulated genes of the yeast saccharomyces cerevisiase by microarray hybridization. Molecular Biology Cell 9:3271-3378. Statnikov A, Aliferis CF, Tsamardinos I. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21(5):631-43. Steeg E. (1997). Neural networks, adaptative optimization and RNA secondary structure prediction. Artificial Intelligence and Molecular Biology. MIT Press. Tamayo P, Slonim D, Maserov J. (1999). Interpreting patterns on gene expression with self-organizing maps: methods and application to hemotopoietics differectiation. Proceedings of the National Academic of Sciences. 96:2907-2929. Thalamuthu A, Mukhopadhyay I, Zheng X and Tseng G. (2006). Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19):2405-2412. Tominaga D, Okamoto M, Maki Y, Watanabe S and Eguchi Y. (1999). Non-linear numeric optimization technique based on genetic algorithm for inverse problems: Towards the inference of genetic networks. Computational Science and Biology (Proceedings of German Conference of Bioinformatics) pp 127-140. Wang Z, Wang Y, Xuan J, Dong Y, Bakay M, Feng Y, Clarke R and Hoffman E. (2006). Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data. Bioinformatics 22(6):755-761.

Wei JS. (2004). Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer Research 65:374. Wiese KC and Glen E. (2003). A permutation-based genetic algorithm for the RNA folding problem: A Critical look of selection strategies, crossover operators, and representation issues. Biosystems 72(1-2):29-41. Yokohama T, Watanabe T, Teneda A and Shimizu T. (2001). A web server for multiple sequence alignment using genetic algorithm. Genome Informatics. 12:382-283. Zhang C and Wong AK. (1997). Toward efficient multiple molecular sequence alignments: A system of genetic algorithm and dynamic programming. IEEE transitions on System, Man and Cybernetics B. 27(6):918-932. Zhang C and Wong AK. (1998). A technique of genetic algorithm and sequence synthesis for multiple molecular sequence alignment. Proc. Of IEEE transactions on System, Man and Cybernetics. 3:2442-2447.

KEy TERMS Bioinformatics: The use of applied mathematics, informatics, statistics, and computer science to study biological systems. Gene Expression: The conversion of information from gene to protein via transcription and translation. Gene Mapping: Any method used for determining the location of a relative distance between genes on a chromosome. Gene Regulatory Network: Genes that regulate or circumscribe the activity of other genes; specifically, genes with a code for proteins (repressors or activators) that regulate the genetic transcription of the structural genes and/or regulatory genes. Phylogeny: The evolutionary relationships among organisms. The patterns of lineage branching produced by the true evolutionary history of the organism that is being considered. Sequence Alignment: The result of comparing two or more gene or protein sequences in order to determine


their degree of base or amino acid similarity. Sequence alignments are used to determine the similarity, homology, function, or other degrees of relatedness between two or more genes or gene products.

B

Structure Prediction: Algorithms that predict the 2d or 3D structure of proteins or DNA molecules from their sequences.

Bioinspired Associative Memories Roberto A. Vazquez Center for Computing Research, IPN, Mexico Humberto Sossa Center for Computing Research, IPN, Mexico

INTRODUCTION An associative memory AM is a special kind of neural network that allows recalling one output pattern given an input pattern as a key that might be altered by some kind of noise (additive, subtractive or mixed). Most of these models have several constraints that limit their applicability in complex problems such as face recognition (FR) and 3D object recognition (3DOR). Despite of the power of these approaches, they cannot reach their full power without applying new mechanisms based on current and future study of biological neural networks. In this direction, we would like to present a brief summary concerning a new associative model based on some neurobiological aspects of human brain. In addition, we would like to describe how this dynamic associative memory (DAM), combined with some aspects of infant vision system, could be applied to solve some of the most important problems of pattern recognition: FR and 3DOR.

BACKGROUND Humans possess several capabilities such as learning, recognition and memorization. In the last 60 years, scientists of different communities have been trying to implement these capabilities into a computer. Along these years, several approaches have emerged, one common example are neural networks (McCulloch & Pitts, 1943) (Hebb, 1949) (Rosenblatt, 1958). Since the rebirth of neural networks, several models inspired in the neurobiological process have emerged. Among these models, perhaps the most popular is the feed-forward multilayer perceptron trained with the back-propagation algorithm (Rumelhart & McClelland, 1986). Other neural models are associative memories, for example (Anderson, 1972) (Hopfield, 1982) (Sussner, 2003) (Sossa, Barron & Vazquez, 2004). On the other hand,

the brain is not a huge fixed neural network as had been previously thought, but a dynamic, changing neural network. In this direction, several models have emerged for example (Grossberg, 1967) (Hopfield, 1982). In most of these classical neural networks approaches, synapses are only adjusted during the training phase. After this phase, synapses are no longer adjusted. Modern brain theory uses continuous-time model based on current study of biological neural networks (Hecht-Nielse, 2003). In this direction, the next section described a new dynamic model based on some aspects of biological neural networks.

Dynamic Associative Memories (DAMs) The dynamic associative model is not an iterative model as Hopfield’s model. It emerges as an improvement of the model and results presented in (Sossa, Barron & Vazquez, 2007). Let x ∈ R n and y ∈ R m an input and output pattern, respectively. An association between input pattern x and output pattern y is denoted as (xk, yk), where k is the corresponding association. Associative memory: W is represented by a matrix whose components wij can be seen as the synapses of the neural network. If x k = y k ∀k = 1, , p then W is auto-associative, otherwise it is hetero-associative. A distorted version of a pattern x to be recalled will be denoted as x . If an associative memory W is fed with a distorted version of xk and the output obtained is exactly yk, we say that recalling is robust. Because of several regions of the brain interact together in the process of learning and recognition (Laughlin & Sejnowski, 2003), in the dynamic model there are defined several interacting areas; also it integrated the capability to adjust synapses in response to an input stimulus. Before the brain processes an input pattern, it is hypothesized that pattern is transformed and codified by the brain. This process is simulated


Bioinspired Associative Memories

using the procedure introduced in (Sossa, Barron & Vazquez, 2004). This procedure allows computing coded patterns and de-coding patterns from input and output patterns allocated in different interacting areas of the model. In addition a simplified version of xk denoted by sk is obtained as: sk = s (x k ) = mid x k

(1)

where mid operator is defined as mid x = x(n +1)/ 2 . When the brain is stimulated by an input pattern, some regions of the brain (interacting areas) are stimulated and synapses belonging to these regions are modified. In this model, the most excited interacting area is call active region (AR) and could be estimated as follows:  p  ar = r (x ) = arg  min s (x ) − si  i =1  

(2)

Once computed the coded patterns, the de-coding patterns and sk we can build the associative memory.

{

}

L e t (x , y ) k = 1, , p , x k ∈ R n , y k ∈ R m a fundamental set of associations (coded patterns). Synapses of associative memory W are defined as:

wij = yi − x j

k

k

(3)

Let K W ∈ R n the kernel of an associative memory W. A component of vector KW is defined as: kwi = mid (wij ), j = 1, , m

(4)

Synapses that belong to KW are modified as a response to an input stimulus. Input patterns stimulate some ARs, interact with these regions and then, according to those interactions, the corresponding synapses are modified. An adjusting factor denoted by ∆w can be computed as: ∆w = ∆ (x ) = s (x ar )− s (x )

(5)

where ar is the index of the AR. Finally, synapses belonging to KW are modified as: K W = K W ⊕ (∆w − ∆wold )

where

operator

(6) ⊕

is

defined

as

x ⊕ e = xi + e ∀i = 1, , m . Once synapses of the associative memory have been modified in response to an input pattern, every component of vector y can be recalled by using its corresponding input vector x as:

yi = mid (wij + x j ), j = 1, , n

(7)

In short, building of the associative memory can be performed in three stages as:

In short, pattern y can be recalled by using its corresponding key vector x or x in six stages:

1.

Transform the fundamental set of association into coded and de-coding patterns. Compute simplified versions of input patterns by using equation 1. Build W in terms of coded patterns by using equation 3.

1.

There are synapses that can be drastically modified and they do not alter the behavior of the associative memory. On the contrary, there are synapses that can only be slightly modified to do not alter the behavior of the associative memory; we call this set of synapses the kernel of the associative memory and it is denoted by KW.

4.

2. 3.

2. 3.

5. 6.

Obtain index of the active region ar by using equation 2. Transform x k using de-coding pattern xˆ ar by apply ing the following transformation: x k = x k + xˆ ar .  Compute adjust factor ∆w = ∆ (x ) by using equation 5. Modify synapses of associative memory W that belong to KW by using equation 6.  Recall pattern y k by using equation 7.  Obtain y k by transforming y k using de-coding pattern  yˆ ar by applying transformation: y k = y k − yˆ ar .

B


The formal set of propositions that support the correct functioning of this dynamic model, the main advantages against other classical models and some interesting applications of this model are described in (Vazquez, Sossa & Garro, 2006) and (Vazquez & Sossa, 2007). In general, we distinguish two main parts in this model: a part concerning to the determination of the AR (PAR) and a part concerning to pattern recall (PPR). PAR (first step during recall procedure) sends a signal to PPR (remaining steps for recall procedure) and indicates the region activated by the input pattern.

FACE AND 3D OBJECT RECOGNITION USING SOME ASPECTS OF THE INFANT VISION SySTEM AND DAMS Several statistical computationally expensive techniques (dimension reduction techniques) such as principal component analysis and factor analysis have been proposed, for solving the FR and 3DOR problem. Instead of using the complete version of the describing pattern X of any face or object, a simplified version from describing pattern X could be used to recognize a face or an object. In many papers, authors have used PCA to perform FR and other tasks, refer for example to (Turk & Pentland, 1991). During early developmental stages, there are communication pathways between the visual and other sensory areas of the cortex, showing how the biological network is self-organizing. Within a few months of birth, the baby is able to differentiate one face or objects (toys) from others. Barlow hypothesized that for a neural system one possible way of capturing the statistical structure was to remove the redundancy in the sensory outputs (Barlow, 2001). By taking into account the theory of Barlow, we propose a novel method for FR and 3DOR based on some biological aspects of infant vision. The biological hypotheses of this proposal are based on the role of the response to low frequencies at early stages, and some conjectures concerning how an infant detects subtle features (stimulating points (SP)) in a face or object (Mondloch et al., 1999; Acerra, Burnod, & Schonen, 2002). The proposal consists on several DAMs used to recognize different images of faces and objects. As the infant vision responds to low frequencies of the signal, 0

a low-pass filter is first used to remove high frequency components from the image. After that, we divide the image in different parts (sub-patterns). Then, over each sub-pattern, we detect subtle features by means of a random selection of SPs. Preprocessing images used to remove high frequencies and random selection of SPs contribute to eliminating redundant information and help the DAMs to learn efficiently the faces or the objects. At last, each DAM is fed with these subpatterns for training and recognition.

Response to Low Frequencies Instead of using a filter that exactly simulates the infant vision system behavior at any stage, we use a low-pass filter to remove high frequency. This kind of filter could be seen as a slight approximation of the infant vision system due to it eliminates high frequency components from the pattern, see Figure 1.

Random Selection In the DAM model, the simplified version of an input pattern is the middle value of input pattern. In order to simulate the random selection of the infant vision system we have substituted mid operator with rand operator defined as follows:

rand x = xsp

(8)

where sp = random(n) is a random number between zero and the length of input pattern. sp is a constant value computed at the beginning of the building phase and represents a SP. During recalling phase sp takes the same value. rand operator uses a uniform random generator to select a component over each part of the pattern. We adopt this operator based on the hypothetical idea about infants are interested into sets of features where each set is different with some intersection among them. By selecting features at random, we conjecture that at least we select a feature belonging to these sets.

Implementation of the Proposal During recalling, each DAM recovers a part of the image based on the AR of each DAM. However, a part of the image could be wrongly recalled because its


Figure 1. Images filtered with masks of different size. Each group could be associated with different stages of infant vision system.

corresponding AR could be wrongly determined due to some patterns do not satisfy the prepositions that guarantee perfect recall. To avoid this, we use an integrator. Each DAM determines an AR, the index of the AR is sent to the integrator, the integrator determines which was the most voted region and sends to the DAMs the index of the most voted region (the new AR). k k Let I x  a×b and I y  c×d an association of images and r be the number of DAMs. Building of the nDAMs is done as follows:

1.

Select filter size and apply it to the images.

2.

Transform the images into a vector ( x k , y k ) by means of the standard image scan method where vectors are of size a × b and c × d respectively.

3.

Decompose x and y in r sub-patterns of the same size. Take each sub-pattern (from the first one to the last one (r)), then take at random a SP spi , i = 1, , r and extract the value at that position. Train r DAMS as in building procedure taking each sub-pattern (from the first one to the last one (r)) using rand operator.

4.

5.

k

k

Pattern I ky can be recalled by using its corresponding key image I k or I k as follows: x

x

1. 2.

3. 4. 5. 6.

Select filter size and apply to the images. Transform the images into a vector by means of the standard image scan method and decompose x k in r sub-patterns of the same size. Use the SP, spi , i = 1, , r computed during the building phase and extract the value of each subpattern. Determine the most voted active region using the integrator. Substitute mid with rand operator in recalling procedure and apply steps from two to six as described in recalling procedure on each DAM. Finally, put together recalled sub-patterns to form the output pattern.

A schematic representation of the building and recalling phases is shown in Figure 2.

Some Experimental Results To test the accuracy of the proposal, we performed two experiments. In experiment 1, we used a benchmark (Spacek, 1996) of faces of 15 different people. In experiment 2, we use a benchmark (Nene, 1996) of 100 objects. During the training process in both experiments, the DAM performed with 100% accuracy using only one image of each person and object. During testing, the DAM performed in average with 99% accuracy for the remaining 285 images of faces (experiment 1) and 95% accuracy for the remaining 1900 images of

B


Figure 2. (a) Schematic representation of building phase. (b) Schematic representation of the recalling phase. ª¬I kx º¼ aub

ª¬I kx º¼ aub

f ª¬I kx º¼ aub

f ª¬I kx º¼ aub

x k R ab

x k R ab

x1k R ab r x k2 R ab r x kr R ab r

x1k R ab r x k2 R ab r x kr R ab r

PAR1

PPR1

PAR 2

PPR 2

PAR r

PAR1

PPR r

y k R c d

f ª¬I ky º¼ cu d

ª¬I ky º¼ cud

PPR1

PAR 2

PPR 2

PAR r

PPR r

INTEGRATOR

y1k R cd r y k2 R cd r y 3k R cd r

y1k R cd r y k2 R cd r y 3k R cd r

y k R c d

ª¬I ky º¼ cud

(a)

objects (experiment 2) by using different sized-filter and SPs. Through several experiments we have tested the accuracy and stability of the proposal using different number of stimulation points, see Figure 3 and Figure 4. Because of SPs (pixels) were randomly selected, we decided to test the stability of proposal with the same configuration 20 times. An extra experiment was performed with images partially occluded. In average, the accuracy of the proposal diminished to 80%. While PCA dimension reduction techniques require the covariance matrix to build an Eigenspace, then to project patterns using this space to eliminate redundant information, our proposal only requires removing high frequencies by using a filter and a random selection of stimulating points. This approach contributes to eliminating redundant information; it is less computationally expensive than PCA, and helps the DAMs or other classification tools to learn efficiently the faces or objects.

(b)

FUTURE TRENDS Preprocessing images used to remove high frequencies and random selection of SPs contribute eliminating unnecessary information and help the DAM to learn efficiently faces and objects. Now we need to study new mechanisms based on evolutionary techniques in order to select the most important SPs. In addition, we need to test different types of filters that really simulate the behavior of the infant vision system. In a near future, we pretend to use this proposal as a biological model to explain the learning process in infant’s brain for FR and 3DOR. One step in this direction can be found in (Vazquez & Sossa, 2007).

CONCLUSION In this paper, we have proposed a novel method for FR and 3DOR based on some biological aspects of infant vision. We have shown that by applying some aspects of the infant vision system it is possible to enhance the performance of an associative memory (or other


Figure 3. Accuracy of the proposal using different filter size. The reader can verify the accuracy of the proposal diminish after apply a filter of size greater than 25.

Figure 4. Average accuracy of the proposal. Maximum, average and minimum accuracy are sketched.

distance classifiers) and make possible its application to complex problems such as FR and 3DOR. In order to recognize different images of face or objects we have used several DAMs. As the infant vision responds to low frequencies of the signal, a low-filter is first used to remove high frequency components from the image. Then we detected subtle features in the image by means of a random selection of SPs. At last, each DAM was fed with this information for training and recognition.

Through several experiments, we have shown the accuracy and the stability of the proposal even under occlusions. In average, the accuracy of the proposal oscillates between 95% and 99%. The results obtained with the proposal were comparable with those obtained by means of a PCAbased method (99%). Although PCA is a powerful technique it consumes a lot of time to reduce the dimensionality of the data. Our proposal, because of its simplicity in operations, is not a computationally

B


expensive technique and the results obtained are comparable to those provided by PCA.

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.

REFERENCES

Rumelhart, D. & McClelland, J. (1986). Parallel distributed processing group. MIT Press.

Acerra, F., Burnod, Y. & Schonen, S. (2002). Modelling aspects of face processing in early infancy. Developmental science, 5(1), 98-117.

Sossa, H., Barron, R. & Vazquez, R. A. (2004). Transforming fundamental set of patterns to a canonical form to improve pattern recall. Lecture Notes in Artificial Intelligence 3315, 687-696.

Anderson, J. A. (1972). A simple neural network generating an interactive memory. Mathematical Biosciences, 14(3-4), 197-220. Barlow, H. B. (2001). Redundancy Reduction Revisited. Network: Computation in Neural Systems, 12:241-253. Grossberg, S. (1967). Nonlinear difference-differential equations in prediction and learning theory. Proceedings of the National Academy of Sciences, 58(4), 1329–1334. Hebb, D. O. (1949). The Organization of Behavior, New York: Wiley. Hecht-Nielse, et al. (2003). A theory of the thalamocortex. Computational models for neuroscience, pp 85-124, Springer-Verlag, London. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8), 2554-2558. Laughlin, S. B. & Sejnowski, T. J. (2003). Communication in neuronal networks. Science, 301(5641), 1870-1874. McCulloch, W.S. & Pitts, W.H. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical biophysics, 5(1-2), 115–133. Mondloch, C. J. et al. (1999). Face perception during early infancy. Psychological Science, 10(5), 419422. Nene, S. A. et. al. (1996). Columbia Object Image Library (COIL 100). Technical Report No. CUCS006-96. Department of Computer Science, Columbia University.

Sossa, H., Barron, R. & Vazquez, R. A. (2007). Study of the influence of noise in the values of a median associative memory. Lecture Notes in Computer Sciences, 4432, 55-62. Spacek, L. (1996). Collection of facial images: Grimace. Available from http://cswww.essex.ac.uk/mv/ allfaces/grimace.html Sussner, P. (2003). Generalizing operations of binary auto-associative morphological memories using fuzzy set theory. Journal of Mathematical Imaging and Vision, 19(2), 81-93. Turk, M. & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71-86. Vazquez, R. A., Sossa, H. & Garro, B. A. (2006). A new bi-directional associative memory. Lecture Notes in Artificial Intelligence, 4293, 367-380. Vazquez, R. A. & Sossa H. (2007). A computational approach for modeling the infant vision system in object and face recognition. Journal BMC Neuroscience 8(suppl 2), P204.

KEy TERMS Associative Memory: Mathematical device specially designed to recall output patterns from input patterns that might be altered by noise. Dynamic Associative Memory: A special type of associative memory composed by dynamical synapses. This memory adjusts the values of their synapses during recalling phase in response to input stimuli. Dynamical Synapses: Synapses that modified their values in response to an input stimulus also during recalling phases.


Low-Pass Filter: Filter which removes high frequencies from an image or signal. This type of filters is used to simulate the infant vision system at early stages. Examples of these filters are the average filter or the median filter. PCA: Principal component analysis is a technique used to reduce multidimensional data sets to lower dimensions for analysis. PCA involves the computation of the eigenvalue decomposition of a data set, usually after mean centering the data for each attribute.

Random Selection: Selection of one or more components of a vector at randomly manner. Random selection techniques are used to reduce multidimensional data sets to lower dimensions for analysis. Stimulating Points: Characteristic points of an object in an image used during learning and recognition, which capture the attention of a child. These stimulating points are used to train the dynamic associative memory.

B

Bio-Inspired Dynamical Tools for Analyzing Cognition Manuel G. Bedia University of Zaragoza, Spain Juan M. Corchado University of Salamanca, Spain Luis F. Castillo National University, Colombia

INTRODUCTION The knowledge about higher brain centres in insects and how they affect the insect’s behaviour has increased significantly in recent years by theoretical and experimental investigations. Nowadays, a large body of evidence suggests that higher brain centres of insects are important for learning, short-term, longterm memory and play an important role for context generalisation (Bazhenof et al., 2001). Related to these subjects, one of the most interesting goals to achieve would be to understand the relationship between sequential memory encoding processes and the higher brain centres in insects in order to develop a general “insect-brain” control architecture to be implemented on simple robots. In this contribution, it is showed a review of the most important and recent results related to spatio-temporal coding and it is suggested the possibility to use continuous recurrent neural networks (CRNNs) (that can be used to model non-linear systems, in particular Lotka-Volterra systems) in order to find out a way to model simple cognitive systems from an abstract viewpoint. After showing the typical and interesting behaviors that emerge in appropriate LotkaVolterra systems (in particular, winnerless competition processes) next sections deal with a brief discussion about the intelligent systems inspired in studies coming from the biology.

BACKGROUND What do we name “computation”? Let us say a system shows the capability to compute if it has memory (or some form of internal plasticity) and it is able to

determine the appropriate decision (or behavior, or action) given a criteria and making calculations using what it senses from the outside world. Some biological systems, like several insects, have brains that show a type of computation that may be described functionally by a specific type of non-linear dynamical systems called Lotka-Volterra systems (Rabinovich et al., 2000). According to our objectives, one of the first interests focuses on how an artificial recurrent neural network could model a non-linear system, in particular, a LotkaVolterra system (Afraimovich et al., 2004) and what are the typical processes that emerge in Lotka-Volterra systems (Rabinovich et al., 2000). If it could be understood, then it would be clearer how the relationships between sequential memory encoding processes and the higher brain centres in insects are. About higher brain centers (and how they affect an insect’s behaviour) it is possible to stop the functioning of particular neurons under investigation during phases of experiments and gradually reestablish the functioning of the neural circuit (Gerber et al., 2004). At the present, it is known that higher brain centers in insects are related on autonomous navigation, multi-modal sensory integration, and to an insect’s behavioral complexity generally; evidence also suggests an important role for context generalization, short-term and long-term memory (McGuire et al., 2001). For a long time, insects have inspired robotic research in a qualitative way but insect nervous systems have been under-exploited as a source for potential robot control architectures. In particular it often seems to be assumed that insects only perform ‘reactive’ behavior, and more complex control will need to be modeled on ‘higher’ animals.


Bio-Inspired Dynamical Tools

SPATIO-TEMPORAL NEURAL CODING GENERATOR The ability to process sequential information has long been seen as one of the most important functions of “intelligent” systems (Huerta et al., 2004). As it will be shown afterwards, winnerless competition principle appears as a major type of mechanism of sequential memory processing. The underlying concept is that sequential memory can be encoded in a (multidimensional) dynamical system by means of heteroclinic trajectories connecting several saddle points. Each of the saddle points is assumed to be remembered for further action (Afraimovich et al., 2004).

Computation over Neural Networks Digital computers are considered universal in the sense of capability to implement any symbolic algorithm. If artificial neural networks, that have a great influence on the field of computation, are considered as a paradigm of computation, one may ask how the relation between neural networks and the classical computing paradigm is. For this question it is needed to consider, on the one hand, discrete computation (digital) and on the other hand, nondiscrete computation (analog). In terms of the first, the traditional paradigm is the Turing Machine with the Von Neumann architecture. A decade ago it was shown that artificial neural networks of analog neurons and rational weights are computationally equivalent to Turing machines. In terms of analog computation, it was also showed that three-layer feedforward nets can approximate any smooth function with arbitrary precision (Hornik et al., 1990). This result was extended to show how continuous recurrent neural nets (CRNN) can approximate an arbitrary dynamical system as given by a system of n coupled first-order differential equations (Tsung, 1994; Chow and Li, 2000).

Neural Network Computation from a Dynamical-System Viewpoint Modern dynamical systems theory is concerned with the qualitative understanding of asymptotic behaviors of systems that evolve in time. With complex non-linear systems, defined by coupled differential, difference or functional equations, it is often impossible to obtain closed-form (or asymptotically closed form) solutions. Even if such solutions are obtained,

their functional forms are usually too complicated to give an understanding of the overall behavior of the system. In such situations qualitative analysis of the limit sets (fixed points, cycles or chaos) of the system can often offer better insights. Qualitative means that this type of analysis is not concerned with the quantitative changes but rather what the limiting behavior will be (Tsung, 1994).

Spatio-Temporal Neural Coding and Winnerless Competition Networks It is important to understand how the information is processed by computation from a dynamical viewpoint (in terms of steady states, limit cycles and strange attractors) because it gives us the possibility of manage sequential processes (Freeman, 1990). In this section it is showed a new direction in information dynamics namely the Winnerless Competition (WLC) behavior. The main point of this principle is the transformation of the incoming spatial inputs into identity-temporal output based on the intrinsic switching dynamics of a dynamical system. In the presence of stimuli the sequence of the switching, whose geometrical image in the phase space is a heteroclinic contour, uniquely depends on the incoming information. Consider the generalized Lotka-Volterra system (N=3): a1 = a1 [1 − (a1 + a 2 = a 2 [1 − (a 2 + a 3 = a3 [1 − (a3 +

a2 + 21 a1 + 31 a 2 +

12

a )] 23 a 3 )] 32 a 2 )]

13 3

If the following matrix and parameter conditions are satisfied,

(

1  ij )=  2   3

0 y in = in in x in , x in ≤ y in

Individuals and fitness

y in , x in ≤ y in y out = x in , x in > y in -1 EVALUATION Output strings (ind.)

N Parents CROSSOVER

RNG xin1 Output string (offspring)

Inputs strings (parents)

N/2

xout1

xin2

xout2 yout

xout1 =

xin1 , y in < p xin 2 , y in ≥ p

xout 2 =

xin 2 , y in < p xin1 , y in ≥ p

yout = f(yin)

RNG1 R NG2

MUTATION

M

M

M

xin

Output strings (offspring)

Inputs strings (parents)

N

yin

zin

xout

M

yout

B zin ( xout ) =

zout

Bzin ( xin ), yin ≠ n

Generation with new individuals

B zin ( xin ), yin = n

y out = f ( yin ) z out = g ( z in )

M

Random numbers PSEUDO-RANDOM NUMBER GENERATOR

613

Evolved Synthesis of Digital Circuits

f total = 0.9 f eval .corect + 0.1 f min im

Three dynamic reconfigurable circuits are designed and tested. All are based on hardware reconfigurable structure presented in figure 1. First schema, min-max terms reconfigurable circuit, use the same principles as programmable logic array. The scheme is composed by three layers: INV layer, AND layer and OR layer. Genetic algorithm command connections between INV layer and AND layer. This reconfigurable circuit has the fast convergence speed and the individuals with the smallest size but explore only traditional space solution and its size grow exponentially with inputs number and linear with outputs number. The second circuit is reconfigurable INV-AND-OR circuit. Like the first circuit, it has three layers: INV layer, AND layer and OR layer. Genetic algorithm configure in this case connections between INV - AND layer and AND – OR layer. This schema reduces the increase of size with number of outputs but remain exponential increase with number of inputs and the size of individuals is bigger than first circuit. The last reconfigurable circuit is elementary functions reconfigurable circuit (e – reconfigurable). It contains more layers. Each layer contains a number of

generic gates. Generic gate can implement a Boolean elementary functions (AND, OR, XOR) and more complex circuits like MUX. This solution increases the size of the individuals and the complexity of the reconfigurable circuit but is almost invariant with number of inputs and outputs. The last reconfigurable circuit explores the largest solution space, beyond the bounds of traditional design methods. The evolvable hardware is used in three applications. First, the target function is static and algorithm must find hardware solution to implement it. Each individual represent here a potential solution for hardware implementation of target function. Evolution loop is repeated until optimal solution is found. Hardware solution finding here is named evolved hardware. At this time evolution loop is stopped. In the second application the target function is also static. But here the individual codes only a sub circuit – one generic gate or one gates layer. The individuals evolve and the offspring replace the parents in different position and evaluation is done to entire circuit until the new solution is better than the old solution. Evolution loop is repeated until optimal solution is found. This solution is used to design circuit with big number of inputs, outputs and sub circuits.

Figure 4. Reconfigurable elementary functions circuit

x1 x2 x3

f (x ) & | ^ ~

f (x ) & | ^ ~

c1

614

f1 j ( x1 , x 2 , x 3 )

j : 1..c1

f( x ) & | ^ ~

f (x ) & | ^ ~

f (x ) & | ^ ~

j=1

F1

f (x ) & | ^ ~

f (x ) & | ^ ~

ℜ1 (e1 j ( x1 , x 2 , x 3 ))

f( x ) & | ^ ~

c2

ℜ 2 (e 2 j ( x1 , x 2 , x 3 )) j=1

f 2 j ( x1 , x 2 , x 3 ) j : 1..c 2

c3

F2

ℜ 3 (e 3 j ( x 1 , x 2 , x 3 )) f 3 j ( x1 , x 2 , x 3 ) j=1 j : 1..c3


Figure 5. Application schema: Finding optimal solution for target function implementation

E

H a r d w a r e g e n e tic a lg o r ith m

E va l . In d iv. 1

E va l . In d iv. 2

E va l . In d iv. 3

E va l . In d iv. 4

E va l . In d iv. 5

E va l . In d iv. 6

E va l . In d iv. 7

E va l . In d iv. 8

C ir cu it 1

C ir cu it 2

C ir cu it 3

C ir cu it 4

C ir cu it 5

C ir cu it 6

C ir cu it 7

C ir cu it 8

O p tim a l so lu tio n

F in a l cir cu it

The last applications use dynamic target functions. Here each individual represent a complete solution for circuit. Evolution loop here is in two steps. First step is same like in the first application: loop until solution is found. After the solution is found in an individual named main individual the evolution continue for the others individuals. The target of the second step in evolution loop is to obtain different individuals relative to the main individual. When the target function is changed, the evolution loop pass in first step and the individuals, with high degree of dispersion, evolve to new solution.

CONClUSIONS In this paper we have presented the concept of the evolvable hardware and show a practical implementation of hardware genetic algorithm and reconfigurable hardware structure. Hardware genetic algorithm increases the convergence speed to solutions which represent configuration for reconfigurable circuit. It can be used for evolvable synthesis of digital circuit in intrinsic evolvable hardware. The bit string solutions which are giving by genetic algorithm can be connections configuration for a dynamic reconfigurable hardware circuit.

We present here three architectures of reconfigurable circuits which can be dynamically programmed by same hardware genetic algorithm module. The structure was implemented on Xilinx FPGA Spartan 3.

FUTURE TRENDS There are more directions of research from this paper. First is design of reconfigurable circuit by using FPGA primitives. The new generation of FPGA (Virtex5) allows dynamically reconfiguration using primitives. In this case the generic gate is replaced by physical cells from FPGA. Another direction is implementation of hybrid neuro-genetic structure. A hardware implementation neural network can be used to store the best solutions from genetic algorithm. This configuration can be used to improve convergence of genetic algorithm. The evolved hardware can be used to design analog circuits. In this case, Boolean reconfigurable circuit can be replacing by analog reconfigurable circuit (like Field Programmable Transistors Area).

REFERENCES Ali B., Almaini A. and Kalganova T., “Evolutionary Algorithms and Their Use in the Design of Sequential 615


Logic Circuits”, Springer – Genetic Programming and evolvable machines, vol.5, p. 11-29, Kluwer Academic Publisher, 2004.

97-001, Dept. Computer Science and Engineering, University of Nebraska-Lincon, 4 July, 1997.

Bland I. M. and Megson G.M., “Systolic Array Library for Hardware Genetic Algorithms”, Parallel, Emergent and Distributed Architectures Laboratory, Department of Computer Science, University of Reading, 2001.

Shaaban N., Hasegawa S. and Suzuky A., “Improvement of energy characteristics of CdZnTe semiconductors detectors”, Genetic Programming and Evolvable Machines, vol.2. nr.3 289-299, Kluwer Academic Publisher, 2001.

Coello C. A., Van Veldhuizen D. A. and Lamont G. B., “Evolutionary Algorithms for Solving Multi-Objective Problems”, Kluwer Academic Publishers, New York, 2002.

Sharabi S. and Sipper M., “GP-Sumo: Using genetic programming to evolve sumobots”, Springer. Genetic Programming and Evolvable machines, vol. 7, p.211230, Springer Science+Business Media, 2006 .

Goldberg D. E., Kargupta H., Horn J. and Cantú-Paz E., “Critical deme size for serial and parallel genetic algorithms”, IlliGAL, University of Illinois, Jan. 1995.

ThompsonA. and Layzell P., “Analysis of unconventional evolved electronics,” Commun. ACM, 42(4), pp. 71–79, 1999.

Iana G. V., Serban G., Angelescu P., Ionescu L. and Mazare A., “Aspects on sigma-delta modulators implementation in hardware structures”, Advances in Intelligent Systems and Technologies, Proceedings ECIT2006. European Conference on Intelligent Systems and Technologies, Iasi 2006. Ionescu L., Serban G., Ionescu V., Anghelescu P. and Iana G., “Implementation of GAs în Reconfigurable Gates Network”, Third European Conference on Intelligent Systems and Technologies ECIT 2004, ISBN 973-7994-78-7, 2004. Koza J. R., Bennett III F. H., Hutchings J. L., Bade S. L., Keane M. A. and D. Andre, “Evolving sotring networks using genetic programming and the rapidly reconfigurable xilinx 6216 field programmable gate array,” in Proc. 31st Asilomar Conf. Signals, Systems, and Comp., IEEE Press: New York, 1997. Martin P., “A hardware implementation of a genetic programming system using FPGAs and HandelC”, Springer Genetic Programming and Evolvable Machines, vol. 2, nr.4, p.317-343, 2001 Miller J., Job D. and Vassiliev V., “Principles in the evolutionary design of digital circuits – Part 1,2”, Springer – Genetic Programming and Evolvable machines, vol. 1, p. 7-35, p. 259 – 288, Kluwer Academic Publishers, 2000. Scott D., Seth S. and A. Samal, “A hardware engine for genetic algorithms,” Technical Report UNL-CSE-

616

Yasunaga M., Kim J., Yoshihara I., “Evolvable reasoning hardware: its prototyping and performance evaluation”, Springer – Genetic Programming and Evolvable machines, vol. 2, p. 211-230, Kluwer Academic Publishers, 2001. Zhao S., Jiao L., “Multi-objective evolutionary design and knowledge discovery of logic circuits based on an adaptive genetic algorithm”, Springer. Genetic Programming and Evolvable machines, vol.7, p.195-210, Springer Science+Business Media, 2006.

KEy TERMS Evolvable Hardware: Reconfigurable circuit which is programmed by evolved algorithm like GA. To extrinsic evolvable hardware evolved algorithm run to host station outside of the reconfigurable circuit (PC). To intrinsic evolvable hardware evolved algorithm run inside the same system with reconfigurable circuit (even same chip). Genetic Algorithms (GA): A genetic algorithm (or GA) is a search technique used in computing to find true or approximate solutions to optimization and search problems. Genetic algorithms are categorized as a stochastic local search technique. Genetic algorithms are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover (also called recombination).Individuals, with coding schema, are initial random values. All individuals are, in the first


step of algorithm evaluated to get the fitness value. In the next step, they are sorted by fitness and selected for genetic operators. The parents are the individuals involved in genetic operators like crossover or mutation. The offspring resulted are evaluated together with parents and the algorithm resume with the first step. The loop is repeated until the solution is find or the number of generation reach the limit given by the programmer. Genotype: Describe the genetic constitution of an individual, that is the specific allelic makeup of an individual. In evolvable hardware it consist in a vector of configuration bits.

Microstructure: Integration of structure in same chip. Evolvable hardware microstructure is an intrinsic evolvable hardware with all modules in same chip. Phenotype: Describe one of the traits of an individual that is measurable and that is expressed in only a subset of the individuals within that population. In evolvable hardware phenotype consist in the circuit coded by an individual. Reconfigurable Circuit: Hardware structure consist in logical cell network which allow configuration of the interconnections between cells

HGA: Hardware genetic algorithm is a hardware implementation of genetic algorithm. Hardware implementation increases the performance of the algorithm by replacing serial software modules with parallel hardware.

617

E

618

Evolving Graphs for ANN Development and Simplification Daniel Rivero University of A Coruña, Spain David Periscal University of A Coruña, Spain

INTRODUCTION One of the most successful tools in the Artificial Intelligence (AI) world is Artificial Neural Networks (ANNs). This technique is a powerful tool used in many different environments, with many different purposes, like classification, clustering, signal modelization, or regression (Haykin, 1999). Although they are very easy to use, their creation is not a simple task, because the expert has to do much effort and spend much time on it. The development of ANNs can be divided into two parts: architecture development and training and validation. The architecture development determines not only the number of neurons of the ANN, but also the type of the connections among those neurons. The training determines the connection weights for such architecture. The architecture design task is usually performed by means of a manual process, meaning that the expert has to test different architectures to find the one able to achieve the best results. Each architecture trial means training and validating it, which can be a process that needs many computational resources, depending on the complexity of the problem. Therefore, the expert has much participation in the whole ANN development, although techniques for relatively automatic creation of ANNs have been recently developed.

BACKGROUND ANN development is a research topic that has attracted many researchers from the world of evolutionary algorithms (Nolfi & Parisi D., 2002) (Cantú-Paz & Kamath, 2005). These techniques follow the general strategy of an evolutionary algorithm: an initial population with different types of genotypes encoding also different

parameters – commonly, the connection weights and/ or the architecture of the network and/or the learning rules – is randomly created and repeatedly induced to evolve. The most direct application of EC tools in the ANN world is to perform the evolution of the weights of the connections. This process starts from an ANN with an already determined topology. In this case, the problem to be solved is the training of the connection weights, attempting to minimise the network failure. Most of training algorithms, as backpropagation (BP) algorithm (Rumelhart, Hinton & Williams, 1986), are based on gradient minimisation, which presents several inconveniences. The main of these disadvantages is that, quite frequently, the algorithm gets stuck into a local minimum of the fitness function and it is unable to reach a global minimum. One of the options for overcoming this situation is the use of an evolutionary algorithm, so the training process is done by means of the evolution of the connection weights within the environment defined by both, the network architecture, and the task to be solved. In such cases, the weights can be represented either as the concatenation of binary values or of real numbers on a genetic algorithm (GA) (Greenwood, 1997). The evolution of architectures consists on the generation of the topological structure, i.e., establishing the connectivity and the transfer function of each neuron. To achieve this goal with an evolutionary algorithm, it is needed to choose how to encode the genotype of a given network for it to be used by the genetic operators. The most typical approach is called direct encoding. In this technique there is a one-to-one correspondence between each of the genes and a determined part of the network. A binary matrix represents an architecture where every element reveals the presence or absence


Evolving Graphs for ANN Development and Simplification

of connection between two nodes (Alba, Aldana & Troya, 1993). In comparison with direct encoding, there are some indirect encoding methods. In these methods, only some characteristics of the architecture are encoded in the chromosome. These methods have several types of representation. Firstly, the parametric representations represent the network as a group of parameters such as number of hidden layers, number of nodes for each layer, number of connections between two layers, etc (Harp, Samad & Guha, 1989). Another non direct representation type is based on a representation system that uses grammatical rules (Kitano, 1990), shaped as production rules that make a matrix that represents the network. Another type of encoding is the growing methods. In this case, the genotype contains a group of instructions for building up the network (Nolfi & Parisi, 2002). All of these methods evolve architectures, either alone (most commonly) or together with the weights. The transfer function for every node of the architecture is supposed to have been previously fixed by a human expert and is the same for all the nodes of the network or, at least, all the nodes of the same layer. Only few methods that also induce the evolution of the transfer function have been developed (Hwang, Choi & Park, 1997).

ANN DEVELOPMENT WITH GENETIC PROGRAMMING This section very briefly shows an example of how to develop ANNs using an AI tool, Genetic Programming (GP), which performs an evolutionary algorithm, and how it can be applied to Data Mining tasks.

The GP encoding for the solutions is tree-shaped, so the user must specify which are the terminals (leaves of the tree) and the functions (nodes capable of having descendants) for being used by the evolutionary algorithm in order to build complex expressions. The wide application of GP to various environments and its consequent success are due to its capability for being adapted to numerous different problems. Although the main and more direct application is the generation of mathematical expressions (Rivero, Rabuñal, Dorado & Pazos, 2005), GP has been also used in other fields such as filter design (Rabuñal, Dorado, Puertas, Pazos, Santos & Rivero D., 2003), knowledge extraction, image processing (Rivero, Rabuñal, Dorado & Pazos, 2004), etc.

Model Overview This work will use a graph-based codification to represent ANNs in the genotype. These graphs will not contain any cycles. Due to this type of codification the genetic operators had to be changed in order to be able to use the GP algorithm. The operators were changed in this way: •

•

Genetic Programming GP (Koza, 92) is based on the evolution of a given population. Its working is similar to a GA. In this population, every individual represents a solution for a problem that is intended to be solved. The evolution is achieved by means of the selection of the best individuals – although the worst ones have also a little chance of being selected – and their mutual combination for creating new solutions. After several generations, the population is expected to contain some good solutions for the problem.

•

The creation algorithm must allow the creation of graphs. This means that, at the moment of the creation of a node’s child, this algorithm must allow not only the creation of this node, but also a link to an existing one in the same graph, without making cycles inside the graph. The crossover algorithm must allow the crossing of graphs. This algorithm works very similar to the existing one for trees, i.e. a node is chosen on each individual to change the whole subgraph it represents to the other individual. Special care has to be taken with graphs, because before the crossover there may be links from outside this subgraph to any nodes on it. In this case, after the crossover these links are updated and changed to point to random nodes in the new subgraph. The mutation algorithm has been changed too, and also works very similar to the GP tree-based mutation algorithm. A node is chosen from the individual and its subgraph is deleted and replaced with a new one. Before the mutation occurs, there may be nodes in the individual pointing to other nodes in the subgraph. These links are updated

619

E


Table 1. Summary of the operators to be used in the tree Node ANN n-Neuron

Type ANN NEURON

Num. Children Num. outputs 2*n

n-Input +,-,*,% [-4.4]

NEURON REAL REAL

0 2 0

and made to point to random nodes in the new subgraph.

Children type NEURON n NEURON n REAL (weights) REAL -

In order to be able to use GP to develop any kind of system, it is necessary to specify the set of operators that will be in the tree. With them, the evolutionary system must be able to build correct trees that represent ANNs. An overview of the operators used can be seen on Table 1. This table shows a summary of the operators that can be used in the tree. This set of terminals and functions are used to build a tree that represents an ANN.

These algorithms must also follow two restrictions in GP: typing and maximum height. The GP typing property (Montana, 1995) means that each node will have a type and will also provide which type will have each of its children. This property provides the ability of developing structures that follow a specific grammar.

Figure 1. GP graph and its resulting network

ANN 3-Neuron

2-Neuron

-1

2-Neuron

3-Neuron

2.1

2-Input

-2

1-Input 3.2

0.67

4-Input

1-Input

+ 2.1

3-Input

x1

3.2

x2 x3 x4

-1

-2 1.3 1.1

2.1

0.67

0.4 -1

% -1.8

2.6

1.3

2-Input

620

-2.34

-2.34 -1.8

-

2 1.8


Although these sets are not explained in the text, in Fig. 1 can be seen an example of how they can be used to represent an ANN. These operators are used to build GP trees. These trees have to be evaluated, and, once the tree has been evaluated, the genotype turns into phenotype. In other words, it is converted into an ANN with its weights already set (thus it does not need to be trained) and therefore can be evaluated. The evolutionary process demands the assignation of a fitness value to every genotype. Such value is the result of the evaluation of the network with the pattern set that represents the problem. This result is the Mean Square Error (MSE) of the difference between the network outputs and the desired outputs. Nevertheless, this value has been modified in order to induce the system to generate simple networks. The modification has been made by adding a penalization value multiplied by the number of neurons of the network. In such way, and given that the evolutionary system has been designed in order to minimise an error value, when adding a fitness value, a larger network would have a worse fitness value. Therefore, the existence of simple networks would be preferred as the penalization value that is added is proportional to the number of neurons at the ANN. The calculus of the final fitness will be as follows: fitness = MSE + N * P where N is the number of neurons of the network and P is the penalization value for such number.

Example of Applications This technique has been used for solving problems of different complexity taken from the UCI (Mertz & Murphy, 2002). All these problems are knowledgeextraction problems from databases where, taking certain features as a basis, it is intended to perform a prediction about another attribute of the database. A small description of the problems to be solved can be seen at Table 2, along with other ANN parameters used later in this work. All these databases have been normalised between 0 and 1 and divided into two parts, taking the 70% of the data base for training and using the remaining 30% for performing tests.

Results and Comparison with Other Methods

E

Several experiments have been performed in order to evaluate the system performance. The values taken for the parameters at these experiments were the following: • • • • • • •

Population size: 1000 individuals. Crossover rate: 95%. Mutation probability: 4%. Selection algorithm: 2-individual tournament. Graph maximum height: 5. Maximum inputs for each neuron: 9. Penalization value: 0.00001.

To achieve these values, several experiments had to be done in order to obtain values for these parameters that would return good results to all of the problems. These problems are very different in complexity, so it is expected that these parameters give good results to many different problems. In order to evaluate its performance, the system presented here has been compared with other ANN generation and training methods. The method 5x2cv was used by Cantú-Paz and Kamath (1995) for the comparison of different ANN generation and training techniques based on EC tools. This work presents as results the average precisions obtained in the 10 test results generated by this method. Such values are the basis for the comparison of the technique described here with other well known ones, described in detail by Cantú-Paz and Kamath (1995). Such work shows the average times needed to achieve the results. Not having the same processor that was used, the computational effort needed for achieving the results can be estimated. This effort represents the number of times that the pattern file was evaluated. The computational effort for every technique can be measured using the population size, the number of generations, the number of times that the BP algorithm was applied, etc. This calculation varies for every algorithm used. All the techniques that are compared with the work are related to the use of evolutionary algorithms for ANN design. Five iterations of a 5-fold crossed validation test were performed in all these techniques in order to evaluate the accuracy of the networks. These techniques are connectivity matrix, pruning, parameter search and graph-rewriting grammar. 621


Table 2 shows a summary of the number of neurons used by Cantú-Paz and Kamath (1995) in order to solve the problems that were used with connectivity matrix and pruning techniques. The epoch number of the BP algorithm, when used, is also indicated here. Table 3 shows the parameter configuration used by these techniques. The execution was stopped after

5 generations with no improvement or after 50 total generations. The results obtained with these 4 methods are shown in Table 4. Every box of the table indicates 3 different values: precision value obtained by Cantú-Paz and Kamath (1995) (left), computational effort needed for obtaining such value with that technique (below) and

Table 2. Summary of the problems to be solved Description Problem

Number of inputs

Number of instances

9 4 13 34

699 150 303 351

Breast Cancer Iris Flower Heart Disease Ionosphere

ANN configuration Number of outputs 1 3 1 1

Inputs

Hidden

Outputs

BP Epochs

9 4 26 34

5 5 5 10

1 3 1 1

20 80 40 40

Table 3. Parameters of the techniques used for the comparison Parameter Chromosome length (L) Population size Crossover points Mutation rate

Matrix N

Pruning N

3 L 

Parameters 36 25

Grammar 256 64

3 L  L/10 1/L

L/10 1/L

2 0.04

L/10 0.004

N = (hidden+output)*input + output*hidden

Table 4. Comparison with other methods Problem Breast Cancer Iris Flower Heart Cleveland Ionosphere Average 622

Matrix 96.77 96.27 92000 92.40 95.49 320000 76.78 81.11 304000 87.06 88.34 464000 88.25 90.30

Pruning 96.31 95.79 4620 92.40 81.58 4080 89.50 78.28 7640 83.66 82.37 11640 90.46 84.50

Parameters 96.69 96.27 100000 91.73 95.52 400000 65.89 81.05 200000 85.58 87.81 200000 84.97 90.16

Grammar 96.71 96.31 300000 92.93 95.66 1200000 72.8 80.97 600000 88.03 88.36 600000 87.61 90.32


precision value obtained with the technique described here and related to the previously mentioned computational effort value (right). Watching this table, it is obvious that the results obtained with the method proposed here are, not only similar to the ones presented by Cantú-Paz and Kamath (1995), but better in most of the cases. The reason of this lies in the fact that these methods need a high computational load since training is necessary for every case of network (individual) evaluation, which therefore turns to be time-consuming. During the work described here, the procedures for design and training are performed simultaneously, and therefore, the times needed for designing as well as for evaluating the network are combined.

the error given by the rest of the ANN development systems used for the comparison. Only one technique (pruning) performs better that the one described here. However, that technique still needs some work from the expert, to do the design of the initial network. Most of the techniques used for the ANN development are quite costly, due in some cases to the combination of training with architecture evolution. The technique described here is able to achieve good results with a low computational cost and besides, the added advantage is that, not only the architecture and the connectivity of the network are evolved, but also the network itself undergoes an optimization process.

FUTURE TRENDS

The development of the experiments described in this work, has been performed with equipments belonging to the Super Computation Center of Galicia (CESGA). The Cleveland heart disease database was available thanks to Robert Detrano, M.D., Ph.D., V.A. Medical Center, Long Beach and Cleveland Clinic Foundation.

The future line of works in this area would be the study of the system parameters in order to evaluate their impact on the results from different problems. Another interesting line consists on the combination of this graph evolution algorithm with a GA that performs an optimization process on the weight values. With this modification, the whole system will have two levels: 1.

ACKNOWlEDGMENT

REFERENCES

The graph evolution algorithm explained in this work performs the evolution of the architectures. The GA takes those architectures and optimizes the weights of the connections.

Alba E., Aldana J.F. & Troya J.M. (1993) Fully automatic ANN design: A genetic approach. Proc. Int. Workshop Artificial Neural Networks (IWANN’93), Lecture Notes in Computer Science. Berlin, Germany: Springer-Verlag, 686, 399-404.

With this architecture, the evolution of ANNs can be seen as a lamarckian strategy.

Cantú-Paz E. & Kamath C. (2005) An Empirical Comparison of Combinatios of Evolutionary Algorithms and Neural Networks for Classification Problems. IEEE Transactions on systems, Man and Cybernetics – Part B: Cybernetics. 915-927.

2.

CONClUSION This work describes a technique in which an evolutionary algorithm is used to automatically develop ANNs. This evolutionary algorithm performs graph evolution, and it is based on the GP algorithm, although it had to be modified in order to make it operate with graphs instead of trees. Results show that the networks returned by this algorithm give, in most of the cases, an error lower than

Greenwood G.W. (1997) Training partially recurrent neural networks using evolutionary strategies. IEEE Trans. Speech Audio Processing, 5, 192-194. Harp S.A., Samad T. & Guha A. (1989) Toward the genetic synthesis of neural networks. Proc. 3rd Int. Conf. Genetic Algorithms and Their Applications, J.D. Schafer, Ed. San Mateo, CA: Morgan Kaufmann. 360-369.

623

E


Haykin, S. (1999). Neural Networks (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Hwang M.W., Choi J.Y. & Park J. (1997) Evolutionary projection neural networks. Proc. 1997 IEEE Int. Conf. Evolutionary Computation, ICEC’97. 667-671. Jung-Hwan Kim, Sung-Soon Choi & Byung-Ro Moon (2005) Normalization for neural network in genetic search. Genetic and Evolutionary Computation Conference. 1-10. Kitano H. (1990) Designing neural networks using genetic algorithms with graph generation system. Complex Systems, 4, 461-476. Koza, J. R. (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press. Mertz C.J. & Murphy P.M. (2002). UCI repository of machine learning databases. http://www-old.ics.uci. edu/pub/machine-learning-databases. Montana D.J. (1995) Strongly typed genetic programming. Evolutionary Computation, 3(2), 199-200. Nolfi S. & Parisi D. (2002) Evolution of Artificial Neural Networks. Handbook of brain theory and neural networks, Second Edition. Cambridge, MA: MIT Press. 418-421. Rabuñal J.R., Dorado J., Puertas J., Pazos A., Santos A. & Rivero D. (2003) Prediction and Modelling of the Rainfall-Runoff Transformation of a Typical Urban Basin using ANN and GP. Applied Artificial Intelligence. Rivero D., Rabuñal J.R., Dorado J. & Pazos A. (2004) Using Genetic Programming for Character Discrimination in Damaged Documents. Applications of Evolutionary Computing, EvoWorkshops2004: EvoBIO, EvoCOMNET, EvoHOT, EvoIASP, EvoMUSART, EvoSTOC. 349-358. Rivero D., Rabuñal J.R., Dorado J. & Pazos A. (2005) Time Series Forecast with Anticipation using Genetic Programming. IWANN 2005. 968-975.

624

Rumelhart D.E., Hinton G.E. & Williams R.J. (1986) Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructures of Cognition. D. E. Rumelhart & J.L. McClelland, Eds. Cambridge, MA: MIT Press. 1, 318-362.

KEy TERMS Artificial Neural Networks: Interconnected set of many simple processing units, commonly called neurons, that use a mathematical model, that represents an input/output relation, Back-Propagation Algorithm: Supervised learning technique used by ANNs, that iteratively modifies the weights of the connections of the network so the error given by the network after the comparison of the outputs with the desired one decreases. Evolutionary Computation: Set of Artificial Intelligence techniques used in optimization problems, which are inspired in biologic mechanisms such as natural evolution. Genetic Programming: Machine learning technique that uses an evolutionary algorithm in order to optimise the population of computer programs according to a fitness function which determines the capability of a program for performing a given task. Genotype: The representation of an individual on an entire collection of genes which the crossover and mutation operators are applied to. Phenotype: Expression of the properties coded by the individual’s genotype. Population: Pool of individuals exhibiting equal or similar genome structures, which allows the application of genetic operators. Search Space: Set of all possible situations of the problem that we want to solve could ever be in.

625

Facial Expression Recognition for HCI Applications

F

Fadi Dornaika Institut Géographique National, France Bogdan Raducanu Computer Vision Center, Spain

INTRODUCTION Facial expression plays an important role in cognition of human emotions (Fasel, 2003 & Yeasin, 2006). The recognition of facial expressions in image sequences with significant head movement is a challenging problem. It is required by many applications such as human-computer interaction and computer graphics animation (Cañamero, 2005 & Picard, 2001). To classify expressions in still images many techniques have been proposed such as Neural Nets (Tian, 2001), Gabor wavelets (Bartlett, 2004), and active appearance models (Sung, 2006). Recently, more attention has been given to modeling facial deformation in dynamic scenarios. Still image classifiers use feature vectors related to a single frame to perform classification. Temporal classifiers try to capture the temporal pattern in the sequence of feature vectors related to each frame such as the Hidden Markov Model based methods (Cohen, 2003, Black, 1997 & Rabiner, 1989) and Dynamic Bayesian Networks (Zhang, 2005). The main contributions of the paper are as follows. First, we propose an efficient recognition scheme based on the detection of keyframes in videos where the recognition is performed using a temporal classifier. Second, we use the proposed method for extending the human-machine interaction functionality of a robot whose response is generated according to the user’s recognized facial expression. Our proposed approach has several advantages. First, unlike most expression recognition systems that require a frontal view of the face, our system is viewand texture-independent. Second, its learning phase is simple compared to other techniques (e.g., the Hidden Markov Models and Active Appearance Models), that is, we only need to fit second-order Auto-Regressive models to sequences of facial actions. As a result, even when the imaging conditions change the learned Auto-Regressive models need not to be recomputed.

The rest of the paper is organized as follows. Section 2 summarizes our developed appearance-based 3D face tracker that we use to track the 3D head pose as well as the facial actions. Section 3 describes the proposed facial expression recognition based on the detection of keyframes. Section 4 provides some experimental results. Section 5 describes the proposed human-machine interaction application that is based on the developed facial expression recognition scheme.

SIMUlTANEOUS HEAD AND FACIAl ACTION TRACKING In our study, we use the Candide 3D face model (Ahlberg, 2001). This 3D deformable wireframe model is given by the 3D coordinates of n vertices. Thus, the 3D shape can be fully described by the 3n-vector g - the concatenation of the 3D coordinates of all vertices. The vector g can be written as: g = gs + A

a

(1)

where gs is the static shape of the model, τa is the facial action vector, and the columns of A are the Animation Units. In this study, we use six modes for the facial Animation Units (AUs) matrix A, that is, the dimension of τa is 6. These modes are all included in the Candide model package. We have chosen the six following AUs: lower lip depressor, lip stretcher, lip corner depressor, upper lip raiser, eyebrow lowerer and outer eyebrow raiser. A cornerstone problem in facial expression recognition is the ability to track the local facial actions/deformations. In our work, we track the head and facial actions using our face tracker (Dornaika & Davoine, 2006). This appearance-based tracker simultaneously computes the 3D head pose and the facial actions τa by minimizing a distance between



the incoming warped frame and the current appearance of the face. Since the facial actions, encoded by the vector τa, are highly correlated to the facial expressions, their time series representation can be utilized for inferring the facial expression in videos. This will be explained in the sequel.

EFFICIENT FACIAL EXPRESSION DETECTION AND RECOGNITION In (Dornaika & Raducanu, 2006), we have proposed a facial expression recognition method that is based on the time-series representation of the tracked facial actions τa. An analysis-synthesis scheme based on learned auto-regressive models was proposed. In this paper, we introduce a process able to detect keyframes

in videos. Once a keyframe is detected, the temporal recognition scheme described in (Dornaika & Raducanu, 2006) will be invoked on the detected keyframe. The proposed scheme has two advantages. First, the CPU time corresponding to the recognition part will be considerably reduced since only few keyframes are considered. Second, since a keyframe and its neighbor frames are characterizing the expression, the discrimination performance of the recognition scheme will be boosted. In our case, the keyframes are defined by the frames where the facial actions change abruptly. Thus, a keyframe can be detected by looking for a local positive maximum in the temporal derivatives of the facial actions. To this end, two entities will be computed from the sequence of facial actions τa that arrive in a sequential fashion: (i) the L1 norm ||τa||1, and (ii) the temporal derivative given by:

Figure 1. Efficient facial expression detection and recognition based on keyframes

Figure 2. Keyframe detection and recognition applied on a 1600-frame sequence

626


Dt =

∂

a 1

∂t

6

=∑ i =1

∂

EXPERIMENTAL RESULTS

a (i )

∂t

(2)

In the above equation, we have used the fact that the facial actions are positive. Let W be the size of a temporal segment defining the temporal granulometry of the system. In other words, the system will detect and recognize at most one expression every W frames. In practice, W belongs to [0.5s, 1s]. The whole scheme is depicted in Figure 1. In this figure, we can see that the system has three levels: the tracking level, the keyframe detection level, and the recognition level. The tracker provides the facial actions for every frame. Whenever the current video segment size reaches W frames, the keyframe detection is invoked to select a keyframe in the current segment if any. A given frame is considered as a keyframe if it meets three conditions: (1) the corresponding Dt is a positive local maximum (within the segment), (2) the corresponding norm ||τa||1 is greater than a predefined threshold, (3) its far from the previous keyframe by at least W frames. Once a keyframe is found in the current segment, the dynamical classifier described in (Dornaika & Raducanu, 2006) will be invoked. Figure 2 shows the results of applying the proposed detection scheme on a 1600-frame sequence containing 23 played expressions. Some images are shown in Figure 4. The solid curve corresponds to the norm ||τa||1, the dotted curve to the derivative Dt and the vertical bars correspond to the detected keyframes. In this example, the value of W is set to 30 frames. As can be seen, out of 1600 frames only 23 keyframes will be processed by the expression classifier.

Recognition results: We used a 300-frame video sequence. For this sequence, we asked a subject to display several expressions arbitrarily (see Figure 3). The middle of this figure shows the normalized similarities associated with each universal expression where the recognition is performed for every frame in the sequence. As can be seen, the temporal classifier (Dornaika & Raducanu, 2006) has correctly detected the presence of the surprise, joy, and sadness expressions. Note that the mixture of expressions at transition is normal since the recognition is performed in a framewise manner. The lower part of this figure shows the results of applying the proposed keyframe detection scheme. On a 3.2 GHz PC, a non-optimized C code of the developed approach carries out the tracking and recognition in about 60 ms. Performance study: In order to quantify the recognition rate, we have used 35 test videos retrieved from the CMU database. Table 1 shows the confusion matrix associated with the 35 test videos featuring 7 persons. As can be seen, although the recognition rate was good (80%), it is not equal to 100%. This can be explained by the fact that the expression dynamics are highly subject-dependent. Recall that the used auto-regressive models are built using data associated with one subject. Notice that the human ‘ceiling’ in correctly classifying facial expressions into the six basic emotions has been established at 91.7%.

Table 1. Confusion matrix for the facial expression classifier associated with 35 test videos (CMU data). The model is built using one unseen person

Surprise Sadness Joy Disgust Anger

Surprise (7) 7 0 0 0 0

Sadness (7) 0 7 0 0 0

Joy (7) 0 0 7 0 0

Disgust (7) 0 5 0 2 0

Anger (7) 0 0 0 2 5 627

F


Figure 3. Top: Four frames (50, 110, 150, and 250) associated with a 300-frame test sequence. Middle: The similarity measure computed for each universal expression and for each non-neutral frame of the sequence-the framewise recognition. Bottom: The recognition based on keyframe detection.

HUMAN-MACHINE INTERACTION Interpreting non-verbal face gestures is used in a wide range of applications. An intelligent user-interface not only should interpret the face movements but also should interpret the user’s emotional state (Breazeal, 2002). Knowing the emotional state of the user makes machines communicate and interact with humans in a natural way: intelligent entertaining systems for kids, interactive computers, intelligent sensors, social robots, 628

to mention a few. In the sequel, we will show how our proposed technique lends itself nicely to such applications. Without loss of generality, we use the AIBO robot which has the advantage of being especially designed for Human Computer Interaction. The input to the system is a video stream capturing the user’s face. The AIBO robot: AIBO is a biologically-inspired robot and is able to show its emotions through an array of LEDs situated in the frontal part of the head. In addition to the LEDs’ configuration, the robot response


Figure 4. Top: Some detected keyframes associated with the 1600-frame video. Middle: The recognized expression. Bottom: The corresponding robot’s response.

contains some small head and body movements. From its concept design, AIBO’s affective states are triggered by the Emotion Generator engine. This occurs as a response to its internal state representation, captured through multi-modal interaction (vision, audio and touch). For instance, it can display the ‘happiness’ feeling when it detects a face (through the vision system) or it hears a voice. But it does not possess a built-in system for vision-based automatic facial-expression recognition. For this reason, with the scheme proposed in this paper (see Section 3), we created an application for AIBO whose purpose is to enable it with this capability. This application is a very simple one, in which the robot is just imitating the expression of a human subject. Usually, the response of the robot occurs slightly after the apex of the human expression. The results of this application were recorded in a 2 minute video which can be downloaded from the following address: http://www.cvc.uab.es/~ bogdan/AIBO-emotions.avi. In order to be able to display simultaneously in the video the correspondence between subject’s and robot’s expressions, we put them side by side. Figure 4 illustrates five detected keyframes from the 1600 frame video depicted in Figure 2. These are shown in correspondence with the robot’s response. The middle row shows the recognized expression. The bottom row shows a snapshot of the robot head when it interacts with the detected and recognized expression.

CONClUSION This paper described a view- and texture-independent approach to facial expression analysis and recognition. The paper presented two contributions. First, we proposed an efficient facial expression recognition scheme based on the detection of keyframes in videos. Second, we applied the proposed method in a Human Computer Interaction scenario, in which an AIBO robot is mirroring the user’s recognized facial expression.

ACKNOWlEDGMENT This work has been partially supported by MCYT Grant TIN2006-15308-C02, Ministerio de Educación y Ciencia, Spain. Bogdan Raducanu is supported by the Ramon y Cajal research program, Ministerio de Educación y Ciencia, Spain. The authors thank Dr. Franck Davoine from CNRS, Compiegne, France, for providing the video sequence shown in Figure 4.

REFERENCES Ahlberg, J. (2001). CANDIDE-3 – An Updated Parameterized Face. Technical Report LiTH-ISY-R-2326, Dept. of Electrical Engineering, Linköping University, Sweden.

629

F


Bartlett, M., Littleworth, G., Lainscsek, C., Fasel I. & Movellan, J. (2004). Machine Learning Methods for Fully Automatic Recognition of Facial Expressions and Facial Actions. Proc. of IEEE Conference on Systems, Man and Cybernetics, Vol. I, The Hague, The Netherlands, pp.592-597. Black, M.J. & Yacoob, Y. (1997). Recognizing facial expressions in image sequences using local parameterized models of image motion. International Journal of Computer Vision, 25(1):23-48. Breazeal, C. & Scassellati, B. (2002). Robots that Imitate Humans. Trends in Cognitive Science, Vol. 6, pp. 481-487. Cañamero, L. & Gaussier, P. (2005). Emotion Understanding: Robots as Tools and Models. In Emotional Development: Recent Research Advances, pp. 235258. Cohen, I., Sebe, N., Garg, A., Chen, L. & Huang, T. (2003). Facial Expression Recognition from Video Sequences: Temporal and Static Modeling. Computer Vision and Image Understanding, 91(1-2):160-187. Dornaika, F. & Davoine, F. (2006). On Appearance Based Face and Facial Action Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 16(9):1107-1124. Dornaika, F. & Raducanu, B. (2006). Recognizing Facial Expressions in Videos Using a Facial Action Analysis-Synthesis Scheme. Proc. of IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. N/A, Australia. Fasel, B. & Luettin, J. (2003). Automatic Facial Expression Analysis: A Survey. Pattern Recognition, 36(1):259-275. Picard, R., Vyzas, E. & Healy, J. (2001) Toward Machine Emotional Intelligence: Analysis of Affective Psychological State. IEEE Trasactions on Pattern Analysis and Machine Intelligence, 23(10):1175-1191. Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, 77(2):257-286. Sung, J., Lee, S. & Kim, D. (2006). A Real-Time Facial Expression Recognition Using the STAAM. Proc. of

630

International Conference on Pattern Recognition, Vol. I, pp. 275-278, Hong-Kong. Tian, Y., Kanade T. & Cohn, J. (2001). Recognizing Action Units for Facial Expression Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, pp. 97-115. Yeasin M., Bullot, B. & Sharma, R. (2006). Recognition of Facial Expressions and Measurement of Levels of Interest from Video. IEEE Transactions on Multimedia 8(3):500-508. Zhang, Y. & Ji, Q. (2005). Active and Dynamic Information Fusion for Facial Expression Understanding from Image Sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):699-714.

KEy TERMS 3D Deformable Model: A model which is able to modify its shape while being acted upon by an external influence. In consequence, the relative position of any point on a deformable body can change. Active Appearance Models (AAM): Computer Vision algorithm for matching a statistical model of object shape and appearance to a new image. The approach is widely used for matching and tracking faces. AIBO: One of several types of robotic pets designed and manufactured by Sony. Able to walk, “see” its environment via camera, and recognize spoken commands, they are considered to be autonomous robots, since they are able to learn and mature based on external stimuli from their owner or environment, or from other AIBOs. Autoregressive Models: Group of linear prediction formulas that attempt to predict the output of a system based on the previous outputs and inputs. Facial Expression Recognition System: Computer-driven application for automatically identifying person’s facial expression from a digital still or video image. It does that by comparing selected facial features in the live image and a facial database. Hidden Markov Model (HMM): Statistical model in which the system being modeled is assumed to be


a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. Human–Computer Interaction (HCI): The study of interaction between people (users) and computers. It is an interdisciplinary subject, relating computer science with many other fields of study and research (Artificial Intelligence, Psychology, Computer Graphics, Design).

Social Robot: An autonomous robot that interacts and communicates with humans by following the social rules attached to its role. This definition implies that a social robot has a physical embodiment. A consequence of the previous statements is that a robot that only interacts and communicates with other robots would not be considered to be a social robot. Wireframe Model: The representation of all surfaces of a three-dimensional object in outline form.

631

F

632

Feature Selection Noelia Sánchez-Maroño University of A Coruña, Spain Amparo Alonso-Betanzos University of A Coruña, Spain

INTRODUCTION

BACKGROUND

Many scientific disciplines use modelling and simulation processes and techniques in order to implement non-linear mapping between the input and the output variables for a given system under study. Any variable that helps to solve the problem may be considered as input. Ideally, any classifier or regressor should be able to detect important features and discard irrelevant features, and consequently, a pre-processing step to reduce dimensionality should not be necessary. Nonetheless, in many cases, reducing the dimensionality of a problem has certain advantages (Alpaydin, 2004; Guyon & Elisseeff, 2003), as follows:

Feature extraction and feature selection are the main methods for reducing dimensionality. In feature extraction, the aim is to find a new set of r dimensions that are a combination of the n original ones. The best known and most widely used unsupervised feature extraction method is principal component analysis (PCA); commonly used as supervised methods are linear discriminant analysis (LDA) and partial least squares (PLS). In feature selection, a subset of r relevant features is selected from a set n, whose remaining features will be ignored. As for the evaluation function used, FS approaches can be mainly classified as filter or wrapper models (Kohavi & John, 1997). Filter models rely on the general characteristics of the training data to select features, whereas wrapper models require a predetermined learning algorithm to identify the features to be selected. Wrapper models tend to give better results, but when the number of features is large, filter models are usually chosen because of their computational efficiency. In order to combine the advantages of both models, hybrid algorithms have recently been proposed (Guyon et al., 2006).

•

• • •

Performance improvement. The complexity of most learning algorithms depends on the number of samples and features (curse of dimensionality). By reducing the number of features, dimensionality is also decreased, and this may save on computational resources—such as memory and time—and shorten training and testing times. Data compression. There is no need to retrieve and store a feature that is not required. Data comprehension. Dimensionality reduction facilitates the comprehension and visualisation of data. Simplicity. Simpler models tend to be more robust when small datasets are used.

There are two main methods for reducing dimensionality: feature extraction and feature selection. In this chapter we propose a review of different feature selection (FS) algorithms, including its main approaches: filter, wrapper and hybrid – a filter/wrapper combination.

FEATURE SElECTION The advantages described in the Introduction section denote the importance of dimensionality reduction. Feature selection is also useful when the following assumptions are made: • •

There are inputs that are not required to obtain the output. There is a high correlation between some of the input features.


Feature Selection

A feature selection algorithm (FSA) looks for an optimal set of features, and consequently, a paradigm that describes the FSA is heuristic search. Since each state of the search space is a subset of features, FSA can be characterised in terms of the following four properties (Blum & Langley, 1997): • •

•

•

The initial state. This can be the empty set of features, the whole set or any random state. The search strategy. Although an exhaustive search leads to an optimal set of features, the associated computational and time costs are high when the number of features is high. Consequently, different search strategies are used so as to identify a good set of features within a reasonable time. The evaluation function used to determine the quality of each set of features. The goodness of a feature subset is dependent on measures. According to the literature, the following measures have been employed: information measures, distance measures, dependence measures, consistency measures, and accuracy measures. The stop criterion. An end point needs to be established; for example, the process should finish if the evaluation function has not improved after a new feature has been added/removed.

In terms of search method complexity, there are three main sub-groups (Salapa et al., 2007): •

•

Exponential strategies involving an exhaustive search of all feasible solutions. Exhaustive search guarantees identification of an optimal feature subset but has a high computational cost. Examples are the branch and bound algorithms. Sequential strategies based on a local search for solutions defined by the current solution state. Sequential search does not guarantee an optimal result, since the optimal solution could be in a region of the search space that is not searched. However, compared with exponential searching, sequential strategies have a considerably reduced computational cost. The best known strategies are sequential forward selection and sequential backward selection (SFS and SBS, respectively). SFS starts with an empty set of features and adds features one by one, while SBS begins with a full set and removes features one by one. Features are added or removed on the basis of improvements

•

in the evaluation function. These approaches do not consider interactions between features, i.e., a feature may not reduce error by itself, but improvement may be achieved by the feature’s link to another feature. Floating search (Pudil et al., 1994) solves this problem partially, in that the number of features included and/or removed at each stage is not fixed. Another approach (Sánchez et al., 2006) uses sensitivity indices (the importance of each feature is given in terms of the variance) to guide a backward elimination process, with several features discarded in one step. Random algorithms that employ randomness to avoid local optimal solutions and enable temporary transition to other states with poorer solutions. Examples are simulated annealing and genetic algorithms.

The most popular FSA classification, which refers to the evaluation function, considers the three (Blum & Langley, 1997) or last two (Kohavi & John, 1997) groups, as follows: •

•

•

Embedded methods. The induction algorithm is simultaneously an FSA. Examples of this method are decision trees, such as classification and regression trees (CART), and artificial neural networks (ANN). Filter methods. Selection is carried out as a preprocessing step with no induction algorithm (Figure 1). The general characteristics of the training data are used to select features (for example, distances between classes or statistical dependencies). This model is faster than the wrapper approach (described below) and results in a better generalisation because it acts independently of the induction algorithm. However, it tends to select subsets with a high number of features (even all the features) and so a threshold is required to choose a subset. Wrapper methods. Wrapper models use the induction algorithm to evaluate each subset of features, i.e., the induction algorithm is part of the evaluation function in the wrapper model, which means this model is more precise than the filter model. It also takes account of techniques, such as cross-validation, that avoid over-fitting. However, wrapper models are very time consuming, which 633

F

Feature Selection

Figure 1. Filter algorithm

Feature selection

reduced set of features

restricts application with some datasets. Moreover, although they may obtain good results with the inherent induction algorithm, they may perform poorly with an alternative algorithm. Hybrid methods that combine filter and wrapper methods have recently been attracting a great deal of attention in the FS literature (Liu & Motoda, 1998; Guyon et al., 2006). Although the following sections of this chapter are mainly devoted to filter and wrapper methods, a brief review of the most recent hybrid methods is also included.

Induction Algorithm

Accuracy

CUS. However, using both algorithms in domains with a large number of features may be computationally unfeasible. Consequently, search heuristics are used in different versions of the algorithm, resulting in good but not necessarily optimal solutions.

RELIEF

A number of representative filter algorithms are described in the literature, such as χ2-Statistic, information gain, or correlation based feature selection (CFS). For the sake of completeness, we will refer to two classical algorithms (FOCUS and RELIEF) and will describe very recently developed filter methods (FCBF and INTERACT). An exhaustive discussion of filter methods is provided in Guyon et al. (2006)—including of methods such as Random Forests (RF), an ensemble of tree classifiers.

The RELIEF algorithm (Kira & Rendell, 1992) estimates the quality of attributes according to how well their values distinguish between instances that are near to each other. For this purpose, given a randomly selected instance, xs={x1s,x2s,…,xns}, RELIEF searches for its two nearest neighbours: one from the same class, called nearest hit H, and the other from a different class, called nearest miss M. It then updates the quality estimate for all the features, depending on the values for xs, M, and H. RELIEF can deal with discrete and continuous features but is limited to two-class problems. An extension—ReliefF—not only deals with multiclass problems but is also more robust and capable of dealing with incomplete and noisy data. ReliefF was subsequently adapted for continuous class (regression) problems, resulting in the RReliefF algorithm (RobnikSikonja & Kononenko, 2003).

FOCUS

FCBF and INTERACT

In FOCUS (Almuallim & Dietterich, 1991) all feature subsets of increasing size are evaluated until a suitable subset is encountered. Feature subset q is said to be suitable if there is no pair of examples that have different class values and the same values for all the features in q. The successor of this algorithm is FOCUS_2 (Almuallim & Dietterich, 1992), which prunes the search space, thereby evaluating only promising subsets. FOCUS_2 is therefore much faster than FO-

The fast correlated-based filter (FCBF) method (Yu & Liu, 2003) is based on symmetrical uncertainty (SU), which is defined as the ratio between the information gain and the entropy of two features, x and y:

Filter Methods

634

SU ( x, y ) = 2

IG ( x / y ) H ( x) + H ( y ) .

Feature Selection

This method was designed for high-dimensionality data and has been shown to be effective in removing both irrelevant and redundant features. However, it fails to take into consideration the interaction between features. The INTERACT algorithm (Zhao & Liu, 2007) uses the same goodness measure, SU, but also includes the consistency contribution (c-contribution). It can thus handle feature interaction, and efficiently selects relevant features.

Wrapper Methods The idea of the wrapper approach is to select a feature subset using a learning algorithm as part of the evaluation function (Figure 2). Instead of using subset sufficiency, entropy or another explicitly defined evaluation function, a kind of “black box” function is used to guide the search. The evaluation function for each candidate feature subset returns an estimate of the quality of the model that is induced by the learning algorithm. This can be rather time consuming, since, for each candidate feature subset evaluated during the search, the target learning algorithm is usually applied several times (e.g., in the case of 10-fold cross validation being used to estimate model quality). Here

we briefly describe several feature subset selection algorithms—developed in machine learning—that are based on the wrapper approach. The literature is vast in this area and so we will just focus on the most representative wrapper models. An interesting study of the wrapper approach was conducted by Kohavi & John (1997). Besides introducing the notion of strong and weak feature relevance, these authors showed the results achieved by different induction algorithms (ID3, C4.5, and naïve Bayes) in several search methods (best first, hill-climbing, etc.). Aha & Bankert (1995) used a wrapper approach in instance-based learning and proposed a new search strategy that performs beam search using a kind of backward elimination; that is, instead of starting with an empty feature subset, the search randomly selects a fixed number of feature subsets and starts with the best among them. Caruana & Freitag (1994) developed a wrapper feature subset selection method for decision tree induction, proposing bidirectional hill-climbing for the feature space—as more effective than either forward or backward selection. Genetic algorithms have been broadly adopted to perform the search for the best subset of features in a wrapper way (Liu & Motoda, 1998, Huang et al. 2007). The feature selection

Figure 2. Wrapper algorithm Training data

Training data

Feature Search Set of features

Induction Algorithm

Measure of goodness Feature Evaluation

Set of features

Hypothesis Induction Algorithm

Test data

Evaluation

Accuracy

635

F

Feature Selection

methods using support vector machines (SVMs) have obtained satisfactory results (Weston et al., 2001). SVMs are also combined with other techniques to implement feature selection (different approaches are described in Guyon et al., 2006). Kim et al. (2003) use artificial neural networks (ANNs) for customer prediction and ELSA (Evolutionary Local Selection Algorithm) to search for promising subsets of features.

Hybrid Methods Whereas the computational cost associated with the wrapper model makes it unfeasible when the number of features is high, when the filter model is used its performance is less than satisfactory. The hybrid model is a good combination of the two approaches that overcomes these problems. Hybrid methods use a filter to generate a ranked list of features. On the basis of the order thus defined, nested subsets of features are generated and computed by a learning machine, i.e. following a wrapper approach (Guyon et al., 2006). The main features of the hybrid model are depicted in Figure 3. One of the first hybrid approaches proposed was that of Yuan et al., 1999. Since then, the hybrid model has focused the attention of the research community and, by now, numerous hybrid models have been developed to solve a variety of problems, such as intrusion detection, text categorisation, etc. As a combination of filter and wrapper models, there exist a great number of hybrid methods, so it is

not possible to include all of them and therefore we will refer to some interesting ones. Some hybrid methods involving SVMs are presented in Guyon et al. (2006), chapters 20 and 22. Shazzad & Park (2005) investigate a fast hybrid method –a fusion of Correlation-based Feature Selection, Support Vector Machine and Genetic Algorithm– to determine an optimal feature set. A feature selection model based both on information theory and statistical tests is presented by Sebban & Nock (2002). Zhu et al. (2007) incorporates a filter ranking method in a genetic algorithm to improve classification performance and accelerate the search process.

FUTURE TRENDS Feature selection is a huge topic that it is impossible to discuss in a short chapter. To pinpoint new topics in this area we refer the reader to the suggestions given by Guyon et al. (2006), summarised as follows: •

Unsupervised variable selection. Although this chapter has focused on supervised feature selection, several authors have attempted to implement feature selection for clustering applications (see, for example, Dy & Brodley, 2004). For supervised learning tasks, one may want to pre-filter a set of the most significant variables with respect to a criterion which does not make use of y to minimise the problem of over-fitting.

Figure 3. Hybrid algorithm Wrapper complete set of features

reduced set of features

feature subset search

Filter induction algorithm

636

Feature Selection

•

•

Selection of examples. Mislabelled examples may induce a choice of wrong variables, so it may be preferable to jointly select both variables and examples. System reverse engineering. This chapter focuses on the problem of selecting features useful to build a good predictor. Unravelling the causal dependencies between variables and reverse engineering the system that produced the data is a far more challenging task that is beyond the scope of this chapter (but see, for example, Pearl, 2000).

CONClUSION Feature selection for classification and regression is a major research topic in machine learning. It covers many different fields, such as, for example, text categorisation, intrusion detection, and micro-array data. This study reviews key algorithms used for feature selection, including filter, wrapper and hybrid approaches. The review is not exhaustive and is merely designed to give an idea of the state of the art in the field. Most feature selection algorithms lead to significant reductions in the dimensionality of the data without sacrificing the performance of the resulting models. Choosing between approaches depends on the problem in hand. Adopting a filtering approach is computationally acceptable, but the more complex wrapper approach tends to produce greater accuracy in the final result. The filtering approach is very flexible, since any target learning algorithm can be used. It is also faster than the wrapper approach. This latter, on the other hand, is more dependent on the learning algorithm; but the selection process is better. The hybrid approach offers promise in terms of improving results in terms of classification accuracy as well as in terms of the identification of relevant attributes for the analysis.

REFERENCES Aha, D.W., and Bankert, R. L. (1995). A comparative evaluation of sequential feature selection algorithms. Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, 1-7. SpringerVerlag.

Almuallim, H. & Dietterich, T. G (1991). Learning with many irrelevant features. Proceedings of the 9th National Conference on Artificial Intelligence, 547552, AAAI Press. Almuallim, H. & Dietterich, T. G. (1992) Efficient algorithms for identifying relevant features. Proceedings of the 9th Canadian Conference on Artificial Intelligence, 38-45, Vancouver. Alpaydin, E. (2004). Introduction to Machine Learning. MIT Press. Blum, A. L. & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, (97) 1-2, 245-271. Caruana, R. & Freitag, D. (1994). Greedy attribute selection. Proceedings of the Eleventh International Conference on Machine Learning. Morgan Kaufmann Publishers, Inc., 28-36. Dy, J. G. & Brodley, C. E. (2004). Feature Selection for Unsupervised Learning. Journal of Machine Learning Research, (5), 845–889. Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, (3), 1157-1182. Guyon, I., Gunn, S., Nikravesh, M. & Zadeh, L.A. (2006). Feature Extraction. Foundations and Applications. Springer. Huang, J., Cai, Y. & Xu, X. (2007). A hybrid genetic algorithm for feature selection wrapper based on mutual information. Patter recognition letters, (28) 13, 1825-1844. Kim, Y., Street W. N. & Menczer, F. (2003). Feature selection in data mining. Data mining: opportunities and challenges, 80-105. IGI Publishing. Kira, K. & Rendell, L. (1992). The feature selection problem: traditional methods and new algorithm. Proc. AAAI’92, San Jose, CA. Kohavi, R. & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, (97)1-2, 273324. Liu, H. & Motoda, H. (1998). Feature extraction, construction and selection. A data mining perspective. Kluwer Academic Publishers. 637

F

Feature Selection

Pearl, J. (2000). Casuality . Cambridge University Press.

SVMs. Advances in Neural Information Processing Systems, (13). MIT Press.

Pudil, P. and Novovicova, J. and Kittler, J. (1994). Floating search methods in feature-selection. Pattern Recognition Letters, (15) 11, 1119-1125.

Zhao, Z. and Liu, H. (2007). Searching for interacting features. Proceedings of International Joint Conference on Artificial Intelligence, 1157-1161.

Robnik-Sikonja, M. & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning, (53), 23-69, Kluwer Academic Publishers.

Zhu, Z., Ong, Y., Dash, M. (2007) Wrapper–Filter Feature Selection Algorithm Using a Memetic Framework. IEEE Transactions on Systems, Man and Cybernetics, Part B. (37) 1, 70-76.

Salappa, A., Doumpos, M. & Zopounidis, C. (2007). Feature selection algorithms in classification problems: an experimental evaluation. Optimization Methods and Software, (22) 1, 199 – 212. Sánchez-Maroño, N., Caamaño-Fernández, M., Castillo, E & Alonso-Betanzos, A.(2006). Functional networks and analysis of variance for feature selection. Proceedings of International Conference on Intelligent Data Engineering and Automated Learning, 1031-1038. Shazzad, K.M & Jong S.P. (2005). Optimization of Intrusion Detection through Fast Hybrid Feature Selection. International Conference on Parallel and Distributed Computing, Applications and Technologies, 264 – 267. Sebban, M., Nock, R. (2002). A hybrid filter/wrapper approach of feature selection using information theory. Patter recognition, (35)4:835-846. Yu, L. and Liu, H. (2003). Feature selection for highdimensional data: A Fast Correlation-Based Filter Solution. Proceedings of The Twentieth International Conference on Machine Learning, 856-863.

KEy TERMS Dimensionality Reduction: The process of reducing the number of features under consideration. The process can be classified in terms of feature selection and feature extraction. Feature Extraction: A dimensionality reduction method that finds a reduced set of features that are a combination of the original ones. Feature Selection: A dimensionality reduction method that consists of selecting a subset of relevant features from a complete set while ignoring the remaining features. Filter Method: A feature selection method that relies on the general characteristics of the training data to select and discard features. Different measures can be employed: distance between classes, entropy, etc. Hybrid Method: A feature selection method that combines the advantages of wrappers and filters methods to deal with high dimensionality data.

Yuan, H., Tseng, S.S., Gangshan, S. and Fuyan, Z. (1999). Two-phase feature selection method using both filter and wrapper. Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, (2) 132–136.

Sequential Backward (Forward) Selection (SBS/SFS): A search method that starts with all the features (an empty set of features) and removes (adds) a single feature at each step with a view to improving -or minimally degrading- the cost function.

Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T. and Vapnik, V. (2001). Feature selection for

Wrapper Method: A feature selection method that uses a learning machine as a “black box” to score subsets of features according to their predictive value.

638

639

Feed-Forward Artificial Neural Network Basics Lluís A. Belanche Muñoz Universitat Politècnica de Catalunya, Spain

The answer to the theoretical question: “Can a machine be built capable of doing what the brain does?” is yes, provided you specify in a finite and unambiguous way what the brain does. Warren S. McCulloch

INTRODUCTION The class of adaptive systems known as Artificial Neural Networks (ANN) was motivated by the amazing parallel processing capabilities of biological brains (especially the human brain). The main driving force was to re-create these abilities by constructing artificial models of the biological neuron. The power of biological neural structures stems from the enormous number of highly interconnected simple units. The simplicity comes from the fact that, once the complex electro-chemical processes are abstracted, the resulting computation turns out to be conceptually very simple. These artificial neurons have nowadays little in common with their biological counterpart in the ANN paradigm. Rather, they are primarily used as computational devices, clearly intended to problem solving: optimization, function approximation, classification, time-series prediction and others. In practice few elements are connected and their connectivity is low. This chapter is focused to supervised feed-forward networks. The field has become so vast that a complete and clearcut description of all the approaches is an enormous undertaking; we refer the reader to (Fiesler & Beale, 1997) for a comprehensive exposition.

BACKGROUND Artificial Neural Networks (Bishop, 1995), (Haykin, 1994), (Hertz, Krogh & Palmer, 1991), (Hecht-Nielsen, 1990) are information processing structures without global or shared memory, where each of the computing elements operates only when all its incoming information is available, a kind of data-flow architectures.

Each element is a simple processor with internal and adjustable parameters. The interest in ANN is primarily related to the finding of satisfactory solutions for problems cast as function approximation tasks and for which there is scarce or null knowledge about the process itself, but a (limited) access to examples of response. They have been widely and most fruitfully used in a variety of applications—see (Fiesler & Beale, 1997) for a comprehensive review—especially after the boosting works of (Hopfield, 1982), (Rumelhart, Hinton & Williams, 1986) and (Fukushima, 1980). The most general form for an ANN is a labelled directed graph, where each of the nodes (called units or neurons) has a certain computing ability and is connected to and from other nodes in the network via labelled edges. The edge label is a real number expressing the strength with which the two involved units are connected. These labels are called weights. The architecture of a network refers to the number of units, their arrangement and connectivity. In its basic form, the computation of a unit i is expressed as a function Fi of its input (the transfer function), parameterized with its weight vector or local information. The whole system is thus a collection of interconnected elements, and the transfer function performed by a single one (i.e., the neuron model) is the most important fixed characteristic of the system. There are two basic types of neuron models in the literature used in practice. Both express the overall computation of the unit as the composition of two functions, as is classically done since the earlier model proposal of McCulloch & Pitts (1943): Fi(x ) = {g(h(x,wi)), wi∈Rn},

x∈Rn

(1)

where wi is the weight vector of neuron i, h:Rn×Rn→ R is called the net input or aggregation function, and g:R→R is called the activation function. All neuron parameters are included in its weight vector. The choice h(x,wi)=x⋅wi+θ, where θ∈R is an offset term that may be included in the weight vector, leads to one of the most widely used neuron models. When


F

Feed-Forward Artificial Neural Network Basics

Figure 1. A classification problem. Left: Separation by spherical RBF units (R-neurons). Right: Separation by straight lines (P-neurons) in the MLP.

neurons of this type are arranged in a feed-forward architecture, the obtained neural network is called MultiLayer Perceptron (MLP) (Rumelhart, Hinton & Williams, 1986). Usually, a smooth non-linear and monotonic function is used as activation. Among them, the sigmoids are a preferred choice. The choice h(x,wi)= ||x-wi||/θ (or other distance measure), with θ>0∈R a smoothing term, plus an activation g with a monotonically decreasing response from the origin, leads to the wide family of localized Radial Basis Function networks (RBF) (Poggio & Girosi, 1989). Localized means that the units give a significant response only in a neighbourhood of their centre wi. A Gaussian g(z)=exp(-z2/2) is a preferred choice for the activation function. The previous choices can be extended to take into account extra correlations between input variables. The inner product (containing no cross-product terms) can be generalized to a real quadratic form (an homogeneous polynomial of second degree with real coefficients) or even further to higher degrees, leading to the so-called higher-order units (or Σ−Π units). A higher-order unit of degree k includes all possible cross-products of at most k input variables, each with its own weight. Conversely, basic Euclidean distances can be generalized to completely weighted distance measures, where all the (quadratic) cross-products are included. These full expressions are not commonly used because of the high numbers of free parameters they involve. These two basic neuron models have traditionally been regarded as completely separated, both from a mathematical and a conceptual point of view. To a certain degree, this is true: the local vs. global approximation approaches to a function that they carry 640

out make them apparently quite opposite methods (see Fig. 1). Mathematically, under certain conditions, they can be shown to be related (Dorffner, 1995). These conditions (basically, that both input and weight vectors are normalized to unit norm) are difficult to fulfil in practice. A layer is defined as a collection of independent units (not connected with one another) sharing the same input, and of the same functional form (same Fi but different wi). Multilayer feed-forward networks take the form of directed acyclic graphs obtained by concatenation of a number of layers. All the layers but the last (called the output layer) are labelled as hidden. This kind of networks (shown in Fig. 2) compute a parameterized function Fw(x) of their input vector x by evaluating the layers in order, giving as final outcome the output of the last layer. The vector w represents the collection of all the weights (free parameters) in the network. For simplicity, we are not considering connections between non-adjacent layers (skip-layer connections) and assume otherwise total connectivity. The set of input variables is not counted as a layer. Output neurons take the form of a scalar product (a linear combination), eventually followed by an activation function g. For example, assuming a single output neuron, a one-hidden-layer neural network with h hidden units computes a function F:Rn→R of the form:

Fw(x)=g(

h

∑ i =1

ciFi(x) - θ)

(2)


Figure 2. A two-hidden-layer example of ANN, mapping a three-dimensional input space x=(x1,x2,x3) to a twodimensional output space (y1,y2)=Fw(x). The network has four and three units in the first and second hidden layers, respectively, and two output neurons. The vector w represents the collection of all the weights in the network.

where θ∈R is an offset term (called the bias term), ci∈R and g can be set as desired, including the choice g(z)=z. Such a feed-forward network has dim(w)=(n+1)h+h+1 parameters to be adjusted.

FEED-FORWARD NEURAL NETWORKS The RBF and MLP networks provide parameterized families of functions suitable to function approximation on multidimensional spaces. A sigmoid neuron puts up an hyperplane that divides its input space in two halves. In other words, the points of equal neuron activation (with fixed weights) are hyperplanes. This behaviour is not caused by the sigmoid, but by the scalar product. The isoactivation contours for an RBF unit (in case of an unweighted Euclidean norm) are hyperspheres. The radially symmetric and centered response is not caused by the activation function (e.g., Gaussian or exponential) but by the norm. In both cases, the activation function acts as a non-linear monotonic distorsion of its argument as computed by the aggregation function. Definition (Isoactivation set). Given a real function f:Rn→(a,b), define Ifα for α∈(a,b) as the set of isoactivation points Ifα={x∈Rn|f(x)=α}.

Definition (P-neuron). A neuron model Fi of the form: Fi(x)={g(wi⋅x+θi),wi∈Rn,θi∈R}

(3)

with g a bounded, non-linear and increasing function for which limz→∞ g(z)=gmax∈R and limz→-∞ g(z)=gmin∈R will be denoted P-neuron (from Perceptron). For these neurons, the sets IFiα are (n-1)dimensional hyperplanes for constant values of α, parallel with one another for different α. In practice, the g are usually the well-behaved sigmoids, though other activation functions are sometimes found in the literature (e.g., sinusoid). The latter are not included in the above Definition. Definition (R-neuron). A neuron model Fi of the form: Fi(x)={1/θi g(||x-wi||q), wi∈Rn, θi>0∈R, q≥1∈R} (4) where ||.|| is a norm and g is a symmetric function such that g(|z|) is monotonic, with a maximum gmax at Fi(wi) and a (possibly asymptotically reached) minimum gmin=0 will be denoted R-neuron (from Radial). For these neurons, the sets IFiα are (n-1)-dimensional 641

F


Figure 3. The logistic function l(z)=glog1.5(z) and its first derivative l’(z)>0. This function is maximum at the origin, corresponding to a medium activation at l(0)=0.5. This point acts as an initial “neutral” value around a quasi-linear slope.

hypersurfaces (centered at wi) for constant values of α (e.g., hypercubes for q=1, hyperspheres for q=2) concentric with one another for different α. The norm used can be any Minkowskian norm of the form:

||z||q= (

n

∑ i =1

|zi|q) 1/q, q≥1∈R

(5)

In practice, typical choices are q=2 and g a Gaussian function. Due to their widespread use, we present two of the most popular sigmoids, and show how they are tightly related. A sigmoid function g can be defined as a monotonically increasing function exhibiting smoothness and asymptotic properties. The two more commonly found representatives are the logistic:

1 ∈ (0,1) 1 + exp(− ( z − )) and the hyperbolic tangent:

(6)

gtanhβ(z)=

exp( ( z − )) − exp(− ( z − )) ∈ (-1,1) exp( ( z − )) + exp(− ( z − )) (7)

The offset θ is in practice set to zero, because its function is the same as that of the bias term in the aggregation function in (3). These two families of functions can be made exactly the same shape (assuming θ=0) by making the β in (6) be twice the value of the β in (7). For instance, for β=0.5:

gtanh 0.5 (z) = gtanh1(z/2) =

1 − exp(− z ) = 2glog1(z)-1 1 + exp(− z ) (8)

1 is the bipolar version of glog1(z)= 1+ exp( − z ) . These functions are chosen because of their simple analytic behaviour, especially in what concerns differentiability, of great importance for learning algorithms relying in derivative information (Fletcher, 1980). In particular,

(glogβ)’(z) = β glogβ (z)(1-glogβ (z)) 642

(9)


The interest in sigmoid functions also relies in the behaviour of their derivatives. Consider, for example, (6) with β=1.5 and θ=0, plotted in Fig. (3). The derivative of a sigmoid is always positive. For θ=0, all the functions are centred at z=0. In this point, the function has a medium activation, and its derivative is maximum, allowing for maximum weight updates.

Types of Artificial Neural Networks A fundamental distinction to categorize a neural network relies on the kind of architecture, basically divided in feed-forward (for which the graph contains no cycles) and recurrent (the rest of situations). A very common feed-forward architecture contains no intra-layer connections and all possible inter-layer connections between adjacent layers. Definition (Feed-forward neural network: structure). A bipartitioned graph is a graph G whose nodes V can be partitioned in two disjoint and proper sets V1 and V2, V1∪V2=V, in such a way that no pair of nodes in V1 is joined by an edge, and the same property holds for V2. We write then Gn1,n2, with n1=|V1|,n2=|V2|. A bipartitioned graph Gn1,n2 is complete if every node in V1 is connected to every node in V2. These concepts can be generalized to an arbitrary number of partitions, as follows: A k-partitioned graph Gn1,...,nk is a graph whose nodes V can be partitioned in k disjoint and proper sets V1,...,Vk, such that k

 i =1

Vi=V,

in Definitions 2 and 3, which are collectively grouped in the network parameters w. The first output is defined as y(0)=x. For the last (output) layer, hc+1=m and the Fc+1l,1≤ l≤ hc+1 are P-neurons or linear units (obtained by removing the activation function in a P-neuron). The final outcome for Fw(x) is the value of y(c+1). Definition (MLPNN). A MultiLayer Perceptron Neural Network is a FFNN (n,c,m) for which c≥1 and all the Fl are P-neurons, 1 ≤ l ≤ c. Definition (RBFNN). A Radial Basis Function Neural Network is a FFNN (n,c,m) for which c=1 and all the Fc are R-neurons.

LEARNING IN ARTIFICIAL NEURAL NETWORKS A system can be said to learn if its performance on a given task improves with respect to some measure as a result of experience (Rosenblatt, 1962). In ANNs the “experience” is the result of exposure to a training set of data, accompanied with weight modifications. The main problem tackled in supervised learning is regression, the approximation of an n-dimensional function f: X⊂ Rn→Rm by finite superposition (composition and addition) of known parameterized base functions, like those in (3) or (4). Their combination gives rise to expressions of the form Fw(x). The interest is in finding a parameter vector w* of size s such that Fw*(x) optimizes a cost functional L (f, Fw) called the loss: w*=argminw∈Rs L (f, Fw)

(10)

in a way that no pair of nodes in Vi is joined by an edge, for all 1≤ i≤ k. In these conditions, a feed-forward fully connected neural network with c hidden layers and hl units per layer l, 1≤ l≤ c+1, takes the form of a directed complete c+1-partitioned graph Gh1,...,hc+1.

The only information available is a finite set D of p noisy samples of f, D={<xi,yi>,f(xi)+εi=yi}, where xi∈Rn is the stimulus, yi∈Rm is the target, εi is the noise (assumed additive) and |D|=p. An estimation of L (f,

Definition (Feed-forward neural network: function). A feed-forward neural network consisting of c hidden layers, denoted FFNN(n,c,m), is a function Fw: Rn→Rm made up of pieces of the form y (l) =(F 1l (y (l-1) ),...,F h ll (y (l-1) )), representing the output of layer l, for 1≤ l≤ c+1. The Fl denote the neuron model of layer l and hl∈N+ their number, and each neuron Fil has its own parameters w(l)i as

~ L (D, Fw) =

~

Fw) can be obtained as L (D, Fw), the apparent loss, computed separately for each sample in D,

∑

( x i ,y i ) ∈D

λ(yi ,Fw(xi))

(11)

643

F


A common form for λ is an error function, as the squared-error λ(a,b)=(a-b)2. This results from the assumption that the noise follows a homocedastic gaussian distribution with zero mean. When using this error, the expression (11) can be viewed as the (squared) Euclidean norm in Rp of the p-dimensional error vector e=(e1,...,ep), known as the sum-of-squares error, with ei=yi-Fw(xi), as:

~ L (D,Fw) =

∑

( x i ,y i ) ∈D

(yi-Fw(xi))2 = e⋅ e = ||e||2 (12)

The usually reported quantity ||e||2/p is called mean square error (MSE), and is a measure of the empirical error (as opposed to the unknown true error). We shall

~

denote the error function E(w)= L (D, Fw). In a training process, the network builds an internal representation of the target function by finding ways to combine the set of base functions {Fi(x)}i. The validity of a solution is mainly determined by an acceptably low and

~ ~ balanced L (D,Fw) and L (Dout, Fw), for any Dout ⊂

X\D (where Dout is not used in the learning process) to ensure that f has been correctly estimated from the data. Network models too inflexible or simple or, on the contrary, too flexible or complex will generalize inadequately. This is reflected in the bias-variance tradeoff: the expected loss for finite samples can be decomposed in two opposing terms called error bias and error variance (Geman, Bienenstock & Doursat, 1992). The expectation for the sum-of-squares error function, averaged over the complete ensemble of data sets D is written as (Bishop, 1995): E(w)

=

ED{(Fw(x)-)2}

= (ED{(Fw(x)-})2+ED{(Fw(x)ED{Fw(x)})2} (13) where ED is the expectation operator taken over every data set of the same size as D and denotes the

644

conditional average of the target y=f(x) (which expresses the optimal network mapping), given by: =

∫

y p(y|x) dy

The first term in the right hand side of (13) is the (squared) bias and the second is the variance. The bias measures the extent to which the average (over all D) of Fw(x) differs from the desired target function . The variance measures the sensitivity of Fw(x) to the particular choice of D. Too inflexible or simple models will have a large bias, while too flexible or complex will have a large variance. These are complementary quantities that have to be minimized simultaneously; both can be shown to decrease with increasing availability of larger data sets D. The expressions in (13) are functions of an input vector x. The average values for bias and variance can be obtained by weighting with the corresponding density p(x):

∫

ED{(Fw(x)-)2}p(x) dx

∫ +∫ =

(ED{(Fw(x)-})2p(x) dx ED{(Fw(x)-ED{Fw(x)})2}p(x) dx (14)

Key conditions for acceptable performance on novel data are given by a training set D as large and representative as possible of the underlying distribution, and a set Dout of previously unseen data which should not contain examples exceedingly different from those in D. An important consideration is the use of a net with minimal complexity, given by the number of free parameters (the number of components in w). This requirement can be realized in various ways. In regularization theory, the solution is obtained from a variational principle including the loss and prior smoothness information, defining a smoothing functional φ such that lower values correspond to smoother functions. A solution of the approximation problem is then given by minimization of the functional (Girosi, Jones & Poggio, 1993):


~

H(Fw)= L (D,Fw)+ηφ(Fw)

(15)

where η is a positive scalar controlling the tradeoff between fitness to the data and smoothness of the solution. A common choice is the second derivative P(f) = f’’ of which the (squared) Euclidean norm is taken: φ(Fw)=||P(Fw)||2=

∫

{Fw’’(t)}2dt

(16)

CONCLUSION Artificial Neural Networks are information processing structures evolved as an abstraction of known principles of how the brain might work. The computing elements, called neurons, are linked to one another with a certain strength, called weight. In their simplest form, each unit computes a function of its inputs—which are either the outputs from other units or external signals—influenced by the weights of the links conveying these inputs. The network is said to learn when the weights of all the units are adapted to represent the information present in a sample, in an optimal sense given by an error function. The network relies upon the representation capacity of the neuron model as the cornerstone for a good approximation.

REFERENCES Bishop, C. (1995). Neural Networks for Pattern Recognition. Clarendon Press. Dorffner, G. (1995). A generalized view on learning in feedforward neural networks. Technische Universität Cottbus, Reihe Mathematik M-01/1995, pp.34-54.

Geman, S., Bienenstock, E., Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4 (1): 1-58, 1992. Girosi, F., Jones, M., Poggio, T. (1993). Priors, Stabilizers and Basis Functions: from regularization to radial, tensor and additive splines. AI Memo No.1430, AI Laboratory, MIT. Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. MacMillan. Hecht-Nielsen, R. (1990). Neurocomputing. AddisonWesley. Hertz, J., Krogh, A., Palmer R.G. (1991). Introduction to the Theory of Neural Computation, Addison-Wesley. Hopfield, J.J. (1982) Neural Networks and Physical Systems with Emergent Collective and Computational Abilities. In Proceedings of the National Academy of Sciences, USA, Vol. 79, pp. 2554-2558. McCulloch, W., Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5: 115-133. Poggio T., Girosi, F. (1989). A Theory of Networks for Approximation and Learning. AI Memo No. 1140, AI Laboratory, MIT. Rosenblatt, F. (1962). Principles of neurodynamics. Spartan Books, NY. Rumelhart, D., Hinton, G., Williams, R. (1986). Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition (Vol. 1: Foundations). Rumelhart, McClelland (eds.), MIT Press, Cambridge, MA.

Fiesler, E., Beale, R. (Eds., 1997) Handbook of Neural Computation. IOP Publishing & Oxford Univ. Press.

KEy TERmS

Fletcher, R. (1980). Practical methods of optimization. Wiley.

Architecture: The number of artificial neurons, its arrangement and connectivity.

Fukushima, K. (1980). Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, pp. 193-202.

Artificial Neural Network: Information processing structure without global or shared memory that takes the form of a directed graph where each of the computing elements (“neurons”) is a simple processor with internal and adjustable parameters, that operates only when all its incoming information is available. 645

F


Bias-Variance Tradeoff: The mean square error (to be minimized) decomposes in a sum of two non-negative terms, the squared bias and the variance. When an estimator is modified so that one term decreases, the other term will typically increase. Feed-Forward Artificial Neural Network: Artificial Neural Network whose graph has no cycles. Learning Algorithm: Method or algorithm by virtue of which an Artificial Neural Network develops a representation of the information present in the learning examples, by modification of the weights. Neuron Model: The computation of an artificial neuron, expressed as a function of its input and its weight vector and other local information. Weight: A free parameter of an Artificial Neural Network, that can be modified through the action of a Learning Algorithm to obtain desired responses to input stimuli.

646

647

Finding Multiple Solutions with GA in Multimodal Problems Marcos Gestal University of A Coruña, Spain Mari Paz Gómez-Carracedo University of A Coruña, Spain

INTRODUCTION

mOTIVATION

Traditionally, the Evolutionary Computation (EC) techniques, and more specifically the Genetic Algorithms (GAs) (Goldberg & Wang, 1989), have proved to be efficient when solving various problems; however, as a possible lack, the GAs tend to provide a unique solution for the problem on which they are applied. Some non global solutions discarded during the search of the best one could be acceptable under certain circumstances. The majority of the problems at the real world involve a search space with one or more global solutions and multiple local solutions; this means that they are multimodal problems (Harik, 1995) and therefore, if it is desired to obtain multiple solutions by using GAs, it would be necessary to modify their classic functioning outline for adapting them correctly to the multimodality of such problems.

This chapter tries to establish the basis for the understanding of multimodality where, firstly, the characterisation of the multimodal problems will be attempted. It would be also tried to offer a global view of some of the several approaches proposed for adapting the classic functioning of the GAs to the search of multiple solutions. Lastly, the contributions of the authors will be also showed.

BACKGROUND: CHARACTERIZATION OF mULTImODAL PROBLEmS The multimodal problems can be briefly defined as those problems that have multiple global optimums or multiple local optimums. For this type of problems, it is interesting to obtain the greatest number of solutions due to several reasons; on one hand, when there is not a total knowledge of the

Figure 1. Rastrigin function


F

Finding Multiple Solutions with GA in Multimodal Problems

problem, the solution obtained might not be the best one as it can not be stated that no better solution could be found at the search space that has not been explored yet. On the other hand, although being certain that the best solution has been achieved, there might be other equally fitted or slightly worst solutions that might be preferred due to different factors (easier application, simpler interpretation, etc.) and therefore considered globally better. One of the most characteristic multimodal functions used in lab problems are the Rastrigin function (see Fig. 1) which offers an excellent graphical point of view about multimodality means. Providing multiple optimal (and valid) solutions and not only the unique global solution is crucial in multiple environments. Usually, it is very complex to implement in the practice the best solution represents, so it can offers multiple problems: computational cost too high, complex interpretation,… In these situations it turns out useful to have a range of valid solutions between which that one could choose that, still not being the best solution to the raised problem, offer a level of acceptable adjustment and be simpler to implement, to understand, … that the ideal global one.

EVOLUTIONARy TECHNIQUES AND mULTImODAL PROBLEmS As it has been mentioned, the application of EC techniques to the resolution of multimodal problems sets out the difficulty that this type of techniques shows since they tend to solely provide the best of the found solutions and to discard possible local optimums that might have been found throughout the search. Quite many modifications have been included in the traditional performance of the GA in order to achieve good results with multimodal problems. A crucial aspect when obtaining multiple solutions consists on keeping the diversity of the genetic population, distributing as much as possible the genetic individuals throughout the search space.

CLASSICAL APPROACHES Nitching methods allow GAs to maintain a genetic population of diverse individuals, so it is possible 648

to locate multiple optimal solutions within a single population. In order to minimise the impact of homogenisation, or to tend that it may only affect later states of searching phase, several alternatives have been designed, based most of them on heuristics. One of the first alternatives for promoting the diversity was the applications of scaling methods to the population in order to emphasize the differences among the different individuals. Other direct route for avoiding the diversity loss involves focusing on the elimination of duplicate partial high fitness solutions (Bersano, 1997) (Langdon, 1996). Some other of the approaches tries to solve this problem by means of the dynamic variation of crossover and mutation rates (Ursem, 2002). A higher amount of mutations are done in order to increase the exploration through the search space, when diversity decreases; the mutations decrease and crossovers increase with the aim of improving exploitation in optimal solution search when diversity increases. There are also proposals of new genetic operators or variations of the actual ones. For example some of the crossover algorithms that improve diversity and that should be highlighted are BLX (Blend Crossover) (Eshelman & Schaffer, 1993), SBX (Simulated Binary Crossover) (Deb & Agrawal, 1995), PCX (Parent Centric Crossover) (Deb, Anand & Joshi, 2002), CIXL2 (Confidence Interval Based Crossover using L2 Norm) (Ortiz, Hervás & García, 2005) or UNDX (Unimodal Normally Distributed Crossover) (Ono & Kobayashi, 1999). Regarding replacement algorithms, schemes that may keep population diversity have been also looked for. An example of this type of schemes is crowding (DeJong, 1975)(Mengshoel & Goldberg, 1999). Here, a newly created individual is compared to a randomly chosen subset of the population and the most closely individual is selected for replacement. Crowding techniques are inspired by Nature where similar members in natural populations compete for limited resources. Likewise, dissimilar individuals tend to occupy different niches and are unlikely to compete for the same resource, so different solutions are provided. Fitness sharing was firstly implemented by Goldberg & Richardson for being used on multimodal functions (Goldberg & Richardson, 1999). The basic idea involves determining, from the fitness of each solution, the maximum number of individuals that can remain around it, awarding the individuals that exploit unique areas of the domain. The dynamic fitness shar-


ing (Miller & Shaw, 1995) with two components was proposed in order to correct the dispersion of the final distribution of the individuals into niches: the distance function, which measures the overlapping of individuals, and the comparison function, which results “1” if the individuals are identical and values closer to “0” as much different they are. The clearing method (Petrowski, 1996) is quite different from the previous ones, as the resources are not shared, but assigned to the best individuals, who will be then kept at every niche. The main inconvenience of the techniques previously described lies in the fact that they add new parameters that should be configured according the process of execution of GA. This process may be disturbed by the interactions among those parameters (Ballester & Carter, 2003).

OWN PROPOSALS Once detected the existing problems they should be resolved, or at least, minimized. With this goal, the Artificial Neural Network and Adaptive System (RNASA) group have developed two proposals that use EC techniques for this type of problems. Both proposals try to find the final solution but keeping partial solutions within the final population. The main ideas of the two proposals, together with the problems used for the tests are explained at the following points.

Hybrid Two-Population Genetic Algorithm

F

Introduction To force a homogeneous search throughout the search space, the approach proposed here is based on the addition of a new population (genetic pool) to a traditional GA (secondary population). The genetic pool will divide the search space into sub-regions. Every one of the individuals of the genetic pool has its own fenced range for gene variation, so every one of these individuals would represent a specific sub-region within the global search space. On the other hand, the group of individual ranges in which any gene may have its value, is extended over the whole of those possible values that a gene may have. Therefore, this genetic pool would sample the whole of the search space. It should be borne in mind that a traditional GA performs its search considering only one sub-region (the whole of the search space). Here the search space will be divided into different subregions or intervals according to the number of genetic individuals in the genetic pool. Since the individuals in the genetic pool have restrictions in their viable gene values, one of these individuals would not be provided a valid solution. So, it is also used another population (the secondary population) in addition to the genetic pool. Here, a classical GA would develop its individuals in an interactive fashion with those individuals of the genetic pool. Unlike at genetic pool, the genes of individuals of secondary population may adopt values throughout the

GiN

Ind2

…

GPN

GP1 GP2 GP3

… IndP

Genetic Pool N: N um ber of Variables S elected S: S econdary P opulation Indiv iduals n: num ber of subregions to div ide the search space

…

GiN

Ind1

0 d

G21 G22 G23

…

GiN

Ind2

…

GSN

0 d

GS1 GS2 GS3

…

…

G11 G12 G13

…

G21 G22 G23

0 d

…

Ind1

…

GiN

Valid Range for Gene Value

d

…

…

(n-1)d/n

G11 G12 G13

…

0 d/n d/n 2·d/n

…

Valid Range for Gene Value

Figure 2. Structure of populations of hybrid two-population genetic algorithm

IndS

Secondary Population P: G enetic P ool Indiv iduals d: num ber of original v ariables

649


whole of the search space, so it would contribute the solutions, whereas the genetic pool would act as a support, keeping search space homogeneously explored. The secondary population will provide the solutions (since its individuals are allowed to vary along all the search space range), whereas the genetic pool would act as a support, keeping search space homogeneously explored. Next, both populations, which are graphically represented in Fig. 2, will be described in detail.

The Genetic Pool As it has been previously mentioned, every one of the individuals at the genetic pool represents a sub-region of the global search space. Therefore, they should have the same structure or gene sequence than when using a traditional GA. The difference lies in the range of values that these genes might have. When offering a solution, traditional GA may have any valid value, whereas in the proposed GA, the range of possible values is restricted. Total value range is divided into the same number of parts than individuals in genetic pool, so that a sub-range of values is allotted to each individual. Those values that a given gene may have will remain within its range for the whole of the performance of the proposed GA. In addition to all that has been said, every individual at the genetic pool will be in control of which are the genes that correspond to the best found solution up to then (meaning whether they belong to the best individual at secondary population). This Boolean value would be used to avoid the modification of those genes that, in some given phase of performance, are the best solution to the problem. Furthermore, every one of the genes in an individual has an I value associated which indicates the relative increment that would be applied to the gene during a mutation operation based only on increments and solely applied to individuals of the genetic pool. It is obvious that this incremental value should have to be

Figure 3. Structure of the genetic pool individuals

650

lower than the maximum range in which gene values may vary. The structure of the individuals at genetic pool is shown at Fig.3. As these individuals do not represent global solutions to the problem that has to be solved, so their fitness value will not be compulsory. It will reduce the complexity of the algorithm and, of course, it will increase the computational efficiency of the final implementation.

The Secondary Population The individuals of the secondary population are quite different for the previous. In this case, the genes of the individuals on the secondary population can take any value throughout the whole space of possible solutions. This allows that all individuals on secondary population are able to offer global solutions to the problem. This is not possible in genetic pool because their genes were restricted to different sub-ranges. The evolution of the individuals at the genetic pool will be carried out by a traditional GA rules. The main different lies in the operator crossover. In this case a modified crossover will be used. Due to the information is stored in isolated population, now the two parents who will produce the new offspring will not belong to the same population. Hence, the genetic pool and secondary population are combined instead. In this way information of both populations will be merged to produce the most fitted offspring.

The Crossover Operator As it was pointed before the crossover operator recombines the genetic material of the individuals of both populations. This recombination involves a random individual from secondary population with a representative of the genetic pool. This representative will represent a potential solution offered by the genetic pool. As a unique individual can not verify this requirement, the representative will


Figure 4. Hybrid two-population genetic algorithm: Crossover

be formed by a subset of genes of different individuals on the genetic pool. Gathering information from different partial solutions will allow producing a valid global solution. Therefore, the value for every gene of the representative will be randomly chosen among all the individuals in the genetic pool. After a value is assigned to all the genes, this new individual represents not a partial, unlike every one of the individuals separately, but a global solution. Now, the crossover operator will be applied. This crossover function will keep the secondary population diversity, so the offspring will contain values from the genetic pool. Therefore the genetic algorithm would be able to maintain multiple solutions in the same population. The crossover operator does not change the genetic pool because the last one only acts as an engine to keep the diversity This process is summarized in Fig 4.

The Mutation Operator Mutation operator increments the value of individual genes in the genetic pool. It introduces new information in the genetic pool, so the representative can use it and finally, by means of the crossover operator, introduce it in secondary population. It should be noted that the new value will have upper limit, so when it is reached the new gene value will be reset to the lower value. When generations advance the increment amount is reduced, so the increment applied to the individuals in the genetic pool will take lower values. The different increments between iterations are calculated taking in mind the lower value for a gene (LIM_INF_IND), the

F

upper value for that gene (LIM_SUP_IND) and the total number of individuals in the genetic pool (IND_POOL) as Fig. 5 summarize. In such way, first generations will explore the search space briefly (a coarse-grain search) and it is intended to do a more exhaustive route through all the values that a given gene may have (a fine-grain search) as the search process advance.

Genetic Algorithm with Division into Species Another proposed solution is an adaptation of the nitching technique. This adaptation consists on the division of the genetic population into different and independent subspecies. In this case the criterion that determines the specie for a specific individual to concrete specie is done according to genotype similarities (similar genotypes will form isolated species). This classical concept has been provided with some improvements in order to, not only decrease the number of iterations needed for obtaining solutions, but also increase the number of solutions kept within the genetic population. Several iterations of the GA were executed on every species of the genetic population for speeding up the convergence towards the solution that exists near every species. The individuals generated during this execution having a genotype of a different species will be discarded. The crossover operations between the species are following applied similarly to what happens in biology. It origins, on one hand, the crossovers between similar individuals are preferred (as it was done at the previous step using GAs) and on the other, the crossovers between different species are enabled, although in a lesser rate. 651


Figure 5. Pseudocode for mutation and Delta initialization IF (not Bi) Gi = Gi + Ii IF (Gi > LIM_SUP_GEN) Gi = LIM_INF_GEN Ii = Ii – Delta ENDIF ENDIF

The individuals generated after these crossovers could, either be incorporated to an already existing species or, if they analyse a new area of the search space, create themselves a new species. Finally, the GA provides as much solutions as species remains actives over the search space.

FUTURE TRENDS Since there are not any methods that provide the best results in all the possible situations, new approaches would be developed. New fitness functions would help to locate a great number of valid solutions within the search space. In the described approaches this functions remains constants over the method execution. Another option would be allow dynamical fitness functions that vary along the execution stage. These kind of functions will try to adapt their output with the knowledge extracted from the search space while the crossover and mutation operators explore new arenas. If different techniques offer acceptable solutions, other interesting approach an interesting point consists on putting together. For example, this hybrid models would integrate statistics methods (with a great mathematical background) with other heuristics.

CONCLUSION This article shows an overview of the different methods related with evolutionary techniques used to address the problem of multimodality. This chapter showed several approaches to provide, not only a global solution, but multiple solutions to the same problem. It would help 652

Delta =

( LIM _ SUP _ IND )−( LIM _ INF _ IND ) IND _ POOL

the final user to decide which of them is the most suitable in any particular case. The final decision will depend on several factors, not only the global error reached for a particular method. Other factors also depend on the economic impact, the difficulty to implement it, the quality of the knowledge provided for their analysis, and so on.

REFERENCES Ballester, P.J., & Carter, J.N. (2003). Real-Parameter Genetic algorithm for Finding Multiple Optimal Solutions in Multimodel Optimizaton, Proceedings of Genetic and Evolutionary Computation, pp. 706-717. Bersano-Beguey, T. (1997) Controlling Exploration, Diversity and Escaping from Local Optimal in GP. Proceedings of Genetic Prograrnming. MIT Press. Cambridge, MA. Deb, K., & Agrawal, S. (1995). Simulated binary crossover for continuous search space. Complex Systems 9(2), pp. 115-148. 1995. Deb, K., Anand, A., & Joshi, D. (2002). A Computationally Efficient Evolutionary Algorithm for Real Parameter Optimization, KanGAL report: 2002003. DeJong, K.A. (1975). An Analysis of the Behaviour of a Class of Genetic Adaptative Systems. Phd. Thesis, University of Michigan, Ann Arbor. Eshelman, L.J., & Schaffer J.D. (1994). Real coded genetic algorithms and interval schemata. Foundations of Genetic Algorihtms (2), pp. 187-202. Goldberg, D.E., & Richardson J. (1987) Genetic algorithms with Sharing for Multimodal Function Optimi-


zation. Proceedings of 2nd International Conference on Genetic algorithms (ICGA), pp. 41-49. Goldberg, D.E., & Wang, L. (1989). Genetic algorithms in Search Optimization & Machine Learning. Addison-Wesley. Harik, G. (1995). Finding multimodal solutions using restricted tournament selection. Proceedings of the Sixth International Conference on Genetic algorithms, (ICGA) 24-31. Landgon, W. (1996). Evolution & Genetic Programming Populations. University College. Technical Report RN/96/125. London. Mengshoel, O.J., & Goldberg, D.E. (1999). Probabilistic Crowding: Deterministic Crowding with Probabilistic Replacement”, Proceedings of Genetic and Evolutionary Computation, pp. 409-416. Miller, B., & Shaw, M. (1995). Genetic algorithms with Dynamic Niche Sharing for Multimodal Function Optimization. IlliGAL Report 95010. University of Illinois. Urbana Champaign. Ono, I., & Kobayashi, S. (1999). A real-coded genetic algorithm for function optimization using unimodal normal distribution. Proceedings of International Conference on Genetic algorithms, pp. 246-253. Ortiz, D., Hervás, C., & García, N., (2005). CIXL2: A crossover operator for evolutionary algorithms based on population features. Journal of Artificial Intelligence Research. Petrowski, A. (1996). A Clearing Procedure as a Niching Method for Genetic algorithms. Proceedings of International Conference on Evolutionary Computation. IEEE Press. Nagoya, Japan. Ursem, R.K. (2002). Diversity-Guided Evolutionary Algorithms. Proceedings of VII Parallel Problem Solving from Nature, pp. 462-471.

KEy TERmS Crossover: Genetic operation included in evolutionary techniques used to generate the offspring from current population. There are very different methods to perform crossover, but the general idea resides in merging the genetic information of the parents within the offspring with the aim of produce better solutions as generations advance. Evolutionary Technique: Technique which tries to provide solutions for a problem guided by biological principles such as the survival of the fittest. This kind of techniques starts from a randomly generated population which evolves by means of crossover and mutation operations to provide the final solution. Genetic Algorithm: A special type of evolutionary technique which represents the potential solutions of a problem within chromosomes (usually a collection of binary, natural or real values). Multimodal Problems: A special kind of problems where a unique global solution does not exist. Several global optimums or one global optimum with several local optimums (or peaks) can be found around the search space. Mutation: The other genetic operation included in evolutionary techniques to perform the reproduction stage. Mutation operator introduces new information in the system by random changes applied within the genetic individuals. Search Space: Set of all possible situations of the problem that we want to solve could ever be in. Combination of all the possible values for all the variables related with the problem. Species: Within the context of genetic algorithm, a subset of genetic individuals with similar genotype (genetic values) which explore the same, or a similar, area of the search space.

653

F

654

Full-Text Search Engines for Databases László Kovács University of Miskolc, Hungary Domonkos Tikk Budapest University of Technology and Economics, Hungary

INTRODUCTION Current databases are able to store several Tbytes of free-text documents. The main purpose of a database from the user’s viewpoint is the efficient information retrieval. In the case of textual data, information retrieval mostly concerns the selection and the ranking of documents. The selection criteria can contain elements that apply to the content or the grammar of the language. In the traditional database management systems (DBMS), text manipulation is restricted to the usual string manipulation facilities, i.e. the exact matching of substrings. Although the new SQL1999 standard enables the usage of more powerful regular expressions, this traditional approach has some major drawbacks. The traditional string-level operations are very costly for large documents as they work without task-oriented index structures. The required full-text management operations belong to text mining, an interdisciplinary field of natural language processing and data mining. As the traditional DBMS engine is inefficient for these operations, database management systems are usually extended with a special full-text search (FTS) engine module. We present here the particular solution of Oracle; there for making the full-text querying more efficient, a special engine was developed that performs the preparation of full-text queries and provides a set of language and semantic specific query operators.

BACKGROUND Traditional DBMS engines are not adequate to meet the users’ requirements on the management of free-text data as they handles the whole text field as an atom (Codd, 1985). A special extension to the DBMS engine is needed for the efficient implementation of text manipulating operations. There is a significant demand

on the market on the usage of free text and text mining operations, since information is often stored as free text. Typical application areas are, e.g., text analysis in medical systems, analysis of customer feedbacks, and bibliographic databases. In these cases, a simple character-level string matching would retrieve only a fraction of related documents, thus an FST engine is required that can identify the semantic similarities between terms. There are several alternatives for implementing an FTS engine. In some DBMS products, such as Oracle, Microsoft SQLServer, Postgres, and mySQL, a builtin FTS engine module is implemented. Some other DBMS vendors extended the DBMS configuration with a DBMS-independent FTS engine. In this segment the main vendors are: SPSS LexiQuest (SPSS, 2007), SAS Text Miner (SAS, 2007), dtSearch (dtSearch, 2007), and Statistica Text Miner (Statsoft, 2007). The market of FTS engines is very promising since the amount of textual information stored in databases rises steadily. According to the study of Meryll Lynch (Blumberg & Arte, 2003), 85% of business information are text documents – e-mails, business and research reports, memos, presentations, advertisements, news, etc. – and their proportion still increases. In 2006, there were more than 20 billion documents available on the Internet (Chang, 2006). The estimated size of the pool increases to 550 billion documents when the documents of the hidden (or deep) web – which are e.g. dynamically generated ones – are also considered.

TEXT mINING The subfield of document management that aims at processing, searching, and analyzing text documents is text mining. The goal of text mining is to discover the non-trivial or hidden characteristics of individual documents or document collections. Text mining is an


Full-Text Search Engines for Databases

Figure 1. The text mining module

Document Collection

Knowledge

F Documents retrieval and preprocessing

Decision Support

application oriented interdisciplinary field of machine learning which exploits tools and resources from computational linguistics, natural language processing, information retrieval, and data mining. The general application schema of text mining is depicted in Figure 1 (Fan, Wallace, Rich & Zhang, 2006). For giving a brief summary of text mining, four main areas are presented here: information extraction, text categorization/classification, document clustering, and summarization.

Information Extraction The goal of information extraction (IE) is to collect the text fragments (facts, places, people, etc.) from documents relevant to the given application. The extracted information can be stored in structured databases. IE is typically applied in such processes where statistics, analyses, summaries, etc. should be retrieved from texts. IE includes the following subtasks: • • •

Text analysis

named entity recognition – recognition of specified types of entities in free text, see e.g. Borthwick, 1999; Sibanda & Uzuner, 2006, co-reference resolution – identification of text fragments referring to the same entity, see e.g. Ponzetto & Strube, 2006, identification of roles and their relations – determination of roles defined in event templates, see e.g. Ruppenhofer et al, 2006.

Text Categorization Text categorization (TC) techniques aim at sorting documents into a given category system (see Sebastiani, 2002 for a good survey). In TC, usually, a classifier

Extraction

Categorization

Clustering

Summarization

model is built based on the content of a set of sample documents, which model is then used to classify unseen documents. Typical application examples of TC include among many others: • • • •

document filtering – such as e.g. spam filtering, or newsfeed (Lewis, 1995); patent document routing – determination of experts in the given fields (Larkey, 1999); assisted categorization – helping domain experts in manual categorization with valuable suggestions (Tikk et al, 2007), automatic metadata generation (Liddy et al, 2002),

Document Clustering Document clustering (DC) methods group elements of a document collection based on their similarity. Here again, documents are usually clustered based on their content. Depending on the nature of the results, one can have partitioning and hierarchical clustering methods. In the former case, there is no explicit relation among the clusters, while in the latter case a hierarchy of clusters is created. DC is applied for e.g.: • • •

clustering the results of (internet) search for helping users in locating information (Zamir et al, 1997), improving the speed of vector space based information retrieval (Manning et al, 2007), providing a navigation tool when browsing a document collection (Käki, 2005).

655


Summarization Text summarization aims at the automatic generation of short and comprehensible summaries of documents. Text extraction algorithms create summary by extracting relevant descriptive phrases (typically sentences) from the original text, while summaries generated by abstraction methods may contain synthesized text as well. The typical application areas of summarization span from the internet search to arbitrary document management system (Ganapathiraju, 2002; Radev et al; 2001).

FULL-TEXT SEARCH (FTS) ENGINES Full-Text Search Based on the literature (Maier, 2001, Curtmola, 2005), an effective FTS engine should support several query functionalities. The simplest operation is the stringbased query, which retrieves texts that exactly match the query string. In some cases, the position of the keywords within the document is also an important factor. The simplest form of similarity-based matching uses the edit-distance function. The next operation is the content-based query, where similarity is defined on the semantic level. An FTS engine should also support grammar (and therefore language) specific operators (e.g. stemming). The highest level of text search operates with semantic-based matching (thesaurus-based neighborhood, generalization of a word, specialization, synonyms). From the practical viewpoint, the efficient execution of queries is also very important. Due to the heterogeneity of the source pool, the support of different document formats is a key requirement. The minimal usage of other resources provides an independent, flexible solution. From the aspect of software development, the open, standardized interface is a good investment. To provide a manageable, easy to understand response, the efficient ranking of the result set is crucial (Chakrabarti, 2006). The products and test systems currently available only partially meet the above requirements.

Structure of a General FTS Engine FTS engines are structurally similar to database systems: they store data and metadata; their purpose is to 656

provide an efficient information retrieval (Microsoft, 2007; Oracle Text, 2007). As the processing of a full-text query requires several distinct steps, the FTS engines typically have modular structure (see also Figure 2.). The loader module loads the documents into a common staging area, into a common representation. In further steps, data items are transformed into a common format, too. The loaded documents are stored in the datastore unit. Document processing has several steps. The sectioner unit has to discover the larger internal logical structure of the documents. The word-breaker parses the text into smaller syntactical units like paragraphs, sentences and terms (words). For reducing the length and complexity of the text, several preprocessing steps are executed. First, a filter module is applied that discard irrelevant words (stop-words, noise words). Next, the stemmer unit generates the stem form for every word. In the background, the language lexicon supports the language-specific reduction steps. This lexicon contains the grammar of the supported languages and the list of stop-words. The thesaurus is a special lexicon, which stores the terms organized in a graph based on their semantic relationship. To provide an efficient term management, several kinds of indexes are created. The indexer unit manages the different document-term indices that enable the efficient access to term occurrences. On the front-end side, the query preprocessor transforms the user’s query into an internal format. This format is processed by the query matcher, resulting in a set of matching documents. The search engine may be extended with a text mining module that performs data mining operations, like clustering or classification. In order to provide a more accurate response, the query refinement engine performs the processing of relevance feedback. The list of matching documents is pipelined to the ranking module. The exporter module generates the final format of the ranked document set. As mentioned, database systems use indices for the fast access to data items. For full-text search, the inverted index is the most efficient index structure (Zobel, 2006). In the simple inverted index, the key of the index is the term. Each key is associated with a pair (df, dl). Here df is the number of documents containing the key, and dl is the list of documents that contain the key. Each entry in the list contains a document identifier and the frequency value in the document. The position-based inverted index differs from the simple version as that the list corresponding to a document also contains the positions of the given term in the text.


Figure 2. Modules of an FTS engine

F PRE-PROCESSING

LOADER Documents

INDEXER

staging area

stop-word filter

document-terms index

sectioner

stemmer

term-document inverted index

word-breaker

dimension reduction

phase-index

Indices Lexicon Query

QUERY ENGINE

POST-PROCESSING

parsing

query refinement

optimization ranking

Output

executor

FTS Engine Interface in Oracle Text The FTS functionality in Oracle Text (Oracle, 2007) can be activated with some extensions to SQL and with procedural SQL packages. Oracle Text supports four index types: • • • •

CONTEXT-type index: inverted index for long documents; CTXCAT-type index: to support content- and attribute-based indexing for shorter documents; CTXRULE-type index: rules for document clustering; CTXPATH-type index: indexing of XML documents.

The stemming module supports only two languages: English and French. In the queries, the CONTAINS operator supports the following matching modes: • • • •

keyword: exact matching; AND, OR, NOT : Boolean operators; NEAR (keyword1, keyword2): the keywords should occur at near positions in the document; BT(keyword): generalization of the keyword;

• • • • • • • •

NT(keyword): specialization of the keyword; REL(keyword): words in the thesaurus in relation with the keyword; SYN(keyword): the synonyms of the keyword; $keyword: words having the same stem; !keyword: words having the same pronunciation; ABOUT keywords: words belonging to the given topic; FUZZY(keyword): words that are close to the keyword in terms of the edit distance; WITHIN (section): the matching is restricted to a given section of the documents.

The example below retrieves the documents containing words that have similar meaning as “food”: SELECT description FROM books WHERE CONTAINS (description, ’NT(food,1)’) > 0; Oracle Text supports three methods for document partition (categorization & clustering). The manual categorization allows the user to enter keyword-category pairs. The automatic categorization works if a training set of document-category pairs is given. The cluster657


ing method automatically determines the clusters in a set of documents based on their similarity. To provide semantic-based matching for any arbitrary domain, the users can create their own thesaurus.

FUTURE TRENDS In our view, there are three main areas where the role of FTS engine should be improved in the future: web search engines, ontology-based information retrieval, and management of XML documents. The main standard for the query of XML documents is nowadays the XQuery language. This standard is very flexible for selecting structured data elements, but it has no special features for the unstructured part. In (Botev, 2004; Curtmola, 2005), an extension of XQuery with full-text functionality is proposed. The extended query language is called TeXQuery and GalaTex. The language contains a rich set of composite full-text primitives such as phrase matching, proximity distance, stemming and thesauri. The combination of structure- and contentbased queries is investigated deeply from a theoretical viewpoint in (Amer, 2004). The efficiency of information retrieval can be improved with the extension of additional semantic information. The ALVIS project (Luu, 2006) aims at building a distributed, peer-to-peer semantic search engine. The peer-to-peer network is a self-organizing system for decentralized data management in distributed environments. During a query operation, a peer broadcasts search requests in the network. A peer may be assigned to a subset of data items. The key element in the cost reduction is the application of a special index type at the nodes. The index contains in addition to the single keyword entries also entities for compound keys with high discriminative values. A very important application area of full-text search is the Web. A special feature of Web search is that the users apply mostly simple queries. Only 10% of queries use some complex full-text primitives like Boolean operators, stemming or fuzzy matching. Eastman (2003) investigated the reasons of omitting the complex operators and concluded that the application of complex full-text operators does not significantly improve the search results. Efficiency is a key factor in web search engines (Silvestri, 2004). The goal of the research is to upgrade the indexing mechanism of web search engines to provide efficient full-text search 658

operators.

CONCLUSION The information is stored on the web and in computers mostly in free-text format. The current databases are able to store and manage huge document collection. Free-text data sources require specific search operations. Database management systems usually contain a separate full-text search engine to perform full-text search primitives. In general, the current FTS engines support the following functionalities: exact matching, position-based matching, similarity-based matching (fuzzy matching), grammar-based matching (stemming) and semantic-based matching (synonym- and thesaurusbased matching). It has been shown that the average user requires additional help to exploit the benefits of these extra operators. Current research focuses on solving the problem of covering new document formats, adapting the query to the user’s behavior, and providing an efficient FTS engine implementation.

REFERENCES Amer Yahia, S., Lakshmanan, L. & Pandit, S. (2004). FlexPath: Flexible Structure and Full-Text Querying for XML. In Proc. of ACM SIGMOD (pp.83–94), Paris, France. Borthwick, A. (1999). A Maximum Entropy Approach to Named Entity Recognition, Ph.D. thesis. New York University, USA. Blumberg, R. & Arte, S. (2003). The problem with unstructured data. DM Review (February). Botev, C., Amer-Yaiha, S. & Shanmugasundaram, J. (2004). A TexQuery-based XML full-text search engine. In Proc. of ACM SIGMOD (pp. 943–944), Paris, France. Chakrabarti, K., Ganti, V., Han, J. & Xin, D. (2006). Ranking Objects by Exploiting Relationships: Computing Top-K-over Aggregation. In Proc. of ACM SIGMOD (pp. 371–382), Chicago, IL, USA. Chang, K. & Cho, J. (2006). Accessing the Web: From Search to Integration, In Proc. of ACM SIGMOD (pp. 804–805) , Chicago, IL, USA.


Codd, E.F (1985). Is Your DBMS Really Rational, (Codd’s 12 rules), Computerworld Magazine Curtmola, E., Amer-Yaiha, S., Brown, P. & Fernandez, M. (2005). GalaTex: A Conformant Implementation of the Xquery Full-Text Language, Proc. of WWW 2005.(pp. 1024–1025), Chiba, Japan. dtSearch (2007), Text Retrieval / Full Text Search Engine, http://www.dtsearch.com Eastman, C. & Jansen, B (2003). Coverage, Relevance, and Ranking: The Impact of Query Operators on Web Search Engine Results, ACM Transactions on Information Systems, 21, (4), 383–411. Fan, W., Wallace, L., Rich, S., & Zhang, Z. (2006). Tapping the power of text mining. Communications of the ACM, 49 (9), 76–82. M. K. Ganapathiraju (2002). Relevance of cluster size in MMR based summarizer. Technical Report 11-742, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA. Käki, M. (2005). Findex: search result categories help users when document ranking fails. In CHI -05: Proc. of the SIGCHI conference on Human factors in computing systems (pp. 131–140), Portland, OR, USA. Larkey, L. S. (1999). A patent search and classification system. In Proc. of DL-99, 4th ACM Conference on Digital Libraries (pp. 179–187), Berkeley, CA, USA. Lewis, D. D. (1995). The TREC-4 filtering track: description and analysis. In Proc. of TREC-4, 4th Text Retrieval Conference, (pp. 165–180), Gaithersburg, MD, USA. Liddy, E.D., Sutton, S., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N.E., Diekema, A., McCracken, N., & Silverstein, J. (2002). Automatic metadata generation and evaluation. In Proc. of ACM SIGIR (pp. 401–402), Tampere, Finland. Luu, T., Klemm, F., Podnar, I., Rajman, M. & Aberer, K. (2006). ALVIS Peers: A Scalable Full-text Peer-toPeer Retrieval Engine, Proc. of ACM P2PIR’06 (pp. 41–48), Arlington, VA, USA. Maier, A.; Simmen, D. (2001). DB2 Optimization in Support of Full Text Search, Bulletin of IEEE on Data Engineering.

Manning, Ch. D., Raghavan, P., & Schütze, H. (2007). Introduction to Information Retrieval. Cambridge University Press. Microsoft (2007). SQL Server Full Text Search Engine, http://technet.microsoft.com/en-us/library/ms345119. aspx Oracle Text (2007). Oracle Text Product Description, homepage: http://www.oracle.com/technology/products/text/index.html Ponzetto, S. P., & Strube, M. (2006). Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. In Proc. of HLT-NAACL, Human Language Technology Conf. of the NAACL (pp. 192–199), New York, USA. Radev, D., Blair-Goldensohn, S., & Zhang, Z. (2001). Experiments in single and multi-document summarization using MEAD. In Proc. of DUC-01, Document Understanding Conf., Workshop on Text Summarization, New Orleans, USA. Ruppenhofer, J., Ellsworth, M., Petruck, M. R. L., Johnson, Ch. R., & Scheffczyk, J. (2006). FrameNet II: Extended Theory and Practice. International Computer Science Institute, Berkeley, USA. SAS (2007). SAS Text Miner, http://www.sas.com/ technologies/analytics/datamining/textminer/ Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Sibanda, T., & Uzuner, Ö. (2006). Role of local context in automatic deidentification of ungrammatical, fragmented text. In Proc. of HLT-NAACL, Human Language Technology Conf. of the NAACL (pp. 65–73), New York, USA. Silvestri, F., Orlando, S & Perego, R. (2004). WINGS: A Parallel Indexer for Web Contents, Lecture Notes in Computer Science, 3036, pp. 263-270. SPSS (2007). Predictive Text Analysis, http://www. spss.com/predictive_text_analytics/ Statsoft (2007). STATISTICA Text Miner, http://www. statsoft.com Tikk, D., Biró, Gy., & Törcsvári, A. (2007). A hierarchical online classifier for patent categorization. In do 659

F


Prado, H. A. & Ferneda, E., editors, Emerging Technologies of Text Mining: Techniques and Applications. Idea Group Inc. (in press). Zamir, O., Etzioni, O., Madani, O., & Karp, R. M. (1997). Fast and intuitive clustering of web documents. In Proc. of SIGKDD-97, 3rd Int. Conf. on Knowledge Discovery and Data Mining (pp. 287–290), Newport Beach, USA. Zobel, J. & Moffat, A. (2006). Inverted Files for Text Search Engines, ACM Computing Surveys, 38(2), Article 6.

KEy TERmS Full-Text Search (FTS) Engine: A module within a database management system that supports efficient search in free texts. The main operations supported by the FTS engine are the exact matching, position-based matching, similarity-based matching, grammar-based matching and semantic-based matching. Fuzzy Matching: A special type of matching where the similarity of two terms are calculated as the cost of the transformation from one into the other. The most widely used cost calculation method is the edit distance method. Indexer: It builds one or more indices for the speed up information retrieval from free text. These indices usually contain the following information: terms (words), occurrence of the terms, format attributes. Inverted Index: An index structure where every key value (term) is associated with a list of objects identifiers (representing documents). The list contains objects that include the given key value.

660

Query Refinement Engine: A component of the FTS engine that generates new refined queries to the initial query in order to improve the efficiency of the retrieval. The refined queries can be generated using the users’ response or some typical patterns in the query history. Ranking Engine: A module within the FTS engine that ranks the documents of the result set based on their relevance to the query. Sectioner: A component of the FTS engine, which breaks the text into larger units called sections. The types of extracted sections are usually determined by the document type. Stemmer: It is a language-dependent module that determines the stem form of a given word. The stem form is usually identical to the morphological root. It requires a language dictionary. Thesaurus: A special repository of terms, which contains not only the words themselves but the similarity, the generalization and specialization relationships. It describes the context of a word but it does not give an explicit definition for the word. Word-Braker: A component of the full-text engine whose function is to break the text into words and phrases.

661

Functional Dimension Reduction for Chemometrics Tuomas Kärnä Helsinki University of Technology, Finland Amaury Lendasse Helsinki University of Technology, Finland

INTRODUCTION High dimensional data are becoming more and more common in data analysis. This is especially true in fields that are related to spectrometric data, such as chemometrics. Due to development of more accurate spectrometers one can obtain spectra of thousands of data points. Such a high dimensional data are problematic in machine learning due to increased computational time and the curse of dimensionality (Haykin, 1999; Verleysen & François, 2005; Bengio, Delalleau, & Le Roux, 2006). It is therefore advisable to reduce the dimensionality of the data. In the case of chemometrics, the spectra are usually rather smooth and low on noise, so function fitting is a convenient tool for dimensionality reduction. The fitting is obtained by fixing a set of basis functions and computing the fitting weights according to the least squares error criterion. This article describes a unsupervised method for finding a good function basis that is specifically built to suit the data set at hand. The basis consists of a set of Gaussian functions that are optimized for an accurate fitting. The obtained weights are further scaled using a Delta Test (DT) to improve the prediction performance. Least Squares Support Vector Machine (LS-SVM) model is used for estimation.

BACKGROUND The approach where multivariate data are treated as functions instead of traditional discrete vectors is called Functional Data Analysis (FDA) (Ramsay & Silverman, 1997). A crucial part of FDA is the choice of basis functions which allows the functional representation. Commonly used bases are B-splines (Alsberg & Kvalheim, 1993), Fourier series or wavelets (Shao,

Leung, & Chau, 2003). However, it is appealing to build a problem-specific basis that employs the statistical properties of the data at hand. In literature, there are examples of finding the optimal set of basis functions that minimize the fitting error, such as Functional Principal Component Analysis (Ramsay et al., 1997). The basis functions obtained by Functional PCA usually have global support (i.e. they are non-zero throughout the data interval). Thus these functions are not good for encoding spatial information of the data. The spatial information, however, may play a major role in many fields, such as spectroscopy. For example, often the measured spectra contain spikes at certain wavelengths that correspond to certain substances in the sample. Therefore these areas are bound to be relevant for estimating the quantity of these substances. We propose that locally supported functions, such as Gaussian functions, can be used to encode this sort of spatial information. In addition, variable selection can be used to select the relevant functions from the irrelevant ones. Selecting important variables directly on the raw data is often difficult due to high dimensionality of data; computational cost of variable selection methods, such as Forward-Backward Selection (Benoudjit, Cools, Meurens, & Verleysen, 2004; Rossi, Lendasse, François, Wertz, & Verleysen, 2006), grows exponentially with the number of variables. Therefore, wisely placed Gaussian functions are proposed as a tool for encoding spatial information while reducing data dimensionality so that other more powerful information processing tools become feasible. Delta Test (DT) (Jones, 2004) based scaling of variables is suggested for improving the prediction performance. A typical problem in chemometrics deals with predicting some chemical quantity directly from measured spectrum. Due to additivity of absorption spectra, the problem is assumed to be linear and therefore linear


F

Functional Dimension Reduction for Chemometrics

models, such as Partial Least Squares (Härdle, Liang, & Gao, 2000) have been widely used for the prediction task. However, it has been shown that the additivity assumption is not always true and environmental conditions may further introduce more non-linearity to the data (Wülfert, Kok, & Smilde, 1998). We therefore propose that in order to address a general prediction problem, a non-linear method should be used. LS-SVM is a relatively fast and reliable non-linear model which has been applied to chemometrics as well (Chauchard, Cogdill, Roussel, Roger, & Bellon-Maurel, 2004).

USING GAUSSIAN BASIS WITH SPECTOmETRIC DATA Consider a problem where the goal is to estimate a certain quantity p ∈ ℜ from a measured absorption spectrum X based on the set of N training examples ( X j , p j ) Nj=1 . In practice, the spectrometric data Xj is a set of discretized measurements ( xij , yij )im=1 where xij ∈ [a, b]⊂ ℜ stand for the observation wavelength and yij ∈ ℜ is the response. Adopting the FDA framework (Ramsay et al., 1997), our goal is to build a prediction model F so that pˆ = F( X ) . Here, the argument X is a real-world spectrum, i.e. a continuous function that maps wavelengths to responses. Without much loss of generality it can be assumed that X belongs to L2([a, b]), the space of square integrable functions on the interval [a,b]. However, since the spectrum X is unknown and infinite dimensional it is impossible to build the model F(X) in practice. Therefore X must be approximated with a q dimensional representation W = Ρ( X ), Ρ : L2 → ℜ q , and our prediction model becomes pˆ = F(W ) . Naturally, in order to obtain dimensionality reduction, we

Figure 1. Outline of the prediction method

662

require that q is smaller than the number of points in the spectra. Figure 1 presents a graph of the overall prediction method. Gaussian fitting is used for the approximation of X. The obtained vectors ω are further scaled by a diagonal matrix A before the final LS-SVM modeling. The following sections explain these steps in greater detail.

Gaussian Fitting: Approximating Spectral Function X Because the space L2([a, b]) is infinite dimensional function space, it is necessary to consider some finite dimensional subspace V ⊂ L2 ([a, b]) in order to obtain a feasible function approximation. We define V by a set of Gaussian functions J k ( x) = e

− x −tk

2

S k2

, k = 1,, q ,

(1)

where tk is the center and σk is the width parameter. The set φk(x) spans a q dimensional normed vector space and we can write V = span{φk(x)}. A natural choice for the norm is the L2 norm: fˆ

V

b = ( ∫ fˆ ( x) 2 dx)1 / 2 a

.

Now X can be approximated using the basis representation Xˆ ( x) = W T F ( x) , where F ( x) = [J1 ( x), J 2 ( x) ,. . . , J q ( x) ]T .

The weights ω are chosen to minimize the square error:


m

2

T min ∑ yi − W F ( xi ) . W i =1

(2)

In other words, we simply fit a function to the points ( xi , yi )im=1 using the basis functions φk(x). Now, any function Xˆ ∈ V is uniquely determined by the weight vector ω. This suggests that it is equivalent to analyze the discrete weight vectors ω instead of the continuous functions Xˆ .

obtained easily by solving the problem (2). The solution is the pseudoinverse ω = (GTG)–1GTy (Haykin, 1999), where y = [y1, y2,..., ym]T are the values to be fitted and [G]i,j = φj(xi). Since the Gaussian functions are differentiable, the locations and widths can be optimized for a better fit. The average fitting error of all functions is obtained by averaging Eq. (2) over all of the sample inputs j = 1, . . .,N. Using the matrix notation given above, it can be formulated as

Orthonormalization Radial symmetric models (such as the LS-SVM) depend only on the distance metric d(·,·) in the input space. Thus, we require that the mapping from V to ℜq is isometric, i.e. dV ( fˆ , gˆ ) = d q (A, B) for any functions fˆ ( x) = AT F ( x) and gˆ ( x) = BT F ( x) . The first distance is calculated in the function space and the latter one in ℜq . In the space V, distances are defined by the norm d ( fˆ , gˆ ) = fˆ − gˆ . Now a simple calculation gives

fˆ − gˆ

2 V

E=

1 2N

N

∑ (GW j =1

j

− y j )T (GW j − y j )

,

which can be differentiated with respect to tk and σk (Kärnä & Lendasse, 2007). Knowing the partial derivates, the locations and the widths can be optimized using unconstrained nonlinear optimization. In this article, Broyden-FletcherGoldfarb-Shanno (BFGS) Quasi-Newton method V with line search is suggested. The formulation of the 2 b q 2   BFGS algorithm can be found in Bazaraa, Sherali and T fˆ − gˆ = ∫  ∑ (A − B)J k ( x)  dx = (A − B) & (A − B) V Shetty (1993).  a  k =1 An example of spectral data and an optimized basis 2 b  q  T functions in presented in Figure 2. This application is = ∫  ∑ (A − B)J k ( x)  dx = (A − B) & (A − B) 1 = k  a , where Figure 2. Above: NIR absorption spectra. Below: 13 optimized basis functions

b

& i , j = ∫ Ji ( x)J j ( x)dx a

.

This implies that if the basis is orthonormal, the matrix Φ becomes an identity matrix and the distances become equal, i.e. fˆ − gˆ

V

= A−B

q

= ((A − B)T (A − B))1 / 2

.

Unfortunately this is not the case with Gaussian basis ~ = UW need to be applied. and a linear transformation W Here the matrix U is the Cholesky decomposition of Φ = UTU. In fact, the transformed weights ω are related to a set of new basis functions F~ = U −1F that are both optimized to fit the data and orthonormal.

Finding an Optimal Gaussian Basis When the basis functions are fixed, the weights ω are 663

F


related to prediction of fat content in meat samples using NIR absorption spectra (Kärnä et al., 2007; Rossi et al., 2006; Thodberg, 1996). It can be seen that the basis has adapted to the data: there are narrow functions in the center where there is more variance in the data.

Variable Scaling Variable scaling can be seen as a generalization of variable selection; in variable selection variables are either included in the training set (corresponding to multiplication by 1) or excluded from it (corresponding to multiplication by 0), while in variable scaling the entire range [0,1] of scalars is allowed. In this article, we present a method for choosing the scaling using Delta Test (DT) (Lendasse, Corona, Hao, Reyhani, & Verleysen, 2006). The scalars are generated by iterative ForwardBackward Selection (FBS) (Benoudjit et al., 2004; Rossi et al., 2006). FBS is usually used for variable selection, but it can be extended to scaling as well; Instead of turning scalars from 0 to 1 or vice versa, increases by 1/h (in the case of forward selection) or decreases by 1/h (in the case of backward selection) are allowed. Integer h is a constant grid parameter. Starting from an initial scaling, the FBS algorithm changes the each of the scalars by ±1/h and accepts the change that resulted in the best improvement. The process in repeated until no improvement is found. The process is initialized with several sets of random scalars. DT is a method for estimating the variance of the noise within a data set. Having a set of general inputoutput pairs (xi , yi )iN=1 ∈ ℜ m × ℜ and denoting the nearest neighbor of xi by xNN(i), the DT variance estimate is D =

1 2N

N

∑y i =1

NN ( i )

− yi

2

,

where yNN(i) is the output of xNN(i). Thus, δ is equivalent to the residual (i.e. prediction error) of a first-nearest-neighbor model. DT is useful in evaluation of dependence of random variables and therefore it can be used for scaling: The set of scalars that give the smallest δ is selected.

LS-SVm LS-SVM is a least square modification of the Support Vector Machine (SVM) (Suykens, Van Gestel, De 664

Brabanter, De Moor, & Vandewalle, 2002). The quadratic optimization problem of SVM is simplified so that it reduces into a linear set of equations. Moreover, regression SVM usually involves three unknown parameters while LS-SVM has only two; the regularization parameter γ and the width parameter θ. N m Given a set of N training examples (xi , yi )i=1 ∈ ℜ × ℜ T the LS-SVM model is yˆ = w Y ( x) + b , where Y : ℜ m → ℜ n is a mapping from the input space onto a higher dimensional hidden space, w ∈ ℜ n is a weight vector and b is a bias term. The optimization problem is formulated as Min J (w , b) = w ,b

1 1 N 2 w + G ∑ ei2 2 2 i=1

so that yi = w T Y ( xi ) + b + ei ,

where ei is the prediction error and γ ≥ 0 is a regularization parameter. The dual problem is derived using Lagrangian multipliers which lead into a linear KKT system that is easy to solve (Suykens et al., 2002). Using the dual solution, the original model can be reformatted as N

yˆ (x) = ∑ A i K (x, x i ) + b i =1

,

where the kernel K( x, x i ) = Y ( x )Τ Y ( x i ) is a continuous and symmetric mapping from ℜ m × ℜ m to ℜ and αi are the Lagrange multipliers. A widely-used choice for the K 2 2 is the standard Gaussian kernel K (x1 , x 2 ) = e x1 − x 2 2 Q . The LS-SVM prediction is the final step in the proposed method where spectral data is compressed by the Gaussian fitting and the fitting weights are normalized and scaled before the prediction. More elaborate discussion and applications to real-world data are presented in Kärnä et al. (2007).

FUTURE TRENDS The only unknown parameter in the proposed method is the number of basis functions which is selected by validation. In future other methods for determining good basis size should be developed in order to speed up the process. Moreover, the methodology should be tested with various data sets, including other than


spectral data. The LS-SVM predictor could be also replaced with another model. Although the proposed Gaussian fitting combined with LS-SVM model seems to be fairly robust, the relation between the basis functions and the prediction performance should be studied in detail. It would be desirable to optimize the basis directly for best possible prediction performance (instead of good data fitting), although it seems difficult due to over-fitting and high computational costs.

CONCLUSION This article deals with the problem of finding a good set of basis functions for dimension reduction of spectral data. We have proposed a method based on Gaussian basis functions where the locations and the widths of the functions are optimized to fit the data as accurately as possible. The basis indeed tends to follow the nature of the data and provides a good tool for dimension reduction. Other methods, such as the proposed DT scaling, will benefit from the smaller data dimension and help to achieve even better data compression. The LS-SVM model is a robust and fast method to be used in the final prediction.

REFERENCES Alsberg, B. K., & Kvalheim, O. M. (1993). Compression of nth-order data arrays by B-splines. I : Theory. Journal of Chemometrics 7, 61–73. Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear Programming, Theory and Algorithms. John Wiley and Sons. Bengio, Y., Delalleau, O., & Le Roux, N. (2006). The Curse of Highly Variable Functions for Local Kernel Machines. Y. Weiss and B. Schölkopf and J. Platt (editors), Neural Information Processing Systems (NIPS 2005), Advances in Neural Information Processing Systems 18, 107–114. Benoudjit, N., Cools, E., Meurens, M., & Verleysen, M. (2004). Chemometric calibration of infrared spectrometers: selection and validation of variables by non-linear models. Chemometrics and Intelligent Laboratory Systems 70, 47–53.

Chauchard, F., Cogdill, R., Roussel, S., Roger, J. M., & Bellon-Maurel, V. (2004). Application of LS-SVM to non-linear phenomena in NIR spectroscopy: development of a robust and portable sensor for acidity prediction in grapes. Chemometrics and Intelligent Laboratory Systems 71, 141–150. Haykin, S. (1999). Neural Networks: A Comprehensive Foundation (2nd ed.). Prentice Hall. Härdle, W., Liang, H., & Gao, J. T. (2000). Partially Linear Models. Physica-Verlag. Jones, A. J. (2004). New tools in non-linear modeling and prediction. Computational Management Science 1, 109–149. Kärnä, T., & Lendasse, A. (2007). Gaussian Fitting Based FDA for Chemometrics. F. Sandoval, A. Prieto, J. Cabestany, M. Graña (editors), 9th International Work-Conference on Artificial Neural Networks (IWANN’2007), Lecture Notes in Computer Science 4507, 186–193. Lendasse, A., Corona, F., Hao, J., Reyhani, N., & Verleysen, M. (2006). Determination of the Mahalanobis matrix using nonparametric noise estimations. M. Verleysen (editor), 14th European Symposium on Artificial Neural Networks (ESANN 2006), d-side publi., 227–232. Ramsay, J., & Silverman, B. (1997). Functional Data Analysis. Springer Series in Statistics. Springer. Rossi, F., Lendasse, A., François, D., Wertz, V., & Verleysen, M. (2006). Mutual information for the selection of relevant variables in spectrometric nonlinear modeling. Chemometrics and Intelligent Laboratory Systems 80 (2), 215–226. Shao, X. G., Leung, A. K., & Chau, F. T. (2003). Wavelet: A New Trend in Chemistry. Accounts of Chemical Research 36, 276–283. Suykens, J., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2002). Least Squares Support Vector Machines. World Scientific Publishing. Thodberg, H. (1996). A Review of Bayesian Neural Networks with an Application to Near Infrared Spectroscopy. IEEE Transactions on Neural Networks 7, 56–72.

665

F


Verleysen, M., & François, D. (2005). The Curse of Dimensionality in Data Mining and Time Series Prediction. J. Cabestany, A. Prieto, and D.F. Sandoval (editors), 8th International Work-Conference on Artificial Neural Networks (IWANN’2005), Lecture Notes in Computer Science 3512, 758–770. Wülfert, F., Kok, W. T., & Smilde, A. K. (1998). Influence of temperature on vibrational spectra and consequences for the predictive ability of multivariate models. Analytical Chemistry 70, 1761–1767.

KEy TERmS Chemometrics: Application of mathematical or statistical methods to chemical data. Closely related to monitoring of chemical processes and instrument design. Curse of Dimensionality: A theoretical result in machine learning that states that the lower bound of error that an adaptive machine can achieve increases with data dimension. Thus performance will degrade as data dimension grows. Delta Test: A Non-parametric Noise Estimation method. Estimates the amount of noise within a data set, i.e. the amount of information that cannot be explained by any model. Therefore Delta Test can be used to obtain a lower bound of learning error which can be achieved without risk of over-fitting. Functional Data Analysis: A statistical approach where multivariate data are treated as functions instead of discrete vectors.

666

Least Squares Support Vector Machine: A least squares modification of the Support Vector Machine which leads into solving a linear set of equations. Also bears close resemblance to Gaussian Processes. Machine Learning: An area of Artificial Intelligence dealing with adaptive computational methods such as Artificial Neural Networks and Genetic Algorithms. Over-Fitting: A common problem in Machine Learning where the training data can be explained well but the model is unable to generalize to new inputs. Over-fitting is related to the complexity of the model: any data set can be modelled perfectly with a model complex enough, but the risk of learning random features instead of meaningful causal features increases. Support Vector Machine: A kernel based supervised learning method used for classification and regression. The data points are projected into a higher dimensional space where they are linearly separable. The projection is determined by the kernel function and a set of specifically selected support vectors. Training process involves solving a Quadratic Programming problem. Variable Selection: Process where unrelated input variables are discarded from the data set. Variable selection is usually based on correlation or noise estimators of the input-output pairs and can lead into significant improvement in performance.

667

Functional Networks

F

Oscar Fontenla-Romero University of A Coruña, Spain Bertha Guijarro-Berdiñas University of A Coruña, Spain Beatriz Pérez-Sánchez University of A Coruña, Spain

INTRODUCTION Functional networks are a generalization of neural networks, which is achieved by using multiargument and learnable functions, i.e., in these networks the transfer functions associated with neurons are not fixed but learned from data. In addition, there is no need to include parameters to weigh links among neurons since their effect is subsumed by the neural functions. Another distinctive characteristic of these models is that the specification of the initial topology for a functional network could be based on the features of the problem we are facing. Therefore knowledge about the problem can guide the development of a network structure, although on the absence of this knowledge always a general model can be used. In this article we present a review of the field of functional networks, which will be illustrated with practical examples.

BACKGROUND Artificial Neural Networks (ANN) are a powerful tool to build systems able to learn and adapt to their environment, and they have been successfully applied in many fields. Their learning process consists of adjusting the values of their parameters, i.e., the weights connecting the network’s neurons. This adaptation is carried out through a learning algorithm that tries to adjust some training data representing the problem to be learnt. This algorithm is guided by the minimization of some error function that measures how well the ANN is adjusting the training data (Bishop, 1995). This process is called parametric learning. One of the most popular neural

network models are Multilayer Perceptrons (MLP) for which many learning algorithms can be used: from the brilliant backpropagation (Rumelhart, Hinton & Willian, 1986) to the more complex and efficient Scale Conjugate Gradient (Möller, 1993) or Levenberg-Marquardt algorithms (Hagan & Menhaj, 1994). In addition, also the topology of the network (number of layers, neurons, connections, activation functions, etc.) has to be determined. This is called structural learning and it is carried out mostly by trial and error. As a result, there are two main drawbacks in dealing with neural networks: 1. 2.

The resulting function lacks of the possibility of a physical or engineering interpretation. In this sense, Neural Networks act as black boxes. There is no guarantee that the weights provided by the learning algorithm correspond to a global optimum of the error function, it can be a local one.

Models like Generalized Linear Networks (GLN) present an unique global optimum that can be obtained by solving a set of linear equations. However, its mapping function is limited as this model consists of a single layer of adaptive weights (wj) to produce a linear combination of non linear functions (φj): y ( x) = ∑ j = 0 w j F j ( x) . M

Some other popular models are Radial Basis Function Networks (RBF) whose hidden units use distances to a prototype vector (µj) followed by a transformation with a localized function like the Gaussian: .


Functional Networks

2  − x−  j   y (x) = ∑ j = 0 w j∈ j (x) = ∑ j = 0 w j exp −   2σ 2j   The resulting architecture is more simple than the one of the MLP, therefore reducing the complexity of structural learning and propitiating the possibility of physical interpretation. However, they present some other limitations like their inability to distinguish non significant input variables (Bishop, 1995), to learn some logic transformations (Moody & Darken, 1989) or the need of a large number of nodes even for a linear map if precision requirement is high (Youssef, 1993). Due to these limitations, there have been appearing some models that extend the original ANN, such as, fuzzy neural networks (Gupta & Rao, 1994), growing neural networks, or probabilistic neural networks (Specht, 1990). Nowadays, the majority of these models still act as black boxes. Functional networks (Castillo, 1998, Castillo, Cobo, Gutiérrez, & Pruneda, 1998), a relatively new extension of neural networks, take into account the functional structure and properties of the process being modeled, that naturally determine the initial network’s structure. Moreover, the estimation of the network’s weights it is often based on an error function that can be minimized by solving a system of linear equations, therefore conducting faster to an unique and global solution. M

M

NETWORKS Functional networks (FN) are a generalization of neural networks, which is achieved by using multiargument and learnable functions (Castillo, 1998, Castillo, Cobo, Gutiérrez, & Pruneda, 1998), i.e., the shape of the functions associated with neurons are not fixed but learned from data. In this case, it is not necessary to include weights to ponder links among neurons since their effect is subsumed by the neural functions. Figure 1 shows an example of a general FN for I=N0 explanatory variables. Functional networks consist of the following elements: a.

b.

Several layers of storing units (represented in Figure 1 by small filled circles). These units are used for the storage of both the input and the output of the network, or to storage intermediate information (see units yi(k) in Figure 1). One or more layers of functional units or neurons (represented by open circles with the name of each of the functional units inside). These neurons include a function that can be multivariate and that can have as many arguments as inputs. These arguments, and therefore the form of the neural functions, are learnt during training. By applying their functions, neurons evaluate a set of input

DESCRIPTION OF FUNCTIONAL

Figure 1. Generalized model for functional networks y1

(0)

y2 (0)

f1

(1)

f2

(1)

y1(1)

y2(1)

0

668

fN(1) 1

f2

(2)

y1(2)

y1 (M-1) . . .

y2(2)

y2(M-1) . . .

yN2(2)

yN(M-1) . . .M-1

f1(M)

y1 (M)

f2 (M)

y2(M)

fN(M)

yN(M)

. . .

. . . yN(0)

f1

(2)

yN1 (1)

fN(2) 2

M

M

Functional Networks

values in order to return a set of output values to the next layer of storing units. In this general model each neural function f i (m ) is defined as the following composition:

(

)

(

(

) (y

(

f i ( m ) y1( m−1) ,..., y N( mm−−11) = g i( m ) hi(1m−1) y1( m−1) ,..., hiN( mm)−1 y N( mm−−11)

(

gi( m ) hi(1m −1) (y1( m −1) ),..., hiN( mm)−1

c.

( m −1) N m−1

))

where the superscript (m) is the number of (m ) layer. The functions g i are known and fixed before training, for example to be the sum or product. In contrast, functions hij(m ) are lineal combinations of other known functions φiq (for example, polynomials, cosines, etc.), i.e. n( m ) hij( m ) (y (jm−1) )= ∑ zij=1 aijz( m )Fijz( m ) (y (jm−1)) where the coefficients aijz(m ) implied in this linear combination are the model parameters to be learned. As can be observed, MLPs, GLNs and RBFs are particular cases of this generalized model. A set of directed links that connect the functional units and the storing units. These connections indicate the direction of the flow of information. The general FN in Figure 1 does not have arrows that converge in the same storing unit, but if it did, this would indicate that the neurons from which they emanate must produce identical outputs. This is an important feature of FNs that is not available for neural networks. These converging arrows represent constraints which can arise from physical and/or theoretical characteristics of the problem under consideration.

Learning in Functional Networks Functional networks combine knowledge about the problem to determine the network, and training data to estimate the unknown neural functions. Therefore, in contradistinction to neural networks, FNs include two types of learning: 1.

))

Structural learning. The specification of the initial topology for a FN can be based on the features of the problem we are facing (Castillo, Cobo, Gutiérrez, & Pruneda, 1998). Usually knowledge about the problem can be used in order to develop a network structure. An important feature of FN is that they allow managing functional restric-

2.

tions determined by some known properties of the model to be estimated. These restrictions can be representing by forcing the outputs of some neurons to coincide in a unique storage unit. Later on, the network can be translated into a system of functional equations that usually can be simplified in order to obtain a more simple but equivalent architecture. Finally, on the absence of knowledge about the problem always the general model, shown in Figure 1, can be used. Parametric learning. This second stage refers to the estimation of the neuron’s functions. Often these neural functions are considered to be lineal combinations of functional families, and therefore the parametric learning consists of estimating both the arguments of the neural functions and the parameters of the lineal combination using the available training data. It is important to remark that this type of learning generalizes the idea of estimating the weights of a neural network.

An Example Of A Functional Network In this section the use of FNs is illustrated by means of an artificial simple example. Let’s suppose a problem of engine diagnosis for which three continuous variables (x=’vibrations’, y=’oil density’, z=’temperature’) are being monitored. The problem is to estimate the probability P of a given diagnosis based on these variables, i.e., P(x, y, z). Moreover, we know that the information provided by the monitored variables is accumulative. Therefore, it is possible, for example, to calculate first the probability P1(x, y) of a diagnosis based on only variables x and y, and later on when variable z is available combine the value provided by P1 with the new information z to obtain P(x, y, z). That is, there exist some functions such as: P(x, y, z) ≡ F[P1(x, y), z] = K[P2(y, z), x] = L[P3(x, z), y] (1) This situation suggests the structure of the FN shown in Figure 2a, where I is the identity function. The coincident connections in the store output unit, or equivalently eq. 1, establish strong restrictions about the functions P1, P2, P3, F, K, L. The use of methods for functional equations allows to deal with eq. 1 in order to obtain the corresponding functional conditions from which it is possible to derive a new equation for 669

F

Functional Networks

Figure 2. Functional network for the diagnosis example

function P: P(x, y, z) = k[p(x) + q(y) + g(z)]. This leads to the new more simple FN represented in Figure 2b which is equivalent to that of Figure 2a.

A Comparison Between Functional and Neural Networks Although FNs are extensions of neural networks, there are some main features that distinguish both models: 1.

2.

3.

670

Neural networks are derived only from data about the problem. However, FNs can also use knowledge about the problem to derive its topology, incorporating properties about the function to be modeled. During learning in neural networks the shape of neural functions is fixed usually to be a sigmoid type function, and only the weights can be adapted. In FNs, neural functions are also learnt. Neural functions that can be employed in neural networks are limited and belong to some known

4.

5.

family. Also, for each layer the same function is used for every neuron. In FNs any arbitrary function can be used for each neuron. These functions can be multiargument and multivariate. In neural networks activation functions have only one argument (combination of several input data). In FNs it is possible to force the output of some neurons to coincide by connecting them to the same storing unit. These connections are restrictions to the model that sometimes can be used to derive a more simple model.

Some Functional Network Models In this section some typical FN models are presented, that let solving several real problems.

The Uniqueness Model This is a simple but very powerful model for which the output z of the corresponding FN architecture can be written as a function of the inputs x and y, z = F(x, y) = f3–1(f1(x) + f2(y))

(2)

Functional Networks

Uniqueness of Representation. For this model to z j required = f 3−1 (f1 (tox jfix ) +the f 2 ( y j ) )⇔ have uniqueness of solution it is only functions f1, f2, f3 at a point (see explanation in Castillo, Cobo, Gutiérrez, & Pruneda, 1998). Learning the model. Learning the function F(x, y) in eq.2 is equivalent to learning the functions from a data set, {(xi, yi, zi): j = 1,..., n} where z is the desired output for the given inputs. To estimate f1, f2, f3 we can employ the non-linear and linear methods: 1.

2 j

2.

F

f 3 ( z j ) = f1 ( x j ) + f 2 ( y j ); j = 1,..., n

Again the functions fs can be approximated as a linear combination of known functions from a given family. Finally, the following sum of square errors, m3 m2  m1  Q = ∑ e = ∑  ∑ a1iF1i ( x j ) + ∑ a 2iF 2i ( y j ) − ∑ a3iF3i ( z j )  j =1 j =1  i =1 i =1 i =1  n

2 j

n

The Non-Linear Method. We approximate each 2 of the functions f1, f2, f3–1 z = F(x, y) = f3–1(f1(x) + n m3 m2 n  m1  2 f2(y)) by considering them to be a linear combiQ = ∑ e j = ∑  ∑ a1iF1i ( x j ) + ∑ a 2iF 2i ( y j ) − ∑ a3iF3i ( z j )  j =1  i =1 i =1 i =1  nation of known functions from a given family j =1 (e. g., polynomial). Finally, the following sum of squared errors is minimized, can be minimized by solving a system of linear equations, where the unknowns are the coeffi2 m3 m2 n n   cients  m1 2 a as it is demonstrated in (Castillo, Cobo, si Q = ∑ e j = ∑  z j − ∑ a3k F3k  ∑ a1iF1i ( x j ) + ∑ a 2iF 2i ( y j )   Gutiérrez, & Pruneda, 1998). = 1 = 1 j =1 j =1  k =1 i i  

m3 m2    m1 Q = ∑ e = ∑  z j − ∑ a3k F3k  ∑ a1iF1i ( x j ) + ∑ a 2iF 2i ( y j )   j =1 j =1  k =1 i =1   i =1 n

z j = f 3−1 (f1 ( x j ) + f 2 ( y j ) )⇔ f 3 ( z j ) = f1 ( x j ) + f 2 ( y j ); j = 1,..., n

n

2

Linear Method. A simplification of the non-linear method can be done by considering the following equivalence:

The Generalized Associativity Model Figure 3a shows a generalized associativity FN of three inputs, where the nodes I represent the identity function. This model is based on the generalized associative property, that is, the output of this network can be obtained as a function of G(x, y) and the input z, or

Figure 3. The generalized associativity functional network

x

I

G

z

p

y

q

z

r

K

u

y N

x

F

+

k

u

I

a)

b) 671

2

Functional Networks

as a function of the input x and N(y, z). This property is represented with the links convergent to the output node u, which leads to the functional equation F[G(x, y), z] = K[x, N(y, z)]

(3)

Simplification of the model. It can be shown that the general solution of eq. 3 is: F ( x, y ) = k [f ( x) + r ( y )]; G ( x, y ) = f −1[p ( x) + q ( y )] K ( x, y ) = k [p ( x) + n( y )]; N ( x, y ) = n −1[q ( x) + r ( y )]

(4) where f, r, k, n, p, q are arbitrary continuous and strictly monotonic functions. Substituting eq. 4 in eq. 3, the following result is obtained F[G(x, y), z] = K[x, N(y, z)] = u = k[p(x) + q(y) + r(z)] (5)

Thus, the FN in Figure 3b is equivalent to the FN in Figure 3a. Uniqueness of Representation. By employing functional equations for the generalized associativity model it can be demonstrated that uniqueness of solution requires fixing the functions k, p, q, r at a point (see Castillo, Cobo, Gutiérrez, & Pruneda, 1998). Learning the model. The problem of learning the FN in Figure 3b involves estimating the functions k, p, q, r in eq. 5, that can be rewritten as: k–1(u) = p(x) + q(y) + r(z) Being {(x1i, x2i, x3i, x4i)|i = 1,..,n} with (x1, x2, x3, x4 ≡ x, y, z, u) the observed training sample of size n we can define the error ei = pˆ ( x1i ) + qˆ ( x2i ) + rˆ( x3i ) − kˆ −1 ( x4i ); i = 1,..., n

Figure 4. Separable functional network architecture f1 x

x fn + g1 x gn

z

y

c 11

h1 x

+ k1

km

a)

672

x

fr

x

g1

x

g k-r

x

x

hm

x

f1

c 1k-r

y

b)

c r1

c rk-r

+

z

Functional Networks

Suppose that each of the functions is a linear combination of known functions from given families (e.g. polynomial). Then, the sum of squared errors is defined as n

f j ( x) = ∑ a jk f k ( x); j = r + 1,..., k , j =1

k −r

j =1

Employing the Lagrange multipliers technique, the minimum is obtained by solving the following system of linear equations:

∂QL ms = ∑ a kjF kj (A k ) − B k = 0; ∀k , ∂Lk j =1

where the unknowns are the multipliers λ1,...,λ4 the coefficients in the set {akj | j = 1,..., mk ; k = 1,2,3,4} which are the parameters of the FN.

The Separable Model Consider the equation

By replacing these terms in equation eq. 6 we obtain r k −r

z = F ( x, y ) = ∑∑ cij f i ( x) g j ( y )

n

m

i =1

j =1

i =1 j =1

,

Thus, to find the optimum coefficients we minimize the sum of squared errors Q = ∑k =1 ek n

k

i

i

where f i ( x) = hi − n ( x) g i ( y ) = −ki − n ( y ); i = n + 1,..., n + m

s −r

ei = x0i − ∑∑ cij f i ( x1i ) g j ( x2i ); i = 1,..., n

which can be written as

∑ f ( x) g ( y ) = 0

(7)

where cij are the parameters of the model, and which leads to the simplified FN in Figure 4b. Uniqueness of Representation. In this case the uniqueness of representation is given without the need of fixing the implied functions at any point. Learning the model. In this case a simple least squares method allows obtaining the optimal coefficients cij using the available data {(x0i, x1i, x2i)|i = 1,.., n} with (x0, x1, x2 ≡ z, x, y). In this way, the error can be obtained as, r

z = F ( x, y ) = ∑ f i ( x)ri ( y ) = ∑ h j ( x)k j ( y )

.

i =1 j =1

n ∂QL = 2∑ eiF kr ( xki ) + Lk F kr (A k ) = 0; ∀k , r ∂a kr i =1

i =1

F

r

g s ( y ) = −∑ a js g r + j ( y ); s = 1,..., r

2

 4 mk  Q = ∑ ei2 = ∑  ∑∑ akjFkj ( xki )  i =1 i =1  k =1 j =1 . n

is

(6)

2

.

In this case, the parameters are not constrained by extra conditions, so the minimum can be obtained by solving the following system of linear equations, where the unknowns are the coefficients cij: n ∂Q = 2∑ ek f p ( x1k ) g q ( x2 k ) = 0; p = 1,..., r ; q = 1,..., r − s ∂c pq k =1

n This suggests the FN in Figure∂4a. Q = Assuming 2∑ ek f p ( x1k )that g q ( x2 k ) = 0; p = 1,..., r ; q = 1,..., r − s Simplification of the model. ∂c pq k =1 {f1(x),...,fr(x)}, {gr+1(x),...,gk(x)} are two sets of linearly independent functions, the general solution of eq. 6 Examples of Applications

k

∑ f ( x) g ( y ) = 0 i =1

i

i

In this section, illustrative examples for two different models of FN are presented. These models were ap673

Functional Networks

plied to a regression and a classification problem. In all cases, the functions of each layer were approximated by considering a linear combination of known functions from a polynomial family.

continuous attributes. The set contains 178 instances that must be classified in three different classes. For this case, the Separable Model (Figure 4) with three output units was employed. Moreover, its performance is compared to other standard methods: a Multilayer Perceptron (MLP), a Radial Basis Function Network (RBF) and Support Vector Machines (SVM). Figure 5 shows the comparative results. The first subfigure contains the mean accuracy obtained using a leaving-one-out cross-validation method. As can be observed, the FN obtains a very good performance for

Classification Problem The first example shows the performance of a FN solving a classification problem: the Wine data set. This database can be obtained from the UCI Machine Learning Repository1. The aim of this problem is to determine the origin of wines using a chemical analysis of 13

Figure 5. Accuracy and training time obtained by different models for the wine data set Wine 1 0,9

1

1

1

0,978

0,981

0,97

1

1 1

0,981

0,8 Accuracy

0,7

0,652

0,6

Train Test

0,599

0,5 0,4 0,3 0,2 0,1 0 MLP

RBF-1

RBF-2

RBF-3

SVM

RF

model

25

Time (s)

20

15 Media Media+Desv

10

5

0 MLP

RBF-1

RBF-2

RBF-3

model

674

SVM

RF

Functional Networks

the test set. Regarding the time required for the learning process, the second subfigure shows that the FNs are comparable with the other methods.

Regression Problem In this case the aim of the network is to predict the failure shear effort in concrete beams based on several geometrical, longitudinal and transversal parameters of the beam (Alonso-Betanzos, Castillo, FontenlaRomero, & Sánchez-Maroño, 2004). A FN, corresponding to the Associative Model (Figure 3), and also a MLP were trained employing a ten-fold cross-validation, running 30 simulations using different initial parameter values. A set with 12 samples was kept for further validation of the trained systems. The mean normalized Mean Squared Errors over 30 simulations obtained by the FN was 0.1789 and 0.8460 for test and validation, respectively, while the MLP obtained 0.1361 and 2.9265.

FUTURE TRENDS Functional networks are being successfully employed in many different real applications. In engineering problems they have been applied, for instance, for surface reconstruction (Iglesias, Gálvez, & Echevarría, 2006). Other works have used these networks for recovering missing data (Castillo, Sánchez-Maroño, Alonso-Betanzos, & Castillo, 2003) and for general regression and classification problems (Lacruz, Pérez-Palomares & Pruneda, 2006) . Another recent research line is related to the investigation of measures of fault tolerance (Fontenla-Romero, Castillo, Alonso-Betanzos, & Guijarro-Berdiñas, 2004), in order to develop new learning methods.

CONCLUSION This article presents a review of functional networks. Functional networks are inspired by neural networks and functional equations. This model offers all the advantages of ANNs, such as noise tolerance and generalisation capacity, adding new advantages. One of them is the possibility to use knowledge about the problem to be modeled to derive the initial network topology, thus resulting on a model that can be physical

or engineering interpreted. Another main advantage is that the initially proposed model can be simplified, using functional equations, and learnt by solving a system of linear equations, which speeds the learning process and avoid it to be stuck in a local minimum. Finally, the shape of neural function does not have to be fixed, but they can be fitted from data during training, therefore widening the modeling ability of the network.

REFERENCES Alonso-Betanzos, A., Castillo, E., Fontenla-Romero, O., & Sánchez-Maroño, N. (2004). Shear Strength Prediction using Dimensional Analysis and Functional Networks. Proceedings of European Symposium on Artificial Neural Networks, 251-256 Bishop, C.M. (1995). Neural Networks for pattern recognition. Oxford University Press. Castillo, E. (1998). Functional networks. Neural Processing Letters, 7, 151-159. Castillo, E., Cobo, A., Gutiérrez, J., & Pruneda R. (1998). Functional networks with applications. A neural-Based Paradigm. Kluwer Academic Publishers. Castillo, E., & Gutiérrez, J.M. (1998). A comparison of functional networks and neural networks. Proceedings of the IASTED International Conference on Artificial Intelligence and Soft Computing, 439-442 Castillo, E., Iglesias, A., & Ruiz-Cobo, R. (2004). Functional Equations in Applied Sciences. Elsevier. Castillo, E., Sánchez-Maroño, N., Alonso-Betanzos, A., & Castillo, C. (2003). Recovering missing data with Functional and Bayesian Networks. Lecture Notes in Computer Science, 2687, part II, 489-496. Fontenla-Romero, O., Castillo, E., Alonso-Betanzos, A., & Guijarro-Berdiñas, B. (2004). A measure of fault tolerance for functional networks. Neurocomputing, 62, 327-347. Gupta, M., & Rao, D. (1994). On the principles of fuzzy neural networks. Fuzzy Sets and Systems, 61, 1-18. Hagan, M.T. & Menhaj, M. (1994). Training feedforward networks with the marquardt algorithm. IEEE Transactions on Neural Networks, 5(6), 989-993.

675

F

Functional Networks

Iglesias, A., Gálvez, A. & Echevarría, G. (2006). Surface reconstruction via functional networks. Proceedings of the International Conference on Mathematical and Statistical Modeling. CDROM ISBN:84-689-8577-5. Lacruz, B., Pérez-Palomares, A. & Pruneda, R.E. (2006). Functional Networks for classification and regression problems. Proceedings of the International Conference on Mathematical and Statistical Modeling. CDROM ISBN:84-689-8577-5. Moller, M.F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525-533. Moody, J. & Darken, C.J. (1989) Fast learning in networks of locally-tuned processing units. Neural Computation,1(2), 281-294. Rumelhart, D.E., Hinton, G.E. & Willian, R.J. (1986) Learning representations of back-propagation errors. Nature, 323, 533-536.

Functional Equation: An equation for which its unknowns are expressed in terms of both independent variables and functions. Functional Network: A structure consisting of processing units and storing units. These units are organized in layers and linked by connections. Each processing unit contains a multivariate and multiargument function to be learnt during a training process. Lagrange Multiplier: Given the function f(x1, x2,...,xn), the Lagrange multiplier λ is used to find the extremum of f subject to a constraint g(x1, x2,...,xn) by solving ∂f ∂g +L = 0, ∀k = 1,..., n ∂xk ∂x k .

Specht, D. (1990). Probabilistic neural networks. Neural Networks, 3, 109-118.

Learning Algorithm: A process that, based on some training data representing the problem to be learnt, adapts the free parameters of a given model, such as a neural network, in order to obtain a desired functionality.

Youssef, H. M. (1993). Multiple Radial Basis Function Networks in Modeling and Control. A IAA/ GNC.

Linear Equation: An algebraic equation involving only a constant and first-order (linear) terms.

KEy TERmS Error Function: When talking about learning, this is a function that quantifies how much a system has learnt. One of the most popular error functions is the mean squared error that measures the differences between the answers provided by a system and the correct answer.

676

Uniqueness: Property of being the only possible solution.

ENDNOTE 1

Web page: www.ics.uci.edu/~mlearn/MLRepository.html

677

Fuzzy Approximation of DES State Juan Carlos González-Castolo CINVESTAV Unidad Guadalajara, Mexico Ernesto López-Mellado CINVESTAV Unidad Guadalajara, Mexico

INTRODUCTION State estimation of dynamic systems is a resort often used when only a subset of the state variables can be directly measured; observers are the entities computing the system state from the knowledge of its internal structure and its (partially) measured behaviour. The problem of discrete event systems (DES) estimation has been addressed in (Ramirez, 2003) and (Giua 2003); in these works the marking of a Petri net (PN) model of a partially observed event driven system is computed from the evolution of its inputs and outputs. The state of a system can be also inferred using the knowledge on the duration of activities. However this task becomes complex when, besides the absence of sensors, the durations of the operations are uncertain; in this situation the observer obtains and revise a belief that approximates the current system state. Consequently this approach is useful for non critical applications of state monitoring and feedback in which an approximate computation is allows. The uncertainty of activities duration in DES can be handled using fuzzy PN (FPN) (Murata, 1996), (Cardoso, 1999), (Hennequin, 2001), (Pedrycz, 2003), (Ding, 2005); this PN extension has been applied to knowledge modelling (Chen, 1990), (Koriem, 2000), (Shen, 2003), planning (Cao, 1996), reasoning (Gao, 2003) and controller design (Andreu, 1997), (Leslaw, 2004). In these works the proposed techniques include the computation of imprecise markings; however the class of models dealt does not include strongly connected PN for the modelling of cyclic behaviour. In this article we address the problem of state estimation of DES for calculating the fuzzy marking of a Fuzzy Timed Petri Net (FTPN); for this purpose a set of matrix expressions for the recursive computing the current fuzzy marking is developed. The article focuses on FTPN whose structure is a Marked Graph (called Fuzzy Timed Marked

Graph -FTMG) because it allows showing intuitively the problems of the marking estimation in exhibiting cyclic behaviour.

BACKGROUND Possibility Theory In theory of possibility, a fuzzy set ã is used for delimiting ill-known values or for representing values characterized by symbolic expressions. The set is defined as a = (a1 , a2 , a3 , a4 ) such that a1 , a2 , a3 , a4 ∈  , a1 ≤ a2 and a3 ≤ a4 . The fuzzy set ã delimits the run time as follows: •

• • • •

The values b , a in the ranges (a1, a2), (a3, a4), respectively, indicate that the activity is possibly executed with ( )∈ (0,1). When ∈ b the function ( ) grows towards 1, which means that the possibility of stopping increases. When ∈ a, the membership function ( ) decreases towards 0, representing that there is a reduction of the possibility of stopping. The values (0, a1 ]  mean that the activity is running. The values [a4 , +∞) mean that the activity is stopped. The values a ∈ [a2 , a3 ]| a2 ≤ a3 represent full possibility that is ( ) = 1, this represents that it is certain that the activity is stopped. The support of a is the range ∈ [a1 , a4 ] where a ( ) > 0. A fuzzy set ã is referred indistinctly by the function

( ) or the characterization (a1 , a2 , a3 , a4 ). For sim-

plicity, in this work the fuzzy possibility distribution of the time is described with trapezoidal or triangular forms. For example, Fig.1 shows the fuzzy set that


F

Fuzzy Approximation of DES State

Figure 1. Fuzzy set

it is represents in natural language: “the activity will stop about 2.5”. Fuzzy extension principle. The fuzzy extension principle plays a fundamental role because we can extend functions defined on crisp sets to functions on fuzzy sets. An important application of this principle is a mechanism to operate arithmetically with fuzzy numbers. Definition. Let X1,… ,Xn be crisp sets and let f a function such f : X1×… ×Xn → Y . If ã1,… ,ãn are fuzzy sets on X1,… ,Xn, respectively, then f(ã1,… ,ãn) is the fuzzy set on Y such that:

{

f (a1 ,..., an ) = ∪(x1 ,..., xn )∈(X1×...× X n ) 

a1

(x1 ) ∧ ... ∧

an

(xn )/ f (x1 ,..., xn )}

If b̃= f(ã1,… ,ãn) then b̃ is the fuzzy set on Y such that: b ( y ) = ∨ (x1 ,..., xn )∈(X1×...× X n ): f (x1 ,..., xn )= y 

a 1

(x1 ) ∧ ... ∧

a n

(xn )

The fuzzy set was characterized as:

a = {

a

(x1 )/ x1 ,..., a (xn )/ xn }.

With the extension principle we can define a simplified fuzzy sets addition operation. Definition. Let a = (a1 , a2 , a3 , a4 ) and b = (b1 , b2 , b3 , b4 ) be two trapezoidal fuzzy sets. The fuzzy sets addition operation is: a ⊕ b = (a1 + b1 , a2 + b2 , a3 + b3 , a4 + b4 ) (Klir, 1995). 678

Definition The intersection and union of fuzzy sets are defined in terms of min and max operators.

(a ∩ b )= min(a, b) = min (

a

( ), b ( ))|

∈ support _ of _ a ∧ b

and

(a ∪ b )= max (a, b )= max (

a

( ),

b

( ))|

∈ support _ of _ a ∨ b

We used these operators, intersection and union, as a t-norm and a s-norm, respectively. Definition The distribution of possibility before and after ã are the fuzzy sets a b = (−∞, a2 , a3 , a4 ) and a a = (a1 , a2 , a3 , +∞ ) respectively; they are defined in (Andreu, 1997) as a function ( −∞,a ] ( ) = sup ( ′ ) and ′≥ = sup ( ′ ), respectively. ( a , +∞ ] ( ) ′≤

Petri Nets Theory Definition. An ordinary PN structure G is a bipartite digraph represented by the 4-tuple G = (P, T , I , O ) where P = {p1 , p2 , , pn }and T = {t1 , t2 , , tm } are finite sets of vertices called respectively places and transitions, I (O ) : P × T → {0,1}is a function representing the arcs going from places to transitions (transitions to places). Pictorially, places are represented by circles, transitions are represented by rectangles, and arcs are depicted as arrows. The symbol •t j (t j • )denotes the set


of all places pi such that I (pi , t j ) ≠ 0 O (pi , t j ) ≠ 0 . Analogously, • pi ( pi • ) d enotes the set of all transitions tj such that O (pi , t j ) ≠ 0 I (pi , t j ) ≠ 0 . − − The pre-incidence matrix of G is C = cij  where − cij = I (pi , t j ); the post-incidence matrix of G is C + = cij+  where cij+ = O (pi , t j ); the incidence matrix of G is C = C + − C − . A marking function M : P →  + represents the number of tokens or marks (depicted as dots) residing inside each place. The marking of a PN is usually expressed as an n-entry vector.

( )

(

)

Definition. A Petri Net system or Petri Net (PN) is the pair N = (G, M0), where G is a PN structure and M0 is an initial token distribution. In a PN system, a transition tj is enabled at the marking Mk if ∀pi ∈ P, M k ( pi ) ≥ I (pi , t j ); an enabled transition tj can be fired reaching a new marking Mk+1 which can be computed using the PN state equation: +

−

M k +1 = M k + C vk − C vk

Definition. A p-invariant Yi (t-invariant Xi) of a PN is a positive integer solution of the equation Yi T C = 0 (CXi = 0). The support of the p-invariant Yi (t-invariant Xi) is the X i = t j | X i (t j ) ≠ 0 . set Y = p | Y (p ) ≠ 0

{

i

j

i

j

}(

PCi = (Pi = Yi , Ti = ∪tk ∈• p j , tl ∈ p j • | p j ∈ Yi , I i , Oi) named p-component, where I i = Pi × Ti ∩ I , Oi = Pi × Ti ∩ O.

Definition. Let Xi be a t-invariant of a PN, and X i be the support of Xi, then the induced subnet by Xi is

(

TCi = Pi = {∪ pk ∈•t j , pl ∈ t j • | t j ∈ X i }, Ti = X i , I i , Oi

(

Definition. A transition tk ∈ T is live, for a marking M0, if ∀M k ∈ R (G, M 0 ) , ∃M n ∈ R (G, M 0 ) such that tk is enabled tk    Mn →  .

A PN is live if all its transitions are live. Definition. A PN is said 1-bounded, or safe, for a marking M0, if ∀pi ∈ P and ∀M j ∈ R (G, M 0 ), it holds that M j ( pi ) ≤ 1. In this work we deal with live and safe PN.

})

Definition. Let Yi a p-invariant of a Petri net (G, M 0 ) , Yi the support of Yi, then the induced subnet by Yi is

TCi = Pi = {∪ pk ∈•t j ,(1) pl ∈ t j • | t j ∈ X i }, Ti = X i , I i , Oi

where vk (i ) = 0, i ≠ j , vk ( j ) = 1 . The reachability set of a PN is the set of all possible reachable marking from M0 firing only enabled transitions; this set is denoted by R(G, M0). A structural conflict is a PN sub-structure in which two or more transitions share one or more input places; such transitions are simultaneously enabled and the firing of one of them may disable the others, Fig.3(b).

{

F

)

named t-component. I i = Pi × Ti ∩ I and Oi = Pi × Ti ∩ O.

Definition. A invariant Zi is minimal if no invariant Zj satisfies Z j ⊂ Z i , where Zi,Zj are p-invariants or t-invariants and ∀z ∈ Z i : z ≥ 0. Definition. Let Z = {Z1 , , Z q }be the set of minimal invariants (Silva, 1982) of a PN, then Z is called the invariants base. The cardinality of Z is represented as Z .

FUZZy TImED PETRI NETS Basic Operators We introduce first some useful operators. Definition. In order to get the fuzzy set between f and g , the lmax function is defined as:

679

)


( )

(

lmax f , g = min f a , g b

)

 r  f1k g k 1   f11  f1r   g11(2) g1n   ∑ k =1  +         operation  •   the Definition. The latest (earliest) selects =  r   g r1 are g rn  latest (earliest) fuzzy set among fmr  they  f m1n fuzzy sets;  ∑ fmk g k 1 calculated as follows:  k =1

(

)

( (

)

(

))

 g kn  k =1    r  g  f ∑ mk kn  k =1 r

∑ f

1k

))

(

latest f1 , , fn = min max f1b , , fnb , min f1a , , fna Formalism Description of the FTPN

(

)

( (

)

( ( (

)

st f1 , , fn = min max f1b , , fnb , min f1a , , fna

Definition. A fuzzy timed Petri net structure is a 3tuple FTPN = (N , Γ, ); where N = (G, M0) is a PN, Γ = {a1 , a2 , , an } is a collection of fuzzy sets, : P → Γ earliest f1 , , fn = min min f1b , , fnb , max f1a , , fna is a function that associates a fuzzy set ai ∈ Γ to each place pi ∈ P. st f1 , , fn = min min f1b , , fnb , max f1a , , fna (4) • Fuzzy timing of places Definition. The fuzzy_conjugation-operator is defined oper The fuzzy set a = (a1 , a2 , a3 , a4 ) Fig.2(b) represents as arg1 • arg 2 , where arg1, arg2 are arguments that the static possibility distribution ( a )∈ [0,1] of the can be matrices of fuzzy sets; · is the fuzzy and operation instant at which a token leaves a place p ∈ P , starting and oper is any operation referred as, +, -, latest, min, from the instant when p is marked. This set does not etc. For some row i = 1,...m and some column j = 1,... change during the FTPN execution.  n) the products and fik , g kj | k = 1, , r are computed  as oper and fik , g kj . For example: • Fuzzy timing of tokens

(

( (

 f11       f  m1

( (

)

)

(

(3)

)

))

(

))

( ) ))

f1r   g11 + •  f   g r1 mr 

 r  f1k g k 1   g1n   ∑ k =1      =  r g rn   ∑ fmk g k1  k =1



r

g fuzzy set b = (b , b , b , b ) Fig.2(c) represents ∑ f The 1k

kn

1

2

3

4

the dynamic possibility distribution ( b )∈ [0,1] as sociated to a token residing within a p ∈ P ; it also  the instant b at which such a token leaves r represents f g  ∑ kn themkplace,  starting from the instant when p is marked. k =1 b is computed from a every time the place is marked k =1

Figure 2. (a) Fuzzy timed Petri net. (b) The fuzzy set associated to places. (c) Fuzzy set to place or mark associated. (d) Fuzzy timestamp

680


during the marking evolution of the FTPN. A token begins to be available for enabling output transitions at (b1 ). Thus b a = (b1 , b2 , b3 , +∞ ) represents the possibility distribution of available tokens. The fuzzy set c = (c1 , c2 , c3 , c4 ) , known as fuzzy timestamp, Fig.2(d) is a dynamic possibility distribution ( c )∈ [0,1] that represents the duration of a token within a place p∈P.

Enabling and Firing of Transitions •

Fuzzy enabling date

The fuzzy enabling date etk ( ) of the transition tk at the instant t is a possibility distribution of the latest leaving instant among the leaving instants bpi of all tokens within the pi ∈•tk , Fig.3(a). etk

( ) = latest (bp )∀pi ∈•tk i

(5)

The latest operation obtains the latest date in which the input places pi to tk have a token. •

Fuzzy firing date

The firing transition date otk ( ) of a transition tk is determined with respect to the set of transitions {tj} simultaneously enabled, Fig.3(b). This date, expressed as a possibility distribution, is computed as follows: otk

( ) = min (et ( ), earliest (et ( ))∀tk ∈ pn •; pn ∈•t j ) k

j

•

Fuzzy timestamp

For a given place ps, the possibility distribution bps may be computed from a ps and the firing dates ot j ( ) of a t j ∈• ps using the following expression:

(

))⊕ a p ∀t j ∈• ps

bps = lmax ot j (

s

The token do not disappear of •t and appear in t • instantaneously. The fuzzy timestamp c ps is the time elapse possibility that a token is in a place ps ∈ P . The possibility distribution c ps is computed from the occurrence dates of both • ps and ps • , see Fig.3(c).

(

(

c ps = lmax earliest oti (

)), latest (ot ( )))∀ti ∈• ps , t j ∈ ps • j

(8) Actually, c ps represent the fuzzy marking at the instant t.

Matrix Formulation Now, we reformulate the expressions (5), (6), (7) and (8) allowing a more general an compact representation.  lmax  B =  C + • O  ⊕ A   latest

(6)

The earliest operation obtains the earliest date in which the transitions in a structural conflict are enabled.

(7)

(9)

T E = C −  • B

(10)

earliest T min   O = C −  •  C − • E   

(11)

Figure 3. (a) Conjunction transition. (b) Structural conflict. (c) Attribution and selection place

681

F


latest  earliest  C = lmax  C + • O , C − • O   

(12)

where B , E , O and C denote vectors composed by bps , etk , otk , c ps , respectively.

Modeling Example Now we will illustrate the previous matrix formulation though a simple example. Example Consider the system shown in Fig.4(a); it consist of two cars, car1 and car2, which move along independent and dependent ways executing the set of activities Op={Right_car1, Right_car2, Charge_car1, Charge_car2, Left_car1,2 Discharge_car1,2}. The operation of the system is automated following the sequence described in the FTPN of Fig.4(b) in which the activities are associated to places. The ending time possibility a pi for every activity is given in the model. We are considering that there are not sensors detecting the tokens in the system, thus the behavior is then analyzed through the estimated state. a.

b.

 bp  o ⊕ a p1   1   t5   bp2  ot1 ⊕ a p2     bp3   ot1 ⊕ a p3    B=  = ot ⊕ a p4  b  p4   2 bp  ot3 ⊕ a p5    5     o a ⊕ p6  bp   t4  6  bp1  et1      bp2 et2   bp3 E =  et3  =     et4  latest bp4 , bp5 e    t5  bp  6

(

 ot1   et1  o  e   t2   t2   O =  ot3  =  et3      ot4  et4  o  e   t5   t5 

Initial conditions: Initially, M 0 = {p1}, therefore, the enabling date et1 ( ) of transitions t1 is immediate, i.e., (0,0,0,0). Since •t1 = 1 , then ot1 ( ) = et1 ( ). Matrix equations: For the obtained the fuzzy sets we solve (9)-(12) as follows:

Figure 4. (a) Two cars system. (b) Fuzzy timed Petri net model

682

(13)         

)

(14)

(15)


(

)

(

)

( (

) )

Figure 5(a) present the marking evolution of one cycle and some more steps. C is represented by the dashed line and B is represented by the shadowed area. Notice that O coincide sometimes with B .

 lmax ot , ot  5 1   c p1    c  lmax (ot1 , ot2 )   p2    c p3   lmax ot1 , ot3   C =    =  c p4  lmax (ot2 , ot4 )  c    p5  lmax ot , ot  3 4  c p6   lmax o , o  t4 t5   

c.

(16)

Firing t1: When t1 is fired, the token is removed from p1; p2 and p3 get one token each one. B = 0

d.

FUZZy STATE EQUATION

(0.9,1,1,1.1) (0.8,1,1,1.2 )

0 0 0 

T

The possibility sets bp2 , bp3 coincide with a p2 and a p3 , respectively. Firing t2: The fuzzy enabling time and the fuzzy occurrence time are computed by (14) and (15), respectively. O , E = 0

(0.9,1,1,1.1)

0 0 0 

T

(0, 0,1,1.1)

0 0 0 

T

e.

(2.6,3,3,3.4 )

0 0 

T

C = 0 0

(0.8,1,1,1.2 )

(0, 0,1,1.2 )

B = 0 0 0 0

0 0 

0 0 0 

(2.6,3,3,3.4 )

−∆

) + C + ∆O (  ) − C − ∆O (  ) b

∆ O ( b ) = O ( ) − O ( − ∆

(17)

a

T

0 

−∆ ,

)∈ b

) − O ( ) | (

−∆ ,

)∈ a.

The marking possibility obtained in (17) can be greater than 1; then since FTPN are safe, we use the min function to obtain M ( ) ≤ 1. The new marking is denoted by Mˆ ( ), i.e.,

) = min (M (



) + C + ∆O (  ) − C − ∆O (  ),1 )

−∆

b

a

(18)

 where 1 is a n-entry vector containing 1 in each entry. Initially M (0 ) = M 0 . If t ≠ 0 then (18) is solved in three steps:

T T

)| (

and

Mˆ (

Firing t3: Again, using (14), (15) and (16) we obtain: O , E = 0 0

)= M (

∆ O ( a ) = O ( − ∆

The set bp4 is the possibility distribution of the instant at which place p4 losses the token and it can be calculated by (13). B = 0 0 0

M(

Here

The set c p2 is the possibility distribution of the time at which p2 is marked. So, we computed (16). C = 0

We analyzed equation (1) in order to obtain the fuzzy marking equation. C + vk provides information about the places that get tokens. Also, we must consider that in FTPN the transition firing possibility evolves continuously. The variation of O ( b ) during ∈ b modifies the possibility of tokens residing in the output places of the firing transitions; thus the corresponding term to vk in FTPN is rather a variation denoted by ∆ O ( b ) ; thus the marking variation is C + ∆ O ( b ). By a similar reasoning on the term C − vk corresponds to C − ∆ O ( a ) in + − FTPN. The operation C ∆ O ( b ) − C ∆ O ( a ) represents the possible marking change. Considering the marking after a time elapse ∆ we obtain:

•

 M(

) = C + ∆O (  ) − C − ∆O (  )

•

M(

)= M (

b

−∆

a



)+ M ( ) 683

F


Mˆ (

•



) = min (M ( ),1 )

Remark. If ∆ O ( b ) , ∆ O ( a ) ∈ {0,1} the behaviour is that of an ordinary timed Petri net. Example. For the system shown in Fig.4, we obtained the marking in some instants. The initial marking is M(0) = [1 0 0 0 0 0]. The transition t1 is firing at = 0+, therefore + M(0+) = [0 1 1 0 0 0] . For ∈ (0 , 0.8 ) the marking does not change. For tt = 1 we obtain: 0 1   1 M (1) =  0 0   0

0 0 0 1 0 0

0 0 0 0 1 0

0 1 1 0 0 0 0  0 0     0 1 0 0 1 0 0   0 0 1 0  0  −  0 0   0 0 0 1 0 0 0   0 0 0 1  0     1 0   0 0 0 0  T M (1) = M (0.8 ) + M (1) = [0 1 1 1 0 0]  ˆ (1) = min M (1), 1 = [0 1 1 1 0 0]T M

(

0 0  0    0 0    0  0   0 0      0  1 

)

The marking evolution at some relevant instants is shown below: 0 0.8 1 2 2.8

t

Mˆ p1 (t ) 1 Mˆ p2 (t ) 0 Mˆ (t ) 0 p3

Mˆ p4 (t Mˆ (t p5

Mˆ p6

) ) (t )

3

3.2 4 5

0

0 0

0

0

0

0 1

1

1 0

0

0

0

0 0

1

1 0

0

0

0

0 0

0

0

1 1

1

1

0.5 0 0

0

0

1 1

1

1

0.5 0 0

0

0

0 0

0

0.66

1

1 1

among the bigger possibility Mˆ u ( ) that the token is in a place u and the possibility Mˆ v ( ) that the token is in any other place. The function Yi ( ) is then calculated. Yi

( ) = min (Mˆ p ( ) − Mˆ p ( ) ) u

v

(19)

such that ∀ {pu , pv }∈ Yi ; v ≠ u; Yi ∈ Y Example. The FTPN in Fig.4 has two p-invariants with supports Y1 = p1 , p2 , p4 , p5 and Y2 = p1 , p3 , p5 . Figure 6 shows the fuzzy sets C obtained from evolution of the marking in the p-component induced by Y1 , Fig.5(b). Definition. The state estimation S, at the instant t is described by the function s ( )∈ [0,1], which determines the possible state of the system among other possible states; it is calculated by: s(

) = min ( Y ( ))| i = 1,..., Y ; Yi ∈ Y i

(20)

Discrete State From the FTPN In order to obtain a possible discrete marking M ( ) of the FTPN it is necessary to perform a “defuzzyfication” of M(t). This can be accomplished taking into account the possible discrete marking M i ( ) of every p-component induced by Yi. Before describing the procedure to obtain M ( ), we define M(t) as: M(

) =  m p ( )...m p ( ) 1

n

T

|n= P

(21)

Notice that during ∈ (0,5 ), Mˆ ( ) coincides with the fuzzy timestamp; it is shown in Fig.5(a).

where m p ( ) | k = 1,..., n is the estimated marking of the place pk ∈ P . Now, the discrete marking can be obtained with the following procedure.

STATE APPROXImATION OF THE FTPN

Algorithm: Defuzzification See Algorithm A.

Marking Estimation Definition. The marking estimation Ξ in the instant t is described by the function Yi ( )∈ [0,1] which rec-

ognize the possible marked place pu ∈ Yi | i ∈ {1,..., Y }, among other possible places pv ∈ Yi | v ≠ u . The function Yi ( ) evaluates the minimal difference that exist 684

k

Example. Following the previous example, the marking M(t) during τ ∈ (0.08] does not change, that is M(

)= M0

+

= [0 1 1 0 0 0].


Figure 5. (a) Fuzzy marking evolution. (b) Marking estimation. (c) Discrete state

For t = 0.95 the new fuzzy marking is

F

M (0.95 ) = [0 1 1 0.5 0.25 0] , T

therefore M 1 (0.95 ) = [0 1 0 0 0 0]

T

M 2 (0.95 ) = [0 0 1 0 0 0]

T

T Mˆ (0.95 ) = [0 1 0 0 0 0] + [0 0 1 0 0 0] = [0 1 1 0 0 0] T

T

M (0.95 ) = [0 1 1 0 0 0]

T

Figure 5(c) shows the marking obtained at different instants.

FUTURE TRENDS Previous results on estimation of Fuzzy Timed State Machines and that included in this article are going to be integrated for addressing a larger class of PN. Another issue currently addressed is the study of FTPN including measurable places for dealing with sensors or detectable activities within the system; this will allow establishing a bound on the uncertainty of the estimated state. The optimal placement of sensors is an interesting matter of research.

Algorithm A.

Input: M ( ), Y Output: M ( Step 1 M (

)



)← 0

Step 2 ∀Yi | i = 1,..., Y

Step 2.1 ∀pk ∈ Yi : mˆ q = max (M ( pk )) Step 2.2 M (pq ) = 1

685


The aim of this research has been the use of the methodology for estimating the DES state of a discrete event system for monitoring its behavior and diagnosing faults. A FTPN is going to be used as a reference model and their outputs (measurable marking) have to be compared with the outputs of the monitored system; the analysis of residuals should provide an early detection of system malfunctioning and a plausible location of the faulty behavior.

CONCLUSION This article addressed the state estimation problem of DES whose the duration of activities is ill known; fuzzy sets represent the uncertainty of the ending of activities. Several novel notions have been introduced in the FTPN definition, and a new matrix formulation for computing the fuzzy marking of Marked Graphs has been proposed. The extreme situation in which any activity of a system cannot be detected by sensors has been dealt for illustrating the degradation of the marking estimation when a cyclic execution is performed. Current research addresses the topics mentioned in the above section.

Cybern., Part A: Syst. and Humans, Vol. 33, No. 3, 314-324. Giua, A., Julvez, C., Seatzu C. (2003). Marking Estimation of Petri Nets base on Partial Observation. Proc. Of the American Control Conference, Denver, Colorado June 4-6, 326-331. González-Castolo, J. C., López-Mellado, E. (2006). Fuzzy State Estimation of Discrete Event Systems. Proc. of MICAI 2006: Advances in Artificial Intelligence, Vol. 4293, 90-100. Hennequin, S., Lefebvre, D., El Moudni., A. (2001). Fuzzy Multimodel of Timed Petri Nets. IEEE Trans. on Syst., Man, Cybern., Vol. 31, No. 2, 245-250. Klir, G. and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic. Theory and Applications. Prentice Hall, NJ, USA. Koriem, S.M. (2000). A Fuzzy Petri Net Tool For Modeling and Verification of Knowledge-Based Systems. The Computer Journal, Vol. 43, No. 3. 206-223. Leslaw, G., Kluska, J. (2004). Hardware Implementation of Fuzzy Petri Net as a Controller. IEEE Transactions on Systems, Man, and Cybernetics, Vol. 34, No. 3, 1315-1324.

Andreu, D., Pascal, J-C., Valette. R. (1997). Fuzzy Petri Net-Based Programmable Logic Controller. IEEE Trans. Syst. Man. Cybern, Vol.27, No. 6, 952-961.

Martinez, J. & Silva, M. (1982). A Simple and fast algorithm to obtain all invariants of a generalized Petri nets. Proc. of Second European Workshop on Application and Theory of Petri Nets, Informatik-Fachberichte Vol. 52, 301-310.

Cao, T., Sanderson, A. C. (1996). Intelligent Task Planning Using Fuzzy Petri Nets. Intelligent Control and Intelligent Automation, Vol. 3. Word Scientific.

Murata., T. (1996). Temporal uncertainty and fuzzytiming high-level Petri nets. Lect. Notes Comput. Sci., Vol.1091, 29-58.

Cardoso, J., Camargo., H. (1999). Fuzziness in Petri Nets. Physical Verlag.

Pedrycz, W., Camargo, H. (2003). Fuzzy timed Petri Nets. Elsevier, Fuzzy Sets and Systems, 140, 301330.

REFERENCES

Chen, S., Ke, J., Chang, J. (1990). Knowledge representation using Fuzzy Petri nets. IEEE Trans. Knowledge Data Eng., Vol. 2, No. 3, 311-319. Ding, Z., Bunke, H., Schneider, M., Kandel., A. (2005). Fuzzy Timed Petri Net, Definitions, Properties, and Applications. Elsevier, Mathematical and Computer Modelling 41, 345-360. Gao, M., Zhou, M., Guang, X., Wu, Z. (2003). Fuzzy Reasoning Petri Nets, IEEE Trans. Syst., Man & 686

Ramirez-Treviño, A., Rivera-Rangel, A., LópezMellado, E. (2003). Observability of Discrete Event Systems Modeled by Interpreted Petri Nets. IEEE Transactions on Robotics and Automation. Vol.19, No. 4, 557-565. Shen, R. V. L. (2003). Reinforcement Learning for High-Level Fuzzy Petri Nets. IEEE Trans. on Syst., Man, & Cybern., Vol. 33, No. 2, 351-362.


KEy TERmS Discrete Events Systems: It is the class of systems whose behavior is characterized by successions of states delimited by asynchronous events. Most of these systems have been man made. Fuzzy Logic: It is a Knowledge representation technique and computing framework whose approach is based on degrees of truth rather than the usual “true” or “false” of classical logic. Fuzzy Petri Nets: It is a family of formalisms extending Petri nets by the inclusion of fuzzy sets representing usually uncertainty of time elapses. Imprecise Marking: The imprecise localization of tokens within places of a FTPN; it is computed as a possibility distribution.

Marked Graph: It is a Petri Net subclass in which every place has only one input transition and one output transition. State Estimation: It is the inference process that determines the current state of a system from the knowledge of sequences of inputs and outputs. State Machine: It is a Petri Net subclass in which every transition has only one input place and one output place. System Monitoring: It is a surveillance process on measurable events and/or outputs of a system; it is often used a reference model that specifies a reasonable good behavior. Deviations from the reference are analyzed and determined if there exist a fault. This process is included as a part of a fault diagnosis process.

Petri Nets: It is a family of formalisms for modeling and analysis of concurrent DES allowing intuitive graphical descriptions and providing a simple but sound mathematical support. A timed Petri net includes information about the duration of the modeled activities.

687

F

688

Fuzzy Control Systems: An Introduction Guanrong Chen City University of Hong Kong, China Young Hoon Joo Kunsan National University, Korea

INTRODUCTION Fuzzy control systems are developed based on fuzzy set theory, attributed to Lotfi A. Zadeh (Zadeh, 1965, 1973), which extends the classical set theory with memberships of its elements described by the classical characteristic function (either “is” or “is not” a member of the set), to allow for partial membership described by a membership function (both “is” and “is not” a member of the set at the same time, with a certain degree of belonging to the set). Thus, fuzzy set theory has great capabilities and flexibilities in solving many real-world problems which classical set theory does not intend or fails to handle. Fuzzy set theory was applied to control systems theory and engineering almost immediately after its birth. Advances in modern computer technology continuously backs up the fuzzy framework for coping with engineering systems of a broad spectrum, including many control systems that are too complex or too imprecise to tackle by conventional control theories and techniques.

BACKGROUND: FUZZy CONTROL SySTEmS The main signature of fuzzy logic technology is its ability of suggesting an approximate solution to an imprecisely formulated problem. From this point of view, fuzzy logic is closer to human reasoning than the classical logic, where the latter attempts to precisely formulate and exactly solve a mathematical or technical problem if ever possible.

Motivations for Fuzzy Control Systems Theory Conventional control systems theory, developed based on classical mathematics and the two-valued logic, is relatively mature and complete. This theory has its solid foundation built on classical mathematics, electrical engineering, and computer technology. It can provide rigorous analysis and often perfect solutions when a system is precisely defined mathematically. Within this framework, some relatively advanced control techniques such as adaptive, robust and nonlinear control theories have gained rapid development in the last three decades. However, conventional control theory is quite limited in modeling and controlling complex dynamical systems, particularly ill-formulated and partially-described physical systems. Fuzzy logic control theory, on the contrary, has shown potential in these kinds of non-traditional applications. Fuzzy logic technology allows the designers to build controllers even when their understanding of the system is still in a vague, incomplete, and developing phase, and such situations are quite common in industrial control practice.

General Structure of Fuzzy Control Systems Just like other mathematical tools, fuzzy logic, fuzzy set theory, fuzzy modeling, fuzzy control methods, etc., have been developed for solving practical problems. In control systems theory, if the fuzzy interpretation of a real-world problem is correct and if fuzzy theory is developed appropriately, then fuzzy controllers can be suitably designed and they work quite well to their advantages. The entire process is then returned to the


Fuzzy Control Systems

original real-world setting, to accomplish the desired system automation. This is the so-called “fuzzification—fuzzy operation—defuzzification” routine in fuzzy control design. The key step—fuzzy operation—is executed by a logical rule base consisting of some IF-THEN rules established by using fuzzy logic and human knowledge (Chen & Pham, 1999, 2006; Drianker, Hellendoorn & Reinfrank, 1993; Passino & Yurkovich, 1998; Tanaka, 1996; Tanaka & wang, 1999; Wang, 1994; Ying, 2000).

Fuzzification Fuzzy set theory allows partial membership of an element with respect to a set: an element can partially belong to a set and meanwhile partially not belong to the same set. For example, an element, x, belonging to the set, X, IS specified by a (normalized) membership function, μX : X → [0,1]. There are two extreme cases: μX(x) = 0 means x ∉ X and μX(x) = 1 means x ∈ X in the classical sense. But μX(x) = 0.2 means x belongs to X only with grade 0.2, or equivalently, x does not belong to X with grade 0.8. Moreover, an element can have more than one membership value at the same time, such as μX(x) = 0.2 and μX(x) = 0.6, and they need not be summed up to one. The entire setting depends on how large the set X is (or the sets X and Y are) for the associate members, and what kind of shape a membership function should have in order to make sense of the real problem at hand. A set, X, along with a membership function defined on it, μX(·), is called a fuzzy set and is denoted (X, μX). More examples of fuzzy sets can be seen below, as the discussion continues. This process of transforming a crisp value of an element (say x = 0.3) to a fuzzy set (say x = 0.3 ∈ X = [0,1] with μX(x) = 0.2) is called fuzzification. Given a set of real numbers, X = [–1,1], a point x ∈ X assumes a real value, say x = 0.3. This is a crisp number without fuzziness. However, if a membership function μX(·) is introduced to associate with the set X, then (X, μX) becomes a fuzzy set, and the (same) point x = 0.3 has a membership grade quantified by μX(·) (for instance, μX(x) = 0.9). As a result, x has not one but two values associated with the point: x = 0.3 and μX(x) = 0.9. In this sense, x is said to have been fuzzified. For convenience, instead of saying that “x is in the set X with a membership value μX(x),” in common practice it is usually said “x is ,” while one should keep in mind that there is always a well-defined membership function

associated with the set X. If a member, x, belongs to two fuzzy sets, one says “x is X1 AND x is X2,” and so on. Here, the relation AND needs a logical operation to perform. As a result, this statement eventually yields only one membership value for the element x, denoted by μX × X (x). There are several logical operations to 1 2 implement the logical AND; they are quite different but all valid within their individual logical system. A commonly used one is μX × X (x) = min {μX (x), μX (x)}. 1

2

1

2

Fuzzy Logic Rule Base The majority of fuzzy logic control systems are knowledge-based systems. This means that either their fuzzy models or their fuzzy logic controllers are described by fuzzy logic IF-THEN rules. These rules have to be established based on human expert’s knowledge about the system, the controller, and the performance specifications, etc., and they must be implemented by performing rigorous logical operations. For example, a car driver knows that if the car moves straight ahead then he does not need to do anything; if the car turns to the right then he needs to steer the car to the left; if the car turns to the right by too much then he needs to take a stronger action to steer the car to the left much more, and so on. Here, “much” and “more” etc. are fuzzy terms that cannot be described by classical mathematics but can be quantified by membership functions (see Fig. 2, where part (a) is an example of the description “to the left”). The collection of all such “if … then …” principles constitutes a fuzzy logic rule base for the problem under investigation. To this end, it is helpful to briefly summarize the experience of the driver in the following simplified rule base: Let X = [–180°, 180°], x be the position of the car, μleft(·) be the membership function for the moving car turning “to the left,” μright(·) the membership function for the car turning “to the right,” and μ0(·) the membership function for the car “moving straight ahead.” Here, simplified statements are used, for instance, “x is Xleft” means “x belongs to X with a membership value μleft(x)” etc. Also, similar notation for the control action u of the driver is employed. Then, a simple typical rule base for this car-driving task is R(1): R(2): R(3):

IF x is Xleft IF x is Xright IF x is X0

THEN u is Uright THEN u is Uleft THEN u is U0

689

F


where X0 means moving straight ahead (not left nor right), as described by the membership function shown in Fig. 2(c), and “u is U0” means u = 0 (no control action) with a certain grade (if this grade is 1, it means absolutely no control action). Of course, this description only illustrates the basic idea, which is by no means a complete and effective design for a real car-driving application. In general, a rule base of r rules has the form

Thus, what should the control action be? To simplify this discussion, suppose that the control action is simply u = –x with the same membership functions μX = μU for all cases. Then, a natural and realistic control action for the driver to take is a compromise between the two required actions. Among several possible compromise (or, average) formulas for this purpose, the most commonly adopted one that works well in most cases is the following weighted average formula:

R(k): IF x1 is Xk1 AND ··· AND xm is Xkm THEN u is Uk (1)

u=

M right (u ) ⋅ u + M0 (u ) ⋅ u M right (u ) + M0 (u )

=

0.28 × (−5o ) + 0.5 × (−5o ) = −0.5o 0.28 + 0.5

o o where m ≥ 1 and k = 1,···, r. u = M right (u ) ⋅ u + M0 (u ) ⋅ u = 0.28 × (−5 ) + 0.5 × (−5 ) = −0.5o M right (u ) + M0 (u ) 0.28 + 0.5

Defuzzification

An element of a fuzzy set may have more than one membership value. In Fig. 1, for instance, if x = 5° then it has two membership values: μright(x) = 5/180 ≈ 0.28 and μ0(x) = 0.5. This means that the car is moving to the right by a little. According to the above-specified rule base, the driver will take two control actions simultaneously, which is unnecessary and physically impossible.

Here, the result is interpreted as “the driver should turn the car to the left by 5°.” This averaged outputs is called defuzzification, which yields a single crisp value for the control, which may actually yield similar averaged results in general. The result of defuzzification usually is a physical quantity acceptable by the original real system. Whether or not this defuzzification result works well depends

Figure 1. Membership functions for directions of a moving car

690


Figure 2. A typical fuzzy logic controller

e

F

Fuzzification

controller input

Fuzzy Rule Base

Defuzzification

u controller output

Fuzzy Logic Controller (FLC)

on the correctness and effectiveness of the rule base, while the latter depends on the designer’s knowledge and experience about the physical system or process for control. Just like any of the classical design problems, there is generally no unique solution for a problem; an experienced designer usually comes out with a better design. A general weighted average formula for defuzzification is the following convex combination of the individual outputs: r

r

i =1

i =1

output = ∑ A iui := ∑

wi

∑

r

w i =1 i

⋅ ui

(2)

with notation referred to the rule base (1), where wi MU i (ui ), A i :=

wi

∑i =1 wi r

≥ 0, i = 1,, r ,

r

∑A i =1

i

=1

Sometimes, depending on the design or application, the weights are m

wi = Π M X ij ( x j ), j =1

i = 1,, r

The overall structure of a fuzzy logic controller is shown in Fig. 2.

mAIN FOCUS OF THE CHAPTER: SOmE BASIC FUZZy CONTROL APPROACHES A Model-Free Approach This general approach of fuzzy logic control works for trajectory tracking for a conventional dynamical system that does not have a precise mathematical model. The basic setup is shown in Fig. 3, where the plant is a conventional system without a mathematical description and all the signals (the reference set-point sp, output y(t), control u(t), and error e(t) = sp – y(t)) are crisp. The objective is to design a controller to achieve the goal e(t) → 0 as t → ∞, assuming that the system inputs and outputs are measurable by sensors on line. If the mathematical formulation of the plant is unknown, how can one develop a controller to control this plant? Fuzzy logic approach turns out to be advantageous in this situation: it only uses the plant inputs and outputs, but not the state variables nor any other information. After the design is completed, the entire dashed-block in Fig. 2 is used to replace the “controller” block in Fig. 3. As an example, suppose that the physical reference set-point is the degree of temperature, say 40°F, and that the designer knows the range of the error signal, e(t) = 40° – y(t), is within X = [–25°, 45°], and assume that the scale of control is required to be in the unit of 1°. Then, the membership functions for the error signal to be “negative large” (NL), “positive large” (PL), and “zero” (ZO) may be chosen as shown in Fig.

691


Figure 3. A typical reference set-point tracking control system + e r 3 (error (reference signal) − signal)

Controller

4. Using these membership functions, the controller is expected to drive the output temperature to be within the allowable range: 40° ± 1°. With these membership functions, when the error signal e(t) = 5°, for instance,

Figure 4. Membership function for the error temperature signal

u (control signal)

Plant

y (output signal)

it is considered to be “positive large” with membership value one, meaning that the set-point (40°) is higher than y(t) by too much. The output from the fuzzification module is a fuzzy set consisting of the interval X and three membership functions, μNL, μPL and μZO, in this example. The output from fuzzification will be the input to the next module—the fuzzy logic rule base—which only takes fuzzy set inputs to be compatible with the logical IFTHEN rules. Figure 5 is helpful for establishing the rule base. If e > 0 at a moment, then the set-point is higher than the output y (since e = 40° – y), which corresponds to two possible situations, marked by a and d respectively. To further distinguish these two situations, one may use the rate of change of the error, e = − y . Here, since the set-point is a constant, its derivative is zero. Using information from both e and ė, one can completely characterize the changing situation of the output temperature at all times. If, for example, e > 0 and ė > 0,

Figure 5. Temperature set-point tracking example temperature

b

o r = 45

692

y( t )

d e( t )

0

c

a

t


then the temperature is currently at situation d rather than situation a, since ė > 0 means y < 0 which, in turn, signifies that the curve is moving downward. Based on the above observation from the physical situations of the current temperature against the set-point, a simple rule base can be established as follows: R1: R2: R3: R4:

IF e > 0 AND IF e > 0 AND IF e < 0 AND IF e < 0 AND

> 0 THEN u(t+) = –C · u(t); < 0 THEN u(t+) = C · u(t); > 0 THEN u(t+) = C · u(t); < 0 THEN u(t+) = –C · u(t);

otherwise (e.g., e = 0 or ė = 0), u(t+) = u(t), till next step, where C > 0 is a constant control gain and t + can be just t + 1 in discrete time. In the above, the first two rules are understood as follows (other rules can be similarly interpreted): 1.

2.

R(1): e > 0 and > 0. As analyzed above, the temperature curve is currently at situation d, so the controller has to change its moving direction to the opposite by changing the current control action to the opposite (since the current control action is driving the output curve downward). R(2): e > 0 and < 0. The current temperature curve is at situation a, so the controller does not need to do anything (since the current control action is driving the output curve up toward the setpoint).

The switching control actions may take different forms, depending on the design. One example is u(t + 1) = u(t) + Δu(t), among others (Chen & Pham, 1999, 2006). Furthermore, to distinguish “positive large” from just “positive” for e > 0, one may use those membership functions shown in Fig. 4. Since the error signal e(t) is fuzzified in the fuzzification module, one can similarly fuzzify the auxiliary signal ė(t) in the fuzzification module. Thus, there are two fuzzified inputs, e and ė, for the controller, and they both have corresponding membership functions describing their properties as “positive large” (µPL), “negative large” (µNL), or “zero” (µZO), as shown in Fig. 5. Thus, for the rule base, one may replace it by a set of more detailed rules as follows:

R1: R2: R3: R4: R5: R6: R7: R8:

IF e = PL AND > 0 THEN u(t+1)= −µPL(e) . u(t); IF e = PS AND > 0 THEN u(t+1) = −(1−µPS(e)) . u(t); IF e = PL AND < 0 THEN u(t+1) = µPL(e) . u(t); IF e = PS AND < 0 THEN u(t+1) = (1−µPS(e)) . u(t); IF e = NL AND > 0 THEN u(t+1) = µNL(e) . u(t); IF e = NS AND > 0 THEN u(t+1) = (1−µNS(e)) . u(t); IF e = NL AND < 0 THEN u(t+1) = −µNL(e) . u(t); IF e = NS AND < 0 THEN u(t+1) = −(1−µNS(e)) . u(t);

otherwise, u(t+1) = u(t). Here and below, “= PL” means “is PL,” etc. In this way, the initial rule base is enhanced and extended. In the defuzzification module, new membership functions are needed for the change of the control action, u(t + 1) or Δu(t), if the enhanced rule base described above is used. This is because both the error and the rate of change of the error signals have been fuzzified to be “positive large” or “positive small,” the control actions have to be fuzzified accordingly (to be “large” or “small”). Now, suppose that a complete, enhanced fuzzy logic rule base has been established. Then, in the defuzzification module, the weighted average formula can be used to obtain a single crisp value as the control action output from the controller (see Fig. 2):

∑ u (t + 1) =

N i =1

Mi ⋅ ui (t + 1)

∑

N i =1

Mi

This is an average value of the multiple (N = 8 in the above rule base) control signals at step t + 1, and is physically meaningful to the given plant.

A Model-Based Approach If a mathematical model of the system, or a fairly good approximation of it, is available, one may be able to design a fuzzy logic controller with better results such as performance specifications and guaranteed stability.

693

F

X k1

X k1

X k1


This constitutes a model-based fuzzy control approach (Chen & Zhang, 1997; Malki, Li & Chen, 1994; Malki, Feigenspan, Misir & Chen, 1997; Sooraksa & Chen, 1998; Ying, Siler & Buckley, 1990). For instance, a locally linear fuzzy system model is described by a rule base of the following form: RS( k ) : IF x1 is X k1 AND  AND

AND  AND

xm is X km

xm is X km THEN x = Ak x + Bk u

This theorem provides a basic (sufficient) condition for the global asymptotic stability of the fuzzy control system, which can also be viewed as a criterion for tracking control of the system trajectory to the zero setr THEN xpoint. = Ak xClearly, + Bk u stable control gain matrices {K k }k =1 may be determined according to this criterion in a design. (3)

where {Ak} and {Bk} are given constant matrices, x = [x1,...,xm]T is the state vector, and u = [u1,...,un]T is a controller to be designed, with m ≥ n ≥ 1, and k = 1,···,r. The fuzzy system model (3) may be rewritten in a more compact form as follows: r

x = ∑ A k (Ak x + Bk u ) = A( M ( x)) x + B( M ( x))u k =1

{

}

(4)

m

i , j =1

.

Based on this fuzzy model, (3) or (4), a fuzzy controller u(t) can be designed by using some conventional techniques. For example, if a negative state-feedback controller is preferred, then one may design a controller described by the following ruse base: RC( k ) : IF x1 is X k1 AND  AND

AND  AND

(5)

where {K k }rk =1 are constant control gain matrices to be determined, k = 1,···,r. Thus, the closed-loop controlled system (4) together with (5) becomes

AND  AND

AND  AND

The essence of systems control is to achieve automation. For this purpose, a combination of fuzzy control technology and advanced computer facility available in the industry provides a promising approach that can mimic human thinking and linguistic control ability, so as to equip the control systems with a certain degree of artificial intelligence. It has now been realized that fuzzy control systems theory offers a simple, realistic and successful addition, or sometimes an alternative, for controlling various complex, imperfectly modeled, and highly uncertain engineering systems, with a great potential in many real-world applications.

REFERENCES G. Chen & T. T. Pham (1999). Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems. CRC Press.

G. Chen & T. T. Pham (2006) Introduction to Fuzzy xm is X km THEN xSystems. = [ Ak − BK k ] xPress. CRC

xm is X km THEN x = [ Ak − BK k ]x

(6)

For this feedback controlled system, the following is a typical stability condition [1,2,10]: If there exists a common positive definite and symmetric constant matrix P such that AkT P + PAk = −Q for some 694

This topic will be discussed elsewhere in the near future.

xm is X km THEN u = − K k x

xm is X km THEN u = − K k x

(k ) RSC : IF x1 is X k1

FUTURE TRENDS

CONCLUSION

where M ( x) = M X ij ( x)

Q > 0 for all k = 1,···,r, then the fuzzy controlled system (6) is asymptotically stable about zero.

G. Chen & D. Zhang (1997). Back-driving a truck with suboptimal distance trajectories: A fuzzy logic control approach. IEEE Trans. on Fuzzy Systems, 5: 369-380. D. Drianker, H. Hellendoorn & M. Reinfrank (1993). An Introduction to Fuzzy Control. Springer-Verlag.


H. Malki, D. Feigenspan, D. Misir & G. Chen (1997) Fuzzy PID control of a flexible-joint robot arm with uncertainties from time-varying loads. IEEE Trans. on Contr. Sys. Tech. 5: 371-378. H. Malki, H. Li & G. Chen (1994). New design and stability analysis of fuzzy proportional-derivative control systems. IEEE Trans. on Fuzzy Systems, 2: 345-354. K. M. Passino & S. Yurkovich (1998) Fuzzy Control, Addison-Wesley. P. Sooraksa & G. Chen (1998). Mathematical modeling and fuzzy control of flexible robot arms. Math. Comput. Modelling. 27: 73-93. K. Tanaka (1996). An Introduction to Fuzzy Logic for Practical Applications. Springer. K. Tanaka & H. O. Wang (1999). Fuzzy Control Systems Design and Analysis: A Linear Matrix Inequality Approach. IEEE Press. L. X. Wang (1994) Adaptive Fuzzy Systems and Control: Design and Stability Analysis. Prentice-Hall. H. Ying (2000). Fuzzy Control and Modeling: Analytical Foundations and Applications. IEEE Press. H. Ying, W. Siler & J. J. Buckley (1990). Fuzzy control theory: a nonlinear case. Automatica. 26: 513-520. L. A. Zadeh (1965). Fuzzy sets. Information and Control. 8: 338-353.

L. A. Zadeh (1973). Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. on Systems, Man, and Cybernetics. 3: 28-44.

KEy TERmS Defuzzification: A process that converts fuzzy terms to conventional expressions quantified by realvalued functions. Fuzzification: A process that converts conventional expressions to fuzzy terms quantified by fuzzy membership functions. Fuzzy Control: A control method based on fuzzy set and fuzzy logic theories. Fuzzy Logic: A logic that takes on continuous values in between 0 and 1. Fuzzy Membership Function: A function defined on fuzzy set and assumes continuous values in between 0 and 1. Fuzzy Set: A set of elements with a real-valued membership function describing their grades. Fuzzy System: A system formulated and described by fuzzy set-based real-valued functions.

695

F

696

Fuzzy Decision Trees Malcolm J. Beynon Cardiff University, UK

INTRODUCTION The inductive learning methodology known as decision trees, concerns the ability to classify objects based on their attributes values, using a tree like structure from which decision rules can be accrued. In this article, a description of decision trees is given, with the main emphasis on their operation in a fuzzy environment. A first reference to decision trees is made in Hunt et al. (1966), who proposed the Concept learning system to construct a decision tree that attempts to minimize the score of classifying chess endgames. The example problem concerning chess offers early evidence supporting the view that decision trees are closely associated with artificial intelligence (AI). It is over ten years later that Quinlan (1979) developed the early work on decision trees, to introduced the Interactive Dichotomizer 3 (ID3). The important feature with their development was the use of an entropy measure to aid the decision tree construction process (using again the chess game as the considered problem). It is ID3, and techniques like it, that defines the hierarchical structure commonly associated with decision trees, see for example the recent theoretical and application studies of Pal and Chakraborty (2001), Bhatt and Gopal (2005) and Armand et al. (2007). Moreover, starting from an identified root node, paths are constructed down to leaf nodes, where the attributes associated with the intermediate nodes are identified through the use of an entropy measure to preferentially gauge the classification certainty down that path. Each path down to a leaf node forms an ‘if .. then ..’ decision rule used to classify the objects. The introduction of fuzzy set theory in Zadeh (1965), offered a general methodology that allows notions of vagueness and imprecision to be considered. Moreover, Zadeh’s work allowed the possibility for previously defined techniques to be considered with a fuzzy environment. It was over ten years later that the area of decision trees benefited from this fuzzy environment opportunity (see Chang and Pavlidis, 1977). Since then there has been a steady stream of research

studies that have developed or applied fuzzy decision trees (FDTs) (see recently for example Li et al., 2006 and Wang et al., 2007). The expectations that come with the utilisation of FDTs are succinctly stated by Li et al. (2006, p. 655); “Decision trees based on fuzzy set theory combines the advantages of good comprehensibility of decision trees and the ability of fuzzy representation to deal with inexact and uncertain information.” Chiang and Hsu (2002) highlight that decision trees has been successfully applied to problems in artificial intelligence, pattern recognition and statistics. They go onto outline a positive development the FDTs offer, namely that it is better placed to have an estimate of the degree that an object is associated with each class, often desirable in areas like medical diagnosis (see Quinlan (1987) for the alternative view with respect to crisp decision trees). The remains of this article look in more details at FDTs, including a tutorial example showing the rudiments of how an FDT can be constructed.

BACKGROUND The background section of this article concentrates on a brief description of fuzzy set theory pertinent to FDTs, followed by a presentation of one FDT technique. In fuzzy set theory (Zadeh, 1965), the grade of membership of a value x to a set S is defined through a membership function μS(x) that can take a value in the range [0, 1]. The accompanying numerical attribute domain can be described by a finite series of MFs that each offers a grade of membership to describe x, which collectively form its concomitant fuzzy number. In this article, MFs are used to formulate linguistic variables for the considered attributes. These linguistic variables are made up of sets of linguistic terms which are defined by the MFs (see later).


Fuzzy Decision Trees

Figure 1. Example membership function and their use in a linguistic variable

Surrounding the notion of MFs is the issue of their structure (Dombi and Gera, 2005). Here, piecewise linear MFs are used to define the linguistic terms presented, see Figure 1. In Figure 1(top), a single piecewise linear MF is shown along with the defining values that define it, namely, α1,1, α1,2, α1,3, α1,4 and α1,5. The associated mathematical structure of this specific form of MF is given below; 0   x − j ,1  0.5  j,2 − j ,1  x − j,2 0.5 + 0.5  j ,3 − j,2  1 ( x)  x − j ,3 1 − 0.5 j,4 − j ,3   x − j,4 0.5 − 0.5 j ,5 − j,4  0 

if x ≤ if

j ,1

if

j,2

if x =

j ,1

<x≤ <x≤

j,2

j ,3

j ,3

if

j ,3

<x≤

j,4

if

j,4

<x≤

j ,5

if

j ,5

<x

As mentioned earlier, MFs of this type are used to define the linguistic terms which make up linguistic variables. An example of a linguistic variable X based on two linguistic terms, X1 and X2, is shown in Figure 2(bottom), where the overlap of the defining values for

F

each linguistic term is evident. Moreover, using left and right limits of the X domain as −∞ and ∞, respectively, the sets of defining values are (in list form); X1 - [−∞, −∞, α1,3, α1,4, α1,5] and X2 - [α2,1, α2,2, α2,3, ∞, ∞], where α1,3 = α2,1, α1,4 = α2,2 and α1,5 = α2,3. This section now goes on to outline the technical details of the fuzzy decision tree approach introduced in Yuan and Shaw (1995). With an inductive fuzzy decision tree, the underlying knowledge related to a decision outcome can be represented as a set of fuzzy ‘if .. then ..’ decision rules, each of the form; If (A1 is Ti11 ) and (A2 is Ti22 ) … and (Ak is Tikk ) then C is Cj, where A1, A2, .., Ak and C are linguistic variables for the multiple antecedents (Ai’s) and consequent (C) statements used to describe the considered objects, and T(Ak) = { T1k , T2k , .. TSki } and {C1, C2, …, CL} are their respective linguistic terms, defined by the MFs T jk (x) etc. The MFs, T jk (x) and C j ( y ), represent the grade of membership of an object’s antecedent Aj being T jk and consequent C being Cj, respectively. A MF µ(x) from the set describing a fuzzy linguistic variable Y defined on X, can be viewed as a possibility distribution of Y on X, that is π(x) = µ(x), for all x ∈ X the values taken by the objects in U (also normalized so max x∈ X p ( x) = 1 ). The possibility measure Eα(Y) of ambiguity is defined by n

Eα(Y) = g(π) = ∑ (p i∗ − p i∗+1 ) ln[i ] , i =1

697


k

where �* = {�1*, �2*, …, �n*} is the permutation of the normalized possibility distribution π = {π(x1), π(x2), …, π(xn)}, sorted so that �i* ≥ �*i+1 for i = 1, .., n, and p n∗+1 = 0 . The ambiguity of attribute A (over the objects u1, .., um) is given as: 1 Eα(A) = m ∑ Ea ( A(ui ) , i =1

G(P| F) = ∑ w( Ei | F )G ( Ei ∩ F ) , i =1

where G(Ei ∩ F) is the classification ambiguity with fuzzy evidence Ei ∩ F, and where w(Ei| F) is the weight which represents the relative size of subset Ei ∩ F in F: w(Ei| F) =

m

∑ min(

u∈U

where Eα(A(ui)) = g (

Ts (ui )

max (

1≤ j ≤ s

Tj (ui ))),

with T1, …, Ts the linguistic terms of an attribute (antecedent) with m objects. The fuzzy subsethood S(A, B) measures the degree to which A is a subset of B, and is given by, S(A, B) = ∑ min( u∈U

A

(u ),

B

(u )) ∑

u∈U

A

(u ).

Given fuzzy evidence E, the possibility of classifying an object to the consequent Ci can be defined as,

Ei

(u ),

F (u ))

j

where the fuzzy subsethood S(E, Ci) represents the degree of truth for the classification rule (‘if E then Ci’). With a single piece of evidence (a fuzzy number for an attribute), then the classification ambiguity based on this fuzzy evidence is defined as: G(E) = g(π(C| E)), which is measured using the possibility distribution π(C| E) = (π(C1| E), …, π(CL| E)). The classification ambiguity with fuzzy partitioning P = {E1, …, Ek} on the fuzzy evidence F, denoted as G(P| F), is the weighted average of classification ambiguity with each subset of partition:

(u ),

 

F (u )) 

In summary, attributes are assigned to nodes based on the lowest level of classification ambiguity. A node becomes a leaf node if the level of subsethood is higher than some truth value β assigned to the whole of the fuzzy decision tree. The classification from the leaf node is to the decision group with the largest subsethood value. The truth level threshold β controls the growth of the tree; lower β may lead to a smaller tree (with lower classification accuracy), higher β may lead to a larger tree (with higher classification accuracy).

The main thrust of this article is a detailed example of the construction of a fuzzy decision tree. The description includes the transformation of a small data set into a fuzzy data set where the original values are described by their degrees of membership to certain linguistic terms. The small data set considered, consists of five objects, described by three condition attributes T1, T2 and T3, and classified by a single decision attribute C, see Table 1. If these values are considered imprecise, fuzzy, there is the option to transform the data values in

Table 1. Example data set

698

Ej

mAIN THRUST

π(Ci|E) = S ( E , Ci ) / max S ( E , C j ) ,

Object u1 u2 u3 u4 u5

k  ∑  ∑ min( j =1 u∈U

T1 112 85 130 93 132

T2 45 42 58 54 39

T3 205 192 188 203 189

C 7 17 22 29 39


Figure 2. Membership functions defining the linguistic terms, CL and CH, for the decision attribute C

fuzzy values. Here, an attribute is transformed into a linguistic variable, each described by two linguistic terms, see Figure 2. In Figure 2, the decision attribute C is shown to be described by the linguistic terms, CL and CH (possibly denoting the terms low and high). These linguistic terms are themselves defined by MFs (µC (·) and µC (·). The L H hypothetical MFs shown have the respective defining terms of , µC (·): [−∞, −∞, 9, 25, 32] and µC (·): [9, 25, L

H

F

32, ∞, ∞]. To demonstrate their utilisation, for the object u2, with a value C = 17, its fuzzification creates the two values µC (17) = 0.750 and µC (17) = 0.250, the larger L H of which is associated with the high linguistic term. A similar series of membership functions can be constructed for the three condition attributes, T1, T2 and T3, Figure 3. In Figure 3, the linguistic variable version of each condition attribute is described by two linguistic terms

Figure 3. Membership functions defining the linguistic terms for the condition attributes, T1, T2 and T3

699


Table 2. Fuzzified version of the example data set Object o1 o2 o3 o4 o5

T1 = [T1L, T1H] [0.433, 0.567] - H [1.000, 0.000] - L [0.000, 1.000] - H [1.000, 0.000] - L [0.000, 1.000] - H

T2 = [T2L, T2H] [1.000, 0.000] - L [1.000, 0.000] - L [0.227, 0.773] - H [0.409, 0.591] - H [1.000, 0.000] - L

T3 = [T3L, T3H] [0.000, 1.000] - H [0.750, 0.250] - L [0.917, 0.083] - L [0.000, 1.000] - H [0.875, 0.125] - L

C = [CL, CH] [1.000, 0.000] - L [0.750, 0.250] - L [0.594, 0.406] - L [0.214, 0.786] - H [0.000, 1.000] - H

(possibly termed as low and high), themselves defined by MFs. The use of these series of MFs is the ability to fuzzify the example data set, see Table 2. In Table 2, each object is described by a series of fuzzy values, two fuzzy values for each attribute. Also shown in Table 2, in bold, are the larger of the values in each pair of fuzzy values, with the respective linguistic term this larger value is associated with. Beyond the fuzzification of the data set, attention turns to the construction of the concomitant fuzzy decision tree for this data. Prior to this construction process, a threshold value of β = 0.75 for the minimum required truth level was used throughout. The construction process starts with the condition attribute that is the root node. For this, it is necessary to calculate the classification ambiguity G(E) of each condition attribute. The evaluation of a G(E) value is shown for the first attribute T1 (i.e. g(π(C| T1))), where it is broken down to the fuzzy labels L and H, for L;

along with G(T1H) = 0.572, then G(T1) = (0.514 + 0.572)/2 = 0.543. Compared with G(T2) = 0.579 and G(T2) = 0.583, the condition attribute T1, with the least classification ambiguity, forms the root node for the desired fuzzy decision tree. The subsethood values in this case are; for T1: S(T1L, CL) = 0.574 and S(T1L, CH) = 0.426, and S(T2H, CL) = 0.452 and S(T2H, CH) = 0.548. For T2L and T2H, the larger subsethood value (in bold), defines the possible classification for that path. In both cases these values are less that the threshold truth value 0.75 employed, so neither of these paths can be terminated to a leaf node, instead further augmentation of them is considered. With three condition attributes included in the example data set, the possible augmentation to T1L is with either T2 or T3. Concentrating on T2, where with G(T1L) = 0. 0.514, the ambiguity with partition evaluated for T2 (G(T1L and T2| C)) has to be less than this value, where;

π(C| T1L) = S (T1L , Ci ) / max S (T1L , C j ) ,

G(T1L and T2| C) = ∑ w(T2i | T1L )G (T1L ∩ T2i ).

considering CL and CH with the information in Table 1;

Starting with the weight values, in the case of T1L and T2L, it follows;

k

j

S(T1L, CL) = ∑ min( u∈U

T1L (u ),

C L (u ))

∑

u∈U

i =1

TL (u )

=

1.398/2.433 = 0.574,

∑ min(

whereas, S(T1L, CH) = 0.426. Hence π = {0.574, 0.426}, giving the ordered normalized form of π* = {1.000, 0.741}, with p 3∗ = 0 , then 2

G(T1L) = g(π(C| T1L)) = ∑ ( i =1

700

w(T2L| T1L) =

∗ i

−

∗ i +1 ) ln[i ] =

0.514,

u∈U

T2 L

(u ),

(u )) ∑  ∑ min( j =1 u∈U k

T1 L

T2 j

(u ),

1.842/2.433 = 0.757. Similarly w(T2H| T1L) = 0.243, hence;

T1 L

(u )) = 


G(T1L and T2| C) = 0.757 × G(T1L ∩ T2L) + 0.699 × G(T1L ∩ T2H) = 0.757 × 0.327 + 0.699 × 0.251 = 0.309, A concomitant value for G(T1L and T3| C) = 0.487, the lower of these (G(T1L and T2| C)) is lower than the concomitant G(T1L) = 0.514, so less ambiguity would be found if the T2 attribute was augmented to the path T1 = L. The subsequent subsethood values in this case for each new path are; T2L; S(T1L ∩ T2L, CL) = 0.759 and S(T1L ∩ T2L, CH) = 0.358; T2H: S(T1L ∩ T2H, CL) = 0.363 and S(T1L ∩ T2H, CH) = 1.000. With each suggested classification path, the largest subsethood value is above the truth level threshold, therefore they are both leaf nodes leading from the T1 = L path. The construction process continues in a similar vein for the path T1 = H, with the resultant fuzzy decision tree in this case presented in Figure 4. The fuzzy decision tree in Figure 8 shows five rules (leaf nodes), R1, R2, …, R5, have been constructed. There are a maximum of four levels to the tree shown, indicating a maximum of three condition attributes are used in the rules constructed. In each non-root node shown the subsethood levels to the decision attribute terms C = L and C = H are shown. On the occasions when the larger of the subsethood values is above the

defined threshold value of 0.75 then they are shown in bold and accompany the node becoming a leaf node. The interpretative power of FDTs is shown by consideration of the rules constructed. For the rule R5 it can be written down as; ‘If T1 = H and T3 = H then C = L with truth level 0.839.’ The rules can be considered in a more linguistic form, namely; ‘If T1 is low and T3 is high then C is low with truth level 0.839.’ It is the rules like this one shown that allow the clearest interpretability to the understanding of results in classification problems when using FDTs.

FUTURE TRENDS Fuzzy decision trees (FDTs) benefit from the inductive learning approach that underpins their construction, to aid in the classification of objects based on their values over different attribute. Their construction in a fuzzy

Figure 4. Fuzzy decision tree for example data set

701

F


environment allows for the potential critical effects of imprecision to be mitigated, as well as brings a beneficial level of interpretability to the results found, through the decision rules defined. As with the more traditional ‘crisp’ decision tree approaches, there are issues such as the complexity of the results, in this case the tree defined. Future trends will surely include how FDTs can work on re-grading the complexity of the tree constructed, commonly known as pruning. Further, the applicability of the rules constructed, should see the use of FDTs extending in the range of applications it can work with.

CONCLUSION The interest in FDTs, with respect to AI, is due to the inductive learning and linguistic rule construction processes that are inherent with it. The induction undertaken, truly lets the analysis create intelligent results from the data available. Many of the applications FDTs have been used within, such medicine, have benefited greatly from the interpretative power of the readable rules. The accessibility of the results from FDTs should secure it a positive future.

REFERENCES Armand, S., Watelain, E., Roux, E., Mercier, M., & Lepoutre, F.-X. (2007). Linking clinical measurements and kinematic gait patterns of toe-walking using fuzzy decision trees. Gait & Posture, 25(3), 475-484.

operators. Fuzzy Sets and Systems, 154, 275-286. Hunt, E.B., Marin, J., & Stone, P.T. (1966). Experiments in Induction. New York, NY: Academic Press. Li, M.-T., Zhao, F., & Chow L.-F. (2006). Assignment of Seasonal Factor Categories to Urban Coverage Count Stations Using a Fuzzy Decision Tree. Journal of Transportation Engineering, 132(8), 654-662. Pal, N.R., & Chakraborty, S. (2001). Fuzzy Rule Extraction From ID3-Type Decision Trees for Real Data. IEEE Transactions on Systems, Man, and Cybernetics B, 31(5), 745-754. Quinlan, J.R. (1979). Discovery rules by induction from large collections of examples, in: D. Michie (Ed.), Expert Systems in the Micro Electronic Age, Edinburgh University Press, Edinburgh, UK. Quinlan, J.R. (1987). Probabilistic decision trees, in: P. Langley (Ed.), Proc. 4th Int. Workshop on Machine Learning, Los Altos, CA. Wang, X., Nauck, D.D., Spott, M., & Kruse, R. (2007). Intelligent data analysis with fuzzy decision trees. Soft Computing, 11, 439-457. Yuan, Y., & Shaw, M.J. (1995). Induction of fuzzy decision trees. Fuzzy Sets and Systems, 69(2), 125-139. Zadeh, L.A. (1965). Fuzzy Sets. Information and Control, 8(3), 338-353.

KEy TERmS

Bhatt, R.B., & Gopal, M. (2005). Improving the Learning Accuracy of Fuzzy Decision Trees by Direct Back Propagation, IEEE International Conference on Fuzzy Systems, 761-766.

Condition Attribute: An attribute that describes an object. Within a decision tree it is part of a non-leaf node, so performs as an antecedent in the decision rules used for the final classification of an object.

Chang, R. L. P., & Pavlidis, T. (1977). Fuzzy decision tree algorithms. IEEE Transactions Systems Man and Cybernetics, SMC-7(1), 28-35.

Decision Attribute: An attribute that characterises an object. Within a decision tree is part of a leaf node, so performs as a consequent, in the decision rules, from the paths down the tree to the leaf node.

Chiang, I.-J., & Hsu, J.Y.-J. (2002). Fuzzy classification trees for data analysis. Fuzzy Sets and Systems, 130, 87-99. Dombi, J., & Gera, Z. (2005). The approximation of piecewise linear membership functions and Łukasiewicz

702

Decision Tree: A tree-like structure for representing a collection of hierarchical decision rules that lead to a class or value, starting from a root node ending in a series of leaf nodes.


Induction: A technique that infers generalizations from the information in the data. Leaf Node: A node not further split, the terminal grouping, in a classification or decision tree. Linguistic Term: One of a set of linguistic terms, which are subjective categories for a linguistic variable, each described by a membership function. Linguistic Variable: A variable made up of a number of words (linguistic terms) with associated degrees of membership.

Membership Function: A function that quantifies the grade of membership of a variable to a linguistic term. Node: A junction point down a path in a decision tree that describes a condition in an if-then decision rule. From a node, the current path may separate into two or more paths. Root Node: The node at the tope of a decision tree, from which all paths originate and lead to a leaf node.

Path: A path down the tree from root node to leaf node, also termed a branch.

703

F

704

Fuzzy Graphs and Fuzzy Hypergraphs Leonid S. Bershtein Taganrog Technological Institute of Southern Federal University, Russia Alexander V. Bozhenyuk Taganrog Technological Institute of Southern Federal University, Russia

INTRODUCTION Graph theory has numerous application to problems in systems analysis, operations research, economics, and transportation. However, in many cases, some aspects of a graph-theoretic problem may be uncertain. For example, the vehicle travel time or vehicle capacity on a road network may not be known exactly. In such cases, it is natural to deal with the uncertainty using the methods of fuzzy sets and fuzzy logic. Hypergraphs (Berge,1989) are the generalization of graphs in case of set of multiarity relations. It means the expansion of graph models for the modeling complex systems. In case of modelling systems with fuzzy binary and multiarity relations between objects, transition to fuzzy hypergraphs, which combine advantages both fuzzy and graph models, is more natural. It allows to realise formal optimisation and logical procedures. However, using of the fuzzy graphs and hypergraphs as the models of various systems (social, economic systems, communication networks and others) leads to difficulties. The graph isomorphic transformations are reduced to redefinition of vertices and edges. This redefinition doesn’t change properties the graph determined by an adjacent and an incidence of its vertices and edges. Fuzzy independent set, domination fuzzy set, fuzzy chromatic set are invariants concerning the isomorphism transformations of the fuzzy graphs and fuzzy hypergraph and allow make theirs structural analysis.

BACKGROUND The idea of fuzzy graphs has been introduced by Rosenfeld in a paper in (Zadeh, 1975), which has also been discussed in (Kaufmann, 1977). The questions of using fuzzy graphs for cluster analysis were considered in (Matula,1970, Matula,1972). The

questions of using fuzzy graphs in Database Theory were discussed in (Kiss,1991). The tasks of allocations centers on fuzzy graphs were considered in (Moreno, Moreno & Verdegay, 2001, Kutangila-Mayoya & Verdegay, 2005, Rozenberg & Starostina, 2005). The analyses and research of flows and vitality in transportation nets were considered in (Bozhenyuk, Rozenberg & Starostina, 2006). The fuzzy hypergraph applications to portfolio management, managerial decision making, neural cell-assemblies were considered in (Monderson & Nair, 2000). The using of fuzzy hypergraphs for decision making in CAD-Systems were also considered in (Malyshev, Bershtein & Bozhenyuk, 1991).

mAIN DEFINITIONS OF FUZZy GRAPHS AND HyPERGRAPHS This article presents the main notations of fuzzy graphs and fuzzy hypergraphs, invariants of fuzzy graphs and hypergraphs.

Fuzzy Graph ~ Let a fuzzy direct graph G = ( X , U~ ) is given, where X is a set of vertices, U~ = {M U ( xi , x j )|( xi , x j ) ∈ X 2 } is a fuzzy set of edges with the membership function µU : X2 → [0,1] (Kaufmann, 1977). ~ Example 1. Let fuzzy graph G has X={x ,x ,x ,x }, 1 2 3 4 ∼ and U ={, , , 1 2 2 3 3 4 , }. It is presented in figure 4 1 4 2 1. ~ The fuzzy graph G may present a fuzzy dependence relation between objects x , x , x , end x . If the object 1 2 3 4 x fuzzy depends from the object x , then there is direct j i edge (x ,x ) with membership function µU(xi, xj). i j ~ If a fuzzy relation, presented by fuzzy graph G , is symmetrical, we have the fuzzy nondirect graph.


Fuzzy Graphs and Fuzzy Hypergraphs

Figure 1.

X

0,5

1

0,2

~ γ(x,y) = max (µ(Lα(x,y)), α =1,2 ,...,p,

4

α

~ A fuzzy graph G = ( X , U~ ) is convenient for representing as fuzzy adjacent matrix rij , where rij n× n = µU(xi, xj). So, the fuzzy graph, presented in figure 1, may be consider by adjacent matrix: x1 x2 x3 x4 0 0,5 0 0 x1 0 0,6 0 R X = x2 0 0 0 0.3 x3 0 0 , 2 1 0 0 x4

~

. ~

~ L(x i , x m ) =< µ U ( x i , x j ) /( x i , x j ) >, < µ U ( x j , x k ) /( x j , x k ) >,... ..., < µ U ( x l , x m ) /( x l , x m ) > .

~ Conjunctive strength of path μ (L(xi ,x m )) is de-

fined as:

&

~ < xα ,xβ >? L(xi ,xm )

where p - number of various simple directed paths from vertex x to vertex y. A subset of vertices X' is called a fuzzy independent vertex set (Bershtein & Bozhenuk, 2001) with the degree of independence α ( X ′) = 1 − max {µU ( xi , x j )}. ? xi , x j ? X

~

The fuzzy graph H = (X′, U′) is called a fuzzy ~ subgraph (Monderson & Nair, 2000) of G = ( X , U~ ) ~ ~ if X ′ ⊆ X and U′ ⊆ U . Fuzzy directed path (Bershtein & Bozhenyuk, 2005) ~ L (x i ,x m ) of graph G~ = ( X , U~ ) is called the sequence of fuzzy directed edges from vertex xi to vertex xm:

~ μ (L(xi ,xm )) =

If a number of vertices n≥3 and xi = xm, then the path is called a cycle. Obviously, what is it definition coincides with the same definition for nonfuzzy graphs. Vertex y is called fuzzy accessible from vertex x in ~ the graph G = ( X , U~ ) if exists a fuzzy directed path from vertex x to vertex y. The accessible degree of vertex y from vertex x, (x≠y) is defined by expression:

μU < xα ,xβ >

.

~ Fuzzy directed path L(xi ,xm ) is called simple path between vertices xi and xm if its part is not a path between the same vertices.

A subset of vertices X' ⊆ X of graph G is called a maximal fuzzy independent vertex set with the degree α(X'), if the condition α(X'') < α(X') is true for any X' ⊂ X''. Let a set tk={Xk1, Xk2,…,Xkl} be given where Xki is a fuzzy independent k-vertex set with the degree of independent αki. We define as A kmax = max{A X 1 ,A X 2 ,...,A X l } k

k

k

.

~

The value A kmax means that fuzzy graph G includes kvertex subgraph with the degree of independent A kmax and doesn’t include k-vertex subgraph with the degree of independence more than A kmax . A fuzzy set Ψ X = {< A 1m ax /1 >, < A 2m ax / 2 >,...,< A nm ax / n >}

~

is called a fuzzy independent set of fuzzy graph G . ~ Fuzzy graph G , presented in figure 1, has seven maximum fuzzy independent vertex sets: Ψ1 = {x 2 } , Ψ2 = {x 4 } , Ψ3 = {x1 , x 3 }

with the degree of independence 1; Ψ4 = {x1 , x 4 } with the degree of independence 0,8; Ψ5 = {x1 , x 3 , x 4 } with 705

F


the degree of independence 0,7; Ψ6 = {x1 , x 2 } with the degree of independence 0,5 and Ψ7 = {x1 , x 2 , x 3 } with the degree of independence 0,4. So, its fuzzy independent set is defined as Ψ X = {< 1 /1 >, < 1 / 2 >, < 0,7 / 3 >, < 0 / 4 >}.

Let X' be an arbitrary subset of the vertex set X. For each vertex y ∈ X \ X' we define the value: G ( y ) = max{M U ( y , x )} x ∈X

.

The set X' is called a fuzzy dominating vertex set for vertex y with the degree of domination γ(y). The set X' is called a fuzzy dominating vertex set ~ for the graph G with the degree of domination B ( X ′) = min max{M U ( y , x )}. y ∈X \ X ′ x ∈X ′

~ A subset X' ⊆ X of graph G is called a minimal fuzzy dominating vertex set with the degree β(X') if the condition β(X'') < β(X') is true for any subset X'' ⊂ X'. Let a set tk={Xk1, Xk2,…,Xkl} be given, where Xki is a fuzzy dominating k-vertex set with the degree of domination βki. We define as β kmin = max{β X , β X ,..., β X } . min min In the case τk = ∅ we define B X = B X . Volume B ~ means that fuzzy graph G includes k-vertex subgraph with the degree of domination B and doesn’t include k-vertex subgraph with the degree of domination more than B . ~ Afuzzy set B X = {< b1min / 1 >, < b 2min / 2 >,..., < b nmin / n >} ~ is called a domination fuzzy set of fuzzy graph G (Bershtein & Bozhenuk, 2001 a). ~ Fuzzy graph G (Figure 1) has five fuzzy minimal dominating vertex sets: P1 = {x1,x2,x3} with the degree of domination 1; P2 = {x1,x3,x4} with degree of domination 0,6; P3 = {x2,x3} with the degree of domination 0,5; P4 = {x1,x3} with degree of domination 0,2 and P5 = {x2,x4} with the degree of domination 0,3. A ~ domination fuzzy set of fuzzy graph G is defined as ~ B X = {< 0 / 1 >, < 0,5 / 2 >, < 1 / 3 >, < 1 / 4 >} . A value 1 k

k +1

k

2 k

l k

~ is called a separation degree of fuzzy graph G with k colors. ~ The fuzzy graph G may be colored in a number of colors from 1 to n. In this case the separation degree L depends of the number of colors. For the fuzzy graph ~ G we relate a family of fuzzy sets ~ ~ ℜ = {A G } , A G = {< L A~ ( k) / k| k = 1, n}

where L A~ ( k ) defines a degree of separation of fuzzy ~ graph G with k colors. γ = {< L ~γ ( k ) / k| k = 1, n} is called A fuzzy set ~ ~ a fuzzy chromatic set of graph G if the condition ~  ⊆  is performed for any set A A G ∈ℜ , or else: G ~ (∀A G ∈ℜ)(∀k = 1, n)[ L A ( k ) ≤ L ~γ ( k )] (Bershtein & Bozhenuk, 2001 b). Otherwise, the fuzzy chromatic set defines a maxi~ mal separation degree of fuzzy graph G with k= 1, 2,..., n colors. ~ For fuzzy graph G (Figure 1) the fuzzy chromatic set is  (G ) = {< 0 / 1 >, < 0,5 / 2 >, < 1 / 3 >}. ~ So, the fuzzy graph G may be colored •

min

k

•

min

k

min

k

L = & α i = & (1- ∨ µ G ( x, y)) i =1, k

706

i =1, k

x,y∈X i

•

by one color with the degree of separation 0. In other words, there is at least pair of vertices xi and xj for which the membership function µU(xi, xj) = 1. In our graph, these vertices are x4 and x2; by 2 colors with the degree of separation 0,5 (vertices x1, x2 - first color, vertices x3, x4 - second color). In other words, between vertices of the same color there aren’t edges with the membership function more than 0,5; by 3 colors with the degree of separation 1 (vertices x1, x3, - first color, vertices x2 - second color, vertex x4 - third color). In other words, between vertices of the same color there aren’t any edges.

Fuzzy Hypergraph ~

~

Let a fuzzy hypergraph H = (X,E) be given, where X={xi}, i∈I={1,2,…,n} – is a finite set and E~ = {e~k }, ~ ek = { < µ ek ( x ) / x > } , k∈K={1,2,…,m} is a family of fuzzy subsets in X (Monderson & Nair, 2000, Bershtein & Bozhenyuk, 2005). Thus elements of set X are the vertices of hypergraph, a family E~ is the family of hypergraph fuzzy edges. The value μek (x) [ 0,1 ] is an incidence degree of a vertex x to an edge e~k .


It is possible to see that a fuzzy hypergraph is turned in the fuzzy graph when 1 ≤| e~k |≤ 2, k ∈ K . Vertices x and y are called fuzzy adjacent vertices if there are some edge, which includes both vertices. In this case a value μ(x,y) = ~ μek (x)& μek (y) ek ? E

is called an adjacent degree of two vertices x and y of ~ fuzzy hypergraph H . Two edges e~i and e~j are called fuzzy adjacent edges if e~i ∩ e~j ≠∅. In this case a value (ei , e j ) =

V

ei ∩ e j

x∈( ei ∩ e j )

( x)

is called adjacent degree of edges e~i and e~j . ~ ~ A fuzzy hypergraph H = (X,E) is convenient for representing as fuzzy incidence matrix rij n×m , where rij = me ( xi ) . So, any matrix, which elements are inj cluded in the interval [0,1], may be consider as fuzzy incidence matrix of some fuzzy hypergraph. ~ A fuzzy simple path C ( x1 , xq+1 ) with the length q is defined as the sequence

C ( x1 , xq +1 ) = ( x1 , e2 ,, eq ,

eq

e1

( x1 ), e1 ,

( xq +1 ), xq +1 ),

e1

( x2 ), x2 ,

e2

( x2 ),

where all vertices x1 ,  , x q ∈ X and all edges ~ e~1 ,  , e~q ∈ E are different. A strength of fuzzy simple path is the weakest of adjacent degrees, which are included in this path ~ C ( x1 , xq+1 ) . If two vertices x1 and xq+1 are connected ~ ~ ~ by paths C1 , C 2 ,  C t with strengths 1 , 2 , , t, then say that vertices x1 and xq+1 are fuzzy connected by the strength ( x1 , xq +1 ) = 1V 2 V V t . An internal stability degree of vertices subset X' of ~ fuzzy hypergraph H is determined as: α X ′ = 1 − max μ(x,y) .

LI = &

i =1, k

i

( x, y ))

= & (1- ∨ i =1, k

x,y∈Xi

~ is called a separation degree of fuzzy hypergraph H at its k-colorings (Bershtein, Bozhenuk & Rozenberg, 2005).. ~ Fuzzy hypergraph H can be colored in any number of k colours and thus separation degree L depends on ~ their number. Fuzzy hypergraph H we shall put in conformity family of fuzzy sets  } A   = {< L(k ) / k | k = 1, n} ℜ={A H H , ,

where L(k) determines a separation degree of fuzzy ~ hypergraph H at its certain k - colouring. Fuzzy set  = {< L  (k ) / k | k = 1, n} is called ~ fuzzy chromatic set of hypergraph H , if for any ~   ⊆ . In other words, other set AH~ ∈ ℜ , it is true A H ~ (∀A H~ ∈ ℜ)( ∀k = 1, n)[ L (k ) ≤ L(k )] . Or, otherwise, ~ fuzzy chromatic set of hypergraph H determines the greatest separation degrees at colouring its tops in one of 1,2...n colours. ~ Let H be a fuzzy hypergraph which the incidence matrix is given by:

e1 e2 x1 x2 I = x3 x4 x5 x6

is

e3

e4

e5

0,8 0,5 0 0 0 1 0 0 0 0 0, 4 1 0,3 0,7 0 0 0,6 0, 4 0, 2 1 0 0 0,7 1 0 0 0 0 0 0, 4

The fuzzy chromatic set for the fuzzy hypergraph

I = < 0 2 1 > < 0 5 2 > < 1 3 > .

x,y X ′

Subset X' ⊆ X is called a maximal fuzzy internally stable set with the degree of internal stability aX', if the statement (X ′′ X′)(α X′′ < α X′ ) is true. ~ Let’s paint each vertex x∈X of hypergraph H in one of k colours (1≤k≤n) and we shall consider a Xi, subset of vertices, colored identically. The value

Otherwise, the fuzzy hypergraph may be colored by one color with the degree of separation 0,2; by 2 colors with the degree of separation 0,5 (vertices x2, x3 and x6 - first color, vertices x1, x4 и x5 - second color); by 3 colors with the degree of separation 1 (vertices x2, and x4 - first color, vertices x1, x5 and x6 - second color, vertex x3 - third color).

707

F


FUTURE TRENDS In according to a principle of generalization L. Zadeh, the theory of fuzzy graphs and fuzzy hypergraphs will develop in a development course of nonfuzzy graphs, hypergraphs, and fuzzy sets theory.

CONCLUSION When we consider fuzzy graphs and fuzzy hypergraphs, there is an opportunity to relate any set vertices and edges to family of partial graphs and hypergraphs with given property. For example, a sequence of edges – to family of graph paths; a sequence of vertices and edges – to family of bipartite graphs, and so on. It enables to define new properties of fuzzy graphs and hypergraphs, and to use theirs to analysis and synthesis fuzzy systems.

REFERENCES Berge, C. (1989). Hypergraphs: combinatorics of finite sets. Elsevier Science Publishers. Bershtein, L.S. & Bozhenuk A.V. (2001 a). Maghout Method for Determination of Fuzzy Independent, Dominating Vertex Sets and Fuzzy Graph Kernels. J. General Systems, 30, 45-52. Bershtein, L.S. & Bozhenyuk A.V. (2001 b). A Сolor Problem for Fuzzy Graph. Computation intelligence: theory and applications; international conference; proceedings. 7th Fuzzy Days, Dortmund, Germany, October 1-3, 2001. Bernd Reusch (ed.): Springer-Verlag (2206). 500-505.

in Transportation Nets with Fuzzy Dates. Moscow, Nauchniy Mir. Kaufmann, A. (1977). Introduction a la theorie des sous-ensemles flous, Masson, Paris, France. Kiss, A. (1991). An Application of Fuzzy Graphs in Database Theory, Automata, Languages and Programming Systems. Pure Math., Appl. Ser. A, 1, 337-342. Kutangila-Mayoya, D. & Verdegay, J.L. (2005). PMedian Problems in a Fuzzy Environment. Mathware & Soft Computing, 12, 97-106. Malyshev, N.G., Bershtein, L.S. & Bozhenyuk, A.V. (1991). Fuzzy Models for Expert Systems in CAD-Systems. Moscow, Energoatomizdat. Matula, D.W. (1970). Cluster Analysis Via Graph Theoretic Techniques: Proc. of Lousiana Conf. on Combinatorics, Graph Theory and Computing. 199-212. Matula, D.W. (1972). K-components, Clusters, and Slicings in Graphs. SIAM J. Appl. Math., 22, 459-480. Monderson, J.N. & Nair, P.S. (2000). Fuzzy Graphs and Fuzzy Hypergraphs. Heidelberg; New-York: Physica-Verl. Moreno Perez, J.A., Moreno-Vega, J.M. & Verdegay, J.L. (2001). In Location Problem on Fuzzy Graphs. Mathware & Soft Computing, 8, 217-225. Zadeh, L.A. (1975). Fuzzy sets and their application to cognitive and decision, Academic Press, New York, USA.

KEy TERmS

Bershtein, L.S. & Bozhenyuk, A.V. (2005). Fuzzy Graphs and Hypergraphs. Moscow, Nauchniy Mir.

Binary Relation: A binary relation R from a set A to a set B is a subset of A×B.

Bershtein, L.S., Bozhenyuk, A.V. & Rozenberg, I.N. (2005). Fuzzy Coloring of Fuzzy Hypergraph. Computation Intelligence, Theory and Applications. International Conference 8th Fuzzy Days in Dortmund, Germany, Sept. 29- Oct. 01, 2004 Proceedings. Bernd Reusch (ed.): Springer-Verlag. 703-711.

Binary Symmetric Relation: A relation R on a set A is symmetric if for all x,y∈A xRy⇒yRx.

Bozhenyuk, A.V., Rozenberg, I.N. & Starostina, T.A. (2006). Analysis and Research of Flows and Vitality

708

Fuzzy Set: A generalization of the definition of the classical set. A fuzzy set is characterized by a membership function, which maps the member of the universe into the unit interval, thus assigning to elements of the universe degrees of belongingness with respect to a set.


Graph: A graph G = (V, E) is a mathematical structure consisting of two finite sets V and E. The elements of V are called vertices (or nodes), and the elements of E are called edges. Each edge has a set of one or two vertices associated to it, which are called its endpoints. Graph Invariant: A property of a graph that is preserved by isomorphisms. Isomorphic Graphs: Two graphs that have a structure-preserving vertex bijection between them.

m

E i =1

i

=X

F

.

Membership Function: The membership function of a fuzzy set is a generalization of the characteristic function of crisp sets. Multiarity Relation: A multiarity relation R between elements of sets A, B, …, C is a subset of A×B×…×C.

Hypergraph: A hypergraph on a finite set X={x1,x2,…,xn} is a family H={E1,E2,…,Em} of subsets of X such that Ei ≠ ∅ and

709

710

Fuzzy Logic Applied to Biomedical Image Analysis Alfonso Castro University of A Coruña, Spain Bernardino Arcay University of A Coruña, Spain

INTRODUCTION Ever since Zadeh established the basis of fuzzy logic in his famous article Fuzzy Sets (Zadeh, 1965), an increasing number of research areas have used his technique to solve and model problems and apply it, mainly, to control systems. This proliferation is largely due to the good results in classifying the ambiguous information that is typical of complex systems. Success in this field has been so overwhelming that it can be found in many industrial developments of the last decade: control of the Sendai train (Yasunobu & Miyamoto, 1985), control of air-conditioning systems, washing machines, auto-focus in cameras, industrial robots, etc. (Shaw, 1998) Fuzzy logic has also been applied to computerized image analysis (Bezdek & Keller & Krishnapuram & Pal, 1999) because of its particular virtues: high noise insensitivity and the ability to easily handle multidimensional information (Sutton & Bezdek & Cahoon, 1999), features that are present in most digital images analyses. In fuzzy logic, the techniques that have been most often applied to image analysis have been fuzzy clustering algorithms, ever since Bezdek proposed them in the seventies (Bezdek, 1973). This technique has evolved continuously towards correcting the problems of the initial algorithms and obtaining a better classification: techniques for a better initialization of these algorithms, and algorithms that would allow the evaluation of the solution by means of validity functions. Also, the classification mechanism was improved by modifying the membership function of the algorithm, allowing it to present an adaptative behaviour; recently, kernel functions were applied to the calculation of memberships. (Zhong & Wei & Jian, 2003) At the present moment, applications of fuzzy logic are found in nearly all Computer Sciences fields, it constitutes one of the most promising branches of Artificial

Intelligence both from a theoretic and commercial point of view. A proof of this evolution is the development of intelligent systems based on fuzzy logic. This article presents several fuzzy clustering algorithms applied to medical images analysis. We also include the results of a study that uses biomedical images to illustrate the mentioned concepts and techniques.

BACKGROUND Fuzzy logic is an extension of the traditional binary logic that allows us to achieve multi-evaluated logic by describing domains in a much more detailed manner and by classifying better through searches in a more extensive area. Fuzzy logic makes it possible to model the real world more efficiently: for example, whereas binary logic merely allows us to state that a coffee is hot or cold, fuzzy logic allows us to distinguish between all the possible temperature fluctuations: very hot, lukewarm, cold, very cold, etc. Techniques based on fuzzy logic have proven to be very useful for dealing with the ambiguity and vagueness that are normally associated to digital images analysis. At what grey level do we fixate the thresholding? Where do we locate the edge in blurred objects? When is a grey level high, low, or average? The fuzzy processing of digital images can be considered a totally different focus with respect to the traditional computerized vision techniques. It was not developed to solve a specific problem, but describes a new class of image processing techniques and a new methodology to develop them: fuzzy edge detectors, fuzzy geometric operators, fuzzy morphological operators, etc. These features make fuzzy logic especially useful for the development of algorithms that improve medical images analysis, because it provides a framework


Fuzzy Logic Applied to Biomedical Image Analysis

for the representation of knowledge that can be used in any phase of the analysis. (Wu & Agam & Roy & Armato, 2004) (Vermandel & Betrouni & Taschner & Vasseu & Rosseau, 2007)

Finally, these algorithms will be used in a study that shows the use and possibilities of fuzzy logic in the analysis of biomedical images.

FUZZy CLUSTERING ALGORITHmS APPLIED TO BIOmEDICAL ImAGE ANALySIS

The FCM algorithm was developed by Bezdek (Bezdek, 1973) and is the first fuzzy clustering algorithm; it initially needs the number of clusters in which the image will be divided and a sample of each cluster. The steps of this algorithm are the following:

Medical imaging systems use a series of sensors that detect the features of the tissues and the structure of the organs, which allows us, depending on the used technique, to obtain a great amount of information and images of the area from different angles. These virtues have converted them into one of the most popular support techniques in diagnosis, and have given rise to the current distribution and variety in medical images modalities (X-Rays, PET …) and to new modalities that are being developed (fMRI). The complexity of the segmentation of biomedical images is entirely due to its characteristics: the large amount of data that need to be analyzed, the loss of information associated to the transition from a 3D body to a 2D representation, the great variability and complexity of the shapes that must be analyzed … Among the most frequently applied focuses to segment medical images is the use of pattern recognition techniques, since normally the purpose of analyzing a medical digital image is the detection of a particular element or object: tumors, organs, etc. Of all these techniques, fuzzy clustering techniques have proven to be among the most powerful ones, because they allow us to use several features of the dataset, each with their own dimensionality, and to partition these data; also, they work automatically and usually have low computational requirements. Therefore, if the problem of segmentation is defined as the partition of the image into regions that have a common feature, fuzzy clustering algorithms carry out this partition with a set of exemplary elements, called centroids, and obtain a matrix of the size of the original image and with a dimensionality equal to the number of clusters into which the image was divided; this indicates the membership of each pixel to each cluster and serves as a basis for the detection of each element. In the next section we present a series of fuzzy clustering algorithms that can be considered to reflect the evolution in this field and its various viewpoints.

Fuzzy C-Means (FCM)

1.

Calculation of the membership of each element to each cluster:

 u (i, j ) =  ∑   k

2. vk

y (i, j ) − vk y (i, j ) − vj

2 m −1

  

−1

Calculation of the new centroids of the image: m ∑ uk (i, j ) y (i, j ) =

i, j

∑ uk (i, j )

m

, k = 1,, C

i, j

3.

(1)

(2)

If the error stays below a determined threshold, stop. In the contrary case, return to step 1.

The parameters that were varied in the analysis of the algorithm were the provided samples and the value of m.

Fuzzy K-Nearest Neighbour (FKNN) The Fuzzy K-Nearest Neighbour (Givens Jr. & Gray & Keller, 1992) is, as its name indicates, a fuzzy variant of a hard segmentation algorithm. It needs to know the number of classes into which the set that must be classified will be divided. The element that must be classified is associated to the class of the nearest sample among the K most similar ones. These K most similar samples are known as “neighbours”; if, for instance, the neighbours are classified from more to less similar, the destination class of the studied element will be the class of the neighbour that is first on the list. We use the expression in Equation 3 to calculate the membership factors of the pixel to the considered clusters: 711

F


    1 uij  ∑ 2  j =1  x − x m−1  j   ui ( x ) =   K   1 ∑  2  j =1  m−1   x − xj 

Kernelized Fuzzy C-Means (KFCM)

K

(3)

where uij represents the membership factor of the j-th sample to the i-th class; xj represents one of the K samples that are most similar to the treated pixel; x represents the pixel itself; m is a weight factor of the distance between the pixel and the samples and ui(x) represents the level of membership of the pixel x to class i. During the analysis of this algorithm, the parameters that varied were the samples provided as initial centroids and the considered number of neighbours.

Modified Fuzzy C-Means This algorithm is based on the work of Young Won Lim and Sang Uk Lee (Lee & Lim, 1990), who describe an algorithm for the segmentation of color images through the study of the histograms of each color band. This algorithm also relies on the classification algorithm fuzzy c-means. The MFCM consists of two parts: 1.

2.

A hard part that studies the histograms of an image in order to obtain the number of classes, and carries out a first global classification of the image; and A fuzzy part that classifies the pixels that have more difficulties in determining the class to which they belong. The pixels of this area are called “fuzzy zone”.

Once obtained the initial clusters with its centroids, the algorithm uses the FCM membership function (Eq. 2) to classify the pixels. The fuzzy points are pixels between the initial clusters and pixels of clusters too little for its consideration. Since we do not dispose of labeled simples of each class, we use the gravity centers of the clusters to calculate the membership factors of a pixel. During the analysis of this algorithm, we varied the value of the sigma used to smoothen the histogram, the area that the initial clusters need to survive, and the security areas around the clusters. 712

This algorithm was proposed by Wu Zhong-Dong et al (Zhong & Wei & Jian, 2003) and is based on FCM, integrated with a kernel function that allows the transfer of the data to a space with more dimensionality, which makes it easier to separate the clusters. The most often used kernel functions are the polinomial functions (Eq. 4) and the radial base functions (Eq. 5).

K (X , Y ) = F (X )⋅ F (Y ) = (X ⋅ Y + b )

d

(

K (X , Y ) = F (X )⋅ F (Y ) = exp − (X − Y ) / 2S 2 2

(4)

)

(5)

The algorithm consists of the following steps: 1.

Calculation of the membership function: ( ) ( 1 / d (X , V )) = ( ) ∑ (1 / d (X ,V )) 1 / q −1

2

u jk

C

j

2

j =1

k

1 / q −1

j

k

(6)

where d 2 (X j ,Vk )= K (X j , X j )− 2 K (X j ,Vk )+ K (Vk , Vk )

2.

Calculation of the new kernel matrix and K (Vˆk ,Vˆk ):

(

( ) ∑ (u ) K (X , X ) ∑ (u )

)

K X j , Vˆk = F (X j )⋅ F Vˆk =

N

i =1

ik

q

i

N

i =1

ik

q

(

K X j , Vˆk

)

j

(7)

where

∑ j =1 (u jk ) F (X j ) F Vˆk = N q ∑ j =1 (u jk )

()

3. 4.

N

q

Update the memberships ujk to ûjk by means of Equation 6. If the error stays below a determined threshold, stop. In the contrary case, return to step 1.

The different parameters for the analysis of this algorithm were the initial samples.


Images Used in the Study For the selection of the images that were used in the study (Gonzalez & Woods, 1996), we applied the traditional image processing techniques and used the histogram as basic tool. See Figure 1. We observed that the pictures presented a high level of variation, because it was not possible to standardize the different elements that have a determining effect on them: position of the patient, luminosity, etc. We selected the pictures on the basis of a characteristic trait (bad lighting, presence of strange objects, etc.) or on their “normality” (correct lighting, good contrast, etc.). The

images were digitalized to a size of 500x500 pixels and 24 color bits per pixel, using an average scanner. The histograms of Figure 1 show some of the characteristics that were present in most photographies. The bands with a larger amount of pixels are those of the colors red and green, because of the color of the skin and the fact that green is normally used in sanitary tissue. The histogram is continuous and presents values in most levels, which leads us to suppose that the value of most points is determined by the combination of the three bands instead of only one band, as was to be expected. This complicates the analysis of the image with algorithms.

Figure 1. Photograph that was used in the study, and histogram for each color band

713

F


carried out on the segmented image and on the real image (Eq. 8).

Results The test images were divided into 3 clusters: background, healthy tissue, and burned tissue. These areas are clearly distinguished by the specialist, which allows us to build better masks to evaluate the success rate in pixel detection applied to burn wounds. The success rate of the fuzzy clustering algorithms was first measured with Zhang’s RUMA (Relative Ultimate Measurement Accuracy) (Zhang, 1996). The purpose of RUMA is to measure the quality of the segmentation in terms of the similarity of the measures

RUMA =

Rf − S f Rf

x100% (8)

In our study, we measured the success rate by comparing the number of pixels of the burned area in the result image that coincided with pixels of the burned area in the mask.

Figure 2. Best results for the RUMA and global measurements for the: FKNN algorithm (a) and MFCM algorithm (b)

(a)

(b) 714


We also opted for applying a second success rate measurement, because although RUMA provides a value for the area of interest, it may not detect certain classification errors that can affect the resulting image. We use a measure that was developed by our research team and measures the clustering algorithm’s performance in classifying all the pixels of the image (Eq. 9). During the development of the measure, we supposed that the error would be smaller if the error of each cluster classification were smaller, so we measured the error in the pixel classification of each cluster and weighed it against the number of pixels of that cluster. n

n

error = ∑∑ j =1 i =1

Fij MASC j

,i ≠ j (9)

Fij is the number of clusters that belong to cluster j and were assigned to cluster i, MASCj is the total amount of pixels that belong to class j, and n is the amount of clusters into which the image was divided. The value of this measurement lies between 0 and n; in order to simplify its interpretation, it was normalized between 0 and 1. The graphics are simplified by inverting the discrepancy values: the higher the value, the better the result.

Figure 2(a) shows the best results for the FKNN algorithm, varying the number of samples and neighbours from 1 sample per cluster to 8 samples, for both measurements. Figure 2(b) shows the results for the MFCM algorithm, varying the threshold that was required for each area in the histogram and the sigma, for both measurements. The FCM and FKCM algorithms are not detailed because the parameters that were varied were the value of the provided samples and the stop threshold, with a rather stable result for both measurements. In the Figure 3 we can see one of the results obtained for the algorithm FCM and the imaged labeled Q1. Figure 4(a) shows the results for the various images of the test set for RUMA applied to all the algorithms, Figure 4(b) shows the results using global measurement. The tests reveal great variation in the values provided for the different algorithms by each measurement; this is due to the lack of homogeneous conditions in the acquisition of the images and the ensuing differences in photographic quality. We can also observe that the results obtained with FKCM are considerably better than the results with FCM, because the first uses a better function to calculate the pixel membership. Nevertheless, for most

Figure 3. Image labeled Q1 (left) and one of the results obtained for the FCM algorithm (right)

715

F


Figure 4. Best results for the burned area using: RUMA measurement (a) and global success rate measurement (b)

(a)

(b)

pictures the good results with FKCM are surpassed by the FKNN and MFCM algorithms. In the case of FKNN, this is due to its capacity to use several samples for each cluster, which allows a more exact calculation of the memberships and less error probability. MFCM, on the other hand, carries out a previous analysis of the 716

histogram, which enables it in most cases to find good centroids and make good classifications. Even though the FKNN algorithm obtains better results, in most cases it requires a high number of samples (more than 4), which may disturb the medical expert and complicate the implantation in real clini-


cal environments. This problem does not apply to the MFCM algorithm, which calculates the samples itself; however, its success values greatly vary, and for many images we had to finetune the parameters in order to obtain good results.

FUTURE TRENDS The field of fuzzy logic is a field that evolves continuously and is increasingly applied to industrial products. The medical images analysis field is among the most active in computerized vision and represents an important challenge to researchers in search of new technological developments. Fuzzy clustering algorithms constitute one of the most useful and interesting branches of fuzzy logic. Their use is expected to increase and new algorithms will appear that will provide ever better results. These algorithms will more and more often be applied to the field of medical images, where they allow us to handle new multidimensional modalities and improvements.

CONCLUSION This article presents the results obtained by various fuzzy clustering algorithms in analyzing a set of burn wound pictures. The studied techniques obtain a high level of detection in the burned area and as such show their capacity to analyse this type of medical images. Testing however reveals a high degree of variation in the values provided by each algorithm, due to the absence of homogeneous conditions during the image acquisition and the ensuing differences in the quality of the pictures. This study shows how the FKCM algorithm provides the best results with the smallest amount of parameters. However, if we could control the context in which the photographs are taken, the best algorithm would be MFCM, which provides better results and operates automatically. Also, we revise the state of the art in the field of fuzzy logic and clustering algorithms, in order to show the characteristics of these techniques and their possibilities.

REFERENCES Zadeh, L. (1965). Fuzzy sets. Information and Control. (8) 338-353. Shaw, I. (1998). Fuzzy Control of Industrial Systems: Theory and Applications. Kluwer Academic Publishers. Yasunobu, S. & Miyamoto, S. (1985) Automatic train operation by fuzzy predictive control. Industrial Applications of Fuzzy Control. Ed: M. Sugeno. North Holland. Bezdek, J., Keller, J., Krishnapuram, R., & Pal, N. (1999). Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Norwell, MA. Sutton, M., Bezdek, J., Cahoon, T. (2000) Image Segmentation by Fuzzy Clustering: Methods and Issues. Handbook of Medical Imaging: Processing and Analysis. Ed. Isaac N. Bankman. 87-126. Bezdek, J. (1973). Fuzzy Mathemathics in Pattern Classification. Ph.D. Distertation. Appl. Math., Cornell University, Ithaca, NY, 1973. Zhong, W.D., Wei, X.X., & Jian, Y.P. (2003). Fuzzy C-Means clustering algorithm based on kernel method. Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA ’03). IEEE Press. Wu, C., Agam, G., Roy, A.S. & Armato, S.G. (2004). Regulated morphology approach to fuzzy shape analysis with application to blood vessel extraction in thoracic CT scans. Proceedings of SPIE. (5370) 1262-1270. Vermandel, M., Betrouni, N., Taschner, C., Vasseu, C. & Rosseau, J. (2007). From MIP image to MRA segmentation using fuzzy set theory. Computerized Medical Imaging & Graphics. (31) 128-140. Haußecker, H. & Tizhoosh, H.R. (1999). Fuzzy Image Processing. Handbook of Computer Vision and Applications. Volume 2. Ed. Bernd Jäne, Horst Haußecker and Peter Geißler. 683-727. Givens Jr., J.A., Gray, M.R. & Keller, J.M. (1992) A fuzzy k-nearest neighbour algorithm. Fuzzy models for pattern recognition: methods that search for struc-

717

F


tures in data. Ed: J.C. Bezdek, S.K. Pal. IEEE Press. 258-263. Lee, S.U. & Lim, Y.M. (1990) On the color image segmentation algorithm based on the thresholding and the fuzzy c-means techniques. Pattern Recognition. (23) 9, 935-952. Pham, D.L. (2001) Spatial models for fuzzy clustering, Computer Vision and Image Understanding (84) 285-297. Gonzalez, R., Woods, R. (1996) Digital image processing. Addison-Wesley. Zhang, Y.J. (1996) A survey on evaluation methods for image segmentation, Pattern Recognition (29) 1335-1346.

KEy TERmS Fuzzification: The process of decomposing a system input and/or output into one or more fuzzy sets. Many types of curves can be used, but triangular or trapezoidal shaped membership functions are the most common. Fuzzy Algorithm: An ordered sequence of instructions which may contain fuzzy assignments, conditional statements, repetitive statements, and traditional operations.

718

Fuzzy Inference Systems: A sequence of fuzzy conditional statements which may contain fuzzy assignment and conditional statements. The execution of such instructions is governed by the compositional rule of inference and the rule of preponderant alternative. Fuzzy Operator: Operations that enable us to combine fuzzy sets. A fuzzy operator combines two fuzzy sets to give a new fuzzy set. The most frequently used fuzzy operators are the following: equality, containment, complement, intersection and union. Medical Image: A medical specialty that uses xrays, gamma rays, high-frequency sound waves, and magnetic fields to produce images of organs and other internal structures of the body. In diagnostic radiology the purpose is to detect and diagnose disease, whereas in interventional radiology, imaging procedures are combined with other techniques to treat certain diseases and abnormalities. Membership Function: Gives the grade, or degree, of membership within the fuzzy set, of any element of the universe of discourse. The membership function maps the elements of the universe onto numerical values in the interval [0, 1]. Segmentation: A process that partitions a digital image into disjoint (non-overlapping) regions, using a set of features or characteristics. The output of the segmentation step is usually a set of classified elements, such as tissue regions or tissue edges.

719

Fuzzy Logic Estimator for Variant SNR Environments Rosa Maria Alsina Pagès Universitat Ramon Llull, Spain Clàudia Mateo Segura Universitat Ramon Llull, Spain Joan-Claudi Socoró Carrié Universitat Ramon Llull, Spain

INTRODUCTION The acquisition system is one of the most sensitive stages in a Direct Sequence Spread Spectrum (DS-SS) receiver (Peterson, Ziemer & Borth, 1995), due to its critical position in order to demodulate the received information. There are several schemes to deal with this problem, such as serial search and parallel algorithms (Proakis, 1995). Serial search algorithms have slow convergence time but their computational load is very low; on the other hand, parallel systems converge very quickly but their computational load is very high. In our system, the acquisition scheme used is the multiresolutive structure presented in (Moran, Socoró, Jové, Pijoan & Tarrés, 2001), which combines quick convergence and low computational load. The decisional system that evaluates the acquisition stage is a key process in the overall system performance, being a drawback of the structure. This becomes more important when dealing with time-varying channels, where signal to noise ratio (called SNR) is not a constant parameter. Several factors contribute to the performance of the acquistion system (Glisic & Vucetic, 1997): channel distorsion and variations, noise and interference, uncertainty about the code phase, and data randomness. The existence of all these variables led us to think about the possibility of using fuzzy logic to solve this complex acquisition estimation (Zadeh, 1973). A fuzzy logic acquisition estimator had already been tested and used in our research group to control a serial search algorithm (Alsina, Morán & Socoró, 2005) with encouraging results, and afterwards in the multiresolutive scheme (Alsina, Mateo & Socoró, 2007), and other applications to this field can be found in bibliography as (Bas, Pérez & Lagunas, 2001) or (Jang,

Ha, Seo, Lee & Lee, 1998). Several previous works have been focused in the development of acquisition systems for non frequency selective channels with fast SNR variations (Moran, Socoró, Jové, Pijoan & Tarrés, 2001) (Mateo & Alsina, 2004).

BACKGROUND In 1964, Dr. Lofti Zadeh came out with the term fuzzy logic (Zadeh, 1965). The reason was that traditional logic could not answer to some questions with a simple yes or no. So, it handles the concept of partial truth. Fuzzy logic is one of the possibilities to imitate the working of a human brain, and so to try to turn artificial intelligence into real intelligence. Zadeh devised the technique as a method to solve problems for soft sciences, in particular those that involve human interaction. Fuzzy logic has been proved to be a good option for control in very complex processes, when it is not possible to produce a mathematical model. Also fuzzy logic is recommendable for highly non-linear processes, and overall, when expert knowledge is desirable to be performed. But it is not a good idea to apply if traditional control or estimators give out satisfying results, or for problems that can be modelled in a mathematical way. The most recent works in control and estimation using fuzzy logic applied to direct sequence spread spectrum communication systems are classified into three types. The first group uses fuzzy logic to improve the detection stage of the DS-CDMA1 receiver, and they are presented by Bas et al and Jang et al (Bas, Pérez, & Lagunas, 2001)(Jang, Ha, Seo, Lee, & Lee, 1998). The second group uses fuzzy logic to improve interference


F

Fuzzy Logic Estimator for Variant SNR Environments

rejection, with works presented by Bas et al and by Chia-Chang et al (Bas, & Neira, 2003) (Chia-Chang, Hsuan-Yu, Yu-Fan, & Jyh-Horng, 2005). Finally, fuzzy logic techniques are also improving estimation and control in the acquisition stage of the DS-CDMA receiver, in works by Alsina et al (Alsina, Moran, & Socoró, 2005) (Alsina, Mateo, & Socoró, 2007).

ACQUISITION ESTImATION IN DS-CDmA ENVIRONmENTS One of the most important problems to be solved in direct sequence spread spectrum systems is to achieve a robust and precise acquisition of the pseudonoise

sequence; this is to obtain an accurate estimation of its exact phase or timing position (Proakis, 1995). In time-varying environments this fact becomes even more important because acquisition and tracking performance can heavily degrade communication demodulation reliability. In this work a new multiresolutive acquisition system with a fuzzy logic estimator is proposed (Alsina, Mateo, & Socoró, 2007). The fuzzy logic estimation improves the accuracy of the acquisition stage compared to the results for the stability controller, through the estimation of the probability of being acquired, and the signal to noise ratio in the channel, improving the results obtained for the first fuzzy logic estimator for the multiresolutive structure in (Alsina, Mateo & Socoró, 2007).

Figure 1. Multiresolutive adaptive structure for acquisition and tracking

720


Multiresolutive Acquisition Structure

The Fuzzy Logic Acquisition Estimation

The aim of the multiresolutive scheme presented in (Moran, Socoró, Jové, Pijoan & Tarrés, 2001) is to find the correct acquisition point in a reasonable convergence time. It gives a good trade-off between speed of convergence of the parallel systems and the low computational load of the serial search algorithms. An M order decimation is firstly applied to the input signal x[n]2 as acquisition stage can accept uncertainties under the chip period, and thus to decrease the computational load of the acquisition stage. Once the signal x[n] is decimated, the resulting signal r[n] is fed into the filters of a multiresolutive structure (see the structure in figure 1). Note that there are H different branches that work with decimated versions of the input signal, separated in H disjoint subspaces. Each branch has an adaptive FIR LMS filter of length

The fuzzy logic acquisition estimator has been designed using data of the impulsional response of all the LMS filters of the structure. Their values variations give information about the probability of being correctly acquired, and also about SNR ratio variations in the channel. In the conducted experiments, the signal space has been divided into four subspaces (H=4), so four LMS filters compose the acquisition stage. The length of the PN sequences is PG=127, so each filter has

 PG 3 N = ,  H 

trained with a decimated version of the PN sequence (PN-DEC). Under ideal conditions, in a non-frequency selective channel with white Gaussian noise, just one of the filters should locally converge an impulse like λbi[k]δ[n – τ], where b[k] is the information bit, τ represents the delay between the input signal PN sequence and the reference one and λ is the fading coefficient for channel distorsion. The algorithm is reseted every new data symbol, and a modulus smoothing average algorithm is applied to each of the LMS solutions (wi[n]) to remove the data randomness component bi[k] dependency, obtaining nonnegative and averaged impulsional responses (Wavi[n]). The decisional system uses a peak detection algorithm to find which of these filters has detected the signal (Wcon[n]), and the position of the maximum (τ) in this filter will give the coarse estimation of the acquisition phase. When the acquisition point by the decisional system is restored, tracking is solved with another adaptive LMS filter (wr[n]), which expands the search window around the acquisition point, using the full time resolution input signal x[n]. Thus, the estimation of the acquisition point (now called ξ) is refined by the tracking and the signal can be correctly demodulated.

 PG  N = = 32  H 

taps to converge. This input and output variables were already defined in (Alsina, Mateo & Socoró, 2007), but the rules to be evaluated have been designed in a more precise way.

Input Variables Four different parameters have been defined as inputs in the fuzzy estimator; three of them referred to the values of the four modulus averaged acquisition LMS filters (Wavi[n]), especially the LMS filter adapted to the decimated sequence PN-DEC (called Wcon[n]), and one about the tracking filter (wtr[n]) that refines the search: •

Ratio1: it is computed as the quotient of the peak value of the LMS filter Wcon[n] divided into the mean value of this filter but the maximum, as follows: Ratio1 =

•

1 N

Wcon [T ] N

∑W

n=1; n≠T

con

[ n]

Ratio2: it is evaluated as the quotient of the peak value of the LMS filter Wcon[τ] divided into the average of the value of the same position in the other three filters Wavi[n]. Ratio2 =

Wcon [T ]

H 1 Wavi [T ] ∑ H − 1 i=1; Wavi ≠Wcon

721

F


•

Ratio3: it is obtained as the quotient of the peak value of the LMS filter Wcon[τ] divided into the mean value of the three other filters Wavi[n]. Ratio3 =

•

Wcon [T ] 1 1 N ∑ ∑Wavi [n] H − 1 i =1; Wavi ≠Wcon N n=1 H

Ratio1 track: it is computed as the quotient of the peak value of the LMS tracking filter wtr[ξ], being ξ the most precise estimation of the correct acquisition point, divided into the mean value of the same filter but the maximum.

Ratio1 _ track =

1 N

wtr [X ] N

∑ w [ n]

n=1; n≠X

tr

These parameters have been chosen due to the information they contain about the probability of being acquired, and also about the SNR level in the channel and its variations. They value variations give good estimations about acquisition quality and a good measure for SNR, with the appropriate definition of IF-THEN rules.

Output Variables The results will be obtained using a defuzzyfication method based on the centroid (Leekwijck & Kerre,

Figure 2. Variable acquisition for all input variables combinations

722


1999). Two output variables will be computed. Acquisition, giving a value in the range of [0,1], being zero when it is Not Acquired and one if it is Acquired. Three more fuzzy sets have been defined between the extreme values; Probably Not Acquired, Not Determined and Probably Acquired. Acquisition will show a value of reliability for the correct demodulation of the detector. The multiresolutive scheme only gives an estimation of the acquisition point, and Acquisition value evaluates the probability of being acquired, and so, the consistency of the bit demodulation done by the receiver. The second variable is SNR Estimation which gives a value (in the range of [-30,0] dBs in our experiment) of the estimated SNR value in the channel. SNR Estimation will give us information about channel conditions; this will help not only in acquisition and tracking, but

also in detection as in (Verdú, 1998) or (Alsina, Morán & Socoró, 2005).

If-Then Rules A total of sixty rules have been used to define the two outputs in function of the input values, evolving the set of rules used in (Alsina, Mateo & Socoró). In figure 2 the surface for Acquisition for all input variables and figure 3 shows the surface for SNR Estimation for all inputs. Rules have been defined to take into account the best performance, in its range, of each input parameter value to design the two outputs of the fuzzy estimator. This means the value range is only considered where their estimations are more reliable for both outputs.

Figure 3. Variable SNR Estimation for all input variables combinations

723

F


The most improved estimation for the output Acquisition is the correspondence to Not Determined; this means that the input parameters have no coherent values of Acquisition or Not Acquisition by themselves. To obtain a precise output value, the fuzzy estimator evaluates the degree of implication of each input parameter to the membership functions, and projects this implication to the fuzzy sets of the output variable Acquisition, in order to obtain its value through defuzzyfication. Ratio1 and Ratio1 track are the best input parameters to estimate Acquisition when channel conditions are good; these two parameters are supported by Ratio2 and Ratio3 when SNR worsen. The precision of the critical estimations has been improved in the design of the new rules for the fuzzy estimator. On the other hand, SNR Estimation most robust evaluations are made by Ratio2 and Ratio3; they are improved by Ratio1 track when SNR is high, and by Ratio1 when SNR is very low. As can be observed in figure 3, these variables highly correlate with SNR Estimation value.

Results In this section the results obtained with the new acquisition and SNR fuzzy logic estimator will be summarized. Several simulations using an Additive White Gaussian Noise channel (AWGN), some of them with very fast SNR changes, have been done to show the performance of the fuzzy estimator in terms of reliability and stability.

Fuzzy Estimator Acquisition Reliability vs. Stability Control A previous acquisition estimation was obtained using a stability control (Moran, Socoró, Jové, Pijoan & Tarrés, 2001), that took into account preservation of the acquisition point for evaluation and comparison purposes. It considered that the system was acquired only due to continuous repetitions of the acquisition point given by the multiresolutive scheme. This stability control gave a binary response about the performance

Figure 4. % of correct estimation of acquisition using the new fuzzy estimator against the stability Control

724


of the system. Despite its good performance, being observed in figure 4, the new fuzzy approach improves the results for wider SNR range. The quality of the fuzzy acquisition estimation is much better for very low SNR compared to the stability control, and its global performance for the whole range of SNR in our tests is improved. The stability control is not a good estimator for critical SNR (considered around -15dBs), and it decreases its reliability when SNR decreases. Despite showing similar performance around critical SNR, the fuzzy logic estimation of Acquisition improves its performance for worse SNR ratios, being over 90% of correct estimation all the simulations along.

Fuzzy SNR Estimation in Time Varying Channels In figure 5.a the acquisition system has been simulated in an AWGN channel, forcing severe and very fast SNR changes in order to evaluate the convergence speed of the SNR estimator. SNR Estimation mean value, being a very variable value, is obtained through an exponential smoothing average filter, and compared to the SNR in the AWGN channel. The SNR in the channel is estimated

quite precisely until very low SNR (near -20dBs) by the fuzzy block, as the input parameters are not stable enough to make a good prediction for lower values; this is similar to what happens for Acquisition estimation. To observe the recovery of the fuzzy estimator in case of fast SNR changes in the channel, a detail of SNR Estimation is shown in figure 5.b. This information shows the channel state to the receiver, and allows further work to improve reliability of the demodulation by means of different approaches (Verdú, 1998).

FUTURE TRENDS Future work will be focused on improving the estimation for the SNR in the fuzzy system. Another goal to be reached is to increase the stability against channel changes using previous detected symbols, obtaining a system with feedback. The fuzzy estimator outputs will be used to design a controller for the acquisition and tracking structure. Its aim will be to improve the stability of estimation of the correct acquisition point (ξ) through an effective and robust control of its variations for sudden channel changes, so memory will be added to the fuzzy logic estimator. This way the estimator is

Figure 5. a) SNR estimation in a varying SNR channel; b) Detail of SNR Estimation when adapting to an instantaneous SNR variation

725

F


converted in a controller, and the whole performance of the receiver is improved. Further research will also take into account multipath channel conditions and possible variations, including rake-based receiver detection, in order to reach a good acquisition and tracking performance in ionospheric channels. Furthermore, the reliability of the results encourages us to use the acquisition estimation to minimize the computational load of the acquisition system for proper channel conditions, thorough decreasing the number of iterations to converge in the LMS adaptive filters. A more efficient fuzzy logic control can be designed in order to achieve a better trade-off between computational load (referred to the LMS filters adaptation) and acquisition point estimation accuracy (ξ).

CONCLUSION The new proposed acquisition system estimator has already been exposed, and some results have been compared against a stability control strategy within the multiresolutive acquisition system in a variant SNR environment. The main advantage of a multiresolutive fuzzy estimator is its reliability when evaluating the probability of acquisition, also its stability, and its quick convergence when there are fast channel SNR changes. The computational load of a fuzzy estimator is higher than the same cost in a stability control. The mean number of FLOPS in a DSP needed to do all the process is greater compared to the conventional stability control. This has to be taken into account because the multiresolutive structure should make its computational cost minimum to work on-line with the received data. Further work will be done to compare the computational load added to the structure to the global improvements of the multiresolutive receiver, to decide whether this cost increase is affordable for the acquisition system, or it is not.

REFERENCES Alsina, R.M., Morán, J.A., & Socoró, J.C. (2003). Multiresolution Adaptive Structure for Acquisition and Detection in DS-SS Digital Receiver in a Multiuser Environment. IEEE International Symposium on Signal Processing and its Applications. 726

Alsina, R.M., Morán, J.A., & Socoró, J.C. (2005). Sequential PN Acquisition Based on a Fuzzy Logic Controller. 8th International Workshop on Artificial Neural Networks, Lecture Notes in Computer Science. (3512) 1238-1245. Alsina, R.M., Mateo, C., & Socoró, J.C. (2007). Multiresolutive Adaptive PN Acquisition Scheme with a Fuzzy Logic Estimator in Non Selective Fast SNR Variation Environments. 9th International Workshop on Artificial Neural Networks, Lecture Notes in Computer Science. (4507) 367-374. Bas, J., Pérez, A., & Lagunas, M.A. (2001). Fuzzy Recursive Symbol-by-Symbol Detector for Single User CDMA Receivers. International Conference on Acoustics, Speech and Signal Processing. Bas, J., & Neira, A.P. (2003). A fuzzy logic system for interference rejection in code division multiple access. The 12th IEEE International Conference on Fuzzy Systems, (2), 996-1001. Chia-Chang, H., Hsuan-Yu, L., Yu-Fan, C., & JyhHorng, W. (2005). Adaptive interference supression using fuzzy-logic-based space-time filtering techniques in multipath DS-CDMA. The 6th IEEE International Workshop on Signal Processing Advances in Wireless Communications, p. 22-26. Glisic, S.G., & Vucetic, B. (1997). Spread Spectrum CDMA Systems for Wireless Communications. Artech House Publishers. Jang, J., Ha, K., Seo, B., Lee, S., & Lee, C.W. (1998). A Fuzzy Adaptive Multiuser Detector in CDMA Communication Systems. International Conference on Communications. Leekwijck, W.V., & Kerre, E.E. (1999). Defuzzification: Criteria and Classification. Fuzzy Sets and Systems. (108) 159-178. Mateo, C., & Alsina, R.M. (2004). Diseno de un Sistema de Control Adaptativo a las Condiciones del Canal para un Sistema de Adquisición de un Receptor DS-SS. XIX Congreso de la Unión Científica Nacional de Radio. Morán, J.A., Socoró, J.C., Jové, X., Pijoan, J.L., & Tarrés, F. (2001). Multiresolution Adaptive Structure for Acquisition in DS-SS Receiver. International Conference on Acoustics, Speech and Signal Processing.


Peterson, R.L., Ziemer, R.E., & Borth, D.E. (1995). Spread Spectrum Communications Handbook. Prentice Hall. Proakis, J.G. (1995). Digital Communications. McGraw-Hill. Verdú, S. (1998). Multiuser Detection. Cambridge University Press. Zadeh, L.A. (1965). Fuzzy Sets. Information and Control. (8), 338-353. Zadeh, L.A. (1973). Outline of a New Approach to the Analysis of Complex Systems and Decision Processes. IEEE Transactions Systems Man Cybernetics. (3), 28-44. Zadeh, L.A. (1988). Fuzzy Logic. Computer, 83-92.

KEy TERmS Defuzzyfication: After computing the fuzzy rules, and evaluating the fuzzy variables, this is the process the system follows to obtain a new membership function for each output variable. Degree of Truth: It denotes the extent to which a preposition is true. It is important to not be confused with the concept of probability. Fuzzy Logic: Fuzzy logic was derived from Fuzzy Set theory, working with a reason that it is approximate rather than precise, deducted from the typical predicate logic.

Fuzzy Sets: Fuzzy sets are sets whose members have a degree of membership. They were introduced to be an extension of the classical sets, whose elements’ membership was assessed by binary numbers. Fuzzyfication: It is the process of defining the degree of membership of a crisp value for each fuzzy set. IF-THEN Rules: They are the typical rules used by expert fuzzy systems. The IF part is the antecedent, also named premise, and the THEN part is the conclusion. Linguistic Variables: They take on linguistic values, which are words, with associated degrees of membership in each set. Linguistic Term: It is a subjective category for a linguistic variable. Each linguistic term is associated with a fuzzy set. Membership Function: It is the function that gives the subjective measures for the linguistic terms.

ENDNOTES 1

2

3

DS-CDMA stands for Direct Sequence Code Division Multiple Access. The received signal x[n] is sampled at M samples per chip in order to give the necessary time resolution for the tracking stage. where PG is the length of the pseudonoise sequences, also called PN sequences and 'ceil(x)' (expressed as N = [x]) is the smaller integer greater than x.

727

F

728

Fuzzy Rule Interpolation Szilveszter Kovács University of Miskolc, Hungary

INTRODUCTION The “fuzzy dot” (or fuzzy relation) representation of fuzzy rules in fuzzy rule based systems, in case of classical fuzzy reasoning methods (e.g. the Zadeh-MamdaniLarsen Compositional Rule of Inference (CRI) (Zadeh, 1973) (Mamdani, 1975) (Larsen, 1980) or the Takagi - Sugeno fuzzy inference (Sugeno, 1985) (Takagi & Sugeno, 1985)), are assuming the completeness of the fuzzy rule base. If there are some rules missing i.e. the rule base is “sparse”, observations may exist which hit no rule in the rule base and therefore no conclusion can be obtained. One way of handling the “fuzzy dot” knowledge representation in case of sparse fuzzy rule bases is the application of the Fuzzy Rule Interpolation (FRI) methods, where the derivable rules are deliberately missing. Since FRI methods can provide reasonable (interpolated) conclusions even if none of the existing rules fires under the current observation. From the beginning of 1990s numerous FRI methods have been proposed. The main goal of this article is to give a brief but comprehensive introduction to the existing FRI methods.

BACKGROUND Since the classical fuzzy reasoning methods (e.g. the Zadeh-Mamdani-Larsen CRI) are demanding complete rule bases, the classical rule base construction claims a special care of filling all the possible rules. In case if the rule base is “sparse” (some rules are missing), observations may exist which hit no rule and hence no conclusion can be obtained. In many application areas of fuzzy control structures, the accidental lack of conclusion is hard to explain, or meaningless (e.g. in steering control of a vehicle). This case one obvious solution could be to keep the last real conclusion instead of the missing one, but applying historical data automatically to fill undeliberately missing rules could cause unpredictable side effects. Another solution for the same problem is the application of the fuzzy rule

interpolation (FRI) methods, where the derivable rules are deliberately missing. The rule base of an FRI controller is not necessarily complete, since FRI methods can provide reasonable (interpolated) conclusions even if none of the existing rules fires under the current observation. It could contain the most significant fuzzy rules only, without risking the chance of having no conclusion for some of the observations. On the other hand most of the FRI methods are sharing the burden of high computational demand, e.g. the task of searching for the two closest surrounding rules to the observation, and calculating the conclusion at least in some characteristic α-cuts. Moreover in some methods the interpretability of the fuzzy conclusion gained is also not straightforward (Kóczy & Kovács, 1993). There have been a lot of efforts to rectify the interpretability of the interpolated fuzzy conclusion (Tikk & Baranyi, 2000). In (Baranyi, Kóczy & Gedeon, 2004) Baranyi et al. give a comprehensive overview of the recent existing FRI methods. Beyond these problems, some of the FRI methods are originally defined for one dimensional input space, and need special extension for the multidimensional case (e.g. (Jenei, 2001), (Jenei, Klement & Konzel, 2002)). In (Wong, Tikk, Gedeon & Kóczy, 2005) Wong et al. gave a comparative overview of the recent multidimensional input space capable FRI methods. In (Jenei, 2001) Jenei introduced a way for axiomatic treatment of the FRI methods. In (Perfilieva, 2004) Perfilieva studies the solvability of fuzzy relation equations as the solvability of interpolating and approximating fuzzy functions with respect to a given set of fuzzy rules (e.g. fuzzy data as ordered pairs of fuzzy sets). The high computational demand, mainly the search for the two closest surrounding rules to an arbitrary observation in the multidimensional antecedent space turns many of these methods hardly suitable for real-time applications. Some FRI methods, e.g. the method introduced by Jenei et al. in (Jenei, Klement & Konzel, 2002), eliminate the search for the two closest surrounding rules by taking all the rules into consideration, and therefore speeding up the reasoning process. On the other hand, keeping the goal of con-


Fuzzy Rule Interpolation

structing fuzzy conclusion, and not simply speeding up the reasoning, they still require some additional (or repeated) computational steps for the elements of the level set (or at least for some relevant α levels). An application oriented aspect of the FRI emerges in (Kovács, 2006), where for the sake of reasoning speed and direct real-time applicability, the fuzziness of fuzzy partitions replaced by the concept of Vague Environment (Klawonn, 1994). In the followings, the brief structure of several FRI methods will be introduced in more details.

FUZZy RULE INTERPOLATION mETHODS One of the first FRI techniques was published by Kóczy and Hirota (Kóczy & Hirota, 1991). It is usually referred as KH method. It is applicable to convex and normal fuzzy (CNF) sets in single input and single output (SISO) systems. The KH method takes into consideration only the two closest surrounding (flanking) rules to the observation. It determines the conclusion by its α-cuts in such a way that the ratio of distances between the conclusion and the consequents should be identical with the ratio of distances between the observation and the antecedents for all important α-cuts. The applied formula:

(

) (

) (

) (

)

d A* , A1 : d A* , A2 = d B* , B1 : d B* , B2 ,

F

can be solved for the required conclusion B for relevant α-cuts after decomposition. Where A1 → B1 and A2 → B2 are the two flanking rules of the observation A* and d: F(X)×F(X)→R is a distance function of fuzzy sets (in case of the KH method it was calculated as the distance of the lower and upper end points of the α-cuts) (see e.g. on Fig. 1.). It is shown in, e.g. in (Kóczy & Kovács, 1993), (Kóczy & Kovács, 1994) that the conclusion of the KH method is not always directly interpretable as fuzzy set (see e.g. on Fig. 1.). This drawback motivated many alternative solutions. The first modification was proposed by Vass, Kalmár and Kóczy (Vass, Kalmár & Kóczy, 1992) (referred as VKK method), where the conclusion is computed based on the distance of the centre points and the widths of the α-cuts, instead of their lower and upper end point distances. The VKK method extends the applicability of the KH method, but it was still strongly depends on the membership shape of the fuzzy sets (e.g. it was unable to handle singleton antecedent sets, as the width of the antecedent’s support must not be zero). In spite of the known restrictions, the KH method is still popular because of its simplicity. Subsequently it was generalized in several ways. Among them the stabilized KH interpolator was emerged, as it was proved *

Figure 1. KH method for two SISO rules: A1 → B1 and A2 → B2 , conclusion y of the observation x

729


to hold the universal approximation property in (Tikk, Joó, Kóczy, Várlaki, Moser & Gedeon, 2002) and (Tikk, 2003). This method takes into account all the rules of the rule base in the calculation of the conclusion. The method adapts a modification of the Shepard operator based interpolation (Shepard, 1968). The rules are taken into account in extent to the inverse of the distance between their antecedents and the observation. The universal approximation property holds if the distance function is raised to the power of at least the number of the antecedent dimension. Another modification of KH method is the modified alpha-cut based interpolation method (referred as MACI) (fully extended in (Tikk & Baranyi, 2000)), which alleviates completely the abnormality problem. MACI’s main idea is the following: it transforms fuzzy sets of the input and output universes to such a space where abnormality is excluded, then computes the conclusion there, which is finally transformed back to the original space. MACI uses vector representation of fuzzy sets. The original method was introduced in (Yam & Kóczy, 1997) and it was applicable for CNF sets only. This restriction was latter relaxed in (Tikk, Baranyi, Gedeon & Muresan 2001) by paying its expanse in higher computational demand than the original method. MACI is one of the most applied FRI methods (Wong, Tikk, Gedeon & Kóczy, 2005), since it preserves advantageous computational and approximate nature of KH method, while it excludes its chance for abnormal conclusion. Another FRI method was proposed by Kóczy et al. in (Kóczy, Hirota & Gedeon, 1997). It takes into consideration only the two closest surrounding rules to the observation and its main idea is the conservation of the “relative fuzziness” (referred as CRF method). This notion means that the left (and right) fuzziness of the approximated conclusion in proportion to the flanking fuzziness of the neighbouring consequent should be the same as the left (and right) fuzziness of the observation in proportion to the flanking fuzziness of the neighbouring antecedent. The original method is restricted to CNF sets only. An improved fuzzy interpolation technique for multidimensional input spaces (referred as IMUL) was originally proposed in (Wong, Gedeon & Tikk, 2000), and described more detailed in (Wong, Tikk, Gedeon & Kóczy, 2005). IMUL applies a combination of CRF and MACI methods, and mixes the advantages of both. The core of the conclusion is determined by MACI method, 730

while its flanks by CRF (the method is restricted to trapezoidal membership functions). The main advantages of this method are its applicability for multi-dimensional problems and its relative simplicity. Conceptually different approaches were proposed in (Baranyi, Kóczy & Gedeon, 2004) based on the relation, semantic and inter-relational features of the fuzzy sets. The family of these methods applies a two step “General Methodology” (referred as GM). The notation also reflects the feature, that methods based on GM can handle arbitrary shaped fuzzy sets. The basic concept is to divide the task of the FRI into two main steps. The first step is to determine the reference point of the conclusion based on the ratio of the distances between the reference points of the observation and the antecedents. Then accomplishing the first step, based on the existing rules a new, interpolated rule is generated for the reference point of the observation and the reference point of the conclusion. In the second step of the method, a single rule reasoning method (revision function) is applied to determine the final fuzzy conclusion based on the similarity of the fuzzy observation and the antecedent of the new “interpolated” rule. For both the main steps of GM numerous solutions exists, therefore the GM stands for an FRI concept, or a family of FRI methods. A rather different application oriented aspect of the FRI emerges in the concept of the Fuzzy Interpolation based on Vague Environment FRI method (referred as FIVE), originally introduced in (Kovács, 1996), (Kovács & Kóczy, 1997a), (Kovács & Kóczy, 1997b) and extended with the ability of handling fuzzy observation in (Kovács, 2006). It was developed to fit the speed requirements of direct fuzzy control, where the conclusions of the fuzzy controller are applied directly as control actions in a real-time system. The main idea of the FIVE method is based on the fact that most of the control applications serves crisp observations and requires crisp conclusions from the controller. Adopting the idea of the vague environment (Klawonn, 1994), FIVE can handle the antecedent and consequent fuzzy partitions of the fuzzy rule base by scaling functions (Klawonn, 1994) and therefore turn the fuzzy interpolation to crisp interpolation. In FIVE any crisp interpolation, extrapolation, or regression method can be adapted very simply for FRI. Because of its simple multidimensional applicability, in FIVE, originally the Shepard operator based interpolation (Shepard, 1968) was adapted.


FUTURE TRENDS Future trends of the FRI methods include the appearance of numerous hybrid FRI methods i.e. neuro-FRI, genetic-FRI for (depending on the application area) gradient based, or gradient free parameter optimisation of the FRI model. Future trends also directed to extended number of practical applications of the FRI. Recently a freely available comprehensive FRI toolbox (Johanyák, Tikk, Kovács & Wong, 2006) and an FRI oriented web site (http://fri.gamf.hu) were appeared for aiding and guiding the future FRI applications.

CONCLUSION There are relatively few Fuzzy Rule Interpolation (FRI) techniques can be found among the practical fuzzy rule based applications. On one hand the FRI methods are not widely known, and some of them have limitations from practical application point of view, e.g. can be applied only in one dimensional case, or defined based on the two closest surrounding rules of the actual observation. On the other hand enabling the application of sparse rule bases the FRI methods can dramatically simplify the way of fuzzy rule base creation, since FRI methods can provide reasonable (interpolated) conclusions even if none of the existing rules fires under the current observation. Therefore these methods can save the expert from dealing with derivable rules and help to concentrate on cardinal actions only and hence simplify the rule base creation itself. Thus, compared to the classical fuzzy CRI, the number of the fuzzy rules needed to be handled during the design process, could be dramatically reduced (see e.g. in (Kovács, 2005)). Moreover in case of parameter optimisation of the sparse FRI model (hybrid FRI methods), the reduced FRI rule base size could also means reduction in the size of the optimisation search space, and hence it can lead to quicker optimisation algorithms too.

REFERENCES P. Baranyi, L. T. Kóczy, and T. D. Gedeon (2004). A Generalized Concept for Fuzzy Rule Interpolation. IEEE Transaction on Fuzzy Systems, (12) 6, 820-837.

S. Jenei (2001). Interpolating and extrapolating fuzzy quantities revisited – an axiomatic approach. Soft Computing, (5), 179-193. S. Jenei, E. P. Klement and R. Konzel (2002). Interpolation and extrapolation of fuzzy quantities – The multipledimensional case. Soft Computing, (6), 258-270. Zs. Cs. Johanyák, D. Tikk, Sz. Kovács, K. W. Wong (2006). Fuzzy Rule Interpolation Matlab Toolbox – FRI Toolbox, Proc. of the IEEE World Congress on Computational Intelligence (WCCI’06), 15th Int. Conf. on Fuzzy Systems (FUZZ-IEEE’06), Vancouver, BC, Canada, Omnipress. ISBN 0-7803-9489-5, 1427-1433. F. Klawonn (1994). Fuzzy Sets and Vague Environments. Fuzzy Sets and Systems, (66), 207-221. G. J. Klir, T. A. Folger (1988). Fuzzy Sets Uncertainity and Information. Prentice-Hall International. L. T. Kóczy and K. Hirota (1991). Rule interpolation by α-level sets in fuzzy approximate reasoning. BUSEFAL, Automne, URA-CNRS, Toulouse, France, (46), 115-123. L. T. Kóczy and Sz. Kovács (1993). On the preservation of the convexity and piecewise linearity in linear fuzzy rule interpolation. Tokyo Institute of Technology, Yokohama, Japan, Technical Report TR 93-94/402, LIFE Chair Fuzzy Theory. L. T. Kóczy and Sz. Kovács (1994). Shape of the Fuzzy Conclusion Generated by Linear Interpolation in Trapezoidal Fuzzy Rule Bases. Proceedings of the 2nd European Congress on Intelligent Techniques and Soft Computing, Aachen, 1666–1670. L.T. Kóczy, K. Hirota, and T. D. Gedeon (1997). Fuzzy rule interpolation by the conservation of relative fuzziness. Technical Report TR 97/2. Hirota Lab, Dept. of Comp. Int. and Sys. Sci., Tokyo Institute of Technology, Yokohama. Sz. Kovács (1996). New Aspects of Interpolative Reasoning. Proceedings of the 6th. International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Granada, Spain, 477-482. Sz. Kovács, and L.T. Kóczy (1997a). Approximate Fuzzy Reasoning Based on Interpolation in the Vague

731

F


Environment of the Fuzzy Rule base as a Practical Alternative of the Classical CRI. Proceedings of the 7th International Fuzzy Systems Association World Congress, Prague, Czech Republic, 144-149. Sz. Kovács, and L.T. Kóczy (1997b). The use of the concept of vague environment in approximate fuzzy reasoning. Fuzzy Set Theory and Applications, Tatra Mountains Mathematical Publications, Mathematical Institute Slovak Academy of Sciences, Bratislava, Slovak Republic, (12), 169-181. Sz. Kovács (2005). Interpolative Fuzzy Reasoning in Behaviour-based Control, Advances in Soft Computing, Computational Intelligence, Theory and Applications, Bernd Reusch (Ed.), Springer, Germany, ISBN 3-54022807-1, (2), 159-170. Sz. Kovács (2006). Extending the Fuzzy Rule Interpolation “FIVE” by Fuzzy Observation. Advances in Soft Computing, Computational Intelligence, Theory and Applications, Bernd Reusch (Ed.), Springer Germany, ISBN 3-540-34780-1, 485-497. P. M. Larsen (1980). Industrial application of fuzzy logic control. Int. J. of Man Machine Studies, (12) 4, 3-10. E. H. Mamdani and S. Assilian (1975). An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. of Man Machine Studies, (7), 1-13. I. Perfilieva (2004). Fuzzy function as an approximate solution to a system of fuzzy relation equations. Fuzzy Sets and Systems, (147), 363-383. D. Shepard (1968). A two dimensional interpolation function for irregularly spaced data. Proc. 23rd ACM Internat. Conf., 517-524. M. Sugeno (1985). An introductory survey of fuzzy control. Information Science, (36), 59-83. T. Takagi and M. Sugeno (1985). Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. on SMC, (15), 116-132. D. Tikk and P. Baranyi (2000). Comprehensive analysis of a new fuzzy rule interpolation method. In IEEE Transaction on Fuzzy Systems, (8) 3, 281-296. D. Tikk, P. Baranyi, T. D. Gedeon, and L. Muresan (2001). Generalization of a rule interpolation method

732

resulting always in acceptable conclusion. Tatra Mountains Mathematical Publications, (21), 73-91. D. Tikk, I. Joó, L. T. Kóczy, P. Várlaki, B. Moser, and T. D. Gedeon (2002). Stability of interpolative fuzzy KH-controllers. Fuzzy Sets and Systems, (125) 1, 105-119. D. Tikk (2003). Notes on the approximation rate of fuzzy KH interpolator. Fuzzy Sets and Systems, (138) 2, 441-453. Y. Yam, and L. T. Kóczy (1997). Representing membership functions as points in high dimensional spaces for fuzzy interpolation and extrapolation. Dept. Mech. Automat. Eng., Chinese Univ. Hong Kong, Technical Report CUHK-MAE-97-03. G. Vass, L. Kalmár and L. T. Kóczy (1992). Extension of the fuzzy rule interpolation method. Proceedings of the International Conference Fuzzy Sets Theory Applications (FSTA’92), Liptovsky Mikulas, Czechoslovakia, 1-6. K. W. Wong, T. D. Gedeon, and D. Tikk (2000). An improved multidimensional α-cut based fuzzy interpolation technique. Proceedings of the International Conference Artificial Intelligence in Science and Technology (AISAT’2000), Hobart, Australia, 29-32. K. W. Wong, D. Tikk, T. D. Gedeon, and L. T. Kóczy (2005). Fuzzy Rule Interpolation for Multidimensional Input Spaces With Applications. IEEE Transactions on Fuzzy Systems, ISSN 1063-6706, (13) 6, 809-819. L. A. Zadeh (1973). Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. on SMC, (3), 28-44.

KEy TERmS α-Cut of a Fuzzy Set: Is a crisp set, which holds the elements of a fuzzy set (on the same universe of discourse) whose membership grade is grater than, or equal to α. (In case of “strong” α -cut it must be grater than α.) ε-Covering Fuzzy Partition: The fuzzy partition (a set of linguistic terms (fuzzy sets)) ε-covers the universe of discourse, if for all the elements in the


universe of discourse a linguistic term exists, which have a membership value grater or equal to ε. Complete (or Dense) Fuzzy Rule Base: A fuzzy rule base is complete, or dense if all the input universes are ε-covered by rule antecedents, where ε>0. In case of Complete Fuzzy Rule Base, for all the possible multidimensional observations, a rule antecedent must exist, which has a nonzero activation degree. Note, that completeness of the fuzzy rule base is not equivalent with covering fuzzy partitions on each antecedent universe (required but not sufficient in multidimensional case). Usually the number of the rules of a complete rule base is O(MI), where M is the average number of the linguistic terms in the fuzzy partitions and I is the number of the input universe. Convex and Normal Fuzzy (CNF) Set: A fuzzy set defined on a universe of discourse holds total ordering, which has a height (maximal membership value) equal to one (i.e. normal fuzzy set), and having membership grade of any elements between two arbitrary elements grater than, or equal to the smaller membership grade of the two arbitrary boundary elements (i.e. convex fuzzy set). Fuzzy Compositional Rule of Inference (CRI): The most common fuzzy inference method. The fuzzy conclusion is calculated as the fuzzy composition (Klir & Folger, 1988) of the fuzzy observation and the fuzzy rule base relation (see “Fuzzy dot” representation of fuzzy rules). In case of the Zadeh - Mamdani - Larsen max-min compositional rule of inference (Zadeh, 1973) (Mamdani, 1975) (Larsen, 1980) the applied fuzzy composition is the max-min composition of fuzzy relations (“max” stands for the applied s-norm and “min” for the applied t-norm fuzzy operations). “Fuzzy Dot” Representation of Fuzzy Rules: The most common understanding of the If-Then fuzzy rules.

The fuzzy rules are represented as a fuzzy relation of the rule antecedent and the rule consequent linguistic terms. In case of the Zadeh - Mamdani - Larsen compositional rule of inference (Zadeh, 1973) (Mamdani, 1975) (Larsen, 1980) the fuzzy rule relations are calculated as the fuzzy cylindric closures (t-norm of the cylindric extensions) (Klir & Folger, 1988) of the antecedent and the rule consequent linguistic terms. Fuzzy Rule Interpolation: A way for fuzzy inference by interpolation of the existing fuzzy rules based on various distance and similarity measures of fuzzy sets. A suitable method for handling sparse fuzzy rule bases, since FRI methods can provide reasonable (interpolated/extrapolated) conclusions even if none of the existing rules fires under the current observation. Sparse Fuzzy Rule Base: A fuzzy rule base is sparse, if an observation may exist, which hits no rule antecedent. (The rule base is not complete.) Vague Environment (VE): The idea of a VE is based on the similarity (or in this case the indistinguishability) of the considered elements. In VE the fuzzy membership function µ A (x ) is indicating level of similarity of x to a specific element a that is a representative or prototypical element of the fuzzy set µ A (x ), or, equivalently, as the degree to which x ∈ X is indistinguishable from a ∈ X (Klawonn, 1994). Therefore the α-cuts of the fuzzy set µ A (x ) are the sets which contain the elements that are 1 − α -indistinguishable from a. Two values in a VE are ε-distinguishable if their distance is greater than ε. The distances in a VE are weighted distances. The weighting factor or function is called scaling function (factor) (Klawonn, 1994). If the VE of a fuzzy partition (the scaling function or at least the approximate scaling function (Kovács, 1996), (Kovács & Kóczy, 1997b)) exists, the member sets of the fuzzy partition can be characterized by points in that VE.

733

F

734

Fuzzy Systems Modeling: An Introduction Young Hoon Joo Kunsan National University, Korea Guanrong Chen City University of Hong Kong, China

INTRODUCTION The basic objective of system modeling is to establish an input-output representative mapping that can satisfactorily describe the system behaviors, by using the available input-output data based upon physical or empirical knowledge about the structure of the unknown system.

BACKGROUND Conventional system modeling techniques suggest constructing a model described by a set of differential or difference equations. This approach is effective only when the underlying system is mathematically well-defined and precisely expressible. They often fail to handle uncertain, vague or ill-defined physical systems, and yet most real-world problems do not obey such precise, idealized, and subjective mathematical rules. According to the incompatibility principle (Zadeh, 1973), as the complexity of a system increases, human’s ability to make precise and significant statements about its behaviors decreases, until a threshold is reached beyond which precision and significance become impossible. Under this principle, Zadeh (1973) proposed a modeling method of human thinking with fuzzy numbers rather than crisp numbers, which had eventually led to the development of various fuzzy modeling techniques later on.

mAIN FOCUS OF THE CHAPTER Structure Identification In structure identification of a fuzzy model, the first step is to select some appropriate input variables from the collection of possible system inputs; the second

step is to determine the number of membership functions for each input variable. This process is closely related to the partitioning of input space. Input space partitioning methods are useful for determining such structures (Wang & Mendel, 1996).

Grid Partitioning Figure 1 (a) shows a typical grid partition in a twodimensional input space. Fuzzy grids can be used to generate fuzzy rules based on system input-output training data. Also, a one-pass build-up procedure can avoid the time-consuming learning process, but its performance depends heavily on the definition of the grid. In general, the finer the grid is, the better the performance will be. Adaptive fuzzy grid partitioning can be used to refine and even optimize this process. In the adaptive approach, a uniformly partitioned grid may be used for initialization. As the process goes on, the parameters in the antecedent membership functions will be adjusted. Consequently, the fuzzy grid evolves. The gradient descent method may then be used to optimize the size and location of the fuzzy grid regions and the overlapping degree among them. The major drawback of this grid partition method is that the performance suffers from an exponential explosion of the number of inputs or membership functions as the input variables increase, known as the “curse of dimensionality,” which is a common issue for most partitioning methods.

Tree Partitioning Figure 1 (b) visualizes a tree partition. The tree partitioning results from a series of guillotine cuts. Each region is generated by a guillotine cut, which is made entirely across the subspace to be partitioned. At the (k – 1)st iteration step, the input space is partitioned into k regions. Then a guillotine cut is applied to one of


Fuzzy Systems Modeling

these regions to further partition the entire space into k + 1 regions. There are several strategies for determining which dimension to cut, where to cut at each step, and when to stop. This flexible tree partitioning algorithm resolves the problem of curse of dimensionality. However, more membership functions are needed for each input variable, and they usually do not have clear linguistic meanings; moreover, the resulting fuzzy model consequently is less descriptive.

Scatter Partitioning Figure 1 (c) illustrates a scatter partition. This method extracts fuzzy rules directly from numerical data (Abe & Lan, 1995). Suppose that a one-dimensional output, y, and an m-dimensional input vector, x, are available. First, the output space is divided into n intervals, [y0, y1], (y1, y2], …, (yn–1, yn], where the ith interval is called “output interval i.” Then, activation hyperboxes are determined, which define the input region corresponding to the output interval i, by calculating the minimum and maximum values of the input data for each output interval. If the activation hyperbox for the output interval i overlaps with the activation hyperbox for the output interval j, then the overlapped region is defined as an inhibition hyperbox. If the input data for output intervals i and/or j exist in the inhibition hyperbox, then within this inhibition hyperbox one or two additional activation hyperboxes will be defined. Moreover, if two activation hyperboxes are defined and they overlap, then an additional inhibition hyperbox

is further defined. This procedure is repeated until overlapping is resolved.

Parameters Identification After the system structure has been determined, parameters identification is in order. In this process, the optimal parameters of a fuzzy model that can best describe the input-output behavior of the underlying system are searched by optimization techniques. Sometimes, structure and parameters are identified under the same framework through fuzzy modeling. There are virtually many different approaches to modeling a system using the fuzzy set and fuzzy system theories (Chen & Pham, 1999, 2006), but the classical least-squares optimization and the general Genetic Algorithm (GA) optimization techniques are most popular. They are quite generic, effective, and competitive with other successful non-fuzzy types of optimization-based modeling methods such as neural networks and statistical Monte Carlo.

An Approach Using Least-Squares Optimization A fuzzy system can be described by the following generic form: m

f ( x) = ∑ A k g k ( x) = A g ( x) T

k =1

(1)

Figure 1. Three typical MISO partitioning methods

(a) fuzzy grid

(b) tree partition

(c) scatter partition 735

F


T where A = [A1 ,,A m ] are constant coefficients and

g k ( x) =

Π in=1M X kj ( x)

∑ (Π m

k =1

n i =1

), k = 1,···,m

M X kj ( x)

(2)

form of (2) with m = n in this discussion, and initially with ckj = 1, xkj = xk (t j ) , and S kj =

1 ml

[ max{ x (t ), j = 1,, n} − min{ x (t ), j = 1,, n}] k

j

[

k

j

]

1 max{ xk (t j ), j = 1,, n} − min{ xk (t j ), j = 1,, n} kj = are the basis functions,Sin which ml µXkj(·) are the chosen , k = 1,···, n membership functions. Suppose that the real system output is m

y (t ) = ∑ A k g k ( x) + e(t )

(3)

k =1

where y(t) is the system output and e(t) represents the modeling error, which is assumed to be uncorrelated with the fuzzy basis functions {g k (⋅)}mk=1 in this discussion. Suppose that n pairs of system input-output data are given: (xd(ti), yd(ti)), i = 1,···,n. The goal is to find the best possible fuzzy basis functions, such that the total least-squares error between the data set and the system outputs { y (ti )}in=1 is minimized. To do so, the linear model (3) is first written in a matrix form over the time domain t1 < ··· < tn, namely,

where ml is the number of the basis functions in the final expression, which is determined by the designer based on experience (usually, ml < n).

After choosing the initial fuzzy basis functions, the next step is to select the most significant ones among them. This process is based on the classical Gram-Schmidt orthogonalization, while ckj, xkj , and σkj are all fixed:

Step 1. For j = 1, compute (i )

w1 = g i ( x d ) = [ g i ( xd (t1 )),, g i ( xd (tn ))]T (i )

(i ) 1

h

=

( w1 )T y d (i )

y = GA + e

where y = [y(t1), ··· ,y(tn)]T, e = [e(t1), ···, e(tn)]T, and  g1 (t1 )  g m (t1 )   G := [ g 1 ,, g n ] =    g1 (tn )  g m (t1 )

(i )

( w1 )T w1

E1i = (h1( i ) ) 2

(i )

(i )

( w1 )T w1 T yd yd

(1 ≤ i ≤ n) where

T

with gj = [gj(t1), ···, gj(tn)]T, j = 1, ···, n. The first step is to transform the set of numbers, gi(tj), i = 1, ···, m, j = 1, ···, n, into a set of orthogonal basis vectors, and only significant basis vectors are used to form the final least-squares optimization. Here, the Gaussian membership functions

{

}

M X kj ( xk ) = ckj exp −(xk − xkj / S kj) / 2 2

x d = [ xd (t1 ),, xd (tn )]T and

y d = [ yd (t1 ),, yd (tn )]T

are the input-output data set. Then, compute E1( i1 ) = max{E1(i ) : 1 ≤ i ≤ n}

and let (i )

w1 = w1 1 = g i and h1 = h1(i1 ) . 1

are used as an example to illustrate the computational algorithm. One approach to initializing the fuzzy basis functions is to choose n initial basis functions, gk(x), in the 736

Step 2. For each j, 2 ≤ j ≤ ml, compute T

ckj(i ) =

wk g i T

wk wk


GA can be used to find an optimal or suboptimal fuzzy model to describe a given system without manual design (Joo, Hwang, Kim & Woo, 1997; Liska & Melsheimer, 1994; Soucek & Group, 1992). In addition, GA fuzzy modeling method can be integrated with other components of a fuzzy system, so as to achieve overall superior performance in control and automation.

w j = g i − ∑ k =1 ckj(i ) wk j −1

(i )

(i )

(i ) j

h =

( w j )T y d (i )

(i )

( w j )T w j

(i )

E (ji ) = (h (j i ) ) 2

(i )

( w j )T w j

Genetic Algorithm Preliminaries

T

yd yd

{

(i )

}

E k j = max E (ji ) : 1 ≤ i ≤ n; i ≠ i1 ,, i ≠ i j −1

where E (ij ) represents the error-reduction ratio due (i ) to w j . Pick (i )

w j = w j j and hk = hk(i j ) .

Step 3. Solve equation A( m l ) A

( ml )

=h

( ml )

for a solution A where h

A

( ml )

( ml )

( ml )

= [A1( ml ) ,,A m( ml l ) ]T ,

= [h1 ,, hml ]T and

1 c12(i2 )  0 1 =  0  0 0 0 

(i ) c1mmll  (i )   c2 mmll     (i ) 1 cmlm−l 1, ml  0 1 

c13(i3 )  ( i3 ) c23 

0 

The final result is obtained as ml

f ( x) = ∑ A k( ml ) g ik ( x) k =1

An Approach Using Genetic Algorithms The parameter identification procedure is generally very tedious for a large-scale complex system, for which the GA approach has some attractive features such as its great flexibility and robust optimization ability (Man, Tang, Kwong & Halang, 1997). GA is attributed to Holland (1975), which was applied to fuzzy modeling and fuzzy control in the 1980s.

GA provides an optimization method, with a stochastic search algorithm, based on some common biological principles of selection, crossover and mutation. A GA algorithm encodes each point in a solution space into a string composing of binary or real values, called a chromosome. Each point is assigned a fitness value from zero to one, which is usually taken to be the same as the objective function to be maximized. A GA scheme keeps a set of points as a population, which is evolved repeatedly toward a better and possibly the best fitness value. In each generation, GA generates a new population using genetic operators such as crossover and mutation. Through these operations, individuals with higher fitness values are more likely to survive and to participate in the next genetic operations. After a number of generations, individuals with higher fitness values are kept in the population while the others are eliminated. GA, therefore, can ensure a gradual increasing of improving solutions, till a desired optimal or suboptimal solution is obtained. Basic GA Elements A simple genetic algorithm (SGA) was first described by Goldberg (1989) and is used here for illustration, with a pseudo-code shown below, where the population at time t is a time function, P = P(t), with a random initial population P(0). Procedure GA Begin t=0 Initialize P(t) Evaluate P(t) While not finished do Begin t=t+1 Reproduce P(t) from P(t – 1) Crossover individuals in P(t) 737

F


End

Mutate individuals in P(t) Evaluate P(t) End

Population Representation and Initialization Individuals are encoded as strings (i.e., chromosomes) composing of some alphabets, so that the genotypes (chromosome values) are uniquely mapped onto the decision variable (phenotype) domain. The most commonly used representation in GA is the binary alphabet, {0,1}; others are ternary, integer, real-valued, etc. (Takagi & Sugeno, 1985). The search process, described below, will operate on these encoding decision variables rather than the decision variables themselves, except when real-valued genes are used. After a representation method has been chosen to use, the first step in the SGA is to create an initial population, by generating the required number of individuals via a random number generator which uniformly distributes initial numbers in the desired range. Objective and Fitness Functions The objective function is used to measure the performance of the individuals over the problem domain. The fitness function is used to transform the objective function value into a measure of relative fitness; mathematically, F(x) = g(f(x)), where f is the objective function, g is the transform that maps the value of f to a nonnegative number, and F is the resulting relative fitness. In general, the fitness function value corresponds to the number of offspring, and an individual can expect to produce this value in the next generation. A commonly used transform is the proportional fitness assignment, defined by F ( xi ) = f ( xi ) / ∑i =1 f ( xi ) N

,

where N is the population size and xi is the phenotypic value of individual i, i = 1,···, N. Although the above fitness assignment ensures that each individual has a certain probability of reproduction according to its relative fitness, it does not account for negative objective function values. A linear transform, which offsets the objective function, is often used prior 738

to the fitness assignment. It takes the form F(x) = fa(x) + b, where a is a positive scaling factor if the optimization is to maximize the objective function but is negative if it is a minimization, and the offset b is used to ensure that the resulting fitness values are all negative. Then, the selection algorithm selects individuals for reproduction on the basis of their relative fitness. Reproduction Once each individual has been assigned a fitness value, they can be chosen from the population with a probability according to their relative fitness. They can then be recombined to produce the next generation. Most widely used genetic operators in GA are selection, crossover, and mutation operators. They are often run simultaneously in an GA program. Selection Selection is the process of determining the number of trials in which a particular individual is chosen for reproduction. Thus, it is the number of offspring that an individual will produce in the mating pool, a temporary population where crossover and mutation operations are applied to each individual. The selection of individuals has two separate processes: a. b.

determination of the number of trials an individual can expect to receive; conversion of the expected number of trials into a discrete number of offspring.

Crossover (Recombination) The crossover operator defines the procedure for generating children from two parents. Analogous to biological crossover, it exchanges genes at a randomly selected crossover point from also randomly selected parents in the mating pool to generate children. A common method is the following: Parent chromosomes are cut at randomly selected points, which can be more than one, to exchange their genes at some specified crossover points with a user-specified crossover probability. This crossover method is categorized into single-point crossover and multi-point crossover


according to the number of crossover points. Uniform crossover often works well with small populations of chromosomes and for simpler problems (Soucek & Group, 1992).

Figure 2. A chromosome structure for fuzzy modeling

Mutation Mutation operation is randomly applied to individuals, so as to change their gene value with a mutation probability, Pm, which is very low in general. GA Parameters The choice of the mutation probability Pm and the crossover probability Pc as two control parameters can be a complex nonlinear optimization problem. Their settings are critically dependent upon the nature of the objective function. This selection issue still remains open to better resolutions. One suggestion is that for large population size (say 100), crossover rate is 0.6 and mutation rate is 0.001, while for small population size (such as 30), crossover rate is 0.9 and mutation rate is 0.01 (Zalzala & Fleming, 1997).

GA-Based Fuzzy System Modeling In GA, parameters for a given problem are represented by the chromosome. This chromosome may contain one or more substrings. Each chromosome contains a possible solution to the problem. Fitness function is used to evaluate how well a chromosome solves the problem. In the GA-based approach for fuzzy modeling, each chromosome represents a specific fuzzy model, and the ultimate goal is to carefully design a good (ideally optimal) chromosome to represent a desired fuzzy model. Chromosome Structure As an example, consider a simple fuzzy model with only one rule, along with the scatter partition to be encoded to a chromosome. Suppose that both real number coding and integer number coding are used. The structure and the parameters of the fuzzy model are encoded into one or more substrings in the chromosome. A chromosome is composed of two substrings (candidate substring and decision substring) and these substrings are divided

into two parts (IF part and THEN part), as shown in Fig. 2. The candidate substring is encoded by real numbers, as shown in Fig. 3 (a). It contains the candidates for the parameters of a membership function in the IF part, and the fuzzy singleton membership function in the THEN part. Figure 3 describes the coding format of a candidate substring in a chromosome, where n is the number of input variables, r the number of candidates for parameters in the IF part, and s the number of candidates for the real numbers in the THEN part. Decision substrings are encoded by integers, which determine the structure and the number of rules, by choosing one of the parameters in the candidate substrings, as illustrated by Fig. 3 (b). The decision substrings for the IF part determine the premise structure of the fuzzy rule base. It is composed of n genes that take integer values (alleles) between 0 and r. According to this value, an appropriate parameter in the candidate substring is selected. A zero value means that the related input is not included in the rule. A decision substring for the THEN part is composed of c (the maximum number of rules) genes that take the integer values between 0 and s, which chooses appropriate values from the candidate substring for the THEN part. In this substring, the gene taking the zero value deletes the related rule. Therefore, these substrings determine the structure of the THEN part and the number of rules. Figure 4 illustrates an example of decoding the chromosome, with the resulting fuzzy rule shown in Fig. 5.

739

F


Figure 3. Two basic functions in a chromosome

(a) The candidate substrings

(b) The decision substrings

Fitness Function To measure the performance of the GA-based fuzzy modeling, an objective function is defined for optimization, which is chosen by the designer and usually is a least-squares matching measure of the form 1 n J = ∑ ( yi − yid ) 2 n i =1 , d i

where {yi} and { y } are the fuzzy model outputs and desired outputs, respectively, and n is the number of the data used. 740

Since GA is guided by the fitness values and requires literally no limit on the formulation of its performance measure, one can incorporate more information about a fuzzy model into the fitness function: f = g(Jstructure, Jaccuracy, ···). One example of a fitness function is f (J ) =

L 1− L + J 1+ c ,

where λ ∈ [0,1] is the weighting factor (a large λ gives a highly accurate model but requires a large number of rules), and c is the maximum number of rules. When the fitness function is evaluated over an empty set, it is


Figure 4. An example of genetic decoding process

F

Figure 5. The first fuzzy rule obtained by the decoding processes

undefined; but in this case one may introduce a penalty factor, 0 < p < 1, and compute p · f(J) instead of f(J). If an individual with a very high fitness value appears at the earlier stage, this fitness function may cause early convergence of the solution, thereby stopping the algorithm before optimality is reached. To avoid this situation, the individuals may be sorted according to their raw fitness values, and the new fitness values are determined recursively by

f1 = 1, f2 = fa1 = a,..., fm = am for a fitness scaling factor a ∈ (0,1). GA-Based Fuzzy Modeling with Fine Tuning GA generally does not guarantee the convergence to a global optimum. In order to improve this, the gradient descent method can be used to fine tune the parameters 741


identified by GA. Since GA usually can find a near global optimum, to this end fine tuning of the membership function parameters in both IF and THEN parts, e.g., by a gradient descent method, can generally lead to a global optimization (Chang, Joo, Park & Chen, 2002; Goldberg, 1989).

FUTURE TRENDS This will be further discussed elsewhere in the future.

CONCLUSION Fuzzy systems identification is an important and yet challenging subject for research, which calls for more efforts from the control theory and intelligent systems communities, to reach another high level of efficiency and success.

REFERENCES S. Abe & M. S. Lan (1995). Fuzzy rules extraction directly from numerical data for function approximation. IEEE Trans. on Systems, Man and Cybernetics. 25: 119-129. W. Chang, Y. H. Joo, J. B. Park & G. Chen (2002). Design of robust fuzzy-model-based controller with sliding mode control for SISO nonlinear systems. Fuzzy Sets and Systems. 125:1-22. G. Chen & T. T. Pham (1999). Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems. CRC Press.

J. Liska & S. S. Melsheimer (1994). Complete design of fuzzy logic systems using genetic algorithms. Proc. of IEEE Conf. on Fuzzy Systems. 1377-1382. K. F. Man, K. S. Tang, S. Kwong & W. A. Halang (1997). Genetic Algorithms for Control and Signal Processing. Springer. B. Soucek & T. I. Group (1992). Dynamic Genetic and Chaotic Programming. Wiley. W. Spears & V. Anand (1990). The use of crossover in genetic programming. NRL Technical Report, AI Center, Naval Research Labs, Washington D. C. T. Takagi & M. Sugeno (1985) Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. on Systems, Man and Cybernetics. 15: 116-132. L. X. Wang & J. M. Mendel (1996). Generating fuzzy rules by learning from examples. IEEE Trans. on Systems, Man and Cybernetics. 22:1414-1427. L. A. Zadeh (1973). Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. on Systems, Man and Cybernetics. 3: 28-44. A. M. S. Zalzala & P. J. Fleming (1997) Genetic Algorithms in Engineering Systems. IEE Press.

KEy TERmS Fuzzy Rule: A logical rule established based on fuzzy logic. Fuzzy System: A system formulated and described by fuzzy set-based real-valued functions.

G. Chen & T. T. Pham (2006). Introduction to Fuzzy Systems. CRC Press.

Genetic Algorithm: An optimization scheme based on biological genetic evolutionary principles.

E. Goldberg (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.

Least-Squares Algorithm: An optimization scheme that minimizes the square of the sum of the approximation errors.

J. H. Holland (1975). Adaptation in Natural and Artificial Systems. MIT Press. Y. H. Joo, H. S. Hwang, K.B. Kim & K.B. Woo (1997) Fuzzy system modeling by fuzzy partition and GA hybrid schemes. Fuzzy Sets and Systems. 86: 279-288. 742

Parameter Identification: Find appropriate parameter values in a mathematical model. Structure Identification: Find a mathematical representation of the unknown system’s structure.


System Modeling: A mathematical formulation of an unknown physical system or process.

F

743

744

Gene Regulation Network Use for Information Processing Enrique Fernandez-Blanco University of A Coruña, Spain J. Andrés Serantes University of A Coruña, Spain

INTRODUCTION From the unicellular to the more complex pluricellular organism needs to process the signals from its environment to survive. The computation science has already observed, that fact could be demonstrated remembering the artificial neural networks (ANN). This computation tool is based on the nervous system of the animals, but not only the nervous cells process information in an organism. Every cell has to process the development and functioning plan encoded at its DNA and every one of these cells executes this program in parallel with the others. Another interesting characteristic of natural cells is that they form systems that are tolerant to partial failures: small errors do not induce a global collapse of the system. The present work proposes a model that is based on DNA information processing, but adapting it to general information processing. This model can be based on a set of techniques called Artificial Embryogeny (Stanley K. & Miikkulainen R. 2003) which adapts characteristics from the biological cells to solve different problems.

BACKGROUND The Evolutionary Computation (EC) field has given rise to a set of models that are grouped under the name of Artificial Embryology (AE), first introduced by Stanley and Miikkulainnen (Stanley K. & Miikkulainen R. 2003). This group refers to all the models that try to apply certain characteristics of biological embryonic cells to computer problem solving, i.c. self-organisation, failure tolerance, and parallel information processing. The work on AE has two points of view. On the one hand can be found the grammatical models based on L-systems (Lindenmayer A. 1968) which do a top-down

approach to the problem. On the other hand can be found the chemical models based on the Turing’s ideas (Turing A. 1952) which do a down-top approach. The grammatically approach, some times, has used the models for study the evolution of ANN, which is known as neuroevolution. The first neuroevolution system was development by Kitano (Kitano, H. 1990). In this work Kitano shows that it was possible to evolve the connectivity matrix of ANN through a set of rewrite rules. Another remarkable work is the application of L-systems do by Hornby and Pollack (Hornby, G. S. & Pollack J. B. 2002). At this work they simultaneously evolved the body morphologies and the neural network of artificial creatures in a simulated 3D physical environment. Finally, mention the works carry out by Gruau (Gruau F. 1994) where the author uses grammar trees to encode steps in the development of a neural network from a single antecesor cell. On the chemical approach, the starting point of this field can be found in the modelling of gene regulatory networks, performed by Kauffmann in 1969 (Kauffman S.A. 1969). After that, several works were carried out on subjects such as the complex behaviour generated by the fact that the differential expression of certain genes has a cascade influence on the expressions of others (Mjolsness E., Sharp D.H., & Reinitz J. 1995). Considering the gene regulatory networks works, the most relevant models are the following: the Kumar and Bentley model (Kumar S. & Bentley P.J 2003), which uses the theory of fractal proteins Bentley, P.J., Kumar, S. 1999; for the calculation of protein concentration; the Eggenberger model (Eggenberger P. 1996), which uses the concepts of cellular differentiation and cellular movement to determine cell connections; and the work of Dellaert and Beer (Dellaert F. & Beer R.D. 1996), who propose a model that incorporates the idea of biological operons to control the model expression, where the function assumes the mathematical meaning of a Boolean function.


Gene Regulation Network Use

GENETIC REGULATORy NETWORK mODEL The cells of a biological system are mainly determined by the DNA strand, the genes, and the proteins contained by the cytoplasm. The DNA is the structure that holds the gene-encoded information that is needed for the development of the system. The genes are activated or transcribed thanks to the protein shaped-information that exists in the cytoplasm, and consist of two main parts: the sequence, which identifies the protein that will be generated if the gene is transcribed, and the promoter, which identifies the proteins that are needed for gene transcription. Another remarkable aspect of biological genes is the difference between constitutive genes and regulating genes. The latter are transcribed only when the proteins identified in the promoter part are present. The constitutive genes are always transcribed, unless inhibited by the presence of the proteins identified in the promoter part, acting then as gene oppressors. The present work has tried to partially model this structure with the aim of fitting some of its abilities into a computational model; in this way, the system would have a structure similar that is similar to the above and will be detailed in the next section.

Various model variants were developed on the basis of biological concepts. The proposed artificial cellular system is based on the interaction of artificial cells by means of messages that are called proteins. These cells can divide themselves, die, or generate proteins

Figure 1. Structure of a system gene DNA

G ENE TRUE

1001

……………….

0010

P rom oters

1001

• • • •

Proposed Model

C ons tituent

that will act as messages for themselves as well as for neighbour cells. The system is supposed to express a global behaviour towards the information processing. Such behaviour would emerge from the information encoded in a set of variables of the cell that, in analogy with the biological cells, will be named genes. The central element of our model is the artificial cell. Every cell has a binary string-encoded information for the regulation of its functioning. Following the biological analogy, this string will be called DNA. The cell also has a structure for the storage and management of the proteins generated by the own cell and those received from neighbourhood cells; following the biological model, this structure is called cytoplasm. The DNA of the artificial cell consists of functional units that are called genes. Each gene encodes a protein or message (produced by the gene). The structure of a gene has four parts (see Figure 1):

……………….

1000 S equenc e

0010

1000

Sequence: the binary string that corresponds to the protein that encodes the gene Promoters: is the gene area that indicates the proteins that are needed for the gene’s transcription. Constituent: this bit identifies if the gene is constituent or regulating Activation percentage (binary value): the percentage of minimal concentration of promoters proteins inside the cell that causes the transcription of the gene.

The transcription of the encoded protein occurs when the promoters of the non-constituent genes appear in a certain rate at the cellular cytoplasm. On the other hand, the constituent genes are expressed until such expression is inhibited by the present rate of the promoter genes. The other fundamental element for keeping and managing the proteins that are received or produced by the artificial cell is the cytoplasm. The stored proteins have a certain life time before they are erased. The cytoplasm checks which and how many proteins are needed for the cell to activate the DNA genes, and as such responds to all the cellular requirements for the concentration of a given type of protein. The cytoplasm also extracts the proteins from the structure in case they are needed for a gene transcription.

A c tiv ation P roteins

745

G


Figure 2. Logical operators match

This analogous functioning seems to indicate that the system could execute more complex tasks, as ANNs do (Hassoun M.H. 1995).

G e n e S tru ctu re

C ons tituent

P rom oter

S equenc e

AND F als e

A

B

C

FUTURE TRENDS A

B

C

0

0

0

1

0

0

1

0

1

1

1

A

B

C

0

0

0

1

0

1

0

1

1

1

1

1

0

OR F als e

F als e

A

C

B

C

A

C

N OT T rue

A

C

0

1

1

0

The Information Processing Capacities The biological cells, besides generating structures, work as small processors for parallel information handling with the remaining cells. The information that they process comes from their own generation and from their environment. On the basis of this fact, the present work has explored the generation capabilities of the model structure, although using the gene and protein structure, an operation set with Boolean algebra-like structure might be defined. The space for the definition of the operations would be the presence or absence of certain proteins into the system, whereas the operation result would be the protein contained/encoded at the gene. The AND operation (see Figure 2) would be modelled with a gene that would need for its expression all the proteins of its promoters. The OR operation would be modelled with two genes that, despite their different promoters, result in the same protein. Finally, the NOT operation would be modelled with the constituent part, which changes the performance of that gene. The presence of proteins belonging to the promoters would imply the absence of the gene resulting protein at the system. This behaviour is similar to the gene regulatory networks (Kauffman S.A. 1969). The Artificial Neuron Networks (ANNs) can be configured for carrying out these processing tasks.

746

The final objective of this group is to develop an artificial model which is based on the biologically model with a processing information capacity similar to the ANN. In order to archive this objective some simple tests have been developed to check the functioning of the model. The result of these tests show that is possible to process information using the gene regulatory network as the basing system. From this point of development, the next steps of development must go in order to develop more complex task and to study the functioning of the model. Other objective for future works can be the combination of the process information capacities of the model with the generating structure capacities presented in (Fernández-Blanco E., Dorado J., Rabuñal J.R., Gestal M. & Pedreira N. 2007).

CONCLUSION At this work some properties of biological cells have been adapted to an artificial model. In particular the gene regulatory network idea was adapted to processing information. This adaptation has its bases on using the transcription rule to determine a boolean algebra like structure. The result of this adaptation is that, now, we can use it to develop processing information tests and. Finally comment that this new way of generation processing information networks has a lot of test and studies to do until it is stabilized as a consolidated technique for information processing.

REFERENCES Bentley, P.J., Kumar, S. (1999) Three ways to grow designs: A comparation of three embryogenies for an evolutionary design problem. In Proceedings of Genetic and Evolutionay Computation.


Dellaert F. & Beer R.D. (1996) A Developmental Model for the Evolution of Complete Autonomous Agent In From animals to animats: Proceedings of the Forth International Conference on Simulation of Adaptive Behavior, Massachusetts, September 9-13, pp. 394401, MIT Press. Eggenberger P. (1996) Cell Interactions as a Control Tool of Developmental Processes for Evolutionary Robotics. In From animals to animats: Proceedings of the Forth International Conference on Simulation of Adaptive Behavior, Massachusetts, September 9-13, pp. 440-448, MIT Press. Fernández-Blanco E., Dorado J., Rabuñal J.R., Gestal M. & Pedreira N. (2007) A New Evolutionary Computation Technique for 2D Morphogenesis and Information Processing. WSEAS Transactions on Information Science & Applications vol. 4(3) pp.600-607, WSEAS Press. Gruau F. (1994)Neural network synthesis using cellular encodingand the genetic algorithm. Doctoral dissertation, Ecole Normale Superieure de Lyon, France. Hassoun M.H. (1995) Fundamentals of Artificial Neural Networks. University of Michigan Press, MA, USA Hornby, G. S. & Pollack J. B. (2002) Creating high-level components with a generative representation for body brain evolution. Artificial Life vol.8 issue 3. Kauffman, S.A. (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology 22 pp. 437-467.

Lindenmayer, A. (1968) Mathematical models for cellular interaction in development: Part I and II. Journal of Theorical Biology. Vol. 18 pp. 280-299, pp. 300-315. Mjolsness, E., Sharp, D.H., & Reinitz, J. (1995) A Connectionist Model of Development. Journal of Theoretical Biology 176: 291-300. Stanley, K. & Miikkulainen, R. (2003) A Taxonomy for Artificial Embryogeny. In Proceedings Artificial Life 9, pp. 93-130. MIT Press. Turing, A.(1952) The chemical basis of morphogenesis. Philosofical Transactions of the Royal Society B, vol.237, pp. 37-72

KEy TERmS Artificial Cell: Each of the elements that process the orders codified into the DNA. Artificial Embryogeny: Under this term are all the processing models which use biological development ideas as inspiration. Cytoplasm: Part of an artificial cell which is responsible of management the protein-shaped messages. DNA: Set of rules which are responsible of the cell behaviour. Gene: Each of the rules which codifies one action of the cell.

Kitano, H. (1990). Designing neural networks using genetic algorithm with dynamic graph generation system. Complex Systems vol. 4 pp. 461-476

Gene Regulatory Network: Term that names the connexion between the different genes of a DNA. The connexion identifies the genes that are necessary for the transcription of other ones.

Kumar, S. & Bentley P.J. (editors) (2003). On Growth, Form and Computers. Academic Press. London UK.

Protein: This term identifies every kind of the messages that receives an artificial cell.

747

G

748

Genetic Algorithm Applications to Optimization Modeling Pi-Sheng Deng California State University at Stanislaus, USA

INTRODUCTION Genetic algorithms (GAs) are stochastic search techniques based on the concepts of natural population genetics for exploring a huge solution space in identifying optimal or near optimal solutions (Davis, 1991)(Holland, 1992)(Reeves & Rowe, 2003), and are more likely able to avoid the local optima problem than traditional gradient based hill-climbing optimization techniques when solving complex problems. In essence, GAs are a type of reinforcement learning technique (Grefenstette, 1993), which are able to improve solutions gradually on the basis of the previous solutions. GAs are characterized by their abilities to combine candidate solutions to exploit efficiently a promising area in the solution space while stochastically exploring new search regions with expected improved performance. Many successful applications of this technique are frequently reported across various kinds of industries and businesses, including function optimization (Ballester & Carter, 2004)(Richter & Paxton, 2005), financial risk and portfolio management (Shin & Han, 1999), market trading (Kean, 1995), machine vision and pattern recognition (Vafaie & De Jong, 1998), document retrieval (Gordon, 1988), network topological design (Pierre & Legault, 1998)(Arabas & Kozdrowski, 2001), job shop scheduling (Özdamar, 1999), and optimization for operating system’s dynamic memory configuration (Del Rosso, 2006), among others. In this research we introduce the concept and components of GAs, and then apply the GA technique to the modeling of the batch selection problem of flexible manufacturing systems (FMSs). The model developed in this paper serves as the basis for the experiment in Deng (2007).

GENETIC ALGORITHmS GAs were simulation techniques proposed by John Holland in the 1960s (Holland, 1992). Basically, GAs

solve problems by maintaining and modifying a population of candidate solutions through the application of genetic operators. During this process, beneficial changes to parent solutions are combined into their offspring in developing optimal or near-optimal solutions for the given task. Intrinsically, GAs explore multiple potentially promising regions in the solution space at the same time, and switch stochastically from one region to another for performance improvement. According to Holland (1992), regions in the solution space can be defined by syntactic patterns of solutions, and each pattern is called a schema. A schema represents the pattern of common attributes or features of the solutions in the same region. Let Σ be an alphabet of symbols. A string over an alphabet is a finite sequence of symbols from the alphabet. An n-ary schema is defined as a string in (Σ ∪ {#})n, where # ∉ Σ is used as a wildcard denotation for any symbol in Σ. Conceptually, n-ary schemata can be regarded as defining hypersurfaces of an n-dimensional hypercube that represents the space of all n-attribute solutions. Individual solutions in the same region can be regarded as instances of the representing schema, and an individual solution can belong to multiple schemata at the same time. Actually, an n-attribute solution is a member of 2n different schemata. Therefore, evaluating a solution has the similar effect of sampling 2n regions (i.e., schemata) at the same time, and this is the famous implicit parallelism of genetic search. A population of M solutions will contain at least 2n and at most M ⋅ 2 n schemata. Even for modest values of n and M, there will be a large number of schemata available for processing in the population. GAs perform an implicit parallel search through the space of possible schemata in the form of performing an explicit parallel search through the space of individual solutions. The problem solving process of GAs follows a five-phase operational cycle: generation, evaluation, selection, recombination (or crossover), and mutation.


Genetic Algorithm Applications to Optimization Modeling

At first a population of candidate solutions is generated. A fitness function or objective function is then defined, and each candidate solution in the population is evaluated to determine its performance or fitness. Based on the relative fitness value, two candidate solutions are selected probabilistically as parents. Recombination is then applied probabilistically to the two parents to form two offspring, and each of the offspring solutions contains some characteristics from its parent solutions. After this, mutation is applied sparingly to components of each offspring solution. The newly generated offspring are then used to replace the low-fitness members in the population. This process is repeated until a new population is formed. Through the above iterative cycles of operations, GAs is able to develop better solutions through progressive generations. In order to prepare for the investigation of the effects of genetic operations in the sequel of current research, we apply the GA technique to the optimization modeling of manufacturing systems in next section.

A GA-BASED BATCH SELECTION SySTEm Batch selection is one of the most critical tasks in the development of a master production plan for flexible manufacturing systems (FMSs). In the manufacturing process, each product requires processing by different sets of tools on different machines with different operations performed in a certain sequence. Each machine has its own limited space capacity in mounting tools and limited amount of available processing time. Under various kinds of resource constraints, choosing an optimal batch of products to be manufactured in a continuous operational process with the purpose to maximize machine utilization or profits has made the batch selection decision a very hard problem. While this problem is usually manageable for manufacturing small number of products, it quickly becomes intractable if the number of products grows even slightly large. The time required to solve the problem exhaustively would grow in a non-deterministic polynomial manner with the number of products to be manufactured. Batch selection affects all the subsequent decisions in job shop scheduling for satisfying the master production plan, and holds the key to the efficient utilization of resources in generating production plans

for fulfilling production orders. In our formulation, we use the following denotational symbols: • • • • • • • • • • •

M: the cardinality of the the set of machines available T: the cardinality of the the set of tools available P: the cardinality of the set of products to be manufactured MachineUtilization: the function of total machine utilization processing_timeproduct,tool,machine: the time needed to manufacture product product using tool tool on machine machine available_timemachine: the total available processing time on machine machine capacitymachine: the total number of slots available on machine machine machine, tool, product: indicators for machines, tools, and products to be manufactured correspondingly slottool: the number of slot required by machine tool tool quantityproduct: the quantity of product product to be manufactured in a shift Qproduct: the quantity of product product ordered by customers as specified in the production table

Fitness (or Objective) Function The objective is to identify a batch of products to be manufactured so that the total machine utiliztion rate will be maximized. See Exhibit A. The above objective function is to be maximized subject to the following resource constraints: 1.

Machine capacity constraint (see Exhibit B)

The above function f() is used to determine if tool tool needs to be mounted on machine machine for the processing of the current batch of product. 2. 3.

Machine time constraint (see Exhibit C) Non-negativity and integer contraints

Encoder/Decoder The Encoder/Decoder is a representation scheme used to determine how the problem is structured in the GA 749

G


Exhibit A. MachineUtilization(quantity1 , quantity2 ,....., quantity P ) =

Maximize

M

T

P

∑ ∑ ∑

machine =1 tool =1 product =1

processing _ time product ,tool ,machine quantity product M

∑

machine =1

available _ timemachine

Exhibit B. T

∑ slot

tool =1

tool

f(

tool

f(

P

∑ processing _ time

product =1

product ,tool ,1

quantity product ) ≤ capacity1

 T

∑ slot

tool =1

P

∑ processing _ time

product =1

product ,tool , M

quantity product ) ≤ capacity M

1, if y > 0 where f ( y ) =  0, if y = 0

Exhibit C. P

T

∑ ∑ processing_time

product =1 tool =1

product ,tool ,1

quantity product ≤ available _ time1

 P

T

∑ ∑ processing_time

product =1 tool =1

product ,tool , M

quantity product ≤ available _ time M

quantity product ≥ 0, quantity product ≤ Q product , and quantity product is an integer, for product = 1, 2,  , P

system. The way in which candidate solutions are encoded is one of a central factor in the success of GAs (Mitchell, 1996). Generally, the solution encoding can be defined over an alphabet Σ which might consist of binary digits, continuous numbers, integers, or symbols. However, choosing the best encoding scheme is almost tantamount to solving the problem itself (Mitchell, 1996). In this research, our GA system is mainly based on Holland's canonical model (Holland, 1992), which 750

is one of the most commonly used encoding schemes in practice—binary encoding. A candidate solution for the batch selection task is a vector of quantities to be manufactured for P products. Let the entire solution space be denoted as solution (see Exhibit D). The encoding function encodes the quantity to be produced for each product as an l-bit binary string, and then forms a concatenation of the strings for P products


Exhibit D. solution =

P

∏ [0, 1,, Q

product =1

product

G

]

= {( quantity1 ,  , quantity P ) ∈ ({ 0} ∪ ℵ) P | 0 ≤ quantity product ≤ Q product , quantity product is an integer, and product = 1, 2,  , P}.

which are to be included in a production batch. Each candidate solution (quantity1,…,quantityP) is a string of length lP over the binary alphabet Σ ={0, 1}. Such an encoded l-bit string has a value equal to  quantity max {Q product } ≤ 2 l − 1 product , if product =1, 2 ,..., P      quantity product (2 l − 1) , otherwise.  0 . 5 −   max {Q product }  product =1, 2 ,..., P 

l In the above formula, 2 – 1 is the value of an l 1 , and j•k is the ceiling function. For bit string 1 l example, assume there are only two products to be selected in a production batch with 200 units as the largest possible quantity to be manufactured for each product. A candidate solution consisting of quantities 100 and 51 for products 1 and 2 respectively will be represented by a 16-bit string as 0110010000110011 with the first 8 bits representing product 1 and the second 8 bits representing product 2. After a new solution string is generated, it is then decoded back to the format for the compuation of the objective function and for the check of solution feasibility. Let each l-bit segment of a solution string be denoted as string with string[i] as the value of the ith bit in the l-bit segment. The decoding function converts each l-bit string according to the following formula:

 l ∑ string[i ] ⋅ 2 i −1 , if max {Q product } ≤ 2 l − 1 product =1, 2 ,..., P  i =1  max {Q product }   l i −1  product =1, 2 ,..., P − 0.5, otherwise.  ∑ string[i ] ⋅ 2  ⋅ l 2 −1    i =1 

Five-Phase Genetic Operations Our system follows the generation-evaluation-selection-crossover-mutation cycles in searching for appropriate solution strings for the batch selection task. It starts with generating an initial population, Pop, of pop_size candidate solution strings at random. In each iteration of the operational cycle, each candidate solution string, si, in the current population is evaluated by the fitness function. Candidate solution strings in the current population are selected probabilitistically on the basis of their fitness values as seeds for generating the next generation. The purpose of selection is to generate offspring of high fitness value on the basis of the fitter members in the current population. Actually, selection is the mechanism that helps our GA system to exploit a promising region in the solution space. There are several fitness-based schemes for the selection process: Roulette-wheel selection, rank-based selection, tournament selection, and elitist selection (Goldberg, 1989)(Michalewicz, 1994). The first three methods randomly select candidate solution strings for reproduction on the basis of either the fitness value or the rank of individual strings. Best members of the current population might be lost if they are not selected to reproduce or if they are altered by crossover (i.e., recombination) or mutation. The elitist selection strategy is for the purpose of retaining some of the fittest individuals from the current population. Elitist selection retains a limited number of “elite” solution strings, i.e., strings with the best fitness 751


values, for passing to the next generation without any modification. A fraction called the “generation gap” is used to specify the proportion of the population to be replaced by offspring strings after each iteration. Our GA system retains copies of the first (1 − generation _ gap ) ⋅ pop _ size “elitist” members of Pop for the formation of the next population, Popnew. For generating the rest of the members for Popnew, the GA module will probabilitistically select: generation _ gap ⋅ pop _ size 2

pairs of solution strings from Pop for generating offspring strings. The probability of selecting a solution string, si, from Pop is given by

Pr( si ) =

Fitness ( si )

pop _ size

∑ Fitness(s j =1

j

)

.

Let the cumulative probability of individual solution strings in the population be called Ci, and i

C i = ∑ Pr(s j ), j =1

for i = 1, 2,…, pop_size. The solution string si will be selected for reproduction if C i −1 < rand (0,1) ≤ C i . In addition to exploiting a promising solution region via the selection process, we also need to explore other promising regions for possible better solutions. Exploitation without exploration will cause degeneration for a population of solution strings, and might cause the local optima problem for the system. Actually, the capability of maintaining a balanced exploitation vs. exploration is a major strength of the GA approach over traditional optimization techniques. The exploration function is achieved by the crossover and mutation operators. These two operators generate offspring solutions which belong to new schemata, and thus allow our system to explore other promising regions in a solution sapce. This process also allows our system to improve its performance stochastically. 752

Crossover recombines good solution strings in the current population and proliferates the population gradually with schemata of high fitness values. Crossover is commonly regarded as the most distinguishing operator of GAs, and it usually interacts in a highly intractable manner with fitness function, encoding, and other details of a GA (Mitchell, 1996). Though various crossover operators have been proposed, there is no general conclusions on when to use which type of crossover (Michalewicz, 1994)(Mitchell, 1996). In this paper, we adopt the standard one-point crossover for our GA system. For each pair of solution strings selected for reproduction, the value of crossover_rate determines the probability for their recombination. A position in both candidate solution strings is randomly selected as the crossover point. The parts of two parent strings after the crossover position are exchanged to form two offspring. Let k be the crossover point randomly generated from a uniform distribution ranging from 1 to lP, where lP is the length of a solution string. Let si = (x1, x2,…, xk-1, xk,…, xlP) and sj = (y1, y2,…, yk-1, yk,…, ylP) represent a pair of candidate solution strings selected for reproduction. Based on these two strings, the crossover operator generates two offspring si′ = ( x1′, x2′ , , xlP′ ) and s′j = ( y1′, y2′ , , ylP′ ), where  x , if i < k xi′ =  i  yi , otherwise  y , if i < k yi′ =  i  xi , otherwise.

In other words, si′ = (x1, x2,…, xk-1, yk,…, ylP) and s′j = (y1, y2,…, yk-1, xk,…, xlP). These two oppspring are then added to Popnew. This offspring-generating process is repeated until there are generation _ gap ⋅ pop _ size offspring generated for Popnew. With selection and crossover alone, our system might occasionally develop a uniform population which consists of the same solution strings. This will blind our system to other possible solutions. Mutation, which is the other operator applied to the reproduction process, is used to help our system avoid the formation of a uniform population by introducing diversity into a population. It is generally believed that mutation alone does not advance the search for a solution, and is usu-


ally considered as a secondary role in the operation of GAs (Goldberg, 1989). Usually, mutation is applied to alter the bit value of a string in a population only occasionally. Let mutation_rate be the probability of mutation for each bit in a candidate solution string. For each offspring string, s′ = ( x1′, x2′ , , xlP′ ), generated by the crossover operator for the new population Popnew, the mutation operator will invert each bit probabilitistically: 1 − xi , if rand (0,1) < mutation _ rate xi′ =   xi , otherwise.

The probability of mutation for a candidate solution string is 1 − (1 − mutation _ rate)lP . The above processes constitute an operational cycle of our system. These operations are repeated until the termination criterion is reached, and the result is passed to the Decoder for decoding. The decoded result is then presented to the decision maker for further consideration in the final decision. If current solution is not satisfactory to the decision maker, the current solution can be modified by the decision maker, and then entered into the GA system to initiate another run of search process for satisfactory solutions.

FUTURE TRENDS AND CONCLUSION In this paper we designed a GA-based system for the batch selection problem of flexible manufacturing systems. In our design we adopted a binary encoding scheme, the elitist selection strategy, a single-point crossover strategy, and a uniform random mutation for the batch selection problem. The performance of GAs is usually influenced by various parameters and the complicated interactions among them, and there are several issues worth further investigation. With the availability of a larger pool of diverse schemata in a larger population, our GA system will have a broader view of the “landscape” (Holland, 1992) of the solution space, and is thus more likely to contain representative solutions from a large number of hyperplanes. This advantage gives GAs more chances of discovering better solutions in the solution space. However, Davis (1991) argues that the most effective population size is dependent upon the nature of the problem, the representation formalism, and the GA

operators. Still, Schaffer et al. (1991) asserted that the best settings for population size is independent of the problems. In the sequel of this paper, we will conduct a sequence of experiment to systematically analyze the influence of the population size on GA performance, by using the batch-selection model peoposed in this paper, so that we can be more conclusive on the issue of the effective population size.

REFERENCES Arabas, J., & Kozdrowski, S. (2001). Applying an Evolutionary Algorithm to Telecommunication Network Design. IEEE Transactions on Evolutionary Computation. (5)4, 309-322. Ballester, P.J., & Carter, J.N. (2004). An Effective Real-Parameter Genetic Algorithm with Parent Centric Normal Crossover for Multimodal Optimisation. Proceedings of the 2004 GECCO. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer-Verlag. 901-913. Davis, L. (Editor) (1991). Handbook of Genetic Algorithms. New York, NY: Van Nostrand Reinhold. Del Rosso, C. (2006). Reducing Internal Fragmentation in Segregated Free Lists Using Genetic Algorithms. Proceedings of the 2006 ACM Workshop on Interdisciplinary Software Engineering Research. Deng, P-S. (2007). A Study of the Performance Effect of Genetic Operators. Encyclopedia of Artificial Intelligence, Dopico, J.R.R., de la Calle, J.D. & Sierra, A.P. (Editors), Harrisburg, PA: IDEA. Goldberg, D.E. (1989). Genetic Algorithm in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley. Gordon, M. (1988). Probabilistic and Genetic Algorithms for Document Retrieval. Communications of the ACM. (31)10, 1208-1218. Grefenstette, J.J. (1993). Introduction to the Special Track on Genetic Algorithms. IEEE Expert. October, 5-8. Holland, J. (1992). Adaptation in Natural and Artificial Systems. Cambridge, MA: MIT Press.

753

G


Kean, J. (1995). Genetic Algorithms for Market Trading. AI in Finance. Winter, 25-29.

KEy TERmS

Michalewicz, Z. (1994). Genetic Algorithms + Data Structures = Evolution Programs. New York, NY: Springer-Verlag.

Batch Selection: Selecting the optimal set of products to produce, with each product requiring a set of resources, under the system capacity constraints

Mitchell, M. (1996). An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press.

Fitness Functions: The objective function of the GA for evaluating a population of solutions

Özdamar, L. (1999). A Genetic Algorithm Approach to a General Category Project Scheduling Problem. IEEE Transactions on Systems, Man, and Cybernetics. (29)1, 44-59.

Flexible Manufacturing Systems: A manufacturing system which maintains the flexibility of order of operations and machine assignment in reacting to planned or unplanned changes in the production process

Pierre, S., & Legault, G. (1998). A Genetic Algorithm for Designing Distributed Computer Network Topologies. IEEE Transactions on Systems, Man, and Cybernetics. (28) 249-258. Reeves, C.R., & Rowe, J.E. (2003). Genetic Algorithms - Principles and Perspectives. Boston, MA: Kluwer Academic. Richter, J.N., & Paxton, J. (2005). Adaptive Evolutionary Algorithms on Unitation, Royal Road and Longpath Functions. Proceedings of the Fourth IASTED International Conference on Computational Intelligence, Calgary, Alberta, Canada. Shin, K.S., & Han, I. (1999). Case-Based Reasoning Supported by Genetic Algorithms for Corporate Bond Rating. Expert Systems With Applications. (16)2, 8595. Vafaie, H., & De Jong, K.A. (1998). Feature Space Transformation Using Genetic Algorithms. IEEE Intelligent Systems. (13)2, 57-65.

754

Genetic Algorithms: A stochastic search method which applies genetic operators to a population of solutions for progressively generating optimal or nearoptimal solutions Genetic Operators: Selection, crossover, and mutation, for combining and refining solutions in a population Implicit Parallelism: A property of the GA which allows a schema to be matched by multiple candidate solutions simultaneously without even trying Landscape: A function plot showing the state as the “location” and the objective function value as the “elevation” Reinforcement Learning: A learning method which interprets feedback from an environment to learn optimal sets of condition/response relationships for problem solving within that environment Schemata: A general pattern of bit strings that is made up of 1, 0, and #, used as a building block for solutions of the GA

755

Genetic Algorithms for Wireless Sensor Networks João H. Kleinschmidt State University of Campinas, Brazil

INTRODUCTION Wireless sensor networks (WSNs) consist of a large number of low-cost and low-power sensor nodes. Some of the applications of sensor networks are environmental observation, monitoring disaster areas and so on. Distributed evolutionary computing is a poweful tool that can be applied to WSNs, because these networks require algorithms that are capable of learning independent of the operation of other nodes and also capable of using local information (Johnson, Teredesai & Saltarelli, 2005). Evolutionary algorithms must be designed for the resource constraints present in WSNs. This article describes how genetic algorithms can be used in WSNs design in order to satisfy energy conservation and connectivity constraints.

BACKGROUND The recent advances in wireless communications and digital electronics led to the implementation of low power and low cost wireless sensors. A sensor node must have components for sensing, data processing and communication. These devices can be grouped to form a sensor network (Akyildiz, Sankarasubramaniam & Cayirci, 2002) (Callaway 2003). The network protocols, such as formation algorithms, routing and management, must have self-organizing capabilities. In

general, sensor networks have some features that differ from traditional wireless networks in some aspects: the number of sensor nodes can be very high; sensor nodes are prone to failures; sensor nodes are densely deployed; the topology of the network can change frequently; sensor nodes are limited in computational capacities, memory and energy. The major challenge in the design of WSNs is the fact that energy resources are significantly more limited than in wired networks and other types of wireless networks. The battery of the sensors in the network may be difficult to recharge or replace, causing severe limitations in the communication and processing time between all sensors in the network. Thus, the main parameter to optimize for is the network lifetime, or the time until a group of sensors runs out of energy. Another issue in WSN design is the connectivity of the network according to the selected communication protocol. Usually, the protocol follows the cluster-based architecture, where single hop communication occurs between sensors of a cluster and a selected cluster head sensor that collects all information obtained by the other sensors in its cluster. This architecture is shown in Figure 1. Since the purpose of the sensor network is the collection and management of measured data for some particular application, this collection must meet specific requirements depending on the type of data. These requirements are turned into application specific parameters of the network.

Figure 1. Cluster-based sensor network Cluster 1

cluster head sensor node

Cluster 2

Cluster 3

sink


G

Genetic Algorithms for Wireless Sensor Networks

GENETIC ALGORITHmS FOR WIRELESS SENSOR NETWORKS A WSN designer who takes into account all the design issues deals with more than one non-linear objective functions or design criteria which should be optimized simultaneously. Therefore, the focus of the problem is how to find many near-optimal non-dominated solutions in a practically acceptable computational time (Jourdan & de Weck, 2004) (Weise, 2006) (Ferentinos & Tsiligiridis, 2007). There are several interesting approaches to tackling such problems, but one of the most powerful heuristics, which is also appropriate to apply in the multi-objective optimization problem, is based on genetic algorithms (GA) (Ferentinos & Tsiligiridis, 2007). Genetic algorithms have been used in many fields of science to derive solutions for any type of problems (Goldberg 1989) (Weise, 2006). They are particularly useful in applications involving design and optimization, where there are large numbers of variables and where procedural algorithms are either non-existent or extremely complicated (Khana, Liu & Chen, 2006), (Khana, Liu & Chen, 2007). In nature, a species adapts to an environment because the individuals that are the fittest in respect to that environment will have the best chance to reproduce, possibly creating even fitter child. This is the basic idea of genetic evolution. Genetic algorithms start with an initial population of random solution candidates, called individuals or chromosomes. In the case of sensor networks, the individuals are small programs that can be executed on sensor nodes (Wazed, Bari, Jaekel & Bandyopadhyay, 2007). Each individual may be represented as a simple string or array of genes, which contain a part of the solution. The values of genes are called alleles. As in nature, the population will be refined step by step in a cycle of computing the fitness of its individuals, selecting the best individuals and creating a new generation derived from these. A fitness function is provided to assign the fitness value for each individual, based on how close an individual is to the optimal solution. Two randomly selected individuals, the parents, can exchange genetic information in a process called crossover to produce two new chromosomes know as child. A process called mutation may also be applied to obtain a good solution, after the process of crossover. This process helps to restore any genetic values when the population converges

756

too fast. After the crossover and mutation processes the individuals of the next generation are selected. Some of the poorest individuals of the generation can be replaced by the best individuals from the previous generation. This is called elitism, and ensures that the new generation is at least as fit as the previous generation. The algorithm stops if a predetermined stopping criterion is met (Hussain, Matin & Islam, 2007).

Fitness Function and Specific Parameters for WSNs The fitness function executed in a sensor node is a weighted function that measures the quality or performance of a solution, in this case a specific sensor network design. This function is maximized by the GA system in the process of evolutionary optimization. A fitness function must include and correctly represent all or at least the most important factors that affect the performance of the system. The major issue in developing a fitness function is the decision on which factors are the most important ones (Ferentinos & Tsiligiridis, 2007) (Gnanapandithan & Natarajan, 2006). A genetic algorithm must be designed for WSN topologies by optimizing energy-related parameters that affect the battery consumption of the sensors and thus, the lifetime of the network. At the same time, the algorithm has to meet some connectivity constraints and optimize some physical parameters of the WSN implemented by the specific application. The multiple objectives of the optimization problem are blended into a single objective function, the parameters of which are combined to formulate a fitness function that gives a quality measure to each WSN topology. Three sets of parameters dominate the design and the performance of a WSN: the application specific parameters, connectivity parameters and the energy related parameters. Some possible parameters are discussed in (Ferentinos & Tsiligiridis, 2007): •

•

Operation energy: the energy that a sensor consumes during some specific time of operation. It depends whether the sensor operates as cluster head or as regular sensor. Communication energy: the energy consumption due to communication between sensors. It depends on the distances between transmitter and receiver.


• •

•

• •

Battery life: battery capacity of each sensor. Sensors-per-cluster head: parameter to ensure that each cluster head does not have more than a maximum predefined number of sensors in its cluster. It depends on the physical communications capabilities and the amount of data that can be processed by a cluster head. Sensors out of range error: parameter to ensure that each sensor can communicate with its cluster head. It depends on the signal strength of the sensors. Spatial density: minimal number of measurements points that adequate monitor the variables of a given area. Uniformity of measurement: the measures of an area of interest must give a uniform view of the area conditions. The total area can be divided in several sub-areas for a uniform measurement.

Other parameters can be defined, especially those related to application specific requirements, such as sensor to sink delay, routing information, localization, network coverage, etc. The optimization problem is defined by the minimization of the WSN parameters. If n optimization parameters were defined, they may be combined into a single objective function: n  f = min ∑ wi Pi ,  i =1 

where P is the parameter objective and w is the weighting coefficients, that define the importance of each parameter in the network design. The importance of each parameter on the performance of the network has to be designed carefully. These values are firstly determined based on experience on the importance of each one. Then, some experimentation is made to determine the final values. An individual will be selected to be the parent of the next generation using its fitness value. The probability that an individual be chosen is proportional to the value. After this process, the type of crossover and mutation has to be defined, as well as the population size and the probabilities for crossover and mutation. Some experiments must be carried out to determine the most appropriate values for WSNs.

FUTURE TRENDS Some of the recent research areas in wireless sensor networks include the design of MAC protocols, efficient routing, data aggregation, collaborative processing, sensor fusion, security, localization, data reliability, network management, etc. All these topics may benefit from the usage of genetic algorithms. Some research has been made using genetic algorithms to solve some WSNs problems (Hussain, Matin & Islam, 2007) (Jin, Liu, Hsu & Kao, 2005) (Ferentinos & Tsiligiridis, 2007) (Wazed, Bari, Jaekel & Bandyopadhyay, 2007) (Rahmani, Fakhraie, & Kamarei, 2006) (Qiu, Wu, Burns, & Holzhauer, 2006). However, most of the research topics of WSNs using genetic algorithms remain few or completely unexplored.

CONCLUSION This article discussed the application of genetic algorithms in wireless sensor networks. The basic idea of GA was discussed and some specific considerations for WSNs were made, including crossover, mutation and definition of the fitness function. The mainly performance parameters may be divided in three groups: energy, connectivity and application specific. Since WSNs have many objectives to be optimised, GA is a promising candidate to be used in WSNs design.

REFERENCES Akyildiz, I. F., Su, W., Sankarasubramaniam, Y. & Cayirci, E. (2002). A survey on sensor networks. IEEE Communications Magazine, 40 (8), 102-114. Callaway, Egdar H. (2003). Wireless Sensor Networks: Architectures and Protocols, CRC Press, 352 pages. Ferentinos, K. P., & Tsiligiridis, T. A. (2007). Adaptive Design Optimization of Wireless Sensor Networks Using Genetic Algorithms. Elsevier Computer Networks, (51) 1031-1051. Gnanapandithan, N. & Natarajan, B. (2006). Parallel Genetic Algorithm Based Optimal Fusion in Sensor Networks, IEEE Consumer Communications and Networking Conference.

757

G


Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA, 1989. Hussain, S., Matin, A. W., & Islam, O. (2007). Genetic Algorithm for Energy Efficient Clusters in Wireless Sensor Networks. IEEE 4th International Conference on Information Technology, Las Vegas, Nevada, USA. Jin, M., Liu, W., Hsu, D. F., & Kao, C. (2005). Compact Genetic Algorithm for Performance Improvement in Hierarchical Sensor Networks Management. IEEE Int. Symposium on Parallel Architectures, Algorithms and Networks, Las Vegas, USA. Johnson, D., Teredesai, A. M., & Saltarelli, R. (2005). Genetic Programming in Wireless Sensor Networks. European Conference on Genetic Programming, Lausanne, Switzerland. Jourdan, D. B. & de Weck, O. L. (2004). Layout Optimization for a Wireless Sensor Network Using a Multi-objective Genetic Algorithm. IEEE Vehicular Technology Conference. Khana, R., Liu, H., & Chen, H. (2006). Self-Organization of Sensor Networks Using Genetic Algorithms. IEEE International Conference on Communications, Istanbul, Turkey. Khana, R., Liu, H., & Chen, H. (2007). Dynamic Optimization of Secure Mobile Sensor Networks: A Genetic Algorithm. IEEE International Conference on Communications, Glasgow, Scotland. Qiu, Q., Wu, Q., Burns, D. & Holzhauer, D. (2006). Lifetime Aware Resource Management for Sensor Network Using Distributed Genetic Algorithm. International Symposium on Low Power Electronics and Design. Rahmani, E., Fakhraie, S. M. & Kamarei, M. (2006). Finding Agent-Based Energy-Efficient Routing in Sensor Networks using Parallel Genetic Algorithm, International Conference on Microelectronics. Wazed, S., Bari, A., Jaekel, A., & Bandyopadhyay, S. (2007). Genetic Algorithm Based Approach for Extending the Lifetime of Two-Tiered Sensor Networks. 2nd IEEE International Symposium on Wireless Pervasive Computing, San Juan, Puerto Rico. Weise, T. Genetic Programming for Sensor Networks. (2006) Technical report, University of Kassel. 758

KEy TERmS Cluster-Based Architecture: Sensor networks architecture where communication occurs between sensors of a cluster and a selected cluster head that collects the information obtained by the sensors in its cluster. Cluster Head: Sensor node responsible for gathering data of a sensor cluster and transmitting them to the sink node. Crossover: Genetic operator used to vary the programming of a chromosome or chromosomes from one generation to the next. Energy Parameters: Parameters that affect the battery consumption of the sensors, including the energy consumed due to sensing, communication and computational tasks. Fitness Function: A particular type of objective function that quantifies the optimality of a solution in a genetic algorithm. Genetic Algorithms: Search technique used in computing to find true or approximate solutions to optimization and search problems. Mutation: The occasional (low probability) alteration of a bit position. Network Lifetime: Time until the first sensor node or group of sensor nodes in the network runs out of energy. Sensor Node: Network node with components for sensing, data processing and communication. Wireless Sensor Networks: A network of spatially distributed devices using sensors to monitor conditions at different locations, such as temperature, sound, pressure, etc.

759

Genetic Fuzzy Systems Applied to Ports and Coasts Engineering Óscar Ibáñez University of A Coruña, Spain Alberte Castro University of Santiago de Compostela, Spain

INTRODUCTION Fuzzy Logic (FL) and fuzzy sets in a wide interpretation of FL (in terms in which fuzzy logic is coextensive with the theory of fuzzy sets, that is, classes of objects in which the transition from membership to non membership is gradual rather than abrupt) have placed modelling into a new and broader perspective by providing innovative tools to cope with complex and ill-defined systems. The area of fuzzy sets has emerged following some pioneering works of Zadeh (Zadeh, 1965 and 1973) where the first fundamentals of fuzzy systems were established. Rule based systems have been successfully used to model human problem-solving activity and adaptive behaviour. The conventional approaches to knowledge representation are based on bivalent logic. A serious shortcoming of such approaches is their inability to come to grips with the issue of uncertainty and imprecision. As a consequence, the conventional approaches do not provide an adequate model for modes of reasoning. Unfortunately, all commonsense reasoning falls into this category.

The application of FL to rule based systems leads us to fuzzy systems. The main role of fuzzy sets is representing Knowledge about the problem or to model the interactions and relationships among the system variables. There are two essential advantages for the design of rule-based systems with fuzzy sets and logic: • •

The key features of knowledge captured by fuzzy sets involve handling uncertainty. Inference methods become more robust and flexible with approximate reasoning methods of fuzzy logic.

Genetic Algorithms (GAS) are a stochastic optimization technique that mimics natural selection (Holland, 1975). GAs are intrinsically robust and capable of determining a near global optimal solution. The use of GAS is usually recommended for optimization in high-dimensional, multimodal complex search spaces where deterministic methods normally fail. GAs explore a population of solutions in parallel. The GA is a searching process based on the laws of natural selections and

Figure 1. A typical GA cycle

Initial Population (Chromosomes)

Evaluation

Subpopulation (Offspring)

Selection

Genetic Operators


G

Genetic Fuzzy Systems Applied to Ports and Coasts Engineering

genetics. Generally, a simple GA contains three basic operations: selection, genetic operations and replacement. A typical GA cycle is shown in Fig. 1. In this paper it is shown how a genetic algorithm can be used in order to optimize a fuzzy system which is used in wave reflection analysis at submerged breakwaters.

it is a novel approach to estimate reflection coefficient, since a GA will determine the membership functions for each variable involved in the fuzzy system.

BACKGROUND

Fuzzy rule-based systems can be used as a tool for modelling non-linear systems especially complex physical systems. It is well known fact that the breakwater damage ratio estimation process is dynamic and nonlinear, so classical methods cannot be able to capture this behaviour resulting in unsatisfactory solutions. The Knowledge Base (KB) is the FS component comprising the expert knowledge knows about the problem. So is the only component of the FS depending on the concrete application and it makes the accuracy of the FS depends directly on its composition. The KB is comprised of two components, a Data Base (DB), containing the definitions of fuzzy rules linguistic labels, that is, the membership functions of the fuzzy sets, and a Rule Base (RB), constituted by the collection of fuzzy rules representing the expert knowledge. There are many tasks that have to be performed in order to design a concrete FS. As it has been shown previously, the derivation of the KB is the only one directly depending on the problem to solve. It is known that the more used method in order to perform this task is based directly on extracting the expert experience from the human process operator. The problem arises when there are not able to express their knowledge in terms of fuzzy rules. In order to avoid this drawback, researches have been investigating automatic learning methods for designing FSs by deriving automatically an appropriate KB for the FS without necessary of its human expert. The Genetic algorithms (GA) have demonstrated to be a powerful tool for automating the definition of the KB since adaptativa control, learning and self-organization can be considered in a lot of cases as optimization or search process. The fuzzy systems making use of GA in their design process are called generically GFSs. These advantages have extended the use of GAs in the development of a wide range of approaches for designing FSs in the last years. It is possible to

Many works have been done in the area of artificial intelligence applied to Coastal Engineering. It can be said that Artificial Intelligence methods have a wide acceptance among Coastal & Ports Engineers. Artificial Neural Network has been applied for years with very good results. The big drawback is their inability to explain their results, how have reached them, because they work as a black box and it can not be known what happen inside them. Over the last few years, a lot of works about fuzzy systems with engineering applications have been developed (Mercan, Yagci & Kabdasli, 2003; Dingerson, 2005; Gezer, 2004; Ross, 2004; Oliveira, Souza & Mandorino, 2006; Ergin, Williams & Micallef, 2006; Yagci, Mercan, Cigizoglu & Kabdasli, 2005). These systems have the advantage of being easy to understand (their solutions) and the capacity to handle uncertainty. However, most of these found a problem with knowledge extraction; when they try to define their RB and DB, in many cases for the difficulty of the problem and more often for the difficulty of represent all the expert knowledge in some rules and membership function. To overcome these problems Genetic Fuzzy Systems (GFS) emerged, in which expert advice it is not as important as in Fuzzy System (FS) since it could be only needed to define the variables involved and its work domain. GFS (Cordón, et al., 2001) allow us to be less dependent on expert knowledge and in addition it is easier to reach better accuracy with these systems since they can realize a tuning process for membership functions and refine the rule set in order to optimize it. Following a specific application of GFS for wave reflection analysis at submerged breakwaters is presented. While other kinds of techniques have been applied to that problem (Taveira, 2005; Kobayasi & Wurjanto, 1989; Abul-Azm, 1993; Losada, Silva & Losada, 1999),

760

ANALySIS OF WAVE REFLECTION AT SUBmERGED BREAKWATERS WITH A GENETIC FUZZy SySTEm


distinguish three different groups of genetic FS design process according to the KB components included in the learning process. These ones are the following: •

• •

Genetic definition of the Fuzzy System Data Base (Bolata and Nowé, 1995; Fathi-Torbaghan and Hildebrand, 1994; Herrera and Verdegay, 1995b; Karr, 1991b). Genetic derivation of the Fuzzy System Rule Base (Bonarini, 1993; Karr, 1991a; Thrift, 1991). Genetic learning of the Fuzzy System Knowledge Base (Cooper and Vidal, 1993; Herrera, Lozano and Verdegay, 1995a; Leitch and Probert, 1994; Lee and Takagi, 1993; Ng and Lee, 1994).

In this paper, we create a Fuzzy System which predicts reflection coefficient at a different model of submerged breakwaters. To do this task, a part of this Fuzzy System, the Data Base, is defined and tuning by a Genetic Algorithm.

SUBmERGED BREAKWATER DOmAIN Submerged breakwaters are effective shore protection structure against wave action with a reduced visual impact (see fig. 2). To predict reflection coefficient several parameters have to be taken into account, they are: •

Rc: water level above crest.

• • • •

Hs: significant wave height. d: water depth. Tp: peak period or Lp: peak wavelength

G

These are parameters that connect the submerged breakwater model and the wave. The parameters that identified the submerged breakwater model (see fig. 3) are: the height (h) and the crest width (B), n (cotangent α), breakwater slope (α) and slope nature (smooth or rough). To predict the reflection coefficient, the first ones were used but in many cases dimensionless parameters were used instead the parameters separately. A lot of tests were done with different number of input variables and different number of fuzzy sets for each membership function. Depending of the variables and membership function number, a set of rules were established for each case.

PHySICAL TEST A large number of tests have been carried out (TaveiraPinto, 2001) with different water deeps and wave conditions for each model (figure 3 shows the general layout of the tested models). Eight impermeable physical models have been tested with different geometries (crest width, slope), different slope nature (smooth, rough), values for tan α (from 0.20 to 1.00) and n (from 1 to 5 ) in the old unidirectional wave tank of the Hydraulics Laboratory of the Faculty of Engineering of the University of Porto.

Figure 2. Outline of a submerged breakwater and its action

761


Figure 3. Diagram of interesting variables taken into account in a submerged breakwater

GENETIC FUZZy SySTEm The target of the GA is find the better distribution for the membership functions (optimization task) inside of the domain of each variable, so that minimizes the error of the created fuzzy system when it is applied to the training set

Genome Encoding Each individual of the GA represents the Data Base of the fuzzy system that means all the membership functions. Each gen contains the position of one point of one membership function. As can be seen in fig. 4, one variable X with all its fuzzy sets is coding as a chain of real numbers. The used codification allows different kinds of membership functions (triangular, trapezoid, Gaussian, etc…) codifying the representative points in the chromosome so the resultant chromosome is variable size.

Genetic Operators Genetic operators were limited in order to generate meaningful fuzzy systems. •

762

Crossover: The classical crossover operator, with one-point, n-point or uniform crossover, has to be limited in its possible cross points. To avoid

•

•

meaningless membership functions it is only allows exchange the genetic material corresponding to whole variables. Mutation: When a mutation happens, the new value of the gen will be between a lower and an upper limit, both have worked out from the neighbour points of the corresponding membership function and its neighbour membership functions. Selection: The selection method is tournament with elitism (Blickle, 1997).

Fitness The way of find out what individual is better than other is the fitness function. In this case, one individual represent a part of a fuzzy system (DB) and with the rest of the fuzzy system (static RB) the fitness of that individual can be calculate. For that aim the physical test is split in two new sets, one was used as a training set and the other as a test set. For each physical test of the training set, the corresponding value for the input variables are introducing in the fuzzy system (individual in the genetic population). Once is calculated the output with a Mandani (Mandani, 1977) strategy and a Centroid defuzzification method, the result is compared to the output of the physical test; the difference is piled up for every tests in the training set and once all test have been introduced in the fuzzy system (one individual from the GA) and have been calculated its error, the


Figure 4. Piece of a chromosome. Xij contains the position of one point (i) of one membership function (j)

addition of the errors is the fitness function value for the individual. The smaller is the total error the better is the individual.

Results Good results were obtained (from 85% to 95% of success) for the different tests done. Tests differ from one another for the number of input variables and the number of rules as well as genetic algorithm parameters. An easy understanding test is explained following: • • • •

Selected dimensionless parameters: Rc/Hs and d/Lp. Both input variables were split in two (Low and High) trapezoidal membership functions. The output variable Cr (reflection coefficient) was split in three (Low, Medium and High) trapezoidal membership functions. The rule set was made up of by three rules: o If (Rc/Hs = Low) and (d/Lp = Low) then (Cr = High) o If (Rc/Hs = Low) and (d/Lp = High) then (Cr = Medium) o If (Rc/Hs = High) and (d/Lp = Low) then (Cr = Medium)

G

The training set was made up of 24 physical tests and the medium square error in that step was 0.84. Resultant membership functions can be seen in fig. 5. The test set was made up of 11 physical tests and the mean square error in that step was 0.89.

FUTURE TRENDS Give the GA the capacity to optimize rules so that the system definition becomes easier and better results can be reached. The GA must be able to generate individuals with different number rules and different kind of rules at the same time that these individuals represent different membership functions.

CONCLUSION • •

A Genetic Fuzzy System was development to estimate the wave reflection coefficient at submerged breakwaters. Good results were obtained (near to 90% accuracy) but better results (near to 97% accuracy) are difficult to understand inside the fuzzy theory.

763


Figure 5. Resultant membership functions from tuning process of a DB by GA

• •

It is a hard task to choose the rule set and furthermore the system’s accuracy depends on this set a lot. The more inputs the problem have the more difficult become to define the rule set.

Bolata F. & Nowé A., 1995. From fuzzy linguistic specifications to fuzzy controllers using evolution strategies. In Proc. Fourth IEEE International Conference on Fuzzy Systems (FUZZ-IEEE’95), Yokohama, pp. 77-82.

REFERENCES

Bonarini A. 1993. Learning incomplete fuzzy rule sets for an autonomous robot. In Proc. First European Congress on Fuzzy and Intelligent Technologies (EUFIT’93), Aachen, pages 69-75.

Abul-Azm A. G., 1993. Wave Diffraction Through Submerged Breakwaters. Journal of Waterway, Port, Coastal and Ocean Engineering, Vol. 119, No. 6, pp. 587-605.

Cooper M. G. & Vidal J. J., 1993. Genetic design of fuzzy logic controllers. In Proc. Second International Conference on Fuzzy Theory and Technology (FTT’93), Durham.

Baglio S. & Foti E., 2003. Non-invasive measurements to analyze sandy bed evolution under sea waves action. Instrumentation and Measurement, IEEE Transactions on. Vol. 52, Issue: 3, pp. 762-770.

Cordón, O., Herrera, F., Hoffman F., Magdalena, L. (2001). Genetic fuzzy systems. World Scientific.

Blickle, T. (1997). Tournament selection. In T. Bäck, D.G. Fogel, & Z. Michalewicz (Eds.), Handbook of Evolutionary Computation. New York: Taylor & Francis Group.

764

Dingerson L. M., 2005. Predicting future shoreline condition based on land use trends, logistic regression and fuzzy logic. Thesis. The Faculty of the School of Marine Science. Ergin A., Williams A.T. & Micallef A., 2006. Coastal Scenery: Appreciation and Evaluation. Journal of


Coastal Research Article: pp. 958-964. Volume 22, Issue 4. Fathi-Torbaghan M. & Hildebrand L., 1994. Evolutionary strategies for the optimization of fuzzy rules. In Proc. Fifth International Conference on Information Processing and Management of Uncertainty in Knowledge Based Systems (IPMU’94), Paris, pp. 671-674. Gezer E., 2004. Coastal Scenic Evaluation, A pilot study for Çiralli. Thesis. The Graduate School of Natural and Applied Sciences of Middle East Technical University. Herrera F., Lozano M. & Verdegay J. L., 1995a. A Learning process for fuzzy control rule using genetic algorithms. Technical Report DECSAI-95108, University of Granada, Department of Computer Science and Artificial Intelligence. Herrera F., Lozano M. & Verdegay J. L., 1995b. Tuning fuzzy logic controllers by genetic algorithms. International Journal of Approximate Reasoning 12: 293-315. Holland J. H., 1975. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor. Karr C., 1991a. Applying genetics. AI Expert pages 38-43. Karr C., 1991b. Genetics algorithms for fuzzy controllers. AI Expert pages 26-33. Kobayashi N. & Wurjanto A., 1989. Wave Transmission Over Submerged Breakwaters. Journal of Waterway, Port, Coastal and Ocean Engineering, Vol. 115, No. 5, pp. 662-680. Leitch D. & Probert P., 1994. Context depending coding in genetic algorithms for the sesign of fuzzy systems. In Proc. IEEE/Nagoya University WWW on Fuzzy Logic and Neural Networks/Genetic Algorithms Nagoya. Lee M. & Takagi H., 1993. Embedding a priori knowledge into an integrate fuzzy system design method based on genetic algorithms. In Proc. Fifth International Fuzzy Systems Association World Congress (IFSA’93), Seoul, pages 1293-1296. Losada I.J., Silva R. & Losada, M.A., 1996. 3-D nonbreaking regular wave interaction with submerged breakwaters. Coastal Engineering, Volume 28, Number 1, pp. 229-248(20).

Mandani, E.H., 1977. Application of fuzzy logic to approximate reasoning using linguistic synthesis. IEEE Transactions on Computers C-26 (12), 1182–1191. Oliveira S.S., Souza F.J. & Mandorino F., 2006. Using fuzzy logic model for the selection and priorization of port areas to hydrographic re-surveying. Evolutions in hydrography, Antwerp (Belgium), Proceedings of the 15th International Congress of the International Federation of Hydrographic Societies. Special Publication of the Hydrographic Society, 55: pp. 63-67. Ross T. J., 2004. Fuzzy Logic with Engineering Applications. John Wiley and Sons. Technology & Industrial Arts. Taveira-Pinto F., 2001, “Analysis of the oscillations and velocity fields in the vicinity of submerged breakwaters under wave action”, Ph. D. Thesis, Univ. of Porto (In Portuguese). Taveira-Pinto F., 2005. Regular water wave measurements near submerged breakwaters. Meas. Science and Technology. Thrift P., 1991. Fuzzy logic synthesis with genetic algorithms. In Proc. Fourth International Conference on Genetic Algorithms (ACGA’91), pages 509-513. Van Oosten R. P., Peixó J., Van der Meer M.R.A. & Van Gent H.J., 2006. Wave transmisión at low-crested structures using neural networks. ICCE 2006, Abstract number: 1071. Yagci O., Mercan D. E., Cigizoglu H. K. & Kabdasli M. S., 2005. Artificial intelligence methods in breakwater damage ratio estimation. Ocean engineering, vol. 32, no. 17-18, pp. 2088-2106. Yagci O., Mercan D. E. & Kabdasli M. S., 2003. Modelling Of Anticipated Damage Ratio On Breakwaters Using Fuzzy Logic. EGS - AGU - EUG Joint Assembly, Abstracts from the meeting held in Nice, France, 6 - 11 April 2003. Zadeh L. A., 1965. Fuzzy sets. Information and Control 8: 358-353. Zadeh L. A., 1973. Outline of a new approach to the analysis of complex systems and decision process. IEEE Transactions on Systems, Man and Cybernetics 3: 28-44.

765

G


KEy TERmS Fuzzification: Establishes a mapping from crisp input values to fuzzy set defined in the universe of discourse of that input. Fuzzy System (FS): Any FL-based system, which either uses FL as the basis for the representation of different forms of knowledge, or to model the interactions and relationships among the system variables. Genetic Algorithm: General-purpose search algorithms that use principles by natural population genetics to evolve solutions to problems Genetic Fuzzy System: A fuzzy system that is augmented with an evolutionary learning process.

766

Mamdani Fuzzy Rule-Based System: A rule based system where fuzzy logic (FL) is used as a tool for representing different forms of knowledge about the problem at hand, as well as for modelling the interactions and relationships that exist between its variables. Mamdani Inference System: Derives the fuzzy outputs from the inputs fuzzy sets according to the relation defined through fuzzy rules. Establishes a mapping between fuzzy sets U = U1 x U2 x . . . x Un in the input domain of X1…, Xn and fuzzy sets V in the output domain of Y. The fuzzy inference scheme employs the generalized modus ponens, an extension to the classical modus ponens (Zadeh, 1973). Takagi-Sugeno-Kang Fuzzy Rule-Based System: A rule based system whose antecedent is composed of linguistic variables and the consequent is represented by a function of the input variables.

767

Grammar-Guided Genetic Programming Daniel Manrique Inteligencia Artificial, Facultad de Informatica, UPM, Spain Juan Ríos Inteligencia Artificial, Facultad de Informatica, UPM, Spain Alfonso Rodríguez-Patón Inteligencia Artificial, Facultad de Informatica, UPM, Spain

INTRODUCTION Evolutionary computation (EC) is the study of computational systems that borrow ideas from and are inspired by natural evolution and adaptation (Yao & Xu, 2006, pp. 1-18). EC covers a number of techniques based on evolutionary processes and natural selection: evolutionary strategies, genetic algorithms and genetic programming (Keedwell & Narayanan, 2005). Evolutionary strategies are an approach for efficiently solving certain continuous problems, yielding good results for some parametric problems in real domains. Compared with genetic algorithms, evolutionary strategies run more exploratory searches and are a good option when applied to relatively unknown parametric problems. Genetic algorithms emulate the evolutionary process that takes place in nature. Individuals compete for survival by adapting as best they can to the environmental conditions. Crossovers between individuals, mutations and deaths are all part of this process of adaptation. By substituting the natural environment for the problem to be solved, we get a computationally cheap method that is capable of dealing with any problem, provided we know how to determine individuals’ fitness (Manrique, 2001). Genetic programming is an extension of genetic algorithms (Couchet, Manrique, Ríos & RodríguezPatón, 2006). Its aim is to build computer programs that are not expressly designed and programmed by a human being. It can be said to be an optimization technique whose search space is composed of all possible computer programs for solving a particular problem. Genetic programming’s key advantage over genetic

algorithms is that it can handle individuals (computer programs) of different lengths. Grammar-guided genetic programming (GGGP) is an extension of traditional GP systems (Whigham, 1995, pp. 33-41). The difference lies in the fact that they employ context-free grammars (CFG) that generate all the possible solutions to a given problem as sentences, establishing this way the formal definition of the syntactic problem constraints, and use the derivation trees for each sentence to encode these solutions (Dounias, Tsakonas, Jantzen, Axer, Bjerregard & von Keyserlingk, D. 2002, pp. 494-500). The use of this type of syntactic formalisms helps to solve the so-called closure problem (Whigham, 1996). To achieve closure valid individuals (points that belong to the search space) should always be generated. As the generation of invalid individuals slows down convergence speed a great deal, solving this problem will very much improve the GP search capability. The basic operator directly affecting the closure problem is crossover: crossing two (or any) valid individuals should generate a valid offspring. Similarly, this is the operator that has the biggest impact on the process of convergence towards the optimum solution. Therefore, this article reviews the most important crossover operators employed in GP and GGGP, highlighting the weaknesses existing nowadays in this area of research. We also propose a GGGP system. This system incorporates the original idea of employing ambiguous CFG to overcome these weaknesses, thereby increasing convergence speed and reducing the likelihood of trapping in local optima. Comparative results are shown to empirically corroborate our claims.


G

Grammar-Guided Genetic Programming

blocks (also called context) across the trees by setting severe (strong) constraints for tree nodes considered as possible candidates for selection as crossover nodes (D’haesler, 1994, pp. 379-407). A system of coordinates is defined to univocally identify each node in a derivation tree. The position of each node within the tree is specified along the path that must be followed to reach a given node from the root. To do this, the position of a node is described by means of a tuple of n coordinates T = (b1, b2,…, bn), where n is the node’s depth in the tree, and bi indicates which branch is selected at depth i (counting from left to right). Figure 2 shows an example representing this system of coordinates. Only nodes with the same coordinates from both parents can be swapped. For this reason, a subtree may possibly never migrate to another place in the tree. This limitation can cause serious search space exploration problems, as the whole search space cannot be covered unless each function and terminal appears at every possible coordinate at least once in any one individual in the population. This failure to migrate building blocks causes them to evolve separately in each region, causing a too big an exploitation capability, thereby increasing the likelihood of trapping in local optima (Barrios, Carrascal, Manrique & Ríos, 2003, pp. 275-293). As time moves on, the code bloat phenomenon becomes a serious problem and takes an ever more prominent role. To avoid this, Crawford-Marks &

BACKGROUND Koza defined one of the first major crossover operators (KX) (1992). This approach randomly swaps subtrees in both parents to generate offspring. Therefore, it tends to disaggregate the so-called building blocks across the trees (that represent the individuals). The building blocks are those subtrees that improve fitness. This over-expansion has a negative effect on the fitness of the individuals. Also, this operator’s excessive exploration capability leads to another weakness: an increase in the size of individuals, which affects system performance, and results in a lower convergence speed (Terrio & Heywood, 2002). This effect is known as bloat or code bloat. There is another important drawback: many of the generated offspring are syntactically invalid as the crossovers are done completely at random. These individuals should not be part of the new population because they do not provide a valid solution. This seriously undermines the convergence process. Figure 1 shows a situation where one of the two individuals generated after Koza’s crossover breaches the constraints established by a hypothetical grammar whose sentences represent arithmetic equalities. The strong context preservative crossover operator (SCPC) avoids the problem of desegregation of building

Figure 1. Incorrect operation of Koza’s crossover operator S S

S Crossover node in parent 1

E N

6

E

=

+

E N

N

7 Crossover node in parent 2

F

+

E

=

N

-

E

8

E

N

N

N

4

3

2

4

Subtrees to be swapped

768

E

E

=

N S

N

+

3

7

E

E

4 F

Invalid production

E

=

N

-

E

8

N

+

E

N

E

N

4

N

2

6


Figure 2. The system of coordinates defined in SCPC () (2) (2,1) (2,1,3)

(2,1,3,1)

Spector (2002) developed the Fair crossover (pp. 733-739). This is a modified version of the approach proposed by Langdon (1999, pp. 1092-1097). Tree size is controlled as follows. First, a crossover node in the first parent is selected at random and the length, l, of the subtree extending from the node to the leaves is calculated. Then, a node is also selected at random in the second parent, and the length, l2, for this second subtree is calculated. If l2 is within the range [l – l/4, l + l/4], then the crossover node for the second parent is accepted, and the two subtrees are swapped. If not, another crossover node is selected at random for the second parent and the check is run again. This way, the size of the subtree in the second parent to be swapped is controlled and limited, so the code bloat phenomenon is avoided. Another aspect to comment here is that the range in which l2 must be included can be modified to afford specific problems more efficiently, but the range originally proposed works fine for most of them. Whigham proposed one of the most commonly used operators (WX) in GGGP (1995, pp. 33-41). Because of its sound performance in such systems, it has become the de facto standard and is still in use today (Rodrigues & Pozo, 2002, pp. 324-333), (Hussain, 2003), (Grosman & Lewin, 2004, pp. 2779-2790). The algorithm works as follows. First, as all the terminal symbols have at least one non-terminal symbol above them, then, without loss of generality, the crossover nodes can be confined exclusively to locations on nodes containing non-terminal symbols. A non-terminal node belonging to the first parent is selected at random. Then a non-terminal node labeled with the same non-terminal symbol as in the first-chosen crossover node is selected from the second parent. This assures that generated individuals belong

to the grammar-generated language, as the crossed nodes share the same symbol. This operator’s main flaw is that there are other possible choices of node in the second parent that are not explored and that could end in the target solution (Manrique, Marquez, Ríos & Rodríquez-Patón, 2005, pp. 252-261).

THE PROPOSED CROSSOVER OPERATOR FOR GGGP SySTEmS The proposed operator is a general-purpose operator designed to work in any GGGP system. It takes advantage of the key feature that defines a CFG as ambiguous: the same sentence can be obtained by several derivation trees. This implies that there are several individuals representing the solution to a problem. It is therefore easier to find. This operator consists of eight steps: 1.

2.

3.

4. 5.

6.

7.

Choose a node, except the axiom, with a nonterminal symbol randomly from the first parent. This node is called crossover node and is denoted CN1. Choose the parent of CN1. As we are working with a CFG, this will be a non-terminal symbol. The right-hand sides of all its production rules are stored in the array R. The derivation produced by the parent of CN1 is called main derivation, and is denoted A ::= C. Calculate the derivation length l as the number of symbols in the right-hand side of the main derivation. Having l, the position (p) of CN1 in the main derivation and C, define the three-tuple T(l, p, C). Delete from R all the right-hand sides with different lengths from the main derivation. Remove from R all those right-hand sides in which there exists any difference between the symbols (except the one located in position p) in each right-hand side and the symbols in C. The set X is formed by all the symbols in the righthand sides of R that are in position p. X contains all the non-terminal symbols of the second parent that can be chosen as a crossover node (CN2). Choose CN2 randomly from X, discarding all the nodes that will generate offspring trees with a size greater than a previously established value D.

769

G


8.

Calculate the two new derivation trees produced as offspring by swapping the two subtrees whose roots are CN1 and CN2.

Results We present and discuss the results achieved by the crossover operators described in the background section and the operator that we propose. To do so, we have tackled a complex classification problem: the real-world task of providing breast cancer prognosis (benign or malignant) from the morphological characteristics of microcalcifications. Microcalcifications are small mineral deposits in breast tissue that could constitute cancer. This experiment involved searching a knowledge base of fuzzy rules that could give such a prognosis. The data employed for giving a disease prognosis are: patient’s age, lesion size, lesion location in the breast, and particular features of the microcalcifications: number, distribution and type. Number indicates the quantity of existing clustered microcalifications, distribution shows how they are clustered and type reflects the individual morphology of the microcalcifications. To run the tests, 365 microcalcifications were selected at random. Of these, 315 lesions were randomly selected for use as genetic programming system training cases with the different crossover operators described. After training, the fittest individual was selected to form a knowledge base with the fuzzy rules encoded by this individual. Then, the knowledge base was tested with

The underlying idea of this algorithm consists on calculating which are the non-terminal symbols that can substitute the symbol contained in CN1, bearing in mind that the production rule that contains CN1 keeps being valid. Since all non-terminal symbols that can generate valid production rules are taken into account in the crossover process, this operator takes advantage of ambiguous grammars. The proposed crossover operator has primarily three attractive features: a) step 7 states a code bloat control mechanism, b) the offspring produced are always composed of two valid trees and c) step 6 indicates that all the possible nodes of the second parent that can generate valid individuals are taken into account, not only those nodes with the same non-terminal symbol as the one chosen for the first parent. This third feature increases the GGGP system’s exploration capability, which avoids trapping in local optima and takes advantage of there being more than one derivation tree (potential solution to the problem) for a single sentence.

Figure 3. Average convergence speed for each crossover operator

Proposed crossover

SCPC WX Fair

0

25

50

KX

75

100

125

Generation 770

150

175

200


the 50 remaining lesions not chosen during the training phase to output the number of correctly classified patterns in what we have called the testing phase. The CFG employed was formed by 19 non-terminal symbols, 54 terminals and 51 production rules, some of them included to obtain an ambiguous grammar. The population size employed was 1000, the upper bound for the size of the derivation trees was set to 20. The fitnesss function consisted of calculating the number of well-classified patterns. Therefore, the greater the fitness, the fitter the individual is, with the maximum limit of 315 in the training phase and 50 in the test. Figure 3 shows the average evolution process for each of the five crossover operators in the training phase after 100 executions. It is clear from Figure 3 that KX yields the worst results, because it maintains an over-diverse population and allows invalid individuals to be generated. This prevents it from focusing on one possible solution. The effect of Fair is just the opposite, leading very quickly to one of the optimal solutions (this is why it has a relatively high convergence speed initially), and

slowing down if convergence is towards a local optimum (which happens in most cases). WX and SCPC produce good results, bettered only by the proposed crossover. Its high convergence speed evidences the benefits of taking into account all possible nodes of the second parent that can generate valid offspring. Table 1 shows examples of fuzzy rules output in one of the executions for the best two crossover operators —WX and the proposed operator— once the training phase was complete. Table 2 shows the average number (rounded up or down to the nearest integer) of correctly classified patterns after 100 executions, achieved by the best individual in the training and test phases, and the percentage of times that the system converged prematurely. KX again yields the worst results, correctly classifying just 57.46% (181/315) of patterns in the training phase and 54% (27/50) in the testing phase. SCPC and Fair crossovers also return insufficient results: around 59% in the training phase and 54%-56% in the testing phase, although, as shown in Figure 3, SCPC has a higher convergence speed. Finally, note the similarity

Table 1. Some knowledge base fuzzy rules output by two GGGP systems Crossover operator WX Proposed

Rule 1 Rule 2 IF NOT (type=branched) OR (number=few) THEN (prognosis=benign) IF NOT (age=middle) AND IF (type=heterogeneous) THEN NOT (location=subaerolar) (prognosis=malignant) AND NOT(type=oval) THEN (prognosis=malignant)

Table 2. Average number of correctly classified patterns and unsuccessful runs Crossover operator KX SCPC Fair WX Proposed

Training 181/315 (57.46%) 186/315 (59.04%) 185/315 (58.73%) 191/315(60.63%) 191/315(60.63%)

Testing 27/50 (54%) 28/50 (56%) 27/50 (54%) 30/50 (60%) 31/50 (62%)

Unsuccessful runs 36% 14% 15% 8% 2% 771

G


between WX and the proposed operator. However, the proposed operator has higher speed of convergence and is less likely to get trapped in local optima, as it converged prematurely only twice in 100 executions.

can choose any node from the second parent to generate the offspring, rather than just those nodes with the same non-terminal symbols as the one chosen in the first parent.

FUTURE TRENDS

REFERENCES

The continuation of the work described in this article can be divided into two main lines of investigation in GGGP. The first involves finding an algorithm that can estimate the maximum sizes of the trees generated throughout the evolution process to assure that the optimal solution will be reached. This would overcome the proposed crossover operator’s weakness of not being able to reach a solution because the permitted maximum tree size is too restrictive for it to be able to reach a good solution, whereas this solution could be found if individuals were just a little larger. The second interesting line of research derived from this work is the use of ambiguous grammars. It has been empirically observed that using the proposed operator combined with ambiguous grammars in GGGP systems benefits convergence speed. However, “too much ambiguity” is damaging. The idea is to get an ambiguity measure that can answer the question of how much ambiguity is needed to get the best results in terms of efficiency.

Barrios, D., Carrascal, A., Manrique, D. & Ríos, J. (2003). Optimization with real-coded genetic algorithms based on mathematical morphology. International Journal of Computer Mathematics, (80) 3, 275-293.

CONCLUSION This article summarizes the latest and most important advances in GGGP, paying special attention to the crossover operator, which (alongside the initialization method, the codification of individuals and, to a lesser extent, the mutation operator, of course) is chiefly responsible for the convergence speed and the success of the evolution process. GGGP systems are able to find solutions to any problem that can be syntactically expressed by a CFG. The proposed crossover operator provides GGGP systems with a satisfactory balance between exploration and exploitation capabilities. This results in a high convergence speed, while eluding local optima as the reported results demonstrate. To be able to achieve such good results, the proposed crossover operator includes a computationally cheap mechanism to control bloat, it always generates syntactically valid offspring and it 772

Couchet, J., Manrique, D., Ríos, J. & Rodríguez-Patón, A. (2006). Crossover and mutation operators for grammar-guided genetic programming. Softcomputing, DOI 10.1007/s00500-006-0144-9. Crawford-Marks, R. & Spector, L. (2002). Size control via size fair genetic operators in the pushGP genetic programming system. In proceedings of the genetic and evolutionary computation conference, New York, 733-739. D’haesler, P. (1994). Context preserving crossover in genetic programming. In IEEE Proceedings of the 1994 world congress on computational intelligence, Orlando, (1) 379-407 Dounias, G., Tsakonas, A., Jantzen, J., Axer, H., Bjerregard, B., & von Keyserlingk, D. (2002). Genetic Programming for the Generation of Crisp and Fuzzy Rule Bases in Classification and Diagnosis of Medical Data. Proceedings of the 1st International NAISO Congress on Neuro Fuzzy Technologies, Havana, Cuba, 494-500. Grosman, B. & Lewin, D.R. (2004). Adaptive Genetic Programming for Steady-State Process Modeling. Computers and Chemical Engineering, 28 2779-2790. Hussain, T.S. (2003). Attribute grammar encoding of the structure and behaviour of artificial neural networks. PhD Thesis, Queen’s University. Kingston, Ontario, Canada. Keedwell, E., & Narayanan, A. (2005). Intelligent bioinformatics. Wiley & Sons. Koza, JR. (1992). Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge.


Langdon, WB. (1999). Size fair and homologous tree genetic programming crossovers. In proceedings of genetic and evolutionary computation conference, GECCO’99, Washington DC, 1092-1097. Manrique, D. (2001). Diseño de redes de neuronas y nuevas técnicas de optimización mediante algoritmos genéticos [Artificial neural networks design and new optimization techniques using genetic algorithms]. PhD Thesis, Facultad de Informática, Universidad Politécnica de Madrid. Manrique, D., Márquez, F., Ríos, J. & Rodríguez-Patón A. (2005). Grammar-based crossover operator in genetic programming. Lecture Notes in Artificial Intelligence, 3562 252-261. Rodrigues, E. & Pozo, A. (2002). Grammar-Guided Genetic Programming and Automatically Defined Functions. In proceedings of the 16th Brazilian symposium on artificial intelligence, Recife, Brazil, 324-333. Terrio, MD., & Heywood, MI. (2002). Directing crossover for reduction of bloat in GP. In IEEE proceedings of Canadian conference on electrical and computer engineering, (2) 1111-1115. Whigham, P.A. (1995). Grammatically-based genetic programming. In proceedings of the workshop on genetic programming: from theory to real-world applications, California, 33-41. Whigham, P.A. (1996). Grammatical bias for evolutionary learning. PhD Thesis, School of Computer Science, Australian Defence Force (ADFA), University College, University of New South Wales. Yao, X., & Xu, Y. (2006). Recent advances in evolutionary computation. Journal of Computer Science & Technology, (21) 1 1-18.

KEy TERmS Ambiguous Grammar: Any grammar in which different derivation trees can generate the same sentence. Closure Problem: Phenomenon that involves always generating syntactically valid individuals. Code Bloat: Phenomenon to be avoided in a genetic programming system convergence process involving the uncontrolled growth, in terms of size and complexity, of individuals in the population Convergence:Process by means of which an algorithm (in this case an evolutionary system) gradually approaches a solution. A genetic programming system is said to have converged when most of the individuals in the population are equal or when the system cannot evolve any further. Fitness: Measure associated with individuals in an evolutionary algorithm population to determine how good the solution they represent is for the problem. Genetic Programming: A variant of genetic algorithms that uses simulated evolution to discover functional programs to solve a task. Grammar-Guided Genetic Programming: The application of analytical methods and tools to data for the purpose of identifying patterns, relationships or obtaining systems that perform useful tasks such as classification, prediction, estimation, or affinity grouping. Intron: Segment of code within an individual (subtree) that does not modify the fitness, but is on the side of convergence process.

773

G

774

Granular Computing Georg Peters Munich University of Applied Sciences, Germany

INTRODUCTION It is well accepted that in many real life situations information is not certain and precise but rather uncertain or imprecise. To describe uncertainty probability theory emerged in the 17th and 18th century. Bernoulli, Laplace and Pascal are considered to be the fathers of probability theory. Today probability can still be considered as the prevalent theory to describe uncertainty. However, in the year 1965 Zadeh seemed to have challenged probability theory by introducing fuzzy sets as a theory dealing with uncertainty (Zadeh, 1965). Since then it has been discussed whether probability and fuzzy set theory are complementary or rather competitive (Zadeh, 1995). Sometimes fuzzy sets theory is even considered as a subset of probability theory and therefore dispensable. Although the discussion on the relationship of probability and fuzziness seems to have lost the intensity of its early years it is still continuing today. However, fuzzy set theory has established itself as a central approach to tackle uncertainty. For a discussion on the relationship of probability and fuzziness the reader is referred to e.g. Dubois, Prade (1993), Ross et al. (2002) or Zadeh (1995). In the meantime further ideas how to deal with uncertainty have been suggested. For example, Pawlak introduced rough sets in the beginning of the eighties of the last century (Pawlak, 1982), a theory that has risen increasing attentions in the last years. For a comparison of probability, fuzzy sets and rough sets the reader is referred to Lin (2002). Presently research is conducted to develop a Generalized Theory of Uncertainty (GTU) as a framework for any kind of uncertainty whether it is based on probability, fuzziness besides others (Zadeh, 2005). Cornerstones in this theory are the concepts of information granularity (Zadeh, 1979) and generalized constraints (Zadeh, 1986). In this context the term Granular Computing was first suggested by Lin (1998a, 1998b), however it still lacks of a unique and well accepted definition. So, for example, Zadeh (2006a) colorfully calls granular

computing “ballpark computing” or more precisely “a mode of computation in which the objects of computation are generalized constraints”.

BACKGROUND Humans often speak and think in words rather than in numbers. For example, in summer we say that it is hot outside rather than that is 35.32° Celsius. This means that we often define our information as an imprecise perception-based linguistic variable rather than as a precise measure-based number. The impreciseness in our formulation basically has four reasons (Zadeh, 2005): 1.

2.

3.

Bounded ability of human sensors and computational limits of the brain. (1) Our human sensors do not have the abilities of a laser based speed controller. So we cannot quantify the speed of a racing car as 252.18 km/h in Albert Park, Melbourne. However on the linguistic level we can define the car as fast. (2) M∈ost people cannot numerically calculate the exact race distance given by 5,303 km * 53 turns=307.574 km due to computational limits of their brains. However they probably estimate that it will be around 300 km. Lack of numerical information. Melbourne is considered as a shopping paradise in Australia since there are countless shops. Maybe only local government knows the exact number of shops. Qualitative, non quantifiable information. Much information is provided rather qualitative than quantitative. If one describes the quality of a pizza in an Italian restaurant in Lygon Street in Melbourne’s suburb Carlton only a qualitative, linguistic judgment like excellent or very good is possible. The judgment is hardly to be quantifiable (beside a technical counting of the olives or the weight of the salami etc.).


Granular Computing

4.

Tolerance for imprecision. Recall the example, Melbourne as a shopping paradise, given above. To define Melbourne as shopping paradise its exact number of shops is not needed. It is sufficient to know that there are many shops. This tolerance for impression often makes a statement more robust and efficient in comparison to exact numerical values.

So obviously humans often prefer not to deal with precise but favor vague information that is immanent in natural language. Humans would rarely formulate a sentence like: With a probability of 97.34% I will see Ken, who has a height of 1.97m, at 12:05pm. Instead most humans would prefer to say:

A central objective of the concept of granular computing is to bridge this gap and compute with words (Zadeh, 1996). This leads to the ideas of information granularity or granular computing which was introduced by Zadeh (1986, 1979). The concept of information granularity has its roots in fuzzy set theory (Zadeh, 1965, 1997). Zadeh (1986) advanced and generalized this idea so that granular computing subsumes any kind of uncertainty and imprecision like “set theory and interval analysis, fuzzy sets, rough sets, shadowed sets, probabilistic sets and probability […], high level granular constructs” (Bargiela, Pedrycz, 2002, p. 5). The term granular computing was first suggested by Lin (1998a, 1998b).

FUNDAmENTALS OF GRANULAR COmPUTING

Around noon I will almost certainly meet tall Ken.

Singular and Granular Values

While the first formulation is computer compatible since it contains numbers (singletons) the second formulation seems too be to imprecise to be used as input for computers.

To more formally describe the difference between natural language and precise information let us recall the example sentences given in Section 2. The infor-

Figure 1. Mapping of Singletons and granular values With a probability of 97.34% I will see Ken, who has a height of 1.97m, at 12:05pm.

Around noon I will almost certainly meet tall Ken.

Table 1. Singular and granular values Variable Probability Height Time

Singular Values 97.34% 1.97m 12:05pm

Granular Values almost certainly tall around noon 775

G

Granular Computing

mation given in the two sentences can be mapped as depicted in Figure 1. While the first sentence contains exact figures (singletons) the second sentence describes the same context using linguistic variables (granular values). A comparison of the singular and granular values is given in Table 1. For example, the variable height can be mapped to the singleton 1.97m or the granule tall. The granule tall covers not only the singleton 1.97m but also neighbor-

hood values. See Figure 2 for an interval granulation of the singleton of the variable height; a fuzzy membership function (linguistic variable) would be another possibility for a granule of tall (see Figure 3). The main difference in the representation of the variable heights is entailed by a different formulation of the constraints. While the formulation as a singleton is of bivalence nature (height=1.97m) a fuzzy formulation would contain memberships. This leads to the concept of generalized constraints.

Figure 2. Presentation of variable height as Singleton and granule

µ(Height)

Figure 3. Fuzzy memberships

small

tall

1.97m 776

Height

Granular Computing

Generalized Constraints Overview of Constraints The generalization of constraints is a central concept in granular computing. The main purpose is to make classic constraints like∈(member), = (equal); < (smaller) and > (greater) more flexible and therefore closer to the way humans think. In the following subsections we will discuss standard, primary and general constraints in more detail.

Basic Concept of Generalized Constraints Standard Constraints. A standard constraint C is characterized by its bivalency (possibilistic of veristic) or probabilistic nature. Bivalent and probabilistic constraints do not have memberships degrees which indicate the degree of satisfaction of the constraint A: a variable X does or does not fulfill the standard constraint. Examples for bivalent constraints are: ∈ (member), = (equal); < (smaller) and > (greater) besides others. Primary Constraints. Zadeh (2006a) suggested the following primary constraints: • • •

Possibilistic (r=blank) Probabilistic (r=p) Veristic (r=v)

since they formulate the basic perceptions possibility, likelihood and truth. In contrast to the standard constraints bivalency is no longer required for the possibilistic and veristic constraints. Therefore standard constraints are included in the primary constraints. Applying the primary constraints to our example the second “Ken sentence” of Section 2 we get: •

•

•

Possibilistic Constraint (X is R): Ken is tall → Height(Ken) is tall (see Dubois, Prade (1998) for semantics of fuzzy sets including possibility (Zadeh, 1978)). Probabilistic Constraint (X isp R): Actual arrival time (X) at meeting point → X isp N(μ, σ2) is e.g. normal distributed around the agreed meeting time μ. Veristic Constraint (X isv R): Ken is at the meeting point at 12:05pm → Present(Ken, meeting point) isv 12:05pm.

Generalized Constraints. Further constraints include (Zadeh, 2005) usuality (r=u), random set (r=rs), fuzzy graph (r=fg), bimodal (r=bm) and group (r=g). The set of general constraints consists of these and the primary constraints. So, formally a generalized constraints (GC) is given by (Zadeh, 2005): GC(X): X isr R with X the constrained variable and R the non-bivalent relation. In the term isr the letter r defines the semantics or the modality of the constraint as describe above.

Generalized Constraint Language To formally describe generalized constraints Zadeh (2006b) suggests a Generalized Constraint Language (GCL). In Section 3.2.2 we already used the GCL in the presented example, e.g. the mapping: Ken is tall → Height(Ken) is tall, which has the form p → X isr R with p an expression in natural language. In this context Zadeh (2006b) defines the translation of natural language into GCL as precisiation. The precisiation can lead to v-precise and/or m-precise results: •

•

v-precisiation: a precise value is obtained. vprecisiation has s-precisiation (singleton), cgprecisiation (crisp granular) and g-precisiation (granular) as its modalities. s-precisiation leads to a singleton, while cg-precisiation leads to an crisp interval. g-precisiation is the most general form of precisiation and leads to fuzzy intervals, fuzzy graphs besides others. m-precisiation: a precise meaning is obtained. m-precisiation can further divided into the modalities mm-precisiation (machine-oriented) and mh- precisiation (human-oriented).

Examples: (1) Ken is between a and b meters tall is m-precise and since the variables a and b are not specified v-imprecise. (2) Ken is approximately c meters tall → Ken is a meters tall is a s-precisiation. The term approximately c can also be abbreviated as c*. The star indicates that c is a granular value.

777

G

Granular Computing

Feature 2

Figure 4. Rough sets

Generalized Extension Principle Singleton Lower Approximation Upper Approximation

Feature 1

In contrast to precisiation granulation leads to an imprecisiation of the information. Obviously the translation Ken is 1.97m → Ken is c meters tall is a v-imprecisiation and Ken is c meters tall → Ken is tall a m-imprecisiation. So for example, rough sets can be interpreted as cascading cg-imprecisiation. In rough set theory (Pawlak, 1982) a set is described by a lower and upper approximation (LA and UA respectively). The lower approximation is a subset of the upper approximation. While the objects in the lower approximation surely belong to the corresponding set the objects in a upper approximation might belong to the set. Therefore rough set theory provides an example of a cascading granulation: X Î LA Ì UA (see Figure 4).

Deduction Rules

One of the most fundamental theorem in fuzzy logic is the Extension Principle (Zadeh, 1975, Zimmermann, 2001). Basically the Extension Principle defines how the memberships μy(y) of an endogenous variable Y=f(X) can be determined with X and Y singletons and μx(X) given. A simple transformation μy(Y)= μy(f(X))= μx(X) does not generally provide a unique solution. Therefore, to obtain a unique solution, sup μy(f(X)) is taken. The Generalized Extension Principle (Zadeh, 2006a) establishes a relationship between Y*=f*(X*) Gr(Y) isr Gr(X) with Y*, X* and f*() granules. It can be considered as primary deduction rule since many others deduction rules can be derived from it (Zadeh, 2006b).

Example Let us consider an example (Zadeh, 2005, 2006a, 2006b): The following linguistic statement is given: Most Swedes are tall → (Height(Swedes) are tall) is most. First let us specify

Principal Deduction Rules

Swedes are tall → ∫ X(h)μtall(h)dh

In this Section we regard the term granular computing in its literally meaning: how to compute with granules and focus on principal deductions (Zadeh, 2005, 2006b):

with X(h) the height density function and μtall(h) the membership function for the linguistic variable tall. Second we have to apply the linguistic variable most to the expression Swedes are tall and obtain:

• • •

Conjunction Projection Protagation

For more details on deduction rules the reader is referred to Zadeh (2005, 2006b).

778

Most (Swedes are tall) → μmostl ( ∫ X(h)μtall(h)dh ) As result we get a precise formulation of the given linguistic statement.

Granular Computing

CONCLUSION AND FUTURE RESEARCH Granular Computing is a mighty framework to deal with uncertainty. Information granules can include probabilistic as well as possibilistic phenomena besides others. Therefore granular computing functions as a umbrella for them without competing with them. One core advantage is that is helps to bridge the gap between (imprecise) natural language and the precision that is immanent in computers etc. Presently Zadeh is promoting his idea towards a Generalized Theory of Uncertainty in many publications and presentations. In future the Generalized Theory of Uncertainty will probably be the dominant label for anything related to this topic. Since the Generalized Theory of Uncertainty is a young but rapidly emerging new branch in science future research will go in the direction of the generalization of uncertainty concepts, e.g. from probabilistic and fuzzy clustering towards granular clustering.

REFERENCES Bargiela, A. & Pedrycz, W. (2002). Granular computing: an introduction. Boston: Kluwer Acamemic Publishers. Dubois, D. and Prade, H. (1993). Fuzzy sets and probability: misunderstandings, bridges and gaps.. In Proceedings of the second IEEE International Conference on Fuzzy Systems (pp. 1059-1068), San Francisco. Dubois, D. and Prade, H. (1997). The three semantics of fuzzy sets. Fuzzy Sets and Systems, 90, 141-150. Lin, T.Y. (1998a). Granular Computing on Binary Relations I: Data Mining and Neighborhood Systems. In Skowron, A. and Polkowski, L. (eds.), Rough Sets In Knowledge Discovery (pp. 107-121), Heidelberg: Physica-Verlag. Lin, T.Y. (1998a). Granular Computing on Binary Relations II: Data Mining and Neighborhood Systems. In Skowron, A. and Polkowski, L. (eds.), Rough Sets In Knowledge Discovery (pp. 121-140), Heidelberg: Physica-Verlag. Lin, T.Y. (2002). Fuzzy sets, rough set and probability. In Keller, J. and Nasraoui, O. (eds), Proceedings of the Annual Meeting of the North American Fuzzy

Information Processing Society 2002 (pp.302-305), University, New Orleans. Pawlak, Z. (1982). Rough sets. International Journal of Parallel Programming, 11, 341-356. Ross, T.J.; Booker. J.M.; Parkinson, W.J. (2002). Fuzzy Logic and Probability Applications: A Practical Guide. Philadelphia: SIAM - Society for Industrial & Applied Mathematics. Zadeh, L. (1965). Fuzzy sets. Information and Control, 8, 338-353. Zadeh, L. (1978). Fuzzy sets as the basis for a theory of possibility. Fuzzy Sets and Systems, 1, 3-28. Zadeh, L. (1979). Fuzzy Sets and Information Granularity. In Gupta, M., Ragade, R., Yager, R. (eds.), Advances in Fuzzy Set Theory and Applications (pp. 3-18). Amsterdam: North-Holland Publishing. Zadeh, L. (1986). Outline of a computational approach to meaning and knowledge representation based on the concept of a generalized assignment statement. In Thoma, M. and Wyner A. (eds.), Proceedings of the International Seminar on Artificial Intelligence and Man-Machine Systems (LNCIS 80, pp. 198-211). Heidelberg: Springer-Verlag. Zadeh, L. (1995). Discussion: probability theory and fuzzy logic are complementary rather than competitive. Technometrics, 37, 271-276. Zadeh, L. (1996). Fuzzy logic = computing with words. IEEE Transactions of Fuzzy Systems, 2, 103-111. Zadeh, L. (1997). Towards a theory of fuzzy information granularity and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, 90, 111-127. Zadek, L. (2005). Toward a generalized theory of uncertainty (GTU)––an outline. Information Sciences, 172, 1-40. Zadeh, L. (2006a). Granular computing - the concept of generalized-constraint-based computing (presentation slides). Proceedings of Rough Sets and Current Trends in Computing: 5th International Conference, (LNCS 4259, pp 12-14). Heidelberg: Springer-Verlag. Zadeh, L. (2006b). Generalized theory of uncertainty (GTU) - principle concepts and ideas. Computational Statistics & Data Analysis, 51, 15-46 779

G

Granular Computing

Zimmermann, H. J. (2001). Fuzzy set theory and its applications. Boston: Kluwer Academic Publishers.

Hybridization: Combination of methods like probabilistic, fuzzy, rough concepts, or neural nets, e.g. fuzzy-rough, rough-fuzzy or probabilistic-rough, or fuzzy-neural approaches.

KEy TERmS

Linguistic Variable: A linguistic variable is a linguistic expression (one or more words) labeling an information granular. For example a membership function is labeled by the expressions like “hot temperature” or “rich customer”.

Fuzzy Set Theory: Fuzzy set theory was introduced by Zahed in 1965. The central idea of fuzzy set theory is that an object belongs to more than one sets simultaneously. the closeness of the object to a set is indicated by membership degrees. Generalized Theory of Uncertainty (GTU): GTU is a framework that shall subsume any kind of uncertainty (Zadeh 2006a). The core idea is to formulate generalized constraints (like possibilistic, probabilistic, veristic etc.). The objective of GTU is not to replace existing theories like probability or fuzzy sets but to provide an umbrella that allows to formulate any kind of uncertainty in a unique way. Granular Computing: The idea of granular computing goes back to Zadeh (1979). The basic idea of granular computing is that an object is describe by a bunch of values in possible dimensions like indistinguishability, similarity and proximity. If a granular is labeled by a linguistic expressing it is called a linguistic variable. Zahed (2006a) defines granular computing as “a mode of computation in which the objects of computation are generalized constraints”.

780

Membership Function: A membership function shows the membership degrees of a variable to a certain set. For example, a temperature t=30° C belongs to the set “hot temperature“ with a membership degree λHT(30°)=0.8. The membership functions are not objective but context and subject-dependent. Rough Set Theory: Rough set theory was introduced by Pawlak in 1982. The central idea of rough sets is that some objects distinguishable while others are indiscernible from each other. Soft Computing: In contrast to “hard computing” soft computing is collection of methods (fuzzy sets, rough sets neutral nets etc.) for dealing with ambiguous situations like imprecision, uncertainty, e.g. human expressions like “high profit at reasonable risks”. The objective of applying soft computing is to obtain robust solutions at reasonable costs.

781

Growing Self-Organizing Maps for Data Analysis Soledad Delgado Technical University of Madrid, Spain Consuelo Gonzalo Technical University of Madrid, Spain Estíbaliz Martínez Technical University of Madrid, Spain Águeda Arquero Technical University of Madrid, Spain

INTRODUCTION Currently, there exist many research areas that produce large multivariable datasets that are difficult to visualize in order to extract useful information. Kohonen selforganizing maps have been used successfully in the visualization and analysis of multidimensional data. In this work, a projection technique that compresses multidimensional datasets into two dimensional space using growing self-organizing maps is described. With this embedding scheme, traditional Kohonen visualization methods have been implemented using growing cell structures networks. New graphical map displays have been compared with Kohonen graphs using two groups of simulated data and one group of real multidimensional data selected from a satellite scene.

BACKGROUND Data mining first stage usually consist of building simplified global overviews of data sets, generally in graphical form (Tukey, 1977). At present, the huge amount of information and its multidimensional nature complicates the possibility to employ direct graphic representation techniques. Self-Organizing Maps (Kohonen, 1982) fit well in the exploratory data analysis since its principal purpose is the visualization and the analysis of nonlinear relations between multidimensional data (Rossi, 2006). In this sense, a great variety of Kohonen’s SOM visualization techniques (Kohonen, 2001) (Ultsch & Siemon, 1990) (Kraaijveld,

Mao & Jain, 1995) (Merlk & Rauber, 1997) (Rubio & Giménez 2003) (Vesanto, 1999), and some automatic map analysis (Franzmeier, Witkowski & Rückert 2005) have been proposed. In Kohonen’s SOM the network structure has to be specified in advance and remains static during the training process. The choice of an inappropriate network structure can degrade the performance of the network. Some growing self-organizing maps have been implemented to avoid this disadvantage. In (Fritzke, 1994), Fritzke proposed the Growing Cell Structures (GCS) model, with a fixed dimensionality associated to the output map. In (Fritzke, 1995), the Growing Neural Gas is exposed, a new SOM model that learns topology relations. Even though the GNG networks get best grade of topology preservation than GCS networks, due to the multidimensional nature of the output map it cannot be used to generate graphical map displays in the plane. However, using the GCS model it is possible to create networks with a fixed dimensionality lower or equal than 3 that can be projected in a plane (Fritzke, 1994). GCS model, without removal of cells, has been used to compress biomedical multidimensional data sets to be displayed as two-dimensional colour images (Walker, Cross & Harrison, 1999).

GROWING CELL STRUCTURES VISUALIZATION This work studies the GCS networks to obtain an embedding method to project the bi-dimensional output


G

Growing Self-Organizing Maps for Data Analysis

map, with the aim of generating several graphic map displays for the exploratory data analysis during and after the self-organization process.

Growing Cell Structures The visualization methods presented in this work are based on self-organizing map architecture and learning process of Fritzke’s Growing Cell Structures (GCS) network (Fritzke, 1994). GCS network architecture consists of connected units forming k-dimensional hypertetrahedron structures linked between them. The interconnection scheme defines the neighbourhood relationships. During the learning process, new units are added and superfluous ones are removed, but these modifications are performed in such way that the original architecture structure is maintained. The training algorithm is an iterative process that performs a non-linear projection of the input data over the output map, trying to preserve the topology of the original data distribution. The self-organization process of the GCS networks is similar that in Kohonen’s model. For each input signal the best matching unit (bmu) is determined, and bmu and its direct neighbour’s synaptic vectors are modified. In GCS networks each neuron has associated a resource, which can represent the number of input signals received by the neuron, or the summed quantization error caused by the neuron. In every adaptation step the resource of the bmu is conveniently modified. A new neuron is inserted between the unit with highest resource, q, and its direct neighbour with the most different reference vector, f, after a fixed number of adaptation steps. The new unit synaptic vector is interpolated from the synaptic vectors of q and f, and the resources values of q and f are redistributed too. In addition, neighbouring connections are modified in order to ensure the output architecture structure. Once all the training vectors have been processed a fixed number of times (epoch), the neurons whose reference vectors fall into regions with a very low probability density are removed. To guarantee the architecture structure some neighbouring connections are modified too. Relative normalized probability density estimation value proposed in (Delgado, 2004) has been used in this work to determine the units to be removed. This value provides better interpretation of some training parameters, improving the removal of cells and the topology preserving of the network.

782

Several separated meshes could appear in the output map when superfluous units are removed. When the growing self-organization process finishes, the synaptic vectors of the output units along with the neighbouring connections can be used to analyze different input space properties visually.

Network Visualization: Constructing the Topographic Map The ability to project high-dimensional input data onto a low-dimensional grid is an important property of Kohonen feature maps. By drawing the output map over a plane it will be possible to visualize complex data and discover properties or relations of the input vector space not expected in advance. Output layer of Kohonen feature maps can be printed on a plane easily, painting a rectangular grid, where each cell represents an output neuron and neighbour cells correspond to neighbour output units. GCS networks have less regular output unit connections than Kohonen ones. When k=2 architecture factor is used, the GCS output layer is organized in groups of interconnected triangles. In spite of bi-dimensional nature of these meshes, it is not obvious how to embed this structure into the plane in order to visualize it. In (Fritzke, 1994), Fritzke proposed a physical model to construct the bi-dimensional embedding during the self-organization process of the GCS network. Each output neuron is modelled by a disc, with diameter d, made of elastic material. Two discs with distance d between centres touch each other, and two discs with distance smaller than d repeal each other. Each neighbourhood connection is modelled as an elastic string. Two discs connected but not touching are pulled each other. Finally, all discs are positively charged and repeal each other. Using this model, the bi-dimensional topographic coordinates of each output neuron can be obtained, and thus, the bi-dimensional output meshes can be printed on a plane. In order to obtain the output units bi-dimensional coordinates of the topographic map (for k=2), a slightly modified version of this physical model has been used in this contribution. At the beginning of the training process, the initial three output neurons are placed in the plane in a triangle form. Each time a new neuron is inserted, its position in the plane is located exactly halfway of the position of the two neighbouring neurons between which it has been inserted. After this, attraction


and repulsion forces are calculated for every output neuron and its positions are consequently moved. The attraction force of a unit is calculated as the sum of individual attraction forces that all neighbouring connections exercise over it. Attraction force between two neighbouring neurons i and j, with pi and pj coordinates in the plane, and Euclidean distance e, is calculated as (e-d)/2 if e≥d, and 0 otherwise. The repelling force of a unit is calculated as the sum of individual repulsion forces that all no-neighbouring output neurons exercise over it. Repelling force between two no-neighbouring neurons i and j is calculated as d/5 if 2d<e≤3d, d/2 if d<e≤2d, d if 0<e≤d, and 0 otherwise. There exist three basic differences between the embedding model used in this work and the Fritzke’s one. First, repelling force is only calculated with no-neighbouring units. Second, attracting force between two neurons i and j is multiplied by the distance normalization ((pj-pi)/e) and by the attraction factor 0.1 (instead of 1). Last, repelling force between two neurons i and j is multiplied by the distance normalization ((pi-pj)/e) and by the repulsion factor 0.05 (instead of 0.2). The result of applying this projection method is showed in Fig. 1. When removal of cells is performed,

different meshes are showed unconnectedly. Without any other additional information, this projection method makes possible cluster detection.

Visualization Methods Using the projection method exposed, traditional Kohonen visualization methods can be implemented using GCS networks with k=2. Each output neuron is painted as a circle in a colour determined by a major parameter. When greyscale is used, normally dark and clear tones are associated with high and low values respectively. The grey scales are relative to the maximum and minimum values taken by the parameter. The nature of the data used to calculate the parameter determines three general types of methods for performing visual analysis of self-organizing maps: distances between synaptic vectors, training patterns projection over the neurons, and individual information about synaptic vectors. All the experiments have been performed using two groups of simulated data and one group of real multidimensional data (Fig. 2) selected from a scene registered by the ETM+ sensor (Landsat 7). The input signals are defined by the six ETM+ spectral bands with

Figure 1. Output mesh projection during different self-organization process stages of a GCS network trained with bi-dimensional vectors distributed on eleven separate regions.

(a)

(b)

(c)

(d)

(e)

Figure 2. (a) Eleven separate regions in the bi-dimensional plane. (b) Two three dimensional chain-link. (c) Projection of multidimensional data of satellite image. TM5

TM7

(a)

(b)

(c)

TM4

783

G


the same spatial resolution: TM1 to TM5, and TM7. The input data set has a total number of 1800 pixels, 1500 carefully chosen from the original scene and 300 randomly selected. The input vectors are associated to six land cover categories.

Displaying Distances The adaptation process of GCS networks places the synaptic vectors in regions with high probability density, removing units positioned into regions with a very low probability density. A graphical representation of distances between the synaptic vectors will be a useful tool to detect clusters over the input space. Distance map, unified distance map (U-map), and distance addition map have been implemented to represent distance map information with GCS networks. In distance map, the mean distance between the synaptic vector of each neuron and the synaptic vectors of all its direct neighbours is calculated. U-map represents the same information than distance map but, in addition it includes the distance between all the neighbouring neurons (painted in a circle form between each pair of neighbour units). Finally, the sum of the distance between the synaptic vector of a neuron and the synaptic vectors of the rest of units is calculated, when distance addition map is generated. In distance map and U-map, dark zones represent clusters and clear zones boundaries along with them. In distance addition map, neurons with near synaptic vectors appear with similar colour, and boundaries can be detected analyzing the regions where a considerable colour

variation exists. Using GCS networks, separated meshes represent different input clusters, usually. Fig. 3 shows an example of these three graphs, compared with the traditional Kohonen’s maps, when an eleven separate regions distribution data set is used. GCS network represents eleven clusters in the three graphs, clearly. Distance map and U-map in Kohonen’s network show the eleven clusters too, but in distance addition map it is not possible to distinguish them.

Displaying Projections This technique takes into account the input distribution patterns to generate different values to assign to each neuron. For GCS networks, data histograms and quantization error maps have been implemented. Generating the histogram, the number of training patterns associated to each neuron is obtained. However, when quantization error graph has to be produced, the sum of the distances between the synaptic vector of a neuron and the input vectors that lies in its Voronoi region is calculated. In both graphs, dark and clear zones correspond with high and low probability density areas, respectively, so it can be used in cluster analysis. Fig. 4 shows an example of these two methods compared with those obtained using Kohonen’s model when chain-link distribution data set is used. Using Kohonen’s model is difficult to distinguish the number of clusters present in the input space. On the other hand, GCS model has generated three output meshes, two of them representing one ring.

Figure 3. From left to right: distance map, U-map (unified distance map), and distance addition map when an eleven separate regions distribution data set is used. (a) Kohonen feature map with 10x10 grid of neurons. (b) GCS network with 100 output neurons. The right column shows the input data and the network projection using the two component values of the synaptic vectors.

(a)

(b)

784


Displaying Components The displaying components technique analyzes each synaptic vector or reference vector component in an individual manner. This kind of graphs offers a visual analysis of the topology preserving of the network, and a possible detection of correlations and dependences between training data components. Direct visualization of synaptic vectors and component planes graphs have been implemented for GCS networks. Direct visualization map represents each neuron in a circle form within its synaptic vector inside in a graphical manner. This graph can be complemented with anyone of described in the previous sections, enriching

its interpretation. A component plane map visualizes an individual component of all the synaptic vectors. When all the component planes are generated, relations between weights can be appreciated if similar structures appear in identical places of two different component planes. Fig. 5 shows an example of these two displaying methods when multi-band data of satellite image is used. The direct visualization map shows the similarity between neighbouring units synaptic vectors, and, it is interesting distinguish the fact that all the neurons in a cluster have similar synaptic shapes. Furthermore, the integrated information about the distance addition map shows that there is no significant colour variation inside the same cluster. The six component

Figure 4. From left to right: Unified distance map, data histograms and quantization error maps when chain-link distribution data set is used. (a) Kohonen feature map with 10x10 grid of neurons. (b) GCS network with 100 output neurons. The right column shows the input data and the network projection using the three component values of the synaptic vectors.

(a)

(b)

Figure 5. GCS network trained with multidimensional data of satellite image, 54 output neurons. Graphs from (a) to (f) show the component planes for the six elements of the synaptic vectors. (g) Direct visualization map using distance addition map additional information.

(a)

(b)

(c)

(d)

(e)

(f)

(g) 785

G


plane graphs exhibit possible dependences involving TM1, TM2 and TM3 input vector components and, TM5 and TM7 components too.

Results Several Kohonen and GCS networks have been trained in order to evaluate and compare the resulting visualization graphs. For the sake of space only a few of these maps have been included here. Fig. 3 and Fig. 4 compare Kohonen and GCS visualizations using distance map, U-map, distance addition map, data histograms and quantization error map. It can be observed that GCS model offers much better graphical results in clusters analysis than Kohonen networks. The removal of units and connections inside low probability distribution areas causes that GCS network presents within a particular cluster the same quality of information that Kohonen network in relation to the entire map. Since it has already been mentioned, the grey scale used in all the maps is relative to the maximum and minimum values taken by the studied parameter. In all the cases the range of values taken by the calculated factor using GCS is minor than using Kohonen maps. The exposed visualization methods applied to the visual analysis of multidimensional satellite data has given very satisfactory results (Fig 5). All trained GCS networks have been able to generate six sub maps in the output layer (in some case they have arrived up to eight) that identify the six land cover classes present in the sample of data. The direct visualization map and the component plane graphs have demonstrated to be a useful tool for the extraction of knowledge of the multisensorial data.

FUTURE TRENDS The proposed knowledge visualization method based on GCS networks has results a useful tool for multidimensional data analysis. In order to evaluate the quality of the trained networks we consider necessary to develop some measure techniques (qualitative and quantitative in numerical and graphical format) to analyze the topology preservation obtained. In this way we will be able to validate the information visualized by the methods presented in this paper. Also it would be interesting to validate these methods of visualisation with new data sets of very high 786

dimensional nature. We need to study the viability of cluster analysis with this projection technique when this class of data samples is used.

CONCLUSION The exposed embedding method allows multidimensional data to be displayed as two-dimensional grey images. The visual-spatial abilities of human observers can explore these graphical maps to extract interrelations and characteristics in the dataset. In GCS model the networks size does not have to be specified in advance. During the training process, the size of the network grows and decreases adapting its architecture to the particular characteristics of the training dataset. Although in GCS networks it is necessary to determine a great number of training factors than in Kohonen model, using the learning modified model the tuning of the training factors values is simplified. In fact, several experiments have been made on datasets of diverse nature using the same values for all the training factors and giving excellent results in all the cases. Especially notable is the cluster detection during the self-organization process without any other additional information.

REFERENCES Delgado S., Gonzalo C., Martínez E., & Arquero A. (2004). Improvement of Self-Organizing Maps with Growing Capability for Goodness Evaluation of Multispectral Training Patterns. IEEE International Proceedings of the Geoscience and Remote Sensing Symposium. 1, 564-567. Franzmeier M., Witkowski U., & Rückert U. (2005). Explorative data analysis based on self-organizing maps and automatic map analysis. Lecture Notes in Computer Science. 3512, 725-733. Fritzke, B (1994). Growing Cell Structures – A selforganizing Network for Unsupervised and Supervised Learning. Neural Networks. 7(9), 1441-1460. Fritzke, B (1995). A growing neural gas network learns topologies. Advances in neural information processing systems. 7, 625-632.


Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics. (43), 59-69. Kohonen, T. (2001). Self-Organizing Map (3rd ed). Springer, Berlin Heidelberg New York. Kraaijveld MA., Mao J., & Jain AK. (1995). A non linear projection method based on Kohonen’s topology preserving maps. IEEE Transactions on Neural Networks. 6(3), 548-559. Merlk D., & Rauber A. (1997). Alternative ways for cluster visualization in self-organizing maps. Workshop on Self-Organizing Maps, Helsinki, Finland. Rossi, F. (2006). Visual data mining and machine learning. Proceedings of the European Symposium on Artificial Neural Networks, Bruges, Belgium. 251-264. Rubio M., & Giménez V. (2003). New methods for selforganizing map visual analysis. Neural Computation & Applications. 12, 142-152. Tukey, JW. (1977). Exploratory data analysis. Addison-Wesley, Reading, MA. Ultsch A., & Siemon HP. (1990). Kohonen self-organizing feature maps for exploratory data analysis. Proceedings of the International Neural Network, Dordrecht, The Nederlands. Vesanto, J. (1999). SOM-based visualization methods. Intelligent Data Analysis, Elsevier Science, 3(2), 111-126. Walker AJ., Cross SS., & Harrison RF. (1999). Visualisation of biomedical datasets by use of growing cell structure networks: a novel diagnostic classification technique. The Lancer, Academic Research Library. 1518-1521.

KEy TERmS Artificial Neural Networks: An interconnected group of units or neurons that uses a mathematical model for information processing based on a connectionist approach to computation. Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns, relationships or obtaining systems that perform useful tasks such as classification, prediction, estimation, or affinity grouping. Exploratory Data Analysis: Philosophy about how a data analysis should be carried out. Exploratory data analysis employs a variety of techniques (mostly graphical) to extract the knowledge inherent to the data. Growing Cell Structures: Growing variant of the self-organizing map model, with the peculiarity of dynamically adapts the size and connections of the output layer to the characteristics of the training patterns. Knowledge Visualization: The creation and communication of knowledge through the use of computer and non-computer-based, complementary, graphic representation techniques. Self-Organizing Map: A subtype of artificial neural network. It is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space Unsupervised Learning: Method of machine learning where a model is fit to observations. It is distinguished from supervised learning by the fact that there is no a priori output.

787

G

788

GTM User Modeling for aIGA Weight Tuning in TTS Synthesis Lluís Formiga Universitat Ramon Llull, Spain Francesc Alías Universitat Ramon Llull, Spain

INTRODUCTION Unit Selection Text-to-Speech Synthesis (US-TTS) systems produce synthetic speech based on the retrieval of previous recorded speech units from a speech database (corpus) driven by a weighted cost function (Black & Campbell, 1995). To obtain high quality synthetic speech these weights must be optimized efficiently. To that effect, in previous works, a technique was introduced for weight tuning based on evolutionary perceptual tests by means of Active Interactive Genetic Algorithms (aiGAs) (Alías, Llorà, Formiga, Sastry & Goldberg, 2006) aiGAs mine models that map subjective preferences from users by partial ordering graphs, synthetic fitness and Evolutionary Computation (EC) (Llorà, Sastry, Goldberg, Gupta & Lakshmi, 2005). Although aiGA propose an effective method to map single user preferences, as far as we know, the methodology to extract common solutions among different individual preferences (hereafter denoted as common knowledge) has not been tackled yet. Furthermore, there is an ambiguity problem to be solved when different users evolve to different weight configurations. In this review, Generative Topographic Mapping (GTM) is introduced as a method to extract common knowledge from aiGA models obtained from user preferences.

BACKGROUND Weight Tuning in Unit-Selection Text-toSpeech Synthesis The aim of US-TTS is to generate synthetic speech by concatenating the sequence of units that best fit the requirements derived from the input text. The speech

units are retrieved from a database (speech corpus) which stores speech-units previously recorded by a professional speaker, typically. Text-to-speech workflow is generally modelled as two independent blocks that convert written text into speech signal. The first block is named Natural Language Processing (NLP), which is followed by the Digital Signal Processing block (DSP). At first stage, The NLP block carries out a text preprocessing (e.g. conversion of digit numbers or acronyms to words), then it converts graphemes to phonemes. And at last stage, the NLP block assigns quantified prosody parameters to each phoneme guiding the way each phoneme is converted to signal. Generally, this quantified prosody parameters involve duration, pitch and energy. Next, The DSP block retrieves from a recorded database (speech corpus) the sequence of units that best matches the target requirements (the phonemes and their prosody). Finally, the speech units are ensembled to obtain the output speech signal. The retrieval process is done by a dynamic programming algorithm (e.g. Viterbi or A* (Formiga & Alías, 2006)) driven by a cost function. The cost function computes the load of selecting a unit within a sequence as the sum of two weighted subcosts (see equation (1)): the target subcost (Ct) and the concatenation subcost (Cc). In this work, the Ct is considered as a weighted linear combination of the normalized prosody distances between the target-NLP predicted prosody vector and the candidate unit prosody vector (see equation ). Otherwise, the Cc is computed as a weighted linear combination of the distances between the feature vectors of the speech signal around its concatenation point (see equation ). n

n

i =1

i=2

C (t1n , u1n ) = ∑C t (ti , ui ) + ∑C c (ui −1 , ui )


(1)

GTM User Modeling for aIGA Weight Tuning in TTS Synthesis

p

C t (ti , ui ) = ∑wtj C uj (ti , ui )

(2)

j =1

q

C c (ui −1 , ui ) = ∑wcj C cj (ui −1 , ui )

(3)

j =1

n

where t1 represents the target units sequence {t1, n t2,...,tn} and u1 represents the candidate units sequence {u1, u2,..., un}.

C tj (ti , ui ) = 1 − e

 Pj ( ti )     σP  j  

C cj (ui −1 , ui ) = 1 − e

2

 PjR ( ui −1 ) − PjL ( ui )      σ Pj  

(4) 2

(5)

Appropriate design of cost function by means of weight training is a crucial to earn high quality synthetic speech (Black, 2002). Nevertheless this concern has focused approaches with no unique response. Several techniques have been suggested for weight tuning, which may be spitted into three families: i) manual-tuning ii) computationally-driven purely objective methods and iii) perceptually optimized techniques (Alías, Llorà, Formiga, Sastry & Goldberg, 2006). The present review is based on the techniques based on human feedback to the training process, following previous work (Alías, Llorà, Formiga, Sastry & Goldberg, 2006), which is outlined in the next section.

The Approach: Interactive Evolutionary Weight Tuning Computationally-driven purely objective methods are mainly focused on an acoustic measure (obtained from cepstral distances) between the resynthesized and the natural signals. Hunt and Black adopted two approaches in (Hunt & Black, 1996). The first approach was based on adjusting the weights through an exhaustive search of a prediscretized weight space (weight space search, WSS). The second approach proposed by the authors used a multilinear regression technique (MLR), across the entire database to compute the desired weights. Later, Meron and Hirose (Meron & Hirose, 1999) presented a methodology that improved the efficiency of the WSS and refined the MLR method. In a previous

work (Alías & Llorà, 2003), introduced evolutionary computation to perform this tuning. More precisely, Genetic Algorithms (GA) were applied to obtain the most appropriate weight. The main added value of making use of GA to find optimal weight configuration is the independency to linear search models (as in MLR)and, in addition, it avoids the exhaustive search (as in WSS). However, all this methods lack on its dependency on the acoustic measure to determine the actual quality of the synthesized speech, which in most part is relative to human hearing. To obtain better speech quality, it was suggested that user should take part in the process. In (Alías, Llorà, Iriondo, Sevillano, Formiga & Socoró, 2004) there were conducted preference tests by synthesizing the training text according to two different weights and comparing the obtained speech subjective quality. Subsequently, Active Interactive Genetic Algorithms were presented in (Llorà, Sastry, Goldberg, Gupta & Lakshmi, 2005) as one interactive evolutionary computation method where the user feedback evolves the solutions through survival-ofthe-fittest mechanism. The solutions inherent fitness is based on the partial order provided by the evaluator; Active iGAs base its efficiency on evolving different solutions by means of surrogate fitness, which generalize the user preferences. This surrogate fitness and the evolutionary process are based on the following key elements: i) partial ordering, ii) induced complete order, and iii) surrogate function via ε Support Vector Machines (ε-SVM). Preference decisions made by the user are modelled as a directional graph which is used to generate partial ordering of solutions (e.g: xˆ1 > xˆ2 ; xˆ2 > xˆ3 : xˆ1 → xˆ2 → xˆ3 ) (see figure 1). Table 1 shows the approach of global rank based on dominance measure: given a vertex v, the number of dominated vertexes δ(v) and dominating vertexes is computed. Using this measures, the estimated fitness may be computed as fˆ (v) = δ (v) − ∈ (v) . The estimated ranking rˆ(v) is obtained by sorting based on fˆ (v) (Llorà, Sastry, Goldberg, Gupta & Lakshmi, 2005). The procedure of aiGA is detailed in algorithm 1. However, once the global weights were obtained with aiGA, there was no single dominant weight solution (Alías, Llorà, Formiga, Sastry & Goldberg, 2006), i.e. each test performed by different users gave similar and different solutions. This fact implied that a second group of users had to validate the obtained weights.

789

G


Figure 1. (Left) Partial evaluations allow building a directed graph among all solutions. (Right) Obtained graph must be cleared to avoid cycles and draws.

Table 1. Estimation of the global ranking based on the dominance measure. Dur T Ene T and Pit T stand for the weight values for target weights (duration, energy and pitch). In the same way, Pit C Ene C and Mfc C stand for the weight values for concatenation weights (Pitch, Energy and Mel Frequency Cepstrum). v

δ(v)

φ(v)

10718 10721 10723

15 11 11

0 1 1



13271 13272 13269



1 1 0



14 14 19

fˆ (v)

Dur T

Ene C

Ene T

Mfc C

Pit C

Pit T

rˆ(v)

ε-SVM fˆ (v)

15 10 10

0.04 0.27 0.17

0.32 0.08 0.16

0.04 0.26 0.05

0.27 0.1 0.01

0.22 0.11 0.31

0.12 0.18 0.29

1 2.5 2.5

(0.189) (3.294) (3.286)



-13 -13 -19



0.17 0.24 0.01



0.16 0.12 0.21

Thus, clustering problem from different tests was suitable to the weight tuning problem with the goal of extracting consistent results from the user tests.



0.02 0.25 0.01



0.11 0.12 0.17



0.31 0.26 0.32



0.23 0.01 0.27



21 21 23



(13.091) (20.174) (23.792)

GENERATIVE TOPOGRAPHIC mAPPING BEING A PLUS GTM in a Nutshell Unsupervised learning allows to group sparse data into clusters in terms of similarity of data samples. Several

790


Algorithm 1

G

Algorithm 1 Algorithm description of active iGA procedure aiGA() 1 Create an empty directed graph G . 2 Create 2 h random initial solutions ( S set). 3 Create the hierarchical tournament set T using the available solutions in S . 4 Present the tournaments in T to the user and update the partial ordering in G . 5 Estimate rˆ(v) for each v ∈ S . 6 Train the surrogate E -SVM synthetic fitness based on S and rˆ(v) . 7 Optimize the surrogate E -SVM synthetic fitness with cGA. 8 Create the S ′ set with 2 h −1 different solutions where S ∩ S ′ = ∅ , sampling out of the probabilistic model evolved by cGA. 9 Create the hierarchical tournament set T ′ with 2 h − 1 tournaments using 2 h −1 solutions in S and 2 h −1 solucions in S ′ . 10  S ⇐ S ∪ S ′ . 11  T ⇐ T ∪ T ′ . 12  Go to 4 while not converged.

methods perform this grouping (Figuereido & Jain, 2002): Expectation Maximization (EM), k-means, Gaussian Mixture Models (GMM), Self Organizing Maps (SOM) and Generative Topographic Mapping, among others. Techniques may be grouped, according to (Figuereido & Jain, 2002), into two types of formulation: i) model-based methods (e.g. GMM, EM, GTM) and ii) heuristic methods (e.g. k-means or hierarchical agglomerative methods). The number of sources generating the data is the differential propriety. Indeed, model-based methods suppose that the observations have been fashioned by one (arbitrarily chosen and unidentified) source of a set of alternative arbitrary sources. Therefore, inferring these tuned sources and mapping the source to each observation leads to a clustering of the set of observations. Otherwise, heuristic methods assume only one source for the observed data considering similar heterogeneity for them. Self-Organizing Maps (or Kohonen maps) (Kohonen, 1990) are a clustering technique based on neural networks. The easiness of visualizing of multidimen-

sional data is the largely appropriate added value of SOM. In addition, Generative Topographic Mapping is a nonlinear latent variable model introduced in (Bishop, Svensen & Williams, 1998). GTM intends to give an substitute answer to SOM by means of overcoming its restrictions which are listed in (Kohonen, 2006): i) the absence of a cost function, ii) the lack of a theoretical basis for choosing learning rate parameter schedules and neighbourhood parameters to ensure topographic ordering, iii) the absence of any general proofs of convergence and iv) the fact that the model does not define a probability density. GTM is based on a constrained mixture of GMM whose parameters can be tuned through EM algorithm. The handicap of heuristic based models is that there is not a-priori distribution of the centroids for each cluster. In GTM, the set of latent points is modelled as a grid. A circular gaussian distribution is a point in the grid with its equivalent correspondence, through a weighted non-linear basis functions, onto the multidimensional space. Thus, grid is shaped to wrap the data due to the explicit order among the gaussian distributions 791


Modelling User Preferences by GTM GTM is able to extract solutions from the different aiGA evolved graphs due to the consistency of its theoretical basis. The key objective is to recognize important clusters inside the evolved data space and therefore, determine the fitness entropy of each cluster in terms of fitness variance to choose the global weight configuration set. GTM can model the best aIGA weights from multidimensional weight space into a two-dimensional space. Taking into account the cluster with higher averaged fitness and lower standard deviation allows selecting the best weight configuration from different user aiGA models. For adjusting this method the geometry of the gaussian distributions and the size of the latent space have to be set up manually. EM weights GTM centroids and the basis functions. Then, it is extracted from each cluster the average fitness as well as its standard deviation. The computation of the averaged fitness and standard deviation is computed from the set which its weight combinations bayesian a posteriori probability is the highest to the cluster.

It is to note that the fitness itself does not get involved into the optimization EM part on behalf it is relative to each user and is not known for unevaluated weight combinations for one specific user (unless εSVM predicted).

Experiments and Results On (Formiga & Alías, 2007) common knowledge was extracted from user evolved weights from previous tests conducted on catalan speech corpus with 9863 recorded units (1207 diphones and triphones) (obtained from 1520 sentences and words) (Alías, Llorà, Formiga, Sastry & Goldberg., 2006). On that test, five phonetically balanced sentences where extracted from the corpus to perform the global weight tuning process by a web interface named SinEvo (Alías, Llorà, Iriondo, Sevillano, Formiga & Socoró, 2004). The evolved weights were normalized through MaxMin normalization to range all weights between 0 and 1. That test was conducted by three users, obtaining fifteen different weight configurations.

Figure 2. Performance of GTM: Different pareto fronts were analyzed for each configuration

792


On (Formiga & Alías, 2007), different configurations of GTM were analyzed for mapping normalized weights (hexagonal or rectangular grid and different grid sizes: (3 × 3, 4 × 4, 5 × 5)). The purpose of this analysis was to find the optimal GTM configuration, i.e. The one which minimizes averaged standard deviation (std) per cluster and the number of significant clusters per population (with averaged fitness over 75%) while maximizing the averaged mean fitness per cluster. As it may be noticed in figure 2, the 4 × 4 grid configuration with hexagonal latent grid was selected as it yielded the best Pareto front (although the rest of 4x4 grids achieved similar performance). After GTM was set up, each evolved weights were extracted and mapped to other users GTMs within the same sentence, obtaining their corresponding fitness from the other users preferences. Equation 6 allowed to set a global fitness (gF) from overall averaged fitness ( FAvGTM ) for each evolved weight configuration. gFw = i

1 U W GTM ⋅ ∑∑FAv ( w j , ui ) N i =1 j =1

(6)

where Ustands for the number of users, W for the number of weight configurations, N stands for the total number of weights (U + W). In addition, to avoid a perceptual manual validation stage ten different users-not involved whatsoever in the tuning process-performed a comparison to the aIGA best weights to allow a comparison between GTM clustering and real human preference on validation stage. Analyzing the results on figure 3, the GTM most voted weights configurations fit in with the manual user preferences for three sentences (De la Seva Selva, Del Seu Territori and Grans Extensions). Though, the rest of the sentences have quite different behaviour. The best weight combination selected from the users was the second GTM best weight configuration in I els han venut while the best GTM weight combination was never been voted. Cosine correlation is taken into account among problematic weights configurations as the important matter is weight distribution instead of analyzing the values themselves. In this case, GTM two better weights have a 0.7841 correlation, so GTM results may be measured satisfactory as weights approach equivalent patterns. By the other hand, the correlation

Figure 3. The results of the comparison between normalized user voted preferences and GTM mapping are presented for the five sentences. Two different solutions for same user were considered if they adopted similar fitness in aIGA model.

793

G


between the two best GTM weights configurations is 0.8323 in Fusta de Birmània and, as in the previous case, the correlation gives again satisfactory results.

FUTURE TRENDS Future work will be focused on conducting new experiments, e.g. by clustering similar units instead of tackling global weight tuning on preselected sentences or by including more users in the training process. In addition the expansion of the capabilities of GTM to map user preferences opens the possibility to focus on non-linear cost functions so as to overcome the linearity restrictions of the present function.

CONCLUSIONS This article continues the work of including user preferences for tuning the weights of the cost function in Unit-selection TTS systems. In previous works we have presented a method to find cost function optimal weight tuning based on perceptual criteria of individual users. As a next step, this paper applies a heuristic method for choosing the best solution among all users overcoming the need to conduct a second listening test to select the best weight configuration among individual optimal solutions. This proof-of-principle study shows that GTM is capable of mapping the common knowledge among different users thanks to working on the perceptually optimized weights space obtained through aiGA and getting a final solution that can be used for a final adjustment of the TTS.

REFERENCES Alías, F., Llorà, X., Formiga, L., Sastry, K., Goldberg, D.E. (2006): Efficient interactive weight tuning for tts synthesis: reducing user fatigue by improving user consistency. In: Proceedings of ICASSP. Volume I., Toulouse, France 865–868. Alías, F., Llorà, X., Iriondo, I., Sevillano, X., Formiga, L., Socoro, J. C. (2004): Perception- Guided and Phonetic Clustering Weight Tuning Based on Diphone Pairs for Unit Selection TTS. In: Proceedings of the

794

8th International Conference on Spoken Language Processing (ICSLP) Jeju Island, Korea. Alías, F., Llorà, X. (2003): Evolutionary weight tuning based on diphone pairs for unit selection speech synthesis. In: Proceedings of the 8th European Conference on Speech Communication and Technology (EuroSpeech). Bishop, C.M., Svensen, M., Williams, C.K.I. (1998): GTM: The generative topographic mapping. Neural Comp. 10(1) 215–234. Black, A.W., Campbell, N. (1995): Optimising selection of units from speech databases for concatenative synthesis. In: Proceedings of EuroSpeech. Volume 1., Madrid 581–584. Black, A.W. (2002): for all of the people all of the time. In: IEEE Workshop on Speech Synthesis (Keynote), Santa Monica, USA. Figueiredo, M.A.F., Jain, A.K. (2002): Unsupervised learning of finite mixture models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24(3) 381–396. Formiga L., Alías F. (2007): Extracting User Preferences by GTM for aiGA Weight Tuning in Unit Selection Text-to-Speech Synthesis , International Workshop on Artificial Neural Networks (IWANN07), pp. 654661, ISBN 978-3-540-73006-4, June 2007, Donostia (Spain). Formiga L., Alías F. (2006): Heuristics for implementing the A* algorithm for unit selection TTS synthesis systems , IV Jornadas en Tecnología del Habla (4JTH06), pp. 219-224, ISBN 84-96214-82-6, november, Zaragoza (Spain). Holland J. (1975): Adaptation in Natural and Artificial Systems. Ann arbor, Univ. of Michigan Press. Hunt, A., Black, A.W. (1996): Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP. Volume 1., Atlanta, USA 373–376. Kohonen, T. (2006): Self-Organizing Maps. Springer. Kohonen, T. (1990): The self-organizing map. Proceedings of the IEEE 78(9) 1464–1480.


Llorà., X., Sastry, K., Goldberg, D.E., Gupta, A., Lakshmi, L. (2005): Combating user fatigue in IGAs: Partial ordering, support vector machines, and synthetic fitness. Proceedings of Genetic and Evolutionary Computation Conference 2005 (GECCO-2005) 1363–1371 note: (Also IlliGAL Report No. 2005009). Meron, Y., Hirose, K. (1999): Efficient weight training for selection based synthesis. In: Proceedings of EuroSpeech. Volume 5., Budapest, Hungary 2319–2322.

KEy TERmS Correlation: A statistical measurement of the interdependence or association between two or qualitative variables. A typical calculation would be performed by multiplying a signal by either another signal (cross-correlation) or by a delayed version of itself (autocorrelation). Digital Signal Processing (DSP): DSP, or Digital Signal Processing, as the term suggests, is the processing of signals by digital means. The processing of a digital signal is done by performing numerical calculations. Diphone: A sound consisting of two phonemes: one that leads into the sound and one that finishes the sound. e.g.: “hello” silence-h h-eh eh-l l-oe oe-silence. Evolutionary Algorithms: Collective term for all variants of (probabilistic) optimization and approximation algorithms that are inspired by Darwinian evolution. Optimal states are approximated by successive improvements based on the variation-selection-paradigm. Generative Topographic Mapping (GTM): Itisa technique for density modelling and data visualisation inspired in SOM (see SOM definition).

Mel Frequency Cepstral Coefficients (MFCC): The MFCC are the coefficients of the Mel cepstrum. The Mel-cepstrum is the cepstrum computed on the Mel-bands (scaled to human ear) instead of the Fourier spectrum. Natural Language Processing (NLP): Computer understanding, analysis, manipulation, and/or generation of natural language. Pitch: Intonation measure given a time in the signal. Prosody: A collection of phonological features including pitch, duration, and stress, which define the rhythm of spoken language. Text Normalization: The process of converting abbreviations and non-word written symbols into words that a speaker would say when reading that symbol out loud. Unit Selection Synthesis: A synthesis technique where appropriate units are retrieved from large databases of natural speech so as to generate synthetic speech. Unsupervised Learning: Learning techniques that group instances without a pre-specified dependent attribute. Clustering algorithms are usually unsupervised methods for grouping data sets. Self-Organizing Maps: Self-organizing maps (SOMs) are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks Surrogate Fitness: Synthetic fitness measure that tries to evaluate one evolutionary solution in the same terms as one perceptual user would

795

G

796

Handling Fuzzy Similarity for Data Classification Roy Gelbard Bar-Ilan University, Israel Avichai Meged Bar-Ilan University, Israel

INTRODUCTION Representing and consequently processing fuzzy data in standard and binary databases is problematic. The problem is further amplified in binary databases where continuous data is represented by means of discrete ‘1’ and ‘0’ bits. As regards classification, the problem becomes even more acute. In these cases, we may want to group objects based on some fuzzy attributes, but unfortunately, an appropriate fuzzy similarity measure is not always easy to find. The current paper proposes a novel model and measure for representing fuzzy data, which lends itself to both classification and data mining. Classification algorithms and data mining attempt to set up hypotheses regarding the assigning of different objects to groups and classes on the basis of the similarity/distance between them (Estivill-Castro & Yang, 2004) (Lim, Loh & Shih, 2000) (Zhang & Srihari, 2004). Classification algorithms and data mining are widely used in numerous fields including: social sciences, where observations and questionnaires are used in learning mechanisms of social behavior; marketing, for segmentation and customer profiling; finance, for fraud detection; computer science, for image processing and expert systems applications; medicine, for diagnostics; and many other fields. Classification algorithms and data mining methodologies are based on a procedure that calculates a similarity matrix based on similarity index between objects and on a grouping technique. Researches proved that a similarity measure based upon binary data representation yields better results than regular similarity indexes (Erlich, Gelbard & Spiegler, 2002) (Gelbard, Goldman & Spiegler, 2007). However, binary representation is currently limited to nominal discrete attributes suitable for attributes such as: gender, marital

status, etc., (Zhang & Srihari, 2003). This makes the binary approach for data representation unattractive for widespread data types. The current research describes a novel approach to binary representation, referred to as Fuzzy Binary Representation. This new approach is suitable for all data types - nominal, ordinal and as continuous. We propose that there is meaning not only to the actual explicit attribute value, but also to its implicit similarity to other possible attribute values. These similarities can either be determined by a problem domain expert or automatically by analyzing fuzzy functions that represent the problem domain. The added new fuzzy similarity yields improved classification and data mining results. More generally, Fuzzy Binary Representation and related similarity measures exemplify that a refined and carefully designed handling of data, including eliciting of domain expertise regarding similarity, may add both value and knowledge to existing databases.

BACKGROUND Binary Representation Binary representation creates a storage scheme, wherein data appear in binary form rather than the common numeric and alphanumeric formats. The database is viewed as a two-dimensional matrix that relates entities according to their attribute values. Having the rows represent entities and the columns represent possible values, entries in the matrix are either ‘1’ or ‘0’, indicating that a given entity (e.g., record, object) has or lacks a given value, respectively (Spiegler & Maayan, 1985). In this way, we can have a binary representation for discrete and continuous attributes.


Handling Fuzzy Similarity for Data Classification

Table 1. Standard binary representation table

Table 1 illustrates binary representation of a database consists of five entities with the following two attributes: Marital Status (nominal) and Height (continuous). • •

Marital Status, with four values: S (single), M (married), D (divorced), W (widowed). Heights, with four values: 1.55, 1.56, 1.60 and 1.84.

However, practically, binary representation is currently limited to nominal discrete attributes only. In the current study, we extend the binary model to include continuous data and fuzzy representation.

Similarity Measures Similarity/distance measures are essential and at the heart of all classification algorithms. The most commonly-used method for calculating similarity is the Squared Euclidean measure. This measure calculates the distance between two samples as the square root of the sums of all squared distances between their properties (Jain & Dubes, 1988) (Jain, Murty & Flynn, 1999). However, these likelihood-similarity measures are applicable only to ordinal attributes and cannot be used to classify nominal, discrete, or categorical attributes, since there is no meaning in placing such attribute values in a common Euclidean space. A similarity measure, which applicable to nominal attributes and used in our research is the Dice (Dice 1945). Additional binary similarity measures were developed and presented (Illingworth, Glaser & Pyle, 1983) (Zhang & Srihari, 2003). Similarities measures between the different attribute values, as proposed in

H

Zadeh (1971) model, are essential in the classification process. In the current study we use similarities between entities and between entity’s attribute values to get better classification. Following former reserches, (Gelbard & Spiegler, 2000) (Erlich, Gelbard & Spiegler, 2002), the current study also uses Dice measure.

Fuzzy Logic The theory of Fuzzy Logic was first introduced by Lotfi Zadeh (Zadeh, 1965). In classical logic, the only possible truth-values are true and false. In Fuzzy Logic; however, more truth-values are possible beyond the simple true and false. Fuzzy logic, then, derived from fuzzy set theory, is designed for situations where information is inexact and traditional digital on/off decisions are not possible. Fuzzy sets are an extension of classical set theory and are used in fuzzy logic. In classical set theory, membership of elements in relation to a set is assessed according to a clear condition; an element either belongs or does not belong to the set. By contrast, fuzzy set theory permits the gradual assessment of the membership of elements in relation to a set; this is described with the aid of a membership function . An element mapped to the value 0 means that the member is not included in the given set, ‘1’ describes a fully included member, and all values between 0 and 1 characterize the fuzzy members. For example, the continuous variable “Height” may have three membership functions; stand for “Short”, “Medium” and “Tall” categories. An object may belong to few categories in different membership degree, e.g 180 cm. height may belong to the “Medium” and “Tall” categories, in different 797


membership degree expressed by the range [0,1]. The membership degrees are returned from the membership functions. We can say that a man whose height is 180 cm. is “slightly medium” and a man whose height is 200 cm. is of “perfect tall” height. Different membership functions might represent different membership degrees. Having several possibilities for membership functions is part of the theoretical and practical drawbacks in Zada’s model. There is no “right way” to determine the right membership functions (Mitaim & Kosko, 2001). Thus, a membership function may be considered arbitrary and subjective. In the current work, we make use of membership functions to develop the enhanced similarity calculation for use in classification of fuzzy data.

FUZZy SImILARITy REPRESENTATION Standard Binary Representation exhibits data integrity in that it is precise, and preserves data accuracy without either loss of information or rounding of any value. The mutual exclusiveness assumption causes the “isolation” of each value. This is true for handling discrete data values. However, in dealing with a continuous attribute, e.g. Height, we want to assume that height 1.55 is closer to 1.56 than to 1.60. However, when converting such values into a mutually exclusive binary representation (Table 1), we lose these basic numerical relations. Similarity measures between any pair with different attribute values is always 0, no matter how similar the attribute values are. This drawback makes the standard binary representation unattractive for representing and handling continuous data types. Similarity between attribute values is also needed for nominal and ordinal data. For example, the color “red” (nominal value) is more similar to the color “purple” than it is to the color “yellow”. In ranking (question-

naires) a “1” satisfactory rank (ordinal variable) might be closer to the “2” rank than to the “5” rank. The absence of these similarity “intuitions” are of paramount importance in classification and indeed may cause some inaccuracies in classification results. The following sections present a model that adds relative similarity values to the data representation. This serves to empower the binary representation to better handle both continuous and fuzzy data and improves classification results for all attribute types.

Model for Fuzzy Similarity Representation In standard binary representation, each attribute (which may have several values, e.g., color: red, blue, green, etc.) is a vector of bits where only one bit is set to “1” and all others are set to “0”. The “1” bit stands for the actual value of the attribute. In the Fuzzy Binary Representation, the zero bits are replaced by relative similarity values. The Fuzzy Binary Representation is viewed as a two-dimensional matrix that relates entities according to their attribute values. Having the rows represent entities and the columns represent possible values, entries in the matrix are fuzzy numbers in the range [0,1], indicating the similarity degree of specific attribute value to the actual one, where ‘1’ means full similarity to the actual value (this is the actual value), ‘0’ means no similarity al all and all other values means partial similarity. The following example illustrates the way for creating the Fuzzy Binary Representation: Let’s assume we have a database of five entities and two attributes represented in a binary representation as illustrated in Table1. The fuzzy similarities between all attribute values are calculated (next section describes the calculation process) and represented in a two-dimensional “Fuzzy

Table 2. Fuzzy similarity matrixes of the marital status and height attributes

798


Similarity Matrix”, wherein rows and columns stand for the different attributes’ values, and the matrix cells contain the fuzzy similarity between the value pairs. The Fuzzy Similarity Matrix is symmetrical. Table 2 illustrates fuzzy similarity matrixes for Marital Status and Height attributes. The Marital Status similarity matrix shows that the similarity between Single and Widow is “high” (0.8), while there is no similarity between Single and Married (0). The Height similarity matrix shows that the similarity between 1.56 and 1.60 is 0.8 (“high” similarity), while the similarity between 1.55 and 1.84 is 0 (not similar at all). These similarity matrixes can be calculated automatically, as is explained in the next section. Now, the zero values in the binary representation (Table 1) are replaced by the appropriate similarity value (Table 2). For example, in Table 1, we will replace the zero-bit stands for Height 1.55 of the first entity, with the fuzzy similarity between 1.55 and 1.60 (the actual attribute value), as indicated in the Height fuzzy similarity matrix (0.7). Table 3 illustrates the fuzzy representation accepted after such replacements. It should be noted that the similarities indicated in the fuzzy similarity table relate to the similarity between the actual value of the attribute (e.g. 1.60 in

entity 1) and the other attributes’ values (e.g. 1.55, 1.56 and 1.84). Next, the fuzzy similarities, presented in decimal form, are converted into a binary format – the Fuzzy Binary Representation. The conversion should allow similarity indexes like Dice. To meet this requirement, each similarity value is represented by N binary bits, where N is determined by the required precision. For one- tenth precision, 10 binary bits are needed, for one-hundredth precision, 100 binary bits are needed. For ten bits precision fuzzy similarity “0” will be represented by ten ‘0’s, the fuzzy similarity “0.1” will be represented by nine ‘0’ followed by one ‘1’, the fuzzy similarity “0.2” will be represented by eight ‘0’s followed by two ‘1’s and so on till the fuzzy similarity “1” which will be represented by ten ‘1’s. Table 4 illustrates the conversion from fuzzy representation (Table 3) to fuzzy binary representation. The Fuzzy Binary Representation illustrated in Table 4 is suitable for all data types (discrete and continuous) and, with the new knowledge (fuzzy similarities values) it contains, a better classification is expected. The following section describes the process for similarity calculations necessary for this type of Fuzzy Binary Representation.

Table 3. Fuzzy similarity table

Table 4. Fuzzy binary representation table

799

H


Fuzzy Similarity Calculation Similarity calculation between the different attribute values is not a precise science, i.e., there is no one way to calculate it, just as there is no one way to develop membership functions in the Fuzzy Logic world. We suggest determining similarities according to the attribute type. A domain expert should evaluate similarity for nominal attributes like “Marital Status”. For example, Single, Divorced and Widowed are considered “one person”, while Married is considered as “two people”. Therefore, Single may be more similar to Divorced and Widowed than it is to Married. On the other hand “Divorced” is one that once was married, so may be it is more similar to Married than to single. In short, similarity is a relative, rather than an absolute measure, as there is hardly any known automatic way to calculate similarities for such attributes and therefore a domain expert is needed. Similarity for ordinal data like satisfactory rank can be calculated in the same way as for nominal or continuous attributes depending on the nature of attributes’ values. Similarity for continuous data like Height can be calculated automatically. Unlike nominal attributes, in continuous data there is an intuitive meaning to the “distance” between different values. For example, as regards the Height attribute, the difference between 1.55 and 1.56 is smaller than the distance between 1.55 and 1.70; therefore, the similarity is expected to be higher accordingly. For continuous data, an automatic method can be constructed, as showed, to calculate the similarities. Depending on the problem domain, a continuous attribute can be divided into one or more fuzzy sets (categories), e.g., the Height attribute can be divided into three sets: Short, Medium and Tall. A membership function for each set can be developed. The calculated similarities depend on the specified membership functions; therefore, they are referred to here as fuzzy similarities. The following algorithm can be used for similarity calculations of continuous data: For each pair of attribute values (v1 and v2) For each membership function F Similarities (v1, v2) = 1 - distance between F(v1) and F(v2) Similarity (v1, v2) = Maximum of the calculated Similarities 800

Now that we have discussed both a model for Fuzzy Binary Representation and a way to calculate similarities, we will show the new knowledge (fuzzy similarities) added to the standard binary representation improve the similarity measures between different entities, as discussed in the next section.

COmPARING STANDARD AND FUZZy SImILARITIES In this section, we compare standard and fuzzy similarities. The similarities were calculated according to the Dice index for the example represented in Table 4. Table 5 combines similarities of the different entities related to (a) Martial Status (nominal), to (b) Height (continuous) and to (c) both the Marital Status and Height attributes. Several points and findings arise from the representations shown above (Table 5). These are briefly highlighted below: 1.

2.

3.

In our small example, a nominal attribute (Marital Status) represented in standard binary representation cannot be used for classification. In contrast, the Fuzzy Binary Representation, with a large diversity of similarities results, will enable better classification. Grouping entities with a similarity that is equal to or greater than 0.7 yields a class of entities 2, 3, 4 and 5, which represent Single, Divorced and Widowed that belong to the set “one person”. For a continuous attribute (Height) represented in the standard binary representation, classification is not possible. In contrast, the Fuzzy Binary Representation with diversity in similarities results will, once again, enable better classification. Entities 1 and 5 have absolute similarity (1), since for the Height attribute they are identical. Entities 2 and 4 (similarity = 0.94) are very similar, since they represent the almost identical heights of 1.55 and 1.56, respectively. Classification based on these two entities is possible due to diversity of similarities. The same phenomena presented for a single attribute (Marital Status or Height) exist also for the both attributes (Marital Status + Height) when are taking together. Similarity greater than 0.8 is


Table 5. Entities similarity

used to group entities 2, 4 and 5, which represent “one person” around 1.56 meters height. Two important advantages of the novel Fuzzy Binary Representation detailed in the current work over the standard binary representation are suggested: (1) It is practically suitable to all attribute types. (2) It improves classification results.

FUTURE TRENDS The current work improves classification by adding new similarity knowledge to the standard representation of data. Further research can be conducted to calculate the interrelationship between the different attributes, i.e., the cross-similarities among attributes such as marital status and height. Understanding such interrelationships might further serve to refine the classification and data mining results. Another worthwhile research direction is helping the human domain expert to get the “right” similarities, and thus choose the “right” membership functions. A Decision Support System may provide a way in which to structure the similarity evaluation of the expert and make his/her decisions less arbitrary.

H

CONCLUSION In the current paper, the problems of representing and classifying data in databases were addressed. The focus was on Binary Databases, which have been shown in recent years to have an advantage in classification and data mining. Novel aspects for representing fuzziness were shown and a measure of similarity for fuzzy data was developed and described. Such measures are required, as similarity calculations are at the heart of any classification algorithm. Classification examples were illustrated. The evaluating of similarity measures shows that standard binary representation is useless when dealing with continuous attributes for classification. Fuzzy Binary Representation reforms this drawback and results in promising classification based on continuous data attributes. In addition, adding fuzzy similarity was also shown to be useful for regular (nominal, ordinal) data to ensure better classification. Summarily, fuzzy representation improves classification results for all attribute types.

REFERENCES Dice, L.R. (1945). Measures of the amount of ecological association between species. Ecology, 26(3), 297-302.

801


Erlich, Z., Gelbard, R. & Spiegler, I. (2002). Data Mining by Means of Binary Representation: A Model for Similarity and Clustering. Information Systems Frontiers, 4(2), 187-197.

Zhang, B., & Srihari, S.N. (2004). Fast k-Nearest Neighbor Classification Using Cluster-based Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(4), 525-528.

Estivill-Castro, V. & Yang J. (2004). Fast and Robust General Purpose Clustering Algorithms. Data Mining and Knowledge Discovery, 8(2), 127-150.

KEy TERmS

Gelbard, R. & Spiegler, I. (2000). Hempel’s raven paradox: a positive approach to cluster analysis. Computers and Operations Research, 27(4), 305-320. Gelbard, R., Goldman, O. & Spiegler, I. (2007). Investigating Diversity of Clustering Methods: An Empirical Comparison”, Data & Knowledge Engineering, 63(1), 155-166. Illingworth, V., Glaser, E.L. & Pyle, I.C. (1983). Hamming distance. In, Dictionary of Computing, Oxford University Press, 162-163. Jain, A.K. & Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice Hall. Jain, A.K., Murty, M.N. & Flynn, P.J. (1999). Data Clustering: A Review. ACM Communication Surveys, 31(3), 264-323. Lim, T.S., Loh, W.Y. & Shih, Y.S. (2000). A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms. Machine Learning, 40(3), 203-228. Mitaim, S. & Kosko, B. (2001). The Shape of Fuzzy Sets in Adaptive Function Approximation. IEEE Transactions on Fuzzy Systems, 9(4), 637-656. Spiegler, I. & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233-254. Zadeh, L.A., (1965). Fuzzy Sets. Information and Control, 8(1), 338-353. Zadeh, L.A., (1971). Similarity Relations and Fuzzy Ordering. Information Sciences, 3, 177-200. Zhang, B. & Srihari, S.N. (2003). Properties of Binary Vector Dissimilarity Measures. In, Proc. JCIS Int’l Conf. Computer Vision, Pattern Recognition, and Image Processing, 26-30.

802

Classification: The partitioning of a data set into subsets, so that the data in each subset (ideally) share some common traits - often proximity according to some defined similarity/distance measure. Data Mining: The process of automatically searching large volumes of data for patterns, using tools such as classification, association rule mining, clustering, etc. Database Binary Representation: A representation where a database is viewed as a two-dimensional matrix that relates entities (rows) to attribute values (columns). Entries in the matrix are either ‘1’ or ‘0’, indicating that a given entity has or lacks a given value. Fuzzy Logic: An extension of Boolean logic dealing with the concept of partial truth. Fuzzy logic replaces Boolean truth values (0 or 1, black or white, yes or no) with degrees of truth. Fuzzy Set: An extension of classical set theory. Fuzzy set theory used in Fuzzy Logic, permits the gradual assessment of the membership of elements in relation to a set. Membership Function: The mathematical function that defines the degree of an element’s membership in a fuzzy set. Membership functions return a value in the range of [0,1], indicating membership degree. Similarity: A numerical estimate of the difference or distance between two entities. The similarity values are in the range of [0,1], indicating similarity degree.

803

Harmony Search for Multiple Dam Scheduling Zong Woo Geem Johns Hopkins University, USA

INTRODUCTION

BACKGROUND

The dam is the wall that holds the water in, and the operation of multiple dams is complicated decisionmaking process as an optimization problem (Oliveira & Loucks, 1997). Traditionally researchers have used mathematical optimization techniques with linear programming (LP) or dynamic programming (DP) formulation to find the schedule. However, most of the mathematical models are valid only for simplified dam systems. Accordingly, during the past decade, some meta-heuristic techniques, such as genetic algorithm (GA) and simulated annealing (SA), have gathered great attention among dam researchers (Chen, 2003) (Esat & Hall, 1994) (Wardlaw & Sharif, 1999) (Kim, Heo & Jeong, 2006) (Teegavarapu & Simonovic, 2002). Lately, another metaheuristic algorithm, harmony search (HS), has been developed (Geem, Kim & Loganathan, 2001) (Geem, 2006a) and applied to various artificial intelligent problems, such as music composition (Geem & Choi, 2007) and Sudoku puzzle (Geem, 2007). The HS algorithm has been also applied to various engineering problems such as structural design (Lee & Geem, 2004), water network design (Geem, 2006b), soil stability analysis (Li, Chi & Chu, 2006), satellite heat pipe design (Geem & Hwangbo, 2006), offshore structure design (Ryu, Duggal, Heyl & Geem, 2007), grillage system design (Erdal & Saka, 2006), and hydrologic parameter estimation (Kim, Geem & Kim, 2001). The HS algorithm could be a competent alternative to existing metaheuristics such as GA because the former overcame the drawback (such as building block theory) of the latter (Geem, 2006a). To test the ability of the HS algorithm in multiple dam operation problem, this article introduces a HS model, and applies it to a benchmark system, then compares the results with those of the GA model previously developed.

Before this study, various researchers have tackled the dam scheduling problem using phenomenon-inspired techniques. Esat and Hall (1994) introduced a GA model to the dam operation. They compared GA with the discrete differential dynamic programming (DDDP) technique. GA could overcome the drawback of DDDP which requires exponentially increased computing burden. Oliveira and Loucks (1997) proposed practical dam operating policies using enhanced GA (real-code chromosome, elitism, and arithmetic crossover). Wardlaw and Sharif (1999) tried another enhanced GA schemes and concluded that the best GA model for dam operation can be composed of real-value coding, tournament selection, uniform crossover, and modified uniform mutation. Chen (2003) developed a real-coded GA model for the long-term dam operation, and Kim et al. (2006) applied an enhanced multi-objective GA, named NSGA-II, to the real-world multiple dam system. Teegavarapu and Simonovic (2002) used another metaheuristic algorithm, simulated annealing (SA), to solve the dam operation problem. Although several metaheuristic algorithms have been already applied to the dam scheduling problem, the recently-developed HS algorithm was not applied to the problem before. Thus, this article deals with the HS algorithm’s pioneering application to the problem.

HARmONy SEARCH mODEL AND APPLICATION This article presents two major parts. The first part explains the structure of the HS model; and the second part applies the HS model to a bench-mark problem.


H

Harmony Search for Multiple Dam Scheduling

Dam Scheduling Model Using HS The HS model has the following formulation for the multiple dam scheduling. Maximize the benefits obtained by hydropower generation and irrigation Subject to the following constraints: 1. 2. 3.

Range of Water Release: the amount of water release in each dam should locate between minimum and maximum amounts. Range of Dam Storage: the amount of dam storage in each dam should locate between minimum and maximum amounts. Water Continuity: the amount of dam storage in next stage should be the summation of the amount in current stage, the amount of inflow, and the amount of water release.

The HS algorithm starts with filling random scheduling vectors in the harmony memory (HM). The structure of HM for the dam scheduling is as follows:  R1  12  R1     R1HMS

R21





    R NHMS

R22 R2HMS

R1N R N2

(1)

w.p. w.p. w.p.

p1 p2

w.p.

p4

p3

(2)

where RiNEW is a new water release amount for decision variable i; the first row in the right hand side means that the new amount is chosen randomly from the total range; the second row means that the new amount is chosen from the HM; the third and fourth rows means that the new amount is certain unit (Δ) higher or lower 804

The HS model was applied to a popular multiple dam system as shown in Figure 1 (Wardlaw & Sharif, 1999). The problem has 12 two-hour operating periods, and only dam 4 has irrigation benefit because outflows of other dams are not directed to farms. The range of water releases is as follows:

The range of dam storages is as follows:

Each row stands for each solution vector, and each column stands for each decision variable (water release amount in each stage and each dam). At the end of each row, the objective function value locates. HMS (harmony memory size) is the number of solution vectors is HM. Based on the initial HM, a new scheduling can be generated with the following function:

RiNEW

Applying HS to a Benchmark Problem

0.0 ≤ R1 ≤ 3, 0.0 ≤ R2, R3 ≤ 4, 0.0 ≤ R4 ≤ 7

Z (R 1 )   Z (R 2 )     Z (R HMS )

 Ri , RiMIN ≤ Ri ≤ RiMAX  1 2 HMS }  R (k ) ∈{Ri , Ri , ..., Ri ← i R k ( ) + ∆ i   Ri ( k ) − ∆ 

than the original amount Ri(k) obtained from the HM. The summation of probability is equal to one (p1 + p2 + p3 + p4 = 1). If the newly-generated vector, RNEW, is better than the worst harmony in the HM in terms of objective function, the new harmony is included in the HM and the existing worst harmony is excluded from the HM. If the HS model reaches MaxImp (maximum number of function evaluations), computation is terminated. Otherwise, another new harmony (= vector) is generated by considering one of three above-mentioned mechanisms.

Figure 1. Schematic of four dam system

(3)


0.0 ≤ S1, S2, S3 ≤ 10, 0.0 ≤ S4 ≤ 15

(4)

The initial and final storage conditions are as follows:

S1(0), S2(0), S3(0), S4(0) = 5

(5)

S1(12), S2(12), S3(12) = 5, S4(12) = 7

(6)

Table 1. One example of optimal schedules by HS Time

Dam 1

Dam 2

Dam 3

Dam 4

0 1 2 3 4 5 6 7 8 9 10 11

1.0 0.0 0.0 2.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 0.0

4.0 1.0 2.0 0.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 3.0

0.0 0.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 0.0

0.0 2.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 0.0 0.0

Figure 2. Water release trajectory in each dam

Dam 1

Dam 2

Dam 3

Dam 4

8

Release

6

4

2

0 0

1

2

3

4

5 6 Time t

7

8

9

10

11

805

H


There are only two inflows: 2 units to dam 1; 3 units to dam 2. I1 = 2, I2 = 3

(7)

Wardlaw and Sharif (1999) tackled this dam scheduling problem using an enhanced GA model (Population Size = 100; Crossover Rate = 0.70; Mutation Rate = 0.02; Number of Generations = 500; Number of Function Evaluations = 35,000; Binary, Gray, & Real-Value Representations; Tournament Selection; One-Point, Two-Point, & Uniform Crossovers; and Uniform and Modified Uniform Mutations). The GA model found a best near-optimal solution of 400.5, which is 99.8% of global optimum (401.3). The HS model was applied to the same problem with the following algorithm parameters: HMS = 30; HMCR = 0.95; PAR = 0.05; and MaxImp = 35,000. The HS model could find five different global optimal solutions (HS1 ~ HS5) with identical cost of 401.3. Table 1 shows one example out of five optimal water release schedules. Figure 2 shows corresponding release trajectories in all dams. When the HS model was further tested with different algorithm parameter values, it found a better solution than that (400.5) of the GA model seven cases out of eight ones.

FUTURE TRENDS From the success in this study, the future HS model should consider more complex dam scheduling problems with various real-world situations. Also, algorithm parameter guidelines obtained from considerable experiments on the values will be helpful to engineers in practice because meta-heuristic algorithms, including HS and GA, require lots of trials to obtain best algorithm parameters.

CONCLUSION Music-inspired algorithm, HS, was successfully applied to the optimal scheduling problem of the multiple dam system, outperforming the results of GA. While the GA model obtained near-optimal solutions, the HS

806

model found five different global optima under the same number of function evaluations. Moreover, the HS model did not perform sensitivity analysis of algorithm parameters while the GA model tested many parameter values and different operation schemes. This could reduce time and trouble in choosing parameter values in HS.

REFERENCES Chen, L. (2003). Real Coded Genetic Algorithm Optimization of Long Term Reservoir Operation. Journal of the American Water Resources Association, 39(5), 1157-1165. Erdal, F. & Saka, M. P. (2006). Optimum Design of Grillage Systems Using Harmony Search Algorithm. Proceedings of the 5th International Conference on Engineering Computational Technology (ECT 2006), CD-ROM. Esat, V. & Hall, M. J. (1994). Water Resources System Optimization Using Genetic Algorithms. Proceedings of the First International Conference on Hydroinformatics, 225-231. Geem, Z. W. (2006a). Improved Harmony Search from Ensemble of Music Players. Lecture Notes in Artificial Intelligence, 4251, 86-93. Geem, Z. W. (2006b). Optimal Cost Design of Water Distribution Networks using Harmony Search. Engineering Optimization, 38(3), 259-280. Geem, Z. W. (2007). Harmony Search Algorithm for Solving Sudoku. Lecture Notes in Artificial Intelligence, In Press. Geem, Z. W. & Choi, J. Y. (2007). Music Composition Using Harmony Search Algorithm. Lecture Notes in Computer Science, 4448, 593-600. Geem, Z. W. & Hwangbo, H. (2006). Application of Harmony Search to Multi-Objective Optimization for Satellite Heat Pipe Design. Proceedings of US-Korea Conference on Science, Technology, & Entrepreneurship (UKC 2006), CD-ROM. Geem, Z. W., Kim, J. H., & Loganathan, G. V. (2001). A New Heuristic Optimization Algorithm: Harmony Search. Simulation, 76(2), 60-68.


Kim, J. H., Geem, Z. W., & Kim, E. S. (2001). Parameter Estimation of the Nonlinear Muskingum Model Using Harmony Search. Journal of the American Water Resources Association, 37(5), 1131-1138. Kim, T., Heo, J. -H., Jeong, C. -S. (2006). Multireservoir System Optimization in the Han River basic using Multi-Objective Genetic Algorithm. Hydrological Processes, 20, 2057-2075. Lee, K. S. & Geem, Z. W. (2004). A New Structural Optimization Method Based on the Harmony Search Algorithm. Computers & Structures, 82(9-10), 781798. Li, L., Chi, S. -C., & Chu, X. -S. (2006). Location of Non-Circular Slip Surface Using the Modified Harmony Search Method Based on Correcting Strategy. Rock and Soil Mechanics, 27(10), 1714-1718. Oliveira, R., & Loucks, D. P. (1997). Operating Rules for Multireservoir Systems. Water Resources Research, 33(4), 839-852. Ryu, S., Duggal, A.S., Heyl, C. N., & Geem, Z. W. (2007). Mooring Cost Optimization Via Harmony Search. Proceedings of the 26th International Conference on Offshore Mechanics and Arctic Engineering, ASME, CD-ROM.

KEy TERmS Evolutionary Computation: Solution approach guided by biological evolution, which begins with potential solution models, then iteratively applies algorithms to find the fittest models from the set to serve as inputs to the next iteration, ultimately leading to a model that best represents the data. Genetic Algorithm: Technique to search exact or approximate solutions of optimization or search problem by using evolution-inspired phenomena such as selection, crossover, and mutation. Genetic algorithm is classified as global search algorithm. Harmony Search: Technique to search exact or approximate solutions of optimization or search problem by using music-inspired phenomenon (improvisation). Harmony search has three major operations such as random selection, memory consideration, and pitch adjustment. Harmony search is classified as global search algorithm. Metaheuristics: Technique to find solutions by combining black-box procedures (heuristics). Here, ‘meta’ means ‘beyond’, and ‘heuristic’ means ‘to find’.

Teegavarapu, R. S. V., Simonovic, S. P. (2002). Optimal Operation of Reservoir Systems Using Simulated Annealing. Water Resources Management, 16, 401-428.

Multiple Dam Scheduling: Process of developing individual dam schedule in multiple dam system. The schedule contains water release amount at each time period while satisfying release limit, storage limit, and continuity conditions.

Wardlaw, R., Sharif, M. (1999). Evaluation of Genetic Algorithms for Optimal Reservoir System Operation. Journal of Water Resources Planning and Management, ASCE, 125(1), 25-33.

Optimization: Process of seeking to optimize (minimize or maximize) an objective function while satisfying all problem constraints by choosing the values of continuous or discrete variables. Soft Computing: Collection of computational techniques in computer science, especially in artificial intelligence, such as fuzzy logic, neural networks, chaos theory, and evolutionary algorithms.

807

H

808

Hierarchical Neuro-Fuzzy Systems Part I Marley Vellasco PUC-Rio, Brazil Marco Pacheco PUC-Rio, Brazil Karla Figueiredo UERJ, Brazil Flavio Souza UERJ, Brazil

INTRODUCTION

BACKGROUND

Neuro-fuzzy [Jang,1997][Abraham,2005] are hybrid systems that combine the learning capacity of neural nets [Haykin,1999] with the linguistic interpretation of fuzzy inference systems [Ross,2004]. These systems have been evaluated quite intensively in machine learning tasks. This is mainly due to a number of factors: the applicability of learning algorithms developed for neural nets; the possibility of promoting implicit and explicit knowledge integration; and the possibility of extracting knowledge in the form of fuzzy rules. Most of the well known neuro-fuzzy systems, however, present limitations regarding the number of inputs allowed or the limited (or nonexistent) form to create their own structure and rules [Nauck,1997][Nauck,19 98][Vuorimaa,1994][Zhang,1995]. This paper describes a new class of neuro-fuzzy models, called Hierarchical Neuro-Fuzzy BSP Systems (HNFB). These models employ the BSP partitioning (Binary Space Partitioning) of the input space [Chrysanthou,1996] and have been developed to bypass traditional drawbacks of neuro-fuzzy systems. This paper introduces the HNFB models based on supervised learning algorithm. These models were evaluated in many benchmark applications related to classification and time-series forecasting. A second paper, entitled Hierarchical Neuro-Fuzzy Systems Part II, focuses on hierarchical neuro-fuzzy models based on reinforcement learning algorithms.

Hybrid Intelligent Systems conceived by using techniques such as Fuzzy Logic and Neural Networks have been applied in areas where traditional approaches were unable to provide satisfactory solutions. Many researchers have attempted to integrate these two techniques by generating hybrid models that associate their advantages and minimize their limitations and deficiencies. With this objective, hybrid neuro-fuzzy systems [Jang,1997][Abraham,2005] have been created. Traditional neuro-fuzzy models, such as ANFIS [Jang,1997], NEFCLASS [Nauck,1997] and FSOM [Vuorimaa,1994], have a limited capacity for creating their own structure and rules [Souza,2002a]. Additionally, most of these models employ grid partition of the input space, which, due to the rule explosion problem, are more adequate for applications with a smaller number of inputs. When a greater number of input variables are necessary, the system’s performance deteriorates. Thus, Hierarchical Neuro-Fuzzy Systems have been devised to overcome these basic limitations. Different models of this class of neuro-fuzzy systems have been developed, based on supervised technique.

HIERARCHICAL NEURO-FUZZy SySTEmS This section presents the new class of neuro-fuzzy systems that are based on hierarchical partitioning.


Hierarchical Neuro-Fuzzy Systems Part I

Two sub-sets of hierarchical neuro-fuzzy systems (HNF) have been developed, according to the learning process used: supervised learning models (HNFB [So uza,2002b][Vellasco,2004], HNFB-1 [Gonçalves,2006], HNFB-Mamdani [Bezerra,2005]); and reinforcement learning models (RL-HNFB [Figueiredo,2005a], RLHNFP [Figueiredo,2005b]). The focus of this paper is on the first sub-set of models, which are described in the following sections.

• •

Basic Neuro-Fuzzy BSP Cell

n

k =1

If x ∈ ρ then y = d1 If x ∈ µ then y = d2.

An HNFB model may be described as a system that is made up of interconnections of HNFB cells. Figure 1(b) illustrates an HNFB system along with the respective partitioning of the input space. In this system, the initial partitions 1 and 2 (‘BSP0’ cell) have been subdivided; hence, the consequents of its rules are the outputs of BSP1 and BSP2, respectively. In turn, these subsystems have, as consequents, values d11, y12, d21 and d22, respectively. Consequent y12 is the output of the ‘BSP12’ cell. The output of the system in figure 1(b) is given by equation (2). y = A 1 (A 11 d11 + A 12 (A 121 d121 + A 122 d122 )) + A 2 (A 21 d 21 + A 22 d 22 )

(2)

Each rule corresponds to one of the two partitions generated by BSP. Each partition can in turn be subdivided into two parts by means of another HNFB cell. The profiles of membership functions ρ(x) and µ(x) are complementary logistic functions. The output y of an HNFB cell (defuzzification process) is given by the weighted average. Due to the fact that the membership function ρ(x) is the complement to 1 of the membership function µ(x), the following equation applies: y = R ( x) * d 1 + M ( x) * d 2

where: xk is the system’s k-th input; the wk represent the weight associated with the in put xk; ‘n’ is equal to the total number of inputs; and w0 corresponds to a constant value. The output of a stage of a previous level: The case where di =yj, where yj represents the output of a generic cell ’j’, whose value is also calculated by eq. (1).

HNFB Architecture

An HNFB cell is a neuro-fuzzy mini-system that performs fuzzy binary partitioning of the input space. The HNFB cell generates a crisp output after a defuzzification process. Figure 1(a) illustrates the cell’s functionality, where ‘x’ represents the input variable; ρ(x) and µ(x) are the membership functions low and high, respectively, which generate the antecedents of the two fuzzy rules; and y is the crisp output. The linguistic interpretation of the mapping implemented by the HNFB cell is given by the following rules: • •

H

d i = ∑ w k x k + w0

•

HIERACHICAL NEURO-FUZZy BSP mODEL

A singleton: The case where di = constant. A linear combination of the inputs:

2

or y = i∑=1A i d i

(1)

where αi symbolizes the firing level of the rule in partition i and are given by: α1 = ρ(x); α2 = µ(x). Each di corresponds to one of the three possible consequents below:

It must be stressed that, although each BSP cell divides the input space only in two fuzzy set (low and high), the complete HNFB architecture divides the universe of discourse of each variable in as many partitions as necessary. The number of partitions is determined during the learning process. In Figure 1(c), for instance, the upper left part of the input space (partition 12 in gray) has been further subdivided by the horizontal variable x1, resulting in three fuzzy sets for the complete universe of discourse of this specific variable.

Learning Algorithm The HNFB system has a training algorithm based on the gradient descent method for learning the structure

809


Figure 1. (a) Interior of Neuro-Fuzzy BSP cell. (b) Example of HNFB system. (c) Input space Partitioning of the HNFB system x2

x (input) R

d1 d2

d 22 d 21

M

x

x (consequents)

∑ y

(output)

x1 d122 d121

BSP 12

x2 y12 d11

(a)

BSP 1

x1 2 1

(b)

of the model and, consequently, linguistic rules. The parameters that define the profiles of the membership functions of the antecedents and consequents are regarded as fuzzy weights of the neuro-fuzzy system. In order to prevent the structure from growing indefinitely, a non-dimensional parameter, named decomposition rate (δ), was created. More details of this algorithm may be found in [Souza,2002b][Gonçalves,2006]. The results obtained in classification and time series forecasting problems are presented in the Case Studies section.

HIERARCHICAL NEURO-FUZZy BPS FOR CLASSIFICATION The original HNFB provides very good results for function approximation and time series forecasting. However, it is not ideal for pattern classification applications, since it has only one output and makes use of the Takagi-Sugeno inference method [Takagi,1985], which reduces the rule base interpretability. Therefore, a new hierarchical neuro-fuzzy BSP model dedicated to pattern classification and rule extraction, called the Inverted HNFB or HNFB-1, has been developed, which is able to extract classification rules such as: If x is A and y is B then input-pattern belongs to class Z. This new hierarchical neuro-fuzzy model is denominated inverted because it applies the learning process of the original HNFB to generate the model’s structure. After this first learning phase, the 810

121 122

BSP 2

22

Bi-partitioning

BSP 0

y

(output)

11

x2

21

x1

(c)

structure is inverted and the architecture of the HNFB-1 model is obtained. The basic cell of this new inverted structure is described below.

Basic Inverted-HNFB Cell Similarly to the original HNFB model, a basic Inverted-HNFB cell is a neuro-fuzzy mini-system that performs fuzzy binary partitioning in a particular space according to the same membership functions ρ and µ. However, after a defuzzification process, the Inverted-HNFB cell generates two crisp outputs instead of one. Fig. 2(a) illustrates the interior of the Inverted-HNFB cell. By considering that membership functions are complementary, the outputs of an HNFB-1 cell are given: y1 = B * R ( x) and y 2 = B * M ( x) , where β corresponds to one of the two possible cases below: • •

β=the input of the first cell: so β =1. β=is the output of a cell of a previous level: so β=yj, where yj represents one of the two outputs of a generic ‘j’ cell.

Inverted-HNFB Architecture Fig. 2(b) presents an example of the original HNFB architecture obtained during the training phase of a database containing three distinct classes, while Fig. 2(c) shows how the HNFB-1 model is obtained, after the inversion process.


In the HNFB-1 architecture shown in Fig. 2(c), it may be observed that the classification system has several outputs (y1 to y5), one for each existing leaf in the original HNFB architecture. The outputs of the leaf cells are calculated by means of the following equations (using complementary membership functions): y1 = R 0 .R1

(3)

y 2 = R 0 .M1 .R12

(4)

y 3 = R 0 .M1 .M12

(5)

y 3 = R 0 .M1 .M12

(6)

y 5 = M 0 .M 2

(7)

where ρi and µi are the membership functions for the BSPi.

HNFB-1 System Outputs After the inversion has been performed, the outputs are connected to T-conorm cells (OR operator) that define the classes (see Fig. 2(d)). The initial procedure for linking the leaf cells to the T-conorm neurons consists of connecting all leaf cells with all T-conorm neurons. Once these connections have been made, it is necessary to establish their weights. For the purpose of assigning

Figure 2. (a) HNFB-1 basic cell. (b) Original HNFB architecture. (c) Inversion of the architecture shown in Fig. 2(b). (d) Connection of the inverted architecture to T-conorm cells

xi µ

ρ y2 y1

×

β

×

(a) y (o u tp u t) y (o u tp u t) BSP

xk 1 1

d1 1

2

xm

y

2

d

21

12

12

d

121

d

1

122

(b)

2

xm

2

4

y

1

xj

BSP 2

y

4

y

5

BSP 12

5

y

2

BSP

2

xm

BSP

y1

BSP

y

y1

22

xj

BSP

j

d

BSP

0

1

xm

0

1

xm

BSP

BSP

xk

BSP

xk

0

BSP

m

y (o u tp u t)

y3

12

y

2

y3

(c)

C la s s 1

C la s s 2

C la s s 3

(d) 811

H


these weights, a learning method based on the Least Mean Squares [Haykin 1999] has been employed. After the weights have been determined, the Tconorm operation (Limited Sum T-conorm operator [Ross,2004]) is used for processing the output of the neuron. The final output of the HNFB-1 system is specified by the highest output obtained among all the T-conorm neurons, determining the class to which the input pattern belongs. Results obtained with the HNFB-1 model, in different benchmark classification problems, are presented in Case Studies section.

HIERARCHICAL NEURO-FUZZy BPS mAmDANI The Hierarchical Neuro-Fuzzy BSP Mamdani (HNFBMamdani), as HNFB-1, was also developed to enhance the interpretability of the hierarchical neuro-fuzzy systems. However, since the HNFB-1 is dedicated to classification problems, a more general model was devised. The HNFB-Mamdani employs Mamdani inference method [Jang,1997] in the rules´ consequents, and can be applied in control systems, pattern classification, forecasting, and rule extraction.

HNFB-Mamdani Architecture The HNFB-Mamdani architecture is formed by the interconnection of HNFB-1 cells in a binary tree structure and is divided into three basic modules: input partitioning structure; weighted connection from the binary structure leaf cells (di) to the T-conorm neurons (Ti); and the defuzzification process. The first two modules are identical to the HNFB-1 architecture, except that each T-conorm neuron is associated with a fuzzy set M of the consequent. All leaf cells are connected to all T-conorm neurons. To each connection there is a weight associated, whose value is also establish by the Least Mean Squares algorithm. The consequent of a fuzzy rule in the HNFB-Mamdani model is a fuzzy set represented by a triangular membership function. The total number of fuzzy sets associated with the output variable is specified by the user.

Defuzzification Method The defuzzification process selected for the HNFBMamdani model is the weighted average of the maximum values. Figure 3 illustrates the defuzzification process for a model with three output fuzzy sets.

Figure 3. Defuzzification process

Leaf Cells 812

T-Conorm Neurons

Fuzzy Sets

Defuzzification


The output y is then calculated by Eq. (8). n

yj =

j ∑ A i * Ci

i =1

n

∑A i

j

i =1

(8)

where: yj : αij: Ci: *: n:

output of the HNFB-Mamdani for input pattern j; output value of the i-th T-conorm neuron (Ti) for input pattern j; value in the universe of discourse of the output variable where the Mi fuzzy set presents the maximum value; product operator. total number of fuzzy sets associated with the output variable.

CASE STUDIES In order to evaluate the performance of supervised HNFB models, two benchmark classification databases and six load time series from utilities of the Brazilian electrical energy sector were selected.

Pattern Classification Pattern classification aims to determine to which group of a pre-determined set an input pattern belong to. Two benchmark applications were selected among those most frequently employed in the area of machine learning. The results obtained with the proposed HNFB models were compared to the ones described in [Gonçalves,2006]. In order to generate the training and test sets, the total set of patterns was randomly divided into two equal parts. Each of these two sets was alternately used either as a training or as a test set. Table 1 below summarizes the average classification performance obtained with both test sets. The performance of the HNFB models is better than the other models, except for the HNFB-Mamdani case. Since HNFB-Mamdani is a general-purpose model, it tends to provide inferior results when compared to application-specific models, such as Inverted-HNFB and HNFB-Class. On the other hand, HNFB and HNFQ [Souza,2002a] are also general-purpose models but still

provide a superior performance than HNFB-Mamdani. This is due to the Takagi-Sugeno inference method used by those models, which is usually more accurate than the Mamdani inference method [Bezerra,2005]. The disadvantage of the Takagi-Sugeno method is its reduced interpretability.

Electric Load Forecasting This experiment made use of data related to the monthly electric load of 6 utilities of the Brazilian electrical energy sector. The results obtained with the HNFB models were compared with Backpropagation algorithm, statistical techniques, such as the Holt-Winters and Box & Jenkins,

Table 1. Comparison of the average classification performance

NN KNN FSS BSS MFS1 MFS2 C4.5 FID3.1 NEFCLASS HNFB1 HNFB2 HNFQ HNFB-Inverted HNFB-Class1 HNFB-Class2 HNFB-Mamdani

Iris 94.00 % 96.00% 96.00 % 98.67 % 98.67 % 98.67 % 98.67 % 98.67 % 97.33 % 95,00 %

Wine 95.20 % 96.70% 92.80 % 94.80 % 97.60% 97.90 % 97.80% 97.80 % 98.88 % 99.44 % 98.87 % 98.88 % 95.77%

where: NN=nearest-neighbor, KNN=k-nearest-neighbor, FSS=nearest-neighbor/forward sequential selection of feature), BSS=nearest-neighbor/backward sequential selection of feature), MFS=Multiple Feature Subsets, C4.5, FID3.1, NEFCLASS, HNFB1 (fixed selection), HNFB2 (adaptive selection), HNFQ, Inverted-HNFB, HNFB-Class1 (fixed selection) and HNFB-Class2 (adaptive selection). References to all these models are provided in [Gonçalves,2006]. 813

H


Table 2. Monthly load prediction errors (MAPE) for different models

COPEL CEMIG LIGHT FURNAS CERJ E.PAULO

HNFBMamdani 1,77% 1,39% 2,41% 3,08% 2,79% 1,42%

HNFB 1,17 % 1,12 % 2,22 % 3,76 % 1,35 % 1,17 %

Back Propagation 1.57% 1.47% 3.57% 5.28% 3.16% 1.58%

and with Bayesian Neural Nets (BNN) [Bishop,1995], trained by Gaussian approximation and by the MCMC method. Table 2 below presents the performance results in terms of the “Mean Absolute Percentage Error”. It can be observed that the general performance of HNFB models is usually superior to the results provided by statistical methods. The results obtained with BNNs are generally better than with HNFB models. However, according to [Tito,1999], the training time with BNN was about 8 hours. This was a much longer period than the time required by the HNFB models to perform the same task, which was of the order of tens to hundreds of seconds, on similar equipment. Additionally, the data used in the HNFB models were not treated in terms of their seasonal aspects, nor were they made stationary as was the case of the BNN tested in [Tito,1999].

FUTURE TRENDS As can be seen from the results presented, HNFB models provide very good performance in different applications. To improve the performance of the HNFB-Mamdani model, which provided the worst results among the supervised HNFB models, the model is being extended to allow the use of different types of output fuzzy sets (such as Gaussian, trapezoidal, etc.) and by adding an algorithm to optimize the total number of output fuzzy sets.

814

Box & Jenkins 1.63% 1.67% 4.02% 5.43% 3.24% 2.23%

HoltWinters 1.96% 1.75% 2.73% 4.55% 2.69% 1.85%

RNB (Gaussian) 1.45% 1.29% 1.44% 1.33% 1.50% 0.79%

RNB (MCMC) 1.16% 1.28% 2.23% 3.85% 1.33% 0.78%

CONCLUSION The objective of this article was to introduce a new class of neuro-fuzzy models which aims to improve the weak points of conventional neuro-fuzzy systems. The results obtained by the HNFB models showed that they yield a good performance as classifiers of database patterns or as time series forecasters. These models are able to create their own structure and allow the extraction of knowledge in the form of linguistic fuzzy rules.

REFERENCES Abraham, A. (2005).Adaptation of Fuzzy Inference System Using Neural Learning, Fuzzy System Engineering: Theory and Practice, Springer-Verlag, Chapter3,pp.53-83. Bezerra, R.A., Vellasco, M.M.B.R., Tanscheit, R. (2005).Hierarchical Neuro-Fuzzy BSP Mamdani System, 11th World Congress of International Fuzzy Systems Association,3,1321-1326. Bishop, C.M. (1995).Neural Networks for Pattern Recognition, Clarendon Press. Chrysanthou, Y. & Slater, M. (1992). Computing dynamic changes to BSP trees, EUROGRAPHICS ‘92,11(3),321-332.


Figueiredo, K.T., Vellasco, M.M.B.R. Pacheco, M.A.C. (2005a).Hierarchical Neuro-Fuzzy Models based on Reinforcement Learning for Intelligent Agents, Computational Intelligence and Bioinspired Systems,LNCS3512,424-431. Figueiredo, K., Santos, M., Vellasco, M.M.B.R., Pacheco, M.A.C. (2005b).Modified Reinforcement Learning-Hierarchical Neuro-Fuzzy Politree Model for Control of Autonomous Agents, International Journal of Simulation Systems, Science & Technology,6(1011),4-13. Gonçalves, L. B., Vellasco, M.M.B.R., Pacheco, M.A.C., Souza, F.J. (2006).Inverted Hierarchical Neuro-Fuzzy BSP System: A Novel Neuro-Fuzzy Model for Pattern Classification and Rule Extraction in Databases, IEEE Transactions on Systems, Man & Cybernetics,PartC,36(2),236-248. Haykin, S. (1999). Neural Networks - A Comprehensive Foundation. Mcmillan College Publishing. Jang, J.-S.R., Sun, C.-T., Mizutani, E. (1997). NeuroFuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. PrencticeHall. Nauck, D. & Kruse, R. (1997).A neuro_fuzzy method to learn fuzzy classification rules from data. Fuzzy Sets and Systems,88,277-288. Nauck D. & Kruse, R. (1998).A Neuro-Fuzzy Approach to Obtain Interpretable Fuzzy Systems for Function Approximation, IEEE International Conference on Fuzzy Systems, 1106-1111. Ross, T.J. (2004).Fuzzy Logic with Engineering Applications, John Wiley&Sons. Souza, F.J., Vellasco, M.M.B.R., Pacheco, M.A.C. (2002a).Hierarchical neuro-fuzzy quadtree models, Fuzzy Sets and Systems 130(2),89-205. Souza F.J., Vellasco, M.M.B.R., Pacheco, M.A.C. (2002b).Load Forecasting with The Hierarchical Neuro-Fuzzy Binary Space Partitioning Model, International Journal of Computers Systems and Signals 3(2),118-132. Takagi, T. & Sugeno, M. (1985).Fuzzy identification of systems and its application to modelling and control, IEEE Trans. on Systems,Man and Cybernetics,15(1),116-132.

Tito, E., Zaverucha, G., Vellasco, M.M.B.R., Pacheco, M. (1999).Applying Bayesian Neural Networks to Electrical Load Forecasting, 6th Int. Conf. on Neural Information Processing. Vellasco, M.M.B.R., Pacheco, M.A.C., Ribeiro-Neto, L.S., Souza, F.J.(2004).Electric Load Forecasting: Evaluating the Novel Hierarchical Neuro-Fuzzy BSP Model, International Journal of Electrical Power & Energy Systems, 26(2),131-142. Vuorimaa, P.(1994).Fuzzy self-organizing map, Fuzzy Sets and Systems, 66(2),223-231. Zhang,J. & Morris,A.J.(1995).Fuzzy neural networks for nonlinear systems modelling, IEE Proc.-Control Theory Appl.142(6),551-561.

KEy TERmS Artificial Neural Networks: Composed of several units called neurons, connected through synaptic weights, which are iteratively adapted to achieve the desired response. Each neuron performs a weighted sum of its inputs, which is then passed through a nonlinear function that yields the output signal. ANNs have the ability to perform a non-linear mapping between their inputs and outputs, which is learned by a training algorithm. Bayesian Neural Networks: Multi-layer neural networks that use training algorithms based on statistical Bayesian inference. BNNs offer a number of important advantages over the standard Backpropagation learning algorithm including: confidence intervals can be assigned to the predictions generated by a network; they allow the values of regularization coefficients to be selected using only training data; similarly, they allow different models to be compared using only the training data dealing with the issue of model complexity without the need to use cross validation. Binary Space Partitioning: The space is successively divided in two regions, in a recursive way. This partitioning can be represented by a binary tree that illustrates the successive n-dimensional space sub-divisions in two convex subspaces. This process results in two new subspaces that can be later partitioned by the same method.

815

H


Fuzzy Logic: Can be used to translate, in mathematical terms, the imprecise information expressed by a set of linguistic IF-THEN rules. Fuzzy Logic studies the formal principles of approximate reasoning and is based on Fuzzy Set Theory. It deals with intrinsic imprecision, associated with the description of the properties of a phenomenon, and not with the imprecision associated with the measurement of the phenomenon itself. While classical logic is of a bivalent nature (true or false), fuzzy logic admits multivalence. Machine Learning: Concerned with the design and development of algorithms and techniques that allow computers to “learn”. The major focus of machine learning research is to automatically extract useful information from historical data, by computational and statistical methods.

816

Pattern Recognition: A sub-topic of machine learning, which aims to classify input patterns into a specific class of pre-defined groups. The classification is usually based on the availability of a set of patterns that have already been classified. Therefore, the resulting learning strategy is based on supervised learning. Supervised Learning: A machine learning technique for creating a function from training data, which consist of pairs of input patterns as well as the desired outputs. Therefore, the learning process depends on the existance of a “teacher” that provides, to each input pattern, the real output value. The output of the function can be a continuous value (called regression), or a class label of the input object (called classification).

817

Hierarchical Neuro-Fuzzy Systems Part II Marley Vellasco PUC-Rio, Brazil Marco Pacheco PUC-Rio, Brazil Karla Figueiredo UERJ, Brazil Flavio Souza UERJ, Brazil

INTRODUCTION This paper describes a new class of neuro-fuzzy models, called Reinforcement Learning Hierarchical NeuroFuzzy Systems (RL-HNF). These models employ the BSP (Binary Space Partitioning) and Politree partitioning of the input space [Chrysanthou,1992] and have been developed in order to bypass traditional drawbacks of neuro-fuzzy systems: the reduced number of allowed inputs and the poor capacity to create their own structure and rules (ANFIS [Jang,1997], NEFCLASS [Kruse,1995] and FSOM [Vuorimaa,1994]). These new models, named Reinforcement Learning Hierarchical Neuro-Fuzzy BSP (RL-HNFB) and Reinforcement Learning Hierarchical Neuro-Fuzzy Politree (RL-HNFP), descend from the original HNFB that uses Binary Space Partitioning (see Hierarchical Neuro-Fuzzy Systems Part I). By using hierarchical partitioning, together with the Reinforcement Learning (RL) methodology, a new class of Neuro-Fuzzy Systems (SNF) was obtained, which executes, in addition to automatically learning its structure, the autonomous learning of the actions to be taken by an agent, dismissing a priori information (number of rules, fuzzy rules and sets) relative to the learning process. These characteristics represent an important differential when compared with existing intelligent agents learning systems, because in applications involving continuous environments and/or environments considered to be highly dimensional, the use of traditional Reinforcement Learning methods based on lookup tables (a table that stores value functions for a small or discrete state space) is no longer possible, since the state space becomes too large.

This second part of hierarchical neuro-fuzzy systems focus on the use of reinforcement learning process. The first part presented HNFB models based on supervised learning methods. The RL-HNFB and RL-HNFP models were evaluated in a benchmark control application and a simulated Khepera robot environment with multiple obstacles.

BACKGROUND The model described in this paper was developed based on an analysis of the limitations in existing models and of the desirable characteristics for RL-based learning systems, particularly in applications involving continuous and/or high dimensional environments [Jouffe,1998][Sutton,1998][Barto,2003][Satoh,2006]. Thus, the Reinforcement Learning Hierarchical NeuroFuzzy Systems have been devised to overcome these basic limitations. Two different models of this class of neuro-fuzzy systems have been developed, based on reinforcement learning techniques.

HIERARCHICAL NEURO-FUZZy SySTEmS This section presents the new class of neuro-fuzzy systems that are based on hierarchical partitioning. As mentioned in the first part, two sub-sets of hierarchical neuro-fuzzy systems have been developed, according to the learning process used: supervised learning models (HNFB [Souza,2002][Vellasco,2004], HNFB-1 [Gonçalves,2006], HNFB-Mamdani [Bezerra,2005]);


H

Hierarchical Neuro-Fuzzy Systems Part II

and reinforcement learning models (RL-HNFB [Figueiredo,2005a], RL-HNFP [Figueiredo,2005b]). The focus of this article is on the second sub-set of models. These models are described in the following sections.

by the execution of action a in state s, in accordance with a policy π. For further details about RL theory, see [Sutton,1998]. The linguistic interpretation of the mapping implemented by the RL-NFP cell depicted in Figure 1(a) is given by the following set of rules:

REINFORCEmENT LEARNING HIERARCHICAL NEURO-FUZZy mODELS

rule1: If x1 ∈ ρ1 and x2 ∈ ρ2 then y = ai rule2: If x1 ∈ ρ1 and x2 ∈ µ2 then y = aj rule3: If x1 ∈ µ1 and x2∈ ρ2 then y = ap rule4: If x1∈ µ1 and x2∈ µ2 then y = aq

The RL-HNFB and RL-HNFP models are composed of one or various standard cells, called RL-neurofuzzy-BSP (RL-NFB) and RL-neuro-fuzzy-Politree (RLNFP), respectively. The following sub-sections describe the basic cells, the hierarchical structures and the learning algorithm.

Reinforcement Learning Neuro-Fuzzy BSP and Politree Cells An RL-NFB cell is a mini-neuro-fuzzy system that performs binary partitioning of a given space in accordance with ρ and µ membership functions. In the same way, an RL-NFP cell is a mini-neuro-fuzzy system that performs 2n partitioning of a given input space, also using complementary membership functions in each input dimension. The RL-NFB and RL-NFP cells generate a precise (crisp) output after the defuzzification process [Figueiredo,2005a][Figueiredo,2005b]. The RL-NFB cell has only one input (x) associated with it. The RL-NFP cell receives all the inputs that are being considered in the problem. For illustration purpose, figure 1(a) depicts a cell with two inputs – x1 and x2 - (Quadtree partitioning), providing a simpler representation than the n-dimensional form of Politree. In figure 1(a) each partitioning is generated by the combination of two membership functions - ρ (low) and µ (high) of each input variable. The consequents of the cell’s poli-partitions may be of the singleton type or the output of a stage of a previous level. Although the singleton consequent is simple, this consequent is not previously known because each singleton consequent is associated with an action that has not been defined a priori. Each poli-partition has a set of possible actions (a1, a2, ... an), as shown in figure 1(a), and each action is associated with a Q-value function. The Q-value is defined as being the sum of the expected values of the rewards obtained 818

where consequent ai corresponds to one of the two possible consequents below: a singleton (fuzzy singleton consequent, or zero-order Sugeno): the case where ai=constant; the output of a stage of a previous level: the case where ai=ym , where ym represents the output of a generic cell ‘m’.

RL-HNFB and RL-HNFP Architectures RL-HNFB and RL-HNFP models can be created based on the interconnection of the basic cells. The cells form a hierarchical structure that results in the rules that compose the agent’s reasoning. In the example of an architecture presented in figure 1(b), the poli-partitions 1, 3, 4, …, m-1 have not been subdivided, having as consequents of its rules the values a1, a3, a4, …, am-1, respectively. On the other hand, poli-partitions 2 and m have been subdivided; so the consequents of its rules are the outputs (y2 and ym) of subsystems 2 and m, respectively. On its turn, these subsystems have, as consequent, the values a21, a22, ..., a2m, and am1, am2, ..., amm, respectively. Each ‘ai’ corresponds to a consequent of zero-order Sugeno (singleton), representing the action that will be identified (between the possible actions), through reinforcement learning, as being the most favorable for a certain state of the environment. It must be stressed that the definition of which partition must be subdivided or not is defined automatically by the learning algorithm. The output of the system depicted in figure 1(b) (defuzzification) is given by equation (1). In these equations, αi corresponds to the firing level of partition i and ai is the singleton consequent of the rule associated with partition i.


Figure 1. (a) RL-NHP cell; (b) RL-HNFP architecture

H

(a)

(b)

2n

2n

i =1

i =1

y =A 1.a1 + A 2 ∑ A 2i .a 2i + A 3 .a 3 + A 4 .a 4 +  + A m ∑ A mi .a mi

(1)

RL-HNFB and RL-HNFP Learning Algorithm The learning process starts with the definition of the relevant inputs for the system/environment where the agent is and the sets of actions it may use in order to achieve its objectives. The agent must run many cycles to ensure learning in the system/environment where it is. A cycle is defined as the number of steps the agent takes in the environment, which extends from the point he is initiated to the target point. The RL-HNFB and RL-HNFP models employ the same learning algorithm. Each partition chooses an action from its set of actions; the resultant action is calculated by the defuzzification process and represents the action that will be executed by the agents’ actuators. After the resultant action is carried out, the environment is read once again. This reading enables calculation of the environment reinforcement value that will be used to evaluate the action taken by the agent. The reinforcement is calculated for each partition of all active cells, by means of its participation in the resulting action. Thus, the environment reinforcement calculated by the evaluation function is backpropagated from the root-cell to the leaf-cells. Next, the Q-values associated to the actions that have contributed to the resulting action are updated, based on the SARSA

algorithm [Sutton,1998]. More details can be found in [Figueiredo,2005b]. The RL_HNFB and RL_HNFP models have been evaluated in different control applications. Two of these control application are presented in the next section.

CASE STUDIES Cart-Centering The cart-centering problem [Koza,1992] is generally used as a benchmark of the area of evolutionary programming, where the force that is applied to the car is of the “bang bang” type [Koza,1992]. This problem was used mainly for the purpose of evaluating how well the RL-HNFB and RL-HNFP models would adapt to changes in the input variable domain without having to undergo a new training phase. The problem consists of parking, in the centre of a one-dimensional environment, a car with mass m that moves along this environment due to an applied force F. The input variables are the position (x) of the car, and its velocity (v). The objective is to park the car in position x = 0 with velocity v = 0. The equations of motion are (where the τ parameter represents the time unit): xt +T = xt + T .vt

vt +T = vt + T .Ft / m

(2)

819


The global reinforcement is calculated by equation (3) below: If (x>0 and v A is a function that determines, for each state, what action to take. For any given policy π, we can define a value function Vπ, representing the expected infinite-horizon discounted return to be obtained from following such a policy starting at state s: Vπ(s) = E[r0 + γ r1+ γ2 r2 + γ3 r3 + …].


H

Hierarchical Reinforcement Learning

Bellman (1957) provides a recursive way of determining the value function when the reward and transition probabilities of an MDP are known, called the Bellman equation: Vπ(s) = R(s, π(s)) + γ Σs’∈S T(s, π(s), s’) Vπ(s’), commonly rewritten as an action-value function or Q-function: Qπ(s,a) = R(s, a) + γ Σs’∈S T(s, a, s’) Vπ(s’). An optimal policy π*(s) is a policy that returns the action a that maximizes the value function: π*(s) = argmaxa Q*(s,a) States can be represented as a set of state variables or factors, representing different features of the environment: s = .

Learning in Markov Decision Processes (mDPs) The reinforcement-learning problem consists of determining or approximating an optimal policy through repeated interactions with the environment (i.e., based on a sample of experiences of the form <state – action – next state – reward>). There are three main approaches to learning such an optimal or near-optimal policy: • • •

Policy-search methods: learn a policy directly via evaluation in the environment. Model-free (or direct) methods: learn the policy by directly approximating the Q function with updates from direct experience. Model-based (or indirect) methods: first learn the transition probability and reward functions, and use those to compute the Q function by means of , for example, the Bellman equations.

Model-free algorithms are sometimes referred to as the Q-learning family of algorithms. See Sutton (1988) or Watkins (1989) for the first best-known examples. It is known that model-free methods make inefficient use of experience, but they do not require expensive

826

computation to obtain the Q function and the corresponding optimal policy. Model-based methods make more efficient use of experience, and thus require less data, but they involve an extra planning step to compute the value function, which can be computationally expensive. Some wellknown algorithms can be found in the literature (Sutton, 1990; Moore & Atkeson, 1993; Kearns & Singh, 1998; and Brafman & Tennenholtz, 2002). Algorithms for reinforcement learning in MDP environments suffers from what is known as the curse of dimensionality: an exponential explosion in the total number of states as a function of the number of state variables. To cope with this problem, hierarchical methods try to break down the intractable state space into smaller pieces, which can be learned independently and reused as needed. To achieve this goal, changes need to be introduced to the standard MDP formalism. In the introduction we mentioned the two main ideas behind hierarchical RL: task decomposition and state abstraction. Task decomposition implies that the agent will not only be performing single-step actions, but also full subtasks which can be extended in time. Semi-Markov Decision Processes (SMDPs) will let us represent these extended actions. State abstraction means that, in certain contexts, certain aspects of the state space will be ignored, and states will be grouped together. Factored-state representations is one way of dealing with this. The following section introduces these two common formalisms used in the HRL literature.

Beyond MDPs: SMDPs and Factored-State Representations We’ll consider the limitations of the standard MDP formalism by means of an illustrating example. Imagine an agent whose task is to exit a multi-storyed office building. The starting position of the agent is a certain office in a certain floor, and the goal is to reach the front door at ground level. To complete the task, the agent has to first exit the room, find its way through the hallways to the elevator, take the elevator to the ground floor, and finally find its way from the elevator to the exit. We would like to be able to reason in terms of subtasks (e.g., “exit room”, “go to elevator”, “go to floor X”, etc.), each of them of different durations and levels of abstraction, each encompassing a series of


lower-level or primitive actions. Each of these subtasks is also concerned with only certain aspects of the full state space: while the agent is inside the room, and the current task is to exit it, the floor the elevator is on, or whether the front door of the building is open or closed, is irrelevant. However, these features will become crucial later as the agent’s subtask changes. Under the MDP formalization, time is represented as a discrete step of unitary and constant duration. This formulation does not allow the representation of temporally extended actions of varying durations, amenable to represent the kind of higher-level actions identified in the example. The formalism of semi-Markov Decision Processes (SMDPs) enables this representation (Puterman, 1994). In SMDPs, the transition function is altered to represent the probability that action a from state s will lead to next state s’ after t timesteps: Pr(s’, t | s, a) The corresponding value function is now: Vπ(s) = R(s, π(s)) + Σs’∈S γt Pr(s’, t | s, a) Vπ(s’) SMDPs also enable the representation of continuous time. For dynamic programming algorithms for solving SMDPs, see Puterman (1994) and Mahadevan et al., (1997). Factored-state MDPs deal with the fact that certain aspects of the state space are irrelevant for certain actions. In factored-state MDPs, state variables are decomposed into independently specified components, and transition probabilities are defined as a product of factor probabilities. A common way of representing independence relations between state variables is through Dynamic Bayes Networks (DBNs). As an example, imagine that the state is represented by four state variables: s = , and we know that for action a the value of variable f1 in the next state only depends on the prior values of f1 and f4, f2 depends on f2 and f3, and the others only depend on their own prior value. This transition probability in a Factored MDP would be represented as: Pr(s’ | s, a) = Pr(f1’ | f1 f4, a) Pr(f2’ | f2 f3, a) Pr(f3’ | f3, a) Pr(f4’ | f4, a)

For learning algorithms in factored-state MDPs, see Kearns & Koller (1999) and Guestrin et al. (2002).

HIERARCHICAL REINFORCEmENT-LEARNING mETHODS Different approaches and goals can be identified within the hierarchical reinforcement-learning subfield. Some algorithms are concerned with learning a hierarchical view of either the environment or the task at hand, while others are just concerned with exploiting this knowledge when provided as input. Some techniques try to learn or exploit temporally extended actions, abstracting together a set of actions that lead to the completion of a subtask or subgoal. Other methods try to abstract together different states, treating them as if they were equal from the point of view of the learning problem. We will briefly review a set of algorithms that use some combination of these approaches. We will also identify which of these methods are based on the modelfree learning paradigm as opposed to those that try to construct a model of the environment.

Options: Learning Temporally Extended Actions in the SMDP Framework Options make use of the SMDP framework to allow the agent to group together a series of actions (an option’s policy) that lead to a certain state or set of states identified as subgoals. For each option, a set of valid start states is also identified, where the agent can decide whether to perform a single-step primitive action, or to make use of the option. We can think of options as pre-stored policies for performing abstract subtasks. A learning algorithm for options is described by Sutton, Precup & Singh (1999) and belongs to the model-free Q-learning family. In its current formulation, the options framework allows for two-level hierarchies of tasks, although they could potentially be generalized to multiple levels. End states (i.e., subgoals) are given as input to the algorithm. There is work devoted to discovering these subgoals and constructing useful options from them (Şimşek et al., 2005; and Jong & Stone, 2005).

827

H


While options have been shown to improve the learning time of model-free algorithms, it is not clear that there is an advantage in terms of learning time over model-based methods. As any model-free method, though, they do not suffer from the computational cost involved in the planning step. It is still an open question whether options can be generalized to multiple-level hierarchies, and most of the work is empirical, with no theoretical bounds.

MaxQ: Combining a Hierarchical Task Decomposition with State Abstraction MaxQ is also a model-free algorithm in the Q-learning family. It receives as input a multi-level hierarchical task decomposition, which decomposes the full underlying MDP into an additive combination of smaller MDPs. Within each task, abstraction is used so that state variables that are irrelevant for the task are ignored (Dietterich, 2000). The main drawback of MaxQ is that the hierarchy and abstraction have to be provided as input, and in it’s model-free form it misses opportunities for faster learning.

DSHP: Model-Based Hierarchical Decomposition for Efficient Learning and Planning Deterministic Sample-Based Hierarchical Planning (DSHP) combines factored-state MDP representations, a MaxQ hierarchical task decomposition, and modelbased learning to achieve provably efficient learning and planning in deterministic domains (Diuk, Strehl & Littman, 2006). While, as a model-based algorithm, DSHP allows for faster learning and planning, it still suffers from the problem that the hierarchy and abstraction have to be provided as input.

hierarchy of smaller interlinked MDPs. HEXQ is modelfree and based on Q-learning (Hengst, 2002). HEXQ shows a promising method for discovering abstractions and hierarchies, but still suffers from a lack of any theoretical bounds or proofs. All the work using HEXQ has been empirical, and it’s general power still remains an open question.

HAM-PHAM: Restricting the Class of Possible Policies Hierarchies of Abstract Machines (HAMs) also make use of the SMDP formalism. The main idea is to restrict the class of possible policies by means of small nondeterministic finite-state machines, which constrain the sequences of actions that are allowed. Elements in HAMs can be thought of as small programs, which at certain points can decide to make calls to other lowerlevel programs (Parr & Russell, 1997; andParr, 1998). See also Programmable HAMs (PHAMs), an extension by Andre & Russell (2000). HAM provides an interesting approach to make learning and planning easier, but has also only been shown to work better in certain empirical examples.

FUTURE TRENDS We expect to see most of the new work in the field of Hierarchical Reinforcement Learning tackling two areas: hierarchy and abstraction discovery, and transfer learning. We believe the main open question is how structure can be learned from experience, and once learned be applied to tasks and problems different from the original one. There is also promising but still little theoretical work currently being produced in the area, work that could prove the general power of different methods. Most of the work is empirical and only shown to work through experiments in small domains.

HEXQ: Discovering Hierarchy As opposed to MaxQ, DSHP, or other methods that receive the hierarchical task decomposition as input, HEXQ tries to automatically discover it. HEXQ analyses traces of experience and identifies regions of the MDP with repeated characteristics. It uses this experience to build temporal and state abstractions, constructing a 828

CONCLUSION The goal of hierarchical reinforcement learning is to combat the “curse of dimensionality”, the main obstacle in achieving scalable RL that can be applied to real-life problems, by means of hierarchical task decompositions and state abstraction. This active area of research has


achieved mixed results, with algorithms and frameworks focusing on just one or two combinations of the different aspects of the problem. A single approach that can deal with structure discovery and its use, with both temporal and state abstraction, and that can provably learn and plan in polynomial time is still the main item in the research agenda of the field.

REFERENCES Andre, D. & Russell. S. J. (2000). Programmable reinforcement learning agents. Advances in Neural Information Processing Systems (NIPS). Barto, A.G. & Mahadevan, S. (2003). Recent Advances in Hierarchical Reinforcement Learning. Special Issue on Reinforcement Learning, Discrete Event Systems Journal. (13) 41-77. Bellman, R. (1957). Dynamic Programming. Princeton University Press. Boutilier, C.; Dean, T.; & Hanks, S. (1999) Decisiontheoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research. (11) 1-94. Brafman, R. & Tennenholtz, M. (2002). R-MAX – a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research. Dietterich, T.G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research (13) 227–303. Diuk, C.; Strehl, A. & Littman, M.L. (2006). A Hierarchical Approach to Efficient Reinforcement Learning in Deterministic Domains. Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS).

Guestrin, C.; Patrascu, R.; & Schuurmans, D. (2002). Algorithmdirected exploration for model-based reinforcement learning in factored MDPs. Proceedings of the International Conference on Machine Learning, 235–242. Hengst, B. (2002). Discovering hierarchy in reinforcement learning with hexq. Proceedings of the 19th International Conference on Machine Learning.

Jong, N & Stone, P. (2005) State Abstraction Discovery from Irrelevant State Variables. Proceedings of the 19th International Joint Conference on Artificial Intelligence. Kaelbling, L. P.; Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research. (4) 237-285. Kearns, M. & Singh, S. (1998). Near-Optimal Reinforcement Learning in Polynomial Time. Proceedings of the 15th International Conference on Machine Learning. Kearns, M. J., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI), 740–747. Mahadevan, S., Marchalleck, N., Das, T. & Gosavi, A. (1997). Self-improving factory simulation using continuous-time average-reward reinforcement learning. Proceedings of the 14th International Conference on Machine Learning. Moore, A. & Atkeson, Ch. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning. Parr, R. & Russell, S. (1997). Reinforcement learning with hierarchies of machines. Proceedings of Advances in Neural Information Processing Systems 10. Parr, R. (1998). Hierarchical Control and learning for Markov decision processes. PhD thesis, University of California at Berkeley. Puterman, M. L. (1994). Markov Decision Problems. Wiley, New York. Şimşek, Ö, Wolfe, A.P. & Barto, A. (2005). Identifying useful subgoals in reinforcement learning by local graph partitioning. Proceedings of the 22nd International Conference on Machine Learning Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning. Sutton, R. S. (1990). Integrated architectures for learning, planning and reacting based on approximating dynamic programming. Proceedings of the 7th International Conference on Machine Learning. Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. The MIT Press. 829

H


Sutton, R.; Precup, D. & Singh, S. (1999) Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence Watkins, C. (1989). Learning from Delayed Rewards. PhD Thesis.

KEy TERmS Factored-State Markov Decision Process: An extension to the MDP formalism used in Hierarchical RL where the transition probability is defined in terms of factors, allowing the representation to ignore certain state variables under certain contexts. Hierarchical Reinforcement Learning: A subfield of reinforcement learning concerned with the discovery and use of task decomposition, hierarchical control, temporal and state abstraction (Barto & Mahadevan, 2003).

830

Hierarchical Task Decomposition: A decomposition of a task into a hierarchy of smaller subtasks. Markov Decision Process: The most common formalism for environments used in reinforcement learning, where the problem is described in terms of a finite set of states, a finite set of actions, transition probabilities between states, a reward signal and a discount factor. Reinforcement Learning: The problem faced by an agent that learns to a utility measure behavior from its interaction with the environment. Semi-Markov Decision Process: An extension to the MDP formalism that deals with temporally extended actions and/or continuous time. State-Space Generalization: The technique of grouping together states in the underlying MDP and treating them as equivalent for certain purposes.

831

High Level Design Approach for FPGA Implementation of ANNs Nouma Izeboudjen Center de Développement des Technologies Avancées (CDTA), Algérie Ahcene Farah Ajman University, UAE Hamid Bessalah Center de Développement des Technologies Avancées (CDTA), Algérie Ahmed Bouridene Queens University of Belfast, UK Nassim Chikhi Center de Développement des Technologies Avancées (CDTA), Algérie

INTRODUCTION Artificial neural networks (ANNs) are systems which are derived from the field of neuroscience and are characterized by intensive arithmetic operations. These networks display interesting features such as parallelism, classification, optimization, adaptation, generalization and associative memories. Since the McCulloch and Pitts pioneering work (McCulloch, W.S., & Pitts, W. (1943), there has been much discussion on the topic of ANNs implementation, and a huge diversity of ANNs has been designed (C. Lindsey & T. Lindblad, 1994). The benefits of using such implementations is well discussed in a paper by R. Lippmann (Richard P. Lipmann, 1984): “The great interest of building neural networks remains in the high speed processing that can be achieved through massively parallel implementation”. In another paper Clark S. Lindsey (C.S Lindsey, Th. Lindbald, 1995) posed a real dilemma of hardware implementation: “Built a general, but probably expensive system that can be reprogrammed for several kinds of tasks like CNAPS for example? Or build a specialized chip to do one thing but very quickly, like the IBM ZISC Processor”. To overcome this dilemma, most researchers agree that an ideal solution should relay the performances obtained using specific hardware implementation and the flexibility allowed by software tools and general purpose chips. Since their commercial introduction in the mid1980’s, and due to the advances in the development

of both of the microelectronic technology and the specific CAD tools, FPGAs devices have progressed in an evolutionary and revolutionary way. The evolution process has allowed faster and bigger FPGAs, better CAD tools and better technical support. The revolution process concerns the introduction of high performances multipliers, Microprocessors and DSP functions. This has a direct incidence to FPGA implementation of ANNs and a lot of research has been carried to investigate the use of FPGAs in ANNs implementation (Amos R. Omandi & Jagath C. rajapakse, 2006). Another attractive key feature of FPGAs is their flexibility, which can be obtained at different levels: exploitation of the programmability of FPGA, dynamic reconfiguration or run time reconfiguration (RTR), (Xilinx XAPP290, 2004) and the application of the design for reuse concept (Keating, Michael; Bricaud, Pierre, 2002). However, a big disadvantage of FPGAs is the low level hardware oriented programming model needed to fully exploit the FPGA’s potential performances. High level based VHDL synthesis tools have been proposed to bridge the gap between the high level application requirements and the low level FPGA hardware but these tools are not algorithmic or application specific. Thus, special concepts need to be developed for automatic ANN implementation before using synthesis tools. In this paper, we present a high level design methodology for ANN implementation that attempts to build a


H

High Level Design Approach for FPGA Implementation of ANNs

bridge between the synthesis tool and the ANN design requirements. This method offers a high flexibility in the design while achieving speed/area performances constraints. The three implementation figures of the ANN based back propagation algorithm are considered. These are the off-type implementation, the on-chip global implementation and the dynamic reconfiguration choices of the ANN. To achieve our goal, a design for reuse strategy has been applied. To validate our approach, three case studies are considered using the Virtex-II and Virtex-4 FPGA devices. A comparative study is done and new conclusions are given.

D iL = f ' (u iL )(d i − y i ) D iL = f ' (u iL )(d i N−l y i ) D lj −1 = f ' (u lj−1 ) N l wij D il 1 ≤ i ≤ N l D lj −1 = f ' (u lj−1 ) i =1 wij D il 1 ≤ i ≤ N l

∑ ∑

(3) , 1≤ l ≤ L , 1≤ l ≤ L

where, di is the desired output f’ the derivative function of f The Weight update step computes the weights updates according to: wijl (t + 1) = wijl (t ) + ∆wijl (t ) l l l l ww ∆ t )1=) = HDwilijy(ljt−)1 + ∆wij (t ) ij (ijt(+

∆wijl (t ) = HD il y lj−1

BACKGROUND In this section, theoretical presentation of the multilayer perceptron (MLP) based back propagation algorithm is given. Then, discussion of the most related works to the topics of high level design methodology and ANNs frameworks are given.

Theoretical Background of the Back Propagation Algorithm The back propagation is one of the well known algorithms that are used to train the MLP ANN network in a supervised mode. The MLP is executed in three phases: the feed forward phase, the error calculation phase and the synaptic weight updating phase (Freeman, J. A. and Skapura, D. M, 1991). In the feed forward phase, a pattern xi is applied to the input layer and the resulting signal is forward propagated through the network until the final outputs have been calculated; for each i (index of neuron) and j (index of layer) M [ji ] =

n0

∑ W [ ]x j

i

i

i =1

i

Oj

= f ( xj ) =

1 1 + exp( − Mj )

(1) (2)

where, M ij is the weighted sum of the synaptic weights and o ij is the output of the sigmoid activation function. The error calculation step, computes the local error, δ for each layer starting from output back to input: 832

(4)

i =1

(5) (6)

where, η is the learning factor, Δw the variation of weights and l, the indices of the layers.

Background on ANN Frameworks The most related works to ANNs frameworks are presented by (F. Schurmann & all, 2002), (M. Diepenhorst & all, 1999), and (J. Zhu & all, 1999). In the other hand, and with the increasing complexity of FPGAs circuits, Core-based synthesis methodology is proposed as a new trend for efficient hardware implementation of FPGAs. In these tools a library of pre-designed IPs “Intellectual Property” cores are proposed. An example can be found in (Xilinx Core Generator reference) and (Opencores reference). In the core based design methodology, efficient reuse is derived from the parameterized design with VHDL and its many flexible constructs and characteristics (i.e. abstraction, encapsulation, inheritance and reuse through attributes, package, procedures and functions). Beside this, the reuse concept is well suited for high regular and repetitive structures such as neural networks. However although all these advantages, seldom attention has been done to apply design for reuse for ANNs. In this context our paper presents a new high level design methodology based upon the use of the design for reuse concept for ANNs. In order to achieve this goal, the design must fulfill the following requirements (Keating, Michael; Bricaud, Pierre, 2002): •

The design must be block-based


• • • • •

The design must be reconfigurable to meet the requirement of many different applications. The design must use standard interfaces. The code must be synthesizable at the RTL level. The design must be verified. The design must have robust scripts and must be well documented.

implementation, the global on chip implementation and implementation using run time reconfiguration (RTR). Thus a Core is generated for each type of implementation. At this level, the user/designer can fix the parameters of the network, i.e. the number of neurons in each layer, synaptic values, multiplier type, data representation and precision. At the low level all the IP Cores that construct the neuron are generated automatically from the library of the synthesis tool which is in our case MENTOR GRAPHICS (Mentor Graphics user guide reference), and which also integrates the Xilinx IP Core Generator. In addition, for each IP Core, a graphical interface is generated to fix its parameters. Thus, the user/designer can change the network performances architecture by changing the IP cores that are stoked in the library. Then a VHDL code at the register transfer level (RTL) is generated for synthesis. Before, functional simulation is required. The result is a file netlist ready for place and rout followed by final FPGA prototyping on a board. Documentation is available at each level of the

PRESENTATION OF THE PROPOSED DESIGN APPROACH The proposed design approach is shown in Fig.1 as a process of flow. In this figure, the methodology used is based on a top down design approach in which the designer/user is guided step by step in the design process of the ANN. First, the user is asked to select the dimension of the network. The next step involves selection of ANN implementation choices; these are the off chip

Figure 1. The proposed design methodology The BP Graphical user interface

ANN dimension

On chip implementation No Off chip training

Selection of the Off chip ANN core

No

Dynamic reconfiguration yes

Global on chip implementation

Dynamic reconfiguration

Selection of the On chip ANN core

Selection of the RTR ANN core

Define ANN parameters

Functional simulation

yes

R eusable IP C ores

Generate VHDL code at RTL level

RTL synthesis tool

Implementation Place & root tools

FPGA prototyping

833

H


design process and the code is well commented. Thus, the design for reuse requirements is applied through the design process. In what follow, presentation of each implementation type is given.

equation (2). As shown in Fig. 4, the hardware model of the neuron is mainly based on a:

The Feed Forward Off-Chip Implementation

•

Fig. 2 shows a top view of the feed forward core which is composed of a data path module and a control module. At the top level these two modules are represented by black boxes and only the neural network inputs and outputs signals are shown to the user/designer. By clicking inside the boxes, we can get access to the network architecture which is composed of three layers represented by black boxes as shown in Fig. 3 (left side). By clicking inside each box, we can get access to the layer architecture which is composed of black boxes representing the neurons as shown in Fig. 3 (right side); and by clicking inside each neuron’s box we can get access to the neuron hardware architecture as shown in Fig 4. Each neuron implements the accumulated weight sum of equation (1) and the activation function of

•

•

• •

Memory circuit where the final values of the synaptic weights are stocked, A multiply circuit (MULT) which computes the product of the stored synaptic weights with inputs data An accumulator circuit (ACUM) which computes the sum of the above products A circuit that approximates the activation function (example linear function or sigmoid function) A multiplexer circuit (MUX) in the case of serial transfer between inputs in the same neuron

The neural network architecture has the following properties: • • •

Computation between layers is done serially For the same layer, neurons are computed in parallel For the same neuron, only one multiplier and one accumulator (MULT +ACUM=MAC) are used to compute the product sum.

Figure 2.The feed forward core module using the mentor graphics design tool Synaptic weights

Inputs

Outputs selectF clk

wi[ j ]

Feed Forward

Feed Forward Control

reset2 reset1 reset

sel2 sel1 sel

write2 write1 write

read2 read1 read

load2 load1 load

834

addr

add_sub clk resetF


Figure 3. The ANN architecture

H N eurone 1

N eurone 1 Layer 1

Layer2

Layer 3 N eurone 3

N eurone_n F eed forw ard C ontrol

Layer

Figure 4. Equivalent hardware architecture of the neuron

Memory

X1 X2…...Xn

MULT MUX

ACUM

Sy naptic w eights

W ji[l ]

Activation Output function circuit

Inputs

• •

Each multiplier is connected to a memory. The depth of each memory is equal to the number of neurons constituting the layer The whole network is controlled by a control unit module.

Each circuit that constructs the neuron is an IP core “Intellectual Property” that can be generated from the Xilinx Core Generator. The feed forward control module is composed of three phases: control of the neuron, control of the 835


layer and control of the network. Considering the fact that neurons work in parallel, so control of the layer is similar to the control of the neuron plus the multiplexer’s control. Control of the neuron is divided into four phases: start, initialization, synaptic multiplication/accumulation and storage of the weighted sum. The first state diagram of the feed forward control module which was designed, was based on the Moore machine in which the system vary only when its state change. The drawback of this machine is that it is not generic. For example, (load=0, reset=0) allows the accumulator to add a value present at the input register. This accumulated value must be done as many times as the number of neurons in the previous layer. Thus, if we change the number of neurons from one layer to another one, we have to change all the flow state of the control module. To overcome this problem, the Moor machine is replaced by the Mealy machine in which we add a counter program with a generic value M and a transition variable Max such that:

posed architecture which is composed of a feed forward module, an Error-calculation module and an Update module. The set of the three modules is controlled by a global control unit. The feed forward module computes equations (1) and (2). The Error module computes equations (3) and (4) and the Update module computes equations (5) and (6). Each module exhibits a high degree of regularity of the structure, modularity and repeatability which make the whole ANN a good candidate for the application of the design for reuse concept. As in the off-chip implementation case, first the unit control unit has been done using a Moore machine that integrates control of the three modules: feed forward, error and update modules. In order to achieve reuse, we have replaced the Moore machine by a Mealy machine. Thus, the size of the network can be modified by simple copy/past or remove operations of the boxes.

ifoutput _ counter = M → Max = 1  → Max = 0 else

Our strategy for run time reconfiguration follows the following steps: first the feed forward and the global control modules are configured. The results are stored in the Bus macro module of the Virtex FPGA device. In the next step, the feed forward module is reset from the FPGA and the Update and Error modules are configured. The generated results are stored in the Bus macro modules and the same procedure is applied to the next training example of the ANN. A more detailed description is given in (N. Izeboudjen and all, 2007).

where the value of M is done equal to the number of neuron. By using this strategy, we obtain an architecture that has two important key features: generecity and flexibility. Generecity is related to the data word size, precision, and memory depth which are kept as generic parameters at the top level of the VHDL description. The flexibility feature is related to the size of the network (the number of neurons in each layer), thus it is possible to add neurons by simple copy/past of the neurons boxes or cores and it is also possible to remove them by simple cut operation of the boxes. It is also possible to use other IP cores from the library (example replace parallel MULT with pipeline MULT) to change the performances of the network without changing the VHDL code. Thus, the design for reuse concept is applied.

The Direct On-Chip Implementation Strategy In this section, we propose the equivalent architecture for implementation of the three successive phases of the back propagation algorithm. Fig.5 depicts the pro-

836

The Run Time Reconfiguration Strategy

Performance Evaluation In this section, we discuss the performance of the three implementation figures of the back propagation algorithm. The parameters to be considered are the number of configurable logic blocs (CLB), the time response (TR) and the number of Million connexions per second (MCPS). A comparison of these parameters is done between the Virtex-II and Virtex-4 families. Functional simulation is achieved using ModelSim simulator (ModelSim user guide reference). The RTL synthesis is achieved using the Mentor graphics synthesis tool (Mentor Graphics synthesis tool user guide reference) and for final implementation, the ISE foundation place and rout (8.2) tool is used (ISE foundation user guide reference).


read read1 read2

write write1 write2

load load1 load2

addr addr1 addr2 reset reset1 reset2

sel sel1 sel2

D13_23_33 D12_22_32 D11_21_31

Update

sel4 sel3

Error_calculation

load4 load3

reset2 reset rese 1 t

sel2 sel1 se l write write 2 writ 1 e

read2 read rea 1 d

load2 load1 load

clk resetF

clk Add_sub

W12 W22 W32 W13 W23 W33 x11 x21 x31 x12 x22 x32 x13 x23 x33 O13 O23 O33 D13_23_33 D12_22_32 D11_21_31

add r add_su b

add_su b load4 load3

val2 val1 va l Addr_lut

sel6 sel7 sel5 sel4 sel3 reset reset 4 3

E r r o r & U p d a te C o n tr o l

clk Add_sub

W12 W22 W32 W13 W23 W33 x11 x21 x31 x12 x22 x32 x13 x23 x33 O13 O23 O33

O11 O21 O31 O12 O22 O32 W13_3 W23_3 W33_3 W12_2 W22_2 W32_2 W31_1 W21_1 W11_1 W11 W21 W31

Feed_forward

resetU clk

val val1 val2 sel6 sel7

H

O11 O21 O31 O12 O22 O32 W13_3 W23_3 W33_3 W12_2 W22_2 W32_2 W31_1 W21_1 W11_1 W11 W21 W31 W12 W22 W32 W13 W23 W33

X1 X2 X3 selF

W3_3 W2_3 W1_3 W3_2 W3_2 W3_1 W3 W2 W1

Figure 5. Architecture of the BP algorithm

Feed Forward Control

resetF

clk

resetU

C o n tr o l R P G o n ch ip

Our first application is an ANN classifier that is used to classify heart coronary diseases. The network has been trained off chip using the MATLAB 6.5 tool. After training the dimension of the network as well as the synaptic weight were fixed. The network has a dimension of (1, 8, 1) and the synaptic weights have a data width of 24 bits. For this application we selected the circuits XC2V1000 and XC4VLX15 devices, of Virtex-II and Virtex-4 respectively. Synthesis results show that the XC2V1000 circuit consume 99% in terms of (CLB), the time response TR = 44.46 (ns) while the MCPS=360. Concerning the XC4VLX15, it consumes 82% in term of CLB, TR= 26.76 (ns) and MCPS= 597. Thus, the XC4VLX15 achieves better performances in term of area (gain 19% of CLB in term of area), the speed rate is 1.6 and MCPS rate is 1.6. Our second application is the classical (2, 2, 1) “XOR” ANN which is used as a benchmark for non-linearly separable problems. The general on chip learning implementation has been applied to the network. It is to be mentioned that area constraints could not be met for

the first family XC2V1000 as well as the XC4VlX15, and we have tried several families until we fixed the XC4VlX80 for Virtex-4 and the XC2V8000 for VirtexII. Synthesis results show that the XC2V8000 circuit consume 22% in terms of (CLB), the time response TR= 59.5 (ns) while the MCPS=202. Concerning the XC4VLX80, it consumes 30% in term of (CLB), TR = 47.93 (ns) and MCPS= 250. From these results we can conclude that with the Virtex-II family we can gain 8% of (CLB) in term of area ; this is due to the fact that the Virtex-II integrates more multipliers than the Virtex-4 and in which the MAC component is integrated into the DSP48 (XC4VlX80 has 80 MAC DSP and XC2V8000 has 168 bloc multipliers). But the Virtex-4 circuit is faster than the Virtex-II and can achieve more MCPS (rate of ~1.24). The on chip implementation requires a lot of multipliers and this is why, we recommend using it if the timing constraints are not critical. In the third application, three arbitrary networks are used to show the performance of the (RTR) over the global implementation. These are a (3,3,3) network, 837


a (16,16,16) network and a (16,64,8) network. The results show that when the size of the network is big it is difficult to implement the whole RPG into one FPGA. With the RTR we can achieve more than 30% reduction in the area and more than 40% increase in speed and MCPS.

REFERENCES

FUTURE TRENDS

C. S. Lindsey and T. Lindblad (1994), “Review of Neural Network Hardware: A user’s perspective”, IEEE Third Workshop on Neural Networks: from Biology to High Energy Physics.

The proposed ANN environment is still under construction. The design approach is based on the use of predesigned IP cores which are generated from the Xilinx Core generator tool. Our next objective is to enrich and enhance the library of the IP cores, especially in the case of implementation of the activation function (sigmoid, linear transfer circuits), and to evaluate and compare the performances of the ANN regarding others pre-designed IP cores. Also, we plan to extend the reuse concept of the ANN to other ANNs algorithms (Kohonen, Hopfield networks) Concerning the run-time reconfiguration (RTR), the next step is to integrate the RTR design approach with the planeAhead design tool (PlanAhead user guide reference). As future work, we plan to evaluate and analysis the cost of the design for reuse concept applied to ANNs

CONCLUSION Through this paper, we have presented a successful design approach for FPGA implementation of ANNs. We have applied a design for reuse strategy and parametric design to achieve our goal. The proposed methodology offers high flexibility because the size of the ANN can be changed by simple copy/remove of the neurons cores. In addition the format, data widths and precision are considered as generic parameters. Thus, different applications can be targeted in a reduced design time. As for the three applications, the first conclusion is that the new Virtex-4 FPGA devices achieve faster networks comparing to Virtex-II; but regarding to the area; i.e. number of CLBs, the Virtex-II is better. Thus in our opinion, the Virtex-II is well suited as a platform to experiment ANN implementations. This can help to give new directions for future work.

838

Amos R. Omondi and Jagath C. rajapakse (2006), “FPGA implementation of neural networks”, Springer Verlag. C.S. Lindsey, Th. Lindblad (1995)” Survey of neural network hardware”, SPIE Vol. 2492, pp 1194-1205

M. Diepenhorst, M. van Veelen, J.A.G Nijhuis and L. Spaanenburg(1999), IEEE, pp 2302-2305 Freeman, J.A. and Skapura, D. M (1991) “Neural networks Algorithms, Applications and Programming Techniques” Addison Wesley publisher. ISE Core generator, www.xilinx.com J. Zhu, G. J. Milne, B. K. Gunther (1999) “ Towards an FPGA Reconfigurable Computing Environment for Neural Networks Implementations” Artificial neural networks, Conference publication No 470, IEE , Volume 2, pp 661-666 Keating, Michael; Bricaud, Pierre (2002) “Reuse methodology manual”, Kluwer academic publisher. McCulloch, W.S, & Pitts, W. (1943), “A Logical Calculus of Ideas Immanent in Nervous Activity”, Bulletin of Mathematical Biophysics. (5) 115-133. Model Sim user guide www.model.com Mentor graphics user guide www.mentor.com N. Izeboudjen, A.Farah, H. Bessalah, A. Bouridene, N. Chikhi (2007), “Towards a Platform for FPGA Implementation of the MLP Based back Propagation Algorithm” IWANN, LNCS, pp. 497-505 OpenCores: www.opencores.org PlanAhead User guide www.xilinx.com Richard P. Lippmann (1984), «An Introduction to computing with neural nets », IEEE ASSP Magazine, pp. 4 -22 F. Schumann, S. Hofmann, J. Schemmel, K. Meier, (2002), “Evolvable Hardware” Proceedings NASA/ DoD Conference on Volume, Issue, pp 266 - 273


Xilinx application notes XAPP290 (2004) “Two Flows for Partial Reconfiguration: Module Based or Difference Based”, pp (1-28) www.xilinx.com.

On-Chip Training: A term that design implementation the three phases of the back propagation algorithm into one or several chips

KEy TERmS

Off-Chip Training: Training of the network is done using software tools like MATLAB and only the feed forward phase is considered generalisation.

ASIC: Acronym Application Specific Integrated Circuits CLB: Acronym for Configurable Logic Blocs FPGA: Field Programmable Gate Arrays High Level Synthesis: A top down design methodology that transform an abstract level such as the VHDL language into a physical implementation level

RTL: Acronym of Register Transfer Level Run Time Reconfiguration: A solution that permits to use the smallest FPGA and to reconfigure it several times during the processing. Run time reconfiguration can be partial or global. VHDL: Acronym for Very high speed integrated circuits Hardware Description Language)

839

H

840

HOPS: A Hybrid Dual Camera Vision System Stefano Cagnoni Università degli Studi di Parma, Italy Monica Mordonini Università degli Studi di Parma, Italy Luca Mussi Università degli Studi di Perugia, Italy Giovanni Adorni Università degli Studi di Genova, Italy

INTRODUCTION Biological vision processes are usually characterized by the following different phases: •

•

•

Awareness: natural or artificial agents operating in dynamic environments can benefit from a, possibly rough, global description of the surroundings. In human this is referred to as peripheral vision, since it derives from stimuli coming from the edge of the retina. Attention: once an interesting object/event has been detected, higher resolution is required to set focus on it and plan an appropriate reaction. In human this corresponds to the so-called foveal vision, since it originates from the center of the retina (fovea). Analysis: extraction of detailed information about objects of interest, their three-dimensional structure and their spatial relationships completes the vision process. Achievement of these goals requires at least two views of the surrounding scene with known geometrical relations. In humans, this function is performed exploiting binocular (stereo) vision.

Computer Vision has often tried to emulate natural systems or, at least, to take inspiration from them. In fact, different levels of resolution are useful also in machine vision. In the last decade a number of studies dealing with multiple cameras at different resolutions have appeared in literature. Furthermore, the ever-growing computer performances and the ever-decreasing cost of video equipment make it possible to develop systems

which rely mostly, or even exclusively, on vision for navigating and reacting to environmental changes in real time. Moreover, using vision as the unique sensory input makes artificial perception closer to human perception, unlike systems relying on other kinds of sensors and allows for the development of more direct biologically-inspired approaches to interaction with the external environment (Trullier 1997). This article presents HOPS (Hybrid Omnidirectional Pin-hole Sensor), a class of dual camera vision sensors that try to exalt the connection between machine vision and biological vision.

BACKGROUND In the last decade some investigations on hybrid dual camera systems have been performed (Nayar 1997; Cui 1998; Adorni 2001; Adorni 2002; Adorni 2003; Scotti 2005; Yao 2006). The joint use of a moving standard camera and of a catadioptric sensor provides these sensors with their different and complementary features: while the traditional camera can be used to acquire detailed information about a limited region of interest (“foveal vision”), the omnidirectional sensor provides wide-range, but less detailed, information about the surroundings (“peripheral vision”). Possible employments for this class of vision systems are video surveillance applications as well as mobile robot navigation tasks. Moreover, their particular configuration makes it possible to realize different strategies to control the orientation of the standard camera; for example, scattered focus on different objects permits to perform recognition/classification tasks while continu-


HOPS

ous movements allow to track any interesting moving object. Three-dimensional reconstruction based on stereo vision is also possible.

HOPS: HyBRID OmNIDIRECTIONAL PIN-HOLE SENSOR This article is focused on the latest prototype of the HOPS (Hybrid Omnidirectional-Pinhole Sensor) sensor (Adorni 2001; Adorni 2002; Adorni 2003, Cagnoni 2007). HOPS is a dual camera vision system that achieves a high-resolution 360-degrees field of view as well as 3D reconstruction capabilities. The effectiveness of this hybrid sensor derives from the joint use of a traditional camera and a central catadioptric camera which both satisfy the single-viewpoint constraint. Having two different viewpoints from which the world is observed, the sensor can therefore act as a stereo pair finding effective applications in surveillance and robot navigation. To create a handy versatile system that could meet the requirements of the whole vision process in a wide variety of applications, HOPS has been designed to be considered as a single integrated object: one of the most direct advantages offered by this is that, once it is

assembled and calibrated, it can be placed and moved anywhere (for example in the middle of a room ceiling or on a mobile robot) without any need for further calibrations. Figure 1 shows the latest two HOPS prototypes. In the one that has been used for the experiments reported here, the traditional camera which, in this version, cannot rotate, has been placed on top and can be pointed downwards with an appropriate fixed tilt angle to obtain a high-resolution view of a restricted region close to the sensor. In the middle, one can see the catadioptric camera consisting of a traditional camera pointing upwards to a hyperbolic mirror hanging over it and held by a plexiglas cylinder. As can be observed, the mirror can be moved up and down to permit optimal positioning (Swaminathan 2001; Strelow 2001) during calibration. Moreover, to avoid undesired light reflections on the internal surface of the Plexiglas cylinder, a black needle has been placed on the mirror apex as suggested in (Ishiguro 2001). Finally, in the lower part, some circuits generate video synchronization signals and allow for external connections. The newer version of HOPS (see Figure 1, right) overcomes some limitations of the present one. It uses two digital high-resolution Firewire cameras,

Figure 1. The two latest versions of the HOPS sensor: the one used for experiments (left) and the newest version (right) which is currently being assembled and tested.

841

H

HOPS

in conjunction with mega-pixel lenses characterized by a very low TV-distortion, to achieve better image quality. Furthermore, in this new version the traditional camera is hung to a stepper motor, controlled via a USB interface, and therefore is able to rotate. This time the traditional camera has been placed below the catadioptric part: this makes it possible to have no wires within the field of view of the omnidirectional image besides allowing, in surveillance applications, to see also the blind area of the omnidirectional view due to the reflection of the camera on the mirror.

Sensor Calibration In order to extract metric information from two-dimensional images, one must perform a calibration of the camera and estimate the geometric parameters needed to describe image formation. Therefore, after calibration, relationships between points on images and their real position in the 3D space can be expressed by mathematical equations which can solve metric problems. Sensor calibration can be based on a standard Photogrammetric Calibration (Kraus 1993; Zhang 2000) using a heavily structured environment with grids of points of known coordinates. First, the two cameras are calibrated independently, before assembling them on the sensor, to estimate their intrinsics as well as the radial distortion introduced by the optics. Then, the mirror is accurately positioned with respect to the camera in order to achieve single-viewpoint vision for the catadioptric

part of the sensor as described by (Benosman 2001). The last, but probably most important, phase of the calibration is aimed at detecting geometric relationships between the traditional image and the omnidirectional one: once again, a set of known points was used to estimate the parameters of the mapping. Notice that the relationships that were computed between the two views are constant in time because of the sensor structure. In this way, once the calibration procedure is over, no external fixed references are needed any longer, and one can place the sensor anywhere and perform stereo vision tasks without needing any further calibration.

Mirror to Camera Positioning To position the hyperbolic mirror with respect to the standard camera and achieve the single-viewpoint characteristic for the catadioptric part of the sensor, one can operate as follows. Supposing that the single view-point constraint is satisfied, and since the mirror profile is known, the camera calibration data and some simple equations can be used to calculate the expected projections of any known 3D point set onto the omnidirectional image. To verify the correctness of the relative mirror-tocamera positioning, a calibration box has been built with grids of known coordinates painted on its inner walls. Hence, after placing the sensor into it, the mirror can be manually moved until the grids appearing on the image taken in real time match the theoretical

Figure 2. Mirror position calibration: the sensor inside the calibration box (left) and the acquired omnidirectional image (right) with the correct grid positions superimposed in white.

842

HOPS

ones super-imposed over it as they should appear if the mirror had been correctly placed (see Figure 2). This is a very cheap method which, however, yields very good results.

Joint Camera Calibration To obtain a fully effective stereo system it is essential to make a joint camera calibration to extract information about the relative positioning of the camera pair. Usually, the internal reference frame of one of the two cameras is chosen as the global reference for the pair: since two different kinds of cameras are available, the simplest choice is to set the omnidirectional camera’s frame as the global reference for the whole sensor. Using once again the above-mentioned grids of points, images pair (omnidirectional and traditional) of grids lying on different (parallel) planes with known relative positions are acquired. Once 3D coordinates of points positions, referred to the sensor reference frame, have been estimated through the omnidirectional image, solving for geometric constraints between points projections in the traditional image permits to estimate the relative position of the traditional camera. To take the standard camera rotation into consideration, its position has to be described by a more complex transformation than a simply fixed rototranslation: the geometric and kinematic coupling between the two cameras has to be understood and modeled with more parameters. Obviously, this requires that images be taken with the traditional camera in many different positions. After this joint camera calibration, HOPS can be used to perform metric measurements on the images, obtaining three-dimensional data referred to its own global reference frame: this means that no further calibrations are needed to take sensor displacements into account.

Perspective Reprojections & Inverse Perspective Mapping One of the opportunities offered by a perspective image is the possibility to apply an Inverse Perspective Mapping (IPM) transformation (Little 1991) to obtain a different image in which the information content is homogeneously distributed among all pixels. Since central catadioptric cameras are characterized by a single viewpoint, the images acquired by them are perspective

images suitable to be used for IPM. Choosing a virtual image plane as the new domain for the IPM, a perspective reprojection similar to traditional images can be obtained from part of those omnidirectional images. Figure 3 shows a pair of images acquired by HOPS and a perspective reconstruction of the omnidirectional view obtained applying an IPM on the corresponding area seen by the traditional camera. As can be noticed, the difference in resolution between the two perspective views is considerable. Choosing a horizontal plane as reference for the IPM, it is possible to obtain something very similar to an orthographic view of that area, usually referred to as “bird’s eye view”. If the floor is used as reference to perform IPM on both images, it is possible to extract useful information about objects/obstacles surrounding the system (Bertozzi 1998).

3D Reconstruction Tests To verify the correctness of the calibration process, an estimation of the positions of points in a three-dimensional space can be performed along with other tests. After capturing one image from each of the two views, the points in the test pattern are automatically detected and for each one the light rays from which it was generated are computed based on the projection model obtained during calibration. Since the estimated homologous rays are usually skew lines, the shortest segment joining the two rays can be found and its middle point used as an estimate of the point’s 3D position. In Table 1, results obtained using a 4x3 point test-pattern with 60 mm between point centers are reported. Even if the origin of the sensor reference system is physically inaccessible and no high-precision instruments were available, this pattern was placed as accurately as possible 390 mm perpendicularly ahead of the sensor itself (along the y direction in the chosen reference frame) and centered along the x direction: the z coordinates of the points in the top row were measured to be equal to 55 mm. This set-up is reflected by the estimated values for the first experiment reported in Table 1. More relevantly, the mean distance between points was estimated to be 59.45 mm with a standard deviation σ = 1.14: those values are fully compatible with the resolution available for measuring distances on the test pattern and with the mirror resolution (also limited by image resolution). In a second experiment, a test-pattern with six points spaced by 110 mm, located about 1 m ahead, 0.25 m 843

H

HOPS

Figure 3. Omnidirectional image (above, left) and traditional image (above, right) acquired by HOPS. Below a perspective reconstruction of part of the omnidirectional one is shown.

to the right and a bit below the sensor, has been used. In the lower part of Table 1 the estimated positions are shown: the estimated mean distance was 109.09 mm with a standard deviation σ = 8.89. In another test with the same pattern located 1.3 m ahead, 0.6 m to the left and 0.5 m below the sensor (see Figure 4) the estimated mean distance was of about 102 mm with a standard deviation σ = 9.98. It should be noticed that, at those distances, taking into account image resolution as well as the mirror profile, the sensor resolution is of the same order of magnitude as the errors obtained. Furthermore, the method used to find the center of circles suffers from luminance and contrast variations: substituting circles with adjacent alternate black and white squares and 844

using a corner detector capable of sub-pixel accuracy would probably yield better results.

FUTURE TRENDS A field which nowadays draws great interest is autonomous vehicle navigation. Even if at the moment there are still many problems to be solved before seeing autonomous public vehicles, industrial applications are already possible. Since results in omnidirectional visual servoing and ego-motion estimation are also applicable to hybrid dual camera systems, and many more opportunities are offered by the presence of a second high-resolution view, the use of such devices in this field

HOPS

Table 1. 3D estimation results: the tables show the estimated positions obtained. The diagrams below them show the estimated distances between points on the test-pattern. All values are in mm. - Experiment 1

- Experiment 2

Figure 4. Omnidirectional image (left) and traditional image (right) acquired for a 3D stereo estimation test

845

H

HOPS

is desirable. Even if most applications of these systems are related with surveillance, they could be applied even more directly to robot-aided human activities, since robots/vehicles involved in these situations are less critical and their controllability is easier.

navigation. Proceedings of the IEEE Workshop on Omnidirectional Vision. Madison Wisconsin, 21 June 2003. IEEE Computer Society Press, 78-89.

CONCLUSIONS

Bertozzi, M., Broggi, A. & Fascioli, A. (1998). Stereo inverse perspective mapping: Theory and applications. Image and Vision Computing Journal Elsevier Vol. 16, 585-590.

The Hybrid Omnidirectional Pin-hole Sensor (HOPS) dual camera system has been described. Since its joint camera calibration leads to a fully calibrated hybrid stereo pair from which 3D information can be extracted, HOPS suits several kinds of applications. For example, it can be used for surveillance and robot self-localization or obstacle detection, offering the possibility to integrate stereo sensing with peripheral/foveal active vision strategies: once objects or regions of interest are localized on the wide-range sensor, the traditional camera can be used to enhance the resolution with which these areas can be analyzed. Tracking of multiple objects/people relying on high-resolution images for recognition and access control or estimating velocity, dimensions and trajectories are some examples of surveillance tasks for which HOPS is suitable. Accurate obstacle detection, landmark localization, robust ego-motion estimation or three-dimensional environment reconstruction are other examples of possible applications related to (autonomous/holonomous) robot navigation in semistructured or completely unstructured environments. Some preliminary experiments have been performed to solve both surveillance and robot navigation with encouraging results.

REFERENCES Adorni, G., Bolognini, L., Cagnoni, S., & Mordonini, M. (2001). A non-traditional omnidirectional vision system with stereo capabilities for autonomous robots. In F. Esposito (Ed.), Lecture Notes In Computer Science Springer-Verlag, Vol. 2175, 344–355. Adorni, G., Cagnoni, S., Carletti, M., Mordonini, M. & Sgorbissa, A. (2002). Designing omnidirectional vision sensors. AI*IA Notizie 15(1), 27–30. Adorni, G., Cagnoni, S., Mordonini, M. & Sgorbissa, A. (2003). Omnidirectional stereo systems for robot 846

Benosman, R. & Kang, S. (2001). Panoramic vision: Sensors, theory and applications. Springer-Verlag New York, Inc.

Cagnoni, S., Mordonini, M., Mussi, L. & Adorni, G. (2007). Hybrid stereo sensor with omnidirectional vision capabilities: Overview and calibration procedures. Proceedings of the 14th International Conference of Image Analysis and Processing. Modena, 11-13 September 2007. IEEE Computer Society Press, 99-104. Cui, Y., Samarasekera, S., Huang, Q. & Greiffenhagen, M. (1998). Indoor monitoring via the collaboration between a peripheral sensor and a fovea1 sensor. VS: Proceedings of the 1998 IEEE Workshop on Visual Surveillance. Bombay, 2 January 1998. IEEE Computer Society Press, Vol.00, 2-9. Ishiguro, H. (2001). Development of low-cost compact omnidirectional vision sensors. In R. Benosman & S. Kang (Eds.), Panoramic vision: Sensors, theory and applications Springer-Verlag New York, Inc, 23-28. Kraus, K. (1993). Photogrammetry: Fundamentals and standard processes (4th ed., Vol. 1). Dümmler. Little, J., Bohrer, S., Mallot, H. & Bülthoff, H. (1991). Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biological Cybernetics Springer-Verlag Vol. 64, 177-185. Nayar, S. & Boult, T. (1997). Omnidirectional vision systems: 1998 PI report. Proceedings of the 1997 DARPA Image Understanding Workshop. New Orleans, 11-14 May 1997. Storming Media, 93-99. Scotti, G., Marcenaro, L., Coelho, C., Selvaggi, F. & Regazzoni, C. (2005). Dual camera intelligent sensor for high definition 360 degrees surveillance. IEE Proceedings on Vision Image and Signal Processing. IEE Press. Vol.152, 250-257. Swaminathan, R., Grossberg, M. D. & Nayar, S. K. (2001). Caustics of catadioptric cameras. Proceedings of the 8th International Conference on Computer Vision.

HOPS

Vancouver, 9-12 July 2001. IEEE Computer Society Press. Vol.2, 2-9.

This means that, from a standing position, it can move as easily in any direction.

Trullier, O., Wiener, S., Berthoz, A. & Meyer, J. (1997). Biologically - based artificial navigation systems: Review and prospects. Progress in Neurobiology. Elsevier. Vol. 51, 483–544.

Inverse Perspective Mapping (IPM): A procedure which allows for perspective effect to be removed from an image by homogeneously redistributing the information content of the image plane into a new two-dimensional domain.

Yao, Y., Abidi, B. & Abidi, M. (2006). Fusion of omnidirectional and PTZ cameras for accurate cooperative tracking. In AVSS: Proceedings of the IEEE International Conference on Video and Signal Based Surveillance. Sydney, 22-24 November 2006. IEEE Computer Society Press, 46-51. Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE Computer Society Press. Vol. 22, 1330-1334.

KEy TERmS Camera Calibration: A procedure used to obtain geometrical information about image formation in a specific camera essential to relate metric distances on the image to distances in the real word. Anyway, some a priori information is needed to reconstruct the third dimension from only one image. Holonomous Robot: A robot with an unconstrained freedom of movement with no preferential direction.

Lens Distortion: Optical errors in camera lenses, usually due to mechanical misalignment of its parts, can cause straight lines in the observed scene to appear curved in the captured image. The deviation between the theoretical image and the actual one is mostly to be attributed to lens distortion. Pin-Hole Camera: A camera that uses a tiny hole (the pin-hole) to convey all rays from the observed scene to the image plane. The smaller the pin-hole, the sharper the picture. Pin-hole cameras achieve a potentially infinite depth of field. Because of its geometric simplicity, the “pin-hole model” is used to describe most traditional cameras. Single Viewpoint Constraint: When all incoming principal light rays of a lens intersect at a single point, an image with a non-distorted metric content is obtained. In this case all information contained in this image is seen from this view-point. Visual Servoing: An approach to robot control based on visual perception: a vision system extracts information from the surrounding environment to localize the robot and consequently servoing its position.

847

H

848

Hybrid Dual Camera Vision Systems Stefano Cagnoni Università degli Studi di Parma, Italy Monica Mordonini Università degli Studi di Parma, Italy Luca Mussi Università degli Studi di Perugia, Italy Giovanni Adorni Università degli Studi di Genova, Italy

INTRODUCTION Many of the known visual systems in nature are characterized by a wide field of view allowing animals to keep the whole surrounding environment under control. In this sense, dragonflies are one of the best examples: their compound eyes are made up of thousands of separate light-sensing organs arranged to give nearly a 360° field of vision. However, animals with eyes on the sides of their head have high periscopy but low binocularity, that is their views overlap very little. Differently, raptors’ eyes have a central part that permits them to see far away details with an impressive resolution and their views overlap by about ninety degrees. Those characteristics allow for a globally wide field of view and for accurate stereoscopic vision at the same time, which in turn allows for determination of distance, leading to the ability to develop a sharp, three-dimensional image of a large portion of their view. In mobile robotics applications, autonomous robots are required to react to visual stimuli that may come from any direction at any moment of their activity. In surveillance applications, the opportunity to obtain a field of view as wide as possible is also a critical requirement. For these reasons, a growing interest in omnidirectional vision systems (Benosman 2001), which is still a particularly intriguing research field, has emerged. On the other hand, requirements to be able to carry out object/pattern recognition and classification tasks are opposite, high resolution and accuracy and low distortion being possibly the most important ones. Finally, three-dimensional information extraction can be usually achieved by vision systems that combine the use of at least two sensors at the same time.

This article presents the class of hybrid dual camera vision systems. This kind of sensors, inspired by existing visual systems in nature, combines an omnidirectional sensor with a perspective moving camera. In this way it is possible to observe the whole surrounding scene at low resolution, while, at the same time, the perspective camera can be directed to focus on objects of interest with higher resolution.

BACKGROUND There are essentially two ways to observe a very wide area. It is possible to use many cameras pointed on nonoverlapping areas or, conversely, a single camera with a wide field of view. In the former case, the amount of data to be analyzed is much bigger than in the latter one. In addition, calibration and synchronization problems for the camera network have to be faced. On the other hand, in the second approach the system is cheaper, easy to calibrate, while the analysis of a single image is straightforward. In this case, however, the disadvantage is a loss of resolution at which objects details are seen, since a wider field of view is projected onto the same area of the video sensor and thus described with the same amount of pixel as for a normal one. This was clear since the mid 1990s with the earlier experiments with omnidirectional vision systems. Consequently a number of studies on omnidirectional sensors “enriched” with at least one second source of environmental data arose to achieve wide fields of view without loss of resolution. For example some work, oriented to robotics applications, has dealt with a catadioptric camera working in conjunction with a laser scanner as, to cite


Hybrid Dual Camera Vision Systems

only few recent, in (Kobilarov 2006; Mei 2006). More surveillance application-oriented work has involved multi-camera systems, joining omnidirectional and traditional cameras, while other work dealt with geometric aspects of hybrid stereo/multi-view relations, as in (Sturm 2002; Chen 2003). The natural choice to develop a cheap vision system with both omni-sight and high-detail resolution is to couple an omnidirectional camera with a moving traditional camera. In the sequel, we will focus on this kind of systems that are usually called “hybrid dual camera systems”.

Omnidirectional Vision There are two ways to obtain omnidirectional images. With a special kind of lenses mounted on a standard camera, called “fisheye lenses”, it is possible to obtain a field of view up to about 180-degrees in both directions. The widest fisheye lens ever produced featured a 220-degrees field of view. Unfortunately, it is very difficult to design a fisheye lens that satisfies the single viewpoint constraint. Although images acquired by fisheye lenses may prove to be good enough for some visualization applications, the distortion compensation issue has not been solved yet, and the high unit-cost is a major drawback for its wide-spread applications. Combining a rectilinear lens with a mirror is the other way to obtain omnidirectional views. In the so called “catadioptric lenses” a convex mirror is placed in front of a rectilinear lens achieving a field of view possibly even larger than with a fisheye lens. Using particularly shaped mirrors precisely placed with

respect to the camera is also possible to satisfy the single viewpoint constraint and thus to obtain an image which is perspectively correct. Moreover, catadioptric lenses are usually cheaper than fisheye ones. In Figure 1 a comparison between these two kinds of lenses can be seen.

OVERVIEW OF HyBRID DUAL CAmERA SySTEmS The first work concerning hybrid vision sensors is probably the one mentioned in (Nayar 1997) referred to as “Omnidirectional Pan/Tilt/Zoom System” where the PTZ unit was guided by inputs obtained from the omnidirectional view. The next year (Cui 1998) presented a distributed system for indoor monitoring: a peripheral camera was calibrated to estimate the distance between a target and the projection of the camera on the floor. In this way, they were able to precisely direct the foveal sensor, of known position, to the target and track it. A hybrid system for obstacle detection in robot navigation was described in (Adorni 2001) few years later. In this work, a catadioptric camera was calibrated along with a perspective one as a single sensor: its calibration procedure permitted to compute an Inverse Perspective Mapping (IPM) (Little 1991) based on a reference plane, the floor, for both images and hence, thanks to the cameras’ disparity, to detect obstacles by computing the difference between the two images. While this was possible only within the common field of view of the two cameras, awareness or even tasks such as ego-motion estimation were potentially pos-

Figure 1. Comparison between image formation in fisheye lenses (left) and catadioptric lenses (right)

849

H


Figure 2. A pair of images acquired with the hybrid system described in (Cagnoni 2007). The omnidirectional image (left) and the perspective image (right). The different resolution of the two images is clearly visible.

sible thanks to the omni-view. This system was further improved and mainly tested in RoboCup1 applications, (Adorni 2002; Adorni 2003; Cagnoni 2007). In Figure 2 it is possible to see a pair of images acquired with such a system. Some recent work has concentrated on using dual camera systems for surveillance applications. In (Scotti 2005), when some alarm is detected on the omnidirectional sensor, the PTZ camera is triggered and the two views start to track the target autonomously. Acquired video sequences and other metadata, like object classification information, are then used to update a distributed database to be queried later by users. Similarly in (Yao 2006), after the PTZ camera is triggered by the omnidirectional one, the target is tracked independently on the two views, but then a modified Kalman filter is used to perform data fusion: this approach achieves an improved tracking accuracy and permits to resolve occasional occlusions leading to a robust surveillance system.

FUTURE TRENDS Nowadays public order keeping, private property access control and security video surveillance are reasons for which we need to surveil wide areas of our environment. Surveillance is an ever growing market and automatic surveillance is an interesting challenge: many projects are oriented in this direction and in some of them an 850

important role is already played by hybrid dual camera systems. The monitoring system installed between Eagle Pass, Texas, and Piedras Negras, Mexico, by engineers of the Computer Vision and Robotics Laboratory at the University of California, San Diego, affiliated with the California Institute for Telecommunications and Information Technology, is an example of a very complex surveillance system in which hybrid dual camera systems are involved (Hagen 2006). Because of the competitive cost, the compactness and the opportunities offered by these systems, they are likely to be used more and more in the future in intelligent surveillance systems. Another field subjected to great interest is autonomous vehicle navigation. Even if at the moment there are still many problems to be solved before seeing autonomous public vehicles, industrial applications are already possible. Since omnidirectional visual servoing and ego-motion estimation can actually be implemented also using hybrid dual camera systems, and many more opportunities are offered by the presence of a second high-resolution view, their future involvement in this field is desirable.

CONCLUSIONS The class of hybrid dual camera systems has been described and briefly overviewed. The joint use of a standard camera and of a catadioptric sensor provides


this kind of sensors with their different and complementary features: while the traditional camera can be used to acquire detailed information about a limited region of interest (“foveal vision”), the omnidirectional sensor provides wide-range, but less detailed information about the surroundings (“peripheral vision”). Tracking of multiple objects/people relying on highresolution images for recognition and access control or estimating object/people velocity, dimensions and trajectory are some examples of possible automatic surveillance tasks for which hybrid dual camera systems are suitable. Furthermore, their use in (autonomous) robot navigation, allows for accurate obstacle detection, egomotion estimation and three-dimensional environment reconstruction. With one of these sensors on board, a mobile robot can be provided with all the necessary information needed to navigate safely in a dynamic environment.

REFERENCES Adorni, G., Bolognini, L., Cagnoni, S. & Mordonini, M. (2001). A non-traditional omnidirectional vision system with stereo capabilities for autonomous robots. In F. Esposito (Ed.), Springer-Verlag. Lecture Notes In Computer Science Vol. 2175, 344–355. Adorni, G., Cagnoni, S., Carletti, M., Mordonini, M. & Sgorbissa, A. (2002). Designing omnidirectional vision sensors. AI*IA Notizie XV (1), 27–30. Adorni, G., Cagnoni, S., Mordonini, M. & Sgorbissa, A. (2003). Omnidirectional stereo systems for robot navigation. In Proceedings of the IEEE Workshop on Omnidirectional Vision. Madison, Wisconsin, 21 June 2003. IEEE Computer Society Press, 79-89. Benosman, R. & Kang, S. (2001). Panoramic vision: Sensors, theory and applications. Springer-Verlag. Cagnoni, S., Mordonini, M., Mussi, L. & Adorni, G. (2007). Hybrid stereo sensor with omnidirectional vision capabilities: Overview and calibration procedures. In Proceedings of the 14th International Conference of Image Analysis and Processing Modena, 11-13 September 2007. IEEE Computer Society Press, 99-104. Chen, X., Yang, J. & Waibel, A. (2003). Calibration of a hybrid camera network. In Proceedings of the 9th

IEEE International Conference on Computer Vision. Nice, 13-16 October 2003. IEEE Computer Society Press, 150-155. Cui, Y., Samarasekera, S., Huang, Q. & Greiffenhagen, M. (1998). Indoor monitoring via the collaboration between a peripheral sensor and a fovea1 sensor. In VS: Proceedings of the 1998 IEEE Workshop on Visual Surveillance. Bombay, 2 January 1998. IEEE Computer Society Press. Vol.00, 2-9. Hagen, D. & Ramsey, D. (2006). UCSD engineers deploy novel video surveillance system on Texas Bridge over Rio Grande. Retrieved June 6, 2007, from the California Institute for Telecommunications and Information Technology Web site: http://www.calit2. net/newsroom/release.php?id=873 Kobilarov, M., Hyams, J., Batavia, P. & Sukhatme, G. S. (2006). People tracking and following with mobile robot using an omnidirectional camera and a laser. In Proceedings of the IEEE International Conference on Robotics and Automation. Orlando, 15-19 May 2006. IEEE Computer Society Press, 557-562. Little, J., Bohrer, S., Mallot, H. & Bülthoff, H. (1991). Inverse perspective mapping simplifies optical flow computation and obstacle detection. In Biological Cybernetics Springer-Verlag. Vol.64, 177-185. Mei, C. & Rives, P. (2006). Calibration between a central catadioptric camera and a laser range finder for robotic applications. In Proceedings of the IEEE International Conference on Robotics and Automation. Orlando, 15-19 May 2006. IEEE Computer Society Press, 532-537. Nayar, S. & Boult, T. (1997). Omnidirectional vision systems: 1998 PI report. In Proceedings of the 1997 DARPA Image Understanding Workshop. New Orleans, 11-14 May 1997. Storming Media, 93-99. Scotti, G., Marcenaro, L., Coelho, C., Selvaggi, F. & Regazzoni, C. (2005). Dual camera intelligent sensor for high definition 360 degrees surveillance. In IEE Proceedings on Vision Image and Signal Processing, IEE Press, Vol.152, 250-257. Sturm, P. (2002). Mixing catadioptric and perspective cameras. In Proceedings of the Workshop on Omnidirectional Vision. Copenhagen, 12-14 June 2002. IEEE Computer Society Press, 37-44.

851

H


Yao, Y., Abidi, B. & Abidi, M. (2006). Fusion of omnidirectional and PTZ cameras for accurate cooperative tracking. In AVSS: Proceedings of the IEEE International Conference on Video and Signal Based Surveillance. Sydney, 22-24 November 2006. IEEE Computer Society Press, 46-51.

KEy TERmS Camera Calibration: A procedure used to obtain geometrical information about image formation in a specific camera. After calibration, it is possible to relate metric distances on the image to distances in the real word. In any case only one image is not enough to reconstruct the third dimension and some a priori information is needed to accomplish this capability. Catadioptric Camera: A camera that uses in conjunction catoptric, reflective, lenses (mirrors) and dioptric, refractive, lenses. Usually the purpose of these cameras is to achieve a wider field of view than the one obtained by classical lenses. Even if the field of view of a lens could be improved with any convex surface mirror, those of greater interest are conic, spherical, parabolic and hyperbolic-shaped ones. Central Catadioptric Camera: A camera that combines lenses and mirrors to capture a wide field of view through a central projection (i.e. a single viewpoint). Most common examples use paraboloidal or hyperboloidal mirrors. In the former case a telecentric lens is needed to focalize parallel rays reflected by the mirror and there are no constraints for mirror to camera relative positioning: the internal focus of the parabola acts as the unique viewpoint; in the latter case it is possible to use a normal lens, but mirror to camera positioning is critical for achieving a single viewpoint: it is essential

852

that the principal point of the lens coincides with the external focus of the hyperboloid to let the internal one be the unique viewpoint for the observed scene. Omnidirectional Camera: A camera able to see in all directions. There are essentially two different methods to obtain a very wide field of view: the older one involves the use of a special type of lens, usually referred to as fisheye lens, while the other one uses in conjunction rectilinear lenses and mirrors. Lenses obtained in the latter case are usually called catadioptric lenses and the camera-lens ensemble is referred to as catadioptric camera. PTZ Camera: A camera able to pan left and right, tilt up and down, and zoom. It is usually possible to freely control its orientation and zooming status at a distance through a computer or a dedicated control system. Stereo Vision: A visual perception process that exploits two different views to achieve depth perception. The difference between the two images, usually referred to as binocular disparity, is interpreted by the brain (or by an artificial intelligent system) as depth. Single Viewpoint Constraint: To obtain an image with a non-distorted metric content, it is essential that all incoming principal light rays of a lens intersect at a single point. In this case a fixed viewpoint is obtained and all the information contained in an image is seen from this point.

ENDNOTE 1

Visit http://www.robocup.org for more information.

853

Hybrid Meta-Heuristics Based System for Dynamic Scheduling Ana Maria Madureira Polytechnic Institute of Porto, Portugal

INTRODUCTION The complexity of current computer systems has led the software engineering, distributed systems and management communities to look for inspiration in diverse fields, e.g. robotics, artificial intelligence or biology, to find new ways of designing and managing systems. Hybridization and combination of different approaches seems to be a promising research field of computational intelligence focusing on the development of the next generation of intelligent systems. A manufacturing system has a natural dynamic nature observed through several kinds of random occurrences and perturbations on working conditions and requirements over time. For this kind of environment it is important the ability to efficient and effectively adapt, on a continuous basis, existing schedules according to the referred disturbances, keeping performance levels. The application of Meta-Heuristics to the resolution of this class of dynamic scheduling problems seems really promising. In this article, we propose a hybrid Meta-Heuristic based approach for complex scheduling with several manufacturing and assembly operations, in dynamic Extended Job-Shop environments. Some self-adaptation mechanisms are proposed.

BACKGROUND Scheduling Problem The planning of Manufacturing Systems involves frequently the resolution of a huge amount and variety of combinatorial optimisation problems with an important impact on the performance of manufacturing organisations. Examples of those problems are the sequencing and scheduling problems in manufacturing management, routing and transportation, layout design and timetabling problems.

Scheduling can be defined as the assignment of time-constrained jobs to time-constrained resources within a pre-defined time framework, which represents the complete time horizon of the schedule. An admissible schedule will have to satisfy a set of constraints imposed on jobs and resources. So, a scheduling problem can be seen as a decision making process for operations starting and resources to be used. A variety of characteristics and constraints related with jobs and production system, such as operation processing time, release and due dates, precedence constraints and resource availability, can affect scheduling decisions (Leung, 2004) (Brucker, 2004) (Blazewicz, Ecker &Trystrams, 2005) (Pinedo, 2005). Real world scheduling requirements are related with complex systems operated in dynamic environments. This means that they are frequently subject to several kinds of random occurrences and perturbations, such as new job arrivals, machine breakdowns, employees sickness, jobs cancellation and due date and time processing changes, causing prepared schedules becoming easily outdated and unsuitable. Scheduling under this environment is known as dynamic. Dynamic scheduling problems may be classified under deterministic, when release times and all other parameters are known and fixed, and under non-deterministic when some or all system and job parameters are uncertain, such as when jobs arrive randomly to the system, over time. Traditional heuristic scheduling methods, encounter great difficulties when they are applied to some real-world situations. This is for three main reasons. Firstly, traditional scheduling methods use simplified and deterministic theoretical models, where all problem data are known before scheduling starts. However, many real world optimization problems are dynamic and non-deterministic and, in which changes may occur continually. In practice, static scheduling is not able to react dynamically and rapidly in the presence of dynamic information not previously foreseen in the current schedule.


H

Hybrid Meta-Heuristics Based System for Dynamic Scheduling

Secondly, most of the approximation methods proposed for the Job-Shop Scheduling Problems (JSSP) are oriented methods, i.e. developed specifically for the problem in consideration. Some examples of this class of methods are the priority rules and the Shifting Bottleneck (Pinedo, 2005). Finally, traditional scheduling methods are essentially centralized in the sense that all the computations are carried out in a central computing and logic unit. All the information concerning every job and every resource has to go through this unit. This centralized approach is especially susceptible to problems of tractability, because the number of interacting entities that must be managed together is large and leads to a combinatorial explosion. Particularly since, a detailed schedule is generated over a long time horizon, and planning and execution are carried out in discrete buckets of time. Centralized scheduling is therefore large, complex, and difficult to maintain and reconfigure. On the other hand, the inherent nature of much industrial and service process is distributed. Consequently, traditional methods are often too inflexible, costly, and slow to satisfy the needs of real-world scheduling systems. By exploiting problem-specific characteristics, classical optimisation methods are not enough for the efficient resolution of those problems or are developed for specific situations (Leung, 2004) (Brucker, 2004) (Logie, Sabaz & Gruver, 2004) (Blazewicz, Ecker &Trystrams, 2005) (Pinedo, 2005).

Meta-Heuristics As a major departure from classical techniques, a Meta-heuristic (MH) method implies higher-level strategy controlling lower-level heuristic methods. Meta-heuristics exploit not only the problem characteristics but also ideas based on artificial intelligence rationale, such as different types of memory structures and learning mechanisms, as well as the analogy with other optimization methods found in nature. The interest of the Meta-Heuristic approaches is that they converge, in general, to satisfactory solutions in an effective and efficient way (computing time and implementation effort). The family of MH includes, but it is not limited to Tabu Search, Simulated Annealing, Soft Computing, Evolutionary Algorithms, Adaptive Memory procedures, Scatter Search, Ant Colony Optimization, Swarm Intelligence, and their hybrids.

854

For literature on this subject, see for example (Glover & Gary, 2003) and (Gonzalez, 2007). In last decades, there has been a significant level of research interest in Meta-Heuristic approaches for solving large real world scheduling problems, which are often complex, constrained and dynamic. Scheduling algorithms that achieve good or near optimal solutions and can efficiently adapt them to perturbations are, in most cases, preferable to those that achieve optimal ones but that cannot implement such an adaptation. This is the case with most algorithms for solving the so-called static scheduling problem for different setting of both single and multi-machine systems arrangements. This reality, motivated us to concentrate on tools, which could deal with such dynamic, disturbed scheduling problems, even though, due to the complexity of these problems, optimal solutions may not be possible to find. Several attempts have been made to modify algorithms, to tune them for optimization in a changing environment. It was observed in manufacturing all these studies, that the dynamic environment requires an algorithm to maintain sufficient diversity for a continuous adaptation to the changes of the landscape. Although the interest in optimization algorithms for dynamic optimization problems is growing and a number of authors have proposed an even greater number of new approaches, the field lacks a general understanding as to suitable benchmark problems, fair comparisons and measurement of algorithm quality (Branke, 1999) (Cowling & Johanson, 2002) (Madureira, 2003), Madureira, Ramos & Silva, 2004) (Aytug, Lawley, McKay, Mohan & Uzsoy, 2005). In spite of all the previous trials scheduling problem still known to be NP-complete. This fact incites researchers to explore new directions.

Hybrid Intelligent Systems Hybridization of intelligent systems is a promising research field of computational intelligence focusing on combinations of multiple approaches to develop the next generation of intelligent systems. An important stimulus to the investigations on Hybrid Intelligent Systems area is the awareness that combined approaches will be necessary if the remaining tough problems in artificial intelligence are to be solved. Meta-Heuristics, Bio-Inspired Techniques, Neural computing, Machine Learning, Fuzzy Logic Systems, Evolution-


ary Algorithms, Agent-based Methods, among others, have been established and shown their strength and drawbacks. Recently, hybrid intelligent systems are getting popular due to their capabilities in handling several real world complexities involving imprecision, uncertainty and vagueness (Boeres, Lima, Vinod & Rebello, 2003), (Madureira, Ramos & Silva, 2004) (Bartz-Beielstein, Blesa, Blum, Naujoks, Roli, Rudolph &Sampels, 2007).

HyBRID mETA-HEURISTICS BASED SCHEDULING SySTEm The purpose of this article is to describe an framework based on combination of Meta-Heuristics, Tabu Search(TS) and Genetic Algorithms(GA), and constructive optimization methods for solving a class of real world scheduling problems, where the products (jobs) to be processed have due dates, release times and different assembly levels. This means that parts to be assembled may be manufactured in parallel, i.e. simultaneously. The problem, focused in this work, which we call Extended Job-Shop Scheduling Problem (EJSSP) has major extensions and differences in relation to the classic Job-Shop Scheduling Problem. In this work, we define a job as a manufacturing order for a final item, that could

be Simple or Complex. It may be Simple, like a part, requiring a set of operations to be processed. Complex Final Items, requiring processing of several operations on a number of parts followed by assembly operations at several stages, are also dealt with. Moreover, in practice, scheduling environment tends to be dynamic, i.e. new jobs arrive at unpredictable intervals, machines breakdown, jobs can be cancelled and due dates and processing times can change frequently (Madureira, 2003) (Madureira, Ramos & Silva, 2004). It starts focusing on the solution of the dynamic deterministic EJSSP problems. For solving these we developed a framework, leading to a dynamic scheduling system having as a fundamental scheduling tool, a hybrid scheduling system, with two main pieces of intelligence (Figure 1). One such piece is a combination of TS and GA based method and a mechanism for inter-machine activity coordination. The objective of this mechanism is to coordinate the operation of machines, taking into account the technological constraints of jobs, i.e. job operations precedence relationships, towards obtaining good schedules. The other piece is a dynamic adaptation module that includes mechanisms for neighbourhood/population regeneration under dynamic environments, increasing or decreasing it according new job arrivals or cancellations.

Figure 1. Hybrid meta-heuristics based scheduling system Jobs

Scheduling Module Random Events

MH Parameterization

PréProcessing

Scheduling Method

Dynamic Adaptation Coordination Mechanism

User Interface

Scheduling Plan

� 855

H


A detailed description of the approach, methods and of its application to concrete problems can be found in Madureira (2003).

Pre-Processing Module The pre-processing module deals with processing input information, namely problem definition and instantiation of algorithm components and parameters, such as, the initial solution and neighbourhood generation mechanisms, size of neighbourhood/population, tabu list attributes and tabu list length.

Hybrid Scheduling Module Initially, we start by decomposing the deterministic EJSSP problem into a series of deterministic Single Machine Scheduling Problems (SMSP). We assume the existence of different and known job release times rj, prior to which no processing of the job can be done and, also, job due dates dj. Based on these, release dates and due dates are determined for each SMSP and, subsequently, each such problem is solved independently by a TS or a GA(considering a self-parameterization issue). Afterwards, the solutions obtained for each SMSP are integrated to obtain a solution to the main EJSSP problem instance. The integration of the SMSP solutions may give an unfeasible schedule to the EJSSP. This is why schedule repairing may be necessary to obtain a feasible solution. The repairing mechanism named Inter-Machine Activity Coordination Mechanism (IMACM) carries this out. The repairing is based on coordination of machines activity, having into account job operation precedence and other problem constraints. This is done keeping job allocation order, in each machine, unchanged. The IMACM establishes the starting and the completion times for each operation. It ensures that the starting time for each operation is the higher of the two following values: • •

856

the completion time of the immediately precedent operation in the job, if there is only one, or the highest of all if there are more; the completion time of the immediately precedent operation on the machine.

Dynamic Adaptation Module For non-deterministic problems some or all parameters are uncertain, i.e. are not fixed as we assumed in the deterministic problem. Non-determinism of variables has to be taken into account in real world problems. For generating acceptable solutions in such circumstances our approach starts by generating a predictive schedule, using the available information and then, if perturbations occur in the system during execution, the schedule may have to be modified or revised accordingly, i.e. rescheduling/dynamic adaptation is performed. Therefore, in this process, an important decision must be taken, namely that of deciding if and when rescheduling should happen. The decision strategies for rescheduling may be grouped into three categories: continuous, periodic and hybrid rescheduling. In the continuous one rescheduling is done whenever an event modifying the state of the system occurs. In periodic rescheduling, the current schedule is modified at regular time intervals, taking into account the schedule perturbations that have occurred. Finally, for the hybrid rescheduling the current schedule is modified at regular time intervals if some perturbation occurs. In the scheduling system for EJSSP, dynamic adaptation is necessary due to two classes of events: • •

Partial events which imply variability in jobs or operations attributes such as processing times, due dates and release times. Total events which imply variability in neighbourhood structure, resulting from either new job arrivals or job cancellations.

While, on one hand, partial events only require redefining job attributes and re-evaluation of the objective function of solutions, total events, on the other hand, require a change on solution structure and size, carried out by inserting or deleting operations, and also re-evaluation of the objective function. Therefore, under a total event, the modification of the current solution is imperative. In this work, this is carried out by mechanisms described in (Madureira, Ramos & Silva, 2004) for SMSP. Considering the processing times involved and the high frequency of perturbations, rescheduling all jobs from the beginning should be avoided. However, if


work has not yet started and time is available, then an obvious and simple approach to rescheduling would be to restart the scheduling from scratch with a new modified solution on which takes into account the perturbation, for example a new job arrival. When there is not enough time to reschedule from scratch or job processing has already started, a strategy must be used which adapts the current schedule having in consideration the kind of perturbation occurred. The occurrence of a partial event requires redefinition of job attributes and a re-evaluation of the schedule objective function. A change in job due date requires the re-calculation of the operation starting and completion due times of all respective operations. However, changes in the operation processing times only requires re-calculation of the operation starting and completion due times of the succeeding operations. A new job arrival requires definition of the correspondent operation starting and completion times and a regenerating mechanism to integrate all operations on the respective single machine problems. In the presence of a job cancellation, the application of a regenerating mechanism eliminates the job operations from the SMSP where they appear. After the insertion or deletion of positions, neighbourhood regeneration is done by updating the size of the neighbourhood and ensuring a structure identical to the existing one. Then the scheduling module can apply the search process for better solutions with the new modified solution.

Job Arrival Integration Mechanism When a new job arrives to be processed, an integration mechanism is needed. This analyses the job precedence graph that represents the ordered allocation of machines to each job operation, and integrates each operation into the respective single machine problem. Two alternative procedures could be used for each operation: either randomly select one position to insert the new operation into the current solution/chromosome or use some intelligent mechanism to insert this operation in the schedules, based on job priority, for example.

Job Elimination Mechanism

Regeneration Mechanisms After integration/elimination of operations is carried out, by inserting/deleting positions/genes in the current solution/chromosome, population regeneration is done by updating its size. The population size for SMSP is proportional to the number of operations. After dynamic adaptation process, the scheduling method could be applied and search for better solutions with the modified solution. In this way we proposed a hybrid system in which some self-organization aspects could be considered in accordance with the problem being solved: the method and/or parameters can change in run-time, the used MH can change according with problem characteristics, etc.

FUTURE TRENDS Considering the complexity inherent to the manufacturing systems, the dynamic scheduling is considered an excellent candidate for the application of agentbased technology. A natural evolution to the approach above proposed is a Multi-agent Scheduling System that assumes the existence of several Machines Agents (which are decision-making entities) distributed inside the Manufacturing System that interact and cooperate with other agents in order to obtain optimal or nearoptimal global performances. The main idea is that from local, autonomous and often conflicting behaviours of the agents a global solution emerges from a community of machine agents solving locally their schedules and cooperating with other machine agents (Madureira, Gomes & Santos, 2006). Agents must be able to learn and manage their internal behaviours and their relationships with other agents, by cooperative negotiation in accordance with business policies defined by user manager. Some self-organization aspects could be considered in accordance with the problem being solved: the method and/or parameters can change in run-time, the agents can use different MH according with problem characteristics, etc.

When a job is cancelled, an eliminating mechanism must be implemented so the correspondent position/ gene will be deleted from the solutions.

857

H


CONCLUSION This article proposes a system architecture that makes good use and combination of the advantages of two different Meta-Heuristics: Tabu Search and Genetic Algorithms. We believe that a new contribution for the resolution of more realistic scheduling problems, the Extended Job-Shop Problems was described. The particularity of our approach is the procedure to schedule operations, as each machine will first find local optimal or near optimal solutions, succeeded by the interaction with other machines trough cooperation mechanisms as a way to find an optimal global schedule, on dynamic environments. The proposed system is prepared to use other Local Search Meta-Heuristics, to drive schedules based on practically any performance measure and it is not restricted to a specific type of scheduling problems.

REFERENCES Aytug, Haldun, Lawley, Mark A., McKay, Kenneth, Mohan, Shantha & Uzsoy, Reha(2005). Executing production schedules in the face of uncertainties: A review and some future directions. European Journal of Operational Research, Volume 16 (1), 86-110. Bartz-Beielstein, Thomas, Blesa, M.J., Blum, C., Naujoks, B., Roli, A., Rudolph, G. & Sampels, M.(2007). Hybrid Metaheuristics. Proceedings of 4th International Workshop H. Dortmund, Germany, Lecture Notes in Computer Science. Vol. 4771, ISBN: 978-3-54075513-5. Blazewicz. Jacek, Ecker, Klaus H.&Trystram, Denis(2005), Recent advances in scheduling in computer and manufacturing systems. European Journal of Operational Research, 164(3), 573-574. Boeres, Cristina, Lima, Alexandre, Vinod, E.&Rebello, F.(2003). Hybrid Task Scheduling: Integrating Static and Dynamic Heuristics. 15th Symposium on Computer Architecture and High Performance Computing, 199. Branke, J.(1999). Evolutionary Approaches to Dynamic Optimization Problems – A Survey. GECCO Workshop on Evolutionary Algorithms for Dynamic Optimization Problems, 34-137. 858

Brucker, Peter(2004). Scheduling Algorithms. Springer, 4rd edition. Cowling, P.&Johansson, M.(2002). Real time information for effective dynamic scheduling. European Journal of Operational Research,139 (2), 230-244. Glover, Fred & Gary, A. Kochenberger(2003). Handbook of Metaheuristics. International Series in Operations Research & Management Science, Springer, Vol. 57, ISBN: 978-1-4020-7263-5. Gonzalez, Teofilo F.(2007). Handbook of Approximation Algorithms and Metaheuristics. Chapman&Hall/ Crc Computer and Information Science Series. Leung, Joseph.(2004). Handbook of Scheduling. Chapman&Hall/CRC, Boca Raton, FL. Logie, S., Sabaz, D. & Gruver, W.A.(2004). Sliding Window Distributed Combinatorial Scheduling using JADE. Proc. of IEEE International Conference on Systems, Man and Cybernetics, Netherlands, 19841989. Madureira, Ana(2003). Meta-Heuristics Application to Scheduling in Dynamic Environments of Discrete Manufacturing. PhD Dissertation, University of Minho, Braga, Portugal(in portuguese). Madureira, Ana, Gomes, Nuno & Santos, Joaquim(2006). Cooperative Negotiation Mechanism for Agent Based Distributed Manufacturing Scheduling. WSEAS Transactions on Systems, Issue 12, Volume 5, ISSN:1109-2777, 2899-2904. Madureira, Ana, Ramos, Carlos & Silva, Sílvio(2004). Toward Dynamic Scheduling Through Evolutionary Computing. WSEAS Transactions on Systems, Issue 4, Volume 3, 1596-1604. Pinedo, M.(2005). Planning and Scheduling in Manufacturing and Services, Springer-Verlag, New York, ISBN:0-387-22198-0.

KEy TERmS Cooperation: The practice of individuals or entities working together with common goals, instead of working separately in competition, and in which the success of one is dependent and contingent upon the success of the other.


Dynamic Scheduling Systems: Are frequently subject to several kinds of random occurrences and perturbations, such as new job arrivals, machine breakdowns, employee’s sickness, jobs cancellation and due date and time processing changes, causing prepared schedules becoming easily outdated and unsuitable. Evolutionary Computation: A subfield of artificial intelligence that involve techniques implementing mechanisms inspired by biological evolution such as reproduction, mutation, recombination, natural selection and survival of the fittest. Genetic Algorithms: Particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover. Hybrid Intelligent Systems: Denotes a software system which employs, a combination of Artificial Intelligence models, methods and techniques, such Evolutionary Computation, Meta-Heuristics, MultiAgent Systems, Expert Systems and others.

Meta-Heuristics: Form a class of powerful and practical solution techniques for tackling complex, large-scale combinatorial problems producing efficiently high-quality solutions. Multi-Agent Systems: A system composed of several agents, collectively capable of solve complex problems in a distributed fashion without the need for each agent to know about the whole problem being solved. Scheduling: Can be seen as a decision making process for operations starting and resources to be used. A variety of characteristics and constraints related with jobs and machine environments (Single Machine, Parallel machines, Flow-Shop and Job-Shop) can affect scheduling decisions. Tabu Search: A approximation method, belonging to the class of local search techniques, that enhances the performance of a local search method by using memory structures (Tabu List).

859

H

860

A Hybrid System for Automatic Infant Cry Recognition I Carlos Alberto Reyes-García Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico Ramon Zatarain Instituto Tecnologico de Culiacan, Mexico Lucia Barron Instituto Tecnologico de Culiacan, Mexico Orion Fausto Reyes-Galaviz Universidad Autónoma de Tlaxcala, Mexico

INTRODUCTION Crying in babies is a primary communication function, governed directly by the brain; any alteration on the normal functioning of the babies’ body is reflected in the cry (Wasz-Höckert, et al, 1968). Based on the information contained in the cry’s wave, the infant’s physical state can be determined; and even pathologies in very early stages of life detected (Wasz-Höckert, et al, 1970). To perform this detection, a Fuzzy Relational Neural Network (FRNN) is applied. The input features are represented by fuzzy membership functions and the links between nodes, instead of weights, are represented by fuzzy relations (Reyes, 1994). This paper, as the first of a two parts document, describes the Infant Cry Recognition System´s architecture as well as the FRNN model. Implementation and testing are reported in the complementary paper.

BACKGROUND The pioneer works on infant cry were initiated by Wasz-Hockert since the beginnings of the 60’s. In one of those works his research group showed that the four basic types of cry can be identified by listening: pain, hunger, pleasure and birth. Further studies led to the development of conceptual models that describe the anatomical and physiologic basis of the production and neurological control of crying (Bosma, Truby & Antolop, 1965). Later on, Wasz-Hockert (1970) applied

spectral analysis to identify several types of crying. Other works showed that there exist significant differences among the several types of crying, like healthy infant’s cry, pain cry and pathological infant’s cry. In one study, Petroni used Neural Networks (Petroni, Malowany, Johnston, and Stevens, 1995) to differentiate between pain and no-pain crying. Cano directed several works devoted to the extraction and automatic classification of acoustic characteristics of infant cry. In one of those studies, in 1999 Cano presented a work where he demonstrates the utility of the Kohonen’s Self-Organizing Maps in the classification of Infant Cry Units (Cano-Ortiz, Escobedo-Becerro, 1999) (Cano, Escobedo and Coello, 1999). More recently, in (Orozco, & Reyes, 2003) we reported the classification of cry samples from deaf and normal babies with feed-forward neural networks. In 2004 Cano and his group, in (Cano, Escobedo, Ekkel, 2004) reported a radial basis network (RBN) to find out relevant aspects concerned with the presence of Central Nervous System (CNS) diseases. In (Suaste, Reyes, Diaz, and Reyes, 2004) we showed the implementation of a Fuzzy Relational Neural Network (FRNN) for Detecting Pathologies by Infant Cry Recognition. The study of connectionist models also known as Artificial Neural Networks (ANN) has enjoyed a resurgence of interest after its demise in the 60’s. Research was focused on evaluating new neural networks for pattern classification, training algorithms using real speech data, and on determining whether parallel neural network architectures can be designed to perform efficiently the work required by complex


A Hybrid System for Automatic Infant Cry Recognition I

speech recognition algorithms (Lippmann, 1990). In the connectionist approach, pattern classification is done with a multi-layer neural network. A weight is assigned to every link between neurons in contiguous layers. In the input layer each neuron receives one of the features present in the input pattern vectors. Each neuron in the output layer corresponds to each speech unit class (word or sub-word). The neural network associates input patterns to output classes by modeling the relationship between the two pattern sets. The pattern is estimated or learned by the network with a representative sample of input and output patterns (Morgan, and Scofield, 1991) (Pedrycz, 1991).. In order to stabilize the perceptron’s behavior, many researchers had been trying to incorporate fuzzy set theory into neural networks. The theory of fuzzy sets, developed by Zadeh in 1965 (Zadeh, 1965), has since been used to generalize existing techniques and to develop new algorithms in pattern recognition. Pal (Pal, 1992a) suggested that to enable systems to handle real-life situations, fuzzy sets should be incorporated into neural networks, and, that the increase in the amount of computation required with its incorporation, is offset by the potential for parallel computation with high flexibility that fuzzy neural networks have. Pal proposes how to do data fuzzification, the general

system architecture of a fuzzy neural network and the use of 3n-dimensional vectors to represent the fuzzy membership values of the input features to the primary linguistic properties low, medium, and high (Pal, 1992a) and (Pal, and Mandal, 1992b). On the other side, the idea of using a relational neural network as a pattern classifier was developed by Pedrycz and presented in (Pedrycz, 1991). As a result of the combination of the Pal’s and Pedrycz’s proposed methodologies in 1994 C. A. Reyes (1994) developed the hybrid model known as fuzzy relational neural network (FRNN).

THE AUTOmATIC INFANT CRy RECOGNITION PROCESS The infant cry automatic classification process is, in general, a pattern recognition problem, similar to Automatic Speech Recognition (ASR) (Huang, Acero, Hon, 2001). The goal is to take the wave from the infant’s cry as the input pattern, and at the end obtain the kind of cry or pathology detected on the baby (Cano, Escobedo and Coello, 1999) (Ekkel, 2002). Generally, the process of Automatic Infant Cry Recognition is done in two steps. The first step is known as signal processing, or feature extraction, whereas the second is known as

Figure 1. Automatic infant cry recognition process

861

H


pattern classification. In the acoustical analysis phase, the cry signal is first normalized and cleaned, and then it is analyzed to extract the most important features in function of time. The set of obtained features is represented by a vector, which represents a pattern. The set of all vectors is then used to train the classifier. Later on, a set of unknown feature vectors is compared with the acquired knowledge to measure the classification output efficiency. Figure 1 shows the different stages of the described recognition process.

Cry Patterns Classification The vectors, representing patterns, obtained in the extraction stage are later used in the classification process. There are four basic schools for the solution of the pattern classification problem, those are: a) Pattern comparison (dynamic programming), b) Statistic Models (Hidden Markov Models HMM), c) Knowledge based systems (expert systems), and d) Connectionists Models (neural networks). In recent years, a new strong trend of more robust hybrid classifiers has been emerging. Some of the better known hybrid models result from the combination of neural and fuzzy approaches (Jang, 1993) (Lin Chin-Teng, and George Lee, 1996). For the work shown here, we have implemented a hybrid model of this type, called the Fuzzy Relational Neural Network, whose parameters are found trough the application of genetic algorithms. We selected this kind of model, because of its adaptation, learning and knowledge representation capabilities. Besides, one of its main functions is to perform pattern recognition. In an Automatic Infant Cry Classification System, the goal is to identify a model of an unknown pattern obtained after the original sound wave is acoustically analyzed, and its dimensionality reduced. So, in this phase we determine the class or category to which each cry pattern belongs to. The collection of samples, each of which is represented by a vector of n features, is divided in two subsets: The training set and the test set. First, the training set is used to teach the classifier to distinguish between the different crying types. Then the test set is used to determine how well the classifier assigns the corresponding class to a pattern by means of the classification scheme generated during training.

862

THE FUZZy NEURAL NETWORK mODEL The system proposed in this work is based upon fuzzy set operations in both; the neural network’s structure and the learning process. Following Pal’s idea of a general recognizer (Pal, S.K., 1992a), the model is divided in two main parts, one for learning and another for processing, as shown in Figure 2.

Fuzzy Learning The fuzzy learning section is composed by three modules, namely the Linguistic Feature Extractor (LFE), the Desired Output Estimator (DOE), and the Neural Network Trainer (NNT). The Linguistic Feature Extractor takes training samples in the form of n-dimensional vectors containing n features, and converts them to Nn-dimensional form vectors, where N is the number of linguistic properties. In this case the linguistic properties are low, medium, and high. The resulting 3n-dimensional vector is called Linguistic Properties Vector (LPV). In this way an input pattern Fi = [Fi1, Fi2, ...,Fin] containing n features, may be represented as (Pal, and Mandal, 1992b)

Fi =  µ low(Fi1 ) (Fi ), µ med (Fi1 ) (Fi ),  µ high(Fi1 ) (Fi ),, µ high(Fin ) (Fi )  The DOE takes each vector from the training samples and calculates its membership to class k, in an l-class problem domain. The vector containing the class membership values is called the Desired Vector (DV). Both LPV and DV vectors are used by the neural Network Trainer (NNT), which takes them for training the network. The neural network has only one input and one output layer. The input layer is formed by a set of Nn neurons, with each of them corresponding to one of the linguistic properties assigned to the n input features. In the output layer there are l neurons, with each node corresponding to one of the l classes; in this implementation, each class represents one type of crying. There is a link from every node in the input layer to every node in the output layer. All the con-


Figure 2. General architecture of the automatic infant cry recognition system

H

LEARNING PHASE Linguistic Feature Extractor

LPV

training samples

Desired Output Estimator

test samples

DV

Linguistic Feature Extractor

k Neural Network Trainer R

Fuzzy Classifier LPV

Y Decision Making OUTPUT

PROCESSING PHASE

nections are described by means of fuzzy relations R: X × Y→ [0, 1] between the input and output nodes. The error is represented by the distance between the actual output and the target or desired output. During each learning step, once the error has been computed, the trainer adjusts the relationship values or weights of the corresponding connections, either until a minimum error is obtained or a given number of iterations are completed. The output of the NNT, after the learning process, is a fuzzy relational matrix (R in Figure 1) containing the knowledge needed to further map the unknown input vectors to their corresponding class during the classification process.

the learning phase, described in the previous section. The output of this module is an LPV vector, which along with the fuzzy relational matrix R, are used by the Fuzzy Classifier, which obtains the actual outputs from the neural network. The classifier applies the max-min composition to calculate the output. The output of this module is an output vector containing the membership values of the input vector to each of the classes. Finally, the Decision Making module selects the highest value from the classifier and assigns the corresponding class to the testing vector.

Fuzzy Processing

A membership function maps values in a domain to their membership value in a fuzzy set. Several kinds of membership functions are available. In the reported experiments triangular membership functions were used. According to (Park, Cae, and Kandel, 1992) the use of more linguistic properties to describe a pattern

The fuzzy processing section is formed by three different modules, namely the Linguistic Feature Extractor (LFE), the Fuzzy Classifier (FC), and the Decision Making Module (DMM). The LFE works as the one in

Membership Functions

863


point makes a model more accurate, but too many can make the description unpractical. So, here we use seven linguistic properties: very low, low, more or less low, medium, more or less high, high, and very high.

Desired Membership Values Before defining the output membership function, we define the equation to calculate the weighted distance of the training pattern Fj to the kth class in an l-class problem domain as in (Pal, 1992a) 2

 F − S kj  zik = ∑  ij  , : for k = 1,, l j =1   U kj  n

where Fij is the jth feature of the ith pattern vector, σkj denotes the mean, and υkjj denotes the standard deviation of the jth feature for the kth class. The membership value of the ith pattern to class k is defined as follows M k (Fi ) =

1 z 1 +  ik  fd

  f e 

, : M k (Fi )∈ [0,1]

where fe is the exponential fuzzy generator, and fd is the denominational fuzzy generator controlling the amount of fuzzines in this class-membership set. In this case, the higher the distance of the pattern from a class, the lower its membership to that class. Since the training data have fuzzy class boundaries, a pattern point usually belongs to more than one class at different degrees.

The Neural Network Trainer The neural network model discussed here is based on the relational neural structure proposed by Pedrycz in (Pedrycz, W., 1991). The Relational Neural Network (RNN): Let X = {x1, x2,…, xn} be a finite set of input nodes and let Y = {y1, y2,…, yl} represent the output nodes set in an l-class problem domain. When the max-min composition operator denoted X ◦ R is applied to a fuzzy set X and a fuzzy relation R, the output is a new fuzzy set Y, we have

864

Y = X R

( (

))

Y (y j )= max xi min X (xi ), R (xi , y j )

(1)

where X is a fuzzy set, Y is the resulting fuzzy set and R a fuzzy relation R : X×Y→ [0,1] describing all relationships between input and output nodes. We will take the whole neural network represented by expression (1) as a collection of l separate n-input single-output cells. Learning in a Fuzzy Neural Network: If the actual response from the network does not match the target pattern; the network is corrected by modifying the link weights to reduce the difference between the observed and target patterns. To measure the difference a performance index called equality index is defined, which is 1 + T (y )− Y (y ), if Y (y ) > T (y )  T (y ) ≡ Y (y ) = 1 + Y (y )− T (y ), if Y (y ) < T (y ) 1, if Y (y ) = T (y ) 

where T(y) is the target output at node y, and Y (y) is the actual output at the same node. In a problem with n input patterns, there are n input-output pairs (xij, ti) where ti is the target value when the input is Xij. Parameters Updating: Pedricz also proposes to complete the process of learning separately for each output node. The learning algorithm is a version of the back-propagation algorithm. Let’s consider an n-inputL-output neural network having the following form   n yi = f (xi ; a ,U ) =  ∨ (a j ∧ xij )   j =1

where a = [a1,a2, . . . , aL] is a vector containing all the weights or relations, xi = [xi1, xi2, . . . , xin] is the vector with the values observed in the input nodes. The parameters a and υ are updated iteratively by taking increment Δam resulting from deviations between all pairs yi and ti as follows ∆a(k )  ∆a(k + 1) a(k + 1) = a(k )+ Ψ1 (k ) +H Nn   Nn

where k is the learning step. Ψ1 and Ψ2 are non-increasing functions of k controlling the decreasing influence of increments Δam. Ψ is the learning momentum


specifying the level of modification of the learning parameters with regard to their values in the previous learning step k. A way of determining the increments Δam is with regard to the mth coordinates of a, m = 1, 2,..., L. The computation of the overall performance index, and the derivatives to calculate the increments for each coordinate of a, and υ are explained in detail in (Reyes, C. A., 1994). Once the training has been terminated, the output of the trainer is the updated relational matrix, which will contain the knowledge needed to map unknown patterns to their corresponding classes.

FUTURE TRENDS One unexplored possibility of improving the FRNN performance is the use of other fuzzy relational products instead of max-min composition. Moreover, membership functions have parameters which can be optimized by genetic algorithms any other optimizing technique. Adequate parameters may improve learning and recognition efficiency of the FRNN.

CONCLUSIONS We have presented the development and implementation of an AICR system as well as a powerful hybrid classifier, the FRNN, which is a model formed by the combination of fuzzy relations and artificial neural networks. The synergistic symbiosis obtained though the fusion of both methodologies will be demonstrated. In the related paper on applications of this model, we will show some practical results, as well as an improved model by means of genetic algorithms.

ACKNOWLEDGmENTS This work is part of a project that is being financed by CONACYT-Mexico (46753).

REFERENCES Bosma, J. F., Truby, H. M., and Antolop, W. (1965), Cry Motions of the Newborn Infant. Acta Paediatrica Scandinavica (Suppl.), 163, 61-92.

Cano, Sergio D, Escobedo, Daniel I., and Coello, Eddy (1999), El Uso de los Mapas Auto-Organizados de Kohonen en la Clasificación de Unidades de Llanto Infantil, Grupo de Procesamiento de Voz, 1er Taller AIRENE, Universidad Catolica del Norte, Chile, pp 24-29. Cano, Sergio D, Escobedo, Daniel I., Ekkel, Taco (2004) A Radial Basis Function Network Oriented for Infant Cry Classification, Proc. of 9th Iberoamerican Congress on Pattern Recognition, Puebla, Mexico. Cano-Ortiz, S.D,. Escobedo-Becerro, D. I (1999), Clasificación de Unidades de Llanto Infantil Mediante el Mapa Auto-Organizado de Kohoeen, I Taller AIRENE sobre Reconocimiento de Patrones con Redes Neuronales, Universidad Católica del Norte, Chile,pp. 24-29. Ekkel, T. (2002), Neural Network-Based Classification of Cries from Infants Suffering from Hypoxia-Related CNS Damage, Master Thesis, University of Twente. The Netherlands. Huang, X., Acero, A., Hon, H. (2001) Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice-Hall, Inc., USA. Jang, J.-S. R. (1993), ANFIS: Adaptive Network-based Fuzzy Inference System, in IEEE Transactions on Systems, Man, and Cybernetics, 23 (03):665-685. Lin Chin-Teng, and George Lee, C.S. (1996), Neural Fuzzy System: A Neuro-Fuzzy Synergism to Intelligent Systems, Prentice Hall, Upper Saddle River, NJ. Lippmann, R.P. (1990), Review of Neural Networks for Speech Recognition”, in Readings in Speech Recognition, Morgan Kauffman Publishers Inc., San Mateo, Calif, pp 374-392. Morgan, D.P., and Scofield, C.L. (1991), Neural Networks and Speech Processing, Kluwer Academic Publishers, Boston. Orozco, J., Reyes, C.A. (2003), Mel-frequency Cepstrum Coefficients Extraction from Infant Cry for Classification of Normal and Pathological Cry whit Feed-Forward Neural Networks, Proc. of ESANN, Bruges, Belgium. Pal, S.K. (1992a) Multilayer Perceptron, Fuzzy Sets, and Classification”, in IEEE Trans. on Neural Networks, vol 3, No 5, Sep 1992, pp 683-697. 865

H


Pal, S.K. and Mandal, D.P. (1992b), Linguistic Recognition Systems Based on Approximated Reasoning, in Information Science, vol. 61, No 2, pp 135-161. Park, D., Cae, Z., and Kandel, A. (1992), Investigations on the Applicability of Fuzzy Inference, in Fuzzy Sets and Systems, vol 49, pp 151-169. Pedrycz, W.(1991), Neuro Computations in Relational Systems, IEEE Trans .On Pattern Analysis and Intelligence, vol. 13, No 3, pp 289-296. Petroni, M., Malowany, A. S., Johnston, C., and Stevens, B. J., (1995),. Identification of pain from infant cry vocalizations using artificial neural networks (ANNs), The International Society for Optical Engineering. Volume 2492. Part two of two. Paper #: 2492-79. pp.729-738. Reyes, C. A., (1994) On the design of a fuzzy relational neural network for automatic speech recognition, Doctoral Dissertation, The Florida State University, Tallahassee, Fl,. USA. Suaste, I., Reyes, O.F., Diaz, A., Reyes, C.A. (2004) Implementation of a Linguistic Fuzzy Relational Neural Network for Detecting Pathologies by Infant Cry Recognition, Proc. of IBERAMIA, Puebla, Mexico , pp. 953-962. Wasz-Höckert, O., Lind, J., Vuorenkoski, V., Partanen, T., & Valanne, E. (1970) El Llanto en el Lactante y su Significación Diagnóstica, Cientifico-Medica, Barcelona. Wasz-Höckert, O., Lind, J., Vuorenkoski, V., Partenen, T., Valanne, E. (1968), The infant cry: a spectrographic and auditory analisis, Clin. Dev. Med. 29, pp. 1-42 Zadeh, L.A. (1965), Fuzzy Sets, Inform. Contr., vol 8, pp 338-353.

866

KEy TERmS Artificial Neural Networks: A network of many simple processors that imitates a biological neural network. The units are connected by unidirectional communication channels, which carry numeric data. Neural networks can be trained to find nonlinear relationships in data, and are used in applications such as robotics, speech recognition, signal processing or medical diagnosis. Automatic Infant Cry Recognition (AICR): A process where the crying signal is automatically analyzed, to extract acoustical features looking to determine the infant’s physical state, the cause of crying or even detect pathologies in very early stages of life. Back propagation Algorithm: Learning algorithm of ANNs, based on minimising the error obtained from the comparison between the outputs that the network gives after the application of a set of network inputs and the outputs it should give (the desired outputs). Fuzzy Relational Neural Network (FRNN): A hybrid classification model combining the advantages of fuzzy relations with artificial neural networks. Fuzzy Sets: A generalization of ordinary sets by allowing a degree of membership for their elements. This theory was proposed by Lofti Zadeh in 1965. Fuzzy sets are the base of fuzzy logic. Hybrid Intelligent System: A software system which employs, in parallel, a combination of methods and techniques from Soft Computing. Learning Stage: A process to teach classifiers to distinguish between different pattern types.

867

A Hybrid System for Automatic Infant Cry Recognition II Carlos Alberto Reyes-García Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico Sandra E. Barajas Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico Esteban Tlelo-Cuautle Instituto Nacional de Astrofísica Óptica y Electrónica, Mexico Orion Fausto Reyes-Galaviz Universidad Autónoma de Tlaxcala, Mexico

INTRODUCTION Automatic Infant Cry Recognition (AICR) process is basically a problem of pattern processing, very similar to the Automatic Speech Recognition (ASR) process (Huang, Acero, Hon, 2001). In AICR first we perform acoustical analysis, where the crying signal is analyzed to extract the more important acoustical features, like; LPC, MFCC, etc. (Cano, Escobedo and Coello, 1999). The obtained characteristics are represented by feature vectors, and each vector represents a pattern. These patterns are then classified in their corresponding pathology (Ekkel, 2002). In the reported case we are automatically classifying cries from normal, deaf and asphyxiating infants. We use a genetic algorithm to find several optimal parameters needed by the Fuzzy Relational Neural Network FRNN (Reyes, 1994), like; the number of linguistic properties, the type of membership function, the method to calculate the output and the learning rate. The whole model has been tested on several data sets for infant cry classification. The process, as well as some results, is described.

BACKGROUND In the first part of this document a complete description of the AICR system as well as of the FRNN is given. So, with continuity purposes, in this part we will concentrate in the description of the genetic algorithm and the whole system implementation and testing.

A genetic algorithm refers to a model introduced and investigated by John Holland (John Holland, 1975) and by students of Holland (DeJong, 1975). Genetic algorithms are often viewed as function optimizers, although the range of problems to which genetic algorithms have been applied is quite broad. Recently, numerous papers and applications combining fuzzy concepts and genetic algorithms (GAs) have become known, and there is an increasing concern in the integration of these two topics. In particular, there are a great number of publications exploring the use of GAs for developing or improving fuzzy systems, called genetic fuzzy systems (GFSs) (Cordon, Oscar, et al, 2001) (Casillas, Cordon, del Jesus, Herrera, 2000).

EVOLUTIONARy DESIGN Within the evolutionary techniques, perhaps one of the most popular is the genetic algorithm (AG) (Goldberg, 1989). Its structure presents analogies with the biological theory of evolution, and is based on the principle of the survival of the fittest individual (Holland, 1975). Generally, a genetic algorithm has five basic components (Michalewicz, 1992). A representation of potential solutions to the problem, a form to create potential initial solutions, a fitness function that is in charge to evaluate solutions, genetic operators that alter the offspring’s composition, and values for parameters like the size of the population, crossover probability, mutation probability, number of generations and others. Here we present different features of the genetic


H

A Hybrid System for Automatic Infant Cry Recognition II

algorithm used to find a combination of parameters for the FRNN.

Chromosomal Representation The binary codification is used in genetic algorithms, and Holland in (Holland, 1975) gave a theoretical justification to use it. Holland argued that the binary codification allows having more schemes than a decimal representation. Scheme is a template that describes a subgroup of strings that share certain similarities in some positions throughout their length (Goldberg, 1989). The problem variables consist of the number of linguistic properties, the type of membership function, the classification method and the learning rate. We are interested in having between 3 and 7 linguistic properties, so, the number of linguistic variables is encoded into a binary string of 3 bit length. The membership function is represented as a 2 bit string, where [00] decodes the Trapezoidal membership function, [01] decodes the Π function, [10] decodes the Triangular function, [11] decodes the Gaussian membership function. The classification methods are also coded as a 2 bit string, where [00] represents the max-min composition, [01] represents the geometrical mean and [10] represents the relational square product. Finally, the learning rate is represented as a binary string of 3 bit length, where [000] decodes to 0.1 learning rate, [001] decodes to 0.2 learning rate, [010] decodes to 0.31 learning rate, [011] decodes to 0.4 learning rate, and [100] decodes to 0.5 learning rate. A larger learning rate is not desirable, so all other bit values are ignored. The chromosome is obtained by concatenating all the above strings. Figure 1 shows an example of the chromosomal representation. Initial population is generated from a random selection of chromosomes, a population size of 50 was considered.

Genetic Operations We use four genetic operations, namely elitism, roulette wheel selection, crossover and mutation. Elitism: In order to ensure that the members with highest fitness value of the population stay in the next generation we apply elitism. It has been demonstrated (Günter, Rudolph, 1994), that a genetic algorithm must use elitism to be able to show convergence. At each iteration of the genetic algorithm we select the members with the four highest fitness values and we put them in the next generation. Selection: In the genetic algorithm the selection process is made in a probabilistic way, it is to say, the less apt individuals even have a certain opportunity to be selected. There are many different types of selection approaches; we use the roulette wheel selection, where members of the population have a probability of being selected that is directly proportionate to their fitness. Crossover: In this work we use a single point crossover. Observing the performance of different crossover operators, De Jong (De Jong, K., 1975) concluded that, although increasing the number of points of crosses affects its schemes from a theoretical perspective, in practice this does not seem to have a significant impact. The crossover is the principal operator in the genetic algorithm. Based on some experiments we decided to determine the crossover point randomly and the crossover probability was fixed at 0.8. Mutation: This operator allows the introduction of new chromosomal material in the population. We selected a gene randomly and we replaced it by its complement, a zero is changed by a one and a one is changed by a zero. Some authors suggest that the mutation probability equal to 1/L, where L is the length of the chain of bits is an inferior limit acceptable for the optimal percentage of mutation (Bäck, Thomas, 1993). In this work the mutation probability is fixed at 0.05.

Figure 1. Chromosomal representation

1

0

1

0

0

0

0

0

0

1

| linguistic properties | | membership | | classification | | learning rate | | function | | method | 868


Fitness Function The objective function of our optimization problem is called fitness function. This function must be able to penalize the solutions that are not good and award the good ones so they can propagate quickly (Coello, Carlos A., 1995). As a fitness function we use the classification error given by the Fuzzy Relational Neural Network. Then the fitness function is defined by the following equation F = eFRNN

In this case we define the classification error as follows eFRNN =

No.PM No.S

where No.PM represents the number of perfect matches, in other words, it represents the number of samples classified correctly. The term No.S represents the total number of given samples to the FC.

ImPLEmENTATION AND RESULTS Signal Processing The analysis of the raw cry waveform provides the information needed for its recognition. At the same time, it discards unwanted information such as background noise, and channel distortion (Levinson S.E., and Roe, D.B., 1990). Acoustic feature extraction is a transformation of measured data into pattern data. Some of the most important techniques used for analyzing cry wave signals are: Discrete Fourier Transform (DFT), cepstral processing, and Linear Prediction Analysis (LPA) (Ainsworth, W.A., 1988) (Schafer and Rabiner 1990). The application of these techniques during signal processing obtains the values of a set of acoustic features. The features may be spectral coefficients, linear prediction coefficients (LPC), Mel frequency cepstral coefficients (MFCC), among others (Ainsworth, W.A., 1988). The set of values for n features may be repre-

sented by a vector in an n-dimensional space. Each vector represents a pattern. For the present experiments we work with samples of infant cries. The infant cries were collected by recordings done directly by medical doctors and then, each signal wave was divided in segments of 1 second, each segment represents a sample. Then, acoustic features were obtained by means of techniques as Linear Prediction Coefficients (LPC) and Mel Frequency Cepstral Coefficients (MFCC), by the use of the freeware program Praat v4.0.8 (Boersma, P., Weenink, 2002). Every sample of 1 second is divided in frames of 50-milliseconds and from each frame we extract 16 coefficients, this procedure generates vectors whit 304 coefficients by sample. In this paper we show the results obtained with Mel Frequency Cepstral Coefficients. In order to reduce the dimensions of the sample vectors we apply Principal Component Analysis. The FRNN and the genetic algorithm are implemented in Matlab. We have a corpus of 157 samples of normal infant cry, 340 of asphyxia infant cry, and 879 of hypo acoustics. Also we have a corpus of 192 samples of pain and 350 samples of hunger crying. We worked with a population of 50 individuals and the number of training epochs for the FRNN was set at three. The initial population was randomly chosen. The number of generations needed for the genetic algorithm was of only three. These values were set on the basis of the observation of the results of several experiments.

Preliminary Results Three different classification experiments were made, the first one consists in classifying deaf and normal infant cry, the second one was made to classify infant cry in categories called asphyxia and normal, and the third one to classify hunger and pain crying. In each task the training samples and the test samples are randomly selected. The results of the model in the classification of deaf and normal cry are given in Table I. In Table II we show the results obtained in the second classification task. Finally Table III shows the results in the classification of hunger and pain cry. In every classification task the GA was run about 15 times and the reported results show the average of the best classification in each experiment.

869

H


Table 1. Results of classifying deaf and normal cry Characteristics Number of linguistic properties Membership function Classification method Learning rate

Successful codification 011

Interpretation

01

Π

00

max-min

001

0.2

Accuracy

3 98%

Table 2. Results of classifying asphyxia and normal cry Characteristics Number of linguistic properties Membership function Classification method Learning rate


Interpretation

01

Π

01

geometrical mean 0.31

0 10

Accuracy

3 84%

Table 3. Results of classifying hunger and pain cry Characteristics Number of linguistic properties Membership function Classification method Learning rate 870


Interpretation 7

01

Π

00

max-min

010

0.31

Accuracy

95.24%


Performance Comparison with Other Models Reyes and Orozco (Orozco, Reyes, 2003) classified cry samples from deaf and normal babies, obtaining recognition results around 97.43%. Reyes et al (Suaste, Reyes, Diaz, Reyes, 2004) showed an implementation of a linguistic fuzzy relational neural network to classify normal and pathological infant cry with percentage of correct classification of 97.3% and 98%. Petroni, Malowany, Johnston and Stevens (1995) classified cry from normal babies to identify pain with artificial neural networks and report results of correct classification that go from 61% with cascade-correlation networks up to 86.2% with feed-forward neural networks. In (Lederman, 2002) Dror Lederman presents some classification results for infants with respiratory distress syndrome RDS (related to asphyxia) versus healthy infants. For the classification he used a Hidden Markov Model architecture with 8 states and 5 Gaussians/state. The results reported are of 63 % of total mean correct classification.

FUTURE TRENDS AICR systems may expand their utility by training them to recognize a larger number of pathologies. The first requirement to achieve this goal is to collect a suitable set of labeled samples for any target pathology. The GA presented here optimizes some parameters of the FRNN, but the model has more. So, other parameters can be added to the chromosomal representation in order to improve the model, like initial values of the relational matrix and of the bias vectors, number of training epochs, and the values of the exponential fuzzy generator and the denominational fuzzy generator used by the DOE.

CONCLUSIONS The proposed genetic algorithm computes a selection of the number of linguistic properties, the membership function used to calculate the linguistic features, the method to calculate the output of the classifier in the fuzzy processing section and the learning rate of the FRNN. The solution obtained by the proposed genetic algorithm is a set of characteristics that the FRNN can

use to make the classification of infant cry. The use of linguistic properties allows us to deal with the impreciseness of infant cry and provides the classifier with very useful information. By applying the linguistic information and given the nature of the model, it is not necessary to get training through a high number of learning epochs, a high number of iterations in the genetic algorithm is not necessary either. The results of classifying deaf and normal infant cry are very similar to other models, but when we classify hunger and pain the results are much better than other models.

ACKNOWLEDGmENTS This work is part of a project that is being financed by CONACYT-Mexico (46753).

REFERENCES Bäck, Thomas (1993), Optimal mutation rates in genetic search, Proceedings of the Fifth International Conference on Genetic Algorithms, San Mateo, California: Morgan Kaufmann, pp 2-8. Boersma, P., Weenink (2002), D. Praat v 4.0.8. A system for doing phonetics by computer, Institute of Phonetic Sciences of the University of Amsterdam, February. Bosma, J. F., Truby, H. M., & Antolop, W. (1965), Cry Motions of the Newborn Infant. Acta Paediatrica Scandinavica (Suppl.), 163, 61-92. Cano, Sergio D, Escobedo, Daniel I., and Coello, Eddy (1999), El Uso de los Mapas Auto-Organizados de Kohonen en la Clasificación de Unidades de Llanto Infantil, Grupo de Procesamiento de Voz, 1er Taller AIRENE, Universidad Catolica del Norte, Chile, pp 24-29. Casillas, J., Cordon, O., Jesus, M.J. del, Herrera, F., (2000), Genetic Feature Selection in a FuzzyRule. Based Classification System Learning Process for High Dimensional Problems, Technical Report ·DECSAI000122, Universidad de Granada, Spain. Coello, Carlos A.(1995), Introducción a los algoritmos genéticos, Soluciones Avanzadas. Tecnologías de Información y Estrategias de Negocios, Año 3, No. 17, pp. 5-11. 871

H


Cordon, Oscar, Herrera, Francisco, Hoffmann, Frank and Magdalena, Luis, (2001), Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases, Singapore, World Scientific. De Jong, K. (1975), An Analysis of the Behavior of a Class of Genetic Adaptive Systems. Ph.D. Dissertation. Dept. of Computer and Communication Sciences, Univ. of Michigan, Ann Arbor. Ekkel, T. (2002), Neural Network-Based Classification of Cries from Infants Suffering from Hypoxia-Related CNS Damage, Master Thesis. University of Twente. The Netherlands. Goldberg, David E. (1989), Genetic Algorithms in Search, Optimization and Machine Learning. Massachusetts: Addison-Wesley. Günter, Rudolph (1994), Convergence analysis of canonical genetic algorithms, IEEE Transactions on Neural Networks, vol. 5, pp 96-101. Holland, J. (1975), Adaptation in Natural and Artificial Systems, University of Michigan Press. Huang, X., Acero, A., Hon, H. (2001) Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice-Hall, Inc., USA. Lederman, D. (2002), “Automatic Classification of Infants’ Cry”. Master Thesis. University of Negev. Israel. Petroni, M., Malowany, A. S., Johnston, C., and Stevens, B. J., (1995),. Identification of pain from infant cry vocalizations using artificial neural networks (ANNs), The International Society for Optical Engineering. Volume 2492. Part two of two. Paper #: 2492-79. pp.729-738. Reyes, C. A., (1994), On the design of a fuzzy relational neural network for automatic speech recognition, Doctoral Dissertation, The Florida State University, Tallahassee, Fl,. USA. Suaste, I., Reyes, O.F., Diaz, A., Reyes, C.A. (2004) Implementation of a Linguistic Fuzzy Relational Neural Network for Detecting Pathologies by Infant Cry Recognition, Proc. of IBERAMIA, Puebla, Mexico , pp. 953-962.

872

Zbigniew Michalewicz (1992), Genetic algorithms + data structures = evolution programs, SpringerVerlag, 2nd ed.

KEy TERmS Binary Chromosome: Is an encoding scheme representing one potential solution to a problem, during a searching process, by means of a string of bits. Evolutionary Computation: A subfield of computational intelligence that involves combinatorial optimization problems. It uses iterative progress, such as growth or development in a population, which is then selected in a guided random search to achieve the desired end. Such processes are often inspired by biological mechanisms of evolution. Fitness Function: It is a function defined over the genetic representation and measures the quality of the represented solution. The fitness function is always problem dependent. Genetic Algorithms: A family of computational models inspired by evolution. These algorithms encode a potential solution to a specific problem on a simple chromosome-like data structure and apply recombination operators to these structures so as to preserve critical information. Genetic algorithms are often viewed as function optimizers, although the range of problems to which genetic algorithms have been applied is quite broad. Hybrid Intelligent System: A software system which employs, in parallel, a combination of methods and techniques mainly from subfields of Soft Computing. Signal Processing: The analysis, interpretation and manipulation of signals. Processing of such signals includes storage and reconstruction, separation of information from noise, compression, and feature extraction. Soft Computing: A partnership of techniques which in combination are tolerant of imprecision, uncertainty, partial truth, and approximation, and whose role model is the human mind. Its principal constituents are Fuzzy Logic (FL), Neural Computing (NC), Evolutionary Computation (EC) Machine Learning (ML) and Probabilistic Reasoning (PR).

873

IA Algorithm Acceleration Using GPUs Antonio Seoane University of A Coruña, Spain Alberto Jaspe University of A Coruña, Spain

INTRODUCTION Graphics Processing Units (GPUs) have been evolving very fast, turning into high performance programmable processors. Though GPUs have been designed to compute graphics algorithms, their power and flexibility makes them a very attractive platform for generalpurpose computing. In the last years they have been used to accelerate calculations in physics, computer vision, artificial intelligence, database operations, etc. (Owens, 2007). In this paper an approach to general purpose computing with GPUs is made, followed by a description of artificial intelligence algorithms based on Artificial Neural Networks (ANN) and Evolutionary Computation (EC) accelerated using GPU.

BACKGROUND General-Purpose Computation using Graphics Processing Units (GPGPU) consists in the use of the GPU as an alternative platform for parallel computing taking advantage of the powerful performance provided by the graphics processor (General-Purpose Computation Using Graphics Hardware Website; Owens, 2007). There are several reasons that justify the use of the GPU to do general-purpose computing (Luebke, 2006): •

•

Last generation GPUs are very fast in comparison with current processors. For instance, a NVIDIA 8800 GTX card has computing capability of approximately 330 GFLOPS, whereas an Intel Core2 Duo 3.0 GHz processor has only a capability of about 48 GFLOPS. GPUs are highly-programmable. In the last years graphical chip programming capacities have grown very much, replacing fixed-programming

•

•

• •

engines with programmable ones, like pixel and vertex engines. Moreover, this has derived in the appearance of high-level languages that help its programming. GPUs evolution is faster than CPU’s one. The increase in GPU’s performance is nowadays from 1.7x to 2.3x per year, whereas in CPUs is about 1.4x. The pressure exerted by videogame market is one of the main reasons of this evolution, what forces companies to evolve graphics hardware continuously. GPUs use high-precision data types. Although in the very beginning graphics hardware was designed to work with low-precision data types, at the present time internal calculations are computed using 32 bits float point numbers. Graphics cards have low cost in relation to the capacities that they provide. Nowadays, GPUs are affordable for any user. GPUs are highly-parallel and they can have multiple processors that allow making high-performance parallel arithmetic calculations.

Nevertheless, there are some obstacles. First, not all the algorithms fit for the GPU’s programming model, because GPUs are designed to compute high-intensive parallel algorithms (Harris, 2005). Second, there are difficulties in using GPUs, due mainly to: • •

•

GPU’s programming model is different from CPU’s one. GPUs are designed to graphics algorithms, therefore, to graphics programming. The implementation of general-purpose algorithms on GPU is quite different to traditional implementations. Some limitations or restrictions exist in programming capacities. Most functions on GPU’s programming languages are very specific and dedicated to make calculations in graphics algorithms.


I

IA Algorithm Acceleration Using GPUs

•

GPU’s architectures are quite variable due to their fast evolution and the incorporation of new features.

Therefore it is not easy to port an algorithm developed for CPUs to run in a GPU.

Overview of the Graphics Pipeline Nowadays GPUs make their computations following a common structure called Graphics Pipeline. The Graphics Pipeline (Akenine-Möller, 2002) is composed by a set of stages that are executed sequentially inside the GPU, allowing the computing of graphics algorithms. Recent hardware is made up of four main elements. First, the vertex processors, that receive vertex arrays from CPU and make the necessary transformations from their positions in space to the final position in the screen. Second, the primitive assembly build graphics primitives (for instance, triangles) using information about connectivity between different vertex. Third, in the rasterizer, those graphical primitives are discretized and turned into fragments. A fragment represents a potential pixel and contains the necessary information (color, depth, etc.) to generate the final color of a pixel. Finally, in the fragment processors, fragments become pixels to which final color is written in a target buffer, that can be the screen buffer or a texture. In the present, GPUs have multiple vertex and fragment processors that compute operations in parallel. Both are programmable using little pieces of code called vertex and fragment programs, respectively. In the last years different high-level programming languages have released like Cg/HLSL (Mark, 2003; HLSL Shaders) or GLSL (OpenGL Shading Language Information Site), that make easier the programming of those processors.

The GPU Programming Model There is a big difference between programming CPUs and GPUs due mainly to their different programming models. GPUs are based on the stream programming model (Owens, 2005a; Luebke, 2006; Owens, 2007), where all data are represented by a stream that can be defined as a sorted set of data of the same type. A kernel operates on full streams, and takes input data from one or more streams to produce one or more output streams. The main characteristic of a kernel is 874

that it operates on the whole stream, instead individual elements. The typical use of a kernel is the evaluation of a function over each element from an input stream, calling this a map operation. Other operations of a kernel are expansions, reductions, filters, etc. (Buck, 2004; Horn, 2005; Owens, 2007). The kernel generated outputs are always based on their input streams, what means that inside the kernel, the calculations made on an element never depends of the other ones. In stream programming model, applications are built connecting multiple kernels. An application can be represented as a dependency graph where each graph node is a kernel and each edge represents a data stream between kernels (Owens, 2005b; Lefohn, 2005). The behavior of graphic pipeline is similar to the stream programming model. Data flows through each stage, where the output feeds the next one. Stream elements (vertex or fragment arrays) are processed independently by kernels (vertex or fragment programs) and their output can be received again by another kernels. The stream programming model allows an efficient computation, because kernels operate on independent elements from a set of input streams and can be processed using hardware like GPU, that process vertex or fragments streams in parallel. This allows making parallel computing without the complexity of traditional parallel programming models.

Computational Resources on GPU In order to implement any kind of algorithm on GPU, there are different computational resources (Harris, 2005; Owens, 2007). By one side, current GPUs have two different parallel programmable processors: vertex and fragment processors. Vertex processors compute vertex streams (points with associated properties like position, color, normal, etc.). A vertex processor applies a vertex program to transform each input vertex to its position on the screen. Fragment processors compute fragment streams. They apply a fragment program to each fragment to calculate the final color of the pixel. In addition of using the attributes of each fragment, those processors can access to other data streams like textures when they are generating each pixel. Textures can be seen as an interface to access to read-only memory. Another available resource in GPU is the rasterizer. It generates fragments using triangles built in from vertex and connectivity information. The rasterizer


allows generating an output set of data from a smaller input one, because it interpolates the properties of each vertex that belongs to a triangle (like color, texture coordinates, etc.) for each generated fragment. One of the essential features of GPUs is the renderto-texture one. This allows storing the pixels generated by the fragments processor in a texture, instead of a screen buffer. This is at the moment the only mechanism to obtain directly output data from GPU computing. Render-to-texture cannot be thought as an interface to read-write memory, due to the fact that fragment processor can read data from a texture in multiple times, but it can write there just one time, at the end of each fragment processing.

ARTIFICIAL INTELLIGENCE ALGORITHmS ON GPU Using the stream programming model as well as resources provided by graphics hardware, Artificial Intelligence algorithms can be parallelized and therefore computing-accelerated. The parallel and high-intensive computing nature of this kind of algorithms makes them good candidates for being implemented on the GPU. Consider the evolution process of genetic algorithms, where a fitness value needs to be computed for each individual. Population could be considered as a data stream and fitness function as a kernel to process this stream. On GPU, for instance, the data stream must be represented as a texture, whereas the kernel must be implemented on a fragment program. Each individual’s fitness would be obtained in an output stream, represented also by a texture, and obtained by the use of render-to-texture feature. Recently some works have been realized mainly in paralleling ANN and EC algorithms, described in following sections.

Artificial Neural Networks Bohn (1998) used GPGPU to reduce training time in Kohonen’s feature maps. In this case, the bigger the map, the higher was the time reduction using the GPU. On 128x128 sized maps, time was similar using CPU and GPU, but on 512x512 sized maps, GPU was almost 3.5 times faster than CPU, increasing to 5.3 faster rates on 1024x1024 maps. This was one of the first implementations of GPGPU, made on a non-

programmable graphics system, a SiliconGraphics Infinite Reality workstation. Later, with programmable hardware, Oh (2004) used the GPU for accelerating the process of obtaining the output of a multilayer perceptron ANN. Developed system was applied to pattern recognition obtaining 20x lower computing time than CPU implementation.. Considering another kind of ANNs, Zhongwen (2005) used GPGPU to reduce computing time in training Self-Organizing Maps (SOMs). The bigger the SOM, the higher was the reduction. Whereas using 128x128 neurons maps computing time was similar between CPU and GPU, 512x512 neuron maps involved a training process 4x faster using GPU implementation. Bernhard (2005) used GPU to simulate Spiking Neurons model. This ANN model both requires high intensive calculations and has a parallel nature, so fits very well on GPGPU computation. Authors made different implementations depending on the neural network application. In the first case, an image segmentation algorithm was implemented using a locally-excitatory globally-inhibitory Spiking Neural Network (SNN). In this experiment, authors obtained up to 10x faster results. In the second case, SNNs were used to image segmentation using an algorithm based on histogram clustering where the ANN minimized the objective function. Here the speed was improved up to 10 times also. Seoane (2007) showed multilayer perceptron ANN training time acceleration using GA. GPGPU techniques for ANN computing allowed accelerating it up to 11 times. The company Evolved Machines (Evolved Machines Website) uses the powerful performance of GPUs to simulating of neural computation, obtaining results up to 100x faster than CPU computation.

Evolutionary Computation In EC related works, Yu (2005) describes how parallel genetic algorithms can be mapped in low-cost graphics hardware. In their approach, chromosomes and fitness values are stored in textures. Fitness calculation and genetic operators were implemented using fragment programs on GPU. Different population sizes applied to the Colville minimization problem were used for testing, resulting in better time reductions according to bigger populations. In the case of a 128x128 sized population, 875

I


GPU genetic operators computing was 11.8 times faster than CPU, whereas in a 512x512 sized population, that rate incremented to 20.1. In fitness function computing, rates were 7.9 and 17.1 respectively. In another work, Wong (2006) implemented Hybrid Genetic Algorithms on GPU incorporating the Cauchy mutation operator. All algorithm steps were implemented in graphics hardware, except random number generation. In this approach, a pseudo-deterministic method was proposed for selecting process, allowing significant running-time reductions. GPU implementation was 3x faster than CPU’s one. Fok (2007) showed how to implement evolutionary algorithms on GPU. Since the crossover operators of GA requires more complex calculations than mutation ones, authors studied a GPU implementation of Evolutionary Programming, using only mutation operators. Tests have been proved with the Cauchy distribution to 5 different optimization problems, obtaining between 1.25 and 5 times faster results.

FUTURE TRENDS Nowadays GPUs are very powerful and they are evolving quite fast. By one side, there are more and more programmable elements in GPUs; by the other one, programming languages are becoming full-featured. There are more and more implementations of different kinds of general-purpose algorithms that take advantage of these features. In Artificial Intelligence field the number of developments is rather low, in spite of the great amount of current algorithms and their high computing requirements. It seems very interesting using GPUs to extend existent implementations. For instance, some examples of speeding ANNs simulations up have been shown, however there is no works in accelerating training times. Likewise same ideas can be applied to implement other kinds of ANNs architectures or IA techniques, like in genetic programming field, where there is neither any development.

CONCLUSION This paper has introduced general-purpose programming on GPUs. They have been shown as powerful parallel processors, which programming capabilities 876

allow using for general-purpose high-intensive computing algorithms. Based on this idea, existent implementations of IA models like ANN or EC on GPUs have been described, with a considerable computing time reduction. General-purpose computing on GPU and its use to accelerating IA algorithms provides great advantages, being an essential contribution in application where computing time is a decisive factor.

REFERENCES Akenine-Möller, T. & Haines, E. (2002). Real-Time Rendering. Second Edition. A.K. Peters. Bernhard, F. & Keriven, R. (2006). Spiking Neurons on GPUs. International Conference on Computational Science – ICCS 2006. 236-243. Bohn, C.-A. (1998). Kohonen Feature Mapping through Graphics Hardware. In 3rd International Conference on Computational Intelligence and Neurosciences. Buck, I. & Purcell, T. (2004). A toolkit for computation on GPUs. In GPU Gems. R. Fernando, editor. Addison-Wesley, 621-636. Evolved Machines Website. (n.d.). Retrieved June 4, 2007 from http://www.evolvedmachines.com/ Fok, K.L., Wong, T.T. & Wong, M.L. (2007). Evolutionary Computing on Consumer-Level Graphics Hardware. IEEE Intelligent Systems. 22(2), 69-78. General-Purpose Computation Using Graphics Hardware Website. (n.d.). Retrieved June 4, 2007 from http://www.gpu.org Harris, M. (2005). Mapping computational concepts to GPUs. In GPU Gems 2, M. Pharr, editor. AddisonWesley, 493-508. HLSL Shaders. (n.d.). Retrieved June 4, 2007 from http://msdn.microsoft.com/archive/default.asp?url=/ archive/en-us/directx9_c_Dec_2005/HLSL_Shaders. asp Horn, D. (2005). Stream reduction operations for GPGPU applications. In GPU Gems 2, M. Pharr, editor. Addison-Wesley, 573-589.


Lefohn, A., Kniss, J., Owens J. (2005). Implementing efficient parallel data structures on GPUs. In GPU Gems. 2, M. Pharr, editor. Addison-Wesley, 521-545. Luebke, D. (2006). General-Purpose Computation on Graphics Hardware. In Supercomputing 2006 Tutorial on GPGPU. Mark, W.R., Glanville, R.S., Akeley, K. & Kilgard, M.J. (2003). Cg: a system for programming graphics hardware in a C-like language. ACM Trans. Graph. ACM Press, 22(3), 896-907. Oh, K.-S., Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition. 37(6), 13111314.

Zhongwen, L., Hongzhi, L., Zhengping, Y. & Xincai, W. (2005). Self-Organizing Maps Computing on Graphic Process Unit. ESANN’2005 proceedings - European Symposium on Artificial Neural Networks. 557-562.

KEy TERmS Fragment: Potential pixel containing all the necessary information (color, depth, etc.) to generate the final fragment color.

OpenGL Shading Language Information Site. (n.d.). Retrieved June 4, 2007, from http://developer.3dlabs. com/openGL2/index.htm

Fragment Processor: Graphics system element that receives as input a set of fragments and processes it to obtain pixel, writing them in a target buffer. Present GPUs have multiple fragment processors working in parallel and can be programmed using fragment programs.

Owens, J. (2005a). Streaming architectures and technology trends. In GPU Gems 2, M. Pharr, editor. Addison-Wesley, 457-470.

Graphics Pipeline: Three dimensional graphics oriented architecture, composed by several stages that run sequentially.

Owens, J. (2005b). The GPGPU Progamming Model. In General Purpose Computation on Graphics Hardware. IEEE Visualization 2005 Tutorial.

Graphics Processing Unit (GPU): Electronic device designed for graphics rendering in computers. Its architecture is specialized in graphics calculations.

Owens, L.D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A.E. & Purcell, T.J. (2007). A Survey of General-Purpose Computation on Graphics Hardware. COMPUTER GRAPHICS forum. 26(1), 80-113.

General-Purpose Computation on GPUs (GPGPU): Trend in computing devices dedicated to implement general-purpose algorithms using graphics devices, called GPUs. At the moment, the high programmability and performance of GPUs allow developers run classical algorithms in these devices to speed non-graphics applications up, especially those algorithms with parallel nature.

Seoane, A., Rabuñal, J. R. & Pazos, A. (2007). Aceleración del Entrenamiento de Redes de Neuronas Artificiales mediante Algoritmos Genéticos utilizando la GPU. Actas del V Congreso Español sobre Metaheurísticas, Algorítmos Evolutivos y Bioinspirados - MAEB 2007. 69-76. Wong, M.L. & Wong, T.T. (2006). Parallel Hybrid Genetic Algorithms on Consumer-Level Graphics Hardware. IEEE Congress on Evolutionary Computation. 2973-2980. Yu, Q., Chen, C. & Pan, Z. (2005). Parallel Genetic Algorithms on Programmable Graphics Hardware. Lecture Notes in Computer Science. 1051-1059.

Pixel: Picture Element abbreviation, used for referring graphic image points. Rasterizer: Graphics Pipeline element, which from graphic primitives provides appropriate fragments to a target buffer. Render-to-Texture: GPU feature that allows stocking the fragment processor output on a texture instead on a screen buffer. Stream Programming Model: This parallel programming model is based on defining, by one side, sets of input and output data, called streams, and by the other side, intensive computing operations, called

877

I


kernel functions, to be applied sequentially on the streams. Texture: In computer graphics field, it refers to a digital image used to modify the appearance of a tridimensional object. The operation that wraps around a texture over an object is called texture mapping. Talking about GPGPU, a texture can be considered as a data stream. Vertex: In computer graphics field, it refers to a clearly defined point in a tridimensional space, which is processed by Graphics Pipeline. Relationships can be established between those vertices (like triangles) to assembly structures that define a tridimensional object. Talking about GPGPU, an vertex array can be considered as a data stream. Vertex Processor: Graphics system component that receives as input a set of 3D vertex and process them to obtain 2D screen positions. Present GPUs have multiple vertex processors working in parallel and can be programmed using vertex programs.

878

879

Improving the Naïve Bayes Classifier Liwei Fan National University of Singapore, Singapore Kim Leng Poh National University of Singapore, Singapore

INTRODUCTION A Bayesian Network (BN) takes a relationship between graphs and probability distributions. In the past, BN was mainly used for knowledge representation and reasoning. Recent years have seen numerous successful applications of BN in classification, among which the Naïve Bayes classifier was found to be surprisingly effective in spite of its simple mechanism (Langley, Iba & Thompson, 1992). It is built upon the strong assumption that different attributes are independent with each other. Despite of its many advantages, a major limitation of using the Naïve Bayes classifier is that the real-world data may not always satisfy the independence assumption among attributes. This strong assumption could make the prediction accuracy of the Naïve Bayes classifier highly sensitive to the correlated attributes. To overcome the limitation, many approaches have been developed to improve the performance of the Naïve Bayes classifier. This article gives a brief introduction to the approaches which attempt to relax the independence assumption among attributes or use certain pre-processing procedures to make the attributes as independent with each other as possible. Previous theoretical and empirical results have shown that the performance of the Naïve Bayes classifier can be improved significantly by using these approaches, while the computational complexity will also increase to a certain extent.

BACKGROUND The Naïve Bayes classifier, also called simple Bayesian classifier, is essentially a simple BN. Since no structure learning is required, it is very easy to construct and implement a Naïve Bayes classifier. Despite its simplicity, the Naïve Bayes classifier is competitive with other more advanced and sophisticated classifiers such as

decision trees (Friedman, Geiger & Goldszmidt, 1997). Owing to these advantages, the Naïve Bayes classifier has gained great popularity in solving different classification problems. Nevertheless, its independence assumption among attributes is often violated in the real world. Fortunately, many approaches have been developed to alleviate this problem. In general, these approaches can be divided into two groups. One attempts to relax the independence assumption of Naïve Bayes classifier, e.g. Semi-Naïve Bayes (SNB) (Kononenko, 1991), Searching for dependencies (Pazzani, 1995), the Tree Augmented Naïve Bayes (TAN) (Friedman, Geiger & Goldszmidt, 1997), SuperParent Tree Augmented Naïve Bayes (SP-TAN) (Keogh & Pazzani, 1999), Lazy Bayes Rule (LBR) (Zheng & Webb, 2000) and Aggregating OneDependence Estimators (AODE) (Webb, Boughton & Wang, 2005). The other group attempts to use certain pre-processing procedures to select or transform the attributes, which can be more suitable for the assumption of the Naïve Bayes classifier. The Feature selection can be implemented by greedy forward search (Langley & Sage, 1994) and Decision Trees (Ratanamahatana & Gunopulos, 2002). The transformation techniques include Principal Component Analysis (PCA) (Gupta, 2004), Independent Component Analysis (ICA) (Prasad, 2004) and CC-ICA (Bressan & Vitria, 2002). The next section describes the main ideas of the two groups of techniques in a broad way.

ImPROVING THE NAÏVE BAyES CLASSIFIER This section introduces the two groups of approaches that have been used to improve the Naïve Bayes classifier. In the first group, the strong independence assumption is relaxed by restricted structure learning. The second


I

Improving the Naïve Bayes Classifier

group helps to select some major (and approximately independent) attributes from the original attributes or transform them into some new attributes, which can then be used by the Naïve Bayes classifier.

Relaxing the Independence Assumption Relaxing the independence assumption means that the dependence will be considered in constructing the network. To consider the dependencies between attributes, Kononenko (Kononenko, 1991) proposed the Semi-Naïve Bayes classifier (SNB), which joined the attributes based on the theorem of Chebyshev. The medical diagnostic data were used to compare the performance of the SNB and the NB. It was found that the results of two domains are identical but in the other two domains SNB slightly improves the performance. Nevertheless, this method may cause overfitting problems. Another limitation of the SNB is that the number of parameters will grow exponentially with the increase of the number of attributes that need to be joined. In addition, the exhaustive searching technique of joining attributes may affect the computational time. Pazzani (Pazzani, 1995) used Forward Sequential Selection and Joining (FSSJ) and Backward Sequential Elimination and Joining (BSEJ) to search dependencies and join the attributes. They tested the two methods on UCI data and found that BSEJ provided the most improvement. Friedman et al. (Friedman, Geiger & Goldszmidt, 1997) found that Kononenko’s and Pazzani’s methods can be represented as an augmented Naïve Bayes network, which includes some subgraphs. They restricted the network to be Tree Augmented Naïve Bayes (TAN) that spans over all attributes and can be learned by tree-structure learning algorithms. The results based on problems from the UCI repository showed that the TAN classifier outperforms the Naïve Bayes classifier. It is also competitive with C4.5 while maintains the computational simplicity. However, the use of the TAN classifier is only limited to the problems with discrete attributes. For the problems with continuous attributes, these attributes must be prediscretized. To address this problem, Friedman et al. (Friedman, Goldszmidt & Lee, 1998) extended TAN to deal with continuous attributes via parametric and semiparametric conditional probabilities. Keogh & Pazzani (Keogh & Pazzani, 1999) proposed a variant of the TAN classifier, i.e. SP-TAN, which could result 880

in better performance than TAN. The performance of SP-TAN is also competitive with the Lazy Bayes Rule (LBR), in which the lazy learning techniques are used in the Naïve Byes classifier (Zheng, & Webb, 2000; Wang & Webb, 2002) Although LBR and SP-TAN have outstanding performance on the testing data, the main disadvantage of the two methods is that they have high computational complexity. Aggregating One-Dependence Estimators (AODE), developed by Webb et al. (Webb, Boughton & Wang, 2005), can avoid model selection which may reduce computational complexity and lead to lower variance. These advantages have been demonstrated by some empirical experiment results. It is also empirically found that the average prediction accuracy of AODE is comparative to that of LBR and SP-TAN but with lower variance. Therefore, AODE might be more suitable for small datasets due to its lower variance.

Using Pre-Processing Procedures In general, the pre-processing procedures for the Naïve Bayes classifier include feature selection and transforming the original attributes. The Selective Bayes classifier (SBC) (Langley & Sage, 1994) deals with correlated features by selecting only some attributes into the final classifier. They used a greedy method to search the space and forward selection to select the attributes. In their study, six UCI datasets are used to compare the performance of the Naïve Bayes classifier, SBC and C4.5. It is found that selecting the attributes can improve the performance of the Naïve Bayes classifier when there are redundant attributes. In addition, SBC is found to be competitive with C4.5 in terms of the datasets by which C4.5 outperforms the Naïve Bayes classifier. The study by Ratanamahatana & Gunopulos (Ratanamahatana & Gunopulos, 2002) applied C4.5 to select the attributes for the Naïve Bayes classifier. Interestingly, experimental results showed that the new attributes obtained by C4.5 can make the Naïve Bayes classifier outperform C4.5 with respect to a number of datasets. Transforming the attributes is another useful preprocessing procedure for the Naïve Bayes classifier. Gupta (Gupta, 2004) found that Principal Component Analysis (PCA) was helpful to improve the classification accuracy and reduce the computational complexity. Prasad (Prasad, 2004) applied Independent Component Analysis (ICA) to all the training data and found


that the performance of Naïve Bayes classifier integrated ICA performed better than C4.5 and IB1 integrated with ICA. Bressan and Vitria (Bressan & Vitria, 2002) proposed the class-conditional ICA (CC-ICA) to do pre-processing procedure for the Naïve Bayes classifier, and found that CC-ICA based Naïve Bayes classifier outperformed the pure Naïve Bayes classifier. Based on the UCI datasets, a detailed comparative study of PCA, ICA and CC-ICA for Naïve Bayes classifier has been carried out by Fan & Poh (Fan & Poh, 2007). PCA attempts to transform the original data into a new uncorrelated dataset, while ICA attempts to transform them into a new dataset with independent attributes. Class-conditional ICA (CC-ICA), proposed by Bressan and Vitria (2002), is built upon the idea that ICA is used to make the attributes as independent as possible for each class. In such a way, the new attributes are more reasonable than those from the PCA and ICA in order to satisfy the independence assumption of the Naïve Bayes classifier. The datasets were limited to the continuous datasets due to the requirement of the three pre-processing procedures. The results showed that all the three preprocessing procedures can improve the performance of the Naïve Bayes classifier. It is likely due to the fact that transforming the attributes could weaken the dependence among different attributes. In addition, the discrepancy between the performance of ICA and PCA integrated with the Naïve Bayes classifier is not large. This may be an indication that PCA and ICA are competitive in improving the performance of Naïve Bayes classifier. When the number of attributes became larger, the three pre-processing procedures also improved the performance of the Naïve Bayes classifier by more. From the methodological point of view, the CC-ICA pre-processing procedure seems to be more plausible than PCA and ICA for Naïve Bayes classifier (Bressan and Vitria, 2002; Vitria, Bressan, & Radeva, 2007). The experimental results by Fan & Poh (Fan & Poh, 2007) also showed that CC-ICA integrated with the Naïve Bayes classifier outperforms PCA and ICA integrated with the Naïve Bayes classifier in terms of classification accuracy. However, CC-ICA requires more training data to ensure that there are enough training data for each class. It is therefore suggested that the choice of a suitable pre-processing procedure should depend on the characteristics of datasets, e.g. the sample size for each class.

FUTURE TRENDS With the development of the algorithms for learning BN, relaxing the independence assumption is promising for improving the performance of the Naïve Bayes classifier. However, relaxing the independence assumption to the unrestricted BN is not appropriate. Friedman et al. (Friedman, Geiger, & Goldszmidt, 1997) compared the Naïve Bayes classifier and Bayesian Network and found that using unrestricted BN did not improve the accuracy. On the contrary, it even reduced the accuracy in some domains. Therefore, other restricted BN may be used for improving the performance while keeping the simplicity of the Naïve Bayes classifier. Effective and simple learning algorithm is also important for the improving the performance. On the other hand, with the development of algorithms for machine learning, more pre-processing procedures are expected to be developed for selecting or transforming the attributes. One possible way to get better performance is to combine feature selection with transformation techniques to do the pre-processing procedures. Among the alternative techniques for doing pre-processing procedures, the most promising one might be ICA. The reason is that the motivation of the pre-processing procedures is to derive the attributes satisfying the independence assumption for the Naïve Bayes classifier while the objective of ICA is to find the independent components. However, there are also some limitations on the use of ICA, e.g. the requirements of continuous datasets and a large number of training samples. How to overcome these limitations is therefore a potential area for future research.

CONCLUSION This article briefly discusses the techniques which can be used to improve the performance of the Naïve Bayes classifier. The general idea is to overcome the limitation of the strong independence assumption of the Naïve Bayes classifier. Relaxing the strong assumption is a natural way and has been studied from different viewpoints. All the approaches relaxing the assumption discussed in the article is restricted Bayesian Networks, which are still most practicable techniques. In addition, pre-processing procedures are also very useful to make the attributes to satisfy the independence assumption. However, using these approaches increases the compu881

I


tational complexity to a certain extent. It would be useful to model correlations among appropriate attributes that can be captured by simple restricted structure but with good performance.

Langley, P. & Sage, S. (1994). Induction of Selective Bayesian Classifiers. Proceedings of the Tenth Conference on Unvertainty in Artificial Intelligence. 399-406.

REFERENCES

Pazzani, M.J. (1995). Searching for Dependencies in Bayesian Classifiers. Proceedings of the fifth International Workshop on Artificial Intelligence and Statistics. 424-429.

Bressan, M., Vitria, J. (2002). Improving Naïve Bayes Using Class-conditional ICA. Advances in Artificial Intelligence - IBERAMIA 2002. 1-10. Cheng, J., & Greiner, R. (1999). Comparing Bayesian Network Classifiers. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. 101-107. Fan, L., Poh, K.L. (2007). A Comparative Study of PCA, ICA and Class-conditional ICA for Naïve Bayes Classifier. Computational and Ambient Intelligence. Lecture Notes in Computer Science. (4507), 16-22. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning. (29) 131-163. Friedman, N., Goldszmidt, M., & Lee, T.J. (1998). Byesian Network Classification with Continuous Attributes: Getting the best of Both Discretization and Parametric Fitting. Proceedings of the Fifteenth International Conference on Machine Learning. 179-187. Gupta, G. K. (2004). Principal Component Analysis and Bayesian Classifier Based on Character Recognition. Bayesian Inference and Maximum Entropy Methods in Science and Engineering. AIP Conference Proceedings. (707), 465-479. Keogh, E., & Pazzani, M.J. (1999). Learning Augmented Bayesian Classifiers: A Comparison of Distribution-based and Classification-based Approaches. Proceedings of the International Workshop on Artificial Interlligence and Statistics. 225-230. Kononenko, I. (1991). Semi-Naïve Bayesian Classifier. Proceedings of the sixth Eurjopean Working Session on Learning. 206-219. Langley, P., Iba, W., & Thompson, K. (1992). An Analysis of Bayesian Classifiers. Proceedings of the Tenth National Conference on Artificial Intelligence. AAAI Press, San Jose, CA. 223-228. 882

Prasad, M.N., Sowmya, A., Koch, I. (2004). Feature Subset Selection using ICA for Classifying Emphysema in HRCT Images. Proceedings of the 17th International Conference on Pattern Recognition. 515-518. Ratanamahatana, C.A. and Gunopulos, D., Feature selection for the naive Bayesian classifier using decision trees. Applied Artificial Intelligence. (17), 475-487. Vitria, J., Bressan, M., & Radeva, P. (2007). Bayesian Classification of Cork Stoppers Using Class-conditional Independent Component Analysis. IEEE Transactions on Systems, Man and Cybernetics. (37), 32-38. Wang, Z., & Webb, G.I. (2002). Comparison of Lazy Bayesian Rule and Tree-Augmented Bayesian Learning. Proceedings of the IEEE International Conference on Data Mining. 775-778. Webb, G., Boughton, J.R., & Wang, Z. (2005). Not so Naïve Bayes: Aggregating One-Dependence Estimatiors. Machine Learning. (58), 5-24. Zheng, Z., & Webb, G. (2000). Lazy Learning of Bayesian Rules. Machine Learning. (41), 53-87.

KEy TERmS Decision Trees: Decision tree is a classifier in the form of a tree structure, where each node is either a leaf node or a decision node. A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. A well known and frequently used algorithm of decision tree over the years is C4.5. Forward Selection and Backward Elimination: A forward selection method would start with the empty set and successively add attributes, while a backward


elimination process would begin with the full set and remove unwanted ones. Greedy Search: At each point in the search, the algorithm considers all local changes to the current set of attributes, makes its best selection, and never reconsiders this choice. Independent Component Analysis (ICA): Independent component analysis (ICA) is a newly developed technique for finding hidden factors or components to give a new representation of multivariate data. ICA could be thought of as a generalization of PCA. PCA tries to find uncorrelated variables to represent the original multivariate data, whereas ICA attempts to obtain statistically independent variables to represent the original multivariate data. Naïve Bayes Classifier: The Naïve Bayes classifier, also called simple Bayesian classifier, is essentially a

simple Bayesian Network (BN). There exist two underlying assumptions in the Naïve Bayes classifier. First, all attributes are independent with each other given the classification variable. Second, all attributes are directly dependent on the classification variable. Naïve Bayes classifier computes the posterior of classification variable given a set of attributes by using the Bayes rule under the conditional independence assumption. Principal Component Analysis (PCA): PCA is a popular tool for multivariate data analysis, feature extraction and data compression. Given a set of multivariate measurements, the purpose of PCA is to find a set of variables with less redundancy. The redundancy is measured by correlations between data elements. UCI Repository: This is a repository of databases, domain theories and data generator that are used by the machine learning community for the empirical analysis of machine learning algorithms.

883

I

884

Incorporating Fuzzy Logic in Data Mining Tasks Lior Rokach Ben Gurion University, Israel

In this chapter we discuss how fuzzy logic extends the envelop of the main data mining tasks: clustering, classification, regression and association rules. We begin by presenting a formulation of the data mining using fuzzy logic attributes. Then, for each task, we provide a survey of the main algorithms and a detailed description (i.e. pseudo-code) of the most popular algorithms.

INTRODUCTION There are two main types of uncertainty in supervised learning: statistical and cognitive. Statistical uncertainty deals with the random behavior of nature and all existing data mining techniques can handle the uncertainty that arises (or is assumed to arise) in the natural world from statistical variations or randomness. Cognitive uncertainty, on the other hand, deals with human cognition. Fuzzy set theory, first introduced by Zadeh in 1965, deals with cognitive uncertainty and seeks to overcome many of the problems found in classical set theory. For example, a major problem faced by researchers of control theory is that a small change in input results in a major change in output. This throws the whole control system into an unstable state. In addition there was also the problem that the representation of subjective knowledge was artificial and inaccurate. Fuzzy set theory is an attempt to confront these difficulties and in this chapter we show how it can be used in data mining tasks.

BACKGROUND Data mining is a term coined to describe the process of sifting through large and complex databases for identifying valid, novel, useful, and understandable patterns and relationships. Data mining involves the inferring of algorithms that explore the data, develop

the model and discover previously unknown patterns. The model is used for understanding phenomena from the data, analysis and prediction. The accessibility and abundance of data today makes knowledge discovery and data mining a matter of considerable importance and necessity. We begin by presenting some of the basic concepts of fuzzy logic. The main focus, however, is on those concepts used in the induction process when dealing with data mining. Since fuzzy set theory and fuzzy logic are much broader than the narrow perspective presented here, the interested reader is encouraged to read Zimmermann (2005). In classical set theory, a certain element either belongs or does not belong to a set. Fuzzy set theory, on the other hand, permits the gradual assessment of the membership of elements in relation to a set. Let U be a universe of discourse, representing a collection of objects denoted generically by u. A fuzzy set A in a universe of discourse U is characterized by a membership function µA which takes values in the interval [0, 1]. Where µA(u) = 0 means that u is definitely not a member of A and µA(u) = 1 means that u is definitely a member of A. The above definition can be illustrated on the vague set of Young. In this case the set U is the set of people. To each person in U, we define the degree of membership to the fuzzy set Young. The membership function answers the question ”to what degree is person u young?”. The easiest way to do this is with a membership function based on the person’s age. For example Figure 1 presents the following membership function:   0 age(u ) > 32  MYoung (u ) =  1 age(u ) < 16  32 − age(u ) otherwise  16 


(1)

Incorporating Fuzzy Logic in Data Mining Tasks

Figure 1. Membership function for the young set

Given this definition, John, who is 18 years old, has degree of youth of 0.875. Philip, 20 years old, has degree of youth of 0.75. Unlike probability theory, degrees of membership do not have to add up to 1 across all objects and therefore either many or few objects in the set may have high membership. However, an object’s

I

membership in a set (such as “young”) and the set’s complement (“not young”) must still sum to 1. The main difference between classical set theory and fuzzy set theory is that the latter admits to partial set membership. A classical or crisp set, then, is a fuzzy set that restricts its membership values to {0,1}, the

Figure 2. Membership function for the crisp young set

885


endpoints of the unit interval. Membership functions can be used to represent a crisp set. For example, Figure 2 presents a crisp membership function defined as: 0 age(u ) > 22 MCrispYoung (u ) =  1 age(u ) ≤ 22

(2)

In regular classification problems, we assume that each instance takes one value for each attribute and that each instance is classified into only one of the mutually exclusive classes. To illustrate how fuzzy logic can help data mining tasks, we introduce the problem of modeling the preferences of TV viewers. In this problem there are 3 input attributes: A = {Time of Day,Age Group,Mood}. The classification can be the movie genre that the viewer would like to watch, such as C = {Action,Comedy,Drama}. All the attributes are vague by definition. For example, people’s feelings of happiness, indifference, sadness, sourness and grumpiness are vague without any crisp boundaries between them. Although the vagueness of ”Age Group” or ”Time of Day” can be avoided by indicating the exact age or exact time, a rule induced with a crisp decision tree may then have an artificial crisp boundary, such as ”IF Age < 16 THEN action movie”. But how about someone who is 17 years of age? Should this viewer definitely not watch an action movie? The viewer preferred genre may still be vague. For example, the viewer may be in a mood for both comedy and drama movies. Moreover, the association of movies into genres may also be vague. For instance the movie ”Lethal Weapon” (starring Mel Gibson and Danny Glover) is considered to be both comedy and action movie. Fuzzy concept can be introduced into a classical data mining task if at least one of the attributes is fuzzy. In the example described above , both input and target attributes are fuzzy. Formally the problem is defined as following: Each class cj is defined as a fuzzy set on the universe of objects U. The membership function µcj(u) indicates the degree to which object u belongs to class cj. Each attribute ai is defined as a linguistic attribute which takes linguistic values from dom(ai) = {vi,1, vi,2,...vi,|dom(ai)|}. Each linguistic value vi,k is also a fuzzy set defined on U. The membership µvi,k(u) specifies the degree to which object u’s attribute ai is vi,k . Recall that the membership of a linguistic value can be subjectively assigned or transferred from numerical values by a membership function defined on the range of the numerical value. 886

Typically, before one can incorporate fuzzy concepts into a data mining application, an expert is required to provide the fuzzy sets for the quantitative attributes, along with their corresponding membership functions (Mitra and Pal, 2005). Alternatively the appropriate fuzzy sets are determined using fuzzy clustering.

mAIN FOCUS OF THE CHAPTER Fuzzy Supervised Learning In this section we survey supervised methods that incorporate fuzzy sets. Supervised methods are methods that attempt to discover the relationship between input attributes and a target attribute (sometimes referred to as a dependent variable). The relationship discovered is represented in a structure referred to as a model. Usually models describe and explain phenomena, which are hidden in the dataset and can be used for predicting the value of the target attribute knowing the values of the input attributes. It is useful to distinguish between two main supervised models: classification models (classifiers) and Regression Models. Regression models map the input space into a real-value domain. For instance, a regressor can predict the demand for a certain product given its characteristics. On the other hand, classifiers map the input space into pre-defined classes. Fuzzy set theoretic concepts can be incorporated at the input, output, or into to backbone of the classifier. The data can be presented in fuzzy terms and the output decision may be provided as fuzzy membership values (Peng, 2004). In this chapter we will concentrate on fuzzy decision trees. The interested reader is encouraged to read also about soft regression (Shnaider et al., 1997) and Neuro-fuzzy (Mitra and Hayashi, 2000, Nauck, 1997). Decision tree is a predictive model which can be used to represent classifiers. Decision trees are frequently used in applied fields such as finance, marketing, engineering and medicine. Decision tree are self-explained. There is no need to be an expert in data mining in order to follow a certain decision tree. There are several algorithms for induction of fuzzy decision trees (Olaru and Wehenkel, 2003), most of them extend existing decision trees methods such as: Fuzzy-CART (Jang, 1994), Fuzzy-ID3 (Cios and Sztandera, 1992; Maher and Clair, 1993). Another complete


framework for building a fuzzy tree including several inference procedures based on conflict resolution in rule-based systems and efficient approximate reasoning methods was presented in (Janikow, 1998). In this section we will focus on the algorithm proposed in Yuan and Shaw (1995). This algorithm can handle the classification problems with both fuzzy attributes and fuzzy classes represented in linguistic fuzzy terms. It can also handle other situations in a uniform way where numerical values can be fuzzified to fuzzy terms and crisp categories can be treated as a special case of fuzzy terms with zero fuzziness. The algorithm uses classification ambiguity as fuzzy entropy. The classification ambiguity directly measures the quality of classification rules at the decision node. It can be calculated under fuzzy partitioning and multiple fuzzy classes. When a certain attribute is numerical, it needs to be fuzzified into linguistic terms before it can be used in the algorithm (Hong et al., 1999). The fuzzification process can be performed manually by experts or can be derived automatically using some sort of clustering algorithm. Clustering groups the data instances into subsets in such a manner that similar instances are grouped together; different instances belong to different groups. The instances are thereby organized into an efficient representation that characterizes the population being sampled.

One can use a simple algorithm to generate a set of membership functions on numerical data. Assume attribute ai has numerical value x from the domain X. We can cluster X to k linguistic terms vi,j, j = 1,...,k. The size of k is manually predefined. Figure 3 illustrates the creation of four groups defined on the age attribute: ”young”, ”early adulthood”, ”middle-aged” and ”old age”. Note that the first set (”young”) and the last set (”old age”) have a trapezoidal form which can be uniquely described by the four corners. For example, the ”young” set could be represented as (0,0,16,32). In between, all other sets (”early adulthood” and ”middleaged”) have a triangular form which can be uniquely described by the three corners. For example, the set ”early adulthood” is represented as (16,32,48). The induction algorithm of fuzzy decision tree measures the classification ambiguity associated with each attribute and split the data using the attribute with the smallest classification ambiguity. The classification ambiguity of attribute ai with linguistic terms vi,j, j = 1,...,k on fuzzy evidence S, denoted as G(ai | S), is the weighted average of classification ambiguity calculated as: k

G (ai S ) = ∑ w(vi , j S ) ⋅ G (vi , j S ) j =`1

(3)

Figure 3. Membership function for various groups in the age attribute

887

I


where w(vi,j | S) is the weight which represents the relative size of vi,j and is defined as: w(vi , j S ) =

M (vi , j S )

∑ M (v

i,k

k

S)

(4)

The classification ambiguity of vi,j is defined as    G (vi , j S ) = g  p  C vi , j   



,

which is measured based on the possibility distribution vector        p  C vi , j  =  p  c1 vi , j  ,..., p  c k vi , j   . 





Given vi,j, the possibility of classifying an object to class cl can be defined as: 



p  cl vi , j  =

S (vi , j , cl )

max S (v

i, j

k

, ck )

(5)

where S(A,B) is the fuzzy subsethood that measures the degree to which A is a subset of B. The subsethood can be used to measure the truth level of the rule of classification rules. For example given a classification rule such as ”IF Age is Young AND Mood is Happy THEN Comedy” we have to calculate S(Hot∩Sunny, Swimming) in order to measure the truth level of the classification rule.  The function g ( p ) is the possibilistic measure of ambiguity or nonspecificity and is defined as:  p

 g ( p ) = ∑  pi∗ − pi∗+1  ⋅ ln(i ) i =1

(6)

where ∗   p =  p1∗ ,…, p ∗p  



 is the permutation of the possibility distribution p ∗ ∗ sorted such that pi ≥ pi +1 . All the above calculations are carried out at a predefined significant level α. An instance will take into consideration of a certain branch vi,j only if its corresponding membership is greater 888

than α. This parameter is used to filter out insignificant branches. After partitioning the data using the attribute with the smallest classification ambiguity, the algorithm looks for nonempty branches. For each nonempty branch, the algorithm calculates the truth level of classifying all instances within the branch into each class. The truth level is calculated using the fuzzy subsethood measure S(A,B). If the truth level of one of the classes is above a predefined threshold β then no additional partitioning is needed and the node become a leaf in which all instance will be labeled to the class with the highest truth level. Otherwise the procedure continues in a recursive manner. Note that small values of β will lead to smaller trees with the risk of underfitting. A higher β may lead to a larger tree with higher classification accuracy. However, at a certain point, higher values β may lead to overfitting. In a regular decision tree, only one path (rule) can be applied for every instance. In a fuzzy decision tree, several paths (rules) can be applied for one instance. In order to classify an unlabeled instance, the following steps should be performed: •

• •

Step 1: Calculate the membership of the instance for the condition part of each path (rule). This membership will be associated with the label (class) of the path. Step 2: For each class calculate the maximum membership obtained from all applied rules. Step 3: An instance may be classified into several classes with different degrees based on the membership calculated in Step 2.

Fuzzy Clustering The goal of clustering is descriptive, that of classification is predictive. Since the goal of clustering is to discover a new set of categories, the new groups are of interest in themselves, and their assessment is intrinsic. In classification tasks, however, an important part of the assessment is extrinsic, since the groups must reflect some reference set of classes. Clustering groups data instances into subsets in such a manner that similar instances are grouped together, while different instances belong to different groups. The instances are thereby organized into an efficient representation that characterizes the population being


sampled. Formally, the clustering structure is represented as a set of subsets C = C1,...,Ck of S, such that:

uij =

1

∑

c k =1

() dij d kj

k

S =  i =1 Ci

and Ci ∩ Cj = ∅ for i ≠ j. Consequently, any instance in S belongs to exactly one and only one subset. Traditional clustering approaches generate partitions; in a partition, each instance belongs to one and only one cluster. Hence, the clusters in a hard clustering are disjointed. Fuzzy clustering (Nasraoui and Krishnapuram, 1997, Shnaider et al., 1997) extends this notion and suggests a soft clustering schema. In this case, each pattern is associated with every cluster using some sort of membership function, namely, each cluster is a fuzzy set of all the patterns. Larger membership values indicate higher confidence in the assignment of the pattern to the cluster. A hard clustering can be obtained from a fuzzy partition by using a threshold of the membership value. The most popular fuzzy clustering algorithm is the fuzzy c-means (FCM) algorithm. FCM is an iterative algorithm. The aim of FCM is to find cluster centers (centroids) that minimize a dissimilarity function. To accommodate the introduction of fuzzy partitioning, the membership matrix(U) is randomly initialized according to Equation 7. c

∑u i =1

ij

= 1,∀j = 1,..., n

(7)

The algorithm minimizes a dissimilarity (or distance) function which is given in Equation 13: c

c

n

J (U , c1 , c2 ,..., cc ) = ∑ J i = ∑ ∑ uijm dij2 i =1

i =1 j =1

(8)

where, uij is between 0 and 1; ci is the centroid of cluster i; dij is the Euclidian distance between i-th centroid and j-th data point; m is a weighting exponent. To reach a minimum of dissimilarity function there are two conditions. These are given in Equation 9 and Equation 10.

∑ = ∑ n

ci

j =1 n

uijm

(10)

By iteratively updating the cluster centers and the membership grades for each data point, FCM iteratively moves the cluster centers to the “right” location within a data set. However, FCM does not ensure that it converges to an optimal solution. The random initialization of U might have uncancelled effect on the final performance.

Fuzzy Association Rules Association rules are rules of the kind “70% of the customers who buy vine and cheese also buy grapes”. While the traditional field of application is market basket analysis, association rule mining has been applied to various fields since then, which has led to a number of important modifications and extensions. A fuzzy association algorithm is proposed in Komem and Schneider (2005). The quantitative values are first transformed into a set of membership grades, by using predefined membership functions. Every membership grade represents the agreement of a quantitative value with a linguistic term. In order to avoid discriminating the importance level of data, each point must have membership grade of 1 in one membership function; Thus, the membership functions of each attribute produce a continuous line of µ = 1. Additionally, in order to diagnose the bias direction of an item from the center of a membership function region, almost each point get another membership grade which is lower than 1 in other membership functions region. Thus, each end of membership function region is touching, close to, or slightly overlapping an end of another membership function (except the outside regions, of course). By this mechanism, as point “a” moves right, further from the center of the region “middle”, it gets a higher value of the label “middle-high”, additionally to the value 1 of the label “middle”.

FUTURE TRENDS Some of the challenges of using fuzzy theory in data mining tasks, include the following:

uijm x j

j =1

I

2 / ( m −1)

(9) 889


1. 2. 3. 4.

Incorporation of domain knowledge for improving the fuzzy modeling. Developing methods for presenting fuzzy data model to the end-users. Efficient integration of fuzzy logic in data mining tools. A hybridization of fuzzy sets with data mining techniques.

CONCLUSIONS This chapter discussed how fuzzy logic can be used to solve several different data mining tasks, namely classification clustering, and discovery of association rules. The discussion focused mainly one representative algorithm for each of these tasks. There are at least two motivations for using fuzzy logic in data mining, broadly speaking. First, as mentioned earlier, fuzzy logic can produce more abstract and flexible patterns, since many quantitative features are involved in data mining tasks. Second, the crisp usage of metrics is better replaced by fuzzy sets that can reflect, in a more natural manner, the degree of belongingness/membership to a class or a cluster.

REFERENCES Cios, K. J., & Sztandera, L. M. (1992). Continuous ID3 algorithm with fuzzy entropy measures, Proc. IEEE lnternat. Con/i on Fuzz)’ Systems, pp. 469-476. Hong, T.P., Kuo, C.S. and Chi, S.C. (1999). A Fuzzy Data Mining Algorithm for Quantitative Values. Third International Conference on Knowledge-Based Intelligent Information Engineering Systems. Proceedings. IEEE, pp. 480-483. Jang, J. (1994). Structure determination in fuzzy modeling: A fuzzy CART approach, in Proc. IEEE Conf. Fuzzy Systems, pp. 480–485. Janikow, C.Z. (1998), Fuzzy Decision Trees: Issues and Methods, IEEE Transactions on Systems, Man, and Cybernetics, 28(1): 1-14. Komem, J., & Schneider, M. (2005), On the Use of Fuzzy Logic in Data Mining, in The Data Mining and Knowledge Discovery Handbook, O. Maimon, L. Rokach (Eds.), Springer, pp. 517-533. 890

Maher, P. E., & Clair, D. C. (1993). Uncertain reasoning in an ID3 machine learning framework, in Proc. 2nd IEEE Int. Conf. Fuzzy Systems, pp. 7–12. Mitra, S., & Hayashi, Y. (2000). Neuro-fuzzy Rule Generation: Survey in Soft Computing Framework. IEEE Trans. Neural Networks, 11(3):748-768. Mitra, S., & Pal, S. K. (2005), Fuzzy sets in pattern recognition and machine intelligence, Fuzzy Sets and Systems 156(1):381–386 Nasraoui, O., & Krishnapuram, R. (1997). A Genetic Algorithm for Robust Clustering Based on a Fuzzy Least Median of Squares Criterion, Proceedings of NAFIPS, Syracuse NY, pp. 217-221. Nauck, D. (1997). Neuro-Fuzzy Systems: Review and Prospects Paper appears in Proc. Fifth European Congress on Intelligent Techniques and Soft Computing (EUFIT’97), Aachen, pp. 1044-1053 Olaru, C., & Wehenkel L. (2003). A complete fuzzy decision tree technique, Fuzzy Sets and Systems, 138(2):221-254, 2003. Peng, Y. (2004). Intelligent condition monitoring using fuzzy inductive learning, Journal of Intelligent Manufacturing, 15 (3): 373-380. Shnaider, E., Schneider, M., & Kandel A. (1997). A Fuzzy Measure for Similarity of Numerical Vectors, Fuzzy Economic Review, 2(1):17-38. Yuan, Y., & Shaw M. (1995). Induction of fuzzy decision trees, Fuzzy Sets and Systems, 69(1):125-139. Zimmermann H. J. (2005), Fuzzy Set Theory and its Applications, Springer, 4th edition.

KEy TERmS Association Rules: Techniques that find in a database conjunctive implication rules of the form “X and Y implies A and B.” Attribute: A quantity describing an instance. An attribute has a domain defined by the attribute type, which denotes the values that can be taken by an attribute. Classifier: A structured model that maps unlabeled instances to finite set of classes.


Clustering: The process of grouping data instances into subsets in such a manner that similar instances are grouped together into the same cluster, while different instances belong to different clusters. Data Mining: The core of the KDD process, involving the inferring of algorithms that explore the data, develop the model, and discover previously unknown patterns. Fuzzy Logic: A type of logic that recognizes more than simple true and false values. With fuzzy logic, propositions can be represented with degrees of truth-

fulness and falsehood thus it can deal with imprecise or ambiguous data. Boolean logic is considered to be a special case of fuzzy logic. Instance: A single object of the world from which a model will be learned, or on which a model will be used. Knowledge Discovery in Databases (KDD): A nontrivial exploratory process of identifying valid, novel, useful, and understandable patterns from large and complex data repositories.

891

I

892

Independent Subspaces Lei Xu Chinese University of Hong Kong, Hong Kong, & Peking University, Beijing, China

INTRODUCTION

or

Several unsupervised learning topics have been extensively studied with wide applications for decades in the literatures of statistics, signal processing, and machine learning. The topics are mutually related and certain connections have been discussed partly, but still in need of a systematical overview. The article provides a unified perspective via a general framework of independent subspaces, with different topics featured by differences in choosing and combining three ingredients. Moreover, an overview is made via three streams of studies. One consists of those on the widely studied principal component analysis (PCA) and factor analysis (FA), featured by the second order independence. The second consists of studies on a higher order independence featured independent component analysis (ICA), binary FA, and nonGaussian FA. The third is called mixture based learning that combines individual jobs to fulfill a complicated task. Extensive literatures make it impossible to provide a complete review. Instead, we aim at sketching a roadmap for each stream with attentions on those topics missing in the existing surveys and textbooks, and limited to the authors’ knowledge.

x = xˆ + e = Ay + e, [y = y(1) , y (2) , y (3) ]T .

A GENERAL FRAmEWORK OF INDEPENDENT SUBSPACES A number of unsupervised learning topics are featured by its handling on a fundamental task. As shown in Fig.1(b), every sample x is projected into xˆ on a manifold and the error e = x − xˆ of using xˆ to represent x is minimized collectively on a set of samples. One widely studied situation is that a manifold is a subspace represented by linear coordinates, e.g., spanned by three linear independent basis vectors a1 , a 2 , a 3 as shown in Fig.1(a). So, xˆ can be represented by its projection y ( j ) on each basis vector, i.e., xˆ = ∑ j y (1) a j 3

(1)

Typically, the error e = x − xˆ is measured by the square norm, which is minimized when e is orthogonal to xˆ . Collectively, the minimization of the average error 2 e on a set of samples or its expectation E e 2 is featured by those natures given at the bottom of Fig.1(a). Generally, the task consists of three ingredients, as shown in Fig.2. First, how the error e = x − xˆ is measured. Different measures define different projections. 2 The square norm d = e applies to a homogeneous medium between x and xˆ . Other measures are needed for inhomogeneous mediums. In Fig.1(c), a non-orthogonal but still linear projection is considered via 2 d = e B = e T Σ e−1e with Σ e−1 = B T B , as if e is first mapped to a homogeneous medium by a linear mapping e and then measured by the square norm. Shown at the bot2 tom of Fig.1(c) are the natures of this Min e B . Being 2 considerably different from those of Min e , more assumptions have to be imposed externally. The second ingredient is a coordinate system, via either linear vectors in Fig.1(a)&(c) or a set of curves on a nonlinear manifold in Fig.1(b). Moreover, there is the third ingredient that imposes certain structure to further constrict how y is distributed within the coordinates, e.g., by the nature d). The differences in choosing and combining the three ingredients lead to different approaches. We use the name “independent subspaces” to denote those structures with the components of y being mutually independent, and get a general framework for accommodating several unsupervised learning topics. Subsequently, we summarize them via three streams of studies by considering • • •

2

d = e B = e T Σ e−1e and two special cases, three types of independence structure, and whether there is temporal structure among samples, varying from one linear coordinate system to multiple linear coordinate systems at different locations, as shown in Fig.2.


Independent Subspaces

Figure 1.

I

Figure 2.

893


STUDIES FEATURED By SECOND ORDER INDEPENDENCE We start at considering samples of independently and identically distributed (i.i.d.) by linear coordinates and (j) (j) an independent structure of a Gaussian p(y t | ì ) , with the projection measure varying as illustrated within the first column of the table in Fig.2. We encounter factor 2 analysis (FA) in the general case d = e B = eT BT Be. At the special case B = σ e I , the linear coordinates span a principal subspace of data. Further imposing ATA = I and requiring the columns of A given by the first m principal components (PCs), i.e., eigenvectors that T −1 correspond the largest eigenvalues of Σ = ( B B ) . It becomes equivalent to PCA. Moreover, at the degenerated case e = 0, y = xW de-correlates components of y, e.g., performing a pre-whitening as encountered in signal processing. We summarize studies on the Roadmap A. The first stream originated from 100 years ago. The first adaptive learning one is Oja rule that finds the 1st-PC (i.e., the eigenvector that corresponds the largest eigenvalue of Σ ), without explicitly estimating Σ . Extended to find multi-PCs, one way is featured by either an asymmetrical or a sequential implementation of the 1st-PC rule, but suffering error-accumulation. Details are referred to Refs.5,6,67,76,96 in (Xu, 2007a). The other way is finding multi-PCs symmetrically, e.g., Oja subspace rule. Further studies are summarized into the following branches:

MCA, Dual Subspace, and TLS Fitting In (Xu, Krzyzak&Oja, 1991), a dual pattern recognition is suggested by considering both the principal subspace and its complementary subspace, as well as both the multiple PCs and its complementary counterparts--the components that correspond the smallest eigenvalues of Σ (i.e., the row vectors of U in Fig.2). Moreover, the first adaptive rule is proposed by eqn.(11a) in (Xu, Krzyzak&Oja, 1991) to get the component that corresponds the smallest eigenvalue of Σ , under the name Minor component analysis (MCA) firstly coined by Xu, Oja&Suen (1992), and it is also used for implementing a total least square (TLS) curve fitting. Subsequently, this topic has been brought to the signal processing literature by Gao, Ahmad & Swamy (1992) that was motivated by a visit of Gao to Xu’s office where Xu introduced him the result of Xu,Oja&Suen (1992). Thereafter, adap894

tive MCA learning for TLS filtering becomes a popular topic of signal processing, see (Feng,Bao&Jiao,1998) and Refs.24,30,58,60 in (Xu,2007a). It was also suggested in (Xu,Krzyzak&Oja,1992) that an implementation of PCA or MCA is made by switching the updating sign in the above eqn.(11a). Efforts were subsequently made to examine the existing PCA rules on whether they remain stable after such a sign switching. These jobs usually need tedious mathematical analyses of ODE stability, e.g., Chen & Amari (2001). An alternative way is turning an optimization of a PCA cost into a stable optimization of an induced cost for MCA, e.g., the LMSER cost is turned into one for subspace spanned by multiple MCs (Xu, 1994, see Ref.111, Xu2007a). A general method is further given by eqns(24-26) in (Xu, 2003) and then discussed in (Xu, 2007a).

LMSER Learning and Subspace Tracking A new adaptive PCA rule is derived from the gradient ∇E 2 (W ) for a least mean square error reconstruction (LMSER) (Xu,1991), with the first proof proposed on global convergence of Oja subspace rule--a task that was previously regarded as difficult. It was shown mathematically and experimentally that LMSER improves Oja rule by further comparative studies, e.g, see (Karhunen,Pajunen&Oja,1998) and see (Refs14,15,48,54,71,72, Xu2007a). Two years after (Xu,1991), this E2(W) is used for signal subspace tracking via a recursive least square technique (Yang,1993), then followed by others in the signal processing literature (Refs.33&55, Xu2007a). Also, PCA and subspace analysis can be performed by other theories or costs (Xu, 1994a&b). The algebraic and geometric properties were further analyzed on one of them, namely relative uncertainty theory (RUT), by Fiori (2000&04, see Refs.25,29, Xu2007a). Moreover, the NIC criterion for subspace tracking is actually a special case of this RUT, which can be observed by comparing eqn.(20) in (Miao& Hua,1998 ) with the equation of ρe at the end of Sec.III.B in (Xu,1994a).

Principal Subspace vs. Multi-PCs Oja subspace rule does not truly find the multi-PCs due to a rotation indeterminacy. Interestingly, it is demonstrated experimentally that adding a sigmoid function makes LMSER approximate the multi-PCs


Figure 3.

I

895


Figure 4.

896


well (Xu,1991). Working at Harvard in the late summer 1991, Xu got aware of Brockett (1991) and thus extended the Brockett flow of n × n orthogonal matrices to that of n × n1 orthogonal matrices with n > n1 , from which two learning rules for truly the multi-PCs are obtained through modifying the LMSER rule and Oja subspace rule. The two rules were included as eqns (13)&(14) in Xu (1993) that was submitted in 1991, which are independent and also different from Oja (1992). Recently, Tanaka (2005) unifies these rules into one expression controlled by one parameter, and a comparative study was made to show that eqn(14) in (Xu,1993) turned out to be the most promising one.

ponent analysis (ICA), tackled in the following four branches: • • •

m

p( y ) = ∏ p( y ( j ) ) j =1

Adaptive Robust PCA In the statistics literature, robust PCA was proposed to resist outliers via a robust estimator on Σ . Xu&Yuille (1992&95) generalized the rules of Oja, LMSER, and MCA into robust adaptive learning by statistical physics, related to the Huber M-estimators. Also, the PCA costs in (Xu,1994b) are extended to robust versions in Tab.2 of (Xu, 1994a). Thereafter, efforts have been further made, including its use in computer vision, e.g., see (Refs9,21,45,52, Xu2007a). On Roadmap A, another branch consists of advances on FA, which includes PCA as its special case at Σ e = σ e2 I . In the past decade, there is a renewed interest on FA, not only the EM algorithm for FA is brought to implementing PCA, but also adaptive EM algorithm and other advances are developed in help of the Bayesian Ying Yang (BYY) harmony learning.

SUBSPACES OF HIGHER ORDER INDEPENDENCE (j) Noticing the table in Fig.2, we proceed as p(y (j) t | becomes nonGaussian ones in the last two columns. Shown at the left-upper corner on Roadmap B, the degenerated case e = 0 leads to the problem of solving x = Ay from samples of x and an independence constraint m

p( y ) = ∏ p( y ( j ) ) j =1

.

One way is solving induced nonlinear algebraic equations. Another way is called independent com-

Seeking extremes of the higher order cumulants of y. Using nonlinear Hebbian learning for removing higher order dependences among components of y, actually from which ICA studies originate. Optimizing a cost that bases on

•

directly. As shown on Roadmap B, a same updating equation is reached from several aspects, with actual differences coming from pre-specify( j) ing the nonlinearity of f ( y ) . One works when the source components of y* are all subgaussians while the other works when the components of y*are all supergaussians. This problem is solved ( j) by learning jointly W and f ( y ) via a parametric model. It is further found that a rough estimate of each source is already enough, which motivates the so called one-bit-matching conjecture that is recently proved to be true mathematically (Xu, 2007b). Implementing nonlinear LMSER (Xu, 1991&93). Details are referred to Roadmap B. Here, we add clarifications on two previous confusions. One relates to an omission of the origin of nonlinear LMSER. This has already been clarified in (Karhunen,Pajunen, &Oja,1998; Hyvarinen, Karhunen, & Oja, 2001;Plumbley &Oja,2004), clearly spelling out that the nonlinear E2(W) and its adaptive gradient rule were both proposed firstly in (Xu, 1991&93). The second confusion is about that ICA is usually regarded as a counterpart of PCA. As stated in (Xu,2001b&03) and observed from the Table in Fig.2, ICA by y = xW is actually an extension of de-correlation analysis, in any combinations of PCs and MCs. The counterpart of MCA is minor ICA (M-ICA) while the counterpart of PCA is principal ICA (P-ICA).

In fact, the concept `principal’ emerges from et = xt – Ay ≠ 0. As shown within the table in Fig.2 and on (j) (j) the rightmost column on Roadmap B, as p(y t | 897

I


becomes nonGaussian ones, FA is extended to a binary FA (BFA) if y is binary, and a nonGaussian FA (NFA) if y is real but nonGaussian. Similar to FA perform2 ing PCA at Σ e = σ e I , both BFA and NFA become to perform a P-ICA at Σ e = σ e2 I . Observing the first box in this column, for et = xt – Ay ≠ 0 we need to seek an appropriate nonlinear map y = f(x). It usually has no analytical solution but needs an expensive computation to approximate. As discussed in (Xu, 2003), nonlinear LMSER uses a sigmoid non( j) ( j) W x to avoid computing costs linearity y t = s ( z t ),), zz ==xW and approximately implements a BFA for a Bernoulli N p( y ( j ) ) with a probability p j = N1 ∑t =1 s( z t( j ) ) and a NFA for p( y ( j ) ) with a pseudo uniform distribution on (–∞, +∞), as well as a nonnegative ICA (Plumbley&Oja,2004) ( j) when p( y ) is on [0, +∞). However, further quantitative analysis is needed for this approximation. Without approximation, the EM algorithm is developed for maximum likelihood learning since 1997, still suffering expensive computing costs. Favorably, further improvements have also been achieved by the BYY harmony learning. Details are referred to the rightmost column on Roadmap B.

Next, we move to multiple subspaces at different locations as shown in Fig.2. Studies are summarized on Roadmap C, categorized according to one key point, i.e., a scheme p  ,t that allocates a sample x t to different subspaces. This p  ,t bases on two issues. One is a local measure on how the  -th subspace is suitable for representing x t . The other is a mechanism that summarizes the local measures of subspaces to yield p  ,t . One typical mechanism is that emerges in the EM algorithm for the maximum likelihood or Bayesian learning, where x t is fractionally allocated among subspaces proportional to their local measures. Another typical mechanism is that x t is nonlinearly located to one or more winners via a competition based on the local measures, e.g,, as in the classic competitive learning and the rival penalized competitive learning (RPCL). Also, a scheme p  ,t may come from blending both types of mechanisms, as that from the BYY harmony learning. Details are referred to (Xu,2007c) and its two http-sites.

TEmPORAL AND LOCALIZED EXTENSIONS

Another important task is how to determine the number k of subspaces and the dimension m  of each subspace. It is called model selection, usually implemented in two phases. First, a set of candidates are considered by enumerating k and m  , with unknown parameters estimated by the maximum likelihood learning. Second, the best among the candidates is selected by one of criteria, such as AIC, CAIC, SIC/BIC/MDL, Cross Validation, etc. However, this two-phase implementation is computationally very extensive. Moreover, the performance will degenerate considerably when the sample size is finite while k and m  are not too small. One trend is letting model selection to be made automatically during learning, i.e., on a candidate with k and m  initially being large enough, learning not only determines unknown parameters but also automatically shrinks k and m  to appropriate ones. Two such efforts are RPCL and the BYY harmony learning. Details are referred to (Xu,2007c) and its two http-sites. Also, there are open issues on x = Ay + e, e ≠ 0, with components of y mutually independent in higher order statistics. Some are listed below:

We further consider temporal samples shown at the bottom of the rightmost column on both Roadmap A and Roadmap B, via embedding a temporal structure (j) in p(y (j) . A typical one is using t | t (j) t

=

(j)

(j) t

(j) t

φj ,

= { (j) t−

}

q(j) =

,

e.g., a linear regression (j) t

q( j )

= ∑τ =1 βτ( j )

(j) t−

,

to turn a model (e.g., one in the table of Fig.2) into temporal extensions. Information is carried over time (j) in two ways. One is computing t by the regres(j) sion, with learning on t made through the gradient with respect to j j by a chain rule. The second is (j) (j) (j) (j) and getting the computing ∫ p(y t | t t t gradient with respect to j j . Details are referred to Xu (2000&01a&03). 898

FUTURE TRENDS


Figure 5.

• •

I

Which part of unknown parameters in x = Ay + e can be determined uniquely ? Under which conditions, the independence

and the best reconstruction of x by xˆ = Ay A y can be achieved simultaneously? If not, what is the best nonlinear y = f(x) in term of both

m

m

p( y ) = ∏ p( y ( j ) )

p( y ) = ∏ p( y ( j ) )

j =1

•

can be ensured in concept? Can it be further achieved by a learning algorithm? In what a sense, both ensuring

j =1

•

and e ≠ 0? Can such a best be obtained analytically or via an effective computing?

m

p( y ) = ∏ p( y ( j ) ) j =1

CONCLUSION Studies of three closely related unsupervised learning streams have been overviewed in an extensive scope 899


and from a systematic perspective. A general framework of independent subspaces is presented, from which a number of learning topics are summarized via different features of choosing and combining the three basic ingredients.

Plumbley, M.D., & Oja, E., (2004), A “nonnegative PCA” algorithm for independent component analysis, IEEE Transactions Neural Networks 15(1),66-76.

ACKNOWLEDGmENT

Xu, L., (2007a), A unified perspective on advances of independent subspaces: basic, temporal, and local structures, Proc.6th.Intel.Conf.Machine Learning and Cybernetics, Hong Kong, 19-22 Aug.2007, 767-776.

The work is supported by Chang Jiang Scholars Program by Chinese Ministry of Education for Chang Jiang Chair Professorship in Peking University.

REFERENCES Brockett, R.W., (1991), Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems, Linear Algebra and Its Applications 146,79-91. Chen, T., & Amari, S., (2001), Unified stabilization approach to principal and minor components extraction algorithms, Neural Networks 14(10),1377–1387.

Tanaka, T., (2005), Generalized weighted rules for principal components tracking, IEEE Transactions Signal Processing 53(4),1243- 1253.

Xu, L., (2007b), One-bit-matching ICA theorem, convex-concave programming, and distribution approximation for combinatorics, Neural Computation 19,546-569. Xu, L., (2007c), A unified perspective and new results on RHT computing, mixture based learning, and multi-learner based problem solving, Pattern Recognition 40,2129-2153. Also see http://www.scholarpedia. org/article/Rival_Penalized_Competitive_Learning http://www.scholarpedia.org/article/Bayesian_Ying_ Yang_Learning.

Feng, D.Z., Bao, Z., & Jiao, L.C., (1998), Total least mean squares algorithm, IEEE Transactions Signal Processing 46,2122–2130.

Xu, L., (2003), Independent component analysis and extensions with noise and time: A Bayesian Ying-Yang learning perspective, Neural Information Processing Letters and Reviews 1(1),1-52.

Gao, K., Ahmad, M.O., & Swamy, M.N., (1992), Learning algorithm for total least-squares adaptive signal processing, Electronic Letters 28(4),430–432.

Xu, L., (2001a), BYY harmony learning independent state space and generalized APT financial analyses, IEEE Transactions Neural Networks 12,822–849.

Hyvarinen, A., Karhunen, J., & Oja, E., (2001), Independent component analysis, John Wiley, NY, 2001.

Xu, L., (2001b), An Overview on Unsupervised Learning from Data Mining Perspective, Advances in Self-Organizing Maps, Allison et al, Eds., Springer, 2001,181–210.

Karhunen, J., Pajunen, P. & Oja , E., (1998), The nonlinear PCA criterion in blind source separation: relations with other approaches, Neurocomputing 22,5-20. Miao, Y.F., & Hua, Y.B., (1998), Fast subspace tracking and neural network learning by a novel information criterion, IEEE Transactions Signal Processing 46,1967–79. Oja, E., (1992), Principal components, minor components, and linear neural networks, Neural Networks 5,927-935. Oja, E., Ogawa, H., & Wangviwattana, J., (1991), Learning in nonlinear constrained Hebbian networks, Proc.ICANN’91, 385-390. 900

Xu, L., (2000), Temporal BYY learning for state space approach, hidden Markov model and blind source separation, IEEE Transactions Signal Processing 48,2132–2144. Xu, L., Cheung, C.C., & Amari, S., (1998), Learned parametric mixture based ICA algorithm, Neurocomputing 22,69-80. Xu, L., (1994a), Beyond PCA learning: from linear to nonlinear and from global representation to local representation, Proc.ICONIP94, Vol.2,943-949.


Xu, L., (1994b), Theories for unsupervised learning: PCA and its nonlinear extensions, Proc.IEEE ICNN94, Vol.II,1252-1257. Xu, L., (1993), Least mean square error reconstruction principle for self-organizing neural-nets, Neural Networks 6,627–648. Xu, L., Oja, E., & Suen, C.Y., (1992), Modified Hebbian learning for curve and surface fitting, Neural Networks 5,393-407. Xu, L., & Yuille, A.L., (1992&95), Robust PCA learning rules based on statistical physics approach, Proc.IJCNN92-Baltimore, Vol.I:812-817. An extended version on IEEE Transactions Neural Networks 6,131–143. Xu, L., (1991), Least MSE reconstruction for self-organization, Proc.IJCNN91-Singapore, Vol.3,2363-73. Xu, L., Krzyzak, A., & Oja, E., (1991), A neural net for dual subspace pattern recognition methods, International Journal Neural Systems 2(3),169-184. Yang, B., (1993), Subspace tracking based on the projection approach and the recursive least squares method, Proc.IEEE ICASSP93, Vol.IV,145–148.

KEy TERmS BYY Harmony Learning: It is a statistical learning theory for a two pathway featured intelligent system via two complementary Bayesian representations of the joint distribution on the external observation and its inner representation, with both parameter learning and model selection determined by a principle that two Bayesian representations become best harmony. See http://www.scholarpedia.org/article/Bayesian_Ying_Yang_Learning. Factor Analysis: A set of samples {x t }t =1 is described by a linear model x = Ay + µ + e, where µ is a constant, y and e are both from Gaussian and mutually uncorrelated, and components of y are called factors and mutually uncorrelated. Typically, the model is estimated by the maximum likelihood principle. N

basis vectors and the corresponding coordinates are mutually independent. Least Mean Square Error Reconstruction (LMSER): For an orthogonal projection xt onto a subspace spanned by the column vectors of a matrix W, maximizN 2 T t 1 ing N ∑t =1 (w x t ) subject to W W = I is equivalent to N 2 minimizing the mean square error N1 ∑t =1 x t − xˆ t by using the projection xˆ t = WW T x t as reconstruction of xt, which is reached when W spans the same subspace spanned by the PCs. Minor Component (MC): Being orthogonal complementary to the PC, the solution of N min (w t w =1} J(w) = N1 ∑ t =1 (w r x t )2 = w T is the MC, while the m-MCs are referred to the columns of W N r 2 T 1 that minimizes J(W ) = N ∑ t =1 || W x t || = Tr[W W ] t subject to W W = I . Principal Component (PC): For samples {x t }t =1 with a zero mean, its PC is a unit vector w originated at zero with a direction along which the average of the orthogonal projection by every sample is maximized, N i.e., max(w t w =1} J(w) = N1 ∑ t =1 (w T x t )2 = w T , the solution is the eigenvector of the sample covariance N matrix = N1 ∑ t =1 t Tt , corresponding to the largest eigen-value. Generally, the m-PCs are referred to the m orthonormal vectors as the columns of W that N maximizes J(W ) = N1 ∑ t =1 || W r x t ||2 = Tr[W T W ] . N

Rival Penalized Competitive Learning: It is a development of competitive learning in help of an appropriate balance between participating and leaving mechanisms, such that an appropriate number of agents or learners will be allocated to learn multiple structures underlying observations. See http://www. scholarpedia.org/article/Rival_Penalized_Competitive_Learning. Total Least Square (TLS) Fitting: Given samples N {z t }t =1 , z t = [y t , xtT ]T, instead of finding2 a vector w N T 1 to minimize the error N ∑t =1 y t − w x t , the TLS fitting is finding an augmented vector w~ = [ w T , c ]T such N ~T 2 1 that the error N ∑t =1 w z t is minimized subject T to w~ w~ = 1 , the solution is the MC of {z t }tN=1 .

Independence Subspaces: It refers to a family of models, each of which consists of one or several subspaces. Each subspace is spanned by linear independent 901

I

902

Information Theoretic Learning Deniz Erdogmus Northeastern University, USA Jose C. Principe University of Florida, USA

INTRODUCTION Learning systems depend on three interrelated components: topologies, cost/performance functions, and learning algorithms. Topologies provide the constraints for the mapping, and the learning algorithms offer the means to find an optimal solution; but the solution is optimal with respect to what? Optimality is characterized by the criterion and in neural network literature, this is the least addressed component, yet it has a decisive influence in generalization performance. Certainly, the assumptions behind the selection of a criterion should be better understood and investigated. Traditionally, least squares has been the benchmark criterion for regression problems; considering classification as a regression problem towards estimating class posterior probabilities, least squares has been employed to train neural network and other classifier topologies to approximate correct labels. The main motivation to utilize least squares in regression simply comes from the intellectual comfort this criterion provides due to its success in traditional linear least squares regression applications – which can be reduced to solving a system of linear equations. For nonlinear regression, the assumption of Gaussianity for the measurement error combined with the maximum likelihood principle could be emphasized to promote this criterion. In nonparametric regression, least squares principle leads to the conditional expectation solution, which is intuitively appealing. Although these are good reasons to use the mean squared error as the cost, it is inherently linked to the assumptions and habits stated above. Consequently, there is information in the error signal that is not captured during the training of nonlinear adaptive systems under non-Gaussian distribution conditions when one insists on second-order statistical criteria. This argument extends to other linear-second-order techniques such as principal component analysis (PCA), linear discriminant analysis (LDA), and canonical correlation

analysis (CCA). Recent work tries to generalize these techniques to nonlinear scenarios by utilizing kernel techniques or other heuristics. This begs the question: what other alternative cost functions could be used to train adaptive systems and how could we establish rigorous techniques for extending useful concepts from linear and second-order statistical techniques to nonlinear and higher-order statistical learning methodologies?

BACKGROUND This seemingly simple question is at the core of recent research on information theoretic learning (ITL) conducted by the authors, as well as research by others on alternative optimality criteria for robustness to outliers and faster convergence, such as different Lp-norm induced error measures (Sayed, 2005), the epsilon-insensitive error measure (Scholkopf & Smola, 2001), Huber’s robust m-estimation theory (Huber, 1981), or Bregman’s divergence based modifications (Bregman, 1967). Entropy is an uncertainty measure that generalizes the role of variance in Gaussian distributions by including information about the higher-order statistics of the probability density function (pdf) (Shannon & Weaver, 1964; Fano, 1961; Renyi, 1970; Csiszár & Körner, 1981). For on-line learning, information theoretic quantities must be estimated nonparametrically from data. A nonparametric expression that is differentiable and easy to approximate stochastically will enable importing useful concepts such as stochastic gradient learning and backpropagation of errors. The natural choice is kernel density estimation (KDE) (Parzen, 1967), due its smoothness and asymptotic properties. The plug-in estimation methodology (Gyorfi & van der Meulen, 1990) combined with definitions of Renyi (Renyi, 1970), provides a set of tools that are well-tuned for learning applications – tools suitable


Information Theoretic Learning

for supervised and unsupervised, off-line and on-line learning. Renyi’s definition of entropy for a random variable X is HA ( X ) =

1 log ∫ pA ( x)dx 1−A

(1)

This generalizes Shannon’s linear additivity postulate to exponential additivity resulting in a parametric family. Dropping the logarithm for optimization simplifies algorithms. Specifically of interest is the quadratic entropy (α=2), because its sample estimator requires only one approximation (the density estimator itself) and an analytical expression for the integral can be obtained for kernel density estimates. Consequently, a sample estimator for quadratic entropy can be derived for Gaussian kernels of standard deviation σ on an iid sample set {x1,…,xN} as the sum of pairwise sample (particle) interactions (Principe et al, 2000): 1 N N Hˆ 2 ( X ) = − log( 2 ∑ ∑ G N i =i j =1

2S

( xi − x j ))

(2)

The pairwise interaction of samples through the kernel intriguingly provides a connection to entropy of particles in physics. Particles interacting trough information forces (as in the N-body problem in physics) can employ computational techniques developed for simulating such large scale systems. The use of entropy in training multilayer structures can be studied in the backpropagation of information forces framework (Erdogmus et al, 2002). The quadratic entropy estimator was employed in measuring divergences between probability densities and blind source separation (Hild et al, 2006), blind deconvolution (Lazaro et al, 2005), and clustering (Jenssen et al, 2006). Quadratic expressions with mutual-information-like properties were introduced based on the Euclidean and CauchySchwartz distances (ED/CSD). These are advantageous with computational simplicity and statistical stability in optimization (Principe et al, 2000). Following the conception of information potential and force and principles, the pairwise-interaction estimator is generalized to use arbitrary kernels and any order α of entropy. The stochastic information gradient (SIG) is developed (Erdogmus et al, 2003) to train adaptive systems with a complexity comparable

to the LMS (least-mean-square) algorithm - essential for training complex systems with large data sets. Supervised and unsupervised learning is unified under information-based criteria. Minimizing error entropy in supervised regression or maximizing output entropy for unsupervised learning (factor analysis), minimization of mutual information between the outputs of a system to achieve independent components or maximizing mutual information between the outputs and the desired responses to achieve optimal subspace projections in classification is possible. Systematic comparisons of ITL with conventional MSE in system identification verified the advantage of the technique for nonlinear system identification and blind equalization of communication channels. Relationships with instrumental variables techniques were discovered and led to the error-whitening criterion for unbiased linear system identification in noisy-input-output data conditions (Rao et al, 2005).

SOmE IDEAS IN AND APPLICATIONS OF ITL Kernel Machines and Spectral Clustering: KDE has been motivated by the smoothness properties inherent to reproducing kernel Hilbert spaces (RKHS). Therefore, a practical connection between KDE-based ITL, kernel machines, and spectral machine learning techniques was imminent. This connection was realized and exploited in recent work that demonstrates an information theoretic framework for pairwise similarity (spectral) clustering, especially normalized cut techniques (Shi & Malik, 2000). Normalized cut clustering is shown to determine an optimal solution that maximizes the CSD between clusters (Jenssen, 2004). This connection immediately allows one to approach kernel machines from a density estimation perspective, thus providing a robust method to select the kernel size, a problem still investigated by some researchers in the kernel and spectral techniques literature. In our experience, kernel size selection based on suitable criteria aimed at obtaining the best fit to the training data - using Silverman’s regularized squared error fit (Silverman, 1986) or leave-one-out cross-validation maximum likelihood (Duin, 1976), for instance - has proved to be convenient, robust, and accurate techniques that avoid many of the computational complexity and load

903

I


issues. Local data spread based modifications resulting in variable-width KDE are also observed to be more robust to noise and outliers. An illustration of ITL clustering by maximizing the CSD between the two estimated clusters is provided in Figure 1. The samples are labeled to maximize DCS ( p, q ) = − log

< p, q > f || p || f || q || f

(3)

where p and q are KDE for two candidate clusters, f is the overall data KDE and the weighted inner product to measure angular distance between clusters is < p, q > f = ∫ p ( x)q ( x) f −1 ( x)dx

(4)

When estimated using a weighted KDE variant, this criterion becomes equivalently ∑

DCS ( p, q) ≈

xi ∈ p , y j ∈q

∑

xi ∈ p , xi ∈ p

K1/ f ( xi , y j )

K1/ f ( xi , x j ) ∑

yi ∈q , y j ∈q

K1/ f ( yi , y j )

(5)

where K1/f is an equivalent kernel generated from the original kernel K (Gaussian here). One difficulty with kernel machines is their nonparametric nature, the requirement to solve for the eigendecomposition of a large positive-definite matrix that has size N×N, for N training samples. The solution is a weighted sum of kernels evaluated over each training sample, thus the test procedure for each novel sample involves evaluating the sum of N kernels: ytest = ∑ kN=1 wk K ( xtest − xk ) . The Fast Gauss Transform (FGT) (Greengard, 1991), which uses the polynomial expansions for a Gaussian (or other) kernel has been employed to overcome this difficulty. FGT carefully selects few center points around which truncated Hermite polynomial expansions approximate the kernel machine. FGT still requires heavy computational load in off-line training (minimum O(N2), typically O(N3)). The selection of expansion centers is typically done via clustering (e.g., Ozertem & Erdogmus, 2006). Correntopy as a Generalized Similarity Metric: The main feature of ITL is that it preserves the universe of concepts we have in neural computing, but allows the adaptive system to extract more information from the data. For instance, the general Hebbian principle is

Figure 1. Maximum CSD clustering of two synthetic benchmarks: training and novel test data (left), KDE using Gaussian kernels with Silverman-kernel-size (center), and spectral projections of data on two dominant eigenfunctions of the kernel. The eigenfunctions are approximated using the Nystrom formula.

904


reduced into a second order metric in traditional artificial neural network literature (input-output product), thus becoming a synonym for second order statistics. The learning rule that maximizes output entropy (instead of output variance), using SIG with Gaussian kernels is ∆w(n) = H ( x(n) − x(n − 1))( y (n) − y (n − 1)) (Erdogmus et al, 2002), which still obeys the Hebbian principle, yet extracts more information from the data (leading to the error-whitening criterion for input-noise robust learning). ITL quantifies global properties of the data, but will it be possible to apply it to functions, specifically those in RKHS? A concrete example is on similarity between random variables, which is typically expressed as second order correlation. Correntropy generalizes similarity to include higher order moment information. The name indicates the strong relation to correlation, but also stresses the difference – the average over the lags (for random processes) or over dimensions (for multidimensional random variables) is the information potential, i.e. the argument of second order Renyi’s entropy. For random variables X and Y with joint density p(x,y), correntropy is defined as V ( X , Y ) = ∫ ∫ D ( x − y ) p ( x, y )dxdy

(6)

and measures how dense the two random variables are along the line x=y in the joint space. Notice that it is similar to correlation, which also asks the same question in a second moment framework. However, correntropy is local to the line x=y, while correlation is quadratically dependent upon distances of samples in the joint space. Using a KDE with Gaussian kernels V ( X ,Y ) =

1 N ∑ G ( xi − yi ) N i =1

(7)

Correntropy is a positive-definite function, thus defines a RKHS. Unlike correlation, RKHS is nonlinearly related to the input, because all moments of the random variable are included in the transformation. It is possible to analytically solve for least squares regression and principal components in this space, yielding nonlinear fits in input space. Correntopy induced metric (CIM) behaves as the L2-norm for small distances and progressively approaches the L1-norm and then converges to L0 at infinity. Thus robustness to outliers is automatically achieved and equivalence to Huber’s robust estimation can be proven (Santamaria, 2006). Unlike conventional kernel methods, correntropy solutions remain in the same dimensionality as the in-

Figure 2. Maximum mutual information projection versus kernel LDA test ROC results on hand-written digit recognition shown in terms of type-1 and type-2 errors (left); ROC results (Pdetect vs Pfalse) compared for various techniques on sonar data. Both data are from the UCI Machine Learning Repository (2007).

905

I


put vector. This might indicate built-in regularization properties, yet to be explored. Nonparametric Learning in the RKHS: It is possible to obtain robust solutions to a variety of problems in learning using the nonparametric and local nature of KDE and its relationship with RKHS theory. Recently, we explored the possibility of designing nonparametric solutions to the problem of identifying nonlinear dimensionality reduction schemes that maintain maximal discriminative information in a pattern recognition problem (quite appropriately measured by the mutual information between the data and the class labels as agreed upon by many researchers). Using the RKHS formalism and based on the KDE, results were obtained that consistently outperformed the alternative rather heuristic kernel approaches such as kernel PCA and kernel LDA (Scholkopf & Smola, 2001). The conceptual oversight in the latter two is that, both PCA and LDA procedures are most appropriate for Gaussian distributed data (although acceptable for other symmetric unimodal distributions and are commonly but possibly inappropriately used for arbitrary data distributions). Clearly, the distribution of the data in the kernel induced feature space could not be Gaussian for all typically exploited kernel selections (such as the Gaussian kernel), since these are usually translation invariant, therefore the data is, in principle, mapped to an infinite dimensional hypersphere on which the data could not

have been Gaussian distributed (nor symmetrically distributed in general for the ideal kernel for a given problem since these are positive definite functions). Consequently, the hasty use of kernel extensions of second-order techniques is not necessarily optimal in a meaningful statistical sense. Nevertheless, these techniques have found successful applications in various problems; however, their suboptimality is clear from comparisons with more carefully designed solutions. In order to illustrate how drastic the performance difference could be, we present a comparison of a mutual information based nonlinear nonparametric projection approach (Ozertem et al, 2006) and kernel LDA in a simplified two-class handwritten digit classification case study and sonar mine detection case study. The ROC curves of both algorithms on the test set after being trained with the same data is shown in Figure 2. The kernel is assumed to be a circular Gaussian with size set to Silverman’s rule-of-thumb. For the sonar data, we also include KDE-based approximate Bayes classifier and linear LDA for reference. In this example, KLDA performs close to mutual information projections, as observed occasionally.

FUTURE TRENDS Nonparametric Snakes, Principal Curves and Surfaces: More recently, we have been investigating

Figure 3. Nonparametric snake after convergence from an initial state that was located at the boundary of the guitar image rectangle (left). The global principal curve of a mixture of ten Gaussians obtained according to the local subspace maximum definition for principal manifolds (right).

906


the application of KDE and RKHS to nonparametric clustering, principal curves and surfaces. Interesting mean-shift-like fixed-point algorithms have been obtained; specifically interesting is the concepts of nonparametric snakes (Ozertem & Erdogmus, 2007) and local principal manifolds (Erdogmus & Ozertem, 2007) that we developed recently. The nonparametric snake approach overcomes the principal difficulties experienced by snakes (active contours) for image segmentation, such as low capture range, data curvature inhomogeneity, and noisy and missing edge information. Similarly, the local conditions for determining whether a point is in a principal manifold or not provide guidelines for designing fixed point and other iterative learning algorithms for identifying such important structures. Specifically in nonparametric snakes, we treat the edgemap of an image as samples and the values of the edginess as weights to construct a weighted KDE, from which, a fixed point iterative algorithm can be devised to detect the boundaries of an object in background. The designed algorithm can be easily made robust to outlier edges, converges very fast, and can penetrate into concavities, while not being trapped into the object at missing edge localities. The guitar image in Figure 3 emphasizes these advantages as the image exhibits both missing edges and concavities, while background complexity is trivially low as that was not the main concern in this experiment – the variable width KDE easily avoids textured obstacles. The algorithm could be utilized to detect the ridge-boundary of a structure in any dimensional data set in other applications. In defining principal manifolds, we avoided the traditional least-squares error reconstruction type criteria, such as Hastie’s self-consistent principal curves (Hastie & Stuetzle, 1989), and proposed a local subspace maximum definition for principal manifolds inspired by differential geometry. This definition lends itself to a uniquely defined principal manifold hierarchy such that one can use inflation and deflation to obtain a d-dimensional principal manifold from a (d+1)-dimensional principal manifold. The rigorous and local definition lends itself to easy algorithm design and multiscale principal structure analysis for probability densities. We believe that in the near future, the community will be able to prove maximal information preserving properties of principal manifolds obtained using this definition in a manner similar to mean-shift clustering

solving for minimum information distortion clustering (Rao et al, 2006) and maximum likelihood modelling achieving minimum Kullback-Leibler divergence asymptotically (Carreira-Perpinan & Williams, 2003; Erdogmus & Principe, 2006).

CONCLUSION The use of information theoretic learning criteria in neural networks and other adaptive system solutions have so far clearly demonstrated a number of advantages that arise due to the increased information content of these measures relative to second-order statistics (Erdogmus & Principe, 2006). Furthermore, the use of kernel density estimation with smooth kernels allows one to obtain continuous and differentiable criteria suitable for iterative descent/ascent-based learning and the nonparametric nature of KDE and its variants (such as variable-size kernels) allow one to achieve simultaneously robustness, global optimization through kernel annealing, and data modeling flexibility in designing neural networks and learning algorithms for a variety of benchmark problems. Due to lack of space, detailed mathematical treatments cannot be provided in this article; the reader is referred to the literature for details.

REFERENCES Bregman, L.M., (1967). The Relaxation Method of Finding the Common Point of Convex Sets and Its Application to the Solution of Problems in Convex Programming. USSR Computational Mathematics and Physics, (7), 200-217. Carreira-Perpinan, M.A., Williams, C.K.I., (2003). On the Number of Modes of a Gaussian Mixture. Proceedings of Scale-Space Methods in Computer Vision. 625-640. Csiszár, I., Körner, J. (1981). Information Theory: Coding Theorems for Discrete Memoryless Systems, Academic Press. Duin, R.P.W., On the Choice of Smoothing Parameter for Parzen Estimators of Probability Density Functions. IEEE Transactions on Computers, (25) 1175-1179.

907

I


Erdogmus, D., Ozertem, U., (2007). Self-Consistent Locally Defined Principal Surfaces. Proceedings of ICASSP 2007. to appear. Erdogmus, D., Principe, J.C., From Linear Adaptive Filtering to Nonlinear Information Processing. IEEE Signal Processing Magazine, (23) 6, 14-33. Erdogmus, D., Principe, J.C., Hild II, K.E., (2002). Do Hebbian Synapses Estimate Entropy? Proceedings of NNSP’02, 199-208. Erdogmus, D., Principe, J.C., Hild II, K.E., (2003). On-Line Entropy Manipulation: Stochastic Information Gradient. IEEE Signal Processing Letters, (10) 8, 242-245. Erdogmus, D., Principe, J.C., Vielva, L. Luengo, D., (2002). Potential Energy and Particle Interaction Approach for Learning in Adaptive Systems. Proceedings of ICANN’02, 456-461. Fano, R.M. (1961). Transmission of Information: A Statistical Theory of Communications, MIT Press. Greengard, L., Strain, J., (1991). The Fast Gauss Transform. SIAM Journal of Scientific and Statistical Computation, (12) 1, 79–94. Gyorfi, L., van der Meulen, E.C. (1990). On Nonparametric Estimation of Entropy Functionals. Nonparametric Functional Estimation and Related Topics, (G. Roussas, ed.), Kluwer Academic Publisher, 81-95. Hastie, T., Stuetzle, W., (1989). Principal Curves. Journal of the American Statistical Association, (84) 406, 502-516. Hild II, K.E., Erdogmus, D., Principe, J.C., (2006). An Analysis of Entropy Estimators for Blind Source Separation. Signal Processing, (86) 1, 182-194. Huber, P.J., (1981). Robust Statistics. Wiley. Jenssen, R., Erdogmus, D., Principe, J.C., Eltoft, T., (2004). The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space. Advances in NIPS’04, 625-632. Jenssen, R., Erdogmus, D., Principe, J.C., Eltoft, T., (2006). Some Equivalences Between Kernel Methods and Information Theoretic Methods. JVLSI Signal Processing Systems, (45) 1-2, 49-65.

908

Lazaro, M., Santamaria, I., Erdogmus, D., Hild II, K.E., Pantaleon, C., Principe, J.C., (2005). Stochastic Blind Equalization Based on PDF Fitting Using Parzen Estimator. IEEE Transactions on Signal Processing, (53) 2, 696-704. Ozertem, U., Erdogmus, D., (2006). Maximum Entropy Approximation for Kernel Machines. Proceedings of MLSP 2005. Ozertem, U., Erdogmus, D., Jenssen, R., (2006). Spectral Feature Projections that Maximize Shannon Mutual Information with Class Labels. Pattern Recognition, (39) 7, 1241-1252. Ozertem, U., Erdogmus, D., (2007). A Nonparametric Approach for Active Contours. Proceedings of IJCNN 2007, to appear. Parzen, E., (1967). On Estimation of a Probability Density Function and Mode. Time Series Analysis Papers, Holden-Day, Inc. Principe, J.C., Fisher, J.W., Xu, D., (2000). Information Theoretic Learning. Unsupervised Adaptive Filtering, (S. Haykin, ed.), Wiley, 265-319. Rao, Y.N., Erdogmus, D., Principe, J.C., (2005). Error Whitening Criterion for Adaptive Filtering: Theory and Algorithms. IEEE Transactions on Signal Processing, (53) 3, 1057-1069. Rao, S., de Madeiros Martins, A., Liu, W., Principe, J.C., (2006). Information Theoretic Mean Shift Algorithm. Proceedings of MLSP 2006. Renyi, A., (1970). Probability Theory, North-Holland Publishing Company. Sayed, A.H. (2005). Fundamentals of Adaptive Filtering. Wiley & IEEE Press. Scholkopf, B., Smola, A.J. (2001). Learning with Kernels. MIT Press. Shannon, C.E., Weaver, W. (1964). The Mathematical Theory of Communication, University of Illinois Press. Shi, J., Malik, J., (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, (22) 8, 888-905.


Silverman, B.W., (1986). Density Estimation for Statistics and Data Analysis, Chapman & Hall. Santamaria, I., Pokharel, P.P., Principe, J.C., (2006). Generalized Correlation Function: Definition, Properties, and Application to Blind Equalization. IEEE Transactions on Signal Processing, (54) 6, 2187-2197. UCI Machine Learning Repository (2007). http:// mlearn.ics.uci.edu/MLRepository.html. last accessed in June 2007.

KEy TERmS Cauchy-Schwartz Distance: An angular density distance measure in the Euclidean space of probability density functions that approximates information theoretic divergences for nearby densities. Correntropy: A statistical measure that estimates the similarity between two or more random variables by integrating the joint probability density function along the main diagonal of the vector space (line along ones). It relates to Renyi’s entropy when averaged over sample-index lags.

Information Theoretic Learning: A technique that employs information theoretic optimality criteria such as entropy, divergence, and mutual information for learning and adaptation. Information Potentials and Forces: Physically intuitive pairwise particle interaction rules that emerge from information theoretic learning criteria and govern the learning process, including backpropagation in multilayer system adaptation. Kernel Density Estimate: A nonparametric technique for probability density function estimation. Mutual Information Projections: Maximally discriminative nonlinear nonparametric projections for feature dimensionality reduction based on the reproducing kernel Hilbert space theory. Renyi Entropy: A generalized definition of entropy that stems from modifying the additivity postulate and results in a class of information theoretic measures that contain Shannon’s definitions as special cases. Stochastic Information Gradient: Stochastic gradient of nonparametric entropy estimate based on kernel density estimation.

909

I

910

Intelligent Classifier for Atrial Fibrillation (ECG) O. Valenzuela University of Granada, Spain I. Rojas University of Granada, Spain F. Rojas University of Granada, Spain A. Guillen University of Granada, Spain L. J. Herrera University of Granada, Spain F. J. Rojas University of Granada, Spain M. Cepero University of Granada, Spain

INTRODUCTION This chapter is focused on the analysis and classification of arrhythmias. An arrhythmia is any cardiac pace that is not the typical sinusoidal one due to alterations in the formation and/or transportation of the impulses. In pathological conditions, the depolarization process can be initiated outside the sinoatrial (SA) node and several kinds of extra-systolic or ectopic beatings can appear. Besides, electrical impulses can be blocked, accelerated, deviated by alternate trajectories and can change its origin from one heart beat to the other, thus originating several types of blockings and anomalous connections. In both situations, changes in the signal morphology or in the duration of its waves and intervals can be produced on the ECG, as well as a lack of one of the waves. This work is focused on the development of intelligent classifiers in the area of biomedicine, focusing on the problem of diagnosing cardiac diseases based on the electrocardiogram (ECG), or more precisely on the differentiation of the types of atrial fibrillations. First of all we will study the ECG, and the treatment

of the ECG in order to work with it, with this specific pathology. In order to achieve this we will study different ways of elimination, in the best possible way, of any activity that is not caused by the auriculars. We will study and imitate the ECG treatment methodologies and the characteristics extracted from the electrocardiograms that were used by the researchers that obtained the best results in the Physionet Challenge, where the classification of ECG recordings according to the type of Atrial Fibrillation (AF) that they showed, was realised. We will extract a great amount of characteristics, partly those used by these researchers and additional characteristics that we consider to be important for the distinction mentioned before. A new method based on evolutionary algorithms will be used to realise a selection of the most relevant characteristics and to obtain a classifier that will be capable of distinguishing the different types of this pathology.

BACKGROUND The electrocardiogram (ECG) is a diagnostic tool that measures and records the electrical activity of the heart


Intelligent Classifier for Atrial Fibrillation (ECG)

in exquisite detail (Lanza 2007). Interpretation of these details allows diagnosis of a wide range of heart conditions. The QRS complex is the most striking waveform within the electrocardiogram (Figure 1). Since it reflects the electrical activity within the heart during the ventricular contraction, the time of its occurrence as well as its shape provide much information about the current state of the heart. Due to its characteristic shape

it serves as the basis for the automated determination of the heart rate, as an entry point for classification schemes of the cardiac cycle, and often it is also used in ECG data compression algorithms. A normal QRS complex is 0.06 to 0.10 sec (60 to 100 ms) in duration. In order to have a signal clean of auricular activity in the ECG, we will analyse and compare performances from these two different approaches: 1.

Figure 1. Diagram of the QRS complex

2.

To remove the activity of the QRS complex, subtracting from the signal a morphological average of its activity for every heart beat, To detect the TQ section among heart beats (which are zones clean of ventricular activity) and analyse only data from that section.

There exists a great variety of algorithms to carry out the extraction of the auricular activity from the electrocardiogram such as the Thakor method (a recurrent adaptive filter structure), adaptive filtering of the whole band, methods based on neural-networks, spatial-temporal cancellation methods and methods based on Wavelets or on the concept of Principal Component Analysis (PCA) (Castells et al. 2004, Gilad-Bachrach et al. 2004, Petrutiu et al. 2004). A fundamental step in any of these approaches is the detection of the QRS complex in every heart beat. Software QRS detection has been a research topic for

Figure 2. The segments are shown detected by the algorithm on the two channels of a registration. In green the end of the wave T is shown, and in red the principle of the wave Q. Therefore each tract among final of wave T (green) and wave principle Q (red), it is a segment of auricular activity. The QRST complex is automatically detected with good precision.

911

I


more than 30 years. Once the QRS complex is identified, we will have a starting point to implement some different techniques for the QRST removal. Figure 2 show how the QRST is automatically detected. This is the first step in the analysis of the ECG. The study and analysis of feature extraction techniques from ECG signals is a very common task in any implementation of automatic classification systems from signals of any kind. During the execution of this sub-task, it is very important to analyse different research results existing in the literature. It is important to analyse the use of the frequency domain to obtain the Dominant Atrial Frequency (DAF) which is an index of the auricular activity which measures the dominant frequency in the frequency spectrum that can be obtained from the auricular activity signal. In this spectrum, for each ECG record, the maximum energy peak is calculated, and this frequency will be the one that dominates the spectrum (Cantini et al. 2004). It is also important to use the RR distance, and different filters in the 4-10Hz range, using a Butterworth filter of first order. It is important to note the MUSIC (Multiple Signal Clasification) method of order 12 to calculate the pseudo-periodogram of the signal. In order to obtain more robust estimations, signal filtering by variable-length windows, with no overlapping, and on every one of them, an analysis of the frequency spectrum can be performed. It is also important to note the Welch method, the Choi-Williams transform, and some heuristical methods used by cardiology experts (Atrial Fibrillation, 2007).

GENETIC PROGRAmmING The genetic programming (GP) can be understood as an extension of the genetic algorithm (GA) (Zhao, 2007). GP began as an attempt to discover how computers could learn to solve problems, in different fields, like automatic desing, function approximation, classification, robotic control, signal processing, without being explicitly programmed to do so (Koza, 2003). Also, in bio-medical application, GP has been extensively and satisfactorily used (Lopes, 2007). The primary differences between GAS and GP can be summarised as a) GP typically codes solutions as tree structured, variable length chromosomes, while GA’s generally make use of chromosomes of fixed length and structure, b) GP typically incorporates a domain specific syntax that 912

governs acceptable (or meaningful) arrangements of information on the chromosome. For GA’s, the chromosomes are typically syntax free. The field of program induction, using a tree-structured approach, was first clearly defined by Koza (Koza, 2003).The following steps summarise the search procedure used with GP. 1. 2.

3.

Create an initial population of programs, randomly generated as compositions of the function and terminal sets. WHILE termination criterion not reached DO (a) Execute each program to obtain a performance (fitness) measure representing how well each program performs the specified task. (b) Use a fitness proportionate selection method to select programs for reproduction to the next generation. (c) Use probabilistic operators (crossover and mutation) to combine and modify components of the selected programs. The fittest program represents a solution to the problem.

A NEW INTELLIGENT CLASSIFIER BASED ON GENETIC PROGRAmmING FOR ECG. In the different articles we have studied, the authors did not use any algorithmic method in order to try to classify the electrocardiograms (Cantini et al. 2004, Lemay et al. 2004). The authors applied simple methods to try to establish the possible classification based on the classification capacity of one single characteristic or pairs of characteristics (through a graphic representation) (Hay et al. 2004). Nevertheless, the fact that one single characteristic might not be perfect individually to classify a group of patterns in the different categories, does not mean that combined with another or others it does not obtain some high percentages in the classification. Due to the great quantity of characteristics obtained from the ECG, a method to classify the patterns was needed, alongside a way of selecting the subgroup of characteristics optimal for classifying, since the great quantity of existing characteristics would introduce noise as soon as the search for the optimal classifier of the patterns of characteristics begins. In total 55 different


characteristics were used, from the papers (Cantini et al. 2004, Lemay et al. 2004, Hayn et al. 2004, Mora et al. 2004). There are other paper in the bibliography that used soft-computing method to analyze ECG (Wiggins et al. 2008, Lee et al. 2007, Yu et al. 2007). In this paper, a new intelligent algorithm based on genetic programming (one paradigm of the soft-computing area) for simultaneously select the best features is proposed for the problem of classification spontaneous termination of atrial fibrillation. In this algorithm genetic programming is used to search for a good classifier at the same time as the search for an optimal subgroup of characteristics. The algorithm consists of a population of classifiers, and each one of those is associated with a fitness value that indicates how well it classifies. Each classifier is made up of: 1. 2.

A binary vector of characteristics, which indicates with 1’s the characteristics it uses. A multitree with as many trees as classes as has the collection of data of the problem. Every tree i distinguishes between the class i (giving a positive output) and the rest of the classes (negative output). Furthermore, it is connected to values pj (frequency of failures), and wj (frequency of successes). The trees are made up of function nodes [+,-,*,/, trigonometric functions (sine, cosine, etc.), statistic functions (minimums, maximums, average)] and terminal nodes {constant number and features}. Their translation to a mathematical formula is immediate.

The algorithm consists of a loop in which in each repetition a new population is formed from the previous through the genetic operators. The classifiers that score the highest on fitness will have more possibilities to participate, with which the population will tend to improve its quality with the successive generations. The proposed algorithm is composed of the following building blocks: 1.

Fitness function. The fitness function combines the double objective of achieving a good classification and a small subgroup of characteristics:

B −   Fitness = f 1 + A e n   

(1)

In this equation, f is the sum of the cases of success in the classification of the trees, β is the cardinality of the feature subset used, n is the total number of features and α is a parameter which determines the relative importance that we want to assign for correct classification and the size of the feature subset, calculated as: gen   A = C 1 −  TotalGen  

(2)

where C is a constant, and TotalGen is the number of generations proposed genetic algorithm is evolved, and gen is the current generation number.

Figure 3. An example of a crossover operation in the proposed multitree classifier. (a) and (b) are initially the classifiers P1 and P2. In the figures (c) and (d) the results of the crossover operator is presented.

913

I


2. 3. 4.

Reproduction operator: a classifier chosen proportionally to the fitness passes on, intact, to the next generation. Mutation operator: a classifier is selected randomly and nodes of a tree are changed, giving more probability to the worst trees. Crossover operator: homogeneous cross (classifiers with the same characteristics) and heterogeneous cross (classifiers with a similar subgroup). It realises the exchange of subtrees and trees between the classifiers. Figure 3 shows the behaviour of this operator.

It was thought to be useful to value the characteristics first, and use this assessment when a subgroup would be assigned to the classifier. This is performed in the following steps: • •

A probability is given to each characteristic of being assigned to the initial subgroup of the classifier proportional to its assessment. G-flip was used to assess the characteristics (GiladBacharach et al. 2004). G-flip is a greedy search algorithm for maximizing an evaluation function that take into account the number of features selected. The algorithm repeatedly iterates over the feature set and updates the set of chosen features. In each iteration it decides to remove or add the current feature to the selected set by evaluating the margin term of the evaluation function with and without this feature. This algorithm is similar

•

to the zero-temperature Monte-Carlo (Metropolis) method. It converges to a local maximum of the evaluation function, as each step increases its value and the number of possible feature sets is finite. The proposed methodology devalues bad characteristics in groups with a large quantity of characteristics, thus accelerating their convergence to good groups of characteristics and good classification results.

SImULATION RESULTS We have used and compared two different new intelligent classifiers. The first one presents an online feature selection algorithm using genetic programming. The proposed genetic programming methodology simultaneously selects a good subset of features and constructs a classifier using the selected features for the problem of ECG classification. We have designed new genetic operator in order to produce a robust and precise algorithm. The other classifier is based in the hybridization of a feature selection algorithm and a neural network system based on kernel method (Support Vector Machine). We have four classification task:  Event A: To differ among registration N (Group N: non-terminating AF -defined as AF that was not observed to have terminated for the duration

Table 1. Comparison of different approaches (in bracket the standard deviation) Method:

Infogain (Molina et al. 2002)

New evolutive algorithm for classification

Task

Best

Best

Event A: Event B: Event C: Event D: 914

93 70 96 68

Median/ (error) 91 (±2) 66 (±4) 88 (±6) 62 (±4)

100 95 89 85

Median 98 (±2) 81 (±14) 83 (±6) 80 (±5)

Kernel method (Support Vector Machine) (Schölkopf et al. 2002) Best Median 100 80 84 83

98 (±2) 68 (±12) 75 (±9) 77 (±6)

Relief (Kononenko 1994)

Best 72 80 74 53

Median 64 (±8) 74 (±6) 68 (±6) 49 (±4)


of the long-term recording, at least an hour following the segment-) and registration T (Group T: AF that terminates immediately (within one second) after the end of the record).  Event B: To differ among the type registrations S (Group S: AF that terminates one minute after the end of the record) and those of type T.  Event C: To differ among registrations type N of AF and a second group in which registrations type S and type T are included.  Event D: Separation of the 3 types of registrations in a simultaneous way. These groups N,T and S are distributed across a learning set (consisting of 10 labelled records from each group) and two test sets. Test set A contains 30 records, of which about one-half are from group N, and of which the remainder are from group T. Test set B contains 20 records, 10 from each of groups S and T. Table 1 shows the simulation results (in % of classification), for different method and the evolutive algorithm proposed for ECG classification:

FUTURE TRENDS The field of signal processing in bio-medical problems is an exciting and increasingly field nowadays. The rapid development of powerful microcomputers promoted the widespread application of software for electrocardiogram analysis and QRS detection algorithms in cardiological devices, and automatic classifiers. However, and important research field for the next year, will be the hybridization of new intelligent techniques, as genetic algorithm and genetic programming, or other paradigms from soft-computing (fuzzy logic, neural networks, SVM, etc.), that improve the behaviour of standard classification algorithm for the diagnosis of different cardiological pathologies.

CONCLUSIONS In this paper, a new online feature selection algorithm using genetic programming technique has been proposed as classifier for classification spontaneous termination of atrial fibrillation. In a combined way, our genetic programming methodology automatically

selects the required features while design the multitree classifier. Different genetic operator has been design for the multitree classifier, and for a better performance of the classifier, the initialization process generates solution using smaller feature subsets with has been previously selected with a greedy search algorithm (G-Flips) for maximizing the evaluation function. The effectiveness of the proposed scheme is demonstrated in a real problem: The Classification Spontaneous Termination of Atrial Fibrillation. At this point, it is important to note that the use of different characteristic gives different classification result as can be observed by the authors working in this challenge. The selection of different features extracted from an electrocardiogram has a strong influence on the problem to be solve and in the behaviour of the classifier. Therefore it is important to develop a general tool able to be face with different cardiac illnesses, which can select the most appropriate features in order to obtain an automatic classifier. As it can be observed, the proposed methodology has very good result compared with the winner of the challenge from PhysioNet and Computers in Cardiology 2004, even if this methodology has been developed in a general way to resolved different classification problems.

REFERENCES Atrial Fibrillation Atrial Flutter Fibrillation- What are they? http://www.hoslink.com/heart/af.htm ). Cantini, F., et al. (2004). Predicting the end of an Atrial Fibrillation Episode: The PhysioNet Challenge, Computers in Cardiology, 121-124 Castells, F., Rieta J.J., Mora C., Millet J., & Sánchez C. (2004). Estimation of Atrial Fibrilatory Waves fron one-lead ECGs using principal component analysis concepts, Computer in Cardiology, 215-219. Gilad-Bachrach, R., Navot, A., & Tishby, N. (2004). Margin based feature selection - theory and algorithms. ICML, 4-8 . Hayn, D., et al. (2004). Automated Prediction of Spontaneous Termination of Atrial Fibrillation from Electrocardiograms. Computers in Cardiology, 117-120 Kononenko, I. (1994), Estimating Attributes: Analysis and Extensions of RELIEF. ECML, 171–182. 915

I


Koza, J.R., et.al. (2003).Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Kluwer Academic Publishers, Lanza, G.A., (2007). The Electrocardiogram as a Prognostic Tool for Predicting Major Cardiac Events. Cardiovascular Diseases,(50) 2, 87-111 Lee., C.S., & Wang, M.H. (2007) Ontological fuzzy agent for electrocardiogram application. Expert Systems with Applications, Available online Lemay, H., et al. (2004). AF Classification Based on Clinical Features. Computers in Cardiology, 669672 Lopes, H.S., (2007).Genetic programming for epileptic pattern recognition in electroencephalographic signals. Applied Soft Computing, (7) 1, 343-352 Molina, L., Belanche, L., & Nebot A. (2002). Feature Selection Algorithms: A Survey and Experimental Evaluation, IEEE International Conference on Data Mining, 306-313 Mora, C., & Castells., J. (2004). Prediction of Spontaneous Termination of Atrial Fibrillation Using Time Frecuency Analysis of the Atrial Fibrillation Wave, Computers in Cardiology,109-112 Petrutiu, S., Sahakian, A.V., & Swiryn, S. (2004). Fibrillatory Wave Analysis of the Surface ECG to Predict Termination of Atrial Fibrillation. Computers in Cardiology, 250-261 Schölkopf, B. & Smola, A.J.(2002). Learning with Kernels. MIT Press, Cambridge, MA. Wiggins, M., Saad, A., Litt, B., & Vachtsevanos, G., (2008). Evolving a Bayesian classifier for ECG-based age classification in medical applications. Applied Soft Computing, (8) 1,599-608 Yu, S.N., & Chen, Y.H. (2007). Electrocardiogram beat classification based on wavelet transformation and probabilistic neural network. Pattern Recognition Letters, (28) 10, 1142-1150 Zhao, H., (2007).A multi-objective genetic programming approach to developing Pareto optimal decision tres. Decision Support Systems, (43) 3, 809-826

916

KEy TERmS Arrhythmia: Arrhythmias are disorders of the regular rhythmic beating of the heart. Arrhythmias can be divided into two categories: ventricular and supraventricular. Atrial Fibrillation: The atrial fibrillation (AF) is the sustained arrhythmia that is most frequently found in clinical practice, present in 0.4% of the total population. Its frequency increases with age and with the presence of structural cardiopathology. AF is especially prevalent in the elderly, affecting 2-5% of the population older than 60 years and 10 percent of people older than 80 years. Electrocardiogram: The electrocardiogram (ECG) is a diagnostic tool that measures and records the electrical activity of the heart Feature Selection: Feature selection is a process frequently used in classification algorithm, wherein a subset of the features available from the data are selected for the classifier. The best subset contains the least number of dimensions or features that most contribute to a correct classification process. Genetic Algorithm: Genetic Algorithms (GA) are a way of solving problems by mimicking the same processes mother nature uses. They use the same combination of selection, recombination and mutation to evolve a solution to a problem. Genetic Programming: Genetic Programming (GP), evolved a solution in the form of a Lisp program using an evolutionary, population-based, search algorithm which extended the fixed-length concepts of genetic algorithms. Soft-Computing: Refers to a collection of different paradigms (such as fuzzy logic, neural networks, simulated annealing, genetic algorithms and other computational techniques), which are focussed in analyze, model and discover information in very complex problems. Support Vector Machine (SVM): Are a special Neural Networks that performs classification by constructing an N-dimensional hyperplane that separates the data into two categories.

917

Intelligent MAS in System Engineering and Robotics G. Nicolás Marichal University of La Laguna, Spain Evelio J. González University of La Laguna, Spain

INTRODUCTION The concept of agent has been successfully used in a wide range of applications such as Robotics, e-commerce, agent-assisted user training, military transport or health-care. The origin of this concept can be located in 1977, when Carl Hewitt proposed the idea of an interactive object called actor. This actor was defined as a computational agent, which has a mail address and a behaviour (Hewitt, 1977). Actors receive messages from other actors and carry out their tasks in a concurrent way. It is difficult that a single agent could be sufficient to carry out a relatively complex task. The usual approach consists of a society of agents - called Multiagent Systems (MAS) -, which communicate and collaborate among them and they are coordinated when pursuing a goal. The purpose of this chapter is to analyze the aspects related to the application of MAS to System Engineering and Robotics, focusing on those approaches that combine MAS with other Artificial Intelligence (AI) techniques.

BACKGROUND There is not an academic definition accepted by every researcher about the term agent. In fact, agent researchers have offered a variety of definitions explicating his or her particular use of the word. An extensive list of these definitions can be found in (Franklin and Graesser, 1996). It does not fall in the scope of this chapter to reproduce that list. However, we will include some of them, in order to illustrate how heterogeneous these definitions are.

“Autonomous agents are computational systems that inhabit some complex dynamic environment, sense and act autonomously in this environment, and by doing so realize a set of goals or tasks for which they are designed.” (Maes, 1995, p. 108) “Autonomous agents are systems capable of autonomous, purposeful action in the real world.” (Brustoloni, 1991, p. 265) “An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors.” (Russell and Norvig, 1995, p. 31) Despite the existing plethora of definitions, agents are often characterized by only describing their features (long-live, autonomy, reactivity, proactivity, collaboration, ability to perform in a dynamic and unpredictable environment, etc.). With these characteristics, users can delegate to agents tasks designed to be carried out without human intervention, for instance, as personal assistants that learn from its user. In most of applications, a standalone agent is not sufficient for carrying out the desired task: agents are forced to interact with other agents, forming a MAS. Due to their capacity of flexible autonomous action, MAS can treat with open – or at least highly dynamic or uncertain- environments. On the other hand, MAS can effectively manage situations where distributed systems are needed: the problem being solved is itself distributed, the data are geographically distributed, systems with many components and huge content, systems with future extensions, etc. A researcher could include a single agent to implement all the tasks. Nevertheless, this type of macroagent represents a bottleneck for the system speed, reliability and management.


I

Intelligent MAS in System Engineering and Robotics

It is clear that the design of a MAS is more complex than a single agent. Apart from the code for the treatment of the task-problem, a developer needs to implement those aspects related to communication, negotiation among the agents and its organization in the system. Nevertheless, it has been shown that MAS offer more than they cost (Cockburn, 1996) (Gonzalez, 2006) (Gonzalez, 2006b) (Gyurjyan, 2003) (Seilonen 2005).

•

mAS, AI AND SySTEm ENGINEERING An important topic in System Engineering is that of process control problem. We can define it as the one of manipulating the input variables of a dynamic system in an attempt to influence over the output variables in a desired fashion, for example, to achieve certain values or certain rates (Jacquot, 1981). In this context, as other Engineering disciplines, we can find a lot of relevant formalisms and standards, whose descriptions are out of the scope of this chapter. An interested reader can get an introductory presentation of these aspects in (Jacquot, 1981). Despite their advantages, there are few approaches to the application of MAS technology to process automation (much less than applications to other fields such as manufacturing industry). Some reasons for this lack of application can be found in (Seilonen, 2005):

•

•

•

•

•

Process automation requires run-time specifications that are difficult to reach by the current agent technology. The parameters in the automation process design are usually interconnected in a strict way, thus it is highly difficult to decompose the task into agent behaviors. Lack of parallelism to be modeled through agents.

In spite of these difficulties, some significant approaches to the application of MAS to process control can be distinguished: •

918

An interesting approach of application of MAS to process control is that in which communication techniques among agents are used as a mechanism of integration among systems independently designed. An example of this approach is the

•

•

ARCHON (Architecture for Cooperative Heterogeneous on-line systems) architecture (Cockburn, 1996) that has been used in at least three engineering domains: Electricity Transportation, Electricity Distribution and Particle Accelerator Control. In ARCHON, each application program (known as Intelligent System) is provided with a layer (called Archon Layer) that allows it to transfer data/messages to other Intelligent Systems. A second approach consists of those systems that implement a closed loop-based control. In this sense, we will cite the work of (Velasco et al., 1996) for the control of a thermal central. A different proposal consists of complementing a pre-existing process automation system with agent technology. In other words, it is a complementation, not a replacement. The agent system is an additional layer that supervises the automation system and reconfigures it when it is necessary. Seilonen et al. also propose a specification of a BDI-model-based agent platform for process automation (Seilonen, 2005). V. Gyurjyan et al. (2003) propose a controller system architecture with the ability of combining heterogeneous processes and/or control systems in a homogeneous environment. This architecture (based on the FIPA standard) develops the agents as a level of abstraction and uses a description of the control system in a language called COOL (Control Oriented Ontology Language). Tetiker et al. (2006) propose a decentralized multi-layered agent structure for the control of distributed reactor networks where local control agents individually decide on their own objectives allowing the framework to achieve multiple local objectives concurrently at different parts of the network. On top of that layer, a global observer agent continuously monitors the system. Horling, Lesser et al. (2006) describe a soft realtime control architecture designed to address temporal and ordering constraints, shared resources and the lack of a complete and consistent world view. From challenges encountered in a real-time distributed sensor allocation environment, the system is able to generate schedules respecting temporal, structural and resource constraints, to merge new goals with existing ones, and to detect and handle unexpected results from activities. Other proposal of real-time control architecture


is CIRCA (A Cooperative Intelligent Real-Time Control Architecture) by Musliner, Durfee and Shin (1993), that uses separate AI and real-time subsystems to address the problems for which each is designed. In this context, we proposed a MAS (called MASCONTROL) for identification and control of processes, whose design follows the FIPA specifications (FIPA, 2007) regarding architecture, communication and protocols. This MAS implements a self-tuning regulator (STR) scheme, so this is not a new general control algorithm but a new approach for its development. Its main contribution consists of showing the potential that a controller, through the use of MAS and ontologies – expressed in OWL (Ontology Web Language)-, can control systems in an autonomous way, using actions whose description, for example, is on the web, and can read on it (without knowing a priori) the logic of how to do the control. In this context, our experience is that agents do not offer any advantage if they are not intelligent and ontologies represent an intelligent way to manage knowledge since they provides the common format in which they can express that knowledge. Two important advantages of their use are extensibility and communication with other agents sharing the same language. These advantages are shown in the particular case of open systems, that is, when different MAS from different developers interact (Gonzalez, 2006). As a STR, our MAS tries to carry out the processes of identification and control of a plant. We consider that this model can be properly managed by a MAS due to two main reasons: •

•

A STR scheme contains modules that are conceptually different, such as the direct interaction with the plant to control, identification of the system and determination of the best values for the controller parameters. It is possible to carry out the calculations in a parallel way. For instance, several transfer functions could be explored simultaneously. Thus, several agents can be launched in different computers, taking advantage of the possibility of parallelism provided by the MAS.

Other innovator aspect of this work is the use of artificial neural networks (ANN) for the identification and determination of the parameters. ANN and STR

present clear analogies. The training of a neural network consists of finding the best values of the weights of the network while it is necessary to optimize some parameters for a model (identification) or for a controller in a STR. Because of this similarity of methods, we have considered the application of ANN training methods to control problems. In this case, ANN are applied for two purposes: the parameter optimization of a model of the unknown system and the optimization of the parameters of a controller. This way, the resulting system could be seen as a hybrid intelligent system for a real-time application. An interested reader can get a deeper description of the system in (Gonzalez, 2006b). It is important to remark that this framework can be used for every algorithm of identification and control. In this context, we have checked the MAS controlling several and different plants, obtaining a proper behavior. In contrast, due to the transmission rate and optimization time, the designed MAS should be used for the controlling of not-excessively fast processes, according to the first restriction stated above. However, we expect to have shown an example of how the other two (strong interdependency of the parameters and lack of parallelism) can be overcome. As can be seen, the mentioned restrictions often become serious obstacles in the application of MAS to Engineering Systems. In this framework, the use of Fuzzy rules is a very usual solution in order to define single-agents behaviours (Hoffmann, 2003). Unfortunately, the definition of the rules is cumbersome in most cases. As a possible solution to the difficult task of generating the adequate rules, several automatic algorithms have been proposed. New rule extraction approaches based on connectionist models have been proposed. Among them, the Neuro-Fuzzy systems has been proven as a way to obtain the rules, taking advantage of the learning properties of the Neural Networks and the form of expressing the knowledge by Fuzzy rules (Mitra and Hayashi, 2000). In this context, several applications have been developed. In Robotics applications, it could be cited the work of (Lee and Qian, 1998), who describe a two-component system for picking up moving objects for a vibratory feeder or the work of (Kiguchi, 2004), proposing a hierarchical neuro-fuzzy controller for a robotic exoskeleton, to assist motion of physically weak persons such as elderly, disabled, and injured persons. As a particular case, a system for the detection and identification of road markings will be presented 919

I


in this chapter. This system has been incorporated to a vehicle as it can be seen in Figure 1. This system is based on infrared technology and a classification tool based on a Neuro-Fuzzy System. A particular feature to take into account in this kind of tasks is that the detection and classification have to be done in real time. Hence, the time consumed by the hardware system and the processing algorithms is critical in order to take a right decision within the time frame of its relevance. Looking for an inexpensive and fast system, the infrared technology is a good alternative solution in this kind of applications. In this direction, taking into account the time limitations, a combination between a device based on infrared technology and different techniques to extract convenient Fuzzy rules are used (Marichal, 2006). It is important to remark that the extraction and the interpretation of

Figure 1. Infrared system under the vehicle

920

the rules have generated great interest in recent years (Guillaume, 2007). The final purpose is to achieve a MAS, where each agent does its work as fast as possible, overcoming the temporal limitations of the MAS as pointed out by (Seilonen, 2005). In this context, we would like to remark some approaches of MAS applied to decision fusion for distributed sensor systems, in particular that by Yu and Sycara (2006). In order to achieve the mentioned MAS, it is necessary to obtain the rules for each agent. Furthermore, a depth analysis over the rules has to be done, minimizing the number of them and setting the mapping between these rules and the different scenarios. The approach used in the shown case is based on designing rules for each situation found by the vehicle. In fact, each different scenario should be expressed


Table 1. Rules extracted by the neuro-fuzzy approach

I

Arrow

Right Arrow

Yield

[0 2)

[2 4)

[4 6)

Forwardright Arrow [6 8]

Reference Value

1

3

5

7

Rules

6, 7

8,9, 10,11, 12

13,14, 15,16, 17,18

19,20,21, 22,23, 24, 25

Range

by its own rules. This feature gives more flexibility in the process of designing the desired MAS. Because of that, the separation of rules according to the kind of road marking could help in this purpose. In Table 1, it is shown the result of this process for the infrared system shown in Figure 1. Note that, the reference values are the values associated with each road marking, the range refers to the interval where the output values of the resultant Fuzzy system could be for a particular sign and finally, the rules are indicated by an order number. It is important to remark that it is necessary to interpret the obtained rules. In this way, it is possible to associate these rules with different situations and generate new rules more appropriate for a particular case under consideration. Hence, the agents related with the detection and classification of the signs could be expressed by this set of Fuzzy rules. Moreover, agents, which are in charge of taking decisions based on the information, provided by the detection and classification of a particular road marking, could incorporate these rules as part of them. Problems in task decomposition process, pointed out by (Seilonen, 2005), could be simplified in this way. On the other hand, although the design of behaviors is very important, it should be said that the issues related with the co-operation among agents are also essential. In this context, the work of (Howard et al, 2007) could be cited.

Other Rules [-1 0]

1,2, 3,4, 5

FUTURE TRENDS As technology provides faster and more efficient computers, the application of AI techniques to MAS is supposed to become increasingly popular. That improvement in the computer capacity and some emerging techniques (meta-level accounting, schedule caching, variable time granularities, etc.) (Horling, Lesser et al., 2006) will imply that other AI methods- impossible to be currently applied in the field of System Engineering- will be introduced in an efficient way in a near future. In our opinion, other important feature to be explored is the improvement in MAS communication. It is also convenient to look for more efficient MAS protocols and standards, in addition to those aspects related to new hardware features. These improvements would allow, for example, developing operative real-time tele-operated applications.

CONCLUSION The application of MAS to Engineering Systems and Robotics is an attractive platform for the convergence of various AI technologies. This chapter shows in a summarized manner how different AI techniques (ANN, Fuzzy rules, Neuro-Fuzzy systems) have been

921


successfully included into MAS technology in the field of System Engineering and Robotics. These techniques can also overcome some of the traditionally described drawbacks for MAS application, in particular, highly difficult decomposition of the task into agent behaviors and lack of parallelism to be modeled through agents. However, present-day MAS technology does not fulfill completely the severe real-time requirements that are implicit in automation processes. Thus, and until the technology provides faster and more efficient computers, our opinion is that the application of AI techniques in MAS needs to be optimized for real-time systems, for example, extracting convenient Fuzzy rules and minimizing its number.

REFERENCES Brustoloni, J.C. (1991). Autonomous Agents: Characterization and Requirements. Carnegie Mellon Technical Report CMU-CS-91-204, Pittsburgh: Carnegie Mellon University Cockburn D. & Jennings, N. R. (1996). ARCHON: A Distributed Artificial Intelligence System for Industrial Applications. In G.M.P. O’Hare and N.R. Jennings, editors, Foundations of Distributed Artificial Intelligence. John Wiley & Sons. Franklin S. & Graesser A. (1996). Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents. Intelligent Agents III. Agent Theories, Architectures and Languages (ATAL’96). FIPA web site. http://www.fipa.org. Last access: 15th August 2007. González, E.J., Hamilton, A., Moreno, L., Marichal, R., & Muñoz V. (2006) Software experience when using ontologies in a multi-agent system for automated planning and scheduling. Software - Practice and Experience, 36 (7), 667-688.

depollution problem. Fuzzy Sets and Systems. 158, 2078-2094 Gyurjyan, V., Abbott, D., Heyes, G., Jastrzembski, E., Timmer, C. & Wolin, E. (2003) FIPA agent based network distributed control system. 2003 Computing in High Energy and Nuclear Physics (CHEP03). Hewitt, C. (1977). Viewing Control structures as Patterns of Passing Messages. Artificial Intelligence, (8) 3, 323-364. Hoffmann, F. (2003). An Overview on Soft Computing in Behavior Based Robotics. Lecture Notes in Computer Science, 544-551. Horling, B., Lesser V., Vincent R. & Wagner T. (2006) The Soft Real-Time Agent Control Architecture, Autonomous Agents and Multi-Agent Systems, 12(1), 35-91 Howard A, Parker L. E., and Sukhatme G., (2006). Experiments with a Large Heterogeneous Mobile Robot Team: Exploration, Mapping, Deployment, and Detection. International Journal of Robotics Research, vol. 25, 5-6, 431-447. Jacquot, R.G. (1981) Modern Digital Control Systems. Marcel Dekker, Editor. Electrical engineering and electronics; 11. Kiguchi, K.; Tanaka, T.; Fukuda, T. (2004) Neuro-fuzzy control of a robotic exoskeleton with EMG signals, IEEE Transactions on Fuzzy Systems 12, 4, 481 - 490. Lee, K. M. & Qian, Y. F. (1998) Intelligent vision-based part-feeding on dynamic pursuit of moving objects, Journal Manufacturing Science Engineering–Transactions ASME 120(3), 640–647. Maes, P. (1995). Artificial Life Meets Entertainment: Life like Autonomous Agents, Communications of the ACM, 38 (11), 108-114

González, E.J., Hamilton, A., Moreno, L., Marichal, R., Marichal, G.N., & Toledo J. (2006b) A MAS Implementation for System Identification and Process Control. Asian Journal of Control, 8 (4). 417-423.

Marichal, G.N., González, E.J., Acosta, L., Toledo, J., Sigut, M. & Felipe, J. (2006). An Infrared and Neuro-Fuzzy-Based Approach for Identification and Classification of Road Markings. Advances in Natural Computation. Lecture Notes in Computer Science, 4.222, 918-927.

Destercke S., Guillaume S. and Charnomordic B. (2007) Building an interpretable fuzzy rule base from data using orthogonal least squares- application to a

Mitra, S. & Hayashi, Y. (2000). Neuro-fuzzy rule generation: survey in soft computing framework. IEEE Transactions on Neural Networks. (11) 3, 748-768

922


Musliner, D., Durfee E. & Shin, K. (1993). CIRCA: A Cooperative Intelligent Real-Time Control Architecture, IEEE Transactions on Systems, Man and Cybernetics, 23(6) Russell, S.J. & Norvig, P. (1995), Artificial Intelligence: A Modern Approach, Englewood Cliffs, NJ: Prentice Hall Seilonen, I., Koskinen, K., Pirttioja, T., Appelqvist, P. & Halme, A. (2005). Reactive and Deliberative Control and Cooperation in Multi-Agent System Based Process Automation, 6th IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA 2005). Tetiker, M.D., Artel, A., Tatara, E., Teymour, F., North, M., Hood, C. & Cinar, A. (2006) Agent-based System for Reconfiguration of Distributed Chemical Reactor Network Operation, Proceedings of the American Control Conference. Velasco, J., González, J.C., Magdalena, L. & Iglesias, C. (1996). Multiagent-based control systems: a hybrid approach to distributed process. Control Engineering Practice, 4, 839-846. Yu B. & Sycara K. (2006) Learning the Quality of Sensor Data in Distributed Decision Fusion, International Conference on Information Fusion (Fusion 06), Florence, Italy.

KEy TERmS Artificial Neural Network: An organized set of many simple processors called neurons that imitates a biological neural configuration. FIPA: It stands for “Foundation for Intelligent Physical Agents”, IEEE Computer Society standards organization that promotes agent-based technology and the interoperability of its standards with other technologies MultiAgent System: System composed of several agents, usually designed to cooperate in order to reach a goal. Neuro-Fuzzy: Hybrids of Artificial neural networks and Fuzzy Logic. Ontology: Set of classes, relations, functions, etc. that represents knowledge of a particular domain. Real-Time System: System with operational deadlines from event to system response. Self-Tuning Regulator: Type of adaptive control system composed of two loops, an inner loop (process and ordinary linear feedback regulator), and an outer loop (recursive parameter estimator and design calculation which adjusts its parameters).

923

I

924

Intelligent Query Answering Mechanism in Multi Agent Systems Safiye Turgay Abant İzzet Baysal University, Turkey Fahrettin Yaman Abant İzzet Baysal University, Turkey

INTRODUCTION

BACKGROUND

The query answering system realizes the selection of the data, preparation, pattern discovering, and pattern development processes in an agent-based structure within the multi agent system, and it is designed to ensure communication between agents and an effective operation of agents within the multi agent system. The system is suggested in a way to process and evaluate fuzzy incomplete information by the use of fuzzy SQL query method. The modelled system gains the intelligent feature, thanks to the fuzzy approach and makes predictions about the future with the learning processing approach. The operation mechanism of the system is a process in which the agents within the multi agent system filter and evaluate both the knowledge in databases and the knowledge received externally by the agents, considering certain criteria. The system uses two types of knowledge. The first one is the data existing in agent databases within the system and the latter is the data agents received from the outer world and not included in the evaluation criteria. Upon receiving data from the outer world, the agent primarily evaluates it in knowledgebase, and then evaluates it to be used in rule base and finally employs a certain evaluation process to rule bases in order to store the knowledge in task base. Meanwhile, the agent also completes the learning process. This paper presents an intelligent query answering mechanism, a process in which the agents within the multi-agent system filter and evaluate both the knowledge in databases and the knowledge received externally by the agents. The following sections include some necessary literature review and the query answering approach Then follow the future trends and the conclusion.

The query answering system in agents utilizes fuzzy SQL queries from the agents, then creates and optimizes a query plan that involves the multiple data source of the whole multi agent system. Accordingly, it controls the execution of the task to generate the data set. The query operation constitutes the basic function of query answering. By query operation, the most important function of the system is fulfilled. This study also discusses peer to peer network structure and SQL structure, as well as query operation. Query operation was applied in various fields. For example, selecting the related knowledge in a web environment was evaluated in terms of relational concept in databases. Relational database system particularly assists the system in making evaluations for making decisions about the future and in making the right decisions with fuzzy logic approach (Raschia & Mauaddib, 2002; Tatarinov et al. 2003; Galindo et al. 2001; Bosc et al. Chaudhry et.al. 1999; Saygın et al. 1999; Turgay et al.2006). Query operation was mostly used in choosing the related information web environment (Jim & Suciu, 2001; He et al. (2004). Data mining approach was used in dynamic site discovery process by the data preparation and type recognition approaches in complex matching schema with correlation values in query interfaces and query schemas (Nambiar & Kambhampati, 2006; Necib & Freytag, 2005). Query processing within peer to peer network structure with SQL structure was discussed generally (Cybenko et al. 2004; Bernstein et al. 1981). Query processing and database was reviewed with relational database (Genet & Hinze, 2004; Halashek-Wiener et al., 2006). Fuzzy set was proposed by Zadeh (1965) and the division of the features into various linguistic values was widely


Intelligent Query Answering Mechanism in Multi Agent Systems

used in pattern recognition and in the fuzzy inference system. Kubat, et al. (2004) reviewed the frequency of the fuzzy logic approach in operations research methods as well as artificial intelligence ones in discrete manufacturing. Data processing process within the multi-agent systems can be grouped as static and dynamic. While the evaluation process of existing data by the system can be referred to as a static structure, the evaluation process of new data or possible data within the system can be referred to as a dynamic structure. The studies on the static structure can be expressed as database management’s query process (McClean, Scotney, Rutjes & Hartkamp, 2003) and the studies on the dynamic structure can be expressed as the whole of the agent system (Purvia, Cranefield, Bush & Carter, 2000; Hoschek, 2002; Doherty, Lukaszewicz, & Szalas, 2004, Turgay, 2006)

AGENT BASED QUERy ANSWERING SySTEm The query process lists the knowledge with desired characteristics in compliance with the required condition while query answering finds the knowledge conforming to the required conditions and responds to the related message in the form of knowledge. In par-

ticular, a well-defined query answering process within multi agent systems provides communication among agents, the sharing of knowledge and the effective performance of data processing process and learning activities. The system is able to process incomplete or fuzzy knowledge intelligently with the fuzzy SQL query approach. The distributed query answering mechanism was proposed as a cooperative agent-based solution for information management with fuzzy SQL query. A multi-agent approach to information management includes some features such as: • • • •

Concurrency Distributed computation Modularity Cooperation

Figure 1 represents each agent’s query answering mechanism. When the data is received by the system, the query variables are chosen by query and then the data related with fuzzy SQL are suggested. The obtained result is represented as the answer knowledge in the agent and thus the process is completed. The data are classified by the fuzzy query approach, depending on fuzzy relations and importance levels. The rule base of the system is formed after a

Figure 1. Model driven framework for query answering mechanism in a multi-agent system

Agent 1

Agent 2

…. Agent n

I N T E R F A C E

Data

U N I T

Agent n-1

Query

Query Variables

Evaluation with Fuzzy SQL

QUERY ANSWER Find Result

Obtained Rules From Query Based

925

I


query and evaluation. The task base structure of the system is updated by the mechanism in line with the obtained fuzzy rules, and then, it is ensured that the system makes an appropriate and right decision and acts intelligently.

Step4: determines the knowledge in compliance with the criteria through fuzzy SQL commands Step5: sends the obtained task or rule to the related agent Step6: performs the answering operation

Operation Mechanism of Agent Based Fuzzy Query Answering System

The agent based query answering system involves three main stages: knowledge processing, query processing and agent learning (see Figure2). The operation types of these stages are given in detail below.

The agent does the following: Step1: receives the task knowledge from the related agent Step2: does the fuzzification of knowledge Step3: determines fuzzy grade values according to knowledge features

Knowledge Processing This is the stage where the knowledge is received by the agent from the external environment and necessary preparations are made before query. The criteria and

Figure 2. Suggested system model for each agent

Answering Process

Real World

Query Process Receive knowledge

Agent

Knowledge Processing

knowledge

…..

knowledge

Fuzzification Data F 1

…..

Data F n

Query Processing Query Query Parsing Fuzzy Query Decomposition Fuzzy Query Optimization

Agent Learning Evaluate of the each rule from query results Determination of the agent task and the bid from rule results

926


keywords to be used in evaluating the received data are defined in this stage. This stage can also be called pre-query. The keywords, concepts, attribute and relationship knowledge to be analysed by the agent are determined in this stage before query. In this system, the behaviour structure of intelligent query answering system is formed. During the system modelling, the perception model considered being coming signal, data and knowledge from the external environment for a more understandable structure in learning module plays an important role. Coming from the external environment and called the input modelling; is defined as the perception set. Agent i, x perception coming from the external environment, refer to the Ai,x. Table 1 includes the nomenclature of agent based query answering system. The multi-agent system consists of more than one agent. The agent set is A={A1, A2,…,Ai}. The knowledge set is K={K1, K2, ...,Ky}The knowledgebase is (in Table 1 and Figure 3). The rule set is R={R1, R2, ...,Rx}. The rule base is . The task set is T={T1, T2, ...,Tj}. The task base is . When data arrives from the external environment, it is perceived as input : When “x” is

perceived by Agent i, it is referred to as Aix. This input can also be used in knowledgebase, rule base and task base. The following goals that were determined as a result of the process and the evaluation of the information coming to the knowledge-base should have been achieved in the mechanism of intelligent query answering. • • •

Goal definition Data selection Data preparation

Query Processing The agent performs two types of query in the process of defining keywords, concepts or attributes during knowledge processing. The first is external query, which is realized among the agents, while the second is the internal query, where the agent scans the knowledge within itself. During these query processes, the fuzzy SQL approach is applied. Feature-Attribute At and relation Re are elements formed among the components within the system. These elements are the databases of knowledgebase, rule base and task base. While attribute refers to agent specifications, Resource includes not only raw data externally received but also knowledgebase, rule base and task base which each agent possesses.

Table 1. The nomenclature of agent based query answering system

A T Ai,x m k=1 Tjk Li,m Qi,n Ati Ri,r Ki,y Ri,x Ti,t

,i agent set {A1, A2,...,Ai}  j task set in {T1, T2, ...,Tj} ,i agents x percept ,i agent’s j task sets refers to continuing subsets from k to m situation ,i agents m learning situation ,i agents n querying situation ,i agents attribute situation ,i agent’s r decision situation ,i agent’s y knowledgebase ,i agent’s x rule base ,i agent’s t task base 927

I


A={At, Re(Ki,y, Ri,x, Ti,t)} Let P(At) denote the set of all possibility distributions that may be defined over the domain of an attribute At. A fuzzy relation R with ∪ schema A1, A2, …,An, where Ai is an attribute is defined as R=P(At1)×P(At2) ×…×P(Atn) ×D, where D is a system-supplied attribute for membership degree with a domain [0,1] and × denotes the cross product. Each data value V of the attribute is associated with a possibility distribution defined over the domain of the attribute and has a membership function denoted by µv(x). If the data value is crisp, its possibility distribution is defined by

1 M v (x ) =  0

if

x=v otherwise

(1)

Like standard SQL, queries in fuzzy SQL are specified in select statement of the following form: SELECT FROM WHERE

Attributes Relations Selection Conditions.

The semantics of a fuzzy SQL query is defined based on satisfaction degrees of query conditions. Consider a predicate XΘY in a WHERE clause. The satisfaction degree, denoted by d(XΘY), is evaluated for values of X and Y. Let the value of X be v1 and that of Y of v2. Then, d(XΘY)=maxX,Y (X,Y))

(min(µv1 (X), µ v2(Y), µΘ (2)

where X and Y are crisp values in the common domain over which v1 and v2 are defined(Yang et al., 2001). Function Θ is a function that compares the degrees in terms of satisfaction among the variables. When the satisfaction degree is evaluated for X and Y the former takes the value of v1, while the latter takes the value of v2. As shown in Figure 2, bids are taken as a set, the frequencies of the received bids are fixed and then the bids are decomposed into groups. The decomposed

928

bids are included into databases of the multi-agent system. The information in databases is fuzzified and the interrelation between them is determined in terms of weight and importance level.

Agent Learning Process This is a process where the system learns the knowledge obtained as a result of query as a rule or task. The system fulfils not only the task but also the learning process (in Figure 3). Learning process is acquired and the data from the external transition is processed by the agent system of the defined aim during the activities. Learning algorithm shows the variability of the system status(in Table 2). In the learning process with the help of the query processing, candidate rules are determined by taking the fuzzy dimension attributes and the attribute measures into consideration. Therefore, it would be true to say that a hierarchical order from knowledge-base to rule-base and from rule- base to task-base is available in the system. Algorithm Learning Analysis Input: A relational view that contains a set of records and the questions for influence analysis. Output: An efficient association rule. Step1: Specifies the fuzzy dimension attribute and the measure attribute. Step2: Identifies the fuzzy dimension item sets and calculates the support coefficient Step3: Identifies the measure item sets and calculates the support coefficient. Step4: Constructs sets of candidate rules, and computes the confidence and aggregate value. Step5: Obtains a rule at the granularity level with greatest confidence, and forms a rule at the aggregation level with largest abstract value of the measure attribute. Step6: Computes the assertions at different levels, exits if comparable (i.e., there is no inconsistency found in semantics at different levels). Step7: Generates rules from the refined measure item sets and forms the framework of the rule. Step8: Constructs the final rule as a task for related agent.


Table 2. The query answering mechanism’s learning analysis algorithm

I

Algorithm Learning Analysis Input: A relational view that contains a set of records and the questions for influence analysis. Output: An efficient association rule. Step1: Specifies the fuzzy dimension attribute and the measure attribute. Step2: Identifies the fuzzy dimension item sets and calculates the support coefficient Step3: Identifies the measure item sets and calculates the support coefficient. Step4: Constructs sets of candidate rules, and computes the confidence and aggregate value. Step5: Obtains a rule at the granularity level with greatest confidence, and forms a rule at the aggregation level with largest abstract value of the measure attribute. Step6: Computes the assertions at different levels, exits if comparable (i.e., there is no inconsistency found in semantics at different levels). Step7: Generates rules from the refined measure item sets and forms the framework of the rule. Step8: Constructs the final rule as a task for related agent.

Figure 3. The way the input perceived by the agent is processed

Percept Includes • Knowledge Base or • Rule Base or • Task Base

• If using of the KnowledgeBase Querying is FSQL

• If using of the Rule Base Querying is < Ri,x, Ai, ∅> • If using of the Task Base Querying is < Ti,t, Ai, ∅> Learning is < Ki,y, Ri,x, Ti,t, ,Ai ∅>

The Percept send to the instead of the task

Task is Realizing

929


FUTURE TRENDS Future tasks of the system will be realized when the system performs query answering more quickly thanks to the distributed, autonomous, intelligent and communicative agent structure of the suggested agent based fuzzy query answering system. In fuzzy approach, the system will primarily examine and group the relational database in databases of the agents with the fuzzy logic and then will shape the rule base of the system by applying the fuzzy logic method to these data. After the related rule is chosen, the rule base of the system will be designed and the decision mechanism of the system will operate. Therefore, relational database structure and system behaviour are important in determining the first peculiarity of the system and in terms of data clearing. For future research, it is noted that the design of fuzzy databases involves not just modelling the data but also modelling operations on the data. Relational databases support only limited data types, while fuzzy and possibility databases allow a much larger number of comparatively complex data types (e.g., possibility distributions). This suggests that it might be fruitful to employ object-oriented database technology to allow explicit modelling of complex data types. The incorporation of fuzziness into distributed events can be performed as a future study. Finally, due to frequent changes in the positions and status of objects in an active mobile database environment, the issue of temporality should be considered by adapting the research results of temporal database systems area into active mobile databases.

CONCLUSION This paper discusses a variety of issues in adapting fuzzy database concepts to an active multi agent database system which incorporates active rules in a multi computing environment. This study shows how fuzziness can be introduced to different aspects of rule execution from event detection to coupling modes. As an initial step, membership degree calculation for various types of composite events has been explained. Dynamic determination of coupling modes has been done by using the strengths of events and reliabilities of conditions which are calculated via membership functions. Strengths of events and condition reliabili930

ties have been shown to be useful for condition and action status, as well. The partitioning of the rule set into multi agent system events has also been discussed as an example of inter-rule fuzziness. Similarity based event detection has been introduced to active multi agent databases, which is an important contribution from the perspective of performance.

REFERENCES Bernstein, P.A., Goodman, N., Wong, E., Reeve, C.L. & Rothnie, J.B. (December 1981), Query Processing in a System for Distributed Databases (SDD-1), ACM Transactions on Database Systems, 6(4), 602-625. Bosc, P. & Pivert, O. (1995), SQLf: A relational database language for fuzzy querying, IEEE Transactions on Fuzzy Systems, 3, 11-17. Chaudhry, N., Moyne, J. & Rundensteiner, E.A. (1999), An extended database design methodology for uncertain data management”, Information Sciences, 121, 83-112. Cybenko, G., Berk, V., Crespi, V., Gray, R. & Jiang, G. (2004), An overview of Process Query Systems, Proceedings of SPIE Defense and Security Symposium, 12-16 April, Orlando, Florida, USA. Doherthy, P., Szalas ,A. & Lukaszewicz, W. (2004), Approximate Databases and Query Techniques for Agents with Heterogeneous Ontologies and Perceptive Capabilities, Proceedings on the 9th International Conference on Principles of Knowledge Representation and Reasoning. Doherty, P., Lukaszewicz, W. & Szalas, A. (2004), Approximate Databases and Query Techniques for Agents with Heterogeneous Perceptual Capabilities, Proceedings on the 7th International Conference on Information Fusion. Doherty, Lukaszewicz, & Szalas, 2004 Galindo, J., Medina, J.M. & Aranda-Garrido, M.C. (2001), Fuzzy division in fuzzy relational databases: an approach, Fuzzy Sets and Systems, 121, 471-490. Genet, B. & Hinze, A. (2004), Open Issues in Semantic Query Optimization in Related DBMS, IV. Working paper series (University of Waikato. Dept. of Computer Science); 2004/10.


Halashek-Wiener, C., Parsia, B. & Sinn, E. (2006), Towards Continuous Query Answering on the Semantic Web, In UMIACS Technical Report,. http://www. mindswap.org/papers/2006/ ContQueryTR2006.pdf. He, B., Chang, K.C. & Han, J. (2004), Discovering Complex Matching across Web Query Interfaces: A Correlation Mining Approach, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’04, August 22-25, Seattle, Washington, USA. Hoschek, W. (2002), Query Processing in Containers Hosting Virtual Peer-to-Peer Nodes, Int’l. Conf. on Information Systems and Databases (ISDB 2002), Tokyo, Japan, September. Jim, T. & Suciu D.(2001), Dynamically Distributed Query Evaluation, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, Santa Barbara, California, United States, Pg: 28 – 39, 2001ISBN:1-58113-361-8 Kubat, C., Taşkın, H., Topal, B. & Turgay, S. (2004), Comparison of OR and AI methods in discrete manufacturing using fuzzy logic, Journal of Intelligent Manufacturing, 15, 517-526. McClean, S., Scotney, B., Rutjes, H. & Hartkamp, J.,(2003), Metadata with a MISSION: Using Metadata to Query Distributed Statistical Meta-Information Systems, 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice-Metadata Research&Applications, DC-2003, 28 September-2 October, Seattle, Washington USA. Nambiar, U. & Kambhampati,S. (2006), Answering Imprecise Queries over Autonomous Web Databases, Proceedings of the 22nd International Conference on ICDE ‘06 Data Engineering, 03-07 April 2006. Necib, C. B. & Freytag, J.C.(2005), Query Processing Using Ontologies, Proceedings of the 17th Conference on Advanced Information Systems Engineering (CAISE’05), Porto, Portugal, June. Purvia, M., Cranefield, S., Bush, G. & Carter, D., (January 4-7, 2000), The NZDIS Project: an AgentBased Distributed Information Systems Architecture, in Proceedings of the Hawaii International Conference on System Sciences, Maui, Hawaii.

Raschia, G. & Mouaddib, N. (2002), SAINTETIQ: a fuzzy set-based approach to database summarization, Fuzzy Sets and Systems, 129, 137-162. Saygın, Y., Ulusoy Ö. & Yazıcı, A. (1999), Dealing with fuzziness in active mobile database systems, Information Sciences, 120, 23-44. Tatarinov, I., Ives, Z., Madhavan, J., Halevy A., Suciu, D., Dalvi, N., Dong, X.L., Kadiyska, Y., Miklau & G., Mork, P. (September 2003), The Piazza Peer Data Management Project, SPECIAL ISSUE: Special topic section on peer to peer data management ACM SIGMOD Record, 32(3), , p 47 – 52, ISSN:0163-5808. Turgay, S. (2006, May 29-31), Analytic Model of an Intelligent Agent Interface, Proceedings of 5th International Symposium on Intelligent Manufacturing Systems, Turkey, (pp.1222-1229). Turgay, S., Kubat, C. & Öztemel, E. (2006, May 29-31), Intelligent Fuzzy Decision Mechanism in Multi-Agent Systems for Flexible Manufacturing Environment, Proceedings of 5th International Symposium on Intelligent Manufacturing Systems, Turkey, (pp. 1159-1169). Yang, Q., Zhang, W., Liu, C., Wu, J., Yu, C., Nakajima, H & Rishe, N.D. (2001). Efficient Processing of Nested Fuzzy SQL Queries in a Fuzzy Database, IEEE Transactions on Knowledge and Data Engineering, 13(6). Zadeh, L.A. (1965), Fuzzy sets, Information Control, 8 (3), 338–353.

KEy TERmS Agent : A system that fulfils the independent functions, perceives the outer world and establishes the linking among the agents through its software. Flexible Query: Incorporates some elements of the natural language so as to make a possible simple and powerful expression of subjective information needs. Fuzzy SQL(Structural Query Language): It is an extension of the SQL language that allows us to write flexible conditions in our queries. The FSQL allows us to use linguistic labels defined on any attribute.

931

I


Fuzzy SQL Query: Fuzzy SQL allows the system to make flexible queries about crisp or fuzzy attributes in fuzzy relational data or knowledge. Intelligent Agent: It consists of a sophisticated intelligent computer program; which is acting of situated, independent, reactive, proactive, flexible, recovers from failure and interacts with other agents. Multi-Agent System: It is a flexible incorporated network of software agents that interact to solve the

932

problems that are beyond the individual capacities or knowledge of each problem solver. Query: Caries out the scanning of the data with required specifications. Query Answering: Answers a user query with the help of a single or multi-database in the multi agent system. System: A set of components considered to act as a single goal-oriented entity.

933

Intelligent Radar Detectors Raúl Vicen Bueno University of Alcalá, Spain Manuel Rosa Zurera University of Alcalá, Spain María Pilar Jarabo Amores University of Alcalá, Spain Roberto Gil Pita University of Alcalá, Spain David de la Mata Moya University of Alcalá, Spain

INTRODUCTION The Artificial Neural Networks (ANNs) are based on the behaviour of the brain. So, they can be considered as intelligent systems. In this way, the ANNs are constructed according to a brain, including its main part: the neurons. Moreover, they are connected in order to interact each other to acquire the followed intelligence. And finally, as any brain, it needs having memory, which is achieved in this model with their weights. So, starting from this point of view of the ANNs, we can affirm that these systems are able to learn difficult tasks. In this article, the task to learn is to distinguish between the presence or not of a reflected signal called target in a Radar environment dominated by clutter. The clutter involves all the signals reflected from other objects in a Radar environment that are not the desired target. Moreover, the noise is considered in this environment because it always exists in all the communications systems we can work with.

BACKGROUND The ANNs, as intelligent systems, are able to detect known targets in adverse Radar conditions. These conditions are related with one of the most difficult clutter we can find, the coherent Weibull clutter. It is possible because ANNs trained in a supervised way can

approximate the Neyman-Pearson (NP) detector (De la Mata-Moya, 2005, Vicen-Bueno, 2006, Vicen-Bueno, 2007), which is usually used in Radar systems design. This detector maximizes the probability of detection (Pd) maintaining the probability of false alarm (Pfa) lower than or equal to a given value (VanTrees, 1997). The detection of targets in presence of clutter is the main problem in Radar detection systems. Many clutter models have been proposed in the literature (Cheikh, 2004), although one of the most used models is the Weibull one (Farina, 1987a, DiFranco, 1980). The research shown in (Farina, 1987b) set the optimum detector for target and clutter with arbitrary Probability Density Functions (PDFs). Due to the impossibility to obtain analytical expressions for the optimum detector, only suboptimum solutions were proposed. The Target Sequence Known A Priori (TSKAP) detector is one of them and is taken as reference for the experiments. Also, these solutions convey implementation problems, some of which make them non-realizable. As mentioned above, one kind of ANNs, the MultiLayer Perceptron (MLP), is able to approximate the NP detector when it is trained in a supervised way to minimize the Mean Square Error (MSE) (Ruck, 1990, Jarabo, 2005). So, MLPs have been applied to the detection of known targets in different Radar environments (Gandhi, 1997, Andina, 1996).


I

Intelligent Radar Detectors

INTELLIGENT RADAR DETECTORS BASED ON ARTIFICIAL NEURAL NETWORKS This section starts with a discussion of the models selected for the target, clutter and noise signals. For these models, the optimum and suboptimum detectors are presented. These detectors will be taken as a reference for the experiments. After, it is presented the intelligent detector proposed in this work. This detector is based on intelligent systems like the ANNs, and a further analysis of its structure and parameters is made. Finally, several results are obtained for the detectors under study in order to analyze their performances.

Signal Models: Target, Clutter and Noise The Radar is assumed to collect N pulses in a scan, so input vectors (z) are composed of N complex samples, which are presented to the detector. Under hypothesis H0 (target absent), z is composed of N samples of clutter and noise. Under hypothesis H1 (target present), a known target characterized by a fixed amplitude (A) and phase (θ) for each of the N pulses is summed up to the clutter and noise samples. Also, a Doppler frequency in the target model of 0,5 · PRF is assumed, where PRF is the Pulse Repetition Frequency of the Radar system. The noise is modelled as a coherent white Gaussian complex process of unity power, i.e., a power of ½ for the quadrature and phase components, respectively. The clutter is modelled as a coherent correlated sequence with Gaussian AutoCorrelation Function (ACF), whose complex samples have a modulus with a Weibull PDF: p ( w ) = ab

−a

w

a −1

e

w −    b 

a

(1)

where |w| is the modulus of the coherent Weibull sequence and a and b are the skewness (shape) and scale parameters of a Weibull distribution, respectively. The NxN autocorrelation matrix of the clutter is given by

(M c )h,k = Pc

934

h−k c

2

e

 j 2 

(h − k )

fc   PRF 

(2)

where the indexes h and k varies from 1 to N, Pc is the clutter power, ρc is the one-lag correlation coefficient and fc is the Doppler frequency of the clutter. The relationship between the Weibull distribution parameters and Pc is Pc =

2b 2  2  Γ  a a

(3)

where Γ( ) is the Gamma function. The model used to generate coherent correlated Weibull sequences consists of two blocks in cascade: a correlator filter and a NonLinear MemoryLess Transformation (NLMLT) (Farina, 1987a). To obtain the desired sequence, a coherent white Gaussian sequence is correlated with the filter designed according to (2) and (3). The NLMLT block, according to (1), gives the desired Weibull distribution to the sequence. So, in that way, it is possible to obtained a coherent sequence with the desired correlation and PDF. Taking into consideration that the complex noise samples are of unity variance (power), the following power relationships are considered for the study: • •

Signal to Noise Ratio: SNR = 10log10(A2) Clutter to Noise Ratio: CNR = 10log10(Pc)

Neyman-Pearson Detectors: Optimum and Suboptimum Detectors The problem of optimum Radar detection of targets in clutter is explored in (Farina, 1987a) when both are time correlated and have arbitrary PDFs. The optimum detector scheme is built around two non-linear estimators of the disturbances in both hypotheses, which minimize the MSE. The study of Gaussian correlated targets detection in Gaussian correlated clutter plus noise is carried out, but for the cases where the hypothesis are non-Gaussian distributed, only suboptimum solutions are studied. The proposed detectors basically consist of two channels. The upper channel is matched to the conditions that the sequence to be detected is the sum of the target plus clutter in presence of noise (hypothesis H1). While the lower one is matched to the detection of clutter in presence of noise (hypothesis H0). For the detection problem considered in this paper, the suboptimum detection scheme (TSKAP) shown in figure 1 is taken. Considering that the CNR is very


Figure 1. Target sequence known a priori detector

high (CNR>>1), the inverse of the NLMLT is assumed to transform the Weibull clutter in Gaussian, so the Linear Prediction Filter (LPF) is a N-1 order linear one. Then, the NLMLT transforms the filter output in a Weibull sequence. Besides being suboptimum, this scheme presents two important drawbacks: 1.

2.

The prediction filters have N-1 memory cells that must contain the suitable information to predict correct values for the N samples of each input pattern. So N+(N-1) pulses are necessary to decide if the target is present or not. The target sequence must be subtracted from the input of the H1 channel.

There is no sense in subtracting the target component before deciding if this component is present or not. So, in practical cases, it makes this scheme non-realizable.

Intelligent Radar Detectors In order to overcome the drawbacks of the scheme proposed in the previous section, a detector based on a MLP with log-sigmoid activation function in its hidden and output neurons with hard limit threshold after its output is proposed. Also, as MLPs have been probed to approximate the NP detector when minimizing the MSE (Jarabo, 2005), it can be expected that the MLPbased detector outperforms the suboptimum scheme proposed in (Farina, 1987a). MLPs have been trained to minimize the MSE using two algorithms:the back-propagation (BP) with vary-

I

ing learning rate and momentum (Haykin, 1999) and the Levenberg-Marquardt (LM) with varying adaptive parameter (Bishop, 1995). While BP is based on the steepest descent method, the LM is based on the Newton method, which is designed specifically for minimizing the MSE. For MLPs which have up to few hundred of weights (W), the LM algorithm is more efficient than the BP one with variable learning rate or the conjugate gradient algorithms, being able to converge in many cases when the other two algorithms fail (Hagan, 1994). The LM algorithm uses the information (estimation of the WxW Hessian matrix) of the error surface in each iteration to find the minimum. It makes this algorithm faster than the previous ones. Cross-validation is used with both training algorithms, where training and validation sets are synthetically generated. Moreover, a new set (test set) of patterns is generated to test the trained MLP for estimating the Pfa and Pd using Montecarlo simulation. All the patterns of the three sets are generated under the same conditions (SNR, CNR and a parameters of the Radar problem) in order to study the capabilities of the MLP plus hard limit thresholding working as a detector. MLPs are initialized using the Nguyen-Widrow method (Nguyen, 1999) and, in all cases, the training process is repeated ten times to guarantee that the performance of all the MLPs is similar in average. Once all the MLPs are trained, the best MLP in terms of the estimated MSE with the validation set is selected, in order to avoid the problem of keeping in local minima at the end of the training. The architecture of the MLP considered for the experiments is I/H/O, where I is the number of MLP 935


inputs, H is the number of hidden neurons in its hidden layer and O is the number of MLP outputs. As the MLPs work with real arithmetic, if the input vector (z) is composed of N complex samples, the MLP will have 2N inputs (N in phase and N in quadrature components). The number of MLP independent elements (weights) to solve the problem is W=(I+1)·H+(H+1)·O, including the bias of each neuron.

Results The performance of the detectors exposed in the previous sections is shown in terms of the Receiver Operating Characteristics (ROC) curves. They give the estimated Pd for a desired Pfa, which values are obtained varying the output threshold of the detector. The experiments presented are made for an integration of two pulses (N=2). So, in order to test correctly the TSKAP detector, observation vectors (also called patterns during the text) of length 3 (N+(N-1)) complex samples are generated, due to memory requirements of the TSKAP detector (N-1 pulses). The a priori probabilities of H0 and H1 hypothesis are supposed to be the same. Three sets of patterns are generated for each experiment: train, validation and test sets. The first and the second ones have 5·103 patterns, respectively. The third one has 2.5·106 patterns, so the error in the estimation of the Pfa and the Pd is lower than 10% of the estimated values in the worst case (Pfa=10-4). The patterns of all the sets are synthetically generated under the same conditions. These conditions involve typical values (Farina, 1987a, DiFranco, 1980, Farina, 1987b) for the SNR (20 dB), the CNR (30 dB) and the a (a=1.2) parameter of the Weibull-distributed clutter. The MLP architecture used to generate the MLPbased detector is 6/H/1. The number of MLP outputs (O=1) is established by the problem (binary detection). The number of hidden neurons (H) is studied in this work. And the number of MLP inputs (I=6) is established according to the next criterion. A total of 6 inputs (2(N+(N-1))) are selected when the MLP-based detector wants to be compared with the TSKAP detector in the same conditions, i.e., when both detectors have the same available information (3 pulses for an integration of N=2 pulses). Because of the TSKAP detector memory requirements, this case is considered. Figure 2 shows the results of a study when 3 pulses are used to take the final decision by the MLP-based 936

detector according to the criterion exposed above. The study shows the influence of the training algorithm and the MLP size, i.e., the number of independent elements (W weights) that has the ANN to solve the problem. For the case of study, two important aspects have to be noted. The first one is related with the training algorithm. As can be observed, the performance achieved with a low size MLP (6/05/1) is very similar for both training algorithms (LM and BP). But when the MLP size is greater, for instance, 6/10/1, the performance achieved with the LM algorithm is better than the performance achieved with the BP one. It is due to the LM algorithm is more efficient than the BP one finding the minimum of the error surface. Moreover, the MLP training with LM is faster than the training with BP, because the number of training epochs can be reduced in an order of magnitude. The second aspect is related with the MLP size. As can be observed, no performance improvement is achieved when 20 or more hidden neurons are used comparing both algorithms as occurred with 10 hidden neurons. Moreover, from 20 (W=121 weights) to 30 (W=181 weights) hidden neurons, the performance tends to a maximum value (independently of the training algorithm used), i.e., almost no performance improvement is achieved with more weights. So, an MLP-based detector with 20 hidden neurons achieves an appropriate performance with low complexity. A comparison between the performances achieved with the TSKAP detector and the MLP-based detector of size 6/20/1 trained with BP and LM algorithms is shown in figure 3. Two differences can be observed. The first one is that the MLP-based detector performance is practically independent of the training algorithm, comparing their results with the ones obtained for the TSKAP detector. And the second one is that the 6/20/1 MLP-based detector is always better than the TSKAP detector when they are compared in the same conditions of availability of information, i.e., with the availability of 3 (N+(N-1)) pulses to decide. Under these conditions and comparing figures 2 and 3, it can be observed that a 6/05/1 MLP-based detector is enough to overcome the TSKAP one. The appreciated differences between the TSKAP and MLP-based detectors appear because the first one is a suboptimum detector and the second one approximates the optimum one, but it will be always worse than the optimum detector. It can not be demonstrated because an analytical expression for the optimum detector


Figure 2. MLP-based detector performances for different structure sizes (6/H/1) and different training algorithms: (a) BP and (b) LM

(a)

(b)

Figure 3. TSKAP and MLP-based detectors performances for MLP size 6/20/1 trained with BP and LM algorithms

related with the research in Radar detectors. In the first trend, it is possible to emphasize the research in areas like ensembles of ANNs, committee machines based on ANNs and others way to combine the intelligence of different ANNs like the MLPs, the Radial Basis Functions and others. Moreover, new trends try to find different ways to train ANNs. In the second trend, several researchers are trying to find different ways to create radar detectors in order to improve their performances. Moreover, several solutions are proposed, but they depend on the Radar environment considered. So, detectors based on signal processing tools seem to be the most appropriated, but the intelligent detector exposed here is a new way of working, which can brings good solutions to these problems. This is possible because of the intelligence of the ANNs to adapt to almost any kind of Radar conditions and problems.

can not be obtained detecting targets in presence of Weibull-distributed clutter.

FUTURE TRENDS Two different future trends can be mentioned. The first one is related with ANNs and the second one is

CONCLUSION After the developed study, several conclusions can be set. The LM training algorithm achieves better MLPbased detectors than the BP one. No performance improvement is obtained for training MLPs with LM or BP algorithms when their sizes are greater than 6/20/1. But, the great advantage of the LM one against the BP one is its fastest training for low size MLPs (a 937

I


few hundred of weights), i.e., the MLPs considered in this study. Finally, the MLP-based detector works better than the TSKAP one in cases of working with the same available information (N+(N-1)=3), because the memory requirements of the TSKAP one. In those cases, low complexity MLP-based detectors can be obtained because a 6/05/1 MLP has enough intelligence to obtain better performance than the TSKAP one.

REFERENCES Andina, D., & Sanz-Gonzalez, J.L. (1996). Comparison of a Neural Network Detector Vs Neyman-Pearson Optimal Detector. Proc. of ICASSP-96. 3573-3576. Bishop, C.M. (1995). Neural networks for pattern recognition. Oxford University Press Inc. De la Mata-Moya, D., Jarabo-Amores, P., Rosa-Zurera, M., López-Ferreras, F., & Vicen-Bueno, R. (2005). Approximating the Neyman-Pearson Detector for Swerling I Targets with Low Complexity Neural Networks. Lecture Notes in Computer Science. (3697), 917-922. Cheikh, K., & Faozi S. (2004). Application of Neural Networks to Radar Signal Detection in K-distributed Clutter. First Int. Symp. on Control, Communications and Signal Processing Workshop Proc. 633-637. DiFranco, J.V., & Rubin, W.L. (1980). Radar Detection. Artech House. Farina, A.,Russo, A., Scannapieco, F., & Barbarossa, S. (1987a). Theory of Radar Detection in Coherent Weibull Clutter. In: Farina, A. (eds.): Optimised Radar Processors. IEE Radar, Sonar, Navigation and Avionics, Series 1. Peter Peregrinus Ltd. 100-116. Farina, A., Russo, A., & Scannapieco, F. (1987b). Radar Detection in Coherent Weibull Clutter. IEEE Trans. on Acoustics, Speech and Signal Processing. ASSP-35 (6), 893-895. Gandhi, P.P., & Ramamurti, V. (1997). Neural Networks for Signal Detection in Non-Gaussian Noise. IEEE Trans. on Signal Processing. (45) 11, 2846-2851. Hagan. M.T., & Menhaj, M.B. (1994). Training Feedforward Networks with Marquardt Algorithm. IEEE Trans. on Neural Networks. (5) 6, 989-993.

938

Haykin, S. (1999). Neural Networks. A Comprehensive Foundation (Second Edition). Prentice-Hall. Jarabo-Amores, P., Rosa-Zurera, M., Gil-Pita, R., & López-Ferreras, F. (2005). Sufficient Condition for an Adaptive System to Aproximate the Neyman-Pearson Detector. Proc. IEEE Workshop on Statistical Signal Processing. 295-300. Nguyen, D., & Widrow, B. (1999). Improving the Learning Speed of 2-layer Neural Networks by Choosing Initial Values of the Adaptive Weights. Proc. of the Int. Joint Conf. on Neural Networks. 21-26. Ruck, D.W., Rogers, S.K., Kabrisky, M., Oxley, M.E., & Suter, B.W. (1990). The Multilayer Perceptron as an Approximation to a Bayes Optimal Discriminant Function. IEEE Trans. on Neural Networks. (1) 11, 296-298. Van Trees, H.L. (1997). Detection, Estimation and Modulation Theory. Part I. John Wiley and Sons. Vicen-Bueno, R., Jarabo-Amores, M. P., Rosa-Zurera, M., Gil-Pita, R., & Mata-Moya, D. (2007). Performance Analysis of MLP-Based Radar Detectors in WeibullDistributed Clutter with Respect to Target Doppler Frequency. Lecture Notes in Computer Science. (4669), 690-698. Vicen-Bueno, R., Rosa-Zurera, M., Jarabo-Amores, P., & Gil-Pita, R. (2006). NN-Based Detector for Known Targets in Coherent Weibull Clutter. Lecture Notes in Computer Science. (4224), 522-529.

KEy TERmS Artificial Neural Networks (ANNs): A network of many simple processors (“units” or “neurons”) that imitates a biological neural network. The units are connected by unidirectional communication channels, which carry numeric data. Neural networks can be trained to find nonlinear relationships in data, and are used in applications such as robotics, speech recognition, signal processing or medical diagnosis. Backpropagation Algorithm: Learning algorithm of ANNs, based on minimising the error obtained from the comparison between the ANN outputs after the application of a set of network inputs and the desired


outputs. The update of the weights is done according to the gradient of the error function evaluated in the point of the input space that indicates the input to the ANN. Knowledge Extraction: Explicitation of the internal knowledge of a system or set of data in a way that is easily interpretable by the user. Intelligence: It is a property of mind that encompasses many related abilities, such as the capacities to reason, plan, solve problems, think abstractly, comprehend ideas and language, and learn. Levenberg-Marquardt Algorithm: Similar to the Backpropagation algorithm, but with the difference that the error is estimated according to the Hessian Matrix. This matrix gives information of several directions

where to go in order to find the minimum of the error function, instead of the local minimum one that gives the backpropagation algorithm. Probability Density Function: The statistical function that shows how the density of possible observations in a population is distributed. Radar: It is the acronym of Radio Detection and Ranging. In few words, a Radar emits an electromagnetic wave that is reflected by the target and others objects present in its observation space. Finally, the Radar receives these reflected waves (echoes) to analyze them in order to decide whether a target is present or not.

939

I

940

Intelligent Software Agents Analysis in E-Commerce I Xin Luo The University of New Mexico, USA Somasheker Akkaladevi Virginia State University, USA

INTRODUCTION Equipped with sophisticated information technology infrastructures, the information world is becoming more expansive and widely interconnected. Internet usage is expanding throughout the web-linked globe, which stimulates people’s need for desired information in a timely and convenient manner. Electronic commerce activities, powered by Internet growth, are increasing continuously. It is estimated that online retail will reach nearly $230 billion and account for 10% of total U.S. retail sales by 2008 (Johnson et al. 2003). In addition, e-commerce entailing business-to-business (B2B), business-to-customer (B2C) and customer-to-customer (C2C) transactions is spawning new markets such as mobile commerce. By increasing the degree and sophistication of the automation, commerce becomes much more dynamic, personalized, and context sensitive for both buyers and sellers. Software agents were first used several years ago to filter information, match people with similar interests, and automate repetitive behavior (Maes et al. 1999). In recent years, agents have been applied to the arena of e-commerce, triggering a revolutionary change in the way we conduct online transactions in B2B, B2C, and C2C. Researchers argue that the potential of the Internet for transforming commerce is largely unrealized (Begin et al. 2002; Maes et al. 1999). Further, He and Jennings noted that a new model of software agent is needed to achieve the degree of automation and move to second generation e-commerce1 applications (He et al. 2003). This is due to the predicament that electronic purchases are still largely unautomated. Maes et al. (1999) also addressed that, even though information is more easily accessible and orders and payments are dealt with electronically, humans are still in the loop in all stages of the buying process, which inevitably increase the transaction costs. Undoubtedly,

a human buyer is still responsible for collecting and interpreting information on merchants and products, making decisions about merchants and products, and ultimately entering purchase and payment information. Additionally, Jennings et al. (1998) confirmed that commerce is almost entirely driven by human interactions and further argued that there is no reason why some commerce cannot be automated. This unautomated loop requires a lot of time and energy and results in inefficiency and high cost for both buyers and sellers. To automate time-consuming tasks, intelligent software agent (ISA) technology can play an important role in online transaction and negotiation due to its capability of delivering unprecedented levels of autonomy, customization, and general sophistication in the way e-commerce is conducted (Sierra et al. 2003). Systems containing ISAs have been developed to automate the complex process of negotiating a deal between a buyer and a seller. An increasing number of e-commerce agent systems are being developed to support online transactions that have a number of variables to consider and to aim for a win-win result for sellers and buyers. In today’s e-commerce arena, systems equipped with ISAs may allow buyers and sellers to find the best deal taking into account the relative importance of each factor. Advanced systems of e-commerce that embody ISA technologies are able to perform a number of queries and to process phenomenal volumes of information. ISAs reduce transaction costs by collecting information about services and commodities from a lot of firms and presenting only those results with high relevance to the user. ISA technologies help businesses automate information transaction activity, largely eliminate human intervention in negotiation, lower transaction and information search cost, and further cultivate competitive advantage for companies. Therefore, ISAs can free people to concentrate on the


Intelligent Software Agents Analysis in E-Commerce I

issues requiring true human intelligence and intervention. Implementing the personalized, social, continuously running, and semi-autonomous ISA technologies in business information systems, the online business can become more user-friendly, semi-intelligent, and human-like (Pivk 2003).

LITERATURE REVIEW A number of scholars have defined the term intelligent software agent. Bradshaw (1997) proposed that one person’s intelligent agent is another person’s smart object. Jennings and Wooldridge (1995) defined agents as a computer system situated in some environment that is capable of autonomous action in this environment to meets its design objective. Shoham (1997) further described an ISA as a software entity which functions continuously and autonomously in a particular environment, often inhabited by other agents and processes. In general, an ISA is a software agent that uses Artificial Intelligence (AI) in the pursuit of the goals of its clients (Croft 2002). It can perform tasks independently on behalf of a user in a network and help users with information overload. It is different from current programs in terms of being proactive, adaptive, and personalized (Guttman et al. 1998b). Also, it can actively initiate actions for its users according to the configurations set by the users; it can read and understand user’s preferences and habits to better cater to user’s needs; it can provide the users with relevant information according to the pattern it adapts from the users. ISA is a cutting-edge technology in computational sciences and holds considerable potential to develop new avenues in information and communication technology (Shih et al. 2003). It is used to perform multi-task operations in decentralized information systems, such as the Internet, to conduct complicated and wide-scale search and retrieval activities, and assist in shopping decision-making and product information search (Cowan et al. 2002). ISA’s ability of performing continuously and autonomously stems from human desire in that an agent is capable of operating certain activities in a flexile and intelligent manner responsive to changes in the environment without constant human supervision. Over a long period of time, an agent is capable of adapting from its previous experience and would be able to inhabit an environment with other

agents to communicate and cooperate with them to achieve tasks for human.

Intelligent Agent Taxonomy and Typology Franklin and Grasser (1996) proposed a general taxonomy of agent (see Figure 1). This taxonomy is based on the fact that ISA technologies are implemented in a variety of areas, including biotechnology, economic simulation and data-mining, as well as in hostile applications (malicious codes), machine learning and cryptography algorithms. In addition, Nwana (1996b) proposed the agent typology (see Figure 2) in which four types of agents can be categorized: collaborative agents, collaborative learning agents, interface agents and smart agents. These four agents have different congruence amid learning, autonomy, and cooperation and therefore tend to address different sides of this topology in terms of the functionality. According to Nwana (1996b), collaborative agents emphasize more autonomy and cooperation than learning. They collaborate with other agents in multi-agent environments and may have to negotiate with other agents in order to reach mutually acceptable agreements for users. Unlike collaborative agents, interface agents emphasize more autonomy and learning. They support and provide proactive assistance. They can observe user’s actions in the interface and suggest better ways for completing a task for the user. Also, interface agents’ cooperation with other agents is typically limited to asking for advice (Ndumu et al. 1997).

Figure 1. Franklin and Grasser’s agent taxonomy (Source: Franklin & Grasser. 1996)

941

I


Figure 2. A Part View of Agent Typology Source: Nwana (1996b)

The benefits of interface agents include reducing user’s efforts in repetitive work and adapting to their user’s preferences and habits. Smart agents are agents that are intelligent, adaptive, and computational (Carley 1998). They are advanced intelligent agents summing up the best capabilities and properties of all presented categories. This proposed typology highlights the key contexts in which the agent is used in AI literature. Yet Nwana (1996b) argued that agents ideally should do all three equally well, but this is the aspiration rather than the reality. Furthermore, according to Nwana (1996b) and Jennings and Wooldridge (1998), five more agent types could be derived based on the typology, from a panoramic perspective (see Figure 3). In this proposed typology, mobile agents are autonomous and cooperative software processes capable of roaming wide area networks, interacting with foreign hosts, performing tasks on behalf of their owners (Houmb 2002). Information agents can help us manage the explosive growth of information we are experiencing. They perform the role of managing, manipulating,

or collating information from many distributes sources (Nwana 1996b). Reactive agents choose actions by using the current world state as an index into a table of actions, where the indexing function’s purpose is to map known situations to appropriate actions. These types of agents are sufficient for limited environments where every possible situation can be mapped to an action or set of actions (Chelberg 2003). Hybrid agents adopt strength of both the reactive and deliberative paradigms. They aim to have the quick response time of reactive agents for well known situations, yet also have the ability to generate new plans for unforeseen situations (Chelberg 2003). Heterogeneous agents systems refer to an integrated set-up of at least two or more agents, which belong to two or more different agent classes (Nwana 1996b).

CONCLUSION AND FUTURE WORK This paper explores how ISAs can automate and add value to e-commerce transactions and negotiations. By

Figure 3. A panoramic overview of the different agent types (Source: Jennings & Wooldridge, 1998)

942


leveraging ISA-based e-commerce systems, companies can more efficiently make decisions because they have more accurate information and identify consumers’ tastes and habits. Opportunities and limitations for ISA development are also discussed. Future technologies of ISAs will be able to evaluate basic characteristics of online transactions in terms of price and product description as well as other properties, such as warranty, method of payment, and after-sales service. Also, they would better manage ambiguous content, personalized preferences, complex goals, changing environments, and disconnected parties (Guttman et al. 1998a). Additionally, for the future trend of ISA technology deployment, Nwana (1996a) describes that “Agents are here to stay, not least because of their diversity, their wide range of applicability and the broad spectrum of companies investing in them. As we move further and further into the information age, any information-based organization which does not invest in agent technology may be committing commercial hara-kiri.”

REFERENCES Begin, L., and Boisvert, H. “Enhancing the value proposition Via the internet,” International Conference on Electronic Commerce Research (ICECR-5), 2002. Bradshaw, J.M. “Software Agents,” online: http:// agents.umbc.edu/introduction/01-Bradshaw.pdf) 1997. Carley, K.M. “Smart Agents and Organizations of the Future,” online: http://www.hss.cmu.edu/departments/sds/faculty/carley/publications/ORGTHEO36. pdf) 1998. Chelberg, D. “Reactive Agents,” online: http://zen. ece.ohiou.edu/~robocup/papers/HTML/AAAI/node3. html), 03-05 2003.

Franklin, S., and Graesser, A. “Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents,” Proceedings of the Third International Workshop on Agent Theories, Architectures, and Languages, Springer-Verlag, 1996. Guttman, R., Moukas, A., and Maes, P. “Agent-mediated Electronic Commerce: A Survey,” Knowledge Engineering Review (13:3), June 1998a. Guttman, R., Moukas, A., and Maes, P. “Agents as Mediators in Electronic Commerce,” International Journal of Electronic Markets (8:1), February 1998b, pp 22-27. He, M., Jennings, N.R., and Leung, H.-F. “On AgentMediated Electronic Commerce,” IEEE Transactions on Knowledge and Data Engineering (15:4), July/August 2003. Houmb, S.H. “Software Agent: An Overview,” online: http://www.idi.ntnu.no/emner/dif8914/ppt-2002/swagent_dif8914_2002.ppt) 2002. Jennings, N.R., and Wooldridge, M. “Applications of Intelligent Agents,” in Agent Technology: Foundations, Applications, and Markets,1998, pp 3-28. Johnson, C., Delhagen, K., and Yuen, E.H. “US eCommerce Overview: 2003 To 2008,” Online: http://www.forrester.com/ER/Research/Brief/Excerpt/0,1317,16875,00.html), July 25 2003. Maes, P., Guttnab, R.H., and Moukas, A.G. “Agents That Buy and Sell. (software agents for electronic commerce)(Technology Information),” Communications of the ACM (42:3) 1999, p 81. Ndumu, D., and Nwana, H. “Research and Development Challenges for Agent-Based Systems,” IEE Proceedings on Software Engineering (144:01), January 1997.

Cowan, R., and Harison, E. “Intellectual Property Rights in Intelligent-Agent Technologies: Facilitators, Impediments and Conflicts,” online: http://www.itas. fzk.de/e-society/preprints/ecommerce/CowanHarison. pdf) 2002.

Nwana, H.S. “Software Agents: An Overview,” online: http://agents.umbc.edu/introduction/ao/) 1996b.

Croft, D.W. “Intelligent Software Agents: Definitions and Applications,” online: http://www.alumni.caltech. edu/~croft/research/agent/definition/) 2002.

Shih, T.K., Chiu, C.-F., and Hsu, H.-h. “An AgentBased Multi-Issue Negotiation System in E-commerce,” Journal of Electronic Commerce in Organizations (1:1), Jan-March 2003, pp 1-16.

Pivk, A. “Intelligent Agents in E-Commerce,” online: http://ai.ijs.si/Sandi/IntelligentAgentRepository.html) 2003.

943

I


Sierra, C., Wooldridge, M., Sadeh, N., Conte, R., Klusch, M., and Treur, J. “Agent Research and Development in Euroope,” online: http://www.unicom.co.uk/3in/ISSUE4/4.Asp) 2003.

KEy TERmS Agent: A computer system situated in some environment that is capable of autonomous action in this environment to meets its design objective. Business-to-Business E-Commerce: Electronic transaction of goods or services between businesses as opposed to that between businesses and other groups. Business-to-Customer E-Commerce: Electronic or online activities of commercial organizations serving the end consumer with products and/or services. It is usually applied exclusively to e-commerce. Customer-to-Customer E-Commerce: Online transactions involving the electronically-facilitated transactions between consumers through some third party. Electronic Commerce (E-Commerce): Consists of the buying and selling of products or services over electronic systems such as the Internet and other computer networks. A wide variety of commerce is conducted in this way, including electronic funds transfer, supply chain management, e-marketing, online transaction processing, and automated data collection systems. Intelligent Software Agent: A software agent that uses Artificial Intelligence (AI) in the pursuit of the goals of its clients. Ubiquitous Commerce (U-Commerce): The ultimate form of e-commerce and m-commerce in an ‘anytime, anywhere’ fashion. It involves the use of ubiquitous networks to support personalized and uninterrupted communications and transactions at a level of value that far exceeds traditional commerce.

944

945

Intelligent Software Agents Analysis in E-Commerce II Xin Luo The University of New Mexico, USA Somasheker Akkaladevi Virginia State University, USA

ISA OPPORTUNITIES AND LImITATIONS IN E-COmmERCE Cowan et al. (2002) argued that the human cognitive ability to search for information and to evaluate their usefulness is extremely limited in comparison to those of computers. In detail, it’s cumbersome and timeconsuming for a person to search for information from limited resources and to evaluate the information’s usefulness. They further indicated that while people are able to perform several queries in parallel and are good at drawing parallels and analogies between pieces of information, advanced systems that embody ISA architecture are far more effective in terms of calculation power and parallel processing abilities, particularly in the quantities of material they can process (Cowan et al. 2002). According to Bradshaw (1997), information complexity will continue to increase dramatically in the coming decades. He further contended that the dynamic and distributed nature of both data and applications require that software not merely respond to requests for information but intelligently anticipate, adapt, and actively seek ways to support users. E-commerce applications based on agent-oriented e-commerce systems have great potential. Agents can be designed using the latest web-based technologies, such as Java, XML, and HTTP, and can dynamically discover and compose E-services and mediate interactions to handle routine tasks, monitor activities, set up contracts, execute business processes, and find the best services (Shih et al., 2003). The main advantages of using these technologies are their simplicity of usage, ubiquitous nature, and their heterogeneity and platform independence (Begin and Boisvert, 2002). XML will likely become the standard language for agent-oriented E-commerce interactions to encode exchanged messages, documents, invoices, orders, service descriptions, and other information. HTTP,

the dominant WWW protocol, can be used to provide many services, such as robust and scalable web servers, firewall access, and levels of security for these E-commerce applications. Agents can be made to work individually, as well as in a collaborative manner to perform more complex tasks (Franklin and Graesser, 1996). For example, to purchase a product on the Internet, a group of agents can exchange messages in a conversation to find the best deal, can bid in an auction for the product, can arrange financing, can select a shipper, and can also track the order. Multi-agent systems (groups of agents collaborating to achieve some purpose) are critical for large-scale e-commerce applications, especially B2B interactions such as service provisioning, supply chain, negotiation, and fulfillment, etc. The grouping of agents can be static or dynamic depending on the specific need (Guttman et al., 1998b). A perfect coordination should be established for the interactions between the agents to achieve a higher-level task, such as requesting, offering and accepting a contract for some services (Guttman et al., 1998a). There are several agent toolkits publicly available which can be used to satisfy the customer requirements and ideally they need to adhere to standards which define multi-party agent interoperability. For example, fuzzy logic based intelligent negotiation agents can be used to interact autonomously and consequently, and save human labor in negotiations. The aim of modeling a negotiation agent is to reach mutual agreement efficiently and intelligently. The negotiation agent should be able to negotiate with other such agents over various sets of issues, and on behalf of the real-world parties they represent, i.e. they should be able to handle multi-issue negotiations at any given time. The boom in e-commerce has now created the need for ISAs that can handle complicated online transactions and negotiations for both sellers and buyers. In


I

Intelligent Software Agents Analysis in E-Commerce II

general, buyers want to find sellers that have desired products and services. And they want to find product information and gain expert advice before and after the purchase from sellers, which, in turn, want to find buyers and provide expert advice about their product or service as well as customer service and support. Therefore, there is an opportunity that both buyers and sellers can automate handling this potential transaction by adopting ISA technology. The use of ISAs will be essential to handling many tasks of creating, maintaining, and delivering information on the Web. By implementing ISA technology in e-commerce, agents can shop around for their users; they can communicate with other agents for product specifications, such as price, feature, quantity, and service package, and make a comparison according to user’s objective and requirement and return with recommendations of purchases, which can meet those specifications; they can also act for sellers by providing product or service sales advice, and help troubleshoot customer problems by automatically offering solutions or suggestions; they can automatically pay bills and keep track of the payment. Looking at ISA development from an international stand point, the nature of Internet in developed countries, such as USA, Canada, West Europe, Japan, and Australia, etc. and the consequent evolution of e-commerce as the new model provide exciting opportunities and challenges for ISA-based developments. Opportunities include wider market reach in a timely manner, higher earnings, broader spectrum of target and potential customers, and collaboration among vendors. This ISA-powered e-commerce arena would be different than our traditional commerce, because the traditional form of competition can give way to collaborative efforts across industries for adding value to business processes. This means that agents of different vendors can establish a cooperative relationship to communicate with each other via XML language in order to set up and complete transactions online. Technically, for instance, if an information agent found that the vendor is in need of more airplane tickets, it would notify a collaborative agent to search for relevant information regarding the ticket in terms of availability, price, and quantity etc. from other sources over the Internet. In this case, the collaborative agent would work with mobile agents and negotiate with other agents working for different vendors and obtain ticket information for its user. It would be able to provide 946

the user with the result of the search, and, if needed, purchase the tickets for the user if certain requirements can be met. In the meantime, interface agents can monitor the user’s reaction and decision behavior, and would provide the user with informational assistance in terms of advice, recommendation, and suggestion for any related and similar transactions. On the other hand, however, this kind of intelligent electronic communication and transaction is relatively inapplicable in traditional commerce where different competitive vendors are not willing to share information with each other (Maes et al., 1999). The level of willingness in ISA-based e-commerce is, however, somewhat limited due to sociological and ethical factors, which will be discussed later in this paper. In addition, designing and implementing ISA technology is a costly predicament preventing companies from adopting this emerging tool. Companies need to invest a lot of money to get the ISA engine started. Notwithstanding the exciting theoretical benefits discussed above, many companies are still not sure about how much ISA technology can benefit themselves in terms of revenue, ROI, and business influence in the market where other players are yet to adopt this technology to cooperate with each other. Particularly, medium or small size companies are reluctant to embark on this arena mainly due to the factor of cost. Additionally, lack of consistent architectures in terms of standards and laws also obstructs the further development of ISA technology (He et al., 2003). In detail, IT industry has not yet finalized the ISA standards, as there are a number of proprietary standards set by various companies. This causes a confusion problem for ISAs to freely communicate with each other. Also, related to standards, relevant laws have not surfaced to regulate how ISAs can legally cooperate with each other and represent their human users in the cyber world. Additionally, ISA development and deployment is not a global perspective (Jennings et al. 1998). Despite the fact that ISA technology is an ad-hoc topic in developed countries, developing countries are not fully aware of the benefits of ISA and therefore have not deployed ISA-based systems on the Web because their e-commerce development levels and skills are not as sophisticated or advanced as those of the developed countries. This intra-national limitation among developed and developing countries unfortunately hinders


agents from freely communicating with each other over the globally connected Internet.

SOCIOLOGICAL AND ETHICAL CHALLENGES In the preceding sections of this paper, the technical issues involved in agent development have been addressed. However, in addition to these issues, there are also a range of social and cyber-ethical problems, such as trust and delegation, privacy, responsibility, and legal issues, which will become increasingly important in the field of agent technology (Bradshaw 1997; Jennings et al. 1998; Nwana 1996b). •

•

•

•

Trust and delegation: For users who want to depend on ISA technology to obtain desired information, they must trust agents which autonomously delegate for users to do the job. It would take time for users to get used to their agents and gain confidence in the agents that work for them. And users have to make a balance between agents continually seeking guidance and never seeking guidance. Users might need to set proper limitations for their agents, otherwise agents might surpass their authorities. Privacy: In the explosive information society, security is becoming more and more important. Therefore, users must make sure that their agents always maintain their privacy in the course of transactions. Electronic agent security policies may be needed to encounter this potential threat. Responsibility: Users need to seriously consider how much responsibility the agents need to carry regarding the transaction pitfall. To some extent, agents are rendered responsibility to get the desired product/service for their users. If the users are not satisfied with the transaction result, they may need to redesign or reprogram the agent rather than directly blame the fault on electronic agents. Legal issues: In addition to responsibility, users should also think about any potential legal issues triggered by their agents, which, for instance, offer inappropriate advice to other agents resulting in liabilities to other people. This would be very challenging to the ISA technology development, and the scenario would be complicated since the current law does not specify which party (the

•

company who wrote the agent, the company who customized and used the agent, or both) should be responsible for the legal issues. Cyber-ethical issues: Eichmann (1994) and Etzioni & Weld (1994) proposed the following etiquettes for ISAs which gather information on the Web. • Agents must identify themselves; • They must moderate the pace and frequency of their requests to some server; • They must limit their searches to appropriate servers; • They must share information with others; • They must respect the authority placed on them by server operators; • Their services must be accurate and up-todate; • Safety: they should not destructively alter the world; • Tidiness: they should leave the world as they found it; • Thrift: they should limit their consumption of scarce resources; • Vigilance: they should not allow client actions with unanticipated results.

CONCLUSION AND FUTURE WORK ISA technology has to confront the increasing complexity of modem information environments. Research and development of ISAs on the Internet is crucial for the development of next generation in open information environments. Sociological and cyber-ethical issues need to be considered for the next generation of agents in e-commerce system, which will explore new types of transactions in the form of dynamic relationships among previously unknown parties (Guttman et al. 1998b). According to Nwana (1996a), the ultimate ISA’s success will be the acceptance and mass usage by users, once issues such as privacy, trust, legal, and responsibility are addressed and considered when users design and implement ISA technologies in ecommerce and emerging commerce, such as mobile commerce (M-commerce) and Ubiquitous commerce (U-commerce). It is expected that future research can further explore how ISAs are leveraged in these two newly emerged avenues.

947

I


REFERENCES Begin, L., and Boisvert, H. “Enhancing the value proposition Via the internet,” International Conference on Electronic Commerce Research (ICECR-5), 2002. Bradshaw, J.M. “Software Agents,” online: http:// agents.umbc.edu/introduction/01-Bradshaw.pdf) 1997. Cowan, R., and Harison, E. “Intellectual Property Rights in Intelligent-Agent Technologies: Facilitators, Impediments and Conflicts,” online: http://www.itas. fzk.de/e-society/preprints/ecommerce/CowanHarison. pdf) 2002. Eichmann, D. “Ethical Web Agents,” Second International World-Wide Web Conference: Mosaic and the Web, October 18-20 1994, pp 3-13. Franklin, S., and Graesser, A. “Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents,” Proceedings of the Third International Workshop on Agent Theories, Architectures, and Languages, Springer-Verlag, 1996. Etzioni, O., and Weld, D. “A Softbot-Based Interface to the Internet,” Communications of the ACM, July 1994, pp 72-76. Guttman, R., Moukas, A., and Maes, P. “Agent-mediated Electronic Commerce: A Survey,” Knowledge Engineering Review (13:3), June 1998a. Guttman, R., Moukas, A., and Maes, P. “Agents as Mediators in Electronic Commerce,” International Journal of Electronic Markets (8:1), February 1998b, pp 22-27. He, M., Jennings, N.R., and Leung, H.-F. “On AgentMediated Electronic Commerce,” IEEE Transactions on Knowledge and Data Engineering (15:4), July/August 2003. Jennings, N.R., and Wooldridge, M. “Applications of Intelligent Agents,” in Agent Technology: Foundations, Applications, and Markets,1998, pp 3-28. Maes, P., Guttnab, R.H., and Moukas, A.G. “Agents That Buy and Sell. (software agents for electronic commerce)(Technology Information),” Communications of the ACM (42:3) 1999, p 81.

948

Ndumu, D., and Nwana, H. “Research and Development Challenges for Agent-Based Systems,” IEE Proceedings on Software Engineering (144:01), January 1997. Nwana, H.S. “Intelligent Software Agents on the Internet: an inventory of currently offered functionality in the information society & a prediction of (near-) future developments,” online: http://www.hermans. org/agents/index.html), July 1996a. Nwana, H.S. “Software Agents: An Overview,” online: http://agents.umbc.edu/introduction/ao/) 1996b. Shih, T.K., Chiu, C.-F., and Hsu, H.-h. “An AgentBased Multi-Issue Negotiation System in E-commerce,” Journal of Electronic Commerce in Organizations (1:1), Jan-March 2003, pp 1-16.

KEy TERmS Agent: A computer system situated in some environment that is capable of autonomous action in this environment to meets its design objective. Business-to-Business E-Commerce: Electronic transaction of goods or services between businesses as opposed to that between businesses and other groups. Business-to-Customer E-Commerce: Electronic or online activities of commercial organizations serving the end consumer with products and/or services. It is usually applied exclusively to e-commerce. Customer-to-Customer E-Commerce: Online transactions involving the electronically-facilitated transactions between consumers through some third party. Electronic Commerce (E-Commerce): Consists of the buying and selling of products or services over electronic systems such as the Internet and other computer networks. A wide variety of commerce is conducted in this way, including electronic funds transfer, supply chain management, e-marketing, online transaction processing, and automated data collection systems. Intelligent Software Agent: A software agent that uses Artificial Intelligence (AI) in the pursuit of the goals of its clients.


Ubiquitous Commerce (U-Commerce): The ultimate form of e-commerce and m-commerce in an ‘anytime, anywhere’ fashion. It involves the use of ubiquitous networks to support personalized and uninterrupted communications and transactions at a level of value that far exceeds traditional commerce.

I

949

950

Intelligent Software Agents with Applications in Focus Mario Janković-Romano University of Belgrade, Serbia Milan Stanković University of Belgrade, Serbia Uroš Krčadinac University of Belgrade, Serbia

INTRODUCTION Most people are familiar with the concept of agents in real life. There are stock-market agents, sports agents, real-estate agents, etc. Agents are used to filter and present information to consumers. Likewise, during the last couple of decades, people have developed software agents, that have the similar role. They behave intelligently, run on computers, and are autonomous, but are not human beings. Basically, an agent is a computer program that is capable of performing a flexible and independent action in typically dynamic and unpredictable domains (Luck, McBurney, Shehory, & Willmott, 2005). Agents are capable of performing actions and making decisions without the guidance of a human. Software agents emerged in the IT because of the ever-growing need for information processing, and the problems concerning dealing and working with large quantities of data. Especially important is how agents act with other agents in the same environment, and the connections they form to find, refine and present the information in a best way. Agents certainly can do tasks better if they perform together, and that is why the multi-agent systems were developed. The concept of an agent has become important in a diverse range of sub-disciplines of IT, including software engineering, networking, mobile systems, control systems, decision support, information recovery and management, e-commerce, and many others. Agents are now used in an increasingly wide number of applications — ranging from comparatively small systems such as web or e-mail filters to large, complex systems such as air-traffic control, that have a large dependency on fast and precise decision making.

Undoubtedly, the main contribution to the field of intelligent software agents came from the field of artificial intelligence (AI). The main focus of AI is to build intelligent entities and if these entities sense and act in some environment, then they can be considered agents (Russell & Norvig, 1995). Also, object-oriented programming (Booch, 2004), concurrent object-based systems (Agha, Wegner, and Yonezawa, 1993), and human-computer interaction (Maes, 1994) are fields that constantly drive forward the development of agents.

BACKGROUND Although the term ‘agent’ is widely used, by many people working in closely related areas, it defies attempts to produce a single universally accepted definition. One of the most broadly used definitions states that “an agent is an encapsulated computer system that is situated in some environment, and that is capable of flexible, autonomous action in that environment in order to meet its design objectives” (Wooldridge and Jennings, 1995). There are three main concepts in this definition: situatedness, autonomy, and flexibility: •

•

Situatedness means that an agent is situated in some environment and that it receives sensory input and performs actions which change that environment in some way. Autonomy is the ability of an agent to act without the direct intervention of humans. It has control over its own actions and over its internal state. Also, the autonomy implies the capability of learning from experience.


Intelligent Software Agents with Applications in Focus

•

Flexibility means that the agent is able to perceive its environment and respond to changes in a timely fashion; it should be able to exhibit opportunistic, goal-directed behaviour and take the initiative whenever appropriate. In addition, an agent should be able to interact with other agents and humans, thus to be ‘social’.

For some researchers - particularly those interested in AI - the term àgent’ has a stronger and more specific meaning than that sketched out above. These researchers generally mean an agent to be a computer system that, in addition to having the properties identified above, is either conceptualized or implemented using concepts that are more usually applied to humans. For example, it is quite common in AI to characterize an agent using mentalistic notions, such as knowledge, belief, intention, and obligation (Wooldridge & Jennings, 1995).

INTELLIGENT SOFTWARE AGENTS Agents and Environments An agent collects its percepts through its sensors, and acts upon the environment through its actuators. Thus, the agent is proactive. Its actions in any moment depend on the whole sequence of these inputs up to that moment. A decision tree for every possible percept

sequence of an agent would completely define the agent’s behavior. This would define the function that maps any sequence of percepts to the concrete action – the agent function. The program that defines the agent function is called the agent program. So, the agent function is a formal description of the agent’s behavior, and the agent program is a concrete implementation of that formalism. (Krcadinac, Stankovic, Kovanovic & Jovanovic, 2007) To implement all this, we need to have a computing device with appropriate sensors and actuators on which the agent program will run. This is called agent architecture. So, an agent is essentially made of two components: the agent architecture and the agent program. Also, as Russell and Norvig (1995) specify, one of the most sought after characteristics of an agent is its rationality. An agent is rational if it always does the action that will lead to the most successful outcome. The rationality of an agent depends on (a) the performance measure that defines what is a good action and what is a bad action, (b) the agent’s knowledge about the environment, (c) the agent’s available actions, and (d) the agent’s percept history.

The Types of Agents There are several basic types of agents with respect to their structure (Russell & Norvig, 1995): 1.

Figure 1. Agent and environment 2.

3.

The simplest kind of agents are the simple reflex agents. Such an agent only reacts to its current percept, completely ignoring its percept history. When a new percept is received, a rule that maps that percept to an action is activated. Such rules are known as condition-action rules. Model-based reflex agents are more powerful agents, because they maintain some sort of internal state of the environment that depends on the percept history. For maintaining this sort of information, an agent must know how the environment evolves, and how its actions affect the environment. Goal-based agents have some sort of goal information that describes desirable states of the world. Such an agent’s decision making process is fundamentally different, because when a goalbased agent is considering performing an action it is asking itself “would this action make me 951

I


4.

5.

happy?” along with the standard “what this action will have as a result?”. Utility-based agents use a utility function that maps each state to a number that represents the degree of happiness. They are able to perform rationally even in the situations when there are conflicting goals, as well as when there are several goals that can be achieved, but none with certainty. Learning agents do not have a priori knowledge of the environment, but learn about it. This is beneficial because these agents can operate in unknown environments and to a certain degree facilitates the job of developers because they do not need to specify their whole knowledge base.

Multi-Agent Systems Multi-Agent Systems (MAS) are systems composed of multiple autonomous components (agents). They historically belong to Distributed Artificial Intelligence (Bond & Gasser, 1998). MAS can be defined as a loosely coupled network of problem solvers that work together to solve problems that are beyond the individual capabilities or knowledge of a single problem solver (Durfee and Lesser, 1989). In a MAS, each agent has incomplete information or capabilities for solving the problem and thus has a limited viewpoint. There is no global system control, the data is decentralized and the computation is asynchronous. In addition to MAS, there is also the concept of a multi-agent environment, which can be seen as an environment that includes more than one agent. Thus, it can be cooperative, or competitive, or a combined one, and creates a setting where agents need to interact (socialize) between each other, either to achieve their individual objectives, or to manage the dependencies that follow from being situated in a common environment. These interactions range from simple semantic interoperation (exchanging comprehensible communications), client-server interactions (the ability to request that a particular action is performed), to rich social interactions (the ability to cooperate, coordinate, and negotiate about a course of action). Because of the issues due to heterogeneous nature of agents involved in communication (e.g., finding one another), there is also a need for middle-agents, which cover cooperation among agents and connect 952

service providers with service requesters in the agent world. These agents are useful in various roles, such as matchmakers or yellow page agents that collect and process service offers (“advertisements”), blackboard agents that collect requests, and brokers that process both (Sycara, Decker, & Williamson, 1997). There are several alternatives to middle agents, such as Electronic Institutions – a framework for Agents’ Negotiation which seeks to incorporate organizational concepts into multi-agent systems. (Rocha and Oliveira, 2001) Communication among agents is achieved by exchanging messages represented by mutually understandable language (syntax) and containing mutually understandable semantics. In order to find a common ground for communication, an agent communication language (ACL) should be used to provide mechanisms for agents to negotiate, query, and inform each other. The most important such languages today are KQML (Knowledge Query and Manipulation Language) (ARPA Knowledge Sharing Initiative, 1993) and FIPA ACL (FIPA, 1997).

AGENT APPLICABILITy There are great possibilities for applying multi-agent systems to solving different kinds of practical problems. •

•

•

Auction negotiation model, as a form of communication, enables a group of agents to find good solutions by achieving agreement and making mutual compromises in case of conflicting goals. Such an approach is applicable to trading systems, where agents act on behalf of buyers and sellers. Financial markets, as well as scheduling, travel arrangement, and fault diagnosing also represent applicable fields for agents. Another very important field is information gathering, where agents are used to search through diverse and vastly different information sources (e.g., World Wide Web) and acquire relevant information for their users. One of the most common domains is Web browsing and search, where agents are used to adapt the content (e.g., search results) to the users’ preferences and offer relevant help in browsing. Process control software systems require various kinds of automatic (autonomous) control and re-


•

•

action for its processes (e.g. production process). Reactive and responsive, agents perfectly fit the needs of such a task. Example domains in this field include: production process control, climate monitoring, spacecraft control, and monitoring nuclear power plants. Artificial life studies the evolution of agents, or populations of computer simulated life forms in artificial environments. The goal is to study phenomena found in real life evolution in a controlled manner, hopefully to eliminate some of the inherent limitations and cruelty of evolutionary studies using live animals. Finally, intelligent tutoring systems often include pedagogical agents, which represent software entities constructed to present the learning content in a user-friendly fashion and monitor the user’s progress through the learning process. These agents are responsible for guiding the user and suggesting additional learning topics related to the user’s needs (Devedzic, 2006).

Some of the more specific examples of intelligent agent applications include Talaria System, military training, and Mobility Agents. T a l a r i a System (The Autonomous Lookup And Report Internet Agent System) is a multi-agent system, developed for academic purposes at the University of Belgrade, Serbia. It was built as a solution to the common problem of gathering information from diverse Web sites that do not provide RSS feeds for news tracking. The system was implemented using the JADE modeling framework in Java. (Stankovic, Krcadinac, Kovanovic & Jovanovic, 2007) Talaria System is using the advantages of human-agent communication model to improve usability of web sites and to relieve users from annoying and repetitive work. The system provides each user with a personal agent, which periodically monitors the Web sites that the user expressed interest in. The agent informs its user about relevant changes, filtered by assumed user preferences and default relevance factors. Human-agent communication is implemented via email, so that a user can converse with her/his agent in natural language, whereas the agent heuristically interprets concrete instructions from the mail text (e.g., “monitor this site” or “kill yourself”). Simulation and modelling are extensively used in a wide range of military applications, from development, testing and acquisition of new systems and

technologies, to operation, analysis and provision of training, and mission rehearsal for combat situations. The Human Variability in Computer Generated Forces (HV-CGF) project, undertaken on behalf of the UK’s Ministry of Defence, developed a framework for simulating behavioral changes of individuals and groups of military personnel when subjected to moderating influences such as caffeine and fatigue. The project was built with the JACK Intelligent Agents toolkit, a commercial Java-based environment for developing and running multiagent applications. Each team member is a rational agent able to execute actions such as doctrinal and non-doctrinal behaviour tactics, which are encoded as JACK agent graphical plans. (Belecheanu et al., 2005) Mobility Agents is an agent-based architecture that helps a person with cognitive disabilities to travel using public transportation. Agents are used to represent transportation participants (buses and travelers) and to enable notification of bus approaching and arrival. Information is passed to the traveler using a multimedia interface, via a handheld device. Customizable user profiles determine the most appropriate modality of interaction (voice, text, and pictures) based on the user’s abilities (Repenning & Sullivan, 2003). This imposes a personal agent to take care that abstract goals, as “go home”, are translated into concrete directions. To achieve this, an agent needs to collect information about user-specific locations and must be able to suggest the right bus for the particular user’s current location and destination.

FUTURE TRENDS Future looks bright for this technology as development is taking place within a context of broader visions and trends in IT. The whole growing field of IT is about to drive forward the R&D of intelligent agents. We especially emphasize the Semantic Web, ambient intelligence, service oriented computing, Peer-to-peer computing and Grid Computing. The Semantic Web is the vision of the future Web based on the idea that the data on the Web can be defined and linked in such a way that it can be used by machines for the automatic processing and integration (BernersLee, Hendler, & Lassila, 2001). The key to achieving this is by augmenting Web pages with descriptions of their content in such a way that it is possible for 953

I


machines to reason automatically about that content. The common opinion is that the Semantic Web itself will be a form of intelligent infrastructure for agents, allowing them to “understand” the meaning of the data on the Web (Luck et al., 2005). The concept of ambient intelligence describes a shift away from PCs to a variety of devices which are embedded in our environment and which are accessed via intelligent interfaces. It requires agent-like technologies in order to achieve autonomy, distribution, adaptation, and responsiveness. Service oriented computing is where MAS could become very useful. In particular, this might involve web services, where the Quality Of Service demands are important. Each web service could be modeled as an agent, with dependencies, and then simulated for observed failure rates. Peer-to-peer (P2P) computing, presenting networked applications in which every node is in some sense equivalent to all others, tends to become more complex in the future. Auction mechanism design, agent negotiation techniques, increasingly advanced approaches to trust and reputation, and the application of social norms, rules and structures - presents some of the agent technologies that are about to become relevant in the context of P2P computing. Grid Computing is the high-performance agentbased computing infrastructure for supporting largescale distributed scientific endeavour. The Grid provides a means of developing eScience applications, yet it also provides a computing infrastructure for supporting more general applications that involve large-scale information handling, knowledge management and service provision. The key benefit of Grid computing is flexibility – the distributed system and network can be reconfigured on demand in different ways as business needs change. Some considerable challenges have still remained in the agent-based world, such as the lack of sophisticated software tools, techniques and methodologies that would support the specification, development, integration and management of agent systems.

another for their individual and/or collective benefit. A number of significant advances have been made over the past two decades in design and implementation of individual autonomous agents, and in the way in which they interact with one another. These concepts and technologies are now finding their way into commercial products and real-world software solutions. Future IT visions share the common need for agent technologies and prove that agent technologies will continue to be of vital importance. It is foreseeable that agents will become the integral part of informational technologies and artificial intelligence in the near future, and that is why they should be kept an eye on.

CONCLUSION

Booth, D., Haas, H., McCabe, F., Newcomer, E., Champion, M., Ferris, C., & Orchard, D. (2004, February)., Web services architecture. W3C working group note 11. Retrieved January 30, 2007, from http://www. w3.org/TR/ws-arch/.

Today, research and development in the field of intelligent agents is rapidly expanding. At its core is the concept of autonomous agents interacting with one 954

REFERENCES Agha, G., Wegner, P., & Yonezawa, A. (Eds.). (1993). Research directions in concurrent object-oriented programming. Cambridge, MA: The MIT Press. ARPA Knowledge Sharing Initiative. (1993). Specification of the KQML agent-communication language – plus example agent policies and architectures. Retrieved January 30, 2007, from http://www.csee.umbc. edu/kqml/papers/kqmlspec.pdf. Barber, K. S., and Martin, C. E. (1999). Agent Autonomy: Specification, Measurement, and Dynamic Adjustment, Autonomy Control Software Workshop, Seattle, Washington. Belecheanu, A. R., Luck, M., McBurney P., Miller T., Munroe, S., Payne T., & Pechoucek M. (2005). Commercial Applications of Agents: Lessons, Experiences and Challenges. (p. 2) Southampton, UK. Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web, Scientific American, pp. 35-43. Bond, A. H., & Gasser, L. (Eds.). (1998). Readings in distributed artificial intelligence. San Mateo, CA: Morgan Kaufmann Publishers. Booch, G. (2004). Object-oriented analysis and design (2nd ed.). MA: Addison-Wesley.


Devedzic, V. (2006). Semantic web and education. Berlin, Heidelberg, New York: Springer. Durfee, E. H., & Lesser, V. (1989). Negotiating task decomposition and allocation using partial global planning. In L. Gasser, & M. Huhns (Eds.), Distributed artificial intelligence: Volume II ( pp. 229–244) London: Pitman Publishing and San Mateo, CA: Morgan Kaufmann. FIPA (1997). Part 2 of the FIPA 97 specifications: Agent communication language. Retrieved January 30, 2007, from http://www.fipa.org/specs/fipa00003/ OC00003A.html. Krcadinac, U., Stankovic, M., Kovanovic, V., & Jovanovic, J. (2007). Intelligent Multi-Agent Systems in: Carteli, A., & Palma, M. (Eds.). Encyclopedia of Information Communication Technology, Idea Group International Publishing, (forthcoming) Luck, M., McBurney, P., Shehory, O., & Willmott, S. (2005). Agent technology: Computing as interaction. Retrieved January 30, 2007, from http://www.agentlink. org/roadmap/al3rm.pdf. Maes, P.(1994) Agents that reduce work and information overload. Communications of the ACM, 37(7), 31–40. Repenning, A., & Sullivan, J. (2003). The Pragmatic Web: Agent-Based Multimodal Web Interaction with no Browser in Sight, In G.W.M. Rauterberg, M. Menozzi, & J. Wesson, (Eds.), Proceedings of the Ninth International Conference on Human-Computer Interaction (pp. 212-219). Amsterdam, The Netherlands: IOS Press. Rocha, A. P. & Oliveira, E. (2001) Electronic Institutions as a framework for Agents’ Negotiation and mutual Commitment. In P. Brazdil, A. Jorge (Eds.), Progress in Artificial Intelligence (Proceedings of 10th EPIA), LNAI 2258, pp. 232-245, Springer. Russell, S. J., & Norvig, P. (1995). Artificial intelligence: A modern approach. New Jersey: Prentice-Hall. Stankovic, M., Krcadinac, U., Kovanovic, V., & Jovanovic, J. (2007). An Overview of Intelligent Software

Agents in: Khosrow-Pour, M. (Ed.). Encyclopedia of Information Science and Technology, 2nd Edition, Idea Group International Publishing, (forthcoming) Sycara, K., Decker, K., & Williamson, M. (1997). Middle-Agents for the Internet, In M. E. Pollack, (Ed.), Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (pp. 578-584). Morgan Kaufmann Publishers. Wooldridge, M., & Jennings, N. R. (1995). Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2), pp. 115–152.

KEy TERmS Actuators: Software component and part of the agent used as a mean of performing actions in the agent environment. Agent Autonomy: Agent’s active use of its capabilities to pursue some goal, without intervention by any other agent in the decision-making process used to determine how that goal should be pursued (Barber & Martin, 1999). Agent Percepts: Every information that an agent receives trough it’s sensors, about the state of the environment or any part of the environment. Intelligent Software Agent: An encapsulated computer system that is situated in some environment and that is capable of flexible, autonomous action in that environment in order to meet its design objectives (Wooldridge & Jennings, 1995). Middle-Agents: Agents that facilitate cooperation among other agents and typically connect service providers with service requesters. Multi-Agent System (MAS): A software system composed of several agents that interact in order to find solutions of complex problems. Sensors: Software component and part of the agent used as a mean of acquiring information about current state of the agent environment (i.e., agent percepts).

955

I

956

Intelligent Traffic Sign Classifiers Raúl Vicen Bueno University of Alcalá, Spain Elena Torijano Gordo University of Alcalá, Spain Antonio García González University of Alcalá, Spain Manuel Rosa Zurera University of Alcalá, Spain Roberto Gil Pita University of Alcalá, Spain

INTRODUCTION The Artificial Neural Networks (ANNs) are based on the behavior of the brain. So, they can be considered as intelligent systems. In this way, the ANNs are constructed according to a brain, including its main part: the neurons. Moreover, they are connected in order to interact each other to acquire the followed intelligence. And finally, as any brain, it needs having memory, which is achieved in this model with their weights. So, starting from this point of view of the ANNs, we can affirm that these systems are able to learn difficult tasks. In this article, the task to learn is to distinguish between different kinds of traffic signs. Moreover, this ANN learning must be done for traffic signs that are not in perfect conditions. So, the learning must be robust against several problems like rotation, translation or even vandalism. In order to achieve this objective, an intelligent extraction of information from the images is done. This stage is very important because it improves the performance of the ANN in this task.

BACKGROUND The Traffic Sign Classification (TSC) problem has been studied many times in the literature. This problem is solved in (Perez, 2002, Escalera, 2004) using the correlation between the traffic sign and each element of a database, which involves large computational cost. In (Hsu, 2001), Matching Pursuit (MP) is applied in two

stages: training and testing. The training stage finds a set of the best MP filters for each traffic sign, while the testing one projects the unknown traffic sign to different MP filters to find the best match. This method also implies large computational cost, especially when the number of elements grows up. In recent works (Escalera, 2003, Vicen, 2005a, Vicen, 2005b), the use of ANNs is studied. The first one studies the combination of the Adaptive Resonance Theory with ANNs. It is applied to the whole image, where many traffic signs can exist, which involves that the ANN complexity must be very high to recognize all the possible signs. In the last works, the TSC is constructed using a preprocessing stage before the ANN, which involves a computational cost reduction in the classifier. TSCs are usually composed by two specific stages: the detection of traffic signs in a video sequence or image and their classification. In this work we pay special attention to the classification stage. The performance of these stages highly depends on lighting conditions of the scene and the state of the traffic sign due to deterioration, vandalism, rotation, translation or inclination. Moreover, its perfect position is perpendicular to the trajectory of the vehicle, however many times it is not like that. Problems related to the traffic sign size are of special interest too. Although the size is normalized, we can find signs of different ones, because the distance between the camera and the sign is variable. So, the classification of a traffic sign in this environment is not easy.


Intelligent Traffic Sign Classifiers

The objective of this work is the study of different classification techniques combined with different preprocessings to implement an intelligent TSC system. The preprocessings considered are shown below and are used to reduce the classifier complexity and to improve its performance. The studied classifiers are the k-Nearest Neighbor (k-NN) and an ANN based method using Multilayer Perceptrons (MLPs). So, this work tries to find which are the best preprocessings, the best classifiers and which combination of them minimizes the error rate.

INTELLIGENT TRAFFIC SIGN CLASSIFICATION An intelligent traffic sign classification can be achieved taking into account two important aspects. The first one focus on the extraction of the relevant information of the input traffic signs, which can be done adaptively or fixed. The second one is related with the classification core. From the point of view of this part, ANNs can play a great role, because they are able to learn from different environments. So, an intelligent combination of both aspects can lead us to the success in the classification of traffic signs.

Traffic Sign Classification System Overview The TSC system and the blocks that compose it are shown in figure 1. Once the Video Camera block takes a

video sequence, the Image Extraction block makes the video sequence easy to read and it is the responsible to obtain images. The Sign Detection and Extraction Stage extracts all the traffic signs contained in each image and generates the small images called blobs, one per possible sign. Figure 1 also shows an example of the way this block works. The Color Recognition Stage is the responsible to discern among the different predominant color of the traffic sign: blue, red or others. Once the blob is classified according to its predominant color, the TSC Stage has the responsibility to recognize the exact type of signal, which is the aim of this work. This stage is divided in two parts: the traffic sign preprocessing stage and the TSC core.

Database Description The database of blobs used to obtain the results presented in this work is composed of blobs with only noise and nine different types of blue traffic signs, which belong to the international traffic code. Figure 2.a (Normal Traffic Signs) shows the different classes of traffic signs considered in this work, which have been collected by the TSC system presented above. So, they present distortions due to the problems described in previous sections, which are shown in figure 2.b (Traffic Signs with problems). The problems caused by vandalism are shown in the example of class S8. The problems related to the blob extraction in the Sign Detection and Extraction Stage (not a correct fit in the square image) are shown in the examples of classes S2, S4 and S9. Examples of signs with problems of rotation, translation or inclination are those of classes S4, S6 and

Figure 1. Traffic sign classification system

957

I


Figure 2. Noise and nine classes of international traffic signs: (a) Normal traffic signs and (b) Traffic signs with problems

(a)

S9. Finally, the difference of brightness is observed in both parts of figure 2. For example, when the lighting of the blob is high, the vertical row of the example of class S3 is greater than horizontal row of the example of class S2.

Traffic Sign Preprocessing Stage Each blob presented at the input of the TSC stage contains information of the three-color components: red, green and blue. Each blob is composed of 31x31 pixels. So, the memory required for each blow is 2883 bytes. Due to the high quantity of data, the purpose of this stage is to reduce it and to limit the redundancy of information, in order to improve the TSC performance and to reduce the TSC core computational cost. The first preprocessing made in this stage is the transformation of the color blob (3x31x31) to a gray scale blob (31x31) (Paulus, 2003). Consider for the next explanation that M is a general bidimensional matrix that contains either the gray scale blob or the output of one of the next preprocessings: •

958

Median filter (MF) (Abdel, 2004). It is applied to each pixel of M. A block of nxn elements that surrounds a pixel of M is taken, which is sorted in a linear vector. The median value of this vector is selected as the value of the processed pixel. This

(b)

•

•

preprocessing is usually used to reduce the noise in an image. Histogram equalization (HE). It tries to enhance the contrast of M. The pixels are transformed according to a specified image histogram (Paulus, 2003). This equalization is usually used to improve the dynamic range of M. Vertical (VH) and horizontal (HH) histograms (Vicen, 2005a, Vicen, 2005b). They are computed with vh i =

31

1 ∑ (mi, j >T ) , i =1,2,...,31 31 j=1

hh j =

(1)

31

1 ∑ (mi, j >T ) , j =1,2,...,31 31 i=1

(2)

respectively, where mi,j is the element of the i-th row and j-th column of the matrix M and T is the fixed or adaptive threshold of this preprocessing. If T is fixed, it is established at the beginning of the preprocessing, but if T is adaptive, it can be calculated with the Otsu method (Ng, 2004) or with the mean value of the blob, so both methods are M-dependent. vhi corresponds to the ratio of values of column j-th that are greater than T and hhj corresponds to the ratio of values of row i-th that are greater than T.


Traffic Sign Classification Core

p(S i ) ≈

TSC can be formulated as a multiple hypothesis test. Consider that P(Di|Sj) is the probability of deciding in favor of Si (decision Di) when the true hypothesis is Sj, Ci,j is the cost associated with this decision and P(Sj) is the prior probability of hypothesis Sj. Then the objective is to minimize a risk function that is given as the average cost C , which is defined in (3) for L hypothesis. L

L

C = ∑ ∑ Ci, j P(Di |S j )P(S j ) i=1 j=1

(3)

The classifier performance can be given as the total error rate (Pe) and the total correct rate (Pc=1-Pe) for all the hypothesis (classes).

Traffic Sign Classification Core Based on Statistical Methods: The k-NN The k-NN approach is a widely-used statistical method (Kisienski, 1975) applied in classification tasks. It assumes that the training set contains Mi points of class Si and M points in total, so

∑M i

i

=M

.

Then a hypersphere around the observation point x is taken, which encompasses k points irrespective of their class label. Suppose this sphere, of volume V, contains ki points of class Si, then p(x |S i ) ≈

ki M iV

(4)

provides an approximation to this class-conditional density. The unconditional density can be estimated using p(x ) ≈

k MV

while the priors can be estimated using

(5)

Mi M.

(6)

Then applying Bayes’ theorem (Bishop, 1995), we obtain: p(S i |x ) =

p(x |S i )p(S i ) ki ≈ p(x ) k.

(7)

Thus, to minimize the probability of misclassifying x, it should be assigned to the class Si for which the ratio ki/k is highest. The way to apply this method consists in comparing each x of the test set with all the training set patterns and deciding which class Si is the most appropriate one. k denotes the number of patterns that take part in the final decision of classifying x in class Si. When a draw exists in the majority voting, the decision is taking using the class of the nearest pattern. So, the results for k=1 and k=2 are the same.

Traffic Sign Classification Core Based on Neural Networks: The MLP The Perceptron was developed by F. Rosenblatt (Rosenblatt, 1962) in the 1960s for optical character recognition. The Perceptron has multiple inputs fully connected to an output layer with multiple outputs. Each output yi is the result of applying a linear combination of the inputs to a non-linear function called activation function. MLPs (Haykin, 1999) extend the Perceptron by cascading one or more extra layers of processing elements. These layers are called hidden layers, since their elements are not connected directly to the external world. The expression I/H1/.../Hh/O denotes an MLP with I inputs (size of the observation vector x), h hidden layers with Hh neurons in each one and O outputs (size of the classification vector y). Cybenko’s theorem (Cybenko, 1989) states that any continuous function f • ℜ•n → ℜ can be approximated with any degree of precision by log-sigmoidal functions. Therefore, MLPs using the log-sigmoidal activation function for each neuron are selected. Gradient descent with momentum and adaptive learning rate backpropagation algorithm is used to train the MLPs, where the Mean Square Error (MSE)

959

I


criterion is minimized. Moreover, cross-validation is used in order to reduce generalization problems.

RESULTS The database considered for the experiments is composed of 235 blobs of ten different classes: noise (S0) and nine classes of traffic signs (S1-S9). The database has been divided in three sets: train, validation and test, which are composed of 93, 52 and 78 blobs, respectively, being preprocessed before they are presented to the TSC core. The first one is used as the training set for the k-NN and the MLPs. The second one is used to stop the MLP training algorithm (Bishop, 1995, Haykin, 1999) according to the cross-validation applied during the training. And the last one is used to evaluate the performance of the k-NN and the MLPs. Experimental environments characterized by a large dimensional space and a small data set pose generalization problems. For this reason, the MLPs training is repeated 10 times with different weights initialization each time and the best MLP in terms of Pe estimated with the validation set is selected. Once the color blobs are transformed to gray scale, three different combinations of preprocessings (CPPs) are applied, so each CPP output is 62 elements: • • •

The first combination (CPP1) applies the VH and HH with an adaptive threshold T calculated with the mean of the blob. The second combination (CPP2) applies, in this order, the HE and the VH and HH with an adaptive threshold T calculated with the Otsu method. The third combination (CPP3) applies, in this order, the MF, the HE and the VH and HH with a fixed threshold (T=185).

For the TSC core based on the k-NN, a study of the k parameter is made for the different CPPs considered in the experiments (table 1). The lowest error rate is achieved with CPP3 and k=1, which performance is Pe=6,4% (Pc=93,6%). For the TSC core based on MLPs, a study of the number of hidden layers (h) and the number of neurons in each one (Hh) is done. For the case of one hidden layer (h=1), table 2 shows the results for the different CPPs. In this case, the best

960

performance is obtained with the CPP3 and an MLP of 62/62/10, where its error rate is Pe=2,6% (Pc=97,4%). The CPP2 achieves good performances but they are always lower than in the case of using the CPP3. The use of CPP1 with MLPs achieves the worst results of the three cases under study. The study of the TSC core based on an MLP with two hidden layers (h=2) (table 3) shows that the best combination of the CPPs and [H1,H2] for the MLP is CPP3 and [H1=70,H2=20], respectively. In this case, the best performance achieved is Pe=1,3% (Pc=98,7%). As occurs for MLPs with one hidden layer, the best CPP is the third one and the worst one is the first one.

FUTURE TRENDS New innovations can be achieved in this research area. The new trends try to improve the preprocessing techniques. In this case, advance signal processing can be applied to TSC. On the other hand, other TSC cores can be used. For instance, classifiers based on Radial Basis Function or Support Vector Machines (Maldonado, 2007) can be applied. Finally, optimization techniques, like Genetic Algorithms, have an important role in this research area to find which is the best selection of preprocessings of a bank of them.

CONCLUSION The performances of all the TSC designs are quite good, even though when the problems of deterioration, vandalism, rotation, translation, inclination, not a correct fit in the 31x31 blob and variation in size exist in the blobs. Several combinations of preprocessings are used. The best one applies, in this order, the median filter, the histogram equalization and the vertical and horizontal histograms with a fixed threshold (T=185). Concerning the type of classifier, the best TSCs are always achieved with MLPs. Moreover, the best results are achieved by MLPs of two hidden layers. The Pe reduction of the TSC core based on a 62/70/20/10 MLP (Pe=1,3%) is of 1,3% with respect to the best one achieved with only one hidden layer MLP (62/62/10) and 5,1% with respect to the best kNN (k=1) achieved.


Table 1.Pe(%) versus k parameter for each TSC based on different CPPs and k-NN k

1

3

4

5

6

7

8

9

10

11

12

CPP1

29,5

30,8

29,5

29,5

32,0

30,8

30,8

28,2

25,6

25,6

25,6

CPP2

19,2

17,9

14,1

16,7

14,1

15,4

19,2

16,7

17,9

19,2

19,2

CPP3

6,4

9,0

9,0

11,5

12,8

12,8

12,8

12,8

12,8

12,8

10,3

I

Table 2. Pe(%) versus H1 parameter for each TSC based on different CPPs and MLPs of sizes (62/H1/10) H1

6

14

22

30

38

46

54

62

70

78

86

CPP1

24,4

17,9

17,9

15,4

18,9

16,7

14,1

17,9

19,2

17,9

15,4

CPP2

21,8

14,1

14,1

14,1

12,8

10,3

11,5

10,3

12,8

9,0

11,5

CPP3

12,8

3,8

5,1

3,8

3,8

5,1

3,8

2,6

5,1

5,1

3,8

Table 3. Pe(%) versus [H1, H2] parameters for each TSC based on different CPPs and MLPs of sizes (62/H1/ H2/10) H1

10

10

15

15

25

25

40

40

60

60

70

70

H2

6

8

5

7

8

10

15

20

18

25

20

30

CPP1

28,2

24,4

23,1

25,6

19,2

19,2

19,2

19,2

17,9

17,9

15,4

15,4

CPP2

25,6

25,6

26,9

23,1

17,9

20,5

16,7

11,5

12,8

11,5

12,8

9,0

CPP3

15,4

10,3

15,4

12,8

7,7

5,1

6,4

5,1

5,1

5,1

1,3

5,1

REFERENCES Abdel-Dayem, A.R., Hamou, A.K., & El-Sakka, M.R. (2004). Novel Adaptive Filtering for Salt-and-Pepper

Noise Removal from Binary Document Images. Lecture Notes in Computer Science. (3212), 191-199. Bishop, C.M. (1995). Neural networks for pattern recognition. Oxford University Press Inc. 961


Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals and Systems. (2), 303-314.

ognition Tasks. Lecture Notes in Computer Science. (3512), 865-872.

Escalera, A. de la, et. al. (2004). Visual Sign Information Extraction and Identification by Deformable Models for Intelligent Vehicles. IEEE Trans. on Intelligent Transportation Systems. (5) 2, 57-68.

Vicen-Bueno, R., Gil-Pita, R., Jarabo-Amores, M. P., & López-Ferreras, F. (2005b). Complexity reduction in Neural Networks applied to Traffic Sign Recognition tasks. 13th European Signal Processing Conference. EUSIPCO 2005.

Escalera, A. de la, et. al. (2003). Traffic Sign Recognition and Analysis For Intelligent Vehicles. Image and Vision Computing. (21), 247-258.

KEy TERmS

Haykin, S. (1999). Neural Networks. A Comprehensive Foundation (Second Edition). Prentice-Hall. Hsu, S.H., & Huang, C.L. (2001). Road Sign Detection and Recognition Using Matching Pursuit Method. Image and Vision Computing. (19), 119-129. Kisienski, A. A., et al. (1975). Low-frequency Approach to Target Identification. Proc. IEEE. (63), 1651-1659. Maldonado-Bascon, S., Lafuente-Arroyo, S., Gil-Jimenez, P., Gomez-Moreno, H., & Lopez-Ferreras, F. (2007). Road-Sign Detection and Recognition Based on Support Vector Machines. IEEE Trans. on Intelligent Transportation Systems. (8) 2, 264-278. Ng, H.F. (2004). Automatic Thresholding for Defect Detection. IEEE Proc. Third Int. Conf. on Image and Graphics. 532-535. Paulus, D.W.R., & Hornegger, J. (2003). Applied Pattern Recognition (4th Ed.): Algorithms and Implementation in C++. Vieweg. Pérez, E., & Javidi, B. (2002). Nonlinear DistortionTolerant Filters for Detection of Road Signs in Background Noise. IEEE Trans. on Vehicular Technology. (51) 3, 567-576. Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan books. Vicen-Bueno, R., Gil-Pita, R., Rosa-Zurera, M., Utrilla-Manso, M., & López-Ferreras, F. (2005a). Multilayer Perceptrons Applied to Traffic Sign Rec-

962

Artificial Neural Networks (ANNs): A network of many simple processors (“units” or “neurons”) that imitates a biological neural network. The units are connected by unidirectional communication channels, which carry numeric data. Neural networks can be trained to find nonlinear relationships in data, and are used in applications such as robotics, speech recognition, signal processing or medical diagnosis. Backpropagation Algorithm: Learning algorithm of ANNs, based on minimizing the error obtained from the comparison between the ANN outputs after the application of a set of network inputs and the desired outputs. The update of the weights is done according to the gradient of the error function evaluated in the point of the input space that indicates the input to the ANN. Classification: The act of distributing things into classes or categories of the same type. Detection: The perception that something has occurred or some state exists. Information Extraction: Obtention of the relevant aspects contained in data. It is commonly used to reduce the input space of a classifier. Pattern: Observation vector that for its relevance is considered as an important example of the input space. Preprocessing: Operation or set of operations applied to a signal in order to improve some aspects of it.

963

Interactive Systems and Sources of Uncertainties Qiyang Chen Montclair State University, USA John Wang Montclair State University, USA

INTRODUCTION Today’s e-commerce environment requires that interactive systems exhibit abilities such as autonomy, adaptive and collaborative behavior, and inferential capability. Such abilities are based on the knowledge about users and their tasks to be performed (Raisinghani, Klassen and Schkade, 2001). To adapt users’ input and tasks an interactive system must be able to establish a set of assumptions about users’ profiles and task characteristics, which is often referred as user models. However, to develop a user model an interactive system needs to analyze users’ input and recognize the tasks and the ultimate goals users trying to achieve, which may involve a great deal of uncertainties. Uncertainty refers to a set of values about a piece of assumption that cannot be determined during a dialog session. In fact, the problem of uncertainty in reasoning processes is a complex and difficult one. Information available for user model construction and reasoning is often uncertain, incomplete, and even vague. The propagation of such data through an inference model is also difficult to predict and control. Therefore, the capacity of dealing with uncertainty is crucial to the success of any knowledge management system. Currently, a vigorous debate is in progress concerning how best to represent and process uncertainties in knowledge based systems. This debate carries great importance because it is not only related to the construction of knowledge based system but also focuses on human thinking in which most decisions are made under conditions of uncertainty. This chapter presents and discusses uncertainties in the context of user modeling in interactive systems. Some elementary distinctions between different kinds of uncertainties are introduced. The purpose is to provide an analytical overview and perspective concerning how and where uncertainties

arise and the major methods that have been proposed to cope with them.

Sources of Uncertainties The user model based interactive systems face the problems of uncertainty in the reference rule, the facts, and representation languages. There is no widely accepted definition about the presence of uncertainty in user modeling. However, the nature of uncertainty in a user model can be investigated through its origin. Uncertainty can arise from a variety of sources. Several authors have emphasized the need for differentiating among the types and sources of uncertainty. Some of the major sources are as follows: (1) The imprecise and incomplete information obtained from the user’s input. This type of source is related to the reliability of information, which involves the following aspects: •

•

Uncertain or imprecise information exists in the factual knowledge (Dutta, 2005). The contents of a user model involve uncertain factors. For instance, the system might want to assert "It is not likely that this user is a novice programmer." This kind of assertion might be treated as a piece of knowledge. But it is uncertain and seems difficult to find a numerical description for the uncertainty in this statement (i.e., no appropriate sample space in which to give this statement statistical meaning, if a statistical method is considered for capturing the uncertainty). The default information often brings uncertain factors to inference processes (Reiter, 1980). For example, the stereotype system carries extensive default assumptions about a user. Some assump-


I

Interactive Systems and Sources of Uncertainties

•

•

tions may be subject to change as interaction progresses. Uncertainty occurs as a result of ill-defined concepts in the observations or due to inaccuracy and poor reliability of the measurement (Kahneman and Tversky, 1982). For example, a user's typing speed could be considered as a measurement for a user's file editing skill. But for some applications it may be questionable. The uncertain exception to general assumptions for performing some actions under some circumstances can cause conflicts in reasoning processes.

(2) Inexact language by which the information is conveyed. The second source of uncertainty is caused by the inherent imprecision or inexactness of the representation languages. The imprecision appears in both natural languages and knowledge representation language. It has been proposed to classify three kinds of inexactness in natural language (Zwick, 1999). The first is generality, in which a word applies to a multiplicity of objects in the field or reference. For example, the word “table” can apply to objects differing in size, shape, materials, and functions. The second kind of linguistic exactness is ambiguity, which appears when a limited number of alternative meanings have the same phonetic form (e.g., bank). The third is vagueness, in which there are no precise boundaries to the meaning of the word (e.g., old, rich). In knowledge representation languages employed in user modeling systems, if rules are not expressed in a formal language, their meaning usually cannot be interpreted exactly. This problem has been partially addressed by the theory of approximate reasoning. Generally, a proposition (e.g., fact, event) is uncertain if it involves a continuous variable. Note that an exact assumption may be uncertain (e.g., the user is able to learn this concept), and an assumption that is absolutely certain may be linguistically inexact (e.g. the user is familiar with this concept). (3) Aggregation or summarization of information. The third type of uncertainty source arises from aggregation of information from different knowledge sources or expertise (Bonissone and Tong, 2005). Aggregating information brings several potential problems that are discussed in (Chen and Nocio 1997).

964

(4) Deformation while transferring knowledge. There might be no semantic correspondence between one representation language to another. It is possible that there is even no appropriate representation for certain expertise, for example, the measurement of user’s mental workload. This makes the deformation of transformation inevitable. In addition, human factors greatly affect the procedure of information translation. Several tools that use cognitive models for knowledge acquisition have been presented (Jacobson and Freiling, 1988).

CONCLUSION Generally, uncertainty affects the performance of an adaptive interface in the following aspects and obviously, the management of uncertainty must address all of the following aspects (Chen and Norcio, 2001). • • • • • •

How to determine the degree to which the premise of a given rule has been satisfied. How to verify the extent to which external constraints have been met. How to propagate the amount of uncertain information through triggering of a given rule. How to summarize and evaluate the findings provided by various rules or domain expertise. How to detect possible inconsistencies among the various sources and, How to rank different alternatives or different goals.

REFERENCES Barr, A. and Feigenbaum, E. A., The Handbook of Artificial Intelligence 2. Los Altos, Kaufmann , 1982. Bhatnager, R. K. and Kanal, L. N., “Handling Uncertainty Information: A Review of Numeric and Nonnumeric Methods,” Uncertainty in Artificial Intelligence, Kanal, L. N. and Lemmer, J. F. (ed.), pp2-26, 1986. Bonissone, P. and Tong, R. M., “Editorial: Reasoning with Uncertainty in Expert Systems,” International Journal of Man-Machine Studies, Vol. 30, 69-111 (2005)


Buchanan, B. and Smith, R. G. Fundamentals of Expert Systems, Ann. Rev., Computer Science, Vol. 3, pp. 23-58, 1988. Chen, Q. and Norcio, A.F. “Modeling a User’s Domain Knowledge with Neural Networks,” International Journal of Human-Computer Interaction, Vol. 9, No. 1, pp. 25-40, 1997. Chen, Q. and Norcio, A.F. “Knowledge Engineering in Adaptive Interface and User Modeling,” HumanComputer Interaction: Issues and Challenges, (ed.) Chen, Q. Idea Group Pub. 2001. Cohen, P. R. and Grinberg, M. R., “ A Theory of Heuristic Reasoning about Uncertainty, AI Magazine, Vol. 4(2), pp. 17-23, 1983. Dempster, A. P., “Upper and Lower Probabilities Induced by a Multivalued mapping,” The Annuals of Mathematical Statistics , Vol. 38(2), pp. 325-339, 1967. Dubois, D. and Prade, H., “Combination and Propagation of Uncertainty with Belief Functions -- A Reexamination,” Proc. of 9th International Joint Conference on Artificial Intelligence, pp. 111-113, 1985. Dutta, A., “Reasoning with Imprecise Knowledge in Expert Systems,” Information Sciences, Vol. 37, pp. 2-24, 2005. Doyle, J., “A Truth Maintenance System,” AI, Vol. 12, 1979, pp. 231-272. Garvey, T. D., Lowrance, J. D. and Fischer, M. A. “An Inference Technique for Integrating Knowledge from Disparate Source,” Proc. of the 7th International Joint Conference on AI, Vancouver, B. C. pp. 319-325, 1982 Heckerman, D., “Probabilistic Interpretations for MYCIN’s Certainty actors,” Uncertainty in Artificial Intelligence,(ed.). Kanal, L. N. and Lemmer, J. F. , 1986 Jacobson, C. and Freiling, M. J. “ASTEK: A Mulitparadigm Knowledge Acquisition tool for complex structured Knowledge.” International. Journal of Man-Machine Studies, Vol. 29, 311-327. 1988. Kahneman, D. and Tversky, A (1982). Variants of Uncertainty, Cognition, 11, 143-157.

McDermott, D. and Doyle, J., “Non-monotonic Logic,” AI Vol. 13, pp. 41-72. (1980). Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publisher, San Mateo, CA ,1988. Pearl, J., “Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach,” Proc. of the 2nd National Conference on Artificial Intelligence, IEEE Computer Society, pp. 1-12, 1985. Pednault, E. P. D., Zucker, S. W. and Muresan, L.V., “On the Independence Assumption Underlying Subjective Bayesian Updating, “ Artificial Intelligence, 16, pp. 213-222. 1981 Raisinghani, M., Klassen, C. and Schkade, L. “Intelligent Software agents in Electonic Commerce: A Socio-Technical Perspective,” Human-Computer Interaction: Issues and Challenges, (ed.) Chen, Q. Idea Group Pub. 2001. Reiter, R., “A Logic for Default Reasoning,” Artificial Intelligence, Vol. 13, 1980 pp. 81-132. Rich, E., “User Modeling via Stereotypes,” Cognitive Sciences, Vol. 3 1979, pp. 329-354. Shafer, G., A Mathematical Theory of Evidence, Priceton University Press, 1976. Zadeh, L. A., “ Review of Books : A Mathematical Theory of Evidence,” AI Magazine., 5(3), 81-83. 1984 Zadeh, L. A. “Knowledge Representation in Fuzzy Logic,” IEEE Transactions on Knowledge and Data Engineering, Vol. 1, No. 1, pp. 89-100, 1989. Zwick, R., “Combining Stochastic Uncertainty and Linguistic Inexactness: Theory and Experimental Evaluation of Four Fuzzy Probability Models,” Int. J. Man-Machine Studies, Vol. 30, pp. 69-111, 1999.

KEy TERmS Interactive System: A system that allows dialogs between the computer and the user. Knowledge Based Systems: A computer system that programmed to imitate human problem-solving

965

I


by means of artificial intelligence and reference to a database of knowledge on a particular subject. Knowledge Representation: The notation or formalism used for coding the knowledge to be stored in a knowledge-based system. Stereotype: A set of assumptions based on conventional, formulaic, and simplified conceptions, opinions about a user, which is created by an interactive system. Uncertainties: A potential deficiency in any phase or activity of the modeling process that is due to the lack of knowledge User Model: A set of information an interactive system infers or collects, which is used to characterize a user’s tasks, goals, domain knowledge and preferences, etc. to facilitate human computer interaction.

966

967

Intuitionistic Fuzzy Image Processing Ioannis K. Vlachos Aristotle University of Thessaloniki, Greece George D. Sergiadis Aristotle University of Thessaloniki, Greece

INTRODUCTION Since its genesis, fuzzy sets (FSs) theory (Zadeh, 1965) provided a flexible framework for handling the indeterminacy characterizing real-world systems, arising mainly from the imprecise and/or imperfect nature of information. Moreover, fuzzy logic set the foundations for dealing with reasoning under imprecision and offered the means for developing a context that reflects aspects of human decision-making. Images, on the other hand, are susceptible of bearing ambiguities, mostly associated with pixel values. This observation was early identified by Prewitt (1970), who stated that “a pictorial object is a fuzzy set which is specified by some membership function defined on all picture points”, thus acknowledging the fact that “some of its uncertainty is due to degradation, but some of it is inherent”. A decade later, Pal & King (1980) (1981) (1982) introduced a systematic approach to fuzzy image processing, by modelling image pixels using FSs expressing their corresponding degrees of brightness. A detailed study of fuzzy techniques for image processing and pattern recognition can be found in Bezdek et al and Chi et al (Bezdek, Keller, Krisnapuram, & Pal, 1999) (Chi, Yan, & Pham, 1996). However, FSs themselves suffer from the requirement of precisely assigning degrees of membership to the elements of a set. This constraint raises some of the flexibility of FSs theory to cope with data characterized by uncertainty. This observation led researchers to seek more efficient ways to express and model imprecision, thus giving birth to higher-order extensions of FSs theory. This article aims at outlining an alternative approach to digital image processing using the apparatus of Atanassov’s intuitionistic fuzzy sets (A-IFSs), a simple, yet efficient, generalization of FSs. We describe heuristic and analytic methods for analyzing/synthesizing images to/from their intuitionistic fuzzy components

and discuss the particular properties of each stage of the process. Finally, we describe various applications of the intuitionistic fuzzy image processing (IFIP) framework from diverse imaging domains and provide the reader with open issues to be resolved and future lines of research to be followed.

BACKGROUND From the very beginning of their development, FSs intrigued researchers to apply the flexible fuzzy framework in different domains. In contrast with ordinary (crisp) sets, FSs are defined using a characteristic function, namely the membership function, which maps elements of a universe to the unit interval, thereby attributing values expressing the degree of belongingness with respect to the set under consideration. This particular property of FSs theory was exploited in the context of digital image processing and soon turned out to be a powerful tool for handling the inherent uncertainty carried by image pixels. The importance of fuzzy image processing was rapidly acknowledged by both theoreticians and practitioners, who exploited its potential to perform various image-related tasks, such as contrast enhancement, thresholding and segmentation, de-noising, edge-detection, and image compression. However, and despite their vast impact to the design of algorithms and systems for real-world applications, FSs are not always able to directly model uncertainties associated with imprecise and/or imperfect information. This is due to the fact that their membership functions are themselves crisp. These limitations and drawbacks characterizing most ordinary fuzzy logic systems (FLSs) were identified and described by Mendel & Bob John (2002), who traced their sources back to the uncertainties that are present in FLSs and arise from various factors. The very meaning of words that are used in the antecedents and consequents of FLSs can


I

Intuitionistic Fuzzy Image Processing

be uncertain, since some words may often mean different things to different people. Moreover, extracting the knowledge from a group of experts who do not all agree, leads in consequents having a histogram of values associated with them. Additionally, data presented as inputs to an FLS, as well as data used for its tuning, are often noisy, thus bearing an amount of uncertainty. As a result, these uncertainties translate into additional uncertainties about FS membership functions. Finally, Atanassov et al. (Atanassov, Koshelev, Kreinovich, Rachamreddy & Yasemis, 1998) proved that there exists a fundamental justification for applying methods based on higher-order FSs to deal with everyday-life situations. Therefore, it comes as a natural consequence that such an extension should also be carried in the field of digital image processing.

THE IFIP FRAmEWORK In quest for new theories treating imprecision, various higher-order extensions of FSs were proposed by different scholars. Among them, A-IFSs (Atanassov, 1986) provide a simple and flexible, yet solid, mathematical framework for coping with the intrinsic uncertainties characterizing real-world systems. A-IFSs are defined using two characteristic functions, namely the membership and the non-membership that do not necessarily sum up to unity. These functions assign to elements of the universe corresponding degrees of belongingness and non-belongingness with respect to a set. The membership and non-membership values induce an indeterminacy index, which models the hesitancy of deciding the degree to which an element satisfies a particular property. In fact, it is this additional degree of freedom that provides us with the ability to efficiently model and minimize the effects of uncertainty due to the imperfect and/or imprecise nature of information. Hesitancy in images originates out of various factors, which in their majority are caused by inherent weaknesses of the acquisition and the imaging mechanisms. Distortions occurred as a result of the limitations of the acquisition chain, such as the quantization noise, the suppression of the dynamic range, or the nonlinear behavior of the mapping system, affect our certainty regarding the “brightness” or “edginess” of a pixel and therefore introduce a degree of hesitancy associated with the corresponding pixel. Moreover, dealing with “qualitative” rather than “quantitative” properties of images is one 968

of the sound advantages of fuzzy-based techniques. Qualitative properties describe in a more natural and human-centric manner image attributes, such as the “contrast” and the “homogeneity” of an image region, or the “edginess” of a boundary. However, as already pointed out, these terms are themselves imprecise and thus they additionally increase the uncertainty of image pixels. It is therefore a necessity, rather than a luxury, to employ A-IFSs theory to cope with the uncertainty present in real-world images. In order to apply the IFIP framework, images should first be expressed in terms of elements of A-IFSs theory. Analyzing and synthesizing digital images to and from their corresponding intuitionistic fuzzy components is not a trivial task and can be carried out using either heuristic or analytic approaches.

Heuristic Modelling As already stated, the factors introducing hesitancy in real-world images can be traced back to the acquisition stage of imaging systems and involve pixel degradation, mainly triggered by the presence of quantization noise generated by the A/D converters, as well as the suppression of the dynamic range caused by the imaging sensor. A main effect of quantization noise in images is that there exist a number of gray levels with zero, or almost zero, frequency of occurrence, while gray levels in their vicinity possess high frequencies. This is due to the fact that a gray level g in a digital image can be either (g+1) or (g-1) without any appreciable change in the visual perception. An intuitive and heuristic approach to the modelling of the aforementioned sources of uncertainty in the context of A-IFSs was proposed by Vlachos & Sergiadis (Vlachos & Sergiadis, 2005) (Vlachos & Sergiadis, 2007 d) for gray-scale images, while an extension to color images was presented in Vlachos & Sergiadis (Vlachos & Sergiadis, 2006). The underlying idea involves the application of the concept of the fuzzy histogram of an image, which models the notion of the gray level “approximately g”. The fuzzy histogram takes into account the frequency of neighboring gray levels to assess the frequency of occurrence of the gray level under consideration. Consequently, a quantitative measure of the quantization noise can be calculated as the normalized absolute difference between the ordinary (crisp) and fuzzy histograms.


Finally, to further incorporate the additional distortion factors into the calculation of hesitancy, parameters are employed that model the influence of the dynamic range suppression and the fact that lower gray levels are more prone to noise than higher ones.

Analytic Modelling The analytic approach offers a more generic treatment to hesitancy modelling of digital images, since

it does not require an a priori knowledge of the system characteristics, nor a particular pre-defined image acquisition model. Generally, it consists of sequential operations that primarily aim to optimally transfer the image from the pixel domain (PD) to the intuitionistic fuzzy domain (IFD), where the appropriate actions will be performed, using the fuzzy domain (FD) as an intermediate step. After the modification of the membership and non-membership components of the image in the IFD, an inverse procedure is carried out

Figure 1. Overview of the analytic IFIP framework

Figure 2. The process of fuzzification (from image properties to membership functions)

969

I


for transferring the image back to the PD. A block diagram illustrating the analytic modelling is given in Figure 1. Details on each of the aforementioned stages of IFIP are provided below.

Fuzzification It constitutes the first stage of the IFIP framework, which assigns degrees of membership to image pixels with respect to an image property, such as “brightness”, “homogeneity”, or “edginess”. These properties are application dependent and also determine the operations to be carried out in the following stages of the IFIP framework. For the task of contrast enhancement one may consider the “brightness” of gray levels and construct the corresponding FS “Bright pixel” or “Dark pixel” using different schemes that range from simple intensity normalization to more complex approaches involving knowledge extracted from a group of human experts (Figure 2).

Intuitionistic Fuzzification Intuitionistic fuzzification is one of the most important stages of the IFIP architecture, since it involves the construction of the A-IFS that represents the image properties in the IFD. The analytic approach allows for an automated modelling of the hesitancy carried by image pixels, by rendering image properties directly from the FS obtained in the fuzzification stage through the use of intuitionistic fuzzy generators (Bustince, Kacprzyk & Mohedano, 2001). In order to construct an A-IFS that efficiently models a particular image property, tunable parametric intuitionistic fuzzy generators are utilized. The underlying statistics of images are closely related to and soundly affect the process of hesitancy modelling. Different parameter values of the intuitionistic fuzzy generators produce different A-IFSs and therefore alternative representations of the image in the IFD are possible. Consequently, an optimization criterion should be employed, in order to select the parameter set that derives the A-IFS that optimally models the hesitancy of pixels from the multitude of possible representations. Such a criterion, that also encapsulates the image statistics, is the intuitionistic fuzzy entropy (Burillo & Bustince, 1996) (Szmidt & Kacprzyk, 2001) of the image under consideration. Therefore, the set of parameters that produce the A-IFS with the maximum 970

intuitionistic fuzzy entropy is considered as optimal. We refer to this process of selection as the maximum intuitionistic fuzzy entropy principle (Vlachos & Sergiadis, 2007 d). The optimal parameter set is then used to construct membership and non-membership functions corresponding to the intuitionistic fuzzy components of the image in the IFD. This procedure is schematically illustrated in Figure 3.

Modification of Intuitionistic Fuzzy Components It involves the actual processing of the intuitionistic fuzzy components of the image with respect to a particular property. Depending on the desired image task one is about to perform, suitable intuitionistic fuzzy operators are applied to both membership and nonmembership functions.

Intuitionistic Defuzzification After obtaining the modified intuitionistic fuzzy components of the image, it is required that these components should be combined to produce the processed image in the FD. This procedure involves the embedding of hesitancy into the membership function. To carry out this task, we utilize suitable parametric intuitionistic fuzzy operators that de-construct an A-IFS into an FS. It should be stressed out that the final result soundly depends on the selected parameters of the aforementioned operators. Therefore, optimization criteria, such as the maximization of the index of fuzziness of the image, are employed to select the overall optimal parameters with respect to the considered image operation.

Defuzzification The final stage of the IFIP framework involves the transfer of the processed fuzzy image into the PD. Depending on the desired image operation, various functions may be applied to carry out this task.

Applications The IFIP architecture has been successfully applied to many image processing problems. Vlachos & Sergiadis (2007 d) exploited the potential of the framework in order to perform contrast enhancement to low-con-


trasted images. Different approaches were introduced, namely the intuitionistic fuzzy contrast intensification and the intuitionistic fuzzy histogram hyperbolization (IFHH). An extension of the IFHH technique to color images was proposed in Vlachos & Sergiadis (Vlachos & Sergiadis, 2007 b). Additionally, the effects of employing different intuitionistic fuzzification and intuitionistic defuzzification schemes to the performance of contrast enhancement algorithms was thoroughly studied and investigated in Vlachos & Sergiadis (2007)

(2007 d) and (2006 b), respectively. Application of AIFSs theory to edge detection was also demonstrated in Vlachos & Sergiadis (Vlachos & Sergiadis, 2007 d), based on intuitionistic fuzzy similarity measures. The problem of image thresholding and segmentation under the context of IFIP, was also addressed (Vlachos & Sergiadis, 2006 a) using novel intuitionistic fuzzy information measures. Under the general framework of IFIP, the notions of the intuitionistic fuzzy histograms of a digital image were introduced (Vlachos

Figure 3. The process of intuitionistic fuzzification

Figure 4. The stages of the IFIP framework for contrast enhancement

971

I


& Sergiadis, 2007 c) and their application to contrast enhancement was demonstrated (Vlachos & Sergiadis, 2007 a). Finally, the IFIP architecture was successfully applied in mammographic image processing (Vlachos & Sergiadis, 2007 d). Figure 4 illustrates the stages of IFIP in the case of the IFHH approach.

FUTURE TRENDS Even though higher-order FSs have been widely applied to decision-making and pattern recognition problems, it seems that their application in the field of digital image processing is just beginning to develop. As a newly-introduced approach, the IFIP architecture remains a suggestively and challenging open field for future research. Therefore, it is expected that the IFIP framework will attract the interest of theoreticians and practitioners in the near future. The proposed IFIP context bases its efficiency in the ability of A-IFSs to capture and render the hesitancy associated with image properties. Consequently, the analysis and synthesis of images in terms of elements of A-IFSs theory plays a key role in the performance of the framework itself. Therefore, the stages of intuitionistic fuzzification and defuzzification need to be further studied from an application point of view, to provide meaningful ways of extracting and embedding hesitancy from and to images. Finally, the IFIP architecture should be extended to image processing task handled today by FS theory, in order to investigate and evaluate its advantages and particular merits.

CONCLUSION This article describes an intuitionistic fuzzy architecture for the processing of digital images. The IFIP framework exploits the potential of A-IFSs to efficiently model the uncertainties associated with image pixels, as well as with the definitions of their properties. The proposed methodology provides alternative approaches for analyzing/synthesizing images to/from their intuitionistic fuzzy components. Application of the IFIP framework to diverse imaging domains demonstrates its efficiency compared to traditional image processing techniques. It is expected that the proposed context will provide theoretician and practitioners with an alternative and

972

challenging way to perceive and deal with real-world image processing problems.

REFERENCES Atanassov, K.T. (1986). Intuitionistic Fuzzy Sets. Fuzzy Sets and Systems, 20 (1), 87-96. Atanassov, K.T., Koshelev, M., Kreinovich, V., Rachamreddy, B., & Yasemis, H. (1995). Fundamental Justification of Intuitionistic Fuzzy Logic and of Interval-Valued Fuzzy Methods. Notes on Intuitionistic Fuzzy Sets, 4 (2), 42-46. Bezdek, J.C., Keller, J., Krisnapuram, R., & Pal, N.R. (1999). Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Springer. Burillo, P., & Bustince, H. (1996). Entropy on Intuitionistic Fuzzy Sets and on Interval-Valued Fuzzy Sets. Fuzzy Sets and Systems, 78 (3), 305-316. Bustince, H., Kacprzyk, J., & Mohedano, V. (2000). Intuitionistic Fuzzy Generators: Application to Intuitionistic Fuzzy Complementation. Fuzzy Sets and Systems, 114 (3), 485-504. Chi, Z., Yan, H., & Pham, T. (1996). Fuzzy Algorithms: With Applications to Image Processing and Pattern Recognition, World Scientific Publishing Company. Mendel, J.M., & Bob John, R.I. (2002). Type-2 Fuzzy Sets Made Simple. IEEE Transactions on Fuzzy Systems, 10 (2), 117-127. Pal, S.K., & King, R.A. (1980). Image Enhancement Using Fuzzy Set. Electronics Letters, 16 (10), 376-378. Pal, S.K., & King, R.A. (1981). Image Enhancement Using Smoothing with Fuzzy Sets. IEEE Transactions on Systems, Man, and Cybernetics, 11 (7), 495-501. Pal, S.K., & King, R.A. (1982). A Note on the Quantitative Measurement of Image Enhancement Through Fuzziness. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4 (2), 204-208. Prewitt, J.M.S. (1970). Object Enhancement and Extraction. Picture Processing and Psycho-Pictorics (pp. 75-149), Lipkin, B.S. Rosenfeld, A. (Eds.), Academic Press, New York.


Szmidt, E., & Kacprzyk, J. (2001). Entropy for Intuitionistic Fuzzy Sets. Fuzzy Sets and Systems, 118 (3), 467-477. Vlachos, I.K., & Sergiadis, G.D. (2005). Towards Intuitionistic Fuzzy Image Processing. Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation (CIMCA 2005), Vienna, Austria. Vlachos, I.K., & Sergiadis, G.D. (2006). A Heuristic Approach to Intuitionistic Fuzzification of Color Images. Proceedings of the 7th International FLINS Conference on Applied Artificial Intelligence (FLINS 2006), Genova, Italy. Vlachos, I.K., & Sergiadis, G.D. (2006 a). Intuitionistic Fuzzy Information—Applications to Pattern Recognition. Pattern Recognition Letters, 28 (2), 197-206. Vlachos, I.K., & Sergiadis, G.D. (2006 b). On the Intuitionistic Defuzzification of Digital Images for Contrast Enhancement. Proceedings of the 7th International FLINS Conference on Applied Artificial Intelligence (FLINS 2006), Genova, Italy. Vlachos, I.K., & Sergiadis, G.D. (2007). A Two-Dimensional Entropic Approach to Intuitionistic Fuzzy Contrast Enhancement. Proceedings of the International Workshop on Fuzzy Logic and Applications (WILF 2007), Genova, Italy. Vlachos, I.K., & Sergiadis, G.D. (2007 a). Hesitancy Histogram Equalization. Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2007), London, United Kingdom. Vlachos, I.K., & Sergiadis, G.D. (2007 b). Intuitionistic Fuzzy Histogram Hyperbolization for Color Images. Proceedings of the International Workshop on Fuzzy Logic and Applications (WILF 2007), Genova, Italy. Vlachos, I.K., & Sergiadis, G.D. (2007 c). Intuitionistic Fuzzy Histograms of an Image. Proceedings of the World Congress of the International Fuzzy Systems Association (IFSA 2007), Cancun, Mexico. Vlachos, I.K., & Sergiadis, G.D. (2007 d). Intuitionistic Fuzzy Image Processing. Soft Computing in Image Processing: Recent Advances (pp. 385-416), Nachtegael, M. Van der Weken, D. Kerre, E.E. Philips, W. (Eds.), Series: Studies in Fuzziness and Soft Computing, 210, Springer.

Vlachos, I.K., & Sergiadis, G.D. (2007 e). The Role of Entropy in Intuitionistic Fuzzy Contrast Enhancement. Proceedings of the World Congress of the International Fuzzy Systems Association (IFSA 2007), Cancun, Mexico.

KEy TERmS Crisp Set: A set defined using a characteristic function that assigns a value of either 0 or 1 to each element of the universe, thereby discriminating between members and non-members of the crisp set under consideration. In the context of fuzzy sets theory, we often refer to crisp sets as “classical” or “ordinary” sets. Defuzzification: The inverse process of fuzzification. It refers to the transformation of fuzzy sets into crisp numbers. Fuzzification: The process of transforming crisp values into grades of membership corresponding to fuzzy sets expressing linguistic terms. Fuzzy Logic: Fuzzy logic is an extension of traditional Boolean logic. It is derived from fuzzy set theory and deals with concepts of partial truth and reasoning that is approximate rather than precise. Fuzzy Set: A generalization of the definition of the classical set. A fuzzy set is characterized by a membership function, which maps the members of the universe into the unit interval, thus assigning to elements of the universe degrees of belongingness with respect to a set. Image Processing: Image processing encompasses any form of information processing for which the input is an image and the output an image or a corresponding set of features. Intuitionistic Fuzzy Index: Also referred to as “hesitancy margin” or “indeterminacy index”. It represents the degree of indeterminacy regarding the assignment of an element of the universe to a particular set. It is calculated as the difference between unity and the sum of the corresponding membership and non-membership values. Intuitionistic Fuzzy Set: An extension of the fuzzy set. It is defined using two characteristic functions, 973

I


the membership and the non-membership that do not necessarily sum up to unity. They attribute to each individual of the universe corresponding degrees of belongingness and non-belongingness with respect to the set under consideration. Membership Function: The membership function of a fuzzy set is a generalization of the characteristic

974

function of crisp sets. In fuzzy logic, it represents the degree of truth as an extension of valuation. Non-Membership Function: In the context of Atanassov’s intuitionistic fuzzy sets, it represents the degree to which an element of the universe does not belong to a set.

975

Knowledge Management Systems Procedural Development Javier Andrade University of A Coruña, Spain Santiago Rodríguez University of A Coruña, Spain María Seoane University of A Coruña, Spain Sonia Suárez University of A Coruña, Spain

INTRODUCTION The success of the organisations is increasingly dependant on the knowledge they have, to the detriment of other traditionally decisive factors as the work or the capital (Tissen, 2000). This situation has led the organisations to pay special attention to this new intangible item, so numerous efforts are being done in order to conserve and institutionalise it. The Knowledge Management (KM) is a recent discipline replying this increasing interest; however, and despite its importance, this discipline is currently in an immature stage, as none of the multiple existing proposals for the development of Knowledge Management Systems (KMS) achieve enough detail for perform such complex task. In order to palliate the previous situation, this work presents a methodological framework for the explicit management of the knowledge. The study has a formal basis for achieving an increased level of detail, as all the conceptually elements needed for understanding and representing the knowledge of any domain are identified. The requested descriptive character is achieved when basing the process on these elements and, in this way, the development of the systems could be guided more effectively.

BACKGROUND During the last years numerous methodological frameworks for the development of KMS have arisen, the

most important of which are the ones of Junnarkar (1997), Wiig et al (1997), Daniel et al (1997), Holsapple and Joshi (1997), Liebowitz and Beckman (Liebowitz, 1998; Beckman, 1997), Stabb and Schnurr (1999), Tiwana (2000) and Maté et al (2002). Nevertheless, the existing proposals do not satisfy adequately the needs of the organisation knowledge (Rubenstein-Montano, 2001; Andrade, 2003) due to their immaturity, mainly based on the following aspects: 1.

2.

The research efforts have been mainly focused on the definition of a process for KMS development, ignoring instead the study of the object to be managed: the knowledge. The definition of such process has eluded in most of the cases the human factor and it has been restricted only to the technological viewpoint of the KM.

The first aspect regards the necessary study of the knowledge as basis for the definition of the Corporate Memory structure; this study should identify (i) the type of knowledge that has to be included in that repository and (ii) their descriptive properties for the Corporate Memory to include all the features of the knowledge items that it stores. The definition of that structure would enable also the definition of a descriptive process for creating KMS by using the different characteristics and types of knowledge. However, and despite the influence that the object to be managed has on the management process, only the Wiig (1997) proposal pays attention to its study. Such


K

KM Systems

proposal identifies a small set of descriptors that support the formalisation (making explicit) of the knowledge although, (i) its identification does not result from an exhaustive study and (ii) it does not enable a complete formalisation as it is solely restricted to some generic properties. The second step suggests that the whole process for KMS development should consider the technological as well as the human vision. The first one is focused on how obtaining, storing and sharing the relevant knowledge that exists within an organisation, by creating the Corporate Memory and the computer support system. The second vision involves, not only the creation of a collaborative atmosphere within the organisation in order to achieve the involvement of the workers in the KM program, but also the tendency to share their knowledge and use the one already provided by other members. Despite the previous fact, the vast majority of the analysed approaches are solely focused on the technological KM viewpoint, which jeopardises the success of a KMS (Andrade, 2003). In fact, among the previously mentioned proposals, only the Tiwana (2000) proposal explicitly considers the human viewpoint by including a specific phase for it. As a result of both aspects, the current proposals are restricted to a set of generic guides for performing KM, which is quite different from the formal and detailed vision that is being demanded. In other words, the current approaches indicate what to do but not how to do it (prescriptive viewpoint against descriptive/procedural viewpoint). In this scenario the developers of this type of systems have to elaborate their own ad hoc approach, achieving results that only depend on the experience and the capabilities of the development team.

DEVELOPmENT FOR KNOWLEDGE mANAGEmENT SySTEmS This section presents a methodological framework for the explicit KM that solves the previously mentioned problems. A study of the object to be managed has been performed for obtaining a knowledge formalisation schema, i.e., for knowing the relevant knowledge items and the characteristics/properties that should be made explicit. Using the results achieved after this study a methodological framework for KMS creation has been defined. Both aspects are following discussed. 976

Proposed Formalisation Schema The natural language is the language par excellence for sharing knowledge. Due to this, a good identification of all the necessary elements for conceptualising (understanding) the knowledge of any domain (and therefore those for whom the respective formalisation mechanisms must be provided) can be done from the analysis of the different grammatical categories of the natural language: nouns, adjectives, verbs, adverbs, locutions and other linguistic expressions. This study, whose detailed description and applications have been described in several works (Andrade, 2006; Andrade, 2008), reveals that all the identified conceptual elements can be put into the following knowledge levels according to their function within the domain: •

•

• •

Static. It regards the structural or operative knowledge domain, meaning domain facts that are true and that can be used in some operations as concepts, properties, relationships and constraints. Dynamic. It is related to the performance of the domain, that is, functionality, action, process or control: inferences, calculations and step sequence. This level can be divided into two sublevels: Strategic. It includes what to do, when and in what order (i.e., step factorisation). Tactical. It specifies how and when obtaining new operative knowledge (i.e., the description of a given step).

Every one of these levels approaches a different fragment of the organisation knowledge, although they all are obviously interrelated; in fact, the strategic level controls the tactical one, as for every last level/elemental step (strategic knowledge) the interferences and calculi must be indicated (tactical knowledge). Also the level of the operative knowledge is controlled by the other two, as it specifies how, not only the bifurcation points or execution alternatives are decided (strategic knowledge), but also how interferences and calculi are done (tactical knowledge). Therefore, a KMS must provide support to all these levels. As it can be observed at Table 1, the main formalisation schema has been divided, on one hand, into several individual schemas corresponding to each one of the identified knowledge levels and, on the other, into

KM Systems

Table 1. Components defined for every identified schema

K

Schemas

Components

Common

Catalogue of terms

Dynamic

Strategic Tactical

Catalogue of non terminal steps Catalogue of terminal steps Catalogue of tactical steps Catalogue of concepts

Static

Operative

Catalogue of relationships Catalogue of properties

a common one for the three levels, providing the global vision of the organisation knowledge. Therefore, the knowledge formalisation involves a dynamic schema including the strategic and tactical individual schemas, a dynamic schema including an operative schema, and a common schema, for describing the common aspects regardless the level. Every individual schema is also constituted by some components. The catalogue of terms is a common component for the schemas, providing synonyms and abbreviations for identifying every knowledge asset within the organisation. The strategic schema describes the functional splitting of every KMS operation and also each identified step. As the description varies when the step is terminal or not (elemental step), two different components are needed for including all the characteristics of this level. The approach–procedural or algorithmic, for instance–should be described with detail for every asset included into the catalogue of terminal steps. All this information is included at the catalogue of tactical steps. Lastly, the static schema is made up of the catalogue of concepts–including the identified concepts and their description–, the catalogue of relationships–describing the identified relationships and their meaning–and the catalogue of properties–referring the properties of the previously mentioned concepts and relationships–. The detailed description of this study, together with the descriptors of every component, can be found in (Andrade, 2008).

PROPOSED mETHODOLOGICAL FRAmEWORK The proposed process, whose basic structure is shown in Figure 1, has been elaborated bearing in mind the problems detected at the KM discipline and already mentioned throughout the present work. As it can be noticed in the previous figure, this process includes the following phases: 1. 2.

3.

Setting-up of the KM commission: the direction defines a KM commission for tracking and performing the KM project. Scope identification. The problem to be approached would be specified by means of determining on where the present cycle of the KM project must have a bearing. In order to achieve this, the framework proposes the use of the SWOT analysis (Strengths, Weaknesses, Opportunities, Tricks), together with the proposal of Zack (1999). Knowledge acquisition, including:

3.1. Identification of knowledge domains. The knowledge needs regarding the approached subject area are determined by means of different meetings involving the development team, the KM committee and the people responsible of every operation to be performed. 3.2. Capture of the relevant knowledge. The obtaining of all the possible knowledge related with the

977

KM Systems

Figure 1. Structure of the proposed process

3

Knowledge acquisition

1

Setting-up of the KM commission

4

Knowledge assimilation

5

Knowledge consolidation

6

Creation of the support system

2

Scope identification

7

Creation of the collaboration environment

operation approached is based on the identified domains. This is done by means of: (a)

Identifying where the knowledge lies in. The KM commission is in charge of identifying and providing the human and non human knowledge sources that are going to be analysed. (b) Determining the knowledge that has to be captured. As in the previous epigraph, it should be necessary to bear in mind the strategic, tactical and operative knowledge. (c) Knowledge obtaining. Obviously, when all the knowledge that is needed does not exist at the organisation it should be generated or imported. 4.

Knowledge assimilation, comprising:

4.1. Knowledge conceptualisation. Its goal is the comprehension of the captured knowledge. It is recommended to start with the strategic knowledge for subsequently focusing on the tactical knowledge. As the strategic and tactical elements are understood, it would be necessary to assimilate arisen elements of the operative level. 4.2. Knowledge representation. The relevant knowledge has to be made explicit and formalised, according to the components (Andrade, 2008) summarised at Table 1. This in one of the main distinguishing 978

points of the proposal presented here, as the proposed formalisation schema indicates the specific descriptors needed for a correct and complete formalisation of the knowledge. 5.

Knowledge consolidation, including:

5.1. Knowledge verification. In order to detect failures and omissions related with the represented knowledge it should be considered: (a)

Generic aspects. It has to be checked that any knowledge element (strategic, tactical and operative) is included into the catalogue of terms, that any term included there has been made explicit according to the type of knowledge and, that all the fields are completed. (b) Strategic aspects. It should be verified that (i) any decision regarding an execution is made according to the existing operative knowledge, (ii) any last level step is associated to an existing tactical knowledge and, (iii) any non terminal step is correctly split. All the previous facts would be achieved by checking the accordance between the split tree and the content of the formalisation schema of the terminal strategic knowledge. (c) Tactical aspects. It should be verified that: (i) the whole of the tactical knowledge is used in some of the last level steps of the strategic knowledge and that any operative knowledge related to the tactical knowledge is available. In order to achieve

KM Systems

this, the operative knowledge items will be represented as nodes within a knowledge map. This type of maps enable the graphic visualisation of how new elements are obtained from the existing ones. Once the map has been done, it should be scoured for checking that the whole of the operative knowledge has been included. (d) Operative aspects. It should be confirmed that: (i) there are not isolated concepts, (ii) there are not attributes unrelated to a concept or to a relationship, (iii) there are not relationships associating non existing concepts or relationships and (iv) the whole of the operative knowledge is used in some of the tactical knowledge and/or in the decision making of the flow control of the strategic knowledge. In order to perform the three first verifications, a relationships diagram will be elaborated for graphically showing the existing relationships among the different elements of the operative knowledge. The syntax of this type of diagrams is analogous to the one of the class diagrams used in the methodologies of object-oriented software development. The verification of the last proposal will be done by using a knowledge map; the execution structures included into the content of the formalisation schema for the strategic knowledge of last level related to every process (the remaining inferior levels are included into the superior one) will be also used in this verification. With these two mentioned graphic representations it could be verified that every operative element is included into at least one of the representations. 5.2. Knowledge validation. In order to verify the knowledge represented and verified, the development team, the KM commission and the involved parts will revise: (a) (b) (c) (d) (e)

The knowledge splitting tree The knowledge map The relationship diagram The functional splitting tree The content of the formalisation schema

6.

Creation of the support system, which is divided into:

6.1. Definition of the incorporation mechanisms. The KM commission and the development team determine the adequacy of the incorporation type (passive, active or their combination) according to criteria such as financial considerations or stored knowledge. 6.2. Definition of the notification mechanisms. The KM commission and the development team will establish the most suitable method for notifying the newly included knowledge. The notification can be passive or active; even the absence of notification could be considered. 6.3. Definition of the mechanisms for knowledge localisation. Several alternatives, such as the need of including intelligent searches or meta-searches, are evaluated. 6.4. Development of the KM support system. It will be necessary to define and to implement the corporate memory, the communication mechanisms and the applications for collaboration and team work. 6.5. Population of the corporate memory. Once the KM system has been developed. The knowledge captured, assimilated and consolidated will be included into the corporate memory. 7.

Creation of the collaboration environment. The main goal of this phase is to promote and to improve the contribution of knowledge and its subsequent use by the organisation. It should be borne in mind the risk that involves the use of an unsuitable organisation culture or of inadequate tools for promotion and reward. The following strategies should be followed instead:

•

Considering the employee worth according his/her knowledge contribution to the organisation Supporting and awarding the use of the organisational existing knowledge Promoting the relaxed dialogue among employees from different domains Promoting a good atmosphere among the employees Committing all the employees

• • • •

FUTURE TRENDS As it has been indicated, the KM discipline remains in an immature stage due to an inadequate viewpoint: the absence of a strict study for determining the relevant 979

K

KM Systems

knowledge and the characteristics that should be supported. Such situation has led to an important detail shortage of the existing proposals for KMS development, currently dependant solely from the individual good work of the developers. The present proposal means a new viewpoint for developing this type of systems. However, it still remains a lot to do. As the authors are aware of the high grade of bureaucracy that might be needed for specifically following the present proposal, it should be expedited and characterised for specific domains. Nevertheless, this viewpoint could be considered as the key for achieving specific ontologies for KM in every domain.

CONCLUSION This article has presented a methodological framework for the development of KMS that, differently from the existing proposals, is based on the strict study of the knowledge to be managed. This characteristic has provided the system with a higher procedural level of detail than the current proposals, as the elements conceptually needed for understanding and representing the knowledge of any domain have been identified and formalised.

REFERENCES Andrade, J., Ares, J., García, R., Rodríguez, S., & Suárez, S. (2003). Lessons Learned for the Knowledge Management Systems Development. Proc. 2003 IEEE International Conference on Information Reuse and Integration, 471-477. Andrade, J., Ares, J., García, R., Pazos, J., Rodríguez, S., & Silva S. (2006). Definition of a problem-sensitive conceptual modelling language: foundations and application to software engineering. Information and Software Technology, 48 (7), 517-531. Andrade, J., Ares, J., García, R., Pazos, J., Rodríguez, S., & Silva S. (2008). Formal conceptualisation as a basis for a more procedural knowledge management. Decision Support Systems, 45(1), 164-179. Beckman, T (1997).A Methodology for Knowledge Management. International Association of Science 980

and Technology for Development (IASTED) AI and Soft Computing Conference. Daniel, M., Decker, S., Domanetzki, A., HeimbrodtHabermann, E., Höhn, F., Hoffmann, A., Röstel, H., Studer, R., & Wegner R. (1997). ERBUS-Towards a Knowledge Management System for Designers. Proc. of the Knowledge Management Workshop at the 21st Annual German AI Conference. Holsapple, C., Joshi, K. (1997). Knowledge Management: A Three-Fold Framework. Kentucky Initiative for Knowledge Management Paper, n. 104. Junnarkar, B. (1997). Leveraging Collective Intellect by Building Organizational Capabilities. Expert Systems with Applications, 13 (1), 29-40. Liebowitz, J., & Beckman, T. (1998). Knowledge Organizations. What Every Manager Should Know. CRC Press. Maté, J.L., Paradela, L.F., Pazos, J., Rodríguez-Patón, A. & Silva A. (2002). MEGICO: an Intelligent Knowledge Management Methodology. Lecture Notes in Artificial Intelligence, 2473, 102-107. Staab, S., & Schnurr, H.P. (1999). Knowledge and Business Processes: Approaching and Integration. Workshop on Knowledge Management and Organizational Methods. IJCAI99. Rubenstein-Montano, B., Liebowitz, J., Buchwalter, J., McCaw, D., Newman, B., & Rebeck K. (2001). A Systems Thinking Framework for Knowledge Management. Decision Support Systems, 31, 5-16. Tissen, R., Andriessen, D., & Deprez, F. L. (2000). The Knowledge Dividend. Financial Times Prentice-Hall. Tiwana, A. (2000). The Knowledge Management Toolkit. Practical Techniques for Building a Knowledge Management System. Prentice-Hall. Wiig K., de Hoog, R., & van der Spek, R. (1997). Supporting Knowledge Management: a Selection of Methods and Techniques. Expert Systems with Applications, 13 (1), 15-27. Zack, M. H. (1999). Developing a Knowledge Strategy. California Management Review, 41 (3), 125-145.

KM Systems

KEy TERmS Commission of Knowledge Management: Team in charge of the Knowledge Management project. Corporate Memory: Physical and persistent storage of the knowledge in an organisation. Its structure is determined by the knowledge formalisation schema. Knowledge: Pragmatic level of information resulting from the combination of the information received with the individual experience. Knowledge Formalisation Schema: Set of attributes for describing and formalising the knowledge.

Knowledge Management: Discipline that tries to suitably provide the adequate information and knowledge to the people indicated, whenever and how they need them. In such way these people will have all the necessary elements for best performing their tasks. Knowledge Management System: System for managing knowledge in organizations, supporting the addition, storage, notification and localization of expertise and knowledge. Methodological Framework: Approach for making explicit and structuring how a given task is performed.

981

K

982

Knowledge Management Tools and Their Desirable Characteristics Juan Ares University of A Coruña, Spain Rafael García University of A Coruña, Spain María Seoane University of A Coruña, Spain Sonia Suárez University of A Coruña, Spain

INTRODUCTION The Knowledge Management (KM) is a recent discipline that was born under the idea of explicitly managing the whole existing knowledge of a given organisation (Wiig, 1995) (Wiig et al., 1997). More specifically, the KM involves providing the people concerned with the right information and knowledge at the most suitable level for them, when and how best suit them; in such way, these people will have all the necessary ingredients for choosing the best option when faced with a specific problem (Rodríguez, 2002). As the knowledge, together with the ability for its best management, has turned into the key factor for the organizations to stand out, it is desirable to determine and develop the support instruments for the generation of such value within the organisations. This situation has been commonly accepted by several authors as (Brooking, 1996) (Davenport & Prusak, 2000) (Huang et al., 1999) (Liebowitz & Beckman, 1998) (Nonaka & Takeuchi, 1995) and (Wiig, 1993) among others. Technological tools should be available for diminishing the communication distance and for providing a common environment where the knowledge might accessible for being stored or shared. As KM is a very recent discipline, there are few commercial software tools that deal with those aspects necessary for its approach. Most of the tools classified as KM-related are mere tools for managing documents, which is unsuitable for the correct management of the organisations knowledge. Bearing such problem in mind, the present work approaches the establishment

of a KM support software tool based on the own definition of KM and on the existing tools. For achieving this, section 2 presents the market analysis that was performed for studying the existing KM tools, where not only their characteristics were analysed, but also the future needs of the knowledge workers. Following this study, the functionality that a KM support tool should have and the proposal for the best approach to that functionality were identified.

BACKGROUND The first step for developing a complete KM support tool according to the present and future trade needs is the performance of a study of the existing market. After the initial identification of the characteristics that a KM support tool should have, a posterior work reveals how the studied tools provide support to every one of the previously identified characteristics. Lastly, an evaluation of the obtained results will be performed.

Characteristics to be Considered The previously mentioned definition of KM was the basis for the identification of the characteristics to be considered, bearing in mind the different aspects that should be supported by the tool. A KM tool should give support to the following aspects (Andrade et al., 2003a):


Knowledge Management Tools

• Corporate Memory • Yellow Pages • Collaboration and Communication mechanisms 1.

2.

Corporate Memory The Corporate Memory compiles the knowledge that exists within an organisation for its workers disposal (Stein, 1995) (Van Heijst et al., 1997). Due to this, to compile and to make the relevant knowledge explicit is equally important than providing the suitable mechanisms for its correct and easy location, as well as recuperation. Yellow Pages A KM program should not make the mistake of trying to capture and represent the whole existing knowledge of the organisation, as this attempt would not be feasible; in this sense, the relevant knowledge for the performance of the organisation should be the one to be included. However, not making all the knowledge explicit does not mean that it has to be obviated; for that reason, it is important to determine which knowledge has every individual at the organisation by means of the elaboration of the Yellow Pages. These ones identify and publish additional knowledge sources, human and non-human, that are at the organisation disposal (Davenport & Prusak, 2000).

3.

Collaboration and Communication Mechanisms At the organisations the knowledge is share, as well as distributed, regardless of the automatism, or not, of the process. A knowledge transfer occurs every time that an employee asks a workmate of the adjoining office how to perform a given task. These daily knowledge transfers made the routine of the organisation up but, as they are local and fragmentary, some systems for user collaboration and communication should be therefore established. An adequate KM support tool should include mechanisms that guarantee the efficiency of the collaboration and the communication, regardless of the physical or temporal location of the interlocutors.

Analysed Tools Once the aspects that a KM support tool should consider have been identified, the following step involves analysing how the current tools consider them. With such purpose, the main so-named KM support tools that exist currently were analysed, discarding certain tools such as information search engines or simple applications for documents management, as they merely offer partial solutions.

Table 1. Tools analysed

K-Factory Norma K-Factor Hyperwave GTC Epicentric Plumtree Intrasuite Coldata Intranets WebSpace Knowledge Discovery System Documentum 5 Livelink (Opentext) Adenin

Corporate Memory              

Yellow Pages         

Collaboration and communication mechanisms               983

K


The analysis included thirteen tools (Table 1), all of them approaching at least two of the previously mentioned aspects. It should be highlighted that all the tools implement the Corporate Memory as a document warehouse, while the Yellow Pages appear as a telephone directory.

the adequate knowledge that the user might need at a given moment. Therefore, and as it has been pointed previously, for the best use of the knowledge, it should be somehow structured. The communication supports are also quite important. The characteristics of a KM support tool should be then necessarily defined, together with a guide for approaching them.

Results Evaluation After the tools were analysed it was noticed that, for every aspect considered, there are some common elements. Bearing in mind these elements and the current needs, table 2 shows the desirable characteristics that a KM support tool should have. The conclusions drawn after a deeper study on how the analysed tools approach the desirable characteristics are following presented. Firstly it was observed that none of the tools classified as KM ones has the necessary structure for best identifying, formalising and sharing the relevant knowledge, as they solely perform documental management complemented, in the best of the cases, by some descriptive fields, the association to a contents tree or by means of links to another related documents. Such fact creates many problems, especially and due to the great data volume, the difficulty for selecting

RECOmmENDED FEATURES The approach to every one of the detected characteristics should be initiated as soon as the functionality that a support tool for the explicit management of the corporative knowledge might have been determined. 1.

Corporate Memory: the organisation knowledge has to be physically stored somehow by means of a Corporate Memory for being adequately shared. A Corporate Memory is an explicit, independent and persistent knowledge representation (Stein, 1995) (Van Heijst et al., 1997) that can be considered as a knowledge repository from the individuals that work at a given organisation. The Corporate Memory should include the following aspects:

Table 2. Desirable characteristics of a KM support tool Aspect

Desirable characteristic

Corporate Memory

Yellow Pages

Collaboration and communication mechanisms

Asynchronous communication

Synchronous communication

984

Knowledge formalisation Knowledge Incorporation New knowledge notification Search Experts search Integration Workgroup Workflow Management of time, tasks and resources E-mail Forum Suggestion box Notice board Chat Electronic board Audio-conference Video-conference


1.1. Knowledge formalisation. Before being included into the Corporate Memory, the knowledge has to be formalised by means of the determination of, not only the relevant knowledge, but also the attributes that describe it. When performing this formalisation it should be born in mind that there are two types of knowledge; on one hand, the Corporate Memory must include the knowledge needed to describe the operations for performing an organisational task. On the other side, it is necessary to capture the knowledge that has been acquired by the individuals after their experience and life time. This markedly heuristic knowledge is known as Learned Lessons: positive as well as negative experiences that can be used for improving the future performance of the organisation (Van Heijst, 1997), and therefore refining its current knowledge. a. Organisational knowledge (Andrade et al., 2003b). A KM system should consider different types of knowledge when structuring the relevant knowledge associated to the operations that exist at the organisation: • Strategic or control knowledge: it indicates, not only what to do, but also why, where and when. For that reason, the constituents of the functional disintegration of every operation should be identified. • Tactical: it specifies how and under what circumstances the tasks are done. This type of knowledge is associated with the execution process of every last-level strategic step. b. Learned lessons. It is related to the experience and the knowledge that the individuals have with regards to their task. It provides the person who possesses it with the ability for refining both, the processes that follows at work and the already existing knowledge, in order to be more efficient. Whereas it’s appropriate to create systems of learned lessons (Weber, 2001) in order to save this type of knowledge.

1.2. Incorporation mechanisms. The knowledge can be incorporated in an active or passive way (Andrade et al., 2003c). The active incorporation is based on the existence of a KM group in charge of looking after the quality of the knowledge that is going to be incorporated. This guarantees the quality of the knowledge included into the Corporate Memory but it also takes human resources up. Differently from the previous way, at the passive incorporation does not exist any group for quality evaluation, as the own individual ready to share knowledge and experience will be responsible for evaluating that the proposal fulfils the minimum requirements of quality and relevancy. The main advantage of the second alternative is that it does not take additional resources up. Bearing in mind the previous considerations, the active knowledge incorporation is preferred whenever it might be possible, as in such way the quality and the relevancy of the knowledge will be guaranteed. 1.3. Notification mechanisms. All the members of the organisation should be informed when a new knowledge is incorporated as this enables the refinement of their knowledge. The step previous to the notification is the definition of the group of people tan will be informed of the new appearance of a knowledge item. There are two alternatives (García et al., 2003): subscription, where every individual at the organisation might take out a subscription to certain preferred specific issues, and spreading, where the notification messages reach the workers without previous request. At the spreading, the messages can be sent to all the members of the organisation, but this is not advisable as the receptor would be not able of discern which ones of the vast amount of messages received might be interesting for him/her. Other spreading possibility would rely on an individual or a group that would be in charge of determining the addressees for every given message; this last option is quite convenient for the members of the organisation but it takes up a vast amount of resources that have to contain themselves 985

K


2.

3.

986

a lot of information regarding the interests of every one of the members. 1.4. Localisation mechanisms. The tool should be provided with some search mechanism in order to achieve the maximum possible profit from the captured and incorporated knowledge (Tiwana, 2000). It is necessary to reach an agreement between efficiency and functionality, as enough search options should be available without increasing the system complexity. For this reason, the following search mechanisms are suggested: • Hierarchy search: this search catalogues the knowledge into a fixed hierarchy, in such way that the user might move through a group of links for refining the search performed. • Attribute search: is based on the specification of terms in which the user is interested, resulting into some knowledge elements that might content those terms. This type of search provides more general results than the previous one. Yellow Pages: a KM system should not try to capture and assimilate the whole of the knowledge that exists at the organisation as it would not be feasible. Therefore, the Yellow Pages are used for including, not only the systems that store knowledge, but also the individuals that have additional knowledge. Their elaboration is performed after determining the knowledge possessed by every individual at the organisation or by any other non human agents. Collaboration and communication mechanisms: at the organisations, the knowledge is shared and distributed regardless the process might be automated or not. The technology helps the interchange of knowledge and ideas among the members of the organisation, as it enables bringing the best possible knowledge within reach of the individual who requires it. The collaboration and communication mechanisms detected are the following: 3.1 Asynchronous communication. Does not require the connection between the ends of the communication at the same time. • E-mail. The electronic messenger enables the interchange of text and/or

any other type of document among two or several users • Forum. It consists of a Web page where the participants leave questions that do not have to be answered at that very moment. Other participants leave the answers which, together with the questions, can be seen by anyone entering the forum at any moment. • Suggestion box. It enables sending suggestions or comments of any relevant aspect of the organisation to the adequate person or department. • Notice board. It is a common space where the members of the organisation can publish some announcements appropriate for the public interest. 3.2 Synchronous communication. This type of interactive technology is based on realtime communications. Some of the most important systems are the following: • Chat. It implies the communication among several people through the computer, as all the people connected can follow the communication, express an opinion, contribute ideas, make or answer questions when they decide. • Electronic board. It provides the members of the organisation with a shared space for improving the interchange the ideas where everybody draws or writes. • Audio conference. Two or more users can use real-time voice communication. • Video conference. Two or more users can use real-time image communication.

FUTURE TRENDS As it has been mentioned before, there is not a current KM tool that might cover adequately the organisational needs. This problem has been approached in the present work by trying to determine the functionality that any of these tools should incorporate. This is a first step that should be complemented with subsequent works, as it


is necessary to go deeper and determine better how to approach and implement the specified aspects.

CONCLUSION The knowledge, either for its management or not, is transmitted within the organisations, although its existence does not imply its adequate use. There is a vast amount of knowledge where access is extremely difficult; this means that there are items from where no return is being achieved and that they are lost into the organisation. The KM represents the effort for capturing and getting benefits from the collective experience of the organisation by means of turning it accessible to any of its members. However, it could be stated that not a current tool is able to efficiently perform this task as, although there exist the so-named KM tools, they merely store documents and none of them performs the structuration of the relevant knowledge for its best use. In order to palliate such problems, the present work proposes an approach based on a market research. It is as well based on the KM definition that indicates how to approach and defines the characteristics that a tool should have for working as facilitator of an adequate and explicit Knowledge Management.

REFERENCES Andrade, J., Ares, J., García, R., Rodríguez, S., Silva, A., & Suárez, S. (2003a): Knowledge Management Systems Development: a Roadmap. Lecture Notes in Artificial Intelligence, 2775, 1008-1015. Andrade, J.; Ares, J.; García, R.; Rodríguez, S. & Suárez, S. (2003b): Lessons Learned for the Knowledge Management Systems Development. In Proceedings of the 2003 IEEE International Conference on Information Reuse and Integration. Las Vegas (USA). Brooking, A. (1996): Intellectual Capital. Core Asset for the Third Millennium Enterprise. International Thomson Business Press. London (UK). Davenport, T. H. & Prusak, L. (2000): Working Knowledge: How Organizations Manage What They Know. Harvard Business School Press. Boston (USA).

García, R.; Rodríguez, S.; Seoane, M. & Suárez, S. (2003): Approach to the Development of Knowledge Management Systems. In Proceedings of the 10th. International Congress on Computer Science Research. Morelos (Mexico). Huang, K. T.; Lee, Y. W. & Wang, R. Y. (1999): Quality Information and Knowledge. Prentice-Hall PTR. New Jersey (USA). Liebowitz, J. & Beckman, T. (1998): Knowledge Organizations. What Every Manager Should Know. CRC Press. Florida (USA). Nonaka, I. & Takeuchi, H. (1995): The Knowledge Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford University Press. New York (USA). Rodríguez, S. (2002): Un marco metodológico para la Knowledge Management y su aplicación a la Ingeniería de Requisitos Orientada a Perspectivas. PhD. Dissertation. School of Computer Science. University of A Coruña (Spain). Stein, E. W. (1995): Organizational Memory: Review of Concepts and Recommendations for Management. International Journal of Information Management. Vol. 15. No. 2. PP: 17-32. Tiwana, A. (2000): The Knowledge Management Toolkit. Prentice Hall. Van Heijst, G.; Van der Spek, R. & Kruizinga, E. (1997): Corporate Memories as a Tool for Knowledge Management. Expert Systems with Applications. Vol. 13. No. 1. PP: 41-54. Weber R., Aha, D. W., & Becerra-Fernandez, I. (2001): Intelligent Lessons Learned Systems. International Journal of Expert Systems Research and Applications, Vol. 20 No.1, PP: 17-34. Wiig, K.; de Hoog, R. & Van der Spek, R. (1997): Supporting Knowledge Management: A Selection of Methods and Techniques. Expert Systems with Applications. Vol. 13. No. 1. PP: 15-27. Wiig, K. (1993): Knowledge Management Foundations: Thinking about thinking. How People and Organizations Create, Represent and Use Knowledge. Schema Press. Texas (USA).

987

K


Wiig, K. (1995): Knowledge Management Methods: Practical Approaches to Managing Knowledge. Schema Press, LTD. Texas (USA).

KEy TERmS Communication & Collaboration Tool: Systems that enable collaboration and communication among members of an organisation (i.e. chat applications, whiteboards). Document Management: It is the computerised management of electronic, as well as paper-based documents. Institutional Memory: It is the physical storage of the knowledge entered in an organization. Knowledge: Pragmatic level of information that provides the capability of dealing with a problem or making a decision.

988

Knowledge Management: Discipline that intends to provide, at its most suitable level, the accurate information and knowledge for the right people, whenever they may needed and at their best convenience. Knowledge Management Tool: Organisational system that connects people with the information and communication technologies, with the purpose of improving the share and distribution processes of the organisational knowledge. Lesson Learned: Specific experience, positive or negative, of a certain domain. It is obtained into a practical context and can be used during future activities of similar contexts. Yellow Page: It storages information about a human or non-human source that has additional and/or specialized knowledge about a particular subject.

989

Knowledge-Based Systems

K

Adrian A. Hopgood De Montfort University, UK

INTRODUCTION The tools of artificial intelligence (AI) can be divided into two broad types: knowledge-based systems (KBSs) and computational intelligence (CI). KBSs use explicit representations of knowledge in the form of words and symbols. This explicit representation makes the knowledge more easily read and understood by a human than the numerically derived implicit models in computational intelligence. KBSs include techniques such as rule-based, modelbased, and case-based reasoning. They were among the first forms of investigation into AI and remain a major theme. Early research focused on specialist applications in areas such as chemistry, medicine, and computer hardware. These early successes generated great optimism in AI, but more broad-based representations of human intelligence have remained difficult to achieve (Hopgood, 2003; Hopgood, 2005).

BACKGROUND The principal difference between a knowledge-based system and a conventional program lies in its structure. In a conventional program, domain knowledge is intimately intertwined with software for controlling the

application of that knowledge. In a knowledge-based system, the two roles are explicitly separated. In the simplest case there are two modules: the knowledge module is called the knowledge base and the control module is called the inference engine. Some interface capabilities are also required for a practical system, as shown in Figure 1. Within the knowledge base, the programmer expresses information about the problem to be solved. Often this information is declarative, i.e. the programmer states some facts, rules, or relationships without having to be concerned with the detail of how and when that information should be applied. These latter details are determined by the inference engine, which uses the knowledge base as a conventional program uses a data file. A KBS is analogous to the human brain, whose control processes are approximately unchanging in their nature, like the inference engine, even though individual behavior is continually modified by new knowledge and experience, like updating the knowledge base. As the knowledge is represented explicitly in the knowledge base, rather than implicitly within the structure of a program, it can be entered and updated with relative ease by domain experts who may not have any programming expertise. A knowledge engineer is someone who provides a bridge between the domain

Figure 1. The main components of a knowledge-based system

knowledge base

inference engine

interface to the outside world

humans

hardware

data

other software



expertise and the computer implementation. The knowledge engineer may make use of meta-knowledge, i.e. knowledge about knowledge, to ensure an efficient implementation. Traditional knowledge engineering is based on models of human concepts. However, it has recently been argued that animals and pre-linguistic children operate effectively in a complex world without necessarily using concepts. Moss (2007) has demonstrated that agents using non-conceptual reasoning can outperform stimulus–response agents in a grid-world test bed. These results may justify the building of non-conceptual models before moving on to conceptual ones.

TyPES OF KNOWLEDGE-BASED SySTEm Expert Systems Expert systems are a type of knowledge-based system designed to embody expertise in a particular specialized domain such as diagnosing faulty equipment (Yanga, 2005). An expert system is intended to act like a human expert who can be consulted on a range of problems within his or her domain of expertise. Typically, the user of an expert system will enter into a dialogue in which he or she describes the problem – such as the symptoms of a fault – and the expert system offers advice, suggestions, or recommendations. It is often proposed that an expert system must offer certain capabilities that mirror those of a human consultant. In particular, it is often stated that an expert system must be capable of justifying its current line of inquiry and explaining its reasoning in arriving at a conclusion. This functionality can be integrated into the inference engine (Figure 1).

Rule-Based Systems Rules are one of the most straightforward means of representing knowledge in a KBS. The simplest type of rule is called a production rule and takes the form: if then

An example production rule concerning a boiler system might be:

990

/* rule1 */ if valve is open and flow is high then steam is escaping

Part of the attraction of using production rules is that they can often be written in a form that closely resembles natural language, as opposed to a computer language. The facts in a KBS for boiler monitoring might include: /* fact1 */ valve is open /* fact2 */ flow is high

One or more given facts may satisfy the condition of a rule, resulting in the generation of a new fact, known as a derived fact. For example, by applying rule1 to fact1 and fact2, fact3 can be derived: /* fact3 */ steam is escaping

The derived fact may satisfy the condition of another rule, such as: /* rule2 */ if steam is escaping or valve is stuck then outlet is blocked

This, in turn, may lead to the generation of a new derived fact or an action. Rule1 and rule2 are interdependent, since the conclusion of one can satisfy the condition of the other. The inter-dependencies amongst the rules define a network, as shown in Figure 2, known as an inference network. It is the job of the inference engine to traverse the inference network to reach a conclusion. Two important types of inference engine can be distinguished: forwardchaining and backward-chaining, also known as datadriven and goal-driven, respectively. A KBS working in data-driven mode takes the available information, i.e. the given facts, and generates as many derived facts as it can. In goal-driven mode, evidence is sought to support a particular goal or proposition. The data-driven (forward chaining) approach might typically be used for problems of interpretation, where the aim is to find out whatever the system can infer about some data. The goal-driven (backward chaining)


Figure 2. An inference network for a boiler system

K blockage or

steam escaping

valve stuck

and

and

valve open

flow high

approach is appropriate when a more tightly focused solution is required, such as the generation of a plan for a particular goal. In the example of a boiler monitoring system, forward chaining would lead to the reporting of any recognised problems. In contrast, backward chaining might be used to diagnose a specific mode of failure by linking a logical sequence of inferences, disregarding unrelated observations. The rules that make up the inference network in Figure 2 are used to link cause and effect: if then <effect>

Using the inference network, an inference can be drawn that if the valve is open and the flow rate is high (the causes) then steam is escaping (the effect). This is the process of deduction. Many problems, such as diagnosis, involve reasoning in the reverse direction, i.e. the user wants to ascertain a cause, given an effect. This is abduction. Given the observation that steam is escaping, abduction can be used to infer that valve is open and the flow rate is high. However, this is only a valid conclusion if the inference network shows all of the circumstances in which steam may escape. This is the closed-world assumption. If many examples of cause and effect are available, the rule (or inference network) that links them can be inferred. For instance, if every boiler blockage ever seen was accompanied by steam escaping and a stuck valve, then rule2 above might be inferred from those examples. Inferring a rule from a set of example cases of cause and effect is termed induction.

pressure high

valve closed

Hopgood (2001) summarizes deduction, abduction, and induction as follows: • • •

deduction: cause + rule abduction: effect + rule induction: cause + effect

⇒ effect ⇒ cause ⇒ rule

Logic Programming Logic programming describes the use of logic to establish the truth, or otherwise, of a proposition. It is, therefore, an underlying principle for rule-based systems. Although various forms of logic programming have been explored, the most commonly used one is the Prolog language (Bramer, 2005), which embodies the features of backward chaining, pattern matching, and list manipulation. The Prolog language can be programmed declaratively, although an appreciation of the procedural behavior of the language is needed in order to program it effectively. Prolog is suited to symbolic problems, particularly logical problems involving relationships between items. It is also suitable for tasks that involve data lookup and retrieval, as pattern-matching is fundamental to the functionality of the language.

Symbolic Computation A knowledge base may contain a mixture of numbers, letters, words, punctuation, and complete sentences. These symbols need to be recognised and processed by the inference engine. Lists are a particularly useful 991


data structure for symbolic computation, and they are integral to the AI languages Lisp and Prolog. Lists allow words, numbers, and symbols to be combined in a wide variety of ways. A list in the Prolog language might look like this:

Use of Vague Language

[animal, [cat, dog], vegetable, mineral]

Two popular techniques for handling the first two sources of uncertainty are Bayesian updating and certainty theory (Hopgood, 2001). Bayesian updating has a rigorous derivation based upon probability theory, but its underlying assumptions, e.g., the statistical independence of multiple pieces of evidence, may not be true in practical situations. Certainty theory does not have a rigorous mathematical basis, but has been devised as a practical and pragmatic way of overcoming some of the limitations of Bayesian updating. It was first used in the classic MYCIN system for diagnosing infectious diseases (Buchanan, 1984). Other approaches are reviewed in (Hopgood, 2001), where it is also proposed that a practical non-mathematical approach is to treat rule conclusions as hypotheses that can be confirmed or refuted by the actions of other rules. Possibility theory, or fuzzy logic, allows the third form of uncertainty, i.e. vague language, to be used in a precise manner.

where this example includes a nested list, i.e. a list within a list. In order to process lists or similar structures, the technique of pattern matching is used. For example, the above list in Prolog could match to the list [animal, [_, X], vegetable, Y]

where the variables X and Y would be assigned values of dog and mineral respectively. This pattern matching capability is the basis of an inference engine’s ability to process rules, facts and evolving knowledge.

Uncertainty The examples considered so far have all dealt with unambiguous facts and rules, leading to clear conclusions. In real life, the situation can be complicated by three forms of uncertainty:

Uncertainty in the Rule Itself For example, rule1 (above) stated that an open valve and high flow rate lead to an escape of steam. However, if the boiler has entered an unforeseen mode, it made be that these conditions do not lead to an escape of steam. The rule ought really to state that an open valve and high flow rate will probably lead to an escape of steam.

Uncertainty in the Evidence There are two possible reasons why the evidence upon which the rule is based may be uncertain. First, the evidence may come from a source that is not totally reliable. For example, in rule1 there may be an element of doubt whether the flow rate is high, as this information relies upon a meter of unspecified reliability. Second, the evidence itself may have been derived by a rule whose conclusion was probable rather than certain.

992

Rule1, above, is based around the notion of a “high” flow rate. There is uncertainty over whether “high” means a flow rate of the order of 1cm3s-1 or 1m3s-1.

Decision Support and Analysis Decision support and analysis (DSA) and decision support systems (DSSs) describe a broad category of systems that involve generating alternatives and selecting among them. Web-based DSA, which uses external information sources, is becoming increasingly important. Decision support systems that use artificial intelligence techniques are sometimes referred to as intelligent DSSs. One clearly identifiable family of intelligent DSS is expert systems, described above. An expert system may contain a mixture of simple rules based on experience and observation, known as heuristic or shallow rules, and more fundamental or deep rules. For example, an expert system for diagnosing car breakdowns may contain a heuristic that suggests checking the battery if the car will not start. In contrast, the expert system might also contain deep rules, such as Kirchoff’s laws, which apply to any electrical circuit and could be used in association with other rules and observations to diagnose any electrical circuit. Heuristics can often


provide a useful shortcut to a solution, but lack the adaptability of deep knowledge. Building and maintaining a reliable set of cause–effect pairs in the form of rules can be a huge task. The principle of model-based reasoning (MBR) is that, rather than storing a huge collection of symptom–cause pairs in the form of rules, these pairs can be generated by applying underlying principles to the model. The model may describe any kind of system, including systems that are physical (Fenton, 2001), software-based (Mateis, 2000), medical (Montani, 2003), legal (Bruninghaus, 2003), and behavioral (De Koning, 2000). Models of physical systems are made up of fundamental components such as tubes, wires, batteries, and valves. As each of these components performs a fairly simple role, it also has a simple failure mode. Given a model of how these components operate and interact to form a device, faults can be diagnosed by determining the effects of local malfunctions on the overall device. Case-based reasoning (CBR) also has a major role in DSA. A characteristic of human intelligence is the ability to recall previous experience whenever a similar problem arises. This is the essence of casebased reasoning (CBR), in which new problems are solved by adapting previous solutions to old problems (Bergmann, 2003). Consider the example of diagnosing a brokendown car. If an expert system has made a successful diagnosis of the breakdown, given a set of symptoms, it can file away this information for future use. If the expert system is subsequently presented with details of another broken-down car of exactly the same type, displaying exactly the same symptoms in exactly the same circumstances, then the diagnosis can be completed simply by recalling the previous solution. However, a full description of the symptoms and the environment would need to be very detailed, and it is unlikely to be reproduced exactly. What is needed is the ability to identify a previous case, the solution of which can be reused or modified to reflect the slightly altered circumstances, and then saved for future use. Such an approach is a good model of human reasoning. Indeed case-based reasoning is often used in a semiautomated manner, where a human can intervene at any stage in the cycle.

FUTURE TRENDS While large corporate knowledge-based systems remain important, small embedded intelligent systems have also started to appear in the home and workplace. Examples include washing machines that incorporate knowledge-based control and wizards for personal computer management. By being embedded in their environment, such systems are less reliant on human data input than traditional expert systems, and often make decisions entirely based on sensor data. If AI is to become more widely situated into everyday environments, it needs to become smaller, cheaper, and more reliable. The next key stage in the development of AI is likely to be a move towards embedded AI, i.e. intelligent systems that are embedded in machines, devices, and appliances. The work of Choy (2003) is significant in this respect, as it demonstrates that the DARBS blackboard system can be ported to a compact platform of parallel low-cost processors. In addition to being distributed in their applications, intelligent systems are also becoming distributed in their method of implementation. Complex problems can be divided into subtasks that can be allocated to specialized collaborative agents, bringing together the best features of knowledge-based and computation intelligence approaches (Li, 2003). As the collaborating agents need not necessarily reside on the same computer, an intelligent system can be both distributed and hybridized (Choy, 2004). Paradoxically, there is also a sense in which intelligent systems are becoming more integrated, as software agents share access to a single definitive copy of data or knowledge, accessible via the web.

CONCLUSION As with any technique, knowledge-based systems are not suitable for all types of problems. Each problem calls for the most appropriate tool, but knowledge-based systems can be used for many problems that would be impracticable by other means. They have been particularly successful in narrow specialist domains. Building an intelligent system that can make sensible decisions about unfamiliar situations in everyday, non-specialist domains remains a severe challenge.

993

K


This development will require progress in simulating behaviors that humans take for granted – specifically perception, recognition, language, common sense, and adaptability. To build an intelligent system that spans the breadth of human capabilities is likely to require a hybrid approach using a combination of artificial intelligence techniques.

Systems Man and Cybernetics Part C - Applications and Reviews, 31, 269-281.

REFERENCES

Hopgood, A.A. (2005). The state of artificial intelligence. Advances in Computers, 65, 1-75.

Bergmann, R., Althoff, K.-D., Breen, S., Göker, M., Manago, M., Traphöner, R., and Wess, S. (2003). Developing Industrial Case-Based Reasoning Applications – the INRECA Methodology (2nd Edition). Lecture Notes in Artificial Intelligence, Vol. 1612. Springer - Buchreihe.

Li, G., Hopgood, A.A. and Weller, M.J. (2003). Shifting Matrix Management: a model for multi-agent cooperation. Engineering Applications of Artificial Intelligence, 16, 191-201.

Bramer, M.A. (2005), Logic Programming with Prolog. Springer-Verlag, London. Bruninghaus, S. and Ashley, K. D. (2003). Combining case-based and model-based reasoning for predicting the outcome of legal cases. Lecture Notes in Artificial Intelligence, 2689, 65-79. Buchanan, B. G. and Shortliffe, E. H. (1984). RuleBased Expert Systems: the MYCIN experiments of the Stanford Heuristic Programming Project, Addison-Wesley. Choy, K.W., Hopgood, A.A., Nolle, L. and O’Neill, B.C. (2003). Design and implementation of an inter-process communication model for an embedded distributed processing network. International Conference on Software Engineering Research and Practice (SERP’03), Las Vegas, 239-245. Choy, K.W., Hopgood, A.A., Nolle, L. and O’Neill, B.C. (2004). Implementation of a tileworld testbed on a distributed blackboard system. 18th European Simulation Multiconference (ESM2004), Magdeburg, Germany, 129-135. De Koning, K., Bredeweg, B., Breuker, J., and Wielinga, B. (2000). Model-based reasoning about learner behaviour. Artificial Intelligence, 117, 173-229. Fenton, W. G., Mcginnity, T. M., and Maguire, L. P. (2001). Fault diagnosis of electronic systems using intelligent techniques: a review. IEEE Transactions on

994

Hopgood, A. A. (2001). Intelligent Systems for Engineers and Scientists, 2nd edition. CRC Press, Boca Raton. Hopgood, A. A. (2003). Artificial intelligence: hype or reality? IEEE Computer, 6, 24-28.

Mateis, C., Stumptner, M., and Wotawa, F. (2000). Locating bugs in Java programs - First results of the Java diagnosis experiments project. Lecture Notes in Artificial Intelligence, 1821, 174-183. Montani, S., Magni, P., Bellazzi, R., Larizza, C., Roudsari, A. V., and Carson, E. R. (2003). Integrating model-based decision support in a multi-modal reasoning system for managing type 1 diabetic patients. Artificial Intelligence in Medicine, 29, 131-151. Moss, N.G., Hopgood, A.A. and Weller, M.J. (2007). Can Agents without Concepts think? An Investigation using a Knowledge Based System. Proc. AI-2007: 27th SGAI International Conference on Artificial Intelligence, Cambridge, UK. Yanga, B.S., Limb, D.S., and Tanc, A.C.C. (2005). VIBEX: an expert system for vibration fault diagnosis of rotating machinery using decision tree and decision table. Expert Systems with Applications, 28(4), 735-742.

KEy TERmS Backward Chaining: Rules are applied through depth-first search of the rule base to establish a goal. If a line of reasoning fails, the inference engine must backtrack and search a new branch of the search tree. This process is repeated until the goal is established or all branches have been explored.


Case-Based Reasoning: Solving new problems by adapting solutions that were previously used to solve old problem. Closed-World Assumption: The assumption that all knowledge about a domain is contained in the knowledge base. Anything that is not true according to the knowledge base is assumed to be false. Deep Knowledge: Fundamental knowledge with general applicability, such as the laws of physics, which can be used in conjunction with other deep knowledge to link evidence and conclusions. Forward Chaining: Rules are applied iteratively whenever their conditions are satisfied, subject to a selection mechanism known as conflict resolution when the conditions of multiple rules are satisfied.

Heuristic or Shallow Knowledge: Knowledge, usually in the form of a rule, that links evidence and conclusions in a limited domain. Heuristics are based on observation and experience, without an underlying derivation or understanding. Inference Network: The linkages between a set of conditions and conclusions. Knowledge-Based System: System in which the knowledge base is explicitly separated from the inference engine that applies the knowledge. Model-Based Reasoning: The knowledge base comprises a model of the problem area, constructed from component parts. The inference engine reasons about the real world by exploring behaviors of the model. Production Rule: A rule of the form if

then .

995

K

996

Kohonen Maps and TS Algorithms Marie-Thérèse Boyer-Xambeu Université de Paris VII – LED, France Ghislain Deleplace Université de Paris VIII – LED, France Patrice Gaubert Université de Paris 12 – ERUDITE, France Lucien Gillard CNRS – LED, France Madalina Olteanu Université de Paris I – CES SAMOS, France

INTRODUCTION In the analysis of a temporal process, Kohonen maps may be used together with time-series (TS) algorithms. Previous research aimed at combining Kohonen algorithms and Markov switching models in order to suggest a periodization of the international bimetallism in the 19th century (Boyer-Xambeu, Deleplace, Gaubert, Gillard and Olteanu, 2006). This research was based on an economic study of the international monetary system ruling at this time in Europe, which combined three monetary zones: a gold-standard one, centred in London, a bimetallic one, centred in Paris, and a silver-standard one, centred in Hamburg (BoyerXambeu, Deleplace and Gillard, 2006). The three major financial centres of that system (London, Paris, and Hamburg, hence the label LPH used hereafter) were linked through arbitrage operations between markets for gold and silver and markets for foreign exchange located in those centres. Since two metals, gold and silver, acted as monetary standards in that system, it worked as an international bimetallism. Its growing integration during half a century (from 1821 to 1873) was reflected in the convergence of the observed levels of the relative price of gold to silver in London, Paris, and Hamburg. However, this integration process was subject to various changes, which can be understood as exogenous shocks disturbing that process.

One such shock is vastly documented in the literature: the discovery of new gold mines in the United States and Australia, which led to a sudden decline in 1850 of the gold-silver price over all the markets in the world. This decline was not of the same magnitude everywhere, and therefore the spread between the London, Paris, and Hamburg gold-silver prices increased, stopping for a time the integration process of the system. This is what we will call a breaking in that process. The present paper aims at locating the major breakings occurring during the period of international bimetallism; a historical study could link them to special events, which operated as exogenous shocks on that system. The indicator of integration used is the spread between the highest and the lowest among the London, Paris, and Hamburg gold-silver prices. Three algorithms are combined to study this integration: a periodization obtained with the SOM algorithm is confronted to the estimation of a two-regime Markov switching model, in order to give an interpretation of the changes of regime; at the same time changepoints are identified over the whole period providing a more precise interpretation of these varying types of regulation. Section 2 summarizes the results obtained with the SOM algorithm to differentiate the sub-periods obtained using the whole available data. Section 3 presents the kind of model used and the results of its estimation using the new indicator, the


Kohonen Maps and TS Algorithms: Clear Convergence

spread computed at each period of quotation between the three relative prices of gold in silver. The sub-periods are confronted to the two regimes obtained and some evidence of a relation between the regime and the volatility of the spread is presented. Section 4 presents the technique used to identify change-points in the temporal process and some strong results of breaks in mean and in variance of the spread are obtained. They are interpreted in terms of monetary history as, for some of them, they are quite new in the literature of this domain. Some further directions of research are indicated in conclusion.

THE SUB-PERIODS OBTAINED WITH A SOm ALGORITHm1

Characteristics of the Macro-Classes Large sequences of contiguous weeks are grouped in the macro-classes, however a few years are fragmented in short periods situated in different classes • •

They represent a central position contrasting to the well identified other classes: •

The Data The relative prices of gold in silver are computed from the price of each metal observed, twice a week, in each of the three financial places, Paris, London and Hamburg (respectively, poa, lgs, and hoa), from the beginning of 1821 until the end of 1860. The same type of data is available for the exchange rates (Pound in Francs, Pound in Marks, Mark in Francs: respectively, lpv, hlv, and phv). An observation is a set of twelve values, two quotations (Tuesday and Friday) for each of the six variables. A computed variable has been added to emphasize the relation between the relative price of metals in Hamburg and the average level in Paris and London of this value (hpl). Most of the time the quotations show rather small differences within a given week, but periods with important troubles, Paris in the late 1840s for instance, may be well separated from the more classical ones. After the Kohonen classification using a grid of 25 nodes, a hierarchical ascending classification is used to produce a small number of macro classes, in this case 6 macro classes, corresponding to the main sub-periods. This latter classification is constructed with the code vectors obtained from the first process2.

Class 1 is constituted of 3 groups of years 182930, 1834-38, 1848-49 and a lot of fragments of other years Class 2 is more simple to describe with 3 intervals 1832-33, 1842-43 and 1846-47 and some sparse weeks from the 1830s.

•

• •

Class 3: 2 sets constituted of years 1824-25 and 1827-28, with almost no missing weeks in these intervals, indicating that this sub-period is very homogeneous Class 4: the end of year 1853 and the whole period 1854-60; again only a small number of weeks are missing for this continuous sub-period of more than seven years Class 5: 1821-24 and 1826-beginning 1827 plus small parts of 1830 and 1832 Class 6: two sets 1839-41 and 1851-53

The means of the variables used to obtain the classification can be represented to illustrate the great differences appearing between the sub-periods. Changing hierarchies between the relative prices are the characteristic identifying the four last macro-classes. Rearranging the various classes according to calendar time allows to distinguish between three sub-periods: a) the 1820s (classes 5 and 3, covering 1821 to 1828); b) the 1830s and 1840s (classes 1 and 2, covering 1829 to 1849); c) the 1850s (classes 6 and 4, covering 1851 to 1860). Only the years 1839-41 resist to that rearrangement, since they belong to class 6, while they should appear in classes 1 and 2 relative to the 1830s and 1840s; some explanation will be suggested in the last section. Fig. 1. exhibits two contrasted situations, where the gold-silver price is respectively low (class 4) and high (class 5) in all the three financial centres. Fig. 2. confirms that opposition, since the two classes are also

997

K


Figure 1. Gold-silver price and the 6 macro-classes

16

15.9

15.8

P aris H am burg London

15.7

15.6

15.5

15.4

15.3

15.2 class 1

class 2

class 3

class 4

class 5

class 6

Figure 2. Exchange rates and the 6 macro-classes Exchange Rates

Exchange Rates 14

26

1.92

1.92

1.9

1.9 25.8

13.8

1.88

1.88 25.6

1.84

13.4

1.86 M/F

M/F

P/F - P/M

1.86

P/F - P/M

13.6

1.84

25.4

1.82

1.82 25.2

13.2

1.8

1.8 25

13

1.78 class 1

class 2

class 3 Pound in M ark

class 4

class 5

class 6

class 2

class 3 P ound in Francs

class 4

class 5

class 6

M ark in Franc

M ark in Franc

sharply contrasted by the levels of the exchange rates. Years 1821-23 and 1826 (class 5) are marked by a low mark/franc exchange rate and high gold-silver prices, the Hamburg one being higher than the Paris one; years 1854-60 (class 4) are marked by a high mark/franc exchange rate and low gold-silver prices, the Hamburg one being below the Paris one.

998

1.78 class 1

These remarks, which also apply respectively to the rest of the 1820s (class 3) and to the rest of the 1850s (class 6) are consistent with historical analysis: while the Hamburg mark was always anchored to silver, the French franc was during the 1820s and 1850s anchored to gold (in contrast with the 1830s and 1840s when it was anchored to silver); it is then normal that the mark


depreciated against the franc when silver depreciated against gold, and more in Hamburg than in Paris (as in class 5 and 3), and that the mark appreciated against the franc when silver appreciated against gold, and more in Hamburg than in Paris (as in class 4 and 6).

A mODEL FOR THE SPREAD BETWEEN THE HIGHEST AND THE LOWEST GOLD-SILVER PRICE An Autoregressive Markov Switching Model The key assumption is that the time series to be modeled follow a different pattern or a different model according to some unobserved, finite valued process. Usually, the unobserved process is a Markov chain whose states are called “regimes”, while the observed series follows a linear autoregressive model whose coefficients depend on the current regime. Let us put this in a mathematical language. Suppose that (yt)t∈Z is the observed time series and that the unobserved process (xt)t∈Z is a two-states Markov chain with transition matrix  p 1− q A=  , where p, q ∈]0,1[ q  1 − p

(1)

Then, assuming that yt depends on the first l lags of time, we have the following equation of the model: y t = a 0xt + a1xt y t −1 + ... + alxt y t −l + S xt E t

(2)

where aixt ∈ {ai1 , ai2 }∈ R 2 for every i ∈ {0,1,..., l}, 2 S xt ∈{S 1 , S 2}∈ (R+*) and εt is a standard Gaussian noise.

y

x

1 0

1 1

1 l

2 0

t

t +1

xt )

K

}

, a12 ,..., al2 , S 1 , S 2 , p, q

and they are usually estimated by maximizing the loglikelihood function via an EM (Expectation – Maximization) algorithm3. Our characteristic of interest will be the “a posteriori” computed conditional probabilities of belonging to the first or to the second regime. Indeed, as our goal is to derive a periodization of the international bimetallism, the “a posteriori” computed states of the unobserved Markov chain will provide a natural one. Although the results obtained with a switching Markov model are usually satisfying in terms of prediction and the periodizations are interesting and easily interpretable, a difficulty remains: how does one choose the number of regimes? In the absence of a complete theoretical answer, the criteria for selecting the “right” number of regimes are quite subjective from a statistical point of view4.

The Results In this paper we use a two-regime model to represent the spread computed with the gold-silver prices observed at each period on the three places. The transition matrix indicates good properties of stability:  0.844298 0.253357     0.155702 0.746643 

and no three regime model was found with an acceptable stability. The first regime is a multilayer perceptron with one hidden layer, the second one is a simple linear model with one lag. Using the probabilities computed for each regime at each period, it may be interesting to study

t +1

f x t +1 ( y t , y t − 1 ) + σ

)+ σ x ε t

P ( x t

{a , a ,..., a , a

y

t

f x t (y t −1 , y t − 2

The parameters of the model are then

x

x t +1

ε t+1

t +1

999


Table 1. Regime 1 and volatility of spread Sub-periods

Number of obs.

% regime 1

1 2 3 4 5 6

483 335 191 376 303 390

0.733 0.627 0.445 0.816 0.625 0.723

the six sub-periods obtained and to observe the switch between the regimes along these periods of time. Most of the time the regime 1 explains the spread (about 70% of the whole period) but important differences are to be noted between the sub-periods: Classes 3 and 4 clearly contrast with, respectively, the highest and the lowest volatility of spread as they are ruled by, respectively, regime 2 and regime 1 models. As will be explained later, further investigations have to be made with a more complex model and using a more adapted indicator of the arbitrages ruling the markets.

IDENTIFICATION OF CHANGE-POINTS: A GLOBAL VISION OF THE BImETALLIST SySTEm OF PAymENTS Elements About the Technique5 A different approach to model changes of regime in a time-series is to detect change-points or breaks. Here, the main assumption is that the whole series is observed and change-points are computed “a posteriori”. Thus, this approach has not a predictive goal, but it is rather aimed at explaining the series by a piecewise stationary process which seems to be well adapted to our problem. Mathematically, the model can be written as follows: let us consider the observed m-dimensional series yt = {y1,t,...,ym,t)T, t = 1,...,T and suppose that it is abruptly 1000

Standard deviation of spread 0.053 0.061 0.075 0.044 0.050 0.049

changed. The changes, whose number and configuration are unknown, occur in the marginal distribution and may be in mean, in variance or in both mean and variance. We assume that there exists an integer K* * * * and a sequence of change-points T ={T 1 ,...,T K } with * * * * T 0 = 0 < T 1 < ... < T K −1 < T K = T such that (µk, ∑k) ≠ (µk+1, ∑k+1) where µk = E(Yt) and ∑k = Cov(Yt) = E(Yt – E(Yt))(Yt – E(Yt))T, T k*−1 + 1 ≤ t ≤ T k*. The numbers of changes as well as their configuration are computed by minimizing a penalized contrast function. Details on the algorithms for computing the change-points configuration τ* can be found in Lavielle and Teyssière (2006)6. *

*

*

Some Results and Interpretation Applying this technique to the spread gave 7 changepoints in mean and 4 in mean and variance. Fig. 3 summarizes the spread, the four change-points (the first 4 green lines in chronological order) obtained in mean and variance, and the 2 last change-points in mean which correspond to a major break in the level of the gold-silver price, observed simultaneously on the three places and correspond to the great change in production of gold in United States. A closer look at the spread between the highest and the lowest among the London, Hamburg and Paris gold-silver prices draws attention upon three episodes, each of them beginning with a break which sharply increases the spread and ends with another breaking which sharply narrows it (green vertical lines on Fig. 3). These episodes have in common to be linked to


Figure 3. Spread, change-points and probability of regime 1

K

SPREAD (Max-Min) 0.8

1824-21

1839-45

1825-41

1854-41

1850-46

1843-13

5

0.7 4 0.6

0.5

prob1

3

0.4 2

0.3

0.2 1 0.1

spread

shocks affecting the integration process of the LPH system, although the shocks may have been asymmetrical (only one or two of the financial centres being initially hit) or symmetrical (the three of them being simultaneously hit). The first episode runs from the 21st week of 1824 till the 41st week of 1825. The sharp initial increase in the spread may be explained by two opposite movements in London and Hamburg: on one side, heavy speculation in South-American bonds and Indian cotton fuelled in London the demand for foreign payments in silver, which resulted in a great increase in the price of silver and a corresponding decline in the gold-silver price; on the other side, the price of gold rose in Hamburg while the price of silver remained constant, sparkling the huge spread between the highest (Hamburg) and the lowest (London) gold-silver prices. More than one year later, the opposite movements took place: the price of gold plunged in Hamburg, while the price of silver remained at its height in London, under the influence of continuing speculation (which would end up in the famous banking crisis of December 1825); consequently the spread abruptly narrowed, this event being reflected by the breaking of the 41st week of 1825.

18 59 03

18 57 03

18 55 03

18 53 03

18 51 03

18 49 03

18 47 03

18 45 03

18 43 03

18 41 03

18 39 03

18 37 03

18 35 03

18 33 03

18 31 03

18 29 03

18 27 03

18 25 03

18 23 03

0

18 21 03

0

prob1

The second episode runs from the 45th week of 1839 till the 13th week of 1843. It started with the attempt of Prussia to unify the numerous German-speaking independent states in a common monetary zone, on a silver standard. Since the Bank of Hamburg maintained the price of silver fixed, that pressure on silver led to a drop in the Hamburg price of gold, and consequently in its gold-silver price, at a time when it was more or less stabilized in Paris. The spread between the highest (Paris) and the lowest (Hamburg) gold-silver price suddenly was enlarged, and during more than three years remained at a level significantly higher than during the 14 preceding years. This episode ended with the breaking of the 13th week of 1843, when, this shock having been absorbed, the gold-silver price in Hamburg went back in line with the price in the two other financial centres. The third episode runs from the 46th week of 1850 till the 41st week of 1854. The shock was then symmetrical: London, Paris and Hamburg were hit by the pouring of gold following the discovery of Californian mines, and the sudden downward pressure on the world price of that metal. It took four years to absorb this

1001


enormous shock, as reflected by the breaking of the 41st week of 1854.

riodization of International Bimetallism (1821-1873)”, Investigacion Operacional (forthcoming).

CONCLUSION

Cottrell, M., Fort, E.C. & Pagès, G. (1997), “Theoretical aspects of the Kohonen Algorithm” WSOM’97, Helsinki 1997.

In the three cases, the integration process of the LPH system, shown by the downward trend of the spread over half a century, was jeopardized by a shock: a speculative one in 1824, an institutional one in 1839, a technological one in 1850. But the effects of these shocks were absorbed after some time, thanks to active arbitrage operations between the three financial centres of the system. Generally, that arbitrage did not imply the barter of gold for silver but the coupling of a foreign exchange operation (on bills of exchange) with the transport of one metal only. As a consequence, it would be appropriate in a further study to locate the breakings of another indicator of integration: the spread between a representative “national” gold-silver price and an arbitrated international gold-silver price taking into account the foreign exchange rates. At the same time it would be interesting to go further with the Markov switching model, trying more complete specifications.

Cottrell, M., Gaubert, P., Letremy, P., Rousset, P., “Analysing and Representing Multidimensional Quantitative and Qualitative Data. Demographic Study of the Rhone Valley. The Domestic Consumption of the Canadian Families”, in Kohonen Maps, E. Oja and S. Kaski Eds., Elsevier Science, Amsterdam, 1999.

REFERENCES

Lavielle, M. & Teyssière, G. (2006), “Detection of Multiple Change-Points in Multivariate Time Series”, Lithuanian Mathematical Journal, vol. 46, n° 3, pp 287-306.

Boyer-Xambeu, M.-T., Deleplace, G. & Gillard, L. (1995), « Bimétallisme, taux de change et prix de l’or et de l’argent (1717-1873) », Economies et Sociétés 29, no. 7-8: 5-377. Boyer-Xambeu, M.-T., Deleplace, G. & Gillard, L. (1997). ‘Bimetallic Snake’ and Monetary Regimes : The Stability of the Exchange Rate between London and Paris from 1796 to 1873. Monetary Standards and Exchange Rates, M.C. Marcuzzo, L.H. Officer, and A. Rosselli Eds., Routledge, London, 1997: 106-49. Boyer-Xambeu, M.-T., Deleplace, G. & Gillard, L. (2006). International Bimetallism? Exchange Rates and Bullion Flows in Europe, 1821-1873, mimeo, Université Paris 8 – LED. Boyer-Xambeu, M.-T., Deleplace, G., Gaubert, P., Gillard, L. & Olteanu, M. (2006). “Combining a Dynamic Version of Kohonen Algorithm and a Two-Regime Markov Switching Model: an Application to the Pe1002

Hamilton, J. D. (1989). “A new approach to the economic analysis of non-stationary time series and the business cycle”, Econometrica, 57, 357-84 Kohonen, T. Self-Organization and Associative Memory. (3rd edition 1989), Springer, Berlin, 1984. Lavielle, M. (1999), “Detection of multiple changes in a sequence of dependant variables”, Stochastic Process. Appl., vol. 83, pp 79-102. Lavielle, M. & Teyssière, G. (2005), “Adaptative detection of multiple change-points in asset price volatility”, in Teyssière G. & Kirman Eds. A. Long-Memory in Economics, Springer, Berlin, pp.129-156.

Maillet B., Olteanu M., Rynkiewicz J. (2004), “Nonlinear Analysis of Shocks when Financial Markets are Subject to Changes in Regime”, Proceedings of ESANN 2004, p. 87-92 Olteanu M., Rynkiewicz J. (2006), “Estimating the Number of Regimes in an Autoregressive Model with Markov Switching”, IOR 2006, La Habana. Rynkiewicz J. (2004), “Estimation of linear autoregressive models with Makov-switching”, Investigacion Operacional, La Havane, Cuba. Vol. 25:2, p. 166-173 Teyssière, G. (2003), “Interaction models for common long-range dependence in asset price volatility”, in Rangarajan G. and Ding M. Eds., Processes with Long Range Correlations: Theory and Applications, Lectures Notes in Physics, 621, Springer, Berlin, pp. 251-269.


KEy TERmS Change-Point: Instant of time where the basic parameters of time series change (in mean and/or in variance); the series may be considered as a piecewise stationary process between two change-points Gold-Silver Price: Ratio of the market price of gold to the market price of silver in one place. The stability of that ratio through time and the convergence of its levels in the various places constituting the international bimetallism (see that definition) are tests of the integration of that system.

the stability of the exchange rates between them. Its working depends on the monetary rules adopted in each country and on international arbitrage (see that definition) between the foreign exchange markets. Historical examples are the gold-standard system (1873-1914) and the Bretton-Woods system (1944-1976). The paper studies some characteristics of another historical example: international bimetallism (see that definition). Markov Switching Model: An autoregressive model where the process linking a present value to its lags is an hidden Markov chain defined by its transition matrix

International Arbitrage: Activity of traders in gold and silver and in foreign exchange, which consisted in comparing their prices in different places, and in moving the precious metals and the bills of exchange accordingly, in order to make a profit. Arbitrage and monetary rules were the two factors explaining the working of international bimetallism (see that definition).

SOM Algorithm: An unsupervised technique of classification (Kohonen,1984) combining adaptative learning and neighbourhood to construct a very stable classification, with a more simple interpretation (‘Kohonen maps’) than other techniques.

International Bimetallism: An international monetary system (see that definition) which worked from 1821 to 1873. It was based on gold and silver acting as monetary standards, either together in the same country (like France) or separately in different countries (gold in England, silver in German and Northern states). The integration of that system was reflected in the stability and the convergence of the observed levels of the relative price of gold to silver (see that definition) in London, Paris, and Hamburg.

ENDNOTES

International Monetary System: A system linking the currencies of various countries, which ensures

1

2

3

4 5

6

Details may be found in Boyer-Xambeu, ..., Olteanu, 2006. See Cottrell M., Fort…(1997) and Cottrell M., Gaubert…(1999). See Rynkiewicz (2004) and Maillet et al. (2004). See Olteanu et al. (2006). The authors are very grateful to Gilles Teyssière for a significant help on this part. See also Lavielle M. and Teyssière G. (2005), Teyssière G. (2003) and Lavielle M. (1999).

1003

K

1004

Learning in Feed-Forward Artificial Neural Networks I Lluís A. Belanche Muñoz Universitat Politècnica de Catalunya, Spain

INTRODUCTION

off errors. Given E(w) to be minimized and an initial state w0, these methods perform for each iteration the updating step:

The view of artificial neural networks as adaptive systems has lead to the development of ad-hoc generic procedures known as learning rules. The first of these is the Perceptron Rule (Rosenblatt, 1962), useful for single layer feed-forward networks and linearly separable problems. Its simplicity and beauty, and the existence of a convergence theorem made it a basic departure point in neural learning algorithms. This algorithm is a particular case of the Widrow-Hoff or delta rule (Widrow & Hoff, 1960), applicable to continuous networks with no hidden layers with an error function that is quadratic in the parameters.

where ui is the minimization direction (the direction in which to move) and αi∈R is the step size (how far to make a move in ui), also known as the learning rate in earlier contexts. For convenience, define ∆wi=wi+1-wi. Common stopping criteria are:

BACKGROUND

3.

The first truly useful algorithm for feed-forward multilayer networks is the backpropagation algorithm (Rumelhart, Hinton & Williams, 1986), reportedly proposed first by Werbos (1974) and Parker (1982). Many efforts have been devoted to enhance it in a number of ways, especially concerning speed and reliability of convergence (Haykin, 1994; Hecht-Nielsen, 1990). The backpropagation algorithm serves in general to compute the gradient vector in all the first-order methods, reviewed below. Neural networks are trained by setting values for the network parameters w to minimize an error function E(w). If this function is quadratic in w, then the solution can be found by solving a linear system of equations (e.g. with Singular Value Decomposition (Press, Teukolsky, Vetterling & Flannery, 1992)) or iteratively with the delta rule. The minimization is realized by a variant of a gradient descent procedure, whose ultimate outcome is a local minimum: a w* from which any infinitesimal change makes E(w*) increase, that may not correspond to one of the global minima. Different solutions are found by starting at different initial states. The process is also perturbed by round-

4.

wi+1=wi+αiui

1. 2.

(1)

A maximum number of presentations of D (epochs) is reached. A maximum amount of computing time has been exceeded. The evaluation has been minimized below a certain tolerance. The gradient norm has fallen below a certain tolerance.

LEARNING ALGORITHmS Training algorithms may require information from the objective function only, the gradient vector of the objective function or the Hessian matrix of the objective function: •

•

•

Zero-order training algorithms make use of the objective function only. The most significant algorithms are evolutionary algorithms, which are global optimization methods (Goldberg, 1989). First-order training algorithms use the objective function and its gradient vector. Examples are Gradient Descent, Conjugate Gradient or QuasiNewton methods, which are all local optimization methods (Luenberger, 1984). Second-order training algorithms make use of the objective function, its gradient vector and its Hessian matrix. Examples are Newton’s method


Learning in Feed-Forward Artificial Neural Networks I

and the Levenberg-Marquardt algorithm, which are local optimization methods (Luenberger, 1984). First-order methods. The gradient ∇Ew of an sdimensional function is the vector field of first derivatives of E(w) w.r.t. w, ∇Ew= (

∂E( w ) ∂E( w ) ,..., ) ∂w1 ∂ws

(2)

Here s=dim(w). A linear approximation to E(w) in an infinitesimal neighbourhood of an arbitrary point wi is given by: E(w) ≈ E(wi)+∇Ew(wi)⋅(w-wi)

(3)

We write ∇Ew(wi) for the gradient ∇Ew evaluated at wi. These are the first two terms of the Taylor expansion of E(w) around wi. In steepest or gradient descent methods, this local gradient alone determines the minimization direction ui. Since, at any point wi, the gradient ∇Ew(wi) points in the direction of fastest increase of E(w), an adjustment of wi in the negative direction of the local gradient leads to its maximum decrease. In consequence the direction ui= -∇Ew(wi) is taken. In conventional steepest descent, the step size αi is obtained by a line search in the direction of ui: how far to go along ui before a new direction is chosen. To this end, evaluations of E(w) and its derivatives are made to locate some nearby local minimum. Line search is a move in the chosen direction ui to find the minimum of E(w) along it. For this one-dimensional problem, the simplest approach is to proceed along ui in small steps, evaluating E(w) at each sampled point, until it starts to increase. One often used method is a divide-and-conquer strategy, also called Brent’s method (Fletcher, 1980): 1.

2. 3.

Bracket the search by setting three points a,...,<xp,yp>}, f(xµ)k+ε = yµ,k

(19)

For simplicity, we assume that the loss is the square error and define:

1008

p

m

M =1

k =1

E(w)=1/2 ∑ ∑

(yµ,k - ζ µ,c+1k)2

(20)

where ζ µ,c+1k=Fw(xµ)k is the k-th component of the network’s response to input pattern xµ (the network has c+1 layers, of which c are hidden). For a given input pattern xµ, define: Eµ(w)=1/2

m

∑ k =1

(yµ,k - ζ µ,c+1k)2

(21)

so that p

E(w) = ∑ Eµ (w). M =1

The computation of a single unit i in layer l, 1≤ l≤ c+1 upon presentation of pattern xµ to the network may be expressed 6 iM ,l =g( 6îM ,l ), with g a smooth function –as the sigmoidals– and 6îM ,l ∑ wijl 6 Mj ,l −1 . j

The first outputs are then defined ζ µ,0i = xµ,i. A single weight wijl denotes the connection strength from neuron j in layer l-1 to neuron i in layer l, 1≤ l ≤ c+1. If the gradient-descent rule (1) with constant α is followed, then ui= –∇Ew(wi). Together with (2), the increment ∆wij in a single weight wijl of w is:

$wijl nA

p p ∂E (w ) ∂E M (w ) nA ∑ nA ∑ $Mwijl l l ∂wij ∂wij M =1 M =1

(22) We have: ∆µ wij l =

∂E M (w ) ∂E M (w ) ∂6 iM ,l ∂6îM ,l = ∂wijl ∂6 iM ,l ∂6îM ,l ∂wijl

(23)

Proceeding from right to left in (23):

∂6îM ,l = 6 Mj ,l −1 l ∂wij

(24)

Learning in Feed-Forward Artificial Neural Networks I

∂6 iM ,l dg(6îM ,l ) = = g ' (6îM ,l ) M ,l M ,l ∂6î d6î β:

(25)

Assuming g to be the logistic function with slope

g' 6îM ,l Bg 6îM ,l ng 6îM ,l B 6 iM ,l n 6 iM ,l (26)

The remaining expression

∂E M (w ) ∂6 iM ,l

For the general case l0 a smoothing term, plus an activation g which very often is a monotonically decreasing response from the origin. These units are localized, in the sense that they give a significant response only in a neighbourhood of their centre wi. For the activation function a Gaussian g(z)=exp(-z2/2) is a preferred choice. Learning in RBF networks is characterized by the separation of the process in two consecutive stages (Haykin, 1994), (Bishop, 1995): 1.

2.

Optimize the free parameters of the hidden layer (including the smoothing term) using only the {x}i in D. This is an unsupervised method that depends on the input sample distribution. With these parameters found and frozen, optimize the {ci}i, the hidden-to-output weights, using the full information in D. This is a supervised method that depends on the given task.

There are many ways of optimizing the hiddenlayer parameters. When the number of hidden neurons equals the number of patterns, each pattern may be taken to be a center of a particular neuron. However, the aim is to form a representation of the probability density function of the data, by placing the centres in only those regions of the input space where significant data are present. One commonly used method is the k-means algorithm (McQueen, 1967), which in turn is an approximate version of the maximum-likelihood (ML) solution for determining the location of the means of a mixture density of component densities (that is, maximizing the likelihood of the parameters with respect to the data). The Expectation-Maximization (EM) algorithm (Duda & Hart, 1973) can be used to find the exact ML solution for the means and covariances of the density. It seems that EM is superior to k-means (Nowlan, 1990). The set of centres can also be selected randomly from the set of data points. The value of the smoothing term can be obtained from the clustering method itself, or else estimated a posteriori. One popular heuristic is:

Q=

d 2M

(2)

where d is the maximum distance between the chosen centers and M is the number of centers (hidden units). Alternatively, the method of Distance Averaging (Moody and Darken, 1989) can be used, which is the global average over all Euclidean distances between the center of each unit and that of its nearest neighbor. Once these parameters are chosen and kept constant, assuming the output units are linear, the (square) error function is quadratic, and thus the hidden-to-output weights can be fast and reliably found iteratively by simple gradient descent over the quadratic surface of the error function or directly by solving the minimum norm solution to the over determined least-squares data fitting problem (Orr, 1995). The whole set of parameters of a RBF network can also be optimized with a global gradient descent procedure on all the free parameters at once (Bishop, 1995), (Haykin, 1994). This brings back the problems of local minima, slow training, etc, already discussed. However, better solutions can in principle be found, because the unsupervised solution focuses on esti-

1013

L

Learning in Feed-Forward Artificial Neural Networks II

mating the input probability density function, but the resulting disposition may not be the one minimizing the square error.

EVOLUTIONARy LEARNING ALGORITHmS The alternative to derivative-based learning algorithms (DBLA) are Evolutionary Algorithms (EA) (Back, 1996). Although the number of successful specific applications of EA is counted by hundreds (see (Back, Fogel & Michalewicz, 1997) for a review), only Genetic Algorithms or GA (Goldberg, 1989) and, to a lesser extent, Evolutionary Programming (Fogel, 1992), have been broadly used for ANN optimization, since the earlier works using genetic algorithms (Montana & Davis, 1989). Evolutionary algorithms operate on a population of individuals applying the principle of survival of the fittest to produce better approximations to a solution. At each generation, a new population is created by selecting individuals according to their level of fitness in the problem domain and recombining them using operators borrowed from natural genetics. The offspring also undergo mutation. This process leads to the evolution of populations of individuals that are better suited to their environment than the individuals that they were created from, just as in natural adaptation. There are comprehensive review papers and guides to the extensive literature on this subject: see (Shaffer, Whitley & Eshelman, 1992), (Yao, 1993), (Kusçu & Thornton, 1994) and (Balakrishnan & Honavar, 1995). One of their main advantages over methods based on derivatives is the global search mechanism. A global method does not imply that the solution is not a local optimum; rather, it eliminates the possibility of getting caught in local optima. Another appealing issue is the possibility of performing the traditionally separated steps of determining the best architecture and its weights at the same time, in a search over the joint space of structures and weights. Another advantage is the use of potentially any cost measure to assess the goodness of fit or include structural information. Still another possibility is to embody a DBLA into a GA, using the latter to search among the space of structures and the DBLA to optimize the weights; this hybridization leads to extremely high computational costs. Finally, there is the use of EA solely for the numerical optimization problem. In the neural context, this is arguably the task 1014

for which continuous EA are most naturally suited. However, it is difficult to find applications in which GA (or other EA, for that matter) have clearly outperformed DBLA for supervised training of feed-forward neural networks (Whitley, 1995). It has been pointed out that this task is inherently hard for algorithms that rely heavily on the recombination of potential solutions (Radcliffe, 1991). In addition, the training times can become too costly, even worse than that for DBLA. In general, Evolutionary Algorithms –particularly, the continuous ones– are in need of specific research devoted to ascertain their general validity as alternatives to DBLA in neural network optimization. Theoretical as well as practical work, oriented to tailor specific EA parameters for this task, together with a specialized operator design should pave the way to a fruitful assessment of validity.

FUTURE TRENDS Research in ANN currently concerns the development of learning algorithms for weight adaptation or, more often, the enhancement of existing ones. New architectures (ways of arranging the units in the network) are also introduced from time to time. Classical neuron models, although useful and effective, are lessened to a few generic function classes, of which only a handful of instances are used in practice. One of the most attractive enhancements is the extension of neuron models to modern data mining situations, such as data heterogeneity. Although a feedforward neural network can in principle approximate an arbitrary function to any desired degree of accuracy, in practice a pre-processing scheme is often applied to the data samples to ease the task. In many important domains from the real world, objects are described by a mixture of continuous and discrete variables, usually containing missing information and characterized by an underlying vagueness, uncertainty or imprecision. For example, in the well-known UCI repository (Murphy & Aha, 1991) over half of the problems contain explicitly declared nominal attributes, let alone other discrete types or fuzzy information, usually unreported. This heterogeneous information should not be treated in general as real-valued quantities. Conventional ways of encoding non-standard information in ANN include (Prechelt, 1994), (Bishop, 1995), (Fiesler & Beale, 1997):


Ordinal variables. These variables correspond to discrete (finite) sets of values wherein an ordering has been defined (possibly only partial). They are more than often treated as real-valued, and mapped equidistantly on an arbitrary real interval. A second possibility is to encode them using a thermometer. To this end, let k be the number of ordered values; k new binary inputs are then created. To represent value i, for 1≤i≤k, the leftmost 1,...,i units will be on, and the remaining i+1,...,k off. The interest in these variables relies in that they appear frequently in real domains, either as symbolic information or from processes that are discrete in nature. Note that an ordinal variable need not be numerical. Nominal variables Nominal variables are unanimously encoded using a 1-out-of-k representation, being k the number of values, which are then encoded as the rows of the Ik×k identity matrix. Missing values Missing information is an old issue in statistical analysis (Little & Rubin, 1987). There are several causes for the absence of a value. They are very common in Medicine and Engineering, where many variables come from on-line sensors or device measurements. Missing information is difficult to handle, especially when the lost parts are of significant size. It can be either removed (the entire case) or “filled in” with the mean, median, nearest neighbour, or encoded by adding another input equal to one only if the value is absent and zero otherwise. Statistical approaches need to make assumptions about or model the input distribution itself. The main problem with missing data is that we never know if all the efforts devoted to their estimation will revert, in practice, in better-behaved data. This is also the reason why we develop on the treatment of missing values as part of the general discussion on data characteristics. The reviewed methods pre-process the data to make it acceptable by models that otherwise would not accept it. In the case of missing values, the data is completed because the available neural methods only admit complete data sets. Uncertainty. Vagueness, imprecision and other sources of uncertainty are considerations usually put aside in the ANN paradigm. Nonetheless, many variables in learning processes are likely to bear some form of uncertainty. In Engineering, for example, on-line sensors are likely to get old with time and continuous use, and this may be reflected in the quality of their measurements. In many occasions, the data at hand are imprecise for a manifold of reasons: technical

limitations, a veritable qualitative origin, or even we can be interested in introducing imprecision with the purpose of augmenting the capacity for abstraction or generalization (Esteva, Godo & García), possibly because the underlying process is believed to be less precise than the available measures. In Fuzzy Systems theory there are explicit formalisms for representing and manipulating uncertainty, that is precisely what the system best models and manages. It is perplexing that, when supplying this kind of input/output data, we require the network to approximate the desired output in a very precise way. Sometimes the known value takes an interval form: “between 5.1 and 5.5”, so that any transformation to a real value will result in a loss of information. A more common situation is the absence of numerical knowledge. For example, consider the value “fairly tall” for the variable height. Again, Fuzzy Systems are comfortable, but for an ANN this is real trouble. The integration of symbolic and continuous information is also important because numeric methods bring higher concretion, whereas symbolic methods bring higher abstraction. Their combined use is likely to increase the flexibility of hybrid systems. For numeric data, an added flexibility is obtained by considering imprecision in their values, leading to fuzzy numbers (Zimmermann, 1992).

CONCLUSION As explained at length in other chapters, derivativebased learning algorithms make a number of assumptions about the local error surface and its differentiability. In addition, the existence of local minima is often neglected or overlooked entirely. In fact, the possibility of getting caught in these minima is more than often circumvented by multiple runs of the algorithm (that is, multiple restarts from different initial points in weight space). This “sampling” procedure is actually an implementation of a very naïve stochastic process. A global training algorithm for neural networks is the evolutionary algorithm, a stochastic search training algorithm based on the mechanics of natural genetics and biological evolution. It requires information from the objective function, but not from the gradient vector or the Hessian matrix and thus it is a zero-order method. On the other hand, there is an emerging need to devise neuron models that properly handle different data types, 1015

L


as is done in support vector machines (Shawe-Taylor & Cristianini, 2004), where kernel design is a current research topic.

REFERENCES Balakrishnan, K., Honavar, V. (1995). Evolutionary design of neural architectures - a preliminary taxonomy and guide to literature. Technical report CS-TR-95-01. Dept. of Computer Science. Iowa State University. Bäck, T. (1996). Evolutionary Algorithms in Theory and Practice. Oxford Univ. Press, New York Bäck, Th., Fogel D.B., Michalewicz, Z. (Eds., 1997) Handbook of Evolutionary Computation. IOP Publishing & Oxford Univ. Press. Bishop, C. (1995). Neural Networks for Pattern Recognition. Clarendon Press. Esteva, F., Godo, L., García, P. (1998). Similarity-based Reasoning. IIIA Research Report 98-31, Instituto de Investigación en Inteligencia Artificial. Barcelona, Spain. Duda, R.O., Hart, P.E. (1973) Pattern classification and scene analysis. John Wiley. Fiesler, E., Beale, R. (Eds., 1997) Handbook of Neural Computation. IOP Publishing & Oxford Univ. Press. Fogel, L.J. (1992). An analysis of evolutionary programming. In Fogel and Atmar (Eds.) Procs. of the 1st annual conf. on evolutionary programming. La Jolla, CA: Evolutionary Programming Society. Goldberg, D.E. (1989). Genetic Algorithms for Search, Optimization & Machine Learning. Addison-Wesley. Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. MacMillan. Hertz, J., Krogh, A., Palmer R.G. (1991). Introduction to the Theory of Neural Computation, Addison-Wesley, Redwood City. Hinton, G.E. (1989). Connectionist learning procedures. Artificial Intelligence, 40: 185-234. Kusçu, I., Thornton, C. (1994). Design of Artificial Neural Networks using genetic algorithms: review and prospect. Technical Report of the Cognitive and 1016

Computing Sciences Dept. University of Sussex, England. Little, R.J.A., Rubin, D.B. (1987). Statistical analysis with missing data. John Wiley. McQueen, J. (1967) Some methods of classification and analysis of multivariate observations. In Procs. of the 5th Berkeley Symposium on Mathematics, Statistics and Probability. LeCam and Neyman (eds.), University of California Press. Montana, D.J., Davis, L. (1989). Training Feed-Forward Neural Networks using Genetic Algorithms. In Proceedings of the 11th International Joint Conference on Artificial Intelligence. Morgan Kaufmann. Moody, J. and Darken, C. (1989): J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1:281-294. Murphy, P.M., Aha, D. (1991). UCI Repository of machine learning databases. UCI Dept. of Information and Computer Science. Nowlan, S. (1990). Max-likelihood competition in RBF networks. Technical Report CRG-TR-90-2. Connectionist Research Group, Univ. of Toronto. Orr, M.J.L. (1995) Introduction to Radial Basis Function Networks. Technical Report of the Centre for Cognitive Science, Univ. of Edinburgh. Poggio T., Girosi, F. (1989). A Theory of Networks for Approximation and Learning. AI Memo No. 1140, AI Laboratory, MIT. Prechelt, L. (1994). Proben1: A set of Neural Network Benchmark Problems and Benchmarking Rules. Technical Report 21/94. Universität Karlsruhe. Radcliffe, N.J. (1991). Genetic set recombination and its application to neural network topology optimization. Technical Report EPCC-TR-91-21. University of Edinburgh. Shaffer, J.D., Whitley, D., Eshelman, (1992). L.J. Combination of Genetic Algorithms and Neural Networks: A Survey of the State of the Art. In Combination of Genetic Algorithms and Neural Networks. Shaffer, J.D., Whitley, D. (eds.), pp. 1-37. Shawe-Taylor, J. Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press.


Whitley, D. (1995). Genetic Algorithms and Neural Networks. In Genetic Algorithms in Engineering and Computer Science. Periaux, Galán, Cuesta (eds.), John Wiley. Yao, X. (1993). A Review of Evolutionary Artificial Networks. Intl. Journal of Intelligent Systems, 8(4): 539-567, 1993. Zimmermann, H.J. (1992). Fuzzy set theory and its applications. Kluver.

KEy TERmS Architecture: The number of artificial neurons, its arrangement and connectivity. Artificial Neural Network: Information processing structure without global or shared memory that takes the form of a directed graph where each of the computing elements (“neurons”) is a simple processor with internal and adjustable parameters, that operates only when all its incoming information is available.

Evolutionary Algorithm: A computer simulation in which a population of individuals (abstract representations of candidate solutions to an optimization problem) are stochastically selected, recombined, mutated, and then removed or kept, based on their relative fitness to the problem. Feed-Forward Artificial Neural Network: Artificial Neural Network whose graph has no cycles. Learning Algorithm: Method or algorithm by virtue of which an Artificial Neural Network develops a representation of the information present in the learning examples, by modification of the weights. Neuron Model: The computation of an artificial neuron, expressed as a function of its input and its weight vector and other local information. Weight: A free parameter of an Artificial Neural Network, that can be modified through the action of a Learning Algorithm to obtain desired responses to certain input stimuli.

1017

L

1018

Learning Nash Equilibria in Non-Cooperative Games Alfredo Garro University of Calabria, Italy

INTRODUCTION Game Theory (Von Neumann & Morgenstern, 1944) is a branch of applied mathematics and economics that studies situations (games) where self-interested interacting players act for maximizing their returns; therefore, the return of each player depends on his behaviour and on the behaviours of the other players. Game Theory, which plays an important role in the social and political sciences, has recently drawn attention in new academic fields which go from algorithmic mechanism design to cybernetics. However, a fundamental problem to solve for effectively applying Game Theory in real word applications is the definition of well-founded solution concepts of a game and the design of efficient algorithms for their computation. A widely accepted solution concept of a game in which any cooperation among the players must be selfenforcing (non-cooperative game) is represented by the Nash Equilibrium. In particular, a Nash Equilibrium is a set of strategies, one for each player of the game, such that no player can benefit by changing his strategy unilaterally, i.e. while the other players keep their strategies unchanged (Nash, 1951). The problem of computing Nash Equilibria in non-cooperative games is considered one of the most important open problem in Complexity Theory (Papadimitriou, 2001). Daskalakis, Goldbergy, and Papadimitriou (2005), showed that the problem of computing a Nash equilibrium in a game with four or more players is complete for the complexity class PPAD-Polynomial Parity Argument Directed version (Papadimitriou, 1991), moreover, Chen and Deng extended this result for 2-player games (Chen & Deng, 2005). However, even in the two players case, the best algorithm known has an exponential worst-case running time (Savani & von Stengel, 2004); furthermore, if the computation of equilibria with simple additional properties is required, the problem immediately becomes NP-hard (Bonifaci, Di Iorio, & Laura, 2005) (Conitzer & Sandholm, 2003) (Gilboa & Zemel, 1989) (Gottlob, Greco, & Scarcello, 2003).

Motivated by these results, recent studies have dealt with the problem of efficiently computing Nash Equilibria by exploiting approaches based on the concepts of learning and evolution (Fudenberg & Levine, 1998) (Maynard Smith, 1982). In these approaches the Nash Equilibria of a game are not statically computed but are the result of the evolution of a system composed by agents playing the game. In particular, each agent after different rounds will learn to play a strategy that, under the hypothesis of agent’s rationality, will be one of the Nash equilibria of the game (Benaim & Hirsch, 1999) (Carmel & Markovitch, 1996). This article presents SALENE, a Multi-Agent System (MAS) for learning Nash Equilibria in noncooperative games, which is based on the above mentioned concepts.

BACKGROUND An n-person strategic game G can be defined as a tuple G = (N; (Ai)i∈N; (ri)i∈N), where N = {1, 2, … , n} is the set of players, Ai is a finite set of actions for player i∈N, and ri : A1 × … × An → ℜ is the payoff function of player i. The set Ai is called also the set of pure strategies of player i. The Cartesian product ×i∈N Ai = A1 × … × An can be denoted by A and r : A → ℜN can denote the vector valued function whose ith component is ri, i.e., r(a) = (r1(a), … , rn(a)), so it is possible to write (N, A, r) for short for (N; (Ai)i∈N; (ri) i∈N). For any finite set Ai the set of all probability distributions on Ai can be denoted by Δ(Ai). An element σi ∈ Δ(Ai) is a mixed strategy for player i. A (Nash) equilibrium of a strategic game G = (N, A, r) is an N-tuple of (mixed) strategies σ = (σi) i∈N, σi ∈ Δ(Ai), such that for every i ∈ N and any other strategy of player i, τi ∈ Δ(Ai), ri(τi,σ-i) ≤ ri(σi,σ-i), where ri denotes also the expected payoff to player i in the mixed extension of the game and σ-i represents the mixed strategies in σ of all the other players. Basically, supposing that all the other players do not change their


Learning Nash Equilibria in Non-Cooperative Games

strategies it is not possible for any player i to play a different strategy τi able to gain a better payoff of that gained by playing σi. σi is called a Nash equilibrium strategy for player i. In 1951 J. F. Nash proved that a strategic (noncooperative) game G = (N, A, r) has at least a (Nash) equilibrium σ (Nash, 1951); in his honour, the computational problem of finding such equilibria is known as NASH (Papadimitriou, 1994).

SOFTWARE AGENTS FOR LEARNING NASH EQUILIBRIA SALENE was conceived as a system for learning at least one Nash Equilibrium of a non-cooperative game given in the form G = (N; (Ai)i∈N; (ri)i∈N). In particular, the system asks the user for: • • •

the number n of the players which defines the set of players N = {1, 2, … , n}; for each player i∈N, the related finite set of pure strategies Ai and his payoff function ri : A1 × … × An → ℜ; the number k of times the players will play the game.

Figure 1. The class diagram of SALENE FIPAAgent

1

1

1 1..*

ManagerAgent

1

1 1

11 1

RefereeAgent

1

1

ManagerBehaviour

1..* 1

11 >

1

1

1

PlayerAgent

0..*

1 > > > 1

RefereeBehaviour

1

Then, the system creates n agents, one associated to each player, and a referee. The agents will play the game G k times, after each match, each agent will decide the strategy to play in the next match to maximise his expected utility on the basis of his beliefs about the strategies that the other agents are adopting. By analyzing the behaviour of each agent in all the k matches of the game, SALENE presents to the user an estimate of a Nash Equilibrium of the game. The Agent paradigm has represented a “natural” way of modelling and implementing the proposed solution as it is characterized by several interacting autonomous entities (players) which try to achieve their goals (consisting in maximising their returns). The class diagram of SALENE is shown in Figure 1. The Manager Agent interacts with the user and it is responsible for the global behaviour of the system. In particular, after having obtained from the user the input parameters G and k, the Manager Agent creates both n Player Agents and a Referee Agent that coordinates and monitors the behaviours of the players. The Manager Agent sends to all the agents the definition G of the game then he asks the Referee Agent to orchestrate k matches of the game G. In each match, the Referee Agent asks each Player Agent which pure strategy he has decided to play, then, after having acquired the strategies from all players, the Referee Agent communicates to each Player Agent both the strategies played and the payoffs gained by all players. After playing k matches of the game G the Referee Agent communicates all the data about the played matches to the Manager Agent which analyses it and properly presents the obtained results to the user. A Player Agent is a rational player that, given the game definition G, acts to maximise his expected utility in each single match of G without considering the overall utility that he could obtain in a set of matches. In particular the behaviour of the Player Agent i can be described by the following main steps: 1.

1

PlayerBehaviour

2. GameDefinition

In the first match the Player Agent i chooses to play a pure strategy randomly generated considering all the pure strategies playable with the same probability: if |Ai|=m the probability of choosing a pure strategy s∈Ai is 1/m; The Player Agent i waits for the Referee Agent to ask him which strategy he wants to play, then he communicates to the Referee Agent the chosen 1019

L


3. 4.

5.

1020

pure strategy as computed in step 1 if he is playing his first match or in step 4 otherwise; The Player Agent waits for the Referee Agent to communicate him both the pure strategies played and the payoffs gained by all players; The Player Agent decides the mixed strategy to play in the next match. In particular, the Player Agent updates the beliefs about the mixed strategies currently adopted by the other players and consequently recalculate the strategy able to maximise his expected utility. Basically, the Player Agent i tries to find the strategy σi ∈ Δ(Ai), such that for any other strategy τi ∈ Δ(Ai), ri(τi,σ-i) ≤ ri(σi,σ-i) where ri denotes his expected payoff and σ-i represents his beliefs about the mixed strategies currently adopted by all the other players, i.e. σi =(σj)j∈N,j≠i, σj ∈ Δ(Aj). In order to evaluate σj for each other player j≠i the Player Agent i considers the pure strategies played by the player j in all the previous matches and computes the frequency of each pure strategy, this frequency distribution will be the estimate for σj. If there is at least an element in the actually computed set σ-i=(σj)j∈N,j≠i that differs from the set σ-i as computed in the previous match, the Player Agent i solves the inequality ri(τi,σ-i) ≤ ri(σi,σ-i) that is equivalent to solve the optimization problem P={max(ri(σi,σi )), σi∈Δ(Ai)}. It is worth noting that P is a linear optimization problem, actually, given the set σ-i, ri(σi,σ-i) is a linear objective function in σi (see the game definition reported in the Background Section), and with |Ai|=m σi∈Δ(Ai) is a vector χ ∈ ℜM such that Σs∈M χs=1 and for every s∈M χs≥0, so the constraint σi∈Δ(Ai) is a set of m+1 linear inequalities. P is solved by the Player Agent by using an efficient method for solving problems in linear programming, in particular the predictor-corrector method of Mehrotra (1992), whose complexity is polynomial for both average and worst case. The obtained solution for σi is a pure strategy because it is one of the vertices of the polytope which defines the feasible region for P. The obtained strategy σi will be played by the Player Agent i in the next match; ri(σi,σ-i) represents the expected payoff to player i in the next match; back to step 2.

It is worth noting that a Player Agent for choosing the mixed strategy to play in each match of G does not need to known the payoff functions of the others players, in fact, for solving the optimization problem P it only needs to consider the strategies which have been played by the other players in all the previous matches. The Manager Agent, receives from the Referee Agent all the data about the k matches of the game G and computes an estimate of a Nash Equilibrium of G, i.e. an N-tuple σ=(σi)i∈N, σi∈Δ(Ai). In particular, in order to estimate σi (the Nash equilibrium strategy of the player i), the Manager Agent computes, on the basis of the pure strategies played by the player i in each of the k match, the frequency of each pure strategy: this frequency distribution will be the estimate for σi. The so computed set σ=(σi)i∈N, σi∈Δ(Ai) will be then properly proposed to the user together with the data exploited for its estimation. SALENE has been implemented using JADE (Bellifemine, Poggi, & Rimassa, 2001), a software framework allowing for the development of multiagent systems and applications conforming to FIPA standards (FIPA, 2006), and tested on different games that differ from each other both in the number and in the kind of Nash Equilibria. The experiments have demonstrated that: •

•

if the game has p>=1 Pure Nash Equilibria and s>=0 Mixed Nash Equilibria the agents converge in playing one of the p Pure Nash Equilibria; in these cases, as the behaviour of each Player Agent converges with probability one to a Nash Equilibrium of the game, the learning process converges in behaviours to equilibrium (Foster & Young, 2003); if the game has only Mixed Nash Equilibria, while the behaviour of the Player Agents does not converge to an equilibrium, the time-average behaviour, i.e. the empirical frequency with which each player chooses his strategy, may converge to one of the mixed Nash Equilibria of the game; that is the learning process may converge in time average to equilibrium (Foster and Young, 2003).

In the next Section the main aspects related to the convergence properties of the approach/algorithm


exploited by the SALENE agents for leaning Nash Equilibria are discussed in a more general discussion about current and future research efforts.

FUTURE TRENDS Innovative approaches, as SALENE, based on the concepts of learning and evolution have shown great potential for modelling and efficiently solving non-cooperative games. However, as the solutions of the games (e.g. Nash Equilibria) are not statically computed but are the result of the evolution of a system composed by interacting agents, there are several open problems mainly related to the accuracy of the provided solution that need to be tackled to allow these approaches to be widely exploited in concrete business application. The approach exploited in SALENE, which derives from the Fictitious Play (Robinson, 1951) approach, efficiently solves the problem of learning a Nash Equilibrium in non-cooperative games which have at least one Pure Nash Equilibrium: in such a case the behaviour of the players exactly converges to one of the Pure Nash Equilibria of the game (convergence in behaviours to equilibrium). On the contrary, if the game has only Mixed Nash Equilibria, the convergence of the learning algorithm is not ensured. Computing ex ante when this case happens is quite costly as it requires to solve the following problem: “Determining whether a strategic game has only Mixed Nash Equilibria”, which is equivalent to: “Determining whether a strategic game does not have any Pure Nash Equilibria”. This problem is Co-NP complete as its complement “Determining whether a strategic game has a Pure Nash Equilibrium” is NP complete (Gottlob, Greco, & Scarcello, 2003). As witnessed by the conducted experiments, when a game has only Mixed Nash Equilibria there are still some cases in which, while the behaviour of the players does not converge to an equilibrium, the time-average behaviour, i.e. the empirical frequency with which each player chooses his strategy, converges to one of the Mixed Nash Equilibria of the game (convergence in time average to equilibrium). Nevertheless, there are some cases in which there is neither convergence in behaviour neither convergence in time average to equilibrium; an example of such a case is the fashion game of Shapley (1964). An important open problem is then represented by the characterization of the classes of games for which the learning algorithm

adopted in SALENE converges; more specifically, the classes of games for which the algorithm: (a) convergences in behaviours to equilibrium (which implies the convergence in time average to equilibrium), (b) only convergences in time average to equilibrium; (c) does not converge neither in behaviours neither in time average. Currently, it has been demonstrated that the algorithm converges in behaviours or in time average to equilibrium for the following classes of games: • • • •

zero-sum games (Robinson, 1951); games which are solvable by iterated elimination of strictly dominated strategies (Nachbar, 1990); potential games (Monderer & Shapley, 1996); 2xN games, i.e. games with 2 players, 2 strategies for one player and N strategies for the other player (Berger, 2005).

Future efforts will be geared towards: (i) completing the characterization of the classes of games for which the learning algorithm adopted in SALENE converges and evaluating the complexity of solving the membership problem for such a classes; (ii) evaluating different learning algorithms and their convergence properties; (ii) letting the user ask for the computation of Nash Equilibria with simple additional properties. More in general, a wide adoption of the emerging agent-based approaches for solving games which model concrete business applications will depend on the accuracy and the convergence properties of the provided solutions; both aspects still need to be fully investigated.

CONCLUSION The complexity of NASH, the problem consisting in computing Nash Equilibria in non-cooperative games, is still debated, but even in the two players case, the best known algorithm has an exponential worst-case running time. SALENE, the proposed MAS for learning Nash Equilibria in non-cooperative games, can be conceived as a heuristic and efficient method for computing at least one Nash Equilibria in a non-cooperative game represented in its normal form; actually, the learning algorithm adopted by the Player Agents has a polynomial running time for both average and worst case. SALENE can be then fruitfully exploited for 1021

L


efficiently solving non-cooperative games which model interesting concrete problems ranging from classical economic and finance problems to the emerging ones related to the economic aspects of the Internet such as TCP/IP congestion, selfish routing, and algorithmic mechanism design.

REFERENCES Bellifemine, F., Poggi, A., & Rimassa, G. (2001). Developing multi-agent systems with a FIPA-compliant agent framework. Software Practice and Experience, 31(2),103-128. Benaim, M., & Hirsch, M.W. (1999). Learning process, mixed equilibria and dynamic system arising for repeated games. Games and Economic Behavior, 29, 36-72. Berger, U. (2005). Fictitious Play in 2xN Games. Journal of Economic Theory, 120, 139-154. Bonifaci, V., Di Iorio, U., & Laura, L. (2005). On the complexity of uniformly mixed Nash equilibria and related regular subgraph problems. Proceedings of the 15th International Symposium on Fundamentals of Computation Theory, 197-208.

Foster, D.P., & Young, P.H. (2003). Learning, Hypothesis Testing, and Nash Equilibrium. Games and Economic Behavior, 45(1), 73-96. Fudenberg, D., & Levine, D. (1998). The Theory of Learning in Games. Cambridge, MA: MIT Press. Gilboa, I., & Zemel, E. (1989). Nash and correlated equilibria: some complexity considerations. Games and Economic Behavior, 1(1), 80-93. Gottlob, G., Greco, G., & Scarcello, F. (2003). Pure Nash Equilibria: hard and easy games. Proceedings of the 9th Conference on Theorethical Aspects of Rationality and Knowledge, 215-230. Maynard Smith, J. (1982). Evolution and the Theory of Games. Cambridge University Press. Mehrotra, S. (1992). On the Implementation of a Primal-dual Interior Point Method. SIAM Optimization Journal, 2, 575-601. Monderer, D., & Shapley, L.S. (1996). Fictitious Play Property for Games with Identical Interests. Journal of Economic Theory, 68, 258-265. Nachbar, J. (1990). Evolutionary Selection Dynamics in Games: Convergence and Limit Properties. International Journal of Game Theory, 19, 59-89.

Carmel, D., & Markovitch, S. (1996). Learning Models of Intelligent Agents. Proceedings of the 13th National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference, Vol. 2, 62-67, Menlo Park, California: AAAI Press.

Nash, J.F. (1951). Non-cooperative games. Annals of Mathematics, 54, 289-295.

Chen, X., & Deng, X. (2005). Settling the Complexity of 2-Player Nash-Equilibrium. Electronic Colloquium on Computational Complexity, Report No. 140.

Papadimitriou, C.H. (1994). On the complexity of the parity argument and other inefficient proofs of existence. Journal of Computer and Systems Sciences, 48(3), 498-532.

Conitzer, V., & Sandholm, T. (2003). Complexity results about Nash equilibria. Proceedings of the 18th International Joint Conference on Artificial Intelligence, 765-771. Daskalakis, C., Goldbergy, P.W., & Papadimitriou, C.H. (2005). The Complexity of Computing a Nash Equilibrium. Electronic Colloquium on Computational Complexity, Report No. 115. FIPA (2006). Foundation for Intelligent Physical Agents, http://www.fipa.org. 1022

Papadimitriou, C.H. (1991). On inefficient proofs of existence and complexity classes. Proceedings of the 4th Czechoslovakian Symposium on Combinatorics.

Papadimitriou, C.H. (2001). Algorithms, Games and the Internet. Proceedings of the 33rd Annual ACM Symposium on the Theory of Computing, 749-753. Robinson, J. (1951). An Iterative Method of Solving a Game. Annals of Mathematics, 54, 296-301. Savani, R., & von Stengel, B. (2004). Exponentially many steps for finding a Nash equilibrium in a bimatrix game. Proceedings of the 45th Symposium on Foundations of Computer Science, 258-267.


Shapley, L.S. (1964). Some topics in two-person games. Advances in Game Theory, 1-28, Dresher, M., Shapley, L.S., & Tucker, A. W. editors, Princeton University Press. Tsebelis, G. (1990). Nested Games: rational choice in comparative politics. University of California Press. Von Neumann, J, & Morgenstern, O. (1944). Theory of Games and economic Behaviour. Princeton University Press.

KEy TERmS Computational Complexity Theory: A branch of the theory of computation in computer science which studies how the running time and the memory requirements of an algorithm increase as the size of the input to the algorithm increases. Game Theory: A branch of applied mathematics and economics that studies situations (games) where

self-interested interacting players act for maximizing their returns. Heuristic: In computer science, a technique designed to solve a problem which allows for gaining computational performance or conceptual simplicity potentially at the cost of accuracy and/or precision of the provided solutions to the problem itself. Nash Equilibrium: A solution concept of a game where no player can benefit by changing his strategy unilaterally, i.e. while the other players keep theirs unchanged; this set of strategies and the corresponding payoffs constitute a Nash Equilibrium of the game. NP-Hard Problems: Problems that are intrinsically harder than those that can be solved by a nondeterministic Turing machine in polynomial time. Non-Cooperative Games: A game in which any cooperation among the players must be self-enforcing. Payoffs: Numeric representations of the utility obtainable by a player in the different outcomes of a game.

1023

L

1024

Learning-Based Planning Sergio Jiménez Celorrio Universidad Carlos III de Madrid, Spain Tomás de la Rosa Turbides Universidad Carlos III de Madrid, Spain

INTRODUCTION Automated Planning (AP) studies the generation of action sequences for problem solving. A problem in AP is defined by a state-transition function describing the dynamics of the world, the initial state of the world and the goals to be achieved. According to this definition, AP problems seem to be easily tackled by searching for a path in a graph, which is a well-studied problem. However, the graphs resulting from AP problems are so large that explicitly specifying them is not feasible. Thus, different approaches have been tried to address AP problems. Since the mid 90’s, new planning algorithms have enabled the solution of practical-size AP problems. Nevertheless, domain-independent planners still fail in solving complex AP problems, as solving planning tasks is a PSPACE-Complete problem (Bylander, 94). How do humans cope with this planning-inherent complexity? One answer is that our experience allows us to solve problems more quickly; we are endowed with learning skills that help us plan when problems are selected from a stable population. Inspire by this idea, the field of learning-based planning studies the development of AP systems able to modify their performance according to previous experiences. Since the first days, Artificial Intelligence (AI) has been concerned with the problem of Machine Learning (ML). As early as 1959, Arthur L. Samuel developed a prominent program that learned to improve its play in the game of checkers (Samuel, 1959). It is hardly surprising that ML has often been used to make changes in systems that perform tasks associated with AI, such as perception, robot control or AP. This article analyses the diverse ways ML can be used to improve AP processes. First, we review the major AP concepts and summarize the main research done in learning-based planning. Second, we describe current trends in applying

ML to AP. Finally, we comment on the next avenues for combining AP and ML and conclude.

BACKGROUND The languages for representing AP tasks are typically based on extensions of first-order logic. They encode tasks using a set of actions that represents the statetransition function of the world (the planning domain) and a set of first-order predicates that represent the initial state together with the goals of the AP task (the planning problem). In the early days of AP, STRIPS was the most popular representation language. In 1998 the Planning Domain Definition Language (PDDL) was developed for the first International Planning Competition (IPC) and since that date it has become the standard language for the AP community. In PDDL (Fox & Long, 2003), an action in the planning domain is represented by: (1) the action preconditions, a list of predicates indicating the facts that must be true so the action becomes applicable and (2) the action postconditions, typically separated in add and delete lists, which are lists of predicates indicating the changes in the state after the action is applied. Before the mid ‘90s, automated planners could only synthesize plans of no more than 10 actions in an acceptable amount of time. During those years, planners strongly depended on speedup techniques for solving AP problems. Therefore, the application of search control became a very popular solution to accelerate planning algorithms. In the late 90’s, a significant scaleup in planning took place due to the appearance of the reachability planning graphs (Blum & Furst, 1995) and the development of powerful domain independent heuristics (Hoffman & Nebel, 2001) (Bonet & Geffner, 2001). Planners using these approaches could often synthesize 100-action plans just in seconds.


Learning-Based Planning

At the present time, there is not such dependence on ML for solving AP problems, but there is a renewed interest in applying ML to AP motivated by three factors: (1) IPC-2000 showed that knowledge-based planners significantly outperform domain-independent planners. The development of ML techniques that automatically define the kind of knowledge that humans put in these planners would bring great advances to the field. (2) Domain-independent planners are still not able to cope with real-world complex problems. On the contrary, these problems are often solved by defining ad hoc planning strategies by hand. ML promises to be a solution to automatically defining these strategies. And, (3) there is a need for tools that assist in the definition, validation and maintenance of planning-domain models. At the moment, these processes are still done by hand.

LEARNING-BASED PLANNING This section describes the current ML techniques for improving the performance of planning systems. These techniques are grouped according to the target of learning: search control, domains-specific planners, or domain models.

Learning Search Control Domain-independent planners require high search effort, so search-control knowledge is frequently used to reduce this effort. Hand-coded control knowledge has proved to be useful in many domains, however is difficult for humans to formalize it, as it requires specific knowledge of the planning domains and the planner structure. Since AP’s early days, diverse ML techniques have been developed with the aim of automatically learning search-control knowledge. A few examples of these techniques are macro-actions (Fikes, Hart & Nilsson, 1972), control-rules (Borrajo & Veloso, 1997), and case-based and analogical planning (Veloso, 1994). At the present, most of the state-of-the-art planners are based on heuristic search over the state space (12 of the 20 participants in IPC-2006 used this approach). These planners achieve impressive performance in many domains and problems, but their performance strongly depends on the definition of a good domainindependent heuristic function. These heuristics are computed solving a simplified version of the planning

task, which ignores the delete list of actions. The solution to the simplified task is taken as the estimated cost for reaching the task goals. These kinds of heuristics provide good guidance across the wide range of different domains. However, they have some faults: (1) in many domains, these heuristic functions vastly underestimate the distance to the goal leading to poor guidance, (2) the computation of the heuristic values of the search nodes is too expensive, and (3) these heuristics are non-admissible so heuristics planners do not find good solutions in terms of plan quality. Since evaluating a search node in heuristic planning is so time consuming, (De la Rosa, García-Olaya & Borrajo, 2007) proposed using Case-based Reasoning (CBR) to reduce the number of explored nodes. Their approach stores sequences of abstracted state transitions related to each particular object in a problem instance. Then, with a new problem, these sequences are retrieved and re-instantiated to support a forward heuristic search, deciding the node ordering for computing its heuristic value. In the last years, other approaches have been developed to minimize the negative effects of the heuristic through ML: (Botea, Enzenberger, Müller & Schaeffer, 2005) learned off-line macro-actions to reduce the number of evaluated nodes by decreasing the depth of the search tree. (Coles & Smith, 2007) learned on-line macro-actions to escape from plateaus in the search tree without any exploration. (Yoon, Fern & Givan, 2006) proposed using an inductive approach to correct the domain-independent heuristic in those domains based on learning a supplement to the heuristic from observations of solved problems in these domains. All these methods for learning search-control knowledge suffer from the utility problem. Learning too much control knowledge can actually be counterproductive because the difficulty of storing and managing the information and the difficulty of determining which information to use when solving a particular problem can interfere with efficiency.

Learning Domain-Specific Planners An alternative approach to learning search control consists of learning domain-specific planning programs. These programs receive as input a planning problem of a fixed domain and return a plan that solves the problem.

1025

L


The first approaches to learn domain-specific planners were based on supervised inductive learning; they used genetic programming (Spector, 1994) and decision-list learning (Khardon, 1999), but they were not able to reliably produce good results. Recently, (Winner & Veloso, 2003) presented a different approach based on generalizing an example plan into a domain-specific planning program and merging the resulting source code with the previous ones. Domain-specific planners are also represented as policies, i.e., pairs of state and the preferred action to be executed in the state. Relational Reinforcement Learning (RRL) (Dzeroski, Raedt & Blockeel, 1998) has aroused interest as an efficient approach for learning policies for relational domains. RRL includes a set of learning techniques for computing the optimal policy for reaching the given goals by exploring the state space though trial and error. The major benefit of these techniques is that they can be used to solve problems whether the action model is known or not. In the other hand, since RRL does not explicitly include the task goals in the policies, new policies have to be learned every time a new goal has to be achieved, even if the dynamics of the environment has not changed. In general, domain-specific planners have to deal with the problem of generalization. These techniques build planning programs from a given set of solved problems so cannot theoretically guarantee solving subsequent problems.

Learning Domain Models No matter how efficient a planner is, if it is fed with a defective domain model, it will return defective plans. Designing, encoding and maintaining a domain model is very laborious. At the time being, planners are the only tool available to assist in the development of an AP domain model, but planners are not designed specifically for this purpose. Domain model learning studies ML mechanisms to automatically acquire the planning action schemas (the action preconditions and post-conditions) from observations of action executions. Learning domain models in deterministic environments is a well-studied problem; diverse inductive learning techniques have been successfully applied to automatically define the actions schema from observations (Shen & Simon, 1989), (Benson, 1997), (Yang, Wu & Jiang, 2005), (Shahaf & Amir, 2006). In stochastic environments, this problem becomes more 1026

complex. Actions may result in innumerable different outcomes, so more elaborated approaches are required. (Pasula, Zettlemoyer & Kaelbling, 2004) presented the first specific algorithm to learn simple stochastic actions without conditional effects. This algorithm is based on three levels of learning: the first one consists of deterministic rule-learning techniques to induce the action preconditions. The second one relies on a search for the set of action outcomes that best fits the execution examples, and; the third one consists of estimating the probability distributions over the set of action outcomes. But, stochastic planning algorithms do not need to consider all the possible actions outcomes. (Jimenez & Cussens 2006) proposed to learn complex action-effect models (including conditions) for only the relevant action outcomes. Thus, planners generate robust plans by covering only the most likely execution outcome while leaving others to be completed when more information is available. In deterministic environments, (Shahaf & Amir, 2006) introduced an algorithm that exactly learns STRIPS action schemas even if the domain is only partially observable. But, in stochastic environments, there is still no general efficient approach to learn action model.

FUTURE TRENDS Since the appearance of the first PDDL version in IPC1998, the standard planning representation language has evolved to bring together AP algorithms and real-world planning problems. Nowadays, the PDDL 3.0 version for the IPC-2006 includes numeric state variables to support quality metrics, durative actions that allow explicit time representation, derived predicates to enrich the descriptions of the system states, and soft goals and trajectory constraints to express user preferences about the different possible plans without discarding valid plans. But, most of these new features are not handled by the state-of-the-art planning algorithms: The existing planners usually fail solving problems that define quality metrics. The issue of goal and trajectory preferences has only been initially addressed. Time and resources add such extra complexity to the search process that a real-world problem becomes extremely difficult to solve. New challenges for the AP community are those related to developing new planning algorithms and heuristics to deal with these kinds of problems. As


it is very difficult to find an efficient general solution, ML must play an important role in addressing these new challenges because it can be used to alleviate the complexity of the search process by exploiting regularity in the space of common problems. Besides, the state-of-the-art planning algorithms need a detailed domain description to efficiently solve the AP task, but new applications like controlling underwater autonomous vehicles, Mars rovers, etc. imply planning in environments where the dynamics model may be not easily accessible. There is a current need for planning systems to be able to acquire information of their execution environment. Future planning systems have to include frameworks that allow the integration of the planning and execution processes together with domain modeling techniques. Traditionally, learning-based planners are evaluated only against the same planner but without learning, in order to prove their performance improvement. Additionally, these systems are not exhaustively evaluated; typically the evaluation only focuses on a very small number of domains, so these planners are usually quite fragile when encountering new domains. Therefore, the community needs a formal methodology to validate the performance of the new learning-based planning systems, including mechanisms to compare different learning-based planners. Although ML techniques improve planning systems, existing research cannot theoretically demonstrate that they will be useful in new benchmark domains. Moreover, for time being, it is not possible to formally explain the underlying meaning of the learned knowledge (i.e., does the acquired knowledge subsumes task decomposition? a goal ordering? a solution path?). This point reveals that future research in AP and ML will also focus on theoretical aspects that address these issues.

CONCLUSION Generic domain-independent planners are still not able to address the complexity of real planning problems. Thus, most planning systems implemented in applications require additional knowledge to solve the real planning tasks. However, the extraction and compilation of this specific knowledge by hand is complicated. This article has described the main last advances in developing planners successfully assisted by ML

techniques. Automatic learned knowledge is useful for AP in diverse ways: it helps planners in guiding search processes, in completing domain theories or in specifying particular solutions to a particular problem. However, the learning-based planning community can not only focus on developing new learning techniques but also on defining formal mechanisms to validate its performance against other generic planners and against other learning-based planners.

REFERENCES Benson, S. (1997). Learning Action Models for Reactive Autonomous Agents. PhD thesis, Stanford University. Blum, A., & Furst, M. (1995). Fast planning through planning graph analysis. In C. S. Mellish, editor, Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI-95, volume 2, pages 1636–1642, Montreal, Canada, August 1995. Morgan Kaufmann. Bonet, B. & Geffner, H. (2001). Planning as Heuristic Search. Artificial Intelligence, 129 (1-2), 5-33. Borrajo, D., & Veloso, M. (1997). Lazy Incremental Learning of Control Knowledge for Efficiently Obtaining Quality Plans. AI Review Journal. Special Issue on Lazy Learning. 11 (1-5), 371-405. Botea, A., Enzenberger, M., Müller, M. & Schaeffer, J. (2005). Macro-FF: Improving AI Planning with Automatically Learned Macro-Operators. Journal of Artificial Intelligence Research (JAIR), 24, 581-621. Bylander, T., The computational complexity of propositional STRIPS planning. (1994). Artificial Intelligence, 69(1-2), 165–204. Coles, A., & Smith, A. (2007). Marvin: A heuristic search planner with online macro-action learning. Journal of Artificial Intelligence Research, 28, 119–156. De la Rosa, T., García Olaya, A., & Borrajo, D. (2007) Using Utility Cases for Heuristic Planning Improvement. Procceedings of the 7th International Conference on Case-Based Reasoning, Belfast, Northern Ireland, Springer-Verlag. Dzeroski, S., Raedt, L. D., & Blockeel, H., (1998) Relational reinforcement learning. In International 1027

L


Workshop on Inductive Logic Programming, pages 11–22. Fikes, R., Hart, P., & Nilsson, N., (1972) Learning and Executing Generalized Robot Plans, Artificial Intelligence, 3, pages 251-288. Fox, M. & Long, D, (2003) PDDL2.1: An extension to PDDL for expressing temporal planning domains. Journal of Artificial Intelligence Research, 20, 61–124. Hoffmann J. & Nebel B. (2001) The FF planning system: Fast plan generation through heuristic search. Journal of Artificial Intelligence Research,14, 253–302. Jiménez, S. & Cussens, J. (2006). Combining ILP and Parameter Estimation to Plan Robustly in Probabilistic Domains. In Conference on Inductive Logic Programming. Santiago de Compostela, ILP2006. Spain. Khardon, R. (1999) Learning action strategies for planning domains. Artificial Intelligence, 113, 125–148, Pasula, H. Zettlemoyer, L. & Kaelbling, L. (2004) Learning probabilistic relational planning rules. Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling, ICAPS04. Samuel, A. L., (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 211–229. Shahaf, D & Amir, E. (2006). Learning partially observable action schemas. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI’06). Shen, W. & Simon. (1989). Rule creation and rule learning through environmental exploration. In Proceedings of the IJCAI-89, pages 675–680. Spector, L. (1994) Genetic programming andAI planning systems. In Proceedings of Twelfth National Conference on Artificial Intelligence, Seattle,Washington,USA, AAAI Press/MIT Press. Veloso, M. (1994). Planning and learning by analogical reasoning. Springer Verlag. Winner, E. & Veloso, M. (2003) Distill: Towards learning domain-specific planners by example. In Proceedings of Twentieth International Conference on Machine Learning (ICML 03), Washington, DC, USA.

1028

Yang, Q, Wu, K & Jiang, Y. (2005) Learning action models from plan examples with incomplete knowledge. In Proceedings of the 2005 International Conference on Automated Planning and Scheduling, (ICAPS 2005) Monterey, CA USA, pages 241–250. Yoon, S., Fern, A., & Givan, R., (2006). Learning heuristic functions from relaxed plans. In International Conference on Automated Planning and Scheduling (ICAPS-2006).

KEy TERmS Control Rule: IF-THEN rule to guide the planning search-tree exploration. Derived Predicate: Predicate used to enrich the description of the states that is not affected by any of the domain actions. Instead, the predicate truth values are derived by a set of rules of the form if formula(x) then predicate(x). Domain Independent Planner: Planning system that addresses problems without specific knowledge of the domain, as opposed to domain-dependent planners, which use domain-specific knowledge. Macro-Action: Planning action resulting from combining the actions that are frequently used together in a given domain. Used as control knowledge to speed up plan generation. Online Learning: Knowledge acquisition during a problem-solving process with the aim of improving the rest of the process. Plateau: Portion of a planning search tree where the heuristic value of nodes is constant or does not improve. Policy: Mapping between the world states and the preferred action to be executed in order to achieve a given set of goals. Search Control Knowledge: Additional knowledge introduced to the planner with the aim of simplifying the search process, mainly by pruning unexplored portions of the search space or by ordering the nodes for exploration.

1029

A Longitudinal Analysis of Labour Market Data with SOM Patrick Rousset CEREQ, France Jean-Francois Giret CEREQ, France

INTRODUCTION The aim of this paper is to present a typology of career paths in France drawn up with the Kohonen algorithm and its extension to a clustering method of life history analysis based on the use of Self Organizing Maps (SOMs). Several methods have previously been presented for transforming qualitative into quantitative information so as to be able to apply clustering algorithms such as SOMs based on the Euclidean distance. Our approach consists in performing quantitative encoding on labor market situation proximities across time. Using SOMs, the preservation of the topology also makes it possible to check whether this new method of encoding preserves the particularities of the life history according to our economic approach to careers. Lastly, this quantitative encoding preprocessing, which can be easily applied to analysis methods of life history, completes the set of methods extending the use of SOM to qualitative data.

BACKGROUND Several methods are generally used to study the dynamic aspects of careers. The first method, which estimates some reduced-form transition models, has been extensively used in labor microeconometrics, using event-history models for continuous-time data or discrete-time panel data with Markov processes. Those of the second kind, which include the method presented here, are sequence analysis methods dealing with complex information about individual labor market histories, such as the various states undergone, the duration of the spells, multiple transitions between the states, etc.. The idea was to empirically generate a statistical typology of sequences by performing cluster analysis (Lebart, 2006). This method thus makes it pos-

sible to define “cluster paths” constituting endogenous variables and explained in terms of individual characteristics such as gender, educational level or parental socio-economic status. The optimal matching method, which has been widely used in social science since the pioneering paper by Abott (Abbott & Hrycak, 1990) , is an attractive solution for analysing longitudinal data of this kind. The basic idea underlying this method is to take a pair of sequences and calculate the cost of transforming them into each other by performing a series of elementary operations (insertion, deletion and substitution). However, this method has been heavily criticized because it may be difficult to determine the values of these elementary operations. Here we adopt another strategy. First, in order to classify sequences into groups, we have defined a measure of the distance between each trajectory, which is coherent with our data and with some well-known theoretical hypotheses in the field of labor economics. We then use Self Organizing Maps (the Kohonen algorithm) for classification and purposes. Self Organizing Maps (see Kohonen, 2001, Fort, 2006) are known to be a powerful clustering and projection method. Since this method accounts efficiently for changes occurring with time, SOMs yield accurate predictions (see for example Cotrell , Girard & Rousset ,1998, Dablemont, Simon, Lendasse, Ruttiens, Blayo & Verleysen, 2003, Souza, Barreto & Mota, 2005). Life histories can be considered as a qualitative record of information, while SOMs are based on Euclidean distance. Many attempts have been made to transform qualitative variables into quantitative ones: using for example the Burt description (see the KACM presentation in Cottrel & Letremy, 1995) or using the multidimensional scaling (Miret, Garcia-Lagos, Joya, Arazoza & Sandoval, 2005). In our approach, the quantitative recoding focuses on the proximity between items considering particularities of the data (a life his-


L

A Longitudinal Analysis of Labour Market Data with SOM

tory) according to our economic approach. When the preprocessing of recoding is performed, Self Organizing Maps is a useful clustering tool, first considering its pre-mentioned clustering and projection qualities and also because of its ability to make the efficiency of our new encode emerge.

CLUSTERING LIFE HISTORy WITH SOm An Example of a Life History Career Paths Labor economists have generally assumed that the beginning of a career results from a matching process (Jovanovich, 1979). Employers and job seekers lack information about each other: employers need to know how productive their potential employee is and job applicants want to know whether the characteristics of the job correspond to their expectations. Job turnover and temporary employment contracts can therefore be viewed as the consequences of this trial-by-error process. However, individuals’ first employment situations may also act as a signal of employability to the labor market. For example, a long spell of unemployment during the first years in a person’s career may be interpreted by potential employers as sign of low work efficiency; whereas working at a temp agency may be regarded as a sign of motivation and adaptability. This is consistent with the following path dependency hypothesis: the influence of past job experience on the subsequent career depends on the “cost” associated with the change of occupational situation. However, empirical studies have shown that employers mainly recruit on the basis of recent work experience (Allaire, Cahuzac & Tahar, 2000). The effects of less recent employment situations on a person’s career therefore decrease over time. Data The data used in this study were based on the “Generation 98” survey carried out by Céreq: 22 000 young people who had left initial training in 1998 at all levels and in all training specializations were interviewed in spring 2001 and 2003 and autumn 2005. This sample

1030

was representative of the 750 000 young people leaving the education system for the first time that year in France. This survey provided useful information about the young people’s characteristics (their family‘s socioeconomic status, age, highest grade completed, highest grade attended, discipline, any jobs taken during their studies, work placement) and the month-by-month work history from 1998 to 2005. We therefore have a complete and detailed record of the labor market status of the respondents during the 88-month period from July 1998 to November 2005. Employment spells were coded as follows, depending on the nature of the labor contract: 1 = permanent labor contract, 2 = fixed term contract, 3 = apprenticeship contract, 4 = public temporary labor contract, 5 = interim/temping). Other unemployed situations were coded as follows: 6 = unemployment, 7 = inactivity, 8 = military service, 9 = at school.

Preprocessing Phase: Life History Encoding The encoding of the trajectories involved a two-step preprocessing phase : defining a distance between states including time dynamics and the resulting quantitative encoding of trajectories. These two steps refer to the specificity of the data set structures of life history samples: the variables items (the states) are some qualitative information while the variables order records some quantitative information (the timing and the duration of events). The Distance Between Situations Working with pairs (state, time), called situations, allows to include the time dynamics in the proximities between occupational states. The proximity between two situations is measured on the basis of their common future, in line with our Economic approach. A situation is assumed as a potential for its own future, depending on its influence on this future. The similarity between two situations is deduced from comparisons between their referring potential. The potential future PS of a situation S among n monthly periods and p states is defined as the p×n dimensional vector given in (1). Its components PSS’ are the product of terms φ and β. φ measures the flow between situation S and any situation S’ as the empirical probability of reaching S’ starting


from S. It is also the empirical probability of an individual i being in any future situation S’ conditionally of being at the present in S. The coefficient of temporal inertia β weights the influence of S’ on PS according to the Economic approach. It is a decreasing function of the time delay (t’-t). In the career paths application, the function chosen is the inverse of the delay and 0 for the past. Lastly, α ensures that potential futures PS will be profiles. The natural distance between situations is therefore the χ² distance between their potential future profiles. PS =( s ,t )

with the formula (2). Applying to the situations, the principal components of inertia (the principal events) are computed as the principal component vectors of the matrix ∆. Trajectories can then be described in the principal events space: performing the traditional binary encoding (3) of the trajectory Ti is equivalent to performing a linear encoding through the situations (4) and then also through the principal events E (5). ∆ jj' = GS j .GS j ' = −

1 2

[d

jj '

− d j. − d . j ' + d ..

]

(2)

1 = (0,...,0, B (t ' )Φ SS '==((s1,,tt)') , B (t ' )Φ SS '=( 2,t ') ,..., B (t ' )Φ SS '=(9,t ') ,..., B (T )Φ SS '=(1,T ) ,..., B (T )Φ SS ' ,..., B (T )Φ SS '=(9,T ) )   T A (t ) i = (0,...,0,1,0,...,0,0,...,0,1,0,...,0,..., 0,...,0,1,0,...,0)

     T

t '≥ t

S ' = ( 2 , t ') ,..., B (t ' )Φ SS '=(9,t ') ,..., B (T )Φ SS '=(1,T ) ,..., B (T )Φ SS ' ,..., B (T )Φ SS '=(9,T ) ) S 



T

'≥ t

time =1

(1)

Ti =

∑

S =( s ,t )

  ↑   position: s*t

Ti = ∑A S (∑ B eS Ee ) = ∑ G e Ee

∑ C S C S' i

S

∑ CS

(3)

e

e

(4) (5)

,

CLASSIFICATION OF LIFE HISTORIES WITH SOm: A TyPOLOGy OF CAREER PATHS IN FRANCE AND ECONOmIC INTERPRETATIONS

1 B (t , t ' ) = t '−t + 1

and S'

time =9

1

i

A −1 (t ) = ∑ B ( S ' )Φ SS '

  

A ( S ) (0,...,0,1,0,...,0)

where Φ SS '==((ss,'t,t)') =

9 positions

.

The Trajectory Encoding In the present case of equi-weighting, the inertia of the space of situations results from the distances previously computed. The principal components of inertia, called here principal events, can therefore be deduced. The term “event” refers to a combination of the point in time, the duration and the occupational status. The quantitative encoding of trajectories proposed here results from their description in the “events” space. The process used here is in line with J.P. Benzécri’s one (Benzécri, 1973), which explains how: when considering a set of situations {Si}, its center of gravity G and the matrix recording squares of distance (djj’) between elements Sj and Sj’, one can deduce the matrix of scalar products ∆ between any vectors GSj

The result of the typology of career paths in France using Self Organizing Maps with a 10x10 grid is presented in Figure 1. In each unit of the map, a chronogram describes the characteristics of the class (the career path). Chronograms show the evolution in time of the proportional contribution (in percentage) of each state to the classes. On the one hand, the SOM topology reflects the continuity of the evolution in time. On the other hand, similarities between situations give rise to the mixing of classes (see for example cluster 95) or proximities on the map between two clusters (for example, between clusters 71 and 72 – eighth line, first and second column) although few individuals are in the same state at the same time. The map thus makes it possible to assess the efficiency of the encoding process. The Kohonen map displays a concise vision of the types of career paths occurring during the first seven years of working life. In general, most of the clusters describe a direct school-to-work transition process, 1031

L


Figure 1. Typology of career paths with SOM: each unit on the map gives a chronogram of the evolution in time of the proportional contribution (as a percentage) of any occupational position. Two populations of closed units have similar chronogram.

which can be characterized by an immediate access to a permanent contract or an indirect access during the first few years. In the upper-left-hand corner of the map, mainly the five clusters of the two first lines, career paths are characterized by a high level of access to employment with permanent contracts. Young people rapidly gained access to a permanent contract and kept it until the end of the observation period. In upper-right-hand corner of the map, access to a permanent contract was less direct: during the first year, young people first obtained a temporary contract or spent ten months in compulsory military service before obtaining a permanent contract. the upper part of the map, the bottom part describes career paths with longer period of temporary contracts and/or unemployment. In the lower-right-hand corner, access to a permanent contract is becoming rare during the first few years. However, more than ninety per cent of young people have obtained a final permanent position (in the last column of the map). In the bottom lines, a five-year public policy contract called “Emploi Jeunes” features 1032

strongly instead of the classical fixed term contract. The lower-left-hand corner of the map shows more unstable trajectories which end in a temporary position: seven years after leaving school, people have a temporary contract (in the last two clusters in the first column) or are unemployed (second and third cells on the last line). The chronograms situated in the middle-lefthand part of the map highlight how the longitudinal approach is interesting to understand the complexity of transition processes: the young people here were directly recruited for a permanent job, but five or six years after graduating, they lost this permanent job. This turn of events can be explained by the strong change in the economic environment which occurred during this period; the years 2003 and 2004 correspond to a dramatic growth of youth unemployment on the French labor market. What role does each individual characteristic play in the development of these career paths? Several factors may explain the labor market opportunities of schoolleavers: human capital factors, parents’ social class,


Figure 2. Career path typology by educational levels

Educational levels (dark = without qualifications; medium = secondary educational graduates; light = ligher educational graduates).

and other factors responsible for inequalities on the labor market, such as parents’ nationality and gender. The distribution of these characteristics was included graphically in each cell on the map. Figure 2, which gives the distribution in terms of educational level, clearly shows that educational level strongly affected the career path. Higher educational graduates feature much more frequently in the upperleft-hand corner of the map, whereas school leavers without any qualifications occur more frequently in the bottom part. Figure 3 shows a similar pattern as far as gender is concerned: there are much higher percentages of females than males in the most problematic career paths, which suggests the occurrence of gender segregation or discriminatory practices on the French labor market. The differences are less conspicuous as far as the father’s nationality is concerned (Figure 4). However, the results obtained here also suggests that children with French parents have better chances than the others of finding “safe” career paths.

Figure 3. Career path typology by gender

Figure 4. Career path typology by father’s nationality

Dark =male, light = female

Father’s nationality: dark = foreign nationality, light = French nationality 1033

L


FUTURE TRENDS The relevance of the method presented here concerns both aspects: the preprocessing and SOMs’ result. The advantages of SOM depend on the distance chosen, which must enable the associated algorithm to preserve the proximities between situations. The relevance of the preprocessing stage in the method will therefore be confirmed if it enhances the reliability of the SOMs (Debodt, Cottrell & Verleysen, 2002, and Rousset, Guinot & Maillet, 2006, have presented a method of measuring and a method of increase reliability, respectively). On the other hand, the reliability also depends on the choice of the future weighting function, function β in formula (1). By consequence, function β could be determined here with a view to the reliability of the SOM results. But unfortunately, in general, this approach may be counter-productive in some cases. In the case of career paths, for example, it would lead to weighting the long term future, which would increase the robustness but would not be suitable from the point of view of the Economic investigation. This problem arises in many general contexts where the main effects of the present on the future are short term effects, whereas the reliability increases in the long term. The main criterion used to choose function β must therefore be the topic of interest. Further studies are now required to improve the reliability of this method: first function β needs to be defined more closely and secondly, the validity of the method needs to be tested after enhancing the reliability that of the SOM topology obtained after performing the preprocessing step described above). It might also be worth investigating the use of Markovian models to define function β in particular, as well as to study career paths in general. This method will also have to be applied in the future to other samples.

CONCLUSION The aim of this study was to analyze the early career of French schools leavers using Self Organizing Maps. This empirical analysis showed that career paths are strongly segmented. Although most of the “career paths” studied were characterized by stabilization on the labor market at some point or another, some of them show the great difficulties encountered by labor market entrants. Obtaining a permanent contract does not 1034

actually guarantee life-long employment. In addition, the econometric analysis carried out in the second part of this study shows that the diversity of career paths can be partly explained by the educational levels and individual characteristics of school leavers. In the present method of analyzing information on individuals’ trajectories in time through a finite number of states, two important aspects are combined: the encoding of the data and the analysis of the data presented in the form of SOMs. The first aspect avoids the well known problem of skew present with qualitative encoding, including when it is linked to the evolution in time. Self organizing maps are a natural approach to the data analysis, since this tool combines the advanteges of clustering and representation methods. The method described here turns out to be an efficient means of investigating changes with time and the proximities between situations. In addition, the preservation of the topology was found to be a useful property, which makes it possible to assess the efficiency of the recoding. In conclusion, the method presented here could easily be used to analyze any life history.

REFERENCES Abbott, A. & Hrycak, A. (1990). Measuring resemblance in sequence data: an optimal matching analysis of musicians’ career. American Journal of Sociology, 96, 1, 144-185. Allaire, G., Cahuzac, E. & Tahar, G. (2000). Persistance du chômage et insertion. L’Actualité économique, 76, 2, 237-263. Benzécri J.P. (1973) L’Analyse des Données. TIIBn°2 “Representation Euclidienne d’un Ensemble fini de masses et de distances” , DUNOD, Paris, 65-95. Cottrell, M., Girard, B. & Rousset, P. (1998). Forecasting of curves using a Kohonen classification, Journal of Forecasting, 17, 429-439. Cottrel, M. & Letrémy, P. (1995). Classification et analyse des correspondances au moyen de L’Algoritme de Kohonen: Application à l’étude de données socioéconomiques, Prépublication du SAMOS, (42), University of Paris I, France.. De Bodt, E., Cottrell, M., & Verleysen, M. (2002). Statistical tools to assess the reliability of self-organizing


maps. Neural Networks, 15, 967-978. Dablemont, S., Simon, G., Lendasse, A., Ruttiens, A., Blayo, F. & Verleysen, M. (2003). Time series forecasting with SOM and local non-linear models - Application to the DAX30 index prediction. In Proceedings of the 4th Workshop on Self-Organizing Maps, (WSOM)’03, 340-345. Fort J.C. (2006). SOM’s mathematics , Neural Networks, 19, 812-816. Fougère, D. & Kamionka, T. (2005). Econometrics of Individual Labor Market Transitions, IZA Discussion Papers 1850, Institute for the Study of Labor (IZA). Jovanovic, C. (1979). Job matching and the theory of turnover, Journal of Political Economy, 87, 5, 972990. Kohonen T. (2001). Self-Organizing Maps. 3.ed, Springer Series in Information Sciences, 30, Springer Verlag, Berlin. Lebart L., Morineau M. et Piron M. (2006). Statistique exploratoire multidimensionnelle. 4.ed, Dunod, Paris. Miret, E., García-Lagos, F., Joya, G., Arazoza, H. & Sandoval, F. (2005). A combined multidimensional scaling + self-organizing maps method for exploratory analysis of qualitative data. Proceedings of the 5th Workshop on Self-Organizing Maps (WSOM’05), Paris, France, 711-718. Rousset P., Guinot C. & Maillet B. (2006). Understanding and Reducing Variability of SOM neighbourhood structure , Neural Networks, 19, 838-846. Souza, L. G., Barreto, G.A. & Mota, J. C. (2005). Using the self-organizing map to design efficient RBF models for nonlinear channel equalization, Proceedings of the

5th Workshop on Self-Organizing Maps (WSOM’05), Paris, France, 339-346.

KEy TERmS Careers Paths: Sequential monthly position among several pre-defined working categories. Distibutional Equivalency: Property of a distance that allows to group two modalities of the same variable having identical profiles into a new modality weighted with the sum of the two weights. Markov Process: Stochastic process in wich the new state of a system depends on the previous state or a finite set of previous states. Optimal Matching: Statistical method issued from biology abble to compare two sequences from a predifined cost of substitution. Preservation of Topology: After learning, observations associated to the same class or to « close » classes according to the definition of the neighborhood and given by the network structure are « close » according to the distance in the input space. Self-Organizing Maps by Kohonen: A neural network unsupervised method of vector quantization widely used in classification. Self-Organizing Maps are a much appreciated for their topology preservation property and their associated data representation system. These two additive properties come from a pre-defined organization of the network that is at the same time a support for the topology learning and its representation. χ² Distance: Distance having certain specific properties such as the distibutional equivalency.

1035

L

1036

Managing Uncertainties in Interactive Systems Qiyang Chen Montclair State University, USA John Wang Montclair State University, USA

INTRODUCTION To adapt users’ input and tasks an interactive system must be able to establish a set of assumptions about users’ profiles and task characteristics, which is often referred as user models. However, to develop a user model an interactive system needs to analyze users’ input and recognize the tasks and the ultimate goals users trying to achieve, which may involve a great deal of uncertainties. In this chapter the approaches for handling uncertainty are reviewed and analyzed. The purpose is to provide an analytical overview and perspective concerning the major methods that have been proposed to cope with uncertainties.

Approaches for Handling Uncertainties For a long time, the Bayesian model has been the primary numerical approach for representation and inference with uncertainty. Several mathematical models that are different from the probability prospective have also been proposed. The main ones are Shafer-Dempster’s Evidence Theory (Belief Function) (Shafer, 1976; Dempster, 1976) and Zadeh’s Possibility Theory (Zadeh, 1984). There have also been some attempts to handle the problem of incomplete information using classical logic. Many approaches to default reasoning logic have been proposed, and study of non-monotonic logic has gained much attention. These approaches can be classified into two categories: numerical approaches and non-numerical approaches. 1.

Probability and Bayesian Theory. There is support for the theoretical necessity and justification of using a probability framework for knowledge representation, evidence combination and propagation, learning ability, and clarity of explanation (Buchana and Smith, 1988). Bayesian processing remains the fundamental idea underlying many

new proposals that claim to handle uncertainty efficiently. In all the practical developments to date, the Bayesian formula and probability values have been used as some kind of coefficients to augment deterministic knowledge represented by production rules (Barr and Feigenbaum, 1982). Some intuitive methods for combination and propagation of these values have been suggested and used. One such case is the use of Certainty Factors (CF) in MYCIN (Shortliffe and Buchanan, 1976). Rich also use a simplified CF approach in user modeling system GROUNDY (Rich, 1979). However, some objections against such probabilistic methods of accounting for uncertainty have been raised (Karnal and Lemmer, 1986). One of the main objections is that these values lack any definite semantics because of the way they have been used. Using a single number to summarize uncertainty information has always been a contested issue (Heckerman, 1986). The Bayesian approach requires that each piece of evidence be conditionally independent. It has been concluded that the assumptions of conditional independence of the evidence under the hypotheses are inconsistent with the other assumptions of exhaustive and mutually exclusive space of hypotheses. Specifically, Pednault et al. (1981) show that, under these assumptions, a probabilistic update could take place if there were more than two competing hypotheses. Pearl (1985) suggests that the assumption of conditional independence of the evidence under the negation of the hypotheses is over-restrictive. For example, if the inference process contains multiple paths linking the evidence to the same hypothesis, the independence is violated. Similarly, the required mutual exclusiveness and exhaustiveness of the hypotheses are not very realistic. This assumption would not hold if more than one hypothesis occurred simultaneously and is as restrictive as the single-default assumption of the simplest


Managing Uncertainties in Interactive Systems

diagnosing systems. This assumption also requires that every possible hypothesis is known a priori. It would be violated if the problem domain were not suitable to a closed-world assumption. Perhaps the most restrictive limitation of the Byesian approach is its inability to represent ignorance. The Bayesian view of probability does not allow one to distinguish uncertainty from ignorance. One cannot tell whether a degree of belief was directly calculated from evidence or indirectly inferred from an absence of evidence. In addition, this method requires a large amount of data to determine the estimates for prior and conditional probabilities. Such a requirement becomes manageable only when the problem can be represented as a sparse Bayesian network that is formed by a hierarchy of small clusters of nodes. In this case, the dependencies among variables (nodes in the network) are known, and only the explicitly required conditional probabilities must be obtained (Pearl, 1988). 2.

The Dempster-Shafer Theory of Evidence. The Dempster-Shafer theory, proposed by Shafer (Shafer, 1976), was developed within the framework of Dempster’s work on upper and lower probabilities induced by a multi-valued mapping (Dempster, 1967). Like Bayesian theory, this theory relies on degrees of belief to represent uncertainty. However, it allows one to assign a degree of belief to subsets of hypotheses. According to the Dempster-Shafer theory, the feature of multi-valued mapping is the fundamental reason for the inability of applying the well-known theorem of probability that determines the probability density of the image of one-to-one mapping (Cohen, 1983). In this context, the lower probability is associated with the degree of belief and the upper probability with a degree of plausibility. This formalism defines certainty as a function that maps subsets of a proposition space on the [0,1] scale. The sets of partial beliefs are represented by mass distributions of a unit of belief across the space of propositions. These distributions are called the basic probability assignment. The total certainty over the space is 1. A non-zero BPA can be given to the entire proposition space to represent the degree of ignorance. The certainty of any proposition is then represented by the interval characterized by upper and lower probabilities.

Dempster’s rule of combination normalizes the intersection of the bodies of evidence from the two sources by the amount of non-conflictive evidence between the sources. This theory is attractive for several reasons. First, it builds on classical probability theory, thus inheriting much of its theoretical foundations. Second, it seems not to over-commit by not forcing precise statements of probabilities: its probabilities do not seem to provide more information than is really available. Third, it reflects the degree of ignorance of the probability estimate. Fourth, the Dempster-Shafer theory provides rules for combining probabilities and thus for propagating measures through the system. This also is one of the most controversial points since the propagation method is an extension of the multiplication rule for independent events. Because many applications involve dependent events, the rule might be inapplicable by classical statistical criteria. The tendency to assume that events are independent unless proven otherwise has stimulated a large proportion of the criticism of probability approaches. Dempster-Shafer theory suffers the same problem (Bhatnager and Kanal, 1986). In addition, there are two problems with Dempster-Shafer approach. The first problem is computational complexity. In the general case, the evaluation of the degree of belief and upper probability requires exponential time in the cardinality of the hypothesis set. This complexity is caused by the need for enumerating all the subsets of a given set. The second problem in this approach results from the normalization process presented in both Dempster’s and Shafer’s work. Zadeh has argued that this normalization process can lead to incorrect and counter-intuitive results (Zadeh, 1984). By removing the conflicting parts of the evidence and normalizing the remaining parts, important information may be discarded rather than utilized adequately. Dubois and Prade (1985) have also shown that the normalization process in the rule of evidence combination creates a sensitivity problem, where assigning a zero value or a very small value to a basic probability assignment causes very different results. Based on Dempster-Shafer theory, Garvey et al. (1982) proposed an approach called Evidential Reasoning that adopts the evidential interpretation of the degree of belief and upper probabilities. This approach defines the likelihood of a proposition as a subinterval of the

1037

M


unit interval [0,1]. The lower bound of this interval is the degree of support of the proposition and the upper bound is its degree of plausibility. When distinct bodies of evidence must be pooled, this approach uses the same Dempster-Shafer techniques, requiring the same normalization process that was criticized by Zadeh (Zadeh, 1984). 3.

Fuzzy Sets and Possibility Theory. The theory of possibility was proposed independently by Zadeh, as a development of fuzzy set theory, in order to handle vagueness inherent in some linguistic terms (Zadeh, 1978). For a given set of hypotheses, a possibility distribution may be defined in a way that is very similar to that of a probability distribution. However, there is a qualitative difference between the probability and possibility of an event. The difference is that a high degree of possibility does not imply a high degree of probability, nor does a low degree of probability imply a low degree of possibility. However, an impossible event must also be improbable. More formally, Zadeh defined the concept of a possibility distribution.

The concept of possibility theory has been built upon fuzzy set theory and is well suited for representing the imprecision of vague linguistic predicates. The vague predicate induces a fuzzy set and the corresponding possibility distribution. From a semantic point of view, the values restricted by a possibility distribution are more or less all the eligible values for a linguistic variable. This theory is completely feasible for every element of the universe of discourse. 4.

1038

Theory of Endorsement. A different approach to uncertainty representation was proposed by Cohen (Cohen, 1983), which is based on a qualitative theory of “endorsement.” According to Cohen, the records of the factors relating to one’s certainty are called endorsements. Cohen’s model of endorsement is based on the explicit recording of the justifications for a statement, normally requiring a complex data structure of information about the source. Therefore, this approach maintains the uncertainty. The justification is classified according to the type of evidence for a proposition, the possible actions required to solve the uncertainty of that evidence, and other related features.

Endorsements can provide a good mechanism for the explanations of reasoning, since they create and maintain the entire history of justifications (i.e., reasons for believing or disbelieving a proposition) and the relevance of any proposition with respect to a given goal. Endorsements are divided into five classes: rules, data, task, conclusion, and resolution. Cohen points out that the main difference between the numerical approaches and the endorsement-based approach, specifically with respect to chains of inferences, is that reasoning in the former approach is entirely automatic and non-reflective, while the latter approach provides more information for reasoning about uncertainty. Consequently, reasoning in the latter approach can be controlled and determined by the quality and availability of evidence. Endorsements provide the information necessary to many aspects of reasoning about uncertainty. Endorsements are used to schedule sure tasks before unsure ones, to screen tasks before activating them, to determine whether a proposition is certain enough for some purpose, and to suggest new tasks when old ones fail to cope with uncertainty. Endorsements distinguish different kinds of uncertainty, and tailor reasoning to what is known about uncertainty. However, Bonissone and Tong (1995) argue that combinations of endorsements in a premise (i.e., proposition), propagation of endorsements to a conclusion, and ranking of endorsements must be explicitly specified for each particular context. This creates potential combinatorial problems. 5.

Assumption based reasoning and non-monotonic logic: In the reasoned-assumptions approach proposed by Doyle (1979), the uncertainty embedded in an implication rule is removed by listing all the exceptions to that rule. When this is not possible, assumptions are used to show the typicality of a value (i.e., default values) and defeasibility of a rule (i.e., liability to defeat of reason). In classical logic, if a proposition C can be derived from a set of propositions S, and if S is a subset of T, then C can also be derived from T. As a system’s premises increase, its possible conclusions at least remain constant and more likely increase. Deductive systems with this property are called monotonic. This kind of logic lacks tools for describing how to revise a formal theory to deal with inconsistencies caused by new information. McDermott and Dole proposed a non-monotonic


logic to cope with this problem (McDemott and Doyle, 1980).

CONCLUSION

When an assumption used in the deductive process is found to be false, non-monotonic mechanisms must be used to keep the integrity of the statements (Doyle, 1979). However, this approach lacks facilities for computing degrees of belief. Bonissone and Tong (1995) suggest that assumption-based systems can cope with cases of incomplete information, but they are inadequate in handling the imprecise information. In particular, they cannot integrate probabilistic measures with reasoned assumptions. Furthermore, such systems rely on the precision of the defaulted values. On the other hand, when specific information is missing, the system should be able to use analogous or relevant information inherited from some higher-level concept. This surrogate for the missing information is generally fuzzy or imprecise and provides limited constraints on the value of the missing information. In the inference system employing non-monotonic logic, assumptions are made that may have to be revised in the light of new information. They have the property that at any given inference stage, more than one mutually consistent set of conclusions can be derived from the available data and possible assumptions. Such conclusions may be invalidated as new data is considered to be incompatible with some default assumptions. The inference system requires that justifications for any conclusion are recorded during the inference process and used for dependency-directed backtracking during the revision of beliefs. This is implemented by the Truth Maintenance System (TMS) (Doley, 1979). The weakness of non-monotonic logic is that in standard non-monotonic logic the only message conveyed by a contradiction is that a piece of information previously believed true is actually false (for the time being). However, the real contents of the inconsistency being discovered may not be as reliable as was assumed, or it may be that a subject is not in a well-ordered state, or a mixture of both (Bahatnagar and Kanal, 1986). In addition, since the TMS examines the new information one piece at a time, it lacks the ability to detect noise input that should be ignored. This weakness is crucial to the task of pattern recognition (Chen and Norcio, 1997).

REFERENCES

This chapter analyzes approaches of handling them in an adaptive human computer interface. Each approach can only deal with a particular type of uncertainty problems effectively. The interface system needs more comprehensive approach for uncertainty management due to various sources of uncertainties in human machine dialog. Especially, since human-machine dialog tend to be context-dependent, the management of uncertainty must provide a pattern-formatted view for user modeling. In other word, a user modeling system must examine the user input based the context of the dialog to obtain a complete and consistent user profiles. . Some non-traditional approaches have been proposed to handle uncertainties in interactive systems, such as neural networks and generic algorithms (Chen and Norcio 1997), because they have strong ability of pattern recognition and classification. However, the conversion between non-numerical user input and numerical input for neural network processing still involves a great deal of uncertainties.

Barr, A. and Feigenbaum, E. A., The Handbook of Artificial Intelligence 2. Los Altos, Kaufmann , 1982. Bhatnager, R. K. and Kanal, L. N., “Handling Uncertainty Information: A Review of Numeric and Nonnumeric Methods,” Uncertainty in Artificial Intelligence, Kanal, L. N. and Lemmer, J. F. (ed.), pp2-26, 1986. Bonissone, P. and Tong, R. M., “Editorial: Reasoning with Uncertainty in Expert Systems,” International Journal of Man-Machine Studies, Vol. 30, 69-111 (2005) Buchanan, B. and Smith, R. G. Fundamentals of Expert Systems, Ann. Rev., Computer Science, Vol. 3, pp. 23-58, 1988. Chen, Q. and Norcio, A.F. “Modeling a User’s Domain Knowledge with Neural Networks,” International Journal of Human-Computer Interaction, Vol. 9, No. 1, pp. 25-40, 1997. Chen, Q. and Norcio, A.F. “Knowledge Engineering in Adaptive Interface and User Modeling,” Human-

1039

M


Computer Interaction: Issues and Challenges, (ed.) Chen, Q. Idea Group Pub. 2001. Cohen, P. R. and Grinberg, M. R., “ A Theory of Heuristic Reasoning about Uncertainty, AI Magazine, Vol. 4(2), pp. 17-23, 1983. Dempster, A. P., “Upper and Lower Probabilities Induced by a Multivalued mapping,” The Annuals of Mathematical Statistics , Vol. 38(2), pp. 325-339, 1967. Dubois, D. and Prade, H., “Combination and Propagation of Uncertainty with Belief Functions -- A Reexamination,” Proc. of 9th International Joint Conference on Artificial Intelligence, pp. 111-113, 1985. Dutta, A., “Reasoning with Imprecise Knowledge in Expert Systems,” Information Sciences, Vol. 37, pp. 2-24, 2005. Doyle, J., “A Truth Maintenance System,” AI, Vol. 12, 1979, pp. 231-272. Garvey, T. D., Lowrance, J. D. and Fischer, M. A. “An Inference Technique for Integrating Knowledge from Disparate Source,” Proc. of the 7th International Joint Conference on AI, Vancouver, B. C. pp. 319-325, 1982 Heckerman, D., “Probabilistic Interpretations for MYCIN’s Certainty actors,” Uncertainty in Artificial Intelligence,(ed.). Kanal, L. N. and Lemmer, J. F. , 1986 Jacobson, C. and Freiling, M. J. “ASTEK: A Mulitparadigm Knowledge Acquisition tool for complex structured Knowledge.” International. Journal of Man-Machine Studies, Vol. 29, 311-327. 1988. Kahneman, D. and Tversky, A (1982). Variants of Uncertainty, Cognition, 11, 143-157. McDermott, D. and Doyle, J., “Non-monotonic Logic,” AI Vol. 13, pp. 41-72. (1980). Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publisher, San Mateo, CA ,1988. Pearl, J., “Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach,” Proc. of the 2nd National Conference on Artificial Intelligence, IEEE Computer Society, pp. 1-12, 1985. 1040

Pednault, E. P. D., Zucker, S. W. and Muresan, L.V., “On the Independence Assumption Underlying Subjective Bayesian Updating, “ Artificial Intelligence, 16, pp. 213-222. 1981 Raisinghani, M., Klassen, C. and Schkade, L. “Intelligent Software agents in Electonic Commerce: A Socio-Technical Perspective,” Human-Computer Interaction: Issues and Challenges, (ed.) Chen, Q. Idea Group Pub. 2001. Reiter, R., “A Logic for Default Reasoning,” Artificial Intelligence, Vol. 13, 1980 pp. 81-132. Rich, E., “User Modeling via Stereotypes,” Cognitive Sciences, Vol. 3 1979, pp. 329-354. Shafer, G., A Mathematical Theory of Evidence, Priceton University Press, 1976. Zadeh, L. A., “ Review of Books : A Mathematical Theory of Evidence,” AI Magazine., 5(3), 81-83. 1984 Zadeh, L. A. “Knowledge Representation in Fuzzy Logic,” IEEE Transactions on Knowledge and Data Engineering, Vol. 1, No. 1, pp. 89-100, 1989. Zwick, R., “Combining Stochastic Uncertainty and Linguistic Inexactness: Theory and Experimental Evaluation of Four Fuzzy Probability Models,” Int. J. Man-Machine Studies, Vol. 30, pp. 69-111, 1999.

KEy TERmS Bayesian Theory: Also known as Bayes’ rule or Bayes’ law. It is a result in probability theory, which relates the conditional and marginal probability distributions of random variables. In some interpretations of probability, Bayes’ theory tells how to update or revise beliefs in light of new evidences. Default Reasoning: A non-monotonic logic proposed by Raymond Reiter to formalize reasoning with default assumptions. Default reasoningc can express facts like “by default, something is true”; by contrast, standard logic can only express that something is true or that something is false. This is a problem because reasoning often involves facts that are true in the majority of cases but not always. A classical example is: “birds typically fly”. This rule can be expressed in


standard logic either by “all birds fly”, which is inconsistent with the fact that penguins do not fly, or by “all birds that are not penguins and not ostriches and ... fly”, which requires all exceptions to the rule to be specified. Default logic aims at formalizing inference rules like this one without explicitly mentioning all their exceptions Non-Monotonic Logic: A formal logic whose consequence relation is not monotonic. Most studied formal logics have a monotonic consequence relation, meaning that adding a formula to a theory never produces a reduction of its set of consequences. Intuitively, monotonicity indicates that learning a new piece of knowledge cannot reduce the set of what is known. A monotonic logic cannot handle various reasoning tasks such as reasoning by default. Possibility Theory: A mathematical theory for dealing with certain types of uncertainty and is an alternative to probability theory. Professor Lotfi Zadeh first introduced possibility theory in 1978 as an extension of his theory of fuzzy sets and fuzzy logic. Shafer-Dempster’s Evidence Theory: A mathematical theory of evidence based on belief functions and plausible reasoning, which is used to combine separate pieces of information (evidence) to calculate the probability of an event. The theory was developed by Arthur P. Dempster and Glenn Shafer.

Theory of Endorsement: An approach to represent uncertainty proposed by Cohen, which is based on a qualitative theory of “endorsement.” According to Cohen, the records of the factors relating to one’s certainty are called endorsements. Cohen’s model of endorsement is based on the explicit recording of the justifications for a statement, normally requiring a complex data structure of information about the source. Therefore, this approach maintains the uncertainty. The justification is classified according to the type of evidence for a proposition, the possible actions required to solve the uncertainty of that evidence, and other related features. Truth Maintenance System: A knowledge representation method for representing both beliefs and their dependencies. The name truth maintenance is due to the ability of these systems to restore consistency. There are two major truth maintenance systems: single-context and multi-context truth maintenance. In single context systems, consistency is maintained among all facts in memory (database). Multi-context systems allow consistency to be relevant to a subset of facts in memory (a context) according to the history of logical inference. This is achieved by tagging each fact or deduction with its logical history. Multi-agent truth maintenance systems perform truth maintenance across multiple memories, often located on different machines.

1041

M

1042

Many-Objective Evolutionary Optimisation Francesco di Pierro University of Exeter, UK Soon-Thiam Khu University of Exeter, UK Dragan A. Savić University of Exeter, UK

INTRODUCTION Many-objective evolutionary optimisation is a recent research area that is concerned with the optimisation of problems consisting of a large number of performance criteria using evolutionary algorithms. Despite the tremendous development that multi-objective evolutionary algorithms (MOEAs) have undergone over the last decade, studies addressing problems consisting of a large number of objectives are still rare. The main reason is that these problems cause additional challenges with respect to low-dimensional ones. This chapter gives a detailed analysis of these challenges, provides a critical review of the traditional remedies and methods for the evolutionary optimisation of many-objective problems and presents the latest advances in this field.

BACKGROUND There has been considerable recent interest in the optimisation of problems consisting of more than three performance criteria, realm that was coined manyobjective optimisation by Farina and Amato (Farina, & Amato, 2002). To date, the vast majority of the literature has focused on two and three-dimensional problems (Deb, 2001). However, in recent years, the incorporation of multiple indicators into the problem formulation has clearly emerged as a prerequisite for a sound approach in many engineering applications (Coello Coello, Van Veldhuizen, & Lamont, 2002). Despite the tremendous development that MOEAs have undergone over the last decade, and their ample success in disparate applications, studies addressing high-dimensional real-life problems are still rare (Coello Coello, & Aguirre, 2002). The main reason is that

many-objective problems cause additional challenges with respect to low-dimensional ones: If the dimensionality of the objective space increases, then in general, the dimensionality of the Pareto-optimal front also increases. The number of points required to characterise the Pareto-optimal front increases exponentially with the number of objectives considered. It is clear that these two features represent a hindrance for most of the population-based methods, including MOEAs. In fact, in order to provide a good approximation of a high-dimensional optimal Pareto front, this class of algorithms must evolve populations of solutions of considerable size. This has a profound impact on their performance, since evaluating each individual solution may be a time-consuming task. Using smaller populations would not be a viable option, at least for Pareto-based algorithms, given the progressive loss of selective pressure they experience as the number of objectives increases, with a consequent deterioration of performances, as it is theoretically shown in (Farina, & Amato, 2004) and empirically evidenced in (Deb, 2001, pages 404-405). In contrast to Pareto-based methods, traditional multi-objective optimisation approaches, which work by reducing the multi-objective problem into a series of parameterised single-objective ones that are solved in succession, are not affected by the curse of dimensionality. However, such strategies cause each optimisation to be executed independent to each other, thereby losing the implicit parallelism of population-based multi-objective algorithms. The remainder of this chapter provides a detailed review of the methods proposed to address the first two


Many-Objective Evolutionary Optimisation

issues affecting many-objective evolutionary optimisation and discusses the latest advances in the field.

REmEDIAL mEASURES: STATE-OF-THE-ART The possible remedies that have been proposed to address the issues arising in evolutionary many-objective optimisation can be broadly classified as follows: • • •

aggregation, goals and priorities conditions of optimality dimensionality reduction

In the next sub-sections we give an overview of each of these methods and review the approaches that have been so far proposed.

Aggregation, Goals and Priorities This class of methods tries and overcome the difficulties described in the previous section through the decomposition of the original problem into a series of parameterised single-objective ones, that can then be solved by any classical or evolutionary algorithm. Many aggregation-based methods have been presented so far and they are usually based on modifications of the weighted sum approach, such as the augmented Tchebycheff function, that are able to identify exposed solutions, and explore non-convex regions of the trade-off surface. However, the problem of selecting an effective strategy to vary weights or goals so that a representative approximation of the trade-off curve can be achieved is still unresolved. The ε-constraint approach (Chankong, & Haimes, 1983), which is based on minimisation of one (the most preferred or primary) objective function while considering the other objectives as constraints bound by some allowable levels, was also used in the context of evolutionary computing. The main limitation of this approach is its computational cost and the lack of an effective strategy to vary bound levels (ε). Recently, Laumanns et al. (Laumanns, Thiele, & Zitzler, 2006) proposed a variant of the original approach where they developed a variation scheme based on the concept of ε-Pareto dominance (efficiency) (White, 1986) that adaptively generates constraint values, thus enabling the exhaustive exploration of the Pareto front, provided

the scheme is coupled with an exact single-objective optimiser. It must be pointed out however, that none of the methods described above has ever been thoroughly tested in the context of many-objective optimisation. The Multiple Single Objective Pareto Sampling (MSOPS 1 & 2), an interesting hybridisation of the aggregation method with goal specification, was presented in (Hughes, 2003, Hughes, 2005). In the MSOPS, the selective pressure is not provided by Pareto ranking. Instead, a set of user defined target vectors is used in turn, in conjunction with an aggregation method, to evaluate the performance of each solution at every generation of a MOEA. The greater is the number of targets that a solution nears, the better its rank. The authors suggested two aggregation methods: the weighted min-max approach (implemented in MSOPS) and the Vector-Angle-Distance-Scaling (implemented in MSOPS 2). The results indicated with statistical significance that NSGA-II (Deb, Pratap, Agarwal, & Meyarivan, 2002), the Pareto-based MOEA used for comparative purposes, was outperformed on many objective problems. This was also recently confirmed by Wagner et al. in (Wagner, Beume, & Naujoks, 2007), where they benchmarked traditional MOEAs, aggregation-based methods and indicator-based methods on a up to 6-objective problems and suggested a more effective method to generate the target vectors.

Conditions of Optimality Recently, great attention has been given to the role that conditions of optimality may play in the context of many-objective evolutionary optimisation when used to rank trial solutions during the selection stage of MOEA in alternative to, or conjunction with, Pareto efficiency. Farina et al. (Farina, & Amato, 2004) proposed the use of a fuzzy optimality condition, but did not provide a direct means to incorporate it into a MOEA. Köppen et al. (Koppen, Vincente-Garcia, & Nickolay, 2005) also suggested the fuzzification of the Pareto dominance relation, which was exploited within a generational elitist genetic algorithm on a synthetic MOP. The concept of knee (Deb, 2003), has also been exploited in the context of evolutionary many-objective optimisation. Simply stated, a knee is a portion of a Pareto surface where the marginal substitution rates are particularly high, i.e. a small improvement in one objective lead to a high deterioration of the others. A graphical representation is given in Figure 1. The idea 1043

M


is that, with no prior information about the preference structures of the DM, the knee is likely to be the most interesting area. Branke et al. (Branke, Deb, Dierolf, & Osswald, 2004) developed two methodologies to detect solutions laying on knees, and incorporated them into the second stage (crowding measure) of the ranking procedure of NSGA-II. The first methodology consists in evaluating for each individual in a population the angle between itself and some neighbouring solutions and to use this value to favour solutions with higher angles, i.e. closer to the knee. The methodology, however, scales poorly with the number of objectives. The second strategy resorts to the expected marginal utility function to detect solutions close to the knee. This approach extends easily with the number of objectives; however, the sampling necessary to evaluate the expectation of the marginal utility function with a certain confidence may become expensive. Neither of these approaches has been tested on many-objective problems. The concept of approximate optimal solutions has also been investigated to some extent in the context of evolutionary many-objective optimisation. In particular, ε-efficiency was considered to be potentially

Figure 1. Simple Pareto front with a knee

1044

effective to ease some of the difficulties associated with many-objective problems. A recent study by Wagner et al. (Wagner, Beume, & Naujoks, 2007) showed the excellent performance of ε-MOEA (Deb, Mohan, & Mishra, 2003) on a 6-objective instance of two synthetic test functions. A good review on the application of approximate conditions of optimality is given in (Burke, & Landa Silva, 2006), where the authors also compared the effect of using two relaxed forms of Pareto dominance as evaluation methods within two MOEAs. Recently,di Pierro et al. (di Pierro, Khu, & Savić, 2007) proposed a ranking scheme based on Preference Ordering (Das, 1999), a condition of optimality that generalises Pareto efficiency, but it is more stringent, and tested it using NSGA-II as the optimisation shell on a suit of seven benchmark problems with up to eight objectives. Results indicated that the methodology proposed enhanced significantly the convergence properties of the standard NSGA-II algorithm on all the test problems. The strengths of this approach are its absence of parameters to tune and the fact that it showed very good performances across varying problem features; the drawbacks its computation runtime and


the fact that its combination with diversity preserving mechanisms that favours extreme solutions may ingenerate too high of a selective pressure. In (Sato, Aguirre, & Tanaka, 2007), Sato et al. introduced an approach to modify the Pareto dominance condition used within the selection stage of Paretobased MOEAs, by which the area dominated by any given point is contracted or expanded according to a formula derived from the Sine theorem and the extent of the contraction/expansion is controlled by a constant factor. The results of a series of experiments performed using NSGA-II equipped with the contraction/expansion mechanism on 0/1 multi-objective knapsack problems showed substantially improved convergence and diversity performances compared to the standard NSGA-II algorithm. However, it was also shown that the optimal value for the contraction/expansion factor depends strongly on various problem features, and no indication was given to support a correct choice. Most modern MOEAs rely on a two-stage ranking during the selection process. At the first stage the ranks are assigned according to some form of Pareto-based dominance relation; if ties exist, these are resolved resorting to mechanisms that favour a good distribution along the Pareto front of the solutions examined. It has now been acknowledged that this second stage of the ranking procedure may in fact be detrimental in case of many objectives, as it was shown in (Purshouse, & Fleming, 2003b) for the case of NSGA-II. Recent efforts have therefore focused on replacing diversity preserving mechanisms at this second stage with more effective ones. Koppen and Yoshida (Koppen, & Yoshida, 2007) proposed four secondary ranking assignments and tested them by replacing the crowding distance assignment within NSGA-II. The results indicated improved convergence in all cases compared to the standard NSGA-II. However, the authors did not report any result on the diversity performance of the algorithms.

Dimensionality Reduction Methods The aim of this class of methods is usually to transform the objective space into a lower dimension representation, either one-off (prior to the optimisation) or iteratively (as the search progresses). Deb and Saxena (Deb, & Saxena, 2006) developed a procedure based on principal component analysis

(PCA) for reducing the dimension of the problem to solve. The procedure consists in performing a series of optimisations using a state-of-the-art MOEA, each one focusing only on the objectives that PCA found explaining most of the variance on the basis of Pareto front obtained with the previous optimisation. Recently Saxena and Deb (Saxena, & Deb, 2007) extended their work and replaced PCA with two dimensionality reduction techniques, the correntropy PCA and a modified maximum variance unfolding that could also detect non-linear interactions in the objective space. The results indicated that the former method suffered to some extent from a difficult choice o the best kernel function to use, whereas for the latter, the authors performed a significant number of experiments to suggest bound values of the only free parameter of the procedure. It must be highlighted that these two studies are the only efforts that have challenged new algorithms on highlydimensional test problems (up to 50 objectives). In a recent study Brockhoff and Zitzler (Brockhoff, & Zitzler, 2006b) introduced the minimum objective subset problem (MOSS), which is concerned with the identification of the largest set of objectives that can be removed without altering the dominance structure of the problem (i.e. the set of Pareto optimal solutions obtained considering all the objectives or only the MOSS is the same), and developed an exact algorithm and a greedy heuristic to solve it. Subsequently (Brockhoff, & Zitzler, 2006a), they proposed a measure of variation for the dominance structure and extended the MOSS to allow for dimensionality reductions involving predefined thresholds of problem structure changes. However, they did not propose a mechanism to incorporate these algorithms within a MOEA. Recently, the analysis of the relationships of interdependence between the objectives of an optimisation problem has been successfully exploited to devise effective reduction methods. Following the definitions of conflict, support or harmony, and independence proposed in (Carlsson, & Fuller, 1995), Purshouse and Fleming (Purshouse, & Fleming, 2003a) discussed the effects of these relationships in the context of manyobjective evolutionary optimisation. In a later study Purshouse and Fleming (Purshouse, & Fleming, 2003c) also suggested, in the case of objectives independence, a divide-and-conquer algorithm based on objective space decomposition.

1045

M


FUTURE TRENDS

REFERENCES

As it appears from the discussion above, there is an increasing effort to develop strategies that are able to overcome the limitations of Pareto-based methods when solving problems with many objectives. Although promising results have been generally reported, most of the approaches presented are of an empirical nature, which makes it difficult to draw conclusions that can be generalised. With the exception of dimensionality reduction techniques, the majority of the studies presented to date focus on mechanisms to improve the ranking of the solutions in the selection process. However, the analysis of these mechanisms is usually undertaken in isolation with respect to the other components of the algorithms. In our view, this is an important limitation that next generation algorithms will have to address, in particular, by undertaking the analysis of these mechanisms in relation with the variation operators. Moreover, there has been little attention in trying to characterise the solutions that a given method (belonging to the first or second category identified in the previous section) favours, in relation to the properties of the problem being solved. Theoretical frameworks are therefore needed in order to analyse existing methods and develop more focused approaches. As it was pointed out by di Pierro in (di Pierro, 2006), where he provided a theoretical framework to analyse the effect of the Preference Ordering-based ranking procedure in relation to the interdependence relationships a problem, this approach enables predicting the effect of applying a given methodology to a particular problem with limited prior knowledge, which is certainly an advantage since the goal of developing powerful algorithms is to solve (often for the first time) real life problems.

Branke, J., Deb, K., Dierolf, H., and Osswald, M., (2004). Finding Knees in Multi-objective Optimization, Kanpur Genetic Algorithm Laboratory, KanGAL Tech. Rep. No. 2004010.

CONCLUSIONS

Das, I., (1999). A Preference Ordering Among Various Pareto Optimal Alternatives, Structural Optimization, (18) 1, 30-35.

In this chapter we have provided a comprehensive review of the state-of-the-art of evolutionary algorithms for the optimisation of many objective problems discussing limitations and strengths of the approaches described, and we have suggested future trends of research for a field that is gathering increasing momentum.

1046

Brockhoff, D. and Zitzler, E., (2006a). Dimensionality Reduction in Multiobjective Optimization with (Partial) Dominance Structure Preservation: Generalized Minimum Objective Subset Problems, TIK-Report No. 247,. Brockhoff, D. and Zitzler, E., (2006b). On Objective Conflicts and Objective Reduction in Multiple Criteria Optimization, TIK-Report No. 243, TIK-Report No. 243. Burke, E. K., & Landa Silva, J. D., (2006). The Influence of the Fitness Evaluation Method on the Performance of Multiobjective Search Algorithms, European Journal of Operational Research, (169) 3, 875-897. Carlsson, C., & Fuller, R., (1995). Multiple Criteria Decision Making: The Case for Interdependence, Computers and Operations Research, (22) 251-260. Chankong, V., & Haimes, Y. Y., (1983). Multiobjective decision making: theory and methodology, Dover Publications. Coello Coello, C. A., & Aguirre, A. H., (2002). Design of Combinational Logic Circuits through an Evolutionary Multiobjective Optimization Approach, Artificial Intelligence for Engineering Design, Analysis and Manufacture, (16) 1, 39-53. Coello Coello, C. A., Van Veldhuizen, D. A., & Lamont, G. B., (2002). Evolutionary Algorithms for Solving Multi-Objective Problems, Kluwer Academic Publishers.

Deb, K., (2001). Multi-Objective Optimization using Evolutionary Algorithms, John Wiley & Sons. Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T., (2002). A Fast and Elitist Multiobjective Genetic Al-


gorithm: NSGA-II, IEEE Transaction on Evolutionary Computation, (6) 2, 182-197.

International Conference, EMO 2005, vol. 3410, pp. 399-412.

Deb, K., & Saxena, D. K., (2006). “Searching for paretooptimal solutions through dimensionality reduction for certain large-dimensional multi-objective optimization problems”, in IEEE Congress on Evolutionary Computation, (CEC 2006), pp. 3353-3360.

Köppen, M., & Yoshida, K., (2007). Substitute Distance Assignments in NSGA-II for Handling Many-Objective Optimization Problems, in Evolutionary Multi-Criterion Optimization, .4th International Conference, EMO 2007, S. Obayashi and K. Deb and C. Poloni and T. Hiroyasu and T. Murata Eds. Springer.

Deb, K., (2003). “Multi-objective evolutionary algorithms: Introducing bias among Pareto-optimal solutions”, in Advances in Evolutionary Computing: Theory and Applications. A. Ghosh and S. TsuTsui, Eds. London: Springer -Verlag, pp. 263-292. Deb, K., Mohan, M., and Mishra, S., (2003). A Fast Multi-objective Evolutionary Algorithm for Finding Well-Spread Pereto-Optimal Solutions, Kanpur Genetic Algorithm Laboratory, KanGAL Technical Report, 2003002. di Pierro, F., (2006). Many-Objective Evolutionary Algorithms and Applications to Water Resources Engineering, School of Engineering, Computer Science and Mathematics, University of Exeter, UK, Ph.D. Thesis. di Pierro, F., Khu, S.-T., & Savic, D. A., (2007). An Investigation on Preference Order - Ranking Scheme for Multi Objective Evolutionary Optimisation, IEEE Transaction on Evolutionary Computation, (11) 1, 17-45. Farina, M., & Amato, P., (2004). A Fuzzy Definition of Optimality for Many-Criteria Optimization Problems, IEEE Transactions on Systems, Man, and Cybernetics Part A---Systems and Humans, (34) 3, 315-326. Farina, M., & Amato, P., (2002). “On the Optimal Solution Definition for Many-criteria Optimization Problems”, in Proceedings of the NAFIPS-FLINT International Conference 2002, pp. 233-238, Piscataway, New Jersey: IEEE Service Center. Hughes, E. J., (2003). Multiple single objective pareto sampling, in 2003 Congress on Evolutionary Computation, pp. 2678-2684. Köppen, M., Vincente-Garcia, R., & Nickolay, B., (2005). Fuzzy-Pareto-Dominance and Its Application in Evolutionary Multi-objective Optimization, in Evolutionary Multi-Criterion Optimization. Third

Laumanns, M., Thiele, L., & Zitzler, E., (2006). An efficient, adaptive parameter variation scheme for metaheuristic based on the epsilon-constraint method, European Journal of Operational Research, (169) 932-942. Purshouse, R. C., & Fleming, P. J., (2003c). An adaptive divide-an-conquer methodology for evolutionary multi-criterion optimisation, in Second International Conference on Evolutionary Multi-Criterion Optimization (EMO 2003), pp. 133-147, Berlin: Springer. Purshouse, R. C., & Fleming, P. J., (2003a). Conflict, harmony, and independence: Relationships in evolutionary multi-criterion optimisation, in Second International Conference on Evolutionary Multi-Criterion Optimization (EMO 2003), pp. 16-30, Berlin: Springer. Purshouse, R. C., & Fleming, P. J., (2003b). Evolutionary Multi-Objective Optimisation: An Exploratory Analysis, in Proceedings of the 2003 Congress on Evolutionary Computation (CEC 2003), vol. 3, pp. 2066-2073 IEEE Press. Sato, H., Aguirre, H. E., & Tanaka, K., (2007). Controlling Dominance Area of Solutions and Its Impact on the Performance of MOEAs, in Evolutionary MultiCriterion Optimization, .4th International Conference, EMO 2007, pp. 5-19. Saxena, D. K., & Deb, K., (2007). Non-linear Dimensionality Reduction Procedures, in Evolutionary Multi-Criterion Optimization, .4th International Conference, EMO 2007, pp. 772-787, S. Obayashi and K. Deb and C. Poloni and T. Hiroyasu and T. Murata Eds. Springer. Wagner, T., Beume, N., & Naujoks, B., (2007). Pareto-, Aggregation-, and Indicator-based Methods in ManyObjective optimization, in Evolutionary Multi-Criterion Optimization.4th International Conference, EMO 2007,

1047

M


pp. 742-756, S. Obayashi and K. Deb and C. Poloni and T. Hiroyasu and T. Murata Eds. Springer.

Pareto Front: Image of the Pareto Set onto the performance (objective) space.

White, D. J., (1986). Epsilon Efficiency, Journal of Optimization Theory and Applications, (49) 319-337.

Pareto Optimal Solution: Solution that is not dominated by any other feasible solution. Pareto Set: Set of Pareto Optimal solutions.

KEy TERmS Evolutionary Algorithms: Solution methods inspired by the natural evolution process that evolve a population of solutions to an optimisation problem through iterative application of randomised processes of recombination and selection, until a termination criteria is met. Many-Objective Problem: Problem consisting of more than 3-4 objectives to be concurrently maximised/minimised.

1048

Ranking Scheme: Scheme that assigns to each solution of a population a score that is a measure of its fitness relative to the other members of the same population. Selective Pressure: The ratio between the number of expected selections of the best solution and the mean performing one.

1049

Mapping Ontologies by Utilising Their Semantic Structure Yi Zhao Fernuniversitaet in Hagen, Germany Wolfgang A. Halang Fernuniversitaet in Hagen, Germany

INTRODUCTION As a key factor to enable interoperability in the Semantic Web (Berners-Lee, Hendler & Lassila, 2001), ontologies are developed by different organisations at a large scale, also in overlapping areas. Therefore, ontology mapping has come into forth to achieve knowledge sharing and semantic integration in an environment where knowledge and information are represented by different underlying ontologies. The ontology mapping problem can be defined as acquiring the relationships that hold between the entities of two ontologies. Mapping results can be used for various purposes such as schema/ontology integration, information retrieval, query mediation, or web service mapping. In this article, a method to map concepts and properties between ontologies is presented. First, syntactic analysis is applied based on token strings, and then semantic analysis is executed according to WordNet (Fellbaum, 1999) and tree-like graphs representing the structures of ontologies. The experimental results exemplify that our algorithm finds mappings with high precision.

BACKGROUND Borrowed from philosophy, ontology refers to a systematic account of what can exist or ‘be’ in the world. In the fields of artificial intelligence and knowledge representation, ontology refers to the construction of knowledge models that specify a set of concepts, their attributes, and the relationships between them. Ontologies are defined as “explicit conceptualisation(s) of a domain” (Gruber, 1993), and are seen as a key to realise the vision of the Semantic Web. Ontology, as an important technique to represent

knowledge and information, allows to incorporate semantics into data to drastically enhance information exchange. The Semantic Web (Berners-Lee, Hendler & Lassila, 2001) is as a universal medium for data, information, and knowledge exchange. It suggests to annotate web resources with machine-processable metadata. With the rapid development of the Semantic Web, it is likely that the number of ontologies used will strongly increase over the next few years. By themselves, however, ontologies do not solve any interoperability problem. Ontology mapping (Ehrig, 2004) is, therefore, a key to exploit semantic interoperability of information and, thus, has been drawing great attention in the research community during recent years. This section introduces the basic concepts of information integration, ontologies, and ontology mapping. Mismatches between ontologies are mainly caused by independent development of ontologies in different organisations. They become evident when trying to combine ontologies which describe partially overlapping domains. The mismatches between ontologies can broadly be distinguished into syntactic, semantic, and structural heterogeneity. Syntactic heterogeneity denotes differences in the language primitives used to specify ontologies, semantic heterogeneity denotes differences in the way domains are conceptualised and modelled, while structural heterogeneity denotes differences in information structuring. There have been a number of previous works proposed so far on ontology mapping (Shvaiko, 2005, Noy, 2004, Sabou, 2006, Su, 2006). In (Madhavan, 2001), a hybrid similarity mapping algorithm has been introduced. The proposed measure integrates the linguistic and structural schema matching techniques. The matching is based primarily on schema element names, not considering their properties. LOM (Li, 2004) is a semi-automatic lexicon-based ontology-mapping tool that supports a human mapping engineer with a


M

Mapping Ontologies by Utilising Their Semantic Structure

first-cut comparison of ontological terms between the ontologies to be mapped. It decomposes multi-word terms into their word constituents except that it does not perform direct mapping between the words. The procedure associates the WordNet synset index numbers of the constituent words with ontological term. The two terms which have the largest number of common synsets are recorded and presented to the user.

mAIN FOCUS OF THE CHAPTER Our current work tries to overcome the limitations mentioned above, and to improve precision of ontology mapping. The research goal is to develop a method and to evaluate results of ontology mappings. In this article, we present a method to map ontologies synthesised of token-based syntactic analysis, and semantic analysis employing the WordNet (Fellbaum, 1999) thesaurus and tree-structured graphs. The algorithm is outlined and expressed in pseudo-code as listed in Figure 1. The promising results obtained from experiments indicate that our algorithm finds mappings

with high precision.

Syntax-Level Mapping Based on Tokenisation Before employing syntactic mapping, a pre-processing is inevitable, which is called tokenisation. Here, ontologies are represented in the language OWL-DL1. Therefore, all ontology terms are represented with OWL URI. For example, in ontology “beer”, an OWL Class ‘Ingredient’ is described by “[OWLClassImpl] http://www.purl.org/net/ontology/ beer#Ingredient”, where “[OWLClassImpl]” implies OWL class, URL “http://www.purl.org/net/ ontology/beer” addresses the provenance of the ontology, and ‘Ingredient’ is the class name. Tokenisation should first extract the valid ontology entities from OWL descriptions, which, in this example, is ‘Ingredient’. Moreover, the labels of ontology entities (classes

Figure 1. Pseudo-code of mapping algorithm Input: OWL O1, OWL O2, threshold sigma; Output: similarity between O1 and O2; Begin build tree-structured graphs for O1 and O2, and get their edge sets E1 and E2; for each child node Ci ∈ E1 do for each child node Cj ∈ E2 do tokenise Ci and Cj into token sets tci and tcj; if (tci unequal to tcj) then calculate syntactic level similarity

Simsyn

between tci and tcj;

if ( Sim syn < sigma) then // semantic mapping

compute semantic-level similarity of tci, tcj based on WordNet; if (tci and tcj have no WordNet relationship) then determine similarity

od end

1050

od

fi

fi

fi

Simtsg

with the specific properties and

relationships between their parent/child nodes in ts-graphs


and properties) are quite often defined with different representations by different organisations. For instance, representations may be with or without connector symbols, with upper or lower cases, etc., which renders it very complicated and hard to identify terms. Tokenisation means to parse names into tokens based on specific rules or symbols by customisable tokenisers using punctuation, upper case, special symbols, digits, etc. In this way, a class or property name can m be n tokenised into one or several Sim token strings. For example, term = V Sim + Vthe ∑∑ 1 syn orig ij Sim ij , =1 ‘Social’, i =1 jas ‘Social_%26_Science’ can be tokenised ‘26’, and ‘Science’. Note that the terms can sometimes contain digits like date which is not neglectable. For simplicity, we assume that all terms of ontology concepts (classes, properties) are described without abbreviations. The mapping process between different class and property names is then transformed to mapping between tokens. We first check whether the original child nodes are equal ignoring case. Otherwise, the tokens are used instead to check whether they are equal. If not, the similarity measure based on the edit distance is adopted to calculate similarity. If the calculated similarity value is above a threshold σ (for example, 0.95), the compared nodes are considered to be similar. The process continues to deal with the next pair of nodes in the same way. The edit distance formulated by Levenshtein (Levenshtein, 1966) and the string mapping method proposed by Maedche & Staab (Maedche & Staab, 2002) are employed here to calculate token similarity. The edit distance is a well-established method to weigh the difference between two strings. It measures the minimum number of token insertions, deletions, and substitutions required to transform one string into another using a dynamic programming algorithm. The string matching method is used to calculate token similarity based on Levenshtein’s edit distance: Sim( X , Y ) = max( 0,

min( X , Y ) − ed ( X , Y ) min( X , Y )

) ∈ [0,1]

(1) where X, Y are token strings, ‘|X|’ is the length of X, ‘min( )’ and ‘max()’ denotes the minimum/maximum value of two arguments, respectively, and ‘ed( )’ is the edit distance. As the original ontology terms may have been to-

kenised into many sub-terms, i.e., tokens, it is necessary to separately calculate similarity between each pair of token strings. Assume that the number of tokens of the first term is m, n for the second term, and assume m ≥ n, the total similarity measure according to Eq. (1) is: m

n

Sim syn = V 1 Simorig + ∑∑V ij Simij , i =1 j =1

m

m

n

V 1 + ∑∑V ij = 1 i =1 j =1

n

V 1 + ∑∑V ij = 1 i =1 j =1

(2)

where Simorig is the similarity between the original strings, Simij is the similarity between tokens ith and jth from two source terms, and ω1, ωij are the weights for Simorig and Simij. The sum of ωij and ω1 are supposed to be 1. Given a predefined similarity threshold, if the acquired similarity value is greater than or equal to the threshold, then two tokens are considered similar, vice versa.

Semantic-Level Mapping Based on Ontology Structure Semantic heterogeneity occurs when there is a disagreement about meaning, interpretation, or intended use of the same or related data. Semantic relations (Gahleitner, & Woess, 2004) are: • • •

M

different naming of the same content, i.e., synonyms, different abstraction levels: generic terms vs. more specific ones (name vs. first name and last name), hypernyms or hyponyms, and different structures about the same content (separate type vs. part of a type), i.e., meronyms.

In ontology mapping, WordNet is one of the most frequently used sources of background knowledge. Actually, it plays the role of an ‘intermediate’ to help finding semantic heterogeneity. The WordNet library can be accessed with the Java API JWNL2 . It groups English words into sets of synonyms called synsets providing short, general definitions, and it records the various semantic relations between these synonym sets. The purpose is to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. It is assumed that each sense in a 1051


WordNet synset describes a concept. WordNet senses are related among themselves via synonym, hyponym, and hyperonym relations. Terms lexicalising the same concept (sense) are considered to be equivalent through the synonym relationship, while hypernyms, hyponyms, and meronyms are considered similar. With this rule, the original ontology terms and their relative token strings are first checked whether they have the same part of speech, i.e., noun, verb, adjective, adverb. The next step is to judge whether they are synonyms, hypernyms, hyponyms, or meronyms. Analogue to Eq. (2), for a pair of nodes a similarity value is calculated according to the weights of different similarity parts. If it exceeds a given threshold, the pair is considered to be similar. If the above-mentioned syntactic and semantic WordNet mapping methods still could not find a mapping between two terms, another semantic-level method based on tree-structured graphs is applied. As rendered with SWOOP3 (a hypermedia-inspired ontology browser and editor based on OWL ontologies, which supports renderers to obtain class/property hierarchy trees as well as definitions of and inferred facts on OWL classes and properties), ontology entities are represented by class/property hierarchy trees. From class hierarchy trees, tree-structured graphs are constructed. Based on the notion of structure graphs (Lian, 2004), a tree-structured graph (ts-graph) is defined as: Definition 1. Given a tree of sets T, N the union of all sets in T, and E the set of edges in T , then ts-g (T) = (N, E) is called tree-structured graph of T,

if it holds (a, b) ∈ E if and only if a is a parent element of b; a is called parent node, and b child node. In building a ts-graph, breadth-first traversal is applied to traverse a tree hierarchy. To construct a ts-graph, we begin with the root node’s first child node. All its child nodes and their relative parent nodes form edges as (parent node, child node) for the ts-graph. The process is repeated until the tree is completely traversed. After the ts-graphs of two ontologies to be matched are built, the edge sets of both graphs are employed in the mapping process. The relative positions of a pair’s (from two ontologies) nodes within their tree-structured graphs determines the semantic-level mapping between them. In Table 1 we summarise three types of relationships between edges characterising properties, child classes, and parent-and-child classes. By understanding these properties, we can derive that entities having the same properties are similar. This is not a rule always holding true, but it is a strong indictor for similarity.

Experimental Results The implementation of our algorithm was written in Java. It relies on SWOOP for parsing OWL files. All tests were run on a standard PC (with Windows XP). Our test cases include ontologies “Baseball team”, and “Russia” provided by the institute AIFB4. To get an impression of the matchmaker’s performance different measures have to be considered. There are various ways to measure how well retrieved information matches intended information. To evaluate

Table 1. Relationships between classes

No.

1052

Characteristic

Rules: Given two classes a and b

R1

Properties

R2

Child classes

If properties (data type property/object property ∉ null) of a and b are similar, a and b are also similar. If all child classes of a and b are similar, a and b are also similar.

R3

Parent-and-child classes

If parent class and one of the child classes of a and b are similar, a and b are also similar.


Figure 2. Analysis of mapping results with ontologies Russia

M

1 0.8 0.6

Precision Recall

0.4

F-Measure

0.2 0

0.8

0.85

the quality of mapping results, we use standard information-retrieval metrics: Recall (r), Precision (p), and F-Measure (Melnik & Rahm, 2002), where Recall is the ratio of the number of relevant entities retrieved to the total number of relevant entities, Precision is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved, and F-Measure =

2rp r +p

(3)

As shown in Figure 2, with the increase of the threshold, the mapping precision for “Russia” gets higher. The shortcoming of this method is its efficiency. Though we trade off efficiency to get more effective mapping results, our algorithm is still applicable to some offline web applications, like information filtering (Hanani, 2001) according to users’ profile.

FUTURE TRENDS Currently, the results of similarity computations are provided in form of text documents. In order to present mapping results more reasonably and understandably, the objective of our future work is to treat the results of similarity computations as ontology. Another objective

0.95

of our future work is to address security problems connected to ontology mapping, such as trust management in the application of web services.

CONCLUSION The overall research goal presented in this article is to develop a method for ontology mapping that combines syntactic analysis measuring the difference between tokens by the edit distance with semantic analysis based on WordNet as semantic relation and the similarity of structured graphs representing the ontologies being compared. Empirically we have shown that our synthesised mapping method works with relatively high precision.

REFERENCES Berners-Lee, T., Hendler J., & Lassila O. (2001). The Semantic Web. Scientific American, 284(5), 35-43. Do, H., Melnik, S., & Rahm, E. (2002). Comparison of Schema Matching Evaluations. In Proceedings of the 2nd International Workshop on Web Databases (German Informatics Society). Ehrig, M., Sure, Y.: Ontology Mapping - An Integrated 1053


Approach. In Bussler, C., Davis, J., Fensel, D., Studer, R., eds.: Proceedings of the 1st ESWS. Volume 3053 of Lecture Notes in Computer Science., Heraklion, Greece, Springer Verlag, pp. 76–91, 2004.

Semantic Web as Background Knowledge for Ontology Mapping. In Proceedinds of the International Workshop on Ontology Matching (OM-2006), collocated with ISWC’06.

Fellbaum, C. (1999). Wordnet: An Electronic Lexical Database. MIT press.

Su, X.M. & Atle Gulla, J. (2006) An Information Retrieval Approach to Ontology Mapping. Data & Knowledge Engineering 58(1), 47-69.

Gahleitner, E., & Woess, W. (2004). Enabling Distribution and Reuse of Ontology Mapping Information for Semantically Enriched Communication Services. In Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA’04). Gruber, T.R. (1993). A Translation Approach to Portable Ontology Specification. Knowledge Acquisition, 5(2), 199-220. Hanani, U., Shapira, B., Shoval, P. (2001) Information Filtering: Overview of Issues, Research and Systems. User Modeling and User-Adapted Interaction, 11, 203-259. Kalyanpur, A., Parsia, B., & Hendler, J. (2005). A Tool for Working with Web Ontologies. Intrnational Journal on Semantic Web and Information Systems, 1(1). Levenshtein, I.V. (1966). Binary Codes Capable of Correcting Celetions, Insertions, and Reversals. Cybernetics and Control Theory, 10(8), 707-710. Li, J. (2004). LOM: A Lexicon-based Ontology Mapping tool. In Proceeding of the Performance Metrics for Intelligent Systems (PerMIS’04), Information Interpretation and Integration Conference (I3CON), Gaithersburg, MD. Retrieved Aug. 30, 2007, from http://reliant.teknowledge.com/DAML/I3con.pdf. Lian, W., Cheung, D.W., & Yiu, S.M. (2004). An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge and Data Engineering, 16(1), 82-96. Maedche A., & Staab, S. (2002) Measuring Similarity between Ontologies. In Proceedings of the European Conference on Knowledge Acquisition and Management (EKAW-2002). LNCS/LNAI 2473, pp. 251-263, Springer-Verlag. Noy, N.F. (2004). Semantic Integration: A Survey of Ontology-based Approaches. SIGMOD Record, 33(4), 65-70. Sabou, M., d’Aquin, M., & Motta, E. (2006) Using the 1054

KEy TERmS Ontology: As a means to conceptualise and structure knowledge, ontologies are seen as the key to realise the vision of the semantic web Ontology Mapping: Ontology mapping is required to achieve knowledge sharing and semantic integration in an environment with different underlying ontologies. Precision: The ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved. Recall: The ratio of the number of relevant entities retrieved to the total number of relevant entities. Semantic Web: Envisioned by Tim Berners-Lee, the semantic web is as a universal medium for data, information, and knowledge exchange. It suggests to annotate web resources with machine-processable metadata. Similarity Measure: A method used to calculate the degree of similarity between mapping sources. Tokenisation: Tokenisation extracts the valid ontology entities from OWL descriptions. Tree-Structured Graph: A graphical structure to represent a tree with nodes and a hierarchy of its edges.

ENDNOTE 1 2 3

http://www.w3.org/TR/owl-features/ http://sourceforge.net/projects/jwordnet www.mindswap.org/2004/SWOOP/


4

The datasets are available from http://www.aifb. uni-karlsruhe.de/WBS/meh/mapping/.

M

1055

1056

Mathematical Modeling of Artificial Neural Networks Radu Mutihac University of Bucharest, Romania

INTRODUCTION Models and algorithms have been designed to mimic information processing and knowledge acquisition of the human brain generically called artificial or formal neural networks (ANNs), parallel distributed processing (PDP), neuromorphic or connectionist models. The term network is common today: computer networks exist, communications are referred to as networking, corporations and markets are structured in networks. The concept of ANN was initially coined as a hopeful vision of anticipating artificial intelligence (AI) synthesis by emulating the biological brain. ANNs are alternative means to symbol programming aiming to implement neural-inspired concepts in AI environments (neural computing) (Hertz, Krogh, & Palmer, 1991), whereas cognitive systems attempt to mimic the actual biological nervous systems (computational neuroscience). All conceivable neuromorphic models lie in between and supposed to be a simplified but meaningful representation of some reality. In order to establish a unifying theory of neural computing and computational neuroscience, mathematical theories should be developed along with specific methods of analysis (Amari, 1989) (Amit, 1990). The following outlines a tentatively mathematical-closed framework in neural modeling.

BACKGROUND ANNs may be regarded as dynamic systems (discrete or continuous), whose states are the activity patterns, and whose controls are the synaptic weights, which control the flux of information between the processing units (adaptive systems controlled by synaptic matrices). ANNs are parallel in the sense that most neurons process data at the same time. This process can be synchronous, if the processing time of an input neuron is the same for all units of the net, and

asynchronous otherwise. Synchronous models may be regarded as discrete models. As biological neurons are asynchronous, they require a continuous time treatment by differential equations. Alternatively, ANNs can recognize the state of environment and act on the environment to adapt to given viability constraints (cognitive systems controlled by conceptual controls). Knowledge is stored in conceptual controls rather than encoded in synaptic matrices, whereas learning rules describe the dynamics of conceptual controls in terms of state evolution in adapting to viability constraints. The concept of paradigm referring to ANNs typically comprises a description of the form and functions of the processing unit (neuron, node), a network topology that describes the pattern of weighted interconnections among the units, and a learning rule to establish the values of the weights (Domany, 1988). Although paradigms differ in details, they still have a common subset of selected attributes (Jansson, 1991) like simple processing units, high connectivity, parallel processing, nonlinear transfer function, feedback paths, non-algorithmic data processing, self-organization, adaptation (learning) and fault tolerance. Some extra features might be: generalization, useful outputs from fuzzy inputs, energy saving, and potential overall high speed operation. The digital paradigm dominating computer science assumes that information must be digitized to avoid noise interference and signal degradation. In contrast, a neuron is highly analog in the sense that its computations are based on spatiotemporal integrative processes of smoothly varying ion currents at the trigger zone rather than on bits. Yet neural systems are highly efficient and reliable information processors.

Memory and Learning The specificity of neural processes consists in their distributive and collective nature. The phenomenon


Mathematical Modeling of Artificial Neural Networks

by biological neural networks (NNs) are changing in response to extrinsic stimuli is called self-organization. The flexible nature of the human brain, represented by self-organization, seems to be responsible for the learning function which is specific to living organisms. Essentially, learning is an adaptive self-organizing process. From the training assistance point of view, there are supervised and unsupervised neural classifiers. Supervised classifiers seek to characterize predefined classes by defining measures that maximize in-class similarity and out-class dissimilarity. Supervision may be conducted either by direct comparison of output with the desired target and estimating error, or by specifying whether the output is correct or not (reinforcement learning). The measure of success in both cases is given by the ability to recover the original classes for similar but not identical input data. Unsupervised classifiers seek similarity measures without any predefined classes performing cluster analysis or vector quantization. Neural classifiers organize themselves according to their initial state, types and frequency of the presented patterns, and correlations in the input patterns by setting up some criteria for classification (Fukushima, 1975) reflecting causal mechanisms. There is no general agreement on the measure of their success since likelihood optimization always tends to favor single instance classes. Classification as performed by ANNs has essentially a dual interpretation reflected by machine learning too. It could mean either the assignment of input patterns to predefined classes, or the construction of new classes from a previously undifferentiated instance set (Stutz & Cheesman, 1994). However, the assignment of instances to predefined classes can produce either the class that best represents the input pattern as in the classical decision theory, or the classifier can be used as a content-addressable or associative memory, where the class representative is desired and the input pattern is used to determine which exemplar to produce. While the first task assumes that inputs were corrupted by some processes, the second one deals with incomplete input patterns when retrieval of full information is the goal. Most neural classifiers do not require simultaneous availability of all training data and frequently yield error rates comparable to Bayesian methods without needing prior information. An efficient memory might store and retrieve many patterns, so its dynamics must allow for as many states of activity which are stable against small perturbations as possible. Several ap-

proaches dealing with uncertainty such as fuzzy logic, probabilistic, hyperplane, kernel, and exemplar-based classifiers can be incorporated into ANN classifiers in applications where only few data are available (Ng & Lippmann, 1991). The capacity of analog neural systems to operate in unpredictable environments depends on their ability to represent information in context. The context of a signal may be some complex collections of neural patterns, including those that constitute learning. The interplay of context and adaptation is a fundamental principle of the neural paradigm. As only variations and differences convey information, permanent change is a necessity for neural systems rather than a source of difficulty as it is for digital systems.

mATHEmATICAL FRAmEWORK OF NEURONS AND ANNS mODELING An approach to investigate neural systems in a general frame is the mean field theory (Cooper & Scofield, 1988) from statistical physics suited for highly interconnected systems as cortical regions are. However, there is a big gap between the formal model level of description in associative memory levels and the complexity of neural dynamics in biological nets. Neural modeling need no information concerning correlations of input data, rather nonlinear processing units and a sufficiently large number of variable parameters ensure the flexibility to adapt to any relationship between input and output data. Models can be altered externally, by adopting a different axiomatic structure, and internally, by revealing new inside structural or functional relationships. Ranking several neuromorphic models is ultimately carried out based on some measure of performance.

Neuron Modeling Central problems in any artificial system designed to mimic NNs arise from (i) biological features to be preserved, (ii) connectivity matrix of the processing units, whose size increases with the square of their number, and (iii) processing time, which has to be independent of the network size. Biologically realistic models of neurons might minimally include: •

Continuous-valued transfer functions (graded response), as many neurons respond to their 1057

M


• •

•

•

input in a continuous way, though the nonlinear relationship between the input and the output of cells is a universal feature; Nonlinear summation of the inputs and significant logical processing performed along the dendritic tree; Sequences of pulses as output, rather than a simple output level. A single state variable yj representing the firing rate, even if continuous, ignores much information (e.g., pulse phase) that might be encoded in pulse sequences. However, there is no relevant evidence that phase plays a significant role in most neuronal circuits; Asynchronous updating and variable delay of data processing, that is, the time unit elapsing per processing step, t → t + 1, is variable among neurons; Variability of synaptic strengths caused by the amount of transmitter substance released at a chemical synapse, which may vary unpredictably. This effect is partially modeled by stochastic generalization of the binary neural models dynamics.

Figure 1. Typical sigmoid transfer functions

1058

Most neuromimetic models are based on the McCulloch and Pitts (1943) neuron as a binary threshold unit:  N  y j (t + 1) = Θ  ∑ wij xi (t ) − Q j   i =1 

(1)

where yj represents the state of neuron j (either 1 or

0) in response to input signals {xi}i, θj stands for a certain threshold characteristic of each neuron j, time t is considered discrete, with one time unit elapsing per processing step, and Θ is the unit step (Heaviside) function: 1, x ≥ 0 Θ (x ) =  0, x < 0

(2)

The weights, wij , 1 ≤ i ≤ N , represent the strengths of the synapses connecting neuron i to neuron j, and may be positive (excitatory) or negative (inhibitory). The weighted sum


N

2.

∑ w x (t ) i =1

i

j

i

of the inputs presented at time t to unit j must reach or exceed the threshold θj for the neuron j to fire. Though extremely simplified, a synchronous assembly of such formal neurons is theoretically capable of universal computation (Welstead, 1994) for suitable chosen weights {wij } , i.e., it may perform computations i,j that conventional computers do, yet not necessarily so rapidly or conveniently. A general expression that includes some of the above features derived from the digital model (1) is:  N  y j = f j  ∑ wij xi − Q j   i =1 

(3)

where yj is the continuous-valued state (activation) of unit j and fj is a general transfer function. Threshold nodes are required for universal approximation and the activation function ought to be nonlinear with bounded output (Fig. 1). The neurons are updated asynchronously in random order at random times.

Mathematical Methods for ANNs Modeling Several neural models and parallel information processing systems inspired by brain mechanisms were proposed. Almost all practical applications were achieved by simulation on conventional digital computers (von Neumann), so the real parallel processing advantages and massive unit densities hoped for were lost. Mathematical methods approaching various types of ANNs in a unified way and results from linear and nonlinear control systems used to obtain learning algorithms could be grouped in four categories: 1.

3.

Tensor products and pseudo-inverses of linear operators, which represent the specific structural connectionism and provide a mathematical explanation of the Hebbian nature of many learning algorithms (Hebb, 1949). This is due to the fact that derivatives of a wide class of nonlinear maps defined on spaces of synaptic matrices are tensor products and because the pseudo-inverse of a tensor product of linear operators is the tensor product of their pseudo-inverses;

4.

Convex and nonsmooth analysis is particularly suited to nonlinear networks in proving the convergence of two main types of learning rules. The first class consists of algorithms derived from gradient methods and includes the backpropagation update rule, whereas the second class deals with algorithms based on Newton’s method; Control and viability theory (Aubin, 1991), which deals with neural systems that learn viable solutions as control systems satisfying given viability (state) constraints. The purpose is to derive algorithms of control systems emulated by ANNs with feedback regulation. Three classes of learning rules are envisaged: (i) external learning rules based on gradient methods of optimization problems involving nonsmooth functions, (ii) internal learning rules based on the viability theory, and (iii) uniform algorithms; Probability theory and Bayesian statistics. Bayesian statistics and neural modeling may seem extremes of the data-modeling spectrum. ANNs are nonlinear parallel computational devices and their training by example to solve prediction and classification problems is quite a purpose-specific procedure. Contrarily, Bayesian statistics is heavily based on coherent inference and clearly defined axioms. Yet both approaches aim to create models in good accordance with the data. ANNs can be interpreted as more flexible versions of traditional regression techniques in the sense of capturing regularities in the data that the linear models are not able to handle. However, over-flexible ANNs may discover non-existent correlations in the data. Bayesian inference provides means to infer how flexible a model is warranted by the data and suppresses the tendency to assess spurious structure in the data by incorporating the Occam’s razor that sets the preference for simpler models if they compete to come out with the same result. Learning in ANNs is interpreted as inference on the most probable parameters for a model, given the training data. The search in the model space can also be treated as an inference problem of relative probability for alternative models, given the data. Bayesian inference for ANNs can be implemented numerically by deterministic methods involving Gaussian approximations (MacKay, 1992), or by Monte Carlo methods (Neal, 1996).

1059

M


Let N formal neurons link, directly for one-layer networks and indirectly for multi-layered ones, an input space X of signals to an output space Y. The state space of the system is the product X × Y of the input-output pairs (x,y), which are generically called patterns or configurations in pattern recognition (PR), data analysis, and classification problems. When X = Y and the input of the patterns coincide with the outputs, (x,y), x = y, the system is called autoassociative; if the input and output patterns are different, (x,y) y ≠ x, then the system is heteroassociative. Among all possible input-output patterns, a subset K ⊂ X × Y is chosen as training set. Most often, the input and output spaces are finite dimensional linear spaces: X =  N and Y =  M , whereas the input signals may obey some state constraints: • • •

Real numbers for fuzzy applications, preferably in the intervals [0,1] or [–1, +1]; Binary numbers that belong to {0, 1}; Bipolar numbers that belong to {–1, +1}.

If neurons are labeled by j = 1,2,...,N, then let P(N) of cardinal 2N denote the family of subsets of neurons called conjuncts (or coalitions) of neurons. Any con-

nection links a postsynaptic neuron j to conjuncts S ⊂ P(N) of presynaptic neurons. Each conjunct S preprocesses (or gates) the afferent signals {xi)i produced by the presynaptic neurons through a function: J → J S (x ) {xi }i =1,2,..., N  S

If conjuncts are reduced to individual neurons S = {i}, then the role of control is played by the synaptic matrix: W = wSj

S ⊂ P (N ) j =1,2,..., N

(5)

where w Sj represents the entries from S to neuron j. The modulus of the synaptic weight w Sj represents the strength and it gives the nature of the connection from conjunct S to the formal neuron j, counted positively if the synapse is excitatory, and negatively if it is inhibitory. Accordingly, the neuron j receives the signal

∑

S ⊂ P (N )

wSjJ S (x )

Figure 2. Conjuncts of neurons {2,3} and {4,5,6} gate the inputs to neuron 1

1060

(4)


(Fig. 2), which determines its state of activity. Hence the propagation rule that characterize the network dynamics is: yj = f j

({w ,J (x )} ) S j

S

N

y j = ∑ wSj xi + c j

(6)

({w (t ),J (x, t )} ) S j

S

X =  N , Y = , K = {0,1} × {0,1} , N

S ⊂ P (N )

and fuzzy associative memories to X =  N , Y = , K = [0,1] × [0,1] , N

(7)

hence:

for asynchronous neurons (continuous dynamical system), where in most cases: fj

y=

({w ,J (x )} ) S j

S

S ⊂ P (N )

  = g j  ∑ wSj J S (x )  S ⊂ P (N )   

(8)

Here gj integrates the afferent signals {w J S (x )} sent to the neuron j by the afferent neurons through their outputs {xi}i, preprocessed by the conjunct S, and delivered to neuron j via the weight w Sj . Usually, the synaptic weights wSj = 0 when j ∈ S, whereas wSj ≠ 0 when j ∈ S is associated with autoexcitation. However, when S = {j}, S ≠ {i}, and wSj = 0 , then the loss term g j (0, 0,..., wijJ j (x ), 0,..., 0 ) represents some kind of forgetting like decaying frequency while the neuron in question j is not excited by the others. Several neural systems can be expressed within this framework (Aubin, 1991). S j

1.

Associative memories are defined by the lack of preprocessing, that is, J S (x ) = 0 if S > 1 or JS (x ) = xi if S = {} i ,

ˆ = wˆ W ij

then:

i =1,2,..., N j = 0,1,..., M

i =1

2.

3.

∑

S ⊂ P (N )

wS ∏ xi

1 S

i∈S

(9)

where |S| stands for the number of elements in conjunct S. Associative memories with gates are defined by preprocessing and gj affine:

(11)

Nonlinear automata are defined by various forms of gj:   f j (x, W ) = g j  ∑ wSj J S (x )  S ⊂ P (N )   

(12)

When appropriate, thresholds θ ∈ Y may be integrated in the processing function g: g(z) = h(z – θ)

(13)

If the threshold is part of the controls to be adjusted during training, it may be incorporated as an entry of an extended synaptic matrix: ˆ = wˆ W ij

i =1,2,..., N j = 0,1,..., M

 wij for i = 1, 2,..., N ; j ∈ L ( × X , Y ), wˆ ij =  Qi for i = 1, 2,..., N ; j

 wij for i = 1, 2,..., N ; j = 1, 2,..., M ∈ L ( × X , Y ), wˆ ij =  Qi for i = 1, 2,..., N ; j = 0

N

y j = ∑ wij xi + c j

M

Boolean associative memories correspond to

S ⊂ P (N )

for synchronous neurons (discrete dynamical system), and: x′j (t ) = f j

(10)

i =1

(14)

Particularly simple is the perceptron: X =  N , Y =, Q ∈ Y , JS ( x ) = 0

if |S| > 1, then: 1061


 0 if  y= 1 if 

N

∑w x i

i =1

N

∑w x i =1

i

i

veniences and make them suitable for modeling rather complex systems involving plenty of information.

0 agents (PLKn) (Fagin, Halpern, Moses & Vardi, 1995). A special terminology, notation and Kripke models are used in this framework. A set of relational symbols Rel in PLKn consists of natural numbers [1..n] representing names of agents. Notation for modalities is: if i∈ [1.. n] and j is a formula, then (Ki j) and (Si j) are used instead of ([i] j) and (〈i〉 j). These formulas are read as “(an agent) i knows j” and “(an agent) i can suppose j”. For every agent i∈ [1..n] in every model M = (D, I), interpretation I(i) is an “indistinguishability relation”, i.e. an equivalence relation2 between states that the agent i can not distinguish. Every model M, where all agents are interpreted in this way, is denoted as (D, ~1, … ~n, I) with explicit I(1) = ~1, … I(n) = ~n instead of brief standard notation (D,I). An agent knows some “fact” j in a state s of a model M, if the fact is valid in every state s′ of this model that the agent can not distinguish from s: •

s╞M (Ki j) iff s′╞M j for every state s′~i s.

Similarly, an agent can suppose a “fact” j in a state s of a model M, if the fact is valid in some state s′ of this model that the agent can not distinguish from s:

Modal Logics for Reasoning about Multiagent Systems

•

s╞M (Si j) iff s′╞M j for some state s′~i s.

The above possible worlds semantics of knowledge is due to pioneering research (Hintikka, 1962).

Temporal Logic with Actions Another propositional polymodal logic is Computational Tree Logic with actions (Act-CTL). Act-CTL is a variant of a basic propositional branching time temporal logic Computational Tree Logic (CTL) (Emerson, 1990; Clarke, Grumberg & Peled, 1999). In Act-CTL the set of relational symbols consists of action symbols Act. Each action symbol can be interpreted by an “instant action” that is executable in one undividable moment of time. Act-CTL notation for basic modalities is: if b∈ Act and j is a formula, then , (AbX j) and (EbX j) are used instead of ([b] j) and (〈b〉 j). But syntax of Act-CTL has also some other special constructs associated with action symbols: if b∈ Act and j and ψ are formulas, then (AbG j), (AbF j), (EbG j), (EbF j), Ab(j U ψ) and Eb(j U ψ) are also formulas of Act-CTL. In formulas of Act-CTL prefix “A” is read as “for every future”, “E” – “for some future”, suffix “X” – “next state”, “G” – “always” or “globally”, “F” – “sometimes” or “future”, the infix “U” – “until”, and a sub-index “b” is read as “in b-run(s)”. We have already explained semantics of (AbX j) and (EbX j) by referencing to ([b] j) and (〈b〉 j). Constructs “AbG ”, “AbF ”, “EbG ”, and “EbF ” can be expressed in terms of “Ab(…U…)” and “Eb(…U…)” , for example: (EbFj) ↔ Eb(true U j). Thus let us define below semantics of “Ab(…U…)” and “Eb(…U…)” only. Let M = (D, I) be a model. If b∈ Act is an action symbol, then a partial b-run is a sequence of states s0,… sk,s(k+1),…∈ D (maybe infinite) such that (sk,s(k+1))∈ I(b) for every consecutive pair of states within this sequence. If b∈ Act is an action symbol, then a b-run is an infinite partial b-run or finite b-run that can not be continued3. Then semantics of constructs “Ab(…U…)” and “Eb(…U…)” can be defined as follows: • •

s╞M Ab(j U ψ) iff for every b-run s0, …sk, … that starts in s (i.e. s0=s) there exists some n≥0 for which sn╞Mψ and sk╞Mj for every k∈[0..(n-1)]; s╞M Eb(j U ψ) iff for some b-run s0, …sk, … that starts in s (i.e. s0=s) there exists some n≥0 for which sn╞Mψ and sk╞Mj for every k∈[0..(n-1)].

The standard branching-time temporal logic CTL can be treated as Act-CTL with a single implicit action symbol.

Combined Logic of Knowledge, Actions and Time There are many combined polymodal logics for reasoning about multiagent systems. Maybe the most advanced is Belief-Desire-Intention (BDI) logic (Wooldridge, 1996; Wooldridge, 2002). An agent’s beliefs correspond to information the agent has about the world. (This information may be incomplete or incorrect. An agent’s knowledge in BDI is just a true belief.) An agent’s desires correspond to the allocated tasks. An agent’s intentions represent desires that it has committed to achieving. Admissible actions are actions of individual agents; they may be constructed from primitive actions by means of composition, nondeterministic choice, iteration, and parallel execution. But semantics of BDI and reasoning in BDI are quite complicated for a short encyclopedia article. In contrast, let us discuss below a simple example of a combined logic of knowledge, actions and time – namely Propositional Logic of Knowledge and Branching Time for n>0 agents Act-CTL-Kn (Garanina, Kalinina, & Shilov, 2004; Shilov, Garanina & Choe, 2006; Shilov & Garanina, 2006). First we provide a formal definition of Act-CTL-Kn, then discuss some pragmatics, and then – in the next section – introduce model checking as a reasoning mechanism. Let [1..n] be a set of agents (n > 0), and Act be a finite alphabet of action symbols. Syntax of Act-CTLKn admits epistemic modalities Ki , and Si for every i∈[1..n], and branching-time constructs AbX, EbX, AbG, EbG, AbF, EbF, Ab(…U…), and Eb(…U…) for every b∈Act. Semantics is defined in terms of entailment in environments. An (epistemic) environment is a tuple E = (D, ~1, … ~n , I) such that (D, ~1, … ~n) is a model for PLKn , and (D, I) is a model for Act-CTL. Entailment relation ╞ is defined by induction according to the standard definition for propositional connectives (see semantics of EPDL), and the above definitions of epistemic modalities and branching time constructs. We are mostly interested in trace-based perfect recall synchronous environments generated from background finite environments. “Generated” means that possible “worlds” are runs of finite-state machine(s). There are several opportunities how to define semantics of 1091

M


combined logics on runs. In particular, there are two extreme cases: Forgetful Asynchronous Systems (FAS) and Synchronous systems with Perfect Recall (PRS). “Perfect recall” means that every agent has a log-file with all his/her observations along a run, while “forgetful” means that information of this kind is not available. “Synchronous” means that every agent can distinguish runs of different lengths, while “asynchronous” means that some runs of different lengths may be indistinguishable. It is quite natural that in the FAS case combined logic Act-CTL-Kn can express as much as it can express in the background finite system. In contrast, in the PRS case Act-CTL-Kn becomes much more expressive than in the background finite environment. Importance of combined logics in the framework of trace-based semantics with synchronous perfect recall rely upon their characteristic as logics of agent’s learning or knowledge acquisition. We would like to argue this characteristic by the following single-agent4 Fake Coin Puzzle FCP(N,M).

b(L,R) for disjoint L, R⊆ [1..N+1] with |L| = |R|. The only information available for the agent (i.e., which gives him/her an opportunity to distinguish states) is a balancing result. The agent should learn fake_coin_number from a sequence which may start from any initial state and then consists of M queries and corresponding results. Hence single agent logic Act-CTL-K1 seems to be a very natural framework for expressing FCP(N,M) as follows: to validate or refute whether

A set consists of (N+1) enumerated coins. The last coin is a valid one. A single coin with a number in [1..N] is fake, but other coins with numbers in [1...(N+1)] are valid. All valid coins have the same weight that differs from the weight of the fake. Is it possible to identify the fake by balancing coins M times at most?

The model checking problem for a combined logic (Act-CTL-Kn in particular) and a class of epistemic environments (ex., PRS or FAS environments) is to validate or refute s╞E j , where E is a finitely-generated environment in the class, s is an “initial state” of the environment E, and j is a formula of the logic. The above re-formulation of FCP(N,M) is a particular example of a model checking problem for a formula of Act-CTL-Kn and some finitely-generated perfect recall environment. Papers (Meyden & Shilov, 1999) and (Garanina, Kalinina & Shilov, 2004) have demonstrated that if the number of agents n>1, then the model checking problem in perfect recall synchronous systems is very hard or even undecidable. In particular, it has non-elementary5 upper and lower time bounds for Act-CTL-Kn. Papers (Meyden & Shilov, 1999) and (Shilov, Garanina & Choe, 2006) have suggested a tree-like data structures to make “feasible” model checking of combinations of temporal and action logics with propositional logic of knowledge PLKn. Alternatively, (van der Hoek & Wooldridge, 2002; Lomuscio & Penczek, 2003) have susggested either to simplify language of logics to be combined, or to consider agents with “bounded” recall.

In FCP(N,M) the agent (i.e. a person who have to solve the puzzle) does not know neither a number of the fake, nor whether it is lighter or heavier than the valid coins. Nevertheless, this number is in [1..N], and the fake coin is either lighter (l) or heavier (h). The agent can make balancing queries and read balancing results after each query. Every balancing query is an action b(L,R) which consists in balancing of two disjoint sets of coins: with numbers L⊆[1..N+1] on the left pan, and with numbers R⊆[1..N+1] on the right pan, |L| = |R|. There are three possible balancing results: “”, and “=”, which means that the left pan is lighter, heavier than or equal to the right pan, respectively. Of course, there are initial states (marked by ini) which represent a situation when no query has been made. Let us summarize. The agent acts in the environment generated from a finite space [1..N]×{l,h}×{, =, ini}. His/her admissible actions are balancing query

1092

s |= E ( EB X ...M −times...EB X (∨ f ∈[1.. N ]

K 1 ( fake _ coin _ number = f ))...)) for every initial state s, where E is a PRS environment generated from a finite space [1..N]×{l,h}×{, =, ini}, and B is a balancing query ∪L,R⊆[1..N+1]b(L,R).

FUTURE TRENDS: mODEL CHECKING FOR COmBINED LOGICS


CONCLUSION Combinations of temporal logics and logics of actions with logics of knowledge become an actual research topic due to the importance of study of interactions between knowledge and actions for reasoning about real-time multiagent systems. A comprehensive survey of logics, techniques, and results was out of scope of the article. The primary target of present article was to provide semi-formal introduction to the field of combined modal logics, discuss their utility for reasoning about multiagent systems. The emphasis has been done on model checking of trace-based knowledge-temporal specifications of perfect recall synchronous systems.

REFERENCES Clarke, E., Grumberg, O., & Peled, D. (1999). Model Checking. MIT Press. Bull, R. & Segerberg, K. (2001) Basic Modal Logic. Handbook of Philosophical Logic, v.3. D. Gabbay and F. Cuenthner editors. Kluwer Academic Publishers. Dixon, C., Nalon, C., & Fisher, M. (2004). Tableau for Logics of Time and Knowledge with Interactions Relating to Synchrony. Journal of Applied Non-Classical Logics, 14(4), 397-445. Emerson, E.A. (1990). Temporal and Modal Logic. Handbook of Theoretical Computer Science (B), J. van Leeuwen, A.R. Meyer, M. Nivat, M. Paterson, D. Perrin editors. Elsevier and The MIT Press. Fagin, R., Halpern, J.Y., Moses, Y., & Vardi, M.Y. (1995). Reasoning about Knowledge. MIT Press. Garanina, N.O., Kalinina, N.A., & Shilov N.V. (2004) Model checking knowledge, actions and fixpoints. Proceedings of Concurrency, Specification and Programming Workshop CS&P’2004. Humboldt Universitat, Berlin, Informatik-Bericht, 170, 351-357. Halpern, J.Y., & Vardi, M.Y. (1986). The Complexity of Reasoning About Knowledge and Time. Proceedings of the eighteenth annual ACM symposium on Theory of computing. 304-315. Halpern, J. Y., van der Meyden, R., & Vardi, M.Y. (2004). Complete Axiomatizations for Reasoning about

Knowledge and Time. SIAM Journal on Computing, 33(3), 674-703. Harel, D., Kozen, D., & Tiuryn, J. (2000). Dynamic Logic. MIT Press. Hintikka, J. (1962). Knowledge and Belief. Cornell University Press. van der Hoek, W., & Wooldridge, M.J. (2002). Model Checking Knowledge and Time. Lecture Notes in Computer Science, 2318, 95-111. Lomuscio, A., & Penczek, W. (2003). Verifying Epistemic Properties of Multi-agent Systems via Bounded Model Checking. Fundamenta Informaticae, 55(2), 167-185. van der Meyden, R., & Shilov, N.V. (1999). Model Checking Knowledge and Time in Systems with Perfect Recall. Lecture Notes in Computer Science, 1738, 432-445. Rescher, N. (2005). Epistemic Logic. A Survey of the Logic of Knowledge. University of Pitsburgh Press. Shilov, N.V., Garanina, N.O., & Choe, K.-M. (2006). Update and Abstraction in Model Checking of Knowledge and Branching Time. Fundameta Informaticae, 72(1-3), 347-361. Wooldridge, M. (1996) Practical reasoning with procedural knowledge: A logic of BDI agents with know-how. Lecture Notes in Artificial Intelligence, (1085), 663–678. Wooldridge, M. (2002). An Introduction to MultiAgent Systems. John Wiley & Sons Ltd.

KEy TERmS Environment: A labeled transition system that provides an interpretation for logic of knowledge, actions and time simultaneously. Labeled Transition Systems or Kripke Model: An oriented labeled graph (infinite maybe). Nodes of the graph are called states or worlds, some of them are marked by propositional symbols that are interpreted to be valid in these nodes. Edges of the graph are marked by relational symbols that are interpreted by these edges. 1093

M


Logic of Actions: A polymodal logic that associate modalities like “always” and “sometimes” with action symbols that are to be interpreted in labeled transition systems by transitions. A so-called Elementary Propositional Dynamic Logic (EPDL) is sample logic of actions. Logic of Knowledge or Epistemic Logic: A polymodal logic that associate modalities like “know” and “suppose” with enumerated agents or groups of agents. Agents are to be interpreted in labeled transition systems by equivalence “indistinguishability” relations. A socalled Propositional Logic of Knowledge of n agents (PLKn) is sample epistemic logic. Logic of Time or Temporal Logic: A polymodal logic with a number of modalities that correspond to “next time”, “always”, “sometimes”, and “until” to be interpreted in labeled transition systems over discrete partial orders. For example, Linear Temporal Logic (LTL) is interpreted over linear orders. Model Checking Problem: An algorithmic problem to validate or refute a property (presented by a formula) in a state of a model (from a class of Kripke structures). For example, model checking problem for combined logic of knowledge, actions and time in initial states of perfect recall finitely generated environments.

1094

Multiagent System: A collection of communicating and collaborating agents, where every agent have some knowledge, intensions, enabilities, and possible actions. Perfect Recall Synchronous Environment: An environment for modeling a behavior of a perfect recall synchronous system. Perfect Recall Synchronous System: A multiagent system where every agent always records his/her observation at all moments of time while system runs.

ENDNOTES 1

2

3

4

5

Due to pioneering papers of Saul Aaron Kripke (born in 1940) on models for modal logics. A symmetric, reflexive, and transitive binary relation on D. That is for the last state s there is no state s′ such that (s,s′)∈I(b). For multiagent example refer Muddy Children Puzzle (Fagin, Halpern, Moses & Vardi, 1995). I.e. it is not bounded by a tower of exponents with any fixed height

1095

Modularity in Artificial Neural Networks Ricardo Téllez Technical University of Catalonia, Spain Cecilio Angulo Technical University of Catalonia, Spain

INTRODUCTION The concept of modularity is a main concern for the generation of artificially intelligent systems. Modularity is an ubiquitous organization principle found everywhere in natural and artificial complex systems (Callebaut, 2005). Evidences from biological and philosophical points of view (Caelli and Wen, 1999) (Fodor, 1983), indicate that modularity is a requisite for complex intelligent behaviour. Besides, from an engineering point of view, modularity seems to be the only way for the construction of complex structures. Hence, whether complex neural programs for complex agents are desired, modularity is required. This article introduces the concepts of modularity and module from a computational point of view, and how they apply to the generation of neural programs based on modules. Two levels, strategic and tactical, at which modularity can be implemented, are identified. How they work and how they can be combined for the generation of a completely modular controller for a neural network based agent is presented.

BACKGROUND When designing a controller for an agent, there exists two main approaches: a single module contains all the agent’s required behaviours (monolithic approach), or global behaviour is decomposed into a set of simpler sub-behaviours, each one implemented by one module (modular approach). Monolithic controllers implement on a single module all the required mappings between the agent’s inputs and outputs. As an advantage, it is not required to identify required sub-behaviours nor relations between them. As a drawback, whether the complexity of the controller is high, it could be impossible at practice to design such a controller without obtaining large interferences between different parts of

it. Instead, when a modular controller is used, the global controller is designed by a group of sub-controllers, so required sub-controllers and their interactions for generating the final global output must be defined. Despite the disadvantages of the modular approach (Boers, 1992), complex behaviour cannot be achieved without some degree of modularity (Azam, 2000). Modular controllers allow the acquisition of new knowledge without forgetting previously acquired one, which represents a big problem for monolithic controllers when the number of required knowledge rules to be learned is large (De Jong et al., 2004). They also minimize the effects of the credit assignment problem, where the learning mechanism must provide a learning signal based on the current performance of the controller. This learning signal must be used to modify the controller parameters which will improve the controller behaviour. In large controllers, it becomes difficult finding changing parameters of the controller based on the global learning signal. Modularization helps to keep small the controllers’ size, minimizing the effect of the credit assignment. Modular approaches allow for a complexity reduction of the task to be solved (De Jong et al., 2004). While in a monolithic system the optimization of variables is performed at the same time, resulting in a large optimization space, in modular systems, optimization is performed independently for each module resulting on reduced searching spaces. Modular systems are scalable, in the sense that former modules can be used for the generation of new ones when problems are more complex, or just new modules can be added to the already existing ones. It also implies that modular systems are robust, since the damage on one module results in a loss of the abilities given by that module, but the whole system is partially kept functioning. Modularity can be a solution to the problem of neural interference (Di Ferdinando et al., 2000), which is encountered in monolithic networks. This phenomenon


M

Modularity in Artificial Neural Networks

is produced when an already trained network losses part of its knowledge when either, it is re-trained to perform a different task, called temporal cross-talk (Jacobs et al.,1991), or two or more different tasks at the same time, the effect being called spatial cross-talk (Jacobs,1990). Modular systems allow reusing modules in different activities, without re-implementation of the function represented on each different task (De Jong et al., 2004) (Garibay et al., 2004).

Modularity From a computational point of view, modularity is understood as the property that some complex computational tasks have to be divided into simpler subtasks. Then, each of those simpler subtasks is performed by a specialized computational system called a module, generating the solution of the complex task from the solution of the simpler subtask modules (Azam, 2000). From a mathematical point of view, modularity is based on the idea of a system subset of variables which may be optimized independently of the other system variables (De Jong et al., 2004). In any case, the use of modularity implies that a structure exists in the problem to be solved. In modular systems, each of the system modules operates primarily according to its own intrinsically determined principles. Modules within the whole system are tightly integrated but independent from other modules following their own implementations. They have either distinct or the same inputs, but they generate their own response. When the interactions between modules are weak and modules act independently from each other, the modular system is called nearly decomposable (Simon, 1969). Other authors have identified this type of modular systems as separable problems (Watson et al., 1998). This is by far one of the most studied types of modularity, and it can be found everywhere from business to biological systems. In nearly decomposable modular systems, the final optimal solution of a global task is obtained as a combination of the optimal solutions of the simpler ones (the modules). However, the existence of decomposition for a problem doesn’t imply that sub-problems are completely independent from each other. In fact, a system may be modular and still having interdependencies between modules. It is defined a decomposable problem as a problem that can be decomposed on other sub-prob1096

lems, but the optimal solution of one of those problems depends on the optimal solution of some of the others (Watson, 2002). The resolution of such modular systems is more difficult than a typical separable modular system and it is usually treated as a monolithic one in the literature.

Module Most of the works that use modularity, use the definition of module given by (Fodor, 1983), which is very similar to the concept of object in object-oriented programming: a module is a domain specific processing element, which is autonomous and cannot influence the internal working of other modules. A module can influence another only by its output, this is, the result of its computation. Modules do not know about a global problem to solve or global tasks to accomplish, and are specific stimulus driven. The final response of a modular system to the resolution of a global task, is given by the integration of the responses of the different modules by a especial unit. The global architecture of the system defines how this integration is performed. The integration unit must decide how to combine the outputs of the modules, to produce the final answer of the system, and it is not allowed to feed information back into the modules.

mODULAR NEURAL NETWORKS When modularity is applied for the design of a modular neural network (MNN) based controller, three general steps are commonly observed: task decomposition, training and multi-module decision-making (Auda and Kamel, 1999). Task decomposition is about dividing the required controller into several sub-controllers, and assigning each sub-controller to one neural module. Modules should be trained either, in parallel, or in different processes following a sequence indicated by the modular design. Finally, when the modules have been prepared, a multi-module decision making strategy is implemented which indicates how all those modules should interact in order to generate the global controller response. This modularization approach can be seen as at the level of the task. The previous general steps for modularity only apply for a modularization of nearly decomposable or separable problems. Decomposable problems, those


where strong interdependencies between modules exist, are not considered under that decomposition mechanism, and they are treated as monolithic ones. The article introduces the differentiation between two modular levels, the current modularization level, which concentrates on task sub-division, and a new modularization performed at the level of the devices or elements. Those approaches are called strategic and tactical, respectively.

Strategic and Tactical Modularity Borrowing the concepts from game theory, strategy deals with what has to be done in a given situation in order to perform a task by dividing the global target solution into all the sub-targets required to accomplish the global one. Tactics, on the other hand, treats about how plans are going to be implemented, this means, how to use the resources available at that moment to accomplish each of those sub-targets. It is defined strategic modularity in neural controllers as the modular approach that identifies which sub-goals are required for an agent in order to solve a global problem. Each sub-goal identified is implemented by a monolithic neural net. In contrast, tactical modularity in neural controllers is defined as the one that identifies which inputs and outputs are necessary for the implementation of a given goal, and it designs a single module for each input and output. In tactical modularity, modularization is performed at the level of the elements (any meaningful input or output of the neural controller) that are actually involved in the accomplishment of the task. To our extent, all the research based on neural modularity and divide-and-conquer principles, focus their division at the strategic level, that is, how to divide the global problem into its sub-goals. Then, they implement each of those sub-goals by means of a single neural controller, final goal being generated by combining the outputs of those sub-goals in some sense. The current paper proposes, first, the definition of two different levels of modularity, and second, the use of tactical modularity as a new level of modularization that allocates space for decomposable modularity. It is expected that tactical modularization will be able in the generation of complex neural controllers when many inputs and outputs must be taken into account. It will be confirmed below, where the use of the two

types of modularity will be compared against monolithic approaches.

Implementing Modularity Strategic modularity can be implemented by any of the modular approaches that already exist in the literature. See (Auda and Kamel, 1999) for a complete description. Any of the modularization methods described there is strategic, although it was not given that name, and they can, in general, be integrated with tactical modularity. The term strategic is used for those modular approaches in order to differentiate them from the new proposed modularity. Tactical modularity defines modularity at the level of the elements involved in the generation of a subgoal. By elements, it is understand the inputs required to generate the sub-goal and the outputs that define the sub-goal solution. Each of those elements conform a tactical module, implemented by a simple neural network. That is, tactical modularity is implemented by designing a completely distributed controller composed of small processing modules around each of the meaningful elements of the problem. The schematics for a tactical module is shown in Figure 1. Tactical modules are connected to its associated element, controlling it, and processing the information coming in, for input elements, or going out, for output elements. This kind of connectivity means that the processing element is the one that decides which commands must be sent to the output element, or how a value received from an input element must be interpreted. It is said that the processing element is responsible for its associated element. In order to generate a complete answer for the sub-goal, all the tactical modules are connected each other, output of each module being sent back to all the others. By introducing this connectivity, each module is aware about what the others are doing, allowing that the different modules coordinate for the generation of a common answer, and avoiding a central coordinator. The resulting architecture shows a completely distributed MNN, where neural modules are independent but implement strong interactions with the other modules. Figure 2 shows an example of connectivity in the generation of a tactical modular neural controller for a simple system composed of two input elements and two outputs. 1097

M


Figure 1. Schematics of a tactical module for one input element (left) and for one output element (right)

Figure 2. Connectivity of a tactical modular controller with two input elements and two output elements

Training tactical modules is difficult due to the strong relationships between the different modules. Training methods used in strategic modules based on error propagation are not suitable, so a genetic algorithm is used to train the nets, because it allows to find the networks weights without defining an error measurement, just by specifying a cost function (Di Ferdinando et al., 2000).

Combination of Different Levels The use of one kind of modularity does not prevent, in principle, the use at the same time of the other type of modularity. In fact, strategic and tactical modularity can be used separately or in conjunction with each other. When the solution required from the controller is simple, then either, a strategic, or a tactical modularization can be used. In those cases, it is suggested that the selection of the kind of modularity be based on the complexity of the problem. For simple problems with a small number of elements, a monolithic controller will fit it. Whether the number of elements is high, then a tactical modular controller will be the best option. Finally, for very complex tasks with many elements, a combination of strategic and tactical modularization could be preferable. 1098


When combining both levels in one neural controller, the strategic modularization should be first performed, for identifying the different sub-goals required for the implementation. Next, a tactical modularization should be completed, implementing each of those sub-goals by a group of tactical modules. The number of tactical modules for each strategic module will depend on the elements involved in the resolution of the specific sub-goal.

Application Examples So far, strategic and tactical modularity have been mainly applied to robot control. The input elements are sensors and the output elements are actuators. In a first experiment, tactical modularity was applied to the control of a Khepera robot learning to solve the garbage collector problem (Téllez and Angulo, 2006) (Téllez and Angulo, 2007). It involved the coordination of 11 elements (seven sensors and four actuators), creating 11 tactical modules. The task was compared with different levels of modularization, including monolithic,

strategic, tactical and a combination of both. The results showed that the combination of both levels obtained the better results (see Figure 3). On additional experiments, tactical modularity was implemented for an Aibo robot. In this case, 31 tactical modules were required to generate the controller. The controller was generated to solve different tasks like stand up, standing and pushing the ground (Téllez et al., 2005). The controller was also able to generate one of the first MNNs controller able to make Aibo walk (Téllez et al., 2006).

FUTURE TRENDS Within the evolutionary robotics paradigm, it is very difficult to generate complex behaviours when the robot used is quite complex with a huge number of sensors and actuators. The use of tactical modularity together with strategic one, is introduced as a possible solution to the problem of generating complex behaviours in complex robots. Even if some examples have been

Figure 3. This figure represent the maximal performance value obtained by different types of modular approaches. Approach (a) is a monolithic approach, (b) and (c) are two different types of strategic approaches, (d) is tactical approach, and (f) is a reduced version of the tactical approach.

1099

M


provided with a quite complex robot, it is necessary to see if the system can scale to systems with hundreds of elements. Additional applications include its use in more classical domains like pattern recognition, speech recognition.

regularity and hierarchy in open-ended evolutionary computation.

CONCLUSION

Garibay 0., & Garibay I., & Wu A.S. (2004), No Free Lunch for Module Encapsulation, Proceedings of the Modularity, Regularity and Hierarchy in Openended Evolutionary Computation Workshop - GECCO 2004.

The level of modularity in neural controllers can be highly increased if tactical modularity is taken into account. This type of modularity complements typical modularization approaches based in strategic modularizations, by dividing strategic modules into their minimal components, and assigning one single neural module to each of them. This modularization allows the implementation of decomposable problems within a modularized structure. Both types of modularizations can be combined in order to obtain a highly modular neural controller, which shows better results in complex robot control.

REFERENCES Auda, G., & Kamel, M. (1999), Modular neural networks: a survey, International Journal of Neural Systems, 9(2), 129-151. Azam, F. (2000), Biologically inspired modular neural networks, PhD Thesis at the Virginia Polytechnic Institute and State University. Boers, E., & Kuiper, H. (1992), Biological metaphors and the design of modular artificial neural networks, Master Thesis, Leiden University. Caelli, G. L., & Wen, W. (1999), Modularity in neural computing, Proceedings of the IEEE 87(9), 14971518. Fodor, J. (1983), The modularity of mind, The MIT Press. Callebaut, W. (2005), The ubiquity of modularity, Modularity. Understanding the Development and Evolution of Natural Complex Systems, The MIT Press. De Jong, E.D., & Thierens, D., & Watson, R.A. (2004), Defining Modularity, Hierarchy, and Repetition, Proceedings of the GECCO Workshop on Modularity, 1100

Di Ferdinando, A., & Calabretta, R., & Parisi, D. (2000), Evolving modular architectures for neural networks, Proceedings of the sixth Neural Computation and Psychology Workshop: Evolution, Learning and Development.

Jacobs, R.A. (1990), Task decomposition through competition in a modular connectionist architecture, PhD thesis, University of Massachusets. Jacobs, R.A., & Jordan, M.I., & Barto, A.G. (1991), Task decomposition through competition in a modular connectionist architecture: the what and where vision tasks, Cognitive Science, 15, 219-250. Simon, H.A., (1969) The sciences of the artificial, The MIT Press. Téllez, R.A., & Angulo, C., & Pardo, D. (2005), Highly modular architecture for the general control of autonomous robots, Proceedings of the 8th International Work-Conference on Artificial Neural Networks. Téllez, R.A., & Angulo, C., & Pardo, D. (2006), Evolving the walking behaviour of a 12 DOF quadruped using a distributed neural architecture, Proceedings of the 2nd International Workshop on Biologically Inspired Approaches to Advanced Information Technology. Téllez, R., & Angulo, C. (2006) Tactical modularity for evolutionary animats, Proceedings of the International Catalan Conference on Artificial Intelligence. Téllez, R., & Angulo, C. (2007), Acquisition of meaning through distributed robot control, Proceedings of the ICRA Workshop on Semantic information in robotics. Watson, R.A., & Hornby, G.S., & Pollack, J. (1998), Modeling Building-Block Interdependency, Late Breaking Papers at the Genetic Programming 1998 Conference. Watson, R. (2002), Modular Interdependency in Complex Dynamical Systems, Proceedings of the 8th Inter-


national Conference on the Simulation and Synthesis of Living Systems.

KEy TERmS Cost Function: A mathematical function used to determine how good or how bad has a neural network performed during the training phase. The cost function usually indicates what is expected from the neural controller. Element: Any variable of the program that contains a value that is used to feed into the neural network controller (input element) or to contain the answers of the neural network (output element). The input elements are usually the variables that contain the information from which the output will be generated. The output elements contain the output of the neural controller.

Evolutionary Robotics: A technique for the creation of neural controllers for autonomous robots, based on genetic algorithms. Genetic Algorithm: An algorithm that simulates the natural evolutionary process, applied the generation of the solution of a problem. It is usually used to obtain the value of parameters difficult to calculate by other means (like for example the neural network weights). It requires the definition of a cost function. Modularization: A process to determine the simplest meaningful parts that compose a task. There is no formal process to implement modularization, and in practice, it is very arbitrary. Neural Controller: It is a computer program, based on artificial neural networks. The neural controller is a neural net or group of them which act upon a series of meaningful inputs, and generates one or several outputs.

1101

M

1102

Morphological Filtering Principles Jose Crespo Universidad Politécnica de Madrid, Spain

INTRODUCTION In the last fifty years, approximately, advances in computers and the availability of images in digital form have made it possible to process and to analyze them in automatic (or semi-automatic) ways. Alongside with general signal processing, the discipline of image processing has acquired a great importance for practical applications as well as for theoretical investigations. Some general image processing references are (Castleman, 1979) (Rosenfeld & Kak, 1982) (Jain, 1989) (Pratt, 1991) (Haralick & Shapiro, 1992) (Russ, 2002) (Gonzalez & Woods, 2006). Mathematical Morphology, which was founded by Serra and Matheron in the 1960s, has distinguished itself from other types of image processing in the sense that, among other aspects, has focused on the importance of shapes. The principles of Mathematical Morphology can be found in numerous references such as (Serra, 1982) (Serra, 1988) (Giardina & Dougherty, 1988) (Schmitt & Mattioli, 1993) (Maragos & Schafer, 1990) (Heijmans, 1994) (Soille, 2003) (Dougherty & Lotufo, 2003) (Ronse, 2005).

This article provides an overview of morphological filtering. The main families of morphological filters are discussed, taking into consideration the possibility of computing hierarchical image simplifications. Both the binary (or set) and gray-level function frameworks are considered. In the following of this section, some fundamental notions of morphological processing are discussed. The underlying algebraic structure and associated operations, which establish the distinguishing characteristics of morphological processing, are commented.

UNDeRlyING AlGeBRAIC STRUCTURe AND BASIC OpeRATIONS In morphological processing, the underlying algebraic structure is a complete lattice (Serra, 1988). A complete lattice is a set of elements with a partial ordering relationship, which will be denoted as ≤, and with two operations defined called supremum (sup) and infimum (inf): •

BACKGROUND Morphological processing especially uses set-based approaches, and it is not frequency-based. This is in fact in sharp contrast with linear signal processing (Oppenheim, Schafer, & Buck, 1999), which deals mainly with the frequency content of an input signal. Let us mention also that Mathematical Morphology (as the name suggests) normally employs a mathematical formalism. Morphological filtering is a type of image filtering that focuses on increasing transformations. Shapes can be satisfactorily processed by morphological filters. Starting with elementary transformations that are based on Minkowski set operations, other more complex transformations can be realized. The theory of morphological filtering is soundly based on mathematics.

•

The sup operation computes the smallest element that is larger than or equal to the operands. Thus, if a, b are two elements of a lattice, “a sup b” is the element of the lattice that is larger than both a and b, and there is no smaller element that is so. The inf operation computes the greatest element that is smaller than or equal to the operands.

Moreover, every subset of a lattice has an infimum element and a supremum element. For sets and gray-level images, these operations are: •

Sets (or binary images) o Order relationship: ⊆(set inclusion). o “A sup B” is equal to “A È B”, where A and B are sets. o “A inf B” is equal to “A Ç B”.


Morphological Filtering Principles

•

Gray-level images (images with intensity values within a range of integers) o Order relationship: For two functions f,g: f ≤ g ⇒ f(x) ≤ g(x),

o

for all pixel x where the right-hand-side ≤ refers to the order relationship of integers. The sup of f and g is the function: (f sup g)(x) = max {f(x), g(x)}

o

where “max” denotes the computation of the maximum of integers. The inf of f and g is the function: (f inf g)(x) = min {f(x), g(x)} where “min” symbolizes the computation of the minimum of integers.

TRANSfORmATION pROpeRTIeS The concept of ordering is key in non-linear morphological processing, which focuses especially on those transformations that preserve ordering. An increasing transformation Ψ defined on a lattice satisfies that, for all a,b: a ≤ b ⇒ Ψ(a) ≤ Ψ(b) The following two properties concern the ordering between the input and the output. If I denotes an input image, an image operator Ψ is extensive if and only if, ∀I,

Ψ(I) = Ψ Ψ (I) Within the non-linear morphological framework, the important duality principle states that, for each morphological operator, there exists a dual one with respect to the complementation operation. Two operators Ψ and Ω are dual if Ψ = CΩC The complementation operation C, for sets, computes the complement of the input. In the case of graylevel images a related operation is the image inversion, which inverts an image reversing the intensity values with respect to the middle point of the intensity value range. The following concept of pyramid applies to multiscale transformations. A family of operators {Ψi}, where i ∈ S = {1,...,n}, forms a multi-level pyramid if ∀j, k ∈ S, j ≥ k, ∃l such that Ψj = Ψl Ψk In words, the set of transformations {Ψi} constitutes a pyramid if any level j of the hierarchy can be reached by applying a member of {Ψi} to a finer (smaller index) level k.

STRUCTURING elemeNTS A structuring element is a basic tool used by morphological operators to explore and to process the shapes and forms that are present in an input image. Normally, flat structuring elements, which are sets that define a shape, are employed. Two usual shapes (square and diamond) are displayed next (the “x” symbol denotes the center):

I ≤ Ψ (I) A related property is the anti-extensivity property. An operator Ψ is anti-extensive if and only if, ∀I, I ≥ Ψ (I) The concept of idempotence is a fundamental notion in morphological image processing. An operator Ψ is idempotent if and only if, ∀I,

(a) Square 3x3

(b) Diamond 3x3

If B denotes a structuring element, its transposed is B = {(− x,− y ) ∈ B} (i.e., B inverted with respect to the coordinate origin). If a structuring element B is  centered and symmetric, then B = B .

1103

M


DIlATIONS AND eROSIONS Dilations and erosions are the most basic transformations in morphological processing. Dilations δ are increasing operators that satisfy, ∀I, I', δ(I sup I') = δ(I) supδ(I') Respectively, erosions ε are increasing operators that satisfy, ∀I, I', ε (I inf I') = ε(I) inf ε(I') Dilations and erosions by structuring element perform, respectively, sup and inf operations over an input image that depend on a structuring element B. These dilations and erosions are symbolized, respectively, by δB and εB, and they originate from the Minkowski set addition and subtraction. Let us first discuss the set framework. In the set framework, if A denotes an input set, the δB(A) dilation computes the locus of points where the B structuring element translated to has a non-empty intersection with (i.e., “touches”) input set A:

Bx symbolizes the structuring element B translated to point (or pixel) x (i.e., Bx = {x' | x' – x ∈ B}, where “-” symbolizes the vector subtraction). Figure 1 shows a set example in R2. Input set A (composed of two connected-components) and structuring element B (a circle) are displayed in part (a). The δB(A) dilation is shown in part (b). The previous expression can be formulated using the sup operation as: δB(A) =  A−b = sup b∈B A−b b∈B

Using the lattice framework, the expression for functions is formally identical to the expression for sets: δB(I) = supb∈BI–b Note: the sup operator is that of the function lattice. The function expression can be written in another way that gives a more operational expression to compute the value of the result of δB(I) at each pixel x of I: [δB(I)](x) = maxb∈B{I(x + b)}

δB(A) = {x | Bx Ç A ≠ ∅}

Figure 1. Dilation δB(A) (a) Set A and structuring element B

1104

(b) δB(A)


The sup operation has been replaced by the “max” operation that computes the maximum of a set of integers. Note that “[δB(I)](x)” is the intensity value of pixel x in that image. The “+” symbol denotes the vector addition. Some important properties of dilations δB are

The expression for sets formulated by means of the inf operation is: εB(A)=  A−b = inf b∈B A−b b∈B

the following:

• • •

For sets, dilation δB is commutative, i.e., if A denotes an input set, then δB(A) = δA(B). If a structuring element B contains the coordinate origin, then δB is extensive. The dilation by a structuring element is associative, i.e, if B is the result of δC(D) (or δD(C)), then δB(I) = δC(δD(I)) = δD(δC(I)).

If A denotes an input set and Bdenotes a structuring element, the εB(A) erosion computes the locus of points where the B structuring element translated to is completely included within input set A: εB(A) = {x | Bx ⊆ A} Figure 2 displays a set example of εB(A), where A is that of the previous dilation example. The following expressions of erosions εB are analogous to those already introduced for dilations.

The expressions for functions are: εB(I) = infb∈BI–b [εB(I)](x) = minb∈B{I(x + b)} Some important properties of erosions εB are the following: • •

If the coordinate origin belongs to a structuring element B, then εB is anti-extensive. The erosion by a structuring element is associative, i.e, if B is the result of δC(D) (or δD(C)), then εB(I) = εC (εD(I)) = εD (εC(I)).

In fact, expressions for erosions are dual of, respectively, those of dilations; δB and εB are dual of each other: δB = C εB C A simple 1-D example of a gray-level dilation and erosion (where B has 3 points) is the following: 9 10 11 12 12 11 13 13 12 12 12 10 9 9 9 10 I

Figure 2. Erosion εB(A) εB(A)

10 11 12 12 12 13 13 13 13 12 12 12 10 9 10 10

D B (I ) 9 9 10 11 11 11 11 12 12 12 10 9 9 9 9 9

E B (I )

fROm SeT OpeRATORS TO fUNCTION OpeRATORS Flat operators are function operators Ψ that can be derived from a set operator Ψ' that satisfy the following threshold superposition property: [Ψ (I)](x) = sup{u:x ∈ Ψ' (Uu(I))} 1105

M


where Uu is the thresholding operator at level u, and I is an image. The thresholding operator at level u is defined as Uu(I) = {x : I(x) ≥ u} Let us define a variant of the thresholding operator that outputs a binary function (instead of a set): (U'u(I))(x) is 1 if I(x) ≥ u, and 0 otherwise. Then, a flat operator Ψ that commutes with thresholding is said to satisfy: U'uΨ = ΨU'u

BASIC mORpHOlOGICAl fIlTeRS Openings and Closings In morphological processing, a filter is an increasing and idempotent transformation. The two most fundamental filters in morphological processing arise when there is an order between the input and the filter output. They are the so-called openings and closings, symbolized, respectively, by γ and φ. • •

An opening γ is an anti-extensive morphological filter. A closing φ is an extensive morphological filter.

The names “algebraic openings’’ and “algebraic closings’’ are also used in the literature to refer to these most general types of openings and closings. The computation of openings and closings that use structuring elements as the “shape probes’’ to process input image shapes is discussed next. They are defined in terms of dilations and erosions by structuring element. For an input set A, an opening by structuring element B, symbolized by γB, is the set of points x that belong to a translated structuring element that fits a set A, i.e., that is included in A. Let us establish how γB is computed in the next definition, which applies both to sets and images. An opening by structuring element B, symbolized by γB is defined by G B = D B E B 1106

i.e., γB is the sequential composition of an erosion εB  followed by a dilation D B , where B denotes B transposed. This type of filter first erodes an input image by the εB erosion, and then the subsequent D B dilation generally recovers in some sense the parts of the input image that have persisted. Nevertheless, not everything is normally recovered, and the output image is always less than or equal to the input image. The definition of the dual closing follows. A closing by structuring element B, symbolized by φB is defined by J B = E B D B

i.e., φB is the sequential composition of a dilation  δB followed by an erosion E B , where B denotes B transposed.

Alternated filters The sequential compositions of an opening γ and a closing φ are called alternated sequential compositions. A morphological alternated filter is a sequential composition of an opening and a closing, i.e., φγ and γφ, are alternated filters. An important fact is that there is generally no ordering between the input and output of alternated filters, i.e., I  φγ (I)  I I  γφ (I)  I In addition, there is generally no ordering between φγ and γφ. Alternated filters are quite useful in image processing and analysis because they combine in some way the effects of both openings and closings in one filter.

parallel Combination properties The class of openings (or, respectively, of closings) is closed under the sup (respectively, inf) operation. In other words:


• •

The sup of openings is an opening. The inf of closings is a closing.

Different structuring elements can be combined to achieve a desired shape filtering effect. For example, the effect of a sup of openings such as (γA sup γB), which is itself an opening, can be quite different from either γA or γB.

GRANUlOmeTRIeS AND ANTI-GRANUlOmeTRIeS

M

The granulometry concept formalizes the size distribution notion. Size distributions are families of transformations Ψi with a size parameter i that satisfy the following axioms:

Figure 3. Granulometry

1107


• • •

Increasingness Anti-extensivity Absorption

If Ψi, Ψj belong to a size distribution, where i ≤ j, then Ψi Ψj = Ψj Ψi = Ψmax(i, j) In morphological filtering, the so-called granulometries are families of transformations that satisfy the size distribution axioms above. A family of openings {γi}, where i ∈ S = {1,...,n} is a granulometry if, for all i, j ∈ S, i ≤ j ⇒ γi ≥ γj. i.e., an ordered family of openings constitutes a granulometry. The dual concept of a granulometry is called an antigranulometry, which is an ordered family of closings, as defined next. A family of closings {φi}, where i ∈ S = {1,...,n} is an anti-granulometry if, for all i, j ∈ S,

Figure 3 illustrates the granulometry concept. Three opening outputs have been displayed in parts (b), (c) and (d), particularly the outputs corresponding to openings γ4, γ6 and γ8, where the subindex indicates the size of the structuring element (Iñesta & Crespo, 2003). A subindex i refers to a square of side 2i+1. There is an ordering between the four images: part (a) ≥ part (b) ≥ part (c) ≥ part (d).

mUlTI-leVel mORpHOlOGICAl fIlTeRING Alternating Sequential filters Granulometries and anti-granulometries allow to build complex filters composed of ordered openings and closings. An alternating sequential filter ASF is an ordered sequential composition of alternated filters φjγj or γjφj, such as ASFi= φiγi...φjγj...φ1γ1

i ≤ j ⇒ φi ≤ φj.

ASF’i = γiφi...γjφj...γ1φ1

Quite often, both a granulometry and an anti-granulometry are computed. Normally, quantitative increasing measures are computed at each output. These measure values build a curve that can be used to characterize an input image if the measure criterion is appropriate.

where i ≥ j ≥ 1, and where γi and φi belong, respectively, to a granulometry and an anti-granulometry. Alternating sequential filters satisfy the following absorption property: if i ≥ j, then

M(γN(I) ≤ ... ≥ M (γ1(I) ≤ M(I) ≤ M(φ1(I) ≤ ... ≥ M (φN(I)

ASFj ASFi ≤ ASFi

An example of an increasing criterion is the area or number of pixels in binary images, or the volume in non-binary images. To build a family of openings and a family of closings by structuring elements that constitute, respectively, a granulometry and antigranulometry, an appropriate family of structuring elements {iB | i ∈ {0,...,N}} that ensure the ordering of openings and closings is needed. Particularly, the family of structuring elements must satisfy the following property: γ(i–1)B(iB) = iB, for i ≥ 1.

1108

ASFi ASFj = ASFi

ASF’i ASF’j = ASF’i ASF’j ASF’i ≥ ASF’i

morphological pyramids Multi-level (or multi-scale) operators are families of transformations that depend on a scale parameter i. Within morphological filters, cases that satisfy the pyramid condition are: • • •

granulometries, anti-granulometries, and alternating sequential filters.


fUTURe TReNDS

RefeReNCeS

Operators that consider connectivity aspects have been an active research and work area in morphological processing. Connectivity integrates easily in the morphological filtering framework using the connected class concept (and the associated opening) introduced in (Serra, 1988). The class of connected filters (Serra & Salembier, 1993) (Crespo, Serra, & Schafer, 1993) (Vincent, 1993) (Salembier & Serra, 1995) (Crespo, Serra, & Schafer, 1995) (Breen & Jones, 1996) (Crespo & Schafer, 1997) (Crespo & Maojo, 1998) (Garrido, Salembier, & Garcia, 1998) (Heijmans, 1999) (Crespo, Maojo, Sanandrés, Billhardt, & Muñoz, 2002) (Crespo & Maojo, 2008), which preserve shapes particularly well, has been successfully used in image processing and analysis applications. In more recent years, certain types of connected filters, such as the so-called levelings (Meyer, 1998) (Meyer, 2004), whose origin in the set framework can be traced back to (Crespo et al., 1993) (Crespo & Schafer, 1997), have been the focus of new research efforts.

Breen, E. J., & Jones, R. (1996, November). Attribute openings, thinnings, and granulometries. Computer Vision and Image Understanding, 64(3), 377-389.

CONClUSION

Castleman, K. (1979). Digital image processing. Englewood Cliffs: Prentice Hall. Crespo, J., & Maojo, V. (1998, April). New results on the theory of morphological filters by reconstruction. Pattern Recognition, 31(4), 419-429. Crespo, J., Maojo, V. (2008). The strong property of morphological connected alternated filters. Accepted for publication in the Journal of Mathematical Imaging and Vision. DOI: 10.1007/x10851-008-0098-x. Crespo, J., Maojo, V., Sanandrés, J., Billhardt, H., & Muñoz, A. (2002). On the strong property of connected open-close and close-open filters. In J. Braquelaire, J.-O. Lachaud, & A. Vialard (Eds.), Discrete geometry for computer imagery (Vol. 2301, pp. 165-174). BerlinHeidelberg: Springer-Verlag. Crespo, J., & Schafer, R.W. (1997). Locality and adjacency stability constraints for morphological connected operators. Journal of Mathematical Imaging and Vision, 7(1), 85–102.

This article has provided a summary of morphological filtering, which is qualitatively different from linear filtering. These differences are clear when morphological filtering is approached analysing the underlying algebraic framework and the key importance of ordering and increasingness. Morphological filtering provides a distinctive type of image analysis that is appropriate to deal with shapes. Although in its origin morphological filtering was especially associated to set processing (and many concepts are originally set-based), it extends to nonbinary gray-level functions.

Crespo, J., Serra, J., & Schafer, R. W. (1993, May). Image segmentation using connected filters. In J. Serra & P. Salembier (Eds.), Workshop on mathematical morphology, Barcelona (pp. 52–57).

ACKNOWleDGmeNTS

Giardina, C., & Dougherty, E. (1988). Morphological methods in image and signal processing. Englewood Clliffs: Prentice-Hall.

This work has been supported in part by “Ministerio de Educación y Ciencia” of Spain (Ref.: TIN200761768).

Crespo, J., Serra, J., & Schafer, R. W. (1995, November). Theoretical aspects of morphological filters by reconstruction. Signal Processing, 47(2), 201–225. Dougherty, E., & Lotufo, R. (2003). Hands-on morphological image processing. Bellingham: SPIE Press. Garrido, L., Salembier, P., & Garcia, D. (1998). Extensive operators in partition analysis for image sequence analysis. Signal Processing, 66(2), 157–180.

Gonzalez, R. C., & Woods, R. E. (2006). Digital image processing (3rd. ed.). Englewoods Cliff: Prentice Hall.

1109

M


Haralick, R., & Shapiro, L. (1992). Computer and robot vision. Volume I. Reading: Addison-Wesley Publishing Company.

Salembier, P., & Serra, J. (1995). Flat zones filtering, connected operators, and filters by reconstruction. IEEE Transactions on Image Processing, 4(8), 1153–1160.

Heijmans, H. (1994). Morphological image operators. Boston: Academic Press.

Schmitt, M., & Mattioli, J. (1993). Morphologie mathématique. Paris: Masson.

Heijmans, H. (1999). Connected morphological operators for binary images. Computer Vision and Image Understanding, 73, 99–120.

Serra, J. (1982). Mathematical morphology. Volume I. London: Academic Press.

Iñesta, J. M., & Crespo, J. (2003). Principios básicos del análisis de imágenes médicas. In M. Belmonte, O. Coltell, V. Maojo, J. Mateu, & F. Sanz (Eds.), Manual de informática médica (pp. 299–333). Barcelona: Editorial Menarini - Caduceo Multimedia. Jain, A. (1989). Fundamentals of digital image processing (Prentice hall information and system sciences series; Series Editor: T. Kailath). Englewood Cliffs: Prentice Hall. Maragos, P., & Schafer, R. W. (1990, April). Morphological systems for multidimensional signal processing. Proc. of the IEEE, 78(4), 690–710. Meyer, F. (1998). From connected operators to levelings. In H. J. A. M. Heijmans & J. B. T. M. Roerdink (Eds.), Mathematical morphology and its applications to image and signal processing (pp. 191–198). Dordrecht: Kluwer Academic Publishers. Meyer, F. (2004, January-March). Levelings, image simplification filters for segmentation. Journal of Mathematical Imaging and Vision, 20(1-2), 59–72. Oppenheim, A., Schafer, R. W., & Buck, J. (1999). Discrete-time signal processing (2nd. ed.). Englewood Cliffs: Prentice-Hall. Pratt, W. (1991). Digital image processing (2nd. ed.). New York: John Wiley and Sons. Ronse, C. (2005). Guest editorial. Journal of Mathematical Imaging and Vision, 22(2 - 3), 103–105. Rosenfeld, A., & Kak, A. (1982). Digital picture processing - Volumes 1 and 2. Orlando: Academic Press. Russ, J. C. (2002). The image processing handbook (4th. ed.). Boca Raton: CRC Press.

1110

Serra, J. (Ed.). (1988). Mathematical morphology. Volume II: Theoretical Advances. London: Academic Press. Serra, J., & Salembier, P. (1993, July). Connected operators and pyramids. In Proceedings of SPIE, Nonlinear algebra and morphological image processing, San Diego (Vol. 2030, pp. 65–76). Soille, P. (2003). Morphological image analysis (2nd. ed.). Berlin-Heidelberg-New York: Springer-Verlag. Vincent, L. (1993, April). Morphological grayscale reconstruction in image analysis: Applications and efficient algorithms. IEEE Transactions on Image Processing, 2, 176–201.

Key TeRmS Duality: The duality principle states that, for each morphological operator, there exists a dual one. In sets, the duality is established with respect to the set complementation operation (see further details in the text). Extensivitity: A transformation is extensive when its output is larger than or equal to the input. Anti-extensivity is the opposite concept: a transformation is anti-extensive when its output is smaller than or equal to the input. Idempotence: A transformation Ψ is said to be idempotent if, when sequentially applied twice, it does not change the output of the first application, i.e.,Ψ Ψ = Ψ. Image Transformation: An operation that processes an input image and produces an output image.


Increasingness: A transformation is increasing when it preserves ordering. If Ψ is increasing, then a ≤ b ⇒ Ψ(a) ≤ Ψ(b). Lattice: A complete lattice is a set of elements with a partial ordering relationship and two operations called supremum and infimum.

Morphological Filter: An increasing and idempotent transformation. Multi-Scale Transformation: A transformation that displays some characteristics controllable by means of (at least) a parameter, which is called the size or scale parameter.

1111

M

1112

MREM, Discrete Recurrent Network for Optimization Enrique Mérida-Casermeiro University of Málaga, Spain Domingo López-Rodríguez University of Málaga, Spain Juan M. Ortiz-de-Lazcano-Lobato University of Málaga, Spain

INTRODUCTION Since McCulloch and Pitts’ seminal work (McCulloch & Pitts, 1943), several models of discrete neural networks have been proposed, many of them presenting the ability of assigning a discrete value (other than unipolar or bipolar) to the output of a single neuron. These models have focused on a wide variety of applications. One of the most important models was developed by J. Hopfield in (Hopfield, 1982), which has been successfully applied in fields such as pattern and image recognition and reconstruction (Sun et al., 1995), design of analogdigital circuits (Tank & Hopfield, 1986), and, above all, in combinatorial optimization (Hopfield & Tank, 1985) (Takefuji, 1992) (Takefuji & Wang, 1996), among others. The purpose of this work is to review some applications of multivalued neural models to combinatorial optimization problems, focusing specifically on the neural model MREM, since it includes many of the multivalued models in the specialized literature.

BACKGROUND In Hopfield and Tank’s pioneering work (Hopfield & Tank, 1985), neural networks were applied for the first time to solve combinatorial optimization problems, concretely the well-known travelling salesman problem. They developed two types of networks, discrete and continuous, although the latter has been mostly chosen to solve optimization problems, adducing that it helps to escape more easily from local optima. Since then, the search for better neural algorithms, to face the diverse problems of combinatorial optimization (many of them

belonging to the class of NPcomplete problems), has been the objective of researchers in this field. This method of optimization consists of minimizing an energy function, whose parameters and constraints are obtained by means of identification with the objective function of the optimization problem. In this case, the energy function has the form: E(S ) = −

N 1 N N wi , j si s j + ∑ Qi si ∑∑ 2 i =1 j =1 i =1

where N is the number of neurons of the network, wi,j is the synaptic weight between neurons j and i, and θi is the threshold or bias of the neuron i. In the discrete version of Hopfield’s model, component si of the state vector S = (s1,...,sN) can take values in  = {−1,1} (constituting the bipolar model) or in  = {0,1} (unipolar model). In the continuous version,  = [−1,1] or  = [0,1]. This continuous version, although it has been traditionally the most used for optimization problems, presents certain inconveniences: •

•

Certain special mechanisms, maybe in form of constraints, should be contributed in order to get that, in the final state of the network, all the components of state vector S belong to {–1, 1} or {0,1}. The traditional dynamics used in this model, implemented in a digital computer, does not guarantee the decrease of the energy function in every iteration, so it is not ensured that the final state is a minimum of the energy function (GalánMarín, 2000).


MREM, Discrete Recurrent Network for Optimization

However, the biggest problem of this model (the discrete as well as the continuous one) is the possibility to converge to a non feasible state, or to a local (not global) minimum. Wilson and Pawley (1988) demonstrated, through massive simulations, that, for the travelling salesman problem of 10 cities, only 8% of the solutions were feasible, and most not good. Moreover, this proportion got worse when problem size was increased. After this, many works were focused on improving Hopfield's network: • • •

By modifying the energy function (Xu & Tsai, 1991). By adjusting the numerous parameters present in the network, as in (Lai & Coghill, 1988). By using stochastic techniques in the dynamics of the network (Kirkpatrick et al., 1983) (Aarts & Korst, 1988).

Particularly, researchers tried to improve the efficiency of Hopfield's network for the travelling salesman problem, achieving acceptable results, but inferior to Operations Research techniques (Takahashi, 1997). The reason for these disappointing results is that the linear formulation used by these techniques is a great advantage in comparison with neural networks, which unavoidably use a quadratic energy function, impeding the use of subpaths deletion techniques (Smith, 1996), and provoking the appearance of a bigger number of local minima. Another research line was devoted to the improvement of Hopfieldtype recurrent networks, and their application to diverse problems of optimization, in which some results proved to be better than those obtained by traditional Operations Research techniques (Smith & Krishnamoorthy, 1998). Takefuji's work (Takefuji, 1992) (Lee et al., 1992)(Takefuji & Wang, 1996), with a great number of publications in international media, must be highlighted. Their results have been overcome by the OCHOM model (GalánMarín & MuñozPérez, 2001).

mUlTIVAlUeD DISCReTe ReCURReNT mODel. ApplICATION TO COmBINATORIAl OpTImIzATION pROBlemS A new generalization of Hopfield’s model arises in the works (MéridaCasermeiro, 2000) (MéridaCasermeiro et al., 2001), where the MREM (Multivalued REcurrent Model) model is presented.

The Neural mRem model This model presents two essential features that make it very versatile and that increase its applicability: • •

The output of each neuron, si, is a value of the set  = {m1 , m2 , , mL }, which is not necessarily numeric. The concept of similarity function f between neuron outputs is introduced. f(x,y) represents the similarity between neuron states x and y.

This way, the energy function of this model is as follows: E(S ) = −

N 1 N N ( , ) w f s s + Qi ( si ) ∑∑ i, j i j ∑ 2 i =1 j =1 i =1

where Qi :  →  is a generalization of the thresholds of each neuron. The features mentioned above make that in this model certain optimization problems (as the travelling salesman problem), have a better representation than in the unipolar or bipolar Hopfield’s models, and their successors. It is clear that MREM includes Hopfield’s models (with outputs in  = {−1,1} or in  = {0,1}) if we consider the similarity function given by the product f(a,b) = ab. Other multivalued models, like MAREN or SOAR (Erdem & Ozturk, 1996) (Ozturk & Abut, 1997), are also generalized by MREM. The dynamics for this network is chosen according to the problem to be tackled.

1113

M


Application to Several Combinatorial Optimization problems This multivalued model has been successfully applied to diverse optimization problems, outperforming the best-established algorithms. Several of these applications can be found at (MéridaCasermeiro et al., 2003) (MéridaCasermeiro & LópezRodríguez, 2005) (LópezRodríguez et al., 2006). These problems are typical representatives of the NPcomplete complexity class, indicating their degree of difficulty in resolution.

The Travelling Salesman Problem Traveling Salesman Problem (TSP) is one of the most wellknown and studied combinatorial optimization problems due to its wide range of reallife applications and intrinsic complexity. Reallife applications cover aspects such as automatic routing for robots and hole location in printed circuits design (Reinelt, 1994), as well as gas turbine checking, machine task scheduling or crystallographic analysis (Bland & Shallcross, 1987), among others. This problem can be stated as follows: given N cities X1,...,XN and distances di,j between each pair of cities Xi and Xj, the objective is to find the shortest closed tour visiting each city once. In order to get the TSP solved by this neural model, two identifications must be done: •

•

A network state must be identified to a solution to the TSP: Since a solution to the N cities TSP can be represented as a permutation in the set of numbers {1,...,N}, the net will be formed by N neurons, taking value in the set  = {1, , N }, such that state vector S = (s1,...,sN) represents a permutation of {1,...,N}. With this representation, si = k means that kth city will be visited in the ith place. The energy function must be identified to the total distance of a tour: If we let f(x,y) = –2dx,y and 1, ( j = i + 1) ∨ ((i = N ) ∧ ( j = 1)) wi , j =  otherwise 0,

the energy function obtained is

1114

N −1

E ( S ) = ∑ d si , si+1 + d sN , s1 i =1

,

the total distance of the tour represented by state vector S. Computational dynamics is based on starting with a random feasible initial state vector and updating neuron outputs to keep the current state vector inside the feasible states set. To this end, at each iteration, a 2opt update will be made on current state vector, that is, every pair of neurons, p,q with p > q + 1, is studied and checked in parallel whether there exists a cross between segments (sp, sp+1) and (sq, sq+1). In this case, the next relation holds: d s p , s p+1 + d sq , sq+1 > d s p , sq + d s p+1 , sq+1

Then, the trajectory between cities sp+1 and sq is inverted, that is, if S is the current state, the new state vector S’ will be defined by: , s si, =  q + p +1−i  si ,

p +1 ≤ i ≤ q otherwise

As an additional technique for improvement, it has also been considered 3opt updates: the tour is decomposed into three consecutive arcs, A, B and C, which are then recombined in all possible ways: {ABC,AC B,AB’C,ABC’,AB’C’,AC’B, ACB’, AC’B’}, where A’,B’,C’ are the reversed arcs corresponding to A, B, and C, respectively. Note that {ABC,AB’C,ABC’,AC’B’} are 2opt updates, so there is no need to check them again. The next state of the net will be the combination that decreases most the energy function. Further details in (MéridaCasermeiro et al., 2003). In (MéridaCasermeiro et al., 2003), some experimental results are provided, for problems from the TSPLIB repository (see Table 1). This model is compared against KNIES (Aras et al., 1999), a model based on Kohonen’s self organizing map. MREM proved to outperform KNIES, obtaining in many cases almost optimal solutions.


Figure 1. Best solution found by MREM (left, error=1.3%) and optimal solution (right)

M

Table 1. Results of KNIES and MREM for the TSP for some instances from TSPLIB Instance eil51 st70 eil76 rd100 eil101 lin105 pr107 pr124 bier127 kroA200

Optimum 426 675 538 7910 629 14379 44303 59030 118282 28568

KNIES Best (%) Best (%)

MREM Av. (%)

t (sec)

2.86 1.51

0.23 0.00

2.43 1.89

3.12 9.01

4.98 2.09 4.66 1.29 0.42 0.08 2.76

1.30 0.00 1.43 0.00 0.15 0.00 0.42

3.43 3.02 3.51 1.71 0.82 1.23 2.06

10.80 61.70 27.76 28.83 49.79 59.51 66.29

5.71

3.49

6.70

318.44

The Graph Partition Problem Let  = ( , E ) be an undirected graph without selfconnections.  = {vi } is the set of vertices and ε is the set of ne edges. For each edge (vi , v j ) ∈ E there is

a weight ci , j ∈  + . All weights can be expressed by a symmetric real matrix C, with ci,j = 0 when it does not exist the arc (vi , v j ) . MaxCut Problem: to find a partition of  into K disjoint sets Ai such that the sum of the weights of the 1115


function by taking wi,j = –2ci,j and f(x,y) = δx,y (that is, f(x,y) = 1 if, and only if, x = y, otherwise it is 0), considering θi = 0. The dynamics used in (MéridaCasermeiro & LópezRodríguez, 2005) was named best2. best2 consists in getting the greatest decrease of the energy function by changing the state of only two neurons at each time. If neurons p and q are to be updated, energy increments ΔE(i, j) when sp = i and sq = j, for i, j ∈ {1,...,K}, are computed. Then, the state of minimum increase is chosen as the new network state. By using this dynamics, in (MéridaCasermeiro & LópezRodríguez, 2005), the MREM model is compared against some other networks, like OCHOM (GalánMarín & MuñozPérez, 2001), obtaining the best results in authors’ experiments (see Table 2).

edges from ε, that have their endpoints in different elements of the partition, is maximum. Therefore, the function to maximize is

∑ ∑

i|vi ∈Am j |v j ∈An

ci , j

m>n

To solve the MaxCut problem with MREM, we need N neurons, one per node in  . The output of neuron i, si ∈  = {1, 2, , K }, will denote that ith node is assigned to Asi. Since it is equivalent to maximize the cost of edges cut by the partition and to minimize the cost of edges with endpoints in the same set of the partition, the objective function can be modelled as an energy

Table 2. Results for MaxCut comparing MREM and OCHOM N

dens Best

50

100

150

Best

OCHOM Av.

t

276,8

256,28

0,05

276,8

242,15

0,0023

0,25 0,5

1013,2 1778,8

970,84 1724,08

0,06 0,06

999,6 1778,8

926,26 1694,44

0,75 0,9

2663,6 2941,8

2475,48 2876,18

0,05 0,06

2646 2940,4

2432,47 2865,83

0,0026 0,0033 0,0036 0,0031

0,05

990,2

917,72

0,15

958,8

867,64

0,0064

0,25 0,5

3719,2 6711,6

3620,9 6637,08 0,75 9816,2 9524,1 0,9 11348,8 11215,06

0,14 0,13 0,14 0,14

3725,5 6695,8 9816,2 11391,3

3571,24 6585,54 9444,33 11148,4

0,0086 0,0126 0,0118 0,0109

0,05

2009,8

0,26

1929,6

0,25 0,5 0,75

7990 7807,16 14701,4 14531,06 21126,2 20899,94 24926 24589,62

0,26 0,24 0,22 0,22

7940,2 14658,4 21124 24859,7

0,05

1116

t

0,05

0,9

200

MREM Av.

3411,4

1933,6

1837,43 0,0147 7690,35 14489,5 20907,6 24533,1

0,0258 0,0209 0,0252 0,0256

3321,84

0,38

3409,5

0,25 0,5 0,75

13741 13533,9 25750,8 25500,18 37038,6 36789,2

0,35 0,34 0,32

13617,9 25770,8 36932

13439,7 25526,8 36683,4

3316,28 0,0276 0,0468 0,0451 0,0486

0,9

43584,8 43296,26

0,33

43420,6

43104,6

0,0462


developed in literature is that there is no need of assigning a good ordering of the vertices at a preprocessing step. The model, as well as the relative position of the arcs, computes this optimal node order. To solve the 2PCNP problem, authors have considered two MREM neural models:

The 2Pages Graph Layout Problem In the last years, several graph representation problems have been studied in the literature. Most of them are related to the linear graph layout problem, in which the vertices of a graph are placed along a horizontal ``node line’’, or ``spine’’ (dividing the plane into two halfplanes or ``pages’’) and then edges are added to this representation as specified by the adjacency matrix. The objective of this problem is to minimize the number of crossings produced by such a layout. Some examples of problems associated to this linear graph layout problem (or 2 pages crossing number problem, 2PCNP) are the bandwidth problem (Chinn et al., 1982), the book thickness problem (Kainen, 1990), the pagenumber problem (Malitz, 1994), the boundary VLSI layout problem (Ullman, 1984) and the singlerow routing problem (Raghavan & Sahni, 1983), or printed circuit board layout (Sinden, 1966) and automated graph drawing (Tamassia et al., 1988). In (LópezRodríguez et al., 2007), a neural model, derived from MREM, is designed to solve this problem. One of the differences of this model with the algorithms

•

•

The first network will be formed by N neurons, being N the number of nodes in the graph. Neurons output (the state vector) indicate the node ordering in the line. Thus, si = k will be interpreted as the kth node being placed in the ith position in the node line. Hence, the output of each neuron can take value in the set 1 = {1, 2, , N }. The second network will be formed by as many neurons as edges in the graph, M. The output of each neuron will belong to the set 2 = {−1,1} . For the arc (vi, vj), S(v , v ) = –1 will indicate that i j the edge will be drawn in the lower halfplane, and S(v , v ) = +1, in the upper one. i j

Initially, the state of the net of vertices is randomly selected as a permutation of {1,2,...,N}. At any time,

Table 3. Comparison between MREM and the heuristics mentioned in Cimikowski’s work Graph

N

M

MREM

K6

6

21

3

3

3

4

5

K7

7

28

9

9

11

9

13

K8

8

36

18

18

18

30

27

K9

9

45

36

36

42

50

50

K10

10

55

60

60

80

92

80

C20 (1, 2)

20

40

0

2

0

0

0

C20 (1, 2, 3)

20

60

19

24

36

48

40

C20 (1, 2, 3, 4) 20

80

74

74

90

118

108

C22 (1, 2, 3)

22

66

22

26

40

54

44

C22 (1, 3, 5, 7) 22

88

198

200

306

294

286

C24 (1, 3)

24

48

11

14

22

16

22

C26 (1, 3)

26

52

11

16

24

16

24

C28 (1, 3, 5)

28

84

80

86

138

138

130

C30 (1, 3, 5)

30

90

92

96

148

150

140

CN e-len 1-page greedy

1117

M


Figure 2. Optimal layouts for graphs K6 (left) and K3,3 (right)

the net is looking for a better solution than the current one, in terms of minimizing the energy function. This is achieved by permuting the output of two neurons (node positions) and changing the location of an edge (from the upper halfplane to the lower one, and viceversa). In (LópezRodríguez et al., 2007), this new model is compared against some heuristics (Cimikowski, 2002) specially designed for this problem. MREM obtained the best solutions in the experiments, improving the best known solution in some cases (Table 3).

fUTURe TReNDS Recurrent neural networks can be used to solve many optimization problems. Researchers and practitioners can benefit from the application of the neural model MREM to diverse optimization problems. Other problems where these models can be applied cover aspects such as data classification, image compression by vector quantization, etc. It must be noted that many graph-based problems can be easily formulated in terms of minimizing the energy function of this model: degreeconstrained minimum spanning tree, maximum clique, etc.

1118

CONClUSION The first works in optimization by neural networks were inspired in Hopfield’s models. These models did not obtain good results when compared to the wellknown Operations Research techniques. Many researchers focused on developing new neural models to improve the performance of Hopfieldtype networks in this kind of tasks. The problem of these binary models is that all the information given by the problem has to be specified by means of only two values ({0,1} or {–1,1}), so some information is lost. Multivalued neural models are designed to represent the information of the problem by means of more than two values, achieving a better representation of the problem. With this improvement, computational dynamics of multivalued models can be easily designed to solve a given optimization problem. These advantages make this kind of networks a very powerful ally in tackling combinatorial problems. The MREM model is a multivalued model that generalizes many other models, so it can be easily used to solve optimization problems, as shown in the text.


Some applications of the model are wellknown NPcomplete optimization problems, like the Traveling Salesman Problem, the Graph Partition Problem, and the 2 Pages Crossing Number Problem. As shown in the references, this model is able to outperform the bestalgorithmuptodate in each of the mentioned problems.

RefeReNCeS Aarts, E. & Korst, J. (1988). Simulated Annealing with Boltzman Machines. Wiley. Aras, N., Oomen B.J., & Altinel, I.K. (1999). The Kohonen Network Incorporating Explicit Statistics and its Application to the Travelling Salesman Problem. Neural Networks, 12:1273-1284. Bland, R. & Shallcross, D. F. (1987). Large Traveling Salesman Problem Arising from Experiments in Xray Crystallography: a Preliminary Report on Computation. Technical Report No. 730, School of OR/IE, Cornell University, New York. Chinn, P. Z., Chvátalová, L., Dewdney, A. K., & Gibbs, N. E. (1982). The bandwidth problem for graphs and matrices-a survey. J. Graph Theory, 6, 223-253. Cimikowski, R. (2002). Algorithms for the fixed linear crossing number problem. Discrete Applied Mathematics, 122, 93 – 115. Erdem, M. H. & Ozturk, Y. (1996). A new family of multivalued networks. Neural Networks, 9(6), 979989. GalánMarín, G. & MuñozPérez, J. (2001). Design and analysis of maximum hopfield networks. IEEE Trans. on Neural Networks, 12(2), 329-339. GalánMarín, G. (2000). Redes Neuronales Recurrentes para Optimización Combinatoria. PhD thesis, Universidad de Málaga.

Kainen, P. C. (1990). The book thickness of a graph, ii. Congr. Numer., 71, 127-132. Kirkpatrick, S., Gellat, C. D., & Vecchi, M. (1983). Optimization by simulated annealing. Science, 220, 671-680. Lai, W. K. & Coghill, G. G. (1988). Genetic breeding of control parameters for the hopfieldtank neural network. In Proc. Int. Conf. Neural Networks (pp. 618-623). Lee, K. C., Funabiki, N., & Takefuji, Y. (1992). A parallel improvement algorithm for the bipartite subgraph problem. IEEE Transactions on Neural Networks, 3(1), 139-145. LópezRodríguez, D., MéridaCasermeiro, E., Ortiz de LazcanoLobato, J. M., & GalánMarín, G. (2007). Two pages graph layout via recurrent multivalued neural networks. Lecture Notes in Computer Science, 4507, 192-199. LópezRodríguez, D., MéridaCasermeiro, E., Ortiz de LazcanoLobato, J. M., & LópezRubio, E. (2006). Image compression by vector quantization with recurrent discrete networks. Lecture Notes in Computer Science, 4132, 595-605. Malitz, S. M. (1994). On the page number of graphs. J. Algorithms, 17(1), 71-84. McCulloch, W. & Pitts, W. (1943). A logical calculus of the ideas inmanent in nervous activity. Bulletin Mathematical Biophysic, 5, 115-133. MéridaCasermeiro, E. (2000). Red Neuronal recurrente multivaluada para el reconocimiento de patrones y la optimización combinatoria. PhD thesis, Universidad de Málaga. MéridaCasermeiro, E., GalánMarín, G., & Muñoz Pérez, J. (2001). An efficient multivalued Hopfield network for the travelling salesman problem. Neural Processing Letters, 14, 203-216.

Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities, 79, 2254-2558.

MéridaCasermeiro, E. & López Rodríguez, D. (2005). Graph partitioning via recurrent multivalued neural networks. Lecture Notes in Computer Science, 3512, 1149-1156.

Hopfield, J. & Tank, D. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 141-152.

MéridaCasermeiro, E., Muñoz Pérez, J., & DomínguezMerino, E. (2003). An nparallel multivalued network: Applications to the travelling salesman problem.

1119

M


Computational Methods in Neural Modelling, Lecture Notes in Computer Science, 2686, 406-413.

Ullman, J. D. (1984). Computational Aspects of VLSI. Computer Science Press.

Ozturk, Y. & Abut, H. (1997). System of associative relationships (SOAR). Proceedings of ASILOMAR.

Wilson, V. & Pawley, G. (1988). On the stability of the TSP problem algorithm of Hopfield and Tank. Biological Cybernetics, 58, 63-70.

Raghavan, R. & Sahni, S. (1983). Single row routing. IEEE Trans. Comput, C32(3), 209-220. Reinelt, G. (1994). The Travelling Salesman. Computational Solutions for TSP Applications. Springer. Sinden, F. W. (1966). Topology of thin films circuit. Bell Syst. Tech. Jour., XLV, 1639-1666. Smith, K. (1996). An argument for abandoning the traveling salesman problem as a neural network benchmark. IEEE Transactions on Neural Networks, 7(6), 1542-1544. Smith, K., P. M. & Krishnamoorthy, M. (1998). Neural techniques for combinatorial optimization with aplications. IEEE Transactions on Neural Networks, 9(6), 1301-1318. Sun, Y., Li, J. G., & Yu, S. Y. (1995). Improvement on performance of modified Hopfield network for image restoration. IEEE Transactions on Image Processing, 5, 688-692. Takahashi, Y. (1997). Mathematical improvement of the Hopfield model for TSP feasible solutions by synapse dynamical systems. Neurocomputing, 15, 15-43. Takefuji, Y. (1992). Neural Network Parallel Computing. Kluwer Academic. Takefuji, Y. & Wang, J. (1996). Neural Computing for Optimization and Combinatorics, volume 3. World Scientific. Tamassia, R., Di Battista, G., & Batini, C. (1988). Automatic graph drawing and readability of diagrams. IEEE Trans. Syst. Man. Cybern., SMC18, 61-79. Tank, D. & Hopfield, J. (1986). Simple neural optimization networks: An a/d converter, signal decision circuit and linear programming circuit. IEEE Transactions on Circuits and Systems, 33(5), 533-541.

1120

Xu, X. & Tsai, W. T. (1991). Effective neural algorithms for the traveling salesman problem. Neural Networks, 4, 193-205.

Key TeRmS 2 Pages Graph Layout Problem: Problem of finding an ordering of the nodes of a graph on a straight line, and assigning, to each edge, a location in any of the two halfplanes induced by that line, such that the number of crossings between edges is minimum. Artificial Neural Network: Structure for distributed and parallel processing of information, formed by a series of units (which may possess a local memory and make local information processing operations), interconnected via one-way communication channels, called connections. Computational Dynamics: Updating scheme of the neuron outputs in a neural model. Energy Function: Objective function of the optimization problem solved by a neural model. MaxCut Problem: Problem of finding a partition of the set of nodes of a weighted graph, such that the sum of the costs corresponding to edges, with end-points in different sets of the partition, is maximum. Multivalued Discrete Neural Model: A model of neural networks in which neuron outputs may take value in the set  = {m1 , , mL }, instead of  = {−1,1} or  = {0,1}. Travelling Salesman Problem: Problem of finding the shortest closed tour that visits a series of N cities. Each city must be visited exactly one time.

1121

Multilayer Optimization Approach for Fuzzy Systems Ivan N. Silva University of São Paulo, Brazil Rogerio A. Flauzino University of São Paulo, Brazil

INTRODUCTION The design of fuzzy inference systems comes along with several decisions taken by the designers since is necessary to determine, in a coherent way, the number of membership functions for the inputs and outputs, and also the specification of the fuzzy rules set of the system, besides defining the strategies of rules aggregation and defuzzification of output sets. The need to develop systematic procedures to assist the designers has been wide because the trial and error technique is the unique often available (Figueiredo & Gomide, 1997). In general terms, for applications involving system identification and fuzzy modeling, it is convenient to use energy functions that express the error between the desired results and those provided by the fuzzy system. An example is the use of the mean squared error or normalized mean squared error as energy functions. In the context of systems identification, besides the mean squared error, data regularization indicators can be added to the energy function in order to improve the system response in presence of noises (from training data) (Guillaume, 2001). In the absence of a tuning set, such as happens in parameters adjustment of a process controller, the energy function can be defined by functions that consider the desired requirements of a particular design (Wan, Hirasawa, Hu & Murata, 2001), i.e., maximum overshoot signal, setting time, rise time, undamped natural frequency, etc. From this point of view, this article presents a new methodology based on error backpropagation for the adjustment of fuzzy inference systems, which can be then designed as a three layers model. Each one of these layers represents the tasks performed by the fuzzy inference system such as fuzzification, fuzzy rules inference and defuzzification. The adjustment

procedure proposed in this article is performed through the adaptation of its free parameters, from each one of these layers, in order to minimize the energy function previously specified. In principle, the adjustment can be made layer by layer separately. The operational differences associated with each layer, where the parameters adjustment of a layer does not influence the performance of other, allow single adjustment of each layer. Thus, the routine of fuzzy inference system tuning acquires a larger flexibility when compared to the training process used in artificial neural networks. This methodology is interesting, not only for the results presented and obtained through computer simulations, but also for its generality concerning to the kind of fuzzy inference system used. Therefore, such methodology is expandable either to the Mandani architecture or also to that suggested by Takagi-Sugeno.

BACKGROUND In the last years it has been observed a wide and crescent interest in applications involving logic fuzzy. These applications include from consumer products, such as cameras, video camcorders, washing machines and microwave ovens, even industrial applications as control of processes, medical instrumentation and decision support systems (Ramot, Friedman, Langholz & Kandel, 2003). The fuzzy inference systems can be treated as methods that use the concepts and operations defined by the fuzzy set theory and by fuzzy reasoning methods (Sugeno & Yasukawa, 1993). Basically, these operational functions include fuzzification of inputs, application of inference rules, aggregation of rules and defuzzification, which represents the crisp outputs of the fuzzy


M

Multilayer Optimization Approach for Fuzzy Systems

system (Jang, 1993). At present time, there are several researchers engaged in studies related to the design techniques involving fuzzy inference systems. The first type of design technique of fuzzy inference system has its focus addressed to enable the modeling of process from their expert knowledge bases, where both antecedent and consequent terms of the rules are always fuzzy sets, offering then a high semantic level and a good interpretability capacity (Mandani & Assilian, 1975). However, the applicability of this technique in the mapping of complex systems composed by several input and output variables has been an arduous task, which can produce as inaccurate results as poor performance (Guillaume, 2001)(Becker, 1991). The second type of design technique of fuzzy inference system can be identified as being those that incorporate learning, in an automatic way, from data that are representing the behavior of the input and output variables of the process. Therefore, this design strategy uses a collection of input and output values obtained from the process to be modeled, which differs of the first design strategy, where the fuzzy system was defined using only the expert knowledge acquired from observation on the respective system. In a generic way, the methods derived from this second strategy can be interpreted as being composed by automatic generation techniques of fuzzy rules, which use the available data for their adjustment procedures (or training).

Among the main approaches belonging to this second design strategy, it has been highlighted the ANFIS (Adaptive-Network-based Fuzzy Inference Systems) algorithm proposed by Jang (1993), which is applicable to the fuzzy architectures constituted by real polynomial functions as consequent terms of the fuzzy rules, such as those presented by Takagi & Sugeno (1985) and Sugeno & Kang (1988). The more recent approaches, such as those proposed by Panella & Gallo (2005), Huang & Babri (2006) and Li & Hori (2006), are also belonging to this design strategy. However, the representation of a process through these automatic architectures can implicate in interpretability reduction in relation to the created base of rules, whose consequent terms are expressed in most of the cases by polynomial functions, instead of linguistic variables (Kamimura, Takagi & Nakanishi,1994). Thus, the development of adjustment algorithms of fuzzy inference systems, which the consequent terms of the fuzzy rules are also represented by fuzzy sets, has been widely motivated.

mAIN fOCUS Of THe CHApTeR Considering the operational functions performed by the fuzzy inference systems, it is convenient to represent them by a three-layer model. Thus, the fuzzy inference

Figure 1. Fuzzy inference system composed by two inputs and one output

M A1

x1

Output layer

I1

M A2

w1

R1(.)

r2(.)

w2

R2(.)

r3(.)

w3

R3(.)

…

r1(.)

M AN 1

M B1 M B2

... …

x2

I2

M BN 2

Input layer

1122

Aggregation

Inference layer

defuzzification

y


system presented in this article can be represented by the sequential composition of three layers, i.e., input layer, inference layer and output layer. The input layer has functionalities of connecting the input variables (coming from outside) with the fuzzy inference system, performing their respective fuzzifications through proper membership functions. In the inference layer of the fuzzy rules, the input fuzzified variables are combined among them, according to defined rules, using as support the operations defined by the fuzzy theory. The resulting set of this aggregation process is then defuzzified to produce the fuzzy inference system output. The aggregation and defuzzification process of the fuzzy system output are both made by the output layer. It is important to observe, concerning to the output layer, that although it performs the two processes above described, it is also responsible for storing the membership functions of the output variables. As illustration, Fig. 1 shows the proposed multilayer model, which is constituted by two inputs and one output, having three fuzzy rules in its inference layer. In the following subsections further details will be presented about how fuzzy inference systems can be represented by a three-layers model.

Input layer The inputs fuzzification has the purpose of determining the membership degree of each input related to the fuzzy sets associated with each input variable. To each input variable of the fuzzy system can be associated as many fuzzy sets as necessary. In this way, let a fuzzy system constituted by only one input with N fuzzy sets, the output of the input layer will be a column vector with N elements, which are representing the membership degrees of this input in relation to those fuzzy sets. If we define the input of this fuzzy system with a unique input x, then the output of the input layer will be the vector I1 represented by:

[

]

I 1 (x ) = µ A1 ( x) µ A2 ( x)  µ AN ( x) T

(1)

where µ Ak (.) is the membership function defined to input x, which is referring to the k-th fuzzy set associated with it. The generalization of the input layer concept for a fuzzy system having p input variables can be achieved

if we consider each input being modeled as a sub-layer of the input layer. Taking into account this consideration, the output vector of the input layer I(x) is then defined by:

[

I (x ) = I 1 (x1)T

I 2 (x2)T

( )T]T

 I p xp

(2)

where xi is the i-th input of the fuzzy system and Ik(.) is the k-th vector of membership functions associated with the input xk. In Fig. 1 is illustrated the input layer for a fuzzy system composed by two inputs, which are mapped by the vectors I1 and I2. There are several membership functions that can be used in the proposed approach. One of the necessary requisites for those functions is that they are normalized in the closed domain [0,1].

Inference layer The inference layer of a fuzzy system has the functionality of processing the fuzzy inference rules defined for it. Another functionality is to provide a knowledge base for the process. In this paper, the fuzzy inference system has initially all the possible inferred rules. Therefore, the tuning algorithm has the task of weighting the inference rules. The weighting of the inference rules is a proper way to represent the most important rules, or even to allow that conflicting rules are related to each other without any linguistic completeness loss. Thus, it is possible to express the i-th fuzzy rule as follows: Ri (I (x )) = wi ri (I (x ))

(3)

where Ri(.) is the function representing the fuzzy weighting of the i-th fuzzy rule, wi is the weight of the i-th fuzzy rule and ri(.) represents the fuzzy value of the i-th fuzzy rule. In Fig. 1, it is shown the composition involving ri(.) and Ri(.) for the three fuzzy rules belonging to the inference layer.

Output layer

The output layer of the fuzzy inference system aims to aggregate the inference rules as well as the defuzzification of the fuzzy set generated from the aggregation of these inference rules. Besides the operational aspects, the aggregation and defuzzification methods must consider the requi1123

M


sites of hardware performance in order to reduce the computational effort needed for processing the fuzzy system. In this paper, the output layer of the inference system is also adjusted. The adjustment of this layer occurs in a similar way to that occurred with the input layer of the fuzzy system. As example, an illustration representing the procedures involved with the output layer is also shown in Fig. 1.

Adjustment of the fuzzy Inference System Let a fuzzy system with two inputs, each one composed of three gaussian membership functions, with a total of five inference rules, and having an output defined by two gaussian membership functions. It is known that, for each gaussian membership function, two free parameters should be considered, i.e., the mean and the standard deviation. Consequently, the number of free parameters of the input layer is 12. For each inference rule, a weighting factor has been associated, resulting a total of 5 free parameters in the inference layer. In relation to the output layer, the same considerations used for the input layer are valid. Therefore, four free parameters are associated with the output layer. Therefore, the mapping f between the input space x and the output space y may be defined by: y = f (x, mf In , w, mf Out )

(4)

where mfIn is the parameter vector associated with the input membership functions, w is the weight vector of the inference rules, and mfOut is the parameter vector associated with the output membership functions. Therefore, mfIn, w and mfOut represent the free parameters of the fuzzy system, which can be rewritten as follows: y = f (x , 1 )

(5)

where Θ is the vector resulting from concatenation of the free parameters involved with the fuzzy system, i.e.

[

1 = mf InT , w T , mf Out T

]

(6)

The energy function to be minimized, considering the fixed tuning set {x,d}, is defined by:

1124

ξ ≡ ξ (x, y )(1 )

(7)

where ξ represents the energy function associated with the fuzzy inference system f.

Unconstrained Optimization Techniques Let an energy function ξ(x,y)(Θ) differentiable in relation to free parameters of the fuzzy inference system. Thus, the objective is to find an optimum solution Θ* subject to: ξ(1* ) ≤ ξ(1 )

(8)

Therefore, we can observe that to satisfy the condition expressed in (8), it is necessary to solve an unconstrained optimization problem to obtain the solution Θ*, which is given by: 1 * ≡ arg min ξ(1 )

(9)

1

The condition that expresses the optimum solution in (9) can also be rewritten as follows: ∇ξ(1 * ) = 0

where ∇ is the gradient operator defined by:  ∂ξ ∂ξ ∂ξ  ∇ξ(1 ) =  , ,,  ∂1m   ∂11 ∂12

T

(10)

There are several techniques used to solve unconstrained optimization problems. A detailed description of these methods can be found in Bertsekas (1999). The selection of the most proper method is related to the complexity associated with the energy function. For example, the Gauss-Newton method for unconstrained optimization can be more applicable in problems where the energy function is defined by: 1 ξ(1 ) = 2

m

∑ e (i) i =1

2

(11)


where e(i) is the absolute error in relation to the i-th tuning pattern. In this paper, a derivation of the Gauss-Newton method is used for tuning fuzzy inference system, which is defined by the expression following: 1next = 1now −

( )

1 T J J 2

−1

g

(12)

where g is the gradient of ξ expressed in (11) and J is the Jacobean matrix of e defined in (12). The optimization algorithm used was the Levenberg-Marquardt method (Marquardt, 1963), which can efficiently handle ill-conditioned matrices JTJ by altering equation (13) as follows: 1next = 1now −

(

)

1 T J J + λI 2

−1

g

(13)

The calculation of the matrices J and the vectors g were performed through the finite differences method.

Simulation Results This section presents simulation results of the proposed methodology for the Mandani fuzzy model. In the two examples following the fuzzy system is used to model nonlinear functions. In the first example, a fuzzy inference system is used to predict the Mackey-Glass time series. In the second example, a two-dimensional sinc function is modeled by the fuzzy inference system.

Example 1: Modeling the Mackey-Grass Function

500 patterns. The input variables of the fuzzy inference system were four, which correspond to values x(t – 18), x(t – 12), x(t – 6) and x(t). As output variable was adopted x(t + 6). The fuzzy inference system was defined having 4 fuzzy sets attributed to each input variable and also to the output variable. A total of 64 inference rules have been used in the inference process. The energy function of the system was defined as being the mean squared error between the desired values x(t + 6) and the values x (t + 6), i.e. 1 ξ(1 ) = L

L

∑ [x (t + 6) − x (t + 6)] i

i

i =1

2

(15)

where L is the number of data used in the tuning process (L=500). After minimization of (16), the membership functions of the fuzzy inference system were adjusted as illustrated in Fig. 2. In Fig. 3 is presented the prediction results provided by the fuzzy inference system for 1000 sample points. The mean squared error of estimation for the proposed problem was 0.000598 with standard deviation of 0.02448. The prediction error for the 1000 sample points is shown in Fig. 4. For comparison, it was developed a fuzzy inference system adjusted by the ANFIS (Adaptive Neural-Fuzzy Inference System). This fuzzy inference system was composed by 10 membership functions for each input, being the knowledge base constituted by 10 rules. The mean squared error of estimation for the proposed problem was 0.000165 with standard deviation of 0.0041.

Using the adjustment methodology presented in this paper, a fuzzy inference system of Mandani type was developed with objective to predict the Mackey-Glass time series (Mackey & Glass, 1977), which is defined by:

Example 2: Modeling the Two-Input Sinc Function

dx(t ) a ⋅ x(t − τ) = −b ⋅ x(t ) + dt 1 + x(t − τ) c

z = sinc( x, y ) =

(14)

where the values of the constants are usually assumed as a = 0.2, b = 0.1 and c = 10. The value for the delay constant τ was 17. The tuning set was constituted by

In this example is used the proposed methodology to model a two-dimensional sinc function defined by: sin( x) ⋅ sin( y ) x⋅ y

(16)

From uniformly distributed grid points into the input range [-10,10] x [-10,10] of (17), 225 tuning data pairs were obtained. The fuzzy inference system used here 1125

M


Figure 2. Input membership functions MF

MF

MF

MF

MF MF

MF

MF

x(t-18) MF

MF

x(t-12) MF

MF

MF

x(t-6)

x(t)

Fig. 3. Estimation of the fuzzy inference system for the Mackey-Glass series

Figure 4. Prediction error for the Mackey-Glass series

2 ELATIVEPERCETUALERROR

1126

4 ES TPATTERNS


Figure 5. Tuning data (a) and reconstructed surface (b)

z

M

z

y

x

y

(a)

contains 11 rules, with 8 membership functions assigned to input variable x, 7 membership functions assigned to input y and 3 membership functions assigned to output z. The tuning data and the reconstructed surface are illustrated in Fig. 5.

fUTURe TReNDS The methodology for the adjustment of fuzzy inference systems presented in this article can be considered very promising, not only for performance and precision obtained through computer simulations, but also for its interpretability in relation to the output variable, which is a highly desirable feature of a fuzzy system. Actually, it is the most prominent feature that distinguishes fuzzy systems from many other modeling techniques. We think that fuzzy system adjustment architectures, such as that proposed here, are ideally suited for explaining solutions to users because both premises (antecedents) and consequences of the rules are defined by fuzzy sets. Future research and application should return to and concentrate on the linguistic features of fuzzy systems

x

(b)

and their capabilities for knowledge representation, exploiting the tolerance for imprecision and uncertainty to summarize data and focus on decision-relevant information.

CONClUSION In this article was underlined the basic foundations involved with the adjustment process of fuzzy inference systems from unconstrained optimization techniques. In order to become the more efficient tuning, it is necessary that the energy function is properly specified for the adjustment process. To validation of the proposed methodology, the results obtained by the proposed approach were compared to those provided from the ANFIS methodology, and also through of mathematical modeling problems. The results obtained from this methodology offers new perspectives of researches related to the fuzzy inference systems, allowing thus that problems previously treated only by artificial neural networks may also be treated through fuzzy inference systems.

1127


RefeReNCeS Becker, S. (1991). Unsupervised Learning Procedures for Neural Networks. International Journal of Neural Systems. (2) 17-33. Bertsekas, D. P. (1999). Nonlinear Programming (2nd ed.). Belmont: Athena Scientific. Figueiredo, M., & Gomide, F. (1997). Adaptive Neuro Fuzzy Modeling. Proceedings of the Sixth IEEE International Conference on Fuzzy Systems. 1567-1572. Guillaume, S. (2001). Designing Fuzzy Inference Systems from Data: An Interpretability-Oriented Review. IEEE Transactions on Fuzzy Systems, (9) 426-443. Huang, G.-B., & Babri, H.A. (2006). Universal Approximation Using Incremental Networks with Random Hidden Computation Nodes. IEEE Transactions on Neural Networks, (17) 4, 879-892. Jang, J.R. (1993). ANFIS: Adaptive-Network-Based Fuzzy Inference System. IEEE Transactions on Systems, Man, and Cybernetics. (23) 665-685. Kamimura, R., Takagi, T., & Nakanishi, S. (1994). Improving Generalization Performance by Information Minimization. Proceedings of the IEEE World Congress on Computational Intelligence. 143-148. Li, W., & Hori, Y. (2006). An Algorithm for Extracting Fuzzy Rules Based on RBF Neural Network. IEEE Transactions on Industrial Electronics, (53) 4, 1269-1276.

Ramot, D., Friedman, M., Langholz, G., & Kandel, A. (2003). Complex Fuzzy Logic. IEEE Transactions on Fuzzy Sets. (11) 450-461. Sugeno, M., & Kang, G.T. (1988). Structure Identification of Fuzzy Model. Fuzzy Sets and Systems. (28) 15-33. Sugeno, M., & Yasukawa, T. (1993). A Fuzzy-logicbased Approach to Qualitative Modeling. IEEE Transactions on Fuzzy Systems. (1) 7-31. Takagi, T., & Sugeno, M. (1985). Fuzzy Identification of System and Its Application to Modeling and Control. IEEE Transactions on Systems, Man, and Cybernetics. (15) 116-132. Wan, W., Hirasawa, K., Hu, J., & Murata, J. (2001). Relation Between Weight Initialization of Neural Networks and Pruning Algorithms: Case Study on Mackey-Glass Time Series. Proceedings of the International Joint Conference on Neural Networks. 1750-1755.

Key TeRmS Backpropagation Algorithm: Learning algorithm of ANNs, based on minimizing the error obtained from the comparison between the outputs that the network gives after the application of a set of network inputs and the outputs it should give (the desired outputs).

Mackey, M.C., & Glass, L. (1977). Oscillation and Chaos in Physiological Control Sciences. Science. (197) 287-289.

Defuzzification: Process of producing a quantifiable result (crisp) in fuzzy logic. Typically, a fuzzy system will have a number of rules that transform a number of variables into a “fuzzy” result, that is, the result is described in terms of membership in fuzzy sets.

Mandani, E.H., & Assilian, S. (1975). An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller. International Journal of Man-Machine Studies. (7) 1-13.

Fuzzification: Process of transforming crisp values into grades of membership for linguistic terms of fuzzy sets. The membership function is used to associate a grade to each linguistic term.

Marquardt, D. (1963). An Algorithm for Least Squares Estimation of Nonlinear Parameters. SIAM Journal on Applied Mathematics. (11) 431-441.

Fuzzy Logic: Type of logic dealing with reasoning that is approximate rather than precisely deduced from classical predicate logic.

Panella, M., & Gallo, A.S. (2005). An Input-output Clustering Approach to the Synthesis of ANFIS Networks. IEEE Transactions on Fuzzy Systems, (13) 1, 69-81.

Fuzzy Rule: Linguistic constructions of type IFTHEN that have the general form “IF A THEN B”, where A and B are (collections of) propositions containing linguistic variables.

1128


Fuzzy System: Approach of the computational intelligence that uses a collection of fuzzy membership functions and rules, instead of Boolean logic, to reason about data.

Membership Function: Generalization of the indicator function in classical sets. In fuzzy logic, it represents the degree of truth as an extension of valuation.

1129

M

1130

Multi-Layered Semantic Data Models László Kovács University of Miskolc, Hungary Tanja Sieber University of Miskolc, Hungary

INTRODUCTION One of the basic terms in information engineering is data. In our approach, data item is defined as representation of an information atom stored in digital computers. Although an information atom can be considered as a subject-predicate-value triplet (Lassila, 1999), data is usually given only with its value representation. This fact can lead to definitions where data is just numbers, words or pictures without context. For example in (WO, 2007), data is given as information in numerical form that can be digitally transmitted or processed. It is interesting that we can often recognize that the term ‘data’ is used without any exact terminological definition with the effect that the term often remains confusing, sometimes even contradicting the definitions of the term presented. Sieber and Kammerer (2006) introduce a new interpretation of data containing several levels. The lowest level belongs to data instances that describe the form and appearance of symbols. The intermediate level is the level of representatives which includes the applied encoding system. The highest level is related to the meaning with context description. All three levels are needed to get to know the information atom. For example the symbol ‘36’ in a database determines only the value and representation system, but not the meaning. To cover the whole information atom, the database should store some additional data items to describe the original data. The main purpose of semantic data models is to describe both context and the main structure of data items in the problem area. These additional data items are called metadata. It is important to see that: • • •

metadata are data, metadata are relative, and metadata describe data.

Metadata constitute a basis for bringing together data that are related in terms of content, and for processing them further. They can be understood as a pre-requisite for intelligent and efficient administration and processing, and not least as a focused, formal means of providing relevant data.

BACKGROUND In data management systems, the context of a value is usually defined with the help of a storage structure. An identification name (a text value) is assigned to each position of the structure. The description of storage (structure, naming and constraints) is called schema. A big problem of structural data modeling is that it can not provide all the information needed to understand the full context of the data. For example, a relational schema RT (NM INT, KNEV CHAR(20), RU DATE) alone is not enough to capture the meaning of the stored data items. The main building blocks to describe the context in semantic data models (SDM) are concepts and relationships. The first widely known structure oriented semantic models in database design are the EntityRelationship (ER) model (Chen, 1976) and the EER (Thalheim, 2000) model. The ER model consists of three basic elements: entity (concept), relationship and attribute. The attributes are considered as structure elements of the entities, one attribute may belong to only one entity. The EER model is the extension of the ER model with IS_A and HAS_A relationships. Some other extensions are SIM, IFO and RM/T. One of the main drawbacks of structure oriented SDM is the limitations of expressive power.


Multi-Layered Semantic Data Models

Later, models like UML or ODL (Catell, 1997) were developed to cover the missing object oriented elements. In the case of ODL, a class description can contain the following elements: attributes, methods, inheritance parameters, visibility, relationships and integrity rules. These models provide a powerful complexity for software engineering but they are not very flexible to describe data models of higher abstraction. Global investigations were focused on the SDM with simpler and more universal elements. The most widely known high level semantic models are semantic networks and ontology models. A semantic network is represented with a directed graph where the vertices are the concepts and the edges are the relationships. The main differences between ontology models and the traditional SDM are in the followings: there is no fixed structural hierarchy among the concepts, flexible relationships, independence from application domain, structure is mapped into a logical formula, it can be related to an inference engine. It is widely assumed that anything at a high level of information processing must be based on ontology (Sloman, 2003). Further details can be found on current applications of ontology among others in (Taniar, 2006). One of the first languages for ontology is RDF (Lassila, 1999). RDF is used to describe the concepts in a neutral, machine-readable format. According to the specification, the basic language elements are resources, literals and statements. There are two types of resources: entity resources and properties. A statement is a triplet (p,s,o), where p is a property, s is a resource and o is either a literal or a resource. In another approach, p is called predicate, s is the subject and o is the object in the statement. As it can be seen, a statement corresponds to an information atom. A pioneer representative of the next generation of languages is OWL (Bechhofer, 2004) which can be considered as an extension of RDF, that contains extra elements to describe among others typing, property characteristics, cardinality and behavioral properties. The OWL-DL language is based on Description Logic that describes the structural relationships of the domain in a logic language, which enables automatic reasoning and constraint checking in the system. The applied logic language is based on first-order predicate logic. The most widely used products related to OWL are Protégé, Pellet and KAON2.

mUlTI-lAyeReD SemANTIC mODelS multi-layered Schemas In the case of systems with complex functionality, one way to reduce complexity is to build up a modular system. Modularization is a successful concept in all engineering areas. Modularization can be vertical or horizontal. Vertical modularization is called layering. The basic properties of a layered system are the followings: • • • •

the elements are assigned to clusters (called layers); there exists a hierarchical relationship between the clusters; the relationships within the clusters differ from the relationships between the clusters; the clusters cooperate with each other in the role of a client or of a server.

Every layer offers a set of functionality where the functions are built upon the services of the underlying layers. In the case of a multi-layered system, the implementation can gain in cost reduction compared with a single-layer structure. Layering means modularization from the viewpoint of implementation and it has the following qualitative and quantitative benefits (Knoerschild, 2003): • • • •

encapsulation (the layers are in great part selfcontained, consistency) , independence, flexibility (the layers can be replaced without affecting the other layers), cost reduction (simplicity in testing and in design, reusability).

The layered structure is a common technology nowadays among others in networking (Hnatyshin, 2007), image processing (Sunitha, 2007), process control (Zender, 2007) and software development (Kreku, 2006).

1131

M


multi-layered Nature of Human Recognition It was realized very early that human spatial cognition is based on a partially hierarchical conceptual view of space (McNamara, 1986). It is usual to perform the mapping of spatial environment with a semantic hierarchy (Kuipers, 2000). In the proposal of Sloman (2003), the internal representation of the spatial environment is implemented with a three-layered model. The lowest layer is called metric layer. It establishes an absolute frame of reference. It consists of a navigation graph that describes the important positions of the environment. In the next, topological layer, the navigational nodes are mapped into areas, where an area corresponds to a set of connected nodes. An area denotes a compound spatial concept. The highest level belongs to the conceptual layer. In this layer the areas are mapped to general abstract concepts. This level corresponds to the ontology layer that provides different relationships and a reasoning engine. According to the current H–Cogaff view of human information processing architecture (Sloman, 2003) the cognition system consists of several regions performing concurrent activities. The regions are structured into hierarchies. The perception hierarchy can activate for example different concepts at the same time for a single sensor input image. Visual perception can detect different levels of structure and different levels of concepts. The developed multi-layer ontology model consists of three layers: the reactive, the deliberate and the meta-management layer. Also in artificial intelligence, the application of multi-layered structures has gained a larger popularity. In (Kamimura, 2003), the information theoretical competitive learning method was implemented with multi-layered networks to solve complex problems. Networks are composed of several competitive layers. In each competitive layer, information is maximized. This successive information maximization enables networks to extract features gradually. Experimental results confirmed that information can be maximized in multi-layered networks, and the networks can extract features that cannot be detected by single-layered networks

1132

multi-layered Concept models Traditional SDM models were intended to manage only single-layer structure. Neither of the original versions of ER, RDF or OWL uses layers in the model. On the other hand, it can be seen, that the layered structure has a lot of benefits: • • • •

increase in simplicity of management, decrease in complexity, increase in flexibility, and increase in reusability.

The first classic layered models for semantic networks were developed in the 80’s. The strong influence of psychological theories in human cognition can be easily observed in the proposed models. The layered model of Thompson (1990) consists of five layers similar in Greenwald’s (1988) model. The base layer represents sensor data (images, sounds, signs) and the temporal relationships between these data items. The next layer is devoted to basic concepts. The connection of a concept and of its sensory appearances may vary and may be very complex (transformations). This connection should perform more complex activity than just simple association. The next level is called the level of events. In this layer, the simple object instances are bound to series and sentences. This layer should contain the logic supporting ideas of time and causality. The next level generates abstract objects that cluster object instances together. The next model level describes the activities on the abstract concepts like planning and modeling. The highest level is related to the abstract concepts and abstract activities like reasoning and metadata management. In the proposal of Khosla (2004), the layering of ontology is strongly correlated with the functional layering of the system. The paper describes the structure of a general soft-computing module. The most internal layer is the object layer to describe the data schema. On the top of the data layer are the distributed agent layer, the tool agent layer and the optimization agent layer. These layers perform among others preprocessing, transformation and decision making. The concept of multi-layering can be applied to the traditional data models, too. A layered UML model is represented


among others in (Kreku, 2006). The layers here also correspond to the different functional areas within the application. The three proposed layers are the component layer, the HW architecture layer and the platform architecture layer. In (Sunitha, 2007), the investigation is focused on the SDM part only. The semantic data model is divided into three layers. The bottom part is the concept measure layer that contains the descriptions of the concepts themselves. The middle layer is used to store the relationships (like specialization, classification) among the concepts. The top layer is for the context-related knowledge elements describing the environment of the application field. In some other proposals, layering refers not to the functional structure but to the abstraction levels. In the classical UML (Terrasse, 2001), a four-layered metamodel architecture is used. The bottom layer is the object layer and the next layer is the model layer. On the top of the model layer is the metamodel layer. At the top, the meta-metamodel layer can be found. The metaobject facility (MOF) model is based on a layered, conceptual metamodel structure. The content of a conceptual layer describes the elements in the next layer down. Both UML and MOF are based on class oriented representations. In (Melnik, 2000), a threelayer abstraction model is defined for the semantic web. These layers are the syntax layer, the object layer and the semantic layer. The semantic data model presented in (Sieber, 2006) was invented as a model for bridging the gap between data and semiotics in terms of Peirce with a special focus on technical documentation processes. The concerned knowledge can be shared between many people in different departments, different production locations (including different countries) and in different applications. As a consequence of this, in such a process nearly everyone has two roles: one of owning and dating knowledge; and one of searching and needing dated knowledge. For mathematical purposes this model was extended (Sieber, 2007) using a semantic network based upon a lattice of concepts. The result is a multi-layer semantic data model that can be used to visualize more general decoding and encoding processing between signals and (semantic) concepts.

fUTURe TReNDS The term of layered ontology occurs very rarely in the literature. The main reason for this is that the basic ontology can describe any levels of abstractions. Thus a single layer can cover any levels of concepts. This monolithic structure will cause difficulties in integrating existing ontologies, as the overlap of concepts is harder to detect. The current research projects on ontology are usually aimed at the development of accurate ontologies for the different application domains. The main current areas are: medical systems, geographical systems, linguistics, social studies, enterprise information systems, logic, knowledge representation and automatic reasoning. Very few proposals deal with the application of modular, multi-layered ontologies. In (Purao, 2005), for example a database design specific domain is analyzed, where three layers are defined: the core (local) level, the neighborhood level and the global domain level. Although the importance of a domain independent ontology is visible and clear for everybody, the current works seem to neglect this requirement. According to (Mikroyannidis, 2006), ontology management in most information systems is based on simplicity, ontology layering is rarely used, and the requirements for ontology evolution and integration are usually neglected. We can predict that the modularized, layered ontology models will get more attention in the near future.

CONClUSION Traditional semantic models are based on a single-layer structure. This approach is a drawback in the development of complex systems. As the model of human cognition is built on a multi-layer approach, and the goal of semantic models is to describe concepts of our world, the multi-layer models seem to be more accurate to create a global semantic model. The current semantic models, like UML or ontology provide some layering possibilities, but the detailed analysis of multi-layered semantic models is a task of the future.

1133

M


RefeReNCeS Bechhofer,. S. et al (2004). OWL Web Ontology Language Reference, homepage:http://www.w3.org/ TR/2004/rec-owL-REF-20040210 Catell, R. et all. (1997). The Object Database Standard: ODMG 2.0, The Morgan Kaufmann Publisher, ISBN 15589604634 Chen, P. (1976). The Entity-Relationship Model: Toward a unified view of data, ACM on Database Systems, pp. 9-36 Greenwald, A. (1988). Levels of representation, Technical paper, homepage: http://faculty.washington. edu/agg/unpublished.htm Gustavsson L., Torgersson O.(2005): Benefits of Multi-Layer Design in Software with Multi-User Interfaces- A Three Step Study, Proceedings of Software Engineering Hnatishin, V. & Gramatges, G. & Stiefel, M. (2007): Practical Considerations for Extending Network Layer Models with OPNET Modeler, Proceedings of Modeling and Simulations Kamimura, R. (2003). Information theoretic competitive learning in self-adaptive multi-layered networks, Connection Science, Vol. 15., pp. 3-26. Khosla, R. (2004). Multi-layered Distributed Agent Technology for Soft computing systems, International Journal Knowledge-based Intelligent Information and Engineering Systems , Vol 8. , pp. 117-128. Knoerschild, K. (2003). A Layering challenge, Technical paper, homepage: http://www.artima.com/weblogs/ viewpost.jsp/thread=88316 Kreku, J. et al. (2006). Layered UML Workload and SystemC Platform Models for Performance Simulation, Proceedings of FDL2006 Kuipers, B. (2000), The Spatial Semantic Hierarchy, Artificial Intelligence Vol. 119, pp. 191-233 Lassila, O. & Swick, R. (1999). Resource Description Framework (RDF) Model and Syntax Specification, W3C Recommendation, homepage: http://www. w3.org/TR/1999/REC-rdf-syntax-19990222/

1134

McNamara, T. (1986). Mental representations of spatial relations, Cognitive Psychology, Vol 18, pp. 87-121. Mikroyannidis, A. & Theodouldis, B.(2006): Heraclitus II: A Framework for Ontology Management and Evolution, Proceedings of IEEE/WIC/ACM Conference on Web Intelligence Melnik, S. & Decker, S. (2000). A Layered Approach to Information Modeling and Interoperability on the WEB, Proceedings. of ECDL Workshop on the Semantic Web, pp. 10-23. Purao, S. & Storey, V. (2005). A multi-layered ontology for comparing relationship semantics, Applied Ontology, I, pp. 117-139 Sieber, T. & Kammerer, M.. (2006). Daten, Wissen und Information. Eine Grundlagenanalyse unter besonderer Berücksichtigung der technischen Dokumentation, source: www.iit.uni-miskolc.hu/~kovacs Sieber, T. & Kovacs, L.. (2007). Development of a multi-layer concept network, Technical paper, source: www.iit.uni-miskolc.hu/~kovacs Sloman, A. (2003). What kind of virtual machine is capable of human consciousness, Proceedings of ECAP2003 Sloman, A.(2003). Human vision – a multi-layered multi-functional system, Proceedings of BMVA2003 Sunitha, A.& Govindarajului, P. (2007). Multilayer Semantic data Model for Sports Video, IJCSN Journal, Vol 7, pp. 330-341 Taniar, D. (2006). WEB Semantics and Ontology, Igi Global Publisher, ISBN 9781591409069 Terrasse, M., Savonnet, M. & Becker, G. (2001). An UML-metamodeling Architecture for Interoperability of Information Systems, Proceedings of IDEAS01 Thalheim, B. (2000). Entity-Relationship Modeling (EER), Springer Verlag Thompson, I. (1990). Layered Cognitive Networks, Generative Science, homepage: www.generativescience.org/ps-papers/layer7.html WO (2007). Webopedia database, homepage: http:// www.webopedia.com/TERM/d/data.html


Zender, H. & Kruijff, G. (2007). Multi-layered conceptual spatial mapping for autonomous mobile robots, Proc. of Control Mechanisms for Spatial Knowledge Processing in Cognitive / Intelligent Systems, pp. 62-66.

Key TeRmS Data Model: A formal description language to describe and to manipulate the investigated data instances. It contains three components: a static structural part, an integrity part and a manipulation part. Multi-Layered Data Model: A data model where the model elements are assigned to levels. In the model, a hierarchy is defined between the levels. Regarding the element-level relationships, the intra-level relationships differ from the inter-level relationships. Ontology: A semantic data model that describes the concepts and their relationships. It contains a controlled vocabulary and a grammar for using the vocabulary terms. The ontology enables to make queries and asser-

tions and reasoning. The most popular form to describe ontology is RDF and OWL. OWL: A language to describe Web ontologies. It uses an XML format and it contains a formal description logic component, too. It provides the following extra functionality: classification, type and cardinality constraints, thesauri, decidability. RDF: A semantic data model that describes the world with statements. A statement is a triplet having the following form: subject-predicate-object. Semantic Data Model: A high level data model. It is usually based on concepts and it uses a graphical formalism. It contains only the key, the semantic properties of the data structure. It does not cover the details of the implementation. UML: A standardized general-purpose modeling language for object oriented software systems. It has a graphical notation and contains several diagrams: structure diagrams (class, object, component, package) and behavioral diagrams (activity, use-case, state machine, interaction).

1135

M

1136

Multilogistic Regression by Product Units P. A. Gutiérrez University of Córdoba, Spain C. Hervás University of Córdoba, Spain F. J. Martínez-Estudillo INSA – ETEA, Spain M. Carbonero INSA – ETEA, Spain

INTRODUCTION Multi-class pattern recognition has a wide range of applications including handwritten digit recognition (Chiang, 1998), speech tagging and recognition (Athanaselis, Bakamidis, Dologlou, Cowie, Douglas-Cowie & Cox, 2005), bioinformatics (Mahony, Benos, Smith & Golden, 2006) and text categorization (Massey, 2003). This chapter presents a comprehensive and competitive study in multi-class neural learning which combines different elements, such as multilogistic regression, neural networks and evolutionary algorithms. The Logistic Regression model (LR) has been widely used in statistics for many years and has recently been the object of extensive study in the machine learning community. Although logistic regression is a simple and useful procedure, it poses problems when is applied to a real-problem of classification, where frequently we cannot make the stringent assumption of additive and purely linear effects of the covariates. A technique to overcome these difficulties is to augment/replace the input vector with new variables, basis functions, which are transformations of the input variables, and then to use linear models in this new space of derived input features. Methods like sigmoidal feed-forward neural networks (Bishop, 1995), generalized additive models (Hastie & Tibshirani, 1990), and PolyMARS (Kooperberg, Bose & Stone, 1997), which is a hybrid of Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991) specifically designed to handle classification problems, can all be seen as different nonlinear basis function models. The major drawback of these approaches is stating the typology and the optimal number of the corresponding basis functions.

Logistic regression models are usually fit by maximum likelihood, where the Newton-Raphson algorithm is the traditional way to estimate the maximum likelihood a-posteriori parameters. Typically, the algorithm converges, since the log-likelihood is concave. It is important to point out that the computation of the Newton-Raphson algorithm becomes prohibitive when the number of variables is large. Product Unit Neural Networks, PUNN, introduced by Durbin and Rumelhart (Durbin & Rumelhart, 1989), are an alternative to standard sigmoidal neural networks and are based on multiplicative nodes instead of additive ones.

BACKGROUND In the classification problem, measurements xi, i = 1,2,...,k, are taken on a single individual (or object), and the individuals are to be classified into one of J classes on the basis of these measurements. It is assumed that J is finite, and the measurements xi are random observations from these classes. A training sample D = {(xn, yn); n = 1, 2,...,N} is available, where xn = (x1n,...,xkn) is the vector of measurements taking values in Ω ⊂  k , and yn is the class level of the nth individual. In this chapter, we will adopt the common technique of representing the class levels using a “1-of-J” encoding vector y = (y(1), y(2),...,y(J)), such as y(l) = 1 if x corresponds to an example belonging to class l and y(l) = 0 otherwise. Based on the training sample, we wish to find a decision function C : Ω → {1,2,...,J} for classifying the individuals. In other words, C provides a partition, say D1,D2,...,DJ, of Ω, where Dl corresponds to the lth class, l = 1,2,...,J,


Multilogistic Regression by Product Units

and measurements belonging to Dl will be classified as coming from the lth class. A misclassification occurs when a decision rule C assigns an individual (based on measurements vector) to a class j when it is actually coming from a class l ≠ j. To evaluate the performance of the classifiers we can define the Correctly Classified Rate by CCR =

1 N

N

∑ I (C (x ) = y n

n =1

)

n

,

where I(.) is the zero-one loss function. A good classifier tries to achieve the highest possible CCR in a given problem. Suppose that the conditional probability that x belongs to class l verifies:

(

)

p y (l ) = 1 x > 0, l = 1, 2,..., J , x ∈ Ω

,

and set the function: fl (x, l ) = log

( p (y

), l = 1, 2,..., J , = 1 x)

p y (l ) = 1 x (J )

∈Ω

where θl is the weight vector corresponding to class l and fJ(x, θJ) ≡ 0. Under a multinomial logistic regression, the probability that x belongs to class l is then given by

(

)

p y (l ) = 1 x, Q =

exp fl (x, Q l ) J

, l = 1, 2,..., J

∑ exp f (x, Q ) j =1

j

j

where θ = (θ1, θ2,...,θJ–1). The classification rule coincides with the optimal Bayes’ rule. In other words, an individual should be assigned to the class which has the maximum probability, given the vector measurement x: C (x) = lˆ ,

where lˆ = arg max l fl (x, Qˆ l ) , for l = 1,...,J.

On the other hand, because of the normalization condition we have: J

∑ p (y

(l )

)

= 1 x, Q = 1

l =1

,

and the probability for one of the classes (in the proposed case, the last) need not be estimated (observe that we have considered fJ(x, θJ) ≡ 0).

mUlTIlOGISTIC ReGReSSION AND pRODUCT UNIT NeURAl NeTWORKS Multilogistic Regression by using Linear and Product-Unit models (MLRPU) overcomes the nonlinear effects of the covariates by proposing a multilogistic regression model based on the combination of linear and product-unit models, where the nonlinear basis functions of the model are given by the product of the inputs raised to arbitrary powers. These basis functions express the possible strong interactions between the covariates, where the exponents are not fixed and may even take real values. In fitting the proposed model, the non-linearity of the PUNN implies that the corresponding Hessian matrix is generally indefinite and the likelihood has more local maximum. This reason justifies the use of an alternative heuristic procedure to estimate the parameters of the model.

Non-linear model proposed The general expression of the proposed model is given by: k

m

k

i =1

j =1

i =1

fl (x,Q l ) A 0l + ∑ A il xi + ∑ B lj ∏ xi ji , l = 1, 2,..., J − 1 w

where θl = (αl, βl, W), Al = (A 0l , A1l ,..., A kl ) ,

Bl = (B1l ,...., B ml ) and

W = (w1, w2,..., wm), 1137

M


with wj = (wj1, wj2,..., wjk), w ji ∈  .

As has been stated before, the nonlinear part of fl(x,θl) corresponds to Product Unit Neural Networks (PUNN), introduced by Durbin and Rumelhart (Durbin & Rumelhart, 1989) and subsequently developed by other authors (Janson & Frenzel, 1993), (Leerink, Giles, Horne & Jabri, 1995), (Ismail & Engelbrecht, 2000), (Martínez-Estudillo, Hervás-Martínez, Martínez-Estudillo & García-Pedrajas, 2006), (Martínez-Estudillo, Martínez-Estudillo, Hervás-Martínez & García-Pedrajas, 2006). Advantages of product-unit based neural networks include increased information capacity and the ability to form higher-order combinations of inputs. They are universal approximators and it is possible to obtain upper bounds of the VC dimension of product unit neural networks that are similar to those obtained for sigmoidal neural networks (Schmitt, 2001). Despite these obvious advantages, product-unit based neural networks have a major drawback. Their training is more difficult than the training of standard sigmoidal based networks (Durbin & Rumelhart, 1989). The main reason for this difficulty is that small changes in the exponents can cause large changes in the total error surface. Hence, networks based on product units have more local minima and a greater probability of becoming trapped in them. It is well-known (Janson & Frenzel, 1993) that back-propagation is not efficient in training product-units. Several efforts have been made to develop learning methods for product units (Leerink, Giles, Horne & Jabri, 1995), (Martínez-Estudillo, Martínez-Estudillo, Hervás-Martínez, & García-Pedrajas, 2006), mainly in a regression context.

Estimation of the Model Coefficients In the supervised learning context, the components of the weight vectors θ = (θ1, θ2,..., θJ–1) are estimated from the training dataset D. To perform the maximum likelihood (ML) estimation of θ, one can minimize the negative log-likelihood function L( ) = − =

1 N

1138

N



J

∑ −∑ y n =1

l =1

1 N

N

∑ log p ( n =1

n

n

,

)=

J

f (x n , l ) + log ∑ exp fl (

(l ) n l

l =1

n

l

 ) 

The error surface associated with the proposed model is very convoluted with numerous local optimums. The non-linearity of the model with respect to the parameters θl and the indefinite character of the associated Hessian matrix do not recommend the use of gradient-based methods to maximize the log-likelihood function. Moreover, the optimal number of basis functions of the model (i.e. the number of hidden nodes in the product-unit neural network) is unknown. Thus, the estimation of the vector parameter Qˆ is carried out by means a hybrid procedure described below. In this paragraph we make a detailed description of the different aspects of the MLRPU methodology. The process is structured in four steps: Step 1. We apply an evolutionary neural network algorithm to find the basis functions ˆ ) = {B (x, w ˆ 1 ), B2 (x, w ˆ 2 ),..., Bm (x, w ˆ m )} B(x, W 1

corresponding to the nonlinear part of f(x,θ). We have to determine the number of basis functions m and the weight vector W = (w1, w2,..., wm). To apply evolutionary neural network techniques, we consider a PUNN with the following structure (Fig. 1): an input layer with a node for every input variable, a hidden layer with several nodes, and an output layer with one node for each category. The activation function of the j-th node in the hidden layer is given by k

B j (x, w j ) = ∏ xi

w ji

i =1

where wji is the weight of the connection between input node i and hidden node j and wj = wj1,..., wjk). The activation function of the output node l is given by m

gl (x,Bl 7 ) = B 0l + ∑ B lj B j ( X Wj ) j =1

,

where B lj is the weight of the connection between the hidden node j and the output node l. The transfer function of all output nodes is the identity function. The weight vector W = (w1, w2,..., wm) is estimated by means of an evolutionary programming algorithm detailed in (Hervás-Martínez, Martínez-Estudillo & Gutiérrez, 2006), that optimizes the error function


Figure 1. Model of a product-unit based neural network g1(x)

M

g 2 (x)

gJ 1(x)

1,2,..., J

1 0

2 0

2 1

1 1

c 0

c 1

2 m

2 2

1 2

c 2

c m

1 m

1,2,...,m

bias

w11 w21 w2k wmk w w12 22 wm 2 w1k wm 1 x2

x1

xk

)=

1 N

N



J −1

∑ −∑ y n =1

l =1

(l ) n

gl ( n ,

l

J −1

) + log ∑ exp gl ( n , l , l =1

 ) 

L( , ) =

Step 2. We consider the following transformation of the input space by adding the nonlinear basis functions obtained by the evolutionary algorithm in step 1: ˆ

k

m

ˆl

ˆl

N



J

∑ −∑ y n =1

l =1

(l ) n

l

(

n

l

J

n

) + log ∑ exp( l =1

l

n

l

n

 ) 

k

m

k

i =1

j =1

i =1

f l (x,Qˆ ) Aˆ 0l + ∑ Aˆ il xi + ∑ Bˆ lj ∏ xi ji , l = 1, 2,..., J − 1 k

(x,Q ) A 0 + ∑ A i xi + ∑ B j ∏ xi xk , z1 ,..., zim=1) (x1 , x2 ,..., xk ) → ( x1 , xf2l ,..., j =1 i =1 ˆl

1 N

where xn = (1, x1n,..., xkn). Now, the Hessian matrix of the negative log-likelihood in the new variables x1, x2,..., xk, z1,..., zm is semidefinite positive. Then, we could apply Newton’s method, also known, in this case, as Iteratively Reweighted Least Squares (IRLS). Although there are other methods for performing this optimization, none clearly outperforms IRLS (Minka, 2003). The estimated vector coefficient ˆ = ( ˆ ˆ ˆ determines the model:

Although in this step the evolutionary process obtains a concrete value for the β vector, we only consider ˆ = (w ˆ 1,w ˆ 2 , ..., w ˆ m ) that the estimated weight vector W builds the basis functions. The value for the β vector will be determined in Step 3 together with the α coefficient vector.

H : k → k +m

1,2,...,k

Step 3. We minimize the negative log-likelihood function for N observations:

given by the negative log-likelihood for N observations associated with the product-unit model: L* (

1

wˆ ji

wˆ

, l = 1, 2,..., J − 1

ˆ 1 ),..., zm = Bm (x, w ˆ m) . where z1 = B1 (x, w 1139


Step 4. In order to select the final model, we use a backward stepwise procedure, starting with the full model with all the covariates, initial and PU covariates, and successively prune variables sequentially to the model until further pruning does not improve the fit.

Application to Remote Sensing We have tested the proposed methodology in a real agronomical problem of precision farming, consisting of mapping weed patches in crop fields, through remote sensed data. Remote sensing systems can provide a large amount of continuous field information at a reasonable cost. Remote sensed imagery shows great potential in modelling different agronomic parameters for its application in precision farming. One aspect of overcoming the possibility of minimizing the impact of agriculture on environmental quality is the development of more efficient approaches for crop production determination and for site-specific weed management. Potential economic and environmental benefits of site-specific herbicide applications include reduced spray volume, herbicide costs and non-target spraying, and increased control of weeds, (Thompson, Stafford & Miller, 1991), (Medlin, Shaw, Gerard, & LaMastus, 2000). We face a mapping weed patches problem through the analysis of aerial photographs. Images and data sets have been given by the Precision Farming and Remote Sensing Unit of the Department of Crop Protection, Institute of Sustainable Agriculture (CSIC, Spain), whose members reported previous results in predicting Ridolfia segetum Moris patches, (Peña-Barragán, López-Granados, Jurado-Expósito & García-Torres, 2007). The data analyzed correspond to a study conducted in 2003 at the 42 ha-farm Matabueyes, naturally infested by R. segetum. At a field study, the nature of 2,400 pixels was determined, being them considered as ground-truth pixels: 800 pixels were classified as R. segetum and 800 pixels were classified as R. segetum free pixels. Input variables include the digital values of all bands in each available image, that is: Red (R), Green (G) and Blue (B), for June image, and R, G, B and Near Infrared (NIR) for May and July images. The experimental design was conducted using a stratified holdout cross-validation procedure, where the size of training set was approximately 0.7n (1,120 pixels) for 1140

the training set and 0.3n (480 pixels) for the generalization set, n being the size of the full dataset. In all experiments, the EA has been applied with the same parameters. SPSS 13.0 software (SPSS, 2005) was used for applying IRLS algorithm and in order to select the more significant variables in the final model, through a backward stepwise procedure. The models compared in the different experiments are the following: firstly, we extract the best PUNN model of the EA (EPUNN); secondly, we obtain standard Logistic Regression model using initial covariables (LR); finally, we apply Logistic Regression only over basic function extracted from EPUNN model (MLRPU) and over the same basic functions together with initial covariables (MLRLPU).

Results Performance of each model has been evaluated using the Correctly Classified Ratio in the generalization set (CCRG). In Table 1 we show the matrix results of classification over train and generalization sets for the three classification problems and the four models proposed (one problem at each date, May, June and July, and four models, EPUNN, LR, MLRPU and MLRLPU). Best CCRG results are obtained with MLRPU and MLRLPU at May and June, although at July EPUNN model yields the best results. At all dates, differences between standard LR and hybrid LR (MLRPU and MLRLPU) are very significant. Table 2 includes models obtained at the date that leads to better classification results, that is, at June. Using these models we can obtain the probability of R.segetum presence at all pixels of the image, including non ground-truth pixels. Figure 1, 2, 3 and 4 represents weed maps obtained using the four proposed models at June, EPUNN, LR, MLRPU and MLRLPU, respectively. Weed presence probability has been represented using a scale between white (minimum probability, nearly 0) and dark green (maximum probability, nearly 1). From these maps, the agronomical expert can decide what threshold probability value consider to apply herbicide. MLRLPU and MLRPU models clearly differentiate better between high weed density zones and weed free zones, so they have a higher interest from the point of view of intelligent site-specific herbicide application.


Table 1. Classification matrixes (Y=0, R.segetum absence; Y=1, R.segetum presence) at all dates, using best Evolutionary Product Unit Neural Network, EPUNN, Logistic Regression, LR (in italic), Logistic Regression only with Product Units, MLRPU, (in brackets), and Logistic Regression with initial covariables and Product Units, MLRLPU (in squared brackets)

Phenological Stage (Date)

Training

Vegetative

Ground Truth Response Y=0

(mid-May)

Y=1 CCR (%)

Flowering

Y=0

(mid-June)

Y=1 CCR (%)

Senescence

Y=0

(mid-July)

Y=1 CCR (%)

Generalization

Predicted Response Y=0 Y=1 384 352 (383) [394] 133 171 (136) [141]

176 208 (177) [166] 427 389 (424)419]

547 529 (547)[552] 7 30 (9) [11]

13 31 (13) [8] 553 530 (551)[549]

443 296 (443) [447] 105 131 (111) [117]

117 264 (117) [113] 455 429 (449) [443]

CCR (%) 68.5 62.9 (68.4) [70.4] 76.2 69.5 (75.7) [74.8] 72.4 66.2 (72.1) [72.6] 97.7 94.5 (97.7)[98.6] 98.8 94.6 (98.4) [98] 98.2 94.6 (98)[98.3] 79.1 52.9 (79.1) [79.8] 81.2 76.6 (80.2) [79.l] 80.1 64.7 (79.6) [79.5]

Predicted Response Y=0 Y=1 164 148 (164) [168] 65 69 (67) [69]

76 92 (76) [72] 175 171 (173) [171]

236 226 (237)[238] 2 12 (2) [2]

4 14 (3)[2] 238228 (238)[238]

195 138 (425) [189] 52 60 (53) [50]

45 102 (55) [51] 188 180 (187) [190]

CCR (%) 68.3 61.7 (68.3) [70] 72.9 71.3 (72.1) [71.3] 70.6 66.5 (70.2) [70.6] 98.3 94.2 (98.8)[99.2] 99.2 95 (99.2)[99.2] 98.7 94.6 (99)[99.2] 81.2 57.5 (88.5) [78.8] 78.3 75 (77.9) [79.2] 79.8 66.3 (79.6) [79]

Table 2. Obtained models at June for the determination of R.segetum presence probability (P) in order to obtain weed patches maps Methodology EPUNN LR MLRPU

#coef. 8 4 7

MLRLPU

9

Model P = 1/(1+exp(-(-0.424+75.419(V4.633)+0.322(R-1.888) +14.990(A3.496 V-3.415)))) P = 1/(1+exp(-(-0.694+8.282(A)-63.342(V)-11.402(R)))) P = 1/(1+exp(-(-17.227+143.012(V4.633) +0.636(R-1.888)+23.021(A3.496 V-3.415)))) P = 1/(1+exp(-(18.027+130.674(A)-133.662(V) -29.346(R)+353.147(V4.633) -3.396(B3.496 G-3.415))))

1141

M


fUTURe TReNDS Concepts exposed in this chapter offer the possibility of developing new models of multi-class generalized linear regression, by means of considering different types of basis functions (Sigmoidal Units, Radial Basis

Functions and Product Units) for the non-linear part of the proposed model. Moreover, future research could include ordinal logistic regression models with different basis functions or probit models with different basis functions.

Figure 1. EPUNN R.segetum presence probability map

Figure 2. LR R.segetum presence probability map

Figure 3. MLRPU R.segetum presence probability map

Figure 4. MLRLPU R.segetum presence probability map

1142


CONClUSION To the best of our knowledge, the approach presented in this paper is a study in multi-class neural learning which combines three tools used in machine learning research: the logistic regression, the product-unit neural network model and the evolutionary neural network paradigm. Logistic regression is a well-tested statistical approach that performs well in two-class classification and can naturally be generalized to the multi-class case. On the other hand, product-unit neural network models are an alternative to standard sigmoidal neural networks with the ability to capture non-linear interaction between the input variables. Finally, evolutionary artificial neural networks present an interesting platform for optimizing both network performance and architecture simultaneously. The adequate combination of these three elements carried out in several steps in our proposal, provides a competitive methodology to solve classification problems.

RefeReNCeS Athanaselis, T., Bakamidis, S., Dologlou, I., Cowie, R., Douglas-Cowie, E., & Cox, C. (2005). ASR for emotional speech: Clarifying the issues and enhancing performance. Neural Networks, 18(4), 437-444. Bishop, M. (1995). Neural Networks for Pattern Recognition, Oxford University Press. Chiang, J.H., (1998). A Hybrid Neural Model in Handwritten Word Recognition. Neural Networks, 11(2), 337-346. Durbin, R., & Rumelhart, D. (1989). Products Units: A computationally powerful and biologically plausible extension to back propagation networks. Neural Computation, 1, 133-142. Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1-141. Hastie, T., Tibshirani, R.J., & Friedman, J. (2001). The Elements of Statistical Learning. Data mining, Inference and Prediction, Springer-Verlag. Hervás-Martínez, C., Martínez-Estudillo, F.J., Gutiérrez, P.A. (2006). Classification by means of evolutionary product-unit neural networks. IEEE International

Joint Conference on Neural Networks (IJCNN06, Vancouver), 2834-2841. Ismail, A., & Engelbrecht, A.P. (2000). Global Optimization Algorithms for Training Product Unit Neural Networks. International Joint Conference on Neural Networks, 1, 132-137. Janson, D.J., & Frenzel, J.F. (1993). Training product unit neural networks with genetic algorithms. IEEE Expert: Intelligent Systems and Their Applications, 8(5), 26-33. Kooperberg, C., Bose, S., & Stone, C.J. (2006). Polychotomous regression. Journal of the American Statistical Association, 92, 117-127. Leerink, L., Giles, C., Horne, B., & Jabri, M. (1995). Learning with product units. Advances in Neural Information Processing Systems, 7, 537-544. Mahony, S., Benos, P.V., Smith, T.J., & Golden, A. (2006). Self-organizing neural networks to support the discovery of DNA-binding motifs. Neural Networks, 19(6-7), 950-962. Martínez-Estudillo, A.C., Martínez-Estudillo, F.J., Hervás-Martínez, C., & García-Pedrajas, N. (2006). Evolutionary Product Unit based Neural Networks for Regression. Neural Networks, 19(1), 477-486. Martínez-Estudillo, A.C., Hervás-Martínez, C., Martínez-Estudillo, F.J., & García-Pedrajas, N. (2006). Hybridation of evolutionary algorithms and local search by means of a clustering method. IEEE Transaction on Systems, Man and Cybernetics, Part. B: Cybernetics, 36(3), 534-546. Medlin, C.R., Shaw, D.R., Gerard, P.D., & LaMastus, F.E. (2000). Using remote sensing to detect weed infestations in Glycine max. Weed Science, 48(3), 393-398. Minka, T.P., (2003). A comparison of numerical optimizers for logistic regression. http://research.microsoft. com/˜minka/papers/logreg/ Peña-Barragán, J.M., López-Granados, F., JuradoExpósito, M., & García-Torres L. (2007). Mapping Ridolfia segetum patches in sunflower crop using remote sensing. Weed Research, 47(2), 164-172. Schmitt, M. (2001). On the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation, 14(24), 241-301. 1143

M


SPSS (2005). Advanced Models. Copyright 13.0 SPSS Inc. Chicago, IL. Thompson, J.F., Stafford, J.V. & Miller P.C.H. (1991). Potential for automatic weed detection and selective herbicide application. Crop Protection, 10, 254-259.

Key TeRmS Artificial Neural Networks: A network of many simple processors (“units” or “neurons”) that imitates a biological neural network. The units are connected by unidirectional communication channels, which carry numeric data. Neural networks can be trained to find nonlinear relationships in data, and are used in applications such as robotics, speech recognition, signal processing or medical diagnosis. Evolutionary Computation: Computation based on iterative progress, such as growth or development in a population. This population is selected in a guided random search using parallel processing to achieve the desired solution. Such processes are often inspired by biological mechanisms of evolution. Evolutionary Programming: One of the four major evolutionary algorithm paradigms, with no fixed structure or representation, in contrast with some of the other evolutionary paradigm. Its main variation operator is the mutation.

1144

Iteratively Reweighted Least Squares (IRLS): Numerical algorithm for minimizing any specified objective function using a standard weighted least squares method such as Gaussian elimination. It is widely applied in Logistic Regression. Logistic Regression: Statistical regression model for Bernoulli-distributed dependent variables. It is a generalized linear model that uses the logit as its link function. Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable (the natural log of the odds of the dependent occurring or not). Precision Farming: Use of new technologies, such as global positioning (GPS), sensors, satellites or aerial images, and information management tools (GIS) to assess and understand in-field variability in agriculture. Collected information may be used to more precisely evaluate optimum sowing density, estimate fertilizers and other inputs needs, and to more accurately predict crop yields. Product Unit Neural Networks: Alternative to standard sigmoidal neural networks, based on multiplicative nodes instead of additive ones. Concretely, the output of each hidden node is the product of all its inputs raised to a real exponent. Remote Sensing: Short or large-scale acquisition of information of an object or phenomenon, by the use of either recording or real-time sensing devices that is not in physical or intimate contact with the object (such as by way of aircraft, spacecraft, satellite, or ship).

1145

Multi-Objective Evolutionary Algorithms Sanjoy Das Kansas State University, USA Bijaya K. Panigrahi Indian Institute of Technology, India

INTRODUCTION Real world optimization problems are often too complex to be solved through analytical means. Evolutionary algorithms, a class of algorithms that borrow paradigms from nature, are particularly well suited to address such problems. These algorithms are stochastic methods of optimization that have become immensely popular recently, because they are derivative-free methods, are not as prone to getting trapped in local minima (as they are population based), and are shown to work well for many complex optimization problems. Although evolutionary algorithms have conventionally focussed on optimizing single objective functions, most practical problems in engineering are inherently multi-objective in nature. Multi-objective evolutionary optimization is a relatively new, and rapidly expanding area of research in evolutionary computation that looks at ways to address these problems. In this chapter, we provide an overview of some of the most significant issues in multi-objective optimization (Deb, 2001).

BACKGROUND Arguably, Genetic Algorithms (GAs) are one of the most common evolutionary optimization approaches. These algorithms maintain a population of candidate solutions in each generation, called chromosomes. Each chromosome corresponds to a point in the algorithm’s search space. GAs use three Darwinian operators − selection, mutation, and crossover to perform their search (Mitchell, 1998). Each generation is improved by systematically removing the poorer solutions while retaining the better ones, based on a fitness measure. This process is called selection. Binary tournament selection and roulette wheel selection are two popular methods of selection. In binary tournament selection,

two solutions, called parents, are picked randomly from the population, with replacement, and their fitness compared, while in roulette wheel selection, the probability of a solution to be picked, is made to be directly proportional to its fitness. Following selection, the crossover operator is applied. Usually, two parent solutions from the current generation are picked randomly for producing offspring to populate the next generation of solutions. The offspring are created from the parent solutions in such a manner that they bear characteristics from both. The offspring chromosomes are probabilistically subject to another operator called mutation, which is the addition of small random perturbations. Only a few solutions undergo mutation. Evolutionary Strategies (ES) forms another class of evolutionary algorithms that is closely related to GAs and uses similar operators as well. Particle Swarm Optimization (PSO) is a more recent approach (Clerc, 2005). It is modeled after the social behavior of organisms such as a flock of birds or a school of fish, and thus only loosely classified as an evolutionary approach. Each solution within the population in PSO, called a particle, has a unique position in the search space. In each generation, the position of each particle is updated by the addition of the particle’s own velocity to it. The velocity of a particle, a vector, is then incremented towards best location encountered in the particle’s own history (called the individual best), as well as the best position in the current iteration (called the global best).

eVOlUTIONARy AlGORITHmS fOR mUlTI-OBJeCTIVe OpTImIzATION multi-Objective Optimization When dealing with optimization problems with multiple objectives, the conventional theories of optimality can-


M

Multi-Objective Evolutionary Algorithms

Figure 1. An illustration of convergence and diversity concepts in multi-objective optimization algorithms. The objective functions f1 and f2 are to be minimized. f2

f2

true Pareto front

f1 good convergence but poor diversity

f1

not be applied. Instead, the concepts of dominance and Pareto-optimality are used. Without a loss of generality, we will assume that the optimization problem involves the simultaneous minimization of several objectives only. If these objective functions are fi(.), i = 1,...,M, a solution x is said to dominate another solution y if and only if for all i, fi(x) ≤ fi(y) with at least one of the inequalities being strict. In other words, x dominates y if and only if x is as good as y for all objectives and better in at least one. This relationship is written x  y . In the set of all feasible solutions, that subset whose members are not dominated by any other in the set, is called the Pareto set. In other words, if S is the search space, the Pareto set P is given by, P = {x ∈ S | ∀y ∈ S, y  x is false} . The image of the Pareto set P in the M dimensional objective function space is called the Pareto front, F. Thus, F = {( f1 ( x), f 2 ( x), ... f M ( x) )| x ∈ P}. The goal of a multi-objective optimization algorithm is twofold. Firstly, its output, the set of non-dominated solutions in the population, must be as close to the true Pareto front as possible. This feature is called convergence. Secondly, in addition to good convergence, the multi-objective evolutionary algorithm should also yield solutions that sample the front at approximately 1146

f2

f2

poor convergence

f1 good convergence and diversity

f1

regularly spaced intervals, a feature that is usually referred to as diversity. Outputs, where the solutions are clustered in a few regions of the front while other regions are either omitted or poorly sampled, are not desirable. Figure 1 illustrates the concepts of good convergence and diversity. In order to handle multi-objective optimization tasks, an evolutionary algorithm must be equipped to discriminate between solutions using either convergence or diversity as the criterion for comparison. When using convergence, the majority of current evolutionary algorithms make use of one of two basic ranking schemes that were originally put forth by Goldberg (Goldberg, 1989). The first is a method that shall be referred to here as domination counting. Within a population of solutions, the rank of any solution is the number of other solutions in the population that dominate it. Clearly, the non-dominated solutions in the population are assigned counts of zero. The second approach will be called non-dominated sorting. Here, ranks are assigned to each solution in a population, in such a manner that solutions that have the same rank do not dominate one another, each solution is assigned a lower rank than another that it dominates, and, in turn, is ranked higher


than ones dominating it. As before, non-dominated solutions in the population are assigned a rank of zero. Both concepts are illustrated in figure 2. The numbers corresponding to each solution in figure 2 (left) are the domination counts. The figure shows specifically the solution with a count of 3 being dominated by three others (with ranks 0, 0 and 2). In figure 2 (right), the solutions with equal rank (the ranks being 0, 1 or 2) are grouped together. Solutions with a rank of 0 are non-dominated ones. They dominate those with ranks 1 or 2. When they are removed, those with ranks of 1 are no longer dominated. Removing rank 1 solutions makes the rank 2 solution non-dominated. Multi-objective evolutionary algorithms must also be equipped with the ability to discern between solutions that are in sparser regions of the M dimensional space of objective functions, from those in denser ones. The three main approaches used to do so are illustrated in figure 3. The first of these methods is to consider a bounding hypercube around each solution in the objective function space that does not enclose any other solution (Deb, Pratap, Agarwal & Meyarivan, 2002). Neighboring solutions will be located at some of the corners of this hypercube. This is shown in figure 3 (left), where two solutions, a and b, have been enclosed by hypercubes. The perimeters of the hypercubes are considered as measures of diversity. Solutions whose bounding hypercubes have a larger perimeter are considered to be located in sparser regions than those with smaller ones. The second approach (Knowles & Corne, 2000) is to superimpose an M-dimensional hypergrid in the objective function space, and consider the number of solutions that stay in each of the hypergrid’s cells as a measure of how dense the region around the cell is. In figure 3 (middle), since b occupies the same cell as another solution, whereas a does not, the latter is

regarded as being placed in a sparser region. The last approach computes the kth nearest neighbor of each solution (Zitzler, Laumanns & Thiele, 2001). This situation is depicted in figure 3 (right), where the solutions a and b have been connected to their nearest neighbors (k=1). Solutions, which lie at greater distances from their neighbors, are considered to be in sparser regions of the objective function space. Since elitism, the guaranteed survival of the fittest solutions per generation, shows faster convergence in single objective evolutionary algorithms, this feature has also been incorporated into most current multiobjective evolutionary algorithms. Elitism is ensured by means of an archive that stores the best solutions in each generation. Quite often, the archived solutions are reinserted back into the main population. Archiving is implemented via schemes that we will refer to, collectively, as elite preservation.

A few Recent evolutionary Algorithms (i) NSGA-II (Deb, Pratap, Agarwal & Meyarivan, 2002): NSGA-II (Non-dominated Sorting Genetic Algorithm) maintains a population and a separate archive, each of size N. In each generation, the population is merged with the archive. This merged set is then subject to elite preservation. The new archive is filled by taking the N best-ranked solutions obtained from elite preservation. The same N individuals are also subject to tournament selection, crossover, and mutation to form the population for the next generation. Elite preservation is implemented in NSGA-II by using non-dominated sorting for convergence, and the hypercube method for diversity. It proposes an O(MN2) subroutine for fast non-dominated sorting to assign ranks to the solutions in the merged population.

Figure 2. Discriminating between solutions based on convergence f2

f2

domination counting 0

1 2

0

non-dominated sorting

3 2 1

0 0

f1

f1 1147

M


Figure 3. Discriminating between solutions based on diversity f2

f2

bounding hypercube

a

a

nearest neighbor

a b

b f1

Starting with the lowest ranked solutions, the archive is filled until it reaches its full capacity N. Diversity is invoked to further discriminate between mutually non-dominating solutions when not all solutions of an identical rank can be inserted into the archive. NSGA-II incorporates an algorithm of order O(MN log(N)) to do so. Because non-dominated sorting is the computationally rate limiting portion in NSGA-II, the overall complexity is O(MN2) per generation. (ii) SPEA-2 (Zitzler, Laumanns & Thiele, 2001): The SPEA-2 (Strength Pareto Evolutionary Approach) method is quite similar to NSGA-II. It also maintains a population and an archive of size N, merging both in the beginning of every generation and making use of elite preservation to identify the best to undergo crossover and mutation for the next generation. In SPEA-2, individual fitness are computed in a two-step manner, that is an improved version of the basic domination counting approach discussed earlier. First to be computed is the strength of each solution in the merged population, i.e., the number of solutions it dominates. Then, the raw fitness of each individual is computed as the sum of the strengths of all the solutions that dominate it. In (Zitzler, Laumanns & Thiele, 2001) it is argued that this method of computing fitness imparts SPEA-2 with some capability to preserve diversity. However, in itself, raw fitness is inadequate to do so, and so a separate term is added to it, that explicitly takes diversity into account. This second term added to each solution’s fitness, is inversely related to the distance from the solution to its kth nearest neighbor in objective function space. The overall algorithm complexity of SPEA-2 is O(MN3) per generation. (iii) MOPSO (Coello Coello, 2004): MOPSO (Multi-objective Particle Swarm Optimization) is an 1148

f2

hypergrid

b f1

f1

approach for fast multi-objective optimization that is based on PSO. It imposes population diversity by means of the M-dimensional hypergrid described earlier, and counting the number of solutions present in each of the hypergrid’s cells. Solutions occupying cells with lower counts are preferred to those with higher counts. An even distribution of solutions along the Pareto front is achieved by biasing particles to update their velocities towards global best particles that are located in sparser cells of the hypergrid, i.e. with lower counts. This is done using a roulette wheel selection algorithm that picks a cell probabilistically using cell counts, such that the higher the cell count, the lower the probability of selection becomes. MOPSO also implements a mutation operator. (iv) ParEGO (Knowles, 2006): ParEGO (Parallel Efficient Global Optimization) has been explicitly designed for problems where evaluating the objective function is highly expensive in terms of computer time. Therefore, ParEGO converges in as few function evaluations as possible. The algorithm uses a Gaussian process model to approximate the fitness landscape that is learned adaptively using supervised learning. For further details one is referred to (Knowles, 2006). (v) FSGA (Koduru, Das, Welch & Roe, 2004): FSGA (Fuzzy Simplex Genetic Algorithm) has a complexity of O(MN2) per generation similar to NSGA-II. It differs from NSGA-II and SPEA-2 in the method used for elite preservation. A measure called fuzzy dominance is used for the purpose. A solution that is not dominated by any other is assigned a fuzzy dominance of zero. The poorer a solution is, the higher the fuzzy dominance value it is assigned. FSGA’s fuzzy dominance is a numerical method that not only uses Pareto-optimality, but also considers the degree to which one solution dominates


another, making effective use of differences between their values of the objective functions. It has been designed specifically so that FSGA can be hybridized readily with a local search algorithm. PAES (The Pareto Archive Evolutionary Strategy) and PESA (Pareto Envelope based Selection Algorithm) are other successful multi-objective evolutionary algorithms that make use of the hypergrid to measure diversity (Knowles & Corne, 2000, Corne, Jerram, Knowles & Oates, 2001). Another algorithm, RDGA (Rank Density based Genetic Algorithm) uses this method along with a ranking scheme wherein a nondominated individual has unit rank and others are assigned one plus the sum of the ranks of all solutions that dominate them (Lu & Yen, 2003). Very recently, the use of fuzzy dominance has been successfully applied to another multi-objective PSO algorithm (Koduru, Das & Welch, 2007).

fUTURe TReNDS Multi-objective evolutionary optimization is a rapidly expanding, new field of research. Although several interesting approaches have been proposed in the recent literature, further investigation is necessary before multi-objective algorithms can truly address the needs of the application domains. One current research focus is in devising numerical metrics to compare solutions. This is particularly useful when the problem contains a large number of objectives. In higher dimensional objective function space, it is less likely to find a solution that dominates another, i.e. is better than or equal to another in all objectives. Under these circumstances, comparing solutions that are already within the Pareto front is essential. One such method has been suggested recently (Farina & Amato, 2004). This method counts the number of objectives along which one solution is better than and worse than another, and proposed fuzzy metrics based on the counts. However, such ideas have yet to be incorporated within evolutionary algorithms. A related direction of research is in devising schemes to compare solutions in the presence of uncertainty in objective functions. This research has obvious practical implications in engineering and other applications where measuring objectives such as cost, efficiency or expected lifetime are difficult tasks (Fieldsend, Everson & Singh, 2005).

Another direction of certain future interest is in multi-objective optimization using novel biological paradigms. Only a few multi-objective PSO algorithms, such as MOPSO, and fuzzy dominance based PSO method (Koduru, Das & Welch, 2007), have been proposed; consequently, there is great interest in devising better PSO search strategies within the evolutionary computation community. Another class of algorithms based on computations involved in the vertebrate immune system is emerging, called Artificial Immune Systems (AIS). Although a few multi-objective AIS algorithms have been proposed recently (cf. Coello Coello & Cortés, 2005), there is substantial scope for improvement in this direction. Other trends are in devising more difficult benchmark test problems. Huband et al. have proposed recent benchmarks (Huband, Hingston, Barone & While, 2006), and the performance of evolutionary methods for these functions need to be investigated.

CONClUSION We have provided an overview of the new and expanding field of multi-objective optimization, outlining some of the most significant approaches. We chose to describe NSGA-II and SPEA-2 as they are the most popular algorithms today. We also discuss the recent algorithm, ParEGO, which is very promising for some specialized applications as well as the even more recent FSGA, currently under development, which fills the need for hybrid multi-objective algorithms. Finally, we also have outlined MOPSO, which is based on a new evolutionary paradigm, PSO. Lastly, we address future trends in evolutionary multi-objective optimization to complete the discussion.

RefeReNCeS Clerc, M., (2005). Particle Swarm Optimization. ISTE Press, UK. Coello Coello, C.A. (2004). Handling multiple objectives with particle swarm optimization. IEEE Transactions on Evolutionary Computation. 8(3): 256-279. Coello Coello, C.A. & Cortés N.C. (2005). Solving multiobjective optimization problems using an artificial

1149

M


immune system. Genetic Programming and Evolvable Machines. 6(2): 163-190. Corne, D.W., Jerram, N.R., Knowles, J.D., & Oates, M.J. (2001). PESA-II: Region based selection in evolutionary multiobjective optimization, In Spector et al., (editors), Proceedings of the Genetic and Evolutionary Computation Conference. 283-290.

Proceedings of the ACM Genetic and Evolutionary Computing Conference, London, UK. (Eds. Dirk Thierens et al.): 853-860. Lu, H., & Yen, G.G. (2003). Rank-density-based multiobjective genetic algorithm and benchmark test function study. IEEE Transactions on Evolutionary Computation, 7(4): 325-343.

Deb, K. (2001). Multi-Objective Optimization Using Evolutionary Algorithms. Wiley: Chichester, U.K.

Mitchell, M. (1998). An Introduction to Genetic Algorithms. MIT Press.

Deb, K., Pratap, A., Agarwal, S., & Meyarivan T. (2002). A fast and elitist multi-objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2): 182-197.

Zitzler, E., Laumanns, M., & Thiele, L. (2002) SPEA-2: Improving the strength Pareto approach. Proceedings of EUROGEN 2001, Evolutionary Methods for Design, Optimization, and Control with Applications to Industrial Problems, K. Giannakoglou, D. Tsahalis, J. Periaux, P. Papailou, and T. Fogarty (editors), Athens, Greece: 95-100.

Farina M. & Amato P. (2004). A fuzzy definition of “optimality” for many-criteria optimization problems. IEEE Transactions on Systems, Man, and Cybernetics Part A-Systems and Humans. 34(3): 315-326. Fieldsend, J.E., Everson, R.M. & Singh, S. (2005). Multi-objective optimization in the presence of uncertainty, IEEE Congress on Evolutionary Computation, 1: 243-250. Goldberg, D.E., (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading, MA. Huband, S., Hingston, P., Barone, L., & While L. (2006). A review of multiobjective test problems and a scalable test problem toolkit. IEEE Transactions on Evolutionary Computation. 10(5): 477-506. Knowles, J.D., & Corne, D.W. (2000). Approximating the nondominated front using the Pareto archived evolution strategy. Evolutionary Computation. 8: 149-172. Knowles, J. (2005). ParEGO: a hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems, IEEE Transactions on Evolutionary Computation. 10(1): 50-66. Koduru, P. Das, S., Welch, S.M., & Roe, J. (2004). Fuzzy dominance based multi-objective GA-Simplex hybrid algorithms applied to gene network models. Proceedings of the Genetic and Evolutionary Computing Conference, Seattle, Washington, Kalyanmoy Deb et al. (editors), Springer-Verlag, LNCS 3102: 356-367. Koduru, P., Das, S., & Welch, S.M. (2007). Multiobjective and hybrid PSO using ε-fuzzy dominance, 1150

Key TeRmS Elitism: A strategy in evolutionary algorithms where the best one or more solutions, called the elites, in each generation, are inserted into the next, without undergoing any change. This strategy usually speeds up the convergence of the algorithm. In a multi-objective framework, any non-dominated solution can be considered to be an elite. Evolutionary Algorithm: A class of probabilistic algorithms that are based upon biological metaphors such as Darwinian evolution, and widely used in optimization. Fitness: A measure that is used to determine the goodness of a solution for an optimization problem. Fitness Landscape: A representation of the search space of an optimization problem that brings out the differences in the fitness of the solutions, such that those with good fitness are “higher”. Optimal solutions are the maxima of the fitness landscape. Generation: A term used in evolutionary algorithms that roughly corresponds to each iteration of the outermost loop. The offspring obtained in one generation become the parents of the next. Multi-Objective Optimization: An optimization problem involving more than a single objective function.


In such a setting, it is not easy to discriminate between good and bad solutions, as a solution, which is better than another in one objective, may be poorer in another. Without any loss of generality, any optimization problem can be cast as one involving minimizations only.

M

Objective Function: The function that is to be optimized. In a minimization problem, the fitness varies inversely as the objective function. Population Based Algorithm: An algorithm that maintains an entire set of candidate solutions, each solution corresponding to a unique point in the search space of the problem. Search Space: Set of all possible solutions for any given optimization problem. Almost always, a neighborhood around each solution can also be defined in the search space.

1151

1152

Multi-Objective Training of Neural Networks M. P. Cuéllar Universidad de Granada, Spain M. Delgado Universidad de Granada, Spain M. C. Pegalajar Universidad de Granada, Spain

INTRODUCTION Traditionally, the application of a neural network (Haykin, 1999) to solve a problem has required to follow some steps before to obtain the desired network. Some of these steps are the data preprocessing, model selection, topology optimization and then the training. It is usual to spend a large amount of computational time and human interaction to perform each task of before and, particularly, in the topology optimization and network training. There have been many proposals to reduce the effort necessary to do these tasks and to provide the experts with a robust methodology. For example, Giles et al. (1995) provides a constructive method to optimize iteratively the topology of a recurrent network. Other methods attempt to reduce the complexity of the network structure by mean of removing unnecessary network nodes and connections like in (Morse, 1994). In the last years, evolutionary algorithms have been shown as promising tools to solve this problem, existing many competitive approaches in the literature. For example, Blanco et al. (2001) proposed a master-slave genetic algorithm to train (master algorithm) and to optimize the size of the network (slave algorithm). For a general view of the problem and the use of evolutionary algorithms for neural network training and optimization, we refer the reader to (Yao, 1999). Although the literature about genetic algorithms and neural networks is very extensive, we would like to remark the recent popularity of multi-objective optimization (Coello et al., 2002, Jin, 2006), specially to solve the problem of simultaneous training and topology optimization of neural networks. These methods have shown to perform suitably for this task in previous works, although most of them are proposed for feedforward models. They attempt to optimize the

structure of the network (number of connections, hidden units or layers), while training the network at the same time. Multi-objective algorithms may provide important advantages in the simultaneous training and optimization of neural networks: They may force the search to return a set of optimal networks instead of a single one; they are capable to speed-up the optimization process; they may be preferred to a weight-aggregation procedure to cover the regularization problem in neural networks; and they are more suitable when the designer would like to combine different error measures for the training. A recent review of these techniques may be found in (Jin, 2006).

BACKGROUND Multi-objective algorithms have become popular in the last years to solve the problem of the simultaneous training and topology optimization of neural networks, because of the innovations they can provide to solve it. Certain authors have addressed this problem through the evolution of single ensembles as for example with DIVACE-II (Chandra et al., 2006), which also implements different levels of coevolution. In other works, the networks are fully evolved and the evolutionary operators are designed to deal with both training and structure optimization. Some authors have addressed the problem of the structure optimization attending to reduce either the number of network neurons or either the number of network connections. In the first methods (Abbass et al., 2001; Delgado et al., 2005; González et al., 2003), the optimization is easier since the codification of a network contains a smaller number of freedom degrees than the last methods; however, they have a disadvantage in the sense that the networks obtained are fully connected. On the other hand, the methods in


Multi-Objective Training of Neural Networks

the second place (Jin et al., 2004; Cuéllar et al, 2007) ( f i ( s1 ) ≤ f i ( s 2 ), ∀i : 1 ≤ i ≤ k )^ (∃j : f j ( s1 ) < f j ( s 2 );1 ≤ j ≤ k ) attempt to reduce the number of connections but it is ( f i (nodes s1 ) ≤ fis (1) i ( s 2 ), ∀i : 1 ≤ i ≤ k )^ (∃j : f j ( s1 ) < f j ( s 2 );1 ≤ j ≤ k ) not ensured that also the number of network also minimum. Nevertheless, experimental results have The solutions that are non-nominated by any other shown that the networks obtained with these proposals solution are called the non-dominated set or Pareto have a low size (Jin et al, 2004). frontier. The goal of any multi-objective algorithm is The hybridation of multi-objective evolutionary to find the solutions in the Pareto frontier. Thus, the algorithms with traditional gradient-based training selection of the objectives to be achieved in a multialgorithms has also provided promising results. While objective algorithm is a key aspect, since they will be the evolutionary algorithm makes a wide exploration used to guide the search across the search space to of the solution space, the gradient-based algorithms are obtain the optimal solutions. However, the higher the capable to address the search to promising areas durnumber of objectives is, the higher the complexity of ing the evolution and to exploit the solutions suitably. the search space is. In this work, we attempt to train and This hybridation is usually carried out by including optimize the size of an Elman Network (Mandic and the gradient-based training method as a local search Chambers, 2001), for time series prediction problems. operator in the evolutionary process. Then, the local This network type has an input layer, an output layer search operator is applied after the mutation and before and a hidden layer. The data of the time series is prothe evaluation of the solutions. Some examples are the vided in time to the network inputs, and the objective system MPANN developed by H.A. Abbass (2001), is the network output to provide the future values of and the works by Y. Jin et al. (2006). the time series at the output.The recurrent connections In the next section, we make an study of different are in the hidden layer, so that the output of a hidden aspects concerning the multi-objective optimization neuron at time t is also input for all the hidden neurons of neural networks. Concretely, we make an study of at time t+1. The reader may found a wider information the objectives to be achieved in the multi-objective about dynamical recurrent neural networks applied for algorithm and the multi-objective algorithms used. time series prediction in (Aussem, 1999; Mandic and We focus our analysis on recurrent neural networks Chambers, 2001). (Haykin, 1999; Mandic and Chambers, 2001), since these models have a high complexity due to the recur1 T rence. The experiments are illustrated in problems of f1 ( s * ) = min{ f1 ( s )) = min{ ∑ (Y (t ) − O(t )) 2 } time-series prediction, since this type of problems has T t =1 (2) multiple applications in many research and enterprise areas and the neural models used are suitable for this f 2 ( s * ) = min{ f 2 ( s )} =min{ h( s )} (3) application, as suggested by previous works (Aussem, 1999). f 3 ( s * ) = min{ f 3 ( s )} = min{ n( s )} (4)

M

mUlTI-OBJeCTIVe eVOlUTIONARy AlGORITHmS fOR NeURAl NeTWORKS TRAINING AND OpTImIzATION The most recent multi-objective evolutionary algorithms are based in the concept of Pareto dominance as a criterion to determine whether a solution is optimal or not. Let F(s)=(f1(s), f2(s),..., fk(s)) be a set of k objectives to be achieved, and let s1 and s2 be two solutions. In a minimization problem, it is said that s2 is dominated by s1 if, and only if:

For the problem of neural network optimization and training, we consider three objectives to be achieved (see equations (2)-(4)). The objective f1(s) attempts to minimize the network error, while f2(s) is used to optimize the number of hidden neurons and f3(s) the number of network connections. In equation (2), T is the number of training patterns, Y(t) is the desired output for pattern t and O(t) is the network output. In equation (3), h(s) is the number of hidden neurons for the network s; and n(s) is the number of network connections in equation (4). Another issue related to the objectives is the network codification. For example, in 1153


works like in (Abbass, 2001), the objectives to achieve are (f1(s), f2(s)), obtaining fully connected networks. In this cases, the representation of the network attempts to codify the neurons if a binary vector, and the network weights in a matrix with real values. In other works like in (Jin et al., 2006), the network connections are codified into a binary matrix and the network weight into a matrix with real values, since the objectives to be optimized are (f1(s),f3(s)). If we would consider to optimize all the objectives, the representation should contain the network structure (hidden neurons and connections) and the weights. In our proposal, the number of network neurons are codified with an integer valur following the guidelines in (Delgado et al., 2005; Cuéllar et al., 2007), the connections are codified in a binary vector, and the network weights in a vector with real

Figure 1. Example for the codification of an Elman network with 1 input, 1 output, 2 hidden neurons and 5 connections

Figure 2. Example of the crossover

1154

values. Figure 1 shows an example of the codification of an Elman network into a solution, where Vij are the network weights from input j to hidden neuron i, Uir is the recurrent weight from neuron r to neuron i and Woi are the weights from hidden neuron i to the output neuron o. A network connection is active if the corresponding gene is set to 1. Otherwise, the connection is not active. The evolutionary operators like the crossover and the mutation should consider two different areas in a solution: Structural recombination/mutation, and genetic recombination/mutation. The genetic one is associated to the area of the network weights, while the structural one is for the network topology. Additionally, it could be included a local search operator to improve the network performance locally in the area that codifies the network weights, as suggested in the previous section. In our experiments, we have used a simple recombination that generates two children from two parents, with no structural recombination: We have tested that the structural recombination could provide a high exploitation of the solution space, and the selective pressure produced by the objectives to be achieved could then produce a premature convergence. On the other hand, for the mutation we have included three probabilities since the structural mutation may have a high impact in the population of solutions: Structural mutation is selected with probability p1, and genetic mutation with probability 1- p1. In the structural mutation, the number of hidden neurons is altered with probability p2; otherwise, the active/inactive connections are mutated. Finally, a gene is altered with probability p3. Figure 2 shows an example of the crossover, and Figure 3 shows an example of the structural mutation for the number of


Figure 3. Example of mutation

hidden neurons. In this last case, the solution grows to have three hidden neurons, and new genes are generated. The values for these genes are a random number in the bounds of the gene. In Figure 3, these genes may be recognized by mean of the symbol ?. For the experiments, we attempt to study the benefits of the inclusion of a higher number of objectives to achieve in the multi-objective algorithm, and the effects of the evolutionary algorithm. To illustrate our results, we have selected two social-economic time-series for forecasting: The evolution of the U.S. population from 1950 to 2004 taken monthly (USPop), and the evolution of the euro/U.S. dollar variation from 1995 to 2004, taken monthly (EurDol). The 80% of the data are used for training, and the remaining 20% is for the test. Both time series may be downloaded for free

from http://www.economagic.com. The parameter for the networks in our experiments are bounded for the number of hidden neurons, from 3 to 12. The networks have one input for the value of the time series at time t, and one output for the value of the time series at time t+1, to be predicted. There have been 30 experiments with the multiobjective algorithms, which are based in the algorithms NSGA2 (Deb et al, 2002) and SPEA2 (Zitzler et al, 2001). We label NSGA2 and SPEA2 for the algorithms that optimize the objectives (f1(s), f2(s)) and NSGA2.connect and SPEA2.connect for the algorithms that optimize (f1(s), f2(s), f3(s)). The stopping criterion is to have 10000 solutions evaluated, and size of the population is 50. The parameters for mutation are (p1, p2, p3)=(0.5, 0.5, 0.1) and the range for the genes containing network weights is [-5.0, 5.0]. We have used the binary tournament selection, the heuristic Wright’s crossover and the displacement mutation for the evolutionary operators. Figure 4 draws the distribution of the performance for the neural networks obtained in the Pareto frontiers for the 30 experiments, in each data set. Additionally, Table 1 shows the best Pareto frontiers obtained, where Column 1 plots the algorithm, columns 2 and 5 expose the number of hidden neurons, columns 3 and 6 describe the number of network connections, and columns 4 and 7 the Mean Square Error (MSE) in the training. We may observe that SPEA2.connect has obtained a Pareto frontier wider than NSGA2.connect in both problems. In some situations, this fact may be desir-

Figure 4. Boxplots for the distribution of network performance in USPop (a), and EurDol (b)

1155

M


Table 1. Best Pareto frontiers obtained in the data sets

SPEA2 NSGA2.connect

USPop Hidden units 3 5 11 10 4 3

Connections 6 17 43 44 24 10

MSE 0.042 0.041 0.028 0.050 0.021 0.092

NSGA2

7

63

0.007

Algorithm SPEA2.connect

EurDol Hidden units 3 4 5

Connections 11 7 16

MSE 0.013 0.012 0.009

4 3 5 4

24 4 17 24

0.016 0.013 0.012 0.003

able, since we are provided with a larger set of optimal networks from we could select the best appropiate network to solve our problem. On the other hand, figures 4.a and 4.b suggest that the inclusion of the network connections optimization in the multi-objective method may produce poorer results. In both problems, the best solutions are provided by the algorithm NSGA2, which return fully connected networks. The same algorithm that also optimize the number of connections, NSGA2. connect, provides networks with minimum size, but the network performance are lower. In the case of the algorithms based on SPEA2, we may notice that SPEA2.connect is the less robust algorithm, since the distribution in the MSE is the widest. However, the boxplot shows that the best solutions of this method may be similar to the ones from NSGA2. This fact suggests that we may encounter smaller networks using the three objectives in SPEA2.connect, but sacrifying some improvements in the network performance and spending more computational time to obtain a suitable solution. Moreover, the networks obtained with the inclusion of objective f3(s) in the optimization process have a size which is very low being compared with the fully connected networks from SPEA2 and NSGA2.

therefore to find optimal solutions. The hybridation of multi-objective evolutionary algorithms with nonlinear programming methods to address the search space to promising areas have proved to work well in the works that propose a lower number of objectives in the optimization. In the case studied in this work, the improvements of these procedures could be better since the size of the search space is wider. Another important issue is the research of the evolution considering diversity and convergence: The objectives used usually introduce a high selective pressure in the population, and specially the objectives for the topology optimization. This could be addressed by introducing components in the evolutionary process to control the balance in diversity/convergence, therefore improving the search process and the exploration/exploitation of the solution space. Another interesting line to work is the inclusion of objectives to improve another properties of neural networks like noise tolerance or generalization. For example, this issue has been suggested in (Graning et al, 2006), where it is introduced an extra objective to improve the generalization of a feedforward network in binary classification.

fUTURe TReNDS

CONClUSION

We have studied in the previous section that the inclusion of a larger number of objectives for the network optimization is able to reduce the size of the network, although the network performance obtained is poorer. This is usual since the more objectives to be optimized, the more complex is the search space and

In this work, we have studied the benefits and disadvantages in multi-objective training and fully topology optimization of recurrent neural networks. We have tested the methods in time series prediction problems, and they have been also compared with the methods that do also optimize the number of connections. In general

1156


terms, all the algorithms have solved the problems suitably. The methods studied provide networks with minimum number of hidden units and connections, and the network performance is good. However, these methods may produce poorer results than those that only optimize the number of hidden neurons and provide fully connected networks. Using a higher computational time, the results from the algorithms that optimize the topology, in terms of hidden neurons and connections, may be competitive, providing networks with performance similar to those techniques that do not optimize the number of network connections. Moreover, these methods include the advantage that the network’s size is very low, being compared with the fully connected networks.

RefeReNCeS Abbass, H.A.; & Sarker, R. (2001). Simultaneous Evolution of Architectures and Connection Weights in ANNs. In Proc. of The Artificial neural networks and Expert systems Conference. 16-21. Abbass, H.A. (2001). A Memetic Pareto Evolutionary Approach to Artificial Neural Networks. Lecture Notes in Artificial Intelligence. 2256, 1-12. Aussem, A. (1999). Dynamical Recurrent Neural Networks towards Prediction and Modelling of Dynamical Systems. Neurocomputing. 28(15), 207-232.

Deb, K.; Patrap, A.; Agarwal, A.; & Meyarivan, T. (2002). A fase and elitist multi-objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation. 6(2), 182-197 Delgado, M.; & Pegalajar, M.C. (2005). A multiobjective genetic algorithm for obtaining the optimal size of a recurrent neural network for grammatical inference. Pattern Recognition. 38(9), 1444-1456 Graning, L., Jin, Y.; & Sendhoff, B. (2006). Generalization improvement in multi-objective learning. In Proc. of International Joint Conference on Neural Networks. 9893-9900. González, J.; Rojas, I.; Ortega, J.; Pomares, H.; Fernández, F.; & Díaz, A. (2003). Multiobjective evolutionary optimization of the size, shape, and position parameters of radial basis function networks for function approximation. IEEE Transactions on Neural Networks. 14(6), 1478-1495. Giles, C.; Chen, D.; Sun, G.; Chen, H.; Lee, H. & M. Goudreau (1995). Constructive learning of recurrent neural networks: problems with recurrent cascade correlation and a simple solution. IEEE Transactions on Neural Networks. 6(4), 489. Haykin, S. (1999). Neural Networks: A Comprehensive Foundation. Prentice Hall. Jin, Y. (2006). Multi-objective machine learning. Springer, New York.

Blanco, A.; Delgado, M. & Pegalajar, M.C. (2001). A genetic algorithm to obtain the optimal recurrent neural network. International Journal of Approximate Reasoning. 23, 67-83.

Jin, Y.; Okabe, T.; & Sendhoff B. (2004). Neural network regularization and ensembling using multi-objective evolutionary algorithms. In Proc. of the 2004 congress on evolutionary computation. 1-8.

Chandra, A.; & Yao, X. (2006). Evolving hybrid ensembles of learning machines for better generalization. Neurocomputing. 69, 686-700.

Jin, Y.; Sendhoff, B.; & Corner, E. (2006). Evolutionary multi-objective optimization for simultaneous generation of signal-type and symbol-type representations. Lecture Notes on Computer Science. 3410, 752-766.

Coello, C.A.; Van Veldhuizen, D.A.; & Lamont G.B. (2002). Evolutionary Algorithms for Solving Multiobjective Problems. New York: Kluwer Academic Publishers. Cuéllar, M.P.; Delgado, M.; & Pegalajar, M.C. (2007). Topology optimization and training of Recurrent Neural Networks with pareto-based multi-objective algorithms: A experimental study. Lectures notes on Computer Science. 4507, 359-366.

Mandic, D.P., & Chambers, J. (2001). Recurrent Neural Networks for Prediction. John Wiley and Sons. Morse, J. (1994). Reducing the size of the non-dominated set, pruning by clustering. Computational and Operational Research 7(1:2), 55-66. Yao, X. (1999). Evolving Artificial Neural Networks. Proc. Of the IEEE. 87(9), 1423-1447..

1157

M


Zitzler, E.; Laummans, M.; & Thiele, L. (2001). SPEA2: Improving the Strength Pareto Evolutionary Algorithm. Technical report 103, Computer Engineering and Networks Laboratory (TIK), Swiss Federal Institute of Technology (ETH), Zurich, Switzerland.

Key TeRmS Dynamical Recurrent Neural Networks: Artificial neural network that include recurrent connections in the network structure. They are capable to process patterns with undetermined size and/or indexed in time. The output in these networks at time t+1 are computed using the network inputs at time t and the network state, provided by the recurrent connections. Ensembles: Self-containing area of a neural network (neuron, connection, set of a neuron with connections...) that, being combined with other ensembles, is able to build a neural network that solves a problem.

1158

Evolutionary Algorithm: Optimization algorithm based on Darwinian nature evolution. Multi-Objective Optimization: Optimization of a problem that involves the satisfacibility or optimization of two or more objectives, sometimes opposed each other. Regularization: Optimization of both complexity and performance of a neural network following a linear aggregation or a multi-objective algorithm. Time-Series: Data sequence indexed in time. Time-Series Prediction: Problem that involves the prediction of the future values of a time series, considering a few values from the data set in the past.

1159

“Narrative” Information and the NKRL Solution Gian Piero Zarri LaLIC, University Paris 4-Sorbonne, France

INTRODUCTION In a companion article of this Encyclopaedia: ‘Narrative’ Information, the Problem, we have introduced the problem of finding a complete and computationally efficient system for representing and managing ‘nonfictional narrative information’. We have stressed there the important economic value of this multimedia type of information – that concerns, e.g., corporate memory documents, news stories, normative and legal texts, medical records, intelligence messages, surveillance videos or visitor logs, actuality photos, eLearning and Cultural Heritage material, etc. We have also emphasised that the usual Computer Science tools – including those pertaining to the now very popular ‘Semantic Web’ domain, see (Bechhofer et al., 2004, Beckett, 2004) – are not really suitable for dealing with this type of information.

BACKGROUND

The main innovation introduced by NKRL with respect to the usual ontological paradigms concerns the addition to the traditional ontology of concepts – called HClass, ‘hierarchy of classes’ in the NKRL’s jargon – an ontology of events, i.e., a new sort of hierarchical organization where the nodes correspond to n-ary structures called ‘templates’ (HTemp, ‘hierarchy of templates’). A partial image of the ‘upper level’ of HClass – that follows then the standard Protégé approach, see (Noy et al., 2000) – is given in Figure 1; for HTemp, see Table 1 and Figure 2 below.

A SHORT DeSCRIpTION Of NKRl Instead of using the traditional (binary) attribute/value organization, the templates are generated from the Figure 1. A partial representation of the ‘upper level’of HClass, the NKRL ‘traditional’ ontology of concepts.

In this article, we will present an Artificial Intelligence tool, NKRL (Narrative Knowledge Representation Language) that has been especially developed for dealing in an ‘intelligent’ way with the nonfictional narrative information. NKRL is, at the same time: •

•

•

a knowledge representation system for describing in the best possible detail the essential content (the ‘meaning’) of complex nonfictional ‘narratives’; a system of reasoning (inference) procedures that, thanks to the richness of the representation system, is able to automatically establish ‘interesting’ relationships among the represented data; an implemented software environment that allows the user to encode the original narratives in terms of the representation language to create ‘NKRL knowledge bases’ in a specific application domain and to exploit ‘intelligently’ these bases.


N

“Narrative” Information: The NKRL Solution

n-ary combination of quadruples connecting together the symbolic name of the template, a predicate, and the arguments of the predicate introduced by named relations, the roles. The quadruples have in common the name and predicate components. Denoting then with Li the generic symbolic label identifying a given template, with Pj the predicate used in the template, with Rk the generic role and with ak the corresponding argument, the core data structure for templates has the following general format (see also the companion article, ‘Narrative’ Information, the Problem): (Li (Pj (R1 a1) (R2 a2) … (Rn an)))

(1)

Predicates pertain to the set {BEHAVE, EXIST, EXPERIENCE, MOVE, OWN, PRODUCE, RECEIVE}, and roles to the set {SUBJ(ect), OBJ(ect), SOURCE, BEN(e)F(iciary), MODAL(ity), TOPIC, CONTEXT}. An argument of the predicate can consist of a simple ‘concept’ or of a structured association (‘expansion’) of several concepts. Templates can be conceived as the formal representation of generic classes of elementary events like “move a physical object”, “be present in a place”, “produce a service”, “send/receive a message”, etc. When a particular event pertaining to one of these general classes must be represented, the corresponding template is instantiated to produce a predicative occurrence. To represent then a simple narrative like: “On November 20, 1999, in an unspecified village, an armed group of people has kidnapped Robustiniano Hablo”, we must then select firstly in the HTemp hierarchy the template corresponding to “execution of violent actions”, see Figure 2 and Table 1 below – this example refers to a recent application of NKRL in a ‘terrorism’ context in the framework of an European project see, e.g., (Zarri, 2005). As it appears from Table 1a, the arguments of the predicate (the ak terms in (1)) are represented by variables with associated constraints expressed as HClass concepts or combinations of concepts. When deriving a predicative occurrence (an instance of a template) like mod3.c5 in Table 1b, the role fillers in this occurrence must conform to the constraints of the father-template. For example, ROBUSTINIANO_HABLO (the ‘BEN(e)F(iciary)’ of the action of kidnapping) and INDIVIDUAL_PERSON_20 (the unknown ‘SUBJECT’, actor, initiator etc. of this action) are both ‘individuals’, instances of the HClass concept individual_person. The 1160

constituents – as SOURCE in Table 1a – included in square brackets are optional. A ‘conceptual label’ like mod3.c5 is the symbolic name used to identify the NKRL code corresponding to a specific predicative occurrence. The ‘attributive operator’, SPECIF(ication), is one of the four operators used in NKRL for the construction of ‘structured arguments’ (‘complex fillers’ or ‘expansions’) see, e.g., (Zarri, 2003). The SPECIF lists, with syntax (SPECIF ei p1 … pn), are used to represent the properties or attributes that can be asserted about the first element ei, concept or individual, of the list – e.g., in the SUBJ filler of mod3.c5, Table 1b, the attributes weapon_wearing and (SPECIF cardinality_ several_)) are both associated with INDIVIDUAL_PERSON_20. The ‘location attributes’, represented in the predicative occurrences as lists, are linked with the arguments of the predicate by using the colon operator, ‘:’, see the individual VILLAGE_1 in Table 1b. In the occurrences, the two operators date-1, date-2 materialize the temporal interval normally associated with narrative events, see (Zarri, 1998) – and, more in general, (Allen, 1981, Ferro et al., 2005). 150 templates are permanently inserted into HTemp; Figure 2 reproduces the ‘external’ organization of the PRODUCE branch of HTemp. This branch includes the Produce:Violence template used in Table 1. HTemp corresponds then to a sort of ‘catalogue’ of narrative formal structures, that are very easy to ‘customize’ to derive the new templates that could be needed for a particular application. What expounded until now illustrates the NKRL solutions to the problem of representing ‘elementary’ (simple) events. To deal now with those ‘connectivity phenomena’ that arise when several elementary events are connected through causality, goal, indirect speech etc. links – see also (Mani and Pustejovsky, 2004) – the basic NKRL knowledge representation tools have been complemented by more complex mechanisms that make use of second order structures, see (Zarri, 2003). For example, the binding occurrences consist of lists of symbolic labels (ci) of predicative occurrences; the lists are differentiated using specific binding operators like GOAL, CONDITION and CAUSE. Let us suppose that, in Table 1, we state now that: “…an armed group of people has kidnapped Robustiniano Hablo in order to ask his family for a ransom”, where the new elementary event: “the unknown individuals will ask for a ransom” corresponds to a new predicative occurrence, e.g., mod3.


Table 1. Building up and querying predicative occurrences

N

a) name: Produce:Violence father: Produce:PerformTask/Activity position: 6.35 NL description: ‘Execution of Violent Actions on the Filler of the BEN(e)F(iciary) Role’ PRODUCE SUBJ var1: [(var2)] OBJ var3 [SOURCE var4: [(var5)]] BENF var6: [(var7)] [MODAL var8] [TOPIC var9] [CONTEXT var10] {[modulators], ≠abs}

var1 = var3 = var4 = var6 = var8 =  <machine_tool>   <weapon_> var9 = var10 = <situation_>  <spatio/temporal_relationship>  <symbolic_label> var2, var5, var7 =

b) mod3.c5)

PRODUCE

(VILLAGE_1)

Produce:Violence (6.35)

SUBJ

(SPECIF INDIVIDUAL_PERSON_20 weapon_wearing (SPECIF cardinality_ several_)):

OBJ kidnapping_ BENF ROBUSTINIANO_HABLO CONTEXT #mod3.c6 date-1: 20/11/1999 date-2:

On November 20, 1999, in an unspecified village (VILLAGE_1), an armed group of people has kidnapped Robustiniano Hablo. c)

PRODUCE SUBJ : human_being : OBJ : violence_ BENF : human_being : {} date1 : 1/1/1999 date2 : 31/12/1999

There is any information in the system concerning violence activities during 1999?

1161


Figure 2A. Partial representation of the PRODUCE branch of HTemp, the ‘ontology of events’

c7. To represent this situation, we must add to the oc-

currences that represent the two elementary events a new binding occurrence, e.g., mod3.c8, to link together the conceptual labels mod3.c5 (corresponding to the kidnapping occurrence, see also Table 1b) and mod3. c7 (corresponding to the new occurrence describing the intended result). mod3.c8 will have then the form: “mod3.c8) (GOAL mod3.c5 mod3.c7)”. The meaning of mod3.c8 can be paraphrased as: “the activity described in mod3.c5 is focalised towards (GOAL) the realization of mod3.c7”. Reasoning in NKRL ranges from the direct questioning of an NKRL knowledge base making use of search patterns (formal queries over the contents of the knowledge base) that try to unify the predicative occurrences of the base to high-level inference procedures. A simple example of search pattern in supplied in Table 1c, producing as an answer, among other things, the predicative occurrence mod3.c5 of Table 1b – see (Ellis, 1995, Corbett, 2003, etc.) for the techniques used to unify complex conceptual structures. With respect now to the high level procedures – a detailed 1162

paper on this topic is (Zarri, 2005) – the transformation rules try to ‘adapt’, from a semantic point of view, the original query/queries (search patterns) that failed to the real contents of the existing knowledge bases. The principle employed consists in using rules to automatically ‘transform’ the original query (i.e., the original search pattern) into one or more different queries (search patterns) that are not strictly ‘equivalent’ but only ‘semantically close’ to the original one. Let us suppose that, e.g., during the search for all the possible information linked with the Robustiniano Hablo’s kidnapping, we ask the system whether Robustiano Hablo is wealthy. In the absence of a direct answer, the system will automatically ‘transform’ the original query using a rule like: “In a context of ransom kidnapping, the certification that a given character is wealthy or has a professional role can be substituted by the certification that: i) this character has a tight kinship link with another person, and ii) this second person is a wealthy person or a professional people”. The final result can then be paraphrased in this way: we do not know whether Robustiano Hablo is wealthy, but we


can say that his father is a wealthy businessperson, see (Zarri, 2005) for the details. Hypothesis rules allow building up ‘reasonable’ logic/semantic connections among the data stored in an NKRL knowledge base using a number of pre-defined reasoning schemata, e.g., ‘causal’ schemata. For example, to mention a ‘classic’ NKRL example, after having directly retrieved through the use of a search pattern an information like: “Pharmacopeia, an USA biotechnology company, has received 64,000,000 dollars from the German company Schering in connection with an R&D activity”, we could be able to automatically construct a sort of ‘causal explanation’ of this event by retrieving information like: i) “Pharmacopeia and Schering have signed an agreement concerning the production by Pharmacopeia of a new compound” and ii) “in the framework of the agreement previously mentioned, Pharmacopeia has actually produced the new compound”. In Table 2, we give the informal description of the reasoning steps (called ‘condition schemata’ in a hypothesis context) that must be validated to prove that a generic ‘kidnapping’ corresponds, in reality, to a more precise ‘kidnapping for ransom’ environment. When several reasoning steps must be simultaneously validated, as in Table 2, a failure is always possible. To overcome this problem – and, at the same time, discover all the possible implicit information associated with the original data – the two inference modes, transformation and hypotheses, can be used in an integrated way, see (Zarri, 2005). In practice, we make use of ‘transformations’ within a ‘hypothesis’ context. This means that, whenever a ‘search pattern’ is derived from a ‘condition schema’ of a hypothesis to implement one of the steps of the reasoning process, we can use it ‘as it is’ – i.e., as originally coded when the inference rule has been built up – but also in a ‘transformed’ form if the appropriate

transformation rules exist within the system. Making use of the transformation rules already existing within the system, the hypothesis represented in an informal way in Table 2 becomes, in practice, potentially equivalent to the hypothesis of Table 3. For example, the proof that the kidnappers are part of a terrorist group or separatist organization (reasoning step Cond1 of Table 2) can be now obtained indirectly, transformation T3, by checking whether they are members of a specific subset of this group or organization.

fUTURe TReNDS NKRL is a fully implemented language/environment. The software exists in two versions, an ORACLE-supported and a file-oriented one. Future improvements will concern mainly: •

•

•

The addition of features that will allow us querying the system in Natural Language. Very encouraging experimental results have already been obtained in this context thanks to the combined use of shallow parsing techniques – see, e.g., (Koster, 2004) and of the standard NKRL inference capabilities. On a more ambitious basis, the introduction of some features for the semi-automatic construction of the knowledge base of annotation/occurrences making use of full NL techniques. Some preliminary work in this context has been realised making use of the syntactic/semantic Cafetière tools, see (Black et al., 2003, 2004). The introduction of optimisation techniques for the (basic) chronological backtracking of the NKRL InferenceEngine, in the style of the well-known techniques developed in a Logic Programming context see, e.g., (Clark and Tärnlund, 1982).

Table 2. Inference steps for the ‘kidnapping for ransom’ hypothesis

(Cond1) The kidnappers are part of a separatist movement or of a terrorist organization. (Cond2) This separatist movement or terrorist organization currently practices ransom kidnapping of particular categories of people. (Cond3) In particular, executives or assimilated categories are concerned. (Cond4) It can be proved that the kidnapped is really a businessperson or assimilated.

1163

N


Even in its present form, NKRL has been able to deal successfully, in a ‘intelligent information retrieval’ mode, with the most different ‘narrative’ domains – from history of France to terrorism, from Falkland War to the corporate domain, from the legal field to the beauty care domain or the analysis of customers’ motivations, etc.

CONClUSION In this article, we have supplied some details about NKRL (Narrative Knowledge Representation Language), a fully implemented, up-to-date knowledge representation and inference system especially created for an ‘intelligent’ exploitation of narrative knowledge. The main innovation of NKRL consists in associating with the traditional ontologies of concepts an

‘ontology of events’, i.e., a hierarchical arrangement where the nodes correspond to n-ary structures called ‘templates’.

RefeReNCeS Allen, J.F. (1981). An Interval-Based Representation of Temporal Knowledge. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence - IJCAI/81. San Francisco: Morgan Kaufmann. Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., and Stein, L.A., eds. (2004). OWL Web Ontology Language Reference – W3C Recommendation 10 February 2004. W3C (http://www.w3.org/TR/owl-ref/).

Table 3. ‘Kidnapping’ hypothesis in the presence of transformations concerning intermediary inference steps

(Cond1) The kidnappers are part of a separatist movement or of a terrorist organization. – (Rule T3, Consequent1) Try to verify whether a given separatist movement or terrorist organization is in strict control of a specific sub-group and, in this case, – (Rule T3, Consequent2) check if the kidnappers are members of this sub-group. We will then assimilate the kidnappers to ‘members’ of the movement or organization. (Cond2) This movement or organization practices ransom kidnapping of given categories of people. – (Rule T2, Consequent) The family of the kidnapped has received a ransom request from the separatist movement or terrorist organization. – (Rule T4, Consequent1) The family of the kidnapped has received a ransom request from a group or an individual person, and – (Rule T4, Consequent2) this second group or individual person is part of the separatist movement or terrorist organization. – (Rule T5, Consequent1) Try to verify if a particular sub-group of the separatist movement or terrorist organization exists, and – (Rule T5, Consequent2) check whether this particular sub-group practices ransom kidnapping of particular categories of people. – … (Cond3) In particular, executives or assimilated categories are concerned. – (Rule T0, Consequent1) In a ‘ransom kidnapping’ context, we can check whether the kidnapped person has a strict kinship relationship with a second person, and – (Rule T0, Consequent2) (in the same context) check if this second person is a businessperson or assimilated. (Cond4) It can be proved that the kidnapped person is really an executive or assimilated. – (Rule T6, Consequent) In a ‘ransom kidnapping’ context, ‘personalities’ like physicians, journalists, artists etc. can be assimilated to businesspersons.

1164


Beckett, D., ed. (2004). RDF/XML Syntax Specification (Revised) – W3C Recommendation 10 February 2004. W3C (http://www.w3.org/TR/rdf-syntax-grammar/). Black, W.J., McNaught, J., William J. Black, Vasilakopoulos, A., Zervanou, K., Rinaldi, F., and Theodoulidis, B. (2003). CAFETIERE: Conceptual Annotations for Facts, Events, Individual Entities, and Relations (Technical Report TR-U4.3.1). Manchester: UMIST Department of Computation. Black, W.J., Jowett, S., Mavroudakis, T., McNaught, J., Theodoulidis, B., Vasilakopoulos, A., Zarri, G.P., and Zervanou, K. (2004). Ontology-Enablement of a System for Semantic Annotation of Digital Documents. In: Proceedings of the 4th International Workshop on Knowledge Markup and Semantic Annotation (SemAnnot 2004) – 3rd International Semantic Web Conference (November 8, 2004, Hiroshima, Japan). Corbett, D. (2003). Reasoning and Unification over Conceptual Graphs. New York: Kluwer Academic/ Plenum Publishers. Clark, K.L., and Tärnlund, S.-A.., eds. (1982). Logic Programming. London: Academic Press. Ellis, G. (1995). Compiling Conceptual Graphs. IEEE Transactions on Knowledge and Data Engineering 7: 68-81. Ferro, L., Gerber, L., Mani, I., Sundheim, B., and Wilson, G. (2005). TIDES – 2005 Standard for the Annotation of Temporal Expressions (2005 Release). McLean (VA): The MITRE Corporation. Koster, C.H.A. (2004). Head/modifier Frames for Information Retrieval. In: Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Text Processing – CLing-2004 (LNCS 2945), Gelbukh, A., ed. Berlin: Springer. Mani, I., and Pustejovsky, J. (2004). Temporal Discourse Models for Narrative Structure. In: Proceedings of the ACL Workshop on Discourse Annotation. East Stroudsburg (PA): Association for Computational Linguistics. Noy, F.N., Fergerson, R.W., and Musen, M.A. (2000). The Knowledge Model of Protégé-2000: Combining Interoperability and Flexibility. In: Knowledge Acquisition, Modeling, and Management – Proceedings of

EKAW’2000 (LNCS 1937), Dieng, R., and Corby, O., eds. Berlin: Springer. Zarri, G.P. (1998). Representation of Temporal Knowledge in Events: The Formalism, and Its Potential for Legal Narratives. Information & Communications Technology Law – Special Issue on Models of Time, Action, and Situations 7: 213-241. Zarri, G.P. (2003). A Conceptual Model for Representing Narratives. In: Innovations in Knowledge Engineering, Jain, R., Abraham, A., Faucher, C., and van der Zwaag, eds. Adelaide: Advanced Knowledge International. Zarri, G.P. (2005). Integrating the Two Main Inference Modes of NKRL, Transformations and Hypotheses. Journal on Data Semantics (JoDS) 4: 304-340.

Key TeRmS Attributive Operator: The ‘attributive operator’, SPECIF(ication), is one of the four operators used in NKRL for the construction of ‘structured arguments’ (‘complex fillers’ or ‘expansions’) of the conceptual predicates. The SPECIF lists, with syntax (SPECIF ei p … p ), are used to represent the properties or at1 n tributes that can be asserted about the first element ei, concept or individual, of the list. Binding Occurrences: Second order structures used to deal with those ‘connectivity phenomena’ that arise when several elementary events are connected through causality, goal, indirect speech etc. links. They consists of lists of symbolic labels (ci) of predicative occurrences; the lists are differentiated using specific binding operators like GOAL, CONDITION and CAUSE. Format of NKRL Templates: Templates take the form of n-ary combinations of quadruples connecting together the ‘symbolic name’ of the template, a ‘conceptual predicate’ and the ‘arguments’ of the predicate introduced by named relations, the ‘roles’ (like SUBJ(ect), OBJ(ect), SOURCE, BEN(e)F(iciary), etc.). The quadruples have in common the ‘name’ and ‘predicate’ components. Denoting then with Li the symbolic label identifying the template, with Pj the predicate, with Rk the generic

1165

N


role and with ak the generic argument, the core data structure for templates has the format: (Li (Pj (R1 a1) (R2 a2) … (Rn an))) . Templates are included in an inheritance hierarchy, HTemp(lates), which implements NKRL’s ‘ontology of events’. NKRL Inference Engine: A software module that carries out the different ‘reasoning steps’ included in hypotheses or transformations. It allows us to use these two classes of inference rules also in an ‘integrated’ mode, augmenting then the possibility of finding interesting (implicit) information. NKRL Inference Rules, Hypotheses: They are used to build up automatically ‘reasonable’ connections among the information stored in an NKRL knowledge base according to a number of pre-defined reasoning schemata, e.g., ‘causal’ schemata’.

1166

NKRL Inference Rules, Transformations: These rules try to ‘adapt’, from a semantic point of view, a query that failed to the contents of the existing knowledge bases. The principle employed consists in using rules to automatically ‘transform’ the original query into one or more different queries that are not strictly ‘equivalent’ but only ‘semantically close’ to the original one. Ontology of Concepts vs. Ontology of Events: The ontologies of concepts concern the ‘standard’ hierarchical organizations of concepts to be used to model (in a ‘static’ way) a given domain. NKRL adds an ‘ontology of events’, i.e., a new sort of hierarchical organization where the nodes, represented by n-ary structures called ‘templates’, represent general classes of ‘dynamical’ events like “move a physical object”, “produce a service”, “send/receive a message”, etc.

1167

“Narrative” Information Problems Gian Piero Zarri LaLIC, University Paris 4-Sorbonne, France

INTRODUCTION ‘Narrative’ information concerns in general the account of some real-life or fictional story (a ‘narrative’) involving concrete or imaginary ‘personages’. In this article we deal with (multimedia) nonfictional narratives of an economic interest. This means, first, that we are not concerned with all sorts of fictional narratives that have mainly an entertainment value, and represent an imaginary narrator’s account of a story that happened in an imaginary world: a novel is a typical example of fictional narrative. Secondly, our ‘nonfictional narratives’ must have an economic value: they are then typically embodied into corporate memory documents, they concern news stories, normative and legal texts, medical records, intelligence messages, surveillance videos or visitor logs, actuality photos and video fragments for newspapers and magazines, eLearning and multimedia Cultural Heritage material, etc. Because of the ubiquity of these ‘narrative’, ‘dynamic’ resources, it is particularly important to build up computer-based applications able to represent and to exploit in a general, accurate, and effective way the semantic content – i.e., the key ‘meaning’ – of these resources.

BACKGROUND ‘Narratives’ represent presently a very ‘hot’ domain. From a theoretical point of view, they constitute the object of a full discipline, the ‘narratology’, whose aim can be defined as that of producing an in-depth description of the ‘syntactic/semantic structures’ of the narratives, i.e., the narratologist is in charge of dissecting narratives into their component parts in order to establish their functions, their purposes and the relationships among them. A good introduction to the full domain is (Jahn, 2005). Even if narratology is particularly concerned with literary analysis (and, therefore, with ‘fictional’ narra-

tives), these last years some of its varieties have acquired a particular importance also from a strict Artificial Intelligence (AI) and Computer Science (CS) point of view. Leaving apart the old dream of generating fictions by computer, see (Mehan, 1977) and, more recently, (Callaway and Lester, 2002), we can mention here two new disciplines, ‘storytelling’ and ‘eChronicles’, that are of interest from both a nonfictional narratives and a AI/CS point of view. Storytelling – see, e.g., (Soulier, 2006) – concerns in general the study of the different ways of conveying ‘stories’and events in words, images and sounds in order to entertain, teach, explain etc. Digital Storytelling deals in particular with the ways of introducing characters and emotions in the interactive entertainment domain, and concerns then videogames, massively multiplayer online games, interactive TV, virtual reality etc., see (Handler Miller, 2004). Digital Storytelling is, therefore, related to another, computer-based variant of narratology called Narrative Intelligence, a sub-domain of AI that explores topics at the intersection of Artificial Intelligence, media studies, and human computer interaction design (narrative interfaces, history databases management systems, artificial agents with narrative structured behaviour, systems for the generation and/or understanding of histories/narratives etc.), see (Mateas and Sengers, 2003). An eChronicle system can be defined in short as way of recording, organizing and then accessing streams of multimedia events captured by individuals, groups, or organizations making use of video, audio and other sensors. The ‘chronicles’ gathered in this way may concern any sort of ‘narratives’ from meeting minutes to football games, sales activities, ‘lifelogs’ obtained from wearable sensors, etc. The technical challenges concern mainly the ways of aggregating the events into coherent ‘episodes’ making use of domain models as ontologies, and providing then access to this sort of material to the users at the required level of granularity. Note that exploration, and not ‘normal’ querying, is the predominant way of interaction with the chronicle


N

“Narrative” Information Problem

repositories; more details can be found, e.g., in (Güven, Podlaseck and Pingali, 2005), (Westermann and Jain, 2006). The solution (NKRL) proposed for the ‘intelligent’ management of nonfictional narratives in the companion article – ‘Narrative’ Information, the NKRL Solution – of the present one is considered as a fully-fledged eChronicle technique, see (Zarri, 2006). In NKRL, however, a fundamental aspect concerns the presence of powerful ‘reasoning’ techniques – an aspect that is not taken into consideration sufficiently in depth in eChronicles that are mainly interested in the accumulation of narrative materials more than in the ‘intelligent’ exploitation of their inner relationships.

RepReSeNTING THe ‘NONfICTIONAl’ NARRATIVeS All the different sorts of ‘nonfictional narratives’ evoked in the previous Sections concern, practically, the description of spatially and temporally characterised ‘events’ that relate, at some level of abstraction, the behaviour or the state of some real-life ‘actors’ (characters, personages, etc.): these try to attain a specific result, experience particular situations, manipulate some (concrete or abstract) materials, send or receive messages, buy, sell, deliver etc. Note that: •

•

•

1168

The term ‘event’ is taken here in its most general meaning, covering also strictly related notions like fact, action, state, situation, episode, activity etc. The ‘actors’ or ‘personages’ involved in the events are not necessarily human beings: we can have narratives concerning, e.g., the vicissitudes in the journey of a nuclear submarine (the ‘actor’, ‘subject’ or ‘personage’) or the various avatars in the life of a commercial product. Even if a large amount of nonfictional narratives are embodied within natural language (NL) texts, this is not necessarily true: narrative information is really ‘multimedia’. A photo representing a situation that, verbalized, could be expressed as “The US President is addressing the Congress” is not of course an NL document, yet it surely represents a narrative.

An in-depth analysis of the existing Knowledge Representation solutions that could be used to represent and manage nonfictional narratives endowed with the above characteristics is beyond the possibilities of this article – see in this context, e.g., (Zarri, 2005). We will limit ourselves, here, to some quick consideration. We can note, first of all, that the now so popular Semantic Web (W3C) languages like RDF (Resource Description Framework), see (Manola and Miller, 2004), and OWL (Web Ontology Language), see (McGuinness and Harmelen, 2004) are unable to fit the bill because their core formalism consists in practice of the classical ‘attribute – value’ model. For these ‘binary’ languages then, a property can only be a binary relationship, linking two individuals or an individual and a value. When these languages must represent simple ‘narratives’ like “John has given a book to Mary”, several difficulties arise. In this extremely simple sentence, e.g., “give” is an n-ary (ternary) relationship that, to be represented in a complete way, asks for the presence of a specific ‘semantic predicate’ in the “give” or “transfer” style, where the ‘arguments’, “John”, “book” and “Mary”, of the predicate must be labelled with ‘conceptual roles’ such as, e.g., ‘agent of give’, ‘object of give’ and ‘beneficiary of give’ respectively. Efforts for extending the W3C languages by introducing some n-ary feature have been not very successful until now: see, in this context, a recent working paper from the W3C Semantic Web Best Practices and Deployment Working Group (SWBPD WG) about “Defining N-ary Relations on the Semantic Web” (Noy and Rector, 2006). This paper proposes some extensions to the binary paradigm to allow the correct representation of ‘narratives’ like: “Steve has temperature, which is high, but failing” or “United Airlines flight 3177 visits the following airports: LAX, DFW, and JFK”. The technical solutions expounded in this paper are not very convincing and have aroused several criticisms. These have focused, mainly, on i) the fact that the majority of the solutions proposed do not deal, in reality, with the n-ary problem, but with (only loosely) related matters like the possibility of specifying a ‘standard’ binary relationship via the addition of properties, and ii) on the arbitrary introduction, through reification processes, of fictitious (and inevitably ad hoc) ‘individuals’ to represent the n-ary relations when these are actually dealt with. Moreover, the paper say nothing,


e.g., about the way of dealing, in concrete ‘narrative’ situations, with those crucial ‘connectivity phenomena’ like causality, goal, indirect speech, co-ordination and subordination etc. that link together the basic pieces of information – e.g., the ‘basic events’ corresponding to the present illness state of Steve with other ‘basic events’ corresponding to the (possible or definite) ‘causes’ of such state. Several solutions for representing narratives in computer-usable ways according to some sort of actual ‘n-ary model’ have been described in the literature. For example, in the context of his work – between the mid-fifties and the mid-sixties – on the set up of a mechanical translation process based on the simulation of the thought processes of the translator, Silvio Ceccato (Ceccato, 1961) proposed a representation of narrative-like sentences as a network of triadic structures (‘correlations’) organized around specific ‘correlators’ (a sort of roles). Ceccato is also credited to be one of the pioneers of the semantic network studies; basically, semantic networks are directed graphs (digraphs) where the nodes represent concepts, and the arcs different kinds of associative links, not only the ‘classical’ IsA and property-value links, but also n-ary relationships. A panorama of the different conceptual solutions proposed in a semantic network context can be found in (Lehmann, 1992). In the seventies, a sort of particularly popular, n-ary semantic network approach has been represented by the Conceptual Dependency theory of Roger Schank (Schank, 1973). In this theory, the underlying meaning (‘conceptualization’) of narrative-like utterances is expressed as combinations of ‘semantic predicates’ chosen from a set of twelve ‘primitive actions’ (like INGEST, MOVE, ATRANS, the transfer of an abstract relationship like possession, ownership and control, PTRANS, physical transfer, etc.) plus states and changes of states, and seven role relationships (‘conceptual case’). Conceptual Graphs (CGs) is the representation system developed by John Sowa (Sowa, 1984, 1999) and derived, at least partly, from Schank’s work and other early work in the Semantic Networks domain. CGs make use of a graph-based notation for representing ‘concept-types’ (organized into a type-hierarchy), ‘concepts’ (that are instantiations of concept types) and ‘conceptual relations’ that relate one concept to another. CGs can be used to represents in a formal way narratives like “A pretty lady is dancing gracefully” and more complex,

second-order constructions like contexts, wishes and beliefs. CYC, see (Lenat et al., 1990) concerns one of the most controversial endeavours in the history of Artificial Intelligence. Started in the early ‘80 as a MCC (Microelectronics and Computer Technology Corporation, Texas, USA) project, it ended about 15 years later with the set up of an enormous knowledge base containing about a million of hand-entered ‘logical assertions’ including both simple statements of facts and rules about what conclusions can be inferred if certain statements of facts are satisfied. The ‘upper level’ of the ontology that structures the CYC knowledge base is now freely accessible on the Web, see http://www. cyc.com/cyc/opencyc. A detailed analysis of the origins, developments and motivations of CYC can be found in (Bertino et al., 2001: 275-316). We can also mention here another ‘modern’ system, Topic Maps, see (Rath, 2003), where information is represented using topics (representing any concept, from people to software modules and events), associations (the relationships between them), and occurrences (the relationships between topics and information resources relevant to them). They correspond, eventually, to a sort of downgraded Semantic Network representation. Leaving now aside ‘historical’ solutions like those proposed by Schank or Ceccato, none of the existing nary solutions mentioned above seem to be able to satisfy completely the nonfictional narratives requirements, see again (Zarri, 2005) for more details. The universal purposes of CYC, the extremely large dimensions of its knowledge base and the extreme diversity of the contents of this base give rise to serious consistency problems, that have apparently restricted the development of concrete applications based on this technology to experimental projects mainly supported by the US Government. On the other hand, the knowledge representation language of CYC, CycL (substantially, a frame system rewritten in logical form) seems to be too rigid and uniform to adapt itself to the representation of all the different facets (from general concepts and elementary events to the connectivity phenomena etc.) that characterise the narratives. Conceptual Graphs (CGs) could represent, at least in principle, a valid solution for dealing with nonfictional narrative information. However, it seems evident that work in a CGs context concerns mainly, with few exceptions, the ‘academic’ domain, and that the practically-oriented applications of CGs are particularly scarce. This becomes particular

1169

N


evident when we consider that the CGs developers still lack of an exhaustive and authoritative list of standard CGs structures under the form of ‘canonical graphs’ that could constitute a sort of ‘catalogue’ for dealing with practical problems; the set up of a tool like this seems never have been planned. The existence of such a catalogue could be extremely important for practical applications in the narrative (not only) domain given that: i) a system-builder should not have to create himself the structural and inferential knowledge needed to describe and exploit the events proper to a (sufficiently) large class of narratives; ii) the reproduction and the sharing of previous results could become neatly easier. We can add to the above difficulties the existence of a series of general problems that are not associated with a specific system but that concern by and large all the existing n-ary solutions, like the lack of agreement about the list of ‘roles’ (conceptual cases) to be used when a narrative must be practically represented into conceptual format, or the differences of opinion about the use of ‘primitives’.

ACTUAl TReNDS In spite of the quite pessimistic considerations of the previous Section, conceiving a specific Knowledge Representation tool for dealing in practice with nonfictional narrative information is far from being impossible. Returning now to the “John gave a book…” example above – and leaving aside, for the moment being, all the additional problems linked, e.g., with the existence of the ‘connectivity phenomena’ – it is not too difficult to see that a complete, n-ary representation that captures all the ‘essential meaning’ of this elementary narrative amounts to: •

•

1170

Define JOHN_, MARY_ and BOOK_1 as ‘individuals’, instances of general ‘concepts’ like human_being and information_support or of more specific concepts. Concepts and instances (individuals) are, as usual, collected into a ‘binary’ ontology (built up using a standard tool like, e.g., Protégé). Define an n-ary structure organised around a conceptual predicate like, e.g., MOVE or PHYSICAL_TRANSFER and associate the above individuals (the arguments) to the predicate through the use of conceptual roles that specify their

•

‘function’ within the global narrative. JOHN_ will then be introduced by an AGENT (or SUBJECT) role, BOOK_1 by an OBJECT (or PATIENT) role, MARY_ by a BENEFICIARY role. An additional information like “yesterday” could be introduced by, e.g., a TEMPORAL_ANCHOR role, etc. ‘Reify’ the obtained n-ary structured associating with it an unique identifier under the form of a ‘semantic label’, to assure both i) the logical-semantic coherence of the structure; ii) an rational and efficient way of storing and retrieving it.

Formally, an n-ary structure defined according the above guidelines can be described as: (Li (Pj (R1 a1) (R2 a2) … (Rn an)))

(1)

where Li is the symbolic label identifying the particular n-ary structure (e.g., the global structure corresponding to the representation of the “John gave a book…” example), Pj is the conceptual predicate, Rk is the generic role and ak the corresponding argument (e.g., the individuals john_, mary_ etc.). Note that each of the (Ri ai) cells of (1), taken individually, represents a binary relationship in the W3C language style. The main point here is, however, that the whole conceptual structure represented by (1) must be considered globally. The solution represented formally by (1) is at the core of a complete and running conceptual tools for the representation and management of nonfictional narrative information called NKRL (Narrative Knowledge representation Language), see (Zarri, 2005) and the companion article: ‘Narrative’ Information, the NKRL Solution.

CONClUSION We deal in this article with ‘nonfictional narratives’. These are information resources of a high economical importance that concern, e.g., the ‘corporate knowledge’ documents, the news stories, the medical records, the surveillance videos or visitor logs, etc. When we examine the existing (or past) general Knowledge Representation systems that could be used for dealing with nonfictional narratives, we can note that none of them seem to be able to satisfy completely the nonfictional narratives requirements. For example, the


W3C (Semantic Web) languages like RDF and OWL cannot fit the bill since they are binary-based types of representation while narratives ask, in general, for n-ary solutions. A specific, narrative-oriented formalism able to capture the essential ‘meaning’ of an ‘elementary’ narrative event however exists, see (Zarri, 2005) and the companion article: ‘Narrative’ Information, the NKRL Solution.

RefeReNCeS Bertino, E., Catania, B., and Zarri, G.P. (2001). Intelligent Database Systems. London: Addison-Wesley and ACM Press. Callaway, C.B., and Lester, J.C. (2002). Narrative Prose Generation. Artificial Intelligence 139: 213-252. Ceccato, S. (1961) Linguistic Analysis and Programming for Mechanical Translation. Milano: Feltrinelli. Güven, S., Podlaseck, M., and Pingali, G. (2005). PICASSO: Pervasive Information Chronicling, Access, Search, and Sharing for Organizations. In: Proceedings of the IEEE 2005 Pervasive Computing Conference (PerCom 2005). Los Alamitos (CA): IEEE Computer Society Press. Handler Miller, C. (2004). Digital Storytelling. A Creator’s Guide to Interactive Entertainment. Burlington (MA): Focal Press. Jahn, M. (2005). Narratology: A Guide to the Theory of Narrative (version 1.8). Cologne: English Department of the University (http://www.uni-koeln.de/~ame02/ pppn.htm). Lehmann, F., ed. (1992). Semantic Networks in Artificial Intelligence. Oxford: Pergamon Press. Lenat, D.B., Guha, R.V., Pittman, K., Pratt, D., and Shepherd, M. (1990). CYC: Toward Programs With Common Sense. Communications of the ACM 33(8): 30-49. Manola, F., and Miller, E. (2004). RDF Primer – W3C Recommendation 10 February 2004. W3C (http://www. w3.org/TR/rdf-primer/). Mateas, M., and Sengers, P., eds. (2003). Narrative Intelligence. Amsterdam: John Benjamins.

McGuinness, D.L., van Harmelen, F. (2004). OWL WEB Ontology Language Overview – W3C Recommendation 10 February 2004. W3C (http://www. w3.org/TR/owl-features/). Mehan, J. (1977). TALE-SPIN – An Interactive Program That Writes Stories. In: Proceedings of the 1977 International Joint Conference on Artificial Intelligence - IJCAI/97. San Mateo (CA): Morgan Kaufmann. Rath, H.H. (2003). The Topic Maps Handbook (White Paper, version 1.1). Gütersloh: empolis GmbH (http:// www.empolis.com/downloads/empolis_TopicMaps_ Whitepaper20030206.pdf). Noy, F.N., and Rector, A., eds. (2006). Defining N-ary Relations on the Semantic Web – W3C Working Group Note 12 April 2006. W3C (http://www.w3.org/TR/2006/ NOTE-swbp-n-aryRelations-20060412/). Schank, R.C. (1973). Identification of Conceptualizations Underlying Natural Language. In: Computer Models of Thought and Language, Schank, R.C., and Colby, K.M., eds. San Francisco: W.H. Freeman and Co. Soulier, E., ed. (2006). Le Storytelling, concepts, outils et applications. Paris : Lavoisier. Sowa, J.F. (1984). Conceptual Structures: Information Processing in Mind and Machine. Reading (MA): Addison-Wesley. Sowa, J.F. (1999). Knowledge Representation: Logical, Philosophical, and Computational Foundations. Pacific Grove (CA): Brooks Cole Publishing Co. Westermann, U., and Jain, R. (2006). A Generic Event Model for Event-Centric Multimedia Data Management in eChronicle Applications. In: Proceedings of the 22nd International Conference on Data Engineering Workshops – ICDE Workshop on eChronicles (ICDEW’06). Los Alamitos (CA): IEEE Computer Society Press. Zarri, G.P. (2005). Integrating the Two Main Inference Modes of NKRL, Transformations and Hypotheses. Journal on Data Semantics (JoDS) 4: 304-340. Zarri, G.P. (2006). Modeling and Advanced Exploitation of eChronicle ‘Narrative’ Information. In: Proceedings of the 22nd International Conference on Data Engineering Workshops – ICDE Workshop on eChronicles (ICDEW’06). Los Alamitos (CA): IEEE Computer Society Press. 1171

N


Key TeRmS ‘Binary’ Languages vs. n-ary Languages: Binary languages (like RDF and OWL) are based on the classical ‘attribute – value’ model: they are called ‘binary’ because, for them, a property can only be a binary relationship, linking two individuals or an individual and a value. They cannot be used to represent in an accurate way the narratives that ask in general, on the contrary, for the use of n-ary knowledge representation languages. Connectivity Phenomena: In the presence of several, logically linked elementary events, this term denotes the existence of a global ‘narrative’ information content that goes beyond the simple addition of the information conveyed by the single events. The connectivity phenomena are linked with the presence of logico-semantic relationships like causality, goal, co-ordination and subordination etc. Core Format of a Complete Solution for Representing Narratives: Formally, an n-ary structure able to represent the ‘essential meaning’ of an ‘elementary event’ can be described as: (Li (Pj (R1 a1) (R2 a2) … (Rn an))) where Li is the symbolic label identifying the particular formalized event, Pj is the conceptual predicate, Rk is the generic role and ak the corresponding argument.

1172

Examples of n-ary Languages: ‘Historical’ examples of n-ary languages are Ceccato’s ‘correlations’, Schank’s Conceptual Dependency theory, many Semantic Networks proposals, etc. Current n-ary systems are, e.g., Topic Maps, Sowa’s Conceptual Graphs, Lenat’s CYC, etc. None of them are able to satisfy completely the requirements for an ‘intelligent’ representation and management of nonfictional narrative information. Narrative Information: Concerns in general the account of some real-life or fictional story (a ‘narrative’) involving concrete or imaginary ‘personages’. Narratology: Discipline that deals with narratives from a theoretical point of view. Sub-classes of narratology that have a ‘computational’ interest are, e.g., Storytelling, Narrative Intelligence and the eChronicle systems. Nonfictional Narrative of an Economic Interest: In this case, the personages are ‘real characters’, and the narrative happens in the real world. Moreover, the narratives are now embodied in multimedia documents of an economic interest: corporate memory documents, news stories, normative and legal texts, medical records, intelligence messages, surveillance videos or visitor logs, etc.

1173

Natural Language Processing and Biological Methods Gemma Bel Enguix Rovira i Virgili University, Spain M. Dolores Jiménez López Rovira i Virgili University, Spain

INTRODUCTION During the 20th century, biology—especially molecular biology—has become a pilot science, so that many disciplines have formulated their theories under models taken from biology. Computer science has become almost a bio-inspired field thanks to the great development of natural computing and DNA computing. From linguistics, interactions with biology have not been frequent during the 20th century. Nevertheless, because of the “linguistic” consideration of the genetic code, molecular biology has taken several models from formal language theory in order to explain the structure and working of DNA. Such attempts have been focused in the design of grammar-based approaches to define a combinatorics in protein and DNA sequences (Searls, 1993). Also linguistics of natural language has made some contributions in this field by means of Collado (1989), who applied generativist approaches to the analysis of the genetic code. On the other hand, and only from theoretical interest a strictly, several attempts of establishing structural parallelisms between DNA sequences and verbal language have been performed (Jakobson, 1973, Marcus, 1998, Ji, 2002). However, there is a lack of theory on the attempt of explaining the structure of human language from the results of the semiosis of the genetic code. And this is probably the only arrow that remains incomplete in order to close the path between computer science, molecular biology, biosemiotics and linguistics. Natural Language Processing (NLP) –a subfield of Artificial Intelligence that concerns the automated generation and understanding of natural languages— can take great advantage of the structural and “semantic” similarities between those codes. Specifically, taking the systemic code units and methods of combination of the genetic code, the methods of such entity can be translated to the study of natural language. Therefore, NLP could become another “bio-inspired” science, by

means of theoretical computer science, that provides the theoretical tools and formalizations which are necessary for approaching such exchange of methodology. In this way, we obtain a theoretical framework where biology, NLP and computer science exchange methods and interact, thanks to the semiotic parallelism between the genetic code and natural language.

BACKGROUND Most current natural language approaches show several facts that somehow invite to the search of new formalisms to account in a simpler and more natural way for natural languages. Two main facts lead us to look for a more natural computational system to give a formal account of natural languages: a) natural language sentences cannot be placed in any of the families of the Chomsky hierarchy (Chomsky, 1956) in which current computational models are basically based, and b) rewriting methods used in a large number of natural language approaches seem to be not very adequate, from a cognitive perspective, to account for the processing of language. Now, if to these we add (1) that languages that have been generated following a molecular computational model are placed in-between Context-Sensitive and Context-Free families; (2) that genetic model offers simpler alternatives to the rewriting rules; (3) and that genetics is a natural informational system as natural language is, we have the ideal scene to propose biological models in NLP. The idea of using biological methods in the description and processing of natural languages is backed up by a long tradition of interchanging methods in biology and natural/formal language theory: 1.

Results and methods in the field of formal language theory have been applied to biology:


N

Natural Language Processing and Biological Methods

2.

3.

4.

(1) Pawlak (1965) dependency grammars as an approach in the study of protein formation; (2) transformational grammars for modeling gene regulations (Collado, 1989); (3) stochastic context-free grammars for modeling RNA (Sakakibara et al., 1994); (4) definite clause grammars and cut grammars to investigate gene structure and mutations and rearrangement in it (Searls, 1989); (5) tree-adjoining grammars for predicting RNA structure of biological data (Uemura et al., 1999). Natural languages as models for biology: (1) Watson (1968) understanding of heredity as a form of communication; (2) Asimov (1968) idea that nucleotide bases are letters and they form an alphabet; (3) Jacob (1970) consideration that the sense of the genetic message is given by the combination of its signs in words and by the arrangement of words in phrases; (4) Jakobson (1970) ideas about taking the nucleotide bases as phonemes of the genetic code or about the binary oppositions in phonemes and in the nucleic code. Biological ideas in linguistics: (1) the “tree model” proposed by Schleicher (1863); (2) the “wave model” due to Schmidt (1872); (3) the “geometric network model” proposed by Forster (1997); or (3) the naturalistic metaphor in Linguistics defended by Jakobson (1970, 1973). Using DNA as a support for computation is the basic idea of Molecular Computing (Păun et al., 1998). Speculations about this possibility can be found in Feynman (1961), Bennett (1973) and Conrad (1995).

BIOlOGICAl meTHODS IN Nlp Here, we present an overview of different bio-inspired methods that during the last years have been successfully applied to several NLP issues, from syntax to pragmatics. Those methods are taken mainly from computer science and are basically the following: DNA computing, membrane computing and networks of evolutionary processors.

1174

DNA Computing One of the most developed lines of research in natural computing is the named molecular computing, a model based on molecular biology, which arose mainly after Adleman (1994). An active area in molecular computing is DNA computing (Păun et al., 1998) inspired in the way that DNA perform operations to generate, replicate or change the configuration of the strings. Application of molecular computing methods to natural language syntax gives rise to molecular syntax (Bel-Enguix & Jiménez-López, 2005a). Molecular syntax takes as a model two types of mechanisms used in biology in order to modify or generate DNA sequences: mutations and splicing. Mutations refer to changes performed in a linguistic string, being this a phrase, sentence or text. Splicing is a process carried out involving two or more linguistic sequences. It is a good framework for approaching syntax, both from the sentential or dialogical perspective. Methods used by molecular syntax are based on basic genetic processes: cut, paste, delete and move. Combining these elementary rules most of the complex structures of natural language can be obtained, with a high degree of simplicity. This approach is a test of the generative power of splicing for syntax. It seems, according to the results achieved, that splicing is quite powerful for generating, in a very simple way, most of the patterns of the traditional syntax. Moreover, the new perspectives and results it provides, could mean a transformation in the general perspective of syntax. From here, we think that bio-NLP, applied in a methodological and clear way, is a powerful and simple model that can be very useful to a) formulate some systems capable of generating the larger part of structures of language, and b) define a formalization that can be implemented and may be able to describe and predict the behavior of natural language structures.

membrane Computing Membrane Systems (MS) (Păun, 2000) are models of computation inspired by some basic features of biological membranes. They can be viewed as a new paradigm in the field of natural computing based on the functioning of membranes inside the cell. MS can be used as generative, computing or decidability devices. This new computing model has several intrinsically


interesting features such as, for example, the use of multisets and the inherent parallelism in its evolution and the possibility of devising computations which can solve exponential problems in polynomial time. This framework provides a powerful tool for formalizing any kind of interaction, both among agents and among agents and environment. One of key ideas of MS is that generation is made by evolution. Therefore, most of evolving systems can be formalized by means of membrane systems. Linguistic Membrane Systems (LMS) (Bel-Enguix & Jiménez-López, 2005b) aim to model linguistic processes, taking advantage of the flexibility of MS and their suitability for dealing with some fields where contexts are a central part of the theory. LMS can be easily adapted to deal with different aspects of the description and processing of natural languages. The most developed applications of LMS are semantics and dialogue. MS are a good framework for developing a semantic theory because they are evolving systems by definition, in the same sense that we take meaning to be a dynamic entity. Moreover, MS provide a model in which contexts, either isolated or interacting, are an important element which is already formalized and can give us the theoretical tools we need. Semantic membranes may be seen as an integrative approach to semantics coming from formal languages, biology and linguistics. Taking into account results obtained in the field of computer science as well as the naturalness and simplicity of the formalism, it seems the formalization of contexts by means of membranes is a promising area of research for the future. Examples of application of MS to semantics can be found in Bel-Enguix and Jiménez-López (2007). A topic where context and interaction among agents is essential is the field of dialogue modeling and its applications to the design of effective and user-friendly computer dialogue systems. Taking into account a pragmatic perspective of dialogue and based on speech acts, multi-agent theory and dialogue games, Dialogue Membrane Systems have arisen, as an attempt to compute speech acts by means of MS. Considering membranes as agents, and domains as a personal background and linguistic competence, the application to dialogue is almost natural, and simple from the formal point of view. For examples of this application see Bel-Enguix and Jiménez-López (2006b).

NepS-Networks of evolutionary processors

N

Networks of Evolutionary Processors (NEPs) are a new computing mechanism directly inspired in the behavior of cell populations. Every cell is described by a set of words (DNA) evolving by mutations, which are represented by operations on these words. At the end of the process, only the cells with correct strings will survive. In spite of the biological inspiration, the architecture of the system is directly related to the Connection Machine (Hillis, 1985) and the Logic Flow paradigm (Errico et al. 1994). Moreover, the global framework for the development of NEPs has to be completed with the biological background of DNA computing (Păun et al., 1998), membrane computing (Păun, 2000) and, specially, with grammar systems (Csuhaj-Varjú et. al., 1994), which share with NEPs the idea of several devices working together and exchanging results. First precedents of NEPs as generating devices can be found in Csuhaj-Varjú & Salomaa (1997) and Csuhaj-Varjú & Mitrana (2000). The topic was introduced in Castellanos et al. (2003) and Martín-Vide et al. (2003), and further developed in Castellanos et al. (2005), Csuhaj-Varjú et al. (2005). With this background and theoretical connections, it is easy to understand how NEPs can be described as agential bio-inspired context-sensitive systems. Many disciplines are needed of these types of models that are able to support a biological framework in a collaborative environment. The conjunction of these features allows applying the system to a number of areas, beyond generation and recognition in formal language theory. NLP is one of the fields with a lack of biological models and with a clear suitability for agential approaches. NEPs have significant intrinsic multi-agent capabilities together with the environmental adaptability that is typical of bio-inspired models. Some of the characteristics of NEPs architecture are the following: Modularization, contextualization and redefinition of agent capabilities, synchronization, evolvability and learnability. Inside of the construct, every agent is autonomous, specialized, context-interactive and learning-capable. In what refers to the functioning of NEPs, two main features deserve to be highlighted: emergence and parallelism. 1175


Because of those features, NEPs seems to be a suitable model for tackling natural languages. One of the main problems of natural language is that it is generated in the brain, and there is a lack of knowledge of the mental processes the mind undergoes to bring about a sentence. While expecting new advances in neuro-science, we have to use models that seem to fit better to NLP. Modularity has shown to be an important idea in a wide range of fields: cognitive science, computer science and, of course, NLP. NEPs provide a suitable theoretical framework for formalization of modularity in NLP. Another chief problem for the formalization and processing of natural language is its changing nature. Not only words, but also rules, meaning and phonemes can take different shapes during the process of computation. Formal models based in mathematical language have a lack of flexibility to describe natural language. Biological models seem to be better to this task, since biological entities share with languages the concept of “evolution”. From this perspective, NEPs offer enough flexibility to model any change at any moment in any part of the system. Besides, as a bio-inspired method of computation, they have the capability of simulating natural evolution in a highly pertinent and specialized way. Some linguistic disciplines, as pragmatics or semantics, are context-driven areas, where the same utterance has different meanings in different contexts. To model such variation, a system with a good definition of environment is needed. NEPs offer some kind of solution to approach formal semantics and formal pragmatics from a natural computing perspective. Finally, the multimodal approach to communication, where not just production, but also gestures, vision and supra-segmental features of sounds have to be tackled, refers to a parallel way of processing. NEPs allow modules to work in parallel. The autonomy of every one of the processors and the possible miscoordination between them can also give account of several problems of speech. Examples of NEPs applications to NLP can be found in Bel-Enguix and Jiménez-López (2005c, 2006a).

focusing on the formal definition of several frameworks that adapt models coming from the area of bio-inspired computation to NLP needs. The main trends for the future focus on the implementation of these models in order to test their computational advantages over classical models of NLP without biological inspiration.

CONClUSION The coincidences between several structures of language and biology allow us, in the field of NLP, to take advantage of the bio-inspired models formalized by theoretical computer science. Moreover, the multiagent capabilities of some of these models make them a suitable tool for simulating the processes of generation and recognition in natural language. Biological methods coming from computer science can be very useful in the field of natural language, since they provide simple, flexible and intuitive tools for describing natural languages and making easier their implementation in NLP systems. This research provides an integrative path for biology, computer science and NLP – three branches of human knowledge that have to be together in the development of new systems of communication for future global society.

RefeReNCeS Adleman, L.M. (1994). Molecular Computation of Solutions to Combinatorial Problems. Science, 226, 1021-1024. Asimov, I. (1968) . Il Codice Genetico. Torino: Einaudi. Bel-Enguix, G. & Jiménez-López, M.D. (2005a). Byosyntax. An Overview. Fundamenta Informaticae, 64, 1-12.

fUTURe TReNDS

Bel-Enguix, G. & Jiménez-López, M.D. (2005b). Linguistic Membrane Systems and Applications. In Gh. Ciobanu, Gh. Păun & M.J. Pérez-Jiménez (Eds.), Applications of Membrane Computing (pp. 347-388). Berlin: Springer.

Three general formalisms for dealing with NLP by means of biological methods have been introduced,

Bel-Enguix, G. & Jiménez-López, M.D. (2005c). Analysing Sentences with Networks of Evolutionary Processors. In J. Mira & J.R. Álvarez (Eds.), Artificial

1176


Intelligence and Knowledge Engineering Applications: A Bioinspired Approach (pp. 102-111). LNCS 3562. Berlin: Springer. Bel-Enguix, G. & Jiménez-López, M.D. (2006a). Cognitive Modeling with Networks of Evolutionary Processors. A Preview. In J. Multisita & H. Haaparanta (Eds.), Proceedings of the Workshop on Human Centered Technology HCT06 (pp. 268-275), Pori: Tampere University of Technology, Publication 6. Bel-Enguix, G. & Jiménez-López, M.D. (2006b). Computing Dialogues with Membranes. Electronic Notes on Theoretical Computer Science, 157(4), 57-73. Bel-Enguix, G. & Jiménez-López, M.D. (2007). Dynamic Meaning Membrane Systems: An Application to the Description of Semantic Change. Fundamenta Informaticae, 76(3), 219-237. Bennett, C.H. (1973). Logical Reversibility of Computation. IBM Journal of Research Development, 17, 525-532. Castellanos, J., Leupold, P. & Mitrana. V. (2005). Descriptional and Computational Complexity Aspects of Hybrid Networks of Evolutionary Processors. Theoretical Computer Science, 330(2), 205-220. Castellanos, J., Martín-Vide, C., Mitrana, V. & Sempere, J.M. (2003). Networks of Evolutionary Processors. Acta Informatica, 39, 517-529. Chomsky, N. (1956). Three Models for the Description of Language. IRE Transactions on Information Theory, 2, 113-124. Collado Vides, J. (1989). A Transformation-Grammar Approach to the Study of Regulation of Gene Expression. Journal of Theoretical Biology, 136, 403–425. Conrad, M. (1995). The Price of Programmability. In R. Herken (Ed.), The Universal Turing Machine: A Half-Century Survey (pp. 261-282). Wien: Springer. Csuhaj-Varjú, E., Dassow, J., Kelemen, J. & Păun, Gh. (1994). Grammar Systems. London: Gordon and Breach. Csuhaj-Varjú, E., Martín-Vide, C. & Mitrana, V. (2005). Hybrid Networks of Evolutionary Processors are Computational Complete. Acta Informatica, 41(45), 257-272.

Csuhaj-Varjú, E. & Mitrana, V. (2000). Evolutionary Systems: A Language Generating Device Inspired by Evolving Communities of Cells. Acta Informatica, 36, 913–926. Csuhaj-Varjú, E. & Salomaa, A. (1997). Networks of Parallel Language Processors. In Gh. Păun & A. Salomaa (Eds.), New Trends in Formal Languages (pp. 299-318). LNCS 1218. Berlin: Springer. Errico, L. & Jesshope, C. (1994). Towards a New Architecture for Symbolic Processing. In I. Plander (Ed.), Artificial Intelligence and Information-Control Systems of Robots ’94 (pp. 31-40). World Scientific Publisher. Feynman, R.P. (1961). There’s Plenty of Room at the Bottom. In D.H. Hilbert (Ed.), Miniaturization (pp. 282-296). Reinhold. Forster, P. (1997). Network Analysis of Word Lists. In Third International Conference on Quantitative Linguistics (pp. 184-186). Helsinki: Research Institute for the Languages of Finland. Hillis, W.D. (1985). The Connection Machine. Cambridge: MIT Press. Jacob, F. (1970). La Logique du Vivant. Une historie de l’heredité. Paris: Gallimard. Jakobson, R. (1970). Linguistics. In Main Trends of Research in the Social and Human Sciences (pp. 419463). Paris: Mouton. Jakobson, R. (1973). Essais de Linguistique Générale. 2. Rapports Internes et Externes du Language. Paris: Les Éditions de Minuit. Ji, S. (2002). Microsemiotics of DNA. Semiotica, 138 (1/4), 15-42. Marcus, S. (1998). Language at the Crossroad of Computation and Biology. In Gh. Păun (Ed.), Computing with Bio-Molecules (pp. 1-35). Singapore: Springer. Martín-Vide, C., Mitrana, V., Pérez-Jiménez, M. & Sancho-Caparrini, F. (2003). Hybrid Networks of Evolutionary Processors. In E. Cantú-Paz et al. (Eds.), Genetic and Evolutionary Computation (Part I, pp. 401–412). LNCS 2723. Berlin: Springer. Păun, Gh. (2000). Computing with Membranes. Journal of Computer and System Sciences, 61, 108-143 1177

N


Păun, Gh., Rozenberg, G. & Salomaa, A. (1998). DNA Computing. New Computing Paradigms. Berlin: Springer. Pawlak, Z. (1965). Gramatyka i Matematika. Warzsawa: Panstwowe Zakady Wydawnietw Szkolnych. Sakakibara, Y., Brown, M., Underwood, R., Saira Mian, I. & Haussler, D. (1994). Stochastic ContextFree Grammars for Modeling RNA. In Proceedings of the 27th Hawaii International Conference on System Sciences (pp. 284-283). Honolulu: IEEE Computer Society Press. Scheicher, A. (1863). Die Darwinsche Theorie und die Sprachwissenschaft. Weimar. Schmidt, J. (1872), Die Verwantschaftsverhältnisse der Indogermanischen Sprachen. Weimar: Böhlau. Searls, D. (1989). Investigating the Linguistics of DNA with Definite Clause Grammars. In E. Lusk & R. Overbeek (Eds.), Logic Programming: Proceedings of the North American Conference on Logic Programming (vol. 1, pp. 189-208). Association for Logic Programming. Searls, D. (1993). The Linguistics of DNA. American Scientist, 80, 579–591. Uemura, Y., Hasegawa, A., Kobayashi, S. & Yokomori, T. (1999). Tree Adjoining Grammars for RNA Structure Prediction. Theoretical Computer Science, 210(2), 277-303. Watson, J.D. (1968). Biologie Moléculaire du Gène. Paris: Ediscience.

Key TeRmS Grammar Systems Theory: A consolidated and active branch in the field of formal languages that provides syntactic models for describing multi-agent systems at the symbolic level using tools from formal languages and grammars.

1178

Membrane Systems: In a membrane system multisets of objects are placed in the compartments defined by the membrane structure that delimits the system from its environment. Each membrane identifies a region, the space between it and all directly inner membranes. Objects evolve by means of reaction rules associated with compartments, and applied in a maximally parallel, nondeterministic manner. Objects can pass through membranes, membranes can change their permeability, dissolve and divide. Multi-Agent System: A system composed of a set of computational agents that perform local problem solving and cooperatively interact to solve a single problem (or reach a goal) difficult to be solve (achieved) by an individual agent. Mutations: Several types of transformations in a single string. Natural Computing: Research field that deals with computational techniques inspired by nature and natural systems. This type of computing includes evolutionary algorithms, neural networks, molecular computing and quantum computing. Neural Network: Interconnected group of artificial neurons that uses a mathematical or a computational model for information processing based on a connectionist approach to computation. It involves a network of simple processing elements that can exhibit complex global behaviour. Splicing: Operation which consists of splitting up two strings in an arbitrary way and sticking the left side of the first one to the right side of the second one (direct splicing), and the left side of the second one to the right side of the first one (inverse splicing).

1179

Natural Language Understanding and Assessment Vasile Rus The University of Memphis, USA Philip M. McCarthy The University of Memphis, USA Danielle S. McNamara The University of Memphis, USA Arthur C. Graesser The University of Memphis, USA

INTRODUCTION Natural language understanding and assessment is a subset of natural language processing (NLP). The primary purpose of natural language understanding algorithms is to convert written or spoken human language into representations that can be manipulated by computer programs. Complex learning environments such as intelligent tutoring systems (ITSs) often depend on natural language understanding for fast and accurate interpretation of human language so that the system can respond intelligently in natural language. These ITSs function by interpreting the meaning of student input, assessing the extent to which it manifests learning, and generating suitable feedback to the learner. To operate effectively, systems need to be fast enough to operate in the real time environments of ITSs. Delays in feedback caused by computational processing run the risk of frustrating the user and leading to lower engagement with the system. At the same time, the accuracy of assessing student input is critical because inaccurate feedback can potentially compromise learning and lower the student’s motivation and metacognitive awareness of the learning goals of the system (Millis et al., 2007). As such, student input in ITSs requires an assessment approach that is fast enough to operate in real time but accurate enough to provide appropriate evaluation. One of the ways in which ITSs with natural language understanding verify student input is through matching. In some cases, the match is between the user input and a pre-selected stored answer to a question, solution to a problem, misconception, or other form of

benchmark response. In other cases, the system evaluates the degree to which the student input varies from a complex representation or a dynamically computed structure. The computation of matches and similarity metrics are limited by the fidelity and flexibility of the computational linguistics modules. The major challenge with assessing natural language input is that it is relatively unconstrained and rarely follows brittle rules in its computation of spelling, syntax, and semantics (McCarthy et al., 2007). Researchers who have developed tutorial dialogue systems in natural language have explored the accuracy of matching students’ written input to targeted knowledge. Examples of these systems are AutoTutor and Why-Atlas, which tutor students on Newtonian physics (Graesser, Olney, Haynes, & Chipman, 2005; VanLehn , Graesser, et al., 2007), and the iSTART system, which helps students read text at deeper levels (McNamara, Levinstein, & Boonthum, 2004). Systems such as these have typically relied on statistical representations, such as latent semantic analysis (LSA; Landauer, McNamara, Dennis, & Kintsch, 2007) and content word overlap metrics (McNamara, Boonthum, et al., 2007). Indeed, such statistical and word overlap algorithms can boast much success. However, over short dialogue exchanges (such as those in ITSs), the accuracy of interpretation can be seriously compromised without a deeper level of lexico-syntactic textual assessment (McCarthy et al., 2007). Such a lexico-syntactic approach, entailment evaluation, is presented in this chapter. The approach incorporates deeper natural language processing solutions for ITSs with natural language exchanges while


N

Natural Language Understanding and Assessment

remaining sufficiently fast to provide real time assessment of user input.

(Reasonable re-statement of all and only the critical information in the text)

BACKGROUND

Elaboration: He could have borrowed stuff. (Reasonable reaction to the text)

Entailment evaluations help in the assessment of the appropriateness of student responses during ITS exchanges. Entailment can be distinguished from three similar terms (implicature, paraphrase, and elaboration), all of which are also important for assessment in ITS environments (McCarthy et al, 2007). The terms entailment is often associated with the highly similar concept of implicature. The distinction is that entailment is reserved for linguistic-based inferences that are closely tied to explicit words, syntactic constructions, and formal semantics, as opposed to the knowledge-based implied referents and references, for which the term implicature is more appropriate (McCarthy et al., 2007). Implicature corresponds to the controlled knowledge-based elaborative inferences defined by Kintsch (1993) or to knowledge-based inferences defined in the inference taxonomies in discourse psychology (Graesser, Singer, & Trabasso, 1994). The terms paraphrase and elaboration also need to be distinguished from entailment. A paraphrase is a reasonable restatement of the text. Thus, a paraphrase is a form of entailment, yet an entailment is not necessarily a paraphrase. This asymmetric relation can be understood if we consider that John went to the store is entailed by (but not a paraphrase of) John drove to the store to buy supplies. The term elaboration refers to information that is generated inferentially or associatively in response to the text being analyzed, but without the systematic and sometimes formal constraints of entailment, implicature, or paraphrase. Examples of each term are provided below for the sentence John drove to the store to buy supplies. Entailment: John went to the store. (Explicit, logical implication based on the text) Implicature: John bought some supplies. (Implicit, reasonable assumption from the text, although not explicitly stated in the text) Paraphrase: He took his car to the store to get things that he wanted.

1180

Evaluating entailment is generally referred to as the task of recognizing textual entailment (RTE; Dagan, Glickman, & Magnini, 2005). Specifically, it is the task of deciding, given two text fragments, whether the meaning of one text logically infers the other. When it does, the evaluation is deemed as T (the entailing text) entails H (the entailed hypothesis). For example, a text (from the RTE data) of Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year would entail a hypothesis of Yahoo bought Overture. The task of recognizing entailment is relevant to a large number of applications, including machine translation, question answering, and information retrieval. The task of textual entailment has been a priority in investigations of information retrieval (Monz & de Rijke, 2001) and automated language processing (Pazienza, Pennacchiotti, & Zanzotto, 2005). In related work, Moldovan and Rus (2001) analyzed how to use unification and matching to address the answer correctness problem. Similar to entailment, answer correctness is the task of deciding whether candidate answers logically imply an ideal answer to a question.

THe leXICO-SyNTACTIC eNTAIlmeNT AppROACH A complete solution to the textual entailment challenge requires linguistic information, reasoning, and world knowledge (Rus, McCarthy, McNamara, & Graesser, in press). This chapter focuses on the role of linguistic information in making entailment decisions. The overall goal is to produce a light (i.e. computationally inexpensive), but accurate solution that could be used in interactive systems such as ITSs. Solutions that rely on processing-intensive deep representations (e.g., frame semantics and reasoning) and large structured repositories of information (e.g., ResearchCyc) are impractical for interactive tasks because they result in lengthy response times, causing user dissatisfaction. One solution for recognizing textual entailment is based on subsumption. In general, an object X subsumes


an object Y if and only if X is more general than or identical to Y. Applied to textual entailment, subsumption translates as follows: hypothesis H is entailed from T if and only if T subsumes H. The solution has two phases: (I) map both T and H into graph structures and (II) perform a subsumption operation between the Tgraph and H-graph. An entailment score, entail(T,H), is computed, quantifying the degree to which the T-graph subsumes the H-graph. In phase I, the two text fragments involved in a textual entailment decision are initially mapped onto a graph representation. The graph representation employed is based on the dependency-graph formalisms of Mel’cuk (1998). The mapping relies on information from syntactic parse trees. A phrase-based parser is used to derive the dependencies. Although a dependencyparser may be adopted, our particular research agenda required partial phrase parsers for other tasks such as computing cohesion metrics. Having a phrase-based and dependency parser integrated in the system would have led to a heavier, less interactive system. A parse tree groups words into phrases and organizes these phrases into hierarchical tree structures from which syntactic dependencies among concepts can be detected. The system uses Charniak’s (2000) parser to obtain parse trees and Magerman’s (1994) head-detection rules to obtain the head of each phrase. A dependency tree is generated by linking the head of each phrase to its modifiers in a systematic mapping process. The dependency tree encodes exclusively local dependencies (head-modifiers), as opposed to long-distance (remote) dependencies, such as the remote subject relation between bombers and enter in the sentence The bombers managed to enter the embassy compounds. Thus, in this stage, the dependency tree is transformed onto a dependency graph by generating remote dependencies between content words. Remote dependencies are computed by a naive-Bayes functional tagger (Rus

& Desai, 2005). An example of a dependency graph is shown in Figure 1 for the sentence The two objects will cover the same horizontal distance. For instance, there is a subject (subj) dependency relation between objects and cover. In phase II, the textual entailment problem (i.e., each T and H) is mapped into a specific example of graph isomorphism called subsumption (also known as containment). Isomorphism in graph theory addresses the problem of testing whether two graphs are the same. A graph G = (V, E) consists of a set of nodes or vertices V and a set of edges E. Graphs can be used to model the linguistic information embedded in a sentence: vertices represent concepts (e.g., bombers, joint venture) and edges represent syntactic relations among concepts (e.g., the edge labeled subj connects the verb cover to its subject objects in Figure 1). The Text (T) entails the Hypothesis (H) if and only if the hypothesis graph is subsumed (or contained) by the text graph. The subsumption algorithm for textual entailment (Rus et al., in press) has three major steps: (1) find an isomorphism between VH (set of vertices of the Hypothesis graph) and VT; (2) check whether the labeled edges in H, EH, have correspondents in ET ; and (3) compute score. In step 1, for each vertex VH, a correspondent VT node is sought. If a vertex in H does not have a direct correspondent in T, a thesaurus is used to find all possible synonyms for vertices. Step 2 takes each relation in H and checks its presence in T. The checking is augmented with relation equivalences among linguistic phenomena such as possessives and linking verbs (e.g. be, have). For instance, tall man would be equivalent to man is tall. A normalized score for vertices and edge mapping is then computed. The score for the entire entailment is the sum of each individual vertex and edge matching score. Finally, the score must account for negation. The approach handles both explicit and implicit negation. Explicit negation is indicated by particles such as no, not, neither ... nor

Figure 1. An example of a dependency graph obj mod nn

subj

post

The two objects will cover the same horizontal distance. 1181

N


and the shortened form n’t. Implicit negation is present in text via deeper lexico-semantic relations among linguistic expressions. The most obvious example is the antonymy relation among words, which is retrieved from WordNet (Miller, 1995). Negation is accommodated in the score after making the entailment decision for the Text-Hypothesis pair (without negation). If any one of the text fragments is negated, the decision is reversed, but if both are negated the decision is retained (doublenegation), and so forth.

entailment for Intelligent Tutoring Systems The problem of evaluating student input in ITSs with natural language understanding is modeled here as a textual entailment problem. Results of this approach are shown on data sets from two ITSs: AutoTutor and iSTART. Data from the AutoTutor experiments involve college students learning Newtonian physics, whereas data from iSTART involve adolescent and college students constructing explanations about science texts.

AutoTutor AutoTutor (autotutor.org) teaches topics such as Newtonian physics, computer literacy, and critical thinking by holding a dialogue in natural language with the student. The system presents deep-reasoning questions to the student that call for explanations or other elaborate answers. AutoTutor has a list of anticipated good answers (or expectations) and a list of misconceptions associated with each main question. AutoTutor guides the student in articulating the expectations through a number of dialogue moves and adaptively responds to the student by giving short feedback on the quality of student contributions. To understand how the entailment approach helps to assess the appropriateness of student responses in AutoTutor, consider the following AutoTutor problem: Suppose a runner is running in a straight line at constant speed, and the runner throws a pumpkin straight up. Where will the pumpkin land? Explain why. An expectation for this problem is The object will continue to move at the same horizontal velocity as the person when it is thrown. A real student answer is The pumpkin and the runner have the same horizontal 1182

velocity before and after release. The expert judgment of this response was very good. Such expectation/student-input (E-S) pairs can be viewed as an entailment pair of Text-Hypothesis. The task is to find the truth value of the student answer based on the true fact encoded in the expectation. Rus and Graesser (2006) examined how the lexico-syntactic system described in the previous section performed on a test set of 125 E-S pairs collected from a sample of AutoTutor tutorial dialogues. The lexico-syntactic approach provided the best accuracy (69%), whereas a Latent Semantic Analysis (LSA, Landauer et al., 2007) approach yielded an accuracy of 60%. Such a result illustrates the value of augmenting AutoTutor with lexico-syntactic natural language understanding.

iSTART (Interactive Strategy Trainer for Active Reading and Thinking) The primary goal of iSTART (istartreading.com) is to help high school and college students learn to use reading comprehension strategies that support deeper understanding. iSTART’s design combines the power of self-explanation in facilitating deep learning (McNamara et al., 2004) with content-sensitive, interactive strategy training. The iSTART system helps students learn to self-explain using a variety of reading strategies (e.g., rewording the text, or paraphrasing; or elaborating on the text by linking textual content to what the reader already knows). The final stage of the iSTART process requires students to self-explain sentences from two short passages. Scaffolded feedback is provided to the students based on the quality of the student responses. The entailment evaluation has been used in two iSTART studies. In Rus et al. (2007), a corpus of iSTART self-explanation responses was evaluated by an array of textual evaluation measures. The results demonstrated that the entailment approach was the most powerful distinguishing index of the self-explanation categories (Entailer: F(1,1228) = 25.05, p < .001; LSA: F(1,1228) = 2.98, p > .01). In McCarthy et al. (2007), iSTART self explanations were hand-coded for degree of entailment, paraphrase, versus elaboration. Once again, the entailment evaluation proved to be a more powerful predictor of these categories than traditional measures: for entailment, the Entailer was a significant predictor (t= 9.61, p < .001) and LSA was a marginal predictor (t = -1.90, p = .061); for elaboration and for


paraphrase the Entailer was again a significant predictor (t = -7.98, p < .001; t = 5.62, p < .001, respectively), whereas LSA results were not significant.

fUTURe TReNDS While the results of the entailment evaluation have been encouraging, a variety of developments of the approach are underway. For example, there are plans to weight words by their specificity and to learn syntactic patterns or transformations that lead to similar meanings. The current negation detection algorithm will be extended to assess plausible implicit forms of negation in words such as denied, denies, without, ruled out. A second extension addresses issues of relative opposites: knowing that an object is not hot does not entail that the object is cold (i.e., it could simply be warm).

CONClUSION Recognizing and assessing textual entailment is a prominent and challenging task in the fields of Natural Language Processing and Artificial Intelligence. This chapter presented a lexico-syntactic approach to the task of evaluating entailment. The approach is light, using minimal knowledge resources, yet it has delivered high performance in evaluations of three data sets involving natural language interactions in ITSs. The entailment approach is a promising step in achieving the goal of fast and effective evaluation of student contributions in short text exchanges, which is needed to provide optimal feedback and responses to student learners.

ACKNOWleDGmeNT This research was partially supported by the National Science Foundation (REC 106965, ITR 0325428, REESE 0633918), and by the Institute for Education Sciences (IES R305G020018-02). Any opinions, findings, conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of The University of Memphis, the NSF, or the IES.

RefeReNCeS Charniak, E., (2000). A maximum-entropy-inspired parser. In Proceedings ANLP-NAACL’2000, Seattle, Washington, 132-139. Dagan, I., Glickman, O., & Magnini, B. (2005), The PASCAL recognising textual entailment challenge. In Proceedings of the Recognizing Textual Entaiment Challenge Workshop, Southampton, U.K, 1-8. Graesser, A. C., Olney, A., Haynes, B.C., & Chipman, P. (2005). AutoTutor: A cognitive system that simulates a tutor that facilitates learning through mixed-initiative dialogue. In C. Forsythe, M. L. Bernard, and T. E. Goldsmith, (Eds.). Cognitive Systems: Human Cognitive Models in Systems Design. Erlbaum, Mahwah, NJ. Graesser, A.C., & Singer, M. & Trabasso, T. (1994). Constructing inferences during narrative text comprehension. Psychological Review, 101, 371-95. Kintsch, W. (1993). Information accretion and reduction in text processing: Inferences. Discourse Processes, 16:193-202, 1993. Landauer, T., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook of Latent Semantic Analysis. Mahwah, NJ: Erlbaum. Magerman, D.M., (1994). Natural language parsing as statistical pattern recognition. Ph.D. thesis, Stanford University, February. McCarthy, P.M., Rus, V., Crossley, S.A., Bigham, S.C., Graesser, A.C., & McNamara, D.S. (2007). Assessing entailer with a corpus of natural language. In Proceedings of the Florida Artificial Intelligence Research Society International Conference (FLAIRS) (pp. 247252), Menlo Park, CA: AAAI Press. McNamara, D. S., Boonthum, C., Levinstein, I. B., & Millis, K. (2007). Evaluating self-explanations in iSTART: comparing word-based and LSA algorithms. In Landauer, T., D.S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of LSA. Mahwah, NJ: Erlbaum, 227-241. McNamara, D. S., Levinstein, I. B. & Boonthum, C. (2004). iSTART: Interactive strategy trainer for active reading and thinking. Behavioral Research Methods, Instruments, & Computers, 36, 222-233.

1183

N


Miller, G. (1995), Wordnet: a lexical database for English. Communications of the ACM, 38, 39-41. Millis, K., Magliano, J., Wiemer-Hastings, K., Todaro, S., & McNamara, D.S. (2007). Assessing and improving comprehension with Latent Semantic Analysis. In Landauer, T., D.S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 207-225). Mahwah, NJ: Erlbaum. Mel’cuk, I.A. (1998). Dependency syntax: Theory and practice. State University of New York Press, Albany, NY. Moldovan, D.I. & Rus, V. (2001) Logic form transformation of WordNet and its applicability to question answering. Proceedings of the ACL 2001 Conference, Toulouse, France, 394-401. Monz, C. & de Rijke, M. (2001). Light-weight entailment checking for computational semantics. In P. Blackburn and M. Kohlhase (Eds.), Proceedings of Inference in Computational Semantics (ICoS-3), Siena, Italy (pp. 59-72). Pazienza, M.T., Pennacchiotti, M., & Zanzotto, F. M. (2005). Textual entailment as syntactic graph distance: A rule based and SVM based approach. In Proceedings of the Recognizing Textual Entaiment Challenge Workshop, Southampton, U.K., April 11 – 13, 25-28. Rus, V. & Desai, K. (2005). Assigning function tags with a simple model. In Proceedings of Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Mexico City, Mexico, 112-115. Rus, V., McCarthy, P.M., McNamara, D.S., & Graesser, A.C. (in press). A study of textual entailment, International Journal on Artificial Intelligence Tools. Rus, V., Graesser, A. C., & Desai, K. (2005). LexicoSyntactic subsumption for textual entailment. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2005), Borovets, Bulgaria, 444-452. Rus, V., McCarthy, P.M., McNamara, D.S., & Graesser, A.C. (2007). Assessing student self-explanations in an intelligent tutoring system. In D. S. McNamara & J. G. Trafton (Eds.), Proceedings of the 29th Annual Cognitive Science Society. Austin, TX: Cognitive Science Society.

1184

VanLehn, K., Graesser, A.C., Jackson, G.T., Jordan, P., Olney, A., & Rose, C.P. (2007). When are tutorial dialogues more effective than reading? Cognitive Science, 31, 3-62.

Key TeRmS Dependency: Binary relations between words in a sentence whose label indicates the syntactic relation among the two words. Entailment: The task of deciding whether a text fragment logically or semantically infers another text fragment. Expectation: A stored (generally ideal) answer to a problem, against which input is evaluated; concept used in ITSs. Graph Subsumption: A specific example of graph isomorphism. Isomorphism exists when two graphs are equivalent. Subsumption can be viewed as subgraph isomorphism. Intelligent Tutoring System: Interactive, feedbackbased computer systems designed to help students learn various topics. Latent Semantic Analysis: A statistical technique for human language understanding based on words that co-occur in documents of large corpora. Natural Language Processing: The science of capturing the meaning of human language in computational representations and algorithms. Natural Language Understanding and Assessment: An NLP subset focusing on evaluating natural language input in intelligent tutoring systems. Syntactic Parsing: The process of discovering the underlying structure of sentences.

1185

Navigation by Image-Based Visual Homing Matthew Szenher University of Edinburgh, UK

INTRODUCTION Almost all autonomous robots need to navigate. We define navigation as do Franz & Mallot (2000): “Navigation is the process of determining and maintaining a course or trajectory to a goal location” (p. 134). We allow that this definition may be more restrictive than some readers are used to - it does not for example include problems like obstacle avoidance and position tracking - but it suits our purposes here. Most algorithms published in the robotics literature localise in order to navigate (see e.g. Leonard & Durrant-Whyte (1991a)). That is, they determine their own location and the position of the goal in some suitable coordinate system. This approach is problematic for several reasons. Localisation requires a map of available landmarks (i.e. a list of landmark locations in some suitable coordinate system) and a description of those landmarks. In early work, the human operator provided the robot with a map of its environment. Researchers have recently, though, developed simultaneous localisation and mapping (SLAM) algorithms which allow robots to learn environmental maps while navigating (Leonard & Durrant-Whyte (1991b)). Of course, autonomous SLAM algorithms must choose which landmarks to map and sense these landmarks from a variety of different positions and orientations. Given a map, the robot has to associate sensed landmarks with those on the map. This data association problem is difficult in cluttered real-world environments and is an area of active research. We describe in this chapter an alternative approach to navigation called visual homing which makes no explicit attempt to localise and thus requires no landmark map. There are broadly two types of visual homing algorithms: feature-based and image-based. The featurebased algorithms, as the name implies, attempt to extract the same features from multiple images and use the change in the appearance of corresponding features to navigate. Feature correspondence is - like data association - a difficult, open problem in real-world environments. We argue that image-based homing algorithms, which

provide navigation information based on whole-image comparisons, are more suitable for real-world environments in contemporary robotics.

BACKGROUND Visual homing algorithms make no attempt to localise in order to navigate. No map is therefore required. Instead, an image IS (usually called a snapshot for historical reasons) is captured at a goal location S = (xS , yS). Note that though S is defined as a point on a plane, most homing algorithms can be easily extended to three dimensions (see e.g. Zeil et al. (2003)) . When a homing robot seeks to return to S from a nearby position C = (xC , yC ), it takes an image IC and compares it with IS. The home vector H = S - C is inferred from the disparity between IS and IC (vectors are in upper case and bold in this work). The robot’s orientation at C and S is often different; if this is the case, image disparity is meaningful only if IC is rotated to account for this difference. Visual homing algorithms differ in how this disparity is computed. Visual homing is an iterative process. The home vector H is frequently inaccurate, leading the robot closer to the goal position but not directly to it. If H does not take the robot to the goal, another image IC is taken at the robot’s new position and the process is repeated. The images IS and IC are typically panoramic grayscale images. Panoramic images are useful because, for a given location (x,y) they contain the same image information regardless of the robot’s orientation. Most researchers use a camera imaging a hemispheric, conical or paraboloid mirror to create these images (see e.g. Nayar (1997)). Some visual homing algorithms extract features from IS and IC and use these to compute image disparity. Alternatively, disparity can be computed from entire images, essentially treating each pixel as a viable feature. Both feature-based and image-based visual homing algorithms are discussed below.


N

Navigation by Image-Based Visual Homing

feATURe-BASeD VISUAl HOmING Feature-based visual homing methods segment IS and IC into features and background (the feature extraction problem). Each identified feature in the snapshot is then usually paired with one feature in IC (the correspondence problem). The home vector is inferred from - depending on the algorithm - the change in the bearing and/or apparent size of the paired features. Generally, in order for feature-based homing algorithms to work properly, they must reliably solve the feature extraction and correspondence problems. The Snapshot Model (Cartwright & Collett (1983)) - the first visual homing algorithm to appear in the literature and the source of the term “snapshot” to describe the goal image - matches each snapshot feature with the current feature closest in bearing (after both images are rotated to the same external compass orientation). Features in (Cartwright & Collett (1983)) were black cylinders in an otherwise empty environment. Two unit vectors, one radial and the other tangential, are associated with each feature pair. The radial vector is parallel to the bearing of the snapshot feature; the tangential vector is perpendicular to the radial vector. The direction of the radial vector is chosen to move the agent so as to reduce the discrepancy in apparent size between paired features. The direction of the tangential vector is chosen to move the agent so as to reduce the discrepancy in bearing between paired features. The radial and tangential vectors for all feature pairs are averaged to produce a homing vector. The Snapshot Model was devised to explain the behaviour of nest-seeking honeybees but has inspired several robotic visual homing algorithms. One such algorithm is the Average Landmark Vector (ALV) Model (Möller et al. (2001)). The ALV Model, like the Snapshot Model, extracts features from both IC and IS. The ALV Model, though, does not explicitly solve the correspondence problem. Instead, given features extracted from IS , the algorithm computes and stores a unit vector ALVS in the direction of the mean bearing to all features as seen from S. At C, the algorithm extracts features from IC and computes their mean bearing, encoded in the unit vector ALVC . The home vector H is defined as ALVC - ALVS. Figure 1 illustrates home vector computation for a simple environment with four easily discernible landmarks. Several other interesting feature-based homing algorithms can be found in the literature. Unfortunately, space constraints prevent us from reviewing them here. 1186

Two algorithms of note are: visual homing by “surfing the epipoles” (Basri et al. (1998) and the Proportional Vector Model (Lambrinos et al. (2000)). The Snapshot and ALV Models were tested by their creators in environments in which features contrasted highly with background and so were easy to extract. How is feature extraction and correspondence solved in real-world cluttered environments? One method is described in Gourichon et al. (2002). The authors use images converted to the HSV (Hue-Saturation-Value) colour space which is reported to be more resilient to illumination change than RGB. Features are defined as image regions of approximately equal colour (identified using a computationally expensive region-growing technique). Potential feature pairs are scored on their difference in average hue, average saturation, average intensity and bearing. The algorithm searches for a set of pairings which maximise the sum of individual match scores. The pairing scheme requires O(n2) pair-score computations (where n is the number of features). The algorithm is sometimes fooled by features with similar colours (specifically, pairing a blue chair in the snapshot image with a blue door in the current image). Gourichon et al. did not explore environments with changing lighting conditions. Several other methods feature extraction and correspondence algorithms appear in the literature; see e.g. Rizzi et al. (2001), Lehrer & Bianco (2000) and Gaussier et al. (2000). Many of these suffer from some of the same problems as the algorithm of Gourichon et al. described above. The appearance of several competing feature extraction and correspondence algorithms in recent publications indicates that these are open and difficult problems; this is why we are advocating image-based homing in this chapter. Figure 1. Illustration of Average Landmark Vector computation. See Section titled “Feature-based Visual Homing” for details


ImAGe-BASeD VISUAl HOmING Feature-based visual homing algorithms require consistent feature extraction and correspondence over a variety of viewing positions. Both of these are still open problems in computer vision. Existing solutions are often computationally intensive. Image-based visual homing algorithms avoid these problems altogether. They infer image disparity from entire images; no pixel is disregarded. We believe that these algorithms present a more viable option for real-world, real-time robotics. Three image-based visual homing algorithms have been published so far; we describe these below.

Image Warping The image warping algorithm (Franz et al. (1998)) asks the following question: When the robot is at C in some unknown orientation, what change in orientation and position is required to transform IC into IS ? The robot needs to know the distance to all imaged objects in IS to answer this question precisely. Not having this information, the image warping algorithm makes the assumption that all objects are at an equal (though unknown) distance from S. The algorithm searches for the values of position and orientation change which minimises the mean-square error between a transformed IC and IS . Since the mean square error function is rife with local minima, the authors resort to a brute force search over all permissible values of position and orientation change. Unlikely as the equal distance assumption is, the algorithm frequently results in quite accurate values for H. Unlike most visual homing schemes, image warping requires no external compass reference. Unfortunately, the brute force search for the homing vector and the large number of transformations of IC carried out during this search make image warping quite computationally expensive.

Homing with Optic flow Techniques When an imaging system moves from S to C , the image of a particular point in space moves from IS(x,y) to IC(x’,y’). This movement is called optic flow and (x-x′, y- y′) is the so called pixel displacement vector. Vardy & Möller (2005) demonstrate that the home vector H can be inferred from a single displacement vector so long as the navigating robot is constrained to move on

a single plane. Several noisy displacement vectors can be combined to estimate H. Vardy & Möller (2005) describe a number of methods, adapted from the optic flow literature, to estimate the displacement vector. One of the most successful methods – BlockMatch - segments the snapshot image into several equal-sized subimages. The algorithm then does a brute force search of a subset of IC to find the best match for each subimage. A displacement vector is computed from the centre of each subimage to the centre of its match pair in IC. A less computationally intensive algorithm estimates the displacement vector from the intensity gradient at each pixel in IC . The intensity gradient at a particular pixel can be computed straightforwardly from intensities surrounding that pixel. No brute-force search is required. In comparative tests, Vardy & Möller demonstrated that their optic flow based methods perform consistently better than image warping in several unadulterated indoor environments. A drawback to the optic flow homing methods is that the robot is constrained to move on a single plane. The authors do not provide a way to extend their algorithm to three dimensional visual homing.

Surfing the Difference Surface Zeil et al. (2003) describe a property of natural scenes which can be exploited for visual homing: as the Euclidean distance between S and C increases, the pixelby-pixel root mean square (RMS) difference between IS and IC increases smoothly and monotonically. Labrosse and Mitchell discovered this phenomenon as well; see Mitchell & Labrosse (2004). Zeil et al. reported that the increase in the RMS signal was discernible from noise up to about three meters from S in their outdoor test environment; they call this region the catchment area. RMS, when evaluated at locations in a subset of the plane surrounding S, forms a mathematical surface, the difference surface. A sample difference surface is shown in Figure 2(a) (see caption for details). Zeil et al. describe a simple algorithm to home using the RMS difference surface. Their “Run-Down” algorithm directs the robot to move in its current direction while periodically sampling the RMS signal. When the current sample is greater than the previous, the robot is made to stop and turn ninety degrees (clockwise or 1187

N


Figure 2. Two difference surfaces formed using the RMS image similarity measure. Both the surfaces and their contours are shown. In each case, the snapshot IS was captured at x=150cm, y=150cm in a laboratory environment. (a) The snapshot was captured in the same illumination conditions as all other images. Notice the global minimum at the goal location and the absence of local minima. (b) Here we use the same snapshot image as in (a) but the lighting source has changed in all other images. The global minimum no longer appears at the goal location. When different goal locations were used, we observed qualitatively similar disturbances in the difference surfaces formed. The images used were taken from a database provided by Andrew Vardy which is described in Vardy & Möller (2005).

(a)

(b)

1188


counter-clockwise, it does not matter). It then repeats the process in this new direction. The agent stops when the RMS signal falls below a pre-determined threshold. We have explored a biologically inspired difference surface homing method which was more successful than “Run-Down” in certain situations (Zampoglou et al. (2006)). Unlike the optic flow methods described in the previous section, visual homing by optimising the difference surface is easily extensible to three dimensions (Zeil et al. (2003)). Unfortunately, when lighting conditions change between capture of IS and IC , the minimum of the RMS difference surface often fails to coincide with S, making homing impossible (Figure 2(b)).

fUTURe TReNDS No work has yet been published comparing the efficacy of the image-based homing algorithms described above. This would seem the logical next step for image-based homing researchers. As we mentioned in the section titled “Surfing the Difference Surface,” the difference surface is disrupted by changes in lighting between captures of IS and IC. This problem obviously demands a solution and is a focus of our current research. Finally, it would be interesting to compare standard map-based navigation algorithms with the image-based visual homing methods presented here.

CONClUSION Visual homing algorithms - unlike most of the navigation algorithms found in the robotics literature - do not require a detailed map of their environment. This is because they make no attempt to explicitly infer their location with respect to the goal. These algorithms instead infer the home vector from the discrepancy between a stored snapshot image taken at the goal position and an image captured at their current location. We reviewed two types of visual homing algorithms: feature-based and image-based. We argued that image-based algorithms are preferable because they make no attempt to solve the tough problems of consistent feature extraction and correspondence - solutions to which feature-based algorithms demand. Of the three

image-based algorithms reviewed, image warping is probably not practicable due to the computationally demanding brute force search required. Work is required to determine which of the two remaining image-based algorithms is more effective for robot homing in realworld environments.

RefeReNCeS Basri, R., Rivlin, E., & Shimshoni, I. (1998). Visual homing: Surfing on the epipoles. In The Proceedings of the Sixth International Conference on Computer Vision (p. 863-869). Cartwright, B., & Collett, T. (1983). Landmark learning in bees. Journal of Comparative Physiology, 151, 521-543. Franz, M., & Mallot, H. (2000). Biomimetic robot navigation. Robotics and Autonomous Systems, 3 0 , 133-153. Franz, M., Schölkopf, B., Mallot, H., & Bülthoff, H. (1998). Where did i take that snapshot? scene-based homing by image matching. Biological Cybernetics, 79, 191-202. Gaussier, P., Joulain, C., Banquet, J., Leprêtre, S., & Revel, A. (2000). The visual homing problem: an example of robotics/biology cross fertilization. Robotics and Autonomous Systems, 30, 155-180. Gourichon, S., Meyer, J., & Pirim, P. (2002). Using colored snapshots for short-range guidance in mobile robots. International Journal of Robotics and Automation: Special Issue on Biologically Inspired Robotics, 17 (4), 154-162. Lambrinos, D., Möller, R., Labhart, T., Pfeifer, R., & Wehner, R. (2000). A mobile robot employing insect strategies for navigation. Robotics and Autonomous Systems, 30, 39-64. Lehrer, M., & Bianco, G. (2000). The turn-back-andlook behaviour: bee versus robot. Biological Cybernetics, 83, 211-229. Leonard, J. J., & Durrant-Whyte, H. F. (1991a). Mobile robot localization by tracking geometric b e a c o n s . IEEE Transactions on Robotics and Automation, 7 (3), 376-382. 1189

N


Leonard, J. J., & Durrant-Whyte, H. F. (1991b). Simultaneous map building and localization for an autonomous mobile robot. In Proceedings of the IEEE International Workshop on Intelligent Robots and Systems (pp. 14421447). Osaka, Japan.

Key TeRmS

Möller, R., Lambrinos, D., Roggendorf, T., Pfeifer, R., & Wehner, R. (2001). Insect strategies of visual homing in mobile robots. In B. Webb & T. R. Consi (Eds.), Biorobotics: Methods and Applications (pp. 37-66). The MIT Press, Cambridge, Massachusetts.

Correspondence Problem: The problem of pairing an imaged feature extracted from one image with the same imaged feature extracted from a second image. The images may have been taken from different locations, changing the appearance of the features.

Mitchell, T., & Labrosse, F. (2004). Visual homing: a purely appearance-based approach. In Proceedings of TAROS (Towards Autonomous Robotic Systems). The University of Essex, UK.

Image-based Visual Homing: Visual homing (see definition below) in which the home vector is estimated from the whole-image disparity between snapshot and current images. No feature extraction or correspondence is required.

Nayar, S. K. (1997). Omnidirectional video camera. In Proceedings of DARPA image understanding workshop. New Orleans, USA. Rizzi, A., Duina, D., & Cassinis, R. (2001). A novel visual landmark matching for a biologically inspired homing. Pattern Recognition Letters, 22, 1371-1378. Vardy, A., & Möller, R. (2005). Biologically plausible visual homing methods based on optical flow techniques. Connection Science, Special Issue: Navigation, 17 (1-2), 47-89. Zampoglou, M., Szenher, M., & Webb, B. (2006). Adaptation of controllers for image-based homing. Adaptive Behavior, 14 (4), 381-399. Zeil, J., Hofmann, M., & Chahl, J. (2003). Catchment areas of panoramic snapshots in outdoor scenes. Journal of the Optical Society of America A, 20 (3), 450-469.

1190

Catchment Area: The area from which a goal location is reachable using a particular navigation algorithm.

Feature Extraction Problem: The problem of extracting the same imaged features from two images taken from (potentially) different locations. Navigation: The process of determining and maintaining a course or trajectory to a goal location. Optic Flow: The perceived movement of objects due to viewer translation and/or rotation. Snapshot Image: In the visual homing literature, this is the image captured at the goal location. Visual Homing: A method of navigating in which the relative location of the goal is inferred by comparing an image taken at the goal with the current image. No landmark map is required.

1191

Nelder-Mead Evolutionary Hybrid Algorithms Sanjoy Das Kansas State University, USA

INTRODUCTION Real world optimization problems are often too complex to be solved through analytic means. Evolutionary algorithms are a class of algorithms that borrow paradigms from nature to address them. These are stochastic methods of optimization that maintain a population of individual solutions, which correspond to points in the search space of the problem. These algorithms have been immensely popular as they are derivativefree techniques, are not as prone to getting trapped in local minima, and can be tailored specifically to suit any given problem. The performance of evolutionary algorithms can be improved further by adding a local search component to them. The Nelder-Mead simplex algorithm (Nelder & Mead, 1965) is a simple local search algorithm that has been routinely applied to improve the search process in evolutionary algorithms, and such a strategy has met with great success. In this article, we provide an overview of the various strategies that have been adopted to hybridize two wellknown evolutionary algorithms - genetic algorithms (GA) and particle swarm optimization (PSO).

BACKGROUND Arguably, GAs are one of the most of all common population based approaches for optimization. The population of candidate solutions that these algorithms maintain in each generation are called chromosomes. GAs carry out the Darwinian operators of selection, mutation, and recombination, on these chromosomes, to perform their search (Mitchell, 1998). Each generation is improved by removing the poorer solutions from the population, while retaining the better ones, based on a fitness measure. This process is called selection. Following selection, a method of recombining solutions called crossover is applied. Here two (or more) parent solutions from the current generation are picked randomly for producing offspring to populate the next generation of solutions. The offspring chromosomes

are then probabilistically subject to mutation, which is carried out by the addition of small random perturbations. PSO is a more recent approach for optimization (Kennedy & Eberhart, 2001). Being modelled after the social behavior of organisms such as a flock of birds in flight or a school of fish swimming, it is considered an evolutionary algorithm only in a loose sense. Each solution within the population is called a particle in PSO. Each such particle’s position in the search space is constantly updated within each generation, by the addition of the particle’s velocity to it. The velocity of a particle is then adjusted towards the best position encountered in the particle’s own history (individual best), as well as the best position in the current iteration (global best). Since evolutionary algorithms use a population of individuals and randomized variational operators, they are adept at performing exploratory searches over their search spaces. However, when the aim is to produce outputs within reasonable time limits, it is important to balance this exploration with better exploitation of smaller-scale features in the fitness landscape. In the latter context, local search algorithms enable single solutions to be improved using local information (e.g., directional trends in fitness around each solution) and take the solution towards the closest maximum fitness. Hybrid algorithms that combine the advantages of exploration and exploitation comprise of a distinct area of evolutionary computation research that have been variously called as Lamarckian or memetic approaches, of which Nelder-Mead hybrids are a significant chunk.

NelDeR-meAD SImpleX BASeD HyBRIDIzATION The Nelder-mead Downhill Simplex Algorithm The Nelder-Mead simplex algorithm is a derivative-free local search technique that is capable of moving a cluster


N

Nelder-Mead Evolutionary Hybrid Algorithms

Figure 1. Various operations in the Nelder-Mead simplex routine r

b c w

original

c w r

reflection r

c

expansion

c

outward contraction

c r inward contraction

of solutions in the gradient direction and which, as per current research, can be very effectively combined with GA and PSO approaches. These hybrid evolutionary algorithms have been shown to be very successful in continuous optimization problems. The Nelder-Mead smplex method makes use of a construct called a simplex (see Figure 1.). When the search space is n-dimensional, the simplex consists of n+1 solutions, si, i = {1, 2, …, n+1}, that are usually closely spaced. As shown in the top left of Figure 1., in a two-dimensional search plane, a simplex is a triangle. The fitness of each solution is considered in each step of the Nelder-Mead method, and the worst solution w is identified. The centroid, c, of the remaining n points c=

1 ∑ si n ,

is computed and the reflection of w along it determined. This reflection yields a new solution r that replaces w, in the next step, as shown in the top right of Figure 1. If the solution r produced by this reflection has a higher fitness than any other solution in the simplex, the simplex is further expanded along the direction of 1192

collapse

r, as shown in the middle left of the figure. On the other hand, if r has a low fitness compared to the others, the simplex is contracted. Contraction can be either outward or inward depending upon whether r is better or worse than w. The contraction operations are shown in the middle right and bottom left of the figure. If neither contraction improves the worst solution in the simplex, the best point in the simplex is computed, and a collapse is then carried out, and all the points of the simplex are moved a little closer towards the best one, as shown in the bottom right of the same figure. The approaches taken to incorporate a simplex-based local search routine within the broad framework of a genetic algorithm fall under four different schemes that are shown in Figure 2. These are as follows:

Two-phase Hybridization This is the simplest of all approaches and has been applied to GAs (Chelouah & Siarry, 2000, Chelouah & Siarry, 2003, Robin, Orzati, Moreno, Homan & Bachtold, 2003). In the first phase in this scheme, a GA is applied to the optimization problem to explore the entire search space until one or more good solutions


are found, which can no longer be improved through the random operations of crossover and mutation. The Nelder-Mead simplex algorithm is then invoked in the second phase to further improve the solutions by allowing them to ascend towards their local maxima. In another approach the initial points of the simplex are obtained by taking the solution with the best fitness given by the GA, and then generating the remaining n points around it (Chelouah & Siarry, 2000, Chelouah & Siarry, 2003).

Serial Hybridization In this scheme, the solutions in each generation are subject to the usual operators of the main evolutionary algorithm as well as the one or more steps of the Nelder-Mead simplex method. It has been successfully applied to hybridize GAs (Renders & Flasse, 1998, Yang & Douglas, 1998, Durand & Alliot, 1999, Guo & Shouyi, 2003, Trabia, 2004). This method has also been used in conjunction with PSO by Das et al. (Das, Koduru, Welch, Gui, Cochran, Wareing & Babin, 2006, Koduru, Welch, Das, 2007). In each generation, following the position and velocity updates, the population is

clustered into distinct clusters of n+1 solutions each, and a few steps of the Nelder-Mead algorithm applied separately to each cluster. The Nelder-Mead step is applied a fixed number of times per generation. The serial hybridization scheme has been successfully implemented within a multi-objective optimization framework also (Koduru, Das, Welch 2007). Instead of fitness, a metric called fuzzy dominance is applied to discriminate between the n+1 solutions within a simplex. A solution that is not dominated by any other is assigned a fuzzy dominance of zero. The poorer a solution is, the higher the fuzzy dominance value it is assigned.

parallel Hybridization Such hybridization approaches assemble the offspring generation from the parent generation in two parallel tracks. The standard evolutionary algorithm operators are used to generate some of the offspring, while others are generated using the simplex algorithm. This strategy is applied to hybridize GAs (Yen, Liao, Lee & Randolph, 1998, Koduru, Das, Welch & Roe, 2004, Koduru, Das, Welch, Roe & Lopez-Dee, 2005). In these

Figure 2. Four different GA-simplex hybridization strategies

Evolutionary Algorithm


Nelder-Mead S implex


S erial

Two-Phase



Parallel



Implicit 1193

N


approaches, the best n+1 solutions (called elites) of each generation are picked to be improved further, using the Nelder-Mead simplex method. In another strategy, a probabilistic variant of the Nelder-Mead approach is used, where the amount of contraction and/or expansion of the simplex is determined randomly, but within specific limits (Yen, Liao, Lee & Randolph, 1998). The approaches taken in (Koduru, Das, Welch & Roe, 2004) and (Koduru, Das, Welch, Roe & Lopez-Dee, 2005) are multi-objective implementations that make use of the fuzzy dominance metric discussed earlier to identify the best and worst solutions. In order to preserve solution diversity within the population, the collapse operation is never used and the Nelder-Mead routine is terminated instead, when the need for one arises, within each generation. This scheme has been used with PSO (Fan, Liang & Zahara 2004, Zahara, Fan & Tsai, 2005). As earlier, only the best n+1 points of the population are picked to undergo improvement using the Nelder-Mead simplex method. The remaining solutions in each generation are obtained using standard PSO position and velocity updates.

Implicit Hybridization Here, the Nelder-Mead simplex algorithm is not applied directly. Instead, the approach is buried implicitly within any of the evolutionary algorithm’s generic operators. One simple technique in GAs is the multi-parent simplex-based crossover (Renders & Bersini, 1994). This method applies a single Nelder-Mead step to produce a new offspring. Novel crossover techniques are also suggested (Bersini, 2002). In another method, each simplex is encoded as a chromosome and the algorithm uses a specially devised multi-parent crossover within the GA (Hedar & Fukushima, 2003). Das et al. (Das, Koduru, Welch, Gui, Cochran, Wareing & Babin, 2006) use implicit hybridization in PSO by adding a term to the velocity of each particle that allows the latter to reorient its trajectory towards gradient direction sensed by the Nelder-Mead simplex (from the worst towards the centroid).

most practical problems in engineering are inherently multi-objective in nature. Consequently, multi-objective evolutionary optimization is a relatively new, emerging direction of evolutionary computation research. Perhaps the only attempts at incorporating Nelder-Mead simplex as an additional operator within a GA have been reported by Koduru, Das & Welch (cf. Koduru, Das, Welch, Roe, 2004). Clearly more research is required in this direction, and as multi-objective algorithms become more common, Nelder-Mead strategies will be investigated more vigorously. PSO is a new technique for evolutionary optimization. Research into PSO-based hybrid algorithms has only recently begun to make its appearance. A few limited approaches have been suggested to hybridize PSO with Nelder-Mead simplex by Das et al. (Das, Koduru, Welch, Gui, Cochran, Wareing & Babin, 2006) and Zahara et al., (Zahara, Fan & Tsai, 2005). The method suggested in (Koduru, Das, Welch 2007) is, to the best of the author’s knowledge, the only attempt at producing a multi-objective PSO hybrid algorithm. Here again, further investigation is necessary. Although research into these evolutionary hybrid algorithms is over a decade old, with several good approaches having been suggested, there is no clear consensus about which approach is best suited for any given application. More research in this direction is warranted to obtain further insights into the performance of these algorithms.

CONClUSION In the literature on evolutionary optimization, many effective approaches have been proposed to hybridize GAs with Nelder-Mead simplex. More recently, researchers have begun implementing similar ideas within PSO also. A few papers on multi-objective hybrid approaches have been published. However, a formal framework to categorize all these approaches has so far been lacking. This chapter surveys the various methods and proposes a way to organize them into four distinct categories.

fUTURe TReNDS

RefeReNCeS

Although traditionally, evolutionary algorithms have focussed on optimizing single objective functions,

Bersini, H. (2002). The immune and chemical crossovers. IEEE Transactions on Evolutionary Computation. 6(3): 306-313.

1194


Chelouah, R., & Siarry, P. (2000). A continuous genetic algorithm designed for the global optimization of multimodal functions, Journal of Heuristics 6: 191-213. Chelouah, R., & Siarry, P. (2003). Genetic and NelderMead algorithms hybridized for a more accurate global optimization of continuous multiminima functions, European Journal of Operational Research, 148: 335348. Das, S., Koduru, P., Welch, S.M., Gui, M., Cochran, M., Wareing, A., & Babin, B. (2006). Adding local search to particle swarm optimization. Proceedings, World Congress on Computational Intelligence, Vancouver, BC, Canada, 428-433. Durand N., & Alliot, J.M. (1999). A combined NelderMead Simplex and genetic algorithm. Genetic and Evolutionary Computing Conference (GECCO). Guo G., & Shouyi, Y. (2003). Evolutionary parallel local search for function optimization. IEEE Transactions on Systems, Man and Cybernetics Part-B. 7(1): 243-258. Hedar, A., Fukushima, M. (2003). “Minimizing multimodal functions by simplex coding genetic algorithm”, Optimization Methods and Software. 18: 265-282. Fan, S.S., Liang, Y.C., Zahara, E. (2004). “Hybrid simplex search and particle swarm optimization for the global optimization of multimodal functions”, Engineering Optimization. 36(4): 401-418. Kennedy, J., & Eberhart, R.C. (2001) Swarm Intelligence, Morgan Kaufmann, CA. Koduru, P. Das, S., Welch, S.M., & Roe, J. (2004). Fuzzy dominance based multi-objective GA-Simplex hybrid algorithms applied to gene network models. Lecture Notes in Computer Science: Proceedings of the Genetic and Evolutionary Computing Conference, Seattle, Washington, (Eds. Kalyanmoy Deb et al.), Springer-Verlag, 3102: 356-367. Koduru, P. Das, S., Welch, S.M., Roe, J., & Lopez-Dee, Z.P. (2005). A Co-evolutionary hybrid algorithm for multi-objective optimization of gene regulatory network models, Proceedings of the Genetic and Evolutionary Computing Conference, Washington D. C. 393-399. Koduru, P., Das, S., & Welch, S.M. (2007). Multiobjective and hybrid PSO using ε-fuzzy dominance, Proceedings of the Genetic and Evolutionary Comput-

ing Conference, London, UK. (Eds. Dirk Thierens et al.) 853-860. Koduru, P., Welch, S.M., & Das, S. (2007). A particle swarm optimization approach for estimating confidence regions. Proceedings of the Genetic and Evolutionary Computing Conference, London, UK. (Eds. Dirk Thierens et al.) 70-77. Mitchell, M. (1998). An Introduction to Genetic Algorithms. MIT Press. Nelder, J.A., & R. Mead, R. (1965). A simplex method for function minimization, Computer Journal, 7(4): 308-313. Renders, J., & Bersini, H. (1994). Hybridizing genetic algorithms with hill-climbing methods for global optimization:two possible ways. Proceedings of IEEE International Conference on Evolutionary Computation. 312-317. Renders, J.M., & Flasse, S.P. (1998). Hybrid methods using genetic algorithms for global optimization, IEEE Transactions on Systems, Man and Cybernetics Part-B. 28(2): 73-91. Robin, F., Orzati, A., Moreno, E., Homan, O.J., & Bachtold, W. (2003). Simulation and evolutionary optimization of electron-beam lithography with genetic and simplex-downhill algorithms, IEEE Transactions on Evolutionary Computation. 7(1): 69-82. Trabia, M.B. (2004). A hybrid fuzzy simplex genetic algorithm. Journal of Mechanical Design, 126 (6): 969-97. Yang, R., & Douglas, I., (1998). Simple genetic algorithm with local tuning: Efficient global optimizing technique. Journal of Optimization Theory and Applications. 98: 449-465. Yen, J., Liao, J.C., Lee, B., & Randolph, D. (1998) A hybrid approach to modeling metabolic systems using a genetic algorithm and simplex method, IEEE Transactions on Systems, Man and Cybernetics Part-B, 28(2): 173-191. Zahara, E., Fan, S.S. & Tsai, D.M. (2005). Optimal multi-thresholding using a hybrid optimization approach. Pattern Recognition Letters, 26(8): 10821095.

1195

N


Key TeRmS Dominance: A relationship between solutions in a multi-objective optimization problem. A solution dominates another if and only if it is equal to the latter in all the objectives and better in at least one. Evolutionary Algorithm: A class of probabilistic algorithms that are based upon biological metaphors such as Darwinian evolution, and widely used in optimization. Exploration: A strategy that samples the fitness landscape extensively to obtain good regions. Exploitation: A greedy strategy that seeks to improve one or more solutions to an optimization problem to take it to a maximum in its vicinity. Fitness: A measure to determine the goodness of a solution to an optimization problem. When a single objective is to be maximized, the fitness is either equal to the objective or a monotonically increasing function of it. Fitness Landscape: A representation of the search space of an optimization problem that brings out the differences in the fitness of the solutions, such that those with good fitness are “higher”. Optimal solutions are the maxima of the fitness landscape.

1196

Generation: A term used in evolutionary algorithms that corresponds to an iteration of the outermost loop. Local Search: A search algorithm to carry out exploitation. Multi-Objective Optimization: An optimization problem involving more than a single objective function. In such a setting, it is not easy to discriminate between good and bad solutions, as a solution that is better than another in one objective may be poorer in another. Without any loss of generality, each objective function can be considered to be one involving maximization. Population-Based Algorithm: An algorithm, which maintains an entire set of candidate solutions for an optimization problem. Search Space: Set of all possible solutions for any given optimization problem, in which one can usually define a neighborhood of any solution.

1197

Neural Control System for Autonomous Vehicles Francisco García-Córdova Polytechnic University of Cartagena (UPCT), Spain Antonio Guerrero-González Polytechnic University of Cartagena (UPCT), Spain Fulgencio Marín-García Polytechnic University of Cartagena (UPCT), Spain

INTRODUCTION Neural networks have been used in a number of robotic applications (Das & Kar, 2006; Fierro & Lewis, 1998), including both manipulators and mobile robots. A typical approach is to use neural networks for nonlinear system modelling, including for instance the learning of forward and inverse models of a plant, noise cancellation, and other forms of nonlinear control (Fierro & Lewis, 1998). An alternative approach is to solve a particular problem by designing a specialized neural network architecture and/or learning rule (Sutton & Barto, 1981). It is clear that biological brains, though exhibiting a certain degree of homogeneity, rely on many specialized circuits designed to solve particular problems. We are interested in understanding how animals are able to solve complex problems such as learning to navigate in an unknown environment, with the aim of applying what is learned of biology to the control of robots (Chang & Gaudiano, 1998; Martínez-Marín, 2007; Montes-González, Santos-Reyes & RíosFigueroa, 2006). In particular, this article presents a neural architecture that makes possible the integration of a kinematical adaptive neuro-controller for trajectory tracking and an obstacle avoidance adaptive neuro-controller for nonholonomic mobile robots. The kinematical adaptive neuro-controller is a real-time, unsupervised neural network that learns to control a nonholonomic mobile robot in a nonstationary environment, which is termed Self-Organization Direction Mapping Network (SODMN), and combines associative learning and Vector Associative Map (VAM) learning to generate transformations between spatial and velocity

coordinates (García-Córdova, Guerrero-González & García-Marín, 2007). The transformations are learned in an unsupervised training phase, during which the robot moves as a result of randomly selected wheel velocities. The obstacle avoidance adaptive neurocontroller is a neural network that learns to control avoidance behaviours in a mobile robot based on a form of animal learning known as operant conditioning. Learning, which requires no supervision, takes place as the robot moves around a cluttered environment with obstacles. The neural network requires no knowledge of the geometry of the robot or of the quality, number, or configuration of the robot’s sensors. The efficacy of the proposed neural architecture is tested experimentally by a differentially driven mobile robot.

BACKGROUND Several heuristic approaches based on neural networks (NNs) have been proposed for identification and adaptive control of nonlinear dynamic systems (Fierro & Lewis, 1998; Pardo-Ayala & Angulo-Bahón, 2007). In wheeled mobile robots (WMR), the trajectorytracking problem with exponential convergence has been solved theoretically using time-varying state feedback based on the backstepping technique in (Ping & Nijmeijer, 1997; Das & Kar, 2006). Dynamic feedback linearization has been used for trajectory tracking and posture stabilization of mobile robot systems in chained form (Oriolo, Luca & Vendittelli, 2002). The study of autonomous behaviour has become an active research area in the field of robotics. Even the simplest organisms are capable of behavioural feats unimaginable for the most sophisticated machines. When


N

Neural Control System for Autonomous Vehicles

an animal has to operate in an unknown environment it must somehow learn to predict the consequences of its own actions. Biological organisms are a clear example that this short of learning is possible in spite of what, from an engineering standpoint, seem to be insurmountable difficulties: noisy sensors, unknown kinematics and dynamics, nostationary statistics, and so on. A related form of learning is known as operant conditioning (Grossberg, 1971). Chang and Gaudiano (1998) introduce a neural network for obstacle avoidance that is based on a model of classical and operant conditioning. Psychologists have identified classical and operant conditioning as two primary forms of learning that enables animals to acquire the causal structure of their environment. In the classical conditioning paradigm, learning occurs by repeated association of a Conditioned Stimulus (CS), which normally has no particular significance for an animal, with an Unconditioned Stimulus (UCS), which has significance for an animal and always gives rise to an Unconditioned Response (UCR). The response that comes to be elicited by the CS after classical conditioning is known as the Conditioned Response (CR) (Grossberg & Levine, 1987). Hence, classical conditioning is the putative learning process that enables animals to recognize informative stimuli in the environment. In the case of operant conditioning, an animal learns the consequences of its actions. More specifically, the animal learns to exhibit more frequently a behaviour that has led to reward in the past, and to exhibit less frequently a behaviour that led to punishment. In the field of neural networks research, it is often suggested that neural networks based on associative learning laws can model the mechanisms of classical conditioning, while neural networks based on reinforcement learning laws can model the mechanisms of operant conditioning (Chang & Gaudiano, 1998). The reinforcement learning is used to acquire navigation skills for autonomous vehicles, and updates both the vehicle model and optimal behaviour at the same time (Galindo, González & Fernández-Madrigal, 2006; Lamiraux & Laumond, 2001; Galindo, FernándezMadrigal & González, 2007). In this article, we propose a neurobiologically inspired neural architecture to show how an organism, in this case a robot, can learn without supervision to recognize simple stimuli in its environment and to associate them with different actions. 1198

ARCHITeCTURe Of THe NeURAl CONTROl SySTem Figure 1(a) illustrates our proposed neural architecture. The trajectory tracking control without obstacles is implemented by the SODMN and a neural network of biological behaviour implements the avoidance behaviour of obstacles.

Self-Organization Direction mapping Network (SODmN) The transformation of spatial directions to wheels angular velocities is expressed like a linear mapping and is shown in Fig. 1(b). The spatial error is computed to get a spatial direction vector (DVs). The DVs is transformed by the direction mapping network elements Vik to corresponding motor direction vector (DVm). On the other hand, a set of tonically active inhibitory cells, which receive broad-based inputs that determine the context of a motor action, was implemented as a context field. The context field selects the Vik elements based on the wheels angular velocities configuration. A speed-control GO signal acts as a non-specific multiplicative gate and controls the movement’s overall speed. The GO signal is an input from a decision centre in the brain, and starts at zero before movement and then grows smoothly to a positive value as the movement develops. During the learning, the GO signal is inactive. Activities of cells of the DVs and DVm are represented in the neural network by quantities (S1, S2, ..., Sm) and (R1, R2, ..., Rn), respectively. The direction mapping is formed with a field of cells with activities Vik. Each Vik cell receives the complete set of spatial inputs Sj, j = 1, ..., m, but connects to only one Ri cell. The direction mapping cells ( V ∈  n × k ) compute a difference of activity between the spatial and motor direction vectors via feedback from DVm. During learning, this difference drives the adjustment of the weights. During performance, the difference drives DVm activity to the value encoded in the learned mapping. A context field cell pauses when it recognizes a particular velocity state (i.e., a velocity configuration) on its inputs, and thereby disinhibits its target cells. The target cells (direction mapping cells) are completely shut off when their context cells are active (see Fig. 1(b)). Each context field cell projects to a set of


Figure 1. (a) Neural architecture for reactive and adaptive navigation of a mobile robot. (b) Self-organization direction mapping network for the trajectory tracking of a mobile robot. Desired Spatial Position

Sensed Position in Spatial Coordinates -

Active → ck = [cf ]+ Inactive → ck = 0

∑

+ Spatial Direction Vector

S2

S1

Context Field

Sm

cf

( xd , yd , φd ) φd 1 φd +

+

−

Kp

(θ ,θ )Mobile ( x, y, φ ) (θ ,θ ) Robot r

SODMN

d

r

l

d

l

+

φe

NNAB

Vn 2

V12

θr

θl

V11 cf

θr

Direction Mapping Cells

VnK

V1K

R1

Rn

Vn1

∏

∏

GO

Motor Direction Vector

φ

Ultrasound sensors

(a)

direction mapping cells, one for each velocity vector component. Each velocity vector component has a set of direction mapping cells associated with it, one for each context. A cell is “off” for a compact region of the velocity space. It is assumed for simplicity that only one context field cell turns “off” at a time. The centre context field cell is “off” when the angular velocities are in the centre region of the velocity space. The “off” context cell enables a subset of direction mapping cells through the inhibition variable ck, while “on” context cells disable to the other subsets. The learning is obtained by decreasing weights in proportion to the product of the presynaptic and postsynaptic activities (Gaudiano, & Grossberg, 1991). The training is done by generating random movements, and by using the resulting angular velocities and observed spatial velocities of the mobile robot as training vectors to the direction mapping network.

Neural Network for the Avoidance Behaviour (NNAB) Grossberg proposed a model of classical and operant conditioning, which was designed to account for a

θl

θr d

θl d

Sensed angular velocities of wheels

(b)

variety of behavioural data on learning in vertebrates (Grossberg, 1971; Grossberg & Levine, 1987). Our implementation is based in the Grossberg’s conditioning circuit, which follows closely that of Grossberg & Levine (1987) and Chang & Gaudiano (1998), and is shown in Figure 2. In this model the sensory cues (both CSs and UCS) are stored in Short Term Memory (STM) within the population labelled S, which includes competitive interactions to ensure that the most salient cues are contrast enhanced and stored in STM while less salient cues are suppressed. The population S is modelled as a recurrent competitive field in simplified discrete-time version, which removes the inherent noise, efficiently normalizes and contrast-enhances from the ultrasound sensors activations. In the present model, the CS nodes correspond to activation from the robot’s ultrasound sensors. In the network Ii represents a sensor value which codes proximal objects with large values and distal objects with small values. The network requires no knowledge of the geometry of the mobile robot or the quality, number, or distribution of sensors over the robot’s body.

1199

N


Fig. 2. Neural Network for the avoidance behaviour

The drive node D corresponds to the Reward/ Punishment component of operant conditioning (an animal/robot learns the consequences of its own actions). Learning can only occur when the drive node is active. Activation of drive node D is determined by the weighted sum of all the CS inputs, plus the UCS input, which is presumed to have a large, fixed connection strength. The drive node D is active when the robot collides with an obstacle. Then the unconditioned stimulus (USC) in this case corresponds to a collision detected by the mobile robot. The activation of the drive node and of the sensory nodes converges upon the population of polyvalent cells P. Polyvalent cells require the convergence of two types of inputs in order to become active. In particular, each polyvalent cell receives input from only one sensory node, and all polyvalent cells also receive input from the drive node D. Finally, the neurons (xmi) represent the response conditioned or unconditioned and are thus connected to the motor system. The motor population consists of nodes (i.e., neurons) encoding desired angular velocities of avoidance. When driving the robot, activation is distributed as a Gaussian centred on the desired angular velocity of avoidance. The use of a Gaussian leads to smooth transitions in angular velocity even with few nodes. The output of the angular velocity population is decomposed by SODMN into left and right wheel angular velocities. A gain term can be used to specify the maximum possible velocity. In NNAB the proximity sensors initially do not propagate activity to the motor population because the initial weights are 1200

small or zero. The robot is trained by allowing it to make random movements in a cluttered environment. Whenever the robot collides with an obstacle during one of these movements (or comes very close to it), the nodes corresponding to the largest (closest) proximity sensor measurements just prior to the collision will be active. Activation of the drive node D allows two different kinds of learning to take place: the learning that couples sensory nodes (infrared or ultrasounds) with the drive node (the collision), and the learning of the angular velocity pattern that existed just before the collision. The first type of learning follows an associative learning law with decay. The primary purpose of this learning scheme is to ensure that learning occurs only for those CS nodes that were active within some time window prior to the collision (UCS). The second type of learning, which is also of an associative type but inhibitory in nature, is used to map the sensor activations to the angular velocity map. By using an inhibitory learning law, the polyvalent cell corresponding to each sensory node learns to generate a pattern of inhibition that matches the activity profile active at the time of collision. Once learning has occurred, the activation of the angular velocity map is given by two components (see Figure 3). An excitatory component, which is generated directly by the sensory system, reflects the angular velocity required to reach a given target in the absence of obstacles. The second, inhibitory component, generated by the conditioning model in response to sensed obstacles, moves the robot away from the obstacles


Figure 3. Positive Gaussian distribution represents the angular velocity without obstacle and negative distribution represents activation from the conditioning circuit. The summation represents the angular velocity that will be used to drive the mobile robot.

as a result of the activation of sensory signals in the conditioning circuit.

eXpeRImeNTAl ReSUlTS The proposed control algorithm is implemented on a mobile robot from the Polytechnic University of Cartagena (UPCT) named “CHAMAN”. The platform has two driving wheels (in the rear) mounted on the same axis and two passive supporting wheels (in front) of free orientation. The two driving wheels are independently driven by two DC-motors to achieve the motion and orientation. High-level control algorithms (SODMN and NNAB) are written in VC++ and run with a sampling time of 10 ms on a remote server (a Pentium IV processor). The lower level control layer is in charge of the execution of the high-level velocity commands. It consists of a Texas Instruments TMS320C6701 Digital Signal Processor (DSP). Figure 4 shows approach behaviours and the tracking of a trajectory by the mobile robot with respect to the reference trajectory. Figure 5 illustrates the mobile robot’s performance in the presence of several obstacles. The mobile robot starts from the initial position labelled X and reaches a desired position. During the movements, whenever the

mobile robot is approaching an obstacle, the inhibitory profile from the conditioning circuit (NNAB) changes the selected angular velocity and makes the mobile robot turn away from the obstacle.

fUTURe TReNDS The tendency of robots’ control systems is to come to understand and to imitate the way that biological systems learn and evolve to resolve complex problems in unknown environments. Simple animals (e.g.: crabs, insects, scorpions and other ones) are studied to formalize robust neural models for the robots’ locomotion system. In humans, decoded neural behaviors of neural activities of the cortical system tend to be applying to robotic prosthesis for the control of movement. Neural networks and other bio-mimetic techniques with an emphasis on navigation and control are used to operate in real-time with only minimal assumptions about the robots or the environment, and that can learn, if needed, with little or no external supervision. In this article, the proposed neural control system can be applied for underwater applications. In this case, sonar sensors will replace ultrasound sensors. The proposed neural architecture learns to carry out a reactive and adaptive navigation nonstationary environments.

1201

N


Figure 4. Adaptive control by the SODMN. a) Approach behaviours. The symbol X indicates the start of the mobile robot and Ti indicates the desired reach. b) Tracking control of a desired trajectory. c) Real-time tracking performance. 10

9 T 6

8

Mobile robot's trajectory

T 5

8

7

5

6

T 3 X 2

T 1

T 4, T 9

Y [m]

Y [m]

6

4

4

3

Desired trajectory

2 1 0 0

T 8

T 7 X

2

T 2

1

2

4

X [m]

6

8

0 0

10

2

4

X [m]

(a)

6

8

(b)

(c)

Figure 5. Trajectory followed by the mobile robot in presence of obstacles using the NNAB

(a)

1202

(b)

10


CONClUSION In this article, we have implemented a neural architecture for trajectory tracking and avoidance behaviours of mobile robot. A biologically inspired neural network for the spatial reaching tracking has been developed. This neural network is implemented as a kinematical adaptive neuro-controller. The SODMN uses a context field for learning the direction mapping between spatial and angular velocity coordinates. The performance of this neural network has been successfully demonstrated in experimental results with the trajectory tracking and reaching of a mobile robot. The avoidance behaviours of obstacles were implemented by a neural network that is based on a form of animal learning known as operant conditioning. A differentially driven mobile robot tested the efficacy of the proposed neural network for avoidance behaviours experimentally.

RefeReNCeS Chang, C., & Gaudiano, P. (1998). Application of biological learning theories to mobile robot avoidance and approach behaviors. J. Complex Systems. (1), 79–114. Das, T., & Kar, I.N. (2006). Design and implementation of an adaptive fuzzy logic-based controller for wheeled mobile robots. IEEE Transactions on Control SystemsTech-nology. (14), 501–510. Fierro, R., & Lewis, F.L. (1998). Control of a nonholonomic mobile robot using neural networks. IEEE Trans. Neural Netw. (9), 589–600. Galindo, C., Fernández-Madrigal, J.A., & González, J. (2007). Towards the automatic learning of reflex modulation for mobile robot navigation. Nature Inspired Problem-Solving Methods in Knowledge Engineering, Mira, J., & Alvarez, J.R. (Eds.) IWINAC 2007, Part II. LNCS vol. 4528, 347-356. Springer, Heidelberg. Galindo, C., González, J., & Fernández-Madrigal, J.A. (2006). A control architecture for human-robot integration: Application to robotic wheelchair. IEEE Trans. On Systems, Man, and Cyb. –Part B, 36, 1053-1068. García-Córdova, F., Guerrero-González, A., & García-Marín, F. (2007). Design and implementation of an adaptive neuro-controller for trajectory tracking of

nonholonomic wheeled mobile robots. Nature Inspired Problem-Solving Methods in Knowledge Engineering, Mira, J., & Alvarez, J.R. (Eds.) IWINAC 2007, Part II. LNCS vol. 4528, 459-468. Springer, Heidelberg. Gaudiano, P., & Grossberg, S. (1991). Vector associative maps: Unsupervised real-time error-based learning and control of movement trajectories. Neural Networks. (4), 147–183. Grossberg, S. (1971). On the dynamics of operant conditioning. Journal of Theorical Biology. (33), 225–255. Grossberg, S., & Levine, D. (1987). Neural dynamics of attentionally moduled Pavlovian conditioning: Blocking, interstimulus interval, and secondary reinforcement. Applied Optics. (26), 5015–5030. Lamiraux, F., & Laumond, J.P. (2001). Smooth motion planning for car-like vehicles. IEEE Trans. Robot. Automat. 17(4), 498-502. Martínez-Marín, T. (2007). Learning autonomous behaviours for nonholonomic vehicles. Computational and Ambient Intelligence, Sandoval, F., Prieto, A., Cabestany, J., & Graña, M. (Eds.) IWANN 2007. LNCS vol. 4507, 839-846. Springer, Heidelberg. Montes-González, F., Santos Reyes, J., & Ríos Figueroa, H. (2006). Integration of evolution with a robot action selection model. Gelbukh, A., & Reyes-García, C.A. (Eds.) MICAI 2006. LNCS vol. 4293, 1160-1170. Oriolo, G., Luca, A.D., & Vendittelli, M. (2002). WMR control via dynamic feedback linearization: Design, implementation and experimental validation. IEEE Trans. Control.Syst. Technol. 10, 835-852. Pardo-Ayala, D.E., & Angulo-Bahón, C. (2007). Emerging behaviors by learning joint coordination in articulated mobile robots. Computational and Ambient Intelligence, Sandoval, F., Prieto, A., Cabestany, J., & Graña, M. (Eds.) IWANN 2007. LNCS vol. 4507, 806-813. Springer, Heidelberg. Ping, Z., & Nijmeijer, H. (1997). Tracking control of mobile robots: A case study in backstepping. Automatica. (33), 1393–1399. Sutton, R.S., & Barto, A.G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review. (88), 135–170. 1203

N


Key TeRmS Artificial Neural Network: A network of many simple processors (“units” or “neurons”) that imitates a biological neural network. The units are connected by unidirectional communication channels, which carry numeric data. Neural networks can be trained to find nonlinear relationships in data; and are used in applications such as robotics, speech recognition, and signal processing or medical diagnosis. Classical Conditioning: It is a form of associative learning that was first demonstrated by Ivan Pavlov. The typical procedure for inducing classical conditioning involves a type of learning in which a stimulus acquires the capacity to evoke a response that was originally evoked by another stimulus. Conditioned Response (CR): If the conditioned stimulus and the unconditioned stimulus are repeatedly paired, eventually the two stimuli become associated and the organism begins to produce a behavioral response to the conditioned stimulus. Then, the conditioned response is the learned response to the previously neutral stimulus.

1204

Conditioned Stimulus (CS): It is a previously neutral stimulus that, after becoming associated with the unconditioned stimulus, eventually comes to trigger a conditioned response. The neutral stimulus could be any event that does not result in an overt behavioral response from the organism under investigation. Operant Conditioning: The term “Operant” refers to how an organism operates on the environment, and hence, operant conditioning comes from how we respond to what is presented to us in our environment. Then the operant conditioning is a form of associative learning through which an animal learns about the consequences of its behaviour. Unconditioned Response (UR): It is the unlearned response that occurs naturally in response to the unconditioned stimulus. Unconditioned Stimulus (UCS): Which is one that unconditionally, naturally, and automatically triggers an innate, often reflexive, response in the presence of significant stimulus. For example, when you smell one of your favourite foods, you may immediately feel very hungry. In this example, the smell of the food is the unconditioned stimulus.

1205

Neural Network-Based Visual Data Mining for Cancer Data Enrique Romero Technical University of Catalonia, Spain Julio J. Valdés National Research Council Canada, Canada Alan J. Barton National Research Council Canada, Canada

INTRODUCTION According to the World Health Organization (http:// www.who.int/cancer/en), cancer is a leading cause of death worldwide. From a total of 58 million deaths in 2005, cancer accounts for 7.6 million (or 13%) of all deaths. The main types of cancer leading to overall cancer mortality are i) Lung (1.3 million deaths/year), ii) Stomach (almost 1 million deaths/year), iii) Liver (662,000 deaths/year), iv) Colon (655,000 deaths/year) and v) Breast (502,000 deaths/year). Among men the most frequent cancer types worldwide are (in order of number of global deaths): lung, stomach, liver, colorectal, oesophagus and prostate, while among women (in order of number of global deaths) they are: breast, lung, stomach, colorectal and cervical. Technological advancements in recent years are enabling the collection of large amounts of cancer related data. In particular, in the field of Bioinformatics, high-throughput microarray gene experiments are possible, leading to an information explosion. This requires the development of data mining procedures that speed up the process of scientific discovery, and the in-depth understanding of the internal structure of the data. This is crucial for the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad, Piatesky-Shapiro & Smyth, 1996). Researchers need to understand their data rapidly and with greater ease. In general, objects under study are described in terms of collections of heterogeneous properties. It is typical for medical data to be composed of properties represented by nominal, ordinal or real-valued variables (scalar), as well as by others of a more complex nature, like images, time-series, etc. In addition, the information

comes with different degrees of precision, uncertainty and information completeness (missing data is quite common). Classical data mining and analysis methods are sometimes difficult to use, the output of many procedures may be large and time consuming to analyze, and often their interpretation requires special expertise. Moreover, some methods are based on assumptions about the data which limit their application, specially for the purpose of exploration, comparison, hypothesis formation, etc, typical of the first stages of scientific investigation. This makes graphical representation directly appealing. Humans perceive most of the information through vision, in large quantities and at very high input rates. The human brain is extremely well qualified for the fast understanding of complex visual patterns, and still outperforms the computer. Several reasons make Virtual Reality (VR) a suitable paradigm: i) it is flexible (it allows the choice of different representation models to better suit human perception preferences), ii) allows immersion (the user can navigate inside the data, and interact with the objects in the world), iii) creates a living experience (the user is not merely a passive observer, but an actor in the world) and iv) VR is broad and deep (the user may see the VR world as a whole, and/or concentrate on specific details of the world). Of no less importance is the fact that in order to interact with a virtual world, only minimal skills are required. Visualization techniques may be very useful for medical decisión support in the oncology area. In this paper unsupervised neural networks are used for constructing VR spaces for visual data mining of gene expression cancer data. Three datasets are used in the paper, representative of three of the most important


N

Neural Network-Based Visual Data Mining for Cancer Data

types of cancer in modern medicine: liver, stomach and lung. The data sets are composed of samples from normal and tumor tissues, described in terms of tens of thousands of variables, which are the corresponding gene expression intensities measured in microarray experiments. Despite the very high dimensionality of the studied patterns, high quality visual representations in the form of structure-preserving VR spaces are obtained using SAMANN neural networks, which enables the differentiation of cancerous and noncancerous tissues. The same networks could be used as nonlinear feature generators in a preprocessing step for other data mining procedures.

NeURAl NeTWORKS fOR THe CONSTRUCTION Of VIRTUAl ReAlITy SpACeS VR spaces for the visual representation of information systems (Pawlak, 1991) and relational structures were introduced in (Valdés, 2002a) (Valdés, 2003). A VR space is a tuple Ω =< O, G , B, R m, g 0 , l , g r , b, r > , where v O is a relational structure ( O = < O, Γ > ), O is a finite v set of objects, and Γ is a set of relations); G is a nonempty set of geometries representing the different objects and relations; B is a non-empty set of behaviors of the objects in the virtual world; R m ⊆ ℜ m is a metric space of dimension m (Euclidean or not) which will be the actual VR geometric space. The other elements v are mappings: g0 : O → G, l : O → Rm, g r : Γ → G and b : O → B. The typical desiderata for the visual representation of data and knowledge can be formulated in terms of minimizing information loss, maximizing structure preservation, maximizing class separability, or their combination, which leads to single or multi-objective optimization problems. In many cases, these concepts can be expressed deterministically using continuous functions with well defined partial derivatives. This is the realm of classical optimization where there is a plethora of methods with well known properties. In the case of heterogeneous information the situation is more complex and other techniques are required (Valdés, 2002b) (Valdés, 2004) (Valdés & Barton, 2005). In the unsupervised case, the function f mapping the original space to the VR (geometric) space Rm can be constructed as to maximize some metric/non-metric structure preservation criteria as is typical in multidimensional 1206

scaling (Borg & Lingoes, 1987) or minimize some error measure of information loss (Sammon, 1969). A typical error measure is:

Sammon Error =

1

∑ ( − )2 ij ij i< j

∑ i< j ij

ij

where δij is a dissimilarity measure between two objects i, j in the original space, and ξij is another dissimilarity measure defined on objects i, j in the VR space (the images of i, j under f). Typical dissimilarity measures for δij are the Euclidean distance or the dissimilarity based on Gower’s similarity coefficient (Gower, 1971). The Euclidean distance is the usual measure for ξij in the VR space. Usually, the mappings f obtained using approaches of this kind are implicit because the images of the objects in the new space are computed directly. However, a functional representation of f is highly desirable, specially in cases where more samples are expected a posteriori and need to be placed within the space. With an implicit representation, the space has to be computed every time that a new sample is added to the set, whereas with an explicit representation, the mapping can be computed directly. As long as the incoming objects can be considered as belonging to the same population of samples used for constructing the mapping function, the space does not need to be recomputed. Neural networks are natural candidates for constructing explicit representations due to their general universal approximation property. If proper training methods are used, neural networks can learn structure preserving mappings of high dimensional samples into lower dimensional spaces suitable for visualization (2D, 3D). If visualization is not a requirement, spaces of smaller dimension than the original can be used as new features for noise reduction or other data mining methods. Such an example is the SAMANN network. This is a feedforward network and its architecture consists of an input layer with as many neurons as descriptor attributes, an output layer with as many neurons as the dimension of the VR space and one or more hidden layers. The classical way of training the SAMANN network is described in (Mao & Jain, 1995). It consists of a gradient descent method where the derivatives of the Sammon error are computed in a similar way to the classical backpropagation algorithm. Different from the backpropagation algorithm,


the training is unsupervised and the weights can only be updated alter a pair of examples are presented to the network.

CANCeR DATA SeTS DeSCRIpTION Three microarray gene expression cancer databases were selected. They are representative of some of the leading causes of cancer death in the world and share the typical features of these kind of data: a small number of samples (in the order of tens), described in terms of a very large number of attributes (in the order of tens of thousands).

liver Cancer Data We used the same data as in (Lam, Wu, Vega, Miller, Spitsbergen, Tong, Zhan, Govindarajan, Lee, Mathavan, Murthy, Buhler, Liu & Gong, 2006), where zebrafish liver tumors were analyzed and compared with human liver tumors. The database (http://www.ncbi.nlm.nih. gov/projects/geo/gds/gds_browse.cgi?gds=2220) contains 20 samples (10 normal, 10 tumor), with 16,512 attributes. First, liver tumors in zebrafish were generated by treating them with carcinogens. Then, the expression profiles of zebrafish liver tumors were compared with those of zebrafish normal liver tissues using a Wilcoxon rank-sum test. As a result of this comparison, a zebrafish liver tumor differentially expressed gene set consisting of 2,315 gene features was obtained. This data set was used for comparison with human tumors. The results suggest that the molecular similarities between zebrafish and human liver tumors are greater than the molecular similarities between other types of tumors (stomach, lung and prostate).

Stomach Cancer Data We used the same data as in (Hippo, Taniguchi, Tsutsumi, Machida, Chong, Fukayama, Kodama & Aburatani, 2002), where a study of genes that are differentially expressed in cancerous and noncancerous human gastric tissues was performed. The database (http://www.ncbi.nlm.nih.gov/projects/geo/gds/ gds_browse.cgi?gds=1210) contains 30 samples (22 tumor, 8 normal) that were analyzed by oligonucleotide microarray, obtaining the expression profiles for 6,936 genes (7,129 attributes). Using the 6,272

genes that passed a prefilter procedure, cancerous and noncancerous tissues were successfully distinguished with a two-dimensional hierarchical clustering using Pearson’s correlation. However, the clustering results used most of the genes on the array. To identify the genes that were differentially expressed between cancer and noncancerous tissues, a Mann-Whitney’s U test was applied to the data. As a result of this analysis, 162 and 129 genes showed a higher expression in cancerous and noncancerous tissues, respectively. In addition, several genes associated with lymph node metastasis and histological classification (intestinal, diffuse) were identified.

lung Cancer Data We used the same data as in (Spira, Beane, Pinto-Plata, Kadar, Liu, Shah, Celli & Brody, 2004), where gene expressions were compared in for severely emphysematous lung tissue (from smokers at lung volume reduction surgery) and normal or mildly emphysematous lung tissue (from smokers undergoing resection of pulmonary nodules). The database (http://www.ncbi.nlm. nih.gov/projects/geo/gds/gds_browse.cgi?gds=737) contains 30 samples (18 severe emphysema, 12 mild or no emphysema), with 22,283 attributes. Genes with large detection P-values were filtered out, leading to a data set with 9,336 genes, that were used for subsequent analysis. Nine classification algorithms were used to identify a group of genes whose expression in the lung distinguished severe emphysema from mild or no emphysema. First, model selection was performed for every algorithm by leave-one-out cross-validation, and the gene list corresponding to the best model was saved. The genes reported by at least four classification algorithms (102 genes) were chosen for further analysis. With these genes, a two-dimensional hierarchical clustering using Pearson’s correlation was performed that distinguished between severe emphysema and mild or no emphysema. Other genes were also identified that may be causally involved in the pathogenesis of the emphysema.

eXpeRImeNTAl SeTTINGS Data preprocessing For stomach and lung data, each gene was scaled to mean zero and standard deviation one (original data 1207

N


were not normalized). For liver data, no transformation was performed (original data were log2 ratios).

model Training For every data set, SAMANN networks were constructed to map the original data to a 3D VR space. The Euclidean distance was the dissimilarity measure used for both the original and the VR spaces. The activation functions used were sinusoidal for the first hidden layer and hyperbolic tangent for the rest. A collection of models was obtained by varying some of the network controlling parameters: number of units in the first hidden layer (two different values), weights ranges in the first hidden layer (three different values), learning rates (three different values), momentum (three different values), number of pairs presented to the network at every iteration (three different values), number of iterations (three different values) and random seeds (four different values), for a total of 1,944 SAMANN networks for every data set.

Computing environment All of the experiments were conducted on a Condor pool (http://www.cs.wisc.edu/condor) located at the Institute for Information Technology, National Research Council Canada.

ReSUlTS For every data set, we constructed the histograms of the Sammon error for the obtained networks. All of the empirical distributions were positively skewed (with the mode on the lower error side), which is a good behavior. In addition, the general error ranges were small. In table 1 some statistics of the experiments are presented: minimum, maximum, mean and Standard

deviation for the best (i.e., with smallest Sammon error) 1,000 networks. Clearly, it is impossible to represent a VR space on printed media (navigation, interaction, and world changes are all lost). Therefore, very simple geometries were used for objects and only snapshots of the virtual worlds are presented. Figures 1, 2 and 3 show the VR spaces corresponding to the best networks for the liver, stomach and lung cancer data sets respectively. Although the mapping was generated from an unsupervised perspective (i.e., without using the class labels), objects from different classes are differently represented in the VR space for comparison purposes. Transparent membranes wrap the corresponding classes, so that the degree of class overlapping can be easily seen. In addition, it allows to look for particular samples with ambiguous diagnostic decisions. The low values of the Sammon error indicate that the spaces preserved most of the distance structure of the data, therefore, giving a good idea about the distribution in the original spaces. The three virtual spaces are clearly polarized with two distribution modes, each one corresponding to a different class. Note, however, that classes are more clearly differentiated for the liver and stomach data sets than for the lung data set, where a certain level of overlapping exists. The reason for this may be that mild and no emphysema were considered members of the same class (see above). The advantage of using SAMANN networks is that, since the mapping f between the original and the virtual space is explicit, a new sample can be easily transformed and visualized in the virtual space. Since the distance between any two objects is an indication of their dissimilarity, the new point is more likely to belong to the same class of its nearest neighbors. In the same way, outliers can be readily identified, although they may result from the space deformation inevitably introduced by the dimensionality reduction.

Table 1. Statistics of the best 1,000 SAMANN networks obtained Data Set Liver Cancer Stomach Cancer Luna Cancer

1208

Minimum 0.039905 0.062950 0.079242

Sammon Maximum 0.055640 0.077452 0.107842

Error Mean 0.049857 0.072862 0.094693

Std.Dev. 0.003621 0.003346 0.006978


CONClUSION High quality virtual reality spaces for visual data mining of typical examples of gene expression cancer data were obtained using unsupervised structure-preserving neural networks in a distributed computing data mining (grid) environment. These results show that a few nonlinear features can effectively capture the

Figure 1. VR space of the liver cancer data set (Sammon error = 0.039905, best out of 1,944 experiments). Dark spheres: normal, Light spheres: cancerous samples.

similarity structure of the data and also provide a good differentiation between the cancer and normal classes. A similar study can be found in (Valdés, Romero & González, 2007). However, in cases where the descriptor attributes are not directly related to class structure or where there are many noisy or irrelevant attributes the situation may not be as clear. In these cases, feature subset selection and other data mining procedures could be considered in a preprocessing stage.

ACKNOWleDGmeNT This work was partially supported by the Consejo Interministerial de Ciencia y Tecnología (CICYT, Spain), under project TIN2006-08114, and conducted in the framework of the STATEMENT OF WORK between the National Research Council Canada (Institute for Information Technology, Integrated Reasoning Group) and the Soft Computing Group (Dept. of Languages and Information Systems), Polytechnic University of Catalonia, Spain.

Figure 2. VR space of the stomach cancer data set (Sammon error = 0.062950, best out of 1,944 experiments). Dark spheres: normal, Light spheres: cancerous samples.

Figure 3. VR space of the lung cancer data set (Sammon error = 0.079242, best out of 1,944 experiments). Dark spheres: severe emphysema, Light spheres: mild or no emphysema. The boundary between the classes in the VR space seem to be a low curvature surface.

1209

N


RefeReNCeS Borg, I. & Lingoes, J. (1987). Multidimensional Similarity Structure Analysis. Springer-Verlag. Fayyad, U., Piatesky-Shapiro, G. & Smyth (1996). From Data Mining to Knowledge Discovery. Advances in Knowledge Discovery and Data Mining, U. Fayyad, et al. editors, 1-34, AAAI Press. Gower, J.C. (1971). A General Coefficient of Similarity and Some of Its Properties. Biometrics 1, 857-871. Hippo, Y., Taniguchi, H., Tsutsumi, S., Machida, N., Chong, J.M., Fukayama, M., Kodama, T. & Aburatani, H. (2002). Global Gene Expression Analysis of Gastric Cancer by Oligonucleotide Microarrays. Cancer Research 62 (1), 233-240. Lam, S.H., Wu, Y.L., Vega, V.B., Miller, L.D., Spitsbergen, J., Tong, Y., Zhan, H., Govindarajan, K.R., Lee, S., Mathavan, S., Murthy, K.R.K., Buhler, D.R., Liu, E.T. & Gong, Z. (2006). Conservation of Gene Expression Signatures between Zebrafish and Human Tumors and Tumor Progression. Nature Biotechnology 24 (1), 73-75. Mao, J. & Jain, A.K. (1995). Artificial Neural Networks for Feature Extraction and Multivariate Data Projection. IEEE Transactions on Neural Networks 6, 296-317. Pawlak, Z. (1991). Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers. Sammon, J.W. (1969). A Non-linear Mapping for Data Structure Analysis. IEEE Transactions on Computers C-18, 401-408. Spira, A., Beane, J., Pinto-Plata, V., Kadar, A., Liu, G., Shah, V., Celli, B. & Brody, J.S. (2004). Gene Expression Profiling of Human Lung Tissue from Smokers with Severe Emphysema. American Journal of Respiratory Cell and Molecular Biology 31, 601-610. Valdés, J.J. (2002a). Virtual Reality Representation of Relational Systems and Decision Rules: An Exploratory Tool for Understanding Data Structure. Theory and Application of Relational Structures as Knowledge Instruments, P. Hajek editor, Meeting of the COST action 274. Valdés, J.J. (2002b). Similarity-based Heterogeneous 1210

Neurons in the Context of General Observational Models. Neural Network World 12 (5), 499-508. Valdés, J.J. (2003). Virtual Reality Representation of Information Systems and Decision Rules: An Exploratory Tool for Understanding Data and Knowledge. International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (LNAI 2639), 615-618. Valdés, J.J. (2004). Building Virtual Reality Spaces for Visual Data Mining with Hybrid Evolutionary-classical Optimization: Application to Microarray Gene Expression Data. IASTED International Joint Conference on Artificial Intelligence and Soft Computing, 161-166. Valdés, J.J. & Barton, A. (2005). Virtual Reality Visual Data Mining with Nonlinear Discriminant Neural Networks: Application to Leukemia and Alzheimer Gene Expresión Data. International Joint Conference on Neural Networks, 2475-2480. Valdés, J.J., Romero, E. & González, R. (2007). Data and Knowledge Visualization with Virtual Reality Spaces, Neural Networks and Rough Sets: Application to Geophysical Prospecting. International Joint Conference on Neural Networks, 1060-1065.

Key TeRmS Artificial Neural Networks: Interconnected group of simple units (neurons) that, as a function of the connections between the units and the parameters, can compute complex behaviors and find nonlinear relationships in data. They are used in applications such as robotics, signal processing, or medical diagnosis. Backpropagation Algorithm: Algorithm to compute the gradient with respect to the weights, used for the training of some types of artificial neural networks. It was first described by P. Werbos in 1974, and further developed by D.E. Rumelhart, G.E. Hinton and R.J. Williams in 1986. Condor: Specialized workload management system for computer-intensive jobs in a distributed computing environment, developed at the university of Wisconsin-Madison (http://www.cs.wisc.edu/condor). It provides a job queuing mechanism, resource


monitoring and management, scheduling policy, and priority scheme. Data Mining: Nontrivial extraction of implicit, previously unknown and potentially useful information from data. Typically, analytical methods and tools are applied to data with the aim of identifying patterns, relationships or obtaining databases for tasks such as classification, prediction, estimation or clustering. Gene Expression: Process by which the inheritable information which comprises a gene, such as the DNA sequence, is made manifest as a physical and biologically functional gene product, such as protein or RNA. SAMANN Neural Networks: Unsupervised feedforward neural networks for data projection. The classical way of training SAMANN networks was described by J. Mao and A.K. Jain in 1995. It consists of a gradient descent method where the derivatives of the Sammon error are computed in a similar way to the backpropagation algorithm.

Sammon Error: Error function to maximize structure preservation in projected data. It is defined as

1

∑ ( − )2 ij ij i< j

∑ i< j ij

ij

,

where δij and ξij are dissimilarity measures between two objects i, j in the original and projected space, respectively. Virtual Reality: Technology which allows the user to interact with a computer-simulated environment. Most current virtual reality environments are mainly visual experiences, displayed either on a computer screen or through special stereoscopic displays. Some advanced haptic systems include tactile information.

1211

N

1212

Neural Network-Based Process Analysis in Sport Juergen Perl University of Mainz, Germany

INTRODUCTION Processes in sport like motions or games are influenced by communication, interaction, adaptation, and spontaneous decisions. Therefore, on the one hand, those processes are often fuzzy and unpredictable and so have not extensively been dealt with, yet. On the other hand, most of those processes structurally are roughly determined by intention, rules, and context conditions and so can be classified by means of information patterns deduced from data models of the processes. Self organizing neural networks of type Kohonen Feature Map (KFM) help for classifying information patterns – either by mapping whole processes to corresponding neurons (see Perl & Lames, 2000; McGarry & Perl, 2004) or by mapping process steps to neurons, which then can be connected by trajectories that can be taken as process patterns for further analyses (see examples below). In any case, the dimension of the original data (i.e. the number of contained attributes) is reduced to the dimension of the representing neuron (normally 2 or 3), which makes it much easier to deal with. Additionally, extensions of the KFM-approach are introduced, which are able to flexibly adjust the net to dynamically changing training situations. Moreover, those extensions allow for simulating adaptation processes like learning or tactical behaviour. Finally, a current project is introduced, where tactical processes in soccer are analysed under the aspect of simulation-based optimization.

In Motor Analysis, a lot of data regarding positions, angels, speed, or acceleration of articulations can be recorded automatically by means of markers and high speed digital cameras. The problem is that those recorded data show a high degree of redundancy and inherent correlation: A leg consisting of thigh, lower leg, foot, and the articulations hip joint, knee, and ankle obviously has only a comparably small range of possible movements due to natural restrictions. Therefore the quota of characteristic motion data is comparably small as well. Classification can help for deducing that relevant information from recorded data by mapping them to representative types or patterns. In Game Analysis, during the last about 5 years an increasing number of approaches have been developed which enable for automatic recording of position data. Based on the video time precision of 25 frames per second, 9.315.000 x-y-z-coordinate data from 22 players and the ball can be taken from a 90-minutes soccer game. Obviously, the amount of data has to be reduced and to be focused to the major tactical patterns of the teams. Similar to what coaches are doing, the collection of players’ positions can be reduced to constellations of tactical groups which interact like super-players and therefore enable for a computer-aided game analysis based on pattern analysis. As is demonstrated in the following, neural network-based pattern analysis can support the handling of those problems.

mAIN fOCUS Of THe CHApTeR

BACKGROUND

Artificial Neural Networks

A major problem in analysing complex processes in sport like motions or games often is the reduction of available data to useful information. Two examples shall make plain what the particular problems in sport are:

Current developments in the fields of Soft Computing and/or Computational Intelligence demonstrate how information patterns can be taken from data collections by means of fuzziness, similarity and learning, which


Neural Network-Based Process Analysis in Sport

the approach of Artificial Neural Networks gives an impressive example for. In particular self organizing neural networks of type KFM (Kohonen Feature Map) play an important role in aggregating input data to clusters or types by means of a self organized similarity analysis (Kohonen, 1995).

Net-Based process Analysis Processes can be mapped to attribute vectors – in a game, for example, by recording the positions of the players – which then can be learned by neurons. There is, of course, a certain loss of precision if replacing an attribute vector by a representing neuron, the entry of which is similar but normally not identical to that attribute vector. Nevertheless, there are two major advantages of the way a KFM maps input data to corresponding neurons: 1.

2.

The number of objects is dramatically reduced if using the representing neurons instead of the original attribute vectors: a 2-dimensional 20×20-neuron-matrix contains 400 neurons, while a 10-dimensional vector space with only 10 different values per attribute already contains 1010 = 10.000.000.000 vectors. The dimension of input data is reduced to the dimension of the network (i.e. normally 2 or at most 3). This for example enables for mapping time-series of high-dimensional attribute vectors to trajectories of neurons that can easily be presented graphically.

There are three ways of gaining information from data by means of Artificial Neural Networks of KFMtype: 1. 2. 3.

Neurons represent classes of similar data and so define types of information patterns. Clusters of neurons represent time-static classes of similar information patterns and so build structures of information patterns. Trajectories of neurons represent time-dynamic sequences of information patterns and so build 2-dimensional mappings of time-dependent processes. Moreover, trajectories themselves build patterns and therefore can be input to a network for classifying their similarities – which is extremely helpful not least in motor analysis or in game analysis.

There are a large number of successful applications that demonstrate how those neural networks can be used for that pattern analysis (see Perl & Dauscher, 2006).

example “Gait Analysis”: Reduction of Redundancy and Dimensionality In gait analysis, data from articulations like for example hip-joint, knee and ankle can automatically be recorded using markers and so build a time series of n-dimensional attribute vectors which can be trained to a net. The result is that each of those n-dimensional vectors is mapped to a 2-dimensional neuron of the net – i.e. the dimension is reduced from n to 2. Corresponding to the original time series the neurons can be connected by a

Figure 1. Two trajectories of the same gait process, using 20 attribute values (left) and 10 attribute values (right), respectively. The high degree of similarity suggests that the missing 10 values are redundant and can be neglected.

1213

N


trajectory, which represents the original n-dimensional process through a 2-dimensional trajectory – therefore enabling for a much easier similarity analysis (Perl, 2004; Schöllhorn, 2004). Moreover, net-based analysis shows that, by avoiding redundancy, also the dimension of the original data can be reduced without loosing relevant information (see Figure 1).

With the same approach that was used for gait analysis, the process of rowing was analyzed under the aspect of inter-individual similarity and intra-individual stability. Obviously, there is a great similarity on the set of all trajectories (see Figure 2). However, the trajectories of rower A are perfectly similar to each other – demonstrating a high stability – while those of rower B are not as much. The experience with rowing pattern is that net-based analysis of rowing trajectories is very sensitive and helps for detecting even small instabilities which otherwise could not have been detected from video frames or original time series of data vectors (see Perl & Baca, 2003).

– i.e. a constellation – can be represented by a vector of position coordinates, which then a net can be trained with. Figure 3 shows two exemplars of the same trained volleyball-net, with small squares representing activated constellations and marked areas representing major constellation types. Obviously, the teams represented by the left and the right net activate quite different types of constellation. Moreover, the moves between the constellations – i.e. the edges and/or trajectories – are quite different, too: The left team moves between the areas, while the right team more or less selects an area and then adjusts its constellation. In a game like volleyball – i.e. with separated teams – it is comparably easy to deduce tactical ideas from those trajectory patterns. Some first result could be taken from handball too, where net-based analysis was helpful for detecting successful offence processes (net-based handball analysis: Pfeiffer & Perl, 2006; net-based soccer analysis: Lees et al., 2003; Leser, 2006). Based on those results, currently a project is run which deals with simulation-based tactics-optimisation in soccer. First results are encouraging. They were shown as videorepresentation at the famous Documenta-exhibition on fine arts, 2007 in Kassel/Germany.

example “Tactics in Games”: Constellation Analysis

Dynamic extensions of Kfm-Type Neural Networks

In a more complex way, trajectories can improve the transparency of the tactical behaviour of players or even a team (net-based volleyball analysis: Jäger, Perl & Schöllhorn, 2007). A collection of player positions

Self organizing maps of KFM-type are very helpful for analyzing dynamic processes. They fail, however, if learning or other process dynamics are parts of the processes to be trained. This is due to the fact that the

example “ergometer Rowing”: Inter- and Intra-Individual process-Analysis

Figure 2. Trajectories of the rowing process of two rowers A and B, one stroke per graphic

1214


Figure 3. Two examples of a net trained with constellations, where the marked squares represent frequent constellations and the marked areas represent major types of constellations.

learning procedure of a KFM is externally controlled, resulting in a network that works like a tool, without being able to change with or adapt to changing process types or contexts. One successful approach that improves the dynamics of the learning process is that of the Dynamically Controlled Network (DyCoN: Perl, 2002 a/b), which is a KFM-derivate that is able to learn continuously. The idea is that each neuron contains an individual adaptive learning model based on the Performance Potential Metamodel (PerPot: Perl, 2002 a; learning strategies: Perl & Weber, 2004). While DyCoN helps for analysing dynamic learning processes, a different type of neural network is necessary for simulating those learning processes – in order to eventually schedule and optimize those processes individually. One important point was to dynamically adapt the capacity of the network to the requirements of the learning process. This was done by integrating the concept of Growing Neural Gas (GNG: Fritzke, 1997), where, briefly spoken, the number and positions of neurons vary time-dependently with the changing information flow from training, this way adapting the network size and topology to the training amount and content. The result is the Dynamically Controlled Neural Gas (DyCoNG) the concept of which completes the combination of DyCoN and GNG by specific „quality neurons“ that reflect the information theoretical quality of information and therefore can measure the originality of a recorded activity (Perl et al., 2006). Based on the assumption that there is a strong correspondence

between the „quality“ of a neuron and the originality of the represented type of activity, the network‘s reaction on an input-stimulus (i.e. generating a new connected/not connected quality neuron or not) indicates an evaluation of the originality of the corresponding activity. According to the two tasks „analysis of creativity learning“ and „simulation of creativity learning“, two major results could be obtained: The DyCoN-model was used for analyzing the learning profiles, which were fed into as patterns and then recognized as members of clusters respectively types of learning behaviour. It was remarkable that the net could detect a number of significantly different types of learning behaviour – which in practice is useful for individually adjust the training to the athletes (Perl et al., 2006). The DyCoNG-model was used for learning profile simulation, with the original activity- and rating-data as input and learning profiles as output. The learning profiles resulting from DyCoNG-training could also be separated into types which qualitatively correspond to those from DyCoN-analysis. This at least gives an idea of how to manage the above mentioned individual adaptation by means of net-based simulation. In a first approach net-based originality analysis has successfully been used in case of handball: In a case study dealing with data from the Handball World Championship 2007 in Germany, offence activities of high originality could be detected with a remarkable high accordance to experts‘ evaluation. Moreover, a degree of originality per team and game could be mea1215

N


sured, resulting in team-specific originality profiles that characterize increasing and decreasing playing qualities during the tournament. Currently, a similar project is run with soccer, where in a first attempt the final of the World Championship 2006 is analyzed.

(see Figure 4). Such an associative network could help for an improved simulation of „creative“ behaviour, based on a specific creativity potential that describes frequency, maximal distance, or neuron similarity of those associative jumps.

fUTURe TReNDS

Improvement of Tactical process patterns

The two major ideas for planned future work are to expand net-based simulation of originality to associative behaviour and to analyse the effects of virtually generated “creative” activities in simulated games.

Net-Based Simulation of Associative Behaviour In a simplified way behaviour can be understood as recognizing the behavioural context like environment or situation followed by a context-oriented selection of a best fitting activity. In case of convergent behaviour this selection is more or less rule-based and determined. In case of divergent or creative behaviour the selection has a certain undetermined degree of freedom – i.e. spontaneous „jumps“ are possible from a first priority activity to associated ones. Mapped to neural networks, where activities can be thought to be connected to neurons, this means a „jump“ from the input-corresponding neuron to a different one – located either in a neighboured cluster or as an isolated quality neuron

Figure 4. Net with clusters (marked by slim lines), associative „jumps“ between clusters (bold dotted lines), and generated quality neuron (bold line)

The idea of optimizing strategies by means of simulation was developed in the early 1980ies for games like tennis or badminton, where the player‘s abilities and tactics in a simplified way can be characterized by two matrices: The action-depending transfer of situations can be measured by a transfer frequency matrix, while the situation-depending success of actions can be measured by an action success matrix. Based on those two matrices of both the players, a game can be simulated stochastically regarding its main process structures. Moreover, modifying the entries of the matrices – i.e. changing tactical aspects or technical skills – can help for improving tactical patterns by means of simulation. Although soccer is much more complex then tennis or badminton, the same idea can be used if the complexity is reduced by introducing „super-players“ as we do in a current project: Groups of players, e.g. representing offence or defence, are combined to corresponding data objects, which are characterized by constellations of player positions. The interactions of the single players then are reduced to the interactions of the constellations or super-player, which makes it much easier to map the processes to networks for tactical analysis. The intended aim is to derive those characteristic matrices as well as information about creativity from the network in order to simulate games and improve tactical process patterns: As is indicated in Figure 5, a recorded original activity (white dot on the net) could be replaced by a apparently better or more creative one (white circle above the white dot), which in the simulation changes the regarding constellation and the resulting process and its success.

CONClUSION Net-based analysis of processes in sport is a difficult and challenging task because of the fuzziness and the indeterminism of athletes’ behaviour and interaction. 1216


Figure 5. Steps of net-based analysis and simulation of games like soccer: Replacing players by positions and positions by constellations; analysing constellations by means of networks; simulative modification of tactical patterns; analysing simulated games in order to improve tactical and creative behaviour.

The result of about 30 years of work in this area is that a lot of problems could be solved methodically. The bottleneck, however, was the recording of data and the transfer to information. Meanwhile, data from biomechanical, physiological, and medical applications can be recorded automatically, and even in games like soccer automatic position recording has become possible. Therefore the problem has changed from “how to get data” to “how to transfer data to information”. The presented net-based approaches show how this problem can be handled, opening new perspectives of transferring theoretical approaches to practical work.

Lees, A., Barton, B. & Kerschaw, L. (2003). The use of Kohonen neural network analysis to establish characteristics of technique in soccer kicking. Journal of Sports Sciences, 21, 243-244.

RefeReNCeS

Perl, J. & Baca, A. (2003). Application of neural networks to analyze performance in sports. In E. Müller, H. Schwameder, G. Zallinger & V. Fastenbauer (Eds.), Proceedings of the 8th annual congress of the European College of Sport Science, 342.

Fritzke, B. (1997). A self-organizing network that can follow non-stationary distributions. Proceedings of ICANN97, International Conference on Artificial Neural Networks. 613-618. Jäger, J., Perl, J. & Schöllhorn, W. (2007). Analysis of players’ configurations by means of artificial neural networks. International Journal of Performance Analysis of Sport, 3 (7), 90-103. Kohonen T. (1995). Self-Organizing Maps. Berlin–Heidelberg–New-York: Springer.

Leser, R. (2006). Prozessanalyse im Fußball mittels Neuronaler Netze. M. Raab, A. Arnold, K. Gärtner, J. Köppen, C. Lempertz, N. Tielemann, H. Zastrow (Eds.), Human Performance and Sport, 2, 199-202. McGarry, T., & Perl, J. (2004). Models of sports contests – Markov processes, dynamical systems and neural networks. M. Hughes, & I. M. Franks (Eds.), Notational Analysis of Sport, 227–242.

Perl, J. & Dauscher, P. (2006). Dynamic Pattern Recognition in Sport by Means of Artificial Neural Networks. R. Begg & M. Palaniswami (Eds.), Computational Intelligence for Movement Science, 299-318. Perl, J. & Lames, M. (2000). Identifikation von Ballwechselverlaufstypen mit Neuronalen Netzten am Beispiel Volleyball. W. Schmidt & A. Knollenberg (Eds.), Schriften der dvs, 112, 211-215. 1217

N


Perl, J. & Weber, K. (2004). A Neural Network approach to pattern learning in sport. International Journal of Computer Science in Sport, 3 (1), 67-70. Perl, J, Memmert, D., Bischof, J. & Gerharz, Ch. (2006). On a First Attempt to Modelling Creativity Learning by Means of Artificial Neural Networks. International Journal of Computer Science in Sport, 5 (2), 33-37. Perl, J. (2002 a). Adaptation, Antagonism, and System Dynamics. G. Ghent, D. Kluka & D. Jones (Eds.), Perspectives – The Multidisciplinary Series of Physical Education and Sport Science, 4, 105-125. Perl, J. (2002 b). Game analysis and control by means of continuously learning networks. International Journal of Performance Analysis of Sport, 2, 21-35. Perl, J. (2004). A Neural Network approach to movement pattern analysis. Human Movement Science, 23, 605-620. Pfeiffer, M. & Perl, J. (2006). Analysis of Tactical Structures in Team Handball by Means of Artificial Neural Networks. International Journal of Computer Science in Sport, 5 (1), 4-14. Schöllhorn, W. (2004). Applications of artificial neural nets in clinical biomechanics. Clinical Biomechanics, 19 (9), 876-898.

Key TeRmS Cluster: A collection of neurons is called a cluster, if they are similar and locally neighboured. Due to the topology preserving property of KfM-training classes of similar training vectors are mapped to clusters of neighboured neurons. DyCoN: A DyCoN is a KFM-type network, where each neuron contains an individual PerPot-based selfcontrol of its activation radius and learning rate. The DyCoN-concept enables for continuous learning and therefore supports continuous training and testing, training in phases and with generated data, on line-adaptation during tests and analyses, and flexible adaptation to new information patterns (Perl, 2002 a). (Note that DyCoN is used commercially. Therefore, technical details cannot be published but are under secrecy by DyCoS GmbH (www.dycos.net)). 1218

DyCoNG: The concept of DyCoNG combines the concepts of DyCoN and GNG and completes it by dynamically generating “quality” neurons in order to represent relevant and rare information during the training process (Perl et al., 2006). GNG: A GNG is network without a fixed neuron topology, which is able to generate new neurons on demand. Therefore a GNG is able to dynamically adapt its neuron structure to amount and structure of the trained information (Fritzke, 1997). Information Pattern: An information pattern is a structure of information units like e.g. a vector or matrix of numbers, a stream of video frames, or a distribution of probabilities. KFM: A KFM consists of a (normally: 2-dimensional) matrix of neurons, each of which contains a vector of attributes. Two neurons are called similar if the (Euclidian) distance of their attribute vectors is below a given threshold. Two neurons are called neighboured if they are next to each other regarding the given net topology (see Kohonen, 1995). PerPot: PerPot is a model of dynamic adaptation, where an input flow feeds an internal strain potential as well as an internal response potentials, from which an output potential is fed by specifically delayed flows. Since the strain flow is negative and the response flow is positive, resulting in an oscillating stabilizing adaptation, the model is called antagonistic (Perl, 2002 a). Test: In a test, an attribute vector is fed to the network to determine its type – i.e. the neuron it is corresponding to. Training: During the training, attribute vectors are fed to the network and mapped to the corresponding neuron the entry of which is most similar to that of the attribute vector. After the training, the space of training attribute vectors is (more or less) completely represented by the neurons of the network – meaning that every training attribute vector belongs to a neuron the entry of which it is most similar to. Type: The collection of attribute vectors that, after training, is represented by a neuron is called its type. Also the representing neuron can be called the type.

1219

Neural Networks and Equilibria, Synchronization, and Time Lags Daniela Danciu University of Craiova, Romania Vladimir Răsvan University of Craiova, Romania

INTRODUCTION All neural networks, both natural and artificial, are characterized by two kinds of dynamics. The first one is concerned with what we would call “learning dynamics”, in fact the sequential (discrete time) dynamics of the choice of synaptic weights. The second one is the intrinsic dynamics of the neural network viewed as a dynamical system after the weights have been established via learning. Regarding the second dynamics, the emergent computational capabilities of a recurrent neural network can be achieved provided it has many equilibria. The network task is achieved provided it approaches these equilibria. But the dynamical system has a dynamics induced a posteriori by the learning process that had established the synaptic weights. It is not compulsory that this a posteriori dynamics should have the required properties, hence they have to be checked separately. The standard stability properties (Lyapunov, asymptotic and exponential stability) are defined for a single equilibrium. Their counterpart for several equilibria are: mutability, global asymptotics, gradient behavior. For the definitions of these general concepts the reader is sent to Gelig et. al., (1978), Leonov et. al., (1992). In the last decades, the number of recurrent neural networks’ applications increased, they being designed for classification, identification and complex image, visual and spatio-temporal processing in fields as engineering, chemistry, biology and medicine (see, for instance: Fortuna et. al., 2001; Fink, 2004; Atencia et. al., 2004; Iwahori et. al., 2005; Maurer et. al., 2005; Guirguis & Ghoneimy, 2007). All these applications are mainly based on the existence of several equilibria for such networks, requiring them the “good behavior” properties above discussed. Another aspect of the qualitative analysis is the so-called synchronization problem, when an external

stimulus, in most cases periodic or almost periodic has to be tracked (Gelig, 1982; Danciu, 2002). This problem is, from the mathematical point of view, nothing more but existence, uniqueness and global stability of forced oscillations. In the last decades the neural networks dynamics models have been modified once more by introducing the transmission delays. The standard model of a Hopfield-type network with delay as considered in (Gopalsamy & He, 1994) is n du i = −ai u i (t ) + ∑ wij g j (u j (t − T ij )) + I i i = 1, n (1) dt j =1

The present paper aims to a general presentation, with both research and educational purposes, of the three topics mentioned previously.

BACKGROUND Dynamical systems with several equilibria occur in such fields of science and technology as electrical machines, chemical reactions, economics, biology and, last but not least, neural networks. For systems with several equilibria the usual local concepts of stability are not sufficient for an adequate description. The so-called “global phase portrait” may contain both stable and unstable equilibria: each of them may be characterized separately since stability is a local concept dealing with a specific trajectory. But global concepts are also required for a better system description and this is particularly true for the case of the neural networks. Indeed, the neural networks may be viewed as interconnections of simple computing elements whose computational capability is increased by interconnection (“emergent collective capacities”


N

Neural Networks and Equilibria, Synchronization, and Time Lags

– to cite Hopfield). This is due to the nonlinear characteristics leading to the existence of several stable equilibria. The network achieves its computing goal if no self-sustained oscillations are present and it always achieves some steady-state (equilibrium) among a finite (while large) number of such states. This behavior is most suitably described by the concepts arising from the papers of Kalman (1957) and Moser (1967). The last of them relies on the following remark concerning the rather general nonlinear autonomous system

sense); it is called quasi-monostable if every bounded solution is quasi-convergent. d) System (3) is called gradient-like if every solution is convergent; it is called quasi-gradient-like (has global asymptotics) if every solution is quasi-convergent.

x = − f ( x) , x ∈  n

Lemma 1 Consider system (2) and assume existence of a continuous function V :  n →  that is nonincreasing along any of its solutions. If, additionally, a bounded on  + solution x(t) for which there exists some τ > 0 such that V(x(τ)) = V(x(0)) is an equilibrium, then the system is quasi-monostable.

(2)

where f(x) = grad G(x) and G :  n →  is such that the number of its critical points is finite and is radially G ( x) = ∞ . Under these assumptions unbounded i.e. xlim →∞ any solution of (2) approaches asymtotically one of the equilibria (which is also a critical point of G – where its gradient, i.e. f vanishes). Obviously the best limit behavior of a neural network would be like this – naturally called gradient like behavior. Nevertheless there are other properties that are also important while weaker; in the following we shall discuss some of them. The mathematical object will be in the following the system of ordinary differential equations

x = f ( x, t )

(3)

and we shall first define some basic notions. Definition 1 a) Any constant solution of (3) is called equilibrium; the set of equilibria E is called stationary set. b) A solution of (3) is called convergent if it approaches asymptotically some equilibrium: lim x(t ) = c ∈ E . t →∞

(4)

A solution is called quasi-convergent if it approaches asymptotically the stationary set: lim d ( x(t ), E )

t

0,

(5)

with d(z, M) being the distance (in the usual sense) from the point z to the set M. c) System (3) is called monostable (strictly mutable) if every bounded solution is convergent (in the above 1220

Remark that convergence is a solution property while monostability and gradient property are associated to systems. For autonomous (time invariant) systems of the form (2) the following Lyapunov type results are available.

Lemma 2 If the assumptions of Lemma 1 hold and, additionally, V(x) is radially unbounded then system (2) is quasi-gradient like. Lemma 3 If the assumptions of Lemma 2 hold and the set E is discrete (i.e. it consists of isolated points only) then system (2) is gradient-like.

DYNAMICS ISSUES OF RECURRENT NEURAL NETWORKS Neural Networks as Systems with Several Equilibria It has been already mentioned that the emergent computational capacities of the neural networks are ensured by: a) nonlinear behavior of the neural cells; b) their connectivity. These two properties define the neural networks as dynamical systems with many equilibria whose performance depends on the (high) number of these equilibria and on the gradient like property of the network. On the other hand, the standard recurrent neural networks (Bidirectional Associative Memory (Kosko, 1988), Hopfield (1982), cellular (Chua & Yang, 1988), Cohen-Grossberg (1983)), which contain internal feedback loops - having thus the propensity for instability, possess some “natural”, i.e. associated in a natural way,


Lyapunov function allowing to obtain the required qualitative properties (Răsvan, 1998). One of the most general models of neural networks that has a natural Lyapunov function is the CohenGrossberg model described by n   x i = ai ( xi ) bi ( xi ) − ∑ cij d j ( x j ) , j =1  

i = 1, n ,

(6)

with cij = cji; this model may be written as (7)

x = − A( x) gradV ( x) where A(x) is a diagonal matrix with the entries Aij ( x) =

ai ( xi ) D ij d i′ ( xi )

Time Delays in Neural Networks (8)

and V :  →  is defined by n

x

n i 1 n n V ( x) = ∑∑ cij d i ( xi )d j ( x j ) − ∑ ∫ bi (L )d i′(L )dL 2 1 1 1 0

(9) The presence of A(x) makes system (7) a pseudogradient system – compare to (2). The properties of the associated Lyapunov function (9) will give sufficient conditions in order to obtain the required qualitative behaviors for the system. The derivative function of (9) is: 2

n   W ( x) = −∑ ai ( xi )d i′( xi ) bi ( xi ) − ∑ cij d j ( x j ) ≤ 0 1 j =1   (10) n

One can see that the inequality (10) holds provided ai(λ) > 0 and di(λ) are monotone nondecreasing. If additionally di(λ) are strictly increasing, then the set where W = 0 consists of equilibria only. The system results quasi-gradient like i.e. every solution approaches asymptotically the stationary set. Consider now a model of artificial neural network implemented by electrical circuits: Ri C i

are subject to sector restrictions and global Lipschitz inequalities, it was only natural to try to improve the stability conditions using the Lyapunov functions suggested by the Popov frequency domain inequalities and the Yakubovich-Kalman-Popov lemma. For instance, in (Danciu & Răsvan, 2000) there was considered a rather general system with several sector restricted nonlinearities and the Lyapunov function was constructed in a rational way starting from an improved frequency domain stability inequality of Popov type with PI multiplier. In the case of (11) this rather involved approach gives a gradient like behavior provided the symmetry condition Rij = Rji is observed.

n dvi R = −vi + ∑ i (J j (v j ) − v j )+ Ri I i dt j =1 Rij

(11)

with φj(·) being sigmoidal. Since sigmoidal functions

We shall consider here the model (1). Since we do not dispose (yet) in the time delay case of an instrument like the Lyapunov like lemmas given in BACKGROUND, we have to restrict ourselves to the analysis of the stability of a particular equilibrium. If u i , i = 1,  , n is some equilibrium of (1) and if the deviations z i = u i − u i are considered, the system in deviations is obtained n dz i = −ai z i (t ) − ∑ wij J j ( z j (t − T ij )) , i = 1, n dt j =1

(12)

with J j ( z j ) = g j (u j ) − g j (u j + z j ) . As known, if g j :    satisfy the usual sigmoid conditions i.e. gj(0) = 0, monotonically increasing and globally Lipschitz - that is 0≤

g j (S 1 ) − g j (S 2 ) S1 − S 2

≤ L j , ∀S 1 ≠ S 2 ,

(13)

then φj defined above are such. With the usual notations of the field, let zt(·) = z(t + ·) denote the state of (12) at n t; the state space will be considered  (−r ,0;  ) with n r = max ij, the space of continuous  - valued mappings i, j defined on [–r, 0] with the usual norm of the uniform convergence. One considers the Lyapunov-Krasovskii functional (the analogue of the Lyapunov function of the delayless case) suggested by (Nishimura & Kitamura, 1969), V :    + as 1221

N


∑ ∑

I(Xc) =

oi ∈ X o j ∈ X , j >i

mi / c .m j / c d 2 (oi , o j )

zi ( 0 ) n 0  1 2 2 2  + R z ( Q ) D J ( z ( Q )) d Q ∑=1  2 P i zi (0) + Li ∫ J i (Q ) dQ + ∑ ij j ij j j ∫  j = 1  0 −T ij

n

(

)

(14) with �i ≥ 0, λi ≥ 0,ρij ≥ 0, δij ≥ 0 some free parameters. Considering this functional along the solutions of (12) and differentiating it with respect to t we may find the so-called derivative functional W :    as below

[

n

W ( z ) = ∑ − ai

i

i =1

− [ i z i (0) + n

n

1

1

+ ∑∑

[

ij

z i2 (0) −

i

ai

i

(z i (0) )z i (0) −

n

i

i

(z i (0) )]∑ wij j (z j (− j =1

z 2j (0) +

ij

2 j

(z

j

(0) )−

ij

ij

 ) ) + 

z 2j (− ij ) −

ij

2 j

(z

j

]

(− ij ) )

(15) The problem of the sign for W gives the following choice of the free parameters in (14) (Danciu & Răsvan, 2007):  m cij2 Si = a −∑  j =1 D ji  2 i

Li > 0 ,  m cij2 2 ∑  j =1 D ji 

m ∑ ( R ji + D ji ) > 0  j =1 

−1

  m c2  (ai − S i ) < P i < 2 ∑ ij   j =1 D ji  

−1

  ( ai + S i )  

(16) The application of the standard stability theorems for time delay systems (Hale & Verduyn Lunel, 1993) will give asymptotic stability of the equilibrium z = 0 ( u = u ). The mathematical result reads as follows Theorem 3: Consider system (12) with ai > 0 and wij such that it is possible to choose ρij > 0 and δij > 0 in order to satisfy σi > 0 with σi defined in (16). Then the equilibrium is globally asymptotically stable.

Synchronization Problems From this point of view the qualitative behavior of the network is nothing more but behavior under the 1222

time varying stimuli. This is particularly true for the modeling of rhythmic activities in the nervous system (Kopell, 2000) or the synchronization of the oscillatory responses (König & Schillen, 1991). Both rhythmicity and synchronization suggest some recurrence and this implies coefficients and stimuli being periodic or almost periodic. The model with time varying stimulus has the form n du i = − ai u i (t ) − ∑ wij f j (u j (t − T ij ) )+ ci (t ) , i = 1, n dt 1

(17) under the same assumptions as previously, with the functions f i :   [−1,1] being sigmoidal and therefore, globally Lipschitz. The forcing stimuli ci(t) are periodic or almost periodic and the main mathematical problem is to find conditions on the systems to ensure existence and exponential stability of a unique global (i.e. defined on f i ): solution has the features of   [−1which ,1] a limit regime, i.e. not defined by initial conditions and of the same type as the stimulus - periodic or almost periodic respectively. This is an “almost linear behavior” for reasons that are obvious. The approach to be taken in this problem is to obtain some estimates of the system’s solutions, which finally give information about system’s convergence and ultimate boundedness. Next we have to apply a fixed-point theorem and we use the theorems of Halanay (Halanay, 1967) on invariant manifolds for flows on Banach spaces (see (Danciu, 2002) for details and simulation results). We give below a theorem based on the application of the Lyapunov functional (14) but restricted to be only quadratic in the state variables (λi = 0, δij = 0), 0 n n  1 V (u ) = ∑  P i u i2 (0)+ ∑ R ij ∫ u 2j (Q ) dQ   i =1  2 j =1 −T ij

(18)

with �i > 0, ρij > 0, i, j = 1, n . We may state Theorem 2 Assume that ai > 0, Li > 0 and wij are such that the derivative functional corresponding to ci(t) ≡ 0 in (17) namely


n   W (u ) = ∑ − ai P i u i2 (0) − P i u i (0)∑ wij f j (u j (−T ij ) ) i =1  j =1  n

n

n

1

1

[

+ ∑∑ R ij u 2j (0) − u 2j (−T ij )

] (19)

is negative definite with a quadratic upper bound. Then the system (17) has a unique global solution u i (t ), i = 1, n which is bounded on  and exponentially stable. Moreover, this solution is periodic or almost periodic according to the character of ci(t)- periodic or almost periodic respectively.

FUTURE TRENDS Supposing the field of AI has its own dynamics, the neural networks and their structures will evolve in order to improve the imitative behavior i.e. more of the “natural” intelligence will be transferred to AI. Consequently, science and technology will deal with new structures of various physical natures having multiple equilibria. At least the following qualitative behaviors will remain under study: stability-like properties (dichotomy, gradient behavior a.s.o.), synchronization (forced oscillations, almost linear behavior, chaos control) and complex dynamics (including chaotic behavior).

CONCLUSIONS Our experience on neural networks dynamics shows that the most important study is to obtain conditions for gradient or quasi-gradient like behavior. Besides the comparison method of (Popov, 1979) which requires relaxation of the condition of the identical dynamics of all neurons, the most popular tool remains the Lyapunov method. If the Lyapunov like lemmas given in BACKGROUND would be available in the time delay case, then improved Lyapunov functionals remaining constant on the set of equilibria could ensure a gradient like behavior.

REFERENCES Atencia, M., Joya, G., Sandoval, F. (2004). Parametric identification of robotic systems with stable timevarying Hopfield networks. Neural Computing and Applications, Springer London, 13(4), 270-280. Chua, L. & Yang, L. (1988). Cellular neural networks: theory and applications, IEEE Transactions on Circuits and Systems, CAS-35, 1257-1290. Cohen, M. A. & Grossberg, S. (1983). Absolute stability of pattern formation and parallel storage by competitive neural networks. IEEE Transactions of Systems, Man & Cybernetics, 13, 815-826. Danciu, D. (2002). Qualitative behavior of the time delay Hopfield type neural networks with time varying stimulus. Annals of The University of Craiova, Series: Electrical Engineering 26, 72–82. Danciu, D. & Răsvan, V. (2000). On Popov-type stability criteria for neural networks. Electronic Journal on Qualitative Theory of Differential Equations 23. http://www.math.uszeged.hu/ejqtde/6/623.pdf Danciu, D. & Răsvan, V. (2007). Dynamics of Neural Networks – Some Qualitative Properties. Computational and ambient Intelligence. Lectures Notes in Computer Science, (4507), F. Sandoval, A. Prieto, J. Cabestany, editors, 8-15. Fink, W. (2004). Neural attractor network for application in visual field data classification. Physics in Medicine and Biology, 49(13), 2799-2809. Fortuna, L., Arena, P., Balya, D. & Zarandy, A. (2001). Cellular Neural Networks. IEEE Circuits and Systems Magazine, 4, 6–21. Gelig, A. Kh., Leonov, G. A. & Yakubovich, V.A. (1978). Stability of nonlinear systems with non-unique equilibrium state. (in Russian) U.R.S.S.: Moscow, Nauka Publishers House. Gelig, A. Kh. (1982) Dynamics of pulse systems and neural networks (in Russian). Leningrad Univ. Publishing House. Gopalsamy, K. & He, X. Z. (1994). Stability in asymmetric Hopfield nets with transmission delays, Physica D., 76, 344-358.

1223

N


Guirguis, L.A., Ghoneimy, M.M.R.E. (2007). Channel Assignment for Cellular Networks Based on a Local Modified Hopfield Neural Network. Wireless Personal Communications, Springer US, 41(4), 539-550.

Nishimura, M., Kitamura, S. & Hirai, K. (1969). A Lyapunov Functional for Systems with Multiple Nonlinearities and Time Lags, Technological Reports, Japan: Osaka University, 19(860), 83-88.

Halanay, A. (1967). Invariant manifolds for systems with time lag. Differential and dynamical systems. Hale & La Salle editors, New York, Academic Press, 199–213.

Popov, V.M. (1979). Monotonicity and Mutability. Journal of Differential Equations, 31(3), 337-358.

Hale J. K. & Verduyn Lunel, S. M. (1993). Introduction to Functional Differential Equations. SpringerVerlag. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of National Academic Science U.S.A., 79, 2554-2558. Iwahori, Y., Kawanaka, H., Fukui, S., Funahashi, K. (2005). Obtaining Shape from Scanning Electron Microscope using Hopfield Neural Network. Journal of Intelligent Manufacturing, Springer US, 16(6), 715-725. Kalman, R. E. (1957). Physical and mathematical mechanisms of instability in nonlinear automatic control systems. Transactions American Society of Mechanical Engineers, 79(3). König, P. & Schillen, J.B. (1991). Stimulus dependent assembly formation of oscillatory responses: I. Synchronization. Neural Computation, 3, 155-166. Kopell, N. (2000). We got rhythm: dynamical systems of the nervous system. Notices of American Mathematical Society, 47, 6-16. Kosko, B. (1988). Bidirectional associative memories. IEEE Transactions Systems, Man and Cybernetics, 18, 49-60.

Răsvan, V. (1998). Dynamical Systems with Several Equilibria and Natural Lyapunov Functions, Archivum mathematicum, 34(1), [EQUADIFF 9], 207-215.

KEY TERMS Asymptotic Stability: The solution x (t ) of (3) is called asymptotically stable if it is Lyapunov stable (see below) and, moreover, there exists δ0 > 0 such that if x0 − x (t 0 ) < D 0 then lim x(t ; t 0 , x0 ) − x (t ) = 0 . t →∞

Fixed Point Theorem: If f(x) is some function of real variable with real values, the values such that f(x) = x are called the fixed points of the mapping. In general, if f : X  X is a mapping from the metric space X inf : X X into itself, the fixed points of this mapping are defined as above. A fixed point theorem is a theorem showing under which conditions some mapping has a fixed point in the corresponding metric space. Frequency Domain Stability Inequality of Popov: Consider a feedback structure containing a linear dynamical block with the transfer function H(s) and a nonlinear function - subject to the sector condition 0 < φ(σ)σ < kσ2. The Popov inequality ensures absolute stability i.e. global asymptotic stability of the zero equilibrium for all nonlinear functions satisfying the above inequality and reads as follows: there exists some β such that

Leonov, G.A., Reitmann, V. & Smirnova, V.B. (1992). Non-local methods for pendulum-like feedback systems, Germany: Leipzig, Teubner Verlag.

1 + Re (1 + jWB )H ( jW ) > 0 , ∀W ∈  k .

Maurer, A., Hersch, M. and Billard, A. G. (2005). Extended Hopfield Network for Sequence Learning: Application to Gesture Recognition. Proceedings of 15th International Conference on Artificial Neural Network, 493-498.

Global Stability: An equilibrium is global (asymptotically) stable if it is the unique equilibrium of the dynamical system and the property holds globally (its domain of attraction is the entire state space).

Moser, J. (1967). On nonoscillating networks. Quarterly Applied Mathematics, 25, 1-9. 1224

Lyapunov Function: State scalar function defined on the state space of a system in order to obtain some qualitative properties - stability of equilibria, oscillatory


behavior etc. - using a single function instead of several i.e. system’s state trajectories. A Lyapunov function is usually positive definite and, along system’s trajectories, is at least nonincreasing. The definite sign condition may also be relaxed for the generalized Lyapunov functions in the LaSalle sense. The basic physical model for the Lyapunov function is system’s energy - a state function that is nonincreasing along the state trajectory being at the same time positive definite. The strength of the Lyapunov function is exactly its independence of the physical concepts since writing down the stored energy of a system is not an easy job except possibly such standard cases as mechanical systems or electrical circuits. The energy like concepts may be nevertheless inspiring when “guessing” a Lyapunov function. In the infinite dimensional cases e.g. time delay or propagation systems, the Lyapunov function is replaced by a Lyapunov functional defined on the infinite dimensional state space. Oscillations (Self-Sustained and Forced): Type of steady state behavior when the state trajectories, while remaining bounded, never reach an equilibrium but their deviations from this equilibrium keep sign changing. Usually an oscillation is viewed as having some recurrent properties, being either periodic or almost periodic. When the system is autonomous i.e. free of external oscillatory signals while nevertheless displaying an oscillatory behavior which is sustained by non-oscillatory internal factors of the system, it is said that this system displays self-sustained oscillations (the term belongs to Mandelstamm and Andronov). When the system is non-autonomous and subject to external oscillatory signals (stimuli), the limit regime that occurs is called forced oscillation. Phase Portrait: Term borrowed from the Poincaré theory of the phase (space) plane where this portrait is better defined. Its extension to higher order systems is mainly informal, based on geometric arguments. By phase portrait it is understood the total of state trajectories as limit regimes (equilibria, recurrent motions, limit sets) and standard trajectories e.g. defined by initial conditions.

Recurrent Neural Network (RNN): Neural networks which display feedback interconnections among their units (neurons). Due to these cyclic connections RNNs are nonlinear dynamical systems with very rich spatial and temporal behaviors: stable and unstable fixed points, limit cycles and chaotic behavior. These behaviors make them suitable for modeling certain cognitive functions such as associative memory, unsupervised learning, self-organizing maps and temporal reasoning. Synchronization: Interaction phenomenon among coupled subsystems of a system resulting in some ordering of their evolution. Its maximal stage is the complete synchronization of the subsystems’ periods resulting in a periodic evolution of the state of the entire system. When a system is externally forced by an oscillatory signal, synchronization means a limit regime of the entire state, which has the same waveform as the forcing signal (periodic with the same period if the forcing signal is periodic or almost periodic if the forcing signal is such). Stability: Qualitative property of the solution of a system with the significance of the limitation of the perturbations effect on the considered solution viewed as basic. Among all kinds of stability (bounded input/ bounded output, Lagrange stability, Birkhoff stability, input-to-state stability) the stability in the sense of Lyapunov - with respect to the initial conditions, viewed as incorporating the effect of short-period perturbations - is the most widely used; it means that sufficiently small deviations in the initial condition (state) will result in arbitrarily small deviations in the current state at all following moments. Rigorously, the basic solution x (t ) of (3) is called stable in the sense of Lyapunov if, for any ε > 0 arbitrarily small and any t 0 ∈  there exists some δ(ε, t0) > 0 sufficiently small such that if x0 − x (t 0 ) < D (E , t 0 ) , then x(t ; t 0 , x0 ) − x (t ) < E for all t > t0. If in the above definition δ is independent of the initial moment t0 the stability is called uniform; from the point of view of the practice, this is the more important stability notion of stability. It is also a necessary condition for uniform asymptotic stability (see above).

1225

N

1226

Neural Networks and HOS for Power Quality Evaluation Juan J. González De la Rosa Universities of Cádiz-Córdoba, Spain Carlos G. Puntonet University of Granada, Spain A. Moreno-Muñoz Universities of Cádiz-Córdoba, Spain

INTRODUCTION Power quality (PQ) event detection and classification is gaining importance due to worldwide use of delicate electronic devices. Things like lightning, large switching loads, non-linear load stresses, inadequate or incorrect wiring and grounding or accidents involving electric lines, can create problems to sensitive equipment, if it is designed to operate within narrow voltage limits, or if it does not incorporate the capability of filtering fluctuations in the electrical supply (Gerek et. al., 2006; Moreno et. al., 2006). The solution for a PQ problem implies the acquisition and monitoring of long data records from the energy distribution system, along with an automated detection and classification strategy which allows identify the cause of these voltage anomalies. Signal processing tools have been widely used for this purpose, and are mainly based in spectral analysis and wavelet transforms. These second-order methods, the most familiar to the scientific community, are based on the independence of the spectral components and evolution of the spectrum in the time domain. Other tools are threshold-based algorithms, linear classifiers and Bayesian networks. The goal of the signal processing analysis is to get a feature vector from the data record under study, which constitute the input to the computational intelligence modulus, which has the task of classification. Some recent works bring a different strategy, based in higher-order statistics (HOS), in dealing with the analysis of transients within PQ analysis (Gerek et. al., 2006; Moreno et. al., 2006) and other fields of Science (De la Rosa et. al., 2004, 2005, 2007).

Without perturbation, the 50-Hz of the voltage waveform exhibits a Gaussian behaviour. Deviations from Gaussianity can be detected and characterized via HOS. Non-Gaussian processes need third and fourth order statistical characterization in order to be recognized. In order words, second-order moments and cumulants could be not capable of differentiate non-Gaussian events. The situation described matches the problem of differentiating between a transient of long duration named fault (within a signal period), and a short duration transient (25 per cent of a cycle). This one could also bring the 50-Hz voltage to zero instantly and, generally affects the sinusoid dramatically. By the contrary, the long-duration transient could be considered as a modulating signal (the 50-Hz signal is the carrier). These transients are intrinsically non-stationary, so it is necessary a battery of observations (sample registers) to obtain a reliable characterization. The main contribution of this work consists of the application of higher-order central cumulants to characterize PQ events, along with the use of a competitive layer as the classification tool. Results reveal that two different clusters, associated to both types of transients, can be recognized in the 2D graph. The successful results convey the idea that the physical underlying processes associated to the analyzed transients, generate different types of deviations from the typical effects that the noise cause in the 50-Hz sinusoid voltage waveform. The paper is organized as follows: Section on higher-order cumulants summarizes the main equations of the cumulants used in the paper. Then, we recall the competitive layer’s foundations, along with the Kohonen learning rule. The experience is described then, and the conclusions are drawn.


Neural Networks and HOS for Power Quality Evaluation

[

]

HIGHER-ORDER CUMULANTS

IWi1,1 (q ) = IWi1,1 (q − 1)+ P(q ) IWi1,1 (q − 1) ,

High-order statistics, known as cumulants, are used to infer new properties about the data of non-Gaussian processes (Mendel, 1991; Nikias & Mendel, 2003). The relationship among the cumulants of r stochastic signals, {xi}iє[1,r], and their moments of order p, p ≤ r, can be calculated by using the Leonov-Shiryayev formula (Nandi, 1999; Nikias & Mendel, 2003). For an rth-order stationary random process {x(t)}, the rth-order cumulant is defined as the joint rth-order cumulant of the random variables x(t), x(t+τ1), …, x(t+τr-1),

where p is the input vector, q is the time instant, and α is the learning rate. The Kohonen rule allows the weights of a neuron to learn an input vector, so it is useful in recognition applications. The winning neuron is more likely to win the competition the next time a similar vector is presented. As more and more inputs are presented, each neuron in the layer closest to a group of input vectors soon adjusts its weight vector toward those inputs. Eventually, if there are enough neurons, every cluster of similar input vectors will have a neuron that outputs “1” when a vector in the cluster is presented.

C r ,x (τ 1 , τ 2 ,...,τ r ) = Cum[x(t ), x(t + τ 1 ),..., x(t + τ r )]..

(1)

Considering τ1=τ2=τ3=0 in Eq. (1), we have some particular cases: 2 , x = E x 2 (t ) = C2 , x (0 ),

{ }

(2a)

{ }

(2b)

{ }

(2c)

3,x = E x3 (t ) = C3,x (0 ,0 ), 4 ,x = E x 4 (t ) − 3( 2 ,x )2 = C 4 ,x (0,0 ,0 ).

Eqs. (2) are measurements of the variance, skewness and kurtosis of the statistical distribution, in terms of the cumulants at zero lags. We will use and refer to normalized quantities because they are shift and scale invariant.

COMPETITIVE LAYERS The neurons in a competitive layer distribute themselves to recognize frequently presented input vectors. The competitive transfer function accepts a net input vector p for a layer (each neuron competes to respond to p) and returns outputs of 0 for all neurons except for the winner, which is associated with the most positive element of the net input. For zero bias, the neuron whose weight vector is closest to the input vector has the least negative net input and, therefore, wins the competition to output a 1. The winning neuron will move closer to the input, after this has been presented. The weights of the winning neuron are adjusted with the Kohonen learning rule. If for example the ith-neuron wins, the elements of the ith-row of the input weight matrix (IW) are adjusted as shown in Eq. (3):

(3)

EXPERIMENTAL RESULTS The aim is to differentiate between two classes of PQ events, named long-duration and short-duration. The experiment comprises two stages. The feature extraction stage is based on the computation of cumulants. Each vector’s coordinate corresponds to the local maximum and minimum of the 4th-order central cumulant. Secondly, the classification stage is based on the application of the competitive layer to the feature vectors. We use a two-neuron competitive layer, which receives two-dimensional input feature vectors during the network training. We analyze a number of 16 1000-point real-life registers during the feature extraction stage. Before the computation of the cumulants, two pre-processing actions have been performed over the sample signals. First, they have been normalized because they exhibit very different-in-magnitude voltage levels. Secondly, a high-pass digital filter (5th-order Butterworth model with a characteristic frequency of 150 Hz) eliminates the low frequency components which are not the targets of the experiment. This by the way increases the non-Gaussian characteristics of the signals, which in fact are reflected in the higher-order cumulants. Fig. 1 shows the comparison of the two types of events. After pre-processing, a battery of sliding central cumulants (2nd, 3rd and 4th-order) is calculated. Each cumulant is computed over 50 points; this window’s length (50 points) has been selected neither to be so long to cover the whole signal nor to be very short. The algorithm calculates the 3 central cumulants over 50 points, and then it jumps to the following starting 1227

N


Figure 1. Analysis for two types of transients

point; as a consequence we have 98 per cent overlapping sliding windows (49/50=0.98). Each computation over a window (called a segment) outputs 3 cumulants. The signal processing analysis indicates that the 2nd-order cumulant sequence (the variance), clearly indicates the presence of an event. Both types of transients exhibit an increasing variance in the neighbourhood of the PQ event, that present the same shape, with only one maximum. The magnitude of this maximum is by the way the only available feature which can be used to distinguish different events from the second-order point of view. Resulting from the classification stage, the bidimensional representation (2-dimensional feature vectors) suggests very intelligible 2-D graphs for 4thorder. 3rd-order diagrams don’t show quite different clusters because maxima and minima are similar. It is possible to differentiate PQ events from the 3rd-order 1228

perspective if we consider more features in the input vector (perhaps 3-D feature vectors), like the number of extremes (maxima and minima), and the order in which the maxima and the minima appear as time increases. The sliding 4th-order cumulants exhibit clear differences, not only for the shape of the time-domain graphs, but also for the different location of minima, which suggest a clustering for the points in the 2-D feature space. Fig. 2 shows an example of 4th-order cumulant sequence comparison for the two types of transients. For each sample register (data record) the sliding 4th-order cumulants’ sequence is calculated (as in Fig. 2). For each data record, the maximum and the minimum are detected and selected as a point in the feature space. Fig. 3 presents the results of the training stage, using the Kohonen rule. The horizontal (vertical) axis cor-


Figure 2. Comparison of 4th-order cumulants’ sequences for two types of transients

responds to the maxima (minima) values. Each cross in the diagram corresponds to an input vector and the circles indicate the final location of the weight vector (after learning) for the two neurons of the competitive layer. Before training, both weight vectors pointed to the asterisk, which is the initializing point (the midpoint of the input intervals). The separation between classes (inter-class distance) is well defined. Both types of PQ events are clustered. The correct configuration of the clusters is corroborated during the simulation of the neural network, in which we have obtained an approximate classification accuracy of 97 percent. During the simulation, new signals (randomly selected from our data base) were processed using this methodology. The accuracy of the classification results increases with the number of data. To evaluate the confidence of the statistics a

N

significance test has been conducted. As a result, the number of measurements is significantly correct.

CONCLUSION In this paper we have proposed an automatic method to detect and classify two PQ transients, named short and long-duration. The method comprises two stages. The first includes pre-processing (normalizing and filtering) and outputs the 2-D feature vectors, each of which coordinate corresponds to the maximum and minimum of the central cumulants. The second stage uses a neural network to classify the signals into two clusters. This stage is different-in-nature from the one used in (Gerek et. al., 2006) consisting of quadratic classifiers. The configuration of the clusters is assessed 1229


Figure 3. Competitive layer training results over 20 epochs. Upper cluster: Short-duration PQ-events. Down cluster: Long-duration events.

during the simulation of the network, in which we have obtained acceptable classification accuracy.

Series in Probability and Statistics, Wiley Interscience, 2000.

ACKNOWLEDGMENT

Chonavel, T., Statistical Signal Processing. Modelling and Estimation, 1st ed., ser. Advanced Textbooks in Control and Signal Processing. London: Springer, 2003, vol. 1.

We would like to acknowledge the Spanish Ministry of Education and Science for funding the projects DPI2003-00878 and PETRI-95-0824-OP, and to the Andalusian Government for funding the project PAI2005-TIC00155.

REFERENCES Bendat, J., Piersol, A.: Random Data Analysis and Measurement Procedures, 3rd. Edition, Vol. 1 of Wiley 1230

De la Rosa, J.J.G., Puntonet, C.G., Lloret, I., Górriz, J.M.: Wavelets and wavelet packets applied to termite detection. In: ICCS 2005. LNCS, vol. 3514, pp. 900–907. Springer, Heidelberg (2005) De la Rosa, J.J.G., Ruzzante, J., Piotrkowski, R.: Thirdorder spectral characterization of acoustic emission signals in ring-type samples from steel pipes for the oil industry. In: Elsevier. (Ed.) Mechanical systems and Signal Processing, vol. 21, pp. 1917–1926 (2007)


De la Rosa, J.J.G., Lloret, I., Puntonet, C.G., Górriz, J.M.: Higher-order statistics to detect and characterise termite emissions. Electronics Letters 40, 1316–1317, Ultrasonics (2004) De la Rosa, J.J.G., Puntonet, C.G., Lloret, I.: An application of the independent component analysis to monitor acoustic emission signals generated by termite activity in wood. In: Elsevier. (Ed.) Measurement, vol. 37, pp. 63–76 (2005) De la Rosa, J.J.G, Moreno-Muñoz, A. Higher-order cumulants and spectral kurtosis for early detection of subterranean termites,” Mechanical Systems and Signal Processing (Ed. Elsevier), vol. In Press, Accepted Manuscript, 2007, available online 1 September 2007. Gerek, O.N., Ece, D.G.: Power-quality event analysis using higher order cumulants and quadratic classifiers. IEEE Transactions on Power Delivery 21, 883–889 (2006) Mendel, J.M.: Tutorial on higher-order statistics (spectra) in signal processing and system theory: Theoretical results and some applications. In: Proceedings of the IEEE 79, 278–305 (1991) Moreno, A., Pallarés, V., De la Rosa, J.J.G., Galisteo, P.: Study of voltage sag in a highly automated plant. In: MELECON 2006, Proceedings of the 2006 13th IEEE Mediterranean Electrotechnical Conference. Moreno-Muñoz, A. and Mª D. Redel. Calm in the campus: power disturbances threaten university life. IEE Power Engineer, 19 (4), (2005), p. 34 Moreno-Muñoz, A.; Redel, M. D. and González, M. Power quality in high-tech campus. Proc. of the Institution of Mechanical Engineers, part A: Journal of Power and Energy. 220 (3), (2006) p. 257 Nandi, A.K.: Blind Estimation using Higher-Order Statistics, 1st Edn., vol. 1. Kluwer Academic Publichers, Boston (1999)

Nikias, C.L., Mendel, J.M.: Signal processing with higher-order spectra. IEEE Signal Processing Magazine, pp. 10–37 (1993) Nikias, C.L., Petropulu, A.P.: Higher-Order Spectra Analysis. In: A Non-Linear Signal Processing Framework, Prentice-Hall, Englewood Cliffs, NJ (1993)

KEY TERMS Artificial Neural Networks: A network of many simple processors (“units” or “neurons”) that imitates a biological neural network. The units are connected by unidirectional communication channels, which carry numeric data. Neural networks can be trained to find nonlinear relationships in data, and are used in applications such as robotics, speech recognition, signal processing or medical diagnosis. Cluster: A set of incidences relative to the characteristics associated to some signals, which have been previously analyzed. Cumulants: Statistics that characterize a probability distribution. A distribution with given cumulants can be approximated through the Edgeworth series. Competitive Layer: The neurons in a competitive layer distribute themselves to recognize frequently presented input vectors. HOS: Higher-Order Statistics; the set of statistics of order higher than 2. The advantage of using them is based on the advantage of noise rejection for symmetrically distributed processes. Power Quality: Is the branch of research which aims to study the techniques for the assessment of the quality of electricity. Transient: A signal which vanishes with the time and usually with short duration. They are very common in industry applications. Transients may occur either in repeatable fashion or as random impulses.

1231

N

1232

Neural Networks on Handwritten Signature Verification J. Francisco Vargas University of Las Palmas de Gran Canaria, Spain & Universidad de Antioquia, Colombia Miguel A. Ferrer University of Las Palmas de Gran Canaria, Spain

INTRODUCTION Biometric offers potential for automatic personal identification and verification, differently from other means for personal verification; biometric means are not based on the possession of anything (as cards) or the knowledge of some information (as passwords). There is considerable interest in biometric authentication based on automatic signature verification (ASV) systems because ASV has demonstrated to be superior to many other biometric authentication techniques e.g. finger prints or retinal patterns, which are reliable but much more intrusive and expensive. An ASV system is a system capable of efficiently addressing the task of make a decision whether a signature is genuine or forger. Numerous pattern recognition methods have been applied to signature verification. Among the methods that have been proposed for pattern recognition on ASV, two broad categories can be identified: memory-based and parameter-based methods as a neural network. The Major approaches to ASV systems are the template matching approach, spectrum approach, spectrum analysis approach, neural networks approach, cognitive approach and fractal approach. The proposed article reviews ASV techniques corresponding with approaches that have so far been proposed in the literature. An attempt is made to describe important techniques especially those involving ANNs and assess their performance based on published literature. The paper also discusses possible future areas for research using ASV.

BACKGROUND As any human production, handwriting is subject to many variations from very diverse origins: Historic, geographic, ethnic, social, psychological, etc (Bou-

letreau, 1998). ASV is a difficult problem because signature samples from the same person are similar but not identical. In addition, a person signature often changes radically during their lifetime (Hou, 2004). Although these factors can affect a given instance of a person writing, writing style develops as the writer learns to write, as do consistencies which are typically retained (Guo, 1997). One of the methods used by expert document examiners is to try to exploit these consistencies and identify ones which are both stable and difficult to imitate. In general, ASV systems can be categorized into two kinds: The On-line and Off-line systems. For On-line, the use of electronic devices to capture dynamics from signature permits to register more information about the signing process while improving the system performance, in the case of Off-line approaches for ASV, this dynamic information is lost and only a static image is available. This makes it quit difficult to define effective global or local features for the verification purpose. Three different types of forgeries are usually take into account on ASV system: random forgeries, produced without knowing either the name of the signer nor the shape of his signature; simple forgeries, produced knowing the name of the signer but without having an example of his signature; and skilled forgeries, produced by people who, looking at an original instance of the signature, attempt to imitate it as closely as possible. The problem of signature verification become more difficult when passing from random to simple and skilled forgeries, the later being so difficult a task that even human beings make errors in several cases. It is pointing out that several systems proposed up to now, while performing reasonably well on a single category of forgeries, decrease in performance when working with all the categories simultaneously, and generally this decrement is bigger than one would expect.(Abuhaiba,2007; Ferrer,2005).


Neural Networks on Handwritten Signature Verification

Numerous pattern recognition methods have been applied to signature verification (Plamondon, 1989). Among the methods that have been proposed for pattern recognition, two broad categories can be identified: memory-based techniques in which incoming patterns are matched to a (usually large) dictionary of templates, and parameter-based methods in which pre-processed patterns are sent to a trainable classifier such as a neural network (Lippmann, 1987). Memory-based recognition methods require a large memory space to store the templates, while a neural network is a parameterbased approach which just requires a small amount of memory space to store the linking weights among neurons. Mighell et al (Mighell, 1989) were apparently the first to work in applying NNs for off-line signature classification. Sabourin and Drouhard (Sabourin,1992) presented an method based on directional probability density functions together with a BackPropagation neural networks (BPN) to detect random forgery. Qi and Hunt (Qi, 1996) used global and grid features with a simple Euclidean distance classifier. Sansone and Vento (Sansone,2000) proposed a sequential three-stage multi-expert system, in which the first expert eliminates random and simple forgeries, the second isolates skilled forgeries, and the third gives the final decision by combining decisions of the previous stages together with reliability estimations. Baltzakis and Papamarkos (Baltzakis,2001) developed a two-stage neural network, in which the first stage gets the decisions from neural networks and Euclidean distance classifiers supplied by the global, grid and texture features, and the second combines the four decisions using a radial-base function (RBF) neural network.

MAIN FOCUS OF THE CHAPTER As mentioned above, the major approaches to signature verification systems are the template matching approach, spectrum approach, spectrum analysis approach, neural networks approach, cognitive approach and fractal approach. The rigid template matching, the simplest and earliest approach to pattern recognition, can detect random forgeries from genuine signatures successfully, but cannot detect skilled forgeries effectively. The statistical approach, including HHMs, Bayesian and so on, can detect random forgeries as well as skilled forgeries from genuine ones. Structural approach shows good performance when detecting

genuine signatures and forgeries. But this approach may yield a combinatorial explosion of possibilities to be investigated, demanding large training sets and very large computational efforts. The spectrum analysis approach can be applied to different languages, including English and Chinese. Moreover it can be applied to either on-line or off-line verification systems. Neural networks approach offers several advantages such as, unified approaches for feature extraction and classification and flexible procedures for finding good, moderately nonlinear solutions. When it is used in either on-line or off-line signature verification, it also shows reasonable performance.

Neural Networks on ASV Multi-layer perceptron (MLP) neural networks are among the most commonly used classifiers for pattern recognition problems. Despite their advantages, they suffer from some very serious limitations that make their use, for some problems, impossible. The first limitation is the size of the neural network. It is very difficult, for very large neural networks, to get trained. As the amount of the training data increases, this difficulty becomes a serious obstacle for the training process. The second difficulty is that the geometry, the size of the network, the training method used and the training parameters depend substantially on the amount of the training data. Also, in order to specify the structure and the size of the neural network, it is necessary to know a priori the number of the classes that the neural network will have to deal with. Unfortunately, when talking about a useful ASV, a priori knowledge about the number of signatures and the number of the signature owners is not available (Baltzakis,2001). For the BPN case, a learning law is used to modify weight values based on an output error signal propagated back through the network. From random initial values, the weights are changed according to this learning law that uses a learning rate and a smoothing rate which sometimes allows a faster convergence of the training phase. The training phase is critical, especially when the data to be classified are not clearly distinguishable and when there are not enough examples to conduct training. In this case, the training phase can be very long and it may even be impossible to obtain an acceptable performance. Usually a criterion for stopping the training phase is defined. After that, several rejection methods are evaluated to improve the decision taken by 1233

N


this kind of classifier. Finally, the number of neurons in the hidden layer of the BPN is adjusted in order to increase the global performance of the first stage of the ASV (Drouhard, 1996). An interesting aspect of BPN is that during learning process, the hidden layers build an internal representation of the inputs that is useful to produce the output (Looney, 1997). (Fleming. 1990) used a two-stage NN with the same number of neurones for input and output layers, and fewer units for the hidden layer. This forces the network to encode the inputs in a smaller dimensional space retaining most of the relevant information in an equivalent way as the Principal Component Analysis (PCA) method. This class of networks are known as compression networks. An important property of compression networks is that they can act as auto associative or content addressable memories (Kohonen, 1977; Valentin, 1994). This means that these networks are able to acceptably reconstruct a degraded pattern when noise is given as input or to complete an incomplete input pattern (O’Toole, 1993). The quality of the results will depend on the number of hidden units of the compression network. On the other hand, Syntactic NNs can model stochastic and non-stochastic grammars. Learning is therefore a process of grammatical inference and recognition a process of parsing. Note that this has great generality; by varying the grammar we can encompass a wide range of pattern recognition models. The stochastic nets are properly probabilistic and are powerful discriminators; the non-stochastic are less powerful, but have straightforward silicon implementation with existing technology. Learning in syntactic nets may proceed supervised or unsupervised (Lucas, 1990).

Combined Classifiers Approaches (Baltzakis, 2001) presents a different technique for off-line signature recognition and verification. The proposed confronts above mentioned BPN problems by reducing the training computation time (This is achieved because each neural network corresponds to only one signature owner) and the size of the neural networks used (The feature set is split to three different groups, i.e., global features, grid features and texture features.). For each one of these feature sets a special two stage Perceptron OCON (one-class-one-network) classification structure has been implemented. In the

1234

first stage, the classifier combines the decision results of the neural networks and the Euclidean distance obtained using the three feature sets. The results of the first-stage classifier feed a second-stage radial base function (RBF) neural network structure, which makes the final decision. To effectively verify skilled forgeries, a fuzzy neural network named Pseudo Outer-Product based Fuzzy Neural Network (POPFNN) is integrated into the signature verification system described in (Zhou, 1996). As a hybrid of fuzzy systems and neural networks, the POPFNN possesses many advantages such as high computational capability and learning ability when compared against other techniques used in signature verification systems. As hybrid intelligent systems, fuzzy NNs possess the advantages of both NNs and fuzzy rule-based systems and are particularly powerful in handling complex, non-linear and imprecise problems such as ASV. Besides, the membership functions and fuzzy rules identified in the POPFNN give more transparency to the decision making process. These advantages make the proposed fuzzy neural network driven signature verification system particularly powerful and robust even in dealing with skilled forgeries. In (Zhou, 1996), POPFNN operates in two fundamental modes, the learning mode and the classification mode. In the learning mode, a collection of training signature samples is used to train POPFNN. Feature vectors extracted from the training signature samples are utilized to initialize and adjust the parameters of POPFNN, including membership functions, fuzzy rules, and weights of the links. In the classification mode, POPFNN performs pure classification without self-modification. Feature vectors extracted from the unknown signatures are fed into POPFNN and the corresponding outputs are obtained at the output layer of POPFNN. (Bromley, 1994) presents an algorithm based on a novel NN, called a “Siamese” neural network. This network has two input fields to compare two patterns and one output whose estate value corresponds to the similarity between the two patterns. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Training was carried out using a modified version of BP. All weights could be learnt, but the two sub-networks were constrained to have identical weights.


FUTURE TRENDS Notwithstanding the enormous work carried out in the field of signature verification, several questions still remains unresolved. New solutions to these problems will determine the conditions under which the signature verification systems of the next future will be developed. The selection of the most suitable set of feature for a signer is one of the relevant open questions and the use of new approaches for classification still an open problem. Genetic algorithms (GA) have been recently used for this purpose (Xuhua, 1996). Another promising area of research concern multiexpert verification, which combine hard (Dimauro, 1997) and soft (Plamondon,1992) decision, based on parallel (Qi, 1995), serial (Cardot, 1991) or hybrid strategies(Cordella, 2000). In the framework of a handwritten text recognition application, (Heutte, 2004) have developed a multiple agent system able to manage interaction between different contextual levels of handwriting interpretation. The EMAC (Hernoux, 1999) environment has been specified from constraints imposed by their handwriting interpretation system. This work presents this platform as help to implement specific collaboration or cooperation schemes between agents which bring out new trends in the automatic reading of handwritten texts and could be implemented for automatic signature verification systems. (Balkrishana, 2007) recently presented a Colour Code Algorithm which deals with the recognition of the signature, as human operator generally make the work of signature recognition. Hence the algorithm simulates human behavior, to achieve perfection and skill through AI. The logic that decides the extent of validity of the signature must implement Artificial Intelligence Pattern recognition is the science that concerns the description or classification of measurements, usually based on underlying model. In future the system can be configured using Neural Networks and Fuzzy Rule base, where online training of recognition is possible. A list of companies involved in signature verification systems production is given in (Kalenova, 2004), along with a short description of the products available. Although signature verification is not one of the safest biometric solutions, the use of it in business practices is still justified. Primarily due to the fact that the signature

is a de facto mean of confirming the identity of the person, and therefore will provide a far less disruptive migration to an advanced technology than any other biometric can. Thus, signature verification has a very promising future.

CONCLUSION Automatic signature verification is very attractive problem for researches. This article presents a review of approaches for Automatic Signature Verification using Neural Networks. The main aspects related to training process are discussed. Although some approaches have False Reject Rate and False Acceptance Rate ranging from 2% to 5%, systems developers cannot compare their results due to the lack of a widely accepted protocol for experimental tests, as well as the absence of large, public signature databases. A useful bibliography is also provided for interested readers.

REFERENCES Abuhaiba, I.,(2007) Offline Signature Verification Using Graph Matching, Turkish Journal of Electrical Engineering & Computer Sciences. 1(4). Balkrishana, V., (2007), A Colour Code Algorithm for Signature Recognition, Electronic Letters on Computer Vision and Image Analysis. 6(1), 1-12. Baltzkis H, & Papamarkos N, (2001), A new signature verification technique based on a two stage neural network classifier, Engineering Application of Artificial Intelligence. 14, 95-103. Bouletreau, V., Vincent, N., Sabourin, R. & Emptoz, H., (1998). Handwriting and signature: one or two personality identifiers?., Proceedings Fourteenth International Conference on Pattern Recognition. 2, 1758-1760. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E. & Shah, R., (1994), Signature Verification using a “Siamese” Time Delay Neural Network, Advances in Neural Information Processing Systems. 7(4), 669-688. Cardot, H., Revenu, M., Victorri, B., & Revillet, M.J., (1991), Cooperation de réseaux neuronaux pour

1235

N


l’autentification de signatures manuscrites, Proceedings of International Conference on Neuro-Nimes. 6, 737–744. Cordella, L.P., Foggia, P., Sansone, C., Tortorella, F., & Vento, M., (2000), A Cascaded Multiple Expert System for Verification, Multiple Classifier Systems, editions J.Kittler and F.Roli, Springer. 1857, 330-339. Dimauro, G., Impedovo, S., Pirlo, G., & Salzo, A., (1997), A multi-expert signature verification system for bankcheck processing, International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI). 11(5), 827-844 Drouhard, J.P., Sabourin, R., & Godbout, M., (1996), A Neural Network Approach Off-line Signature Verification using directional PDF, Pattern Recognition. 29(3). 415-424. Ferrer, M.A., Alonso, J.B., & Travieso, C.M., (2005), Offline geometric parameters for automatic signature verification using fixed-point arithmetic, IEEE Transactions on Pattern Analysis and Machine Intelligence. 27(6), 993-997. Fleming, M.K. & Cottrell, G.W., (1990), Categorization of faces using unsupervised feature extraction, Proceedings International Conference on Neural Networks II. 2, 65-70. Guo J, Doermann D, & Rosenfeld A, (1997), Local correspondence for detecting random forgeries, Proceedings of the Fourth International Conference on Document Analysis and Recognition. 1, 319-323.

Kohonen, t., (1977), Associative Memory: A System Theoretic Approach, Springer. Lippmann, R. P., (1987) An introduction to computing with neural nets, IEEE ASSP Magazine. 4(2), 4-22. Looney, C.G., (1997), Pattern Recognition using Neural Networks: Theory and Algorithms for Engineers and Scientists, Oxford University Press. Lucas, S.M. & Damper, R.I. , (1990), Signature verification with a syntactic neural net, IJCNN International Joint Conference on Neural Networks. Mighell, D. A., Wilkinson, T. S. & J. W. Goodman, (1989), Backpropagation and Its Application to Handwritten Signature Verification, Advances in Neural Information Processing Systems. 340-347. O’Toole A.J., et al., (1993), Low dimensional representation of faces in higher dimensions of the face space, Journal of the Optical Society of America A., 10, 405-410. Plamondon, R., Yergeau, P., & Brault, J., (1992), A multi-level signature verification system, From Pixels to Features III - Frontiers in Handwriting Recognition., S.Impedovo and J.C.Simon editions. Plamondon, R. & Lorette, G., (1989), Automatic signature verification and writer identification— The state of the art, Pattern Recognition. 22(2), 107-131. Qi Y Y, & Hunt B R, (1994), Signature verification using global and grid features, Pattern Recognition. 27(12), 1621-1629.

Heutte, L., Nosary, A., & Paquet. T., (2004), A multiple agent architecture for handwritten text recognition. Pattern Recognition. 37, 665-674.

Qi, Y., & Hunt, B.R., (1995), A multiresolution approach to computer verification of handwritten signatures, IEEE Transaction on Image Processing. 4(6).

Hernoux, C., (1999). EMAC, Un environnement MultiAgents à mémoire Collective, Mémoire d’ingénieur, CNAM.

Sabourin R., & Drouhard J. P., (1992), Offline signature verification using directional PDF and neural networks, Proceedings 11th international conference on pattern recognition.

Hou W, Ye X, & Wang K, (2004). A Survey of Off-line Signature Verification, Proceedings International Conference on Intelligent Mechatronics and Automation. Kalenova, D., (2004), Personal Authentication Using Signature Recognition, Department of Information Technology, Laboratory of Information Processing, Lappeenranta University of Technology.

1236

Sansone C, & Vento M, (2000), Signature verification: increasing performance by a multistage system, Pattern Analysis & Application. 3, 169-181. Valentin, D., Abdi, H., O’Toole, A.J. & Cottrell, G.W. (1994)., Connectionist Models of Face Processing: A Survey, Pattern Recognition. 27(9), 1209-1230.


Xuhua, Y., Furuhashi, T., Obata, K., & Uchikawa, Y., (1996), Selection of features for signature verification using the genetic algorithm, Computers Ind. Eng. Zhou, R.W. & Quek, C. , (1996), An automatic fuzzy neural network driven signature verification system, IEEE International Conference on Neural Networks.

KEY TERMS Agent Based Mode: A specific individual based computational model for computer simulation extensively related to the theme in complex systems, Monte Carlo Method, multi agent systems, and evolutionary programming. The idea is to construct the computational devices (agents with some properties) and then, simulate them in parallel to model the real phenomena. Automatic Signature Verification: A procedure that determine if a handwritten signature is genuine or a forgery, when a person claims for identity verification. Backpropagation Algorithm: Learning algorithm of ANNs, based on minimising the error obtained from the comparison between the outputs that the network gives after the application of a set of network inputs and the outputs it should give (the desired outputs).

Feature Selection: The technique, commonly used in machine learning, of selecting a subset of relevant features for building robust learning models. Its objective is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. Fuzzy Logic: Derived from fuzzy set theory dealing with reasoning that is approximate rather than precisely deduced from classical predicate logic. It can be thought of as the application side of fuzzy set theory dealing with well thought out real world expert values for a complex problem. Genetic Algorithms: A genetic algorithm is technique used for searching or programming. It is used in computing to find true or approximate solutions to optimization and search problems of various types and used as a function in evolutionary computation. Genetic algorithms are based on biological events. They mimic biological evolution. Principal Component Analysis: A technique used to reduce multidimensional data sets to lower dimensions for analysis. PCA involves the computation of the eigenvalue decomposition of a data set, usually after mean centering the data for each attribute.

1237

N

1238

Neural/Fuzzy Computing Based on Lattice Theory Vassilis G. Kaburlasos Technological Educational Institution of Kavala, Greece

INTRODUCTION Computational Intelligence (CI) consists of an evolving collection of methodologies often inspired from nature (Bonissone, Chen, Goebel & Khedkar, 1999, Fogel, 1999, Pedrycz, 1998). Two popular methodologies of CI include neural networks and fuzzy systems. Lately, a unification was proposed in CI, at a “data level”, based on lattice theory (Kaburlasos, 2006). More specifically, it was shown that several types of data including vectors of (fuzzy) numbers, (fuzzy) sets, 1D/2D (real) functions, graphs/trees, (strings of) symbols, etc. are partially(lattice)-ordered. In conclusion, a unified cross-fertilization was proposed for knowledge representation and modeling based on lattice theory with emphasis on clustering, classification, and regression applications (Kaburlasos, 2006). Of particular interest in practice is the totally-ordered lattice (R,≤) of real numbers, which has emerged historically from the conventional measurement process of successive comparisons. It is known that (R,≤) gives rise to a hierarchy of lattices including the lattice (F,≤) of fuzzy interval numbers, or FINs for short (Papadakis & Kaburlasos, 2007). This article shows extensions of two popular neural networks, i.e. fuzzy-ARTMAP (Carpenter, Grossberg, Markuzon, Reynolds & Rosen 1992) and self-organizing map (Kohonen, 1995), as well as an extension of conventional fuzzy inference systems (Mamdani & Assilian, 1975), based on FINs. Advantages of the aforementioned extensions include both a capacity to rigorously deal with nonnumeric input data and a capacity to introduce tunable nonlinearities. Rule induction is yet another advantage.

BACKGROUND Lattice theory has been compiled by Birkhoff (Birkhoff, 1967). This section summarizes selected results regard-

ing a Cartesian product lattice (L,≤)= (L1,≤1)×…×(LN,≤N) of constituent lattices (Li,≤i), i=1,…,N. Given an isomorphic function θi: (Li,≤i)→(Li,≤i)∂ in a constituent lattice (Li,≤i), i=1,…,N, where (Li,≤i)∂ ≡ (Li,≤ i∂ ) denotes the dual (lattice) of lattice (Li,≤i), then an isomorphic function θ: (L,≤)→(L,≤)∂ is given by θ(x1,…,xN)=(θ1(x1),…,θN(xN)). Given a positive valuation function vi: (Li,≤i)→R in a constituent lattice (Li,≤i), i=1,…,N then a positive valuation v: (L,≤)→R is given by v(x1,…,xN)=v1(x1)+… +vN(xN). It is well-known that a positive valuation vi: (Li,≤i)→ R in a lattice (Li,≤i) implies a metric function di: Li×Li→ R 0+ given by di(a,b) = vi(a∨b) - vi(a∧b). Minkowski metrics dp: (L1,≤1)×…×(LN,≤N)= (L,≤)→ R are given by 1/ p

dp(x,y) =  d1p ( x1, y1 ) +  + d Np ( xN , yN ) 

,

where x= (x1,…,xN), y=(y1,…,yN), p∈R. An interval [a,b] in a lattice (L,≤) is defined as the set [a,b]≐{x∈L: a≤x≤b, a,b∈L}. Let τ(L) denote the set of intervals in a lattice (L,≤). It turns out that (τ(L),≤) is a lattice, ordered by set inclusion. Definition 1. The size Zp: τ(L)→ R 0+ of a lattice (L,≤) interval [a,b]∈τ(L), with respect to a positive valuation v: (L,≤)→R, is defined as Zp([a,b])=dp(a,b).

NEURAL/FUZZY COMPUTING BASED ON LATTICE THEORY This section delineates modified extensions to a hierarchy of lattices stemming from the totally ordered lattice (R,≤) of real numbers. Then, it details the relevance of


Neural/Fuzzy Computing Based on Lattice Theory

novel mathematical tools. Next, based on the previous mathematical tools, this section presents extensions of ART/SOM/FIS. Finally, it discusses comparative advantages.

We remark that the cardinality of set G equals ℵ1ℵ1 = (2 ) = 2ℵ0ℵ1 = 2ℵ1 =ℵ2 > ℵ1, where ℵ1 is the cardinality of the set R of real numbers.

Modified Extensions in a Hierarchy of Lattices

Proposition 3. Consider metric(s) d∆: ∆×∆→ R 0+ in lattice (∆,≤). Let G1,G2∈(G,≤). Assuming that the following integral exists, a metric function dG: G×G→ R 0+ is given by

Consider the product lattice (∆,≤) = (R×R,≤∂×≤) = (R×R,≥×≤) of generalized intervals. A generalized interval (element in ∆) will be denoted by [a,b] and will be called positive (negative) for a≤b (a>b). The set of positive (negative) generalized intervals will be denoted by ∆+ (∆-) − We remark that the set of positive generalized intervals is isomorphic to the set of conventional intervals in the set R of real numbers. A decreasing function θR: R→R is an isomorphic function θR: (R,≤)→(R,≤)∂; furthermore, a strictly increasing function vR: R→R is a positive valuation vR: (R,≤)→R. Hence, function v∆: (∆,≤)→R given by v∆([a,b])= vR(θR(a))+vR(b) is a positive valuation in lattice (∆,≤). There follows a metric function d∆: ∆×∆→ R 0+ given by d∆([a,b],[c,d])= [vR(θR(a∧c))-vR(θR(a∨c))] + [vR(b∨d)-vR(b∧d)]; in particular, for θR(x)= -x and vR(x)= x it follows v∆([a,b])= |a-c| + |b-d|. Choosing parametric functions θR(.) and vR(.) there follow tunable nonlinearities in lattice (R,≤). Moreover, note that ∆ is a real linear space with • •

addition defined as [a,b] + [c,d] = [a+c,b+d], and multiplication (by a real k) defined as k[a,b] = [ka,kb].

It turns out that ∆+ (as well as ∆-) is cone in linear space ∆ − Recall that a subset C of a linear space is called cone if for all x∈C and λ>0, we have λx∈C. Definition 2. A generalized interval number (GIN) is a function f: (0,1]→∆. Let G denote the set of GINs. It follows that (G,≤) is a lattice, in particular (G,≤) is the Cartesian product of lattices (∆,≤). Moreover, G is a real linear space with • •

addition defined as (G1 + G2)(h) = G1(h) + G2(h), h∈(0,1], and multiplication (by a real k) defined as (kG)(h) = kG(h), h∈(0,1].

ℵ0 ℵ1

1

dG(G1,G2) = d Δ (G1 (h), G2 (h))dh . ∫ 0

Our interest here focuses on the sublattice (F,≤) of lattice (G,≤), namely sublattice of fuzzy interval numbers (FINs). A FIN is defined rigorously as follows. Definition 4. A fuzzy interval number (FIN) F is a GIN such that either (1) both F(h)∈∆+ and h1≤h2 ⇒ F(h1)≥F(h2), for all h∈(0,1] (positive FIN) or (2) there is a positive FIN P such that F(h) = -P(h), for all h∈(0,1] (negative FIN). Let F+ (F-) denote the set of positive (negative) FINs. Note that both F+∪F- = F and F+∩F-=∅ hold. Furthermore, F+ (F-) is a cone with cardinality ℵ1 (Kaburlasos & Kehagias, 2006). The previous mathematical analysis may potentially produce useful techniques based on lattice vector theory (Vulikh, 1967). A positive FIN will simply be called “FIN”. A FIN may admit different interpretations including a (fuzzy) number, an interval, and a cumulative distribution function.

Relevance of Novel Mathematical Tools A fundamental mathematical result in fuzzy set theory is the “resolution identity theorem”, which states that a fuzzy set can, equivalently, be represented either by its membership function or by its α-cuts (Zadeh, 1975). The aforementioned theorem has been given little attention in practice to date. However, some authors have capitalized on it by designing effective as well as efficient fuzzy inference systems (FIS) involving fuzzy numbers whose α-cuts are conventional closed intervals (Uehara & Fujise, 1993, Uehara & Hirota, 1998). This work builds on the abovementioned mathematical result as follows. In the first place, we drop the possibilistic interpretation of a membership function. Then, we consider the corresponding “α-cuts representation”. 1239

N


Next, we consider the metric cone F+N of (positive) FINs. In conclusion, we propose extensions of established neural/fuzzy algorithms, including ART (adaptive resonance theory), SOM (self-organizing map), and FIS (fuzzy inference systems), in F+N (Kaburlasos, 2007). A novelty of this work is an improved mathematical notation, which emphasizes relevance with the aforementioned “resolution identity theorem”.

FLR-4: Assimilation Condition: Both (1) size Z1(xi∨uJ) is less than a user-defined threshold size Zcrit, and (2) Ki = CJ. FLR-5: If the Assimilation Condition is not satisfied then “reset” the winner pair (uJ,CJ); goto step FLR-2. Else, replace the winner uJ by the join-interval xi∨uJ; goto step FLR-1.

An Extension of Fuzzy-ARTMAP

The corresponding testing phase is carried out by winner-take-all competition based on the similarity measure function µ(.,.).

A fuzzy-ARTMAP extension, namely fuzzy lattice reasoning (FLR), is presented in this section based on a similarity measure (function) defined in the following. Definition 5. A similarity measure in a set S is a function µ: S×S→(0,1], which satisfies the following conditions. (S1) µ(a,b) = 1 ⇔ a = b. (S2) µ(a,b) = µ(b,a).

1 1 1 1 + ≤ + (S3) . µ ( a , b ) µ ( x, x ) µ ( a , x ) µ ( x, b )

A similarity measure is defined based on a metric function next. Proposition 6. If function d: S×S→ R 0+ is a metric then function µ: S×S→(0,1] given by µ(a,b) = 1/[1+d(a,b)] is a similarity measure.

FLR for Training FLR-0: A set RB = {(u1,C1),…,(uL,CL)} is given, where ul∈ F+N and Cl∈C, l=1,…,L is a class label in the finite set C. FLR-1: Present the next input pair (xi,Ki)∈ F+N ×C, i=1,…,n to the initially “set” RB. FLR-2: If no more pairs are “set” in RB then store input pair (xi,Ki) in the RB; L←L+1; goto step FLR-1. Else, compute the similarity µ(xi,ul) of input xi∈ F+N with a “set” element ul∈ F+N , l=1,…,L in RB. FLR-3: Competition among the “set” pairs in the RB: µ( xi , ul ). Winner is pair (uJ,CJ) such that J≐ arg lmax ∈{1,..., L} In case of multiple winners, choose the one with the smallest size Z1(.). 1240

An Extension of SOM A straightforward SOM extension, namely granular SOM (grSOM), is presented in this section in cone F+N .

grSOM for Training GR-0: The user defines the size L of a L×L grid of neurons. Each neuron can store both a N-dimensional FIN Wi,j∈ F+N , i,j∈{1,…,L} and a class label Ci,j∈C, where C is a finite set. Initially all neurons are uncommitted. GR-1: Memorize the first training data pair (x1,K1)∈ F+N ×C by committing, randomly, a neuron in the L×L grid. Repeat the following steps a user-defined number Nepochs of epochs. GR-2: For each training datum (xk,Kk)∈ F+N ×C, k=1,…,n “reset” all L×L grid neurons. Then carry out the following computations. GR-3: Calculate the Minkowski metric distance d1(xk,Wi,j) between xk and committed neurons Wi,j, i,j∈{1,…,L}. GR-4: Competition among the “set” (and, committed) neurons in the L×L grid: Winner is neuron (I,J) whose weight WI,J is the nearest to xk, that is (I,J)≐ arg m i n d1 ( xk ,Wi, j ). i, j∈{1,..., L} GR-5: Assimilation Condition: Both (1) Vector Wi,j is in the neighborhood of vector WI,J on the L×L grid, and (2) CI,J = Kk. GR-6: If the Assimilation Condition is satisfied then compute a new value W'i,j as


  h( k ) h( k ) W'i,j≐ 1 − xk  Wi, j + 1 + d1 (WI,J ,Wi, j )  1 + d1 (WI,J ,Wi, j ) 

Else, “reset” the winner (I,J); goto GR-4. GR-7: If all the L×L neurons are “reset” then commit an uncommitted neuron from the grid, and memorize the current training datum (xk,Kk). If there are no more uncommitted neurons then increase L by one. The corresponding testing phase is carried out by winner-take-all competition based on the Minkowski metric d1(.,.).

An Extension of FIS The basic idea towards novel FIS analysis and design is to employ a similarity measure function µ(X,Ai) = 1/[1+d(X,Ai)], where X,Ai∈ F+N , as a fuzzy membership function regarding a rule Ri: Ai→Ci, where Ai∈ M F+N , Ci∈ F+ , i=1,…,L (Kaburlasos & Kehagias, 2007). Advantages are presented in the following.

Comparative Advantages First, an important advantage of the mathematical tools above is that the proposed ART/SOM/FIS extensions can handle, in any combination, numeric and/or nonnumeric data, the latter include fuzzy numbers, intervals, and cumulative distribution functions. Second, we can employ parametric decreasing (increasing) functions θR: R→R (vR: R→R) in a data dimension, where the function parameters can be estimated/tuned optimally towards improving performance. Third, the proposed ART/SOM/FIS extensions can induce descriptive decision-making knowledge (i.e. rules) from the training data. Fourth, regarding the FLR, note that a similarity measure function µ(.,.) can effectively replace an inclusion measure function σ(.,.) − Recall that the latter (function) had replaced both of fuzzy-ARTMAP’s Choice (Weber) function and Match function (Kaburlasos & Petridis, 2000, Kaburlasos, Athanasiadis & Mitkas, 2007). The reason behind the aforementioned “effective” replacement is that an inclusion measure σ(A,B), or σ(B,A), considers mainly one of A,B∈ F+N ; whereas,

a similarity measure µ(A,B) considers both A,B∈ F+N based on their corresponding metric distance. Fifth, regarding the proposed SOM extension, note that this work carries out computations in the cone F+ of FINs for faster data processing compared to a previous version of grSOM (Kaburlasos & Papadakis, 2006). Sixth, regarding the proposed FIS, novel advantages include a capacity to generalize beyond a fuzzy rule’s support. The latter implies, potentially, an alleviation of the “curse of dimensionality” problem regarding the number of rules.

FUTURE TRENDS Data-processing of FINs by multiplayer perceptrons is straightforward, as described in (Kaburlasos & Christoforidis, 2006), and it will be pursued in future work.

CONCLUSION This article has presented novel mathematical tools for unified analysis and design of neural/fuzzy systems. We built on fuzzy set theory’s “resolution identity theorem”. Nevertheless, in the first place, we dropped the possibilistic interpretation of a membership function. Then, we considered the corresponding “α-cuts representation”. Our interest focused on fuzzy interval numbers, or FINs for short, which can represent (fuzzy) numbers, intervals, and cumulative distribution functions. Based on lattice theory, we showed that the space of FINs is a metric cone. In conclusion, this works opens up the possibility to design FIN-to-FIN maps implementable on neural/fuzzy architectures including also tunable nonlinearities.

REFERENCES Birkhoff, G. (1967). Lattice Theory. Providence, RI: AMS, Colloquium Publications, 25. Bonissone, P.P., Chen, Y.T., Goebel, K., & Khedkar, P.S. (1999) Hybrid Soft Computing Systems: Industrial and Commercial Applications. Proc IEEE, (87) 9, 1641-1667. Carpenter, G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., & Rosen, D.B. (1992) Fuzzy ARTMAP: 1241

N


A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps. IEEE Transactions on Neural Networks, (3) 5, 698-713. Fogel, D.B. (1999). Evolutionary Computation: Toward a New Philosophy of Machine Intelligence (2nd ed.). Piscataway, NJ: IEEE Press. Kaburlasos, V.G. (2006). Towards a Unified Modeling and Knowledge Representation Based on Lattice Theory − Computational Intelligence and Soft Computing Applications. Heidelberg, Germany: Springer, series: Studies in Computational Intelligence, vol. 27. Kaburlasos, V.G. (2007). Unified Analysis and Design of ART/SOM Neural Networks and Fuzzy Inference Systems Based on Lattice Theory. Computational and Ambient Intelligence, Sandoval, F., Prieto, A., Cabestany, J., Graña, M. editors. Heidelberg, Germany: SpringerVerlag, series: LNCS, vol. 4507, pp. 80-93. Kaburlasos, V.G., & Christoforidis, A. (2006). Granular Auto-Regressive Moving Average (grARMA) Model for Predicting a Distribution From Other Distributions. Real-World Applications. Proceedings of the World Congress on Computational Intelligence (WCCI) 2006, FUZZ-IEEE Program, pp. 791-796. Kaburlasos, V.G., & Kehagias, A. (2006). Novel Fuzzy Inference System (FIS) Analysis and Design Based on Lattice Theory. Part I: Working Principles. International Journal of General Systems, (35) 1, 45-67. Kaburlasos, V.G., & Kehagias, A. (2007). Novel Fuzzy Inference System (FIS) Analysis and Design Based on Lattice Theory. IEEE Trans. Fuzzy Systems, (15) 2, 243-260. Kaburlasos, V.G., & Papadakis, S.E. (2006). Granular Self-Organizing Map (grSOM) for Strucuture Identification. Neural Networks, (19) 5, 623-643. Kaburlasos, V.G., & Petridis, V. (2000). Fuzzy Lattice Neurocomputing (FLN) Models. Neural Networks, (13) 10, 1145-1170. Kaburlasos, V.G., Athanasiadis, I.N., & Mitkas, P.A. (2007). Fuzzy Lattice Reasoning (FLR) Classifier and its Application for Ambient Ozone Estimation. International Journal of Approximate Reasoning, (45) 1, 152-188.

1242

Kohonen, T. (1995). Self-Organizing Maps. Berlin, Germany: Springer. Mamdani, E.H., & Assilian, S. (1975). An Experiment in Linguistic Synthesis With a Fuzzy Logic Controller. International Journal of Man-Machine Studies, (7), 1-13. Papadakis, S.E., & Kaburlasos, V.G. (2007). Induction of Classification Rules From Histograms. Joint Conference on Information Sciences (JCIS), Proceedings of the 8th International Conference on Natural Computing(NC), pp. 1646-1652. Pedrycz, W. (1998). Computational Intelligence − An Introduction. Boca Raton, FL: CRC Press. Uehara, K., & Fujise, M. (1993). Fuzzy Inference Based on Families of α-level sets. IEEE Transactions on Fuzzy Systems, (1) 2, 111-124. Uehara, K., & Hirota, K. (1998). Parallel and Multistage Fuzzy Inference Based on Familes of α-level sets. Information Sciences, (106) 1-2, 159-195. Vulikh, B.Z. (1967). Introduction to the Theory of Partially Ordered Vector Spaces. Gronigen: WoltersNoordhoff Scientific Publications, XV. Zadeh, L.A. (1975) The Concept of a Linguistic Truth Variable and its Application to Approximate Reasoning –I, II, III. Information Sciences, (8) 3, 199-249; (8) 4, 301-357; (9) 1, 43-80.

KEY TERMS ART: ART stands for Adaptive Resonance Theory. That is a biologically inspired neural paradigm for, originally, clustering binary patterns. An analog pattern version of ART, namely fuzzy-ART, is applicable in the unit hypercube. The corresponding neural network for classification is called fuzzy-ARTMAP. Dual (Lattice): Given a lattice (L,≤), its dual lattice, symbolically (L,≤)∂ or (L,≤∂) ≡ (L,≥), is a lattice with the inverse order relation (≥). FIS: FIS stands for Fuzzy Inference System. That is an architecture for reasoning involving fuzzy sets (typically fuzzy numbers) based of fuzzy logic.


Isomorhic (Function): Given two lattices (L1,≤1) and (L2,≤2), an isomorphic function is a bijective (oneto-one) function ϕ: (L1,≤1)→(L2,≤2) such that x≤y ⇔ ϕ(x)≤ϕ(y). Lattice: A lattice is a poset (L,≤) any two of whose elements have both a greatest lower bound (g.l.b.), denoted by x∧y, and a least upper bound (l.u.b.), denoted by x∨y. Poset: A partially ordered set (or, poset, for short) is a pair (P,≤), where P is a set and ≤ is an order relation on P. The latter (relation) by definition satisfies (1) x≤x, (2) x≤y and y≤x ⇒ x = y, and (3) x≤y and y≤z ⇒ x≤z. Positive Valuation (Function): Given a lattice (L,≤), a positive valuation is a function v: (L,≤)→

R, which satisfies both v(x)+v(y) = v(x∧y)+v(x∨y) and xi

mi / c .m j / c d 2 (oi , o j )

(10)

Therefore, the algorithm is the following: • •

•

Initialization step: Having an initial partition, Xc, c = 1..C, with for instance, an affectation from an initial random referent observation set. Representation step: For all prototypes ωc and observations oi, compute the weights mi/c in Eq. (9) and the inertia I(Xc) in Eq. (10), update the neighborhood function for the next iteration. Affectation step: Affect each observation to a prototype ωf(i) according to the minimum distance in Eq. (9): f (i ) = Arg  Min (DT (oi , Wc ))  c .

The representation step and affectation step are sequentially computed up to convergence. The training parameters for the decreasing neighborhood function follow the usual recommendations for SOM algorithms: fast, then slow decrease (http://www.cis.hut.fi/projects/ somtoolbox/documentation/). With convergence, if necessary for visualization of the final map, a referent observation can be associated to each prototype according to a “set Mean search” (or set Median) or a “Mean search” (or Median), for instance. In the following, we will compare three DSOM respectively called DSOM(K), DSOM(EG) and DSOM for our proposal. To compare the “set Mean” and “set Median” approaches for the three algorithms, d2(oi, oj) will be substituted by dγ(oi, oj): “set Median” corresponds to γ = 1 and “set Mean” to γ = 2. Different power values γ will be also tested. Other transformations may be applied to a dissimilarity matrix to transform it into a distance matrix, such as adding a constant, or combining the both (Joly & le Calvé, 1994). The “adding constant” method provides great distortions in the

initial dissimilarity data. Our experiments confirm it. The “power” method gives better results. Concerning the computation time, these DSOM algorithms are equivalent, but the reasons differ. For DSOM(K) and DSOM(EG), the representation step is the most time- consuming one due to optimization for each referent. With our proposal, this optimization is implicit, but this step remains time-consuming because of the computation of the weights mi/c and inertia I(Xc).

Methodology Description of the Experiment To evaluate the 3 DSOM algorithms, two metrics will be used. The first one is the classical quantization error (Eg). The second one concerns topology preservation. Among existing criteria, we have chosen two measures in Eq. (11) which are compatible with dissimilarity data: the “trustworthiness” (M1) and the “continuity” (M2) (Venna & Kaski, 2001). The trustworthiness relates to the error provided by new observations in an output neighborhood while they are not in the input neighborhood; conversely for the continuity. M1 and M2 are evaluated in function of the number (k) of the nearest neighbors and normalized between 0 and 1. For visualization according to Venna & Kaski, the trustworthiness is more important than the continuity. The more M1(k) and M2(k) are large, the better the projection quality is. We compute also the integrated Mi(k) until a neighborhood with 10% of the whole samples: these values ( M i ) measure the quality of the local topology preservation. M 1 (k ) = 1 −

N 2 ∑ ∑ (r (oi , o j ) − k ) Nk (2 N − 3k − 1) i =1 o j ∈U k ( oi )

M 2 (k ) = 1 −

N 2 ∑ ∑ (rˆ(oi , o j ) − k ) Nk (2 N − 3k − 1) i =1 o j ∈Vk ( oi )

(11) With Ck (oi ), Cˆ k (oi ) sets of k first neighbors of oi in the input space, output space;

{

},

U k (oi ) = o j | o j ∈ Cˆ k (oi ) ∧ o j ∉ Ck (oi )

{

};

Vk (oi ) = o j | o j ∉ Cˆ k (oi ) ∧ o j ∈ Ck (oi )

1247

N

A New Self-Organizing Map for Dissimilarity Data

r(oi, oj), rˆ (oi , o j ) ranks of oj in the neighbourhood of oi in the input space, output space. Three databases are used. The first one is an artificial dataset: 100 uniform samples in R2, dissimilarity data is the exact Euclidean distance, the configuration parameter γ is set to 2. The second one is the “Chicken Silhouette” (http://algoval.essex.ac.uk:8080/data/sequence/chicken/chicken.tgz). This data consists of 446 samples (binary images of chicken parts) categorized in 5 classes. The distance matrix is calculated according to “AngleCostFunction” (Barbara Spillmann, 2004) based on the local orientation of the sample contours. The third dataset is larger. It is extracted from the SCOWL word lists (http://wordlist.sourceforge.net/). After some reduction of plural and possessive forms from a small English dictionary, the dataset consists of 2000 words. The Levenshtein distance (Levenshtein, 1966) is then used to calculate the pair-wise dissimilarities.

Results On the artificial dataset, the performances of the three algorithms are very similar (Table 1). With a vector SOM, the results are identical. The map is a hexagonal one with a grid of 5x5 neurons. As expected, the behaviour of the three algorithms differs with the real datasets. With the “Chicken” databases, the map is a hexagonal one with a grid of 7x7 neurons. DSOM presents the best topology preservation according to M1(k) and M2(k) (Fig. 1.a), and the best compromise between quantization and topology preservation (Table 2). While varying γ, we observe an evolution of these criteria. We notice that each algorithm exhibits a different value for the optimal power γ: γ = 1 for DSOM(K), γ = 1.5 for DSOM(EG), γ = 3 for DSOM. However, γ = 1 can be considered as the best compromise for the three algorithms and will be used

Table 1. Comparison of the quantization quality (Eg) and topology preservation ( M 1 , M 2 ) Artificial, γ = 2 Eg

M1 M2

DSOM(K) 0.0063 0.9892

DSOM(EG) 0.0067 0.9848

DSOM 0.0063 0.9855

0.9791

0.9777

0.9804

Table 2. Comparison of the quantization quality (Eg) and topology preservation ( M 1 , M 2 ) Chicken, γ = 1 Eg

M1 M2 1248

DSOM(K) 11.7183 0.8923

DSOM(EG) 12.0817 0.9040

DSOM 11.7966 0.9360

0.8320

0.8083

0.8880


Figure 1. (a) Chicken database: Evolution of M1(k) and M2(k) with γ =1, (b) SCOWL database: Evolution of M 1 and M 2 for different values of the power γ

(a)

(b)

Figure 2. Chicken database: prototypes of the neurons for DSOM. Each color corresponds to one of five classes of chicken parts: wing, back, drumstick, thigh and back, and breast.

1249

N


Figure 3. SCOWL database: Part of the final map. At the end, the referents are assigned with a “set Median search”. For the particularity of referent 117, see the text.

to present the results. Figure 2 show the prototypes of all the nodes for DSOM. The neighbor nodes have the similar prototypes. The map is organized to respect the data clustered into the 5 classes as well as possible. For the third dataset, the hexagonal map is used with the grid of 12x12 neurons. The conclusions are the same. We present in Fig. 1.b, the evolution of the integrated Mi(k)( M 1 ). The values are higher for DSOM and also less sensible to different values of γ. Figure 3 illustrates the central part of the map for γ = 1, where the organization of the referents with length of the words is evident. On this figure, only referent 117 (“present”) does not belong to its partition. On the whole map, it is the case for 5 over 144 referents (3.5%). For DSOM(K) and DSOM(EG), the results are 23.4% and 99.7% respectively. From these characteristics, we also observe a higher effectiveness of the proposed algorithm which is mainly due to the implicit reference.

1250

FUTURE TRENDS The proposed algorithm is based on the computation of a “pseudo” gravity centre for each prototype. This computing is justified by assumption of existence of a latent Euclidean space. That means the dissimilarity data must be isometric to a L2 norm. In practice, this requirement is very seldom strictly checked and an approximation is often sufficient. Therefore, to completely validate this new DSOM, it is necessary to test it with more other data types and larger databases having a “ground truth”. The data organization is interpreted after projection into the final map, and the neighbourhood in the output map must reveal the main latent properties of the observations which must be in agreement with the “ground truth”.


CONCLUSION This article presents a new affective algorithm for DSOM. Through the criteria of trustworthiness and continuity, this DSOM presents good topology preservation. The main reason of this improvement comes from the representation step where it is possible to continuously adapt the referent of each prototype like with the vector model. To achieve it, we use an implicit reference during the representation step thanks to the Huygens theorem. Even if the Euclidean assumptions are not exactly verified in practice, the distortions due to this mismatching are in fact less important than the ones occurred with the collision effect which is a difficult problem for the classical DSOM algorithms. This effectiveness is represented in this article by the better performance of the proposed algorithm compared to the other ones.

ACKNOWLEDGMENTS This work is supported by grants of the “Fonds National pour la Science”, from the program “ACI Masse de Données” and the project “DataHighDim”. T.Ho-Phuoc’s PhD is funded by the French MESR.

REFERENCES Ambroise C. and Govaert G. (1996). Analyzing dissimilarity matrices via Kohonen maps. IFCS-96, Int. Federation of Classification Societies, (2), Kobe (Japan), 96-99. Barbara Spillmann. (2004). Description of the distance matrices. Institute of Computer Science and Applied Mathematics, University of Bern. Borg I., Groenen P. (1997). Modern Multidimensional Scaling: Theory and Applications. Springer Verlag, New-York, Inc. Conan-Guez B., Rossi F., El Golli A. (2006). Fast algorithm and implementation of dissimilarity self-organizing maps. Neural Networks, 19(6-7), 855-863. El Golli A., Conan-Guez B., Rossi F. (2004). A self organizing map for dissimilarity data. IFCS-04, International Federation of Classification Societies, Chicago, 61-68.

Jain A.K., Dubes R.C. (1988). Algorithms for clustering Data, Prentice-Hall, Englewood Cliffs, NJ. Joly S., Le Calvé G., (1994). Similarity functions, Chapter 3, 67-86, in Classification and Dissimilarity Analysis, Lecture Notes in Statistics, Van Cutsem ed., Springer-Verlag, New York. Graepel T., Obermayer K. (1999). A stochastic self-organizing map for proximity data. Neural Computation, 11(1), 139–155. Kohonen T. (1997). Self-Organizing Maps. Springer Verlag New York. Kohonen T., Somervuo P.J. (1998). Self-organizing maps for symbol strings. Neurocomputing, (21), 1930. Kohonen T., Somervuo P.J. (2002). How to make large self-organizing maps for non vectorial data. Neural networks, 21(8). Levenshtein V.I. (1966). Binary codes capable of correcting deletions, insertions and reversals, Soviet Physics Dokl., vol. 10 (8), 707-710. Martínez C. D., Juan A., Casacuberta F. (2001). Improving classification using median string and NN rules. IX Spanish Symp. on Pattern Recog. and Image Analysis, (2), 391-395. Van Cutsem B. (1994). Classification and Dissimilarity Analysis, Lecture Notes in Statistics, Van Cutsem Ed., Springer-Verlag, New York. Venna J., Kaski S. (2001). Neighborhood preservation in nonlinear projection methods: An experimental study. ICANN 2001, Berlin, 485-491.

KEY TERMS Affectation Step: A part of the learning iteration where an observation is affected to the nearest prototype according to a predefined distance. Dissimilarity Data: Data in which all we know about the observations are pair-wise dissimilarities. Dissimilarity SOM: A SOM where all observations are described by a dissimilarity matrix.

1251

N


Prototype: Referent of a node (neuron) on the map.

the training samples while preserving the topological properties of the input space.

Quantization Error: Error which appears when an observation is represented by a prototype.

SOM Batch Algorithm: A version of SOM in which at an iteration all observations are available and used for computation.

Representation Step: A part of the learning iteration where the prototype is adapted to well represent its affected observations. Self-Organizing Map (SOM): A subtype of artificial neural networks. It is trained using unsupervised learning to produce low dimensional representation of

1252

Topology Preservation: Preservation of the neighbourhood relation of the observations in the output space. It means that the observations which are neighbours in the input space should be projected in neighbour nodes.

1253

NLP Techniques in Intelligent Tutoring Systems Chutima Boonthum Hampton University, USA Irwin B. Levinstein Old Dominion University, USA Danielle S. McNamara The University of Memphis, USA Joseph P. Magliano Northern Illinois University, USA Keith K. Millis The University of Memphis, USA

INTRODUCTION Many Intelligent Tutoring Systems (ITSs) aim to help students become better readers. The computational challenges involved are (1) to assess the students’ natural language inputs and (2) to provide appropriate feedback and guide students through the ITS curriculum. To overcome both challenges, the following non-structural Natural Language Processing (NLP) techniques have been explored and the first two are already in use: word-matching (WM), latent semantic analysis (LSA, Landauer, Foltz, & Laham, 1998), and topic models (TM, Steyvers & Griffiths, 2007). This article describes these NLP techniques, the iSTART (Strategy Trainer for Active Reading and Thinking, McNamara, Levinstein, & Boonthum, 2004) intelligent tutor and the related Reading Strategies Assessment Tool (R-SAT, Magliano et al., 2006), and how these NLP techniques can be used in assessing students’ input in iSTART and R-SAT. This article also discusses other related NLP techniques which are used in other applications and may be of use in the assessment tools or intelligent tutoring systems.

BACKGROUND Interpreting text is critical for intelligent tutoring systems (ITSs) that are designed to interact meaningfully with, and adapt to, the users’ input. Different ITSs use

different Natural Language Processing (NLP) techniques in their system. NLP systems may be structural, i.e., focused on grammar and logic, or non-structural, i.e., focused on words and statistics. This article deals with the latter. Examples of the structural approach include ExtrAns (Extracting Answers from technical texts question-answering system; Molla et al., 2003) which uses minimal logical forms (MLF; that is, the form of first order predicates) to represent both texts and questions and C-Rater (Leacock & Chodorow, 2003) which scores short-answer questions by analyzing the conceptual information of an answer in respect to the given question. Turning to the non-structural approach, AutoTutor (Graesser et al., 2000) uses LSA to analyze the student’s input against expected sets of answers and CIRCSIM-Tutor (Kim et al., 1989) uses a wordmatching technique to evaluate students’ short answers. The systems considered more fully below, iSTART (McNamara et al., 2004) and R-SAT (Magliano et al., 2006) use both word-matching and LSA in assessing quality of students’ self-explanation. Topic models (TM) were explored in both systems, but have not yet been integrated.

MAIN FOCUS OF THE CHAPTER This article presents three non-structural NLP techniques (WM, LSA, and TM) which are currently used


N

NLP Techniques in Intelligent Tutoring Systems

or being explored in reading strategies assessment and training applications, particularly, iSTART and R-SAT.

Word Matching Word matching is a simple and intuitive way to estimate the nature of an explanation. There are two ways to compare words from the reader’s input (either answers or explanations) against benchmarks (collections of words that represent a unit of text or an ideal answer): (1) Literal word matching and (2) Soundex matching. Literal word matching – Words are compared character by character and if there is a match of sufficient length then we call this a literal match. An alternative is to count words that have the same stem (e.g., indexer and indexing) as matching. If a word is short a complete match may be required to reduce the number of false-positives. Soundex matching – This algorithm compensates for misspellings by mapping similar characters to the same soundex symbol (Christian, 1998). Words are transformed to their soundex code by retaining the first character, dropping the vowels, and then converting other characters into soundex symbols: 1 for b, p; 2 for f, v; 3 for c, k, s; etc. Sometimes only one consecutive occurrence of the same symbol is retained. There are many variants of this algorithm designed to reduce the number of false positives (e.g., Philips, 1990). As in literal matching, short words may require a full soundex match while for longer words the first n soundex symbols may suffice. Word-matching is also used in other applications, such as, CIRCSIM-Tutor (Kim et al., 1989) on shortanswer questions and Short Essay Grading System (Ventura et al., 2004) on questions with ideal expert answers.

Latent Semantic Analysis (LSA) Latent Semantic Analysis (LSA; Landauer, Foltz, & Laham, 1998) uses statistical computation to extract and represent the meaning of words. Meanings are represented in terms of their similarity to other words in a large corpus of documents. LSA begins by finding the frequency of terms used and the number of co-occurrences in each document throughout the corpus and then uses a powerful mathematical transformation to find deeper meanings and relations between words. 1254

When measuring the similarity between text-objects, LSA’s accuracy improves with the size of the objects, so it provides the most benefit in finding similarity between two documents but as it does not take word order into account, short documents may not receive the full benefit. The details for constructing an LSA corpus matrix are in Landauer & Dumais (1997). Briefly, the steps are: (1) select a corpus; (2) create a term-document-frequency (TDF) matrix; (3) apply Singular Value Decomposition (SVD; Press et al., 1986) to the TDF matrix to decompose it into three matrices (L x S x R; where S is a scaling, matrix). The leftmost matrix (L) becomes the LSA matrix of that corpus. The optimal size is usually in the range of 300–400 dimensions. Hence, the LSA matrix dimensions become N x D where N is the number of unique words in the entire corpus and D is the optimal dimension (reduced from the total number of documents in the entire corpus). The similarity of terms (or words) is computed by comparing two rows, each representing a term vector. This is done by taking the cosine of the two term vectors. To find the similarity of sentences or documents, (1) for each document, create a document vector using the sum of the term vectors of all the terms appearing in the document and (2) calculate a cosine between two document vectors. Cosine values range from ±1 where +1 means highly similar. To use LSA in the tutoring systems, a set of benchmarks are created and compared with the trainee’s input. Examples benchmarks are the current target sentence, previous sentences, and the ideal answer. A high cosine value between the current sentence benchmark and the reader’s input would indicate that the reader understood the sentence and was able to paraphrase what was read. To provide appropriate feedback, a number of cosines are computed (one for each benchmark). Various statistical methods, such as discriminant analysis and regression analysis, are used to construct the feedback formula. McNamara et al. (2007) describe various ways that LSA can be used to evaluate the reader’s explanations: either LSA alone or a combination of LSA with WM. The final conclusion is that a fully-automated (i.e., less hand-crafted benchmarks construction), combined system produces the better results. There are a number of other intelligent tutoring systems that use LSA in their feedback system, for examples, Summary Street (Steinhart, 2001), Auto-


Tutor (Greasser et al., 2000), and Tutoring System (Lemaire, 1999).

Topic Models The Topic Models approach (TM; Steyvers & Griffiths, 2007) applies a probabilistic model to find a relationship between terms and documents in terms of topics. A document is considered to be generated probabilistically from a number of topics where each topic consists of a number of terms, each given a probability of selection if that topic is used. By using a TM matrix, the probability that a certain topic was used in the creation of a given document is estimated. If two documents are similar, the estimates of the topics within these documents should be similar. TM is similar to LSA, except that a term-document frequency matrix is factored into two matrices instead of three: one is the probabilities of terms belonging to the topics (the TM matrix), the other the probabilities of topics belonging to the documents. The Topic Modeling Toolbox (Steyvers & Griffiths, 2007) can be used to construct a TM matrix, To measure the similarity between documents, the Kullback Leibler distance (KL-distance: Steyvers & Griffiths, 2007) is recommended, rather than the cosine measure (which can also be used). Using TM in a tutoring system is similar to using LSA, where a set of benchmarks is defined and the reader’s input is compared against each benchmark. The only different is the use of KL-distance instead of LSA-cosine value. The preliminary results of investigating TM in place of LSA (Boonthum, Levinstein, & McNamara, 2006) indicate that TM is as good as LSA alone (correlation between computerized-scores and human rating scores), but a little bit lower than a combined system using both WM and LSA. This suggests that the TM should be further investigated in combination with WM or LSA or both. TM is mostly used in document clustering (grouping documents based on relevancy or similar topics; Buntine et al., 2005), data mining (Tuulos & Tirri, 2004), and search engines (Perkiö et al., 2004). A variation on TM by Steyvers & Griffiths (2007), is Probabilistic Latent Semantic Analysis (PLSA; Hofmann, 2001) which models each document as generated from a number of hidden topics and each topic has its features defined as the conditional probabilities of word occurrences in that topic.

iSTART and RSAT Applications iSTART (Interactive Strategy Trainer for Active Reading and Thinking) is a web-based, automated tutor designed to help students become better readers using multi-media technology. It provides adolescent to college-aged students with a program of self-explanation and reading strategy training (McNamara et al., 2004) called Self-Explanation Reading Training, or SERT (see McNamara et al., 2004). iSTART consists of three modules: Introduction (description of SERT and reading strategies), Demonstration (illustration of how these reading strategies can be used), and Practice (hands-on practice of these reading strategies). In the Practice module, students practice using reading strategies by typing self-explanations of sentences. The system evaluates each explanation and then provides appropriate feedback to the student. If the explanation is irrelevant or too short compared to the given sentence and passage, the student is required to add more information. Otherwise, the feedback is based on the level of its overall quality. The computational challenge is to provide appropriate feedback to the students about their explanations. Doing so requires capturing some sense of both the meaning and quality of their explanation. A combination of word-matching and LSA provided better results (comparing the computerized-score using NLP techniques to the human rating score and having higher correlation between these two sets of scores) than either separately (McNamara, Boonthum, Levinstein, & Millis, 2007). R-SAT (Reading Strategy Assessment Tool; Maglino et al., 2007) is an automated web-based reading assessment tool designed to measure readers’ comprehension and spontaneous use of reading strategies. The R-SAT is similar to the iSTART Practice module in the sense that it presents passages to the reader one sentence at a time and asks for the reader’s input. The difference is that, instead of an explanation, R-SAT asks either an indirect (“What are your thoughts regarding your understanding of the sentence in the context of the passage?”) or a direct question (e.g., Why did the miller want to marry the girl?”) at pre-selected target sentences. The answers to the indirect questions are evaluated on how they are related to the given sentence and passage; the answers to the direct questions are assessed by comparing them to ideal answers.

1255

N


The problem is to analyze the answers and generate a set of scores for overall comprehension and strategy usage. Ultimately, these scores can be used as a pre-assessment for iSTART allowing the trainer to individualize the iSTART curriculum based on the reader’s needs. R-SAT was initially proposed to use word-matching, LSA, and other techniques beyond LSA. However, during the course of development, word-matching was found to produce better results than LSA or in combination with LSA.

FUTURE TRENDS These three NLP techniques (WM, LSA, and TM) are used in the ongoing research on assessing and improving comprehension skills via reading strategies in the R-SAT and iSTART projects. WM and LSA have been extensively investigated for iSTART and to some extent in R-SAT. The lack of success of LSA compared to the simpler WM in R-SAT is somewhat surprising and may be due to particular features of the algorithms used or to the variety of text genres used in R-SAT. Future work is planned with modified algorithms and substituting genre-specific LSA spaces for the general space now used. In addition TM needs further exploration, especially in its use with small units of text where the recommended Kullback Leibler distance has not proven particularly effective.

CONCLUSION The purpose of this article is to describe three NLP techniques and how they can be used in assessment tools and intelligent tutoring systems. For iSTART to teach reading strategies effectively, it must be able to deliver valid feedback on the quality of the explanations that a reader produces and therefore the system must understand, at least to some extent, the explanation. Of course, automating natural language understanding has been extremely challenging, especially for non-restrictive content domains like explaining a freely-entered text t. Algorithms such as LSA open up a number of possibilities to systems such as iSTART: in essence LSA provides a ‘simple’ algorithm that allowed tutoring systems to provide appropriate feedback to students (see Landauer et al., 2007). The results presented in Boonthum et

1256

al. (2006) show that the topic model similarly offers a wealth of possibilities in natural language processing. For R-SAT to measure a reader’s comprehension and reading skills accurately, like iSTART it must also be able to understand, to some extent, what a reader says, especially when he/she is asked to describe their current thoughts. Although LSA is a good candidate, simple word matching against various benchmarks seems adequate to provide satisfactory results especially when aggregated over several explanations (see Magliano et al., 2006). It is also demonstrates that a combination of techniques produces better results than using one technique on its own.

REFERENCES Boonthum, C., Levinstein, I.B., & McNamara, D.S. (2006). Evaluating Self-Explanations in iSTART: Word Matching, Latent Semantic Analysis, and Topic Models. In A. Kao & S. Poteet (Eds.), Text Mining and Natural Language Processing, Springer. 91-106. Buntine, W., Löfström, J., Perttu, S., & Valtonen, K. (2005). Topic-Specific Scoring of Documents for Relevant Retrieval. In Workshop on Learning in Web Search (LWS-2005), pp 34-41. Christian. P. (1998). Soundex – can it be improved? Computers in Genealogy, 6 (5). Graesser, A., Wiemer-Hastings, P., Wiemer-Hastings, K., Harter, D., Person, N., & TRG. (2000). Using Latent Semantic Analysis to evaluate the contributions of students in AutoTutor. Interactive Learning Environments, 8 , 149-169. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, pp 177-196. Kim, N., Evens, M.W., Michael, J.A., & Rovick, A.A. (1989). CIRCSIM-Tutor: An Intelligent Tutoring System for Circulatory Physiology. In Maurer, H. (ed.), Computer-Assisted Learning: 2nd International Conference (ICCAL-89), pp. 254-266. Berlin: Springer-Verlag. Landauer, T.K., Foltz, P.W., & Laham, D. (1998). Intorduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.


Landauer, T., McNamara, D.S., Dennis, S., & Kintsch, W. (2007). A Handbook of Latent Semantic Analysis Mahwah, NJ: Erlbaum. Landauer, T.K. & Dumais, S.T. (1997) A solution to Plato’s problem: the Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211-240. Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389-405. Lemaire, B. (1999). Tutoring Systems Based on Latent Semantic Analysis. In Artificial Intelligence in Education (AIED-99), S. Lajoie and M. Vivet (eds.), IOS Press, Amsterdam, pp. 527-534. Magliano, J.P., Millis, K.K., Gilliam, S., Levinstein, I.B., & Boonthum, C. (2006). Assessing Reading Comprehension with Verbal Protocols and Latent Semantic Analysis. In the Proceeding of the 47th Annual Meeting of the Psychonomic Society, Houston, TX. McNamara, D.S., Boonthum, C., Levinstein, I.B., & Millis, K.K. (2007). Using LSA and word-based measures to assess self-explanations in iSTART. In T. Landauer et al. (Eds.), A Handbook of Latent Semantic Analysis (pp. 227-241). Mahwah, NJ: Erlbaum. McNamara, D.S., Levinstein, I.B., & Boonthum, C. (2004). iSTART: Interactive Strategy Trainer for Active Reading and Thinking. Submitted to Behavioral Research Methods, Instruments, and Computers, 36, 222-233. Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J., & Hess, M. (2003). ExtrAns: Extracting Answers from Technical Texts. IEEE Intelligent System, 18(4): 12-17. Perkiö, J., Buntine., W., & Perttu, S. (2004). Exploring Independent Trends in a Topic-Based Search Engine. In Proceedings of the Web Intelligence Conference (WI-2004), pp. 664-668. Philips, L. (1990). Hanging on the Metaphone. Computer Language, 7(12). Press, W.M., Flannery, B.P., Teukolsky, S.A., & Vetterling, W.T. (1986). Numerical recipes: The art of scientific computing. New York, NY: Cambridge University Press.

Steinhart, D. (2001). Summary Street: An intelligent tutoring system for improving student writing through the use of latent semantic analysis. Ph.D. dissertation, Dept. Psychology, Univ. Colorado, Boulder. Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models. In T. Landauer, D.S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 427-448). Mahwah, NJ: Erlbaum. Tuulos, V.H. & Tirri, H. (2004). Combining Topic Models and Social networks for Chat Data Mining. In Proceedings of on Web Intelligence Conference (WI-2004), pp. 206-213. Ventura, M.J., Franchescetti, D.R., Pennumatsa, P., Graesser, A.C., Jackson, G.T., Hu, X., Cai, Z., & TRG. (2004). Combining Computational Models of Short Essay Grading for Conceptual Physics Problems. In J.C. Lester et al. (Eds.), Intelligent Tutoring Systems (pp. 423-431). Berlin, Germany: Springer.

KEY TERMS Intelligent Tutoring System (ITS): Also called Intelligence Computer-Aided Instruction (ICAI), a personal training assistant that captures the subject matter and teaching expertise and individualize the curriculum to meet each learner’s needs in order to master the subject matter. Its main goal is to provide benefits of the one-on-one instruction: lessons are conducted at the learner’s own pace; practices are interactive so the learner can improve their weaker skills; and realtime question answering clarify learner’s doubts or misunderstanding; and an individualized curriculum based on the learner’s needs. Kullback Leibler Distance (KL-distance): A natural distance function from a “true” probability distribution to a “target” probability distribution. It can be interpreted as the expected extra message-length per datum due to using a code based on the wrong (target) distribution compared to using a code based on the true distribution. Latent Semantic Analysis (LSA): A natural language processing technique that analyses relationships between a set of documents and terms within these documents. LSA was created in 1990 for informa-

1257

N


tion retrieval and is sometimes called latent semantic indexing (LSI). LSA Cosine: A measurement of a relation between two vector-units. A unit can be as small as a word or as large as an entire document. It can be computed using the dot-product of two vectors where each vector is a representation of a unit (word, sentence, paragraph, or whole document). Probabilistic Latent Semantic Analysis (PLSA): A statistical techniques for the analysis of two-mode and co-occurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and related areas. PLSA evolved from LSA but focuses more on the relationship of topics within documents. Protocols: Any verbal input that students or readers produce during a session. This can be a set of explanations or answers to direct questions.

1258

Self-Explanation and Reading Strategy Trainer (SERT): Pedagogy uses five strategies to help students become a better reader. The reading strategies include (1) comprehension monitoring, being aware of one’s own understanding of the text; (2) paraphrasing, or restating the text in different words; (3) elaboration, using prior knowledge or experiences to understand the text (domain-specific knowledge-based inferences) or using common-sense or logic to understand the text (general knowledge based inferences); (4) predictions, predicting what the text will say next; and (5) bridging, understanding the relation between separate sentences of the text. Word Matching (WM): A simple way to compare words. Literal match is done by comparing character by character, while Soundex match transforms each word into a Soundex code, similar to phonetic spelling.

1259

Non-Cooperative Facial Biometric Identification Systems Carlos M. Travieso González University of Las Palmas de Gran Canaria, Spain Aythami Morales Moreno University of Las Palmas de Gran Canaria, Spain

INTRODUCTION The verification of identity is becoming a crucial factor in our hugely interconnected society. Questions such as “Is she really who she claims to be?”, “Is this person authorized to use this facility?” are routinely being posed in a variety of scenarios ranging from issuing a driver’s license to gaining entry into a country. The necessity for reliable user authentication techniques has increased in the wake of heightened concerns about security and rapid advancements in networking, communication, and mobility. Biometric systems, described as the science in order to recognize an individual based on his or her physical or behavioural traits, is beginning to get acceptance as a legitimate method in order to determine an individual’s identity. Nowadays, biometric systems have been deployed in various commercial, civilian, and forensic applications as a means of establishing identity. In particular, this work presents a non-cooperative identification system based on facial biometric.

BACKGROUND How do biological measurements qualify as being biometric? Any human physiological and/or behavioural characteristic can be used as a biometric characteristic as long as it satisfies the following requirements (Jain, Ross & Prabhakar, 2004): universality, distinctiveness, permanence, collectability. The choice of biometric identifiers has a major impact on the performance of the system. This choice depends greatly on the intended application of the system. Currently, some of the most widely used biometrics identifiers include fingerprints (Jain, Ross &

Prabhakar, 2004, pp. 43-64), hand geometry (SanchezReillo, Sanchez-Avila, Gonzalez-Marcos, 2000), iris (Jain, Ross & Prabhakar, 2004, pp. 103-121), face (Jain, Ross & Prabhakar, 2004, pp. 65-86), etc... Most biometric systems require co-operation on the part of the users in order to acquire their biometric data. Face identification, however, does not require this condition for its use, although it can be used. This is therefore its principal advantage over other biometric systems. Human face identification is an extensively studied field since the computational cost has not been turned out to be a drawback, due to the increasing importance of this kind of biometric identification in the access security to places such as airports, metros, train and bus stations. The process of facial identification incorporates two significant methods: detection (an individual from among a set) and identification (whether an individual is whom s/he claims to be). Face detection (Young-Bum Sun, Jin-Tae Kim & Won-Hyung Lee, 2002) involves locating the human face within an image captured by a video camera and taking that face and isolating it from the other objects captured within the image. Identification is comparing the captured face with other faces that have been saved and stored in a database. The basic underlying identification technology of facial feature identification involves either eigenfeatures (facial metrics) or eigenfaces. Within this type of study a great variety of references can be found (Discrete Cosine Transform (DCT), Karhunen-Loeve (KL) Transform, Independent Component Analysis (ICA), Principal Component Analysis (PCA), etc). The greatest advantage of a facial identification system is its non-cooperative nature as it is a system which can work independently of user co-operation.


N

Non-Cooperative Facial Biometric Identification Systems

FACIAL IDENTIFICATION SYSTEM This article presents the two principal processes associated with face identification: face detection and face identification. However, there also exist other aspects of facial identification system to be taken into account. In the face detection module the face capturing is shown, just when the camera takes a picture or frame. The image acquisition can be carried out using RGB images, Infrared (IR) images among other formats; recently thermal images are also being used. The choice of the image format depends on its applications, lighting conditions, location (indoor or outdoor system), and the degree of security. In the face identification module, a database can be found with the user information that must be located; therefore a supervised classification must be carried out. The parametrization submodule extracts the user features, and the classification system generates a model in order to difference our user/users versus the remainder of persons (see figure 1).

nostrils, eyebrow, mouth, lips, ears, etc., with the assumption that there is only one face in an image (Zhiwei, & Oiang, 2006). Face recognition or face identification compares an input image against a database and reports a match, if found (Darrell, Gordon, Harville & Woodfill, 2000). The purpose of face authentication is to verify the claim of the individual’s identity in an input image (Crowley & Berard, 1997), while face tracking methods continuously estimate the location and possibly the orientation of a face in an image sequence in real time (Darrell, Gordon, Harville, & Woodfill, 2000, Zhiwei, & Qiang, 2006) (see figure 2). Several face detection systems have been introduced (Ming-Hsuan Yang, David Kriegman & Narendra Ahuja, 2002) (Yang, Ahuja, &Kriegman, 2000 ). There are many existing techniques to detect faces based on a single image. The techniques for face detection with a single image were classified into three categories. •

Face Detection The challenges associated with face detection can be attributed to the following factors: Pose, presence or absence of structural components, facial expression, occlusion, image orientation, imaging conditions. There are many closely related problems with respect to face detection. Face localization aims to determine the image position of a single face; this is a simplified detection problem with the assumption that an input image contains only one face (Lam & Yan, 1994). The goal of facial feature detection is the detection of the presence and location of features, such as eyes, nose,

Knowledge Based System: This approach depends on using rules about human facial features to detect faces. Human facial features such as two eyes that are symmetric to each other, a nose and mouth, and other distance features represent this feature set. After detecting features, a verification process is carried out to reduce false detection. This approach is good for frontal images, as is shown in figure 3. The difficulty lies in translating human knowledge into known rules and to detect faces in different poses. Furthermore, the surrounding environment can also pose a problem. For example, changes in light sources can add or remove shadows from a face. Therefore, many variables should be considered when designing a face detection system.

Figure 1. Block diagram for a non-cooperative facial identification Face Detection

Test sample

Trained Model

Decision (Test Mode)

FACE DETECTION MODULE

Facial database

Parameterization

Classification system FACE IDENTIFICATION MODULE

1260


Figure 2. Face detection examples in a motion picture captures

•

•

For these reasons, in a non-cooperative system this technique suffers invariability. Image Based System: In this approach, a predefined standard face pattern is used to match with the segments in the image to determine whether they are faces or not. It uses training algorithms to classify regions into face or nonface classes. Image-based techniques depend on multi-resolution window scanning to detect faces, so these techniques have high detection rates but are slower than the feature-based techniques. Eigenfaces (Yang, Ahuja, & Kriegman, 2000) and neural networks (Rowley, Baluja & Kanade, 1998) are examples of image-based techniques. This approach has the advantage of being simple to implement, but it cannot effectively deal with variation in scale, pose and shape (Rein-Lien Hsu & Jain, 2002). Features Based System: This approach depends on extraction of facial features, which are not affected by variations of lighting conditions, pose, and/or other factors. These methods are classified

Figure 3. A typical face image used in knowledge based methods

N

according to the extracted features. Feature-based techniques depend on feature derivation and analysis to gain the required knowledge about faces. Features may be skin colour, face shape, or facial features such as eyes, nose, etc.... Feature based methods are preferred for real time systems where the multi-resolution window scanning used by image based methods are not applicable. Human skin colour is an effective feature used to detect faces, because although different people have different skin colours, several studies have shown that the basic difference is based on their intensity rather than their chrominance. Human faces have a special texture that can be used to separate them from different objects (Bojkovic, & Samcovic, 2006). The facial features method depends on detecting features of the face.

Face Identification in Transform Domain Systems The detected faces always have variable conditions (lighting, expression, rotation, translation, etc), and therefore, images used to train can have some changes with respect to images from face detection. The use of Features or Knowledge Based Systems is a disadvantage due to the wide data variability from variable conditions. Therefore, transform domain systems are a good goal because they group the information and contribute more discrimination to the facial identification. Transform domain analysis is a commonly used image processing and a parameterization technique. In recent years some work has been done to extract transform domain features for image identification. Li et al. extract Fourier range and angle features to identify the palm-print image (Li, Zhang & Xu, 2002). Lai et al. use holistic Fourier invariant features to recognize the facial image (Lai, Yuen & Feng, 2001). Another spectral 1261


feature generated from singular value decomposition (SVD) is used by some researchers (Chellappa, Wilson & Sirohey, 1995). However, Tian et al. indicate that this feature does not contain adequate information for face recognition (Tian, Tan, Wang & Fang, 2003). Hafed and Levine (2001) extract discrete cosine transform (DCT) feature for face recognition. They point out that DCT obtains the near-optimal performance of Karhunen–Loeve (KL) transform in facial information compression. And the performance of DCT is superior to those of discrete Fourier transform (FT) and other conventional transforms. By manually selecting the DCT frequency bands, their recognition method achieves a similar recognition effect to the Eigenface method (M. H. Yang, 2002) which is based on KL transform. Nevertheless, their method cannot provide a rational band selection rule or strategy. Nor can it outperform the classic Eigenface method. In addition, some extended discrimination methods are proposed. Zhang et al. (2002) present a dual Eigenspace method for face recognition. In his work, W. Malina (2001), proposed several new discrimination principles based on the Fisher criterion. Yang uses principal component analysis kernel (PCA) for facial feature extraction and recognition (Bartlett, Movellan & Sejnowski,, 2002), while Bartlett et al. (2002) apply the independent component analysis (ICA) in face recognition. However, Yang shows that both ICA and PCA kernels need much more computing time than PCA. In addition, when the Euclidean distance is used, there is no significant difference in the classification performance of PCA and ICA (Bartlett, Movellan & Sejnowski, 2002). Jing et al. (2003) put forward a classifier combination method for face recognition. This paper does not analyze and compare these extended

discrimination methods, but limits itself to a comparison of major linear discrimination methods including the Eigenface method, the Fisherface method, DLDA and discriminated waveletface. The KL transform is an optimal transform for removing statistical correlation. Of the discrete transforms, DCT approaches the KL transform (Hu, Worrall, Sadka & Kondoz, 2001). In other words, DCT has strong ability to remove correlation and compress images. Furthermore, DCT can be used by fast Fourier transform (FFT), while there is no fast realization algorithm for KL transform. Therefore, our approach sufficiently uses these favourable properties of DCT. The following table shows different systems based on different methods of face recognition with their corresponding recognition rates. The databases used are ORL [ORL Database], Yale [Yale Database], AR-Face [AR Database] and FERET [FERET Database].

FUTURE TRENDS Recently, numerous methods that combine several facial features have been proposed to locate or detect faces. Most of them use global features such as skin colour, size, and shape to find face candidates, and then verify these candidates using different local parameterization methods. The challenge is to achieve invariability of the captured images from the conditions (light, shapes ...) and positional changes (rotations, scales …). The creation and development of new methods based on transform domain system will provide robust characteristics for achieving this invariability. With respect to facial identification, 3D techniques can be used for the purpose in this system, but the

Figure 4. Face samples with different conditions (lighting and rotation)

1262


computational cost is a major disadvantage for real time applications. Facial rebuilding with 3D techniques can obtain more information and any features can be extracted. Moreover, this system retains the non-cooperation quality. In the future, the use of the multimodal systems with other biometric characteristics will generate a stronger and robust system.

CONCLUSION Face recognition is a challenging and interesting problem. However, it can also be regarded as part of the wider attempt to solve one of the greatest challenges to computer vision, that of object recognition. In particular, facial identification is becoming a very important biometric system in the battle to reduce global terrorism. Much research has already been carried out in this field, and bearing in mind the threat to security which the world is currently facing, there will undoubtedly be many more publications on facial identification in the future.

REFERENCES Jain, A. K., Ross, A., & Prabhakar, S., (2004), An introduction to Biometric Recognition, IEEE Transactions on Circuits and Systems for Video Technology, Special Issue on Image and Video Based Biometrics, 14(1), pp.4 - 20. Sanchez-Reillo, R., Sanchez-Avila, C., & GonzalezMarcos, A., (2000), Biometric identification through hand geometry measurements, in IEEE Transactions on Pattern Analisys and Machine Intelligence, 22(10), pp. 1168-1171. Young-Bum S., Jin-Tae K., & Won-Hyung L., (2002) Extraction of face objects using skin color information, in IEEE 2002 International Conference on Communications, 1, pp. 600 - 604. Lam, K., & Yan, H., (1994) Fast Algorithm for Locating Head Boundaries, J. Electronic Imaging, 3(4), pp. 351-359. Zhiwei, Z., & Qiang J., (2006) Robust Pose Invariant Facial Feature Detection and Tracking in Real-Time, 18th International Conference on Pattern Recognition 1, pp. 1092 – 1095.

Table 1. Results of different systems with the different databases Databases

ORL

Yale AR-Face FERET

Systems IGF (Liu & Wechsler, 2003) Gabor with FLD (Zhu, Vai & Peng Un Mak, 2004) Discrete Wavelet Transform + SVM (Travieso et al., 2004) FRCM (Ho-Man Tang, Michael Lyu & Irwin King, 2003) ENFS (Zhu, Vai & Mak, 2004) Embedded HMM (Nefian & Hayes, 1999) Several SVM+NN arbitrator (Kim, Jung & Kim, 2002) Kernel PCA (Kim, Jung & Kim, 2002) Nearest Feature Space (Chien & Wu, 2002) 2D DCT with KPCA and NFS (Zhu, Vai & Mak, 2003) ICA + SVM (Déniz, Castrillón & Hernández, 2003) Discriminative Common Vector (Cevikalp et al., 2005) MRF (Huang, Pavlovic & Metaxas, 2004) FRCM (Ho-Man Tang, Michael Lyu & Irwin King, 2003) Discriminative Common Vector (Cevikalp et al., 2005) Gabor + ICA (Liu & Wechsler, 2003) ICA + SVM (Jain & Huang, 2004)

Recognition Rates 100% 99,0% 98,9% 98,8% 98,5% 98,0% 97,9% 97,5% 96,1% 96,0% 99.3% 97.3% 96,1% 96,0% 99.3% 100% 95.7%

1263

N


Yang, M.H., Ahuja, N., & Kriegman, D., (2000), Face recognition using kernel eigenfaces, International Conference on Image Processing, 1, pp. 37 – 40. Crowley, J. L., and Berard, F., (1997) Multi-Modal Tracking of Faces for Video Communications, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 640-645. Darrell, T., Gordon, G., Harville, M., & Woodfill, J., (2000) Integrated Person Tracking Using Stereo, Color, and Pattern Detection, Int’l J. Computer Vision, vol. 37, no. 2, pp. 175-185. Ming-Hsuan Yang, David J. Kriegman, & Narendra Ahuja, (2002) Detecting Faces in Images IEEE Tran s. Pattern Analysis And Machine Intelligence , vol. 24, no. 1. Rowley, H. A., Baluja, S., & Kanade, T., (1998) Neural Network Based Face Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, January, pp. 23-38. Bojkovic, Z., & Samcovic, A., (2006), Face Detection Approach in Neural Network Based Method for Video Surveillance, 8th Seminar on Neural Network Applications in Electrical Engineering, pp. 44 – 47. Li, W., Zhang, D., & Xu, Z., (2002) Palmprint identification by Fourier transform, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(4), pp. 417–432. Lai, J. H., Yuen, P. C., & Feng, G. C., (2001) Face recognition using holistic Fourier invariant features, Pattern Recognition, 34(1), pp. 95–109, 2001. Chellappa, R., Wilson, C., & Sirohey, S., (1995) Human and machine recognition of faces: A survey, Proceedings of IEEE, 83, pp. 705–740. Tian, Y., Tan, T. N., Wang, Y. H. & Fang, Y. C., (2003) Do singular values contain adequate information for face recognition?, Pattern Recognition, 36(3) pp. 649–655. Hafed, Z. M., & Levine, M. D., (2001) Face recognition using the discrete cosine transform, International Journal Computation Vision, 43(3) pp. 167–188. Zhang, D., Peng, H., Zhou, J., & Pal, S. K., (2002) A novel face recognition system using hybrid neural and

1264

dual eighefaces methods, IEEE Transaction on System., Man, and Cybernetic. A, 32, pp. 787–793. Malina, W., (2001) Two-parameter Fisher criterion, IEEE Transaction on System., Man, and Cybernetic B, 31, pp. 629–636. Yang, M. H., (2002) Kernel eigenfaces vs. kernel fisherfaces: Face recognition using kernel methods, in IEEE Proc. 5th International Conference Automatic Face Gesture Recognition, pp. 215–220. Bartlett, M. S., Movellan, J. R., & Sejnowski, T. J., (2002) Face recognition by independent component analysis, IEEE Transaction on. Neural Network, 13, pp. 1450–1464. Jain, A. K., (1989) Fundamentals of Digital Image Processing. Englewood Cliffs, NJ: Prentice–Hall. Jing, X. Y., Zhang, D., & Yang, J. Y., (2003) Face recognition based on a group decision-making combination approach, Pattern Recognition, 36(7), pp. 1675–1678. Liu, C., & Wechsler, H., (2003) Independent component analysis of Gabor features for face recognition, IEEE Transactions on Neural Networks, 14, pp. 919-928. Zhu, J., Vai, M., & Peng U.M., (2004) Gabor Wavelets Transform and Extended Nearest Feature Space Classifier for Face Recognition, Third International Conference on Image and Graphics, pp. 246-249. Tang, HM., Lyu, M., & King, I., (2003) Face recognition committee machine, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 837- 840. Zhu, J., Vai, M. & Mak, P., (2004) A New Enhanced Nearest Feature Space (ENFS) Classifier for Gabor Wavelets Features-based Face Recognition, International Conference on Biometric Authentication, pp. 124-131. Nefian, A.V., & Hayes, M.H., (1999) An embedded HMM-based approach for face detection and recognition, IEEE International Conference on Acoustics, Speech, and Signal Processing, 6, pp 3553-3556. Kim, K. I., Jung, K. & Kim, H. J. , (2002) Face Recognition Using Kernel Principal Component Analysis, IEEE Signal Processing Letters, 9, pp. 40- 42.


Chien, J.T., & Wu, C.C., (2002) Discriminant waveletfaces and nearest feature classifiers for face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, pp. 1644-1649. Zhu, J., Vai, M., & Mak, P., (2003) Face Recognition, a Kernel PCAApproach, Chinese Conference on Medicine and Biology, pp. 81-83. Huang R., Pavlovic, V., & Metaxas, D., (2004) A hybrid face recognition method using Markov random fields, Proceedings of the 17th International Conference on Pattern Recognition, 3, pp. 157-160. Travieso C.M., Alonso J.B., & Ferrer M.A., (2004) Facial identification using transformed domain by SVM, Proceedings of the 38th IEEE International Carnahan Conference on Security Technology, pp. 193-196. Déniz O., Castrillón M., & Hernandez M., (2003) Face recognition using independent component analysis and support vector machines, Pattern Recognition Letters, 24(13), pp. 2153-2157. Cevikalp H., Neamtu M., Wilkes M., & Barkana A., (2005) Discriminative Common Vectors for Face Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(1), pp. 4-13. Jain A., & Huang J., (2004) Integrating Independent Components and Support Vector Machines for Gender Classification, Proceedings of the 17th International Conference on Pattern Recognition, 3, pp. 558-561. ORL Database, http://www.uk.research.att.com/facedatabase.html (last visit: 07-31-05) Yale Database, http://cvc.yale.edu/projects/yalefaces/ yalefaces.html (last visit: 07-31-07)

KEY TERMS Biometric System: This is a system which identifies persons from physical or behavioral characteristics. These characteristics are intrinsic to the individuals. Face Detection: The act of detecting a face from a frame or an image. Face Identification: This is a system which creates a model from facial features in order to recognize persons. Independent Component Analysis (ICA): A computational method for separating a multivariate signal into additive subcomponents supposing the mutual statistical independence of the non-Gaussian source signals. Multi-Modal System: Use of different biometric system in order to identify or verify persons. Non-Cooperative Identification System: This is a system for identification which does not require the collaboration of a user in order to operate. The information for identification is obtained with the permission of the user. Supervised Classification: Classification system that generates a model using training samples, and it uses that model to establish an evaluation or test with other samples. Transform Domain System: This is a change from visible range to another different range, which transforms the information, providing other properties in this domain.

AR-Face Database, http://cobweb.ecn.purdue.edu/ ~aleix/aleix_face_DB.html (last visit: 07-31-07) FERET Database, http://www.itl.nist.gov/iad/humanid/ feret/feret_master.html (last visit: 07-31-07)

1265

N

1266

Nonlinear Techniques for Signals Characterization Jesús Bernardino Alonso Hernández University of Las Palmas de Gran Canaria, Spain Patricia Henríquez Rodríguez University of Las Palmas de Gran Canaria, Spain

INTRODUCTION The field of nonlinear signal characterization and nonlinear signal processing has attracted a growing number of researchers in the past three decades. This comes from the fact that linear techniques have some limitations in certain areas of signal processing. Numerous nonlinear techniques have been introduced to complement the classical linear methods and as an alternative when the assumption of linearity is inappropriate. Two of these techniques are higher order statistics (HOS) and nonlinear dynamics theory (chaos). They have been widely applied to time series characterization and analysis in several fields, especially in biomedical signals. Both HOS and chaos techniques have had a similar evolution. They were first studied around 1900: the method of moments (related to HOS) was developed by Pearson and in 1890 Henri Poincaré found sensitive dependence on initial conditions (a symptom of chaos) in a particular case of the three-body problem. Both approaches were replaced by linear techniques until around 1960, when Lorenz rediscovered by coincidence a chaotic system while he was studying the behaviour of air masses. Meanwhile, a group of statisticians at the University of California began to explore the use of HOS techniques again. However, these techniques were ignored until 1980 when Mendel (Mendel, 1991) developed system identification techniques based on HOS and Ruelle (Ruelle, 1979), Packard (Packard, 1980), Takens (Takens, 1981) and Casdagli (Casdagli, 1989) set the methods to model nonlinear time series through chaos theory. But it is only recently that the application of HOS and chaos in time series has been feasible thanks to higher computation capacity of computers and Digital Signal Processing (DSP) technology.

The present article presents the state of the art of two nonlinear techniques applied to time series analysis: higher order statistics and chaos theory. Some measurements based on HOS and chaos techniques will be described and the way in which these measurements characterize different behaviours of a signal will be analized. The application of nonlinear measurements permits more realistic characterization of signals and therefore it is an advance in automatic systems development.

BACKGROUND In digital signal processing, estimators are used in order to characterize signals and systems. These estimators are usually obtained using linear techniques. Their mathematical simplicity and the existence of a unifying linear systems theory made their computation easy. Furthermore, linear processing techniques offer satisfactory performance for a variety of applications. However, linear models and techniques cannot solve issues such as nonlinearities due to noise, to the production system of the signal, system nonlinearities in digital signal acquisition, transmission and perception, nonlinearities introduced by the processing method and nonlinear dynamics behaviour. Therefore, the application of linear processing techniques leads to less realistic characterization of certain systems and signals. As a result of the shortcomings of linear techniques, analysis procedures are being revised and nonlinear techniques are being applied in computing estimators and models and in signal characterization to increase the possibilities of digital signal processing. HOS is a field of statistical signal processing which has become very popular in the last 25 years. To date almost all digital signal processing have been based


Nonlinear Techniques for Signals Characterization

on second order statistics (autocorrelation function, power spectrum). HOS use extra information which can be used to get better estimates of noisy situation and nonlinearities. Chaos theory (nonlinear dynamical theory) is a long-term unpredictable behaviour in a nonlinear dynamic system caused by sensitive on initial conditions. Therefore, irregularities in a signal can be produced not only by random external input but also by chaotic behaviour. Both nonlinear techniques have been used in signals characterization and numerous automatic classification systems have been developed using HOS and chaos features in many fields. Texture classification (Coroyer, Declercq, Duvaut, 1997), seismic event prediction (Van Zyl, 2001), fault diagnosis in machine condition monitoring through vibration signals (Samanta, Al.Balushi. & Al-Araimi, 2006), (Wang & Lin, 2003) and economy (Hommes & Manzan, 2006) are some examples. Their application in biomedical signals is especially important. Nonlinear features have proven to be useful in voice, electrocardiogram (ECG) and electroencephalogram (EEG) signals characterization. Automatic classification systems between pathological and healthy voices have been implemented using nonlinear features (Alonso, de León, Alonso, Ferrer, 2001) (Alonso, Díaz-de-María, Travieso, Ferrer, 2005). Nonlinear characteristics have been used in the detection of electrocardiographic changes through ECG signal (Ubeyli & Guler, 2004), in the evaluation of neurological diseases using EEG signal (Gulera, Ubeylib & Guler, 2005), (Kannathal, Lim Choo Min, Rajendra Acharya & Sadasivan, 2005) and in diagnosis of phonocardiogram (Shen, Shen, 1997).

NONLINEAR METHODS: CHAOS THEORY AND HIGHER ORDER STATISTICS APPLIED TO TIME SERIES Higher Order Statistics Higher Order Statistics, known as cumulants and their Fourier transform, known as polyspectra are extensions of second-order measures (such as the autocorrelation function and power spectrum). Some advantages of HOS over second-order statistics are:

1.

2.

HOS give amplitude and phase information in the spectral domain, whereas second order statistics only give amplitude information (Mendel, 1991) (Nikias & Petropulu, 1993). Therefore, non-minimum phase signals and certain types of phase coupling (associated with nonlinearities) cannot be correctly identified by second-order statistics. HOS are blind to Gaussian processes whereas correlation is not (Mendel, 1991). Therefore, cumulants can be used in determining Gaussian noise levels in a signal, separating non-Gaussian signals from Gaussian noise, in harmonics components estimation or in increasing signal to noise ratio (SNR) when signals are contaminated with Gaussian noise.

The second-order measures work properly if the signal has a Gaussian probability density function, but many real-life signals are non-Gaussian. Therefore, HOS are a powerful tool to work with non-Gaussian and nonlinear processes. Next, some higher order statistics measurements are shown and their usefulness in characterizing certain nonlinear phenomena is explained.

Third Order Moment: Skewness Skewness is a third order moment and a measure of the asymmetry in a probability distribution. This measurement enables us to discriminate among different kind of data distribution as its value varies according to the asymmetry of a distribution. The skewness of a Normal distribution is zero (data symmetric about the mean), positive skewness corresponds to a distribution with a right tail longer and negative skewness to a distribution with a left tail longer. In most cases normal distribution is assumed, but data points are not usually perfectly symmetric. Skewness reflects positive or negative deviations from the mean and gives more realistic characterization of a data set.

Fourth Order Moment: Kurtosis Kurtosis is a fourth order moment and a measure of whether the data in a probability distribution are peaked

1267

N


or flat relative to a Normal distribution. Kurtosis is a measure of the data concentration about the mean, higher kurtosis means more of the variance is due to infrequent extreme deviations.

Higher Order Cumulants Higher order moments are natural generalization of autocorrelation, while cumulant (Mendel, 1991) are nonlinear combinations of moments. The second order cumulant is the autocorrelation function. Higher order cumulants can be seen as a measure of gaussianility of a random process because cumulants higher than second order are zero in a gaussian process.

Bispectrum Bispectrum is the Fourier transform of the third order cumulant. The bispectrum of a stationary Gaussian process with zero media are equal to zero. The bispectrum of a signal plus Gaussian noise is the same as that of the signal, whereas the power spectrum of a signal plus Gaussian noise is very different from the power spectrum of the signal alone. Therefore, through bispectrum Gaussian noise can be separated from non-Gaussian noise and signal-tonoise ratios can be improved. On the other hand, quadratic phase coupling can be detected and no minimum phase systems can be identified with the bispectrum.

Bicoherence Closely related to the bispectrum is the third-order coherence measure, the bicoherence. Bicoherence is the bispectrum normalized. Bicoherence is bounded between 0 and 1 values and it is used to detect quadratic phase coupling due to second order alinearities. A phase coupling between a linear combination of the frequency components ω1 and ω2 exists if the bicoherence has a value equal to one for a pair of frequencies (ω1, ω2).

Chaos Theory The Chaos theory helps us to understand and interpret the observations from complex deterministic dynamical systems and it can be used to predict and control time series (Kantz & Schreiber, 1997). Until the appear1268

ance of the chaos theory all irregular behaviour was interpreted as a stochastic behaviour and therefore unpredictable. Thanks to the chaos theory this is not necessarily true. For example, stochastic and chaotic systems have rich broadband power spectra and varying phase spectra. So, in order to distinguish between stochastic and chaotic systems the chaos theory is a powerful new tool. A deterministic dynamical system describes the time evolution of a system in some phase space Γ ∈ ℜ m (m dimensional vectorial space), where a state is specified by a vector x ∈ ℜ m . This evolution can be expressed by ordinary differential equations (Kantz & Schreiber, 1997): d x(t ) = f (t , x(t )), t ∈ ℜ dt

or in discrete time t = nΔt by maps: x n +1 = F ( x n ), n ∈ Ζ

A sequence of points ( x n or x(t ) ) that solve the equations of the system are called trajectories. The initial conditions are x 0 or x(0) , respectively. The region of the phase space in which all trajectories originated in a range of initial conditions converges after a transition time is called attractor. An example of a chaotic attractor from the Colpitts oscillator (Kennedy, 1994) is illustrated in Figure 1. Most of the time we need to characterize nonlinear systems for which equations and models are unknown. However, some measurements of the system are known.

Figure 1. Attractor from Colpitts oscillator


There exist some techniques to obtain the phase space and the attractor from the output signal (embedding techniques). Thus, certain quantities such as Lyapunov exponents, correlation dimension and KolmogorovSinai entropy are obtained from the attractor. These quantities provide measurements of the nonlinearity degree of the system. These measurements are invariant under smooth transformations and thus independent of the embedding procedure.

Lyapunov Exponents

Takens’ embedding theorem (Takens, 1981) states that an embedding exists if the dimension (m) of the reconstructed phase space is such that m>2D+1 (D is the attractor dimension). There exist two main methods to reconstruct the attractor from a time series: the method of delays (Kantz & Schreiber, 1997) and principal component analysis (Broomhead & King, 1986). The former method is the most popular: a delay reconstruction in m dimensions is formed by the vectors sn given as (Kantz & Schreiber, 1997),

Lyapunov exponents characterize the rate of separation of two points in phase space initially separated by a small distance. There exist as many Lyapunov exponents as m (dimension of the phase space). The maximal Lyapunov exponent (MLE) is the largest one and determines the predictability of a dynamical system. A positive MLE means divergence of nearby trajectories, i.e. chaos. For a mathematical description we refer the reader to (Kantz & Schreiber, 1997). Several algorithms to compute Lyapunov exponents from a time series have been implemented (Wolf et. al, 1985), (Rosenstein, Collins, De Luca, 1993), (Kantz, 1994), (Sprott, 2003). MLE is useful to characterize different kinds of behaviour in a signal or system. A negative MLE is an indicator of a stable fixed point (a dissipative or nonconservative system), a positive MLE is an indicator of irregular (chaotic) behaviour, a zero MLE is an indicator of a conservative system (such as a harmonic oscillator) and an infinite MLE is an indicator of noise.

sn = [ s (n), s (n − T ),..., s (n − (m − 1)T )]

Kolmogorov-Sinai Entropy

where s(n) is the scalar signal measured, m is the embedding dimension of the reconstructed phase space and T is the time delay. Takens’ theorem is strictly an existence theorem and does not suggest how to find the embedding dimension (m) and the time delay (T). The first zero of autocorrelation function or when it decays 1e has been suggested as a first order estimator of T. The first minimum of mutual information function (Fraser & Swinney, 1986) is another estimator of T that takes into account nonlinear correlations. The false neighbours method (Kennel, Brown & Abarbanel, 1992) and the false strands method are proposed methods to estimate the embedding dimension (m). The latter is an improvement of the false neighbours method.

Kolmogorov-Sinai (KS) entropy quantifies the loss of information as a system evolves and it is another measurement related to the unpredictability of a system. In a regular and predictable system, HKS = 0, i.e. nearby points are closely grouped in some other small region of phase space and there is no change in information. In a random process HKS = ∞ due to the fact that all phase space regions become possible after a short time. In chaotic systems 0 < HKS < ∞ indicates that nearby points in the phase space diverge exponentially. Therefore, according to KS entropy values different types of systems can be characterized: regular, chaotic and noise systems.

Embedding Techniques

Chaotic Measurements In the following paragraphs some chaotic measurements will be described.

Correlation Dimension Correlation dimension (Grassberger & Procaccia, 1983) quantifies the complexity of the reconstructed attractor. It is a geometric measurement of sensitive dependence on initial conditions because in chaotic motion the attractor usually shows a very complicated and fractal geometry. In a chaotic deterministic system the

1269

N


correlation dimension yields to a finite value, whereas in a random process it does not converge to a value. A maximum likelihood estimator to obtain optimal values of correlation dimension is the Takens-Theiler estimator (Theiler, 1988). Correlation dimension allows us to identify a random process from a chaotic motion. A non-integer (fractal) value of the correlation dimension is usually a symptom of chaos, whereas a integer value is a symptom of a regular behaviour. Furthermore, the correlation dimension is an estimation of the number of degrees of freedom of a system.

FUTURE TRENDS In automatic recognition systems it is necessary to characterize data sequences and objects (voice, sounds, faces, hands, etc.) in order to achieve a well described features space. Having differential features will later lead to a successful classification process. However, the task of finding differential features is not always easy. Nonlinear techniques are novel resources to characterize time series and overcome certain previous problems of linear techniques. Proof of this is the development of several automatic classification systems using nonlinear features such as (Alonso, de León, Alonso, Ferrer, 2001) (Alonso, Díaz-de-María, Travieso, Ferrer, 2005), (Ubeyli & Guler, 2004), (Gulera, Ubeylib & Guler, 2005).

CONCLUSION In this article we have shown the state of the art in two recent nonlinear techniques: Higher order statistics and the chaos theory. The main point is the fact that many signals in real life cannot be adequately modelled by linear approximation alone. Recently, the development of packages to compute chaotic (TISEAN package, Hegger, Kantz & Schreiber, 1999) and HOS (HOSA toolbox for Matlab) measures from data sets has made the application of these techniques to data sets feasible. Thanks to these techniques it is now possible to extract new characteristics previously ignored by linear analysis. Therefore the use of nonlinear techniques

1270

leads to more realistic characterization of signals and systems. These new approaches to signal analysis and characterization provide new tools for the better characterization of signals and as a previous step in order to create new, more accurate and powerful automatic systems in patter recognition systems such as voice and facial recognition.

REFERENCES Alonso, J.B., de León, J., Alonso, I. & Ferrer, M. A. (2001). Automatic detection of pathologies in the voice by hos based parameters. EURASIP Journal on Applied Signal Processing, 1, 275-284. Alonso, J.B., Díaz-de-María, F., Travieso, C. M., Ferrer, M. A. (2005). Using Nonlinear Features for Voice Disorder Detection. 3rd International Conference on Nonlinear speech processing, 94-106. Broomhead, D. & King, G. (1986). Extracting qualitative dynamics from experimental data. Physica D, 20, 217–236. Casdagli, M. (1989). Nonlinear prediction of chaotic time series. Physica D 35, 335. Coroyer, C., Declercq, D., Duvaut, P. (1997). Texture classification using third order correlation tools. IEEE Signal Processing Workshop on Higher-Order Statistics (SPW-HOS’97), p. 0171. Fraser, A. M. & Swinney, H. L. (1986). Independent coordinates for strange attractors from mutual information, Phys. Rev. A 33, 1134-1140. Grassberger, P. & Procaccia, I. (1983). Measuring the strangeness of strange attractors. Physica D 9, 189. Gulera, N.F., Ubeylib, E.D. & Guler, I. (2005). Recurrent neural networks employing Lyapunov exponents for EEG signals classification. Expert Systems with Applications, 29, 506–514. Harbourne, R. T., Stergiou, N. (2003). Nonlinear Analysis of the Development of Sitting Postural Control. Wiley Periodicals, Inc. Hegger, R., Kantz, H. & Schreiber, T. (1999). Practical implementation of nonlinear time series methods: The TISEAN package. Chaos 9, 413.


Hommes, C.H. & Manzan, S. (2006). Testing for Nonlinear Structure and Chaos in Economic Time Series. Tinbergen Institute Discussion Papers No. 2006-030/1.

Samanta, B., Al.Balushi, K.R. & Al-Araimi, S.A. (2006). Artificial neural networks and genetic algorithm for bearing fault detection. Soft Computing, (10), 264-271.

Kannathal, N., Lim Choo Min, Rajendra Acharya U. & Sadasivan, P.K. (2005). Entropies for detection of epilepsy in EEG. Computer Methods and Programs in Biomedicine, 80, 187–194.

Shen, M, Shen F. (1997).Time-varying third-order cumulant spectra and its application to the analysis and diagnosis of phonocardiogram. IEEE Signal Processing Workshop on Higher-Order Statistics (SPW-HOS’97). p.0024.

Kantz, H. (1994). A robust method to estimate the maximal Lyapunov exponent of a time series. Physics Letters A, 185, 77-87. Kantz, H. & Schreiber, T. (1997). Nonlinear Time Series Analysis. Cambridge Nonlinear Science Series 7. Kennedy M. P. (1994), “Chaos in the Colpitts oscillator,” IEEE Trans. Circ. Syst., vol. 41, pp. 771-774 Kennel, M. & Abarbanel, H. (2002). False neighbors and false strands: A reliable minimum embedding dimension algorithm. Phys. Rev. E 66. Kennel, M., Brown, R., Abarbanel, H. (1992). Determining embedding dimension for phase space reconstruction using the method of false nearest neighbors. Phys. Rev. A 45, 3403 – 3411. Logan, D., Mathew, J. (1996). Using the correlation dimension for vibration fault diagnosis of rolling element bearing – 2. Selection of experimental parameters. Mechanical Systems and Signal Processing, 10, 251-264. Mendel, J.M. (1991). Tutorial on higher-order statistics (spectra) in signal processing and system theory: theoretical results and some applications. IEEE, Proceedings, 79, 278-305.

Sprott, J. C. (2003). Chaos and Time-Series Analysis. Oxford, UK: Oxford University Press. Takens, F. (1981). Detecting strange attractors in turbulence. Lecture notes in mathematics. Dynamical systems and turbulence (898), 366. Springer, Berlin. Theiler, J. (1988).Lacunarity in a best estimator of fractal dimension. Phys. Lett. A 135, 195. Tufillaro, N., Abbott, T. & Reilly, J. (1992). An expiremental Approach to Nonlinear Dynamics and Chaos. Reading, MA: Addison-Wesley. Ubeyli, E.D. & Guler, I. (2004). Detection of electrocardiographic changes in partial epileptic patients using Lyapunov exponents with multilayer perceptron neural networks. Engineering Applications of Artificial Intelligence, 6 (17), 567–576. Van Zyl, J. (2001). Modelling Chaotic Systems with Neural Netwoks: Application to Seismic Event Predicting in Gold Mines. Thesis. Wang, W.J. & Lin, R. M. (2003). The application of pseudo-phase portrait in machine condition monitoring. Journal of Sound and Vibration. 1 (259), 1-16.

Nikias C.L. and Petropulu A.P. (1993), Higher-Order Spectra analysis, PTR Prentice Hall, New Jersey

Wolf, A., Swift, J.B., Swinney, H.L. & Vastano, J.A. (1985). Determining Lyapunov exponents from a time series. Physica D, 16, 285-317.

Packard, N. H., Crutchfield, J. P., Farmer, J. D. & Shaw, R. S. (1980). Geometry from a Time Series. Phys. Rev. Lett. 45 (9), 712-716.

KEY TERMS

Rosenstein, M. T., Collins, J. J., De Luca, C. J. (1993). A practical method for calculating largest Lyapunov exponents from small data sets. Physica D 65, 117. Ruelle, D. (1979). Sensitive dependence on initial condition and turbulent behaviour of dynamical systems. Annals of the New York of Sciences 316 (1), 408-416.

Attractor: A region in the phase space to which all trajectories converge after a transition time. It is the long term behaviour of a dynamical system. Bicoherence: It is a normalised version of the bispectrum. The bicoherence takes values bounded

1271

N


between 0 and 1, which make it a convenient measure for quantifying the phase coupling in a signal. Chaos: Long-term unpredictable behaviour caused by sensitive dependence on initial conditions. Cumulants: The kth order cumulant is a function of the moments of orders up to and including k. HOS: Higher order statistics is a field of statistical signal processing that uses more information than autocorrelation functions and spectrum. It uses moments, cumulants and polyspectra. They can be used to get better estimates of parameters in noisy situations, or to detect nonlinearities in the signals. Kolmogorov-Sinai Entropy: Measurement of information loss per unit of time in phase space.

1272

Lyapunov Exponents: Quantity that characterizes the rate of separation of infinitesimally close trajectories in a dynamical system. The maximal Lyapunov exponent (MLE) determines the predictability of a dynamical system. A positive MLE means a chaotic system. Polyespectra: The Fourier transform of cumulants. The second order polyspectra is the power spectrum. Most HOS work on polyspectra focusses attention on the bispectrum and the trispectrum. Reconstructed Phase Space: Phase space obtained from a time series through embedding techniques such as principal component analysis or the method of delays.

1273

Ontologies and Processing Patterns for Microarrays Mónica Miguélez Rico University of A Coruña, Spain José Antonio Seoane Fernández University of A Coruña, Spain Julián Dorado de la Calle University of A Coruña, Spain

INTRODUCTION The researchers currently have a new tool for dealing with the solution of biomedical problems: the Microarrays. These devices support the study and the acquisition of information related to many genes at the same time by means of a unique experiment, providing multiple potential applications such as mutation detection of microorganism identification. Some of the problems that exist when working with this type of technologies are the high number of data and the complex technical nomenclature to be dealt with. These facts imply the need of using several standards and ontologies when performing this type of experiments.

BACKGROUND The microarrays have been a key element in the biotechnological revolution of the last years; however new problems regarding both, data handling and statistics analysis, have arisen due to the vast volume of information and to the structure of the data used. The main concern lies in the vast amount of data to be stored, processed and analysed. Besides, as the microarrays are a new technique, most of the methods, protocols and standards are still being defined. The fact of dealing with such amount of unstructured information leads to believe that is quite difficult for the descriptors of the stored concepts or their units to be the same at the different data bases where it is accessed. In order to support the vocabulary unification task, the ontologies (Chandrasekaran, 1999) enable

a hierarchical definition of concepts for framing the schemas of the accessed data bases. There are fully established ontologies also quite used as the UMLS medical vocabulary (UMLS, 2006), that has information about symptoms and illnesses, or the GO (Gene Ontology) genomic ontology (Gene Ontology, 2006), regarding information about the function and the expression location of the different human genes. Once the use of ontologies has been established, they are also quite useful for searching hidden relationships among data. Consultations with SQL-type (Structured Query Language) (Beaulieu, 2005) query languages may be performed in an ontology and translated to query languages owning to each underlying data base. In this way, by the use of the ontology, it could be known that the presence of fever is a symptom and which are the illnesses that present fever as a symptom. Currently, there are special data formats in medicine science as the DICOM standard (Oosterwijk, 2001) for storage and transfer of the increasing amount of medical images that support new imaging modalities. Nevertheless, the typical biomedical images, as the microarrays or the DNA gels, are not currently considered at DICOM, although their future integration is foreseeable in incoming revisions, as the clinical test based on these techniques might be increasingly used in routine medical practice. At the moment, however, the management of this type of images is quite sensitive.

MAIN FOCUS OF THE CHAPTER This paper presents a description of the most important standards and ontologies for working with microarrays


O

Ontologies and Processing Patterns for Microarrays

experiments; it also tackles the integration options of some of these ontologies and standards into an information system for managing microarrays. The first standardisation initiatives appeared in 1998. They were more or less isolated initiatives where three standardisation areas could be distinguished: hardware, fixed material and procedures for analysis and storage of studies information. Several organisations as the MGED Normalization Working Group (MGED Data, 2006) were created for the standardisation of the information. The MGED (Microarray Gene Expression Data) Society is an international organisation devoted to the standardisation and to the exchange of information related to microarrays experiments. Other organisations to be mentioned are the OMG (Object Management Group) (OMG, 2006) or the UCL/HGNC (Human Gene Nomenclature) (HGNC, 2006). As far as terminologies, vocabularies, nomenclatures and ontologies is concerned, it should be highlighted the MGED Ontology (MGED OWG, 2006), which describes the experiments and the gene expression data, or the GO (Gen Ontology Consortium) (Gene Ontology, 2006), which provides controlled vocabularies for describing the molecular function, the biological process and the cellular components of the gene products. Also the UCL/HGNC (Human Gene Nomenclature) (HGNC, 2006), the TaO (TAMBIS Ontology) (TaO, 2006), the Figure 1. MGED ontology

1274

RiboWeb (RiboWeb, 2001) or the EcoCyc (EcoCyc, 2005) should be mentioned. Regarding the data exchange standards in the microarrays field, the MicroArray and Gene Expression Markup Language (MAGE-ML) (MAGE-ML, 2006) is language designed for describing and communicating information among microarrays experiments. Other data exchange standards are the Bioinformatics Sequence Markup Language (BSML) (BSML, 2006), the Gene Expression Markup Language (GeneXML) (NCGR, 2006) or the Genome Annotation Markup Elements (GAME) (Bioxml, 2006). The MGED Group is the standardisation organisation that presents the wider scope regarding the microarrays field and presented in November 2000 the standard MIAME (Minimun Information About a Microarray Experiment) (MIAME, 2006). This acronym describe the minimal information regarding microarrays that, either should be stored into a data base (from now, DD.BB) used as a public repository, or that should be stored for enabling the non ambiguous interpretation of the experiments results and for repeating such experiments. After defining the information that is going to be stored (MIAME), there should be a model of objects (UML) for describing, not only how the data of these experiments should be expressed, but also the mecha-


nisms for their exchange, bearing always in mind the MIAME guides. This is precisely what the MAGE-OM (MicroArray and Gene Experiment Object Model) (MAGE-OM, 2006) standard defines. This model of objects has been developed for being independently used from the implementation chosen and, in this way, it can be used as a map for data structures in platforms such as Java, Perl or C++. The model has been currently translated to a set of relational tables divided in packages, according to the natural separation of the gene expression data into cases and objects. In this point, and by the use of standards already described, the microarrays experiments data to be stored and their model of objects are both defined. A language for the data exchange is therefore needed, as the MAGE-ML (MicroArray Gene Expression Markup Language (MAGE-ML, 2006). It is a XML (XML, 2006) formal language directly derived from the MAGE-OM object model. This language has been designed for describing and communicating the information of such type of elements and it can be used for describing microarrays-related items such the designs, information about the fabrication or the structure of experiments. A tool named as MAGE-stk (MAGE Software Toolkit) has been developed in order to simplify the use of the MAGE-OM standard. This tool is based on an Open Source package collection that implements the MAGE model of objects (MAGE-OM) in several programming languages. It makes the reading of the MAGE-ML easier; this tool also simplifies the MAGEML writing from MAGE-OM and it provides methods for the fully maintenance, as well as actualisation, of MAGE-OM. Once the standards needed for working with microarrays technology have been defined, the following step is the description and the use of several ontologies that might enable, as it was mentioned before, the unification of the different vocabularies used. The MGED Ontology (MO) is one of the most important ontologies when working with microarrays and, particularly, when using certain previously mentioned standards. The main goal of this ontology is to provide standard terms for the notation of experiments with microarrays; such terms not only will serve for structuring questions related to the elements of the experiments, but also they might be used for unambiguously describing how the experiments have been done.

As the ontology-encoded terms will be eventually placed in MAGE-ML documents, the efforts of both, MAGE and the working group, should be coordinated at the points where they superimpose, for the ontology classes and the MAGE classes to have the same names and relationships. The ontology has been conceived for continuously growing and therefore fulfilling the requirements of descriptive terms related to emerging applications of microarrays. Besides, the use of ontologies for software programming should be fixed, in order to avoid constant revisions of the programming for searching changes in vocabularies and relationships. The fulfilment of such objectives is achieved by establishing the central MGED ontology, a nucleus at the MGED ontology that will remain constant. The extended MGED ontology is a second ontology layer that contains all the additional terms that might be considered (see Figure 1). The central MGED ontology has been developed for working with the MAGE 1.0 schema, and it is restricted to MAGE-OM v1.1. The extended MGED ontology increases the ontology nucleus with terms that are out of reach of MAGE v1.1. The Gen Ontology (GO) is other ontology that should be considered when working with microarrays. The Gen Ontology Project implies a collaborative effort in order to fulfil the needs of consistent descriptors for genetic products in different DD.BB. The project started in 1998 as collaboration among three DD.BB. related to models of organisms: FlyBase, Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD). Since then, the GO consortium has grown and includes many more DD.BB., as some of the world biggest repositories for plants, animals and microbial genomes. The GO project has developed three controlled and structured ontologies/vocabularies: biological processes, molecular functions and cellular components. In this way, a given gene can be located in one or more cellular components, the biological processes where it is active can be checked and the molecular functions represented by that gene at those processes can be visualised. For instance, the ‘cytochrome c’ gene can be described by the molecular function term ‘oxidoreducta activity’, by the biological process terms ‘oxidate phosphorylation’ and ‘death cell induction’ and by the cellular component terms ‘mitochondrial matrix’ and ‘mitochondrial membrane’.

1275

O


ESSENTIAL CHARACTERISTICS OF AN INFORMATION SYSTEM FOR MICROARRAYS MANAGEMENT This type of system needs an architecture of data integration for easily store the vast amount of information generated by the experiments with microarrays; In order to achieve this, the architecture should provide the users with assistants and contextual support for handling the information. On the other hand, a Web architecture, by means of an Internet connection, will enable the access and the management of the information from any place at any time. For the ontology information to be always actualised and available for the users, the architecture should provide an integrated access to several ontology servers. In order to achieve this, it should be advisable to use Web Services in cases as the access to the Gen Ontology (GO) or to the Biological Imaging Methods (FBbi); alternatively, Internet access should be used in MGED Ontology access. Besides, for the users to introduce data and consult the stored information more easily, the system should have an interface that might show a list of ontology terms and values; in this way, this list would enable ontology consultations that might include all the meanings of a given concept. As the proposed system has to support the information exchange among the different researchers, this type of architecture should use the existing standards related to data storage (MIAME) and to information exchange (MAGE-OM y MAGE-ML). In the first case, the system should have to implement a DD.BB. whose fields fulfil the MIAME standard; in the second case, the system will use the MAGE-OM object model for enabling the generation of the MAGE-ML information exchange file by the users whenever they might require it. Lastly, it should be also advisable that the users could continue using the existing applications, to which they are used to, and that have been developed by experts on the subject usually using the R language. Due to that reason, the system should have such applications available for the users. In order to achieve this, it is proposed an approach based in the use of Web services by the architecture. This architecture is being currently developed by the RNASA/IMEDIR lab group from the University of A Coruña.

1276

CONCLUSION Nowadays there are several tools that enable the analysis of microarrays imaging; however, as they are software specifically designed for each array type, they do not allow wide options and they, not only require to be installed in the user machine, but also its installation is restricted to a few operative systems. Regarding data processing, there are several projects that include packages for performing microarrays imaging processings as normalisation or clustering; however, some of these packages need to download the different processing tools that they contain in order to use them. Lastly, there are several types of public DD.BB. for storing the information of this type of experiments by the use of Web formularies. As there are also some stand-alone tools that store the data into a DD.BB. created in the machine of the user, this machine should have a DD.BB. manager installed. Nevertheless, no systems have been found to perform the different steps without needing to install software or to quit the system. The new systems of this area should allow the data storage into a MIAME standard DD.BB. with the option of performing the image analysis of the different microarrays experiments and keeping the analysis results into de system DD.BB. The systems should also provide several processing types using R language in order to perform data analysis and subsequent experiment conclusions. The data model of the system should use MAGE-OM standard and then offer the resulting experiment MAGE-ML file to the user.

ACKNOWLEDGMENT This work was partially supported by the Spanish Ministry of Education and Culture (Ref TIN2006-13274) and the European Regional Development Funds (ERDF), grant (Ref. PIO61524) funded by the Carlos III Health Institute, grant (Ref. PGIDIT 05 SIN 10501PR) from the General Directorate of Research of the Xunta de Galicia and grants(File 2006/60, 2007/127 and 2007/144) from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia.


REFERENCES Beaulieu, A. (2005). Learning SQL. O’Reilly. Bioxml. (2006). Consulted in July 2006 from, http:// www.bioxml.org/Projects/game/ BSML. Bioinformatics Sequence Markup Language. (2006). Consulted in June 2006, from http://www. bsml.org/ Chandrasekaran, B., Josephson, J. R. & Richard, V. (1999). What Are Ontologies, and Why Do We Need Them?. IEEE Intelligent Systems. 14 (1), 20 – 26. EcoCyc. (2005). Consulted in February 2006, from http://ecocyc.org/ Gene Ontology. Consulted in September 2006, from http://www.geneontology.org/ HGNC. HUGO Gene Nomenclature Committee. (2006). Consulted in March 2006, from http://www. gene.ucl.ac.uk/nomenclature/ MAGE-ML. (2006). Consulted in May 2006, from http://www.mged.org/Workgroups/MAGE/mageml.html MAGE-OM. (2006). Consulted in May 2006, from http://www.mged.org/Workgroups/MAGE/mageom.html MGED Data Transformation an Normalization Working Group. (2006). Consulted in June 2006, from http://genome-www5.stanford.edu/mged/normalization.html MGED OWG. The MGED Ontology is an experimental Ontology. Consulted in August 2006, from http://mged.sourceforge.net/ontologies/OntologyWorkshopMGED6.ppt MIAME (2006). Consulted in July 2006, from http:// www.mged.org/Workgroups/MIAME/miame.html

RiboWeb. (2001). Consulted in February 2006, from http://riboweb.stanford.edu/ TaO: TAMBIS Ontology. (2006). Consulted in February 2006, from http://imgproj.cs.man.ac.uk/tambis/ Unified Medical Language System. in February 2006, from http://www.nlm.nih.gov/research/umls/. Extensible Markup Language. (2006). in August 2006, from http://www.w3.org/XML/

KEY TERMS MAGE-ML: Microarray Gene Expression Markup Language. Formal language designed for describing and communicating the experiment-based microarrays information. MAGE-OM: MicroArray and Gene Experiment Object Model. Standard that defines the model of objects for the gene expression-based experiments. MAGE-stk: MAGE Software Toolkit. Open Source Package collection that implements the MAGE (MAGE-OM) model of objects in several programming languages. MIAME: Minimum Information About a Microarray Experiment. Standard that indicates the minimal information needed for microarrays experiments. MicroArrays: A technology using a high-density array of nucleic acids, protein, or tissue for simultaneously examining complex biological interactions which are identified by specific location on a slide array. A scanning microscope detects the bound, labelled sample and measures the visualized probe to ascertain the activity of the genes of interest in genotyping, cellular studies, and expression analysis.

OMG. Object Management Group. (2006). Consulted in March 2006, from http://www.omg.org/

Ontology: In computer science this term refers to the attempt of formulate an exhaustive and rigorous conceptual schema into a given domain, with the aim of making communication and information sharing among systems easier.

Oosterwijk, H. (2001). DICOM Básico. 2ª Edición. OTech.

R: Language and programming environment for graphic and statistical analysis.

NCGR. (2006). Consulted in June 2006, from http:// www.ncgr.org/genex/

1277

O

1278

Ontologies for Education and Learning Design Manuel Lama University of Santiago de Compostela, Spain Eduardo Sánchez University of Santiago de Compostela, Spain

INTRODUCTION In the last years, the growing of the Internet have opened the door to new ways of learning and education methodologies. Furthermore, the appearance of different tools and applications has increased the need for interoperable as well as reusable learning contents, teaching resources and educational tools (Wiley, 2000). Driven by this new environment, several metadata specifications describing learning resources, such as IEEE LOM (LTCS, 2002) or Dublin Core (DCMI, 2004), and learning design processes (Rawlings et al., 2002) have appeared. In this context, the term learning design is used to describe the method that enables learners to achieve learning objectives after a set of activities are carried out using the resources of an environment. From the proposed specifications, the IMS (IMS, 2003) has emerged as the de facto standard that facilitates the representation of any learning design that can be based on a wide range of pedagogical techniques. The metadata specifications are useful solutions to describe educational resources in order to favour the interoperability and reuse between learning software platforms. However, the majority of the metadata standards are just focused on determining the vocabulary to represent the different aspects of the learning process, while the meaning of the metadata elements is usually described in natural language. Although this description is easy to understand for the learning participants, it is not appropriate for software programs designed to process the metadata. To solve this issue, ontologies (Gómez-Pérez, Fernández-López, and Corcho, 2004) could be used to describe formally and explicitly the structure and meaning of the metadata elements; that is, an ontology would semantically describe the metadata concepts. Furthermore, both metadata and ontologies emphasize that its description must be shared (or standardized) for a given community.

In this paper, we present a short review of the main ontologies developed in last years in the Education field, focusing on the use that authors have given to the ontologies. As we will show, ontologies solve issues related with the inconsistencies of using natural language descriptions and with the consensous for managing the semantics of a given specification.

ONTOLOGIES IN EDUCATION In the educational domain a number of ontologies have been developed for authors. Thus ontologies have been developed to describe the learning contents of technical documents and formalize the semantics of learning objects; model the elements required for the design, analysis, and evaluation of the interaction between learners in computer supported cooperative learning; and describe the learning design associated to a unit of learning in which the learning flow is explicitly declared.

Ontologies in Learning Contents and Metadata The main purpose of these ontologies is to describe the contents or features of documents in order to favor its indexing and retrieval from applications. Thus Kabel, Wielinga, and Hoog (1999) develop three ontologies that annotate technical documents from a given domain: these documents are converted in a large collection of information elements described by a number of attributes to which values are assigned from the ontologies. These attributes are referred to the subject matter in the application domain, structural and representational properties (paragraphs, sections, etc.) and the potencial instructional roles of the information elements. Following this approach the ontologies represent the


Ontologies for Education and Learning Design

semantics of the documents, enabling its indexing and retrieving from databases. Other interesting ontology in this field is proposed by Brase, Painter and Nejdl (2004). Using an ontology language as TRIPLE, this ontology describes the semantics of the LOM specification, adding formal axioms and rules to the metadata representation of the standard. With this formal description the semantics of the LOM specification is not changed, but it helps to define the constraints on LOM fields, making clear the meaning and use of these LOM fields, resulting in easier exchange of LOM metadata between different applications and contexts.

Ontologies in Collaborative Learning Environments These ontologies are used to model the interaction between the learning actors (typically teachers and students) in collaborative environments. Thus Inaba et al. (2001) present an ontology a collaborative learning ontology that facilitates the design, analysis, and evaluation of a collaborative learning sesion. This ontology describes the concepts of several well-established learning theories, defining the semantics of what learning goal concept is and connecting this concept with the theories which are formulated in a taxonomy. In this work, authors have used the ontology to facilitate users the design and execution of the instructional process in a collaborative environment (Barros, Verdejo, Read, & Mizoguchi, 2002).

runtime. In other words, the behavioural model defines the semantics of the IMS LD specification during the execution phase. Figure 1 depicts the main concepts of the IMS LD specification. Knight, Gasevic and Richards (2006) present a general framework whose prupose is to save the gap between learning designs and the learning objects used in them. For achieved this, the framework considers the development of three ontologies that describe the learning design, the learning objects and the context in which these objects are used. LOCO is the ontology, defined in the language OWL (Dean & Schreiber, 2004), that deals with the description of learning designs. It represents the semantics specified in IMS LD and, particularly, in its conceptual model, which means that LOCO integrates the concepts and relations defined in the conceptual and information models of the IMS

Figure 1. Main concepts of the IMS Learning Design specification (Amorim et al., 2006)

Ontologies in Learning Design These ontologies focus on the semantic description of the learning design modelling which defines the learning flow of the activities to be carried out by teachers and students. The ontologies developed in this field are based on the IMS Learning Design (IMS LD) specification which has risen as a de facto standard for defining learning designs. This specification has: (1) a well-founded conceptual model that declares the vocabulary and the functional relations between the concepts of the learning design; (2) an information model that describes in an informal (natural language) way the semantics of every concept and relation introduced in the conceptual model; and (3) a behavioural model that specifies the constraints imposed to the software system when a given learning deisgn is executed in 1279

O


Table 1. Examples of axioms that constrain the semantics of the IMS LD concepts

IMS LD Specification Design Axiom 1

Explanation

Runtime Axiom 1

The value of the attribute time limit of a Method must be greater than the value of the time limit of any Play. That is, the Play(s) cannot finish after the Method.

IMS LD Specification

∀ m, p, cm, cp  m ∈ Method ∧ p ∈ Play ∧ cm ∈ Complete-Method ∧ cp ∈ Complete-Play ∧ playref(p, m) ∧ complete-unit-of-learning-ref(cm, m) ∧ complete-play-ref(cp, p) → time-limit(cm) ≥ timelimit(cp) Page 90: “The same role can be associated with different activities or environments in different roleparts, and the same activity or environment can be associated with different roles in different roleparts. However, the same role may only be referenced once in the same act.”

Explanation

For the same Act, the Roles involved in the execution of the Act are disjoint.

Formal Description IMS LD Specification

∀ a, r, rp  a ∈ Act ∧ r ∈ Role ∧ rp ∈ Role-Part ∧ role-part-ref(rp, a) ∧ role-ref(r, rp) → ¬ ∃ rp1  rp1 ∈ Role-Part ∧ rp1 ≠ rp ∧ role-part-ref(rp1, a) ∧ role-ref(r, rp1) Page 25 (item 0.2.1): “The create-new attribute indicates whether multiple occurrences of this role may be created during runtime. When the attribute has the value “not-allowed” then there is always one and only one instance of the role.”

Explanation

If the value of the attribute create-new is “not-allowed”, it can have an only instance of the Role at which it is applied.

Formal Description

∀ r  r ∈ Role ∧ create-new(r) = “not-allowed” → ¬ ∃ r1  r1 ∈ r

Formal Description

Design Axiom 2

Page 38 (item 0.2.2): “The time limit specifies that it is completed when a certain amount of time has passed, relative to the start of the run of the current unit of learning. The time is always counted relative to the time when the run of the unit-of-learning has been started. Authors have to take care that the time limits set on role-parts, acts and plays are logical.”

LD standard, but the semantics expressed in natural language is not included in the ontology. To deal with this issue, Amorim, Lama, Sánchez, Riera and Vila (2006) propose an ontology also based on the IMS LD that incorporates all its semantics, adding a number of axioms to the conceptual model: they are extracted from the information model where are expressed as natural language restrictions to the values of the concept attributes (table 1). Therefore this ontology does not modify the IMS LD spefication, but it incorporates all the semantics in order to enable software programs to manage directly from the representation in the ontology. With this formal specification this ontology, which is developed in F-Logic (Kiefer, Lausen, Wu, 1996) and OWL, has been used to validate the consistency of unit of learnings defined in authoring tools and as a language for knowledge interchanging between agents in collaborative environment (Riera et al., 2005).

1280

CONCLUSION Ontologies in Education are usually developed following a metadata standard whose intend is capture the semantics of a given theory or specification. Most of metadata standards have been modelled following the XML-Schema language (Thompson, Beech, Maloney, & Mendelsohn, 2004) which is not expressive enough to describe the semantics (or meaning) associated to the elements defined in the metadata. Thus, the main limitations of the XML-Schema language are (Gil & Ratnakar, 2002) that hierarchical relations between two or more concepts cannot be explicitly defined, and general and formal constraints (or axioms) between concepts, attributes, and relations cannot be specified. To solve these limitations of the XML-Schema language the modelling of metadata standards needs to be enriched in order to describe explicitly and formally the semantics of its elements. Thus misinterpretations or errors are avoided when the instances of the concepts are created. This is the main purpose of the ontologies


developed in the Education field: to favour the interoperability between software programs by representing all the semantics of the metadata, not only the concepts and relations expressed in XML-based formats.

ACKNOWLEDGMENT Authors would like to to thank the Xunta de Galicia for their financial support in carrying out this work under the project PGIDIT06SIN20601PR.

REFERENCES Amorim, R., Lama, M., Sánchez, E., Riera, A., & Vila, X.A. (2006). A learning design ontology based on the IMS specification. Journal of Educational Technology and Society, 9(1), 38-57. Barros, B., Verdejo, F., Read, T., & Mizoguchi, R. (2002). Applications of a Collaborative Learning Ontology. In C.A. Coello, A. de Albornoz, L.E. Sucar, & O.C. Battistutti (Ed.), Proceedings of the Second Mexican International Conference on Artificial Intelligence (pp. 301-310), Yucatan, Mexico. Brase, J., & Nejdl, W. (2004). Ontologies and Metadata for eLearning. In S. Staab & R. Studer (Ed.), Handbook on Ontologies (pp. 555-574). Berlin: SpringerVerlag. Dean, M., & Schreiber, G. (editors) (2004). OWL – Web Ontology Language Reference. W3C Recommendation. http://www.w3.org/TR/owl-ref. Dublin Core Metadata Initiative (2004). Dublin Core Metadata Element Set, Version 1.1. Reference Description. http://dublincore.org/documents/dces. Gil, Y., & Ratnakar, V. (2002). A Comparison of (Semantic) Markup Languages. In S.M. Haller, & G. Simmons (Eds.), Proceedings of the Fifteenth International FLAIRS Conference (pp. 413-418), Pensacola Beach, Florida. Gómez-Pérez, A., Fernández-López, M., & Corcho, O. (2004). Ontological Engineering. Berlin: Springer Verlag. IEEE Learning Technology Standards Committee (2002). Draft Standard for Learning Object Metadata

(LOM). http://ltsc.ieee.org/wg12/files/LOM_1484_12_ 1_v1_Final_Draft.pdf IMS Global Learning Consortium (2003). IMS Learning Design Information Model. Version 1.0 Final Specification. http://www.imsglobal.org/learningdesign/ldv1p0/imsld_infov1p0.html Inaba, A., Tamura, T., Ohkubo, R., Ikeda, M., Mizoguchi, R., & Toyoda, J. (2001). Design and Analysis of Learners Interaction based on Collaborative Learning Ontology. Proceedings of the Second European Conference on Computer-Supported Collaborative Learning (Euro-CSCL’2001) (pp. 308-315). Kabel, S., Wielinga, B., & de Hoog, R. (1999). Ontologies for indexing Technical Manuals for Instruction. Proceedings of the AIED-Workshop on Ontologies for Intelligent Educational Systems (pp. 44-53), LeMans, France. Kifer, M., Lausen, G., and Wu, J. (1995). Logical foundations of object oriented and frame based languages. Journal of ACM, 42, 741-843. Riera, A., Sánchez, E., Lama, M., Amorim, R., Vila, X., & Barro, S. (2004). Study of Communication in a MultiAgent System for Collaborative Learning Scenarios. Proceedings of the Twelfth Euromicro Conference on Parallel, Distributed and Network based Processing (pp. 233-240), A Coruña, Spain. Rawlings, A., Rosmalen, P., Koper, R., Rodríguez-Artacho, M., & P. Lefrere (2002). Survey of Educational Modelling Languages (EMLs). CEN/ISSS WS/LT Learning Techonologies Workshop. Sintek, M., & Decker, S. (2002). TRIPLE---A Query, Inference, and Transformation Language for the Semantic Web. In I. Horrocks, & J.A. Hendler. Proceedings of the International Semantic Web Conference, Sardinia, Italy. Thompson, H., Beech, D., Maloney, M., & Mendelsohn, N. (2004). XML-Schema Part 1: Structures Second Edition. http://www.w3.org/TR/xmlschema-1 Wiley, D. (2000). Learning Object Design and Sequencing Theory. Department of Instructional Psychology and Technology. Brigham Young University. Doctoral Thesis.

1281

O


KEY TERMS Collaborative Learning Environment: Software system oriented to support collaborative learning experience in which two or more agents engage the goal of constructing knowledge based on group discussion and decision-making processes. Interoperability: Capability to communicate, execute programs, or transfer data among various functional units in a manner that requires the user to have little or no knowledge of the unique characteristics of those units. Learning Design: Description of a method enabling learners to attain certain learning objectives by performing certain learning activities in a certain order in the context of a certain learning environment. A learning design is based on the pedagogical principles of the designer and on specific domain and contexts variables (e.g., designs for math be ematics teaching can differ from designs for language teaching). Learning Objects: Any reproducible and addressable digital or non-digital resource used to perform learning activities or support activities. Examples are: web pages, text books, text processors, instruments, etc.

1282

Metadata: Information about data, which can be used to comprehend, use, and manage data. Ontology: Formal and explicit specification of a shared conceptualization, where conceptualization refers to an abstract model of a concept in the world; formal means that the ontology should be machine readable; explicit means that the type of concepts and the constraints on their use are explicitly defined; and shared reflects the notion that an ontology captures consensual knowledge accepted by a group. Ontology Language: Formal language based on a logic paradimg that can represent concepts and the constraints between them. Reasoning capabilities of the language depend on the paradigm in which the language is based on.

1283

Ontology Alignment Overview José Manuel Vázquez Naya University of A Coruña, Spain Marcos Martínez Romero University of A Coruña, Spain Javier Pereira Loureiro University of A Coruña, Spain Alejandro Pazos Sierra University of A Coruña, Spain

INTRODUCTION At present, ontologies are considered to be an appropriate solution to the problem of heterogeneity in data, since ontological methods make it possible to reach a common understanding of concepts in a particular domain. However, utilizing a single ontology is neither always possible nor recommendable, given that different tasks or different points of view usually require different conceptualizations. This can lead to the usage of different ontologies, although in some cases the different ontologies collectively might contain information that could be overlapping and possibly even contradictory. This, in turn, represents another type of heterogeneity that can result in inefficient processing or misinterpretation of data, information, and knowledge. To address this problem while at the same time insure an appropriate level of interoperability between heterogeneous systems, it is necessary to find correspondences or mappings that exist between the elements of the (different) ontologies being used. This process is known as ontology alignment. This article offers an updated overview of ontology alignment, including a detailed explanation of what alignment consists of, and how it can be achieved. First, ontologies are defined using a fusion of different interpretations. This is followed by a definition of the concept of ontology alignment and, using a simple example, some of the most commonly used alignment techniques are illustrated. Subsequently, a case is made for the importance of automating the process of ontology alignment, summarizing some of the main alignment systems currently in use. Finally, in the context of future directions, a discussion is presented of the advantages

associated with integrating ontology alignment into systems that require exchanging information in an automatic fashion.

BACKGROUND Towards the end of the 20th and beginning of the 21st centuries, the term “ontology” (or ontologies) gained usage in computer science to refer to a research area in the subfield of artificial intelligence primarily concerned with the semantics of concepts and with expressive (or interpretive) processes in computer-based communications. In this context, there are many definitions of ontology, and these definitions have evolved over the years. Gruber offered one of the first definitions of ontology in 1993, as follows (Gruber, 1993): “An ontology is an explicit specification of a conceptualization”. Gruber’s definition became the most frequently referenced one in the literature, and became the base or working definition for those working in this area. At present, ontologies are viewed as a practical way to conceptualize information that is expressed in electronic format, and are being used in many applications including the Semantic Web, e-Commerce, data warehouses, or information integration and retrieval. The basic idea behind these applications is to use ontologies to reach a common level of understanding or comprehension within a particular domain (e.g., a particular industry, medicine, housing, car repair, finances, etc.).


O

Ontology Alignment Overview

However, certain systems that encompass a large number of components associated with different domains would generally require the use of different ontologies. In such cases, using ontologies would not reduce heterogeneity but rather would recast the heterogeneity problem into a different (and higher) framework wherein the problem becomes one of ontology alignment, thereby allowing a more efficient exchange of information and knowledge derived from different (heterogeneous) data bases, knowledge bases, and the knowledge contained in the ontologies themselves. In this manner, ontology alignment enhances system interoperability.

ONTOLOGY ALIGNMENT Euzenat et al. defined the problem of ontology alignment in the following manner (Euzenat et al., 2004): “Given two ontologies which describe each a set of discrete entities (which can be classes, properties, rules, predicates, etc.), find the relationships (e.g. equivalence or subsumption) holding between these entities.” The key issue in ontology alignment is finding which entity in one ontology corresponds (in terms of meaning) to another entity in one (or many) ontology (or ontologies). Essentially, one might say that ontol-

ogy alignment can be reduced to defining a similarity measure between entities in different ontologies and selecting a set of correspondences between entities of different ontologies with the highest similarity measures. There are different methods to calculate the similarity measures between entities, and collectively these methods are known as ontology alignment techniques. Many of these techniques are derived from other fields (for instance, discrete mathematics, automatic learning, data base design, pattern recognition, among others). Consequently, some of these techniques attempt to compare text strings that describe the entities in the ontologies (terminology-based ontology alignment), while others calculate the similarity measures between entities taking into account the structure of their corresponding ontologies (structural ontology alignment). A complete classification of alignment techniques has been developed by Martínez (Martínez, 2007). Using a simple example, the following discussion illustrates some of the basic ontology alignment techniques that are currently used. In this example, two simple ontologies are examined, as shown in Figure 1. The ontologies shown in Figure 1 describe various entities in the real world: sets of elements that share certain characteristics or classes (e.g., Wing, Car, Bus, etc.), instances of classes (individuals) and their relations (e.g. a specific Ferrari F50 belongs to a

Figure 1. An example illustrating the alignment between two ontologies

1284


Table 1. Some examples of ontology alignment techniques Correspondence Thing – Object Vehicle – Mean of transport Car – Car Ferrari F50 – Ferrari F50 Plane – Aeroplane

Winged vehicle – Air mean

Technique Used Language-based terminological technique Terminological technique based on text strings Terminological technique based on text strings (suffix)

Structural technique

specific person, Mark), as well as three different types of relationships between individuals (isA, partOf and hasOwner). Each one of the ontologies presented in this example has its own set of entities organized according to a specific taxonomy. The two representations arise due to the fact that they correspond to two different perspectives or points of view, each associated with a different domain. However, some pairs of entities can be identified in these ontologies that share the same or similar semantics. Thus, it’s probable that the Plane class in the first ontology and Aeroplane in the second ontology refer to the same concept in general (in the real world), given that the terms that describe them are synonymous terms. Table 1 shows some of the pairs of entities of these ontologies among which semantic similarities could exist, as would be revealed once alignment techniques are applied. The technique that is being applied in each case is shown, along with a description of the technique itself.

Ontology Alignment Systems Ontology alignment is intended for use in an automated fashion for two primary reasons: first, it’s a time-consuming, tedious, and occasionally difficult task, and, second, its true value is revealed when it is integrated into processes that exchange information automatically. This has resulted over the past few years in the

O

Description A support tool such as a dictionary is used (e.g. WordNet, 2007) to uncover that both terms are synonymous. Text string that describe the entities completely coincide, since it can be shown that both entities have the same or similar semantics. The first term is a suffix of the second, which would indicate that a relationship exists between them. In the first ontology, Winged vehicle is a child class of Vehicle and parent class of Plane. In the second, Air mean is child class of Mean of transport and parent class of Aeroplane. Since Vehicle was shown to be equivalent to Mean of transport, and Plane refers to the same concept as Aeroplane, both classes would show ascendants and descendants of the same or similar semantics, indicating a semantic relationships between them.

emergence of multiple software tools that have been developed by diverse research groups and well-established international organizations, primarily associated with the academic community. The tools, designed to automatically identify the correspondences that may exist between entities of different ontologies, are called ontology alignment systems. Through the development of these tools, a considerable number of ontology alignment systems have become available. Each one of these systems offers a unique set of advantages, disadvantages, and performance characteristics. Table 2 lists the main ontology alignment systems that are currently available. An ontology alignment system accepts one (or more) ontologies as input, and provides, as output, a set of correspondences between their elements. This set of correspondences is referred to as alignment. The quality of a particular alignment depends on the correctness and completeness of the correspondences it has found. An alignment system is typically based on several of the latest alignment techniques in conjunction with its own methods with the aim of obtaining the most precise and complete alignment possible.

FUTURE TRENDS At present, there are several ontology alignment systems capable of identifying, with acceptable efficiency, semantic correspondences that may exist 1285


Table 2. Ontology alignment systems Name AnchorPROMPT Chimaera

CMS

COMA++/ COMA CtxMatch

Stanford University (USA) School of Electronics and Computer Science & Advanced Knowledge Technologies group (University of Southampton), Hewlett Packard Laboratories (UK) University of Leipzig (Germany)

Blue

University of Trento (Italy) University of Washington (USA)

Falcon-AO

Southeast University (China)

FOAM [APFEL, NOM, QOM] HCONE-merge H-Match LOM MAFRA MapOnto MetaQuerier MoA OLA OntoBuilder OntoMerge Rondo S-Match SAMBO

1286

Developed by Stanford University (USA)

University of Karlsruhe (Germany) University of Aegean (Greece) University of Milan (Italy) Teknowledge Corporation (Palo Alto, USA) Instituto Politecnico do Porto (Portugal) University of Toronto (Canada), University of Rutgers (USA) University of Illinois (USA) Electronics and Telecommunications Research Institute (Korea) INRIA Rhône-Alpes (France), University of Montreal (Canada) Technion Israel Institute of Technology (Israel) Yale University (USA), University of Oregon (USA) University of Leipzig (Germany), Microsoft Research (USA) University of Trento, Italy University of Linköpings (Sweden)

References Noy & Musen, 2003 McGuinness, Fikes, Rice & Wilder, 2000

CMS, 2006, Kalfoglou & Hu, 2005

COMA, 2006, Aumueller, Do, Massmann & Rahm, 2005, Massmann, Engmann & Rahm, 2006 Zanobini, 2004 Doan, Madhavan, Domingos & Halevy, 2002, Doan, Madhavan, Domingos & Halevy, 2004 Jian, Hu, Cheng & Qu, 2005, Hu, Jian, Qu & Wang, 2005, Hu, Zhao & Qu, 2006, Hu, Cheng, Zheng, Zhong & Qu, 2006 Ehrig & Staab, 2004, Ehrig & Sure, 2005, Ehrig, Staab & Sure, 2005 Kotis, Vouros & Padilla, 2004, Kotis, Vouros & Stergiou, 2005, Vouros & Kotis, 2005 Castano, Ferrara & Montanelli, 2003 Li, 2004 Maedche, Motik, Silva & Volz, 2002 An, Borgida & Mylopoulos, 2005 Chang, He & Zhang, 2004, Chang. He & Zhang, 2005 Jaehong et al., 2005 Euzenat, Loup, Touzani & Valtchev, 2004, Euzenat & Valtchev, 2004, Euzenat, Guérin & Valtchev, 2005 Gal, Modica & Jamil, 2004 Dou, McDermott & Qi, 2002 Melnik, Rahm & Bernstein, 2003 Giunchiglia, Shvaiko & Yatskevich, 2004 Lambrix & Tan, 2006


between entities associated with different ontologies. However, the true potential of ontology alignment will be realized when this methodology is integrated in processes that require that information between different systems be exchanged fully automatically. This would be achievable when ontology alignment systems become sufficiently powerful to resolve, in real time and with minimal error, alignment problems in specific domains. Once these issues are successfully addressed, it will become possible to attain an appropriate level of interoperability between heterogeneous systems that were previously not exploited jointly, thereby representing a high water mark in the field of information and communications technologies. Multiple systems of different characteristics and origins would thus be able to communicate with each other, making it possible to reveal new knowledge that could have previously remained uncovered in disjointed information systems. This would potentially provide human users with a wide range of automated intelligent systems and services capable of interrelating with each other without external assistance, which in turn would considerably facilitate one of the most challenging tasks: the automatic, efficient, and reliable exploitation of large quantities of information.

CONCLUSION In some applications, the use of a single ontology to fully describe an entire domain is generally not an adequate solution, and it normally becomes necessary to use different ontologies. In such cases, the need arises to find relationships between the elements of the different ontologies, a process known as ontology alignment. Automation of the ontology alignment process can be reasonably achieved, which is precisely why this process is especially useful in environments or applications that require the automatic interoperability between systems. Currently, there are numerous ontology alignment systems available, and most of these are the result of academic or basic research. These systems can be viewed as software tools capable of finding correspondences or relationships that may exist between the elements of different ontologies. These tools can provide rather remarkable results, especially when taking into account the fact that they essentially remain

works in progress, still in the initial development or testing phases. In the future, it is expected that ontology alignment systems will reach acceptable levels of robustness, efficiency, and reliability, which would make it possible to apply these systems to processes that automatically exchange data between different systems that individually utilize different ontologies. These automated interactions between systems would not only reduce user intervention but would also automate many time-consuming, complex, and computationally costly tasks that are currently either performed manually or not at all.

ACKNOWLEDGMENT This work was partially supported by the Spanish Ministry of Education and Culture (Ref TIN2006-13274) and the European Regional Development Funds (ERDF), grant (Ref. PIO52048) funded by the Carlos III Health Institute, grant (Ref. PGIDIT 05 SIN 10501PR) from the General Directorate of Research of the Xunta de Galicia and grant (File 2006/60) from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia. The work of José M. Vázquez is supported by an FPU grant (Ref. AP2005-1415) from the Spanish Ministry of Education and Science.

REFERENCES An, Y., Borgida, A., & Mylopoulos, J. (2005). Constructing Complex Semantic Mappings between XML Data and Ontologies. Proceedings of ISWC’05. Aumueller, D., Do, H.H., Massmann, S., & Rahm, E. (2005). Schema and ontology matching with COMA++. SIGMOD Conference. Castano, S., Ferrara, A., & Montanelli, S. (2003). H-MATCH: an algorithm for dynamically matching ontologies in peer-based systems. Proceedings of the First Workshop on Semantic Web and Databases (SWDB-03), VLDB 03, Berlin, Germany. Chang, C., He, B., & Zhang, Z. (2004). MetaQuerier over the Deep Web: Shallow Integration across Holistic Sources. Proceedings of the VLDB Workshop on 1287

O


Information Integration on the Web (VLDB-IIWeb’04), Toronto, Canada. Chang, C., He, B., & Zhang Z. (2005). Towards Large Scale Integration: Building a MetaQuerier over Databases on the Web. Proceedings of the Second Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, California. COMA Website (2006). URL: http://dbs.uni-leipzig. de/en/Research/coma.html/ Crosi Mapping System Website (2006). URL: http:// www.aktors.org/crosi/deliverables/summary/cms. html/ Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2002). Learning to map between ontologies on the semantic web. Proceedings of the World-Wide Web Conference, Hawai, USA. Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2004). Ontology Matching: A Machine Learning Approach. Staab, S. & Studer, R. (eds.). Handbook on Ontologies in Information Systems, Springer-Velag, 397-416. Dou, D., McDermott, D., & Qi, P. (2002). Ontology translation by ontology merging and automated reasoning. Proceedings of the EKAW 2002 Workshop on Ontologies for Multi-Agent Systems. Sigüenza, Spain. Ehrig, M., & Staab, S. (2004). QOM - Quick Ontology Mapping. Proceedings of the Third International Semantic Web Conference, LNCS 3298, 683-697. Springer, Hiroshima, Japan. Ehrig, M., & Sure, Y. (2005). FOAM - Framework for Ontology Alignment and Mapping - Results of the Ontology Alignment Evaluation Initiative. Proceedings of the Workshop on Integrating Ontologies, 156, 72-76. Ehrig, M., Staab, S., & Sure, Y. (2005). Bootstrapping Ontology Alignment Methods with APFEL. Proceedings of the 4th International Semantic Web Conference, ISWC 2005, LNCS 3729, 186-200. Springer. Euzenat, J., Le Bach, T., Barrasa, J., Bouquet, P., De Bo, J., Dieng, R., Ehrig, R., et al. (2004). State of the art on ontology alignment. Deliverable D2.2.3 v1.2. Knowledge Web. URL: http://knowledgeweb.semanticweb.org/ 1288

Euzenat, J., & Valtchev, P. (2004). Similarity-based ontology alignment in OWL-Lite. Proceedings of 16th european conference on artificial intelligence (ECAI), 333-337. Amsterdam, Holland. Euzenat, J., Loup, D., Touzani, M., & Valtchev, P. (2004). Ontology alignment with OLA. Proceedings of 3rd ISWC2004 workshop on Evaluation of Ontologybased tools (EON), 59-68, Hiroshima, Japan. Euzenat, J., Guérin, P., & Valtchev, P. (2005). OLA in the OAEI 2005 alignment contest. Proceedings KCap 2005 workshop on Integrating ontology, 97-102, Banff, Canada. Gal, A., Modica, G. A., & Jamil, H. M. (2004). OntoBuilder: Fully Automatic Extraction and Consolidation of Ontologies from Web Sources. Proceedings of the ICDE 2004. Giunchiglia, F., Shvaiko, P., & Yatskevich, M. (2004). S-Match: An Algorithm and an Implementation of Semantic Matching. Proceedings of ESWS’04. Gruber, T. R. A translation approach to portable ontology specification. (1993). Knowledge Acquisition, 5(2), 199-200. Hu, W., Jian, N., Qu, Y., & Wang, Y. (2005). GMO: A graph matching for ontologies. Proceedings of the KCAP workshop on Integrating Ontologies, 41-48. Hu, W., Zhao, Y., & Qu, Y. (2006). Partition-based block matching of large class hierarchies. Proceedings of the 1st Asian Semantic Web Conference (ASWC’06), 72-83. Hu, W., Cheng, G., Zheng, D., Zhong, X., & Qu, Y. (2006). The Results of Falcon-AO in the OAEI 2006 Campaign. ISWC Ontology matching workshop. Athens, USA. Jaehong, K., Jang, M., Young-Guk, H., Joo-Chan, S. & Jo, S. (2005). MoA: OWL ontology merging and alignment tool for the semantic web. Lecture notes in Computer Science, 3533/2005, 722-731, Springer. Jian, N., Hu, W., Cheng, G., & Qu, Y. (2005). FalconAO: Aligning Ontologies with Falcon. Proceedings of K-Cap 2005 Workshop on Integrating Ontologies, 85-91, Banff, Canada. Kalfoglou, Y., & Hu, B. (2005). CMS: CROSI Mapping System - Results of the 2005 Ontology Alignment


Contest. Proceedings of K-Cap’05 Integrating Ontologies workshop, 77-85, Banff, Canada. Kotis, K., Vouros, G. A., & Padilla, J. (2004). HCOME: tool-supported methodology for collaboratively devising living ontologies. Semantic Web and Databases. Second International Workshop, SWDB. Toronto, Canada. Kotis, K., Vouros, G., & Stergiou, K. (2005). Towards Automatic Merging of Domain Ontologies: The HCONE-merge approach. Elsevier’s Journal of Web Semantics (JWS), 4:1, 60-79. Lambrix, P., & Tan, H. (2006). SAMBO - A System for Aligning and Merging Biomedical Ontologies. Journal of Web Semantics, Special issue on Semantic Web for the Life Sciences, 4(3), 196-206. Li, J. (2004). LOM: A Lexicon-based Ontology Mapping Tool. Proceedings of the Performance Metrics for Intelligent Systems (PerMIS. ‘04). Maedche, A., Motik, B., Silva, N., & Volz, R. (2002). MAFRA - A Mapping Framework for Distributed Ontologies. Proceedings of 13th European Conference on Knowledge Engineering and Knowledge Management (EKAW). Sigüenza, Spain. Martínez, M. (2007). Analysis and comparative study of ontology alignment systems, and development of an ontology alignment system optimized for aligning medical ontologies. Pazos, A., Vázquez, J.M. (dirs.). University of A Coruña. Final project.

Vouros, G., & Kotis, K. (2005). Extending HCONEmerge by approximating the intended interpretations of concepts iteratively. 2nd European Semantic Web Conference, Heraklion, Creta, Greece. WordNet, 2007. Cognitive Science Laboratory. Princeton University. URL: http://wordnet.princeton.edu/ Zanobini, S. (2004). Improving ctxmatch by means of grammatical and ontological knowledge - in order to handle attributes. Technical Report 554, Department of Information and Communication Technology, University of Trento, Italy.

KEY TERMS Class: A set that contain individuals which share certain characteristics. The word concept is sometimes used in place of class. Classes are a concrete representation of concepts. Individual: A object in the domain that we are interested in. Individuals are also known as instances of classes. Interoperability: A state or situation through which heterogeneous systems can exchange data and/or processes. Mapping: A correspondence found during the process of ontology alignment. Ontology: A formal and explicit specification of a shared conceptualization.

Massmann, S., Engmann, D., & Rahm, E. (2006). COMA++: Results for the Ontology Alignment Contest OAEI 2006. International Workshop on Ontology Matching (5th ISWC-2006), Athens, Georgia, USA.

Ontology Alignment: A process that consists of finding the semantic relationships that may exist between different elements in different ontologies.

Melnik, S., Rahm, E., & Bernstein, P. A. (2003). Rondo: A Programming Platform for Model Management. Proceedings of ACM SIGMOD 2003, San Diego, USA.

Ontology Alignment System: A software tool capable of conducting the alignment of ontologies in an automated fashion.

McGuinness, D. L., Fikes, R., Rice, J., & Wilder, S. (2000). An environment for merging and testing large ontologies. Proceedings of 7th Intl. Conf. on Principles of Knowledge Representation and Reasoning (KR2000). Colorado, USA.

Ontology Mapping: See ontology alignment. Ontology Matching: See ontology alignment. Relation: A link between individuals. In the field of ontologies, relations are also known as properties.

Noy, F. N., & Musen, A. M. (2003). The PROMPT Suite: Interactive Tools for Ontology Merging and Mapping. International Journal of Human-Computer Studies, 59/6, 983-1024. 1289

O

1290

Ontology Alignment Techniques Marcos Martínez Romero University of A Coruña, Spain José Manuel Vázquez Naya University of A Coruña, Spain Javier Pereira Loureiro University of A Coruña, Spain Norberto Ezquerra Georgia Institute of Technology, USA

INTRODUCTION Sometimes the use of a single ontology is not sufficient to cover different vocabularies for the same domain, and it becomes necessary to use several ontologies in order to encompass the entire domain knowledge and its various representations. Disciplines where this occurs include medical science and biology, as well as many of its associated subfields such as genetics, epidemiology, etc. This may be due to a domain’s complexity, expansiveness, and/or different perspectives of the same domain on the part of different groups of users. In such cases, it is essential to find relationships that may exist between the elements of a specific domain’s different ontologies, a process known as ontology alignment. There are several methods for identifying the relationships or correspondences between elements associated with different ontologies, and collectively these methods are called ontology alignment techniques. Many of these techniques stem from other fields of study (e.g., matching techniques in discrete mathematics) while others have been specifically designed for this purpose. The key to successfully aligning ontologies is based on the appropriate selection and implementation of a set of those ontology alignment techniques best suited for a particular alignment problem. Ontology alignment is a complex, tedious, and time-consuming task, especially when working with ontologies of considerable size (containing, for instance, thousands of elements or more) and which have complex relationships between the elements (for example, a particular problem domain in medicine). Furthermore, the true potential of ontology alignment is realized when different information-exchange processes are integrated

automatically, thereby providing the framework for reaching a suitable level of efficient interoperability between heterogeneous systems. The importance of automatically aligning ontologies has therefore been a topic of major interest in recent years, and recently there has been a surge in a variety of software tools dedicated to aligning ontologies in either a fully or partially automated fashion. Some of these tools —generally referred to as ontology alignment systems— have been the result of well known and respected research centers, including Stanford University and Hewlett Packard Laboratories, for instance. In Shvaiko & Euzenat, 2007, updated information is given regarding the currently available ontology alignment systems. Each ontology alignment system combines different alignment approaches along with its own techniques, such that correspondences between the different ontologies can be detected in the most complete, precise, and efficient manner. Since each system is based on its own approximation techniques, different systems yield different results, and therefore the quality of the results can vary among systems. Most of the alignment systems are oriented to solving problems of a general nature, since ontologies associated with a single domain share certain characteristics that set them apart from ontologies associated with other domains. Recently, some systems have emerged that are designed to align ontologies in a specific domain. An example is the SAMBO alignment system (Lambrix, 2006) in the biomedical domain. These and other domain-specific systems can produce excellent results (when used for the domains for which they were designed), but are generally not useful when applied to other domains.


Ontology Alignment Techniques

This article presents a classification of the most commonly used, recently developed alignment techniques, supported by simple examples to illustrate the specific techniques underlying different systems. Future directions in ontology alignment are also examined.

BACKGROUND The key to ontology alignment is to find those entities in one ontology that may correspond to other entities in another ontology. Basically, this can be viewed as finding a similarity measure between elements (or socalled entities) associated with different ontologies, and subsequently selecting the set of correspondences that produce the strongest measures of similarity. There are, however, different ways to compute similarity measures; there are various studies dedicated to the classification of these techniques (Rahm & Bernstein, 2001, Euzenat & Valtchev, 2004, Euzenat et al., 2004, Shvaiko & Euzenat, 2005). Following these classification schemes (especially those undertaken by Euzenat and Valtchev (Euzenat & Valtchev, 2004) and based on Euzenat et al., 2004), the next section will introduce an abbreviated classification of those ontology alignment techniques that are most commonly utilized by current ontology alignment systems. This condensed classification is centered on the type of element being manipulated by the alignment technique, and complements the taxonomy proposed by Rahm and Bernstein (Rahm & Bernstein, 2001), and —for the purpose of clarity and brevity—summarizes only those alignment techniques that compare on an individual basis a single element in one ontology with another element associated with another ontology (known as local alignment techniques, as in Euzenat et al., 2004).

ONTOLOGY ALIGNMENT TECHNIQUES Ontology alignment techniques can be classified according to the following (please refer to Figure 1): 1.

Terminological techniques. These calculate the similarity between text strings and describe several elements in the ontologies (names, labels, and/or comments). There are two types of terminological

techniques: those based on text strings and those based on the language. 1.1. Terminological techniques based on text strings. These are based on the idea of comparing the structure in text strings, which are viewed as sequences of characters. These techniques consider that the similarity between two terms increases when the similarity between their corresponding text strings also increases, but without considering the underlying semantics in the terms. In this manner, the application of a technique of this type to the terms Apple and Apples would yield a relatively high measure of similarity, whereas the application of the same technique to the terms Apple and Orange would yield a lower degree of similarity (or a lower similarity measure), since in the second case the text strings are quite different. The isolated use of these techniques is usually not recommended, since it is preferable to use them in conjunction to other, more powerful alignment techniques; these can be easily illustrated with the following example: it would be erroneous to conclude that the terms Cream and Scream are highly similar (although their meanings are very different), or that the terms Student and Pupil are very distinct or dissimilar (although the semantic concepts are generally the same). Some examples of terminological techniques based on text strings are the distance measure proposed by Hamming (Hamming, 1950), which counts the number of different characters in two different text strings; the distance measure suggested by Levenshtein (Levenshtein, 1966), which examines the minimum number of operations (insertions, deletions and/or substitutions) that are necessary to transform one text string into another; and the distance measure Jaro (Jaro, 1989), which analyzes the number and order of two common characters in two text strings. 1.2. Terminological techniques based on language. These techniques are more complex but more reliable than those previously discussed, and do not treat terms as simple sequences of characters that are independent of one another. Rather, these techniques view terms as groups of elements with meaning (lexima and morphema, i.e., prefixes and suffixes). The main objective of these techniques is to discover the similarity that may exist between terms associated with one concept, although the relationships can be formed by strings of characters that are very different. In other words, these techniques attempt to obviate the different termino1291

O


logical variations that can affect terms that are being mutually compared. These techniques, in turn, can be classified according to whether intrinsic and extrinsic approaches: 1.2.1. Intrinsic techniques. These are oriented toward detecting the similarity between terms that have undergone morphological and syntactical variations (e.g, Mean of transport, Mean of transportation, Transportation mean), as in Porter Stemming Algorithm (Porter, 1980). 1.2.2. Extrinsic techniques. These consist of utilizing external linguistic resources, such as dictionaries and thesaurus, in order to find the similarity between lexical variations in the same term (e.g, Mean of transport and Vehicle). External techniques consider the fact that there usually is an equivalence relationship between synonyms, and a subsuming relationship between hyponyms. In this manner, an alignment system based on extrinsic terminological techniques would presumably be capable of detecting, for instance, an equivalence relationship between the terms Leukocyte and White blood cell (since they are synonymous) and a subsumed relationship between Moycyte and Cell

Figure 1. Ontology alignment techniques

1292

(since Myocyte is a type of Cell). Some of the external linguistic resources most commonly used by such alignment systems currently in use include WordNet (WordNet, 2007), as an English-language resource, or UMLS (National Library of Medicine, 2007) in the medical domain. Other extrinsic techniques that are in use include multilingual techniques, dedicated to finding relationships between terms written in different languages (such as the Spanish word célula and its English counterpart, cell) and using multilingual dictionaries such as EuroWordNet (Vossen, 1997). 2.

Structural Techniques. In addition to comparing text strings that describe the entities in each ontology, it is frequently useful to compare the internal structure of the entities themselves, or the relationships that each entity may maintain with other entities (external structure comparison).

2.1. Internal structure comparison techniques. These techniques compare internal characteristics of the entities, such as the rank, cardinality, transitivity, and/or symmetry of its properties (attributes and relationships). For instance, if in one ontology A there is an entity Per-


son with three attributes (birth_date of type date; name of type string, and weight of type int), and in another ontology B there is an entity Human_being with two attributes (date_of_birth of type date; and first_name of type string), a technique of this type might conclude that there is certain similarity between these two entities, since the types of two of the attributes coincide. In this concrete case the technique’s conclusion would have been correct: Person and Human_Being refer to the same concept in the real world. However, it is easy to find cases in which the technique would produce erroneous results. For instance, if the entity in ontology B were Car with three attributes (registration_date of type date; color of type string; and weight of type int), a comparison of internal structure might suggest that the entities Person and Car were similar, since the ranks of the three attributes coincide although in reality they are entities associated with very different semantics. Consequently, given that it is frequently possible to find multiples entities in an ontology that represent similar internal characteristics, these techniques tend to be used in conjunction with other techniques (such as terminological techniques). It is probably wise to utilize a method to compare the internal structure during the initial alignment stages, in order to filter pairs of entities that could be related, and subsequently apply other techniques before finally deciding on the overall level of similarity. 2.2. External structure comparison techniques. These techniques compute the similarity that may exist between entities by considering the position that the entities in question occupy within their respective ontologies. The underlying principle is that, if two entities are similar, then there is likely to be some similarity with their adjacent (or neighboring) entities. These techniques tend to treat ontologies as graphs in which each node is a vertex in the ontology and each edge is a relationship between vertices; algorithms that are especially designed to work with graphs are used to find the relationships between elements in the ontologies. As a matter of fact, this problem is equivalent to that or solving a graph homomorphism (Garey, 1979). One of the better known techniques for making the external-structure comparison is the one used by the Anchor-PROMPT ontology alignment system (Noy & Musen, 2000), which is based on the idea that if two pairs of entities in the source ontologies are similar and there are connected paths linking them, then the elements in those paths are also similar.

3.

4.

Extensional techniques. These extensional (or extensible) techniques compare the extension or length of the classes of ontologies: in other words, their instantiations or examples. This is useful when the information about the entities to be compared is limited but there is additional data or information about their examples; alternately, they are useful as a means of supporting other alignment techniques in order to detect erroneous or misleading correspondences. For instance, if an ontology contains a class denoted as Human_being with two instances, John and Mary, and the other ontology contains a class labeled Person with the same instances (John and Mary), then it could be inferred, by comparing all the instances of the ontologies, that the classes are similar. Semantic techniques. These types of techniques attempt to align the elements in the ontologies according to their semantic interpretation. The general approach is based on deductive methods that draw from theoretical models that provide a justification for the results that are obtained. Some examples include the Propositional SATisfiability (SAT) and techniques based on Description Logics (DL).

4.1. SAT techniques: the application of SAT techniques to the ontology alignment problem consists of translating the information associated with pairs of terms between which a mathematical or formulaic relationship could exist. The relationship would be of the form Axioms→rel(element1, element2), where element1 and element2 are the entities in the ontologies that are being examined to determine if there is a semantic relationship between them, and rel is the relationship that exists between the entities. Subsequently, the validity of the relationship (the aforementioned formula) is evaluated. The advantage of using SAT techniques is that it supports an exhaustive analysis of all the possible correspondences as well as the possibility of selecting only the major correspondences. 4.2. Techniques based on DL: the expressivity of propositional language used by SAT techniques is limited, as they are unable to work with certain types of predicates. However, Description Logics provides the necessary expressivity to code alignment problems as propositional validity problems with greater flexibility. For instance, if an ontology contains the classes City, 1293

O


Worker and Industrial_city, as a City with more than 600,000 Workers, and another ontology contains the classes Big_town, Inhabitant and Crowded_big_town, as a Big_town with more than 500,000 Inhabitants and it is established that all Workers are Inhabitants and that City is equivalent to Big_town, then a DL-based technique could deduce that an Industrial_city is a Crowded_big_town.

FUTURE TRENDS Current ontology alignment systems take as input two ontologies and, once the alignment process is executed, yield as output a set of correspondences between their elements. Using up-to-date alignment techniques, this process is still very time consuming and computationally expensive especially in those cases where the input ontologies are large. This may not present a challenge in cases where the same ontologies are always used, since in such cases it would only be necessary to perform the alignment once, and subsequently the correspondences that have been revealed could be reutilized. However, there are applications or contexts where it becomes necessary to instantly identify which entity in ontology A corresponds with an entity in ontology B, without previously “knowing” the ontologies. In these cases, current ontology alignment techniques are limited, as is the case with the Semantic Web or the integration of information from different sources that were mutually “unknown” to each other. In these types of problems, it is more important to reduce the computational time that is necessary to carry out the alignment, although the quality of the alignment could be somewhat affected. As a result, it is very probable that in the next few years the field of ontology alignment will see a major thrust being placed on exploring techniques capable of finding correspondences in an increasingly shortened amount of time. It is also expected that new techniques will emerge that will allow the consultation or usage of external linguistic resources in a more efficient and powerful manner than is now possible. The utilization of external resources is essential in alignment problems associated with specific domains, although current approaches are not capable of achieving optimal usage of these types of resources, thereby wasting a significant amount of potentially useful information.

1294

CONCLUSION Ontology alignment is an important aspect of practically any domain or application area where it is necessary to use an ontology. There are various approaches to finding semantic correspondences that may exist between elements of different ontologies, known as ontology alignment techniques. This paper has presented a condensed classification of those ontology alignment techniques that are most commonly used today. Clearly, not all alignment techniques are equally applicable to any problem. For instance, it is not useful to apply an extensional technique to ontologies that have no instances. Consequently, a number of factors ought to be considered when selecting among different alignment techniques for application to a particular problem. Among these are the domain to which the ontologies belong, the language in which the ontologies are expressed, the number and type of elements contained in the ontologies, etc. And, although a particular technique may be applicable to a specific alignment problem, there is also the question of errors. As a result, it should be stressed that aligning two ontologies is not simply the application of an alignment technique in an isolated manner: rather, the goal is mainly to find the appropriate combination of alignment techniques to be applied, such that the strengths of one technique can compensate another technique’s weaknesses and limitations, with the overarching objective of uncovering an optimal set of correspondences between the ontologies of interest.

ACKNOWLEDGMENT This work was partially supported by the Spanish Ministry of Education and Culture (Ref TIN2006-13274) and the European Regional Development Funds (ERDF), grant (Ref. PIO52048) funded by the Carlos III Health Institute, grant (Ref. PGIDIT 05 SIN 10501PR) from the General Directorate of Research of the Xunta de Galicia and grant (File 2006/60) from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia. The work of José M. Vázquez is supported by an FPU grant (Ref. AP2005-1415) from the Spanish Ministry of Education and Science.


REFERENCES Euzenat, J., Loup, D., Touzani, M., & Valtchev, P. (2004). Ontology alignment with OLA. Proceedings of 3rd ISWC2004 workshop on Evaluation of Ontologybased tools (EON). 59-68. Hiroshima, Japan. Euzenat, J., Le Bach, T., Barrasa, J., Bouquet, P., De Bo, J., Dieng, R., Ehrig, R., et al. (2004). State of the art on ontology alignment. Deliverable D2.2.3 v1.2. Knowledge Web. URL: http://knowledgeweb.semanticweb.org/ Euzenat, J., & Valtchev, P. (2004). Similarity-based ontology alignment in OWL-Lite. Proceedings of 16th european conference on artificial intelligence (ECAI), 333-337. Amsterdam, Holland. Garey, M., & Johnson, D. (1979). Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman & Co. Gruber, T. R. A translation approach to portable ontology specification. (1993). Knowledge Acquisition, 5(2), 199-200. Hamming, R. W. (1950). Error Detecting and Error Correcting Codes. Bell System Technical Journal, 26(2):147-160. Jaro, M. A. (1989). Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society, 64:1183-1210. Lambrix, P., Tan, H. (2006). SAMBO - A System for Aligning and Merging Biomedical Ontologies. Journal of Web Semantics, Special issue on Semantic Web for the Life Sciences, 4(3), 196-206. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710. Noy, F. N., & Musen, A. M. (2000). Anchor-PROMPT: Using non-local context for semantic matching. Proceedings of the Workshop on Ontologies and Information Sharing at the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001). Seattle, USA.

Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program, 14(3): 130-137. Rahm, E., & Bernstein, P. (2001). A Survey of Approaches to Automatic Schema Matching. VLDB Journal, 10(4), 334-350. Shvaiko, P., & Euzenat, J. (2005). A Survey of Schemabased Matching Approaches. Journal on Data Semantics (JoDS), IV, LNCS 3730,146-171. National Library of Medicine (NLM), (2007). Unified Medical Language System. URL: http://umlsinfo.nlm. nih.gov/ Vossen, P. (1997). EuroWordNet: a multilingual database for information retrieval. Third DELOS workshop – Cross-Language Information Retrieval. European Research Consortium for Informatics and Mathematics, 85-94, Zurich. WordNet. (2007). Cognitive Science Laboratory. Princeton University. URL: http://wordnet.princeton.edu

KEY TERMS Domain: Specific areas of interest (e.g., artworks by Picasso) or of knowledge (e.g., medicine, physics, etc.). Ontology: A formal and explicit specification of a shared conceptualization. Ontology Alignment: A process that consists of finding the semantic relationships that may exist between different elements in different ontologies. Ontology Alignment System: A software tool capable of conducting the alignment of ontologies in an automated fashion. Ontology Alignment Technique: Method used to identify the semantic correspondences that may exist between the elements of different ontologies. Ontology Entity: An ontology entity represents a conceptual element of the domain of discourse. Thesaurus. Networked collection of controlled vocabulary terms.

Shvaiko, P., Euzenat, J. (2007). Ontology Matching Web. URL: http://www.ontologymatching.org 1295

O

1296

Optimization of the Acoustic Systems V. Romero-García Polytechnic University of Valencia, Spain E. Fuster-Garcia Polytechnic University of Valencia, Spain J. V. Sánchez-Pérez Polytechnic University of Valencia, Spain L. M. Garcia-Raffi Polytechnic University of Valencia, Spain X. Blasco Polytechnic University of Valencia, Spain J. M. Herrero Polytechnic University of Valencia, Spain J. Sanchis Polytechnic University of Valencia, Spain

INTRODUCTION A genetic algorithm is a global search method based on a simile of the natural evolution. Genetic Algorithms have demonstrated good performance for difficult problems where the function to minimize is complicated. In this work we applied this optimization method to improve the acoustical properties of the Sonic Crystal (Martínez-Sala et Al.,1995) (Kushwaha et al., 1994), a kind of structures used in acoustics. In the last few years the propagation of the acoustic waves in heterogeneous materials whose acoustic properties vary periodically in space have attracted considerable interest. The so-called Sonic Crystals are the typical example of this kind of materials in the range of the acoustic frequencies. These systems are defined as periodic structures with strong modulation of the elastic constants between the scatterers and the surrounding material. Recently, the strategy to enhance Sonic Crystals properties has been based on the use of scatterers with acoustical properties added. The use of local resonators (Liu et al., 2000) or Helmholtz resonators (Hu et al., 2005) as scatterers have produced very good results Some authors also have built new structures with scatterers made up of porous material improving the attenu-

ation capability of the Sonic Crystals (Umnova et al., 2006). However, the use of Sonic Crystals as outdoor acoustic barriers requires scatterers made up of robust and long-lasting materials. This is the reason why it seems interesting to analyze the possibility of optimizing the attenuation capability of Sonic Crystals made with rigid scatterers like wood, PVC or aluminium. The creation of vacancies in a Sonic Crystals improves the attenuation capability of the Sonic Crystals (Caballero et al., 2001). However, it does not exist any generic rule about the creation of vacancies in a Sonic Crystals. In fact, similar structures can produce very different acoustic fields behind of them. Because of the complexity of mathematical functions involved in Sonic Crystals calculus, Genetic Algortihm turns up as a tool specially indicated for this kind of problems (Hakanson et al., 2004) (Romero-García et al., 2006). This procedure can work together with the Multiple Scattering theory which is a self-consistent method for calculating the acoustic pressure including all orders of scattering (Chen & Ye, 2001). Given a starting Sonic Crystals, the Genetic Algorithm generates quasi ordered structures offspring by means of the creation of vacancies that are classified in terms of a cost function based on the pressure values at a specific point. The sound scattered pressure by every


Optimization of the Acoustic Systems

structure analyzed by Genetic Algorithm is performed by a two-dimensional (2D) Multiple Scattering theory. In the present work, it is shown an improvement of the Genetic Algorithm based on Parallel implementation and as a consequence, new and better results are obtained to design Quasi Ordered Structures made with rigid cylinders that attenuate sound in a predetermined band of frequencies.

SONIC CRYSTALS Sonic Crystals are arrays of scatterers placed periodically in space whose physical properties are different to the surrounding material. In the low frequency range, Sonic Crystals behave as an homogeneous medium with an acoustic impedance greater than that of the air. Then Sonic Crystals can work as refractive devices. Moreover, Sonic Crystals present band gaps, i.e., ranges of sound frequencies where the sound propagation inside the crystal is forbidden. The presence of these band gaps is explained by the well-known Bragg’s law. The reflections inside the crystal, and consequently the position of the gaps depend on the lattice constant, i.e., on the geometry of the Sonic Crystals. The existence, in periodic media, of an absolute band gap where the propagation of sound is forbidden for every incidence direction, can have a profound impact on several scientific and technological disciplines, for example, in the design of acoustic filters or acoustic barriers. Some studies have showed that there are three important parameters for the spectral gap creation (Economolu & Sigalas, 1994). One is the density ratio y = ρs/ρh between the scattering material and the host material densities. The second one is the filling factor, ff = Vs/V , that shows the volume occupied by the scattering material respect to the total volume. The last parameter is the topology used to design the Sonic Crystals. It was demonstrated that the density ratio plays an important role in the gap creation: Sonic Crystals built with scatterers of high density embedded in a host material of low density are better to create the spectral gap than another kind of configurations. Moreover the optimum value of the filling factor, ff, to the gap creation has been ranged between 10% and 50%. In this work we use a Sonic Crystals built by aluminium cylinders of 2 cm of radius as scatterers embedded in air (Network topology). Due to the fact that those structures present a high density ratio, and the

maximum filling factor is ff = 0,36, we ensure that our structure is well designed to the gap creation. Now we want to find the best filling factor and space distribution of scatterers that present the best acoustical properties. Genetic Algorithm together with the MST is a good procedure to achieve our objective.

COST FUNCTION AND CHROMOSOME DESCRIPTION The mechanism used by Genetic Algorithm in this work is the creation of vacancies in the starting Sonic Crystals. Fig. 1 shows the starting Sonic Crystals and a Quasi Ordered Structures offspring generated by Genetic Algorithm by means of the creation of vacancies. Using this procedure we can vary the filling factor and, at the same time, evaluate different spaces of configuration. Each Quasi Ordered Structures will be considered as an individual. The chromosome that represents each Quasi Ordered Structures, is a real vector with values in [0; 1] range. Each coordinate represents the existence or not of a cylinder at a specific position of the scatterer (beginning with the cylinder a the left top corner of the Sonic Crystals and following by columns until right bottom corner, see starting Sonic Crystals at figure 1). Values in [0; 0:5[ means there is a vacancy, in opposition values in [0:5; 1] means there is a cylinder. In this work we are interested in maximizing the sound attenuation for a predetermined range of frequencies not dependent on the lattice constant, at a point located behind the crystal. The acoustic attenuation in a point (x, y) and for a incidence frequency ν is:

where the interfered pressure is determined by the MST. This pressure depends on the position and on the radius of the scatterers and the incidence frequency. In the equation (1) we can see that for a point (x, y), a value of incidence frequency ν and a value of cylinder radius rl, it is possible to find a configuration of cylinders that minimize the Pinterferred, that means, maximize the acoustic attenuation. If we are interested in maximizing the sound attenuation in a predetermined range of frequencies at 1297

O


Figure 1. Starting sonic crystals and a possible quasi ordered structures offspring

a point of coordinates (x, y) we have to define a new function that we have to minimize in order to achieve the maximum acoustic attenuation. To do that, we define our cost function based on the MST

where

represents the mean pressure in the range of frequencies [ν1; νN] and N represents the number of frequencies considered in this range. In our case, we use N = 13. The second term in equation (2) represents the mean deviation. The variable under study is x=(Xcyl,Ycyl) a vector that contains the information about the space configuration of the Quasi Ordered Structures.

PARALLEL GENETIC ALGORITHM A Genetic Algorithm is an optimization technique that looks for the solution of the optimization problem, imitating species evolutionary mechanism (Goldberg, 1989). 1298

In an optimization problem, there is a function to optimize (cost function) and a zone where to look for (search space). Every point of the search space had an associated value of the function. The different points of the search space are the different individuals of population. Similarly to natural genetic, every different individual is characterized by a chromosome and in the optimization problem, this chromosome is made by the point coordinates in the search space. The cost function value for an individual has to be understood as the adaptation level to the environment for such individual. Evolutionary mechanism, that is, the rules for changing populations throughout generations is performed by Genetic Operators. A general Genetic Algorithm evolution mechanism could be described as follows: From an initial population (randomly generated), the next generation is obtained as: 1.

2.

Some individuals are selected for the next generation. This selection is made depending on adaptation level (cost function value). Such individuals with better J(x) value have more possibilities to be selected. To explore search space, an exchange of information between individuals is performed by crossover. That produces a gene exchange between chromosomes. The rate of individuals to crossover is fixed by Pc, crossover probability.


3.

An additional search space exploration is performed by mutation. Some individuals are subject to a random variation in their genes. The rate of individuals to be mutated is set by mutation probability Pm.

In this general framework, there are several variation in the Genetic Algorithm implementation; different gene codification, different genetic operator implementation, etc. Implementation for the present work has the following characteristics: 1. 2.

3. 4.

Real value codification, each gene has a real value, the interpretation of the chromosome has been detailed in previous section. J(x) is not directly used as cost function. A linear ’ranking’ operation is performed (Bäck, 1996). Ranking operation prevents the algorithm from exhausting, it avoids clearly dominant individuals prevailing too soon. Selection is made by the operator known as Stochastic Universal Sampling (SUS) (Baker, 1987). For crossover it is used intermediate recombination operator (Mühlenbein et al., 1993). Chromosomes sons (x’1 and x’2) are obtained through following operation on chromosomes fathers (x1 and x2): x’1 = α1 · x1 + (1- α1) x2 ; x’2 = α2 · x2 + (1- α2) x1 ; α1, α2, ∈ [-d, 1+d]

5.

α1 and α2 have to be generated for each gene increasing search capabilities but with a higher computational cost. Implemented Genetic Algorithm has been adjusted as follows: α1=α2 and generated for each chromosome, d = 0 and Pc = 0,8. Mutation operation is done with a probability Pm = 0,1 and a normal distribution with standard deviation set to 20% of search space range.

(Cantú-Paz, 1995) the selected one is the configuration Master-Slave. For this architecture there is one processor working as Master, executing tasks of the Genetic Algorithm (ranking, selection, crossover and mutation), and the rest evaluate fitness function of a subpopulation (see Fig. 2). The Master has to send subpopulation to each Slave, who makes fitness evaluation and returns results to the Master. The Master works in a synchronous way, waiting for all fitness value from all Slaves. After receiving all fitness values the Master performs the evolution to produce the next generation (genetic operators are executed) and sends to the Slaves the new population for fitness evaluation. This type of implementation is the most simple and does not change Genetic Algorithm operators and behaviour. The time reduction is significative since the overall time is divided by the number of Slaves. For the problem proposed, with 5 Slaves, the total execution is reduced to 21 hours. All developments (Genetic Algorithm and Sonic Crystals models) have been made in Matlab®, parallelization has been done using Matlab Distributed Computing Toolbox and Matlab Distributed Computing Engine.

RESULTS In this point we present some of our main results. In this work we have analyzed width ranges of 600 Hz

Figure 2. Master/slave architecture for parallel genetic algorithm

The high computational cost of Sonic Crystal optimization problem produces huge execution time, i.e. in a standard execution (population of 360 individuals, 250 generations) time is around 104 hours. Improvements of execution time have been obtained with a parallel implementation of the Genetic Algorithm described. Several alternative for parallelization are possible 1299

O


centered at several frequencies (800, 1100, 1300, 1700, 2000, 2300, 3090 Hz) above the first Bragg’s peak. On the Fig. 3 we present the results corresponding to the ranges of frequencies centered at 1700 and 3090 Hz respectively. On the left hand of the Fig. 3 we present the schemes of cylinders of the Quasi Ordered Structures generated by the design tool described above. On the right hand the acoustic attenuation spectra calculated by the MST for the starting Sonic Crystals (continuous line) and for the optimized Quasi Ordered Structures (dashed line) is shown. The creation of attenuation peaks in ranges of frequencies independents on the geometry of the starting Sonic Crystals using rigid scatterers has been the goal

of this paper. As one can see on the Fig. 3, the peak attenuation in the spectra of the optimized Quasi Ordered Structures appears in the chosen frequency range, and this peak is absent in the spectra of the starting Sonic Crystals. Notice that the acoustic attenuation level in the frequency range in the starting Sonic Crystals is much lower than the Quasi Ordered Structures one. Even in some case the starting Sonic Crystals produces sound reinforcement. Moreover, the total number of cylinders in the optimized Quasi Ordered Structures is also lower than the starting Sonic Crystals one. In our results the number of cylinders is ranged between 36.7 % and 60%.

Figure 3. Optimized Quasi Ordered Structures and its spectrum. On the left hand the plot presents the schemes of cylinders of the optimized Quasi Ordered Structures. On the right hand the plots show the acoustic attenuation spectra calculated by the MST for the starting Sonic Crystals (continuous line) and for the optimized Quasi Ordered Structures (dashed line). (a) Optimization corresponding to the central frequency of 1700 Hz. (b) Optimization corresponding to the central frequency of 3000 Hz.

1300


These results constitute a useful tool to design acoustic barriers based on Sonic Crystal with no need for sophisticated scatterers. The technological advantages of using Quasi Ordered Structures with rigid cylinders as scatterers are: high resistance for use outdoors, constructive simplicity and low cost due to the reduction in volume of the crystal.

CONCLUSION This work shows an important and successful application of a Genetic Algorithm with a parallel implementation. Sonic Crystals open the way for innovative application in noise reduction in several interesting areas as acoustic noise barriers for traffic or general devices for controlling the noise. The Genetic Algorithm demonstrates an adequate optimization for a so complex problem and with the parallel implementation execution times are drastically reduced. Moreover, this method offers the possibility to test a wide range of Sonic Crystals adjustment in a reasonable time.

ACKNOWLEDGMENT The authors acknowledge financial support provided by the Spanish MEC (Project No. MAT2006-03097) and by the Generalitat Valenciana (Spain) under Grant No. GV/2007/191. This work also has been partially supported by MEC (Spanish government) and FEDER funds: projects DPI2005-07835, DPI2004-8383- C0302 and GVA-026.

REFERENCES T. Bäck. Evolutionaty Algorithms in theory and practice. Oxford University Press, New York, (1996).

E. Cantú-Paz. A summary of resaearch on parallel genetic algorithms. Technical Report 95007, Illinois Genetic Algorithms Laboratory. IlliGAL, (1995). Y.Y. Chen and Zhen Ye. Theoretical analysis of acoustic stop bands in twodimensional periodic scattering arrays. Phys. Rev. E (64), 036616(2001) E.N. Economou and M.M. Sigalas. Classical wave propagation in periodic structures: Cermet versus network topology. Phys. Rev. B, (48), 18 ,(13434), (1993). D.E. Goldberg. Genetic Algorithms in search, optimization and machine learning. Addison-Wesley, (1989). A. Hakansson, J. Sánchez-Dehesa and L. Sanchis. Acoustic lens design by genetic algorithms. Phys. Rev. B (70), 214302 (2004). X. hu, C.T. Chan, and J. Zi. Two dimensional sonic crystals with Helmholtz resonators. Phys. Rev. E (71),055601 (2005). M.S. Kushwaha, P. Halevi, G. Martínez, L. Dobrynski and B. Djafari-Rouhani. Theory of acoustic band structure of periodic elastic composites. Phys. Rev. B, (49),4, pp.2313-2322, (1994). Z.Liu, X. Zhang, Y. Mao, Y.Y. Zhu, Z. Yang, C.T. Xhan, and P. Sheng. Locally resonant sonic materials. Science,(289), 1734, 2000. R. Martínez-Sala, J. Sancho, J. V. Sánchez Pérez, J. Llinares, F. Meseguer. Sound atteuntaion by sculpture. Nature (London) (387), 241 (1995). H. Mühlenbein and D. Schlierkamp-Voosen. Predictive Models for the Breeder Genetic Algorithm I. Continuous Parameter Optimization. Evolutionary Computation, (1), 1, (1993).

J.E. Baker. Reducing bias and inefficiency in the selection algorithm. In Proc. Second International Conference on Genetic Algorithms, (1987).

V. Romero-García, E. Fuster, L.M. García-Raffi, E.A. Sánchez-Pérez, M. Sopena, J. Llinares, J.V. SánchezPérez. Band gap creation using quasiordered strutures based on sonic crystals. Appl. Phys. Lett., (88), 1741041 174104-3, 2006.

D. Caballero, J. Sánchez-Dehesa, R. Martínez-Sala, C. Rubio, J.V. Sánchez Pérez, L. Sanchis and F. Meseguer. Suzuki phase in two-dimensional sonic crystals. Phys. Rev. B (64), 064303. (2001)

M.M. Sigalas, E.N. Economou and M. Kafesaki. Spectral gaps for electromagnietic and scalar waves: Possible explanation for certain differences. Phys. Rev. B, (50), 5, (1994), (3393).

1301

O


O. Umnova, K. Attenborough, and C.M. Linton. Effects of porous covering on sound attenuation by periodic arrays of cylinders. J. Acoust. Soc. Am. (119), 278 (2006)

KEY TERMS Acoustic Attenuation Spectrum: Representation of the attenuation contribution of each acoustic frequency to a sound. Cost Function: Mathematical function to minimize in an optimization problem. Evolutionary Mechanism: Mechanism guided by biological evolution which represents the rules for changing populations throughout generations. Filling Factor: Volume fraction occupied by the scattering material. Defined as, ff=Vs/V, where V is the total volume of the composite, and Vs the volume of the scattering material.

1302

Genetic Algorithm: Global search method based on a simile of the natural evolution. Quasi Ordered Structure: Given a starting Sonic Crystal (see Sonic Crystal), a quasi ordered structure (Quasi Ordered Structures) is the configuration of scatterers resulting of the creation of vacancies in the Sonic Crystal. Search Space: Set of all possible situations of the problem that we want to solve could ever be in. Sonic Crystal: Arrays of scatterers placed periodically in space whose physical properties are different to the surrounding material.

1303

Particle Swarm Optimization and Image Analysis

P

Stefano Cagnoni Università degli Studi di Parma, Italy Monica Mordonini Università degli Studi di Parma, Italy

INTRODUCTION Particle Swarm Optimization (PSO) is a simple but powerful optimization algorithm, introduced by Kennedy and Eberhart (Kennedy 1995). Its search for function optima is inspired by the behavior of flocks of birds looking for food. Similarly to birds, a set (swarm) of agents (particles) fly over the search space, which is coincident with the function domain, looking for the points where the function value is maximum (or minimum). In doing so, each particle’s motion obeys two very simple difference equations which describe the particle’s position and velocity update. A particle’s motion has a strong random component (exploration) and is mostly independent from the others’; in fact, the only piece of information which is shared among all members of the swarm, or of a large neighborhood of each particle, is the point where the best value for the function has been found so far. Therefore, the search behavior of the swarm can be defined as emergent, since no particle is specifically programmed to achieve the final collective behavior or to play a specific role within the swarm, but just to perform a much simpler local task. This chapter introduces the basics of the algorithm and describes the main features which make it particularly efficient in solving a large number of problems, with particular regard to image analysis and to the modifications that must be applied to the basic algorithm, in order to exploit its most attractive features in a domain which is different from function optimization.

BACKGROUND One of the most attractive features of PSO, apart from its effectiveness and robustness with respect to local

minima, is certainly its simplicity, which makes it trivial to implement in any programming language. It is also very versatile and applicable to a large number of optimization problems, virtually to any problem defined within a space for which a metric can be defined. However, its behavior, which mainly depends on the values of three constants, is still far from being fully understood. Extensive work (Engelbrecht2005, Clerc2006, Poli2007a) has provided very important insights into the properties of the algorithm, in studies where the dynamic properties of the swarm have been studied, even if under some restrictive assumptions. The model which underlies PSO describes the motion of a swarm of particles within the domain of a function, usually termed fitness function as for evolutionary algorithms (Eiben 2004, de Jong 2006), seeking for its optimum. Such a motion is comparable to the random motion of a set of independent non-interacting particles within a force field generated by two attractors, one of which is specific to each cell. The basic PSO equations for a generic particle P within the swarm are XP(t) = XP(t-1) + vP(t)

(1)

vP(t) = ω * vP(t-1) + C1 * rand() * [XPbest - X(t-1)] + C2 * rand() * [Xgbest - X(t-1)] (2) where vP is the velocity of particle P, C1 and C2 are two positive constants, ω is the so-called inertia weight, XP is the position of particle P, XPbest is the best-fitness point reached by P up to time t-1, Xgbest is the bestfitness point found by the whole swarm, rand() is a random value taken from a uniform distribution in the interval [0,1]. In its motion, the swarm explores the space effectively, usually converging rapidly to the optimum,



even if its behavior is strongly dependent on the values of ω, C1, and C2, which must be therefore set very accurately.

PARTICLE SWARM OPTIMIZATION AND IMAGE ANALYSIS Even if much is still to be learned and discovered about PSO from a theoretical point of view (Kennedy 2007), as regards applications PSO is gaining more and more popularity. As reported in (Poli2007b), a very recent in-depth review of the field, searching the IEEExplore (http://ieeexplore.ieee.org) technical publication database by the keyword PSO returns a list of much more than 1,000 titles, about one third of which deal with theoretical aspects. This means that, to date, an incomplete list of PSO application papers adds up to little less than 1,000. Amazingly, about two thirds of them have been published in the last two years. Image analysis is one of the fields to which PSO is being applied most frequently. As shown by a large number of papers in the image processing and computer vision literature, image analysis problems can be often reformulated as optimization problems, in which an objective function, directly derived from the physical features of the problem, is either maximized or minimized. In most cases, an optimum set of parameters which define the solution are sought using an optimization method. For most real-world problems, usually severely affected by noise or by the natural variability of the instances of the objects which must be detected, this is often inevitable, since methods in which closedform solutions are directly applied are not usually robust enough with respect to such features. A large number of examples of applications of both traditional and evolutionary optimization methods including, as such, PSO, are reported in the literature. In this section we will not consider direct applications of PSO as optimizers for an objective function. We will focus our attention on applications in which PSO is not only a way to ‘tune’ a more general algorithm by adapting it to the specific features of the problem at hand, but is directly part of the solution. We will first introduce some general considerations on image analysis problems, which define the requirements imposed by them. This will allow us to reformulate some typical classes of problems encountered in image analysis, such as object detection and tracking or 1304

image segmentation, to include PSO, or some adapted version of its basic formulation, into the solution. We will then briefly show two examples of applications of PSO to segmentation and object detection, in which the above mentioned considerations have been taken into account.

PSO for Object Detection and Segmentation In considering the application of PSO to image analysis tasks, one could assume the swarm to fly over the image to detect points or regions of interest. Therefore, the domain of the fitness function becomes the image itself. The fitness value to be assigned to each point can then be defined as a local function of image intensity in a neighborhood of that point, returning high values in points where features similar to the ones which are sought are found. However, more global information must usually be extracted in image analysis tasks. In fact, while the basic PSO algorithm aims at finding a single optimum within the fitness landscape under exploration, in several image analysis applications more than one optimum (multiple objects) are to be found. This situation is typical of object recognition tasks, where the goal is to identify all possible occurrences of an object of interest characterized by a set of specific features. Similarly, in region-based segmentation, several regions with homogeneous features must be accurately located. Such requirements, encountered also in many other application areas, have led to the definition of several variants of PSO, in which particles are subdivided into a predefined number of sub-swarms, based on some clustering technique (Kennedy 2000, Veenhuis 2006, Passaro 2008), or through speciation (Chow 2004, Bird 2006, Leong 2006, Yen 2006), to achieve a dynamical reconfiguration of the swarm and the detection of an arbitrary number of regions of interest within the search space. The velocity update function must also be modified in order to let the swarm spread as uniformly as possible over a whole area of interest featuring high fitness values. Such modifications may include introducing repulsive forces between particles, to prevent the whole swarm from converging onto the same point, and limiting particles’ mobility inside a region of interest, to keep the swarm compact and in a stable configuration.


We will first show how these ideas can be applied to two common image analysis problems: region segmentation and object detection. Then we will show results obtained in two real-world problems: the first one was proposed as topic for a competition at GECCO 2006, and consists of detecting and segmenting as precisely as possible large pieces of pasta imaged over a set of noisy backgrounds over which also tiny pasta pieces are scattered, which must be ignored (see Figure 1). The second problem is a sub-task of plate recognition, in which the region occupied by a license plate is to be located within an image (see Figure 2). Even if the two tasks are semantically different, they share some common lower-level features, which allow the same modifications to basic PSO to be used in both cases, with a two-step approach. In the basic step, the image is explored, to focus on regions where interesting features are detected, before a refinement occurs in the subsequent step.

Modified PSO Equations for Image Analysis In basic PSO, the fitness function is evaluated point by point. In analyzing images using PSO, the search space being the image, using such a local fitness function would make the search extremely sensitive to noise and possibly misleading. If fitness evaluation were just pixel-based, a meaningless isolated pixel yielding high fitness as a result of noise could attract and trap the whole swarm into its neighborhood. To allow PSO to produce a uniform distribution of particles over each region of interest, the basic PSO algorithm can be modified in two directions: • •

Forcing division of the swarm into sub-swarms, able to converge towards different regions of interest, Favoring dispersion of the particles all over the regions of interest.

Using the so-called K-means PSO (Passaro 2008), in which clusters of particles form based on their proximity within the search space, the former goal can be achieved. To achieve the latter, both the fitness function and the velocity-update equation must be modified. As concerns the fitness function, a local fitness term, which evaluates how “interesting” the neighborhood of one pixel is, can be added to a punctual fitness function

term, whose value is computed based only on information carried by the pixel under consideration: fitness(x,y) = punctual_fitness(x,y) + local_fitness(x,y) The local_fitness term depends on the number of particles, with high punctual fitness, which are neighbors of the pixel located in (x,y), and is given by: local_fitness = K0 * number _of _neighbors where number_of_neighbors is the number of particles within a pre-defined neighborhood of (x,y) and K0 is a constant. This way, the particles are attracted towards the areas where more pixels meet the punctual requirement, keeping away from isolated noisy pixels. This modification enhances the density of particles in the most interesting regions. To cover the whole extension of these regions, also the basic PSO velocity-update equation needs to be modified from (1) to: vP*(t) = vP(t) + repulsionP The repulsion term can be expressed as |repulsion(i,j)| = REPULSION_RANGE - |Xi- Xj| where i and j are the particle indices and REPULSION_RANGE is the maximum distance within which the particles interact. Values of repulsion(i,j) are set to 0 for distances between i and j larger than REPULSION_RANGE. The global repulsion term repulsionP for particle P is the average of all repulsion terms acting on it

repulsionP = ( Σj=1,N repulsion(P,j) ) / n N being the number of particles in the swarm and n the number of particles within the neighborhood of P defined by REPULSION_RANGE. Finally, to produce more stable sub-swarms, a particle with high punctual and local fitness is allowed to stand still with a probability which is linearly dependent on the particle density in its neighborhood, estimated as P{vP(t) = 0} = n/N 1305

P


REAL-WORLD EXAMPLES Pasta Segmentation In a color-based region segmentation problem, the fitness function measures the similarity of the pixel color to the expected color of the objects of interest. For pasta, it can be expressed as: if (|r(x,y)-g(x,y)| < 30 and r(x,y)-b(x,y) > 60) then punctual_fitness = 30 - |r(x,y) - g(x,y)| else punctual_fitness = 0 where r(x,y), g(x,y) and b(x,y) are the red, green, and blue values, respectively, of the pixel located in (x,y). Since the goal is to obtain an accurate segmentation, up to pixel precision, and given the large number of pixels belonging to the objects of interest, PSO cannot obviously produce the final solution directly. Instead, it can be used in a pre-processing stage preceding a final thresholding stage which produces the actual output.

Following the PSO rules modified as previously described, the particles will tend to move towards larger pasta regions and stay around there. If one performs a number of PSO runs, assigning to each pixel a score which is directly proportional to the number of times a particle walks through it, the probability of belonging to a large pasta piece can be estimated for each pixel. To better estimate such a probability, avoiding bias deriving from the initial particle locations, each run should start with a different random initialization of the whole swarm. Image regions which eventually have high density of high-score pixels correspond to pieces of pasta. The final result of this stage, that we termed global search, is a preliminary segmentation by which the areas where large pieces of pasta are most likely to be found are grossly detected. To refine the segmentation, an algorithm which is very similar to the one used in the previous stage is applied; this time the domain where the swarm can move is limited to smaller regions surrounding pixel clusters whose score was above a threshold in the last phase of the global search. The final segmentation is eventually obtained

Figure 1. Pasta segmentation. Top: Original image (left) and results of global search (right). Bottom: Results of local search (left) and final segmentation (right).

1306


by thresholding the locally updated scores to obtain a binary image. Figure 1 shows the results obtained on one of the images from the image set used in the competition.

Plate Detection In the license plate detection problem, the low-level feature on which detection is based is the density of high-level values of the horizontal gradient, due to the presence, in the plate, of symbols or symbol elements, which can be encountered when the image is scanned row-wise. Since a color image is available, we can use both color and gradient information, by first considering only those pixels which satisfy the typical features of plates (black characters on a white background for the most recent European standards), and then considering gradient information. The punctual fitness of a pixel is defined as: if ( |r(x,y) - g(x,y)| > 30 or |r(x,y) - b(x,y)| > 30 or |g(x,y) - b(x,y)| > 30 ) punctual_fitness = 0; else {right_gradient = |intensity(x,y) intensity(x+1,y)|; left_gradient = |intensity(x,y) intensity(x-1,y) |; if (right_gradient > left_gradient) punctual_fitness = right_gradient; else punctual_fitness = left_gradient; } The basic PSO step is virtually the same as in pasta segmentation. However, a different algorithm is used, divided, as well, into a global and a local exploration stage in which, after the most promising areas are firstly

located, the exploration is then refined to determine whether they actually include a plate. In the global search, the swarm flies over the image until at least one sub-swarm of size greater than a prefixed threshold (50% of the whole swarm) has formed or a given number of iterations has been reached. Then a local search is performed within regions where sub-swarms of sufficient dimension have formed, starting from the region occupied by the largest swarm; during this second stage: (i) the search is restricted to smaller image regions of interest enclosing the subswarms, (ii) the search is re-initialized activating a new full-size swarm in the region of interest, and (iii) the search is run for a pre-set number of iterations. At the end of this stage, a new bounding box, containing all particles, is computed. If this box has an aspect ratio compatible with a license plate, the plate is considered to have been found. Otherwise, the swarm is expanded along its two dimensions, by forcing low-fitness particles to move only horizontally or vertically, in order to reach higher-fitness points and, possibly, to let the bounding box reach the expected aspect ratio; in case of failure, the current region is discarded and the next area detected during the global search is explored. Figure 2 shows the original image, along with the results of the global and local search, and the final result of the PSO-based algorithm. The algorithm is computationally very efficient. A number of function evaluations is required to detect the plate, which is lower than just computing the whole gradient image, which would be just the very first step in any ‘traditional’computer vision approach. Iteratively re-initializing, in each frame, the swarm location in a neighborhood of the region where the plate has been detected in the previous one, real-time performances can be achieved in tracking the plate in videos acquired at 30 frames per second using a standard PC.

Figure 2. License-plate detection; Original image (left) and results of the detection (right)

1307

P


The same cannot be said for the pasta segmentation algorithm if high segmentation accuracy is required (about 30 seconds were needed to produce the segmentation in Figure 1 on a 2.8 GHz PC). However, even in that case, if the pieces are just to be grossly located, just a few runs of the algorithm are enough to achieve the goal.

FUTURE TRENDS Research on PSO and PSO applications to the most various fields is booming nowadays. Image analysis is no exception: according to the INSPEC bibliography database, the number of papers which describe applications of PSO to such a field has increased by almost 50% in the last six months. Results are already very encouraging and suggest that much more is to be expected in the near future.

CONCLUSION PSO is a versatile and effective optimization technique whose features can be easily adapted to a vast variety of problems, in solving which it can act not only as a “plain” optimizer, but as a more general, flexible search paradigm. The applications described in this chapter have confirmed this, introducing a general framework which can be applied, with few changes, to many other object detection and recognition problems, as well as to other lower-level tasks in computer vision, such as image segmentation.

REFERENCES

De Jong, K.A. (2006). Evolutionary Computation: a unified approach. MIT Press. Engelbrecht, A.P. (2005). Fundamentals of Computational Swarm Intelligence. Wiley. Eiben, A. & Smith, J. (2004). Introduction to Evolutionary Computation. Springer. Kennedy, J. & Eberhart, R. (1995). Particle Swarm Optimization. Proc. IEEE International Conference on Neural Networks. 1942-1948, Vol. IV. Kennedy, J. (2000). Stereotyping: improving particle swarm performance with cluster analysis. Proc. IEEE Int. Conference on Evolutionary Computation, 15071512. Kennedy, J., Poli, R. & Blackwell, T. (2007) Particle Swarm Optimisation: an overview. Swarm Intelligence, in press. Leong, W.F. & Yen, G.G. (2006) Dynamic population size in PSO-based multiobjective optimization. Proc. IEEE Congress on Evolutionary Computation, 6182-6189. Passaro, A. & Starita, A. (2008) Particle swarm optimization for multimodal functions: A clustering approach. Journal of Artificial Evolution and Applications, Volume 2008, Article ID 482032. Poli, R. (2007) The sampling distribution of particle swarm optimizers and their stability. Tech. Rep. CSM465, Department of Computer Science, University of Essex. Poli, R. (2007). Analysis of the publications on the applications of Particle Swarm Optimisation. Journal of Artificial Evolution and Applications, in press.

Bird, S. & Li, X. (2006). Enhancing the robustness of a speciation-based PSO. Proc. IEEE Congress on Evolutionary Computation, 3185-3192.

Veenhuis, C. & Köppen, M. (2006) Data swarm clustering. In Abraham, A., Groşan, C. & Ramos, V. (eds.). Swarm Intelligence in Data Mining. Springer, 221-241.

Chow, C.K. & Tsui, H.T. (2004). Autonomous agent response learning by a multispecies particle swarm optimization. Proc. IEEE Congress on Evolutionary Computation, 778-785.

Yen, G.G. & Daneshyari, M. (2006). Diversity-based information exchange among multiple swarms in Particle Swarm Optimization. Proc. IEEE Congress on Evolutionary Computation, 6150-6157.

Clerc, M. (2006). Particle Swarm Optimization. ISTE.

1308


KEY TERMS Evolutionary Computation: Collection of techniques, basically aimed at function optimization but applicable to a huge variety of problems, by which the optimum of a function (fitness function) is sought through iterative refinements, according to rules inspired by the laws of natural evolution. Fitness Function:In evolutionary computation, the objective function which is to be optimized. Image Analysis: Collection of techniques by which high-lev

Encyclopedia of Artificial Intelligence

Encyclopedia of Artificial Intelligence

Encyclopedia of Artificial Intelligence