Lecture Notes in Control and Information Sciences Editors: M. Thoma, M. Morari
377
Krzysztof Patan
Artificial Neural Networks for the Modelling and Fault Diagnosis of Technical Processes
ABC
Series Advisory Board F. Allgöwer, P. Fleming, P. Kokotovic, A.B. Kurzhanski, H. Kwakernaak, A. Rantzer, J.N. Tsitsiklis
Author Krzysztof Patan University of Zielona Góra Inst. Control and Computation Engineering ul. Podgórna 50 65-246 Zielona Góra Poland E-Mail:
[email protected] ISBN 978-3-540-79871-2
e-ISBN 978-3-540-79872-9
DOI 10.1007/978-3-540-79872-9 Lecture Notes in Control and Information Sciences
ISSN 0170-8643
Library of Congress Control Number: 2008926085 c 2008
Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 543210 springer.com
To my beloved wife Agnieszka, and children Weronika and Leonard, for their patience and tolerance
Foreword
An unappealing characteristic of all real-world systems is the fact that they are vulnerable to faults, malfunctions and, more generally, unexpected modes of behaviour. This explains why there is a continuous need for reliable and universal monitoring systems based on suitable and effective fault diagnosis strategies. This is especially true for engineering systems, whose complexity is permanently growing due to the inevitable development of modern industry as well as the information and communication technology revolution. Indeed, the design and operation of engineering systems require an increased attention with respect to availability, reliability, safety and fault tolerance. Thus, it is natural that fault diagnosis plays a fundamental role in modern control theory and practice. This is reflected in plenty of papers on fault diagnosis in many control-oriented conferences and journals. Indeed, a large amount of knowledge on model based fault diagnosis has been accumulated through scientific literature since the beginning of the 1970s. As a result, a wide spectrum of fault diagnosis techniques have been developed. A major category of fault diagnosis techniques is the model based one, where an analytical model of the plant to be monitored is assumed to be available. Unfortunately, a fundamental difficulty related to the model based approach is the fact that there are always modelling uncertainties due to unmodelled disturbances, simplifications, idealisations, linearisations, model parameter inaccuracies and so on. Another important difficulty concerns the intrinsic non-linear characteristic of most engineering systems. Indeed, with a few exceptions, most of the well-established approaches presented in the literature can be applied to linear systems only. This fact, of course, considerably limits their application in modern industrial control systems. Therefore, it is clear that there is a need for both modelling and fault diagnosis techniques for non-linear dynamic systems, which must ensure robustness to modelling uncertainties. Presently, many researchers see artificial neural networks as a strong alternative to the classical methods used in the model based fault diagnosis framework. Indeed, due to their interesting properties as functional approximators, neural networks turn out to be a very promising tool for
VIII
Foreword
dealing with non-linear processes. Although a considerable research attention has been drawn to the application of neural networks in this important research area, the existing publications on the specific class of locally recurrent neural networks considered in this book are rather scarse. To date, very few works can be found in the literature presenting locally recurrent neural networks in a unified framework including stability analysis, approximation abilities, training sequences selection as well as industrial applications. The book presents the application of neural networks to the modelling and fault diagnosis of industrial processes. The first two chapters focus on the fundamental issues such as the basic definitions and fault diagnosis schemes as well as a survey on ways of using neural networks in different fault diagnosis strategies. This part is of a tutorial value and can be perceived as a good starting point for the newcomers to this field. Chapter 3 presents a special class of locally recurrent neural networks, addressing their properties and training algorithms. Investigations regarding stability, approximation capabilities and the selection of optimal input training sequences are carried out in the subsequent three chapters. Chapter 7 describes decision making methods including robustness analysis. The last chapter shows original achievements in the area of fault diagnosis of industrial processes. All the concepts described in this book are illustrated with either simple academic examples or real-world practical applications. Because of the fact that both theory and practical applications are discussed, the book is expected to be useful for both academic researchers and professional engineers working in industry. The first group may be especially interested in the fundamental issues and/or some inspirations regarding future research directions concerning fault diagnosis. The second group may be interested in practical implementations which can be very helpful in industrial applications of the techniques described in this publication. Thus, the book can be strongly recommended to both researchers and practitioners in the wide field of fault detection, supervision and safety of technical processes.
February, 2008
Prof. Thomas Parisini University of Trieste, Italy
Preface
It is well understood that fault diagnosis has become an important issue in modern automatic control theory. Early diagnosis of faults that might occur in the supervised process renders it possible to perform important preventing actions. Moreover, it allows one to avoid heavy economic losses involved in stopped production, the replacement of elements and parts. The core of fault diagnosis methodology is the so-called model based scheme, where either analytical or knowledge based models are used in combination with decision making procedures. The fundamental idea of model based fault diagnosis is to generate signals that reflect inconsistencies between nominal and faulty system operating conditions. In the case of complex systems, however, one is faced with the problem that no accurate or no sufficiently accurate mathematical models are available. A solution of the problem can be obtained through the use of artificial neural networks. For the last two and a half decades there has been observed significant development in the so-called dynamic neural networks. One of the most interesting solutions of the dynamic system identification problem is the application of locally recurrent globally feedforward networks. The book is mainly focused on investigating the properties of locally recurrent neural networks, developing training procedures for them and their application to the modelling and fault diagnosis of non-linear dynamic processes and plants. The material included in the monograph results from research that has been carried out at the Institute of Control and Computation Engineering of the University of Zielona G´ ora, Poland, for the last eight years in the area of the modelling of non-linear dynamic processes as well as fault diagnosis of industrial processes. Some of the presented results were developed with the support of the Ministry of Science and Higher Education in Poland under the grants Artificial neural networks in robust diagnostic systems (2007-2008) and Modelling and identification of non-linear dynamic systems in robust diagnostics (2004-2007). The work was also supported by the EC under the RTN project Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems DAMADICS (2000-2004).
X
Preface
The monograph is divided into nine chapters. The first chapter constitutes an introduction to the theory of the modelling and fault diagnosis of technical processes. Chapter 2 focuses on the modelling issue in fault diagnosis, especially on the model based scheme and neural networks’ role in it. Chapter 3 deals with a special class of locally recurrent neural networks, investigating its properties and training algorithms. The next three chapters discuss the fundamental issues of locally recurrent networks, namely approximation abilities, stability and stabilization procedures, and selecting optimal input training sequences. Chapter 7 discusses several methods of decision making in the context of fault diagnosis including both constant and adaptive thresholds. Finally, Chapter 9 shows original achievements in the area of fault diagnosis of industrial processes. At this point, I would like to express my sincere thanks to Prof. J´ozef Korbicz for suggesting the problem, his invaluable help and continuous support. The author is grateful to all the friends at the Institute of Control and Computation Engineering of the University of Zielona G´ ora for many stimulating discussions and friendly atmosphere required for finishing this work. Especially, I would like to thank my brother Maciek for his partial contribution to Chapter 6, and Wojtek Paszke for support in the area of linear matrix inequalities. Finally, I would like to express my gratitude to Ms Agnieszka Ro˙zewska for proofreading and linguistic advise on the text.
Zielona G´ ora, February, 2008
Krzysztof Patan
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Organization of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 3
2
Modelling Issue in Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Problem of Fault Detection and Fault Diagnosis . . . . . . . . . . . . . . 2.2 Models Used in Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Parity Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Observers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Neural Networks in Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Multi-layer Feedforward Networks . . . . . . . . . . . . . . . . . . . . 2.3.2 Radial Basis Function Network . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Kohonen Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Model Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Knowledge Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 2.3.6 Data Analysis Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Evaluation of the FDI System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 8 11 12 12 13 14 15 16 16 18 20 21 23 23 24 26
3
Locally Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Neural Networks with External Dynamics . . . . . . . . . . . . . . . . . . . 3.2 Fully Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Partially Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 State-Space Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Locally Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Model with the IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Analysis of Equilibrium Points . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Controllability and Observability . . . . . . . . . . . . . . . . . . . . . 3.5.4 Dynamic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Training of the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29 30 31 32 34 36 40 43 47 49 52
XII
Contents
3.6.1 Extended Dynamic Back-Propagation . . . . . . . . . . . . . . . . . 3.6.2 Adaptive Random Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Simultaneous Perturbation Stochastic Approximation . . . 3.6.4 Comparison of Training Algorithms . . . . . . . . . . . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52 53 55 57 62
4
Approximation Abilities of Locally Recurrent Networks . . . . 4.1 Modelling Properties of the Dynamic Neuron . . . . . . . . . . . . . . . . 4.1.1 State-Space Representation of the Network . . . . . . . . . . . . 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Approximation Abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Process Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65 66 67 67 68 72 74
5
Stability and Stabilization of Locally Recurrent Networks . . 77 5.1 Stability Analysis – Networks with One Hidden Layer . . . . . . . . . 78 5.1.1 Gradient Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.1.2 Minimum Distance Projection . . . . . . . . . . . . . . . . . . . . . . . 82 5.1.3 Strong Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.4 Numerical Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1.5 Pole Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.1.6 System Identification Based on Real Process Data . . . . . . 92 5.1.7 Convergence of Network States . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 Stability Analysis – Networks with Two Hidden Layers . . . . . . . . 96 5.2.1 Second Method of Lyapunov . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.2 First Method of Lyapunov . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Stability Analysis – Cascade Networks . . . . . . . . . . . . . . . . . . . . . . 110 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6
Optimum Experimental Design for Locally Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Optimal Sequence Selection Problem in Question . . . . . . . . . . . . . 6.1.1 Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Sequence Quality Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Characterization of Optimal Solutions . . . . . . . . . . . . . . . . . . . . . . . 6.3 Selection of Training Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Simulation Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113 114 114 115 116 117 118 119 119 120 121
Decision Making in Fault Detection . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Simple Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Normality Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123 124 126 126
7
Contents
XIII
7.2.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Threshold Calculating – A Single Neuron . . . . . . . . . . . . . . 7.2.4 Threshold Calculating – A Two-Layer Network . . . . . . . . . 7.3 Robust Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Adaptive Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Fuzzy Threshold Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Model Error Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
127 130 131 132 133 135 137 140
8
Industrial Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Sugar Factory Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Instrumentation Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Actuator Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.4 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Fluid Catalytic Cracking Fault Detection . . . . . . . . . . . . . . . . . . . . 8.2.1 Process Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Faulty Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Robust Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 DC Motor Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 AMIRA DR300 Laboratory System . . . . . . . . . . . . . . . . . . . 8.3.2 Motor Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Fault Diagnosis Using Density Shaping . . . . . . . . . . . . . . . . 8.3.4 Robust Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
141 141 143 143 146 160 161 163 164 164 168 172 172 173 176 176 181 182
9
Concluding Remarks and Further Research Directions . . . . . . 187
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
List of Figures
2.1 2.2
Scheme of the diagnosed automatic control system . . . . . . . . . . . . . Types of faults: abrupt (dashed), incipient (solid) and intermittent (dash-dot) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Two stages of fault diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 General scheme of model based fault diagnosis . . . . . . . . . . . . . . . . 2.5 Neuron scheme with n inputs and one output . . . . . . . . . . . . . . . . . 2.6 Three layer perceptron with n inputs and m outputs . . . . . . . . . . . 2.7 Structure of the radial basis function network with n inputs and m outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Model based fault diagnosis using neural networks . . . . . . . . . . . . . 2.9 Model-free fault diagnosis using neural networks . . . . . . . . . . . . . . . 2.10 Fault diagnosis as pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Definition of the benchmark zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13
External dynamics approach realization . . . . . . . . . . . . . . . . . . . . . . Fully recurrent network of Williams and Zipser . . . . . . . . . . . . . . . . Partialy recurrent networks due to Elman (a) and Jordan (b) . . . Architecture of the recurrent multi-layer perceptron . . . . . . . . . . . Block scheme of the state-space neural network with one hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized structure of the dynamic neuron unit (a), network composed of dynamic neural units (b) . . . . . . . . . . . . . . . . . . . . . . . . Neuron architecture with local activation feedback . . . . . . . . . . . . . Neuron architecture with local synapse feedback . . . . . . . . . . . . . . . Neuron architecture with local output feedback . . . . . . . . . . . . . . . Memory neuron architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neuron architecture with the IIR filter . . . . . . . . . . . . . . . . . . . . . . . Transformation of the neuron model with the IIR filter to the general local activation feedback structure . . . . . . . . . . . . . . . . . . . . State-space form of the i-th neuron with the IIR filter . . . . . . . . . .
8 9 10 11 17 18 19 22 24 25 25 30 32 33 34 34 36 39 39 40 41 41 42 43
XVI
List of Figures
3.14 Positions of equilibrium points: stable node (a), stable focus (b), unstable node (c), unstable focus (d), saddle point (e), center (f) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15 Eigenvalue positions of the matrix A: γ = 0.3 (a), γ = 0.5 (b) . . . 3.16 Eigenvalue positions of the modified matrix A: γ = 0.3 (a), γ = 0.5 (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.17 Topology of the locally recurrent globally feedforward network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.18 Learning error for different algorithms . . . . . . . . . . . . . . . . . . . . . . . 3.19 Testing phase: EDBP (a), ARS (b) and SPSA (c). Actuator (black), neural model (grey). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44 45 46 50 60 61
4.1 4.2 4.3
i-th neuron of the second layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cascade structure of the modified dynamic neural network . . . . . . Cascade structure of the modified dynamic neural network . . . . . .
5.1 5.2 5.3 5.4 5.5 5.6
Result of the experiment – an unstable system . . . . . . . . . . . . . . . . 80 Result of the experiment – a stable system . . . . . . . . . . . . . . . . . . . 81 Idea of the gradient projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Stability triangle and the search region . . . . . . . . . . . . . . . . . . . . . . . 84 Sum squared error – training without stabilization . . . . . . . . . . . . . 90 Poles location during learning without stabilization: neuron 1 (a), neuron 2 (b), neuron 3 (c), neuron 4 (d) . . . . . . . . . . . . . . . . . . 91 Poles location during learning, stabilization using GP: neuron 1 (a), neuron 2 (b), neuron 3 (c), neuron 4 (d) . . . . . . . . . . . . . . . . 92 Poles location during learning without stabilization: neuron 1 (a), neuron 2 (b), neuron 3 (c), neuron 4 (d) . . . . . . . . . . . . . . . . . . 93 Poles location during learning, stabilization using MDP: neuron 1 (a), neuron 2 (b), neuron 3 (c), neuron 4 (d) . . . . . . . . . . 94 Actuator (solid line) and model (dashed line) outputs for learning (a) and testing (b) data sets . . . . . . . . . . . . . . . . . . . . . . . . 94 Results of training without stabilization: error curve (a) and the convergence of the state x(k) of the neural model (b) . . . . . . . 95 Results of training with GP: error curve (a) and the convergence of the state x(k) of the neural model (b) . . . . . . . . . . 95 Results of training with MDP: error curve (a) and the convergence of the state x(k) of the neural model (b) . . . . . . . . . . 96 Convergence of network states: original system (a)–(b), transformed system (c)–(d), learning track (e) . . . . . . . . . . . . . . . . . 99 Convergence of network states: original system (a)–(b), transformed system (c)–(d), learning track (e) . . . . . . . . . . . . . . . . . 102 Graphical solution of the problems (5.70) . . . . . . . . . . . . . . . . . . . . . 109
5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 6.1 6.2
70 70 72
Convergence of the design algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 120 Average variance of the model response prediction for optimum design (diamonds) and random design (circles) . . . . . . . . . . . . . . . . 121
List of Figures
Residual with thresholds calculated using: (7.3) (a), (7.5) with ζ = 1 (b), (7.5) with ζ = 2 (c), (7.5) with ζ = 3 (d) . . . . . . . . . . . . 7.2 Normality testing: comparison of cumulative distribution functions (a), probability plot (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Neural network for density calculation . . . . . . . . . . . . . . . . . . . . . . . 7.4 Simple two-layer network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Output of the network (7.20) with the threshold (7.34) . . . . . . . . . 7.6 Residual signal (solid), adaptive thresholds calculated using (7.38) (dotted), and adaptive thresholds calculated using (7.39) (dashed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Illustration of the fuzzy threshold adaptation . . . . . . . . . . . . . . . . . 7.8 Scheme of the fault detection system with the threshold adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Idea of the fuzzy threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Model error modelling: error model training (a), confidence region constructing (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Idea of model error modelling: system output (solid), centre of the uncertainty region (dotted), confidence bands (dashed) . . . . .
XVII
7.1
8.1 8.2 8.3 8.4
8.5
8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13
Evaporation station. Four heaters and the first evaporation section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Actuator to be diagnosed (a), block scheme of the actuator (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Causal graph of the main actuator variables . . . . . . . . . . . . . . . . . . Residual signal for the vapour model in different faulty situations: fault in P 51 03 (900–1200), fault in T 51 07 (1800–2100) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Residual signals for the temperature model in different faulty situations: fault in F 51 01 (0–300), fault in F 51 02 (325–605), fault in T 51 06 (1500-1800), fault in T 51 08 (2100–2400), fault in T C51 05 (2450–2750) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Normal operating conditions: residual with the constant (a) and the adaptive (b) threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Residual for different faulty situations . . . . . . . . . . . . . . . . . . . . . . . . Residual of the nominal model (output F ) in the case of the faults f1 (a), f2 (b) and f3 (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Residual of the nominal model (output X) in the case of the faults f1 (a), f2 (b) and f3 (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Residuals of the fault models f1 (a) and (d); f2 (b) and (e); f3 (c) and (f) in the case of the fault f1 . . . . . . . . . . . . . . . . . . . . . . . . . Residuals of the fault models f1 (a) and (d); f2 (b) and (e); f3 (c) and (f) in the case of the fault f2 . . . . . . . . . . . . . . . . . . . . . . . . . Residuals of the fault models f1 (a) and (d); f2 (b) and (e); f3 (c) and (f) in the case of the fault f3 . . . . . . . . . . . . . . . . . . . . . . . . . General scheme of the fluid catalytic cracking converter . . . . . . . .
125 127 129 129 131
135 136 137 137 138 139
142 144 145
148
149 151 152 154 155 156 157 158 162
XVIII
List of Figures
8.14 Results of modelling the temperature of the cracking mixture (8.9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.15 Residual signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.16 Cumulative distribution functions: normal – solid, residual – dashed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.17 Probability plot for the residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.18 Residual histogram (a), network output histogram (b), estimated PDF and the confidence interval (c) . . . . . . . . . . . . . . . . 8.19 Residual histogram (a), network output histogram (b), estimated PDF and the confidence interval (c) . . . . . . . . . . . . . . . . 8.20 Residual (solid) and the error model output (dashed) under nominal operating conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.21 Confidence bands and the system output under nominal operating conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.22 Residual with constant thresholds under nominal operating conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.23 Fault detection results: scenario f1 (a), scenario f2 (b), scenario f3 (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.24 Laboratory system with a DC motor . . . . . . . . . . . . . . . . . . . . . . . . . 8.25 Equivalent electrical circuit of a DC motor . . . . . . . . . . . . . . . . . . . 8.26 Responses of the motor (solid) and the neural model (dash-dot) – open-loop control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.27 Responses of the motor (solid) and the neural model (dashed) – closed-loop control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.28 Symptom distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.29 Residual and constant thresholds (a) and confidence bands generated by model error modelling (b) . . . . . . . . . . . . . . . . . . . . . . 8.30 Fault detection using model error modelling: fault f11 – confidence bands (a) and decision logic without the time window (b); fault f61 – confidence bands (c) and decision logic without the time window (d); fault f42 – confidence bands (e) and decision logic without the time window (f) . . . . . . . . . . . . . . . . 8.31 Fault detection by using constant thresholds: fault f11 – residual with thresholds (a) and decision logic without the time window (b); fault f61 – residual with thresholds (c) and decision logic without the time window (d); fault f42 – residual with thresholds (e) and decision logic without the time window (f) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163 164 165 165 166 167 169 170 170 171 174 176 177 177 179 182
183
184
List of Tables
3.1 3.2 3.3 3.4 3.5
Specification of different types of dynamic neuron units . . . . . . . . Outline of ARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline of the basic SPSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Characteristics of learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . Characteristics of learning methods . . . . . . . . . . . . . . . . . . . . . . . . . .
38 55 57 59 62
4.1 4.2
Selection results of the cascade dynamic neural network . . . . . . . . Selection results of the two-layer dynamic neural network . . . . . . .
73 73
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8
Outline of the gradient projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Outline of the minimum distance projection . . . . . . . . . . . . . . . . . . 86 Number of operations: GP method . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Number of operations: MDP method . . . . . . . . . . . . . . . . . . . . . . . . . 89 Comparison of the learning time for different methods . . . . . . . . . 90 Outline of norm stability checking . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Comparison of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Outline of constrained optimisation training . . . . . . . . . . . . . . . . . . 110
6.1
Sample mean and the standard deviation of parameter estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.1
Threshold calculating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.1 8.2 8.3 8.4 8.5 8.6
Specification of process variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . Description of symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Selection of the neural network for the vapour model . . . . . . . . . . . Number of false alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neural models for nominal conditions and faulty scenarios . . . . . . Results of fault detection (a) and isolation (b) (X – detectable/isolable, N – not detectable/not isolable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modelling quality for different models . . . . . . . . . . . . . . . . . . . . . . . .
8.7
143 144 148 150 153
159 159
XX
List of Tables
8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17
FDI properties of the examined approaches . . . . . . . . . . . . . . . . . . . Specification of measurable process variables . . . . . . . . . . . . . . . . . . Comparison of false detection rates . . . . . . . . . . . . . . . . . . . . . . . . . . Performance indices for faulty scenarios . . . . . . . . . . . . . . . . . . . . . . Performance indices for faulty scenarios . . . . . . . . . . . . . . . . . . . . . . Laboratory system technical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of fault detection for the density shaping technique . . . . . Fault isolation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fault identification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of fault detection for model error modelling . . . . . . . . . . . .
160 163 168 168 169 175 178 180 181 185
Nomenclature
Symbols R t, k x(·) u(·) y(·) σ(·) σ(·) A W C B D G g θ, θˆ N (m, v) β C1 I 0 C K A− rtd , rf d rti , rf i tdt
set of real numbers continuous and discrete-time indexes state vector input vector output vector activation function vector-valued activation function state matrix weight matrix output matrix feed-forward filter parameters matrix transfer matrix slope parameters matrix vector of biases vector of network parameters and its estimate normally distributed random number with the expectation value m and the standard deviation v significance level class of continuously differentiable mappings identity matrix zero matrix set of constraints set of violated constraints pseudo-inverse of a matrix A true and false detection rates, respectively true and false isolation rates, respectively time of fault detection
XXII
List of Tables
Operators P E E[·|·] sup inf max min rank(A) det(A) trace(A)
probability expectation conditional expectation least upper bound (supremum) greatest lower bound (infimum) maximum minimum rank of a matrix A determinant of a matrix A trace of a matrix A
Abbrevations FDI UIO GMDH BP RBF RTRN RMLP IIR FIR LRGF ARS EDBP SPSA BIBO GP MDP ODE w.p.1 a.s. LMI OED ARX NNARX MEM SCADA AIC FPE MIMO MISO FCC DC
Fault Detection and Isolation Unknown Input Observer Group Method and Data Handling Back-Propagation Radial Basis Function Real-Time Recurrent Network Recurrent Multi-Layer Perceptron Infinite Impulse Response Finite Impulse Response Locally Recurrent Globally Feed-forward Adaptive Random Search Extended Dynamic Back-Propagation Simultaneous Perturbation Stochastic Approximation Bounded Input Bounded Output Gradient Projection Minimum Distance Projection Ordinary Differential Equations with probability 1 almost surely Linear Matrix Inequality Optimum Experimental Design Auto-Regressive with eXogenous input Neural Network Auto-Regressive with eXogenous input Model Error Modelling Supervisory Control And Data Acquisition Akaike Information Criterion Final Prediction Error Multi Input Multi Output Multi Input Single Output Fluid Catalytic Cracking Direct Current
1 Introduction
The diagnostics of industrial processes is a scientific discipline aimed at the detection of faults in industrial plants, their isolation, and finally their identification. Its main task is the diagnosis of process anomalies and faults in process components, sensors and actuators. Early diagnosis of faults that might occur in the supervised process renders it possible to perform important preventing actions. Moreover, it allows one to avoid heavy economic losses involved in stopped production, the replacement of elements and parts, etc. Most of the methods in the fault diagnosis literature are based on linear methodology or exact models. Industrial processes are often difficult to model. They are complex and not exactly known, measurements are corrupted by noise and unreliable sensors. Therefore, a number of researchers have perceived artificial neural networks as an alternative way to represent knowledge about faults [1, 2, 3, 4, 5, 6, 7]. Neural networks can filter out noise and disturbances, they can provide stable, highly sensitive and economic diagnostics of faults without traditional types of models. Another desirable feature of neural networks is that no exact models are required to reach the decision stage [2]. In a typical operation, the process model may be only approximate and the critical measurements may be able to map internally the functional relationships that represent the process, filter out the noise, and handle correlations as well. Although there are many promising simulation examples of neural networks in fault diagnosis in the literature, real applications are still quite rare. There is a great necessity to conduct more detailed scientific investigations concerning the application of neural networks in real industrial plants, to achieve complete utilization of their attractive features. One of the most frequently used schemes for fault diagnosis is the model based concept. The basic idea of model based fault diagnosis is to generate signals that reflect inconsistencies between nominal and faulty system operation conditions [8, 9, 10, 11]. Such signals, called residuals, are usually calculated by using analytical methods such as observers [9, 12], parameter estimation methods [13, 14] or parity equations [15, 16]. Unfortunately, the common drawback of these approaches is that an accurate mathematical model of the diagnosed plant K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 1–6, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
2
1 Introduction
is required. When there are no mathematical models of the diagnosed system or the complexity of a dynamic system increases and the task of modelling is very hard to implement, analytical models cannot be applied or cannot give satisfactory results. In these cases data based models, such as neural networks, fuzzy sets or their combination (neuro-fuzzy networks), can be considered. In recent years, a great deal of attention has been paid to the application of artificial neural networks in the modelling and identification of dynamic processes [17, 18, 19, 20], adaptive control systems [19, 21, 22], time series prediction problems [23, 24]. A growing interest in the application of artificial neural networks to fault diagnosis systems has also been observed [25, 26, 7, 6, 4, 27]. Artificial neural networks provide an excellent mathematical tool for dealing with non-linear problems. They have an important property according to which any continuous non-linear relationship can be approximated with arbitrary accuracy using a neural network with a suitable architecture and weight parameters [23, 28]. Their another attractive property is the self learning ability. A neural network can extract the system features from historical training data using the learning algorithm, requiring little or no a priori knowledge about the process. This provides the modelling of non-linear systems with a great flexibility [19, 23, 29]. These features allow one to design adaptive control systems for complex, unknown and non-linear dynamic processes. As opposed to many effective applications, e.g. in pattern recognition problems [30, 31], the approximation of non-linear functions [32, 33], the application of neural networks in control systems requires taking into consideration the dynamics of the processes being investigated. The application of feedforward neural networks with the back-propagation learning algorithm in control systems requires the introduction of delay elements [7, 18, 19, 34, 21]. Such a solution is needed because these relatively simple and easy to apply networks are of a static type [19, 29]. Hence, their application possibilities in relation to dynamic problems are very limited and insufficient. Recurrent neural networks are characterized by considerably better properties, assessing from the point of view of their application in control theory [35, 36, 37]. Due to feedbacks introduced to the network architecture it is possible to accumulate historical data and use them later. Feed-back can be either local or global. Globally recurrent networks can model a wide class of dynamic processes; however, they possess disadvantages such as slow convergence of learning and stability problems [18]. In general, these architectures seem to be too complex for practical implementations. Furthermore, the fixed relationship between the number of states and the number of neurons does not allow adjusting the dynamics of the model and its non-linear behaviour separately. The drawbacks of globally recurrent networks can be partly avoided by using locally recurrent networks [38, 26, 39]. Such networks have a feedforward multi-layer architecture and their dynamic properties are obtained using a specific kind of neuron models [38, 40]. One of the possible solutions is the use of neuron models with the Infinite Impulse Response (IIR) filter. Due to introducing a linear dynamic system into the neuron structure, the neuron activation depends on its current inputs as well as past inputs and outputs. The conditions for global stability of the neural
1.1 Organization of the Book
3
network considered can be derived using pole placement and the Lyapunov second method. Neural networks with two hidden layers possess much more powerful properties than networks with one hidden layer. Therefore, the stabilization of such networks is a problem of crucial importance. The issue of calculating bounds on the network parameters based on the elaborated stability conditions in order to guarantee that the final neural model after training is stable is also a challenging objective. Most studies on locally recurrent globally feedforward networks are focused on training algorithms and stability problems. Literature about approximation abilities of such networks is rather scarce. An interesting topic is dealing with investigating approximation abilities of a locally recurrent neural network. The different structures of dynamic networks can be analysed to answer the question of how many layers are necessary to approximate a state-space trajectory produced by any continuous function with arbitrary accuracy. It is also fascinating to investigate how these results can be used in a broader sense in order to estimate a number of neurons needed to ensure a given level of approximation accuracy. Another important issue is the problem of how to select the training data to carry out the training as effectively as possible. The theory related to Optimal Experimental Design (OED) can be applied here. The problem can be stated as follows: where to locate measurement sensors so as to guarantee the maximal accuracy of parameter estimation. This is of paramount interest in applications, as it is generally impossible to measure the system state over the entire domain. The optimal measurement problem is very attractive from the viewpoint of the degree of optimality and it does arise in a variety of applications. At the moment there is no contribution of experimental design for dynamic neural networks to the existing state-of-the-art. Therefore, this topic seems to be the most challenging one. Dynamic neural networks can be successfully applied to design model based fault diagnosis. However, model based fault diagnosis is founded on a number of idealized assumptions. One of them is that the model of the system is a faithful replica of plant dynamics. Another one is that disturbances and noise acting upon the system are known. This is, of course, not possible in engineering practice. The robustness problem in fault diagnosis can be defined as the maximisation of the detectability and isolability of faults and simultaneously the minimisation of uncontrolled effects such as disturbances, noise, changes in inputs and/or the state, etc. Therefore, the problem of estimating model uncertainty is of paramount importance taking into account the number of false alarms appearing in the monitoring system.
1.1 Organization of the Book The remaining part of the book consists of the following chapters: Modelling issue in fault diagnosis. The chapter is divided into four parts. The objective of the first part (Section 2.1) is to introduce the reader into
4
1 Introduction
the theory of fault detection and diagnosis. This part explains the main tasks that fault diagnosis system should provide, defines fault types and phases of the diagnostic procedure. Section 2.2 presents the most popular methods used in model based fault diagnosis. This section discusses parameter estimation methods, parity relations, observers, neural networks and fuzzy logic models. The main advantages and drawbacks of the discussed techniques are portrayed. A brief introduction to popular structures of neural networks is given in Section 2.3. This section also presents three main classes of fault diagnosis methods, i.e. model based, knowledge based and data analysis based approaches with emphasis put on neural networks’ role in these schemes. In order to validate a diagnostic procedure, a number of performance indices are introduced in Section 2.4. Locally recurrent neural networks. The first part of the chapter, consiting of Sections 3.1, 3.2, 3.3 and 3.4, deals with network structures in which dynamics are realized using time delays and global feedbacks. The well known structures are discussed, i.e. the Williams-Zipser structure, partially recurrent networks, state-space models, with a rigorous analysis of their advantages and shortcomings. Section 3.5 presents locally recurrent structures, with the main emphasis on networks designed with neuron models with the IIR filter. This part of the chapter consists of original research results including the analysis of equilibrium points of the neuron, its observability and controlability. Training methods intended for use with locally recurrent networks are described in Section 3.6. Three algorithms are presented: extended dynamic back-propagation, adaptive random search and simultaneous perturbation stochastic approximation [6, 41, 26, 42, 43, 44, 45]. Approximation abilities of locally recurrent networks. The chapter contains original research which deals with investigating approximation abilities of a special class of discrete-time locally recurrent neural networks [46, 47]. The chapter includes analytical results showing that a locally recurrent network with two hidden layers is able to approximate a state-space trajectory produced by any Lipschitz continuous function with arbitrary accuracy [46, 47]. Moreover, based on these results, the network can be simplified and transformed into a more practical structure needed in real world applications. In Section 4.1, modelling properties of a single dynamic neuron are presented. The dynamic neural network and its representation in the statespace are described in Section 4.1.1. Some preliminaries required to show approximation abilities of the proposed network are discussed in Section 4.2. The main result concerning the approximation of state-space trajectories is presented in Section 4.3 [46, 47]. Section 4.4 illustrates the identification of a real technological process using the locally recurrent networks considered [46]. Stability and stabilization of locally recurrent networks. The chapter presents originally developed algorithms for stability analysis and the stabilization of a class of discrete-time locally recurrent neural networks [45, 48, 49]. In Section 5.1, stability issues of the dynamic neural network with
1.1 Organization of the Book
5
one hidden layer are discussed. The training of the network under an active set of constraints is formulated (Sections 5.1.1 and 5.1.2) [45] together with convergence analysis of the proposed projection algorithms (Section 5.1.3) [49]. The section reports also experimental results including complexity analysis (Section 5.1.4) and stabilization effectiveness of the proposed methods (Section 5.1.5) as well as their application to the identification of an industrial process (Section 5.1.6). Section 5.2 presents stability analysis based on Lyapunov’s methods [48]. Theorems based on the second method of Lyapunov are presented in Section 5.2.1. In turn, algorithms utilizing Lyapunov’s first method are discussed in Section 5.2.2. Section 5.3 is devoted to stability analysis of the cascade locally recurrent network proposed in Chapter 4. Optimal experiment design for locally recurrent networks. Original developments about input data selection for the training process of a locally recurrent neural network are presented. At the moment there is no contribution of optimal selection of input sequences for locally recurrent neural networks to the existing state-of-the-art in the area. Therefore, this topic seems to be the most challenging one among these proposed in the monograph. The chapter aims to fill this gap and propose a practical approach for input data selection for the training process of the locally recurrent neural network. The first part of the chapter, including Sections 6.1 and 6.2, gives the fundamental knowledge about optimal experimental design. The proposed solution of selecting training sequences is formulated in Section 6.3. Section 6.4 contains results of a numerical experiment showing the performance of the delineated approach. Decision making in fault detection. The chapter discusses several methods of decision making in the context of fault diagnosis. It is composed of two parts. The first part, consisting of Sections 7.1 and 7.2, is devoted to algorithms and methods of constant thresholds calculation. Section 7.1 briefly describes known algorithms for generating simple thresholds based on the assumption that a residual signal has a normal distribution. A sligthly different original approach is shown in Section 7.2, where first a simple neural network is used to approximate the probability density function of a residual, and then a threshold is calculated [50, 51, 52]. The second part, including Section 7.3, presents several robust techniques for decision making. Section 7.3.1 discusses a statistical approach to adapt the threshold using the time window and recalculating the mean value and the standard deviation of a residual [53]. The application of fuzzy logic to threshold adaptation is described in Section 7.3.2 [54]. The original solution to design the robust decision making process obtained through model error modelling and neural networks is investigated in Section 7.3.3 [55, 56]. Industrial applications. The chapter presents original achievements in the area of fault diagnosis of industrial processes. Section 8.1 includes experimental results of fault detection and isolation of selected parts of the sugar
6
1 Introduction
evaporator [26, 49, 44, 45, 43, 42, 57, 54, 58, 59]. The experiments presented in this section were carried out using real process data. Section 8.2 consists of results concerning fault detection of selected components of the fluid catalytic cracking process [52, 50, 55]. The experiments presented in this section were carried out using simulated data. The last example, fault detection, isolation and identification of the electrical motor, is shown in Section 8.3 [51, 56]. The experiments presented in this section were carried out using real process data.
2 Modelling Issue in Fault Diagnosis
When introducing fault diagnosis as a scientific discipline, it is worth providing some basic definitions. These definitions, suggested by the IFAC Technical Committee SAFEPROCESS, have been introduced in order to unify the terminology in the area. Fault is an unpermitted deviation of at least one characteristic property or variable of the system from acceptable/usual/standard behaviour. Failure is a permanent interruption of the system ability to perform a required function under specified operating conditions. Fault detection is a determination of faults present in the system and the time of detection. Fault isolation is a determination of the kind, location, and time of detection of a fault. Follows fault detection. Fault identification is a determination of the size and time-variant behaviour of a fault. Follows fault isolation. Fault diagnosis is a determination of the kind, size, location and time of detection of a fault. Follows fault detection. Includes both fault isolation and fault identification. In the literature, there exist also other definitions of fault diagnosis. A very popular definition of fault diagnosis includes also fault detection [60]. Such a definition of fault diagnosis is used in this monograph. The chapter is divided into four main parts. Section 2.1 is an introduction to fault diagnosis theory. This section explains the main objectives of fault diagnosis, defines fault types and phases of the diagnostic procedure. Section 2.2 presents the most popular methods used for model based fault diagnosis. This section discusses parameter estimation methods, parity relations, observers, neural networks and fuzzy logic models. The main advantages and drawbacks of the discussed techniques are portrayed. A brief introduction to popular structures of neural networks is given in Section 2.3. The section presents also three main classes of fault diagnosis methods, i.e. model based, knowledge based and data K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 7–27, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
8
2 Modelling Issue in Fault Diagnosis
analysis based approaches, with emphasis put on neural networks’ role in these schemes. Each diagnostic algorithm should be validated to confirm its effectiveness and usefulness for real-world fault diagnosis. Some indices needed for this purpose are introduced in Section 2.4. The chapter concludes with some final remarks in Section 2.5.
2.1 Problem of Fault Detection and Fault Diagnosis The main objective of the fault diagnosis system is to determine the location and occurrence time of possible faults based on accessible data and knowledge about the behaviour of the diagnosed process, e.g. using mathematical, quantitative or qualitative models. Advanced methods of supervision and fault diagnosis should satisfy the following requirements [13]: • • • •
early detection of small faults, abrupt as well as incipient, diagnosis of faults in actuators, process components or sensors, detection of faults in closed loop control, supervision of processes in transient states.
The aim of early detection and diagnosis is to have enough time to take counteractions such as reconfiguration, maintenance, repair or other operations. Let us assume that a plant of an automatic control system with the known input vector u and the output vector y, as shown in Fig. 2.1, is given [4, 7, 61]. Such a plant can be treated as a system which is composed of a certain number of subsystems such as actuators, process components, and sensors. In each of these functional devices faults may occur that lead to undesired or intolerable performance, in other words, a failure of the controlled system. The main objective of fault diagnosis is to detect faults in each subsystem and their causes early enough so that the failure of the overall system can be avoided, and to provide information about their sizes and sources. Typical examples of faults are as follows: • defective constructions, such as cracks, ruptures, fractures, leaks, • faults in drives, such as damages of the bearings, deficiencies in force or momentum, defects in the gears,
u
fa
fp
fs
Actuators
Process
Sensors
y
Plant Unknown inputs: noise, disturbances, parameters variations
Fig. 2.1. Scheme of the diagnosed automatic control system
2.1 Problem of Fault Detection and Fault Diagnosis
9
• faults in sensors – scaling errors, hysteresis, drift, dead zones, shortcuts, • abnormal parameter variations, • external obstacles – collisions, the clogging of outflows. Taking into account the scheme shown in Fig. 2.1, it is useful to divide faults into three categories: actuator (final control element), component and sensor faults, respectively. Actuator faults f a can be viewed as any malfunction of the equipment that actuates the system, e.g. a malfunction of the pneumatic servomotor in the control valve in the evaporation station [7, 26]. Component faults (process faults) f p occur when some changes in the system make the dynamic relation invalid, e.g. a leakage in a gas pipeline [62]. Sensor faults f s can be viewed as serious measurements variations. Faults can commonly be described as inputs. In addition, there is always modelling uncertainty due to unmodelled disturbances, noise and the model (see Fig. 2.1). This may not be critical to the process behaviour, but may obscure fault detection by rising false alarms. Faults can be also classified taking into account the time-variant behaviour of a fault. Three classes can be distinguished: abrupt, incipient and intermittent faults (Fig. 2.2). An abrupt fault (in Fig 2.2 marked with the dashed line) is simply an abrupt change of variables. It is asumed that a variable or a signal has a constant value θ0 . When a fault occurs, the value of the parameter jumps to a new constant value θ1 . An incipient fault gradually develops to a larger and larger value (in Fig 2.2 marked with the solid line). Slow degradation of a component can be viewed as an incipient fault. An intermittent fault is a fault that occurs and disappears repeatedly (in Fig 2.2 marked with the dash-dot line). A typical example of such a fault is a loose connector. In general, there are three phases in the diagnostic process [13, 63, 10, 7]: • detection of faults, • isolation of faults, • identification of faults.
Fig. 2.2. Types of faults: abrupt (dashed), incipient (solid) and intermittent (dash-dot)
10
2 Modelling Issue in Fault Diagnosis
The main objective of fault detection is to make a decision whether a fault has occurred or not. Fault isolation should give information about fault location, which requires that faults be distinguishable. Finally, fault identification comprises the determination of the size of a fault and the time of its occurrence. In practice, however, the identification phase appears rarely and sometimes it is incorporated into fault isolation. Thus, from the practical point of view, the diagnostic process consists of two phases only: fault detection and isolation. Therefore, the common abbreviation used in many papers is FDI (Fault Detection and Isolation). In other words, automatic fault diagnosis can be viewed as a sequential process involving symptom extraction and the actual diagnostic task. Usually, a complete fault diagnosis system consists of two parts (Fig. 2.3): • residual generation, • residual evaluation. The residual generation process is based on a comparison between the measured and predicted system outputs. As a result, the difference or the so-called residual is expected to be near zero under normal operating conditions, but on the occurrence of a fault a deviation from zero should appear. In turn, the residual evaluation module is dedicated to the analysis of the residual signal in order to determine whether a fault has occurred and to isolate the fault in a particular system device. Fault detection can be performed either with or without the use of a process model. In the first case, the detection phase includes generating residuals using models (analytical, neural, rough, fuzzy, etc.) and estimating residual values. It consists in transforming quantitative diagnostic residuals into qualitative ones and making a decision about the identification of symptoms. In the latter case, methods of limit value checking or the checking of simple relations between process variables are used in order to obtain special features of the diagnosed process. This process is often called feature extraction or diagnostic signal generation. These features are then compared with the normal features of the healthy Faults f
Disturbances d
Input u(k)
Output y(k) PROCESS
Residual generation Residuals r Residual evaluation Faults f
Fig. 2.3. Two stages of fault diagnosis
2.2 Models Used in Fault Diagnosis
11
Faults
u(k )
y(k )
PLANT Nominal model Fault model 1 Fault model n
Residual generation
+
y0(k) y1(k) yn(k) -
+
+
r0
S
r1
S
rn
S
Fault classifier
f
Residual evaluation
Fig. 2.4. General scheme of model based fault diagnosis
process. To carry out this process, change detection and classification methods can be applied. One of the most well-known approaches to residual generation is the model based concept. In the general case, this concept can be realized using different kinds of models: analytical, knowledge based and data based ones [64]. Unfortunately, the analytical model based approach is usually restricted to simpler systems described by linear models. When there are no mathematical models of the diagnosed system or the complexity of a dynamic system increases and the task of modelling is very hard to achieve, analytical models cannot be applied or cannot give satisfactory results. In these cases data based models, such as neural networks, fuzzy sets or their combination (neuro-fuzzy networks), can be considered. Figure 2.4 illustrates how the fault diagnosis system can be designed using models of the system. As can be seen in Fig. 2.4, a bank of process models should be designed. Each model represents one class of the system behaviour. One model represents the system under its normal operating conditions and each successive one – a faulty situation [7]. After that, the residuals can be determined by comparing the system output y(k) and the ouputs of models y0 (k), y1 (k),...,yn (k). In this way, the residual vector r = [r0 , r1 , . . . , rn ], which characterizes a suitable class of the system behaviour, can be obtained. Finally, the residual vector r should be transformed by a classifier to determine the location and time of fault occurrence. It is worth noting here that it is impossible to model all potential system faults. The designer of FDI systems can construct models based on available data. In many cases, however, only data for normal operating conditions are available and data for faulty scenarios have to be simulated. Therefore, when designing faulty models using, e.g. neural networks, serious problems can be encountered.
2.2 Models Used in Fault Diagnosis The section presents models most frequently used in the framework of model based fault diagnosis. Due to the comprehensive and vast literature available at
12
2 Modelling Issue in Fault Diagnosis
the moment, the presented models are discussed very briefly. A more complete description of many models used in fault diagnosis can be found in [65, 13, 4, 15, 9, 14, 7, 25, 12]. 2.2.1
Parameter Estimation
In most practical cases, the process parameters are not known at all or are not known exactly enough. Then they can be determined by parameter estimation methods if the basic structure of the model is known by measuring input and output signals. Consider the process described by y(k) = Ψ T θ,
(2.1)
where Ψ is the regressor vector, Ψ = [−y(k − 1), . . . , −y(k − m), u(k), . . . , u(k − n)]T , and θ is the parameter vector, θ = [a1 , ..., am , b0 , ..., bn ]T . Assuming that the parameter vector θ has physical meaning, the task consists in detecting faults in a system by measuring the input u(k) and the output y(k), and then giving ˆ If the fault is modelled the estimate of the parameters of the system model θ. as an additive term f acting on the parameter vector of the system θ = θ nom + f ,
(2.2)
where θnom represents the nominal (fault-free) parameter vector, then the parameter estimate θˆ indicates a change in the parameters as follows: ˆ Δθ = θ − θ.
(2.3)
Fault detection decision making leads to checking if the norm of the parameters change (2.3) is greater than a predefined threshold value. The methods of threshold determining are widely presented in Chapter 7. Therefore, the problem implies on-line parameter estimation, which can be solved with various recursive algorithms, such as the recursive least-square method [66], the instrumental variable approach [67] or the bounded-error approach [68]. The main drawback of this approach is that the model parameters should have physical meaning, i.e. they should correspond to the parameters of the system. In such situations, the detection and isolation of faults is very straightforward. If this is not the case, it is usually difficult to distinguish a fault from a change in the parameter vector θ resulting from time-varying properties of the system. Moreover, the process of fault isolation may become extremely difficult because the model parameters do not uniquely correspond to those of the system. It should also be pointed out that the detection of faults in sensors and actuators is possible but rather complicated [14]. Parameter estimation can also be applied to non-linear processes [69, 70, 66]. 2.2.2
Parity Relations
Consider a linear process described by the following transfer function: GP (s) =
BP (s) . AP (s)
(2.4)
2.2 Models Used in Fault Diagnosis
13
If the structure of the process as well as the parameters are known, the process model is represented by BM (s) . (2.5) GM (s) = AM (s) Assume that fu (t) and fy (t) are additive faults acting on the input and output, respectively. If GP (s) = GM (s), the output error has the form e (s) = y(s) − GM u(s) = GP (s)fu (s) + fy (s).
(2.6)
Faults that influence the input or output of the process result in changes of the residual e (t) with different transients. The polynomials of GM (s) can also be used to form a polynomial error : e(s) = AM (s)y(s) − BM (s)u(s) = Ap (s)fy (s) + Bp (s)fu (s).
(2.7)
Equations (2.6) and (2.7) are known as parity equations (parity relations) [15]. Parity relations can also be derived from the state-space representation; then they offer more freedom in the design of parity relations [16]. The fault isolation strategy can be relatively easily realised for sensor faults. Indeed, using the general idea of the dedicated fault isolation scheme, it is possible to design the parity relation with the i-th, i = 1, . . . , m, sensor only. Thus, by assuming that all actuators are fault free, the i-th residual generator is sensitive to the i-th sensor fault only. This form of parity relations is called the single-sensor parity relation and it has been studied in a number of papers, e.g. [71, 72]. Unfortunately, the design strategy for actuator faults is not as straightforward as that for sensor faults. It can, of course, be realised in a very similar way but, as is indicated in [9, 71], the isolation of actuator faults is not always possible in the so-called single-actuator parity relation scheme. An extension of parity relations to non-linear polynomial dynamic systems was proposed in [73]. Parity relations for a more general class of non-linear systems were introduced by Krishnaswami and Rizzoni [74]. 2.2.3
Observers
Assume that the state equations of the system have the following form: x(k + 1) = Ax(k) + Bu(k), y(k) = Cx(k),
(2.8) (2.9)
where A is the state transition matrix, B is the input matrix, C is the output matrix, x is the state vector, u and y are the input and output vectors, respectively. The basic idea underlying observer based approaches to fault detection is to obtain the estimates of certain measured and/or unmeasured signals [9, 15, 12]. Then, the estimates of the measured signals are compared with their
14
2 Modelling Issue in Fault Diagnosis
originals, i.e. the difference between the original signal and its estimate is used to form a residual in the form ˆ (k). r(k) = y(k) − C x
(2.10)
To tackle this problem, many different observers (or filters) can be employed, e.g. Luenberger observers [65] or Kalman filters [75]. From the above discussion, it is clear that the main objective is the estimation of system outputs while the estimation of the entire state vector is unnecessary. Since reduced-order observers can be employed, state estimation is significantly facilitated. On the other hand, to provide an additional freedom to achieve the required diagnostic performance, the observer order is usually larger than the possible minimum one. The admiration for observer based fault detection schemes is caused by the still increasing popularity of state-space models as well as the wide usage of observers in modern control theory and applications. Due to such conditions, the theory of observers (or filters) seems to be well developed (especially for linear systems). This has made a good background for the development of observer based FDI schemes. Faults f and disturbances d can be modelled in state equations as follows [9]: x(k + 1) = Ax(k) + Bu(k) + Ed(k) + F f (k), y(k) = Cx(k) + Δy,
(2.11) (2.12)
where E is the disturbance input matrix, F is the fault matrix, and Δy denotes faults in measurements. The presented structure is known in the literature as the Unknown Input Observer (UIO) [9, 76]. Recently, this kind of state observers was exhaustively discussed in [12]. Model linearisation is a straightforward way of extending the applicability of linear techniques to non-linear systems. On the other hand, it is well known that such approaches work well when there is no large mismatch between the linearised model and the non-linear system. Two types of linearisation can be distinguished, i.e. linearisation around the constant state and linearisation around the current state estimate. It is obvious that the second type of linearisation usually yields better results. Unfortunately, during such linearisation the influence of terms higher than linear is usually neglected (as in the case of the extended Luenberger observer and the extended Kalman filter). One way out from this problem is to improve the performance of linearisation based observers. Another way is to use linearisation-free approaches. Unfortunately, the application of such observers is limited to certain classes of non-linear systems. 2.2.4
Neural Networks
Artificial neural networks have been intensively studied during the last two decades and succesfully applied to dynamic system modelling [23, 19, 18, 77] as well as fault detection and diagnosis [7, 25, 78]. Neural networks provide an interesting and valuable alternative to classical methods, because they can deal
2.2 Models Used in Fault Diagnosis
15
with the most complex situations which are not sufficiently defined for deterministic algorithms to execute. They are especially useful in situations when there is no mathematical model of the process considered, so the classical approaches such as observers or parameter estimation methods cannot be applied. Neural networks provide an excellent mathematical tool for dealing with non-linear problems [23, 79]. They have an important property according to which any non-linear function can be approximated with arbitrary accuracy using a neural network with a suitable architecture and weight parameters. Neural networks are parallel data processing tools capable of learning functional dependencies of data. This feature is extremely useful when solving different pattern recognition problems. Their another attractive property is the self-learning ability. A neural network can extract the system features from historical training data using the learning algorithm, requiring little or no a priori knowledge about the process. This provides the modelling of non-linear systems with a great flexibility. These features allow one to design adaptive control systems for complex, unknown and non-linear dynamic processes. Neural networks are also robust with respect to incorrect or missing data. Protective relaying based on artificial neural networks is not affected by a change in system operating conditions. Neural networks also have high computation rates, large input error tolerance and adaptive capability. In general, artificial neural networks can be applied to fault diagnosis in order to solve both modelling and classification problems [25, 7, 6, 9, 4, 80]. To date, many neural structures with dynamic characteristics have been developed. These structures are characterized by good effectiveness in modelling non-linear processes. Among many, one can distinguish a multi-layer perceptron with tapped delay lines, recurrent networks or networks of the GMDH (Group Method and Data Handling) type [81]. Neural networks of the dynamic type are largely discussed in Chapter 3. Further in this chapter, Section 2.3 discusses different neural network structures and the possibilities of their application to fault diagnosis of technical processes. 2.2.5
Fuzzy Logic
Analytical models of systems are often unknown, and knowledge about the diagnosed system is inaccurate. It is formulated by experts and has the form of if-then rules containing linguistic evaluation of process variables. In such cases, fuzzy models can successfully be applied to fault diagnosis. Such models are based on the so-called fuzzy sets defined as follows [82]: A = {μA (x), x}, ∀x ∈ X,
(2.13)
where μA (x) is a membership function of the fuzzy set A, while μA (x) ∈ [0, 1]. The membership function realizes the mapping of the numerical space X of a variable to the range [0, 1]. A fuzzy model structure contains three main blocks: the fuzzyfication block, the inference block and the defuzzyfication block. Input signal values are introduced
16
2 Modelling Issue in Fault Diagnosis
to the fuzzyfication block. This block defines the degree of the membership of the input signal to a particular fuzzy set in the following way: μA (x) : X → [0, 1].
(2.14)
Fuzzy sets are assigned to each input and output, and linguistic values, e.g small, medium, large, are attributed to a particular fuzzy set. Within the interference block, the knowledge about the system is described in the form of rules that can have the form Ri : if (x1 = A1j ) and (x2 = A2k ) and ... then (y = Bl ),
(2.15)
where xn is the n-th input, Ank is the k-th fuzzy set of the n-th input, y represents the output, and Bl denotes the l-th fuzzy set of the output. The set of all fuzzy rules constitutes the base of rules. On the basis of the resulting membership function of the output, a precise (crisp) value of the output is calculated in the defuzzyfication block. The expert’s knowledge can be used for designing the model. Unfortunately, the direct approach to model constructions has serious disagvantages. If the expert’s knowledge is incomplete or faulty, an incorrect model can be obtained. While designing a model one should also utilize the measurement data. Therefore, it is advisable to combine the expert’s knowledge with available data while designing a fuzzy model. The expert’s knowledge is useful for defining the structure and initial parameters of the model while the data are helpful for model adjusting. Such a conception has been applied to the so-called fuzzy neural networks. They are convenient modelling tools for residual generation since they allow combining the fuzzy modelling technique with neural training algorithms. More details about fuzzy neural networks can be found in [83, 84, 85, 86, 87].
2.3 Neural Networks in Fault Diagnosis Artificial neural networks, due to their ability to learn and generalize non-linear functional relationships between input and output variables, provide a flexible mechanism for learning and recognising system faults. Among a variety of architectures, notable ones are feedforward and recurrent networks. Feed-forward networks are commonly used in pattern recognition tasks while recurrent networks are used to construct a dynamic model of the process. Recurrent networks are discussed in Chapter 3 and are out of the scope of this section. Below, neural networks frequently used in fault diagnosis are briefly presented. 2.3.1
Multi-layer Feedforward Networks
Artificial neural networks are constructed with a certain number of single processing units which are called neurons. The McCulloch-Pitts model (Fig. 2.5) is the fundamental, classical neuron model and it is described by the equation
2.3 Neural Networks in Fault Diagnosis
y=σ
n
17
wi ui + b ,
(2.16)
i=1
where ui , i = 1, 2, . . . , n, denotes neuron inputs, b is the bias (threshold), wi denotes synaptic weight coefficients, σ (·) is the non-linear activation function. There are many modifications of the above neuron model. This is a result of applying different activation functions. McCulloch and Pitts used the unit step as an activation function. In 1960, Widrow and Hoff applied the linear activation function and they created in this way the Adaline neuron [88, 89]. In recent years, sigmoid and hyperbolic tangent functions [23, 29, 90] have been most frequently used. The choice of a suitable activation function is dependent on a specific application of the neural network. The multi-layer perceptron is a network in which the neurons are grouped into layers (Fig. 2.6). Such a network has an input layer, one or more hidden layers, and an output layer. The main task of the input units (black squares) is preliminary input data processing u = [u1 , u2 , . . . , un ]T and passing them onto the elements of the hidden layer. Data processing can comprise scalling, filtering or signal normalization, among others. The fundamental neural data processing is carried out in hidden and output layers. It is necessary to notice that links between neurons are designed in such a way that each element of the previous layer is connected with each element of the next layer. These connections are assigned with suitable weight coefficients which are determined, for each separate case, depending on the task the network should solve. The output layer generates the network response vector y. Non-linear neural computing performed by the network shown in Fig. 2.6 can be expressed by y = σ 3 {W 3 σ 2 [W 2 σ 1 (W 1 u)]},
(2.17)
where σ 1 , σ 2 and σ 3 are vector-valued activation functions which define neural signal transformation through the 1-st, 2-nd and output layers; W 1 , W 2 and W 3 are the matrices of weight coefficients which determine the intensity of connections between neurons in the neighbouring layers; u, y are the input and output vectors, respectively. One of the fundamental advantages of neural networks is that they have the ability of learning and adapting. From the technical point of view, the training of u1
un
1 b
w2 +
...
u2
w1
σ(·)
y
wn
Fig. 2.5. Neuron scheme with n inputs and one output
18
2 Modelling Issue in Fault Diagnosis W1
W2
W3
u1
y1
u2
y2
. . . un
. . .
. . .
. . . ym
Fig. 2.6. Three layer perceptron with n inputs and m outputs
a neural network is nothing else but the determination of weight coefficient values between the neighbouring processing units. The fundamental training algorithm for feedforward multi-layer networks is the Back-Propagation (BP) algorithm [91, 92, 93]. It gives a prescription how to change the arbitrary weight value assigned to the connection between processing units in the neighbouring layers of the network. This algorithm is of an iterative type and it is based on the minimisation of a sum-squared error utilizing the optimisation gradient descent method. The modification of the weights is performed according to the formula w(k + 1) = w(k) − η∇J (w(k)) ,
(2.18)
where w(k) denotes the weight vector at the discrete time k, η is the learning rate, and ∇E (w(k)) is the gradient of the performance index J with respect to the weight vector w. The back-propagation algorithm is widely used and in the last few years its numerous modifications and extensions have been proposed [90]. Unfortunately, the standard BP algorithm is slowly convergent. To overcome this inconvenience, modified techniques can be used. One of them uses the momentum factor [94]. Another way to speed up the convergence of the training algorithm is to use adaptable parameters [95]. Besides the above techniques, there are many other modifications of BP, which have proved their usefulness in practical applications. It is worth mentioning the quickprop algorithm [96, 97], resilient backpropagation [98], the LevenbergMarquardt algorithm [99] or conjugate gradient methods [100]. 2.3.2
Radial Basis Function Network
In recent years, Radial Basis Function (RBF) networks have been enjoying greater and greater popularity as an alternative solution to the slowly convergent multi-layer perceptron. Similarly to the multi-layer perceptron, the radial basis network has an ability to model any non-linear function [101, 102]. However, this kind of neural networks needs many nodes to achieve the required approximating
2.3 Neural Networks in Fault Diagnosis
19
properties. This phenomenon is similar to the choice of the number of hidden layers and neurons in the multi-layer perceptron. The RBF network architecture is shown in Fig. 2.7. Such a network has three layers: the input layer, only one hidden, the non-linear layer and the linear output layer. It is necessary to notice that the weights connecting the input and hidden layers have values equal to one. This means that the input data are passed on to the hidden layer without any weight operation. The output φi of the i-th neuron of the hidden layer is a non-linear function of the Euclidean distance between the input vector u = [u1 , . . . , un ]T and the vector of the centres ci = [ci1 , . . . , cin ]T , and can be described by the following expression: φi = ϕ (u − ci , ρi ) ,
i = 1, . . . , v,
(2.19)
where ρi denotes the spread of the i-th basis function, · is the Euclidean norm, and v is the number of hidden neurons. The network output y is a weighted sum of the hidden neurons’ outputs: y = Θφ,
(2.20)
where Θ denotes the matrix of connecting weights between the hidden neurons and output elements, and φ = [φ1 , . . . , φv ]T . Many different functions ϕ(·) have been suggested. The most frequently used are Gaussian functions: 2 z (2.21) ϕ(z, ρ) = exp − 2 ρ or invert quadratic functions as well: − 1 ϕ(z, ρ) = z 2 + ρ2 2 .
(2.22)
The fundamental operation in the RBF network is the selection of the function number, function centres and their position. Too small a number of centres can result in weak approximating properties. On the other hand, the number of exact centres increases exponentially with an increase in the input space size of the {1}
Θ
u1 y1
. . . un
. . .
ym
Fig. 2.7. Structure of the radial basis function network with n inputs and m outputs
20
2 Modelling Issue in Fault Diagnosis
network. Hence, it is unsuitable to use the RBF network in problems where the input space has large sizes. To train such a network, hybrid techniques are used. First, the centres and the spreads of the basis functions are established heuristically. After that, the adjusting of the weights is performed. The centres of the radial basis functions can be chosen in many ways, e.g. as values of the random distribution over the input space or by clustering algorithms [103, 104], which give statistically the best choice of the centre numbers and their positions as well. When the centre values are established, the objective of the learning algorithm is to determine the optimal weight matrix Θ, which minimises the difference between the desired and the real network response. The output of the network is linear in weights and that is why for the estimation of the weight matrix traditional regressive methods can be used. Examples of such techniques are the orthogonal least square method [105] and the Kaczmarz algorithm [104]. The former guarantees fast convergence and fast training of the RBF network. On the contrary, Kaczmarz’s algorithm is less numerically complicated, but it is sometimes slowly convergent, for example, when system equations are badly conditioned. 2.3.3
Kohonen Network
Kohonen network is a self-organizing map. Such a network can learn to detect regularities and correlations in their input and adapt their future responses to that input accordingly. The network parameters are adpated by a learning procedure based on input patterns only (unsupervised learning). Contrary to the standard supervised learning methods, the unsupervised ones use input signals to extract knowledge from data. During learning, there is no feedback to the environment or the investigated process. Therefore, neurons and weighted connections should have a certain level of self-organization. Moreover, unsupervised learning is only useful and effective when there is a redundancy of learning patterns. Inputs and neurons in the competitive layer are connected entirely. Furthermore, the concurrent layer is the network output which generates the response of the Kohonen network. The weight parameters are adapted using the winner takes all rule as follows [106]:
(2.23) i = arg min u − wj , j
where u is the input vector, i is the index of the winner, wj is the weight vector of the j-th neuron. However, instead of adapting only the winning neuron, all neurons within a certain neighbourhood of the winning neuron are adjusted according to the formula wj (k + 1) = wj (k) + η(k)C(k) (u(k) − wj (k)) ,
(2.24)
where η(k) is the learning rate and C(k) is a neighbourhood. The learning rate and the neighbourhood size are altered through two phases: an ordering phase and a tunning phase. An iterative character of the learning rate leads to gradual establishing of the feature map. During the first phase, neuron weights are
2.3 Neural Networks in Fault Diagnosis
21
expected to order themselves in the input space consistent with the associated neuron positions. During the second phase, the learning rate continues to decrease, but very slowly. The small value of the learning rate finely tunes the network while keeping the ordering learned in the previous phase stable. In the Kohonen learning rule, the learning rate is a monotone decreasing time function. Frequently used functions are η(k) = 1/k or η(k) = ak −a for 0 < a 1. The concept of neighbourhood is extremely important during network processing. A suitably defined neighbourhood influences the number of adapting neurons, e.g. 7 neurons belong to the neighbourhood of radius 1 defined on the hexagonal grid while the neighbourhood of radius 1 arranged on the rectangular grid includes 9 neurons. A dynamic change of the neighbourhood size beneficially influences the quickness of feature map ordering. The learning process starts with a large neighbourhood size. Then, as the neighbourhood size decreases to 1, the map tends to order itself topologically over the presented input vectors. Once the neighbourhood size is 1, the network should be fairly well ordered and the learning rate slowly decreases over a longer period to give the neurons time to spread out evenly across the input vectors. A typical neighbourhod function is the Gaussian one [23, 90]. After designing the network, a very important task is assigning clustering results generated by the network with desired results for a given problem. It is necessary to determine which regions of the feature map will be active during the occurrence of a given fault. The remaining part of this section is devoted to different fault diagnosis schemes. Modern methods of FDI of dynamic systems can be split into three broad cathegories: model based approaches, knowledge based approaches, and data analysis based approaches. In the following sections, all three classes are discussed with emphasis on the role of neural networks in each scheme. 2.3.4
Model Based Approaches
Model based approaches generally utilise results from the field of control theory and are based on parameter estimation or state estimation. The approach is founded on the fact that a fault will cause changes in certain physical parameters which in turn will lead to changes in some model parameters or states. When using this approach, it is essential to have quite accurate models of the process considered. Technological plants are often complex dynamic systems described by non-linear high-order differential equations. For their quantitative modelling for residual generation, simplifications are inevitable. This usually concerns both the reduction of dynamics order and linearisation. Another problem arises from unknown or time variant process parameters. Due to all these difficulties, conventional analytical models often turn out to be not accurate enough for effective residual generation. In this case, knowledge based models are the only alternative. For the model based approach, the neural network replaces the analytical model that describes the process under normal operating conditions. First, the network has to be trained for this task. The learning data can be collected directly from the process, if possible, or from a simulation model that is as realistic as possible. The latter possibility is of special interest for data acquisition
22
2 Modelling Issue in Fault Diagnosis
Faults f
Disturbances d
Input u(k)
Output y(k) PROCESS Residual generation ( Neural model) Residuals r Residual evaluation (Neural classifier) Faults f
Fig. 2.8. Model based fault diagnosis using neural networks
in different faulty situations in order to test the residual generator, as those data are not generally available in the real process. The training process can be carried out off-line or on-line (it depends on the availability of data). The possibility to train a network on-line is very attractive, especially in the case of adapting a neural model to mutable environment or non-stationary systems. After finishing the training, the neural network is ready for on-line residual generation. To be able to capture the dynamic behaviour of the system, a neural network should have dynamic properties, e.g. it should be a recurrent network. Residual evaluation is a decision-making process that transforms quantitative knowledge into qualitative Yes or No statements. It can also be seen as a classification problem. The task is to match each pattern of the symptom vector with one of the pre-assigned classes of faults and the fault-free case. This process may highly benefit from the use of intelligent decision making. To perform residual evaluation, neural networks can be applied, e.g. feedforward networks or self-organizing maps. Figure 2.8 presents the block scheme of model based fault diagnosis designed using neural networks. Neural networks have been successfully applied to many applications including model based fault diagnosis. Among many, it is worth noting several applications. Neural networks have been used in fault detection and classification in chemical processes: batch polymerisation and the distillation column [107]. Multi-layer feedforward networks with delays have been used to model chemical processes, and the RBF network has been applied as a classifier. Chen and Lee used a neural network based scheme for fault detection and diagnosis in the framework of fault tolerant control in an F-16 aircraft simulator [108]. The authors of this work used an RBF network with delays to model full non-linear dynamics of an F-16 aircraft flight simulator. After that, with the help of the well-known multi-layer perceptron, a decision about faults was made. There is also a variety of papers showing the application of recurrent networks to model based fault diagnosis. A fault diagnosis scheme to detect and diagnose transient faults in a turbine waste gate of a diesel engine was reported in [109]. An observer based fault detection and isolation system of a
2.3 Neural Networks in Fault Diagnosis
23
three-tank laboratory system was discussed in [39]. Model based fault diagnosis of sensor and actuator faults in a sugar evaporator using recurrent networks was presented in [26]. 2.3.5
Knowledge Based Approaches
Knowledge based approaches are generally based on expert or qualitative reasoning [110]. Several knowledge based fault diagnosis approaches have been proposed. These include the rule based approach, where a diagnostic rule can be formulated from the process structure and unit functions, and the qualitative simulation based approach. In the rule based approach, faults are usually diagnosed by casually tracing symptoms backward along their propagation paths. Fuzzy reasoning can be used in the rule based approach to handle uncertain information. In the qualitative simulation based approach, qualitative models of a process are used to predict the behaviour of the process under normal operating conditions and in various faulty situations. Fault detection and diagnosis is then performed by comparing the predicted behaviour with the actual observations. The methods that fall into this category can be viewed as fault analysers because their objective is to make a decision whether or not a fault has occured in the system based on the set of logical rules that are either pre-programmed by an expert or learned through a training process (Fig. 2.9). When data about process states or operation condition are passed on to the fault analyser, they are checked against the rule base stored there and a decision about operation conditions of the system is made. Neural networks are an excellent tool to design such fault analysers [64]. The well-known feedforward multi-layer networks are most frequently used. Summarizing, to develop knowledge based diagnostic systems, knowledge about the process structure, process unit fuctions and qualitative models of process units under various faulty conditions are required. Therefore, the development of a knowledge based diagnosis system is generally effort demanding. 2.3.6
Data Analysis Approaches
In data analysis based approaches, process operational data covering various normal and abnormal operations are used to extract diagnostic knowledge. Two main methods exist: neural network based fault diagnosis and multivariate statistical data analysis based fault diagnosis. In neural network based fault diagnosis, the only knowledge required is the training data, which contain faults and their symptoms. The fault symptoms are in the form of variations in process measurements. Through the training, the relationships between the faults and their symptoms can be discovered and stored as network weights. The trained network can be then used to diagnose faults in such a way that it can associate the observed abnormal conditions with their corresponding faults. This group of approached uses neural networks as pattern classifiers (Fig. 2.10). In multivariate statistical data analysis techniques, fault signatures are extracted from process
24
2 Modelling Issue in Fault Diagnosis Faults f
Disturbances d
Input u(k)
Output y(k) PROCESS Generation of diagnostic signals (Neural network) Diagnostic signals s
Fig. 2.9. Model-free fault diagnosis using neural networks
operational data through some multivariate statistical methods such as principal component analysis, projection to a latent structure or non-linear principal component analysis [111]. It should be mentioned that statistical data analysis such as principal component analysis can be carried out by means of neural network training, e.g. the Generalized Hebbian Algorithm (GHA) or Adaptive Principal-component EXtractor (APEX) algorithms, which utilize a single perceptron network or its modifications [23]. There is a rich bibliography reporting applications of neural networks in the framework of data analysis to fault diagnosis of technical and industrial processes. Karpenko and colleagues implemented a neural network of the feedforward type to detect and identify actuator faults in a pneumatic control valve [112]. The network was trained to assign each operating condition to a specific class, and to estimate the magnitude of the faulty condition. On-line fault diagnosis of a continuous stirred tank reactor using a multiple neural network is reported in [113]. The achieved results confirmed that a multiple network structure based system gives more reliable diagnosis than a single neural network. In turn, a self-organizing competitive neural network was applied to fault diagnosis of the suck rod pumping system in [114]. The authors obtained a high quality fault classifier performing better than a classical feedforward network. Neural networks can be also useful in the area of feature extraction. Glowacki and co-workers used the multi-layer feedforward network trained with the Levenberg-Marquardt algorithm for fault detection of a DC motor [115]. A similar approach was used in [116] for sensor fault isolation and reconstruction.
2.4 Evaluation of the FDI System Each diagnostic algorithm or method should be validated to confirm its effectiveness and usefulness for real-world fault diagnosis. In this section, we define a set of indices needed to evaluate an FDI system. The benchmark zone is defined from the benchmark start-up time ton to the benchmark time horizon thor .
2.4 Evaluation of the FDI System Faults f
25
Disturbances d
Input u(k)
Output y(k) PROCESS Extraction of diagnostic signals Diagnostic signals s
Fault patterns
Classification (Neural network) Faults f
Residual
Fig. 2.10. Fault diagnosis as pattern recognition
ton
tf rom
Decision making
false decisions
1
0
??
thor
time
true decisions
?
?
tdt
time
Fig. 2.11. Definition of the benchmark zone
Figure 2.11 illustates the benchmark zone definition. Decisions before the benchmark start-up ton and after the benchmark time horizon thor are out of interest. The time of the fault start-up is represented by tf rom . When a fault occurs in the system, a residual should deviate from the level assigned to the fault-free case (Fig. 2.11). The quality of the fault detection system can be evaluated using a number of performance indices [52, 26]:
26
2 Modelling Issue in Fault Diagnosis
• Time of fault detection tdt – period of time needed for the detection of a fault measured from tf rom to a permanent, true decision about a fault, as presented in Fig. 2.11. As one can see there, the first three true decisions are temporary ones and are not taken into account during determining tdt ; • False detection rate rf d defined as follows: i i tf d rf d = , (2.25) tf rom − ton where tif d is the period of the i-th false fault detection. This index is used to check the system in the fault-free case. Its value shows the percentage of false alarms. In the ideal case (no false alarms), its value should be equal to 0; • True detection rate rtd given by i i ttd rtd = , (2.26) thor − tf rom where titd is the period of the i-th true fault detection. This index is used in the case of faults and describes the efficiency of fault detection. In the ideal case (fault detected immediately and surely), its value is equal to 1; • Isolation time tit – period of time from the beginning of the fault start-up tf rom to the moment of fault isolation; • False isolation rate rf i represented by the formula i i tf i rf i = , (2.27) tf rom − ton where tif i is the period of the i-th false fault isolation; • True isolation rate rti defined by the following equation: i i tti rti = , thor − tf rom
(2.28)
where titi is the period of the i-th true fault isolation.
2.5 Summary The diagnostics of industrial processes is a scientific discipline which has been expansively developed in recent years. It is very difficult to imagine a big industrial plant without a monitoring or diagnostic system. It is clear that providing fast and reliable fault diagnosis is a part of control design. Unfortunately, most control systems exhibit non-linear behaviour, which makes it impossible to use clasical methods such as parameter estimation methods, parity relations or observers. Taking into account the above, there is a need for techniques able to cope with non-linearities. One possible solution is to use artificial intelligence methods. Artificial neural networks have gained a more and more prominent
2.5 Summary
27
position in fault detection systems. The chapter presented three main classes of fault diagnosis methods, i.e. model based, knowledge based and data analysis based approaches with emphasis on neural networks’ role in these schemes. Generally, based on the presented case studies, neural networks can be used in two ways: to construct a model of the process considered or to perform classification tasks. In the further part of this book neural networks are discussed in the framework of modelling and model based fault diagnosis. Special attention is paid to the so-called locally recurrent networks. In order to properly use the dynamic type of neural networks for modelling or residual generation, a number of problems has to be solved, e.g. deriving training algorithms, investigating approximation abilities and stability problems, and selecting optimal training sequences. These problems are presented in the forthcoming chapters. Neural network based algorithms for residual evaluation are also considered.
3 Locally Recurrent Neural Networks
Artificial neural networks provide an excellent mathematical tool for dealing with non-linear problems [18, 23, 77]. They have an important property according to which any continuous non-linear relation can be approximated with arbitrary accuracy using a neural network with a suitable architecture and weight parameters. Their another attractive property is the self-learning ability. A neural network can extract the system features from historical training data using the learning algorithm, requiring little or no a priori knowledge about the process. This provides the modelling of non-linear systems with a great flexibility [18, 19]. However, the application of neural networks to the modelling or fault diagnosis of control systems requires taking into account the dynamics of processes or systems considered. A neural network to be dynamic must contain a memory. The memory can be divided into a short-term memory and a long-term memory, depending on the retention time [117, 36, 118, 23]. The short-term memory refers to a compilation of knowledge representing the current state of the environment. In turn, the long-term memory refers to knowledge stored for a long time or permanently. One simple way of incorporating a memory into the structure of a neural network is the use of time delays, which can be implemented at the synaptic level or in the input layer of the network. Another important way in which the dynamics can be built into the operation of a neural network in a implicit manner is through the use of feedbacks. The are two basic methods of incorporating feedbacks to a neural network: local feedback at the level of a single neuron inside the network and global feedback encompassing the whole network. Neural networks with one or more feedbacks are referred to as recurrent networks. This chapter is mainly focused on locally recurrent networks. The chapter is organized as follows: the first part, consiting of Sections 3.1, 3.2, 3.3 and 3.4, deals with network structures, in which dynamics are realized using time delays and global feedbacks. Section 3.5 presents locally recurrent structures, with the main emphasis on networks designed with neuron models with infinite impulse response filters. Training methods intended for use with locally recurrent networks are described in Section 3.6. Three algorithms are proposed: extentded dynamic back-propagation, adaptive random search and K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 29–63, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
30
3 Locally Recurrent Neural Networks
simultaneous perturbation stochastic approximation. The chapter concludes with some final remarks in Section 3.7.
3.1 Neural Networks with External Dynamics The commonly and willingly used neural network for modelling processes is a multi-layer perceptron. This class of neural models, however, is of a static type and can be used to approximate any continuous non-linear, although static, function [33, 32]. Therefore, neural network modelling of control systems should take into account the dynamics of processes or systems considered. Two main methods exist to provide a static neural network with dynamic properties: the insertion of an external memory to the network or the use of feedback. The strategy most frequently applied to model dynamic non-linear mapping is the external dynamics approach [20, 21, 23, 19, 18]. It is based on the non-linear input/output model in (3.1): (3.1) ym (k + 1) = f y(k), . . . , y(k − m), u(k), . . . , u(k − m) , where f (·) is a non-linear function, u(k) is the input, y(k) and ym (k) are outputs of the process and the model, respectively, m is the order of the process. The non-linear model is clearly separated into two parts: a non-linear static approximator (multi-layer perceptron) and an external dynamic filter bank (tapped delay lines) (Fig. 3.1). As a result, a model known as a multi-layer perceptron with tapped delay lines (time-delay neural network) is obtained. Time-delay neural networks can describe a large class of systems but are not so general as non-linear state-space models. Limitations are observed for processes with non-unique non-linearities, e.g. hysteresis or backslash, where internal unmeasurable states play a decisive role, and partly for processes with non-invertible non-linearities [119, 18]. Moreover, the problem of order selection is not satisfactorily solved yet. This problem is equivalent to the determination of relevant inputs for the function f (·). If the order of a process is known, all necessary past inputs and outputs should be fed to the network. In this way, the input space
z −1
... ...
z −1
y(k)
y(k − 1)
...
z −1
y(k − m)
...
u(k − m)
z −1
u(k − 1)
u(k)
Multi-layer perceptron ym (k + 1)
Fig. 3.1. External dynamics approach realization
3.2 Fully Recurrent Networks
31
of the network becomes large. In many practical cases, there is no possibility to learn the order of the modelled process, and the number of suitable delays has to be selected experimentally by using the trial and error procedure [19]. Many papers show that the multi-layer perceptron is able to predict the outputs of various dynamic processes with high precision, but its inherent nonlinearity makes assuring stability a hard task, especially in the cases in which the output of the network is fed back to the network input as in the case of the parallel model [20, 19]. There are also situations in which this type of networks is not capable of capturing the whole plant state information of the modelled process [120, 119]. The use of real plant outputs avoids many of the analytical difficulties encountered, assures stability and simplifies the identification procedure. This type of feedforward networks is known as a series-parallel model, introduced by Narendra and Parthasarathy [20]. Such networks are capable of modelling systems if they have a weakly visible state, i.e. if there is an input-output equivalent to a system whose state is a function or a fixed set of finitely many past values of its inputs and outputs [120]. Otherwise, the model has a strongly hidden state and its identification requires recurrent networks of a fairly general type. Recurrent networks are neural networks with one or more feedback loops. As a result of feedbacks introduced to the network structure, it is possible to accumulate the information and use it later. Feedbacks can be either of a local or a global type. Taking into account the possible location of feedbacks, recurrent networks can be divided as follows [38, 23, 40]: • Globally recurrent networks – there are feedbacks allowed between neurons of different layers or between neurons of the same layer. Such networks incorporate a static multi-layer perceptron or parts of it. Moreover, they exploit the non-linear mapping capability of the multi-layer perceptron. Basically, three kinds of networks can be distinguished [29, 18]: fully recurrent networks, partially recurrent networks, state-space networks; • Locally recurrent networks – there are feedbacks only inside neuron models. This means that there are neither feedback connections between neurons of successive layers nor lateral links between neurons of the same layer. These networks have a structure similar to static feedforward ones, but consist of the so-called dynamic neuron models.
3.2 Fully Recurrent Networks The most general architecture of the recurrent neural network was proposed by Williams and Zipser in [35]. This structure is often called the Real Time Recurrent Network (RTRN), because it has been designed for real time signal processing. The network consists of M neurons, and each of them creates feedback. Each link between two neurons represents an internal state of the model. Any connections between neurons are allowed. Thus, a fully connected neural
3 Locally Recurrent Neural Networks
z −1 z −1 z
y1 (k + 1)
m
ym (k + 1)
−1
m+1 ...
u1 (k) un (k)
1 ...
z −1
...
32
M
Fig. 3.2. Fully recurrent network of Williams and Zipser
architecture is obtained. Only m of M neurons are established as the output units. The remaining H = M − m neurons are hidden ones. The scheme of the RTRN architecture with n inputs and m outputs is shown in Fig. 3.2. This network is not organized in layers and has no feedforward architecture. The fundamental advantage of such networks is the possibility of approximating wide class of dynamic relations. Such a kind of networks, however, exhibits some well-known disadvantages. First of them is a large structural complexity; O(n2 ) weights needed for n neurons. Also, the training of the network is usually complex and slowly convergent [29, 23, 40]. Moreover, there are problems with keeping network stability. Generally, this dynamic structure seems to be too complex for practical applications. Moreover, the fixed relation between the number of states and the number of neurons does not allow one to adjust the dynamics order and non-linear properties of the model separately. Bearing in mind these disadvantages, fully recurrent networks are rather not used in engineering practice of non-linear system identification.
3.3 Partially Recurrent Networks Partially recurrent networks have a less general character [121, 122, 36, 123]. Contrary to the fully recurrent network, the architecture of partially recurrent networks is based on a feedforward multi-layer perceptron consisting of an additional layer of units called the context layer. Neurons of this layer serve as internal states of the model. Among many proposed structures, two partially recurrent networks have received considerable attention: the Elman [36] and the Jordan [123] structure. The Elman network is probably the best-known example of a partially recurrent neural network. The realization of such networks is considerably less expensive than in the case of a multi-layer perceptron with tapped delay lines. The scheme of the Elman network is shown in Fig. 3.3(a). This network consists of four layers of units: the input layer with n units, the context layer with v units, the hidden layer with v units and the output layer
3.3 Partially Recurrent Networks
(b)
un (k)
...
u1 (k)
...
...
y1 (k)
ym (k) u1 (k) un (k)
ym (k)
...
...
...
y1 (k)
copy made each time step
...
...
context layer
copy made each time step
context layer
(a)
33
Fig. 3.3. Partialy recurrent networks due to Elman (a) and Jordan (b)
with m units. The input and output units interact with the outside environment, whereas the hidden and context units do not. The context units are used only to memorize the previous activations of the hidden neurons. A very important assumption is that in the Elman structure the number of context units is equal to the number of hidden units. All the feedforward connections are adjustable; the recurrent connections denoted by a thick arrow in Fig. 3.3(a) are fixed. Theoretically, this kind of networks is able to model the s-th order dynamic system, if it can be trained to do so [23]. At the specific time k, the previous activation of the hidden units (at time k − 1) and the current inputs (at time k) is used as inputs to the network. In this case, the Elman network’s behaviour is analogous to that of a feedforward network. Therefore, the standard back-propagation algorithm can be applied to train the network parameters. However, it should be kept in mind that such simplifications limit the application of the Elman structure to the modelling of dynamic processes [29, 28]. In turn, the Jordan network is presented in Fig. 3.3(b). In this case, feedback connections from the output neurons are fed to the context units. The Jordan network has been succesfully applied to recognize and differentiate various output time-sequences [124, 123] or to classify English syllables [125]. Partially recurrent networks possess over fully recurrent networks the advantage that their recurrent links are more structured, which leads to faster training and fewer stability problems [23, 18]. Newerless, the number of states is still strongly related to the number of hidden (for Elman) or output (for Jordan) neurons, which severely restricts their flexibility. In the literature, there are propositions to extend partially recurrent networks by introducing additional recurrent links, represented by the weight α, from the context units to themselves [121, 126]. The value of α should be less than 1. For α close to 1, the long term memory can be obtained, but it is less sensitive to details. Another architecture can be found in the recurrent network elaborated by Parlos [37] (Fig. 3.4). A Recurrent Multi-Layer Perceptron (RMLP) is designed
34
3 Locally Recurrent Neural Networks feedforward links
cross-talk links
recurrent links
y1 (k)
u1 (k) u2 (k)
...
...
y2 (k)
un (k)
ym (k)
Fig. 3.4. Architecture of the recurrent multi-layer perceptron
based on the multi-layer perceptron network, and by adding delayed links between neighbouring units of the same hidden layer (cross-talk links), including unit feedback on itself (recurrent links) [37]. Empirical evidence indicates that by using delayed recurrent and cross-talk weights the RMLP network is able to emulate a large class of non-linear dynamic systems. The feedforward part of the network still maintains the well-known curve-fitting properties of the multi-layer perceptron, while the feedback part provides its dynamic character. Moreover, the usage of the past process observations is not necessary, because their effect is captured by internal network states. The RMLP network has been successfully used as a model for dynamic system identification [37]. However, a drawback of this dynamic structure is increased network complexity strictly dependent on the number of hidden neurons and the resulting long training time. For the network containing one input, one output and only one hidden layer with v neurons, the number of the network parameters is equal to v 2 + 3v.
3.4 State-Space Networks Figure 3.5 shows another type of recurrent neural networks known as the statespace neural network [119, 23, 18]. The output of the hidden layer is fed back to bank of unit delays x(k) u(k)
non-linear linear y(k+1) bank of hidden output unit layer x(k+1) layer delays
y(k)
Fig. 3.5. Block scheme of the state-space neural network with one hidden layer
3.4 State-Space Networks
35
the input layer through a bank of unit delays. The number of unit delays used here determines the order of the system. The user can choose how many neurons are used to produce feedback. Let u(k) ∈ Rn be the input vector, x(k) ∈ Rq – the output of the hidden layer at time k, and y(k) ∈ Rm – the output vector. Then the state-space representation of the neural model presented in Fig. 3.5 is described by the equations x(k + 1) = f (x(k), u(k)) y(k) = Cx(k)
,
(3.2) (3.3)
where f (·) is a non-linear function characterizing the hidden layer, and C is a matrix of synaptic weights between hidden and output neurons. This model looks similar to the external dynamic approach presented in Fig. 3.1, but the main difference is that for the external dynamics the outputs which are fed back are known during training, while for the state-space model the outputs which are fed back are unknown during training. As a result, state-space models can be trainined only by minimizing the simulation error. State-space models possess a number of advantages, contrary to fully and partially recurrent networks [23, 18]: • The number of states (model order) can be selected independently from the number of hidden neurons. In this way only those neurons that feed their outputs back to the input layer through delays are responsible for defining the state of the network. As a consequence, the output neurons are excluded from the definition of the state; • Since model states feed the input of the network, they are easily accesible from the outside environment. This property can be useful when state measurements are available at some time instants (e.g. initial conditions). The state-space model includes several recurrent structures as special cases. The previously analyzed Elman network has an architecture similar to that presented in Fig. 3.5, except for the fact that the output layer can be non-linear and the bank of unit delays at the output is omitted. In spite of the fact that state-space neural networks seem to be more promising than fully or partially neural networks, in practice a lot of difficulties can be encountered [18]: • Model states do not approach true process states; • Wrong initial conditions can deteriorate the performance, especially when short data sets are used for training; • Training can become unstable; • The model after training can be unstable. In particular, these drawbacks appear in the cases when no state measurements and no initial conditions are available. A very important property of the state-space neural network is that it can approximate a wide class of non-linear dynamic systems [119]. There are, however, some restrictions. The approximation is only valid on compact subsets of the state-space and for finite time intervals, thus interesting dynamic characteristics are not reflected [127, 23].
36
3 Locally Recurrent Neural Networks
3.5 Locally Recurrent Networks All recurrent networks described in the previous sections are called globally recurrent neural networks. In such networks all possible connections between neurons are allowed, but all of the neurons in a network structure are static ones based on the McCulloch-Pitts model. A biological neural cell not only contains a non-linear mapping operation on the weighted sum of its inputs but it also has some dynamic properties such as state feedbacks, time delays hysteresis or limit cycles. In order to cope with such dynamic behaviour, a special kind of neuron models has been proposed [128, 129, 130, 131, 132]. Such neuron models constitute a basic building block for designing a complex dynamic neural network. The dynamic neuron unit systematized by Gupta and co-workers in [77] as the basic element of neural networks of the dynamic type is presented in Fig. 3.6(a). The neuron receives not only external inputs but also state feedback signals from itself and other neurons in the network. The synaptic links in this model contain a self-recurrent connection representing a weighted feedback signal of its state and lateral connections which constitute state feedback from other neurons of the network. The dynamic neuron unit is connected to other (n − 1) models of the same type forming a neural network (Fig. 3.6(b)). The general dynamic neuron unit is described by the following equations: dxi (t) = −αi xi (t) + fi (w i , x) dt , yi (t) = gi (xi (t))
(3.4) (3.5)
where x ∈ Rn+1 is the augmented vector of n-neural states from other neurons in the network including the bias, w i is the vector of synaptic weights associated with the i-th dynamic neuron unit, αi is the feedback parameter of the i-th dynamic unit, yi (t) is the output of the i-th neuron, fi (·) is a non-linear function (b)
(a)
lateral recurrence
selfrecurrence
fi
+
1 s
-αi
gi
yi (k)
outputs
x(k)
inputs
self-recurrence
xi (k)
self-feedback dynamic neuron unit
Fig. 3.6. Generalized structure of the dynamic neuron unit (a), network composed of dynamic neural units (b)
3.5 Locally Recurrent Networks
37
of the i-th neuron, and gi (·) is an output function of the i-th neuron. Using Euler’s method, the first order derivative is approximated as dx(t) x((k + 1)T ) − x(kT ) , = dt t=kT T
(3.6)
where T stands for the sampling time and k is the discrete-time index. Assuming that T = 1, (3.6) can be rewritten in the simpler form dx(t) = x(k + 1) − x(k). dt
(3.7)
Using (3.7), the discrete-time forms of (3.4) and (3.5) are as follows: xi (k + 1) = −(αi − 1)xi (k) + fi (w i , x(k)) . yi (k) = gi (xi (k))
(3.8) (3.9)
Due to various choices of the functions fi (·) and gi (·) in (3.8) and (3.9) as well as different types of synaptic connections, different dynamic neuron models can be obtained. The general discrete-time model described by (3.8) and (3.9) may be expanded into various other representations. Mathematical details of different types of dynamic neuron units are given in Table 3.1. Neural networks composed of dynamic neuron units have a recurrent structure with lateral links between neurons, as depicted in Fig. 3.6(b). A different approach providing dynamically driven neural networks is used in the so-called Locally Recurrent Globally Feed-forward (LRGF) networks [38, 40]. LRGF networks have an architecture that is somewhere inbetween a feedforward and a globally recurrent architecture. The topology of such a kind of neural networks is analogous to the multi-layered feedforward one, and the dynamics are reproduced by the so-called dynamic neuron models. Based on the well-known McCulloch-Pitts neuron model, different dynamic neuron models can be designed. In general, differences between these depend on the localization of internal feedbacks. Model with local activation feedback. This neuron model was studied by Frasconi [130], and may be described by the following equations: ϕ(k) =
n
r wi ui (k) + di ϕ(k − i) , i=1 i=1 y(k) = σ ϕ(k)
(3.10a) (3.10b)
where ui , i = 1, 2, . . . , n are the inputs to the neuron, wi reflects the input weights, ϕ(k) is the activation potential, di , i = 1, 2, . . . , r are the coefficients which determine feedback intensity of ϕ(k − i), and σ(·) is a non-linear activation function. With reference to Fig. 3.7, the input to the neuron can be a combination of input variables and delayed versions of the activation ϕ(k). Note that right-hand side summation in (3.10a) can be interpreted as the Finite
38
3 Locally Recurrent Neural Networks Table 3.1. Specification of different types of dynamic neuron units
Model DNU-1
fi (·)
gi (·)
w Ti σ(x(k))
Reference
some function
based [133]
on
Hopfield
based [133]
on
Hopfield
where σ(·) is a vector-valued activation function DNU-2
w Ti y(k)
σi (xi (k))
where σi (·) is a non-linear activation function of the i-th neuron DNU-3
σi (w Ti x(k))
xi (k)
Pineda [134]
DNU-4
σi (w Ti x(k)) + x0 w0i
xi (k)
Pineda [135]
where w i ∈ Rn is the vector of synaptic weights of the i-th neuron without bias, w0i is the bias of the i-th neuron DNU-5
(γi − βi xi ) w Ti σ(x(k)) σi (xi (k))
Grossberg [136]
where γi is an automatic gain control of the i-th neuron, βi is a total normalization for the internal state of the i-th neuron
Impulse Response (FIR) filter. This neuron model has the feedback signal taken before the non-linear activation block (Fig. 3.7). Model with local synapse feedback. Back and Tsoi [129] introduced the neuron architecture with local synapse feedback (Fig. 3.8). In this structure, instead of a synapse in the form of the weight, the synapse with a linear transfer function, the Infinite Impulse Response (IIR) filter, with poles and zeros is applied. In this case, the neuron is described by the following set of equations: n −1 Gi (z )ui (k) (3.11a) y(k) = σ i=1
Gi (z
−1
)=
r
bj z −j
pj=0 −j j=0 aj z
, (3.11b)
where ui (k), i = 1, 2, . . . , n is the set of inputs to the neuron, Gi (z −1 ) is the linear transfer function, bj , j = 0, 1, . . . , r, and aj , j = 0, 1, . . . , p are its zeros and poles,
3.5 Locally Recurrent Networks
u2 (k) un (k)
w1 w2
ϕ(k)
+
...
u1 (k)
39
σ(·)
y(k)
wn d1 . . .
d2
dr
z−1 ... z−1
z−1
Fig. 3.7. Neuron architecture with local activation feedback
respectively. As seen in (3.11b), the linear transfer function has r zeros and p poles. Note that the inputs ui (k), i = 1, 2, . . . , n may be taken from the outputs of the previous layer, or from the output of the neuron. If they are derived from the previous layer, then it is local synapse feedback. On the other hand, if they are derived from the output y(k), it is local output feedback. Moreover, local activation feedback is a special case of the local synapse feedback architecture. In this case, all synaptic transfer functions have the same denominator and only one zero, i.e. bj = 0, j = 1, 2, . . . , r. Model with local output feedback. Another dynamic neuron architecture was proposed by Gori [128] (see Fig. 3.9). In contrast to local synapse as well as local activation feedback, this neuron model takes feedback after the non-linear activation block. In a general case, such a model can be described as follows: n r y(k) = σ wi ui (k) + di y(k − i) , (3.12) i=1
i=1
where di , i = 1, 2, . . . , r are the coefficients which determine feedback intensity of the neuron output y(k − i). In this architecture, the output of the neuron is filtered by the FIR filter, whose output is added to the inputs, providing the activation. It is easy to see that by the application of the IIR filter to filtering the neuron output a more general structure can be obtained [38]. The work of Gori [128] found its basis in the work by Mozer [122]. In fact, one can consider this architecture as a generalization of the Jordan-Elman architecture [123, 36].
u2 (k)
un (k)
G1(z) G2(z)
...
u1 (k)
+
σ(·)
y(k)
Gn(z)
Fig. 3.8. Neuron architecture with local synapse feedback
3 Locally Recurrent Neural Networks u1 (k) u2 (k) un (k)
w1 w2 +
...
40
ϕ(k)
σ(·)
y(k)
wn d1 . . .
d2
z−1 ... z−1
dr z−1
Fig. 3.9. Neuron architecture with local output feedback
Memory neuron. Memory neuron networks were introduced by Poddar and Unnikrishnan [131]. These networks consist of neurons which have a memory; i.e. they contain information regarding past activations of its parent network neurons. A general scheme of the memory neuron is shown in Fig. 3.10. The mathematical description of such a neuron is presented below: n n wi ui (k) + si zi (k) (3.13a) y(k) = σ i=1
i=1
, zi (k) = αi ui (k − 1) + (1 − αi )zi (k − 1)
(3.13b)
where zi , i = 1, 2, . . . , n are the outputs of the memory neuron from the previous layer, si , i = 1, 2, . . . , n are the weight parameters of the memory neuron output zi (k), and αi = const is a coefficient. It is observed that the memory neuron ”remembers” the past output values to that particular neuron. In this case, the memory is taken to be in the form of an expotential filter. This neuron structure can be considered to be a special case of the generalized local output feedback architecture. It has a feedback transfer function with one pole only. Memory neuron networks have been intensively studied in recent years, and there are some interesting results concerning the use of this architecture in the identification and control of dynamic systems [137]. 3.5.1
Model with the IIR Filter
In the following part of the section, the general structure of the neuron model proposed by Ayoubi [109] is considered. The dynamics are introduced to the neuron in such a way that neuron activation depends on its internal states. This is done by introducing an IIR filter into the neuron structure. In this way, the neuron reproduces its own past inputs and activations using two signals: the input ui (k), for i = 1, 2, . . . , n and the output y(k). Figure 3.11 shows the structure of the neuron model considered. Three main operations are performed in this dynamic structure. First of all, the weighted sum of inputs is calculated according to the formula
3.5 Locally Recurrent Networks u1 (k)
41
w1
u2 (k)
w2 ...
un (k)
σ(·)
+
y(k)
wn memory neuron
z−1
si
αi zi (k)
+
z−1
1−αi
Fig. 3.10. Memory neuron architecture
ϕ(k) =
n
wi ui (k).
(3.14)
i=1
The weights perform a similar role as in static feedforward networks. The weights together with the activation function are responsible for approximation properties of the model. Then this calculated sum ϕ(k) is passed to the IIR filter. Here, the filters under consideration are linear dynamic systems of different orders, viz. the first or the second order. The filter consists of feedback and feedforward paths weighted by the weights ai , i = 1, 2, . . . , r and bi , i = 0, 1, . . . , r, respectively. The behaviour of this linear system can be described by the following difference equation: z(k) =
r
bi ϕ(k − i) −
i=0
r
ai z(k − i),
(3.15)
i=1
where ϕ(k) is the filter input, z(k) is the filter output, and k is the discrete-time index. Alternatively, the equation (3.15) may by rewritten as a transfer function:
n i i=0 bi z
. (3.16) G(z) = n 1 + i=1 ai z i Finally, the neuron output can be described by y(k) = σ g2 (z(k) − g1 ) ,
u2 (k) un (k)
w1 w2 ...
u1 (k)
(3.17)
ϕ(k) +
IIR
z(k)
σ(·)
y(k)
wn
Fig. 3.11. Neuron architecture with the IIR filter
42
3 Locally Recurrent Neural Networks
(a)
u2 (k) un (k)
w1 w2 ...
u1 (k)
+
wn
FIR
σ(·)
y(k)
FIR
(b)
u2 (k)
un (k)
FIR FIR ...
u1 (k)
FIR
σ(·)
+
y(k)
FIR
Fig. 3.12. Transformation of the neuron model with the IIR filter to the general local activation feedback structure
where σ(·) is a non-linear activation function that produces the neuron output y(k), g1 and g2 are the bias and the slope parameter of the activation function, respectively. In the dynamic neuron, the slope parameter can change. Thus, the dynamic neuron can model the biological neuron better. In the biological neuron, at the macroscopic level the dendrites of each neuron receive pulses at the synapses and convert them to a continuously variable dendritic current. The flow of this current through the axon membrane modulates the axonal firing rate. This morphological change of the neuron during the learning process may be modelled by introducing the slope of the activation function in the neuron as one of its adaptable parameters in addition to the synaptic weights and filter parameters [132, 138]. The neuron model with the IIR filter may be equivalent to the general local activation structure. Let us assume that the r-th order IIR filter has been divided into two r-th order FIR filters 3.12(a). One filter recovers past data of ϕ(k), and the second recovers past data of z(k). In this way, the structure shown in Fig. 3.12(b) can be obtained. In this model, each synapse signal is weighted by a suitable weight and after that the weighted sum is passed to one common FIR filter. Instead of this, one can resign from weights and replace them with FIR filters. Thus, as a result, local activation feedback with FIR synapses is achieved [40]. In spite of the equivalence, these two models require different numbers of adaptable parameters. In order to analyze the properties of the neuron model considered, it is convenient to represent it in the state-space. The states of the neuron can be described by the following state equation:
3.5 Locally Recurrent Networks
43
b0 u(k)
xi (k+1) w
1
+
z
xi (k) −1
b
y(k) +
σ(·)
Ai
Fig. 3.13. State-space form of the i-th neuron with the IIR filter
x(k + 1) = Ax(k) + W u(k),
(3.18)
where x(k) ∈ R is the state vector, W = 1w is the weight matrix (w ∈ Rn , 1 ∈ Rr is the vector with one in the first place and zeros elsewhere), u(k) ∈ Rn is the input vector, n is the number of inputs, and the state matrix A has the form ⎡ ⎤ −a1 −a2 . . . −ar−1 −ar ⎢ 1 0 ... 0 0 ⎥ ⎢ ⎥ ⎢ 0 1 . . . 0 0 ⎥ (3.19) A=⎢ ⎥. ⎢ .. .. . . .. .. ⎥ ⎣ . . . . . ⎦ 0 0 ... 1 0 r
T
Finally, the neuron output is described by y(k) = σ g2 (bx(k) + du(k) − g1 ) ,
(3.20)
where σ(·) is a non-linear activation function, b = [b1 − b0 a1 , . . . , br − b0 ar ] is the vector of feedforward filter parameters, d = [b0 w1 , . . . , b0 wn ]. The block structure of the state-space representation of the neuron considered is presented in Fig. 3.13. 3.5.2
Analysis of Equilibrium Points
Let x∗ be an equilibrium state of the dynamic neuron described by (3.18). Introducing an equivalent transformation z(k) = x(k) − x∗ , the system (3.18) can be transformed to the form z(k + 1) = Az(k).
(3.21)
A constant vector z is said to be an equilibrium (stationary) state of the system (3.21) if the following condition is satisfied: Az(k) = 0.
(3.22)
If a matrix A is non-singular, the equilibrium state is determined by the eigenvalues of A. If one assumes that the dynamic neuron consists of a second order linear dynamic system, equilibrium points may be divided into six classes depicted in Fig. 3.14. The cases (a) and (b) show stable equilibrium points, while
44
3 Locally Recurrent Neural Networks (a)
(b)
Im(z)
1
Im(z)
1
0
(c)
∗ ∗
∗ 0
Re(z)
(d)
Im(z)
1
∗
Im(z)
1
∗ ∗
0
(e)
∗ 0
Re(z)
(f)
Im(z)
Re(z)
∗
Re(z)
Im(z)
∗ 1
1
0
∗
∗ Re(z)
0
Re(z)
∗
Fig. 3.14. Positions of equilibrium points: stable node (a), stable focus (b), unstable node (c), unstable focus (d), saddle point (e), center (f)
the situations (c)–(e) present unstable ones. Figure 3.14(f) shows an example of a critically stable system. Direct computation of the eigenvalues of A is usually complicated, especially in the case of large scale non-linear systems. In many cases, some indirect approaches are usually useful. To show the positions of the eigenvalues of the state transition matrix A, Ostrowski’s theorem [139] can be used. Theorem 3.1 (Ostrowski’s theorem). Let A = [ai,j ]n×n be a complex matrix, γ ∈ [0, 1] be given, and Ri and Ci denote the deleted row and deleted column sums of A as follows:
3.5 Locally Recurrent Networks
Ri =
Cj =
n j=1,j=i n
|ai,j |
,
|ai,j |
.
45
i=1,i=j
All the eigenvalues of A are then located in the union of n closed disks in the complex plane with the centres ai,i and the radii ri = Riγ Ci1−γ ,
i = 1, 2, . . . , n.
According to this theorem, the eigenvalues of A are located in a small neigbourhood of the points a1,1 , . . . , an,n . Moreover, disks centered at the points ai,i are easily computable. Example 3.2. Let us analyse equilibrium points of the system (3.21) with the state transition matrix determined during the training process with the following elements: 0.451 0.041 A= . 1 0 An illustation of the eigenvalues regions for this case is presented in Figs. 3.15(a) and (b) for different settings of the parameter γ. In both figures there are two of Ostrowski’s disks with the centres a1,1 = 0.451 and a2,2 = 0 but with different radii due to different values of γ. A suitably selected parameter γ renders it possible to find out eigenvalues positions more accurately. Fig. 3.15(a) presents the results for γ = 0.3, where two of Ostrowski’s disks intersect each other, while in Fig. 3.15(b) there are two separate disks obtained for γ = 0.5. In the latter case one knows that each disk represents the position of exactly one eigenvalue. In general, it is easy to verify that the stability of the system (3.21) is guaranteed if |ai,i | + ri < 1, (b)
i = 1, 2, . . . , n.
(3.23)
(a) Im(z)
Im(z)
Re(z)
Re(z)
Fig. 3.15. Eigenvalue positions of the matrix A: γ = 0.3 (a), γ = 0.5 (b)
46
3 Locally Recurrent Neural Networks (b)
(a) Im(z)
Im(z)
Re(z)
Re(z)
Fig. 3.16. Eigenvalue positions of the modified matrix A: γ = 0.3 (a), γ = 0.5 (b)
The form of the matrix A makes some stability criteria hardly or even not at all applicable. This problem arises especially for much more complex network structures (see Chapter 5). The state transition matrix A has a specific form, because all elements excluding the first row are constants (not adjustable). Ones in this matrix are what causes that Ostrowski’s radii can have quite big values. In order to make the criterion (3.23) and other stability criteria discussed in Chapter 5 more applicable, let us introduce a modified form of the matrix A as follows: ⎤ ⎡ −a1 −a2 . . . −ar−1 −ar ⎢ ν 0 ... 0 0 ⎥ ⎥ ⎢ ⎢ 0 ν . . . 0 0 ⎥ A=⎢ (3.24) ⎥, ⎢ .. .. .. ⎥ .. . . ⎦ ⎣ . . . . . 0 0 ... ν 0 with the parameter ν ∈ (0, 1) instead of ones. The parameter ν represents the influence of the state xi on the state xi−1 . Example 3.3. Let us revisit the problem considered in the example 3.2 with the state transition matrix of the form 0.451 0.041 A= . ν 0 For ν = 0.7, eigenvalues regions are presented in Fig. 3.16(a) for γ = 0.3 and Fig. 3.16(b) for γ = 0.5. Similarily as in the example 3.2, there are two families of Ostrowski’s disks with the centres a1,1 = 0.451 and a2,2 = 0, but each disk has a smaller radius than in the previous example. Moreover, it is easier to select the parameter γ to obtain separated disks. In this example, for γ = 0.3 disks are separated, but in the previous one for the same settings they are not.
3.5 Locally Recurrent Networks
3.5.3
47
Controllability and Observability
In order to make the problem tracktable, it is necessary to make certain assumptions including the controllability and observability of the system [140, 141]. Even in the case of linear time-invariant systems, prior information concerning the system was assumed to be known to obtain a solution (e.g. the system order, relative degree, high frequency gain, etc.). Controllability is concerned with whether or not one can control the dynamic behaviour of the neural network. In turn, observability is concerned with whether or not one can observe the result of the control applied to the network. It that sense, observability is dual to controllability. Definition 3.4. A dynamic system is controllable if for any two states x1 and x2 there exists an input sequence u of a finite length that will transfer the system from x1 to x2 . Definition 3.5. A dynamic system is observable if a state x of this system can be determined from an input sequence u and an output sequence y, both of a finite length. For non-linear systems, conditions for both global controllability and observability are very hard to elaborate and verify, therefore local forms may be used instead. The observation equation of the dynamic neuron (3.20) is represented in a non-linear form. Therefore, one can linearize this equation by expanding it using the Taylor series around the origin x(k) = 0 and u(k) = 0, and retaining first order terms as follows: δy(k) = σ(g1 g2 ) + σ (g1 g2 )g2 bδx(k) + σ (g1 g2 )g2 dδu(k),
(3.25)
where δu(k), δy(k), δu(k) represent small displacement of the input, output and the state, respectively. Thus, the linearized system can be represented in the form ¯ ¯ δx(k + 1) = Aδx(k) + Bδu(k) , ¯ ¯ δ y¯(k) = Cδx(k) + Dδu(k)
(3.26) (3.27)
¯ = A, B ¯ = W, C ¯ = σ (g1 g2 )g2 b, D ¯ = σ (g1 g2 )g2 d, and δ y¯(k) = where A δy(k) − σ(g1 g2 ). The state-space representation (3.26) and (3.27) has a standard linear form, and in further analysis the well-known approaches for checking the controllability and observability of linear systems will be applied. Controllability Let us define the controllability matrix in the form ¯ ...,A ¯B, ¯ B ¯ . ¯(q−1) B, MC = A
(3.28)
48
3 Locally Recurrent Neural Networks
The system (3.26) is controllable if the matrix M C is of rank q (full rank matrix). For linear systems, the controllability is a global property. Taking into accout the fact that the state equation (3.18) is linear, the condition (3.28) is a condition of global controllability of the neuron with the IIR filter with the state equation (3.18). Observability Let us define the matrix
T (g−1) ¯ C ¯A ¯T , . . . , C ¯ A ¯ . M O = C,
(3.29)
The matrix M O is called the observability matrix of the linearized system. If the matrix M O is of rank q (full rank matrix), then the system represented by (3.26) and (3.27) is observable. Theorem 3.6. Let (3.26) and (3.27) be a linearization of the neuron (3.18) and (3.20). If the linearized system is observable, then the neuron described by (3.18) and (3.20) is locally observable around the origin. Proof. The proof is based on the inverse function theorem [142] and the reasonig presented in [143, 23]. Let us consider the mapping H(U q (k), x(k)) = (U q (k), Y q (k)),
(3.30)
where H : R2q → R2q , and U q (k) = [u(k), u(k + 1), . . . , u(k + n − 1)], Y q (k) = [y(k), y(k + 1), . . . , y(k + n − 1)]. The Jacobian matrix of the mapping H at (0, 0) has the form ⎡ ⎤ ⎡ ⎤ ∂U q (k) ∂Y q (k) ∂Y q (k) ⎢ ∂U q (k) ∂U q (k) ⎥ ⎢ I ∂U q (k) ⎥ ⎥ ⎢ ⎥. (3.31) J(0,0) = ⎢ ⎣ ∂U q (k) ∂Y q (k) ⎦ = ⎣ ∂Y q (k) ⎦ 0 ∂x(k) ∂x(k) ∂x(k) The element ∂Y q (k)/∂U q (k) is out of interest. Using (3.20), the derivatives of Y q (i), i = k, . . . , k + q − 1 with respect to x(k) can be given by ∂y(k) ¯ = C, ∂x(k)
∂y(k + 1) ¯ A, ¯ =C ∂x(k)
...
,
∂y(k + q − 1) ¯A ¯(q−1) . =C ∂x(k)
(3.32)
The derivatives (3.32) form the columns of the observability matrix M O , thus the Jacobian can be rewritten as I P J(0,0) = , (3.33) 0 MO where P = ∂Y q (k)/∂U q (k). Using the inverse function theorem, if rank M O = q, then locally there exists an inverse Ψ = H −1 such that (U q (k), x(k)) = Ψ (U q (k), Y q (k)).
(3.34)
3.5 Locally Recurrent Networks
49
As a result, in the local neighbourhood of the origin, using the sequences U q (k) and Y q (k), by the continuity of Ψ and σ, the system (3.18) and (3.20) is able to determine the state x(k). Example 3.7. Let us consider the already trained dynamic neuron (3.18) and (3.20) represented by the matrices A=
T 0.3106 −0.3439 0.4135 0.9326 , , W = , b= 1 0 0 0.6709 d = 0.0371 , g1 = 0.9152, g2 = 0.1126.
After linearization around the origin, one obtains the following matrices: T ¯ = 0.3106 −0.3439 , B ¯ = 0.4135 , C ¯ = 0.1039 , D ¯ = −0.0041. A 1 0 0 0.0747 The controllability matrix has the form 0.1284 0.4135 MC = , 0.4135 0
(3.36)
and rank(M C ) = 2, which is equal to q. The observability matrix has the form 0.1038 0.107 MO = , (3.37) 0.0747 −0.0357 and rank(M O ) = 2, which is equal to q. The neuron is controllable as well as observable. 3.5.4
Dynamic Neural Network
One of the main advantages of locally recurrent networks is that their structure is similar to that of a static feedforward one. The dynamic neurons replace the standard static neurons. This network structure does not have any global feedbacks, which complicate the architecture of the network and the training algorithm. Such networks have an architecture that is somewhere inbetween a feedforward and a globally recurrent architecture. Tsoi and Back [38] called this class of neural networks the Locally Recurrent Globally Feed-forward (LRGF) architecture. The topology of an LRGF network is illustrated in Fig. 3.17. There are some interesting motivations which make LRGF networks very attractive [40, 38]: 1. Well-known neuron interconnection topology; 2. Small number of neurons required for a given problem; 3. Stability of the network. Globally recurrent architectures have a lot of problems in settling to an equilibrium value. For globally recurrent networks, stability is hard to prove. Many locally recurrent networks allow an easy check on stability by simple investigation of poles of their internal filters;
50
3 Locally Recurrent Neural Networks W1
W2
W3 y1 (k)
u1 (k)
y2 (k)
u2 (k)
dynamic neuron
Fig. 3.17. Topology of the locally recurrent globally feedforward network
4. Explicit incorporation of past information into the architecture, needed for identification, control or time series prediction; 5. Simpler training than in globally recurrent networks. Gradient calculation carried out with real-time recurrent learning or back-propagation through time have become tedious and time consuming in globally recurrent networks. Locally recurrent networks have feedforward connected neurons, which yields simpler training than in the case of globally recurrent networks; 6. Convergence speed. As mentioned above, taking into account the complexity of recurrent structures, and thus the complexity of training algorithms, the learning convergence time is long. LRGF networks have a less complicated structure, and maybe the convergence speed of these networks will be faster. Let us consider the M -layered network with dynamic neurons represented by (3.14)–(3.17) with the differentiable activation functions σ(·) (Fig. 3.17). Let sμ denote the number of neurons in the μ-th layer, uμi (k) – the output of the i-th neuron of the μ-th layer at discrete time k. The activity of the j-th neuron in the μ-th layer is defined by the formula sμ−1 r r μ μ μ μ μ μ μ μ uj (k) = σ g2j bij wjp up (k − i) − aij zj (k − i) − g1j . (3.38) i=0
p=1
i=1
In order to analyze the properties of the neural networks considered, e.g. stability or approximation abilities, it is convenient to represent them in the state-space. The following paragraph presents a state-space representation of discrete-time dynamic neural networks with one and two hidden layers, respectively. State-space representation of the dynamic network Let us consider a discrete-time neural network with n inputs and m outputs. A network is composed of the dynamic neuron models described by (3.18) and (3.20). Each neuron consists of IIR filters of order r.
3.5 Locally Recurrent Networks
51
Network with one hidden layer A neural model with one hidden layer is described by the following formulae: x(k + 1) = Ax(k) + W u(k) , (3.39) y(k) = Cσ(G2 (Bx(k) + Du(k) − g 1 ))T where N = v × r represents the number of model states, x ∈ RN is the state vector, u ∈ Rn , y ∈ Rm are input and output vectors, respectively, A ∈ RN ×N is the block diagonal state matrix (diag(A) = [A1 , . . . , Av ]), W ∈ RN ×n (W = [w1 1T , . . . , wv 1T ]T , where wi is the input weight vector of the i-th hidden neuron), and C ∈ Rm×v are the input and output matrices, respectively, B ∈ Rv×N is a block diagonal matrix of feedforward filter parameters (diag(B) = [b1 , . . . , bv ]), D ∈ Rv×n is the transfer matrix (D = [b01 wT1 , . . . b0v wTv ]T ), g 1 = [g11 . . . g1v ]T denotes the vector of biases, G2 ∈ Rv×v is the diagonal matrix of slope parameters (diag(G2 ) = [g21 . . . g2v ]), and σ : Rv → Rv is the non-linear vector-valued function. Network with two hidden layers A neural model composed of two hidden layers with v1 neurons in the first layer and v2 neurons in the second layer is represented as follows: x(k + 1) = g (x(k), u(k)) , (3.40) y(k) = h (x(k), u(k)) where g, h are non-linear functions. Taking into account the layered topology of the network, one can decompose the state vector as follows: x(k) = [x1 (k) x2 (k)]T , where x1 (k) ∈ RN1 (N1 = v1 × r) represents the states of the first layer, and x2 (k) ∈ RN2 (N2 = v2 × r) represents the states of the second layer. Then the state equation can be rewritten in the following form: x1 (k + 1) = A1 x1 (k) + W 1 u(k) (3.41a) 1 1 1 2 2 2 1 2 1 , x (k + 1) = A x (k) + W σ G2 (B x (k) + D u(k) − g 1 ) (3.41b) where u ∈ Rn , y ∈ Rm are inputs and outputs, respectively, the matrices A1 ∈ RN1 ×N1 , B 1 ∈ Rv1 ×N1 , W 1 ∈ RN1 ×n , D 1 ∈ Rv2 ×n , g 11 ∈ Rv1 , G12 ∈ Rv1 ×v1 have the form analogous to the matrices describing the network with one hidden layer, A2 ∈ RN2 ×N2 is the block diagonal state matrix of the second layer (diag(A2 ) = [A21 , . . . , A2v2 ]), W 2 ∈ RN2 ×v1 is the weight matrix between the first and second hidden layers defined in a similar manner as W 1 . Finally, the output of the model is represented by the equation y(k) = C 2 σ G22 (B 2 x2 (k) + D2 σ G12 (B 1 x1 (k) + D1 u(k) − g 11 ) − g21 ) , (3.42) where C 2 ∈ Rm×v2 is the output matrix, B 2 ∈ Rv2 ×N2 is the block diagonal matrix of second layer feedforward filter parameters, D 2 ∈ Rv2 ×v1 is the transfer
52
3 Locally Recurrent Neural Networks
matrix of second layer, g 21 ∈ Rv2 is the vector of second layer biases, G22 ∈ Rv2 ×v2 represents the diagonal matrix of the second layer activation function slope parameters. The matrices B 2 , D 2 , g 21 and G22 have the form analogous to that of the matrices of the first hidden layer.
3.6 Training of the Network 3.6.1
Extended Dynamic Back-Propagation
All unknown network parameters can be represented by a vector θ. The main objective of learning is to adjust the elements of the vector θ in such a way as to minimise some loss (cost) function θ = min J(θ), θ∈C
(3.43)
where θ is the optimal network parameter vector, J : Rp → R1 represents some loss function to be minimised, p is the dimension of the vector θ, and C ⊆ Rp is the constraint set defining the allowable values for the parameters θ. The way of deriving Extended Dynamic Back Propagation (EDBP) is the same as in the standard BP algorithm [92, 93, 28, 23, 90]. Let us define the objective (loss) function as follows: 1 2 (yd (k) − y(k; θ)) , 2 N
J(l; θ) =
(3.44)
k=1
where yd (k) and y(k; θ) are the desired output of the network and the actual response of the network on the given input pattern u(k), respectively, N is the dimension of the training set, and l is the iteration index. The objective function should be minimised based on a given set of input-output patterns. The adjustment of the parameters of the j-th neuron in the μ-th layer according to off-line EDBP has the following form [44, 144, 41]: θjμ (l + 1) = θjμ (l) − η
∂J(l) . ∂θjμ (l)
(3.45)
Substituting (3.44) into (3.45) one obtains θjμ (l + 1) = θjμ (l) − η
N
μ δjμ (k)Sθj (k),
(3.46)
k=1 μ (k) = where η represents the learning rate, δjμ (k) = ∂J(l)/∂ z˜jμ(k), Sθj μ μ μ μ μ μ μ ∂ z˜j (k)/∂θj (l), and z˜j (k) = g2j (zj (k) − g1j ). The error δj (k) is defined as follows: ⎧ ⎪ zjμ (k)) (yd (k) − y(k)) , for μ = M, ⎨−σ (˜ μ s
μ+1 δj (k) = μ+1 μ+1 μ+1 μ δpμ+1 (k)g2p , for μ = 1, . . . , M − 1. zj (k)) b0p wpj ⎪ ⎩σ (˜ p=1
(3.47)
3.6 Training of the Network
53
μ The sensitivity Sθj for the elements of the unknown vector of the network parameters θ for the j-th neuron in μ-th layer can be calculated according to the following formulae [44, 144, 41]:
i) sensitivity with respect to the feedback filter parameter aμpj : r μ μ μ μ μ aij Sapj (k − i) , Sapj (k) = −g2j zj (k − p) +
(3.48)
i=1
where j = 1, . . . , sμ , and p = 1, . . . , r; ii) sensitivity with respect to the feedforward filter parameter bμpj : Sbμpj (k)
=
μ g2j
ϕμj (k
− p) −
r
aμij Sbμpj (k
− i) ,
(3.49)
i=1
where ϕμj (k) is given by (3.14), j = 1, . . . , sμ , and p = 0, . . . , r; μ iii) sensitivity with respect to the bias g1j Sgμ1j (k) = −1,
(3.50)
where j = 1, . . . , sμ ; μ : iv) sensitivity with respect to the slope parameter g2j Sgμ2j (k) = zjμ (k),
(3.51)
where j = 1, . . . , sμ ; μ : v) sensitivity with respect to the weight wjp μ Sw (k) jp
=
μ g2p
r
bμij uμp (k
− i) −
i=0
r
μ aμij Sw (k jp
− i) ,
(3.52)
i=1
where j = 1, . . . , sμ and p = 1, . . . , sμ−1 . In many industrial applications, there is a need to perform the training on-line. Then, the update of the network parameters should be done after the presentation of each single pattern. For on-line training, the formula (3.46) takes the form μ (k). (3.53) θjμ (k + 1) = θjμ (k) − ηδjμ (k)Sθj Such simplifications introduce some disturbances into gradient based algorithms, but they can be neglected for appropriately small values of the parameter η. 3.6.2
Adaptive Random Search
In this section, an Adaptive Random Search (ARS) method for optimisation is considered. The method has the advantages of being simple to implement and having broad applicability. The information required to implement the method is esentially only the input-output data, where the vector of parameters θ is the
54
3 Locally Recurrent Neural Networks
input and the loss function measurement J(θ) (noise-free) or L(θ) (noisy) is the output. The underlying assumptions about J are relatively minimal; particularly, there is no requirement that the gradient of J be computable or even that the gradient exist, and that J be unimodal. The algorithm can be used with virtually any function. The user should simply specify the nature of sampling randomness to permit an adequate search of the parameter domain Θ. Assuming that the sequence of solutions θˆ0 , θˆ1 , . . . , θˆk is already appointed, a way of achieving the next point θˆk+1 is formulated as follows [66]: θˆk+1 = θˆk + rk ,
(3.54)
where θˆk is the estimate of θ at the k-th iteration, and r k is the perturbation vector generated randomly according to the normal distribution N (0, v). The new solution θˆk+1 is accepted when the cost function J(θˆk+1 ) is less than J(θˆk ), otherwise θˆk+1 = θˆk . To start the optimisation procedure, it is necessary to determine the initial point θˆ0 and the variance v. Let θ be a global minimum to be located. When θˆk is far from θ , rk should have a large variance to permit large displacements, which are necessary to escape local minima. On the other hand, when θˆk is close θ , rk should have a small variance to permit exact exploration of the parameter space. The idea of ARS is to alternate two phases: variance selection and variance exploitation [66]. During the variance selection phase, several successive values of v are tried for a given number of iterations of the basic algorithm. The competing vi is rated by its performance in the basic algorithm in terms of cost reduction starting from the same initial point. Each vi is computed according to the formula vi = 10−i v0 ,
for i = 1, . . . , 4,
(3.55)
and it is allowed for 100/i iterations to give more trails to larger variances. v0 is the initial variance and can be determined, e.g. as a spread of the parameters domain v0 = θmax − θmin ,
(3.56)
where θmax and θmin are the largest and lowest possible values of the parameters, respectively. The best vi in terms of the lowest value of the cost function is selected for the variance exploitation phase. The best parameter set θˆk and the variance vi are used in the variance exploitation phase, whilst the algorithm (3.54) is run typically for one hundred iterations. The algorithm can be terminated when the maximum number of the algorithm iterations nmax is reached or when the assumed accuracy Jmin is obtained. Taking into account local minima, the algorithm can be stopped when v4 has been selected a given number of times. It means that the algorithm gets stuck in the local minimum and cannot escape its basin of attraction. Apart from its simplicity, the algorithm possesses the property of global convergence. Moreover, adaptive parameters of the algorithm decrease the chance to get stuck in local minima.
3.6 Training of the Network
55
Table 3.2. Outline of ARS Step 0: Initiation Choose θˆ0 , nmax , Jmin v0 ; set θˆbest = θˆ0 , n = 1; Step 1: Variance selection phase Set i = 1, k = 1, θˆk = θˆ0 ; while ( i < 5 ) do while ( k 100/i ) do Computations for a trial point; Set k = k + 1; end while Set i = i + 1, k = 1, θˆk = θˆ0 ; end while Step 2: Variance exploitation phase Set k = 1, θˆk = θˆbest , i = ibest ; while ( k 100 ) do Computation for a trial point; Set k = k + 1; end while if ( n = nmax ) or ( J(θˆbest ) < Jmin ) then STOP else Set θˆ0 = θˆbest , n = n + 1, and go to Step 1 Computation for a trial point: Perturb θˆk to get θˆk : vi = 10−i v0 , θˆk = θˆk + rk ; if ( J(θˆk ) J(θˆk ) ) then θˆk+1 = θˆk else θˆk+1 = θˆk ;
if ( J(θˆk ) J(θˆbest ) ) then θˆbest = θˆk and ibest = i.
3.6.3
Simultaneous Perturbation Stochastic Approximation
Stochastic Approximation (SA) is a very important class of stochastic search algorithms. It is necessary to mention that the well-known back-propagation algorithm, recursive least squares and some forms of simulated annealing are special cases of stochastic approximation [145]. These methods can be divided into two groups: gradient-free (Kiefer-Wolfowitz) and stochastic gradient based (Robbins-Monro root finding) algorithms [146]. In recent years, there has been observed a growing interest in stochastic optimisation algorithms that do not depend on gradient information or measurements. This class of algorithms is based on approximation to the gradient formed from generally noisy measurements of the loss function. This interest is motivated by several problems such as adaptive control or statistical identification of complex systems, the training of recurrent neural networks, the recovery of images from noisy sensor data, and many more. The general form of the SA recursive procedure is [147]: ˆk (θˆk ), θˆk+1 = θˆk − ak g
(3.57)
56
3 Locally Recurrent Neural Networks
where g ˆk (θˆk ) is the estimate of the gradient ∂J/∂ θˆ based on the measurements L(·) of the loss function J(·) (where L(·) is a measurement affected by noise). In the context of neural network training, the loss function can be in the form of the sum of squared errors between the desired and network outputs, calculated using the entire set of input patterns (batch or off-line learning). The essential part of (3.57) is gradient approximation. Simultaneous Perturbation Stochastic Approximation (SPSA) has all the elements of θˆ randomly perturbed to obtain two measurements L(·), but each component g ˆki (θˆk ) is derived from a ratio involving individual components in the perturbation vector and the difference between the two corresponding measurements. For two-sided simultaneous perturbation, the gradient estimate is obtained by the formula [148, 43]: g ˆki (θˆk ) =
L(θˆk + ck Δk ) − L(θˆk − ck Δk ) , 2ck Δki
i = 1, . . . , p,
(3.58)
where the distribution of the user-specified p-dimensional random perturbation vector, Δk = (Δk1 , Δk2 , . . . , Δkp )T , is independent and symmetrically distributed around 0, with the finite inverse moments E(|Δki |−1 ) for all k, i. One of the possible distributions that satisfy these conditions is the symmetric Bernoulli ±1. On the other hand, two widely used distributions that do not satisfy these conditions are the uniform and normal ones. The rich literature presents sufficient conditions on the convergence of SPSA (θˆk → θ in the stochastic, almost sure sense) [145, 149, 147]. However, the efficiency of SPSA depends on the shape of J(θ), the values of the gain sequences {ak } and {ck }, and the distribution of {Δki }. The choice of the gain sequences is critical for the performance of the algorithm. In SPSA, the gain sequences are calculated as follows [147]: ak =
a , (A + k)α
ck =
c , kγ
(3.59)
where A, a, c, α and γ are non-negative coefficients. Asymptotically optimal values of α and γ are 1.0 and 1/6, respectively. Spall proposed to use 0.602 and 0.101 [148]. It appears that choosing α < 1 usually yields better finite sample performances. An outline of the basic SPSA algorithm is given in Table 3.3, where nmax is the maximum number of iterations, Jmin is the assumed accuracy, and θˆ0 is the initial vector of the network parameters. In the case of the neural network, the measurements L(·) are calculated using the sum of squared errors between the desired and the actual response of the neural network over the whole learning set. As one can see in (3.58), at each iteration of SPSA it is required to calculate only two measurements of the loss function in contrast to the well-known standard Kiefer-Wolfowitz stochastic approximation, which uses 2p measurements, while both algorithms achieve the same level of statistical accuracy [148]. In other terms, the on-line computational burden is not dependent on the dimension of the parameter vector θˆk to be determined (this dimension may be quite large when neural models are considered). This aspect makes SPSA very suitable and promising for real applications.
3.6 Training of the Network
57
Table 3.3. Outline of the basic SPSA Step 0: Initiation Choose θˆ0 , nmax , Jmin , A, a, c, α and γ; set k := 1 Step 1: Generation of the perturbation vector Calculate ak and ck using (3.59); generate an l-dimensional random vector Δk using the Bernoulli ±1 distribution Step 2: Loss function evaluations Generate two measurements L(·) around θˆk : L(θˆk + ck Δk ) and L(θˆk − ck Δk ) Step 3: Gradient approximation Generate g(θˆk ) according to (3.58) Step 4: Update parameter estimates Use the recursive form of the SA (3.57) to update θˆk to a new value θˆk+1 ; set k := k + 1 Step 5: Termination criteria if ( quality > Jmin ) or ( number of iterations < nmax ), then STOP, else go to Step 1
It is also possible to apply SPSA to global optimisation [150, 151]. Two solutions are reported in the literature: using injected noise, and using a stepwise (slowly decaying) sequence {ck }. In the latter case, the parameter γ controls the decreasing ratio of the sequence {ck } and can be set to a small value to enable the property of global optimisation. The dynamic neural network is very sensitive to large changes in parameter values (dynamic filters), and large values of a can make the learning process divergent. Therefore, it is recommended to use a relatively small initial value of a, e.g. a = 0.05. Many papers show successful application of SPSA to queuing systems, pattern recognition, neural network training, parameter estimations, etc. For the survey, the interested reader is referred to [148]. 3.6.4
Comparison of Training Algorithms
All training methods are implemented in Borland C++ BuilderTM Enterprise Suite Ver. 5.0. Simulations are performed using a PC with an Athlon K7 550 processor and a 128 MB RAM. To check the efficiency of the training methods, the following examples are studied: Example 3.8. Modelling of an unknown dynamic system. The second order linear process under consideration is described by the following transfer function [152, 42]: G(s) =
ω . (s + a)2 + ω 2
(3.60)
58
3 Locally Recurrent Neural Networks
Its discrete form is given by yd (k) = A1 yd (k − 1) + A2 yd (k − 2) + B1 u(k − 1) + B2 u(k − 2).
(3.61)
Assuming that the parameters of the process (3.60) are a = 1 and ω = 2π/2.5, and the sampling time T = 0.5 s, the coefficients of the equation (3.61) are A1 = 0.374861, A2 = −0.367879, B1 = 0.200281 and B2 = 0.140827. Taking into account the structure of the dynamic neuron (3.14)–(3.17), only one neuron with the second order IIR filter and the linear activation function is required to model this process. The training of the dynamic neuron was carried out using off-line EDBP, ARS and SPSA algorithms. In order to compare the different learning methods, the assumed accuracy is set to 0.01 and several performance indices such as the Sum of Squared Errors (SSE), the number of Floating Operations (FO), the number of Network Evaluations (NE) and training time are observed. The learning data are generated by feeding a random signal of the uniform distribution |u(k)| (a2 + ω 2 ) to the process, and recording its output. In this way, the training set containing 200 patterns is generated. After the training, the behaviour of the neural model was checked using the step signal u(k) = (a2 + ω 2 )/ω. EDBP algorithm. The learning process was carried out off-line. To speed up the convergence of the learning, the adaptive learning rate was used. The initial value of the learning rate η was equal to 0.005. The initial network parameters were chosen randomly using a uniform distribution from the interval [−0.5; 0.5]. Figure 3.18 shows the course of the output error. The assumed accuracy is reached after 119 algorithm iterations. ARS algorithm. The next experiment was performed using the ARS algorithm. As in the previous example, the initial network parameters were generated randomly using a uniform distribution in the interval [−0.5; 0.5]. The initial variance v0 is 0.1. In Fig. 3.18, one can see the error course for this example. The assumed accuracy is achieved after 9 iterations. The initial value of v0 is very important for the convergence of the learning. When this value is too small, e.g. 0.0001, the convergence is very slow. On the other hand, when the value of v0 is too large, e.g. 10, many cost evaluations are performed at very large variances and this results in too chaotic a search. These steps are not effective for the performance of the algorithm and significantly prolong the learning time. SPSA algorithm. In the last training example the SPSA algorithm was used. The initial parameters were generated randomly with a uniform distribution in the interval [−0.5; 0.5]. After some experiments, the algorithm parameters which assure quite fast convergence are as follows: a = 0.001, A = 0, c = 0.02, α = 0.35 and γ = 0.07. The learning results are shown in Fig. 3.18. The assumed accuracy is obtained after 388 iterations. Discussion. All algorithms considered reached the assumed accuracy. It must be taken into account, however, that these methods need different numbers of floating operations per one iteration as well as different numbers of network
3.6 Training of the Network
59
Table 3.4. Characteristics of learning methods Characteristics
EDBP
ARS
SPSA
Learning time Iterations SSE FO NE FO/iteration NE/iteration
2.67 sec 119 0.0099 2.26 · 106 119 1.89 · 104 1
10.06 sec 9 0.006823 7.5 · 106 2772 8.33 · 105 308
3.99 sec 388 0.00925 2.9 · 106 776 8.58 · 103 2
evaluations, in order to calculate the values of the cost function. The characteristics of the algorithms are shown in Table 3.4. ARS reached the assumed accuracy at the lowest number of iterations but during one algorithm step it is required to perform much more floating operations than the other algorithms. This is caused by a large number of network evaluations (see Table 3.4). Therefore, the learning time for this algorithm is the greatest one. In turn, EDBP uses the smallest number of network evaluations, but calculating a gradient is a time-consuming operation. As a result, the simplest algorithm, taking into account the number of floating operations per one iteration, is SPSA. However, SPSA approximates the gradient and it is needed to perform more algorithm steps to obtain similar accuracy as for the gradient based algorithm. For this simple example, EDBP is the most effective algorithm. But it should be kept in mind that the examined system is a linear one and the error surface has only one minimum. The next example shows the behaviour of the learning methods for a non-linear dynamic case. Example 3.9. Modelling of a sugar actuator. The actuator to be modelled is described in detail in Section 8.1. In Fig. 8.1 this device is marked by the dotted square. For the actuator, LC51 03.CV denotes the control signal (actuator input), and F 51 01 is the juice flow on the inlet to the evaporation station (actuation). With these two signals, the neural model of the actuator can be defined as (3.62) F 51 01 = FN (LC51 03.CV ), where FN denotes the non-linear function. Experiment. During the experiment, a locally recurrent network composed of 2 (two processing layers, one neurons with the IIR filter of the structure N1,5,1 input, five neurons in the hidden layer and one output) was trained using in turn the EDBP, ARS and SPSA methods. Taking into account the non-linear dynamic behaviour of the actuator, each neuron in the network structure possesses the first order filter and the hyperbolic tangent activation function. The model of the actuator was identified using real process data recorded during the sugar campaign in October 2000. In the sugar factory control system, the sampling time is equal to 10 s. Thus, during one work shift (6 hours) approximately
60
3 Locally Recurrent Neural Networks Sum of squared errors
Error
102 101 100
SPSA
10-1
ARS
10-2 0
EDBP 50
100
150
200
Iterations
250
300
350
Fig. 3.18. Learning error for different algorithms
2160 training samples per one monitored process variable are collected. For many industrial processes, measurement noise is of a high frequency [153]. Therefore, to eliminate the noise, a low pass filter of the Butterworth type of the second order was used. Moreover, the input samples were normalized to the zero mean and the unit standard deviation. In turn, the output data should be transformed taking into consideration the response range of the output neurons. For the hyperbolic tangent activation function, this range is [-1;1]. To perform such a kind of transformation, simple linear scalling can be used. Additionally, to avoid the saturation of the activation functions, the output was transformed into the range [-0.8;0.8]. It is necessary to notice that if the network is used with other data sets, it is required to memorise maximum and minimum values of the training sequence. To perform the experiments, two data sets were used. The first set, containing 500 samples, was used for training, and the other one, containing 1000 samples, was used to check the generalisation ability of the networks. EDBP algorithm. The algorithm was run over 20 times with different initial network parameter settings. The learning process was carried out off-line for 5000 steps. To speed up the convergence of learning, the adaptive learning rate was used. The initial value of the learning rate η was 0.005. The obtained accuracy is 0.098. To check the quality of the modelling, the neural model was tested using another data set of 1 000 samples. Figure 3.19(a) shows the testing phase of the neural model. As can be seen, the generalization abilities of the dynamic network are quite good. ARS algorithm. Many experiments were performed to find the best value of the initial variance v0 . Eventually, this value was found to be v0 = 0.05. With this initial variance the training was carried out for 200 iterations. The modelling results for the testing set are presented in Fig. 3.19(b). The characteristics of the algorithm are included in Table 3.5. ARS is time consuming, but it can find a better solution than EDBP. The influence of the initial network parameters is examined, too. The most frequently used range of parameter values is [−1; 1]. The simulations show that more narrow intervals, e.g. [-0.7;0.7] or [-0.5;0.5], assure faster convergence.
3.6 Training of the Network
61
(a) Outputs
0,5 0,4 0,3 0,2 0,1 0 0
100
200
300
400
500
600
700
800
900
600
700
800
900
600
700
800
900
Time
(b) Outputs
0,5 0,4 0,3 0,2 0,1 0 0
100
200
300
400
500
Time
(c) Outputs
0,5 0,4 0,3 0,2 0,1 0 0
100
200
300
400
500
Time
Fig. 3.19. Testing phase: EDBP (a), ARS (b) and SPSA (c). Actuator (black), neural model (grey).
SPSA algorithm. This algorithm is a simple and very fast procedure. However, the choice of the proper parameters is not a trivial problem. There are 5 parameters which have a crucial influence on the convergence of SPSA. In spite of the speed of the algorithm, the user should take a lot of time to select its proper values. Sometimes it is very difficult to find good values and the algorithm fails. The experiment was carried out for 7500 iterations using the following parameters: a = 0.001, A = 100, c = 0.01, α = 0.25 and γ = 0.05. The modelling results for the testing set are presented in Fig. 3.19(c). The parameter γ controls the decreasing ratio of the sequence {ck } and is set to a small value to enable the property of global optimisation. The parameter a is set to a very small value to assure the convergence of the algorithm. The dynamic neural network is very sensitive to large changes in parameter values (dynamic filters), and large values of a such as 0.4 can make the learning process divergent. Taking into account that the first value of the sequence ak is small, the parameter α is set to 0.25 (the optimal value is 1, Spall [148] proposes to use 0.602). In spite of the difficulties in selecting the network parameters, the modelling results are quite good. Moreover, the generalisation ability for this case is better than for both EDBP and ARS.
62
3 Locally Recurrent Neural Networks
Table 3.5. Characteristics of learning methods Characteristics
EDBP
ARS
SPSA
Learning time Iterations SSE – training SSE – testing FO NE FO/iteration NE/iteration
12.79 min 5000 0.098 0.564 3.1 · 109 5 000 6.2 · 105 1
33.9 min 200 0.07 0.64 1.6 · 1010 61 600 8 · 107 308
10.1 min 7500 0.0754 0.377 2.5 · 109 15 000 3.3 · 105 2
Discussion. The characteristics of the learning methods are shown in Table 3.5. The best accuracy for the training set was obtained using ARS. A slightly worse result was achieved using SPSA and the worst quality was obtained using EDBP. In this example, the actuator is described by a non-linear dynamic relation and the algorithms belonging to global optimisation techniques performed their task with a better quality than the gradient based one. At the same time, SPSA is much faster than ARS. Moreover, taking into account similar training accuracy, the generalisation ability of the neural model trained by SPSA is much better than that of the neural model trained by ARS.
3.7 Summary This chapter describes different neural architectures adequate for control applications, especially in the modelling and identification of dynamic processes. The well-known and commonly used structures, starting from simply feedforward networks with delays and finishing with more sophisticated recurrent architectures, are presented and discussed in detail. Each of these structures has some advantages and disadvantages, too. Feed-forward networks fascinate one with their simplicity and good approximation abilities. On the other hand, globally recurrent networks have a more complex structure but reveal good natural dynamic behaviour. The third group, locally recurrent globally feedforward networks, may be placed in the middle. They have an architecture similar to that of feedforward networks and the dynamic properties of recurrent ones. They seem to be very attractive, because they combine the best features of both feedforward and globally recurrent networks. However, these neural networks are known to a small degree, and therefore require more scientific research, including stability analysis, the application of more effective learning procedures, robustness investigation, etc. The structure of the neuron with the IIR filter and its mathematical description are presented in detail. A complex analysis of equilibrium points is carried out together with a discussion of the aspects of observability and controlability. Based on dynamic neuron models, a neural network can be designed. For the locally recurrent network considered, a state-space representation required for
3.7 Summary
63
both stability and approximation abilities discussions is derived. These problems are discussed in detail further in this monograph. For locally recurrent networks several training algorithms are proposed. The fundamental training method is a gradient descent algorithm utilizing the backpropagation error scheme. This algorithm is called extended dynamic backpropagation. It may have both the off-line and on-line forms, and therefore it can be widely used in control theory. The identification of dynamic processes, however, is an example where the training of the neural network is not a trivial problem. The error function is strongly multimodal, and during training the EDBP algorithm often gets stuck in local minima. Even the multi-starting of EDBP cannot yield the expected results. Therefore, other methods that belong to the class of global optimisation techniques are investigated. To tackle this problem, two stochastic approaches are proposed, and comparative studies between the proposed methods and the off-line version of the gradient based algorithm are carried out, taking into account both simulated and real data sets. The first stochastic method is the ARS algorithm, which is a very simple one. This algorithm can be very useful in engineering practice, because the user should only determine one parameter to start the optimisation procedure. The second stochastic method, SPSA, is much faster than ARS, but to start the optimisation process five parameters should be determined. To define these values, the user should possess quite an extensive knowledge about this method to use it properly. The performed simulations show that stochastic approaches can be effective alternatives to gradient based methods. Taking into account the property of global optimisation, both stochastic approaches can be effectively used for the modelling of non-linear processes. Locally recurrent networks are a very attractive modelling tool. However, to use it properly some of its properties should be more deeply investigated. The next chapter contains original research results which deal with approximation abilities of a special class of discrete-time locally recurrent neural networks.
4 Approximation Abilities of Locally Recurrent Networks
In the last decade, a growing interest in locally recurrent networks has been observed. This class of neural networks, due to their interesting properties, has been successfully applied to solve problems from different scientific and engineering areas. Cannas and co-workers [154] applied a locally recurrent network to train the attractors of Chua’s circuit, as a paradigm for studying chaos. The modelling of continuous polymerisation and neutralisation processes is reported in [155]. In turn, a three-layer locally recurrent neural network was succesfully applied to the control of non-linear systems in [132]. In the framework of fault diagnosis, the literature reports many applications, e.g. a fault diagnosis scheme to detect and diagnose a transient fault in a turbine waste gate of a diesel engine [109], an observer based fault detection and isolation system of a three-tank laboratory system [39], or model based fault diagnosis of sensor and actuator faults in a sugar evaporator [26]. Tsoi and Back [38] compared and applied different architectures of locally recurrent networks to the prediction of speech utterance. Finally, Campolucci and Piazza [156] elaborated an intristic stability control method for a locally recurrent network designed for signal processing. Most theoretical studies on locally recurrent networks are focused on training algorithms, stability problems or the convergence of the network to the equilibria [77]. The literature on approximation abilities of such networks is rather scarse. Interesting results on approximation capabilities of discrete-time recurrent networks were elaborated by Jin and co-workers [157]. A completely different approach was used by Garzon and Botelho [158] to explore the problem of approximating real-valued functions by recurrent networks, both analog and discrete. Unfortunately, both approaches are dedicated to globally recurrent networks. The chapter proposes the generalization of the method presented in [157] to locally recurrent neural networks, which is based on the well-known universal approximation theorem for multi-layer feedforward networks [33, 32, 159, 160]. The works [33, 32, 159, 160] present several assumptions under which multi-layer feedforward networks are universal approximators. Hornik and co-workers, for example, proved that networks with arbitrary squashing activation functions are capable of approximating any function [32]. In turn, the authors of [159] showed K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 65–75, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
66
4 Approximation Abilities of Locally Recurrent Networks
that a multi-layer feedforward network can approximate any continuous function to any degree of accuracy if and only if the network’s activation functions are non-polynomial. The chapter is organized as follows: in Section 4.1, modelling properties of a single dynamic neuron are presented. The dynamic neural network and its representation in the state-space are described in Section 4.1.1. Some preliminaries required to show approximation abilities of the proposed network are discussed in Section 4.2. The main result concerning the approximation of state-space trajectories is presented in Section 4.3. Section 4.4 illustrates the identification of a real technological process using the locally recurrent networks considered. The chapter concludes with some final remarks in Section 4.5.
4.1 Modelling Properties of the Dynamic Neuron Let us assume that the non-linear model is given by the sigmoidal function, which is described by σ z(k) =
1
. 1 + exp − z(k)
(4.1)
Expanding σ(·) using the Taylor series around z = 0 one obtains 1 1 1 1 5 σ z(k) + z(k) − z 3 (k) + z (k) − · · · + . . . . 2 4 48 480
(4.2)
Accordingly, the input-output relation for the dynamic neuron with the second order filter, and with z(k) represented by (3.15), is given in the form y(k) =c0 + c1 ϕ(k) + c2 ϕ(k)(k − 1) + c3 ϕ(k)(k − 2) + c4 z(k − 1) + c5 z(k − 2) + c6 ϕ(k)3 (k) + c7 ϕ(k)2 (k)ϕ(k)(k − 1) + c8 ϕ(k) (k)ϕ(k)(k − 2) + c9 ϕ(k) (k)z(k − 1) 2
2
,
(4.3)
+ c10 z 3 (k) + . . . where ϕ is the weighted sum of inputs calculated according to (3.14), and ci represents parameters which are the functions of the neuron parameters: 1 1 1 1 1 1 , c1 = b 0 , c2 = b 1 , c3 = b 2 , c4 = − a 1 , c5 = − a 2 , 2 4 4 4 4 4 1 3 1 2 1 2 1 1 b , c7 = b b 1 , c8 = b b2 , c9 = − b20 a1 , c10 = − a31 . c6 = 48 0 48 0 48 0 48 48
c0 =
If a number of such models is connected into a multi-layer structure, this result can be extended to higher level network approximations and other non-linear functions. Thus, a powerful approximating tool may be obtained. Modelling capabilities of a dynamic neural network are studied in the forthcoming sections.
4.2 Preliminaries
4.1.1
67
State-Space Representation of the Network
A locally recurrent network with only one hidden layer is represented by the linear state equation [48]. Thus, its ability to approximate any non-linear mappings is limited. Therefore, in this section a network with two hidden layers is taken into account. Let us consider a discrete-time dynamic neural network with n inputs and m outputs, with two hidden layers described by (3.41) and (3.42). Using the augmented state vector x(k) = [x1 (k) x2 (k)]T , the state equation (3.41) may be represented in the following form: x(k + 1) = Ax(k) + W1 σ (W2 x(k) + W3 u(k) + W4 ) + Bu(k),
(4.4)
where σ(·) is a continuously differentiable sigmoidal vector-valued function, and 1 1 0 0 0 A 0 W , W1 = , A= , B= , W2 = W2 G12 B 1 0 0 0 A2 0 0 W3 = = , W . 4 G12 D 1 −G12 g 11
4.2 Preliminaries To prove approximation abilities of the neural network considered, necessary preliminaries should be provided. Definition 4.1. Let S ∈ Rn and U ∈ Rm be open sets. A mapping f : S × U → Rn is said to be Lipschitz in x on S × U if there exists a constant L > 0 such that (4.5) f (x1 , u) − f (x2 , u) Lx1 − x2 for all x1 , x2 ∈ S, and any u ∈ U , and L is a Lipschitz constant of f (x, u). We call f locally Lipschitz in x if each point of S has a neighbourhood S0 ∈ S such that the restriction f to S0 × U is Lipschitz in x. Lemma 4.2. Let S ∈ Rn and U ∈ Rm be open sets and a mapping f : S×U → S be C 1 (continuously differentiable). Then f is locally Lipschitz in x. Moreover, if Dx ∈ S and Du ∈ U are compact (closed and bounded) sets, then f is Lipschitz in x on the set Dx × Du . Proof. For the proof, see Hirsch and Smale [161], pages 163 and 173.
Lemma 4.3. Let S ∈ Rn and U ∈ Rm be open sets, f , fˆ : S × U → S Lipschitz continuous mappings, L a Lipschitz constant on fˆ(x, u) in x on S × U , and for all x ∈ S and u ∈ U f (x, u) − fˆ(x, u) < . (4.6) If x(k) and z(k) are solutions of the following difference equations: x(k + 1) = f (x(k), u(k))
68
and
4 Approximation Abilities of Locally Recurrent Networks
z(k + 1) = fˆ(z(k), u(k)),
with an initial condition x(0) = z(0) ∈ S, then x(k) − z(k) < ak ,
k 0,
(4.7)
where ak = 1 + Lak−1 with a0 = 0.
Proof. For the proof, see Jin et al. [157].
Lemma 4.4. Let S ∈ Rn and U ∈ Rm be open sets, and g : S × U → S a Lipschitz continuous mapping; then the mapping of the form g ¯(x, u) = Ax + g(x, u) + Bu is also Lipschitz in x on S × U . Proof. From Definition 1 one obtains Ax1 + g(x1 , u) + Bu − Ax2 − g(x2 , u) − Bu = Ax1 − Ax2 + g(x1 , u) − g(x2 , u) Ax1 − Ax2 + g(x1 , u) − g(x2 , u) Ax1 − x2 + Lx1 − x2 L1 x1 − x2 , ¯ and L is a Lipschitz constant where L1 = A + L is a Lipschitz constant of g of g.
4.3 Approximation Abilities This section presents the main result concerning the approximation ability of the dynamic neural network under consideration. The theorem presented below is an extension of the result elaborated in [157], and utilizes the universal approximation theorem for multi-layer feedforward neural networks [33, 32]. Theorem 4.5. Let S ∈ Rn and U ∈ Rm be open sets, Ds ∈ S and Du ∈ U compact sets, Z ∈ Ds an open set, and f : S × U → Rn a C 1 vector-valued function. For a discrete-time non-linear system of the form z(k + 1) = f (z(k), u(k)),
z ∈ Rn , u ∈ Rm
(4.8)
with an initial state z(0) ∈ Z, for arbitrary > 0 and an integer 0 < I < +∞, there exist integers v1 , v2 and a neural network of the form (4.4) with an appropriate initial state x(0) such that for any bounded input u : R+ = [0, ∞] → Du max z(k) − x(k) < .
0kI
(4.9)
Proof. From Lemma 1 one knows that f (z, u) is Lipschitz in z on Ds × Du with the constant L, then ak = 1 + Lak−1 , a0 = 0.
4.3 Approximation Abilities
69
For a given number , define 1 =
, aI
0 k I.
(4.10)
Using the universal approximation theorem for multi-layer feedforward neural networks [33, 32], one knows that any continuous function can be uniformly approximated by a continuous neural network having only one hidden layer and continuous sigmoidal functions [33]. Thus, there exist matrices W 1 , W 2 , W 3 ¯ such that and W 4 and an integer N f¯(z, u) − W 1 σ(W 2 z + W 3 u + W 4 ) < 1 ,
(4.11)
where f¯(z, u) = f (z, u) − Az − Bu; thus f (z, u) − Az − Bu − W 1 σ(W 2 z + W 3 u + W 4 ) < 1 .
(4.12)
Let us define a vector-valued function g(z, u): g(z, u) = Az + Bu + W 1 σ(W 2 z + W 3 u + W 4 ),
(4.13)
and let (4.12) be expressed in the form f (z, u) − g(z, u) < 1 .
(4.14)
According to Lemma 3, one knows that g(z, u) is Lipschitz in z. Assume that z ∈ Ds and η ∈ RN1 +N2 are solutions of the following difference equations: z(k + 1) = f (z(k), u(k)) η(k + 1) = g(η(k), u(k)) with an initial condition z(0) = η(0) = z 0 ∈ Z. Thus, using Lemma 2, z(k) − η(k) < 1 ak < 1 aI z(k) − η(k) <
.
(4.15) (4.16)
Finally, comparing (4.13) with (4.4), one can see that x(k) = η(k) and the theorem is proved. Remark 4.6. The theorem applies in particular to functions that are of the C 1 class, or even continuously differentiable only in x, for such f is locally Lipschitz in x. Remark 4.7. Approximation is performed on a finite closed interval [0, I]. Remark 4.8. Non-linearities incorporated in neurons of the second layer do not affect the state equation (3.41). Using that way of reasoning, the neuron models of the second layer can be simplified. Firstly, these neurons can have a linear character, so non-linear functions do not have to appear in the neuron stucture anymore. Secondly, IIR filters can be replaced by FIR ones. These modifications do not influence the form of the state equation, but the neuron structure is much simpler and there is a smaller number of adjustable parameters.
70
4 Approximation Abilities of Locally Recurrent Networks
u(k)
xi (k+1)
wi
1
xi (k)
Ci
+ z−1
yi (k)
Ai
Fig. 4.1. i-th neuron of the second layer
The modified neuron structure is shown in Fig. 4.1. If the output matrix C i = I, then an output produced by the neuron y i (k) = xi (k). For further analysis, let us consider the modified network structure with the state represented as follows: x1 (k + 1) = A1 x1 (k) + W 1 u(k) ¯2 x ¯ 2 σ G12 (B 1 x1 (k) + D 1 u(k) − g 11 ) , x ¯2 (k + 1) = A ¯2 (k) + W +W u u(k)
(4.17a) (4.17b)
¯ ¯2 ∈ RN¯2 ×N¯2 , W ¯ 2 ∈ RN¯2 ×v1 , W u ∈ RN¯2 ×n and the neurons of where x ¯ 2 ∈ RN 2 , A the second layer receive excitation not only from the neurons of the previous layer but also from the external inputs (Fig. 4.2). According to Remark 4.8, the first layer includes neurons with IIR filters while the second one consists of neurons with FIR filters. In this case, the second layer of the network is not a hidden one, contrary to the original structure presented in Fig. 3.17. The following corollary presents approximation abbilities of the modified neural network:
Corollary 4.9. Let S ∈ Rn and U ∈ Rm be open sets, Ds ∈ S and Du ∈ U compact sets, Z ∈ Ds an open set, and f : S × U → Rn a C 1 vector-valued function. For a discrete-time non-linear system of the form z(k + 1) = f (z(k), u(k)),
z ∈ Rm , u ∈ Rn
(4.18)
with an initial state z(0) ∈ Z, for arbitrary > 0 and an integer 0 < I < +∞, there exist integers v1 and v2 and a neural network of the form (4.17) with
l IIR – neuron with the IIR filter l FIR – neuron with the FIR filter IIR IIR
u(k)
FIR FIR
y1 (k) y2 (k)
Fig. 4.2. Cascade structure of the modified dynamic neural network
4.3 Approximation Abilities
71
an appropriate initial state x(0) such that for any bounded input u : R+ = [0, ∞] → Du max z(k) − x ¯2 (k) < ε. (4.19) 0kI
Proof. Let us decompose the vector x ¯2 into η 1 ∈ RN1 and η 2 ∈ RN2 ; then (4.17b) can be rewritten in the following form: η1 (k + 1) =A21 η 1 (k) + W 21 σ G12 (B 1 x1 (k) + D1 u(k) − g 11 ) + W u1 u(k) , (4.20) 2 η (k + 1) =A22 η 2 (k) + W 22 σ G12 (B 1 x1 (k) + D1 u(k) − g 11 ) + W u2 u(k) where A21 ∈ RN1 ×N1 , A22 ∈ RN2 ×N2 , W 21 ∈ RN1 ×v1 , W 22 ∈ RN2 ×v1 , W u1 ∈ RN1 ×n and W u2 ∈ RN2 ×n If the weight matrices are given as follows: W 21 = 0,
W 22 = W 2
A21 = A1 ,
A22 = A2 ,
W u1 = W 1
W u2 = 0,
the state equation (4.20) receives the form η 1 (k + 1) =A1 η 1 (k) + W 1 u(k) . η 2 (k + 1) =A2 η 2 (k) + W 2 σ G12 (B 1 x1 (k) + D 1 u(k) − g 11 )
(4.21)
If the vectors η 1 and η 2 represent the states x1 and x2 , respectively, the system (4.21) is equivalent to (3.41), and by using Theorem 1 the corollary is proved. Remark 4.10. The network structure (4.17) is not a strict feedforward one as it has a cascade structure. The introduction of an additional weight matrix W u renders it possible to obtain a system equivalent to (3.41), but the main advantage of this representation is that the whole state vector is available from the neurons of the second layer of the network. This fact is of crucial importance taking into account the training of the neural network. If the output y(k) is y(k) = x ¯2 (k),
(4.22)
then weight matrices can be determined using a training process, which minimizes the error between the network output and measurable states of the process. Remark 4.11. Usually, in engineering practice, not all process states are directly available (measurable). In such cases, the dimension of the output vector is rather lower than the dimension of the state vector, and the network output can be produced in the following way: y(k) = C x ¯2 (k).
(4.23)
In such cases, the cascade neural network contains an additional layer of static linear neurons playing the role of the output layer (Fig. 4.3). This neural structure has two hidden layers containing neurons with IIR and FIR filters, respectively, and an output layer with static linear units.
72
4 Approximation Abilities of Locally Recurrent Networks
l– neuron with the IIR filter IIR
Ll– static linear neuron
l– neuron with the FIR filter FIR IIR
FIR
IIR
L FIR
u(k)
L
y1 (k) y2 (k)
FIR
Fig. 4.3. Cascade structure of the modified dynamic neural network
4.4 Process Modelling Example 4.12. Modelling of a sugar actuator (revisited). To illustrate modelling abilities of neural structures investigated in the previous sections, the sugar actuator discussed in Section 3.6.4 is revisited. During the experiment two neural structures were examined: the two-layer locally recurrent network described by (3.41) and (3.42), and the cascade locally recurrent network described by (4.17). Both neural networks were trained using the ARS algorithm. The initial values of the network parameters were generated randomly from the interval [−0.5, 0.5] using a uniform distribution. While the largest and lowest values of the network parameters are unknown, one cannot use (3.56) to set the initial value of v0 . Therefore, the value of v0 was selected experimentally, and v0 equal to 0.1 assured satisfactory learning results. The identification results are given in Tables 4.1 and 4.2. Training and testing sets were formed to be separable. The training set consisted of 1000 samples. To evaluate the generalization ability of the networks, three different testing sets were applied. The first set (T1 ) consisted of 20000 samples, the second (T2 ) and third (T3 ) ones contained 5000 samples. To find the best performing model, many network configurations were checked. The purpose of model selection is to identify a model that fits a given data set in the best way. Several information criteria can be used to accomplish this task [162], e.g. the Akaike Information Criterion (AIC). The criterion, which determines model complexity by minimizing an information theoretical function fAIC , is defined as follows: fAIC = log(J) +
2K , N
(4.24)
where J is the sum of squared errors between the desired outputs (yid ) and the network outputs (yi ) defined as follows: J=
N
(yid − yi )2 ,
(4.25)
i=1
where N is the number of samples used to compute J. In Table 4.1, the notation n − v − m(r1 , r2 ) represents a cascade network with n inputs, v hidden
4.4 Process Modelling
73
neurons with the r1 -th order IIR filter and m output neurons with the FIR filter of the r2 -th order. The notation n − v1 − v2 − m(r) in Table 4.2 stands for a two-layer locally recurent network with n inputs, v1 neurons in the first hidden layer, v2 neurons in the second hidden layer, m linear output static neurons and each hidden neuron consisting of the r-th order IIR filter. All hidden neurons have hyperbolic tangent activation functions. The best results are marked with the frames. Let us analyze the results for the cascade network given in Table 4.1. The best results for the training set were obtained for the dynamic network containing only one hidden neuron (networks 1, 2 and 3). However, such structures have relatively poor generalization abilities. For the testing set T1 , the best performance is observed for the network 15. Slightly worse results are achieved for the structures 8, 10 and 12. In this case, the size of the training set is large (20000 samples) and the penalty term in (4.24) does not have so much influence on the value of the information criterion. For the smaller testing sets, Table 4.1. Selection results of the cascade dynamic neural network No. Network
Parameter
Training set
Testing set T1
structure
number K
J
J
fAIC
J
fAIC
J
fAIC
1
4-1-1(0,2)
13
10.69
2.40
415.46
6.03
55.00
4.01
71.86
4.28
2
4-1-1(1,1)
15
10.73
2.40
481.32
6.18
64.41
4.17
93.74
4.55
3 4 5 6 7
4-1-1(2,2) 4-2-1(0,2) 4-3-1(1,1) 4-3-1(1,2) 4-5-1(0,2)
18 20 35 36 41
10.64 10.78 10.99 10.68 10.67
2.40 2.42 2.47 2.44 2.45
462.05 393.96 446.23 422.25 459.51
6.14 5.98 6.10 6.05 6.13
59.72 52.39 50.64 52.44 60.14
4.10 3.97 3.94 3.97 4.11
84.58 65.80 60.73 70.72 84.06
4.44 4.19 4.12 4.27 4.45
8 9 10 11
4-3-1(2,1) 4-5-1(1,1) 4-7-1(0,2) 4-5-1(1,2)
41 55 55 56
10.95 10.69 10.92 10.84
2.48 2.48 2.50 2.50
375.23 489.79 379.92 494.76
5.93 6.20 5.95 6.21
43.70 62.68 51.92 58.20
3.79 4.16 3.97 4.09
55.66 93.13 63.63 85.30
4.04 4.56 4.18 4.47
12 13 14
4-5-1(2,1) 4-7-1(1,2) 4-6-1(2,1)
65 76 77
10.97 10.88 10.64
2.53 2.54 2.52
385.62 409.54 394.44
5.96 6.02 5.99
44.92 46.05 47.70
3.83 3.86 3.9
53.32 58.27 60.47
4.00 4.10 4.13
15 16 17 18
4-6-1(2,2) 4-7-1(2,1) 4-7-1(2,2) 4-9-1(2,2)
78 89 90 114
10.80 10.64 10.96 10.64
2.54 2.54 2.57 2.59
371.09 446.63 437.38 464.92
5.92 6.11 6.09 6.15
58.03 59.04 52.60 60.44
4.09 4.11 4.00 4.15
67.79 81.27 59.99 85.29
4.25 4.43 4.13 4.49
fAIC
Testing set T2 Testing set T3
Table 4.2. Selection results of the two-layer dynamic neural network No. Network
Parameter
Training set
Testing set T1
Testing set T2 Testing set T3 J 76, 36 94.33 69.2 66.97 65.28
1 2 3 4 5
structure 4-3-2-1(1) 4-4-3-1(1) 4-4-3-1(1) 4-5-2-1(1) 4-4-3-1(2)
number K 45 66 66 67 80
J 12.62 11.63 12.12 12.54 11.78
fAIC 2, 63 2.59 2.63 2.66 2.63
J 717.12 675.4 645.59 939.35 571.68
fAIC 6.58 6.52 6.48 6.85 6.36
6 7 8
4-5-3-1(2) 4-7-4-1(1) 4-7-4-1(2)
94 115 137
11.39 11.80 12.22
2.62 2.70 2.78
675.98 6.53 1370.59 7.23 1336.49 7.21
fAIC 4.35 4.57 4.26 4.23 4.21
4.11 58.99 467.65 6.19 77.18 4.4
J 90.57 79.95 69.64 94.52 89.52
fAIC 4.52 4.41 4.27 4.58 4.53
58.41 4.11 341.98 5.88 95.77 4.62
74
4 Approximation Abilities of Locally Recurrent Networks
other structures show better performance: in the case of T2 it is the structure 8, and in the case of T3 the structure 12. Summarizing, the best neural structure giving reasonable results for all testing sets is the structure 4-3-1(2,1) with 41 parameters. In turn, Table 4.2 contains the results for two-layer neural networks. In this case, the network selected as the optimal one with respect to the AIC was the structure 6 with the configuration 4-5-3-1(2). This network includes 94 parameters, which is much more than in the case of the best performing cascade network, which contains only 41 parameters. Also, the generalization results are much worse, especially comparing the sum of squared errors for the testing set T1 (675.98 against 375.23 obtained for the cascade network). Taking into acount the results of the training, one can observe that two-layer networks are much more difficult to train than cascade ones. The reason for that is that the observation equation (3.42) has a complex non-linear form, which transforms the state vector into the output one. Theorem 1 gives a result showing that a two-layer network can represent the state of any Lipschitz continuous mapping with arbitrary accuracy, but there is no result showing how this can be done using the output of the network. The cascade structure has a more practical form, especially taking into account the training process, which has been confirmed by computer experiments. The experiments show that the cascade network can be trained more effectively than the two-layer locally recurrent network. In spite of the fact that the neural structure can be determined using some information criteria, it is required to perform tests of many network configurations to select the best neural network. There are still open problems, e.g. how to determine an appropriate number of hidden neurons which assure the required level of approximation or how to select the order of filters in neurons to capture the dynamics of the modelled process well.
4.5 Summary In the chapter, it was proved that the locally recurrent network with two hidden layers is able to approximate a state-space trajectory produced by any Lipschitz continuous function with arbitrary accuracy. The undertaken analysis of the discussed network rendered it possible to simplify its structure and to significantly reduce the number of parameters. Thus, a neural network consists of non-linear neurons with IIR filters in the first hidden layer and linear neurons with FIR filters in the second layer. To make this result applicable to real world problems, a novel structure of the neural network was proposed, where the whole state vector is available from the output neurons. The newly proposed network has a cascade structure. It means that the neurons of the output layer receive excitation from both the neurons of the previous layer and the external input. However, approximation accuracy is strictly dependent on suitably selected weight matrices. Moreover, the parameters of the network can be determined by minimizing a criterion based on errors between the network output and measurable states of the identified process. The performed experiments show that the cascade network
4.5 Summary
75
can be trained more effectively than the two-layer locally recurrent network. In spite of the fact that the neural structure can be determined using some information criteria, it is required to perform tests of many network configurations to select the best neural network. It is worth noting that approximation performance is strictly dependent on the initial parameter values. While a neural network is a redundant system, many different parameter settings could generate similar approximation accuracy. Therefore, in order to investigate a neural structure as much as possible, the multi-start technique should be applied. There are still open problems, e.g. how to determine an appropriate number of hidden neurons which assure the required level of approximation accuracy or how to select the order of filters to capture the dynamics of the modelled process. Another important problem regarding the locally recurrent neural network is its stability. The next chapter presents originally developed algorithms for stability analysis and the stabilization of a class of discrete-time locally recurrent neural networks.
5 Stability and Stabilization of Locally Recurrent Networks
Stability plays an important role in both control theory and system identification. Furthermore, the stability issue is of crucial importance in relation to training algorithms adjusting the parameters of neural networks. If the predictor is unstable for certain choices of neural model parameters, serious numerical problems can occur during training. Stability criteria should be universal, applicable to as broad a class of systems as possible and at the same time computationally efficient. The majority of well-known approaches are based on Lyapunov’s method [163, 77, 164, 165, 166, 167]. Fang and Kincaid applied the matrix measure technique to study global exponential stability of asymmetrical Hopfield type networks [168]. Jin et. al [169] derived sufficient conditions for absolute stability of a general class of discrete-time recurrent networks by using Ostrowski’s theorem. Recently, global asymptotic as well exponential stability conditions for discrete-time recurrent networks with globally Lipschitz continuous and monotone nondecreasing activation functions were introduced by Hu and Wang [170]. The existence and uniqueness of an equilibrium were given as a matrix determinant problem. Unfortunately, most of the existing results do not consider the stabilization of the network during training. They allow checking the stability of the neural model only after training it. Literature about the stabilization of neural network during training is rather scarce. Jin and Gupta [171] proposed two training methods for a discrete-time dynamic network: multiplier and constrained learning rate algorithms. Both algorithms utilized stability conditions derived by using Lyapunov’s first method and Gersgorin’s theorem. In turn, Suykens et al. [172] derived stability conditions for recurrent multi-layer network using linearisation, robustness analysis of linear systems under non-linear perturbations and matrix inequalities. The elaborated conditions have been used to constrain the dynamic backpropagation algorithm. These solutions, however, are devoted to globally recurrent networks. For any non-linear neural network model with activation functions in the form of squashing ones (e.g. sigmoid or hyperbolic tangent), the model is stable in the BIBO sense (Bounded Input Bounded Output) [19, 119]. When locally recurrent networks composed of neuron models with the IIR filter are applied, K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 77–112, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
78
5 Stability and Stabilization of Locally Recurrent Networks
during learning the filter parameters may exhibit values forcing the instability of the filter. Taking into account the fact that the activation function is bounded, the neuron starts to work as a switching element. The neural model is stable in the BIBO sense, but a certain number of neurons are useless. To avoid this undesirable effect, the neural network should be stabilizable during learning, which means that to utilize each neuron as fully as possible each IIR filter inside the neuron should be stable. Stability conditions for dynamic neuron units can be found in the interesting book of Gupta et all. [77]. The authors derived stability conditions for various dynamic networks using the diagonal Lyapunov function method. Unfortunately, the stabilization problem is not under consideration. The training process is an iterative procedure and stability should be checked after each learning step. On the other hand, the stabilization of the network should be a simple procedure, not introducing any considerable complexity into the existing training algorithm. This chapter proposes two stabilization methods of the neural model with one hidden layer during training. The first one is based on a gradient projection whilst the second one on a minimum distance projection. These methods are relatively simple procedures and can be used for stabilizing a neural network. As was shown in Chapter 4, approximation abilities of locally recurrent networks with only one hidden layer are limited [46]. Therefore, the chapter presents also stability criteria for more complex neural models consisting of two hidden layers. For such networks, stability criteria based on Lyapunov’s methods are introduced. Moreover, some aspects concerning the computational burden of stability checking are also discussed. Based on the elaborated stability conditions, a stabilization procedure is proposed which quarantees the stability of the trained model. The chapter is organized as follows: in Section 5.1, stability issues of the dynamic neural network with one hidden layer are discussed. The training of the network under an active set of constraints is formulated (Sections 5.1.1 and 5.1.2), and convergence analysis of the proposed projection algorithms is conducted (Section 5.1.3). The section reports also experimental results including complexity analysis (Section 5.1.4) and stabilization effectiveness of the proposed methods (Section 5.1.5), as well as their application to the identification of an industrial process (Section 5.1.6). Section 5.2 presents stability analysis based on Lyapunov’s methods. Theorems based on the second method of Lyapunov are presented in Section 5.2.1. In turn, algorithms utilizing Lyapunov’s first method are discussed in Section 5.2.2. Section 5.3 is devoted to stability analysis of the cascade locally recurrent network proposed in Chapter 4. The chapter concludes with some final remarks in Section 5.4.
5.1 Stability Analysis – Networks with One Hidden Layer A very important problem in the identification of unknown dynamic systems using neural network approaches is the stability problem. This problem is more clearly observable in cases where recurrent neural networks are applied. As has
5.1 Stability Analysis – Networks with One Hidden Layer
79
been mentioned earlier, the dynamic neuron contains the linear dynamic subsystem (IIR filter), and during training the filter poles may lie outside the stability region. The following experiment shows that the stability of the network may have a crucial influence on the training quality. Example 5.1. Consider a network with a single hidden layer consisting of 3 dynamic neurons with second order IIR filters and hyperbolic tangent activation functions. The network was trained off-line with the SPSA method [26] for 500 iterations using 100 learning patterns. The process to be identified is a proces described by the following difference equation [20]: yd (k) = f [yd (k − 1), yd (k − 2), yd (k − 3), u(k − 1), u(k − 2)],
(5.1)
where the non-linear function f [·] is given by f [x1 , x2 , x3 , x4 , x5 ] =
x1 x2 x3 x5 (x3 − 1) + x4 . 1 + x22 + x23
(5.2)
The results of the training are presented in Figs. 5.1 and 5.2. The results for the unstable model are depicted in Fig. 5.1 whilst those for the stable one in Fig. 5.2. As one can observe, training in both cases is convergent (Fig. 5.1(c) and Fig. 5.2(c)). However, the first model is unstable (states are divergent – Fig. 5.1(a)) and generalization properties are very poor (Fig. 5.1(b)). In turn, the states of the stable neural model are depicted in Fig. 5.2(a) and the testing of this network in Fig. 5.2(b). For this network, modelling results are much better. This simple experiment shows that the stability problem is of crucial importance and should be taken into account during training, otherwise the obtained model may be improper. One of the possible ways of assuring network stability is to introduce constraints on the filter parameters into the optimisation procedure. Thus, the optimisation problem with constraints may be determined. This technique may be very useful, because the optimisation procedure returns parameters which assure the stability of the model. In this section, stability analysis and stabilization approaches are presented. Let us consider the locally recurrent neural network (3.39) with a single hidden layer containing v dynamic neurons as processing elements and an output layer with linear static elements. It is well known that a linear discrete time system is stable iff all roots zi of the characteristic equation are inside the unit circle: ∀i
|zi | < 1.
(5.3)
The state equation in (3.39) is linear. Thus, the system (3.39) is stable iff the roots of the characteristic equation det (zI − A) = 0
(5.4)
satisfy (5.3). In (5.4), I ∈ RN ×N represents the identity matrix. In a general case, (5.4) may have a relatively complex form and analytical calculation of roots
80
5 Stability and Stabilization of Locally Recurrent Networks
1
(a)
x10
Network states
4
0
-3 1
2
3
4
5
6
7
8
9
Discrete time Output of the process (solid) and the network (dashed)
(b)
1
0
-1 0
10
20
30
40
50
Discrete time
(c)
Sum-squared network error for 499 epochs 3
10
2
10
1
10
0
50
100
150
200
250
300
350
400
450
Epoch
Fig. 5.1. Result of the experiment – an unstable system
can be extremely difficult. In the analysed case, however, the state equation is linear with the block-diagonal matrix A, which makes the consideration of stability relatively easier. For block-diagonal matrices, their determinant can be represented as follows [139]: det (A) =
v
det (Ai ).
(5.5)
i=1
Using (5.5), the characteristic equation (5.4) can be rewritten in the following way: v det (zi I − Ai ) = 0, (5.6) i=1
is the identity matrix, zi represents the poles of the i-th neuron. where I ∈ R Thus, from (5.6) one can determine the poles of (3.39) solving the set of equations r×r
∀i
det (Ai − Izi ) = 0.
(5.7)
5.1 Stability Analysis – Networks with One Hidden Layer
81
Network states
(a) 2.5
0
-2.5 0
5
10
15
20
25
30
35
40
45
Discrete time
(b)
Output of the process (solid) and the network (dashed) 1
0
-1 0
10
20
30
40
50
Discrete time Sum-squared network error for 499 epochs
(c) 2
10
1
10
0
10
0
50
100
150
200
250
300
350
400
450
Epoch
Fig. 5.2. Result of the experiment – a stable system
From the above analysis one can conclude that the poles of the i-th subsystem (i-th dynamic neuron) can be calculated separately. Finally, it can be stated that if all neurons in the network are stable, then the whole neural network model is stable. If during training the poles are kept inside the unit circle, the stability of the neural model will be guaranteed. The main problem now is how one can elaborate a method of keeping poles inside the unit circle during neural network training. This problem can be solved by deriving a feasible set of the filter parameters. First order filter. This is a trivial case. The poles must satisfy the condition (5.3). In this way, the characteristic equation of the simple neuron is given by 1 + a1 z −1 = 0.
(5.8)
The solution is z = −a1 . Substituting this solution to (5.3) finally one obtains a1 ∈ (−1, 1). Second order filter. The characteristic equation is represented by the formula 1 + a1 z −1 + a2 z −2 = 0.
(5.9)
82
5 Stability and Stabilization of Locally Recurrent Networks
Using the Hurwitz stability criterion, one can show that the feedback filter parameters must satisfy the conditions ⎧ ⎪ ⎨1 − a1 + a2 > 0 (5.10) 1 + a1 + a2 > 0 . ⎪ ⎩ 1 − a2 > 0 The set of equations (5.10) determines the feasible region of the filter parameters in the form of a triangle. 5.1.1
Gradient Projection
Constraints are imposed on two groups of parameters: the slopes of activation functions and the feedback filter parameters. In this section, an extension of the Gradient Projection (GP) method presented in [66] is described. The new parameter vector suggested by the training method is projected onto a feasible region. The main idea is to modify the search direction only when the constraints are active. This means that at each learning iteration one can compute a set of active constraints and check which parameter violates the constraints. The resulting algorithm is presented below, where θui and θli represent the upper and lower bounds on the i-th parameter, respectively. For the slope parameters of the activation function there is only a lower bound θli = 0. In the case when neurons consist of the first order filter, the lower and upper bounds have the following values: θli = −1, and θui = 1. Slightly more complicated is the case when neurons possess the second order IIR filter. In that case one can determine the bounds as follows: θli = −2 and θui = 2 for ai2 , and θli = −ai2 − 1 and θui = ai2 + 1 for ai1 . The general form of gradient projection is described in Table 5.1. This algorithm works well and is easy to use only when a set of constraints K has a simple geometrical shape (e.g. a hypercube). Hence, problems can occur when second order filters inside the neurons are applied. To ilustrate the problem, let us analyse the situation presented in Fig. 5.3. The training method updates the point Pk to a new value Pk+1 . The coordinate a2 of Pk+1 has a correct value (a2 ∈ (−2, 2)). Unfortunately, the second coordinate exceeds the admissible value. According to Step 2 of the algorithm, the search direction for this coordinate is set to zero (dashed is still line in Fig. 5.3). As one can observe in Fig. 5.3, the obtained point Pk+1 not acceptable. This undesirable effect is caused by the complex form of the feasible region. Therefore, it is proposed to add another step to the standard gradient projection procedure (Step 4: Check for solution feasibility) in order to avoid such problems. This step can look as follows (dotted line in Fig. 5.3): is still not acceptable then set Pk+1 := Pk . if Pk+1
5.1.2
Minimum Distance Projection
The method proposed in the previous subsection does not take into account the distance between the solution suggested by the training method and the feasible region. The approach described in this section is based on the Minimum Distance
5.1 Stability Analysis – Networks with One Hidden Layer
83
a2 1 Pq k
a1
I
-1
q
-1
1
q
Pk+1 Pk+1
Fig. 5.3. Idea of the gradient projection
Projection (MDP). The main idea is to project a point onto a feasible region in order to deteriorate the training in a minimum way. For the slope parameters there is only a lower bound θli = 0. If a slope parameter exceeds this lower bound, then it is set to a small value ε, e.g ε = 10−2 . When neurons have second order IIR filters, the stabilization problem can be solved as a quadratic programming one. Let us consider the problem of the minimum distance projection of a given point d onto a feasible region. For second order IIR filters, this task can be formulated as follows: min
2
(di − ai )2
i=1
s.t. 1 − a1 + a2 > 0 ,
(5.11)
1 + a1 + a2 > 0 1 − a2 > 0 Table 5.1. Outline of the gradient projection Step 0: Initiation ˆ 0 ∈ Θ, set k := 0 Choose θ ˆ (θˆk ). Define a set Kk containing all violated constraints: Step 1: Compute g i i θli ) or (θˆk+1 θui )} Kk = {i | (θˆk+1
Step 2: Correct the search direction −ˆ g (θˆk ), taking into account a set of constraints Kk If i ∈ Kk then −ˆ gi (θˆk ) := 0 ˆ k+1 according to (3.57) Step 3: Compute θ Step 4: Check for solution feasibility: i i i θli ) or (θˆk+1 θui )) then θˆk+1 := θˆki If ((θˆk+1
Step 5: Termination criteria if (termination criterion satisfied) then STOP else go to Step 1
84
5 Stability and Stabilization of Locally Recurrent Networks
where ai is the i-th optimal filter parameter, di is the value suggested by the training algorithm. The constraints in (5.11) form a compact but open set, which makes this problem unsolvable. To deal with this, it is proposed to use a stability margin. Let us assume a constant ψ (ψ < 1) representing the stability margin. The problem is to project the poles zi into the circle of the radius ψ. Deriving the zeros of (5.9) and using the condition ∀i |zi | ψ, after simple but timeconsuming calculations one can obtain the following set of constraints: ⎧ 2 ⎪ ⎨ψ − ψa1 + a2 0 (5.12) ψ 2 + ψa1 + a2 0 . ⎪ ⎩ 2 ψ − a2 0 Figure 5.4 presents the stability triangle and the search region for the second order IIR filter. Using (5.12), the problem (5.11) can be rewritten as follows: 2
(di − ai )2
(5.13a)
s.t. −a2 − ψa1 − ψ 2 0 , −a2 + ψa1 − ψ 2 0 a2 − ψ 2 0
(5.13b) (5.13c) (5.13d)
min
i=1
Now, the constraints (5.13b)–(5.13d) form a compact and closed set and the problem can be easily solved using the Lagrange multipliers method. The Lagrange function has the form L(a1 , a2 , λ1 , λ2 , λ3 ) = (d1 − a1 )2 + (d2 − a2 )2 + λ1 (−a2 − ψa1 − ψ 2 ) +λ2 (−a2 + ψa1 − ψ 2 ) + λ3 (a2 − ψ 2 ). a2 search region
stability triangle 2
1 P1
P2
a2
-1
P3
1
=
ψa
− 1
ψ
a1
-1
Fig. 5.4. Stability triangle and the search region
(5.14)
5.1 Stability Analysis – Networks with One Hidden Layer
Let us define the Kuhn-Tucker conditions: ⎧ ∂L ⎪ = −2d1 + 2a1 − ψλ1 + ψλ2 = 0 ⎪ ⎪ ⎪ ∂a ⎪ ⎪ 1 ⎪ ⎪ ∂L ⎪ ⎪ = −2d2 + 2a2 − λ1 − λ2 + λ3 = 0 ⎪ ⎪ ⎪ ∂a 2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨λ1 ∂L = λ1 (−a2 − ψa1 − ψ 2 ) = 0 ∂λ1 ⎪ ⎪ ∂L ⎪ ⎪ λ2 = λ2 (−a2 + ψa1 − ψ 2 ) = 0 ⎪ ⎪ ∂λ ⎪ 2 ⎪ ⎪ ⎪ ⎪ ∂L ⎪ ⎪ λ3 = λ3 (a2 − ψ 2 ) = 0 ⎪ ⎪ ∂λ ⎪ 3 ⎪ ⎪ ⎩ ∀i λi 0
.
85
(5.15)
The solution of (5.15) can be derived by analyzing which constraints will be active in a specific case. There are six possibilities: 1. The constraint (5.13b) is active (λ1 = 0, λ2 = 0, λ3 = 0), the corrected coordinates are calculated using the formulae ψ (−d2 − ψd1 − ψ 2 ), 1 + ψ2 1 a2 = d2 + (−d2 − ψd1 − ψ 2 ); 1 + ψ2
a1 = d1 +
2. The constraint (5.13c) is active (λ2 = 0, λ1 = 0, λ3 = 0), the corrected coordinates are obtained as follows: ψ (−d2 + ψd1 − ψ 2 ), 1 + ψ2 1 a2 = d2 + (−d2 + ψd1 − ψ 2 ); 1 + ψ2
a1 = d1 −
3. The constraint (5.13d) is active (λ3 = 0, λ1 = 0, λ2 = 0), the corrected parameters are a1 = d1 , a2 = ψ 2 ; 4. The constraints (5.13b) and (5.13d) are active (λ1 = 0, λ3 = 0, λ2 = 0), the solution is the point P1 = (a1 , a2 ) = (−2ψ, ψ 2 ) (see Fig. 5.4); 5. The constraints (5.13c) and (5.13d) are active (λ2 = 0, λ3 = 0, λ1 = 0), the solution is the point P2 = (a1 , a2 ) = (2ψ, ψ 2 ) (see Fig. 5.4); 6. The constraints (5.13b) and (5.13c) are active (λ1 = 0, λ2 = 0, λ3 = 0), the solution is the point P3 = (a1 , a2 ) = (0, −ψ 2 ) (see Fig. 5.4). The feasible region considered is defined in the form of a triangle. The possibility that all three constraints are active never holds in this case, because there does not exist such a point to violate all constraints simultaneously. Table 5.2 presents the minimum distance projection algorithm. As one can see, by using the solution of the Kuhn-Tucker conditions (5.15), a very simple algorithm is obtained.
86
5 Stability and Stabilization of Locally Recurrent Networks Table 5.2. Outline of the minimum distance projection Step 0: Initiation ˆ 0 ∈ Θ, set k := 0 Choose θ ˆ (θˆk ) and update the parameter estimate θˆk+1 Step 1: Compute g Step 2: Check constraint violation and correct the parameters according to the solution of (5.15) Step 3: Termination criteria if (termination criterion satisfied) then STOP else go to Step 1
In the case when neurons consist of the first order filter, the lower and upper bounds have the following values: θli = −1, and θui = 1. If a feedback parameter θi exceeds the lower bound θli , then θi is set to the value −ψ. On the other hand, if a feedback filter parameter θi exceeds the upper bound θui , then θi is set to the value ψ. This simple algorithm generates a new feasible solution which lies in a minimal distance from the solution proposed by the training procedure and guarantees the stability of the filter, simultaneously deteriorating the training of the network in a minimum way. 5.1.3
Strong Convergence
Both of the proposed algorithms are projection based methods and can be represented in the following general form: ˆk ) , ˆ k − ak g ˆ k (θ (5.16) θˆk+1 = πG θ where πG : Rp → G is the projection onto the constraint set G. Introduce the projection term z k ; then (5.16) can be rewritten as ˆ k − ak g ˆ k ) + ak z k . ˆ k (θ θˆk+1 = θ
(5.17)
ˆ k − ak g ˆk) ˆ k (θ In this way, ak z k is the vector of shortest distance needed to take θ ˆ back to the constraint set G if it is not in G. As gˆ k (θk ) is the estimate of the ˆ k ), the bias in g ˆ k ) is defined as follows: ˆ k (θ gradient g k (θ ˆ k ) = E gˆ (θ ˆ ˆ ˆ (5.18) bk (θ k k ) − g(θ k )|θ k , where E[·|·] denotes the conditional expectation. It is expected that bk → 0 as k → ∞. Using (5.18) and defining the error term ˆk ) = g ˆk) − E g ˆ k )|θ ˆk , ˆ k (θ ˆ k (θ (5.19) ek (θ one can rewrite (5.17) as ˆ k − ak g (θ ˆ ˆ ˆ θˆk+1 = θ k k ) − ak bk (θ k ) − ak ek (θ k ) + ak z k .
(5.20)
5.1 Stability Analysis – Networks with One Hidden Layer
87
All methods of convergence analysis need to show that the so-called “tail” effect of the noise vanishes. Such behaviour is essentially due to the martingale difference property and decreasing the step size ak . Firstly, let us define sets describing the feasible regions of the network parameters derived at the beginning of Section 5.1. The proposed projection algorithms work with inequality constraints. Let us introduce the following assumptions, where Assumption 1 defines a set of constraints when first order filters inside neurons are applied whilst Assumption 2 defines a set of constraints in the case of second order filters. Assumption 1. Define the set G = {θ : θli θi θui }, where θli < θui and θli and θui are real numbers. The set G is hyperrectangle in this case. Assumption 2. Define the set G = {θ : qi (θ) 0, i = 1, . . . , 3} and assume that it is connected, compact and non-empty. Let qi (·), i = 1, . . . , 3 be continuously differentiable real-valued functions. Aditionally, it is needed to show that the fundamental algorithm (SPSA) is convergent. Recall the following assumptions: Assumption 3 (Gain sequences). ak , ck > 0 ∀k; ak → 0, ck → 0 as k → ∞;
∞ ak 2
∞ < ∞. k=0 ak = ∞, k=0 ck (±)2
Assumption 4 (Measurement noise). For some α0 , α1 , α2 > 0 and ∀k, E[ k ˆ k ± Δk )2 ] α1 , and E[Δ−2 ] α2 (l = 1, 2, . . . , p). α0 , E[J(θ kl
]
ˆ k < ∞ a.s. ∀k. Assumption 5 (Iterate boundedness). θ Assumption 6 (Relationship to Ordinary Differential Equations (ODE)). θ ∗ is an asymptotically stable solution of the differential equation dx(t)/dt = −g(x). Assumption 7. Let D(θ ∗ ) = {x0 : limt→∞ x(t|x0 ) = θ∗ }, where x(t|x0 ) denotes the solution to the differential equation of Assumption 6 based on the initial ˆ k ∈ S infinitely conditions x0 . There exists a compact set S ⊆ D(θ ∗ ) such that θ often for almost all sample points. Assumption 8 (Smoothness of J). J is three-time continuously differentiable and bounded on Rp . Assumption 9 (Statistical properties of perturbations). {Δki } are independent for all k, i, identically distributed for all i at each k, symmetrically distributed about zero and uniformly bounded in magnitude for all k, i. For motivations and a detailed discussion about Assumptions 3–9 the reader is referred to [147, 145]. Proposition 5.2. Assume that the conditions of SPSA (Assumptions 3–9) hold for the algorithm (5.16) with any of the constraint set conditions (Assumption 1 or Assumption 2) holding. Then ˆ k → θ∗ as k → ∞ θ
a.s. (w.p.1).
(5.21)
88
5 Stability and Stabilization of Locally Recurrent Networks
Proof. From Theorem 2.1 and Theorem 2.3 of [173], we know that (5.21) holds if ˆ k ) < ∞ ∀k and bk (θ ˆ k ) → 0 a.s., 1. bk (θ
m ˆ 2. lim P sup i=k ai ei (θ) λ = 0 for any λ > 0, k→∞
mk
3. z k is equicontinuous. The condition 1) follows from Lemma 1 of [147] and Assumptions 3, 8 and 9. According to Proposition 1 of [147], 2) can be shown. Consider 3). The reasoning for the equicontinuity of z k in the case where G satisfies Assumption 1 is given in the proof of Theorem 2.1 of [173] (Section 5.2, pages 96-97), and for the case where G satisfies Assumption 2 it is given in the proof of Theorem 2.3 of [173] (Section 5.2, pages 101-102). Then the conditions 1)-3) are satisfied and the proposition follows. Remark 5.3. The above proof is based on the assumption that noise in each observation is a martingale difference. However, the achieved result can be expanded to other types of noise, e.g. correlated noise. Remark 5.4. Some conditions, e.g. Assumption 5, can be replaced by weaker ones. One reason for that is that weaker conditions are more easily verifiable. Another advantage is seen when dealing with complicated problems such as correlated noise, state dependent noise or asynchronous algorithms. Remark 5.5. The above deliberations show that the discussed class of projection methods is convergent almost surely. Unfortunately, this analysis does not consider differences between the two projection algorithms proposed in Sections 5.1.1 and 5.1.2. These differences, numerical complexity and other issues will be explored in further sections. 5.1.4
Numerical Complexity
The main objective of further investigation is to show the reliability and effectiveness of the techniques presented in Sections 5.1.1 and 5.1.2. The first experiment is focused on the complexity of the proposed approaches. Both stabilization techniques are based on checking constraints violation, thus a number of additional operations have to be performed at each iteration of the training algorithm. Taking into account the fact that, in general, the training process consists of many steps, numerical complexity of the proposed solutions is of a crucial importance. Tables 5.3 and 5.4 demonstrate the minimal and maximal number of operations needed at each iteration for the stabilization methods GP and MDP, respectively. As one can see, in the case of the first order filter, both methods require a similar number of operations to be performed. In the case of the second order filter, GP is computationally less complex than MDP as far as the average number of operations is concerned. In specific cases, MDP can be less complex (see columns for the minimum number of operations). Taking into account the
5.1 Stability Analysis – Networks with One Hidden Layer
89
Table 5.3. Number of operations: GP method Type of operation
first order filter
second order filter
min
max average
min
max average
statement checking
1
1
1
3
3
3
setting operations
1
2
1.5
2
5
3.5
additions
1
1
1
2
2
2
multiplications
1
1
1
2
2
2
TOTAL
4
5
4.5
9
12
10.5
Table 5.4. Number of operations: MDP method Type of operation statement checking
first order filter
second order filter
min
max average
min
max average
1
2
1.5
2
3
2.5
setting operations
0
1
0.5
0
2
1
additions
0
2
1
0
16
8
multiplications
0
1
0.5
0
16
8
TOTAL
1
6
3.5
2
37
19.5
fact that the training procedure consists of hundreds of different kinds of operations, it seems that the proposed solutions are very attractive because of their simplicity. The next experiment shows how time-consuming these methods are. Let us consider the second order linear process discussed in Example 3.8. The learning data are generated by feeding a random signal of the uniform distribution |u(k)| (a2 + ω 2 ) to the process and recording its output signal. In this way, a training set containing 200 patterns is generated. In order to compare the stabilization methods, a number of experiments were performed using a different number of hidden neurons and a different number of learning steps. All methods are implemented in Borland C++ BuilderTM Enterprise Suite Ver. 5.0. Simulations are performed using a PC with a Celeron 600 processor and a 160 MB RAM. Training using each method was performed 10 times and the average results are presented in Table 5.5. The results confirm that with a small number of learning steps the differences between the methods are negligible, a few hundred of a second. Greater differences are observed in the case of a larger number of iterations. As one can see in Table 5.5, after 15000 learning steps there are 3,6 sec and 13,2 sec of difference between GP and training without stabilization and MDP and training without stabilization, respectively. The results show that the proposed methods are very simple as far as software implementation is concerned, and they do not prolong learning in a significant way.
90
5 Stability and Stabilization of Locally Recurrent Networks
Table 5.5. Comparison of the learning time for different methods Stabilization method Characteristics v = 3, r = 2, nmax = 500
5.1.5
none
GP
MDP
7.44 sec
7.46 sec
7.47 sec
v = 8, r = 2, nmax = 500
12.59 sec 12.65 sec 12,7 sec
v = 15, r = 2, nmax = 500
25.16 sec 25.18 sec 25.18 sec
v = 7, r = 2, nmax = 5000
1.98 min 1.99 min 1.99 min
v = 7, r = 2, nmax = 15000
6.93 min 6.99 min 7.15 min
Pole Placement
To show stabilization capabilities of the proposed methods, several experiments are carried out. The first one is the identification of a dynamic process without the stabilization of the learning. In the next two experiments, to train the dynamic network the GP and MDP techniques are applied. All experiments are performed with exactly the same learning data, initial network parameters and parameters of the training algorithm. The process to be identified is described by the following difference equation [20]: yd (k) =
yd (k − 1) + u3 (k − 3). 1 + yd2 (k − 2)
(5.22)
This is a third order dynamic process. To identify (5.22), the dynamic network (3.39) is applied. The arbitrarily selected two-layer architecture contains four hidden dynamic neurons with second order IIR filters and hyperbolic tangent activation functions, and one linear output neuron. The training process was carried out off-line for 500 iterations using a pseudo-random input uniformly distributed on the interval [-2,2]. The parameters of SPSA are as follows: A = 0, α = 0.2, γ = 0.1, a = 0.0002, c = 0.001.
Sum-squared error
104
3
10
0
50
100
150
200
250
300
350
400
450
Epoch
Fig. 5.5. Sum squared error – training without stabilization
5.1 Stability Analysis – Networks with One Hidden Layer (a)
91
(b)
1
unstable pole
Im(z)
Im(z)
1
0 unstable pole
0
-1 -1
-1
0
1
Re(z)
(c)
unstable pole
-1
(d) 1
1
unstable pole
Im(z)
Im(z)
unstable pole
0
-1
1
0
Re(z)
-1
0
Re(z)
1
0
-1
-1
0
Re(z)
1
Fig. 5.6. Poles location during learning without stabilization: neuron 1 (a), neuron 2 (b), neuron 3 (c), neuron 4 (d)
Figure 5.5 presents the learning error of the neural network in the case when constraints on the network parameters are not applied. As one can see there, the course of the error is not smooth. There are great fluctuations caused by the instability of the neurons, including the large jump after the 400-th iteration. Poles placement during learning without stabilization is shown in Fig. 5.6. To clarify the analysis, poles placement after 20 algorithm iterations is presented only. Three out of four neurons lost stability. Only the poles of the third neuron (Fig. 5.6(c)) are inside the unit circle. In turn, Fig. 5.7 presents poles location in the case when the GP approach is applied. As one can observe, each neuron keeps its own poles inside the unit circle. Unstable poles from the previous case are corrected according to GP to fall into a stable region. At each iteration the entire neural model is stable and the convergence of the learning is faster. Interesting results can be observed by analysing Figs. 5.6(c) and 5.7(c). In the case when the poles are stable, the stabilization method does not change poles location and, consequently, does not introduce any needless operations. A crucial factor for the correct work of the GP method is the initial poles location. The
92
5 Stability and Stabilization of Locally Recurrent Networks (b)
1
Im(z)
Im(z)
(a)
0
-1
-1
0
(c)
1
1
Im(z)
Im(z)
0
Re(z)
(d)
1
0
-1
0
-1 -1
1
Re(z)
1
-1
0
Re(z)
1
0
-1
-1
0
1
Re(z)
Fig. 5.7. Poles location during learning, stabilization using GP: neuron 1 (a), neuron 2 (b), neuron 3 (c), neuron 4 (d)
initial neuron poles should be stable, so at the beginning of the training feedback filter parameters are set to zero. The next experiment shows a comparison of pole placement during learning without stabilization and with the use of the MDP method (Figs. 5.8 and 5.9, respectively). In this case, the stability margin is set to ψ = 0.9. During learning without stabilization all neurons lost stability (Fig. 5.8). As one can see in Fig. 5.9, MDP stabilization works very well. All neurons are stable during learning. Moreover, according to the assumptions, the corrected poles are arranged in the circle with the radius ψ = 0.9 (in Fig. 5.9 marked with the dashed line). The MDP method controls poles placement pretty well. There is no pole exceeding the circle with the assumed radius ψ. 5.1.6
System Identification Based on Real Process Data
In this experiment, real process data from an industrial plant are employed to identify the input-output model of a selected part of the plant. The plant
5.1 Stability Analysis – Networks with One Hidden Layer (a) 1
93
(b) 1 unstable poles
Im(z)
Im(z)
unstable pole
0
-1
0
-1
-1
1
Re(z)
(c)
-1
0
1
unstable pole
Im(z)
unstable pole
0
1
Re(z)
(d)
1
Im(z)
0
unstable pole
unstable pole
0
unstable pole
-1
-2
-1
0
1
2
-1
-2
-1
0
1
Re(z) Re(z) Fig. 5.8. Poles location during learning without stabilization: neuron 1 (a), neuron 2 (b), neuron 3 (c), neuron 4 (d)
considered is a sugar actuator described in detail in Section 8. The data used for the learning and testing sets were suitably preprocessed removing trends, resulting in the 500-th and 1000-th elements for the learning and testing data sets, respectively. To model the process (8.4), the dynamic neural network (3.39) is applied using three hidden neurons, each including a second order IIR filter and a hyperbolic tangent activation function. The parameters of the SPSA algorithm are A = 0, α = 0.35, γ = 0.1, a = 0.001, c = 0.01. As a stabilization method, the MDP technique is applied with the stability margin ψ = 0.9. The responses of the neural model obtained for both the learning and testing data sets are presented in Fig. 5.10. The mean squared output error for the learning set is equal to 1.5523, and for the testing set – 7.6094. 5.1.7
Convergence of Network States
The last experiment aims at showing the convergence of the states of the network trained without and with stabilization techniques. The training data and the
94
5 Stability and Stabilization of Locally Recurrent Networks (b)
1
0
-1
-1
0
(c)
-1
-1
0
1
Re(z)
(d) 1
Im(z)
1
Im(z)
0
1
Re(z)
0
-1
1
Im(z)
Im(z)
(a)
-1
0
1
0
-1
-1
0
1
Re(z) Re(z) Fig. 5.9. Poles location during learning, stabilization using MDP: neuron 1 (a), neuron 2 (b), neuron 3 (c), neuron 4 (d) (a)
(b) 35
30
30 20
25 20
Outputs
Outputs
10 0 -10
15 10 5 0 -5
-20
-10 -30
0
50
100 150 200 250 300 350 400 450
Time
-15
0
100 200 300 400 500 600 700 800 900
Time
Fig. 5.10. Actuator (solid line) and model (dashed line) outputs for learning (a) and testing (b) data sets
5.1 Stability Analysis – Networks with One Hidden Layer (a)
95
(b) 3
10
20 2
15
Network states
Sum-squared error
10
10
1
10
0
10
5 0 -5
-10 -15
-1
10
0
100
200
300
400
500
Epoch
600
700
800
900
1
2
3
4
5
6
Time
7
8
9
10
Fig. 5.11. Results of training without stabilization: error curve (a) and the convergence of the state x(k) of the neural model (b) (a)
(b) 2
10
Network states
Sum-squared error
1
1
10
0
-1
0
10
0
100 200 300 400 500 600 700 800 900
Epoch
-2
0
5
10
15
Time
20
25
30
Fig. 5.12. Results of training with GP: error curve (a) and the convergence of the state x(k) of the neural model (b)
structure of the network are the same as in the previous experiment. Here, the parameters of SPSA are as follows: A = 100, α = 0.602, γ = 0.1, a = 0.015, c = 0.1. Figure 5.11 presents the training results of the basic algorithm without stabilization. As one can see, the error curve is convergent. In spite of that, the neural model is not stable, because four out of six states are divergent. Consequently, two out of three neurons go into saturation and dynamic as well as approximation properties of the neural network are strongly restricted. This example shows clearly that the stabilization problem is important and necessary to tackle during training. The next two figures (Figs. 5.12 and 5.13) show the results of training when a stabilization technique is used. Both of the proposed methods assure the stability of the neural model. Neural states are convergent, as depicted in Fig. 5.12 for the GP algorithm and in Fig. 5.13 in the case of MDP.
96
5 Stability and Stabilization of Locally Recurrent Networks
(a)
(b) 2
10
2
Sum-squared error
1 1
Network states
10
0
10
0 -1 -2 -3
-1
10 0
100 200 300 400 500 600 700 800 900
Epoch
-4 0
5
10
15
Time
20
25
30
Fig. 5.13. Results of training with MDP: error curve (a) and the convergence of the state x(k) of the neural model (b)
5.2 Stability Analysis – Networks with Two Hidden Layers Let us consider the locally recurrent neural network (3.41) and (3.42) with two hidden layers containing v1 neurons in the first layer and v2 neurons in the second layer, where each neuron consists of the r-th order IIR filter, and an output layer with linear static elements. For further analysis let us assume that the activation function of each neuron is chosen as the hyperbolic tangent one σ(x) = tanh(x) satisfying the following conditions: (i)
σ(x) → ±1 as x → ±∞,
(ii) σ(x) = 0 at a unique point x = 0, (iii) σ (x) > 0 and σ (x) → 0 as x → ±∞,
(5.23)
(iv) σ (x) has a global maximum equal to 1. In this case the state equation has a non-linear form. From the decomposed state equation (3.41), it is clearly seen that the states of the first layer of the network are independent of the states of the second layer and have a linear form (3.41a). The states of the second layer are described by the non-linearity (3.41b). Let Ψ = G12 B 1 and s1 = G12 D 1 u(k) − G12 g 11 , where s1 can be regarded as a threshold or a fixed input; then (3.41b) takes the form (5.24) x2 (k + 1) = A2 x2 (k) + W 2 σ Ψ x1 (k) + s1 . Using the linear transformation v 1 (k) = Ψ x1 (k) + s1 and v 2 (k) = x2 (k), one obtains an equivalent system: v 1 (k + 1) = Ψ A1 Ψ − v 1 − Ψ A1 Ψ − s1 + s2 (5.25) , v 2 (k + 1) = A2i v 2 (k) + W 2i σ v 1 (k)
5.2 Stability Analysis – Networks with Two Hidden Layers
97
where Ψ − is a pseudoinverse of the matrix Ψ (e.g. in a Moore-Penrose sense), and s2 = Ψ W 1 u(k)+s1 is a threshold or a fixed input. Let v ∗ = [v 1∗ v 2∗ ]T be an equilibrium point of (5.25). Introducing an equivalent coordinate transformation z(k) = v(k) − v ∗ (k), the system (5.25) can be transformed to the form z 1 (k + 1) = Ψ A1 Ψ − z 1 (k) , (5.26) z 2 (k + 1) = A2 z 2 (k) + W 2 f (z 1 (k)) where f (z 1 (k)) = σ(z 1 (k) + v 1∗ (k)) − σ(v 1∗ (k)). Substituting z(k) = [z 1 (k) z 2 (k)]T , one finally obtains z(k + 1) = Az(k) + Wf (z(k)), where
5.2.1
Ψ A1 Ψ − 0 , A= 0 A2
0 0 . W= W2 0
(5.27)
(5.28)
Second Method of Lyapunov
In this section, the second method of Lyapunov is used to determine stability conditions for the system (5.27). Lemma 5.6 (Global stability theorem of Lyapunov [77]). Let x = 0 be an equilibrium point of the system x(k + 1) = f (x(k)),
(5.29)
and V : Rn → R a continuously differentiable function such that 1. 2. 3. 4.
V (x(k) = 0) = 0, V (x(k)) > 0, for x = 0, V (x) → ∞ for x → ∞, ΔV (x(k)) = V (x(k + 1)) − V (x(k)) < 0 for x = 0.
Then, the equilibrium point x = 0 is globally asymptotically stable and V (x) is a global Lyapunov function. Theorem 5.7. The neural system represented by (5.27) is globally asymptotically stable if the following condition is satisfied: A + W < 1.
(5.30)
Proof. Let V (z) = z be a Lyapunov function for the system (5.27). This function is positive definite with the minimum at x(k) = 0. The difference along the trajectory of the system is given as follows: ΔV (z(k)) = z(k + 1) − z(k) = Az(k) + Wf (z(k)) − z(k) Az(k) + Wf (z(k)) − z(k).
(5.31)
98
5 Stability and Stabilization of Locally Recurrent Networks
The activation function σ is a short map with the Lipschitz constant L = 1. Then f is also a short map, with the property f (z(k)) z(k), and (5.31) can be expressed in the form ΔV (z(k)) Az(k) + Wz(k) − z(k) (A + W − 1) z(k).
(5.32)
From (5.32) one can see that if A + W < 1,
(5.33)
then ΔV (z(k)) is negative definite and the system (5.27) is globally asymptotically stable, which completes the proof. Remark 5.8. The theorem formulates the sufficient condition only, not a necessary one. Therefore, if the condition (5.30) is not satisfied, one cannot judge the stability of the system. Remark 5.9. The condition (5.30) is very restrictive. The matrix A is a block diagonal one with the entries Ψ A1 Ψ − and A2i , for i = 1, . . . , v2 . For block diagonal matrices, the following relation holds: A = max {Ai } . i=1,...,n
(5.34)
The entries of A for i = 2, . . . , v2 have the form (3.19). For such matrices, the norm is greater than or equal to one. Thus, Theorem 5.7 is useless, because there is no network (5.27) able to satisfy (5.30). One way to make (5.30) applicable to the system (5.27) is to use the modified neuron state matrix of the form (3.24) with the parameter ν. The parameter ν can be selected experimentally by the user or can be adapted by a training procedure. Remark 5.10. In spite of its shortcomings, the condition (5.30) is very attractive because of its simplicity and ease of use. Remark 5.11. The theorem is also valid for other activation functions with the Lipschitz constant L 1, satisfying the conditions (5.23). Example 5.12. Consider the neural network described by (3.41) and (3.42) with 7 neurons in the first hidden layer and 4 neurons in the second hidden layer. Each neuron consists of the second order IIR filter and a hyperbolic tangent activation function. The network is applied to model the process (5.1). Training was carried out for 5000 steps using the SPSA algorithm with the settings a = 0.002, c = 0.01, α = 0.302, γ = 0.101, A = 100. The training set consists of 100 samples generated randomly using the uniform distribution. The sum of squared errors for the training set is 0.6943, and for the testing set containing another 100 samples it is 1.2484. The stability of the trained network was tested using the norm stability condition (5.30) as follows: A2 + W2 = 4.0399 > 1
(5.35)
A1 + W1 = 4.7221 > 1 , A∞ + W∞ = 6.4021 > 1
(5.36) (5.37)
5.2 Stability Analysis – Networks with Two Hidden Layers
(a)
(b) 3
1.5
2
1
1
x2
0.5
x1
99
0
0
−0.5
−1
−1
−2
−1.5 0
20
40
60
80
−3 0
100
20
40
60
80
100
Time
Time
(c)
(d) 1
1.5
0.8 1 0.6 0.5
0.2
z2
z1
0.4
0
0 −0.2
−0.5
−0.4 −1 −0.6 −0.8 0
5
10
15
−1.5 0
20
5
Time
10
15
20
Time
(e) 2
Sum−squared error
10
1
10
0
10
−1
10
0
1000
2000
3000
4000
Epoch
Fig. 5.14. Convergence of network states: original system (a)–(b), transformed system (c)–(d), learning track (e)
100
5 Stability and Stabilization of Locally Recurrent Networks
where X2 =
λmax X T X
X1 = max
1in
X∞ = max
n
|xij | .
j=1 n
1jn
|xij |
i=1
Unfortunately, based on (5.35)–(5.37), the norm stability condition cannot judge the stability of the system. On the other hand, observing the convergence of the network states one can see that the system is stable. Figures 5.14(a) and (b) present the convergence of the states of the first and second layers of the system (3.41). In turn, in Figs 5.14(c) and (d) the convergence of the transformed autonomous system (5.27) is shown. All states converge to zero, which means that the network is stable. This experiment clearly shows that the norm stability condition is very restrictive. Moreover, to satisfy the condition (5.30) the entries of both matrices A and W should have relatively small values. The following procedure proposes the training of the network assuring the stability of the model. Assuming that each neuron in the network is represented by the modified state transition matrix (3.24), the norm of the matrix W is checked after each training step. If the norm stability condition is not satisfied, the entries of W are decreased iteratively. Example 5.13. Le us revisit the problem considered in the example 5.12, but this time with each neuron in the network represented by the modified state transition matrix (3.24) with the parameter ν = 0.5. The training is carried out using the procedure shown in Table 5.6. In this case, the sum of squared errors for the training set is 0.7008, and for the testing set containing another Table 5.6. Outline of norm stability checking Step 0: Initiation Choose the network parameters is such a way that A < 1, set ν < 1 Step 1: Update the network parameters using a training algorithm Step 2: Assure the stability of the network set x := 1; while (A + W > 1) do x := x + 1; W := W/(x · W); end while Step 3: Termination criteria if (termination criterion satisfied) then STOP else go to Step 1
5.2 Stability Analysis – Networks with Two Hidden Layers
101
100 samples it is 1.2924. The stability of the already trained network was tested using the norm stability condition (5.30) as follows: A2 + W2 = 0.9822 < 1.
(5.38)
In this case, the criterion is satisfied and the neural network is globally asymptotically stable. Similarly as in the previous example, Figs 5.15(a) and (b) present the convergence of the states of the first and second layers of the system (3.41). In turn, in Figs 5.15(c) and (d) the convergence of the transformed autonomous system (5.27) is shown. All states converge to zero, which means that the network is stable. The procedure presented in Table 5.6 guarantees the stability of the model. Recalculating the weights W can introduce perturbations to the training in the form of spikes, as illustrated in Fig. 5.15(e), but the training is, in general, convergent. The discussed examples show that the norm stability condition is a restrictive one. In order to successfully apply this criterion to network training, several modifications are required. Firstly, the form of the state transition matrix A is modified and, secondly, the update of the network weight matrix W should be performed during training. In the further part of the section less restrictive stability conditions are investigated. Theorem 5.14. The neural system (5.27) is globally asymptotically stable if there exists a matrix P 0 such that the following condition is satisfied: (A + W)T P (A + W) − P ≺ 0.
(5.39)
Proof. Let us consider a positive definite candidate Lyapunov function: V (z) = z T P z.
(5.40)
The difference along the trajectory of the system (5.27) is given as follows: ΔV (z(k)) =V (z(k + 1)) − V (z(k)) = (Az(k) + Wf (z(k)))T P (Az(k) + Wf (z(k))) − z T (k)P z(k) =z T (k)AT P Az(k) + f T (z(k))W T P Az(k)
(5.41)
+ z T (k)AT P Wf (z(k)) + f T (z(k))W T P Wf (z(k)) − z T (k)P z(k). For activation functions satisfying the conditions (5.23) it holds that |f (z)| |z| and f z>0 f= ; (5.42) −f z < 0 then f T (z(k))W T P Az(k) z T (k)W T P Az(k)
(5.43)
z T (k)AT P Wf (z(k)) z T (k)AT P Wz(k)
(5.44)
102
5 Stability and Stabilization of Locally Recurrent Networks
(a)
(b) 1
0.3 0.2
0.5 0.1 0
x1
x2
0 −0.1
−0.5
−0.2 −1 −0.3 −1.5 0
20
40
60
80
−0.4 0
100
20
40
Time
(c)
80
100
(d) 1
1
0.5
0.5
z2
z1
60
Time
0
−0.5
−1 0
0
−0.5
5
10
15
−1 0
20
5
Time
10
15
20
Time
(e) 2
Sum−squared error
10
1
10
0
10
−1
10
0
1000
2000
3000
4000
Epoch
Fig. 5.15. Convergence of network states: original system (a)–(b), transformed system (c)–(d), learning track (e)
5.2 Stability Analysis – Networks with Two Hidden Layers
103
and f T (z(k))W T P Wf (z(k)) z T (k)W T P Wz(k).
(5.45)
Substituting the inequalities (5.43), (5.44) and (5.45) into (5.41), one obtains ΔV (z(k)) z T (k)AT P Az(k) + z T (k)W T P Az(k) + z T (k)AT P Wz(k) + z T (k)W T P Wz(k) − z T (k)P z(k)
(5.46)
z (k)(A P A + W P A + A P W + W P W − P )z(k) z T (k) (A + W)T P (A + W) − P z(k). T
T
T
T
T
From (5.46) one can see that if (A + W)T P (A + W) − P ≺ 0,
(5.47)
then ΔV (z(k)) is negative definite and the system (5.27) is globally asymptotically stable. Remark 5.15. From the practical point of view, the selection of a proper matrix P , in order to satisfy the condition (5.39), can be troublesome. Therefore, the corollary presented below allows us to verify the stability of the system in an easier manner. The corollary is formulated in the form of the Linear Matrix Inequality (LMI). Recently, LMI methods have become quite popular among researchers from the control community due to their simplicity and effectiveness taking into account numerical complexity [174]. Lemma 5.16 (Schur complement [175]). Let A ∈ Rn×n and C ∈ Rm×m be symmetric matrices, and A 0; then C + B T A−1 B ≺ 0,
(5.48)
iff
−A B U= BT C
≺0
or, equivalently,
C BT U= ≺ 0. B −A
(5.49)
Corollary 5.17. The neural system (5.27) is globally asymptotically stable if there exists a matrix Q 0 such that the following LMI holds: −Q (A + W) Q ≺ 0. (5.50) Q (A + W)T −Q Proof. From Theorem 5.14 one knows that the system (5.27) is globally asymptotically stable if the following condition is satisfied: (A + W)T P (A + W) − P ≺ 0.
(5.51)
104
5 Stability and Stabilization of Locally Recurrent Networks
Applying the Schur complement formula to (5.51) yields −P −1 A + W ≺ 0. T (A + W) −P
(5.52)
In order to transform (5.52) into the LMI, let us introduce the substitution Q = P −1 and then multiply the result from the left and the right by diag(I, Q) to obtain −Q (A + W) Q ≺ 0. −Q Q (A + W)T
Remark 5.18. The LMI (5.50) defines the so-called feasibility problem [175, 174]. This convex optimisation problem can be solved effectively using poly-nomialtime algorithms, e.g. interior point methods. Interior point algorithms are computationally efficient and nowadays widely used for solving LMIs. Example 5.19. Consider again the problem presented in Example 5.12. As is shown in that example, the norm stability condition cannot ensure the stability of the neural model (3.41). In this example, the condition given in Corollary 5.17 is used to check the stability of the neural network. The problem was solved with the LMI solver implemented in the LMI Control Toolbox under Matlab 7.0. After 4 iterations the solver found the feasible solution represented by the following positive definite matrix Q: ⎤ ⎡ 48.6 ⎢ 0.1 ⎢ ⎢ −2.0 ⎢ ⎢ 17.4 ⎢ ⎢ 11.0 ⎢ ⎢ 12.0 ⎢ ⎢ ⎢ −14.9 ⎢ Q=⎢ −9.9 ⎢ ⎢ −8.9 ⎢ ⎢ −2.5 ⎢ ⎢ −1.4 ⎢ ⎢ 2.4 ⎢ ⎢ 1.2 ⎢ ⎣ −6.2 −2.1
0.1 82.9 −3.4 −9.9 20.5 19.4 −5.5 −0.5 1.1 6.9 −0.3 −0.7 0.9 0.4 −3.4
−1.9 −3.4 30.8 0.6 4.6 −16.7 −7.2 5.7 9.0 1.3 0.8 13.1 7.2 0.9 −0.01
17.4 −9.9 0.6 65.7 −14.9 −9.6 −26.0 2.3 8.2 −0.7 −0.3 −4.5 −2.4 4.0 −1.4
11.0 20.5 4.6 −14.9 56.1 −11.7 21.1 −7.1 6.4 −3.4 0.3 1.5 −1.5 1.8 −1.7
11.9 19.4 −16.7 −9.6 −11.9 77.3 −17.5 1.1 4.3 0.5 0.1 −2.6 −0.9 −7.4 2.6
−14.9 −5.5 −7.2 −26.0 21.1 −17.5 51.5 −2.4 3.4 2.2 0.3 −1.9 −1.4 −4.1 3.7
−9.9 −0.5 5.7 2.3 −7.1 1.1 −2.4 123.9 23.3 6.2 2.2 −5.5 2.9 −1.3 1.9
−8.9 1.1 9.0 8.2 6.4 4.3 3.4 23.3 185.4 1.6 4.9 3.1 −4.5 0.3 −0.1
−2.5 6.9 1.3 −0.7 −3.4 0.5 2.2 6.2 1.6 122.5 10.3 −5.3 0.2 9.9 7.4
−1.4 −0.3 0.8 −0.3 0.3 0.1 0.3 2.2 4.9 10.3 184.9 5.3 −3.3 −2.8 7.2
2.4 −0.7 13.1 −4.5 1.5 −2.6 −1.9 −5.5 3.1 −5.3 5.4 120.1 7.2 −8.6 −3.4
1.3 0.9 7.2 −2.4 −1.5 −0.9 −1.4 2.9 −4.5 0.2 −3.3 7.2 183.1 −5.4 −6.1
−6.2 0.4 0.9 4.0 1.8 −7.4 4.1 −1.3 0.3 9.9 −2.8 −8.6 −5.4 115.2 −0.2
−2.1 −3.4 ⎥ ⎥ −0.01⎥ ⎥ −1.4 ⎥ ⎥ −1.7 ⎥ ⎥ 2.6 ⎥ ⎥ ⎥ 3.7 ⎥ ⎥ 1.9 ⎥ . ⎥ −0.1 ⎥ ⎥ 7.4 ⎥ ⎥ 7.2 ⎥ ⎥ −3.4 ⎥ ⎥ −6.1 ⎥ ⎥ −0.2 ⎦ 178.9
For the matrix Q, the condition (5.50) is satisfied and the neural network is globally asymptotically stable. This example shows that the condition presented in Theorem 5.14 is less restrictive than the norm stability condition. Moreover, representing a stability condition in the form of LMIs renders it possible to easily check the stability of the neural system. Lemma 5.20 ([176, 177]). Let A ∈ Rq×q be a symmetric matrix, and P ∈ Rr×q and Q ∈ Rs×q real matrices; then there exists a matrix B ∈ Rr×s such that (5.53) A + P T B T Q + QT BP ≺ 0
5.2 Stability Analysis – Networks with Two Hidden Layers
105
iff the inequalities W TP AW P ≺ 0 and W TQ AW Q ≺ 0 both hold, where W P , and W Q are full rank matrices satisfying Im(W P ) = ker(P ) and Im(W Q ) = ker(Q). Example 5.21. The term (5.50) can be rewritten as
0 −Q (A + W) Q −Q AQ W = + Q 0I + QT W T 0 . T T QA −Q 0 I Q (A + W) −Q (5.54) Using Lemma 5.20 one obtains W TP
−Q AQ W P ≺ 0, QAT −Q
W TQ
−Q AQ W Q ≺ 0, QAT −Q
(5.55)
where W P = diag(ker(W ), I) and W Q = diag(I, 0). Multiplying the second inequality in (5.55) gives Q 0. Then (5.55) can be rewritten as W TP RW P ≺ 0, where
Q 0,
(5.56)
−Q AQ . R= QAT −Q
(5.57)
These LMI conditions can be solved with a less computational burden than the LMI condition 5.50. The results of computations, for different network structures, are presented in Table 5.7. The experiments were performed using the LMI Control Toolbox under Matlab 7.0 on a PC with Intel Centrino 1.4 GHz and a 512MB RAM. Lemma 5.20 is frequently used to reduce the number of matrix variables. Since some variables can be eliminated, the computational burden can be significantly reduced. As shown in Table 5.7, the transformed LMIs (5.56) are relatively easier to solve than the LMIs (5.50). For each network structure considered, the LMIs (5.56) are solved performing one step of an algorithm only whilst the LMIs (5.50) require three or four steps. As a result, the LMIs (5.56) are solved 3–5 times faster (Table 5.7). Table 5.7. Comparison of methods Network
LMI (5.50)
LMIs (5.56)
structure
time [sec]
iterations
time [sec]
iterations
7-4 15-7 25-10
0.0845 0.545 6.5
4 3 4
0.0172 0.189 1.7
1 1 1
106
5 Stability and Stabilization of Locally Recurrent Networks
5.2.2
First Method of Lyapunov
Theorems based on Lyapunov’s second method formulate sufficient conditions for global asymptotical stability of the system. In many cases, however, there is a need to determine neccessary conditions. In such cases, Lyapunov’s first method can be used. Moreover, stability criteria developed using the second method of Lyapunov cannot be used as a starting point to determime constraints on the network parameters. Thus, the optimisation problem with constraints cannot be determined. This section presents an approach, based on the first method of Lyapunov, which allows us to elaborate a training procedure with constraints on the network parameters. Thus, the training process can guarantee the stability of the neural model. Lemma 5.22 (Lyapunov’s first method). Let x∗ = 0 be an equilibrium point of the system x(k + 1) = f (x(k)), (5.58) where f : D → Rn is a continuously differentiable function and D is a neighbourhood of the origin. Define the Jacobian of (5.58) in the neigbourhood of the equilibrium point x∗ = 0 as ∂f . (5.59) J= ∂x |x=0 Then 1. The origin is locally asymptotically stable if all the eigenvalues of J are inside the unit circle in the complex plane. 2. The origin is unstable if one or more of the eigenvalues of J are outside the unit circle in the complex plane. Theorem 5.23. The neural system (5.27) composed of neurons with first order filters (r = 1) is locally asymptotically stable if the following conditions are satisfied: 1. |a11i | < 1 ∀i = 1, . . . , v1 , 2. b11i = 0 ∀i = 1, . . . , v1 .
and
|a21i | < 1
∀i = 1, . . . , v2 ,
Proof. The Jacobian of (5.27) is given by Ψ A1 Ψ − 0 J= . W 2 f (0) A2 .
(5.60)
The characteristic equation has the form det(J − λI) = 0.
(5.61)
The Jacobian is the block matrix and then the determinant of J − λI is given by det(J − λI) = (Ψ A1 Ψ − − λ)(A2 − λ) − W 2 σ (0) · 0 = (Ψ A1 Ψ − − λ)(A2 − λ)
.
(5.62)
5.2 Stability Analysis – Networks with Two Hidden Layers
107
Finally, the characteristic equation receives the form (Ψ A1 Ψ − − λ)(A2 − λ) = 0,
(5.63)
and the system is stable if all eigenvalues of both matrices Ψ A1 Ψ − and A2 are located in the unit circle. In our case, A1 = diag(a11 , . . . , a1v1 ), A2 = diag(a21 , . . . , a2v2 ) and B 1 = diag(b11 , . . . , b1v1 ). If Condition 2 is satisfied, then 1 Ψ = diag(g211 b11 , . . . , g2v b1 ). In this trivial case, a pseudoinverse of Ψ is given 1 v1 as follows: ⎡ ⎤ 1 . . . 0 ⎢ g21 b11 ⎥ ⎢ 1 ⎥ ⎢ ⎥ ⎢ ⎥ . . − . ⎢ ⎥ .. .. (5.64) Ψ = ⎢ .. ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ 1 0 ... 1 1 g2v1 bv1 and, finally, Ψ A1 Ψ − = A1 . Then the system is stable if all eigenvalues of A1 and A2 are located in the unit circle. Taking into account the reasoning presented in Section 5.1, one knows that all eigenvalues of a block diagonal matrix are located in the unit circle if the eigenvalues of each matrix on the diagonal are located in the unit circle. According to this, |a11i | < 1, ∀i = 1, . . . , v1 and |a21i | < 1, ∀i = 1, . . . , v2 , which completes the proof. Theorem 5.24. The neural system (5.27) composed of neurons with second order filters (r = 2) is locally asymptotically stable if the following conditions are satisfied: 1. For each entry of A1 and A2 , a set of inequalities is satisfied: ⎧ ⎪ ⎨ 1 − a1 + a2 > 0 1 + a1 + a2 > 0 ; ⎪ ⎩ 1 − a2 > 0
(5.65)
2. (b11i )2 + (b12i )2 = 0 ∀i = 1, . . . , v1 ; 3. |b11i | < |b12i | ∀i = 1, . . . , v1 . Proof. From the proof of Theorem 5.23 one knows that the system (5.27) is stable if the eigenvalues of Ψ A1 Ψ − and A2 are located in the unit circle. Let us consider the eigenvalues of A2 first. According to the reasoning presented in Section 5.1, one knows that the eigenvalues of A2 are stable if for each entry on the diagonal a set of inequalities (5.65) holds. Next, take into account Ψ A1 Ψ − . This is a block diagonal matrix with the entries Ψ i A1i Ψ − i , i = 1, . . . , v1 , where 1 1 Ψ i = g2i bi . Using Singular Value Decomposition (SVD), it is easy to verify that Ψ− i =
b1i , 1 b1 2 g2i i 2
(5.66)
108
5 Stability and Stabilization of Locally Recurrent Networks
where x2 is the Euclidean norm of the vector x. Using (5.66), Ψ i A1i Ψ − i can be represented as Ψ i A1i Ψ − i =
−a11i (b11i )2 + b11i b12i (1 − a12i ) . (b11i )2 + (b12i )2
(5.67)
In order to obtain a stable system, the condition −a11i (b11i )2 + b11i b12i (1 − a12i ) < 1, (b11i )2 + (b12i )2
∀i = 1, . . . , v1
(5.68)
should be satisfied. To clarify the presentation in the following deliberations, the index i is omitted. Let us rewrite (5.68) as follows: −(b11 )2 − (b12 )2 − b11 b12 < f (a11 , a12 ) < (b11 )2 + (b12 )2 − b11 b12 ,
(5.69)
where f (a11 , a12 ) = −a11 (b11 )2 − a12 b11 b12 . To complete the proof, it is necessary to show that max f (a11 , a12 ) < (b11 )2 + (b12 )2 − b11 b12 and min f (a11 , a12 ) > −(b11 )2 − (b12 )2 − b11 b12 . Therefore, it is required to solve two optimisation problems: max
f (a11 , a12 )
s.t. 1 − a1 + a2 0 1 + a1 + a2 0
min f (a11 , a12 ) and
s.t. 1 − a1 + a2 0 . 1 + a1 + a2 0
(5.70)
1 − a2 0 1 − a2 0 The graphical solution of the optimisation problems (5.70) is presented in Fig. 5.16. Case 1. b1 b2 > 0 and b21 > b1 b2 . The course of the cost function is presented in Fig. 5.16(a). The maximum is located at the point P1 = (−2, 1) and the minimum at the point P2 (2, 1). Case 2. b1 b2 > 0 and b21 < b1 b2 . The course of the cost function is presented in Fig. 5.16(b). The maximum is located at the point P3 = (0, 1) and the minimum at the point P2 (2, 1). There is another posibility, when b1 b2 > 0 and b21 = b1 b2 , but in this case b1 = b2 and Condition 3 is not satisfied. Case 3. b1 b2 < 0 then b21 > b1 b2 . The course of the cost function is presented in Fig. 5.16(c). The maximum is located at the point P1 = (−2, 1) and the minimum at the point P2 (2, 1).
5.2 Stability Analysis – Networks with Two Hidden Layers
109
According to (5.69), one should check the following: 1. f (−2, 1) < b21 + b22 − b1 b1 ; in this case 2b21 − b1 b2 < b21 + b22 − b1 b1 b21 < b22 and Condition 3 is satisfied;
(a)
(b)
a2 1
a1 1
-1 m
mi
xf
nf
ma
-1
P2
1
P2
) a2 1, f(a
P1
a2 in
f
a1
1
ax
-1
(c)
m
f
-1
P3
f(a 1,a 2)
a2 P1
P2
f(a1, a
2)
1
-1
f
a1 1
min
max
f
-1
Fig. 5.16. Graphical solution of the problems (5.70)
2. f (0, −1) < b21 + b22 − b1 b1 ; one obtains b1 b2 < b21 + b22 − b1 b1 0 < (b1 − b2 )2 that is true for any b1 and b2 ; 3. f (2, 1) > −b21 − b22 − b1 b1 ; in this case −2b21 − b1 b2 > −b21 − b22 − b1 b1 b21 < b22 and Condition 3 is satisfied. ¯ The problems (5.70) were solved for constraints in the form of a compact set A, but Condition 1 defines an open set of constraints A. Therefore, the operations of
110
5 Stability and Stabilization of Locally Recurrent Networks Table 5.8. Outline of constrained optimisation training
Step 0: Initiation Choose the initial network parameters, set ε to a small value, e.g. ε = 10−5 . Step 1: Update the network parameters using a training algorithm. Step 2: Assure the feasibility of the matrices A1 and A2 , e.g. using the gradient projection or the minimum distance projection, proposed in Section 5.1. Step 3: Assure the feasibility of the matrix B 1 as follows: for i:=1 to v1 do if |b11i | > |b12i | then if b11 i > 0 then if b12i > 0 then b12i := b11i + ε else b12i := −b11i − ε else if b12i > 0 then b12i := −b11i + ε else b12i := b11i − ε end for Step 4: Termination criteria if (termination criterion satisfied) then STOP else go to Step 1.
maximum and minimum over the compact set can be replaced by the operations of supremum and infimum over the open set as follows: b21 + b22 − b1 b2 > maxf (a1 , a2 ) supf (a1 , a2 ),
(5.71)
b21 + b22 − b1 b2 < minf (a1 , a2 ) inf f (a1 , a2 ),
(5.72)
¯ A
A
and ¯ A
A
which completes the proof.
Remark 5.25. Contrary to the global asymptotical stability theorems presented in Section 5.2.1, Theorems 5.23 and 5.24 formulate necessary as well as sufficient conditions for local asymptotical stability of a neural network, and are able to judge between the stability and instability of a neural model. Furthermore, based on the conditions formulated by them, a constrained training procedure can be derived, which guarantees the stability of the neral network. The example of such a training procedure for a neural network consisting of second order filters is presented in Table 5.8.
5.3 Stability Analysis – Cascade Networks In the following section, the stability of the cascade dynamic network discussed in Section 4.3 is investigated. It is shown, that the cascade dynamic network is equivalent to the locally recurrent network with two hidden layers. Subsequently, all stability methods considered in the previous sections can be successfully applied.
5.4 Summary
111
Consider the neural model represented by the state equation (4.17). Let Ψ = G12 B 1 and s1 = G12 D 1 u(k) − G12 g 11 , where s1 can be treated as a threshold or a fixed input; then (4.17b) takes the form ¯2 x ¯ 2 σ Ψ x1 (k) + s1 + W u u(k). x ¯2 (k + 1) = A ¯2 (k) + W
(5.73)
Using the linear transformation v 1 (k) = Ψ x1 (k) + s1 and v 2 (k) = x ¯2 (k), one obtains an equivalent system: v 1 (k + 1) = Ψ A1 Ψ − v 1 − Ψ A1 Ψ − s1 + s2 , (5.74) ¯2 v 2 (k) + W ¯ 2 σ v 1 (k) + s3 v 2 (k + 1) = A where Ψ − is a pseudoinverse of the matrix Ψ , s2 = Ψ W 1 u(k) + s1 and s3 = W u u(k) are thresholds or fixed inputs. Let v ∗ = [v 1∗ v 2∗ ]T be an equilibrium point of (5.74). Introducing an equivalent coordinate transformation z(k) = v(k) − v ∗ (k), the system (5.74) can be transformed to the following form: z 1 (k + 1) = Ψ A1 Ψ − z 1 (k) , (5.75) ¯2 z 2 (k) + W ¯ 2 f (z 1 (k)) z 2 (k + 1) = A where f (z 1 (k)) = σ(z 1 (k) + v 1∗ (k)) − σ(v 1∗ (k)). Substituting z(k) = [z 1 (k) z 2 (k)]T , one finally obtains ¯ ¯ (z(k)), z(k + 1) = Az(k) + Wf
where A¯ =
Ψ A1 Ψ − 0 ¯2 , 0 A
¯ = W
0 0 ¯20 . W
(5.76)
(5.77)
Finally, comparing (5.76) with (5.27), one can state that these two representations are analogous. Thus, all stability methods elaborated in Sections 5.2.1 and 5.2.2 can be successfully applied to stability analysis of the cascade network.
5.4 Summary The purpose of this chapter was to propose methods for stability analysis of locally recurrent neural models. To tackle this problem for locally recurrent networks with only one hidden layer, two approaches were elaborated. The first one is based on the gradient projection giving a prescription how to modify the search direction in order to meet the constraints imposed on neural model parameters. The second method is formulated as the minimum distance projection. The search direction is modified in such a way as to find the new solution by minimizing the distance to the feasible region. Hence, the stabilization method finds a new search direction deterorating learning in a minimum way. An important result included in the chapter is stability analysis. The discrete-time recurrent neural network was represented in the state-space. This representation makes it
112
5 Stability and Stabilization of Locally Recurrent Networks
possible to utilize the state equation to derive stability conditions. The resulting state equation has a linear form with a block diagonal state transition matrix. Due to this representation, feasible regions for the network parameters were derived and employed later in the proposed stabilization methods. The chapter presentes also sufficient conditions for strong convergence of the projection algorithms under consideration. The methods were checked using a number of experiments, showing their usefullness and efficiency. It should be pointed out that the methods are very simple and numerically uncomplicated, and can be easily introduced into the learning procedure. The example of the identification of a real process confirms the effectiveness of the proposed learning with stabilization. The possibility to represent neural networks in the state-space makes it possible to apply the analysed neural networks to design not only input-output models, but state-space models as well. The stability of more complex locally recurrent networks (networks with two hidden layers and cascade ones) was investigated using Lyapunov’s methods. The norm stability condition is a very restrictive one, but introducing a modified neuron structure makes this condition applicable to real-life problems. Moreover, the norm stability criterion can be adopted to design stable training of the dynamic neural network, which guarantees the stability of the model. On the other hand, the condition presented in Theorem 5.14 is less restrictive than the norm stability condition, but there are problems with finding a proper matrix P able to satisfy the condition. Therefore, it is proposed to formulate this condition in the form of LMIs, and then the stability can be easily checked using suitable numerical packages. Theorems based on the second method of Lyapunov give sufficient conditions for global asymptotic stability. If these conditions are not satisfied, one cannot judge the stability of the system. If the necessary conditions are requried, one can use Lyapunov’s first method. Algorithms utilizing the first method of Lyapunov formulate necessary as well as sufficient conditions for local asymptotical stability of a neural network, and are able to judge the between the stability and instability of a neural model. Moreover, based on the conditions formulated by these theorems, a constrained training procedure can be derived, which guarantees the stability of the neural network. There are still challenging open problems, e.g. to propose new Lyapunov candidate functions, which would make it possible to formulate less restrictive stability conditions, or to elaborate more robust procedures for stabilizing the neural network during training, which will deteriorate the training process in a negligible way. In the next chapter the problem of selecting training sequences for locally recurrent neural networks is discussed. The proposed approach is based on the theory of optimum experimal design.
6 Optimum Experimental Design for Locally Recurrent Networks
A fundamental problem underlying the training of neural networks is the selection of proper input data to provide good representation of the modelled system behaviour [77, 7]. This problem comprises the determination of a limited number of observational units obtained from the experimental environment in such a way as to obtain the best quality of the system responses. The importance of input data selection has already been recognized in many application domains [178]. Fault detection and isolation of industrial systems is an example which is particularly stimulating in the light of the results reported in this monograph. One of the tasks of failure protection systems is to provide reliable diagnosis of the expected system state. But to produce such a forecast, an accurate model is necessary together with its calibration, which requires parameter estimation. The preparation of experimental conditions in order to gather informative measurements can be very expensive or even impossible (e.g. for faulty system states). On the other hand, data from a real-world system may be very noisy and using all the data available may lead to significant systematic modelling errors. As a result, we are faced with the problem of how to optimise the training data in order to obtain the most precise model. Although it is well known that the training quality for neural networks heavily depends on the choice of input sequences, there have been relatively few contributions to experimental design for those systems [179, 180] and, in addition, they focus mainly on the multi-layer perceptron class of networks. The applicability of such a static type of networks for the modelling of dynamic systems is rather limited. To the best of the author’s knowledge, the problem of optimal selection of input sequences has not been considered in the context of dynamic neural networks yet. This chapter aims to fill this gap and propose a practical approach for input data selection for the training process of dynamic neural networks. More precisely, locally recurrent neural networks are taken into account due to their obvious advantages in the modelling of complex dynamic systems. In addition, a particular experimental setting is assumed, i.e. that the observational data are gathered in the form of time series, so the observational unit is understood here K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 113–122, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
114
6 Optimum Experimental Design for Locally Recurrent Networks
as a finite sequence of samples. The problem in question is as follows: Given a finite set of observational units, obtained under different experimental conditions, assign a non-negative weight to each of them so as to maximize the determinant of the Fisher information matrix related to the parameters to be identified. All weights should sum up to unity. The weight assigned to a particular sequence can be interpreted as the proportion of effort spent at this unit during network training or the percentage of experimental effort spent at this unit. The potential solutions are of considerable interest while assessing which sequences are more informative than others and they permit complexity reduction of the training process. The solution proposed here is close in spirit to classical optimum experimental design theory for lumped systems [181, 66]. It relies on a generalization of a known numerical algorithm for the computation of a D-optimum design on a finite set. The performance of the delineated approach is illustrated via numerical simulations regarding a simple example of a linear dynamic object. The chapter is organized as follows: the first part of the chapter, including Sections 6.1 and 6.2, gives the fundamental knowledge about optimum experimental design. The proposed solution of selecting training sequences is formulated in Section 6.3. Section 6.4 contains the results of a numerical experiment showing the performance of the delineated approach. The chapter concludes with some final remarks in Section 6.5.
6.1 Optimal Sequence Selection Problem in Question Assume that the dynamic neural network under consideration is composed of the neuron models with the IIR filter indroduced in Section 3.5.1, represented in the state-space by the equations (3.18) and (3.20). The analysis undertakern in this chapter is limited to a neural network consisting of one hidden layer represented by (3.39). 6.1.1
Statistical Model L
j denote the sequence of network responses for Let y j = y(uj ; θ) = {y(k; θ)}k=0 Lj j the sequence of inputs u = {u(k)}k=0 related to the consecutive time instants k = 0, . . . , Lj < ∞ and selected from among an a priori given set of input sequences U = {u1 , . . . , uS }. Here θ represents a p-dimensional unknown network parameter vector which must be estimated using observations of the system (i.e. filter parameters, weights, slope and bias coefficients). From the statistical point of view, the sequences of observations related to P different input sequences may be considered as
z j (k) = y j (k; θ) + εj (k),
k = 0, . . . , Lj ,
j = 1, . . . , P,
(6.1)
where z j (k) is the output and εj (k) denotes the measurement noise. It is customary to assume that the measurement noise is zero-mean, Gaussian and white, i.e. E[εi (k)εj (k )] = v 2 δij δkk ,
(6.2)
6.1 Optimal Sequence Selection Problem in Question
115
where v > 0 is the standard deviation of the measurement noise, δij and δkk standing for the Kronecker delta functions. An additional substantial assumption is that the training of the neural network, equivalent to the estimation of the unknown parameter vector θ, is performed via the minimisation of the least-squares criterion θˆ = arg min
θ∈Θad
Lj P
z j (k) − y j (k; θ)2 ,
(6.3)
j=1 k=0
where Θad is the set of admissible parameters. It becomes clear that since y j (k; θ) strongly depends on the input sequences uj it is possible to improve the training process through appropriate selection of input sequences. 6.1.2
Sequence Quality Measure
In order to properly choose the input sequences which will be most informative for the training of the dynamic network, a quantitative measure of the goodness of parameter identification is required. A reasonable approach is to choose a performance measure defined on the Fisher Information Matrix (FIM), which is commonly used in optimum experimental design theory [182, 181, 66]. Sequences which guarantee the best accuracy of the least-squares estimates of θ are then found by choosing uj , j = 1, . . . , P so as to minimize some scalar measure of performance Ψ defined on the average Fisher information matrix given by [183]: P Lj 1 H(uj , k)H T (uj , k), (6.4) M= P Lj j=1 k=0
where H(u, k) =
∂y(u, k; θ) ∂θ
(6.5) θ=θ 0
stands for the so-called sensitivity matrix, θ 0 being a prior estimate to the unknown parameter vector θ [184, 185, 183, 186] (see Appendix A for the derivation of the matrix H). Such a formulation is generally accepted in optimum experimental design for non-linear dynamic systems, since the inverse of the FIM constitutes, up to a constant multiplier, the Cram´er-Rao lower bound on the covariance matrix of any unbiased estimator of θ [66], i.e. cov θˆ M −1 .
(6.6)
When the observation horizon is large, the non-linearity of the model with respect to its parameters is mild and the measurement errors are independently distributed and have small magnitudes, it is legitimate to assume that our estimator is efficient in the sense that the parameter covariance matrix achieves the lower bound [185, 186].
116
6 Optimum Experimental Design for Locally Recurrent Networks
As for Ψ , various choices exist for such a function [66, 181], but most common are: • D-optimality (determinant) criterion: Ψ (M ) = − log det M ;
(6.7)
• G-optimality (maximum variance) criterion: Ψ (M ) = max φ(uj , M ), j u ∈U
where
φ(u , M ) = trace j
(6.8)
Lj 1 T −1 H (u, k)M H(u, k) . Lj k=0
The D-optimum design minimises the volume of the uncertainty ellipsoid for the estimates. In turn, the G-optimum design suppresses the maximum variance of the system response prediction. The introduction of an optimality criterion renders it possible to formulate the sensor location problem as an optimisation problem: (6.9) Ψ M (u1 , . . . , uP ) −→ min with respect to uj , j = 1, . . . , P belonging to the admissible set U. 6.1.3
Experimental Design
The direct consequence of the assumption (6.2) is that we admit replicated input sequences, i.e. some uj s may appear several times in the optimal solution (because independent observations guarantee that every replication provides additional information). Consequently, it is sensible to reformulate the problem so as to operate only on the distinct locations u1 , . . . , uS instead of u1 , . . . , uP by relabelling them suitably. To this end, we introduce r1 , . . . , rS as the numbers of replicated measurements corresponding to the sequences u1 , . . . , uS . In this formulation, the ui s are said to be the design or support points, and p1 , . . . , pS are called their weights. The collection of variables 1 2 u , u , . . . , uS ξP = , (6.10) p 1 , p 2 , . . . , pS S where pi = ri /P , P = i=1 ri , is called the exact design of the experiment. The proportion pi of observations performed for ui can be considered as the percentage of experimental effort spent at that sequence. Hence, we are able to rewrite the FIM in the form M (ξP ) =
S i=1
pi
Li 1 H T (ui , k)H(ui , k). Li k=0
(6.11)
6.2 Characterization of Optimal Solutions
117
Here the pi s are rational numbers, since both ri s and P are integers. This leads to a discrete numerical analysis problem whose solution is difficult for standard optimisation techniques, particularly when P is large. A potential remedy for this problem is to extend the definition of the design. This is achieved through the relaxation of constraints on weights, allowing the pi s to be considered as real numbers in the interval [0, 1]. This assumption will be also made in what follows. Obviously, we must have Si=1 pi = 1, so we may think of the designs as probability distributions on U. This leads to the so-called continuous designs, which constitute the basis of the modern theory of optimal experiments [181, 66]. It turns out that such an approach drastically simplifies the design, and the existing rounding techniques [181] justify such an extension. Thus, we shall operate on designs of the form
S u1 , u2 , . . . , uS ξ= ; pi = 1 , (6.12) p 1 , p 2 , . . . , pS i=1
which concentrates P p1 observational sequences for u1 (so we repeat approximately P p1 times the presentation of this sequence during the training of the network), P p2 for u2 , and so on. Then we may redefine optimal design as a solution to the optimisation problem ξ = arg min Ψ [M (ξ)],
(6.13)
ξ∈Ξ(U
where Ξ(U) denotes the set of all probability distributions on U.
6.2 Characterization of Optimal Solutions In the remainder of this chapter we shall assume that H ∈ C(U; Rp ). A number of characterizations of the optimal design ξ can be derived in a rather straightforward manner from the general results given in [183] or [186]. Lemma 6.1. For any ξ ∈ Ξ(U), the information matrix M (ξ) is symmetric and non-negative definite. Let us introduce the notation M(U) for the set of all admissible information matrices, i.e.
M(U) = M (ξ) : ξ ∈ Ξ(U) . (6.14) Lemma 6.2. M(U) is compact and convex. Theorem 6.3. An optimal design exists comprising no more than m(m + 1)/2 support sequences. Moreover, the set of optimal designs is convex. The next theorem is crucial for the approach considered and provides a tool for checking the optimality of designs. It is usually called an equivalence theorem [187].
118
6 Optimum Experimental Design for Locally Recurrent Networks
Theorem 6.4 (Equivalence equivalent:
theorem).
The
following
conditions
are
(i) the design ξ maximizes ln det M (ξ), (ii) the design ξ minimizes maxui ∈U φ(ui , ξ) , and (iii) maxui ∈U φ(ui , ξ) = p, and the so-called sensitivity function φ(ui , ξ) = trace
Li 1 H T (ui , k)M −1 H(ui , k) Li k=0
is of paramount importance here. From the result above it comes that the minimisation of the average variance of the estimated system response (understood as the quality of the training process) is equivalent to the optimisation of the Doptimality criterion. This paves the way for the application of numerous efficient algorithms known from experimental design theory to the discussed problem of the selection of training sequences for the network considered. Since analytical determination of optimal designs is difficult or impossible even for very simple network structures, some iterative design procedures will be required. A simple computational scheme for that purpose is given in the next section.
6.3 Selection of Training Sequences In the case considered in the paper, i.e. the design for fixed sensor locations, a computational algorithm can be derived based on the mapping T : Ξ(U) → Ξ(U) defined by u1 , ..., uS . (6.15) Tξ= p1 φ(u1 , ξ)/p, . . . , pS φ(uS , ξ)/p From Theorem 6.4 it follows that a design ξ is D-optimal if it is a fixed point of the mapping T , i.e. (6.16) T ξ = ξ. Therefore, the following algorithm can be used as a generalization of that proposed in [188, p. 139] for the classical optimum experimental design problem consisting in iterative computation of a D-optimum design on a finite set: (0)
Step 1. Guess a discrete starting design ξ (0) such that pi Choose some positive tolerance η 1. Set = 0. Step 2. If the condition φ(ui , ξ () ) < 1 + η, p is satisfied, then STOP.
i = 1, . . . , S
> 0 for i = 1, . . . , S.
6.4 Illustrative Example
119
Step 3. Construct the next design ξ (k+1) by determining its weights according to the rule i () ) (+1) () φ(u , ξ , i = 1, . . . , S, pi = pi m increment k by one and go to Step 2. The convergence result of this scheme can be found in [188] or [186].
6.4 Illustrative Example 6.4.1
Simulation Setting
Consider a single dynamic neuron with the second order IIR filter and a hyperbolic tangent activation function. The neuron is used to model the linear dynamic system given by (3.60) (see Example 3.8 for details). The discrete form of this system has the form yd (k) = A1 yd (k − 1) + A2 yd (k − 2) + B1 u(k − 1) + B2 u(k − 2)
(6.17)
with the parameters A1 = 0.374861, A2 = −0.367879, B1 = 0.200281 and B2 = 0.140827. At the beginning, the neuron was preliminarily trained using randomly generated data. The learning data were generated feeding a random signal of the uniform distribution |u(k)| 2 to the process and recording its output. The training process was carried out off-line for 2000 steps using the EDBP algorithm. Taking into account that a neuron is a redundant system, some of its parameters are not identifiable. In order to apply optimum experimental design to neuron training, certain assumptions should be made. Therefore, we focus our attention only on the parameters directly related to the neuron dynamics, i.e. filter parameters and weights. So, without loss of generality, let us assume that the feedforward filter parameter b0 is fixed to the value 1, and the slope of the activation function g2 is set also to 1. This reduces the dimensionality of estimation and assures the identifiability of the rest of the parameters (i.e. it assures that the related FIM is non-singular). At the second stage of the training process the learning data were split into 20 time sequences, containing 100 consecutive samples each. The design purpose was to choose from this set of all learning patterns the most informative sequences (in the sense of D-optimality) and their presentation frequency (i.e. how often they should be repeated during the training). To determine the optimal design, a numerical routine from Section 6.3 was implemented in the form of the Matlab program. All the admissible learning sequences taken with equal weights formed the initial design. The accuracy of the design algorithm was set to η = 10−2 and it produced the solution with no more than 200 iterations at each time, i.e. below 1 second (using a PC machine equipped with a Pentium M740 processor (1.73GHz, 1 GB RAM) running Windows XP and MATLAB 7 (R14). The convergence of the design algorithm is presented in Fig. 6.1.
120
6 Optimum Experimental Design for Locally Recurrent Networks −7
6
x 10
Determinant of the FIM
5.5
5
4.5
4
3.5
3 0
50
100
150
200
Iteration
Fig. 6.1. Convergence of the design algorithm
6.4.2
Results
The neuron was trained in two ways. The first way is to use the optimal training sets selected during the optimum experimental design phase. The second way is to use random sequences as the training ones. The purpose of these experiments is to check the quality of parameter estimation. For a selected plan, the training was carried out 10 times. Each sequence in the plan was used proportionally to the sequence weight in the plan. For example, if the optimal plan consists of the sequences 3,6 and 10 with the weights 0.1, 0.2 and 0.7, respectively, then during the training the 3-rd sequence is used only once, the 6-th sequence twice and the 10-th sequence seven times. The procedure is repeated 20 times using different measurement noise affecting the output of the system. The achieved results are presented in Table 6.1. As we can see, the accuraccies of a majority of parameter estimates are improved with some Table 6.1. Sample mean and the standard deviation of parameter estimates sample mean
standard deviation
parameter
optimal plan
random plan
optimal plan
random plan
w a1 a2 b1 b2 g1
0.0284 −0.3993 0.3775 3.9895 3.4629 0.0016
0.0284 −0.3853 0.3635 3.8301 3.2295 0.0011
0.0016 0.0199 0.0244 0.1191 0.1527 0.0142
0.0017 0.0241 0.0231 0.1787 0.0892 0.0191
6.5 Summary
121
Variance of the response prediction
7
6.5
6
5.5
5
4.5
4 0
5
10
15
20
Sequence number
Fig. 6.2. Average variance of the model response prediction for optimum design (diamonds) and random design (circles)
exceptions (cf. standard deviations of a2 or b2 ). This is a direct consequence of applying a D-optimality criterion, which minimises the volume of the confidence ellipsoid for the parameter estimates, so sometimes an increase in the quality of the majority of estimates may be achieved at the cost of the few others. In Fig. 6.2, the uncertainty of network response prediction is compared based on the parameter estimates determined using an optimal and a random design. It becomes clear that training based on optimal learning sequences leads to greater reliability of the network response.
6.5 Summary The results contained in this chapter show that some well-known methods of optimum experimental design for linear regression models can be easily extended to the setting of the optimal training sequence selection problem for dynamic neural networks. The clear advantage of the proposed approach is that the quality of the training process measured in terms of the uncertainty of network response prediction can be significantly improved with the same effort spent on training or, alternatively, training process complexity can be reduced without degrading network performance. In this chapter, a very simple but extremely efficient algorithm for finding optimal weights assigned to given input sequences was extended to the framework of the optimal training sequence selection problem. Future research will be focused on the adaptation of the proposed approach to the task of fault detection and its application to industrial systems.
122
6 Optimum Experimental Design for Locally Recurrent Networks
Appendix A. Derivation of the Sensitivity Matrix Partial derivatives of the output with respects to a neuron parameter: For ai , i = 1, . . . , 2: ∂y(k) ∂x1 (k) ∂x2 (k) , = σ g2 b1 + b2 ∂ai ∂ai ∂ai where
∂x1 (k) ∂x1 (k − 1) ∂x2 (k − 1) = −xi (k − 1) − a1 − a2 ∂ai ∂ai ∂ai
(6.19)
∂x2 (k) ∂x1 (k − 1) = ; ∂ai ∂ai
(6.20)
∂y(k) = σ g2 wu(k); ∂b0
(6.21)
∂y(k) = σ g2 xi (k); ∂bi
(6.22)
∂y(k) ∂x1 (k) ∂x2 (k) = σ g2 b1 + b2 , ∂w ∂w ∂w
(6.23)
∂x1 (k − 1) ∂x2 (k − 1) ∂x1 (k) = u(k) − ai − a2 ∂w ∂w ∂w
(6.24)
∂x2 (k) ∂x1 (k − 1) = ; ∂w ∂w
(6.25)
∂y(k) = −σ g2 ; ∂g1
(6.26)
∂y(k) = σ (b1 x1 (k) + b2 x2 (k) + b0 wu(k) − g1 ) . ∂g1
(6.27)
and
For b0 :
For bi , i = 1, . . . , 2:
For w:
where
and For g1 :
For g2 :
(6.18)
7 Decision Making in Fault Detection
Every model based fault diagnosis scheme that utilizes an analytical, a neural or a fuzzy model consists of the decision part, in which the evaluation of the residual signal takes place and, subsequently, the decision about faults is made in the form of an alarm. The residual evaluation is nothing else but a logical decision making process that transforms quantitative knowledge into qualitative Yes-No statements [4, 9, 7]. It can also be seen as a classification problem. The task is to match each pattern of the symptom vector with one of the pre-assigned classes of faults and the fault-free case. This process may highly benefit from the use of intelligent decision making. A variety of well-established approaches and techniques (thresholds, adaptive thresholds, statistical and classification methods) can be used for residual evaluation. A desirable property of decision making is insensitivity to uncontrolled effects such as changes in inputs u and a state x, disturbances, model errors, etc. The reasons why decision making can be sensitive to the mentioned uncontrolled efects are as follows [189]: • Sometimes it is impossible to completely decouple disturbances and effects of faults; • Unmodelled disturbances or an incorrect model structure implies that the performance of the decision making block is decreased; • Even though noise terms are included in the model, it is impossible to prevent the noise from affecting the decision making process. These factors make the problem of robust decision making extremely important when designing fault detection and isolations systems [190, 191]. The robustness of fault diagnosis can be achieved in many ways. In the following sections, different decision making techniques based on artificial intelligence will be discussed and investigated, e.g. the realization of adaptive thresholds or robust fault detection using uncertainty models [192]. Most of the decision making techniques presented in this chapter are based on artificial intelligence methods, which are very attractive and more and more frequently used in FDI systems [64, 7]. The chapter is composed of two parts. The first part consists of Sections 7.1 and 7.2, and is devoted to algorithms and methods of constant thresholds K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 123–140, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
124
7 Decision Making in Fault Detection
calculating. Section 7.1 briefly describes algorithms for generating simple thresholds based on the assumption that a residual signal has a normal distribution. A sligthly different approach is shown in Section 7.2, where a simple neural network is first used to approximate the probability density function of a residual and then a threshold is calculated. The second part, including Section 7.3, presents several robust techniques for decision making. Section 7.3.1 discusses a statistical approach to adapting the threshold using time window and recalculating the mean value and the standard deviation of a residual. The application of fuzzy logic to threshold adaptation is described in Section 7.3.2. The robustness of the decision making process obtained through model error modelling and neural networks is investigated in Section 7.3.3. The chapter concludes with some final remarks in Section 7.4.
7.1 Simple Thresholding To evaluate residuals and to obtain information about faults, simple thresholding can be applied. If residuals are smaller than the threshold value, a process is considered to be healthy, otherwise it is faulty [193]. For fault detection, the residual must meet the ideal condition being zero in the fault-free case and different from zero in the case of a fault. In practice, due to modelling uncertainty and measurement noise, it is necessary to assign thresholds larger than zero in order to avoid false alarms. This operation causes a reduction in fault detection sensitivity. Therefore, the choice of the threshold is only a compromise between fault decision sensitivity and the false alarm rate. In order to select the threshold, let us assume that the residual satisfies r(k, θ) = (k),
k = 1, ..., N,
(7.1)
where (k) are N (m, v) random variables with the mean value m and the standard deviation v, N is the number of samples used to calculate m and v, θ is the vector of the model parameters. A significance level β corresponds to the probability that a residual exceeds a random value tβ with N (0, 1) [66]: r(k) − m (7.2) β=P > tβ . v The values of tβ are tabulated in most statistical books. In this way, assuming a significance level β, one can obtain a tabulated value of tβ and then a threshold T according to the formula T = tβ v + m. (7.3) The decision making algorithm compares the absolute residual value with its assigned threshold. The diagnostic signal s(r) takes the value of one if the threshold value T has been exceeded: 0 if |r(k)| T s(r) = . (7.4) 1 if |r(k)| > T
7.1 Simple Thresholding
125
Another method frequently used to derive a threshold is ζ-standard deviation. Assuming that the residual is an N (m, v) random variable, thresholds are assigned to the values T = m ± ζv, (7.5) where ζ, in most cases, is equal to 1, 2 or 3. The probability that a sample exceeds the threshold (7.5) is equal to 0.15866 for ζ = 1, 0.02275 for ζ = 2 and 0.00135 for ζ = 3. The described method works well and gives satisfactory results only when the normality assumption of a residual is satisfied. A discussion about normality testing is carried out in Section 7.2.1. Example 7.1. The following example shows a residual with thresholds calculated using (7.3) and (7.5). The mean value m and the standard deviation v calculated over the residual consisting of N = 1000 samples were as follows: m=
N 1 ri = 0.0253, N i=1
(a)
0.03
0.028
0.028
0.026 0.024 0.022 0.02 0.018 0
0.026 0.024 0.022 0.02
200
400
600
800
0.018 0
1000
200
400
Time
600
800
1000
600
800
1000
Time
(c)
(d) 0.03
0.032
0.028
0.03
Residual and thresholds
Residual and thresholds
(7.6)
(b)
0.03
Residual and thresholds
Residual and thresholds
1 (ri − m)2 = 0.0018. N − 1 i=1 N
v=
0.026 0.024 0.022 0.02 0.018 0
0.028 0.026 0.024 0.022 0.02
200
400
600
Time
800
1000
0.018 0
200
400
Time
Fig. 7.1. Residual with thresholds calculated using: (7.3) (a), (7.5) with ζ = 1 (b), (7.5) with ζ = 2 (c), (7.5) with ζ = 3 (d)
126
7 Decision Making in Fault Detection
Figure 7.1(a) presents a residual signal along with thresholds determined using (7.3). In turn, Fig. 7.1(b) shows thresholds calculated using (7.5) with ζ = 1. In this case, a large number of false alarms is generated. By increasing the value of ζ, the confidence interval becomes wider (Figs. 7.1(c) and (d)) and, simultanuously, the number of false alarms is reduced significantly. Unfortunately, the sensitivity of the fault detection algorithm decreases. As one can see in Fig. 7.1(d), with ζ = 3 there is almost no false alarms, but small faults can be hidden from being observed by the decision making procedure. This example clearly illustrates that the choice of the threshold is only a compromise between fault decision sensitivity and the false alarm rate.
7.2 Density Estimation 7.2.1
Normality Testing
In most cases, histograms do not help much in deciding whether to accept or reject the normality assumption. The most popular method of normality testing is to compare the cumulative distribution function of the residual Fr (x) with that of a normal distribution F (x) [66, 52]. The first step of the procedure is to normalize the residual as follows: rn (k) =
r(k) − m , v
k = 1, . . . , n,
(7.7)
where m is the mean value of r(k) and v is the standard deviation of r(k). Then, the residual is ordered by indexing time instants: rn (k1 ) rn (k2 ) · · · rn (kn ). The empirical cumulative distribution function is then ⎧ ⎪ ⎪0 if x < rn (k1 ) ⎨ i Fr (x) = if rn (ki ) x < rn (ki+1 ) . ⎪ n ⎪ ⎩ 1 if rn (kn ) x
(7.8)
(7.9)
After that, Fr (x) can be plotted against the cumulative distribution function of a normal variable N (0, 1) – F (x). Now making a decision about rejecting or accepting the normality assumption is much easier than in the case of a histogram. Another simple method for normality testing is the so-called probability plot, which can be obtained by plotting F (x) as a function of i/n. When the normality assumption is satisfied, this plot should be close to a straight line. If the normality testing fails, it means that using normal distribution statistics for decision making can cause significant mistakes in fault detection, e.g. a large number of false alarms occur in the diagnosed system. Thus, in order to select a threshold in a more proper way, the distribution of the residual should be
7.2 Density Estimation (a)
127
(b) 1 0.999
0.8
0.99
0.7
0.95 0.9
Probability
Cumulative distribution functions
0.9999
0.9
0.6 0.5 0.4
0.75 0.5 0.25 0.1 0.05
0.3 0.2
0.01
0.1
0.001
0 −4
0.0001
−3
−2
−1
0
1
2
3
0.018
Data
0.02
0.022
0.024
0.026
0.028
0.03
Data
Fig. 7.2. Normality testing: comparison of cumulative distribution functions (a), probability plot (b)
discovered or the transformation of the residual to another known distribution should be performed. A possible solution of this problem is discussed in the next section. Example 7.2. Consider again the residual used in Example 7.1. In the following example, two methods of normality testing discussed earlier are illustated. In Fig. 7.2(a), the cumulative distribution function of the residual (dashed) is compared with that of the normal distribution (solid). As one can observe, the normality assumption does not seem to be valid in this case, because the empirical cumulative distribution function of the residual is not symmetric. The probability plot for the residual is shown in Fig. 7.2(b). This plot clearly shows that there are large deviations from the normal distribution for probabilities from the intervals (0; 0.1) and (0; 9, 1). Assuming that a residual has the normal distribution and applying a confidence level, a significant mistake in decision making can be made. 7.2.2
Density Estimation
The transformation of a random vector x of an arbitrary distribution to a new random vector y of a different distribution can be realized by maximazing the mutual information that the output of neural networks contains about its input [23, 194, 195]. For invertible continuous deterministic mappings, the mutual information between inputs and outputs can be maximised by maximising the entropy of the output alone [195]. Let us consider a situation when a single input is passed through a transforming function σ(x) to give an output y, where σ(x) is a monotonically increasing continuous function satisfying limx→+∞ σ(x) = 1 and limx→−∞ σ(x) = 0. The probability density function of y satisfies py (y) =
px (x) . |∂y/∂x|
(7.10)
128
7 Decision Making in Fault Detection
The entropy of the output h(y) is given by h(y) = −E{log py (y)},
(7.11)
where E{·} stands for the expected value. Substituting (7.10) into (7.11) gives ∂y h(y) = h(x) + E log . (7.12) ∂x The first term on the right can be considered to be unaffected by alternations in the parameters of σ(x). Therefore, to maximise the entropy of y one needs to take into account the second term only. Let us define the divergence between two density functions as follows: px (x) . (7.13) D(px (x), qx (x)) = E log qx (x) Substituting qx (x) = |∂y/∂x|
(7.14)
and using (7.12) one finally obtains h(y) = −D(px (x), qx (x)).
(7.15)
The divergence between the true density of x – px (x) and an arbitrary one qx (x) is minimised when the entropy of y is maximised. The input probability density function is then approximated by |∂y/∂x|. A simple and elegant way to adjust the network parameters, in order to maximise the entropy of y, was given in [195]. The authors of that work used the on-line version of a stochastic gradient ascent rule in the form −1 ∂y ∂ ∂y ∂ ∂y ∂h(y) = log = , (7.16) Δv = ∂v ∂v ∂x ∂x ∂v ∂x where v is a generalized network parameter. Considering the logistic transfer function of the form y=
1 , 1 + exp (−u)
u = wx + b,
(7.17)
where w is the input weight and b is the bias weight parameter, and applying (7.16) to (7.17) yield 1 (7.18) Δw = + x(1 + 2y). w Using similar reasoning, a rule for the bias weight parameter can be derived as follows: Δb = 1 − 2y. (7.19) The presented algorithm is a self-organizing one and does not assume any a priori knowledge about the input distribution. The learning rule (7.18) is antiHebbian, with an anti-decay term. The anti-Hebbian term keeps y away from
7.2 Density Estimation 1
y
b
x
u
w
+
−
+
+
129
1 qˆx (x)
σ(·)
×
w
Fig. 7.3. Neural network for density calculation
saturation at 0 and 1. Unfortunately, alone this term adjusts the weight w to go to 0. Therefore, the anti-decay term (1/w) keeps y away from the situation when w is too small and y stays around 0.5. After training, the estimated probability density function can be calculated using the scheme shown in Fig. 7.3. The estimate of the input propability density function takes the form (7.20) qˆx (x) = |w|y(1 − y). The proposed algorithm can be successfully applied to shape an unknown but symmetric input probability density function. In many cases, however, residuals generated by models of diagnosed processes have asymmetric probability density functions. To estimate such a kind of probability density functions, a more complicated neural network is proposed. Let us consider a simple neural network consisting of two sigmoidal neurons connected in series (Fig. 7.4). This neural model is described by the formulae y=
1 1 , z= . 1 + exp (−w2 z − b2 ) 1 + exp (−w1 x − b1 )
(7.21)
To derive learning rules for the network parameters, the unsupervied learning presented earlier can be also adapted. The update of a generalized network parameter v is given as follows: −1 ∂y ∂ ∂y Δv = , (7.22) ∂x ∂v ∂x where the partial derivative of y with respect to x is represented as ∂y = w1 w2 y(1 − y)z(1 − z). ∂x 1
x
1
b1
w1
+
σ(·)
(7.23)
b2 z
w2
+
Fig. 7.4. Simple two-layer network
σ(·)
y
130
7 Decision Making in Fault Detection
After simple but time consuming calculations one obtains the following update rules: • for the parameter w1 : −1 ∂y ∂y ∂ 1 Δw1 = + w2 xz(1 − 2y) + x(1 − 2z), = ∂x ∂w1 ∂x w1
(7.24)
• for the parameter w2 : Δw2 =
∂y ∂x
−1
∂ ∂w2
∂y ∂x
=
1 + z(1 − 2y), w2
• for the parameter b1 : −1 ∂y ∂y ∂ Δb1 = = w2 z(1 − z)(1 − 2y) + (1 − 2z), ∂x ∂b1 ∂x
(7.25)
(7.26)
• for the parameter b2 : Δb2 =
∂y ∂x
−1
∂ ∂b2
∂y ∂x
= 1 − 2y.
(7.27)
In this case, the input probability density function is approximated by |∂y/∂x| in the following form: qˆx (x) = |w1 w2 |y(1 − y)z(1 − z). 7.2.3
(7.28)
Threshold Calculating – A Single Neuron
For a given significance level β, the objective is to find a and b in such a way as to satisfy the condition
b qˆx (x)dx = 1 − β. (7.29) a
Moreover, for symmetric probability density functions another condition in the form
a β (7.30) qˆx (x)dx = 2 −∞ is frequently used. Then, a threshold assigned to a given significance level is determined as follows: (7.31) Tβ = qˆx (a) = qˆx (b). Taking into account (7.20), (7.29) can be rewritten as y(b) − y(a) = 1 − β,
(7.32)
and (7.30), knowing that for the sigmoidal function y(−∞) = 0, takes the form y(a) =
β . 2
(7.33)
7.2 Density Estimation
131
Substituting (7.33) into (7.31) yields Tβ = |w|0.5β(1 − 0.5β).
(7.34)
Thus, the threshold can be determined using the weight w and the significance level β only. Example 7.3. This example shows how a single neuron can be used in the decision making process. The estimated probability density function of a residual given by (7.20) can be represented in the form of the neural network shown in Fig. 7.3. Furthermore, the threshold can be calculated using (7.34). Now, decision making can be carried out easily. The residual is fed to the network input to obtain the output qˆx (x). Then, the value of qˆx (x) is compared with the threshold. If qˆx (x) is greater than T , the system is healthy, otherwise it is faulty. An illustration of this process is presented in Fig. 7.5, where about the 2800-th time instant a fault occurred in the system, which was surely detected by the proposed approach.
Fig. 7.5. Output of the network (7.20) with the threshold (7.34)
7.2.4
Threshold Calculating – A Two-Layer Network
In the case of an input signal with an asymmetric probability density function, the condition (7.30) cannot be satisfied and analytic calculation of a threshold can be difficult or impossible. Instead, an iterative procedure is proposed (algorithm presented in Table 7.1). The algorithm starts with arbitrarily selected a, and then evaluates the neural network to obtain y(a). After that, it checks whether y(a) has a feasible value or not. If yes, then qˆx (a) and qˆx (b) are calculated. Finally, if the absolute value of the difference between qˆx (b) and qˆx (a) is less than arbitrary accuracy, then the threshold has been found. Otherwise, a is increased and the procedure is repeated. While qˆx (x) is an estimate of the input probability density function, the condition
+∞ qˆx (x)dx = 1 (7.35) −∞
132
7 Decision Making in Fault Detection Table 7.1. Threshold calculating Step 0: Initiation Choose ε, , β, Tβ = 0, a; Calculate S := y(+∞) − y(−∞). Step 1: Calculating qˆx (a) Calculate y(a) according to (7.21). If ( y(a) < 1 − S(1 − β) ) then calculate qˆx (a) using (7.28) else STOP Step 2: Calculating qˆx (b) Calculate in turn: y(b) = (1 − β)S + y(a) − log
z(b) =
1−y(b) y(b)
+ b2
w2 and finally qˆx (b) using (7.28)
Step 3: Termination criterion If ( qˆx (b) − qˆx (a) < ε ) then Tβ ; STOP else a := a + ; go to Step 1.
+∞ can be difficult to satisfy. Please note that the value of −∞ qˆx (x)dx is used to check the feasibility of y(a) as well as to calculate y(b). Therefore, at the beginning of the algorithm, S is calculated to determine the estimated value of +∞ q ˆ (x)dx. x −∞
7.3 Robust Fault Diagnosis In recent years, great emphasis has been put on providing uncertainty descriptions for models used for control purposes or fault diagnosis design. These problems can be referred to as robust identification. The robust identification procedure should deliver not only a model of a given process, but also a reliable estimate of uncertainty associated with the model. There are three main factors which contribute to uncertainty in models fitted to data [196]: • noise corrupting the data, • changing plant dynamics, • selecting a model form which cannot capture the true process dynamics. Two main philosophies exist in the literature: 1. Bounded error approaches or set-membership identification. This group of approaches relies on the assumption that the identification error is unknown but bounded. In this case, identification provides hard error bounds, which guarantee upper bounds on model uncertainty [197]. In this framework, robustness is hardly integrated with the identification process; 2. Statistical error bounds. In these approaches, statistical methods are used to quantify model uncertainty called soft error bounds. In this framework, identification is carried out without robustness deliberations and then one considers robustness as an additional step. This usually leads to least-squares estimation and prediction error methods [198].
7.3 Robust Fault Diagnosis
133
In the framework of fault diagnosis robustness plays an important role. Model based fault diagnosis is built on a number of idealized assumptions. One of them is that the model of the system is a faithful replica of plant dynamics. Another one is that disturbances and noise acting upon the system are known. This is, of course, not possible in engineering practice. The robustness problem in fault diagnosis can be defined as the maximisation of the detectability and isolability of faults and simultaneously the minimisation of uncontrolled effects such as disturbances, noise, changes in inputs and/or the state, etc. [9]. In the fault diagnosis area, robustness can be achieved in two ways [9, 199]: 1. active approaches – based on generating residuals insensitive to model uncertainty and simultaneously senstitive to faults, 2. passive approaches – enhancing the robustness of the fault diagnosis system to the decision making block. Active approaches to fault diagnosis are frequently realized using, e.g. unknown input observers, robust parity equations or H∞ . However, in the case of models with uncertainty located in the parameters, perfect decoupling of residuals from uncertainties is limited by the number of available measurements [15]. An alternative solution is to use passive approaches which propagate uncertainty into residuals. Robustness is then achieved through the use of adaptive thresholds [65]. The passive approach has an advantage over the active one because it can achieve the robustness of the diagnosis procedure in spite of uncertain parameters of the model and without any approximation based on simplifications of the underlying parameter representation. The shortcoming of passive approaches is that faults producing a residual deviation smaller than model uncertainty can be missed. In the further part of this section some passive approaches to robust fault diagnosis are discussed. 7.3.1
Adaptive Thresholds
In practice, due to modelling uncertainty and measurement noise, it is necessary to set the threshold T to a larger value in order to avoid false alarms. This operation causes a reduction in fault detection sensitivity. Therefore, the choice of the threshold is only a compromise between fault decision sensitivity and the false alarm rate. For that reason, it is recommended to apply adaptive thresholds, whose main idea is that they should vary in time since disturbances and other uncontrolled effects can also vary in time. A simple idea is to construct such an adaptive threshold based on the estimation of statistical parameters based on the past observations of the residual. Assume that the residual is an approximation of the normal distribution. Over the past n samples, one can calculate the estimated values of the mean value: k 1 r(i), m(k) = n i=k−n
(7.36)
134
7 Decision Making in Fault Detection
and the variance v(k) =
k 1 2 (r(i) − m(k)) , n−1
(7.37)
i=k−n
where 0 < n < k. Using (7.36) and (7.37), a threshold can be calculated according to the following formula: T (k) = tβ v(k) + m(k).
(7.38)
The main problem here is to choose properly the length of the time window n. If n is selected as too small a value, the threshold adapts very quickly to any change in the residual caused by any factor, e.g. disturbances, noise or a fault. If n is too large, the threshold acts in a similar way as a constant one, and the sensitivity of decision making is decreased. In order to avoid too fast adaptation to the changing residual, it is proposed to apply the weighted sum of the current and previous residual statistics as follows: ¯ T (k) = tβ v¯(k) + m(k),
(7.39)
v¯(k) = ζv(k) + (1 − ζ)v(k − 1),
(7.40)
where where ζ ∈ (0, 1) is the momentum parameter controlling the influence of the current and previous values of the standard deviation value on the threshold level. In a similar way, m(k) ¯ can be calculated as m(k) ¯ = ζm(k) + (1 − ζ)m(k − 1).
(7.41)
In practice, in order to obtain the expected behaviour of the threshold, it is recommended to use the value of ζ slightly lower than 1, e.g. ζ = 0.99. The presented method takes into account the analysis of the residual signal only. In order to obtain a more reliable method for threshold adaptation, one should estimate model uncertainty taking into account other process variables, e.g. measurable process inputs and outputs. The adaptive threshold based on the measure of model uncertainty can be represented as follows: T (k) = c1 U (k) + c2 ,
(7.42)
where U (k) is the measure of model uncertainty at the time instant k, c1 denotes a known bound of the model error, and c2 is the amount of disturbances such as measurement noise. The uncertainty U (k) can be obtained as, e.g. a minimised sum of prediction errors [189], a norm of the output filter [200] or the filtering of model uncertainty [201]. Unfortunately, the presented methods based on model uncertainty are devoted to linear systems. The next section proposes a threshold adaptation method for non-linear systems.
7.3 Robust Fault Diagnosis
135
Example 7.4. The following example shows the properties of the adaptive thresholds (7.38) and (7.39). The length of the time window was set to n = 10 and the momentum parameter ζ = 0.99. Thresholds calculated for the significance level β = 0.01 along with an exemplary residual signal are presented in Fig. 7.6, where a residual is marked with the solid line, the thresholds given by (7.38) are marked with the dotted lines, and thresholds calculated using (7.39) are marked with the dashed lines. In the case of the thresholds (7.38), the short time window makes the thresholds adapt to a changing residual very quickly even when a fault occurrs at the 850-th time instant. This example clearly shows that such a kind of thresholds is useless when a short time window is used. The thresholds (7.39) preform much better. Due to introducing the momentum term of quite a large value, the fast changes in residual statistics calculated at the moment do not influence much the current value of the thresholds. The problem here is to select the proper value of the momentum term. This can be troublesome. 0.2
Residual and thresholds
0.15
0.1
0.05
0
−0.05 0
100
200
300
400
500
600
700
800
900
1000
1100
Time
Fig. 7.6. Residual signal (solid), adaptive thresholds calculated using (7.38) (dotted), and adaptive thresholds calculated using (7.39) (dashed)
7.3.2
Fuzzy Threshold Adaptation
Adaptive thresholding can be successfully realized using the fuzzy logic approach. Threshold changes can be described by fuzzy rules and fuzzy variables [202, 203]. The threshold is adapted based on the changes of the values of u and y p . The idea is presented in Fig. 7.7. The inputs u and the outputs y p are expressed in the form of fuzzy sets by proper membership functions and then the adaptation of the threshold is performed with the help of fuzzy sets. The resulting relationship for the fuzzy threshold adaptation is given by T (u, y p ) = T0 + ΔT (u, y p ),
(7.43)
136
7 Decision Making in Fault Detection
where T0 denotes the constant (nominal) threshold and ΔT (u, y p ) denotes the effect of modelling errors, due to deviations of the process from its operating point. The value of the nominal threshold T0 can be set as follows: T0 = m0 + v0 ,
(7.44)
where m0 is the mean value of the residual under nominal operating conditions and v0 denotes the standard deviation of the residual under nominal operating conditions. Other methods useful for selecting the nominal threshold T0 are presented in Section 7.1. A general scheme of a fault detection system using the adaptation of the threshold, in the framework of model based fault diagnosis, is shown in Fig. 7.8. The main idea is to use fuzzy conditioned statements operating on fuzzy sets which represent the inputs u and the outputs y p . The residual r, calculated as a difference between the process output y p and the model output y, is compared with the adaptive threshold T in the decision logic block. If the value of the residual r is greater than the threshold, then a fault is signalled. An application example of fuzzy threshold adaptation is presented in Section 8.1.3. The adaptation of thresholds can be also interpreted as the adaptation of membership functions of residuals [4]. The idea of the fuzzy threshold is presented in Fig. 7.9 The first maximum of a residual represents a disturbance, while the second one a fault. In the classical manner, ilustrated in Fig. 7.9(a), the first maximum does not exceed the threshold T , but in the case of a small disturbance a false alarm would appear. Figure 7.9(b) presents fuzzy threshold selection when the threshold is splitted up into an interval of a finite width, the so-called fuzzy domain, as presented in Fig. 7.9(c). Now, a small change of the value of the first or the second maximum around T casues small changes in false alarm tendency and, consequently, a small change in the decision making process. By the composition of the fuzzy sets {healthy} and {faulty}, the threshold can be fuzzified as depicted in Fig. 7.9(c). If required, a threshold can be represented by more fuzzy sets, e.g. {small}, {medium}, {large}. In general, the fuzzyfication of the threshold can be interpreted as the fuzzification of the residual [4]. yp
ΔT small T0 nominal ΔT medium ΔT large
u
Fig. 7.7. Illustration of the fuzzy threshold adaptation
7.3 Robust Fault Diagnosis
Threshold adaptation Fuzzification
u(k)
Process Model
Rules base
Defuzzification
137
T0 ΔT
+
+ + T
Information Decision about faults logic
y p (k)
y(k)
+
−
+
r
Fig. 7.8. Scheme of the fault detection system with the threshold adaptation
(a)
(b)
r
time
time
healthy
T
faulty
r fuzzy domain
r
(c)
μr
Fig. 7.9. Idea of the fuzzy threshold
7.3.3
Model Error Modelling
As was mentioned at the beginning of Section 7.3, two main ideas exist to deal with uncertainty associated with the model. The first group of approaches, the so-called set membership identification [204] or bounded error approaches [66], relies on the assumption that the identification error is unknown but bounded. In this framework, robustness is hardly integrated with the identification process. A somewhat different approach is to identify the process without robustness deliberations first, and then consider robustness as an additional step. This usually leads to least-squares estimation and prediction error methods. Prediction error approaches are widely used in designing empirical process models used for control purposes and fault diagnosis. Great emphasis has been put on providing uncertainty descriptions. In control theory, identification which provides the uncertainty of the model is called control relvant identification or robust identification [205, 196, 198]. In order to characterize uncertainty in the model, an estimate of a true model is required. To obtain the latter, a model of increasing complexity is designed until it is not falsified (the hypothesis that the model
138
7 Decision Making in Fault Detection
(a) u
(b) Process
y +
Model
u
r
y Process
ym −
ym Model
Error model
+
+e −
Error model
ye +
confidence region generating
Fig. 7.10. Model error modelling: error model training (a), confidence region constructing (b)
provides an adequate description of a process is accepted at a selected significance level). Statistic theory is then used to derive uncertainty in the parameters. Model error modelling employs prediction error methods to identify a model from the input-output data [198]. After that, one can estimate the uncertainty of the model by analyzing residuals evaluated from the inputs. Uncertainty is a measure of unmodelled dynamics, noise and disturbances. The identification of residuals provides the so-called model error model. In the original algorithm, a nominal model along with uncertainty is constructed in the frequency domain adding frequency by frequency the model error to the nominal model [198]. Below, an algorithm to form uncertainty bands in the time domain is proposed, intended for use in the fault diagnosis framework [55, 51]. The designing procedure is described by the following steps: 1. Using a model of the process, compute the residual r = y − ym , where y and ym are desired and model outputs, respectively; 2. Collect the data {ui , ri }N i=1 and identify an error model using these data. This model constitutes an estimate of the error due to undermodelling, and it is called the model error model; 3. Derive the centre of the uncertainty region as ym + ye ; 4. If the model error model is not falsified by the data, one can use statistical properties to calculate a confidence region. A confidence region forms uncertainty bands around the response of the model error model. The model error modelling scheme can be carried out by using neural networks of the dynamic type, discussed in Chapter 3. Both the fundamental model of the process and the error model can be modelled utilizing neural networks of the dynamic type. Assuming that the fundamental model of the process has already been constructed, the next step is to design the error model. The training process of the error model is illustrated in Fig. 7.10(a). In this case, a neural network is used to model an “error” system with the input u and the output r. After training, the response of this model is used to form uncertainty bands as
7.3 Robust Fault Diagnosis
139
2050 2040
Uncertainty region
2030 2020 2010 2000 1990 1980 1970 1960 1000
1020
1040
1060
1080
1100
Time Fig. 7.11. Idea of model error modelling: system output (solid), centre of the uncertainty region (dotted), confidence bands (dashed)
presented in Fig. 7.10(b), where the centre of the uncertainy region is obtained as a sum of the output of the system model and the output of the error model. Then, the upper band can be calculated as Tu = ym + ye + tβ v,
(7.45)
and the lower band in the following way: Tl = ym + ye − tβ v,
(7.46)
where ye is the output of the error model on the input u, tβ is the N (0, 1) tabulated value assigned to the confidence level, e.g β = 0.05 or β = 0.01, v is the standard deviation of ye . It should be kept in mind that ye represents not only the residual but also structured uncertainty, disturbances, etc. Therefore, the uncertainty bands (7.45) and (7.46) should work well only assuming that the signal ye has a normal distribution. The centre of the uncertainty region is the signal ym +ye ≈ y. Now, observing the system output y, one may make a decision whether a fault occurred or not. If y is inside the uncertainty region, the system is healthy. The idea of model error modelling in the time domain is presented in Fig. 7.11. The output of the system is marked with the solid line. In turn, the sum of the outputs of the model ym and the error model ye is marked with the dotted line. This signal constitutes the centre of the uncertainty region. Using a certain significance level, confidence bands (marked with the dashed lines) are generated around the centre. Thus, the uncertainty region has been determined. As long as the process output lies within the uncertainty region, a fault is not signalled.
140
7 Decision Making in Fault Detection
The key question is to find a proper structure of the error model. As was discussed in [198], one can start with an a priori chosen flexible structure, e.g. the 10-th order FIR filter. If this error model is not falsified by the data, it has to be kept. Otherwise, the model complexity should be increased until it is unfalsified by the data. In Sections (8.2.4) and (8.3.4), neural network based error models are discussed.
7.4 Summary The purpose of this chapter was to introduce methods responsible for making decisions about possible faults. It was shown that by using artificial neural networks an effective residual evaluation can be realized. The first proposed method of residual evaluation used a simple neural network trained in such a way as to maximise the output entropy in order to approximate the probability density function of a residual. Thus, a more representative threshold value assigned to a given significance level can be obtained. It was shown that such an approach significantly reduces the number of false alarms caused by an inaccurate model of the process. The proposed density shaping approach can be easily expanded to more complex network topologies in order to estimate more sophisticated probability density functions. By using two sigmoidal neurons connected in series it is possible to estimate asymmetric probability density functions and the number of false alarms can be even further reduced. It is worth noting that selforganizing training used to adjust the network parameters is very simple and even tens thousand training steps last a few seconds on a standard PC machine. The second proposed method for residual evaluation was a model error modelling technique realized using neural networks intended for use in the time domain. Due to estimating model uncertainty, the robust fault diagnosis system can turn out much more sensitive to the occurrence of small faults than standard decision making methods. Moreover, the number of false alarms can be considerably reduced. The open problem here is to find a proper model error model. This issue seems to be much more difficult to solve than finding a fundamental model of the system.
8 Industrial Applications
This chapter presents the application of artificial neural networks discussed in the previous chapters to fault diagnosis of industrial processes. Three examples are considered: 1. fault detection and isolation of selected parts of the sugar evaporator, 2. fault detection of selected components of the Fluid Catalytic Crackig (FCC) process, 3. fault detection, isolation and identification of a DC motor. In all case studies, locally recurrent globally feedforward networks, introduced in Section 3.5.4, are used as models of the industrial processes considered. Other types of neural networks, discussed in Sections 7.2 and 7.3.3, are used in decision making in order to detect faults. The chapter is organized as follows: in Section 8.1, fault detection and isolation of selected parts of the sugar evaporator is presented. In turn, Section 8.2 consists of results concerning fault detection of selected components of the FCC process. The last example, fault detection, isolation and identification of the DC motor, is shown in Section 8.3.
8.1 Sugar Factory Fault Diagnosis The problem regarding FDI of the components of the sugar evaporator was widely considered within the EU DAMADICS project [206, 207]. DAMADICS was a research project focused on drawing together wide-ranging techniques and fault diagnosis within the framework of a real application to on-line diagnosis of a 5-stage evaporisation plant of the sugar factory in Lublin, Poland. The sugar factory was a subcontractor providing real process data and the evaluation of trials of fault diagnosis methods. The evaporation station presented below is part of the Lublin Sugar Factory, Poland. In a sugar factory, sucrose juice is extracted by diffusion. This juice is concentrated in a multiple-stage evaporator to produce a syrup. The liquor goes through a series of five stages of vapourisers, and in each passage its sucrose concentration increases. The sugar evaporation control should be performed in K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 141–186, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
142
8 Industrial Applications T51 _ 02 o C
T51 _ 03 o C
T51 _ 04 o C
P51 _ 03 kPa T51 _ 07 oC
kPa %
PC51_01
Vapour model
o
C T51 _ 01 Actuator %
P51 _ 04 kPa
F51 _ 02 t/h T51 _ 06 oC
LC51_03
R
R
LC51_01
%
R
o
TC51_05
%
C Temperature model
F51 _ 03 t/h
F51 _ 01
m3/h
T51 _ 08
o
C
Fig. 8.1. Evaporation station. Four heaters and the first evaporation section
such a way that the energy used is minimised to achieve the required quality of the final product. The main inconvenient features that complicate the control of the evaporation process are [153, 26]: • highly complex evaporation structure (large number of interacting components), • large time delays and responses (configuration of the evaporator, number of delays, capacities), • strong disturbances caused by violent changes in the steam, • many constrains on several variables. The filtered and clarified syrup containing ca. 14% of sugar (weak syrup) is directed to the evaporation station to be condensed, up to about 70% of dry substance, with minimum heat-energy consumption. The evaporation station is composed of five sections of vaporizers. The first three sections are of the Roberts type with a bottom heater-chamber, whereas the last two are of the Wiegends type with a top heater-chamber. For waste-heat treatment two multi-section boilers are used. The heat power supplied to drive the evaporation process is gained from the waste steam of the local power station. This power is used to drive all energy-consuming technological nodes such as preheaters, the evaporation station, strike pans and central heating. The saturated vapour generated during the evaporation process is fed to the successive sections of vaporizers and to the preheaters. The water condensate from the evaporators is directed to the waste-heat boilers as well as other devices mentioned above. The waste-heat boilers retrieve heat power by decompressing the condensate in their successive sections. The sugar production process is controlled, monitored and supervised by the decentralized automatic control system of the Supervisory Control And Data Acquisition (SCADA) type. This system makes it possible to record and store set-point, control and process variables. These archives can be used to train neural models off-line. After that, neural models are applied for on-line fault detection and isolation purposes.
8.1 Sugar Factory Fault Diagnosis
143
Table 8.1. Specification of process variables
Variable
Description
Range
F 51 01 F 51 02 P 51 03 T 51 06 T 51 07 T 51 08 T C51 05
Thin juice flow to the input of the evaporation station Steam flow to the input of the evaporation station Vapour pressure in the 1-st section of the evaporator Input steam temperature Vapour temperature in the 1-st section of the evaporator Juice temperature after the 1-st section of the evaporator Thin juice temperature after heaters
0 − 500 m3 /h 0 − 100 t/h 0 − 250 kPa 50 − 150 o C 50 − 150 o C 50 − 150 o C 50 − 150 o C
8.1.1
Instrumentation Faults
Thanks to technological improvement and very careful inspection of the plant before starting a 3-month long sugar campaign, faults in sensors, actuators and technology are rather exceptional. Therefore, to check on the sensitivity and effectiveness of the fault detection system designed using dynamic neural networks, data with artificial faults in measuring circuits are used (the achieved results are presented in Section 8.1.3). Sensor faults are simulated by increasing or decreasing the values of particular signals by 5, 10 and 20% at specified time intervals. In Fig. 8.1, the first section of the evaporation station is shown. In the figure, most of the accessible measurable variables are marked, and their specification is given in Table 8.1. Based on observations of the process variables and on the knowledge of the process, the following models can be designed and investigated [208, 26]: • Vapour pressure in the vapour chamber of the evaporation section (vapour model): (8.1) P 51 03 = h1 (T 51 07); • Juice temperature after the evaporation section (temperature model): T 51 08 = h2 (T 51 06, T C51 05, F 51 01, F 51 02),
(8.2)
where h1 (·) and h2 (·) are the relations between variables. Suitable process variables are measured by specific sensors at chosen points of the evaporation station. After that, the obtained data are transferred to the monitoring system and stored there. 8.1.2
Actuator Faults
The actuator to be diagnosed is marked in Fig. 8.1 by the dashed square. The block scheme of this device is presented in Fig. 8.2, where measurable process variables are marked with the dashed lines. The actuator considered consists of three main parts: the control valve, the linear pneumatic servo-motor and the positioner [209, 210]. The symbols and process variables are presented in
144
8 Industrial Applications Table 8.2. Description of symbols Symbol
Variable
Specification
Range
V1 , V2 V3 V P1 P2 T1 F X CV
– – – P 51 05 P 51 06 T 51 01 F 51 01 LC51 03X LC51 03CV
Hand-driven cut-off valves Hand-driven by-pass valve Control valve Pressure sensor (valve inlet) Pressure sensor (valve outlet) Liquid temperature sensor Process media flowmeter Piston rod displacement Control signal
– – – 0 − 1000 kPa 0 − 1000 kPa 50 − 150 o C 0 − 500 m3 /h 0 − 100 % 0 − 100 %
Table 8.2. The control valve is typically used to allow, prevent and/or limit the flow of fluids. The state of the control valve is changed by a servo-motor. The pneumatic servo-motor is a compressible fluid-powered device in which the fluid acts on the flexible diaphragm to provide the linear motion of the servo-motor stem. The third part is a positioner applied to eliminate the control valve stem’s mis-positions developed by external or internal sources such as frictions, pressure unbalance, hydrodynamic forces, etc. Structural analysis of the actuator and expert knowledge allows us to define the relations between variables. The resulting causal graph is presented in Fig. 8.3. Besides the basic measured variables, there are variables that seem to be realistic to measure: • positioner supply pressure – PZ , • pneumatic servo-motor chamber pressure – PS , • position P controller output – CV I, (a)
(b) CV
POSITIONER
T1
P1 V1
X
X
P2 V
F
V2
V3
Fig. 8.2. Actuator to be diagnosed (a), block scheme of the actuator (b)
8.1 Sugar Factory Fault Diagnosis
145
and an additional set of unmeasurable physical values useful for structural analysis: • • • •
flow through the control valve – FV , flow through the by-pass valve – FV 3 , Vena-contacta force – FV C , by-pass valve opening ratio – X3 .
Taking into account the causal graph presented in Fig. 8.3 and the set of measurable variables, the following two relations are considered: • Servo-motor rod displacement: X = h3 (CV, P1 , P2 , T1 , X);
(8.3)
• Flow through the actuator: F = h4 (X, P1 , P2 , T1 ),
(8.4)
where h3 (·) and h4 (·) are non-linear functions. Fault isolation is only possible if data describing several faulty scenarios are available. Due to safety regulations, it is impossible to generate real faulty data. Therefore, in cooperation with the sugar factory, some faults were simulated by manipulations on process variables. The monitored data acquired from the SCADA system after suitable modification are introduced back to the controlled system. In this way, one can generate artificial faults that are as realistic as possible. For example, the fully opened by-pass valve scenario causes an increasing flow through the actuator and the system responds by throttling the flow in the main pipe. This event can be recognized by observing the CV value. During the experiments, the following faults were considered [26]: f1 – positioner supply pressure drop, f2 – unexpected pressure change across the valve, f3 – fully opened by-pass valve.
PZ CV
CVI
PS
measured variables variables realistic to measure
X
PV
FVC
F
FV3
X3
unmeasurable variables
P1
P2
T1
Fig. 8.3. Causal graph of the main actuator variables
146
8 Industrial Applications
The first faulty scenario can be caused by many factors such as pressure supply station faults, oversized system air consumption, air-leading pipe breaks, etc. This is a rapidly developing fault. Physical interpretations of the second fault can be media pump station failures, increased pipe resistance or external media leakages. This fault is rapidly developing as well. The last scenario can be caused by valve corrosion or seat sealing wear. This fault is abrupt. 8.1.3
Experiments
Data preprocessing Individual process variables are characterized by their amplitudes and ranges. In many cases, these signal parameters differ significantly for different process variables. These large differences may cause the neural model to be trained inaccurately. Therefore, raw data should be preprocessed. Preprocessing is a sequence of operations converting raw data, such as measurements, to a data representation suitable for such processing tasks as modelling, identification or prediction. In this experiment, the inputs of the models under consideration are normalized according to the formula xi − mi , (8.5) x ¯i = vi where mi denotes the mean (expected) value of the i-th input, and vi denotes the standard deviation of the i-th input. The normalization of the inputs guarantees that the i-th input has a zero mean and a unit standard deviation. In turn, the output data should be transformed taking into consideration the response range of the output neurons. For hyperbolic tangent neurons, this range is [−1, 1]. To perform this transformation, the linear scaling can be adopted: ys =
2(y − a) − 1, b−a
(8.6)
where y and ys are the actual and scaled patterns, respectively, and a and b are the minimal and maximal values of the process variables, respectively. In order to achieve the above transformation, the ranges of the suitable process variables given in Tables 8.1 and 8.2 can be used. Model selection The purpose of model selection is to identify a model in order to best fit a learning data set. Several information criteria can be used to accomplish this task [162]. One of them is the Akaike information criterion (4.24), discussed in detail in Section 4.4. Another well-known criterion is the Final Prediction Error (FPE), which selects the model order minimizing the function defined according to the formula 1+ K N , (8.7) fF P E = J 1− K N
8.1 Sugar Factory Fault Diagnosis
147
where J is the sum of squared errors between the desired and network outputs, respectively, N is the number of samples used to compute J, and K is the number of the model parameters. The term (1 + K/N )/(1 − K/N ) decreases with K and represents inaccuracies in estimating the model parameters. Instrumentation fault detection In order to evaluate the quality of modelling, the performance index is introduced in the form of the sum of squared errors. Testing and training sets are formed that are inherently based on data from two different working shifts. It is well known that model selection should never be performed on the same data that are used for the identification of the model itself. Changing the test data, the best model also generally changes; however, under a suitable assumption and a set of data rich enough this issue becomes less critical and should not invalidate the proposed approach. To check on the sensitivity and effectiveness of the proposed fault detection system, data with artificial faults in measuring circuits are employed. The faults are simulated by increasing or decreasing the values of particular signals by 5, 10 and 20% at specified time intervals. In the following, experimental results on the detection of instrumental faults are reported. Vapour model The process to be modelled is described by the formula (8.1). This simple process has one input and one output. The training process was carried out off-line for 30000 iterations using the SPSA algorithm of the dynamic network architecture (3.39). The parameters of the learning algorithm were as follows: A = 100, a = 0.008, c = 0.01, α = 0.602, and γ = 0.101. The learning set Nu consisted of 800 samples whilst the testing set Nt included 3000 samples. The best model was selected by using the information criteria AIC and FPE. The results of model m represents the m-layer dydevelopment are presented in Table 8.3, where Nr,v,s namic neural network with r inputs, v hidden neurons and s outputs, K is the number of the network parameters, Ju and Jt are the sum of squared errors between the desired and network outputs, calculated for the training and testing sets, respectively. For the training set, the best results (marked with the frames) 2 architecture with the first order filter. were obtained for the model of the N1,4,1 However, for the testing set, one can observe that other network structures show 2 . better performance. In this case, the neural network belongs to the class N1,5,1 Each neuron of the dynamic network model is of the first order of the IIR filter and has the hyperbolic tangent activation function. This testifies that this network structure has a better generalization ability than the previous one selected for the training set. Finally, the model corresponding to the minimum criteria in testing was selected as the optimal one. Figure 8.4 presents the residual signal. Our study includes failures in two sensors: the fault in the measurement P 51 03 (time steps 900–1200) and the fault in the measurement T 51 07 (time steps 1800–2100). In both cases, the faults can be detected immediately and certainly, e.g. by using a threshold technique according to (7.3) with β = 0.05. In this case, T = 0.05. Taking into account the sensitivity of the proposed fault detection system, we can state that measuring disturbances smaller than 5% can
148
8 Industrial Applications
Table 8.3. Selection of the neural network for the vapour model Network
Filter
Training
structure
order
K
Ju
2 N1,2,1 2 N1,3,1 2 N1,4,1 2 N1,5,1 2 N1,5,1 2 N1,5,1
2 2 1
28 38 35
1 1 2
fF P E
Testing fAIC
Jt
fF P E
0,1080 0,1158 0,0946 0,1040 0,0745 0,0813
-0,8966 -0,9291 -1,0403
0,639 0,6510 0,674 0,6913 0,622 0,6366
46
0,0832 0,0934
-0,9649
0,607
46 58
0,0763 0,0856 0,1180 0,1364
-1,0025 -0,7831
0,703 0,7249 0,992 1,0311
0,6259
fAIC -0,1758 -0,1460 -0,1828 -0,1861 -0,1224 0,0352
be easily detected. Unfortunately, in Fig. 8.4 we can see that the fault detection system generates a certain number of false alarms. Some of them, like that at the time step 1750, can be caused by disturbances or noise; however, there is also a false alarm (at about the time step 200), which was caused by an inaccurate model. One of the possible solutions is to use a more accurate model of the system considered. If it is impossible to find a better model, an adaptive threshold technique may be applied, which is much more robust than the fixed threshold. Temperature model The process to be modelled is described by the formula (8.2). This process has one output and four inputs. The training process was carried out off-line for 30000 iterations using the SPSA algorithm of the dynamic network architecture.
Fig. 8.4. Residual signal for the vapour model in different faulty situations: fault in P 51 03 (900–1200), fault in T 51 07 (1800–2100)
8.1 Sugar Factory Fault Diagnosis
149
Fig. 8.5. Residual signals for the temperature model in different faulty situations: fault in F 51 01 (0–300), fault in F 51 02 (325–605), fault in T 51 06 (1500-1800), fault in T 51 08 (2100–2400), fault in T C51 05 (2450–2750)
The parameters of the learning algorithm were as follows: A = 1000, a = 0.04, c = 0.01, α = 0.602, and γ = 0.101. The learning set consisted of 800 samples. 2 For the training set, the best results were obtained for the model of the N4,4,1 architecture with the first order filter. However, for the testing set, one can observe that other network structures show better performance. In this case, the 2 . Each neuron of the dynamic network neural network belongs to the class N4,3,1 model is of the first order of the IIR filter and has the hyperbolic tangent activation function. Figure 8.5 presents the residuals for simulated failures of different sensors. The following sensor faults were successively introduced at chosen time intervals: the fault in the measurement F 51 01 (time steps 0–300), the fault in the measurement F 51 02 (time steps 325–605), the fault in the measurement T 51 06 (time steps 1500-1800), the fault in the output sensor T 51 08 (time steps 2100–2400) and the fault in the measurement T C51 05 (time steps 2450–2750). In this study, the threshold applied was equal to 0.04. As can be seen in Fig. 8.5, the fault detection system is most sensitive to the failures of the output sensor – T 51 08 (time steps 2100–2400). Even sensor failures smaller than 5% can be immediately detected. Somewhat worse are diagnosis results for the faults of the sensors T 51 06 (time steps 1500-1800) and T C51 05 (time steps 2450–2750). In both cases, 5% of the failures are explicitly and surely detected by the proposed fault detection system. For the sensor T C51 05, however, one can observe that only -5, -10 and -20% faults are distinctly and surely detected. +5, +10 and +20% faults are signalled by small spikes only, whose occurrence in the residuals is due to noise effects rather than fault occurrences. The worst results are
150
8 Industrial Applications
obtained for the failures of the sensors F 51 01 (time steps 0–300) and F 51 02 (time steps 325–605). Only large faults in both sensors are shown by the residuals. This means that the fault detection system is not very sensitive to the occurrence of faults in these two sensors. Robust instrumentation fault detection Adaptation of the threshold Analyzing the residual signal in the fault-free case, one can see that in some time intervals there are large deviations of the residuals from zero. Unfortunately, these deviations, caused by disturbances or modelling errors, can generate false alarms during residual evaluation. In order to avoid false alarms, it is necessary to analyze how changes of inputs and outputs of the process influence deviations of the residual from zero. Such knowledge can be elaborated in the form of the adaptive threshold (7.43) by means of fuzzy rules. Two sample fuzzy rules, which take into account the modelling mismatch of the vapour model, are given below [54]: R1 : If {u is zero} and {yp is zero} then {ΔJ is large}; R2 : If {u is small positive} and {yp is zero} then {ΔJ is medium}. The linguistic variables zero, large, small positive and medium are defined by the relevant membership functions. To realize such a kind of threshold, the Fuzzy Logic Toolbox for Matlab 5.3 was used. The number of linguistic variables, as well as the shape of membership functions, is chosen experimentally. The defuzzification process is carried out using the centre of area method. A comparison of the constant and adaptive thresholds, in the case of the temperature model, is shown in Fig. 8.6. The value of the constant threshold shown in Fig. 8.6(a) is set as a sum of the mean value and the standard deviation of the residual according to (7.5) with ζ = 1. In this case, the number of false alarms is quite high. It is easy to see that the false alarms which occurred in the case of the constant threshold can be avoided with a suitably adjusted adaptive threshold (Fig. 8.6(b)). In the case of the adaptive threshold, the number of false alarms was reduced even three times. Table 8.4 contains the number of false alarms generated using both the constant and adaptive thresholds, for all investigated models. In all cases the number of false alarms is reduced considerably. Table 8.4. Number of false alarms Model
temperature model vapour model
Number of false alarms constant threshold
adaptive threshold
399 226
132 76
8.1 Sugar Factory Fault Diagnosis
151
(a) Residual with the adaptive threshold
0.07 0.06 0.05
false alarms
0.04 0.03
threshold 0.02 0.01
0
500
1000
1500
2000
2500
3000
Time
(b) Residual with the adaptive threshold
0.07 0.06 0.05
false alarms
0.04 0.03
threshold
0.02 0.01
0
500
1000
1500
2000
2500
3000
Time
Fig. 8.6. Normal operating conditions: residual with the constant (a) and the adaptive (b) threshold
Robust fault detection Similarily as in the previous sections, to check the sensitivity and effectiveness of the proposed robust fault detection system, data with artificial faults in measuring circuits were employed. The faults were simulated by increasing or decreasing the values of particular signals by 5, 10 and 20% at specified time intervals. Figure 8.7 presents the absolute value of the residual signal for the vapour model. This study includes failures of two sensors. First, the changes of measurements
152
8 Industrial Applications
in turn by +5%, +10%, +20%, -5%, -10% and -20% in the sensor P 51 03 were introduced (at the time steps 900–1200). After that, similar failures in the sensor F 51 04 were studied (at the time steps 1800–2100). In both cases, the faults are detected immediately and surely. Taking into account the sensitivity of the proposed fault detection system it can be stated that even sensor failures smaller than 5% can be detected easily. Moreover, using the adaptive threshold technique, the fault detection system can avoid a certain number of false alarms. Taking into account these experimental results, one can conclude that the proposed robust fault detections system is very sensitive to the occurrence of faults. Using the adaptive threshold technique, it is possible to considerably reduce the number of false alarms caused by modelling errors. However, problems of the selection of fuzzy model components such as the number of linguistic variables, the shape of membership functions or the generation of rules are still open.
Fig. 8.7. Residual for different faulty situations
Actuator fault detection and isolation In the proposed FDI system, four classes of the system behaviour – the normal operating condition f0 and three faults: f1 –f3 – are modelled by a bank of dynamic neural networks, according to the scheme presented in Fig. 2.4. To identify the models (8.3) and (8.4), dynamic neural networks with four inputs and two outputs were applied: X (8.8) = N N (P1 , P2 , T1 , CV ), F where N N is the neural model. Multi-Input Multi-Output (MIMO) models are considered taking into account two reasons. The first one is the computational
8.1 Sugar Factory Fault Diagnosis
153
Table 8.5. Neural models for nominal conditions and faulty scenarios Faulty scenario
Structure
Filter order
Activation function
f0 f1 f2 f3
2 N4,5,2 2 N4,7,2 2 N4,7,2 2 N4,5,2
2 1 1 1
hyperbolic hyperbolic hyperbolic hyperbolic
tangent tangent tangent tangent
effort during identification. Each neural model considered can be represented by the two Multi-Input Single-Output (MISO) models, for which the training process can be easier to perform. Unfortunately, in the case when some number of faults are considered, let us say 10, it is required to design 20 MISO models instead of 10 of the MIMO type. The training and testing sets are different and formed similarly as in the case of instrumentation fault detection (see the previous paragraphs). All models were trained and selected as in the previous two examples, but details concerning the proper model selection are not given here. The final neural model specification is presented in Table 8.5. The selected neural networks have a relatively small structure. Only two processing layers with 5 or 7 hidden elements are enough to identify faults with pretty high accuracy. Moreover, dynamic neurons have hyberbolic tangent activation functions and first order IIR filters. Each neural model was trained using suitable faulty data. Subsequently, the performances of the constructed models were examined using both nominal and faulty data. Both fault detection and isolation are performed using the thresholding described by (7.3) assuming the significance level β = 0.05 (5%). The experimental results are reported in the forthcoming sections. Fault detection Fault detection is performed using the model representing nominal operating conditions. The threshold corresponding to the output F was found as Tf = 0.0256, and the threshold for the output X as Tx = 0.0322. The residuals for this model should not be greater than the thresholds when the actuator is healthy, and should exceed the thresholds in faulty cases. Figures 8.8 and 8.9 show the behaviour of both nominal model residuals in various faulty scenarios. For clarity of presentation, thresholds levels are not presented there. The thick line represents the time instant when a fault occurred. It is clearly shown that all the faults are reliable and surely detected. At this stage, of course, it is impossible to find out what exactly happen. In order to localize a fault it is necessary to perform fault isolation. Fault isolation Fault f1 . The first fault was simulated at the time step 270 and lasted about 275 time steps. In order to isolate this fault, the residuals generated by the fault model f1 should be near zero, other fault models should generate residuals different than zero. Figure 8.10 shows the residuals for all fault models. One can observe that this
154
8 Industrial Applications (a)
(b)
(c)
Fig. 8.8. Residual of the nominal model (output F ) in the case of the faults f1 (a), f2 (b) and f3 (c)
fault is isolated because only the residual in Fig. 8.10(a) is near zero during the occurrence of f1 . Simultaneously, the residual in Fig. 8.10(a) is near zero under nominal operating conditions (and it should be different than zero). This means that the related model generates a large number of false alarms. Fault f2 . The next faulty scenario was simulated at the time step 775 (pressure off) till the time step 1395 (pressure on). Figure 8.11 presents the residuals of the fault models for this case. Using the residuals obtained from the output F of the models one can conclude that this fault is not isolable, because two residuals (Figs. 8.11(b) and 8.11(c)) tend to zero. Fortunately, there is a chance to isolate this fault using the output X of the models. Only the fault model f2 generates a residual near zero (Fig.8.11(e)). In this case, however, the residual is strongly oscilating, which can result in quite a large number of false alarms. Fault f3 . The third fault was simulated at the time step 860 (valve opening) till the time step 1860 (valve closing). In this case, the fault is reliably isolated
8.1 Sugar Factory Fault Diagnosis
155
(a)
(b)
(c)
Fig. 8.9. Residual of the nominal model (output X) in the case of the faults f1 (a), f2 (b) and f3 (c)
using the outputs X of the neural models. Only one residual (Fig. 8.12(f)) is near zero when the fault occurs. Similarly as in the previous study, the fault is not isolable using the outputs F of the models. Two of the residuals tend to zero: the residual for the fault f1 (Fig.8.12(a)) and the one for the fault f3 (Fig. 8.12(c)). Decision making is performed using the thresholding technique described by (7.3) with β = 0.05. The threshold values can be found in the paragraph on Fault detection for the nominal model and in Table 8.8 for fault models. The results of fault detection are presented in Table 8.6(a). One can see there that all faults are detected by the proposed system either using the nominal model of the flow F or the nominal model of the rod displacement X. In turn, fault isolation results are shown in Table 8.6(b). The main result here is the fact that the fault f1 cannot be isolated by neural models. The second result shows that the faults f2 and f3 can be isolated using only fault models of the rod displacement. Fault models of the flow cannot locate a fault which occurred in the system.
156
8 Industrial Applications (a)
(b)
(c)
(d)
(e)
(f)
Fig. 8.10. Residuals of the fault models f1 (a) and (d); f2 (b) and (e); f3 (c) and (f) in the case of the fault f1
8.1 Sugar Factory Fault Diagnosis
157
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 8.11. Residuals of the fault models f1 (a) and (d); f2 (b) and (e); f3 (c) and (f) in the case of the fault f2
158
8 Industrial Applications (a)
(b)
(c)
(d)
(e)
(f)
Fig. 8.12. Residuals of the fault models f1 (a) and (d); f2 (b) and (e); f3 (c) and (f) in the case of the fault f3
8.1 Sugar Factory Fault Diagnosis
159
Table 8.6. Results of fault detection (a) and isolation (b) (X – detectable/isolable, N – not detectable/not isolable) (a)
(b) Fault detection results
Fault isolation results f1 f2 f3
Faulty scenario
flow model X X X rod displacement model X X X
f1 f2 f3
Faulty scenario
flow model N N N rod displacement model N X X
Qualitative analysis of the fault detection and isolation system considered is presented in the next section, together with a comparison with alternative approaches. Comparative study To check the efficiency of the proposed fault detection and isolation system, the Locally Recurrent (LR) network trained with SPSA is compared with alternative approaches such as Auto-Regressive with eXogenous inputs (ARX) models [162] and Neural Networks Auto-Regressive with eXogenous inputs (NNARX) models [19]. The simulations are performed over the processes (8.3) and (8.4). All structures of the models used are selected experimentally. To compare the achieved results, the following performance indices are used: • modelling quality in the form of a sum of squared errors between the desired and the actual response of the model calculated using a testing set, • detection time, • false detection rate (2.25), • isolation time, • false isolation rate (2.27). The first comparative study shows the modelling quality of the models achieved by using the examined methods. The modelling quality represents prediction capabilities of the models and is calculated as a sum of squared errors over the testing set. The achieved results are shown in Table 8.7. As one can see, the worst result was obtained for the ARX models. The actuator is described by Table 8.7. Modelling quality for different models Method
LR ARX NNARX
f0
f1
f2
F
X
F
X
0.73 2.52 0.43
0.46 5.38 0.71
0.02 0.91 4.93 14.39 0.089 0.1551
f3
F
X
F
X
0.098 11.92 0.6
0.139 16.96 2, 17
2.32 19.9 0.277
12.27 4.91 22.5
160
8 Industrial Applications Table 8.8. FDI properties of the examined approaches Index
LR
NNARX
f1
f2
f3
f1
f2
f3
td ti rf d rid
4 1 0.34 0.08
5 7 0.26 0.098
81 92 0.186 0.091
10 1 0.357 0.145
3 5 0.42 0.0065
37 90 0.45 0.097
Tf Tx
0.0164 0.0936
0.0191 0.0261
0.0468 0.12
0.0245 0.0422
0.0541 0.0851
0.0215 0.2766
non-linear relations and the classical ARX models cannot handle the behaviour of the actuator in a proper way. Comparing dynamic networks with non-linear autoregresive models one can see that better results are achieved in the case of the LR network (5 of 8 models have a better quality than NNARX models) but, generally speaking, the results are comparable. The second study aims at the presentation of FDI capabilities of the examined methods. At this stage, only the LR and NNARX models are taken into account. As we can see in Table 8.8, all faults were detected and isolated by both approaches using the corresponding thresholds Tf and Tx , with values given in the last two rows of the table. Analysing the results one can state that the detection and isolation time is almost the same in both cases. Slightly better results are observed for the NNARX model in the case of the fault f3 . On the other hand, in most cases the number of false alarms is smaller in the case of the LR model. The values of the indices rf d and rf i should be equal to zero in an ideal case (no false alarms). In the case of fault detection using NNARX models for the faults f2 and f3 , the number of false alarms is pretty high, 42% and 45%, respectively. An interesting result can be observed in the case of the fault f3 for the NNARX approach. The detection time is relatively short (37 time instants), but the number of false alarms is very high. Simultaneously, for the LR approach one can see that the detection time is longer but the number of false alarms is smaller. This phenomenon is directly caused by the thresholding level. It clearly shows that the choice of the threshold is a compromise between fault decision sensitivity and the false alarm rate. In spite of the fact that dynamic networks do not receive visible supremacy over the NNARX approach, it is necessary to mention that neural networks with delay suffer from the selection of the proper input lag space and from the large input space, which can make the learning process very difficult. 8.1.4
Final Remarks
In the section, a few experiments showed the feasibility of applying artificial neural networks composed of dynamic neurons to fault diagnosis. In such systems, an artificial neural network is used as the model of the process under
8.2 Fluid Catalytic Cracking Fault Detection
161
consideration. The effectiveness of the proposed solution was investigated on the basis of two different groups of experiments. The first group concerns the application of a dynamic neural network to the detection of sensor faults. The second group of experiments illustrates how to apply neural networks to fault detection and isolation. To this end, a bank of neural models, including the model for normal operation conditions as well as models for all identified faults, should be used. All faulty situations can be identified and localized to perform relevant preventive operations. In practice, however, it is very difficult to obtain real data on faulty situations. From the reported results, we can conclude that by using artificial neural networks composed of dynamic neurons one can design an effective fault diagnosis system. The experimental results clearly show that dynamic networks perform quite well in comparison with other approaches. An important fact here is that all the experiments were carried out using real process data recorded at the Lublin Sugar Factory in Poland. The limitation of model based fault diagnosis is that we can surely isolate only known working conditions. Therefore, unknown faults can be detected, but isolated only as a group of faults called “unknown”. The problem to detect and isolate multiple faults is very difficult and the treatment of this aspect is out of the scope of this section. Moreover, the section does not deal with these because of the lack of data.
8.2 Fluid Catalytic Cracking Fault Detection Fluid catalytic cracking converts heavy oil into lighter, more valuable fuel products and petrochemical feedstocks. A general scheme of the catalytic cracking process is presented in Fig. 8.13 [211, 52]. It consists of three main subsystems: a reactor, a riser and a regenerator. A finely sized solid catalyst continuously circulates in a closed loop between the reactor and the regenerator. The reactor provides proper feed contacting time and temperature to achieve the desired level of conversion and to disengage products from the spent catalyst. The regenerator restores the catalytic activity of the coke-laden spent catalyst by combustion with air. It also provides the heat of reaction and the heat of feed vaporization by returning the hot, freshly regenerated catalyst back to the reaction system. The hot regenerated catalyst flows to the base of the riser, where it is contacted with heavier feed. The vaporized feed and the catalyst travel up the riser, where vapour phase catalytic reactions occur. The reacted vapour is rapidly disengaged from the spent catalyst in direct-coupled riser cyclones, and it is directly routed to product fractionation in order to discourage further thermal and catalytic cracking. In the product recovery system, reactor vapours are quenched and fractionated, yielding dry gas, liquid petroleum gas, naphtha, and middle distillate products. The whole catalytic cracking process was implemented in Simulink as an FCC benchmark according to the mathematical description presented in [211]. The manipulated variables of crucial importance are the flowrate of the regenerated
162
8 Industrial Applications
products
gas Tdg
Td1 Trg1
regenerator
Trg2
air
Trx
riser
reactor vessel
feed Tf p
Rar
Fig. 8.13. General scheme of the fluid catalytic cracking converter
catalyst to the riser and the flowrate of combustion air to the regenerator beds. The available measurement variables are presented in Table 8.9. Taking into account expert knowledge about the technological process, one can design the following relations between variables: • Temperature of the cracking mixture: Trx = h1 (Trg2 , Tf p , Trx );
(8.9)
• Temperature of the dense phase at the regenerator first stage: Trg1 = h2 (Trg1 , Tar , Rar );
(8.10)
• Temperature of the dense phase at the regenerator second stage: Trg2 = h3 (Trg1 , Tar , Rar );
(8.11)
• Temperature of the regenerator first stage dilute phase: Td1 = h4 (Trg1 );
(8.12)
• Temperature of the general dilute phase: Tdg = h5 (Td1 ).
(8.13)
In order to design a fault diagnosis system for the FCC process, a model based scheme is applied. The residual generation block is realized using locally recurrent networks, described in detail in Section 3.5.4. In turn, residual evaluation is carried out by using statistical analysis, discussed in Section 7.2.2, and MEM, presented in Section 7.3.3. The complete fault diagnosis system is evaluated using several faulty scenarios.
8.2 Fluid Catalytic Cracking Fault Detection
163
Table 8.9. Specification of measurable process variables Variable
Description
Rar Tar Tf p Trx Trg1 Trg2 Td1 Tdg
air flowrate to the regenerator [ton/h] air temperature [o C] feed temperature at the riser entrance [o C] temperature of the cracking mixture in the riser [o C] temperature of the dense phase at the regenerator first stage [o C] temperature of the dense phase at the regenerator second stage [o C] temperature of the regenerator first stage dilute phase [o C] temperature of the general dilute phase [o C]
8.2.1
Process Modelling
A locally recurrent neural network is used to describe the process under normal operating conditions. First, the network has to be trained for this task. The training data were collected from the FCC benchmark. The network was trained using the ARS algorithm to mimic the behaviour of the temperature of the cracking mixture (8.9). The neural model (3.39) has three inputs, Trg2 (k), Tf p (k) and Trx (k), one output, Trx (k + 1), and consists of three hidden neurons, with the hyperbolic tangent activation function and the second order IIR filter each. The structure of the network was selected using the ”trial and error” procedure. The model with the smallest value of the criterion in the form of a sum of squared errors calculated using the testing set is selected as the best one. The training was carried out off-line for 50 steps using 1000 samples. The sum of squared errors calculated over 7000 testing samples is equal to 169.31. The modelling 2
System and model outputs
1.5
1
0.5
0
−0.5
10
20
30
40
50
60
70
80
90
100
Time
Fig. 8.14. Results of modelling the temperature of the cracking mixture (8.9)
164
8 Industrial Applications
1
Residual
0.5
0
−0.5
−1
−1.5 0
1000
2000
3000
4000
5000
6000
7000
Time
Fig. 8.15. Residual signal
results for 100 testing samples are shown in Fig. 8.14, where the model output is marked by the dotted line and the output of the process by the solid line. In turn, the residual signal is presented in Fig. 8.15. Although the model mimics the behaviour of the process quite well, at some time instances there are large differences between the process and model outputs. 8.2.2
Faulty Scenarios
The FCC benchmark makes it possible to simulate a number of faulty scenarios. During the experiments the following scenarios were examined [212, 213, 46]: 1. scenario f1 – 10 % increase in catalyst density, 2. scenario f2 – 15% decrease in the weir constant of the first and second stages, 3. scenario f3 – 10% decrease in the CO2 /CO ratio constant. These faulty scenarios were implemented in Simulink/Matlab as an additional component of the mentioned FCC benchmark. 8.2.3
Fault Diagnosis
Normality testing The comparison between the cumulative distribution function of the normal distribution F (x) (solid line) and the residual one Fr (x) (dotted line) is presented in Fig. 8.16. As one can observe, the normality assumption does not seem not to be valid in this case. The cumulative distribution function of the residual shows that its probability density function is a little bit asymmetric. In turn, the probability plot for the residual is presented in Fig. 8.17. This plot clearly
8.2 Fluid Catalytic Cracking Fault Detection
165
1
Cumulative distribution functions
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −4
−3
−2
−1
0
1
2
3
4
Data
Fig. 8.16. Cumulative distribution functions: normal – solid, residual – dashed
0.9999 0.999 0.99
Probability
0.95
0.5
0.05 0.01 0.001 0.0001 −1.5
−1
−0.5
0
0.5
1
Data
Fig. 8.17. Probability plot for the residual
shows that there are large deviations from the normal distribution on the edges of the probability density function. If it is assumed that a residual has a normal distribution and it is applied a threshold assigned to a confidence level, then a significant mistake can be made. Some faulty events could be hidden by a wrongly selected threshold. Density shaping Using a neural model of the process, a residual signal is generated. This signal is used to train another neural network to approximate the probability density function of the residual. Two cases are considered here [46]:
166
8 Industrial Applications
Case 1. Estimate of the residual propability density function (7.20) In this case, the neural network (7.17) is trained on-line for 90000 steps using unsupervised learning (update rules (7.18) and (7.19)), described in Section 7.22.2. The final network parameters are w = −14.539 and b = −1.297. The residual histogram and the estimated probability density function are presented in Fig. 8.18(a) and (b), respectively. In this case, the estimated distribution function is symmetric with cut-off values determined for the significance level β = 0.05 are xl = −0.34 and xr = 0.163, and the threshold is equal to T = 0.354 (Fig. 8.18(c)). Case 2. Estimate of the residual probability density function (7.28) In this case, the neural network (7.21) is trained on-line for 90000 steps using unsupervised learning (update rules (7.24)–(7.27)), described in Section 7.2.2. The final network parameters are w1 = −5.27, w2 = −11.475, b1 = −0.481 and b2 = 5.626. The residual histogram and the estimated probability density function are presented in Fig. 8.19(a) and (b), respectively. In this case, the estimated distribution function has a wider shape than in the previous case considered (Fig. 8.19(c)). The cut-off values determined for the significance level β = 0.05 are xl = −0.358 and xr = 0.19 with the threshold T = 0.25. (a)
(b)
2000
450
1800
400
1600
350
1400
300
1200 250
1000 200
800
150
600 400
100
200
50
0 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0 0
0.6
0.2
0.4
Data
0.6
0.8
1
Data
(c) 4 3.5 3 2.5 2 1.5 1 0.5 0 −1
−0.8
−0.6
−0.4
−0.2
xl Data
0
0.2 xr
0.4
0.6
Fig. 8.18. Residual histogram (a), network output histogram (b), estimated PDF and the confidence interval (c)
8.2 Fluid Catalytic Cracking Fault Detection (a)
(b)
2000
500
1800
450
1600
400
1400
350
1200
300
1000
250
800
200
600
150
400
100
200 0 −1
167
50
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0 0
0.6
0.2
0.4
Data
0.6
0.8
1
Data
(c) 4 3.5 3 2.5 2 1.5 1 0.5 0 −1
−0.8
−0.6
−0.4
−0.2
xl Data
0
0.2 xr
0.4
0.6
Fig. 8.19. Residual histogram (a), network output histogram (b), estimated PDF and the confidence interval (c)
Sensitivity To perform a decision, a significance level β = 0.05 is assumed. The sensitivity of the proposed fault diagnosis system in the fault-free case is checked using the so-called false detection rate (2.25). If it is assumed that the residual has the normal disribution, the upper threshold is equal to Tu = 0.1068, the lower one to Tl = −0.2975 and rf d = 0.0732. The proposed density shaping using the single neuron (7.17) gives the threshold T = 0.354 and rf d = 0.034, which is a number of false alarms more than two times smaller than in the case of the normality assumption. Even better results are obtained for density shaping using the more complex neural network (7.21). In this case, the threshold T = 0.25 and the false detection rate rf d = 0.0259. The generalization ability of both networks is pretty good, because by assuming the significance level β = 0.05 statistically 5% of samples are allowed to pass the threshold. The achieved result for the network (7.17) is 3.4%, and for the network (7.21) it is 2.59%. The next experiment shows the relationship between the significance level assumed and false detection ratios for application for respectively, normal distribution statistics and density shaping using a single neuron. The results are
168
8 Industrial Applications Table 8.10. Comparison of false detection rates β
rfNd
rfDd
Ratio rfNd /rfDd
0.05 0.01 0.001
0.0732 0.022 0.007
0.034 0.0083 0.0011
2.153 2.65 6.36
presented in Table 8.10, where rfNd is the false detection rate calculated assuming the normal distribution of a residual, and rfDd is the false detection rate of the density shaping method. These results clearly indicate the advantages of the density shaping method. As one can observe, for smaller values of significance levels the disproportion between false detection rates represented by the ratio rfNd /rfDd increases. This result confirms the analysis undertaken based on the probability plot (Fig. 8.17). If normal distribution statistics are used for decision making, then significant mistakes are made, especially using small values of the significance level, e.g. β = 0.001. Fault detection The results of fault detection are presented in Table 8.11. In each case the true detection rate (2.26) is close to one, which means that the detection of faults is performed surely. In order to perform a decision about faults and to determine the detection time tdt , a time window with the length n = 10 was used. If during the following n time steps the residual exceeds the threshold, then a fault is signalled. The application of the time window prevents the situation when temporary true detection signals a fault (see Fig. 2.11). Detection time indices are shown in the last column of Table 8.11. The fault f2 is detected relatively fast. More time is needed to detect the faults f1 and f3 . All faulty scenarios can be classified as abrupt faults. It is observed that the fault f3 is developing slower than the faults f1 and f2 , so the fault diagnosis system needs more time to make a proper decision. Table 8.11. Performance indices for faulty scenarios
8.2.4
Faulty scenario
start-up time
fault time horizon
rtd
tdt
f1 f2 f3
7890 7890 7890
9000 9000 9000
0.9315 0.9883 0.8685
90 14 151
Robust Fault Diagnosis
Confidence Bands In this experiment, decision making is carried out using uncertainty bounds obtained by using model error modelling, discussed in Section 7.3.3. The error
8.2 Fluid Catalytic Cracking Fault Detection
169
model was designed using the NNARX type neural network [55]. Many neural architectures have been examined by the “trial and error” method. The best performing two-layer network consists of four hidden neurons with hyperbolic tangent activation functions and one linear output element. The number of the input delays na and the output delays nb is equal to 5 and 15, respectively. The conclusion is that to capture residual dynamics, a high order model is required. The output of the error model along with the residual is shown in Fig. 8.20. To determine confidence bands, the 95% significance level was assumed (β = 0.05). The uncertainty region (dashed lines) along with the output of the healthy system (solid line) is shown in Fig. 8.21. The false detection rate in this case is rf d = 0.0472. In the cases when there are rapid changes of the output signal with a large amplitude, the uncertainty region is relatively narrow. This situation is depicted in Fig. 8.21 at the 40-th time step, when the output signal exceeds the uncertainty region. For comparison, let us analyze the simple thresholding calculated using (7.3), depicted in Fig. 8.22. The false detection rate in this case is rf d = 0.0734. This is a result more than 1.5 times worse in relation to the adaptive technique based on MEM. 0.6
Residual and model error output
0.5 0.4 0.3 0.2 0.1 0 −0.1 −0.2 7300
7350
7400
7450
7500
Time
Fig. 8.20. Residual (solid) and the error model output (dashed) under nominal operating conditions Table 8.12. Performance indices for faulty scenarios Faulty scenario
start-up time
fault time horizon
rtd
tdt
f1 f2 f3
7890 7890 7890
9000 9000 9000
0.9613 0.9919 0.9207
40 24 80
170
8 Industrial Applications
System output and confidence bands
0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1.2 0
10
20
30
40
50
Time
Fig. 8.21. Confidence bands and the system output under nominal operating conditions
0.4
Residual and thresholds
0.2
0
−0.2
−0.4
−0.6
−0.8 0
200
400
600
800
1000
Time
Fig. 8.22. Residual with constant thresholds under nominal operating conditions
Fault Detection The results of fault detection are presented in Fig. 8.23 and Table 8.12. In Fig. 8.23, the uncertainty bands are marked by the dashed lines and the system output with the solid one. The achieved results confirm that the robust technique for fault diagnosis based on MEM is more sensitive to the occurrence of faults than decision making algorithms based on constant thresholds (compare the true
8.2 Fluid Catalytic Cracking Fault Detection (a)
171
1.5
Confidence bands
1
0.5
0
−0.5
healthy
−1 7880
fault
7890
7900
7910
7920
7930
7940
7950
7905
7910
7915
Time
(b)
3.5 3
Confidence bands
2.5 2 1.5 1 0.5 0 −0.5 −1 7880
healthy 7885
7890
fault 7895
7900
Time
(c)
1.5
Confidence bands
1
0.5
0
−0.5
healthy
fault
−1 7870 7880 7890 7900 7910 7920 7930 7940 7950 7960 7970
Time
Fig. 8.23. Fault detection results: scenario f1 (a), scenario f2 (b), scenario f3 (c)
172
8 Industrial Applications
detection rates in Tables 8.11 and 8.12). For the faults f1 and f3 , the detection time is also shorter. Solely in the case of the scenario f2 , MEM required more time to make a proper decision, in spite of the fact that this fault was relatively more quickly detected than the other two. 8.2.5
Final Remarks
In the section it was shown that by using artificial neural networks a model based fault detection system for chosen parts of the FCC technological process can be designed. The experiments show that the locally recurrent network can model a complex technological process with pretty good accuracy. In turn, a simple neural network trained to maximise the output entropy can approximate the probability density function of a residual and in this way a more representative threshold value can be obtained with a given significance level. It was shown that such an approach significantly reduces the number of false alarms caused by an inaccurate model of the process. The proposed density shaping approach can be easily expanded to more complex network topologies in order to estimate more sophisticated probability distribution functions. By using two sigmoidal neurons connected in series it is possible to estimate asymmetric probability density functions and the number of false alarms can be even further reduced. It is worth noting that self-organizing training used to adjust the network parameters is very simple and even tens thousand training steps last a few seconds on a standard PC machine. Better fault detection results were obtained for a robust fault detection system based on neural network realization of model error modelling. In the framework of MEM, the locally recurrent network was used to model the process under normal operating conditions and then the NNARX model was used to identify the error model (residual). The experiments show that the proposed method gives promising results. An open problem here is to find a proper error model. This problem seems to be much more difficult to solve than finding a fundamental model of the system.
8.3 DC Motor Fault Diagnosis Electrical motors play a very important role in safe and efficient work of modern industrial plants and processes [214]. Early diagnosis of abnormal and faulty states renders it possible to perform important preventing actions, and it allows one to avoid heavy economic losses involved in stopped production, the replacement of elements or parts [10]. To keep an electrical machine in the best condition, several techniques such as fault monitoring or diagnosis should be implemented. Conventional DC motors are very popular, because they are reasonably cheap and easy to control. Unfortunately, their main drawback is the mechanical collector, which has only a limited life spam. In addition, brush sparking can destroy the rotor coil, generate electromagnetic compatibility problems and reduce insulation resistance to an unacceptable limit [215]. Moreover, in many cases, electrical motors operate in closed-loop control and small faults often remain hidden by the control loop. It is only if the whole device fails that
8.3 DC Motor Fault Diagnosis
173
the failure becomes visible. Therefore, there is a need to detect and isolate faults as early as possible. Recently, a great deal of attention has been paid to electrical motor fault diagnosis [215, 216, 217, 78]. In general, the elaborated solutions can be splitted into three categories: signal analysis methods, knowledge based methods and model based approaches [216, 7]. Methods based on signal analysis include vibration analysis, current analysis, etc. The main advantage of these approaches is that accurate modelling of a motor is avoided. However, these methods only use output signals of the motor, hence the influence of an input on an output is not considered. In turn, frequency analysis is time consuming, thus it is not proper for on-line fault diagnosis. In the case of vibration analysis, there are serious problems with noise produced by environment and the coupling of sensors to the motor [216]. Knowledge based approaches are generally founded on expert or qualitative reasoning [110]. Several knowledge based fault diagnosis approaches have been proposed. These include rule based approaches, where diagnostic rules can be formulated from the process structure and unit functions, and qualitative simulation based approaches. The trouble with the use of such models is that accumulating experience and expressing it as knowledge rules is difficult and time consuming. Therefore, the development of a knowledge based diagnosis system is generally effort demanding. Model based approaches include parameter estimation, state estimation, etc. This kind of methods can be effectively used for on-line diagnosis, but its disadvantage is that an accurate model of a motor is required [7]. An alternative solution can be obtained through artificial intelligence, e.g. neural networks. The self-learning ability and the property of modelling non-linear systems allow one to employ neural networks to model complex, unknown and non-linear dynamic processes [4, 218]. 8.3.1
AMIRA DR300 Laboratory System
In this section, a detailed description of the AMIRA DR300 laboratory system is presented. The laboratory system shown in Fig. 8.24 is used to control the rotational speed of a DC motor with a changing load. The laboratory object considered consists of five main elements: a DC motor M1, a DC motor M2, two digital increamental encoders and a clutch K. The input signal of the engine M1 is an armature current and the output one is the angular velocity. The available sensors for the output are an analog tachometer on an optical sensor, which generates impulses that correspond to the rotations of the engine and a digital incremental encoder. The shaft of the motor M1 is connected with the identical motor M2 by the clutch K. The second motor M2 operates in the generator mode and its input signal is an armature current. The available measuremets of the plant are as follows: • motor current Im – the motor current of the DC motor M1, • generator current Ig – the motor current of the DC motor M2, • tachometer signal T ;
174
8 Industrial Applications
Fig. 8.24. Laboratory system with a DC motor
and control signals: • motor control signal Cm – the input of the motor M1, • generator control signal Cg – the input of the motor M2. The technical data of the laboratory system are shown in Table 8.13. The separately excited DC motor is governed by two differential equations. The classical description of the electrical subsystem is given by the equation u(t) = Ri(t) + L
di(t) + e(t), dt
(8.14)
where u(t) is the motor armature voltage, R is the armature coil resistance, i(t) is the motor armature current, L is the motor coil inductance, and e(t) is the induced electromotive force. The counter electromotive force is proportional to the angular velocity of the motor: e(t) = Ke ω(t),
(8.15)
where Ke stands for the motor voltage constant and ω(t) is the angular velocity of the motor. The equivalent electrical circuit of the DC motor is shown in Fig. 8.25. In turn, the mechanical subsystem can be derived from a torque balance: J
dω(t) = Tm (t) − Bm ω(t) − Tl − Tf (ω(t)), dt
(8.16)
where J is the motor moment of inertia, Tm is the motor torque, Bm is the viscous friction torque coefficient, Tl is the load torque, and Tf (ω(t)) is the friction torque. The motor torque Tm (t) is proportional to the armature current: Tm (t) = Km i(t),
(8.17)
where Km stands for the motor torque constant. The friction torque can be considered as a function of the angular velocity and it is assumed to be the sum of
8.3 DC Motor Fault Diagnosis
175
Table 8.13. Laboratory system technical data Component
Variable
Value
rated voltage rated current rated torque rated speed voltage constant moment of inertia torque constant resistance
24 V 2A 0.096 Nm 3000 rpm 6.27 mV/rpm 17.7 × 10−6 Kgm2 0.06 Nm/A 3.13 Ω
output voltage moment of interia
5 mV/rpm 10.6 × 10−6 Kgm2
moment of inertia
33 × 10−6 Kgm2
number of lines max. resolution moment of inertia
1024 4096/R 1.45 × 10−6 Kgm2
Motor
Tachometer
Clutch
Incremental encoder
the Stribeck, Coulumb and viscous components. The viscous friction torque opposes motion and it is proportional to the angular velocity. The Coulomb friction torque is constant at any angular velocity. The Stribeck friction is a non-linear component occuring at low angular velocities. Although the model (8.14)–(8.17) has a direct relation to the motor physical parameters, the true relation between them is non-linear. There are many non-linear factors in the motor, e.g. the non-linearity of the magnetization characteristic of the material, the effect of material reaction, the effect caused by an eddy current in the magnet, residual magnetism, the commutator characteristic, mechanical frictions [216]. These factors are not shown in the model (8.14)–(8.17). Summarizing, the DC motor is a non-linear dynamic process, and to model it suitably non-linear modelling, e.g. dynamic neural networks [46], should be employed. In the following section, a dynamic type of neural networks is proposed to design a non-linear model of the DC motor considered. The motor described works in closed-loop control with the PI controller. It is assumed that the load of the motor is equal to 0. The objective of system control is to keep the rotational speed at the constant value equal to 2000. Additionally, it is assumed that the reference value is corrupted by additive white noise.
176
8 Industrial Applications
i(t)
u(t)
R
L
e(t)
M
Tl
Fig. 8.25. Equivalent electrical circuit of a DC motor
8.3.2
Motor Modelling
A separately excited DC motor was modelled by using the dynamic neural network (4.17), proposed in Section 4.3. The model of the motor was selected as follows: (8.18) T = f (Cm ). The following input signal was used in the experiments: Cm (k) = 3 sin(2π1.7k) + 3 sin(2π1.1k − π/7) + 3 sin(2π0.3k + π/3).
(8.19)
The input signal (8.19) is persistantly exciting of order 6 [162]. Using (8.19), a learning set containig 1000 samples was formed. The neural network model (4.17) and (4.23) had the following structure: one input, 3 IIR neurons with first order filters and hyperbolic tangent activation functions, 6 FIR neurons with first order filters and linear activation functions, and one linear output neuron [56, 51]. The neural model structure was selected using the “trial and error” method. The quality of each model was determined using the AIC [162]. This criterion contains a penalty term and makes it possible to discard too complex models. The training process was carried out for 100 steps using the ARS algorithm [66, 42] with the initial variance v0 = 0.1. The outputs of the neural model and the separately excited motor generated for another 1000 testing samples are depicted in Fig. 8.26. The efficiency of the neural model was also checked during the work of the motor in closed-loop control. The results are presented in Fig. 8.27. After transitional oscilations (Fig. 8.27(a)), the neural model settled at a proper value. For clarity of presentation of the modelling results, in Fig. 8.27(b) the outputs of the process and the neural model for 200 time steps are only illustrated. The above results give a strong argument that the neural model mimics the behaviour of the DC motor pretty well and confirm its good generalization abilities. 8.3.3
Fault Diagnosis Using Density Shaping
Two types of faults were examined during the experiments: • fi1 – tachometer faults simulated by increasing/decreasing the rotational speed, in turn by −5% (f11 ), +5% (f21 ), −10% (f31 ), +10% (f41 ), −20% (f51 ) and +20% (f61 ),
8.3 DC Motor Fault Diagnosis
177
4000
Rotational speed
2000
0
−2000
−4000 0
100
200
300
400
500
600
700
800
900
1000
Time
Fig. 8.26. Responses of the motor (solid) and the neural model (dash-dot) – open-loop control
(a)
(b)
5000
2050
Rotational speed
Rotational speed
4000
3000
2000
2000
1000
0 0
500
1000
Time
1500
2000
1950 2300
2350
2400
2450
2500
Time
Fig. 8.27. Responses of the motor (solid) and the neural model (dashed) – closed-loop control
• fi2 – mechanical faults simulated by increasing/decreasing the motor torque, in turn by +20% (f12 ), −20% (f22 ), +10% (f32 ), −10% (f42 ), +5% (f52 ) and −5% (f62 ). As a result, the total of 12 faulty situations were investigated. Each fault occurred at the tf rom = 4000 time step and lasted to the ton = 5000 time step. Using the neural model of the process, a residual signal was generated. This signal was used to train another neural network to approximate a probability density function of the residual. The training process was carried out on-line for 100000 steps using unsupervised learning, described in Section 7.2.2. The final network
178
8 Industrial Applications
parameters were w1 = −52.376 w2 = 55.274 b1 = −0.011 and b2 = −27.564. Cut-off values determined for the significance level β = 0.05 were as follows: xl = −0.005 and xr = 0, 0052, and the threshold was equal to T = 17, 33. In order to perform the decision about faults and to determine the detection time tdt , a time window with the length n = 50 (0.25 sec) was used (see Fig. 2.11). In the fault-free case, a number of false alarms represented by the false detection rate rf d was monitored [26]. The achieved index value was rf d = 0.04. For comparison, for the constant threshold (7.3) the value rf d = 0.098 was obtained. One can conclude that by using the density shaping technique to calculate the threshold, the number of false alarms can be reduced significantly The next step is to check the fault detection ability of the proposed method. The results of fault detection are presented in Table 8.14. All faults were reliably detected using the density shaping threshold, contrary to the constant threshold technique. In the latter case, problems were encountered with the faults f41 , f52 and f62 (marked with the boxes). An interesting situation is observed for the fault f61 . Due to the moving window with the length of 50, false alarms were not raised just before the 4000-th time step, but in practice from the 3968-th time step the residual exceeded the threshold, which means a false alarm.
Table 8.14. Results of fault detection for the density shaping technique f11
f21
f31
f41
f51
f61
98.2 4069
98.4 4067
98.5 4085
98.7 4063
99.5 4061
98.1 4072
99.4 4067
98.8 4064
99.4 3147
98.9 4063
100 4018
f12
f22
f32
f42
f52
f62
99.4 4059
99.1 4060
99.0 4060
98.6 4065
96.2 4075
99.7 4059
99.3 4057
99.1 4059
99.9 3726
98.4 3132
Density shaping thresholding rtd [%] tdt
99.75 4075
Constant thresholds rtd [%] tdt
Density shaping thresholding rtd [%] tdt
99.3 4057
Constant thresholds rtd [%] tdt
99.5 4056
8.3 DC Motor Fault Diagnosis
179
Fault isolation Fault isolation can be considered as a classification problem where a given residual value is assigned to one of the predefined classes of the system behaviour. In the case considered here, there is only one residual signal and 12 different faulty scenarios. Firtly, it is required to check the distribution of the symptom signals in order to verify the separability of the faults. The symptom distribution is shown in Fig. 8.28. Almost all classes are separable except the faults f11 (marked with o) and f62 (marked with ∗), which overlap each other. A similar situation is observed for the faults f21 (marked with ·) and f52 (marked with +). As a result, the pairs f11 , f62 , and f21 , f32 can be isolated, but as a group of faults only. Finally, 10 classes of faults are formed: C1 = {f11 , f62 }, C2 = {f21 , f52 }, C3 = {f31 }, C4 = {f41 }, C5 = {f51 }, C6 = {f61 }, C7 = {f12 }, C8 = {f22 }, C9 = {f32 } and C10 = {f42 }. To perform fault isolation, the well-known multilayer perceptron was used. The neural network had two inputs (the model input and the residual) and 4 outputs (each class of the system behaviour was coded using a 4-bit representation). The learning set was formed using 100 samples per each faulty situation, then the size of the learning set was equal to 1200. As the well-performing neural classifier, the network with 15 hyperbolic tangent neurons in the first hidden layer, 10 hyperbolic tangent neurons in the second hidden layer, and 4 sigmoidal output neurons was selected. The neural classifier 400
f11 f21 f31 f41 f51 f61
300
200
f12 f22 f32 f42 f52 f62
Residual
100
0
−100
−200
−300
−400 0.16
0.18
0.2
0.22
0.24
0.26
0.28
System input
Fig. 8.28. Symptom distribution
0.3
0.32
0.34
0.36
180
8 Industrial Applications
was trained for 200 steps using the Levenberg-Marquardt method. Additionally, the real-valued response of the classifier was transformed to the binary one. A simple idea is to calculate the distance between the classifier output and each predefined class of the system behaviour. As a result, the binary representation giving the shortest Euclidean distance is selected as a classifier binary output. This transformation can be represented as follows: j = arg min||x − Ki ||, i
i = 1, . . . , NK ,
(8.20)
where x is the real-valued output of the classifier, Ki is the binary representation of the i-th class, NK is the number of predefined classes of the system behaviour, and ||·|| is the Euclidean distance. Then, the binary representation of the classifier can be determined in the form x ¯ = Kj . Recognition accuracy (R) results are presented in the Table 8.15. All classes of faulty situations were recognized surely with accuracy more than 90%. True values of recognition accuracy are marked with the boxes. There are situations of misrecognizing, e.g. the class C4 was classified as the class C2 with the rate 5.7%. Misrecognizing can be caused by the fact that some classes of faults are closely arranged in the symptom space or even slightly overlap each other. Such a situation is observed for the classes C4 and C9 . Generally speaking, the achieved isolation results are satisfactory. It is neccesary to mention that such high isolation rates are only achievable if some faulty scenarios can be treated as a group of faults. In the case considered, there were two such groups of faults, C2 and C1 . Fault identification In this experiment, the objective of fault identification was to estimate the size (S) of detected and isolated faults. When analytical equations of residuals are unknown, fault identification consists in estimating the fault size and the time Table 8.15. Fault isolation results R [%]
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
f11 f21 f31 f41 f51 f61 f12 f22 f32 f42 f52 f62
100 0.3 0.2 – 0.9 – – – – 0.2 – 97.5
– 99.7 0.5 5.7 – 0.2 – – – 0.7 97.7 2.5
– – 99.3 0.7 – – – – 0.2 3.0 – –
– – – 93.6 0.9 – – – 3.9 – – –
– – – – 94.1 1.1 0.4 – – – – –
– – – – – 95.9 1.4 – – – – –
– – – – 0.5 – 97.5 1.6 – – – –
– – – – – – – 98.4 1.8 – – –
– – – – – 2.1 0.7 – 94.1 2.1 – –
– – – – 3.4 0.7 – – – 94.1 2.3
8.3 DC Motor Fault Diagnosis
181
of fault occurrence on the basis of residual values. An elementary index of the residual size assigned to the fault size is the ratio of the residual value rj to a suitably assigned threshold value Tj . In this way, the fault size can be represented as the mean value of such elementary indices for all residuals as follows: S(fk ) =
1 N
j:rj ∈R(fk )
rj , Tj
(8.21)
where S(fk ) represents the size of the fault fk , R(fk ) is the set of residuals sensitive to the fault fk , N is the size of the set R(fk ). The threshold values are given at the beginning of this section. The results are shown in Table 8.16. Analyzing them, one can observe that quite large values were obtained for the faults f51 , f61 and f12 . These faults were arbitrarily assigned to the group large. Another group is formed by the faults f31 , f41 , f22 , f32 and f42 , possessing similar values of the fault size. This group was called medium. The third group of faults consists of f11 , f21 , f52 and f62 . The fault sizes in these cases are distinctly smaller than in the cases already discussed, and this group is called small. The small size of the faults f52 and f62 somewhat explains problems with their detection using a constant threshold (see Table 8.16) Table 8.16. Fault identification results S
f11
small medium large
2.45 3.32
8.3.4
f21
f31
f41
f51
f61
f12
f22
f32
f42
f52
f62
2.28 1.73 5.34 6.19
8.27 8.61 8.65 10.9 11.64 17.39
Robust Fault Diagnosis
To estimate uncertainty associated with the neural model, the MEM technique, discussed in Section 7.3.3, is applied. To design the error model, two methods are utilized: the classical linear ARX model and the neural network based NNARX one. In order to select a proper number of delays, several ARX models were examined and the best performing one was selected using the AIC. The parameters of the ARX model were as follows: the number of past outputs na = 20 and the number of past outputs nb = 20. In turn, the best NNARX structure was selected by the “trial and error” procedure and its parameters are as follows: 8 hidden neurons with hyperbolic tangent activation functions, one linear oputput neuron, the number of past outputs na = 3, the number of past inputs nb = 20. The sum of squared errors calculated for 3000 testing samples for the ARX model was equal to 0.0117, and for the NNARX network it was 0.0065. Due to better performance and generalization abilities, the neural network based error model was used to form uncertainity bands.
182
8 Industrial Applications (b)
20
2050
15
2040 2030
10
Confidence bands
Residual and thresholds
(a)
5 0 −5
2020 2010 2000 1990 1980
−10 −15 1000
1970
1100
1200
1300
Time
1400
1500
1960 1000
1100
1200
1300
1400
1500
Time
Fig. 8.29. Residual and constant thresholds (a) and confidence bands generated by model error modelling (b)
Using the procedure described in Section 7.3.3 and assuming the confidence level equal to β = 0.05, two bands were calculated. The results are presented in Fig. 8.29(b). To evaluate the quality of the proposed solution, another decision making technique based on constant thresholds calculating using (7.3) was also examined. Decision making using constant thresholds is illustrated in Fig. 8.29(a). In both methods, a number of false alarms represented by the false detection rate rf d was monitored [26]. The achieved indices are as follows: rf d = 0.012 in the case of adaptive thresholds and rf d = 0.098 in the case of constant ones. Fault detection In order to perform the decision about faults and to determine the detection time tdt , a time window with the length 0.25 sec was used. The results of fault detection are presented in Table 8.17. All faults were reliably detected except the fault f62 . In this case, model error modelling needed more time to detect this small fault. However, the MEM technique demonstrates more reliable behaviour than simple thresholding. Examples of fault detection are illustrated in Fig. 8.30 (adaptive thresholds) and Fig. 8.31 (constant thresholds). In the presented cases, better performance is observed for model error modelling. 8.3.5
Final Remarks
In the section, the neural network based method for fault detection, isolation and identification of faults in a DC motor was proposed. Using the novel cascade structure of the dynamic neural network, quite an accurate model of the motor was obtained which can mimic a technological process with pretty good accuracy. In turn, a simple neural network trained to maximise the output entropy can approximate the probability density function of a residual, and in this way a more representative threshold value can be obtained with a given significance level. It was shown that such an approach significantly reduces the number of false alarms caused by an inaccurate model of the process. Even better fault
8.3 DC Motor Fault Diagnosis (a)
183
(b)
2050
1.5
1
Decision making
Confidence bands
2000
1950
1900
0.5
0
1850
1800 3700
3800
3900
4000
4100
4200
4300
−0.5 3700
4400
3800
3900
Time
4000
4100
4200
4300
4400
4200
4300
4400
4200
4300
4400
Time
(c)
(d)
2600
1.5
2400
1
Decision making
Confidence bands
2500
2300 2200 2100 2000
0.5
0
1900 1800 3700
3800
3900
4000
4100
4200
4300
−0.5 3700
4400
3800
3900
Time
4000
4100
Time
(e)
(f)
2140
1.5
2120 1
2080
Decision making
Confidence bands
2100
2060 2040 2020 2000
0.5
0
1980 1960 1940 3700
3800
3900
4000
4100
Time
4200
4300
4400
−0.5 3700
3800
3900
4000
4100
Time
Fig. 8.30. Fault detection using model error modelling: fault f11 – confidence bands (a) and decision logic without the time window (b); fault f61 – confidence bands (c) and decision logic without the time window (d); fault f42 – confidence bands (e) and decision logic without the time window (f)
184
8 Industrial Applications
(a)
(b) 20
1.5
0
1
Decision making
Residual and thresholds
10
−10 −20 −30 −40
0.5
0
−50 −60 −70 3700
3800
3900
4000
4100
4200
4300
−0.5 3700
4400
3800
3900
Time
4000
4100
4200
4300
4400
4200
4300
4400
4200
4300
4400
Time
(c)
(d)
300
1.5
1
200
Decision making
Residual and thresholds
250
150 100 50
0.5
0
0 −50 3700
3800
3900
4000
4100
4200
4300
−0.5 3700
4400
3800
3900
Time
4100
Time
(e)
(f) 50
1.5
0 1
Decision making
Residual and thresholds
4000
−50
−100
0.5
0
−150
−200 3700
3800
3900
4000
4100
Time
4200
4300
4400
−0.5 3700
3800
3900
4000
4100
Time
Fig. 8.31. Fault detection by using constant thresholds: fault f11 – residual with thresholds (a) and decision logic without the time window (b); fault f61 – residual with thresholds (c) and decision logic without the time window (d); fault f42 – residual with thresholds (e) and decision logic without the time window (f)
8.3 DC Motor Fault Diagnosis
185
Table 8.17. Results of fault detection for model error modelling f11
f21
f31
f41
f51
f61
99.6 4055
98.8 4077
99.7 4053
99.6 4058
99.5 4075
98.1 4072
99.4 4067
98.8 4064
99.4 3147
98.9 4063
100 4018
f12
f22
f32
f42
f52
f62
99.3 4100
99.2 4060
98.8 4061
99.1 4060
81 4357
99.7 4059
99.3 4057
99.1 4059
99.9 3726
98.4 3132
Model error modelling rtd [%] tdt
97.9 4074
Constant thresholds rtd [%] tdt
Model error modelling rtd [%] tdt
99.2 4058
Constant thresholds rtd [%] tdt
99.5 4056
detection results can be obtained by means of robust fault diagnosis carried out using model error modelling. Due to the estimation of model uncertainty, the robust fault diagnosis system may be much more sensitive to the occurrence of small faults than standard decision making methods such as constant thresholds. The supremacy of MEM may be evident in the case of incipient faults, when a fault develops very slowly and a robust technique performs in a more sensitive manner that constant thresholds. Moreover, comparing the false detection ratios calculated for normal operating conditions for adaptive as well as constant thresholds, one can conclude that the number of false alarms was considerably reduced when model error modelling was applied. Furhermore, fault isolation was performed using the standard multi-layer perceptron. Preliminary analysis of the symptom distribution and splitting faulty scenarios into groups make it possible to obtain high fault isolation rates. The last step in the fault diagnosis procedure was fault identification. In the framework of fault identification, the objective was to estimate the fault size. The size of a fault was estimated by checking how much the residual exceeded the threshold assigned to it. The whole fault diagnosis approach was successfully tested on a number of faulty scenarios simulated in the real plant, and the achieved results confirm the usefulness and effectiveness of artificial neural networks in designing fault detection and isolation systems. It should be pointed out that the presented solution can be easily applied to on-line fault diagnosis.
9 Concluding Remarks and Further Research Directions
There is no doubt that artificial neural networks have gained a considerable position in the existing state-of-the-art in the field of both the modelling and identification of non-linear dynamic processes and fault diagnosis of technical processes. The self-learning ability and the property of approximating non-linear functions provide the modelling of non-linear systems with a great flexibility. These features allow one to design adaptive control systems for complex, unknown and non-linear dynamic processes. The present monograph is mainly devoted to a special class of dynamically driven neural networks consisting of neuron models with IIR filters. The existing literature shows a great potential of locally recurrent globally feedforward networks, which is confirmed by a variety of applications in different scietific areas. Therefore, the application of locally recurrent networks to the modelling of technical processes and fault diagnosis seems to be justified. In the light of the discussion above, the original objective of the research reported in this monograph was to develop efficient tools able to solve problems encountered in modelling and identification theory, and model based fault diagnosis. In order to accomplish this task, appropriate theoretical deliberations were carried out. Furhermore, some known methods were generalized and several new algorithms were constructed. The following is a concise summary of the original contributions provided by this monograph to the state-of-the-art in neural network modelling and fault diagnosis of non-linear dynamic processes: • Detailed analysis of dynamic properties of the neuron model with the IIR filter including the analysis of the equilibrium points, observability and controlability. Deriving state-space representations of locally recurrent globally feedforward neural networks with one and two hidden layers needed in both stability and approximation discussions. Deriving training algorithms based on global optimisation techniques in order to obtain a high quality model of a given process; • Formulating stability conditions for LRGF networks. Based on the conditions obtained for the LRGF network with one hidden layer, a stabilization problem was defined and solved as a constrained optimisation task. For the LRGF K. Patan: Artificial. Neural Net. for the Model. & Fault Diagnosis, LNCIS 377, pp. 187–189, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
188
9 Concluding Remarks and Further Research Directions
network with two hidden layers, both local and global stability conditions were derived using Lyapunov’s methods. Global stability conditions were formulated in the form of LMIs, which makes checking the stability very easy. Based on local stability conditions, constraints on the network parameters were defined. Thus, a constrained training algorithm was elaborated which guarantees the stability of the neural model; • Proving approximation abilities of LRGF networks. In the monograph it was proved that the locally recurrent network with two hidden layers is able to approximate a state-space trajectory produced by any Lipschitz continuous function with arbitrary accuracy. The undertaken analysis of the discussed network rendered it possible to simplify its structure and to significantly reduce the number of parameters. Thus, a novel structure of a locally recurrent neural network was proposed; • Developing methods for optimal training sequence selection for dynamic neural networks. The result presented in the monograph is in fact the first step to the problem which, in the author’s opinion, is the most challenging one among these stated in the monograph. To solve this problem, some wellknown methods of optimum experimental design for linear regression models were successfully adopted; • Technical and industrial applications. Three applications were discussed: –
application of the discussed approaches to the modelling and fault detection and isolation of the components of a sugar evaporator based on real process data, – application of the investigated approaches to the modelling and fault detection of the components of a fluid catalytic cracking converter simulator. – application of the developed approaches to the modelling and fault detection, isolation and identification of an electrical drive laboratory system. Moreover, the uncertainty of the neural model in the framework of fault diagnosis was investigated. In the monograph, the model error modelling method was extended to the time domain. Moreover, the neural version of this method was proposed. From the engineering point of view, many of the proposed approaches lead to more transparent solutions aa well as many efficient and easy to implement numerical procedures. The author strongly believes that these advantages establish a firm position of the discussed methodologies regarding applications in widely understood engineering. Neverthless, there still remain open problems, which require closer attention and indicate further research directions. In particular, the following research problems should be considerd: • to determine an appropriate number of hidden neurons, which assure the required level of approximation accuracy, • to select the proper order of filters to capture the dynamics of the modelled process, • to propose new Lyapunov candidate functions, which make it possible to formulate less restrictive stability conditions,
9 Concluding Remarks and Further Research Directions
189
• to investigate more robust procedures for stabilizing neural networks during training, which will deteriorate the training process in a negligible way, • to find a proper structure of the error model in order to obtain a very sensitive robust fault detection procedure, • to propose fault models using dynamic neural networks without the need for faulty data, • to integrate neural network based fault diagnosis with the fault tolerant control system.
References
1. Sorsa, T., Koivo, H.N.: Application of neural networks in the detection of breaks in a paper machine. In: Preprints IFAC Symp. On-line Fault Detection and Supervision in the Chemical Process Industries, Newark, Delaware, USA. (1992) 162–167 2. Himmelblau, D.M.: Use of artificial neural networks in monitor faults and for troubleshooting in the process industries. In: Preprints IFAC Symp. On-line Fault Detection and Supervision in the Chemical Process Industries, Newark, Delaware, USA. (1992) 144–149 3. Patton, R.J., Chen, J., Siew, T.: Fault diagnosis in nonlinear dynamic systems via neural networks. In: Proc. of CONTROL’94, Coventry, UK. Volume 2. (1994) 1346–1351 4. Frank, P.M., K¨ oppen-Seliger, B.: New developments using AI in fault diagnosis. Engineering Applications of Artificial Intelligence 10 (1997) 3–14 5. Patton, R.J., Korbicz, J.: Advances in computational intelligence. Special Issue of International Journal of Applied Mathematics and Computer Science 9 (1999) 6. Calado, J., Korbicz, J., Patan, K., Patton, R., Sa da Costa, J.: Soft computing approaches to fault diagnosis for dynamic systems. European Journal of Control 7 (2001) 248–286 7. Korbicz, J., Ko´scielny, J., Kowalczuk, Z., Cholewa, W.: Fault Diagnosis. Models, Artificial Intelligence, Applications. Springer-Verlag, Berlin Heidelberg (2004) 8. Isermann, R.: Supervision, fault detection and diagnosis of technical systems. Special Section of Control Engineering Practice 5 (1997) 9. Chen, J., Patton, R.J.: Robust Model-Based Fault Diagnosis for Dynamic Systems. Kluwer Academic Publishers, Berlin (1999) 10. Patton, R.J., Frank, P.M., Clark, R.: Issues of Fault Diagnosis for Dynamic Systems. Springer-Verlag, Berlin (2000) 11. Korbicz, J., Patan, K., Kowal, M., eds.: Fault Diagnosis and Fault Tolerant Control. Challenging Problems of Science - Theory and Applications : Automatic Control and Robotics. Academic Publishing House EXIT, Warsaw (2007) 12. Witczak, M.: Modelling and Estimation Strategies for Fault Diagnosis of NonLinear Systems. From Analytical to Soft Computing Approaches. Lecture Notes in Control and Information Sciences. Springer–Verlag, Berlin (2007) 13. Isermann, R.: Fault diagnosis of machines via parameter estimation and knowledge processing – A tutorial paper. Automatica 29 (1994) 815–835
192
References
14. Patton, R.J., Frank, P.M., Clark, R.N.: Issues of Fault Diagnosis for Dynamic Systems. Springer-Verlag, Berlin (2000) 15. Gertler, J.: Fault Detection and Diagnosis in Engineering Systems. Marcel Dekker, Inc., New York (1998) 16. Isermann, R.: Fault Diagnosis Systems. An Introduction from Fault Detection to Fault Tolerance. Springer-Verlag, New York (2006) 17. Rutkowski, L.: New Soft Computing Techniques for System Modelling, Pattern Classification and Image Processing. Springer-Verlag, Berlin (2004) 18. Nelles, O.: Nonlinear System Identification. From Classical Approaches to Neural Networks and Fuzzy Models. Springer-Verlag, Berlin (2001) 19. Norgard, M., Ravn, O., Poulsen, N., Hansen, L.: Networks for Modelling and Control of Dynamic Systems. Springer-Verlag, London (2000) 20. Narendra, K.S., Parthasarathy, K.: Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks 1 (1990) 12–18 21. Hunt, K.J., Sbarbaro, D., Zbikowski, R., Gathrop, P.J.: Neural networks for control systems. – A survey. Automatica 28 (1992) 1083–1112 22. Miller, W.T., Sutton, R.S., Werbos, P.J.: Neural Networks for Control. MIT Press, Cambridge, MA (1990) 23. Haykin, S.: Neural Networks. A Comprehensive Foundation, 2nd Edition. Prentice-Hall, New Jersey (1999) 24. Zhang, J., Man, K.F.: Time series prediction using RNN in multi-dimension embedding phase space. In: Proc. IEEE Int. Conf. Systems, Man and Cybernetics, San Diego, USA, 11–14 October. (1998) 1868–1873 Published on CD-ROM. 25. Janczak, A.: Identification of Nonlinear Systems Using Neural Networks and Polynomial Models. A Block-oriented Approach. Lecture Notes in Control and Information Sciences. Springer–Verlag, Berlin (2005) 26. Patan, K., Parisini, T.: Identification of neural dynamic models for fault detection and isolation: The case of a real sugar evaporation process. Journal of Process Control 15 (2005) 67–79 27. Guglielmi, G., Parisini, T., Rossi, G.: Fault diagnosis and neural networks: A power plant application (keynote paper). Control Engineering Practice 3 (1995) 601–620 28. Osowski, S.: Neural Networks in Algorithmitic Expression. WNT, Warsaw (1996) (in Polish). 29. Hertz, J., Krogh, A., Palmer, R.G.: Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company, Inc. (1991) 30. Looney, C.G.: Pattern Recognition Using Neural Networks. Oxford University Press (1997) 31. Sharkey, A.J.C., ed.: Combining Artificial Neural Nets. Springer-Verlag, London, UK (1999) 32. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2 (1989) 359–366 33. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems 2 (1989) 303–314 ˙ 34. Kuschewski, J.G., Hui, S., Zak, S.: Application of feedforward neural networks to dynamical system identification and control. IEEE Transactions on Neural Networks 1 (1993) 37–49 35. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1 (1989) 270–289
References
193
36. Elman, J.L.: Finding structure in time. Cognitive Science 14 (1990) 179–211 37. Parlos, A.G., Chong, K.T., Atiya, A.F.: Application of the recurrent multilayer perceptron in modelling complex process dynamics. IEEE Transactions on Neural Networks 5 (1994) 255–266 38. Tsoi, A.C., Back, A.D.: Locally recurrent globally feedforward networks: A critical review of architectures. IEEE Transactions on Neural Networks 5 (1994) 229–239 39. Marcu, T., Mirea, L., Frank, P.M.: Development of dynamical neural networks with application to observer based fault detection and isolation. International Journal of Applied Mathematics and Computer Science 9 (1999) 547–570 40. Campolucci, P., Uncini, A., Piazza, F., Rao, B.D.: On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10 (1999) 253–271 41. Korbicz, J., Patan, K., Obuchowicz, A.: Neural network fault detection system for dynamic processes. Bulletin of the Polish Academy of Sciences, Technical Sciences. 49 (2001) 301–321 42. Patan, K., Parisini, T.: Stochastic learning methods for dynamic neural networks: Simulated and real-data comparisons. In: Proc. 2002 American Control Conference, ACC’02. Anchorage, Alaska, USA, May 8–10. (2002) 2577–2582 43. Patan, K., Parisini, T.: Stochastic approaches to dynamic neural network training. Actuator fault diagnosis study. In: Proc. 15th IFAC Triennial World Congress, b’02. Barcelona, Spain, July 21–26. (2002) Published on CD-ROM. 44. Patan, K., Korbicz, J.: Artificial neural networks in fault diagnosis. In Korbicz, J., Ko´scielny, J.M., Kowalczuk, Z., Cholewa, W., eds.: Fault Diagnosis. Models, Artificial Intelligence, Applications. Springer-Verlag, Berlin (2004) 330–380 45. Patan, K.: Training of the dynamic neural networks via constrained optimization. In: Proc. IEEE Int. Joint Conference on Neural Networks, IJCNN 2004, Budapest, Hungary. (2004) Published on CD-ROM. 46. Patan, K.: Approximation ability of a class of locally recurrent globally feedforward neural networks. In: Proc. European Control Conference, ECC 2007, Kos, Greece, July 2–5. (2007) Published on CD-ROM. 47. Patan, K.: Aproximation of state-space trajectories by locally recurrent globally feed-forward neural networks. Neural Networks (2007) DOI: 10.1016/j.neunet.2007.10.004. 48. Patan, K., Korbicz, J., Pr¸etki, P.: Global stability conditions of locally recurrent neural networks. Lecture Notes on Computer Science. Artificial Neural Networks: Formal Models and Their Applications – ICANN 2005 3697 (2005) 191–196 49. Patan, K.: Stability analysis and the stabilization of a class of discrete-time dynamic neural networks. IEEE Transactions on Neural Networks 18 (2007) 660–673 50. Patan, K., Korbicz, J.: Fault detection in catalytic cracking converter by means of probability density approximation. In: Proc. Int. Symp. Fault Detection Supervision and Safety for Technical Processes, SAFEPROCESS 2006, Beijing, P.R. China. (2006) Published on CD-ROM. 51. Patan, K., Korbicz, J., Glowacki, G.: DC motor fault diagnosis by means of artificial neural networks. In: Proc. 4th International Conference on Informatics in Control, Automation and Robotics, ICINCO 2007, Angers, France, May 9–12. (2007) Published on CD-ROM. 52. Patan, K., Korbicz, J.: Fault detection in catalytic cracking converter by means of probability density approximation. Engineering Applications of Artificial Intelligence 20 (2007) 912–923
194
References
53. Patan, M., Patan, K.: Optimal observation strategies for model-based fault detection in distributed systems. International Journal of Control 78 (2005) 1497–1510 54. Patan, K.: Fault detection system for the sugar evaporator based on AI techniques. In: Proc. 6th IEEE Int. Conf. Methods and Models in Automation and Robotics, MMAR 2000. Mi¸edzyzdroje, Poland, 28-21 August. (2000) 807–812 55. Patan, K.: Robust fault diagnosis in catalytic cracking converter using artificial neural networks. In: Proc. 16th IFAC World Congress, July 3-8, Prague, Czech Republic. (2005) Published on CD-ROM. 56. Patan, K.: Robust faul diagnosis in a DC motor by means of artificial neural networks and model error modelling. In Korbicz, J., Patan, K., Kowal, M., eds.: Fault Diagnosis and Fault Tolerant Control. Academic Publishing House Exit, Warsaw (2007) 337–346 57. Patan, K., Parisini, T.: Dynamic neural networks for actuator fault diagnosis: Application to DAMADICS benchmark problem. In: Proc. Int. Symp. Fault Detection Supervision and Safety for Technical Processes, SAFEPROCESS 2003, Washington D.C., USA. (2003) Published on CD-ROM. 58. Patan, K.: Fault detection of the actuators using neural networks. In: Proc. 7th IEEE Int. Conf. Methods and Models in Automation and Robotics, MMAR 2001. Mi¸edzyzdroje, Poland, August 28–31. Volume 2. (2001) 1085–1090 59. Patan, K.: Actuator fault diagnosis study using dynamic neural networks. In: Proc. 8th IEEE Int. Conf. Methods and Models in Automation and Robotics, MMAR 2002. Szczecin, Poland, September 2–5. (2002) 219–224 60. Gertler, J.: Analytical redundancy methods in fault detection and isolation. Survey and synthesis. In: Proc. Int. Symp. Fault Detection Supervision and Safety for Technical Processes, SAFEPROCESS’91, Baden-Baden, Germany. (1991) 9–21 61. Ko´scielny, J.M.: Diagnostics of Automatic Industrial Processes. Academic Publishing Office EXIT (2001) (in Polish). 62. Liu, M., Zang, S., Zhou, D.: Fast leak detection and location of gas pipelines based on an adaptive particle filter. International Journal of Applied Mathematics and Computer Science 15 (2005) 541–550 63. Ko´scielny, J.M.: Fault isolation in industrial processes by dynamic table of states method. Automatica 31 (1995) 747–753 64. K¨ oppen-Seliger, B., Frank, P.M.: Fuzzy logic and neural networks in fault detection. In Jain, L., Martin, N., eds.: Fusion of Neural Networks, Fuzzy Sets, and Genetic Algorithms, New York, CRC Press (1999) 169–209 65. Patton, R.J., Frank, P.M., Clark, R.N., eds.: Fault Diagnosis in Dynamic Systems. Theory and Application. Prentice Hall, New York (1989) 66. Walter, E., Pronzato, L.: Identification of Parametric Models from Experimental Data. Springer, London (1997) 67. Soderstrom, T., Stoica, P.: System Identification. Prentice-Hall International, Hemel Hempstead (1989) 68. Milanese, M., Norton, J., Piet-Lahanier, H., Walter, E.: Bounding Approaches to System Identification. Plenum Press, New York (1996) 69. Isermann, R.: Fault diagnosis of machines via parameter estimation and knowledge processing. Automatica 29 (1993) 815–835 70. Walker, B.K., Kuang-Yang, H.: FDI by extended Kalman filter parameter estimation for an industrial actuator benchmark. Control Engineering Practice 3 (1995) 1769–1774 71. Massoumnia, B.K., Vander Velde, W.E.: Generating parity relations for detecting and identifying control system components failures. Journal of Guidance, Control and Dynamics 11 (1988) 60–65
References
195
72. Peng, Y.B., Youssouf, A., Arte, P., Kinnaert, M.: A complete procedure for residual generation and evaluation with application to a heat exchanger. IEEE Transactions on Control Systems Technology 5 (1997) 542–555 73. Guernez, C., Cassar, J.P., Staroswiecki, M.: Extension of parity space to nonlinear polynomial dynamic systems. In: Proc. 3rd IFAC Symp. Fault Detection, Supervision and Safety of Technical Processes, SAFEPROCESS’97, Hull, UK. Volume 2. (1997) 861–866 74. Krishnaswami, V., Rizzoni, G.: Non-linear parity equation residual generation for fault detection and isolation. In Ruokonen, T., ed.: Proc. IFAC Symposium SAFEPROCESS’94, Espoo, Finland. Volume 1., Pergamon Press (1994) 317–332 75. Anderson, B.D.O., Moore, J.B.: Optimal Filtering. Prentice-Hall, New Jersey (1979) ˙ 76. Hui, S., Zak, S.H.: Observer design for systems with unknown inputs. International Journal of Applied Mathematics and Computer Science 15 (2005) 431–446 77. Gupta, M.M., Jin, L., Homma, N.: Static and Dynamic Neural Networks. From Fundamentals to Advanced Theory. John Wiley & Sons, New Jersey (2003) 78. Kowalski, C.T.: Monitoring and Fault Diagnosis of Induction Motors Using Neural Networks. Wroclaw University of Technology Academic Press, Wroclaw, Poland (2005) (in Polish). 79. Tadeusiewicz, R.: Neural Networks. Academic Press RM, Warsaw (1993) (in Polish). 80. Koivo, M.H.: Artificial neural networks in fault diagnosis and control. Control Engineering Practice 2 (1994) 89–101 81. Korbicz, J., Mrugalski, M.: Confidence estimation of GMDH neural networks and its application in fault detection systems. International Journal of Systems Science (2007) DOI: 10.1080/00207720701847745. 82. Babuka, R.: Fuzzy Modeling for Control. Kluwer Academic Publishers, London (1998) 83. Jang, J.: ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man and Cybernetics 23 (1995) 665–685 84. Rutkowska, D.: Neuro-Fuzzy Architectures and Hybrid Learning. Springer, Berlin (2002) 85. Osowski, S., Tran Hoai, L., Brudzewski, K.: Neuro-fuzzy TSK network for calibration of semiconductor sensor array for gas measurements. IEEE Transactions on Measurements and Instrumentation 53 (2004) 630–637 86. Kowal, M.: Optimization of Neuro-Fuzzy Structures in Technical Diagnostic Systems. Volume 9 of Lecture Notes in Control and Computer Science. Zielona G´ ora University Press, Zielona G´ ora, Poland (2005) 87. Korbicz, J., Kowal, M.: Neuro-fuzzy networks and their application to fault detection of dynamical systems. Engineering Applications of Artificial Intelligence 20 (2007) 609–617 88. Widrow, B., Hoff, M.E.: Adaptive switching circuit. In: 1960 IRE WESCON Convention Record, part 4, New York, IRE (1960) 96–104 89. Widrow, B.: Generalization and information storage in networks of adaline neurons. In Yovits, M., Jacobi, G.T., Goldstein, G., eds.: Self-Organizing Systems 1962 (Chicago 1962), Washington, Spartan (1962) 435–461 90. Duch, W., Korbicz, J., Rutkowski, L., Tadeusiewicz, R., eds.: Biocybernetics and Biomedical Engineering 2000. Neural Networks. Academic Publishing Office EXIT, Warsaw (2000) (in Polish). 91. Werbos, P.J.: Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University (1974)
196
References
92. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature (1986) 533–536 93. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Parallel Distributed Processing I (1986) 94. Plaut, D., Nowlan, S., Hinton, G.: Experiments of learning by back propagation. Technical Report CMU-CS-86-126, Department of Computer Science, Carnegie Melon University, Pittsburg, PA (1986) 95. Demuth, H., Beale, M.: Neural Network Toolbox for Use with MATLAB. The MathWorks Inc. (1993) 96. Fahlman, S.E.: Fast learning variation on back-propagation: An empirical study. In Touretzky, D., Hilton, G., Sejnowski, T., eds.: Proceedings of the 1988 Connectionist Models Summer School (Pittsburg 1988), San Mateo, Morgan Kaufmann (1989) 38–51 97. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In Touretzky, D.S., ed.: Advances in Neural Information Processing Systems II (Denver 1989), San Mateo, Morgan Kaufmann (1990) 524–532 98. Rojas, R.: Neural Networks. A Systematic Introduction. Springer-Verlag, Berlin (1996) 99. Hagan, M.T., Menhaj, M.B.: Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks 5 (1994) 989–993 100. Hagan, M., Demuth, H.B., Beale, M.H.: Neural Network Design. PWS Publishing, Boston, MA (1996) 101. Girosi, J., Poggio, T.: Neural networks and the best approximation property. Biol. Cybernetics 63 (1990) 169–176 102. Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural Computation 3 (1991) 246–257 103. Chen, S., Billings, S.A.: Neural network for nonliner dynamic system modelling and identification. International Journal of Control 56 (1992) 319–346 104. Warwick, K., Kambhampati, C., Parks, P., Mason, J.: Dynamic systems in neural networks. In Hunt, K.J., Irwin, G.R., Warwick, K., eds.: Neural Network Engineering in Dynamic Control Systems, Berlin, Springer-Verlag (1995) 27–41 105. Chen, S., Cowan, C.F.N., Grant, P.M.: Orthogonal least squares learning algorithm for radial basis function networks. IEEE Transactions on Neural Networks 2 (1991) 302–309 106. Kohonen, T.: Self-organization and Associative Memory. Springer-Verlag, Berlin (1984) 107. Zhou, Y., Hahn, J., Mannan, M.S.: Fault detection and classification in chemical processes based on neural networks with feature extraction. ISA Transactions 42 (2003) 651–664 108. Chen, Y.M., Lee, M.L.: Neural networks-based scheme for system failure detection and diagnosis. Mathematics and Computers in Simulation 58 (2002) 101–109 109. Ayoubi, M.: Fault diagnosis with dynamic neural structure and application to a turbo-charger. In: Proc. Int. Symp. Fault Detection Supervision and Safety for Technical Processes, SAFEPROCESS’94, Espoo, Finland. Volume 2. (1994) 618–623 110. Zhang, J., Roberts, P.D., Ellis, J.E.: A self-learning fault diagnosis system. Transactions of the Institute of Measurements and Control 13 (1991) 29–35 111. Karhunen, J.: Optimization criteria and nonlinear PCA neural networks. In: Proc. Int. Conf. Neural Networks, ICNN. (1994) 1241–1246
References
197
112. Karpenko, M., Sepehri, N., Scuse, D.: Diagnosis of process valve actuator faults using multilayer neural network. Control Engineering Practice 11 (2003) 1289– 1299 113. Zhang, J.: Improved on-line process fault diagnosis through information fusion in multiple neural networks. Computers & Chemical Engineering 30 (2006) 558–571 114. Xu, P., Xu, S., Yin, H.: Application of self-organizing competitive neural network in fault diagnosis of suck rod pumping system. Journal of Petroleum Science & Engineering 58 (2006) 43–48 115. Glowacki, G., Patan, K., Korbicz, J.: Nonlinear principal component analysis in fault diagnosis. In Korbicz, J., Patan, K., Kowal, M., eds.: Fault Diagnosis and Fault Tolerant Control, ISBN: 978-83-60434-32-1. Challenging Problems of Science - Theory and Applications : Automatic Control and Robotics. Academic Publishing House EXIT, Warsaw (2007) 211–218 116. Harkat, M.F., Djelel, S., Doghmane, N., Benouaret, M.: Sensor of fault detection, isolation and reconstruction using nonlinear principal component analysis. International Journal of Automation and Computing (2007) 149–155 117. Arbib, M.A., ed.: The Metaphorical Brain, 2nd edition. Wiley, New York (1989) 118. Mozer, M.C.: Neural net architectures for temporal sequence processing. In Weigend, A.S., A, G.N., eds.: Time series predictions: Forecasting the future and understanding the past, Reading, MA, Addison-Wesley Publishing Company, Inc. (1994) 243–264 119. Zamarreno, J.M., Vega, P.: State space neural network. Properties and application. Neural Networks 11 (1998) 1099–1112 120. Williams, R.J.: Adaptive state representation and estimation using recurrent connectionist networks. In: Neural Networks for Control, London, MIT Press (1990) 97–115 121. Stornetta, W.S., Hogg, T., Hubermann, B.A.: A dynamic approach to temporal pattern processing. In Anderson, D.Z., ed.: Neural Information Processing Systems, New York, American Institute of Physics (1988) 750–759 122. Mozer, M.C.: A focused backpropagation algorithm for temporal pattern recognition. Complex Systems 3 (1989) 349–381 123. Jordan, M.I., Jacobs, R.A.: Supervised learning and systems with excess degrees of freedom. In Touretzky, D.S., ed.: Advances in Neural Information Processing Systems II (Denver 1989), San Mateo, Morgan Kaufmann (1990) 324–331 124. Jordan, M.I.: Attractor dynamic and parallelism in a connectionist sequential machine. In: Proc. 8th Annual Conference of the Cognitive Science Society (Amherst, 1986), Hillsdale, Erlbaum (1986) 531–546 125. Anderson, S., Merrill, J.W.L., Port, R.: Dynamic speech categorization with recurrent networks. In Touretzky, D., Hinton, G., Sejnowski, T., eds.: Proc. of the 1988 Connectionist Models Summer School (Pittsburg 1988), San Mateo, Morgan Kaufmann (1989) 398–406 126. Pham, D.T., Xing, L.: Neural Networks for Identification, Prediction and Control. Springer-Verlag, Berlin (1995) 127. Sontag, E.: Feedback stabilization using two-hidden-layer nets. IEEE Transactions on Neural Networks 3 (1992) 981–990 128. Gori, M., Bengio, Y., Mori, R.D.: BPS: A learning algorithm for capturing the dynamic nature of speech. In: International Joint Conference on Neural Networks. Volume II. (1989) 417–423 129. Back, A.D., Tsoi, A.C.: FIR and IIR synapses, A new neural network architecture for time series modelling. Neural Computation 3 (1991) 375–385
198
References
130. Fasconi, P., Gori, M., Soda, G.: Local feedback multilayered networks. Neural Computation 4 (1992) 120–130 131. Poddar, P., Unnikrishnan, K.P.: Memory neuron networks: A prolegomenon. Technical Report GMR-7493, General Motors Research Laboratories (1991) 132. Gupta, M.M., Rao, D.H.: Dynamic neural units with application to the control of unknown nonlinear systems. Journal of Intelligent and Fuzzy Systems 1 (1993) 73–92 133. Hopfield, J.: Neural networks and physical systems with emergent collective computational abilities. In: Proc. Nat. Acad. Sci. USA. (1982) 2554–2558 134. Pineda, F.J.: Dynamics and architecture for neural computation. J. Complexity 4 (1988) 216–245 135. Pineda, F.J.: Generalization of back-propagation to recurrent neural networks. Physical Rev. Lett. 59 (1987) 2229–2232 136. Grossberg, S.: Content-addressable memory storage by neural networks: A general model and global Lyapunov method. In Schwartz, E.L., ed.: Computational Neuroscience, Cambridge, MA, MIT Press (1990) 137. Sastry, P.S., Santharam, G., Unnikrishnan, K.P.: Memory neuron networks for identification and control of dynamical systems. IEEE Transactions on Neural Networks 5 (1994) 306–319 ˙ 138. Zurada, J.M.: Lambda learning rule for feedforward neural networks. In: Proc. Int. Conf. on Neural Networks. San Francisco, USA, March 28–April 1. (1993) 1808–1811 139. Horn, R.A., Johnson., C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985) 140. Klamka, J.: Stochastic controllability of linear systems with state delays. International Journal of Applied Mathematics and Computer Science 17 (2007) 5–13 141. Oprz¸edkiewicz, K.: An observability problem for a class of uncertain-parameter linear dynamic systems. International Journal of Applied Mathematics and Computer Science 15 (2005) 331–338 142. Vidyasagar, M.: Nonlinear System Analysis, 2nd edition. Prentice-Hall, Englewood Cliffs, NJ (1993) 143. Levin, A.U., Narendra, K.S.: Control of nonlinear dynamical systems using neural networks: Controllability and stabilization. IEEE Transactions on Neural Networks 4 (1993) 192–206 144. Patan, K., Korbicz, J.: Dynamic Networks and Their Application in Modelling and Identification. In Duch, W., Korbicz, J., Rutkowski, L., Tadeusiewicz, R., eds.: Biocybernetics and Biomedical Engineering 2000. Neural Networks. Academic Publishing Office EXIT, Warsaw (2000) (in Polish). 145. Spall, J.C.: Introduction to Stochastic Search and Optimization. John Willey & Sons, New Jersey (2003) 146. Pflug, G.C.: Optimization of Stochastic Models. The Interface Between Simulation and Optimization. Kluwer Academic Publishers, Boston (1996) 147. Spall, J.: Multivariate stochastic aproximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control (1992) 332– 341 148. Spall, J.: Stochastic optimization, stochastic approximation and simulated annealing. In Webster, J., ed.: Encyclopedia of Electrical and Electronics Engineering. John Wiley & Sons, New York (1999) 149. Spall, J.: Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control 45 (2000) 1839–1853
References
199
150. Chin, D.C.: A more efficient global optimization algorithm based on Styblinski and Tang. Neural Networks 7 (1994) 573–574 151. Maryak, J., Chin, D.C.: Global random optimization by simultaneous perturbation stochastic approximation. In: Proc. of the American Control Conference, ACC 2001, Arlington VA, USA. (2001) 756–762 152. Pham, D.T., Liu, X.: Training of Elman networks and dynamic system modelling. International Journal of Systems Science 27 (1996) 221–226 153. Lissane Elhaq, S., Giri, F., Unbehauen, H.: Modelling, identification and control of sugar evaporation – theoretical design and experimental evaluation. Control Engineering Practice (1999) 931–942 154. Cannas, B., Cincotti, S., Marchesi, M., Pilo, F.: Learnig of Chua’s circuit attractors by locally recurrent neural networks. Chaos Solitons & Fractals 12 (2001) 2109–2115 155. Zhang, J., Morris, A.J., Martin, E.B.: Long term prediction models based on mixed order locally recurrent neural networks. Computers Chem. Engng 22 (1998) 1051–1063 156. Campolucci, P., Piazza, F.: Intrinsic stability-control method for recursive filters and neural networks. IEEE Trans. Circuit and Systems – II: Analog and Digital Signal Processing 47 (2000) 797–802 157. Jin, L., Nikiforuk, P.N., Gupta, M.M.: Approximation of discrete-time statespace trajectories using dynamic recurrent neural networks. IEEE Transactions on Automatic Control 40 (1995) 1266–1270 158. Garzon, M., Botelho, F.: Dynamical approximation by recurrent neural networks. Neurocomputing 29 (1999) 25–46 159. Leshno, M., Lin, V., Pinkus, A., Schoken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6 (1993) 861–867 160. Scarselli, F., Tsoi, A.C.: Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Networks 11 (1998) 15–37 161. Hirsch, M., Smale, S.: Differential Equations, Dynamical Systems and Linear Algebra. Academic Press, New York (1974) 162. Ljung, L.: System Identification – Theory for the User. Prentice Hall (1999) 163. Matsuoka, K.: Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. Neural Networks 5 (1992) 495–500 164. Ensari, T., Arik, S.: Global stability analysis of neural networks with multiple time varying delays. IEEE Transactions on Automatic Control 50 (2005) 1781– 1785 165. Liang, J., Cao, J.: A based-on LMI stability criterion for delayed recurrent neural networks. Chaos, Solitons & Fractals 28 (2006) 154–160 166. Cao, J., Yuan, K., Li, H.: Global asymptotical stability of recurrent neural networks with multiple discrete delays and distributed delays. IEEE Transactions on Neural Networks 17 (2006) 1646–1651 167. Forti, M., Nistri, P., Papini, D.: Global exponential stability and global convergence in finite time of delayed neural networks with infinite gain. IEEE Transactions on Neural Networks 16 (2005) 1449–1463 168. Fang, Y., Kincaid, T.G.: Stability analysis of dynamical neural networks. IEEE Transactions on Neural Networks 7 (1996) 996–1006 169. Jin, L., Nikiforuk, P.N., Gupta, M.M.: Absolute stability conditions for discretetime recurrent neural networks. IEEE Transactions on Neural Networks 5 (1994) 954–963
200
References
170. Hu, S., Wang, J.: Global stability of a class of discrete-time recurrent neural networks. IEEE Trans. Circuits and Systems – I: Fundamental Theory and Applications 49 (2002) 1104–1117 171. Jin, L., Gupta, M.M.: Stable dynamic backpropagation learning in recurrent neural networks. IEEE Transactions on Neural Networks 10 (1999) 1321–1334 172. Suykens, J.A.K., Moor, B.D., Vandewalle, J.: Robust local stability of multilayer recurrent neural networks. IEEE Transactions on Neural Networks 11 (2000) 222–229 173. Kushner, H.J., Yin, G.G.: Stochastic Approximation Algorithms and Applications. Springer-Verlag, New York (1997) 174. Paszke, W.: Analysis and Synthesis of Multidimensional System Classes using Linear Matrix Inequality Methods. Volume 8 of Lecture Notes in Control and Computer Science. Zielona G´ ora University Press, Zielona G´ ora, Poland (2005) 175. Boyd S, L. E. Ghaoui, E.F., Balakrishnan, V.: Linear Matrix Inequalities in System and Control Theory. SIAM Studies in Applied and Numerical Mathematics. SIAM, Philadelphia, USA (1994) Vol. 15. 176. Gahinet, P., Apkarian, P.: A linear matrix inequality approach to h∞ control. International Journal of Robust and Nonlinear Control 4 (1994) 421–448 177. Iwasaki, T., Skelton, R.E.: All controllers for the general H∞ control problem: LMI existence conditions and state space formulas. Automatica 30 (1994) 1307– 1317 178. van de Wal, M., de Jager, B.: A review of methods for input/output selection. Automatica 37 (2001) 487–510 179. Fukumizu, K.: Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks 11 (2000) 17–26 180. Witczak, M.: Toward the training of feed-forward neural networks with the Doptimum input sequence. IEEE Transactions on Neural Networks 17 (2006) 357– 373 181. Fedorov, V.V., Hackl, P.: Model-Oriented Design of Experiments. Lecture Notes in Statistics. Springer-Verlag, New York (1997) 182. Atkinson, A.C., Donev, A.N.: Optimum Experimental Designs. Clarendon Press, Oxford (1992) 183. Patan, M.: Optimal Observation Strategies for Parameter Estimation of Distributed Systems. Volume 5 of Lecture Notes in Control and Computer Science. Zielona G´ ora University Press, Zielona G´ ora, Poland (2004) 184. Uci´ nski, D.: Optimal selection of measurement locations for parameter estimation in distributed processes. International Journal of Applied Mathematics and Computer Science 10 (2000) 357–379 185. Rafajlowicz, E.: Optimum choice of moving sensor trajectories for distributed parameter system identification. International Journal of Control 43 (1986) 1441– 1451 186. Uci´ nski, D.: Optimal Measurement Methods for Distributed Parameter System Identification. CRC Press, Boca Raton (2005) 187. Kiefer, J., Wolfowitz, J.: Optimum designs in regression problems. The Annals of Mathematical Statistics 30 (1959) 271–294 188. P´ azman, A.: Foundations of Optimum Experimental Design. Mathematics and Its Applications. D. Reidel Publishing Company, Dordrecht (1986) 189. Nyberg, M.: Model Based Fault Diagnosis: Methods, Theory, and Automotive Engine Applications. PhD thesis, Link¨ oping University, Link¨ oping, Sweden (1999) 190. Korbicz, J.: Robust fault detection using analytical and soft computing methods. Bulletin of the Polish Academy of Sciences, Technical Sciences 54 (2006) 75–88
References
201
191. Shumsky, A.: Redundancy relations for fault diagnosis in nonlinear uncertain systems. International Journal of Applied Mathematics and Computer Science 17 (2007) 477–489 192. Mrugalski, M., Witczak, M., Korbicz, J.: Confidence estimation of the multi-layer perceptron and its application in fault detection systems. Engineering Applications of Artificial Intelligence (2007) DOI: 10.1016/j.engappai.2007.09.008. 193. Basseville, M., Nikiforov, I.V.: Detection of Abrupt Changes: Theory and Application. Prentice Hall (1993) 194. Roth, Z., Baram, Y.: Multidimensional density shaping by sigmoids. IEEE Transactions on Neural Networks 7 (1996) 1291–1298 195. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7 (1995) 1129–1159 196. Quinn, S.L., Harris, T.J., Bacon, D.W.: Accounting for uncertainty in controlrelevant statistics. Journal of Process Control 15 (2005) 675–690 197. Gunnarson, S.: On some asymptotic uncertainty bounds in recursive least squares identification. IEEE Transactions on Automatic Control 38 (1993) 1685–1689 198. Reinelt, W., Garulli, A., Ljung, L.: Comparing different approaches to model error modeling in robust identification. Automatica 38 (2002) 787–803 199. Puig, V., Stancu, A., Escobet, T., Nejjari, F., Quevedo, J., Patton, R.J.: Passive robust fault detection using interval observers: Application to the DAMADICS benchmark problem. Control Engineering Practice 14 (2006) 621–633 200. Ding, X., Frank, P.: Frequency domain approach and threshold selector for robust model-based fault detection and isolation. In: Proc. Int. Symp. Fault Detection Supervision and Safety for Technical Processes, SAFEPROCESS’91, BadenBaden, Germany. (1991) 271–276 201. H¨ ofling, T., Isermann, R.: Fault detection based on adaptive parity equations and single-parameter tracking. Control Engineering Practice 4 (1996) 1361–1369 202. Sauter, D., Dubois, G., Levrat, E., Br´emont, J.: Fault diagnosis in systems using fuzzy logic. In: Proc. First European Congress on Fuzzy and Intelligent Technologies, EUFIT’93, Aachen, Germany. (1993) 781–788 203. Schneider, H.: Implementation of a fuzzy concept for supervision and fault detection of robots. In: Proc. First European Congress on Fuzzy and Intelligent Technologies, EUFIT’93, Aachen, Germany. (1993) 775–780 204. Milanese, M.: Set membership identification of nonlinear systems. Automatica 40 (2004) 957–975 205. Ding, L., Gustafsson, T., Johansson, A.: Model parameter estimation of simplified linear models for a continuous paper pulp degester. Journal of Process Control 17 (2007) 115–127 206. DAMADICS: Website of the Research Training Network on Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems http://diag.mchtr.pw.edu.pl/damadics (2004) 207. Papers of the special sessions: DAMADICS I, II, III. In: Proc. 5th IFAC Symp. Fault Detection Supervision and Safety of Technical Processes, SAFEPROCESS 2003, Washington DC, USA (2003) June 9-11. 208. Ko´scielny, J., Ostasz, A., Wasiewicz, P.: Fault Detection based on Fuzzy Neural Networks – Application to Sugar Factory Evaporator. In: Proc. Int. Symp. Fault Detection Supervision and Safety for Technical Processes, SAFEPROCESS 2000, Budapest, Hungary. (2000) 337–342 209. Barty´s, M., Ko´scielny, J.: Application of fuzzy logic fault isolation methods for actuator diagnosis. In: Proc. 15th IFAC Triennial World Congress, b’02. Barcelona, Spain, July 21–26. (2002) Published on CD-ROM.
202
References
210. Witczak, M.: Advances in model-based fault diagnosis with evolutionary algorithm and neural networks. International Journal of Applied Mathematics and Computer Science 16 (2006) 85–99 211. Moro, L.F.L., Odloak, D.: Constrained multivariable control of fluid catalytic cracking converters. Journal of Process Control 5 (1995) 29–39 212. Alcorta-Garcia, E., de L´eon-Cant´ on, P., Sotomayor, O.A.Z., Odloak, D.: Actuator and component fault isolation in a fluid catalytic cracking unit. In: Proc. 16th IFAC World Congress, July 3–8, Prague, Czech Republic. (2005) Published on CD-ROM. 213. Sotomayor, O.A.Z., Odloak, D., Alcorta-Garcia, E., L´eon-Cant´ on, P.: Observerbased supervision and fault detection of a FCC unit model predictive control system. In: Proc. 7th Int. Symp. Dynamic and Control of Process Systems, DYCOPS 7, Massachusetts, USA. (2004) 214. Orlowska-Kowalska, T., Szabat, K., Jaszczak, K.: The influence of parameters and structure of PI-type fuzzy controller on DC drive system dynamics. Fuzzy Sets and Systems 131 (2002) 251–264 215. Moseler, O., Isermann, R.: Application of model-based fault detection to a brushless DC motor. IEEE Trans. Industrial Electronics 47 (2000) 1015–1020 216. Xiang-Qun, L., Zhang, H.Y.: Fault detection and diagnosis of permanent-magnet DC motor based on parameter estimation and neural network. IEEE Trans. Industrial Electronics 47 (2000) 1021–1030 217. Fuessel, D., Isermann, R.: Hierarchical motor diagnosis utilising structural knowledge and a self-learning neuro-fuzzy scheme. IEEE Trans. Industrial Electronics 47 (2000) 1070–1077 218. Grzesiak, L.M., Kamierowski, M.P.: Improving flux and speed estimators for sensorless AC drives. Industrial Electronics 1 (2007)
Index
activation function hyperbolic tangent 96 linear 17 radial basis 19 sigmoidal 66 step 17 actuator 143 block scheme 144 Adaptive Random Search, see ARS Akaike Information Criterion (AIC) 72, 146, 147, 176, 181 ARS 53, 58, 60, 72 global optimisation 54 outline 55 variance-exploitation 54 variance-selection 54 autonomous system 100, 101 Back-Propagation algorithm (BP) 18 benchmark zone 24 Bernoulli distribution 56, 57 bounded error approaches 132, 137 characteristic equation 79–81, 106 classification 11, 15, 22, 179 conditional expectation 86 constraints 82, 84, 86, 87, 106 active 82, 85 set 86, 87, 109 violated 83, 86 continuously differentiable (C 1 ) 67–70 controllability 47 matrix 47, 49 control valve 143–145
covariance matrix 115 cumulative distribution function 127
126,
D-optimality criterion 116, 118, 121 data analysis based approaches 23 DC motor 141, 172–176, 182 electrical subsystem 174 mechanical subsystem 174 decision making 123, 124, 126, 131, 133, 136 robust 123 density function 124, 128 detection of faults, see fault detection detection time 25, 159, 160, 168, 172, 178, 182 dynamic neuron unit 36, 38 EDBP 52, 58, 60 eigenvalues 106, 107 electrical circuit 174, 176 electrical motor 173 entropy 127, 128 equilibrium point (state) 43, 44, 97, 106, 111 equivalence theorem 118 error model 138, 139 Extended Dynamic Back-Propagation, see EDBP failure 7, 8, 113, 146, 149, 173 false alarm 9, 26, 124, 126, 133, 136, 148, 150, 154, 167, 178, 182 false detection rate 26, 159, 167–169, 178, 182
204
Index
false isolation rate 26, 159 fault 7–9, 12, 16 abrupt 8, 9, 146, 168 actuator 12, 13, 23, 24, 143 detection 7, 9, 10, 13, 22, 26, 123, 124, 133 diagnosis 7, 8, 10, 11, 15, 22, 23, 123, 132, 133, 136, 137, 141, 160, 162, 170, 173 identification 7, 10, 180 incipient 8, 9, 185 intermittent 9 isolation 7, 10, 12, 13, 26, 153, 155, 179 large 150, 181 medium 181 process (component) 9 sensor 9, 12, 23, 24, 149, 150, 161 small 181, 182, 185 FCC 141, 161–164, 172 FDI 10, 11, 14, 21, 24, 123, 152, 160 feasibility problem 104 feasible region 82, 85, 87 FIM 115–117, 119 Final Prediction Error (FPE) 146, 147 Fisher Information Matrix, see FIM 115 Fluid Catalytic Cracking, see FCC 141 fuzzy logic 7, 15 G-optimality criterion 116 Gersgorin’s theorem 77 globally asymptotically stable 101, 103, 104 global optimisation 57 gradient ascent rule 128 gradient projection 82 outline 83
97, 98,
identification of faults, see fault identification input sequence 113–115, 121 interior point method 104 isolation of faults, see fault isolation isolation time 26, 159, 160 knowledge based approaches 23 Kohonen network 20 Kuhn-Tucker conditions 85
Lagrange function 84 multipliers method 84 Linear Matrix Inequality, see LMI Lipschitz constant 67, 68, 98 function 77 mapping 67, 74 LMI 103–105 locally asymptotically stable 106, 107 locally recurrent globally feedforward network, see LRGF LRGF 37, 49 approximation ability 68, 70 cascade 71, 72 state-space representation 51, 67 training 52 Lyapunov first method of 105 function 97, 101 global stability theorem 97 second method of 97 martingale 87, 88 matrix norm 100 MEM 124, 139 minimum distance projection 82, 83, 85 outline 86 model 10, 11, 13 analytical 10, 15, 21, 123 fuzzy 7, 10, 15, 123, 152 mathematical 8, 15 mismatch 9, 14, 150 neural 10, 22, 35, 51, 59, 60, 62, 81, 91, 104, 111, 129, 142, 152, 161, 163, 176, 177 qualitative 8, 23 quantitative 8 uncertainty 9, 132–134, 137, 181, 185 model based 123 model based approaches 21 model error modelling, see MEM modelling 57, 59, 72 Moore-Penrose pseudoinverse 97, 107, 111 multi-input multi-output 152 multi-input single-output 153 multi-layer feedforward network, see multi-layer perceptron
Index multi-layer perceptron
16, 22, 30
necessary condition 110, 112 network state convergence 93, 100, 102 neural network 14, 17 dynamic 49 recurrent, see recurrent network time-delay 30 with external dynamics 30, 35 neuron 16, 17, 19, 21, 36 Adaline 17 dynamic 31, 36, 37, 42, 43, 47, 49, 50, 66, 79 hidden 19, 32–35 linear 71 McCulloch-Pitts 16 memory 39 output 33, 35, 60 sigmoidal 17, 129, 140 winner 20 with finite impulse response 70, 73, 74 with infinite impulse response 40, 42, 48, 58, 62, 70, 74, 77, 119 with local activation feedback 37 with local ouput feedback 39 with local synapse feedback 38 normality assumption 125–127 testing 125–127 norm stability condition 98, 100, 101, 104 observability 47, 48 matrix, 48, 49 observer 7, 13, 15 optimal design 117–119 optimum experimental design 118–120 Ostrowski’s theorem 44, 77
115,
parallel model 31 parameter estimation 7, 12, 15, 21, 173 parameter identifiability 119 parity relations 7, 12 poles 80, 81, 84, 91 location(placement) 90–92 stable 91 unstable 91
positioner 143–145 probability density function 129–131 projection 86–88 quadratic programming
205
127,
83
Radial Basis Function (RBF) 18 random design 121 reactor 161 recurrent multi-layer perceptron 33 recurrent network 31 Elman 32 fully 31 globally 31 Jordan 32 locally 31 parially 33 partially 31, 32 state-space 31, 34 Williams-Zipser 31 regenerator 161–163 residual definition 10 evaluation 10, 22, 123, 124, 140 generation 10, 11, 13, 16, 21, 162, 179 signal 10, 124, 126, 134, 135, 164, 165, 177 riser 161–163 RLMP 33 robust fault detection 123 fault diagnosis 132, 133, 140 identification 132, 137 robustness 133, 137 active approaches 133 passive approaches 133 RTRN 31 SCADA 142, 145 Schur complement 103, 104 sensitivity of decision making 134 of fault decision 126 of fault detection 124, 126, 133 sensitivity matrix 115, 122 series-parallel model 31 servo-motor 143, 144 short map 98
206
Index
significance level 130, 131, 135, 138, 139 Simultaneous Perturbation Stochastic Approximation, see SPSA SPSA 55, 58, 61 gain sequences 56 gradient estimate 56 outline 57 strong convergence 86 stability 78 BIBO 77, 78 condition 97, 101, 104 margin 84, 92, 93 triangle 84 stabilization 83, 88, 90, 93, 94 stable model 79, 91 network 100 region 91 solution 87 system 79, 81, 100, 107 stable system 107, 108 statistical error bounds 132 sufficient condition 98, 105, 110 sugar evaporator 141 support points 116
sequences SVD 107
117
tapped-delay lines 30 temperature model 143, 149 threshold 12, 124, 125, 130, 131 adaptive 123, 134, 135, 182 constant 123, 136, 150, 170, 182 fuzzy 135, 136 nominal 136 simple 124, 169, 182 true detection rate 26, 168, 170 true isolation rate 26 uncertainty bands 138, 139 region 138, 139 unit circle 79, 81, 91, 106, 107 universal approximation theorem 69 unknown input observer 14 unstable model 79, 95 system 80 vapour model
143, 148, 150, 151
winner takes all rule
20
68,
Lecture Notes in Control and Information Sciences Edited by M. Thoma, M. Morari Further volumes of this series can be found on our homepage: springer.com Vol. 377: Patan K. Artificial Neural Networks for the Modelling and Fault Diagnosis of Technical Processes 206 p. 2008 [978-3-540-79871-2] Vol. 376: Hasegawa Y. Approximate and Noisy Realization of Discrete-Time Dynamical Systems 245 p. 2008 [978-3-540-79433-2] Vol. 375: Bartolini G.; Fridman L.; Pisano A.; Usai E. (Eds.) Modern Sliding Mode Control Theory 465 p. 2008 [978-3-540-79015-0] Vol. 374: Huang B.; Kadali R. Dynamic Modeling, Predictive Control and Performance Monitoring 240 p. 2008 [978-1-84800-232-6] Vol. 373: Wang Q.-G.; Ye Z.; Cai W.-J.; Hang C.-C. PID Control for Multivariable Processes 264 p. 2008 [978-3-540-78481-4] Vol. 372: Zhou J.; Wen C. Adaptive Backstepping Control of Uncertain Systems 241 p. 2008 [978-3-540-77806-6] Vol. 371: Blondel V.D.; Boyd S.P.; Kimura H. (Eds.) Recent Advances in Learning and Control 279 p. 2008 [978-1-84800-154-1] Vol. 370: Lee S.; Suh I.H.; Kim M.S. (Eds.) Recent Progress in Robotics: Viable Robotic Service to Human 410 p. 2008 [978-3-540-76728-2] Vol. 369: Hirsch M.J.; Pardalos P.M.; Murphey R.; Grundel D. Advances in Cooperative Control and Optimization 423 p. 2007 [978-3-540-74354-5] Vol. 368: Chee F.; Fernando T. Closed-Loop Control of Blood Glucose 157 p. 2007 [978-3-540-74030-8] Vol. 367: Turner M.C.; Bates D.G. (Eds.) Mathematical Methods for Robust and Nonlinear Control 444 p. 2007 [978-1-84800-024-7]
Vol. 366: Bullo F.; Fujimoto K. (Eds.) Lagrangian and Hamiltonian Methods for Nonlinear Control 2006 398 p. 2007 [978-3-540-73889-3] Vol. 365: Bates D.; Hagström M. (Eds.) Nonlinear Analysis and Synthesis Techniques for Aircraft Control 360 p. 2007 [978-3-540-73718-6] Vol. 364: Chiuso A.; Ferrante A.; Pinzoni S. (Eds.) Modeling, Estimation and Control 356 p. 2007 [978-3-540-73569-4] Vol. 363: Besançon G. (Ed.) Nonlinear Observers and Applications 224 p. 2007 [978-3-540-73502-1] Vol. 362: Tarn T.-J.; Chen S.-B.; Zhou C. (Eds.) Robotic Welding, Intelligence and Automation 562 p. 2007 [978-3-540-73373-7] Vol. 361: Méndez-Acosta H.O.; Femat R.; González-Álvarez V. (Eds.): Selected Topics in Dynamics and Control of Chemical and Biological Processes 320 p. 2007 [978-3-540-73187-0] Vol. 360: Kozlowski K. (Ed.) Robot Motion and Control 2007 452 p. 2007 [978-1-84628-973-6] Vol. 359: Christophersen F.J. Optimal Control of Constrained Piecewise Affine Systems 190 p. 2007 [978-3-540-72700-2] Vol. 358: Findeisen R.; Allgöwer F.; Biegler L.T. (Eds.): Assessment and Future Directions of Nonlinear Model Predictive Control 642 p. 2007 [978-3-540-72698-2] Vol. 357: Queinnec I.; Tarbouriech S.; Garcia G.; Niculescu S.-I. (Eds.): Biology and Control Theory: Current Challenges 589 p. 2007 [978-3-540-71987-8] Vol. 356: Karatkevich A.: Dynamic Analysis of Petri Net-Based Discrete Systems 166 p. 2007 [978-3-540-71464-4]
Vol. 355: Zhang H.; Xie L.: Control and Estimation of Systems with Input/Output Delays 213 p. 2007 [978-3-540-71118-6]
Vol. 339: Alamir, M. Stabilization of Nonlinear Systems Using Receding-horizon Control Schemes 325 p. 2006 [978-1-84628-470-0]
Vol. 354: Witczak M.: Modelling and Estimation Strategies for Fault Diagnosis of Non-Linear Systems 215 p. 2007 [978-3-540-71114-8]
Vol. 338: Tokarzewski, J. Finite Zeros in Discrete Time Control Systems 325 p. 2006 [978-3-540-33464-4]
Vol. 353: Bonivento C.; Isidori A.; Marconi L.; Rossi C. (Eds.) Advances in Control Theory and Applications 305 p. 2007 [978-3-540-70700-4] Vol. 352: Chiasson, J.; Loiseau, J.J. (Eds.) Applications of Time Delay Systems 358 p. 2007 [978-3-540-49555-0] Vol. 351: Lin, C.; Wang, Q.-G.; Lee, T.H., He, Y. LMI Approach to Analysis and Control of Takagi-Sugeno Fuzzy Systems with Time Delay 204 p. 2007 [978-3-540-49552-9] Vol. 350: Bandyopadhyay, B.; Manjunath, T.C.; Umapathy, M. Modeling, Control and Implementation of Smart Structures 250 p. 2007 [978-3-540-48393-9] Vol. 349: Rogers, E.T.A.; Galkowski, K.; Owens, D.H. Control Systems Theory and Applications for Linear Repetitive Processes 482 p. 2007 [978-3-540-42663-9] Vol. 347: Assawinchaichote, W.; Nguang, K.S.; Shi P. Fuzzy Control and Filter Design for Uncertain Fuzzy Systems 188 p. 2006 [978-3-540-37011-6] Vol. 346: Tarbouriech, S.; Garcia, G.; Glattfelder, A.H. (Eds.) Advanced Strategies in Control Systems with Input and Output Constraints 480 p. 2006 [978-3-540-37009-3] Vol. 345: Huang, D.-S.; Li, K.; Irwin, G.W. (Eds.) Intelligent Computing in Signal Processing and Pattern Recognition 1179 p. 2006 [978-3-540-37257-8] Vol. 344: Huang, D.-S.; Li, K.; Irwin, G.W. (Eds.) Intelligent Control and Automation 1121 p. 2006 [978-3-540-37255-4]
Vol. 337: Blom, H.; Lygeros, J. (Eds.) Stochastic Hybrid Systems 395 p. 2006 [978-3-540-33466-8] Vol. 336: Pettersen, K.Y.; Gravdahl, J.T.; Nijmeijer, H. (Eds.) Group Coordination and Cooperative Control 310 p. 2006 [978-3-540-33468-2] Vol. 335: Kozłowski, K. (Ed.) Robot Motion and Control 424 p. 2006 [978-1-84628-404-5] Vol. 334: Edwards, C.; Fossas Colet, E.; Fridman, L. (Eds.) Advances in Variable Structure and Sliding Mode Control 504 p. 2006 [978-3-540-32800-1] Vol. 333: Banavar, R.N.; Sankaranarayanan, V. Switched Finite Time Control of a Class of Underactuated Systems 99 p. 2006 [978-3-540-32799-8] Vol. 332: Xu, S.; Lam, J. Robust Control and Filtering of Singular Systems 234 p. 2006 [978-3-540-32797-4] Vol. 331: Antsaklis, P.J.; Tabuada, P. (Eds.) Networked Embedded Sensing and Control 367 p. 2006 [978-3-540-32794-3] Vol. 330: Koumoutsakos, P.; Mezic, I. (Eds.) Control of Fluid Flow 200 p. 2006 [978-3-540-25140-8] Vol. 329: Francis, B.A.; Smith, M.C.; Willems, J.C. (Eds.) Control of Uncertain Systems: Modelling, Approximation, and Design 429 p. 2006 [978-3-540-31754-8] Vol. 328: Loría, A.; Lamnabhi-Lagarrigue, F.; Panteley, E. (Eds.) Advanced Topics in Control Systems Theory 305 p. 2006 [978-1-84628-313-0]
Vol. 341: Commault, C.; Marchand, N. (Eds.) Positive Systems 448 p. 2006 [978-3-540-34771-2]
Vol. 327: Fournier, J.-D.; Grimm, J.; Leblond, J.; Partington, J.R. (Eds.) Harmonic Analysis and Rational Approximation 301 p. 2006 [978-3-540-30922-2]
Vol. 340: Diehl, M.; Mombaur, K. (Eds.) Fast Motions in Biomechanics and Robotics 500 p. 2006 [978-3-540-36118-3]
Vol. 326: Wang, H.-S.; Yung, C.-F.; Chang, F.-R. H∞ Control for Nonlinear Descriptor Systems 164 p. 2006 [978-1-84628-289-8]