SEQUENTIAL METHODS IN PATTERN RECOGNITION A N D MACHINE LEARNING
This is Volume 52 in MATHEMATICS IN SCIENCE AND ENGI...
115 downloads
1400 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
SEQUENTIAL METHODS IN PATTERN RECOGNITION A N D MACHINE LEARNING
This is Volume 52 in MATHEMATICS IN SCIENCE AND ENGINEERING A series of monographs and textbooks Edited by RICHARD BELLMAN, University of Southern California A complete list of the books in this series appears at the end of this volume.
SEQUENTIAL METHODS IN PATTERN RECOGNITION AND MACHINE LEARNING K. S. FU School of Electrical Engineering Purdue University Lafayette, Indiana
ACADEMIC PRESS New York and London 1968
COPYRIGHT 0 1968 BY ACADEMIC PRESS, INC. ALL RIGHTS RESERVED. NO PART OF THIS BOOK MAY BE REPRODUCED IN ANY FORM, BY PHOTOSTAT, MICROFILM, OR ANY OTHER MEANS, WITHOUT WRITTEN PERMISSION FROM THE PUBLISHERS.
ACADEMIC PRESS, INC. 111 Fifth Avenue, New York, New York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD. Berkeley Square House, London W. 1
LIBRARY OF CONGRESS CATALOG CARDNUMBER: 68-8424
PRINTED IN THE UNITED STATES OF AMERICA
l
During the past decade there has been a considerable growth of interest in problems of pattern recognition and machine learning. This interest has created an increasing need for methods and techniques for the design of pattern recognition and learning systems. Many different approaches have been proposed. One of the most promising techniques for the solution of problems in pattern recognition and machine learning is the statistical theory of decision and estimation. This monograph treats the problems of pattern recognition and machine learning by use of sequential methods in statistical decision and estimation theory. The material presented in this volume is primarily based on the research carried out by the author and his co-workers, Dr. G. P. Cardillo, Dr. C. H. Chen, Dr. Y. T. Chien, and Dr. 2. J. Nikolic during the past several years. In presenting the material, emphasis is placed upon the development of basic theory and computation algorithms in systematic fashion. Although many different types of experiments have been performed to test the methods discussed, for illustrative purpose, only experiments in English-character recognition have been presented. The monograph is intended to be of use both as a reference for system engineers and computer scientists and as a supplementary textbook for courses in pattern recognition and adaptive and learning systems. The presentation is kept concise. As a background to this monograph, it is assumed that the reader has adequate preparation in college mathematics and an introductory course on probability theory and mathematical statistics. The subject matter may be divided into two majors parts: (1) pattern recognition and (2) machine learning. Roughly speaking, six approaches are presented, they are equally divided from Chapter 2 to Chapter 7. After a brief review of several important approaches in pattern recognition in Chapter 1, two methods for feature selection and ordering in terms of information theoretic approach and KarhunenL o h e expansion are presented in Chapter 2. In addition to the V
vi
PREFACE
application of Wald’s sequential probability ratio test and the generalized sequential probability ratio test to pattern classification problems, three techniques are discussed, namely, the modified sequential probability ratio test with time-varying stopping boundaries (Chapter 3), the backward procedure using dynamic programming (Chapter 4), and the nonparametric sequential ranking procedure (Chapter 5). The application of dynamic programming to both feature ordering and pattern classification is also included in Chapter 4. A brief introduction to sequential analysis is given in Appendix A. Bayesian estimation techniques (Chapter 6) and the stochastic approximation procedure (Chapter 7) are introduced as learning techniques in sequential recognition systems. Both supervised and nonsupervised learning schemes are discussed. Relationships between Bayesian estimation techniques and the generalized stochastic approximation procedure are demonstrated. Methods are also suggested for the learning of slowly time-varying parameters. The method of potential functions, because of its close relationship to the stochastic approximation procedure, is briefly presented in Appendix G. Some of the material in the monograph has been discussed in several short courses at Purdue University, Washington University, and UCLA. Most of the material has been taught in both regular and seminar courses at Purdue University and the University of California at Berkeley. For a regular course in pattern recognition and machine learning, many other approaches should also be discussed. Unfortunately, because of the limited scope of the monograph, those promising approaches cannot be covered in detail here. Instead, a very brief remark on other related approaches and interesting research problems is given in the last section of each chapter. I t is no doubt that there are still some works not mentioned even in these remarks due to the author’s oversight or ignorance. Lafayette, Indiana August, 1968
K. S . Fu
ACKNOWLEDGMENTS
It is the author’s pleasure to acknowledge the encouragement of Dr. M. E. VanValkenburg, Dr. L. A. Zadeh, Dr. T. F. Jones, Dr. W. H. Hayt, Jr., Dr. J. C. Hancock, and Dr. J. R. Lehmann. He owes a debt of gratitude to Dr. Richard Bellman, who read the manuscript and contributed many valuable suggestions. The author is also indebted to his colleagues and students at Purdue University and the University of California at Berkeley, who, through many helpful discussions during office and class hours, coffee breaks, and late evenings, assisted in the preparation of the manuscript. Particular suggestions and errata lists were provided by Dr. 2. J. Nikolic and Dr. Y. T. Chien. The author and his co-workers at Purdue have been very fortunate in having the consistent support from National Science Foundation for the research in pattern recognition and machine learning. The major part of the manuscript was completed during the author’s sabbatical year (1967) at the Department of Electrical Engineering and Computer Science, University of California, Berkeley. The environment and the atmosphere in Cory Hall and on Telegraph Avenue definitely stimulated the improvement and the early completion of the manuscript. In addition, the author wishes to thank Mrs. Patricia Gress for her efficient and careful typing of the manuscript.
vii
This page is intentionally left blank
Preface
V
1. Introduction 1.1 Pattern Recognition 1.2 Deterministic Classification Techniques 1.3 Training in Linear Classifiers 1.4 Statistical Classification Techniques 1.5 Sequential Decision Model for Pattern Classification 1.6 Learning in Sequential Pattern Recognition Systems 1.7 Summary and Further Remarks References
1 3 8 10
13 19 21 22
2. Feature Selection and Feature Ordering 2.1 2.2 2.3 2.4
Feature Selection and Ordering-Information Theoretic Approach Feature Selection and Ordering-Karhunen-Lotve Expansion Illustrative Examples Summary and Further Remarks References
24 29 35 43 44
3. Forward Procedure for Finite Sequential Classification Using Modified Sequential Probability Ratio Test 3.1 3.2 3.3 3.4 3.5 3.6
Introduction Modified Sequential Probability Ratio Test-Discrete Case Modified Sequential Probability Ratio Test-Continuous Case Procedure of Modified Generalized Sequential Probability Ratio Test Experiments in Pattern Classification Summary and Further Remarks References
46 47 52 54 56 62 63
4. Backward Procedure for Finite Sequential Recognition Using Dynamic Programming 4.1 Introduction 4.2 Mathematical Formulation and Basic Functional Equation ix
64 65
CONTENTS
X
4.3 Reduction of Dimensionality 4.4 Experiments in Pattern Classification 4.5 Backward Procedure for Both Feature Ordering - and Pattern Classification 4.6 Experiments in Feature Ordering and Pattern Classification 4.7 Use of Dynamic Programming for Feature-Subset Selection 4.8 Suboptimal Sequential Pattern Recognition 4.9 Summary and Further Remarks References
68 72 79 80 86 88 93 94
5. Nonparametric Procedure in Sequential Pattern Classification 5.1 Introduction 5.2 Sequential Ranks and Sequential Ranking Procedure 5.3 A Sequential Two-Sample Test Problem 5.4 Nonparametric Design of Sequential Pattern Classifiers 5.5 Analysis of Optimal Performance and a Multiclass Generalization 5.6 Experimental Results and Discussions 5.7 Summary and Further Remarks References
96 97 101 105 107 113 115 116
6. Bayesian Learning in Sequential Pattern Recognition Systems 6.1 Supervised Learning Using Bayesian Estimation Techniques 6.2 Nonsupervised Learning Using Bayesian Estimation Techniques 6.3 Bayesian Learning of Slowly Varying Patterns 6.4 Learning of Parameters Using an Empirical Bayes Approach 6.5 A General Model for Bayesian Learning Systems 6.6 Summary and Further Remarks References
117 123 127 130 134 139 140
7. Learning in Sequential Recognition Systems Using Stochastic Approximation 7.1 Supervised Learning Using Stochastic Approximation 7.2 Nonsupervised Learning Using Stochastic Approximation 7.3 A General Formulation of Nonsupervised Learning Systems Using Stochastic Approximation 7.4 Learning of Slowly Time-Varying Parameters Using Dynamic Stochastic Approximation 7.5 Summary and Further Remarks References
APPENDIX A. Introduction to Sequential Analysis 1. Sequential Probability Ratio Test 2. Bayes’ Sequential Decision Procedure References
141 148 155 158 168 169 171 171 176 179
CONTENTS
APPENDIXB. Optimal Properties of Generalized KarhunenL o h e Expansion 1. Derivation of Property (i) 2. Derivation of Property (ii)
xi 181 181 183
APPENDIXC. Properties of the Modified SPRT
185
APPENDIXD. Enumeration of Some Combinations of the kj’s and Derivation of Formula for the Reduction of Tables Required in the Computation of Risk Functions
191
APPENDIXE. Computations Required for the Feature Ordering and Pattern Classification Experiments Using Dynamic Programming
196
APPENDIXF. Stochastic Approximation: A Brief Survey
201
1. Robbins-Monro Procedure for Estimating the Zero of an Unknown Regression Function 2. Kiefer-Wolfowitz Procedure for Estimating the Extremum of an Unknown Regression Function 3. Dvoretzky‘s Generalized Procedure 4. Methods of Accelerating Convergence 5. Dynamic Stochastic Approximation References
202 204 206 209 21 2
APPENDIXG. The Method of Potential Functions or Reproducing Kernels
214
1. The Estimation of a Function with Noise-Free Measurements 2. The Estimation of a Function with Noisy Measurements 3. Pattern Classification-Deterministic Case 4. Pattern Classification-Statistical Case References
201
215 217 217 219 221
Author Index
223
Subject Index
226
This page is intentionally left blank
SEQUENTIAL METHODS IN PATTERN RECOGNITION A N D MACHINE LEARNING
This page is intentionally left blank
CHAPTER 1
INTRODUCTION
1.1 Pattern Recognition
The problem of pattern recognition is that of classifying or labeling a group of objects on the basis of certain subjective requirements. Those objects classified into the same pattern class usually have some common properties. The classification requirements are subjective since different types of classifications occur under different situations. For example, in recognizing English characters, there are twenty-six pattern classes. However, in distinguishing English characters. from Chinese characters, there are only two pattern classes, i. e., English and Chinese. Human beings perform the task of pattern recognition in almost every level of the nervous system. Recently, engineers faced the problem of designing machines for pattern recognition. Preliminary results have been very encouraging. There have been some successful attempts to design or to program machines to read printed or typed characters, identify bank checks, classify electrocardiograms, recognize some spoken words, play checkers and chess, and sort photographs. Other applications of pattern recognition include handwritten characters or word recognition, general medical diagnosis, system’s fault identification, seismic wave classification, target detection, weather prediction, speech recognition, etc. The simplest approach for pattern recognition is probably the approach of “template-matching.” I n this case, a set of templates or prototypes, one for each pattern class, is stored in the machine. The input pattern (with unknown classification) is compared with the template of each class, and the classification is based on a preselected matching criterion or similarity criterion. In other words, if the input pattern matches the template of ith pattern class better than it matches any other template, then the input is classified as from the ith pattern class. Usually, for the simplicity of the machine, the templates are stored 1
1.
2
INTRODUCTION
in their raw-data form. This approach has been used for some existing printed-character recognizers and bank-check readers. The disadvantages of the template-matching approach is that it is sometimes difficult to select a good template from each pattern class and to define a proper matching criterion. The difficulty is especially remarkable when large variations and distortions are expected in all the patterns belonging to one class. The recognition of handwritten characters is a good example in this case. A more sophisticated approach is that instead of matching the input pattern with the templates, the classification is based on a set of selected measurements extracted from the input pattern. These selected measurements, called “features,” are supposed to be invariant or less sensitive with respect to the commonly encountered variations and distortions, and to also contain less redundancies. Under this proposition, pattern recognition can be considered as consisting of two subproblems. The first subproblem is what measurements should be taken from the input patterns. Usually, the decision of what to measure is rather subjective and also dependent on the practical situations (for example, the availability of measurements, the cost of measurements, etc.). Unfortunately, at present there is very little general theory for the selection of feature measurements. However, there are some investigations concerned with the selection of a subset and the ordering of features in a given set of measurements. The criterion of feature selection or ordering is often based on either the importance of the features in characterizing the patterns or the contribution of the features to the performance of recognition (i.e., the accuracy of recognition). The second subproblem in pattern recognition is the problem of classification (or making a decision on the class assignment to the input patterns) based on the measurements taken from the selected features. The device or machine which extracts the feature measurements from input patterns is called a feature extractor. The device or machine which performs the function of classification is called a clussijier. A simplified block diagram of a pattern recognition system I
r X In
WSUreIWrltS
Fig. 1.1. A pattern recognition system.
1.2.
3
DETERMINISTIC CLASSIFICATION TECHNIQUES
is shown in Fig. I . l . + Thus, in general terms, the template-matching approach may be interpreted as a special case of the second approach“feature-extraction” approach, where the templates are stored in terms of feature measurements and a special classification criterion (matching) is used for the classifier. 7.2
Deterministic Classification Techniques
The concept of pattern classification may be expressed in terms of the partition of feature space (or a mapping from feature space to decision space). Suppose that N features are to be measured from each input pattern. Each set of N features can be considered as a vector x, called a feature (measurement) vector, or a point in the Ndimensional feature space In,. The problem of classification is to assign each possible vector or point in the feature space to a proper pattern class. This can be interpreted as a partition of the feature space into mutually exclusive regions, and each region will correspond to a particular pattern class. Mathematically, the problem of classification can be formulated in term of “discriminant functions”[ I] Let w1 w2 ,..., w, be designated as the m possible pattern classes to be recognized, and let )
be the feature (measurement) vector where xi represents the ith feature measurement. Then the discriminant function Di(X) associated with pattern class wj , j = 1,..., m, is such that if the input pattern represented by the feature vector X is in class w i, denoted as X wi , the value of D i ( X ) must be the largest. That is, for all X w i,
-
Di(X) > D j ( X ) ,
N
i,j = 1, ..., m, i # j
( 14
Thus, in the feature space 52, the boundary of partition, called the decision boundary, between regions associated with class wi and class w i respectively, is expressed by the following equation )
Dt(X) - Dj(X) = 0 + The division of two parts is primarily for convenience rather than necessity.
(1.3)
1.
4
INTRODUCTION
MDximum Amplaude Detectw
-
Decision
Dircriminani
m
Fig. 1.2.
A classifier.
A general clock diagram for the classifier using criterion (1.2) and a typical two-dimensional illustration of (1.3) are shown in Figs. 1.2 and 1.3, respectively. Many different forms satisfying condition (1.2) can be selected for D i ( X ) . Several important discriminant functions are discussed in the following.
i'
Region Associated With Class ui: Di(X)>D.(X)
I
Fig. 1.3.
1
An example of partition in a two-dimensional feature space.
A. Linear Discriminant Functions I n this case a linear combination of the feature measurements
xl, x2 ,..., xN is selected for Di(X), i.e., N
Di(X)=
+ w ~ , ~ + ~i =, 1,..., m (1.4) between regions in a, associated with w t
k=l
wikxk
The decision boundary and w i is in the form of
N
D i ( X ) - Dj(X) =
k=l
+
w ~ x ~ C
w N + ~= 0
(1.5)
1.2.
DETERMINISTIC CLASSIFICATION TECHNIQUES
5
with Wk = W i k - W j k and WN+1 = Wi,N+1 - Wj,N+1. Equation (1.5) is the equation of a hyperplane in the feature space SZ,. A general linear discriminant computer is shown in Fig. 1.4. If m = 2, on the
:23-Di(X)
xN
+I
f i g . 1.4.
"iN
wi#*l
A linear discriminant computer.
basis of ( l S ) , i, j = 1 , 2 ( i # j ) , a threshold logic device as shown in Fig. 1.5 can be employed as a linear classifier (a classifier using linear
. Fig. 1.5.
A linear two-class classifier.
discriminant functions). From Fig. 1.5, let D ( X ) = & ( X ) - D,(X),if and if
output
=
+I,
i.e., D ( X ) > 0, then
output
=
-1,
i.e., D ( X ) < 0, then X
X
N
w1
(1.6) w2
For the number of pattern classes more than two, m > 2, several threshold logic devices can be connected in parallel so that the combinations of the outputs from, say, M threshold logic devices will be sufficient for distinguishing m classes when 2M m. Or, the general configuration of Figs. 1.2 and 1.4 can also be used.
B. Minimum-Distance Classifier An important class of linear classifiers is that of using the distances between the input pattern and a set of reference vectors or prototype
6
1.
INTRODUCTION
points in the feature space as the classification criterion. Suppose that m reference vectors R, ,R, ,...,R, are given with Rj associated with the pattern class wj . A minimum-distance classification scheme with respect to R, , R, ,..., R, is to classify the input X as from w i, i.e., X ,- wi if I X - Ri I is the minimum (1.7) where I X - R, 1 is the distance defined between X and R,. For example, I X - Ri I may be defined as
IX
- Ri
I
=
[ ( X - RJT(X - Ri)]”z
(1.8)
where the superscript T represents the transpose operation to a vector. From (1.8), I X - Ri l2 = XTX - XTR, - XRiT + RiTRi (1.9) Since X T X is not a function of i, the corresponding discriminant function for a minimum-distance classifier is essentially D i ( X ) = XTR,
+ XRiT - RiTRi ,
i
=
1 , ..., m
(1.10)
which is linear. Hence, a minimum-distance classifier is also a linear classifier. The performance of a minimum-distance classifier is of course dependent upon an appropriately selected set of reference vectors. C . Piecewise Linear Discriminant Functions
The concept adopted in Section B can be extended to the case of minimum-distance classification with respect to sets of reference vectors. Let R, , R, ,..., R, be the m sets of reference vectors associated with classes w1 , w , ,..., w, , respectively, and let reference vectors in Rj be denoted as Rjk’,i.e., Ri(Fc’ E Rj ,
k
=
1,..., uj
where ui is the number of reference vectors in set R i . Define the distance between an input feature vector X and Ri as (1.11) That is, the distance between X and Rj is the smallest of the distances between X and each vector in Rj . The classifier will assign the input
1.2.
DETERMINISTIC CLASSIFICATION TECHNIQUES
7
to a pattern class which is associated with the closest vector set. If the distance between X and Rik’, I X - R‘ik’ I, is defined as (1.8), then the discriminant function used in this case is essentially Di(X) = Max {XTRp’ + (Rp’)TX- (Rp’)TRy’}, i k = 1 ,...,ui
=
1, ...,m
(1.12)
Let
Dp’= x = p+ (@’)TX - (Rp’)=Rp’
(1.13)
Then i = 1, ..., m
Di(X) = Max {Dp’(X)}, k = l , ...,u ,
(1.14)
It is noted that D i k ) ( X )is a linear combination of features, hence the class of classifiers using (1.12) or (1.14) is often called piecewise linear classifiers [I]. An example of the piecewise linear classifier is the or-perceptron which is shown in Fig. 1.6. XI
Logic Dovices (A-unit)
Fig. 1.6.
An a-perceptron.
D. Polynomial Discriminant Functions An rth-order polynomial discriminant function can be expressed as Dg(X) = wiifi(X)
+ wizfi(X) + + w i ~ f ~ (+X )
wi,~+1 (1.15)
wherefi(x) is of the form x”lx”2
k l k2
... xz
for
1
k, ,k, ,..., k, nl ,a2,...,n,
=
1,...)N and 1
=0
(1.16)
The decision boundary between any two classes is also in the form of an rth-order polynomial. Particularly, if r = 2, the discriminant function is called a quadric discriminant function.
1.
8
INTRODUCTION
In this case, fj(X)= ~
2x2
for k , , k ,
=
1,...,N , n1 , n 2 = 0 and 1 (1.17)
Typically,
L
=
&N(N
+ 3)
(1.19)
In general, the decision boundary for quadric discriminant functions is a hyperhyperboloid. Special cases included hypersphere, hyperellipsoid, and hyperellipsoidal cylinder. A general quadric discriminant computer is shown in Fig. 1.7. XI
Fig. 1.7.
1.3
A quadratic discriminant computer.
Training in Linear Classifiers
The two-class linear classifier discussed in Section 1.2 can easily be implemented by a single threshold logic device. If the patterns from different classes are linearly separable (can be separated by a hyperplane in the feature space Qx), then with correct values of the coefficients or weights, w1 , w2 ,..., wN+l in (1.5), the achievement of a perfectly correct recognition is possible. However, in practice, the proper values of the weights are usually not available. Under such circumstances, it is proposed that the classifier be designed to have the capability of estimating the best values of the weights from the input patterns. The basic idea is that by observing patterns with known classifications, the classifier can automatically adjust the weights in order to achieve correct recognitions. The performance of the classifier is supposed to improve as more and more patterns are abserved. This process is called training or learning, and the
1.3.
TRAINING IN LINEAR CLASSIFIERS
9
patterns used as the inputs are called training patterns. Several simple training rules are briefly introduced in this section. Let Y be an augmented feature vector which is defined as
Y=
[:I
XN
=
rl
( 1.20)
where X i s the feature vector of a pattern. Consider two sets of training patterns Ti and Ti belonging to two different pattern classes w1 and w z , respectively. Corresponding to the two training sets there are two sets of augmented vectors Tl and T2 ; each element in Tl and T2 is obtained by augmenting the patterns in Ti and TL , respectively. That the two training sets are linearly separable means that a weight vector W exists (called the solution weight vector) such that or
YTW>0
for each Y E T,
YTW 0. If the output of the classifier is erroneous (i.e., YTW < 0) or undefined (i.e., YTW = 0), then let the new weight vector be
W‘= W + a Y
(1.23)
where (Y > 0 is called the correction increment. On the other hand, for Y E T,, YTW < 0. If the output of the device is erroneous (i.e., YTW > 0) or undefined, then let
W ’ =w - a Y
(1.24)
10
1.
INTRODUCTION
Before training begins, W may be preset to any convenient values. Three rules for choosing 01 are suggested: (i) Fixed increment rule. a is any fixed positive number. (ii) Absolute correction rule. a is taken to be the smalles integer which will make the value of Y T W cross the threshold of zero. That is, a = the
smallest integer < I YTWl YTY
(1.25)
~
(iii) Fractional correction rule. a is chosen such that IYTW-YTW"
=XIYTWI,
O 2, the generalized sequential probability ratio test (GSPRT) can be used [5]. At the nth stage, the generalized sequential probability ratios for each pattern class are computed as
The Un(X/wi)is then compared with the stopping boundary of the ith pattern class, A(wi), and the decision procedure is to reject the pattern class w i from further consideration, that is, X is not considered in the class wi if Un(X/w,) < A(wi), i = 1, 2,..., m (1.60) The stopping boundary is determined by the following relationship A(wi) =
1 - eii [IIy==,(1 - edI1'" '
i
=
1, 2, ...,m
(1.61)
18
1.
INTRODUCTION
After the rejection of pattern class w i from consideration, the total number of pattern classes is reduced by one and a new set of generalized sequential probability ratios is formed. The pattern classes are rejected sequentially until only one is left, which is accepted as the recognized class. The rejection criterion suggested, though somewhat conservative, will usually lead to a high percentage of correct recognition because of the fact that only the pattern classes which are the most unlikely to be true are rejected. For two pattern classes, m = 2, the classification procedure (1.60) is equivalent to Wald's SPRT and the optimality of SPRT holds. For m > 2, whether the optimal property is still valid remains to be justified. However, the classification procedure is close to optimal in that the average number of feature measurements required to reject a pattern class from further consideration is nearly minimum when two hypotheses (the hypothesis of a pattern class to be rejected and the hypothesis of a class not rejected) are considered. A general block diagram for a sequential recognition system is shown in Fig. 1.9. Computer simulations for English character recognition will be described in Section 2.3. Likelihood computers
U h Pattern p u 4 extractor Feature
Fig. 1.9.
p''ionw Un ( X/wi
1
Decision
A sequential pattern recognition system.
A pattern classifier using a standard sequential decision procedure, SPRT or GSPRT, may be unsatisfactory because: (i) an individual classification may require more numbers of feature measurements than can be tolerated; and (ii) the average number of feature measurements may become extremely large if the eij's are chosen to be very small. I n practical situations, it may become virtually necessary to interrupt the stadard procedure and resolve among various courses of action. This can be achieved by truncating the sequential process at n = N. For example, the truncated sequential decision procedure
1.6.
19
LEARNING IN PATTERN RECOGNITION SYSTEMS
for SPRT will be the following. Carry out the regular SPRT until either a terminal decision is made or stage N of the process is reached. If no decision has been reached at stage N, decide X w1 if A, 1, or decide X w 2 if AN 1. I n a pattern classifier using truncated GSPRT, at n = N the input pattern is classified as belonging to the class with the largest generalized sequential probability ratio. Under the truncated procedure the process must terminate in at most N stages. Truncation is a compromise between an entirely sequential procedure and a classical, fixed-sample size decision procedure as (1.35). It is an attempt to reconcile the good properties of both procedures: the sequential property of examining measurements as they accumulate and the classical property of guaranteeing that the tolerances will be met with a specified number of available measurements.
-
G, , then, for any set of features F, the percentage of correct recognition of using features& and F must be greater than the percentage of correct recognition of using fq and F. (iii) The percentage of correct recognition of using F is a linear function of the sum of the Gj values for the features in F. Since no single-number statistic satisfies either (ii) or (iii) in general, a statistic which at least satisfies (ii) and (iii) over a fairly wide range of situations (not all) is proposed. The requirement of Gj being a single number suggests that Gimay be selected as an expected value of some function. Assuming the statistical independence among feature measurements it is suggested that
cc m
Gi =
vj
i=l k=l
prwi
,h(41log V{PbJi,f,(k)l>
(2.1)
A logarithmic function is selected because of the additive property of Gj required by (ii). In view of (i), y should be a measure of the correlation betweenfi and wi . The proposed is
Thus
From (2.2), Gican be interpreted as the mutual information of the feature fiand pattern classes w1 ,..., w, [9].
26
2.
FEATURE SELECTION AND FEATURE ORDERING
The application of divergence as a criterion for feature selection and ordering has been proposed by Marill and Green [2]. Assume that, for wt , X is distributed according to a multivariate gaussian density function with mean vector M iand covariance matrix K, i.e., p(X/wi)= [ ( 2 ~ ) "1 ~K
11/2]~1
exp[- i ( X - MJTK-l(X - Mi)] (2.4)
Let the likelihood ratio be
and let L
= log X =
log p(X/w,)- log p(X/w,)
(2.6)
Substitute (2.4) into (2.6); we obtain L
= XTK-l(Mi - Mj) - ;(Mi
and
+ Mj)TK-l(Mi- Mj)
E[L/Wi]= &(Mi - Mj)TK-'(Mi - M j )
(2.7) (2.8)
Define the divergence between w i and w i as [3] wj) = E
~ L /-~ q~q Iw j i
(2.9)
Then, from (2.8) and (2.9), ](mi,W j ) = (Mi - M,)TK-l(M, - M j )
(2.10)
It is noted that in (2.10) if K = I, the identity matrix, then J ( w i , mi) represents the squared distance between M i and M 3 . If a fixedsample size or nonsequential Bayes decision rule is used for the classifier, then for P(wi) = P(wi) = i,from (1.27), or
X-wi
if X
X-wj
if X
> 1, < 1,
or L > O or L < O
The probability of misrecognition is e = +P[L
> O/wj] + ;P[L < O/W,]
(2.11)
From (2.7), (2.8), and (2.10), it is concluded that p(L/wi)is a gaussian density function with mean *J and variance J where J = J ( w c , wi).
2.1.
INFORMATION THEORETIC APPROACH
27
Similarly, p ( L / w j )is also a gaussian density function with mean -*J and variance J. Thus, e =4
I (2771)-1/2exp[- Ht + i$J>/Jl m
dt
J
-m
Let Y=-
then =
t f BJ
(2.13)
47
I,,, ( 2 ~ ) - l /exp[ ~ m
-+yz] dy
(2.14)
It is noted that, from (2.14), e is a monotonically decreasing function of J ( w c , mi).Therefore, features selected or ordered according to the magnitude of J(wi , wi)will imply their corresponding discriminatory power between wt and w j . For more than two pattern classes, the criterion of maximizing the minimum divergence or the expected divergence between any pair of classes has been proposed for signal detection and pattern recognition problems [4]-[6]. The expected divergence between any pair of classes is given by
For the distributions given in (2.4),
Let d2 = M i n J ( w i , 193
wj),
i #fi
(2.17)
then (2.18)
28
2.
FEATURE SELECTION AND FEATURE ORDERING
Hence (2.19)
T h e tightest upper bound of d occurs when 1 - C ~ “ = , [ P ( is ~ Lthe J~)]~ maximum. This maximum is 1 - ( l / m ) which yields (2.20)
T h e bound, as indicated in (2.20), can be achieved by taking various combinations of features from a given feature set, or, alternatively, by gradually increasing the number of features N such that the feature subset selected will correspond to the case where d2 is the closest value to mJ(w)/(m - 1). I n general, there may be more than one feature subset which satisfies the criterion. I n sequential recognition systems, since the features are measured sequentially a slightly different approach with a similar viewpoint from information theory can be used for “on-line” ordering of features [6].I n the application of SPRT or GSPRT for classification, the knowledge of what pattern classes are more likely to be true (at the input of the recognition system) is used to determine the “goodness” of features. Let 7 be the available number of features at any stage of the sequential process, 7 N, and fj, j = 1,..., 7 , be the j t h feature. The criterion of choosing a feature for the next measurement, following Lewis’ approach, is a single-number statistic which is an expectation of a function describing the correlation among pattern classes, previous feature measurements, and each of the remaining features. Such a statistic associated with fj can be expressed as, after n (noisy) feature measurements x1 , x2 ,..., x, were taken,
which is a decreasing function of n. Since u12 uZ22 uk2 3 ~2 k 2 + ..., ~ the complete ordering of feature measurements according to the descending order of eigenvalues will produce smaller error with respect to any other ordering when the recognition process terminates at a finite number of measurements. Also, since the proposed procedure is independent of the classification scheme used in the recognition system, the problem of selecting a feature subset from a given set of features can be viewed as a subproblem of feature ordering. The procedure of completely ordering the coordinate vectors will allow us to select a subset of r (r N) feature measurements with minimized mean-square error by simply picking out the first Y coordinate vectors in the resulting generalized KarhunenLokve system. A computational example will be given in Section 2.3 to illustrate the procedure.
, 0} in the feature space and wishes to decide, as soon as possible, whether { X ( t ) } is {Xl(t)}or {Xz(t)}.Let t , be the time when the classifier reaches a terminal decision. I n general, t , is a random variable. Let Ei(tT) denote the expected value of t , when { X ( t ) }= {Xi(t)},i = 1, 2. Subject to the requirement that when { X ( t ) }= {Xi(t)}the probability of an incorrect classification will be at most eji , j (#i) = 1, 2, the problem is to give a decision procedure for classifying between {Xl(t)} and {X2(t)}such that Ei(tT) is a minimum for i = 1, 2. This is simply the same formulation for stochastic processes with continuous time parameter as that originally given by Wald for stochastic processes with discrete time parameter. Assume that the stochastic processes associated with the two pattern classes satisfy the following condition: For every t 2 0, X ( t ) is a sufficient statistic for the process, that is, given X ( t ) the conditional distribution of X(7), 0 T t, is (with probability 1) the same for the processes { X l ( t ) }and {X2(t)}.
<
e12 since YluEl(tT)/[T
+
ylEl(tT)l
is always positive. In this formulation, the modifed SPRT with continuous time parameter essentially includes the standard Wald’s SPRT with discrete time parameter as a special case where gl(t) and g2(t) are constants and t is considered as belonging to some nonnegative integer set (0, 1,2,...}. Also because of the use of continuous time parameter some of the approximated relations due to Wald (by neglecting the excess over the boundaries) become exact with probability 1. 3.4 Procedure of Modified Generalized Sequential Probability Ratio Test
Generally speaking, the principle of constructing the time-varying stopping boundaries for Wald’s SPRT also applies to the generalized sequential probability ratio test [7] when the number of pattern classes to be recognized is more than two. In the following, the proce-
3.4.
PROCEDURE
55
dure of modified GSPRT (with time-varying stopping boundaries) for continuous time parameter is described. The case for discrete time parameter can be analogously derived. Let {Xr(t),t >, 01, i = 1,2, ..., m, be the hypothesized stochastic process associated with the ith pattern class w iwhose probability density function is p(X(t)/wi). The classifier continuously measures a stochastic process { X ( t ) ,t 201 at its input and decides, as soon as possible, to classify the input stochastic process as one of the m possible stochastic processes. In a modified GSPRT, the generalized sequential probability ratio for each pattern class is computed upon the measurement of X ( t ) , at time instant t,
and is compared with the stopping boundary gi(t), i = 1, ..., m. As soon as U(X(t)/w,)< gz(t> (3.34) the pattern class wi is dropped from consideration, and the number of possible pattern classes is reduced by one for the next computation. The process of forming the generalized sequential probability ratio continues until there is only one pattern class retained; this pattern class is then assigned to the input. Note that the stopping boundaries gi(t), i = 1,...,m, are, in general, functions of time and need not be identical for all classes. Similar to the ones suggested for the modified SPRT, a simple class of convergent boundaries may assume the form
(3.35) I n fact, the spirit of the modified GSPRT relies on an optimal construction of these functions such that all the pattern classes but one are dropped from consideration by a prespecified time T. It remains to determine the error probabilities eij and the expected termination time Ei(tT) in terms of the design parameters, such as T, r i , etc. Following the approach taken by Reed, the modified GSPRT defined in (3.33) and (3.34) may be viewed as a special Markov process with continuous time parameter. The probability aspects of the modified GSPRT are not as yet completely known. I n turn, in the next section, an algorithmic construction of the timevarying stopping boundaries will be given, and experimental results
56 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
PROCEDURE
will be used to illustrate how the desirable performance resulting from the modified SPRT may also be achieved in the case of modified GSPRT. 3.5 Experiments in Pattern Classification
The modified SPRT and the modified GSPRT described in previous sections have been applied to the classification of handwritten English characters a, b, and c. Sixty samples of each character were processed in establishing the necessary statistics for the construction of suitable mathematical models. The same eighteen features used in the experiments in Section 2.3 were used here. Each input pattern was represented by a sequence of eighteen measurements denoted by a feature vector in the 18-dimensional feature space. The process of discretizing the measurements into ten possible values in the examples described in the following is simply for the purpose of their being easily simulated on a digital computer, with the understanding that the results will apply without any modification to stochastic processes with discrete time parameter. Experiment 1 The feature distributions for each pattern class is assumed to be multivariate gaussian. Let p,(X/o,), i = 1, 2, 3, represent the multivariate gaussian densities for characters a, b, and c, respectively, at the nth stage of the sequential classification process. X is an n-dimensional feature vector denoting the successive measurements of (xl ,x2 ,..., xn), n N = 18. This is the case that the classification process terminates in no more than eighteen feature measurements. Specifically,
0 and i = 1,2, 3. The classification procedure of the modified GSPRT is to drop the class wi from consideration at the nth stage if U,(X/wi) 2). The optimal Bayes sequential decision procedure which minimizes the expected risk including the cost of observations is essentially a backward procedure [l]. It is intended to show in this chapter that, as an alternative approach to the modified sequential probability ratio test, the dynamic programming [2]-[8] provides a feasible computational technique for a class of sequential recognition systems with finite stopping rules. The intuitive argument of using dynamic programming for finite sequential recognition problems can be stated as follows: Consider a sequential decision process. With observations taken one at a time, each stage of the process is a decision problem including both the choice of closing the sequence of observations and making a terminal decision, and the choice of taking an additional observation. It is easy to determine the expected risk involved in the decision when the procedure is terminated, but it is not easy to find the expected risk involved in taking an additional observation. For the case of taking one more observation, the expected risk is that of continuing and then doing the best possible from then on. Consequently, in order to determine the best decision at the present stage 64
4.2.
MATHEMATICAL FORMULATION-BASIC
EQUATION
65
(i.e., whether to continue the process or not) it is necessary to know the best decision in the future. I n other words, as far as seeking the optimal decision procedure is concerned, the natural time order of working from the present to the future is of little use because the present optimum essentially involves the future optimum. The only alternative to keep the true optimality is to work backwards in time, i.e., from the optimal future behavior to deduce the optimal present behavior, and so on back into the past. The entire available future must be considered in deciding whether to continue the process or not, and the method of dynamic programming provides just such an optimization procedure, working backwards from a prespecified last stage to the very first stage. I n the problems of sequential recognition where the decision procedure is to terminate at a finite number of observations, the termination point can be used as a convenient starting point (i.e., the last stage) for backward computation. 4.2 Mathematical Formulation and Basic Functional Equation
The way in which the dynamic programming is carried out in the finite optimal sequential decision procedure is by applying the principle of optimality. As stated by Bellman [2], “an optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” In essence, it is equivalent to saying that if an optimal policy is pursued, then at each stage of the sequential process the remaining decisions must themselves form an optimal policy from the state reached to the terminal point of the process. Consider the successive observations or feature measurements xl, x2 ,..., x, , n = 1, 2,..., with known distribution function of x,+~ given the sequence x1 ,...,x, , P(x,+, I x1 ,..., x,). After the observation of each feature measurement, the decisions made by the classifier include both the choice of closing the sequence of feature measurements and making a terminal decision (to decide the pattern class based on the observed feature measurements), and the choice of making another observation of the next feature measurement before coming to a terminal decision. Let p , ( x , , x2 ,..., x,)
be the minimum expected risk of the entire sequential decision process, having observed
4.
66
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
the
sequence of feature measurements >*-*, X n ; be the cost of continuing the sequential C(x, , x2 ,..., x,) process at the nth stage, i.e., taking an additional feature measurement, xn+l ; R(x, , x2 ,..., x, ; d,) be the risk of making terminal decision di (i.e., the ith pattern class is accepted by the classifier), i = 1 , 2,...,m, on the basis of the feature measurements x l , x2 ,..., x, . 9
X2
If the classifier decides to stop the process, the expected risk is Mini R(xl ,x2 ,..., xn ; d,) by employing an optimal decision rule. If the classifier decides to continue the process and to take on more feature measurement x , + ~ ,the expected risk is C(X1
9
x2
,..*)4
+ f P,+l(Xl
9
I Xl
x2 ,.**, x, ,X,+l) dP(X,,l
****,
x ),
where the integration is carried over the admissible region of x ~ + ~ . Hence, by the principle of optimality, the basic functional equation governing the infinite sequence of the expected risk pn(xl, x2 ,..., x,), n = 1, 2, ..., is p,(x1 , x2 ,*-.,x), Continue: =
Min
Stop:
C(x,
+ J”
,...,x,)
,.-,X, x,+i) Min R(xl ,..., x, ; di) ~n+i(Xi
@(x,+I
I
,.-,x,)
(4.1)
In the case of finite sequential decision processes where a terminal decision has to be made at or before a preassigned stage number N (for example, only a maximum of N feature measurements available for observation), the optimal stopping rule can be determined backwards starting from the given risk function (or specified error probabilities) of the last stage. That is, at Nth stage let, pN(xl ,x2 ,..., xN) = Min R(x, ,x2 ,...,xN ; 4)
(44
and compute the expected risk for stage number less than N through the functional equation (4.1). Specifically, starting with the known
4.2.
MATHEMATICAL FORMULATION-BASIC
67
EQUATION
(or given) value for p N ( x l ,x2 ,...,x N ) in (4.2), we have at (N - 1)th stage, > x2
PN-l(%
xN-l)
)***)
Continue: C(x,, x2 ,..., xN-l)
+s
= Min
d x 1
)***)
x N ) dp(xN
I x1
$***?
xN-l)
(4.3)
Stop: Min R(x, ,..., xN-l ; di) 1 in which pN(xl ,...,x N ) is obtained from (4.2). At (N - 2)th stage, PN-2(x1
9
x2
9***9
xN-2)
Continue: C(xl ,x2 ,..., xN+)
= Min
[
in which pN-l(xl PZ(X1
9
x2)
= Min
+s
PN-l(%
?.**)
%N-l) dp(xN-l
I
,-**,
xN--2)
(4.4)
Stop: Min R(xl ,...,x ~ ;di) - ~ 1
,..., x ~ - is~ obtained ) from (4.3). At second stage
[
Continue:
c(xl
%2)
+
P3(x1
Stop: Min R(xl , x2 ; di) I
x2
,x3) dp(x3
x1
x2)
(4.5)
in which p3(xl , x2 , x3) is obtained from the third stage. At first stage,
in which pz(xl , x2) is obtained from (4.5). One can easily see the computational difficulty arising from this formulation. Aside from the necessary memory locations for estimating the high-order conditional probabilities, the storage in a computer required for calculating the risk functions alone is already enormous. For example, suppose there are eight feature measurements available for successive observations (N = 8) and each measurement can take on one of ten values (discrete case); in order to resolve (4.1) through the recursive equations just described the total storage required for storing all the possible risk functions pn(xl ,x2 ,..., xn), n = 1, 2,..., 8, is 10 + lo2 + ... lo8. Because of this type of computational
+
68
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
difficulty, methods toward the reduction of storage requirement are of major concern in designing a truly sequential recognition system with optimal stopping rule. This is the subject of discussion in the next section.
4.3
Reduction of Dimensionality
4.3.1 USEOF SUFFICIENT STATISTICS The first possible solution to reduce the dimensionality is the use of sufficient statistics in describing the recognition process under consideration. Let each feature measurement assume one of the r discrete values El , E, ,..., E,. (a quantization of feature space). Assume that the features of each pattern class are characterized by a multinomial distribution, i.e., for each m i , i = I, ..., m, there exists a probability function
(4.7)
where p , is the probability of occurrence of Ei for class m i , IT=, pij = 1, and ki is the number of occurrences of Ei, ki = n. Since the statistic (k, , k, ,...,k,.;n) is sufficient to characterize the multinomial distribution it is reasonable to assume that only the number of occurrences of Ei , ki ,j = 1 , 2,..., Y , not its order, is important in making a decision. Then the functional equation (4.1) becomes
x;=l
Pn(k1 > k, ,...,k,) TContinue:
=
Min
C(k, , k, ,...,k,)
4.3.
69
REDUCTION OF DIMENSIONALITY
where P(wJ is the a priori probability for class w i. Specifically, at Nth stage, P N ( 4 9 K, ,-**, k,) = ?in Wl k, ,.**, K, ; 4) (4.9) 9
at (N - 1)th stage,
PN-l(k1 K, ,..*,k,) Continue: C(K, ,K, 9
=
+
Min
,...,K,)
m
c t
i=l
P(wi)
Stop: Min R(K, ,K, a
,...)K j + 1,.*-,K,)
P i j P ~ ( h
j=1
(4.10)
,..., k, ;di)
at first stage, P l ( h 9 A,
=
?.*.9
Min
k,) Continue: C(Kl ,k, ,..., K,)
[
+
c PijfJpP(k1 K, + r
m
i=l
P(4
,-*-*
i=l
Stop: ?in R(kl , K,
,...,K,
1 9 - v
K,)
(4.11)
;d,)
The risk function p,(kl , k , ,..., k,) is then determined for each and every sequence of k , ,k , ,..., k, , where & ki = n, n = 1 , 2,..., N. I n addition, the optimal stopping rule is also determined at each stage. That is, if the risk of stopping is less than the expected risk of continuing for a given history of feature measurements, the sequential process is terminated. The actual optimal structure of the resulting procedure is obtained in the course of solving the functional equation (4.8). In resolving (4.8), it is also required to compute the minimum termination risk Min, R(kl ,k, ,..., k, ;di)at each stage. The Bayes decision rule is employed here to illustrate the computation procedure, although in practice other proper optimal decision rules may be chosen according to the statistical knowledge at hand. Let L(wi,dj) be the loss incurred by making the terminal decision dj when the input pattern is really from class w i . Then the risk function of deciding that the pattern belongs to class m i ,having observed the joint event [kl , k , ,..., k,], can be written as m
R(kl ,k, ,..., K, ;dj) = C P(wi)L(wi ,dj) P(Kl ,..., k, I wi) i=l
(4.12)
70
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
The quantity Min, R(kl , k, ,..., k, ; di) is in fact the risk attained when the sequential process stops. It is worth noting that, similar to the dicussion in Section 1.4, in the case of (0, 1) loss function, i.e., L ( w i ,d j ) = 0 =1
if i
=j
if i # j
(4.13)
the decision procedure reduces to: decide dj if P(wj)P(k, ,k, ,...,k, I wi)
> P(w,) P(k, ,k, ,..., k, I w i ) for all i # j (3.14)
and the risk attained is R(kl , k, ,..., k, ; dj). The way in which the reduction of dimensionality can be achieved is due to the assumption of independent measurements implied by the ignorance of the ordering of the occurrence of Ej . This assumption allows the reduction of storage requirement from C:==,yn to C,"==, (,+;-l) by simply realizing the constraints that C;='=, kj = n and n N . Detailed results on this type of reduction are given in Appendix D.
x,
,%+, I Ft, ,ft,+,)
;ft,+,I x1
*.**Y
3,
;&,)I
Therefore, by the principle of optimality the basic functional equation governing the sequential recognition process becomes Pn(X1 > * * * 9
xn I Ft,)
Continue:
Min
it,+,~Fn
= Min
x W % + l ;ft,+,I x1 ) * * * , x, Stop: Min R(xl ,..., x, ;di I Ftn)
&,)I
1
(4.20)
Again, (4.20) can be recursively solved by setting the terminal condition to be PN(%
)**'?
xN
1 F t ~ >= Min R(xl
)*.*)
xN
; di
IF t ~ >
(4.21)
and computing backwards for risk functions R , , n < N. The major difference between the solution of (4.20) and that of (4.1) lies in the fact that the optimal stopping rules obtained from the present solution are automatically accompanied by a best sequence of features capable of minimizing the expected risk upon termination. 4.6
Experiments in Feature Ordering and Pattern Classification
T o test the formulation and the optimality of the procedure outlined in Section 4.5, the English character recognition problem described
4.6.
EXPERIMENTS IN FEATURE ORDERING
81
in Section 4.4 was again used. Only three pattern classes D, J, and P were considered, each represented by thirty-six samples which were processed both to obtain the probability distribution used and to test the technique. Same as the example in Section 4.4, eight radial intersection measurements quantized into twenty quanta were used as features, and a histogram procedure employed to estimate the probability that a given feature falls in a given quantum, conditioned on the fact that a particular character was measured. All feature measurements were assumed statistically independent. In order to reduce the dimensionality, the Bayesian statistic (a posteriori probability) was used. At each stage of the process, the conditional probability that the sample was from each class was calculated, given the past history of feature measurements. That is, after x1 was measured,
wheref,, is the feature selected to measure by the classifier and x1 is the outcome of the measurement. These quantities are then used as the a priori probabilities for the next stage (the second stage in this case) of the process. The procedure can be formulated recursively as, at the nth stage,
Thus it can be seen that by using this procedure, all information provided by the past history of feature selection and measurement outcomes is contained in the a posteriori probabilities calculated by (4.23). The classifying decision at the final stage depends on the a posteriori probability of occurrence of each class having measured all eight features in addition to the loss due to misrecognitions. For computational purposes each a posteriori probability was quantized into twenty equal devisions. Thus the probability space was quantized into a total of 210 quanta as shown in Fig. 4.3. The loss due to misrecognition was assumed equal to one in all cases, i.e.,
L(Wi,di)= 0, =1,
i=j i#j
0.5
I .o
P(D)
I
P(p)
o,s
(b)
1.0
0.03 0.075 0.075 0.125 0.125 0.125 0.175 0.175 0.175 0.175 0.225 0.225 0.225 0.225 0.225 0.275 0.275 0.275 0.275 0.275 0.275 0.325 0.325 0.325 0.325 0.325 0.325 0.325 0.375 0.375 0.375 0.375 0.375 0.375 0.375 0.375 0.425 0.425 0.425 0.425 0.425 0.425 0.425 0.425 0.425 0.475 0.475 0.475 0.475 0.475 0.475 0.475 0.475 0.475 0.475 0.500 0.525 0.525 0.525 0.525 0.525 0.525 0.525 0.525 0.525 0.475 0.450 0.500 0.550 0.575 0.575 0.575 0.575 0.575 0.575 0.525 0.475 0.425 0.400 0.450 0.500 0.550 0.600 0.625 0.625 0.625 0.575 0.525 0.475 0.425 0.375 0.350 0.400 0.450 0.500 0.550 0.600 0.650 0.625 0.575 0.525 0.475 0.425 0.375 0.325 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.625 0.575 0.525 0.475 0.425 0.375 0.325 0.275 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.575 0.525 0.475 0.425 0.375 0.325 0.275 0.225 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.575 0.525 0.475 0.425 0.375 0.325 0.275 0.225 0.175 0.150 0.200 0.250 0.300 0.3500.400 0.450 0.500 0.550 0.525 0.475 0.425 0.375 0.325 0.275 0.225 0.175 0.125 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.525 0.475 0.425 0.375 0.325 0.275 0.225 0.175 0.125 0.075 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.475 0.425 0.375 0.325 0.275 0.225 0.175 0.125 0.075 0.025
+
Fig. 4.3. (a) The classification decision boundary-the letter indicates the decision to be made. (b) Expected of making a classifying decision.
cost
4.6.
EXPERIMENTS IN FEATURE ORDERING
83
The cost of feature measurement is O.Ol/measurement. In all experiments the a priori probability of each class was taken to be equal to one-third. The expected risk of making a decision for various a posteriori probabilities is printed in the corresponding quantum. The decison boundary diagram shown in Fig. 4.3(a) is interpreted as being that if the a posteriori probabilities fall in a quantum labeled a D, J, or P,the input pattern is classified as a D, J, or P,respectively. The same quantizations were used at every stage of the process, including the calculation of decision boundaries for the selection of features. Detailed illustration of these computations is given in Appendix E. Three experiments were performed. The purpose of these experiments was to allow a verification of the optimal properties of the proposed procedure and a comparison of results obtained from using the proposed procedure and other statistical classification procedures. Experiment 1: Sequential classification with feature ordering. Experiment 2: Nonsequential Bayes classification using all eight features. Experiment 3: Sequential classification without feature ordering. Table 4.2 summarizes the results concerning the accuracy of recognition and the number of feature measurements required for classification. Table 4.3 indicates the costs of the various classification procedures. The results of the three experiments are summarized and discussed as follows. (1) It is seen that the same percentage of correct recognition is obtained for all three classification procedures. In fact, it turned out that the misrecognitions were made on exactly the same patterns. (2) From Table 4.2, it should be noted that even though the sequential classification procedure without feature ordering required fewer feature measurements to classify patterns from class J than the sequential procedure with feature ordering, it did require more measurements for the entire process. It appears that the sequential procedure with feature ordering may cause poorer performance in some particular cases, but on the average over the entire process it produces better results. This is expected since the optimization was carried out over the entire process.
4.
84
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
Table 4.2
ACCURACYOF CLASSIFICATION No. of Patterns classified as True class
D
J
P
yo of correct recognition
Total no. of required measurements
(i) Experiment 1
D
J P
33 0 7
0
3
36 0
0 29
91.6 100 80.6
147 82 135 364
Overall accuracy: 90.7 yo
(ii) Experiment 2
D
33
0
3
J
0 7
36 0
0 29
P
91.6 100 80.6
288 288 288 864
Overall accuracy: 90.7 yo
(iii) Experiment 3
D
33
J
0 7
P
Overall accuracy: 90.7 yo
0 36 0
3 0 29
91.6 100 80.6
187 61 189 436
4.6.
EXPERIMENTS IN FEATURE ORDERING
85
Table 4.3
COSTSOF CLASSIFICATION PROCJSSES (i) Classificationof 36 Samples of D (Class
No. of required measurements Cost of measurements Expected risks of 36 classifying decisions Combined total cost
(ii)
wl)
Exp. 1
Exp. 2
Exp. 3
147 1.47 1.67 3.14
288 2.88 1.87 4.75
187 1.87 1.43 3.30
Exp. 1
Exp. 2
Exp. 3
82 0.82 0.90 1.72
288 2.88 0.90 3.78
61 0.61 0.90 1.51
Classification of 36 Samples of .J (Class w e )
No. of required measurements Cost of measurements Expected risks of 36 clasifying decisions Combined total cost
(iii) Classificationof 36 Samples of P (Class
No. of required measurements Cost of measurements Expected risks of 36 classifying decisions Combined total cost
w3)
Exp. 1
Exp. 2
Exp. 3
135 1.35 3.075 4.425
288 2.88 3.025 5.905
189 1.89 3.275 5.165
(iv) Cumulative Results of Classifying All 108 Samples
No. of required measurements Total cost of measurements Total expected risks of 108 decisions Combined total cost
Exp. 1
Exp. 2
Exp. 4
364 3.64 5.645 9.285
864 8.64 5.795 14.435
437 4.37 5.60 9.97
86
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
(3) The sequential procedure with feature ordering required about 60 % less feature measurements than the nonsequential Bayes pro-
cedure while the sequential procedure without feature ordering required about 50% less. (4) From Table 4.3, it is seen that the overall total cost of the recognition process for the sequential procedure with feature ordering is minimal. The row labeled total expected risk of classifications was obtained by summing the expected risks of recognition and is an indication of the confidence with which classifying decisions are made. The sequential procedure with feature ordering costs about 64% as much as the nonsequential procedure, while the sequential procedure without feature ordering costs about 68% as much. 4.7
Use of Dynamic Programming for Feature-Subset Selection
The proposed dynamic programming procedure for feature ordering and pattern classification can be modified to allow the selection of an optimum subset of features from a given set. Two particular cases are discussed in this section. If an abruptly truncated sequential decision procedure is to be used for pattern classification, it would be important to select the best subset with size equal to the truncated length from a given set of features. The dynamic programming procedure also provides the answer to this type of feature selection problem. Consider, for example, that it is desired to recognize the characters D, J, P using a forward sequential decision procedure with no more than five (independent but not identically distributed) features. The problem is to select a best subset of size five from the eight given features. Assume that the a priori probabilities for each class are given. The feature-subset selection problem can be solved by searching from the memory for the minimum expected risk decision boundaries among all boundaries for which five features remain. I n the example given in Section 4.6, if the a priori probabilities are assumed P(ul)= P(w& = 0.25 and p(u3) = 0 5 then the subsets (f8 ,f6 , f 3 ,f2 ,fl), (f7 ,fS ,f3 , f 2 P f l ) , and (f6 ,f5 ,f 3 ,f2 ,fl) all yield the same minimum expected risk for the process. Any one of the three ordered features subsets is an optimal subset with five features. If a nonsequential Bayes or maximum-likelihood decision procedure is to be used for the pattern classification, the dynamic programming procedure can also be applied to determine the best feature subset
4.7.
DYNAMIC PROGRAMMING FOR FEATURE-SUBSET SELECTION
87
from a given set of features. The only difference from the case just treated above is that, in this case, the cost of taking measurements becomes zero. A computer simulation was performed using the same example in Section 4.6. The a priori probabilities of the three classes are assumed equal with one-third each. The loss due to misrecognition equals to one in all cases. Using the dynamic programming procedure, the expected risk of every subset of the eight features was calculated. The classification of all 108 pattern samples was performed using each subset. In all, a total of C&l (!) = 256 classification studies were made. The results are summarized in Figs. 4.4, 4.5, and 4.6.
I
Fig. 4.4.
Expected Coal of Decision
Experimental relationship between percent error and expected cost.
Figure 4.4 shows the relationship between the expected risk of decision and the percentage of misrecognitions. Because the loss due to misrecognition is equal to one in all situation, linear relationship is expected between the percentage of misrecognition and the expected risk. Both the theoretical relationship and the actual regression line obtained from experimental results are also shown in the figure. Figure 4.5 indicates the bounds within which the results for the feature subsets with various sizes fall. The results show that the expected risk is a good indicator for the classification accuracy. The
88
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
j-Indicote6 the No. of Features Used For Recognition
Fig. 4.5. Minimums and maximums; the number of errors versus the expected cost of decision.
variation in the expected percentage of misrecognitions for feature subsets with various sizes is demonstrated in Fig. 4.6. The numbers associated with the plotted points indicate the best and the worst feature subsets, respectively. 4.8 Suboptimal Sequential Pattern Recognition
In Section 4.5, a backward procedure has been developed for constructing the optimal solution for feature ordering and pattern classification. In general, the knowledge of the joint probability density functions of all the features and the a priori probabilities for each class are required in the computation of (4.20). From the computational point of view the procedure is often difficult to implement without large-scale computation facilities. If certain assumptions (e.g., independence of feature measurements) can be made in the
4.8.
89
SUBOPTIMAL SEQUENTIAL PATTERN RECOGNITION
'7- 40 -
- .40
30 -
- .30
Erron
20
-
- .20
-
678 -.I0
10
I
I
I
I
2
I
3
I
4
I
5
124567 124578 6
1245678 7
8
No. d Footurn
Fig. 4.6. Minimum and maximum; the number of errors versus the number of features measured.
practical recognition problems, the optimal procedure can be implemented as that described in Section 4.6. However, it will still be desirable to develop an approximation to the optimal procedure so the computations involved can be much simplified. I n this section, an approximation scheme which leads to a suboptimal solution is discussed, and comparisons are made with the optimal procedure to show the trade-off between the optimality and the computational difficulties. The approximation made which leads to a suboptimal solution is that, at each stage, the classifier considers the next stage to be terminal, that is, a classification decision must be made at the next stage (onestage ahead truncation). The following three different cases are chosen to illustrate the effectiveness of the suboptimal solution:
Case 1: optimal solution when the feature measurements for each class are independent; Case 2: suboptimal (one-stage ahead truncation) solution when the feature measurements for each class are independent;
90
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
Case 3: suboptimal solution when the feature measurements for each class have a first order Markov dependence.
A comparison of cases 1 and 2 displays the effects of the truncation approximation. On the other hand, a comparison of cases 1 and 3 allows a determination as to the relative advantage of increasing the computational complexity either through (i) increasing the knowledge of the statistical dependence of the feature measurements while truncating the backward procedure, or (ii) simplifying the probability assumptions and carrying out the entire backward programming computation. Case 1
In this case
Equation (4.23) can be used to compute the a posteriori probabilities of each class at each stage. Let P, be the set of a posteriori probabilities of the occurrence of each class computed by (4.23), i.e., P,
= {P(w, I x,
,...,x,
;Ftn); i = 1,..., m}
Then the basic recursive equation (4.20) using statistic reduces to
L Case 2
(4.25)
P, as the sufficient
Stop: Min{R(d, I P,)}
Equation (4.26), in this case, becomes Continue: Min lC(xl ,...,x, I Fin) f
%+I
It is noted that, in (4.27), the averaging process is always over the terminal stage costs, and is easily performed as the sequential recognition process proceeds. In this way the requirement for the storage
4.8.
SUBOPTIMAL SEQUENTIAL PATTERN RECOGNITION
91
(i.e., the storage of the cost surfaces at each stage) is greatly reduced, and the resulting computations simplified. Case 3
For this case, P@n+l;ftn+,
I Xl
9'**9
*n
;Ftn)= P(*n+1
I *n ;hn)
(4.28)
and the a posteriori probabilities are calculated by P(Wi I *1
,... Xn ;Ft,) 9
The sufficient statistic is then (P, ;x, ;ft,)and (4.20) becomes A P n ;*n ;ft,)
Continue: Min lC(xl ,..., xn [F,,) in+1
= Min
(4.30)
L
Stop: Min{R(d, I Pn)} 6
It can be seen that for cases 1 and 2, all information provided by the past history of feature selection and measurement outcomes up to and including the nth stage is contained in the a posteriori probabilities calculated. All that need be done is to keep tracking what features have been measured and the a posteriori probabilities calculated. Thus, the actual values of measurement outcomes and the order of features measured can be dropped from consideration, thereby allowing the reduction in computations and storage. In case 3, additional storage is required in order to save the last feature measurement. More serious memory requirements are necessitated by the storage of the transition matrices, and by the added dimension of dependence of the cost function at each stage. Of course, the main computational advantage remains the fact that the expectation (averaging) in (4.30) is taken only over the terminal stage and can be easily computed at each stage of the process. T o test the formulations in Cases 1, 2, and 3, the recognition of handprinted English characters D,J , P was again used as an example. The same training samples were used as that in Section 4.6 to establish
92
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
Table 4.4
RFSULTS OF OPTIMALAND SUBOPTIMAL SOLUTIONS No. of patterns classified as True class
J
D
P
Expected loss Total no. of Cost of of of correct measurements feature classification recognition required measurements decisions
%
Case 1
D J P
3
0 7
3
36
0
0
3
0
29
91.6
147 82 135
100
80.6
1.47 0.82 1.35
1.675 0.900 3.075
-
-
-
-
90.7
364
3.64
5.650
1.53 0.78 1.14
1.675 0.900 3.050
Total expected loss of the entire process: 9.29
Case 2
D J P
3
3
0 36
0
7
3
91.6
0
100 83.4
29
0
153 78 I14
-
-
-
-
91.7
345
3.45
5.625
0.85 0.66 0.88
0.925 0.900 2.075 3.900
Total expected loss of the entire process: 9.07t
Case 3
D J P
3
5 0 1
0 36 0
1
0
35
97.5
85 66 88
100
97.5
-
-
-
98.2
239
2.39
-
Total expected loss of the entire process: 6.29 t
The peculiar result with a lower expected loss than the optimum was due to the
4.9.
SUMMARY AND FURTHER REMARKS
93
the probability density functions and the transition probabilities required. Table 4.4 summarizes the results obtained with the cost of measuring any feature at any stage equal to 0.01, and the loss of making any classification error equal to 1.0. It appears from the results in Table 4.4 that, for this example, by using Markovian statistics and the one-stage ahead truncation approximation, we would be able to take the correlation between feature measurements into account and still retain the capability to implement a sequential recognition process which approaches the optimum. Of course, if the feature measurements are truly independent then the Markov assumption would result in no improvement over an independence assumption. 4.9
Summary and Further Remark
I n this chapter, the dynamic programming approach has been proved useful in designing a finite sequential classifier whose optimal structure is considered as a multistage decision process. It is shown that the actual decision structure of the sequential classifier, which includes both the choice of continuing and the choice of stopping the sequence of measurements, is obtained by recursively optimizing the risk functions in a backward manner. The backward procedure guarantees the termination of classification processes within a prespecified number of feature measurements (finiteness) and, in the meantime, also preserves the optimality of minimizing the average risk. Methods of reducing the computational difficulty and storage requirement have been suggested to make the multistage decision process suitable for numerical solution. It is true while the assumptions made on (i) independent measurements and (ii) Markov-dependent measurements are only approximations of the true state of affairs, nevertheless they provide a ready solution to the optimal design of many recognition problems. When it is desirable for the recognition system to perform “on-line” selection of feature measurements for successive observations, the approach of using dynamic programming presents a possibility of designing a recognition system for ~~
fact that one incorrectly classified pattern accounted for an expected loss of classification of 0.475 using the optimal procedure and only 0.25 using the approximation. Neglecting this single pattern it is seen that the results become more reasonable.
94
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
both feature selection and pattern classification. Computer-simulated experiments on character recognition, including comparisons between sequential and nonsequential classifiers, have illustrated the validity and feasibility of the dynamic programming approach. There has not been much quantitative comparison of performance between the forward and the backward sequential classification procedures other than the degree of optimality and computational difficulty. This lack of comparison makes it difficult to determine exactly which procedure is more appropriate for a particular problem on hand. Although a suboptimal backward procedure-a one-stage ahead truncation procedure- has been suggested as a compromise, the degradation of performance in general cannot be quantitatively determined beforehand. References 1. D. Blackwell and M. A. Girshick, “Theory of Games and Statistical Decisions.” 1 Wiley, New York, 1954. 2. R. Bellman, “Dynamic Programming.” Princeton Univ. Press, Princeton, New Jersey, 1957. 3. R. Bellman, R. Kalaba, and D. Middleton, Dynamic programming, sequential estimation and sequential detection processes. Proc. Nut. Acad. Sci.47, 338-341 (1961). 4. D. V. Lindley, Dynamic programming and decision theory, Appl. Statist. (lo), 39-51 (1961). 5. E. B. Dynkin, The optimum choice of the instant for stopping a Markov process. Sooiet Math. Dokl. 4, No. 3, 627-629 (1963). 6. P. C. Fishburn, A general theory of finite sequential decision processes. Tech. Paper RAC-TP-143. Res. Anal. Corp., McLean, Virginia, February 1965. 7. R. A. Howard, “Dynamic Programming and Markov Processes” Wiley, New York, 1960. 8. G. B. Wetherill, “Sequential Methods in Statistics:” Methuen, London and Wiley, New York, 1966. 9. K. S. Fu, Y. T. Chien, and G. P. Cardillo, A dynamic programming approach to sequential pattern recognition. IEEE Trans. Electron. Computers 16,790-803 (1967). 10. K. S. Fu and G. P. Cardillo, An optimum finite sequential procedure for feature selection and pattern classification. IEEE Trans. Auto. Control 12, 588-591 (1967). 11. Y. T. Chien and K. S. Fu, An optimal pattern classification system using dynamic programming. Intern. J. Math. Biosciences 1, No. 3, 439-461 (1967). 12. G. P. Cardillo and K. S. Fu, A dynamic programming procedure for sequential pattern classification and feature selection. Intern. J. Math. Biosciences. 1, No. 3, 463-491 (1967). 13. B. R. Bhat, Bayes solution of sequential decision problem for Markov dependent observations. Ann. Math. Statist. 35, 1656-1662 (1964).
REFERENCES
95
14. R. Bellman and R. Kalaba, On the role of dynamic programming in statistical communication theory. IRE Trans. Inform. Theory 3, No. 3, 197-203 (1957). 15. H. H. Goode, Deferred decision theory. In “Recent Developments in Information and Decision Processes” (R. E. Macho1 and P. Gray, eds.). MacMillan, New York, 1962.
CHAPTER 5
NONPARAMETRIC PROCEDURE IN SEQ UE NTIAL PATTERN CLASS1FICAT10N
5.1
Introduction
The design of a sequential pattern classification system for classifying patterns in a random environment (noise, distortion, etc.) has been primarily concerned with the case where the following assumptions are made: (i) a sufficient number of feature measurements is always available and thus the classification process can be prolonged forever if needed;
(ii) the statistical knowledge about the patterns in each class is either completely known a priori or can be estimated by the classification system through some learning processes. The first difficulty, which arises from the prolonged experimentation, can be avoided by either modifying the standard Wald’s sequential probability ratio test so that the classification process will terminate at a prespecified finite number of featuye measurements as described in Chapter 3, or simply employing the dynamic programming procedure which determines the optimal stopping boundaries by computing backwards from the last feature measurement up to the first as discussed in Chapter 4. An equally important but perhaps less explored case of design is the one which will relax the constraint in (ii) so that no assumption or actual knowledge is needed on the form of the underlying probability distributions associated with each pattern class [1]-[8]. The purpose of this chapter, therefore, is to introduce a nonparametric approach to the design of a sequential pattern classification system using Wald’s SPRT. It is noted that in order to carry out the computation in Wald’s SPKT, an assumption or actual knowledge is needed on the specific 96
5.2.
SEQUENTIAL RANKING PROCEDURE
97
forms of the probability density functions, pn(X/wl)and pn(X/w2). This is essentially what has been done in the experiments presented in previous chapters where the feature vectors are assumed to be samples from known probability distributions describable within a set of parameters (for example, mean vectors and covariance matrices in gaussian distributions), known to or estimated by the classification system. It may frequently happen that this knowledge is not available or any simplified assumption cannot be justified due to the lack of a priori information about the random patterns or due to the changing statistics of the operating environment. In either case, nonparametric methods would have to be pursued so as to obtain a more realistic mathematical model in approximating the physical situation. In statistical decision theory, many nonparametric schemes are based on the set of ranks determined by sample measurements. In the following sections, a sequential ranking procedure [9] is employed and the resulting performance analyzed in the design of a binary classifier so that the nonparametric setting of Wald’s SPRT can be naturally applied. A generalized procedure capable of classifying patterns from more than two classes is also discussed. 5.2
Sequential Ranks and Sequential Ranking Procedure
It was remarked in the previous section that in order to apply Wald’s SPRT in the nonparametric setting, we would have to replace the feature measurement vector X = [xl, x2 ,...,xnIT by a vector of ranks T = [Tl, T2,..., Tn]. The rank Tifor xi is 2, Z = 1, 2,..., n, if and only if xi is the Zth smallest measurement with respect to the set of measurements xl,x2 ,..., x, . Because of the sequential nature of taking the feature measurements in SPRT, we are naturally led to the idea of sequentially ranking the measurements every time a new measurement is taken without having to rerank all the preceding measurements in the entire feature vector. T o see exactly how such a procedure may be derived, it is beneficial to look into the ordinary (nonsequential) reranking process which is described as follows. Suppose that the feature measurements x1 Ix2 ,..., x, are taken successively, and each time a new measurement is taken the entire set of measurements is reranked. Let Tii be the rank of xj with respect to the entire set of measurements (xl, x2 ,..., xi) at the ith stage of
98
5.
NONPARAMETRIC PROCEDURE
process, where i = 1 , 2,..., n and j = 1 , 2,..., i. Then the following two groups of vectors will describe the ordinary reranking process: Successive measurement
Ordinary rank
set
vector
It should be pointed out that the vector [T,,, T,,,..., Tnn]alone completely determines the reranking process, in the sense that each ordinary rank vector listed above can be reconstructed given only the ranks T i $ ,i = 1, 2, ..., n, where Tit is the rank of xi relative to the measurement set (xl, x2 ,..., xi). I n fact, it is easily seen that a feature measurement can be ranked as it is measured, relative to the entire proceding measurements, without reranking the previous measurements, and still retain in the information which would come from reranking all the preceding measurements. This method of ranking the measurements is one which fits in naturally with the idea of sequential decision procedure, when the measurements are taken successively in accordance with a specified stopping rule. T o formally present this idea which leads to the development of a nonparametric sequential classification procedure, the following definition and lemma are first given: Definition The “sequential rank” of x, relative to the set of measurements (x,, x2 ,...,x,) is S, if x, is the (S,)th smallest in (XI , x2 ,...,x n).
Thus the sequential rank of x1 is always 1, the sequential rank of 1 or 2 depending on x2 < x1 or x1 < x2 ,and the sequential rank of x3 is 1, 2,or 3, according to whether x3 is the smallest, the next largest, or the largest in the set of measurements (x,, x 2 , x3), etc. In the sequel, the sequential rank vector for the feature measurement vector X = [x, , x,,..., x,]~ will be denoted by S(n) = [Sl s,,**., &I. x2 is either
?
5.2.
SEQUENTIAL RANKING PROCEDURE
99
Lemma There is a one-to-one correspondence between the set of n! possible orderings xi, < xi, < *.- < xi, and the n! possible sequential rank vectors [S, , S, ,..., S,] for the feature measurement vector = [x, , x , ,..., X,]T.
x
Proof [ 9 ] , [lo] Consider the vector [x, , x, ,..., x,ITwhere the xi aren distincts real numbers and the set {[xi, , xi? ,..., xJT} consisting of the n! vectors and possible orderings obtained by permuting the coordinates of [ x , , x, ,...,xJT. Now define the mapping 9 from the set {[xi, , xi2 ,..., x i J T } into the set {[r, ,r2 ,..., rJT: rl = 1; r2 = 1, 2;...; r, = 1, 2,..., n} by setting the jth coordinate of ?(xi, , xiz ,..., xi,) equal to the rank of x. in the set xil , xi, ,..., xi, , that is, the jth coordinate is r if xi, is the rth smallest among xi, , xi2 ,..., xi$ . The mapping is one-to-one and onto.
The significance of this lemma, which will become clear later, may be summarized as follows: If we consider each < xi,, of a feature measurement vector ordering, say xi, < xie < X = [x, ,x, ,..., x,]*, and use the definition given above to obtain the associated sequential ranks S , , S, ,..., S , , the sequential rank vector will be uniquely determined. Conversely, the sequential rank vector uniquely determines the original ordering. Since a particular ordering xil < xi, < ... < xi, also uniquely determines the ordinary rank vector [T,,, T,, ,..., T,,], there exists a one-to-one mapping between the set of sequential rank vectors and the set of ordinary rank vectors for all possible orderings. In order to provide a smooth transition of the Wald's SPRT to its nonparametric setting, it is necessary to find the probability distribution for the sequential rank vectors. There are two significant findings in nonparametric statistics which can be used to obtain, respectively, the exact calculation of sequential rank distribution and a practical application in nonparametric testing problems. One is due to the fact that there exists a one-to-one correspondence between the ordered measurements (hence the ordinary rank vectors) and the sequential rank vectors. It follows that the distribution of the sequential ranks is completely determined since the distribution for the ordinary ranks can be easily calculated [l 11. A second useful finding is the basic assumption of Lehmann alternatives frequently made in nonparametric tests. It will be shown later in this chapter that this assumption, although necessary, is not quite as restrictive as it appears to be when used in the nonparamatric design of sequential classification systems.
---
100
5.
NONPARAMETRIC PROCEDURE
Consider first the distribution of the sequential rank vectors. Using the fact that there is a unique relation between the ordered feature measurements and the sequential rank vectors, the distribution of the sequential rank vectors is also completely specified by
where Pi5(xij)indicates the distribution function of xi5 and the xij's are assumed to be independent in this calculation. For the special case when the distribution functions Pi(xJ are taken to be Lehmann alternatives [ 121, then
Using (5.1) and substituting (5.3), we obtain P(x,
< x2 < < x,) **-
5.3.
SEQUENTIAL TWO-SAMPLE TEST PROBLEM
101
By relabeling the xi's, the probability of any order of the xi's can be found using (5.5), giving all the values needed in (5.1) to specify the distribution of the sequantial rank vectors. 5.3 A Sequential Two-Sample Test Problem
As a basic model for the nonparametric design of a sequential classification system, a sequential two-sample test problem is described in this section. Suppose there are available two measurement vectors of successive measurements X = [x, , x, ,..., x,IT and Y = b1,yz ,...,y,]', each sampled from some probability distributions. The problem is to test the hypothesis that the two distributions are the same against the alternative hypothesis that they are different, using as few measurements as possible. Let the successive measurements x1 , x2 ,..., x, and y1 ,y, ,...,yn be independent random variables, and assume that we wish to test hypothesis Ho :
G
alternative Hl :
G =f(P(X))
= P(X)
with the assumptions that P ( X ) is the probability distribution of X andf(P(X)) the distribution of Y. In order to use the Wald's SPRT based on the sequential ranks, the measurements will be arranged so that they can be taken alternatively as x1 ,y, , x, ,y, ,..., x, , y, . Let the combined measurements at the kth stage be denoted by a vector V ( k ) = [vl, v, ,..., vk] where vl = x,, v, = yl, etc. Let S(k) = [S, , S, ,..., S,] be the sequential rank vector for V(k), and let
be the sequential probability ratio at the kth stage of the process.
5.
102
NONPARAMETRIC PROCEDURE
Under the hypothesis H,, , P,(S(k) = S/H,,) = l/k! for a certain outcome vector S of S(k) and therefore, P,(S(k) = S/Hl) can be computed by noting that each outcome S corresponds, in a one-to-one manner, to a particular ordering of the combined measurements of the xi's and yi's. That is, it is sufficient to compute P(Ul
< % < ... < %/HI) ...
= --mi
t,gt, 0, i = 1,2, ..., m
Consider the successive measurements of the m combined sample vector to be
where k = 1, 2,..., 2n.
112
5.
NONPARAMETRIC PROCEDURE
The corresponding sequential rank vectors are determined to be Sl(k) = [&I S2(4
=
IS,
9
s12
)*.*>
slkl
9
s22
,.**,
S2k1
Sm(k) = [Sm, , s m 2
,.*.,&nkl
where k = 1,2,..., 2n. At the kth measurement of the m combined sample vectors Vl(k),V2(k),..., V J k ) , the sequential probability ratios are computed, i.e., (5.31)
where Pk(S,(k)/Ho)or Pk(Si(k)/Hl)is the probability of the sequential rank vectors S,(k), given the hypothesis that Y belongs to the class wi , or the alternative that Y does not. Following (5.8), for k even, (5.31) becomes (5.32)
where A.. 23 = 1 - ri
if vii is an x from X i if vii is a y
(5.33)
Adopting the rejection criterion of the GSPRT, the pattern class wi is dropped from consideration at the kth measurement if hi, 2 A,(%) (5.34) The process of forming Aik continues until there is only one sequential probability ratio left not satisfying the above inequality, and its associated hypothesis is then accepted as the true pattern class to which Y belongs. Note that the upper stopping boundaries &(mi) in (5.34) are generally functions of both the hypothesis under test and the number of stage k under consideration. I n practice, &(mi) may be set to [l - (eol)i]/(elo)i where (el,), and (eel), are the two types of error probabilities for the hypothesis associated with class wi . As shown in (5.34), the lower stopping boundaries Bk(wi)have been made negligibly small, by setting (eel), arbitrarily small, to prevent the test from accepting a hypothesis prematurely when the alternative Hl is true. In any case, the stopping boundaries should be determined in such a way that they would minimize the effect of possible am-
5.6.
EXPERIMENTAL RESULTS AND DISCUSSIONS
113
biguity, for example, a rejection in all pattern classes in a situation where it is known that the input sample is from one of the pattern classes. 5.6 Experimental Results and Discussions
T o determine the effectiveness of the sequential ranking procedure in constructing a nonparametric sequential classifier, a computersimulation experiment was carried out. The experiment consisted in classifying the handwritten English characters a and b as described in Section 2.3, with the exception that the probability distributions of patterns in each class were assumed to be nonparametric and unknown. Let the learning sample from character a be denoted by
xa= [XI5,xza,...,X&] which was obtained, in this experiment, from the estimated mean vector of sixty samples from character a, or the alternative that it does not (i.e., to accept that Y belongs to the class of character b). A flow diagram showing the process of computer-simulated classification procedure is given in Fig. 5.3. For the purpose of illustration, some sixty pattern samples of characters a and b were tested on the computer. The classification results are summarized in Fig. 5.4 in which the error probability (average percentage of two types of misrecognition) is plotted against the average number of feature measurements required to make a terminal decision, The classification experiment was repeated with r in the Lehmann alternatives as a running parameter. It is seen that the experiment performed tends to verify the theoretical conclusions in two respects: (i) The sequential ranking procedure and the resulting two-sample test model do provide an effective nonparametric procedure for sequential classification in which the error probability decreases as the number of measurements increases, as usually expected in Wald’s SPRT. (ii) For a specified error probability, fewer measurements are required to make a terminal decision by increasing the value of r if r > 1, or by decreasing the value of r if Y < 1, which is the relation concluded in (5.30). This relation is particularly useful in selecting a desirable Lehmann alternative for a certain pattern class in the absence of any statistical knowledge about the pattern samples. Although no direct verification of the validity of the assumption of Lehmann alternatives was attempted in the experiment, the simulation result does indicate
5.
114
NONPARAMETRIC PROCEDURE
I
Obtain learning sample vector
XU) = (x I , x2, ...,x,) 1 = 1, 2, ..., 18
Read in successive measurements of an input sample to be recognized
L P
--
Y ( 0 = cv,, Y2r .... Y,) / = I , 2, ..., 18
Form combined sample vector for X(1) and Y ( / ) V ( k ) = (VI. V Z , ..., ~ k ) = (XI, Y I .
...,
Replace IbyI+ 1 and kbyktl
vr)
Obtain sequential rank vector for V ( k )
S ( k ) = (s!, Sz, ...,Si) k = I . 2, ..., 36 I
Form vector A ( k ) = ( A i , A,, ..., A,)
I
by using Eq.(5.10) or (5.1 I )
I
Yes
I
Decide
Decide
character
character
Fig. 5.3. Computer flow diagram for the recognition experiments using nonparametric technique.
5.7. 60
c
I
Q
Fig. 5.4. model).
115
SUMMARY AND FURTHER REMARKS
5
K
)
1
5
I
2 0 2 5 3 Q Awmga Number of ObrnMtionr
3
5
Performance curves; recognition of characters a and b (nonparametric
possible low error probabilities in the classification of character samples if proper Lehmann alternatives are chosen and a sufficient number of measurements are available. 5.7
Summary and Further Remarks
A nonparametric setting for the Wald’s sequential probability ratio test based on the sequential ranks has been discussed in this chapter. The essential feature of the sequential ranks lies in the fact that a new measurement can be ranked as it is measured, relative to the preceding measurements, without reranking all the previsous measurements. One application of this ranking scheme is in the design of a sequential recognition system to classify patterns with nonparametric statistics. The solution is obtained by formulating the classification procedure in terms of a sequential two-sample problem where the classifier wishes to decide whether or not an X-population and a Y-population have the same probability distribution. With the assumption of Lehmann alternatives in the two-sample’test, a simple design of a nonparametric sequential classifier is developed. Both intuitive and theoretical justifications have been given to the use and selection of suitable Lehmann alternatives. A generalization procedure of the two-sample test to the case of multiclass classification problem
116
5.
NONPARAMETRIC PROCEDURE
has also been suggested. Computer-simulated experiments have shown satisfactory results regarding the verification of theoretical conclusions and the classification of English characters. The nonparametric sequential classification procedure proposed in this chapter is a rather special approach based on the sequential probability ratio test and the assumption of Lehmann alternatives. It should be interesting to explore more general results and possible extensions by considering alternatives other than Lehmann alternatives or other nonparametric decision procedures. References 1. G. H. Ball, Data analysis in the social sciences: What about the details?. Proc. Fall Joint Computer Conference, 533-599 (1965). 2. G. Sebestyen and J. Edie, An algorithm for nonparametric pattern recognition. IEEE Trans. Electronic Computers 15, 908-915 (1966). 3. J. Owen, Nonparametric pattern recognition, Part I and Part 11. T R No. 1 and No. 2, July/October. Information Research Associates, Inc., Waltham, Massachusetts, 1965. 4. M. A. Aiserman, E. M. Braverman, and L. I. Rozonoer, The probability problem 5.
6. 7. 8. 9.
10.
11. 12. 13. 14. 15. 16.
of pattern recognition learning and the method of potential functions. Aertomatika i Telemekhanika 25, 1175-1190 (1964). T. M. Cover and P. E. Hart, Nearest neighbor pattern classification. IEEE Trans. Information Theory 13, 21-27 (1967). D. F. Specht, Generation of polynomial discriminant functions for pattern recognition. IEEE Trans. Electronic Computers 16, 308-319 (1 947). G. F. Hughes, On the mean accuracy of statistical pattern recognizers. IEEE Trans. Information Theory 14, 55-63 (1968). E. G. Henrichon, On nonparametric methods for pattern recognition. Ph. D. Thesis (TR-EE68-18), Purdue University, Lafayette, Indiana, June 1968. E. A. Parent, Sequential ranking procedure. Tech. Rept. No. 80. Dept. of Statist., Stanford Univ., Stanford, California, 1965. 0. Barndorff-Nielsen, On the limit behavior of extreme order-statistics. Ann. Math. Statist. 34, 992-1002 (1963). W. Hoeffding, Optimum nonparametric tests. Proc. Symp. Math. Statist. and Probability, 2nd, Berkeley, 19.51, pp. 83-92. Univ. of California Press, Berkeley, California, 1951. E. L. Lehmann, The power of rank tests. Ann. Math. Statist. 24, 23-43 (1953). I. R. Savage and J. Sethuraman, Stopping time of a rank-order sequential probability ratio test based on Lehmann alternatives. Ann. Math. Statist. 37, NO. 5, 1154-1160 (1966). K. S. Fu and Y. T. Chien, Sequential recognition using a nonparametric ranking procedure. IEEE Trans. Inform. Theory 13, 484-492 (1967). D. A. S. Frazer, “Nonparametric Methods in Statistics.” Wiley, New York, 1957. I. R. Savage, Contributions to the theory of rank order statistics-the two sample case. Ann. Math. Statist. 27, 590-615 (1956).
CHAPTER 6
BAYESIAN LEARNING IN SEQUENTIAL PATTERN RECOGNITION SYSTEMS
6.1 Supervised Learning Using Bayesian Estimation Techniques
As pointed out in Section 1.6, in the absence of complete a priori knowledge, pattern recognition systems can be designed to learn the necessary information from their input observations. Depending upon whether the correct classifications of the input observations are available or not the learning process can be classified into supervised learning and nonsupervised learning schemes. Various techniques have been proposed for the design of learning systems. Two problems are of primary interest in sequential pattern recognition: the problem of learning an unknown probability density function and that of learning an unknown probability measure. Supervised learning schemes using Bayesian estimation techniques are discussed in this section [1]-[3]. When the form of the probability density function p ( X / o , ) is knowa but some parameters 8 of the density function are unknown, the unknown parameters can be learned (estimated) by iterative applications of Bayes’ theorem. It is assumed that there exists an a priori density function for the unknown parameter 8 (in general, vector-valued) po(8) which reflects the initial knowledge about 8. Consider what happens to the knowledge about 8 when a sequence of independent identically distributed feature vectors X , ,X , ,..., X , , all from the same pattern class, is observed. The function po(8) changes to the a posteriori density function p(8/X, ,..., X,) according to Bayes theorem. For example, the a posteriori density function of 8 given the first observation X , ist
+ Since all the learning observations XI,...,X,, are from the same class, dropped out from each term in (6.1) without causing any confusion.
117
wt
can be
118
6.
BAYESIAN LEARNING
After X , and X , are observed, the a posteriori density function of 0 is
In general,
The required probability density function can be computed by
x
p(e/xl,...,x,,, wi)de,
n
=
i , 2,... (6.4)
where the first term at the right-hand side of (6.4),
is known, and the second term, p(d/X, ,..., X , , mi), is obtained from (6.3). The central idea of Bayesian estimation is to extract information from the observations X I , X , ,..., X , for the unknown parameter 0 through successive applications of the recursive Bayes formula. It is known that [l], on the average, a posteriori density function becomes more concentrated and converges to the true value of the parameter so long as the true value is not excluded by the a priori density function of the parameter. In each of the supervised learning schemes to be discussed, the iterative application of Bayes theorem can be accomplished by a fixed computational algorithm. This is made possible by carefully selecting a reproducing a priori density function for the unknown parameter so that the a posteriori density functions after each iteration are members of the same family of a priori density function (i.e., the form of the density function is preserved and only the parameters of the density function are changed).t The learning schemes are then reduced to the successive estimations of parameter values. t Some important results concerning the necessary and sufficient conditions admitting a reproducing density function can be found in the work of Spragins [4].
6.1.
119
SUPERVISED LEARNING
THE PARAMETERS OF A GAUSSIAN DISTRIBUTION 6.1.1 LEARNING
A. Learning the Mean Vector M , with Known Covariance Matrix K In this case, the unknown parameter 0 to be learned is M whose uncertainty can be reflected by assigning a proper reproducing a priori density function p,(0) = p,(M). Let
and assign
where M , represents the initial estimate of the mean vector and Q, is the initial covariance matrix which reflects the uncertainty about M , . From the reproducing property of gaussian density function, it is known that [2], [3], after successive applications of Bayes' formula, the a posteriori density function p ( M / X l ,..., X,), given the learning observations X , ,..., X , , is again a gaussian density function with M , and Q, replaced by the new estimates M , and Q, The new estimates M , and Q, are, respectively, the conditional mean and covariance of M after n learning observations X , ,..., X , , i.e., M n = E[Mn+1/Xl,.**, Xnl
+ n-lK)-lM, + @,(ao + n-lK)-l(X) = cOv[Mn+,/xl ,..., Xn] = (n-lK)(@,+ n-1K)-W0
= (n-'K)(@, ajn
i
and
(6.5)
n
Or, in terms of a recursive relationship, (6.5) and (6.6) can be written as Mn = K(an-l
and
+ lC-lMn-1 + an-l(an-l+ K)-lXn
an= K p n - l + K)-l@n-l
(6.7)
(6.8)
Equation (6.5) shows that M , can be interpreted as a weighted average of the a priori mean vector Mo and the sample information ( X ) , with the weights being ( n - l K ) (Qo n-lK)-l
+
120
6.
BAYESIAN LEARNING
+
and @a(@o n-lK)-l, respectively. The nature of this interpretation can be seen more easily in the special case where
a0= 01-lK,
01
>0
(6.9)
Then (6.5) and (6.6) become (6.10)
and
a,
=' K
nfm
(6.11)
As n ---t co, M , 4 (X) and @, +0, which means, on the average, the estimate M, will approach the true mean vector M of the gaussian density function.t B. Learning the Covariance Matrix K , with Zero (or Known) Mean Vector I n this case 0 = K is the parameter to be learned. Let K-l = Q and assign the a priori density function for Q to be the Wishart density function with parameters (KO, oo) [22], i.e.,
=0
otherwise
(6.12)
where Q, denotes the subset of the Euclidean space of dimension +N(N 1) where Q is positive definite, and CN,,ois the normalizing constant
+
(6.13)
KOis a positive definite matrix which reflects the initial knowledge of K-1, and vo is a scalar which reflects the confidence about the initial estimate K O .It can be shown that, by successive applications of Bayes' formula, the a posteriori density function of Q, t Since the sample mean
( X ) is an unbiased estimate of the true mean vector M .
6.1.
SUPERVISED LEARNING
121
p(Q/Xl ,..., X,), is again a Wishart density function with parameters KOand vo replaced by K , and v, where (6.14)
+n
(6.15)
c
(6.16)
vn = v,, and (XP)
=
I n ; XiXiT i=l
Equation (6.14) can be again interpreted as the weighted average of the a priori knowledge about K-l, K O ,and the sample information contained in ( X X T ) .
C . Learning the Mean Vector M and the Covariance Matrix K I n this case, 8 = (M, Q ) and Q = K-l. An appropriate a priori density function for the unknown parameter 0 is found to be GaussianWishart, i.e., M is distributed according to a gaussian density function with mean vector Mo and covariance matrix dj0 = p;lK, and Q is distributed according to a Wishart density function with parameters vo and KO.It can be shown that, by successive applications of Bayes' formula, the a posteriori density function of 8, p(8/Xl ,...)X,) = p(M, Q/X, ,..., X,), is again a Gaussian-Wishart probability density function with parameters vo , po , Mo , and KOreplaced by v , , p, , M , , and K , , respectively, where ern = vo
+n
Pn = PO
+n
(6.17) (6.18) (6.19)
and (6.21)
Equation (6.19) is the same as (6.10) except that 01 is replaced by po . Equation (6.20) can be interpreted as follows. The first two terms
122
6.
BAYESIAN LEARNING
at the right-hand side are weighted estimates of the noncentralized moments of X; the term (voKo pOMOMOT) represents the a priori knowledge; and [(n - 1)s n (X) ( X ) q represents the sample information. The last term at the right-hand side is generated from the new estimate of the mean of X.
+
+
6.1.2 LEARNING THE PARAMETERS OF A BINOMIAL DISTRIBUTION
It seems to be rather obvious and reasonable to interpret the new estimates of parameters in terms of the weighted average of a priori knowledge and sample information in the case of a gaussian distribution. Unfortunately, the interpretation will become much less obvious for distributions other than gaussian. The difficulty involved can be illustrated by examining the case of binomial distribution b(n, p) with parameters (n,p) [5], [6]. Consider the Bernoulli process with parameter t9 = p. Let x l , x2 ,..., x , denote the observed samples of the process where each x (1 or 0 ) is drawn from a distribution b( 1, p), 0 < p < 1. If r = ZTz1x i , then r is the number of ones (successes) at the nth observation which has the binomial distribution b(n, p ) . That is, the conditional density function of Y, given p , is (6.22)
Notice that the sample outcome y = (r, n) is a sufficient statistic of dimension two for the parameter 8 = p. Suppose that p is unknown and is to be learned through the sample outcome (r, n). As in the case of gaussian distribution, an appropriate a priori density function assigned for p is the beta probability density function [22]
po(e) = p o ( p ) = [ q r , , no - ~ ~ ) ] - ~ p ~ o-- lp(yi' 0 - l
(6.23)
where B ( r o ,no - y o ) is the beta function with parameters r,, and n o , which are assigned positive constants reflecting the initial knowledge about the unknown parameter p . It can be easily verified by Bayes theorem that the a posteriori density of p, given (r, n), is again a beta density function with parameters rn and n,
p ( p I r, n) = [B(rn,n, where
rn = ro
+Y
-
rn)]-lpTn-l(l - p)nn-'n-1
(6.24) (6.25)
6.2.
NONSUPERVISED LEARNING
and n, = no
+n
123 (6.26)
From the first glance at (6.25) and (6.26), it seems rather natural to regard r and n as the sample information to update the initial knowledge yo and n o , respectively. I n doing so, however, it will be unsuccessful to give an interpretation in the sense of weighted average of initial knowledge and sample information as that in the case of gaussian distribution. The difficulty lies in the fact that either component of the statistic (r, n) cannot be considered as a measure of information in a sample from a Bernoulli process, and it follows that it would not be sensible to consider either component of the parameter ( y o , no) as a measure of the knowledge underlying the a priori distribution. T o remedy this situation, let (6.27)
Since for given rn an increase in n implies an increase in r, the sample information seems to be unambiguously measured by (m,n). Substitute monofor ro in (6.23)
p(p/mo,no) = [B(rnono, no(l
-r
O Xn-l), P(Xn/e, 4, p(Xn/w2), P(wl), and P ( 4 are known, p(O/Xl ,..., X,) can be computed by (6.38). Assume that P(wl) and P(w2) are known. In order to compute p(Xn/O,w,) and p(O/X, ,..., X,) for all values of 8, it must be assumed that O can be finitely quantized so that the number of computations can be kept finite. For multiclass problems, if more than one pattern class has unknown parameters, let 0, be the unknown parameter associated with pattern class w, , i = 1, ..., m. Assuming conditional independence of learning observations X, ,...,X , and independence of O,, i = 1, ..., m, similar to (6.38), the recursive equation for estimating 8, can be obtained as
I n general, either p,,(O,) or P(wi), i = 1, ..., m, must be different. Otherwise, the computations for all 8,'s will learn the same thing (since they compute the same quantity) and the system as a whole will learn nothing. 6.3
Bayesian Learning of Slowly Varying Patterns
In Section 6.1, the parameters (for example, the mean vector of a multivariate gaussian distributed pattern class) to be learned are considered fixed but unknown. The problem that the parameters to be learned change slowly in a random manner will be treated in this section. For illustrative purposes, the problem of learning the mean vector M of a gaussian distribution is again used [2]. Assume that
6.
128
BAYESIAN LEARNING
the change of M is slow when compared with the observation time of learning observations so that M changes only slightly from one observation to the next. Mathematically p(X/Mn)= [(2?~)”~I K
1’2]-1
exp[- t ( X - M,JTK-’(X - Mn)] (6.40)
where M , is a function of n, and is to be learned from a sequence of classified learning observations X , , X , ,..., X , . Let X , = M , qn , n = 1,2, ..., where 7, is the (mean zero) noise component conta..., are assumed minated in the measurement. ...,q n - l , r l n , v,+, statistically independent of each other, and q n , n = 1, 2, ..., are also independent of M , . From the slowly varying nature of M , , assume that M , is developed by a random walk process where the random steps are the independent gaussian vectors A,. That is
+
+ An-, Mn + An
M n = Mn-1
Mn+1 =
---
-
= Mn-1
--- = Mo + A 0
+ An-1 + An + *.. + A n
+A
(6.41)
where A, is gaussian distributed with mean zero and covariance matrix Kd . Roughly speaking, M , can be considered as arising from a series of independent steps of length dj (a random variable) being added together. However, the model is inconvenient to apply due to the fact that M , as defined is the sum of a large number of identically distributed random variables. As n increases, the components of M , become unbounded with probability 1. This difficulty can be eliminated by introducing a constant 0 < a < 1 and changing (6.41) to
+ +
M , = aMn-, An-l Mn+, = aMn A, = a2MnP1 adn-, = anflMO andO an-lAl *.-
---
+
+
+
+ An
+ + An
(6.42)
Later, let u + 1 in the final answer so that the modification is only temporary. As in the case of supervised Bayesian learning the successive estimates of the mean vector M , are essentially the conditional
6.3.
129
SLOWLY VARYING PATTERNS
expectations of. X,,, given the sequence of learning observations X,,X, ,..., X, , that is M n
= E[Xn+l/Xl)-.-) Xnl
(6.43)
= E[Mn+l/Xl,*.-) Xnl
Similarly,
anis the conditional covariance of M,+,, @, = c0v[Mn+,/x1,..., X,]
Let M'
= EIM,/Xl,
(6.4)
...,X,]
(6.45)
and @' = C0v[Mn/X1,..., Xn]
(6.4)
By iterative applications of Bayes' formula, the same as was done in Section 6.1, the following results are obtained: M'
= K(@,-1
and @' = K(@,-,
Since Mn+, = aMn
then Mn = aM'
+ K)-'Mn-l +
+ K)-lX,
+ K)-'an-l
(6.47) (6.48)
+ An
(6.49)
+ K)-'Mn-l +
= aK(@%-,
and
@,-1(@,-1
U@n-l(@n-l
+ K)-'Xn
+ KA = a2K(Qn-, + K)-l@,-l + KA
(6.50)
@, = a2@'
(6.51)
For large n, @, II @n-l ,solve (6.51) for @, with slowly varying mean M , (u N 1). We obtain @,
N
[K,K-1]1/2* K
Using (6.52) in (6.50) and expanding [I f (&K-')'/']-' series [16], we have M,
N
[I - (KdK-1)1/2]M,,
(6.52)
in a Neumann
+ (KAK-l)ll2X,
(6.53)
130
6.
BAYESIAN LEARNING
Equation (6.53) again shows that the new estimate M , is a weighted average of the a priori mean vector and the sample information. A special example is given to bring out the significance of the results. Let KA = B2K (6.54) and
Icl, = (1
- B) Mn-1
+ BXn
(6.55)
From (6.55), the slowly varying M , is tracked by adding PX, to an attenuated version of the previous estimate Mnp1as new learning observations arrive. It is noted that if M , is stationary, then K A-+ 0 as a ---f 1. Consequently, (6.50) and (6.51) reduce to (6.7) and (6.8), respectively. 6.4 Learning of Parameters Using an Empirical Bayes Approach
In previous sections of this chapter, the Bayesian estimation techniques have been applied to the estimation of unknown parameters 0 in a probability distribution function when the a priori distribution of 8 is assumed to have a convenient form (reproducing distributions). If t9 itself is a random variable and its a priori distribution P(0) is unknown, a more general formulation based on the empirical Bayes approach is suggested [17], [18]. I n this section, the estimation of unknown parameters in a probability distribution using the empirical Bayes approach is presented. It is known that the unconditional distribution function of X can be expressed as P ( X ) = J qxp)dpp)
where P(X/O) is the conditional distribution of estimate of 8 be of the form y ( X ) . Then
(6.56)
X given 8. Let the (6.57)
which is a minimum when (6.58)
6.4.
131
EMPIRICAL BAYES APPROACH
The random variable 8, defined by (6.58) is the Bayes estimator of 8 corresponding to the a priori distribution P(8). Equation (6.58) is of course the expected value of the a posteriori distribution of 8 given X . If P(8) is known then (6.58) is a computable function. If P(8)is unknown, let 8, ,...,8, be the sequence generated corresponding to the sequence of learning observations X , ,..., X , . Assume that 8, , n = 1,2, ..., are independent with common distribution P(8) and that the distribution of X , depends only on 8,. At the nth stage of the estimation process, i.e., after taking n learning observations X , ,..., X,, if the previous values 8, ,...,8,-, are by now known, the empirical distribution function of 8 can be formed as pn-l(e)
=
number of terms O1 ,..., On-, which are (n - 1)
0
(6.62)
6.
132
BAYESIAN LEARNING
P(0) belongs to the class of all distribution functions on the positive real axis. In this case, from (6.56)
Then, from (6.58) (6.64)
From (6.63) and (6.64), the following relation can be written (6.65)
Let
=
(’ +
+
number of terms x1 ,...,xn which = x 1 number of terms x1 ,..., x, which = x
then regardless of the unknown P(0), we have, as n ---t ~,(x) + 8,
(6.66)
CQ,
with probability 1
(6.67)
This suggests using as an estimate of the unknown 8, the computable quantity q,,(x,,) in the hope that as n+ CO, E[Tn(xn) - en12 + ~
[ d p
el2
If 0 has all its probability concentrated at a single value 8, p(0) = S(O - O0), the Bayes estimator of 0 is
8,
=
eo
(6.68)
, i.e., (6.69)
which does not involve x at all. Hence,
ey
=
o
(6.70)
E [ ( X - e)21
=
E(B) = eo
(6.71)
E[B,
-
and
6.4.
EMPIRICAL BAYES APPROACH
133
P(X/B)is the binomial distribution
Example 2
where r is the total number of trials, x the number of successes, and 0 the unknown probability of success in each trial. P(0) may be taken as the class of all distribution functions on the interval (0, 1). In this case,
(6.73)
and (6.74)
From (6.73) and (6.74), we can write (6.75)
Let Pn.r(x) =
number of terms x1 ,..., xn which = x n
(6.76)
then P,Jx) + P,(x) with probability 1 as n -+ co. Now consider the sequence of learning observations xi , xh ,...,xk where xk denotes the number of successes in the first (r - 1) out of the r trials which produced x, successes, and let Pn.r-dx) =
Thus, as n -+
number of terms x i ,..., x; which = x (n - 1)
(6.77)
00,
Pn:r-l(x)
+
Pr-l(x)
with probability 1. Define (6.78)
134
6.
-+
BAYESIAN LEARNING
then as n-+ co, vn,r(x)
. 7
pT(x
+ l)
= dp,r-1
with probability 1 (6.79)
pT-l(x)
This means that if ~,,,(xk) is used as the estimate of 8, ,the estimator for large n, will do about as well as if the a priori distribution P(8) were known. It is noted that if (6.56) is considered as a mixture distribution, then the estimation of 8 and P(8) from the sequence of learning observation XI ,..., X , is related directly to the problem of nonsupervised learning discussed in Section 6.2. The identifiability condition of a mixture distribution will again play an important role in this class of estimation problems [8], [18]. 6.5
A General Model for Bayesian Learning Systems
Pugachev has proposed a general model for Bayesian learning systems to include various learning schemes into one formulation [191-[21]. With probably a slightly different interpretation, the model is presented in this section. The central idea of Pugachev’s model is to consider a real teacher (or trainer) who might not know the correct answer (for example, the correct classification of a learning observation) exactly. Let the input of the learning system be X . Corresponding to the input X, let the output of the learning system be &, the output of the teacher be Q’, and the desired output be Q. In general, X, Q, Q, and Q’ are vector-valued random variables. The input-output relationship of the teacher can be expressed by the conditional probability density function pt(Q‘/X,Q, &). I n the special case of an ideal teacher (or trainer), the teacher knows the desired output Q, i.e., (6.80) ~ t ( Q 8, ’ /&) x ,= U?’ - Q) which is independent of X and &, where S(Y) is the Dirac delta function. For any teacher, Q’ in general does not coincide with Q. A simple block diagram of the system and teacher is shown in Fig. 6.1. If the teacher trains the system by demonstrating a sequence of learning observations XI ,..., X , , then PdQ’lX, Q, &)
= Pt(Q’/X)
(6.81)
6.5.
GENERAL MODEL FOR LEARNING SYSTEMS
135
X -
Fig. 6.1.
A general block diagram of learning systems.
which is independent of Q and &. If the teacher trains the system by evaluating the system’s performance (from its output), then PdQ’lX, 8,&)
&I
= Pt(Q‘/X,
(6.82)
which is independent of Q. The operation of the system is represented by the conditional probability density function p,(&/X). In the special case of a Bayesian optimal system
where Q* is the Bayesian optimal output with respect to a given criterion. For example, for a Bayesian optimal classifier, R(X, Q*) = Min R(X, &) B
(6.84)
For illustrative purposes, the discrete case is considered here. Let p ( X ,Q) be the joint probability density function of X and Q; in general, the function may contain a linear combination of 6 functions at the points (X, Q) to which nonzero probabilities correspond. Assume that the functions p , ,p , , and p are known functions of their arguments and depend on a finite number of unknown parameters which form the components of a vector 8. As discussed in Section 6.1, the unknown parameters 8 can be estimated (learned) using Bayes formula. Let the a priori density function of 8 be po(8), and the corresponding output sequence of the system and the teacher for the input sequence of learning observations X , ,...,X , be &, ,..., &, and Qi ,..., Q , respectively. i Assume that X , ,..., X , are independent,
6.
136
BAYESIAN LEARNING
then the a posteriori probability density function of 0 can be expressed as
nPYX, II
= KPo(6)
> Qi >
i=l
Qi’le)
(6.85)
where K is a normalization constant. The expression pi(Xi ,&( ,Pile) indicates that it may be different at different times. In the special case of an ideal teacher,
(6.86)
(6.87)
I n the case that a real teacher trains the system by demonstrating a sequence of learning observations,
P(X, &, Q’P) = P(x/e) Pt(Q’/X, 0) where
P(x/e)= fP(X QP)dQ
(6.88) (6.89)
Then, (6.85) becomes
P(O/X~ ,---, Xn ;QI ~ * * Qk) . ~
n pi(xi/e) ~P(Ql/xi n
=K z Po(0)
0)
i=l
(6.90)
In the case that the system learns using its own decisions,
P(~/x,xn ;& 1 ~ * * &n) *~ )***)
npi(xi/e) n
=Ks~o(e)
i=l
(6.91)
6.5.
GENERAL MODEL FOR LEARNING SYSTEMS
137
which is independent of the teacher's output. This class of learning systems is sometimes called decision-directed learning systems. Since the optimal system is defined in the sense of Bayes criterion, it is easily seen that in this sense the ideal teacher may not be the best one. In fact, no system can learn, in general, to reproduce exactly the desired output Q that the ideal teacher does (for example, the case of zero probability of misrecognition). It can learn only to find appropriate estimators of Q. Hence, the teacher which trains the system to find optimal Bayes estimators of the desired output should be considered the best one. The output of such a teacher coincides with the output Q* of the optimal system minimizing the average risk. Assume that an operator A(8) can be determined such that (6.92)
Consequently, the distribution of 8 is entirely concentrated on the subset of values of 8 defined by the equations Q; = A ( e ) x i ,
i = 1, ...,
(6.94)
If there exist r such equations with a unique solution for 8, then for r, the distribution of 8 is concentrated at one point, which any n corresponds to the true values of the unknown parameters 8. This can be done by solving the r equations simultaneously from r pairs of training samples (Xi, 9;).It is noted that two kinds of information are needed for this operation. The first kind is the knowledge of (6.94), and the second kind is the r pairs of training samples with Qi = QS. With this amount of information and by solving r equations as (6.94) simultaneously, the teacher is able to learn the true values of the unknown parameters 8, and from then on, the output of the teacher will be Q* which is Bayes optimal. The Q* from the output of the teacher is then used to train the system, and, in turn, the system will approach a Bayes optimal system. Example Let the real teacher be a binary linear classifier as shown in Fig. 6.2. For gaussian distributed pattern classes with equal
6.
138
BAYESIAN LEARNING
Fig. 6.2.
A linear classifier.
covariance matrix, the Bayes optimal decision boundary is essentially a hyperplane. Let Y
=
]:[
=
[f]
W
and
=
E]
(6.95)
Then the Bayes optimal output of the classifier can be expressed by Q‘ = Q* =
C wixi + wg = WTY 2
i=l
= A(W)Y
(6.96)
Thus, by applying three pairs of training samples (XI ,Qf), ( X , , QZ), and ( X , ,Qg), the true values of the unknown parameters w, , w 2 , and w3 can be obtained by solving three equations as Q ’
= A(W)Yi,
=
1,2,3
simultaneously with
It is noted that the training (or supervised learning) of the system based on the output of the real teacher using Bayesian techniques is an asymptotic process, i.e., the estimated parameter values converge to the true values only asymptotically (or after an infinite sequence of learning samples). However, with known Bayes optimal operator A(0) for the teacher, the teacher can learn the unknown parameters with only a finite number of training samples. (Presumably, the teacher has the capability of solving a set of Y simultaneous equations.) If the system has the same structure as that of the teacher,
6.6.
SUMMARY AND FURTHER REMARKS
139
i.e., the Bayes optimal operator A(0) can also be used for the system, then there will be no difference between the teacher and the system. Both of them can learn the unknown parameters with a finite set of training samples. I n general, the teacher is considered to have a more complicated structure or at least the capability of solving equations (6.94) than that of the system. 6.6
Summary and Further Remarks
The Bayes estimation techniques have been applied to the learning (estimation) of unknown parameters in a probability distributions (or density) function. If the parameters are fixed but unknown, by assuming a convenient form for the a priori distribution of the parameters, the true parameter values can be learned through an iterative application of Bayes formula. Both supervised and nonsupervised learning schemes are discussed. In the nonsupervised learning, the unclassified learning observations are considered as coming from the mixture distribution with the probability distributions of each class as component distributions. The Bayes estimation procedure can also be extended to the case where the unknown parameters are slowly time-varying. If the unknown parameters are themselves random variables with unknown a priori distribution functions, the empirical Bayes approach is suggested for the estimation of the parameters. Finally, the general Bayesian learning model proposed by Pugachev is presented. In this model, learning processes with ideal teacher and with real teacher can be put into one mathematical formulation. The class of reproducing distribution functions plays an important role in obtaining simple computational algorithms for the Bayesian estimation of unknown parameters. This may practically limit the applications of the Bayesian estimation techniques. In using the mixture formulation, a unique solution can only be obtained if the identifiability condition is satisfied. However, it is not obvious that efficient computational algorithms can be obtained for the estimation even if the mixture is identifiable. It would be interesting to study the mixture estimation problems from the computational viewpoint, especially in the high-dimensional cases where the number of unknown parameters is large. Similarly, it would be desirable, from a practical viewpoint, to investigate the computational algorithms derived from Pugachev’s general learning model.
140
6.
BAYESIAN LEARNING
References 1. T. W. Anderson, “An Introduction to Multivariate Statistical Analysis.” Wiley, New York, 1958. 2. N. Abramson and D. Braverman, Learning to recognize patterns in a random environment. IRE Trans. Inform. Theory 8, 58-63 (1962). 3. D. G. Keehn, A note on learning for Gaussian properties. IEEE Trans. Inform. Theory 11, 126-132 (1965). 4. J. D. Spragins, Jr., A note on the iterative application of Bayes rule. IEEE Trans. Inform. Theory 11, 544-549 (1965). 5 . Y. T. Chien and K. S. Fu, On Bayesian learning and stochastic approximation. IEEE Trans. Systems Sci. Cybernetics 3, 28-38 (1967). 6. R. Bellman, “Adaptive Control Processes-A Guided Tour.” Princeton Univ. Press, Princeton, New Jersey, 1961. 7. H. Teicher, On the mixture of distributions. Ann. Math. Statist. 31, 55-73 (1960). 8. H. Teicher, Identifiability of finite mixtures. Ann. Math. Statist. 34, 1265-1269 (1963). 9. S. J. Yakowitz and J. D. Spragins, On the identifiability of finite mixtures. Ann. Math. Statist. 39, 209-214 (1968). 10. D. B. Cooper and P. W. Cooper, Nonsupervised adaptive signal detection and pattern recognition. Information and Control 7, 416-444 (1964). 11. P. W. Cooper, Some topics on nonsupervised adaptive detection for multivariate normal distributions. In “Computer and Information Sciences-11’’ (J. T. Tou, ed.). Academic Press, New York, 1967. 12. R. F. Daly, The adaptive binary-detection problem on the real line. Rept. 2003-3. Stanford Electron. Labs., Stanford, California, February 1962. 13. S. C. Fralick, Learning to recognize patterns without a teacher. IEEE Trans. Inform. Theory 13, 57-64 (1967). 14. D. F. Stanat, Nonsupervised pattern recognition through the decomposition of probability functions. Tech. Rept. Sensory Intelligence Lab., Univ. of Michigan, Ann Arbor, Michigan, April 1966. 15. J. W. Sammon, An adaptive technique for multiple signal detection and identification. In “Pattern Recognition” (L. Kanal, ed.). Thompson Book Co., Washington, D.C., 1968. 16. E. T. Whittaker and G. N. Watson, “Modern Analysis.” Cambridge Univ. Press, London and New York. 1958. 17. H. Robbins, An empirical Bayes approach to statistics. Proc. Symp. Math. Statist. and Probability, 3rd, Berkeley, 1956, 1, 157-164. Univ. of California Press, Berkely, California, 1956. 18. H. Robbins, The empirical Bayes approach to statistical decision problems. Ann. Math. Statist. 35, 1-20 (1964). 19. V. S. Pugachev, A Bayes approach to the theory of learning systems. Preprints, Proc. I F A C Conf., 3rd, June 1966. 20. V. S. Pugachev, Optimal training algorithms for automatic systems with nonideal trainers. Dokl. Akad. Nauk SSSR 172, No. 5, 1039-1042 (1967). 21. V. S. Pugachev, Optimal learning systems. Dokl. Akad. Nauk SSSR 175, No. 5, 762-764 (1967). 22. H. Cramtr, “Mathematical Methods of Statistics.” Princeton Univ. Press, Princeton, New Jersey, 1961.
CHAPTER 7
LEARNING IN SEQUENTIAL RECOGNITION SYSTEMS USING STOCHASTIC APPROXIMATION
7.1
Supervised Learning Using Stochastic Approximation
Stochastic approximation is a scheme for successive estimation of a sought quantity (the unknown parameter to be estimated) when, due to the stochastic nature of the problem, the measurements or observations have certain errors. A brief introduction of stochastic approximation is given in Appendix F. Supervised learning schemes using stochastic approximation are discussed in this section. In the learning of an unknown probability P(wi) from a set of classified learning observations X , , X , ,..., X , , let ni denote the number of times that the observations are from class w i, CL1ni = n and C?=,P(w,) = 1. Since the correct classifications of the learning observations are known, so is ni . If the the initial estimate of P(wi) is Po(wi), 0 Po(w,) 1, Po(wi)= I, then the successive estimates of P(w,) can be formed by the following stochastic approximation algorithm
0,
m
m
yn =
co
n=l
141
and
n=l
yn2
0
+ a)-l satisfying and
n Fn m
=0
n=l
Since 11 Mo 11 < co is assumed for the initial estimate, and E[ll q n 1121 is bounded above by B, i.e., N
E[II ~n
1121= C E [ ( T ~ ~ )< 'I B
(7.13)
(-00
i=l
hence,
E[ll '~Mo1121
+ C E[ll m
n=l
~
n
117~ Gn E[ll Mo 117 + B
m
Yn2 n=l
< 00
(7-14)
144
7.
LEARNING USING STOCHASTIC APPROXIMATION
which verifies Dvoretzky’s condition (N4) (Appendix F). Condition (N5) is satisfied for any measurable function qw(Ml,..., M,J. Therefore, by Dvoretzky’s theorem (special case 11) lim E[ll Mn - M \Iz]
=0
(7.15)
P(lim Mn = M } = 1
(7.16)
n-rm
n+w
which simply means that (7.8) is a special case of stochastic approximation with the convergence of the extimates to the true mean vector in the mean square sense and with probability 1.
B. Learning the Covariance Matrix K of a Gaussian Distribution Rewrite the estimation equation (6.14) as
Since Kn-l
=
+
voK, C:GIXiXiT v,+n-1
then (7.17) becomes
(7.19)
which satisfies conditions (7.3). It can be shown that (7.18) is again a special algorithm of Dvoretzky’s stochastic approximation procedure [2]. As a result, the estimates obtained from (7.18) converge to the true covariance matrix (for every element) in the mean square sense and with probability 1. It is also possible to verify that the Bayesian estimation of both mean vector and covariance matrix of a gaussian distribution form a stochastic approximation algorithm. The detailed analysis is omitted here.
7.1.
145
SUPERVISED LEARNING
C. Learning the Parameter of a Binomial Distribution Equation (6.29) can be rewritten as mn = mn-1
Let = (n
3/n
+ (n + no)-'(m
(7.20)
- mn-1)
+ no)-'
(7.21)
which satisfies conditions (7.3), and (7.22)
m=P+Tn
so that
E(qn) = 0
0 and R > 0 such that (iii) I x' - 0 I
I x' - x" I < p implies I M(x') - M(x")l < R
(iv) For every 8
(F.18)
> 0, there exists ~ ( 6 )> 0 such that
(v) 1 x - 8 1 > S implies
I M(x
o,~n~,z
+ 4 - M ( . - .)I €
> T(S) (F.19)
If {a,} and {c,} are sequences of positive real numbers satisfying m
(vi) then
C a,
n=1
m
m
= 03,
C a,c, < 03, n=l
and
n=1
lim I?[(%, - 8)T = 0
n+w
where {x,} is defined by (F.15).
C( s r < 00,
(F.20)
204
APPENDIX F
It is noted that conditions (F.17), (F.18), and (F.19) are the Lipschitz regulartity conditions. Blum has proved that Theorem 3 holds even when (F.17) is not satisfied. In this case, Blum has also proved that P{limxn = e} = 1 n+m
The Kiefer-Wolfowitz procedure has been extended to multidimensional case by Blum [3] and Sacks [lo], and to continuous case by Sakrison [ 1I]. 3. Dvoretzky’s Generalized Procedure
Dvoretzky [121 has suggested that any stochastic approximation procedure may be viewed as an ordinary deterministic (error-free) successive approximation method with a random noise component superimposed upon it. On the basis of this concept, a generalized stochastic approximation algorithm is proposed as xn+l
=
T n ( x l > * * *xn) ,
+ zn
(F.21)
where T,(xl ,..., x,) is the error-free transformation and z, is the random noise component. Theorem 4
satisfying
Let {a,},
{fin}, and
{y,} be nonnegative real numbers
(F.22)
lim an = 0
PI)
n+m m
(F.23) (F.24) n=l
Let T, be measurable transformations satisfying (D4) I T n ( r l > - - - ,I n ) - 0 I for all real rl , ..., r, . Also,
...> Xn) = xn f (~n/Cn)[M(Xn cn) - W X n - cn)l (F.30)
+
zn = (an/Cn>[r(%
+ cn) -
WXn
+ cn) -Y(%
- Cn)
+
M(Xn
-4
1
(F.31) the procedure is essentially the Kiefer-Wolfowitz procedure. Dvoretzky’s proof of the theorem was simplified by Wolfowitz [13], revealing more of the essential structure of the process. The multidimensional generalization of Theorem 4 has been proved by Gray [14]. Two special cases of Dvoretzky’s procedure are presented in the following as they are extensively used in Section 7.1. Special Case I (real random variables). Let T, be measurable transformations
(R1)
I
Tnh ,**.,
In)
-0 I
T . See also Molverton and Rawgen [26].
nz=rF,,is uniformly
206
APPENDIX F
with probability 1 for all n, imply lim E[(xn - 8)T
n+m
P{limxn = 8} n+m
=0 =
1
Special Case I1 (normed linear space). Suppose that X, and 2, assume values in a normed linear space 52 with 11 Y 11 denoting the norm of Y. Let 6 be an element of 52 and T, which are measurable transformations from the nth Cartesian power of 52 into 52 and assume that
(N1)
II Tn(r1 ***.)
rn)
- 8 It d Fn It rn - 8 II
(F.34)
where {F,} is a sequence of positive numbers satisfying m
HFn=O
(N2)
n=1
Define (N3)
Xn+1 =
Tn(X1 )***s Xn)
+ zn
(F.35)
Then the conditions (N4)
X1 1121 < a,
Xn) (N5) E[lldX1~****
00
1 E[II z n 1121 < a
n=l
(F.36)
+ zn 117 Q E[ll v ( X ~ ~X-n*117~ + E[ll zn 117
for every measureable function v(Xl ,..., X,) imply lim E[ll Xn - 8 11T
n-m
P{lim 11 Xn - 8 11 n+w
(F.37)
=0
(F.38)
1
(F.39)
= 0) =
Block also proposed a more general type of stochastic approximation taking place in a normed vector space [15]. 4.
Methods of Accelerating Convergence
Two approaches have been suggested for accelerating the convergence of stochastic approximation procedure. The first approach
STOCHASTIC APPROXIMATION-A
BRIEF SURVEY
207
is to accelerate convergence by selecting a proper weighting sequence {a,} or {a,} and {c,}, etc. An intelligent way of choosing the weighting sequence based on the information concerned with the behavior of the regression function, intuitively speaking, should improve the rate of convergence. Historically, the first method of accelerating the convergence of a stochastic approximation procedure was proposed by Kesten [16]. The basic idea is that when the estimate is far from the sought quantity 8 there will be few changes of sign of (x, - x,-~). Near the goal 8 we would expect overshooting to cause oscillation from one side of 8 to the other. Kesten proposed using the number of sign changes of (x, - x,-~) to indicate whether the estimate is near or far from 8. Specifically, the quantity a, is not decreasing if (x, - x,-~) retains its sign. Mathematically, the algorithm can be written in the form of Dvoretzky’s procedure where dl = a , , d2 = a 2 , d, = ds(,) , and
+1 03
s(n) = 2
@[(Xi
i=l
with @(x) = 1
=O
- x O
This means that d, is constant so long as (x, - x,-~) and (x,-~ - x,-~) have the same sign. The algorithm (F.40)converges with probability 1. Fabian has proposed the following accelerated algorithms [171: xn+1 = xn
%+l= Xn
+ an s€!n[. + (anlcn)
- Y@n)l for Robbins-Monro procedure
Sgn[Y(Xn
+ 4-
(F.42)
-4 1
for Kiefer-Wolfowitz procedure
(F.43)
Algorithms (F.42) and (F.43) converge to their sought quantities, respectively, only in a comparatively narrow class of problems in which the distribution function of the random variable y is symmetric with respect to 8. Another scheme of accelerating convergence proposed by Fabian is the application of an analogy of steepest descent method. The scheme can be summarized as follows. For given x,
208
APPENDIX F
and y(xn), take a series of (noisy) observations V , of the quantity M(x, kay(x,)), k = 1,2,... . Assume the Vk’s are independent of x, and yn . Select a, = ka when
+
sgn V , = sgn V ,
=
... = sgn V,,
= sgn
V , = -sgn V,,
(F.44)
for Robbins-Monro procedure, or when V,
> Vz > ..* > V,-1 > V , < V,,,
(F.45)
for Kiefer-Wolfowitz procedure. Under rather general conditions on V , , the suggested scheme converges with probability 1. The second approach for accelerating convergence is by taking more observations at each stage of iteration. Intuitively speaking, taking mor observations at each stage will explore the regression function more in detail than the original stochastic approximation procedure [18], [19], and, consequently, the extra information can be utilized to improve the rate of convergence. Venter and Fabian have proposed accelerated algorithms for the Robbins-Monro and Kiefer-Wolfowitz procedures, respectively [20], [21]. For illustrative purposes, Venter’s procedure is that of estimating the slope of the regression function at the root by taking two observations at each stage and using this information to improve the rate of convergence and the asymptotic variance of the Robbins-Monro procedure. The recursive algorithm
where y l and y: are random variables with their conditional distributions given y; , y i , k = 1,..., (n - l), independent and identical to that of y(xn c,J and y(xn - c,), respectively. {c,} and {d,} are two sequences of positive numbers. A, is an estimate of the slope a defined as follows: assume that 0 < a < a < b < co with a and b known, let
+
n
B,
=
n-l
and A,
=a = B,
=b
C ( y ; - $)/2~, j=1
(F.47)
if Bn < a
otherwise if B n > b
(F.48)
The algorithm (F.46) converges with probability 1. If also E(x12)< co, it converges in the mean square sense.
STOCHASTIC APPROXIMATION-A
209
BRIEF SURVEY
The same idea can be carried over to the Kiefer-Wolfowitz procedure. I n this case, three observations are taken at each stage of itereation, and the appropriate second order differences of the observations are used to estimate the second order derivative of the regression function at the maximum (or minimum). This information would then be utilized to determine the next estimate x,+, of the maximum (or minimum). I n a similar idea proposed by Fabian, the Kiefer-Wolfowitz procedure can be modified in such a way as to be almost as speedy as the Robbins-Monro procedure. The modification consists of taking more observations at every stage of iteration and utilizing this information to eliminate (smooth out) the effect of all higher order derivatives of the regression function. 5. Dynamic Stochastic Approximation
Fabian and DupaE have considered the case in stochastic approximation where the sought quantity 8 moves during the iteration process. The following presentation is based on DupaE’s discussion [22].
A. The ModiJied Robbim-Monro Procedure Let M,(x)
= M(x - 0,
+ el),
n = 1,2, ...
(F.49)
such that 8, is the unique root of M,(x) = 0. Let {a,} be a sequence of positive numbers, and let x1 be an arbitrary random variable (initial estimate). Define xn+, = x,* - u,y(x,*), n = 1,2,..., (F.50) where (F.51)
(F.52)
E[Y(Xn*)
and WY(X,*) x,
,..., x,] \< u2 < ca
(F.53)
+
The meaning of the algorithm (F.50) is the following: at the (n 1)th stage of iteration an estimate of On+, is determined. Start from the preceding estimate x, , first make a correction based on (F.51), then estimate the value of M,,, at xf by means of the observation y(xz)
210
APPENDIX F
and, finally, take a further correction, -a,y(x,*). It will be seen from Theorem 5 and its corollary, that the use of the algorithm is justified when 8, is a linear (nearly linear) function of n. Theorem 5
Suppose that the following conditions are satisfied:
(i) M(x) < 0 for x
< 0,
and
M(x) > 0 for x
> 0,
(F.54)
There exist K O ,Kl such that (ii) KOI x - 8,
For n
=
(iii)
I
< I M(x)I < Kl I x - 8,
for -cn
I
(F.55)
1, 2,..., a, = ulna,
a
> 0,
< x < +cn
4 < a
(F.57)
Further, E(xl2) < 00
(4
(F.58)
Then (x, - 0,) approaches zero in the mean, and EL(%,- e,)~]
o(n-3
for
w
- O(n-2(~-")) for
w
=
Qa
O1 There exist
.
K , , K 3 ,K4 such that
K , I x - 4 I d I W x ) l < K3 I x I M"(x)l < K4 for -co For n
=
- 01 I
< x < +00
(F.67)
1, 2,..., a, = a/na,
a
> 0,
c, = c/ny,
c
> 0,
Q O1
< 01 < 1
$a - y
for
w
< 3:.
-y
(F.70)
References
1. H. Robbins and S. Monro, A stochastic approximation method. Ann. Math. Statist. 22, No. 1, 400-407 (1951). 2. J. A. Blum, Approximation methods which converge with probability one. Ann. Math. Statist. 25, NO. 2, 382-386 (1954). 3. J. A. Blum, Multidimensional stochastic approximation procedures. Ann. Math. Statist. 25, No. 4, 737-744 (1965). 4. E. G. Gladyshev, On stochastic approximation. Teor. Veroyatnost. i Primenen. 10, No. 2, (1965). 5. M. Driml and N. Nedoma, Stochastic approximations for continuous random processes. Trans. Conf. Inform. Theory Statist. Dec. Functions, Random Processes, 2nd, Prague, 1960. 6. 0. Hang and A. SpaEek, Random fixed point approximation by differentiable trajectories. Trans. Conf. Inform. Theory, Statist. Dec. Functions, Random Processes, 2nd, Prague, 1960. 7. M. Driml and 0. Hang, Continuous stochastic approximations. Trans. Conf. Inform. Theory, Statist. Dec. Functions, Random Processes, 2nd, Prague, 1960. 8. J . Kiefer and J. Wolfowitz, Stochastic estimation of the maximum of a regression function. Ann. Math. Statist. 23, No. 3, 462-466 (1952). 9. J. A. Blum, A note on stochastic approximation. Proc. Amer. Math. SOC.9, 404-407 (1958). 10. J. Sacks, Asymptotic distribution of stochastic approximations. Ann. Math. Statist. 29, NO. 2, 373-405 (1958). 11. D. L. Sakrison. A continuous Kiefer-Wolfowitz procedure for random processes. Ann. Math. Statist. 35, No. 2, 59C599 (1964). 12. A. Dvoretzky, On stochastic approximation. Proc. Symp. Math. Statist. and Probability Jrd, Berkeley, 1956, 1. Univ. of California Press, Berkeley, California, 1956. 13. J. Wolfowitz, On stochastic approximation methods. Ann. Math. Statist. 27, 1151-1 156 (1956). 14. K. B. Gray, Application of stochastic approximation to the optimization of random circuits. Proc. Symp. Appl. Math. 16th, 1964, 16. Am. Math. SOC.,Providence, Rhode Island 1964. 15. H. D. Block, On stochastic approximation. Unpublished Rep. Dept of Math., Cornell Univ., Ithaca, New York, 1956. 16. H. Kesten, Accelerated stochastic approximation. Ann. Math. Statist. 29, No. 1, 41-59 (1958). 17. V. Fabian, Stochastic approximation methods. Czechoslooak Math. J. 10, No. 1, 123-1 59 (1 960).
REFERENCES
213
18. D. Burkholder, On a class of stochastic approximation processes. Ann. Math. Statist. 27, No. 4, 1044-1059 (1956). 19. H. D. Block, Estimates of error for two modification of the Robbins-Monro stochasticapproximationprocess. Ann. Math. Statist. 28, No. 4,1003-1010 (1957). 20. J. H. Venter, An extension of the Robbins-Monro procedure. Ann. Math. Statist. 38, NO.1, 181-190 (1967). 21. V. Fabian, Stochastic approximation of minima with improved asymptotic speed. Ann. Math. Statist. 38, No. 1, 191-200 (1967). 22. V. DupaE, A dynamic stochastic approximation method. Ann. Math. Statist. 36, 1695-1702 (1965). 23. D. J. Wilde, “Optimum Seeking Methods.” Prentice-Hall, Englewood Cliffs, New Jersey, 1964. 24. V. Fabian, A stochastic approximation method for finding optimal conditions in experimental work and in self-adapting systems. Aplikace Matematiky, 6 , 162-183 (1961). 25. L. Schmetterer, Stochastic approximation. Proc. Symp. Math. Statist. and Probability, 4th, Berkeley, 1961, 1. Univ. of California Press, Berkeley, California, 1961. 26. C. T. Molverton and J. T. Rawgen, A counterexample to Dvoretzky’s stochastic approximation theorem. IEEE Trans. Inform. Theory 14,157-158 (1968).
APPENDIX G
T H E M E T H O D O F POTENTIAL FUNCTIONS O R R E P R O D U C I N G KERNELS
The potential function method introduced and studied extensively by Aiserman, Braverman, and Rozonoer has been used for successive approximations of unknown uniformly bounded continuous functions which may be either deterministic or stochastic. The unknown function, for example, may be a discriminant function, a response function of a system, or a probability distribution (or density) function. In this appendix the method of potential functions is briefly introduced [ 11-[5], and several applications are described. Consider a function f ( X ) of which the exact behavior is unknown. The only information available is some knowledge about the value off(X) at certain points, X , , X, ,..., X , . The problem of learning is, from the available information, to construct a function which converges to f ( X ) in a certain sense. The space of X, Qx , is in general multidimensional and the points of observation, XI ,..., X , , cannot be chosen at will but occur independently in a random manner. Also, the information off(Xi), i = 1, ..., n, may be noisy or only partially measurable [for instance, only the sign off(X,) can be measured]. Hence, the usual extrapolation techniques are practically inapplicable for solving the learning (estimation) problems. This is where the application of potential function method is suitable. The general formulation of the potential function method can be stated as follows. Let y i ( X ) ,i = 1,2,..., be a complete set of functions defined on Qx . Suppose that the function f (X) to be learned can be represented by the expansion (G. 1)
The coefficients ci are unknown a priori. The learning measurements or observations X, ,..., X , are assumed statistically independent and 214
215
POTENTIAL FUNCTIONS OR REPRODUCING KERNELS
distributed according to an unknown probability density function p(X). A function of two variables, called the “potential function,” is introduced as
c hi2pi(X)Vi(Y) 00
K(x,Y ) z=
(G.3
i=l
where the h i s are real numbers chosen in such a way that the function
K ( X , Y) is bounded. After n observations, X, ,..., X , , are taken,
the nth estimate of the functionf(X), denoted byf,(X), can be computed from the following general algorithm fn(X) =fn-l(X)
+ rnK(X,
(G.3) withf,(X) = 0. Y, is dependent upon the type of information received aboutf(X,). The function f ( X ) is assumed to be sufficiently smooth so the condition
c (Ci/hd2
=
i=l
((3.5)
CiVi(X)
then the potential function may be chosen to be M
K(X. Y )=
1 hi2Vi(X)V i ( Y )
(G.6)
i=l
and the condition (G.4) is automatically satisfied for addition, it also follows that
c
# 0. In
M
fnV) =
i=l
CinVi(X1
(G.7)
+
(G.8)
where Cin = Ci.n-1
~nhi~dxn)
Four possible applications are now discussed. 1. The Estimation of a Function with Noise-Free Measurements
Let y
y2 ,..., yn
=f
( X ) and assume that the learning observations yl, ,..., X , be
, where yi =f(&), are noise-free. Let also
xl
216
APPENDIX G
independently distributed according to some unknown probability density function p ( X ) . The learning algorithm (G.3) can be applied to estimate f ( X ) with rn =
yn Sgn[Yn
-.fn-l(Xn)]
((3.9)
where sgn(u) = +1
for u
- -1
for u
>0
l2>= 0 n-m
(G.15)
when (G.12) is used. If conditions (G.5) and (G.6) are satisfied, and for any
where p1 ,...,p M do not vanish simultaneously, thenf,(X) converges to
f ( X ) according to (G.14) and (G.15) not only in probability but also with probability 1.
POTENTIAL FUNCTIONS OR REPRODUCING KERNELS
2.
217
The Estimation of a Function with Noisy Measurements
In this case, the observations y1 ,yz ,...,y, are noisy. Let Yn = f ( X n )
+ 5,
(G.17)
where theit',$. are independent random variables (noise) with zero mean and finite covariances. Also, the conditional probability density function p([,/X,) is assumed not a function of n. Under such conditions, it is suggested that, in the algorithm (G.3), (G.18)
rn = r n [ m -fn-l(Xn)I
where y, satisfies the conditon (G. 11). It can be shown that lim E { [ f ( X )-fn(X)I2) n-m
=0
(G.19)
that is, by applying the algorithm (G.3) with (G.18), the estimate fn(X) converges tof(X) in the mean square sense. Similarly, if (G.5), (G.6), and (G. 16) are satisfied, thenf,(X) also converges tof(X) with probability 1. 3. Pattern Classification-Deterministic
Case
Suppose that the input patterns to the classifier are from one of the two possible pattern classes, w1 and w2 , and assume that w1 and w g are mutually exclusive. The decision functionf(X) used by the classifier is such that, for an E > 0, if
f ( X ) > E,
then X - w l
if
f ( X ) < -E,
then X - w 2
A special case which is commonly used is sgnf(X) = +1, -
-1,
X
-
(G.20)
w1
X-Wz
(G.21)
Let the learning observations Xl ,..., X, from both pattern classes be independently distributed according to some unknown probability density function p(X). The XI ,..., X, are the feature vectors charac-
218
APPENDIX G
terizing the input patterns (learning samples) with known classifications. The algorithm (G.3) can be used to establish the estimates f n ( X )with r n = Hsgnf(Xn)
- ~gnfn-l(Xn)l
(G.22)
If the condition (G.4) is satisfied, then the following convergent property can be proved (G.23)
If, in addition to the condition (G.4), the statistics of the learning observations satisfy the condition that, for 0 < k < a,if X , ,..., X , are not completely separated (correctly classified) by f , ( X ) , there is a strictly positive probability of occurrence of X,, to reduce the misclassifications, then it is possible to find such a k with probability 1. In other words, with probability 1 f,(X) converges tof(X) within a finite number of iterations (observations), k. More specifically, the rate of convergence and the stopping rule for the learning process can be investigated in terms of the number of corrections made in the case of misclassifications. Let L be an infinite sequence of learning observations drown from w1 and w2 .It can be shown that there exists a quantity J, (G.24)
which is independent from the choice of L such that the number of corrections (of misclassifications) S J. The learning process is considered to be terminal if after S corrections of misclassifications there are no corrections during the subsequent S, observations. Let the probability of misclassification after S S, learning observations be Ps+s,(~). Then S l can be selected according to the following stopping rule. For E > 0 and 6 > 0
log(1 - €)
(G.26)
219
POTENTIAL FUNCTIONS OR REPRODUCING KERNELS
4. Pattern Classification-Statistical Case
In this case, the classification of input patterns is based on the set of probabilities P(wJX), i = 1, ..., m, where m is the number of pattern classes. Since P ( o , / X ) are unknown a priori, the potential function method is applied to estimate the probabilities. Let f ( X ) = P(w,/X) and consider a random function +(X) such that +(X)= 1 =O
where 6,is the complement of
if X
-wi
if X - G i
mi.
(G.27)
It is noted that, from (G.27),
P{+(X) = l} = f ( X ) = P ( w , / X ) P{+(X) = O} = 1 - f ( X )
(G.28) (G.29) (G.30) (G.31)
where g, is a random variable with zero mean and finite variance. Then the problem is reduced to that of estimatingf(X) from noisy measurements as in section 2. An alternative approach is to consider a “- operator” such that when it operates on a function #(X),
&x>= 0 =+(X) =1
if