Pattern Recognition 33 (2000) 1}7
Spatio-temporal target identi"cation method of high-range resolution radar Dequan Zhou *, Guosui Liu, Jianxin Wang Section 116, Air Force No. 1 Institute of Aeronautics, Xinyang City, 464000, Henan Province, People+s Republic of China SDepartment of Electrical Engineering, Nanjing University of Sc. and Tech., Nanjing, 210094, People+s Republic of China Received 13 December 1996; received in revised form 13 April 1998
Abstract A spatio-temporal method for target identi"cation of high-range resolution radar is proposed. The hidden Markov model (HMM) technique is used as the classi"cation algorithm, making classi"cation decisions based on a spatiotemporal sequence of radar range pro"les. The e!ectiveness of the method is investigated in identi"cation experiments using extensive data sets of real C-band inverse synthetic aperture radar (ISAR) range pro"les of three di!erent aircrafts. Identi"cation accuracy of 93.01% is obtained for sequences of range pro"les generated over a large aspect angle. 1999 Published by Elsevier Science Ltd. All rights reserved. Keywords: Target identi"cation; Range pro"les; Hidden Markov models; Vector quantizers; High-range resolution radars
1. Introduction Ideally, radar systems should have the ability to provide a reliable identi"cation of every target they detect and/or track, especially when they are used to detect hostile targets such as aircrafts, missiles, tanks and so on. Recently, some papers [1}3] focus on discussing target identi"cation methods by using radar range pro"les of high-resolution radar. A radar range pro"le can be thought of as the one-dimensional projection of the spatial distribution of radar target re#ectivity onto radar line of sight. If a pro"le is unique to a particular aircraft class, then a decision processing algorithm can identify that aircraft based on the information contained within a single radar range pro"le. However, this &&unique pro"les'' condition is not encountered with real radar targets. If the aspect of an aircraft (the angle which its nose makes with the radar line of sight, shown in Fig. 1)
* Corresponding author.
changes then the relative ranges of backscattering points of the aircraft may change. If two points, which are ordinarily in the same range resolution bin are no longer due to aspect change, the range migration phenomenon occurs and the range pro"le #uctuates greatly. To avoid range migration, the aspect change *h would have to satisfy *h(*r/=,
(1)
where *r is the range resolution of the radar, = is the maximum length of the aircraft. In the study, *r is about 0.5 m and = is 36.8 m. The aspect change *h would be less than 0.783. Even if the aspect does not change enough to cause range migration, changes in the relative range of di!erent scatters of a fraction of the radar wavelength can cause the radar range pro"les to #uctuate due to interference e!ects. If the range change of a point is *R+j/4, where j is the radar wavelength, then the contribution to the range pro"le of the point will change from construction to destruction, or vice versa, and the range pro"le will #uctuate signi"cantly. This is the phenomenon of
0031-3203/99/$20.00 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 8 ) 0 0 0 5 2 - 1
2
D. Zhou et al. / Pattern Recognition 33 (2000) 1}7
Fig. 1. Aspect of an aircraft.
speckle. To avoid speckle #uctuations, the aspect would have to satisfy j . *h( 4=
(2)
It can be seen that we do not expect the aspect change of a #ying aircraft satisfy constraints (1) and (2), when using short wavelengths. The range pro"le of a maneuvering #ying aircraft is constantly changing. We may obtain in"nite range pro"les within a small aspect range. In order to characterize an aircraft, a large number of pro"les from all possible aspects would be required. Hudson and Psaltis used a set of correlation "lters [2], each of which represents an aircraft over a limited range of aspect, to characterize an aircraft. Average correctidenti"cation rate for an individual range pro"le without regard to aspect was 57%. When the estimated aspect of the aircraft was taken into account, an average correctidenti"cation rate of 79% was obtained. To improve identi"cation e!ectiveness, they used a compound identi"cation approach where they put the series of pro"les through the correlator, identi"ed each, and averaged the identi"cation to arrive at a single &&best'' identi"cation. The result for eight range pro"les as a frame was a correct-identi"cation rate of 84%. Most of the above experimental results were for noseon data that was obtained from an aspect range of 0&203. Why was the identi"cation accuracy not high for such a small aspect angle range? We believe there are two reasons: (1) the conventional idea of radar target identi"cation is to "nd and employ the invariant features to identify a target. Unfortunately, it is very di$cult and here may not be possible to obtain invariant features of an aircraft; (2) the temporal evolution information among pro"les, which re#ect the change of spatial features with time, was not exploited, only information contained within an individual range pro"le was used. The relationships among the pro"les may be more important than the individual pro"le during the identi"cation process. Seibert and Waxman gave an example to explain the importance [4]. Consider an attentive observer watching an object as it moves naturally past him. He "rst sees the front of the object as it approaches, the transition to the rear view, and "nally the rear view as the object moves beyond him. He experiences not only mul-
tiple views of the object but also sees the relationships between the views as well as the key features. He may learn the views and the view transitions. He may also learn how to explore objects actively for their important features with attentional mechanisms such as directed eye motions. A spatio-temporal target identi"cation idea for highrange resolution radar is proposed based on the above evidence in this paper. The idea not only tries to "nd an invariant feature, but also tries to use a model to learn the information contained in range pro"le sequences of a class target, and then the models of di!erent potential classes are used to identify the radar targets, here aircrafts. The hidden Markov model is used here to characterize and identify the radar range pro"le sequences, i.e. the radar targets. We concentrate on modeling signals, the time-varying range pro"les of a maneuvering aircraft for the purpose of aircraft identi"cation. We give a brief description of a HMM in Section 2. The detailed experimental procedure and results are reported in Section 3 with the conclusion discussed in Section 4. 2. Hidden Markov models HMM is represented by two interrelated mechanisms, an underlying Markov chain having a "nite number of states, and a set of random functions, one of which is associated with each state. HMM techniques have been used extensively in the area of speech recognition over the past decade because of their ability to successfully learn the time-varying signals [5]. Recently, it was also successfully used in hand-written character [6] and visual target [7] recognition. A discrete HMM is typically de"ned by the statetransition probability matrix A, the observation symbol probability matrix B, the initial state probability distribution %, the number of states N, and the number of observation symbols M [5]. The notations are now formally de"ned for the HMM as follows: (1) N, the number of states in the model. Generally, the states are interconnected in such a way that any state can be reached from any other state (e.g. an ergodic model); however, other possible interconnections of states are often of interest. We denote the individual state as S"+s , s , 2 , s , and the state at time t as q . , R (2) M, the number of distinct observation symbols per state, i.e. the discrete alphabet size. The observation symbols correspond to the physical output of the system being modeled. We denote the individual symbols as H R G
(3)
D. Zhou et al. / Pattern Recognition 33 (2000) 1}7
and a should be subject to the constraint GH , a "1, 1)i, j)N. (4) GH H For the special case where any state can reach any other state in a single step, we have a '0 for all i, j. GH For other types of HMMs, we would have a "0 for GH one or more (i, j) pairs. (4) The observation symbol probability distribution B"+b (k), in state j, where b (k) represents the H H observation probability of observing symbol k in state j, i.e. b (k)"P[l at t " q "s ], 1)j)N, H I R H 1)k)M
(5)
and b (k) should satisfy H + b (k)"1, 1)j)N, 1)K)M. (6) H I (5) The initial state probability distribution %"+n ,, G where n "P[q "s ], 1)i)N. (7) G G And n should satisfy G , n "1, 1)i)N. (8) G G It can be seen from the above discussion that a complete speci"cation of an HMM requires speci"cation of two model parameters (N and M), speci"cation of observation symbols, and the speci"cation of the three probability measures A, B, and %. For convenience, we use the compact notation j"(A, B, %)
(9)
to indicate the complete parameter set of the model. HMM can model the set of observations using these probabilistic parameters as a probabilistic function of an underlying Markov chain whose state transions are not directly observable. Two problems should be solved when we use the HMM to identify radar targets. Firstly, how can we obtain the model parameters based on an optimization criterion of a HMM, after a number of training observation sequences of a known class are given? Secondly, given the models and a new observation sequence O "o , o , 2 , o , how can we classify the observaLCU 2 tion e$ciently? The former can be solved using the Baum}Welch re-estimation algorithm [5]. Here, an observation sequence corresponds to a set of feature vectors obtained from range pro"les in the sequence. The model of class l is denoted by j , 1)l)K, where K is the T
3
number of potential target classes. The latter problem can be solved by the following two steps: (1) use the Viterbi algorithm [5] to calculate the probability that the model j produces a new observation O , denoted J LCU as P(O /j ); (2) obtain the identi"cation of O by LCU J LCU choosing the maximum P(O /j ) over all models LCU J written as C"arg max P(O /j ), (10) LCU J )J)) where l is an index over the model for each target and C is the classi"cation. The Viterbi algorithm used to calculate probability of a new observation sequence O , LCU given the model, j , 1)l)K, i.e. P (O /j ) as follows. J LCU J Initialization:
(i)"Log(n )#Log[b (o )], 1)i)N, (11) J G G where N is the number of the state and (i) is the Log of probability that symbol o is observed at time t"1 and in state i. Recursion:
( j)" max [ (i)#Log a ]#Log[b (o )], R R\ GH H R )G), 2)t)¹, 1)j)N.
(12)
¹ermination: Log P*" max [ (i)]. (13) 2 )G), The log of probability is computed, but not probability since it would be out of the dynamic range of the machine. Furthermore, the computation is signi"cantly reduced [5]. The structure of HMM has various forms: an unconstrained model (ergodic model) in which a transition from any state to any other state can be made, and a left}right model in which as time increases the state index increases, that is, the states proceed from left to right. The latter model is appropriate for modeling data having sequential variations as time goes on. In this study, discrete left}right HMMs with "ve states, which are more suitable for modeling signals investigated here, are used. Fig. 2 shows a "ve states left}right HMM. A state in the model may corresponds to a certain range of aspect angles of the target in the study. We choose the number of states as "ve since a good identi"cation result is obtained with comparatively less computation in this case.
Fig. 2. Five states left}right HMM. Several possible links are omitted for the sake of clarity.
4
D. Zhou et al. / Pattern Recognition 33 (2000) 1}7
3. Spatio-temporal identifying procedure and results The data used here is obtained by a realistic C-band wide-band ISAR. The range resolution of the radar is about 0.5 m and the pulse frequency is 400 Hz. Our experiments are performed only for three aircrafts as we only have data of these aircrafts. The aircrafts are a big jet plane Yark-42, a propeller plane AN-26, and a small jet plane Citation, respectively, and their geometrical sizes are 34.88 m;36.8 m, 29.2 m;23.8 m, and 15.9 m; 14.4 m, respectively. Fig. 3 shows the plane trajectories of the aircrafts, which #ew according to the order of the numbers in the "gure. We get both training and testing data sets from the trajectories. A radar is located at the origin of a coordinate axis. The elevation change the three of aircrafts are in the range 13.2}6.03, 22.2}8.23 and 30.6}12.63, respectively. Fig. 4 shows some range pro"les of the three aircrafts. In Fig. 4, abscissa represents the occupied range of target, each of the bins denote a range of 37.5 cm. The magnitude of a range bin is the relative re#ectivity of the target within the bin. Tang obtained identi"cation accuracy of 88.5% using a correlation "lter for the same data sets used here [8]. Moreover, he obtained the result with the training and testing data sets taken from the same trajectories. Range pro"les can be directly used as feature vectors. In this case, the computation load is increased as a range alignment algorithm is used. Investigation shows that
a much simpler feature may be obtained after the preprocessing of multiresolution decomposition and Fourier transform [9]. The signal after multiresolution decomposition is an approximation at the resolution , so range pro"les are of 128-D instead of that of 256-D (one who is interested in detailed multiresolution signal decomposition method can look up reference (10). Then, FFT is applied to the range pro"les of 128-D. Only the "rst 64 frequency features are used because of symmetry of the Fourier transform. The physical length of a target is also an important feature. The number of bins above a setting threshold value in a range pro"le is considered as the length of a target projection on the radar's line of sight and put into a feature vector after timing a proper factor. Finally, we get 65-D feature vectors. Fig. 5 shows some feature vectors obtained from range pro"les after the preprocessing. In the Fig. 5, an abscissa denotes elements of a feature vector. From Figs. 4 and 5, we can see that the three kinds of aircrafts are relatively featureless and di$cult to identify. This can also be revealed by applying a nonlinear mapping algorithm [11], the results are shown in Fig. 6. The nonlinear mapping algorithm is based on a point mapping of N K-dimensional vectors from the K-space to a two-dimensional space such that their interpoint distances approximate the corresponding interpoint distance in the K-space. In Fig. 6b and c points gather in nearly the same region which makes classi"cation di$cult.
Fig. 3. Trajectories of the three aircrafts, (a) Yark-42. (b) An-26. (c) Citation.
Fig. 4. Some range pro"les of the three aircrafts, (a)Yark-42, (b) An-26, (c) Citation.
D. Zhou et al. / Pattern Recognition 33 (2000) 1}7
5
Fig. 5. Some feature vectors of the three aircrafts, (a) Yark-42, (b) An-26, (c) Citation.
Fig. 6. Nonlinear mapping from E to E for the three aircrafts, (a) Yark-42, (b) An-26, (c) Citation.
Table 1 Distribution of 7200 feature vectors on the "rst 20 characteristic pro"les Codeword indices
1
2
3
4
Yark42 An26 Citation
18 93 1
1 0 78 154 102 0 10 0 0
5
6
7
8
0 0 77
0 83 95
0 0 110 0 1 98
As previously mentioned, in"nite range pro"les may appear within a small aspect range. In order to use the discrete HMM we should create a quantizer using feature vectors from all three aircrafts so that "nite codewords can represent all range pro"les. The clustering algorithm proposed by Linde et al. [12], known as LBG, is used to create a vector quantizer. The 7200 normalized range pro"les (2400 for each class), which were fetched from parts 2 and 5 of the trajectory in Fig. 3a for Yark-42, parts 5 and 6 of the trajectory in Fig. 3b for An-26 and parts 6 and 7 of the trajectory in Fig. 3c for Citation, are preprocessed to form feature vectors. Then, the 7200 feature vectors are processed together by the LBG algorithm to produce a 64 codeword vector quantizer. The 64 codeword vector quantizer is chosen because a 128 codeword vector quantizer shows little improvement in reducing the cluster distortion. After clustering, the potentially in"nite number of range pro"les of the three aircrafts within the above parts of the trajectories are collapsed into a "nite and manageable set of codewords, which are referred to as aspect categories (or character-
9
10
11
12
103 0 0 0 0 1 6 14 0 153 120 81
13
14
15
16
17
18
141 109 0 0 0 38 0 0 49
0 10 87
5 0 135 2 0 56
19
20
0 1 6 49 110 110
istic views) by Seibert and Waxmax [4] because a codeword represents some 2-D views of a 3-D object in their paper. A codeword in our paper can be named as a characteristic pro"le. Table 1 gives the distribution of 7200 feature vectors on the "rst 20 characteristic pro"les. Without loss of generality, we can see from the Table 1 that some characteristic pro"les are unique to an aircraft, some are ambiguous among all three aircraft, and some are ambiguous among pairs of the aircrafts. About 75% of the characteristic pro"les are ambiguous to various degrees. Of course, the identi"cation rate must be low when only one range pro"le is used to identify targets. But when a temporal sequence of range pro"le is used, both the transition information among the characteristic pro"les with time and the information of individual pro"les are employed. Seven hundred and twenty symbol sequences (240 for each class) each of which corresponds to the indices of characteristic pro"les of 10 successive feature vectors in the above-mentioned 7200 feature vectors are used to
6
D. Zhou et al. / Pattern Recognition 33 (2000) 1}7
train the corresponding HMM. The Baulm}Welch algorithm used to train HMM is as follows [5]. (1) Set initial estimates for the elements of matrices A and B. We select a random nonzero value, which is normalized to satisfy the constraints: , a "1, i"1, 2,2, N (14) GH H + b (k),"1 j"1, 2,2, N (15) H I as an initial estimate of an element of matrices A and B. (2) Successively obtain new better estimates of matrices A and B by modi"ed re-estimation formulas 1 2\ aI (i)a b (oI ) bI ( j) GH H R> R> I PI R R aN " , GH 1 ) 2\ aI (i)bI (i) R I PI R R )
1)i, j)N
(16)
1 2 aI ( j )bI ( j) R I PI R R bM (l)" , H 1 ) 2 aI ( j )bI ( j ) R I PI R R )
1)j)N, 1)l)M
(17)
only when oI" R> 2 2 R G serving symbols. These functions are calculated by the forward-backward algorithm [5] and n is not re-estiG mated since n "1, n "0, i"2, 3, 2 , N. G
When a new range pro"le sequence of an unknown class target comes, it is "rstly preprocessed and mapped to a symbol sequence after the vector quantizer processing, then probabilities of the three HMMs producing the symbol sequence are calculated by the Viterbi algorithm. Finally, the class corresponding to maximum probability is chosen as an identi"cation result. The block-diagram of the whole identi"cation system is shown in Fig. 7. A correct-identi"cation rate of 100% is obtained when the above training sequence set is used as the testing set. New test data sets, in which samples are taken from new parts of the trajectories, i.e. parts 1, 3 and 4 of the trajectory for the Yark-42, parts 1, 2, 3, 4 and 7 of the trajectory for An-26 and parts 1, 2, 3 4 and 5 of the trajectory for Citation, are used to test the power of identifying unseen data of the above method. The trained HMMs are applied to identify the observation symbol sequence sets (characteristic pro"le index sequence sets obtained after the same vector quantizer processing). Table 2 shows the identi"cation results for all new trajectories. The average correct identi"cation rates for all kinds of aircrafts are 93.01%. A comparison of the HMM method with a one nearest-neighbor (1-NN) classi"er that uses the 65-D feature vectors is performed. The 1-NN method is used to classify each range pro"le in a sequence independently. The result is a compound identi"cation for the sequence as a whole. The average correct-identi"cation rates for the above testing data sets using the 1-NN method are 84.5%. These results clearly show the advantage of using the information in a temporal sequence of target features in pattern recognition.
Fig. 7. Block diagram of the HMM-based identi"cation system.
Table 2 Identi"cation results for test data sets from new trajectories Trajectories
1
2
3
4
Yark-42 An-26 Citation
98.33% 100% 68%
83.33% 100%
100% 97.5% 96.67%
100% 71.67% 94.17%
5
100%
6
Average
86.66%
99.44% 87.83% 91.77%
D. Zhou et al. / Pattern Recognition 33 (2000) 1}7
4. Conclusions The spatio-temporal target identi"cation method for high-range resolution radar, which is di!erent from the conventional method, has been proposed and demonstrated. The HMM is used to model the signal of an aircraft's characteristic range pro"le sequence. As the temporal evolution information is exploited, the identi"cation accuracy is improved. About 93.01% accuracy was obtained for the real unseen data, which is taken from the new trajectories, of the three aircrafts.
Acknowledgements Authors are grateful to the referees for their valuable comments and suggestions on the earlier versions of this paper.
References [1] H.J. Li, S.H. Yang, Using range pro"les as feature vectors to identify aerospace objects, IEEE Trans. Antennas Propagate AP-(41)3 (1993) 261}268. [2] S. Hudson, D. Psaltis, Correlation "lters for aircraft identi"cation from radar pro"les, IEEE Trans. Aerospace Electronic Systems AES-(29)3 (1993) 741}748.
7
[3] C. Stewart, Y.C. Lu, V. Larson, A neural clustering approach for high resolution radar target classi"cation, Pattern Recognition 27 (4) (1994) 503}513. [4] M. Seibert, A.M. Waxman, Adaptive 3-D object recognition from multiple views, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-(14) 2 (1992) 107}124. [5] I.R. Rabinner, A tutorial on hidden Markov models and selected application in speech recognition, Proc. IEEE 77 (2) (1989) 257}284. [6] W.S. Kim, R. Park, O!-line recognition of handwritten Korean and alphanumeric characters using hidden Markov model, Pattern Recognition 29 (5) (1996) 854}858. [7] K.H. Fielding, D.W. Ruck, Spatio-temporal pattern recognition using hidden Markov models, IEEE Aerospace Electron. Systems AES-(31) 4 (1995) 1292}1230. [8] J. Tang, Target detection and recognition in high resolution radar, A Dissertation for the Degree of Doctor of Philosophy, Nanjing University of Aeronautics and Astronautics, 1996. [9] D. Zhou, G. Liu, Radar target identi"cation based on MAP of time-series, Chinese J. Infrared and Milimeter Waves 16 (5) (1997) 369}374. [10] S.G. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Trans. PAMI 11 (7) (1989) 674}693. [11] J.W. Sammon Jr, A nonlinear mapping for data structure analysis, IEEE Trans. Comput. C-18 (5) (1969) 401}409. [12] Y. Linde, A. Buzo, R. Gray, An algorithm for vector quantizer design, IEEE Trans. Commun. Com-28 (1) (1980) 84}95.
About the Author*DEQUAN ZHOU received his m.sc. and ph.d. degree from the department of electrical Engineering of the Nanjing University of Science and Technology in 1995 and 1998, respectively. His research interests are mainly in signal processing, neural networks and pattern recognition. He is currently an Associate Professor in Air Force No. 1 Institute of Aeronautics. About the Author*GUOSUI LIU was born in 1933. He currently is a professor in the Department of Electrical Engineering of the Nanjing University of Science and Technology. Professor Liu is a senior member of IEEE. His current research interests are radar system and signal processing. About the Author*JIANXIN WANG received his B.Sc. and M.Sc. degree from the Department of Electrical Engineering of the Nanjing University of Science and Technology in 1984 and 1987, respectively. He is currently an Associate Professor in the Electrical Engineering Department of the Nanjing University of Science and Technology.
Pattern Recognition 33 (2000) 9}23
Tesseral spatio-temporal reasoning for multi-dimensional data Frans Coenen* Department of Computer Science, The University of Liverpool, Chadwick Building, P.O. Box 147, Liverpool L69 3BX, UK Received 25 March 1997; received in revised form 26 August 1998; accepted 14 December 1998
Abstract A generally applicable approach to N-dimensional spatial reasoning is described. The approach is founded on a unique representation based on ideas concerning `tesserala addressing. This o!ers many computational advantages including minimal data storage, computationally e$cient translation of data, and simple data comparison, regardless of the number of dimensions under consideration. The representation allows spatial attributes associated with objects to be expressed simply and concisely in terms of sets of addresses which can then be related using standard set operations expressed as constraints. The approach has been incorporated into a spatial reasoning system } the SPARTA (SPAtial Reasoning using Tesseral Addressing) system } which has been successfully used in conjunction with a signi"cant number of spatial application domains. 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Spatio-temporal reasoning; Tesseral addressing; N-dimensional information processing
1. Introduction A versatile and generally applicable approach to multi-dimensional spatial reasoning is described. The approach is founded on a tesseral quantitative representation of space that o!ers advantages of both e!ective data storage, and computationally e$cient translation and comparison of data regardless of the number of dimensions under consideration. The nature of the representation, which is in many respects a raster encoding, is such that it is immediately compatible with all applications where spatial objects are represented using linear formats. Examples of such representations include image encodings (GIF, PBM, etc.), some Geographic Information System (GIS) data representations, the U.K. Admiralty Raster Chart System (ARCS), and 3-D visualisation representations such as DXF. In addition a tesseral reference can be viewed as both a raster label and a vec-
* Corresponding author. Tel.: #44-151-794-3698; fax: #44151-794-3715. E-mail address:
[email protected] (F. Coenen)
tor quantity. Consequently the representation can also be interfaced to vector representations such as those prevalent in GIS, and standards such as the DX90 international hydrographic exchange standard. Further the quantitative representation can be extended to describe qualitative aspects of spatial reasoning. Although for quantitative spatial reasoning a reference framework is essential, qualitative reasoning still requires a reference framework to express relations such as `beforea or `inFrontOfa. In a purely topological qualitative approach no reference framework is required; however this can still be de"ned in terms of such a framework. The tesseral approach to spatial reasoning described here has been incorporated into a general purpose spatial reasoning system } the SPARTA (SPAtial Reasoning using Tesseral Addressing) system. SPARTA operates using a constraint satisfaction approach to spatial problem solving supported by a heuristically guided constraint selection mechanism so as to minimise the search space. Spatial problems are passed to this system in the form of a script comprising a set of object descriptions and a set of constraints de"ning the relationships `desireda to exist between pairs or groups of objects. The system then produces all solutions that satisfy the given constraints. Solutions are output in either graphical or
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 3 9 - 4
10
F. Coenen / Pattern Recognition 33 (2000) 9}23
textual formats as directed by the user. The general applicability o!ered by the system is a direct consequence of the tesseral representation on which it is founded. This representation may be viewed as either an intermediate representation which links the reasoning mechanism with some primary representation (raster or vector), or as a primary representation in its own right. Applications that have been investigated using the technique include; Geographic Information Systems (GIS) [1,2], noise pollution monitoring [3], environmental impact assessment [4], shape "tting [5], scheduling and timetabling [6], rule base veri"cation and validation [7] and also a number of `AI problemsa such as the N-queens problem. The rest of this paper is organised as follows. Full details of the unique tesseral representation on which the approach is founded are given in Section 2. In Section 3 the SPARTA system is introduced and a brief overview of the nature of the scripting language presented. The de"nition of the attributes that may be associated with spatial objects and the constraints that may link such objects are then described in Sections 4 and 5, respectively. Some notes concerning the constraint satisfaction mechanism are then given in Section 6. In the following two sections the approach is illustrated "rstly (Section 7) with respect to Allen's interval calculus [8] and secondly (Section 8) with respect to Egenhofer's 9-Intersection mechanism [9]. It should be noted that these two illustrations are of necessity brief and therefore only serve to give a #avour of the full power of the approach. Finally in Section 9 some concluding remarks are presented. 1.1. Note on spatial reasoning Spatial reasoning can be de"ned, very broadly, as the automated manipulation of data objects belonging to spatial domains so as to arrive at application dependent conclusions. A spatial domain, in this context, is considered to imply any N-dimensional space (real or imaginary, including 1-D temporal spaces) in which objects of interest may be contained. Automated manipulation implies the computer simulation of some higher mental processes (not just the simple response to some stimulus or the mechanical performance of an algorithm). Spatial reasoning is therefore not concerned with (say) the automated retrieval of spatially referenced data contained in some database format. At its simplest, spatial reasoning can be considered to revolve round the identi"cation of (previously unknown) relationships that exist between spatial objects according to the relationships that are known or desired to exist between such objects. In more complex systems the identi"ed relations are then used as the foundation whereby further reasoning can take place, and consequently additional conclusions drawn.
2. Representation The tesseral representation used assumes that multidimensional space comprises a set of N-dimensional isohedral (same shape) cells each of which is labelled with a unique address (see Section 2.1 for further detail in tesseral representations). This is a well established view and widely adopted, especially using labels founded on the Cartesian coordinate system (or some variation of it, e.g. latitude and longitude, or eastings and northings). In such systems coordinates can be conceived of as having either a positive or negative sign; consequently both positive and negative space can be identi"ed (where the latter is comprised of labels which include at least one negative coordinate). Thus the universe of discourse can be described as the set ; of all possible labels de"ned as follows: ;"P N (; is equivalent to the disjoint union of P and N) where: P"+p"p3 the set of all cells representing positive space, N"+n"n3the set of all cells representing negative space, Any speci"c application will then consist of some domain of discourse (; ): " ; "P N " " " comprising a subset of ; such that each dimension (D) is de"ned in terms of the set of its possible values: D "+v " min )v)max ) min , max 3Integer ) max L L L L L L "min ;!1, L (where Integer is the set of all integers). Note that, for convenience, individual dimensions are numbered sequentially commencing with the number 1, and that given any space ; the cell where all the coordinate " values equate to 0 is referred to as the global origin of ; . " Although the Cartesian system o!ers the advantages that it is well understood and consequently widely accepted, the disadvantage is that it is not consistent over dimensions, e.g. a 2-D point requires two coordinates while a 4-D point requires four coordinates. In addition the standard Cartesian representation becomes very unwieldy when attempting to manipulate objects with more than two dimensions. The desire to overcome these disadvantages was the initial motivation for the particular tesseral representation described here; however the representation is much more than just a compression technique. The addressing system operates as follows. Given a set of N dimensions each of which has a range of values min . . max, bit groupings are allocated within a signed integer su$cient to encapsulate the full range of values for each dimension. For example given three dimensions each of which has an integer value range of !3 . . 3, then a
F. Coenen / Pattern Recognition 33 (2000) 9}23
11
Fig. 1. Bit allocation.
9-bit signed integer can be used to encapsulate the entire space. A suitable allocation of bits is presented in Fig. 1. Note that this 9-bit representation is used here for illustrative purposes only (the SPARTA system is founded on a 64-bit representation). Using representations of this form any coordinate Ntuple can be converted into an address (label) using the following function: Dom ( f )"+1v . . v 2 " v , v 3D ,, A?PRRCQQ L L L Cod ( f )"+t " t3;,, A?PRRCQQ L Gr ( f )"+1v . . v , t2 " t" v ;base ,, A?PRRCQQ L G G G where Dom, Cod and Gr are the domain, codomain and graph of the function respectively, and base is calculated L using the following identity: L\ base "2 b L G G in which b equals the number of bits allocated to dimenG sion i. In the case of the above example, the bases will be: base "2"1 base "2"8 base "2>"64. Thus f (v , v , v )"v ;base #v ;base #v ;base A?PRRCQQ "v ;1#v ;8#v ;64. Using this function all the addresses in the given 3-D space can be calculated. The set of addresses in the plane of D }D (where the value for D is 0) is given in Fig. 2. The above identity can be veri"ed with respect to this "gure by substituting appropriate coordinate values for v ..v . We can de"ne a similar function, f , to convert RCQQA?PR back from a tesseral address to a Cartesian tuple. This is de"ned in terms of the f function as follows: RCQQA?PR Dom ( f )"+t " t3;, RCQQA?PR Cod ( f )"+1v . . v 2 " v , v 3D , RCQQA?PR L L L Gr ( f )"+1t, v . . v 2 " f (v ..v )"t, A?PRRCQQ L RCQQA?PR L The function can be implemented using appropriate left and right shift operations to isolate bit groupings. 2.1. Relationship to other tesseral representations Tesseral representations are spatial representations which are derived by repeatedly dividing a space into
Fig. 2. Section of example space in plane D !D (value for D "0).
isohedral sub-spaces until some prede"ned resolution is reached [10,11]. The resulting sub-spaces are then allocated (usually numeric) `addressesa in such a way as to re#ect the hierarchical decomposition. The approach is founded on ideas concerning hierarchical clustering developed in the 1960s and 1970s to improve data access times [12], and atomic isohedral tiling strategies developed in the 1970s and 1980s concerned with group theory [13]. These two strands were brought together in the early 1980s when a tesseral arithmetic was de"ned for the representation [14]. There are many variations on the tesseral theme of which the `quad-tesserala approach is arguably the most popular [15,16]. The representation described here is categorised as a tesseral representation because it is a mechanism for addressing spaces decomposed into isohedral subspaces, but at some prede"ned resolution. There is no explicit inclusion of the concept of hierarchical decomposition and consequently this is not re#ected in the addressing system. This method of deriving the decomposition is the feature which sets the described representation apart from all other tesseral representations. The representation also o!ers a number of signi"cant advantages over other tesseral representations (as well as Cartesian approaches) as follows: 1. All references are unique and conceptually simple to generate. 2. It results in a `left-to-righta linearisation of space (see Fig. 2) which in turn o!ers advantages of: 1. E General applicability to any number of dimensions without requiring alteration of program code or
12
F. Coenen / Pattern Recognition 33 (2000) 9}23
recourse to procedures and functions dedicated to a particular number of dimensions (as in the case of systems founded on Cartesian approaches) } all spaces are treated in one dimensional terms. 1. E E$cient data storage using ideas founded on `run-linea encoding strategies (see note in Section 2.2). 1. E Computationally e!ective comparison of sets of addresses (see Section 2.4 for further comment). 1. E Predictability, i.e. given a location it is always possible to predict the addresses of physically adjacent cells (this is not the case with tesseral approaches that adopt a Morton or Hilbert linearisation). 3. It supports computationally e$cient translation through the space using straightforward integer addition and subtraction, unlike other tesseral systems which required recourse to `look-upa tables or specialised tesseral processors (see note in Section 2.3 for further details). 4. It provides for the rotation of objects. 5. If desired, addresses can be ezciently converted to and from Cartesian tuples as described above (not the case with other tesseral approaches). 6. The addressing system is intuitive. As a consequence of these advantages many of the complexity problems associated with Cartesian approaches to spatial representation are circumvented. 2.2. Data storage In Section 2 the representation was described in terms of an approach to data compression. However, a much greater degree of storage e$ciency can be achieved by taking into consideration the linearisation that results from the encoding. Linearisation is a feature of `alla tesseral representations although in most cases the linearisation follows a Morton sequence or a Hilbert curve, which is not as intuitively obvious as the left to right linearisation associated with the SPARTA representation. Whatever the case, `run-linea techniques can be used to store sequences of addresses in terms of a start and an end address. In the case of the SPARTA representation this would result in the de"nition of spatial attributes (such as location and shape) in terms of a sequence of `barsa. This is adequate for one and twodimensional spaces but results in unacceptable overheads when higher dimensional spaces (N'2) are considered. Consequently an alternative approach has been adopted here whereby spatial attributes are expressed in terms of N-dimensional `boxesa, referred to as N-cubes, de"ned in terms of a start and an end address. The start address is de"ned as the reference geometrically nearest to the origin, and the end address that geometrically furthest away. For example consider the 3-D `plusa shape shown
Fig. 3. 3-D example shape.
in Fig. 3. In the context of our 3-D example encoding this might be de"ned as follows: +9 . . 18, 65 . . 146, 72 . . 144, 75 . . 147, 201 . . 210, Note that sequences are indicated using a `. .a operator and that they are ordered according to the numeric location of their start address within the overall linearisation. The shape in Fig. 3 is thus de"ned in terms of 10 integers. Using a `bara encoding this would have required 24 integers. Using a `linear quad-tesserala encoding (with a Morton linearisation) 36 integers (18 sequences) would be required. Without any run-line encoding 64 integers (tesseral addresses) would be needed, and using a Cartesian system 192 integers (64 3-tuples). For the particular shape shown in the "gure a `hierarchical quad-tesserala representation, where the space is decomposed at di!erent resolutions according to the required application, would o!er no advantages over the `linear quad-tesserala version. The N-cube encoding thus o!ers signi"cant data storage advantages over other related representations. Although it is di$cult to quantify this advantage as some types of `shapea are more advantageous to some encodings than others, the above gives an indication of the degree of saving that can be made. In addition the advantage in space saving gained increases with the number of dimensions under consideration. 2.3. Translation A further important advantage of the SPARTA representation is the computational ease with which tesserally
F. Coenen / Pattern Recognition 33 (2000) 9}23
de"ned `shapesa can be translated (moved) through a space ; . For example, considering the space presented " in Fig. 2, to move one `cella to the right from any given reference the address 1 is added to the reference. To move to the left the address 1 is subtracted. To move diagonally (say) one space to the right and two upwards 17 is added (subtracted to move in the opposite direction). To move diagonally (say) one space to the left and two upwards a negative address (15) is required. It is for this reason (to allow translation in all directions) that the representation must include the negative equivalents of any space P of " interest. Thus tesserally de"ned shapes can be translated in any direction, by any distance, using standard integer addition and subtraction. Further the addition/subtraction need only be applied to the start and end references for each sequence. Also, a given shape can be expanded along a particular axis and/or in any particular direction. For example, using the given example space, to expand a single cell in all directions the sequence !73 . . 73 would be added. A translation function f is thus de"ned as folRP?LQJ?RC lows: Dom ( f )"+1S , +c . . d,2 " S RP?LQJ?RC "+a . . b3RRLD , ) c, d3D , 3 3 Cod ( f )"+S " S "+e . . f " e, f3RRLD , RP?LQJ?RC 3 Gr ( f )"+1S ,+c . . d,, S 2 " a . . b3S ) RP?LQJ?RC e . . f3S such that e"a#cf"b#d,.
13
The implementation of this function requires the inclusion of a mechanism to deal with the situation where an attribute is wholly or partially translated out of ; . The " operation of this function can best be de"ned by considering the 2-D example `Ha shape given in Fig. 4 which might be encoded as follows: S "+!2 . . 14, 2 . . 18, 7 . . 9,. F To move this shape one cell to the left and two cells downwards the set S "+!17 . . !17, must be added: R f (S , S )P+!19 . . !3, 15 . . 1, !10 . . !8,. RP?LQJ?RC F R Similarly if the `Ha is to be expanded in both directions along the D axis by one cell then set S "+!1 . . 1,: R f (S , S )P+!1 . . 15, 1 . . 19, 6 . . 10,. RP?LQJ?RC F R Alternatively to expand the `Ha in all 3 directions (in our example space) S "+!73 . . 73,. R 2.4. Comparison Knowledge of the linearisation can also be used to advantage when comparing sets of addresses. For example given the sets of addresses S and S , the validity of the Boolean relation S subset S can be established as follows: S subset S 0∀m . . n3S ) s . . t3S such that m)sn)t The relation S subset S is true if and only if for all N-cubes m . . n in the set S there exists a N-cube s . . t in the set S such that m . . n is within s . . t. Thus to com pare sets of addresses where the elements are de"ned in terms of N-cubes we do not have to consider individual elements. Again, this advantage is of particular relevance when dealing with spaces comprising more than two dimensions.
3. Spatial application description Using the representation spatial attributes associated with objects can be de"ned in terms of sets of N-cubes. The relationships that are desired to exist between objects may then be de"ned in terms of constraints linking attributes associated with pairs of objects. With respect to the SPARTA system such descriptions are expressed as a sequence of PROLOG facts contained in a script. A script typically comprises four sections: a space de"nition, object descriptions de"ned in terms of classes and instances of those classes, and constraint descriptions. The space de"nition describes the limits of the set (; ) " in terms of the positive maximum coordinate for each dimension: Fig. 4. 2-D example shape.
space(+max , max , 2 , max ,). L
14
F. Coenen / Pattern Recognition 33 (2000) 9}23
The space of interest, i.e. the space within which the objects pertaining to the application under consideration exist, is then considered to be equivalent to the positive part of this space (P ). The negative part of this space " (N ) is required when translating objects through the " space (see Section 2.3 above). A class de"nition takes one of the following two formats: class(ClassName, Object¹ype).
Table 1 Possible attribute de"nitions for di!erent object types Attributes
Fixed
Free
Shapeless
Location/location space Shape Rotation Size Contiguity
; ; ; ;
; ; ;
; ; ;
class(ClassName, Object¹ype, ShapeDef ). Every class has a name and a type de"nition for the objects belonging to that class. Currently three object `typesa are supported: E "xed objects which cannot be moved, i.e. they have a`"xeda location. E free objects which have a known shape but no "xed location (and thus can be moved). E shapeless objects which have no given shape or location but are known to exist somewhere within the space P . " Only free objects can have a shape de"nition associated with them (in the case of "xed objects this is implied by the location, in the case of shapeless objects this is unknown). An instance de"nition then takes the form:
instances and even lists of classes; indicated using the appropriate instance or class names. The possible relations that can be used to link the operands are the standard set operations. The nature of constraints is described further in Section 5.
4. Object attributes In this section the nature of the attributes listed in the foregoing section are described in further detail. 4.1. Location
although not every attribute is applicable to every class of object. The distinction between a location and a location space is that the "rst describes the precise location of a "xed object, while the second describes a set of addresses somewhere within which a free or shapeless object is known to exist. The attributes that may be de"ned for each type of object are listed in Table 1. Further detail concerning each of these attributes are given in the following section. Constraints (at their simplest) are of the form:
Other than an identi"er the second most important attribute that may be associated with a spatial entity is its location. In the case of a "xed object this is precisely known. This is not the case for free or shapeless object where only a general location space in which the object may exist is known. Whatever the case the location or location space is described in terms of some subset (Location or Location Space) of P : " ¸ocation"+l " l3¸¸-P ,, " ¸ocationSpace"+l " l3¸¸-P ,. " The de"nitions are the same, but the interpretations are di!erent. In the absence of any other information LocationSpace is assumed to be equal to P . Note also " that in the case of a "xed object the nature of the set Location also implies the nature of all of the other attributes associated with such an object, and hence there is no need to de"ne these attributes (see Table 1). To support the manipulation of objects every location (location space) has a local origin associated with it. This is de"ned as the corner address of the minimum bounding N-cube surrounding the entire location de"nition that is geometrically closest to the global origin for the given set ; . "
constraint(Operand1, Relation, Operand2).
4.2. Shape
The operands are locations or location spaces associated with single instances, list of instances or entire classes of
Shape is arguably the third most signi"cant attribute that a spatial entity has. In the case of a "xed object this is
instance(InstanceName, ClassName, Attribute , Attribute ,2,Attribute ). Possible attributes that can be associated with any given instance include: E E E E E
location or location space, shape, rotation, size (actual, maximum, minimum), contiguity (a spatial entity does not have to be comprised of a continuous set of cells)
F. Coenen / Pattern Recognition 33 (2000) 9}23
de"ned by implication, whereas in the case of a free object shape it is de"ned in terms of a set Shape: Shape"+s " s3SS-; globalOrigin3S,. " Note that in this case the shape set is a subset of ; as " opposed to P as stipulated for Location and Location" Space sets. Note also that the global origin must be included in the shape de"nition } this then acts as the reference address with respect to which the set can be manipulated. Given a free object, the shape de"nition also implies further attributes such as size and contiguity. Note also that a free object must have some location space associated with it } somewhere within which it is known to exist. From Section 4.1 this may be either the entire space P or some subset of P . " " Given a set Shape and a set LocationSpace a set of sets LocationCandidates (the set of candidate location sets) is derived using a function f : A?LBGB?RC*MA?RGMLQ Don ( f )"+1¸ , S2 " ¸ "¸ocationSpace ) A?LBGB?RC*MA?RGMLQ Q Q S"Shape,, Cod ( f )"+¸ " ¸ -¸ ,, A?LBGB?RC*MA?RGMLQ A A Q Gr ( f )"+1¸ , S, ¸ 2 " ¸ A?LBGB?RC*MA?RGMLQ Q A A "6(∀x3¸ ) f (S,+x . . x,) Q RP?LQJ?RC -¸ ),. Q Note that the above would not identify `alla the candidate locations if the zero address (the global origin for ; ) was not required to be included in the de"nition for " the set Shape. 4.3. Rotation A further important attribute associated with free objects is orientation } given a particular object this may be either "xed or free to rotate. Whether a shape/object can be rotated or not is de"ned in terms of a constant rotation which may be included in the predicate de"ning the object in question. 4.4. Size In the absence of location and shape information knowledge may be available concerning size. The size of a spatial object is expressed in terms of the number of cells (not N-cubes) in the set describing its shape, i.e. the cardinality of that set. In the case of a "xed object this will be equivalent to f (¸ocation); where the function A?PBGL?JGRW f returns the `sizea of its argument which must be A?PBGL?JGRW a set of N-cubes. In the case of a free object this will be equivalent to f (Shape). In the case of a shapeless A?PBGL?JGRW object, where the precise de"nition of these sets is not known the required number of elements for the set ¸oca-
15
tion/Shape can still be expressed. The minimum size of any object (that can be physically realised) is 1 indicating that it is represented by a single cell. The maximum size of an object is equivalent to f (P ) } otherwise the A?PBGL?JGRW " object exceeds the application domain. In addition the nature of the size of an object can be expressed in terms of (i) a minimum size, (ii) a maximum size, or (iii) an actual size } the set of operators +(,",',. Thus a set Size, comprising a single 2-tuple, is de"ned as follows: Size "+1q, m2 " q3+(,",', ) m3+1 . . cardinality(P ),, " 4.5. Contiguity Finally something may be known about whether an object's location/shape is represented by a contiguous set of addresses or not. In this context contiguity is de"ned as the situation where each cell of a Location or shape associated with any object is adjacent to at least one other element of this set (adjacency is assumed to mean either edge or corner adjacency). The nature of the connectivity attribute is de"ned in terms of a constant contiguous which may be included in the predicate de"ning a particular object.
5. Constraints From Section 3 constraints are used to link pairs of object locations (where locations are indicated using instance names, list of such names, class names or lists of class names). Given that locations are described in terms of sets of N-cubes it is natural to consider relations in terms of the standard operators found in set theory: equals ("), intersection (5), union (6), subset (-), superset (.), `propera subset (L) and `propera superset (M); and also the negation of these relations. In the context of spatial reasoning these can be incorporated into Boolean functions (functions that return true or false) and non-Boolean functions (functions that return a `newa set) depending on the nature of the relation. With respect to the SPARTA system the Boolean functions are incorporated into constraints that act as "lters to `testa location pairs. Eight such "lters are supported: constraint(A, ,lterEquals, B)"f
(A, B) DGJRCP#OS?JQ Ntrue if A"B,
constraint(A, ,lterNotEquals, B)"f (A, B) DGJRCP,MR#OS?JQ Ntrue if AOB, constraint(A, ,lterIntersects, B)"f
(A, B) DGJRCP'LRCPQCARQ Ntrue if A5BO,
16
F. Coenen / Pattern Recognition 33 (2000) 9}23
constraint(A, ,lterNotIntersects, B) "f (A, B)Ntrue if A5B", DGJRCP,MR'LRCPQCARQ constraint(A, ,lterSubset, B)"f (A, B) DGJRCP1S@QCR Ntrue if ALB, constraint(A, ,lterNotSubset, B)"f
(A, B) DGJRCP,MR1S@QCR Ntrue if A : . B
constraint(A, ,lterSuperset, B)"f
(A, B) DGJRCP1SNCPQCR Ntrue if AMB,
constraint(A, ,lterNotSuperset, B)"f (A, B) DGJRCP,MR1SNCPQCR Ntrue if A 9. B. No distinction is made between a `propera subset (superset) and a subset (superset) because no applications have been found (to date) where the distinction is signi"cant. Boolean "lters are thus used to test locations for "xed objects and/or candidate locations for free objects against each other. Note that the implementation of the above functions makes full use of knowledge of the linearisation as described in Section 2.4. Non-Boolean functions are incorporated into constraints that act as mappings in the sense that they `mapa a non-Boolean set operation (intersection, union, complement) on to the pre"x operand with respect to the post"x operand to produce a new set. In the context of the SPARTA system the pre"x operand must represent a location space associated with a shapeless object and the post"x operand either a "xed location associated with a "xed object or a candidate location associated with a free object. The function then returns a revised location space for the pre"x operand. Only two mapping constraints are supported: constraint(A, mapIntersects, B)" (A, B) K?N'LRCPQCARQ NA"A5B, constraint(A, mapComplement, B)
spatial reasoning applications (such as those discussed by Retz-Schmidt [17], Egenhofer [9] and Cohn [18] amongst many others). However, to increase the expressiveness of these constraints an o!set can be applied to the location associated with an identi"er prior to satisfaction of the constraint. An o!set is some subset of ; which is applied (using the f function) to either " RP?LQJ?RC (a) all the elements describing a location, location space or set of candidate locations or (b) the local origin for such sets. The e!ect in the "rst case is to translate or expand (or both) the space as illustrated in Section 2.3. In the second case this provides the means whereby other `locationsa may be de"ned with respect to the given location (location space or candidate location). A facility is also provided to `shrinka a location by a number of cells either uniformly or along speci"c axes. Whatever the case o!sets are de"ned by a predicate o!set as follows: o+set(Q, P) where Q"+q " q3+ref, all, shrink,,, P"+p " pinRR-; ,. " Thus a constraint is de"ned using one of the following formats: constraint(Operand1, Relation, Operand2). constraint(Operand1, O+set1, Relation, Operand2). constraint(Operand1, Relation, Operand2, O+set2). constraint(Operand1, O+set1, Relation, Operand2, O+set2). Finally, the expressiveness of the intersects relations (either in addition to or as an an alternative to the use of o!sets) may be further expanded by quantifying the size of the intersection. This is achieved by allowing two optional arguments for the intersection relation } an operator and a desired cardinality. The set of possible operators is as follows: (,),",*,'
" (A, B) K?N!MKNJCKCLR NA"B!A. For procedural reasons it is desirable to maintain a declarative reading of constraints (i.e. constraint interpretation should be independent of any ordering in which they might be presented). Thus a union mapping constraint is not supported as this would have the e!ect of increasing the location space associated with a shapeless object and consequently the "nal output would be in#uenced by the ordering in which constraints are processed, i.e. the sequence in which constraints are presented becomes signi"cant. Using the above "lters and mappings it is possible to express many of the standard relations encountered in
The cardinality is expressed as a positive integer in the range of 1 to f (P ). Thus we can insist that an A?PBGL?JGRW " intersection is (say) equal to a cardinality of 4 as follows: constraint(A, ,lterIntersects(", 4), B) .
6. The constraint satisfaction process There are many computer applications where a solution is required to be generated which conforms to a set of constraints. Examples include: machine vision, belief maintenance, scheduling and time-tabling, graph
F. Coenen / Pattern Recognition 33 (2000) 9}23
colouring, logic puzzles, #oor plan design and circuit design. This class of applications is often referred to as Constraint Satisfaction Problems (CSP) [19]. Temporal and spatial reasoning problems also fall into the category of CSP. More formally a CSP can be de"ned in terms of a set of variables I"+X1, X2, . . , Xn, which can take values from the "nite domains I"+D1, D2, . . , Dn, respectively, and a set of constraints which specify which values are compatible with each other. Thus a constraint of the form constraint (Xi1, Xi2, . . , Xik) between k variables from I is a subset of the Cartesian product +Di1;Di2;Dik,. A solution to a CSP is then an assignment of values to all variables which satisfy all constraints. Given that the domains +D1, D2, . . , Dn, are "nite, a solution (or solutions) can always be found provided that constraints are not contradictory. The real problem associated with CSP is that of complexity (CSP are NP-complete) and, as a result, e$ciency. The solution to CSP is typically founded using some tree searching strategy supported by a mechanism whereby fruitless branches within the tree can be identi"ed early in the search process. Examples included dependency based backtracking and a priori pruning techniques such as forward checking and `look aheada [20]. Given a particular CSP the nature of the problem can be further exploited to achieve additional pruning by guiding the search using application dependent heuristics. This was the idea behind a number of problem solving systems such as ALICE [21]. CSP tree searching strategies are well suited to encoding using logic programming languages (e.g. PROLOG). However, the current disadvantage of such languages is that their execution is very ine$cient. Search procedures (i.e. backtracking) are based on recovery from failure rather than avoiding failure. Consequently a number of constraint (logic) programming languages have been developed to address the disadvantages associated with standard logic languages. One example is CHIP (Constraint Handling in PROLOG) which extends standard PROLOG to include the concepts of "nite domains, Boolean terms and relational terms [22]. Further examples include: PROLOG III [23], and CLP (Constraint Logic Programming) which is a framework for constraint handling in logic [24], which in turn is an extension of the earlier logic programming language `Schemea [25]. Alternatively, many researchers have resorted to implementing CSP solution strategies using more traditional imperative programming languages. The current version of the SPARTA system is written in C. The design of the SPARTA system is such that the constraint satisfaction mechanism is independent of the representation, consequently the system is not tied to any particular approach to CSP solving. However, the current version of the system uses a heuristically guided depth "rst search strategy. The constraint satisfaction process commences with a single root node in a solution
17
tree. Initially this node contains a list of constraints to do as de"ned in the input script, and an object space corresponding to P containing no objects. A constraint is " then selected from the constraints to do list following the constraint selection criteria (see Section 6.1). The system then attempts to resolve this constraint with three possible outcomes: 1. The constraint cannot be satis"ed in which case (at this stage) it is concluded that no appropriate con"guration of objects can be found and therefore the satisfaction process is stopped. 2. One or more compatible solutions exist therefore include the referenced objects in the object space, adjust the constraints to do list and select another constraint. Note that two solutions are compatible if the current constraint will continue to hold if the solutions are combined. This will be the case where one of the operands is a "xed object. Where both operands are free objects it is likely that solutions will not be compatible. 3. A number (more than one) of non compatible solutions is produced therefore create a number of branches, equivalent to the number of solution, emanating from the current node each terminating in a new node containing an updated version of the object space and constraints to do list. In this manner a solution tree is dynamically created. If all the given constraints have only one solution the tree will consist of a single (root) node. If, however, the script includes constraints that have more than one solution, the tree will consist of a number of levels, each level representing a point in the solution process where the satisfaction of a constraint generates more than one (non-compatible) solution. Whenever an additional level in the tree is created each branch is processed in a `depth-"rsta manner until either all constraints have been satis"ed, in which case the solution is stored, or an unsatis"able constraint is discovered. On completion of processing a particular branch the result is output and the current node removed from the tree, the system then backtracks to the previous node. If all branches emanating from this node have also been processed this node is also expunged. The process continues until all branches in the tree have been investigated and all solutions generated. Consequently, the solution tree in its entirety never exists, only the current branch and those higher level nodes which merit further investigation. 6.1. Constraint selection strategy In practice constraints are processed in a sequential fashion. This implies an ordering and consequently a selection strategy. However, from a user perspective, it is desirable that constraints can be presented regardless of
18
F. Coenen / Pattern Recognition 33 (2000) 9}23
any ordering requirement. Further, from an implementational perspective, it is desirable that constraints are processed in a computationally e$cient manner and that the growth of the solution tree is minimised. This is achieved by delaying the satisfaction of constraints that may cause the generation of a new level in the solution tree for as long as possible. Constraints are therefore selected according to the following sequence: 1. All constraints relating two "xed objects or a "xed object and a shapeless object. Fixed and shapeless objects have only one possible location/location space associated with them and, therefore, are the most vulnerable constraints, i.e. those constraints that are hardest to satisfy. 2. Constraints linking free objects to "xed objects. Although free objects may have many candidate locations, when related to a "xed object (which by de"nition can have only one location) only one set of compatible solutions can be produced. Constraints are selected here according to the number of candidate locations associated with each free object (calculated using the f function described in SecA?LBGB?RC*MA?RGMLQ tion 4.2. Constraints involving free objects with a low candidate location set cardinality are more vulnerable than those with a higher cardinality. 3. The remaining constraints linking free objects to shapeless objects, or free objects to one another, again according to the cardinality associated with the operands (a shapeless object's location space has a cardinality of 1). The product of the two cardinalities then dictates the maximum number of branches that may be produced on satisfaction of the constraint, although it may be that on satisfaction some or all of the solutions are compatible. Constraints are thus selected so as to minimise cardinality ;cardinality .
reasoning with chains of events. Good and available accounts of Allen's original work can be found in [30] and [31] (see also [32]) and a more general overview in [8]. The work has subsequently been extended to incorporate points as well as intervals (see [33}35]). A good critique of Allen's work can be found in [36], while Vilain et al. [37,38] and Nokel [39] studied the computational complexity of Allen's reasoning scheme and some of its variants. Allen, in his interval calculus, identi"es 13 `intervala relations which are used to link one dimensional objects (Fig. 5). The objects are stored in a network where each arc represents one or more relations. The network is typically expressed as a set of tuples of the form: (node1, relation¸ist, node2'.
The above constraint selection strategy thus has the e!ect of limiting the growth of the search tree while at the same time causing the most vulnerable constraints to be procesed "rst.
7. Example 1: Temporal reasoning using Allen's interval calculus approach The most in#uential modern work carried out in the "eld of temporal reasoning can be considered to be James Allen's Interval Calculus. Although Allen's work cannot be said to be de"nitive, its importance lies in that it is the basis on which many subsequent temporal (and spatial) reasoning systems have been built, or if not the basis at least the catalyst for such work. For example Freksa [26], HernaH ndez [27] and Mukerjee and Joe [28] have all transferred the approach to the spatial domain, while Ligozat [29] has generalised the interval concept for
Fig. 5. Allen's 13 interval relations.
F. Coenen / Pattern Recognition 33 (2000) 9}23
By adding further tuples to the network we can check the continuing consistency of the network, remove possible relations which are no longer applicable (provided that at least one relation always remains linking each node pair), or deduce further relations not previously included in the network. Allen uses a transitivity table technique to achieve these revisions. Using the SPARTA representation Allen's 13 interval relations can be expressed as follows (SPARTA relations given to the right): 0 constraint (a, xlter Subset, b, owset (ref, +!max . . !1,)). " after b 0 constraint (a, xlter Subset, b, owset (ref, +(length#1) . . max ,)). " during b 0 constraint (a, xlter Subset, b). contains b 0 constraint(a, xlter Superset, b). overlaps b 0 constraint(a, xlterIntersects, b). constraint (a, owset (ref, +0,), xlter NotSubset, b). overlappedby b 0 constraint (a, xlter Intersects, b). constraint (a, xlter NotSubset, b, owset (ref, +length,)). meets b 0 constraint (a, owset (ref, +length,), xlter Equals, b, owset (ref, +!1,)). metby b 0 constraint(a,owset (ref, +!1,), xlter Equals, b, owset (ref, +length,)). starts b 0 constraint (a, owset (ref, +0,), xlter Equals, b, owest (ref, +0,)). constraint (a, xlter Subset, b). startedby b 0 constraint (a, owset (ref, +0,), xlter Equals, b, owset (ref, +0,)). constraint (a, xlter Superset, b). xnishes b 0 constraint (a, owset (ref, +length,), xlter Equals, b, owset (ref, +length,)). constraint (a, xlter Superset, b). xnishedby b 0 constraint (a, owset (ref, +length,), xlter Equals, b, owset (ref, +length,)). constraint (a, xlter Subset, b). equals b 0 constraint (a, xlter Equals, b).
a before b a a a a
a
a a a
a
a
a
a
where max is the maximal value for the dimension D , " and the identi"er length indicates the `durationa of the interval in question. Note also that the application of o+set(ref, +0,) has the e!ect of isolating the local origin of the set referred to. Thus, given the relations a startedBy b and b overlappedBy c, using Allen's transitivity table technique, c overlappedBy a can be deduced. This can be expressed in the form of a SPARTA script as follows: space (MaxD1). class(interval, free, S). instance(a, interval). instance(b, interval). instance(c, interval). constraint(a, offset(ref, +0,), equals, b, offset(ref, +0,)).
19
constraint(a, superset, b). constraint(b, intersects, c). constraint(c, notSubset, b, offset(ref, +length,). Note that because a quantitative representation is used the scenario has been dimensioned. Thus we have assumed a 1-D application domain comprising MaxD1 addresses and de"ned each interval as a free object with some shape S which is contained in a location space which has a default value equivalent to P . Using the " SPARTA system all possible con"gurations for the set of objects will be returned. In the above case each con"guration will also illustrate the possible relationships between the instances a and c, i. e. c overlapped by a.
8. Example 2: Topological spatial reasoning using Egenhofer's 9-intersection approach The most obvious approach to spatial reasoning is to extend established temporal reasoning techniques so as to encompass more than one dimension. For example it is possible to consider that the relations that exist in 1-D space exist along all axes of interest in an N-dimensional space. Individual relations are then de"ned by the intersection of any two 1-D relations. This approach has been considered by many authors such as Freksa [26], HernaH ndez [27] and Puller and Egenhofer [40]. It is generally acknowledged that there are some 26 1-D relations (including Allen's 13 interval relations) that can exist between points, intervals, and points and intervals in a 1-D space [41]. Extending these relations to N dimensions results in an exponential increase in the number of relations as the number of dimensions increases. This is the principal reason why one-dimensional reasoning techniques do not lend themselves to easy adaptation to higher dimensions. This has been recognised by many authors (see [27,42,43] amongst many others), and much work has been done on techniques to address these concerns. One approach is to represent information about individual dimensions independently and reason about each dimension in isolation [44]. Alternatively some authors di!erentiate between projection and orientation relations [27]. However, although these techniques all serve to reduce the severity of the problem, the basic di$culty } that the complexity of the problem increases exponentially with the number of dimensions under consideration } has not been removed. An alternative approach is to consider only topological relations. Egenhofer de"nes such relations as follows: those spatial relations that are invariant under topological transformations and, therefore, preserved if the objects are translated, rotated or scaled [9]. Egenhofer uses a 9-Intersection model [9,45] founded on an earlier 4!Intersection model [46] to identify and
20
F. Coenen / Pattern Recognition 33 (2000) 9}23
Table 2 Egenhofer's 9-intersection table intersection (interior(A), interior(B)) intersection (boundary(A), interior(B)) intersection (exterior(A), interior(B))
intersection (interior(A), boundary(B)) intersection (boundary(A), boundary(B)) intersection (exterior(A), boundary(B))
intersection (interior(A), exterior(B)) intersection (boundary(A), exterior(B)) intersection (exterior(A), exterior(B))
manipulate such relations. In this model spatial objects are considered to have three parts (a) an interior, (b) a boundary and (c) an exterior. The topological relation between two `point setsa (sets of addresses), A and B, is then described by the nine intersections of the interior, boundary and exterior of A with those of B as demonstrated in Table 2. Various topological invariants can be used to evaluate and characterise Egenhofer's 9-intersection model. However, it can be shown that there are 2 (512) possible combinations [47] of which only a small subset can be physically realised as follows: E 8 between objects without holes (contiguous region objects). E 18 between objects with holes. E 33 between simple lines. E 19 between an object without a hole and a line. Considering only the relationships between contiguous region objects these can be labelled as shown in Fig. 6. The labelling in the "gure is applied to 2-D examples, however it is equally applicable to any number of dimensions. The 8 topological relations presented in Fig. 6 can be expressed using the SPARTA representation as follows: 0 constraint (a, owset (all,sw . . se), xlter NotIntersects, b). meets b 0 constraint (a, owset (all, sw . . se), xlter Intersects, b). constraint (a, xlter Not Intersects, b). overlaps b 0 constraint (a, xlter Intersects, b). constraint (a, notSubset, b). coveredby b 0 constraint (a, xlterSubset, b). constraint (a, xlter NotSubset, b, owsert (shrink, +1,)). covers b 0 constraint (a, xlter Superset, b). constraint (a, owset (shrink, +1,), xlter Not Superset, b). inside b 0 constraint (a, xlter Subset, b, owset (shrink, +1,)). contains b 0 constraint (a, owset (shrink, +1,), xlter Superset, b). equals b 0 constraint (a, equals, b).
a disjoint b a
a a
a
a a a
Fig. 6. The 8 `contiguous objecta topological relations.
where the identi"ers sw and se represent the addresses required to uniformly expand a location by a factor of one cell in all directions. The application of o+set(shrink, +1,) has the opposite e!ect. Given two 9-intersections, representing two topological relations between two pairs of objects a and b, and b and c, the 9-intersection for the relation linking a to c can be determined by deriving a combined 9-intersection from the two given 9-intersections. To this end Egenhofer speci"es eight `inferencea rules to describe the dependencies of the intersections so that given 9-intersections linking a to b and b to c, the 9-intersection linking a to c can be found. The signi"cance here is that the relationships can be produced without recourse to a
F. Coenen / Pattern Recognition 33 (2000) 9}23
transitivity table (although the possible ways in which the above 8 relationships can be combined can be expressed using such a table). Thus given the relations a covers b and b overlaps c, using Egenhofer's inference rules we can deduce that either, a overlaps c or a covers c or a contains b. We can express this using the SPARTA representation as follows: space (maxD1, maxDN). class (fixedRegion, fixed). class (freeRegion, free, S). instance (a, fixedRegion, L). instance (b, freeRegion). instance (c, freeRegion). constraint (a, superset, b). constraint (a, offset (shrink, +1,), not Superset, b). constraint (b, intersects, c). constraint (b, notSubset, c). Note that here, for illustrative purposes only, one of the instances has been de"ned as a "xed object located at L (this will have the e!ect of reducing the size of the search tree). The remaining two regions are then de"ned as free objects with shape S. When processed this script will produce all possible con"gurations for the objects, each con"guration illustrating a possible relation between the objects a and c, i.e. that a overlaps, covers or contains c in this case. The concept of topological relations forms an important part of spatio-temporal theory and has been adopted by many researchers working in the "eld. For example the spatial axiomisations of Clark [48,49], Cohn [50], Randel et al. [51}53] (see also Cui et al. [54]), and Vieu [55] are all directed at a topological conceptualisation of spatial relations.
9. Conclusions A qualitative spatio-temporal reasoning mechanism has been described founded on a tesseral representation and linearisation of space. The mechanism o!ers the following signi"cant advantages: 1. It is universally applicable regardless of the number of dimensions under consideration. 2. It is fully compatible with both raster and vector representations rendering it suited to a wide range of applications. 3. It is conceptually simple and computationally e!ective. These advantages are gained primarily as a consequence of the particular tesseral representation on which the mechanism is founded.
21
The proposed mechanism has been incorporated into a spatio-temporal reasoning system, the SPARTA system, which has been tested against a signi"cant number spatial problem domains including GIS, environmental impact assessment and noise pollution. However, the approach is not complete. There are still many application areas, such as map and chart interaction, spatial simulation and environmental planning; which require further mechanisms to increase the expressiveness of the relations. With respect to the attributes that can be associated with spatial entities there are also still a number of variations on the given list that require further investigation. For example we can conceive of partially shaped objects in that we may know a minimum shape for the object. Thus there is still much work to do; however the work to date has resulted in a foundation which is already widely applicable.
10. Summary A quantitative spatio-temporal reasoning mechanism founded on a tesseral representation of space is described. The principal advantages of this mechanism are that it is generally applicable to any N-dimensional space while at the same time providing for e!ective manipulation and comparison of objects located within spaces. These advantages are a direct consequence of the derived tesseral representation on which the mechanism is founded. The de"ning features of this representation include: E e$cient translation through N-dimensional space using conventional integer arithmetic, E computationally e!ective rotation, E linearisation of space providing for the application of one-dimensional (temporal) reasoning techniques regardless of the number of dimensions under consideration, E run line encoding, E simple relational comparison of N-dimensional objects. In addition the representation may be viewed as either an intermediate representation which links the reasoning mechanism with some primary representation (raster and by extension vector), or as a primary representation in its own right. The mechanism is incorporated into a demonstration spatio-temporal reasoning system, the SPARTA (SPAtial Reasoning using Tesseral Addressing) system. This currently operates using a constraint satisfaction approach to spatial problem solving supported by a heuristically guided constraint selection mechanism so as to minimise the search space. Spatial problems are passed to this system in the form of a script comprising a set of object
22
F. Coenen / Pattern Recognition 33 (2000) 9}23
descriptions (de"ned in terms of classes and instances of those classes) and a set of constraints de"ning the relationships `desireda to exist between pairs or groups of objects. Successful resolution of a scrip results in production all solutions that satisfy the given constaints (assuming a solution exists). The resulting output can be either graphical or textual as directed by the user. The approach has been applied to a number of practical application areas including; site indenti"cation for civil engineering projects, electronic chart interaction and noise pollution analysis. Other, more esoteric, applications that have been investigated include classic AI problems such as N-queens problems, and multi-dimensional shape "tting scenarios. The paper gives a formal description of the representation and an overview of the SPARTA N-dimensional reasoning system and its associated constraint satisfaction mechanism. The paper is illustrated with implementations of Alan's interval calculaus and Egenhofer's 9Intersection approach to `topologicala N-dimensional spatial reasoning; which serve to give a #avour of the power of the proposed approach.
Acknowledgements Early work on the SPARTA systems was carried out as part of the dGKBIS (dynamic Geographic Knowledge Based Information Systems) project (http://www. csc.liv.ac.uk/&frans/dGKBIS/dGKBIS.html) funded by the UK Engineering and Physical Sciences Research Council. The author is indebted to the ground breaking work of the original research team involved in this project, namely: Bridget Beattie, Trevor Bench-Capon, Bernard (Diz) Diaz and Michael Shave. The author would also like to thank Paul Leng for his helpful comments on an earlier version of this paper.
[5]
[6]
[7]
[8] [9]
[10]
[11]
[12]
[13] [14] [15] [16]
[17]
References [18] [1] B. Beattie, F.P. Coenen, T.J.M. Bench-Capon, B.M. Diaz, M.J.R. Shave, Spatial reasoning for GIS using a Tesseral data representation, in: N. Revell, A.M. Tjoa (Eds.), Database and Expert Systems Applications (Proceedings DEXA'95), Lecture Notes in Computer Science, Vol. 978, Springer, Berlin, 1995, pp. 207}216. [2] F.P. Coenen, B. Beattie, T.J.M. Bench-Capon, B.M. Diaz, M.J.R. Shave, Spatial reasoning for geographic information systems, Proceedings "rst International Conference on GeoComputation, School of Geography, University of Leeds, vol. 1, 1996, pp. 121}131. [3] A.G.P. Brown, F.P. Coenen, M.J.R. Shave, M.W. Knight, An AI approach to noise prediction, Buil. Acoust. 4 (2) (1999) 137}150. [4] B. Beattie, F.P. Coenen, A. Hough, T.J.M. Bench-Capon, B.M. Diaz, M.J.R. Shave, Spatial reasoning for environ-
[19] [20] [21]
[22]
[23]
mental impact assessment, in: Third International Conference on GIS and Environmental Modelling, National Centre for Geographic Information and Analysis, Santa Barbara, WWW and CD, 1996. F.P. Coenen, B. Beattie, T.J.M. Bench-Capon, B.M. Diaz, M.J.R. Shave, Spatio-temporal reasoning using a multidimensional tesseral representation, Proceedings ECAI'98, Wiley, New York, 1998, pp. 140}144. F.P. Coenen, B. Beattie, T.J.M. Bench-Capon, M.J.R Shave, B.M. Diaz, Spatial reasoning for timetabling: the TIMETABLER system, Proceedings of the "rst International Conference on the Practice and Theory of Automated Timetabling } ICPTAT'95, Napier University, Edinburgh, 1995, 57}68 F.P. Coenen, Rulebase checking using a spatial representation, Database and Expert Systems Applications (Proceedings DEXA'98), Lecture Notes in Computer Science, Springer, Berlin, 1999, pp. 166}175. J.F. Allen, Time and time again: the many ways to represent time, Int. J. Intell. Systems 6 (1991) 341}355. M.J. Egenhofer, Deriving the composition of binary topological relations, J. Visual Languages Comput. 5 (1994) 133}149. S.B.M. Bell, B.M. Diaz, F.C. Holroyd, M.J.J. Jackson, Spatially referenced methods of processing raster and vector data, Image Vision Comput. 1 (4) (1983) 211}220. B.M. Diaz, S.B.M. Bell, Spatial data processing using tesseral methods, Natural Environment Research Council Publication, Swindon, England, 1986. G.M. Morton, A computer oriented geodetic data base, and a new technique in "le sequencing, IBM Canada Ltd., 1966. B. Grunbaum, G.C. Shephard, Tilings and Patterns, Freeman, New York, 1987. F.C. Holroyd, The Geometry of tiling hierarchies, Ars Combin. 16B (1983) 211}244. H. Samet, The quadtree and related data structures, ACM Comput. Surv. 16 (2) (1984) 187}260. I. Gargantini, Linear octrees for fast processing of three dimensional objects, Comput. Graphics Image Process. 20 (1982) 365}374. G. Retz-Schmidt, Various views on spatial preposition, AI Mag. 9 (2) (1988) 95}105. A.G. Cohn, J.M. Gooday, B. Bennet, N.M. Gotts, A logical approach to representing and reasoning about space, Artif. Intell. Rev. 9 (1995) 255}259. P. van Hentenryck, Constraint Satisfaction in Logic Programming, MIT Press, Cambridge, MA, 1989. A.K. Mackworth, Consistency in networks of relations, AI J. 8 (1) (1977) 99}118. J-L. Lauriere, A language and a program for stating and solving combinatorial problems, Artif. Intell. 10 (1) (1978) 29}127. M. Dincbas, P. van Hentenryck, H. Simonis, A. Aggoun, T. Graf, F. Berthier, The constraint logic programming language CHIP, in: Proceedings of the International Conference on Fifth Generation Computer Systems (GGCS'88), Tokyo, Japan, 1998. A. Colmerauer, Opening the PROLOG-III universe, Byte Mag. 12 (9) (1987).
F. Coenen / Pattern Recognition 33 (2000) 9}23 [24] J. Ja!ar, S. Michaylov, Methodology and implementation of a CLP system, Fourth International Conference on Logic Programming , Melbourne, Australia, 1987. [25] J. Ja!ar, J-L. Lasses, M.J. Maher, A logic programming language scheme, in: D. DeGroot, G. Linderstrom (Eds.), Logic Programming, Relations, Functions and equations, Prentice-Hall, Englewood Cli!s, NJ, USA, 1986. [26] C. Freksa, Qualitative spatial reasoning, in: D.M. Mark, A.U. Frank (Eds.), Cognitive and Linguistic Aspects of Geographic Space, Kluwer, Dordrecht, Netherlands, 1991, pp. 361}372. [27] D. HernaH ndez, Relative representation of spatial knowledge: the 2-D case, in: D.M. Mark, A.U. Frank (Eds.), Cognitive and Linguistic Aspects of Geographic Space, Kluwer, Dordrecht, Netherlands, 1991, pp. 373}385. [28] A. Mukerjee, G. Joe, A qualitative model of space, Proceedings AAAI' 90, 1990, pp. 721}727. [29] G. Ligozat, Weak representations of interval algebras, Proceedings AAAI'90, 1990, pp. 715}720. [30] J.F. Allen, Maintaining knowledge about temporal intervals, Commun. of the ACM 26 (11) (1983) 832}843. [31] J.F. Allen, Towards a general theory of action and time, Arti. Intell. 23 (1984) 123}154. [32] J.F. Allen, J.A. Koomen, Planning using a temporal world model, in: Proceedings IJCAI'83, vol. 2, Morgan Kaufman, Los Altos, 1983, pp. 741}747. [33] J.F. Allen, P.J. Hayes, A common sense theory of time, in: Proceedings IJCAI'85, Morgan Kaufman, Los Altos, 1985, pp. 528}531. [34] J.F. Allen, P.J. Hayes, Short time periods, in: Proceedings IJCAI'87, Morgan Kaufman, Los Altos, 1987, pp. 981}983. [35] J.F. Allen, P.J. Hayes, Moments and points in an intervalbased temporal logic, Comput. Intell. (Canada) 5 (1989) 225}238. [36] A. Galton, A critical examination of Allen's theory of action and time, Artif. Intell. 42 (1990) 159}188. [37] M.B. Vilain, H. Kautz, Constraint propagation algorithms for temporal reasoning, in: Proceedings AAAI'86 Morgan Kaufmann, Los Altos, CA, 1986, pp. 377}382. [38] M.B. Vilain, H. Kautz, P.G. van Beek, Constraint propagation algorithms for temporal reasoning: a revised report, in: A.D. Weld, J. de Kleer (Eds.), Readings in Qualitative Reasoning about Physical Systems, Morgan Kaufmann, San Mateo, CA, 1989, pp. 373}381. [39] K. Nokel, Convex relations between time intervals, in: Proceedings 5th Osterreichische Arti"cial-IntelligenceTagung, Springer, Berlin, 1989, pp. 298}302. [40] D. Puller, M.J. Egenhofer, Towards formal de"nitions of topological relations among spatial objects, in: D. Marble (Ed.), Proceedings Third International Symposium on Spatial Data Handling, 1988, pp. 225}242.
23
[41] C.L. Hamblin, Instants and intervals, Stud, Gen. 24 (1971) 127}134. [42] S.K. Chang, Q.Y. Shi, C.W. Yan, Iconic Indexing by 2-D strings, IEEE Tran. Pattern Anal. Mach. Intell. PAM1-9 (3) (1987) 413}428. [43] A. Frank, R. Barrera, K.K. Al-taha, Temporal relations in geographic information systems, ACM SIGMOS Rec. 20 (3) (1991) 85}91. [44] J. Malik, T.O. Binford, Reasoning in time and space. Proceedings, International Joint Conference on Arti"cial Intelligence (IJCAI'83), vol. 1, 1983, pp. 343}345. [45] D. Papadias, Y. Theodoridis, T. Sellis, M.J. Egenhofer, Topological relations in the world of minimum bounding rectangles: a study with R-trees, SIGMOD'95, San Jose, CA, USA, 1995, pp. 92}103. [46] M.J. Egenhofer, A formal de"nition of binary topological relationships, in: W. Litwin, H-J. Scheck (Eds.), Proceedings third International Conference on Foundations of Data Organisation and Algorithms (FODO), Lecture Notes in Computer Science 367, Springer, New York, 1989, pp. 457}472. [47] M.J. Egenhofer, R. Franzosa, Point-set topological spatial relationships, Int. J. Geographic Inform. Systems 5 (1991) 161}174. [48] B.L. Clarke, A Calculus of individuals based on connection, Notre Dame J. Formal Logic 23 (3) (1981) 204}218. [49] B.L. Clarke, Individuals and points, Notre Dame J. Formal Logic 26 (1) (1985) 61}75. [50] A.G. Cohn, A more expressive formulation of many sorted logic, J. of Automat. Reason. 3 (2) (1987) 113}200. [51] A.D. Randell, Analysing the familiar: reasoning about space and time in the everyday world, Ph.D. Thesis, University of Warwick, UK, 1991. [52] A.D. Randell, Z. Cui, A.G. Cohn, An interval logic for space based on `connectionsa, Proceedings of ECAI'92, 1992. [53] A.D. Randell, Z. Cui, A.G. Cohn, A spatial logic based on regions and connection, in: Proceedings third International Conference on Principals of Knowledge Representation and Reasoning, Morgan Kaufmann, Los Altrs, CA, 1992. [54] Z. Cui, A.G. Cohn, D.A. Randell, Qualitative simulation based on a logic of space and time, in: Proceedings AISB Workshop `Qualitative and Causal Reasoninga (QCR'93), University of Birmingham, 1993. [55] L. Vieu, A logical framework for reasoning about space, in: A.U. Frank, I. Campari (Eds.), Spatial Information Theory: A Theoretical Basis for GISProceedings COSIT'93, Springer, Berlin, 1993, pp. 25}35.
About the Author*FRANS COENEN is a lecturer within the Department of Computer Science at the University of Liverpool. His current research interests include: spatio-temporal reasoning } particularly its implementation using tesseral approaches and its application to Geographic Information Systems (GIS) and marine electronic chart systems: Knowledge Based Systems (KBS) and their validation, veri"cation and maintenance; and the application of all forms of computer science to the maritime industry. He has published widely on all these subjects.
Pattern Recognition 33 (2000) 25}41
Comparison of algorithms that select features for pattern classi"ers Mineichi Kudo *, Jack Sklansky Division of Systems and Information Engineering, Graduate School of Engineering, Hokkaido University, Kita 13, Nishi 8, Sapporo 060-8628, Japan Department of Electrical Engineering, University of California, Irvine, CA 92697, USA Received 15 May 1998; received in revised form 9 November 1998; accepted 12 January 1999
Abstract A comparative study of algorithms for large-scale feature selection (where the number of features is over 50) is carried out. In the study, the goodness of a feature subset is measured by leave-one-out correct-classi"cation rate of a nearestneighbor (1-NN) classi"er and many practical problems are used. A uni"ed way is given to compare algorithms having dissimilar objectives. Based on the results of many experiments, we give guidelines for the use of feature selection algorithms. Especially, it is shown that sequential #oating search methods are suitable for small- and medium-scale problems and genetic algorithms are suitable for large-scale problems. 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Feature selection; Monotonicity; Genetic algorithms; Leave-one-out method; k-nearest-neighbor method
1. Introduction Feature selection in the design of pattern classi"ers has three goals: (1) to reduce the cost of extracting features, (2) to improve the classi"cation accuracy, and (3) to improve the reliability of the estimate of performance. In the design of segmenters of medical images or of remotely sensed aerial images, the initial set of candidate features often consists of over 100 features. Such a large number of features often includes many garbage features. Such features are not only useless in classi"cation, but sometimes degrade the performance of a classi"er designed on
* Corresponding author. Tel.: #81-11-706-6852; fax: #8111-706-6852. E-mail address:
[email protected] (M. Kudo) This work was carried out within the Japan-U.S. Cooperative Science Program of the U.S. National Science Foundation (NSF) and the Japan Society for the Promotion of Science (JSPS). The authors thank NSF and JSPS for their "nancial support. Part of this support was provided by NSF grant No. IRI-9123720.
the basis of a "nite number of training samples. In such a case, removing the garbage features can improve the classi"cation accuracy. The choice of an algorithm for selecting features from an initial set > depends on ">", the number of features in >. We say that the feature selection problem is small scale, medium scale, or large scale if ">" belongs to [0, 19], [20, 49], or [50, R], respectively. A large number of algorithms have been proposed for feature selection and some comparative studies have been carried out [1}5]. However, many comparative studies do not treat large-scale problems or nonmonotonic problems where an addition of a feature can degrade the classi"cation accuracy. Some studies adopt a monotonic criterion like Mahalanobis distance, but this monotonicity is satis"ed only when the optimal Bayes classi"er is used } an impractical hypothesis. Pudil et al. [4] included nonmonotonic criteria in comparison of large-scale problems, but they compared only a few algorithms. In comparing feature selection algorithms, we place them in three groups according to what is optimized. In one group, the algorithms "nd the feature subset of
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 1 - 2
26
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
a speci"ed dimensionality in which the classes of data are most discriminable. In a second group, the algorithms "nd the smallest feature dimensionality for which the discriminability exceeds a speci"ed value. In a third group, the algorithms "nd a compromise between a small subset of features and class discriminability. To ensure that our conclusions and recommendations are realistic, we tested the feature selection algorithms on real, rather than synthetic data. These data included mammograms, synthetic aperture radar images, numerical shape descriptors of mushrooms, and shape descriptors of motor vehicles. Our estimates of discriminability were based mainly on leave-one-out correct-classi"cation rate of nearest-neighbor (1-NN) classi"ers. We developed a uni"ed method of comparing selection algorithms with di!erent objectives. We describe this method in Section 3.
2. Algorithms Here we describe our terminology. Feature Sets: The initial feature set is denoted by >, ">""n and a selected feature subset by X. Unless mentioned otherwise, we "nd the best X of size m. X G denotes a feature subset of size i. Criterion: A criterion function J(X) evaluates the goodness of subset X on the basis of the ability of a classi"er to discriminate the classes in the feature space represented by X. A larger value of J indicates a better feature subset. We list the algorithms treated in this paper in Table 1, along with their time complexities, the objective types and the search types. Every algorithm is classi"ed into one of the three objective types. For Objective Type A, the algorithm "nds the subset of a given size for which J is a maximum. For Objective Type B, the algorithm
Table 1 Types and time complexity of feature selection algorithms. The objective types are (A) to "nd the best subset of a given size, (B) to "nd the smallest subset satisfying a condition, (C) to "nd a subset whose combined size and error rate are optimal. The search types are (S) Sequential or (P) Parallel Algorithm
Time complexity
Objective Type
Search Type
SFS, SBS GSFS(g), GSBS(g) PTA(l, r) GPTA(l, r) SFFS, SBFS BAB, BAB>, BAB>> RBAB, RBABM GA PARA
# (n) # (nE>) # (n) # (n +J> P>,) O(2L) O(2L) O(2L) #(1) (#(n)) #(n) (#(n))
A A A A A A B C C
S S S S S S S P P
"nds the smallest subset for which J is not less than a speci"ed value, J . For Objective Type C, the algorithm "nds a compromise between Objective Types A and B. This compromise is found by minimizing a penalty function of "X" and J. In Table 1, #( ) ) denotes a tight estimate of complexity (exact except for a multiplicative constant) and O( ) ) denotes an estimate of complexity for which only an upper bound is known. Time complexities under a typical setting of parameters are shown in parentheses. These time complexities are only a clue when we use these algorithms. In many practical situations, some algorithms are carried out faster than their estimates and others are not. For example, BAB is the fastest in some problems in spite of its time complexity of O(2L) and GA consumes time in proportion to the number of generations and also to the population size. We summarize these algorithms below. Some algorithms are improved at some points in this paper. Such points and the new algorithms are marked by &-'. The "rst six algorithms are found in Ref. [1]. SFS, SBS, GSFS(u), SFS: Selects the best single feature and then the best pair including the best GSBS(g) single, and so on. SBS is the backward version. These algorithms are generalized to GSFS(g) and GSBS(g) in such a way that g features are evaluated at the same time and the best g-feature subset is chosen for addition or deletion in the algorithms. PTA(l, r), GPTA(l, r): Plus-l take-away-r algorithm. Go forward l stages (by adding l features by SFS) and go backward r stages (by deleting r features by SBS) and repeat this process. In the generalized algorithm (GPTA(l, r)), GSFS(l) and GSBS(r) are used instead of SFS and SBS in inclusion and exclusion stage, respectively. SFFS, SBFS: The #oating version of PTA(l, r). Unlike PTA(l, r), SFFS can backtrack unlimitedly as long as the backtrack "nds a better feature subset than the feature subset obtained so far at the same size [3]. SBFS is the backward version. BAB, BABⴙ, BABⴙⴙ-, BAB(s), BABⴙⴙ(s), BABⴙⴙ(s)-- : The branch and bound methods. BAB is the original algorithm proposed by Narendra and Fukunaga [6] and BAB> is an improved algorithm by one by Yu and Yuan [7]. Both methods give the optimal solution, as long as the criterion function J is monotonic. BAB(s) is a faster but suboptimal version of BAB [6] (also BAB>(s) for BAB> and BAB>>(s) for BAB>>). In a feature subset X of size t(*m) under search, BAB uses the best criterion value obtained so far at size m for cutting the branches below X, while BAB(s) uses the best criterion value obtained so far at size t!s (if t!s)m then it behaves like BAB), where s is a look-ahead parameter. In this study, we use s"0, 1, 2. In these algorithms, the
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
e$ciency depends on how fast they "nd a better threshold for cutting branches. Therefore, in BAB>>-, we improved BAB> so that they can take an initial threshold h externally before staring the algorithm. This initial threshold h is useful until the algorithm "nds a better threshold than it. How to give h is described in Section 4.4.1. RBAB, RBABM--: The relaxed branch and bound methods. These algorithms can "nd the optimal solution even if the monotonicity of J is damaged a little. Unlike BAB, RBAB [8] aims to "nd the smallest subset for which the criterion value is not under a given threshold h and the search is carried out for subsets with the criterion values over or equal to h!d(d'0), where d is called a margin. RBABM- does not use margin d anymore. Instead, RBABM cuts branches below X only when both X and a parent of X are under h. GA: The genetic algorithm. Many studies have been done on GA for feature selection (for example, see Refs. [9,10]). In GA, a feature subset is represented by a binary string with length n, called a chromosome, with a zero or one in position i denoting the absence or presence of feature i. Each chromosome is evaluated in its "tness through an optimization function in order to survive to the next generation. A population of chromosomes is maintained and evolved by two operators of crossover and mutation. This algorithm can be regarded as a parallel and randomized algorithm. The initial population is arbitrary. So, we discuss how to choose the initial population-. PARA: A parallel algorithm.This algorithm maintains a population of N feature subsets as the same as GA but it updates the population only by local hill-climbing, that is, the population for next generation is made of N best feature subsets of all unvisited immediate above supersets and immediate below subsets of the current N subsets. 2.1. Practical implementation of GA and PARA Here, we describe in detail the implementation of GA and PARA. The following function has been often used in GA [11]: exp((h!J(X))/a)!1 O(X)"!"X"! , exp(1)!1
(1)
where a and h are constants speci"ed according to the problem (usually a"0.01J , where J is an estimated
upper bound of J). This optimization function requires the answer subset X to satisfy J(X)*h "rst and then to be as small as possible, because the exponential function penalizes heavily for subsets not satisfying this "rst requirement. In this study, however, we take more straight
27
forward optimization functions in order to make GA be comparable with Objective Type A and Objective Type B algorithms. First we estimate an upper bound J and a lower
bound J in the criterion function through a prelimi nary feature selection described later. The following optimization function is in accordance with Objective Type A algorithms:
J(X)!e"X" ("X")m), O (X)" J !e"X" ("X"'m),
(2)
where e"b(J !J )/n (b"0.01 in this paper) and is
introduced in order to make the second term work only when two subsets have almost the same criterion values. The following optimization function is in accord with Objective Type B algorithms: O (X)"
!"X"#(J(X)!J )/(J !J #e) (J(X)*h),
!n#(J(X)!J )/(J !J #e) (J(X)(h),
where e is an arbitrary small positive constant and h is a threshold. It is noted that 0)(J(X)!J )/
(J !J #e)(1 and thus a superset X of X never
takes a larger criterion value as long as both X and X hold J(X), J(X)*h. We use both O and O in GA and PARA for com parison with Objective Type A and Objective Type B algorithms. In addition, we use O "J(X) in order to ! "nd the optimal solution in J. In GA, the di$culty is in the setting of parameters. GA has mainly four parameters to be set: the population size N, the maximum number of generations ¹, the probability of crossover p , and the probability of mutation p . In A K addition, there is arbitrariness in an initial population of chromosomes. We determined these values on the basis of the results of many experiments using arti"cial data. In this study, as in most problems, we set N"2n and ¹"50. The resulting complexity of GA is #(n). We use mainly two sets of values of (p , p ): (0.8, 0.1) and (0.6, 0.4). A K In addition, we use the following two types of initial populations of chromosomes: (P1) 2n extreme feature subsets consisting of n distinct 1-feature subsets and n distinct (n!1)-feature subsets and (P2) 2n feature subsets in which the number of features is in [m!2, m#2] and all features appear as evenly as possible, where m is the desired number of features and n is the original number of features. A setting (P1) is denoted by +1, n!1, and (P2) by [m!2, m#2]. Thus GA is speci"ed by a six-tuple of (objective type (O , O or O ), N, ¹, ! initial population type (+1, n!1, or [m!2, m#2]), p , p ). A K
28
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
PARA has also the two parameters of population size and the maximum number of generations. We used N"2n and ¹"50, as in GA. PARA is speci"ed by a triplet of (objective type (O , O or O ), N, initial ! population type (1, n!1 or [m!2, m#2])). Some parameters are omitted if these values are obvious from the context. PARA is carried out until its population is not better than its previous population.
be k-monotonic if XLX, "X"!"X"*kNJ(X))J(X) for every X, X. BAB does the optimal search only when J is 1-monotonic and RBABM does so even when J is 2-monotonic. That is, RBABM permits a violation of monotonicity between parents and children, but requires monotonicity between grandparents and grandchildren.
2.2. Criterion function and monotonicity
3. Preliminary feature selection
In feature selection, the selection of the criterion function J(X) is very important. If we know which classi"er will be used in the problem under consideration, the best criterion is, in general, the correct recognition rate of the classi"er for in"nitely many samples in feature space X. However, it is very di$cult to estimate the correct recognition rate (or the error rate) of a classi"er on the basis of a limited number of training samples. This is one reason why the previous comparative studies on feature selection used Mahalanobis distance, which gives an upper bound of the Bayes error rate with a priori probabilities of classes. Such a parametric estimation is not practical in many problems. In this study we estimate the correct recognition rate by the leave-one-out technique, where a training sample is used as a test sample, a classi"er is constructed by the remaining training samples, and the test sample is classi"ed by the constructed classi"er. This procedure is repeated until all training samples are classi"ed and the results are summed up. If the number of the training samples is large enough, we use the d-fold cross validation technique where each of d nonoverlapping similar-sized subsets is used in place of a training sample in the leave-one-out technique. In this technique we usually use the nearest-neighbor classi"er (1-NN) because of its good asymptotic property in large training samples. It is known that e (2e when the number of training ,, ?WCQ sample is su$ciently large. (We de"ne e as the error ,, rate of the 1-NN classi"er, and e as the error rate of ?WCQ the Bayes optimum classi"er.) Another reason for using 1-NN is its ease of implementation with the leave-oneout technique. If we use near-Bayes-optimal classi"ers, the correct recognition rate is likely to be nearly monotonic. To exploit this property, Foroutan and Sklansky [8] introduced the concept of approximate monotonicity and proposed a relaxed branch and bound method, RBAB. RBAB chooses the smallest feature subset from feasible solutions of feature subsets for which J exceeds a threshold h. The search is continued for feature subsets for which J exceeds h!d(d'0), where d is called a margin. Thus, within a range of margin d, J is permitted to violate its monotonicity and RBAB "nds the optimal solution. However, there still remains the problem of determining the value of d. We propose another version of the branch and bound method, called RBABM. A criterion function J is said to
The result of an algorithm with a sequential search can be represented by a curve, called a criterion curve, connecting points of (m, J(X )) obtained by the algorithm at K size m(m"1,2, n). Many comparative studies [2,3,12] compared algorithms by the criterion curves. However, our concern is in only a part where the criterion value does not degrade so much. In addition, it is di$cult to compare algorithms with di!erent objective types in criterion curves. To cope with these di$culties, we propose preliminary feature selection. This approach was inspired by Sklansky and Siedlecki [11]. With an algorithm with a low time complexity like SFS and SBS, we execute a preliminary feature selection in order to get a criterion curve. Then we classify the problem under consideration into one of three cases: a monotonic case, an approximate monotonic case or a nonmonotonic case. In addition, the values of h and d for RBAB and RBABM are determined from the criterion curve(s). If the problem is judged as monotonic or approximate monotonic, we determine a parameter a ("1%, 5%) showing the degree of degradation and "nd the point that is degraded with a as compared with the maximum criterion value J (Fig. 1). From this a-degradation
point, we determine a criterion value J as a threshold ? and the corresponding number of features m as a desired ? number of features. The value of m is passed to Type-A algorithms and the ? value of J is passed to Type-B algorithms as threshold h. ? For Type-C algorithms, both values are passed to their optimization functions O and O . In addition, an upper bound J and a lower bound J in J are read from the
criterion curve and used in O and O . If the problem is judged as nonmonotonic, we use only Type-A and Type-C algorithms. Type-A algorithms are carried out so as to draw their criterion curves in such a way that an algorithm with a forward search is required to "nd the best (n!1)-feature subset and an algorithm with a backward search is required to "nd the best 1-feature subset. In the process of this search, we can have a criterion curve. Then we choose the optimal subset in the criterion curve as the answer (Fig. 1). For Type-C algorithms, we use O "J(X) as the optimiza! tion function for obtaining the optimal subset in the criterion.
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
29
Fig. 1. Classi"cation of problems.
4. Experimental results 4.1. Datasets We examined eight di!erent problems consisting of one synthetic data and "ve real data (Table 2). Some datasets indicated by (small) are datasets in which some features are preselected by a method SUB that is a revised version of the method of Kudo and Shimbo [13]. SUB was devised for classixer-independent feature selection [14]. Unlike the classi"er-speci"c algorithms discussed in this paper, classi"er-independent feature selection algorithms do not need a criterion function J and aim to select a feature subset for which any kind of classi"er is improved in its discriminability compared with when the initial feature set is used. A comparative study of classi"er-independent algorithms is shown in Ref. [15]. The problem type is determined by the preliminary feature selection with SFS and SBS. SAR: Synthetic aperture radar satellite image. We tested a SAR image with 28,561("169;169) pixels. There are three classes corresponding to di!erent landmarks (urban area, runway and agricultural area). A pixel is characterized by 10 texture features. All pixels are labeled by hand. One percent of all pixels are extracted randomly in proportion to the population size of classes. Vehicle: Vehicle data [16]. The task is to classify a given silhouette as one of four types of vehicle, using 18 features extracted from the silhouette. From the preliminary feature selection, we determined a"5% and J "0.821, m "9. For RBAB, ? ? d"0.005 is used. Mammogram: (large) A mammogram database. The database is a collection of 86 mammograms from 74 cases which are gathered from University of California, San Francisco (UCSF), the Mammographic
Image Analysis Society (MIAS), and the University of California, Los Angels (UCLA). Chosen 65 features are of 18 features characterizing calci"cation (number, shape, size, etc.) and of 47 texture features (histogram statistics, Gabor wavelet response, edge intensity, etc.). There are two classes of benign and malignant (57 and 29 samples, respectively). For this problem, we penalize the error of misclassifying malignant as benign heavily more than the reverse error. We, thus, calculated a criterion value by multiplying 10/11 and 1/11 to these two kinds of errors obtained by the leave-one-out method. In addition, to meet the 5-NN method to this special requirement, we weighted malignant by three such that a neighbor of malignant is counted as three neighbor(s). In this modi"cation, if two of "ve neighbor(s) are malignants, the unknown sample is judged as malignant by 2;3'3. Mammogram (small): A mammogram database. This data is the same data with Mammogram(large) data. The 19 features are preselected from original 65 features by SUB. We used J "0.97 and m "13 for ? ? a"1%. For RBAB, d"0.06 is used. Kittler: Kittler's synthetic data used in many Refs. [1,3,12]. There are two classes with 20 features. The two distributions of samples are normal distributions with a common covariance matrix & and di!erent mean vectors k and k . As the criterion function J, we used Mahalanobis distance de"ned by (k !k )R &\(k !k ) that is equivalent to the Bayes correct recognition rate in this problem. We used J "4.75 and m "16 for ? ? a"5%. Sonar(large): A sonar database [16]. The task is to discriminate between sonar signals bounced o! a metal cylinder and those bounced o! a roughly cylindrical rock using 60 features of which each
30
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
describes the energy within a particular frequency band, integrated over a certain period of time. The database consists of 111 patterns obtained by bouncing sonar signals o! a metal cylinder at various angles and under various conditions and 97 patterns obtained from rocks under similar conditions. Sonar(small): A sonar database. This data is the same as above data. The 40 features are preselected from 60 by SUB. We used J "0.9 and ? m "32 for a"1%. ? Mushroom (small): A mushroom database [16]. The task is to assess the edibility of a large mushroom samples. There are two classes of edible and poisonous and with 4208 and 3916 samples, respectively. Taken 22 categorical features include cap-shape, odor, gill-color, etc. They were converted to 125 numeral features in such a way that a categorical feature with k possible category values was converted to k numeral features. For example, in a categorical feature of gill-size with `broada or `narrowa, `broada is converted to two numeral features (1, 0) and `narrowa is to (0, 1). From these 125 original features, 29 features are preselected by SUB. We chose 1000 samples (500 each) randomly.
4.2. Evaluation method In this study, we used two kinds of graph in comparison of results: (G1) a graph of m vs. J(X ), (G2) a graph of K the evaluation number of J vs. J(X)/(J !J )!"X"/ S S (n!1), where J and J are determined depending on J S problems in order to account the number of features in addition to the criterion value J(X). In (G2), thus, the goodness is evaluated by lines with slope (J !J )/ S J (n!1). 4.3. Setting of GA and PARA Used values of some parameters in GA and PARA are summarized in Tables 3 and 4. 4.4. Results and discussion All results are shown in Figs. 2}10. We summarize these results in Tables 5 and 6. Next, we examine the results in detail on individual topics.
Table 2 Experimental data. n:C of features, M:C of classes, K:C of training samples per class, J: criterion function ((L): leave-one-out, (k-CV): k-fold cross validation) Database
n
M
K
J
Problem type
SAR Vehicle Mammogram(small) Kittler Mushroom (small) Sonar (small) Sonar (large) Mammogram(large)
10 18 19 20 29 40 60 65
3 4 2 2 2 2 2 2
131, 61, 93 199}218 57 and 29 1000 each 500 each 111 and 97 111 and 97 57 and 29
(L) 1-NN (9-CV) linear classi"er (L) weighted 5-NN Mahalanobis (L) 1-NN (L) 1-NN (L) 1-NN (L) weighted 5-NN
Nonmonotonic App. monotonic App. monotonic Monotonic Monotonic App. monotonic Nonmonotonic Nonmonotonic
Table 3 Parameters of GA. O: optimization function, N: population size, I: initial population type, ¹: maximum C of generations, (p , p ): A K crossover probability and mutation probability, K:C of trials in each parameter set. &/' means `ora Database
O
SAR Vehicle Mammogram (small) Kittler
O ! O /O O /O O /O
Mushroom(small) Sonar(small) Sonar(large) Mammogram(large)
O /O O /O O ! O !
N
I
¹
(p , p ) A K
K
20 36 38 40
[2, 6]/+1, 9, [8, 12]/+1, 17, [12, 16]/+1, 18, [14, 18]/+1, 19,
50 50 50 50
3 1 1 1
58 80 120 130
[1, 5]/+1, 28, [30, 34]/+1, 39, [18, 22]/+1, 59, [17, 21]/+1, 64,
50 50 50 50
(0.6, 0.4)/(0.8, 0.1) (0.6, 0.4)/(0.8, 0.1) (0.6, 0.4)/(0.8, 0.1) (0.6, 0.4)/(0.8, 0.1)/ (0.8, 0.02) (0.6, 0.4)/(0.8, 0.1) (0.6, 0.4)/(0.8, 0.1) (0.6, 0.4)/(0.8, 0.1) (0.6, 0.4)/(0.8, 0.1)
1 1 3 3
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
31
Table 4 Parameters of PARA. O: optimization function, N: population size, I: initial population type, K: C of trials in each parameter set. &/ ' means `ora Database
O
N
I
K
SAR Vehicle Mammogram (small) Kittler Sonar(small) Sonar(large)
O ! O /O O /O O /O O /O O ! O ! O ! O !
20 36 38 40 80 10/120 120 10/130 130
[2, 6]/+1, 9, [8, 12]/+1, 17, [12, 16]/+1, 18, [14, 18]/+1, 19, [30, 34]/+1, 39, [18, 22] +1, 59, [17, 21] +1, 64,
3 3 3 1 1 3 3 3 3
Mammogram(large)
Fig. 2. Result of Kittler's data. J "4.75 and m "16 for a"5%. Here X is a selected feature subset, J(X) is Mahalanobis distance, ? ? "X" is the size of X, and ; is the number of evaluation. (a) "X" vs. J(X) (b) ; vs. J(X)/(5!4)!"X"/(20!1).
4.4.1. Ezciency of BAB algorithms The branch and bound method, BAB, is equivalent to the exhaustive search method when the criterion function is monotonic, but too time-consuming. So, we compared several variants of the branch and bound methods on Kittler's data. First, we compared BAB, BAB> and BAB>>. As an initial threshold h for BAB>>, we ad opted the better one of solutions obtained by SFS and SBS at each size m(m"1,2, n!1). The result is shown in Fig. 3. As comparison, the results of SFFS and SBFS are also shown. In Fig. 3, the result of BAB>> does not
include the number of evaluation consumed by SFS and SBS, because they were carried out already in the preliminary feature selection. BAB> is very e$cient as compared with the original BAB especially when the desired number of features m is relatively small. In addition, giving an initial threshold h to BAB>, that is, BAB>>, contributes a little in-e$cien cy (Fig. 3). Since such an initial threshold is useful until the algorithm "nds a new better threshold, it would work well especially when there are many possible solutions, that is, when (L ) is large, thus, when n is large or m is close to n/2. K
32
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
Fig. 3. BAB families on Kittler's data. (a) "X" vs. ; (b) "X" vs. J(X).
Next we examined the suboptimal version of BAB>>, BAB>>(s) (s"0, 1, 2). The look-ahead parameter s shows how deep the algorithm looks ahead. The larger value of s the algorithm uses, the closer it approaches the optimal solution. BAB>>(n!m!1) is equivalent to BAB>>. The results are shown in Fig. 3a and b. In the criterion, BAB>>(1) and BAB>>(2) are su$ciently close to the optimal curve and are faster than BAB>>. In conclusion, in this problem, BAB> and BAB>> outperforms other algorithms in our concerning area (12(m(20) in both performance and speed (Fig. 3). Also in vehicle data with 18 features and in mushroom(small) data with 29 features, BAB>> was comparable in speed with the other algorithms and "nds near-optimal solutions, although some other algorithms found the same solutions (Figs. 8 and 9). Its suboptimal version BAB>>(s) was inferior in performance to BAB>> but faster. One possible use of BAB>>(s) is to carry it out in advance of BAB>>, increasing the value of s one by one from 0. Then, we can have some solutions in an acceptable time. For approximate monotonic problems, RBAB and RBABM are expected to be e!ective. We had solutions only in mammogram(small) and vehicle data in a reasonable time. The results of mammogram(small) are shown in Fig. 6. From these "gures, we can see that RBAB and RBABM succeeded to "nd smaller feature subsets than those of other algorithms, without a large degradation of performance. Especially, RBAB could "nd the smallest subset with six features. This best solution cannot be
obtained by SFFS and SBFS even if they are carried out in all the features (Fig. 6d). RBAB worked well even in vehicle data, while GA also gave the same answer. RBAB and RBABM require a tremendous number of evaluations (Fig. 6c and Fig. 8b). RBAB evaluated 147 195 subsets (28% of possible 2!1 subsets) and RBABM evaluated 31143 (6%) in mammogram(small) data, respectively, and RBAB evaluated 92380 subsets (35% of possible 2!1 subsets) in vehicle data. However, they found their answers already in the "rst 34586 and 6331 evaluations in mammogram(small) data, respectively, and 4709 evaluations in vehicle data (RBAB only). Therefore, a possible e$cient use of RBAB and RBABM is to terminate them after a speci"ed number of evaluations. 4.4.2. Sequential algorithms In this section, we restrict our discussion to sequential algorithms. When using a sequential algorithm, both forward and backward forms of the algorithms should be used at the same time. Usually, backward algorithms are better than their counterparts. However, this depends on the speci"c problem (for example, SFFS is better than SBFS in sonar(large) and vehicle data, see Fig. 5a and Fig. 8a). In addition, it seems that forward algorithms are better than backward algorithms in the case where m is very small (say, m(5) and the inverse holds in the case where m is near n (say, n!m(5). However, this is not true. We can see two counterexamples in mushroom(small) data and sonar(small) data (Figs. 9 and 10).
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
33
Fig. 4. Result graph of Mammogram(large) data. Here X is a selected feature subset, J(X) is the leave-one-out correct recognition rate by a weighted 5-NN method, "X" is the size of X, and ; is the number of evaluation. (a) "X" vs. J(X) (b) GA and PARA (c) ; vs. J(X)/(1!0.9)!"X"/(65!1) (d) GA vs. SFFS and SBFS.
Therefore, both directions should be examined at the same time. The sequential forward and backward #oating search algorithms (SFFS and SBFS) are known to be e!ective for many problems [2}5]. Indeed, these algorithms found fairly good solutions in a moderate time in our experi-
ments. However, it cannot be said that they are always better than the others. SBFS or SFFS sometimes failed to "nd good solutions (see Figs. 2, 6, 8 and 10), but at least one of them worked well in almost all datasets. On average, GPTA(1, 2) was the best in performance as a single sequential algorithm, while GPTA(1, 2) was very
34
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
Fig. 5. Result of Sonar(large) data. Here X is a selected feature subset, J(X) is the leave-one-out correct recognition rate by a weighted 1-NN method, "X" is the size of X, and ; is the number of evaluation. (a) "X" vs. J(X) (b) GA and PARA (c) ; vs. J(X)/ (1!0.85)!"X"/(60!1) (d) GA vs. SFFS and SBFS.
time-consuming. Overall, SFFS and SBFS found fairly good solutions in a shorter time than the other sequential algorithms, if both algorithms are used at the same time. 4.4.3. GA and PARA We describe the sensitivity of GA with respect to some of its parameters.
(i) The values of p and p are crucial. We tested A K the three sets of these parameters (0.6, 0.4), (0.8, 0.1) and (0.8, 0.02), mainly the "rst two. In our experiments, pair (0.8, 0.1) worked best through all experiments, but this seems to depend on the problems. We recommend to use several sets of parameters and to use not too small a value of
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
35
Fig. 6. Result of Mammogram(small) data. J "0.97 and m "13 for a"1%. Here X is a selected feature subset, J(X) is the ? ? leave-one-out correct recognition rate by a weighted 5-NN method, "X" is the size of X, and ; is the number of evaluation. (a) "X" vs. J(X) (b) GA and PARA (c) ; vs. J(X)/(1!0.9)!"X"/(65!1) (d) GA vs. SFFS and SBFS.
p (p '0.02) to avoid being captured in local K K maxima. (ii) We tested two kinds of initial populations, (P1) and (P2). The di!erence is not large. When (P1) is used, GA showed a tendency to "nd larger subsets as
compared with when (P2) is used. We recommend a large value of p (say, p *0.1) if (P1) is used. K K (iii) Since GA is a randomized algorithm, it can produce di!erent solutions among di!erent trials for the same parameter set and the same initial population. Our
36
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
Fig. 7. Result graphs of SAR data. Here X is a selected feature subset, J(X) is the leave-one-out correct recognition rate by a weighted 5-NN method, "X" is the size of X, and ; is the number of evaluation. (a) "X" vs. J(X) (b) J(X)/(0.9!0.85)!"X"/(10!1).
Fig. 8. Result graphs of Vehicle data. J "0.821 and m "9 for a"5%. Here X is a selected feature subset, J(X) is the 9-fold cross ? ? validation correct recognition rate with a linear classi"er, "X" is the size of X, and ; is the number of evaluation. (a) "X" vs. J(X) (b) ; vs. J(X)/(0.9!0.8)!"X"/(18!1).
results show that this variability in solutions is very small. Our GA produced almost the same solutions for di!erent trials (see, Figs. 4b, 5b and 7a).
Next, let us discuss the performance of GA and PARA. (i) GAs with Objectives Types A and B work well for approximate monotonic problems (Figs. 6, 8}10).
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
37
Fig. 9. Result graphs of Mushroom(small) data. J "0.989 and m "3 for a"1%. Here X is a selected feature subset, J(X) is the ? ? leave-one-out correct recognition rate with 1-NN, "X" is the size of X, and ; is the number of evaluation. (a) "X" vs. J(X) (b) ; vs. J(X)/(1!0.9)!"X"/(29!1).
Fig. 10. Result graphs of Sonar(small) data. J "0.9 and m "32 for a"1%. Here X is a selected feature subset, J(X) is the ? ? leave-one-out correct recognition rate with 1-NN, "X" is the size of X, and ; is the number of evaluation. (a) "X" vs. J(X) (b) ; vs. J(X)/(1!0.85)!"X"/(60!1).
GAs with Objective Type B are superior in "nding smaller subsets with acceptable discriminability. (ii) For large nonmonotonic problems, GA with Objective Type C works well in "nding better subsets in the
criterion, provided GA has with an appropriate set of parameters (Figs. 4 and 5). Some algorithms other than GA succeeded in "nding smaller subsets with comparable criterion values, but this seems because
38
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
Table 5 Comments to aid choice of algorithm Algorithm
Comment
SFS, SBS:
are fast, but are low in performance. They are useful for the preliminary feature selection (see Section 3).
GSFS(g), GSBS(g):
are very time-consuming. Their performance is a little better than SFS and SBS. They are e!ective when the desired number of features is very small or very close to the number of initial features.
PTA(l, r):
is e!ective when we want to obtain better solutions than SFS and SBS in a moderate time.
GPTA(l, r):
shows almost the same high performance as SFFS and SBFS, but is much more time-consuming than SFFS and SBFS.
SFFS, SBFS:
are very e!ective for Objective Type A. Their computation time is admissible for small-scale and mediumscale problems. For Objective Type C in large-scale problems, they are somewhat inferior in performance to GA and consume much more time than GA.
BAB>, BAB>>:
are very e!ective for Objective Type A in small-scale and medium-scale monotone problems, but often require a lot of time.
BAB>(s), BAB>>(s):
are faster than the optimal BABs but are lower in performance. We can use these algorithms before the optimal BABs, increasing the value of s as 0, 1, 2,2
RBAB, RBABM:
are very e!ective for Objective Type B in monotone or approximate monotone problems. They can be used for small-scale and medium-scale problems, but sometimes require an excessive time. It may be useful to terminate the algorithms by a given number.
GA:
is very useful for Objective Type C in large-scale problems because of its low computation time. Almost all large-scale problems in the real world become nonmonotonic due to garbage features. For such problems, GA would be appropriate. GA can also be available for Objective Types A and B, but takes much more time than the other selection algorithms in small-scale and medium-scale problems.
PARA:
is inferior in performance to GA and takes much more time than GA for Objective Type C. With a small size of population, it can be used for Objective C in large-scale problems. For Objectives A and B, it has a possibility to "nd better solutions than the other algorithms.
GA was adjusted in order to "nd the best-discriminability subset. From successes of GAs with Objective Types A and B, we believe that once an optimization function is modi"ed to account both class discriminability and feature dimensionality (for example, as Eq. (1)), GA will be satisfactory. (iii) GA takes much more time than the others in smallscale and medium-scale problems (Figs. 2, 6, 7 and 10), in spite of its complexity of #(n). This is because GAs needed many iterations (generations) for their convergence and sometimes reached the limit of 50. The number of generations a!ected heavily their execution time in small-scale and medium-scale problems. On the contrary, in large-scale problems, this impacts on execution time and makes GA more e$cient (Fig. 4c and Fig. 5c). (iv) PARA worked well for monotonic and approximate monotonic problems (Figs. 2, 6 and 10), but not for
nonmonotonic problems. This is because PARA does not have a mechanism to escape from local maxima such as GA has. In conclusion, GA is suitable for large-scale problems because of its e$ciency and e!ectiveness. In addition, GA is useful to "nd better answers for Objective Types A and B even in small-scale and medium-scale problems if the computation time is not an important consideration. 4.4.4. SFFS and SBFS vs. GA Let us compare SFFS and SBFS with GA. There may be a problem in the termination condition of SFFS and SBFS. We terminated them when the number of features reaches a given size m for the "rst time. We knew after our experiments that m#dm is recommended in Ref. [4] to improve the performance. SFFS and SBFS try to "nd solutions along a vertical line (at a "xed m) and GA with Objective Type B tries to "nd solutions over a horizontal
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
39
Table 6 Recommended algorithms for feature selection Objective type
Scale of initial set of features Small (n(20)
A
B
Mono.
Approx. mono.
BAB>>
Medium (20)n(50)
Large (50)n)100)
Non mono.
Mono.
Approx. Mono.
SFFS SBFS
*
BAB>> BAB>>(s)
SFFS SBFS GA
RBAB RBABM
RBAB RBABM
*
RBAB w. termination RBABM w. termination GA
RBAB w. termination * RBABM w. termination GA
*
*
*
*
C
GA SFFS SBFS
Non Mono. Mono. or approx. mono.
*
GA SFFS SBFS
SFFS SBFS GA
Very large (n'100)
Non mono.
GA *
GA
GA *
GA
GA
*
*: Rarely occurs.
line (over a "xed criterion value). Therefore, in monotonic or approximate monotonic problems, it is still di$cult to compare algorithms with di!erent objective types in a completely common way. However, we observed that GA sometimes found better solutions that SFFS and SBFS could not "nd. This is con"rmed by comparing solutions of GA with the criterion curves of SFFS and SBFS (Figs. 4d, 5d and 6d). In mammogram(large) data, nine of twelve solutions of GA are better than or equal to those of SFFS and the SBFS and the remaining three are worse (Fig. 4). Two of the three worse solutions correspond to the same pair of (p , p )"(0.6, 0.4) (Fig. 4d). In sonar(large) data, "ve of A K twelve solutions of GA are better or equal to those of SFFS and SBFS and the remaining seven are worse (Fig. 5). Six of the seven worse solutions corresponds to the same pair of (p , p )"(0.6, 0.4) (Fig. 5d). These results A K suggest that several distinct parameter sets of GA should be examined at the same time. The time complexity of GA is #(n), while that of SFFS and SBFS is at least #(n). This di!erence increases as n increases. Then, even if we take into consideration some iterations of GA with some parameter sets, GA would be more e!ective than SFFS and SBFS for larger problems. Even if a particular GA takes much more time than SFFS and SBFS, the GA has a high possibility to "nd better solutions than the two algorithms. This is because SFFS and SBFS do not have a mechanism of jumping from a subset to another very
di!erent subset. They trace a sequence of subsets of which adjacent subsets are di!erent by only one feature. In addition, in medium-scale approximate monotonic problems, GAs with Objective Types A and B found several solutions which could not be found by SFFS and SBFS in mammogram(small) and sonar(small) data (for example, Fig. 6d). In earlier works [2,5], GA was compared with SFFS and SBFS. Ferri et al. [2] concluded that GA and SFFS are comparable in performance, but as the dimensionality increases the result of GA becomes worse than that of SFFS. Jain and Zongker [5] compared GA with SFFS on Kittler's data and reported a tendency of premature convergence on the use of GA. In contrast to these reports, our experiments show that in small-scale and medium-scale problems GA is better than SFFS and SBFS if GA is given extended time for search. As the dimensionality increases, GA becomes faster than SFFS and SBFS. Below we suggest several possible explanation of the di!erence between our observations and those of Refs. [2,5]. The "rst possible explanation is a di!erence in the way of determining the search time. Ferri et al. stopped GA when GA reached the same number of evaluations as SFFS, but we stopped GA after 50 generations. Our search time determination produced a GA faster than SFFS and SBFS, while the subsets selected by GA were
40
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41
comparable in performance with those of SFFS and SBFS. A second possible explanation is a di!erence in the values of parameters. There are many possible di!erences between Ferri et al.'s GAs and ours: the population size, the initial population, crossover rate and mutation rate. Ferri et al. did not reveal the values of their parameters in their paper [2]. We took the population size as twice the number of features so as to make GA adapt to an increase of the dimensionality. This adaptive population size may have brought better results than their GAs. The crossover and mutation rates also strongly a!ect the results, as described in the previous section. A third possible explanation is a di!erence in the objective of selection. Ferri et al. used Objective Type B for their GA and the penalty function of Eq. (1) with a monotonic criterion. We used only Objective Type C for GA on large-scale problems. For Objective Type C, GA seems to be more e!ective than SFFS and SBFS by virtue of its ability to examine a wider range of candidate solutions. For Objective Type A, SFFS or SBFS could yield a good feature subset stably. If both SFFS and SBFS are used simultaneously, the result would be better. Jain and Zongker [5] compared GA and SFFS on Kittler's data. We also compared GA and SFFS on Kittler's data. In our experiments, half of the results were better than or equal to those of SFFS and SBFS (Fig. 2). Almost all worse results were obtained when the mutation rate p was 0.02, which is the same situation as Jain K and Zongker. This suggests the advisability of using a higher mutation rate. Our suggestion is p "0.1 K In summary, if GA is applied to the same problem two or three times with a few di!erent sets of parameters, GA can be better than SFFS and SBFS in the following two points: (1) GA is controllable in the execution time, indeed we can terminate the generation whenever we want, and (2) the result of GA can be improved by repeating trials and by varying the values of parameters. With regard to item (1), GA seems preferable for all large-scale problems in which n*100. For example, to get the criterion curve of SFS we need 5050 evaluations for n"100 and 45 150 evaluations for n"300. If we estimate the complexity of SFFS by #(n ), we need approximately 63 000 evaluations for n"100 and 880 000 evaluations for n"300. In our setting of GA, we need only 10 000 evaluations for n"100 and 30 000 evaluations for n"300. The second item brings us a high possibility to "nd better answers at the expense of more time. In fact, in our two large-scale nonmonotonic problems, GA found the best feature subset in the criterion and feature subsets better or equal to those of SFFS and SBFS in more than half of several settings and trials. Thus, in more realistic situations where a time-consuming estimate of the correct recognition rate of a classi"er is used and n'100,
GA becomes the only practical way to get reasonable feature subsets. 5. Conclusions Our main conclusions are as follows: (i) Our methodology permits comparisons of algorithms having diverse objectives. This methodology is based on the use of a criterion curve that provides a guideline of combinations of the number of features and the power of discrimination among the classes. This criterion curve is obtained by applying the simplest sequential search methods over the entire range of the number of features. (See Section 3). (ii) Preliminary feature selection using algorithms with a low time complexity is e!ective to capture a given problem (i.e. to estimate the principal design parameters) and to determine some parameters needed for algorithms in a subsequent re"ned feature selection. (See Section 3). (iii) SFFS and SBFS give better solutions than the other sequential search algorithms in a reasonable time for small-scale and medium-scale problems. Both algorithms should be carried out at the same time. (See Sections 4.4.2 and 4.4.4). (iv) GA is suitable for large-scale problems and has a high possibility to "nd better solutions that cannot be found by the other selection algorithms. A few runs with di!erent sets of parameters are recommended. GA is slower, but more e!ective than SFFS and SBFS in small-scale and medium-scale problems. (See Sections 4.4.2}4.4.4). (v) BAB> and BAB>> can work even in medium-scale problems. (See Section 4.4.1). (vi) RBAB and RBABM are e!ective for Objective Type B in small-scale and medium-scale problems, but it consumes an excessive time in medium-scale and large-scale problems. Termination by a given number is recommended for medium-scale and largescale problems. (See Section 4.4.1). Acknowledgements The authors thank the UCI repository of machine learning databases that gave us many kinds of databases and to the Turing Institute, Glasgow, Scotland for the vehicle dataset. References [1] J. Kittler, Feature set search algorithms, in: C.H. Chen (Ed.), Pattern Recognition and Signal Processing, Sijtho! and Noordho!, Alphen aan den Rijn, Netherlands, 1978, pp. 41}60.
M. Kudo, J. Sklansky / Pattern Recognition 33 (2000) 25}41 [2] F.J. Ferri, P. Pudil, M. Hatef, J. Kittler, Comparative study of techniques for large-scale feature selection, in: E.S. Gelsema, L.N. Kanal (Eds.), Pattern Recognition in Practice IV, Elsevier Science B.V., Amsterdam, 1994, pp. 403}413. [3] P. Pudil, J. Novovic\ ovaH , J. Kittler, Floating search methods in feature selection, Pattern Recognition Lett. 15 (1994) 1119}1125. [4] P. Pudil, F.J. Ferri, J. Novovic\ ovaH , J. Kittler, Floating search methods for feature selection with nonmonotonic criterion functions, 12th International Conference on Pattern Recognition, 1994, pp. 279}283. [5] A. Jain, D. Zongker, Feature selection: evaluation, application, and small sample performance, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 153}157. [6] P.M. Narendra, K. Fukunaga, A branch and bound algorithm for feature subset selection, IEEE Trans. Computers 26 (1977) 917}922. [7] B. Yu, B. Yuan, A more e$cient branch and bound algorithm for feature selection, Pattern Recognition 26 (6) (1993) 883}889. [8] I. Foroutan, J. Sklansky, Feature selection for automatic classi"cation of non-gaussian data, IEEE Trans. Systems Man Cybernet. 17 (1987) 187}198. [9] M. R. Vriesenga, Genetic selection and neureal modeling for designing pattern classi"er, Doctor thesis, University of California, Irvine, 1995.
41
[10] W. Siedlecki, J. Sklansky, A note on genetic algorithms for large-scale feature selection, Pattern Recognition Lett. 10 (1989) 335}347. [11] J. Sklansky, W. Siedlecki, Large-scale feature selection, in: L.F. Pau, P.S. Chen, P.S.P. Wang (Eds.), Handbook of Pattern Recognition and Computer Vision, World Scienti"c, Singapore, 1993, pp. 61}123 (Chapter 1.3). [12] D. Zongker, A. Jain, Algorithms for feature selection: an evaluation, 13th International Conference on Pattern Recognition 2 (1996) 18}22. [13] M. Kudo, M. Shimbo, Feature selection based on the structural indices of categories, Pattern Recognition 26 (1993) 891}901. [14] H.J. Holz, M.H. Loew, Relative feature importance: A classi"er-independent approach to feature selection, in: E.S. Gelsema, L.N. Kanal (Eds.), Pattern Recognition in Practice IV, Elsevier, Amsterdam, 1994, pp. 473}487. [15] M. Kudo, J. Sklansky, Classi"er-independent feature selection for two-stage feature selection, in: A. Amin D. Dori, P. Pudil, H. Freeman (Eds.), vol. 1451, Lecture Notes in Computer Science, Springer, Berlin (Advances in Pattern Recognition 1998), pp. 548}554. [16] P.M. Murphy, D.W. Aha, UCI Repository of machine learning databases [machine-readable data repository], University of California, Irivne, Department of Information and Computation Science, 1996.
About the Author*MINEICHI KUDO received his Dr. Eng. degree in Information Engineering from the Hokkaido University in 1988. Since 1994, he has been Associate Professor of Information Engineering at Hokkaido University, Sapporo, Japan. His current research interests include pattern recognition, image processing and computational learning theory. About the Author*JACK SKLANSKY received the B.E.E. degree from The City College of New York in 1950, the M.S.E.E. degree from Purdue University in 1952 and the Eng.Sc.D. degree from Columbia University in 1955. At the University of California, Irvine, he is Research Professor of Electrical and Computer Engineering and of Radiological Sciences. At the King/Drew Medical Center, Los Angeles he is Professor of Radiology. His research subjects include automatic pattern recognition, machine vision, imaging architectures, and medical image modeling, and computer-aided medical diagnosis. He was awarded the grade of Fellow by the Institute of Electrical and Electronics Engineers for contributions to digital pattern classi"cation and medical applications. He was awarded the grade of Fellow by the International Association for Pattern Recognition for contributions to pattern recognition, machine, and medical imaging, and service to the IAPR. He received the 1977 Annual Award of the Pattern Recognition Society. He was a member of the United States delegation to the Board of Governors of the International Association for Pattern Recognition. He was Chairman of the King Sun Fu Memorial Award Committee. He has published over 150 papers and two books in his "elds of interest. He is on the editorial boards of Pattern Recognition and Machine Vision and Applications. Dr. Sklansky is a registered professional engineer in the State of California.
Pattern Recognition 33 (2000) 43}52
Rotation-invariant texture classi"cation using feature distributions M. PietikaK inen*, T. Ojala, Z. Xu Machine Vision and Media Processing Group, Department of Electrical Engineering, Infotech Oulu, University of Oulu, P.O. Box 4500, FIN-90401 Oulu, Finland Received 25 July 1998; received in revised form 19 January 1999; accepted 19 January 1999
Abstract A distribution-based classi"cation approach and a set of recently developed texture measures are applied to rotation-invariant texture classi"cation. The performance is compared to that obtained with the well-known circularsymmetric autoregressive random "eld (CSAR) model approach. A di$cult classi"cation problem of 15 di!erent Brodatz textures and seven rotation angles is used in experiments. The results show much better performance for our approach than for the CSAR features. A detailed analysis of the confusion matrices and the rotation angles of misclassi"ed samples produces several interesting observations about the classi"cation problem and the features used in this study. 1999 Published by Elsevier Science Ltd. All rights reserved. Keywords: Texture analysis; Classi"cation; Feature distribution; Rotation invariant; Performance evaluation
1. Introduction Texture analysis is important in many applications of computer image analysis for classi"cation, detection, or segmentation of images based on local spatial variations of intensity or color. Important applications include industrial and biomedical surface inspection; for example for defects and disease, ground classi"cation and segmentation of satellite or aerial imagery, segmentation of textured regions in document analysis, and content-based access to image databases. There are many applications for texture analysis in which rotation-invariance is important, but a problem is that many of the existing texture features are not invariant with respect to rotations. Some invariance for features derived from co-occurrence matrices or di!erence histograms can be obtained, for example, by simply aver-
* Corresponding author. Tel.: #3588-553-2782; fax: #3588553-2612. E-mail address:
[email protected]." (M. PietikaK inen)
aging the matrices (histograms) or features computed for di!erent angles, e.g. for 0, 45, 90 and 1353. Other early approaches proposed for rotation-invariant texture classi"cation include the methods based on polarograms [1], generalized co-occurrence matrices [2] and texture anisotropy [3]. Kashyap and Khotanzad developed a method based on the circular symmetric autoregressive random "eld (CSAR) model for rotation-invariant texture classi"cation [4]. Mao and Jain proposed a multivariate rotation-invariant simultaneous autoregressive model (RISAR) that is based on the CSAR model, and extended it to a multiresolution SAR model MR-RISAR [5]. A method for classi"cation of rotated and scaled textures using Gaussian Markov random "eld models was introduced by Cohen et al. [6]. Approaches based on Gabor "ltering have been proposed by, among others, Leung and Peterson [7], Porat and Zeevi [8], and Haley and Manjunath [9]. A steerable oriented pyramid was used to extract rotation invariant features by Greenspan et al. [10] and a covariance-based representation to transform neighborhood about each pixel into a set of invariant descriptors was proposed by Madiraju and Liu [11]. You and Cohen extended Laws' masks for
0031-3203/99/$20.00 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 3 2 - 1
44
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
rotation-invariant texture characterization in their `tuneda mask scheme [12]. Recently, we introduced new measures for texture classi"cation based on center-symmetric auto-correlation and local binary patterns, using Kullback discrimination of sample and prototype distributions, and conducted an extensive comparative study of texture measures with classi"cation based on feature distributions [13,14]. By using a standard set of test images, we showed that very good texture discrimination can be obtained by using distributions of simple texture measures, like absolute gray-level di!erences, local binary patterns or centersymmetric auto-correlation. The performance is usually further improved with the use of two-dimensional distributions of joint pairs of complementary features [14]. In experiments involving various applications, we have obtained very good results with this distribution-based classi"cation approach, see e.g. Ref. [15]. Recently, we also extended our approach to unsupervised texture segmentation with excellent results [16]. An overview of our recent progress is presented in [17]. In our earlier experiments, the problems related to rotation invariance have not been considered. This paper investigates the e$ciency of distribution-based classi"cation, and our feature set in rotation-invariant texture classi"cation. We are especially interested in seeing the performances for relatively small windows (64;64 and 32;32 pixels) required by many applications, whereas many of the existing approaches have been tested with larger windows. Texture measures based on centersymmetric autocorrelation, gray-level di!erences and a rotation-invariant version of the local binary pattern operator (LBPROT) are used in experiments. A simple method based on bilinear gray-level interpolation is used to improve rotation-invariance when extracting texture features in a discrete 3;3 neighborhood. The performance of LBPROT recently proposed by Ojala [18] is now experimentally evaluated for the "rst time. The use of joint pairs of features to increase the classi"cation accuracy is also studied. The results for distributionbased classi"cation are compared to those obtained with the well-known CSAR features.
2. Texture measures 2.1. Measures based on center-symmetric auto-correlation In a recent study of Harwood et al. [13], a set of related measures was introduced, including two local centersymmetric auto-correlation measures, with linear (SAC) and rank-order versions (SRAC), together with a related covariance measure (SCOV). All of these are rotationinvariant robust measures and, apart from SCOV, they are locally gray-scale invariant. These measures are abstract measures of texture pattern and gray scale, provid-
Fig. 1. 3;3 neighborhood with four center-symmetric pairs of pixels.
ing highly discriminating information about the amount of local texture. A mathematical description of these measures computed for center-symmetric pairs of pixels in a 3;3 neighborhood (see Fig. 1) is presented in Eqs. (1)}(4). k denotes the local mean and p the local variance in the equations. SCOV is a measure of the pattern correlation as well as the local pattern contrast. Since it is not &normalized' in respect to local gray-scale variation, it provides more texture information than the normalized auto-correlation measures SAC and SRAC. SAC is an auto-correlation measure, a &normalized', gray-scale invariant version of the texture covariance measure SCOV. SAC is invariant under linear gray-scale shifts such as correction by mean and standard deviation. It should also be noted that values of SAC are bound between !1 and 1. 1 SCOV" (x !k)(x!k), G G 4 G SCOV SAC" , p
(1) (2)
Texture statistics based directly on gray values of an image are sensitive to noise and monotonic shifts in the gray scale. With SRAC, the local rank order of the gray values is used instead of the gray values themselves. Hence, SRAC is invariant under any monotonic transformation including correction by mean and standard deviation and histogram equalization. The amount of auto-correlation in the ranked neighborhood is given by Spearman's rank correlation. It is de"ned for the n;n neighborhood with four center-symmetric pairs of pixels as (3), where m is n, and each t is the number of ties at G rank r in the ranked neighborhood. The values of SRAC G are bound between !1 and 1. 12+ (r !r)#¹ , G G G V , SRAC"1! m!m
(3)
1 J ¹ " (t!t ). (4) V 12 G G G The symmetric variance ratio (ratio between the within-pair and between-pair variances), SVR, is a statistic
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
equivalent to the auto-correlation measure SAC. SVR is also invariant under linear gray-scale shifts. WVAR SVR" . BVAR
(7)
(Fig. 2a) is thresholded at the value of the center pixel. The values of the pixels in the thresholded neighborhood (Fig. 2b) are multiplied by the binomial weights given to the corresponding pixels (Fig. 2c). The result for this example is shown in Fig. 2d. Finally, the values of the eight pixels are summed to obtain the LBP number (169) of this texture unit. By de"nition LBP is invariant to any monotonic gray-scale transformation. The texture contents of an image region are characterized by the distribution of LBP. LBP is not rotation invariant, which is undesirable in certain applications. It is possible to de"ne rotation invariant versions of LBP; one solution is illustrated in Fig. 3 [18]. The binary values of the thresholded neighborhood (Fig. 3a) are mapped into an 8-bit word in clockwise or counterclockwise order (Fig. 3b). An arbitrary number of binary shifts is then made (Fig. 3c), until the word matches one of the 36 di!erent patterns (Fig. 3d) of &0' and &1' an 8-bit word can form under rotation. The index of the matching pattern is used as the feature value, describing the rotation invariant LBP of this particular neighborhood.
(8)
2.3. Gray-level diwerence method
(5)
Additionally, the discrimination information provided by three local variance measures can be used. VAR ("BVAR#WVAR) and the two elements contributing to it are all measures of local gray-scale variation, very sensitive to noise and other local gray-scale transformations. The between-pair variance, BVAR, is mostly a measure of residual texture variance and usually it is a very small part of VAR. The majority of local variance is generally due to the within-pair variance WVAR. In our experiments, the classi"cation results for the VAR measure are reported. 1 VAR" (x#xY)!k, G G 8 G 1 BVAR" (x #x)!k, G G 16 G 1 WVAR" (x #x). G G 16 G
45
(6)
2.2. Rotation-invariant local binary pattern In the local binary pattern (LBP) texture operator we introduced [14], the original 3;3 neighborhood
The method based on histograms of absolute di!erences between pairs of gray levels or of average gray levels has performed very well in some comparative studies and applications, see e.g. Ref. [14,15]. For any given displacement d"(dx, dy), where dx and dy are integers,
Fig. 2. Computation of local binary pattern (LBP).
Fig. 3. Computation of LBPROT, rotation-invariant version of LBP.
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
46
let f (x, y)"" f (x, y)!f (x#dx, y#dy)". Let P be the probability density function of f . If the image has m gray levels, this has the form of an m-dimensional vector whose ith component is the probability that f (x, y) will have value i. P can be easily computed by counting the number of times each value of f (x, y) occurs. For a small d, the di!erence histograms will peak near zero, while for a larger d they are more spread out. The rotation invariant feature DIFF4 is computed by accumulating, in the same one-dimensional histogram, the absolute gray-level di!erences in all four principal directions at the chosen displacement D. If D"1, for example, the displacements d"(0, 1), (1, 1), (1, 0) and (1,!1) are considered.
3. Classi5cation based on feature distributions Most of the approaches to texture classi"cation quantify texture measures by single values (means, variances, etc.), which are then concatenated into a feature vector. In this way, much of the important information contained in the whole distributions of feature values is lost. In this paper, a log-likelihood pseudo-metric, the G statistic, is used for comparing feature distributions in the classi"cation process. The value of the G statistic indicates the probability that the two sample distributions come from the same population: the higher the value, the lower the probability that the two samples are from the same population. For a goodness-of-"t test the G statistic is L s G"2 s log G , (9) G m G G where s and m are the sample and model distributions, n is the number of bins and s , m are the respective G G sample and model probabilities at bin i. In the experiments a texture class is represented by a number of model samples. When a particular test sample is being classi"ed, the model samples are ordered according to the probability of them coming from the same population as the test sample. This probability is measured by a two-way test of interaction
L L L f log f ! f log f G G G G QK G QK G G L ! f log f G G G QK QK L L # f log f , (10) G G QK G QK G where f is the frequency at bin i. For a detailed derivation G of the formula, see Ref. [19]. After the model samples have been ordered, the test sample is classi"ed using the G"2
k-nearest-neighbor principle, i.e. the test sample is assigned to the class of the majority among its k nearest models. In our experiments, a value of 3 was used for k. The feature distribution for each sample is obtained by scanning the texture image with the local texture operator. The distributions of local statistics are divided into histograms having a "xed number of bins; hence, the G tests for all pairings of a sample and a model have the same number of degrees of freedom. The feature space is quantized by adding together feature distributions for every single model image in a total distribution which is divided into 32 bins having an equal number of entries. Hence, the cut values of the bins of the histograms correspond to 3.125 (100/32) percentile of combined data. Deriving the cut values from the total distribution, and allocating every bin the same amount of the combined data guarantees that the highest resolution of the quantization is used where the number of entries is largest and vice versa. It should be noted that the quantization of feature space is only required for texture operators with a continuous-valued output. Output of discrete operators like LBP or LBPROT, where two successive values can have totally di!erent meaning, does not require any further processing; operator outputs are just accumulated into a histogram. To compare distributions of complementary feature pairs, metric G is extended in a straightforward manner to scan through the two-dimensional histograms. If quantization of the feature space is required, it is done separately for both features using the same approach as with single features.
4. Experimental design 4.1. Texture images In the experiments, 15 classes of Brodatz [20] textures } pressed cork (D4), grass lawn (D9), herringbone weave (D16), woolen cloth (D19), french canvas (D21), calf leather (D24), beach sand (D29), pressed cork (D32), straw cloth (D53), handmade paper (D57), wood grain (D68), cotton canvas (D77), ra$a (D84), pigskin (D92) and calf fur (D93) - were used. The original 600;450 images were globally gray-scale corrected by Gaussian match [21]. First, each image was rotated by 113 around its center using bicubic interpolation, in order to have a uniform behavior with respect to interpolation e!ects for both rotated and unrotated samples in the experiments. The operator included in the MATLAB Image Processing Toolbox was used [22]. In further discussion, we call these processed images reference images and they served as training data in the classi"cation experiments. Then, the rotated images used for testing the texture classi"er
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
47
Fig. 4. Brodatz textures.
were generated by rotating each image counterclockwise around its center with the same bicubic interpolation method. We used the same set of six di!erent rotation angles that was used by Kashyap and Khotanzad [4]: 30, 60, 90, 120, 150, and 2003. In other words, this is a true test of rotation-invariant texture classi"cation, for the classi"er &sees' only instances of reference textures, and it is tested with instances of rotated textures it has not &seen' before. In order to analyze the e!ect of the window size, samples of two di!erent sizes were used: 64;64 and 32;32 pixels. The samples were extracted inside a 256;256 rectangle located in the center of each texture image. Fig. 4 depicts the 256;256 images of each texture and Fig. 5 illustrates the extraction of sixteen 64;64 samples for a texture with a rotation of 303 (#original rotation of 113). Hence, with the window size of 64;64 pixels each texture class contained 16 reference samples for training the classi"er and 6;16"96 rotated samples for testing it. Similarly, each texture class contained 64 training samples and 6;64"384 testing samples when the window size of 32;32 pixels was used. Even though the original texture images were globally corrected by Gaussian match, each individual sample image was also histogram equalized prior to feature extraction, to minimize the e!ect of possible local variations within the images.
4.2. Use of gray-scale interpolation and joint pairs of complementary features A problem with computing rotation-invariant texture features in a local neighborhood is that the diagonal neighbors are farther from the center pixel than the horizontal and vertical neighbors, respectively. To reduce the e!ects of this on the classi"cation performance, the pixel values for &virtual' diagonal neighbors, located at the same distance from the center pixel than the horizontal and vertical neighbors, can be computed from the original pixel values by interpolation. In our experiments, we used a simple bilinear gray-scale interpolation method for this purpose. In most cases, a single texture measure cannot provide enough information about the amount and spatial structure of local texture. Better discrimination of textures should be obtained by considering the joint occurrence of two or more features. In Ref. [14], we showed that the use of joint distributions of such pairs of features which provide complementary information about textures, usually improves the classi"cation accuracy. In this study, we perform experiments with various pairs of center-symmetric features and LBPROT. Our goal is not to "nd an optimal feature pair for this task, but to see how much improvement a joint feature pair can provide in the rotation-invariant classi"cation problem.
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
48
Fig. 5. Extraction of 64;64 samples from a rotated 600;450 image.
Table 1 Error rates (%) obtained with single features Window size
Interpolation
DIFF4
LBPROT
SCOV
SAC
SRAC
SVR
VAR
64;64
no yes no yes
28.5 27.2 44.9 39.6
38.5 39.2 50.5 47.7
27.6 16.3 39.0 35.1
37.1 32.4 53.6 50.0
41.2 48.1 48.4 59.4
36.9 31.8 53.9 50.3
24.0 14.2 40.7 36.4
32;32
5. Results and discussion 5.1. Single-feature performance Table 1 shows the results of rotation-invariant classi"cation for single features for window sizes of 64;64 and 32;32 pixels, without and with the interpolation of diagonal neighbors when features are extracted. In the case of 64;64 samples, VAR and SCOV features with interpolation provide the best error rates of 14.2 and 16.3%, respectively. DIFF4 also performs reasonably well with an error rate of 27.2%, whereas the worst results are obtained with the gray-scale invariant features (LBPROT, SVR, SAC, SRAC), indicating that information about local gray-scale contrast is useful in discriminating these Brodatz textures. We see that the use of gray-level interpolation usually improves the perfor-
mance. Understandably, error rates for 32;32 samples are much higher, and the performance of, e.g. VAR deteriorates to 36.4%. The 32;32 samples contain only one fourth of the pixel data of the 64;64 samples, and consequently the feature histograms possess less discriminative power. A closer examination of confusion matrices reveals that the center-symmetric features have most trouble in discriminating disordered textures. For example, in the case of 64;64 samples, textures D4, D9, D24, and D32 contribute almost 70% of the misclassi"ed samples when feature VAR is used. Interestingly, LBPROT has no trouble separating D9 nor D32, missing only three of the 192 samples belonging to these two classes. This suggests that a better result could be obtained by carrying out classi"cation in stages, selecting features which best discriminate among remaining alternatives. The local
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
gray-scale contrast seems to be particularly useful in separating textures D21, D57, and D77, for in these cases SCOV and VAR provide almost perfect results while their gray-scale invariant counterparts LBPROT, SVR, SAC, SRAC fail miserably. All features are able to discriminate textures D16 and D93, which both exhibit strong local orientation. Similarly, a closer look at the rotation angles of misclassi"ed samples reveals several interesting details. As expected, of the six rotation angles all features have the fewest misclassi"ed samples at 903. This attributes to the pseudo-rotation-invariant nature of our features. For example, both the center-symmetric features and DIFF4 are truly rotation-invariant only in rotations that are multiples of 453, when computed in a 3;3 neighborhood, and in the chosen set of rotation angles 903 happens to be the only one of this type. A related observation, similarly attributed to the pseudo-rotation-invariant nature of the features, is that the results obtained at a particular rotation a and at its orthogonal counterpart (a#903) are very similar for all features. In other words, almost identical results are obtained at 30 and 1203, just like is the case with 60 and 1503. This suggests that the set of six rotation angles we adopted from Kashyap and Khotanzad [4] is suboptimal, at least when 3;3 operators are used. A more comprehensive test could be obtained by choosing a set of rotation angles that does not contain any two angles di!ering a multiple of 453. The rotation-invariance of the features can be improved by utilizing their generic nature. The features are not restricted to a 3;3 neighborhood, but they all can be generalized to scale. For example, the center-symmetric measures can be computed for suitably symmetrical digital neighborhoods of any size, such as disks or boxes of odd or even sizes. This allows for obtaining a "ner quantization of orientations, for example, with a 5;5 box which contains eight center-symmetric pairs of pixels. In earlier studies LBP, the rotation-variant ancestor of LBPROT, has proven to be very powerful in discriminating unrotated homogeneous Brodatz textures. However, in this study LBPROT provides fairly poor results with rotated Brodatz textures. By de"nition LBPROT is rotation-invariant only in digital domain, and consequently, of the six rotation angles, the interpretation of rotated binary patterns works properly only at 903, where LBPROT provides clearly the best result of all features. In other angles, however, rotated binary patterns are obviously not mapped properly. This is particularly apparent in the case of strongly ordered textures (e.g. D21, D53, D68, and D77) where LBPROT provides perfect results at 903, but fails completely at other rotation angles. This suggests that the current mapping of rotated binary patterns is probably far too strict, and could be relaxed by grouping together rotated patterns that have a speci"c Hamming distance, for example.
49
5.2. Results for joint pairs of features The results obtained with joint pairs of features are presented in Table 2 and Table 3. The gray-scale interpolation was used in feature extraction and the histograms were quantized into 8;8 bins (36;8 for pairs including LBPROT). As expected, the use of feature pairs clearly improves the classi"cation compared to the case of single features. We see that the feature pairs LBPROT/VAR and LBPROT/SCOV provide the best performance with error rates of 10.1 and 10.8% for 64;64 samples and 24.1 and 24.0% for 32;32 samples, respectively. Many other feature pairs also achieve error rates close to 10% for 64;64 samples, including SCOV/VAR, SCOV/SVR, SAC/SCOV and DIFF4/SAC. All these pairs have one thing in common: they include a feature that incorporates local gray-scale contrast. This emphasizes the importance of contrast in discriminating Brodatz textures. Consequently, the pairs of gray-scale invariant features fail. In the previous section we made a remark about all features having the fewest misclassi"ed samples at the rotation angle of 903. This phenomenon is much stronger when joint pairs of features are used. For example, in the case of LBPROT/VAR only 5 of the 145 misclassi"ed samples occur at 903, while each other rotation angle contributes at least 24 misclassi"ed samples. Interestingly, even though LBPROT does fairly poorly by itself, LBPROT combined with a gray-scale variant
Table 2 Error rates (%) obtained with pairs of features for 64;64 samples LBPROT SCOV SAC DIFF4 17.6 LBPROT SCOV SAC SRAC SVR
12.8 10.8
12.6 29.2 11.6
SRAC
SVR
VAR
30.6 38.1 31.7 43.9
12.9 29.2 11.5 32.0 44.2
16.3 10.1 10.6 14.6 23.4 14.4
Table 3 Error rates (%) obtained with pairs of features for 32;32 samples LBPROT SCOV SAC DIFF4 30.2 LBPROT SCOV SAC SRAC SVR
25.5 24.0
28.7 41.2 28.7
SRAC
SVR
VAR
40.0 45.5 38.1 54.5
28.5 41.0 28.6 50.9 54.2
30.7 24.1 29.8 28.4 35.2 28.4
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
50
feature (SCOV or VAR) provides the best results of all pairings. This indicates that the combination of just crude pattern shape/code and pattern contrast is a useful description of rotated Brodatz textures. The shortcomings of LBPROT are still apparent, though, for in the case of 64;64 samples, strongly ordered texture D21 contributes 80 of the 145 samples misclassi"ed by the LBPROT/VAR pair. These 80 misclassi"ed D21 samples, which all are incidentally assigned to class D9, correspond to samples of all other rotation angles but those of 903 which are classi"ed correctly. Obviously, a signi"cant improvement in the classi"cation accuracy can be expected, when the shapes of rotated binary patterns are described e!ectively.
Texture classi"cation is performed by extracting parameters a, b and f from the sample images, and feeding them to a feature vector classi"er. In other words, a texture sample is represented by three numerical CSAR features, in contrast to our approach where features are described by histograms of 32 bins. The texture classi"er is trained with the features of the reference samples, and tested with the rotated samples. Both a multivariate Gaussian (quadratic) classi"er and a 3-NN (nearest neighbor) classi"er were used. When the 3-NN classi"er was used, the features were "rst normalized into the range 0}1 by dividing each feature with its maximum value over the entire training data. 6.3. Results and discussion
6. Quantitative comparison to circular-symmetric autoregressive random 5eld (CSAR) model 6.1. CSAR features For comparison purposes we used the circular-symmetric auto-regressive random "eld (CSAR) model which was proposed for rotation invariant texture classi"cation by Kashyap and Khotanzad [4]. The general expression of the model is described as y(s)"a g y(sr)#(bl(s), P PZ,!
6.2. Classixcation procedure
(11)
where +y(s), s3), )"(0)s , s )M!1), is a set of pixel intensity values of a given M;M digitized discrete image, s are the pixel coordinates, N is the neighbor hood pixel set, r are the neighborhood pixel coordinates, l(s) is a correlated sequence with zero mean and unit variance, a and b are the coe$cients of the CSAR model, and denotes modulo M addition. The model from this equation yields two parameters, a and b. Parameter a measures a certain isotropic property of the image and b a roughness property. The rotation invariant characteristics of a and b are contributed by the choice of the interpolated circular neighborhood of image pixels. There is also a third parameter f which is a measure of directionality and is determined by "tting other SAR models over the image [4].
Since the quadratic classi"er provided slightly better results than the 3-NN classi"er, the discussion is based on the results obtained with the quadratic classi"er (Table 4). Even though none of the individual features is very powerful by itself (the best feature (b) provides error rates of 51.9 and 60.0% for the 64;64 and 32;32 samples, respectively), the three features combined o!er a reasonable texture discrimination with error rates of 15.3 and 29.3%, respectively. However, we see that the distribution-based classi"cation with single features (e.g. VAR and SCOV) performs about as well as the three CSAR features combined in the case of 64;64 samples, and several pairs of joint features provide better performance than the combined CSAR features for both the 64;64 and 32;32 samples. Confusion matrices reveal that the CSAR features have di$culties in separating textures D4, D24, D57, and D68. When all three features are used, these four classes contribute almost 80% of the 221 misclassi"ed 64;64 samples. The confusion is particularly severe between classes D4 (pressed cork) and D24 (pressed calf leather). Examination of the rotation angles of the misclassi"ed samples veri"es the observation of the pseudo-rotationinvariant nature of 3;3 operators made in Section 5.1, for again by far the fewest misclassi"ed samples occur at 903 (only 10, while each other rotation angle contributes at least 31), and again the results obtained at two orthogonal rotation angles are almost identical.
Table 4 Error rates (%) obtained with the CSAR features Window size
a
b
f
a#b
a#f
b#f
a#b#f
64;64 32;32
56.4 67.0
51.9 60.0
68.5 72.5
24.9 44.5
34.2 48.2
27.6 37.0
15.3 29.3
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
7. Conclusion In this paper, a distribution-based classi"cation approach and a set of texture measures based on centersymmetric autocorrelation and local binary patterns were applied to rotation-invariant texture classi"cation. The performance of the proposed approach was compared to that of circular-symmetric autoregressive random "eld (CSAR) model with a di$cult classi"cation problem involving 15 di!erent Brodatz textures and seven rotation angles. The error rates of the best single features (SCOV, VAR) were comparable to those obtained with the three CSAR features combined, and better results were achieved with distributions of joint pairs of features. It was also shown that the rotation invariance of texture measures in a discrete 3;3 neighborhood can be improved by gray-level interpolation. The experimental results also emphasize the importance of local pattern contrast in discriminating Brodatz textures, for even though the samples were corrected against global grayscale variations, the features measuring local gray-scale variations clearly outperformed their gray-scale invariant counterparts. Another observation, veri"ed both with our operators and with the CSAR features, was the pseudo-rotation-invariant nature of the features which should be taken into account when designing new operators and new studies on rotation-invariant texture classi"cation. The shortcomings of LBPROT, the rotationinvariant version of the powerful LBP operator, were exposed, and a signi"cant improvement in classi"cation accuracy can be expected, once the shapes of rotated binary patterns are described e!ectively.
Acknowledgements The "nancial support provided by the Academy of Finland and Technology Development Center is gratefully acknowledged. The authors also wish to thank the anonymous referee for his helpful comments.
References [1] L.S. Davis, Polarograms: a new tool for texture analysis, Pattern Recognition 13 (1981) 219}223. [2] L.S. Davis, S.A. Johns, J.K. Aggarwal, Texture analysis using generalized co-occurrence matrices, IEEE Trans. Pattern Anal. and Mach. Intell. 1 (1979) 251}259. [3] P. Chetverikov, Experiments in the rotation-invariant texture discrimination using anisotropy features, in: Proceedings of the 6th International Conference on Pattern Recognition, Munich, Germany, 1982, pp. 1071}1073. [4] R.L. Kashyap, A. Khotanzad, A model-based method for rotation invariant texture classi"cation, IEEE Trans. Pattern Anal. Mach. Intell. 8 (1986) 472}481.
51
[5] J. Mao, A.K. Jain, Texture classi"cation and segmentation using multiresolution simultaneous autoregressive models, Pattern Recognition 25 (1992) 173}188. [6] F.S. Cohen, Z. Fan, M.A. Patel, Classi"cation of rotated and scale textured images using Gaussian Markov random "eld models, IEEE Trans. Pattern Anal. and Mach. Intell. 13 (1991) 192}202. [7] M.M. Leung, A.M. Peterson, Scale and rotation invariant texture classi"cation, in: Proceedings of the 26th Asilomar Conference on Signals, Systems and Computers, Paci"c Grove, CA, 1992. [8] M. Porat, Y. Zeevi, Localized texture processing in vision: analysis and synthesis in the Gaborian space, IEEE Trans. Biomed. Engng. 36 (1989) 115}129. [9] G.M. Haley, B.S. Manjunath, Rotation-invariant texture classi"cation using modi"ed Gabor "lters, in: Proceedings of the IEEE Conference on Image Processing, Austin, TX, 1994, pp. 655}659. [10] H. Greenspan, S. Belongie, R. Goodman, P. Perona, Rotation invariant texture recognition using a steerable pyramid, in: Proceedings of the 12th International Conference on Pattern Recognition, vol. 2, Jerusalem, Israel, 1994, pp. 162}167. [11] S.V.R. Madiraju, C.C. Liu, Rotation invariant texture classi"cation using covariance, in: Proceedings of the IEEE Conference on Image Processing, vol. 1, Washington, DC, 1995, pp. 262}265. [12] J. You, H.A. Cohen, Classi"cation and segmentation of rotated and scaled textured images using texture `tuneda masks, Pattern Recognition 26 (1993) 245}258. [13] D. Harwood, T. Ojala, M. PietikaK inen, S. Kelman, L.S. Davis, Texture classi"cation by center-symmetric autocorrelation, using Kullback discrimination of distributions, Pattern Recognition Lett. 16 (1995) 1}10. [14] T. Ojala, M. PietikaK inen, D. Harwood, A comparative study of texture measures with classi"cation based on feature distributions, Pattern Recognition 29 (1996) 51}59. [15] T. Ojala, M. PietikaK inen, J. Nisula, Determining composition of grain mixtures by texture classi"cation based on feature distributions, Int. J. Pattern Recognition Arti"cial Intell. 10 (1996) 73}82. [16] T. Ojala, M. PietikaK inen, Unsupervised texture segmentation using feature distributions, Pattern Recognition 32 (1999) 477}486. [17] M. PietikaK inen, T. Ojala, O. Silven, Approaches to texturebased classi"cation, segmentation and surface inspection, in: C.H. Chen, L.F. Pau, P.S.P. Wang (Eds.), Handbook of Pattern Recognition and Computer Vision, second ed., World Scienti"c, Singapore, 1998 (chapter 4.2). [18] T. Ojala, Multichannel approach to texture description with feature distributions, Technical Report CAR-TR-846, Center for Automation Research, University of Maryland, 1996. [19] R.R. Sokal, F.J. Rohlf, Introduction to Biostatistics, second ed., W.H. Freeman, New York, 1987. [20] P. Brodatz, Textures: A Photographic Album for Artists and Designers, Dover Publications, New York, 1966. [21] J.M. Carstensen, Cooccurrence feature performance in texture classi"cation, in: Proceedings of the 8th Scandinavian Conference on Image Analysis, vol. 2, TromsoK , Norway, 1993, pp. 831}838. [22] Image Processing Tooolbox for use with MATLAB. The MathWorks, Inc. (1984}1998).
52
M. Pietika( inen et al. / Pattern Recognition 33 (2000) 43}52
About the Author*MATTI PIETIKAG INEN received his Doctor of Technology degree in Electrical Engineering from the University of Oulu, Finalnd, in 1982. Currently he is Professor of Information Technology, Scienti"c Director of Infotech Oulu } a center for information technology research, and Head of Machine Vision and Media Processing Group, at the University of Oulu. From 1980 to 1981 and from 1984 to 1985 he was visiting the Computer Vision Laboratory at the University of Maryland, USA. His research interests include machine vision, document analysis, and their applications. He has authored about 100 papers in journals, books and conferences. He is the editor (with L.F. Pau) of the book `Machine Vision for Advanced Productiona, published by World Scienti"c in 1996. Prof. PietikaK inen is Fellow of international Association for Pattern Recognition (IAPR) and Senior Member of IEEE, and serves as Member of the Governing Board of IAPR and was Chairman of IAPR Education Committee. He also serves on program committees of several international conferences. About the Author*ZELIN XU received the M.S. degree in Optics from the Changchun Institute of Optics and Fine Mechanics, China, in 1984, respectively. Currently he is a graduate student at the Department of Electrical Engineering of the university of Oulu, Finland. His research interest include texture analysis and visual inspection. About the Author*TIMO OJALA received the M.S. degree in Electrical Engineering with honors from the University of Oulu, Finland, in 1992, and his Dr. Tech. degree from the same university in 1997. He is a member of Machine Vision and Media Processing Group at the University of Oulu. From 1996 to 1997 he was visiting the University of Maryland Institute for Advanced Computer Studies (UMIACS). His research interests include pattern recognition, texture analysis and object-oriented software design.
Pattern Recognition 33 (2000) 53 } 68
Relaxation labeling in stereo image matching Gonzalo Pajares*, JesuH s Manuel de la Cruz, JoseH Antonio LoH pez-Orozco Dpto. Arquitectura de Computadores y Automa& tica, Facultad de CC FnZ sicas, Universidad Complutense de Madrid 28040 Madrid, Spain Received 8 July 1998; accepted 14 December 1998
Abstract This paper outlines a method for solving the global stereovision matching problem using edge segments as the primitives. A relaxation scheme is the technique commonly used by existing methods to solve this problem. These techniques generally impose the following competing constraints: similarity, smoothness, ordering and uniqueness, and assume a bound on the disparity range. The smoothness constraint is basic in the relaxation process. We have veri"ed that the smoothness and ordering constraints can be violated by objects close to the cameras and that the setting of the disparity limit is a serious problem. This problem also arises when repetitive structures appear in the scene (i.e. complex images), where the existing methods produce a high number of failures. We develop our approach from a relaxation labeling method ([1] W.J. Christmas, J. Kittler, M. Petrou, structural matching in computer vision using probabilistic relaxation, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 749}764), which allows us to map the above constraints. The main contribution is made, (1) by applying a learning strategy in the similarity constraint and (2) by introducing speci"c conditions to overcome the violation of the smoothness constraint and to avoid the serious problem produced by the required "xation of a disparity limit. Consequently, we improve the stereovision matching process. A better performance of the proposed method is illustrated by comparative analysis against some recent global matching methods. 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Stereo correspondence; Probabilistic relaxation; Similarity; Smoothness; Uniqueness
1. Introduction A major portion of the research e!orts of the computer vision community has been directed towards the study of the three-dimensional (3-D) structure of objects using machine analysis of images [2}4]. Analysis of video images in stereo has emerged as an important passive method for extracting the 3-D structure of a scene. Following the Barnard and Fischler [5] terminology, we can view the problem of stereo analysis as consisting of the following steps: image acquisition, camera modeling, feature acquisition, image matching, depth determination and interpolation. The key step is that of image matching, that is, the process of identifying the corresponding points in two images that are cast by the same physical point in 3-D space. This paper is devoted solely to this * Corresponding author. Tel.: #0034-91-3944477; fax: #003491-3944687. E-mail address:
[email protected] (G. Pajares)
problem. The basic principle involved in the recovery of depth using passive imaging is triangulation. In stereopsis the triangulation needs to be achieved with the help of only the existing environmental illumination. Hence, a correspondence needs to be established between features from two images that correspond to some physical feature in space. Then, provided that the position of centers of projection, the e!ective local length, the orientation of the optical axis, and the sampling interval of each camera are known, the depth can be established using triangulation [6]. 1.1. Techniques in stereovision matching A review of the state of art in stereovision matching allows us to distinguish two sorts of techniques broadly used in this discipline: area based and feature based [3,4,7,8]. Area-based stereo techniques use correlation between brightness (intensity) patterns in the local neighborhood of a pixel in one image with brightness patterns
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 3 6 - 9
54
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
in the local neighborhood of the other image [9}14], where the number of possible matches is intrinsically high, while feature-based methods use sets of pixels with similar attributes, normally, either pixels belonging to edges [15}27], or the corresponding edges themselves [4,8,28}38]. These latter methods lead to a sparse depth map only, leaving the rest of the surface to be reconstructed by interpolation; but they are faster than area-based methods as there are many fewer points (features) to be considered [3]. We select a feature-based method as justi"ed below. 1.2. Factors awecting the physical stereovision system and choice of features There are extrinsic and intrinsic factors a!ecting the stereovision matching system: (a) extrinsic, in a practical stereo vision system, the left and right images are obtained with two cameras placed at di!erent positions/angles, although they capture both the same scene, each camera may receive a di!erent illumination and also, incidentally, di!erent re#ections; (b) intrinsic, the stereovision system is equipped with two physical cameras, always placed at the same relative position (left and right), although they are the same commercial model, their optical devices and electronic components are di!erent, hence each camera may convert the same illumination level to a di!erent gray level independently. As a result of the above-mentioned factors the corresponding features in both images may display di!erent values. This may lead to incorrect matches. Thus, it is very important to "nd features in both images which are unique or independent of possible variation in the images [39]. This research paper use a feature-based method where edge segments are to be matched, since our experiment has been carried out in an arti"cial environment where edge segments are abundant. This environment is where a mobile robot equipped with the stereovision system navigates. Such features have been studied in terms of reliability [7] and robustness [39] and, as mentioned before, have also been used in previous stereovision matching works. This justi"es our choice of features, although they may be too local. Four average attribute values (module and direction gradient, variance and Laplace) are computed for each edge-segment as we will see later. The ideas described here can be used for other simple geometric primitives with other attributes, even though in the details of the current implementation the line segments play an important role. 1.3. Constraints applied in stereovision matching Our stereo correspondence problem can be de"ned in terms of "nding pairs of true matches, namely, pairs of edge segments in two images that are generated by the
same physical edge segment in space. These true matches generally satisfy some competing constraints [22,23,27]: (1) similarity, matched edge segments have similar local properties or attributes; (2) smoothness, disparity values in a given neighborhood change smoothly, except at a few depth discontinuities; (3) ordering, the relative position among two segments in the left image is preserved in the right one for the corresponding matches; (4) uniqueness, each edge segment in one image should be matched to a unique edge segment in the other image. Immediately after, we introduce the very important disparity concept. Assume a stereo pair of edge segments, where one segment belongs to the left image and the other to the right one. If we superpose the right image over the left one, for example, the two edge segments of the stereo pair appear horizontally displaced. Following horizontal (epipolar) lines we determine the two intersection points for each epipolar line with the pair of edge segments. The relative displacement of the x-coordinates of the two intersection points is the disparity for such points. We compute the disparity for the pair of edge segments as the average displacement between points belonging to both edge segments along the common length. The similarity constraint is associated to a local matching process and the smoothness and ordering constraints to a global matching one. The major di$culty of stereo processing arises due to the need to make global correspondences. A local edge segment in one image may match equally well with a number of edge segments in the other image. This problem is compounded by the fact that the local matches are not perfect due to the abovementioned extrinsic and intrinsic factors. These ambiguities in local matches can only be resolved by considering sets of local matches globally. Hence, to make global correspondences given a pair of edge segments, we consider a set of neighboring edge segments, where a bound on the disparity range de"nes the neighborhood. Relaxation is a technique commonly used to "nd the best matches globally and it refers to any computational mechanism that employs a set of locally interacting parallel processes, one associated with each image unit, that in an iterative fashion update each unit's current labeling in order to achieve a globally consistent interpretation of image data [40,41]. 1.4. Relaxation in stereovision Relaxation labeling is a technique proposed by Rosenfeld et al. [42] and developed to deal with uncertainty in sensory data interpretation systems and to "nd the best matches. It uses contextual information as an aid in classifying a set of interdependent objects by allowing interactions among the possible classi"cations of related objects. In the stereo paradigm the problem involves assigning unique labels (or matches) to a set of features in
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
an image from a given list of possible matches. So, the goal is to assign each feature (edge segment) a value corresponding to disparity in a manner consistent with certain prede"ned constraints. Two main relaxation processes can be distinguished in stereo correspondence: optimization-based and probabilistic/merit-based. In the optimization-based processes stereo correspondence is carried out by minimizing an energy function which is formulated from the applicable constraints. It represents a mechanism for the propagation of constraints among neighboring match features for the removal of ambiguity of multiple stereo matches in an iterative manner. The optimal solution is ground state, that is, the state (or states) of the lowest energy. In the probabilistic/merit-based processes, the initial probabilities/merits, established from a local stereo correspondence process and computed from similarity in the feature values, are updated iteratively depending on the matching probabilities/merits of neighboring features and also from the applicable constraints. The following papers use a relaxation technique: (a) probabilistic/merit [1,4,17,18,20}22,25,26,41}54] and (b) optimization through a Hop"eld neural network [12,14, 24,27,36,55,56]; we use (a). 1.5. Motivational research and contribution For a given pair of features the smoothness constraint makes global correspondences based on how well its neighbors match. The neighborhood region is de"ned by a bound on the disparity range. In our experiments we have found complex images with highly repetitive structures or with objects very close to the cameras or with occlusions. In complex images setting a disparity limit is a di$cult task because according to such a limit the smoothness constraint can be violated. Additionally, in complex images the ordering constraint is also easily violated. We propose a probabilistic relaxation approach for solving the edge-segment stereovision matching problem and base our method on the work of Christmas et al. [1] (similar to these proposed by Rosenfeld et al. [42] or Peleg [48]) for the following reasons: (a) it incorporates the recent advances of the theory of probabilistic relaxation, (b) the similarity, smoothness and ordering constraints can be mapped directly, and (c) it has been satisfactorily applied in stereovision matching tasks [1]. The main contribution of the paper is the use in the relaxation process of a compatibility measure between pairs of matched segments. This enforces consistency imposing smoothness and ordering constraints when no suspicion exits about the violation of such constraints, otherwise a di!erent compatibility measure is used and the lost global consistency is replaced by local ones through the matching probabilities. With this approach, our method is valid for common stereo images and it is
55
extended to deal robustly with complex stereo images where the smoothness and ordering constraints are commonly violated. The extension of our approach is based upon the idea of Dhond and Aggarwal [57] that deals with occluding objects. Additional contributions are made in the following ways: (a) by using the learning strategy of Cruz et al. [31] and Pajares [4] to compute the matching probability for each pair of edge segments. The matching probabilities are used to start the relaxation process and then to obtain a compatibility measure when the smoothness and ordering constraints are violated, and (b) by using the minimum di!erential disparity criterion of Medioni and Nevatia [8] and the approach of Ruichek and Postaire [27] to map the smoothness and ordering constraints, respectively. Therefore, our method integrates several criteria following the idea of Pavlidis [58] and we get that it works properly even if the images are complex. 1.6. Paper organization The paper is organized as follows: in Section 2 the stereovision probabilistic relaxation matching scheme is described. In Section 3 we summarize the proposed matching procedure. The performance of the method is illustrated in Section 4, where a comparative study against existing global relaxation matching methods is carried out, and the necessity of using a relaxation matching strategy rather than a local matching technique is also made clear. Finally in Section 5 there is a discussion of some related topics.
2. Probabilistic relaxation scheme in stereovision matching As mentioned earlier, this paper proposes a probabilistic relaxation technique for solving the stereovision matching problem. This technique is extended to work properly with complex images and integrates several criteria. In this section we deal with these issues. 2.1. Feature and attribute extraction We believe, as in Medioni and Nevatia [8], that feature-based stereo systems have strong advantages over area-based correlation systems. However, detection of the boundaries is a complex and time-consuming sceneanalysis task. The contour edges in both images are extracted using the Laplacian of Gaussian "lter in accordance with the zero-crossing criterion [59,60]. For each zero crossing in a given image, its gradient vector, Laplacian and variance values are computed from the gray levels of a central pixel and its eight immediate neighbors. The gradient vector (magnitude and direction)
56
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
is computed as in Leu and Yau [61], the Laplacian value as in Lew et al. [62] and the variance value as in Krotkov [63]. To "nd the gradient magnitude of the central pixel we compare the gray-level di!erences of the four pairs of opposite pixels in the 8-neighborhood, and the largest di!erence is taken as the gradient magnitude. The gradient direction of the central pixel is the direction out of the eight principal directions whose opposite pixels yield the largest gray-level di!erence and also points in the direction in which the pixel gray level is increasing. We use a chain-code and we assign eight digits to represent the eight principal directions; these digits are integer numbers from 1 to 8. This approach allows the normalization of the gradient direction, so its values fall in the same range as the remainder of the attribute values. In order to avoid noise problems in edge detection that can lead to later mismatches in realistic images, the following two global consistent methods are used: (1) the edges are obtained by joining adjacent zero crossings following the algorithm of Tanaka and Kak [64]; a margin of deviation of $20% in gradient magnitude and of $453 in gradient direction is allowed; (2) each detected contour is approximated by a series of piecewise linear line segments [65]. Finally, for every segment, an average value of the four attributes is obtained from all computed values of its zero crossings. All average attribute values are normalized in the same range. Each segment is identi"ed with its initial and "nal pixel coordinates, its length and its label. Therefore, each pair of features has two associated four-dimensional vectors x and x , where the compo nents are the attribute values, and the sub-indices l and r denote features belonging to the left and right images, respectively. A four-dimensional di!erence measurement vector x is then also obtained from the above x and x vectors, x"x !x "+x , x , x , x ,. The compo
nents of x are the corresponding di!erences for module and direction gradient, Laplacian and variance values,
respectively, associated to the same four attributes used in our approach. Of all the possible combinations of pairs of matches formed by segments of the left and right images, only those pairs with a di!erence value in the direction of the gradient less than $453, taking into account that they overlap will be processed. This is called the initial condition. The remaining pairs that do not meet such conditions are directly classi"ed as False correspondences. The overlap is a concept introduced by Medioni and Nevatia [8], two segments u and v overlap if by sliding one of them in a direction parallel to the epipolar line, they would intersect. We note that a segment in one image is allowed to match more than one segment in the other image. Therefore, we make a provision for broken segments resulting in possible multiple correct matches. The following pedagogical example in Fig. 1 clari"es the above. The edge segment u in the left image matches with the broken segment represented by v and w in the right image, but under the condition that v and w do not overlap and that their orientations, measured by their corresponding x , do not di!er by more than $103. 2.2. Computation of the local matching probability The local probabilities for all pairs of features that meet the above initial conditions are computed following a learning strategy [4,31]. This is brie#y described below. Due to the above-mentioned extrinsic and intrinsic factors, we have veri"ed that the di!erences between the attributes for the true matches cluster in a cloud around a center and we have assumed an underlying probability density function (PDFt). The form of the PDFt is known. Our experience lead us to model this PDFt as a Gaussian one with two unknown parameters, the mean di!erence vector and the covariance matrix. The "rst is the center of
Fig. 1. Broken edge segment v and w may match with u.
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
the cluster and the second gives us information about the dispersion of the di!erences of the attributes in the cluster. Both parameters are estimated through a learning process based on the parametric Bayes classi"er by a maximum likelihood method [4,66]. The learning process is carried out with a number of samples through a training process. Once the PDFt is fully described, it allows us to compute the local matching probability associated with any new and current pair of edge segments using the di!erences of its attributes according to the following expression [4,31,66}68].
1 1 p(x)" exp ! (x!l) &\(x!l) , "&" 2
(1)
where l and & are the corresponding learned cluster center, and the covariance matrix, respectively, x is the four-dimensional di!erence measurement vector (see Section 2.1) and t stands for the transpose. During the computation of the local probabilities, we explicitly apply the similarity constraint between features from the left and right images. These local probabilities are the starting point for the next relaxation matching process, i.e. initial probabilities for the global relaxation process. 2.3. A review of the relaxation labeling processes Recent advances in the theory of probabilistic relaxation can be found in Refs. [1,69], although it has been studied in previous works [42,44,48,51,54]. The probabilistic relaxation process is made applicable to general matching problems. The matching problem is formulated in the Bayesian framework for contextual label assignment. The formulation leads to an evidence-combining formula which prescribes in a consistent and uni"ed manner how unary attribute measurements relating to single entities, binary relation measurements relating to pairs of objects, and prior knowledge of the environment (initial probabilities) should be jointly brought to bear on the object labeling problem. The theoretical framework also suggests how the probabilistic relaxation process should be initialized based on unary measurements (according to Section 2.2). This contrasts with the approach adopted by Li [70] who started the iterative process from a random assignment of label probabilities. A set A of N objects of the scene is to be matched, A"+a , a , 2, a ,. The goal is to match the scene to , a model. Each object a has assigned a label h , which may G G take as its value any of the M model labels that form the set ) : )"+u , u , 2, u ,. The notation u G indicates + F that a model label is associated with a particular scene label h . At the end of the labeling process, it is expected G that each object will have one unambiguous label value. For convenience two sets of indices are de"ned:
57
N ,+1, 2,2, N, and N ,+1, 2,2, i!1, i#1,2, N,. G For each object a a set of m measurements x is availG G able, corresponding to the unary attributes of the object: x "+x, x, 2, xK,. The abbreviation x o denotes G G G G GG , the set of all unary measurement vectors x made on the G set A of objects, i.e., x o "+x , 2, x ,. , GG , For each pair of objects a and a a set of m G H binary measurements is available A : A "+A, GH GH GH A, 2, AK,. The same classes of unary and binary GH GH measurements are also made on the model. After introduction of the above notations and following Christmas et al. [1] the theoretical framework of Bayesian probability for object labeling using probabilistic relaxation is as follows: the label h of an object a will be given the value G G u G, provided that it is the most probable label given all F the information we have for the system, i.e. all unary measurements and the values of all binary relations between the various objects. By mathematical induction it can be shown that there is no need for the inclusion of all binary relations of all objects in order to identify an object, and the most appropriate label of object a is G u G given by F P(h "u G"x ,A ) G F H HZ,M GH H Z ,G "max P (h "u " x ,A ), (2) G H H H Z ,M GH H Z ,G SHZ where the upper-case P represents the probability of an event; thus P(h "u ) and P(h "u G) denote the probG H G F abilities that scene label h is matched to model labels G u and u G, respectively. Under certain often-adopted H F assumptions, the conditional probabilities in this equation can be expressed in a form that indicates that a relaxation updating rule would be an appropriate method of "nding the required maximum. Using the Bayes's rule and after an exhaustive treatment (see Christmas et al. [1]) the desired solution to the problem of labeling, as de"ned by Eq. (2) can be obtained using an iterative scheme as follows: PL (h "u G)QL (h "u G) G F G F PL>(h "u G)" , G F H PL (h "u )QL (h "u ) S Z G H G H where
(3)
QL (h "u ) G ? " PL (h "u )p(A " h "u , h "u ), (4) H @ GH G ? H @ H Z ,G S@ Z where PL (h "u G) is the matching probability at level G F (iteration) n that scene label h is matched to model label G u G and PL> (h "u G) is the updated probability of the F G F corresponding match at level n#1. The quantity QL (h "u ) expresses the support the match h "u G ? G ? receives at the nth iteration step from the other objects in the scene, taking into consideration the binary relations
58
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
that exist between them and object a . The density funcG tion p(A " h "u , h "u ) corresponds to the compatiGH G ? H @ bility coe$cients of other methods [42,44]; i.e. it quanti"es the compatibility between the match h "u and G ? a neighboring match h "u . The density function H @ ranges from 0 to 1 and will be expressed as a compatibility coe$cient in Eq. (6). 2.4. Stereo correspondence using relaxation labeling At this stage there are two goals in mind: (a) to show how the relaxation labeling theory may be applied to stereovision matching problems and (b) to map the similarity, smoothness and ordering stereovision matching constraints in Eqs. (3) and (4). The uniqueness constraint will be considered later during the decision process in Section 4.3. 2.4.1. Stereovision matching by relaxation labeling As explained in the introduction, the stereovision matching is the process of identifying the corresponding features in two images (left and right) that are cast by the same physical feature in 3-D space. In order to apply the relaxation labeling theory to the stereovision matching and without loss of generality, we associate the left stereo-image to the scene and the right one to the model. With this approach we have to match the scene to a model and the stereovision matching problem is now a relaxation matching one. In our approach, the objects in the scene and the model are edge segments. Following the notation introduced in Section 2.3, the set A, of N objects, is identi"ed with the set of N edge segments in the left image from the stereo pair, each left edge segment (object) is identi"ed by its corresponding label h . The edge segments from the G right image are associated to the model labels that form the set ). Then the notation u G in relaxation labeling F indicates now, that we wish to match a right edge segment with a particular left edge segment. According to Section 2.3 the unary attributes m , to map the similarity constraint are module and direction gradient, Laplacian and variance values, as explained in Section 2.1. Therefore m takes a value of 4. The binary relations are: (a) the di!erential disparity of Ref. [8] where we apply the smoothness constraint, and (b) the ordering relation to apply the ordering constraint. Hence m takes the value of 2. According to Section 1.3, we can infer that unary and binary measurements are associated to local and global correspondences, respectively. 2.4.2. Mapping the similarity, smoothness and ordering stereovision matching constraints The next step consists in the mapping of the similarity, smoothness and ordering stereovision matching
constraints [20}23,27], into the relaxation labeling Eqs. (3) and (4). To achieve this we have made two important contributions to solve the stereovision matching problem: (a) by applying a learning strategy in the similarity constraint, and (b) by introducing speci"c conditions where we take into account the possible violation of the smoothness and ordering constraints (objects in the 3-D scene very close to the cameras and occlusions) and the problem of "xing a disparity limit (repetitive structures). We can also point out that unlike Christmas et al. [1], and due to the possible violation of the smoothness and ordering constraints, we have not considered structural or geometrical binary relations for mapping the smoothness constraint. Some of these binary relations can be found in Ref. [1] (angle between line segments, minimum distance between the endpoints of line segments, distance between the midpoints of line segments). (a) Mapping the similarity constraint: The similarity constraint is mapped as follows: Given a pair of edgesegments, from left and right images, to be matched and labeled as h and u , respectively, we have the G H four-dimensional di!erence measurement vector x , G which is obviously a measurement of similarity between the two edge segments involved. This vector is obtained from the corresponding unary measurement vectors associated to the left edge segment x , and the right edge segment x , as explained in Section 2.1, so x "+x , x , x , x ,. The matching probability at level G GK GB GJ GT n, left-hand side of Eq. (1), that the left edge-segment h G is matched to the right edge-segment u is now H PL (h "u ) and x is now x G H G (b) Mapping the smoothness constraint: Assuming no violation of the smoothness constraint and unlike in Christmas et al. [1] the density function p(A " h "u , GH G ? h "u ) in the support of the match, given in Eq. (4), H @ is replaced by a compatibility coe$cient where the smoothness and ordering constraints are mapped. This is the idea proposed by Hummel and Zucker [44] and Rosenfeld et al. [42]. Traditional interpretations of compatibility coe$cients have been in terms of statistical measures such as correlation [42] or structural information [1]. In this paper we de"ne a new compatibility coe$cient. However, beforehand we change the notation for simplicity. So, given two pairs of matched edge segments (h , u ) and (h , u ). Throughout the remainder of the G H F I paper the pairs of edge segments (h , u ) and (h , u ) are G H F I denoted as (i, j) and (h, k) respectively, and the matching probabilities PL (h "u ) and PL (h "u ) as PL and G H F I GH PL, respectively. We measure the consistency between FI (i, j) and (h, k) by means of the compatibility coe$cient c(i, j; h, k) which ranges in the interval [0,1]. According with Medioni and Nevatia [8], the smoothness constraint assumes that neighboring edge segments have similar disparities, except at a few depth discontinuities [12]. Generally, to de"ne the neighboring region a
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
bound is assumed on the disparity range allowed for any given segment. This limit is denoted as maxd, "xed at eight pixels in this paper. For each edge segment 00i11 in the left image we de"ne a window w(i) in the right image in which corresponding segments from the right image must lie and, similarly, for each edge segment 00j11 in the right image, we de"ne a window w( j) in the left image in which corresponding segments from the left image must lie. It is said that an edge segment 00h11 must lie if at least the 30% of the length of the segment 00h11 is contained in the corresponding window. The shape of these windows is a parallelogram, two sides are "xed by 00i11 or 00j11 and the other two are lines of length 2. maxd (see Fig. 3). The smoothness constraint implies that 00i11 in w( j) implies 00j11 in w(i). Now, given 00i11 and 00h11 in w( j) and 00j11 and 00k11 in w(i) where 00i11 matches with 00j11 and 00h11 with 00k11 the di!erential disparity "d !d ", to be included later in c(i, j; h, k), GH FI measures how close the disparity between edge segments 00i11 and 00j11 denoted as d is to the disparity d between GH FI edge segments 00h11 and 00k11. The disparity between edge segments is the average of the disparity between the two edge segments along the length they overlap. This di!erential disparity criterion is used in Refs. [8,27,36,55,56] among others. (c) Mapping the ordering constraint: Assuming, as in the smoothness constraint, no violation of the ordering constraint, we extend the concept de"ned in Ruicheck and Postaire [27] for edge points to edge segments. As described in Ref. [27], if an edge point 00i11 in the left image is matched with an edge point 00j11 in the right image, then it is not possible for an edge point 00h11 in the left image, such that x (x , to be matched with an edge point 00k11 in the F G right image for which x 'x , where x denotes the xH I coordinate in a Cartesian system with its origin in the bottom left corner of each image. We de"ne the ordering constraint coe$cient O for GHFI the edge segments as follows:
59
1 O " o where o ""S(x !x )!S(x !x )" GHFI N GHFI GHFI G F H I , 1 if r'0, and S(r)" (5) 0 otherwise,
this upper-case O coe$cient measures the relative GHFI position of edge segments 00i11 and 00h11 in the left image with respect to 00j11 and 00k11 in the right image. The lower-case o coe$cient is de"ned in Ref. [27] for edge GHFI points taking the discrete values of 0 and 1. O is the GHFI ordering average measure for edge points along the common overlapping length for the four edge segments, it ranges from 0 to 1 and is included later in the compatibility coe$cient c(i, j; h, k). We trace N scanlines along the common overlapping length, each scanline produces a set of four intersection points with the four edge-segments. With this intersection points lower-case o is computed. GHFI Fig. 2 clari"es the above. (d) Problems to apply the smoothness and ordering constraints: During our experiments we have found the following problems related with these constraints, namely: (1) Violation of such constraints in systems with parallel geometry due to the presence of objects close to the cameras and occlusions. This type of objects is particularly dangerous and must be considered immediately in order to avoid, for example, undesired collisions in navigation systems. (2) The "xation of a bound on the disparity is a hard task, because one can never be sure which is the best limit. In images with repetitive structures this problem is more evident, because an edge segment in a window does not imply that its couple is in the other window. As mentioned in Section 1.5, we call complex images the images in which these problems appear and we try to solve them by introducing a speci"c condition in the
Fig. 2. Common overlapping length and intersection points produced by a scanline.
60
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
compatibility coe$cient, as we will see later. So, our system works properly with complex and no complex images. (e) Dexnition of the compatibility coezcient: In our stereovision relaxation labeling model, the probabilities computed according to Eq. (3), are used to determine the various possible correspondences between edge segments. The goal of the relaxation process is to increase the consistency of a given pair of edge segments among the constraints so that the probability of a correct match can be increased and the probability of any incorrect match can be decreased during each iteration. Therefore, we look for a compatibility coe$cient, which must be able to represent the consistency between the current pair of edge segments under correspondence and the pairs of edge segments in a given neighborhood. When no violation of the smoothness or ordering constraints is considered, the compatibility coe$cient makes global consistency between neighbor pairs of edge segments based on such constraints. This is usual in stereovision matching systems and is made by condition C . When violation of the smoothness and ordering constraints is assumed, the global consistency is lost and such constraints cannot be applied, we must fall back on to local consistency, as this is the unique resource available. The local consistency is represented by thesimilarity constraint, that is by the matching probability, but now considering the corresponding neighbors edge segments. This is made through condition C . The compatibility coe$cient c(i, j; h, k) is "nally de"ned as follows:
O d GHFI GHFI if C , l#"d !d " GH FI PL # PL GH FI if C , c(i, j; h, k)" d GHFI 2 0
(6)
otherwise,
where d "(overlap(i, j)#overlap(h, k))/2, which is an GHFI overlap function in the interval [0,1] and measures the common average overlap rates of the edge segments under matching (see Section 2.1); the term d is the GH average disparity, which is the average of the disparity between the two segments 00i11 and 00j11 along the common overlap length and PL is the matching probability of GH edge segments 00i11 and 00j11 at level n. O is the order GHFI coe$cient. Condition C : Assumes the mapping of the ordering and smoothness constraints. The assumption that the ordering constraint is not violated is simple, and only requires that O is greater than a "xed threshold, set to GHFI 0.85 in this paper. The assumption that the smoothness constraint is not violated is based on the following criterion. If the edge segment 00h11 is a neighbor of the edge
segment 00i11(h in w(j)) its preferred match 00k11 is a neighbor of the edge segment 00j11(k in w(i)). This is equivalent to say that both pairs of edge segments have similar disparities, this is the minimum di!erential disparity criterion in Ref. [8]. Given a segment 00h11 in the left image (LI) and a segment 00k11 in the right image (RI), we de"ne a preferred match 00k11 of 00h11, denoted as pm(h, k) if PL*AH PL ∀k 3 RI. Where PL'¹ (with the FI
FIY FI threshold ¹ set to 0.5 in this experiment) and PL "max o PL is the maximum matching prob FIY IY 0' FIY ability of the edge segment 00h11 with any edge segment 00k11 in RI. The constant A is a coe$cient to "x the degree of exigency to de"ne the preferred match (set to 0.85 in this paper). Hence, the compatibility coe$cient is maximum if the di!erential disparity is minimum and O and the overlap function are maximum and vice GHFI versa. Condition C : Unlike C this condition assumes the violation of the smoothness or ordering constraints (due to objects close to the cameras or occlusions) or the possibility that an edge segment 00h11 in w(j) has a preferred match 00k11 out of w(i) (repetitive structures). C is only considered when C is false. The loss of global consist ency is replaced by local consistency in the compatibility coe$cient as the average of the matching probabilities where the overlap function is also taken into account. We must point out that this condition C overrides the mapping of the ordering and smoothness constraints, to take up again the similarity constraint through the matching probabilities, so that the pair of edge segments under correspondence is not penalized during the relaxation process due to violation of the smoothness and ordering constraints. The compatibility coe$cient is maximum if the matching probabilities and overlap function are maximum and vice versa. (f) A pedagogical example: The following is a simple example to clarify the above concepts related with conditions C and C , where we consider the evaluation of a pair of edge segments (i, j), (Fig. 3): (1) The preferred match of 00h11 is 00k11, where h3w( j) and k3w(i); ordering and smoothness are not violated. (2) The preferred match of 00h11 is 00l11, hence the ordering constraint is preserved but the smoothness constraint is violated because l,w(i). (3) The preferred match of 00h11 is 00m11, hence the ordering constraint is violated but the smoothness constraint is preserved because m3w(i). (4) The preferred match of 00h11 is 00n11, in which case both ordering and smoothness constraints are violated In the "rst case condition C reinforces the match (i, j) as expected, but in the remainder of the cases the reinforcement is assumed by condition C . This is the appro priate process. Without condition C the match (i, j) should be penalized and this is unwanted as the edge
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
61
Fig. 3. Evaluation of the pair (i, j) considering its neighborhood.
segments in the right image are all preferred matches of 00h11. The dotted lines in the right image represent the positions of 00i11 and 00h11 when the left image is superimposed to the right one.
3. Summary of the matching procedure After mapping the similarity, ordering and smoothness stereovision matching constraints, the correspondence process is achieved by allowing the system to evolve so that it reaches a stable state, i.e. when no changes occur in the matching probabilities during the updating procedure. The whole matching procedure can be summarized as follows: (1) The system is organized with a set of pairs of edge segments (from the left and right images) with a difference in the direction of the gradient of less than $453 and considering that they overlap (initial conditions in Section 2.1). Each pair of edge segments is characterized by its initial matching probability. (2) The assignment of the initial matching probabilities are computed according to Eq. (1). (3) The variable npair, which represents the number of pairs of edge-segments for which the matching probabilities are modi"ed by the updating procedure, is set to 0. (4) The pairs of edge segments are sequentially selected and the relaxation procedure applied to each one. (5) At the nth iteration, the matching probability is computed according to Eq. (3). (6) If "PL> (h "u G) !PL (h "u G)"'e then npair is G F G F increased. This is the criterion used in Ref. [1], so that for each object a there is one label u G for which G F
the matching probability at a given level is within some small distance e from unity. The constant value e is introduced to accelerate the convergence and is set to 0.01 in this experiment. (7) If npair is not equal to 0 then go to step 3. (8) Matching decisions: A left edge segment can be assigned to a unique right edge segment (unambiguous pair) or several right edge segments (ambiguous pairs). The decision about whether a match is correct or not correct is made by choosing the greater probability value (in the unambiguous case there is only one) whenever it surpasses a previous "xed probability threshold ¹, set to 0.5. The ambiguities under the criterion given in Section 2.1 are allowed as they can be produced by broken edge segments. This step represents the mapping of the uniqueness constraint, which completes the set of matching constraints in stereovision matching.
4. Validation, comparative analysis and performance evaluation 4.1. Design of a test strategy In order to assess the validity and performance of the proposed method (i.e. how well our approach works in images with and without additional complexity), we have selected 48 stereo pairs of realistic stereo images from an indoor environment. Each pair consists of two left and right original images and two left and right images of labeled edge segments. All tested images are 512;512 pixels in size, with 256 gray levels. The two cameras have equal focal lengths and are aligned so that their viewing direction is parallel. The plane formed by the viewed
62
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
Fig. 4. Group SP1: (a) original left stereo image, (b) original right stereo image, (c) labeled segments left image, (d) labeled segments right image.
point and the two focal points intersects the image plane along a horizontal line known as the epipolar line. A grid pattern of parallel horizontal lines is used for camera alignment. The cameras are adjusted so that the images of these lines coincide [71]. Thus, each point in an edge segment lies along the same horizontal or epipolar line crossing both images of the stereo pair. The 48 stereo pairs are classi"ed into three groups: SP1, SP2 and SP3 with 18, 18 and 12 pairs of stereo images, respectively. For space reasons a unique representative stereo pair is shown for each SP set. Group SP1 consists of stereo images without apparent complexity. Figs. 4a}d (original images and labeled edge segments as stated) show a stereo pair representative of SP1. Group SP2 corresponds to scenes where a repetitive structure has been captured, Figs. 5a}d, show a stereo pair representative of SP2, where in this case the repetitive structure is provided by the vertical books in the foreground of the scene. Finally, group SP3 contains objects close to the cameras, that produce a high range of disparity violating the smoothness and ordering constraints. Figs. 6a}d is a stereo pair representative of SP3, where we can see the object labeled as 9, 10 in left image and 11, 12 in right image as a characteristic example of an object close to the cameras occluding the edge segment 19 in the right image. Although this last type of images is unusual, its treatment is very important as they could produce fatal errors in navigation systems for example, where the nearest objects must be processed immediately. As mentioned during the introduction, we base our method on the relaxation work of Christmas et al. [1], in which a stereovision matching problem is solved. This
Fig. 5. Group SP2: (a) original left stereo image, (b) original right stereo image, (c) labeled segments left image, (d) labeled segments right image.
Fig. 6. Group SP3: (a) original left stereo image, (b) original right stereo image, (c) labeled segments left image, (d) labeled segments right image.
algorithm uses straight-line segments as features for the matching, a unique unary measurement of each line segment (orientation of the scene relative to the model) and four binary relations. These binary relations are: (a) the angle between line segments a and a ; (b) the angle G H between segment a and the line joining the center-points G of segments a and a ; (c) the minimum distance between G H the endpoints of line segments a and a ; and (d) the G H
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
distance between the midpoints of line segments a and a . G H It is a structural matching method based on geometrical relations. These are the fundamental and main di!erences with regard to our approach. Indeed, we use four unary measurements (module and direction gradient, Laplacian and variance) to map the similarity constraint and two binary relations to map the smoothness and ordering constraints. Only the direction gradient in our approach could be identi"ed with the unary measurement in Ref. [1]. These reasons impede us from performing the appropriate comparative analysis between our approach and the structural method in Ref. [1]. Moreover, the geometrical relationships would be only applicable with great di$culty to our complex images. According to Section 1.4, there are di!erent relaxation approaches: probabilistic, merit and optimization. This paper develops a probabilistic relaxation (PR) method. To compare our PR technique, we choose a merit relaxation and two optimization relaxation methods. 4.1.1. The merit relaxation approach (MR) The minimum di!erential disparity algorithm of Medioni and Nevatia [8] is the merit relaxation method chosen for comparison purposes for the following reasons: (1) it applies the commonly used constraints, similarity, smoothness and uniqueness; (2) it uses edge segments as features and the contrast and orientation of the features as attributes; (3) it is implied in the derivation of PR; and (4) Medioni and Nevatia conclude that several improvements to the matching method may be possible. A brief description of the MR algorithm is given below. MR uses straight-line segments as features because they believe that the feature-based stereo systems have strong advantages over area-based correlation systems. The method de"nes two windows around every segment in the left and right images and assumes a bound on the disparity range allowed. We have applied these concepts to our algorithm and they are exhaustively explained in Section 2.4.2. The MR local matching process de"nes a boolean function indicating whether two segments are potential matches, such a function is true i! the segments overlap and have similar contrast and orientation (similarity constraint). The overlap concept in PR is also taken from this method and the contrast and orientation could be identi"ed with our module and direction gradient, although we believe that gradient provides better performance. To each pair of segments, the MR associates an average disparity which is the average of the disparity between the two involved segments along the length of their overlap. This is also applied in our method. Thus, a global relaxation matching process is triggered, where a merit function is updated iteratively, and changes if the segments in the neighborhood are assigned new preferred matches. The preferred matches are selected according to the boolean function and the neigh-
63
borhood concept, this latter de"ned on the basis of the disparity bound. The MR applies the smoothness constraint through the minimum di!erential disparity criterion. We have also applied this criterion in PR. The fundamental di!erences between MR and PR can be summarized as follows: (1) PR uses quantitative probability values (matching probabilities) and MR uses a qualitative boolean function; (2) PR introduces speci"c conditions to overcome the possible violation of the smoothness constraint and to avoid the problem of the "xation of the disparity limit, which is missing in MR. 4.1.2. Two optimization relaxation approaches (OR1 and OR2) Two appropriate techniques to compare our PR are the approaches based on the Hop"eld neural network [36] (OR1) and [27] (OR2). These techniques take into account other previous works in this area [12,14,24,55, 56]. They are chosen for the following reasons: (1) in OR1 the similarity and smoothness constraints are mapped as in PR, moreover both use the same set of attributes and features, (2) the compatibility coe$cient in OR1 only takes into account the violation of the smoothness constraint, whereas in PR the violation of both the smoothness and ordering constraints are considered, and (3) although OR2 uses edge pixels as features it is considered as a guideline to map the similarity, smoothness and ordering constraints for edge segments, this is made as in PR but removing the speci"c conditions C and C , that is, with respect to C the preferred match 00k11 of 00h11 ful"lls PL"PL ∀k 3 RI and with respect to C the FI
FIY violation of the smoothness and ordering constraints is not considered. The comparison between PR and OR1 allows us to check the e!ectiveness of the ordering constraint because they use a similar compatibility coe$cient and the comparison between PR and OR2 allows us to prove the e!ectiveness of the compatibility coe$cient itself, and more speci"cally when the smoothness and ordering constraints are violated. 4.2. Experimental results In this study, the proposed PR method for solving the stereovision matching problem was executed on a VAX 8810 workstation. During the test, 48 relaxation processes were built-in and executed, one for each stereo image pair. The number of pairs of edge segments in the di!erent relaxation processes is variable and ranges from 36 to 104. This variability depends on the number of edge segments extracted and the initial conditions established in Section 2.1. Eq. (3) describes the evolution of the matching probabilities for our stereo global matching approach. Fig. 7 shows the number of pairs (npair) of edge segments for
64
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
stereo image and to di!erent complexity of the stereo images the time for convergence is also variable. We have veri"ed in all tested stereo images that an average time for convergence is about 7 min. This time is measured from iterations 0 to 32, the start and end of the global relaxation process, which is the objective of this paper. 4.3. Comparative analysis
Fig. 7. Number of pairs (npair) of edge segments for which the matching probabilities are modi"ed by the updating procedure as a function of the iteration number, averaged for all relaxation processes.
Table 1 Percentage of successes for the group of stereo-pairs SP1 SP1
0
8
16
24
32
PR OR1 OR2 MR
79.1 79.1 79.1 79.1
86.1 84.3 80.3 79.8
91.2 88.7 83.5 82.6
94.3 92.9 87.6 85.2
97.3 96.2 89.9 86.7
which the matching probabilities are modi"ed by the updating procedure as a function of the iteration number. This number is averaged for the 48 executed relaxation processes. It is computed according to point 6 in Section 3. The number of pairs that change their matching probabilities drops in a relatively smooth fashion, with the maximum number of pairs that change between an iteration and the subsequent one being 9. This means that the number of correct matches grows slowly during the global matching process, and this is due to the following two reasons: (a) initially (iteration 0) there is a relatively high percentage of true matches (as we will see later in Table 1), this is a direct consequence of the local stereovision matching process used, and is the same idea that the used in Christmas et al. [1], where the initial probabilities are computed based on the unary measurements, and (b) the coe$cient e is introduced in Section 3, point 6, so that the criterion to determine whether npair changes is relaxed. Finally, we can see that from iteration number 15, npair varies slowly and it can be assumed that all analyzed stereo images reach an acceptably high degree of performance at iteration 15. We can point out that due to the variability in the number of pairs of edge segments to be matched for each
As we already know, the stereo-matching process can be split into local and global processes. This paper develops a global relaxation process, probabilistic based, which is now of interest to us, since we have already exhaustively studied the local process in previous works [4,31,32]. Therefore, the comparative analysis we perform is only concerned with the relaxation process. The system processes the SP1, SP2 and SP3 groups of stereo images. Of all the possible combinations of pairs of matches formed by segments of left and right images only 468, 359 and 294 pairs are considered for SP1, SP2 and SP3, respectively, because the remainder of the pairs do not meet the initial conditions. For each pair of features we compute the local matching probability following the learning procedure described in Section 2.2. This is the starting point for all relaxation processes. For PR, OR1 and OR2 the local matching probabilities are the initial matching probabilities. MR requires the de"nition of a boolean function to indicate whether two segments are potential matches. We have chosen the criterion that if the local matching probability is greater than 0.5, the intermediate probability value in [0,1], the pair of segments is a potential match and the boolean function is true. The computed results are summarized in Tables 1}3. They show the percentage of successes for each group of stereo-pairs (SP1, SP2 and SP3) and for each method (PR, OR1, OR2 and MR) as a function of the number of iterations. Iteration 0 corresponds with results for local matching probabilities, the starting point for the relaxation process. As shown in Fig. 7, 32 iterations su$ce to halt the PR process. This is the number of iterations used for the remainder of the methods. We have performed experiments with window-width values (maxd, see Section 2.4b) in the range 3}15 pixels, where the best results were obtained when maxd was set to 8 pixels.
Table 2 Percentage of successes for the group of stereo-pairs SP2 SP2
0
8
16
24
32
PR OR1 OR2 MR
61.3 61.3 61.3 61.3
75.6 72.7 67.8 66.5
84.8 81.9 70.4 69.8
90.1 88.9 76.9 73.7
96.7 95.3 80.3 77.2
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68 Table 3 Percentage of successes for the group of stereo-pairs SP3 SP3
0
8
16
24
32
PR OR1 OR2 MR
62.8 62.8 62.8 62.8
78.3 76.5 64.6 62.1
87.9 86.8 67.1 61.6
91.4 90.7 69.3 60.7
95.7 95.1 70.5 59.8
4.3.1. Decision process When the relaxation processes have "nished, there are still both unambiguous pairs of segments and ambiguous ones, depending on whether one and only one, or several right image segments correspond to a given left image segment. In any case, the decision about whether a match is correct or not is made by choosing the result of greater probability for PR, OR1 and OR2 and the best merit value for MR. In the unambiguous case there is only one. For PR, OR1 and OR2, it is necessary that this probability surpasses the previous "xed threshold probability (¹"0.5). It is allowed that an edge segment l in the left image matches with r and r in the right image if r and r do not overlap, their directions di!er a value smaller than a "xed threshold (set to 10 in this paper) and the matching probabilities between l and r and l and r are greater than ¹. According to values from Tables 1}3, the following conclusions may be inferred: (1) Group SP1: All methods progress in a similar fashion in the number of successes. The best results are obtained with PR and OR1. In both cases this is due to the de"nition of C in the compatibility coe$cient. (2) Group SP2: The best results are once again obtained with PR and OR1 where the di!erence with respect to OR2 and particularly with respect to MR are signi"cant. The explanation can be found by the presence of repetitive structures in the images. These structures produce false matches, which are taken as true by MR and OR2 and false by PR and OR1, hence PR and OR1 work as expected. The reason is the de"nition of condition C applied to the compatibility coe$cient. (3) Group SP3: The number of successes for PR and OR1 evolves as expected whereas for OR1 remains practically stable and for MR decreases as the number of iterations increases. This last case is a surprising result. This is a direct consequence of images with some objects close to the cameras violating the smoothness and ordering constraints (in Fig. 3a}d, see object bounded by labels 9 and 10 in left image and 11 and 12 in right image). MR increases the merit of false matches, such as (3, 3) or (2, 2), and PR and OR1 handle this phenomenon correctly through condition C .
65
In general, the best results are always obtained with PR. The small di!erence with respect to OR1 is due to the convergence criterion used in the Hop"eld neural network, which requires a greater number of iterations (about 22) than PR to obtain a similar percentage of successes. We have veri"ed that the ordering constraint is not relevant, this is because the impact of such a constraint is assumed by the smoothness constraint. We think that this is the reason for which the ordering constraint is not considered in classical stereovision matching methods [8,14,18,20,23,55] among others. Nevertheless, it is useful in structural or geometric matching approach methods [1,30,37]. (4) The di!erence in the results between PR and OR2 tells us that the de"nition of the compatibility coe$cient with its two conditions is appropriate to deal with the violation of the smoothness and ordering constraints and also to "x the disparity limit. (5) Additionally, we can point out that the global relaxation matching process substantially improves the local matching results (compare results between iteration 0 and 32). Hence the relaxation process becomes particularly necessary for complex stereo images.
5. Concluding remarks An approach to stereo correspondence using a probabilistic relaxation method is presented. The stereo correspondence problem is formulated as a relaxation task where a matching probability is updated through the mapping of four constraints (similarity, smoothness, ordering and uniqueness). The similarity constraint is doubly applied: (a) during the computation of the local matching probability (see Section 2.2), and (b) during the mapping of the corresponding constraint (see Section 2.4a). We have also introduced, through the compatibility coe$cient of Eq. (6), speci"c conditions to overcome the violation of the smoothness and ordering constraints and to avoid the serious problem caused by the setting of the disparity limit. The advantage of using a relaxation process is that a global match is automatically achieved, taking into account each stereo pair of edge segments to be matched: (a) in an isolated fashion (unary measurements), and (b) as a member of a neighborhood (binary measurements). A comparative analysis is performed against other methods OR1, OR2 and MR. We have shown that the proposed method PR gives a better performance. We "rst started our experiments with normal images and then used more complex images with repetitive structures, objects violating the smoothness and ordering constraints and occlusions, always obtaining satisfactory results. We have also shown the necessity of using a global relaxation strategy.
66
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
We hope that the above results show the potential of PR. However, our method does not yet provide a full performance as it does not reach the 100% success level and several improvements to the matching method may be possible. We are currently investigating some of these: (1) hierarchical processing, starting at region level and (2) using corners as complementary features [54].
References [1] W.J. Christmas, J. Kittler, M. Petrou, Structural matching in computer vision using probabilistic relaxation, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 749}764. [2] A.R. Dhond, J.K. Aggarwal, Structure from stereo } a review, IEEE Trans. Systems Man Cybernet. 19 (1989) 1489}1510. [3] T. Ozanian, Approaches for stereo matching } a review, Modeling Identi"cation Control 16 (2) (1995) 65}94. [4] G. Pajares, Estrategia de solucion al problema de la correspondencia en vision estereoscopica por la jerarqumH a metodologica y la integracion de criterios, Ph.D. Thesis, Facultad de C.C. Fisicas, U.N.E.D., Department of Informatica y Automatica, Madrid, 1995. [5] S. Barnard, M. Fishler, Computational stereo, ACM Comput, Surveys 14 (1982) 553}572. [6] K.S. Fu, R.C. GonzaH lez, C.S.G. Lee, RoboH tica: Control, DeteccioH n, VisioH n e Inteligencia, McGraw-Hill, Madrid, 1988. [7] D.H. Kim, R.H. Park, Analysis of quantization error in line-based stereo matching, Pattern Recognition 8 (1994) 913}924. [8] G. Medioni, R. Nevatia, Segment based stereo matching, Comput. Vision Graphics Image Process. 31 (1985) 2}18. [9] H.H. Baker, Building and Using Scene Representations in Image Understanding, AGARD-LS-185 Machine Perception. 1982, 3.1}3.11. [10] I.J. Cox, S.L. Hingorani, S.B. Rao, B.M. Maggs, A maximum likelihood stereo algorithm, Comput. Vision Image Understanding 63 (2) (1996) 542}567. [11] P. Fua, A Parallel Algorithm that produces dense depth maps and preserves image features, Mach. Vision Appl. 6 (1993) 35}49. [12] J.J. Lee, J.C. Shim, Y.H. Ha, Stereo correspondence using the Hop"eld neural network of a new energy function, Pattern Recognition 27 (11) (1994) 1513}1522. [13] Y. Shirai, Three-Dimensional Computer Vision, Springer, Berlin, 1983. [14] Y. Zhou, R. Chellappa, Arti"cial Neural Networks for Computer Vision, Springer, New York, 1992. [15] W.E.L. Grimson, Computational experiments with a feature-based stereo algorithm, IEEE Trans. Pattern Anal. Mach. Intell. 7 (1985) 17}34. [16] A. Khotanzad, A. Bokil, Y.W. Lee, Stereopsis by constraint learning feed-forward neural networks, IEEE Trans. Neural Networks 4 (1993) 332}342. [17] Y.C. Kim, J.K. Aggarwal, Positioning three-dimensional objects using stereo images, IEEE J. Robot. Automat. 3 (1987) 361}373.
[18] V.R. Lutsiv, T.A. Novikova, On the use of a neurocomputer for stereoimage processing, Pattern Recognition Image Anal. 2 (1992) 441}444. [19] D. Maravall, E. Fernandez, Contribution to the matching problem in stereovision, Proceedings of the 11th IAPR: International Conference in Pattern Recognition, The Hague, 1992, pp. 411}414. [20] D. Marr, La Vision, Alianza Editorial, Madrid, 1985. [21] D. Marr, Vision, Freeman, San Francisco, 1982. [22] D. Marr, T. Poggio, A computational theory of human stereovision, Proc. Roy. Soc. London B 207 (1979) 301}328. [23] D. Marr, T. Poggio, Cooperative computation of stereo disparity, Science 194 (1976) 283}287. [24] M.S. Mousavi, R.J. Schalko!, ANN Implementation of stereo vision using a multi-layer feedback architecture, IEEE Trans. Systems Man Cybernet. 24 (8) (1994) 1220}1238. [25] P. Rubio, Analisis comparativo de Metodos de Correspondencia estereoscopica, Ph.D. Thesis, Facultad de Psicologia, Universidad Complutense, Madrid, 1991. [26] P. Rubio, RP: un algoritmo e"ciente para la buH squeda de correspondencias en visioH n estereoscoH pica, InformaH tica y AutomaH tica 26 (1993) 5}15. [27] Y. Ruycheck, J.G. Postaire, A neural network algorithm for 3-D reconstruction from stereo pairs of linear images, Pattern Recognition Lett. 17 (1996) 387}398. [28] N. Ayache, B. Faverjon, E$cient registration of stereo images by matching graph descriptions of edge segments, Int. J. Comput. Vision 1 (1987) 107}131. [29] H.H. Baker, T.O. Binford, Depth from edge and intensity based stereo, in: Proceedings of the 7th International Joint Conference Arti"cial Intellengence, Vancouver, Canada, 1981, pp. 631}636. [30] K.L. Boyer, A.C. Kak, Structural Stereopsis for 3-D vision, IEEE Trans. Pattern Anal. Mach. Intell. 10 (2) (1988) 144}166. [31] J.M. Cruz, G. Pajares, J. Aranda, A neural network approach to the stereovision correspondence problem by unsupervised learning, Neural Networks. 8 (5) (1995) 805}813. [32] J.M. Cruz, G. Pajares, J. Aranda, J.L.F. Vindel, Stereo matching technique based on the perceptron criterion function, Pattern Recognition Letters. 16 (1995) 933}944. [33] W. Ho!, N. Ahuja, Surface from Stereo: Integrating feature matching, disparity estimation, and contour detection, IEEE Trans. Pattern Anal. Mach. Intell. 11 (1989) 121}136. [34] D.H. Kim, W.Y. Choi, R.H. Park, Stereo matching technique based on the theory of possibility, Pattern Recognition Lett. 13 (1992) 735}744. [35] Y. Ohta, T. Kanade, Stereo by intra- and inter-scanline search using dynamic programming, IEEE Trans. Pattern Anal. Mach. Intell. 7 (2) (1985) 139}154. [36] G. Pajares, J.M. Cruz, J. Aranda, Relaxation by Hop"eld network in stereo image matching, Pattern Recognition 31 (5) (1998) 561}574. [37] L.G. Shapiro, R.M. Haralick, Structural descriptions and inexact matching, IEEE Trans. Pattern Anal. Machine Intell. 3 (5) (1981) 504}519.
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68 [38] M.S. Wu, J.J. Leou, A bipartite matching approach to feature correspondence in stereo vision, Pattern Recognition Letters 16 (1995) 23}31. [39] D.M. Wuescher, K.L. Boyer, Robust contour decomposition using a constraint curvature criterion, IEEE Trans. Pattern Anal. Mach. Intell. 13 (1) (1991) 41}51. [40] R.M. Haralick, L.G. Shapiro, Computer and Robot Vision, Vols. I and II, Addison-Wesley, Reading, MA, 1992, 1993. [41] M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis and Machine Vision, Chapman & Hall, London, 1995. [42] A. Rosenfeld, R. Hummel, S. Zucker, Scene labelling by relaxation operation, IEEE Trans. Systems Man Cybernet. 6 (1976) 420}453. [43] S.T. Barnard, W.B. Thompson, Disparity analysis of images, IEEE Trans. on Pattern Anal. Mach. Intell. 13 (1982) 333}340. [44] R. Hummel, S. Zucker, On the foundations of relaxation labeling processes, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1983) 267}287. [45] A. Laine, G. Roman, A parallel algorithm for incremental stereo matching on SIMD machines, IEEE Trans. Robot. and Automat. 7 (1991) 123}134. [46] S. Lloyd, E. Haddow, J. Boyce, A parallel binocular stereo algorithm utilizing dynamic programming and relaxation labelling, Comput. Vision Graphics Image Process. 39 (1987) 202}225. [47] N.M. Nasrabadi, A stereo vision technique using curvesegments and relaxation matching, IEEE Trans. Pattern Anal. Mach. Intell. 14 (5) (1992) 566}572. [48] S. Peleg, A new probabilistic relaxation scheme, IEEE Trans. Pattern Anal. Mach. Intell. 2 (1980) 362}369. [49] K. Prazdny, Detection of binocular disparities, Biol. Cybernet. 52 (1985) 93}99. [50] K.E. Price, Relaxation matching techniques } a comparison, IEEE Trans. Pattern Anal. Mach. Intell. 7 (5) (1985) 617}623. [51] S. Ranade, A. Rosenfeld, Point pattern matching by relaxation, Pattern Recognition 12 (1980) 269}275. [52] D. Sherman, S. Peleg, Stereo by incremental matching of contours, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 1102}1106. [53] C. Stewart, C. Dyer, Simulation of a connectionist stereo algorithm on a memory-shared multiprocessor, in: V. Kumar (Ed.), Parallel Algorithms for Machine Intelligence and Vision, Springer, Berlin, 1990, pp. 341}359. [54] C.Y. Wang, H. Sun, S. Yada, A. Rosenfeld, Some experiments in relaxation image matching using corner features, Pattern Recognition 16 (2) (1983) 167}182.
67
[55] N.M. Nasrabadi, C.Y. Choo, Hop"eld network for stereovision correspondence, IEEE Trans. Neural Networks 3 (1992) 123}135. [56] Y.-H. Tseng, J.-J. Tzen, K.-P. Tang, S.-H. Lin, Image-toimage registration by matching area features using Fourier descriptors and neural networks, Photogrammetric Eng. Remote Sensing 63 (8) (1997) 975}983. [57] U.R. Dhond, J.K. Aggarwal, Stereo matching in the presence of narrow occluding objects using dynamic disparity search, IEEE Trans. Pattern Anal. Mach. Intell. 17 (7) (1995) 719}724. [58] T. Pavlidis, Why progress in machine vision is so slow, Pattern Recognition Lett. 13 (1992) 221}225. [59] A. Huertas, G. Medioni, Detection of intensity changes with subpixel accuracy using Laplacian}Gaussian masks, IEEE Trans. Pattern Anal. Mach. Intell. 8 (5) (1986) 651}664. [60] D. Marr, E. Hildreth, Theory of edge detection, Proc. Roy. Soc. London B 207 (1980) 187}217. [61] J.G. Leu, H.L. Yau, Detecting the dislocations in metal crystals from microscopic images, Pattern Recognition 24 (1991) 41}56. [62] M.S. Lew, T.S. Huang, K. Wong, Learning and feature selection in stereo matching, IEEE Trans. Pattern Anal. Mach. Intell. 16 (9) (1994) 869}881. [63] E.P. Krotkov, Active Computer Vision by Cooperative Focus and Stereo, Springer, New York, 1989. [64] S. Tanaka, A.C. Kak, A rule-based approach to binocular stereopsis, in: R.C. Jain, A.K. Jain (Eds.), Analysis and Interpretation of range images, Springer, Berlin, 1990, pp. 33}139. [65] R. Nevatia, K.R. Babu, Linear feature extraction and description, Comput. Vision Graphics Image Process. 13 (1980) 257}269. [66] R.O. Duda, P.E. Hart, Pattern Classi"cation and Scene Analysis, Wiley, New York, 1973. [67] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company, New York, 1994. [68] B. Kosko, Neural Networks and Fuzzy Systems, PrenticeHall Englewood Cli!s, NJ, 1992. [69] A.M.N. Fu, H. Yan, A new probabilistic relaxation method based on probability space partition, Pattern Recognition 30 (11) (1997) 1905}1917. [70] S.Z. Li, Matching: invariant to translations, rotations and scale changes, Pattern Recognition 25 (1992) 583}594. [71] J. Majumdar, Seethalakshmy, E$cient parallel processing for depth calculation using stereo, Robot. Autonomous Systems 20 (1997) 1}13.
About the Author*G. PAJARES received the B.S. and Ph. D degress in physics from U.N.E.D. (Distance University of Spain) (1987, 1995) discussing a thesis on the application of pattern recognition techniques to stereovision. Since 1990 he has worked at ENOSA in critical software development. He joined the Complutense University in 1995 as an Associate Professor in Robotics. His current research interests include robotics vision systems and applications of automatic control to robotics. About the Author*J.M. CRUZ received a M.Sc. degree in Physics and a Ph.D. from the Complutense University in 1979 and 1984 respectively. From 1985 to 1990 he was with the department of Automatic Control, UNED (Distance University of Spain), and from October 1990 to 1992 with the Department of Electronic, University of Santander. In October 1992, he joined the Department of Computer Science and Automatic Control of the Complutense Uiversity where he is professor. His current research interests include robotics vision ssytems, fusion sensors and applications of automatic control to robotics and #ight control.
68
G. Pajares et al. / Pattern Recognition 33 (2000) 53 } 68
About the Author*J.A. LOPEZ-OROZCO received an MSc degree in Physics from the Complutense University in 1991. From 1992 to 1994 he had a Fellowship to work on Inertial Navigation Systems. Since 1995 he has been an assistant researcher in robotics at the Department of Computers Architecture and Automatic control, where he is working for his Ph.D. His current interest are robotics vision systems and multisensor data fusion.
Pattern Recognition 33 (2000) 69}80
Fingerprint ridge distance computation methodologies Zs.M. KovaH cs-Vajna*, R. Rovatti, M. Frazzoni DEIS, University of Bologna, viale Risorgimento 2, 40134 Bologna, Italy Received 16 October 1997; received in revised form 25 January 1999; accepted 25 January 1999
Abstract The average ridge distance of "ngerprint images is used in many problems and applications. It is used in "ngerprint "lter design or in identi"cation and classi"cation procedures. This paper addresses the problem of local average ridge distance computation. This computation is based on a two-step procedure: "rst, the average distance is de"ned in each signi"cant portion of the image and then this information is propagated onto the remaining regions to complete the computation. Two methods are considered in the "rst step: geometric and spectral. In the geometric approach the central points of ridges are estimated on a regular grid and straight lines passing through these points and parallel to the ridge directions are used. The second method is based on the computation of harmonic coe$cients leading to e!ective estimates of the average ridge period. In order to complete the average distance map a di!usion equation is used so that maps with minimum variations are favored. Finally, some experimental results on NIST SDB4 are reported. 1999 Patern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Image pre-processing; Fingerprint; Ridge distances; Spectral analysis; NIST SDB 4
1. Introduction Each person has a unique set of "ngertips. Thus, "ngerprints have been used as a personal signature since ancient times [1,2]. The scienti"c foundations of "ngerprint employment for personal identi"cation were laid by F. Galton (1822}1916), H. Faulds (1843}1930) H. Wilder (1864}1928) and H. Poll (1877}1939). Galton [3] pointed out the uniqueness and permanence over time of "ngerprint geometry. Nowadays "ngerprints are mainly used in three areas, namely forensic science [4,5], security clearance ("nancial transactions or access to restricted areas), and anthropological and medical studies (speci"c diseases and genetic features) [6]. The police traditionally collect "ngerprint impressions of suspects' inked "ngers on paper, hence introducing
* Corresponding address. DEA - Facolta' di Ingegneria, University of Brescia, via Branze 38, 25123 Brescia, Italy. Tel.: #39-030-3715437; fax: #39-030-380014. E-mail addresses:
[email protected] (Z.M. KovaH csVajna),
[email protected] (R. Rovatti)
noise because of insu$cient or excess ink, and irregularities on the paper surface. Noise is also present in latent "ngerprints, no matter how they are detected and recorded. Furthermore, latent "ngerprints are often distorted by non-uniform pressure or movement. Both noise and imperfect recording techniques make the matching task very di$cult. Security clearance systems use sensors to perceive and record a "ngerprint. Though movement-induced distortion is certainly reduced by the cooperation of the person to be identi"ed, noise will still appear as the "ngerprint is recorded in real conditions, when the "nger can be dirty and/or wet. Moreover, irregular translation and rotation generally a!ect various samples of the same "ngerprint because of varied "nger position, "nger moisture and pressure on the recording surface. Fingerprint matching cannot therefore be performed by superimposing two di!erent "nger print images (FPIs). Instead, "ngerprints are matched by comparing ridge patterns, the relative positions of ridge discontinuities (called minutiv) such as ridge endings and bifurcations and the ridge count between minutiv. Other minutiv can be considered as well as pore information or more complex topological structures [5].
0031-3203/99/$20.00 1999 Patern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 0 - 0
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
70
This paper addresses the problem of local average ridge distance computation in FPIs. Though ridge distance is an essential parameter in many procedures in automatic FPI processing and matching systems, so far its computation seems not to have been investigated in detail. In most automatic FPI processing systems, the images are "rst "ltered (regardless of their gray scale or binary nature) to facilitate matching. An important parameter during this step is the "lter window, often a function of `ridge pattern perioda. Formally speaking, ridge period is not the same as ridge distance despite the frequent interchangeability of the two terms in the literature. In fact, the distance from a given ridge to the subsequent one is locally de"ned as the length of the segment connecting the centers of the two ridges along the line perpendicular to the "rst one, while the ridge period is commonly de"ned as the sum of the widths of the ridge and the subsequent valley. Though the di!erence between ridge distance and ridge period can be locally relevant, their averages tend to coincide in patterns with limited curvature. To see how this happens assume that n ridges are present with di!erent widths w ,2, w and separated by N!1 val L leys with di!erent widths v ,2, v . Referring to Fig. 1 L\ it is easy to see that the average ridge distance will be
1 L\ w w G#v # G> G n!1 2 2 G while the average ridge period is 1 L\ (w #v ), G G n!1 G so that their di!erence is ()(w !w )/(n!1). Note that L this a di!erence vanishes for uniform patterns (with w "w "2"w ) and decreases as more ridges are L considered, i.e. for increasing n. Many contributions mention and exploit the ridge distance estimation and stress its weight in FPI process-
Fig. 1. Comparison between ridge distance and ridge period.
ing, but, to the best of the present authors' knowledge, none of them addresses its computation in depth. O'Gorman and Nickerson [7] propose a "lter design process specially tailored to gray}scale "ngerprints. Their "lter takes advantage of the assumptions about the underlying pattern usually made by humans when inspecting FPIs. Some "ngerprint image parameters are required so that the "lter is better matched and produces the required image enhancement. One of the key parameters is the ridge pattern period. O'Gorman and Nickerson assume it to be constant over the entire image. It is considered to be either the sum of the maximum ridge width and the minimum valley width or the sum of the minimum ridge width and the maximum valley width. Actually, the authors assume these two quantities to be the same and express ridge period by means of four parameters in order to provide for internal variations of ridge and valley width. In order to improve the performance of their "ltering procedure, O'Gorman and Nickerson propose using the statistical average for ridge distance, depending on the subject class (the smaller the class the better the "lter). To do so, they preselect FPIs, partition them into di!erent classes (men, women and children) and design a di!erent "lter for each class. This approach has two weaknesses: the assumption that the ridge period is constant over the entire image, and the grouping into classes of di!erent people. Both these weakness led to problems which the authors claimed were dependent on incorrect feature detection. Lin and Dubes [6] have developed a system for automatic ridge counting and assume ridge period to be constant. Nevertheless, in the discussion of their results they observe that ridge period variation sometimes jeopardizes the possibility of correct ridge counting. Hung [8] computes ridge period independently for each "ngerprint. He attempts to overcome internal ridge width variation by equalizing it throughout the whole FPI. First he applies a binary "lter that uses an estimate of ridge pattern period. Then a direction map is computed, and from any ridge point the next ridge is searched for by moving perpendicularly to local ridge direction. The ridge pattern period is computed by summing the averages of ridge width and valley width over the whole image. This method su!ers from several problems. First, the high internal variation of ridge and valley width can preclude proper processing in a considerable portion of the image if average values over the whole image are computed. Second, neighboring ridges are usually oriented in di!erent directions so that the exact computation of their spacing would require extensive inspection of the image instead of a local orthogonal search which may give incorrect results. Finally, when noise is present in the form of spurs, holes and short ridges, grossly mistaken ridge and valley widths may be included in the "nal statistics, a!ecting their reliability.
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
71
After "ltering, binarizing and ridge thinning, a ridge skeleton is obtained by all the commonly adopted FPI processing systems. With skeletonized images, ridge distance becomes an even more important feature. In fact, the whole skeleton enhancement process consists of selecting a few dozen minutv with a high con"dence rating and using them for actual "ngerprint matching. Hung himself, as well as Xiao and Rafaat, employ ridge distance in analyzing skeleton structure, in order to remove false minutiae or noisy skeleton patterns. For example, most heuristics for minutv extraction are based on local analysis (close bifurcations, facing endpoints, short ridges) in a neighborhood whose size is a direct function of the ridge distance. From the above discussion, some key factors seem to a!ect the average ridge distance estimation. 1. Variability of ridge distance from one person to another. 2. Variability of ridge distance up to 150 within the same FPI. 3. Variability of ridge direction. 4. High levels of noise that may distort statistics. As an additional feature, any candidate methodology for ridge distance extraction should be applicable to grayscale FPI as it is one of the earliest processing phases. This paper is organized as follows. The following section introduces some basic notations which are common to the methods we propose and discuss. In Sections 3 and 4, two di!erent approaches to average ridge distance estimation are described. They are, respectively, inspired by geometric considerations and by the physical signi"cance of image spectral features. Since all the previously described methods are able to e!ectively estimate the average ridge distance only in some sub-blocks of the original image, a mechanism is needed to extend the estimation to sub-blocks corrupted by noise or with a high curvature pattern. Section 5 deals with this problem and proposes a feasible, formally sound approach to this task. In Section 6 the three methods are compared by analyzing their response when some FPIs from a standard database are processed. Finally, some conclusions are drawn and future directions of investigation discussed.
2. Common notations and assumptions All the procedures analyzed in the following consider gray-level FPIs partitioned into square blocks of N;N pixels. Fig. 2 shows a 256-level 512;512 FPI from the NIST-SDB4 database [9] partitioned into 64 square blocks with N"64. Average ridge distance estimation will be carried out separately in each sub-block to cope with the high variability over the whole FPI. The procedures represent each
Fig. 2. Partition of f0025}6 from the NIST-SDB4 database into zones for average ridge distance estimation.
block with a matrix g , assigning a gray level to each VW pair of integer coordinates x, y3+0,2, N!1,. For each block the two numbers g "min g and
VW VW g "max g identify the actual range of gray vari VW VW ation.
3. A geometric approach As ridge distance is a geometric feature of the FPI the most natural way of tackling its estimation relies on recognizing and processing basic geometric entities. The basic entities we are interested in are the central points of ridges and the straight lines passing through them parallel to the local ridge. In fact, wherever in the sub-block these entities can be recognized, analytical computation of the ridge distance is a matter of trivial trigonometric considerations. Hence, we are "rst concerned with recognizing these entities in a FPI which is noisy and irregular. Central ridge points are always located at minima of the gray-scale matrix g , though not every minimum VW corresponds to a ridge center. Even for a relatively small block, an exhaustive search for a gray level minima and the subsequent processing needed to associate subsets of the minima to the ridges they lay on are not advisable from the computational point of view. A regular grid is therefore superimposed on the block as shown in Fig. 3 and local minima are searched for along each horizontal and vertical line.
72
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
Fig. 3. A minima-searching grid for the central block in Fig. 2.
Due to noise and pattern irregularity, not every local minimum is a ridge center; candidate minima must therefore be tested against suitable criteria to ensure that they are true background pixels. To do so a threshold gth is de"ned and only minima at points (x, y) such that g )g are taken as ridge centers. VW As this is the starting point of the whole procedure, the de"nition of g is critical and deserves careful analysis. In good quality images, foreground and background are
well distinguished so that many pixels are either bright, i.e. high g , or dark, i.e. low g . If statistical inVW VW formation is collected about the relative frequency of each gray level in such an image, a typical bi-modal histogram is obtained. Many methods for threshold extraction have been proposed in literature (e.g. Refs. [11}13]). As we must perform threshold selection for every row and every column in the grid we cope with the need for a fast procedure devising an histogram-based
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
73
Fig. 4. The bi-modal gray-level histogram of a good image.
Fig. 5. First and second trial for the gray-level threshold g .
technique combining many aspects of the methods found in literature. Fig. 4 shows a typical gray-level occurrence histogram. It has been obtained by grouping 256 original gray levels into 64 groups so that the ith histogram entry actually accounts for the number of occurrences of gray levels 4i, 4i#1, 4i#2, and 4i#3. In such a case, the natural value for a gray-level threshold discriminating between foreground and background lies somewhere in the valley between the two relevant maxima. Its numerical evaluation exploits a two-step procedure aimed at avoiding statistically negligible contribution to the gray-occurrence histogram. First, the gray-level histogram is collected by analyzing the pixels along the chosen sub-grid line and assigning the *g gray levels between a and a (with a "i*g) to G G> G the ith entry. Then the histogram is smoothed by averaging every entry with its neighbors. Finally, if H is the relative occurrence of the gray-levels grouped G in the ith entry of the smoothed histogram and H "
max H is their maximum, we de"ne the two indices G G i "min+i : H *H /2, and i "max+i : H *H /2,. G
G
These two indices delimit the statistically signi"cant portion of the gray-level histogram and a "rst threshold g is set to i #i #1 . g"*g 2 The construction of i and i from the half-mode horizontal line is sketched in Fig. 5. Note how, even with this well behaved histogram, this "rst estimate may fail to "nd the most natural threshold value, though it certainly identi"es the most promising region. A "nal tuning is therefore needed. To this end we search for local minima of H in a neighborhood of G (i #i #1)/2. In practice, we consider those histo gram entries i such that H )min+H , H , and G G\ G> "i!(i #i #1)/2")20/*g.
If such entries exist, the one with the minimum H de"nes a second gray-level threshold g"*g(i#), G while if no local minima are present in the neighborhood of (i #i #1)/2g is adopted. Though devised under the hypothesis of a bi-modal gray-level histogram, this method can also be applied to unimodal or multi-modal pro"les, often resulting in a reasonable estimate of the natural gray-level threshold distinguishing foreground pixels from background pixels. Gray-level threshold extraction is performed for each horizontal and vertical line of the grid crossing the subblock so that a set of ridge centers can be constructed consisting of all local minima falling below those thresholds. Once the ridge centers are collected, computation of the local distance requires knowledge of the ridge direction. The method we adopt is similar to the one used in Ref. [8], in which only a discrete number of directions is considered. Fig. 6 illustrates the procedure for a number of distinguishable directions n equal to 8, while practical implementation employs a resolution of n "16. Let P be the center of the ridge just discovered. A disk with radius R"10 pixels is considered, partitioned into n equivalent sectors S ,2, S . Then, for each sector L S we de"ne the two quantities g G"min G g and G
1 VW [g !g G]
, p " VWZ1G VW G CS G where the C operator gives the number of pixels in S . G The index p accounts for the presence of foreground gray G levels in S with a quadratic emphasis. Hence, the darker G and more compact the region, the smaller the index, and the ridge direction is taken to be the one associated with the minimum dispersion p , i.e. the line bisecting that G sector. This method can yield a wrong direction when ridges are very close and the darkest pixel of the sector does not
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
74
Fig. 6. Extraction of ridge direction.
Fig. 7. Next ridge search and distance evaluation.
belong to the ridge under examination but to one of its neighbors. Nevertheless, experience has shown that these cases are statistically insigni"cant. Once a central point P and a ridge direction have been extracted we search for the two neighboring ridges forwards and backwards along the line normal to the ridge direction. To do this, the center searching algorithm explained above is employed, accepting only those centers whose distance from the initial ridge center is not greater than a threshold d .
We are now out to evaluate the distance between each pair of ridges, accounting for the fact that the two are, in general, not parallel. Two lines starting from P are therefore considered, one normal to the original ridge, the other normal to the second ridge. If P is the center of the second ridge on the "rst of the two lines, then the second one marks a point P on the line parallel to the second ridge and passing through P. The distance between the two ridges is de"ned as
Acceptable distances are extracted by repeating the above procedure for each ridge center found along every horizontal and vertical line of the grid. Each element in this distance set is then rounded to the nearest integer. With this integer distance set we may now construct a distance histogram whose ith entry Di accounts for the occurrences of a distance whose rounded value is i. The histogram describes the distribution of distances over the sub-block and the "nal average ridge distance is taken to be the average of the signi"cant portion of this distance distribution. The signi"cant portion is, as before, obtained by de"ning the two indices i and i used to estimate the color threshold on the gray-level histogram. Once i and i are de"ned we set the average distance to G iD dM " GG G G D GG G whenever the number of distances in the signi"cant portion of the histogram ( G D ) is not less than a threGG G shold D JN. If, on the contrary, the number of
signi"cant distances is not a signi"cant fraction of the sub-block size, the estimate is rejected.
PP#PP d" . 2 Fig. 7 shows a sample application of this technique. The resulting distance d is accepted only if it is greater than threshold d and if the extracted directions of the
two neighboring ridges are not too di!erent. This prevents false ridges due to noisy pixels in valleys altering the overall distance statistics.
4. The spectral approach Ridge period computation can also be regarded as a spectral analysis problem. In fact, regular two-dimensional patterns may be thought of as a linear combination of
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
simple orthogonal periodic signals (called harmonics). The closer the pattern to a given periodic behavior, the greater the coe$cient of the harmonic with that period. As spectral analysis enables the computation of these harmonic coe$cients, it may also give e!ective estimates of the average ridge period. 4.1. Discrete Fourier transform and harmonic coezcients If g is the gray-scale value of the pixel with coordiVW nates x, y3+02, N!1, in a N;N image, the Discrete Fourier Transform (DFT) of g is de"ned as [[10], VW Chap. 8] 1 ,\ ,\ G " g e\pH,6VWST7, (1) ST N VW V W where j is the imaginary unit, u, v3+0,2, N!1, and 1(x,y)(u,v)2"xu#yv is the vector dot product. G is, ST in general, complex. However, if we express G in ST Eq. (1) as G ""G "eH %ST, it obviously follows ST ST that [[10], Chap. 8] 1 ,\ ,\ g " G epH,6VWST7 VW N ST S T ,\ ,\ 2p1(x,y)(u,v)2 " "G "cos #arg(G ) ST ST N S T in which we exploit the fact that, g being real, the VW imaginary part of the right-hand side can be discarded. In the light of this, the image can be understood as a sum of harmonics h (x,y)"cos[2p/N1(x,y)(u,v)2] ST whose phase and amplitude are modulated by the complex harmonic coe$cients G . The set of all the harST monic coe$cients of an image is commonly referred to as its spectrum. In Fig. 8 the three harmonics h (x, y), h (x, y), and h (x, y) are shown for a 64;64-pixel image. Note how they reproduce an extremely regular pattern whose peri-
75
od can be computed analytically from the expression for h (x,y), resulting in ST d "N/(u#v. ST
(2)
In fact, h (x, y) is maximum for all points (x,y) beST longing to the straight line xu#yv!kN"0 for some integer k. However, the distance between two such straight lines for k and k#1 can be easily computed to give Eq. (2). Several well-known features of the DFT are of great importance in practical computation and deserve to be mentioned. In fact, it can be shown that the two-dimensional set of harmonics and harmonic coe$cients is symmetric and periodic along each axis since we have, for example, G "G and h "h as S>,T ST S>,T ST well as G "G and h "h . A more ,\ST \ST ,\ST \ST global symmetry also exists as G "G and \S\T ST h "h . \S\T ST Finally, two more properties are of interest for our analysis. Let us "rst analyze what happens to the harmonic coe$cients when the original image g is transVW formed into a di!erent image g de"ned as VW g W X W X "g , VW V> V , W> W , where x, y"0,2, N!1 and the translated image is `wrapped arounda de"ning W aX to be a if 0)a(N, , a!N if a*N and N!a if a(0. Starting from the de"nition of the DFT and of g and with a suitable VW sum splitting it follows that 1 ,\ ,\ G " g e\pH,6 WV> VX , WW> WX ,ST7 VW ST N V W
Fig. 8. Images of the three harmonics h
1 ,\ ,\ " g e\pH,6VWST7e\pH,6 V WST7 VW N V W "G
e\pH,6 V WST7, ST
(x, y), h (x, y), and h (x, y).
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
76
so that "G """G ". Let A now be an orthogonal 2;2 ST ST matrix and let us consider a modi"ed image
The method we are proposing detects peaks of Q(r) and estimates the average ridge distance in consequence:
g W X "g , VW , VW
1. Compute G from g . ST VW 2. Compute Q(r) for 0)r)N!1. 3. Find r such that Q(r)*Q(r) for any 0(r )
r)r (N!1, rOr ("nd the position of the lar gest peak). 4. If r is not a local maximum of Q(r) for 0)r)N!1 estimation is impossible. 5. Find r such that Q(r)*Q(r) for any 0(r )
r)r (N!1, rOr, rOr ("nd the position of
the second largest peak). 6. Estimate dM "N/r with con"dence.
where the wrap-around is applied separately to each coordinate of the vector A(x,y). If A(x,y) has all integer coordinates we have 1 ,\ ,\ G " g e\pH,6 WEVWX , ST ST N VW V W 1 ,\ ,\ " g e\pH,6VWRST7. VW N V W From this and from the fact that A\"AR we can easily prove that G "G , i.e. that passing from g to g the ST ST spectrum undergoes the same orthogonal transformation. Though the integer assumption on A(x,y) is crucial in the formal development, this result can be extended to image rotations which often feature the transcendental matrix
A"
cosa
!sina
sina
cosa
if a certain degree of approximation is acceptable. 4.2. Ridge distance estimation from the harmonic coezcients The whole procedure relies on a radial distribution function Q(r) de"ned for every integer 0)r)N. To de"ne such a function say that C is the set of coordinates P u, v such that (u#vKr where the K symbol is resolved by rounding to the nearest integer. If CC is the P number of elements of C then Q(r) is de"ned as P 1 Q(r)" CC
"G ", ST P STZ!P
where 0/0 is de"ned to be 0. Recalling the signi"cance of "G " and the expression for the ridge distance resulting ST from a single harmonic (2), it can be intuitively accepted that Q(r) gives the average contribution of the harmonics with ridge distance N/r to the construction of the overall image. Thus, peaks in Q(r) identify the main harmonic contributions and the corresponding ridge distance is likely to dominate in the image pattern. Note how, owing to the two last properties of the DFT, the de"nition of Q(r) is well posed, as it is invariant with respect to translation and rotation of the image. In fact, computation of Q(r) only involves the modulus of the harmonic coe$cients, and the set of coe$cients taken into consideration for each r is obviously invariant through any rotation of the spectrum, i.e. through any rotation of the image.
a min+Q(r)!Q(r), Q(r)!Q(r!1), Q(r) Q(r)!Q(r#1),. From the con"dence expression it can be easily understood that an estimate is considered to be reliable when r is a well-de"ned peak of the radial density function and the second maximum is su$ciently smaller. Rounding is "nally applied to dM to obtain an integer number of pixels. The direction of rounding is biased by the position of the second maximum of Q(r). The three parameters the whole procedure depends on are easily determined from the available data. In particular, once it is known that a typical human "ngerprint has a ridge distance within the interval [d , d ] one may
set r "N/d and r "N/d . These two para
meters set a "ltering mask on the radial distribution function Q(r) that helps to cut out the high-frequency noise contribution (r'r ) and average gray-level ef fects (r(r ). For typical applications, relying on
a 500DPI image of the "ngerprint, two reasonable values turn out to be r "3 and r "20.
The con"dence normalization coe$cient a is only needed when estimates obtained with the spectral method have to be compared with those obtained with other methods. The value of a may be obtained by estimating the ridge distance in ideal images with parallel equidistant gray peaks (see Fig. 8). Then, setting a" Q(r) , max+Q(r)!Q(r), Q(r)!Q(r!1), Q(r)!Q(r#1), where the max is taken over all the considered ideal images, provides a reasonable con"dence normalization coe$cient. This procedure leads to a normalization coef"cient a"2.43. Once that normalized con"dence had been computed we decided to accept only estimations whose con"dence was not less that 0.4.
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
5. Ridge map completion All the previously discussed approaches provide local estimation of the average ridge distance. They can report when their estimate is not reliable, because the corresponding image block is a!ected by noise or has too complex a ridge geometry. As a result, an incomplete map of average ridge distances is obtained, in which some zones are assigned an average distance while some are not. Regrettably, ridge distance must be de"ned in each zone to allow "ltering and skeletonization procedures to be applied. To address this problem, let us "rst consider an ideal, continuous image in which a number d(x, y) is associated with a square zone centered on (x, y), which represents the average ridge distance in that zone. Our problem is equivalent to the reconstruction of the function d(x, y) for every allowed x and y, given its values d at certain points G (x , y ). Such a task is well-known to be ill-posed in the G G sense of Hadamard [14] since its solution is neither unique nor depends continuously on the given data. In order to solve it, some constraints re#ecting our knowledge about the actual d(x, y) have to be introduced [15,16], the most natural and commonly adopted being some kind of smoothness enforcement. If we choose to measure smoothness with the secondorder Lebesgue norm of some of the derivatives of d, we may recast the reconstruction problem into the minimization problem min
¹[d](x,y) dx dy " s.t. d(x , y )"d , G G G where ¹[ ) ] is a Tikhonov regularizer, i.e. a positive quadratic di!erential operator implicitly de"ning the concept of smoothness. To complete the map of average ridge distances we choose the simplest Tikhonov regularizer, i.e. the modulus of the gradient operator so that maps which minimize variations are favored. This is quite a natural assumption as ridge distance is known to change little between adjacent zones and these changes are even further smoothed by the local averaging procedure. Thus, we set ¹[d]"(Rd/Rx)#(Rd/Ry) and apply elementary calculus of variations [17] to obtain that maximally smooth d satis"es (Rd/Rx)#(Rd/Ry)"0. Thus, using the Laplacian operator , our problem is to "nd a function d(x,y) such that
d"0,
(3) d(x , y )"d . G G G We may now go back to our discrete problem and solve Eq. (3) for the central points of the blocks for which no reliable estimate was provided. To do so, we "rst translate Eq. (3) into a di!usion problem which is suitable for
77
an iterative and discrete solution. That is, we consider d to be the steady-state solution of Rd " d Rt
(4)
which can be discretized both in time and space to obtain dR> R(x, y)!dR(x, y) *t dR(x#1, y)!2dR(x, y)#dR(x!1, y) " (*x) dR(x, y#1)!2dR(x, y)#dR(x, y!1) # . (*y) As *x"*y"*, the iteration of the above formula is stable only if *t/*)1/4 [[18], Chap. 17] and assuming *t"*/4 we obtain dR>DR(x, y)"[dR(x#1, y) #dR(x!1, y)#dR(x, y#1)#dR(x, y!1)] (5) of which iteration yields the steady-state solution of Eq. (4) as tPR, i.e. the solution of Eq. (3). To enhance convergence when few given data are present we enlarge the number of values of d entering the right-hand part of the discretized equation. To do so it must be kept in mind that the Laplacian operator is invariant under rotation, and follow the same discretization path along the diagonals of the average ridge distance maps dR> R(x, y)"[dR(x#1, y#1) #dR(x!1, y!1)#dR(x!1, y#1) #dR(x#1, y!1)],
(6)
where it has been noted that moving along diagonals implies a space discretization step (2 times the previous one and thus a double time step. With this, and assuming a piecewise-linear time evolution is assumed for the unknown values of d(x, y), we may average Eqs. (5) and (6) so to take all the eight neighboring zones into account at each iteration.
6. Experimental results The two methods presented in the previous sections have been tested on many di!erent "ngerprint images from a standard database. In particular, we applied them to NIST-SDB4, a set of 2000 pairs of 500DPI FPIs, coded in 8 bit gray scale [9]. As an example let us consider two images from this database labeled, respectively, f0025}06 (see Fig. 2) and s0011}2 (see Fig. 9). As ridge distance is subject to wide variation over the whole "ngerprint, the enhancement procedure is applied block by block. Block sizes from N"50 to 150 were tried
78
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
Fig. 9. The 500DPI "nger print image s0011}2.
Fig. 10. Results of average ridge distance estimation for f0025}6 with the geometric and spectral approaches, local estimation and distance map completion.
and a reasonable trade-o! between the number of blocks in which the procedures produce a reliable estimate (which increases with N) and their ability to track local ridge-distance variation (which decreases as N increases) was found for N in the range [60,100]. For this reason
Fig. 11. Results of average ridge distance estimation for s0011}2 with the geometric and spectral approaches, local estimation and distance map completion.
N was set to 64 so that each 512;512 image was divided into 64 sub-blocks. Ridge distance estimation by means of the geometric and spectral approach is performed for each sub-block. Figs. 10 and 11 report the results of average distance estimation for f0025}6 and s0011}2 with both the Geometric and Spectral approaches. Null entries on the left-hand side tables represent the fact that both the geometric and spectral approaches may fail to give a reliable estimate of average ridge distance. Right-hand side Figs. 10 and 11 report the completed distance maps obeying the smoothness criterion discussed in Section 5 for the two FPI under consideration. A failure in estimation may occur when less than D signi"cant distances can be averaged to obtain the
"nal estimate (i.e. 0.01;64K41 in the considered cases) or when the normalized con"dence for the spectral estimate is less than 0.4. Though the latter case has a quite straightforward interpretation in terms of noisy or highly irregular patterns, the number of signi"cant distances extracted with the geometric approach is a!ected by many factors. Table 1 shows the incidence of the various decisions on the "nal number of distances. In principle, every local minimum along the horizontal and vertical grid lines in the sub-blocks may lead to two local distances. Yet, only minima under the gray-level threshold are accepted and ridge direction is estimated only for those points not too close to the block border. Moreover, even if the direction
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80 Table 1 Incidence of di!erent phenomena in the reduction of the number of collected distances Heuristics
Discarded over Discarded over total potential total discarded
Minimum above threshold Block border hit Neighbor search failure Di!erent directions Distance below threshold
22.93 0.77 12.21 7.25 12.17
Total
55.33
41.43 1.40 22.06 13.11 21.99
can be extracted, the search for neighboring ridges may partially or totally fail. For the local distance to be computed, we further require that the direction of the neighboring ridges is not too di!erent from the direction of the "rst one, and local distances themselves are accepted only if they do not fall below a certain d (six pixels
in the cases under consideration). Table 1 shows the average incidence of these phenomena in the reduction of the number of extracted distances. It can be easily seen how the gray-threshold computation is critical in establishing the `qualitya of the extracted distances, followed by geometrical considerations such as border e!ect avoidance and constraint enforcement on the extracted distance. A substantial agreement with ridge distances extracted by visual inspection has been veri"ed on a random sample of 20 images from the database. Beyond this, the assessment of the performance of the two proposed methods has to consider the whole "nger print recognition system in which the procedures are embedded. To do so we consider a minuti~ extraction system which provides minutiv to a classixer [19]. If the ridge distance is assumed constant and equal to the average over the database, the number of minutiv that are automatically extracted amounts to K75% of the number of real minutiv. When "ltering and skeletonization exploit the estimation of the ridge distance provided by the proposed methodologies this extraction rate is increased to K90%. At an even higher level, we provided the classi"er with one FPI and asked it to "nd the other FPI belonging to the same pair in the database (the `veri"cationa task speci"c of allowance systems). When a "xed ridge distance is assumed and a false-positive rate of not more than 0.05% is required the number of matches is not more than 77% of the total number of FPI. If ridge distance estimation is introduced this ratio is increased to 85% with the same false-positive ratio.
79
7. Conclusion Since the average ridge distance in "ngerprint images is an important parameter in various problems and applications, several methodologies which have proved capable of tackling the problem of its computation were illustrated. The average ridge distance is used in "ngerprint "lter design, and in identi"cation or classi"cation systems. The average distance computation presented in this work is based on a two step procedure: "rst, the average distance is de"ned in each signi"cant portion of the image and then the complete distance map is obtained by propagating the distance values by means of a di!usion equation. We have developed two methods capable of extracting the distance values in the main portions of the image: a geometric method and a spectral one. In the geometric approach central points of ridges are estimated on a regular grid and tangents to the ridges at these points are used. The second methodology is based on the computation of harmonic coe$cients leading to e!ective estimates of the average ridge period. Moreover, since the mathematical de"nition of distance map completion is not unique, and is also ill-posed, a speci"c formulation was employed which minimizes value variations in the "nal distance map. Tests using the NIST SDB4 and other proprietary databases show the e!ectiveness of the approaches in about one-tenth of the overall computation time of a user identi"cation system. On the NIST SDB4 database, the recognition performance achieved by the two systems employing the geometric and spectral approaches are substantially the same and lead to the extraction of the same number of minutiv with respect to human experts.
References [1] B. Moayer, K.S. Fu, A syntactic approach to "ngerprint pattern recognition, Pattern Recognition 7 (1975) 1}23. [2] D.K. Isenor, S.G. Zaky, Fingerprint identi"cation using graph matching, Pattern Recognition 19 (2) (1986) 113}122. [3] F. Galton, Finger Prints, MacMillan, London, 1892. [4] Q. Xiao, H. Raafat, A combined statistical and structural approach for "ngerprint image postprocessing, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics Conference, 1990, pp. 331}335. [5] The Science of Fingerprints: Classi"cation and Uses United States Department of Justice, Federal Bureau of Investigation, Washington, rev. 12}84, 1988. [6] W.-C. Lin, R.C. Dubes, A review of ridge counting in dermatoglyphics, Pattern Recognition 16 (1983) 1}8. [7] L. O'Gorman, J.V. Nickerson, An approach to "ngerprint "lter design, Pattern Recognition 22 (1989) 28}38.
80
Z.M. Kova& cs-Vajna et al. / Pattern Recognition 33 (2000) 69}80
[8] D.C.D. Hung, Enhancement and feature puri"cation of Fingerprint Images, Pattern Recognition 26 (1993) 1661}1671. [9] C.I. Watson, C.L. Wilson, NIST Special Database 4, Fingerprint Database, National Institute of Standard and Technology, March 1992. [10] W.K. Pratt, Digital Image Processing, Wiley Interscience, New York, 1991. [11] J.S. Weszka, R.N. Nagel, A. Rosenfeld, A threshold selection technique, IEEE Trans. Comput. (1974) 1322-1326. [12] J.S. Weszka, A. Rosenfeld, Histogram modi"cation for threshold selection, IEEE Trans. Systems Man Cybernet. 9 (1979) 38}52. [13] N. Otsu, A threshold selection method form gray-level histograms, IEEE Trans. Systems Man Cybernet. 9 (1979) 62}66.
[14] J. Hadamard, La theH orie des eH quations aux deriveH es partielles, Editions Scienti"ques, Pekin, 1964. [15] A.N. Tikhonov, V.Y. Arsenin, Solution of Ill-Posed Problems, Winston and Wiley, Washington, 1977. [16] V.A. Morozov, Methods for Solving Incorrectly Posed Problems, Springer, Berlin, 1984. [17] R. Courant, D. Hilbert, Methods of Mathematical Physics, Interscience Publisher, New York, 1953. [18] W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T. Vetterling, Numerical Recipes, Cambridge University Press, New York, 1990. [19] A. Farina, Zs. M. KovaH cs, A. Leone, Fingerprint minutiae extraction from skeletonized binary images, Pattern Recognition 32 (5) (1999) 877}889.
About the Author*ZSOLT M. KOVAD CS-VAJNA received his D. Eng. degree from the University of Bologna, Italy, in 1988. Since 1989 he has been with the Department of Electrical Engineering of the same University where he received a Ph.D. in Electrical Engineering and Computer Sciences in 1994 for his research on optical character recognition and circuit simulation techniques. He is currently Assistant Professor in Electronics. His research interests include pattern recognition (OCR, ICR, "ngerprint identi"cation), neural networks and circuit simulation techniques. He is a member of the Institute of Electrical and Electronics Engineers (IEEE), of the International Association for Pattern Recognition (IAPR-IC) and of the International Neural Network Society (INNS). About the Author*RICCARDO ROVATTI was born in Bologna, Italy, on 14 January 1969. He received the Dr. Eng degree (with honors) in Electronic Engineering and Ph.D. degree in Electronic Engineering and Computer Science from the University of Bologna, in 1992 and 1996, respectively. Since 1997 he has been a lecturer of Digital Electronics at the University of Bologna. He has authored or co-authored more than sixty international scienti"c publications. His research interest include fuzzy theory foundations, learning and CAD algorithms for fuzzy and neural systems, statistical pattern recognition, function approximation, non-linear system theory and identi"cation as well as applications of chaotic systems. About the Author*MIRKO FRAZZONI received the D. Eng. degree in Electrical Engineering in 1996 at the University of Bologna, Italy. He is presently with Praxis, UK.
Pattern Recognition 33 (2000) 81}95
Calibrating a video camera pair with a rigid bar N. Alberto Borghese *, P. Cerveri Laboratory of Human Motion Study and Virtual Reality, Istituto Neuroscienze e Bioimmagini CNR, Via f.lli Cervi, 93 - 20090 Segrate, Milano, Italy Department of Bioengineering, Centro di Bioingegneria, Fondazione ProJuventute, Politecnico di Milano, Via Capecelotro, 66, 20148 Milano, Italy Received 4 June 1998
Abstract In this paper a new procedure to determine all the geometrical parameters of a stereo-system is presented. It is based on surveying a rigid bar carrying two markers on its extremities moved inside the working volume and it does not require grids or complex calibration structures. The external parameters are estimated through the epipolar geometry up to a scale factor which is determined from the true length of the bar. The focal lengths are determined using the properties of the absolute conic in the projective space. The principal points are computed through a non-linear minimisation carried out through an evolutionary optimisation. The accuracy of the method is assessed on real data and it compares favourably with that obtained through classical approaches based on control points of known 3D coordinates. 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Epipolar geometry; Evolution strategy; Calibration; Fundamental matrix
1. Introduction Classical calibration techniques are based on surveying a 3D distribution of control points of known position [1}3]. To achieve a high accuracy, the control points should be positioned with extreme precision and distributed over the entire working volume [4]; therefore large, di$cult to move and expensive calibration structures are usually required. A di!erent approach is represented by self-calibration procedures which do not require to know in advance the 3D position of the control points. Classical photogrammetry o!ers the most general solution through the method of bundles adjustment [5,6]. This allows to determine the 3D coordinates of the control points and the internal and external parameters of the set-up at the same time through a non-linear minimisation of the distance between the reprojected and the measured 2D position
* Corresponding author. Tel.: #39-02-21717548; fax: #3902-21717558. E-mail address:
[email protected] (N. Alberto Borghese)
of the control points. In this method, the dimension of the design matrix and the number of unknowns increases linearly with the number of the control points leading to a huge computational load. Moreover, singularities in the design matrix can easily occur, and these may hamper the accuracy in the parameters. To avoid this, the distribution of the 3D control points should be carefully examined before starting the estimate. Finally, bundle adjustment requires a good initialisation of both the parameters and the 3D coordinates of the control points. A di!erent approach to self-calibration has been developed under the framework of the so called `structure from motiona (SfM) problem [3,7}10]. In this framework, the scene is surveyed by a moving camera; its motion as well as the 3D position of a set of points in the scene are determined from the image sequence. The internal parameters are supposed to be known. A linear solution to SfM problems was proposed by LonguetHiggings [11]: it is based on the epipolar constraint which expresses the condition that the two straight lines, through the 2D measured positions of a point and the perspective centres, lie on the same plane: the epipolar plane (cf. Fig. 1). This solution allows to keep the design
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 3 3 - 3
82
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
Fig. 1. The geometrical arrangement of a stereo-pair. The point P is projected in p1 and p2 on the image plane of the two cameras. The optical axis intersects the image plane at a point cj(x H, y H) called principal point. The distance between the perspective centre and the M M image plane, f , is the focal length of the camera. The epipolar plane, %, contains the points P, p1, p2, and the line H C C (the relative position T), which joins the perspective centres of the two cameras. The points of intersection between C C and the image planes are the epipoles, e1 and e2. The intersection between % and the two image planes originates two lines, lp and lp , called epipolar lines. They are corresponding epipolar lines in the homography generated by P, C and C .
matrix small (9;9). SfM solutions have been recently extended by Hartley [12,13] and Faugeras and coworkers [14}16] to estimate also the internal parameters, provided that the sequence consists of a su$cient number of images. The SfM framework has been linked to stereo-calibration by Borghese and Perona [17] who made the observation that the two cameras of a stereo-pair can be associated to a sequence of two images, taken by the moving camera from two di!erent positions. Under this hypothesis, only the focal lengths can be determined by SfM solutions [3,12] and the position of the two principal points has to be determined with a di!erent procedure. As a "rst approximation, they can be assumed to be coincident with the image centre, but, due to imperfect assembling of the optical system, they can be o!set of
several pixels. When high noise levels in the measurement is present, the contribution of the mislocation of the principal point to the 3D reconstruction can be neglected [18] and accuracy su$cient to certain applications like human computer interface can be attained. However, this approximation is not adequate when high accuracy is required. Several techniques have been proposed to determine the true position of the principal point. In Wei and Ma [19] the cross-ratio invariance over projection is exploited but to get a reliable estimate of the principal point, pure radial distortions on the lens is required, which is not often the case. Di!erent solutions are based on ad hoc devices like a rigid cross "lmed on the "eld [8] or laser based systems [20,21]. The use of additional devices make this kind of calibration procedure less friendly and less easy to be performed.
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
In this paper a novel method to estimate all the calibration parameters of a stereo-pair is given. It does not require any initialisation (di!erently from bundle adjustment) and it allows to determine the internal and external parameters including the two principal points (di!erently from SfM approaches), using control points whose 3D position has not to be measured. The results of real calibrations carried out on the "eld are reported and discussed.
2. Geometrical background 2.1. Geometrical model of the stereo system The projection of a 3D point, P(X, >, Z), on the image plane of the video camera j, p (x, y), is described by the H perspective equations, which, in homogeneous notations, are [3] p"KMDP with B"KM,
(1)
83
are known, the o!set coordinates are de"ned as
x !x H H M pJ " y !y H . H H 1
(6)
They will be adopted in Section 2.3 to estimate the two focal lengths. For the sake of convenience, the absolute reference frame is taken as solid with one of the two cameras with the X, > axes parallel to those of its image plane. With this choice, Eq. (1) for the two cameras becomes p "K MP,
(7a)
p "K MDP.
(7b)
These show that the internal and the external parameters can be factorised into the product of a matrix containing the internal parameters, K, by a matrix containing the external ones, D.
where
0 x
!f
K"
2.2. The epipolar geometry
y , 0 1
0 !f 0
(2)
f is the focal length of the camera and (x , y ) are the coordinates of the principal point, c(x ,y ) and f consti tute the internal parameters of the camera (cf. Fig. 1):
1 0 0 0
M" 0 1 0 0 ,
(3)
0 0 1 0
D"
R !RT 0
1
.
(4)
R and T are the orientation (3;3 matrix) and the location vector of the camera with respect to a certain reference frame: they constitute the external parameters. Calibration is here de"ned as the determination of the internal and external parameters for the stereo-pair. Whenever the two focal lengths and the two principal points are known, the normalised target coordinates of a point can be introduced. They are
x !x H H M !f H pL " yH!yMH H !f H 1
(5)
and they will be used in Section 2.4 to estimate the external parameters. Whenever only the principal points
The epipolar geometry is based on the observation that each 3D point, P, and its projections, p and p , lie on the same plane %, called epipolar plane (cf. Fig. 1). This plane intersects the two image planes along two lines, lp1 and lp2, called epipolar lines. For construction, a point p1, belonging to lp1, will correspond to lp2; and a point p2, belonging to lp1, will correspond to lp1. This condition is expressed analytically as follows: lp2"Fp1
(8a)
lp1"F2p2
(8b)
where F is the fundamental matrix [3]. As the scalar product between a point and a line to which it belongs to, is equal to 0, the following relationship holds: PT2 lp2"0
(9)
and from Eq. (8a) and (8b); pT2Fp1"0
(10)
which represents the epipolar plane, %, identi"ed by p , p and C C . The epipoles, e and e , are the points of intersection of the line through C C and the two image planes. For these points, the epipolar lines, le1 and le2, are reduced to the epipoles themselves; it holds: Fe1"F2e2"0
(11)
84
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
This equation will be used to determine the focal lengths in Section 2.3. Matrix F can also be obtained through di!erent considerations. Imposing that the three vectors C C , PC and PC are coplanar, the following relation ship holds: PC ;TPC "0.
(12)
De"ning a skew symmetric matrix S which represents the elements of T as:
S"(T)NS"
0
!¹
¹
!¹
¹
W
8
0
!¹
6
W
¹
0
8 6
0 .
(19)
1
Eq. (18) becomes
pJ 2 diag(1, 1, f )pJ "0. (20) H H H This matrix represents the imaginary circle, rj, centred in the origin which, in homogeneous coordinates, is
(15)
rj: u #v #w f "0 (21) H H H H As the projection of the absolute conic depends only on the internal parameters (Eq. (18)), the reference system of each camera can be arbitrarily positioned. In particular, to get a simpli"ed shape for the matrix F, the reference system is positioned such that the epipole has coordinates [1, 0, 1]:
(16)
e1"e2"[1, 0, 1]2.
E is named the essential matrix and it contains only the external parameters which can be estimated according to the procedure described in Section 2.4. Algebraic considerations which limit the number of independent parameters which can be estimated from Eq. (14) to seven are reported in Appendix A. Of these seven parameters, "ve parameters will be the relative orientation and location up to a scale factor, and the other two will be used to compute the focal lengths [22]. No other parameter can be directly estimated by this approach. 2.3. The fundamental matrix and the focal lengths To estimate the internal parameters, the matrices K1 and K2 in Eq. (14) are isolated from R and T through considerations based on projective geometry. In particular, the correspondence between the lines tangent to the absolute conic is exploited [12,16]. The absolute conic, R , is given by the following equation: P2P"0
0
(14)
This contains all the calibration parameters factorised into the product of the internal (matrices K1 and K2) by the external ones (matrices R and S). When the internal parameters are known, the normalised coordinates Eq. (2) can be used, and Eq. (14) assumes the following shape: pL 2EpL "0 with E"RS
0 H 0 !f H 0 0
!f
K" H
with: F"K2\2 R S K1\.
(13)
and taking into account Eqs. (7a) and (7b), the following homogeneous equation is obtained [9,11]: T 1 pT2K\ R S K\ 2 1 p1"0
formations, and, in particular, for rotation and translation. R is projected onto the image plane of camera j into the conic, p , (cf. Fig. 2) which is described by H p2K\2K\p "0 (18) H H H H It can be shown that the problem of determining the internal parameters of a camera is equivalent to "nding p [23]. Let us specialise the matrices K1 and K2 for the H case when the principal points are known and only the focal lengths have to be determined. In this case o!set coordinates can be adopted (Eqs. (3)), and K assumes the H following shape:
(17)
which has no real points (only imagery points). It has the remarkable property to be invariant for Euclidean trans-
(22)
With this reference system, taking into account Eq. (11), matrix F assumes the following simpli"ed shape:
F"
a
b !a
c
d !c
!a !b
(23)
a
To compute f and f , the properties of the correspond ence between points and epipolar lines described by Eqs. (8a) and (8b) is exploited. Let us consider the line lp1 and lp1 through e1 which are tangent to r (cf. Fig. 2). The two points of tangency are p1 and p1. They identify the polar of e1 with respect to r which has the following equation: *p *p *p u # l # w "0 *p J *l J *w J C C C N u #w f "0 (24) The intersection of the line in Eq. (24) with the conic in Eq. (21), originate two imagery points, p1 and p1, which
The polar of a point, p, with respect to a conic r, is the line joining the two points of tangency with r of the lines through p.
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
85
Fig. 2. The absolute conic, R , and its projection over the two image planes, r and r . The intersection of the two tangent planes with the absolute conic gives the two points P and P which project into p1 and p1 on the "rst image plane and into p2 and p2 on the second one. Through these points the two pairs of epipolar lines, lp1 and lp1, and lp2 and lp2 are identi"ed.
are de"ned up to a scale factor, Eq. (8a) can be written for the points p1 and p1 as
are computed as
u #w f "0 u#l#w f "0 N
u "!w f l"!w f (l#f ), w "w
(25)
Introducing the imaginery unit i, and dividing the three homogenous equations by w , the coordinates of p1 and p1 can be written as p1"[!f if ( f #1), 1]2
(26a)
p1"[!f ,!if ( f #1), 1]2.
(26b)
The points p1 and p1 can be transformed through F into their corresponding epipolar lines, lp2 and lp2 (Eqs. (8a) and (8b)), which will be tangent to p . Using relations (23) and (26), taking into account that Eqs. (8a) and (8b)
Fp1"
!c*\#dif "I , aD!bif 1
(27a)
Fp1"
!c*\!dif "I , a*!bif 1
(27b)
where *"( f #1). In compact notation, Eqs. (27a) and (27b) can be written as Fp1"[!1 (k#it) 1]2"l2,
(28a)
Fp1"[!1 (k#it) 1]2"l2,
(28b)
86
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
where
T "!V,
!ac*#bdf #if (ad!bc)* k#it" a*#bf
(29)
(34b)
Both (34a) and (34b) satisfy Eq. (16). R can assume one of the two following forms:
As the epipoles were chosen on the u-axis (cf. Eq. (22)) and the conic r is a circle centred in the origin, the polar line will be parallel to the v-axis, and p1 and p1 will be symmetrical with respect to the u-axis. It follows that
R "U Z VT,
(35a)
R "U ZT VT,
(35b)
(k#it)"!(k!it) N k"0.
with
(30)
Imposing this condition into Eq. (29), f can be deter mined as
f "
!ac . ac#bd
(31a)
with similar reasoning, the focal length of the second camera, f , can be computed as
f "
!ab . ab#cd
(31b)
This solution is allowed as far as the two optical axes do not intersect. When they do, the principal points of the two cameras, c1 amd c2, correspond to each other in the epipolar transformation (Eq. (10)). In this situation, the following relationship holds: c22Fc1"0 N [0, 0, w ]2 F[0, 0, w ]"0 from which it can be seen that the four elements of F in Eq. (23) are linearly dependent. It follows that f and f can be determined only up to a scale factor. This condition suggests a certain care in the setting up of the cameras although this condition is practically impossible in real applications. 2.4. The essential matrix and the external parameters Once the matrices K and K have been determined, the fundamental matrix F can be transformed into the essential matrix E which contains only the external parameters. An elegant algebraic solution to determine R and T is based on the singular value decomposition (svd) of the matrix E [12,24]. Being the right kernel of E equal to T (Eqs. (13) and (16)), the product ET is equal to zero. It follows that the vector T is parallel to the eigenvector associated to the smallest eigenvalue, w , of E. Expressing the essential matrix in terms of its svd: E"[U W VT],
(33)
the relative position T can be determined up to the sign and to a scale factor, as one of the two versors: T "V,
(34a)
0 1 0
Z" !1 0 0
(36)
0 0 1
both of which satisfy Eq. (16): Eq. (35a) allow to determine R up to rotations of 1803 around the rotation axis. From these four possible solutions the correct matrices, R and T, are obtained by constraining the reconstructed position of the 3D control points to be in front of the two cameras [11,24]. It should be remarked that T is obtained up to a multiplicative factor. This re#ects the fact that the 3D world can be reconstructed through the epipolar constraint only up to a scale factor [2,8,11]: a large scene seen from a large distance by a stereo-system with a large intercamera distance and a smaller scene seen from a smaller distance by a stereo-system with a smaller inter-camera distance, have the same image on the target of the two cameras. 2.5. Determination of the 3D scale factor In the solution through the essential matrix, ""T"""1 has been assumed (Eq. (34a)). The inter-camera distance is therefore taken as the norm of the 3D space. The 3D points position will be scaled accordingly to this choice: k"""T"", Tt"kT, P"kP,
(37)
where P and T are the true 3D position of a point and the true inter-camera distance. P is the 3D position reconstructed with ""T"" "1. To determine the value of k, a possibility is to survey a segment whose length is precisely known. Whenever control points are used, a pair of points, whose distance is precisely known, can be used instead of the segment. Let us call these two points P and P , it follows that (P !P )"k (P !P ) N d"kd N k"d/d.
(38)
However, in the real situation, due to noise on the target coordinates, the 3D reconstruction of P and
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
P will generally be not exact. As a consequence, the distance, d, will not be precisely reconstructed and the value of k will be erroneous. To avoid this, multiple measurements are required and the scale factor can be determined as the mean value of a set of M distances between M di!erent pairs of control points:
k"E
d 1 + dR " H. d M d H H
(39)
In particular, when the control points are the extremes of a single rigid bar, Eq. (39) becomes dR + 1 . k" M d H H
minimisation problem of the following cost function:
1 , 1 , "d !dR"#b "pi1Fpi2" , min a (43) H N N/2 H G c1c2 where N is the number of control points. The second term does not allow to determine the right norm of the 3D space (Eq. (10)) and it can be used only as a re"nement of the solution which, in its bulk part, is obtained through the "rst term. Therefore ba should hold. In the following a"1 and b"0.1 have been adopted. 3.2. Evolutionary optimisation and the principal points
(40)
3. Determination of the two principal points Up to now, the position of the two principal points was supposed to be known. When this is not the case, we show here that it can be determined adjusting the position of the principal points to minimise a "gure of merit derived from geometrical considerations on the 3D reconstruction of the control points (indirect estimation). 3.1. Indirect estimation When the two principal points are displaced with respect to their true value, the o!set coordinates (Eq. (3)) will also be displaced and the estimate of the focal lengths and of the external parameters will be inaccurate. This re#ects in an error in the reconstruction of the bar length which will be minimum when the parameters have been correctly estimated. The mean absolute error on the bar length, de"ned as 1 + e " "d !dR" (41) H B M H will be considered here. Another geometrical quantity a!ected by the displacement of the principal points is the distance between the two straight lines through the perspective centres and the 2D measured points (p1C1 and p2C2 in Fig. 1). In the ideal case, these lines do intersect, but, in practice, due to measurement errors on p1 and p2, they will be generally skewed. As a consequence, Eq. (10) will not be exactly satis"ed and it should be rewritten as pT1F p2"e , C
87
(42)
where e is the coplanarity error. This error has been C widely used to solve the correspondence problem [25,26] and it has been recently proposed in the calibration framework [27]. These considerations have led us to formulate the estimate of the internal parameters as a non-linear
The above formulation leads to a highly non-linear implicit function of c and c . These in fact in#uence the value of the other calibration parameters which, in turn, in#uence the value of the 3D points position and therefore of the cost function. The global minimum of Eq. (42) is therefore hard to be found with classical gradientbased methods. In this paper a solution o!ered by evolutionary optimisation, which has proved particularly successful in the solution of di$cult optimisation problems, is exploited [28]. In this domain, the search for the optimal solution is guided by the evaluation of the cost function alone. The single solution is represented as an individual, the ensemble of possible solutions as a population, and the cost associated to each solution as the degree of "tness of the population element associated to the solution. Before starting the search for the optimal pair of principal points, a search region is de"ned. Taking into account that the maximum o!set of the principal points with respect to the image centre is $T , a hyperbox
search region, H, is created in the four-dimensional space of the two principal points ([c "c ]): c31 $¹ . (44) H"
c32
The optimisation strategy is initialised generating inside H, a random set of NP pairs of principal points, FA " [c "c ]3H, called `fathersa of the `populationa. For G G these points, the external parameters, the focal lengths and the 3D scale factor is determined following the procedures reported in Sections 2.3}2.5; and the `"tnessa is determined through Eq. (43) for each father. Notice that for some of the fathers, an imaginary value of the focal lengths may come out from Eqs. (31a) and (31b). This happens when the principal points are far-o! the true position. In this case a high cost is associated to the related fathers. After the "tness has been computed for all the fathers a second set, S, of NP pairs of principal points, called `sonsa, is obtained by `mutatinga the set FA. The "tness of the sons is evaluated through Eq. (43) as for the fathers. In the mutation process, each father is displaced by a random quantity, z (0, p), where z is a Gaussian G G
88
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
variable in the four-dimensional space with zero mean and standard deviation p, obtaining [c "c ]I>"[c "c ]I#hI z (0, p) G G G G G
(45)
where k indicates the kth generation. p and hI concur in the determination of the amplitude of the displacement for each father. The "tness of each son is compared with that of the father from which it was generated and the one of the two elements which exhibits the best "tness will serve as father for the next `generationa. In more evolutionistic terms: the father and his son are placed in competition for survival and the selection eliminates the weaker for the next generation [29], that is the one which gives the poorer solution. The process of generating new sons and selecting those which have the best "tness is iterated until the "tness does not increase anymore. To increase the resolution in the determination of the principal points, a deterministic annealing scheduling [30] of the search region amplitude has been introduced by setting hI as h hI" . (ln(k#1)
(46)
With this setting, the amplitude of the search region (Eq. (44)) decreases with the generations, allowing a thicker sampling of the best region. The role of p is to restrict the search region if a meaningful number of sons have given a better "tness, or to enlarge it, if the fathers were "tter. For this purpose, the value of p is set according to the ratio, r, between the number of winning sons and winning [29] as follows: r(1/5, pI>"pI/c r"1/5, pI>"pI
(47)
r'1/5, pI>"pIc with c"0.85" where D is the dimension of the solution space, in our case D"4. To speed up the convergence and to avoid bias in the results due to the particular set of calibration points adopted, not all the control points are used at the same time. At each generation, a di!erent sub-set of points is extracted and used for calibration and a di!erent sub-set is extracted and used for evaluating the "tness. This procedure is commonly used in statistics and it goes under the name of bootstrap [31].
4. Summary of the calibration procedure The entire calibration procedure can be summarised as follows (Fig. 3):
1. A set of NC points, +P,!,, extremes of a rigid bar is surveyed by a pair of cameras and their position on the image planes are measured as +p,!, and +p,!,. These constitute the set of the control points. 2. A population of NP pairs of principal points FA " c "c is randomly generated inside the search area H. These constitute the fathers of the population. 3. The reference "tness is set to 0. 4. A subset of M ((NC/2) pairs of control points: +P ,, +p1c, and +p2c,, is extracted from the set +P,!,. This will be used to estimate the focal lengths, the external parameters and the object scale factor. 5. A second subset of ¹ ((NC/2) pairs of points: +P2, p1 , and +p , is extracted from the set +P,!,. 2 2 This will be used to determine the "tness and it will serve as the test set. 6. A set of NP fundamental matrices, +F,, are computed from +p1c, and +p2c,, one for each element of FA. 7. A set of NP pairs of focal lengths, + f , and +f ,, are determined as reported in Section 2.3. When an imaginary value results for the focal length, a low value of "tness is associated to the related pair of fathers. 8. The coordinates of +p1 , and +p2 , are normalised through +f ,, +f , (Eq. (5)). A set of NP essential matrices, +E,, is computed. 9. A set of NP relative orientations, +R,, and normalised relative positions, +T, (with ""T"""1), are computed from the matrices +E, according to Section 2.4. 10. A set of NP 3D scale factors, +k,, is determined. 11. The 3D position of the test points is reconstructed through the estimated parameters +R,, +T,, +k,, +f , f , and FA, originating NP sets of 3D points constituted of NT points each. 12. The cost function in Eq. (43) ("tness) is computed for each set of 3D test points. 13. A new population of NP pairs of principal points, the sons, SA, is generated inside the search area H by random mutation of the fathers set. 14. The internal and external parameters, and the "tness value associated with the sons is determined according to the steps 6}12. 15. If the cost function is lower than the actual reference "tness, the reference "tness is updated and the corresponding parameters are saved (cf. Fig. 3, block diagram on the left). 16. The one between each father and its son which has the best "tness wins and it will serve as father in the next generation. With these new fathers, steps 4}12 are repeated. The search for the principal points (and for the other parameters) ends when the cost function drops under a prede"ned threshold or it does not decrease anymore.
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
89
Fig. 3. The #ow chart of the calibration procedure is reported on the left. On the right the #ow chart of the block calibration and "tness evaluation is exploded in its main components.
5. Experimental results The accuracy obtained with this calibration method has been evaluated on real data and compared with
that obtained through calibration methods based on control points of known 3D coordinates, based on the iterative solution of the perspective equations (ILSSC) [4,10,32].
90
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
To calibrate, a rigid bar is moved inside the working volume while it is surveyed by a pair of video cameras. A small passive marker is placed on each of its extremities. The markers coordinates on the image plane of the two cameras are measured automatically by the Elite system [33]. This recognises the markers owing to a template matching algorithm implemented on a custom VLSI chip [34]. The correspondence between the points on the two cameras has been carried out automatically through the Smart3D tracking system [35]. With this procedure, a very large amount of calibration points, spread inside the working volume, can be collected in a very small time. To increase the accuracy, the main distortions component, which is due to the di!erent scales on the two target axes [2], has been corrected multiplying the horizontal target coordinate by a shrinkage factor obtained through a 2D in-house calibration. This factor, which for the Elite system is 1.4779, is a characteristic of the cameras as it depends only on the electronics and on the target dimensions, and it does not depend on the lenses. Therefore the same shrinkage factor applies to di!erent optics, di!erent apertures and di!erent focusing of the cameras. The same in-house calibration allows to determine the search region H (Eq. (44)), as a hyperbox centred at the point C3 [(128, 128)"(128,128)] with side ¹"$45 pixels. Two experiments have been carried out whose quantitative results are reported in Table 1 and Table 2. In the "rst experiment, a zoom lens has been mounted on the cameras (focal length approximately 30 mm) and a bar carrying two markers 99.1 mm apart was used. In the second one a pair of wide-angle lenses with focal length 8.5 mm and a bar carrying two markers 199.8 mm apart were adopted. The two cameras were positioned approximately at the same height with a relative orientation of 603s. Two sets of control points were acquired for each experiment. The "rst set, S , was obtained surveying the
bar in motion inside the calibration volume for 40 s collecting a total of 4000 matched pairs of 2D points for each camera. After discarding those frames where the two markers were not visible on both cameras, a total of NC"2749 and NC"2530 calibration points, respectively, were left for calibration. From this set of data points, a sub-set of M"150 pairs of calibration points and a sub-set of NT"100 pairs of test points were randomly extracted at each optimisation step of the calibration procedure. The second set of control points, S , consists of the markers positioned on a planar grid surveyed in three known parallel positions inside the working volume. For the "rst experiment, each grid carried 30 markers (5;6) with an intermarker distance of 50 mm and an inter-plane distance of 200 mm; in the second experiment, each grid carried 56 markers (7;8) with an inter-marker distance of 150 mm and an inter-planar distance of 400 mm. The cameras set-up was the same used for the acquisition of the bar. The accuracy was quantitatively assessed as follows: (a) The 3D coordinates of the markers on the bar extremities (set S of control points) are reconstructed with the parameters determined through ILSSC and through bar calibration. From the 3D coordinates of the markers, their 3D distance is determined frame by frame obtaining the set of distances D and D respectively. The di!erence between the true bar length and D or D is computed. (b) The 3D coordinates of the points on the grids (set S of control points) are reconstructed with the para meters determined through ILSSC and through bar calibration. These coordinates cannot be directly compared because the reference systems are di!erent: for ILSSC it is solid with the calibration grids while for bar calibration it is solid with one of the two cameras. Relative measurements have therefore been adopted: the error on the X and > directions is
Table 1 The accuracy obtained with ILSSC and bar calibration when the cameras were equipped with zoom lens (focal length 30 mm). The data from two di!erent calibrations by using the bar are indicated as Bar, 1st and 2nd calibration Bar length: 99.1 mm Grid contains 30 markers (5;6), surveyed in three parallel positions 200 mm apart Distance between two consecutive markers on the grid: 50 mm Calibrated volume 0.25;0.20;0.40 m Working volume 0.6;0.9;0.9 m Evolutionary optimisation: 100 elements and 40 generations
Bar, mean error Grid, x error Grid, y error Grid, z error
ILSSC
Bar, 1st calibration
Bar, 2nd calibration
0.00$0.50 0.04$0.84 0.02$0.07 0.07$0.61
0.00$0.25 0.00$0.81 0.05$0.02 0.15$0.58
0.00$0.21 0.02$0.81 0.04$0.02 0.00$0.58
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
91
Table 2 The accuracy obtained with ILSSC and bar calibration when the cameras were equipped with wide-angle lens (focal length"8.5 mm). The data from two di!erent calibrations by using the bar are indicated as Bar, 1st and 2nd calibration Bar length: 199.8 mm Grid contains 56 markers (7;8), surveyed in three parallel positions 400 mm apart Distance between two consecutive markers on the grid: 150 mm Calibrated volume 0.9;0.9;0.8 m Working volume 1.3;1.2;1.5 m Evolutionary optimization: 100 elements and 40 generations
Bar, mean error Grid, x error Grid, y error Grid, z error
ILSSC
Bar, 1st calibration
Bar, 2nd calibration
0.00$1.12 0.26$0.93 0.18$0.07 0.59$2.94
0.00$1.12 0.47$1.08 0.15$0.02 0.60$2.99
0.01$1.16 0.56$0.93 0.08$0.04 0.52$2.99
Fig. 4. The reconstructed distances between the two markers on the bar extremes frame by frame. They have been computed from the parameters obtained with bar calibration (a) and with ILSSC (b). The data are referred to the second experiment (wide-angle lens with focal length"8.5 mm).
92
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
computed as the mean and standard deviation of the error on the distance between each pair of two consecutive markers on the horizontal and vertical marker lines on the calibration grid. The error in the Z direction is computed as the mean and standard deviation of the error in the distance between each pair of markers in the same position on two consecutive parallel grids. As can be seen in Tables 1 and 2, and in Fig. 4, bar calibration gives an accuracy which is equal or even slightly better with respect to ILSSC both on the bar length and on the relative distances on the grids. An example of the sequence of the principal points examined by the evolutionary optimisation is reported in Fig. 5a and b. Although the search is carried out in the whole search region, H (Eq. (44)), the ensemble of the principal points is driven into a small target region which includes
the true position of the two principal points. This assures a high resolution in the estimate. Moreover, as the region H is explored in parallel, using NP elements at each step, the error decreases very fast. The resulting calibration parameters are adequate already at the "rst step when high accuracy is not required. As can be seen in Fig. 5c, for the best pair of principal points the mean error is only 6.5 times the "nal error on the grid and 1.7 times on the bar length. The total time required by the forty optimisation steps of the algorithm is 74 s on a Pentium MMX, 200 MHz. 6. Discussion and conclusion The only metric information required by bar calibration is the true 3D distance between the bar extremes, and this is in fact a critical parameter. When the distance is not given correctly, a degradation in the accuracy is
Fig. 5. The trajectory of the principal points is reported in (a) for TV1 and in (b) for TV2. The initial position of each pair of principal points is plotted as empty circles and the "nal one by "lled circles. The 2D coordinates are expressed in target units (t.u.). In (c), the "tness, the mean error on the bar length and the mean intersection error are reported for each step of the calibration in experiment 1.
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
introduced. However, as this measurement is taken only once, it is reasonable to assume that it can be performed with an adequate accuracy. Another critical parameter is the angle between the optical axes of the two cameras. When it departs from 903s, the accuracy decreases with power law up to 6 times for angles of 203 or 1603s (e.g. Fig. 3 of Ref. [32]) [16,32,36]. This can be a problem for bar calibration where the 3D reconstruction is an essential part of the calibration procedure while it was not for ILSSC where each camera is calibrated separately. Therefore when the relative angles are very small or very large, many control points are required to achieve a good estimate. When the stereo-pair has been properly set up and the length of the bar is measured with adequate precision, the accuracy of bar calibration is comparable and even superior to that obtained with points of known 3D coordinates. This allows to substitute a simple rigid bar to the cumbersome calibration structures required by classical calibration methods and it suggests to introduce bar calibration as a standard tool for those stereo systems which have to be calibrated before use.
Appendix A. What is represented in the fundamental matrix? We summarise here the considerations of projective geometry which limit to seven the number of independent parameters which can be determined using the epipolar constraint. From Eq. (10) it can be seen that fundamental matrix F is not of full rank. Let e1"[u , v , w ] and e2"[u , v , w ] be the two epiC C C C C C poles. Let us now consider a point p1 onto the image plane of the "rst camera. The epipolar line lp1 to which p1 belongs (cf. Fig. 1) has the homogeneous representation [l , l , l ]2"(e1p1) [3]. The transformations S T U p1Plp1 and p2Plp2 are therefore linear and they can be
d
F" !b !(du !bv ) C C
lp2"Glp1
(A.3)
Noticing that e2lp2"0, it follows that e2GQ p "0 (A.4) Since Q p O0 (Eq. (A.1)), it follows that the right kernel of G must be equal to Q2 up to a scale factor [37]: G"Q2J (A.5) and the following relationships between p1 and p2 and lp1 and lp2 hold: p22Q22JQ p1"0 N F"Q2JQ lp22J lp1"0.
(A.1) lp "[l , l , l ]2"Q p H SH TH UH H H where Q is a 3;3 matrix of rank two, function of the H epipole coordinates [u , v , w ], two of which are indeCH CH CH pendent: 0 !w H vH C C Q "e " wH 0 !u H . (A.2) H H C C !v H uH 0 C C
(A.6a) (A.6b)
J represents a projective mapping between two corresponding epipolar lines [3]. As this transformation has the property to preserve the cross ratio, it will be uniquely identi"ed by four parameters (a, b, c, d), three of which are independent. Every pair of epipolar lines can be identi"ed as at #c t "! (A.7) bt #d wheret " l /l and t " l /l are the projective coor S T S T dinates of the two corresponding epipolar lines (l and U l have been assumed equal to one). It follows that the U matrix J assumes the following expression:
a b 0
J" c
d 0
(A.8)
0 0 0
and, from Eq. (A.6a), the fundamental matrix can be written as
expressed as
We now search for the transformation between two corresponding epipolar lines on the two image planes. This is de"ned by a 3;3 matrix, G, de"ned up to scale factor, such that
!(du !cv ) C C a !(av !bu ) C C !(av !cu ) v (av !cu )#u (du !bv ) C C C C C C C C !c
93
(A.9)
Thus the epipolar transformation can be de"ned by the four independent coordinates of the two epipoles and the three parameters which de"ne the bilinear transformation between corresponding epipolar lines for a total of seven independent parameters. As the right-hand side of F is the matrix Q "(e ), the product Fe is equal to zero (the same is true for e ). This represents the fact that the epipoles do not have a corresponding epipolar line on the other camera.
94
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
References [1] P.R. Wolf, Elements of Photogrammetry, McGraw-Hill, New York, 1983. [2] J. Weng, P. Cohen, P.M. Henriou, Camera calibration with distortion models and accuracy evaluation, IEEE Trans. Pattern. Anal. Machine Intell. 14 (10) (1992) 965}979. [3] O.D. Faugeras, Three-Dimensional Computer Vision, MIT Press, Cambridge, MA, 1993. [4] J. Weng, T.S. Huang, N. Ahuja, Motion and structure from two perspective views: algorithms, error analysis and error estimation, IEEE Trans. Pattern. Anal. Machine Intell. 11 (11) (1989) 451}476. [5] J.F. Kene"ck, Ultra-precise analytics, Photogramm. Engng. 37 (1971) 1167}1187. [6] A. Gruen, H.A. Beyer, System calibration through selfcalibration, Invited paper, Workshop on Camera Calibration and Orientation in Computer Vision, XVIIth ISPRS Congress, Washington, DC, 1992. [7] S. Ullman, The Interpretation of Visual Motion, MIT Press, Cambridge, MA, 1979. [8] J. Dapena, A.H. Everett, J.A Miller, Three-dimensional cinematography with control object of unknown shape, J. Biomech. 15 (1982) 11}19. [9] B.K. Horn, Robot Vision, MIT Press, Cambridge, MA, 1986. [10] J. Weng, T.S. Huang, N. Ahuja, Optimal motion and structure estimation, IEEE Trans. Pattern. Anal. Machine Intell. 15 (9) (1993) 864}884. [11] H.C. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections, Nature 293 (1981) 133}135. [12] R.I. Hartley, Estimation of relative positions for uncalibrated cameras, Computer Vision } ECCV &92, Lecture Notes in Computer Science-Series, vol. 588, Springer, Berlin, 1992, pp. 579}587. [13] R.I. Hartley, Projective reconstruction and invariants from multiple images, IEEE Trans. Pattern. Anal. Machine Intell. 16 (10) (1994) 1036}1041. [14] O.D. Faugeras, Q.T. Luong, S.J. Maybank, Camera selfcalibration: theory and experiments, in Computer Vision } ECCV &92, Lecture Notes in Computer Science-Series, vol. 588, Springer, Berlin, 1992, pp. 321}334. [15] S.J. Maybank, O.D. Faugeras, A theory of self-calibration of a moving camera, Int. J. Comput. Vision 8 (2) (1992) 123}151. [16] Z. Zhang, Determining the epipolar geometry and is uncertainty: a review, Int. J. Comput. Vision 27 (2) (1998) 161}195. [17] N.A. Borghese, P. Perona, Calibration of a stereo system with points of unknown location. Proceedings. of the XIVth International Conference Soc. Biomech, ISB, Paris, 1993, 202}203. [18] R.K. Lenz, R.Y. Tsai, Techniques for calibration of the scale factor and image center for high accuracy 3D machine vision metrology, IEEE Trans. Pattern. Anal. Machine Intell 10 (5) (1988) 713}720. [19] G. Wei, S.D. Ma, Implicit and explicit camera calibration theory and experiments, IEEE Trans. Pattern. Anal. Machine Intell. 16 (5) (1994) 469}480.
[20] R. Tsai, A versatile camera calibration technique for high accuracy 3D machine vision metrology using o!-the-shelf TVcameras and lenses, IEEE J. Robotics Automat. 3 (4) (1987) 35}47. [21] S. Shah, J.K. Aggarwal, Intrinsic parameter calibration procedure for a (high distortion) "sh-eye lens camera with distortion model and accuracy estimation, Pattern Recognition 29 (11) (1996) 1775}1788. [22] Q.T. Luong, O.D. Faugeras, The fundamental matrix: theory, algorithms, and stability analysis, Int. J. Comput. Vision 17 (1) (1996) 43}76. [23] O.D. Faugeras, Strati"cation of three-dimensional vision: projective, a$ne, and metric representations, J. Opt. Soc. Am. A 12 (3) (1995) 465}484. [24] S. Soatto, P. Perona, R. Frezza, G. Picci, Recursive motion and structure estimation with complete error characterization, ECCV 92, Lecture Notes in Computer Science-Series, vol. 588, Springer, Berlin, 1993, pp. 428}433. [25] N.A. Borghese, SMART3D: tracking of movements surveyed by a multiple set of TV cameras. Proceedings of the tenth Symposium. ISBS, 1992, pp. 108}111. [26] X. Hu, N. Ahuja, Matching point features with ordered geometric, rigidity, and disparity constraints, IEEE Trans. Pattern. Anal. Machine Intell. 16 (10) (1994) 1041}1049. [27] R.I. Hartley, In defense of 8-point algorithm, IEEE Trans. Pattern. Anal. Machine Intell. 19 (6) (1997) 580}593. [28] D.B. Fogel, An introduction to simulated evolutionary optimization, IEEE Trans. Neural Network 5 (1) (1994) 3}14. [29] T. BaK ck, G. Rudolph, H.P. Schwefel, Evolutionary programming and evolution strategies: similarities and di!erences, Proceedings. of the Second Annual Conference on Evolutionary Programming, 1993, pp. 11}22. [30] K. Rose, F. Gurewitz, G. Fox, Statistical mechanics and phase transitions in clustering, Phys. Rev. Lett. 65 (8) (1990) 945}948. [31] P.J. Huber, Robust Statistics,, Wiley, New York, 1981. [32] N.A. Borghese, G. Ferrigno, An algorithm for 3D automatic movement detection by means of standard TV cameras, IEEE Trans. Biomed. Engng 37 (12) (1990) 1221}1225. [33] G. Ferrigno, A. Pedotti, ELITE: a digital dedicated hardware system for movement analysis via real-time TV signal processing, IEEE Trans. Biomed. Engng 32 (1985) 943}950. [34] N.A. Borghese, M. Di Rienzo, G. Ferrigno, A. Pedotti, Elite: a goal-oriented vision system for moving objects detection, Robotica 9 (1990) 275}282. [35] N.A. Borghese, SMART3D: tracking of movements surveyed by a multiple set of TV cameras, Proceedings of the tenth ISBS Symposium, Hermes Edition, Milano, 1992, pp. 108}111. [36] J. Oliensis, Rigorous bounds for two-frame structure from motion, Proceedings ECCV &96, Lecture Notes in Computer Science-Series, vol. 1065, Springer, Berlin, 1996, pp. 184}195. [37] G.H. Golub, C.F. Van Loan, Matrix Computation, The John Hopkins University Press, Baltimore, MD, 1991.
N. Alberto Borghese, P. Cerveri / Pattern Recognition 33 (2000) 81}95
95
About the Author*ALBERTO BORGHESE received the `laureaa in Electrical Engineering from politecnico in Milano with 100/100 cum laude and he is currently Director of the laboratory for Human Motion study and Virtual Reality at the Institute of Neuroscience and Bioimages of CNR, Milano. He has been a visiting scientist at the Center for Neural Engineering of University of Southern California and of the Department of Electrical Engineering of the California Institute of Technology. His research interests include quantitative human motion analysis, 3D modeling from range data, animation in virtual reality and arti"cial learning systems. About the Author*PIETRO CERVERI received the `laureaa in Electrical Engineering from politecnico di Milano in the Academic year 1993}1994. He is now a Ph.D. student at the Department of Bioengineering, Politecnico di Milano. His research is focused on image processing and retrieval applied to the Visible Human Data set.
Pattern Recognition 33 (2000) 97}104
Grading meat quality by image processing Kazuhiko Shiranita *, Kenichiro Hayashi , Akifumi Otsubo , Tsuneharu Miyajima, Ryuzo Takiyama Saga Prefectural Industrial Technology Center, 114 Yaemizo, Nabeshima, Saga 849-0932, Japan Saga Prefectural Livestock Experiment Station, 23242 Miyano, Yamauchi, Kishima 849-2305, Japan Kyushu Institute of Design, 4-9-1 Shiobaru, Minami-ku, Fukuoka 815-8540, Japan Received 16 June 1998; accepted 14 December 1998
Abstract We study the implementation of a meat-quality grading system, using the concept of the `marbling scorea, as well as image processing, neural network techniques and multiple regression analysis. The marbling score is a measure of the distribution density of fat in the rib-eye region. We identify "ve features used for grading meat images. For the evaluation of the "ve features, we propose a method of image binarization using a three-layer neural network developed on the basis of inputs given by a professional grader and a system of meat-quality grading based on the evaluation of two of "ve features with multiple regression analysis. Experimental results show that the system is e!ective. 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Image Processing; Neural network; Multiple regression analysis; Meat quality; Grading system; Marbling score
1. Introduction The grade of meat traded in Japan is standardized by the Japan Meat Grading Association [1]. Generally, the determination of the grade of meat quality is carried out by visual inspection, which involves the collation of the actual meat with standard images of each grade, by authorized experts called graders. Since grading is performed in a refrigerator at low temperatures, the work is very di$cult for the graders. Therefore, the development of a grading system which does not depend on visual inspection has long been awaited. The grade of meat is judged by taking two factors into account: one is the yield rate of the meat to be merchandized and the other is the grade of the meat quality. There are four features that determine the meat quality: marbling score, muscle color, fat color and tightness of meat. In particular, the marbling score is the dominant parameter in deciding the meat quality.
* Corresponding author. Tel.: #81-952-30-8161; fax: #81952-25-1694. E-mail address:
[email protected] (K. Shiranita)
The marbling score is a measure of the distribution density of fat in the rib-eye region.The graders determine the grade of the marbling score by comparing the meat with the standard images of each grade, based on their experience. In the automatization of grading by the marbling score, the comparison of the meat with the standard images is the fundamental step. Till date, several studies on grading systems based on the marbling score have been reported [2}4]. In these studies, the marbling score was determined by calculating the percentage of fat in the rib-eye region. The results, however, have not been satisfactory. On the other hand, the application of image processing for automatizing the visual inspection in areas which have long been dependent on experience and intuition of professional workers, has been attracting considerable attention. For example, the development of an inspection system based on intuitive information such as quality, property, sensibility of quality and so on of the product, is required. Recently, studies for designing a system to evaluate pearl quality [5,6] as well as for an automatic inspection system for the shade of color in colored printed matter [7] were performed. In these studies, in order to develop a system with abilities highly similar to
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 3 5 - 7
98
K. Shiranita et al. / Pattern Recognition 33 (2000) 97}104
those of professional workers, the experience and knowhow of the workers were studied, and the results were utilized for the system development. This is extremely necessary and highly important for accurate system development in these "elds. In this work, we study the implementation of a meatquality grading system, using the concept of marbling score, as well as image processing, neural network techniques [8] and multiple regression analysis [9], while considering the methods used by professional graders. In Section 2, we describe the results which we investigate as the basis for determining the grade of meat, based on the results of a questionnaire put forth to graders. From the results, we clarify that graders extract "ve features from the fat region of a meat image, and show a system design of meat-quality grading based on the results investigated. For the construction of a simple and cost-e!ective grading system, monochromatic images of meat should be used rather than color images. In Section 3, experimental results show that a 4-bit monochromatic image is su$cient for accurate grading, as con"rmed by professional graders. The "ve features must be measured from the fat region of a meat image. For the evaluation of the "ve features, the fat region should be correctly separated from the muscle in the rib-eye region in the meat images to be examined. In Section 4, we propose a method of binarization of the 4-bit monochromatic meat image, using a three-layer neural network developed on the basis of inputs given by a professional grader. In Section 5, we identify, among the "ve features, the features which are strongly related to the grade assigned
by the grader, and formulate a multiple regression equation for the determination of the grade using the features obtained with multiple regression analysis. In Section 6, we perform an experiment on grading unevaluated images by the multiple regression equation. The results obtained using the proposed method are con"rmed to be in good agreement with the judgement of a grader.
2. System design The marbling score is a measure of the distribution density of fat in the rib-eye region.The marbling score is classi"ed into 12 grades and 12 standard images for each grade are represented, as shown in Fig. 1. Thus, the grade of marbling of an actual meat sample is determined solely on the basis of its comparison with these standard images, and no physical quantity for grading is explicitly shown. Professional graders, however, can decide the grade of meat instantaneously based on their long experience. This leads us to infer that graders determine the grade of meat using appropriate features as gathered by their experience. On what basis do graders decide the grade of meat quality? We investigated their intuition by means of a questionnaire. As a result, we determined that the graders decided the grade of meat quality while focusing on the following points. First, the graders separated the fat from the muscle in the rib-eye region, and estimated the percentage of fat in the rib-eye region. Next, they counted the number of fat regions, and examined the size of each fat region. The graders call the fat region `marblinga, and the marbling is judged to be large or small according to
Fig. 1. Beef marbling standard.
K. Shiranita et al. / Pattern Recognition 33 (2000) 97}104
99
its size. Finally, the graders examine the amount of scatter of the distribution of marbling in the rib-eye region. Based on the points described above, the graders determine the grade of meat quality. Accordingly, if we can obtain the features described above, we can develop a grading system which has the same ability as a professional grader. Based on the results investigated, we will develop a grading system considering the following points. 1. We examine the kind of meat images to be input to a grading system. 2. We correctly separate marbling from the muscle in the rib-eye region. 3. We compute the following variables for grading meat quality: 1. 1. 1. 1. 1.
(a) (b) (c) (d) (e)
the percentage of marbling in the rib-eye region, the number of marblings, the number of large marblings, the number of small marblings, the amount of scatter of the distribution of marblings in the rib-eye region.
4. We determine the grade of meat quality based on the features mentioned above. In the next section, we examine these points.
3. Images used We examine the kind of image to be input to a grading system, as described in point 1 of Section 2. What kind of image should be used for determining the grade of meat, a color image or a monochromatic image? How many bits should it have? A monochromatic image results in a simpler system than a color image. The smaller the number of bits, the easier the implementation of the system. To answer these questions, we examined 34 di!erent meat samples, each of which had been photographed in color and in monochrome with 1 to 8 bits, i.e., for one meat sample there are nine di!erent images. Thus, there are 34;9"306 images in total. Examples of images are shown in Fig. 2. In the "gures, rectangles show regions (340;212 pixels) in which image processing techniques are applied. A grader determined the meat grade by observing the images and the results are tabulated in Table 1. Fig. 3 is produced based on the values in Table 1 and illustrates the average di!erence in grade between examinations using color images and monochromatic images. We can see that the magnitude of di!erence in grade between color images and monochromatic images with 8 and 4 bits is less than 1, which is a tolerable di!erence in practice. Thus, we can use a 4-bit monochromatic image, which is su$cient for grading meat quality.
Fig. 2. Examples of monochromatic images: (a) 8-bit image (b) 4-bit image, and (c) 2-bit image.
4. Binarization of image In this section, we discuss the method to correctly separate the fat from the muscle in the rib-eye region in meat images, which involves a binarization of the image,
100
K. Shiranita et al. / Pattern Recognition 33 (2000) 97}104
Table 1 Results graded by the grader Color No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
3 7 10 4 5 9 8 6 4 3 8 7 7 6 7 6 6 7 6 8 5 5 7 5 5 8 4 9 7 4 4 5 6 9
Monochrome (bits) 8
7
6
5
4
3
2
1
4 7 9 5 6 9 7 6 4 3 9 5 6 6 6 6 7 8 7 7 3 4 6 6 6 7 5 8 8 5 4 7 7 8
3 6 10 5 6 8 8 6 4 3 9 5 6 4 7 5 5 7 6 8 3 3 6 5 5 7 4 7 7 4 4 6 6 8
3 6 10 4 6 8 8 6 4 3 8 6 6 4 7 5 6 8 7 7 4 3 7 5 6 7 4 7 7 4 4 6 6 8
3 6 10 5 6 8 8 6 4 3 9 6 6 4 7 6 6 8 6 7 4 4 5 5 6 7 4 7 6 5 4 6 6 8
4 6 9 5 6 9 8 7 5 3 9 6 5 5 7 6 7 8 7 8 4 4 7 6 6 7 5 8 7 4 5 7 6 8
3 4 9 4 6 9 8 6 5 3 8 6 5 5 7 6 7 8 7 6 4 5 6 7 5 6 5 7 8 4 4 5 6 8
4 6 9 4 7 7 7 4 8 4 6 4 4 6 6 5 7 6 6 4 3 3 3 9 4 7 4 8 6 3 5 4 5 8
5 7 8 4 9 5 7 4 8 5 5 4 6 6 5 6 7 6 5 4 3 3 4 3 5 4 5 4 6 3 2 4 4 4
Fig. 3. Average di!erence in grade between color and monochromatic images.
Fig. 4. Examples of density histograms: (a) muscle and (b) fat.
as described in point 2 of Section 2. To binarize the image, we introduce the concept of a `fat-pixel patterna and a `muscle-pixel patterna in 4-bit images described in Section 3. If a pixel in a 4-bit image is regarded as `fata, the accompanying `fat-pixel patterna is de"ned as follows. A 3;3 pixel block centered around the fat pixel is selected and its density histogram of pixels in the block (there are 9 pixels) is described. The histogram is also considered to be a 16- dimensional vector because pixels are in a 4-bit image, i.e., the `fat-pixel patterna. A `muscle-pixel patterna is de"ned in a similar manner using a muscle pixel. Examples of fat- and muscle-pixel patterns are shown in Fig. 4 as histograms. We can see that the `fat-pixel patterna and the `muscle-pixel patterna show a characteristic distribution in the "gure. Using the `fat-pixel patterna and the `muscle-pixel patterna described above, we propose a method to binarize the meat image by a neural network. First, to select representative or standard fat and muscle pixels, a professional grader identi"ed 100 fat pixels and 100 muscle pixels in a meat image with 4-bit pixels, which generates 200 fat- and muscle-pixel patterns. Using these patterns, we trained a three-layer neural net to be used in a binarizing machine. Fig. 5 shows the structure of a three-layer neural net used for the binarization of a meat image. In
K. Shiranita et al. / Pattern Recognition 33 (2000) 97}104
101
Fig. 6. Binary image obtained by the proposed method.
Fig. 5. Structure of the used neural network.
the net, the numbers of input, hidden and output units are 16, 6 and 2, respectively. In the net, the input}output relations in the hidden layer and the output layer are, respectively, de"ned as
y "f u x #h , H HG G H G
(1)
z "f a y #c , (2) I IH H I H where i, j and k are neurons of input, hidden and output layers, respectively. x , y and z are output values of G H I neurons i, j and k. u and a are the weights which HG IH connect the input layer and the hidden layer, and the hidden layer and the output layer, respectively. The threshold value of neuron j is h , and that of neuron k is c . H I The function f (x) between the input and output of the neuron is expressed as a sigmoid function as 1 f (x)" . 1#exp(!x)
(3)
The learning algorithm used is the backpropagation method [8]. In learning, the error function is de"ned as E" E " (t !z ), N NI NI N I
(4)
where t is a teaching signal of neuron k in input signal p. NI In learning, the weights and threshold values are renewed to decrease the value of E. Learning stops when the value of E is less than a predetermined constant. Learning converged rapidly. The neural net with the parameters (weights and threshold values) obtained in the above manner was used to binarize 4-bit images. The binarization was applied to the rectangles shown in Fig. 2. The 34 meat images mentioned in Section 3 were binarized by the above method. Fig. 6 shows a result of the binarization of the rectangular areas in Fig. 2b by the proposed method. In the "gure, the white region is the fat region. Comparison of Fig. 6 with Fig. 2b con"rms that the fat region was correctly separated from the muscle in the image. The result shows that the proposed method is e!ective for binarizing the meat image.
5. Deriving a standard for grading meat quality using multiple regression analysis In this section, we evaluate the "ve features described in point 3 of Section 2, and derive a multiple regression equation for grading meat quality based on these features by means of multiple regression analysis. 5.1. Five variables We compute "ve features, which were described in point 3 of Section 2, from the binarized images of the 34 meat samples described in Section 4. We represent these features as "ve variables (x , x , x , x , x ) in the multiple regression analysis. x is the percentage of marblings in the binarized region, x is the number of marblings, x is the number of large marblings of fat, and x is the
102
K. Shiranita et al. / Pattern Recognition 33 (2000) 97}104
number of small marblings of fat. Large marbling is de"ned as a fat region with more than 1250 pixels in binarized images, and small marbling is de"ned as one with less than 1250 pixels. x is the amount of scatter of the distribution of marblings in the binarized region. Subdividing the rectangle shown in Fig. 6 into four equal rectangles, we de"ne variable x as the variance of the number of pixels of fat in the four regions. The "ve variables computed for the above 34 images are listed in Table 2. The values under `Gradea in the table represent the grades assigned by a grader. 5.2. Multiple regression equation Using the values shown in Table 2, we computed correlation coe$cients between variables, and between variables and the grades assigned by a grader. The results are tabulated in Table 3. The correlation coe$cients between `Gradea and x , and between `Gradea and x are high. Furthermore, the correlation coe$cients between x and x , and between x and x are also high. Accordingly, variables x and x are strongly related to the grade of the criterion variable. In multiple regression analysis, in general, a variable whose correlation coe$cient with the criterion variable is more than 0.7 should be used as an explanation variable. In Table 3, x and x are such variables. Therefore, using x and x , the multiple regression equation for comput ing y of the criterion variable is de"ned as y"a #a x #a x , (5) where a is a constant, and a and a are partial regres sion coe$cients of explanation variables x and x , re spectively. Computing the values of a , a and a using the measured values given in Table 2, we obtain a "!2.162,
a "23.288
Table 2 Value of each variable in meat images No.
Grade x
x
x
x
x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
3 7 10 4 5 9 8 6 4 3 8 7 7 6 7 6 6 7 6 8 5 5 7 5 5 8 4 9 7 4 4 5 6 9
40 48 25 38 34 47 44 42 55 36 39 57 54 41 43 40 41 42 51 42 57 49 45 42 62 46 55 49 43 27 40 37 43 48
4 5 7 2 5 7 5 5 3 3 6 4 4 4 5 3 5 6 4 6 3 5 7 4 5 7 5 7 5 5 4 4 4 8
36 43 18 36 29 40 39 37 52 33 33 53 50 37 38 37 36 36 47 36 54 44 38 38 57 39 50 42 38 22 36 33 39 40
2677 3029 10,686 774 516 9906 2972 1196 4484 920 3782 814 821 3310 4494 1944 6100 3142 2357 901 1468 1593 3174 2444 739 1158 1438 7388 502 2404 2484 7375 493 1763
0.227 0.335 0.488 0.248 0.318 0.428 0.408 0.305 0.268 0.187 0.387 0.353 0.369 0.287 0.341 0.320 0.314 0.372 0.316 0.368 0.263 0.265 0.373 0.262 0.294 0.382 0.230 0.392 0.377 0.226 0.253 0.292 0.302 0.397
a "0.172.
5.3. Evaluation of analytical accuracy To evaluate the accuracy of the grade of meat quality obtained using the multiple regression equation, we tested the coe$cient of determination. First, we obtained a coe$cient of determination. A coe$cient of determination R can be obtained from the multiple regression sum of squares (S ) and residual 0 sum of squares (S ) of criterion variable y, as # R"1!S /(S #S ). In the case of computing a , a # 0 # and a of Eq. (5), the coe$cient of determination was 0.93. In general, analytical accuracy of multiple regression analysis is evaluated as being excellent when the coe$cient of determination is 0.8 or higher. Therefore, we tested the obtained coe$cient of determination. When unbiased variances of S and S are set 0 # as < "S /p and < "S /(n!p!1) (where n is the 0 0 # # number of samples and p is the number of variables), if
a value of F "< /< is larger than a predetermined 0 0 # value, then the coe$cient of determination is signi"cant and indicates that the analytical accuracy is high. We obtained F of 217, which is larger than the predeter0 mined value of 5.36 (the value of F(p, n!p!1, a)" F(2, 31, 0.01) for F distribution at a level of signi"cance of a"0.01) [9], which thus indicates that the coe$cient of determination obtained is signi"cant. Accordingly, we concluded that the level of accuracy of the analysis was high.
6. Experimental results on grading meat quality We conducted an experiment in which the grade of an unknown meat sample was determined from an image using the multiple regression equation given above.
K. Shiranita et al. / Pattern Recognition 33 (2000) 97}104
103
Table 3 Correlation coe$cient between variables x x x x x x Grade
x
1.000 !0.028 0.716 !0.149 0.452 0.961
x
0.716 !0.072 1.000 !0.240 0.385 0.753
!0.028 1.000 !0.072 0.985 !0.279 0.004
Table 4 Experimental results of grading No.
x
x
y
Grade
35 36 37 38 39 40 41 42 43 44
0.380 0.331 0.262 0.353 0.256 0.355 0.233 0.398 0.305 0.211
7 5 3 5 3 6 3 8 4 3
7.893 6.407 4.456 6.919 4.316 7.138 3.780 8.484 5.629 3.268
8 6 4 7 4 7 3 9 6 3
Given 10 unknown images of meat, the grades of meat quality were determined using (5). Table 4 shows the results. Variables x and x were obtained by the same process as described in Section 5. Values of y are the grades obtained using Eq. (5), and values of `Gradea were determined by a grader. The values of y agree with the grades assigned by a grader, and the di!erence in grades was always less than 1. We can conclude that the results obtained using the proposed method agree well with the grades assigned by a professional grader, and that the proposed method is useful for determining the grade of meat quality. 7. Conclusions We proposed a method of implementing a system for determining the grade of meat quality by observing images of meat. In the method, the marbling score, which is the dominant parameter in deciding the grade of meat quality, plays an important role. We investigated the aspects of grading work by professional graders, identi"ed the features that the graders used in grading, and designed a system for grading meat quality. In particular, we de"ned the concepts of `fat-pixel patterna and `muscle-pixel patterna, and using the three-layer neural network developed on the basis of these concepts, we binarized 4-bit monochromatic images. We obtained
x
!0.149 0.985 !0.240 1.000 !0.337 !0.124
x
0.452 !0.279 0.385 !0.337 1.000 0.434
x
0.961 0.004 0.753 !0.124 0.434 1.000
values for two features from the binarized images, and derived the multiple regression equation for determining the grade of meat quality using values for these features by means of multiple regression analysis. Experimental results of grading of unknown images showed the proposed method to be useful. Further experiments on a larger number of samples are planned for future studies.
8. Summary In this paper, we study the implementation of a meatquality grading system, using the concept of the `marbling scorea, as well as image processing, neural network techniques and multiple regression analysis. The grade of meat traded in Japan is standardized by the Japan Meat Grading Association. Generally, the determination of the grade of meat quality is carried out by visual inspection, which involves the collation of the actual meat with standard images of each grade, by authorized experts called graders. Since grading is performed in a refrigerator at low temperatures, the work is very di$cult for the graders. Therefore, the development of a grading system which does not depend on visual inspection has long been awaited. The grade of meat is judged by taking two factors into account: one is the yield rate of the meat to be merchandised and the other is the grade of the meat quality. There are four features that determine the meat quality: marbling score, muscle color, fat color and tightness of meat. In particular, the marbling score is the dominant parameter in deciding the meat quality. The marbling score is a measure of the distribution density of fat in the rib-eye region. The graders determine the grade of the marbling score by comparing the meat with the standard images of each grade, based on their experience. In the automatization of grading by the marbling score, comparing the meat with the standard images is the fundamental step. Till date, several studies on grading systems based on the marbling score have been reported. In those studies, the marbling score was determined by calculating the percentage of fat in the rib-eye region. The results, however, have not been satisfactory.
104
K. Shiranita et al. / Pattern Recognition 33 (2000) 97}104
In this work, we study the implementation of a meatquality grading system, using the concept of marbling score, as well as image processing, neural network techniques and multiple regression analysis. The marbling score is a measure of the distribution density of fat in the rib-eye region.The graders determine the grade of the marbling score by comparing the meat with the standard images of each grade, based on their experience. In Section 2, we describe the results which we investigate as the basis for determining the grade of meat, based on the results of a questionnaire put forth to graders. From the results, we clarify that graders extract "ve features from the fat region of a meat image, and show a system design of meat-quality grading based on the results investigated. For the construction of a simple and cost-e!ective grading system, monochromatic images of meat should be used rather than color images. In Section 3, experimental results show that a 4-bit monochromatic image is su$cient for accurate grading, as con"rmed by professional graders. The "ve features must be measured from the fat region of a meat image. For the evaluation of the "ve features, the fat region should be correctly separated from the muscle in the rib-eye region in the meat images to be examined. In Section 4, we propose a method of binarization of the 4-bit monochromatic meat image, using a three-layer neural network developed on the basis of inputs given by a professional grader. In Section 5, we identify, among the "ve features, the features which are strongly related to the grade assigned by the grader, and formulate a multiple regression equation for the determination of the grade using the
features obtained with multiple regression analysis. In Section 6, we perform an experiment on grading unevaluated images by the multiple regression equation. The results obtained using the proposed method are con"rmed to be in good agreement with the judgement of a grader.
References [1] Japan Meat Grading Association, New standard on meat trading, Japan Meat Grading Association, Tokyo, 1988. [2] K. Kuchita, K. Yamagi, T. Yamagishi, Meat quality evaluation method by image analysis and its applications, Chikusan-Kenkyu 47 (1993) 71}73. [3] K. Shiranita, A study on evaluation of meat quality by image analysis, in: Proceedings of the Nagasaki Meeting of JSME no. 958-2, 1995, 161}162. [4] K. Shiranita, T. Miyajima, R. Takiyama, Grading meat quality by image processing and neural network techniques, in: Proceedings of the International Symposium on Nonlinear Theory and its Application (NOLTA'96), 1996, pp. 437}440. [5] N. Nagata, M. Kamei, M. Akane, H. Nakajima, Development of a pearl quality evaluation system based on an instrumentation of `Kanseia, Trans. Inst. Electrical Engrs. Japan 112-C (2) (1992) 111}116. [6] N. Nagata, M. Kamei, T. Usami, Factors identi"cation using sensitivity of layered neural networks and its application to pearl color evaluation, Trans. Inst. Electrical Engrs. Japan 116-C (1996) 556}562. [7] K. Tanimizu, S. Meguro, A. Ishii, High-speed defect detection method for color printed matter, Proceedings of IECON'90, 1990, pp. 653}658. [8] H. Aso, Information Processing by Neural Network, Sangyoutosho, Tokyo, 1988. [9] T. Kan, Practice of Multiple Regression Analysis, Gendaisugakusha, Tokyo, 1993.
About the Author*KAZUHIKO SHIRANITA was born in Saga, Japan in 1961. He received the B.E. degree from Saga University in 1984 and the Dr.E. degree from Kyushu Institute of Design in 1998, respectively. Since 1984 he has been with Saga Prefectural Industrial Technology Center. His research interests include pattern recognition and image processing. About the Author*KENICHIRO HAYASHI was born in Saga, Japan in 1954. He received the B.E. degree from Kyushu University in 1978 and the Dr.E. degree from Saga University in 1994, respectively. Since 1981 he has been with Saga Prefectural Industrial Technology Center. His research interests include image processing and neural network. About the Author*AKIFUMI OTSUBO was born in Saga, Japan in 1965. He received the B.E. degree from Saga University in 1988 and the Dr.E. degree from Kyushu Institute of Technology in 1998, respectively. Since 1988 he has been with Saga Prefectural Industrial Technology Center. His research interests include image processing and neural network. About the Author*TSUNEHARU MIYAJIMA was born in Saga, Japan in 1960. He received the B.E. degree from Saga University in 1983. Since 1984 he has been with Saga Prefectural Livestock Experiment Station. His research interests are in the areas of zootechnical science. About the Author*RYUZO TAKIYAMA was born in Fukuoka, Japan in 1940. He received the B.E., M.E. and Dr.E. degrees from Kyushu University in 1963, 1965 and 1978, respectively. From 1968 to 1969, he was a Research Associate at the Department of Communication Engineering, Kyushu University. From 1969 to 1985 he was an Associate Professor at the Department of Visual Communication Design, Kyushu Institute of Design. Since 1985 he has been a Professor at Kyushu Institute of Design. His research interests include pattern recognition, image processing, neural networks and psychoinformation space theory.
Pattern Recognition 33 (2000) 105}118
Determining simplicity and computing topological change in strongly normal partial tilings of R or R Punam K. Saha , Azriel Rosenfeld* Medical Image Processing Group, University of Pennsylvania, Philadelphia, PA 19104-6021, USA Computer Vision Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA Received 26 August 1998; accepted 5 January 1999
Abstract A convex polygon in R, or a convex polyhedron in R, will be called a tile. A connected set P of tiles is called a partial tiling if the intersection of any two of the tiles is either empty, or is a vertex or edge (in R: or face) of both. P is called strongly normal (SN) if, for any partial tiling P-P and any tile P3P, the neighborhood N(P, P) of P (the union of the tiles of P that intersect P) is simply connected. Let P be SN, and let NH(P, P) be the excluded neighborhood of P in P (i.e., the union of the tiles of P, other than P itself, that intersect P). We call P simple in P if N(P, P) and NH(P, P) are topologically equivalent. This paper presents methods of determining, for an SN partial tiling P, whether a tile P3P is simple, and if not, of counting the numbers of components and holes (in R: components, tunnels and cavities) in NH(P, P). 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Partial tiling; Simple tile; Simply connected neighborhood; Strong normality
1. Introduction In n-dimensional Euclidean space RL, a convex hyperpolyhedron P is a bounded set which is the intersection of a "nite number of closed half-spaces and which has a nonempty interior. Let H , 2, H be the hyperplanes K ((n!1)-dimensional subspaces of RL) that bound the half-spaces whose intersection is P. It can be shown that m must be at least n#1. Each nonempty P5H is called G an (n!1)-dimensional hyperface of P. An (n!1)-dimensional hyperface of P is a convex hyperpolyhedron in RL\; its (n!2)-dimensional hyperfaces are also called (n!2)-dimensional hyperfaces of P, and so on. For n!d"2, 1, or 0, an (n!d)-dimensional hyperface of P is called a face, an edge, or a vertex of P, respectively. It can be shown that a face is a convex planar polygon, an edge is a line segment, and a vertex is a point. Let P be a nonempty set of convex hyperpolyhedra whose union is connected. We call P normal if, for all * Corresponding author. Tel. #1-301-405-4526; fax: #1301-314-9115. E-mail address:
[email protected] (A. Rosenfeld)
P, Q3P, P5Q is either empty or an (n!d)-dimensional hyperface of P and of Q, for some d. [For example, when n"2, the hyperpolyhedra are polygons; if a set of convex polygons is normal, the intersection of any two of them is an edge, a vertex, or empty; their intersection cannot have an interior point, and if it is contained in an edge, it must either be the entire edge or one of the vertices of the edge.] If P is normal and its union is all of RL, it is called a tessellation or tiling of RL; if its union is not all of RL it is called a partial tiling of RL. In either case, the P's in P are called tiles. It is not hard to see that any partial tiling is a subset of a tiling, because the space not occupied by the tiles can be partitioned into convex polyhedra whose pairwise intersections are hyperfaces. The neighborhood N(P, P) of P in P is the union of all Q3P that intersect P (including P itself). The excluded neighborhood NH(P, P) is de"ned similarly, except that Q is not allowed to be P itself. P is called strongly normal (SN) if it is normal and, for all P, P ,2, P (m'1)3P, if each P intersects P and K G I"P 525P is nonempty, then I intersects P. Note K that both normality and strong normality are hereditary: If they hold for P, they hold for any P"-P.
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 3 7 - 0
106
P.K. Saha, A. Rosenfeld / Pattern Recognition 33 (2000) 105}118
From now on we assume that n"2 or 3, so that the P's are convex polygons or polyhedra. The concept of strong normality was introduced by the authors [1] for the case of a partial tiling of R by tetrahedra, and was then generalized to convex polyhedra [2]. It has been shown [2] that P is an SN (partial) tiling of R or R by convex polygons or polyhedra i!, for any P-P and any P3P, N(P, P) is simply connected. (It was also observed [2] that the regular square and hexagonal tessellations of R are SN, but the regular triangular tessellation is not.) If P is SN and NH(P, P) is simply connected, P is called simple in P. When P is not simple, NH(P, P) may be disconnected, or it may have holes (in 2D), tunnels or cavities (in 3D). In this paper we present methods of determining whether P is simple, and if not, of counting the numbers of connected components, holes, tunnels, and cavities (as appropriate) in NH(P, P). From now on we assume that P is SN. To avoid the need for awkward formulations or arti"cial terminology, we treat the twoand three-dimensional cases separately (in Sections 2 and 3, respectively), even though their treatments are to a great extent analogous.
2. The two-dimensional case 2.1. Numbers of components and holes The interior of a polygon P, denoted by interior(P), is de"ned as P ! (the union of the edges of P). If e is an edge, we de"ne interior(e) as e ! (the vertices of e). It will be convenient, in what follows, to de"ne interior(v) as v if v is a vertex. A vertex or edge s of P is called shared if there exist two polygons P , P 3P such that P 5P "s. Otherwise, s is G H G H called bare. An edge that belongs two polygons must be shared, but a bare vertex v can belong to two polygons if they intersect in an edge containing v. The boundary of P, denoted by B(P), is the union of all edges of P. The set B(P)!NH(P, P), denoted by B (P, P), is called the bare boundary of P. The union of @ the shared vertices and edges of P is called the shared boundary of P and is denoted by B (P, P). It is not hard Q to see that for all P, B (P, P) and B (P, P) are disjoint @ Q and their union is B(P). Note that B (P, P) is not the @ union of the bare vertices and edges of P; a vertex of a shared edge may be bare, and a vertex of a bare edge may be shared. An edge of P is shared i! one of the polygons sharing it is P; but a vertex of P may be shared by two polygons neither of which is P. If an edge of P is bare, its interior (at least) is contained in B (P, P); and it is not hard to see that if @ every edge of P is shared, B (P, P) must be empty. This @ proves
Lemma 1. B (P, P) is nonempty if and only if P has a bare @ edge. The union of all the polygons in P will be denoted by U(P). The numbers of components and holes in U(P) are its 0th and 1st Betti numbers, respectively; its higher Betti numbers are zero. As shown in Ref. [1], because of SN, N(P, P) always consists of a single connected component and has no holes. In this section we will determine the numbers of components and holes in NH(P, P). We "rst establish a useful lemma: Lemma 2. The numbers of components and holes in NH(P, P) are, respectively, the same as the numbers of components and holes in B (P, P). Q Proof. Let disc(*, p) denote the disc with radius * and center p(p3R). For every point p3N(P, P)!P there exists a "nite value d such that for all *(d, disc(*, p)5N(P, P)"disc(*, p)5NH(P, P). Now P and N(P, P) are both simply connected and PLN(P, P). Therefore, there exists a Betti number preserving transformation (a deformation retract) from N(P, P) to P. This transformation basically removes N(P, P)!P from N(P, P). The transformation is also Betti number preserving for NH(P, P). This is true because: (a) The transformation removes N(P, P)!P; (b) for every point p3N(P, P)! P there exists a "nite value d such that for all *(d, disc(*, p)5N(P, P)"disc(*,p)5NH(P, P); (c) both N(P, P) and P are simply connected and thus the transformation is Betti number preserving throughout (i.e., it is not the case that at a certain step it removes a hole and at some other step it creates a hole). Since the transformation removes N(P, P)!P, it transforms NH(P, P) to NH(P, P)!(N(P, P)!P)"NH(P, P)5P"NH(P, P)5 B(P)"B (P, P). Therefore the numbers of components Q and holes in NH(P, P) are the same as those numbers in B (P, P). 䊐 Q Now B (P, P) is a subset of B(P) and B(P) is topologiQ cally equivalent to a circle, which has exactly one hole. Therefore B (P, P) can have at most one hole, and this Q occurs only when B (P, P) is topologically equivalent to Q a circle. In other words, B (P, P) has at most one hole, Q and this occurs only when B (P, P)"B(P), i.e., B (P, P) Q @ is empty. Based on this discussion we have Proposition 1. NH(P, P) can have at most one hole, and this occurs only when B (P, P) is empty. @ From Proposition 1 and Lemma 1 we have Proposition 2. NH(P, P) can have at most one hole, and this occurs if and only if all edges of P are shared.
P.K. Saha, A. Rosenfeld / Pattern Recognition 33 (2000) 105}118
2.2. Simple polygons and measures of topological change A polygon P is simple in P if and only if deleting P from P does not change the topology of U(P). Evidently, deleting P changes the topology of U(P) if and only if it changes the topology of its neighborhood (i.e., the union of the polygons that intersect it). Thus P is simple if and only if the numbers of components and holes are the same in N(P, P) and in NH(P, P). As shown in Ref. [1], since P is SN, the numbers of components and holes in N(P, P) are always 1 and 0, respectively. Therefore P is simple if and only if NH(P, P) has exactly one component and has no holes. Criteria for the existence of holes in NH(P, P) were given in Section 2.1. These criteria give us the following characterization of a simple polygon: Theorem 1. P is simple if and only if it satisxes the following two conditions: 1. B (P, P) has exactly one component. Q 2. P has a bare edge. In general, we can de"ne P to be simple if N(P, P) can be continuously contracted into NH(P, P), or equivalently, if P can be continuously contracted into B (P, P). Q It can be shown that P is simple i! NH(P, P) has the same numbers of components and holes as N(P, P), and that Theorem 1 holds even if P is normal (but not necessarily SN). If a polygon is nonsimple, its deletion changes the topology of U(P), but this tells us nothing about the nature of the change. The following theorem is a straightforward consequence of the discussion in Section 2.1.
107
Let the conxguration of an edge or vertex refer to whether it is shared or bare. By Theorems 1 and 2, the changes in the numbers of components and holes when P is deleted (and hence the simplicity of P) are determined if we know the con"gurations of the vertices and edges of P. [Note that this is not true if P is not SN (even if it is normal). For example, the con"gurations of the vertices and edges of the triangle P in Fig. 1a and b are the same, but in Fig. 1a NH(P, P) has only one component, while in Fig. 1b it has two components.] An alternative formulation of Theorems 1 and 2 can be obtained by regarding the given partial tiling P as a subset of a tiling P, and distinguishing the polygons Q in NH(P, P) that share an edge with P from those that only share a vertex with P. The former polygons will be called b-adjacent to P, while all the polygons in NH(P, P) will be called a-adjacent to P. [As usual, the transitive closure of x-adjacency (where x"a or b) is called x-connectedness.] We denote the set of polygons in P that are a-adjacent to P by NH(P), and the set of polygons in ? P!P that are b-adjacent to P by NM H(P). In terms of @ this notation, we can restate Theorems 1 and 2 as follows: Theorem 1. P is simple if and only if it satisxes the following two conditions: 1. NH(P) is nonempty and a-connected. ? 2. NM H(P) is nonempty. @ Theorem 2. When P is deleted, 1. The change in the number of components is one less than the number of a-components in NH(P). ? 2. The change in the number of holes is one when NM H(P) is @ nonempty, and zero otherwise.
Theorem 2. When P is deleted, 1. The change in the number of components is one less than the number of components of B (P, P). Q 2. The change in the number of holes is one when P has no bare edge and zero otherwise.
2.3. Ezcient computation As we have seen, the numbers of components and holes in N(P, P) are always 1 and 0, respectively. Thus the
Fig. 1. The con"gurations of the vertices and edges of triangle P are the same in (a) and (b), but NH(P, P) has only one component in (a), while it has two components in (b). Note that this set of polygons is not SN.
108
P.K. Saha, A. Rosenfeld / Pattern Recognition 33 (2000) 105}118
numbers m(P, P) and d(P, P) of components and holes in NH(P, P) de"ne local measures of topological change when P is deleted. Let simple (P, P) be a predicate which has the value 1 when P is simple in P and 0 otherwise. m(P, P), d(P, P) and simple (P, P) will be referred to as the local topological parameters of P. In this section we develop an e$cient approach to computing these parameters. Proposition 3. If an edge e of P is shared then m(P, P) is independent of the conxgurations of the vertices vLe. Proof. Let P contain a polygon P such that P 5P"v. T T To establish the proposition we shall show that the numbers of components of NH(P, P) and NH(P, P!+P ,) T are the same. If this is not true then one of the following two cases must occur: Case 1: P is an isolated polygon in NH(P, P), so that T the number of components of NH(P, P) is greater than that of NH(P, P!+P ,). T Case 2: Two or more components of NH(P, P!+P ,) T are adjacent to P , so that the number of compoT nents of NH(P, P!+P ,) is greater than that of T NH(P, P). Since e is shared, there exists a polygon P LNH(P, P) C such that P 5P"e. Now P 5P"v implies that vLP . C T T Also vLeLP and thus vLP 5P , i.e., P 5P OH. C C T C T Hence P is adjacent to P and Case 1 never occurs. As T C regards Case 2, suppose there exist two polygons P , P LNH(P, P!+P ,) that belong to di!erent com T ponents of NH(P, P!+P ,) and are adjacent to P . T T P and P are adjacent to P and P 5P OH; thus T T by SN, (P 5P )5POH. Obviously, (P 5P )5PL T T P 5P"v and thus P 5P 5PLvLeLP , which imT T C plies that P 5P OH; hence P and P are adjacent. In C C the same way, it can be shown that P and P are C adjacent. Therefore P 6P 6P is connected, i.e., P and C P belong to the same component of NH(P, P!+P ,), T contradiction. Thus Case 2 cannot occur either. 䊐 By Proposition 2, d(P, P) depends on the con"gurations of the edges of P, and is independent of the con"gurations of the vertices of P. By Proposition 3, m(P, P) is independent of the con"gurations of the vertices belonging to shared edges of P. Based on this, we have the following de"nition and proposition: De5nition 1. A vertex v of P is called trapped if it belongs to a shared edge of P; otherwise, it is called free. Proposition 4. m(P, P) and d(P, P) are independent of the con"gurations of the trapped vertices of P.
Since simple (P, P) depends on m(P, P) and d(P, P) (speci"cally, simple (P, P) is 1 if and only if m(P, P), d(P, P) are 1 and 0, respectively), we also have Corollary 1. The predicate simple (P, P) is independent of the conxgurations of the trapped vertices of P. Lemma 3. Each free, shared vertex of P contributes one component to NH(P, P). Proof. Let v be a free, shared vertex of P. Let P , P ,2, P be the polygons of P such that P 5P"v. L G [Since v is shared, it is in at least one polygon P other than P. Now P5P cannot be an edge, since v is free; hence P5P"v.] Evidently, the P 's are connected. We G show that no other polygon in N(P, P) meets any of the P 's. In fact, suppose P is not one of the P 's (so that G G P5POv), but P meets some P , i.e., P5P OH. Since G G v is free, both of the edges of P that meet at v are bare; hence (P5P )5P"H. But this contradicts the fact that G P is SN. 䊐 Using these results, we now describe a simple method of computing the local topological parameters m(P, P), d(P, P), and simple (P, P). We treat three cases, depending on the presence of shared edges in P. Case 1: All edges of P are shared. In this case all vertices are trapped and no more computation is needed. The local topological parameters have the following values: m(P, P)"1, d(P, P)"1, simple (P, P)"0. Case 2: Some, but not all, edges of P are shared. In this case d(P, P) is always zero, but m(P, P) (and therefore simple (P, P)) depends on the con"gurations of the edges and free vertices (if any) of P. We can compute m(P, P) by examining the successive edges and counting `1a for each pair of consecutive edges one of which is bare and the other shared, and counting `1a for each shared vertex which is the intersection of two bare edges. Finally, simple (P, P) is 1 if m(P, P) is 1, and is 0 otherwise. Case 3: No edge of P is shared. In this case all vertices of P are free. The number of holes d(P, P) is always zero; the number of components m(P, P) is exactly the same as the number of shared vertices; and simple(P, P) is 1 when exactly one vertex is shared, and is 0 otherwise.
3. The three-dimensional case 3.1. Numbers of components, tunnels and cavities The interior of a polyhedron P, denoted by interior (P), is de"ned as P } (the union of the faces of P). If f is a face,
P.K. Saha, A. Rosenfeld / Pattern Recognition 33 (2000) 105}118
we de"ne interior ( f ) as f } (the edges of f ). The interior of an edge or a vertex is de"ned as in the two-dimensional case. A vertex, edge, or face s of P is called shared if there exist two polyhedra P , P 3P such that P 5P "s. G H G H Otherwise, s is called bare. A face that belongs to two polyhedra must be shared, but a bare edge (respectively, vertex) x can belong to two polyhedra if they intersect in a face (respectively, face or edge) containing x. The boundary of P, denoted by B(P), is the union of all faces of P. The set B(P!NH(P, P), denoted by B (P, P), @ is called the bare boundary of P. The union of the shared vertices, edges, and faces of P is called the shared boundary of P and is denoted by B (P, P). It is not hard to see Q that for all P, B (P, P) and B (P, P) are disjoint and their @ Q union is B(P). Note that B (P, P) is not the union of the @ bare vertices, edges, and faces of P; a vertex of a shared face or shared edge may be bare; an edge of a shared face may be bare; a vertex of a bare edge or bare face may be shared; and an edge of a bare face may be shared. A face of P is shared i! one of the polyhedra sharing it is P; but a vertex or edge of P may be shared by two polyhedra neither of which is P. If a face of P is bare, its interior (at least) is contained in B (P, P); and it is not hard to see that if every face of P is @ shared, B (P, P) must be empty. This proves @ Lemma 4. B (P, P) is nonempty if and only if P has a bare @ face. The union of all the polyhedra in P will be denoted by U(P). The numbers of components, tunnels and cavities in U(P) are its 0th, 1st and 2nd Betti numbers, respectively. As shown in Ref. [1], because of SN, N(P, P) always consists of a single connected component without tunnels and cavities. In this section we will determine the numbers of components, tunnels and cavities in NH(P, P). We "rst establish a useful lemma: Lemma 5. The numbers of components, tunnels and cavities in NH(P, P) are the same as the numbers of components, tunnels and cavities in B (P, P), respectively. Q Proof. Analogous to that of Lemma 2, using a ball instead of a disc. 䊐 B (P, P) is a subset of B(P), and B(P) is topologically Q equivalent to a sphere, which has exactly one cavity. Therefore B (P, P) can have at most one cavity, and this Q occurs only when B (P, P) is topologically equivalent to Q a sphere. In other words, B (P, P) has at most one cavity, Q and this occurs only when B (P, P)"B(P), i.e., B (P, P) Q @ is empty. Based on this discussion we have Proposition 5. NH(P, P) can have at most one cavity, and this occurs only when B (P, P) is empty. @
109
From Proposition 5 and Lemma 4 we have Proposition 6. NH(P, P) can have at most one cavity, and this occurs if and only if all faces of P are shared. The remainder of this section deals with methods of determining the number of tunnels in NH(P, P). Let = be a connected subset of the boundary B of a convex set in R } for example, a connected region on the surface of a convex polyhedron. = has a tunnel if and only if B!= is not connected. In fact, = has n'0 tunnels if and only if B!= has n#1 components. Now P is a convex set in R, B(P), is the boundary of P, and B (P, P) is a subset of B(P). Also Q B (P, P)"B(P)!NH(P, P)"B(P)!NH(P, P)5B(P) @ "B(P)!B (P, P). Thus B (P, P) has no tunnel when Q Q B (P, P) is connected or empty; otherwise, the number of @ tunnels in B (P, P) is one less than the number of compoQ nents in B (P, P). According to Lemma 5, the number of @ tunnels in NH(P, P) is equal to the number in B (P, P). Q Based on this discussion we have Proposition 7. NH(P, P) has no tunnel when B (P, P) is @ connected or empty; otherwise, the number of tunnels in NH(P, P) is one less than the number of components in B (P, P). @ If f is a bare face of P, interior( f ) must be contained in B (P, P), and since interior( f ) is connected, it must be @ a subset of some component of B (P, P). Conversely, we @ have Lemma 6. Every component of B (P, P) contains interior @ ( f ) for some bare face f of P. Proof. Suppose component c of B (P, P) does not con@ tain the interior of any bare face of P. Then c must contain the interior of some bare edge or bare vertex. Suppose c contains interior(e) for some bare edge e. Let e be a subset of the face f. If f is bare, interior( f ) is a subset of B (P, P). But interior ( f ) 6 interior(e) is connected, so @ must be contained in c, contradiction. On the other hand, if f is not bare, it is a subset of NH(P, P). Hence interior(e)LeLfLNH(P, P) cannot be a subset of B (P, P), contradiction. The proof if c contains the in@ terior of some bare vertex is analogous. 䊐 Lemma 7. For any two bare faces f, f of P, interior( f ) and interior( f) are connected in B (P, P) if and only if there @ exists an alternating sequence f"f , e , f ,2, f , G e , f ,2, e , f "f of bare faces and bare edges of G G> L\ L P such that e "f 5f for 0)i(n. G G G> Proof. Suppose "rst that there exists such an alternating sequence. Since e is a subset of both f and f , G G G>
110
P.K. Saha, A. Rosenfeld / Pattern Recognition 33 (2000) 105}118
interior( f ) 6 interior(e ) 6 interior( f ) is connected, G G G> and obviously interior( f ) 6 interior(e ) 6 interior( f ) is G G G> a subset of B (P, P). Thus interior(f ) and interior(f ) are @ G G> connected in B (P, P) for 0)i(n, and thus in@ terior( f )"interior (f ) and interior( f)"interior( f ) are L connected in B (P, P). Hence if there exists such an @ alternating sequence, interior( f ) and interior( f) are connected in B (P, P). @ We shall show that, conversely, if interior ( f ) and interior ( f ) are connected in B (P, P), there exists such an @ alternating sequence. A curve connecting them in B (P, P) passes through the interiors of a sequence @ f"s , s ,2, s "f of bare simplexes of P. Since the K interiors of the s 's intersect B (P, P), by normality, it is G @ not hard to see that they must be subsets of B (P, P); @ hence the s 's are bare simplexes. Evidently G interior(s ) 6 interior(s ) must be connected for G G> 0)i(n, since the curve passes through the consecutive s 's. Now the unions of the interiors of two faces, edges, or G vertices of a polyhedron can never be connected. Thus, if we have a sequence x , x ,2, x of faces and edges of I a polyhedron in which interior(x ) 6 interior(x ) is conG G> nected for 0)i(k, it must be an alternating sequence of faces and edges. Hence it remains only to show that we can replace every vertex in the sequence s , s ,2, s by K a sequence of bare faces and bare edges which satisfy the connectivity condition. A vertex in s , s ,2, s must occur in one of the K following four contexts (note that s and s are not K vertices): (a) face, vertex, edge, (b) edge, vertex, face (c) face, vertex, face, (d) edge, vertex, edge. We show how to do the replacement in case (a); the other three cases can be treated similarly. Let x, y, z be the face, the vertex and the edge, respectively. If zLx, i.e., the edge is a subset of the face, then interior(x) 6 interior(z) is connected and we can replace x, y, z by x, z, which is an alternating sequence of bare faces and bare edges satisfying the connectivity condition. If z : . x then we proceed as follows: Since interior(x) 6 interior(y) 6 interior(z) is connected and y is a vertex, y must be a subset of both x and z. Since y is a vertex of P, at least three faces and at least three edges of P must meet at v; moreover, these n*3 faces and edges can be arranged in a sequence such that the ith edge is the intersection of the ith face and the (i#1)st face (mod n). Let the faces be x , x ,2, x and the edges be z , z ,2, z , where L L z "x 5x . Since x and z both meet y, x is one of G G G> L the x 's and z is one of the z 's. Suppose x"x and z"z G G H I where j'k; we have already de"ned the replacement method for j"k, in which z is a subset of x, and the replacement method for j(k is similar to that for j'k. Since interior(y) is a subset of B (P, P), x , x ,2, x and @ L z , z ,2, z must all be bare simplexes. [Suppose this L were not true, e.g., x is not bare, so that it is a subset of NH(P, P). Then interior(y)"yLx LNH(P, P); thus in
terior(y) :. B(P)!NH(P, P), i.e., interior(y) : . B (P, P), @ contradiction.] Now z "x 5x implies that interior(x ) 6 G G G> L G interior(z ) 6 interior(x ) is connected; hence inG G> L terior(x ) 6 interior(z ) 6 interior(x )26 interior(x ) H H H> I 6 interior(z ) is connected. Therefore we can replace the I sequence x, y, z by the alternating sequence x"x , z , H H x ,2, x , z of bare faces and bare edges which satisH> I I "es the connectivity condition. 䊐 De5nition 2. Let O(P, P) be the set of bare faces of P. Two bare faces f, f of P are called face-adjacent if the edge f5f is bare. Two bare faces f, f of P are called faceconnected if there exists a sequence f"f , f ,2, f "f L in O(P, P) such that f and f are face-adjacent for G G> 0)i(n. A face component of O(P, P) is the union of a maximal face-connected subset of O(P, P). Theorem 3. The number of tunnels in NH(P, P) is zero when P has no bare face. Otherwise, the number of tunnels in NH(P, P) is one less than the number of face components in O(P, P). Proof. This follows from Proposition 7 and Lemmas 6 and 7. 䊐 The following corollaries are straightforward consequences of this theorem: Corollary 2. NH(P, P) contains no tunnel if and only if O(P, P) is face-connected. Corollary 3. The number of tunnels in NH(P, P) is independent of whether the vertices of P are bare or shared. 3.2. Simple polyhedra and measures of topological change A polyhedron P is simple in P if and only if deleting P from P does not change the topology of U(P). Evidently, deleting P changes the topology of U(P) if and only if it changes the topology of its neighborhood (i.e., the union of the polyhedra that intersect it). Thus P is simple if and only if the numbers of components, tunnels and cavities are the same in N(P, P) and in NH(P, P). As shown in Ref. [1], since P is SN, the numbers of components, tunnels and cavities in N(P, P) are always 1, 0, and 0 respectively. Therefore P is simple if and only if NH(P, P) has exactly one component and has no tunnels or cavities. Criteria for the existence of tunnels and cavities in NH(P, P) were given in Section 3.1 (Corollary 2 and Proposition 6). These criteria gives us the following characterization of a simple polyhedron.
P.K. Saha, A. Rosenfeld / Pattern Recognition 33 (2000) 105}118
111
Theorem 4. P is simple if and only if it satisxes the following two conditions:
Theorem 4. P is simple if and only if it satisxes the following two conditions:
1. B (P, P) has exactly one component. Q 2. O(P, P) is nonempty and face-connected.
1. NH(P) is nonempty and a-connected ? 2. Exactly one c-component of NM H(P) intersects NM H(P). @ A
If we de"ne `simplea in terms of contractability (see Section 2.2, just after Theorem 1), it can be shown that even if P is normal (but not necessarily SN), P is simple i! conditions 1 and 2 of Theorem 4 hold. For such a P, NH(P, P) must have the same numbers of components, tunnels, and cavities as N(P, P); but these numbers can also be the same even if P is not simple (unlike the situation in two dimensions, where equality of the numbers of components and holes implies simplicity). If a polyhedron is nonsimple, its deletion changes the topology of U(P), but this tells us nothing about the nature of the change. The following theorem is a straightforward consequence of the discussion in Section 3.1. Theorem 5. When P is deleted, 1. The change in the number of components is one less than the number of components of B (P, P). Q 2. The change in the number of tunnels is one less than the number of face components in O(P, P) when O(P, P) is nonempty, and zero otherwise. 3. The change in the number of cavities is one when O(P, P) is empty and zero otherwise. Let the conxguration of a face, edge, or vertex refer to whether it is shared or bare. By Theorems 4 and 5, the changes in the numbers of components, tunnels, and cavities when P is deleted (and hence the simplicity of P) are determined if we know the con"gurations of the vertices, edges, and faces of P. (As in Section 2, this is not true if P is not SN, even if it is normal; examples can be easily given.) An alternative formulation of Theorems 4 and 5 can be obtained if we regard the given partial tiling P, as a subset of a tiling P, and distinguish the polyhedra Q in NH(P, P) that share a face with P, those that share only an edge with P, and those that share only a vertex with P. All three types of Q's will be called a-adjacent to P; those that share a face or edge with P will be called b-adjacent to P; and those that share a face with P will be called c-adjacent to P. (a-, b-, or c-connectedness is the transitive closure of a-, b-, or c-adjacency.) We denote the set of polyhedra in P that are a-adjacent to P by NH(P); the set of polyhedra in P!P that ? are b-adjacent to P by NM H(P); and the set of polyhedra @ in P!P that are c-adjacent to P by NM H(P). In terms A of this notation, we can restate Theorems 4 and 5 as follows:
(Note that by condition 2, NM H(P) cannot be empty; hence A P must have at least one bare face. Condition 2 then readily implies that the set of bare faces of P is faceconnected.) Theorem 5. When P is deleted, 1. The change in the number of components is one less than the number of a-components in NH(P). ? 2. The change in the number of tunnels is one less than the number of c-components of NM H(P) that intersect NM H(P) @ A when NM H(P) is nonempty, and zero otherwise. A 3. The change in the number of cavities is one when NM H(P) A is empty, and zero otherwise. 3.3. Ezcient computation As we have seen, the numbers of components, tunnels and cavities in N(P, P) are always 1, 0 and 0 respectively. Thus the numbers m(P, P), g(P, P) and d(P, P) of components, tunnels and cavities in NH(P, P) de"ne local measures of topological change when P is deleted. Let simple (P, P) be a predicate which has the value 1 when P is simple and 0 otherwise. m(P, P), g(P, P), d(P, P) and simple (P, P) will be referred to as the local topological parameters of P. In this section we develop an e$cient approach to computing these parameters. For brevity, in this section the con"guration of a face, edge, or vertex refers to whether it is shared or bare. Proposition 8. If a face f of P is shared then m(P, P) is independent of the conxgurations of the vertices or edges sLf. Proof. Analogous to that of Proposition 3. 䊐 We can similarly prove Proposition 9. If an edge e of P is shared then m(P, P) is independent of the conxgurations of the vertices of e. Proposition 10. If a face f of P is shared then g(P, P) is independent of the conxgurations of the vertices or edges sLf. Proof. According to Corollary 3, g(P, P) is independent of the con"gurations of the vertices of P. To establish the proposition we shall show that g(P, P) is independent of the con"gurations of all edges of f. Let e be such an edge and let P be a polyhedron (if any) such that P 5P"e. C C
112
P.K. Saha, A. Rosenfeld / Pattern Recognition 33 (2000) 105}118
According to Theorem 3, g(P, P) is zero when O(P, P) is empty, and otherwise it is one less than the number of face components of O(P,P), so that g(P, P) is also zero when O(P, P) contains exactly one face. Thus to prove the proposition we need only show that g(P, P) is independent of the con"guration of e when O(P, P) contains two or more faces. For this purpose we show that two faces of O(P, P) are face-connected in NH(P, P!+P ,) if C and only if they are face-connected in NH(P, P). Suppose there were two faces x, y in O(P, P) that were faceconnected in NH(P, P) but not in NH(P, P!+P ,). Then C there must exist an alternating sequence of bare faces and bare edges of P, say x"f , e , f ,2, e , f "y, such L\ L that interior ( f ) and interior ( f ) are adjacent to interior G G> (e ) for 0)i(n, and e "e for some j, 0)j(n. Now G H for any edge z of a polyhedron Q, interior(z) is adjacent to the interiors of exactly two faces of Q. By assumption, face f is shared and e is an edge of f, so that interior(e) is adjacent to interior( f ). Therefore P cannot have two bare faces whose interiors are adjacent to interior(e), contradiction 䊐. By Proposition 6, d(P, P) depends on the con"gurations of the faces of P, and is independent of the con"gurations of the edges and vertices of P. By Propositions 8 and 10, m(P, P) and g(P, P) are independent of the con"gurations of the edges belonging to shared faces of P. According to Corollary 3, g(P, P) is independent of the con"gurations of the vertices of P. We also see from Propositions 8 and 9 that m(P, P) is independent of the con"gurations of the vertices belonging to shared faces or shared edges. Based on this, we have the following de"nition and proposition: De5nition 3. An edge e of P is called trapped if it belongs to a shared face of P; otherwise, it is called free. A vertex v of P is called trapped if it belongs to a shared face or edge of P; otherwise, it is called free. Proposition 11. m(P, P), g(P, P) and d(P, P) are independent of the conxgurations of the trapped edges and vertices of P. Since simple (P, P) depends on m(P, P), g(P, P) and d(P, P) (speci"cally, simple (P, P) is 1 if and only if m(P, P), g(P, P) and d(P, P) are 1, 0 and 0, respectively), we also have. Corollary 4. The predicate simple (P, P) is independent of the conxgurations of the trapped edges and vertices of P. Lemma 8. Each free, shared vertex of P contributes one component to NH(P, P). Proof. Analogous to that of Lemma 3.
䊐
Lemma 9. If the edges of P are all trapped then m(P, P)"1. Proof. Since all edges of P are trapped, P has at least one shared face, so that m(P, P)*1. Suppose m(P, P)'1 so that B (P, P) is not connected. Now B(P) is topologically Q equivalent to a sphere and B (P, P)LB(P). Therefore Q B (P, P) not connected implies that B (P, P)" Q @ B(P)!B (P, P) contains a tunnel, so that B (P, P) conQ @ tains a closed curve c that is not reducible to a point in B (P, P). Suppose c passes through the interiors of exact@ ly n faces, edges and/or vertices of P all of which are contained in B (P, P), say s , s ,2, s where n is as small @ L as possible. Evidently n'1. (The interior of a face, edge, or vertex is simply connected; hence a simple closed curve entirely contained in the interior of a single face, edge, or vertex is always reducible to a point.) It is not hard to see that at least one of the s 's must be either an edge or G a vertex. (Indeed, the interiors of two faces are always disjoint, so that c cannot pass directly from the interior of one face to the interior of another.) Suppose s is a vertex. G Let e be any edge that contains s . If interior (e) meets G B (P, P), by normality e must be contained in B (P, P); Q Q hence s -B (P, P), contradicting the fact that interior G Q s -B (P, P). Hence interior(e) must be contained in G @ B (P, P). Thus both the faces containing e are bare so @ that e cannot be trapped, contradiction. A similar contradiction can be derived if s is an edge. Hence B (P, P) G @ cannot contain a closed curve that is not reducible to a point in B (P, P); thus B (P, P) cannot contain a tun@ @ nel. Therefore B (P, P) is connected and hence Q m(P, P))1. 䊐 To conclude this section, we describe the general structure of an algorithm that computes the local topological parameters m(P, P), g(P, P), d(P, P) and simple (P, P). We treat the following cases, depending on the presence of shared faces in P. Case 1: All faces of P are shared. In this case all the edges and vertices are trapped and no more computation is needed. The local topological parameters have the following values: m(P, P)"1, g(P, P)"0, d(P, P)"1, simple(P, P)"0. Case 2: Some faces of P are bare but all edges of P are trapped. In this case by Lemma 9 m(P, P)"1. Since some faces are bare, by Theorem 5 we must have d(P, P)"0. As regards g(P, P), note that since every edge is trapped, two bare faces cannot intersect in an edge; hence no two bare faces are face-adjacent, so the number of face components in O(P) is equal to the number of bare faces.
P.K. Saha, A. Rosenfeld / Pattern Recognition 33 (2000) 105}118
Hence by Theorem 5, g(P, P) is one less than the number of bare faces. Therefore, simple(P, P)"1 i! the number of bare faces is 1, and simple (P, P)"0 otherwise. Case 3: Some or all faces of P are bare and some edges of P are free. (Note that if all the faces are bare, all the edges must be free.) Here again, d(P, P)"0. m(P, P) is the number of components in the union of the shared faces and free shared edges of P, plus the number of free shared vertices of P (Theorem 5 and Lemma 8). As for g(P, P), by Theorem 3 it is one less than the number of face components in O(P). (Unfortunately, the computation of m(P, P) and g(P, P) can be complex.) Finally, simple(P, P)"1 if m(P, P)"1 and g(P, P)"0, and simple (P, P)"0 otherwise.
3.4. An example: partial tilings by cubes The computation of the measures of topological change is complicated even if P is a partial tiling derived from a regular tessellation. In this section we describe an algorithm for computing these measures in the case of a partial cubic tiling. In a cubic tessellation P, every cube P has twenty six a-neighbors, eighteen b-neighbors and six c-neighbors. An a-neighbor that is not a b-neighbor will be called a vertex neighbor; a b-neighbor that is not a c-neighbor will be called an edge neighbor; and a c-neighbor will be called a face neighbor. We denote the cubes in the neighborhood of P as shown in Fig. 2. In this section we give an e$cient algorithm that computes the local topological parameters of P in a partial tiling P that is a subset of the cubic tessellation P. We "rst determine the con"gurations of the six face neighbors of P. [In the remainder of this section, the conxguration of a cube speci"es whether the cube is in P or in P!P. A cube will be referred to as black if it belongs to P; otherwise, it will be referred as white. For brevity, m, g, d and simple will refer to m(P, P), g(P, P), d(P, P) and simple(P, P), respectively.) There are sixty-four possible con"gurations of the six face neighbors; they can be grouped into the following ten cases depending upon the number and relative positions of the black face neighbors of P: Case 1: All six face neighbors are black. Only one face neighbor con"guration belongs to this category. All edge and vertex neighbors are trapped. No further computation is necessary and the values of the local topological parameters are m"1; g"0; d"1; simple"0. Case 2: Five face neighbors are black. Six face neighbor con"gurations belong to this category. All edge and vertex neighbors are trapped. No further computation is necessary and the values of the local topological parameters are m"1; g"0; d"0; simple"1.
113
Fig. 2. Notation for the cubes in the neighborhood of P. (a) is the layer of cubes above P, (b) is the layer containing P, and (c) is the layer below P. The F's are face neighbors, the E's are edge neighbors, and the VH), `Between (attribute X' and (attribute Y'a (M "" H) (see Table 1) are supported. Feature generation process is not elaborated here. Please refer to Ref. [29] for details of the same.
The last requirement for constructing the FOHDEL rule is to select the appropriate operators. These can be of pessimistic type like `&a (`anda: MIN) case in which the rule is valid only if all features are detected and the overall evaluation is related to the worse "t value, or the operators can be of optimistic type `"a (`ora: MAX) case in which just one "t of the possible features activates the rule with the best "t value. Other combinations between these two extremes are also supported i.e. weighted average.
4. Automatic generation of the FOHDEL rule-base The advantage of the proposed language is not only limited to its compactness and readability but also because of its generative power and high speed in classi"cation. This can be illustrated with the application of FOHDEL in automatic generation and modi"cation of FOHDEL rule base with changing handwriting variability. The automatic rule generation schemes are discussed in Refs. [16,24,30]. We will insist here only on the aspects related to the language itself and its support to automatic rule generation. The primary goal of such an automatic rule-base generation system is to create an optimal number of e!ective rules by utilizing a minimum amount of input information. In other words the best features and the best rules should be selected and the rest of the unnecessary information should be discarded. Very frequently structure identi"cation methods are used to extract the knowledge automatically from the raw information data, and subsequently these are converted into a special set of rules. Our "rst objective in automatizing the rule generation is to retain the `gooda features and discard the redundant ones. The criterion of a `gooda feature is that it should be invariant to writing style that means give a good possibility answer within a class, and have a highly discriminating character related to other features. The discrimination factor can be enhanced through the linguistic attributes. The second objective is to create an automatic mechanism which can change the rule-base and thus adapt to new writers and handwriting styles. Finally the automatic rule generation method should insure the consistency of the generated rule-base. We choose a hybrid approach that combines the knowledge of an expert with statistically generated fuzzy rules due to the time and computer power constraints required by our application [24]. The statistical information was extracted from UNIPEN database [31] which o!ers "ve million characters of various handwriting styles. The statistical information of these features is in form of the fuzzy feature space. The target of this step is to distinguish the qualitatively superior features with a positive or negative discriminatory power. The selection of good features is accomplished by histogram analysis
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131
comprising of fuzzy average, fuzzy variance and fuzzy correlation matrix [24]. A fuzzy average measure provides a global feature distribution from the histogram analysis and on the other hand the fuzzy variance (p) determines the redundancy of a particular feature. Hence the features with extremely high standard deviation are discarded because of their low discrimination power. Similarly the low-average features are also discarded unless it is manually speci"ed by the expert to negate certain features. For proper utilization of an extracted feature a corresponding attribute is very important. Because mere presence of a feature in a pattern is not enough, a quality measure of extracted feature is represented by the linguistic attributes, e.g. `A character &b' has an excellent vertical linea. This is especially useful for ambiguous situations e.g. when the training data is a compilation of various handwriting styles [31]. The initial rule base is generated from the extracted features. This rule base uses &&' (AND) operator to combine di!erent feature elements. Through an iterative process using the additional operators such as &""' (Between), &'' (More Than) and &(' (Less Than) the rules with the neighboring attributes are merged to reduce the redundant information (see Appendix. C). Very often a particular feature has di!erent attributes for the same symbol. Without attributes this feature could be discarded as partially relevant or irrelevant. Another aspect is the smooth variation from one symbol to a related neighbor, as shown in Figs. 2 and 3. The attributes de"nes the border between the di!erent
123
Algorithm Automatic rule generation Step 1: Step 2: Step 3: Step 4: Step 5:
Acquire a prototype feature set of a large handwriting data. Compute the statistical parameters. Set the thresholds for statistical parameters. Discard the redundant features by histogram analysis and correlation. Determine the FOHDEL features and linguistic terms. Construct the FOHDEL rules. (a) Automatic: Construct the FOHDEL rulebase with the extracted primitives and linguistic terms. (b) Manual Correction: Edit the rule-base by removing unnecessary rules and primitives.
Fig. 3. Variation of &a' to &u'.
Fig. 2. Selection of the linguistic attributes.
124
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131
Fig. 4. (a) More than (&'') (b) Less than (&(') Modi"ers.
Fig. 5. Between (&""') Operator.
features. What ranges of values should A-L and U}L have to be able to discriminate &u' from &a'. By the automatic extraction process based on the proposed fuzzy variance measure the features are extracted with their corresponding attributes and converted into FOHDEL rules. This e!ect was also con"rmed by the test results that showed an improvement in recognition rate by 3%. The summary of the above steps for automatic rule generation is given in the Algorithm: Automatic rule generation. The maximum frequency interval or fuzzy average peak is chosen as the central value of the corresponding linguistic term. The width of the membership function of this linguistic term is partially dependent on the fuzzy variance. If multiple peaks are in the same range of values, the linguistic term is enhanced by enlarging its boundary with help of the operator `betweena, `more thana or `less thana (see Figs. 4 and 5). If peaks are not adjacent then the enlargement of the attribute is achieved through the operator `ora which re#ect alternative solutions. Through this attribute aggregation the number of generated rules is reduced and problems related to the selection between several features with high
variance that correspond to alternative prototypes, are solved.
5. Experimental results After the rule generation phase in form of fuzzy rule base, recognition of an unknown handwriting sample can be undertaken. The FOHDEL rule-base is "rst parsed by an LALR-2 parser. In order to reduce the number of inferences needed to recognize a symbol, the prototype rules in the fuzzy knowledge base are sorted in the symbol categories. The global FOHDEL features pre-classify the unknown symbols into categories e.g. with pen-ups, wide characters, thin characters, complex characters (many speci"c features), simple characters (few speci"c features), etc. Each such category consists of several symbols and for each symbol is described by one or more FOHDEL rules. This permits a pre-classi"cation based
The rule base from Appendix C is used for results.
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131
on global feaure evaluation and thus reduces the computation steps. After the identi"cation of possible categories of the unknown symbols "nal classi"cation task is accomplished. For the "nal classi"cation aggregated features for the unknown symbol are evaluated and a possible output membership matrix is generated. The membership values of the unknown symbol are compared with each of the selected symbol categories and the possible characters with maximum membership values are selected. The output of the classi"cation process is a list of prototypes with their corresponding membership value. As the estimation of the character to belong to di!erent features is not normalized to 100%, answers of the following types are possible: 87% to belong to `aa and 60% to `ua (see Figs. 2 and 4). This is the correct answer and re#ects the following situation: when the circle in the letter `aa is not closed and the recognition of the isolated character is ambiguous. The more the circle opens the more the possibility to be an `aa decreases and the possibility to be an `ua increases. This ambiguity can be solved either by choosing a (using a winner-take-it-all approach) or by adding a second-context-dependent parser which decides the "nal solution from the given short list of possible characters with corresponding membership values. The real-time on-line handwriting was obtained on various platforms. The character recognition time on SPARC-2 platforms was between 0.2 and 0.6 s which is su$cient for real-time operation on personal digital assistants. The on-line performance of FOHRES for recognizing isolated handwritten characters by various users was better than other experimented methods. In Table 1
125
we summarized a comparison of FOHRES [32] with a neural network back-propagation algorithm(NN-BP) introduced in Ref. [17]. This performance evaluation is only qualitative and based on on-line recognition operation. Therefore for establishing a quantitative performance evaluation of FOHRES the initial set of UNIPEN benchmark data were selected. This set consisted of data from altogether ten di!erent writers and 100 character sets of 36 symbols (Lower case latin alphabet and numerals). The classi"cation results of this UNIPEN test data set (100;36;10"36,000 symbols) are shown in Table 2. The training set was not included in the test set and amounted to less than 10% of the test-set. FOHRES provides a membership matrix of the "ve best matches. Some of the falsely classi"ed characters are shown in Table 3. The "rst column shows the input symbol, the second and third column show the "rst and second priorities of recognizer. The last column shows the intended character. The experiments have shown that in Table 3 False classi"ed unknown symbols Input symbol
First recognition choice
Second recognition choice
Ground Truth
l
76%
e
59%
e
y
75%
g
45%
g
b
80%
h
67%
h
p
34%
h
12%
b
w
56%
u
44%
u
w
46%
n
23%
m
a
78%
None
f
86%
t
Table 1 Comparison with other methods Method
NN-BP [17] FOHRES [32]
Recognition rate with unipen data 1st Possibility 1st and 2nd Possibilities 94.3% 93.7%
95.5% 96.1%
ftp://sequoyah.ncsl.nist.edu/outgoing/unipen/train}r01}v05/ train}r01}v05.tar.Z Using the proposed FOHDEL language.
q
47%
t
Table 2 Unipen test-set results Recognition rate with 36000 test symbols
Rejection rate
One choice
Two choices
Three choices
Alphabet Numerals
1.5% 1.7%
93.7% 95.1%
96.1% 95.9%
96.4% 97.2%
126
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131
case of given context information the recognition rate was considerably improved by the given choice of characters with the help of fuzzy logic.
attributes a relative position, between the identi"ed discriminatory features.
Appendix A. S- and - membership functions
6. Conclusions We have presented the requirements and constraints that have in#uenced the development of the fuzzy pattern description language FOHDEL. Through the introduced attributed features and various operators not only a manual rule-base generation as well as the automatic rule generation was possible. The power of the presented language lies not only in the simple syntax and the large possibility of attributes, features and operator combination but also in the possibility to "ne tune uncertainty. The next step of our research work is to apply the proposed language to describe and recognize the cursive handwritten words. For this purpose some features have to be modi"ed to suit the word lengths and to accommodate the repetitive features to give as in the case of the
The name S-function refers to its strong resemblance to the character &S'. (Fig. 8, Eq. (6)).
S(x; a, b, c)"
for x)a,
0 2)
x!a c!a
for a)x)b,
x!c 1!2 ) for b)x)c, c!a 1 for x*c.
(6)
Similarly the -function is de"ned in terms of the Sfunction (Fig. 8, Eq. (7)). By putting two S-functions back to back a -function is obtained.
Fig. 6. Or (&"') Operator (k ). +6
Fig. 7. AND Operator(k ). +',
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131
127
Fig. 8. S-and functions. Table 4 FOHDEL operators Operators
Meaning
Function
&
&and' of two primitives
a#b!"a!b" k (a, b)" (Fig. 7) +', 2
"
&or' of two primitives
#
&average' of primitives
'
more than a linguistic attribute
a#b#"a!b" k (a, b)" (Fig. 6) +6 2 a ) w #b ) w ? @ k (a, w , b, w )" ? @ w #w ? @ S(x, a!0.5, a!0.25, a) k (a, x)" +2 1
for x)a, otherwise.
(Fig. 4a)
(
less than a linguistic attribute
S(x, a, a!0.25, a!0.5) for x)a, (Fig. 4b) k (a, x)" *2 0 otherwise.
""
between two linguistic attributes
k
S(x, a!0.5, a!0.2, a)
C ()
(a, b, x)" 1 #2 S(x, b!0.5, b!0.25, b!0.5)
for x)a, for a)x)b, (Fig. 5) for x*c
Separator between a linguistic term and the primitive Brackets s are used to di!erentiate the hierarchies as in case of # operators
(x; b, c)"
b S x; c!b, c! , c 2
for x)c,
b 1!S x;c, c# , c#b 2
(7) for x)c.
In S(x; a, b, c) the parameter b, b"(a#c)/2, is a crossover point. A crossover point is a value of x at which S attains equilibrium, i.e. 0.5. In (x; b, c) b is the band-
width, i.e. distance between the two crossover points of . At the center height of is unity. This demonstrates that represents normal fuzzy functions.
Appendix B Elements of FOHDEL language are in Tables 4}6.
128
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131
Table 5 FOHDEL linguistic attributes Zero Very Very Low Very Low Low Medium High Very High Very Very High Excellent
Z VVL VL L M H VH VV H E
b"0.30, b"0.30, b"0.30, b"0.30, b"0.30, b"0.30, b"0.30, b"0.30, b"0.30,
k (x)"(x, 0.30, 0.0) 8 k (x)"(x, 0.30, 0.15) 44* k (x)"(x, 0.30, 0.30) 4* k (x)"(x, 0.30, 0.40) * k (x)"(x, 0.30, 0.50) + k (x)"(x, 0.30, 0.60) & k (x)"(x, 0.30, 0.70) 4& k (x)"(x, 0.30, 0.85) 44& k (x)"(x, 0.30, 1.00) #
c"0.00 c"0.15 c"0.30 c"0.40 c"0.50 c"0.60 c"0.70 c"0.85 c"1.00
Table 6 FOHDEL features VSL
Vertical Straight Line
HSL
Horizontal Straight Line
STL
PS
Positive Slant
NS
Negative Slant
STR
O}L
O-Like Curve
C}L
C-Like Curve
HOL
D}L
D-Like Curve
A}L
A-Like Curve
HOR
U}L
U-Like Curve
HL}T
VL}L
HMN
Horizontal Motion
HL}M
VL}M
Vertical Line on the Middle
VMN
Vertical Motion
HL}B
VL}R
Vertical Line on the Right
PS}TL
Positive Slant on the Top-Left Positive Slant on the Middle-Left Positive Slant on the Bottom-Left Negative Slant on the Top-Left Negative Slant on the Middle-Left Negative Slant on the Bottom-Left O-Like Curve on the Top-Lef O-Like Curve on the Middle-Left O-Like Curve on the Bottom-Left D-Like Curve on the Top-Left D-Like Curve on the Middle-Left D-Like Curve on the Bottom-Left C-Like Curve on the Top-Left C-Like Curve on the Middle-Left
PS}TM
Horizontal Line on the Top Horizontal Line on the Middle Horizontal Line on the Bottom Positive Slant on the Top-Middle Positive Slant on the Middle-Middle Positive Slant on the Bottom-Middle Negative Slant on the Top-Middle Negative Slant on the Middle-Middle Negative Slant on the Bottom-Middle O-Like Curve on the Top-Middle O-Like Curve on the Middle-Middle O-Like Curve on the Bottom-Middle D-Like Curve on the Top-Middle D-Like Curve on the Middle-Middle D-Like Curve on the Bottom-Middle C-Like Curve on the Top-Middle C-Like Curve on the Middle-Middle
Walking Stick tilted to the Left Walking Stick tilted to the Right Hockey stick tilted to the Left Hockey stick tilted to the Right Vertical Line on the Left
PS}TR
Positive Slant on the Top-Right Positive Slant on the Middle-Right Positive Slant on the Bottom-Right Negative Slant on the Top-Right Negative Slant on the Middle-Right Negative Slant on the Bottom-Right O-Like Curveon the Top-Right O-Like Curve on the Middle-Right O-Like Curve on the Bottom-Right D-Like Curve on the Top-Right D-Like Curve on the Middle-Right D-Like Curve on the Bottom-Right C-Like Curve on the Top-Right C-Like Curve on the Middle-Right
PS}ML PS}BL NS}TL NS}ML NS}BL O}TL O}ML O}BL D}TL D}ML D}BL C}TL C}ML
PS}MM PS}BM NS}TM NS}MM NS}BM O}TM O}MM O}BM D}TM D}MM D}BM C}TM C}MM
PS}MR PS}BR NS}TR NS}MR NS}BR O}TR O}MR O}BR D}TR D}MR D}BR C}TR C}MR
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131
129
Table 6 (Continued) C}BL A}TR A}ML A}BL U}TL U}ML U}BL HOL}L HOR}L STL}L STR}L LPb
C-Like Curve on the Bottom-Left A-Like Curve on the Top-Left A-Like Curve on the Middle-Left A-Like Curve on the Bottom-Left U-Like Curve on the Top-Left U-Like Curve on the Middle-Left U-Like Curve on the Bottom-Left Left sided Hockey stick on the Left Right sided Hockey stick on the Left Left sided Walking Stick on the Left Right sided Walking Stick on the Left Loop as in &b' (Upwards)
C}BM A}TM A}MM A}BM U}TM U}MM U}BM HOL}R HOR}R STL}R STR}R AR
C-Like Curve on the Bottom-Middle A-Like Curve on the Top-Middle A-Like Curve on the Middle-Middle A-Like Curve on the Bottom-Middle U-Like Curve on the Top-Middle U-Like Curve on the Middle-Middle U-Like Curve on the Bottom-Middle Left sided Hockey stick on the Right Right sided Hockey stick on the Right Left sided Walking Stick on the Right Right sided Walking Stick on the Right Aspect Ratio
Appendix C Example set FOHDEL rules Rule a1: (('MCO}MM) " ('MCO}TM)) & (HOR}R " VL}R " C}BR " NS}MR " NS}MM) & ((LCS Y) & ('MCS}X) Rule a2: ((LCS}Y) & (('MCO}ML) " ('HCC}MM)) & (HOR}R " VL}R " C}MR " NS}MM " NS}MR) Rule a3: ('MCO}MM) & (HOR}R " VL}R " VL}RIC}MR) & ('HCS}Y) & ('VVHCS}X) & ZCVL}R Rule b1: VL L & (D}MR " D}BM " O MR " O}BM " O}BR) Rule b2: VL}L & MCLPh & (HOR}M " C}MR " C}MM) & VVLCSEG Rule b3: VL}L & (A BM (A}BR) & LCSEG & (HCNS}BM " HSL) Rule b4: VL}M & O}BM Rule b5: PS}ML & C}MM & (HL}M " U}MR) Rule c1: C}MM & ((VLCSEG) & ((LCHSL) Rule c2: C}MM & ((VLCSEG) Rule d1: (C}ML " C}MM " 0 MM) & (HOR}R " VL}R) & ('MCS}Y) & ((VHCS}X) & ((LCHSL) Rule d2: STR & VL}R & PS}MM Rule d3: (('MCO-MM) " ('MCO}TM)) & (HOR}R " VL}R " C}BR " NS}MR " NS}MM) & ('LCS}Y) & ((HCS}X)
C}BR
S}X/Y
C-Like Curve on the Bottom-Right A-Like Curve on the Top-Right A-Like Curve on the Middle-Right A-Like Curve on the Bottom-Right U-Like Curve on the Top-Right U-Like Curve on the Middle-Right U-Like Curve on the Bottom-Right Starting X/Y Coordinate
E}X/Y
End X/Y Coordinate
PEN
Number of Penups
SEG
Number of handwriting Segments Loop as in &y' (Downwards)
A}TR A}MR A}BR U}TR U}MR U}BR
LPy
Rule e1: ZCVSL & ZCHSL & ZCPS & ZCO}L & ECE}X & ECE}Y & ((MCS}X) & ((LCS}Y) Rule e2: (('MCHL M) " ('MCPS}MM)) & C L & VLCSEG & ((VHCO}L) Rule f1: LPb & STR Rule f2: LPb & LPy & HL}M Rule f3: ZCPEN & ZCO}L & ((VH "" VVH))CVL}M) & ((M""H)CAR) Rule f4: ZCPEN & VVHCC}L & VVHCLPb & HCAR Rule g1: (O}TL " O}TM) & ('VHCLPy) & ((MCPS) Rule g2: (O}TL " O}TM) & MC VL}M Rule h1: VL L & (A}BM " A}BR " A MM) & ('HCE}X) Rule h2: PS}MM & ECE}X & ZCVSL & HSL Rule h3: ZCPEN & ((VH""VVH)CLPb) & ((VH""VVH)CA}L) & LCSEG Rule i1: HOR & ((VLCSEG) Rule i2: HOR}R & PS ML Rule j1: LPy & ((VLCSEG) & ((VHCAR) Rule j2: LPy & PS}ML & ZCHSL Rule k1: HCPS & HCPS & LCD}L & (VH""VVH)CHOR & VVHCSEG Rule k2: (VH""VVH)CSTR & VVHCSEG & VLCA}L Rule l1: LPb & ((VLCSEG) Rule l2: ('VHCVL}R) & (('HCVL}M) " ('HCPS)) & ((HCAR)
130
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131
Rule l3: VL}L & ('HCLPb) & ((VVLCHSL) & VVLCSEG Rule m1: A}ML & A}MM & A}MR & ((VLCSEG) Rule n1: (A ML " A}MR) & (MCA MM & NS}ML Rule n2: STL & A L & VVLCSEG Rule o1: O}MM & HL}T Rule o2: O}MM & ((LCSEG) Rule p1: VL L & (D}MM " O}MM " D}TM) Rule p2: VL}L & LPb & (('HCD}MM) " ('HCD TR) " ('HCD}MR)) Rule q1: O}TL & (VL}M " VL}L) & (PS BM " PS}BR) Rule q2: VL}M & (C}MM " O}MM " O}TM " C}TM) & PS}BR Rule r1: VL}L & HL T & (PS}ML " VL M) Rule r2: VL}L & STR Rule s1: ('VHCS X) & ((LCS}Y) & ((LCE}X) & ('VHCE}Y) & VVLCSEG Rule t1: HOR & LCPEN & (HSL " PS}MM) Rule u1: ZCO L & ((U}ML & (HOR " ((VLCSEG))) " (U}MM & ZCSEG)) & ((VLCHSL) Rule u2: ZCO L & (VHCC}ML " VHCU ML) & (HCNS & ('VHCE}Y)) Rule v1: U BL &'HCHL}T & ((LCSEG) Rule w1: U}ML & U MR & ((LCSEG) Rule x1: ('MCPS) & ('MCNS) & ('ZCPEN) Rule x2: D ML & C}MR & ('ZCPEN) Rule y1: ZCPEN & (U}TM " U TL) & ('VHCLPy) Rule z1: ZCPEN & VVLCSEG & ((VLCS}X) & ((VLCS}Y) & ('VJCE}X) & ('VHCE}Y) References [1] K.S. Fu, Syntactic Pattern Recognition and Applications, Prentice-Hall, Englewood Cli!s, NJ, 1982. [2] R. Lopez de Mantaras, Reasoning under uncertainty and learning in knowledge based systems: imitating human problem solving behaviour, in: J. Jurada, R. Marks II, C.J. Robinson (Eds.), Computational Intelligence } Imitating Life, IEEE Press, New York, 1994, pp. 104}115. [3] E.T. Lee, L.A. Zadeh, Note on fuzzy languages, Inform. Sci. 1 (1969) 421}434. [4] M. Mizumoto et al., Some considerations on Fuzzy Automata, J. Comput. System Sci. 3 (1969) 409}422. [5] S. Clergeau-Tournmire, R. Plamondon, Integration of lexical and syntactical language in a handwriting recognition system, Mach. Vision Appl. 8 (1995) 249}259. [6] L. Zadeh, The key roles of information granulation and fuzzy logic in human reasoning, concept formulation and computing with words, Fifth FUZZ-IEEE, Lousiana, 1996, pp. 1. [7] S. Edelman, S. Ullman, T. Flash, Reading cursive script by alignment of letter prototypes, Int. J. Comput. Vision 5 (3) (1990) 303}331. [8] B.A. Yanikoglu, P.A. Sandon, O! line cursive handwriting recognition using style parameters, Tech. Rep. PCSTR93-192, Dartmouth College, 1993.
[9] A.W. Senior, O!-line cursive handwriting recognition using recurrent neural networks, Ph.D Thesis, Cambridge University, 1994. [10] A.C. Downton, S. Impedovo, Progress in Handwriting Recognition, World Scienti"c, Colchester, 1996. [11] S. Impedovo, P.S.P. Wang, H. Bunke, Automatic Bankcheck Processing, World Scienti"c, Singapore, 1997. [12] M. Jamshidi, Fuzzy logic software and hardware, in: M. Jamshidi, N. Vadiee, T.F. Ross (Eds.), Fuzzy Logic and Control, Prentice-Hall, Englewood Cli!s, NJ, 1993, pp. 112}148. [13] H. Surmann, K. Heesche, M. Hoh, K. Goser, R. Rudolf, Entwicklungsumgebung fuK r Fuzzy-Controller mit neuronaler Komponente, Proceedings of VDE- Fachtagung: Technische Anwendungen von Fuzzy-Systemen, Dortmund, 1992, pp. 288}297. [14] A.C. Shaw, A formal picture description scheme as a basis for picture processing systems, Inform. Control-14 (1969) pp. 9}52. [15] M. Parizeau, R. Plamondon, A fuzzy-syntactic approach to allograph modeling for cursive script recognition, IEEE-PAMI 17 (7) (1995) 707}712. [16] A. Malaviya, On-line handwriting recognition with a fuzzy feature description language, GMD-Report No.271, R. Oldenbourg Verlag, MuK nchen/Wien, ISBN-3-486-240722, 1996. [17] A. Malaviya, C. Leja, L. Peters, Multi-script handwriting recognition with FOHDEL, Proceedings of NAFIPS'96, IEEE Press, Berkeley, 1996, pp. 147}151. [18] J. Feder, Plex languages, Inform. Sci.- 3 (1971) 225}241. [19] R. Narsimhan, Labelling schemata and syntactic description of pictures, Inform. Control- 7 (1964) 151}179. [20] G.F. DePalma, S.S. Yau, Fractionally fuzzy Grammars with applications to pattern recognition, in: L.A. Zadeh, et al., (Eds.), Fuzzy Sets and Their Applications to Cognitive Processes, Academic Press, New York, 1975. [21] K.S. Fu, Stochastic Automata, stochastic languages and pattern recognition, J. Cybernet. 1 (3) (1971) 31}49. [22] A. Malaviya, L. Peters, Fuzzy feature description of handwriting patterns, Pattern Recognition 30 (10) (1997) 1591}1604. [23] A. Malaviya, L. Peters, M. Theissinger, FOHDEL: a new fuzzy language for on- line handwriting recognition, FUZZ-IEEE, Orlando, 1994, pp. 624}629. [24] A. Malaviya, C. Leja, L. Peters, A hybrid approach of automatic rule generation for handwriting recognition, in: A. Downtown, S. Impedovo (Eds.), Progress in Handwriting Recognition, World Scienti"c, Singapore, 1997. [25] A. Malaviya, R. Klette, A fuzzy syntactic method for online handwriting recognition, in: P. Perner, P.S.P. Wang, A. Rosenfeld (Eds.), Lecture Notes in Computer Science 1121, Advances in Structural and Syntactical Pattern Recognition, Sixth International Workshop SSPR'96, Springer, Berlin, Leipzig, 1996, pp. 381}392. [26] A. Malaviya, L. Peters, Extracting meaningful handwriting features with fuzzy aggregation method, in: Third ICDAR, Montreal, 1995, pp. 841}844. [27] E.T. Lee, Fuzzy tree automata and syntactic pattern recognition, IEEE-PAMI-4 (4) 1982. [28] K.C. Yau, K.S. Fu, A syntactic approach to shape recognition using attributed grammars, IEEE-SMC- 9 (6) (1979) 334}345.
A. Malaviya, L. Peters / Pattern Recognition 33 (2000) 119}131 [29] F. Ivancic, A. Malaviya, L. Peters, An automatic rule base generation method for fuzzy pattern recognition with multi-phased clustering, in: Proceedings of IEEE-KES-98, Adelaide, 1998. [30] A. Malaviya, H. Surmann, L.Peters, Automatic generation of fuzzy rule base for online handwriting recognition, in: Second EUFIT, September, Aachen, 1994, pp. 1060}1065.
131
[31] I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, S. Janet, UNIPEN project of on-line data exchange and recognition benchmarks, in: 13th IEEE-ICPR, Israel, 1994 pp. 29}33. [32] A. Malaviya, L. Peters, R. Camposano, A fuzzy online handwriting recognition system: FOHRES, in: Second International Conference Fuzzy Theory and Technology Control and Decision, Durham, NC, 1993.
About the Author*ASHUTOSH MALAVIYA is currently Chief Technology O$cer at Yatra Corporation in Minneapolis. He has worked as a scientist at the German National Research Center for Information Technology } GMD for the past six years. Previously he worked at the Fraunhofer Institute for Production and Construction Engineering in Berlin, Germany. His main research interests include fuzzy pattern recognition, computational intelligence and decision support systems. Dr. Malaviya obtained the Bachelors degree in Electrical Engineering from the University of Roorkee, India. He obtained the Master's degree in Information Engineering and a Ph.D. in Computer Science from the Technical University of Berlin, Germany. About the Author*LILIANE PETERS is a senior scientist at GMD-SET. Previously she worked as a Research Assistant in the Department of Electrical Engineering at the Technical University in Aachen, Germany. Her main interests are fuzzy systems and real-time aspects of computer vision. Dr. Peters received the Dipl.-Ing. degree in Electrical Engineering from the Technical University in Bucharest and the Ph.D. degree from the Technical University in Aachen, Germany.
Pattern Recognition 33 (2000) 133}147
Writer independent on-line handwriting recognition using an HMM approach Jianying Hu*, Sok Gek Lim, Michael K. Brown Bell Laboratories, Lucent Technologies, 700 Mountain Avenue, Murray-Hill, NJ 07974, USA Received 26 August 1998; accepted 12 January 1999
Abstract In this paper we describe a Hidden Markov Model (HMM) based writer independent handwriting recognition system. A combination of signal normalization preprocessing and the use of invariant features makes the system robust with respect to variability among di!erent writers as well as di!erent writing environments and ink collection mechanisms. A combination of point oriented and stroke oriented features yields improved accuracy. Language modeling constrains the hypothesis space to manageable levels in most cases. In addition a two-pass N-best approach is taken for large vocabularies. We report experimental results for both character and word recognition on several UNIPEN datasets, which are standard datasets of English text collected from around the world. 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Handwriting recognition; Hidden Markov models; Invariant features; Segmental features; N-best decoding; UNIPEN
1. Introduction Hidden Markov Models (HMMs) have been used with great success for stochastic modeling of speech for more than a decade now [1]. More recently they have also been applied to on-line handwriting recognition [2}5] with varying degrees of success. Some earliest works (i.e., Ref. [3]) consisted essentially of direct application of speech recognition systems to handwriting recognition with only substitution of some simple point oriented features for the speech features. In the last few years we have been experimenting with HMM-based methods for on-line handwriting recognition. For some characteristics, the results have been better than those achieved with speech recognition. For example, writer independence is achieved for handwriting recognition by a combination of preprocessing to remove much of the variation in handwriting due to varying personal styles and writing in#uences, and by
* Corresponding author. Tel.: #908-582-5660; fax: #908582-7308. E-mail address:
[email protected] (J. Hu)
feature invariance to reduce sensitivity to the remaining variations. Both of these techniques are more mature than corresponding methods for speech recognition. Indeed, except for pre-emphasis there is almost no processing before feature extraction in most speech recognition systems. Probably a major reason for greater success in preprocessing handwriting is due to the visual nature of the results, where uniformity can be easily seen even by untrained observers, while speech preprocessing may yield a normalized voice that would require great expertise to judge uniformity. Invariance in handwriting features is also more readily obtained by taking advantage on known geometrical invariance relations. In this paper we "rst give a complete overview of an on-line handwriting recognition system which we have developed over the last few years. The main novel aspects of this system compared to other HMM-based systems [2}4,6,7] include: signal preprocessing with a new word normalization method; application of invariant features; combination of high-level and low-level features; stochastic modeling with HMMs that incorporates a dynamically evolving language model; and N-best decoding combined with two-stage delayed stroke modeling for
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 3 - 6
134
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
large vocabulary word recognition. We then present experimental results for both character and word recognition for the newly released UNIPEN data [8,9], making these results useful for future comparison to other systems. The next section gives a brief overview of preprocessing methods. A new method using the Hough transform is used; details are referenced. In Section 3 HMMs for handwriting recognition are described. The Viterbi decoding method is used for e$ciency. A four-stage segmental means training method is discussed. Section 4 covers handwriting features of two types: point based or local features and stroke based or high-level, regional features. We do not use any global features. Global properties of the handwriting are addressed only in preprocessing. Delayed strokes and e$ciency issues are discussed in Section 5. Experimental results are presented in Section 6 and we conclude in Section 7.
2. Preprocessing Preprocessing of on-line handwriting can be classi"ed into two types: noise reduction and normalization [10]. Noise reduction attempts to reduce imperfections caused mainly by hardware limits of electronic tablets, through operations such as smoothing, wild point reduction, hook removal, etc. We employ spline "ltering for smoothing after the standard wild point reduction and dehooking procedures. A spline kernel is convolved with the input data points which results in new sample points from a local approximate cubic spline "tting of the original data points [5]. Cusps (points of sharp directional change) are detected beforehand using a dynamic detection algorithm [5] and treated as boundary points during smoothing, so that the important information contained in cusps is preserved. Normalization refers to the reduction of geometric variance in the handwriting data that is typically due to di!erences in writing style and rendering environment. Examples of normalization include scaling to a standard size, rotation of the text baseline, deskewing of slanted text, etc. The key to high performance writer-dependent recognition is to either perform preprocessing well enough that the writer-dependent variations are removed or to select features that are invariant with respect to writer-dependent variations. Since neither strategy is perfect by itself, some aspects of both are applied in our work. Di!erent normalization procedures are used for words and isolated characters.
letters with ascenders such as `la), and the descender line ( joining the bottom of letters with descenders such as `ga). For each input word these boundary lines are detected (the base and core lines are assumed to be present in all words, while the ascender and descender lines are not), then the word is rotated such that the base line is horizontal, and scaled such that the core height (the distance between the base and core lines) equals a prede"ned value. Deskewing is then applied to correct the overall slant of the word. The two main aspects of word normalization, size and orientation normalization, are closely related: a good estimate of core height relies on accurate estimation of the orientation. Both tasks become very di$cult for unconstrained handwritten words, where the boundary lines for those words are often not well de"ned, as shown in Fig. 1. Various techniques have been developed in the past to tackle this di$cult problem. The main approaches include the histogram-based methods [10}13], linear-regression-based methods [7], and model-based methods [14]. We have developed a new approach based on the Hough transform [15]. First, local maximum and minimum points along the y coordinate are extracted, then a modi"ed Hough transform is applied to extract parallel lines corresponding to the boundary lines. In order to handle the problems caused by the large variation in natural handwriting coupled with the sparse nature of extremum points in each word sample, one-dimensional Gaussian smoothing with adjustable variance is applied in the Hough space, followed by parameter re"nement using linear regression. Compared with previous techniques, the Hough transform based method has the advantage that it simultaneously provides the optimal estimates of both the orientation and the core height of an input word. Fig. 2 illustrates the estimated base line and core line of the word sample shown in Fig. 1.
Fig. 1. A handwritten word whose boundary lines are not well de"ned.
2.1. Word normalization Four boundary lines are de"ned for handwritten words: the base line ( joining the bottom of small lower-case letters such as `aa); the core line ( joining the top of small lower-case letters); the ascender line ( joining the top of
Fig. 2. A handwritten word and the estimated base line and core line.
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
The skew angle estimation algorithm is similar to that of Brocklehurst [16]. First points of maximum and minimum y coordinates are identi"ed and downward strokes (strokes going from a maximum point to a minimum point and longer than a threshold) are isolated. The straight piece of each downward stroke is then obtained by truncating the stroke from both ends until the angles formed by the two halves of the stroke di!er by less than 103. The estimated skew angle h is the average of the skew angles of all the straight pieces, each weighted by the length of the piece. After the skew angle estimation, each point in the script is then corrected by replacing x by x"x!y tan h. This deskewing algorithm is simple and fast, and proves to be very e!ective. 2.2. Character normalization When the input consists of isolated characters, the above normalization procedures can no longer be applied reliably. In this case we simply normalize the bounding box of the character to a standard size, and no orientation or skew correction is attempted. 2.3. Resampling At the end of preprocessing, an equa-arc-length resampling procedure is applied to remove variations in writing speed. The sampling distance is proportional to the core height in the case of word recognition, and the height of the bounding box in the case of character recognition. Local features such as tangent slope angle are computed at each sample point. Each handwriting sample is then represented as a time-ordered sequence of observations in the form of feature vectors: O"[o , o ,2, o ]. The details of the features used in 2 our system will be explained in Section 4.
3. The recognition framework The general recognition framework is composed of Hidden Markov Models (HMMs), representing strokes and characters, embedded in a grammar network representing the vocabulary. The main characteristic of the system is that segmentation and recognition of handwritten words are carried out simultaneously in an integrated process, which provides an ideal mechanism to handle cursive and mixed handwriting. 3.1. Model descriptions A discrete hidden Markov model with N states and M distinct observation symbols v , v ,2, v is described + by the state transition probability matrix A"[a ] , GH ,, state conditional observation probabilities: b (k)" H Pr(o "v " s "j), k"1,2, M, where o is the observaR I R R
135
Fig. 3. A left-to-right HMM.
tion and s is the active state at time t, and initial state R distribution n "Pr(s "i). G The subcharacter and character models we have adopted are so called left-to-right HMMs without state skipping (a "0 for j'i#1; n "0 for i'1), as shown GH G in Fig. 3. We have selected this relatively simple topology because it has been shown to be successful in speech recognition, and there has not been su$cient proof that more complex topologies would necessarily lead to better recognition performance. Furthermore, in unconstrained handwriting, skipping of segments seem to happen most often for ligatures, which in our system is handled by treating ligatures as special `connectinga characters and allowing them to be bypassed in the language model. Given an HMM as described above, the probability of an observation sequence O"[o , o ,2, o ] along 2 a particular state sequence q"[s , s ,2, s ] is: 2 2 2\ P(O, q " j)"P(O " q, j)P(q " j)" b G(o ) a G G>, Q G QQ G G where j is the set of model parameters. The overall probability of this observation sequence is P(O " j)" P(O " q, j)P(q " j). q For left-to-right models, this sum is usually dominated by the optimal path probability: q "argmaxqP(O " q, j)P(q " j).
The last probability can be e$ciently computed using the Viterbi Algorithm. More detailed description of HMMs can be found in Refs. [1,17]. 3.2. Model dexnitions The basic model units used in our system are subcharacter models called nebulous stroke models. A character model is the concatenation of several such stroke models as speci"ed in a character lexicon. The advantage of using subcharacter models is that sharing among characters can be easily implemented, simply by referring to the same stroke models in the character lexicon. A stroke could be any segment of a handwritten script. We do not attempt to impose rigidly de"ned, presegmented strokes as is the case in Ref. [2]. Instead, we make the system learn inde"nite stroke models through training, hence the term nebulous. No manual segmentation is
136
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
involved in training; the stroke models are trained "rst on isolated characters and later on whole word samples, but never on isolated or segmented strokes (which would be impossible to obtain since there is no speci"c de"nition of a stroke). Using shared stroke models among characters has two advantages. The "rst is that it results in a reduced model set. The second one is an extension to the "rst: since shared models also share training samples, the models can be adequately trained with fewer training samples. On the other hand, excessive sharing can also result in smearing of the model space and thus compromise the recognition performance. Our experiments based on the UNIPEN [8,9] data show that when there is large amount of training data, better results are obtained with smaller amount of sharing } the best performance was obtained with sharing only among di!erent classes of the same character which are composed of the same strokes but rendered in di!erent orders (e.g., the `ta with the cross bar drawn before the vertical hook and the `ta with the cross bar drawn after the vertical hook). If the model size becomes an issue (e.g., in applications run on thin clients such as PDA devices), various clustering techniques can be applied to cluster and merge similar stroke models to achieve more aggressive sharing. Currently each stroke is modeled by one HMM state. The number of classes for each character and the number of strokes in each class model are selected by hand "rst and then adjusted empirically. As expected, our experiments show that assigning more states to character classes with more complicated shape (e.g., `ga) than those with simpler shape (e.g., `ia) leads to better recognition results than assigning same number of states to all classes. Currently the number of states per character ranges from 1 to 8. How to automate the process of choosing the number of classes as well as number of states per class is a di$cult problem and is a topic of future research. Ligatures are attached to characters only during training. At recognition time they are treated as special, one stroke `connectinga characters inserted between `corea characters and can be skipped with no penalty. This treatment insures that our system can handle mixed style handwriting as opposed to pure cursive only. The handling of delayed strokes is explained in detail in Section 5. 3.3. Decoding with grammar constraints The left-to-right character models as described above are embedded in a grammar network, which can represent a variety of grammatical constraints, e.g., word dictionaries, statistical character N-gram models and context free grammars. In the following we will explain the concepts in the case of a dictionary. Other forms of grammar constraints can be implemented similarly.
Fig. 4. A grammar network implementing a word dictionary.
Fig. 4 gives the simpli"ed diagram of the grammar network representing a word dictionary. Here each arc represents a character (in the actually network a group of parallel arcs represent all character classes corresponding to the same character). Each path from the start node to the "nal node corresponds to a word in the dictionary. At recognition time, each arc is replaced by the corresponding HMM. Given an observation sequence O, the task of recognition is to "nd the word W in the given dictionary which maximizes the posterior probability P(="O)" P(O"=)P(=)/P(O). Since P(O) is the same for all words, and assume that all words in the dictionary occurs with equal probability, the problem is reduced to maximizing: P(O"=)" q P(O"q)P(q), where Q is the set of state 5 sequences thatZ/U correspond to the same word W. As mentioned before, when left-to-right models are used P(O"=) is often dominated by the probability along the optimal path, thus the problem can be further reduced to maximizing the term: P(O"=,q )"P(O"q 3Q )
5 P(q 3Q ). The solution to the last problem can be
5 found e$ciently via Viterbi search, as explained in more detail later. The embedding of character HMMs with grammar constraints are implemented using an automatic evolutional grammar interpretation system (AEGIS). The three primary components of the evolutional grammar (EG) are a grammar arc table, a null arc (unlabeled grammar arc) table, and a recursive transition network (RTN) table which records the grammar rules of expanding the nonterminals [18]. Labeled grammar arcs can have either a non-terminal label representing another RTN subnetwork or an HMM label representing an HMM de"ned in another table called the lexicon, which describes the structure of HMM for each character class. The EG initially contains a degenerate grammar consisting of only a start node, an end node and a single arc with a non-terminal label. During the evolution process, upon encountering a non-terminal arc, the arc is "rst replaced with the subnetwork it represents and is examined again. This process continues until all non-terminal references on the earliest arcs are eliminated. If a resulting label references an HMM, the appropriate model structure is built as indicated by the lexicon. Once all leading HMM references are built, HMM score integration proceeds.
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
As a score emerges from an HMM and needs to be propagated further in the network, additional evolution of the EG may occur. In this way, only those regions of the grammar touched by the HMM search are expanded. Beam search methods [19] can be used to limit the amount of grammar expansion. For any given dictionary a grammar compiler [20] is used to convert a list of word speci"cations into an optimized network with shared pre"xes and su$xes. In the case of isolated character recognition, the grammar network is simply composed of parallel arcs leading from the start node to the "nal node, each representing one character in the alphabet. The Viterbi algorithm is used to search for the most likely state sequence corresponding to the given observation sequence and to give the accumulated likelihood score along this best path [1]. Suppose that for any state i, q (t) denotes the selected state sequence (hypothesis) G leading to i at sample point t, and d (t) denotes the G accumulated log-likelihood score of that hypothesis. O represents the observation at sample point t, and * (O ) R G R represents the log-likelihood score of O in state i. Since R each character model is a left-to-right HMM with no state skipping, updating the hypothesis and likelihood scores within each character model is straightforward. Score propagation through grammar nodes is a little more complicated. Suppose g is a grammar node an p(g) and s(g) denote the sets of preceding and succeeding character classes corresponding to the incoming and outgoing arcs respectively. For each character class l, m(l) denotes the HMM used to model the class; h(l) denotes the initial state of the model; and f (l) denotes the "nal state of the model. At each sample point t during the Viterbi search, the maximum of all the accumulated scores at the "nal states of the preceding character models, also called incoming scores, is found and propagated to the initial state of each of the succeeding models, along with the corresponding state sequence. The operation is carried out as follows: k"argmax d (t!1); JZNE DJ
(1)
and for each state j"h(l); l3s(g):
q (t!1), j q (t)" DI H q (t!1), j H
if d
(t!1)'d (t!1), DI H Otherwise,
d (t!1)#* (O ) H R d (t)" DI H d (t!1)#* (O ) H H R
(2)
if d (t!1)'d (t!1), DI H (3) Otherwise.
3.4. Model training Models are trained using the well known iterative segmental training method based on Viterbi decoding [1]. Given a set of training samples, the HMM for each sample is instantiated by concatenating the HMMs for
137
the appropriate character classes, ligatures and delayed strokes, which are in turn instantiated by concatenating the composing stroke models speci"ed in the character lexicon. The training procedure then is carried out through iterations of segmentation of training samples by Viterbi algorithm using the current model parameters, followed by parameter re-estimation using the means along the path. The iterative procedure stops when the di!erence between the likelihood scores of the current iteration and those of the previous one is smaller than a present threshold. Training is conducted "rst on isolated character samples and then, whole word samples. No manual segmentation is involved throughout the whole process. We de"ne four di!erent stages of training based on the training samples used and the granularity of training labels. 1. Initial character training is carried out on isolated character samples including ligatures and delayed strokes. Each sample is labeled by the speci"c character class, thus the corresponding HMM instantiated consists of a linear sequence of states. The initial model parameters are computed using equal-arclength segmentations of all samples. This stage essentially serves as a model initializer. 2. Lattice character training is also carried out on isolated character samples, however, here each sample is labeled only by the character truth as opposed to the speci"c class. The HMM instantiated for each sample is no longer a linear model, but rather a lattice containing all classes used to model the same character. Starting from parameters obtained from initial character training, the Viterbi algorithm is used to `picka the best class for each sample and update model parameters accordingly. Since this stage requires much less supervision than the "rst, it can easily accommodate large amount of training data and thus capture variations among many di!erent writers. 3. Linear word training is carried out on whole word samples. Similar to stage 1, each sample is labeled not only by the corresponding word, but also by the exact character class sequence representing the particular style of that sample, which is then converted to a unique stroke sequence according to the character lexicon. This highly constrained training can be used to provide a reliable starting point for whole word training. However, the labeling process is tedious and error prone. 4. Lattice word training is carried out on whole word samples. Similar to stage 2, each sample is tabled by the word truth only. Each word is represented by a lattice, or "nite state network, that includes all possible character class sequences that can be used to model the word. The sequence that best matches the sample is chosen by the decoding algorithm and the
138
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
resulting segmentation is used for parameter re-estimation. Again like stage 2, this process can be applied to a large number of training samples, and this is where most characteristics of cursive or mixed handwriting are captured.
4. Features for on-line handwriting After preprocessing and resampling, seven features are computed at each sample point. Two of these features are conventional ones widely used in HMM based online handwriting recognizers: tangent slope angle and normalized vertical coordinate. The others are more complex features which are classi"ed into two categories: two invariant features and three high-level features (with local representations). On top of these features computed at each sample point, the recognition system is further augmented with a character level segmental feature designed to capture whole character level shape characteristics. These features are explained in detail in the following. 4.1. Invariant features The concept of invariant features arises frequently in various machine vision tasks. Depending on the speci"c task, the geometric transformation ranges from simple rigid plane motion to general a$ne transformation, to prespective mapping, etc. [21]. In the case of handwriting recognition, the transformation of interest is similitude transformation, which is a combination of translation, rotation and scaling. Although the variance caused by similitude transformation can be somewhat reduced through the normalization process in preprocessing as discussed in Section 2, it is well known that normalization is never perfect, especially for unconstrained cursive or mixed handwriting. Therefore, including invariant features can greatly improve the robustness of the system. We introduce two features, normalized curvature and radio of tangents, which are invariant under arbitrary similitude transformation. Similar features have appeared perviously in the literature for planar shape recognition under partial occlusion [21]. A similitude transformation of the Euclidean plane 1P1 is de"ned by w"cUr#*, where c is a positive scalar,
U"
cos u
!sin u
sin u
cos u
,
Suppose that a smooth planar curve P(t)"(x(t), y(t)) is mapped into PI (tI )"(x (tI ), y (tI )) by a reparameterization t(tI ) and a similitude transformation, i.e. PI (tI )"cUP(t(tI ))#*.
Without loss of generality, assume that both curves are parameterized by arc length (natural parameter), i.e. t"s and tI "s . Obviously, ds "c ds. It can be shown [21] that curvature (the reciprocal of radius) at the corresponding points of the two curves is scaled by 1/c, i.e. i(s )"i((s !s )/c)/c. It follows that i((s !s )/c) i(s ) " , (5) (i(s )) (i((s !s )/c)) where i"di/ds and i"di/ds , thus eliminating the scale factor from the value of the ratio. Eq. (5) de"nes an invariant feature which we call normalized curvature. The computation of the normalized curvature de"ned above involves derivative estimation of up to the third order. Another set of invariants that require lower orders of derivatives can be obtained by using the invariance of distance ratios between corresponding points. Consider again the two equivalent curves P(t) and PI (tI ) de"ned above. Suppose P and P are two points on P(t) whose tangent slope angles di!er by h; PI and PI are two points on PI (tI ) with the same tangent slope angle di!erence. P and PI are the intersections of the two tangents on P(t) and PI (tI ), respectively (Fig. 5). Since angles and hence turns of the curve are invariant under the similitude transformation, it can be shown that if point PI corres ponds to point P , then points PI and PI correspond to points P and P respectively [21]. It follows from Eq. (4) that "PI PI " / "PI PI """PP " / "P P". (6) Eq. (6) de"nes another invariant feature which we call ratio of tangents. In order to use ratio of tangents as an invariant feature in handwriting recognition, a "xed angle di!erence h"h has to be used for all sample points in all scripts. Since in real applications we normally have only scattered sample points instead of continuous script and in general, we cannot "nd two sample points whose slope angle di!erence is equal to h , a
*"[v , v ]2, V W
representing a transformation that includes scaling by c, rotation by angle u and translation by *. We regard two curves as equivalent if they can be obtained from each other through a similitude transformation. Invariant features are features that have the same value at corresponding points on di!erent equivalent curves.
(4)
Fig. 5. Ratio of tangents.
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
carefully designed interpolation algorithm is used to compute this feature [22]. Obviously the choice of h greatly a!ects the tangent ratio values. If h is too small, the feature tends to be too sensitive to noise. On the other hand, if h is too large, the feature becomes too global, missing important local shape characteristics. This value is chosen heuristically and h "103 is used in the current system. To evaluate accurately the invariant features described above, high-quality derivative estimates up to the third order have to be computed from the sample points. We used spline "lters of up to the 5th degree for this purpose and obtained satisfying results [22]. 4.2. High-level features The trade-o! between high-level, long-range features and low level, local features is common among many pattern recognition problems: the former are usually more powerful but less robust, while the latter is less informative but more reliable. In on-line handwriting recognition, some attempts have been made on approaches based on high-level features such as loops, crossings and cusps [2,23], where each handwritten word is converted into a sequence of primitives based on extraction of high-level features. These methods do not work well on unconstrained handwriting, because the results of high-level feature extraction tend to be highly erroneous due to the large shape variations in natural cursive handwriting, especially among di!erent writers. More recently, many powerful systems have been developed using local features such as slope angle and curvature, sampled at equa-arc-length points along an input script [3,4,6,7,24]. Our system falls into this category. In such approaches, each sample point along the script is uniformly represented by a local feature vector. The sequence of time ordered feature vectors is then processed through a network of statistical models (HMMs [3,4,5] or TDNN [6,7]) and dynamic programming techniques are applied to "nd the best alignment between the input sequence and the models, thus providing the recognition result. These approaches, which we call point oriented methods, are much more robust because no premature shape classi"cation is attempted before the recognition of the whole word; in other words, segmentation and recognition are integrated into a single process. On the other hand, they su!er from the loss of information carried by high-level features. While features formed by temporally nearby points such as cusps are represented implicity to a certain extent through the slope angle feature, those formed by points that are spatially or temporally far apart, such as crossings and loops, are not presented to the recognition system at all. It is clear that to further improve such systems, high-level long-range features need to be incorporated.
139
Several approaches have been proposed before to incorporate certain long-range features into a point oriented system. Manke et al. [25] proposed using local bitmap images called context bitmaps to model temporally long-range but spatially short-range phenomena such as crossings. However, this method cannot model features such as loops, which are long-range both spatially and temporally. We have developed a new method for combining high-level long-range features and local features. First, high-level features such as crossings, loops and cusps are extracted. Then, a localization procedure is applied to spread these high-level features over the neighboring sample points, resulting in local representations of nearby high-level features. These features are then combined with the usual local features at each sample point. We extract three high-level features commonly used in handwriting recognition: cusps, crossings and loops. Examples of these features are shown in Fig. 6. Details of the extraction algorithm can be found in Ref. [26]. The di$cult problem in using these features in an HMM-based system is how to represent them in the feature vector assigned to each sample point. The loop feature is relatively straightforward } we can simply assign a binary value to each sample point indicating whether or not the point is on a loop. However, the cusp and crossing features are not so trivial. If we apply the same scheme and de"ne a feature with value 1 at a cusp
Fig. 6. Examples of high-level features.
140
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
(crossing) and 0 at all other points, the corresponding feature sequence will be composed of consecutive 0's with occasional occurrences of single 1's. These occasional 1's will most likely not make any di!erence in recognition because they are statistically insigni"cant. Our solution to this problem is to `spreada the in#uence of the crossing or cusp feature to its neighboring points. Instead of using the feature itself, we use a derived feature which measures the distance along the script from each sample point to the nearest cusp or crossing point. The value of this derived feature is 0 at the labeled high-level feature point and positive for all nearby points, where a larger value indicates that the point is further away from the nearest labeled sample point. If at any point there is no labeled high-level feature point within a prede"ned maximum distance w, then the value w is assigned for that point. The two features derived using this scheme are called cusp distance and crossing distance, respectively. Although high-level features are not robust features themselves, our experiments show that by combining their local representations with the `truea local features, more information is represented by the feature vectors while the robustness of the system is still preserved, resulting in improved recognition performance [26]. In our discrete HMM system, the seven features computed at each sample point are treated as being independent from each other. During training, a separate probability distribution is estimated for each feature at each model state. At recognition time, a combined loglikelihood score of a particular feature vector being generated by a certain state is calculated as the weighted sum of the individual feature log-likelihood scores. 4.3. Segmental features The purpose of introducing character level segmental features, which are features computed over the whole segment representing a character, is to capture shape characteristics of the whole character. This is desirable because, after all, characters are the basic shape units used to construct a word. However it is also di$cult to implement in a system based on integrated segmentation and recognition. As we stressed before, in such a system no segmentation is carried out before recognition. Therefore, segmental features cannot be computed before recognition begins, since character boundaries are unknown. Our solution to this problem is a method called interleaved segmental matching, where a modi"ed Viterbi algorithm is applied to generate partial segmentation hypotheses based on the local features (including local representations of high-level features), which are then augmented with segmental features computed on the hypothesized segments. The resulted system is called an augmented HMM system.
In Section 3.3 we explained how hypotheses are propagated through a grammar node g in a commonly used implementation (Eqs. (1)}(3)). In order to incorporate character shape information into the search, we augment the incoming scores with character matching scores, computed using global character shape models. To be more speci"c, let a (t , t ) be the likelihood score of J the segment from sample point t to t on the observa tion sequence being matched to character class l, the augmented incoming scores are de"ned as dI (t!1)" DR d (t!1)#a (t , t!1), where t "t!d (t!1) and DJ J J J KJ d (t!1) is the number of sample points assigned to KJ character model m(l) up to sample point t!1 (character duration). Using these augmented scores, Eqs. (1)}(3) are modi"ed simply by replacing d with dI . It should be pointed out that in such an augmented HMM system, the state sequence resulting from Viterbi search is no longer guaranteed to be the optimal sequence, because now the accumulated score of a sequence leading to state i at sample point t not only depends on the previous state, but also on how the previous state was reached (history re#ected in the character duration). This dependence violates the basic condition for the Viterbi algorithm to yield optimal solution. However, our experiments showed that the gain obtained by incorporating segmental features by far outweighs the loss in the optimality of the algorithm [27]. The segmental matching score a(t , t ) is computed using a correlation based metric inspired by the metrics developed by Sinden and Wilfong [28]. Given a segment a with coordinate sequence g "1(x , y ) , (x , y ) , ? 2, (xL, yL)2, the instance vector of a is de"ned as: V "(x , y , x , y ,2, x , y ), where x "x !x and ? L L G G ? y "y !y for 1)i)n, and (x , y ) is the centroid of a. G G ? ? ? The normalized instance vector of a, u "v / "v ", is ? ? ? a translation and scale independent representation of segment a in RL . Through a resampling procedure, a sample segment of arbitrary length can be mapped to a vector in R, where N is a predetermined number. The di!erence between any two sample segments a and b is de"ned as: D(a, b)"0.5(1!u ) u ), whose value ranges ? @ from 0 (when a and b are identical) to 1. The segmental matching score a (t , t ) is then de"ned by J a (t , t )"!w D(a , a ), where a is the model segment J J ? J RR for character class l, a is the segment from sample point RR t to t on the input sample sequence and w is a weight ? factor. In order to compute the above segment matching score, a single model segment needs to be derived for each character class. Let u , u ,2,u be the normalized + instance vectors of a set of prototypes for the character class l (which can be easily obtained as side products of segmental training). A single model segment representing this class is represented by vector w which minimizes the sum of distances from individual prototypes. It can be easily shown that w"u / "u ", where u " + u . G G
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
An alternative method to incorporate character level shape information is character matching in post-processing. In this method, an N-best decoding algorithm is applied with the local features to obtain the N most likely candidates. This stage is followed by a post-processing step, where character level matching is applied on character segments of the top candidates. The character matching score is then combined with the likelihood score to obtain a weighted sum score for each candidate and the candidate with the highest weighted sum score is chosen as the "nal result. Our experiments showed that this approach gave inferior recognition performance than the interleaved segmental matching method [29]. The reason is that in the interleaved segmental method local and segmental shape features are considered simultaneously during decoding to yield a solution that is favorable in both aspects. On the other hand, in the postprocessing approach local features alone are considered during decoding and as a result the segmentation hypotheses are dictated by local features. During postprocessing, only one particular segmentation for each candidate word is considered for character matching, although a di!erent segmentation could result in a better character matching score. The premature limitation of hypothesis space reduces the chance of the correct word being chosen.
5. A two-stage approach for large vocabulary word recognition When recognition has to be carried out on a large vocabulary, the size of the grammar network increases as well as the recognition time. This poses a serious problem especially when explicit delayed stroke modeling is used, as is the case in our system. Delayed strokes refer to strokes such as the cross in `ta or `xa and the dot in `ia or `ja, which are sometimes drawn last in a handwritten word, separated in time sequence from the main body of the character. In most on-line handwriting recognition systems delayed strokes are "rst detected in preprocessing and then either discarded or used in postprocessing. There are two drawbacks to these approaches. First, the information contained in delayed strokes is wasted or inadequately utilized because stroke segmentation cannot be in#uenced by this data. Second, it is often di$cult to detect delayed strokes reliably during preprocessing. We have developed an approach where delayed strokes are treated as special characters in the alphabet. A word with delayed strokes is given alternative spellings to accommodate di!erent sequences with delayed strokes drawn in di!erent orders. During recognition delayed strokes are considered as inherent parts of a script just like normal characters and contribute directly to the scoring of the hypotheses. This is the technique which we
141
refer to as exact delayed stroke modeling. Good recognition results have been obtained using this approach in a small vocabulary task [5]. However exact delayed stroke modeling method has the disadvantage that it could dramatically increase the hypothesis space. Since each possible position of delayed stroke causes a new branch in the grammar network, potentially, the total number of paths (hypotheses) increases exponentially with the number of delayed strokes. Although the actual increase of the search space can be controlled to a certain extent through a beam search mechanism, the grammar network could still become unmanageable for a large vocabuarly task. Furthermore, exact delayed stoke modeling can be achieved only with a full grammar representing a limited vocabulary. Since statistical grammars such as N-gram grammars used to represent unlimited vocabularies inevitably contain loops, exact delayed stroke modeling becomes infeasible when these grammars are applied. We use a two-stage approach to solve this problem. First an N-best decoding algorithm is applied using simplixed delayed stroke modeling to narrow down the choices of words. Then, in the detail matching stage, a grammar network is constructed covering the top N candidates using exact delayed stroke modeling, and a best-path search is applied to this reduced network to "nd the "nal optimal candidate. In simpli"ed delayed stroke modeling, a character with a delayed stroke is given two groups of parallel sub-paths, one group containing the patterns corresponding to the complete character written in sequence, with the delayed stroke immediately preceding or following the main body of the character; the other group containing only patterns corresponding to the main body of the character, representing the scenario when the delayed stroke is not rendered until the end of the word. A shared delayed stroke module is attached at the end of the grammar network which allows zero, one or multiple occurrences of all types of delayed strokes. To demonstrate the idea, partial diagrams of two di!erent grammar networks, both containing the work `atea, are given in Fig. 7. Fig. 7a shows exact delayed stroke modeling, and Fig. 7b shows simpli"ed delayed stroke modeling. Labels `dota, `crsa and `slsa refer to the dot in `ia or `ja, the cross in `ta and the cross in `xa, respectively. Each labeled arc actually represents a group of parallel arcs corresponding to di!erent models for di!erent ways of drawing the symbol, however for the sake of clarity only one representative is shown in the examples. As can be seen, simpli"ed delayed stroke modeling is inexact, e.g., not every path in the network corresponds to a `valida rendering of a word; however, it is much more e$cient and avoids the problem of exponential growth of hypothesis space. Since the interleaved segmental method requires more computation than conventional Viterbi decoding, for
142
J. Hu et al. / Pattern Recognition 33 (2000) 133}147 Table 1 Character recognition results Training samples
1a (digits) 10,098 1b (upper-case) 13,610 1c (lower-case) 31,956
Fig. 7. Example grammar networks.
a large vocabulary task we propose to apply it only at the detailed matching stage as well. N-best decoding has become a standard technique in speech recognition and, more recently, handwriting recognition. Many algorithms have been developed over the years. Most of these algorithms can be classi"ed into three categories } those that produce the N-best candidates in one forward, frame-synchronous, Viterbi-like search [30}32], those that are based on AH, or stack decoding [33,34], and the tree-trellis based approach [35]. We have developed a new algorithm based on the tree-trellis approach [35]. The di!erence is that instead of always retrieving the top N best candidates, where N is a prede"ned value, our algorithm retrieves a variable number of candidates depending on the distribution of the top scores, thus achieving higher e$ciency [29].
6. Experiments on UNIPEN data Experiments have been carried out using the UNIPEN data. UNIPEN is an international project for on-line handwriting data exchange and recognizer benchmarks which was initiated at the 12th ICPR in 1994 [8]. Several standard training and test sets of the UNIPEN data have just been made available, but we have not yet seen any report on experimental results using those datasets. The UNIPEN data is organized into 11 categories. We have used 4 of them in our experiments: 1a (isolated digits), 1b (isolated upper-case letters), 1c (isolated lower-case letters) and 6 (isolated cursive or mixed-style words in mixed cases). 6.1. Character recognition Character recognition experiments were carried out using the 1a}c categories of UNIPEN training data
Test samples
2684 7351 12,498
Recognition rate (%) Top 1
Top 5
96.8 93.6 85.9
100.0 99.4 98.4
release train}r01}v06 [9]. Some cleaning was carried out to remove samples that were mislabeled (which constitute about 4% of the total data). The cleaned data sets were then divided into 2 parts: around 2/3 of the samples in each categories were used for training and the remaining 1/3 used for testing. The partitions were made such that there are no shared writers between training and test sets. For character recognition, models were trained using training stage 1 (initial character training) followed by stage 2 (lattice character training), as described in Section 3.4. Initial character training was carried out using 10 samples per character class, and lattice character training was then carried out on all character training samples. The results are tabulated in Table 1. The whole recognition process (including preprocessing) takes about 0.3 second per character on a 180 MHz SGI station. 6.2. Word recognition Word recognition experiments were carried out using cateogory 6 of the latest UNIPEN training and development test data releases train}r01}v07 and devtest}r01}v02. No cleaning was carried out for either training or testing. Models were trained using training stage 1 (initial character training), 2 (lattice character training) and 4 (lattice word training), as described in Section 3.4. Stage 3 (linear word training) was skipped at this point because we did not have the time and resources for the massive manual labeling of word samples. Stages 1 and 2 were carried out in the same way as in character recognition experiments using the same datasets. The models were then further trained using around &45,000 word samples from &320 writers randomly drawn from train}r01}v07. Three di!erent word test sets were composed, of vocabulary sizes 500, 1000 and 2000 respectively. All test sets contain only samples from writers not used in training. In order to examine the e!ect of vocabulary and sample set sizes on the recognition as well as re#ecting the sample distribution of the complete devtest}ro1}v02 set, the following constraints were imposed in composing the test sets. First, the test sets contain samples from the
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
143
Table 3 Word recognition results on small to medium-size vocabularies Vocabulary size
500 1000 2000
Fig. 8. Word length histogram for UNIPEN data set devtest}r01}v02. Table 2 Distribution of UNIPEN data set devtest}r01}v02 in terms of cases Exclusively lower-case Mixed case
Exclusively upper-case
82.5%
5.0%
12.5%
same set of writers. Second, each vocabulary is a strict subset of the vocabulary of the next larger size. Third, the samples in each set were chosen such that the word length (in terms of number of characters) histogram of each set bears similar pro"le to that of the word length histogram of the complete devtest}r01}v02 set. Fig. 8 shows the word length histogram of the complete devtest}r01}v02 set. Fourth, the percentages of word samples written in lower-case only, mixed-case and uppercase only for each test set are kept close to those of the complete devtest}r01}v02 set, which are given in Table 2. Finally, the size of each test set was made to be larger than 1500 samples to ensure that the recognition results are statistically meaningful. In the end, samples from 100 writers (out of the 120 writers in devtest}r01}v02 that are not in train}r01}v07) were used to compose the three test sets satisfying these constraints. Recognition was carried out in 2 stages: N-best decoding with simpli"ed delayed stroke modeling and no segmental matching, followed by re-scoring using exact delayed stroke modeling and segmental matching. The results are summarized in Table 3. A close examination of the errors reveals that the recognizer does a much better job on exlusively lowercase samples than on samples containing upper-case letters. For example, on the 1000 vocabulary test set, the top 1 recognition rate on exclusively lower-case samples is 91.5%, while that on samples containing upper-case
Test samples
1685 2447 3267
Recognition rates (%) Top 1
Top 10
91.8 90.5 87.2
98.4 97.6 96.5
letters is only 84.1%. We believe there are two reasons for this discrepancy. First, lower-case letter models are better trained since the majority (75%) of training word samples are in lower-case only, and another 18% of the samples contain only one upper case letter at the beginning. Although there are enough upper-case samples for character training (stages 1 and 2), there are not adequate instances of upper-case letters in lattice training (Stage 4), thus the upper-case letter models are not well trained in the context of words. One possible way to alleviate this problem is to arti"cially increase the instances of uppercase letters in the training set. Second, the current size and orientation normalization algorithm tends to perform poorly on words containing upper case letters. This is because one basic assumption in the boundary line detection algorithm is that the base and core lines are present and dominant in all words, while the ascender and descender lines may or may not be present (see Section 2.1). This assumption is often violated in words containing upper-case letters, especially those written in upper-case only (in which case the core line is not present at all and instead the ascender line is always present). How to solve this problem and improve the robustness of the size and orientation normalization algorithm remains a topic of furture research. 6.3. Large vocabulary experiments Preliminary experiments have been carried out on large vocabularies of up to 20,000 words. Speed is a major concern with large vocabulary recognition when full dictionaries are used. We tested two di!erent methods to reduce the computational load. Since vocabulary size mainly a!ects the "rst stage N-best decoding, both methods modify this stage only. The rescoring stage remains the same as described in the previous section. The idea behind both method is to apply somewhat `coarsera processing for N-best decoding to reduce the computational load without degrading the performance too much. This seems feasible because for the "rst stage processing we only need to make sure that the correct answer is among the top N, as opposed to being precisely at the top. In the "rst method, `coarsea HMM models are de"ned where the number of states used to model each character
144
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
class is reduced to no more than 2. These models are trained in the same way as the original models and then used in the "rst stage N-best decoding procedure. The rational is that by reducing the number of states per character model the network size is reduced and so is the amount of computation. Unfortunately our experiments showed that this is not quite the case. Although the total network size is reduced by roughly half, N-best decoding takes even longer to complete while giving worse recognition results. Careful analysis lead to the following explanation. Since beam search mechanism is used in our system to prune the network during decoding, the computational load is determined by the number of active states as opposed to the total number of states in the network. When only 2 states are used per model, the models provide rather poor discrimination among di!erent characters. As a result, many more competing paths have very similar accumulated scores even after decoding has proceeded far along, causing more paths and thus more HMMs to remain active until the very end. Thus the computational load, which depends on the number of active states, is increased despite the smaller size of the network. In the second method, down-sampling is applied in feature extraction to reduce the number of observations to be processed for each sample word. In our experiments the average number of sample points per character in N-best decoding was reduced to &9 from &14, which was the sampling rate used for rescoring. This turned out to be much more e!ective than the "rst method } the decoding time was reduced by roughly one third with only a small degration in performance. This method was tested using three dictionaries, of sizes 5000, 10,000 and 20,000, respectively. For all three taks the test set is the same as the one used for the 1000 word vocabulary test, and each dictionary is composed by adding randomly chosen words from the standard UNIX dictionary to the original 1000 words. Therefore, unlike the previous group of experiments, here not all words in the dictionaries have instances in the test set. However, these tests still provide valid indications of the systems performance for large vocabulary tasks. The recognition results are summarized in Table 4. To get a rough idea of the system's performance as a function of vocabulary size, we plot all the recognition results in Fig. 9. Bear in mind that this is only a rough plot since the test set varies for di!erent vocabulary sizes (as speci"ed in Tables 3 and 4). The plot shows that performance degradation for top one candidate is much more severe than that for the top ten. This indicates that the second stage rescoring still has much room for improvement, possibly through more careful modeling, better features, and more sophisticated training techniques such as discriminative training [36,37]. Speed is still a problem for large vocabulary tasks } it currently takes about two minutes on average
Table 4 Word recognition results on large vocabularies Vocabulary size
5000 10,000 20,000
Test samples
2447 2447 2447
Recognition rates (%) Top 1
Top 10
83.2 79.8 76.3
94.8 92.9 91.0
Fig. 9. Word recognition rates at di!erent vocabulary sizes.
to recognize a word with the 20,000 word dictionary on an 180 MHz SGI station. Obviously down-sampling can only be applied to a limited extent. Other techniques, such as smarter pruning, will have to be investigated. Another possibility is to use statistical language models such as variable-length N-grams [38,39], which require much smaller networks than complete large dictionaries, for N best decoding.
7. Conclusions These are early results on the UNIPEN data. We think the results are good, but have not seen any published UNIPEN results from other researchers yet, so we look forward to seeing comparable reports. Error rates for the character recognition test results reported in Table 1 are accurate to within less than 1% with 95% con"dence. Test 1a has 10 classes and tests 1b and c are 26-class problems. We think the error rate for test 1c is higher because there is more shape confusion among lower-case letters compared to upper-case ones. The word recognition test sets are of modest size for small vocabularies due to the nature of UNIPEN data. The database typically contains only one or a very small number of examples of each word. Examination of the
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
test data indicates that it is a challenging test because the writing style and legibility varies widely among the 100 di!erent writers. Furthermore, the test sets were taken directly from devtest}v01}r02 with no cleaning. A quick examination shows that they indeed contain quite a few mislabeled or misspelled samples. These results are comparable to those published for other testing data but of course there is no way to make a direct comparison due to probable di!erences in the quality and other characteristics of the data. Future work will be directed primarily toward re"nement of the character models and training methods to improve the recognition performance. Improvements in speed are also needed. Toward this end we are considering a major overhaul of the HMM decoder and automatic grammar reconstruction for the N-best two-stage processing method. We would also like to investigate more sophisticated pruning methods as well as the use of statistical language models for the "rst stage N-best decoding. Our goal will be real-time performance on very large vocabularies.
8. Summary Hidden Markov Models (HMMs) have been used with great success for stochastic modeling of speech for more than a decade now. More recently they have also been applied to on-line handwriting recognition with varying degrees of success. In this paper we describe an on-line handwriting recognition system which we have developed over the last few years. The main novel aspects of this system compared to other HMM-based systems include: signal preprocessing with a new word normalization method; application of invariant features; combination of high-level and low-level features; stochastic modeling with HMMs that incorporates a dynamically evolving language model; and N-best decoding combined with two-stage delayed stroke modeling for large vocabulary word recognition. We present experimental results for both character and work recognition for the newly released UNIPEN data. Character recognition experiments were carried out using dataset train}r01}v06. Word recognition experiments were carried out using datasets train}r01}v07 and devtest}r01}v02, with results on dictionary sizes varying from 500 words to 20,000 words. The use of the standard UNIPEN datasets makes these results useful for future comparison to other systems.
References [1] L.R. Rabiner, B.H. Juang, Fundamentals of Speech Recognition, Prentic-Hall, Englewood Cli!s, NJ, 1993.
145
[2] S. Bercu, G. Lorette, On-line handwritten word recognition: an approach based on hidden markov models, in: Proceeding of the Third International Workshop on Frontiers in Hand-writing Recognition, Bu!alo, USA, May 1993, pp. 385}390. [3] J. Makhoul, T. Starner, R. Schartz, G. Chou, On-line cursive handwriting recognition using speech recognition methods, in Proceedings of IEEE ICASSP'94, Adelaide, Australia, April 1994, pp. v125}v128. [4] K.S. Nathan, H.S.M. Beigi, J. Subrahmonia, G.J. Clary, H. Maruyama. Real-time on-line unconstrained handwriting recognition using statistical methods, in: Proceedings of IEEE ICASSP'95, Detroit, USA, June 1995, pp. 2619}2622. [5] J. Hu, M.K. Brown, W. Turin, Handwriting recognition with hidden Markov models and grammatical constraints, in: Proceedings of the Fourth IWFHR, Taipei, Taiwan, December 1994, pp. 195}205. [6] S. Manke, U. Bodenhausen, Npen##: a writer independent, large vocabulary on-line cursive handwriting recognition system, in: Proceedings of the Third ICDAR, Montreal, Canada, August 1995, pp. 403}408. [7] M. Schenkel, I. Guyon, D. HEnderson, On-line cursive script recgonition using time delay neural networks and hidden Markov models, Machine Vision and Applications, Special Issue on Cursive Script Recognition, vol. 8, 1995. [8] I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, S. Janet, Unipen project of on-lline data exchange and recognizer benchmarks, in: Proceedings of the 12th ICPR, Jerusalem, Israel, October 1994, pp. 29}33. [9] L. Schomaker, The unipen project, in http://unipen.nici. kun.nl. 1998. [10] W. Guerfali, R. Plamondon, Normalizing and restoring on-line handwriting, Pattern Recognition 26 (3) (1993) 419}431. [11] D.J. Burr, A normalizing transform for cursive script recognition, in: Proceedings of the Sixth ICPR, vol. 2, Munich, October 1982, pp. 1027}1030. [12] M.K. Brown, S. Ganapathy, Preprocessing techniques for cursive script recognition, Pattern Recognition 16 (5) (1983) 447}458. [13] H.S.M. Beigi, K. Nathan, G.J. Clary, J. Subrahmonia, Size normalization in on-line unconstrained handwriting recognition, in: Proceedings of ICASSP'94, Adelaide, Australia, April 1994, pp. 169}172. [14] Y. Bengio, Y. LeCun, Word normalization for on-line handwritten word recognition, in: Proceedings of 12th ICPR, vol. 2, Jerusalem, October 1994, pp. 409}413. [15] Amy S. Rosenthal, J. Hu, M.K. Brown, Size and oreintation normalization of on-line handwriting using Hough transofrm, in: Proceedings of ICASSP'97, Munich, Germany, April 1997. [16] E.R. Brocklehurst, P.D. Kenward, Preprocessing for cursive script recognition, NPL Report DITC 132/88, November 1988. [17] L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE 77 (2) (1989) 257}286. [18] M.K. Brown, S.C. Glinski, Stochastic context-free language modeling with evolutional grammars, ICSLP'94 vol. 2, September 1994, 779}782.
146
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
[19] B.T. Lowerre, D.R. Reddy, The HARPY speech understanding system, in: W.A. Lea (Ed.), Trends in Speech Recognition, Prentice-Hall, Englewood Cli!s, NJ, 1980, pp. 340}360 Chapter 15. [20] M.K. Brown, J.G. Wilpon, A grammar compiler for connected speech recognition, IEEE Trans. Signal Process. 39 (1) (1991) 17}28. [21] A.M. Bruckstein, R.J. Holt, A.N. Netravali, T.J. Richardson, Invariant signatures for planar shape recognition under partial occlusion, CVGIP: Image Understanding 58 (1993) 49}65. [22] J. Hu, M.K. Borwn, W. Turin, Invariant features for HMM based handwriting recognition, in: Proceedings of ICIAP'95, Sanremo, Italy, September 1995, pp. 588}593. [23] S.A. Guberman, V.V. Rozentsveig, Algorithm for the recognition of handwritten text, Automat. Remote Control 37 (5) (1976) 751}757. [24] J. Hu, M.K. Brown, W. Turin, HMM based on-line handwriting recognition, IEEE PAMI 18 (10) (1996) 1039}1045. [25] S. Manke, M. Finke, A. Waibel, Combining bitmaps with dynamic writing information for on-line handwriting recognition, in: Proceedings of the 12th ICPR, Jerusalem, October 1994, pp. 596}598. [26] J. Hu, A.S. Rosenthal, M.K. Brown, Combining high-level features with sequential local features for on-line handwriting recognition, in: Proceedings of ICIAP'97, Florence, Italy, September 1997, pp. 647}654. [27] J. Hu, M.K. Brown, W. Turin, Use of segmental features in HMM based handwriting recognition, in: Proceedings of IEEE SMC'95, Vancouver, Canada, October 1995, pp. 2778}2782. [28] F. Sinden, G. Wilfong, L. Ruedisueli, On-line recognition of handwrittin symbols, IEEE Trans. PAMI 18 (9) (1996). [29] J. Hu, M.K. Brown, On-line handwriting recognition with constrained n-best decoding, in: Proceedings of 13th ICPR, vol. C, Vienna, Austria, August 1996, pp. 23}27.
[30] C.-H. Lee, L.R. Rabiner, A frame-synchronous network search algorithm for connected word recognition, IEEE Trans. ASS 37 (11) (1989) 1649}1658. [31] V. Steinbiss, Sentence-hypotheses generation in a continuous-speech recognition system, in: Proceedings of EUROSPEECH'89 1989, pp. 51}54. [32] R. Schwartz, Y.-L. Chow, The N-best algorithm: an e$cient and exact procedure for "nding the N most likely sentence hypotheses, in: Proceedings of ICASSP'90, 1990, pp. 81}84. [33] F. Jelinek, L.R. Bahl, R.L. Mercer, Design of a linguistic statistical decoder for the recognition of continuous speech, IEEE Trans. Inform. Theory IT-21 (3) (1975) 250}256. [34] D. Sturtemant, A stack decoder for continuous speech recognition, in: Proceedings of Speech and Natural Language Workshop, October 1989, pp. 193}198. [35] F.K. Soong, E.-F. Huang, A tree-trellis based fast search for "nding the N best sentence hypotheses in continuous speech recognition, in: Proceedings of ICASSP'91, Toronto, Canada, May 1991, 705}708. [36] W. Chou, B.H. Juang, C.H. Lee, Segmental gpd training of HMM based speech recognizers, in: Proceedings of ICASSP'92, San Fransciso, CA, MArch 1992. [37] B.H. Juang, P.C. Chang, Discriminative training of dynamic programming based speech recognizers, IEEE Trans. Speech and Audio Process. SAP-1 (2) (1993) 135}143. [38] I. Guyon, F. Pereira, Design of a linguistic postprocessor using variable memory length Markov models, in: Proceedings of the Third ICDAR, Montreal, Canada, August 1995, pp. 454}457. [39] J. Hu, W. Turin, M.K. Brown, Language modeling using stochastic automata with variable length contexts, Comput. Speech Language 11 (1) (1997) 1}16.
About the Author*JIANYING HU studied Electrical Engineering at Tsinghua University in Beijing, China, from 1984 to 1988, and received her Ph.D. in Computer Science from SUNY Stony Brook in 1993. Since then she has been a member of technical sta! at Bell Laboratories, Murray Hill, formerly of AT&T, now of Lucent Technologies. Dr. Hu has worked extensively on curve indexing, on-line handwriting recognition, language modeling and document processing. Her current research interests include document understanding, handwriting analysis, information retrieval and content based image retrieval. She is a member of the IEEE computer society. About the Author*SOK GEK LIM received the B.Sc. in Computer Science and Ph.D. in Electrical Engineering from The University of Western Australia in 1991 and 1997, respectively. She worked with NEC in Singapore in cooperation with Kent Ridge Digital Lab (Formerly know as the Institute of Systems Science (ISS)) and the Supercomputing Centre at the National University of Singapore, from 1991 to 1992. She was a research o$cer with the centre for intelligent Information Processing Systems (CIIPS) in the Department of Electrical and Electronic Engineering at The University of Western Australia, from 1992 to 1996. She was awarded a post-doctoral fellowship from the University of West Florida in 1997. Dr. Lim is currently with Lucent Technologies Bell Laboratories, Murray, NJ. Her research interests are image analysis and pattern recognition, http://ciips.ee.uwa.edu.au/&gek. About the Author*MICHAEL K. BROWN is a Member of Technical Sta! at Bell Labs and Senior Member of the IEEE. He has over 50 publications and more than a dozen patents. From 1973 to 1976 Dr. Brown was with the Burroughs Corporation in Detroit (now Unisys Corp.) developing of ink jet printing systems. This work resulted in the University of Michigan's "rst Master's Thesis, `Control Systems for Ink Jet Printersa that describes advances in high speed electrostatically controlled jet printing techniques. From 1976 to 1980 he was self-employed as an Engineering Consultant with Burrough's Corp. working on unconstrained handwritten character recognition (Courtesy Amount Reading of bank cheques) while pursuing his Ph.D. degree at the University of Michigan. His published Ph.D. thesis addressing on-line unconstrained `Cursive Script Word Recognitiona describes new techniques in feature extraction, pattern recognition and writer adaptation.
J. Hu et al. / Pattern Recognition 33 (2000) 133}147
147
In 1980 Dr. Brown joined the Speech Processing Group at Bell Telephone Laboratories, Murray Hill, where he developed speech recognition algorithms, designed VLSI ASIC's and designed speech processing systems. In 1983 he got transferred to the Robotics Principles Research Department where he pursued interests in sensory perception, sensor integration, distributed computing, natural language (speech) understanding, dialogue management, task planning, dynamic simulation and electro-mechanical control. Since 1990, when the department was renamed Interactive Systems Research, Dr. Brown has concentrated more on human}machine and machine}environment interaction, also has resumed his interest in unconstrained on-line handwriting recognition. In 1996 he joined the Multimedia Communications Research Laboratory to continue work on natural language understanding systems and pursue new interests in multimedia pattern analysis.
Pattern Recognition 33 (2000) 149}160
Personal identi"cation based on handwriting H.E.S. Said *, T.N. Tan, K.D. Baker Department of Computer Science, University of Reading, Whiteknights, P.O.Box 225, Reading, Berkshire RG6 6AY, UK National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, People's Republic of China Received 24 June 1998; received in revised form 15 December 1998; accepted 15 December 1998
Abstract Many techniques have been reported for handwriting-based writer identi"cation. The majority of techniques assume that the written text is "xed (e.g., in signature veri"cation). In this paper we attempt to eliminate this assumption by presenting a novel algorithm for automatic text-independent writer identi"cation. Given that the handwriting of di!erent people is often visually distinctive, we take a global approach based on texture analysis, where each writer's handwriting is regarded as a di!erent texture. In principle, this allows us to apply any standard texture recognition algorithm for the task (e.g., the multi-channel Gabor "ltering technique). Results of 96.0% accuracy on the classi"cation of 1000 test documents from 40 writers are very promising. The method is shown to be robust to noise and contents. 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Personal identi"cation; Writer identi"cation; Texture analysis; Gabor "lters; Handwriting processing; Document image processing
1. Introduction Signature veri"cation has been an active research topic for several decades in the image processing and pattern recognition community [1]. Despite continuous e!ort, signature veri"cation remains a challenging issue. It provides a way of identifying the writer of a piece of handwriting in order to verify the claimed identity in security and related applications. It requires the writer to write the same "xed text. In this sense, signature veri"cation may also be called text-dependent writer veri"cation (which is a special case of text-dependent writer identi"cation where more than one writer has to be considered). In practice, the requirement and the use of "xed text makes writer veri"cation prone to forgery. Furthermore, text-dependent writer identi"cation is inapplicable in many important practical applications, for example,
* Corresponding author. Tel.: #44-118-987-51-23x7641; fax: #44-118-975-1994. E-mail address:
[email protected] (H.E.S. Said)
the identi"cation of the writers of archived handwritten documents, crime suspect identi"cation in forensic sciences, etc. In these applications, the writer of a piece of handwriting is often identi"ed by professional handwriting examiners (graphologists). Although human intervention in text-independent writer identi"cation has been e!ective, it is costly and prone to fatigue. Research into writer identi"cation has been focused on two streams, o!-line and on-line writer identi"cation. On-line writer identi"cation techniques are not well developed (as compared to on-line signature veri"cation methods), and only a few papers (e.g. [2]) have been published on this subject. In comparison, o!-line systems have been studied either as fully automated tools or as interactive tools. These systems are based on the use of computer image processing and pattern recognition techniques to solve the di!erent types of problems encountered: pre-processing, feature extraction and selection, specimen comparison and performance evaluation. This paper presents an o!-line system based on computer image processing and pattern recognition techniques. There are two approaches to the o!-line method,
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 0 6 - 0
150
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
namely text-dependent and text-independent. Our work is a text-independent approach where a texture analysis technique is introduced. The text-independent approach uses feature sets whose components describe global statistical features extracted from the entire image of a text. Hence it may be called texture analysis approach. Two general approaches have been proposed in the o!-line method: Histogram descriptions and Fourier transform techniques. In the "rst case, the frequency distribution of di!erent global and local properties is used [3]. Some of these properties are directly or indirectly related to speci"c features used in the forensic document analysis [4]. In the second case, Duverony et al. [5] have reported that the most important variation of the writers transfer function is re#ected in the low-frequency band of Fourier spectrum of the handwriting images. Similarly, Kuckuck [6] has used Fourier transform techniques to process handwritten text as texture. The features extracted were either composed of sequences of spectrum mean values per bandwidth, polynomial "tting coe$cients or a linear mapping of these coe$cients. The method has been tested on a set of 800 handwriting samples (20 writers, 40 samples per writer). An overall classi"cation rate of 90% for all features extracted was obtained. This paper uses multichannel spatial "ltering techniques to extract texture features from a handwritten text block. There are many "lters available for use in the multichannel technique. In this paper we use Gabor "lters, since they have proven to be successful in extracting features for similar applications [7}11]. We also use grey-scale co-occurrence matrices (GSCM) for feature extraction as a comparison. For classi"cation two classi"ers are adopted, namely the weighted Euclidean distance (WED) and the (K-NN) classi"ers. The subsequent sections describe the normalisation of the handwriting images, the extraction of writer features, the experimental results and "nally the conclusions.
2.1. Normalisation of handwriting images Texture analysis cannot be applied directly to handwriting images, as texture is a!ected by di!erent word spacing, varying line spacing, etc. The in#uence of such factors is minimised by normalisation. The input to this normalisation stage is a binary image of any handwritten document. The handwriting may contain lines of di!erent point size and di!erent spacing between lines, words and characters. The normalisation is performed as follows: Text lines are located using the horizontal projection pro"le [10]. Spaces between lines/words and margins are set to a prede"ned size. Then, incomplete lines are "lled by means of text padding. Random non-overlapping blocks (of pixels) are then extracted from the normalised image. Texture analysis is applied to these blocks. Further details on normalisation may be found in [10]. The main steps are illustrated in Fig. 2 below. 2.2. Features extraction In principle, any texture analysis technique can be applied to extract features from each uniform block of handwriting. Here two established methods are implemented to obtain texture features, namely the multichannel Gabor "ltering technique (MGF) [12] and the grey- scale co-occurrence matrix (GSCM) [13]. The former is a popular method which is well recognised and the latter is often used as a benchmark in texture analysis [13]. 2.2.1. Gabor xltering The multichannel Gabor "ltering technique is inspired by the psychophysical "ndings that the processing of pictorial information in the human visual cortex involves a set of parallel and quasi-independent mechanisms or cortical channels which can be modelled by bandpass "lters. Mathematically the 2D Gabor "lter can be expressed as follows: h(x, y)"g(x, y)e\LHSV>TW,
2. The new algorithm
(1)
where g(x,y) is a 2-D Gaussian function of the form The algorithm is based on texture analysis and is illustrated diagrammatically in Fig. 1. The three main stages are described in turn in the remainder of this section.
g(x, y)"e V>WN .
(2)
A simple computational model for the cortical channels is described in [12]. Brie#y stated, each cortical
Fig. 1. Block diagram of the algorithm.
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
151
Fig. 2. The normalisation of the handwriting images.
channel is modelled by a pair of Gabor "lters h (x, y ; f, h) C and h (x, y ; f, 0). The two Gabor "lters are of opposite M symmetry and are given by h (x, y; f, h) " g(x, y)cos(2nf (x cos h#y sin h)), h (x, y; f, h) " g(x, y)sin(2nf (x cos h#y sin h)).
(3)
The Fourier transform of the "lter is calculated and the resultant output image is obtained using FFT, for example, if q the odd "lters outputs then: q (x, y) " FFT\[P(u, v)H (u, v)] (4) M where P(u, v) is the Fourier transform (FT) of the input image p(x, y) and H (u, v) is the FT of the "lter h (x, y; f, h). The results of the two given "lters are com bined using Eq. (5) and a single value at each pixel of the image is obtained (for justi"cation see [8]). q(x, y) " (q(x, y)#q(x, y). (5) The two important parameters of the Gabor "lter are and (the radial frequency and orientation, respectively), which de"ne the location of the channel in the frequency plane. Commonly used frequencies are power of 2. In [12] it was shown that for any image of size N;N, the
important frequency components are within f)N/4 cycles/degree. In this paper we use frequencies of 4, 8, 16 and 32 cycles/degree. For each central frequency f, "ltering is performed at h"0, 453, 903, and 1353. This gives a total of 16 output images (four for each frequency) from which the writer's features are extracted. These features are the mean and the standard deviation of each output image (therefore, 32 features per input image are calculated). Testing was performed using either all 32 features or various sub-sets (e.g., features associated with a particular radial frequency). 2.2.2. Grey-scale co-occurrence matrices GSCMs are also considered. Generally speaking, GSCMs are expensive to compute. For an image represented using N grey levels, each GSCM is of size N;N. Binary handwriting images contain only two grey levels. It is therefore reasonable to use the GSCM technique. In this paper, GSCMs were constructed for "ve distances (d"1, 2, 3, 4, 5) and four directions a " (0, 45, 90, 1353). This gives each input handwriting image 20 matrices of dimension 2;2. When the size of the GSCM is too large to allow the direct use of matrix elements, measurements such as
152
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
energy, entropy, contrast and correlation are computed from the matrix and used as features [13]. For each 2;2 GSCM derived from a binary handwriting image, there are only three independent values due to the diagonal symmetry. The three values are used directly as features. So we have 60 ("2 ;3) features per handwriting image. 2.3. Writer identixcation Two classi"ers were considered, namely the weighted Euclidean distance (WED) classi"er and the K nearestneighbour classi"er (K-NN). 2.3.1. The weighted Euclidean distance classixer Representative features for each writer are determined from the features extracted from training handwriting texts of the writer. Similar feature extraction operations are carried out, for an input handwritten text block by an unknown writer. The extracted features are then compared with the representative features of a set of known writers. The writer of the handwriting is identi"ed as writer K by the WED classi"er i+ the following distance function is a minimum at K: , ( f !f I) L , d(k)" L (6) (vI) L L where f is the nth feature of the input document, and L f I and vI are the sample mean and sample standard L L deviation of the nth feature of writer k respectively. 2.3.2. The K nearest-neighbours classixer When using the K nearest-neighbours classi"er (KNN), for each class < in the training set, the ideal feature vectors are given as f . Then we detect T and measure the features of the unknown writer (represented as ;). To determine the class R of the writer we measure the similarity with each class by computing the distance between the feature vector f and ;. T The distance measure used here is the Euclidean distance. Then the distance computed d of the unknown writer T from class < is given by , (; !f ) , (7) H TH H where j"1,2, N (N is the number of the features considered). The writer is then assigned to the class R such that:
d" T
d "min(d ), (8) 0 where (k"1,2, no of classes). Other more sophisticated measures and classi"ers such as neural network classi"ers could have been used. The emphasis in this paper, however, is on computational simplicity.
3. Experimental results A number of experiments were carried out to show the e!ectiveness of the proposed algorithms. Forty people were selected, then divided into two groups (each group consist of 20 people). Examples of handwriting by these people are shown in Fig. 3. For the purpose of the classi"cation experiments 25 non-overlapping handwriting blocks were extracted for each person. Each sample was selected from an A4 page, scanned using a HP ScanJet4c in custom mode with extra heavy lighting, at a resolution of 150 dpi. The sample images were divided "rst into 10 training and 15 test images per writer (Set A) followed by 15 training and 10 test images (Set B). Images in the test sets did not appear in the training sets. Testing was conducted using di!erent combinations of features under both classi"ers. 3.1. Results from Gabor xltering The e!ects of the Gabor "ltering on classi"cation were investigated. In Fig. 4, the primary images that were produced using channels at f"4, 8, 16, 32 and h"0, 45, 90, 1353 are shown. Tables 1}4 show the results of the multi-channel Gabor "lter features using the two classi"ers. It can be seen that features were extracted using the channels at f " 4, 8, 16, 32 and h " 0, 45, 90, 1353 (hence a total of 32 features) and combination of di!erent frequencies and orientation. The results from the weighted Euclidean distance (WED) were tabulated in Tables 1 and 3. In Tables 2 and 4 the results from the K nearest-neighbours classi"er (K-NN) are given. The outcome of both classi"ers is compared, and the results are shown in Figs. 5 and 6. Similar number of features were used for both classi"ers. When the WED classi"er was used, generally higher identi"cation accuracy were observed (especially for Set A). For example in Group 2, a classi"cation rate of 96.0% was obtained when all 32 features were used (for Set B). Results of 96.0% were also observed when f"32 and 16 were chosen (for Set A). It can be seen that the higher the channel frequency the better the classi"cation accuracy. This e!ect is demonstrated clearly for both sets (Sets A and B). Under the K-NN classi"er, a classi"cation rate of 77.0% was achieved when all features were used (for Set B). The best results (86.0%) under the K-NN were achieved for group1, when the frequencies of f"16, and 32 were used (for Set B). For easier comparison Figs. 5 and 6 show the plots of the identi"cation accuracies for the multi-channel Gabor "ltering features under both classi"ers, where the features sets are in the same order as in Tables 1}4 (e.g. features set 1 for all features; set 2 for all SD features, etc.). It can be easily be seen that "ltering techniques using the WED classi"er performs better.
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
153
Fig. 3. Handwriting examples of 30 di!erent people.
3.2. Results from GSCM In Tables 5}8, features were extracted using distances d " 1, 2, 3, 4, 5 and directions a"(0, 45,90, 1353) (there were a total of 60 features (3;5;4)). Di!erent combinations of feature sets, e.g. features at d"1, 2, 3 and four
directions (given above) were used (i.e. there were a total of 36 features (3;3;4)), etc. Tables 5}8 show the classi"cation rates for the GSCM approach under both classi"ers. Here, the results can be seen to be much poorer than those for the multi-channel Gabor "ltering method. This observation is consistent
154
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
Fig. 4. The output images from the Gabor "lter, at different frequencies and orientations for the "rst writer.
Table 1 Group 1: Identi"cation accuracy of the Gabor "ltering technique under WED Features
All
SD
Mean
Mean at f"16, 32
All at f"16, 32
All at f"32
All at f"16
All at f"8
All at f"4
Set A Set B
94.3 91.0
91.0 90.5
89.7 86.5
89.7 83.0
95.3 88.1
84.7 75.6
58.1 52.2
29.3 29.5
11.3 13.9
Table 2 Group 1: Identi"cation accuracy of the Gabor "ltering technique under K-NN Features
All
SD
Mean
Mean at f"16, 32
All at f"16, 32
All at f"32
All at f"16
All at f"8
All at f"4
Set A Set B
56.0 76.0
56.0 75.5
49.7 73.5
56.7 85.6
56.3 86.0
59.7 82.0
50.7 58.5
31.0 44.5
22.3 28.5
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
155
Table 3 Group 2: Identi"cation accuracy of the Gabor "ltering technique under WED Features
All
SD
Mean
Mean at f"16, 32
All at f"16, 32
All at f"32
All at f"16
All at f"8
All at f"4
Set A Set B
84.1 96.0
82.8 90.1
83.4 92.1
92.1 93.0
96.0 86.0
85.4 84.2
61.3 65.3
34.4 34.7
29.1 22.8
Table 4 Group 2: Identi"cation accuracy of the Gabor "ltering technique under K-NN Features
All
SD
Mean
Mean at f"16, 32
All at f"16, 32
All at f"32
All at f"16
All at f"8
All at f"4
Set A Set B
54.7 77.0
59.3 78.0
40.0 60.0
56.7 75.0
54.0 85.0
66.7 90.0
66.7 88.0
37.6 47.0
21.1 27.7
Fig. 5. Group 1: results from Gabor "lter using the WED and K-NN classi"er.
with the "ndings in [10,12]. Figs. 7 and 8 show plots of the identi"cation accuracy for the GSCM features under both classi"ers. The features sets appear in the same order as those in Tables 5}8. The best results (shaded in the tables) show that only 72.2 and 63.6% (for groups 2 and 1, respectively) of the images were identi"ed correctly when using the WED classi"er, 36 texture features were required. In comparison, a classi"cation rate of 74.0 and 66.0% (for groups 2 and group 1, respectively) are obtained when the K-NN classi"er was used (with a total of 24 texture features). Note that under the K-NN performance in Set B (using 10 testing images) is far better than Set A.
Fig. 6. Group 2: results from Gabor "lter using the WED and K-NN classi"er.
reason the relative sample invariance might be a useful parameter to compute. Writer sample invariance for different writers can be easily be calculated as follows: writer sample SD Sample invariance" ;100. writer sample mean
(9)
3.3. Performance evaluation
The relative sample invariance can be calculated by means of the ratio of the standard deviation (SD) of writer sample features to the mean of those sample features. Table 9 shows the statistics for the proposed algorithms, the multi-channel Gabor "ltering (MGF) and the GSCM. The results shown in Figs. 9}11 are given for 10 writers. It can be easily seen that MGF gave the bestsample invariance.
3.3.1. Sample invariance The remarkable e!ectiveness of using multi-channel Gabor "ltering technique for identi"cation is partially due to the writer texture samples invariance. For this
3.3.2. Types I and II errors The task of verifying a writer is essentially that of recognising a genuine writer while rejecting the imitations. The performance of a system in achieving this is
156
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
Table 5 Group 1: Identi"cation accuracy of the GSCM technique under WED Distances
d"1, 2, 3, 4, 5
d"1, 2, 3
d"2, 3, 4
d"3, 4, 5
d"1, 2
d"4, 5
d"1
d"4
Set A Set B
59.8 52.2
63.6 58.8
53.5 50.0
45.6 46.0
56.0 56.4
43.2 46.4
59.4 59.5
41.8 46.0
Table 6 Group 1: Identi"cation accuracy of the GSCM technique under K-NN Distances
d"1, 2, 3, 4, 5
d"1, 2, 3
d"2, 3, 4
d"3, 4, 5
d"1, 2
d"4, 5
d"1
d"4
Set A Set B
43.3 60.5
45.3 68.0
40.3 57.0
39.7 54.0
37.7 74.0
37.3 53.5
44.3 62.5
38.3 58.5
Table 7 Group 2: Identi"cation accuracy of the GSCM technique under WED Distances
d"1, 2, 3, 4, 5
d"1, 2, 3
d"2, 3, 4
d"3, 4, 5
d"1, 2
d"4, 5
d"1
d"4
Set A Set B
65.1 71.0
72.2 65.2
67.6 75.8
57.0 60.4
71.5 62.2
58.3 60.4
76.8 73.3
57.0 59.4
Table 8 Group 2: Identi"cation accuracy of the GSCM technique under K-NN Distances
d"1, 2, 3, 4, 5
d"1, 2, 3
d"2, 3, 4
d"3, 4, 5
d"1, 2
d"4, 5
d"1
d"4
Set A Set B
50.7 63.0
64.0 75.0
58.7 66.0
51.3 60.0
62.0 66.0
31.4 49.5
61.3 60.0
31.4 56.0
Fig. 7. Group 1: results from GSCM using the WED and K-NN classi"ers.
Fig. 8. Group 2: results from GSCM using the WED and K-NN classi"ers.
measured in two ways, false rejection of genuine writers and false acceptance of imitations. These two measures are termed types I and II errors, respectively. The performance of any veri"cation systems is often characterised
by the combinations of types I and II errors into an overall performance measure, known as the equal error rate, the error is calculated when type I error is equal to Type II error [14].
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
157
Table 9 Group 2: The deviation-to-mean ratios of Gabor "ltering and GSCM technique Writer
1
2
3
4
5
6
7
8
9
10
MGF GSCM
1.04 1.93
0.89 2.04
0.94 1.95
0.92 1.89
0.92 2.03
0.91 1.95
0.98 1.82
1.02 1.94
0.92 1.93
0.93 1.99
3.4. Remarks It is clear that the identi"cation accuracy is much higher when the Gabor "lter technique is used. Poor results are shown for the GSCM method. In summary, the following observations can be made based on the results presented above:
Fig. 9. The performance of the MGF and GSCM techniques.
The equal error rate is calculated for the optimum identi"cation rates for the Gabor "ltering and GSCM techniques that shown in Tables 1}8. The results of the equal error rate are shown in Tables 10}13. It can be seen that the best error rate given is 0.57% for the MGF technique using the WED classi"er, where all the features were considered. In comparison an error rate of 2.32% for GSCM technique is obtained for the WED classi"er, and that using a distance d "1, 2, 3 for Set B.
1. Under all circumstances the multichannel "ltering technique outperforms the GSCM technique (although computationally it is more expensive). 2. The overall performance of the WED classi"er appears to be better than that of the K-NN. 3. It has also been noted that Set A gave higher identi"cation accuracy when the WED is used, but Set B recorded better identi"cation results when the KNN classi"er is used. 4. The results have clearly demonstrated the potential of the proposed texture based approach to personal identi"cation from handwriting images. 4. Future work The approach that has been adopted here is mainly text-independent. In the future text-dependent writer identi"cation will be introduced. This will cover writer signature veri"cation approaches. A comparison between the two approaches will then be drawn.
Fig. 10. The equal error rates for groups 1 and 2 using GFM technique.
158
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
Fig. 11. The equal error rates for groups 1 and 2 using GSCM technique.
Table 10 Group 1: Error rate of the Gabor "ltering technique under K-NN and WED classi"ers Classi"er
The WED
Features Set A Set B
All 0.57% 1.35%
The K-NN All at f"16, 32 0.45% 1.79%
All 4.4% 4.37%
All at f"16, 32 3.6% 2.1%
Table 11 Group 2: Error rate of the Gabor "ltering technique under K-NN and WED classi"ers Classi"er
The WED
Features Set A Set B
All 1.59% 0.6%
The K-NN All at f"16, 32 0.6% 2.1%
All 3.33% 1.8%
All at f"16, 32 3.33% 1.5%
Table 12 Group 1: Error rate of the GSCM technique under K-NN and WED classi"ers Classi"er
The WED
Features Set A Set B
d"1, 2, 3 3.64% 6.18%
The K-NN d"1, 2 4.4% 6.45%
d"1, 2, 3 5.47% 4.8%
d"1, 2 4.8% 3.9%
Table 13 Group 2: Error rate of the GSCM technique under K-NN and WED classi"ers Classi"er
The WED
Features Set A Set B
d"1, 2, 3 3.24% 2.32%
The K-NN d"1, 2 3.63% 3.71%
d"1, 2, 3 3.6% 3.75%
d"1, 2 3.8% 5.1%
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
Currently our work is based on the extraction of global features, but further work will focus on the use of local features. An integrated system will be considered to combine both local and global features to produce more reliable classi"cation accuracy. Other work on writer identi"cation might include the normalisation or the pre-processing of the skewed handwriting images [15]. In this "eld, work to detect the skewed angles of writer's documents is in progress. Research on skewed angle detection of printed document images has extensively been introduced in the "eld of document analysis, but little has been achieved for handwritten documents [16}18].
5. Conclusion We have described a new approach for handwriting based personal identi"cation. Most existing approaches assume implicitly that handwritten texts are "xed. The novel approach introduced in this paper eliminates such an assumption. The algorithm is based on the observation that the handwriting of di!erent people is visually distinctive and a global approach based on textures analysis can be adopted. The approach is therefore text or content independent. A number of experiments have been conducted which use 40 di!erent writer classes. Features were extracted from handwriting images using the multi-channel Gabor "ltering technique and the grey-scale co-occurrence matrix (GSCM) technique. Identi"cation was performed using two di!erent classi"ers namely the weighted Euclidean distance (WED) and the K-nearest neighbours (K-NN) classi"ers. The Results achieved were very promising, and an identi"cation accuracy of 96.0% was obtained using the WED classi"er. The K-NN classi"er gave comparatively poor results. Several factors a!ect the performance of such global approaches, including graphics and skewed handwriting. Other factors include the handwriting styles of di!erent people, and similarities between di!erent handwriting (see Fig. 1). We are currently investigating ways of reducing the impact of such factors on the performance of the proposed global approach. We will also consider local approaches which seek writer speci"c features to improve the recognition accuracy. In the future, both global and local approaches will be integrated as one system for better identi"cation accuracy.
References [1] R. Plamondon, G. Lorette, Automatic signature veri"cation and writer identi"cation-the state of art, Pattern Recognition 22 (2) (1989) 107}131.
159
[2] F.J. Maarse, L.R.B. Schomaker, H.L. Teulings, in: G.C. van der Vier, G. Mulder (Eds.), Automatic Identi"cation of Writers in Human}Computer Interaction: Psychonomic Aspects, Springer, Heidelberg, 1986. [3] B. Azari, Automatic handwriting identi"cation based on the external properties of the samples, IEEE Trans. Systems Man Cybernet 13 (1983) 635}642. [4] V. Klement, K. Steine, R. Naske, The application of image processing and pattern recognition techniques to the forensic analysis of handwriting, in: Proceedings of the International Conference Security Through Science and Engineering, West Berlin, 1980 pp. 5}11. [5] J. Duvernoy, Handwriting synthesis and classi"cation by means of space-variant transform and Karhunen}Loeve analysis, J. Opt. Soc. Am. 65 (1975) 1331}1336. [6] W. Kuckuck, Writer recognition by spectra analysis, in: Proceedings of the International Conference On Security Through Science Engineering, West Berlin, Germany, 1980 pp. 1}3. [7] T.N. Tan, Written language recognition based on texture analysis, in: Proceedings of the IEEE ICIP'96, Lausanne, Switzerland, vol. 2, September 1996, pp. 185}188. [8] T.N. Tan, Texture edge detection by modelling visual cortical channels, Pattern Recognition 28 (9) (1995) 1283}1298. [9] A.K Jain, S. Bhattacharjee, Text segmentation using Gabor "lters for automatic document processing, Mach. Vision Appl. (1992) 169}184. [10] G.S. Peake, T.N. Tan, Script and language identi"cation from document images, in: Proceedings of the BMVC '97, Essex, UK, vol. 2, September 1997, pp. 610}619. [11] T.R. Reed, J.M. Hans Du Buf, A review of recent texture segmentation and feature extraction techniques, CVGIP: Image Understanding 57 (1993) 359}372. [12] T.N. Tan, Texture feature extraction via visual cortical channel modelling, in: Proceedings of the 11th IAPR International Conference Pattern Recognition vol. III, 1992, pp. 607}610. [13] R.M. Haralick, Statistical and structural approaches to textures, Proc. IEEE, vol. 67, 1979, pp. 786}804. [14] M. Golfarelli, D. Maio, D. Maltoni, On error-reject tradeo! in biometric veri"cation systems, IEEE Trans. Pattern Anal. Mach. Intel. 19 (7) (1997) 786}796. [15] H.E.S. Said, G.S. Peake, T.N. Tan, K.D. Baker, Writer identi"cation from non-uniformly skewed handwriting images, in: The Proceedings of the British Machine Vision Conference BMVC98, vol. 2, 1998, pp. 478}487. [16] D.S. Le, G.R. Thoma, H. Wechsler, Automatic orientation and skew angle detection for binary document images, Pattern Recognition 27 (10) (1994) 1325}1344. [17] H. S Baird, The skew angle of printed documents, Proc. SPIE, vol. 40, 1987, pp. 21}24. [18] W. Postal, Detection of linear oblique structures and skew in digitised documents, in: Proceedings of the ICPR'86, 1986, pp. 464}468.
160
H.E.S. Said et al. / Pattern Recognition 33 (2000) 149}160
About the Author*HUWIDA SAID received a B.Eng. honours degree in Electrical and Electronic Engineering from The University of Wales, Swansea in 1995. She is currently writing up for a PhD at The University of Reading and will graduate in July 1999. Her research interest include personal identi"cation from facial features, handwritten characters and audio features. She is an associated member of the IEE and a member of the BCS. About the Author*TIENIU TAN received his B.Sc. (1984) in Electronic Engineering from Xi'an Jiaotong University, China, and M.Sc. (1986), DIC (1986) and Ph.D. (1989) in Electronic Engineering from Imperial College of Science, Technology and Medicine, London, England. In October 1989, he joined the Computational Vision Group at the Department of Computer Science, The University of Reading, England, where he worked as Research Fellow, Senior Research Fellow and Lecturer. In January 1998, he returned to China to take up a professorship at the National Laboratory of Pattern Recognition located at the Institute of Automation of the Chinese Academy of Sciences, Beijing, China. Dr. Tan has published widely on image analysis and computer vision. He is a Senior Member of the IEEE and an elected member of the Executive Committee of the British Machine Vision Association and Society for Pattern Recognition (BMVA). He is the Asia Editor of the International Journal of Image and Vision Computing and an Associated Editor of the Pattern Recognition journal. His current research interests include speech and image processing, machine and computer vision, pattern recognition, multimedia, and robotics. About the Author*KEITH BAKER Studied both Electronics Engineering and Mathematical Physics at the undergraduate level before receiving the PhD degree in physics in 1969 from the University of Sussex, Brighton, England. After a period as postdoctoral fellow, he worked as a Systems Analyst on real-time air tra$c control systems with the Software Systems Division of the Plessey Company. Later he was with the Burroughs Corporation, Detroit, MI, as a Supervisor of Software Engineering and Project Manager. He has also spent some time as Lecture in Computer Science at the University of Sussex before taking up an appointment as Professor of Information Engineering Systems and head of the Electrical and Electronic Engineering Department at the University of Plymouth, England. In 1986 he moved to the University of Reading as full Professor of Computer Science, later to become the head of the Computer Science Department. He is Currently Dean of the Faculty of Sciences. His research interest include contributions to software engineering, computer vision, and intelligent systems. Dr. Baker is the Editor-in Chief of the international Journal Image and Vision Computing, a Fellow of the IEE, and a member of the IEEE, the ACM and BCS.
Pattern Recognition 33 (2000) 161}171
Estimation of hybrid re#ectance properties and shape reconstruction using the LMS method Tae-Eun Kim, Hyun-Ki Hong, Jong-Soo Choi* Department of Electronic Engineering, Chung-Ang University, 221 Huksuk-dong, Dongjak-ku, Seoul, 156-756, South Korea Received 28 August 1998; accepted 30 November 1998
Abstract This paper presents a new method to estimate re#ectance properties of non-Lambertian surface by the least-meansquare (LMS) algorithm. In this paper, a sample sphere with the same surface properties as those of an object is used, and hybrid re#ectance of an object is represented by the Torrance}Sparrow model. Since we know the size of the sample sphere, the intensity image of the object in the experimental setup can be generated. We determine re#ectance parameters that minimize the sum squared di!erence of the intensity distribution between the image of the sample sphere and the generated image. By the estimated re#ectance parameters, three reference images of the sample sphere are obtained from the same viewpoint with di!erent light directions. Direct matching of the object images to the references can precisely reconstruct the shape of the object. This paper uses a plate di!use illumination to alleviate the e!ects of specular spike and highlights. The experimental results show that the proposed method can estimate re#ectance properties of the hybrid surface, and also recover the object shape. 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Photometric stereo; LMS method; Re#ectance property; Plate di!use illumination; Shape reconstruction
1. Introduction One of the most important topics in computer vision is to extract three-dimensional (3D) information from the intensity image. The photometric stereo method uses several intensity images of objects with di!erent light direction to determine the surface shape [1]. The apparent intensity of a point depends on the surface orientation, re#ectance properties, and the distribution of light sources. In particular, surface re#ectance properties are important inspection criteria in many industrial applications. Since automatic extraction of surface properties and recovery of the object shape can solve several prob-
* Corresponding author. Tel.: #82-02-820-5295; Fax: #8202-825-1584. E-mail address:
[email protected] (J.S. Choi)
lems of manual inspection, a number of automatic inspection techniques have been developed. Previous researches are often based on the assumption that the object surface is Lambertian in re#ectance [2}4]. However, most object surfaces encountered in practice have hybrid re#ectance, which is the combination of di!use and specular re#ection. Nayar et al. [5]. Proposed the photometric sampling method that extracts re#ectance properties of the hybrid surface and recovers the object shape. On the contrary, this method must process the sequence of object images that are generated by scanning the array of the extended sources. In addition, the system using 8}127 point sources was proposed to determine the shape of specular objects [6,7]. Though this method is robust to noise, the use of multiple light sources is not practical in many real-world applications. This paper presents a new method to estimate re#ectance properties of the object surface by the least mean square (LMS) algorithm. Our method determines the
0031-3203/99/$20.00 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 3 8 - 2
162
T-E. Kim et al. / Pattern Recognition 33 (2000) 161}171
re#ectance parameters, which minimize a sum-squared error (SSE) of the intensity distributions between the sample sphere image and the calculated. The estimated re#ectance properties are used to generate the reference images in the light source directions. The direct matching of three reference images to the object, can reconstruct any shape of an object with the same surface properties as those of the sample sphere. The proposed method uses a plat di!use illumination to alleviate the e!ects of specular spike and highlights. In this paper, the object of interest has the hybrid re#ectance property, which is described by Torrance}Sparrow model such as plastic, arcryl, and electric parts. [8}10]. It is expected that the proposed algorithm is applicable to vision inspection systems or other "elds. A block diagram for our algorithm is shown in Fig. 1.
2. Re6ectance properties and di4use illumination 2.1. Torrance}Sparrow reyectance model Previous researchers have based their work on the Lambertian model that mainly consists of the di!use re#ection component. However, most objects have hybrid re#ectance properties, which are the combination of the Lambertian and specular components. To represent re#ectance properties of the hybrid surface, two di!erent approaches } the Torrance}Sparrow model and the Beckmann}Spizzichino, have been proposed. Due to its being simple mathematical formula, the Torrance}Sparrow model is preferable to other models [11]. Fig. 2 shows the geometry of the Torrance}Sparrow model.
Fig. 1. A block diagram for the proposed algorithm.
T-E. Kim et al. / Pattern Recognition 33 (2000) 161}171
163
Fig. 2. Torrance}Sparrow model.
In the Torrance}Sparrow model, the specular roughness and the relative strength between the di!use and specular components determine the surface property. The Lambertian model represents the di!use re#ection, which consists of multiple re#ection and internal scattering. Eq. (1) provides the brightness distribution of the object. I"I
#I "k f #k f , k #k "1, (1) B B Q Q B Q where k , k , f , and f are the di!use and the specular B Q B Q weighting constant and intensity, respectively. The bi-directional re#ectance distribution function (BRDF) in the Torrance}Sparrow model can be represented as follows: o f " n ) s, B n
FGP(a, p) f" , (2) Q 4 cos h cos h G P where o, n, s, and F are the albedo constant, the surface normal vector, the position vector of the light source, and the Fresnel coe$cient; G, h , and h are the geometric G P attenuation factor, the angle of incidence, and the viewing angle; and P is the Gaussian distribution of microfacets according to Eq. (3),
a 1 exp ! , P(a, p)" 2p (2pp
(3)
where the facet slope p represents surface roughness by the mean value 1p2 and the standard deviation a. In previous studies, the specular re#ectance may be decomposed into two primary components: the specular lobe and the specular spike. The lobe component spreads around the specular direction, and the spike component is zero in all directions except for a very narrow range around the specular direction. The relative strength of the two components depends on the physical roughness of the surface [12]. Since the Torrance}Sparrow model is not capable of describing the mirror-like behavior of the specular spike, it would yield erroneous results. In this paper, we use the plate di!use illumination to decrease the e!ects of the specular spike and highlights.
Fig. 3. Geometry of the plate di!use illumination.
2.2. The plate diwuse illumination In the plate di!use illumination, irradiance (E) of a point can be modeled as E"I cos /r, where I is an intensity of the light source, r is the distance QMSPAC between the light source and the plate di!user, and
is the angle between the normal vector of the di!user and the position vector of the light source. The radiance which transmits the plate di!user is obtained by ¸" qE ds, where s is an attenuation coe$cient. 1 In Fig. 3, our di!use illumination device consists of the light source and a plate di!user. The previous research using the spherical di!user needs many light sources. [12]. By using a #at paper and three lights sources, our illumination system is simpler for implementation than the previous ones. When the light source is far away from the sample sphere, our illumination system can be modeled as a parallel ray. [13].
3. Estimation of re6ectance properties and shape reconstruction 3.1. Estimation of reyectance properties using the LMS method The Torrance}Sparrow model has the following factors; the surface roughness parameter (p), the di!use and specular components (k and k ), and the Fresnel coe$cQ B ient (F). In general, the roughness of the object is either assumed or measured using a calibration method. However, both the assumption approach and the calibration have di$culty in many real-world applications. In this paper, a sample sphere with the same surface properties as those of an object is used. Since the size of the sample sphere is known, we can generate the intensity image in the experimental setup. First, we divide the
164
T-E. Kim et al. / Pattern Recognition 33 (2000) 161}171
Fig. 4. LMS algorithm for estimation of re#ectance parameters.
sample sphere image into some regions, and k and Q F parameters are initialized to small values that are close to 0. In each region, we update re#ectance parameters to minimize the error function in Eq. (4) at the point. The error function is de"ned as the sum-squared di!erence of the intensity values between the sample image and the calculated. After this operation, we can estimate re#ectance properties of the object surface. e(x, y)" (I (x, y)!I (x, y; k , F, p)). (4) PC?J CQRGK?RC Q 7 6 The surface roughness parameter p is "xed, and the partial derivatives of re#ectance parameters are updated by step size k. In Eq. (5), the second terms on the righthand side are partial derivatives of the error sum with respect to re#ectance parameters. k (n#1)"k (n)#k K k , Q Q Q F(n#1)"F(n)#k K F,
(5)
K Q" (*e(x, y))/*k and K " I V@JMAI W@JMAI Q $ *e(x, y)/*F. Fig. 4 shows the diagram for V@JMAI W@JMAI LMS algorithm used to estimate re#ectance parameters. where
3.2. The direct matching for shape reconstruction In photometric stereo, more than one image are taken from constant viewpoint with di!erent light directions. Three intensity images of an arbitrary object, which are obtained by varying the illumination direction with 1203 interval, are used to recover the shape of the object. Our method generates the intensity image of the range sphere, which has the same diameter as that of the samples. The estimated re#ectance parameters provide intensity distri-
Fig. 5. Determination of surface normal by photometric matching.
bution of the range data under the same source direction and illumination. Therefore, these reference images consist of normal vectors as well as intensity values in all directions. In order to reconstruct the object shape, the proposed method uses a direct matching of the obtained intensity images to the reference images. In this matching, we take into account the neighboring intensity distribution of the candidate sets in the reference images. A point (x , y ) in the "rst object image corresponds to the candidate point set [C"+(x , y ), (x , y ), 2 ,(x , y ),] in the reference image. In order L L to "nd the candidate point set in the other two reference images, we calculate a sum of absolute errors between the reference and the object images at point (x , y ). If a point having the smallest error in the reference image is found, we can determine the surface normal (V) of the point in the object image. When a search region under consideration is N;N, the surface normal is computed by
VG>, WG>,