TeAM YYePG Digitally signed by TeAM YYePG DN: cn=TeAM YYePG, c=US, o=TeAM YYePG, ou=TeAM YYePG,
[email protected] Reason: I attest to the accuracy and integrity of this document Date: 2005.07.10 23:04:19 +08'00'
Pattern Recognition Letters 26 (2005) 1632–1640 www.elsevier.com/locate/patrec
On the compact computational domain of fuzzy-rough sets Rajen B. Bhatt *, M. Gopal
Control Group, Department of Electrical Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110 016, India Received 23 June 2004; received in revised form 13 November 2004 Available online 24 February 2005 Communicated by W. Pedrycz
Abstract Based on some properties of fuzzy t-norm and t-conorm operators, the concept of fuzzy-rough sets on compact computational domain has been put forward. Various mathematical properties of this new definition of fuzzy-rough sets are discussed from pattern classification viewpoint. It is established that the proposed approach identifies various patterns in the sense of fuzzy-roughness, in addition to providing deeper insight into various concepts of fuzzy-rough sets. 2005 Elsevier B.V. All rights reserved. Keywords: Compact computational domain; Fuzzy-rough sets; Pattern recognition
1. Introduction Rough set theory provides systematic approach to knowledge discovery from inexact, noisy, or incomplete information (Pawlak, 1991, 1999, 2002a,b). The central concept in rough set theory is to approximate a target set through crisp partitions generated by equivalence relation, by a pair of exact sets called the lower and the upper approximations. Roughness emerges when there is a one-to-many mapping from equivalence classes
*
Corresponding author. Tel.: +91 11 26596133. E-mail addresses:
[email protected] (R.B. Bhatt),
[email protected] (M. Gopal).
to target sets. Several applications of rough sets in data classification, data analysis, machine learning, and knowledge discovery are of substantial importance (Slowinski, 1992; Pal and Skowron, 1999). Since, rough set theory relies upon crisp partitioning of the dataset, not considering the degree of belongingness of real-valued data, important information may be lost, as a result of the crisp quantization. Dubois and Prade (1990, 1992) investigated the problem of fuzzification of rough sets, and proposed rough-fuzzy and fuzzy-rough sets. Banerjee and Pal (1996) proposed a measure of roughness of a fuzzy set defined in the partitioned domain by making use of the concept of roughfuzzy sets. Banerjee and PalÕs roughness measure
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.006
R.B. Bhatt, M. Gopal / Pattern Recognition Letters 26 (2005) 1632–1640
depends on parameters that are designed as thresholds of definiteness and possibility in membership of the objects to a fuzzy set. Huynh and Nakamori (2004) introduced a parameter-free version of roughness measure for fuzzy sets based on the notion of the mass assignment of a fuzzy set. A measure of fuzziness in rough sets and some characterizations of this measure have been introduced by Chakrabarty et al. (2000). Radzikowska and Kerre (2002, 2004) defined fuzzy-rough sets dependently on fuzzy conjunction and fuzzy implication operators by applying the extension principle. Sarkar and Yagnanarayana (1998a,b) have generalized rough membership functions by defining roughfuzzy and fuzzy-rough membership functions. In this paper, we have proposed modified definition of fuzzy-rough sets by utilizing properties of fuzzy t-norm and t-conorm operators. We have proved that the modified definition is consistent with the definition proposed by Dubois and Prade (1992). However, there are a few advantages of the proposed approach. The proposed definition requires very little computational effort to perform set approximation. Further, the proposed compact computational domain identifies various patterns in the sense of fuzzy-roughness associated within given fuzzy partitioning of data and provides deeper insight to fuzzy-rough sets. The proposed definition can also be utilized to make pattern recognition algorithms faster. Various mathematical properties of this new definition have been discussed and analyzed in the sense of fuzzyroughness. Fuzzy-rough sets approach to feature selection (Bhatt and Gopal, 2004b) and fuzzy decision tree induction algorithms (Bhatt and Gopal, 2004a) are some of the significant applications of proposed fuzzy-rough sets on compact computational domain. The rest of the paper is organized as follows. Section 2 provides required theoretical background of fuzzy-rough sets, and two limiting versions of it. In Section 3 we discuss the concept of fuzzy-rough sets on compact computational domain. Various properties of this modified definition have been derived in Section 4. A brief note on the application of the proposed concepts and related references is given in Section 5. Section 6 concludes the paper.
1633
2. Theoretical foundations I = (U, P [ Q, VP, VQ, f) is a fuzzy information system, where U is non-empty finite set of objects called the universe, P & Q are non-empty finite sets of variables, P = {P1, . . . , Pj, . . . , Pp} corresponds to p inputs and Q is the output. f : U · P [ Q ! VP [ VQ is an information function such that f(U, Pj) 2 VPj; "j and f(V, Q) 2 VQ. Training patterns represented by variables P and Q are clustered into finite number of fuzzy partitions by using any of the standard fuzzy clustering algorithms (Gath and Geva, 1989; Krishnapuram and Keller, 1993; Barni et al., 1996; and Pal and Bezdek, 1995). Let U/P = {Fi}; "i, be fuzzy partitions generated by P on U using fuzzy similarity relation (Radzikowska and Kerre, 2002). Here lFi(x) = U ! [0, 1]; "x, "i, is a normal fuzzy set, i.e., maxxlFi(x) = 1 and infxmaxilFi(x) > 0. In addition to that, supxmin{lFi(x), lFj(x)} < 1; "i, j. lFi(x) is the fuzzy membership of the pattern x in the fuzzy set Fi. The set U/P builds basic partitions of knowledge about the domain of interest. Given any arbitrary fuzzy set A on the universe U, we can approximate it using basic fuzzy partitions. However, due to limited discernibility of objects, fuzzy set approximation may not be perfect. Fuzzy-rough sets are a pair of approximation i, with which we can ÔcertainlyÕ and degrees, hl; l ÔpossiblyÕ approximate arbitrary fuzzy set A; certainty approximated by l and possibility approxi. mated by l The formal definition of fuzzy-rough sets is given below. Definition 1. Given arbitrary fuzzy set lA(x) : U ! [0,1]; "x 2 U and Fi 2 U/P. According to Dubois and Prade (1990, 1992), the fuzzy- rough set is a tuple hlA ; lA i, where lower and upper approximation membership functions are defined by lA ðF i Þ ¼ inf maxf1 lF i ðxÞ; lA ðxÞg x2U
lA ðF i Þ ¼ sup minflF i ðxÞ; lA ðxÞg
ð1Þ
x2U
Two limiting versions of fuzzy-rough sets are rough-fuzzy sets and rough sets.
1634
R.B. Bhatt, M. Gopal / Pattern Recognition Letters 26 (2005) 1632–1640
1. Rough-fuzzy sets (Dubois and Prade, 1990, 1992): In this case, partitions U/P are crisp rather than fuzzy. Let, lA(x) : U ! [0,1]; "x 2 U and R is an equivalence relation (Pawlak, 1991) on U. [x]R = {y 2 UjxRy}; "x 2 U. Rough-fuzzy set is an approximation of fuzzy set A on crisp approximation space U/P. lA ð½xR Þ ¼ infflA ðxÞjx 2 ½xR g lA ð½xR Þ ¼ supflA ðxÞjx 2 ½xR g
ð2Þ
2. Rough sets (Pawlak, 1991): Let, lA(x) : U ! {0,1}; "x 2 U and R is an equivalence relation on U. Rough set is an approximation of crisp set A on crisp approximation space U/P. A ¼ fx 2 U j ½xR Ag ¼ fx 2 U j ½x \ A 6¼ ;g A R
ð3Þ
where A is called the lower approximation and A is called the upper approximation of A. 3. Proposed fuzzy-rough sets on compact computational domain In this section, we propose the concept of fuzzyrough sets on compact computational domain by utilizing properties of fuzzy t-norm and t-conorm operators. 3.1. Fuzzy logical operators Definition 2 (Radzikowska and Kerre, 2002). Fuzzy t-norm (T) and t-conorm (S) are increasing, associative and commutative mapping 2
2
T : ½0; 1 ! ½0; 1 and S : ½0; 1 ! ½0; 1 that satisfies the boundary conditions: T ðx; 1Þ ¼ x; T ðx; 0Þ ¼ 0; 8x 2 U Sðx; 0Þ ¼ x; Sðx; 1Þ ¼ 1; 8x 2 U
S n ða1 ; . . . ; an1 ; an Þ ¼ SðS n1 ða1 ; . . . ; an1 Þ; an Þ
3.2. Modified fuzzy-rough sets Using neutral properties of fuzzy logic operators, we suggest new definitions of lower and upper approximation membership functions. Definition 3. Given arbitrary fuzzy set lA(x) : U ! [0,1]; "x 2 U and Fi 2 U/P. The fuzzy-rough set on compact computational domain is a tuple hlA ; lA i for x 2 DA(Fi) and x 2 DA ðF i Þ respectively. Lower and upper approximation membership functions are defined by 8 inf maxf1 lF i ðxÞ; lA ðxÞg; > > > < x2DA ðF i Þ lA ðF i Þ ¼ DA ðF i Þ 6¼ ; > > > : 1; DA ðF i Þ ¼ ; 8 < sup minflF i ðxÞ; lA ðxÞg; DA ðF i Þ 6¼ ; lA ðF i Þ ¼ x2DA ðF i Þ : 0; DA ðF i Þ ¼ ; ð5Þ where DA(Fi) U and DA ðF i Þ U are compact computational domains for lower and upper approximation membership functions: DA ðF i Þ ¼ fx 2 U jlF i ðxÞ 6¼ 0 ^ lA ðxÞ 6¼ 1g DA ðF i Þ ¼ fx 2 U jlF i ðxÞ 6¼ 0 ^ lA ðxÞ 6¼ 0g
ð6Þ
Here ‘‘^’’ is logical AND connective. ð4Þ
Due to their associative properties we can extend them to nary operation as, T n ða1 ; . . . ; an1 ; an Þ ¼ T ðT n1 ða1 ; . . . ; an1 Þ; an Þ
nary operators Tn and Sn, which are the extension of triangular norm and conorm to n arguments, satisfy similar properties as the original binary T and S. The most popular t-norm and t-conorm used in fuzzy reasoning approaches are standard ‘‘min’’ and ‘‘max’’ operators, respectively. For this family of fuzzy t-norm operators, ‘‘one’’ is the neutral element and for fuzzy t-conorm operators, ‘‘zero’’ is the neutral element.
By Eq. (6) DA ðF i Þ ¼ fi \ AC DA ðF i Þ ¼ fi \ AS
ð7Þ
where AC = {x 2 UjlA(x) = 1}, AS = {x 2 UjlA(x) > 0}, and fi = {x 2 UjlFi(x) > 0}. It is defined as fuzzy-rough set on compact computational domain because it suffices to
R.B. Bhatt, M. Gopal / Pattern Recognition Letters 26 (2005) 1632–1640
calculate lower and upper approximation memberships only on certain compact domain, defined in Eq. (7), instead of all x 2 U. The definition given above is consistent with that given in Eq. (1) but requires less computational effort for calculating lower and upper approximations. For fuzzy t-norm operators, ‘‘one’’ is the neutral element. The region where max{1lFi(x),lA(x)} = 1 does not have any impact on the formation of lower approximation membership due to ÔinfÕ operator on it. We will find a region where max{1lFi(x), lA(x)} 5 1, and compute lower and upper approximation membership degrees only on this region. If this region is empty we will show that lA(Fi) = 1. Let maxf1 lF i ðxÞ; lA ðxÞg 6¼ 1 ) lF i ðxÞ 6¼ 0 ^ lA ðxÞ 6¼ 1;
x2U
) x 2 f i \ AC It is obvious that iff DA(Fi) = ;, then 9 = x 2 U, 3max{1 lFi(x), lA(x)} 5 1, i.e., all the patterns x 2 Fi, belong to the core of fuzzy set A. In this case, fi AC and lA(Fi) = 1. The consistency and compactness of upper approximation definition can be proved in a similar way. If DA ðF i Þ ¼ ;, then Fik and A do not have any pattern in common (i.e., fi \ AS = ;). Further, if DA(Fi) = U, then lA(Fi) never becomes 1. Let DA ðF i Þ ¼ U ) fi \ AC ¼ U ) fi ¼ U ^ AC ¼ U ) fi ¼ U ^ AC ¼ ; For lA(Fi) to be 1, max{1 lFi(x), lA(x)} = 1; "x 2 U. Now, lA(x) 5 1 ({AC = ;) and lFi(x) 5 0 ({fi = U). Thus lA(Fi) never becomes 1 if DA(Fi) = U. Qualitatively, fi = U ^ AC = ; indicates a situation where no elements belong to the core of fuzzy set A, and support(Fi) = U. fi = U is possible, though rarely, but AC = ; never happens in practice. When the classification is crisp (i.e., A is a crisp set), where either an element belongs to the set or not (i.e., crisp partitioning of decision region); saying that the core of a certain classification is empty is equivalent to saying that the clas-
1635
sification does not exist at all. When A is fuzzy, there will be some training patterns which belong to the core of A. So, for real world applications, compact computational domain is strictly a subset of U. DA ðF i Þ U and DA ðF i Þ < n This shows that the number of elements on which calculations are required by the proposed definition will be always less than the total number of training patterns. To explain the computations of the proposed domains, a small example dataset (Table 1) has been taken from Jensen and Shen (2002). Q classifies each of the objects to either of the classes: ÔNoÕ or ÔYesÕ. Two equivalence classes formed by Q are U/Q = {No, Yes} = {[1, 3, 6], [2, 4, 5]}. These two classes can be thought of as fuzzy sets, with the patterns belonging to the class possessing a membership of one, zero otherwise. The variable P1 has been partitioned into two fuzzy sets shown in Fig. 1. Let us consider the approximation of equivalence classes formed by ÔNoÕ and ÔYesÕ with fuzzy set F1. f1 ¼ fx 2 U jlF 1 ðxÞ > 0g ¼ f1; 2; 3g
Table 1 Example dataset Pattern
P1
P2
P3
Q
1 2 3 4 5 6
0.4 0.4 0.3 0.3 0.2 0.2
0.3 0.2 0.4 0.3 0.3 0
0.5 0.1 0.3 0 0 0
No Yes No Yes Yes No
Fig. 1. Fuzzy sets for P1.
1636
R.B. Bhatt, M. Gopal / Pattern Recognition Letters 26 (2005) 1632–1640
By the proposed definition, lower and upper approximation compact computational domains for the equivalence class formed by ÔNoÕ are defined by DNo ðF 1 Þ ¼ f1 \ f1; 3; 6g ¼ f1; 2; 3g \ f2; 4; 5g ¼ f2g DNo ðF 1 Þ ¼ f1 \ f1; 3; 6g ¼ f1; 2; 3g \ f1; 3; 6g ¼ f1; 3g This shows that only pattern {2} belongs to the lower approximation and patterns {1, 3} belong to the upper approximation effective regions. Calculating lower and upper approximation membership degrees only on this compact region, we get the following: lNo ðF 1 Þ ¼ inf maxf1 0:8; 0g ¼ inff0:2g ¼ 0:2 lNo ðF 1 Þ ¼ supfminf0:8; 1g; minf0:6; 1gg ¼ supf0:8; 0:6g ¼ 0:8 This example shows that the proposed definition identifies the effective region for the formulation of lower and upper approximation and then calculates membership degrees only on this compact domain rather than calculating it for all the patterns. For larger databases, it will result in high computational gain. 4. Properties of proposed version of fuzzy-rough sets As an outgrowth of presented definition of fuzzy-rough sets, here are some of the qualitative and quantitative observations which can further explore the significance of the proposed approach from pattern classification viewpoint. Observation 1. DA ðF i Þ ¼ ; ¼ DA ðF i Þ if and only if no fuzzy-roughness is associated with all the patterns x, "x 2 Fi. Proof. No fuzzy-roughness is associated with pattern x if fuzzy set Fi in which x has non-zero membership value belongs only to one class A, or does not share any pattern with the class A. Let DA ðF i Þ ¼ ; () fi \ AC ¼ ; () fi AC () lA ðF i Þ ¼ 1
This way, all the patterns from fuzzy set Fi completely belong to the core of A; fuzzy set Fi perfectly approximates arbitrary fuzzy set A. By the same logic one can show that with DA ðF i Þ ¼ ;, 9 = x, 3(x 2 fi ^ x 2 AS), i.e., fi and AS do not have any pattern in common and thus lA ðF i Þ ¼ 0; fuzzy set Fi completely does not approximate arbitrary fuzzy set A. h Observation 2. If no fuzzy linguistic uncertainty is associated with the pattern x, "x 2 U, then the proposed definition will be converted to the definition of rough-fuzzy lower and upper approximation memberships on respective compact computational domains. Proof. If no fuzzy linguistic uncertainty is associated with pattern x, "x 2 U, then each partition is crisp with lFi(x) : U ! {0,1} and let, lA(x) : U ! [0,1]; "x 2 U. Fi can be regarded as an equivalence class [x]R, where [x]R = {y 2 UjxRy}; "x 2 U 8 < inf flA ðxÞg; ½xR \ AC 6¼ ; lA ð½xR Þ ¼ x2½xR \AC : 1; ½xR \ AC ¼ ; 8 < sup flA ðxÞg; ½xR \ AS 6¼ ; lA ð½xR Þ ¼ x2½xR \AS : 0; ½xR \ AS ¼ ; It is clear from above definition that lA([x]R) = 0 if and only if [x]R X AS, i.e., [x]R and A do not have any pattern in common. lA ð½xR Þ ¼ 1 if and only if [x]R \ AC 5 ;, i.e., at least one element from equivalence class [x]R belongs to the core of fuzzy set A. h Observation 3. If no fuzzy linguistic and fuzzy classification uncertainties are associated with all the patterns x 2 U, then, lFi(x) : U ! {0,1} and lA(x) : U ! {0,1}; "x 2 U. For this case the proposed definition reduces to classical rough set presentation with corresponding compact domain. ( 1; ½xR A lA ð½xR Þ ¼ 0; ½xR 6 A ( 1; ½xR \ A 6¼ ; lA ð½xR Þ ¼ 0; ½xR \ A ¼ ;
R.B. Bhatt, M. Gopal / Pattern Recognition Letters 26 (2005) 1632–1640
1637
The definitions of lower and upper compact computational domains in this case becomes equal, as the set which is to be approximated is crisp, i.e., AC = AS = A. Thus, DA ð½xR Þ ¼½xR \ A ¼ DA ð½xR Þ.
DA ðF 1 ; . . . ; F j Þ ¼ f1 \ \ fj \ AC
Observation 4. If the partitioning is fine, then the proposed computational domain identifies the region where degree of belongingness of the pattern x to fuzzy set A is equal to one, greater than zero but not equal to one, and equal to zero.
Proof. To prove this we will find a region where max{1 min(lF1(x), . . . , lFj(x)), lA(x)} 5 1; and define compact computational domain by set theoretic difference of U with identified region.
Proof. When the partitioning is fine, each pattern x is being treated individually rather then belonging to a fuzzy or crisp partitions. The membership degrees of such fine partitions are thus represented here by l(x) only and are equal to 1. Lower and upper approximation membership degrees of fuzzy set A by such partitions are denoted by lA(x) and lA ðxÞ for all x 2 U. For this formulation, it can be easily proved that lA ðxÞ ¼ lA ðxÞ ¼ lA ðxÞ. From the proposed definition DA ðxÞ ¼ ; () x \ AC ¼ ;; i:e:; x 2 AC () lA ðxÞ ¼ 1: DA ðxÞ ¼ ; () x \ AS ¼ ;; i:e:;
x 62 AS () lA ðxÞ ¼ 0:
If DA 5 ; and DA 6¼ ;, x 62 AC and x 2 AS. This implies that, 0 < lA(x) < 1. h
¼ DA ðF 1 Þ \ \ DA ðF j Þ DA ðF 1 ; . . . ; F j Þ ¼ f1 \ \ fj \ AS ¼ DA ðF 1 Þ \ \ DA ðF j Þ
maxf1 minðlF 1 ðxÞ; . . . ; lF j ðxÞÞ; lA ðxÞg 6¼ 1; ) minðlF 1 ðxÞ; . . . ; lF j ðxÞÞ 6¼ 0 ^ lA ðxÞ 6¼ 1; 8x 2 U ) ðlF 1 ðxÞ ^ ^ lF j ðxÞ 6¼ 0Þ ^ ðlA ðxÞ 6¼ 1Þ; 8x 2 U ) x 2 f1 \ \ fj \ AC Thus DA(F1, . . . , Fj) = DA(F1) \ \ DA(Fj). Proof for DA ðF 1 ; . . . ; F j Þ can be developed on similar lines. h Note that DA ðF 1 ; F 2 ; . . . ; F j Þ DA ðF i Þ; i ¼ 1; . . . ; j jDA ðF 1 ; F 2 ; . . . ; F j Þj 6 DA ðF i Þ; i ¼ 1; . . . ; j From this observation, the following two results can be developed trivially. 1. DA(F1 [ F2) = DA(F1) [ DA(F2) 2. DA ðF 1 [ F 2 Þ ¼ DA ðF 1 Þ [ DA ðF 2 Þ
To compute lower approximation using more than one variable, Cartesian product of fuzzy partitions generated by given set of variables should be considered. In this case
We will prove the first; the second can be derived on similar lines. 1. lA ðF 1 [ F 2 Þ ¼ inf x2U maxf1 maxðlF 1 ðxÞ; lF 2 ðxÞÞ; lA ðxÞg. Proceeding as in Observation 5.
lfF 1 ;...;F j g ðxÞ ¼ minflF 1 ðxÞ; . . . ; lF j ðxÞg
maxf1 maxðlF 1 ðxÞ; lF 2 ðxÞÞ; lA ðxÞg 6¼ 1
and
() ðlF 1 ðxÞ 6¼ 0 _ lF 2 ðxÞ 6¼ 0Þ ^ lA ðxÞ 6¼ 1; x 2 U
lA ðF 1 ; . . . F j Þ 8 inf maxf1 minðlF 1 ðxÞ; . . . ; lF j ðxÞÞ; > > < x2DA ðF 1 ;...;F j Þ ¼ lA ðxÞg; DA ðF 1 ; . . . ; F j Þ 6¼ ; > > : 1; DA ðF 1 ; . . . ; F j Þ ¼ ;
() x 2 ðf1 [ f2 Þ \ AC
Observation 5. Compact computational domain for set approximation with Cartesian product of fuzzy partitions can be defined as,
() x 2 DA ðF 1 Þ [ DA ðF 2 Þ Observation 6 1. 2. 3. 4.
DA[B(Fi) = DA(Fi) \ DB(Fi). DA\B(Fi) = DA(Fi) [ DB(Fi). DA[B ðF i Þ ¼ DA ðF i Þ [ DB ðF i Þ. DA\B ðF i Þ ¼ DA ðF i Þ \ DB ðF i Þ.
1638
R.B. Bhatt, M. Gopal / Pattern Recognition Letters 26 (2005) 1632–1640
Proof. We will prove the first and the fourth; the other two can be proved on similar lines of argument. 1. lA[B ðF i Þ ¼ inf maxf1 lF i ðxÞ; maxðlA ðxÞ; lB ðxÞÞg x2U
We will find a region where max{1 lFi(x), max(lA(x), lB(x))} 5 1. maxf1 lF i ðxÞ; maxðlA ðxÞ; lB ðxÞÞg 6¼ 1 () lF i ðxÞ 6¼ 0 ^ ðlA ðxÞ 6¼ 1 ^ lB ðxÞ 6¼ 1Þ; 8x 2 U () x 2 fi \ AC \ BC () x 2 DA ðF i Þ \ DB ðF i Þ 4. lA\B ðF i Þ ¼ sup minflF i ðxÞ; minðlA ðxÞ; lB ðxÞÞg x2U
We will find a region where min{lFi(x), min(lA(x), lB(x))} 5 0. minflF i ðxÞ; minðlA ðxÞ; lB ðxÞÞg 6¼ 0 () lF i ðxÞ 6¼ 0 ^ ðlA ðxÞ 6¼ 0 ^ lB ðxÞ 6¼ 0Þ; 8x 2 U () x 2 fi \ ðAS \ BS Þ () x 2 DA ðF i Þ \ DB ðF i Þ
Following is an interesting observation on the basis of the above two propositions. Observation 7. Observations 5 and 6 can be combined together to define fuzzy rough sets on compact computational domain for approximating multiple fuzzy sets involving intersection and union operations. For example, two fuzzy sets A and B, when approximated through two fuzzy sets F1 and F2, the compact computational domain can be obtained as, DA\B ðF 1 ; F 2 Þ ¼ DA ðF 1 ; F 2 Þ [ DB ðF 1 ; F 2 Þ ¼ fDA ðF 1 Þ \ DA ðF 1 Þg
H X 1 lF j ðxÞ ljA j¼1 lF j ðxÞ j¼1
T fA ðxÞ ¼ PH where ljA
kA \ F j k ¼ ¼ kF j k
P
x2U
minðlF j ðxÞ; lA ðxÞÞ P x2U lF j ðxÞ
T fA ðxÞ is constrained fuzzy-rough membership function defined by Sarkar and Yagnanarayana (1998a) and H is the number of fuzzy sets in which x has non-zero membership value. Proof. T fA ðxÞ ¼ 1 if and only if ljA ¼ 1; "j = 1, . . . , H. This condition implies that Fj A; "j = 1, . . . , H. In Observation 1 we have proved that Fj AC ) Fj AS iff DA(Fj) = ;; "j. This in turn implies that, ljA ¼ 1; 8j ¼ 1; . . . ; H . This completes the proof. This proposition shows that if DA(Fj) = ;, then fuzzy-roughness is absent for all the patterns x 2 Fj. In the same way, one can show that T fA ðxÞ ¼ 0 ) ljA ¼ 0; "j = 1, . . . , H. In Observation 1 we have also shown that, if DA ðF j Þ ¼ ;; "j = 1, . . . , H, then Fj; "j = 1, . . . , H and A do not have any pattern in common. This indicates that, ljA ¼ 0; "j = 1, . . . , H. h Observation 9. From the definition of compact computational domain of lower and upper approximation memberships, the classification properties of fuzzy set Fi, in classifying an arbitrary fuzzy set A can be grouped into three basic categories: 1. A is completely Fi-observable in a fuzzy-rough manner iff DA(Fi) = ;. 2. A is partially Fi-observable in a fuzzy-rough manner iff DA(Fi) 5 ; and DA ðF i Þ 6¼ ;. 3. A is completely Fi-unobservable in a fuzzyrough manner if DA ðF i Þ ¼ ;.
[ fDB ðF 1 Þ \ DB ðF 2 Þg Observation 8 1. DA ðF i Þ ¼ ; ) T fA ðxÞ ¼ 1; "x 2 Fi 2. DA ðF i Þ ¼ ; ) T fA ðxÞ ¼ 0; "x 2 Fi
Defined compact computational domain for fuzzy-rough sets has close link with minima of classification entropy (Umano et al., 1994) and classification ambiguity (Yuan and Shaw, 1995) functions.
R.B. Bhatt, M. Gopal / Pattern Recognition Letters 26 (2005) 1632–1640
Let, with the given fuzzy set Fi, q discrete classes of concern, i.e., {1, 2, . . . , q} are required to classify. Consider two functions; q X pil log2 pil and Entri ¼
gðpi1 ; pi2 ; . . . ; piq Þ ¼
q X l¼2
1639
pil ln
l l1
It is easy to verify that g gets its minimum only at (1, 0, . . . , 0). h
l¼1 q X pil pi;lþ1 ln l; Ambgi ¼
5. Applications
l¼1
where Entri and Ambgi are classification entropy and classification ambiguity of Fi, respectively. pil is the certainty of Fi concerning lth class defined as, P minðlF i ðxÞ; ll ðxÞÞ P pil ¼ x2U x2U minðlF i ðxÞÞ pil ; "l = 1, . . . , q, and p ; . . . ; p pil ¼ max16l6q are i1 iq pil descending order arrangements of (pi1, . . . , piq). Observation 10. If $l 2 Q, 3Dl(Fi) = ;, function Entri and Ambgi attain their minimum simultaneously. Proof. From Observation 1, if $l 2 Q, 3Dl(Fi) = ;, then Fi l and Fi \ j = ;; "j 2 Q, j 5 l. It indicates the situation where Fi approximates one class perfectly, and does not share any pattern with other classes. From the definition of pil, it is self-evident that {pi1, . . . , pil, . . . , pij, . . . , piq} = {0, . . . , 1, . . . , 0, . . . , 0}. Now, with an appointment 0 · log20 = limx!0(x · log2x) = 0, function Entri attains its minimum at a vector of certainty factors where each component is either 0 or 1, and Ambgi attains its minimum at a vector in which only one component is 1 and all other components are 0. To prove this, consider the second derivative of function Entri. ! q X o2 1 2.5 was found to increase TP% to over 70 and to reduce FP% below 8. A TP% of 85 was the maximum found, because the non-maximal suppression stage, which limits the detected edge to be one pixel thick, does not always result in the retention of the best pixel to match the positions marked in the ground truth. Similarly the FP% never falls below 4%. Thus for high SNR images the FPs tended to lie close to
1616
R.C. Staunton / Pattern Recognition Letters 26 (2005) 1609–1619
90 40db
80
30db
70
20db
TP% and FP%
60 10db
50 40
10db
30 20db
20 10
30db 40db
0
0
0.5
1
1.5
2
2.5
3
Sigma
Fig. 8. TP% (—) and FP% (- - -) vs. r, for SNR from 10 dB to 40 dB.
the ground truth positions, whereas for low SNR images the FPs were randomly distributed throughout each image and resulted where noisy responses exceeded the upper-threshold. In the second simulation, the upper threshold was varied while the lower threshold and r were constant, and in the third simulation, the upper threshold was varied while the lower threshold and r were constant. Fig. 9 shows TP% against variation from 0.1 to 0.9 of both the upper threshold with the lower constant at 0.1, and the lower
threshold with the upper constant at 0.9. Smoothing was disabled (r = 0). The results show a small increase in TP% for a decreasing upper threshold, however a large increase can be obtained if the lower-threshold is reduced below 0.8 for high SNR images, below 0.6 for low SNR images, and below 0.3 for very low SNR images. Fig. 10 shows FP% against variation from 0.1 to 0.9 of both the upper-threshold with the lower constant at 0.1, and the lower-threshold with the upper constant at 0.9. Smoothing was disabled. For a high SNR the FP% can be kept below 5 if an upper-threshold greater than 0.9 is used. The lower-threshold is then immaterial. For the low SNR cases, the upper-threshold should again be kept above 0.9, and the lower-threshold above 0.8 to keep the FP% below 12. For this acquisition system with high SNR (30 dB) images and no smoothing filter, Figs. 9 and 10 indicate an upper-threshold of 0.9 and a lower of 0.8 will give good results, with TP% = 83.6 and FP% = 3.9 achieved. Fig. 8 indicates that smoothing (r = 1.0) should improve these results, and was found to increase TP% to 84.6 and reduce FP% to 3.7. However, a conflict occurs when Figs. 9 and 10 are used to estimate the lower-threshold for low SNR images, where a low lower-threshold results in a high TP%, but also a high FP%. For example with a
50
90 40db
45
80
10db 10db
30db
40 20db
70 35 20db
30
FP%
TP%
60 50
10db
40
25
20db 30db
20 15 40db
30 10
20
5
30db 40db
10 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Threshold
Fig. 9. TP% vs. upper (—) and lower threshold (- - -) vs. r, for SNR from 10 dB to 40 dB.
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Threshold
Fig. 10. FP% vs. upper (—) and lower threshold (- - -) vs. r, for SNR from 10 dB to 40 dB.
R.C. Staunton / Pattern Recognition Letters 26 (2005) 1609–1619
1617
1 shows optimum detector settings for the tested system.
Table 1 Optimum detector settings for tested system SNR (dB)
r
Lower threshold
Upper threshold
TP%
FP%
20 30
1.8 1.0
7.5 0.8
0.9 0.9
83.4 84.6
4.0 3.7
5.1. Example of use
SNR = 20 dB, Fig. 9 indicates a lower-threshold of 0.6 to maximize TP% at 60.4 (FP% = 14.3), but Fig. 10 indicates a threshold of 0.9 to minimize FP% at 6.9% (TP% = 44.5). The final choice will be problem dependant, but Fig. 8 indicates that improvement can be achieved in both TP% and FP% by increasing r to 1.8. Then by setting an intermediate lower-threshold of 7.5, TP% = 83.4 and FP% = 4.0 were obtained. Table
The acquisition system measured in the first part of Section 5 was used to image a soccer ball made from stitched panels. The image is displayed in Fig. 11(a). The characteristics displayed in Figs. 8–10 were used to choose parameters for the Canny Edge Detector. The characteristics were compiled from noise free data to which random white noise was added to give results for the various SNRs. For this test the standard deviation (SD) of the systemÕs intensity noise was measured so that the actual SNR of the image could be calculated. The SD varies with the intensity of
Fig. 11. Soccer ball edge maps. (a) Original, (b) r = 0.5, LT = 0.8, and UT = 0.9, (c) r = 1.0, LT = 0.8, and UT = 0.9, (d) r = 2.7, LT = 0.2, and UT = 0.7.
1618
R.C. Staunton / Pattern Recognition Letters 26 (2005) 1609–1619
the signal, and SD-vs.-intensity calibration curves can be compiled, Healey and Kondepudy (1994). However, here just the soccer ball image was considered, and the SD of the noise in the light and dark image areas measured using 64 timesequenced images of the scene to estimate the SD for each image pixel, Wittels et al. (1988). An SD = 1.8883 was the maximum found for the light areas, and SD = 1.4082 for the dark areas. Examining the image we want the detector to find edges around the ball and between the panels. The panels are stitched and then folded so we can expect up to three parallel edges at each join. The average intensity step height between the ball and the background was measured to be 120 units. Using Eq. (2), with SD = 1.8883 resulted in a SNR = 36 dB. Fig. 8 was used to estimate a suitable range of r parameter values. The 30 dB curve was used because it represents the next worst case below 36 dB. Considering true positive edge points, 0.5 6 r 6 1.0 will result in 80 6 TP% 6 85, and 4 6 FP% 6 12. This range of FP% is large, but r = 0.5 resulted in a more compact filter support and was chosen for the first test. Fig. 9 was used to estimate suitable lower (LT) and upper (UT) threshold values, with LT 6 0.8 indicating a TP% = 80. This constrained UT P LT. Fig. 10 was used to refine the thresholds and ensure a low FP%. It indicated UT = 0.9 to reduce FP% to 5 and that any LT can be used. Finally r = 0.5, LT = 0.8, and UT = 0.9 was chosen, and the resulting edge-map is shown in Fig. 11(b). There was no ground truth for this image, but from Figs. 8–10 TP% = 80 and FP% = 12 were estimated. The outline of the ball appears to be correct, but there are many noiseedge-points as might be expected with a high FP%. The experiment was repeated with r = 1.0. We expected FP% to reduce to 4, and the results in Fig. 11(c) now show an almost noise free background. Now, considering the missing edge between two of the panels in the left of the ball in Fig. 11(c), we can use Figs. 8–10 to estimate detector parameters to optimally detect this edge. The average step height was measured to be 7. The areas are both dark so the noise SD = 1.4082. This resulted in a SNR = 14 dB, a very low figure. The nearest
SNR to this, SNR = 10 dB, was used with Fig. 8 to identify that r = 2.7 will give TP% = 72 and FP% = 7. Fig. 9 indicates UT 6 0.7 and LT 6 0.2 to maximize TP%, whereas Fig. 10 indicates 0.7 6 UT 6 0.9 and LT as high as possible to minimize FP%. The final choice was r = 2.7, LT = 0.2, and UT = 0.7, and the resulting edge-map is shown in Fig. 11(d). The edges between the panels have now been partly detected. It would have been possible to determine the detection parameters by trial and error, but using the proposed method TP% and FP% estimates have been made available and can be maximized for a particular edge.
6. Conclusions Sets of realistic test edges with accurately known positions were produced and used to measure the performance of the Canny detector. The edges were realistic in that they were synthesized by convolving a step edge with the measured PSF of the particular image acquisition system under investigation. The test edge images were identical with those which would be captured by the acquisition system when viewing an object, but with the advantage that the actual positions of the edges were known and available for comparison with the detected result. With this technique it was also possible to allow a synthesized edge to locate to different sub-pixel positions and orientations within the central image pixel, and thus a more comprehensive set of tests were able to be performed. The detector results were expressed in a true positive and false positive format. The optimum values of the detectorÕs parameters were found for a series of edges to which increasing levels of noise had been added. Overall the Canny detector was found to work well. With low noise images, a few extra edge points were detected connected to the edges that were not part of the ground truth line. However with high noise images, extra edge points were detected randomly distributed within the image. The method will improve the setting up procedure of image acquisition systems. Given minimum required true positive and maximum false positive system per-
R.C. Staunton / Pattern Recognition Letters 26 (2005) 1609–1619
formance figures, the detector parameters can be chosen to minimize r for the Gaussian filtering stage, and thus maximize computational efficiency.
References Bowyer, K., Kranenburg, C., Dougherty, S., 2001. Edge detector evaluation using empirical ROC curves. Comput. Vision Image Underst. 84, 77–103. Canny, J., 1986. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8 (6), 679–698. Davies, E.R., 1984. Circularity a new principle underlying the design of accurate edge orientation operators. Image Vision Comput. 2 (3), 134–142. Davis, L.S., 1975. A survey of edge detection techniques. Comput. Graphics Image Process. 4, 248–270. Healey, G.E., Kondepudy, R., 1994. Radiometric CCD camera calibration and noise estimation. IEEE Trans. Pattern Anal. Mach. Intell. 16 (3), 267–276.
1619
Lyvers, E.P., Mitchell, O.R., 1988. Precision edge contrast and orientation estimation. IEEE Trans. Pattern Anal. Mach. Intell. 10 (6), 927–937. Reichenbach, S.E., Park, S.K., Narayanswamy, R., 1991. Characterizing digital image acquisition devices. Opt. Eng. 30 (2), 170–177. Staunton, R.C., 1998. Edge operator error estimation incorporating measurements of CCD TV camera transfer function. IEE Proc. Vision Image Signal Process. 145 (3), 229–235. Tzannes, A.P., Mooney, J.M., 1995. Measurement of the modulation transfer function of infrared cameras. Opt. Eng. 34 (6), 1808–1817. White, R.G., Schowengerdt, R.A., 1994. Effect of point spread functions on precision step edge measurement. J. Opt. Soc. Am. Part A—Opt. Image Sci. Vis. 11 (10), 2593– 2603. Wittels, N., McClelland, J.R., Cushing, K., Howard, W., Palmer, A., 1988. How to select cameras for machine vision. Proc. SPIE 1197, 44–53.
Pattern Recognition Letters 26 (2005) 1620–1631 www.elsevier.com/locate/patrec
A fast and robust feature-based 3D algorithm using compressed image correlation Sheng S. Tan *, Douglas P. Hart Mechanical Engineering Department, Massachusetts Institute of Technology, 77 Mass Avenue, 3-243, Cambridge, MA 02139, USA Received 19 March 2004; received in revised form 19 December 2004 Available online 19 April 2005 Communicated by E. Backer
Abstract Two objectives of 3D computer vision are high processing speed and precise recovery of object boundaries. This paper addresses these issues by presenting an algorithm that combines feature-based 3D matching with Compressed Image Correlation. The algorithm uses an image compression scheme that retains pixel values in high intensity gradient areas while eliminating pixels with little correlation information in smooth surface regions. The remaining pixels are stored in sparse format along with their relative locations encoded into 32-bit words. The result is a highly reduced image data set containing distinct features at object boundaries. Consequently, far fewer memory calls and data entry comparisons are required to accurately determine edge movement. In addition, by utilizing an error correlation function, pixel comparisons are made through single integer calculations eliminating time consuming multiplication and floating point arithmetic. Thus, this algorithm typically results in much higher correlation speeds than spectral correlation and SSD algorithms. Unlike the traditional fixed window sorting scheme, adaptive correlation window positioning is implemented by dynamically placing object boundaries at the center of each correlation window. Processing speed is further improved by compressing and correlating the images in only the direction of disparity motion between frames. Test results on both simulated disparities and real motion image pair are presented. 2005 Elsevier B.V. All rights reserved. Keywords: Compressed image correlation; Gradient-based compression; Feature matching; Adaptive window; 3D vision
1. Introduction *
Corresponding author. Tel.: +1 617 253 0229; fax: +1 240 214 8665. E-mail address:
[email protected] (S.S. Tan).
Features such as edges and corners play an important role in human vision. The visual cortex is especially responsive to strong features in a
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.009
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
scene (Sharma et al., 2003). Together with related abilities such as correspondence matching and tracking, humans are able to react quickly to the environment and focus attention on objects of interest. The significance of features is fully recognized in computer vision. For example, one traditional class of techniques applied to facial recognition is based on the computation of a set of geometrical features from a picture of the face such as the sizes and relative positions of eyes, mouth, nose and chin (Brunelli and Poggio, 1993; Gao and Leung, 2002). There is even belief that edge representations may contain all of the information required for the majority of higher level tasks (Elder, 1999). Feature-based 2D tracking is extensively implemented in automated surveillance, robotic manipulation and navigation. Because real-time processing is a necessity in these applications, only perceptually significant information such as contours is retained from video feeds. If the targetÕs 3D model is known, its detected contours are compared against its geometrical model to determine the objectÕs current position and orientation (Drummond and Cipolla, 2002). If there is no a priori knowledge of the target, it is tracked by finding the contoursÕ disparity between frames using cross correlation (Deriche and Faugeras, 1990) or level set (Mansouri, 2002). Passive 3D imaging can be reduced to the problem of resolving disparities between image frames from one or several cameras. Some key issues involved are lack of texture, discontinuity and speed. Numerous algorithms have been proposed to address these issues that fall into three broad categories: feature-based, area-based and volume-based algorithms (Scharstein and Szeliski, 2002). Same as in 2D tracking, feature-based 3D imaging techniques are able to process an extensive amount of video data in real-time while providing enough latency for high level tasks such as object recognition (Hoff and Ahuja, 1989; Olsen, 1990). This group of methods generates sparse but accurate depth maps at feature points and excels at determining object boundary position where area-based techniques often fail. When a full-field depth map is desirable, the sparse 3D representation provides a solid foundation for additional area- or volume-
1621
based algorithms to fill in the voids when there is ample surface texture; otherwise, when texture is scarce or highly repetitive, object segmentation methods and interpolation are preferable (Izquierdo, 1999; Mahamud et al., 2003). In the emerging field of image-based 3D editing, which has many applications in architectural design and entertainment, long and tedious human efforts are required to manually extract layers and assign depths for a 2D image (Oh et al., 2001). Automatic feature-based depth detection would greatly facilitate this process. Broad adoption of 3D imaging technology is currently limited by speed and robustness. Applications such as robotic surgery (Hoznek et al., 2002) and autonomous navigation or tracking, demand real-time processing. As an example, a texture mapped 3D scene would greatly aid in a surgeonÕs tactile sense. 3D reconstructed views enable better object recognition without turning the camera and taking more images. 3D object tracking would be much more robust than its 2D counterpart if reliable depth information is available. 1.1. Relation to prior work Current methods of finding feature correspondence can be categorized into global or local techniques. Global approaches to the sparse correspondence problem handle the entire set of sparse points by forming a global optimization function. Various constraints such as color constancy, continuity, uniqueness and epipolar constraints guide the search of a global solution. (Bartoli et al., 2003; Maciel and Costeira, 2003). Global techniques are usually robust but relatively slow due to the iteration and optimization process. Local methods find each pixelÕs correspondence by computing a cost function in a small interrogation window around the pixel of interest. Popular cost functions include SSD (Sum of Squared Differences) and cross correlation. Sparse Array Image Correlation is commonly implemented in the field of Particle Image Velocimetry (PIV), where fluid fields are seeded with fluorescent tracer particles and illuminated with a laser sheet. Flow motion is measured by tracking particle displacement. PIV images are comprised of millions of bright spots
1622
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
over a dark background. Each image is compressed into a subset of pixels before correlation that only include high gradient areas (Hart, 1998). This technique is especially fast and robust at handling large data sets. The algorithm presented in this paper shares the same computational grounds as Sparse Array Image Correlation. Depth discontinuities have been a major concern in area-based stereo matching. Boundary overreach, where the detected boundary locations deviate from the real boundaries, often occurs when the interrogation window contains both the boundary and its adjacent smooth surfaces. Adaptive window techniques have been developed to solve this problem (Kanade and Okutomi, 1994; Okutomi et al., 2002). An asymmetrical window is set around the pixel of interest so that the interrogation window does not cover the object boundary. A cost function is calculated for each possible window location around the pixel of interest and the window with optimized result is chosen. The disadvantage of such adaptive window schemes is that computational load is increased by an order of magnitude due to traversing in all the possible windows. 1.2. Contribution of this paper Speed and precise recovery of edge locations and disparities are the two main goals of the algorithm presented here. The purpose of edge detection in the algorithm is to aid in the compressed correlation process, which distinguishes itself from the typical correlation or SSD-based techniques. The remaining pixels are stored in sparse format along with their relative locations encoded into 32-bit words. Compression dramatically increases speed because only a fraction of the original pixels are retained for correlation. By introducing the well-established gradient-based compressed image correlation algorithm from the computational fluids community to the computer vision field, real-time scene reconstruction may gain new momentum. Coarse correlation is first performed to obtain an integer-pixel disparity estimation using large adaptive interrogation windows. Then fine correlation with smaller windows resolves each on-edge
pixelÕs disparity to sub-pixel resolution based on the rough estimation from coarse correlation. Error correlation is chosen over standard cross correlation because pixel comparisons are made through simple integer calculations rather than the computationally expensive multiplication and floating point arithmetic. In order to avoid the boundary overreach problem, adaptive window positioning is also utilized. However, in the proposed algorithm interrogation window selection is integrated with edge detection. The optimum window location is explicitly determined at the moment an edge is detected without the need of testing through a series of possible windows. In this paper, Section 2 describes the intensity gradient compression methodology and the significance of threshold setting. Section 3 explains error correlation in a compressed format. These two sections provide the computational grounds of the proposed algorithm. Section 4 shows how the appropriate window location is adaptively selected and its advantages. In Section 5, fine correlation combined with depth-based segmentation and interpolation is presented as a possible approach to generate a complete depth map. Section 6 provides experimental results of both simulated disparities and real image pairs. The quality of the disparity maps obtained demonstrates the effectiveness of the proposed algorithm.
2. Image compression The first step in compressed image correlation is to generate a data array that contains just enough information to determine the disparity between two images. From the statistical correlation function, PM PN m¼1 n¼1 ½I mþDi;nþDj I m;n ffi UDi;Dj ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PM PN 2 ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PM PN 2 I I m¼1 n¼1 m;n m¼1 n¼1 mþDi;nþDj ð1Þ It is clear that pixels of low intensity contribute little to the correlation coefficient while pixels with high intensities have a much more significant weight due to squaring. This is the reason why
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
1623
cross correlation produces spurious vectors when there is a flare in one image caused by environmental lighting fluctuations. Correlation also fails in featureless, low intensity gradient regions where camera noise becomes significant. Much of the sub-pixel accuracy in image disparity comes from the pixels residing on edges. Thus, discarding low intensity areas and taking into account only strong features that are relatively insensitive to noise, improves correlation robustness. In the algorithm presented herein, local spatial gradients are calculated for each pixel by comparing the intensities of every other pixel instead of two neighboring pixels in order to preserve a wider edge. For gradients in both horizontal and vertical directions, the local gradient is approximated as: gradði; jÞ ¼ jIði þ 2; jÞ Iði; jÞj þ jIði; j þ 2Þ Iði; jÞj
ð2Þ
If disparities between two images only occur in a known direction, for example, horizontally, the gradient formula is reduced to: gradði; jÞ ¼ jði þ 2; jÞ Iði; jÞj
ð3Þ
For the sake of simplicity, this paper only deals with the horizontal disparity case. When the local gradient is larger than a preset threshold, e.g., 20 grayscales, the pixel of interest is retained and saved into a sparse array comprised of 32-bit long words along with its relative locations. Each 32-bit long word is divided into three sections: the last 8 bits store the pixel intensity, the middle 12 bits the y-index j and the first 12 bits the x-index i. For example, a pixel of intensity I = 60 at location i = 1078 and j = 395 is saved as 000110001011 010000110110 00111100 binary. Storing data in this compressed format significantly reduces the number of memory calls that must be made during correlation. The values of i, j and I can be quickly retrieved in a couple of CPU clock cycles by bit-shifting which is optimized for speed in most processors. The above gradient criterion is chosen for its simplicity and small region of support. The major concern here is extracting high intensity gradient pixels for correlation, not a complete edge map of enclosed contours. Other popular edge-detec-
Fig. 1. (a) The original right image. (b) Compressed image at threshold = 5 grayscales. Data retained = 24.1%. (c) Compressed image at threshold = 15 grayscales. Data retained = 2.86%.
tors such as Canny, Sobel and Gaussian are not only computational expensive but also require global filtering before edge detection (Horn, 1986). This is impossible for sparse array format where a simple block transfer cannot be done as in uncompressed format correlation. A proper threshold is essential to both speed and robustness. The higher the threshold the faster the algorithm because less pixels are stored in the sparse array for correlation; also better robustness because only major object boundaries are detected and minor image textures are omitted. The overall compression ratio is determined by both the image complexity and threshold. Fig. 1 illustrates the thresholdÕs role in extracting strong features. At a low threshold, not only the object boundaries but also texture-less areas such as the wall and table are detected. At a higher threshold only the clean edges are extracted.
3. Cross correlation in compressed format For each correlation window in the second image, the local gradient is first checked at each pixel in the window and qualified pixels are stored
1624
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
in compressed format. If the total number of pixels retained is small, it is considered an empty region and this window is discarded without correlating. If the number is large, e.g., 9 px in a 32 · 32 px window, then each pixel in the same correlation window in the first image is immediately compressed and cross correlated against the saved sparse array of the second image before the program moves on to the next window. This compressed correlation technique is most efficient when there is a minimum amount of overlap among interrogation windows. If there is significant overlap, the number of redundant memory calls and arithmetic calculation from repetitive gradient checking and correlation entries greatly slows processing. Error correlation is implemented rather than the traditional statistical correlation function because it replaces multiplication with the much faster addition and subtraction. In addition to being faster, it does not place an unduly significant weight on the high intensity pixels as does the statistical correlation function. It is shown that error correlation significantly improves processing speed while maintaining the level of accuracy compared to the statistical correlation function (Roth et al., 1995). The 2D error correlation function can be expressed as: UDi;Dj PM PN ½I m;n þ I mþDi;nþDj jI m;n I mþDi;nþDj j ¼ m¼1 n¼1 PM PN m¼1 n¼1 ½I m;n þ I mþDi;nþDj ð4Þ The 1D error correlation function in the horizontal direction is simplified to: PM ½I m þ I mþDi jI m I mþDi j UDi ¼ m¼1 PM ð5Þ m¼1 ½I m þ I mþDi While the typical statistical correlation function computes one entry at a time, error correlation is calculated at the same time as the sparse array is being generated. The entire correlation table is constructed by summing entries as they are found in one interrogation window while traversing through the sparse image array generated from the other corresponding interrogation window. The resulting disparity is obtained by searching
for the peak in the correlation coefficient plane. Simple bilinear interpolation is used to determine the correlation maximum within subpixel resolution. Compressed error correlation gives a very steep peak, which is ideal for bilinear interpolation.
4. Adaptive window positioning In the traditional fixed window position scheme, the entire image is evenly divided into uniformly spaced correlation windows, with or without overlapping. Such a window sorting method is easy to implement. However, it produces more spurious vectors than adaptive window approaches. Gross errors occur when an edge sits across two fixed correlation windows (Fig. 2). Fig. 3 is a real example. The edge in the first image is shifted by 8 px to the right. Both the original images and their corresponding extracted edge maps of two neighboring fixed correlation windows are shown. The two windows cut the edge in the second image. As a result, the measured disparity in the left window tends to be smaller than the true disparity because the edge is fully present in the first image and thus has a higher weight in the correlation table. Following the same logic, the right window gives a larger disparity estimate. Correlation result of the left two blocks is 7.23 px, and the right two blocks 8.88 px. In contrast, adaptive window positioning technique enhances robustness by intelligently placing the edges about the center of each correlation window. Each window is dynamically selected at the time an edge is detected. A searching scheme is devised so that when an edge pixel is extracted in the
Fig. 2. Demonstration of a scenario when an edge sits across two fixed neighboring correlation windows.
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
Fig. 3. An edge is shifted by 8 px to the right inside an image pair. The top shows the original images and the bottom their corresponding extracted edge maps. The two fixed neighboring windows cut the edge in the second image. Correlation of the left two blocks gives a disparity of 7.23 px, while the right gives 8.88 px.
second image, a correlation window is immediately placed around this pixel. All the pixels in this window are now accounted for and no more searching for additional interrogation windows is performed in this block in order to maximize speed and minimize window overlap. The edge in Fig. 3 is covered by only one correlation window using adaptive window positioning (Fig. 4) rather than two as with fixed windows (Fig. 3). Consequently the correlation result of this single window gives the correct 8 px.
Fig. 4. Adaptive window positioning is applied to the image pair in Fig. 3. The top shows the original images and the bottom their corresponding extracted edge maps. The one dynamically positioned correlation window holds the complete edge in both images. Correlation of this one block gives the true disparity of 8 px.
1625
The following example demonstrates the effectives of adaptive windows over fixed ones. The first image in the image pair is captured by a camera with an image size of 500 · 500 px (Fig. 5). The second image is obtained by simulating a uniform lateral disparity of 8 px relative to the first image. Integer disparity simulation is simply obtained by pixel index shifting. Compression and correlation are only performed in the X direction using the proposed edge matching algorithm as there is no vertical shift. The measured disparity results using fixed correlation windows are illustrated in Fig. 6. The interrogation window size is 32 · 32 px. Gradient threshold for compression is set at 20 grayscales. Notice that a number of edges are positioned across neighboring windows in the second image. The measured sparse disparity field has a mean of 7.67 px and a standard deviation of 1.56 px. The measured disparity results based on adaptively selected correlation windows are illustrated in Fig. 7. Each cross correlation window location is determined based on the edge map of the second image. Note the significantly improved accuracy with adaptive windows. The measured disparity field of valid vectors has a mean of 7.99 px and a standard deviation of 0.0505 px. The time required for both block finding and compressed correlation is 7.5 ms on a Xeon 2.8 GHz desktop using C++. For the purpose of statistical analysis, vector validation threshold is set at ±0.5 px from the true disparity. Any measured disparity that falls
Fig. 5. Left: the original image of several building blocks captured by a camera. The image size is 500 · 500 px. Downloaded from the CMU Image Database at http://vasc.ri.cmu.edu//idb/html/stereo/arch/. Right: the second image is artificially shifted to the right by 8 px.
1626
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
Fig. 6. Compressed coarse correlation results of an image pair with a simulated horizontal disparity of 8 px using fixed windows. The edge map of the second image is shown. Each block represents a non-empty correlation window. Interrogation window size is 32 · 32 px. Gradient threshold for compression is 20 grayscales. The measured disparity field has a mean of 7.67 px with a standard deviation of 1.56 px.
Fig. 8. Demonstration of the image boundary effect. Gross error occurs when some edges are entering or leaving the field of view between exposures. The top shows the original images and the bottom their corresponding extracted edge maps. The measured disparity is 5.91 px compared to the ground truth of 8 px.
outside this tolerance range is classified as an outlier; otherwise, a valid vector. For example, if the true disparity = 8 px, the range of valid measured vectors is 7.5–8.5 px. If the true disparity = 1 px, valid range of measured vectors is 0.5–1.5 px. The only invalid vector in Fig. 7 is due to the image boundary effect. Gross error occurs when some edges are entering or leaving the field of view between exposures. This issue is probably unsolvable with only two images. Fig. 8 shows both the original images and extracted edge maps of the invalid correlation window in Fig. 7. The measured disparity is 5.91 px compared to the true disparity of 8 px.
5. Fine correlation and complete depth map
Fig. 7. Compressed coarse correlation results of an image pair with a simulated horizontal disparity of 8 px using adaptive windows. Correlation window location is determined based on the edge map of the second image. Interrogation window size is 32 · 32 px. Gradient threshold for compression is 20 grayscales. The measured disparity field of valid vectors has a mean of 7.99 px with a standard deviation of 0.0505 px. Number of valid vectors = 50. Number of invalid vectors = 1.
Coarse compressed correlation provides an averaged disparity estimate for each window. A large window size is necessary in order to accommodate edges of different shapes and orientations in both images. As explained in Section 4, cross correlation generates spurious vectors if an edge is fully present in one correlation window but half missing in the same correlation window in the other image. Usually, the coarse correlation window size is chosen to be roughly twice the size of the largest expected output disparity. In general,
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
a larger window produces fewer errors but it slows processing. Once a disparity estimate from coarse correlation is known, fine correlation can be performed for each on-edge pixel in the primary correlation window using a much smaller window size. The location of fine correlation window in the second image is set around the pixel of interest. The corresponding fine correlation window in the first image is shifted by the integer amount of the coarse correlation output, as illustrated in Fig. 9. Here, fine correlation window size is chosen to be 7 · 7 px. As a general rule, fine correlation speed dramatically improves with reduced fine correlation window size, at the cost of reduced accuracy. Other aspects of the fine compressed correlation are the same as coarse correlation except a lower gradient threshold. The compression threshold is lowered to avoid the loss of any useful information in the reduced correlation window. This practice does not compromise robustness since a valid disparity has been identified using a higher threshold in coarse correlation. Fig. 10 illustrates the fine correlation results up to single pixel resolution based on the coarse correlation output as shown in Fig. 7. Fine correlation has an improved accuracy over coarse correlation because of the window shifting. In this simulated disparity case, the disparity of every on-edge pixel is correctly recovered with a standard deviation of 0 px. In some applications such as 3D feature-based facial recognition (Gao and Leung, 2002) and ob-
1627
Fine correlation results 9
8.5
8
7.5
7 0 100 200 300 Y (pixel)
400
0
300
200
100
400
500
X (pixel)
Fig. 10. Sparse fine correlation results up to single pixel resolution1 of an image pair with a simulated disparity of 8 px. Fine correlation window size is set at 7 · 7 px, and threshold = 15 grayscales. The sparse disparity field has a mean of 8 px with a standard deviation of 0 px.
L
Search for the next coarse correlation window in Image 2
R
Original image pair N
# of pixel retained > min required? Y Compressed coarse correlation Compressed fine correlation for each on-edge pixel
Entire image searched?
N
Y Y Fig. 9. Example of a fine correlation window selection in a coarse correlation window with a calculated disparity of 8 px. The pixel of interest is at the center of the fine correlation window in the second image. The corresponding fine correlation window in the first image is shifted to the left by 8 px. Fine correlation window size is 7 · 7 px.
Correlate empty regions at a reduced threshold
Sufficient texture?
N
Depth-driven segmentation and interpolation
Fig. 11. Overall flow of the proposed algorithm.
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
ject tracking (Drummond and Cipolla, 2002), a sparse disparity map of object boundaries provides enough information. In other applications such as 3D scene reconstruction, a complete disparity map is preferable. There are two general approaches to fill the regions of unknown disparities among the recovered edges. If there is sufficient surface texture, compressed cross correlation at a lower threshold can be performed in such areas. However, in many real world cases such as the scene in Fig. 1, there is a lack of fine texture, which is critical to a reliable correlation. Instead, depth-driven object segmentation and interpolation is necessary to obtain a full field disparity rendering. The overall flow of the proposed edge-matching algorithm with complete disparity map output is shown schematically in Fig. 11.
Standard deviation of measured disparities 0.16
0.14
Standard deviation (± pixel)
1628
0.12
0.1
0.08
0.06
0.04
0
2
4
6
8
10
Measured disparity (pixel)
Fig. 12. Standard deviation of measured disparities of a sequence of images with a simulated horizontal disparity from 0.2 to 11 px relative to the original image. Coarse correlation window size = 32 · 32 px. Fine correlation window size = 7 · 7 px and threshold = 15 grayscales.
6. Experiments 6.1. Simulated disparity The original image in Fig. 5 is artificially shifted laterally to the right from 0.2 to 11 px at an interval of 0.2 px. Sub-pixel disparity is approximated using the shift theorem in the frequency domain. Fine correlation is calculated at each on-edge pixel. Fig. 12 shows the standard deviation of fine correlation results throughout the entire sequence of lateral disparities. The maximum standard deviation among integer disparities is ±0.0812 px. The maximum error among simulated sub-pixel disparities is ±0.147 px. Overall mean value of standard deviations is ±0.098 px. Given the pixel intensity rounding error introduced in the sub-pixel shifting simulation, it is fair to conclude that the proposed algorithm has an accuracy upper limit of ±0.1 px, which is consistent with the best obtainable accuracy from interpolating a single correlation plane calculated with only two images. 6.2. Real motion image pair In the following ‘‘MIT’’ scene as shown in Fig. 16, both images in the image pair are captured by
a camera. The image size is 1152 · 864 px. The camera has a lateral displacement between the two exposures resulting in right-hand disparities in the image plane. The three objects are placed on various depth planes. The closer the object is to the camera, the larger its disparity between frames due to different magnification ratios. Choosing a proper coarse correlation window size is critical to the proposed algorithmÕs performance on speed and accuracy. Overall, a larger window size produces fewer gross errors but it slows processing because the correlation load increase as the square of the window size. On the other hand, smaller correlation window size results in higher spatial resolution and thus less averaging effect in areas where a number of objects at different depths are close to each other. In the following example, a rectangular window shape is able to take advantage of both large and small window sizes since disparities are known to occur in only one direction. The horizontal correlation window size is chosen to be 64 px, while the vertical window size is 8 px. Fig. 13 shows the vector field of coarse correlation results as well as their adaptively selected corresponding window. Fig. 14 provides a side view of the coarse correlation disparity
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
1629
Fig. 13. Coarse correlation results of a motion image pair with lateral disparity. CorrSizeX = 64 CorrSizeY = 8, threshold = 15.
16 14
Disparity (pixel)
12 10 8 6
Fig. 15. Fine correlation results calculated based on the coarse correlation output shown in Fig. 13. (a) top view; (b) front view.
4 2 0
100
200
300
400
500
600
700
800
900
Y (pixel)
Fig. 14. Side view of the coarse correlation disparity field shown in Fig. 13.
field. The processing time of both blocks finding and coarse correlation is 29.0 ms on a Xeon 2.8 GHz desktop for an image pair with a size of 1152 · 864 px. Fig. 15 illustrates both the top and front views of the fine compressed correlation disparity field comprised of all the detected onedge pixels. After a fine disparity map of the object boundaries is acquired, a simple segmentation and interpolation algorithm is implemented to fill the voids
among the edges since there is no sufficient texture on smooth surfaces for a reliable correlation. Fig. 16 shows the full disparity field rendering of the ‘‘MIT’’ scene both with and without texture mapping. The results are encouraging considering only a single image pair is used as input. If there is sufficient texture on smooth surfaces, compressed correlation over the entire image plane can be performed to obtain a complete disparity map. Two different levels of compression will be implemented. In boundary regions, a strong compression is applied which results in the precise recovery of edges. In regions of small gradient variations, which often correspond to smooth surfaces, a mild compression is held, in which case
1630
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631
7. Conclusions In this paper, a new 3D algorithm has been proposed which can recover precise object boundaries at high speed by utilizing compressed image correlation and adaptive windows. Although the algorithm is relatively simple, the experimental results obtained are encouraging. An important feature to note is that this algorithm does not include any global optimization. Compressed image correlation is a technique by which stereo or motion image pairs can be accurately processed at high speeds. It is based on the compression of images in which the number of data set entries is reduced to containing only strong features. Very high correlation speeds are obtained by encrypting the reduced data set into sparse arrays and correlating the data entries using an error correlation function to eliminate time consuming multiplication, division and floating point arithmetic. The performance of this compressed image correlation, however, is largely dependant on image complexity. For applications requiring extremely high speeds such as real-time tracking and video rate stereo vision, the proposed edge-based 3D algorithm appears to be a viable processing technique. Future work includes integrating adaptive window shape and size into the proposed algorithm, as well as multiple image pairs. An extensive analysis of error rate vs. spatial frequency would fully demonstrate its effectiveness and limitations.
References
Fig. 16. Full disparity field rendering of the ‘‘MIT’’ scene. (a) left image; (b) right image; (c) and (d) top view and 3D rendering of the complete disparity field after segmenting and interpolating the sparse correlation output shown in Fig. 15; (e) and (f) two views of the complete disparity field with texture mapping.
any useful information for correlation is retained including the minor features in surface texture.
Bartoli, A., Hartley, R., Kahl, F., 2003. Motion from 3D line correspondences: Linear and non-linear solutions. IEEE Internat. Conf. on Computer Vision and Pattern Recognition (CVPRÕ03), Madison, Wisconsin. Brunelli, R., Poggio, T., 1993. Face recognition: Features versus templates. IEEE Trans. Pattern Anal. Machine Intell. 15 (10), 1042–1052. Deriche, R., Faugeras, O.D., 1990. Tracking line segments. European Conference on Computer Vision (ECCVÕ90), Antibes, France, Springer 1990. Drummond, T., Cipolla, R., 2002. Real-time visual tracking of complex structures. IEEE Trans. Pattern Anal. Machine Intell. 24 (7), 932–946.
S.S. Tan, D.P. Hart / Pattern Recognition Letters 26 (2005) 1620–1631 Elder, J.H., 1999. Are edges incomplete? Internat. J. Comput. Vision 34 (2–3), 97–122. Gao, Y., Leung, M.K.H., 2002. Face recognition using line edge map. IEEE Trans. Pattern Anal. Machine Intell. 24 (6), 764–779. Hart, D.P., 1998. High-speed PIV analysis using compressed image correlation. J. Fluids Eng—Trans. ASME 120 (3), 463–470. Hoff, W., Ahuja, N., 1989. Surfaces from stereo: Integrating feature matching, disparity estimation, and contour detection. IEEE Trans. Pattern Anal. Machine Intell. 11 (2), 121– 136. Horn, B.K.P., 1986. Robot Vision. MIT Press, Cambridge, MA. Hoznek, A., Zaki, S., Samadi, D., Salomon, L., Lobontiu, A., Lang, P., Abbou, C.-C., 2002. Robotic assisted kidney transplantation: An initial experience. J. Urology 167 (4), 1604–1606. Izquierdo, M.E., 1999. Disparity segmentation analysis: Matching with an adaptive window and depth-driven segmentation. IEEE Trans. Circuits Syst Video Technol 9 (4), 589– 607. Kanade, T., Okutomi, M., 1994. A stereo matching algorithm with an adaptive window: Theory and experiment. IEEE Trans. Pattern Anal. Machine Intell. 16 (9), 920–932. Maciel, J., Costeira, J.P., 2003. A global solution to sparse correspondence problems. IEEE Trans. Pattern Anal. Machine Intell. 25 (2), 187–199.
1631
Mahamud, S., Williams, L.R., Thornber, K.K., Xu, K., 2003. Segmentation of multiple salient closed contours from real images. IEEE Trans. Pattern Anal. Machine Intell. 25 (4), 433–444. Mansouri, A.-R., 2002. Region tracking via level set PDEs without motion computation. IEEE Trans. Pattern Anal. Machine Intell. 24 (7), 947–961. Oh, B.M., Chen, M., Dorsey, J., Durand, F., 2001. Imagebased modeling and photo editing. SIGGRAPH 2001. Okutomi, M., Katayama, Y., Oka, S., 2002. A simple stereo algorithm to recover precise object boundaries and smooth surfaces. Internat. J. Comput. Vision 47 (1–3), 261–273. Olsen, S.I., 1990. Stereo correspondence by surface reconstruction. IEEE Trans. Pattern Anal. Machine Intell. 12 (3), 309– 315. Roth, G., Hart, D.P., Katz, J., 1995. Feasibility of using the L64720 video motion estimation processor (MEP) to increase efficiency of velocity map generation for particle image velocimetry (PIV). ASME/JSME Fluids Engineering and laser Anemometry Conference, Hilton Head, South Carolina. Scharstein, D., Szeliski, R., 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Internat. J. Comput. Vision 47 (1–3), 7–42. Sharma, J., Dragoi, V., Tenenbaum, J.B., Miller, E.K., Sur, M., 2003. V1 Neurons signal acquisition of an internal representation of stimulus location. Sci. Mag. 300, 1758–1763.
Pattern Recognition Letters 26 (2005) 1641–1649 www.elsevier.com/locate/patrec
Robust face detection using Gabor filter features Lin-Lin Huang *, Akinobu Shimizu, Hidefumi Kobatake Graduate School of BASE, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi, Tokyo 184-8588, Japan Received 4 March 2004; received in revised form 30 November 2004 Available online 14 April 2005 Communicated by E. Backer
Abstract In this paper, we present a classification-based face detection method using Gabor filter features. Taking advantage of the desirable characteristics of spatial locality and orientation selectivity of Gabor filters, we design four filters corresponding to four orientations for extracting facial features from local images in sliding windows. The feature vector based on Gabor filters is used as the input of the face/non-face classifier, which is a polynomial neural network (PNN) on a reduced feature subspace learned by principal component analysis (PCA). The effectiveness of the proposed method is demonstrated by experiments on a large number of images. We show that using both of the magnitude and phase of Gabor filter response as features, the detection performance is better than that using magnitude only, and using the real part only also performs fairly well. Our detection performance is competitive with those reported in the literature. 2005 Elsevier B.V. All rights reserved. Keywords: Face detection; Classification; Gabor filter; Polynomial neural network
1. Introduction Machine recognition of human faces is motivated by wide applications ranging from static matching of controlled format photographs such as passports, credit cards and mug shots to realtime matching of surveillance video images (Chellappa et al., 1995). An automated face recognition
*
Corresponding author. Tel./fax: +81 42 388 7438. E-mail address:
[email protected] (L.-L. Huang).
system often consists of three modules: face detection, facial feature detection and face recognition. Nevertheless, most existing methods assume that human faces have been located and focus on the recognition algorithm only. To build fully automated face recognition systems, it is essential to develop robust and efficient algorithms to detect faces regardless of pose, scale, orientation, expression, occlusion and lighting condition (Hjelmas and Low, 2001; Yang et al., 2002). The methods proposed for face detection so far generally fall into two major categories:
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.015
1642
L.-L. Huang et al. / Pattern Recognition Letters 26 (2005) 1641–1649
feature-based methods and classification-based ones. Feature-based methods detect faces by searching for facial features (eyes, nose and mouth) and grouping them into faces according to their geometrical relationships (Lin and Fan, 2001; Wong et al., 2001). Since the performance of feature-based methods primarily depends on the reliable location of facial features, it is susceptible to partial occlusion, excessive deformation, and low quality of images. By classification-based methods, face detection is performed by shifting a search window over the input image and each local image in the window is classified to face/non-face class using a classifier. As such, the feature extraction from local images and the design of classifier are important to the detection performance. Many classification models have been proposed for face detection, such as Bayesian classifier (Yang et al., 2001; Liu, 2003), neural networks (Rowley et al., 1998; Sung and Poggio, 1998; Feraud et al., 2001), support vector machines (SVM) (Osuna et al., 1997; Heisele et al., 2003), etc. As to feature extraction, most classification-based methods have used the intensity values of the windowed image as the input features of the classifier. Since a pixel value does not account for any shape or texture characteristics, more effective features that incorporate neighborhood relationships should be exploited to further improve the detection performance. This kind of features can be extracted by spatial filtering using Gabor filters that are selective to both frequency and orientation. Gabor filters have been considered as a very useful tool in computer vision and image analysis due to its optimal localization properties in both spatial domain and spatial frequency domain. The impulse response functions of Gabor filters were shown to well model the receptive field profiles of visual cortical cells of mammalians and exhibit desirable properties of spatial locality and orientation selectivity (Daugman, 1988). The authors of Adini et al. (1997) indicated that face representation based on 2D Gabor filters is more robust against illumination variations in face recognition than intensity values. Gabor filter-based features have been applied to face recognition (Zhang et al., 1997; Liu and Wechester, 2002)
and facial expression classification (Donato et al., 1999) with great success. However, little work has been done to apply it to face detection. This paper proposes a classification-based approach using Gabor filter features for detecting faces from cluttered images. We design four Gabor filters corresponding to four orientations for extracting facial features from the local image in sliding window. The feature vector based on Gabor filters is used as the input of the classifier, which is a polynomial neural network (PNN) on a reduced feature subspace learned by principal component analysis (PCA) (Huang et al., 2003). The effectiveness of the proposed method is demonstrated by experimental results on testing a large number of images. In order to investigate the effectiveness of different part of Gabor filter response, three features are constructed based on the real part of Gabor filter response, magnitude, and the combination of magnitude and phase. The experimental results show the combination of magnitude and phase performs the best. The rest of this paper is organized as follows. Section 2 gives an overview of our face detection system; Section 3 describes the Gabor feature extraction method; Section 4 explains the PNN structure and the learning algorithm; The experimental results are presented in Section 5, and Section 6 provides concluding remarks.
2. System overview The system diagram is shown in Fig. 1. To detect faces of variable sizes and locations, the detector needs to examine shifted regions (scanned by a window) of the test image in multiple scales. The classifier is used to classify the local image in the window to one of two classes: face or non-face. To disambiguate overlapping detected regions, each local image is assigned a face likelihood measure, which should be high for a face region and low for a non-face region. The overlapping detected regions (candidate regions) within one scale or across different scales compete with each other such that only the candidate region with the highest face likelihood is retained as the detected face. In our work, the size of sliding window is 20 · 20 pixels.
L.-L. Huang et al. / Pattern Recognition Letters 26 (2005) 1641–1649
1643
Fig. 1. Diagram of the face detection sytem.
The input image is re-scaled to multiple scales such that a face region is re-scaled to approximately 20 · 20 pixels in one of the re-scaled images. In order to alleviate the variation of lighting conditions, a pre-processing procedure similar to Sung and Poggio (1998) (subtracting each local image with an optimally fitted linear plane) is employed. Then the intensity values of the local image can serve as feature values for classification as in our previous work (Huang et al., 2003). To achieve better performance, more discriminative features can be extracted by spatial filtering using Gabor filters. We extract facial features using four Gabor filters corresponding to four orientations and construct feature vector as the input of the classifier. The classifier used in our system is a polynomial neural network (PNN), which uses the polynomial terms of features as the inputs of a single-layer network with one output unit for assigning face likelihood. To overcome the polynomial complexity of the PNN, which is formidable on large number of features, the dimensionality of the feature vector is reduced by PCA.
due to its optimal localization properties in both spatial and frequency domain. The general functional form of a 2D Gabor filter specified in space and spatial frequency domain is given by " !# 1 x02 þ y 02 gðx; yÞ ¼ exp 2pr2xy 2r2xy r2 expð2pir0 x0 Þ exp 02 ; 2ruv where 0 x ¼ x cos h þ y sin h; y 0 ¼ x sin h þ y cos h;
3.1. Gabor filter design Gabor filters have been applied to various image recognition problems for feature extraction
ð2Þ
where rxy is the standard deviations of the Gaussian envelope which characterizes the spatial extent and the bandwidth of the filter. The parameters (u0, v0) define the spatial frequency of a sinusoidal plane wave, which can also be expressed in polar coordinates as radial frequency r0 and orientation h: r20 ¼ u20 þ v20 ;
3. Gabor filter-based feature extraction
ð1Þ
tan h ¼
v0 . u0
ð3Þ
The frequency and orientation-selective properties of a Gabor filter are more explicit in its frequency domain representation Eq. (4) that specifies the amount by which the filter modifies or modulates each frequency component of the input image.
1644
L.-L. Huang et al. / Pattern Recognition Letters 26 (2005) 1641–1649
( "
ðu u0 Þ2 þ ðv v0 Þ2 Gðu; vÞ ¼ exp 2r2uv 2 r exp 02 ; 2ruv 1 . ruv ¼ 2prxy
#)
Consequently, the radial frequency of Gabor filters is given by r0 ¼
3.2. Gabor feature extraction
pðk 1Þ ; k ¼ 1; 2; 3; 4. ð5Þ 4 The frequency responses of the four filters are depicted in Fig. 2. The frequency band of each filter is a Gaussian centered at (u0, v0). We determine the parameters of Gabor filters according to the sampling theorem (Castlman, 1996). When the sampling rate is 0.5 (we sample 10 · 10 values on filtered local images of 20 · 20 pixels), the maximum frequency rmax of Gabor filters is 0.25. In our work, the half-peak magnitude bandwidth is adopted, which is approximately 1.177ruv. Then we have:
hk ¼
where a ¼ 1.177;
rmax ¼ 0.25. ð6Þ
In addition, we can see from Fig. 2 that radial frequency r0 and ruv has the relation of: aruv ¼ r0 tanðp=8Þ.
ð8Þ
ð4Þ
The property of Gabor filter is defined by radial frequency r0, orientation and filter bandwidth. In this paper, Gabor filters are designed in the sense of the effectiveness of facial feature extraction. Specifically, in frontal or near frontal face image, despite the variation of face identity and expression, the facial contour is approximately an oval and the eyes and mouth are approximately in horizontal while the nose is in vertical. Therefore, Gabor filters with four orientations are considered:
r0 þ aruv ¼ rmax ;
1 4½1 þ 1.177 tanðp=8Þ
ð7Þ
Fig. 2. The four filters in the spatial frequency domain.
Gabor features of a local image in sliding window are extracted based on its Gabor representations, which are obtained by convolving the image with Gabor filters. Let I(x, y) be an image, the convolution of image I(x, y) and a Gabor filter is denoted as Oðx; y; r0 ; hk Þ ¼ Iðx; yÞ gðx; y; r0 ; hk Þ; k ¼ 1; 2; 3; 4;
ð9Þ
where O(x, y, r0, hk) is called as the Gabor representation of image I(x, y). Fig. 3 shows the Gabor representations of a face sample image (convolved with the real part of Gabor filter response). The images are visualized by coding the output values of Gabor filter in gray levels. We can see that the orientation properties of face pattern are well represented by the filtered images. In order to investigate the properties of different part of Gabor representation, three kinds of feature vectors, namely, consine, magnitude and mag + phase, are formed. Specifically, feature vector consine is constructed using the real part of Gabor representation Real{O(x, y, r0, hk)}. magnitude is computed as square root of the real part and the imagine part while feature vector mag + phase is formed by adding the phase of O(x, y, r0, hk) to feature vector magnitude. To encompass the orientation selectivities, we concatenate all the four orientation Gabor representations to construct feature vector. Before the concatenation, we down-sample the Gabor representation O(x, y, r0, hk) by a factor q = 2. In our work, the size of search window is 20 · 20 pixels so that each Gabor representation used for constructing feature vector has 10 · 10 pixel values after down-sampled. Therefore the dimensionality of both consine and magnitude is 400 while feature vector mag + phase has 800 dimensionalities.
L.-L. Huang et al. / Pattern Recognition Letters 26 (2005) 1641–1649
1645
Fig. 3. A face sample image and its four Gabor filtered images.
4. Classification method For classification, we use a polynomial neural network (PNN) to assign a face likelihood to the sliding window so as to classify the window image into face or non-face. The feature vectors described in Section 3 are employed as the input of PNN. The PNN is a single-layer network which uses as inputs not only the feature measurements of the input pattern but also the polynomial terms of the measurements (Schu¨rmann, 1996). Compared with the multi-layer perceptron (MLP), the PNN is faster in learning and less susceptible to local minimum due to its single-layer structure. In our previous experiments, the PNN was shown to outperform the MLP in face detection (Huang et al., 2003). Denote the input pattern as a feature vector x = (x1, x2, . . ., xd)T, the output of the PNN is computed by ! d d X d X X yðxÞ ¼ s wi xi þ wij xi xj þ w0 ; ð10Þ i¼1
i¼1
j¼i
1 . 1 þ expðaÞ
In our problem, the dimensionality of input vector is high. To reduce the complexity of PNN for high-dimensional data, the dimensionality of the raw feature vector is reduced by PCA. The raw vector is projected onto a linear subspace: zj ¼ ðx lÞT /j ;
j ¼ 1; 2; . . . ; m;
i¼1
i¼1
j¼i
ð12Þ 2
Df ¼ kx lk
m X
z2j .
ð13Þ
j¼1
The connecting weights are initialized randomly and updated by stochastic gradient descent to minimize the empirical loss of mean square error (MSE) (Robbins and Monro, 1951): Nx Nx X X 2 2 E¼ ½yðxn Þ tn þ kkwk ¼ En ; ð14Þ n¼1
where s( Æ ) is a sigmoid activation function. sðaÞ ¼
tern reconstruction from the subspace is minimized. At mean time, the reconstruction error of feature space (the distance from the feature subspace, DFFS) is an important indicator of the deviation of the input pattern from the subspace. When the subspace is learned from face samples, the DFFS indicates the dissimilarity of a pattern being a face. Hence, we integrate the DFFS into the PNN: ! m m X m X X w i zi þ wij zi zj þ wD Df þ w0 ; yðxÞ ¼ s
n¼1
where tn denotes the target output for the input pattern xn, with value 1 for face pattern and 0 for non-face pattern; k is a coefficient of weight decay, which is helpful to improve the generalization performance. The PNN is trained on face and non-face samples. Since the PNN is a single-layer network, the training process is quite fast and not influenced by the random initialization of weights.
ð11Þ
where zj denotes the projection of x onto the jth axis of the subspace, /j denotes the eigenvector of the axis, and l denotes the mean vector of the pattern space. The eigenvectors corresponding to the m largest eigenvalues are selected such that the error of pat-
5. Experiments 5.1. Training sample collection The algorithm of training sample collection is the same as that in our previous work (Huang
1646
L.-L. Huang et al. / Pattern Recognition Letters 26 (2005) 1641–1649
et al., 2003). We use 2987 images to extract face samples, which contain 2990 real faces. Each hand cropped face box is normalized into 20 · 20 pixels and gives one face sample. The local image within the box is stretched and compressed to give four samples. In addition, the mirror images of the above five face samples are also included. This is helpful to ensure that the final face samples cover the variations of face pattern in real world as much as possible. In total, 29,900 face samples are collected. Besides for training purpose, they were firstly used to compute the feature subspace by PCA. The non-face samples were collected with a preliminary classifier to identify the background patterns that resemble faces. The non-face samples were collected in three phases. In the first phase, the local images of search windows in background area are compared with the mean vector of face samples and the window with the Euclidean distance under a threshold is considered as a confusing non-face sample. 38,834 non-face samples are gathered. These samples and the 29,900 face samples are used to train the first-phase PNN. In the second and the third phase, the local window images are classified by the trained PNN and the local image which does not contain a face but have a output value higher than a threshold is considered as a confusing non-face sample. About 60,000 non-face samples are collected in the second and the third phrases. Finally, the PNN is re-trained with the face samples and all the nonface samples collected in three phases. 5.2. Experimental results We test the proposed method on two sets of images, which are totally different with those used in training procedure. Test Set1 consists of 109 CMU images,1 most of the images contain more than one face with cluttered background such that Test Set1 has a total of 487 faces. Test Set2 consists of 270 images with simple background and 1 The test set1 of Rowley et al. (1998) contains 130 images. We did not use the 7 images that contain extremely large or small faces and 14 scene images that have been used in training procedure.
each of them contains only one face, so totally there are 270 faces in Test Set2. The images of Test Set2 are downloaded from several websites. In our experiments, a face is successfully detected if the local image of search window contains both eyes and the mouth. Otherwise, it is a false positive. The detection rate is the ratio between the number of successful detection and the total number of faces in the Test set. The false positive rate is the ratio between the number of false positives and the total number of searching windows.
Fig. 4. ROC curves on Test Set1.
Table 1 Detection results on Test Set1 (109 images) Dimensionality
consine magnitude mag + phase
Detection rate (%)
False positives
False positive rate
86.4 86.4 86.4
88 79 57
1.1 · 106 1.0 · 106 7.6 · 107
Table 2 Detection results on Test Set2 (270 images) Dimensionality
consine magnitude mag + phase
Detection rate (%)
False positives
False positive rate
100 100 100
11 7 3
4.6 · 107 2.9 · 107 1.2 · 107
L.-L. Huang et al. / Pattern Recognition Letters 26 (2005) 1641–1649
In detection, the images are resized in 10 scales. For images in Set1, the scale starts from 0.2 and increase at factor of 1.21. The images in Set2 have larger faces, so the scale is from 0.1. In our previous work (Huang et al., 2003), using image intensity as the raw feature (368D vector composed of intensity values of local image of
1647
search window undergone pre-processing procedure, referred to as Inten below), the PNN has yielded promising detection performance. To further improve the detection performance, we exploit Gabor filter features in this paper. Using the four kinds of feature vectors (Inten, consine, magnitude and mag + phase) as the input
Fig. 5. Detection examples on the images of Test Set1.
1648
L.-L. Huang et al. / Pattern Recognition Letters 26 (2005) 1641–1649
of the PNN, we tested the 109 images of Test Set1 and drew the ROC (Receiver Operating Characteristics) curves of tradeoff between the detection rate and the false positive rate with variable decision thresholds. In order to make fair comparison, the dimensionalities of the subspaces employed by the four feature vectors are set to be the same, m = 100. The ROC curves is shown in Fig. 4. From the ROC curves, we can see that the Gabor filter features result in significant improvements compared to intensity features, which justifies that Gabor filter-based features are more effective than intensity-based features. This could be explained as that the orientation properties of face pattern captured by Gabor filters are visually prominent feature and are highly stable, therefore the Gabor filter-based features are more informative and robust to illumination and facial expression changes. Among the performances of three Gabor filter feature vectors, it is shown that the performance of mag + phase is the best while magnitude is superior to consine. The reason could be that the real part and imagine part of Gabor representation provide complementary information for classification. The detection rates and false positive rates on Test Set1 and Test Set2 are listed in Tables 1 and 2, respectively. We can see that on simple images of Test Set2, the detection rate is 100% and the false positive rate is very low. On Test Set1, the detection rate is lower. The reason is that the
images in Test Set1 mostly have lower resolution and more complex backgrounds than those in Test Set2. Some examples of face detection using consine are shown in Figs. 5 and 6. From these examples, we can see that the feature vector based on Gabor filters is quite robust against low image quality and face shape variation. The missed faces are inherently blurry or rotated excessively while the false positives mostly resemble face pattern when viewed in isolation. 5.3. Comparison of performance For comparison with other system, we would like to mention the method proposed by Sung and Poggio (1998), which models the distributions of face and non-face patterns and uses a multi-layer perceptron (MLP) to make final decision. The detection result of Sung and Poggio (1998) has been considered as one of the best results in detecting faces from images with complex backgrounds (Hjelmas and Low, 2001). The detection result of Table 3 Detection results on images of Sung and Poggio (1998) (23 images) Method
True positives
False positives
consine magnitude mag + phase Sung and Poggio (1998)
126 128 130 126
10 8 7 13
Fig. 6. Detection examples on the images of Test Set2.
L.-L. Huang et al. / Pattern Recognition Letters 26 (2005) 1641–1649
Sung and Poggio (1998) on a set of 23 cluttered images along with the result of our method on testing the same images are listed in Table 3. Although it is difficult to give a fair comparison between different methods due to the different training samples, training procedure and execution time, etc., based on the reported results, the detection performance of our method is comparable with that of Sung and Poggio (1998), yet with fewer false positives. In addition, for processing one search window (19 · 19 pixels), the classifier of Sung and Poggio (1998) needs more than 254,700 multiplications, while our method needs about 125,000 for one window of 20 · 20 pixels when using mag + phase. Therefore, our method consumes much less computation resources than that of Sung and Poggio (1998). 6. Conclusion This paper proposes a classification-based face detection approach using Gabor filter features. Four Gabor filters are designed for facial feature extraction and the feature vector based on Gabor filters is applied to be the input of the classifier. The underlying classifier is a polynomial neural network (PNN) on a reduced feature subspace learned by PCA. The detection performances of different parts of Gabor representations have been investigated. The effectiveness of the proposed method is justified in experiments on testing a large number of images. References Adini, Y., Moses, Y., Ullman, S., 1997. Face recognition: The problem of compensating for changes in illumination direction. IEEE Trans. Pattern Anal. Machine Intell. 9 (7), 721–731. Castlman, K.R., 1996. Digital Image Processing. Prentice Hall. Chellappa, R., Wilson, C.L., Sirohey, S., 1995. Human and machine recognition of faces: A survey. Proc. IEEE 83, 705– 740. Daugman, J.G., 1988. Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression.
1649
IEEE Trans. Acoust. Speech Signal Process. 36 (7), 169– 1179. Donato, G., Bartlett, M.S., Hager, J.C., 1999. Classifying facial actions. IEEE Trans. Pattern Anal. Machine Intell. 21, 974– 989. Feraud, R., Bernier, O.J., Viallet, J.E., Collobert, M., 2001. A fast and accurate face detector based on neural networks. IEEE Trans. Pattern Anal. Machine Intell. 23 (1), 42–53. Heisele, B., Serre, T., Prentice, S., Poggio, T., 2003. Hierarchical classification and feature reduction for fast face detection with support vector machines. Pattern Recognition 36, 2007–2017. Hjelmas, R., Low, B.K., 2001. Face detection: A survey. Comput. Vision and Image Understanding 83, 236–274. Huang, L.L., Shimizu, A., Hagihara, Y., Kobatake, H., 2003. Face detection from cluttered images using a polynomial neural network. Neurocomputing 51, 197–211. Lin, C., Fan, K., 2001. Triangle-based approach to the detection of human face. Pattern Recognition 34, 1271– 1284. Liu, C., 2003. A Bayesian discriminating features methods for face detection. IEEE Trans. Pattern Anal. Machine Intell. 25, 725–740. Liu, C., Wechester, H., 2002. Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans. Image Process. 11 (4), 467–476. Osuna, E., Freund, R., Girosi, F., 1997. Training support vector machines: An application to face detection. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition. pp. 130–136. Robbins, H., Monro, S., 1951. A stochastic approximation method. Ann. Math. Stat. 22, 400–407. Rowley, H.A., Baluja, S., Kanade, T., 1998. Neural networkbased face detection. IEEE Trans. Pattern Anal. Machine Intell. 20 (1), 23–38. Schu¨rmann, J., 1996. Pattern Classification: a Unified View of Statistical Pattern Recognition and Neural Networks. Wiley Interscience. Sung, K.K., Poggio, T., 1998. Example-based learning for viewbased human face detection. IEEE Trans. Pattern Anal. Machine Intell. 20 (1), 39–50. Wong, K., Lam, K., Siu, W., 2001. An efficient algorithm for human face detection and facial feature extraction under different conditions. Pattern Recognition 34, 1993–2004. Yang, M.H., Kriegman, D.J., Ahuja, N., 2001. Face detection using multimodal density models. Comput. Vision and Image Understanding 84, 264–284. Yang, M.H., Kriegman, D.J., Ahuja, N., 2002. Detecting faces in images a survey. IEEE Trans. Pattern Anal. Machine Intell. 24 (1), 34–58. Zhang, J., Yan, Y., Lades, M., 1997. Face recognition: Eigenface, elastic matching, neural nets. Proc. IEEE 85, 1423–1435.
Pattern Recognition Letters 26 (2005) 1650–1657 www.elsevier.com/locate/patrec
Color text image binarization based on binary texture analysis Bin Wang *, Xiang-Feng Li, Feng Liu, Fu-Qiao Hu Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, PR China Received 10 October 2003; received in revised form 13 October 2004 Available online 14 April 2005 Communicated by E. Backer
Abstract In this paper, a novel binarization algorithm for color text images is presented. This algorithm effectively integrates color clustering and binary texture analysis, and is capable of handling situations with complex backgrounds. In this algorithm, dimensionality reduction and graph theoretical clustering are first employed. As the result, binary images related to clusters can be obtained. Binary texture analysis is then performed on each candidate binary image. Two kinds of effective texture features, run-length histogram and spatial-size distribution related respectively, are extracted and explored. Cooperating with an LDA classifier, the optimal candidate of the best binarization effect is obtained. Experiments with images collected from Internet has been carried out and compared with existing techniques, both show the effectiveness of the algorithm. 2005 Elsevier B.V. All rights reserved. Keywords: Color text image; Binarization; Binary texture features
1. Introduction Text is a kind of important feature for information retrieval applications. For these applications, as the first step, text in the image should be recognized for its semantic meanings. But current optical character recognition (OCR) technologies are
*
Corresponding author. Tel.: +86 216 236 2527; fax: +86 216 236 2529. E-mail address:
[email protected] (B. Wang).
mostly restricted to recognize text against clean backgrounds. Thus binarization techniques, which aim to separate text from image backgrounds and obtain a clean representation, are usually adopted as an indispensable preprocessing. Most existing binarization techniques are thresholding related. Basically, these techniques can be categorized into two categories: global thresholding and local or adaptive thresholding. Global thresholding methods tempt to binarize the image with a single threshold. Among some most powerful global techniques, OtsuÕs algorithm can achieve
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.12.006
B. Wang et al. / Pattern Recognition Letters 26 (2005) 1650–1657
high performance in terms of uniformity of thresholded regions and correctness of segmentation boundaries (Sahoo et al., 1988). Liu and Srihari (1997) proposed a document binarization method in which OtsuÕs algorithm was used to obtain candidate thresholds. Then, texture features were measured from each thresholded image, based on which the best threshold were picked. In contrast to global ones, adaptive or local methods change the threshold dynamically over the image field according to some local information. Wellner (1993), an adaptive algorithm was developed by Wellner for the DigitalDesk. The method calculated threshold values at each point of estimated background illumination based on a moving average of local pixel intensities. For images with low contrast, variable background intensity and noise, local algorithms work well. However, it appeared that techniques of both above categories perform poorly under complex backgrounds. In recent years, text information extraction from images and video has attracted particular research attention and a variety of related methods have been proposed. A survey of existing methods for text detection in images and video can be found in (Chen and Luettin, 2000), and another one focusing more generally on text information extraction from images and videos was given in (Jung et al., 2004). Text detection/location results are sub-images containing only the text region in the original image, thus they are still gray-scale or color images. Generally, as mentioned above, these sub-images, especially color ones, cannot be directly fed into OCR tools for recognition unless appropriate binarization has been carried out. These images will be called text block images, or just text images in this paper, so as to distinguish them from conventional document images. Compared with conventional document images, text images possess following properties: • With only few words in an image, but these words often convey important information related to the content of the original image; • With complex backgrounds, especially for color text images; • With varying text size, fonts and colors, even for a same word.
1651
These features make no satisfactory results guaranteed when applying existing methods, which may have been well applied to conventional document images, upon these text images, especially color ones, for binarization. In this paper, a novel binarization algorithm is proposed. It is designed to overcome the limitations of existing techniques for color text images. The proposed algorithm efficiently integrates color clustering and binary texture feature analysis. Two kinds of features capable of effectively characterizing text-like binary textures, including run-length histogram and spatial-size distribution related features, are extracted and explored. In addition, in order to handle varying colors of the text in the image, a combination strategy is implemented among the binary images obtained by color clustering. The effective cooperation of these techniques enable the algorithm survives those complex background conditions, as existing techniques tend to fail.
2. Binary texture analysis Texture is the term used to characterize the contextual property of areas in images, and it is one of the main features intensively utilized in image processing and pattern recognition. Most of the known texture models are based on gray-level images, and generally, image texture is defined as a function of the spatial variation in pixel intensities, or more specifically, the spatial variation of pixel intensities which form certain repeated patterns. Texture features are numerical values calculated to describe properties of the image texture under some certain texture model. In (Mihran and Jain, 1993), texture features was categorised as statistical, geometrical, structural, model based and signal processing features in terms of the texture model employed. For general gray-level images, texture features based on statistical information of the variation of pixel intensities are most explored. As a kind of special texture type, binary or boolean texture is the texture appeared in binary images. Compared with general gray-level images, a binary image consists only two intensity levels:
1652
B. Wang et al. / Pattern Recognition Letters 26 (2005) 1650–1657
‘‘1’’ for foreground pixels, and ‘‘0’’ for background pixels. Furthermore, if only foreground (or background) regions of the binary image are of interest, as is the focus of this paper, there will be no variation of pixel intensities. Under such situation, it is apparent that general statistical models mainly focusing on pixel intensity variation are no longer suitable to characterize binary textures. For binary image textures, geometrical or shape properties and spatial distribution patterns become the most important. Thus in this paper, other than the numerous existing texture description methods based on second order statistics as given in (Mihran and Jain, 1993), two kinds of new features were taken into account in the proposed binarization method: one is based on run-length description, the other spatial-size distribution (SSD) related. As will be shown in this paper, both of these features are capable of effectively characterizing text-like textures in binary images. 2.1. Run-length based features In boolean model for texture modeling, onedimensional version is the simplest form. As texture is a two-dimensional property, directional scannings of the image are used to bypass the problem of dimensionality (Garcia-Sevilla and Petrou, 1999). One of the direct outputs of such
scannings is the run and run based binary texture characterization. A run is a maximum contiguous set of constantgray-level pixels located in a scan line, which can then be described by gray-level a, length r, and direction h, denoted as B(a, r, h). Let B(a, r) be the number of runs in all directions of length r and gray-level a. Texture description features can be obtained from computation of continuous probability of the length and gray-level of runs in the image (Sonka et al., 1999). Compared with conventional statistical methods, texture features based on run-length analysis have not attracted the same attention in image analysis applications. In this paper, study objects are binary images, and only foreground horizontal runs are of interest; thus a 1 and r 2 [1, . . . , L], where L is the maximum acceptable run-length. Then it is convenient to shorten B(a, r) to B(r). Denote B(R), R = [1, . . . , L], the one-dimensional array contains the frequencies of each run, called run-length histogram. Examples of run-length histograms for different binary images are shown in Fig. 1. Features of interest are the maximum probability and the stroke width. The maximum probability is defined as the highest frequency in the run length histogram, excluding the unit run-length, which is always related to noise. That is, MAXPROB ¼ max BðrÞ; r2R
r 6¼ 1:
ð1Þ
Fig. 1. Different binary images and their corresponding run-length histogram. (a) Normal binary image and (b) binary text image.
B. Wang et al. / Pattern Recognition Letters 26 (2005) 1650–1657
And the run-length of the highest frequency is: SW ¼ arg max BðrÞ; r2R
r 6¼ 1:
ð2Þ
It actually reflects the average stroke width of the dominant text in the image, thus in (Liu and Srihari, 1997), it was directly called stroke width feature. 2.2. Spatial-size distribution related features Binary images are the main concern of this paper. Thus, there is no variation of intensity as in general gray-level texture analysis. The most important here are the objects or components in the binary image and their spatial arrangement patterns. Same as the grayscale histogram of an image, the run-length histogram is a 1D global representation of the image. Although it contains the essential information for stroke-like features, it is still incompetent to convey the true 2D distribution patterns in the image. Thus, mathematical morphology based binary texture modeling is also studied in this paper. Mathematical morphology has been proverbially recognized as a powerful tool for binary image analysis. Under the framework of mathematical morphology, granulometries are defined as ordered sets of morphological openings or closings, each of which removes image details below a certain size (Vincent, 2000). The granulometric size distribution, or pattern spectrum, which show how the number of foreground pixels in the image changes as a function of the size parameters (Maragos, 1989), is one of the main mathematical morphology related approaches to texture description. But one of the main drawback of pattern spectrum for binary texture modeling is that it conveys only the object size distribution in the image, and very different textures may share the same pattern spectrum. This has been overcome by a new spatial size distribution (SSD) descriptor (Ayala and Domingo, 2001), through incorporating spatial distribution information into its conventional counterpart. The (p,q)-SSD of the image G with respect to a convex and compact set U containing the origin is defined as multiple integrals of the areas v(•) of
1653
granulometries in terms of spatial position variables h1, . . . ,hq (Ayala and Domingo, 2001): 1 F G;U ðk1 ; . . . ; kp ; l1 ; . . . ; lq Þ ¼ vðGÞqþ1 Z Z vðG \ G þ h1 \ . . . \ G þ hq Þ lq U
l1 U
e e e vð WðGÞ \ WðGÞ þ h1 \ \ WðGÞ þ hq Þ dh1 . . . dhq
ð3Þ
where k stands for scale of the elements and l stands for scale of the support set, and ðpÞ ð1Þ e WðGÞ ¼ Wkp Wk1 ðGÞ ; ð4Þ is the composition of the different granulometries. The elemental operation Wk(G) is defined as: Wk ðGÞ ¼ G kEð0; 1Þ ¼ G Eð0; kÞ
ð5Þ
that is, it is a morphological opening operation upon the image G with a element structure E under scale k. The joint density function associated with FG,U is: fG;U ðk1 ; . . . ; kp ; l1 ; . . . ; lq Þ ¼
opþq F G;U ðk1 ; . . . ; kp ; l! ; . . . ; lq Þ : ok1 ; . . . ; okp ol1 ; . . . ; olq
ð6Þ
For discrete sets or discrete images, such joint density function can be written as: fG;U ðk1 ; . . . ; kp ; l1 ; . . . ; lq Þ X sgnðu;vÞ ¼ ð1Þ F ðG;U Þ ðu; vÞ
ð7Þ
ðu;vÞ2N
where N ¼ fðu; vÞ : ui ¼ ki or ui ¼ ki 1; vj ¼ l or vj ¼ lj 1g and sgnðu; vÞ ¼ ]fui : ui ¼ ki 1g þ ]fvj : vj ¼ lj 1g where ]{•} stands for the counting operation, and consequently ]{ui : ui = ki 1} is the number of uis which satisfy ui = ki 1. In this paper, SSD features of order (1, 1) and (2, 1) are of interest. The value of element size k is set the same as that of the stroke-width feature
1654
B. Wang et al. / Pattern Recognition Letters 26 (2005) 1650–1657
SW of the image; the support-set size l varies from 0 to 9 in increments of 3. Thus total eight SSD features can be obtained for each binary image.
3. Color text image binarization The proposed binarization method for color text images consists of four main steps: color space dimensionality reduction, color clustering, texture feature extraction, and selection of the optimal binary image. The flowchart is shown in Fig. 2. 3.1. Dimensionality reduction and clustering Considering properties of human vision, there is large amount of redundancy in the 24-bit RGB representation of color images. Jones and Rehg (2002) reported that 77% of the possible 24-bit
Cluster 1
RGB colors were never encountered. In this paper, their research result is employed to serve as effective preprocessing. In our application, we found that representing each of the RGB channels with only 4 bits introduced little, or even no perceptible visual degradation, as shown in Fig. 3(b). Another attractive feature of this operation is its convenience, simply through performing a 4-bit rightshifting on each RGB channel. Though the dimensionality of the color space has been dramatically reduced, it is still of 16 · 16 · 16; an unsupervised graph theoretical clustering (Matas and Kittler, 1995) is employed for further information congregation. Unlike other techniques, the graph theoretical clustering does not act on image pixels directly, but on the condensed 3D color histogram. Thereby, it suffers little from variation of image sizes. Fig. 3 gives an example of the dimensionality reduction and
Texture feature extraction Texture feature extraction
Cluster 2 Color image
Color dimensionality reduction
Graph theoretical clustering
Texture feature extraction
. . . . . .
Binary image selection
Binary text image
Texture feature extraction Cluster n
Texture feature extraction
Fig. 2. Flowchart of the proposed algorithm.
Fig. 3. Color dimensionality reduction and grap-theoretical clustering. (a) Original image, (b) dimension reduction and (c) Color clustering.
B. Wang et al. / Pattern Recognition Letters 26 (2005) 1650–1657
1655
Fig. 4. The effect of the combination strategy. (a) Original image, (b) no combination and (c) with combination.
graph theoretical clustering: image (a) is the original color image of 49,789 colors; image (b), dimensionality reduction reduces the number of colors to 816; after further clustering, in image (c), only 5 colors (clusters) remain. 3.2. Binary image generation and feature extraction For each of the clusters obtained by the graphtheoretical clustering, a binary image is constructed. Additional binary images are built through combination between/among different original binary images. Before combination take effect, connected component analysis is performed on the original binary images, which can wipe off some background components from them. According to different combination strategies and different input images, we can obtain teens to tens of binary images. The problem becomes how to identify the one giving the best binarization among these candidates. To evaluate the goodness of the candidates, the aforementioned two kinds of texture features are extracted and analyzed. 3.3. Optimal binary image selection In LiuÕs method (Liu and Srihari, 1997), a simple decision tree was employed to identify the best threshold. In the proposed method, limitations of threshold based methods, including OtsuÕs and LiuÕs, are overcome through combination among clusters, instead of selecting a single cluster related binary image as the final output. Multiple steps are implemented to obtain the best binary image. First, initial binary images with small stroke-width frequencies, or the maximumprobability feature, are discarded. Thus only clusters of prominent stroke-like features remain for
further processing. But it is often that non-text binary images can possess prominent stroke-like features in terms of run-length description, and then pass the initial evaluation. This is the reason why LiuÕs method (Liu and Srihari, 1997) still suffers from complex backgrounds. In our method, SSD features of the remained images are fed into a linear discriminant analysis (LDA) classifier for further verification. Images pass the classier are combined again to generate the final result. Comparison between results with and without combination is shown in Fig. 4. It is obvious that integrating the combination strategy into the algorithm can bring more satisfactory binarization, especially for images of several text blocks with different colors.
4. Experimental results The proposed algorithm has been evaluated with a data set of 520 color images gathered from Internet. All these images contain prominent text blocks, but with varying complexity of image backgrounds and text colors. Compared with traditional document images, most of the images gathered for experiment contain only a few words. Upon these images, existing binarization methods which can perform well on conventional document images tend to fail. In the experiment, binarization results and performance of the proposed method were evaluated and compared with two existing methods, OtsuÕs method and the run-length histogram based thresholding method by Liu and Srihari (1997). Examples of the binarization results are shown in Fig. 5. As expected, OtsuÕs algorithm, Fig. 5(b), did not perform well in separating text from
1656
B. Wang et al. / Pattern Recognition Letters 26 (2005) 1650–1657
Fig. 5. Comparison among different algorithms. (a) Original image, (b) OtsuÕs algorithm, (c) LiuÕs method and (d) Proposed method.
complicated backgrounds. Method by Liu, Fig. 5(c), performed better; but same as OtsuÕs algorithm and, in its essence, being single-threshold based technique, it still suffered much from complex backgrounds. The proposed method in this paper, as shown in Fig. 5(d), however, can survive such situations and yield satisfactory results. Furthermore, in LiuÕs method, only text of uniform stroke width was considered; while in our algorithm, text with varying stroke width in a single image can be effectively handled. Apparently, only rough visual comparison above is far from sufficient to reach an objective and convincing evaluation. In the experiment, objective performance comparison among the proposed method and existing ones was carried out in two ways: one in image level, the other in word level. In the image-level evaluation strategy, criteria to accept a given binarization result include: (1) above 80% of the text in the image are correctly separated from the background, and (2) the sum of the area for non-text background regions in the result binary image is no more than that of the accepted text regions. In the word-level evaluation
scheme, recognizable words in the original image and the result binary image which have been correctly separated from the background were counted. The word-level accuracy was then defined as the rate of the number of correctly separated words to that of total recognizable words in the test image set. The final statistics for the experimental results are shown in Table 1. They are divided into two categories, corresponding to the above-mentioned two evaluation strategies: column 2 for image-level evaluation and column 3 for word-level evaluation. From the table, it can be seen that OtsuÕs Table 1 Statistics of the experimental results Scheme
Image-level evaluation
Word-level evaluation
Total Otsu algorithm LiuÕs method (Liu and Srihari, 1997) Proposed method
520 images 353 (67.9%) 376 (72.3%)
3519 words 2397 (68.1%) 2441 (69.4%)
446 (85.8%)
2850 (81%)
The data set consists of 520 color text images.
B. Wang et al. / Pattern Recognition Letters 26 (2005) 1650–1657
1657
either globally or locally. For these techniques no satisfactory binarization results are guaranteed as applied to images with complex backgrounds. In this paper, we have presented a new scheme for color text image binarization which efficiently integrates color clustering and binary texture analysis. Compared with existing ones, the proposed method is robust enough to survive those complicated situations. First results on our data set of above 500 color text images collected from Internet are very encouraging.
Fig. 6. Time consuming regarding to image size.
algorithm only gained an accuracy of 67.9% in the image-level evaluation, a better accuracy of 72.3% was obtained by LiuÕs method; superior to both the two techniques, the proposed algorithm reached an accuracy of 85.8%. In the word-level evaluation, accuracy for OtsuÕs algorithm was 68.1%, and it is 69.4% for LiuÕs method; both the existing methods were under that of the proposed one, which is 81%, with a wide gap of more than 10%. The experimental results show that, compared to existing techniques, the proposed algorithm can effectively separate text from color images with complex backgrounds, and is capable of obtaining cleaner binarization results. In order to further illustrate the efficiency of the proposed method, computational time consumption was also evaluated. Our method was implemented in C++, and had been run on a computer with 2.0GHz Pentium-4M CPU and 256M DRAM. Time consumption in terms of size of the image to process is plotted in Fig. 6. The plot shows that the time required is not changed smoothly as the image size varying. It is variation of the complexity of the images that results in the oscillation in the plot.
5. Conclusion Binarization is an important processing in computer vision, especially for document or text image related applications. Most existing techniques for document image binarization utilize thresholding,
References Ayala, G., Domingo, J., 2001. Spatial size distributions: Applications to shape and texture analysis. IEEE Trans. Pattern Anal. Machine Intell. 23 (12), 1430–1442. Chen, D.T., Luettin, J., 2000. A survey of text detection and recognition in images and videos. IDIAP-RR-00 38, IDIAP. Garcia-Sevilla, P., Petrou, M., 1999. Classification of binary textures using the one-dimensional boolean model. IEEE Trans. Image Process. 8 (10), 1457–1462. Jones, M.J., Rehg, J.M., 2002. Statistical color models with application to skin detection. Int. J. Comput. Vision 46 (1), 81–96. Jung, K., Kim, K.I., Jain, A.K., 2004. Text information extraction in images and video: A survey. Pattern Recog. 37 (5), 977–997. Liu, Y., Srihari, S.N., 1997. Document image binarization based on texture features. IEEE Trans. Pattern Anal. Machine Intell. 19 (5), 540–544. Maragos, P., 1989. Pattern spectrum and multiscale shape representation. IEEE Trans. Pattern Anal. Machine Intell. 11 (7), 701–716. Matas, J., Kittler, J., 1995. Spatial and feature space clustering: Applications in image analysis. In: Proceedings of the 6th International Conference on Computer Analysis of Images and Patterns, Prague, Czech Republic, September, pp. 162– 173. Mihran, T., Jain, A.K., 1993. Texture analysis. In: Handbook of Pattern Recognition and Computer Vision. World Scientific Publishing, pp. 235–276. Sahoo, P.K., Soltani, S., Wong, A.K.C., 1988. A survey of thresholding techniques. Comput. Vision Graph. Image Process. 41 (2), 233–260. Sonka, M., Hlavac, V., Boyle, R., 1999. Image Processing, Analysis, and Machine Vision. Brooks and Cole Publishing, Pacific Grove, CA. Vincent, L., 2000. Granulometries and opening trees. Fundamenta Informaticae 41 (1-2), 57–90, January. Wellner, P.D., 1993. Adaptive thresholding on the digitaldesk. Technical Report EPC-93-110, Rank Xerox Research Center, Cambridge Laboratory.
Pattern Recognition Letters 26 (2005) 1658–1674 www.elsevier.com/locate/patrec
Online mining maximal frequent structures in continuous landmark melody streams Hua-Fu Li a
b
a,*
, Suh-Yin Lee a, Man-Kwan Shan
b
Department of Computer Science and Information Engineering, National Chiao-Tung University, 1001 Ta Hsueh Road, Hsin-Chu 300, Taiwan Department of Computer Science, National Chengchi University, 64, Sec. 2, Zhi-nan Road, Wenshan, Taipei 116, Taiwan Received 10 January 2004; received in revised form 13 November 2004 Available online 14 April 2005 Communicated by E. Backer
Abstract In this paper, we address the problem of online mining maximal frequent structures (Type I & II melody structures) in unbounded, continuous landmark melody streams. An efficient algorithm, called MMSLMS (Maximal Melody Structures of Landmark Melody Streams), is developed for online incremental mining of maximal frequent melody substructures in one scan of the continuous melody streams. In MMSLMS, a space-efficient scheme, called CMB (Chord-set Memory Border), is proposed to constrain the upper-bound of space requirement of maximal frequent melody structures in such a streaming environment. Theoretical analysis and experimental study show that our algorithm is efficient and scalable for mining the set of all maximal melody structures in a landmark melody stream. 2005 Elsevier B.V. All rights reserved. Keywords: Machine learning; Data mining; Landmark melody stream; Maximal melody structure; Online algorithm
1. Introduction Recently, database and knowledge discovery communities have focused on a new data model, *
Corresponding author. Tel.: +886 35731901; fax: +886 35724176. E-mail addresses: hfl
[email protected] (H.-F. Li), sylee@ csie.nctu.edu.tw (S.-Y. Lee),
[email protected] (M.-K. Shan).
where data arrives in the form of continuous, rapid, huge, unbounded streams. It is often referred to as data streams or streaming data. Many applications generate large amount of data streams in real time, such as sensor data generated from sensor networks, transaction flows in retail chains, Web record and click streams in Web applications, performance measurement in network monitoring and traffic management, call records in telecommunications, etc. In such a data
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.016
•••
User Query Streams
User Query Processor
Music ID
Melody Sequence
Buffer
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
•••
Melody (Sequence) Streams
Music Database
1659
Melody Stream Processor Maximal Melody Structure Streams
Summary Data Structure in Main Memory
Fig. 1. Computation model for music melody streams.
stream model, knowledge discovery has two major characteristics (Babcock et al., 2002). First, the volume of a continuous stream over its lifetime could be huge and fast changing. Second, the continuous queries (not just one-shot queries) require timely answers, and the response time is short. Hence, it is not possible to store all the data in main memory or even in secondary storage. This motivates the design of in-memory summary data structure with small memory footprints that can support both one-time and continuous queries. In other words, data stream mining algorithms have to sacrifice the exactness of its analysis result by allowing some counting error. Although several techniques have been developed recently for discovering and analyzing the content of static music data (Bakhmutora et al., 1997; Hsu et al., 2001; Shan and Kuo, 2003; Yoshitaka and Ichikawa, 1999; Zhu et al., 2001), new techniques are needed to analyze and discover the content of streaming music data. Thus, this paper studies a new problem of how to discover the maximal melody structures in a continuous unbounded melody stream. The problem comes from the context of online music-downloading services (such as Kuro at www.music.com.tw), where the streams in question are streams of queries, i.e., music-downloading requests, sent to the server, and we are interested in finding the maximal melody structures requested by most customers during some period of time. With the computation model of music melody streams presented in Fig. 1, the melody stream processor and the summary data structure are two major components in the
music melody streaming environment. The user query processor receives user queries in the form of hTimestamp, Customer-ID, Music-IDi, and then transforms the queries into music data (i.e., melody sequences) in the form of hTimestamp, CustomerID, Music-ID, Melody-Sequencei by retrieving the music database. Note that a buffer can be optionally set for temporary storage of recent music melodies from the music melody streams. In this paper, we present a novel algorithm MMSLMS (Maximal Melody Structures of Landmark Melody Streams) for mining the set of all maximal melody structures in a landmark melody stream. Moreover, the music melody data and patterns are represented as sets of chord-sets (Type I Melody structures) or strings of chord-sets (Type II Melody structures). While providing a general framework of music stream mining, algorithm MMSLMS has two major features, namely one scan of music melody streams for online frequency collection, and prefix-tree-based compact pattern representation. With these two important features, MMSLMS is provided with the capability to work continuously in the unbounded streams for an arbitrary long time with bounded resources, and to quickly answer users queries at any time.
2. Preliminaries 2.1. Music terminologies In this section, we describe several features of music data used in this paper. For the basic
1660
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
terminologies on music, we refer to (Jones, 1974). A chord is a sounding combination of three or more notes at the same time. A note is a single symbol on a musical score, indicating the pitch and duration of what is to be sung and played. A chord-set is a set of chords (Shan and Kuo, 2003). Definition 1. The type I melody structure is represented as a set of chord-sets. The type II melody structure is represented as a string of chordsets. 2.2. Problem statement Let W = {i1,i2, . . . , in} be a set of chord-sets, called items for simplicity. A melody sequence S with m chord-sets is denoted by S = hx1x2 xmi, where xi 2 W, "i = 1,2, . . . , m. A block is a set of melody sequences. Definition 2. A landmark melody stream LMS = [B1,B2, . . . , BN), is an infinite sequence of blocks, where each block Bi is associated with a block identifier i, and N is the identifier of the ‘‘latest’’ block BN. The current length of LMS, written as jLMSj, is N. The blocks arrive in some order (implicitly by arrival time or explicitly by timestamp), and may be seen only once. Definition 3. A set Y W is called an item-set, i.e., a set of chord-sets. k-item-set is represented by (y1y2 yk). The support of an item-set Y, denoted as r(Y), is the number of melody sequences containing Y as a subset in the LMS seen so far. An item-set is frequent if its support is greater than or equal to minsup Æ jLMSj, where minsup is a user-specified minimum support threshold in the range of [0, 1], and jLMSj is the current length of the landmark melody stream LMS. Definition 4. A string Z is called an item-string, i.e., a string of chord-sets. A k-item-string is represented by hz1z2 zki, where zi 2 W, "i = 1,2, . . ., k. The support of an item-string Z, denoted as r(Z), is the number of melody sequences containing Z as a substring in the LMS seen so far. An item-string is frequent if its support is greater than or equal to minsup Æ jLMSj,
where minsup is a user-specified minimum support threshold in the range of [0, 1], and jLMSj is the current length of the landmark melody stream seen so far. Definition 5. A frequent item-set (or item-string) is called maximal if it is not a subset (or substring) of any other frequent item-set (or itemstring). In fact, the total number of maximal melody structures is smaller than that of frequent melody structure. Hence, the type of maximal melody structures is more suitable for the performance requirements of music stream mining. Definition 6. (Problem Definition of Online Mining Maximal Melody Structures in Continuous Landmark Melody Streams.) Given a landmark melody stream LMS = [B1,B2, . . . , BN) and the user specified minimum support, minsup, in the range of [0, 1], the problem of online mining maximal melody substructures is to discover the set of all maximal melody structures, i.e., maximal item-sets or maximal item-strings, in single one scan of the landmark music stream. 2.3. Main performance requirements of music melody stream mining The main performance challenges of mining melody streams are: (1) Online, one-pass algorithm: each sequence in the landmark melody stream is examined once. (2) Bounded-storage: limited memory for storing crucial, compressed information in summary data structure. (3) Real-time: per item processing time must be low. The proposed MMSLMS algorithm possesses all of these characteristics, while none of previously published methods (Bakhmutora et al., 1997; Hsu et al., 2001; Shan and Kuo, 2003; Yoshitaka and Ichikawa, 1999; Zhu et al., 2001) can claim the same.
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
3. Online mining maximal frequent structures in landmark melody streams 3.1. Chord-set memory border In this section, the upper bound on the number of candidate maximal melody structures is discussed, and an efficient algorithm for chord-set memory border construction is proposed. Theorem 1. Given a set of k frequent chord-sets from a landmark melody stream, an upper bound of the amount of maximal frequent melody structures is C kdk=2e . Proof. Assume that there are k frequent chordsets, i.e., k frequent items, in the current landmark melody stream. The solution space of mining all frequent item-sets in the worst case is C k1 þ C k2 þ þ C ki þ þ C kdk=2e þ þ C kk , where C k1 is the total number of frequent 1-item-sets, C ki is that of frequent i-item-sets, and C kk is that of frequent k-item-sets. We observe that the value of C kdk=2e is the maximum value among all the binominal coefficient C ki ; 8i ¼ 1; 2; . . . ; k, in mining all frequent i-item-sets. In other words, the number of frequent dk/2e-item-sets is a maximum. We will prove the number of maximal frequent itemsets can not be greater than the value C kdk=2e , i.e., C kdk=2e is the upper bound. We prove it by contradiction. Assume that the value of C kdk=2e is not the maximum number of maximal frequent item-sets, i.e., a larger upper bound U exists, where U > C kdk=2e . Consider that there are one or more frequent melody structures with length L, where L > dk/2e. If F is a frequent melody structure with length dk/2e + i and it is maximal, where i = 1,2, . . ., k dk/2e, then all of the substructures of F are frequent, which is based on the antimonotone Apriori heuristic (Agrawal and Srikant, 1994): if any i-item-set (or i-item-string) is not frequent, its (i + 1)-item-set (or (i + 1)-item-string) can never be frequent, but not maximal, which is based on the definition 5: a frequent item-set (or item-string) is called maximal if it is not a subset (or substring) of any other frequent item-set (or itemstring). In other words, it means that when one
1661
maximal frequent structure with length L, where L > dk/2e, is added, at most L frequent melody structures with length L 1, are decremented from the current collection of maximal frequent melody structures found so far. Hence, the maximum number of maximal melody structures is changed from U to U 0 , where U 0 = U + 1 L, which is not greater than C kdk=2e . This conflicts with the assumption of U > C kdk=2e and results in a contradiction. Thus the statement is proven to be true. Therefore, we conclude that the maximum number of maximal melody structures is C kdk=2e in the problem of online mining maximal melody structures in a landmark melody stream. h
Example 1. Assume that there are five frequent items (i.e., frequent 1-item-sets) a, b, c, d, and e in the landmark melody stream as shown in Fig. 2. Let MF denote the total number of maximal frequent item-sets. At this point, a, b, c, d and e are maximal and MF ¼ C 51 . Based on the Apriori heuristic, C 52 frequent 2-item-sets are discovered in the worst case. In this case, these frequent 2-item-sets are also maximal and those frequent 1-item-sets are not maximal any more. The current MF is C 51 þ C 52 C 51 ¼ C 52 . Next, C 53 frequent 3-item-sets are found in the worst case. These frequent 3-item-sets are maximal but the sub-sets of the maximal 3-item-sets, i.e., frequent 2-item-sets, are not maximal any more. Now, the MF becomes C 52 þ C 53 C 52 ¼ C 53 . At this time, suppose the frequent 4-item-set abcd exists in this instance and it is also a maximal 4-itemset. The frequent subsets, with length three, of abcd, i.e., abc, abd, acd and bcd, are not maximal any more. Now, the MF becomes C 53 þ 1 4 ¼ 7, i.e., abcd, ace, ade, bcd, bce, bde, cde are maximal frequent item-sets. The new MF is smaller than the upper bound C 5d5=2e . Hence, we can find that if one or more frequent item-sets with length L, where L > d5/2e, are added into the collection of maximal frequent item-sets found so far, the value of MF would be changed and would be less than C 5d5=2e . Consequently, the C 5d5=2e is the upper bound of the number of maximal melody structures.
1662
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
a
b
c
d
C 15
e
ab
ac
ad
ae
bc
bd
be
cd
ce
de
C 52
abc
abd
abe
acd
ace
ade
bcd
bce
bde
cde
C 53
abcd
abce
abde
acde
bcde
C 54 C 55
abcde Fig. 2. Item-set enumeration lattice with respect to five items: a, b, c, d and e.
The key property of algorithm MMSLMS is derived from the recent work (Karp et al., 2003) for finding frequent elements in streaming data. The basic scheme of mining chord-sets from music data streams is generalized from the well-known algorithm (Fisher and Salzberg, 1982) for determining whether a value (majority element) occurs more than n/2 times, i.e., minsup = 0.5, in a data stream of length n. The method can be extended to an arbitrary value of minsup. The scheme is processed as follows. At any given time, a superset of k probably frequent chord-sets with at most 1/minsup times is maintained. Initially, the set is empty. As a chord-set is read from the melody sequence in the current block, two operations are performed as follows. First, if the current chord-set is not contained in the superset and some entries are free, it is inserted into the superset with a count set to one. Second, if the chord-set is already in the superset, its count is incremented by one. However, if the superset is full, the count of each entry in the superset is decremented by one, and the chord-sets whose frequencies are just one are dropped. The method thus identifies at most k candidates for having appeared more than n/(k + 1) times, and uses O(1/minsup) memory entries.
3.2. The proposed algorithm: MMSLMS Algorithm MMSLMS has three modules: MMSLMS-buffer, MMSLMS-summary, and MMSLMS-mine. MMSLMS-buffer repeatedly reads in a block of melody sequences into available main memory. All compressed and essential information about the maximal melody structures will be maintained in the MMSLMS-summary. Finally, MMSLMS-mine finds the maximal melody structures by a depth-first manner in the current MMSLMS-summary. Therefore, the challenges of online mining landmark melody streams lie in the design of a space-efficient representation of the in-memory summary data structure and a fast discovery algorithm for finding maximal melody structures in real time. 3.2.1. MMSLMS-summary First of all, the in-memory data structure MMSLMS-summary is defined and the constructing process of MMSLMS-summary is discussed. Then we use a running example to illustrate. Definition 7. A MMSLMS-summary is an extended prefix-tree-based summary data structure defined below.
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
1. MMSLMS-summary consists of a CMB (Chordset Memory Border), and a set of MPI-trees (Maximal Prefix-Item trees of item-suffixes) denoted as MPI-trees(item-suffixes). 2. Each node in the MPI-tree(item-suffix) consists of four fields: item-id, support, block-id and node-link, where item-id is the item identifier of the inserting item, support registers the number of melody sequences represented by a portion of the path reaching the node with the item-id, the value of block-id assigned to a new node is the block identifier of the current block, and node-link links up a node with the next node with the same item-id in the same MPI-tree or null if there is none. 3. Each entry in the CMB consists of four fields: item-id, support, block-id, and head of node-link (a pointer links to the root node of the MPI-tree with the same item-id), abbreviated as head-link, where item-id registers which item identifier the entry represents, support records the number of transactions containing the item carrying the item-id, the value of block-id assigned to a new entry is the block identifier of current block, and head-link points to the root node of the MPI-tree(item-suffix). Notice that each entry with item-id in the CMB is an item-suffix and it is also the root node of the MPI-tree(item-id). 4. Each MPI-tree(item-suffix) has a specific CMBtable (Chord-set Memory Border table) with respect to the item-suffix (denoted as CMBtable(item-suffix)). The CMB-table(item-suffix) is composed of four fields, namely item-id, support, block-id, and head-link. The CMBtable(item-suffix) operates the same as the CMB except that the field head-link links to the first node carrying the item-id in the MPI-tree(itemsuffix). Notice that jCMB-table(item-suffix)j = jCMBj in the worst case, where jCMBj denotes the total number of entries in the CMB. The construction of MMSLMS-summary is described as follows. First of all, MMSLMS reads a melody sequence S from the current block. Then, MMSLMS projects the sequence S into many subsequences and inserts these subsequences into the CMB and MPI-trees. In details, each melody
1663
sequence S, such as hx1,x2, . . . , xmi, in the current block should be projected by inserting m itemsuffix melody subsequences into the MMSLMSsummary. In other words, the melody sequence S = hx1,x2, . . . , xmi is converted into m melody subsequences; that is, hx1,x2, . . . , xmi, hx2,x3, . . . , xmi, . . . , hxm1,xmi, and hxmi. The m melody subsequences are called item-suffix sequences, since the first item of each melody subsequence is an itemsuffix of the original melody sequence S. This step is called sequence projection, and is denoted as Sequence-Projection (S) = {x1jS,x2jS, . . . , xijS, . . . , xmjS}, where xijS = hxi,xi+1, . . . , xmi, "i = 1,2, . . . , m. Furthermore, the cost of sequence projection of a melody sequence with length m is (m2 + m)/ 2, i.e., m + (m 1) + + 2 + 1. After Sequence-Projection (S), MMSLMS algorithm removes the original melody sequence S from the MMSLMS-buffer. Next, the set of items in these item-suffix sequences are inserted into the CMB and the MPI-trees(item-suffixes) as a branch, and the CMB-table(item-suffixes) are updated according to the item-suffixes. If an itemset (or item-string) share a prefix with an item-set (or item-string) already in the tree, the new itemset (or item-string) will share a prefix of the branch representing that item-set (or item-string). In addition, a support counter is associated with each node in the tree. The counter is updated when an item-suffix sequence causes the insertion of a new branch. In order to limit the memory size of the summary data structure MMSLMS-summary, a space pruning technique is performed. Let the minimum support threshold be minsup, in the range of [0, 1], and the current length of the landmark melody stream be N. The rule for space pruning is as follows. A melody structure E is deleted if E.support < minsup Æ N. E is called an infrequent melody structure. After pruning all infrequent melody structures from the CMB, CMB-table-(itemsuffix) and MPI-trees, the MMSLMS-summary contains all information about frequent melody structures of the landmark melody stream generated so far. Example 2 below illustrates the algorithm step by step. Note that the h i of sequences are omitted for clear presentation.
1664
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
Example 2. Let a block Bj of the landmark melody stream LMS be hacdefi, habei, hcefi, hacdfi, hcefi and hdfi, and the minimum support threshold be 0.5 (i.e., minsup = 0.5), where a, b, c, d, e and f are chord-sets (i.e., items) in a landmark melody stream seen so far. MMSLMS algorithm constructs the MMSLMS-summary with respect to the incoming block Bj and prunes all item-sets that are infrequent from the current MMSLMS-summary in the following steps. Note that each node or entry represented as (f1:f2:f3) is composed of three fields: item-id, support, and block-id. For example, (a:2:j) indicates that, from block Bj, item a appeared twice. Step 1: MMSLMS reads current block Bj into main memory for constructing the MMSLMSsummary. (a) First melody sequence acdef: First of all, MMSLMS algorithm reads the first melody sequence acdef and calls the Sequence-Projection (acdef). Then MMSLMS inserts the item-suffix sequences acdef, cdef, def, ef, and f into
the CMB, [MPI-tree(a), CMB-table(a)], [MPI-tree(c), CMB-table(c)], [MPItree(d), CMB-table(d)], [MPI-tree(e), CMB-table(e)], and [MPI-tree(f), CMB-table(f)], respectively. The result is shown in Fig. 3. In the following sub-steps, as demonstrated in Fig. 4 through Fig. 9, the head-links of each CMB-table (item-suffix) are omitted for concise presentation. (b) Second melody sequence abe: MMSLMS reads the second melody sequence abe and calls the Sequence-Projection (abe). Next, MMSLMS inserts the item-suffix sequences abe, be and e into the CMB, [MPI-tree(a), CMB-table(a)], [MPItree(b), CMB-table(b)] and [MPItree(e), CMB-table(e)], respectively. The result is shown in Fig. 4. (c) Third melody sequence cef: MMSLMS reads the third melody sequence cef and calls the Sequence-Projection (cef). Then, MMSLMS inserts the item-suffix sequences cef, ef and f into the CMB,
Fig. 3. MMSLMS-summary construction after inserting first melody sequence acdef in block Bj. In the following sub-steps, as demonstrated in Fig. 4 through Fig. 9, the head-links of each CMB-table (item-suffix) are omitted for concise presentation.
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
1665
Fig. 4. MMSLMS-summary construction after inserting second melody sequence abe.
[MPI-tree(c), CMB-table(c)], [MPItree(e), CMB-table(e)] and [MPItree(f), CMB-table(f)], respectively. The result is shown in Fig. 5. (d) Fourth melody sequence acdf: MMSLMS reads the fourth melody sequence acdf and calls the Sequence-Projection (acdf). Next, MMSLMS inserts the item-suffix sequences acdf, cdf, df and f into the CMB, [MPI-tree(a), CMB-table(a)], [MPI-tree(c), CMBtable(c)], [MPI-tree(d), CMB-table(d)] and [MPI-tree(f), CMB-table(f)], respectively. The result is shown in Fig. 6. (e) Fifth melody sequence cef: MMSLMS reads the fifth melody sequence cef and calls the Sequence-Projection (cef). Then, MMSLMS inserts the item-suffix sequences cef, ef and f into the CMB, [MPI-tree(c), CMB-table(c)], [MPItree(e), CMB-table(e)] and [MPItree(f), CMB-table(f)], respectively. The result is shown in Fig. 7.
(f) Sixth melody sequence df: MMSLMS reads the sixth melody sequence df and calls the Sequence-Projection (df). Next, MMSLMS inserts the item-suffix sequences df and f into the CMB, [MPI-tree(d), CMB-table(d)] and [MPItree(f), CMB-table(f)], respectively. The result is shown in Fig. 8. Step 2: After computing the current block Bj, all infrequent melody structures are pruned by MMSLMS from the current MMSLMSsummary. At this time, MMSLMS deletes the MPI-tree(b) and its corresponding CMB-table(b), and prunes the entry b from the CMB, since item b is an infrequent item; that is, r(b) < minsup Æ jLMSj, where r(b) = 1 and minsup Æ jLMSj = 0.5 Æ 6 = 3. Next, MMSLMS reconstructs the MPI-tree(a) by eliminating the information about the infrequent item b. The result is shown in Fig. 9. The description stated above is the constructing process of MMSLMS-summary with respect to the
1666
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
Fig. 5. MMSLMS-summary construction after inserting third melody sequence cef.
Fig. 6. MMSLMS-summary construction after inserting third melody sequence acdf.
incoming block over a landmark melody stream. The MMSLMS-summary construction algorithm is depicted in Fig. 10.
3.2.2. MMSLMS-mine In this section, the module, called MMSLMSmine, of mining maximal melody item-sets and
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
1667
Fig. 7. MMSLMS-summary construction after inserting fifth melody sequence cef.
Fig. 8. MMSLMS-summary construction after inserting sixth melody sequence df.
maximal melody item-strings from the current MMSLMS-summary is discussed (Fig. 11).
First of all, given an entry id (from left to right, for example) in the current CMB, MMSLMS-mine
1668
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
Fig. 9. Current MMSLMS-summary after pruning all infrequent melody structures with respect to infrequent item b.
generates candidate maximal melody structures by a top-down approach. The top-down method uses the frequent items (i.e., chord-sets) of CMB-table(id) and item id to generate the candidates. The generating order of these candidates is determined by the size of item-set, from item-set size 1 + jCMB-table(id)j down to size 2. Note that, the generating order ends in 2-item-sets because all frequent entries in the current CMB-table are frequent 1-item-sets. Then MMSLMS-mine checks these candidates whether they are frequent or not by traversing the MPI-tree(id). The MPI-tree traversing principle is described as follows. First, MMSLMS-mine generates a candidate maximal melody item-set, (j + 1)-item-set, containing the item id and all items of the CMB-table(id), where jCMB-table(id)j = j. Second, MMSLMS-mine traverses the MPI-tree via the node-links of the frequent candidate. After that if the candidate is not a frequent item-set, MMSLMS-mine generates substructure candidates with j-item-sets. Next, MMSLMS-mine executes the same MPI-tree traversing scheme for item-set counting. The process stops when MMSLMS-mine finds all maximal
frequent melody structures from the current MMSLMS-summary. Moreover, MMSLMS-mine stores these maximal melody structures into a temporal pattern list, called MMSLMS-list. Notice that MMSLMS-mine can find the set of frequent 2-itemsets by combining the item-suffix id with the frequent items of the CMB-table(id). Example 3. This example illustrates the mining of the maximal melody item-sets from the current MMSLMS-summary in Fig. 9. Let the minimum support threshold be 0.5, i.e., minsup = 0.5. (1) Now, we start the maximal melody item-set mining scheme from the frequent item a. At this moment, the frequent item-set is the only 1-item-set (a), since the support of items c, d, e and f in the CMB-table (a) are less than minsup Æ jLMSj, where jLMSj = jBjj = 6. (2) Next, MMSLMS-mine starts on the second entry c for maximal melody item-set mining. MMSLMS-mine generates a candidate maximal 3-item-set (cef), and traverses the MPItree(c) for counting its support. As a result,
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
1669
Algorithm 1 (MMSLMS-summary Construction) Input: A landmark melody stream, LMS = [B1, B2, …, BN), and a user-specified minimum support threshold, minsup. Output: A current MMSLMS-summary. 1: CMB = ∅ /*initialize the CMB to empty.*/ 2: foreach block Bj do /* j = 1, 2, …, N */ 3: foreach melody sequence S = <x1, x2, …, xm> ∈ Bj (j = 1, 2, …, N) do 4: foreach item xi S do /*the CMB maintenance*/ 5: if xi ∉ CMB then 6: create a new entry (xi, 1, j, head-link) into the CMB; /* the entry form is (item-id, support, block-id, head-link)*/ 7: else /* the entry already in the CMB*/ 8: xi.support = xi.support + 1; /* increment the support of item-id xi by one*/ 9: end if 10: end for 11: call Sequence-Projection(S); /* project the sequence with every prefix-item xi for the construction of MPI-tree(xi)*/ 12: end for 13: call MMSLMS-summary-pruning(MMSLMS-summary, minsup, |LMS|); 14: end for Subroutine Sequence-Projection Input: A melody sequence S = <x1, x2, …, xm> and the current block-id j; Output: MPI-trees(xi), ∀i=1, 2, …, m; 1: foreach item xi (i =1, 2, …, m) do 2: MPI-tree-maintenance([xi|X], MPI-tree(xi), j); /* X = x1, x2, …, xm is the original melody sequence */ /* [xi|X] is an item-suffix melody sequence with item-suffix xi*/ 3: end for Subroutine MPI-tree-maintenance Input: An item-suffix melody sequence <xi, xi+1, …, xm> (i=1, 2, …, m), MPI-tree(xi) and the current block-id j; Output: The modified MPI-tree(xi), where i=1, 2, ..., m; 1: foreach item xi do /* i=1, 2, …, m */ 2: if MPI-tree has a child Y such that Y.item-id = xi.item-id then 3: Y.support = Y.support +1; /*increment Y’s support by one*/ 5: else 6: create a new node Y = (item-id, 1, j, node-link); /* initialize the Y’s support to 1, and link its parent link to MPI-tree, and its node-link linked to the nodes with same item-id via the node-link structure. */ 7: end if 8: end for Fig. 10. Algorithm of MMSLMS-summary construction.
the candidate (cef) is a maximal frequent item-set, since its support is 3, and it is not a sub-structure of any other maximal melody structures within the MMSLMS-list. Now, MMSLMS-mine stores the maximal item-set (cef) into the MMSLMS-list.
(3) MMSLMS-mine starts on the third entry d and generates a maximal frequent 2-itemset (df). We store this item-set (df) into the MMSLMS-list because it is not a sub-structure of any other maximal melody structures within the current MMSLMS-list.
1670
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
Subroutine MMSLMS-summary-pruning Input: An MMSLMS-summary, a user-specified minimum support threshold, minsup, and the current length of LMS, |LMS|; Output: An MMSLMS-summary which contains the set of all frequent melody structures. 1: foreach entry xi (i=1, 2, …, d) ∈ CMB, where d =|CMB| do
2: 3: 4: 5: 6: 7: 8:
if xi .support < minsup |LMS| then /* xi is not a frequent item */ delete those nodes (item-id = xi) via node-link structure; merge the fragmented sub-trees; /* a simple way is to reinsert or to join the remainder sub-trees into the MPI-tree*/; delete MPI-tree(xi); delete the entry xi from the CMB; end if end for Fig. 10 (continued)
Algorithm 2 (MMSLMS-mine) Input: A current MMSLMS-summary, the current length of landmark melody stream |LMS|, and a minimum support threshold minsup. Output: A temporal-pattern-list, MMSLMS-list, of maximal melody structures. 1: MMSLMS-list = ∅; 2: foreach entry e in the current CMB do 3: do generate a candidate maximal melody structure E with size |E| /* |E| = 1+|CMB-table(e) */ 4: counting E.support by traversing the MPI-tree(e); 5: if E.support minsup |LMS| then 6: if E ∉ MMSLMS-list and E is not a substructure of any other maximal frequent structures contained into the MMSLMS-list then 7: add E into the MMSLMS-list; 8: remove E’s substructures from the MMSLMS-list; 9: end if 10: else /* if E is not a frequent melody structure*/ 11: enumerate E into melody substructures with size |E| —1; 12: end if 13: until MMSLMS-mine find the set of all maximal frequent structures with respect to the item e; 14: end for Fig. 11. Algorithm of MMSLMS-mine.
(4) On the fourth entry e, since its maximal melody item-set (ef) is a sub-structure of previous maximal melody item-set (cef), MMSLMS-mine does not store it into the MMSLMS-list. (5) Finally, MMSLMS-mine computes the entry f, and generates a maximal frequent 1-itemset (f) directly, since the CMB-table(f) is
empty. MMSLMS-mine does not store it into the MMSLMS-list, because it is a substructure of a generated maximal item-set (cef). In conclusion, the Maximal Type I Melody Structures determined by algorithm MMSLMS are (a), (cef) and (df). Now, we describe the mining
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
principle of maximal melody item-strings, i.e., Maximal Type II Melody Structures, as below. MMSLMS-mine generates maximal melody itemstrings from the current MMSLMS-summary as shown in Fig. 9 by a depth-first-search (DFS) approach. Hence, the Maximal Type II Melody Structures determined by algorithm MMSLMS are hai, hci, hdi and hefi. Note that hfi is not maximal melody item-string since it is a sub-string of the existing maximal melody 2-item-string hefi. Based on the algorithm MMSLMS-mine in Fig. 10, we have the following lemma. Lemma 2. A melody structure is a maximal melody structure if and only if it is generated by algorithm MMSLMS-mine. Proof. Algorithm MMSLMS-mine is composed of two major steps: frequent melody structure selection (step 1) and maximal melody structure verification (step 2). These steps are performed in sequence. First of all, in the step of frequent melody structure selection, MMSLMS-mine finds frequent melody structure based on the Apriori property if any length i-item-set (or i-item-string) is not frequent, its length (i + 1)-item-set (or (i + 1)-itemstring) can never be frequent. That means MMSLMS-mine does not miss any frequent melody structures. Next, in step 2, MMSLMS-mine checks the frequent melody structures generated from step 1 against the maximal melody structures of the MMSLMS-list, a temporal pattern list of maximal melody structures. If this frequent melody structure is a sub-structure (i.e., sub-set or substring) of any other structures within the MMSLMS-list, then it is not a maximal melody structure according to the Definition 5; otherwise it is a candidate maximal melody structure before the next execution of step 2. Repeating step 1 and step 2, MMSLMS-mine can generate all the maximal melody structures contained in the MMSLMS-list. Hence, we have the lemma: a melody structure is a maximal melody structure if and only if it is generated by algorithm MMSLMS-mine. Space complexity analysis: The space requirement of MMSLMS-summary consists of two parts: the working space needed to create a CMB and the CMB-tables, and the storage space needed to
1671
maintain the set of MPI-trees. Assume that CMB contains k frequent chord-sets such as e1,e2, . . . , ei, . . . , ek at any time. Based on the Theorem 1, we know that there are at most C kdk=2e maximal frequent chord-sets in the landmark melody stream seen so far. If we construct the MMSLMSsummary for all these maximal frequent melody structures, the maximum height of all the MPItrees is dk/2e. There are 1 þ C k1 þ C k1 þ þ 1 2 k1 C dk=2e1 nodes in the MPI-tree(e1), where the value 1 indicates the root node e1 of the MPI-tree(e1), þ C k1 þ þ C k1 and C k1 1 2 dk=2e1 are internal and leaf nodes of the MPI-tree(e1). Moreover, there are 1 þ C k2 þ C k2 þ þ C k2 nodes in the 1 2 dk=2e1 ki ki MPI-tree(e2), . . ., 1 þ C ki 1 þ C 2 þ þ C dk=2e1 kðk1Þ nodes in the MPI-tree(ei), 1 þ C 1 nodes in the MPI-tree(ek1), and 1 (root) node in the MPItree(ek). Thus, the total number of nodes of MPItrees in the MMSLMS-summary is
þ C k1 þ þ C k1 ð1 þ C k1 1 2 dk=2e1 Þ þ C k2 þ þ C k2 þ ð1 þ C k2 1 2 dk=2e1 Þ þ ki ki þ ð1 þ C ki 1 þ C 2 þ þ C dk=2e1 Þ þ kðk1Þ
þ ð1 þ C 1
Þþ1
¼ ðC k1 þ C k1 þ C k1 þ þ C k1 0 1 2 dk=2e1 Þ þ C k2 þ C k2 þ þ C k2 þ ðC k2 0 1 2 dk=2e1 Þ þ ki ki ki þ ðC ki 0 þ C 1 þ C 2 þ þ C dk=2e1 Þ þ kðk1Þ
þ ðC 0
kðk1Þ
þ C1
Þ þ C kk 0 :
This number equals C k1 þ C k2 þ þ C kdk=2e based on Pascals Identity: let x and y be positive integers with x P y. Then C xþ1 ¼ C xy1 þ C xy . y Moreover, the worst case working space requires at most (k2 + k)/2 entries, which is based on the process of Sequence-Projection. Thus, the space requirement of MMSLMS-summary is ðk 2 þ kÞ=2 þ C k1 þ C k2 þ þ C kdk=2e . Finally, the upper bound of space requirement is O(2k). h The worst case space complexity of algorithm MMSLMS can be analyzed in terms of melody sequence size as described below. Assume that the average melody sequence size is m, the current
1672
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
length of the landmark melody stream is N, and the minimum support threshold is minsup. The space requirement of algorithm MMSLMS is composed of two parts, working space and storage space. The working space is used to store the CMB and CMB-tables and the storage space is used to store the MPI-trees. The working space requirement is m + (m 1) + (m 2) + + 1 and the storage space requirement is also about m + (m 1) + (m 2) + + 1. Hence, the space requirement of MMSLMS for inserting a melody sequence with average size m into MMSLMS-summary is 2[m + (m 1) + (m 2) + + 1] = m2 + m. Hence, the space requirement of the stream generated so far in the worst case is N Æ (m2 + m). Note that in the analysis, we assume that the sminsup Æ N is just one and therefore every item of the incoming melody sequence is a frequent item, which is the worst case. However, we know that the value of N increases as time progresses. Hence, the pruning mechanism of MMSLMS-summary is deployed to limit the memory requirement not to exceed an upper bound. From the space complexity analysis, it is not surprising to find that the space complexity grows exponentially into the number of frequent items in the CMB, as all frequent item-sets are represented in the data structure. It is also the solution space of the problem. Time complexity analysis: From the construction process of MMSLMS-summary, we can see that exactly one scan of a landmark melody stream is required. The cost (denoted by Time-cost(S)) of inserting a melody sequence S into the MMSLMSsummary by sequence projection is jSj + (jSj 1) + + 1 = (jSj2 + jSj)/2; that is O(jfreq(S)j2), where freq(S) is the set of frequent items in the melody sequence S. Note that jfreq(S)j 6 jSj, where jSj denotes the size of the melody sequence S. Because the items within the CMB are frequent items, therefore, the cost of inserting a melody sequence S can be stated in terms of the size of CMB. Time-cost(S) = O(jS 0 j2), where jS 0 j is the number of chord-sets of melody sequence S within the CMB. In the worst case, if the melody sequence S contains all the frequent items within the CMB, Time-cost(S) = O(jCMBj2).
4. Experimental results In this section, we first describe the data and experiment set-up used to evaluate the performance of the proposed algorithm, and then report our experimental results. 4.1. Synthetic data and experiment set-up To evaluate the performance of MMSLMS algorithm, two experiments are performed. The experiments were carried out on the IBM synthetic market-basket test data generator proposed by Agrawal and Srikant (1994). Two data streams, denoted by S10.I5.D1000K and S30.I15.D1000K, of size 1 million melody sequences each are studied. The first one, S10.I5.D1000K with 1 K unique items, has an average melody sequence size of 10 with average maximal potentially frequent structure size of 5. The second one, S30.I15.D1000K with 10 K unique items, has an average melody sequence size of 30 with average maximal potentially frequent structure size of 15. In all experiments, the melody sequences of each datasets are looked up in sequence to simulate the environment of a landmark melody stream. All the experiments are performed on a 1066-MHz Pentium III processor with 128 megabytes main memory, running on Microsoft Windows XP. In addition, all the programs are written in Microsoft/Visual C++ 6.0. 4.2. Experimental results In the first experiment, two primary factors, memory and execution time, are examined in the online mining of a landmark melody stream, since both should be bounded online as time advances. In Fig. 12(a), the execution time grows smoothly as the dataset size increases. This is because the average execution time of dataset S10.I5 and S30.I15 are about 12 and 25 s per block respectively, where a block is composed of 50,000 melody sequences. In other words, the computation time of dataset S10.I5 by algorithm MMSLMS is 12 s every 50,000 melody sequences, and for dataset S30.I15 is 25 s every 50,000 melody sequences. Hence, it grows smoothly as the dataset size increases. The memory usage in Fig. 12(b) for both
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
1673
Fig. 12. Required resources for synthetic datasets: (a) execution time and (b) memory.
Fig. 13. (a) Linear scalability of the data stream size and (b) relative error of mining results.
synthetic datasets is stable as time progresses, indicating the feasibility of algorithm MMSLMS. Note that the synthetic landmark melody stream is partitioned into blocks with size 50 K. In the second experiment, we investigate the scalability and relative error of algorithm MMSLMS with respect to varying minimum supports. The relative error is defined as the difference between the measured support and the actual support estimation divided by the actual support. In Fig. 13(a), the execution time grows smoothly as the dataset increases (assume minsup = 0.01%) indicating linear scalability. Fig. 13(b) shows that the relative error decreases as minisup decreases, i.e., as the size of CMB decreases. Generally, the more frequent items are
maintained in the CMB, the more accurate the mining result is.
5. Conclusions In this paper, we proposed a single-pass algorithm, MMSLMS, to discover and maintain all maximal melody structures in a landmark model that contains all the melody sequences in a data stream. In the MMSLMS algorithm, an efficient in-memory summary data structure, MMSLMS-summary, is developed to record all maximal frequent structures in the current landmark model. In addition, MMSLMS uses a space-efficient scheme, the Chord-set Memory Border (CMB), to guarantee
1674
H.-F. Li et al. / Pattern Recognition Letters 26 (2005) 1658–1674
the upper-bound of space requirements of mining maximal melody sequences in a streaming environment. Theoretical analysis and experimental results with synthetic data show that MMSLMS algorithm can meet the performance requirements of data stream mining: one-scan, bounded-space and real time. Further work includes online mining maximal melody structures in count-based and timebased sliding window that contains the most recent melody sequences in a data stream. Acknowledgements The authors thank the reviewers precious comments for improving the quality of the paper. The research is supported by National Science Council of R.O.C. under grant no. NSC93-2213-E-009-043.
References Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases, pp. 487–499.
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002. Models and issues in data stream systems. In: Proceedings of 21th ACM Symposium on Principles of Database Systems, pp. 1–16. Bakhmutora, V., Gusev, V.U., Titkova, T.N., 1997. The search for adaptations in song melodies. Computer Music Journal 21 (1), 58–67. Fisher, M.J., Salzberg, S.L., 1982. Finding a majority among n votes: solution to problem 81-5. Journal of Algorithms 3 (4), 362–380. Hsu, J.L., Liu, C.C., Chen, A.L.P., 2001. Discovering nontrivial repeating patterns in music data. IEEE Transactions on Multimedia 3 (3), 311–325. Jones, G.T., 1974. Music Theory. Harper & Row, Publishers, New York. Karp, R.M., Papadimitrious, C.H., Shanker, S., 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems 28 (1), 51– 55. Shan, M.-K., Kuo, F.-F., 2003. Music style mining and classification by melody. IEICE Transactions on Information and Systems E86-D (4), 655–659. Yoshitaka, A., Ichikawa, T., 1999. A survey on content-based retrieval for multimedia databases. IEEE Transactions on Knowledge and Data Engineering 11 (1), 81–93. Zhu, Y., Kankanhalli, M.S., Xu, C., 2001. Pitch tracking and melody slope matching for song retrieval. In: Proceedings of the Second IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information, pp. 530–537.
Pattern Recognition Letters 26 (2005) 1675–1683 www.elsevier.com/locate/patrec
Correcting the Kullback–Leibler distance for feature selection Frans M. Coetzee
*
GenuOne, Inc., 2 Copley Square, Boston, MA 02216, USA Received 31 December 2003; received in revised form 23 October 2004 Available online 14 April 2005 Communicated by E. Backer
Abstract A frequent practice in feature selection is to maximize the Kullback–Leibler (K–L) distance between target classes. In this note we show that this common custom is frequently suboptimal, since it fails to take into account the fact that classification occurs using a finite number of samples. In classification, the variance and higher order moments of the likelihood function should be taken into account to select feature subsets, and the Kullback–Leibler distance only relates to the mean separation. We derive appropriate expressions and show that these can lead to major increases in performance. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Kullback–Leibler distance; Feature selection; Classification error; Receiver operating curve (ROC); Neyman–Pearson classification
1. Introduction A common approach to feature selection is to select between feature subsets such that the Kullback–Leibler distance between the resulting class conditional probability densities is maximized. In this paper we show that this approach does not
* Address: GenuOne, Inc., ADL, 33 McComb Road, Princeton, NJ 08540, USA. Tel.: +1 609 921 3681; fax: +1 609 258 2799. E-mail address:
[email protected] properly account for the finite sample sizes that appear when classifying finite sequences of mixed samples. We present corrected criteria for evaluating feature subsets, which can lead to major performance improvements. The following notation frames the problem. We assume that we are provided groups of N < 1 samples to be classified, all taken from one of two class-dependent distributions p(xjHi), i = 0, 1 over a finite alphabet X of size M defined by a specific subset. For clarity in notation, we may as appropriate henceforth also denote p(xjH0) by q, and p(xjH1) by p, respectively.
0167-8655/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.014
1676
F.M. Coetzee / Pattern Recognition Letters 26 (2005) 1675–1683
The Kullback–Leibler distance is a scalar summarizing the dissimilarity of two density functions. For densities p and q it is given by Dðp : qÞ ¼
M X
pðvi Þ lnðpðvi Þ=qðvi ÞÞ
ð1Þ
i¼1
with the standard assumption of non-zero probability of all alphabet elements for the two class distributions.1 Since the K–L divergence is a single deterministic scalar that summarizes the difference of arbitrarily high-dimensional functions, it is a useful construct in algorithms. It has a number of uses in coding theory and arises most naturally in the study of frequentist approaches to probability (e.g. the method of types—Cover and Thomas, 1993). Its use in feature selection appears straightforward: when selecting between two sets of features in building a classifier, select the set on which the two classes have the most widely separated densities, as measured by the K–L divergence. In this paper we investigate this claim more carefully. For optimal classification, a classifier compares the likelihood ratio computed on the data to a threshold. It turns out that the set of outputs of any optimal classifier evaluated on input data from a class, has a mean expressed by a K–L divergence of the classconditional pdfs. However, as a random variable, the likelihood ratio also has higher order moments, which impact its being compared with the optimal threshold for class separation. This paper shows that by accounting for these higher order moments, the performance in classifying samples drawn from two sources can be much improved. This argument is developed in the next few sections; we return to further discussion in Sections 5 and 6. To formally frame the problem, let us denote two possible feature subsets by a and b, and the class conditional densities on the symbol set defined by the feature product, by pa(x) and pb(x) under hypothesis H1, and qa(x) and qb(x) under 1
Problems with zero probability of a symbol can be reduced to this type of problem, with aberrant symbols processed independently.
hypothesis H0, respectively. The common procedure is: estimate either the one-sided distance D(p:q), or more generally the symmetric K–L distance (also known as J-distance) J(p:q) = D(p:q) + D(q:p) for different subsets of features, and select the feature subset that maximizes the chosen distance. In short, the rule is: If
J ðpa : qa Þ > J ðpb : qb Þ select a; else select b ð2Þ
This approach has long been used, either explicitly, or implicitly as a building block in other approaches (see for example, summaries in Kanal, 1974; Boekee and Van Der Lubbe, 1979). The most extensive analysis of this approach was performed in the classic paper by Novovicova et al. (1996), where the authors assumed a particular form of parametric density for the classes, and the J-divergence is optimized using an embedded EM algorithm. However, even here the use of the J-divergence as being optimal for finite sample classifications is explicitly assumed, rather than proved (Eq. (12) in the reference). Our approach further differs from earlier approaches in that we directly account for the likely desired operating point of the classification step; the optimal features in a classification problem is generally a function of the desired operating point. We will defer further discussion of references until our method has been explained.
2. Classification and the K–L distance In this section we consider the relationships between the statistics of the likelihood ratio test on a source classification problem, the K–L distribution, and the error that occurs when the optimal Bayes classifier is used for finite sample classification. The classical Bayesian classification solution minimizes the error of choosing the class from which the samples were obtained, based on the class-conditional densities and the priors of the classes. For linear costs on Type I and II errors, the well known theory (Van Trees, 1971) yields a likelihood based test of the form:
F.M. Coetzee / Pattern Recognition Letters 26 (2005) 1675–1683
If lðxÞ ¼ lnðpðx1 ; . . . ; xN jH 1 Þ=pðx1 ; . . . ; xN jH 0 ÞÞ > T select H 1 ; else select H 0 ð3Þ Here T is a scalar threshold determined by the cost of different errors and relative priors of the two distributions. Note that in the ideal Bayesian design case the discriminator system directly implements the above function. Since all samples in (3) are assumed to arise from one source, the likelihood ratio is a random variable that can be conditioned on the source classes. With the additional assumption of independence of samples, we find: lðxjH i Þ ¼ lnðpðx1 ; x2 ; . . . ; xN Þ=qðx1 ; x2 ; . . . ; xN ÞÞjH i N X ¼ lnðpðxj Þ=qðxj ÞÞjH i ð4Þ j¼1
The Kullback–Leibler distance arises naturally by converting the summation above to a frequency formulation (the continuous analogue being Lebesgue integration). Denoting the number of times that symbol vj occurs in the N samples by #(vj), rearranging the summation over the alphabet, and normalizing by the number of samples N, we find: " # M X lN ðxjH i Þ ¼ ð#ðvj Þ=N Þ lnðpðvj Þ=qðvj ÞÞjH i j¼1
ð5Þ Since #(vj) is a binomial variable, it follows directly that: #ðvj Þ=N jH 1 ! pðvj Þ
ð6Þ
#ðvj Þ=N jH 0 ! qðvj Þ
ð7Þ
as N ! 1 by almost any metric of convergence that is of interest (e.g. expectation, probability, mean square, and uniform all included). As a result: Dðq : pÞ; i ¼ 0 EflN ðxÞjH i g ¼ ð8Þ Dðp : qÞ; i¼1 where we ignore that, strictly speaking, equality holds only for those densities where the probability of a symbol is an integer multiple of 1/N. From this equation we see that the standard approach
1677
uses the separation of the means of the two class-conditional likelihood ratio distributions as a proxy for the degree of overlap of the same distributions. However, for finite N, the likelihood ratio lN(x) is itself a non-trivial random variable, and some metric that more accurately captures the gross outlines of its distribution should be used. For simplicity, later we will use a Gaussian approximation that yields metrics similar to the Mahalanobis distance, although other metrics could be explored. Using the constraint that the frequency counts have to lie on a simplex, the joint probability of the frequency counts is multinomial. The classconditional means are given by Eq. (8), while application of some algebra shows that the variance is given by var flN ðxÞjH 1 g ¼
1 ½Wðp : qÞ D2 ðp : qÞ N
ð9Þ
var flN ðxÞjH 0 g ¼
1 ½Wðq : pÞ D2 ðq : pÞ N
ð10Þ
where we define: Wðp : qÞ ¼
M X
pðvj Þ½lnðpðvj Þ=qðvj ÞÞ
2
ð11Þ
j¼1
We now consider the error in classification that can be expected when the likelihood ratio is used to implement the optimal Bayes classifier for finite N. While for general densities no closed form solution exists, we can make headway by assuming each class-conditioned likelihood ratio is approximately Gaussian. This assumption is usually well motivated since the linear combination of binomial distributions approaches a Gaussian distribution under rather general conditions for even moderate N (Johnson and Kotz, 1969; Feller, 1950). It is straightforward to show that given two one dimensional Gaussian distributions G(l0, r0) and G(l1, r1), where we assume means l1 > l0, variances r20 and r21 , and threshold T, the misclassification error has two major components: T l0 l1 T P ¼ pðH 0 ÞU þ pðH 1 ÞU r1 r0
ð12Þ
1678
F.M. Coetzee / Pattern Recognition Letters 26 (2005) 1675–1683
pffiffiffi where U{Æ} = 1/2 erfcð= 2Þ and erfc is the standard complementary error function. Further, the ROC curve defined by false alarm rate PF{T} and detection rate PD{T} is given by T l0 P F fT g ¼ P flN ðxjH 0 Þ > T g ¼ U ð13Þ r0 T l1 P D fT g ¼ P flN ðxjH 1 Þ > T g ¼ U ð14Þ r1
Returning to the feature subset selection problem, we see that at a fixed operating point, detection will be improved by optimizing the following criterion: Xðp : q; jÞ ¼
Dðp : qÞ þ Dðq : pÞ 1=2
½Wðp : qÞ D2 ðp : qÞ 1=2 j Wðq : pÞ D2 ðq : pÞ pffiffiffiffi N Wðp : qÞ D2 ðp : qÞ
ð22Þ
We now consider two specific cases of interest. 2.1. No class preference
3. Application to feature selection
The class-conditional errors on each of the two classes are equal when the threshold: r0 r1 l0 l1 T ¼ þ ð15Þ r0 þ r1 r0 r1
From the previous section, we can now produce two modified rules for feature selection that account for the finite value of N. When Type I and II errors are to be minimized simultaneously:
whence the error is given by l l0 P ¼ U 1 ð16Þ r1 þ r0 Returning to the original problem, it follows that the limiting classification error when sample classification is performed, is given by pffiffiffiffi P ’ Uf N Cðp : qÞg ð17Þ where Cðp : qÞ ¼
Dðp : qÞ þ Dðq : pÞ ½Wðp : qÞ D ðp : qÞ1=2 þ ½Wðq : pÞ D2 ðq : pÞ1=2 2
ð18Þ
Similarly, when a fixed false alarm rate is to be achieved: If Xðpa :qa ; jÞ > Xðpb :qb ; jÞ select a; else select b ð24Þ The careful reader will require that one issue still be addressed: namely that these rules are not coincidental with Eq. (2). In short, we have to address whether the numerator and denominator in Eq. (18) are subject to an implicit dependency such that if: ð25Þ
then it follows that:
In this approach the false alarm rate PF{T} is maintained at constant level. Define j = U1{PF{Æ}}. It follows that: T ¼ l0 þ r0 j
ð19Þ
pffiffiffiffi P D fP F fgg ’ Uf N Xðp : q; jÞg
ð20Þ
where ðl1 l0 Þ r0 j r1
ð23Þ
J ðpa : qa Þ > J ðpb : qb Þ
2.2. Class preference: Neyman–Pearson classification
Xðp : q; jÞ ¼
If Cðpa : qa Þ > Cðpb : qb Þ select a; else select b
ð21Þ
Cðpa : qa Þ > Cðpb : qb Þ
ð26Þ
and similarly, for Eq. (24). We basically have to prove the rules are not simply the same in disguise. We do not expect them to always disagree, but there should be cases where they disagree with each other. This independence is easily proved by example; such an example is discussed in the next section. We note this independence is true in general: the constraints implied by Eqs. (25) and (26) are regular and impose 2 constraints on a
F.M. Coetzee / Pattern Recognition Letters 26 (2005) 1675–1683
set of dimension 2(M 1) spanned by the simplexes from which the class densities are drawn. Hence neither density pairs for which the conditions hold, nor those for which they do not hold, can be considered exceptional. Combined with the fact that there is always at least one multidimensional density that exists for a set of marginal distributions, we can expect the disagreement of the selection rules for a large set of distributions.
4. Examples In this section we present results on a reference data set that show the effect of using the appropriate selection rule for classification problems. The agaricus-lepiota or ‘‘mushroom’’ dataset from the UCI Machine Learning repository (Blake and Merz, 1998) is widely used for evaluating discrete feature selection approaches. The dataset has a large number of data points (8124), and 22 features (numbered 1 through 22), all of which are discrete, with each feature having a distinct vocabulary of up to 12 symbols. These features are shown in Table 1. The rules generated on this Table 1 Features of mushroom data set, numbered 1 through 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Cap-shape Cap-surface Cap-color Bruises? Odor Gill-attachment Gill-spacing Gill-size Gill-color Stalk-shape Stalk-root Stalk-surface-above-ring Stalk-surface-below-ring Stalk-color-above-ring Stalk-color-below-ring Veil-type Veil-color Ring-number Ring-type Spore-print-color Population Habitat
1679
database are widely used in practice to distinguish edible and poisonous mushrooms—if you buy wild mushrooms, your continued health may in fact depend on the accuracy of this database. The large amount of data relative to feature space allows for accurate evaluations of performance, and bounds on performance for different feature sets and rules are known (Duch et al., 1997). The data was used to construct histograms for subsets of groups of three features, for which the histograms are highly accurate. An extensive search over all pairs of groups of three features requires significant computation but is possible. We studied randomly selected triplets of features, where the new criteria disagreed with the standard J-divergence when selecting the optimal feature set, until we had a 1000 cases. In each case we calculated the full ROC curve on each feature subset and evaluated the optimal classifiers in full. For clarity and due to the limited space here, we only consider feature subsets where both of the new criteria differed from the J-divergence (i.e. the new criteria indicate that both at the Bayesian operating point, and the Pf = 0.15 point, the newly selected feature subset would outperform the commonly selected subset). This rather stringent joint condition occurred for roughly 10% of the feature subset pairings. Table 2 contains 10 examples of feature triplets, the values of the different selection criteria calculated for the class-conditional distributions on the subset, and the Bayesian error rate P* (equal class errors) as well as the Neyman–Pearson detection error rate P 0:15 at a false alarm rate of d Pf = 0.15 (the full ROC curve was calculated and interpolated to find these error rates). Table 3 contains 10 examples of feature triplets where the new criteria yielded a different subset selection from the common procedure, but the performance was in fact worse. We ascribe this error to the fact that the new criteria models the likelihood statistics only up to second order; while usually this yields improvement over just using the mean separation (J-divergence), it is not always adequate. We note that such inferior performance is not the norm: Table 4 shows that in most cases (80%+) the new criteria result in an improvement, and on average the improvement in
1680
F.M. Coetzee / Pattern Recognition Letters 26 (2005) 1675–1683
Table 2 Selected triplets of features and performance, where selection criteria differ, and new criteria yield an improvement Set a
Set b
Ja
Jb
Ca
Cb
Xa
Xb
P a
P b
P 0:15 a
P b0:15
(2, 15, 16) (11, 13, 16) (6, 11, 19) (7, 13, 18) (10, 12, 21) (4, 6, 15) (2, 4, 12) (7, 13, 22) (13, 17, 18) (13, 16, 19)
(4, 6, 8) (8, 12, 17) (4, 11, 18) (11, 16, 22) (8, 10, 21) (7, 16, 20) (16, 18, 20) (16, 18, 20) (10, 12, 13) (4, 17, 21)
3.18 5.01 6.61 5.63 6.83 5.52 5.61 6.44 5.11 5.89
2.29 4.85 6.11 4.98 5.84 4.06 5.43 5.43 4.09 5.45
0.70 0.88 1.15 0.95 1.12 0.89 0.99 1.18 0.96 1.07
0.86 1.21 1.41 1.00 1.33 1.38 1.55 1.55 0.99 1.25
1.32 1.53 2.21 1.37 2.05 1.59 1.88 2.05 1.22 1.62
1.63 1.90 3.76 1.76 3.90 2.35 3.00 3.00 1.41 3.54
73.6 83 87.5 74.2 80.9 77.2 84.2 86.6 71.7 83.4
78.1 90.8 92.4 83.4 89.6 88.7 89.6 89.6 78.2 87.3
59.4 79.9 88.1 67.9 76.7 69.7 84.2 87.6 64.5 82.9
67.5 92.8 93.6 82.6 93.3 89.3 90.3 90.3 74.6 89
Table 3 Selected triplets of features and performance, where selection criteria differ, and new criteria cause under-performance (fewer cases overall) Set a
Set b
Ja
Jb
Ca
Cb
Xa
Xb
P a
P b
P 0:15 a
P b0:15
(10, 18, 22) (1, 14, 16) (10, 12, 17) (8, 10, 11) (9, 10, 16) (4, 10, 11) (3, 4, 16) (6, 8, 22) (6, 7, 9) (10, 18, 19)
(7, 10, 21) (11, 16, 18) (6, 13, 17) (4, 18, 20) (3, 11, 16) (4, 8, 11) (2, 8, 21) (12, 13, 16) (3, 11, 16) (6, 11, 18)
4.13 2.83 2.83 8.01 6.43 8.38 4.95 3.15 5.80 3.74
4.10 2.42 2.03 5.48 5.04 7.28 3.67 2.64 5.04 3.45
0.91 0.65 0.72 1.45 1.06 1.63 0.86 0.78 1.12 0.79
0.96 0.68 0.73 1.59 1.17 1.68 0.87 0.86 1.17 0.80
1.42 1.20 0.97 2.62 1.82 2.89 1.52 1.25 1.63 1.10
3.96 2.04 1.16 2.86 3.75 3.97 2.03 1.39 3.75 1.78
81.5 71.7 70 92 85.7 92.5 79 77.5 85.2 79.4
76.4 70.3 69.4 90.2 83.6 91.6 75.7 75.3 83.6 72.9
80.5 59 62.7 93.3 85.7 93.9 68.7 74.2 85.2 70.1
57.8 56 61.7 91.4 80.7 93.2 68.5 71.7 80.7 60.6
Table 4 Mean improvement in error on sets where new criteria are used and performance is increased, and percentage where criteria differ, and new criteria cause under-performance Case
% of sets
DP (%)
DP 0:15 (%)
Superior Inferior Overall
81.9 18.1 100
6.34 2.24 4.79
10.2 5.23 7.37
performance at both operating points is higher than when the new criteria result in inferior performance. Fig. 1 shows the receiver operating curves for two representative feature subsets a = (2, 15, 16) and b = (4, 6, 8). This case is the first in Table 2. 5. Feature space search Up to now the paper discussed only the problem of choosing between two subsets of features
for later use in a classification scheme. The general feature selection problem involves absolute, or more generally a co-final directed set ranking (Munkres, 1975) of all 2N subsets of a set of N features. Such ranking of the subsets requires, first, a method for traversing the feature subset space, and second, a local method for comparing selected pairs of subsets. Our approach addresses the second part of the problem—how to better rank any two subsets; the global search is still responsible for correctly interpreting the pairwise preference, and being robust against errors. For any reasonable sizes of the alphabet, the selection problem is computationally insoluble; in addition to the sheer computational complexity, the ranking is further dependent on the desired operating point—a point frequently unacknowledged. Even in our selected examples in Table 2, different operating points favor different features. True feature selection requires the ordering of
F.M. Coetzee / Pattern Recognition Letters 26 (2005) 1675–1683
Fig. 1. Receiver operating curves (ROC), ROC1 for features a = (2, 15, 16) and ROC2 for features b = (4, 6, 8) on the mushroom data set. The new criteria yields improved performance. The operating points for an optimal Bayesian classifier minimizing Type I and II error simultaneously (straight lines), and for a Neyman–Pearson classifier with Pf = 0.15 (dashed lines) are shown.
ROC curves, not points. The brute force solution requires calculating the ROC curves for all feature subsets, and sorting based on performance at the desired operating point. And if that is not enough of a challenge, in wrapper approaches the additional load of a restricted classifier architecture is also imposed. The plethora of feature selection procedures all try to reduce the complexities involved (see Blum and Langley, 1997 for a review). Some taxonomies for grouping methods exist; the major separation is that of approaches that explicitly perform selection before classification from those where the feature selection is implicit. Explicit approaches include methods that compare performance of a seed set and the set formed by adding features to (forward approach) or trimming features from (backward approach) the seed set, often with some probabilistic or backtracking annealing components (Koller and Sahami, 1996; Pudil et al., 1994a). Our new criteria are obvious candidates for replacing the J-divergence or KL-divergence
1681
in these approaches. In other applications the architecture, estimation and evaluation are intermingled—for example in Attribute Value Taxonomy generation and tree building, split variable selection corresponds to maximizing class divergence (Baker and McCallum, 1998; Kang et al., 2004). Properly accounting for the separation of finite runlength samples is relevant and our new criteria should be considered. A further issue is performing feature selection with limited data. Little effort has been expended in this area. We note that the author previously proposed using the directed-set structure that exist on the ROC curves to bound the feature subset search to only a statistically valid region of the feature space supported by the finite data (Coetzee et al., 2001). This latter approach also sorts ROC curves, and selects families of subsets based on operating point. The approach was required in our applications where the operating point varies in a wide range. The approach used in the current paper is extremely helpful in pre-sorting large data sets for such computationally complex searches, since we can select for particular operating points, yet both criteria (23) and (24) use scalar statistics that retain the benefit of having a single number that captures performance on the dataset.
6. Conclusion It is helpful here to discuss why the effects described in this communication are not widely recognized, especially when the standard J-divergence based approaches are so widespread. It appears that a remarkable degree of confusion exists as to the number N in the likelihood ratio. In introductions of the K–L distance such as in Cover (Cover and Thomas, 1993), the problem that is analyzed (correctly) is how many samples are required to be drawn consecutively from a single distribution to distinguish between two possible source distributions. As N becomes large, equi-partition theorems hold, and the K–L distribution appears naturally in various error exponents. However, in a practical classification problem, samples are obtained from two distributions and
1682
F.M. Coetzee / Pattern Recognition Letters 26 (2005) 1675–1683
the problem is more accurately represented by a mixture distribution. Practitioners frequently confuse the two problems. Even if a large number N 0 of samples are obtained in the latter problem, what is of importance for the classification error bound is the run-length N of batches of samples guaranteed to be pulled sequentially from one of the classes—which is typically small or one in most applications. The large-number asymptotics captured by the K–L distributions do not apply simply because you have a large number of combined samples from the two distributions. Further, this effect of finite run-length is not the same as that of estimating the K–L divergence from a finite set of labeled data. The latter is also a difficult problem, but is a secondary effect relative to the focus of this paper. The other frequent source of confusion results from using K–L divergence implicitly as part of a larger feature selection approach that aims to retain information. In these approaches features are eliminated such that the per-class densities are disturbed minimally by the features being eliminated (Pudil et al., 1994b; Koller and Sahami, 1996). While intuitively these approaches should yield reasonable results, there is no reason to believe that minimizing distortion of per-class distributions as measured by K–L divergence generally minimizes subsequent classification error between the resulting class-conditional distributions. Similarly, ranking bounds on classification performance on feature subsets does not necessarily translate into selecting the optimal feature subset, especially since bounds such as Bhattacharya bounds pay a heavy penalty in discrimination and continuity to achieve their general applicability. We note that if the features of the samples are not independent, our approach still functions perfectly. The important independence requirement relates to the drawing of data samples. If the samples are drawn dependently, converting the likelihood ratio to a frequentist formulation (the Lebesgue integration in (5)), is not always possible, and the K–L distribution does not naturally arise. Little work exists on analyzing the dependent case, beyond some work on higher order extensions of the method of ‘‘types’’—notably by Csiszar (see Csiszar, 1998 for a review). The independence con-
dition is however not an issue in most problems typically considered as classification problems; the dependence problem usually crops up in areas of stochastic game theory and time series prediction. We close with one final interesting point: the selection of subsets does not change with N when neither class is favored (Eq. (18)). In contrast, when the false alarm rate has to be minimized, the run-length may influence the choice of feature subsets (Eq. (22)). In both cases, though, the overall rate of error decreases with run-length (Eqs. (17) and (20)), as one would expect.
Acknowledgment The author thanks the editor and two anonymous reviewers for suggestions that markedly clarified and improved the paper.
References Baker, L.D., McCallum, A.K., 1998. Distributional clustering of words for text classification. In: Proceedings of SIGIR98, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 96–103. Blake, C., Merz, C., 1998. UCI repository of machine learning databases. URL http://www.ics.uci.edu/ ~mlearn/MLRepository.html. Blum, A., Langley, P., 1997. Selection of relevant features and examples in machine learning. Artif. Intell. 97 (1–2), 245– 271. Boekee, D.E., Van Der Lubbe, J.C.A., 1979. Some aspects of error bounds in feature selection. Pattern Recognit. 11, 353– 360. Coetzee, F., Glover, E., Lawrence, S., Giles, C.L., 2001. Feature selection in web applications using roc inflections and power set pruning. In: Symposium on Applications and the Internet-SAINT 2001, San Diego, CA. Cover, Thomas, 1993. Principles of Information Theory. Wiley and Sons. Csiszar, I., 1998. The method of types. IEEE Trans. Inf. Theory 44 (6), 2502–2523. Duch, W., Adamczak, R., Graßbczewski, K., Ishikawa, M., Ueda, H., 1997. Extraction of crisp logical rules using constrained backpropagation networks—comparison of two new approaches. In: Proceedings of the European Symposium on Artificial Neural Networks (ESANN97), pp. 109–114.
F.M. Coetzee / Pattern Recognition Letters 26 (2005) 1675–1683 Feller, W., 1950. An Introduction to Probability Theory and Its Applications, third ed.Probability and mathematical statistics, vol. 1 John Wiley and Sons, ISBN 0-471-257087. Johnson, N.L., Kotz, S., 1969. Discrete Distributions. Probability and Mathematical Statistics. John Wiley and Sons, ISBN 0-471-44360-3. Kanal, L.N., 1974. Patterns in pattern recognition. IEEE Trans. Inf. Theory 20, 697–722. Kang, D., Silvescu, A., Zhang, J., Honavar, V., 2004. Generation of attribute value taxonomies from data for datadriven construction of accurate and compact classifiers. In: Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM 04), Brighton, UK, in print. Koller, D., Sahami, M., 1996. Toward optimal feature selection. In: Proceedings of the 13th International Con-
1683
ference on Machine Learning, Morgan Kaufmann, pp. 284–292. Munkres, J.R., 1975. Topology: a First Course. Prentice-Hall, ISBN 0-13-925495-1. Novovicova, J., Pudil, P., Kittler, J., 1996. Divergence based feature selection for multimodal class densities. IEEE Trans. Pattern Anal. Mach. Intell. 18 (2), 218–223. Pudil, P., Novovicova, J., Kittler, J., 1994a. Floating search methods in feature-selection. Pattern Recognit. Lett. 15 (11), 1119–1125. Pudil, P., Novovicova, J., Kittler, J., 1994b. Simultaneous learning of decision rules and important attributes for classification problems in image analysis. Image Vis. Comput. 12 (3), 193–198. Van Trees, H.L., 1971. Detection Estimation and Modulation Theory, vols. 1–3. Wiley and Sons.
Pattern Recognition Letters 26 (2005) 1684–1690 www.elsevier.com/locate/patrec
Recursive computation method for fast encoding of vector quantization based on 2-pixel-merging sum pyramid data structure Zhibin Pan b
a,*
, Koji Kotani b, Tadahiro Ohmi
a
a New Industry Creation Hatchery Center, Tohoku University, Aza-aoba 10, Aramaki, Aoba-ku, Sendai 980-8579, Japan Department of Electronic Engineering, Graduate School of Engineering, Tohoku University, Aza-aoba 10, Aramaki, Aoba-ku, Sendai 980-8579, Japan
Received 16 February 2004; received in revised form 21 October 2004 Available online 7 April 2005 Communicated by E. Backer
Abstract Vector quantization (VQ) is a popular signal compression method. In the framework of VQ, fast search method is one of the key issues because it is the time bottleneck for VQ applications. In order to speed up VQ encoding process, how to construct some lower dimensional feature vectors for a k-dimensional original vector so as to measure the distortion between any vectors lightly becomes important. To reduce the dimension for approximately representing a kdimensional vector, the multi-resolution concept is a natural consideration. By introducing a pyramid data structure, the multi-resolution concept used in fast VQ encoding includes two aspects, which are (1) a multi-resolution distortion check method and (2) a multi-resolution distortion computation method. Some fast search methods that are based on a 4-pixel-merging (4-PM) mean pyramid data structure [Lin, S.J., Chung, K.L., Chang, L.C. 2001. An improved search algorithm for vector quantization using mean pyramid structure. Pattern Recognition Lett. 22 (3/4) 373] and a 2-pixel-merging (2-PM) sum pyramid data structure [Pan, Z., Kotani, K., Ohmi, T. 2004. An improved fast encoding algorithm for vector quantization using 2-pixel-merging sum pyramid data structure. Pattern Recognition Lett. 25 (3) 459] have already been proposed. Both of them realized the multi-resolution concept by using a multi-resolution distortion check method. However, both of them ignored the multi-resolution distortion computation method, which can also be guaranteed by the multi-resolution concept if a recursive computation way is introduced. In principle, a multi-resolution distortion computation method can completely reuse the obtained computation result that is already executed at a lower resolution level so that no waste to it will occur at all. This paper aims at
*
Corresponding author. Tel.: +81 22 217 3981; fax: +81 22 217 3986. E-mail addresses: pzb@fff.niche.tohoku.ac.jp,
[email protected] (Z. Pan).
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.011
Z. Pan et al. / Pattern Recognition Letters 26 (2005) 1684–1690
1685
improving the search efficiency of the previous work [Pan, Z., Kotani, K., Ohmi, T. 2004. An improved fast encoding algorithm for vector quantization using 2-pixel-merging sum pyramid data structure. Pattern Recognition Lett. 25 (3) 459] further by introducing a multi-resolution distortion computation method into multi-resolution distortion check method so that about half of its computational cost can be reduced mathematically. Experimental results confirmed the proposed method outperforms the previous works obviously. 2005 Elsevier B.V. All rights reserved. Keywords: Recursive computation; Fast search; Vector quantization; 2-pixel-merging; Sum pyramid
1. Introduction Vector quantization (VQ) is a widely used asymmetric signal compression method (Nasarabadi and King, 1988). The encoding process of VQ is computationally very heavy. In a conventional VQ method, an N · N input image is firstly divided into a series of non-overlapping smaller n · n image blocks. Then VQ encoding is implemented block by block sequentially. The distortion between an input image block and a codeword can be measured by squared Euclidean distance for simplicity as d 2 ðI; C i Þ ¼
k X 2 ðI j C i;j Þ
i ¼ 1; 2; . . . ; N c
ð1Þ
j¼1
where k (=n · n) is the vector dimension, I = [I1, I2, . . . , Ik] is the current image block, Ci = [Ci,1,Ci,2, . . . ,Ci,k] is the ith codeword in the codebook C = {Ciji = 1,2, . . . ,Nc}, j represents the jth element of a vector, and Nc is the codebook size. Then a best-matched codeword with minimum distortion, which is called winner afterwards, can be determined straightforwardly by d 2 ðI; C w Þ ¼ minbd 2 ðI; C i Þc C i 2C
i ¼ 1; 2; . . . ; N c
ð2Þ
where Cw means the winner and the subscript ‘‘w’’ is the winner index. This process for finding the winner is called full search (FS) because the matching is executed over the whole codebook. FS is very heavy computationally due to it performing Nc times k-dimensional Euclidean distance computation. Once ‘‘w’’ has been found, which conventionally uses much less bits than Cw, VQ only transmits this index ‘‘w’’ instead of the winner Cw itself to reduce the amount of image data in
order to realize image compression. Because the same codebook has also been stored at the receiver in VQ system, by using the received winner index ‘‘w’’ to retrieve the corresponding Cw at the receiver for an input block, it is very easy to reconstruct an image by pasting the codeword Cw one by one. Because a high dimension k of vector is the main problem for computing Euclidean distance, it is very important to use lower dimensional feature vectors to approximately express a k-dimensional original vector to roughly measure the distortion between two vectors of I and Ci so as to reject obviously unlikely codewords. There exist many fast search methods for VQ. A class of them is based on the multi-resolution concept by constructing an appropriate pyramid data structure (Lee and Chen, 1995; Lin et al., 2001; Song and Ra, 2002; Pan et al., 2004) to roughly describe the original k-dimensional vector with a series of lower dimensional feature vectors or levels in a pyramid. All of these previous works adopted a multi-resolution distortion check method, which refers to that a series of rejection checks are conducted from top level towards bottom level one by one until a rejection decision is made or the bottom level is achieved. They are very searchefficient. However, all of these previous works ignored a multi-resolution computation method, which implies that the distortion at a higher resolution level can be computed by completely reusing the obtained distortion at a previous lower resolution level and then just adding some improvements to enhance the resolution instead of computing it from the very beginning once again. As a result, there is no waste to the distortion obtained at a lower resolution level occurs. This paper aims at improving the method proposed in (Pan et al.,
1686
Z. Pan et al. / Pattern Recognition Letters 26 (2005) 1684–1690
2004) by introducing a recursive computation method so as to reduce about half of its computational cost further. 2. Previous work A 4-pixel-merging (4-PM) mean pyramid data structure for fast VQ encoding as shown in Fig. 1(a) is proposed in (Burt and Adelson, 1983; Lee and Chen, 1995; Lin et al., 2001) to realize a ðu1 þ 1Þ ¼ log4 (n · n) + 1 resolution description for a k-dimensional vector, where the ‘‘Averaging’’ is implemented by a simple arithmetic average operation. Instead of using this 4-PM mean pyramid, a 2-pixel-merging (2-PM) sum pyramid as shown in Fig. 1(b) is proposed in (Pan et al., 2004), where the ‘‘Summing’’ is a summation operation. A 2-PM sum pyramid can double the resolutions of a 4-PM mean pyramid as (u2 + 1) = log2(n · n) + 1 = 2 · u1 + 1. It has already been experimentally confirmed in (Pan et al., 2004) that it is more search efficient to use a 2-PM sum pyramid than to use a 4-PM mean pyramid for fast VQ encoding in the case of not using a recursive computation method. Then, a new hierarchical rejection rule (Pan et al., 2004) based on a multi-resolution distortion check method can be obtained as
L0
d 2s;u2 ðI; C i Þ P P 2ðu2vÞ d 2s;v ðI; C i Þ P P 2ðu21Þ d 2s;1 ðI; C i Þ P 2u2 d 2s;0 ðI; C i Þ
ð3Þ
The distortion at the vthPlevel for v 2 [0 u2] 2v 2 is defined as d 2s;v ðI; C i Þ ¼ m¼1 ðSIv;m SCi;v;m Þ , where SIv,m is the mth pixel (sum) at the vth level in a 2-PM sum pyramid for I and SCi,v,m implies the same thing to Ci for m 2 [1 2v]. The real Euclidean distance d 2 ðI; C i Þ ¼ d 2s;u2 (I, Ci) holds. For a 4 · 4 block, u2 = 4. Then, at any vth level for v 2 [0 u2], if 2ðu2vÞ d 2s;v ðI; C i Þ > d 2min holds, the current real Euclidean distance d 2s;u2 ðI; C i Þ will be definitely larger than d 2min . As a result, the search can be terminated at this vth level and Ci can be rejected safely to save computational cost. Therefore, how to efficiently compute the value of distortion d 2s;v ðI; C i Þ becomes an important issue. Obviously, the search process based on Eq. (3) only used the multi-resolution distortion check method but completely ignored the multi-resolution distortion computation method. It is a waste to computation. This paper aims at introducing the multi-resolution distortion computation method into Eq. (3) in order to fully exploit the potential power of the two aspects included in the multi-resolution concept. How to reuse the
L0 (top level)
Distortion check order
(top level)
Distortion check order
L1
L1
L2
Averaging
L3 L2
n×n (a)
Summing
L4
Lu1 (bottom level)
n×n
Lu2 (bottom level)
(b)
Fig. 1. For an n · n (k = n · n) block: (a) a 4-PM mean pyramid with bottom level u1 = log4(n · n) and (b) a 2-PM sum pyramid with bottom level u2 = log2(n · n) = 2 · u1. One new level is sandwiched in-between every two levels of a conventional 4-PM pyramid (shadowed).
Z. Pan et al. / Pattern Recognition Letters 26 (2005) 1684–1690
obtained d 2s;v ðI; C i Þ to compute d 2s;vþ1 ðI; C i Þ for next distortion check recursively is the key of this paper. 3. Proposed method During a winner search process, in order to avoid discarding the obtained d 2s;v ðI; C i Þ value, a recursive computation method for next d 2s;vþ1 ðI; C i Þ can be realized based on the multiresolution concept, which implies that the distortion at a higher resolution level can be computed by reusing the distortion at a lower resolution level. Based on the 2-PM sum pyramid data structure shown in Fig. 1(b), it is clear SIv,m = SIv+1,2m1 + SIv+1,2m and SCi,v,m = SCi,v+1,2m1 + SCi,v+1,2m are true for m 2 [1 2v]. Therefore, a recursive relation can be derived by rewriting d 2s;vþ1 ðI; C i Þ in the odd terms, even terms first and then using the identical equality (a2 + b2) = (a + b)2 2ab as Def d 2s;vþ1 ðI; C i Þ ¼
¼
2vþ1 X
2
ðSIvþ1;m SCi;vþ1;m Þ
m¼1 2v X
½ðSIvþ1;2m1 SCi;vþ1;2m1 Þ
2
m¼1 2
þ ðSIvþ1;2m SCi;vþ1;2m Þ 2v X ½ðSIvþ1;2m1 SCi;vþ1;2m1 Þ ¼ m¼1
þ ðSIvþ1;2m SCi;vþ1;2m Þ2 2v X 2 ½ðSIvþ1;2m1 SCi;vþ1;2m1 Þ m¼1
ðSIvþ1;2m SCi;vþ1;2m Þ 2v X 2 ½ðSIv;m Þ ðSCi;v;m Þ ¼ m¼1
2
2v X ½ðSIvþ1;2m1 SCi;vþ1;2m1 Þ m¼1
ðSIvþ1;2m SCi;vþ1;2m Þ 2v X 2 ½ð2 SIvþ1;2m1 ¼ d s;v ðI; C i Þ
1687
Eq. (4) is the core of this paper. From Eq. (4), it is obvious that the obtained d 2s;v ðI; C i Þ is completely reused again for computing d 2s;vþ1 ðI; C i Þ in a recursive way. Eq. (4) is actually a realization of the multi-resolution distortion computation method. An analysis on the computational complexity of Eq. (4) is given as follows. When d 2s;vþ1 ðI; C i ) is computed directly, it needs 2v+1 + (2v+1 1) = 2 · 2v+1 1 addition (±) and 2v+1 multiplication (·) operations. However, by storing the doubled odd terms at each level (i.e. to store 2 · SIv+1,2m1 and 2 · SCi,v+1,2m1 as a whole instead of the raw SIv+1,2m1 and SCi,v+1,2m1 for v 2 [1 u2]), Eq. (4) only needs [1 + 2 · 2v + (2v 1)] = (3/2) · 2v+1 addition (±) and 2v = (1/2) · 2v+1 multiplication (·) operations. Mathematically, addition (±) operations can be reduced to [(3/2) · 2v+1]/[2 · 2v+1 1] 3/4 and multiplication (·) operations can be reduced to [(1/2) · 2v+1]/[2v+1] = 1/2 by introducing Eq. (4) for each distortion computation at the (v + 1)th level for v 2 [0 u2 1]. Because a multiplication operation (·) is much heavier than an addition (±) operation, it is clear that Eq. (4) can save about half of the computational burden compared to a direct computation way of d 2s;vþ1 ðI; C i Þ. In principle, 2-PM sum pyramid is an extension of 4-PM mean pyramid. However, in order to realize a multi-resolution distortion computation method recursively to reduce more computational cost, only 2-PM sum pyramid can be used but 4-PM mean pyramid cannot. The reasons is that by using the formula a2 + b2 = (a + b)2 2ab, one multiplication (·) can be saved when the value of (a + b)2 is known in a 2-pixel-merging pyramid structure. In contrast, the formula a2 + b2 + c 2 + d 2 = (a + b + c + d) 2 2ab 2ac 2ad 2bc 2bd 2cd cannot save any computational cost even though the value of (a + b + c + d)2 is known in a 4-pixel-merging pyramid structure. Therefore, 2-PM sum pyramid is a more promising data structure than a 4-PM mean pyramid for fast VQ encoding. Based on the discussions above, a search flow can be summarized as follows:
m¼1
2SCi;vþ1;2m1 ÞðSIvþ1;2m SCi;vþ1;2m Þ ð4Þ
(1) Construct an accompanying 2-PM sum pyramid for each Ci based on Fig. 1(b) off-line.
1688
(2)
(3)
(4)
(5)
(6)
(7)
Z. Pan et al. / Pattern Recognition Letters 26 (2005) 1684–1690
Store the doubled values of odd terms for v 2 [1 u2] levels instead of their raw values so as to take Eq. (4) into account. Sort all codewords by the real sum at L0 level in ascending order and then rearrange them along this order off-line. For an input I, construct its accompanying 2-PM sum pyramid based on Fig. 1(b) on-line. Similarly, store the doubled values of odd terms for v 2 [1 u2] levels. Find an initial nearest neighbor (NN) codeword CN among the sorted codewords by a binary search, which is the closest codeword in terms of real sum difference ds,0(I, CN) = jSI0,1 SCN,0,1j being minimum. It needs log2(Nc) times comparison (cmp) operations. Then compute and store ‘‘so far’’ d 2min ¼ d 2 ðI; C N Þ, and all d 2v;min ¼ 2u2v d 2min in order to simplify a future distortion check at the vth level for v 2 [0 u2 1]. This step needs (2 · k 1) additions (±) and ½k þ log2 ðn nÞ multiplications (·). Continue the winner search up and down around CN one by one. Once ðSI0;1 2 SCi;0;1 Þ P d 20;min holds, terminate search for the upper part of sorted codebook when i < N or the lower part when i > N; If winner search in both upper and lower directions has been terminated, search is complete. Clearly, the current ‘‘so far’’ bestmatched codeword must be the winner. Output the winner index. Then search flow returns to Step 3 for encoding another new input. Otherwise, check whether d 2s;vþ1 ðI; C i Þ P d 2vþ1;min is true or not for v 2 [0 u2 1]. If any distortion check is true, reject Ci safely (Note: Eq. (4) must be introduced here for computing d 2s;vþ1 ðI; C i Þ in a recursive computation way). If all distortion checks fail for a rejection, it implies that current Ci is a better-matched codeword, then update d 2min by d 2s;u2 ðI; C i Þ and all d 2v;min for v 2 [0 u2 1]. Meanwhile, update the winner index found ‘‘so far’’. Then, return to Step 5 to check next codeword.
4. Experimental results Simulation experiments with MATLAB are conducted. Codebooks are generated by using the 512 · 512 8-bit Lena image as a training set. Since it is mathematically very clear how much computational cost can be reduced via introducing a multiresolution distortion computation method Eq. (4) into Eq. (3), only a practical 4 · 4 block size is selected in the experiment of this paper. The reason for this selection is that a 2 · 2 block size results in a very low compression ratio and a 8 · 8 block size results in a rather low PSNR so that they are seldom to be adopted in practical VQ encoding. Because only 2-PM sum pyramid data structure but 4-PM mean pyramid data structure can use the multi-resolution distortion computation method, a performance comparison between the fast encoding method with or without using Eq. (4) is made. The search efficiency is evaluated by the total computational cost in the number of addition (±), multiplication (·) and comparison (cmp) operations per input vector, which consists of (1) finding the initial best-matched codeword CN and computing the initial d 2min ; d 2v;min for v 2 [0 u2 1]; (2) computing the distortion at each level in pyramids for a possible rejection based on Eq. (3) (non-recursive way) or Eq. (3) combined with Eq. (4) (recursive way); (3) updating the ‘‘so far’’ d 2min ; d 2v;min for v 2 [0 u2 1] if all rejection checks fail ultimately. The results are shown in Table 1. In Table 1, ‘‘Add’’ means the addition operation, ‘‘Mul’’ means the multiplication operation and ‘‘Cmp’’ means the comparison operation. From Table 1, it is also obvious that the total computational cost by introducing Eq. (4) into Eq. (3) is much less compared to using Eq. (3) only. Especially the number of multiplication operations can be reduced significantly, which benefits from computing d 2s;vþ1 ðI; C i Þ by using Eq. (4) for v 2 [0 u2 1] in a recursive way. But it cannot achieve a 50% reduction ratio because multiplication operations for computing the initial d 2min ðI; C N Þ, for generating or updating d 2v;min for v 2 [0 u2 1] and for computing d 2s;0 ðI; C i Þ are fixed. In addition, the comparison operations cannot be reduced because the number of distortion checks by using Eq. (4) does not change at all.
Z. Pan et al. / Pattern Recognition Letters 26 (2005) 1684–1690
1689
Table 1 Comparison of total computational cost per input vector in the arithmetical operations for 2-PM sum pyramid data structure with full search (FS) as a baseline Codebook
Method
Operation
Lena
F-16
Pepper
Baboon
128
Full search
Add Mul Cmp Add Mul Cmp
3968 2048 128 171.16 113.81 29.13
3968 2048 128 136.46 91.23 24.78
3968 2048 128 176.87 117.06 30.89
3968 2048 128 404.36 253.33 72.25
Eqs. (3) and (4) (recursive)
Add Mul Cmp
148.34 77.62 29.13
119.73 64.12 24.78
153.45 79.48 30.89
347.20 156.43 72.25
Full search
Add Mul Cmp
7936 4096 256
7936 4096 256
7936 4096 256
7936 4096 256
Eq. (3) only (non-recursive)
Add Mul Cmp
252.98 169.05 46.43
201.78 135.08 39.78
272.09 180.17 51.02
705.39 439.77 131.26
Eqs. (3) and (4) (recursive)
Add Mul Cmp
218.18 112.08 46.43
175.83 91.53 39.78
234.79 118.43 51.02
604.44 265.16 131.26
Full search
Add Mul Cmp
15,872 8192 512
15,872 8192 512
15,872 8192 512
15,872 8192 512
Eq. (3) only (non-recursive)
Add Mul Cmp
363.53 243.26 73.98
313.50 209.70 67.43
427.89 281.19 87.55
1303.10 804.56 249.76
Eqs. (3) and (4) (recursive)
Add Mul Cmp
314.22 158.79 73.98
273.00 138.17 67.43
369.67 180.25 87.55
1116.50 475.49 249.76
Full search
Add Mul Cmp
31,744 16,384 1024
31,744 16,384 1024
31,744 16,384 1024
31,744 16,384 1024
Eq. (3) only (non-recursive)
Add Mul Cmp
492.05 334.36 112.72
478.77 322.16 114.30
657.85 434.96 149.68
2127.22 1324.27 447.91
Eqs. (3) and (4) (recursive)
Add Mul Cmp
427.80 219.44 112.72
418.78 210.42 114.30
571.47 277.76 149.68
1836.61 785.73 447.91
Eq. (3) only (non-recursive)
256
512
1024
5. Conclusion In this paper, three contributions are made. First, the multi-resolution concept that is realized by a 2-PM sum pyramid data structure in VQ
encoding is classified as two aspects, which are a multi-resolution distortion check method and a multi-resolution distortion computation method conducted in a recursive way. Second, it is mathematically made clear that only a 2-PM sum
1690
Z. Pan et al. / Pattern Recognition Letters 26 (2005) 1684–1690
pyramid data structure but a 4-PM mean pyramid data structure can benefit from this recursive computation method. Third, by completely reusing the obtain distortion d 2s;v ðI; C i Þ to compute d 2s;vþ1 ðI; C i Þ for next distortion check, it becomes unnecessary to discard anything so as to avoid the waste of computation. Experimental results confirmed that the proposed method outperforms the previous work (Pan et al., 2004) obviously. References Burt, P.J., Adelson, E., 1983. The Laplacian pyramid as a compact image code. IEEE Commun. 31 (4), 532–540.
Lee, C.H., Chen, L.H., 1995. A fast search algorithm for vector quantization using mean pyramids of codewords. IEEE Trans. Commun. 43 (2/3/4), 1697–1702. Lin, S.J., Chung, K.L., Chang, L.C., 2001. An improved search algorithm for vector quantization using mean pyramid structure. Pattern Recognition Lett. 22 (3/4), 373– 379. Nasarabadi, N.M., King, R.A., 1988. Image coding using vector quantization: A review. IEEE Trans. Commun. 36 (8), 957–971. Pan, Z., Kotani, K., Ohmi, T., 2004. An improved fast encoding algorithm for vector quantization using 2-pixelmerging sum pyramid data structure. Pattern Recognition Lett. 25 (3), 459–468. Song, B.C., Ra, J.B., 2002. A fast search algorithm for vector quantization using L2-norm pyramid of codewords. IEEE Trans. Image Process. 11 (1), 10–15.
Pattern Recognition Letters 26 (2005) 1691–1700 www.elsevier.com/locate/patrec
Design and implementation of a multi-PNN structure for discriminating one-month abstinent heroin addicts from healthy controls using the P600 component of ERP signals Ioannis Kalatzis a, Nikolaos Piliouras a, Eric Ventouras a, Charalabos C. Papageorgiou b, Ioannis A. Liappas b, Chrysoula C. Nikolaou b, Andreas D. Rabavilas b, Dionisis D. Cavouras a,* a
Department of Medical Instrumentation Technology, Technological Educational Institution of Athens, Ag. Spyridonos Street, Egaleo GR-122 10, Athens, Greece b Psychophysiology Laboratory, Eginition Hospital, Department of Psychiatry, Medical School, University of Athens, Greece Received 16 February 2004; received in revised form 5 October 2004 Available online 7 April 2005 Communicated by E. Backer
Abstract A multi-probabilistic neural network (multi-PNN) classification structure has been designed for distinguishing onemonth abstinent heroin addicts from normal controls by means of the Event-Related PotentialsÕ P600 component, selected at 15 scalp leads, elicited under a Working Memory (WM) test. The multi-PNN structure consisted of 15 optimally designed PNN lead-classifiers feeding an end-stage PNN classifier. The multi-PNN structure classified correctly all subjects. When leads were grouped into compartments, highest accuracies were achieved at the frontal (91.7%) and left temporo-central region (86.1%). Highest single-lead precision (86.1%) was found at the P3, C5 and F3 leads. These findings indicate that cognitive function, as represented by P600 during a WM task and explored by the PNN signal processing techniques, may be involved in short-term abstinent heroin addicts. Additionally, these findings indicate that these techniques may significantly facilitate computer-aided analysis of ERPs. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Heroin addicts; Event-related potentials (ERPs); P600 component; Pattern recognition
*
Corresponding author. Tel.: +30 210 5385 375; fax: +30 210 5910 975. E-mail address:
[email protected] (D.D. Cavouras).
0167-8655/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.012
1692
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
1. Introduction Event-related potentials (ERPs) are electrical potentials, usually measured on the scalp, and are distinguished for their high temporal resolution, allowing for real-time and non-invasive observation of electrical activity changes in the brain during the processing of information related to the presentation of stimuli (or events) (Fabiani et al., 2000). Late positive components of the ERP waveform, such as the P300 and the P600 components have attracted special attention in ERP research. Both components are related to working memory (WM) processes, i.e. keeping information actively in mind, the P300 being more related to the on-line updating of working memory and/or attentional operations involved in this function (Polich, 1998), while the P600, elicited between 500 and 800 ms after stimulus presentation, has been linked to hippocampal function (Guillem et al., 1998; Grunwald et al., 1999), having much in common with WM operation (Garcia-Larrea and Cezanne-Bert, 1998; Guillem et al., 1999; Frisch et al., 2003). P600 is thought to reflect the response selection stage of information processing (Falkenstein et al., 1994), i.e. the stage that Ôassigns a specific response to a specific stimulusÕ. The relationship between substance dependence, such as cocaine and/or heroin abuse, and neurophysiological functions has been previously addressed by various workers, using the P300 (Easton and Bauer, 1997; Martin and Siddle, 2003; Papageorgiou et al., 2003; Kouri et al., 1996; Bauer, 1997, 2002; Biggins et al., 1997; Attou et al., 2001) and, to a lesser extent to the P600 (Papageorgiou et al., 2001) concerning six-month, i.e. long-term, abstinent heroin addicts. As far as the application of P600 component of ERPs in picking up relevant aspects of addiction, in association with neuropsychological operation, Papageorgiou et al. (2001) provided evidence indicating that abstinent heroin addicts manifest abnormal aspects of second-pass parsing processes, as reflected by the P600 latencies, elicited during a WM test. The aim of the present study is twofold: first, to search deeper into the P600 signals by extracting new P600-signal characteristics and by employing
powerful classification procedures, to develop a pattern recognition system for discriminating drug-abusers from controls. The P600 component has only been previously employed (Vasios et al., 2002) for computer-based discrimination of normal controls from patients suffering from schizophrenia. Second, to design a novel classification system, according to which composite information is collected from all fifteen leads simultaneously and is fed into a multi-classifier structure to achieve highest classification accuracies.
2. Material and methods 2.1. Subjects Sixteen one-month abstinent heroin-abusers (4 females and 12 males), matched on age and educational level to 20 normal controls (5 females and 15 males), were examined. The former were recruited from the outpatient university clinic of Eginition Hospital of Athens, Greece. Drug abstinence was verified by urine tests. The addicts were mainly long users of heroin, they had not made prolonged use of other drugs, and had no history of mental retardation. The controls were recruited from hospital staff and local volunteer groups. All participants had no history of any neurological or hearing problems and were right-handed as assessed by the Edinburgh Inventory Test (Oldfield, 1971). Written informed consent was obtained from both patients and control subjects. 2.2. ERP generation procedure All subjects were evaluated by a computerized version of the digit span subtest of the Wechsler Adult Intelligence Scale (Wechsler, 1955). The examination procedure followed for each subject is detailed in a previous work by members of our research team (Papageorgiou et al., 2003; Papageorgiou and Rabavilas, 2003). ERPs were recorded using Ag/AgCl electrodes (leads), during the 1 s interval between the warning stimulus and the first administered number and were digitized at a sampling rate of 500 Hz. EEG activity was recorded from 15 scalp leads based on the Inter-
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
national 10–20 system of Electroencephalography (Jasper, 1958), referred to both earlobes (leads at Fp1, Fp2, F3, F4, C3, C4, C5, C6, P3, P4, O1, O2, Pz, Cz, and Fz) (see Fig. 4). 2.3. Conventional statistical analysis To investigate whether the two groups of subjects could be discriminated by conventional statistical analysis methods, a step-wise discriminant method was employed, utilizing the amplitudes (parameter AMP) and latencies (parameter LAT) of the P600 component at all 15 leads. It should be noted that the equality of the covariance matrices of the variables entered for the two groups was ascertained with BoxÕs M-test. 2.4. Pattern recognition methods 2.4.1. Feature generation Nineteen features related to the P600 component (500–800 ms time interval) were extracted from each ERP-signal at each lead by means of a dedicated computer software developed in C++ for the purposes of the present study. The description and relations of the features are given in Table 1. All features were normalized to zero mean and unit standard deviation (Theodoridis and Koutroumbas, 1998), according to relation: xi l x0i ¼ ð1Þ r where xi and x0i are the ith feature values before and after the normalization respectively, and l and r are the mean value and standard deviation respectively of feature x over all subjects (addicts and normal controls). 2.4.2. The PNN classifier The probabilistic neural networks (PNNs) (Specht, 1990) are implemented by a feed-forward and one-pass structure and encapsulate the BayesÕ decision rule together with the use of Parzen estimators of dataÕs probability distribution function (PDF). The PNN classifier was chosen due to its non-parametric nature and because its training is easy and instantaneous (Specht, 1990), especially in comparison with the back-propagation neural
1693
network and the support vector machine classifiers. The discriminant equation of a PNN (equipped with the widely used Gaussian weighting function) for class k is given by the following relation, as described in (Tsai, 2000; Hajmeer and Basheer, 2002): gk ðxÞ ¼
1
d=2 Qd ð2pÞ j¼1 rj
" 2 # Nk d 1 X 1X xj xij exp N k i¼1 2 j¼1 rj
ð2Þ
where x = [x1x2 xd]T is the test pattern vector to be classified, xi is the ith training pattern vector, Nk is the number of patterns in class k, rj are the standard deviations of the distributions of the pattern vector element variables, and d is the feature space dimensionality. The test pattern x is classified to the class with the larger discriminant function value. 2.4.3. Best feature selection procedure To optimize classification performance, the best feature combination had to be determined at each one of the 15 leads, giving the highest classdiscrimination with the least number of features. This was accomplished by employing the exhaustive search method (Theodoridis and Koutroumbas, 1998) of all possible 2, 3, 4, 5, and 6 feature combinations. Combinations with higher numbers of features were also tested, employing the forward stepwise feature selection technique (Theodoridis and Koutroumbas, 1998) and the PNN classifier. The best-feature combinations thus determined were used to design at each lead the PNN classifier, using the leave-one-out method for discriminating heroin addicts from normal controls. 2.4.4. Compartmental classification Neural activity leading to the production of P600 signals has been associated with different brain structures such as the frontal, temporal and parietal regions, which participate in information processing during recognition memory (Papageorgiou et al., 2003). A careful observation of the lead placements on the scalp in Fig. 4 will
1694
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
Table 1 Signal waveform feature descriptions and definitions s/n
Feature
Description
Definition
1. 2. 3.
Time interval to maximum signal value Maximum signal value LAT/AMP ratio
tsmax ¼ ft j sðtÞ ¼ smax g smax = max{s(t)} LAR ¼ tsmax =smax
The absolute value of AMP
AAMP = jsmaxj
The absolute value of LAR
ALAR ¼ jtsmax =smaxj
6.
Latency (LAT) Amplitude (AMP) Latency/Amplitude ratio (LAR) Absolute Amplitude (AAMP) Absolute latency/ amplitude ratio (ALAR) Positive area (PAR)
The sum of the positive signal values
Ap ¼
7.
Negative area (NAR)
The sum of the negative signal values
An ¼
8.
Absolute negative area (ANAR) Total area (TAR) Absolute total area (ATAR) Total absolute area (TAAR) Average absolute signal slope (TAAS)
The absolute value of NAR
ANAR = jAnj
The sum of all signal values
TAR = Apn = Ap + An ATAR = jApnj
The sum of absolute signal values
Apjnj = Apos + jAnegj
13.
Peak-to-peak (PP)
The difference between maximum and minimum signal values
14.
Peak-to-peak time window (PPT)
15.
Peak-to-peak slope (PPS)
16.
Zero crossings (ZC)
17.
Zero crossings in peak-to-peak time (ZCPP)
Time interval between moments where maximum and minimum signal values appear The slope of the line connecting the maximum and the minimum signal points The number of times where the signal is equal to zero The number of times where the signal is equal to zero, within the peak-to-peak time window The frequency of zero crossings within the peak-to-peak time window The number of slope sign alterations of two adjacent signal values
4. 5.
9. 10. 11. 12.
18. 19.
Zero crossings density in peak-to-peak time (ZCDPP) Slope sign alterations (SSA)
The mean of consecutive signal-values slopes
reveal that leads have been positioned at five anatomical brain compartments (Zang et al., 1997): frontal (Fp1, Fp2, F3, Fz, F4), central (C3, Cz, C4), parietal (P3, Pz, P4), occipital (O1, O2), and central-temporal (C5, C6). Evidently, leads have been named after those anatomical compartments.
P800 ms
t¼500 ms f0:5
ðsðtÞ þ j sðtÞ jÞg
t¼500 ms f0:5
ðsðtÞ j sðtÞ jÞg
P800 ms
P mss 1 j s_ j ¼ 1n 800 t¼500 ms ðs j sðt þ sÞ sðtÞ jÞ where s is the sampling interval of the signal (s = 2 ms, for the sampling rate of 500 Hz), n is the number of samples of the digital signal (actual n = (800 ms 500 ms)/2 ms = 150), and s(t) the signal value of the tth sample pp = smax smin, where smax = max{s(t)} and smin = min{s(t)} are the maximum and the minimum signal values, respectively tpp ¼ tsmax tsmin s_ pp ¼ tpp pp P mss nzc ¼ 800 t¼500 ms ds , where ds = 1 if s(t) = 0, 0 otherwise PtSmax nzc ¼ t¼tS ds min
d zc ¼ ntppzc , where nzc are the zero crossings and tpp is the peak-to-peak time window P mss sðtsÞsðtÞ sðtþsÞsðtÞ nsa ¼ 800 t¼500 msþs 0:5 jsðtsÞsðtÞj þ jsðtþsÞsðtÞj, where s is the sampling interval of the signal (s = 2 ms, for the sampling rate of 500 Hz)
Since the latter participate in various brain functions, it was thought appropriate to investigate compartmental P600 differences. Accordingly, compartmental classifications were carried out, each lead participating with its best feature combination.
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
2.4.5. Multi-PNN classification structure As shown in Fig. 4, a multi-PNN classification system was developed for classifying a subject as belonging to either the ‘‘heroin addicts’’ or the ‘‘controls’’ category. At each lead there was a PNN classifier designed to use the leadÕs particular best P600 features and to assign the P600 component to one of two classes. Table 2 gives the best feature combinations at each lead and Table 1 gives an explicit account of each feature. The outcome from each lead (either ‘‘heroin addict’’ or ‘‘control’’) was fed into an end-stage PNN classifier in the form of 1 or 0 (meta-features for addict or control respectively), which was trained to make the final decision on the class of a particular subject. The discriminant function of the end-stage PNN is then takes the form (by normalizing the feature values as described in Section 2.4.1, we can assume equal standard deviation (r) for all pattern-vector distributions) ! Nk 2 X 1 1 kz zi k gk ðzÞ ¼ exp ð3Þ 2 r2 ð2pÞp=2 rp N k i¼1 where z are the above-mentioned meta-features input vectors and p the number of leads participated. The overall system was evaluated by the leaveone subject-out method. Accordingly, each time Table 2 Best feature combination after exhaustive search with and without leave-one-out method, using the PNN classifier (r = 0.24) at each lead
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
Lead
LOO* (%)
w/o LOO (%)
Feature combination
Fp1 Fp2 F3 F4 C3 C4 C5 C6 P3 P4 O1 O2 Pz Cz Fz
80.6 77.8 86.1 75.0 77.8 77.8 86.1 75.0 86.1 69.4 77.8 80.6 80.6 75.0 80.6
97.2 97.2 100.0 97.2 94.4 100.0 97.2 100.0 100.0 83.3 94.4 100.0 97.2 94.4 100.0
LAT, AMP, PPT AAMP, ATAR, PPT LAT, PP, PPT ALAR, TAAS, PP AMP, LAR, AAMP LAT, PPT, ZC ALAR, TAR, SAAS LAT, NAR, TAAS PAR, PPS, SSA PAR, TAAS AAMP, PAR, ZC PAR, TAAS, SSA LAR, TAAS, ZCDPP AMP, ATAR, ZC LAT, ZCDPP, SSA
For feature descriptions see Table 1.
1695
a subject was left out, the overall system was redesigned with the remaining subjects, and the left-out subject was classified by the re-designed system. Then the left-out subject was re-inserted into its class, the next subject was removed and the whole procedure was repeated for all subjects. Finally classification results were presented in a truth table.
3. Results and discussion Fig. 1 shows the grand averages of the ERP signals of the two groups of subjects. Dashed lines represent the heroin addicts and solid lines the controls. As it may be observed, differences between the two groups may be visible in some leads but may be insignificant due to large variations about their grand-average signals. This can be seen on the P600 amplitude and latency scatter diagram (Fig. 2) at the FP1 lead. These two features are usually employed by Psychiatrists in assessing ERP signals by visual inspection. Discriminant analysis, regarding P600 amplitude (parameter AMP), revealed that only one lead entered the discriminant function (C6), being able to classify correctly 69.4%, of the cross-validated grouped cases. Results of comparisons of the latencies (parameter LAT) revealed that only lead P3 entered the discriminant function, being able to classify correctly 69.4%, of the cross-validated grouped cases. Employing a PNN classifier at each lead, the highest classification accuracy was determined employing the smallest number of features for discriminating heroin addicts from controls. Optimum numbers of features at each lead were determined by the exhaustive search method. Accuracies are presented in Table 2 and concern results obtained with and without the leave-one subject-out (LOO) method. Classification accuracies varied between maximum 86.1% (F3, P3, and C5) and minimum 75% (F4, C6, Cz), signifying the difficulty at many leads to discriminate effectively the two groups by means of the P600 component. The discriminatory ability of the PNN classifier was tested for various values of the r parameter, retaining at each classification
1696
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
−20µV
850 ms 20µV
Fig. 1. Grand averages of ERPs of heroin addicts (dashed lines) and normal controls (solid lines) recorded at each lead. The lead notation is based on the International 10–20 system of Electroencephalography (Jasper, 1958).
test the value that provided the highest accuracy. Fig. 3 shows the scatter diagram and decision boundary of the PNN classification achieved at the F3 lead. For highest classification the PNN
had to employ three features and to draw a nonlinear surface through the points. For comparison reasons, the PNN algorithm was tested against the multilayer perceptron
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
Fig. 2. Feature ÔLatencyÕ against feature ÔAmplitudeÕ (see Table 1) plot of P600 signals at the Fp1 lead of heroin addicts (triangles) and normal controls (circles). Feature values are normalized to zero mean and unit standard deviation (Theodoridis and Koutroumbas, 1998).
(MLP) (Khotanzad and Lu, 1990) and the support vector machine (SVM) (Kecman, 2001) classifiers. The maximum overall classification accuracy per lead ranged between 61.1% and 80.6% for the MLP classifier, with two hidden layers and four nodes per layer, and between 69.4% and 83.3%
1697
Fig. 3. Best feature/lead (F3) combination scatter diagram with PNN decision boundary.
for the SVM classifier, with the radial basis function as kernel. The duration of the training phase for the SVM classifier was about three times more than the PNNÕs, while the MLP required over 800 times more computational time than the PNN. An attempt to use the P600 signals of all leads concurrently to design a PNN classifier employing the LOO and exhaustive search methods gave a classification accuracy of 86.1% (see Table 3). Best
Fig. 4. Schematic diagram of leads distribution and multi-PNN classification system steps: First, a PNN classifier is employed at each lead to classify each subject to one of two classes (heroin addicts and normal controls). Then, on the basis of those lead subclassifications, each subject is assigned to a particular class using a second PNN classifier.
1698
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
Table 3 Best feature combination (PPT, PPS, ZCDPP—see Table 1 for descriptions) truth table, using leave-one-out method with PNN (r = 0.24) classifier performed at all leads together Groups
PNN classification Controls
Addicts
Accuracy (%)
Controls Addicts
19 4
1 12
95.0 (specificity) 75.0 (sensitivity) 86.1 (overall)
features (PPT, PPS, ZCDPP—see Table 1) were associated with peak-to-peak time and slope of the P600 signal, i.e. the P600 amplitude and its rate of change differences that may exist between heroin addicts and normal controls. As shown in Table 3, the PNN could classify with high precision 95% the controls, misclassifying only one in 20 (high specificity), although failing to correctly distinguish 4 of the 16 heroin addicts (low sensitivity). Based on these results, it is difficult to draw definite conclusions as to existing differences between the two groups. Table 4 shows the results obtained by employing the multi-PNN classification structure on all leads, using the LOO method, as described in Section 2.4.5. Based on the meta-features (0 or 1) of each lead, the end-PNN could predict the class Table 4 Multi-PNN classification truth table, using leave-one-out method with second-stage PNN (r = 1.0) classifier performed at all leads together Groups
Controls Addicts
Multi-PNN classification at all leads Controls
Addicts
Accuracy (%)
20 0
0 16
100.0 (specificity) 100 (sensitivity) 100 (overall)
of each subject (control or heroin-addict) with accuracy, even when that subject was not involved in its design. The meaning of this outcome is that a complex structure may be designed to discriminate one-month abstinent heroin addicts from normal controls, however, since that structure is based on many different features from the P600 signals selected from 15 leads, it is difficult to reach conclusive reasoning of between group differences. It was thus thought appropriate to proceed to a compartmental investigation of probable P600 deviations between the two classes as described in Section 2.4.4. Table 5 presents the results of regional classification employing the multi-PPN structure and using the LOO method. Results varied between 91.7% at the frontal and 80.6% at the occipital regions. On the other hand, a more careful examination of Table 2 can reveal that of the five leads involved in the frontal compartment, F3 showed the highest classification accuracy (86.1%). Considering the best-feature combination in F3 (latency, peak-to-peak magnitude and peak-topeak time interval), it may be said that it is their combinational involvement that provided that valuable between-groups discriminatory power, which would not have been otherwise easily discernible by visual inspection. The next higher compartmental classification accuracy was found in the combined left and right temporo-central compartments in Table 5, giving a discriminatory precision of 86.1% by misclassifying 3 controls and 2 heroin addicts. However, a more careful examination of Table 2 will reveal that the main contributor to that combined precision is the left temporo-central region, whose accuracy is significantly higher (86.1%) than that of the corresponding right region (75%). Considering the
Table 5 Multi-PNN classification results, using leave-one-out method with second-stage PNN (r = 1.0) classifier performed at several scalp areas Scalp area
Leads
Sensitivity (%)
Specificity (%)
Overall accuracy (%)
Frontal Central Parietal Occipital Central-temporal
Fp1, Fp2, F3, F4, Fz C3, C4, Cz P3, P4, Pz O1, O2 C5, C6
87.5 75.0 68.7 87.5 87.5
95.0 90.0 95.0 75.0 85.0
91.7 83.3 83.3 80.6 86.1
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
best-feature combination in C5 (absolute latency/ amplitude ratio, total area and average absolute signal slope), it may be concluded that P600 signal differences, related to the combined effect of amplitude, latency and rate of change, provide the discriminatory power achieved between groups. Another reassuring fact is that the means of the P600 amplitudes at lead C5 differed statistically between the two groups (p < 0.05), with the heroin addicts group showing higher amplitudes. These differences are important since ERPs from the temporo-parietal region have been associated with the subjectÕs effort to respond to evoked stimuli (Papageorgiou et al., 2003). In fact, as it may be observed from Table 2, the discriminatory power of the left parietal lead was equally high (86.1%). Finally, taking under consideration that significant between-group discriminations were obtained using the left-side leads (F3, C5, P3) separately (see Table 2), when these leads were employed by the multi-PPN structure, a high discrimination accuracy of 94.4% was achieved, misclassifying only 1 heroin-addict and 1 control. Associating the meaning attributed to the P600 component with the present findings, especially the localization of findings at the left hemisphere—an effect most accentuated in the frontal region—it seems reasonable to suggest that short-term opioids abstinence is related to disruption of neural circuits, coupled with the left hemisphere and underlying processes which Ôassign a specific response to a specific stimulusÕ. This concurs with the view of Davidson et al. (1990) that Ôheroin stimuli elicit more pronounced activation in the left hemisphere in heroin abusersÕ. As far as the more noticeable classification accuracy present in frontal regions is concerned, i.e. the one located at the left prefrontal cortex (F3 lead site), it appears to be in accordance with results of recent neuroimaging studies demonstrating neurobiological changes of frontal cortex that accompany drug addiction (Goldstein and Volkow, 2002). This idea is broadly consistent with neuropsychological models of information processing, postulating that the valence of information (positive or negative) and its associated action tendency (approach or withdrawal) is coupled with the operation of the prefrontal brain.
1699
Specifically, the right prefrontal region is conceptualized as a key part within a brain circuit mediating withdrawal-associated behavior, and the equivalent left prefrontal area is considered a main part within a brain circuit mediating approachassociated behavior (Lane et al., 1997). Furthermore, present findings, considered in association with the corresponding patterns of P600 waveforms observed in long-term heroin abstinence (Papageorgiou et al., 2001), seem to show both common and distinct features. In particular, the results of the present study imply that short-term heroin abstinence connotes left hemisphere processes in association with the assignation of specific responses to specific stimuli, as thought to be indicated by the P600 component. In contrary, although long-term heroin abstinence has been reported to be also associated with the assignation of specific responses to specific stimuli, this association is connected with right frontal operation. Conclusively, these findings indicate that cognitive function, as represented by the P600 component, during a WM task and explored by the PNN signal processing techniques, may be involved in short-term abstinent heroin addicts. Additionally, these findings indicate that these techniques may significantly facilitate computeraided analysis of ERPs.
Acknowledgement The present research was funded by the Greek Ministry of Education and the European Union, under research fund ARCHIMEDES—‘‘Development of an Evoked Potentials and Intracranial Current Classification System using Support Vector Machines (SVM) and Probabilistic Neural Networks (PNN)’’.
References Attou, A., Figiel, C., Timsit-Berthier, M., 2001. ERPs assessment of heroin detoxification and methadone treatment in chronic heroin users. Clin. Neurophysiol. 31, 171–180. Bauer, L.O., 1997. Frontal P300 decrements, childhood conduct disorder, family history, and the prediction of relapse
1700
I. Kalatzis et al. / Pattern Recognition Letters 26 (2005) 1691–1700
among abstinent cocaine abusers. Drug Alcohol Depend. 44, 1–10. Bauer, L.O., 2002. Differential effects of alcohol, cocaine, and opioid abuse on event-related potentials recorded during a response competition task. Drug Alcohol Depend. 66, 137– 145. Biggins, C.A., MacKay, S., Clark, W., Fein, G., 1997. Eventrelated potential evidence for frontal cortex effects of chronic cocaine dependence. Biol. Psychiat. 42, 472–485. Davidson, R.J., Ekman, P., Saron, C.D., Senulis, J.A., Friesen, W.V., 1990. Approach-withdrawal and cerebral asymmetry: emotional expression and brain physiology. I. J. Pers. Soc. Psychol. 58, 330–341. Easton, C.J., Bauer, L.O., 1997. Beneficial effects of thiamine on recognition memory and P300 in abstinent cocainedependent patients. Psychiat Res 70, 165–174. Fabiani, M., Gratton, G., Coles, M., 2000. Event-related potentials: methods, theory, and applications. In: Cacioppo, J., Tassinary, L., Bernston, G. (Eds.), Handbook of Psychophysiology. Cambridge University Press, New York. Falkenstein, M., Hohnsbein, J., Hoormann, J., 1994. Effects of choice complexity on different subcomponents of the late positive complex of the event-related potential. Electroencephalogr. Clin. Neurophysiol. 92, 148–160. Frisch, S., Kotz, S., von Cramon, D., Friederici, A., 2003. Why the P600 is not just a P300: the role of the basal ganglia. Clin. Neurophysiol. 114, 336–340. Garcia-Larrea, L., Cezanne-Bert, G., 1998. P3, positive slow wave and working memory load: a study on the functional correlates of slow wave activity. Electroencephalogr. Clin. Neurophysiol. 108, 260–273. Goldstein, R.Z., Volkow, N.D., 2002. Drug addiction and its underlying neurobiological basis: neuroimaging evidence for the involvement of the frontal cortex. Am. J. Psychiat. 159, 1642–1652. Grunwald, T., Beck, H., Lehnertz, K., Blumcke, I., Pezer, N., Kutas, M., Kurthen, M., Karakas, H.M., Van Roost, D., Wiestler, O.D., Elger, C.E., 1999. Limbic P300s in temporal lobe epilepsy with and without AmmonÕs horn sclerosis. Eur. J. Neurosci. 11, 1899–1906. Guillem, F., NÕKaoua, B., Rougier, A., Claverie, B., 1998. Location of the epileptic zone and its physiopathological effects on memory-related activity of the temporal lobe structures: a study with intracranial event-related potentials. Epilepsia 39, 928–941. Guillem, F., Rougier, A., Claverie, B., 1999. Short- and longdelay intracranial ERP repetition effects dissociate memory systems in the human brain. J. Cogn. Neurosci. 11, 437–458. Hajmeer, M., Basheer, I., 2002. A probabilistic neural network approach for modeling and classification of bacterial growth/no-growth data. J. Microbiol. Methods 51, 217–226.
Jasper, H., 1958. The ten–twenty electrode system of the international federation. Electroencephalogr. Clin. Neurophysiol. 10, 371–375. Kecman, V., 2001. Learning and Soft Computing, Support Vector Machines, Neural Networks, and Fuzzy Logic Models. MIT Press, Cambridge, MA. Khotanzad, A., Lu, J.H., 1990. Classification of invariant image representations using a neural network. IEEE Trans. Acoust., Speech, Signal Process. 38 (6), 1028–1038. Kouri, E.M., Lukas, S.E., Mendelson, J.H., 1996. P300 assessment of opiate and cocaine users: Effects of detoxification and buprenorphine treatment. Biol. Psychiat. 40, 617–628. Lane, R.D., Reiman, E.M., Ahern, G.L., Schwartz, G.E., Davidson, R.J., 1997. Neuroanatomical correlates of happiness, sadness, and disgust. Am. J. Psychiat. 154, 926–933. Martin, F.H., Siddle, D.A.T., 2003. The interactive effect of alcohol and temazepam on P300 and reaction time. Brain Cognition 53, 58–65. Oldfield, R.C., 1971. The assessment and analysis of handedness: The Edinburgh inventory. Neuropsychologia 9, 97– 113. Papageorgiou, C., Liappas, I., Asvestas, P., Vasios, C., Matsopoulos, G.K., Nikolaou, C., Nikita, K.S., Uzunoglu, N., Rabavilas, A., 2001. Abnormal P600 in heroin addicts with prolonged abstinence elicited during a working memory test. Neuroreport 12, 1773–1778. Papageorgiou, C., Rabavilas, A., Liappas, I., Stefanis, C., 2003. Do obsessive-compulsive patients and abstinent heroin addicts share a common psychophysiological mechanism. Neuropsychobiology 47, 1–11. Papageorgiou, C.C., Rabavilas, A.D., 2003. Abnormal P600 in obsessive–compulsive disorder. A comparison with healthy controls. Psychiat. Res. 119, 133–143. Polich, J., 1998. P300 clinical utility and control of variability. J. Clin. Neurophysiol. 15, 14–33. Specht, D.F., 1990. Probabilistic neural networks. Neural Networks 3, 109–118. Theodoridis, S., Koutroumbas, K., 1998. Pattern Recognition. Academic Press, UK. Tsai, C-Y., 2000. An iterative feature reduction algorithm for probabilistic neural networks. Omega 28, 513–524. Vasios, C., Papageorgiou, C., Matsopoulos, G.K., Nikita, K.S., Uzunoglu, N., 2002. A decision support system of evoked potentials for the classification of patients with first-episode schizophrenia. German J. Psychiat. 5, 78–84. Wechsler, D., 1955. Manual for the Wechsler Adult Intelligence Scale. Psychological Corporation, New York. Zang, X.L., Begleiter, H., Porjesz, B., 1997. Do chronic alcoholics have intact implicit memory. An ERP study. Electroencephalogr. Clin. Neurophysiol. 103, 457–473.
Pattern Recognition Letters 26 (2005) 1701–1709 www.elsevier.com/locate/patrec
Stochastic texture analysis for monitoring stochastic processes in industry Jacob Scharcanski
*
Instituto de Informa´tica, UFRGS–Universidade Federal do Rio Grande do Sul, Av. Bento Gonc¸alves, 9500, Porto Alegre, RS, 91501-970, Brazil Received 2 February 2004; received in revised form 28 September 2004 Available online 14 April 2005 Communicated by E. Backer
Abstract Several continuous manufacturing processes use stochastic texture images for quality control and monitoring. Large amounts of pictorial data are acquired, providing both important information about the materials produced and about the manufacturing processes involved. However, it is often difficult to measure objectively the similarity among such images, or to discriminate between texture images of materials with distinct properties. The degree of discrimination required by industrial processes sometimes goes beyond the limits of human visual perception. This work presents a new method for multi-resolution stochastic texture analysis, interpretation and discrimination based on the wavelet transform. A multi-resolution distance measure for stochastic textures is proposed, and applications of our method in the non-woven textiles industry are reported. The conclusions include ideas for future work. 2005 Elsevier B.V. All rights reserved. Keywords: Stochastic textures; Wavelets; Anisotropy; Nonwoven textiles
1. Introduction In several continuous processes, static and dynamic stochastic texture images are acquired and used in quality control (Wang, 1999). Often, indus-
*
Tel.: +55 51 3316 7128; fax: +55 51 3316 7308. E-mail address:
[email protected] trial machine operators try to visually interpret stochastic texture images, and estimate the manufacturing process condition using their experience in the field. This empirical approach is subjective, and prone to failure, mainly because the human vision is limited in terms of its ability to distinguish between stochastic textures (Scharcanski and Dodson, 2000). Despite advances in texture representation and classification over the past three decades
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.017
1702
J. Scharcanski / Pattern Recognition Letters 26 (2005) 1701–1709
(Fan and Xia, 2003; Arivazhagan and Ganesan, 2003), the problem of stochastic texture feature interpretation and classification remains a challenge for researchers (Zhu et al., 1998), and for a large segment of the industry (Scharcanski and Dodson, 2000). There have been a variety of methods of extracting texture features from textured images, e.g. geometric, random field, fractal, and signal processing models for textures. A significant part of the recent works on textures concentrate on statistical modelling (Fan and Xia, 2003), which characterizes textures as probability distributions, and uses statistical theories to formulate and solve texture processing problems mathematically. Wavelet-based texture characterization has attracted attention recently because of its usefulness in several important applications, such as texture classification (Arivazhagan and Ganesan, 2003; Do and Vetterli, 2002), and texture segmentation (Choi and Baraniuk, 2001). Several approaches have been proposed to extract features in the wavelet domain with application in texture analysis, such as: (a) wavelet energy signatures, which were found useful for texture classification (Arivazhagan and Ganesan, 2003); (b) second-order statistics of the wavelet transform were also used to improve the accuracy of texture characterization (Wouver et al., 1999); and (c) higher order dependencies of wavelet coefficients were studied for texture analysis (Fan and Xia, 2003; Choi and Baraniuk, 2001; Romberg et al., 2001). These wavelet-based approaches have been found more effective than other methods based on secondorder statistics or random fields, which analyze textures in a single resolution without considering the human visual perception of textures (Fan and Xia, 2003; Scharcanski and Dodson, 2000). Most of the work on texture in the wavelet domain has concentrated on the analysis of visual textures, and were not designed for stochastic texture analysis. For example, often feature extraction is carried out assuming sub-band independence at each resolution (e.g. Arivazhagan and Ganesan, 2003; Do and Vetterli, 2002), which is not verified experimentally. Also, the methods based on higher order statistics generally do not make explicit relevant stochastic texture features,
that are important for industrial applications, where process conditions are estimated based on specific texture parameters (e.g. Fan and Xia, 2003; Choi and Baraniuk, 2001; Romberg et al., 2001; Scharcanski and Dodson, 2000). In this work, a multi-resolution scheme for stochastic texture representation and analysis is proposed. We begin by describing how we measure the image gradients in multiple resolutions. Based on this technique, the local grayscale variability and texture anisotropy are measured in multiple resolutions. Next, a multi-resolution distance measure for stochastic textures is introduced. Finally, we present some applications, experimental results and conclusions. 2. Our proposed texture representation In this work, we emphasize specific stochastic texture features that could be used to facilitate texture interpretation, and the discrimination between distinct process conditions.1 Our method relies on multiple resolution texture gradients and their magnitudes. To estimate the local gradients in multiple resolutions, we apply the redundant two-dimensional WT proposed by Mallat and Zhong (1992). The coefficients W 12j f ðx; yÞ and W 22j f ðx; yÞ represent the details in the x and y directions, respectively, and approximate the image gradient at the resolution 2j. Since we are dealing with digital images f[n, m], we use the discrete version of the WT (Mallat and Zhong, 1992), and the discrete wavelet coefficients are denoted in this work by W i2j f ½n; m, for i = 1, 2. The gradient magnitudes at the resolution 2j are computed from qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 j M 2 f ½n; m ¼ ðW 12j f ½n; mÞ þ ðW 22j f ½n; mÞ : ð1Þ 2.1. Texture directionality in multiple resolutions Texture directionality, i.e. anisotropy, is an important parameter in the manufacture of foillike materials. It correlates well with several 1 One, or more samples, can be used as reference images, indicating a particular process condition.
J. Scharcanski / Pattern Recognition Letters 26 (2005) 1701–1709
mechanical and transport properties of such materials, as well as with commonly-monitored manufacturing variables. In this work, the occurrences of the coefficients W 12j ½n; m and W 22j ½n; m are approximated by Gaussian distributions. The normal plots in Fig. 2(a) and (b) show that the Gaussian model can represent a wide range of wavelet coefficient values in stochastic textures, with some deviation from Gaussian occurring in the tails of the distribution, which was confirmed in a large-sample test set. It was also verified that, in general, wavelet coefficients associated with the same image position [m, n] in different sub-bands (i.e. W 12j ½n; m and W 22j ½n; m), at the same resolution 2j, are correlated (i.e. are not independent). Therefore, we represent the joint distribution of wavelet coefficients by a 1 2 bivariate Gaussian G12 2j ðW 2j ½n; m; W 2j ½n; mÞ, de12 noted simply by Gj . The iso-probability curves of the bivariate Gaussian G12 j are typically elliptic for anisotropic samples, and tend to be circular for samples that are isotropic. The joint coefficient distribution G12 j determines two orthogonal axes of extremal variance, that coincide with the directions of the eigenvectors vmax and vmin of the covariance matrix. In order to estimate the covariance matrix at resolution 2j, let us denote the means of the coefficients W 12j and W 22j by l1 and l2, respectively. The covariance matrix is then calculated as follows: h T i C 2j ¼ E W 12j l1 W 22j l2 " # r21 q12 r1 r2 ¼ : ð2Þ q12 r1 r2 r22 where q12 is the correlation coefficient of W 12j and W 22j , and r1 and r2 are the standard deviations of W 12j and W 22j , respectively. The shape and orientation of the coefficient distribution, at resolution 2j, is described by a Gaussian ellipse with covariance matrix C 2j , mean [l1 l2]T. The orientation h 0 of its main semi-axis can be obtained from Mix (1999): 2q12 r1 r2 tanð2h0 Þ ¼ ; ð3Þ r21 r22 tanð2h0 Þ and h0 ¼ tan1 tanð2h . Fig. 1(c) and (d) shows 0 Þþ2 Gaussian ellipses fitting the actual angular distri-
1703
bution for isotropic and anisotropic texture images. Both ellipses were calculated at the finest resolution 21, and provide a visual indication of the degree of anisotropy of each texture. It shall be noticed that often the joint distribution of coefficient values do not have zero means (i.e. l1 5 0 and l2 5 0), as illustrated in Fig. 1(d). In order to measure quantitatively the distribution eccentricity, we calculate the eigenvectors vmax and vmin from (2), as well as the corresponding eigenvalues kmax and kmin. The eigenvalues kmax and kmin define the semi-axes of a Gaussian ellipse aligned with the eigenvectors directions. The distribution eccentricity e is given by the ratio of the eigenvalues: kmax e¼ ; ð4Þ kmin which provides an estimate for the texture anisotropy at scale 2j. For example, the measured eccentricity e for the isotropic sample in Fig. 1(a) is 1.008, and for the anisotropic sample in Fig. 1(b) is 1.227. It should be noticed that the eccentricity can be calculated in multiple resolutions. 2.2. Texture local graylevel variability in multiple resolutions The texture local graylevel variability in multiple resolutions encodes important information about local density variability, and consequently about the material homogeneity. As discussed in the previous section, the stochastic texture image coefficients W 12j f ½n; m and W 22j f ½n; m, for j = 1, 2, . . . , J, or simply, W 12j f and W 22j f , may be considered as Gaussian distributed (see Fig. 1). As a consequence, the corresponding distribution of magnitudes M 2j f ¼ q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2
2
ðW 12j f Þ þ ðW 22j f Þ , denoted simply by M 2j , may be approximated by a Rayleigh probability density function (Larson and Shubert, 1979):
Rj ðM 2j Þ ¼
M 2j 2
½bj
where qffiffiffiffiffiffiffiffiffiffiffi 2r2M 2j ; bj ¼ 4p
e
M 2 2j ½2bj 2
;
ð5Þ
ð6Þ
1704
J. Scharcanski / Pattern Recognition Letters 26 (2005) 1701–1709
Fig. 1. Comparative results for isotropic and anisotropic test samples. Stochastic texture images: (a) nearly isotropic and (b) anisotropic; Gaussian ellipses: (c) nearly isotropic and (d) anisotropic; histograms of local gradient magnitudes (solid-Rayleigh model): (e) nearly isotropic and (f) anisotropic.
J. Scharcanski / Pattern Recognition Letters 26 (2005) 1701–1709
and rM 2j is the standard deviation of the gradient magnitudes at scale 2j. Fig. 1(e) and (f) illustrates the Rayleigh model fitting the actual gradient magnitudes distribution. In fact, stochastic textures may vary in anisotropy, local variability, or both, at different resolutions, and a wide range of stochastic textures can be represented in terms of these features. Typically, samples of similar textures also have similar bj parameters, and this information may be used for texture discrimination and similarity matching, as discussed later. A distance measure to compare these texture representations is introduced next.
where b1 and b2 are the Rayleigh parameters of the textures, respectively. The Kullback–Leibler distance has some nice properties. Its convexity guarantees that a minimum exists, and in order to calculate the Kullback–Leibler distance from multiple data sets, such as from different sub-bands or feature sets, we can use the chain rule (Do and Vetterli, 2002). This rule states that the Kullback–Leibler distance DKL between two pairs of conditional probability density functions is simply the sum of the distances from corresponding probability density function marginals. Therefore, the proposed multi-resolution stochastic texture distance measure is:
3. A stochastic texture distance measure Dj ¼ In our approach, at each resolution, a texture image Ii is represented by a histogram with k disjoint intervals of equal length, {S1, S2, . . . , Sk}. Under these circumstances, Do and Vetterli (2002) showed that the Kullback–Leibler distance ranks, or classifies, a texture I consistently with maximum likelihood rule. The empirical distributions (i.e. data histograms), usually require large representation overheads (i.e. storing large numbers of histogram bins). Therefore, as detailed before, we model these distributions, at each resolution, by bivariate Gaussian G12 j and Rayleigh Rj probability density functions. The Kullback–Leibler distance between two bivariate Gaussian distributions G(C1, M1) and G(C2, M2) is given by Yoshizawa and Tanabe (1999): DGauss2 ¼ 12 log ðdetðC 2 Þ= detðC 1 ÞÞ þ 12traceðC 1 C 1 2 Þ T
þ ðM 1 M 2 Þ C 1 2 ðM 1 M 2 Þ;
ð7Þ
where C1 and C2 are covariance matrices, M1 and M2 are the means of the bivariate distributions, and det(C) denotes the matrix determinant. On the other hand, the Kullback–Leibler distance between two Rayleigh distributions R(b1) and R(b2) is:
DRayl
! 1 b2 b22 ¼ log þ1 ; p b1 b21
ð8Þ
1705
J X
j j ð1 aÞD Gauss2 þ aDRayl ;
ð9Þ
j¼1
j j where D Gauss2 (2[0, 1]) and DRayl (2[0, 1]) are the symmetric and normalized Kullback–Leibler distances between bivariate Gaussian and univariate Rayleigh distributions at each resolution j, and a (2[0, 1]) is a parameter that controls the weights j j attributed to D Gauss2 and DRayl .
4. Experimental results Stochastic texture images are widely used in manufacture of foil-like materials such as nonwoven textiles, paper, polymer membranes, conductor and semiconductor coatings. Paper and non-woven textiles have a definite ‘‘grain’’ caused by the greater orientation of fibers in the machine direction and by the stress/strain imposed during pressing and drying. The directionality of such materials affect substantially their physical properties. Due to fluctuations and irregularities in systems operation, there are corresponding variations in the makeup of fibers and orientation of fibers across the machine (i.e. web) (Smook, 1994). Consequently, the physical properties may vary in both the MD (machine direction) and CD (cross direction), and often is necessary to sample and test at several locations across the machine. The orientation of paper and non-woven textiles can be determined by testing the mechanical resistance
1706
J. Scharcanski / Pattern Recognition Letters 26 (2005) 1701–1709
to tearing in the MD and CD. During the past two decades, automated non-woven testing became a new trend in industry, and a popular orientation test is the ‘‘TSI/TSO’’ test (Smook, 1994), which measures orientation by measuring the sound propagation speed in the samples, in different orientations. All the orientation tests mentioned above are destructive, or require contact with the sample. Non-destructive and non-contact testing has been a challenge for researchers (Scharcanski and Dodson, 2000). To illustrate the performance of our approach in stochastic industrial texture classification, we used 315 distinct b-radiographic images of nonwoven textile and paper samples, obtained from a standard industrial image database for nonwoven textiles and paper (Dodson et al., 1995). We chose nine samples of each type, from a variety of forming machines, with different furnish and grammage2 (e.g. among the 315 samples we have samples of headbox handsheets, repulped machine sheets, standard handsheets, board handsheets, gap formers, Fourdrinier formers, speedformers, tissue papers, and glass fiber mats). All these images have a resolution 140 · 140 pixels, with a spatial resolution of 0.2 · 0.2 mm2/pixel (Dodson et al., 1995). Each type of texture indicates a particular production condition (i.e. in terms of operating parameters and furnish). The sample textures representing different operating conditions were classified, and the system current condition could be identified. Potentially, this approach could help non-woven textile industry operators identify the current system condition, who usually rely on ‘‘ad-hoc’’ texture interpretation methods. In our classification experiments we compared our approach (i.e. Eq. (9)) with the Spatial Gray Level Dependence Method SGLDM3 and Gabor filters,4 in both cases the Euclidean distance was used as a metric. We also compared our approach
2
Grammage is defined as mass density in g/m2. Each sample is represented by gray-level co-occurrence matrices for the 0, 45 and 90 orientations; matrices are described by five features: contrast, homogeneity, angular second moment, entropy and variance. 4 The Gabor features are the mean magnitudes of the 24 subbands (i.e. four scales and six orientations were used). 3
Table 1 Measured correct classification rate and computational costs: GRD, KL is our approach, i.e. bivariate Gaussian and Rayleigh models, and Kullback–Leibler distance (a = 0.15); 2G, KL is the approach of bivariate Gaussian model for the wavelet coefficients, and Kullback–Leibler distance; Gabor, Euclid. is the approach of Gabor filters, and Euclidean distance; SGLDM, Euclid. is the approach of SGLDM, and Euclidean distance; and Rayl., KL is the approach of Rayleigh model for the gradient magnitudes, and Kullback–Leibler distance Classif. approach
Correct classif. rate (%)
Classif. time (s)
Feature extraction time (s)
GRD, KL 2G, KL Gabor, Euclid. SGLDM, Euclid. Rayl., KL
0.8127 0.7746 0.6508 0.6476 0.6476
0.71 0.66 0.33 0.10 0.10
1.37 0.66 10.65 15.99 15.99
with the performance of its two individual components, i.e. a bivariate Gaussian model for wavelet coefficients, and the Rayleigh model for gradient magnitudes, using as a metric the symmetric and normalized Kullback–Leibler distance for bivariate Gaussians and Rayleighs, respectively. Table 1 shows the correct classification rates obtained (as percentages), indicating that with our approach we obtain the highest correct classification rate (with a = 0.15). With the other texture representation approaches we obtained lower correct classification rates, at a similar, or even higher, computational costs, as shown in Table 1. Computational costs were evaluated in terms of the running times, i.e. seconds, of the Matlab routines implementing the above mentioned methods (on a PIII 870 MHz notebook with 128 MBytes of RAM). We also illustrate our approach using five sets of eight samples equally spaced across the manufacturing web (i.e. 15 cm2 each, spaced 50 cm from each other). Three sample sets represent standard operating conditions, and two sample sets represent deviations from the standard operating conditions (i.e. were obtained before stopping production for maintenance). The central part of these samples were scanned using a scanner with transparency unit, at 600 dpi, obtaining images with 1000 · 1100 pixels. Three consecutive dyadic scales were used (2j, for j = 1, 2, 3) in the texture analysis. These experiments were conducted to
J. Scharcanski / Pattern Recognition Letters 26 (2005) 1701–1709
1707
Fig. 2. Normal plots: (a) nearly isotropic and (b) anisotropic. Web uniformity testing results: (c) sample anisotropy ranking (our approach: solid; tensile test: dotted); (d) anisotropy measurements across the web (our approach: solid; tensile test: dotted); (e) distance Dacc profile (standard operating conditions); (f) distance Dacc profile (non-standard operating conditions). j j
1708
J. Scharcanski / Pattern Recognition Letters 26 (2005) 1701–1709
evaluate web uniformity and to rank samples based on their anisotropy, as illustrated in Fig. 2(c) and (d). Our results correlate with the tensile test (ratio maximum/minimum tensile) in anisotropy measurements across the web (i.e. coefficient of correlation = 0.9059), and in sample ranking based on their anisotropy (i.e. coefficient of correlation = 0.8844). We also verified that our texture analysis results correlate less with TSI/TSO measurements across the web (i.e. coefficient of correlation = 0.7825). However, it should be reported that we found variability in our results, which we attribute to several factors, such as: noisy laboratory mechanical measurements, light dispersion introduced by optical sample scanning, sample mass density and thickness variability. It shall be noticed that TSI/TSO focuses on sample sound propagation properties, which may not correlated well the optical properties of the sample. Therefore, it was necessary to take averages of the measurements to reduce their variability. In order to quantify fluctuations and irregularities in systems operation, it is often necessary to sample and test at several locations across the machine (i.e. web), and estimate the web homogeneity (Smook, 1994). Given a set S of k samples across the web, the sum of pairwise distances between each sample s (2S) and the remaining k 1 samples in S, namely Dacc j ðsÞ, can be used as an web homogeneity indicator: Dacc j ðsÞ ¼
k X J X p¼1
j
j
ð1 aÞD Gauss2 ðs; pÞ þ aDRayl ðs; pÞ:
tic texture images, and only requires that imaging conditions remain the same.
5. Concluding remarks In conclusion, we may say that stochastic texture images are acquired in large quantities in continuous industrial processes, and encode important quality and process information. Consequently, methods for objective stochastic texture interpretation and discrimination are important for a large segment of the industry. This work presented a new multi-resolution method for stochastic texture interpretation and discrimination based on the wavelet transform. Also, a multi-resolution distance measure for stochastic textures was proposed, and applications of our method in the non-woven textiles industry were reported. The preliminary experimental results obtained by our approach were encouraging, and our texture features tend to correlate well with industrial procedures. However, more research is needed to estimate formation anisotropy using texture analysis methods that are statistically robust, and correlate better with the physical properties of stochastic materials. As future work, we intend to use our approach as a support for data mining operations in stochastic data repositories, with application in preventative maintenance and personnel training. Acknowledgments
j¼1
ð10Þ An homogeneous web is characterized by small mean and small variability of across the web. Fig. 2(e) and (f) shows that we obtain different Dacc j ðsÞ profiles for different operating conditions. Under regular operating conditions, we obtain smaller Dacc j ðsÞ values as well as smaller variability across the web—even at web extremities, where higher variability is expected (see Fig. 2(f)); on the other hand, for non-standard operating conditions, we obtain higher Dacc j ðsÞ values and variability, as illustrated in Fig. 2(f). It shall be noticed that this web homogeneity test compares stochas-
The author thanks CNPq (Brazilian Research Council) for financial support, and Mr. Osmar Machado (Riocell, Brazil) for providing experimental data; thanks are due also to Professor Roberto da Silva (Instituto de Informatica, UFRGS, Brazil) and to Professor Robin T. Clarke (Instituto de Pesquisas Hidraulicas, UFRGS, Brazil) for advice and useful discussions. References Arivazhagan, S., Ganesan, L., 2003. Texture classification using wavelet transform.. Pattern Recognition Lett. 24, 1513– 1521.
J. Scharcanski / Pattern Recognition Letters 26 (2005) 1701–1709 Choi, H., Baraniuk, R.G., 2001. Multiscale image segmentation using wavelet-domain hidden markov models. IEEE Trans. Image Process. 10, 1309–1321. Do, M.N., Vetterli, M., 2002. Wavelet-based texture retrieval using generalized gaussian density and Kullback–Leibler distance. IEEE Trans. Image Process. 11 (2), 146–158. Dodson, C., Ng, W.K., Singh, R.R., 1995. Paper stochastic structure analysis—archive 2 (CD-ROM), University of Toronto, Toronto, Canada. Fan, G., Xia, X., 2003. Wavelet-based texture analysis and synthesis using hidden markov models.. IEEE Trans. Circ. Syst.—I: Fundam. Theory Appl. 50 (1), 106–120. Larson, H.J., Shubert, B.O., 1979Probabilistic Models in Engineering Sciences, vol. 1. John Wiley & Sons, New York. Mallat, S.G., Zhong, S., 1992. Characterization of signals from multiscale edges. IEEE Trans. Pattern Anal. Machine Intell. 14 (7), 710–732. Mix, D.F., 1999. Random Signal Processing. Prentice-Hall, New York.
1709
Romberg, J.K., Choi, H., Baraniuk, R.G., 2001. Bayesian tree structured image modeling using wavelet-domain hidden Markov models. IEEE Trans. Image Process. 10, 1056–1068. Scharcanski, J., Dodson, C.T.J., 2000. Local spatial anisotropy and its variability. IEEE Trans. Instrum. Measure. 49 (5), 971–979. Smook, G.A., 1994. Handbook for Pulp and Paper Technologists, second ed. Angus Wilde Publications, Vancouver. Wang, X.Z., 1999. Data Mining and Knowledge Discovery for Process Monitoring and Control. Springer-Verlag, London. Wouver, G.V., Sheunders, P., Dyck, D.V., 1999. Statistical texture characterization from wavelet representations. IEEE Trans. Image Process. 8, 592–598. Yoshizawa, S., Tanabe, K., 1999. Dual differential geometry associated with the Kullback–Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT J. Math. 35 (1), 113–137. Zhu, S.C., Wu, Y., Mumford, D., 1998. Filters, random fields and maximum entropy. Internat. J. Comput. Vision 27 (2), 1–20.
Pattern Recognition Letters 26 (2005) 1720–1731 www.elsevier.com/locate/patrec
Registration and retrieval of highly elastic bodies using contextual information q J. Amores *, P. Radeva Computer Vision Center, Dept. Informa`tica, UAB, Bellaterra, Spain Received 18 December 2004; received in revised form 18 December 2004 Available online 14 April 2005 Communicated by E. Backer
Abstract In medical imaging, comparing and retrieving objects is non-trivial because of the high variability in shape and appearance. Such variety leads to poor performance of retrieval algorithms only based on local or global descriptors (shape, color, texture). In this article, we propose a context-based framework for medical image retrieval on the grounds of a global object context based on the mutual positions of local descriptors. This characterization is incorporated into a fast non-rigid registration process to provide invariance against elastic transformations. We apply our method to a complex domain of images—retrieval of intravascular ultrasound images according to vessel morphology. Final results are very encouraging. 2005 Elsevier B.V. All rights reserved. Keywords: Retrieval; Contextual information; Registration; Elastic matching; Medical imaging; IVUS
1. Introduction In the last decade, the imaging technology has achieved a strong progress in developing novel
q Work supported by Ministerio de Ciencia y Tecnologia of Spain, grant TIC2000-1635-C04-04. * Corresponding author. Fax: +34 935811670. E-mail address:
[email protected] (J. Amores).
and powerful systems for image acquisition, processing and storage. Users are exploiting the opportunity to access and retrieve images in a large and varied collection. Medical imaging represents a real content-based image retrieval (CBIR) application of large image repositories where image retrieval can be of high importance in order to help image diagnosis and therapy. Images constitute a big part of the clinical data; case-based reasoning (recovering similar pathological cases) is one of the usual diagnostic procedures; moreover,
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.12.007
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
constructing digital atlases based on automatic retrieval of images with the same clinical interpretation is very useful as a didactic program. Medical image retrieval is still an emerging field. Due to the fact that human organs can vary in their position and appearance, CBIR systems based on global descriptors have poor performance. Some authors suggest using more elaborated approaches recovering similar patient cases based on contextual information (Shyu et al., 1999; Hou et al., 1992; Petrakis and Faloutsos, 1997; Tagare et al., 1995). The most common descriptor is the attributed relational graph (ARG) that deals with objects and their spatial relations (Petrakis and Faloutsos, 1997). The common characteristic of the reported approaches is that they rely on very well segmented regions (often segmented by hand) (Shyu et al., 1999; Hou et al., 1992; Petrakis and Faloutsos, 1997; Tagare et al., 1995). Looking for context-based CBIR that avoids the need of precise object segmentation, we can mention the general CBIR approach presented by Huang et al. (1997) that uses a contextual descriptor called color correlogram. This correlogram takes into account the spatial relations of local properties such as color of pixels. However, it is not conceived for considering relations between structures (such as pathological regions) of medical images. Another example of context-based general CBIR is presented by Belongie et al. (2002) that use correlograms for shape matching but on binary images. Our goal is to develop a CBIR on medical images where our objects are very elastic bodies, i.e. can significantly vary in position, shape and appearance. In particular, our system analyzes intravascular ultrasound (IVUS) images that represent cross-sectional views of the artery showing normal and diseased tissues (plaques) on the wall of the vase (Europe, 1998). In order to represent the IVUS morphology, local descriptors are necessary to deal with small pathological regions. Information about the spatial arrangement of pathological regions should also be considered due to its importance in diagnosis. This can be achieved by constructing contextual object (vessel) descriptors. Our CBIR integrates all these characteristics: local, global and contextual information,
1721
and the image comparison is invariant against elastic transformations. These characteristics are also important in general image retrieval (Smeulders et al., 2000). Regarding transformation invariance in medical imaging, only works focused on non-rigid registration (without considering retrieval) deal with smooth and elastic transformations where traditional elastic matching methods are computationally very expensive (Bajcsy and Kovacic, 1989; Gee, 1999; Christensen et al., 1996). Due to this fact, works in medical image retrieval only use fast registration methods that do not force smooth or elastic alignments (Dahmen et al., 2000; Robinson et al., 1996; Liu et al., 2001). In this paper we present a novel CBIR system for highly elastic bodies that has two main contributions. First, the object description is constructed in an optimal feature space that represents contextual information needing just a weak object segmentation. In contrast to other context-based approaches (Huang et al., 1997; Belongie et al., 2002), we generalize the correlogram descriptor to deal with mutual positions of different structures or parts of the object, hence (1) it incorporates all the types of information mentioned above (global, local and contextual); (2) it does not need accurate segmentations/classifications of the structures inside the image; and (3) it is flexible: it incorporates specific descriptors of the application domain, which is mandatory in medical images where general descriptors perform poorly. The second contribution consists of the fact that our approach provides invariance to highly elastic transformations without allowing changes in topology of the object and using a computationally efficient registration. Our fast registration approach combines the use of contextual information, thin-plate splines (TPS) applied on a sparse set of landmarks, and a feedback scheme that achieves an elastic and regular, smooth transformation. Using the generalized correlogram into the registration enforces matchings between parts with the same context and removes ambiguities in the possible matchings, leading to faster convergence. The paper is organized as follows: Section 2 describes the feature space, in Section 3 the registration algorithm is analyzed in detail, in Section 4 we explain the final distance used in image
1722
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
comparison, in Section 5 results are provided over the different components of the system and in Section 6 we conclude and discuss future lines.
2. Feature space In order to achieve that the feature space takes into account all types of information relevant to retrieve the image, we include local, global and contextual information by using generalized correlograms. 2.1. Local information By using local information we aim at describing the different types of structures inside the image. In the IVUS case, the discriminating structures are placed around the wall of the vessel (Europe, 1998). A snake is placed at the center of the image after applying an anisotropic diffusion, and it is attracted to this wall. The set of landmarks is then obtained by sampling this snake (see Fig. 1). Associated with each landmark, a local feature vector is computed that describes the type of structure where the landmark lies. In this way we include specific information about our domain, as these local feature vectors are chosen for characterizing the biological structures we deal with (Dy et al., 2003). For discriminating plaques and normal tis-
sue a good descriptor is the gray-level profile along the normal to the wall at the landmark (Nair et al., 2002) (see Fig. 1). Empirically we chose a profile of 32 pixels, i.e. a feature vector of 32 dimensions. These feature vectors are classified and labels are assigned to each landmark, giving more compact local information. For doing so, non-parametric discriminant analysis (NDA) (Bressan and Vitria, 2003) and then K-nearest neighbors is applied. NDA reduces the dimension from 32 to 10, a number chosen empirically. A set of labelled descriptors is necessary for K-NN computation. For generating them, a group of physicians segmented and labelled each biological tissue in the images. Based on this, we took landmarks located at these structures, for each landmark extracted a feature vector and applied the corresponding label. 2.2. Global and contextual information We incorporate the local information into a generalization of correlograms that allows to provide this information along with contextual and global information about the image. Correlograms are histograms which not only measure statistics about the features of the image, but also take into account the spatial distribution of these features. We show here that using a generalization of correlograms we can provide spatial relations between
Fig. 1. From left to right: IVUS, its anisotropic diffusion where landmarks are extracted, and gray-level profiles.
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731 n
structures. Let C ¼ fpi gi¼1 ; pi 2 R2 be a set of n landmarks. Let li be the label of the type of structure where the landmark pi lies. Associated with pi we compute a generalized correlogram hi as follows. For every other landmark pj we consider its label lj and the spatial relation between pi and pj expressed in polar coordinates: (pi pj) = (aij, rij). We gather the label and spatial relation into one triplet (aij, rij, lj). Based on the triplets of the n 1 landmarks pj, j 5 i, the correlogram hi measures the joint distribution of local and spatial properties. This distribution is calculated by a histogram that is based on a quantization of the resulting space with dimensions: angle, radius and label. Each dimension is partitioned separately. Let Au, u = 1, . . ., nu be the bins for the angles and let Rv, v = 1, . . ., nv be the bins for the radius. The third dimension (the label) already represents a partition into nc clusters obtained by the classifier described previously for local descriptors. The correlogram hi is a histogram of dimension nu · nv · nc expressed as: hi ðu; v; cÞ ¼ #fðai;j ; rij ; lj Þ : aij 2 Au ; rij 2 Rv ; lj ¼ c; j 6¼ ig; u ¼ 1; . . . ; nu ;
v ¼ 1; . . . ; nv ;
1723
belonging to calcium plaque (squares), and those belonging to adventitia (circles). This correlogram has 12 intervals of angles and five intervals of radius. Fig. 2(b) shows a log-polar representation of the correlogram for each of the values in the third dimension: type of structure c = 1 and c = 2. In this plot, bins with a high density of landmarks from a particular type of structure are represented by a high gray level. This figure illustrates how the generalized correlogram measures the density over relative positions of the different types of structures, being a contextual descriptor. This correlogram has scale and orientation invariance by normalizing the radius rij by the size of our object and orientating the correlogram along the tangent of the contour (Belongie et al., 2002). The main disadvantage of the spatial quantization is that the resulting correlograms are not robust against large shape changes of the object. This low robustness is accused before registering the images, however it can be avoided by using an appropriate feedback scheme in the registration, such as the one we will explain later. Correlograms are special types of histograms so that an appropriate distance used in this work is the v2 (Duda et al., 2001).
c ¼ 1; . . . ; nc ð1Þ
In this work we use the same log-polar spatial quantization for our correlogram as Belongie et al. (2002) for their shape context, which makes the correlogram more sensitive to local context. Fig. 2(a) shows the spatial bins used in the correlogram associated with a landmark of an IVUS image. The landmarks in this image have been classified into two types of structures: those
Fig. 2. Correlogram: spatial bins (a) and log-polar representation (b).
3. Registration: obtaining invariance against elastic transformations We obtain invariance against elastic deformations by registering the images before their comparison. The scheme followed in the registration is the so-called point-mapping. First, a set of landmarks is extracted from each image. The landmarks are described in some feature space (in our case by correlograms), and a set of correspondences is computed which globally minimize the distance between matching landmarks in this feature space. Finally, a transformation is obtained based on the correspondences. Our registration also includes a search strategy of the final transformation. Let I1 and I2 be two images so that I1 is to be matched against I2. Let X ¼ fpi gni¼1 be the set of landmarks from I1, Y ¼ fqi gni¼1 the set from I2. Let dF(pi, qj) be the distance in the feature space
1724
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
between landmarks pi and qj. We seek the correspondence P function / : {1, . . ., n} ! {1, . . ., n} that n minimizes i¼1 d F ðpi ; q/ðiÞ Þ. This can be obtained by using an assignment optimization algorithm such as the HungarianÕs method (Papadimitriou and Stieglitz, 1982). This method regards the assignment problem as a bipartite graph matching problem: we have nodes in the first set to be linked with nodes in the second, each link has an associated weight expressing the cost of matching the two nodes, and we want to obtain the set of links (matches) that minimizes the total cost and that results in a bijective mapping from the first set to the second one. In our problem, nodes are represented by landmarks and weights are represented by distances dF(pi, qj) between landmarks. If we have n landmarks in each image, the cost of the algorithm has order O(n3). This cost is manageable for a small set of landmarks representing the image. The set of correspondences / is not necessarily regular because geometric restrictions are not considered in the hungarian method (in the results this fact is illustrated clearly). A regular spatial transformation Tk is derived based on /. Given X = {pi} and the corresponding set Y/ = {q/(i)}, the transformation T k : R2 ! R2 maps X close to Y/ and varies smoothly in the rest of the plane R2 . We use the thin-plate spline (TPS) as an efficient elastic transformation Tk that involves just inverting a matrix of n · n, where n is the number of landmarks. The TPS P transformation Tk is obtained by minimizing i kT k ðp i Þ q/ðiÞ k þ J k , where the first term forces approximation to q/i , the second term forces smoothness and k represents a tradeoff between both terms. A high k provides a smooth mapping but a more coarse approximation. Jk is based on the second derivatives of Tk (Bookstein, 1989) and represents the energy of the transformation. The regular transformation Tk is used for obtaining a more regular next set of correspondences / using an iterative method. Initially, / is obtained based on the distances dF(pi, qj). Because / is not necessarily regular, we apply Tk with high k (regularity term). Tk does not approach / accurately (i.e. Tk maps pi far from q/i ). We use Tk only to incorporate regularity into a new /. This is done by recomputing the distance between landmarks
d(pi, qj) and deriving a new / by the hungarian method. We use the following updating formula for the distance: d kþ1 ðpi ; qj Þ ¼ d F ðpi ; qj Þ þ akT kk ðpi Þ qj k, where dk+1(pi, qj) is the distance in iteration k + 1 and T kk ðpi Þ is a regular mapping for pi obtained in iteration k. The term kT kk ðpi Þ qj k is the spatial distance from the potential correspondence qj to the regular mapping T kk ðpi Þ, and thus represents the amount of irregularity introduced by matching qj with pi. The parameter a represents the tradeoff between the two terms: similarity in the feature space and regularity. With this updating formula, the next / is more regular and the regularity parameter k can be decreased in the next iteration (so that a subsequent Tk is more accurate). As Tk becomes more accurate, we force the next / to be close to Tk by increasing a. In this way, k is decreased and a is increased through the iterations, and we use an exponential ratio of change typical of annealing schemes. By this feedback, the neighbors of pi influence in its correspondence through the term T kk ðpi Þ because the TPS considers the spatial dependencies between pi and the neighbors. Therefore we are performing a (fast) relaxation-like cooperative search.
4. Similarity measure in the final comparison between images The registration produces a transformation T which is regular and maps the characteristic points pi from I1 close to their corresponding ones in I2. However, the mapped points are not exactly the characteristic points qi of I2. In order to obtain a regular final set of correspondences / from fpi gni¼1 to fqi gni¼1 , we simply take the Euclidean distances of mapped points and destination points: d(pi, qj) = kT(pi) qjk and compute the correspondences using the hungarianÕs algorithm over this matrix of distances. Our similarity measure is based on the sum of three factors: the distance in the feature space, the amount of deformation necessary to align both objects by TPS (Bookstein, 1989), and a local appearance difference between the aligned image and the destination image.
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
The distance in the feature space is in our case the distance of the correlograms. We recompute these correlograms orienting them now along the x axis of the image. This is done to reduce the little robustness that these correlograms have on the computation of the tangents, now that the alignment permits to avoid the invariance to rotation. Let d F( pi, qj) be the v2 distance between correlograms of pi and qj. Let us denote by bold d the global distance between images, we take as distance dF(I1, I2) the bidirectional Chamfer distance in the feature space: n n 1X 1X min d F ðpi ;qj Þ þ min d F ðpi ;qj Þ dF ðI 1 ;I 2 Þ ¼ n i¼1 j n j¼1 i For computing the local appearance difference between both images, let IW be the image I1 warped according to the obtained transformation from the registration. We take local windows around the mapped points T(pi) in IW and matching points q/(i) in I2. The local appearance difference is expressed: dA ðI 1 ; I 2 Þ ¼
n X w w X X
Gðkðx; yÞkÞ
i¼1 x¼w y¼w
½I W ðT ðpi Þ þ ðx; yÞÞ I 2 ðq/ðiÞ þ ðx; yÞÞ2 where G(r) is a gaussian-like function of the radius r, more sensitive to close positions. The warped image IW does not respect the original pattern of the textures, so it is better to remove them in the comparison. Therefore, we take as images I1 and I2 the anisotropic diffusion of the original images. Finally, the total distance between both images is computed as a combination of the distance components defined above: d(I1, I2) = aFdF(I1, I2) + aAdA(I1, I2) + aEE. The weights aF, aA, aE are computed as the ones minimizing the classification error on the IVUS database, following a leaveone-out procedure. Values obtained for these weights can be found in Section 5.5.
5. Results In this section we see the results of applying each of the components of the registration
1725
algorithm: feature space, feedback scheme, final registration results, and final retrieval results. All the experiments have been conducted on a database of 100 IVUS images, all of them presenting calcium plaque structures. Studying the registration of images presenting calcium plaque is very interesting for the following reasons: first, there is a great difficulty in the differentiation between plaques and adventitia tissue, second, there is a high variability in the shapes of both the entire vessel and the calcium plaque structures, and third it has been clinically seen that the relative spatial position of the calcium is important in diagnosis of heart diseases. 5.1. Performance of the correlograms Our correlograms work with labels of classified landmarks (see Section 2.1). For landmark classification we use a K-nearest neighbor with K = 7, a parameter obtained experimentally. To exploit the complete dataset of 100 IVUS, a procedure similar to leave-one-out is used: descriptors from a new image are classified based on the descriptors from the other 99 images. As we have 100 landmarks per image, each image has 100 descriptors (one per landmark). Therefore the set of 100 descriptors from a new image is classified using 9900 descriptors from the rest of images. The classification hit rate for landmarks is 90.1%. Fig. 3(a) and (b) shows a couple of IVUS images to be registered, Fig. 3(a) displays the image I1 to be aligned and Fig. 3(b) the destination image I2. Fig. 3(c) and (d) show the anisotropic diffusion of (a) and (b) respectively. The thick curve represents the contour of the vessel from which the landmarks are extracted in each image. The image I1 has two calcium plaques on both sides (indicated in Fig. 3(a)), and the image I2 has three calcium plaques: two on both sides and one at the bottom (indicated in Fig. 3(b)). Taking into account global characteristics the plaques on both sides should be matched in both images, leaving alone the small plaque at the bottom of the image I2. We show that the global description is included into our correlograms by comparing the result of an initial coarse transformation using contextual information (correlograms) and then
1726
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
Fig. 3. Couple of images to register (a,b). Their anisotropic diffusions (c,d).
Fig. 4. Intermediate alignment using contextual information (a,b), and using only local information (c,d).
using only local information (our local feature vectors). We show transformation results on the anisotropic diffusion of the images because it is visually more clear. Fig. 4(a) shows the warped I1 when using correlograms. Fig. 4(b) shows the edges of the warped I1 superposed with thick lines onto the image I2. Note that both calcium plaques of I1 are mapped close to the big plaques of I2. Fig. 4(c) and (d) show the warping result when using only local feature vectors. One of the mapped calcium plaques (indicated by a thick arrow in Fig. 4(d)) lies at an intermediate position between a big plaque and a small one (thin arrows). Using correlograms this matching is avoided as the size characteristic is included. 5.2. Evaluation of the feedback scheme The result of applying the feedback scheme is that the transformation becomes more and more accurate and at the same time the set of correspondences becomes more and more regular. The iterative feedback algorithm includes two parameters: a, and k, that change with exponential rate (see Section 3). Experimentally we chose the following
values: a changes from 0 to 0.25 and k changes from 1500 to 0.1. The number of iterations we used in the experiments is 8. Fig. 5(a) shows the initial set of correspondences in the registration of the pair of images of Figs. 3 and 5(b) shows the final set of correspondences. The initial set is very irregular, but holds information about the correct global matching. The final set is completely regular. Fig. 5(c) shows the final transformation: the contours of the aligned image are superposed with thick lines on the destination image. Note that plaques (indicated by arrows) are completely aligned. Now we provide a quantitative evaluation of the feedback scheme. Given one query image and one target image that belongs to the same category as the query, we want the query to be closely aligned to the target. Here we quantify how this alignment evolves in the iterations of the feedback scheme. We randomly take 100 pairs so that in every pair both images are from the same category. Let (I1, I2) be one such a pair, I1 represents a possible query and I2 an image from the same category as that of the query. Given a transformation T that aligns I1 to I2, we measure the goodness
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
1727
Fig. 5. Evolution of the feedback scheme.
of this transformation by the spatial distance between the structures of the aligned T(I1) and the homologous structures of I2. The concept homologous structure is based on whether two structures are from the same type of structure (e.g. both are plaque structures). The resulting measure represents the amount of misalignment of T. We also take another measure that considers the amount of irregularity of T which we explain below. Both measures are based on the transformation T obtained after each iteration: let Ti be this transformation after iteration i. We can then observe how the goodness of the alignment is increased throughout the iterations of the algorithm. Let us express the amount of misalignment more concretely. Let C1 be the set of landmarks of I1, and C2 the set of I2. Given the transformation Ti, the misalignment for the type of structure s is expressed: Es ðI 1 ; I 2 ; T i Þ ¼ max Es1 ; Es2 X 1 Es1 ¼ 1 min ðkT i ðpÞ qkÞ ns p2C ;lðpÞ¼s q2C2 ;lðqÞ¼s ð2Þ 1 X 1 min ðkT i ðpÞ qkÞ Es2 ¼ 2 ns q2C ;lðqÞ¼s p2C1 ;lðpÞ¼s 2
n1s
where is the number of landmarks in I1 located at the type of structure s, n2s is the respective thing for landmarks of I2, and we express as l(p) = s the fact that landmark p has label s. Given Eq. (2), the total misalignment of I1 to I2 given Ti is the average misalignment considering all the types of structures in both images: nc 1 X EðI 1 ; I 2 ; T i Þ ¼ Es ðI 1 ; I 2 ; T i Þ ð3Þ nc s¼1
where nc is the number of types of structure present in both I1 and I2. For evaluating Eq. (2) we use a priori knowledge of what is the correct type of structure of every landmark in both images. The amount of irregularity obtained for the alignment Ti is computed by the deformation energy of the thin-plate spline (see Section 3). For each couple (I1, I2) the algorithm iterates eight times for this experiment, resulting in eight alignments Ti, i = 1, . . ., 8. For each iteration the median is computed across results on the 100 couples, Fig. 6 shows a graphic of the median evolution of the amount of misalignment (Fig. 6(a)) and amount of irregularity (Fig. 6(b)). The horizontal axis represent the number of the iteration i = 1, . . ., 8, and the vertical axis represent the misalignment in (a) or irregularity in (b). As can be seen the misalignment and the irregularity are decreased throughout the iterations, due to the simultaneous maximization in our feedback scheme of both the accuracy and regularity (see Section 3). Finally, we see that the shape-context registration algorithm in (Belongie et al., 2002) without any cooperative feedback scheme leads to poor results when dealing with the types of images we have. We take the same couple displayed in Fig. 3, and only use the landmarks from one type of structure (calcium plaque in this case) for registration. We do so because the shape context in (Belongie et al., 2002) is not suitable for different types of structures. Fig. 7(a) and (b) show the disposition of the mentioned landmarks in both images, note the shape differences. Fig. 7(c) shows the final set of correspondences. The landmarks are not mapped close to their destination, and
1728
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
Fig. 6. Evolution of the misalignment (a) and irregularity (b) of the transformation.
Fig. 7. (a,b) Points from the plaque structures of Fig. 3(a) and (b). (c) Final set of correspondences obtained with the shape-context registration algorithm (Belongie et al., 2002).
the final correspondences are quite irregular. The main cause is that, with the high shape difference in both objects we need to enforce step by step some regularity in the correspondences or else the resulting transformation will map the points without preserving the spatial coherence. Correlograms continue to be valid for modelling contextual and global information (as shown with our result in Fig. 5), but only if we strengthen the spatial coherence of the mapping by some feedback algorithm such as the one explained above. 5.3. Registration results We provide here the final registration results (i.e. results after the last iteration). From our database of 100 IVUS images, there are 100 · 100 pairs of images, from which 1646 pairs are formed by homologous images (i.e. both images in the pair belong to the same category) not including pairs that contain the same image, e.g. (Ii, Ii). As homol-
ogous images should be aligned completely by the registration step, we take these 1646 couples for computing the statistics of the registration accuracy. We obtained a mean amount of misalignment of 4.6 pixels, a median misalignment of 2.04 pixels and a standard deviation of 7.6 pixels. The mean distance between two neighbor landmarks is of 3.1 pixels. Thus, the misalignment is just 1.5 times the distance between landmarks that are neighbors. Experimentally we have seen that a misalignment below 6 pixels is acceptable by the physicians. 75% of the registrations have a misalignment below 4.23 pixels. 5.4. Computational cost: scaling the system The system has been implemented in Matlab code. In a Pentium IV at 2.4 GHz, the average time for all the steps prior to registration is 14 s. The bottleneck of the system is the registration done for every image in the database. The average
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
registration time per image is 12 s, so that the total spent time per 100 images in the database adds up to 20 min in average. This is not a high cost compared to the time spent with a physics-based registration that uses iterative PDEs that lasts several hours. Scaling the system to large number of images requires choosing prototypes representing all the images in the database. In this way, each category in the database is represented by a set of prototypes, so that each prototype represents the appearance of a set of images in the category and is thus a subcategory representant. Given a distance like the one we propose, an automatic algorithm for prototype-selection is given in (Belongie et al., 2002). 5.5. Retrieval results In order to assess the retrieval efficiency, a database of 100 IVUS images has been used. For this database, a group of physicians have grouped the images into categories attending to clinical properties (see Amores et al., 2003). We extract each image from the database, present it as query, and the system computes the distance from this query to every other image in the database. The distance is computed as explained in Section 4 on the 100 images using a leave-one-out procedure. We obtained weights aF, aA, aE that respectively have values 0.4%, 0.25%, 0.35% in average, therefore stress is put on the distance of correlograms. Based on a particular query, the system orders the rest of the images from the database in order of similarity to this query. From this ordered list, we take only the first K images (i.e. the K most similar images to our query). Two measures of retrieval efficiency are used. The first one is the estimated number of images we need to retrieve in order to include an image from the same category as the query. We obtained an average of 2.33 images necessary for including one of the same category. For K = 2 retrieved images the mean number of times in which an image from the same category is included is 89.7%. The second measure is the recall vs scope (Huang et al., 1997): if query Q has N images from the same category, we compute for
1729
Q : EQ ðKÞ ¼
#jI : rankðIÞ 6 K; categoryðIÞ ¼ categoryðQÞj N
and average P over all the queries presented: EðKÞ ¼ N1Q Q EQ ðKÞ, where NQ is the number of query images presented to the database. Our system is compared with other approaches using contextual information. Retrieval with shape contexts in (Belongie et al., 2002) is only suitable for binary images. Huang et al. developed a correlogram that considers for every pair of colors the co-occurrence of pixels with a given distance between them. They do not use the angle for the spatial relations, do not extract landmarks but use all the pixels, and use as local information pixel-level properties such as color (thus we cannot use the type of structure as local information). They report a better performance in color retrieval than other contextual descriptors such as color coherent vectors (Huang et al., 1997). In Table 1 the E(K) efficiency is shown for HuangÕs correlogram (computed on gray level for IVUS), for HuangÕs auto-correlogram, and for our context-based retrieval. Our method outperforms both types of contextual descriptors. HuangÕs correlogram is included for completion, although this descriptor has proved to be slightly worse than the auto-correlogram (indeed, Huang et al. do not report results for the former descriptor). Although HuangÕs approach showed a good performance in a non-medical color database, it is poor when dealing with IVUS images, and a more sophisticated method such as the one presented is necessary for such a domain. For K = 10, we obtained 22% recall (see table) and 36.5% of precision, typical values for other non-medical CBIR applications. For K = 30 we obtained as recall-precision 50–27.7% and for K = 50 we obtained 73–24%. Fig. 8(a) shows three
Table 1 Recall vs scope measure K
HuangÕs correlograms
HuangÕs auto-correlograms
Present method
10 30 50
0.15 0.36 0.54
0.17 0.38 0.57
0.22 0.50 0.73
1730
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
Fig. 8. Example of optimal retrieval result (a) and worst result (b) for K = 3.
queries with high performance. In these queries, the first K = 3 retrieved images are all of the same category. Fig. 8(b) shows three queries with low performance. Although in this example none of the retrieved images strictly belong to the same category as the query, the similarity of the retrieved images and the queries is quite high. For six of the nine retrieved images, the degree of embracement of the plaque in both the query and the retrieved image is the same. Most times the failure of the retrieval system is due to the interpretation of big false structures (in the local landmark classification). The correlogram fails if a big region of the image is entirely missclassified, as there is not valid information in that case. Still, the correlogram is robust when part (but not all) of the landmarks are missclassified. For example, if all the landmarks from a big region of the image are wrongly classified as adventitia structure, the correlogram fails because it interprets that there is adventitia structure in that region of the image. However, if part of the landmarks are wrongly classified as adventitia and a bigger part are correctly classified as plaque structure then the correlogram is robust because measures the density of landmarks belonging to plaque in that region of the image.
6. Conclusions We have introduced a content-based retrieval system that deals with complex medical images
where contextual information is mandatory. We showed that using generalized correlograms we can incorporate this contextual information so that we do not need hand-based segmentation, which differs from previous work (Shyu et al., 1999; Hou et al., 1992; Petrakis and Faloutsos, 1997; Tagare et al., 1995). We reported quantitative results on every component of the system and showed that our system outperforms other contextual-based approaches such as the one by Huang et al. (1997). The important contributions are as follows. First, we introduced a new definition of the correlogram that extends the descriptor to deal with the context of structures. Using this type of correlogram we avoid very accurate segmentations of the structures. Our correlogram can incorporate easily specific information about the medical domain, which is fundamental in medical image retrieval. Second, we achieved invariance against elastic transformations. We used an efficient registration that achieves elastic and accurate alignments and at the same time is smooth, which differs from other medical retrieval systems (Dahmen et al., 2000; Robinson et al., 1996; Liu et al., 2001). Our method combines the use of thin-plate splines, efficient compared to solving a Navier–Stokes PDE (Bajcsy and Kovacic, 1989; Gee, 1999; Christensen et al., 1996), and a feedback scheme that enforces smoothness and accuracy. As future work, faster indexing methods for correlograms are under investigation. An analysis
J. Amores, P. Radeva / Pattern Recognition Letters 26 (2005) 1720–1731
of the impact of choosing a small set of prototypes for large databases would be also of high interest. References Amores, J., Radeva, P., Elastic matching retrieval in medical images using contextual information. Technical report, CVC, September 2003. Bajcsy, R., Kovacic, S., 1989. Multiresolution elastic matching. Comput. Vision, Graphics Image Process. 46 (1), 1–21. Belongie, S., Malik, J., Puzicha, J., 2002. Shape matching and object recognition using shape contexts. IEEE Trans. PAMI 24 (24), 509–522. Bookstein, F.L., 1989. Principal warps: thin-plate splines and the decomposition of deformations. IEEE TPAMI 11 (6). Bressan, M., Vitria, J., 2003. Nonparametric discriminant analysis and nearest neighbor classification. Pattern Recognition Lett. 24 (15), 2743–2749. Christensen, G.E., Rabbitt, R.D., Miller, M.J., 1996. Deformable templates using large deformation kinematics. IEEE Trans. Image Process. 5 (10). Dahmen, J., Theiner, T., Keysers, D., Ney, H., Lehmann, T., Wein, B., Classification of radiographs in the image retrieval in medical applications system (irma), In: Proceedings of 6th International RIAO Conference on Content-Based Multimedia Information Access, Paris, 2000, pp. 551–566. Duda, R.O., Hart, P.E., Stock, D.G., 2001. Pattern Classification. John Wiley & Sons. Dy, J.G., Brodley, C.E., Kak, A., Broderick, L.S., Aisen, A.M., 2003. Unsupervised feature selection applied to contentbased retrieval of lung images. IEEE TPAMI 25 (3). Europe, B.S. (Ed.), Beyond Angiography. Intravascular Ultrasound: State of the Art, vol. 1. XX Congress of the ESC, 1998.
1731
Gee, J.C., 1999. On matching brain volumes. Pattern Recognition 32, 9–111. Hou, T.-Y., Liu, P., Hsu, A., Chiu., M.-Y., 1992. Medical image retrieval by spatial features. IEEE Internat. Conf. Syst. Man Cybernet. 2, 1364–1369. Huang, J., Kumar, S., Mitra, M., Zhu, W., Zabih, R., Image indexing using color correlograms. In: Proc. CVPR., 1997, pp. 762–768. Liu, Y., Dellaert, F., Rothfus, W.E., Moore, A., Schneider, J., Kanade, T., Classification-driven pathological neuroimage retrieval using statistical asymmetry measures. In: Proc. MICCAI Õ01, Utrecht, The Netherlands, 2001, pp. 655–665. Nair, A., Barry, D.K., Tuzcu, E.M., Schoenhagen, P., Nissen, S.E., Vince, D.G., 2002. Coronary plaque classification with intravascular ultrasound radiofrequency analysis. Circulation 22, 2200–2206. Papadimitriou, C., Stieglitz, K., 1982. Combinatorial Optimization: Algorithms and Complexity. Prentice Hall. Petrakis, E.G.M., Faloutsos, C., 1997. Similarity searching in medical image databases. IEEE Trans. Knowledge and Data Engineering 9 (3). Robinson, G.P., Tagare, H.D., Duncan, J.S., Jaffe, C.C., 1996. Medical image collection indexing: shape-based retrieval using kd-trees. Comput. Med. Imaging Graphics 20 (4), 209–217. Shyu, C.R., Brodley, C.E., Kak, A.C., Kosaka, A., Aisen, A.M., Broderick, L.S., 1999. Assert—a physician-inthe-loop content-based retrieval system for hrct image databases. Comput. Vision Image Understand. 75 (1–2), 111–132. Smeulders, A.W., Worring, M., Santini, S., Gupta, A., Jain, R., 2000. Content-based image retrieval at the end of the early years. IEEE TPAMI 22 (12), 1349–1380. Tagare, H.D., Vos, F.M., Jaffe, C.C., Duncan, J.S., 1995. Arrangement, a spatial relation between parts for evaluating similarity of tomographic section. IEEE Trans. PAMI 17 (9), 880–893.
Pattern Recognition Letters 26 (2005) 1732–1739 www.elsevier.com/locate/patrec
Symmetry parameters for 3D pattern classification Fre´de´rique Robert-Inacio
*
Department of Physics, ISEN-Toulon/L2MP CNRS UMR 6137, Place Georges Pompidou, F-83000 Toulon, France Received 22 September 2003; received in revised form 12 October 2004 Available online 7 April 2005 Communicated by E. Backer
Abstract The WinternitzÕs parameter enables to evaluate a symmetry degree for a given 2D convex body, with respect to a particular point. This point is the best-centered point according to symmetry criteria. The generalization of this parameter in 3D is studied and a second 3D-parameter is derived from the 2D-WinternitzÕs parameter. 2005 Elsevier B.V. All rights reserved. Keywords: Pattern recognition; Shape classification; Symmetry parameter; Central symmetry; WinternitzÕs parameter
1. Introduction 1.1. Overview of some existing computational methods for the evaluation of symmetry In many fields, evaluating a symmetry degree for objects under study is a good way of classifying objects. For example, the degree of symmetry according to an axis or a point is an evaluation mean of beauty degree for shapes in architecture. In the recent literature, symmetry has been considered as a good geometric property to classify *
Tel.: +33 494 03 89 97; fax: +33 494 03 89 51. E-mail address:
[email protected] shapes. Actually, the human eye is very sensitive to the symmetry degree of a given pattern in 2D or 3D. These shapes can be well defined, as it is our case, or not (Sun, 1995; Wang and Suter, 2003). For example, symmetry can be useful in the detection of particular shapes, such as circles or ellipses (Ho and Chen, 1995). In the nineties most of works were based on the detection of an axis (or a plane in 3D) or a center of symmetry. In this way, a shape was classified as symmetric, almost symmetric or not symmetric (Marola, 1989; Sun and Si, 1999; Wolter et al., 1985). This approach is very restrictive, as it gives a qualitative estimation of symmetry. That is why a measure of symmetry according to an axis in 2D or a plane in 3D can be set up and considered as a
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.018
F. Robert-Inacio / Pattern Recognition Letters 26 (2005) 1732–1739
1733
signature for shapes (Kazhdan et al., 2002, 2004). Such measures are more discriminant than qualitative estimation, as shapes can be arranged in order according to values of symmetry measure. Our work is similar to the previous ones, which define a measure of reflective symmetry, by the point that it sets up a measure of central symmetry for 2D and 3D shapes, and not only a qualitative estimation. 1.2. The Winternitz’s parameter as a symmetry measure Several mathematicians established some theoretical definitions and properties for symmetry parameters at the beginning of the last century. These parameters give an estimation of the symmetry degree for shapes and a localization of the best-centered point according to symmetry criteria. In image processing, some of them are very interesting to study planar objects, such as BesicovitchÕs, MinkowskiÕs, BlaschkeÕs and WinternitzÕs parameters (Gru¨nbaum, 1963). Properties of these parameters are well known in two dimensions, but their extensions to the third dimension are less studied. In this study, the symmetry parameter enables to compute a symmetry coefficient value determining an estimation of the symmetry degree of a given shape in 2D or 3D. The considered shapes are well defined as a set of points and their symmetry degree is evaluated according to a particular point called the characteristic point. This characteristic point is the center of symmetry for centrally symmetric convex bodies. Fig. 1 shows an example of polygonal shape with a center of symmetry O. In this paper, we are going to study a shape parameter enabling to evaluate the symmetry degree of a given shape and its best-centered point. The aim is to determine symmetry or quasi-symmetry features for convex shapes. We chose to pay more attention to the WinternitzÕs parameter, firstly in 2D and secondly in 3D. Afterwards, we present two different 3D-parameters inspired by the WinternitzÕs 2D-parameter. The first one is directly derived from the generalized definition for an n-dimensional space, whereas
Fig. 1. Polygonal shape with center of symmetry O.
the second one is given by an original formula using the 2D-WinternitzÕs map as a basis of calculus. Results are given for some classical convex bodies, partially for theoretical results and all of them for experimental ones. Setting up the implementation of these symmetry parameters is one of the original points of the paper, because it leads to solve some problems related to sampling, especially in the choice of angles. 2. The WinternitzÕs parameter in two dimensions Gru¨nbaum (1963) recalled from Winternitz (1923) a definition of the WinternitzÕs parameter, which enables to evaluate a symmetry degree for a given object. He described how to determine a particular point of the object corresponding to the best-centered point C, called the characteristic point, according to symmetry criteria. This symmetry degree is a value between 0 and 1. If the considered shape is centrally symmetric, the parameter is equal to 1 and the characteristic point is the center of symmetry. Let us recall the definition of the WinternitzÕs parameter in 2D. 2.1. Definition Let K be a convex body, l the surface measure on R2 .
1734
F. Robert-Inacio / Pattern Recognition Letters 26 (2005) 1732–1739
For each direction h 2 [0, 2p] and each point x 2 K, we define a partition of K by considering the straight line D(x, h) passing by x with direction h (see Fig. 2). This line cuts up K in two disconnected parts: wl(K, x, h) representing the left part with respect to D(x, h) and wr(K, x, h) the right part. Then, for each point x 2 K, the WinternitzÕs map wK(x) is given by the following formula: K ! Rþ
with wK ðxÞ x 7! wK ðxÞ lðwr ðK; x; hÞÞ ; h 2 ½0; 2p½ ¼ inf lðwl ðK; x; hÞÞ
wK :
There exists a particular point xW at which the value W(K) is reached (Su¨ss, 1950). We can prove that this point is unique. This way, a duality between WinternitzÕs parameter value and the bestcentered point is pointed out. In order to prove the characteristic point unicity, we must establish that the sequence of sets La defined by: La ¼ fx 2 K; wK ðxÞ P ag 80 < a 6 1
ð1Þ
The WinternitzÕs parameter is then given by: W ðKÞ ¼ supðwK ðxÞ; x 2 KÞ
2.3. Characteristic point
ð2Þ
2.2. Properties of this parameter As well as for other measures of symmetry: • 0 6 W(K) 6 1 for all K R2 . • W(K) = 1 if and only if K R2 has a center of symmetry. • W(K) = W(T(K)) for every K R2 and every non singular affine transformation T. More precisely: • 45 6 W ðKÞ 6 1 and the lower bound is reached for triangles (Gru¨nbaum, 1960; Hammer, 1963). For most of symmetry parameters, values for triangles are equal to the lower bound. Triangles are then the less centrally symmetric sets.
ð3Þ
is a decreasing sequence of strictly convex sets, while a increases. This sequence converges to the characteristic point La 5 ;. The whole proof was given by (Fille`re, 1995). Some results are given in Table 1 for well-known convex sets (Laboure´ et al., 1993). For the three first shapes, the theoretical value of W(K) is 1 and for the triangle, 0.8. The computed values are very close to the theoretical ones. Fig. 3 shows the WinternitzÕs map for three polygons and an ellipse. The computed coefficient values are 0.981 for the polygon, 0.988 for the
Table 1 Results for the 2D WinternitzÕs parameter Convex sets K
Centroid of K Point xW
W(K)
Disk (r = 100) Square (170 · 170) Rectangle (120 · 180) Reuleaux triangle (r = 120) Triangle Half-disk + half-square
(170, 259) (170, 239) (170, 249) (170, 277)
(170, 261) (170, 242) (170, 252) (170, 274)
0.959 0.963 0.964 0.886
(170, 274) (182, 239)
(171, 274) 0.786 (176, 242) 0.902
Fig. 2. Partition of the convex body K according to the straight line D(x, h).
F. Robert-Inacio / Pattern Recognition Letters 26 (2005) 1732–1739
1735
Fig. 3. WinternitzÕs map for (a) a polygon, (b) an ellipse, (c) a triangle, (d) a hexagon.
ellipse, 0.842 for the triangle and 0.973 for the hexagon. White points are the best-centered points. For the other points, the map value estimates if the point is more or less centered according to central symmetry.
K ! Rþ
with wK ðxÞ x 7! wK ðxÞ lðwr ðK; x; h; /ÞÞ ; h 2 ½0; 2p½; / 2 ½0; p½ ¼ inf lðwl ðK; x; h; /ÞÞ
wK :
ð4Þ 3. The WinternitzÕs parameter in three dimensions 3.1. Definition More generally, we can take l as a volume measure on R3 . We can extend the partition of K defined by a straight line D(x, h) in 2D to a partition of K by planes in 3D. For each x 2 K, we consider the planes H(x, h, /) cutting the space in two half-spaces, h 2 [0, 2p[, / 2 [0, p [. Then K is cut into two parts wr(K, x, h, /) and wl(K, x, h, /) (see Fig. 4). The WinternitzÕs map is given by the following formula:
And the coefficient value: W ðKÞ ¼ supðwK ðxÞ; x 2 KÞ
ð5Þ
3.2. Properties of this parameter The usual properties of measures of symmetry are verified. It was proved (Gru¨nbaum, 1963) that extremal bodies for which the lower bound is reached are simplexes, in other words, tetrahedrons. In this case, we did not exactly determine this lower value. All we know is that it is greater or equal to 27/37 (Ehrart, 1955). The property W(K) = 1 if and only if K has a center of symmetry
1736
F. Robert-Inacio / Pattern Recognition Letters 26 (2005) 1732–1739
Fig. 4. Partition of K in 3D.
was established by Funk (1915), Blaschke (1923) and Bonnesen and Fenchel (1934).
The associated map is defined by: pK ðxÞ ¼ infðwK i ðxÞ; i 2 IÞ
ð7Þ
And the shape parameter P: 4. A new parameter called P
P ðKÞ ¼ supðpK ðxÞ; x 2 KÞ
4.1. Definition
4.2. Properties of this parameter
This parameter P is directly derived from the 2D-WinternitzÕs parameter. The idea is to consider for all planes Hi, the intersection Ki with K, that is a 2D-convex body. Then we compute the 2D-WinternitzÕs map wK i ðxÞ associated with every set Ki, i 2 I, I = [0, 2p[x[0, p[ and i = (h, /) (see Fig. 5). And the parameter P is given by: P ðKÞ ¼ supðinfðwK i ðxÞ; i 2 IÞ; x 2 KÞ
ð6Þ
ð8Þ
• 0 6 P(K) 6 1 for all K R3 . • P(K) = 1 if and only if K R3 has a center of symmetry. Proof • 0 6 P(K) 6 1 for all K R3 . Obviously, 0 6 P(K) as P(K) is the supremum value of a set of positive values. Furthermore,
Fig. 5. A 2D-convex set Ki corresponding to a plane Hi.
F. Robert-Inacio / Pattern Recognition Letters 26 (2005) 1732–1739
these positive values are less or equal to 1 by definition, so P(K) 6 1. • P(K) = 1 if and only if K R3 has a center of symmetry P ðKÞ ¼ 1 () supðpK ðxÞ;x 2 KÞ ¼ 1: () 9x0 2 K; pK ðx0 Þ ¼ 1: () 9x0 2 K; infðwK i ðx0 Þ;i 2 IÞ ¼ 1: () 9x0 2 K; 8i 2 I; wK i ðx0 Þ ¼ 1: () 9x0 2 K; 8i 2 I; x0 2 K i and x0 is the center of symmetry of K i : () x0 is the center of symmetry of K:
5. Results of computation 5.1. First results We consider three 3D-convex bodies: a symmetric polyhedron, a tetrahedron and a pyramid. The computed values for W and P in 3D are given in Tables 2 and 3. These results can be improved by taking into account more directions. In Table 2, the coefficient values are computed with only six directions. The coefficient values are very approximated though the computed point is very close to the centroid
1737
(that is a particular point very close to characteristic points according to most symmetry parameters). Furthermore, we must deal with long computation times. For example, when considering 3Dshapes included in a cube with edges of 128 pixels and only six directions for h, the computation of W(K) or P(K) and their associated characteristic points takes about eight or nine hundred times the computation of the centroid. When considering more directions, the computation time is proportional to the number of planes. These computation times are given for the determination of the whole parameter maps in 3D. Although a few directions are considered, a good approximation of the characteristic point position is obtained. This is induced by the fact that the coefficients P and W are computed as the supremum value of infimum values. And then, as the characteristic point is scanned for every chosen direction, the supremum value is reached at this point or in its very close neighborhood at least once. Furthermore, this point being the best-centered point means that it is the best-centered point for several planes (maybe every plane as it was proved if the convex body has a symmetry center). So, by considering only a few planes, we find nevertheless a computed characteristic point close to the theoretical one, which is close to the centroid. 5.2. Shape classification
Table 2 Results for the 3D WinternitzÕs parameter W Shapes K
Centroid of K
Th. W(K)
Comp. W(K)
Comp. point
Polyhedron Tetrahedron Pyramid
(32, 64, 64) (6, 79, 81) (5, 63, 62)
1 >27/37 ?
0.912 0.901 0.909
(31, 63, 64) (5, 82, 84) (8, 60, 62)
Th.: theoretical value, Comp.: computed value.
Table 3 Results for the 3D parameter P
As a symmetry parameter evaluates the symmetry degree of a shape, classification can be set up by computing this parameter and determining a threshold value enabling to cut up the shapes under study in two subsets containing on the one hand the more symmetric shapes and on the other one the less symmetric ones. The threshold value should be fixed according to the application and the precision it requires.
Shapes K
Centroid of K
Th. P(K)
Comp. P(K)
Comp. point
5.3. Improvements of the algorithms
Polyhedron Tetrahedron Pyramid
(32, 64, 64) (6, 79, 81) (5, 63, 62)
1 ? ?
0.944 0.745 0.774
(32, 64, 64) (5, 80, 84) (10, 62, 66)
We can do a coarse approximation of the characteristic point position by considering only a few planes, and then, the calculus could be more accurate in a neighborhood of this first estimation.
Th.: theoretical value, Comp.: computed value.
1738
F. Robert-Inacio / Pattern Recognition Letters 26 (2005) 1732–1739
That could be a good way for decreasing computation time. And we obtain nevertheless a coarse evaluation of the whole parameter map. So, a first improvement consists in computing a coarse estimation of the parameter map, and then, focusing to the obtained characteristic point, in order to evaluate more accurate values for the parameter and its characteristic point position. But most of the computation time is wasted in the determination of this parameter map. And computing the whole map is not very useful for shape classification. We know that the centroid is close to the theoretical characteristic point. So, the evaluation of the parameter can be very efficient by considering a neighborhood of the centroid where the map will be computed accurately by using a large sample of directions. And then, the coefficient value will be deduced from it, as well as the characteristic point position. This second method was used for eighteen directions and a centroid neighborhood of thirteen-pixel width. Thirteen pixels represent about 10% of the edge of the cube containing the shapes under study. The results are given in Table 4 for W(K) and in Table 5 for P(K). The computation time decreases to less than fifty times the computation time of the centroid.
Table 4 Improved results for the 3D WinternitzÕs parameter W Shapes K
Centroid of K
Th. W(K)
Comp. W(K)
Comp. point
Polyhedron Tetrahedron Pyramid
(32, 64, 64) (6, 79, 81) (5, 63, 62)
1 >27/37 ?
0.971 0.907 0.911
(31, 63, 64) (5, 82, 82) (8, 61, 62)
Th.: theoretical value, Comp.: computed value.
Table 5 Improved results for the 3D parameter P Shapes K
Centroid of K
Th. P(K)
Comp. P(K)
Comp. point
Polyhedron Tetrahedron Pyramid
(32, 64, 64) (6, 79, 81) (5, 63, 62)
1 ? ?
0.977 0.747 0.784
(32, 64, 64) (5, 80, 82) (7, 63, 64)
Th.: theoretical value, Comp.: computed value.
For the symmetric polyhedron, the parameter values increase closer to 1, which is the theoretical value. For the two other shapes, the parameter values do not change in a very significant way, but the position of the characteristic point becomes closer to the centroid.
6. Conclusion Since it is possible to compute 3D-data, the definition of 3D symmetry parameters has a great interest in image processing. The computation of these coefficients is unfortunately time-consuming. The number of planes to consider must be sufficient to have an accurate estimation of the coefficient and not too great to bound the computation time. This paper is a draft approach of 3D-parameters inspired by the 2D WinternitzÕs parameter. The first experimental results are not really close to the known theoretical values but that is due to the coarse sampling of directions. But, as the computed points are very close to the centroid, we guess that these 3D-parameters would be good estimators for symmetry degree. In order to complete this study, we set up algorithms computing with more than six directions in a neighborhood of the centroid. This enables to find the position of the characteristic point and the parameter value without wasting too much time. For 2D-coefficients and 3D-WinternitzÕs coefficient, estimating the symmetry map at the centroid gives a symmetry parameter and the same classification of the shapes is obtained with the coefficient itself. The only drawback of such a simplification is that we loose the duality between coefficient and best-centered point. So the choice to compute the one or the other depends on the application. In order to obtain a good approximation of both the coefficient value and the position of the characteristic point, a good compromise is the evaluation of the parameter map in the neighborhood of the centroid, as proposed in Section 5.3. For the coefficient P, using the fact that for 2D polygonal shapes, WinternitzÕs level lines are made of hyperbole arcs can speed the algorithm up. It could be studied if polyhedrons have similar advantageous properties. This is interesting if the
F. Robert-Inacio / Pattern Recognition Letters 26 (2005) 1732–1739
whole parameter map is required for the application, in other words, if each point of the shape must be evaluated in terms of central symmetry. Theoretical results must be completed too, especially for the parameter P. We have to establish the lower bound of the coefficient P, as symmetry parameters allow classifying shapes from the less symmetric one to the most symmetric one. For 2D coefficient and 3D WinternitzÕs parameter, simplexes appear to be extremal shapes. It still has to be proved for the coefficient P.
References Blaschke, W., 1923. Vorlesungen u¨ber Differentialgeometrie. II, Affine DifferentialgeomSpringer. Springer, Berlin. Bonnesen, T., Fenchel, W., 1934. Theorie Der Konvexen Ko¨rper. Springer, Berlin. Ehrart, E., 1955. Sur les ovales et les ovoı¨des. C. R. Acad. Sci., Paris 240, 583–585. Fille`re, I., 1995. Outils mathe´matiques pour la reconnaissance de formes, proprie´te´s et applications, Ph.D. Thesis, Universite´ de St-Etienne, France. Funk, P., 1915. Ueber eine geometrische Anwendung der Abelschen Intergralgleichung. Math. Ann. 77, 129–135. Gru¨nbaum, B., 1960. Partitions of mass-distributions and of convex bodies by hyperplanes. Pacific J. Math. 10, 1257– 1261.
1739
Gru¨nbaum, B., 1963. Measures of symmetry for convex sets. Proc. Symp. Pure Math. 7, 233–270. Hammer, P.C., 1963. Volumes cut from convex bodies by planes. Mathematika 10. Ho, C.T., Chen, L.H., 1995. A fast ellipse/circle detector using geometric symmetry. Pattern Recognition 28 (1), 117–124. Kazhdan, M., Chazelle, B., Dobkin, D., Finkelstein, A., Funkhouser, T., 2002. A reflective symmetry descriptor. European Conference on Computer Vision (ECCV). Kazhdan, M., Funkhouser, T., Rusinkiewicz, S., 2004. Symmetry descriptors and 3D shape matching. In: Scopigno, R., Zorin, D., (Eds.), Eurographics Symposium on Geometry Processing. Laboure´, M.J., Fille`re, I., Becker, J.M., Jourlin, M., 1993. Asymmetry definition of a convex body by means of characteristic points. Acta Stereol. 12/2, 103–108. Marola, G., 1989. On the detection of the axis of symmetry or almost symmetric planar images. IEEE PAMI 11, 104–107. Su¨ss, W., 1950. Ueber Eibereiche mit Mittelpunkt. Math. Phys., Semesterber 1, 273–287. Sun, C., 1995. Symmetry detection using gradient information. Pattern Recognition Lett. 16, 987–996. Sun, C., Si, D., 1999. Fast reflectional symmetry detection using orientation histograms. Real-time Imaging 5, 63–74. Wang, H., Suter, D., 2003. Using symmetry in robust model fitting. Pattern Recognition Lett. 24, 2953–2966. Winternitz, A, 1923. See BlaschkeÕs book (Blaschke, 1923). pp. 54–55. Wolter, J.D., Woo, T.C., Volz, R.A., 1985. Optimal algorithms for symmetry detection in two and three dimensions. Visual Comput. 1, 37–48.
Pattern Recognition Letters 26 (2005) 1740–1751 www.elsevier.com/locate/patrec
Texture classification via conditional histograms Eugenia Montiel a b
a,*
, Alberto S. Aguado a, Mark S. Nixon
b
Electronic and Electrical Engineering, University of Surrey, Guildford, Surrey GU2 5XH, United Kingdom Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, United Kingdom Received 21 September 2003; received in revised form 1 October 2004 Available online 29 April 2005 Communicated by E. Backer
Abstract This paper presents a non-parametric discrimination strategy based on texture features characterised by one-dimensional conditional histograms. Our characterisation extends previous co-occurrence matrix encoding schemes by considering a mixture of colour and contextual information obtained from binary images. We compute joint distributions that define regions that represent pixels with similar intensity or colour properties. The main motivation is to obtain a compact characterisation suitable for applications requiring on-line training. Experimental results show that our approach can provide accurate discrimination. We use the classification to implement a segmentation application based on a hierarchical subdivision. The segmentation handles mixture problems at the boundary of regions by considering windows of different sizes. Examples show that the segmentation can accurately delineate image regions. 2005 Elsevier B.V. All rights reserved. Keywords: Texture classification; Co-occurrence features; Image analysis; Image segmentation; Region detection; Non-linear image analysis
1. Introduction Previous works have shown that histograms can be used as powerful descriptions for non-parametric classification (Unser, 1986a; Valkealahti * Corresponding author. Tel.: +44 1483 68 6044; fax: +44 1483 68 6031. E-mail addresses:
[email protected] (E. Montiel),
[email protected] (M.S. Nixon).
and Oja, 1998; Ojala et al., 1996, 2000; Hofmann et al., 1998; Puzicha et al., 1999). In contrast to parametric features (Haralick, 1979), histograms contain all the information of distributions avoiding the problem of feature selection. In general, the development of a method for automatic feature selection is not trivial since optimal performance requires a careful selection of features according to particular types of textures (Ohanian and Dubes, 1992; Jain and Zongker, 1997; Sullis,
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.02.004
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
1990; Ng et al., 1998). However, histograms cannot be directly used as texture descriptors; the computation and dimensionality impose prohibitive computational resources for applications (Rosenfeld et al., 1982; Augusteijn et al., 1995). The computation of histograms of low-dimensionality is considered as an open problem (Unser, 1986a; Valkealahti and Oja, 1998; Rosenfeld et al., 1982; Ojala et al., 2001). In addition to make texture descriptors useful for applications, the reduction of histogramÕs dimensionality has two important implications. First, it avoids sparse histograms due to insufficient training data. Secondly, if histograms are compact, it is possible to consider more complex pixels interdependencies increasing the discrimination power. It is important to notice that if the histogram reduction is effective, we should expect good discrimination with similar features than the ones used to codify the high dimension description. As such, it is important to distinguish between the effectiveness of the reduced process and the additional power obtained by including more complex pixel interdependencies. A powerful approach to histogram reduction is to perform a quantisation to adapt the histogram bins according to the distribution (Puzicha et al., 1999). In (Valkealahti and Oja, 1998; Ojala et al., 2001) the adaptation is defined by using techniques of vector quantisation (Gersho and Gray, 1992). Results on texture discrimination have shown that this is a very powerful technique to reduce multiplex (i.e., >2) co-occurrences. However, although a tree structure can be used to handle the complexity required by the encoder, yet the encoder requires a significant number of operations and sample data. The search uses more memory that a full search vector quantisation and the process can lead to sub-optimal solutions (Gersho and Gray, 1992). In this paper, we simplify histograms by considering combinations of the random variables defining the joint probability function (i.e., grey tone or colour dependence) (Unser, 1986a; Rosenfeld et al., 1982; Ojala et al., 2001). In (Rosenfeld et al., 1982), and later (Unser, 1986a), histograms define the probability of the differences of grey levels for pixel pairs. The motivation is that these
1741
operations define the principal axes of the second order joint probability function (Unser, 1986a). However, they have had an unpredictable success in applications (Schulerud and Carstensen, 1995; Chetverikov, 1994). The main caveat of this representation is that, in general, random variables in a texture do not define Gaussian independent distributions. Additionally, although the average error of the difference is well approximated by the factorised probability (Ojala et al., 2001), the difference operation loses spatial information and histograms can become bad approximations of joint probability functions. In order to maintain spatial information, we propose to encode the textureÕs random structure by computing joint probabilities defining the dependence between pixels forming regions sharing intensity or colour properties. To reduce dimensionality, statistical distributions are computed for binary images. We combine joint distributions of binary values and the probability of intensity values to define a collection of histograms. These histograms are normalised, thus they can be used for non-parametric texture discrimination independently of the size of the sampled region (Puzicha et al., 1999). We use the non-parametric classification to implement a segmentation application based on a hierarchical quadtree scheme. Hierarchical strategies have been very effective for image segmentation. The hierarchical approach has two main advantages. First, it performs a fast partition by considering regions rather than individual pixels in fixed overlapping windows. The partition is defined by considering regions of different sizes at different levels of the hierarchy. The second advantage is that it reduces classification errors due to mixture of features computed in a fixed window size. This is convenient to delineate accurate region borders (Ojala et al., 2000; Hsu, 1978; Dutra and Mascarenhas, 1984; Marceau et al., 1990; Briggs and Nellis, 1991; Ma and Manjunath, 1997). These properties have been exploited in algorithms of segmentation based on intensity (Horowitz and Pavlidis, 1976; Wu, 1992), motion (Szeliski and Shum, 1996; Lee, 1998) and texture information (Ojala et al., 2000; Chen and Pavlidis, 1979). Our segmentation is based on the technique presented
1742
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
in (Chen and Pavlidis, 1979), but it divides an image using non-parametric classification.
considering that all the possible values of c1 have the same probability of occurrence and that the probabilities of P{c(x1) = c1} and P{c1 c2 = a} are independent. In this case
2. Statistical characterisation of textures
f ðc1 ; c1 c2 ;sÞ ffi f ða;sÞ for a ¼ c1 c2
The interdependence of pixels in a texture can be defined by the joint probabilities computed for random variables associated to pixel intensities. Given n discrete random variables c(x1), c(x2), . . ., c(xn) at positions x1, x2, . . ., xn, the nthorder density function defines the probability that the variables take the values c1, c2, . . ., cn, respectively (Papoulis, 1991). That is, f(c1, . . ., cn; x1, . . ., xn) = P{c(x1) = c1, . . ., c(xn) = cn}. Here, c takes values within the range of possible grey levels or colours. Since textures are stationary processes, then distributions can be expressed independently of their position. Thus, f(c1, . . ., cn; x1, . . ., xn) depends on the distance between x1 and any other variable xi. That is, f(c1, . . ., cn; s2, . . ., sn) for si = jxi x1j. This function represents the probability of obtaining the grey level values c1, c2, . . ., cn at distances s1, s2, . . ., sn measured from x1. The general form can be limited to different orders to define alternative texture descriptors. For example, co-occurrence matrices (Haralick et al., 1973a; Kovalev and Petrou, 1996; Strand and Taxt, 1994) are defined by considering only two points. That is,
This description can be extended to large neighbourhoods or to differences of higher order. The main advantage is that it is simpler than Eq. (1), thus it can make the classification faster and it requires less training data. The main drawback is that many combinations of values c1, c2 map into the same value of a, thus information about permutations is lost. Additionally, dependence on intensity information is lost making classification on small regions difficult. In the next section we present a characterisation that simplifies Eq. (1) keeping dependence in intensity information. Previous works have shown the importance of intensity values for classification (Dubuisson-Jolly and Gupta, 2000).
f ðc1 ; c2 ;sÞ
for s ¼ jx2 x1 j
ð1Þ
for values of s defining neighbourhoods of 3 · 3 or 4 · 4 pixels. These descriptors have powerful discrimination properties. However, the computation of the probability for each combination c1 and c2 requires a large number of samples and computations. Previous works have considered simplifications obtained by replacing the dependence on the values c1 and c2 for arithmetical combinations (Unser, 1986a; Rosenfeld et al., 1982; Ojala et al., 2001). For example difference histograms (Ojala et al., 2001) are defined as f ðc1 ; c2 c1 ;sÞ
ð2Þ
This equation characterises the same information as Eq. (1), but the changes in intensities are given relative to the value at x1. This can be simplified by
ð3Þ
3. Conditional histograms We can describe the interdependence of pixels in a texture by considering ideas of binary feature selection. Previous works have shown that binarisation can be very effective to characterise the spatial dependence of pixels in images (Wang and He, 1990; Chen et al., 1995; Hepplewhite and Stonham, 1997; Ojala and Pietika¨inen, 1999). As suggested in (Ojala et al., 2000), we use the joint distribution of binary patterns. However, in order to be able to locate a texture embedded in different images, we use a global threshold strategy (Chen et al., 1995). A global threshold divides an image into regions with similar intensity properties. Thus, we can expect that the joint probabilities computed in a binary image to contain much of the information about the spatial structure of the texture. If c1 and c2 represent binary features at x1 and x2, then the structural information given by the permutations in two locations can be written as f ðc1 jc1 ; c2 ;sÞ for s ¼ jx2 x1 j
ð4Þ
This represents the probability that a pixel has intensity c1 conditioned to the intensities of
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
neighbouring pixels. Since we consider binary features, the second-order spatial and intensity interdependencies of neighbouring pixels are defined by the four histograms f 1 ðc1 ;sÞ ¼ f ðc1 j0; 0;sÞ f 2 ðc1 ;sÞ ¼ f ðc1 j0; 1;sÞ f 3 ðc1 ;sÞ ¼ f ðc1 j1; 0;sÞ
ð5Þ
f 4 ðc1 ;sÞ ¼ f ðc1 j1; 1;sÞ This description can be extended to large neighbourhoods or n-order interdependencies. That is, f ðc1 jc1 ; c2 ; . . . ; cn ;s2 ; . . . ; sn Þ
for si ¼ jxi x1 j ð6Þ n
Thus a texture is characterised by 2 one-dimensional histograms per binary feature. For colour or multispectral images, we can maintain a low dimensionality by considering each colour component separately. Thus, for an image with m chromatic components, we have m2n histograms. Although it is possible to consider directly the colour components in the definition of the histograms, in general, it is better to perform a pre-processing such that each component has a higher discriminatory ability. For example, if c1,R, c1,G and c1,B are the red, blue and green values of an image (i.e., m = 3) then the description in Eq. (6) can be extended to f ðc1j jc1 ; c2 ; . . . ; cn ;s2 ; . . . ; sn Þ
ð7Þ
for c1;1 ¼ ðc1;R þ c1;G þ c1;B Þ=3 c1;2 ¼ c1;R c1;B
ð8Þ
c1;3 ¼ ð2c1;G c1;R c1;B Þ=3 These definitions can obtain a set of components with discriminatory ability as good as that obtained by the Karhunen Loeve transformation (Ohta et al., 1980).
4. Classification In general, the classification performance depends on the discrimination approach. Numerous discrimination approaches are possible (Devijver
1743
and Kittler, 1982; Schalkoff, 1992) and classifiers can improve the results at the expense of complexity, computational resources and requirements in the size and quality of the training data. However, it is beyond the scope of this work to evaluate classification schemes. We are interested in evaluating the discrimination properties of conditional density histograms. We have chosen to use a nonparametric classification based on the dissimilarity between the histograms of the training classes and the histograms of the sample to be assigned to the class. The non-parametric approach is particularly suitable for low dimensionality feature spaces and can provide good classification results with relative low computational resources (Puzicha et al., 1999). In non-parametric discrimination techniques, histograms define the feature vectors that form the basis for the classification. Thus, each element in the histogram corresponds to the value of a feature and a texture class is characterised by m2n features per binary operator. An important difference with previous approaches is that in our classifier we assume that training samples define the same distribution. That is, instead of forming different distributions for each training sample, we increment the estimate of a single collection of distributions for each texture class. Thus, the training data of a class characterises a single point in the feature space, making the classification to be more dependent on the selection of good texture characterisation rather than on the power of the classifier. If we included a sophisticate classifier that can distinguish between no-linear separable classes defined by several features, then we would compensate for poor features by using complex discrimination for disperse collections in the feature space. Thus, the performance would be directly related to the size of the training data and to the complexity of the classifier and not to the quality of the features. An advantage of computing features incrementally is that we can obtain an estimate of the distributions with a small number of training samples avoiding sparse histograms. It is well known that histograms with few entries per bin produce a poor performance in non-parametric tests (Ojala et al., 1996; Puzicha et al., 1999). In some applications, it is important to be able to classify by using reduced training data. For example, in image editing
1744
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
or region tracking applications, it is important to minimise the number of times the user selects samples to delineate a region of interest. Additionally, in these applications, training cannot be performed off-line, thus we have to minimise the time spent in computing features and in creating the texture database. In non-parametric classification techniques, there are alternative ways to define the dissimilarity between histograms. Valuable experimental work has evaluated the performance of alternative dissimilarity measures for k-NN classifiers (Ojala et al., 1996; Puzicha et al., 1999). The results suggest that the performance depends on several factors such as the size of the training data, type of features, type of images and particular applications. In general, when histograms are not sparse (e.g., by using marginal distributions or by considering many training samples (Puzicha et al., 1999)), the difference in performance is not significant for most measures. For small sample data, it is better to use measures based on statistics or aggregate measures less sensitive to sample noise. Since we are using only one set of histograms per class, thus we expect (and in fact this is part of the motivation to reduce dimensionality) to have histograms with a significant number of entries per bin. Thus, we choose to use the L1 norm that provides a good classification with a small computational load. In the case of windows with few pixels, the performance can be maintained if the number of bins is reduced. This can be achieved by adaptive binning, but this is computationally expensive (Ojala et al., 1996). In a simpler strategy, we constrain the similarity measure to bins that have a significant number of entries. By considering Eq. (6), a test sample S is assigned to the class of the model Mj that minimises the absolute difference between corresponding bins in each histogram. DðS; M j Þ ¼
m2n X X h
jfS h ðci jc1 ; c2 ; . . . ; cn ; s2 ; . . . ; sn Þ
i
fM j;h ðci jc1 ; c2 ; . . . ; cn ; s2 ; . . . ; sn Þj
ð9Þ
The fist summation indicates all the histograms whilst the second summation iterates for each bin. The sub-index h is used to index the histogram
of the sample and the model. This definition measures whether the pixels in two textures have similar intensities with similar spatial organisation. That is, the value of D(S, M) will be small if the intensity values of the two textures are similar and they are grouped into regions of similar contrast.
5. Segmentation We used the classification to implement a segmentation application based on a top-down hierarchical subdivision. This approach searches for an optimum partition by dividing the image in a quad-tree homogenous decomposition. This comprises three steps. First, a region is classified. Secondly, it is partitioned and each partition is classified. Finally, it is necessary to measure the homogeneity of the partition. If the region is homogenous, then the whole region is assigned to the same class and the subdivision is stopped. If the region is not homogenous, the region is subdivided. The subdivision is repeated until the image region is equal to one pixel. In a hierarchical approach, the segmentation performance depends on the classification and on the ability of computing an optimum partition. An optimal partition divides the image into regions of roughly uniform texture. Thus, the success depends on performing an appropriate decision about the homogeneity of a region. Unfortunately, there has not been a practical evaluation of homogeneity measurements that could help us to choose an optimum partition framework. We base our criterion of homogeneity in two heuristic rules. First, we consider that classification at boundaries contains a mixture of two textures (i.e., non-homogeneous). Accordingly, we subdivide regions that have at least one neighbouring region of a different class. This criterion of subdivision delineates the boundaries between texture regions. The second criterion of homogeneity is based on the classification in successive levels of the quad-tree. Similar to (Ojala et al., 2000), we measure the uniformity by computing the differences between histograms of the four subblocks in the subdivision. However, we do not
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
measure the variance of the sub-blocks, but the confidence that we have in the result. We consider the uncertainty in classification as the ratio between the similarity values of the two textures in the database that are most similar to the distribution computed from a region. We denote as j the element that minimises D(S, Mj) and we denote as DðS; M j Þ the corresponding minimum value. We denote as j+ the class that minimises D(S, Mj) when the element j is not considered. Thus, DðS; M jþ Þ denotes the distance value for the class j+. The uncertainty of classifying S is defined as U ðS; j Þ ¼
DðS; M jþ Þ DðS; M j Þ
ð10Þ
This measure will be close to one if the classification is vague. The uncertainty in the classification decreases when the difference between classification distances increases. Thus, the subdivision must minimise the uncertainty in the entire image. We perform the subdivision based on the comparison between the uncertainty of the region and the uncertainty of the new regions. However, this comparison cannot be based on the uncertainty between different levels in the quad-tree. Since each level of the quad-tree has regions of smaller size, then the uncertainty in lower levels is always higher. In order to evaluate the subdivision, we consider the uncertainty when regions are classified by considering the class in the upper level and the new classification in the low level. If S1, S2, S3 and S4 are the regions in a subdivision, then the change in uncertainty due to a splitting operation can be measured as U ðS 1 ; j Þ þ U ðS 2 ; j Þ þ U ðS 3 ; j Þ þ U ðS 4 ; j Þ ðU ðS 1 ; j1 Þ þ U ðS 2 ; j2 Þ þ U ðS 3 ; j3 Þ þ U ðS 4 ; j4 ÞÞ ð11Þ where j minimises in the upper level and, j1 ; j2 ; j3 and j4 minimise each one of the new sub-regions. The sub-division is performed if the absolute difference in Eq. (11) is lower than a fixed threshold. The basic idea is to subdivide the region only if it is composed of several textures. In this case, the classification obtained by smaller regions
1745
composing a mixture has less uncertainty than when we consider a single class for the whole region. Although uncertainty is capable of giving a useful measure of homogeneity, still it is extremely difficult to classify small blocks in an image. If the window is to small, then it probably does not contain sufficient information to characterise the region, thus increasing the probability of misclassification (Ojala et al., 2000). In order to obtain an accurate delineation of texture regions, we reduce the number of potential classes. When the subdivision is due to a boundary and the window size is smaller than a fixed threshold, the classification is made by considering only the texture classes of current and neighbour regions. That is, we assume that there are not new regions smaller than the fixed threshold, thus the segmentation can be stopped and the subdivision can only be used to delineate the existing coarse regions. The threshold size determines the minimum data necessary to obtain a good classification and it is strongly dependent on the number of texture categories. As more classes are included, the probability of misclassification of small data increases. Thus, the threshold should be increased such that, the classification of regions is made by including enough information. The threshold should be set to the minimum window size for which the classification obtains reliable results. This value is an input parameter of the classification application.
6. Experimental results and examples 6.1. Experimental data In order to assess the discrimination capabilities, we have performed two experimental tests based on the data presented in (Valkealahti and Oja, 1998; Ojala et al., 1996, 2001; Ohanian and Dubes, 1992). The first test (Valkealahti and Oja, 1998; Ojala et al., 2001) defines 32 texture categories from selected images of the Brodatz collection. The second test (Ojala et al., 1996; Ohanian and Dubes, 1992) defines 16 texture categories from four types of images.
1746
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
For the 32-category problem, texture samples are obtained from a 256 · 256 image with 256 grey levels. Each image is subdivided into 16 blocks of 64 · 64 pixels and each block is transformed into three new blocks by 90 rotation, scale from the 45 · 45 pixels in the middle and by combining rotation and scaling. This produces 2048 blocks. Half of the data is selected by randomly choosing 8 blocks and the corresponding transformed blocks. This data is used to define class histograms and the remaining blocks to evaluate the classification process. The 16-category problem uses four types of images containing four distinct textures. Two types of textures are generated from fractal and Gaussian Markov random field models. The other two types were obtained from leather and painted surfaces. Images are divided into 256 non-overlapping regions of 32 · 32 pixels. Only 200 regions of each image are selected to obtain 3200 samples. Performance is measured by considering the leave-one-out error. Thus, 3184 samples are used for training and 16 to assess the discrimination. 6.2. Computation of features In contrast to (Valkealahti and Oja, 1998) where feature histograms are computed for each texture block separately by considering a 4 · 4 neighbourhood, we obtain histograms for each texture category by considering only a 2 · 2 neighbourhood. Thus, the 32 training blocks of each class are used to compute 4 · 4 conditional histograms. The first collection of four represents the class without any geometric transformation; the second collection is used for the scale, and the last two for rotation and combination of scale and rotation. To obtain binary features, we use three fixed thresholds with values of 128, 64 and 32. Thus, each class is represented by 4 · 4 · 3 histograms. 6.3. Classification results Table 1 shows the average classification results obtained for 10 random selected test sets for the first test. The table shows the average for each class.
Table 1 Average classification accuracies (%) over 10 experiments for 32 texture categories Texture
Accuracy
bark beachsand beans burlap d10 d11 d4 d5 d51 d52 d6 d95 fieldstone grass ice image09 image15 image17 image19 paper peb54 pigskin pressdcl raffia raffia2 reptile ricepaper seafan straw2 tree water woodgrain
100.00 85.00 100.00 100.00 61.56 95.63 96.25 96.88 100.00 100.00 100.00 100.00 76.25 99.06 63.13 100.00 96.56 92.50 100.00 99.38 97.81 92.50 100.00 100.00 100.00 100.00 100.00 93.13 100.00 92.19 100.00 100.00
In general, the classification performance is very good with exception of the classes: beachsand, D10, ice and fieldstone. A detailed observation of the results showed that most misclassifications for these classes are for blocks obtained by the scale transformation. Fig. 1 shows six examples of these misclassifications. Fig. 1(a) shows a block of beachsand that was misclassified as the class grass. The second texture in Fig. 1(a) shows an example of this class. Fig. 1(b)–(f) shows other misclassifications obtained for the scaling classes. The remarkable similarity between the textures in Fig. 1 can explain the misclassifications. Additionally, since scaling is obtained from digital images, there is some lost in resolution and as consequence aliasing produces
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
1747
Fig. 1. Examples of misclassification. (a) Beachsand and grass, (b) ice and fieldstone, (c) D10 and tree, (d) fieldstone and peb54, (e) peb54 and fieldstone and (f) pigskin and beachsand.
new blocks with very similar values for neighbouring pixels. Thus, spatial information obtained by considering neighbouring pixels is not so good as the original blocks. Table 2 shows the global performance of the classification. The table includes the results presented in (Valkealahti and Oja, 1998; Ojala et al., 2001) for signed differences (Valkealahti and Oja, 1998), co-occurrences (Haralick et al., 1973b), absolute differences (Ojala and Pietika¨inen, 1996), Gaussian random field model (Van Gool et al., 1985), reduced histograms (Ojala et al., 2001) and channel histograms (Unser, 1986b). In general, the accuracy of the proposed features compares to the most successful techniques. However, it is important to notice that quantisation and signed grey level distributions required 16th and 9th order probabilities. We have used second order joint probabilities in a 2 · 2 neighbourhood. This makes the complexity adequate for applications requiring online training. Table 3 shows the result for the second test. The table includes the results presented in (Ojala et al., 1996; Ohanian and Dubes, 1992) for local
binary patterns (Wang and He, 1990), co-occurrences (Haralick et al., 1973b), grey level differences (Unser, 1986a) and classification based on a combination of three different texture features (Ohanian and Dubes, 1992). For the leave-oneout classification, the classification accuracy for conditional histograms was of 100%. In order to highlight the advantage of conditional histograms, we perform the same test by reducing the training data. In our results good performance can be maintained with only 10% to 5% of the data. Table 4 shows the classification results for only 20 training samples and windows of 32 · 32 and 16 · 16. For the 32 · 32 case this represents the 10% of training data. When the window is reduced to 16 · 16 only 2.5% of the original data is used. Classification performance is maintained with less training data because, as explained in Section 4, histograms are computed by using data incrementally. Thus, when we reduce training data, we are not reducing the number of features, but the accuracy of the features. But since histograms are just one dimensional, then they can be populated with few data.
1748
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
Table 2 Global performance for 16 texture categories
Table 4 Average classification accuracy (%) for 16 texture categories
Average classification Signed differences (LPB) Second order Fourth order Eighth order
93.3 95.7 96.8
Co-occurrence matrices Third order Fifth order Ninth order
90.8 93.8 94.4
Absolute differences Second order Fourth order Eighth order
85.3 92.1 93.2
MRF Seventh order Combined features
71.3 90.0
32 · 32
16 · 16
Fractal1 Fractal2 Fractal3 Fractal4 mrf1 mrf2 mrf3 mrf4 Leather1 Leather2 Leather3 Leather4 paint1 paint2 paint3 paint4
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
100.00 81.11 100.00 100.00 100.00 92.78 100.00 94.44 100.00 100.00 100.00 84.44 100.00 100.00 78.33 100.00
Results obtained by considering 20 training samples per class.
Reduced histograms TSOM, cosine transform VQ, cosine transform TSOM, grey levels
93.9 93.4 92.8
Channel histograms Multi-dimensional One-dimensional
90.4 78.2
Conditional histograms Second order
94.91
6.4. Segmentation
Table 3 Global performance for 16 texture categories Local binary patterns LBP LPB and contrast LBP and covariance
81.40 87.62 185.03
Co-occurrence features 4 Features 9 Features
188.25 90.69
Grey level difference DIFFX and DIFFY Combined features MRF, Gabor, Fractal 4 Features 9 Features Conditional histogram 32 · 32 (199 samples per class) 32 · 32 (20 samples per class) 16 · 16 (20 samples per class)
Texture
96.56 191.07 195.41 100.00 1100.00 195.69
Figs. 2 and 3 show selected examples of the segmentation application for grey level and colour images. In these examples, training data was obtained by considering two windows of 32 · 32 for each class. For each example, we show the segmentation regions, the borders between classes and the uncertainty in the classification. Uncertainty is shown as a colour image whose change in brightness represents the confidence of the classification of each pixel. Brighter colours represent a high degree of confidence, whilst darker colours show areas whose classification is more uncertain. We can see that in general large areas of uniform texture are classified in regions of large size with low uncertainty, whilst high detailed areas are divided into small regions with high uncertainty. The examples show that texture features can be used in applications to obtain well-delineated borders. Notice that as boundaries are refined, regions reduce the confidence in the classification. Figs. 2(a) and 3(a) show two examples of segmentation for synthetic images composed of grey level and colour textures, respectively. Each image has a resolution of 256 · 256 pixels. We can observe that the refinement of regions provides an analysis capable of producing accurate segmentation results. In these examples, the colour information
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
1749
Fig. 2. Examples of texture images. (a) Synthetic image composed of five grey level textures, (b) image containing grey level natural textures, (c) infrared band of a satellite image and (d) intensity image.
Fig. 3. Examples of colour texture images. (a) Synthetic image composed of five colour textures, (b) satellite image with three colour bands, (c) and (d) images containing colour natural textures.
produces a clearer delineation of regions since small regions are more accurately classified. The detailed evaluation of the segmentation results shown that 98.7% and 99.2% of pixels in the grey level and colour images, were correctly classified. Although the difference between the results of the final classification is very small, we can observe a clear distinction in the uncertainty maps. We can observe that whilst for the colour image a high confidence in the classification results is maintained for regions of reduced size, grey level information tends to produce less contrasted and darker regions. Thus, when regions are small, the lack of information increases the probability of erroneous classification for grey level textures.
However, since the increase in probability is only significant for small regions, then the error of the whole segmentation process is small. The example in Fig. 2(b) contains a grey level image with four types of regions. Two types of regions have a rather regular appearance with white and black intensities, respectively. In spite of the lack of texture, the intensity component in the characterisation produces a successful classification. Fig. 2(c) corresponds to the infrared band in a multispectral LANDSAT image. A training set was defined by selecting two regions as representative members of urban and agricultural land covers. In this example, we can see that the model of texture developed can be used to accurately
1750
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751
delineate textured regions. The uncertainty map provides useful information for classification. For example, although the top left region was classified as urban area, the low confidence value shows that it probably contains a different type of land cover. In the same manner, small black regions in the red areas indicate possible fields inside an urban area. The results in Fig. 2(d) show an accurate segmentation capable of discriminating between regions of constant intensity and regions with textured patterns. Fig. 3(b) shows a similar example to Fig. 2(d). In this case, the uncertainty is reduced due to colour data. We can observe an accurate segmentation even for small regions. Fig. 3(c) and (d) show two examples of colour texture landscapes. In Fig. 3(d), the green tones in two texture classes produce a considerable similarity between some regions. Accordingly, the segmentation process subdivides several times the regions that contain green areas. Although small green regions have a large uncertainty with respect to larger areas, the classification results lead to accurately delineated borders. The segmentation of the image in Fig. 3(c) has a more clear distinction of classes. Thus, larger regions with low uncertainty are obtained. We can notice that in regions where rock and grass merge, the uncertainty increases and the subdivision becomes finer in order to identify pure classes.
7. Conclusions and further work We have proposed a characterisation of textures based on a mixture of colour and contextual information obtained from binary features. The characterisation defines one-dimensional histograms that represent the conditional probability of intensity values given the joint probabilities of pixels in image regions. Experimental results show that a non-parametric classification based on conditional histograms produces a compact and powerful set of features. High classification performance is obtained by considering only second order distributions. The compactness of the representation has three main interests. First, compactness is important to make texture analysis practical. This is particularly relevant for applica-
tions requiring on-line database construction. Secondly, it avoids sparse histograms that can reduce the classification performance. Finally, since the number of bins is reduced, compactness minimises the data required during the training step. We reinforce this last point by considering training data incrementally. We have included examples that show the application of the classification to region delineation by means of a hierarchical subdivision. Examples show that the classification is useful to obtain well-delineated borders. The dependence of the representation on intensity data is suitable to classify regions of small sizes. Our current work is considering the potential implications of incremental training. We think that distributions can be used to determine when training can be stopped and to detect when training data agree with a single distribution. If data do not agree with a single distribution, then several classes should be used to represent a texture. Additionally, we consider extending the approach to applications on nonsupervised segmentation. There are recent studies where efficient unsupervised segmentation is performed using feature distributions (Ojala et al., 2000).
References Augusteijn, M.F., Clemens, L.E., Shaw, K.A., 1995. Performance evaluation of texture measures for ground cover identification in satellite images by means of a neural network classifier. IEEE Trans. Geosci. Remote Sensing 33 (3), 616–626. Briggs, J.M., Nellis, M.D., 1991. Seasonal variation of heterogeneity in the tallgrassprairie: A quantitative measure using remote sensing. Photogramm. Eng. Remote Sensing 57 (4), 407–411. Chen, P.C., Pavlidis, T., 1979. Segmentation by texture using a co-occurrence matrix and a split and merge algorithm. Comput. Vision Graphics Image Process. 10, 172–182. Chen, Y.Q., Nixon, M.S., Thomas, D.W., 1995. Statistical geometric features for texture classification. Pattern Recognition 28, 537–552. Chetverikov, D., 1994. GLDH based analysis of texture anisotropy and symmetry: An experimental study. Proc. Internat. Conf. on Pattern Recognition I, 444–448. Devijver, P.A., Kittler, J., 1982. Pattern Recognition, a Statistical Approach. Prentice Hall, Englewood Cliffs, London.
E. Montiel et al. / Pattern Recognition Letters 26 (2005) 1740–1751 Dubuisson-Jolly, M.-P., Gupta, A., 2000. Color and texture fusion: Application to aerial image segmentation and GIS updating. Image Vision Comput. 18 (10), 823–832. Dutra, L.V., Mascarenhas, N.D.A., 1984. Some experiments with spatial feature extraction methods in multispectral classification. Int. J. Remote Sensing 5 (2), 303–313. Gersho, A., Gray, R.M., 1992. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Dordrecht, Netherlands. Haralick, R., 1979. Statistical and structural approaches to texture. Proc. IEEE 67, 786–804. Haralick, R., Shanmugam, K., Distein, I., 1973a. Textural features for image classification. IEEE Trans. Systems Man Cybernet. 3, 610–621. Haralick, R., Shanmugam, K., Dinstein, I., 1973b. Textural features for image classification. IEEE Trans. Systems Man Cybernet. SMC-3, 610–621. Hepplewhite, L., Stonham, T.J., 1997. N-tuple texture recognition and the zero crossing sketch. Electron. Lett. 33 (1), 45–46. Hofmann, T., Puzicha, J., Buhmann, J., 1998. Unsupervised texture segmentation in a deterministic annealing framework. IEEE Trans. Pattern Anal. Mach. Intell. 20 (8), 803– 818. Horowitz, S.L., Pavlidis, T., 1976. Picture segmentation by a tree traversal algorithm. J. ACM 23 (2), 368–388. Hsu, S., 1978. Texture-tone analysis for automated land-use mapping. Photogramm. Eng. Remote Sensing 44 (11), 1393–1404. Jain, A., Zongker, D., 1997. Feature selection: Evaluation, application and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19 (2), 153–158. Kovalev, V., Petrou, M., 1996. Multidimensional co-occurrence matrices for object recognition and matching. Graph. Models Image Process. 58 (3), 187–197. Lee, J.W., 1998. Joint optimization of block size and quantization for quadtree based motion estimation. Image Process. 7 (6), 909–912. Ma, W., Manjunath, B., 1997. Edge flow: A framework of boundary detection and image segmentation, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 744–749. Marceau, D.J., Howarth, P.J., Dubois, J.M., Grattton, D.J., 1990. Evaluation of the grey-level co-occurrence matrix method for land-cover classification using SPOT imagery. IEEE Trans. Geosci. Remote Sensing 28 (4), 513–519. Ng, L.S., Nixon, M.S., Carter, J.N., 1998. Combining feature sets for texture extraction. Proc. IEEE Southwest Symposium on Image Analysis and Interpretation, Texas, pp. 103– 108. Ohanian, P.P., Dubes, R.C., 1992. Performance evaluation for four classes of textural features. Pattern Recognition 25 (8), 819–833. Ohta, Y.-I., Kanade, T., Sakai, T., 1980. Color information for region segmentation. Comput. Graph. Image Process. 13, 222–241.
1751
Ojala, T., Pietika¨inen, M., 1996. A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29 (1), 51–59. Ojala, T., Pietika¨inen, M., 1999. Unsupervised texture segmentation using feature distributions. Pattern Recognition 32, 477–486. Ojala, T., Pietika¨inen, M., Harwood, D., 1996. A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29, 51–59. Ojala, T., Pietika¨inen, M., Ma¨enpa¨a¨, T., 2000. Gray scale and rotation invariant texture classification with local binary patterns, Proc. Sixth European Conf. on Computer Vision, Dublin, Ireland, vol. 1, pp. 404–420. Ojala, T., Valkealahti, K., Oja, E., Pietika¨inen, M., 2001. Texture discrimination with multidimensional distributions of signed gray level differences. Pattern Recognition 34 (3), 727–739. Papoulis, A., 1991. Probability, Random Variables, and Stochastic Processes. McGraw–Hill. Puzicha, J., Rubner, Y., Tomasi, C., Buhmann, J.M., 1999. Empirical evaluation of dissimilarity measures for color and texture. Proc. IEEE Internat. Conf. on Computer Vision (ICCVÕ99), pp. 1165–1173. Rosenfeld, A., Wang, C.Y., Wu, A.Y., 1982. Multispectral texture. IEEE Trans. Systems Man Cybernet. 12 (1), 79– 84. Schalkoff, R., 1992. Pattern Recognition. Statistical, Structural and Neural Approaches. Wiley. Schulerud, H., Carstensen, J., 1995. Multiresolution texture analysis of four classes of mice liver cells using different cell cluster representations. Proc. 9th Scandinavian Conf. on Image Analysis, Uppsala, Sweden, pp. 121–129. Strand, J., Taxt, T., 1994. Local frequency features for texture classification. Pattern Recognition 27 (10), 1397– 1406. Sullis, J.R., 1990. Distributed learning of texture classification. Lect. Notes Comput. Sci. 427, 349–358. Szeliski, R., Shum, H.-Y., 1996. Motion estimation with quadtree splines. IEEE Trans. Pattern Anal. Mach. Intell. 18 (12), 1199–1210. Unser, M., 1986a. Sum and difference histograms for texture classification. IEEE Trans. Pattern Anal. Mach. Intell. 8, 118–125. Unser, M., 1986b. Local linear transforms for texture measurements. Signal Process. 11 (1), 61–79. Valkealahti, K., Oja, E., 1998. Reduced multi-dimensional co-occurrence histograms in texture classification. IEEE Trans. Pattern Anal. Mach. Intell. 20 (1), 90–94. Van Gool, L., Dewaele, P., Oosterlinck, A., 1985. Texture analysis anno 1983. Comput. Vision Graphics Image Process. 29, 336–357. Wang, L., He, D.C., 1990. Texture classification using texture spectrum. Pattern Recognition 23 (8), 905–910. Wu, X., 1992. Image coding by adaptive tree-structured segmentation. IEEE Trans. Inform. Theory 38 (6), 1755– 1767.
Pattern Recognition Letters 26 (2005) 1752–1760 www.elsevier.com/locate/patrec
The Naive Bayes Mystery: A classification detective story Adrien Jamain, David J. Hand
*
Imperial College, Department of Mathematics, 180 Queen’s Gate, London, SW7 2AZ, United Kingdom Received 7 January 2005; received in revised form 15 February 2005 Available online 14 April 2005
Abstract Many studies have been made to compare the many different methods of supervised classification which have been developed. While conducting a large meta-analysis of such studies, we spotted some anomalous results relating to the Naive Bayes method. This paper describes our detailed investigation into these anomalies. We conclude that a very large comparative study probably mislabelled another method as Naive Bayes, and that the Statlog project used the right method, but possibly incorrectly reported its provenance. Such mistakes, while not too harmful in themselves, can become seriously misleading if blindly propagated by citations which do not examine the source material in detail. 2005 Elsevier B.V. All rights reserved. Keywords: Supervised classification; Naive Bayes; Statlog; Comparative studies
1. The setting Many different competing methods have been developed for supervised classification. As a corollary, there have also been many comparative studies of the performance of classification methods. The aims of these studies range from analysing the relative merits of different variations of the same methodology (e.g., Dietterich, 2000) to providing large-scale benchmarks (e.g., Zarndt, 1995). As a by-product of these specific aims, such *
Corresponding author. Fax: +44 20 7594 8517. E-mail addresses:
[email protected] (A. Jamain),
[email protected] (D.J. Hand).
studies have created the general body of opinion that researchers hold about the relative performance of the different methods. However, the question of whether the view that this literature provides is representative of the ÔrealÕ use to which classification methods will be put has been raised by a few (Duin, 1996; Salzberg, 1997). A related but often unrecognised problem is that studies, even if carried out in complete good faith and using sound methodology, are not always accurate enough in their reporting. Indeed, the literature abounds with clear evidence of harmless omissions; for example minor imprecision about which UCI dataset is used or about how data have been pre-processed in a study are commonplace. Other
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.02.001
bp
c4.5R
15
1
http://citeseer.nj.nec.com
ucn2
c4
smml
mml
ocn2
per bayes
id3
• The multi-layer perceptron—bp—on its own. • The noise-tolerant extensions of the nearestneighbour method, ib3 and ib4. • The nearest-neighbour and its condensed version, ib1 and ib2.
c4.5T
cart
ib2
ib1 ib4
ib3
2. The initial clue One tool which we have routinely used in the course of the meta-analysis described in (Jamain, 2004) is clustering. The basic idea is to consider a primary study as a set of n observations (the methods) in p dimensions (the datasets), and to cluster the methods according to their results. The method of our choice was hierarchical clustering, with complete linkage (that is, the distance between two clusters is taken to be the largest one between two points belonging to each cluster), but any other clustering method could also be used. The resulting dendrograms can be visually inspected to discover interesting patterns, and indeed they often offer considerable insights into the relative performance of the methods. We were producing such a clustering tree for the Zarndt study (Zarndt, 1995), which is as far as we know the largest comparative study ever made and which is available in the Citeseer online library,1 when we noticed a very curious pattern. We reproduce this tree in Fig. 1, so that the reader will be able to see if they can spot this anomalous pattern as well. As the figure clearly shows, six different clusters may be identified in the Zarndt study:
1753
10 5
more serious kinds of confusions are not straightforward to spot, and can only be established with some uncertainty (after all, the authors always have the benefit of the doubt). While we were carrying out a meta-analysis of classification studies (Jamain, 2004), we discovered what we think is a worrying confusion in two large comparative studies. We present here the evidence that has led us to this belief, and try to unveil what really happened in both comparative studies.
20
A. Jamain, D.J. Hand / Pattern Recognition Letters 26 (2005) 1752–1760
Fig. 1. Hierarchical clustering of the Zarndt study (using complete linkage).
• C4.5R on its own. • The decision trees—cart, id3, c4.5T, mml, c4 and smml—and Naive Bayes—bayes. • The perceptron—per—and the variants of CN2—ocn2, ucn2. So, where is the anomaly? Well, as the reader may have guessed (aided, perhaps by the title of the paper), it is in the location of the Naive Bayes method, which is deep inside the decision tree cluster. We have noted that in general apparent anomalies can occur in the clusters formed in this study and others. However, this one is stranger because it is not situated in a small cluster or on the edges of a large one. For example, the position of the perceptron within what can be called the rules cluster (next to the two variants of CN2) is potentially intriguing as well, but in fact it is situated on the periphery of this very small cluster and hence the pattern is not that intriguing. In contrast, the decision tree cluster is clearly built ÔaroundÕ what the author calls Naive Bayes.2 This peculiarity led us to carry out some further investigations.
3. The evidence We pursued our inquiry by carefully scrutinizing what the author of the study wrote in his report. Perhaps it is worth mentioning here that 2
See Appendix A for the classical definition of Naive Bayes.
1754
A. Jamain, D.J. Hand / Pattern Recognition Letters 26 (2005) 1752–1760
the study was the result of an M.Sc. project, and hence, fortunately for us, the report gives great detail about all the experimental steps. Had the study been reported in a journal, the story would probably have ended inconclusively here. In the body of his thesis (p. 22), Zarndt is silent about which implementation he used for Naive Bayes. However, in the appendices (p. 72) he reports having used the well-known IND package (Buntine, 1993), and describes which command lines he used for ÔNaive BayesÕ: mktree -e -v -s bytes The term mktree of course reinforced our suspicion. Our doubts were confirmed after further consultation of the IND documentation (Buntine, 1993, p. 22) since it appeared that this particular line creates a Bayesian tree, similar to the MML and SMML trees, but certainly not anything like the classical Naive Bayes method. This would explain the high experimental similarity between the so-called bayes method of Zarndt and all the decision trees observed in Fig. 1. To be fair and honest, this confusion has probably had little consequence due to the relatively limited availability of the study, and would not alone be the object of a communication such as the present one. Still, the Zarndt study has already been referenced four times within the Citeseer library itself, and will probably be more in the future since it has now been discovered and cited. Apart from the obvious similarity in name between Naive Bayes and Bayesian tree, it was still quite puzzling how such a misinterpretation could have happened; after all, Zarndt himself qualifies the IND package as Ôdecision tree softwareÕ on page 22, the same page on which he writes about Naive Bayes. However, further investigation unveiled what may have been the explanation.
4. The plot thickens Also on page 22, the same page on which Zarndt ÔdescribesÕ the Naive Bayes method, although he writes little of particular relevance (seven very general lines), he does makes a reference to the
famous Statlog study (Michie et al., 1994). This is perhaps not surprising since this study is generally considered to be the golden reference in comparative studies of classification methods. The reasons for StatlogÕs popularity are manifold, but perhaps the main one is that its interdisciplinary character makes it accessible to researchers in all areas interested in classification methods.3 Statlog produced results for 23 methods tested on 22 datasets from varied domains, and involved many research teams distributed all over Europe. As we will see shortly in more detail, a ÔNaive BayesÕ routine from the IND package was also apparently used in this study. Hence, it seems possible that Zarndt read the Statlog study, downloaded the IND package, and found that the most similar command to Ôsomething BayesianÕ was the one above. This raises the question of what happened in Statlog? When the Statlog authors describe the Naive Bayes method (p. 40), it is the usual, statistical one. They however write that the specific routine used came from the IND package. This is confirmed in the appendix (p. 263), where the following methods are allegedly taken from the IND package: INDCart, Bayes Tree and Naive Bayes. The first two probably correspond to (unspecified) sequences of IND instructions according to the respective tree building and pruning principles. But the third one is something of a mystery. We have tried the obvious approach to solving this mystery (i.e. contacted the authors) but, at the time of writing, received no answer. The problem is that, besides Statlog now being a rather old piece of work, it is probably difficult for anyone to know exactly what was done by each research team in a project of such a scale. This means that all that now remains is speculation, based on the evidence at our disposal. ÔNaive BayesÕ in Statlog could indeed be: (1) a Bayesian tree confusingly made via IND (similarly to what Zarndt seems to have done);
3 And the full book is downloadable for free from http:// www.amsta.leeds.ac.uk/~charles/statlog/
A. Jamain, D.J. Hand / Pattern Recognition Letters 26 (2005) 1752–1760
(2) a specifically programmed routine, or one from another package, implementing the correct statistical concept but erroneously reported as from IND; (3) a routine of the IND package which has escaped our inspection, or maybe which has been deleted since then (the package went from version 1.0 at the time of Statlog to 2.1 at present). However, the third hypothesis is very unlikely, given that the package is allegedly for the Ôcreation and manipulation of decision trees from dataÕ,4 and that the history of the modifications of IND does not make any reference to such a routine. Now, the question is: can we find any evidence supporting either of the other two hypotheses?
5. First witness: Direct comparison The uncertainty described above risks creating a confusing picture of the performance of Naive Bayes. Can we resolve it by finding evidence of what Naive Bayes is in Statlog by comparing results from Statlog with those from other studies, where we can ascertain—as far as possible—that the correct implementation of Naive Bayes has been used? We know of six other such studies (of course it is likely that there are many other studies in the literature that include some results related to Naive Bayes, but these are the ones we included in our meta-analysis; Jamain, 2004): • (Asparoukhov and Krzanowski, 2001): 30 results on 5 non-UCI datasets which have not been used elsewhere, to the best of our knowledge. • (Titterington et al., 1981): 12 results on the nonUCI head dataset which has not been used in Zarndt, but has been in Statlog.
4 From the IND website http://ic.arc.nasa.gov/projects/bayesgroup/ind/IND-program.html
1755
• (Kontkanen et al., 1998): 9 results on 9 different datasets, three of which have been used in Statlog (Cr.Aust—Australian credit scoring—and Diab—Pima indian diabetes—, Heart—heart disease, two-class version). • (Weiss and Kapouleas, 1989): 4 results on 4 different datasets. None of them have been used in Statlog and 3 of them have been by Zarndt (Ann-thyroid, Iris, Breast Yugoslavian). • (Cestnik et al., 1987): 4 results on 4 different datasets (Lymphography, Primary, Breast Yugoslavian, Hepatitis), all used by Zarndt but none in Statlog. • (Clark and Niblett, 1987): 3 results on 3 datasets (Breast Yugoslavian, Lymphography, Primary), again none of them are in Statlog, but all in Zarndt. Most of the studies do not share any dataset with Statlog, and hence an investigation of the similarity between Naive Bayes in Statlog and Naive Bayes in other studies can only be based a priori on the unique dataset of Titterington and 3 datasets of Kontkanen. Unfortunately, it turns out that the evidence is even weaker than this suggests: the results from the Heart dataset in Kontkanen are not comparable with those of Statlog since the two studies use different misclassification cost matrices (identity in Kontkanen, non-identity in Statlog). The other two datasets of Kontkanen are used with the same cost matrix and the same pre-processing as Statlog. As far as the head dataset common to Titterington and Statlog goes, although both studies use the same cost matrix and the same set of variables (set III in Titterington), there are a number of discrepancies between how the dataset has been processed which makes a completely fair comparison impossible. Indeed, the Titterington version has 1000 examples, split half-and-half between training and test samples, and no missing values replaced (methods included in the study were able to deal with missing values directly), whereas the Statlog version has only 900 examples, was processed by ninefold crossvalidation, and more importantly had all its missing values replaced by class medians.
1756
A. Jamain, D.J. Hand / Pattern Recognition Letters 26 (2005) 1752–1760
Table 1 Direct comparison of error rates between Naive Bayes in Statlog and Naive Bayes in other studies Dataset
Statlog
Other study
Statlog MAD
Cr.Aust Pima Head
0.151 0.262 23.950
0.150 (Kontkanen et al., 1998) 0.243 (Kontkanen et al., 1998) 21.900 (Titterington et al., 1981)
0.019 0.021 9.990
Misclassification costs are equal for Cr.Aust and Pima, and non-equal for Head (for the exact form of the cost matrix see Titterington et al., 1981, p. 154; or Michie et al., 1994, p. 150).
Anyway, completely fair or not, the direct evidence of the comparative performance of Naive Bayes amounts to a set of three pairwise comparisons, which we reproduce in Table 1. To try to make some sense of these numbers, we also show an estimate of the dispersion of observed results for each dataset. For a given dataset we took this estimate to be the MAD (the median absolute deviation about the median) of all the results related to the dataset in the Statlog study. Results by dataset are typically right-skewed; hence, the choice of the MAD estimator over the classical standard deviation—but using standard deviation would not change our conclusion anyway. Since all other results are within one MAD of Statlog there seems to be very little discrepancy of performance between the different applications of Naive Bayes. This tends to support our second hypothesis, which is that the mistake in Statlog was only in the reporting. However, this is only a very small sample of datasets on which to base any assertion, and besides, had we found some discrepancy, we could not have concluded anything due to the tendency of the data to be dispersed anyway. Indeed, to take one example, error rates for the real Naive Bayes on the Breast Yugoslavian dataset show more variation between themselves than with the presumably false Naive Bayes of Zarndt: 0.28 for Weiss and Kontkanen, 0.38 for Clark, 0.22 for Cestnik, and in the middle of all 0.31 for Zarndt. Hence, we have now to look for further evidence, which will have to be indirect.
6. Second witness: Overall accuracy One question which we may try to answer to shed some (indirect) light on the problem is the following: is there a difference in overall accuracy of
Naive Bayes between Statlog and the other studies? In other words, we may consider the global performance of Naive Bayes within each study instead of looking at results for individual datasets. A major difficulty with this approach is how to create some measure of this global performance. Bearing in mind that it has to remain simple (due to the small size of some studies), here is our proposal: • Within each study, scale each Ôset of comparable resultsÕ (see below) between 0 and 1, with 0 being the best and 1 the worst. • Still within each study, average all these scaled results by method. This gives a simple estimate of the overall performance of each method within each study, and we call this estimate Ôoverall error rateÕ. What we call a Ôset of comparable resultsÕ is, broadly speaking, a set of results which are on the same scale. Such a set usually consists of all the results related to a given dataset in a study, but not always. For example, when within a study a dataset is tested with different sets of variables, or different accuracy measures (e.g. different cost matrices), then the corresponding sets of results are not comparable in the strict sense because they are not on the same scale. Hence these sets need to be distinguished before scaling to [0, 1]. For example, Titterington et al. (1981) use four different accuracy measures (error rate with equal misclassification costs, error rate with non-equal misclassification costs, average logarithmic score, average quadratic score) and four different sets of variables, all with the same dataset head. This amounts to 16 different Ôsets of comparable resultsÕ. There are even more subtleties when one considers a study such as (Asparoukhov and
A. Jamain, D.J. Hand / Pattern Recognition Letters 26 (2005) 1752–1760
1757
10
1.0
Titterington81 Cestnik87 Statlog94 Asparoukhov01 Weiss89
Clark87
Fig. 2. Overall error rate of Naive Bayes and other methods in the literature (results for Naive Bayes are in large plain diamonds).
Krzanowski, 2001): in this particular study, methods are used with different class priors. Using different priors being equivalent to using different cost matrices, each set of results related to a given set of priors constitutes a different Ôset of comparable resultsÕ. We show overall error rates for all methods in Fig. 2, including those of Zarndt for comparison. The striking feature is that Naive Bayes seems to perform much worse in Statlog than in other studies. Of course, the overall error rate is very variable in the case of small studies (and there are quite a few of them), and it is also relative to the particular choice of datasets and methods within each study as we will see in the next section. For example, Naive Bayes could appear bad when compared with certain methods and good with others. At first sight, anyway, this would be an argument for our first hypothesis, which is a confusion between methods in Statlog; however, before drawing a perhaps hasty conclusion we should look more closely at the results of Statlog to see why Naive Bayes appears to perform so badly there.
7. Third witness: Inside Statlog Using an approach similar to that which initially led to our suspicions about Naive Bayes in the Zarndt study, we may look at the clusters
RBF AC2 C4.5 NewID CN2
Zarndt95
IndCART Baytree CART Cal5
0.0
Discrim Logdisc SMART Cascade DIPOL92 Backprop
2
0.2
CASTL E NaiveBay Kohonen LVQ
0.4
ALLOC80 NN
Quadisc
Itrule
8 6
0.6
4
Overall error rate
0.8
Fig. 3. Hierarchical clustering of the Statlog study (using complete linkage).
of methods in Statlog (Fig. 3). These are quite interesting on their own; one can see clearly differentiated clusters for decision trees and rules (IndCART, Baytree, CART, Cal5, AC2, C4.5, NewID, CN2), statistical and neural methods (Discrim, Logdisc, SMART, Cascade, DIPOL92, Backprop), and other more minor patterns. Only the quadratic discriminant Quadisc, the rule-induction algorithm Itrule, and radial-basis functions RBF seem quite out of their logical place. Concerning Naive Bayes, it is located closely to the Bayes causal network CASTLE, and away from all the decision trees and rules (except the aforementioned Itrule). Notably far away are the other methods taken from IND, namely IndCART and Baytree. This is of course strong evidence against our first hypothesis, and suggest that it was the correct statistical method which was used in Statlog. Now, to return to the question above, what can explain the difference between the overall accuracy of Naive Bayes in Statlog and in other studies? In fact, a possible explanation is simply in the choice of datasets that the different studies have made. For example, one noticeable fact is that all the smaller studies used medical diagnosis datasets, which usually incorporate a considerable amount of prior knowledge in variable selection. In particular, it is common in such situations that the variables have relatively little correlation (on the principle that highly correlated variables are likely
A. Jamain, D.J. Hand / Pattern Recognition Letters 26 (2005) 1752–1760
0.6 0.5 0.4 0.3 0.2 0.1
Heart DNA Head Cr.Aust Cr.Ger Faults Cred.Man Diab Chrom Tsetse Segm Tech Shuttle Cut20 KL Letter Belg Vehicle SatIm Dig44 Cut50 BelgII
to contribute less unique information about class separability; see Hand and Yu, 2001). In contrast, Statlog has a wider range of datasets, including for example image datasets—and StatlogÕs Naive Bayes does badly on these. In Fig. 4 we have shown the 0–1 scaled error rates, where for each dataset the best method of Statlog is given 0 and the worst 1, as before in our measure of overall error rate. Naive Bayes is the worst method for five datasets: Vehicle (vehicle silhouette recognition), SatIm (satellite imaging), Dig44 (optical digit recognition), Cut50 (character segmentation from handwritten word images), and BelgII (industrial dataset of unknown domain). Among these, the first four are image-related, and the origins of the last one are unknown. Besides, Naive Bayes is not far from the worst method on KL (a processed version of Dig44), Letter (letter recognition), and Cut20 (a processed version of Cut50). In contrast, Naive Bayes is best or close to the best for Heart and Head, two medical diagnosis datasets, and does quite well on the credit scoring datasets also (Cr.Aust, Cr.Ger, Cred.Man). In fact, it is possible to go a bit further, thanks to the availability of some dataset characteristics in Statlog (pp. 171–172). If one looks at the averaged absolute correlation between variables (Fig. 5), one can see a fairly clear relationship between the performance of StatlogÕs Naive Bayes and this particular dataset characteristic. More precisely, it
Average absolute correlation between variables
1758
Fig. 5. Averaged between-variable correlation for each Statlog dataset (same dataset order as Fig. 4—hence Naive Bayes has poor performance on the datasets on the right).
seems that the datasets on which it does relatively well are those with a very low averaged absolute correlation. And those on which it does badly tend to have high values for this characteristic. One may remark that some of these do not (BelgII is an example), but having a low value for this averaged absolute correlation does not mean necessarily that all pairs of variables are weakly correlated—and some other factor may play a part as well. Anyway, it seems that the method tends to do badly when variables are correlated, and do well when they are not, providing a further piece of evidence that the correct Naive Bayes method was used in Statlog.
Scaled misclassification rate
1.0
8. The verdict
0.8
0.6
0.4
0.2
Heart DNA Head Cr.Aust Cr.Ger Faults Cred.Man Diab Chrom Tsetse Segm Tech Shuttle Cut20 KL Letter Belg Vehicle SatIm Dig44 Cut50 BelgII
0.0
Fig. 4. Scaled misclassification rate of Naive Bayes for each dataset in Statlog (datasets ordered by misclassification rate).
In the present communication we have presented all the evidence that has led us to doubts about what ÔNaive BayesÕ exactly is in two large comparative studies, Zarndt and Statlog. Now that we have studied the available evidence, perhaps we should tentatively draw a conclusion. Let us first recall the main points of this evidence. For ZarndtÕs Naive Bayes: • In hierarchical clustering, presence in a decision tree cluster.
A. Jamain, D.J. Hand / Pattern Recognition Letters 26 (2005) 1752–1760
• Explicit statement about the use of a computer routine from the IND decision tree package.
1759
studies. It perhaps also illustrates MurphyÕs Law: Ôif something can go wrong, it willÕ.
For StatlogÕs Naive Bayes: Acknowledgements • Vague statement about the use of a computer routine from IND. • Similarity of performance with other reported results from ÔtrueÕ Naive Bayes on three datasets. • In hierarchical clustering, presence in a small cluster, next to a Bayes causal network method. • Poor overall accuracy, but explained by the presence of datasets with high between-variable correlation. Perhaps the reader will have drawn their own conclusion by now, but we believe that these points suggest that: (1) ZarndtÕs Naive Bayes is not the correct method. (2) StatlogÕs Naive Bayes is the right method, but reported with a wrong provenance. This is a rather reassuring conclusion for the Naive Bayes method, and also for the prestige of the Statlog study. However, we may wonder where the reporting mistake came from. We suggest that it probably has something to do with the organization of the Statlog project: apparently, all datasets were sent to each research group, and each had some relative liberty in choosing which method they were going to use. There could thus have been some omission or confusion in the reporting of one particular group (the one which used the other methods from the IND package). On a slightly broader note, the moral of the story we have told is perhaps twofold: (a) mistakes do happen, even in well designed studies, and (b) if they do happen they might be propagated by further studies, in the same form as they first occur or in another form. The last point is perhaps the crucial one: one could argue that a single mistake is in itself not really harmful, but that the tendency of researchers to refer relatively blindly to previous work is. The present story illustrates that accurate reporting is a very important part of comparative
The work described in this paper was sponsored by the MOD Corporate Research Programme, CISP. We would like to express our appreciation to Andrew Webb for his support and encouragement for this work. Appendix A. The real Naive Bayes The Naive Bayes method originally tackles problems where variables are categorical, although it has natural extensions to other types of variables. It assumes that variables are independent within each class, and simply estimates the probability of observing a certain value in a given class by the ratio of its frequency in the class of interest over the prior frequency of that class. That is, for any class c and vector X = (Xj)j = 1,. . .,k of categorical variables, P ðXjcÞ ¼
k Y
P ðX j jcÞ
j¼1
and P ðX j ¼ xjcÞ # training examples of class c where xj ¼ x : ¼ # training examples of class c Continuous variables are generally discretised, or a certain parametric form is assumed (e.g. normal). One can also use non-parametric density estimators like Kernel functions. Then similar frequency ratios are derived. References Asparoukhov, O.K., Krzanowski, W.J., 2001. A comparison of discriminant procedures for binary variables. Comput. Statist. Data Anal. 38, 139–160. Buntine, W., 1993. IND documentation, version 2.1. NASA Ames Research Center. Available from: .
1760
A. Jamain, D.J. Hand / Pattern Recognition Letters 26 (2005) 1752–1760
Cestnik, B., Kononenko, I., Bratko, I., 1987. ASSISTANT 86: A knowledge-elicitation tool for sophisticated users. In: Progress in Machine Learning: Proc. EWSL-87. Sigma Press, Bled, Yugoslavia, pp. 31–45. Clark, P., Niblett, T., 1987. Induction in noisy domains. In: Proc. 2nd Eur. Work Session on Learning. Sigma Press, Glasgow, Scotland, pp. 11–30. Dietterich, T.G., 2000. An experimental comparison of three methods for constructing ensembles of decisions trees: Bagging, boosting, and randomization. Machine Learn. 40, 139–157. Duin, R.P.W., 1996. A note on comparing classifiers. Pattern Recognition Lett. 17, 529–536. Hand, D., Yu, K., 2001. IdiotÕs Bayes—not so stupid after all? Internat. Statist. Rev. 69, 385–398. Jamain, A., 2004. Meta-analysis of classification methods. Unpublished Ph.D. thesis, Department of Mathematics, Imperial College, London. Kontkanen, P., Myllymaki, P., Silander, T., Tirri, H., 1998. Bayes optimal instance-based learning. In: 11th Eur. Conf. on Machine Learning. Springer-Verlag, Berlin, pp. 77–88.
Michie, D., Spiegelhalter, D.J., Taylor, C.C., 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood. Salzberg, S.L., 1997. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining Knowledge Discovery 1, 317–328. Titterington, D.M., Murray, G.D., Murray, L.S., Spiegelhalter, D.J., Skene, A.M., Habbema, J.D.F., Gelpke, G.J., 1981. Comparison of discrimination techniques applied to a complex data set of head injured patients. J. Roy. Statist. Soc. Ser. A 144, 144–175. Weiss, S.M., Kapouleas, I., 1989. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. In: IJCAI89: Proc. 11th Internat. Joint Conf. on Artificial Intelligence. Morgan Kaufmann, San Mateo, CA, pp. 781–787. Zarndt, F., 1995. A comprehensive case study: An examination of machine learning and connectionnist algorithms. Available from: .
Pattern Recognition Letters 26 (2005) 1761–1771 www.elsevier.com/locate/patrec
Bayesian network classification using spline-approximated kernel density estimation Yaniv Gurwicz, Boaz Lerner
*
Pattern Analysis and Machine Learning Lab, Department of Electrical and Computer Engineering, Ben-Gurion University, P.O. Box 653, 84105 Beer-Sheva, Israel Received 7 November 2004; received in revised form 7 November 2004 Available online 8 April 2005 Communicated by E. Backer
Abstract The likelihood for patterns of continuous features needed for probabilistic inference in a Bayesian network classifier (BNC) may be computed by kernel density estimation (KDE), letting every pattern influence the shape of the probability density. Although usually leading to accurate estimation, the KDE suffers from computational cost making it unpractical in many real-world applications. We smooth the density using a spline thus requiring for the estimation only very few coefficients rather than the whole training set allowing rapid implementation of the BNC without sacrificing classifier accuracy. Experiments conducted over a several real-world databases reveal acceleration in computational speed, sometimes in several orders of magnitude, in favor of our method making the application of KDE to BNCs practical. 2005 Elsevier B.V. All rights reserved. Keywords: Bayesian networks; Classification; Kernel density estimation; Naı¨ve Bayesian classifier; Spline
1. Introduction 1.1. Density estimation for Bayesian network classifiers A Bayesian network (BN) represents the joint probability distribution (density) p(X) over a set *
Corresponding author. E-mail addresses:
[email protected] (Y. Gurwicz), boaz@ ee.bgu.ac.il (B. Lerner).
of n domain variables X = {X1, . . . , Xn} graphically (Pearl, 1988; Heckerman, 1995). An arc and a lack of an arc between two nodes in the graph demonstrate, respectively, dependency and independency between variables corresponding to these nodes (Fig. 1). A connection between Xi and its parents Pai in the graph is quantified probabilistically using the data. A node having no parents embodies the prior probability of the corresponding variable. By ordering the variables topologically, extracting the general factorization of this ordering (using
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.12.008
1762
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
x1
x2
x3
x4
x5
x6 Fig. 1. A graph of an example Bayesian network. Arcs manifest dependencies between nodes representing variables.
the chain rule of probability) and applying the directed Markov property, we can decompose the joint probability distribution (density) n Y pðX i jPai Þ: ð1Þ pðXÞ ¼ pðX 1 ; . . . ; X n Þ ¼ i¼1
The naı¨ve Bayesian classifier (NBC) is a BN used for classification thus belonging to the Bayesian network classifier (BNC) family (John and Langley, 1995; Heckerman, 1995; Friedman et al., 1998; Lerner, 2004). It predicts a class C for a pattern x using Bayes theorem P ðCjX ¼ xÞ ¼
pðX ¼ xjCÞ P ðCÞ pðX ¼ xÞ
ð2Þ
i.e., it infers the posterior probability that x belongs to C, P(CjX = x), by updating the prior probability for that class, P(C), by the class-conditional probability density or likelihood for x to be generated from this class, p(X = xjC), normalized by the unconditional density (evidence), p(X = x). The NBC represents a restrictive assumption of conditional independence between the variables (domain features) given the class allowing the decomposition and computation of the likelihood employing local probability densities n Y pðXjCÞ ¼ pðX i jCÞ: ð3Þ i¼1
Estimating probability densities of variables accurately is a crucial task in many areas of machine learning (Silverman, 1986; Bishop, 1995).
While estimating the probability distribution of a discrete feature is easily performed by computing the frequencies of its values in a given database, the probability density of a continuous feature taking any value in an interval cannot be estimated similarly thus requiring other, more complex methodologies. This is a major difficulty in the implementation of BNCs (John and Langley, 1995; Friedman et al., 1998; Elgammal et al., 2003; Lerner, 2004), and it requires either discretization of the variable into a collection of bins covering its range (Heckerman, 1995; Friedman et al., 1998; Yang and Webb, 2002; Malka and Lerner, 2004) or estimation, using parametric, non-parametric or semi-parametric methods (John and Langley, 1995; Lerner, 2004). Discretization is usually chosen for problems having small sample sizes that cannot guarantee accurate density estimation (Yang and Webb, 2002). Noticeably, prediction based on discretization is prone to errors due to lost of information. Generally, the accuracy discretization methods provide will peak for a specific range of bin sizes deteriorating as moving away from the center of this range (Malka and Lerner, 2004). A too small number of bins will smooth the estimated density and a too large number of bins will lead to the curse of dimensionality resulting in performance worsening in both cases. Besides, a too large number of bins will overload the calculation. In parametric density estimation we assume a model describing the density and look for the optimal parameters for this model. For example, for a Gaussian model we ought estimating the data mean and variance. A single Gaussian estimation (SGE) is straightforward to implement and it bares almost no computational load to the NBC but its accuracy declines with the degree of deviation of the data from normality, which is expected in many realworld problems (John and Langley, 1995; Lerner, 2004). Extending parametric density estimation using Bayesian approaches (Heckerman, 1995), we update an a priori probability (e.g., Dirichlet prior) on the parameters using the likelihood for the data, thus combining prior and acquired knowledge jointly. However, when enough data is available (and the number of parameters is not too large) the likelihood in Bayesian estimation
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
approaches quickly hammers the priors making these approaches somewhat redundant. 1.2. Non-parametric density estimation using kernels Non-parametric methods of density estimation assume no model in hand generating the data but allow the data itself to determine the density. The most common non-parametric method is kernel density estimation (KDE) (Silverman, 1986) computing the density by a linear combination of S kernel functions K having width h that are allocated around each training data point xt, t = 1, . . . , S. Based on these S points, the onedimensional KDE pS(x) of p(x) required in order to compute each of the class-conditional probability densities of the right hand side of (3) is pS ðxÞ ¼ Z
S x x 1 X t K ; S h t¼1 h
ð4Þ
þ1
KðuÞdu ¼ 1 and KðuÞ P 0
8u
ð5Þ
1763
such that it is strongly pointwise consistent, i.e., Z ð6Þ jpS ðxÞ pðxÞjdx ! 0 as S ! 1; which means that in the limit, a posterior probability based on KDE (2) produces the Bayes’ optimal classification error rate. A kernel commonly used in KDE is the standard Gaussian, which when pffiffiffi used with a width of h ¼ 1= S renders the KDE strongly pointwise consistent (John and Langley, 1995). Fig. 2 demonstrates SGE and KDE in comparison to a histogram representation of the Gammaglutamyl transpeptidase feature of the liver-disorders database of the UCI repository (Merz et al., 1997). The KDE tracks the histogram accurately while the SGE fails to reconstruct the histogram skewing toward the tail of the distribution. More evidence to the superiority of KDE to SGE for non-normal distributions in the context of the NBC can be found later in this paper and in John and Langley (1995) and Lerner (2004). Although providing superior accuracy for the NBC, the KDE suffers from extensive
1
SGE KDE
0.12
Histogram, SGE and KDE
0.1
0.08
0.06
0.04
0.02
0
50 100 150 200 250 Gamma-glutamyl transpeptidase feature of the liver-disorders database
Fig. 2. SGE and KDE in comparison to a histogram representation of the gamma-glutamyl transpeptidase feature of the liver-disorders database.
1764
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
computational cost limiting its implementation in real-world applications. Using KDE for the NBC, a class-conditional density for the ith variable, Xi, and kth class is computed for the mth test tr pattern, xtst im , using all training patterns xit pðX i ¼ xtst im jC ¼ kÞ ¼
N trk ðxtst xtr Þ2 im it 1 X 1 pffiffiffiffiffiffi e 2r2 N trk t¼1 2pr
ð7Þ
for a Gaussian kernel having a width r around each of the Ntrk training patterns of class k. Thus, the time complexity of estimating the likelihood employing KDE is O(Nts Æ Ntr Æ Nf Æ Nd) for Nts test patterns, Ntr training patterns, Nf features (variables) and Nd the number of calculations involved in computing a Gaussian, which for the common case Ntr Nc for Nc classes is much larger than O(Nts Æ Nc Æ Nf Æ Nd) which is the complexity of SGE. 1.3. Related methodologies To alleviate the complexity and enable fast implementation of non-parametric density estimation methods, a several approaches have been developed. Since usually many of the kernels are close to each other in feature space, binning (gridding) methods (Silverman, 1986; Jianqing and Marron, 1994; Gray and Moore, 2003) reduce the number of kernel evaluations by chopping each dimension into a number of intervals (bins), M, and representing all training points falling within an interval using a single kernel established employing all of these points. The problem is that M must be large to maintain precise estimation and the number of grid points increases as M N f (Gray and Moore, 2003). If however M is not large enough, the estimation may loose its accuracy. Silverman (1982, 1986) proposes an elaboration of binning using a fast Fourier transform performing discrete convolution to combine the grid counts and kernel weights. However, because a grid still underlies the method, it suffers from explosive scaling and error limitations (Gray and Moore, 2003). The fast Gauss transform (FGT) algorithm (Strain, 1991; Elgammal et al., 2003) expands the exponential of (7) using a Hermite series having a small number of terms around a small
number of centers of boxes clustering the training points. The fast multipole algorithm (FMA) (Greengard, 1988) relies on a spatial decomposition that separates the collection of patterns to regions. The effects of distant regions on test patterns are computed by the multipole expansion, and the effect of nearby regions is computed directly. Lambert et al. (1999) cast (7) using Taylor expansion to a specific order evaluating the approximation at a cost related to this order rather than the size of the training set. Hoti and Holmstrom (2004) transform the data using principal component analysis (PCA) to non-Gaussian and Gaussian data corresponding to the most and least significant PCA eigenvalues, respectively, and then apply density estimation only to the non-Gaussian part. This approach can relieve computational cost although the calculation of the non-Gaussian part of the data is still needed. Moore et al. (1997) suggest a tree in which each node summarizes the relevant statistics of all the data points below it in the tree. Using this multiresolution data structure saves the need to employ most of the training points increasing the speed of kernel regression. Unfortunately, none of the approaches developed to alleviate KDE enabling fast implementation has ever been applied to BNs. Moreover, all of these methods aim at resolving the curse of dimensionality unnecessarily for the NBC decomposition (3). In this study, we propose a spline smoother to reduce the computational burden in KDE making probabilistic inference using the NBC feasible for real-world applications. Section 2 of the paper describes the spline smoother and its application to KDE for NBC. Section 3 outlines our experiments and their results for synthetic and real-world databases, while Section 4 concludes the paper.
2. Spline-approximated KDE for BNCs Our approach differs from previous methods diminishing KDE computationally and relies on composing a spline from low-order polynomials each smoothes the density over a small interval resulting in the approximation of the whole density using very few coefficients.
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
1765
2.1. The spline smoother Splines have been used in many applications, such as medical (Wang and Amini, 2000), video segmentation (Precioso and Barlaud, 2002), image encoding and decoding (Wang et al., 2001) and moments of free-form surfaces (Soldea et al., 2002). Splines are smooth piecewise polynomial functions employed to approximate smooth functions locally (de Boor, 1978). The spline is used in a large interval for which a single approximation requires a polynomial of high degree that complicates the implementation and may overfit the data. Given the data y(d1), . . . , y(dP) with a = d1 < < dj < < dP = b, we establish a piecewise interpolant f to y such that f agrees with low-degree polynomials fj(x) on sufficiently small intervals [dj, dj+1], i.e., f ðxÞ ¼ fj ðxÞ for dj 6 x 6 djþ1 ;
8j ¼ 1; . . . ; P 1 ð8Þ
and the jth polynomial fj(x) coincides with y on the interval edges and its derivatives there satisfy some slope conditions set by the interpolation method being used. Using local polynomial coefficients ajl derived from these slope conditions (de Boor, 1978), the polynomial of order N describing y within the jth interval is fj ðxÞ ¼
N X N l ðx dj Þ ajl :
Fig. 3. A composition of a spline (bottom) from low-order polynomials (top) (Inspired by the Math Works MatLab documentation).
training set shaping KDE, thus eliminate the computational complexity of KDE facilitating classification using the NBC. Fig. 4 shows an example in which a cubic spline smoother of KDE provides identical approximation to direct KDE for a section of the weight percent of sodium in oxide feature of the UCI Repository (Merz et al., 1997) Glass database (top). The figure also shows that the residual, i.e., the difference, between the two densities is negligible (bottom).
ð9Þ 0.016
l¼1
3
Density
0.015
For example, a piecewise cubic function f agrees with y at d1, . . . , dP, is continuous and has a continuous first derivative on [a, b]. It makes use of cubic polynomials (N = 4)
0.014 0.013 0.012 0.01 12.8
2
fj ðxÞ ¼ ðx dj Þ aj1 þ ðx dj Þ aj2
1
ð10Þ
Keeping some boundary conditions at d1, . . . ,dP enables composition of these low order polynomials to a smooth piecewise polynomial function called a spline. Fig. 3 demonstrates such a composition of a spline from low-order polynomials. By approximating KDE using a spline instead of direct implementation we utilize those very few coefficients of the spline instead of the
Residual
þ ðx dj Þaj3 þ aj4 :
KDE Cubic spline
0.011
12.9
13
13.1
13.2
13.3
13.4
13.5
12.9
13
13.1
13.2
13.3
13.4
13.5
x 10–7
0.5 0 –0.5 –1
12.8
Weight percent of sodium in oxide feature of the Glass database
Fig. 4. Cubic spline smoother approximating KDE almost identicaly to direct KDE for a section of the weight percent of sodium in oxide feature of the Glass database (top), both having a negligible difference (Residual) (bottom).
1766
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
2.2. Spline-approximated KDE We suggest applying splines to KDE in order to ease probabilistic inference in NBCs. The spline smoother is applied during the test. After training, we compute for each of the P 1 consecutive intervals within the estimation range of each variable the N coefficients needed to approximate an Nth-order polynomial. We establish a (P 1) · N look-up-table (LUT) matrix, A, holding the ajl coefficients, i.e., all the information needed for the estimation of this variable density. The value of N should be large enough to ensure satisfactory fitted curves, but not too large in order to avoid the curse-of-dimensionality and maintain the simple implementation using low order polynomials. During the test of the mth pattern represented by the ith variable, xtst im , we employ the N coefficients corresponding to the jth interval beginning at dj and coinciding with xtst im in order to evaluate the spline-based estimation for this test point fji ðxtst im Þ ¼
N X N l ðxtst ajli ; im dji Þ
ð11Þ
l¼1
where ajli is the lth spline coefficient of the jth interval of the ith variable. Using spline-based KDE for the NBC, each class-conditional density of (3) for the ith variable and kth class is derived for the mth test pattern using (11) and N spline coefficients rather than using (7) and the whole training set. Thus, time complexity of estimating the likelihood employing spline-based approximation is O(Nts Æ Nf Æ Nc Æ Nn) for Nts test patterns, Nf features, Nc classes and
Nn calculations involved in computing (11). Direct KDE has complexity of O(Nts Æ Nf Æ Ntr Æ Nd) for Ntr training patterns and Nd calculations involved in a single Gaussian distribution in (7). Since Nd and Nn are of the same order the predominant difference in computational cost between the two estimation methods is attributed to the difference between Ntr and Nc where Ntr Nc. Moreover for Nd Nn, the complexity of spline-based KDE approximation is identical to that of SGE.
3. Experiments and results 3.1. Databases and methodology We tested one synthetic and ten real-world databases with continuous features. The synthetic database has two classes and ten continuous features each having a several states sampled according to some a priori probability. Nine of the real-world databases are taken from the UCI repository, which is a well documented database (Merz et al., 1997). The remaining database is taken from a cytogenetic domain including more than 3000 patterns of four classes of signals represented using twelve features of size, shape, color and intensity (Lerner et al., 2001). In the experiments, we employed cross-validation (CV10) and hold-out (2/3 of the data for training) methodologies in databases having less and more than 3000 patterns, respectively. Patterns with missing values were deleted from the database. Table 1 summarizes important characteristics of the real-world databases. In addition, we chose for the
Table 1 Characteristics of the experimented real-world databases Database
Number of classes
Number of features
Continuous/discrete features
Database size
Experiment methodology
Glass Iris Wine Pima Ionosphere Letter Adult Liver disorders Image Cytogenetics
7 3 3 2 2 26 2 2 7 4
9 4 13 8 33 16 14 6 18 12
9/0 4/0 13/0 8/0 32/1 16/0 6/8 6/0 18/0 11/1
214 150 178 768 351 20,000 45,222 345 210 3144
CV10 CV10 CV10 CV10 CV10 Hold-out Hold-out CV10 CV10 Hold-out
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
KDE p the ffiffiffi standard Gaussian with a width of h ¼ 1= S (Section 1.2).
1767
-3
x 10 6.5
Direct KDE 1st order spline-based estimation
6
3.2. Sensitivity to spline order
5.5
Density
5 4.5 4 3.5 3 2.5 2 1.5 3.45
3.5
3.55
3.6
3.65
3.7
Weight percent of sodium in oxide feature of the Glass database -3
x 10 6.5
Direct KDE 4th order spline-based estimation
6 5.5 5
Density
We investigated the influence of spline order on the estimation error and the NBC accuracy. For this purpose, we evaluated splines of orders N = [1, 4] approximating KDE of variables of each databases in comparison to direct KDE and SGE. Fig. 5 shows direct KDE and its 4th order spline approximation coinciding with each other for an example synthetic database feature. The figure also manifests the residuals between 1st and 4th order spline-based approximations and direct KDE. A similar experiment was performed with the weight percent of sodium in oxide feature of the UCI Repository Glass database. Fig. 6 reveals densities approximated by 1st and 4th order splines in comparison to direct KDE for a section of the density. The figure demonstrates the accuracy of the spline (especially cubic) approximating KDE. We also measured the mean squared error (MSE) between direct KDE and spline-based KDE approximation (i.e., the average residual),
4.5 4 3.5 3 2.5 2
0.015
1.5
Density
3.45
0.01 0.005 0
3.5
3.55
3.6
3.65
3.7
Weight percent of sodium in oxide feature of the Glass database
1
2
3
4
5
6
7
8
9
Fig. 6. 1st (top) and 4th (bottom) order spline-based KDE approximations for a section of the weight percent of sodium in oxide feature of the Glass database in comparison to direct KDE.
Synthetic variable
Residual
1
x 10–3
0. 5
MSE ¼
0 –0.5
4th order 1st order
–1 1
2
3
4
5
6
7
8
9
Synthetic variable
Fig. 5. Direct KDE and 4th order spline-based KDE approximation coinciding with each other for an example feature of the synthetic database (top), and the residuals between direct KDE and 1st and 4th order spline-based approximations for this feature (bottom).
P 1X 2 ½yðdj Þ f ðdj Þ P j¼1
ð12Þ
for spline orders in [1, 4]. As presented in Table 2, spline-based KDE approximation demonstrates a negligible MSE compared to direct KDE especially for orders greater than one. Next, we conducted classification experiments on the real-world databases using the NBC employing SGE, KDE and spline-based KDE approximation. Table 3 demonstrates the
1768
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
Table 2 The MSE between direct KDE and spline-based KDE approximation having orders in [1, 4] for the weight percent of sodium in oxide feature of the Glass database Spline order
MSE ( · 109)
1 2 3 4
146 10.6 9.99 9.15
Table 3 The NBC accuracy for different real-world databases when densities are based on 1st or 4th order spline KDE approximations in comparison to SGEa Database
NBC classification accuracy (mean ± std) (%) SGE
Glass Iris Wine Pima Ionosphere Letter Adult Liver disorders Image Cytogenetics
49.0 96.0 97.7 76.0 82.9 65.5 82.6 56.0
(±8.45) (±4.42) (±2.76) (±4.89) (±3.42)
(±10.23)
62.9 (±8.73) 67.5
1st order spline
4th order spline
40.7 76.7 64.0 58.1 57.9 73.3 79.5 50.4
65.5 95.3 95.5 69.4 92.3 73.4 83.1 64.3
(±7.37) (±10.00) (±9.81) (±2.70) (±10.33)
(±7.24)
41.9 (±10.39) 41.0
(±11.19) (±5.21) (±4.17) (±3.67) (±3.85)
(±5.66)
70.6 (±9.15) 74.5
a Accuracy based on 4th order spline KDE approximation is identical to that based on direct KDE. Bold font emphasizes the highest accuracy for a database.
superiority for most databases of 4th order spline in comparison to 1st order spline and SGE in approximating density for the NBC. Accuracy achieved using the 4th order spline is identical to that achieved using direct KDE. In three of the databases (Iris, Wine and Pima), feature distribution is close to normal thus SGE reaching asymptotic performance sooner than KDE (i.e., with a smaller sample size) yielding better accuracy than KDE and therefore better than the spline approximation. In those infrequent occasions of close to normal data distribution, the spline-based KDE approximation cannot ease the sample size sensitivity of KDE compared to SGE. However, in most real-world applications KDE and thus the suggested spline-based KDE approximation will outperform SGE leading to more accurate NBC performance.
3.3. Acceleration and sensitivity to sample size We measured the acceleration (i.e., the ratio) in NBC run-time due to spline-based approximation with respect to direct KDE for increasing sample sizes. Table 4 shows the run-time (on an Intel P-II, 450 MHz processor with 192 MB RAM) using both techniques while classifying the synthetic database for sample sizes in the range [100–200 K] along with the corresponding accelerations. Fig. 7 demonstrates a sharper increase with sample size of the KDE run-time compared to that of the spline approximation as well as the acceleration achieved utilizing the latter. The change of slopes in both graphs is attributed to the switch of methodologies from CV to hold-out (Section 3.1), as each methodology employs different numbers of training and test patterns. 3.4. Sensitivity to dimensionality Fig. 8 demonstrates the effect of increasing dimensionality on the NBC classification run-time when the classifier utilizes direct KDE in comparison to spline-based KDE approximation for 300 patterns of the synthetic database. The acceleration due to spline-based KDE approximation in comparison to direct KDE is constant for all dimensions (i.e., 54 for this database).
Table 4 The NBC run-time on the synthetic database for increasing sample sizes using direct KDE and 4th order spline-based KDE approximation, as well as the run-time acceleration achieved Sample size
NBC run-time (s) Direct KDE
Spline-based
100 200 300 600 1000 2500 10,000 50,000 100,000 200,000
81 323 723 2899 8070 41,231 196,810 4,897,232 19,633,152 77,459,336
3.83 7.5 10.1 19.8 36.9 81 100 513 1041 2126
Run-time acceleration 21 43 72 146 219 509 1968 9547 18,863 36,434
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
1769
Spline-based approximation Direct KDE
7
10
2
(log) Run-time (sec)
NBC run-time (sec)
6
10
5
10
4
10
modification of experimental methodology
10
1
10
3
10
2
10
0
10
Direct KDE Spline-based approximation
1
10
2
10
3
10
4
10
5
5
10
Sample size of synthetic database
10
15
20
25
30
Number of continuous features in the synthetic database
Fig. 8. The NBC classification run-time using direct KDE and spline-based KDE approximation for increasing dimensionality showing constant acceleration. 4
Run-time acceleration
10
Table 5 The NBC run-times using a 4th order spline-based KDE approximation and direct KDE and the corresponding accelerations for a several real-world databases
3
10
2
10
2
10
3
10
4
10
5
10
Sample size of synthetic database
Fig. 7. The NBC run-time for KDE and 4th order spline-based KDE approximation (top), and accelerations due to the spline approximation (bottom) for increasing sample sizes of the synthetic database.
3.5. Accelerations for real-world databases Experimenting with real-world databases of the UCI Repository and the cytogenetic domain, we compare in Table 5 run-times of the NBC employing direct KDE or a cubic spline KDE approximation as well as the corresponding acceleration achieved using the latter. For all databases we observe significant run-time acceleration spanning from 1 to 4 orders of magnitude, where large databases benefit the most pronounced acceleration. For example, classifying the Adult database having 45,222 patterns using direct KDE requires
Database
NBC run-time (s) Direct KDE
Spline
Glass Iris Wine Pima Ionosphere Letter Adult Liver disorders Image Cytogenetics
235 50 238 3103 2120 841,510 3,237,669 530 475 15,690
20 3.5 11.3 19.3 40 4429 301 7.3 40 75
Run-time acceleration 12 14 21 161 53 190 10,746 73 12 209
approximately 37 days compared to 5 min using the spline-based KDE approximation, leading to significant acceleration of more than 104. We note again that the NBC employing each of these two estimation methods achieves identical classification accuracy.
4. Discussion Frequently, classification using BNCs within a domain having continuous variables requires density estimation. Non-parametric density estimation
1770
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771
using kernels is accurate but computationally expensive since all training patterns participate in testing each unseen pattern, sometimes rendering the estimation impractical for real-world applications. We have presented a method based on a spline smoother approximating KDE that instead of using the training set utilizes the spline coefficients (only four in the case of a cubic spline), thus providing rapid evaluation of KDE. Moreover, spline-approximated KDE provides the KDE accuracy at the cost of SGE. Classification experiments with an NBC on synthetic and real-world databases revealed increase with sample size of the acceleration achieved using the spline approximation compared to direct KDE. The experiments proved pronounced decrease of classification run-time sometimes by several orders of magnitude while preserving the predictive accuracy of the classifier, thereby making the suggested method practical for real-world applications. Although demonstrated for the NBC, the method is useful in reducing time complexity in other applications involving non-parametric density estimation. Finally, it is interesting to compare spline to other approximations of KDE. Acknowledgement This work was supported in part by the Paul Ivanier Center for Robotics and Production Management, Ben-Gurion University, Beer-Sheva, Israel. References Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford. de Boor, C., 1978. A Practical Guide to Splines. Appl. Math. Sci., vol. 27. Springer-Verlag. Elgammal, A., Duraiswami, R., Davis, L.S., 2003. Efficient kernel density estimation using the fast Gauss transform with applications to color modeling and tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1499–1504. Friedman, N., Goldszmidt, M., Lee, T.J., 1998. Bayesian network classification with continuous attributes: Getting the best of both discretization and parametric fitting. In: Proceedings of the 15th International Conference on
Machine Learning, San Francisco, CA. Morgan Kaufmann, pp. 179–187. Gray, G., Moore, A.W., 2003. Nonparametric density estimation: toward computational tractability. In: SIAM International Conference on Data Mining. Greengard, L., 1988. The Rapid Evaluation of Potential Fields in Particle Systems. MIT Press, Cambridge, MA. Heckerman, D., 1995. A tutorial on learning Bayesian networks. Microsoft Research Technical Report. MSR-TR-9506. Hoti, F., Holmstrom, L., 2004. A semi-parametric density estimation approach to pattern classification. Pattern Recognition Lett. 37, 409–419. Jianqing, F., Marron, J.S., 1994. Fast implementation of nonparametric curve estimators. J. Comput. Graphical Statist. 3, 35–57. John, G.H., Langley, P., 1995. Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, pp. 338–345. Lambert, C.G., Harrington, S.E., Harvey, C.R., Glodjo, A., 1999. Efficient online nonparametric kernel density estimation. Algorithmica 25, 37–57. Lerner, B., 2004. Bayesian fluorescence in-situ hybridization signal classification. Artificial Intelligence in Medicine 30, 301–316 (A special issue on Bayesian Models in Medicine). Lerner, B., Clocksin, W.F., Dhanjal, S., Hulten, M.A., Bishop, C.M., 2001. Feature representation and signal classification in fluorescence in-situ hybridization image analysis. IEEE Trans. Syst. Man Cybernet. A 31, 655–665. Malka, R., Lerner, B., 2004. Classification of fluorescence in situ hybridization images using belief networks. Pattern Recognition Lett. 25, 1777–1785. Merz, C., Murphy, P., Aha, D., 1997. UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. Available from: . Moore, A.W., Schneider, J., Deng, K., 1997. Efficient locally weighted polynomial regression predictions. In: Proceedings of the International Conference on Machine Learning, pp. 236–244. Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufman. Precioso, F., Barlaud, M., 2002. B-spline active contour with handling of topology changes for fast video segmentation. J. Appl. Signal Process. 6, 555–560. Silverman, B.W., 1982. Kernel density estimation using the fast Fourier transform. J. Roy. Statist. Soc. Ser. C: Appl. Statist. 33, 93–97. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC. Soldea, O., Elber, G., Rivlin, E., 2002. Exact and efficient computation of moments of free-form surface and trivariate based geometry. Computer-Aided Des. 34, 529–539. Strain, J., 1991. The fast Gauss transform with variable scales. SIAM J. Scientific Statist. Comput. 12, 1131–1139.
Y. Gurwicz, B. Lerner / Pattern Recognition Letters 26 (2005) 1761–1771 Wang, L.J., Hsieh, W.S., Truong, T.K., Reed, I.S., Cheng, T.C., 2001. A fast efficient computation of cubic-spline interpolation in image codec. IEEE Trans. Signal Process. 6, 1189–1197. Wang, Y.P., Amini, A.A., 2000. Fast computation of tagged MRI motion fields with subspaces. In: Proceedings of the
1771
IEEE Workshop on Mathematical Methods in Biomedical Image Analysis, 119–126. Yang, Y., Webb, G.I., 2002. A comparative study of discretization methods for naı¨ve Bayes classifiers. In: Proceedings of PKAW, The 2002 Pacific Rim Knowledge Acquisition Workshop, Tokyo, Japan, 159–173.
Pattern Recognition Letters 26 (2005) 1772–1781 www.elsevier.com/locate/patrec
Qualitative real-time range extraction for preplanned scene partitioning using laser beam coding Didi Sazbon a
a,* ,
Zeev Zalevsky b, Ehud Rivlin
a
Department of Computer Science, Technion—Israel Institute of Technology, Technion City, Haifa 32000, Israel b School of Engineering, Bar-Ilan University, Israel Received 21 September 2004; received in revised form 20 February 2005 Available online 29 April 2005 Communicated by R. Davies
Abstract This paper proposes a novel technique to extract range using a phase-only filter for a laser beam. The workspace is partitioned according to M meaningful preplanned range segments, each representing a relevant range segment in the scene. The phase-only filter codes the laser beam into M different diffraction patterns, corresponding to the predetermined range of each segment. Once the scene is illuminated by the coded beam, each plane in it would irradiate in a pattern corresponding to its range from the light source. Thus, range can be extracted at acquisition time. This technique has proven to be very efficient for qualitative real-time range extraction, and is mostly appropriate to handle mobile robot applications where a scene could be partitioned into a set of meaningful ranges, such as obstacle detection and docking. The hardware consists of a laser beam, a lens, a filter, and a camera, implying a simple and cost-effective technique. 2005 Elsevier B.V. All rights reserved. Keywords: Range estimation; Laser beam coding
1. Introduction Range estimation is a basic requisite in Computer Vision, and thus, has been explored to a
*
Corresponding author. Tel.: +972 4 8266077; fax: +972 4 8293900. E-mail address:
[email protected] (D. Sazbon).
great extent. One can undoubtedly find a large quantity of range estimation techniques. These techniques vary in characteristics, such as: density, accuracy, cost, speed, size, and weight. Each technique could be suitable to a group of application, and at the same time, completely inappropriate to others. Therefore, the decision of matching the best technique usually depends on the specific requirements of the desired application. For
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.02.008
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
example: 3D modeling of an object might need both dense and accurate estimation, where cost and speed would not be critical. On the contrary, dense and accurate estimation might have less importance in collision-free path planning, where cost, speed, and mobility, would be essential. Range sensing techniques can be divided into two categories: passive and active (Jarvis, 1983, 1993). Passive sensing refers to techniques using the environmental light conditions, such that do not impose artificial energy sources. These techniques include: range from focus/defocus, range from attenuating medium, range from texture, range from stereo, and range from motion. Active sensing refers to techniques that impose structured energy sources, such as: light, ultrasound, X-ray, and microwave. These techniques include: ultrasonic range sensors, radar range sensors, laser sensors (time-of-flight), range from brightness, pattern light range sensors (triangulation), grid coding, and Moire´ fringe range contours. The technique presented here fits in the pattern light category. Pattern light is commonly used in a stereo configuration in order to facilitate the correspondence procedure, which forms the challenging part of triangulation. Usually, one camera is replaced by a device that projects pattern light (also known as structure light), while the scene is grabbed by the other camera. A very popular group of techniques are known as coded structured light. The coding is achieved either by projecting a single pattern or a set of patterns. The main idea is that the patterns are designed in such a way that each pixel is assigned with a codeword (Salvi et al., 2004). There is a direct mapping between the codeword of a specific pixel and its corresponding coordinates, so correspondence becomes trivial. Different types of patterns are used for the coding process, such as: black and white, gray scale, and RGB (Caspi et al., 1998; Horn and Kiryati, 1999; Manabe et al., 2002; Pages et al., 2003; Sato and Inokuchi, 1987; Valkenburg and McIvor, 1998). Coded structure light is considered one of the most reliable techniques for estimating range, but since usually a set of patterns is needed, it is not applicable to dynamic scenes. When using only one pattern, dynamic scenes
1773
might be allowed, but the results are usually of poor resolution. Additional techniques implementing structured light to assist the correspondence procedure include sinusoidal varying intensities, stripes of different types (e.g. colored, cut), and projected grids (Albamont and Goshtasby, 2003; Fofi et al., 2003; Furukawa and Kawasaki, 2003; Guisser et al., 2000; Je et al., 2004; Kang et al., 1995; Maruyama and Abe, 1993; Scharstein and Szeliski, 2003). These methods, although projecting only one pattern, still exploit a time consuming search procedure. Recently, efforts to estimate range using pattern light and only one image were made. In (Winkelbach and Wahl, 2002), objects were illuminated with a stripes pattern, and surface orientation was first estimated from the directions and the width of the stripes. Then shape was reconstructed from orientations. The drawback of this technique is that it works only for a single object, and the reconstruction is relative, i.e. no absolute range is known. In (Lee et al., 1999), Objects were illuminated with a sinusoidal pattern, and depth was calculated from the frequency variation. The drawback of this technique is its heavy computational time. Here, pattern light is used with only one image to directly estimate range. No correspondence (triangulation) is needed, and the setup consists only of a laser beam, a lens, a single mask, and a camera. The main concept would be to partition the workspace into a set of range segments, in a way that would be meaningful for a working mobile robot. The motivation lies in the fact that in order to perform tasks such as obstacle detection or docking, it should be sufficient that the robot would be able to distinguish between a set of predefined ranges. The idea is to code a laser beam into different patterns, where each pattern corresponds to a specific range segment. Once a scene is illuminated by the coded beam, each patch in it would irradiate with the pattern that corresponds to its range from the light source. The beam coding is merely realized by one special phase-only filter, and consequently, the technique is accurate, fast (hardware solution), cost-effective, and in addition, fits to dynamic scenes.
1774
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
2. Qualitative real-time range extraction for preplanned scene partitioning using laser beam coding The proposed technique is based on an iterative design of a phase-only filter for a laser beam. The relevant range is divided into M meaningful planes. Each plane, once illuminated by a laser beam that propagates through the phase-only filter, would irradiate in a different, predetermined, pattern. The pattern that was chosen here consists of gratings in M different angles (slits), as depicted in Fig. 1. Each range would be assigned with slits having a unique angle. Once a plane is illuminated, it would irradiate with the angular slits pattern that is proportional to its range. The iterative procedure is based on the Gerchberg–Saxton (GS) algorithm (Gerchberg and Saxton, 1972; Zalevsky et al., 1996), as schematically illustrated in Fig. 2. What follows is a description of the general concept of the algorithm. Assume we have a function denoted by f(x, y), then, f(x, y) could be represented as: f ðx; yÞ ¼ jf ðx; yÞj expfi /ðx; yÞg
ð2:1Þ
where, jf(x, y)j is the amplitude of f(x, y), and /(x, y) is the phase of f(x, y). We would denote the Fourier transform of f(x, y) by F(u, v), thus: F ðu; vÞ ¼ jF ðu; vÞj expfi Uðu; vÞg
ð2:2Þ
Fig. 1. A pattern of four gratings in different angles, each implying a different range.
Fig. 2. A schematic description of the GS algorithm to obtain phase information.
where, jF(u, v)j is the amplitude of F(u, v), and U(u, v) is the phase of F(u, v). Assume jf(x, y)j and jF(u, v)j are determined in advance and are denoted by a(x, y) and A(u, v), accordingly. In order to retrieve the phase, /(x, y), from f(x, y), we start with a random estimation of /(x, y), denoted by /1(x, y). Thus, f(x, y) is estimated by: a(x, y) Æ exp{i Æ /1(x, y)}. The following iterative procedure is performed next, until a satisfactory retrieval is achieved: 1. Fourier transform a(x, y) Æ exp{i Æ /k(x, y)} (the current estimation of f(x, y)), resulting in a function denoted by: jFk(u, v)j Æ exp{i Æ Uk(u, v)}. 2. Replace the magnitude, jFk(u, v)j, of the resulted Fourier transform with A(u, v), resulting in a function denoted by: A(u, v) Æ exp{i Æ Uk(u, v)}. 3. Inverse Fourier transform A(u, v) Æ exp{i Æ Uk(u, v)}, resulting in a function denoted by: ak(x, y) Æ exp{i Æ /k+1(x, y)}. Note, the phase component is the estimation for the next iteration. 4. Replace the magnitude, ak(x, y), of the resulted Inverse Fourier transform with a(x, y), resulting in a new estimation of f(x, y), denoted by: a(x, y) Æ exp{i Æ /k+1(x, y)}. Although not proven mathematically, the algorithm is known to give excellent practical results. If we would like to use the GS concept in order to design a phase-only filter, such that, using a laser beam would result in a predefined pattern,
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
we would use a(x, y) = 1, and A(u, v) will be a function that depicts the desired pattern in which the beam, while propagating in free space, would illuminate. Here, we would like to use that concept, but to create a phase-only filter that would illuminate in a pattern, that slightly changes, as a function of range. Thus, the GS algorithm should be modified to comply with these changes (Levy et al., 1999). The modified procedure is as follows: let a(x, y) = 1, let Zj(u, v) be the pattern assigned to the j-th plane (j = 1, 2, . . ., M), and start with /1(x, y), a random estimation of /(x, y). Proceed with the following iterative procedure: 1. Fourier transform a(x, y) Æ exp{i Æ /k(x, y)}, resulting in a function denoted by: A(u, v) Æ exp{i Æ Uk(u, v)}. 2. Set: Fe ¼ 0. 3. Iterate M times: 3.1. Free space propagate A(u, v) Æ exp{i Æ Uk(u, v)} to the range that is assigned to the pattern denoted by Zj(u, v), and replace the resulted magnitude with Zj(u, v), resulting in a function denoted by Z j ðu; vÞ expfi UFSP k ðu; vÞg. 3.2. Free space propagate Z j ðu; vÞ expfi UFSP k ðu; vÞg back to the origin, and then, inverse Fourier transform it, resulting in a function denoted by: zj ðx; yÞ expfi /jk ðx; yÞg. 3.3. Add zj ðx; yÞ expfi /jk ðx; yÞg to Fe . ¼ eF ¼ zðx; yÞ / ðx; yÞ. 4. Set: F estimated
M
1775
(in meters) of the detected plane. Note also, that the planes parameters (i.e. the number of planes, the size of a plane, the location of a plane, the distances between planes—that could vary, and the patterns to be displayed on the planes) can be defined to meet specific requirements of the phaseonly filter. The expected behavior of the laser beam once illuminated and according to the physical characteristics of it would be as follows. The beam would be homogeneous until it propagates and encounters the first predefined plane, then it would exhibit the first designed slit pattern. It would keep the same pattern while propagating along the first segment until encountering the second predefined plane, then it would exhibit the second designed slit pattern. It would keep the same pattern while propagating along the second segment and so on. When it would meet the last predefined plane, it would keep propagating indefinitely with its corresponding slit pattern. Note that the range segments can differ in length and the partitioning should not necessarily be uniform. For example, a docking mobile robot would like to decelerate first at 30 m from the target, then at 5 m, and again at 1 m. The resultant phase-only filter would consist of 3 slits patterns, corresponding to the range segments of 1, 5, and 30 m. Thus, each filter should be designed with range segments that meet the needs of the relevant task, the specific working robot, and the particular workspace.
kþ1
5. Replace the magnitude, z(x, y), of Festimated, with a(x, y), resulting in a new estimation of f(x, y), denoted by: a(x, y) Æ exp{i Æ /k+1(x, y)}. This modified procedure is depicted in Fig. 3. Note the term Free Space Propagation used in step 3 of the procedure. The laser beam is propagated to the position of the plane Zj(u, v) by multiplying its spatial spectrum using the term: " # rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x 2 y 2ffi 2pi d j FSðx; y; d j Þ ¼ exp k 1 k D D k ð2:3Þ where, dj is the range of the plane from the origin, k is the wave length, and D · D are the dimensions
3. Results The proposed technique was tested with a phase-only filter designed to exhibit the patterns depicted by Fig. 4, on six equally spaced planes positioned between 0.5 and 1 meters from the light source. The range between two consecutive planes equals to 0.1 meters. A laser beam having a wave length of 0.5 · 106 m (green light) was used. The physical size of the filter is 4 · 4 mm, and the beam was scattered in order to cover the whole filter. By using the technique described in Section 2, the resulted filter has values in the range [p, p]. In order to save in production costs, the
1776
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
Fig. 3. A schematic description of the modifications to the GS algorithm presented here.
filter was quantized into two levels: p and 0. As can be seen throughout the experiments, the results are satisfying, while the production is extremely cost effective. Fig. 5 shows images depicting the patterns irradiated by the phase-only filter on planes positioned at ranges 0.5, 0.6, 0.7, 0.8, 0.9, and 1 m from the light source. Fig. 6 shows enlargements of the
same images around the areas where the irradiation patterns are captured. The directions of the slits are clearly visible to the human eye. In order to automate the application and deduce the range from the assignment of a specific direction to a particular image a simple procedure can be invoked. Fig. 7 demonstrate the simple steps that compose this procedure on Fig. 6b, that was
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
1777
Fig. 4. The patterns that were chosen to irradiate on planes at range: (a) 0.5, (b) 0.6, (c) 0.7, (d) 0.8, (e) 0.9, and (f) 1 m from the light source.
Fig. 5. Images capturing the different patterns irradiating on planes (white paper) at range: (a) 0.5, (b) 0.6, (c) 0.7, (d) 0.8, (e) 0.9, and (f) 1 m from the light source. Note that the plane in image (b) seems to be darker due to the shadow falling from the computer screen placed on the left.
1778
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
Fig. 6. Enlargements of images (a)–(f) from Fig. 5 around the areas consisting of the irradiation patterns depicting the directions of the slits.
chosen for demonstration. Since these images were taken from a non-calibrated camera they need simple preprocessing. The first step is depicted in Fig. 7a, and consists of rotating the images, so their patterns would be aligned horizontally, and normalizing their color, so they would be comparable with the correlation patterns. The correlation patterns are merely images of the six possible slits. Then in the next step, taking into considerations the fact that the laser is of bright green color that pops up in the images, a simple threshold is applied, as depicted in Fig. 7b. Note that only the relevant information is now available. The third step is to correlate the image with the six possible patterns (i.e. slits in different directions) to get maximum response on the most compatible one, as depicted in Fig. 7c. Table 1 summarizes the correlation values between the aligned, normalized, and thresholded images depicted in Fig. 6 and the pattern of the slits that corresponds to the images in Fig. 4. As is clearly confirmed, the maximum correlation values correspond to the expected patterns. The numbers in the table are scaled by a constant value.
The accuracy of the specific phase-only filter was also investigated. Recall that the filter was designed to be used in the range between 0.5 and 1 m. As was expected and according to the programming of the filter, at distances shorter than 0.5 m no slits pattern were detected. The first pattern appeared at a distance of 0.5 m and kept appearing through the gap between the first and the second predetermined planes. Then, at a distance of 0.6 m, the second pattern appeared and lasted until the following pattern came out, and so on. The last pattern came into sight at a distance of 1 m and stayed on. Following this information, if a match to a certain slit pattern is established, it could be deduced that the irradiating plane is at distance between its corresponding range and lasting through the gap. For example, if the third pattern irradiates, it can be deduced that a plane is present at distance between 0.7 and 0.8 m. Note that the length of the gap can be controlled and determined in advance, so it will meet the needs of the system, at the filter design step. In addition, we have measured the accuracy of the technique at the border planes where the
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
1779
Fig. 7. The steps taken in order to automatically classify the pattern direction demonstrated on image Fig. 6b. (a) The image is aligned horizontally and its color is normalized. (b) A threshold is applied, and since the laser is of bright green color that pops up in its neighborhood, only relevant information remains. (c) The slit pattern that gives maximum correlation response.
Table 1 Correlation values between the aligned, normalized, and thresholded images depicted in Fig. 6 and the pattern of the slits that corresponds to the images in Fig. 4
Image Image Image Image Image Image
(a) (b) (c) (d) (e) (f)
Pattern (a)
Pattern (b)
Pattern (c)
Pattern (d)
Pattern (e)
Pattern (f)
Maximum
Should be
2.8815 1.0965 1.326 1.4535 1.377 1.632
2.805 1.4535 1.887 1.5555 1.275 1.275
1.989 1.0455 2.2695 1.6575 1.326 0.9435
2.2185 0.918 1.7595 2.0655 2.1675 1.377
2.2695 0.8925 1.3005 1.8105 2.3205 1.8105
2.04 0.714 0.918 1.3515 1.5045 1.836
Pattern Pattern Pattern Pattern Pattern Pattern
Pattern Pattern Pattern Pattern Pattern Pattern
(a) (b) (c) (d) (e) (f)
(a) (b) (c) (d) (e) (f)
As is clearly confirmed, the maximum correlation values correspond to the expected patterns.
patterns were supposed to change, and found out that all the ranges were accurate up to 1–3 mm independently from the range itself. This implies that this specific filter has a reliability of 97% in finding the border ranges. Also, if a robot would be interested in finding out when it is exactly on a border plane, it only needs to find a point of change between two successive patterns. Considering the directions of the two patterns, the range is
directly deduced. Note that, in general, if a different filter would be designed its accuracy and reliability should be measured individually. Fig. 8 depicts a semi-realistic scene where a mobile robot (the car) illuminates the scene using the proposed phase-only filter in order to detect obstacles. Two boxes, acting as obstacles, are positioned in front of it, and it can be clearly seen that the pattern of the filter is split between both of them,
1780
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
The results clearly demonstrate that using the proposed technique, range is determined immediately, in real time. In addition to its accuracy, simplicity, and speed, the technique is extremely cost effective, it comprises only of a laser beam, a lens, a filter, and a common camera.
4. Discussion
Fig. 8. A semi-realistic scene where a mobile robot (a car) illuminates the scene using the proposed phase-only filter in order to detect obstacles. Two boxes are positioned in front of it, and it can be clearly seen that the pattern of the filter is split between both of them, where one half irradiate in a specific pattern and the other half irradiates in a different pattern.
where one half irradiates in a specific pattern and the other half irradiates in a different pattern. For a better observation, Fig. 9a consists of a closer image of the obstacles, while Fig. 9b and c consist of even closer images of each of the obstacles. By analyzing the patterns it can be deduced that the first obstacle is located at distance between 0.7 and 0.8 m and the second at distance between 0.8 and 0.9 m (in respect to the light source).
A technique to qualitative real-time range estimation for preplanned scene partitioning is presented here. The setup consists of a laser beam, a lens, a single phase-only filter, and a camera. The phase-only filter is designed in such a way, that a scene patch illuminated by it, would irradiate in a unique pattern proportional to its range from the light source. The phase-only filter can be designed to meet the specific parameters of its working environment. Relevant parameter include: the location (range) of the first range segment, the number of range segments, the length of each segment (e.g. shorter for nearby environment), the uniformity of the gaps (e.g. equal, changing), the dimensions of the projected pattern (e.g. 10 cm, 1/2 m), and the density of the slits forming the pattern. If the environmental conditions require a stronger contrast, a stronger laser source can be used. Note, since the physics of propagating light should be taken into considerations, the dimensions of the projected pattern are getting bigger as the range increase. The specific scanner implemented here and described in the results section, is in fact, a very sim-
Fig. 9. A closer observation of the scene demonstrated in Fig. 8: (a) depicts both obstacles, while (b) and (c) depict even closer each obstacle.
D. Sazbon et al. / Pattern Recognition Letters 26 (2005) 1772–1781
ple one. It could be assembled using available laboratory components. Thus, its main role in proving the correctness of the technique, and as such it was designed having a relatively short total range (0.5–1 m) with relatively long range segments (0.1 m), best suitable for the task of obstacle detection or docking. The environmental factors that might affect the accuracy or the reliability of this scanner are light conditioning or green obstacles. If the light is too strong, the green slits can be hardly seen. Also, if the scene would consist of green obstacles, it might be difficult to separate the slits from the background. This problem, when appropriate, can be resolved by using a laser beam of red light. In general, the technique would mostly fit in a context of a mobile robot that would be interested in a rough estimation of a scene structure. This would enable it to identify guidelines in predetermined ranges and consequently, plan its path. The workspace can be partitioned in advance into a set of relevant ranges composed of near, intermediate, and far at the same time, with variable length of segments. Near ranges would naturally be densely segmented, whereas far ranges would be segmented in sparse manner. The robot would have its range partitioned into an appropriate and meaningful warning zones, so when a match is achieved, a corresponding action could be invoked. The technique extremely fits such scenarios by providing both qualitative and reliable results.
References Albamont, J., Goshtasby, A., 2003. A range scanner with a virtual laser. Image Vision Comput. 21, 271–284. Caspi, D., Kiryati, N., Shamir, J., 1998. Range imaging with adaptive color structured light. IEEE Trans. PAMI 20 (5), 470–480. Fofi, D., Salvi, J., Mouaddib, E.M., 2003. Uncalibrated reconstruction: An adaptation to structured light vision. Pattern Recogn. 36, 1631–1644.
1781
Furukawa, R., Kawasaki, H., 2003. Interactive shape acquisition using marker attached laser projector, 3DIM03, 491–498. Gerchberg, R.W., Saxton, W.O., 1972. A practical algorithm for the determination of phase from image and diffraction plane pictures. Optik 35, 237–246. Guisser, L., Payrissat, R., Castan, S., 2000. PGSD: An accurate 3D vision system using a projected grid for surface descriptions. Image Vision Comput. 18, 463–491. Horn, E., Kiryati, N., 1999. Toward optimal structured light patterns. Image Vision Comput. 17 (2), 87–97. Jarvis, R.A., 1983. A perspective on range finding techniques for computer vision. IEEE Trans. PAMI 5 (2), 123–139. Jarvis, R.A., 1993. Range sensing for computer vision. In: Jain, A.K., Flynn, P.J. (Eds.), Three-dimensional Object Recognition Systems. Elsevier Science Publishers B.V. Je, C., Lee, S.W., Park, R.-H., 2004. High-contrast color-stripe pattern for rapid structured-light range imaging, ECCV041, 95-107. Kang, S.B., Webb, J.A., Zitnick, C., Kanade, T., 1995. A multibaseline stereo system with active illumination and real-time image acquisition, ICCV95, 88–93. Lee, S.-K., Lee, S.-H., Choi, J.-S., 1999. Depth measurement using frequency analysis with an active projection, ICIP993, 906–909. Levy, U., Shabtay, G., Mendlovic, D., Zalevsky, Z., Marom, E., 1999. Iterative algorithm for determining optimal beam profiles in a 3-D space. Appl. Opt. 38, 6732–6736. Manabe, Y., Parkkinen, J., Jaaskelainen, T., Chihara, K., 2002. Three dimensional measurement using color structured patterns and imaging spectrograph, ICPR02-3, 649–652. Maruyama, M., Abe, S., 1993. Range sensing by projecting multiple slits with random cuts. IEEE Trans. PAMI 15 (6), 647–651. Pages, J., Salvi, J., Matabosch, C., 2003. Implementation of a robust coded structured light technique for dynamic 3D measurements, ICIP03-3, 1073–1076. Salvi, J., Pages, J., Batlle, J., 2004. Pattern codification strategies in structured light systems. Pattern Recog. 37, 827–849. Sato, K., Inokuchi, S., 1987. range-imaging system utilizing nematic liquid crystal mask, ICCV87, 657–661. Scharstein, D., Szeliski, R., 2003. high-accuracy stereo depth maps using structured light, CVPR03-1, 195–202. Valkenburg, R.J., McIvor, A.M., 1998. Accurate 3D measurement using a structured light system. Image Vision Comput. 16, 99–110. Winkelbach, S., Wahl, F.M., 2002. Shape from single stripe pattern illumination, DAGM02, 240–247. Zalevsky, Z., Mendlovic, D., Dorsch, R.G., 1996. The Gerchberg–Saxton algorithm applied in the fractional Fourier or the Fresnel domains. Opt. Lett. 21, 842–844.
Pattern Recognition Letters 26 (2005) 1782–1791 www.elsevier.com/locate/patrec
A new coarse-to-fine rectification algorithm for airborne push-broom hyperspectral images Hongya Tuo *, Yuncai Liu Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 1954 Hua Shan Road, Shanghai 200030, PR China Received 8 July 2004; received in revised form 21 January 2005 Available online 14 April 2005 Communicated by R. Davies
Abstract Rectification for airborne push-broom hyperspectral images is an indispensable and important task in registration procedure. In this paper, a coarse-to-fine rectification algorithm is proposed. In the coarse rectification process, the data provided by the Positioning and Orientation System (POS) is used in a model of direct georeference position to adjust the displacement of each line of images. In the fine rectification step, polynomial model is applied to realize that the geometric relationships of all the pixels in raw image space are same to those in the reference image. Experiments show that the fine rectified images are satisfying, and our rectification algorithm is very effective. 2005 Elsevier B.V. All rights reserved. Keywords: Coarse-to-fine rectification; Airborne push-broom hyperspectral image; Direct georeference position; Polynomial model
1. Introduction Image rectification is an indispensable and important task in lots of fields, such as remote sense, computer vision and pattern recognition, especially remote sensing image registration and fusion. Many research studies have been carried out on this topic during the last several decades.
*
Corresponding author. Fax: +1 86 21 62932035. E-mail address:
[email protected] (H. Tuo).
In this paper, we are interested in the rectification of the airborne push-broom hyperspectral images. These kinds of data contain much information with over 100 spectral bands at each pixel location which are very useful for image fusion and classification. Unfortunately, due to the ununiformity of the velocity, the gradient, and the altitude of the airplane during the flight, there are severe distortions in acquired raw images. These images can only be useful if they are rectified. So rectification is a key step for post-process, for example registration and fusion procedures.
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.02.005
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
The rectification process requires the knowledge of the camera interior and exterior orientation parameters (three coordinates of the perspective center, and three rotation angles known as roll, pitch, and yaw angle). In an image of frame photography, where all pixels are exposed simultaneously, each point has the same orientation parameters. The interior orientation parameters are provided by the camera calibration. The exterior orientation parameters are determined by using some mathematical models between object and image spaces (Ebner et al., 1992; Kratky, 1989). Rectification for this kind of images is achieved indirectly by adjusting a number of well-defined ground control points and their corresponding image coordinates. Unlike frame photography, each line of airborne push-broom images is collected at a different time. Perspective geometry varies with each pushbroom line. Therefore, each line has different six exterior orientation parameters, leading to the displacement of each line different and thus the rectification process much more difficult. Now, the positioning and orientation system (POS) integrated with global positioning systems (GPS) and inertial measure unit (IMU) is carried by an airplane to provide precise position and attitude parameters (Haala et al., 1998; Cramer et al.,
1783
2000; Mostafa and Schwarz, 2000; Skaloud and Schwarz, 1998; Skaloud, 1999). Many researches have discussed sensor modeling to estimate precise external orientation parameters for each line from POS data (Daniela, 2002; Chen, 2001; Gruen and Zhang, 2002; Lee et al., 2000; Hinsken et al., 2002). The aerial triangulation method is the traditional rectification method. Its aim is to determine the parameters of exterior orientation of a block of images and the object coordinates of the ground points. But it needs well-distributed and a sufficient number of ground control points which is a time-consuming procedure and is usually applied for full frame imagery. The source images which need to be rectified are from airborne push-broom imager, developed by Shanghai institute of technical physics, Chinese Academy of Sciences. The airborne images contain 652-column and have 124 different spectral bands at each pixel location. At constant time intervals associated with each line, 652 elements with 124 spectral bands are acquired (see Fig. 1(a), Lee et al., 2000). The across-track IFOV (Instantaneous Field of View) is 0.6 mrad, and the alongtrack IFOV is 1.2 mrad. And at the instant time, the sensor is positioned along its flight trajectory at the instantaneous perspective center (see Fig.
Fig. 1. (a) Airborne push-broom hyperspectral image. (b) Instantaneous perspective center.
1784
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
Table 1 The accuracy of POS/AV 510-DG Position (unit: m) Roll angle (unit: deg) Pitch angle (unit: deg) Yaw angle (unit: deg)
0.05–0.3 0.005 0.005 0.008
establish direct georeference position model. In the fine rectification step, we apply the polynomial model to realize that the geometric relationships of all the pixels in image space correspond to those in the reference image. 2.1. Coarse rectification
1(b)). The POS data are obtained by the GPS/ IMU equipment POS/AV 510-DG which accuracy is listed in Table 1. From Table 1, we can see that the accuracy of POS data is relatively high and can be directly used to rectify raw images. So, a coarse-to-fine rectification algorithm using POS data is proposed. A reference image is required. In the coarse rectification process, the POS data is used in a model of direct georeference position to adjust the displacement of each line quickly. In the fine rectification step, fewer ground control points are selected, and polynomial model is applied to realize that the geometric relationships of all the pixels in raw image space are same to those in the reference image. Since the geometric distortions existing in the 124 bands are negligible, rectification is performed on just one of the bands. Then the same transformation is applied to the other bands. This paper is organized as follows. Section 2 gives the coarse-to-fine rectification algorithm in details, in which the model of direct georeference position is established, the resampling and interpolation techniques are discussed, the fine rectification method using the polynomial model is presented, and finally the algorithm is listed in detail. The experimental examples are given in Section 3. A check for the accuracy of rectification is presented in Section 4. Conclusions are appeared in Section 5.
The coarse rectification step includes several key problems: introduction of Coordinate Systems, establishment of direct georeference position model for a push-broom line, and resampling method. 2.1.1. Coordinate systems In order to obtain the mathematic relationships between the image points and ground reference points, Coordinate Systems must be established. As shown in Fig. 2, the major coordinate systems are: (1) sensor coordinate systems (S-UVW), with perspective center S, U-axis parallel to flight trajectory, V-axis tangent to the flight trajectory, and W-axis parallel to the optical axis and pointing upwards, completing a right-hand coordinate systems, (2) ground coordinate systems (O–XYZ), (3) image coordinate systems (o–xyf), in which x and y are the coordinates of image plane, f is the focal length of the sensor. The image coordinate system and the sensor coordinate system are parallel.
2. Coarse-to-fine rectification algorithm In this section, we propose the coarse-to-fine rectification algorithm for airborne push-broom hyperspectral images. This kind of images is obtained by one line after another line. Each line has different exterior orientation parameters which make the rectification process much more difficult. In coarse rectification step, we use the POS data to
Fig. 2. The coordinate systems.
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
2.1.2. Georeferencing for a push-broom line According to the theory of photography, it is known that each line of airborne data has a perspective center and its own exterior orientation elements. At a certain interval, six exterior parameters for a line are recorded by the POS on board of the aircraft. The center position (latitude, longitude and height) is measured by GPS in ground coordinates system, with WGS84 as reference ellipsoid. The aircraft orientation values (roll, pitch, and yaw data) are supplied by INS. Suppose that (XG, YG, ZG) are the coordinates of the perspective center G obtained by GPS, u is the pitch angle of the aircraft, x is the roll angle and j is the yaw angle measured by INS. Plane P is the image space and the ground space is O– XYZ. The geometry among the perspective center G, projective points in image space, and corresponding points in the ground space for one line is shown in Fig. 3. Let line AB is one push-broom line in image plane, and C is the image projective center on line AB with the coordinates (xC, yC, f) in image coordinate system. Q is the ground corresponding point of C. Assume point D is selected from line AB, and its coordinates are (xC, y, f) in the image space. Suppose the corresponding point in the ground space of D is P. We will deduce the mathematic relations between the projective points in image space, and corresponding points in the ground according to Fig. 3. Firstly, we get the corresponding pointÕs coordinates of the image projective center on one pushbroom line in the ground space.
1785
In Fig. 3, line QR is perpendicular to Y-axis; point R is the intersection. The coordinates (XQ, YQ) of point Q can be obtained by X Q ¼ X G þ jQRj ¼ X G þ Z G tg x= cos u
ð1Þ
Y Q ¼ Y G þ jORj ¼ Y G þ Z G tg u
ð2Þ
The next aim is to get the ground coordinates (XP, YP) of P. Line PN is perpendicular to Y-axis, and point N is the intersection. Line QT is perpendicular to line PN, and point T is the intersection. Let h be the angle between lines GQ and GP proceeding from the same point G. Assume that the across-track IFOV (Instantaneous Field of View) is q mrad, then h ¼ ðy y C Þ q ð180=ð1000 pÞÞ ðunit is degreeÞ
ð3Þ
Let jGQj = l, jGPj = s and jQPj = d, we have, X P ¼ X Q þ d sin j
ð4Þ
Y P ¼ Y Q þ d cos j
ð5Þ
From above, we obtain, jGQj ¼ l ¼ Z G =ðcos u cos xÞ
ð6Þ
According to Fig. 3, there are jPN j ¼ l sin x þ d sin j
ð7Þ
jON j ¼ l cos x sin u þ d cos j
ð8Þ
In RTnGNP, we have GN 2 ¼ GP 2 PN 2
ð9Þ
and in RTnGON, GN 2 ¼ OG2 þ ON 2
ð10Þ
Substitute (7) and (8) into (9) and (10), we obtain the following equation: s2 ðl sin x þ d sin jÞ
2
2
¼ ðl cos x cos uÞ þ ðl cos x sin u þ d cos jÞ
2
ð11Þ Simplify (11), we obtain Fig. 3. Geometry among the perspective center G, projective points in image space, and corresponding points in the ground.
l2 þ 2ld ðsin x sin j þ cos x sin u cos jÞ þ d 2 s2 ¼ 0
ð12Þ
1786
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
In nGQP, we have l2 þ s2 2ls cos h d 2 ¼ 0
ð13Þ
and the average height of the aircraft during the flight be H , and then we have: Dx ¼ intðH q1 =100 þ 0:5Þ=10
ð22Þ
Dy ¼ intðH q2 =100 þ 0:5Þ=10
ð23Þ
Let b ¼ sin x sin j þ cos x sin u cos j
ð14Þ
Combine (12) and (13), we get l2 þ b ld ls cos h ¼ 0
ð15Þ
min Y ¼ minðfY : ðX ; Y Þ 2 F gÞ
From (15), we obtain l þ bd s¼ cos h
ð16Þ
tg h d ¼ l qffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1 b tg h b ZG tg h qffiffiffiffiffiffiffiffiffiffiffiffiffi cos u cos x 1 b2 tg h b
ð24Þ
max Y ¼ maxðfY : ðX ; Y Þ 2 F gÞ
Substitute (16) into (12) to solve for d, it yields
¼
Let min X ¼ minðfX : ðX ; Y Þ 2 F gÞ max X ¼ maxðfX : ðX ; Y Þ 2 F gÞ
ð17Þ
Then, the coordinates (XP, YP) of point P can be gotten by X P ¼ X G þ Z G tg x= cos u þ d sin j
ð18Þ
Y P ¼ Y G þ Z G tg u þ d cos j
ð19Þ
In the same way, for point P at the left of Q, the coordinates (XP, YP) of P are X P ¼ X G þ Z G tg x= cos u d sin j
ð20Þ
Y P ¼ Y G þ Z G tg u d cos j
ð21Þ
2.1.3. Resampling and interpolation Once all the pixels (i, j) in the image space are transformed to the ground coordinates (X, Y), a new image F is obtained. Obviously the points (X, Y) in F are not on a regular grid yet. Resampling is used to rectify the transformed image to a regular grid. Suppose the resampling image as Fb . Firstly, it is necessary to get the size of Fb . Assume that the size of Fb is M · N, the across-track IFOV (Instantaneous Field of View) is q1 mrad, and the alongtrack IFOV is q2 mrad. Let the actual width of pixel in ground space be Dx, the length be Dy,
We can get that M ¼ intððmax X min X Þ=DxÞ þ 1
ð25Þ
N ¼ intððmax Y min Y Þ=DyÞ þ 1
ð26Þ
iX ¼ intððX min X Þ=DxÞ 0 6 iX < M
ð27Þ
jY ¼ intððY min Y Þ=DyÞ 0 6 jY < N
ð28Þ
where int is the truncation function that gets the closest integer less than or equal to its variable. Assume that the gray values at (i, j), (X, Y) and (iX, jY) are f(i, j), F(X, Y), and Fb ðiX ; jY Þ respectively, then have F ðX ; Y Þ ¼ f ði; jÞ
ð29Þ
Fb ðiX ; jY Þ ¼ F ðX ; Y Þ
ð30Þ
Using this above method, we can quickly get the transformed image Fb on a regular grid. On the other land, for the rectified grids, some points are not assigned gray values. The next step is interpolation to solve this problem. The interpolation method is introduced as follows. For point (iX, jY), iX = 1, . . . , (M 2), jY = 1, . . . , (N 2), Select its 3 · 3 neighborhood as search region / ¼ f Fb ðiX þ t; jY þ sÞ : t ¼ 1; 0; 1; s ¼ 1; 0; 1g, and define Num ¼ fthe number of Fb ¼ 6 0; Fb 2 /g Sum ¼
1 1 X X
Fb ðiX þ t; jY þ sÞ
ð31Þ ð32Þ
s¼1 t¼1
if Fb ðiX ; jY Þ ¼ 0 and Num 5 0, the gray value Fb ðiX ; jY Þ is changed by the next formula: Fb ðiX ; jY Þ ¼ Sum=Num
ð33Þ
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
1787
Using systematic search, from left to right and top to down, to interpolate the whole pixels on the rectified grid, we can get the coarse rectified image.
For any point (e, g), its corresponding one (X, Y) can be obtained according to (34) and (35). Define
2.2. Fine rectification using polynomial model
J ¼ intðY Þ
Because of the ununiformity of velocity, the gradient, and the altitude of the airplane, airborne images usually have great distortions. Though the above method can eliminate some errors caused by many factors, the precision rectification step is essential for further application such as fusion and classification. The most widely used method for rectification is the polynomial model. In this step, a reference image is required and some ground control points (GCPs) are selected from both the reference image and the coarse rectified image. Let (X, Y) be point in the above rectified image Fb , (e, g) be the correb Also let sponding point in the reference image G. b ; Yb Þ be the estimate of (X, Y) by a polynomial ðX transformation. General form of the model is expressed as follows: b ¼ X
n X ni X i¼0
Yb ¼
bi;j ei gj
U ¼X I V ¼Y J
ð36Þ
b J Þ is gotten by the follows: the gray value RðI; b J Þ ¼ Fb ðX ; Y Þ RðI; ¼ ð1 U Þð1 V Þ Fb ðI; J Þ þ ð1 U ÞV Fb ðI; J þ 1Þ þ U ð1 V Þ Fb ðI þ 1; J Þ þ UV Fb ðI þ 1; J þ 1Þ
ð37Þ
2.3. Algorithm In this section, we give the coarse-to-fine rectification algorithm for airborne push-broom hyperspectral images in details. The procedure involves several steps:
ð34Þ
j¼0
n X ni X i¼0
ai;j ei gj
I ¼ intðX Þ
ð35Þ
j¼0
where ai,j and bi,j are unknown coefficients, and n is the degree of model. In most cases, low-order polynomials are used to rectify images. For example, geometric distortions like scale, translation, rotation, and skew effects can be modeled by an affine transformation (n = 1). If the degree of polynomial model is n, there must have a set of M = (n + 1)(n + 2)/2 GCPs at least to solve (34) and (35). Suppose that {(ei, gi) : i = 1, . . . ,L} and {(Xi, Yi) : i = 1, . . . , L} b and Fb respectively. are GCPs selected from G The least square method can be used to estimate the coefficients ai,j and bi,j. Then the transformab is determined. Define the tion between Fb and G b Bilinear Interpolation fine rectified image to be R. technique is applied to get the gray values of pixels b in R.
Step 1. Input an airborne push-broom image and select one band defined as F with size of I · J. Step 2. Let M be the matrix of referencing POS data with size of 6 * J, in which the elements of jth line are the six exterior orientation parameters of jth line in F, j = 1, . . . , J. Step 3. The center of j th line in image plane is (I/ 2, j). Calculate the perspective center coordinates (XGj, YGj) in ground space by (1) and (2). Then for all point (i, j), i = 0, . . . , I 1, compute its corresponding point coordinate (XPj, YPj) in ground space according to (18) and (19) or (20) and (21). Step 4. For each line of image F, repeat step 3. Then get the coarse rectified image defined as Fb by using resampling and interpolation techniques. Step 5. Select some GCPs {(ei, gi) : i = 1, . . . , L} and {(Xi, Yi) : i = 1, . . . , L} from reference image and Fb respectively. Use polynomial
1788
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
distortion model and Bilinear Interpolab tion to obtain the fine rectified image R. 3. Experiments In our experiments, the reference image is an airborne infrared image acquired by full frame photography in 2002. It has been well-rectified and the geometric resolution is 0.41 m. The images that need to be rectified are obtained by airborne push-broom hyperspectral imager in 2003 with the size of 652 columns, over 3000 rows and 124 bands. In the process of coarse rectification, the POS data are required. Because the whole images are very long, it is better to divide them into several pieces and then to realize coarse rectification. Fig. 4(a) is a subimage with the size of 600 * 652 cut from band 20 of the raw airborne push-broom hyperspectral image. Table 2 lists the POS data of first line of Fig. 4(a), including six exterior orientation parameters (three coordinates of the perspective center, roll, pitch, and yaw angle). Fig. 4(b) shows the coarse rectified image of Fig. 4(a). From Fig. 4(b), it indicates that the effect of coarse rectification is very effective.
Table 2 The POS data of first line of Fig. 4(a) Latitude (deg) Longitude (deg) Altitude (m) Roll angle (deg) Pitch angle (deg) Yaw angle (deg)
3.1204589182E+001 1.2150235801E+002 2.1747928669E+003 1.5322707188E+000 3.8967691065E001 1.4760683542E+001
In the fine rectification process, a reference image is needed. Fig. 5 is the reference image. Fig. 6 is the same as Fig. 4(b). We select 12 set of GCPs from Figs. 5 and 6 and then label them in Figs. 5 and 6 respectively. Using two-degree polynomial model, the results of fine rectification of Fig. 6 are shown in Fig. 7. A lot of tests have been done with other subimages cut from the raw data. From the experiments, we can see that the final rectified images are satisfying. 4. Check for the accuracy of rectification We can check the accuracy of rectification from two aspects: fusion effects and the residuals of GCPsÕ geography coordinates. Since the geometric distortions existing in the 124 bands are negligible,
Fig. 4. (a) A 600 * 652 subimage cut from band 20 of the raw airborne push-broom hyperspectral image. (b) The coarse rectified image of (a).
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
1789
Fig. 7. The fine rectified image of Fig. 4(a) gotten by twodegree polynomial model.
Fig. 5. The reference image and select some GCPs.
Fig. 6. The coarse rectified image of Fig. 4(a) and select some GCPs.
we can apply the same coarse-to-fine parameters to rectify the other bands. In our experiments, three fine rectified images with the same view are extracted from band 20, 80 and 100. And a false
color image can be composed of them and shown in Fig. 8(a). Fig. 8(b) is the corresponding part cut from the reference image. Fig. 8(c) shows the fusion effect of Fig. 8(a) and (b). Through visual checking, it can be seen that the coarse-to-fine rectification result is very good. The reference image is provided with localplane geography coordinates (for a city, there is only one local-plane geography coordinate system). So the final rectified images also hold the local-plane geography coordinates, which can make a quantitative check for the accuracy of rectification. Suppose that {(ei, gi) : i = 1, . . . , L} and {(Xi, Yi) : i = 1, . . . , L} are GCPs selected from the reference image and the rectified image respectively. Assume that (Gei, Ggi) is the geography coordinates of point (ei, gi) and (Gxi, Gyi) is the geography coordinates of (Xi, Yi). Define the residuals as following: X resdðX i Þ ¼ jGxi Gei j;
i ¼ 1; . . . ; L
ð38Þ
Y resdðY i Þ ¼ jGy i Ggi j;
i ¼ 1; . . . ; L
ð39Þ
mean residual L qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1X ðX resdðX i ÞÞ2 þ ðY resdðY k ÞÞ2 ¼ L i¼1
ð40Þ
1790
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
Fig. 8. (a) The false color images composed of band 20, 80 and 100 fine-rectified images. (b) The corresponding part of (a) cut from the reference image. (c) The fusion image of (a) and (b).
Table 3 The geography coordinates and the residuals of the selected GCPs GCPs
1 2 3 4 5 6
Fig. 8(b)
Fig. 8(a)
Ge
Gg
Gx
Gy
4100.61E 4115.91E 3959.01E 3942.03E 3940.17E 4131.76E
3232.78S 3313.21S 3278.56S 3470.13S 3362.92S 3448.21S
4101.08E 4116.08E 3958.89E 3942.64E 3939.83E 4131.39E
3232.53S 3312.84S 3278.78S 3470.34S 3362.53S 3448.78S
The mean_residual
X_resd
Y_resd
0.47 0.17 0.45 0.61 0.34 0.37
0.25 0.37 0.22 0.21 0.39 0.57 0.6222
H. Tuo, Y. Liu / Pattern Recognition Letters 26 (2005) 1782–1791
The mean_residual can be used to judge the accuracy of rectification. We select and label six point pairs from Fig. 8(a) and (b). Their geography coordinates and the residuals are listed in Table 3. It is known that the geometric resolution of the reference image is 0.41 m, and the evaluating mean_residual is less than 0.82 m, it shows that the accuracy of rectification is limited to 2 pixels.
5. Conclusion A coarse-to-fine rectification algorithm for airborne push-broom hyperspectral images has been proposed in this paper. The coarse rectification method can quickly adjust the displacement of each line of an image based on the POS data. By selected fewer ground control points, polynomial model can be applied to realize the fine rectification. The experimental images are extracted from several bands of the 124-band raw images. The results show the effectiveness of our rectification algorithm. The accuracy of rectification is judged by two aspects. The fusion effect of the reference image and the false color images indicates the high accuracy of rectification through visual checking. The residuals of GCPsÕ geography coordinates give a quantitative analysis which shows that the accuracy of rectification is limited to 2 pixels.
Acknowledgement This research is supported by Shanghai Science and Technology Development Funds of China (No. 02DZ15001).
1791
References Chen, T., 2001. High precision georeference for airborne ThreeLine Scanner (TLS) imagery. In: Proc. 3rd Internat. Image Sensing Seminar on New Development in Digital Photogrammetry, Gifu, Japan, pp. 71–82. Cramer, M., Stallmann, D., Haala, N., 2000. Direct georeferencing using GPS/INS exterior orientations for photogrammetric applications. Int. Arch. Photogramm. Remote Sensing 33 (Part B3), 198–205. Daniela, P., 2002. General model for airborne and spaceborne linear array sensors. Int. Arch. Photogramm. Remote Sensing 34 (Part B1), 177–182. Ebner, H., Kornus, W., Ohlhof, T., 1992. A simulation study on point determination for the MOMS-02/D2 space project using an extended functional model. Int. Arch. Photogramm. Remote Sensing 29 (Part B4), 458–464. Gruen, A., Zhang, L., 2002. Sensor modeling for aerial mobile mapping with Three-Line-Scanner (TLS) imagery. Int. Arch. Photogramm. Remote Sensing 34 (Part 2), 139–146. Haala, N., Stallmann, D., Cramer, M., 1998. Calibration of Directly Measured Position and Attitude by Aerotriangulation of Three-line Airborne Imagery. ISPRS, Ohio, pp. 28–30. Hinsken, L., Miller, S., Tempelmann, U., Uebbing, R., Walker, S., 2002. Triangulation of LHSystemsÕADS40 imagery using ORIMA GPS/IMU. Int. Arch. Photogramm. Remote Sensing 34 (Part 3A), 156–162. Kratky, V., 1989. Rigorous photogrammetric processing of SPOT images at CCM Canada. ISPRS J. Photogramm. Remote Sensing (44), 53–71. Lee, C., Theiss, H.J., Bethel, J.S., Mikhail, E.M., 2000. Rigorous mathematical modeling of airborne pushbroom imaging systems. Photogramm. Eng. Remote Sensing 66 (4), 385–392. Mostafa, M.M.R., Schwarz, K., 2000. A multi-sensor system for airborne image capture and georeferencing. Photogramm. Eng. Remote Sensing 66 (12), 1417–1423. Skaloud, J., 1999. Optimizing georeferencing of airborne survey systems by INS/DGPS. Ph.D. Thesis, UCGE Report 20216 University of Calgary, Alberta, Canada. Skaloud, J., Schwarz, K.P., 1998. Accurate orientation for airborne mapping systems. Int. Arch. Photogramm. Remote Sensing 32 (part 2), 283–290.
Pattern Recognition Letters Editors-in-Chief T.K. Ho, Computing Sciences Research Center, Bell Laboratories, Lucent Technologies, 700 Mountain Ave., Murray Hill, NJ 07974-0636, USA,
[email protected] G. Sanniti di Baja, Istituto di Cibernetica, CNR, Via Campi Flegrei 34, 80078 Pozzuoli, Napoli, Italy,
[email protected] Editorial Office E-mail address:
[email protected] This address is for queries only; please do not submit articles directly to this address, but use the online submission service at http://authors.elsevier.com/journal/patrec
Founding Editors E. Backer E.S. Gelsema
Advisory Editors R. De Mori, University d’Avignon, France R.P.W. Duin, Delft University of Technology, Delft, The Netherlands A.K. Jain, Michigan State University, USA J. Kittler, University of Surrey, UK H. Niemann, University of Erlangen-Nu¨rnberg, Germany
Associate Editors S. Ablameyko, Belarussian Academy of Sciences, Minsk, Belarus A.M. Alimi, University of Sfax, Tunisia P. Bhattacharya, Panasonic Information & Networking, Princeton, NJ, USA G. Borgefors, Swedish University of Agricultural Sciences, Uppsala, Sweden L. Bottou, NEC Research Institute, Princeton, NJ, USA T. Breuel, DFKI and Technical University Kaiserslautern, Kaiserslautern, Germany B.B. Chaudhuri, Indian Statistical Unit, Calcutta, India E.R. Davies, University of London, Egham, Surrey, UK S. Dickinson, University of Toronto, Ontario, Canada
M.A.T. Figueiredo, Technical University of Lisbon, Lisboa, Portugal P.J. Flynn, University of Notre Dame, Notre Dame, IN, USA A. Fred, Instituto Superior Tecnico - IST, Lisbon, Portugal L. Goldfarb, University of New Brunswick, Fredericton, New Brunswick, Canada T. Hofmann, Technical University of Darmstadt, Darmstadt, Germany M. Kamel, University of Waterloo, Ont, Canada B. Kamgar-Parsi, Naval Research Laboratory, Washington, DC, USA H. Kim, INHA University, Incheon, Korea W.G. Kropatsch, Vienna University of Technology, Wien, Austria M.-J. Li, Microsoft Research Asia, Beijing, China M. Lindenbaum, Technion, Haifa, Israel R. Manmatha, University of Massachusetts, Amherst, MA, USA K.R. Namuduri, Clark Atlanta University, Atlanta, USA S.K. Pal, Indian Statistical Institute, Calcutta, India W. Pedrycz, University of Alberta, Edmonton, Alberta, Canada F. Pernus, University of Ljubiljana, Slovenia M. Pontil, University College London, UK N.S.V. Rao, Oak Ridge Laboratory, Oak Ridge, TN, USA F. Roli, University of Cagliari, Cagliari, Italy H. Sako, Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan I.K. Sethi, Oakland University, Rochester, MI, USA O. Siohan, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA G. Stamon, Universit´e de Paris-5, France ´ Quebec, Canada C.Y. Suen, Concordia University, Montreal, K. Tumer, NASA Ames Research Center, Moffett Field, CA, USA H. Wechsler, George Mason University, Fairfax, VA, USA L. Younes, The John Hopkins University, Baltimore, MD, USA Y.-J. Zhang, Tsinghua University, Beijing, People’s Republic of China
Description Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition. Examples include: • • • • • • • • • • •
statistical, structural, syntactic pattern recognition; neural networks, machine learning, data mining; discrete geometry, algebraic, graph-based techniques for pattern recognition; signal analysis, image coding and processing, shape and texture analysis; computer vision, robotics, remote sensing; document processing, text and graphics recognition, digital libraries; speech recognition, music analysis, multimedia systems; natural language analysis, information retrieval; biometrics, biomedical pattern analysis and information systems; scientific, engineering, social and economical applications of pattern recognition; special hardware architectures, software packages for pattern recognition.
We invite contributions as research reports or commentaries. Research reports should be concise summaries of methodological inventions and findings, with strong potential of wide applications. Alternatively, they can describe significant and novel applications of an established technique that are of high reference value to the same application area and other similar areas. Commentaries can be lecture notes, subject reviews, reports on a conference, or debates on critical issues that are of wide interest. To serve the interests of a diverse readership, the introduction should provide a concise summary of the background of the work in an accepted terminology in pattern recognition, state the unique contributions, and discuss broader impacts of the work outside the immediate subject area. All contributions are reviewed on the basis of scientific merits and breadth of potential interests. Audience Researchers and practitioners in Pattern Recognition, Computer Science, Electrical and Electronic Engineering, Mathematics and Statistics, and any areas of science and engineering where automatic pattern recognition is applicable.
Authors’ benefits 1. No page charge. Publication information: Pattern Recognition Letters (ISSN 0167-8655). For 2005, volume 26 is scheduled for publication. Subscription prices are available upon request from the Publisher or from the Regional Sales Office nearest you or from this journal’s website (http://www.elsevier.com/locate/patrec). Further information is available on this journal and other Elsevier products through Elsevier’s website (http://www.elsevier.com). Subscriptions are accepted on a prepaid basis only and are entered on a calendar year basis. Issues are sent by standard mail (surface within Europe, air delivery outside Europe). Priority rates are available upon request. Claims for missing issues should be made within six months of the date of dispatch. Orders, claims, and journal enquiries: please contact the Customer Service Department at the Regional Sales Office nearest you: Orlando: Elsevier, Customer Service Department, 6277 Sea Harbor Drive, Orlando, FL 32887-4800, USA; phone: (877) 8397126 or (800) 6542452 [toll free numbers for US customers]; (+1) (407) 3454020 or (+1) (407) 3454000 [customers outside US]; fax: (+1) (407) 3631354 or (+1) (407) 3639661; e-mail:
[email protected] or
[email protected] Amsterdam: Elsevier, Customer Service Department, P.O. Box 211, 1000 AE Amsterdam, The Netherlands; phone: (+31) (20) 4853757; fax: (+31) (20) 4853432; e-mail:
[email protected] Tokyo: Elsevier, Customer Service Department, 4F Higashi-Azabu, 1-Chome Bldg, 1-9-15 Higashi-Azabu, Minato-ku, Tokyo 1060044, Japan; phone: (+81) (3) 5561 5037; fax: (+81) (3) 5561 5047; e-mail:
[email protected] Singapore: Elsevier, Customer Service Department, 3 Killiney Road, #08-01 Winsland House I, Singapore 239519; phone: (+65) 6349 0222; fax: (+65) 6733 1510; e-mail:
[email protected] USA mailing notice: Pattern Recognition Letters (ISSN 0167-8655) is published monthly but semi-monthly in January, May, July and October by Elsevier B.V. (P.O. Box 211, 1000 AE Amsterdam, The Netherlands). Annual subscription price in the USA US$ 1941.00 (valid in North, Central and South America), including air speed delivery. Periodical postage rate paid at Jamaica, NY 11431. USA POSTMASTER: Send change of address to Pattern Recognition Letters, Elsevier, 6277 Sea Harbor Drive, Orlando, FL 32887-4800. AIRFREIGHT AND MAILING in the USA by Publications Expediting Inc., 200 Meacham Avenue, Elmont, NY 11003. Advertising information: Advertising orders and enquiries can be sent to: South America: Mr Tino DeCarlo, The Advertising Department, Elsevier Inc., 360 Park Avenue South, New York, NY 10010-1710, USA; phone: (+1) (212) 633 3815; fax: (+1) (212) 633 3820; e-mail:
[email protected]. Europe, USA, Canada and ROW: Miss Katrina Barton, phone: (+44) (0) 20 7611 4117; fax: (+44) (0) 20 7611 4463; e-mail:
[email protected] The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of paper). Printed in The Netherlands
2. 25 offprints per contribution free of charge. 3. 30% discount on all Elsevier books.
Guide for Authors. Please visit http://authors.elsevier.com/journal/patrec and click Guide for Authors. Author enquiries. For enquiries relating to the submission of articles (including electronic submission where available) please visit Elsevier’s Author Gateway at http://authors.elsevier.com. The Author Gateway also provides the facility to track accepted articles and set up e-mail alerts to inform you of when an article’s status has changed, as well as detailed artwork guidelines, copyright information, frequently asked questions and more. Contact details for questions arising after acceptance of an article, especially those relating to proofs, are provided after registration of an article for publication.