Pattern Recognition 33 (2000) 533
Editorial Energy minimization methods represent a fundamental methodology in computer vision and pattern recognition, with roots in such diverse disciplines as physics, psychology, and statistics. Recent manifestations of the idea include Markov random "elds, deformable models and templates, relaxation labelling, various types of neural networks, etc. These techniques are now "nding application in almost every area of computer vision from early to high-level processing. This edition of Pattern Recognition contains some of the best papers presented at the International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR'97) held at the University of Venice, Italy, from May 21 through May 23, 1997. Our primary motivation in organizing this workshop was to o!er researchers the chance to report their work in a forum that allowed for both consolidation of e!orts and intensive informal discussions. Although the subject was hitherto well represented in major international conferences in the "elds of computer vision, pattern recognition and neural networks, there had been no attempt to organize a specialized meeting on energy minimization methods. The papers appearing in this special edition fall into a number of distinct areas. There are two papers on contour detection. Zucker and Miller take a biologically plausible approach by providing a theory of line detection based on cortical cliques. Thornber and Williams, on the other hand, describe a stochastic contour completion process and provide an analysis of its characteristics. The next block of papers use Markov random "elds. Molina et al. compare stochastic and deterministic methods for blurred image restoration. Perez and Laferte provide a means of sampling graph representations of energy functions. Barker and Rayner provide an image segmentation algorithm which uses Markov Chain Monte Carlo for sampling. Turning our attention to deterministic methods, Yuille and Coughlan provide a framework for comparing
heuristic search procedures including twenty questions and the A-star algorithm. Ho!man et al. show how deterministic annealing can be used for texture segmentation. Rangarajan provides a new framework called self-annealing which uni"es some of the features of deterministic annealing and relaxation labelling. The topic of deterministic annealing is also central to the paper of Klock and Buhmann who show how it can be used for multidimensional scaling. Next there are papers on object recognition. Zhong and Jain show how localization can be e!ected in large databases using deformable models based on shape, texture and colour. Myers and Hancock provide a genetic algorithm that can be used to explore the ambiguity structure of line labelling and graph matching. Lastly, under this heading, Kittler shows some theoretical relationships between relaxation labelling and the Hough transform. The "nal pair of papers are concerned with maximum a posteriori probability estimation. Li provides a recombination strategy for population based search. Gelgon and Bouthemy develop a graph representation for motion tracking. We hope this special edition will prove useful to practitioners in the "eld. A sequel to the workshop will take place in July, 1999 and we hope a second compendium of papers will result. Edwin R. Hancock Department of Computer Science University of York York Y01 5DD, England E-mail address:
[email protected] Marcello Pelillo Universita &&Ca' Foscari'' Venezia, Italy
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 6 9 - 2
Pattern Recognition 33 (2000) 535}542
Cliques, computation, and computational tractabilityq Douglas A. Miller, Steven W. Zucker*,1 Center for Computational Vision and Control, Departments of Computer Science and Electrical Engineering, Yale University, P.O. Box 208285, New Haven, CT, USA Received 15 March 1999
Abstract We describe a class of computations that is motivated by a model of line and edge detection in primary visual cortex, although the computations here are expressed in general, abstract terms. The model consists of a collection of processing units (arti"cial neurons) that are organized into cliques of tightly inter-connected units. Our analysis is based on a dynamic analog model of computation, a model that is classically used to motivate gradient descent algorithms that seek extrema of energy functionals. We introduce a new view of these equations, however, and explicitly use discrete techniques from game theory to show that such cliques can achieve equilibrium in a computationally e$cient manner. Furthermore, we are able to prove that the equilibrium is the same as that which would have been found by a gradient descent algorithm. The result is a new class of computations that, while related to traditional gradient-following computations such as relaxation labeling and Hop"eld arti"cial neural networks, enjoys a di!erent and provably e$cient dynamics. The implications of the model extend beyond e$cient arti"cial neural networks to (i) timing considerations in iological neural networks; (ii) to building reliable networks from less-reliable elements; (iii) to building accurate representations from less-accurate components; and, most generally, to (iv) demonstrating an interplay between continuous `dynamical systemsa and discrete `pivotinga algorithms. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Relaxation labeling; Energy minimization; Linear complementarity problem; Game theory; Early vision; Polymatrix games; Complexity; Dynamical system
1. Introduction How are visual computations to be structured? The most popular approach is to de"ne an energy functional that represents key aspects of the problem structure, and to formulate solutions as minima (or maxima) of this functional. Solutions are sought by a gradient-descent procedure, iteratively formulated, and di!erences
q Portions of this material were presented at the Snowbird Workshop on Neural Computing, April, 1992, and at the Workshop on Computational Neuroscience, Marine Biological Laboratories, Woods Hole, MA, in August, 1993. * Corresponding author. E-mail address:
[email protected] (S.W. Zucker) 1 Research supported by AFOSR, NSERC, NSF, and Yale University.
between algorithms often center on the space over which the minimization takes place, as well as on the type of functional being extremized. In curve detection, for example, one can de"ne the functional in terms of curvature, and then seek `curves of least bending energya (e.g., [1]). We have developed a related } but di!erent } approach in which the functional varies in proportion to the residual between estimates of tangents and curvatures [2,3]. By beginning with those points that are wellinformed by initial (non-linear) operators [4], we have been able to "nd consistent points [5] in a small number of iterations. These computations exemplify the popular `stable statea view of neural computation [6,7], and the energyminimization view in computer vision. Its attraction is that, when suitable basins can be de"ned and an energy or potential function exists over a proper labeling space, the resultant algorithms that seek extremal points can be
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 0 - 9
536
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
formulated in gradient descent terms. The design of such networks are mainly in the speci"cation of a connection architecture and in specifying the synaptic weights or compatibilities between processing units, from which the energy form follows. When the compatibilities or synaptic connections between processing units are asymmetric, no such energy form exists, but a more general variational inequality can be de"ned to drive the evolution toward consistent points [5]. Pelillo [8] has used the Baum}Eagon inequalities to analyze the dynamics of such processes. There is another, on the surface very di!erent perspective toward such processes. Relaxation labeling can be viewed from a game-theoretic perspective: Consider nodes in the relaxation graph as players, labels associated with nodes as pure strategies for the players, and the compatibility function r (j, j@) as the `payo!a that i,j player i receives from player j when i plays pure strategy j and j plays strategy j@. The probability distribution over labels}p (j)}then de"nes the mixed strategy for each i player i. Properly completed, such structures are instances of polymatrix games, and the variational inequality de"ning consistent points is related to the Nash equilibrium for such games [9,10]. Other investigations into using game theory for computer vision problems include Duncan [11,12] Berthod [13]. The above relationship between relaxation labeling and game theory opens a curious connection between an analog system for "nding stationary points of an energy functional (or, in relaxation terms, of the average local potential) and the discrete algorithms normally employed to "nd equilibria of games. This connection between continuous dynamical systems and discrete `pivotinga algorithms is exploited below in Section 3, and provides an example of the extremely interesting area of computation over real numbers [14]. In addition to these formal connections between energy functions, variational inequalities, and game theory, such networks are exciting because of the possibility that they actually might model the behaviour of real neurons. This connection arises from a simple model of computation in neurons that is additive and is modeled by voltage/current relationships [6,15]. Neurons "re in proportion to their membrane potential, and three factors are considered in deriving it: (i) changes in currents induced by neuronal activity in pre-synaptic neurons; (ii) leakage through the neuronal membrane; and (iii) additional input or bias currents. These considerations can be modeled as a di!erential equation (see below), and it was this equation that led Hop"eld [16] to study the stable state view of neural computation. Hop"eld and Tank [6] suggest that such a view is relevant for a wide range of problems in biological perception, as well as others in robotics, engineering, and commerce. This equation also corresponds to the support computation in relaxation labeling [10],
which has also been applied to a wide range of such problems. The relationship between neural computation and the modeling of visual behaviour is exciting, but it raises a deep question. Consider, for instance, the following. Although we can readily interpret line drawings, it has been shown that these problems can be NP-hard [17]. The question is then whether it is possible that biological (or other physical) systems are solving problems that are NP-hard for Turing machines. Contrary to other trends in neuromodeling, we would like to suggest that there may be no need to assume the brain attempts to "nd heuristic solutions to NP-hard problems, but rather that it has reduced the problems it is trying to solve to a polynomial class. In a companion paper ([18]; see also [19,20]) we have described an analog network model for the reliable detection of short line contours and edges on hyperacuityscales in primary visual cortex of primates. In our theory and model this is accomplished by the rapid saturationlevel responses of highly interconnected self-excitatory groups of super"cial-layer pyramidal neurons. We call these self-excitatory groups cliques, and our theory implies that they are a fundamental unit of visual representation in the brain. In this previous work we have shown that this theory is consistent with key aspects of both cortical neurobiology and the psychophysics of hyperacuity, particularly orientation hyperacuity. In this paper we shall describe this theory from a more computational viewpoint, and in particular we shall show that the clique-based computations which we have theorized as being consistent with the observed neurobiology of the primary visual cortex in e!ect solve a class of computational problems which are solvable in polynomial time as a function of the problem size.
2. Computation by cliques: a model We shall not present the full biological motivation here, because a simple cartoon of the early primate visual system su$ces to present the intuition. To set the stage anatomically, recall that the retina projects mainly to the lateral geniculate nucleus (LGN), and the LGN projects to the visual cortex (V1). Physiologically, orientation selectivity emerges in visual cortex, but the orientation tuning is rather broad (typically 10 } 203). This presents a problem because, behaviourally, we are able to distinguish orientations to a hyperacuity level [21,22]. Somehow networks of neurons working in concert must be involved to achieve this added precision, but the question is how. One recalls Hebb's [23] classical hypothesis about cell assemblies, and more recent contributions to cell assemblies by Braitenberg [24,25] and Palm [26]. However, Hebb's hypothesis was purely intuitive, and
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
did not address concrete questions in vision. Nor did Braitenberg and Palm consider analog systems. One part of our project is to develop a view of neural computation su$ciently rich to explain the above hyperacuity performance based on low acuity measurements; another part, which we expand upon below, is to show that these computations are computationally tractable. It is this latter analysis, we believe, that is of interest to research in computer vision, because it leads to alternative methods for solving energy minimization problems as they arise in vision and pattern recognition. Several basic facts about neuroanatomy are relevant to motivating our model. First, recall that the projection from the LGN to V1 is dominated by excitatory synapses, and most intra-cortical connections are excitatory. Second, inhibition is much less speci"c, and, "nally, is additive rather than multiplicative (extensive references in support of these observations are in [18,27,28]). We take the observations about hyperacuity and cell assemblies to indicate that the substrate for neural computation might not be simple networks of neurons, but rather might involve groups of tightly interconnected neurons considered as a unit. We formally de"ne such units as cliques of neurons, where the term from graph theory is invoked to suggest that neurons in a clique are densely interconnected. The dominance of excitatory interactions over short distances further suggests that neurons within a clique could form dense excitatory feedback circuits [28], and the natural operation of these circuits is to bring all neurons to saturation response levels rapidly. Neuronal biophysics then limits the process (since regular spiking neurons cannot burst for very long.) This model of groups of neurons raising themselves to saturation level and then hitting their biophysical limit has been studied [18,19]; the result is that a short burst of about 4 spikes in about 25 ms. is achievable for each neuron, and a `computationa, we submit, is achieved when all neurons within the clique "re as a cohort at this rate within this time period. Note that such "ring rates are well beyond the average to be expected in visual cortex. The computation is initiated with a modest initial a!erent current, as would occur, e.g., when the LGN projection stimulates a subset of neurons in the clique. This mode of computation di!ers from the classical view, as discussed above, because the local circuit computations within the cortex are characterized by saturated responses, indicated by a rapid burst of spikes, rather than by following a gradient to an equilibrium. This cohort burst signals the `bindinga of those neurons into a clique, and the excited clique represents the stimulus orientation to high precision. More generally, the above computation is modeled as a two phase process. In the "rst, saturating phase, the input current drives all neurons in the clique to saturation, and in the second, inhibiting phase, the input current is removed and all neurons not enjoying positive
537
feedback decay to their base level. We believe this model is relevant to circuits other than the cartoon model of the LGN to V1 projection used as an introductory example, especially to intra- and inter-columnar circuits, and shall be pursuing them elsewhere. Such neurophysiological modeling is not necessary for the theoretical developments in this paper. A description of this computation is developed more fully in [18], but for completeness we now list several of its advantages. First, there is the question of how to obtain the precise representation underlying (orientation) hyperacuity from the coarse (orientation) tuning of individual simple cells. Our solution is to form a type of distributed code over collections of simple cells, and this collection is the `cliquea. Roughly, the idea is that di!erent cells would cover the stimulus with slight variation in position and orientation of their receptive "elds; the increased sensitivity to orientation derives from the composite covering; see Fig. 1. The organization is revealed by the barrage of action potentials from the cells comprising the clique. The conceptual appeal of the model is indicated from this example: highly accurate computations are derived from a `cliquea of coarse ones. Although there are limits to the following analogy, the situation resembles the addition of `bitsa to the accumulator in a digital computer: more bits leads to higher accuracy. The second advantage of this model is reliability (cf. [29]), and here it di!ers substantially from the architectural considerations underlying digital accumulators. Each neuron can participate in many cliques, and the system is highly redundant. It is shown in [18] that responses at the clique level remain reliable to the 90% level even when individual neurons are only reliable to the 60% level. Here redundancy improves reliability AND accuracy, which is very di!erent from typical uses of redundancy to only improve reliability; see also [30]. The "nal advantage is computational e$ciency, and proving this forms the remainder of this paper.
3. A polynomial-time algorithm for determining system response to input bias This section contains the main contribution of this paper, and it is here that the primary di!erences from standard energy minimization computations are developed. In particular, we do not compute the trajectory that our dynamical system will follow to "nd an equilibrium, but rather the equilibrium itself. Miller and Zucker's [10] paper is helpful as background reading. In analog arti"cial neuronal networks, `neuronsa are modeled as ampli"ers, and `synapsesa between `neuronsa are modeled as conductances. In symbols, let u denote the input voltage and < the output voltage to i i a `neurona i, and let < "g(u ) denote its input}output i i
538
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
If this system starts out from the state in which all ampli"er input and output voltages are zero, by the form of the piecewise-linear ampli"er functions g (u ) : i i [a , b ]PR given by i i 0, u (a ,1, i i c u #d , a )u )b , i,1 i i,1 i,1 i i,1 g (u )" F (2) i i c u #d , a )u )b , i,u(i) i i,u(i) i,u(i) i i,u(i) 1, u 'b i i,u(i) where
G
Fig. 1. Distributed representation for a thin line contour derives from a family of receptive "elds covering it. Each of these receptive "elds comes from a single cortical neuron, and a clique consists of about 33 neurons. In this example receptive "elds are represented by rectangles, and a white slit contour stimulus (heavy black outline) excites a highly interconnected clique of simple cortical (S) cells to maximal saturated feedback response by crossing, in the appropriate direction and within a narrow time interval, the edge response region of a su$ciently large proportion of the clique's cells.Three such receptive "elds, out of the approximately 33 required, are illustrated here.
relationship. While this is often taken as sigmoidal, we have argued piecewise-linear models work as well, and perhaps even o!er advantages [10]. Further, if we let C denote the input capacitance to ampli"er i, I be i i a "xed input bias for i, and if we de"ne R by the i relationship: 1/R "1/o # + D¹ D, i i ij jEi where o is the resistance across C and ¹ is the conduci i ij tance between amplier i and j, then the system's dynamics are governed by the system (e.g., [16]): du C i" + ¹ < !u /R #I , i dt ij j i i i jEi < "g(u ). i i
(1)
a (a (b "a (2(a (b (b , i i,1 i,1 i,2 i,u(i) i,u(i) i c "[g (b )!g (a )]/[b !a ], i,k i i,k i i,k i,k i,k d "g (a )!c a i,k i i,k i,k i,k for all integers k, 1)k)u(i), this is an asymptotically stable equilibrium if the bias terms I are all zero. Howi ever if the bias terms are nonzero (as for example if they were to represent depolarizing input current originating from the LGN) then the system will evolve monotonically to a new equilibrium state in which some ampli"er outputs may be nonzero. If we then remove the bias terms, the system output variables will then monotonically decrease to a "nal equilibrium state, which we may view as the "nal output or computation of the system. It is our purpose here to show that we can determine this "nal state, whatever it might be, in a number of computational steps which is polynomial in the number of bits needed to specify the system, thus showing the problem is in class P [31], as opposed, for example, to NP-hard problems such as the traveling salesman, which very likely are not. Note that we are not computing the trajectory which the system (1) may follow to get to an equilibrium, but only the equilibrium state itself. We shall do this by in e!ect computing a parametrized continuum of equilibria, as in the parametric simplex method ([32]). These equilibria will correspond, "rst, to slowly increasing the upper bounds on all variables in the presence of bias (Phase I), followed by slowly removing the bias (Phase II). We shall "rst show that this procedure is computable in polynomial time, and then show that the solutions obtained are in fact those which would have resulted from a time evolution of Eq. (1). We stress that we are especially interested in (nonempty) sets of ampli"ers S which achieve an asymptotically stable unbiased equilibrium in which the ampli"ers S have output 1, and all other ampli"ers have output 0. Such sets of ampli"ers are called self excitatory sets; conditions for their existence and certain of their properties are described in [18}20]. Loosely, we shall require the conductances and biases to be non-negative, and the resting state to be stable [33].
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
To begin, note that we have previously shown [10] that for piecewise-linear ampli"ers of the form (2) we may represent the Eq. (1) as a constrained linear dynamical system
putational steps, and hence this will also be true of the entire procedure. To describe this pivoting procedure, a more useful version of Eq. (4) will be
p@"Rp#c#dc8 , (3)
where R is an n]n connectivity matrix, c is a vector of bias terms c not including I as a factor, c8 is a vector of i i bias terms c8 which do include I as a factor, d3[0, 1] is i i a scalar, and e is a vector of 1's. We can thus let d"0 bias and d"1 correspond to a bias I . i It can be shown [9] as a variant of the well known Kuhn}Tucker theorem [34] that p is an equilibrium for Eq. (3) if and only if there also exist vectors y, u, v such that p, y, u, v satisfy the system
C
I n
0
DC D
n
I 0 n 0 I n
CD y
C D C D !c
"
u
#d
e
!c8 0
C
R
!I
I n
DC D
n
0
I 0 n 0 I n
y u
p?u#y?v"0.
0
0 #d 1 e
v
#d 2
C D !c8 0
,
p, y, u, v*0 p?u#y?v"0.
(5)
,
v
p, y, u, v*0
C D CD !c
"
The procedure will have two phases. In phase I we shall assume d "1, and d will increase from 0 to 1. In phase 2 1 II, d "1, and d will decrease from 1 to 0. 1 2 To describe phase I, it will be convenient to rewrite Eq. (5) in yet another form
p
!I
CD p
0)p)e,
R
539
(4)
C DC 0
!e
Here I is the n]n identity matrix. n The above system of equations is an example of a linear complementarity problem, which in general is NP-complete, but is polynomial in several important special cases, including linear and convex quadratic programming [35,36]. An important technique for solving these and other special cases of linear complementarity problems is called Lemke's algorithm [37], and we show [10] that it may be used to "nd an equilibrium for any system of Eq. (1) with piecewise-linear ampli"ers. We shall use a variation of Lemke's algorithm here as well, although one which is di!erent from that which we have described previously. As opposed to the version of Lemke's algorithm which we have described previously [9,10], where it is assumed that the practical behavior is polynomial, based on previous experience with it and related algorithms such as the simplex method for linear programming, in this case we can actually show that this version of Lemke's algorithm must terminate in a number of steps which is linear in the number of model neurons. These steps are called pivots, and each pivot amounts to solving a 2n]2n subsystem of the 2n]4n system of linear equations in the "rst line of Eq. (4) (cf. any text on linear programming, e.g. [32] for a detailed description of pivoting).Therefore, if we assume the coe$cients other than d of Eq. (4) to be integers (or equivalently rationals with a single common denominator) each pivot can be shown to require at most a polynomial number of com-
R
I
n
!I 0
n
DC D I 0 n 0 I n
CD d 1 p
C D C D
y " u
!c 0
#d 2
!c8 0
,
v
d , p, y, u, v*0 1 p?u#y?v"0.
(6)
For d "0 we can trivially "nd a solution for the other 1 variables of Eq. (6) by letting p, v"0, and, for i" 1,2, n,
G G
!c !c8 , if!c !c8 '0, i i u" i 0, else, c #c8 , if c #c8 '0 i y" i i 0 else. Observe that by multiplying by !1 the rows of the "rst line of Eq. (6) which correspond to nonzero values of y we obtain a basic feasible tableau, i.e. there is a subset i of n columns which is a permuted identity matrix, and that the trivial solution to these equations whose nonzero elements correspond to these n columns are also a solution to the second line (6), i.e. are nonnegative. The identity matrix or basis of this tableau consists of those columns corresponding to the nonzero elements of y and u, with the remainder of the identity columns taken from those corresponding to v. Note however that a nondegenerate solution corresponding to this basis (i.e. a solution with no basic
540
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
variables equal to zero) would not satisfy the third line of Eq. (6), i.e. would not be complementary. In order to obtain a basic feasible complementary tableau for Eq. (6) we can proceed by pivoting from each of the rightmost n columns which violates complementarity into the corresponding one of the leftmost n columns. We remark at this point that even though p is constrained to be zero, the basic feasible complementary solution which we have constructed does correspond to a nondegenerate solution for an in"nitesimal relaxation of the constraints. That is, we can, keeping the same basis, add to the right-hand-side of the "rst line of 6 a 2nvector (e, e2,2, e2n)?, where e is treated as arbitrarily small but positive. This in"nitesimal relaxation to produce nondegeneracy is in fact the standard lexicographic pivoting method [32]. We can now begin the complementary pivoting procedure which characterizes Lemke's algorithm by pivoting into the leftmost column of the linear equations corresponding to d , thus alowing d to become positive. 1 1 This causes a column i to leave the basis, thus creating a complementary pair i, iI outside the basis, where either iI "i#n or iI "i!n. Our next choice for a pivot column, in order to maintain complementarity, is therefore iI . We can continue this procedure until d "1 or we reach 1 a column where there is no positive element to pivot on (geometrically an in"nite ray), which represents a basis for which d may be arbitrarily large. Note that this is 1 actually Lemke's algorithm in reverse pivoting sequence, since usually we start on an in"nite ray. However in all other respects it is the same as the algorithm described by Lemke. What does each pivot represent? From our construction of Eq. (3), the larger we make d , the larger the 1 possible voltage outputs that each ampli"er may have. Since all connections are nonnegative, by increasing d we can either increase an ampli"er output through 1 biasing, or by outputs from other ampli"ers. Thus each pivot either represents an ampli"er output going from 0 to positive, or from positive to 1, its upper boundary. Altogether there can be at most two pivots for each ampli"er. Therefore the result of Phase I can be computed in a number of steps which is polynomial in the number of bits needed to specify (1). In particular if (as is the natural assumption for modeling the brain) the maximum speci"cation size of individual components (resistors, capacitors, ampli"ers) given in Eq. (3) is bounded and not a function of the number of components, this implies that Phase I can be computed in polynomial time in the number of model neurons. To begin Phase II, we keep the tableau we had at the end of Phase I, but move the leftmost column of Eq. (6) back to the right-hand-side, as in Eq. (5), with d "1 1 and d "1. This leaves us with a feasible basic com2 plementary solution to Eq. (5). We can then rewrite
Eq. (5) as
CD d
C DC !c8
R
!I
0
I
0
n
DC D
n
I 0 n 0 I n
2 p
C D CD
y " u
!c
#d
0
0 , 1 e
v
d , p, y, u, v*0, 2 p?u#y?v"0,
(7)
and pivot into the leftmost column of the "rst line, which we know from our termination of Phase I must be a basic feasible complementary solution with one missing pair. There will either be two possible pivot column choices, one increasing d and the other decreasing it, or else there 2 will be one nonpositive column (an in"nite ray, for which solutions may be constructed for arbitrarily large d ), and 2 one pivoting column which will decrease d. In either case we can pivot into a unique column which will decrease d , this initiating a unique complementarity pivoting 2 procedure which will reduce the bias parameter d from 2 1 to 0, at which point there is no complementary missing pair and Phase II terminates.The argument that this phase is also polynomial is the same as that for Phase I, except that now each pivot monotonically decreases the voltage outputs. It remains to show that the solutions obtained for Phase I and Phase II of the above procedure correspond to those which would be obtained from a time evolution of the system (3), "rst starting from the zero state in the presence of bias, and then removing the bias. With regard to Phase I, let x be the time evolution equilibrium, and x6 correspond to the above parametric solution for d "1. If for some i, x 'x6 , then the only way this can 1 i i happen is if there exists a jOi such that x 'x6 . But now j i let us watch the time evolution of x from zero and compare it to x6 , and suppose that i is the "rst index in time such that x 'x6 . Then this is an obvious contradici i tion, since another such index j must already have existed (excluding degenerate solutions). Therefore x)x6 . But exactly the same argument can be used to show x6 )x, using the evolution of x6 with respect to an increase in d instead of the evolution of x with respect to time. 1 Therefore x"x6 . A similar argument applies to the end result of Phase II.
4. Conclusions The "eld of neural computation is dominated by a stable-attractor viewpoint in which energy forms are minimized. This viewpoint is attractive because of the gradient-descent interpretation of computations and the relevance for modeling perceptual and other complex
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
phenomena. However, serious questions about computational complexity arise for these models, such as how biological systems can actually compute such trajectories. An alternative view of stable-attractors and energy minimization is obtained by interpreting the relevant structures into game theory. This sets up a duality between continuous dynamical systems and discrete pivoting algorithms for "nding equilibria. We exploit this duality, and the biological metaphor, to motivate an alternative interpretation of what a `neural energy minimizing computationa might be. Starting with the standard Hop"eld equations, we consider computations that are organized into excitatory cliques of neurons. The main result in this paper was to show how e$ciently these neurons can bring each other to saturation response levels, and how these responses agree with the end result of gradient-descent computations. The result suggests that arti"cial neural networks can be designed for e$cient and reliable computation using these techniques, and perhaps that biological neural networks have discovered a reliable and e$cient approach to "nding equilibria that di!ers substantially from common practice in computer vision and pattern recognition.
References [1] S. Ullman, High Level Vision, MIT Press, Cambridge, MA, 1996. [2] P. Parent, S.W. Zucker, Trace inference, curvature consistency, and curve detection, IEEE Trans. Pattern Anal. Machine Intell. 11 (1989) 823}839. [3] S.W. Zucker, A. Dobbins, L. Iverson, Two stages of curve detection suggest two styles of visual computation, Neural Comput. 1 (1989) 68}81. [4] L. Iverson, S.W. Zucker, Logical/linear operators for image curves, IEEE Trans. Pattern Anal. Machine Intell. 17 (10) (1995) 982}996. [5] R.A. Hummel, S.W. Zucker, On the foundations of relaxation labeling processes, IEEE Trans. Pattern Anal. Machine Intell. PAMI-5 (1983) 267}287. [6] J.J. Hop"eld, D.W. Tank, Neural compuation of decisions in optimatization problems, Biol. Cybernet. 52 (1985) 141}152. [7] D.J. Amit, Modeling Brain Function: the World of Attractor Neural Networks, Cambridge University Press, Cambridge, 1989. [8] M. Pelillo, The dynamics of nonlinear relaxation labeling processes, J. Math. Imag. Vision 7 (1997) 309}323. [9] D.A. Miller, S.W. Zucker, Copositive-plus Lemke algorithm solves polymatrix games, Oper. Res. Lett. 10 (1991) 285}290. [10] D.A. Miller, S.W. Zucker, E$cient simplex-like methods for equilibria of nonsymmetric analog networks, Neural Comput. 4 (1992) 167}190.
541
[11] H.I. Bozma, J.S. Duncan, A game-theoretic approach to integration of modules, IEEE Trans. Pattern Anal. Machine Intell. 16 (1994) 1074}1086. [12] A. Chakraborty, J.S. Duncan, Game Theoretic Integration for Image Segmentation, IEEE Trans. Pattern Anal. Machine Intell. 21 (1999) 12}30. [13] S. Yu, M. Berthod, A game strategy approach for image labeling, Comput Vision Image Understanding 61 (1995) 32}37. [14] L. Blum, F. Cucker, M. Shub, S. Smale, Complexity and Real Computation, Springer, New York, 1998. [15] T.J. Sejnowski, Skeleton "lters in the brain, in: G.E. Hinton, J.A. Anderson (Eds.), Parallel Models of Associative Memory, Lawrence Erlbaum, Hillsdale, NJ, 1981. [16] J.J. Hop"eld, Neurons with graded response have collective computational properties like those of twostate neurons, Proc. Natl. Acad. Sci. USA 81 (1984) 3088}3092. [17] L.M. Kirousis, C.H. Papadimitriou, The complexity of recognizing polyhedral scenes, J. Comput. System. Sci. 37 (1988) 14}38. [18] D.A. Miller, S.W. Zucker, Computing with self-excitatory cliques: a model and an application to hyperacuity-scale computation in visual cortex, Neural Comput. 11 (1) (1999) 21}66. [19] D.A. Miller, S.W. Zucker, A model of hyperacuity-scale computation in visual cortex by self-excitatory cliques of pyramidal cells, Technical Report TR-CIM-93-13, Center for Intelligent Machines, McGill University, Montreal, August, 1994. [20] D. Miller, S.W. Zucker, Reliable computation and related games, in: M. Pelillo, E. Hancock (Eds.), Energy Minimization Methods in Computer Vision and Pattern Recognition, Lecture Notes in Computer Science, vol. 1223, Springer, Berlin, 1997, pp. 3}18. [21] G. Westheimer, The spatial grain of the perifoveal visual "eld, Vision Res. 22 (1982) 157}162. [22] G. Westheimer, S.P. McKee, Spatial con"gurations for visual hyperacuity, Vision Res. 17 (1977) 941}947. [23] D.O. Hebb, The Organization of Behaviour, Wiley, New York, 1949. [24] V. Braitenberg, Cell assemblies in the cerebral cortex, in: R. Heim, G. Palm (Eds.), Theoretical Approaches to Complex Systems, Lecture Notes in Biomathematics, vol. 21, Springer, New York, 1978, pp. 171}188. [25] V. Braitenberg, A. Schuez, Anatomy of the Cortex. Statistics and Geometry, Springer, Berlin, 1991. [26] G. Palm, Neural Assemblies: An Alternative Approach to Arti"cial Intelligence, Springer, Berlin, 1982. [27] R.J. Douglas, K.A.C. Martin, Neocortex, in: G.M. Shepherd (Ed.), The Synaptic Organization of the Brain, 3rd ed., Oxford University Press, New York, 1990, pp. 389}438. [28] R.J. Douglas, C. Koch, K.A.C. Martin, H. Suarez, Recurrent excitation in neocortical circuits, Science 269 (1995) 981}985. [29] E.F. Moore, C.E. Shannon, Reliable circuits using less reliable relays, J. Franklin Inst. 262 (1956) 191}208, 281}297. [30] S. Winograd, J.D. Cowan, Reliable Computation in the Presence of Noise, MIT Press, Cambridge, MA, 1963.
542
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
[31] M.R. Garey, D.S. Johnson, Computers and Intractability, Freeman, San Francisco, 1979. [32] G.B. Dantzig, Linear Programming and Extensions, Princeton University Press, Princeton, NJ, 1963. [33] M.W. Hirsch, S. Smale, Di!erential Equations, Dynamical Systems, and Linear Algebra, Academic Press, New York, 1974. [34] H.W. Kuhn, A.W. Tucker, Nonlinear programming, in: J. Neyman (Ed.), 2nd Berkeley Symposium on Mathematical
Statistics and Probability, University of California Press, Berkeley, CA, 1951, pp. 481}492. [35] K. Murty, Linear Complementarity, Linear and Nonlinear Programming, Heldermann, Berlin, 1988. [36] R.W. Cottle, J.-S. Pang, R. Stone, The Linear Complementarity Problem, Academic Press, New York, 1992. [37] C.E. Lemke, Bimatrix equilibrium points and mathematical programming, Management Sci. 11 (1965) 681}689.
About the Author*STEVEN W. ZUCKER is the David and Lucile Packard Professor of Computer Science and Electrical Engineering at Yale University. Before moving to Yale in 1996, he was Professor of Electrical Engineering at McGill University and Director of the Program in Arti"cial Intelligence and Robotics of the Canadian Institute for Advanced Research. He was elected a Fellow of the Canadian Institute for Advanced Research (1983), a Fellow of the IEEE (1988), and (by)Fellow of Churchill College, Cambridge (1993). Dr. Zucker obtained his education at Carnegie-Mellon University in Pittsburgh and at Drexel University in Philadelphia, and was a post-doctoral Research Fellow in Computer Science at the University of Maryland, College Park. He was Professor Invite'eH at Institute National de Recherche en Informatique et en Automatique, Sophia-Antipolis, France, in 1989, a Visiting Professor of Computer Science at Tel Aviv University in January, 1993, and an SERC Fellow of the Isaac Newton Institute for Mathematical Sciences, University of Cambridge. Prof. Zucker has authored or co-authored more than 130 papers on computational vision, biological perception, arti"cial intelligence, and robotics, and serves on the editorial boards of 8 journals. About the Author*DOUGLAS MILLER obtained his Ph.D. at the University of California, Berkeley, in Operations Research. Following a brief period in industry with Paci"c Gas and Electric, in California, he became a Post-Doctoral Fellow at the Center for Intelligent Machines, McGill University, in 1990. Douglas A. Miller passed away in 1994.
Pattern Recognition 33 (2000) 543}553
Characterizing the distribution of completion shapes with corners using a mixture of random processes Karvel K. Thornber!,*, Lance R. Williams" !NEC Research Institute, 4 Independence Way, Princeton, NJ 08540, USA "Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA Received 15 March 1999
Abstract We derive an analytic expression for the distribution of contours x(t) generated by #uctuations in x5 (t)"Lx(t)/Lt due to random impulses of two limiting types. The "rst type are frequent but weak while the second are infrequent but strong. The result has applications in computational theories of "gural completion and illusory contours because it can be used to model the prior probability distribution of short, smooth completion shapes punctuated by occasional discontinuities in orientation (i.e., corners). This work extends our previous work on characterizing the distribution of completion shapes which dealt only with the case of frequently acting weak impulses. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
1. Introduction In a previous paper [1] we derived an analytic expression characterizing a distribution of short, smooth contours. This result has applications in ongoing work on "gural completion [2] and perceptual saliency [3]. The idea that the prior probability distribution of boundary completion shapes can be characterized by a directional random walk is "rst described by Mumford [4]. A similar idea is implicit in Cox et al.'s use of the Kalman "lter in their work on grouping of contour fragments [5]. More recently, Williams and Jacobs [6] introduced a representation they called a stochastic completion xeld } the probability that a particle undergoing a directional random walk will pass through any given position and orientation in the image plane on a path bridging a pair of boundary fragments. They argued that the mode, magnitude and variance of the stochastic completion "eld are related to the perceived shape, salience and sharpness of illusory contours.
* Corresponding author. Fax: 00609-951-2482
Both Mumford [4] and Williams and Jacobs [6] show that the maximum likelihood path followed by a particle undergoing a directional random walk between two positions and directions is a curve of least energy (see [7]). This is the curve that is commonly assumed to model the shape of illusory contours, and is widely used for semiautomatic region segmentation in many computer vision applications (see [8]). The distribution of shapes considered by [1,4}6] basically consists of smooth, short contours. Yet there are many examples in human vision where completion shapes perceived by humans contain discontinuities in orientation (i.e., corners). Fig. 1 shows a display by Kanizsa [9]. This display illustrates the completion of a circle and square under a square occluder. The completion of the square is signi"cant because it includes a discontinuity in orientation. Fig. 2 shows a pair of `Ko!ka Crossesa. When the width of the arms of the Ko!ka Cross is increased, observers generally report that the percept changes from an illusory circle to an illusory square [10]. Although the distribution of completion shapes with corners has not previously been characterized analytically, the idea of including corners in completion shapes
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 1 - 0
544
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
Fig. 1. Amodal completion of a partially occluded circle and square (redrawn from [9]). In both cases, completion is accomplished in a manner which preserves tangent and curvature continuity at the ends of the occluded boundaries.
impulses drawn from a mixture of two limiting distributions. The "rst distribution consists of weak but frequently acting impulses (we call this the Gaussian-limit). The distribution of these weak random impulses has zero mean and variance equal to p2. The weak impulses act at g Poisson times with rate R . The second consists of strong g but infrequently acting impulses (we call this the Poisson-limit). The distribution of these strong random impulses has zero mean and variance equal to p2 (where p p2DX&N(jX#(1!j)m, (1!j2)p2),
f "uf #(1!u)[Af #const], t t~1 t~1 with u"(1!c)p2w/(p2n #p2w). We then have for this new iterative procedure
where 0(j(1, then
f "AI f #(1!u)const, t t~1 where AI "[I!o(I!/N)!(1!o)DTD],
>&N(m, p2). We then have for this iterative method that the transition probabilities are
C
1 n k ( f k D f k~1, l k, g)Jexp ! [ f !Mltkf k~1!Qltkg]t T(t ) t t t t 2¹(t ) tk k
with o"p2n /(p2n #p2w), is now a contraction mapping.
D
][Qltk]~1[ f k!Mltkf k~1!Qltkg] , 1 t t (20)
5. The modi5ed simulated annealing algorithm where Let us now examine how to obtain a contraction for our iterative procedure. Let us rewrite Eq. (12) as an iterative procedure and add (a(1!nnl*i+(i))#b(1!c)) f (i) to each side of the equation, we have (a#b) f (i)"(a(1!nnl*i+(i))#b(1!c)) f (i) t t~1 # a/ + f ( j)(1!l([i, j])) t~1 j /)"3 i #b((DTg)(i)!(DTDf ) (i)#cf (i)), (16) t~1 t~1
Mltk")ltk#(I!)ltk)(Cltk!(DTD)ltk), H Qltk"(I!)ltk)Bltk,
(21) (22)
where Cltk*i+*f k(i)"/jltk*i+(i) + t j /)"3 i and
(1!l([i, j])) f ( j) nnltk*i+(i) tk
A
(DTD)ltk*i+ * f k(i)"(1!jltk*i+(i)) H t
B
(DTDf )(i) !f (i) , c
560
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
)ltk is a diagonal matrix with entries ultk*i+(i) and Qltk is 1 a diagonal matrix with entries p2l*i+(i). m In the next section we apply the modi"ed SA and ICM algorithms, whose convergence is established in the appendix to restore astronomical images. The algorithms are the following:
The modi"ed ICM procedure is obtained by selecting in steps 2 and 4 of Algorithm 2 the mode of the corresponding transition probabilities.
Algorithm 2 (MSA procedure). Let i , t"1, 2,2, be the t sequence in which the sites are visited for updating.
Let us "rst examine how the modi"ed ICM algorithm works on a synthetic star image, blurred with an atmospherical point spread function (PSF), D, given by
1. Set t"0 and assign an initial conxguration denoted as f , l and initial temperature ¹(0)"1. ~1 ~1 2. The evolution l Pl of the line process can be obt~1 t tained by sampling the next point of the line process from the raster-scanning scheme based on the conditional probability mass function dexned in Eqs. (9) and (10) and keeping the rest of l unchanged. t~1 3. Set t"t#1. Go back to step 2 until a complete sweep of the xeld l is xnished. 4. The evolution f Pf of the image system can be obt~1 t tained by sampling the next value of the whole image based on the conditional probability mass function given in Eq. (17) 5. Go to step 2 until t't , where t is a specixed integer. f f The following theorem guarantees that the MSA algorithm converges to a local MAP estimate, even in the presence of blurring. Theorem 2. If the following conditions are satisxed: 1. D/D(0.25 2. ¹(t)P0 as tPR, such that 3. ¹(t)*C /log(1#k(t)), T then for any starting conxguration f , l , we have ~1 ~1 p( f , l D f , l , g)Pp ( f, l) as tPR, t t ~1 ~1 0 where p (. , .) is a probability distribution over local MAP 0 solutions, C is a constant and k(t) is the sweep iteration T number at time t. We notice that if the method converges to a con"guration ( f, l), then fM "arg max p( f DlM , g) f Furthermore, lM "arg max p(lD fM , g) l We conjecture that the method we are proposing converges to an distribution over global maxima. However, the di$culty of using synchronous models prevent us from proving that result (See Ref. [22]).
6. Test examples
d(i)J(1#(u2#v2)/R2)~d,
(23)
with d"3, R"3.5, i"(u, v), and Gaussian noise with p2n "64. If we use p2w"24415, which is realistic for this image, and take into account that, for the PSF de"ned in Eq. (23), c"0.02, A de"ned in Eq. (15) is not a contraction. Figs. 2a and b depict the original and corrupted image, respectively. Restorations from the original and modi"ed ICM methods with b"2 for 2500 iterations are depicted in Figs. 2c and d, respectively. Similar results are obtained with 500 iterations. The proposed methods were also tested on real images and compared with ARTUR, the method proposed by Charbonnier et al. [19]. ARTUR minimizes energy functions of the form
G
J( f )"j2 + u[ f (i)!f (i :#1)] i
H
#+ u[ f (i)!f (i :#2)] # DDg!Df DD2, i
(24)
where j is a positive constant and u is a potential function satisfying some edge-preserving conditions. The potential functions we used in our experiments, u , u , GM HL u and u , are shown in Table 1. HS GR Charbonnier et al. [19] show that, for those u functions, it is always possible to "nd a function JH such that J( f )"inf JH( f, l), l where JH is a dual energy which is quadratic in f when l is "xed. l can be understood as a line process which, for those potential functions, takes values in the interval [0, 1]. To minimize Eq. (24), Charbonnier et al. propose the following iterative scheme: 1. 2. 3. 4. 5. 6.
n"0, f 0,0 Repeat ln`1"arg min [JH( f n, l)] l f n`1"arg min [JH( f, ln`1)] f n"n#1 Until convergence.
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
561
Fig. 2. (a) Original image, (b) observed image, (c) ICM restoration, (d) restoration with the proposed ICM method.
Table 1 Edge preserving potential functions used with ARTUR Potential u GM function
u HL
u HS
u GR
Expression t2 of u(t) 1#t2
log(1#t2) 2J1#t2!2 2 log[cosh (t)]
In our experiments the convergence criterion used in step 6 above was DD f n`1!f nDD2/DD f nDD2(10~6. The solution of step 4 was found by a Gauss}Seidel algorithm. The stopping criterion was DD f n`1,m`1!f n`1,mDD2/DD f n`1,mDD2(10~6, where m is the iteration number. We use images of Saturn which were obtained at the Cassegrain f/8 focus of the 1.52 m telescope at Calar Alto Observatory (Spain) on July 1991. Results are presented on a image taken through a narrow-band interference "lter centered at the wavelength 9500 As . The blurring function de"ned in Eq. (23) was used. The parameters d and R were estimated from the intensity pro"les of satellites of Saturn that were recorded simulta-
neously with the planet and of stars that were recorded very close in time and airmass to the planetary images. We found d&3 and R&3.4 pixels. Fig. 3 depicts the original image and the restorations after running the original ICM and our proposed ICM methods for 500 iterations and the original SA and our proposed SA methods for 5000 iterations. In all the images the improvement in spatial resolution is evident. In particular, ring light contribution has been successfully removed from equatorial regions close to the actual location of the rings and amongst the rings of Saturn, the Cassini division is enhanced in contrast, and the Encke division appears on the ansae of the rings in all deconvolved images. To examine the quality of the MAP estimate of the line process we compared it with the position of the ring and disk of Saturn, obtained from the Astronomical Almanac, corresponding to our observed image. Although all the methods detect a great part of the ring and the disk, the ICM method (Fig. 4a) shows thick lines. The SA method, on the other hand, gives us thinner lines and the details are more resolved (Fig. 4b). Obviously, there are some gaps in the line process but better results would be obtained by using eight neighbors instead of four or, in general, adding more l-terms to the energy function. Fig. 5 depicts the results after running ARTUR using potential functions u , u , u and u on the Saturn GM HL HS GR image together with the results obtained by the proposed
562
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Fig. 3. (a) Original image, (b) restoration with the original ICM method and (c) its line process, (d) restoration with the original SA method and (e) its line process, ( f ) restoration with the proposed ICM method and (g) its line process, (h) restoration with the proposed SA method and (i) its line process.
Fig. 4. Comparison between the real edges (light) and the obtained line process (dark). (a) proposed ICM method, (b) proposed SA method.
ICM method. Note that line processes obtained by the potential functions used with ARTUR are presented in inverse gray levels. The results suggest that u and GM
u capture better the lines of the image than u and HL HS u . Lines captured by all these functions are thicker GR than that obtained by the proposed ICM method; notice that the line process produced by these potential functions is continuous on the interval [0, 1]. Furthermore, u also captures some low-intensity lines, due to the GM noise, that creates some artifacts on the restoration, specially on the Saturn rings, see Fig. 5b. Finally, the potential functions used with ARTUR have captured the totality of the planet contour although the line process intensity on the contour is quite low. The methods were also tested on images of Jupiter which were also obtained at the Cassegrain f/8 focus of the 1.52 m telescope at Calar Alto Observatory (Spain) on August 1992. The blurring function was the same as in the previous experiment. Fig. 6 depicts the original image and the restorations after running the original ICM and our proposed ICM method for 500 iterations and our proposed SA method for 5000 iterations. In all the images the improvement in spatial resolution is evident. Features like the Equatorial plumes and great red spot are
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
563
Fig. 5. (a) Restoration with the proposed ICM method ( f ) and (k) its corresponding horizontal and vertical line process. (b)}(e) show the restoration when ARTUR is run for the potentials u , u , u and u , respectively. Their corresponding horizontal line processed GM HL HS GR are shown in (g)}( j) and their vertical processed are shown in (l)}(o).
very well detected. ARTUR was also tested on these images obtaining similar results to the obtained with Saturn. In order to obtain a numerical comparison, ARTUR and our methods were tested and compared using the
cameraman image. The image was blurred using the PSF de"ned in Eq. (23) with the parameters d"3 and R"4. Gaussian noise with variance 62.5 was added obtaining a image with SNR"20 dB. The original and observed image are shown in Fig. 7.
564
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Fig. 6. (a) Original image, (b) restoration with the original ICM method and (c) its line process, (d) restoration with the proposed ICM method and (e) its line process, ( f ) restoration with the proposed SA method and (g) its line process.
In order to compare the quality of the restorations we used the peak signal-to-noise ratio (PSNR) that, for two images f and g of size M]N, is de"ned as
C
PSNR"10 log 10
Fig. 7. (a) Original cameraman image, (b) observed image.
Figs. 8 and 9 depict the restorations after running our proposed SA method for 5000 iterations, our proposed ICM method for 500 iterations and ARTUR with di!erent potential functions.
D
M]N]2552 . DDg!f DD2
Results, shown in Table 2, are quite similar for all the methods but they suggest that better results are obtained when our proposed SA method is used. For the two best methods in terms of the PSNR, our proposed SA and ARTUR with u , we have included cross-sections of the GM original and restored images in Fig. 10. It can be observed that, although both pro"les are quite similar, our proposed SA method obtain sharper edges than the ones obtained with u . GM
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
565
Fig. 8. (a) Restoration with the proposed SA method and (b), (c) its horizontal and vertical line process, (d) restoration with the proposed ICM method and (e), ( f ) its horizontal and vertical line process, (g) restoration with ARTUR with u function and (h), (i) its horizontal GM and vertical line process.
Table 2 Comparison of the di!erent restoration methods in terms of PSNR ARTUR with
PRNR (dB)
Observed
Proposed ICM
Proposed SA
u GM
u HL
u HS
u GR
18.89
20.72
21.11
20.75
20.64
20.72
20.51
Table 3 shows the total computing time of the studied methods after running them on one processor of a Silicon Graphics Power Challenge XL. It also shows the relative execution time referred to the computing time of the ICM
method. The little di!erence between the ICM and SA methods is due to the fact that most of the time is spent in convolving images.
566
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Fig. 9. (a) Restoration with ARTUR with u function and (c), (c) its horizontal and vertical line process, (d) restoration with ARTUR HL with u function and (e), ( f ) its horizontal and vertical line process, (g) restoration with ARTUR with u function and (h), (i) its HS GR horizontal and vertical line process. Table 3 Total computing time of the methods and relative time per iteration referred to the ICM Original Method ICM Total time (s) 1149 Relative time 1.00
Proposed SA 12852 1.13
ICM 140 0.12
ARTUR with SA 2250 0.20
7. Conclusions In this paper we have presented two new methods that can be used to restore high dynamic range images in the presence of severe blurring. These methods
u GM 198 0.17
u HL 38 0.17
u HS 29 0.17
u GR 29 0.17
extend the classical ICM and SA procedures, and the convergence of the algorithms is guaranteed. The experimental results verify the derived theoretical results. Further extensions of the algorithms are under consideration.
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
567
valued with a number of elements equal to the number of pixels in the image. For simplicity, we assume ) is Rd and k is a Lebesgue measure on Rd. De"ne a Markov operator P : ¸1P¸1 as follows: n
P
P n(s )" n n
n (s Ds )n(s ) ds . (A.1) n n n~1 n~1 n~1 ) By Pm we mean the composite operation P n n`m P P P . The convergence problem we are n`m~12 n`2 n`1 dealing with is the same as the convergence of Pm as 0 mPR.
De5nition A.1. Let x be a vector with components x(i) and Q be a matrix with components q(i, j). We de"ne DDxDD 2 and DDQDD as follows: 2 1@2 DDxDD " + Dx(i)D2 , 2 i DDQxDD 2"max (o(i))1@2, DDQDD "sup 2 DDxDD 2 x i where o(i) are the eigenvalues of matrix QTQ.
A
B
De5nition A.2. A continuous nonnegative function < : )PR is a Lyapunov function if
Fig. 10. Cross section of line 153 of the original cameraman image (solid line) and reconstructed images (dotted line) with (a) proposed SA and (b) ARTUR with u . GM
Appendix. Convergence of the MSA procedure In this section we shall examine the convergence of the MSA algorithm. It is important to make clear that in this new iterative procedure we simulate f (i) using Eq. (17) and to simulate l([i, j]) we keep using Eqs. (9) and (10). We shall denote by n the corresponding transition T probabilities. That is, n k ( f kD f k~1, l k, g) is obtained from T(t ) t t t Eq. (20) and n k (l kD f k~1, l k~1) is obtained from Eqs. (9) t T(t ) t t and (10). Since updating the whole image at the same time prevents us from having a stationary distribution we will not be able to show the convergence to the global MAP estimates using the same proofs as in Geman and Geman [1] and Jeng and Woods [3]. To prove the convergence of the chain we need some lemmas and de"nitions as in Jeng [2] and Jeng and Woods [3]. We assume a measure space (), &, k) and a conditional density function n (s Ds ) which de"nes a Markov chain n n n~1 s , s , 2, s ,2. In our application, the s are vectors 1 2 n i
lim .!9 c/i it is easy to show, by successive marginalizations, that P(z ,2, z )J.!9 c/i hood of i, and: < g (z ) P(z Dzpa )"P(z Dzn8 )" c>.!9 c/i c c . i (i) i (i) k i The model then veri"es Eq. (7) with past neighborhoods n8 (i)Ln(i)Wpa(i) which are small for n(i)'s are. This way to introducing causality is at the heart of the various bidimensional causal representations [7,8,9}15,34]. As we already said, this causal probabilistic decomposition allows to recursively draw samples from P(z), starting from node 1, and all marginals can be exactly computed. However, when Eq. (9) holds for the prior model (z,x) (and therefore for the joint model z,(x, y) in case of pointwise measurements), it does not hold in general for the posterior model (z,xDy), although prior and posterior independence graphs are the same! This is particularly harmful since posterior model is at the heart of inference and sampling procedures (at least in inverse problems).
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
578
3.2. Graphical characterization Graphical considerations will allow to point out an important class of interaction models for which the same conclusion (i.e., causality relative to some in-past parts of the original non-causal neighborhoods) systematically holds, whatever the actual factors are. To make notations general and simpler to handle in the following, we now note g (z , zn Wpa ) even though i i (i) (i) some of the components of zn Wpa might be absent from (i) (i) the arguments of the function (i.e., if 6 c!MiN c>.!9 c/i dn(i)Wpa(i)). Return to successive marginalizations from z to z . As n 1 explained in Section 2, the summation of < g w.r.t. i i z makes all sites of n(n)Wpa(n)"n(n) mutually neighn bors through function G (zn )O+ ng (z , zn ). A particun (n) z n n (n) lar situation for which this structural change has no incidence is when n(n) is already a clique. As a consequence random vector zpa exhibits the (n) subgraph generated by G on pa(n) as an independence graph. Its joint distribution is proportional to i6 / # log(p i), (15) x where x is a tree labeling with x 3M1, 2, MN, b is a positi ive parameter, and M(k , p2)NM are the mean and varik k k/1 ances of the M classes. First experiments were carried out on 256]256 synthetic images involving "ve classes with known means and variances (Fig. 6). The variances were set to a higher level in the second image (standard deviations range from
15 to 40 in the "rst image, and from 15 to 70 in the second one). We compared the three noniterative inference procedures on the quadtree, and the iterative ICM algorithm running on the spatial counterpart of energy (15). The obtained classi"cations are shown in Fig. 7 while Table 2 indicates the corresponding rates of good classi"cation and CPU times in seconds. On both images, the three noniterative estimators have provided very close classi"cations which are better than those obtained by iterative estimation with the gridbased model (and noniterative estimations are less degraded that the iterative one for image d2), while taking two to three times less cpu time. Their noniterative nature results in a xxed computational complexity per site (e.g., they exhibit an O(n) complexity). We experimentally determined, using MATLAB implementations, that MAP, semi-MPM, and MPM inferences are achieved with respectively around 79, 94 and 107 #oating point operations (#ops) per site, when x 's can take two possible i
Fig. 6. 5-class synthetic data.
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
583
Fig. 7. From left to right: MAP, semi-MPM, and MPM estimates on the quadtree, ICM iterative estimate on the pixel grid, for the classi"cation (a) of synthetic image d1, (b) of synthetic image d2.
Table 2 Comparative percentages of misclassi"cation, and CPU times in seconds on synthetic images quadtree
Image d1 Image d2
2d grid
MAP
sMPM
MPM
ICM
4.79% (3.5 s) 8.01% (3.5 s)
4.73% (5.7 s) 7.96% (5.7 s)
4.73% (8.4 s) 7.97% (8.4 s)
5.30% (10.1 s) 9.65% (10.5 s)
values. With a similar implementation, standard ICM estimation on bidimensional grid [1] costs around 52 #ops/site, whereas the overall procedure is iterative with no guarantee on the required number of iterations. Among the three noniterative estimators, the MPM estimator is the more time consuming due to the larger amount of calculations required in the downward sweep. However, this extra cost (for similar estimates) might worth the pain since the obtained knowledge of posterior marginals P(x Dy) allows to assess for each site the degree i of conxdence that can be associated to the estimated value, e.g., through the entropy } + i P(x Dy) log P(x Dy) of x i i the marginal. Fig. 8 shows such `con"dence mapsa. These con"dence measures, reminiscent of error covariance matrices of Gaussian models on trees [24], can be useful for a better appreciation and use of obtained estimates. Visually, the classi"cations provided by the three noniterative estimators exhibit a `blockya aspect, reminding the underlying prior quadtree structure. The amount of such artifact depends on the relative location
of spatial patterns with respect to the block partition induced on the pixel grid by the quadtree. Also, these artifacts are more apparent in the processing of more noisy images, where the role of quadtree-based prior has to be enforced to get rid of noise. In the prospect of parameter estimation, this is not a serious problem, provided that the overall estimate is good (i.e., the percentage of misclassi"cation is low). However, if the visual rendering of the estimate is at the heart of the concerned application, a single ICM smoothing sweep su$ces to remove the `blockynessa at reasonable cost. There is an other source of concern lying in the huge number of successive summations/multiplications usually involved in functions computed through upward sweeps. If no attention is paid to that aspect, one will often end up with quantities either too small, or too large to be handled by computers. To prevent the algorithms from being trapped in these tricky situations, it might be necessary to devise a rescaling of the quantities of interest (namely F 's, F 's, or F 's). A simple way to proceed, i i i consists in normalizing these functions such that
584
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
Fig. 8. `Con"dence mapsa associated to noniterative MPM classi"cation of synthetic images d1 and d2 by the entropy of posterior marginals at leaves (the darker, the less entropy and the higher con"dence).
Fig. 9. (a) 512]512 Spot image (courtesy of Costel, University of Rennes 2 and GSTB); (b) direct MV classi"cation; (c) ICM iterative classi"cation on the pixel grid; (d) MAP noniterative classi"cation on the quadtree.
summing out xn8 yields 1. For instance (i) F "+ ih f < F /(+ n8 (i) ih f < F ). It is easy to see that i x i i j|i6 j x ,x i i j|i6 j these normalizations have no incidence whatsoever on the procedures we have described. Finally we consider the supervised classi"cation of a 512]512 Spot image (Fig. 9a) provided by the Costel laboratory (University of Rennes 2), into 8 classes with physical meanings (mainly the types of culture). Max-
imum likelihood classi"cation (often used in remote sensing applications) is poor (Fig. 9b), but provides a simple and sensible initial con"guration for the iterative grid-based classi"cation whose "nal result is obtained after 65 s (Fig. 9c). In less time, the three tree-based noniterative estimators have provided close results of good quality. See for instance in Fig. 9d the MAP classi"cation, obtained within 40 s.
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
6. Conclusion In this paper, we intended to provide a comprehensive and uni"ed picture of models on `causal graphsa. We presented in detail the manipulation of such discrete models, with emphasis on (a) the use of graph theoretic concepts as tools to devise models and get insight into algorithmic procedures; (b) the profound unity which underlies the di!erent procedures whether they compute probabilities, draw samples, or infer estimates. In particular, we presented three generic exact noniterative inference algorithms devoted to models exhibiting a triangulated independence graph. The "rst algorithm allows to compute the MAP estimate (and can be considered apart from any probabilistic framework as performing global energy minimization). The second one, whose aim is intrinsically probabilistic, allows to compute local posterior marginals which can be used to get the MPM estimate or to estimate parameters within an EM-like algorithm [20]. The third one mixes, to some extent, the characteristics of the two others. On simple quadtrees, these two-sweep procedures provide a hierarchical framework suitable for discrete image analysis problems such as detection, segmentation or classi"cation. Apart from providing a lower cost alternative to iterative inference schemes, these tree-based models are good candidates for handling multiresolution data, as advocated in [21,27].
Acknowledgements Authors gratefully acknowledge their debt to Eric Fabre for enlightening and stimulating discussions.
References [1] J. Besag, Spatial interaction and the statistical analysis of lattice systems, J. Royal Statist. Soc. B 36 (1974) 192}236. [2] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Machine Intell. 6 (6) (1984) 721}741. [3] B. ter Haar Romeny (Ed.), Geometry-driven Di!usion in Computer Vision, Kluwer Academic Publishers, Dordrecht, 1995. [4] G. Celeux, D. Chauveau, J. Diebolt, On stochastic versions of the EM algorithm, Technical Report 2514, INRIA, March 1995. [5] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Royal Statist. Soc. B 39 (1977) 1}38, with discussion. [6] B. Chalmond, An iterative Gibbsian technique for reconstruction of M-ary images, Pattern Recognition 22 (6) (1989) 747}761.
585
[7] K. Abend, T.J. Harley, L.N. Kanal, Classi"cation of binary random patterns, IEEE Trans. Inform. Theory 11 (1965) 538}544. [8] H. Derin, P.A. Kelly, Discrete-index Markov-type random processes, Proc. IEEE 77 (10) (1989) 1485}1509. [9] J. Goutsias, Mutually compatible Gibbs random "elds, IEEE Trans. Inform. Theory 35 (6) (1989) 1233}1249. [10] J. Goutsias, Unilateral approximation of Gibbs random "eld images, Graph. Mod. Image Proc. 53 (1991) 240}257. [11] A. Habibi, Two-dimensional Bayesian estimate of images, Proc. IEEE 60 (1972) 878}883. [12] J. Moura, N. Balram, Recursive structure of noncausal Gauss}Markov random "elds, IEEE Trans. Inform. Theory 38 (2) (1992) 335}354. [13] D. Pickard, A curious binary lattice, J. Appl. Probab. 14 (1977) 717}731. [14] D. Pickard, Unilateral Markov "elds, Adv. Appl. Probab. 12 (1980) 655}671. [15] J. Woods, C. Radewan, Kalman "ltering in two dimensions, IEEE Trans. Inform. Theory 23 (1977) 473}481. [16] G.D. Forney, The Viterbi algorithm, Proc. IEEE 61 (3) (1973) 268}278. [17] L. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains, IEEE Ann. Math. Statist. 41 (1970) 164}171. [18] L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77 (2) (1989) 257}285. [19] C. Bouman, M. Shapiro, A multiscale image model for Bayesian image segmentation, IEEE Trans. Image Process. 3 (2) (1994) 162}177. [20] J.-M. LaferteH , F. Heitz, P. PeH rez, A multiresolution EM algorithm for unsupervised image classi"cation using a quadtree model, in: Proc. Int. Conf. on Pattern Recognition, Vienna, Austria, August 1996. [21] J.-M. LaferteH , F. Heitz, P. PeH rez, E. Fabre, Hierarchical statistical models for the fusion of multiresolution image data, in: Proc. Int. Conf. Computer Vision, Cambridge, June 1995. [22] K. Chou, A. Willsky, A. Benveniste, Multiscale recursive estimation, data fusion, and regularization, IEEE Trans. Autom. Control 39 (3) (1994) 464}477. [23] K. Chou, A. Willsky, R. Nikoukhah, Multiscale systems, Kalman "lters, and Riccati equations, IEEE Trans. Autom. Control 39 (3) (1994) 479}491. [24] M. Luettgen, W. Karl, A. Willsky, E$cient multiscale regularization with applications to the computation of optical #ow, IEEE Trans. Image Process. 3 (1) (1994) 41}64. [25] M. Luettgen, A. Willsky, Likelihood calculation for a class of multiscale stochastic models, with application to texture discrimination, IEEE Trans. Image Processing 4 (2) (1995) 194}207. [26] P. Fieguth, Application of multiscale estimation to large scale multidimensional imaging and remote sensing problems, Ph.D. thesis, MIT Dept. of EECS, June 1995. [27] M. Daniel, A.S. Willsky, A multiresolution methodology for signal-level fusion and data assimilation with applications to remote sensing, Proc. IEEE 85 (1) (1997) 164}180.
586
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
[28] R. Kindermann, J.L. Snell, Markov Random Fields and their Applications, vol. 1, Amer. Math. Soc., Providence, RI, 1980. [29] P. PeH rez, F. Heitz, Restriction of a Markov random "eld on a graph and multiresolution statistical image modeling, IEEE Trans. Inform. Theory 42 (1) (1996) 180}190. [30] S. Lauritzen, Graphical Models, Oxford Science Publications, Oxford, 1996. [31] J. Whittaker, Graphical Models in Applied Multivariate Statistics, Wiley, Chichester, 1990. [32] J. Woods, Two dimensional discrete Markovian "elds, IEEE Trans. Inform. Theory 18 (1972) 232}240.
[33] S. Lauritzen, A. Dawid, B. Larsen, H.-G. Leimer, Independence properties of directed Markov "elds, Networks 20 (1990) 491}505. [34] P.A. Devijver, Real-time modeling of image sequences based on hidden Markov mesh random "eld models, Technical Report M-307, Philips Research Lab., June 1989. [35] J.-M. LaferteH , F. Heitz, P.PeH rez, E. Fabre, Hierarchical statistical models for the fusion of multiresolution data, in: Proc. SPIE Conf. on Neural, Morphological, and Stochastic Methods in Image and Signal Processing, San Diego, USA, July 1995.
About the Author*PATRICK PED REZ was born in 1968. He graduated from ED cole Centrale Paris, France, in 1990. He received the Ph.D. degree in Signal Processing and Telecom. from the University of Rennes, France, in 1993. He now holds a full-time research position at the Inria center in Rennes. His research interests include statistical and/or hierarchical models for large inverse problems in image analysis. About the Author*ANNABELLE CHARDIN was born in 1973. She graduated from ED cole Nationale SupeH rieure de Physique de Marseille. She is completing her Ph.D. degree in Signal Processing and Telecom. from the University of Rennes, France. About the Author*JEAN-MARC LAFERTED was born in 1968. He received the Ph.D. degree in Computer Science from the University of Rennes, France, in 1996. He now holds an assistant professor position at the computer science department of the University of Rennes.
Pattern Recognition 33 (2000) 587}602
Unsupervised image segmentation using Markov random "eld models S.A. Barker*, P.J.W. Rayner Signal Processing and Communications Group, Cambridge University Engineering Department, Cambridge CB2 1PZ, UK Received 15 March 1999
Abstract We present two unsupervised segmentation algorithms based on hierarchical Markov random "eld models for segmenting both noisy images and textured images. Each algorithm "nds the the most likely number of classes, their associated model parameters and generates a corresponding segmentation of the image into these classes. This is achieved according to the maximum a posteriori criterion. To facilitate this, an MCMC algorithm is formulated to allow the direct sampling of all the above parameters from the posterior distribution of the image. To allow the number of classes to be sampled, a reversible jump is incorporated into the Markov Chain. Experimental results are presented showing rapid convergence of the algorithm to accurate solutions. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Markov random "eld; Unsupervised segmentation; Reversible jump; Markov chain; Monte Carlo; Simulated annealing
1. Introduction The segmentation of noisy or textured images into a number of di!erent regions comprises a di$cult optimisation problem. This is compounded when the number of regions into which the image is to be segmented is also unknown. If each region within an image is described by a di!erent model distribution then the observed image may be viewed as a realisation from a map of these model distributions. This underlying map divides the image into regions which are labelled with di!erent classes. Image segmentation can therefore be treated as an incomplete data problem [1], in which the intensity data is observed, the class map is missing and the model parameters associated with each class need to be estimated. In the unsupervised case, the number of model classes is also unknown. The unsupervised segmentation problem has been approached by several authors. Of these, most propose
* Corresponding author. E-mail address:
[email protected] (S.A. Barker)
algorithms comprising of two steps [2}4]. The image is assumed to be composed of an unknown number of regions, each modelled as individual Markov random "elds. The "rst of these steps is a coarse segmentation of the image into the most &likely' number of regions. This is achieved by dividing the image into windows, calculating features or estimating model parameters, then using a measure to combine closely related windows. The resulting segmentation is then used to estimate model parameters for each of the classes, before a supervised high-resolution segmentation is carried out via some form of relaxation algorithm. A similar methodology is used in Geman et al. [5] but the measure used, the Kolmogorov}Smirnov distance, is a direct measure of similarity of the distributions of grayscale values (in the form of histograms) between adjacent windows. Windows are then combined into a single region if the distance between their distributions is relatively small. A variant on this algorithm [6] is based on the same distance function but the distribution of grayscales in each window is compared with the distribution functions of the samples comprising each class over the complete image. If the distribution of one class is
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 4 - 6
588
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
found to be close enough to that of the window then it is designated as being a member of that class. Otherwise, a new outlier class is created. When the "eld stabilises, usually after several iterations, new classes are created from the outliers if they constitute more than one percent of the image. If not, the algorithm is re-run. A split and merge algorithm is proposed by Panjwani and Healey [7]. The image is initially split into large square windows but these are then re-split to form four smaller windows if a uniformity test for each window is not met. The process ends when windows as small as 4]4 pixels are reached. The windows are then merged to form regions using a distance measure based on the pseudo-likelihood. Won and Derin [8] obtain segmentations and parameter estimates by alternately sampling the label "eld and calculating maximum pseudo likelihood estimates of the parameter values. The process is repeated over di!ering numbers of label classes and the resulting estimates are applied to a model "tting criteria to select the optimum number of classes and hence, the image segmentation. The criterion used compensates the likelihood of the optimised model with a penalty term that o!sets image size against the number of independent model parameters used. The penalty term and its associated parameter values were selected arbitrarily. This method of exhaustive search over a varying number of classes was developed further by Langan et al. [9]. Here, an EM algorithm is "rst used to estimate parameters while alternately segmenting the image. To select between the resulting optimsations of the di!ering models the function of increasing likelihood against increasing model order is "tted to a rising exponential model. The exponential model parameters are selected in a least squares sense and the optimum model order is then found at a pre-speci"ed knee point in the exponential curve. The approach to unsupervised segmentation presented here comprises a Markov chain Monte Carlo (MCMC) algorithm to sample from the posterior distribution so that simulated annealing may be used to estimate the MAP solution. This methodology is similar to that used in [10] to segment an image using a known MRF model. Here the method is extended so that the sampling scheme and hence the MAP estimate is not just over the segmentation of the image into classes but the number of classes and their respective model parameters. The algorithm di!ers from those reviewed in that no windowing is required to estimate region model parameters. The algorithm's MCMC methodology removes the necessity for an exhaustive search over a subsection of the parameter space. This ensures an improvement in e$ciency over algorithms that require separate optimisations to be carried out for each model before a model order selection is made.
The remainder of this paper is divided as follows: Section 2 speci"es the image models used throughout the paper. The posterior distributions for the noisy image and texture models are derived in Sections 2.1 and 2.2. Section 3 describes the algorithms employed to sample from these distributions. The segmentation process or allocation of class labels to pixel sites is given, as is the sampling scheme for noise and MRF model parameters from their conditional densities. The method by which reversible jumps are incorporated into Markov chains to enable sampling of the number of classes into which an image might be segmented is then described. This process is then detailed for both noisy and textured image modes. Experimental results for the resulting algorithms are presented in Section 4 and the paper is concluded in Section 5.
2. Image models Let ) denote an M]N lattice indexed by (i, j) so that )"M(i, j);1)i)M,1)j)NN. Let Y"MYs"ys; s3)N be the observed grayscale image where pixels take values from the interval (0,1]. Then let X"MXs"xs; s3)N correspond to the labels of the underlying Markov random "eld which have values taken from "" M0,1,2,k!1N. If g de"nes a neighbourhood at site s, then let the s vector of labels comprising that neighbourhood be x s. g Similarly, let o de"ne a second neighbourhood strucs ture at site s but let this be de"ned on the observed image Y, so that y s is the vector of pixel grayscale o values over that neighbourhood. Finally, let all model parameters be included in the parameter vector W. If a Gibbs distribution is used to model the likelihood of observing the image Y given the label "eld X, and to model all a priori knowledge of spacial correlations within the image and label "elds then the conditional density for an observed pixel grayscale value and class label at site s, given the two neighbourhoods is, p(> "y , X "x D y s, x s, W) s s s s o g 1 " , e~U(Ys/ys,Xs/xs@yos,xgs,(), Z(y s, x s, W) o g
(1)
where ;( ) ) is the energy function and Z( ) ) is the normalising function of the conditional distribution. If we divide the parameter vector so that, W"[/ ; Mc3"N, c], c where / corresponds to a vector of model parameters c de"ning the likelihood of observing the pixel value y , s given its neighbourhood and label and c corresponds to c a vector of hyper-parameters de"ning the prior on the label "eld X, then the conditional distribution may be
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
589
factorised so that,
2.1. The Isotropic Markov random xeld model
p(> "y , X "x D y s, x s, W) s s s s o g
When considering the complete Gibbs distribution for the entire image the partition function (or normalising function) becomes far too complex to evaluate making it unfeasible to compare the relative probabilities of two di!erent MRF realisations. An approximation to the Gibbs distribution that allows an absolute probability for an MRF realisation to be calculated is the PseudoLikelihood, introduced by Besag [11]. The Pseudo-Likelihood is simply the product of the full conditionals, each given by Eq. (2), over the complete image ).
The Isotropic MRF model is used to model an image consisting of regions of constant but di!erent grayscales, corrupted with an i.i.d. noise process. For each pixel, the likelihood of its grayscale value given its underlying label, is given by an Gaussian distribution whose parameters are dependent on the label class. Hence, the grayscale values of the pixels comprising a region labelled as a single class c may be considered a realisation of an i.i.d. Gaussian noise process whose parameter vector is given by / "[k ,p ]. c c c The Potts model chosen to model a priori knowledge of spacial correlations within the label "eld incorporates potential functions de"ned using both singular and pairwise cliques on a nearest neighbour type neighbourhood. If the hyper-parameter vector is c"[a ;Mc3"N, b] then c the approximation to the posterior density given in Eq. (4) may be written,
PL(Y"y, X"x D W)
p(X"x, W, k D Y"y)
1 e~U1(Ys/ys @ yos, Xs/xs, /xs)~U2(Xs/xs @ xgs,c). " Z(y s, / s, x s, c) o x g (2)
1 "< expM!; (> "y D y s, Xs"xs, /xs)N 1 s s o Z(/ s) x s|) ]
expM!+ ; (X "x D x s, cN s|) 2 s s g , < + expM!; (X "c D x s, cN s|) c|" 2 s g
G
(3)
where Z(/ ) is the normalising constant of the likelihood c distribution for the observed pixel value given its neighbourhood and label. By applying Bayes law, an approximation to the posterior distribution for the MRF image model may be formed from the Pseudo-Likelihood. To make this a function of the model order (or number of label classes) k, proper priors must be de"ned for all the model parameters. The distribution will then be given by, p(X"x, W, k D Y"y)JPL(Y"y, X"x D W) k~1 ]p (k)p (c) < p (/ ), r r r c c/0
H
1 1 exp ! (y !k s)2 J< s x 2p2 J2np2 s x s|) xs
(4)
where p ( ) ) are the prior distributions. It is possible to r incorporate various information criteria [8,9] into the posterior distribution by adding compensatory terms to the prior on model order k The Isotropic and Gaussian MRF models used as the basis for segmentation algorithms throughout the remainder of this paper both take Potts models as their prior distribution on the label "eld X. The di!erences between the two models occur in their modelling of the likelihood of observing pixel grayscale values given the label "eld. The principle di!erence comprises the lack of conditioning on neighbouring pixel grayscale values in the Isotropic model. The two models are described in more detail in the following two subsections.
expM!+ (a s#bs`q|os where e is a zero mean Gaussian noise process with s autocorrelation given by
G
p2, c E[e e ]" !h(q)p2, s s`q c c 0,
q"0, s#q3o , s otherwise.
(8)
It is possible, with no loss of generality to halve the number of correlation parameters by making h(q)"h(~q). c c Letting H comprise the vector of correlation parameters c for the GMRF labelled as c, then the conditional distribution for a single pixel may be written, p(y D k ,p ,H ) s c c c 1 1 " exp ! (y !k ) s c 2p2 J2np2 c c
G
A
BH
2 ! + h(q)[(y !k )#(y !k )] . (9) c s`q c s~q c q>s`q|os The interaction between regions is described by a Pott's model. As with the Isotropic MRF this models a priori knowledge of spacial correlations within the label "eld. The Pott's model chosen here is identical to that used in the Isotropic case (given in the previous section) except that there are no single clique parameters. These are omitted to reduce the complexity of the model order sampling step (see Section 3.2) and to weaken the prior on X. Hence, the hyper-parameter vector simply consists of one term, b. The posterior distribution for the complete hierarchical image model may now be approximated using the expression given in Eq. (4) and the conditional distributions given by Eq. (9). p(X"x, W, k D Y"y)J< p(y D k s, p s, H s) s x x x s|) expM!+ b consists of applying a "lter a whose response y 3M1,2,10N is large when the image a near arc a is road-like. (An arc a is considered road-like if the intensity variation of pixels along the arc is smaller than the intensity di!erences between pixels perpendicular to the arc.) The distribution of the tests M> N (regarded a as random variables) depends only on whether or not
Fig. 4. A variation of Geman and Jedynak's tree structure with a di!erent branching pattern. The prior probabilities may express a preference for certain paths, such as those which are straight.
the arc a lies on the road candidate X and the tests are assumed to be conditionally independent given X. Thus the probabilities can be speci"ed by
G
p (> ) P(> DX)" 1 a a p (> ) 0 a
if a lies on X,
(2)
otherwise.
The probability distributions p (.) and p (.) are deter1 0 mined by experiment (i.e. by running the tests on and o! the road to gain statistics). These distributions overlap, otherwise the tests would give unambiguous results (i.e. `roada or `not-roada) and the road could be found directly. The theoretical results we obtain are independent of the precise nature of the tests and indeed the algorithm can be generalized to consider a larger class of tests, but this will not be done in this paper. The true road may be determined by "nding the MAP estimate of P(XD all tests). However, there is an important practical di$culty in "nding the MAP: the number of possible candidates to search over is 3L, an enormous number, and the number of possible tests is even larger (of course, these numbers ignore the fact that some of the paths will extend outside the domain of the image and hence can be ignored. But, even so, the number of possible paths is exorbitant). To circumvent this problem, Geman and Jedynak propose the twenty questions algorithm that uses an intelligent testing rule to select the most informative test at each iteration. They introduce the concept of partial paths and show that it is only necessary to calculate the probabilities of these partial paths rather than those of all possible road hypotheses. They de"ne the set C to consist of all paths a which pass through arc a. Observe, see Fig. 8, that this condition speci"es a unique path from the root arc to a. Thus MX3C N can be thought of as the set of all possible a extensions of this partial path. Their algorithm only needs to store the probabilities of certain partial paths, z "P(X3C D test results), rather than the probabilities a a of all the 3L possible road paths. Geman and Jedynak describe rules for updating these probabilities z but, in a fact, the relevant probabilities can be calculated directly (see next section). It should be emphasized that calculating these probabilities would be signi"cantly more di$cult for general graph structures where the presence of closed loops introduces di$culties which require algorithms like dynamic programming to overcome [28]. The testing rule is the following: after having performed tests > 1 through > k, choose the next test > k`1" n n n > so as to minimize the conditional entropy H(XDb , > ) c k c given by:
G
H(xDb , > )"!+ P(> "y Db ) + P(XDb , > "y ) k c c c k k c c yc x
H
]log P(XDb , > 1"y ) , k c c
(3)
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
where b "My 1,2, y kN is the set of test results from steps k n n 1 through k (we use capitals to denote random variables and lower case to denote numbers such as the results of tests). The conditional entropy criterion causes tests to be chosen which will be expected to maximally decrease the uncertainty of the distribution P(XDb ). k`1 We also point out that their strategy for choosing tests has already been used in Bayes Nets [24]. Geman and Jedynak state that there is a relationship to Bayes Nets [17] but they do not make it explicit. This relationship can be seen from the following theorem. Theorem 1. The test which minimizes the conditional entropy is the same test that maximizes the mutual information between the test and the road conditioned on the results of the proceeding tests. More precisely, arg min H(XDb , c k > )"arg max I(> ; XDb ). c c c k Proof. This result follows directly from standard identities in information theory [18]: I(> ; XDb )"H(XDb )!H(XDb , > ) a k k k a "H(> Db )!H(> DX, b ). h (4) a k a k This maximizing mutual information approach is precisely the focus of attention strategy used in Bayes Nets [24], see Fig. 5. It has proven an e!ective strategy in medical probabilistic expert systems, for example, where it can be used to determine which diagnostic test a doctor should perform in order to gain most information about a possible disease [28]. Therefore, the twenty questions algorithm can be considered as a special case of this strategy. Focus of attention, however, is typically applied
609
to problems involving graphs with closed loops and hence it is di$cult to update probabilities after a question has been asked (a test has been performed). Moreover, on graphs it is both di$cult to evaluate the mutual information and to determine which, of many possible, tests will maximize the mutual information with the desired hypothesis state X. By contrast, Geman and Jedynak are able to specify simple rules for deciding which tests to perform. This is because: (i) their tests, Eq. (2), are simpler than those typically used in Bayes Nets and (ii) their tree structure (i.e. no closed loops) makes it easy to perform certain computations. The following theorem, which is stated and proven in their paper, simpli"es the problem of selecting which test to perform. As we will show later, this result is also important for showing the relationship of twenty questions to A#. The theorem is valid for any graph (even with closed loops) and for arbitrary prior probabilities. It relies only on the form of the tests speci"ed in Eq. (2). The key point is the assumption that roads either contain the arc which is being tested or they do not. Theorem 2. The test > which minimizes the conditional c entropy is the test which minimizes a convex function /(z ) c where /(z)"H(p )z#H(p )(1!z)!H(zp #(1!z)p ). 1 0 1 0 Proof. From the information theory identities given in Eq. (4) it follows that minimizing H(XDb , > ) with respect k c to a is equivalent to minimizing H(> DX, b )!H(> Db ). c k c k Using the facts that P(> DX, b )"P(> DX), z "P(X3C Db ), c k c c c k P(> Db )"+ P(> DX)P(XDb )"p (> )z #p (> )(1!z ), c k X c k 1 c c 0 c c where P(> DX)"p (> ) if arc c lies on X and c 1 c
Fig. 5. A Bayes Net is a directed graph with probabilities. This can be illustrated by a game show where the goal is to discover the job of a participant. In this case the jobs are `unemployeda, `Harvard professora and `Ma"a Bossa. The players are not allowed direct questions but they can ask about causal factors } e.g. `bad lucka or `ambitiona } or about symptoms } `heart attacka, `eating disordera, `big egoa. The focus of attention strategy is to ask the questions that convey the most information. Determining such questions is straightforward in principle, if the structure of the graph and all the probabilities are known, but may require exorbitant computational time if the network is large.
610
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
P(> DX)"p (> ) otherwise, we "nd that c 0 c H(> DX, b )"+ P(XDb )M!+ P(> DX) log P(> DX)N c k k c c X Yc "z H(p )#(1!z )H(p ), H(> Db ) c 1 c 0 c k "H(z p #(1!z )p ). c 1 c 0
(5)
The main result follows directly. The convexity can be veri"ed directly by showing that the second-order derivative is positive. h For the tests chosen by Geman and Jedynak it can be determined that /(z) has a unique minimum at z6 +0.51. For the game of twenty questions, where the tests give unambiguous results, it can be shown that the minimum occurs at z6 "0.5. (In this case the tests will obey p (> "y )p (> "y )"0 ∀y and this enforces that 1 c c 0 c c c H(z p #(1!z )p )"z H(p )#(1!z )H(p )!z log z c 1 c 0 c 1 c 0 c c !(1!z )log(1!z ) and so /(z)"z log z#(1!z) c c log(1!z) which is convex with minimum at z"0.5.) Thus the minimal entropy criterion says that we should test the next untested arc which minimizes /(z ). c By the nature of the tree structure and the prior there can be very few (and typically no) untested arcs with z 'z6 c and most untested arcs will satisfy z )z6 . Restricting c ourselves to this subset, we see that the convexity of /(.), see Fig. 6, means that we need only "nd an arc c for which z is as close to z6 as possible. It is straightforward to show c that most untested arcs, particularly distant descendants of the tested arcs, will have probabilities far less than z6 and so do not even need to be tested (each three way split in the tree will introduce a prior factor 1 which 3
Fig. 6. Test selection for twenty questions is determined by the /(z) function. This is convex with at minimum at z6 . Most untested arcs a will have probabilities z less than z6 and twenty a questions will prefer to explore the most probable of these paths. It is conceivable that a few untested arcs have probability greater than z6 . In this case they may or may not be tested. The exact form of the /(.) function depends on speci"c details of the problem.
multiplies the probabilities of the descendant arcs, so the probabilities of descendants will decay exponentially with the distance from a tested arc). It is therefore simple to minimize /(z ) for all arcs such that z )z6 and then we c c need simply compare this minimum to the values for the few, if any, special arcs for which z 'z6 . This, see [17], c allows one to quickly determine the best test to perform. Observe, that because the prior is uniform there may often be three or two arcs which have the same probability. To see this, consider deciding which arc to test when starting from the root node } all three arcs will be equally likely. It is not stated in [17] what their algorithm does in this case but we assume, in the event of a tie, that the algorithm picks one winner at random.
5. Twenty questions, A# and AH In this section we de"ne an algorithm, which we call A#, which simply consists of testing the most probable untested arc. We show that this is usually equivalent to twenty questions. Then we show that A# can be reexpressed as a variant of AH. The only di!erence between AH and A# is that A# (and twenty questions) makes use of prior expectations in an attempt to speed up the search. (Both A#and twenty questions are formulated with prior probabilities which can be used to make these predictions). The di!erence in search strategies can be thought of, metaphorically, as the distinction between eugenics and breeding like rabbits. AH proceeds by selecting the graph node which has greatest total cost (cumulative and heuristic) and then expands all the children of this node. This is the rabbit strategy. By contrast, A#selects the best graph node and then expands only the best predicted child node. This is reminiscent of eugenics. The twenty questions algorithm occasionally goes one stage further and expands a grandchild of the best node (i.e. completely skipping the child nodes). In general, if prior probabilities for the problem are known to be highly non-uniform, then the eugenic strategy will on average be more e$cient than the rabbit strategy. The algorithm A#is based on the same model and the same array of tests used in Geman and Jedynak's work. What is di!erent is the rule for selecting the most promising arc c on which to perform the next test > . The arc c c that is chosen is the arc with the highest probability z that satis"es two requirements: Test > must not have c c been performed previously and c must be the child of a previously tested arc. For twenty questions the best test will typically be the child of a tested arc though occasionally, as we will describe later, it might be a grandchild or some other descendant. Theorem 3. A# and Twenty questions will test the same arc provided z )z6 for all untested arcs c. Moreover, the c
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
only cases when the algorithms will diwer is when A# chooses to test an arc both siblings of which have already been tested. Proof. The "rst part of this result follows directly from Theorem 2: /(z) is convex with minimum at z6 so, provided z )z6 for all untested c, the most probable untested c arc is the one that minimizes the conditional entropy, see Fig. 6. The second part is illustrated in Fig. 7. Let c be the arc that A# prefers to test. Since A# only considers an arc c that is the child of previously tested arcs, there are only three cases to consider: when none, one, or two of c's siblings have been previously tested. In the "rst two cases, when none or one of c's siblings has been tested, the probability z is bounded: by z (z6 or by z (z6 , respecc c c tively. Clearly, since c is the arc with the maximum probability, no other arc can have a probability closer to z6 ; thus arc c minimizes /(z ) and both algorithms are c consistent. In the third case, however, when both of c's siblings have been tested, it is possible for z to be larger c than z6 . In this case it is possible that other arcs with smaller probabilities would lower / more than /(z ). For c example, if /(z /3)(/(z ), then the twenty questions alc c
Fig. 7. The three possible possibilities for A#'s preferred arc a where dashed lines represent tested arcs. In A, both a's siblings have been tested. In this case the twenty question algorithm might prefer testing one of a's three children or some other arc elsewhere on the tree. In cases B and C, at most one of a's siblings have been tested and so both twenty questions and A# agree.
611
gorithm would prefer any of c's (untested) children, having probability z /3, to c itself. But conceivably there may c be another untested arc elsewhere with probability higher than z /3, and lower than z6 , which twenty questions c might prefer. h Thus the only di!erence between the algorithms may occur when the previous tests will have established c's membership on the road with such high certainty that the conditional entropy principle considers it unnecessary to test c itself. In this case twenty questions may perform a `leap of faitha and test c's children or it may test another arc elsewhere. If twenty questions chooses to test c's children then this would make it potentially more e$cient than A# which would waste one test by testing c. But from the backtracking histogram in [17] it seems that testing children in this way never happened in their experiments. There may, however, have been cases when untested arcs are more probable than z6 and the twenty questions algorithm tested other unrelated arcs. If this did indeed happen, and the structure of the problem might make this impossible, then it seems that twenty questions might be performing an irrelevant test. We expect therefore that A# and twenty questions will usually pick the same test and so should have almost identical performance on the road tracking problem. This analysis can be generalized to alternative branching structures and prior probabilities. For example, for a binary tree we would expect that the twenty questions algorithm might often make leaps of faith and test grandchildren. Conversely, the larger the branching factor then the more similar A# and twenty questions will become. In addition, a non-uniform prior might also make it advisable to test other descendants. Of course we can generalize A# to allow it to skip children too if the children have high probability of being on the path. But we will not do this here because, as we will see, such a generalization will reduce the similarity of A#with AH. Our next theorem shows that we can give an analytic expression for the probabilities of the partial paths. Recall that these are the probabilities z that the road goes a through a particular tested arc a, see Fig. 8. (Geman and Jedynak give an iterative algorithm for calculating these probabilities.) This leads to a formulation of the A# algorithm which makes it easy to relate to AH. The result holds for arbitrary branching and priors. Theorem 4. The probabilities z "P(X3C Dy ,2,y ) of a a 1 M partial paths to an untested arc a, whose parent arc has been tested, can be expressed as 1 Ma p (y j) P(X3C Dy ,2,y )" < 1 a t(a , a ), a 1 M j j~1 Z p (y ) M j/1 0 aj
(6)
where A "Ma : j"12M N is the set of (tested) arcs a j a lying on the path to a, see Fig. 8, and t(a , a ) is the prior i i~1
612
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
with probability p (y ) or p (y ) respectively. We obtain: 1 i 0 i P(XC Dy ,2,y ) a 1 M P(X3C Dy ,2,y ) a 1 M
G
J + P(X) < p (y ) 1 i X|Ca i/1,2,M>XbCiWCa
G
]
H
< p (y ) 0 i i/1,2,M>XbCiWCa
G
p (y ) 1 i " + P(X) < W a p (y ) 2 a i X|C i/1, ,M>XbC C 0 i
G
]
Fig. 8. For any untested arc a, there is a unique path a , a ,2 1 2 linking it to the root arc. As before, dashed lines indicate arcs that have been tested.
probability of arc a following arc a (a is the initializai i~1 0 tion arc). Proof. Suppose a is an arc which has not yet been tested but which is a child of one that has. Assume we have test results (y ,2,y ), then there must be a unique subset 1 M A "a ,2,a a of tests which explore all the arcs from a 1 M the starting point to arc a, see Fig. 8. The probability that the path goes through arc a is given by P(X3C Dy ,2,y )" + P(XDy ,2,y ) a 1 M 1 M X|Ca P(y ,2,y DX)P(X) 1 M "+ . (7) P(y ,2,y ) a 1 M X|C The factor P(y ,2,y ) is independent of a and so we can 1 M remove it (we will only be concerned with the relative values of di!erent probabilities and not their absolute values). Recall that the tests are independent and if arc i lies on, or o!, the road then a test result y is produced i
H
H H
< p (y ) , (8) 0 i i/1,2,M where the notation X3C WC means the set of all roads i a which contain the (tested) arc i and arc a. The "nal factor < p (y ) can be ignored since it is also independent of a. i 0 i Now suppose none of arc a's children have been tested. Then since the sum in Eq. (8) is over all paths which go through arc a this means that set of arcs i : X3C on the i road X for which tests are performed must be precisely those in the unique subset A going from the starting a point to arc a. More precisely, Mi"1,2,M : X3C WC N i a "A . Therefore: a p (y ) p (y ) Ma p (y ) 1 i " < 1 i " < 1 aj . < (9) p (y ) p (y ) p (y ) i/1,2,M > x|CiWCa 0 i i|Aa 0 i j/1 0 aj Now + a P(X) is simply the prior probability that the X|C path goes through arc a. We can denote it by P . Bea cause of the tree structure, it can be written as P "<Ma t(a , a ), where t(a , a ) is the prior proba 1/1 i i~1 i i~1 ability that the road takes the child arc a given that it has i reached its parent a . If all paths passing through a are i~1 equally likely (using Geman and Jedynak's prior on the ternary graph) then t(a , a )"1 for all a and we have: i i~1 3 3L~@Aa@~1 , (10) P , + P(X)" a 3L X|Ca where ¸ is the total length of the road and DA D is the a length of the partial path. Therefore in the general case: 1 Ma p (y j) P(X3C Dy ,2,y )" < 1 a t(a , a ), a 1 M j j~1 Z p (y ) M j/1 0 aj where Z is a normalization factor. h M
(11)
We now show that A# is just a simple variant of AH. We "rst design an admissible AH algorithm using the smallest heuristic which guarantees admissibility. Then we will show that A# is a variant of AH with a smaller heuristic and hence is inadmissible. To adapt AH to apply to the road tracking problem we must convert the ternary road representation tree into a graph by introducing a terminal node to which all the
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
leaves of the tree are connected. We set the cost of getting to this terminal node from any of the leaves of the tree to be constant. Then deciding to go from one node to an adjacent node, and evaluating the cost, is equivalent to deciding to test the arc between these nodes and evaluating the test result. It follows directly from Theorem 4, or see [17], that the best road is the one which maximizes the log of Eq. (11):
G
H
p (y ) E(X)"+ log 1 ai #log t(a , a ) , i i~1 p (y i) 0 a i
(12)
where X"Ma ,2, a N and t(a , a ) is the prior prob1 L i i~1 ability that arc a follows a . Observe that the Z factor i i~1 m from Eq. (11) is the same for all paths so we have dropped it from the right-hand side of Eq. (12). By Theorem 4 again, the cost for a partial path of length M which terminates at arc a is given by a
G
H
p (y ) Ma g(a)" + log 1 aj #log t(a , a ) . j j~1 p (y j) 0 a j/1
(13)
To determine the smallest admissible heuristic for AH, we observe that the cost to the end of the road has a least upper bound of h(a)"(¸!M )M" #" N, a 0 p
(14)
where j "max logMp (y)/p (y)N and j "max logt(. , .) 0 y 1 0 p over all possible prior branching factors in the tree. It is clearly possible to have paths which obtain this upper bound though, of course, they are highly unlikely because they require all the future path segments to have the maximal possible log likelihood ratio, we will return to this issue in Section 6. The heuristic given by Eq. (4) is therefore the smallest possible admissible heuristic. We can therefore de"ne the admissible AH algorithm with smallest heuristic to have a cost f given by
G
Ma p (y ) f (a)"g(a)#h(a)" + log 1 aj #log t(a , a ) j j~1 p (y j) 0 a j/1 #(¸!M )M" #" N. a 0 p
H (15)
Observe that we can obtain Dijkstra by rewriting Eq. (15) as
A
B
Ma p (y ) f" + log 1 aj #log t(a , a ) !" !" ) j j~1 0 p p (y j) 0 a j/1 #¸M" #" N 0 p
(16)
and because the length of all roads are assumed to be the same, the "nal term ¸M" #" N is a constant and can be 0 p ignored. This can be directly reformulated as Dijkstra with g"+Ma (log[p (y j)/p (y j)]#log t(a , a )!" !" ) j/1 1 a 0 a j j~1 0 p and h"0. (Though, strictly speaking, the term Dijkstra
613
only applies if the terms in the sum are all guaranteed to be non-negative. Only with this additional condition is Dijkstra guaranteed to converge.) The size of " #" has a big in#uence in determining 0 p the order in which paths get searched by the admissible AH algorithm. The bigger " #" then the bigger the 0 p overestimate cost, see Eq. (14). The larger the overestimate then the more the admissible AH will prefer to explore paths with a small number of tested arcs (because these paths will have overgenerous estimates of their future costs). This induces a breadth-xrst search strategy [26] and may slow down the search. We now compare A#to AH. The result is summarized in Theorem 5. Theorem 5. A# is an inadmissible variant of AH. Proof. From Theorem 4 we see that A# picks the path which minimizes Eq. (13). In other words, it minimizes the g(a) part of AH but has no overestimate term. In other words, it sets h(a)"0 by default. There is no reason to believe that this is an overestimate for the remaining part of the path. Of course, if " #" )0 then h(a)"0 0 p would be an acceptable overestimate. For the special case considered by Geman and Jedynak this would require that max logMp (y)/p (y)N!log 3)0. For their proby 1 0 ability distributions it seems that this condition is not satis"ed, see Fig. 5 in Ref. [17]. h This means that A#, and hence twenty questions, are typically suboptimal and are not guaranteed to converge to the optimal solution. On the other hand, admissible AH uses upper bounds means that it prefers paths with few arcs to those with many, so it may waste time by exploring in breadth. We will return to these issues in the next section.
6. Heuristics So far we have shown that Dynamic Programming, Dijkstra and Twenty Questions are either exact, or approximate, versions of AH. The only di!erence lies in their choice of heuristic and their breeding strategies (rabbits or eugenics). Both DP and Dijkstra choose admissible heuristics and are therefore guaranteed to "nd the optimal solution. The downside from this is that they are rather conservative and hence may be slower than necessary. By contrast, Twenty Questions is closely related to A#which uses an inadmissible heuristic. It is not necessarily guaranteed to converge but empirically is very fast and "nds the optimal solution in linear time (in terms of the solution length). The choice of heuristics is clearly very important. What principles can be used to determine good heuristics for a speci"c problem domain?
614
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
Fig. 9. In the example from Pearl, the goal is to "nd the shortest path in the lattice from (0, 0) to (m, n). The choice of heuristic will greatly in#uence the search strategy.
An example from Pearl [21] illustrates how di!erent admissible heuristics can a!ect the speed of search. Pearl's example is formulated in terms of "nding the shortest path in a two-dimensional lattice from position (0, 0) to the point (m, n), see Fig. 9. The "rst algorithm Pearl considers is Dijkstra, so it has heuristic h(. , .)"0 which is admissible since the cost of all path segments is positive. It is straightforward to show that this requires us to expand Z (m, n)"2(m#n)2 a nodes before we reach the target (by the nature of AH we must expand all nodes whose cost is less than, or equal to, the cost of the goal node). The second algorithm uses a shortest distance heuristic h(x, y)"J(x!m)2#(y!n)2 which is also admissible. In this case the number of nodes expanded, Z (m, n), can b be calculated. The expression is complex so we do not write it down. The bottom line, however, is that the ratio of Z (m, n)/Z (m, n) is always less than 1 and so the shortest b a distance heuristic is preferable to Dijkstra for this problem. The maximum value of the ratio, approximately 0.18, is obtained when n"m. Its minimum value occurs when n"0 (or equivalently when m"0) and is given by 1/2m which tends to zero for large m. Thus choosing one admissible heuristic in preference to another can yield signi"cant speed-ups without sacri"cing optimality. It has long been known [21] that inadmissible algorithms, which use probabilistic knowledge of the domain, can converge in linear expected time to close to the optimal solution even when admissible algorithms will take provably exponential time. The problems we are interested in solving are already formulated as Bayesian estimation problems which means that probabilistic knowledge of the domain is already available. How can it be exploited?
Consider, for example, the admissible AH algorithm which we de"ned for the road tracking problem, see Section 5. Let us consider the special case when the branching factor is always three and all paths are equally likely. Then we demonstrated that the smallest admissible heuristic is h(a)"(¸!M )max logMp (y)/p (y)N. a y 1 0 This heuristic, however, is really a worst case bound because the chances of getting such a result if we measure the response on the true path are very small. In fact, if we assume that the image measurements are independent (as our Bayesian model does, see Section 2), then the law of large numbers says the total response of all test results along the true path should be close to h (a)" p (¸!M + p (y) logMp (y)/p (y)N. This average bound a y 1 1 0 will typically be considerably smaller than the worst case bound used to determine h(a) above. We would therefore expect that, in general, the average case heuristic h (a) will be far quicker than the worst case heuristic p h(a) and should usually lead to equally good results. (Of course, the average case heuristic will become poor as we approach the end point, where ¸!M is small, and so a we will have to replace it by the worst case heuristic in such situations.) Our recent work [22,23] has made these intuitions precise by exploiting results from the theory of types, see [18], which quantify how rapidly the law of large numbers starts being e!ective. For example, Sanov's theorem can be used to determine the probability that the average cost for a set of n samples from the true road di!ers from the expected average cost + p (y) log Mp (y)/p (y)N. The theorem shows that the y 1 1 0 probability of any di!erence will decrease exponentially with the number of samples n. Conversely, we can ask with what probability will we get an average cost close to + p (y) log Mp (y)/p (y)N from a set of n samples not from y 1 1 0 the true path (i.e. the probability that the algorithm will be fooled into following a false path). Again, it follows from Sanov's theorem that the probability of this happening decreases exponentially with n where the coe$cient in the exponent is the Kullback}Leibler distance between p (y) and p (y). 1 0 Our papers [22,23] prove expected convergence time bounds for optimization problems of the type we have analyzed in this paper. For example, in [23] we prove that inadmissible heuristic algorithms can be expected to solve these optimization problems (with a quanti"ed expected error) while examining a number of nodes which varies linearly with the size of the problem. (The expected sorting time per node is shown to be constant). Moreover, the di$culty of the problem is determined by an order parameter which is speci"ed in terms of the characteristics of the domain (the distributions P , P , P(X) and the branching factor of the search on off tree). At critical values of the order parameter the problem undergoes a phase transition and becomes impossible to solve by any algorithm.
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
7. Summary In summary, we argue that AH, and heuristic algorithms [21], give a good framework to compare and evaluate di!erent optimization algorithms for deformable templates. We describe how both Dijkstra and Dynamic Programming can be expressed in terms of AH [21,26]. We then prove a close relationship between the twenty questions algorithm [17] and a novel algorithm which we call A#. In turn, we prove that A#is an inadmissible variant of AH. We note that both A# and twenty questions, unlike AH and Dijkstra, maintain explicit probabilities of partial solutions which allows them to keep track of how well the algorithm is doing and to warn of faulty convergence. In addition, their explicit use of prior knowledge allows them to improve their search strategy (in general) by making predictions. However, both AH and Dijkstra are designed to work on graphs, which include closed loops, and it may be di$cult to extend twenty questions and A# to such representations. From the AH perspective, the role of heuristics is very important. Most algorithms, implicitly or explicitly, make use of heuristics the choice of which can have a big e!ect on the speed and e!ectiveness of the algorithm. It is therefore important to specify them explicitly and analyze their e!ectiveness. For example, it appears that probabilistic knowledge of the problem domain can lead to heuristics, adapted to the domain, which are provably very e!ective [21}23]. By contrast, algorithms such as Dijkstra, which have no explicit heuristics, have no mechanisms for adapting the algorithm to a speci"c domain. For example, Dijkstra's algorithm applied to detecting visual shapes is very e!ective at low noise levels but can break down (in terms of memory and time) at high noise levels (Geiger } private communication). Characterizing the noise probabilistically and using this to guide a choice of heuristic can lead to a more e$cient algorithms, see [25].
Acknowledgements This report was inspired by an interesting talk by Donald Geman and by David Mumford wondering what relation there was between the twenty question algorithm and AH. One of us (ALY) is grateful to Davi Geiger for discussions of the Dijkstra algorithm and to M.I. Miller and N. Khaneja for useful discussions about the results in this paper and their work on Dynamic Programming. This work was partially supported by NSF Grant IRI 92-23676 and the Center for Imaging Science funded by ARO DAAH049510494. ALY was employed by Harvard while some of this work was being done and would like to thank Harvard University for many character building experiences. He would also like to thank Prof.'s Irwin
615
King and Lei Xu's hospitality at the Engineering Department of the Chinese University of Hong Kong and, in particular, to Lei Xu for mentioning Pearl's book on heuristics and for the use of his copy.
References [1] M.A. Fischler, R.A. Erschlager, The representation and matching of pictorial structures, IEEE. Trans. Comput. C-22 (1973) 67}92. [2] U. Grenander, Y. Chow, D.M. Keenan, Hands: a Pattern Theoretic Study of Biological Shapes, Springer, Berlin, 1991. [3] U. Grenander, M.I. Miller, Representation of knowledge in complex systems, J. Roy. Statist. Soc. 56 (4) (1994) 569}603. [4] T.F. Cootes, C.J. Taylor, Active shape models } &Smart Snakes', British Machine Vision Conference, Leeds, UK, September 1992, 266}275. [5] B.D. Ripley, Classi"cation and clustering in spatial and image data, in: M. Schader (Ed.), Analyzing and Modeling Data and Knowledge, Springer, Berlin, 1992. [6] L.H. Straib, J.S. Duncan, Parametrically deformable contour models, Proceedings of Computer Vision and Pattern Recognition, San Diego, CA, 1989, pp. 98}103. [7] L. Wiscott, C. Von der Marlsburg, A neural system for the recognition of partially occluded objects in cluttered scenes, Neural Comput. 7 (4) (1993) 935}948. [8] A.L. Yuille, Deformable templates for face recognition, J. Cognitive Neurosci. 3 (1) (1991) 59}70. [9] R.E. Bellman, Applied Dynamic Programming, Princeton University Press, Princeton, NJ, 1962. [10] U. Montanari, On the optimal detection of curves in noisy pictures, Comm. ACM 14 (5) (1971) 335}345. [11] M. Barzohar, D.B. Cooper, Automatic "nding of main roads in aerial images by using geometric}stochastic models and estimation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1993, pp. 459}464. [12] D. Geiger, A. Gupta, L.A. Costa, J. Vlontzos, Dynamic programming for detecting, tracking and matching elastic contours, IEEE Trans. Pattern Anal. Mach. Intell. PAMI17 (1995) 294}302. [13] J. Coughlan, Global optimization of a deformable hand template using dynamic programming, Harvard Robotics Laboratory, Technical report, 95-1, 1995. [14] N. Khaneja, M.I. Miller, U. Grenander, Dynamic programming generation of geodesics and sulci on brain surfaces, PAMI 20 (11) (1998) 1260}1265. [15] D. Bertsekas, Dynamic programming and optimal control, vol. 1, Second Ed., Athena Scienti"c Press, 1995. [16] D. Geiger, T-L. Liu, Top-Down Recognition and BottomUp Integration for Recognizing Articulated Objects, Preprint, Courant Institute, New York University, 1996. [17] D. Geman, B. Jedynak, An active testing model for tracking roads in satellite images, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1) (1996) 1}14. [18] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley Interscience Press, New York, 1991.
616
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
[19] L. Kontsevich, Private Communication, 1996. [20] W. Richards, A. Bobick, Playing twenty questions with nature, in: Z. Pylyshyn (Ed.), Computational Processes in Human Vision: An Inter-disciplinary Perspective, Ablex, Norwood, NJ, 1988. [21] J. Pearl, Heuristics, Addison-Wesley, Reading, MA, 1984. [22] A.L. Yuille, J.M. Coughlan, Convergence rates of algorithms for visual search: detecting visual contours, Proceedings NIPS'98, 1998. [23] A.L. Yuille, J.M. Coughlan, Visual search: fundamental bounds, order parameters, phase transitions, and convergence rates, Trans. Pattern Anal. Mach. Intell. (1999), submitted for publication.
[24] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kau!man, San Mateo, CA, 1988. [25] J.M. Coughlan, A.L. Yuille, C. English, D. Snow, E$cient deformable template detection and localization without user initialization, CVIU (1998), submitted for publication. [26] P.H. Winston, Arti"cial Intelligence, Addison-Wesley, Reading, MA, 1984. [27] S. Russell, P. Norvig, Arti"cial Intelligence: a Modern Approach, Prentice-Hall, New York, 1995. [28] S.L. Lauritzen, D.J. Spiegelhalter, Local computations with probabilities on graphical structures and their application to expert systems, J. Roy. Statist. Soc. B 50 (2) (1988) 157}224.
About the Author*ALAN YUILLE receive his BA in Mathematics at the University of Cambridge in 1976. He completed his Ph.D. in Theoretical Physics at Cambridge in 1980 and worked as a postdoc in Physics at the University of Texas at Austin and the Institute for Theoretical Physics at Santa Barbara. From 1982 to 1986 he worked at the Arti"cial Intelligence Laboratory at MIT before joining the Division of Applied Sciences at Harvard from 1986 to 1995 rising to the rank of Associate Professor. In 1995 he joined the Smith-Kettlewell Eye Research Institute in San Francisco. His research interests are in mathematical modelling of arti"cial and biological vision. He has over one hundred peer-reviewed publications in vision, neural networks, and physics. He has co-authored a book with J.J. Clark (`Data Fusion for Sensory Information Processing Systema), and edited a book `Active Visiona with A. Blake.
About the Author*JAMES COUGHLAN received his BA in physics at Harvard in 1998. He is currently working as a postdoc with Alan Yuille at the Smith-Kettlewell Eye Research Institute in San Francisco. His research interests are in computer vision and the applications of Bayesian probability theory to arti"cial intelligence.
Pattern Recognition 33 (2000) 617}634
A theory of proximity based clustering: structure detection by optimization Jan Puzicha!,*, Thomas Hofmann", Joachim M. Buhmann! !Institut fu( r Informatik III, University of Bonn, Bonn, Germany "Artixcial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA Received 15 March 1999
Abstract In this paper, a systematic optimization approach for clustering proximity or similarity data is developed. Starting from fundamental invariance and robustness properties, a set of axioms is proposed and discussed to distinguish di!erent cluster compactness and separation criteria. The approach covers the case of sparse proximity matrices, and is extended to nested partitionings for hierarchical data clustering. To solve the associated optimization problems, a rigorous mathematical framework for deterministic annealing and mean-xeld approximation is presented. E$cient optimization heuristics are derived in a canonical way, which also clari"es the relation to stochastic optimization by Gibbs sampling. Similarity-based clustering techniques have a broad range of possible applications in computer vision, pattern recognition, and data analysis. As a major practical application we present a novel approach to the problem of unsupervised texture segmentation, which relies on statistical tests as a measure of homogeneity. The quality of the algorithms is empirically evaluated on a large collection of Brodatz-like micro-texture Mondrians and on a set of real}word images. To demonstrate the broad usefulness of the theory of proximity based clustering the performances of di!erent criteria and algorithms are compared on an information retrieval task for a document database. The superiority of optimization algorithms for clustering is supported by extensive experiments. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Clustering; Proximity data; Similarity; Deterministic annealing; Texture segmentation; Document retrieval
1. Introduction Data clustering is one of the core methods for numerous tasks in pattern recognition, exploratory data analysis, computer vision, machine learning, data mining, and in many other related "elds. In a rather informal sense, the goal of clustering is to partition a given set of data into homogeneous groups. Cluster homogeneity is thus the central notion which needs to be formalized in order to give data clustering a precise meaning. In this paper, we focus on homogeneity measures which are de"ned in
* Corresponding author. E-mail addresses: Mjan,
[email protected] (J. Puzicha),
[email protected] (T. Hofmann)
terms of pairwise similarities or dissimilarities between data entities or objects. The underlying data is usually called similarity or proximity data [1]. In proximity data, the elementary measurements are comparisons between two objects of a given data set. This data format di!ers from vectorial data where each measurement corresponds to a certain &feature' evaluated at an external scale. Notice however, that pairwise dissimilarities can be canonically generated from vectorial data whenever a distance function or metric is available. There exist numerous approaches to the data clustering problem, only to mention some of the most important: central clustering or vector quantization (e.g., the K-means algorithm [2]), linkage or agglomerative methods [3], mixture models [4], fuzzy clustering [5], and competitive learning [6,7]. These and other approaches o!er a variety of clustering algorithms, and,
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 6 - X
618
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
more fundamentally, di!er with respect to the framework in which data clustering is formulated. Typically, there are limitations to which kind of data a method can be applied. It is a commonly expressed opinion [1] that the application of partitional clustering methods which are based on explicit objective functions are only suitable for vectorial data. In addition, optimization methods are supposed to be inherently non-hierarchical. In this paper, a theory is outlined which advocates a formulation of proximity-based data clustering as a combinatorial optimization problem. The theory proposes solutions to the two fundamental problems: (i) the speci"cation of suitable objective functions, and (ii) the derivation of e$cient optimization algorithms. To address the modeling problem, an axiomatization of clustering objective functions based on fundamental invariance and robustness properties is presented (Section 2). As will be rigorously shown, the proposed axioms imply restrictions on both the way cluster homogeneities are calculated from pairwise dissimilarities as well as the "nal combination of contributions from di!erent clusters. These ideas are extended to cover two generalizations: clustering with sparse or incomplete proximity data and hierarchical clustering. These extensions are indispensable for the analysis of large data sets, since it is in general prohibitive (and also unnecessary due to redundancy) to exhaustively perform all N2 pairwise comparisons for N objects. Moreover, in large-scale applications group structure typically occurs at di!erent resolution levels, which strongly favors hierarchical partitioning schemes. Once a suitable objective function has been identi"ed, in principle any known optimization technique could be applied to "nd optimal solutions. Yet, the NP-hardness of most data partitioning problems renders the application of exact methods for large-scale problems intractable [8]. Therefore, heuristic optimization techniques are promising candidates to "nd at least approximate clustering solutions. In particular, stochastic optimization (Section 3) o!ers robustness and matches the peculiarities of data analysis problems [9]. Two closely related methods will be derived: a Monte Carlo algorithm known as the Gibbs sampler [10] and a deterministic variant known as mean-xeld annealing [11]. Both approaches rely on the introduction of a computational temperature, and o!er a number of advantages: (i) they are general enough to cover all clustering objective functions, (ii) they yield scalable algorithms (in terms of the complexity-quality tradeo!), and (iii) the temperature de"nes a &natural' resolution scale for hierarchical clustering problems [12]. The theory and the derived algorithms are tested and validated in two application areas: unsupervised segmentation of textured images [13,14] and information retrieval in document databases.
2. Axiomatization for clustering objective functions Assume the data is given in the form of a proximity matrix D3RN2 with entries D quantifying the dissimilarij ity between objects o and o from a given domain of i j N objects. Furthermore, assume the number of clusters K to be "xed. A Boolean representation of data partitionings is introduced in terms of assignment matrices M3M , where N,K
G
H
K M " M3M0, 1NNCK: + M "1, 1)i)N . il N,K l/1
(1)
M is an indicator variable for an assignment of object il o to cluster C , hence M "1 if and only if object o i l il i belongs to cluster C . The assignment constraints in the l de"nition of M assure that the data assignment is N,K complete and unique. 2.1. Elementary axioms General principles of proximity-based clustering are expressed by the following axioms: De5nition 1 (Clustering criterion). A cost function H : M ]RN2PR is a clustering criterion if the folN,K N,K lowing set of axioms is ful"lled: Axiom 1 (Permutation invariance). H is invariant N,K with respect to permutations of (a) object indices and (b) label indices. More precisely, for all D3RN2, M3M , N,K permutations n over M1,2, NN, and n6 over M1,2, KN: H (M; D)"H (Mn6 ; Dn), N,K N,K n n where An6 is obtained from A by row permutation with n n and column permutation with n6. Axiom 2 (Monotonicity). For all D3RN2, M3M , N,K nd3R`, and pairs of data indices (i, j): K + M M "M1NN il jl 0 l/1 H (M; D) MxNH (M; Dij), N,K w N,K
(2)
where Dij is obtained from D by the local modi"cation D PD #nd. ij ij H is a strict clustering criterion if for each (i, j) at N,K least one of the inequalities in (2) is strict. Axiom 1 prevents that the quality of a data partitioning depends on additional information hidden in the data labels or cluster labels. Axiom 2 states that increasing the dissimilarity between objects in the same cluster can never be advantageous. The same is true for decreasing
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
the dissimilarity between objects belonging to di!erent clusters. To further limit the functional dependency on proximities, an additivity axiom is introduced. Additivity reduces the noise sensitivity of a clustering criterion by averaging, as opposed to other approaches which completely discard the dissimilarity values and only keep their order relation (e.g., single linkage). De5nition 2 (Additivity). A clustering criterion H is N,K additive if it has the following functional form:1
De5nition 3 (Invariance). An objective functions H is invariant with respect to linear data transformations, if the following set of axioms is ful"lled: Axiom 3 (Scale invariance). For all D3RN2, M3M, c3R`: H(M; cD)"cH(M; D). Axiom 4 (Shift invariance). For all D3RN2, M3M, nd3R : H(M; D#nd)"H(M; D)#Nnd.
N N H (M; D)" + + t (i, j, D , M). N,K N,K ij i/1 j/1, jEi t : M1,2, NN2]R]M PR is called the contribuN,K N,K tion function. Furthermore, for M "1 and M "1, ia jl t (i, j, D , M) does not depend on assignments to clusN,K ij ters kOa, l. The contribution function t of an additive clustering N,K criterion is in fact further restricted by Axioms 1 and 2. Proposition 1. Every additive clustering criterion can be rewritten as a combination of (D-monotone) intra- and inter-cluster contribution functions, t(1) and t(2) respectively, K N H (M; D)" + + M N,K il l/1 i,i j"1, Ej K M t(1) (D , n )! + M t(2) (D , n , n ) . jl N,K ij l jk N,K ij l k k/1, kEl Here n "+N M denotes the size of cluster l. l i/1 il
C
619
D
Each additive clustering criterion can thus be linearly decomposed into one part measuring intra-cluster compactness (to be minimized) and a second part measuring inter-cluster separation (to be maximized). A proof of Proposition 1 is given in the Appendix. 2.2. Invariance axioms While Axioms 1 and 2 ensure elementary requirements for a clustering criterion, and additivity narrows the focus to a particular simple class of objective functions, the following invariance and robustness properties have to be considered as the core of the proposed axiomatization. Assuming N and K to be "xed, the explicit dependency in our notation is dropped whenever possible.
1 Self-dissimilarities D are excluded for simplicity. All objecii tive functions can however be easily modi"ed, if diagonal contributions should be included.
Axiom 3 ensures that rescaling of the data solely rescales the cost function. A scale-invariant criterion has the advantage not to introduce an implicit bias towards a particular data scale. Axiom 4 is crucial for data that is only meaningful on an interval scale and not on an absolute or ratio scale. Invariant clustering criteria are thus non-committal with respect to scale and origin of the data, a property which is especially useful in applications where these quantities are not a priori known.2 For additive clustering criteria scale invariance restricts t(1) and t(2) in Proposition 1 to a linear data dependency. The "rst argument of t(1,2) is therefore dropped with the understanding t(1,2)(D , ) )"D t(1,2)( ) ). The ij ij number of additive clustering criteria is further reduced by the shift invariance axiom. For cluster compactness measures the following result is obtained. Proposition 2. For every invariant additive clustering criterion with t(2)"0 (intra-cluster compactness measure), the contribution function t(1) can be written as 1 1 t(1)(n )"j #(1!j) , j3R. l n !1 n (n !1) l l l The number of admissible weighting functions t(2) is less signi"cantly reduced by the invariance property. Therefore, a special class of contribution functions is considered which possess a natural decomposition. De5nition 4 (Decomposability). An inter-cluster contribution function t(2) is decomposable, if there exists a function f (n ) such that either l f (n )" + n t(2)(n , n ) or f (n )" + n t(2)(n , n ). l k l k l k k l kEl kEl Proposition 3. For every invariant additive clustering criterion with t(1)"2C and decomposable t(2), the contribution
2 Both invariance axioms can in fact be weakened with respect to additional multiplicative or additive constants, but this does not result in qualitatively di!erent criteria.
620
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
function can be written as an azne combination 7 t(2)(n , n )" + j t (n , n )!C, l k r r l k r/1 with an additive owset3 C"1/(N!1) and the following elementary functions and its (l, k)-symmetric counterparts (t , t , t ): 5 6 7 1 1 t " , t " , 1 N!n 2 (K!1)n l l N N t " , t " . 3 K(K!1)(N!n )n 4 K(K!1)n n l l l k Proofs of both propositions are given in the appendix. For simplicity, we focus on the special case of symmetric proximity matrices in the sequel, for which one may set j "j "j "0 without loss of generality. In summary, 5 6 7 we have derived 2 elementary intra-cluster compactness criteria K K N H1#1(M; D)" + n D , H1#2(M; D)" + D , l l K l l/1 l/1 where
K N +K D k/1,kEl lk, H142!(M; D)"! + K!1 K l/1 K N +K nD k/1,kEl k lk, Hps2b(M; D)"! + K N!n l l/1 where
2.3. Robustness axioms The following set of axioms is concerned with the sensitivity of the quality criterion with respect to perturbations of the proximities. Since inaccurate (e.g., quantized) or noisy measurements are common, robustness is a key issue. Two di!erent notions of robustness are distinguished:
(3)
+N +N M M D D " i/1 j/1,jEi il jl ij (4) l +N +N M M i/1 j/1,jEi il jl is the average dissimilarity in cluster C . Moreover, by l restriction to decomposable contribution functions and symmetric data we have derived 4 elementary inter-cluster separation criteria: K +K D H141!(M; D)"! + n k/1,kEl lk, l K!1 l/1 K +K nD H141"(M; D)"! + n k/1,kEl k lk, l N!n l l/1
concerns the weighting of the average cluster compactness (D ) or separation (+ n D ) with either the cluster l kEl k lk size (H1#1, H141!, H141") or a constant (H1#2, H142!, H142"). The second distinction concerns the way the average cluster separation is computed for separation measures: either these averages are performed by pooling all dissimilarities together (H141", H142"), or by a twostage procedure which "rst calculates averages for every pair of clusters (D ) and combines those with a constant lk weight (H141!, H142!). These di!erences are crucial for the robustness properties, formalized in the following axioms.
(5)
(6)
Axiom 5 (Weak robustness). A family of objective functions H"(H ) N is robust in the weak sense if for all N,K N| nd3R, e3R` there exists N 3N such that for all 0 N*N , M3M , D3RN2, and pairs of data indices 0 N,K (i, j): 1 DH (M D D)!H (M D Dij)D(e, N,K N N,K
(8)
where Dij is de"ned as in Axiom 2. Axiom 6 (Strong robustness). H is robust in the strong sense if condition (8) holds for all Di3RN2 de"ned by Di"D#nd(d #d !d d ) . ik il ik il k,l More intuitively, robustness assures that single measurements (weak robustness) or measurements belonging to a single object (strong robustness) do not have a macroscopic in#uence on the costs of a con"guration. The robustness properties of the invariant criteria are summarized without proof in the following table: H1#1
H1#2 H141! H141" H142! H142"
+N +N M M D D " i/1 j/1 il jk ij (7) lk +N +N M M i/1 j/1 il jk denotes the average inter-cluster dissimilarity. The most fundamental distinction between the di!erent criteria
Weak robustness Strong robustness
3 The constant C is a technical requirement to obtain the correct sign for the additive o!set in Axiom 4, it will be dropped in the sequel.
We emphasize the most remarkable facts: (i) H1#1 is the only criterion which ful"lls the strong robustness axiom, (ii) no invariant inter-cluster separation criterion is robust in the strong sense, (iii) H142" is the only criterion
Yes
No
Yes
Yes
No
Yes
Yes
No
No
No
No
No
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
with constant cluster weights being robust in the weak sense. All cluster separation criteria lack strong robustness. The reason for this are con"gurations with only one large (macroscopic) cluster and (K!1) small (microscopic) clusters where the number of inter-cluster dissimilarities scales only with O(N) compared to the total number scaling with O(N2). Strong robustness can be obtained by restricting M to the following sets of admissible solutions: (1) at least two macroscopic clusters for H141" and (2) K macroscopic clusters for H141! and H142". These considerations yield the following ranking of invariant clustering criteria with respect to their asymptotic robustness properties: H1#1zH141"zMH141!, H142"NzMH142!, H1#2N.
(9)
By the axiomatization, the criterion H1#1 thus is clearly distinguished from all other additive criteria. Among the cluster separation measures H141" has been identi"ed as the most promising candidate due to its robustness properties. Interestingly, there is an intrinsic connection to the K-means objective function for central clustering. Assume that the data were generated from a vector space representation v 3Rd by D "(v !v )2, then i ij i j H1#1(M, D)+H,.(M, D), with K N H,.(M, D)" + + M (v !y )2, il i l l/1 i/1 and the usual centroid de"nition
(10)
+N M v y " j/1 jl j.4 l +N M j/1 jl Moreover, there exists an intimate relation to Ward's agglomerative clustering method [15]. If the distance between a pair of clusters is de"ned by the cost increase after merging both clusters, any objective function H is heuristically minimized by greedy merging starting from singleton clusters. In case of H1#1 this procedure exactly yields Ward's method. It was often conjectured that Ward's method depends on the de"nition of centroids and the usage of a squared error measure, however, as demonstrated, Ward's method can be understood as a greedy algorithm to minimize H1#1, which does not involve centroids or any other geometrical concepts. It is worth mentioning that the graph partitioning objective function de"ned by N K N H'1(M; D)" + + + M M D , il jl ij l/1 i/1 j/1, jEi
(11)
4 The almost equal relation &+' refers to the additional diagonal contributions D which is negligible for large N. Alternaii tively, the de"nition of additivity could be extended to cover the re#exive case to get a true identity.
621
is scale invariant and robust in the strong sense, but not shift invariant. H'1 has been utilized for data analysis in the operations research context [16] and also for texture segmentation [17]. The missing shift invariance is obvious: in the limit of *dPR the optimal solution is an equipartitioning, in the limit of *dP!R the con"guration inevitably collapses into one cluster. A &good' balance between positive and negative contribution is necessary to avoid this type of degeneration [17] (see Section 4). This is a consequence of the ratio scale interpretation of dissimilarities which requires the speci"cation of a scale origin. The recently proposed normalized cut approach [18] provides interesting normalized cluster criteria which are not additive. The normalized cut cost function has been introduced only for two clusters. But as it is equal to the minimization of the normalized association H/#(M; D) K +N +N M M D i/1 j/1, jEi il jl ij (12) "+ (+N +N M #M !M M )D jl il jl ij l/1 i/1 j/1, jEi il it is naturally extended to multiple clusters. It should be noted that by using similarities and maximizing (12) a qualitatively di!erent criterion is obtained, which is well-de"ned even for highly sparse dissimilarity data. Both criteria are scale invariant and robust in the strong sense, but they are not shift invariant. The normalized cut is well-de"ned only for positive proximities and is thus only de"ned on a ratio scale. 2.4. Sparse proximity data As discussed in the introduction, it is important for large-scale applications of proximity-based clustering to develop methods which apply to arbitrary sparse proximity matrices. In order to distinguish between known and unknown dissimilarities an irre#exive graph (