Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2314
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Shi-Kuo Chang Zen Chen Suh-Yin Lee (Eds.)
Recent Advances in Visual Information Systems 5th International Conference, VISUAL 2002 Hsin Chu, Taiwan, March 11-13, 2002 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Shi-Kuo Chang Knowledge Systems Institute 3420 Main Street, Skokie, IL 60076, USA E-mail:
[email protected] Zen Chen Suh-Yin Lee National Chiao Tung University, Dept. of Comp. Science & Information Engineering 1001 Ta Hsueh Road, Hsin Chu, Taiwan E-mail: {zchen/sylee}@csie.nctu.edu.tw
Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Recent advances in visual information systems : 5th international conference, VISUAL 2002, Hsin Chu, Taiwan, March 11 - 13, 2002 ; proceedings / Shi-Kuo Chang ... (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002 (Lecture notes in computer science ; Vol. 2314) ISBN 3-540-43358-9
CR Subject Classification (1998): H.3, H.5, H.2, I.4, I.5, I.3, I.7 ISSN 0302-9743 ISBN 3-540-43358-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by Olgun Computergrafik Printed on acid-free paper SPIN: 10846602 06/3142 543210
Preface
Visual information systems are information systems for visual computing. Visual computing is computing on visual objects. Some visual objects such as images are inherently visual in the sense that their primary representation is the visual representation. Some visual objects such as data structures are derivatively visual in the sense that their primary representation is not the visual representation, but can be transformed into a visual representation. Images and data structures are the two extremes. Other visual objects such as maps may fall somewhere in between the two. Visual computing often involves the transformation from one type of visual objects into another type of visual objects, or into the same type of visual objects, to accomplish certain objectives such as information reduction, object recognition, and so on. In visual information systems design it is also important to ask the following question: who performs the visual computing? The answer to this question determines the approach to visual computing. For instance it is possible that primarily the computer performs the visual computing and the human merely observes the results. It is also possible that primarily the human performs the visual computing and the computer plays a supporting role. Often the human and the computer are both involved as equal partners in visual computing and there are visual interactions. Formal or informal visual languages are usually needed to facilitate such visual interactions. In this conference various research issues in visual information systems design and visual computing are explored. The papers are collectively published in this volume. We would like to express our special thanks to the sponsorship of the National Science Council, ROC, the Lee and MTI Center of National Chiao Tung University, ROC, and Knowledge Systems Institute, USA.
January 2002
Shi-Kuo Chang, Zen Chen, and Suh-Yin Lee
VISUAL 2002 Conference Organization
General Chair American General Co-chair European General Co-chair Asian General Co-chair
Shi-Kuo Chang, USA Ramesh Jain, USA Arnold Smeulders, The Netherlands Horace Ip, ROC
Program Co-chairs
Zen Chen, ROC Suh-Yin Lee, ROC
Steering Committee
Shi-Kuo Chang, USA Horace Ip, Hong Kong Ramesh Jain, USA Tosiyasu Kunii, Japan Robert Laurini, France Clement Leung, Australia Arnold Smeulders, The Netherlands
Program Committee Jan Biemond, The Netherlands Josef Bigun, Switzerland Shih Fu Chang, USA David Forsyth, USA Theo Gevers, The Netherlands Luc van Gool, Belgium William Grosky, USA Glenn Healey, USA Nies Huijsmans, The Netherlands Yannis Ioanidis, Greece Erland Jungert, Sweden Rangachar Kasturi, USA Toshi Kato, Japan Martin Kersten, The Netherlands Inald Lagendijk, The Netherlands Robert Laurini, France Yi-Bin Lin, ROC
Carlo Meghini, Italy Erich Neuhold, Germany Eric Pauwels, Belgium Fernando Pereira, Portugal Dragutin Petkovic, USA Hanan Samet, USA Simone Santini, USA Stan Sclaroff, USA Raimondo Schettini, Italy Stephen Smoliar, USA Aya Soffer, USA Michael Swain, USA Hemant Tagare, USA George Thoma, USA Marcel Worring, The Netherlands Jian Kang Wu, Singapore Wei-Pan Yang, ROC
Sponsors
National Science Council, ROC National Chiao Tung University, ROC Knowledge Systems Institute, USA
Table of Contents
I
Invited Talk
Multi-sensor Information Fusion by Query Refinement . . . . . . . . . . . . . . . . . . Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert
II
1
Content-Based Indexing, Search and Retrieval
MiCRoM: A Metric Distance to Compare Segmented Images . . . . . . . . . . . . 12 Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ ao Image Retrieval by Regions: Coarse Segmentation and Fine Color Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Julien Fauqueur and Nozha Boujemaa Fast Approximate Nearest-Neighbor Queries in Metric Feature Spaces by Buoy Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Stephan Volmer A Binary Color Vision Framework for Content-Based Image Indexing . . . . . 50 Guoping Qiu and S. Sudirman Region-Based Image Retrieval Using Multiple-Features . . . . . . . . . . . . . . . . . 61 Veena Sridhar, Mario A. Nascimento, and Xiaobo Li A Bayesian Method for Content-Based Image Retrieval by Use of Relevance Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Ju-Lan Tao and Yi-Ping Hung Color Image Retrieval Based on Primitives of Color Moments . . . . . . . . . . . . 88 Jau-Ling Shih and Ling-Hwei Chen Invariant Feature Extraction and Object Shape Matching Using Gabor Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Shu-Kuo Sun, Zen Chen, and Tsorng-Lin Chia
III
Visual Information System Architectures
A Framework for Visual Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 105 Horst Eidenberger, Christian Breiteneder, and Martin Hitz Feature Extraction and a Database Strategy for Video Fingerprinting . . . . . 117 Job Oostveen, Ton Kalker, and Jaap Haitsma ImageGrouper: Search, Annotate and Organize Images by Groups . . . . . . . 129 Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang
X
Table of Contents
Toward a Personalized CBIR System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang
IV
Image/Video Databases
An Efficient Storage Organization for Multimedia Databases . . . . . . . . . . . . 152 Philip K.C. Tse and Clement H.C. Leung Unsupervised Categorization for Image Database Overview . . . . . . . . . . . . . . 163 Bertrand Le Saux and Nozha Boujemaa A Data-Flow Approach to Visual Querying in Large Spatial Databases . . . 175 Andrew J. Morris, Alia I. Abdelmoty, Baher A. El-Geresy, and Christopher B. Jones MEDIMAGE – A Multimedia Database Management System for Alzheimer’s Disease Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Peter L. Stanchev and Farshad Fotouhi
V
Networked Video
Life after Video Coding Standards: Rate Shaping and Error Concealment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu A DCT-Domain Video Transcoder for Spatial Resolution Downconversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast of Streaming Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Chin-Ying Kuo, Chen-Lung Chan, Vincent Hsu, and Jia-Shung Wang Video Object Hyper-Links for Streaming Applications . . . . . . . . . . . . . . . . . . 229 Daniel Gatica-Perez, Zhi Zhou, Ming-Ting Sun, and Vincent Hsu
VI
Application Areas of Visual Information Systems
Scalable Hierarchical Summarization of News Using Fidelity in MPEG-7 Description Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Jung-Rim Kim, Seong Soo Chun, Seok-jin Oh, and Sanghoon Sull MPEG-7 Descriptors in Content-Based Image Retrieval with PicSOM System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Markus Koskela, Jorma Laaksonen, and Erkki Oja Fast Text Caption Localization on Video Using Visual Rhythm . . . . . . . . . . 259 Seong Soo Chun, Hyeokman Kim, Jung-Rim Kim, Sangwook Oh, and Sanghoon Sull
Table of Contents
XI
A New Digital Watermarking Technique for Video . . . . . . . . . . . . . . . . . . . . . 269 Kuan-Ting Shen and Ling-Hwei Chen Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee Motion Activity Based Shot Identification and Closed Caption Detection for Video Structuring . . . . . . . . . . . . . . . . . . . . 288 Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee Visualizing the Construction of Generic Bills of Material . . . . . . . . . . . . . . . . 302 Peter Y. Wu, Kai A. Olsen, and Per Saetre Data and Knowledge Visualization in Knowledge Discovery Process . . . . . . 311 TrongDung Nguyen, TuBao Ho, and DucDung Nguyen
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Multi-sensor Information Fusion by Query Refinement Shi-Kuo Chang1, Gennaro Costagliola2, and Erland Jungert3 1
Department of Computer Science University of Pittsburgh GLERK$GWTMXXIHY 2 Dipartimento di Matematica ed Informatica Università di Salerno KIRGSW$YRMWEMX 3 Swedish Defense Research Agency (FOI) NYRKIVX$PMRJSMWI
Abstract. To support the retrieval and fusion of multimedia information from multiple real-time sources and databases, a novel approach for sensor-based query processing is described. The sensor dependency tree is used to facilitate query optimization. Through query refinement one or more sensor may provide feedback information to the other sensors. The approach is also applicable to evolutionary queries that change in time and/or space, depending upon the temporal/spatial coordinates of the query originator.
1
Sensor-Based Query Processing for Information Fusion
In recent years the fusion of multimedia information from multiple real-time sources and databases has become increasingly important because of its practical significance in many application areas such as telemedicine, community networks for crime prevention, health care, emergency management, e-learning, digital library, and field computing for scientific exploration. Information fusion is the integration of information from multiple sources and databases in multiple modalities and located in multiple spatial and temporal domains. Generally speaking, the objectives of information fusion are: a) to detect certain significant events [29, 30], and b) to verify the consistency of detected events [10, 20, 25]. As an example, Figure 1(a) is a laser radar image of a parking lot with a moving vehicle (encircled). The laser radar is manufactured by SAAB Dynamics in Sweden. It generates image elements from a laser beam that is split into short pulses by a rotating mirror. The laser pulses are transmitted to the ground in a scanning movement, and when reflected back to the platform on the helicopter a receiver collects the returning pulses that are stored and analyzed. The results are points with x, y, z coordinates and time t. The resolution is about 0.3 m. In Figure 1(a) the only moving vehicle is in the lower right part of the image with a north-south orientation, while all other vehicles have east-west orientation. Figure 1(b) are two video frames showing a moving white vehicle (encircled) while entering a parking lot in the middle of the upper left frame, and between some of the parked vehicles in the lower right frame. Moving objects can be detected from S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 1–11, 2002. © Springer-Verlag Berlin Heidelberg 2002
2
Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert
the video sequence [18]. On the other hand, the approximate 3D shape of an object or the terrain can be obtained from the laser radar image [12]. Therefore the combined analysis of laser radar image and video frame sequence provides better information to detect a certain type of object and/or to verify the consistency of the detected object from both sources. To accomplish the objectives of information fusion, novel sensor-based query processing techniques to retrieve and fuse information from multiple sources are needed. In sensor-based query processing, the queries are applied to both stored databases and real-time sources that include different type of sensors. Since most sensors can generate large quantities of spatial information within short periods of time, sensor-based query processing requires query optimization. We describe a novel approach for sensor-based query processing and query optimization using the sensor dependency tree. Another aspect to consider is that queries may involve data from more than one sensor. In our approach, one or more sensor may provide feedback information to the other sensors through query refinement. The status information such as position, time and certainty can be incorporated in multi-level views and formulated as constraints in the refined query. In order to accomplish sensor data independence, an ontological knowledge base is employed.
(a)
(b)
Fig. 1. (a) A laser radar image of a parking lot with a moving vehicle (encircled). (b) Two video frames showing a moving white vehicle (encircled) while entering a parking lot.
There is an important class of queries that require more sophisticated query refinement. We will call this class of queries evolutionary queries. An evolutionary query is a query that may change in time and/or space. For example when an emergency management worker moves around in a disaster area, a predefined query can be executed repeatedly to evaluate the surrounding area to find out objects of threat. Depending upon the position of the person or agent (the query originator) and the time of the day, the query can be quite different. Our approach is also applicable to evolutionary queries that may be modified, depending upon the temporal/spatial coordinates of the query originator.
Multi-sensor Information Fusion by Query Refinement
3
This paper is organized as follows. The background and related research are described in Section 2. The notion of sensor data dependence is discussed in Section 3, and the sensor data dependency tree is introduced in Section 4. Section 4 describes simple query processing, and Section 5 illustrated the query refinement approach. Section 6 discusses view management and the sensor data ontological knowledge base. An empirical study is described in Section 7.
2
Background and Related Research
In our previous research, a spatial/temporal query language called ΣQL was developed to support the retrieval and fusion of multimedia information from realtime sources and databases [5, 6, 9, 19]. ΣQL allows a user to specify powerful spatial/temporal queries for both multimedia data sources and multimedia databases, thus eliminating the need to write separate queries for each. ΣQL can be seen as a tool for handling spatial/temporal information for sensor-based information fusion, because most sensors generate spatial information in a temporal sequential manner [16]. A powerful visual user interface called the Sentient Map allows the user to formulate spatial/temporal σ-queries using gestures [7, 8]. For empirical study we collaborated with the Swedish Defense Research Agency who has collected information from different type of sensors, including laser radar, infrared video (similar to video but generated at 60 frames/sec), and CCD digital camera. In our preliminary analysis, when we applied ΣQL to the fusion of the above described sensor data, we discovered that in the fusion process data from a single sensor yields poor results in object recognition. For instance, the target object may be partially hidden by an occluding object such as a tree, rendering certain type of sensors ineffective. Object recognition can be significantly improved, if a refined query is generated to obtain information from another type of sensor, while allowing the target being partially hidden. In other words, one (or more) sensor may serve as a guide to the other sensors by providing status information such as position, time and accuracy, which can be incorporated in multiple views and formulated as constraints in the refined query. In the refined query, the source(s) can be changed, and additional constraints can be included in the where-clause of the σ-query. This approach provides better object recognition results because the refined query can improve the result from the various sensor data that will also lead to a better result in the fusion process. A refined query may also send a request for new data and thus lead to a feedback process. In early research on query modification, queries are modified to deal with integrity constraints [26]. In query augmentation, queries are augmented by adding constraints to speed up query processing [13]. In query refinement [28] multiple term queries are refined by dynamically combining pre-computed suggestions for single term queries. Recently query refinement technique was applied to content-based retrieval from multimedia databases [3]. In our approach, the refined queries are created to deal with the lack of information from a certain source or sources, and therefore not only the constraints can be changed, but also the source(s). This approach has not been
4
Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert
considered previously in database query optimization, because usually the sources are assumed to provide the complete information needed by the queries. In addition to the related approaches in query augmentation, there is also recent research work in agent-based techniques that are relevant to our approach. Many mobile agent systems have been developed [1, 2, 22], and recently mobile agent technology is beginning to be applied to information retrieval from multimedia databases [21]. It is conceivable that sensors can be handled by different agents that exchange information and cooperate with each other to achieve information fusion. However, mobile agents are highly domain-specific and depend on ad-hoc, ’hardwired’ programs to implement them. In contrast, our approach offers a theoretical framework for query optimization and is applicable to different type of sensors, thus achieving sensor data independence.
3
Sensor Data Independence
As mentioned in the previous sections, sensor data independence is an important new concept in sensor-based query processing. In database design, data independence was first introduced in order to allow modifications of the physical databases without affecting the application programs [27]. It was a very powerful innovation in information technology. The main purpose was to simplify the use of the databases from an end-user’s perspective while at the same time allow a more flexible administration of the databases themselves [11]. In sensor-based information systems [29], no similar concept has yet been suggested, due to the fact that this area is still less mature with respect to the design and development of information systems integrated with databases in which sensor data are stored. Another reason is that the users are supposed to be domain experts and consequently they have not yet requested sensor-based information systems with this property. In current sensor-based information systems, in order to formulate queries concerning various objects and their attributes registered by the sensors, detailed knowledge about the sensors is required. Therefore sensor selection is left to the users who supposedly are also experts on sensors. However in real life this is not always the case. A user cannot be an expert on all sensors and all sensor data types. Therefore systems with the ability to hide this kind of low-level information from the users need to be developed. User interfaces also need to be designed to allow the users to formulate queries with ease and to request information at a high-level of abstraction to accomplish sensor data independence. An approach to overcome these problems and to accomplish sensor data independence is described, through the use of the sensor dependency tree, the query refinement technique, the multi-level view databases, and above all an ontological knowledge base for the sensors and objects to be sensed.
4
The Sensor Dependency Tree
In database theory, query optimization is usually formulated with respect to a query execution plan where the nodes represent the various database operations to be
Multi-sensor Information Fusion by Query Refinement
5
performed [14]. The query execution plan can then be transformed in various ways to optimize query processing with respect to certain cost functions. In sensor-based query processing, a concept similar to the query execution plan is introduced. It is called the sensor dependency tree, which is a tree in which each node Pi has the following parameters: objecti is the object type to be recognized sourcei is the information source recogi is the object recognition algorithm to be applied sqoi is the spatial coordinates of the query originator tqoi is the temporal coordinates of the query originator aoii is the spatial area-of-interest for object recognition ioii is the temporal interval-of-interest for object recognition timei is the estimated computation time in some unit such as seconds rangei is the range of certainty in applying the recognition algorithm, represented by two numbers min, max from the closed interval [0,1] These parameters provide detailed information on a computation step to be carried out in sensor-based query processing. The query originator is the person/agent who issues a query. For evolutionary queries, the spatial/temporal coordinates of the query originator are required. For other type of queries, these parameters are optional. If the computation results of a node P1 are the required input to another node P2, there is a directed arc from P1 to P2. The directed arcs originate from the leave nodes and terminate at the root node. The leave nodes of the tree are the information sources such as laser radar, infrared camera, CCD camera and so on. They have parameters such as (none, LR, NONE, sqoi, tqoi, aoiall, ioiall, 0, (1,1)). Sometimes we represent such leave nodes by their symbolic names such as LR, IR, CCD, etc. The intermediate nodes of the tree are the objects to be recognized. For example, suppose the object type is ’truck’. An intermediate node may have parameters (truck, LR, recog315, sqoi, tqoi, aoiall, ioiall, 10, (0.3, 0.5)). The root node of the tree is the result of information fusion, for example, a node with parameters (truck, ALL, fusion7, sqoi, tqoi, aoiall, ioiall, 2000, (0,1)) where the parameter ALL indicates that information is drawn from all the sources. In what follows, the spatial/temporal coordinates sqoi and tqoi for the query originator, the allinclusive area-of-interest aoiall and the all-inclusive interval-of-interest ioiall will be omitted for the sake of brevity, so that the examples are easier to read. Query processing is accomplished by the repeated computation and updates of the sensor dependency tree. During each iteration, one or more nodes are selected for computation. The selected nodes must not be dependent on any other nodes. After the computation, one ore more nodes are removed from the sensor dependency tree. The process then iterates. As an example, by analyzing the initial query, the following sensor dependency tree is constructed: (none, LR, NONE, 0, (1,1)) → (truck, LR, recog315, 10, (0.3, 0.5)) → (none, IR, NONE, 0, (1,1)) → (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, ALL, fusion7, 2000, (0,1)) (none, CCD, NONE, 0, (1,1)) → (truck, CCD, recog11, 100, (0, 1)) →
This means the information is from the three sources - laser radar, infrared camera and CCD camera - and the information will be fused for recognizing the object type ’truck’. Next, we select some of the nodes to compute. For instance, all the three leaf nodes can be selected, meaning information will be gathered from all three sources.
6
Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert
After this computation, the processed nodes are dropped and the following updated sensor dependency tree is obtained: (truck, LR, recog315, 10, (0.3, 0.5)) → (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, ALL, fusion7, 2000, (0,1)) (truck, CCD, recog11, 100, (0, 1)) →
We can then select the next node(s) to compute. Since LR has the smallest estimated computation time, it is selected and recognition algorithm 315 is applied. The updated sensor dependency tree is: (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, CCD, recog11, 100, (0, 1))
→
(truck, ALL, fusion7, 2000, (0,1))
In the updated tree, the LR node has been removed. We can now select the CCD node and, after its removal, select the IR node. (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, ALL, fusion7, 2000, (0,1))
Finally, the fusion node is selected. (truck, ALL, fusion7, 2000, (0,1))
After the fusion operation, there are no unprocessed (i.e., unselected) nodes, and query processing terminates.
5
Query Refinement
In the previous section, a straightforward approach of sensor-based query processing is described. This straightforward approach misses the opportunity of utilizing incomplete and imprecise knowledge gained during query processing. Let us re-examine the above scenario. After LR is selected and recognition algorithm 315 applied, suppose the result of recognition is not very good, and only some partially occluded large objects are recognized. If we follow the original approach, the reduced sensor dependency tree becomes: (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, CCD, recog11, 100, (0, 1))
→
(truck, ALL, fusion7, 2000, (0,1))
But this misses the opportunity of utilizing the incomplete and imprecise knowledge gained by recognition algorithm 315. If the query is to find un-occluded objects and the sensor reports only an occluded object, then the query processor is unable to continue unless we modify the query to find occluded objects. Therefore a better approach is to refine the original query, so that the updated sensor dependency tree becomes: (truck, IR, recog144, aoi-23, 2000, (0.6, 0.8)) → (truck, ALL, fusion7, aoi-23, 2000, (0, 1))
This means recognition algorithm 315 is applied to detect objects in an area-ofinterest aoi-23. After this is done, the recognition algorithm 144 is applied to recognize objects of the type ’truck’ in this specific area-of-interest. Finally, the fusion algorithm fusion7 is applied. Given a user query in a high-level language, the natural language, a visual language or a form, the query refinement approach is outlined below, where italic
Multi-sensor Information Fusion by Query Refinement
7
words indicate operations for the second (and subsequent) iteration. Its flowchart is illustrated in Figure 2. Step 1. Analyze the user query to generate/update the sensor dependency tree based upon the ontological knowledge base and the multi-level view database that contains up-to-date contextual information in the object view, local view and global view, respectively. Step 2. If the sensor dependency tree is reduced to a single node, perform fusion operation (if multiple sensors have been used) and then terminate query processing. Otherwise build/refine the σ-query based upon the user query, the sensor dependency tree and the multi-level view database. Step 3. Execute the portion of the σ-query that is executable according to the sensor dependency tree. Step 4. Update the multi-level view database and go back to Step 1.
Fig. 2. Flowchart for the query refinement algorithm.
As mentioned above, there is another class of queries that require more sophisticated query refinement. An evolutionary query is a query that change in time and/or space. Depending upon the position of the query originator and the time of the day, the query can be different. In other words, queries and query processing are affected by the spatial/temporal relations among the query originator, the sensors and the sensed objects In query processing/refinement, the spatial/temporal relations must be taken into consideration in the construction/update of the sensor dependency tree. The temporal relations include "followed by", "preceded by", and so on. The spatial relations
8
Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert
include the usual spatial relations, and special ones such as "occluded by", and so on [24]. As mentioned above, if in the original query we are interested only in finding un-occluded objects, then the query processor must report failure when only an occluded object is found. If, however, the query is refined to "find both un-occluded and occluded objects", then the query processor can still continue.
6
Multi-level View Database and Ontological Knowledge Base
A multi-level view database (MLVD) is needed to support sensor-based query processing. The status information is obtained from the sensors, which includes object type, position, orientation, time, accuracy and so on. The positions of the query originator and the sensors may also change. This is processed and integrated into the multi-level view database. Whenever the query processor needs some information, it asks the view manager. The view manager also shields the rest of the system from the details of managing sensory data, thus achieving sensory data independence. The multiple views may include the following three views in a resolution pyramid structure: the global view, the local view and the object view. The global view describes where the target object is situated in relation to some other objects, e.g. a road from a map. This will enable the sensor analysis program to find the location of the target object with greater accuracy and thus make a better analysis. The local view provides the information such as the target object is partially hidden. The local view can be described, for example, in terms of Symbolic Projection [4], or other representations. Finally, there is also a need for a symbolic object description. The views may include information about the query originator and can be used later on in other important tasks such as in situation analysis. The multi-level views are managed by the view manager, which can be regarded as an agent, or as middleware, depending upon the system architecture. The global view is obtained primarily from the geographic information system (GIS). The local view and object view are more detailed descriptions of local areas and objects. The results of query processing, and the movements of the query originator, may both lead to the updating of all three views. For any single sensor the sensed data usually does not fully describe an object, otherwise there will be no need to utilize other sensors. In the general case the system should be able to detect that some sensors are not giving the complete view of the scene and automatically select those sensors that can help the most in providing more information to describe the whole scene. In order to do so the system should have a collection of facts and conditions, which constitute the working knowledge about the real world and the sensors. This knowledge is stored in the ontological knowledge base, whose content includes object knowledge structure, sensor and sensor data control knowledge. The ontological knowledge base consists of three parts: the sensor part describing the sensors, recognition algorithms and so on, the external conditions part providing a description of external conditions such as weather condition, light condition and so on, and the sensed objects part describing objects to be sensed. Given the external condition and the object to be sensed, we can determine what sensor(s) and recognition algorithm(s) may be applied. For example, IR and Laser can be used at night (time condition), while CCD cannot be used. IR probably can be used in foggy
Multi-sensor Information Fusion by Query Refinement
9
weather, but Laser and CCD cannot be used (weather condition). However, such determination is often uncertain. Therefore certainty factors should be associated with items in the ontological knowledge base to deal with the uncertainty.
7
An Empirical Study
For empirical study we collected over 700 G-bytes of data from three type of sensors, including laser-radar, infrared camera (similar to video but generated at 60 frames/sec), and a CCD digital camera. Figure 3 is an example of an infrared image, a laser radar image and a CCD image of the same area. This experimental data is provided by the Swedish Defense Research Agency for the evaluation of the sensor-based query processing approach. High resolution terrain elevation models for synthetic environments are produced using laser-radar data [17]. GIS data are also available, so that the multi-level view databases and the ontological knowledge base can be constructed. Researchers at the Swedish Defense Research Agency plan to collect a substantial number of test queries. Three different types of test queries will be of interest: a) queries for the recognition of objects from multiple sensors; b) spatial/temporal queries; and c) evolutionary queries.
Fig. 3. An infrared image(left), a laser radar image (middle) and a CCD image (right).
As mentioned above, certainty factors should be associated with the nodes in the sensor dependency tree, the items in the ontological knowledge base, as well as data acquired by the sensors due to technical imperfections in the sensors and other practical considerations. Certainty factors (or confidence values) are normalized as real numbers in the interval [0,1] and interpreted as the certainty (or the confidence) a user may have in a query result. Therefore in query processing, all the computation steps should take uncertainty management into consideration. Different approaches in uncertainty management, including Bayesian networks [15] and fuzzy logic [23], can be considered. Since the certainty factor of a node may change after a computation step, there may be multiple ways of deciding the precedence among the nodes, and the sensor dependency tree may be replaced by the sensor dependency graph. Query processing then does not proceed by simply eliminating nodes successively from the sensor dependency graph. We need to investigate the generalized solution, such as using relaxation algorithms, to attack the problem of query processing and optimization with uncertainty management.
10
Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert
References 1. J. Baumann et al., “Mole – Concepts of a Mobile Agent System”, World Wide Web, Vol. 1, No. 3, 1998, pp 123-137. 2. C. Baumer, “Grasshopper – A Universal Agent Platform based on MASIF and FIPA Standards”, First International Workshop on Mobile Agents for Telecommunication Applications (MATA’99), Ottawa, Canada, October 1999, World Scientific, pp 1-18. 3. K. Chakrabarti, K. Porkaew and S. Mehrotra, “Efficient Query Refinement in Multimedia Databases, 16th International Conference on Data Engineering, San Diego, California, February 28 – March 3, 2000. 4. S. K. Chang and E. Jungert, Symbolic Projection for Image Information Retrieval and Spatial Reasoning, Academic Press, London, 1996. 5. S. K. Chang and E. Jungert, “A Spatial/temporal query language for multiple data sources in a heterogeneous information system environment”, The International Journal of Cooperative Information Systems (IJCIS), vol. 7, Nos 2 & 3, 1998, pp 167-186. 6. S. K. Chang, G. Costagliola and E. Jungert, “Querying Multimedia Data Sources and rd Databases”, Proceedings of the 3 International Conference on Visual Information Systems (Visual’99), Amsterdam, The Netherlands, June 2-4, 1999. 7. S. K. Chang, “The Sentient Map”, Journal of Visual Languages and Computing, Vol. 11, No. 4, August 2000, pp 455-474. 8. S. K. Chang, T. H. Chen and C. S. Li, "Gesture-Enhanced Information Retrieval and Presentation in a Distributed Learning Environment", Proceedings of the International Conference on Multimedia (ICME'2000), New York, July 31 to August 2, 2000. 9. S. K. Chang, G. Costagliola and E. Jungert, “Spatial/Temporal Query Processing for th Information Fusion Applications”, Proceedings of the 4 International Conference on Visual Information Systems (Visual’2000), Lyon, France, November 2000, Lecture Notes in Computer Sciences 1929, Robert Laurini (Ed.), Springer, Berlin, pp 127-139. 10. C.-Y. Chong, S. Mori, K.-C Chang and W. H. Baker, “Architectures and Algorithms for Track Association and Fusion”, Proceedings of Fusion’99, Sunnyvale, CA, July 6-8, 1999, pp 239-246. 11. C. Date, An Introduction to Database Systems, Addison-Wesley, 1995. 12. M. Elmqvist, E. Jungert et al., “Terrain Modelling and Analysis using Laser Scanner Data”, Proceedings of Conference on Land Surface Mapping and Characterization using Laser Altmetry, Annapolis, MD, USA, October 22-24, 2001, pp 219-226, published by Dept. of Geography, University of Maryland, MD, 2001. 13. G. Grafe, “Query Evaluation Techniques for Large Databases”, ACM Computing Surveys, Vol. 25, No. 2, June 1993. 14. M. Jarke and J. Cohen, “Query Optimization in Database Systems”, ACM Computing Surveys, Vol. 16, No. 2, 1984. 15. F. V. Jensen, An Introduction to Bayesian Networks, Springer Verlag, New York, 1996. 16. E. Jungert, “An Information fusion System for Object Classification and Decision Support nd Using Multiple Heterogeneous Data Sources”, Proceedings of the 2 International Conference on Information Fusion (Fusion’99), Sunnyvale, California, USA, July 6-8, 1999. 17. E. Jungert, U. Söderman, S. Ahlberg, P. Hörling, F. Lantz, G. Neider, “Generation of high resolution terrain elevation models for synthetic environments using laser-radar data”, Proceedings of SPIE no 3694, Modeling, Simulation and Visualization for Real And Virtual Environments, Orlando , Florida, April 7-8, 1999, pp 12-20. 18. E. Jungert,”A Qualitative Approach to Reasoning about Objects in Motion Based on Symbolic Projection”, Proceedings of the Conference on Multimedia Databases and Image Communication (MDIC’99), Salerno, Italy, October 4-5, 1999.
Multi-sensor Information Fusion by Query Refinement
11
19. E. Jungert, “A Data Fusion Concept for a Query Language for Multiple Data Sources”, Proceedings of the 3rd International Conference on Information Fusion (FUSION 2000), Paris, France, July 10-13, 2000. 20. L. A. Klein, “A Boolean Algebra Approach to Multiple Sensor Voting Fusion”, IEEE Transactions on Aerospace and Electronic Systems, Vol. 29, No. 2, April 1993, pp 317327. 21. H. Kosch, M. Doller and L. Boszormenyi, “Content-based Indexing and Retrieval supported by Mobile Agent Technology”, Multimedia Databases and Image Communication, LNCS2184, (M. Tucci, ed.), Springer-Verlag, Berlin, 2001, pp 152-166. 22. D. B. Lange and M. Oshima, Programming and Deploying Java Mobile Agents with Aglets, Addison-Wesley, Reading, MA, USA, 1999. 23. Lawrence Livermore National Laboratory, “Multisensor data fusion system using fuzzy logic”, in the web site on sensor technology at http://www.llnl.gov/sensor_technology/ STR25.html, 2001. 24. S.Y. Lee and F. J. Hsu, “Spatial Reasoning and Similarity Retrieval of images using 2D Cstring knowledge Representation”, Pattern Recognition, vol. 25, no 3, 1992, pp 305-318. 25. J. R. Parker, “Multiple Sensors, Voting Methods and Target Value Analysis”, Proceedings of SPIE Conference on Signal Processing, Sensor Fusion and Target Recognition VI, SPIE vol. 3720, Orlando, Florida, April 1999, pp 330-335. 26. M. Stonebraker, “Implementation of Integrity Constraints and Views by Query Modification”, in SIGMOD, 1975. 27. J. D. Ullman, Database and Knowledge-base Systems, Vol. 1, Computer science Press, Rockville, Maryland, USA, 1988, pp 11-12. 28. Bienvenido Vélez, Ron Weiss, Mark A. Sheldon, and David K. Gifford, “Fast and th Effective Query Refinement”, Proceedings of the 20 ACM Conference on Research and Development in Information Retrieval (SIGIR97), Philadelphia, Pennsylvania, July 1997. 29. E. Waltz and J. Llinas, Multisensor data fusion, Artect House, Boston, 1990. 30. F. E. White, “Managing Data Fusion Systems in Joint and Coalition Warfare”, Proceedings of EuroFusion98 – International Conference on Data Fusion, October 1998, Great Malvern, United Kingdom, pp 49-52.
MiCRoM: A Metric Distance to Compare Segmented Images Renato O. Stehling1 , Mario A. Nascimento2 , and Alexandre X. Falc˜ao1 1
2
Institute of Computing, University of Campinas, Brazil {renato.stehling,afalcao}@ic.unicamp.br Department of Computer Science, University of Alberta, Canada
[email protected] Abstract. Recently, several content-based image retrieval (CBIR) systems that make use of segmented images have been proposed. In these systems, images are segmented and represented as a set of regions, and the distance between images is computed according to the visual features of their regions. A major problem of existing distance functions used to compare segmented images is that they are not metrics. Hence, it is not possible to exploit filtering techniques and/or access methods to speedup query processing, as both techniques make extensive use of the triangular inequality property - one of the metric axioms. In this work, we propose microm (Minimum-Cost Region Matching), an effective metric distance which models the comparison of segmented images as a minimum-cost network flow problem. To our knowledge, this is the first time a true metric distance function is proposed to evaluate the distance between segmented images. Our experiments show that microm is at least as effective as existing non-metric distances. Moreover, we have been able to use the recently proposed Omni-sequential filtering technique, and have achieved nearly 2/3 savings in retrieval/query processing time.
1
Introduction
Image databases are becoming more and more common in several distinct application domains, such as (multimedia) search engines, digital libraries, medical and geographic databases and criminal investigation. The evolution of techniques for acquisition, transmission and storage of images has also allowed the construction of very large image databases. All these factors have spurred great interest in content-based image retrieval (CBIR) techniques. Existing CBIR systems based on low-level features (such as color and texture) can be classified into three main categories: (1) global approaches (e.g. [1,2,3]), (2) partitionbased approaches (e.g. [4,5,6]) and (3) regional approaches (e.g. [7,8,9]). Each of these categories poses a distinct compromise among the complexity of visual features extraction algorithms, the complexity of the distance function used to compare images, the amount of space required to represent the visual features and the retrieval effectiveness.
Research partially supported by NSERC, Canada, and by CNPq/FINEP, Brazil, under the PRONEX SAI Project.
S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 12–23, 2002. c Springer-Verlag Berlin Heidelberg 2002
MiCRoM: A Metric Distance to Compare Segmented Images
13
Global approaches describe the visual content of an image as a whole, without spatial or topological information. Partition-based approaches introduce some spatial information about the visual content of the images decomposing them in spatial cells, according to a fixed partition scheme, and describing the content of each cell individually. Regional approaches are a natural evolution of partition-based approaches in the sense that, instead of decomposing images in a fixed way, these approaches exploit their visual content to achieve a more flexible and robust segmentation. Unlike partition cells, segmented regions of two distinct images may have different size, position and shape. Moreover, the number of regions of two images may be different. Our focus in this paper is on the comparison of segmented images in the context of regional CBIR approaches. To the best of our knowledge, existing distance functions that compare segmented images are not metrics. More specifically, they do not satisfy the triangular inequality property. This property is essential to reduce the query processing time using filtering techniques [10] and/or access methods [11,12]. Our contribution in this paper is an effective metric distance to compare segmented images, called microm (Minimum-Cost Region Matching). The main advantage of microm is the possibility, for the first time (as far as we know), to compare segmented images using an effective, true-metric distance function. As a consequence, microm allows the use of filtering techniques and/or access methods to reduce the query time. The remainder of this paper is organized as follows. Section 2 describes in details the problems related to the comparison and indexing of segmented images, identifying existing distances for this purpose. In Section 3, we propose microm, our new metric distance to compare segmented images. The effectiveness of microm is evaluated in Section 4. Experimental results related to the use of filtering techniques based on the microm metric are presented in Section 5. Finally, Section 6 states our conclusions and directions for future works.
2
Comparison of Segmented Images
One important aspect of any CBIR system is the distance function used to compare the visual features extracted from the images. Such a distance affects directly the time spent processing a visual query and the quality of the retrieval (effectiveness). The better the distance simulates the human perception of similarity using the available visual features, the more effective is the CBIR system in retrieving relevant images to the user’s needs. The computational complexity of the distance function is another important issue when processing a visual query. Depending on the function complexity, the time to compute the distance between images might be superior to the time to access the disk pages where the visual features are stored. The distance function also restricts the universe of filtering techniques and access methods which can be used to speedup the query processing. When the visual features of images are represented as points in a k-dimensional space (each vector element corresponds to a spatial coordinate), it is possible to exploit geometric distances like L1 (City-Block) and L2 (Euclidean) to compare images. Moreover, it is possible to use spatial access methods (SAMs) [12] to reduce the search space at query time.
14
Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao
Unfortunately, regional CBIR approaches can not be adequately modeled in a vectorial space, because the number of regions of two images may be different and the obtained regions may also have different sizes. Regional CBIR approaches are better modeled in a metric space. A metric space is composed by a set of elements (in our case, these elements are the visual features) and a metric distance to compare these elements. In metric spaces, there are no restrictions about the representation of the visual features. In this case, what really matter are the metric properties of the distance used to compare the visual features. A distance d is considered a metric if, for any (images) X, Y and Z, the following properties hold: – – – –
Positiveness - d(X, Y ) ≥ 0 Symmetry - d(X, Y ) = d(Y, X) Reflexivity - d(X, X) = 0 Triangular inequality - d(X, Z) ≤ d(X, Y ) + d(Y, Z)
Metric spaces can be efficiently indexed using metric access methods (MAMs) [11]. These methods make extensive use of the triangular inequality property to reduce the search space and also the number of distance computations at query time. The main problem to model a regional CBIR approach in a metric space is related to the distance function used to compare segmented images. To the best of our knowledge, there are only a few works dedicated to this topic. In general, the most common approach is to perform comparisons based on individual regions, as in the Blobworld system [7]. In that system, although querying based on a limited number of regions is allowed, the query is performed by merging single-region query results. Even if it was possible to combine the results obtained with each individual region of an image, there is no guarantee that the full content of the images is compared. It is possible that most of the regions in an image match with the same region of the other. Moreover, if the comparison is performed in the opposite direction, it is possible to obtain a completely different distance. In order to reduce the influence of inaccurate segmentation, and to guarantee the comparison of the full content of the images, systems like SIMPLIcity [8] and CBC [9] compare images according to the properties of all segmented regions simultaneously, not only in a region-by-region basis. SIMPLIcity compare images according to the irm (Integrated Region Matching) distance. An equivalent distance function is used in CBC. The main difference is that the visual features used to compare individual regions in CBC and SIMPLIcity are not the same. 2.1
IRM Distance
The irm distance between two images X and Y is algorithmically described in Table 1. The main problem of the irm distance function is that it does not satisfy the triangular inequality property. This problem is related to the greedy approach of choosing first the most similar regions to be matched. The greedy algorithm in this case does not guarantee that the obtained distance is the best (smallest) one. Figure 1 shows a counterexample where the results obtained with the irm greedy distance do not satisfy the triangular inequality property. In this example, images X, Y and Z are compared two-by-two, according to their regions. Each image has exactly two
MiCRoM: A Metric Distance to Compare Segmented Images
15
Table 1. irm distance irm(X, Y ) for each pair of regions Xi ∈ X and Yj ∈ Y Xi .status = Yj .status = 0 Compute dreg (Xi , Yj ) β=0 for each dreg (Xi , Yj ) in a non-decreasing order if Xi .status = Yj .status = 0 if Xi .size < Yj .size w = Xi .size Yj .size = Yj .size − Xi .size Xi .status = 1 else w = Yj .size Xi .size = Xi .size − Yj .size Yj .status = 1 if Xi .size = 0 then Xi .status = 1 β = β + w × dreg (Xi , Yj ) return β
regions of the same size (0.5). For illustrative purpose only, each region has its visual feature represented by a single numerical value. This number could be, for example, the average gray level of the region. The size and also the visual feature of the regions are normalized between 0 and 1. The distance between two regions (dreg ) is given by the module of the difference of their visual feature. The edges between images show the matched regions according to the irm distance. On the right of Figure 1, there is also the result of the comparisons, organized in a triangular shape. Y
Y a = 0.2
c = 1.0
e = 0.3
0.45
b = 0.6
d = 0.5
0.2
(0.35)
Z
X f = 0.8
X
Z 0.15
Fig. 1. The comparison of images X, Y and Z using the irm distance does not satisfy the triangular inequality property
Thus, the triangular comparison of the images give us the inequality 0.45 ≥ 0.2 + 0.15, which contradicts the triangular inequality property. The problem in this example is in the distance between images X and Y . The greedy approach adopted in irm results in a non-optimal distance when X and Y are compared, because there is another match which reduces the distance between them.
16
Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao
The optimal comparison which minimizes the distance between images X and Y is shown in dotted lines, and gives the result optimal(X, Y ) = 0.5 × |0.2 − 0.5| + 0.5 × |0.6 − 1.0| = 0.35. The result of this optimal comparison is shown between brackets in the triangular representation of the distances among the three images. If the optimal distance is used, we have 0.35 ≤ 0.2 + 0.15, which satisfies the triangular inequality property.
3 The microm Metric Distance In this section, we propose microm (Minimum-Cost Region Matching), a new truemetric distance function to compare the visual content of segmented images. As it will be shown in Section 4, microm is at least as effective as irm, the distance function used in SIMPLIcity and CBC systems, and has the advantage that it can be adequately indexed using existing MAMs [11] such as the M-tree [13]. It is also possible to use a combination of filtering techniques and SAMs to speedup the query processing, as it will be discussed in Section 5. The main idea of microm consists of modeling the comparison of segmented images as a minimum-cost network flow problem [14]. More specifically, the comparison of images is modeled as a transportation problem. The transportation problem is an optimization problem which can be informally expressed as follows. Assume that we have a number of consumers with certain demand for a product. This product is made by a number of producers with certain production capacities. The system is balanced in the sense that the total demand equals the total production capacity. The production should be transported from the producers to the consumers, such that every consumer gets exactly as much product as it needs. The transportation costs from all producers to all consumers are known in advance. The transportation problem is to find the optimal (cheapest) way to bring the products from the producers to the consumers. Next, a formal definition for the transportation problem is given. A network is a directed graph G = (V, E) composed by a set V of n nodes and a set E of m arcs. Each node represents either a producer or a consumer. Assuming that there are p producers and c consumers, we have: n = p + c. Each node has an associated number pd which represents its production (positive values) or its demand (negative values) depending p on whether c the node is a producer or a consumer. The system is balanced, so i=1 pdi + j=1 pdj = 0. There is a directed arc (i, j) for every pair of producer i and consumer j. Thus, m = p × c. Each arc (i, j) has two associated values: its transportation capacity capij , and its transportation cost costij . The arc capacity is given by capij = min(|pdi |, |pdj |). The decision variable in the transportation problem is the flow f lowij in each arc (i, j). pThese cflows should satisfy 0 ≤ f lowij ≤ capij , and should minimize the function i=1 j=1 (costij × f lowij ). The minimum value of the function above corresponds to the microm distance p c between the two images, that is, µ = min( i=1 j=1 (costij × f lowij )). Despite the differences in the modeling of the problem, microm gives the optimal solution for the comparison of segmented images that the greedy approach adopted in irm sometimes fails to obtain. In fact, the irm distance can be thought as a greedy function to solve the
MiCRoM: A Metric Distance to Compare Segmented Images
17
transportation problem (as defined above) which gives as much flow as possible to the arcs with the smallest cost. The minimum-cost network flow problem is a linear program with a very special structure [14]. As such, specialized algorithms can find solutions much faster than plain linear programming algorithms. A large number of efficient algorithms for this specialized instance of the problem are available. In our case, we used the CS2 code developed by Cherkassky and Goldberg1 . CS2 is a an efficient implementation of a scaling pushrelabel algorithm for the minimum-cost flow/transportation problem [15]. An example of two images and the modeling of their comparison as a transportation problem can be viewed in Figure 2. Image X is composed by three regions a, b and c, and image Y is composed by regions d and e. The visual feature of each region is represented by a number. This number and also the size of the regions are normalized between [0,1]. For example, size(a) = 0.5 and size(b) = 0.25. The comparison of images X and Y is modeled as a transportation problem in the following way. Cost
Production Y
X
0.5
a
0.7
a = 1.0 d = 0.8 b = 0.0
e = 0.3
0.25
c
X
d
−0.5
e
−0.5
0.8 0.3 0.0
Cost x Flow
0.5
Y
0.7 x 0.25
0.2 x 0.25 0.7 x 0.25
b
b
c = 0.8 0.25
a
Demand
0.2
d
a" = 1.0
a’ = 1.0
e
b = 0.0
c = 0.8
0.2 x 0.25
d’ = 0.8
e’ = 0.3
d" = 0.8
e" = 0.3
0.3 x 0.25 0.0 x 0.25
0.0 x 0.25 c
0.3 x 0.25
µ(X,Y) = 0.175 + 0.05 + 0.0 + 0.075 = 0.3
Fig. 2. Modeling the comparison of segmented images as a transportation problem
Each region of image X is modeled as a producer node, where the production is given by the normalized size of the region. Similarly, each region of image Y is modeled as a consumer node, with a demand given by its size (remember that a demand is represented by a negative value). Each arc between pairs of producer/consumer nodes has a cost given by the distance (dreg ) between the corresponding regions. In this example, this distance is given by the absolute difference of the numerical properties of the regions. 1
http://www.intertrust.com/star/goldberg/soft.html
18
Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao
A solution for the transportation problem modeled on top of Figure 2 can be viewed on the bottom part of the same figure. As can be seen, half of node a’s production (0.25) was transported to node d with cost 0.2. The other half (0.25) was transported to node e with cost 0.7. All production of node b (0.25) was transported to node e with cost 0.3, filling the demand of that node. Finally, the total production of node c (0.25) was transported to node d with cost 0. The minimum transportation cost in this network is thus (0.25 × 0.2) + (0.25 × 0.7) + (0.25 × 0.3) + (0.25 × 0.0) = 0.3. The bottom-right part of Figure 2 shows how the solution of the transportation problem maps back on the compared images. In this particular example, the irm distance is exactly the same as microm, i.e., µ(X, Y ) = irm(X, Y ). However, as it was shown in the previous section, this is not always the case. 3.1 microm Metric Properties The microm distance decomposes the “real” regions of the images in “virtual” subregions to compute the minimum distance between them. The regions obtained after the virtual decomposition have very interesting properties: – The number of regions of the compared images becomes the same. – The obtained regions are the ones which minimize the distance between the two images, according to the model adopted (transportation problem). – There is a one-to-one match between regions of the two images. – Matched regions have the same size. The above properties ensure that the distance between images is optimal and that the full content of the images is compared. These properties are also useful to show that the microm distance is a metric. By construction, it is clear that the microm distance satisfies the axioms of positiveness, symmetry and reflexivity. Next, it will be shown that this distance also satisfies the triangular inequality property. The demonstration assumes that the distance dreg (used to compare individual regions of images) is a metric. Consider the triangular comparison of three images X, Y and Z, at the level of virtual regions. Assume that a virtual region Xi of image X matches with a virtual region Yj of image Y . Similarly, assume that the virtual region Yj matches with a virtual Zk of image Z, and the virtual region Zk matches with a virtual region Xl of image X, closing a triangular match for a particular virtual region. In this scenario, there are two possible relations between the virtual regions Xi and Xl of image X: either Xi = Xl or Xi = Xl . We call the first case a cyclic match, because the virtual region which started the triangular match is the same that ends the process. The second case is called an acyclic match, as the regions which started and ends the triangular match are different. Initially, let us suppose that the application of the microm distance to compare images X, Y and Z, results only in cyclic matches (Xl = Xi ) at the level of virtual regions. As we are assuming the cyclic property only when images X, Z are compared (closing the triangular comparison of the images), this specific microm distance (with the additional restriction of cyclic matches) is represented as µcyclic (X, Z). We know that for cyclic matches, dreg (Xi , Zk ) ≤ dreg (Xi , Yj ) + dreg (Yj , Zk ) for any regions Xi , Yj and Zk , since dreg is a metric. We also know that the microm
MiCRoM: A Metric Distance to Compare Segmented Images
19
distance is only a linear combination of dreg distances. As the linear combination of metric distances is also a metric, we have that, for the case of cyclic matches of virtual regions, µcyclic (X, Z) ≤ µ(X, Y ) + µ(Y, Z). The assumption of cyclic matches at the level of virtual regions does not guarantee that the obtained distance is optimal, because this is not a restriction of our model. However, as the microm distance is optimal, we have that µ(X, Z) ≤ µcyclic (X, Z) ≤ µ(X, Y ) + µ(Y, Z), i.e., independently of the use of acyclic matches of virtual regions, the optimality of the microm distance always guarantee that the triangular inequality property holds.
4
Effectiveness Evaluation
This section presents our experimental results related to the effectiveness of the microm metric distance. We have compared microm with the irm distance, under the same segmentation scheme. In order to have a reference, we have also included the results obtained when images are represented by its global color histogram (GCH) with the L1 vectorial distance. It was adopted histograms with 64 uniformly quantized colors. The experiments used a collection of about 20,000 heterogeneous images (Corel GALLERY Magic 65,000 - Stock Photo Library 2), composed by 200 distinct image domains, each one with 100 JPEG images. The microm and irm distances were used to compare regions obtained with the CBC(3, 0.1) configuration of the CBC clustering algorithm [9]. This configuration offers an intermediate compromise between the number of obtained regions (which affects the space overhead and the query processing time) and the retrieval effectiveness. With this configuration, each image within our reference collection was segmented (in average) in 40 connected regions. Each region of an image is represented by its average color in the Lab color-space (3 values), its size, and the spatial coordinates of its geometric center (2 values). Thus, each region of an image is represented by 6 float-point numbers (fpns) and images is represented, in average, by 6 × 40 = 240 fpns. The distance between regions (dreg ) adopted is a weighted composition of the distances between the average color in the CIE Lab color-space and between the spatial position of the compared regions. Since it is generally difficult to express low-level features of images, it was adopted the Query-By-Example (QBE) paradigm, where an image is given as example and the system retrieves the most similar matches for this image. The effectiveness of the approaches was evaluated using a set of 18 query images, selected from our reference collection of images. The set of images accepted as relevant for each query image (RRSet) was determined a priori, using a technique similar to the pooling method adopted in TREC conferences [16,17]. We extracted the set of relevant images (for a given query) from a pool of possible relevant images. This pool is created by taking the top 30 images retrieved by each compared approach. The pool of candidate images was then visually analyzed to ultimately decide on the relevance of each image. The subset of relevant images in the pool is the RRSet of the query image. We evaluated the effectiveness of the approaches using Precision vs. Recall (P×R) curves [16]. Precision is a measure which evaluates the accuracy of the search (how many
20
Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao
of the retrieved images are relevant). Recall measures the extent to which the retrieval is exhaustive (how many of the relevant images were retrieved). The results of the effectiveness comparison can be viewed in Figure 3. The best overall results were obtained with the microm metric distance, followed by the irm distance. In both cases, the comparison was based on the regions obtained with the CBC clustering algorithm. As can be seen, both results are better than the use of a GCH to represent images plus a geometric distance (L1 ) to compare these histograms. The advantage of microm over irm is evident, but not very large. This means that the irm distance, although not a metric, is a good approximation for the microm metric distance in terms of effectiveness. It is also efficient, since it is less expensive to compute. However, the microm metric distance, besides being slightly better in terms of effectiveness, has the advantage that its metric properties can be used to speedup query processing using filtering techniques and/or access methods effectively.
CBC + MiCRoM
1.0
CBC + irm GCH + L1
Precision
0.8 0.6 0.4 0.2 0.0 0.0
0.2
0.4
0.6
0.8
1.0
Recall
Fig. 3. Effectiveness results
For small collections, the combination of an efficient distance like irm, and a linear scan of the image database, is an interesting approach. However, for large databases, independently of its computational complexity, the use of a metric distance like microm becomes more attractive as it is possible to reduce the query time making extensive use of the triangular inequality property. In the next section, we will investigate a filtering technique that reduces the CPU time to process a visual query when complex distances like microm are used to compare images.
5
Filtering Based on Metric Distances
Since there are efficient techniques to cope with vector spaces, application designers try to give their problems a vector space structure. A common reduction consists of mapping a general metric space into a projected vector space. A query processed in the vectorial space generates a candidate list of images that should be analyzed in the original metric space in order to eliminate false-positives. The space reduction as discussed above is obtained by defining k images of the database as reference, computing and storing the microm distance between the database
MiCRoM: A Metric Distance to Compare Segmented Images
21
images and the reference images as k-dimensional vectors and then, using a simple and efficient geometric distance to filter out non-relevant images in the vectorial space (at query time). Santos et al [10] called this space reduction Omni-concept. They proposed the HF-algorithm to define the k reference images (foci) used to generate the k-dimensional vectorial space (omni-space). The sequential scan of the omni-space was called Omni-sequential. The omni-sequential algorithm makes extensive use of the triangular inequality property to eliminate non-relevant images at query time. In order to illustrate this process, let us consider Q a query image, D a database image, Fi the ith focus used to generate the k-dimensional omni-space (1 ≤ i ≤ k), and a query radius r. The database image D is a candidate image only if the following inequality holds: max1≤i≤k |µ(Q, Fi ) − µ(Fi , D)| ≤ r
(1)
Notice that the distances µ(Q, Fi ) and µ(Fi , D) are known at query time, as they correspond to the ith omni-coordinate (in the omni-space) of images Q and D, respectively. In our filtering experiments, we adopted the omni-sequential algorithm. As discussed in previous section, our reference collection of images has 20,000 images. The results presented are relative to the 18 query images used in the effectiveness evaluation discussed in previous section. The proportion of the database filtered out using the omni-sequential algorithm was evaluated by varying the number of foci between 1 and 10. The foci images were selected according to the HF-algorithm. We used query radius varying between 0.005 and 0.1 (as the distances are normalized, the maximum distance between two images is 1.0). On the left of Figure 4, it is shown the relation between the query radius and the average number of images retrieved, i.e., the number of images with a microm distance to the query images smaller than the query radius. 100
% of images filtered out
Number of images
90
120 110 100 90 80 70 60 50 40 30 20 10 0
80
0.005
70
0.01 0.015
60
0.02 50
0.025
40
0.030
30
0.035 0.04
20
0.045 10 0
0.005
0.010 0.015
0.020
0.025 0.030
0.035 0.040
0.045
1
2
Query Radius
3
4
5
6
7
8
9
10
Number of foci
Fig. 4. Filtering results
As can be seen, in order to retrieve the top 100 most similar images to a query image, in average, a query radius of 0.045 is enough. A query radius of 0.1 (not shown in the Figure) is sufficient to retrieve, in average, the top 9039 most similar images to the query image. This is approximately half of the database size.
22
Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao
On the right of Figure 4, it is shown the degree of filtering using query-radius between 0.05 and 0.045, according to the number of foci used. As can be seen, independently of the query radius used, the ideal number of foci seems to be 4. After this point, the proportion of the database filtered out does not increase substantially. For example, for a query radius of 0.045, 63.45% of the image database was filtered out using only 4 foci. This means that 2/3 of the database was pruned without computing the microm distance, but only using the L1 distance in the 4-dimensional omni-space. This proportion grows to only 67.34% when 10 foci are used. This behavior is the same for all query radius tested. As the time to compare two 4-dimensional vectors using the L1 distance is much smaller than the comparison of the regions of two images using the microm distance, we can say that the gain in CPU time using omni-sequential (for a query radius of 0.045) is almost of 2/3 when compared to a linear scan of the image database. In order to reduce the I/O time to process a visual query, it is possible to index the generated 4-dimensional vectorial space using a spatial access method (SAM) such as the R∗ -tree [18]. SAMs reduce the comparison of images only to those near the query image. In this way, only a portion of the omni-space need to be read from the disk, further reducing the number of I/O operations to process a visual query.
6
Conclusions and Future Work
This paper presented microm (Minimum-Cost Region Matching), an effective metric distance to compare the visual content of segmented images. microm models the comparison of the regions of two images as a minimum-cost network flow problem [14]. Our experimental results show that the microm metric is at least as effective as the irm distance [8,9]. This suggests that the greedy approach adopted in irm, although not optimal, gives results very close to the results obtained with microm metric, with the advantage of being less complex. However, the main disadvantage of irm is that it is not a metric distance and so, it is useful only when the image database is relatively small. The microm metric, although computationally more complex than irm, is not only slightly more effective, but, more importantly, it has the great advantage that it allows the use of the triangular inequality property in filtering techniques [10] and/or access methods [11,12]. This yields substantially reductions in query processing time and a much broader context of application than irm. In the near future, we plan to investigate in more details the indexing based on the microm metric distance in order to define the best filtering technique/acess method to speedup the query processing. Another possibility is to investigate alternative segmentation techniques which result in regions that could maximize the benefits of comparing them using the microm metric.
Acknowledgments Renato O. Stehling realized this work while visiting at the University of Alberta and was supported by a Graduate Scholarship from FAPESP, Brazil.
MiCRoM: A Metric Distance to Compare Segmented Images
23
References 1. Androutsos, D., Plataniotis, K.N., Venetsanopoulos, A.N.: Vector angular distance measure for indexing and retrieval of color. In: Proc. of SPIE – Storage and Retrieval for Image and Video Databases VII. Volume 3656. (1999) 604–613 2. Sethi, I.K., Coman, I., Day, B., et al.: Color-wise: A system for image similarity retrieval using color. In: Proc. of SPIE – Storage and Retrieval for Image and Video Databases IV. Volume 3312. (1998) 140–149 3. Zhang, Y.J., Liu, Z.W., He, Y.: Comparison and improvement of color-based image retrieval tchniques. In: Proc. of SPIE – Storage and Retrieval for Image and Video Databases VI. Volume 3312. (1998) 371–382 4. Sciascio, E.D., Mingolla, G., Mongiello, M.: Content-based image retrieval over the web using query by sketch and relevance feedback. In: Proc. of the VISUAL’99 Intl. Conf. (1999) 123–130 5. Sebe, N., Lew, M.S., Huijsmans, D.P.: Multi-scale sub-image search. In: Proc. of ACM Multimedia’99 Intl. Conf. (1999) 79–82 6. Stehling, R.O., Nascimento, M.A., Falc˜ao, A.X.: On ’shapes’ of colors for content-based image retrieval. In: Proc. of the ACM MIR’00 Intl. Workshop. (2000) 171–174 7. Carson, C., Thomas, M., Belongie, S., et al.: Blobworld: A system for region-based image indexing and retrieval. In: Proc. of the VISUAL’99 Intl. Conf. (1999) 509–516 8. Li, J., Wang, J.Z., Wiederhold, G.: IRM: Integrated region matching for image retrieval. In: Proc.of ACM Multimedia’00 Intl. Conf. (2000) 147–156 9. Stehling, R.O., Nascimento, M.A., Falc˜ao, A.X.: An adaptive and efficient clustering-based approach for content based retrieval in image databases. In: Proc. of IDEAS’01 Intl. Symposium. (2001) 356–365 10. Santos, R.F., Traina, A., Traina, C., Faloutsos, C.: Similarity search without tears: The omnifamily of all-purpose access methods. In: Proc. of ICDE’01. (2001) 623–630 11. Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L.: Searching in metric spaces. ACM Computing Surveys (2001) To appear. 12. Gaede, V., Guenther, O.: Multidimensional access methods. ACM Comp. Surveys 30 (1998) 123–169 13. Ciaccia, P., Partella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proc. of the VLDB’97 Intl. Conf. (1997) 426–435 14. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice Hall (1993) 15. Goldberg, A.V.: An efficient implementation of a scaling minimum-cost flow algorithm. Journal of Algorithms 22 (1997) 01–29 16. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley (1999) 17. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann (1999) 18. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R∗ -tree: An efficient and robust access method for points and rectangles. In: Proc.of ACM SIGMOD Intl. Conference. (1990) 322–331
Image Retrieval by Regions: Coarse Segmentation and Fine Color Description Julien Fauqueur and Nozha Boujemaa INRIA, Imedia Research Group, BP 105, F-78153 Le Chesnay, France {Julien.Fauqueur,Nozha.Boujemaa}@inria.fr http://www-rocq.inria.fr/imedia/
Abstract. In Content-Based Image Retrieval systems, region-based queries allow more precise search than global ones. The user can retrieve similar regions of interest regardless their background in images. The definition of regions in thousands of generic images is a difficult key point, since it should not need user interaction for each image, and nevertheless be as close as possible to regions of interest (to the user). In this paper we first propose a new technique of unsupervised coarse detection of regions which improves their visual specificity. The Competitive Agglomeration (CA) classification algorithm, which has the advantage to automatically determine the optimal number of classes, is used. The second key point is the region description which must be finer for regions than for images. We present a novel region descriptor of fine color variability: the Adaptive Distribution of Color Shades. It is based on color shades adaptively determined for each region at a high resolution: 5 million of potential different colors represented against few hundreds of predefined colors in existing descriptors. Successful results of segmentation and region queries are presented on a database of 2500 generic images involving landscapes, people, objects, architecture, flora. . . .
1
Introduction
The primary functionality of a Content-Based Image Retrieval system is the global query-by-example approach, in which visual features are extracted from the entire image. But in many cases the user’s goal is to retrieve similar regions rather than similar images as a whole. In a generic image database the search for similar regions using global features over images can be highly biased by the surrounding regions and background. Region based query systems allow to select a region in an image and retrieve images containing a similar region. The two major points to consider are the definition of regions and their description. A manual extraction of regions was proposed in [1] but is unviable for huge databases. Automatic region detection can be performed on-line using features back projection (see [2] and [3]), but they are inaccurate and time consuming at S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 24–35, 2002. c Springer-Verlag Berlin Heidelberg 2002
Image Retrieval by Regions
25
query phase. Off-line methods include systematic image subdivision into squares (see [4]) and image segmentation. This latter method was proposed in a couple of systems such as Blobworld [5] and Netra [6]. In Blobworld [5], segmentation is performed by classification with EM-algorithm which requires to have a prefined number of classes. A contour-based segmentation proposed in [7] and integrated in a CBIR system ([6] and [8]) provides an accurate segmentation but with very homogeneous regions. We can also cite the work of Wang [9] in SIMPLICIty, which performs a color segmentation of images to describe an image as a set of regions, but single region queries can’t be performed. Existing region color discriptors are based on histograms determined on a predefined subsampling of a color space: uniform subsampling of Hsv into 166 bins in VisualSeek [2], uniform subsampling of Lab into 218 bins in Blobworld [10] or a 256 color codebook predetermined for a given database in Netra [8]. Our approach differs from the above by our conception of regions and the techniques for extracting and describing them. We think regions should integrate more intrinsic variability to provide a better characterization and their color description should not depend on a predefined color set. The key idea of coarse region detection and fine description: the relatively high visual variability inside regions is accurately described by the fine resolution of color shades, such that regions are really specific against eachother in the database. The Competitive Agglomeration classification algorithm used for both segmentation and indexing will be detailed in the first section. In section 3, the coarse image segmentation for automatic region detection will be presented. Region indexing and matching are explained in section 4. Then tests and results will be presented and discussed in section 6. Finally conclusions are drawn.
2
CA Clustering Algorithm
Competitive Agglomeration classification, originally presented in [11], has the major advantage to determine the optimal number of clusters. In [12] an application of this algorithm is proposed for image segmentation. Using notations from [11] and [12], we call {xj , ∀j ∈ {1, ..., N }} the set of N data we want to clusterize and C the number of clusters. {βi , ∀i ∈ {1, ..., C}} denotes the set of prototypes to be determined. The distance between data xj and prototype βi is d(xj , βi ). Then CA-Classification is performed by minimizing the following quantity J: 2 C N C N J = J1 + αJ2 , where J1 = u2ij d2 (xj , βi ) and J2 = − uij (1) i=1 j=1
i=1
j=1
where uij represents the membership degree of feature point xj to prototype βi . Minimizing J1 separately is equivalent to perform an FCM classification [13] which determines C optimal prototypes and the fuzzy partition U given xj and C using distance d. And J2 is a complexity reduction term which garantees the cluster validity (see [12]). Therefore J is written as a combination of two opposite
26
Julien Fauqueur and Nozha Boujemaa
effect terms (J1 and J2 ). So minimizing J with an over-specified number of initial clusters simultaneously performs the data clustering and optimizes the number of clusters. α is the competition weight which should allow a balance between terms J1 and J2 in (1). J is minimized recursively and at iteration k, weight α is written as: α(k) = η0 exp
−k τ
C N
u2ij d2 (xj , βi )
2 C N u ij i=1 j=1
i=1
j=1
(2)
As iterations go, α decreases so emphasis is first given to agglomeration process, then to classification optimization. α is fully determined by η0 and τ . During the algorithm spurious clusters are discarded. The convergence is decided when prototypes are stable. The classification granularity is controlled by factor α through its magnitude η0 and its decline strength with τ . The higher η0 and τ , the higher α, so the more classes are merged. So for given classification granularity, CA determines the optimal number of classes. CA will be used at three steps in our work with different levels of granularity and different input data: first to perform image quantization, then to segment roughly the image by computing LDQC prototypes and then to finely describe regions with color shades.
3
Coarse Region Detection
Extracted regions should encompass a certain visual diversity to be visually characteristic, using a coarse segmentation. We want to stay beyond a too fine level of spatial and feature details. This choice is also motivated by the drawbacks of an oversegmentation which provides small and homogeneous regions: – a small region is rarely visually salient in a scene – a statistics-based description computed on a small region can’t be accurate – if all regions are homogeneous, it’s harder to differentiate them from one another – too many regions grow needlessly the database size We’ll define a region of interest as an area of connected pixels perceptually salient, i.e. covering a minimum surface in the image and presenting a certain visual “homogeneous diversity”. To group pixels to form such regions, we want to perform a CA-classification of local color distributions of the image. This feature naturally integrates the diversity of colors in pixels neighbourhood. The choice of the color set to compute local color distributions is crucial: it must be compact to gain speed in classification and be representative of a small pixel neighbourhood. If all original colors (an image can contain thousands of different colors) are kept, the classification will become computationally too expensive. Classic color histogram, computed on a uniform subsampling of a color space are too long (they contain useless empty bins). So we define the
Image Retrieval by Regions
27
color set as the adaptive set representing the quantized colors of a given image obtained by color classification. All neighbourhoods in the image give a set of Local Distributions of Quantized Colors (referred as LDQC’s) which are classified. LDQC prototypes are back projected onto image, then small regions are either merged or discarded. 3.1
Image Color Quantization
Image colors are CA-classified as (L,u,v) triples using the Euclidean distance. The classification granularity was chosen such that big areas in images with a strong texture are at least represented by 2 color shades. At classification convergence the color prototypes define the set Cqc of nqc color shades. Since CA determines automatically the right number of clusters, the number of color shades nqc will be representative of the image color diversity. Quantized image is obtained by back projecting color prototypes in the image. 3.2
Determination of LDQC Prototypes in Image
To determine all the LDQC’s, we slide a window over pixels in the quantized image and evaluate the corresponding local distribution over the Cqc color set. Let’s denote SW the window surface and ST OT the image surface. LDQC’s are evaluated every wr pixels, where wr is the window radius, so that all pixels participate to the determination of the LDQC prototypes. A suitable distribution distance must be used for the classification. Lp distances are widely used to measure similarity between color distributions computed over entire images but are not adapted to distributions computed over small pixel neighbourhoods. Indeed the distribution of a natural image is rather smooth and flatter than that of a small neighbourhood which presents a couple of peaks. Since there are few colors in a neighbourhood it is necessary to have a distance for LDQC which takes into account the inter-bin color similarity. This is what does the color quadratic form distance presented in [14]. Its expression is given for two distributions {xi } and {yi } evaluated on a set of nqc colors: T
dq (x, y) = (x − y) A(x − y) = 2
nqc nqc
(xi − yi )(xj − yj )aij
(3)
i=1 j=1
where aij is the similarity between colors i and j, determined with the Euclidean distance in Luv space. This distance is used during classification to compare the LDQC histograms (we’ll have d = dq in CA formulae (1) and (2)). After classification, the segmented image is obtained by assigning to the ST OT /wr2 pixels the label of the LDQC prototype minimizing the quadratic distance to the LDQC around that pixel. A maximum vote filter is applied to the image of labels to discard isolated pixels. Window surface SW defines the spatial level of details of the segmentation: the higher SW and the bigger patterns we extract. wr was set to 8 pixels for a 500x500 image.
28
3.3
Julien Fauqueur and Nozha Boujemaa
Adjacency Information
The segmented image gives us a complete partition of the image into adjacent regions formed from the back projection of the LDQC prototypes. Very small regions correspond to salient areas detected by LDQC classification but are too small to constitute regions of interest, so they increase needlessly the total number of regions in the database. Besides, in complex scenes, they’re often located at the frontier between two regions of interest or inside a region of interest. So they should be merged to improve the topology of regions of interest. Region attributes (surface, color distribution) and region adjacency (list of neighbours) information are stored in a Region Adjacency Graph structure used to merge regions. We want final regions of interest to be of minimum size SM min = 0.015∗ST OT (i.e. 1.5% of the image surface). Below this threshold a region is merged to its closest visual neighbour if it has one and is discarded otherwise. Two small regions are said to be visually close if they have close mean quantized color distributions. After merging process, remaining regions of size below SM min are salient but too small, so they are discarded from the graph and not indexed. The region extraction workflow is the following: 1. 2. 3. 4.
4 4.1
image quantization by CA-classification of color pixels computation and CA-classification of LDQC’s to obtain LDQC prototypes determination of connected components and generation of the RAG merge and discard regions
Region Indexing and Retrieval Fine Color Region Description
Once regions are detected in a coarse way we have to finely describe their visual appearance. Existing region color descriptors are generally histograms evaluated on a few hundreds of bins obtained by a subsampling of the color space: a uniform subsampling in [2], [10] or a database-dependent subsampling in [8]. See the illustration of a 216 bin Luv histogram region description in left part of figures (1) and (2). Such a description forces the minimum distance between two colors to be high because the subsampling is fixed and because we only consider a few hundreds of colors among millions in a full color space. This low granularity of color description is suitable for complex images as they contain a wide range of different colors. But regions are by definition more homogeneous than an image so their color description should be finer. To represent shades of any given hue, a high granularity color set must be found. A fine uniform subsampling of a color space raises the problems of numerous useless empty bins and heavy matching computation. We want to select for each region an adaptive color set providing color shades which are relevant for the region. We should get a single color shade on a perfectly
Image Retrieval by Regions
29
uniform region and many on a highly textured region. We decide to index regions with the distribution of their color shades determined with CA algorithm with a high classification granularity. To achieve this, for each region, its color pixels in the original image are classified with low τ and η0 to catch representative shades of colors. The optimal number of color shades found by CA is in itself an information about the region visual diversity. The color shade triples are determined from the whole Luv colorspace which contains 5.6 million colors while a classic color descriptor picks colors from around 200 given colors. The descriptor index consists of the list of color shades as Luv triples with their respective percentage in the region. Top-right parts of figures (1) and (2) show examples of such descriptors. Note: the image quantized colors determined in section 3.1 are unsatisfactory candidates to index regions for two reasons: they are determined with a too low granularity (suitable for a coarse segmentation) and all image color pixels are in competition which favours colors from big regions and bias the color prototypes determination.
4.2
Matching Regions
For a given query region of color shades distribution X, similar regions are such that their distribution Y minimizes the distance between X and Y . Let’s write distributions X and Y as pairs of color/percentage: X X X Y Y Y Y X = {(cX 1 , p1 ), ..., (cncsX , pncsX )} and Y = {(c1 , p1 ), ..., (cncsY , pncsY )}
cX i
th
cYj
(4)
th
Y as the color similarity between and acX (i color of X) and (j color i cj of Y ). Since color shades are finely determined the quadratic distance is again a good choice to take into account the inter-bin color similarity. The formula (5) gives the quadratic distance between two color distributions x and y evaluated on the same color set. But when measuring the distribution distance between two regions from two different images, the two color sets are different. So we will rewrite the expression of the quadratic distance to discard the distributions binwise differences. Let’s consider x as the extension of distribution X over the entire color space and y the extension of Y . The extension consists in setting bin values to zero for colors which are not color shades, so we have dq (x, y) = dq (X, Y ).
dq (x, y)2 =(x − y)T A(x − y) =xT Ax − xT Ay − y T Ax + y T Ay =xT Ax + y T Ay − 2xT Ay =
N N i=1 j=1
xi xj aij +
N N i=1 j=1
(5) yi yj aij − 2
N N i=1 j=1
xi yj aij
30
Julien Fauqueur and Nozha Boujemaa
Then we finally have the following expression of the quadratic distance used to compare two color shades distributions X and Y evaluated on any color sets: dq (X, Y )2 =
ncs X ncs X i=1
−2
X X + pX i pj acX i cj
j=1 ncs X ncs Y i=1 j=1
Y pX i pj
ncs Y ncs Y i=1 j=1
pYi pYj acYi cYj (6)
Y acX i cj
The first term involves only the X distribution, the second the Y distribution and the last one the product of both and no more binwise difference is involved. Returned regions are sorted by growing quadratic distance dq .
5
Tests
Our system was tested on IDS database provided by courtesy of Images Du Sud Photo Stock company. It contains 2500 generic images of flowers, portraits, landscapes, seascapes, architecture, people, fruit, gardens. Images size are between 400x400 and 600x600 pixels.
6 6.1
Results Region Detection
A few segmented images are presented in figure (3). More examples can be seen at: http://www-rocq.inria.fr/˜fauqueur/ADCS/ . Images for which an obvious segmentation could be decided are correctly segmented. More generally, images in the database are complex natural scenes and extracted regions present a coherent color diversity. The coarse segmentation proves its ability to integrate within regions areas formed with many shades of the same hue, strong textures, isolated spatial details, which make their specificity. 15248 regions were automatically extracted from the 2483 images (average of 6 regions per image). Segmenting an image took an average of 5.6s. Discarded regions (shown as small grey regions in examples) represent a very small percentage of image surfaces. To compare color shades, a 4*4*4=216 uniform subsampling of the Luv colorspace was also tested to compute the local color distributions. Resulting regions were inaccurate and the histogram vectors were to long to classify. 6.2
Region Description
Top-right parts of figures (1) and (2) illustrate the fine granularity of the color shade representation and their fidelity to the original colors. In figure (3), the segmented images show the detected regions followed by the corresponding images
Image Retrieval by Regions
31
Fig. 1. Color description of the lavender region: with a classic 216 bin Luv distribution (left) and with the ADCS descriptor (top right). Because of the strong subsampling into 216 bins, wrong colors appear in the classic Luv distribution: blue shades rather than purple ones. The ADCS descriptor represents the purple color shades accurately and provides a more compact descriptor. Note color bins in the ADCS distribution have no specific order.
32
Julien Fauqueur and Nozha Boujemaa
Fig. 2. Color description of the sky region: with a classic 216 bin Luv distribution (left) and with the ADCS descriptor (top right). Distribution comparison: both distributions represent real colors but the ADCS has a finer dynamic of blue shades and still in a more compact descriptor.
Image Retrieval by Regions
33
Fig. 3. First: original images. Second: images of regions with mean color. Third: mages of regions with color shades used for indexing. Non-indexed regions are shown with random color pixels.
formed by each region color shades used for their description. More examples of such images can be seen at: http://www-rocq.inria.fr/˜fauqueur/ADCS/ . The global appearance of these quantized images shows the precision of the ADCS region color descriptor. A total of 261219 color shades from the Luv space were used to index the 15248 regions (average of 17 colors per region). 168912 of these colors were unique (to be compared to the couple of hundreds of fixed bins in a classic histogram). Extracting an ADCS index from a region took around 0.5s . Since an average of 17 colors is used to represent a region, we can determine the number of bytes needed to store an ADCS index: for one region, it contains: the number of color shades, the list of color shades (as Luv triples) and the population of each shade, i.e. 1 + 17 ∗ (3 + 1) = 69 bytes. This makes an ADCS index around three times more compact than a classic color histogram. 6.3
Retrieval
Region queries are done by exhaustive comparison with the 15248 regions and average query time is 1.3s. Hundreds of region queries in our system always returned regions which presented a perceptually similar color distribution for various kinds of regions: uniform or textured, containing different hues. Regions described by many color shades returned regions with many color shades and conversely for single-colored regions. We observed that the number of color shades is also an exploited information about the color diversity of a region. Screenshots in figures (4) and (5) show the result of a query on a lavender region. ADCS descriptor is used in figure (4) and, in figure (5), classic 216 bin Luv
34
Julien Fauqueur and Nozha Boujemaa
Fig. 4. Retrieval from top-left lavender region using ADCS.
Fig. 5. Retrieval from top-left lavender region using classic 216 bin Luv histogram.
Image Retrieval by Regions
35
histogram matched with the L1 distance. We can observe that classic histogram didn’t top-ranked regions with colors as similar as with color shades.
7
Conclusions
The key idea is to detect visually specific regions of interest and match them with the fine descriptor to improve the retrieval results. We presented a scheme for coarse automatic image segmentation and fine color description to perform region-based queries in a generic image database. The novel segmentation scheme detects regions which are potential regions of interest for the user (they are visually salient in the image) and at the same time specific from one another in the database (they encompass a visual “homogeneous diversity”). The new ADCS signature provides a representation of region color variability with more accuracy than existing descriptors.
References 1. Del Bimbo and Vicario E., “Using weighted spatial relationships in retrieval by visual contents,” IEEE workshop on Image and Video Libraries, June 1998. 2. S.F. Chang J.R. Smith, “Visualseek: A fully automated content-based image query system,” in ACM Multimedia, 1996, pp. 87–98. 3. B. Moghaddam, H. Biermann, and D. Margaritis, “Defining image content with multiple regions of interest,” CBAIVL, 1999. 4. J. Malki, N. Boujemaa, C. Nastar, and A. Winter, “Region queries without segmentation for image retrieval by content,” in Visual Information and Information Systems, 1999, pp. 115–122. 5. Belongie S., Carson C., Greenspan H., and Malik J., “Color- and texture-based image segmentation using em and its application to content-based image retrieval,” Proc. Int. Conf. on Computer Vision (ICCV’98), 1998. 6. Deng Y. and Manjunath B., “An efficient low-dimensional color indexing scheme for region-based image retrieval,” ICASSP Proceedings, 1999. 7. Ma W. and B. Manjunath, “Edgeflow: A framework of boundary detection and image segmentation,” CVPR Proceedings, pp. 744–749, 1997. 8. Wei-Ying Ma and B. S. Manjunath, “Netra: A toolbox for navigating large image databases,” Multimedia Systems, vol. 7, no. 3, pp. 184–198, 1999. 9. Jia Li James Z. Wang and Gio Wiederhold, “Simplicity: Semantics-sensitive integrated matching for picture libraries,” PAMI, 2001. 10. C. Carson, M. Thomas, and S. Belongie, “Blobworld: A system for region-based image indexing and retrieval,” 1999. 11. H. Frigui and R. Krishnapuram, “Clustering by competitive agglomeration,” Pattern Recognition, vol. 30, no. 7, pp. 1109–1119, 1997. 12. Boujemaa N., “On competitive unsupervized clustering,” ICPR, 2000. 13. J. C. Bezdek, Pattern Recognition with Fuzzy Objective Functions, Plenum, New York NY, 1981. 14. J. Hafner H. Sawhney W. Aquitz M.Flickner and W. Niblack, “Efficient color histogram indexing for quadratic form distance functions,” PAMI, 1995.
Fast Approximate Nearest-Neighbor Queries in Metric Feature Spaces by Buoy Indexing Stephan Volmer Fraunhofer Institute for Computer Graphics, Fraunhoferstr. 5, 64283 Darmstadt, Germany
[email protected] Abstract. An indexing scheme for solving the problem of nearest neighbor queries in generic metric feature spaces for content-based retrieval is proposed aiming to break the “dimensionality curse". The basis for the proposed method is the partitioning of the feature dataset into a fixed number of clusters that are represented by single buoys. Upon submission of a query request, only a small number of clusters whose buoys are close to the query object are considered for the approximate query result, cutting down the amount of data to be processed effectively. Results from extensive experimentation concerning the retrieval accuracy are given. The influence of control parameters is investigated with respect to the tradeoff between retrieval accuracy and query execution time.
1
Introduction
Interest in digital multimedia has increased enormously over the last few years with the evolution of today’s information and communication technologies. Users exploit the opportunities offered by the ability to access and manipulate remotely stored multimedia objects (e.g. text, images, audio, and video) in all imaginable ways. This has fuelled the emergence of large multimedia repositories. Finding a multimedia object whose content is truly relevant to the user’s need has become the focal point of recent research in multimedia information technology. Large repositories cannot be meaningfully queried in the classical sense, because it is very difficult to structure the information contained in multimedia objects in alphanumeric keys or records (either manually or computationally) for traditional relational databases. The concept of searching for information on a semantical level by matching alphanumeric strings no longer applies to multimedia objects, because they consist of abstract representations entailing sensorial data on a syntactical level. In most multimedia applications, all the queries are commonly formulated in a way asking for objects that are similar to a given one [15]. The concept of similarity imposes severe problems because sensorial data is encoded differently than humans perceive it. Content information must be abstracted and translated into an encoding that can be compared. Problems with traditional methods have led to the rise of techniques for retrieving multimedia objects on the basis of content descriptors – a technology now generally referred to as content-based retrieval (CBR). CBR systems employ unsupervised algorithms on multimedia objects analyzing their raw digital data representations. This analysis results in compact content descriptors that convey specific aspects of the object’s most salient features. The similarity between two objects is then determined by some well-defined similarity measure between their associated content descriptors. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 36–49, 2002. c Springer-Verlag Berlin Heidelberg 2002
Fast Approximate Nearest-Neighbor Queries
37
Upon presentation of a query, the content descriptor of the query object is compared with the descriptors in the database. A linear search through all descriptors contained in the database would be very time-consuming and inefficient. Therefore, an indexing scheme becomes necessary in order to limit the number of potential target descriptors from the database and reduce the computational effort needed to determine their similarity to the query descriptor sequentially. This task is generally referred to as similarity indexing [14]. The goal of similarity indexing is to reduce the amount of data to be processed by categorizing or grouping similar objects together.
2
Preliminaries
This paper focuses on providing a general-purpose spatial indexing scheme applicable to any content descriptors derived from digital representations of media objects that comply with the postulates of the metric feature model. Those content descriptors are compact in nature and conserve the most salient features and properties of the media object’s content without accounting for any kind of knowledge, interpretation or reasoning. The metric feature model is based on the assumption that human similarity perception approximately corresponds with a measurement of an appropriate metric distance between content descriptors. 2.1
Metric Feature Model
Let ∆ be a feature extraction algorithm that transforms digital representations of media objects M into content feature descriptors ω: ∆
M −→ ω
(1)
Let the feature domain Ω denote the universe of all feature descriptors that can be generated by ∆. Depending on the specific characteristics of ∆, Ω can be finite or infinite, and discrete or continuous. Then (Ω, δ) is called a generic feature space, where δ : Ω × Ω → IR+ 0
(2)
is a metric on Ω called the dissimilarity measure. The metric δ must satisfy the properties (i) δ(ωi , ωi ) = 0 (ii) δ(ωi , ωj ) = δ(ωj , ωi )
(3)
(iii) δ(ωi , ωj ) + δ(ωj , ωl ) ≥ δ(ωi , ωl ) for all ωi , ωj , ωl ∈ Ω. This common framework includes the definition of ubiquitous d-dimensional feature vector spaces (Ω ≡ IRd ), but is not necessarily limited to them. A finite subset S = { ω1 , ω2 , . . . , ωN } ⊆ Ω (4) of the feature domain is called the feature dataset whose elements are the feature descriptors extracted from a set of |S| = N media objects.
38
2.2
Stephan Volmer
K-Nearest Neighbor Queries
By far the most common query of a CBR system is a request like “Find the K most similar objects to the query example!" Such a request can be formulated as a K-nearest neighbor query (K-NN query) in metric space: Given an query object ωQ ∈ Ω and an integer K ≥ 1, the K-NN query SN N (ωQ , K) selects the K elements from the feature dataset S which have the smallest distance from ωQ with the following properties: (i) SN N (ωQ , K) ⊂ S (ii) |SN N (ωQ , K)| = K (iii) ∀ ω ∈ SN N (ωQ , K) ∃ ω ∈ S \ SN N (ωQ , K)
(5)
with δ(ωQ , ω ) < δ(ωQ , ω),
3
State-of-the-Art
The history of recent research on similarity indexing techniques can be traced back to the middle 70’s when hierarchical tree structures (e.g. k-d tree) for indexing multidimensional vector spaces were first introduced. In 1984, Guttman proposed the R-tree indexing structure [7], which was the basis for the development of many other variants. Sellis et al. proposed the R+ -tree [10], and Beckman et al. proposed the best dynamic R-tree variant, the R∗ -tree [2] in the following years. A very extensive review and comparison of various spatial indexing techniques for feature vector spaces can be found in [14]. Motivated by k-d tree and R-tree, they proposed the VAM k-d tree and the VAMSplit R-tree. Experimentally, they found that the VAMSplit R-tree provided the best performance, however, this was at the loss of the dynamic nature of the R-tree. Common to all of the cited research is the idea that feature descriptors are stored at the leaf level of a hierarchical index tree structure. Each leaf corresponds to a partition of the feature space and each node to a convex subspace spanning the partitions created by its children. Tree branches that do not meet certain distance requirements are pruned during a similarity query in order to reduce the search space. The main problem of this approach is that it requires the evaluation of the distance from the query point to the arbitrarily shaped convex subspace represented by the node being examined. The most common approach for simplifying this problem is that partitions are split along hyperplanes that are orthogonal to a coordinate axis, ultimately creating hyper-rectangular partitions whose sides are aligned parallel to the feature space’s axes. Most of the hierarchical spatial indexing methods work satisfactorily for lower dimensions, but suffer from the dimensionality curse [9] when applied to feature vectors in medium- or high-dimensional feature spaces (d > 20). The dimensionality curse is strictly related to the distribution of the dissimilarity measures between the feature dataset and the query object. If the variance of the dissimilarities for a given query object is low, then conducting an indexed K-NN query becomes a difficult task. A way to
Fast Approximate Nearest-Neighbor Queries
39
obviate this situation is to conduct queries that come up with an approximate solution of the K-NN query problem [1]. In recent research, there have been many attempts to get a grip on the problem of the dimensionality curse – one of them is the reduction of the dimensionality of the underlying feature domain with a principal component analysis (PCA) or its variants. In [8], Ng and Sedighain followed this approach to reduce the dimensionality, and in [6] Faloutsos and Lin proposed a fast approximation of the Karhunen-Loeve Transform (KLT) to perform the dimension reduction. However, even though experimental results from their research showed that some real feature datasets could be considerably reduced in dimension without significant degradation in retrieval quality, the queries become proportionally less accurate with the loss of dimensions. The biggest shortcoming of the techniques mentioned above is that they are applicable to vector spaces only, that is, a vector of fixed dimensionality suitably represents each descriptor. This paper, however, aims on the rather general case, where the similarity criterion defines a metric space instead of the more restricted case of a vector space. Therefore, the indexing scheme has to rely solely on the distance relationship among the objects of the dataset without any information about its topology. Two different ways have been pursued in recent research to solve the problem of similarity indexing for pure metric spaces. The first approach consists in mapping the metric space into a vector space. In [12], for each object in metric space, its distance to a set of d predetermined so-called vantage objects is calculated. The vector of these distances specifies a point in the d-dimensional vantage space. The selection of vantage objects, their number, as well as their location in metric space, is critical for this approach. In [12], an approach is described that attempts to constitute a set of vantage objects that spreads well enough in metric space. However, the central problem of this mapping technique is: How well can a metric space be transformed into a vector space? How many dimensions are required for the target vector space? These are difficult questions that have not been satisfactorily answered. The second approach involves the generation of hierarchical tree structures similar to the ones used for vector spaces. The M -tree as presented by [5], as well as the vantage point tree (VPT) presented by [4] are examples of this approach. These tree structures partition the metric space recursively into smaller subspaces that all have the shape of regular hyper-spheres. Each partition is represented by its centroid and its corresponding covering radius.At query time, the query is compared against all the representatives of the node and the search algorithm enters recursively into all those that cannot be discarded using the covering radius criterion. There are many proposed variations of this approach in literature; most of them differ on how centroids of partitions are selected and on how partitions are split. In the following section an indexing scheme is proposed that follows some of the basic ideas of those approaches, but focuses strongly on a pragmatic solution that delivers reasonable performance in conjunction with a relational database.
4
Buoy Indexing
The proposed indexing scheme is based on the idea that the feature dataset is decomposed into disjoint non-empty partitions of arbitrary convex shape. Each partition is not
40
Stephan Volmer Feature Descriptors
Cluster Buoys Virtual Cluster Borders
Covering Cluster Hyper−spheres
Fig. 1. Schematic diagram of partitioning a dataset of 44 descriptors into 3 clusters in 2-dimensional vector space. Cluster buoys are depicted by black discs with white marker symbols; their associated descriptors by marker symbols of the same shape. Lines representing equidistant points between cluster buoys denote virtual cluster borders. Concentric circles around the cluster buoys illustrate covering hyper-spheres of clusters; an arrow marks its radius to its most distant member descriptor. It should be noted that the covering hyper-spheres do not represent the real shape of the cluster extensions.
represented by a complex description of its extension or its boundaries in the feature domain, but rather by a single prototype element that is an element of the feature domain itself. The prototype element serves as a buoy in feature space for its associated partition. Ideally, the partitions should be distributed in feature space in a way that they cover the dataset well. Each partition should have approximately the same number of feature descriptors as members, and the number of partitions should be an order of magnitude smaller than the number of feature descriptors in the dataset. The membership of an element of the feature dataset to a specific partition is solely determined by its metric distances to all buoys placed in feature space – a feature descriptor exclusively belongs to the partition with the closest associated buoy in the feature space. A partition is not only represented solely by its associated buoy, but also by its covering hyper-sphere. The covering hyper-sphere is sufficiently identified by a single valued parameter that represents the maximum distance from the partition’s associated buoy to its most distant member descriptor. Fig. 1 shows the principle of the described partitioning clustering method. 4.1
Index Generation
In general, the task of partitioning a particular feature dataset S into k disjoint non-empty subsets S1 , S2 , . . . , Sk (hereafter called clusters) with the following properties (i)
k i=1
Si = S
(ii) Si = ∅,
∀1≤i≤k
(iii) Si ∩ Sj = ∅,
∀ 1 ≤ i, j ≤ k, i = j
(6)
Fast Approximate Nearest-Neighbor Queries
41
is performed by any k-clustering algorithm, where the total number of clusters k is assumed to be selected a priori as a constant. Each descriptor of the feature dataset belongs to exactly one cluster (crisp membership). By far the most common type of k-clustering algorithm is the optimization algorithm. The optimization algorithm defines a cost criterion c : {S1 , S2 , . . . , Sk } → IR+ 0,
(7)
which associates a non-negative cost with each cluster. The goal of the optimization algorithm is then to minimize the global cost c (S) =
k
c (Si )
(8)
i=1
for a given feature dataset. If each cluster Si is represented by a buoy ω ˆ i that is an element of the feature domain Ω itself, then, the cost criterion of a cluster can be defined as c(Si ) =
|Si |
δ(ˆ ωi , ωim )
(9)
m=1
where ωim is the mth element of Si , and |Si | is the number of elements in Si . Commonly, the centroid of the cluster would be chosen to be the buoy ω ˆ i (k-means clustering algorithm [3]). However, since many types of dataset do not belong to feature spaces in which the mean is defined1 , a different type of buoy must be chosen in order to be generically applicable for the metric feature space model as defined in Sect. 2.1. Consequently, the median of each cluster is selected as its representative buoy (k-medians clustering algorithm). Note that ω ˆ i ∈ Si ⊂ S ⊂ Ω and that ω ˆ i is chosen to minimize the cost c(Si ) of the cluster itself. The classic implementation of the optimization problem is an algorithm that tries to minimize (8) iteratively. The algorithm terminates, if c(S) remains constant for two consecutive iterations. The result is a local minimum of the optimization problem. Techniques like simulated annealing can be employed further to improve the result. Additional Constraints. The pure k-medians clustering algorithm produces clusters with sizes 1 ≤ |Si | ≤ N − k + 1. In order to support the development of clusters of approximately the same size, an additional constraint on the cluster size Smin ≤ |Si | ≤ Smax
(10)
has to be imposed on the algorithm during any iteration, whereas Smin and Smax are empirically selected thresholds for the minimum and maximum accepted cluster sizes respectively. If any cluster’s size exceeds Smax , the cluster is randomly split into two 1
The mean of two elements of the feature domain is required to be an element of the feature domain itself – this is not always the case for feature spaces that are not vector spaces.
42
Stephan Volmer
equally sized clusters and the smallest existing cluster is deleted. If there are still any clusters whose size falls below Smin , the cluster is deleted and the largest existing cluster is randomly split into two equally sized clusters. The member descriptors of deleted clusters are immediately assigned to the clusters with the closest associated buoys. A high level description of the constrained optimization algorithm is shown in Fig. 2. The selection of Smin and Smax directly impacts the convergence of the constrained optimization algorithm. The expected value for the average cluster size is |Si | =
N k
(11)
Obviously, the size constraints should be selected in a way that 1 ≤ Smin