editor
Andrea F. Abate
Michele Nappi
Proceedings of the Workshop on
Monica Sebillo
mdic
2004
Multimedia Databases and Image Communication
Series on Software Engineering and Knowledge Engineering
Erratum
Proceedings of the Workshop on mdic 2004
Multimedia Databases and Image Communication This title is Vol. 17 of the World Scientific Series on Software Engineering and Knowledge Engineering. On the front cover, the volume number should be 17, not 15.
Multimedia Databases and Image Communication
SERIES ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING Series Editor-in-Chief S K CHANG (University of Pittsburgh, USA)
Vol. 1
Knowledge-Based Software Development for Real-Time Distributed Systems Jeffrey J.-P. Tsai and Thomas J. Weigert (Univ. Illinois at Chicago)
Vol. 2
Advances in Software Engineering and Knowledge Engineering edited by Vincenzo Ambriola (Univ. Pisa) and Genoveffa Tortora (Univ. Salerno)
Vol. 3
The Impact of CASE Technology on Software Processes edited by Daniel E. Cooke (Univ. Texas)
Vol. 4
Software Engineering and Knowledge Engineering: Trends for the Next Decade edited by W. D. Hurley (Univ. Pittsburgh)
Vol. 5
Intelligent Image Database Systems edited by S. K. Chang (Univ. Pittsburgh), E. Jungert (Swedish Defence Res. Establishment) and G. Tortora (Univ. Salerno)
Vol. 6
Object-Oriented Software: Design and Maintenance edited by Luiz F. Capretz and Miriam A. M. Capretz (Univ. Aizu, Japan)
Vol. 7
Software Visualisation edited by P. Eades (Univ. Newcastle) and K. Zhang (Macquarie Univ.)
Vol. 8
Image Databases and Multi-Media Search edited by Arnold W. M. Smeulders (Univ. Amsterdam) and Ramesh Jain (Univ. California)
Vol. 9
Advances in Distributed Multimedia Systems edited by S. K. Chang, T. F. Znati (Univ. Pittsburgh) and S. T. Vuong (Univ. British Columbia)
Vol. 10 Hybrid Parallel Execution Model for Logic-Based Specification Languages Jeffrey J.-P. Tsai and Bing Li (Univ. Illinois at Chicago) Vol. 11 Graph Drawing and Applications for Software and Knowledge Engineers Kozo Sugiyama (Japan Adv. Inst. Science and Technology) Vol. 12 Lecture Notes on Empirical Software Engineering edited by N. Juristo & A. M. Moreno (Universidad Politecrica de Madrid, Spain) Vol. 13 Data Structures and Algorithms edited by S. K. Chang (Univ. Pittsburgh, USA) Vol. 14 Acquisition of Software Engineering Knowledge SWEEP: An Automatic Programming System Based on Genetic Programming and Cultural Algorithms edited by George S. Cowan and Robert G. Reynolds (Wayne State Univ.) Vol. 15 Image: E-Learning, Understanding, Information Retieval and Medical Proceedings of the First International Workshop edited by S. Vitulano (Universita di Cagliari, Italy) Vol. 16 Machine Learning Applications in Software Engineering edited by Du Zhang (California State Univ.,) and Jeffrey J. P. Tsai (Univ. Illinois at Chicago) Vol. 17 Multimedia Databases and Image Communication Proceedings of the Workshop on MDIC 2004 edited by A. F. Abate, M. Nappi & M. Sebillo (Universita di Salerno)
Proceedings of the Workshop on
mdic
2004
Multimedia Databases and Image Communication Salerno, Italy
22 June 2004
editors
Andrea F. Abate I Michele Nappi I Monica Sebillo Universita di Salerno, Italy
Sponsors:
dmi BRAND INNOVA
DIPART1MENT0 0! MATEMATICA £ 1NFORMATICA
°'uKi* .
Composite media items are represented by thick rectangles enclosing the component items.
Synchronization relationships label the arrows' connecting the involved media items.
Figure 1. The graphical symbols used to represent synchronized multimedia documents.
Context
constraints
Context constraints describe how media items are selected lor building an instance of a multimodal document. Due to context variants an item can be defined as mandatory or optional, and can be context-independent, contextdependent or context-selectable; these terms will be discussed in Section 3. A multimodal database is a collection of virtual documents which are made of virtual components, i.e., composite and atomic items, according to a structure independent from the context. Each virtual component is a collection of instances, which are elements on which context constraints are associated. The instantiation of a virtual document into a concrete document consists in the identification of the document components and, for each of them, in the selection of the proper instances • compatible with the given context; Figure 2 illustrates five versions of a document presenting a meteo forecast. - Each version is contextualized in content (short/long descriptions, large/small images), media (video animations, images, audto, text), user device (a desktop computer, a PDA, a cellular phone), and situation (silence). A virtual multimodal document collects into a unique structure all the variants with the associated context information (Figure 3).
7 1
~T~~
sitmtkm
I ^_^
fbmca sfl
I p l j comments
1
"1
f^plH
T^r
- ^
j^jjl
(1) A multimodal document with full audio and animated video, suited for a desktop computer; the satellite image animation loops as long as the audio describes it; when audio endss a forecast map is displayed and described by another audio comment. (2) In this document the audio comment is - substituted by a text, making the presentation suitable for silent environments; the duration of the text is set by the user who reads it and manually advances to the second part; the forecast is presented as in case 1. (3) The PDA version of the document replaces the satellite animation with a small image; the forecast map is also a small image. As in case 2, user controls timing. by advancing manually from the first part to the second.
(4) An audio only document to be delivered to a cellular phone.
(5) A text only document for delivery on a cellular phone as a sequence of short messages (SMS). User advanced from one message to the next. Figure 2. A meteo forecast document for different contexts: a description of the meteorological situation is followed by the forecast. The presentation structure is constant, but different media and different synchronization schemas are used in different contexts.
8 situation
Q—v
forecast 1,2,3
Q—z
Figure 3. A virtual document containing cases 1-5 of Figure 2. Composites collect multiple elements for different contexts; the numbers in the upper right corner of the composites refer to the cases of Figure 2. The instances are selected by interpreting the context information associated to them, represented by circles in the figure.
A multimodal database is also accompanied by the definition of the context in which it operates. A database context is a set of features {/ii /2) • • •, / n } , each feature describing a property of a device, of the user, of the environment, etc., that can affect the document instantiation. A document context is a collection of pairs (/, v) where / is a context feature and v is an instance of that feature (a value), e.g., {displaysize, large) or {video, no). Contexts are stored in the database as usual relational tables. We do not elaborate further here, noting however that contexts are structured along classes of features, possibly hierarchically defined, which belong to different domains like the user profile, the user device, and so on. 3. Context-awareness in document modelling A component of a virtual document can be mandatory or optional according to its role in the document semantics. A mandatory component must always be present in the document for any context; if it is missing, the document is incomplete and cannot be delivered. An optional component can be present or not, i.e., it can be delivered or not, according to its compatibility with the specific context attributes during the concrete document instantiation. The properties of a mandatory component are the following ones: for each value or set of values of a context attribute an instance must be denned; conversely, an instance is associated to a list of context attributes, and for each attribute to a list of values which define the compatible contexts. At instantiation time if an appropriate context is not available, instantiation is not possible and the virtual document to which it belongs cannot be delivered.
9
An optional component can be instantiated (hence delivered) or not without preventing the document from being understandable and useful; the semantics and correctness of the document are not affected by the component presence. The delivery can be a system choice, according to context compatibility, but also a user choice according to a selection made at document request time; delivery is in any case bound to the availability of an instance compatible with the current context. A context-independent component exists in only one instance, which is compatible in principles with all contexts, even if it cannot be delivered on some channels (e.g., an audio device cannot play a written text, but a written text can be displayed in the same way independently from other types of context. At instantiation time the system does not perform any selection but simply picks up the item or not depending on the available channel. A component is context-dependent when it exists in several instances, one for each context or list of contexts; in some contexts the content can be void 3 . For example, a video only component could be void in a mobile phone context, while in a desktop or PDA context could exist in several resolutions and frame speeds. The selection is performed by the system. A context-selectable component exists in several instances, possibly more than one for some contexts. For example, a text at several degrees of detail, an image at several resolutions, alternative spoken/written versions of the same text. The selection is partly performed by the system (by picking only the instances compatible with the current context), and partly by the user, who selects among the alternatives the one best suited to his/her taste. An intelligent agent approach could give support to this issue, but we do not elaborate on this issue here. A concrete document is an instance of a virtual document for a specific context or combination of contexts. 4. Multimodal documents instantiation As we have seen in Section 2 the instantiation of a concrete document from a virtual one consists in the identification of the relevant virtual components and, for each of them, in the selection of the instances according to the context. The identification of the components is trivial if the whole document a A void instance is different from an optional component, since optionality is referred to a semantic role in the document, while the instance is related to the possibility of delivering an item to the user within the current contexts.
10 has to be delivered, since they are listed in the document static structure. However, the delivery of a whole document is not the most frequent case, due to the limitations of some devices or communication channels, and to the context variations in some environments like the mobile ones, where the user situation can change during a session. As an example, a user moving in a museum with a portable device could ask for a detailed description of an artwork, made of text, audio and images compatible with the device, which are better delivered in chunks; for each chunk the resources availability is checked, and the user is asked to confirm for receiving further information, in order to avoid to be stuck in a long download. Therefore, delivering a document generally requires to split it into parts (modules, sections, etc.) which are instantiated and delivered separately. Each part is identified by a main component which is the item which starts the document playback. The identification of the relevant items of each part is in fact a retrieval task: the database is queried for resolving the context-dependencies, based on information extracted by the analysis of the static and the dynamic relationships among the virtual document components. For space reasons we do not discuss here such issues, referring the reader to a previous work by the same authors 5 . In database terms the virtual document instantiation is a view building operation that requires several steps: • identify the needed set of components for delivering a document section; i.e., given the main component of the section, identify the media items bound to it by a synchronization relationship, in order to build up a coherent and complete segment for the user; • for each mandatory virtual components, check that one or more instances exist for the specified context; if some of them do not exist the instantiation fails; • process the optional virtual components only if at least one concrete instance exists for the specified context; • select the set of concrete documents compatible with the specified context; • if context-selectable instances exist, build appropriate combinations by asking the user or through some intelligent assistant, heuristic, or other b ; The details are not relevant since the instances are assumed functionally equivalent for
11 • deliver the concrete document to the user according to the defined dynamics and synchronization constraints. 5.
Conclusion
We have presented a context-aware framework for designing multimodal documents adaptable to different user and resource situations. Contextawareness results in the definition (at design time) and selection (at delivery time) of a set of concrete document components according to a multimodal document model which associates to components context information. Among the issues t h a t deserve further investigation, the consistency of a complex document under different context conditions is of p a r a m o u n t importance. For example, in two different virtual components, each containing several context-selectable instances, only specific combination of instances could be meaningful for the user, who should not be involved in explicit selection operations. Consistency can be approached by attaching rules to the components (both virtual and concrete) t h a t describe mutual consistency relationships, as much as in a traditional database referential integrity defines mutual constraints between database records. References 1. J. F. Allen. Maintaining knowledge about temporal intervals. Comm. ACM, 26(ll):832-843, November 1983. 2. E. Bertino and E. Ferrari. Temporal Synchronization Models for Multimedia Data. IEEE Transactions on Knowledge and Data Engineering, 10(4):612631, July/August 1998. 3. M. M. Blattner and E. P. Glinert. Multimodal integration. IEEE Multimedia, 3(4):14-24, 1996. 4. A. Celentano and O. Gaggi. Template-based generation of multimedia presentations. International Journal of Software Engineering and Knowledge Engineering, 13(4):419-445, 2003. 5. A. Celentano, O. Gaggi, and M.L. Sapino. Retrieving Consistent Multimedia Presentation Fragments. In Workshop on Multimedia Information Systems (MIS 2002), pages 146-154, Tempe, Arizona, USA, November 2002. 6. G. Chen and D. Kotz. A survey of context-aware mobile computing. Technical Report TR2000-381, Dartmouth College, Department of Computer Science, 2000. 7. M. Delato, A. Martelli, M. Martelli, V. Mascardi, and A. Verri. A multimedia, multichannel and personalized news provider. In G. Ventre and R. Canonico, that context.
12
8. 9.
10.
11. 12.
13. 14.
15.
16. 17. 18.
19. 20. 21.
editors, Proc. of the First Int. Workshop on Multimedia Interactive Protocols and Systems, MIPS 2003, pages 388-399. Springer-Verlag, 2003. LNCS 2899. Anind K. Dey. Understanding and Using Context. Personal Ubiquitous Computing, 5(l):4-7, 2001. O. Gaggi and A. Celentano. Modelling Synchronized Hypermedia Presentations. Multimedia Tools and Applications, Kluwer Publ. Co., in press, 2004. Preliminary version: Technical Report CS-2001-11, Dipartimento di Informatica, Universita Ca' Foscari di Venezia, 2002, http://www.dsi.unive.it/~auce/docs/cs0211.pdf. S. Izadi, M. Eraser, M. Flintham S. Benford, and C. Greenhalgh. Citywide: Supporting interactive digital experiences. In Dunlop and Brewster, editors, Mobile HCI 01 - Proceedings of the Third International Workshop on HumanComputer Interaction with Mobile Devices, pages 41-46, November 2001. K. Mitchell N. Davies, K. Cheverst and Alon Efrat. Using and determining location in a context-sensitive tour guide. IEEE Computer, 34(8):35-41, 2001. F. Pittarello. Multi sensory 3d tours for cultural heritage: the palazzo grassi experience. In Proc. of ICHIM2001 - Cultural Heritage and Technologies in the 3rd Millennium, 2001. I. Rakkolainen and T. Vainio. A 3d city info for mobile users. Computers & Graphics, Special Issue on Multimedia Appliances, 25(4):619-625, 2001. Anand Ranganathan and Roy H. Campbell. An infrastructure for contextawareness based on first order logic. Personal Ubiquitous Computing, 7(6):353-364, 2003. B. N. Schilit, N. Adams, and R. Want. Context-aware computing applications. In Proc. Workshop on Mobile Computing Systems and Applications. IEEE, December 1994. A. Schmidt, M. Beigl, and Hans-W. Gellersen. There is more to context than location. Computers and Graphics, 23, 1999. Synchronized Multimedia Working Group of W3C. Synchronized Multimedia Integration Language (SMIL) 2.0 Specification, August 2001. L. Villard, C. Roisin, and N. Layada. A XML-based multimedia document processing model for content adaptation. In Proceedings of Digital Documents and Electronic Publishing (DDEP00), volume 2023 of Lecture Notes in Computer Science, Munich, Germany, September 2000. Springer. R. Want, A. Hopper, V. Falcao, and J. Gibbons. The active badge location system. ACM Trans. Inf. Syst, 10(1):91-102, 1992. M. Weiser and J. Seely Brown. The coming age of calm technology. In Beyond Calculation: The Next Fifty Years of Computing. Springer-Verlag, 1997. H. Yan and T. Selker. Context-aware office assistant. In Proceedings of the 5th international conference on Intelligent user interfaces, pages 276-279. ACM Press, 2000.
13 ENDOWING GEOGRAPHIC INFORMATION SYSTEMS WITH A COGNITIVE LEVEL
ALESSIO DE SIMONE RCOST.Centro di Eccellenza in Ingegneria del Software, 83100 Universita Del Sannio,-Italy FERRANTE FORMATO LatticeLogicAI Italy formato(3).latticeloeic. com NICLA PALLADINO Dipartimento di Matematica ed Applicazioni Universita di Napoli Via Cintia, Montesantangelo, Italy
Geographical Information Systems bundle together data of different nature , such as text, images and multimedia. This is because reality is commonly thought as a categorical set of perceivable information -time, space.colours- As a matter of fact we use present information systems for reading the railroad timetable and -in the case of GIS - to realize that our car is lost somewhere on the surface of Earth. At present the problem of higher level layers in Geographic information systems has been tackled by using pictorial languages and data mining techniques. . Recently other models for knowledge representation were proposed such as mathematical concept models or lattice concept (See for example [10]) In this work we point out some problems inherent to picture languages and we propose an alternative knowledge representation model. The advancement of cognitive science has yield some computational models, called "conceptual spaces" (cfr [5], [6]) in which concepts are modelled as a geometric manifold which has evolved inherent to human mind. Each concept is modelled through a partition offlexiblesurfaces -NURBS- whose control points are prototypical elements. 1.
Enhancing GIS: from Visual Languages to Conceptual Spaces.
From an architectural point of view a Geographical Information System is data base management system in which it is possible to store and retrieve spatial information. This is possible either by directly coding spatial information in form of strings of "pictorial languages" or by mining data stored into raster or vector form. Pictorial languages -sometimes called also visual languages overlapping with any form of communication that relies on graphics rather than simply linear text. ([8])- are an information retrieval system that is used to recognize some spatial patterns out of a discretisation of space. Formally a pictorial language is
14 a language recognizable with a relational grammar. (See for example [4]) .Firstly we would like to focus on some limitations of this kind of visual languages that can be ascribed -among other things- to the definition of relational grammar itself. In fact -as noted in [8] - the most part of visual languages lacks semantics. For example the following relational grammar recognizes the directed graphs: G = (V»,VJ,VR,S,P,G)
V N ={G} VT={0,->}
V R = {start,end} P=1.G::={0} 2. G::={0,G} 3. G::={->,G} {start(->,G), end(->,G)} R = 1. start(->,G):- (G => {O}), start(->,0) 2. start(->,Gi):- (Gi => {0,G2}), start(->,0) 3. start(->,G0 :- (Gi => {0,G2}), start(->,G2) 4. startC-^G,):- (G, => {->2>G2}), start(->,,G2) 5. end(->,G):- (G => {O}), end(->,0) 6. endC-^G,) :- (G, => {0,G2}), end(->,0) 7. e n d ^ G , ) : - (G, => {0,G2}), end(->,G2) 8. end(->,,G,):- (GY => {->2,G2}), end(->bG2)
(1)
15
G
start(—>i ,G2)
J? \endf-*! ,Ga) G;
•»1
start(-»2,Ga) end(-»a,Ga) S
~*2
3
-»a
startf-^a.Gi) nd(->a,G.) %end( G.
Oa
0„
G6
I
Figure 1. A derivation tree of the graph G
But the following figure is not recognizable by any relational grammar
Figure 2: an ambiguous image
This is because vision -unlike the model grasped by picture languages - is a complex phenomenon in which interpretation and partial information play a crucial role. In fact -according to the interpretation and to the amount of information - the figure can be recognized both as a face and as a body.
•
&
face
Figure 3 a sequence of ambiguous images with a bifurcation point
16 By mistaking syntax and semantics the notion of "model" and "completeness" is missing. The list of pictures in Figure 3 are a particular case of a general theory of information retrieval formulated in [7]. In fact they can be interpreted as a succession of "interpretations" or "constraints" or "information tokens" that tends to a complete piece of information: a model. This is a rather crucial point since -for example- we recognize the girl (face) because we have a - perhaps incomplete - information processing apparatus that, - although incomplete - can approximate the complete model of the girl (face) although this complete model may not be effectively computable. An interesting point -observed for example in [3] - is that the recognition of patterns like Figure 3 generates cuspidal bifurcations in the sense of chaos theory. Also, an interesting application of incomplete information system for 3D graphics is HyperProofs. ([2]). VennEuler Diagrams ([9]) also are interesting although with the limitations of computable set theory. We now propose an alternative definition of relational grammar that provides the concepts of partial information and models. Definition 1 A L-relational grammar is a structure G=(G, F N , VT , VR, S, P, T,L) where • KN is the set of non-terminal symbols • V-x is the set of terminal symbols • S is the start symbol. • a e L is a set of information called initial information • P is a set of production rules of the kind where/* is a classical context-free production rule and a is an element in L. Intuitively the meaning is "apply rule/? when you know a". • T is a closure operator on a lattice L whose elements we call worlds. We call model for the grammar G a world coe L such that r(co) = co. Also , we say that information in G is complete w.r.t. a world © provided that T(a) = co. Definition 1 separates the pattern recognition into parsing pattern detection On which a good amount of research has been done - from the problem of representation of knowledge.
17 l
UW^^M&^^&SX
•
Proof Example Solution ^ S M S l t e
H I s Dodecvc) — Dodeeid.
Re then the document Wj is introduced in SFS. Note that the result documents can be simply presented on the basis of the calculated linguistic compatibility LDifj. Step 2) Then we make a refinement on the obtained clustering through the calculus of the similarity index on the documents that belong to the same cluster: this index is calculated between a document and the user profile in order to associate a similarity numerical value (between 0 and 1) with which it is possible to organize the documents linguistically grouped in SFS.
4.
Case study
We illustrate with a simple example how the system works. Let us define: p^ literary contents, p2: poetry, p3:scientific contents, p 4 : IA concepts, p5: logic concepts, p6: physics contents, p7: formality, p8: technical language, p9: student, p10: researcher, Pn:long. P ={p b p2, p3, p„, p5, p6, p7, p8, p 9 }, F ={ p3, p4, Ps, p?, Ps, P, Pio, Pn } ad so P n F = { p 3 , p 4 , p5, p 7 , p 8 , p 9 }. Then we can consider the following choice of Triangular fuzzy numbers and Linguistic terms of Example 1 and Example 2 for the lv Interest and lv Compatibility, respectively. As said, we show only the features present in P n F for both documents and profile. Suppose the user selects the following profile: Uj = vi/{p3, p4} + i/{p8} + si/{p7,
28 p9} + li/{p8}, while Re = Sufficient. Let us consider the following singled out documents: Table 4. An example of documents and their semantic information. Document Wi FLD(Wi) w. vi/{p4, p9} + i/{ps, p,} + li/{p3, p8} w2 vi/{p3, p8} + fi/{p4> + si/{p5, p7, p9} w3 vi/{p5, p7, p9} + ni/{p3, p4> p8} w4 vi/{p3, p4, p8} + fi/{p9} + si/{p5, py} w5 vi/{p4} + i/{p3, p8} + fi/fp,} + li/{p5, p,}
On these documents, our classification algorithm is applied as follows: T_Difi = ( [0.6, 0.8, 0.9] + [0.0, 0.2, 0.3] + [0.6, 0.8, 0.9] + [0.2, 0.4, 0.6] + [0.4, 0.6, 0.8] + [0.4, 0.6, 0.8])/6 = [0.366, 0.566, 0.7] and so, using our algorithm ApprLingk=3 (briefly described in par. 2.3), applied on the linguistic terms defined in Example 2, we have LDifi = "Medium". In the same way: LDif2 = "Very Good', LDif3 = "Almost Sufficient, LDif, = "Very Good", LDif5 = "Very Good\ So the set of "compatible" documents RFS = {"Very Good"/{w2, w4, w5}, "Medium'Vwi }, whereas the document w3 is excluded because its compatibility level is Almost Sufficient, less then the chosen Re . We can now calculate the similarity indexes: 8(Uj, w2) = l-((0+2+l+0+l+0)/3)/(6*5) = 0,9556. In the same way, we have: 8(Uj, w4) = 0,9667; 5(Ui, w5) = 0,9750. Now, for the sake of completeness, we calculate 8(Ui, Wj) = 0,8334. Then we can organize the documents in the cluster labelled as Included Between Interested-Very Interested as follows: ws, w4, wi, hence w5 is the document nearest to user needs. Finally we obtain the ordered SFS: Table 5. The final result of the method: Ordered RFS. I Document -User Similarity Document-User Compatibility Documents Very Good ws 0.9750 w4 0.9667 0.9556 w2 Medium 0.8334 w,
5.
Concluding remarks
In this paper we have illustrated a fuzzy-based methodology for organizing the results of documents search on the web. Our methodology, through type 2 fuzzy sets, introduces linguistic terms to enrich the documents metadata and to represent a user profile. Then an algorithm for matching between user profiledocuments metadata and clustering and ordering the results in function of user needs is presented. Both the meta-data representation and the selection
29 algorithm illustrated in this paper present several aspects deserving further investigation: •
•
•
•
•
A possible extension of the methodology concerns the introduction of a weighting function. In such way the final user could associate higher weights with features he/she considers more important for his/her interests; We could introduce more linguistic variables to give more expressivity to the documents representations and to deal with the complexity of the user profile; It is also possible to tackle the problem of coherence between the attributes used for documents meta-data and those for the user profile, by introducing special labels that represent no information or not compatible to complete the matching; In some situations, it could be useful to use the rejected results of the search; the user, in fact, could be also interested in something different or even opposite to his profile to take general information on a context; Another possible extension regards the introduction of a grouped clustering, in which the selection is made not on the single attributes, but on main sets of them (as contents form and so on, or general, technical, educational, annotations, classification as in [2, 10, 11, 12]);
References 1. P. P. Bonissone, 2001, Fuzzy Sets and Expert Systems in Computer Engineering. On-line Course ECSE 6710. http: //www. rpi. edu/~bonisp/fuzzy-course/2000/course00. html. 2. G. Casella, L. Di Lascio, A. Gisolfi, 2003. Una procedura per la rappresentazione della conoscenza in un ipertesto mediante insiemi fuzzy di tipo 2. AttiAICA2003, Trento, Italy, pp. 53 - 60. 3. N. Dessi, B. Pes, 2003, Learning Objects e Semantic Web. AM AICA 2003, Trento, Italy, pp. 61 - 66. 4. L. Di Lascio, A. Gisolfi, P. Ciamillo, 200?, A new approach to Soft Computing. Elsevier (submitted). 5. L. Di Lascio, E. Fischetti, A. Gisolfi, V. Loia and A. Nappi. Linguistic resources and fuzzy algebraic computing in adaptive hypermedia systems, 2004, in E. Damiani, L. Jain, (Eds.), Soft Computing And Software Engineering, Springer Verlag, Berlin.
30
6. L. Di Lascio, A. Gisolfi and G. Rosa, 2002. A commutative 1-monoid for classifications with fuzzy attributes. Int. J. Of Approximate Reasoning, 26, pp. 1 - 46. 7. L. Di Lascio, E. Fischetti, A. Gisolfi, 2001. An Algebraic Tool for Classification in Fuzzy Environments, in A. Di Nola, G. Gerla (Eds.), Advances in Soft Computing. Phisica-Verlag, Berlin, pp. 129 - 156. 8. L. Di Lascio, E. Fischetti and A. Gisolfi, 1999. A fuzzy-based approach to stereotype selection in hypermedia. User Modelling and User-Adapted Interaction, 9: pp 285 - 320. 9. Gisolfi and G. Nunez, 1993. An algebraic approximation to the classification with fuzzy attributes. International Journal of Intelligent Systems, 9, pp. 75-95. 10. IEEE 1484.12.1-2002, 2002. Draft Standard for Learning Object Metadata, http://www.ieee.org. 11. IMS Learning Resource Meta-Data Information Model Version 1.2.1 Final Specification, 2001, http://www.imsglobal.org/metadata. 12. World Wide Web Consortium (W3C), 2001, Semantic Web, http://www.w3c.org. 13. Z. Yao, B. Wang, 2000. Using section-semantic relation structures to enhance the performance of Web search. Database and Expert Systems Applications. Proceedings. 14. Zadeh L. A., 1970. The Concept of a Linguistic Variable and its Application to Approximate Reasoning-I, II, III. Information Sciences 1 8 II 8 - III 9, pp 199-249; pp 301-357; pp 43-80.
31
D E V E L O P I N G A SYSTEM FOR T H E RETRIEVAL OF MELODIES FROM W E B REPOSITORIES
R I C C A R D O DISTASI a n d L U C A P A O L I N O a n d G I U S E P P E S C A N N I E L L O Dipartimento Email:
di Matematica e Informatica Universita di Salerno, Italy. {ricdis, Ipaolino,gscanniello}<Sunisa.
(DMI) it
This paper presents a system called WebMelodyFinder for content-based retrieval of melodies from repositories on the world wide web. The search is based on a least squares fit. The system considers only the (exact) melodic shape rather than the actual notes, and it is therefore invariant to transposition. Using this system, the melody, automatically extracted by a MIDI file or manually entered by a knowledgeable operator, can be used as the main search key to locate the best matching melodies. Other applications might include musicological archives, where other dimcult-to-search information is stored (e.g., scores or audio recordings).
1. Introduction It would be nice to be able to search for a specific tune by providing the melody to a system, by humming or by playing it into a computer by means of a MIDI-enabled instrument. In most cases, however, the only form of search actually available to the end user is based on metadata (performer, composer, genre, title, etc.) rather than on the actual content, although several systems for music matching and retrieval exist. Most of the existing systems are based on a symbolic representation and perform some form of string matching, often adopting the 'edit distance' as a metric (the number of editing steps necessary to obtain a string from another). This choice makes it possible to manipulate the musical objects in useful ways 9,10,8 , but it makes it harder to account for transposition or melodic variations (staccato or tenuto articulations, etc.) A brief summary of the concepts relevant to MIR (Music Information Retrieval) is sketched in [6], while many string matching based techniques are described in depth in [5]. On the other hand, so-called query-by-humming systems perform an analysis of the melody as sung by the operator in order to extract the
32
information needed for the search 3 . This is an interesting approach indeed, but the process is usually prone to significant errors at several stages: the operator might not be a trained singer, pitch or timing recognition could be problematic, and so on. As a first step, then, it would probably be better to use some different kind of data entry, so that it is possible to assume that the system's input is really what the operator wanted it to be. Among the desirable characteristic of a content based music retrieval system, there are invariance to transposition (i.e., the music should be recognized no matter in which key it is played) and invariance or robustness to tempo change (faster or slower). The proposed system, named WebMelodyFinder, performs melodic matching and retrieval based on the actual content, represented in numerical, rather than symbolic, form. Section 2 explains how melodies are represented and how the searching for the best match is performed, while Section 3 discusses issues related to the ongoing implementation of WebMelodyFinder. Finally, Section 4 draws some conclusions.
2. The Underlying Technique In order to search a melody repository for the best-matching element, the key and the candidates must all be represented in a suitable form. With WebMelodyFinder, the melody is represented as a sequence of integers - one integer for each 'tick' of time. The value of the integer associated with a specific tick reflects the chromatic pitch of the note that sounds during that tick, with middle C (C4) equal to 60, as in the MIDI standard 7 . Thus, the B below middle C (B3) is 59, while C # 4 is 61. The ticks are a musical, rather than absolute, unit of time, in the sense that the duration of a tick is expressed as a fraction of a quarter note, rather than in milliseconds or multiples thereof. The number of ticks per quarter note is called temporal resolution. Any given tick contains exactly one integer (i.e., one note). Therefore, the representation is strictly monodic. This might be considered as a limitation, but for melodic searching it is better to have only the relevant data in the index keys, rather than having to wade through information which is useless for the task at hand. Furthermore, the representation makes no provision for pauses: each note is 'held' until the next one chimes in. This, too, is a design choice, since as long as the following note starts right on time, the exact articulation length of a given note in a melody can be significantly altered without altering the perceived melodic shape, which is
33
what this technique aims at capturing. Typical resolution values in actual midi files are 48, 96, 192, 240, 384 and 480. Desirable resolution values are multiples of 3 and some power of 2, so that triplets (ternary time divisions of a note), as well as the usual binary divisions, can be represented without roundoff. The same goes with WebMelodyFinder, which generaly adopts a resolution of 24 in order to limit the search time—the length of keys and melodies is proportional to the resolution. A detailed picture of the representation is depicted in Figs. 1, 2 and 3. For simplicity, these illustrations were prepared using the somewhat atypical resolution of 100 ticks per quarter note. As can be seen, melodic shapes are markedly different from melody to melody and characterize the melody fully, in the sense of being informationally equivalent to traditional notation. In fact, with a little training it is possible to recognize a familiar melody visually. The idea of a mathematical curve plotting pitch vs. time is not new—see for instance Goldstein4 for a treatment of this concept oriented towards musical analysis. The idea of using the information in the melodic curve in order to perform a search is a small step further. Perhaps it would have been possible to include at least part of the harmonic aspect, for instance by considering the harmonic function of selected melody notes, but this would introduce a layer of ambiguities that can only be risolved at another, higher, level. For instance, even simple questions such as "Is this chord an Fmaj6 or a Dmin7?" require the intervention of a human expert, while extracting melody information is a much easier task— e.g., by picking the relevant track from a midi file, or even by playing a midi keyboard to reproduce the melody which is sought. Furthermore, melody reharmonization is a frequent practice in many styles of music, and this would add a further level of ambiguity, stacking difficulties over difficulties. In conclusion, given the goal (namely, melodic matching), the representation adopted by WebMelodyFinder is a reasonably simple and effective choice. 2.1. Melody
Matching
If u = (UQ, ..., un-i) let us define
and v = (vo,..., vn-i)
are two melodies of length n,
d(u, v) = min < Y " (ui - Vi - c) 2 > ,
(1)
34
Figure 1.
O sole mio
that is, the minimum Euclidean distance achievable by a suitable transposition interval c, expressed as a signed number of chromatic steps. Determining the optimal transposition value c* that yields the minimum in Eq. (1) only requires solving a least squares problem: c
*= E («i-«i)/»-
(2)
0
37
A Foggy Day
IY Freedom Jazz Dance
Guantanamera
I Fall In Love Too Easily
O sole mio
juJ^V)WljfVf
Tico-tico Figure 4.
Assorted melodic curves
Therefore, the total of operations for this step is 2m additions (the division by n is not really performed until necessary). (2) Having obtained the optimal transposition c*, it can now be used for computing the distance as given in (1). This requires 2 subtractions and 1 multiplication for each of the n elements, that is, 2n subtractions and n multiplications for each of the m—n+1 values of 6. The total operations for this step are therefore 2mn — 2n2 + 2 additions and mn — n 2 + 1 multiplications (the square root is not really computed, as distances can be compared while still squared). Putting together the operations for Steps 1 and 2, we have 2(n+l)(m—n+1) additions and n(m — n) + 1 multiplications. In other words, the operations necessary for one match are asymptotically proportional to n(m — n). 4. Conclusions and Future Work This paper has presented a system called WebMelodyFinder. The system can be used for the content-based retrieval of melodies in a transpositioninvariant way. What would be most useful at the moment is some extensive experimental work aimed at assessing the strong points and the possible weaknesses
38 MIDI Melodies
Searching Keys
''
rKey
MIDI Transformation
''
Transformation
"
i'
Melody representation
Searching Keys Representation
Melody Finder ~*
'' Result Report
Figure 5.
The structure of WebMelodyFinder
of the system. In this way, it would be possible to investigate the behaviour of the system regarding missing notes, short notes, slightly altered phrases and other kinds of variation between musical objects. Surfing the web, it is often easy to find different versions of the same tune, and it would be interesting to find out how easily can one version be retrieved while searching for another. This would be a very realistic use case for WebMelodyFinder. Such a series of experiments is currently under way. As for improvements in the underlying search engine, it could be made invariant or robust with respect to time stretching, so that melody snippets can be recognized as identical, or at least close enough, even if metrically different. The time stretching factor 2 is remarkably desirable, since in many cases the same melody can be written in different rhytmic units, differing by a factor of 2, with no perceptual or conceptual difference (i.e., 3/4 vs. 3/8, or 2/4 vs. 2/2). Additionally, explicit constraints could be added to make sure c* is an integer. At present, a non-integer value of c* signals non-exact matching, but such condition can actually also be inferred from a nonzero resulting distance. Finally, in order to speed up the search in large databases, some kind of mark-and-sweep search might be employed, based on a tree scheme similar
39 to, e.g., Tiger Tree Hashing 2 , with t h e hash function replaced by t h e average note value, or perhaps by some transposition-invariant quantity obtained from the sequence of melodic intervals in semitones.
References 1. David Bainbridge, Rodger J. McNab and Lloyd A. Smith, "Melody based tune retrieval over the World Wide Web." Last version: December 1997. h t t p : / / w w w . c s . w a i k a t o . a c . n z / ~ n z d l / p u b l i c a t i o n s / 1 9 9 8 / Bainbridge-McNab-Smith-Melody.pdf 2. Justin Chapweske, "Tree Hash EXchange format (THEX)." Last version: March 2003. h t t p : / / o p e n - c o n t e n t . n e t / s p e c s / d r a f t - j c h a p w e s k e thex-02.html 3. Asif Ghias, Jonathan Logan, David Chamberlin, Brian C. Smith, "Query By Humming — Musical Information Retrieval in an Audio Database." In Proc. ACM Multimedia '95, 5-9 Nov. 1995, San Francisco, CA 4. Gil Goldstein, The Jazz Composer's Companion. Advance Music, 1993. 5. Kjell Lemstrom, "String Matching Techniques for Music Retrieval." Ph.D. Thesis, Series of Publications A, Report A-2000-04, University of Helsinki, Nov. 2000. ISSN: 1238-8645. ISBN:951-45-9573-4 6. Kjell Lemstrom, "In Search of a Lost Melody—Computer Assisted Music: Identification and Retrieval." Finnish Music Quarterly magazine, March/ April 2000. 7. The MIDI Manufacturers Association (MMA), Complete MIDI 1.0 Detailed Specification, 2001. URL for ordering: http://www.midi.org/ 8. Lloyd Smith and Richard Medina, "Discovering Themes by Exact Pattern Matching." In Proc. 2nd Annual International Symposium on Music Information Retrieval (ISMIR 2001), University of Indiana, Bloomington, Indiana, October 15-17, 2001. 9. Alexandra Uitdenbogerd and Justin Zobel, "Manipulation of Music for Melody Matching." In Proc. 6th ACM Int'l Conf. on Multimedia, pp. 235240, Bristol, UK, 1998. ISBN:0-201-30990-4. 10. Alexandra Uitdenbogerd and Justin Zobel, "Melodic Matching Techniques for Large Music Databases." In Proc. 7th ACM Int'l Conf. on Multimedia, pp. 57-66, Orlando, FL, 1999. ISBN:l-58113-151-8.
This page is intentionally left blank
41
FAST FACE R E C O G N I T I O N U S I N G FRACTAL R A N G E / D O M A I N CLASSIFICATION
DANIEL RICCIO Dipartimento
di Matematica e Informatica Universita di Salerno, 84084 Fisciano (SA), Italy
[email protected] In this paper we introduce a new method, namely F F R (Fast Face Recognition Using Fractal Range/Domain Classification), for the face recognition problem. F F R is based on the IFS (Iterated Function Systems) theory, also used for still image compression and indexing, but not enough experimented in the biometrical field. It characterizes in a fast way the similarities between faces, associating to each range extracted from the eyes, nose or mouth regions the topological map of the best fitting domains. F F R is fast and robust to meaningful variations of expression and respect to small changes of illumination and pose as demonstrated in experimental results.
1. Introduction Automatic Face Recognition (AFR) is a complicated object recognition problem due to the variability of face expressions, face position and lighting changes. Several methods have been proposed in order to solve this problem, but the available recognition methods are very far from human capacity, in terms of precision and time spent. All these methodologies can be classified into: • Image based: analyze image as an array of pixels with shades of gray. ICA (Independent Component Analysis) 1, Neural Networks 7, Eigenfaces 7. • Feature based: analyze anthropomorphic face features, its geometry. Elastic Graph Matching 7. • Combined: extract areas of features, and on these areas apply image based algorithms. Fractals 6. Fractal based techniques usually support lossy coding where a given input image I, is partitioned into a set R of disjointed square regions named
42
ranges. From the same image / , another set D of overlapped regions called domains is extracted. Generally, we classify ranges and domains by means of feature vectors in order to throw down the cost of the linear search on the set of domains. For a range r e R, only the domains d € D, having a close feature vector have to be compared. Recently IFS demonstrated its effectiveness also in image indexing 2 because of some desiderable properties such as brightness, color and contrast invariance just to cite some of them. In this context a new fast technique for face recognition, namely FFR (Face Recognition Using Fast Fractal classification), based on IFS is introduced in the following. The core of FFR is MC-DRDC algorithm 5. A recent coding technique performing by means of the following two phases: • In the first phase we compare all domains with the preset block d, computing the approximation error according to the (1) and then storing it in a KD-Tree. e
3d = ™f{d-(