XML for Data Warehousing Chances and Challenges (Extended Abstract) Peter Fankhauser and Thomas Klement Fraunhofer IPSI, Integrated Publication and Information Systems Institute Dolivostr. 15, 64293 Darmstadt, Germany {fankhaus,klement}@fraunhofer.ipsi.de http://www.ipsi.fraunhofer.de
The prospects of XML for data warehousing are staggering. As a primary purpose of data warehouses is to store non-operational data in the long term, i.e., to exchange them over time, the key reasons for the overwhelming success of XML as an exchange format also hold for data warehouses. – Expressive power: XML can represent relational data, EDI messages, report formats, and structured documents directly, without information loss, and with uniform syntax. – Self-describing: XML combines data and metadata. Thereby, heterogeneous and even irregular data can be represented and processed without a fixed schema, which may become obsolete or simply get lost. – Openness: As a text format with full support for Unicode, XML is not tied to a particular hardware or software platform, which makes it ideally suited for future proof long-term archival. But what can we do with an XML data warehouse beyond long term archival? How can we make sense of these data? How can we cleanse them, validate them, aggregate them, and ultimately discover useful patterns in XML data? A natural first step is to bring the power of OLAP to XML. Unfortunately, even though in principle XML is well suited to represent multidimensional data cubes, there does not yet exist a widely agreed upon standard neither for representing data cubes nor for querying them. XQuery 1.0 has resisted standardizing even basic OLAP features. Grouping and aggregation requires nested for-loops, which are difficult to optimize. XSLT 2.0 (XSL Transformations) has introduced basic grouping mechanisms. However, these mechanisms make it difficult to take into account hierarchical dimensions, and accordingly compute derived aggregations at different levels. In the first part of the talk we will introduce a small XML vocabulary for expressing OLAP queries that allow aggregation on different levels of granularity and can fully exploit document order and nested structure of XML. Moreover, we will illustrate the main optimization and processing techniques for such queries. Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 1–3, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
Peter Fankhauser and Thomas Klement
Data cubes constitute only one possible device to deal with the key challenge of XML data warehouses. XML data are notoriously noisy. They often come without a schema or with highly heterogeneous schemas, they rarely explicate dependencies and therefore are often redundant, and they can contain missing and inconsistent values. Data mining provides a wealth of established methods to deal with this situation. In the second part of the talk, we will illustrate by way of a simple experiment, how data mining techniques can help in combining multiple data sources and bringing them to effective use. We explore to which extent stable XML technology can be used to implement these techniques. The experiment deliberately focuses on data and mining techniques that cannot not be readily represented and realized with standard relational technology. It combines a bilingual dictionary, a thesaurus, and a text corpus (altogether about 150 MB data) in order to support bilingual search and thesaurus based analysis of the text corpus. We proceeded in three steps: None of the data sources was in XML form; therefore they needed to be structurally enriched to XML with a variety of tools. State-of-the-art schema mining combined with an off-the-shelf XML-Schema validator has proven to be very helpful to ensure quality for this initial step by ruling out enrichment errors and spurious structural variations in the initial data. In the next step, the data were cleansed. The thesaurus contained spurious cycles and missing relationships, and the dictionary suffered from incomplete definitions. These inconsistencies significantly impeded further analysis steps. XSLT, extended with appropriate means to efficiently realize fixpoint queries guided by regular path expressions, turned out to be a quick and dirty means for this step. However, even though cleansing did not go very far, the developed stylesheets reached a considerable level of complexity, indicating the need for better models to express and detect such inconsistencies. In the final step, the thesaurus was used to enrich the text corpus with so called lexical chains, which cluster a text into sentence groups that contain words in sufficiently close semantic neighborhood. These chains can be used to understand the role of lexical cohesion for text structure, to deploy this structure for finer grained document retrieval and clustering, and ultimately to enhance the thesaurus with additional relationships. Again, XSLT turned out to be a suitable means to implement the enrichment logic in an ad-hoc fashion, but the lack of higher level abstractions for both, the data structures and the analysis rules, resulted in fairly complex stylesheets. On the other hand, XSLT’s versatility w.r.t. expressing different structural views on XML turned out to be extremely helpful to flexibly visualize lexical chains. The main lessons learned from this small experiment are that state-of-the-art XML technology is mature and scalable enough to realize a fairly challenging text mining application. The main benefits of XML show especially for the early steps of data cleansing and enrichment, and the late steps of interactive analysis. These steps are arguably much harder to realize with traditional data warehouse technology, which requires significantly more data cleansing and restructuring as
XML for Data Warehousing Chances and Challenges
3
a prerequisite. On the other hand, the thesaurus based analysis in Step 3 suffers from the lack of XML-based interfaces to mining methods and tools. Realizing these in XSLT, which has some deficiencies w.r.t. compositionality and expressive power, turns out to be unnecessarily complex.
CPM: A Cube Presentation Model for OLAP Andreas Maniatis1, Panos Vassiliadis2, Spiros Skiadopoulos 1, Yannis Vassiliou1 1
National Technical Univ. of Athens, Dept. of Elec. and Computer Eng., 15780 Athens, Hellas {andreas,spiros,yv}@ dblab.ece.ntua.gr
2 University of Ioannina, Dept. of Computer Science 45110 Ioannina, Hellas
[email protected] Abstract. On-Line Analytical Processing (OLAP) is a trend in database technology, based on the multidimensional view of data. In this paper we introduce the Cube Presentation Model (CPM), a presentational model for OLAP data which, to the best of our knowledge, is the only formal presentational model for OLAP found in the literature until today. First, our proposal extends a previous logical model for cubes, to handle more complex cases. Then, we present a novel presentational model for OLAP screens, intuitively based on the geometrical representation of a cube and its human perception in the space. Moreover, we show how the logical and the presentational models are integrated smoothly. Finally, we describe how typical OLAP operations can be easily mapped to the CPM.
1.
Introduction
In the last years, On-Line Analytical Processing (OLAP) and data warehousing has become a major research area in the database community [1, 2]. An important issue faced by vendors, researchers and - mainly - users of OLAP applications is the visualization of data. Presentational models are not really a part of the classical conceptual-logical-physical hierarchy of database models; nevertheless, since OLAP is a technology facilitating decision-making, the presentation of data is of major importance. Research-wise, data visualization is presently a quickly evolving field and dealing with the presentation of vast amounts of data to the users [3, 4, 5]. In the OLAP field, though, we are aware of only two approaches towards a discrete and autonomous presentation model for OLAP. In the industrial field Microsoft has already issued a commercial standard for multidimensional databases, where the presentational issues form a major part [6]. In this approach, a powerful query language is used to provide the user with complex reports, created from several cubes (or actually subsets of existing cubes). An example is depicted in Fig. 1. The Microsoft standard, however, suffers from several problems, with two of them being the most prominent ones: First, the logical and presentational models are mixed, resulting in a complex language which is difficult to use (although powerful enough).
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 4-13, 2003. c Springer-Verlag Berlin Heidelberg 2003
CPM: A Cube Presentation Model for OLAP
5
Secondly, the model is formalized but not thoroughly: for instance, to our knowledge, there is no definition for the schema of a multicube. SELECT CROSSJOIN({Venk,Netz},{USA_N.Children,USA_S,Japan}) ON COLUMNS {Qtr1.CHILDREN,Qtr2,Qtr3,Qtr4.CHILDREN} ON ROWS FROM SalesCube WHERE (Sales,[1991],Products.ALL) Year = 1991 Product = ALL
Venk USA USA_N Seattle
Japan USA_S Boston
Netz USA USA_N Seattle
Japan USA_S Boston
Size(city)
R1
R2 R3 R4
Qtr1
Qtr2 Qtr3 Qtr4
Jan Feb Mar
C1
C2
C3
C4
C5
C6
Jan Feb Mar
Fig. 1: Motivating example for the cube model (taken from [6]). Apart from the industrial proposal of Microsoft, an academic approach has also been proposed [5]. However, the proposed Tape model seems to be limited in its expressive power (with respect to the Microsoft proposal) and its formal aspects are not yet publicly available. In this paper we introduce a cube presentation model (CPM). The main idea behind CPM lies in the separation of logical data retrieval (which we encapsulate in the logical layer of CPM) and data presentation (captured from the presentational layer of CPM). The logical layer that we propose is based on an extension of a previous proposal [8] to incorporate more complex cubes. Replacing the logical layer with any other model compatible to classical OLAP notions (like dimensions, hierarchies and cubes) can be easily performed. The presentational layer, at the same time, provides a formal model for OLAP screens. To our knowledge, there is no such result in the related literature. Finally, we show how typical OLAP operations like roll-up and drill down are mapped to simple operations over the underlying presentational model. The remainder of this paper is structured as follows. In Section 2, we present the logical layer underlying CPM. In Section 3, we introduce the presentational layer of the CPM model. In Section 4, we present a mapping from the logical to the presentational model and finally, in Section 5 we conclude our results and present topics for future work. Due to space limitations, we refer the interested reader to a long version of this report for more intuition and rigorous definitions [7].
2.
The logical layer of the Cube Presentation Model
The Cube Presentation Model (CPM) is composed of two parts: (a) a logical layer, which involves the formulation of cubes and (b) a presentational layer that involves the presentation of these cubes (normally, on a 2D screen). In this section, we present
6
Andreas Maniatis et al.
the logical layer of CPM; to this end, we extend a logical model [8] in order to compute more complex cubes. We briefly repeat the basic constructs of the logical model and refer the interested reader to [8] for a detailed presentation of this part of the model. The most basic constructs are: − A dimension is a lattice of dimension levels (L,p), where p is a partial order defined among the levels of L. − A family of monotone, pairwise consistent ancestor functions ancLL is defined, such that for each pair of levels L1 and L2 with L1pL2, the function ancLL maps each element of dom(L1) to an element of dom(L2). − A data set DS over a schema S=[L1,…,Ln,A1,…,Am] is a finite set of tuples over S such that [L1,…,Ln] are levels, the rest of the attributes are measures and [L1,…,Ln] is a primary key. A detailed data set DS0 is a data set where all levels are at the bottom of their hierarchies. − A selection condition φ is a formula involving atoms and the logical connectives ∧, ∨ and ¬. The atoms involve levels, values and ancestor functions, in clause of the form x ∂ y. A detailed selection condition involves levels at the bottom of their hierarchies. − A primary cube c (over the schema [L1,…,Ln,M1,…,Mm]), is an expression of the form c=(DS0,φ,[L1,…,Ln,M1,…,Mm],[agg1(M01),…,aggm(M0m)]), where: DS0 is a detailed data set over the schema S=[L01,…,L0n,M01,…,M0k],m≤k. φ is a detailed selection condition. M1,…,Mm are measures. L0i and Li are levels such that L0ipLi, 1≤i≤n. aggi∈{sum,min,max,count}, 1≤i≤m. The limitations of primary cubes is that, although they model accurately SELECT-FROM-WHERE-GROUPBY queries, they fail to model (a) ordering, (b) computation of values through functions and (c) selection over computed or aggregate values (i.e., the HAVING clause of a SQL query). To compensate this shortcoming, we extend the aforementioned model with the following entities: − Let F be a set of functions mapping sets of attributes to attributes. We distinguish the following major categories of functions: property functions, arithmetic functions and control functions. For example, for the level Day, we can have the property function holiday(Day) indicating whether a day is a holiday or not. An arithmetic function is, for example Profit=(Price-Cost)*Sold_Items. − A secondary selection condition ψ is a formula in disjunctive normal form. An atom of the secondary selection condition is true, false or an expression of the form x θ y, where x and y can be one of the following: (a) an attribute Ai (including RANK), (b) a value l, an expression of the form fi(Ai), where Ai is a set of attributes (levels and measures) and (c) θ is an operator from the set (>, Price), ranking and range selections (ORDER BY...;STOP after 200, RANK[20:30]), measure selections (sales>3000), property based selection (Color(Product)='Green'). 2 1
2 1
CPM: A Cube Presentation Model for OLAP
7
− Assume a data set DS over the schema [A1,A2,…,Az]. Without loss of generality, suppose a non-empty subset of the schema S=A1,…,Ak,k≤z. Then, there is a set of ordering operations OθS, used to sort the values of the data set, with respect to the set of attributes participating to S. θ belongs to the set {,∅} in order to denote ascending, descending and no order, respectively. An ordering operation is applied over a data set and returns another data set which obligatorily encompasses the measure RANK. − A secondary cube over the schema S=[L1,…,Ln,M1,…,Mm,Am+1,…,Am+p, RANK] is an expression of the form: s=[c,[Am+1:fm+1(Am+1),…,Am+p:fm+p(Am+p)],OθA,ψ] where c=(DS0,φ,[L1,…,Ln,M1,…,Mm],[agg1(M01),…,aggm(M0m)]) is a primary cube, [Am+1,…,Am+p]⊆[L1,…,Ln,M1,…,Mm], A⊆S-{RANK}, fm+1,…,fm+p are functions belonging to F and ψ is a secondary selection condition. With these additions, primary cubes are extended to secondary cubes that incorporate: (a) computation of new attributes (Am+i) through the respective functions (fm+i), (b) ordering (OθA) and (c) the HAVING clause, through the secondary selection condition ψ.
3.
The presentational layer of the Cube Presentation Model
In this section, we present the presentation layer of CPM. First, we will give an intuitive, informal description of the model; then we will present its formal definition. Throughout the paper, we will use the example of Fig. 1, as our reference example. The most important entities of the logical layer of CPM include: − Points: A point over an axis resembles the classical notion of points over axes in mathematics. Still, since we are grouping more than one attribute per axis (in order to make things presentable in a 2D screen), formally, a point is a pair comprising of a set of attribute groups (with one of them acting as primary key) and a set of equality selection conditions for each of the keys. − Axis: An axis can be viewed as a set of points. We introduce two special purpose axes, Invisible and Content. The Invisible axis is a placeholder for the levels of the data set which are not found in the “normal” axis defining the multicube. The Content axis has a more elaborate role: in the case where no measure is found in any axis then the measure which will fill the content of the multicube is placed there. − Multicubes. A multicube is a set of axes, such that (a) all the levels of the same dimensions are found in the same axis, (b) Invisible and Content axes are taken into account, (c) all the measures involved are tagged with an aggregate function and (d) all the dimensions of the underlying data set are present in the multicube definition. In our motivating example, the multicube MC is defined as MC={Rows,Columns,Sections,Invisible,Content}. − 2D-slice: Consider a multicube MC, composed of K axes. A 2D-slice over MC is a set of (K-2) points, each from a separate axis. Intuitively, a 2D-slice pins the axes of
8
Andreas Maniatis et al.
the multicube to specific points, except for 2 axes, which will be presented on the screen (or a printout). In Fig. 2, we depict such a 2D slice over a multicube. − Tape: Consider a 2D-slice SL over a multicube MC, composed of K axes. A tape over SL is a set of (K-1) points, where the (K-2) points are the points of SL. A tape is always parallel to a specific axis: out of the two "free" axis of the 2D-slice, we pin one of them to a specific point which distinguishes the tape from the 2D-slice. − Cross-join: Consider a 2D-slice SL over a multicube MC, composed of K axes and two tapes t1 and t2 which are not parallel to the same axis. A cross-join over t1 and t2 is a set of K points, where the (K-2) points are the points of SL and each of the two remaining points is a point on a different axis of the remaining axes of the slice. The query of Fig. 1 is a 2D-Slice, say SL. In SL one can identify 4 horizontal tapes denoted as R1, R2, R3 and R4 in Fig. 1) and 6 vertical tapes (numbered from C1 to C6). The meaning of the horizontal tapes is straightforward: they represent the Quarter dimension, expressed either as quarters or as months. The meaning of the vertical tapes is somewhat more complex: they represent the combination of the dimensions Salesman and Geography, with the latter expressed in City, Region and Country level. Moreover, two constraints are superimposed over these tapes: the Year dimension is pinned to a specific value and the Product dimension is ignored. In this multidimensional world of 5 axes, the tapes C1 and R1 are defined as: C1 = [(Salesman='Venk'∧ancregion city (city)='USA_N'),(Year='1991'), (ancALL item(Products)='all'),(Sales,sum(Sales))] R1 = [(ancmonth day (Month)='Qtr1'∧Year='1991'),(Year='1991'), (ancALL item(Products)='all'),(Sales,sum(Sales))]
One can also consider the cross-join t1 defined by the common cells of the tapes R1 and C1. Remember that City defines an attribute group along with [Size(City)]. t1=([SalesCube,(Salesman='Venk'∧ancregion city (city)='USA_N ∧ ALL ancmonth day (Month)='Qtr1'∧Year='1991'∧ancitem(Products)='all'), [Salesman,City,Month,Year,Products.ALL,Sales],sum], [Size(City)],true)
In the rest of this section, we will describe the presentation layer of CPM in its formality. First, we extend the notion of dimension to incorporate any kind of attributes (i.e., results of functions, measures, etc.). Consequently, we consider every attribute not already belonging to some dimension, to belong to a single-level dimension (with the same name as the attribute), with no ancestor functions or properties defined over it. We will distinguish between the dimensions comprising levels and functionally dependent attributes through the terms level dimensions and attribute dimensions, wherever necessary. The dimensions involving arithmetic measures will be called measure dimensions. An attribute group AG over a data set DS is a pair [A,DA], where A is a list of attributes belonging to DS (called the key of the group) and DA is a list of attributes dependent on the attributes of A. With the term dependent we mean (a) measures dependent over the respective levels of the data set and (b) function results depending
CPM: A Cube Presentation Model for OLAP
9
on the arguments of the function. One can consider examples of the attribute groups such as ag1=([City],[Size(City)]),ag2=([Sales,Expenses],[Profit]). Invisible
Sections
Year=1992
Year=1991
Content
+
+
Products.ALL = 'all'
Sales, sum(Sales 0), true
ancmonth day (Month)= Qtr1 Quarter = Qtr3 Quarter = Qtr2
Rows ancmonth day (Month)= Qtr4
Salesman='Venk', Region='USA_S' Salesman='Netz', (2) anc region city (City) = (1) 'USA_N' Salesman='Venk', (4) ancregion city (City) = (3) 'USA_N' Salesman='Venk', Country='Japan'
Salesman='Netz', Country='Japan' (6)
(5) Salesman='Netz', Region='USA_S' Columns
Fig. 2: The 2D-Slice SL for the example of Fig. 1.
A dimension group DG over a data set DS is a pair [D,DD], where D is a list of dimensions over DS (called the key of the dimension group) and DD is a list of dimensions dependent on the dimensions of D. With the term dependent we simply extend the respective definition of attribute groups, to cover also the respective dimensions. For reasons of brevity, wherever possible, we will denote an attribute/dimension group comprising only of its key simply by the respective attribute/dimension. An axis schema is a pair [DG,AG], where DG is a list of K dimension groups and AG is an ordered list of K finite ordered lists of attribute groups, where the keys of each (inner) list belong to the same dimension, found in the same position in DG, where K>0. The members of each ordered list are not necessarily different. We denote an axis schema as a pair ASK=([DG1×DG2×…×DGK],[[ag11,ag21,…,agk1 ]×[ag12,ag22 1
,…,agk22]×…×[ag1k,ag2k,…,agkkk])}.
In other words, one can consider an axis schema as the Cartesian product of the respective dimension groups, instantiated at a finite number of attribute groups. For instance, in the example of Fig. 1, we can observe two axes schemata, having the following definitions: Row_S = {[Quarter],[Month,Quarter,Quarter,Month]} Column_S = {[Salesman×Geography], [Salesman]×[[City,Size(City)], Region, Country]} Consider a detailed data set DS. An axis over DS is a pair comprising of an axis schema over K dimension groups, where all the keys of its attribute groups belong to DS, and an ordered list of K finite ordered lists of selection conditions (primary or
secondary), where each member of the inner lists, involves only the respective key of the attribute group. a = (ASK,[φ1,φ2,...,φK]),K≤N or a={[DG1×DG2×…×DGK],[[ag11,ag21,…,agk1 ]×[ag12,ag22,…,agk2 ]×…×[ag1k,ag2k,…,agkk ]], [[φ11,φ21,…,φk1 ]×[φ12,φ22,…,φk2 ]×...×[φ1k,φ2k,…,φkk ]]} 1
1
2
2
k
k
10
Andreas Maniatis et al.
Practically, an axis is a restriction of an axis schema to specific values, through the introduction of specific constraints for each occurrence of a level. In our motivating example, we have two axes: Rows = {Row_S,[ancmonth day (Month)=Qtr1,Quarter=Qtr2, Quarter=Qtr3,ancmonth day (Month)=Qtr4]} Columns = {Column_S,{[Salesman='Venk',Salesman='Netz'], [ancregion city (City)='USA_N', Region='USA_S', Country='Japan']}
We will denote the set of dimension groups of each axis a by dim(a). A point over an axis is a pair comprising of a set of attribute groups and a set of equality selection conditions for each one of their keys. p1=([Salesman,[City,Size(City)]], [Salesman='Venk',ancregion city (City)= 'USA_N'])
An axis can be reduced to a set of points, if one calculates the Cartesian products of the attribute groups and their respective selection conditions. In other words, a=([DG1×DG2×...×DGK],[[p1,p2,…,pl]), l=k1×k2×…×kkk.
Two axes schemata are joinable over a data set if their key dimensions (a) belong to the set of dimensions of the data set and (b) are disjoint. For instance, Rows_S and Columns_S are joinable. A multicube schema over a detailed data set is a finite set of axes schemata fulfilling the following constraints: 1. All the axes schemata are pair-wise joinable over the data set. 2. The key of each dimension group belongs only to one axis. 3. Similarly, from the definition of the axis schema, the attributes belonging to a dimension group are all found in the same axis. 4. Two special purpose axes called Invisible and Content exist. The Content axis can take only measure dimensions. 5. All the measure dimensions of the multicube are found in the same axis. If more than one measure exist, they cannot be found in the Content axis. 6. If no measure is found in any of the "normal" axes, then a single measure must be found in the axis Content. 7. Each key measure is tagged with an aggregate function over a measure of the data set. 8. For each attribute participating in a group, all the members of the group are found in the same axis. 9. All the level dimensions of the data set are found in the union of the axis schemata (if some dimensions are not found in the "normal" axes, they must be found in the Invisible axis). The role of the Invisible axis follows: it is a placeholder for the levels of the data set which are not to be taken into account in the multicube. The Content axis has a more elaborate role: in the case where no measure is found in any axis (like in the example of Fig. 1) then the measure which will fill the content of the multicube is placed there. If more than one measures are found, then they must be placed in the same axis (not Content), as this would cause a problem of presentation on a two-dimensional space. A multicube over a data set is defined as a finite set of axes, whose schemata can define a multicube schema. The following constraints must be met:
CPM: A Cube Presentation Model for OLAP
11
1. Each point from a level dimension, not in the Invisible axis, must have an equality selection condition, returning a finite number of values. 2. The rest of the points can have arbitrary selection conditions (including "true" for the measure dimensions, for example). For example, suppose a detailed data set SalesCube under the schema S = [Quarter.Day, Salesman.Salesman, Geography.City, Time.Day, Product.Item, Sales, PercentChange, BudgetedSales] Suppose also the following axes schemata over DS0 Row_S = {[Quarter],[Month,Quarter,Quarter,Month]} Column_S = {[Salesman×Geography], [Salesman]×[[City,Size(City)], Region, Country]} Section_S = {[Time],[Year]} Invisible_S = {[Product],[Product.ALL]} Content_S = {[Sales],[sum(Sales0)]}
and their respective axes Rows={Row_S,[ancmonth day (Month)=Qtr1,Quarter=Qtr2,Quarter=Qtr3, ancmonth day (Month)=Qtr4]} Columns = {Column_S,{[Salesman='Venk',Salesman='Netz'], [ancregion city (City)='USA_N', Region='USA_S', Country='Japan']} Sections = {Section_S,[Year=1991,Year=1992]} Invisible = {Invisible_S,[ALL='all']} Content_S = {Content_S,[true]} Then, a multicube MC can be defined as MC = {Rows, Columns, Sections, Invisible, Content}
Consider a multicube MC, composed of K axes. A 2D-slice over MC is a set of (K-2) points, each from a separate axis, where the points of the Invisible and the Content axis are comprised within the points of the 2D-slice. Intuitively, a 2D-slice pins the axes of the multicube to specific points, except for 2 axes, which will be presented on a screen (or a printout). Consider a 2D-slice SL over a multicube MC, composed of K axes. A tape over SL is a set of (K-1) points, where the (K-2) points are the points of SL. A tape is always parallel to a specific axis: out of the two "free" axis of the 2D-slice, we pin one of them to a specific point which distinguishes the tape from the 2D-slice. A tape is more restrictively defined with respect to the 2D-slice by a single point: we will call this point the key of the tape with respect to its 2D-slice. Moreover if a 2D-slice has two axes a1,a2 with size(a1) and size(a2) points each, then one can define size(a1)*size(a2) tapes over this 2D-slice. Consider a 2D-slice SL over a multicube MC, composed of K axes. Consider also two tapes t1 and t2 which are not parallel to the same axis. A cross-join over t1 and t2 is a set of K points, where the (K-2) points are the points of SL and each of the two remaining points is a point on a different axis of the remaining axes of the slice. Two tapes are joinable if they can produce a cross-join.
12
4.
Andreas Maniatis et al.
Bridging the presentation and the logical layers of CPM
Cross-joins form the bridge between the logical and the presentational model. In this section we provide a theorem proving that a cross-join is a secondary cube. Then, we show how common OLAP operations can be performed on the basis of our model. The proofs can be found at [7]. Theorem 1. Α cross-join is equivalent to a secondary cube. The only difference between a tape and a cross-join is that the cross-join restricts all of its dimensions with equality constraints, whereas the tape constraints only a subset of them. Moreover, from the definition of the joinable tapes it follows that a 2D-slice contains as many cross-joins as the number of pairs of joinable tapes belonging to this particular slice. This observation also helps us to understand why a tape can also be viewed as a collection of cross-joins (or cubes). Each of this cross-joins is defined from the k-1 points of the tape and one point from all its joinable tapes. This point belongs to the points of the axis the tape is parallel to. Consequently, we are allowed to treat a tape as a set of cubes: t=[c1,…,ck]. Thus we have the following lemma. Lemma 1. A tape is a finite set of secondary cubes. We briefly describe how usual operations of the OLAP tools, such as roll-up, drill down, pivot etc can be mapped to operations over 2D-slices and tapes. − Roll-up. Roll-up is performed over a set of tapes. Initially key points of these tapes are eliminated and replaced by their ancestor values. Then tapes are also eliminated and replaced by tapes defined by the respective keys of these ancestor values. The cross-joins that emerge can be computed through the appropriate aggregation of the underlying data. − Drill-down. Drill down is exactly the opposite of the roll-up operation. The only difference is that normally, the existing tapes are not removed, but rather complemented by the tapes of the lower level values. − Pivot. Pivot means moving one dimension from an axis to another. The contents of the 2D-slice over which pivot is performed are not recomputed, instead they are just reorganized in their presentation. − Selection. A selection condition (primary or secondary) is evaluated against the points of the axes, or the content of the 2D-slice. In every case, the calculation of the new 2D-slice is based on the propagation of the selection to the already computed cubes. − Slice. Slice is a special form of roll-up, where a dimension is rolled up to the level ALL. In other words, the dimension is not taken into account any more in the groupings over the underlying data set. Slicing can also mean the reconstruction of the multicube by moving the sliced dimension to the Invisible axis. − ROLLUP [9]. In the relational context, the ROLLUP operator takes all combination of attributes participating in the grouping of a fact table and produces all the
CPM: A Cube Presentation Model for OLAP
13
possible tables, with these marginal aggregations, out of the original query. In our context, this can be done by producing all combinations of Slice operations over the levels of the underlying data set. One can even go further by combining roll-ups to all the combinations of levels in a hierarchy.
5.
Conclusions and Future Work
In this paper we have introduced the Cube Presentation Model, a presentation model for OLAP data which formalizes previously proposed standards for a presentation layer and which, to the best of our knowledge, is the only formal presentational model for OLAP in the literature. Our contributions can be listed as follows: (a) we have presented an extension of a previous logical model for cubes, to handle more complex cases; (b) we have introduced a novel presentational model for OLAP screens, intuitively based on the geometrical representation of a cube and its human perception in the space; (c) we have discussed how these two models can be smoothly integrated; and (d) we have suggested how typical OLAP operations can be easily mapped to the proposed presentational model. Next steps in our research include the introduction of suitable visualization techniques for CPM, complying with current standards and recommendation as far as usability and user interface design is concerned and its extension to address the specific visualization requirements of mobile devices.
References [1]S. Chaudhuri, U. Dayal: An overview of Data Warehousing and OLAP technology. ACM SIGMOD Record, 26(1), March 1997. [2]P. Vassiliadis, T. Sellis: A Survey of Logical Models for OLAP Databases. SIGMOD Record 28(4), Dec. 1999. [3]D.A. Keim. Visual Data Mining. Tutorials of the 23 rd International Conference on Very Large Data Bases, Athens, Greece, 1997. [4]Alfred Inselberg. Visualization and Knowledge Discovery for High Dimensional Data . 2nd Workshop Proceedings UIDIS, IEEE, 2001. [5]M. Gebhardt, M. Jarke, S. Jacobs: A Toolkit for Negotiation Support Interfaces to MultiDimensional Data. ACM SIGMOD 1997, pp. 348 – 356. [6]Microsoft Corp. OLEDB for OLAP February 1998. Available at: http://www.microsoft.com/data/oledb/olap/. [7]A. Maniatis, P. Vassiliadis, S. Skiadopoulos, Y. Vassiliou. CPM: A Cube Presentation Model. http://www.dblab.ece.ntua.gr/~andreas/publications/CPM_dawak03.pdf (Long Version). [8]Panos Vassiliadis, Spiros Skiadopoulos: Modeling and Optimization Issues for Multidimensional Databases. Proc. of CAiSE-00, Stockholm, Sweden, 2000. [9]J. Gray et al.: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals. Proc. of the ICDE 1996.
Computation of Sparse Data Cubes with Constraints Changqing Chen1, Jianlin Feng2, and Longgang Xiang3 1
School of Software, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China
[email protected] 2 School of Computer Science, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China
[email protected] 3 School of Computer Science, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China
[email protected] Abstract. For a data cube there are always constraints between dimensions or between attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use them to speed up the computation of sparse data cubes. A new algorithm CFD is presented to satisfy this demand. CFD determines the order of dimensions by considering their cardinalities and functional dependencies between them together. It makes dimensions with functional dependencies adjacent and their codes satisfy monotonic mapping, thus reduces the number of partitions for such dimensions. It also combines partitioning from bottom to up and aggregate computation from top to bottom to speed up the computation further. In addition CFD can efficiently compute a data cube with hierarchies from the smallest granularity to the coarsest one, and at most one attribute in a dimension takes part in the computation each time. The experiments have shown that the performance of CFD has a significant improvement.
1 Introduction OLAP often pre-computes a large number of aggregates to improve the performance of aggregation queries. A new operator CUBE BY [5] was introduced to represent a set of group-by operationsi.e., to compute aggregates for all possible combinations of attributes in the CUBE BY clause. The following example 1 shows a cube computation query on a relation SALES (employee, product, customer, quantity) Example 1: SELECT employee, product, customer, SUM (quantity) FROM SALES CUBE BY employee, product, customer It will compute group-bys for (employee, product, customer), (employee, product), (employee, customer), (product, customer), (employee), (product), (customer) and ALL (no GROUP BY). The attributes in the CUBE BY clause are called dimensions and the attributes aggregated are called measures. For n dimensions, 2 n group-bys are Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 14-23, 2003. c Springer-Verlag Berlin Heidelberg 2003
Computation of Sparse Data Cubes with Constraints
15
computed. The number of distinct values of a dimension is its cardinality. Each combination of attribute values from different dimensions constitutes a cell. If empty cells are a majority of the whole cube, then the cube is sparse. Relational normal forms are hardly suitable for OLAP cubes because of different goals in operational and OLAP databases. The main goal of operational databases is to avoid update anomalies and the relational normal forms are very adaptive for this goal. But for OLAP databases the efficiency of queries is the most important issue. So there are always constraints between dimensions or between attributes in a dimension for a cube, such as functional dependencies. Sparsity clearly depends on actual data. However, functional dependencies between dimensions may imply potential sparsity [4]. A tacit assumption for all algorithms before is that dimensions are independent each other and so all these algorithms did not consider the effect of functional dependencies on computing cubes. Algebraic functions COUNT, SUM, MIN and MAX have the key property that more detailed aggregates (i.e. more dimensions) can be used to compute less detailed aggregates (i.e. fewer dimensions). This property induces a partial ordering (i.e. a lattice) on all group-bys of the CUBE. A group-by is called a child of some parent group-by if the parent can be used to compute the child (and no other group-by is between the parent and the child). The algorithms [1, 2, 3, 6] recognize that group-bys with common attributes can share partitions, sorts, or partial sorts. The difference between them is that how they exploit such properties. In these algorithms, BUC [1] computes from bottom to up, while others compute from top to bottom. This paper addresses full cube computation over sparse data cubes and makes the following contributions: 1.
We introduce the problem of computation of sparse data cubes with constraints, which allows us to use such constraints to speed up the computation. A new algorithm CFD (Computation by Functional Dependencies) is presented to satisfy this demand. CFD determines the partitioning order of dimensions by considering their cardinalities and functional dependencies between them together. Therefore the correlated dimensions can share sorts.
2.
CFD partitions group-bys of a data cube from bottom to up, at the same time it computes aggregate values from top to bottom by summing up return values of smaller partitions. Even if all the dimensions are independent each other, CFD is still faster than BUC to compute full cubes.
3.
Few algorithms deal with hierarchies in dimensions. CFD can compute a sparse data cube with hierarchies in dimensions. In this situation, CFD efficiently computes from the smallest granularity to the coarsest one. The rest of this paper is organized as follows: Section 2 presents the problem of sparse cubes with constraints. Section 3 illustrates how to decide partitioning order of dimensions. Section 4 presents a new algorithm for the computation of sparse cubes called CFD. Our performance analysis is described in section 5. Related work is discussed in section 6. Section 7 contains conclusions.
16
Changqing Chen et al.
2 The Problem Let C= DM be an OLAP cube schema, where D are the set of dimensions and M the set of measures. Two attributes X and Y with one-to-one or many-to-one relation has a functional dependency XY, where X is called a determining attribute and Y is called a depending attribute. Such functional dependency can exist between two dimensions or two attributes in a dimension. The problem is when there are constraints (functional dependencies), how to use them to speed up the computation of sparse cubes. The dependencies considered in CFD are only those that the left and right side always contain a single attribute respectively. Such functional dependencies will help in data pre-processing (see Section 3.2) and partitioning dimensions (see section 4.1). Functional dependencies between dimensions implied the structural sparsity of a cube [4]. With no functional dependencies, the structural sparsity is zero. Considering the cube in Example 1, if we know that one employee sells only one product, we get a functional dependency employeeproduct. Assume we have 6 employees, 4 customers, and 3 different products, then the size of the cube is 72 cells. Further the total number of occupied cells in the whole cube is at most 64=24, thus the structural sparsity is 67%.
3 Data Preprocessing CFD partitions from bottom to up just like BUC, i.e., first partitions on one dimension, then two dimensions, and etc. One difference between CFD and BUC is that CFD chooses the order of dimensions by functional dependencies and cardinalities together. 3.1 Deciding the Order of Dimensions First we build a directed graph by Functional Dependencies between dimensions, called FD graph. The graph ignores all transitive dependencies (i.e., dependencies that can be deduced from other dependencies). A node in the graph is a dimension. Once the graph has been built, we try to classify the nodes. We find the longest path in the graph in order to make the most of dependencies. The nodes in such path form a dependency set and are deleted from the graph. Such process is repeated until the graph is empty. The time complexity of this process is O(n2), where n is the number of dimensions. Example 2: A cube has six dimensions from A to F with the cardinalities descendant and functional dependencies AC, AD, CE, BF. Figure 1 is the corresponding FD graph. From Figure 1, we first get the dependency set {A, C, E} for they have the longest path, then {B, F} and at last {D}. The elements in each set are ordered by dependencies. Although there is a functional dependency between A and D, it is not considered, so the dependency set {D} contains only the dimension D itself. After getting the dependency sets, CFD sorts them descendently by the biggest cardinality of a dimension in each set. Then we merge each set sequentially to determine
Computation of Sparse Data Cubes with Constraints
17
the order of dimensions. By this approach, CFD can make the depending dimension share the sort of the determining dimension because such two dimensions are putted together. If there is no functional dependency, the partitioning order of CFD is just the same as that of BUC. A
C
B
D
F
Tom
towel
towel
Tom
0
0
Bob
soap
towel
Ross
1
0
Smith
soap
soap
Bob
2
1
Smith
3
1
White E
Fig. 1. FD graph
employee product
product employee
employee product
sharver
sort
soap
encode
Louis
soap
soap
Louis
4
1
Ross
towel
shaver
White
5
2
Fig. 2. The encoding of two dimensions with a functional dependency
3.2 Data Encoding Like other algorithms for computing a data cube, CFD assumes that each dimension value is an integer between zero and its cardinality, and that the cardinality is known in advance. A usual data encoding does not consider the correlations between dimensions and simply maps each dimension value between zero and its cardinality. This operation is similar to sorting on the values of a dimension. In order to share shorts, CFD encodes adjacent dimensions with functional dependencies jointly to make their codes satisfy a monotonic mapping. For example, X and Y are two dimensions and f is a functional dependency from X to Y. Assume there are two arbitrary values xi and xj on dimension X, and yi = f(xi) and yj = f(xj) are two values on dimension Y. If xi> xj, we have yi yj, then y = f(x) is monotonic. Due to the functional dependency between X and Y, the approach of encoding is to sort on dimension Y first, then the values of X and Y can be mapped sequentially to zero and their cardinalities respectively. Figure 2 shows the encoding of two dimensions with a functional dependency: employeeproduct in Example 1. Obviously, if the left or right side of a functional dependency has more than one attribute, it is difficult to encode like that. Note that the mapping relations can be reflected in the fact table for correlated dimensions. But for hierarchies in a dimension the mapping relations should be stored respectively.
4 Algorithm CFD We propose a new algorithm called CFD for the computation of full sparse data cubes with constraints. The idea in CFD is to take advantage of functional dependencies to share partitions and to make use of the property of algebraic functions to reduce aggregation costs. CFD was inspired by BUC algorithm and is similar to a version of algorithm BUC except the aggregation computation and the partition function. After data preprocessing, we can compute a sparse data cube now.
18
Changqing Chen et al.
CFD(input,dim) Inputs: input: The relation to aggregate. dim: The starting dimension to partition. Globals: numdims: the total number of dimensions. dependentable[]: the dependency sets gotten from section 3.1. hierarchy[numdims]: the high of hierarchies in each dimension. cardinality[numDims][]: the cardinality of each dimension. dataCount[numdims]: the size of each partition. aggval[numdims]: sum the results of smaller partitions. 1: if (dim == numdims) aggval[dim]=Aggregate(input); //the result of a thinnest part ition. 2: FOR d = dim; d < numdims; d++ DO 3: FOR h=0; h-> <Synchronization_scheme> -> <Sources> <Sources> -> mao-list -> mao-list -> | -> current_partial_sum, binary_op, cursor_on_sources -> mapping_function, cursor_on_sources. -> -> -> function1, size_of_sources -> function2, size_of_sources -> | -> accessing_cost, computing_cost -> -> function1, size_updated_sources -> function2, size_updated_sources <Synchronization_scheme> -> -> -> binary_op, insertion_data, inverse_op, deletion_data -> search_for_indirect_sources -> | -> current_partial_sum2, binary_op ,cursor_on_updated_sources -> mapping_function, cursors_on_updated_sources. Fig. 4.
Execution Plan, language and content.
A Multidimensional Aggregation Object (MAO) Framework
53
An example of EP is shown in Figure 5. The system uses this EP to establish and maintain relationships between the source and target MAOs. The derivation relationship re ects that square sums of total passengers by dierent weekdays, aircrafts, and time blocks can be derived from the square sums by dierent dates, aircrafts, and time blocks. Since square sum is a distributive function, this is a distributive relationship and enables the incremental compensation approach for data synchronization. Cost estimations provide information to guide the system for cache placement and synchronization. ((Date,AC,TB),SqrSum, # of passengers) -> ((WD,AC,TB), SqrSum, # of passengers) Derivation_Relationship: Source: ((Date,AC,TB),SqrSum, # of passengers) Target: ((WD,AC,TB), SqrSum, # of passengers) ## W : cursor on target entry; X,Y: cursor on source entry; S: accumulating result. Distributive_Aggregation: if sources are raw data(most detailed fact) => W = X*X + S else W = Y + S Cost_Estimation: For_caching: Caching_computation: f1(size_of_source) Caching_retrieval:f2(size_of_source) For_synchronization: Compensation_cost: computation_cost: f1(size_of_inserted) + f3(size_of_deleted) access_cost: f2(size_of_inserted + size_of_deleted) Recomputation_cost: compute:f1(size_of_updated_source) access:f2(size_of_updated_source) Sunchronization_scheme: direct_compensation: if source are raw data => W = W + I*I - D*D else W = W + I - D indirect_compensation: *pointer for source’s compensation plan. ## recomputation_scheme is the same as in Distributive_Aggregation except setting the source’s cursors on updated source Fig. 5.
7
An example of Execution Plan.
Conclusion
In this paper, we introduced a Multidimensional Aggregation Object (MAO) model which consists of the aggregation function, the aggregation values, and the aggregation scope in a multidimensional environment. MAO represents the aggregated values in a multidimensional structure, and provides information to reuse lower-level and simpler aggregations for compositive aggregations. This information can improve performance and maintain potential data dependancies. The caching placement algorithm is proposed to eÆciently reuse intermediate aggregation results. Because the MAO model provides more information on
54
Meng-Feng Tsai and Wesley Chu
aggregation than presenting data at dierent levels by scope, caching MAOs provides signi cant performance improvements as compared to conventional techniques by caching scopes. To maintain the cached data during updating raw data, two techniques can be used to synchronize the cached data. If an inverse aggregation function exists, then the incremental approach should be used, which uses the inverse function to compensate the cached results. If the inverse aggregation function is not available, then a full reaggregation is needed using the newly updated data. The information for processing MAOs can be speci ed in an Execution Plan (EP). By providing the derivation relationships, cost estimating functions, and synchronization plans in EP, a system can eÆciently reuse and maintain intermediate data. Experiment results show that the application of a caching method using MAO can yield close to an order of a magnitude of improvement in computations as compared with the method that does not use the MAO model. By tracing derivation relationships among the MAOs, the system provides related aggregations at all levels and can therefore be systematicly maintained. Therefore, our proposed methodology provides a more versatile, eÆcient, and coherent environment for complex aggregation tasks. References
[Albr 99] J. Albrecht, H. Gunzel, W. Lehner, \Set-Derivability of Multidimensional Aggregates", DaWak 1999, pp 133-142. [SAgr 96] S. Agrawal, R. Agrawal, P.M. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Sarawagi. \On the computation of multidimensional aggregates ", Proc. 1996 Int. Conf. VLDB'96, pp 506-521, Bombay, India, Sept. 1996. [RAgr 97] R. Agrawal, A. Gupta, and S. Sarawagi, \Modeling multidimensional databases" , Proc. 1997 Int. Conf. Data Engineering(ICDE'97), pp 232243, Birmingham, England, Apr. 1997. [Mumi 97] Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick, \Maintenance of Data Cubes and Summary Tables in a Warehouse ", ACM SIGMOD pp.100{111, '97 AZ, USA. [JimG 96] Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh, \Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals" , IEEE Data Engineering pp.152-159. [Amit 96] Amit Shukla, Prasad M. Deshpande, Jerey F. Naughton, Karthikeyan Ramasamy, \Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies", VLDB 96 Mumbai(Bombay), India, 1996. [Venk 96] Venky Harinarayan, Anand Rajaraman, Jerey D. Ullman, \Implementing Data Cubes EÆciently", SIGMOD Conference 1996 pp.205-216. [Jose 97] Joseph M. Hellerstein, Peter J. Haas, Helen J. Wang, \Online Aggregation", SIGMOD Conference 1997 pp.171-182. [Han 01] Jiawei Han, Micheline Kamber, \Data Mining Concepts and Techniques", Morgan Kaufmann Publishers, pp.230-243.
The GMD Data Model for Multidimensional Information: A Brief Introduction Enrico Franconi and Anand Kamble Faculty of Computer Science Free Univ. of Bozen-Bolzano, Italy
[email protected] [email protected] Abstract. In this paper we introduce a novel data model for multidimensional information, GMD, generalising the MD data model first proposed in Cabibbo et al (EDBT-98). The aim of this work is not to propose yet another multidimensional data model, but to find the general precise formalism encompassing all the proposals for a logical data model in the data warehouse field. Our proposal is compatible with all these proposals, making therefore possible a formal comparison of the differences of the models in the literature, and to study formal properties or extensions of such data models. Starting with a logic-based definition of the semantics of the GMD data model and of the basic algebraic operations over it, we show how the most important approaches in DW modelling can be captured by it. The star and the snowflake schemas, Gray’s cube, Agrawal’s and Vassiliadis’ models, MD and other multidimensional conceptual data can be captured uniformly by GMD. In this way it is possible to formally understand the real differences in expressivity of the various models, their limits, and their potentials.
1
Introduction
In this short paper we introduce a novel data model for multidimensional information, GMD, generalising the MD data model first proposed in [2]. The aim of this work is not to propose yet another data model, but to find the most general formalism encompassing all the proposals for a logical data model in the data warehouse field, as for example summarised in [10]. Our proposal is compatible with all these proposals, making therefore possible a formal comparison of the different expressivities of the models in the literature. We believe that the GMD data model is already very useful since it provides a very precise and, we believe, very elegant and uniform way to model multidimensional information. It turns out that most of the proposals in the literature make many hidden assumptions which may harm the understanding of the advantages or disadvantages of the proposal itself. An embedding in our model would make all these assumptions explicit. Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 55–65, 2003. c Springer-Verlag Berlin Heidelberg 2003
56
Enrico Franconi and Anand Kamble
So far, we have considered, together with the classical basic star and snowflake ER-based models and multidimensional cubes, the logical data models introduced in [2, 5, 1, 6, 9, 11, 3, 7, 8]. A complete account of both the GMD data model (including and extended algebra) and of the various encodings can be found in [4]; in this paper we just give a brief introduction to the basic principles of the data model. GMD is completely defined using a logic-based approach. We start introducing a data warehouse schema, which is nothing else than a set of fact definitions which restricts (i.e., constrains) the set of legal data warehouse states associated to the schema. By systematically defining how the various operators used in a fact definition constrain the legal data warehouse states, we give a formal logic-based account of the GMD data model.
2
The Syntax of the GMD Data Model
We introduce in this Section the notion of data warehouse schema. A data warehouse schema basically introduces the structures of the cubes that will populate the warehouse, together with the types allowed for the components of the structures. The definition of a GMD schema that follows is explained step by step. Definition 1 (GMD schema). Consider the signature < F , D, L, M, V, A >, where F is a finite set of fact names, D is a finite set of dimension names, L is a finite set of level names – each one associated to a finite set of level element names, M is a finite set of measure names, V is a finite set of domain names – each one associated to a finite set of values, A is a finite set of level attributes. ➽ We have just defined the alphabet of a data warehouse: we may have fact names (like SALES, PURCHASES), dimension names (like Date, Product), level name (like year, month, product-brand, product-category) and their level elements (like 2003, 2004, heineken, drink), measure names (like Price, UnitSales), domain names (like integers, strings), and level attributes (like is-leap, country-of-origin). A GMD schema includes: – a finite set of fact definitions of the form . F = E {D1 |L1 , . . . , Dn |Ln } : {M1 |V1 , . . . , Mm |Vm }, where E,F ∈ F, Di ∈ D, Li ∈ L, Mj ∈ M, Vj ∈ V. We call the fact name F a defined fact, and we say that F is based on E. A fact name not appearing at the left hand side of a definition is called an undefined fact. We will generally call fact either a defined fact or an undefined fact. A fact based on an undefined fact is called basic fact. A fact based on a defined fact is called aggregated fact. A fact is dimensionless if n = 0; it is measureless if m = 0. The orderings in a defined fact among dimensions and among measures are irrelevant.
The GMD Data Model for Multidimensional Information
➽ We have here introduced the building block of a GMD schema: the fact definition. A basic fact corresponds to the base data of any data warehouse: it is the cube structure that contains all the data on which any other cube will be built upon. In the following example, BASIC-SALES is a basic fact, including base data about sale transactions, organised by date, product, and store (which are the dimensions of the fact) which are respectively restricted to the levels day, product, and store, and with unit sales and sale price as measures: . BASIC-SALES = SALES {Date|day , Product|product, Store|store } : {UnitSales|int , SalePrice|int} – a partial order (L, ≤) on the levels in L. We call the immediate predecessor relation on L induced by ≤. ➽ The partial order defines the taxonomy of levels. For example, day
month quarter and day week; product type category – a finite set of roll-up partial functions between level elements → Lj ρLi ,Lj : Li for each Li , Lj such that Li Lj . We call ρ∗Li ,Lj the reflexive transitive closure of the roll-up functions inductively defined as follows: ρ∗Li ,Li = id ρ∗Li ,Lj = k ρLi ,Lk ◦ ρ∗Lk ,Lj where (ρLp ,Lq ∪ ρLr ,Ls )(x) = y
for each k such that Li Lk ρLp ,Lq (x) = ρLr ,Ls (x) = y, or ρL ,L (x) = y and ρLr ,Ls (x) = ⊥, or iff p q ρLp ,Lq (x) = ⊥ and ρLr ,Ls (x) = y
➽ When in a schema various levels are introduced for a dimension, it is also necessary to introduce a roll-up function for them. A rollup function defines how elements of one level map to elements of a superior level. Since we just require for the roll-up function to be a partial order, it is possible to have elements of a level which rollup to an upper level, while other elements may skip that upper level to be mapped to a superior one. For example, ρday,month(1/1/01) = Jan-01, ρday,month (2/1/01) = Jan-01, . . . ρquarter,year(Qtr1-01) = 2001, ρquarter,year(Qtr2-01) = 2001, . . .
57
58
Enrico Franconi and Anand Kamble
– a finite set of level attribute definitions: . L = {A1 |V1 , . . . , An |Vn } where L ∈ L, Ai ∈ A and Vi ∈ V for each i, 1 ≤ i ≤ n. ➽ Level attributes are properties associated to levels. For example, . product = {prodname|string , prodnum|int , prodsize|int , prodweight|int } – a finite set of measure definitions of the form . N = f(M ) where N, M ∈ M, and f is an aggregation function f : B(V) → W, for some V, W ∈ V. B(V) is the finite set of all bags obtainable from values in V whose cardinality is bound by some finite integer Ω. ➽ Measure definitions are used to compute values of measures in an aggregated fact from values of the fact it is based on. For example: . . Total-UnitSales = sum(UnitSales) and Avg-SalePrice = average(SalePrice) Levels and facts are subject to additional syntactical well-foundedness conditions: – The connected components of (L, ≤) must have a unique least element each, which is called basic level. ➽ The basic level contains the finest grained level elements, on top of which all the facts are identified. For example, store city country; store is a basic level. – For each undefined fact there can be at most one basic fact based on it. ➽ This allows us to disregard undefined facts, which are in one-toone correspondence with basic facts. – Each aggregated fact must be congruent with the defined fact it is based on, i.e., for each aggregated fact G and for the defined fact F it is based on such that . F = E {D1 |L1 , . . . , Dn |Ln } : {M1 |V1 , . . . , Mm |Vm } . G = F {D1 |R1 , . . . , Dp |Rp } : {N1 |W1 , . . . , Nq |Wq } the following must hold (for some reordering on the dimensions): • the dimensions in the aggregated fact G are among the dimensions of the fact F it is based on: p≤n • the level of a dimension in the aggregated fact G is above the level of the corresponding dimension in the fact F it is based on: Li ≤ R i
for each i ≤ p
The GMD Data Model for Multidimensional Information
59
• each measure in the aggregated fact G is computed via an aggregation function from some measure of the defined fact F it is based on: . N1 = f1 (Mj(1) )
...
. Nq = fq (Mj(q) )
Moreover the range and the domain of the aggregation function should be in agreement with the domains specified respectively in the aggregated fact G and in the fact F it is based on. ➽ Here we give a more precise characterisation of an aggregated fact: its dimensions should be among the dimensions of the fact it is based on, its levels should be generalised from the corresponding ones in the fact it is based on, and its measures should be all computed from the fact it is based on. For example, given the basic fact BASIC-SALES: . BASIC-SALES = SALES {Date|day , Product|product, Store|store } : {UnitSales|int , SalePrice|int} the following SALES-BY-MONTH-AND-TYPE is an aggregated fact computed from the BASIC-SALES fact: . SALES-BY-MONTH-AND-TYPE = BASIC-SALES {Date|month , Product|type } : {Total-UnitSales|int, Avg-SalePrice|real} with the following aggregated measures: . Total-UnitSales = sum(UnitSales) . Avg-SalePrice = average(SalePrice) 2.1
Example
The following GMD schema summarises the examples shown in the previous Section: – Signature: • F = {SALES, BASIC-SALES, SALES-BY-MONTH-AND-TYPE, PURCHASES} • M = {UnitSales, Price, Total-UnitSales, Avg-Price} • D = {Date, Product, Store} • L = {day, week, month, quarter, year, product, type, category, brand, store, city, country } day = {1/1/01, 2/1/01, . . . , 1/1/02, 2/1/02, . . . } month = {Jan-01, Feb-01, . . . , Jan-02, Feb-02, . . . } quarter = {Qtr1-01, Qtr2-01, . . . , Qtr1-02, Qtr2-02, . . . } year = {2001, 2002} ···
60
–
–
–
–
–
3
Enrico Franconi and Anand Kamble • V = {int, real, string} • A = {dayname, prodname, prodsize, prodweight, storenumb} Partial order over levels: • day month quarter year, day week; day is a basic level • product type category, product brand; product is a basic level • store city country; store is a basic level Roll-up functions: ρday,month (1/1/01) = Jan-01, ρday,month (2/1/01) = Jan-01, . . . ρmonth,quarter (Jan-01) = Qtr1-01, ρmonth,quarter (Feb-01) = Qtr1-01, . . . ρquarter,year (Qtr1-01) = 2001, ρquarter,year (Qtr2-01) = 2001, . . . ρ∗day,year (1/1/01) = 2001, ρ∗day,year (2/1/01) = 2001, . . . ··· Level Attributes: . day = {dayname|string , daynum|int } . product = {prodname|string , prodnum|int , prodsize|int , prodweight|int } . store = {storename|string , storenum|int , address|string } Facts: . BASIC-SALES = SALES {Date|day , Product|product , Store|store } : {UnitSales|int , SalePrice|int } . SALES-BY-MONTH-AND-TYPE = BASIC-SALES {Date|month , Product|type } : {Total-UnitSales|int , Avg-SalePrice|real } Measures: . Total-UnitSales = sum(UnitSales) . Avg-SalePrice = average(SalePrice)
GMD Semantics
Having just defined the syntax of GMD schemas, we introduce now their semantics through a well founded model theory. We define the notion of a data warehouse state, namely a specific data warehouse, and we formalise when a data warehouse state is actually in agreement with the constraints imposed by a GMD schema. Definition 2 (Data Warehouse State). A data warehouse state over a schema with the signature < F , D, L, M, V, A > is a tuple I = < ∆, Λ, Γ, ·I >, where – ∆ is a non-empty finite set of individual facts (or cells) of cardinality smaller than Ω; ➽ Elements in ∆ are the object identifiers for the cells in a multidimensional cube; we call them individual facts. – Λ is a finite set of level elements; – Γ is a finite set of domain elements;
The GMD Data Model for Multidimensional Information
61
– ·I is a function (the interpretation function) such that FI ⊆ ∆
for each F ∈ F, where FI is disjoint from any other EI such that E ∈ F for each L ∈ L, where LI is disjoint from any other HI LI ⊆ Λ such that H ∈ L for each V ∈ V, where VI is disjoint from any other WI VI ⊆ Γ such that W ∈ V DI = ∆ →Λ for each D ∈ D MI = ∆ →Γ for each M ∈ M L I → Γ for each L ∈ L and AL (Ai ) = L i ∈ A for some i (Note: in the paper we will omit the ·I interpretation function applied to some symbol whenever this is non ambiguous) ➽ The interpretation functions defines a specific data warehouse state given a GMD signature, regardless from any fact definition. It associates to a fact name a set of cells (individual facts), which are meant to form a cube. To each cell corresponds a level element for some dimension name: the sequence of these level elements is meant to be the “coordinate” of the cell. Moreover, to each cell corresponds a value for some measure name. Since fact definitions in the schema are not considered yet at this stage, the dimensions and the measures associated to cells are still arbitrary. In the following, we will introduce the notion of legal data warehouse state, which is the data warehouse state which conforms to the constraints imposed by the fact definitions. A data warehouse state will be called legal for a given GMD schema if it is a data warehouse state in the signature of the GMD schema and it satisfies the additional conditions found in the GMD schema. A data warehouse state is legal with respect to a GMD schema if: . – for each fact F = E {D1 |L1 , . . . , Dn |Ln } : {M1 |V1 , . . . , Mm |Vm } in the schema: • the function associated to a dimension which does not appear in a fact is undefined for its cells: ∀f. F(f ) → f ∈ dom(D) for each D ∈ D such that D = Di for each i ≤ n ➽ This condition states that the level elements associated to a cell of a fact should correspond only to the dimensions declared in the fact definition of the schema. That is, a cell has only the declared dimensions in any legal data warehouse state. • each cell of a fact has a unique set of dimension values at the appropriate level: ∀f. F(f ) → ∃l1 , . . . , ln . D1 (f ) = l1 ∧ L1 (l1 ) ∧ . . . ∧ Dn (f ) = ln ∧ Ln (ln ) ➽ This condition states that the level elements associated to a cell of a fact are unique for each dimension declared for the fact in the schema. So, a cell has a unique value for each declared dimension in any legal data warehouse state.
62
Enrico Franconi and Anand Kamble
• a set of dimension values identifies a unique cell within a fact: ∀f, f , l1 , . . . , ln . F(f ) ∧ F(f ) ∧ D1 (f ) = l1 ∧ D1 (f ) = l1 ∧ . . . ∧ Dn (f ) = ln ∧ Dn (f ) = ln → f = f ➽ This condition states that a sequence of level elements associated to a cell of a fact are associated only to that cell. Therefore, the sequence of dimension values can really be seen as an identifying coordinate for the cell. In other words, these conditions enforce the legal data warehouse state to really model a cube according the specification given in the schema. • the function associated to a measure which does not appear in a fact is undefined for its cells: ∀f. F(f ) → f ∈ dom(M) for each M ∈ M such that M = Mi for each i ≤ n ➽ This condition states that the measure values associated to a cell of a fact in a legal data warehouse state should correspond only to the measures explicitly declared in the fact definition of the schema. • each cell of a fact has a unique set of measures: ∀f. F(f ) → ∃m1 , . . . , mm . M1 (f ) = m1 ∧ V1 (m1 ) ∧ . . . ∧ Mm (f ) = mm ∧ Vm (mm ) ➽ This condition states that the measure values associated to a cell of a fact are unique for each measure explicitly declared for the fact in the schema. So, a cell has a unique measure value for each declared measure in any legal data warehouse state. – for each aggregated fact and for the defined fact it is based on in the schema: . F = E {D1 |L1 , . . . , Dn |Ln } : {M1 |V1 , . . . , Mm |Vm } . G = F {D1 |R1 , . . . , Dp |Rp } : {N1 |W1 , . . . , Nq |Wq } . . N1 = f1 (Mj(1) ) . . . Nq = fq (Mj(q) ) each aggregated measure function should actually compute the aggregation of the values in the corresponding measure of the fact the aggregation is based on: ∀g, v. Ni (g) = v ↔ ∃r1 , . . . , rp . G(g) ∧ D1 (g) = r1 ∧ . . . ∧ Dp (g) = rp ∧ v = fi ({|Mj(i) (f ) | ∃l1 , . . . , lp . F(f )∧ D1 (f ) = l1 ∧ . . . ∧ Dp (f ) = lp ∧ ρ∗L1 ,R1 (l1 ) = r1 ∧ . . . ∧ ρ∗Lp ,Rp (lp ) = rp |}) for each i ≤ q, where {| · |} denotes a bag. ➽ This condition guarantees that if a fact is the aggregation of another fact, then in a legal data warehouse state the measures associated to the cells of the aggregated cube should be actually computed by applying the aggregation function to the measures of the corresponding cells in the original cube. The correspondence between a cell in the aggregated cube and a set of cells in the original cube is found by looking how their coordinates – which are level elements – are mapped through the roll-up function dimension by dimension.
The GMD Data Model for Multidimensional Information
63
According to the definition, a legal data warehouse state for a GMD schema is a bunch of multidimensional cubes, whose cells carry measure values. Each cube conforms to the fact definition given in the GMD schema, i.e., the coordinates are in agreement with the dimensions and the levels specified, and the measures are of the correct type. If a cube is the aggregation of another cube, in a legal data warehouse state it is enforced that the measures of the aggregated cubes are correctly computed from the measures of the original cube. 3.1
Example
A possible legal data warehouse state for (part of) the previous example GMD schema is shown in the following. BASIC-SALESI = {s1 , s2 , s3 , s4 , s5 , s6 , s7 } SALES-BY-MONTH-AND-TYPEI = {g1 , g2 , g3 , g4 , g5 , g6 } Date(s1 ) Date(s2 ) Date(s3 ) Date(s4 ) Date(s5 ) Date(s6 ) Date(s7 )
= = = = = = =
1/1/01 7/1/01 7/1/01 10/2/01 28/2/01 2/3/01 12/3/01
UnitSales(s1 ) UnitSales(s2 ) UnitSales(s3 ) UnitSales(s4 ) UnitSales(s5 ) UnitSales(s6 ) UnitSales(s7 ) Date(g1 ) Date(g2 ) Date(g3 ) Date(g4 ) Date(g5 ) Date(g6 )
= = = = = =
= = = = = = =
100 500 230 300 210 150 100
daynum(day) = 1
= = = = = = =
Organic-milk-1l Organic-yogh-125g Organic-milk-1l Organic-milk-1l Organic-beer-6pack Organic-milk-1l Organic-beer-6pack
EuroSalePrice(s1) EuroSalePrice(s2) EuroSalePrice(s3) EuroSalePrice(s4) EuroSalePrice(s5) EuroSalePrice(s6) EuroSalePrice(s7)
Jan-01 Feb-01 Jan-01 Feb-01 Mar-01 Mar-01
Total-UnitSales(g1 ) Total-UnitSales(g2 ) Total-UnitSales(g3 ) Total-UnitSales(g4 ) Total-UnitSales(g5 ) Total-UnitSales(g6 )
4
Product(s1 ) Product(s2 ) Product(s3 ) Product(s4 ) Product(s5 ) Product(s6 ) Product(s7 )
Product(g1 ) Product(g2 ) Product(g3 ) Product(g4 ) Product(g5 ) Product(g6 ) = = = = = =
830 300 0 210 150 100
= = = = = =
= = = = = = =
Store(s1 ) Store(s2 ) Store(s3 ) Store(s4 ) Store(s5 ) Store(s6 ) Store(s7 )
= = = = = = =
Fair-trade-central Fair-trade-central Ali-grocery Barbacan-store Fair-trade-central Fair-trade-central Ali-grocery
71,00 250,00 138,00 210,00 420,00 105,00 200,00
Dairy Dairy Drink Drink Dairy Drink Avg-EuroSalePrice(g1) Avg-EuroSalePrice(g2) Avg-EuroSalePrice(g3) Avg-EuroSalePrice(g4) Avg-EuroSalePrice(g5) Avg-EuroSalePrice(g6)
prodweight(product) = 100gm
= = = = = =
153,00 210,00 0,00 420,00 105,00 200,00 storenum(store) = S101
GMD Extensions
For lack of space, in this brief report it is impossible to introduce the full GMD framework [4], which includes a full algebra in addition to the basic aggregation operation introduced in this paper. We will just mention the main extensions with respect to what has been presented here, and the main results. The full GMD schema language includes also the possibility to define aggregated measures with respect to the application of a function to a set of original
64
Enrico Franconi and Anand Kamble
measures, pretty much like in SQL. For example, it is possible to have an aggregated cube with a measure total-profit being the sum of the differences between the cost and the price in the original cube; the difference is applied cell by cell in the original cube (generating a profit virtual measure), and then the aggregation computes the sum of all the profits. Two selection operators are also in the full GMD language. The slice operation simply selects the cells of a cube corresponding to a specific value for a dimension, resulting in a cube which contains a subset of the cells of the original one and one less dimension. The multislice allows for the selection of ranges of values for a dimension, so that the resulting cube will contain a subset of the cells of the original one but retains the selected dimension. A fact-join operation is defined only between cubes sharing the same dimensions and the same levels. We argue that a more general join operation is meaningless in a cube algebra, since it may leads to cubes whose measures are no more understandable. For similar reasons we do not allow a general union operator (like the one proposed in [6]). As we were mentioning in the introduction, one main result is in the full encoding of many data warehouse logical data models as GMD schemas. We are able in this way to give an homogeneous semantics (in terms of legal data warehouse states) to the logical model and the algebras proposed in all these different approaches, we are able to clarify ambiguous parts, and we argue about the utility of some of the operators presented in the literature. The other main result is in the proposal of a novel conceptual data model for multidimensional information, that extends and clarifies the one presented in [3].
References [1] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. of ICDE-97, 1997. 56 [2] Luca Cabibbo and Riccardo Torlone. A logical approach to multidimensional databases. In Proc. of EDBT-98, 1998. 55, 56 [3] E. Franconi and U. Sattler. A data warehouse conceptual data model for multidimensional aggregation. In Proc. of the Workshop on Design and Management of Data Warehouses (DMDW-99), 1999. 56, 64 [4] Enrico Franconi and Anand S. Kamble. The GMD data model for multidimensional information. Technical report, Free University of Bozen-Bolzano, Italy, 2003. Forthcoming. 56, 63 [5] M. Golfarelli, D. Maio, and S. Rizzi. The dimensional fact model: a conceptual model for data warehouses. IJCIS, 7(2-3):215–247, 1998. 56 [6] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: a relational aggregation operator generalizing group-by, cross-tabs and subtotals. In Proc. of ICDE-96, 1996. 56, 64 [7] M. Gyssens and L. V. S. Lakshmanan. A foundation for multi-dimensional databases. In Proc. of VLDB-97, pages 106–115, 1997. 56 [8] A. Tsois, N. Karayiannidis, and T. Sellis. MAC: Conceptual data modelling for OLAP. In Proc. of the International Workshop on Design and Management of Warehouses (DMDW-2001), pages 5–1–5–13, 2001. 56
The GMD Data Model for Multidimensional Information
65
[9] P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations. In Proc. of the 10th SSDBM Conference, Capri, Italy, July 1998. 56 [10] P. Vassiliadis and T. Sellis. A survey of logical models for OLAP databases. In SIGMOD Record, volume 28, pages 64–69, December 1999. 55 [11] P. Vassiliadis and S. Skiadopoulos. Modelling and optimisation issues for multidimensional databases. In Proc. of CAiSE-2000, pages 482–497, 2000. 56
An Application of Case-Based Reasoning in Multidimensional Database Architecture* Dragan Simiü1, Vladimir Kurbalija2, Zoran Budimac2 1
2
Novi Sad Fair, Hajduk Veljkova 11, 21000 Novi Sad, Yugoslavia
[email protected] Department of Mathematics and Informatics, Fac. of Science, Univ. of Novi Sad Trg D. Obradoviüa 4, 21000 Novi Sad, Yugoslavia
[email protected],
[email protected] ABSTRACT. A concept of decision support system is considered in this paper. It provides data needed for fast, precise and good business decision making to all levels of management. The aim of the project is the development of a new online analytical processing oriented on case-based reasoning (CBR) where a previous experience for every new problem is taken into account. Methodological aspects have been tested in practice as a part of the management information system development project of "Novi Sad Fair". A case study of an application of CBR in prediction of future payments is discussed in the paper.
1
Introduction
In recent years, there has been an explosive growth in the use of database for decision support systems. This phenomenon is a result of the increased availability of new technologies to support efficient storage and retrieval of large volumes of data: data warehouse and online analytical processing (OLAP) products. A data warehouse can be defined as an online repository of historical enterprise data that is used to support decision-making. OLAP refers to technologies that allow users to efficiently retrieve data from the data warehouse. In order to help an analyst focus on important data and make better decisions, case-based reasoning (CBR - an artificial intelligence technology) is introduced for making predictions based on previous cases. CBR will automatically generate an answer to the problem using stored experience, thus freeing the human expert of obligations to analyse numerical or graphical data. The use of CBR in predicting the rhythm of issuing invoices and receiving actual payments based on the experience stored in the data warehouse is presented in this paper. Predictions obtained in this manner are important for future planning of a company such as the ”Novi Sad Fair” because achievement of sales plans, revenue and *
Research was partially supported by the Ministry of Science, Technologies and Development of Republic of Serbia, project no. 1844: ”Development of (intelligent) techniques based on software agents for application in information retrieval and workflow”
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 66-75, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Application of Case-Based Reasoning in Multidimensional Database Architecture
67
company liquidation are measures of success in business. Performed simulations show that predictions made by CBR differ only for 8% in respect to what actually happened. With inclusion of more historical data in the warehouse, the system gets better in predictions. Furthermore, the system uses not only a data warehouse but also previous cases and previous predictions in future predictions thus learning during the operating process. The combination of CBR and data warehousing, i.e. making an OLAP intelligent by the use of CBR is a rarely used approach, if used at all. The system also uses a novel CBR technique to compare graphical representation of data which greatly simplifies the explanation of the prediction process to the end-user [3]. The rest of the paper is organized as follows. The following section elaborates more on motivations and reasons for inclusion of CBR in decision support system. This section also introduces our case-study on which we shall describe the usage of our system. Section three overviews the case based reasoning technique, while section four describes the original algorithm for searching the previous cases (curves) looking for the most similar one. Fifth section describes the actual application of our technique to the given problem. Section six presents the related work, while the seventh section concludes the paper.
2
User requirements for decision support system
“Novi Sad Fair” represents a complex organization considering the fact that it is engaged in a multitude of activities. The basic Fair activity is organizing fair exhibitions, although it has particular activities throughout the year. Ten times a year, 27 fair exhibitions are organized where nearly 4000 exhibitors take part, both from the country and abroad. Besides designing a ‘classical’ decision support system based on a data warehouse and OLAP, requirements of the company management clearly showed that it will not be enough for good decision making. The decision to include artificial intelligence methods in general and CBR in particular into the whole system was driven by the results of the survey. The survey was made on the sample of 42 individuals (users of the current management information system) divided into three groups: strategictactical management (9 people), operational managers (15 people), and transactional users (18 people). After a statistical evaluation of the survey [5], the following conclusions (among others) were drown: Development of the decision support systems should be focussed on problems closely related to financial estimates and financial marker trends tracking which span several years. The key influences on business (management) are political and economic environment of the country and region, which induces the necessity of exact implementation of those influences in the observed model (problem). Also it is necessary to take them into account in future events estimations.
68
Dragan Simi´c et al.
The behavior of the observed case does not depend on its pre-history but only on initial state, respectively.
Implementation of this non-exact mathematical model is a very complex problem. As an example, let us take a look into the problem pointed to us by company managers. During any fair exhibition the total of actual income is only 30% to 50% of the total invoice value. Therefore, managers want to know how high the payment of some fair services would be in some future time, with respect to invoicing. If they could predict reliably enough what would happen in the future, they could make important business activities to ensure faster arrival of invoiced payments and plan future activities and exhibitions better. The classical methods can not explain influences on business and management well enough. There are political and economic environments of the country and region that cannot be successfully explained and used with classical methods: war in Iraq, oil deficiency, political assassinations, terrorism, spiral growth in mobile telecommunication industry, general human occupation and motivation. And this is even more true in an enterprise such as Fair whose success depends on many external factors. One possible approach in dealing with external influences is observing the case histories of similar problems (cases) for a longer period of time, and making estimations according to that observation. This approach, generally speaking, represents intelligent search which is applied to solving new problems by adapting solutions that worked for similar problems in the past - case-based reasoning.
3
Case based reasoning
Case-Based Reasoning is a relatively new and promising area of artificial intelligence and it is also considered a problem solving technology (or technique). This technology is used for solving problems in domains where experience plays an important role [2]. Generally speaking, case-based reasoning is applied to solving new problems by adapting solutions that worked for similar problems in the past. The main supposition here is that similar problems have similar solutions. The basic scenario for mainly all CBR applications looks as follows. In order to find a solution of an actual problem, one looks for a similar problem in an experience base, takes the solution from the past and uses it as a starting point to find a solution to the actual problem. In CBR systems experience is stored in a form of cases. The case is a recorded situation where problem was totally or partially solved, and it can be represented as an ordered pair (problem, solution). The whole experience is stored in case base, which is a set of cases and each case represents some previous episode where the problem was successfully solved. The main problem in CBR is to find a good similarity measure – the measure that can tell to what extent the two problems are similar. In the functional way similarity can be defined as a function sim : U u CB o [0 , 1] where U refers to the universe of all objects (from a given domain), while CB refers to the case base (just those objects
An Application of Case-Based Reasoning in Multidimensional Database Architecture
69
which were examined in the past and saved in the case memory). The higher value of the similarity function means that these objects are more similar [1]. The case based reasoning system has not the only goal of providing solutions to problems but also of taking care of other tasks occurring when used in practice. The main phases of the case-based reasoning activities are described in the CBR-cycle (fig. 1) [1].
Fig. 1. The CBR-Cycle after Aamodt and Plaza (1994)
In the retrieve phase the most similar case (or k most similar cases) to the problem case is retrieved from the case memory, while in the reuse phase some modifications to the retrieved case are done in order to provide better solution to the problem (case adaptation). As the case-based reasoning only suggests solutions, there may be a need for a correctness proof or an external validation. That is the task of the phase revise. In the retain phase the knowledge, learned from this problem, is integrated in the system by modifying some knowledge containers. The main advantage of this technology is that it can be applied to almost any domain. CBR system does not try to find rules between parameters of the problem; it just tries to find similar problems (from the past) and to use solutions of the similar problems as a solution of an actual problem. So, this approach is extremely suitable for less examined domains – for domains where rules and connections between parameters are not known. The second very important advantage is that CBR approach to learning and problem solving is very similar to human cognitive processes – people take into account and use past experiences to make future decisions.
70
4
Dragan Simi´c et al.
CBR for predicting curves behaviour
The CBR system for its graphics in presenting both the problem and the cases is used [3]. The reasons are that in many practical domains some decisions depend on behaviour of time diagrams, charts and curves. The system therefore analyses curves, compares them to similar curves from the past and predicts the future behaviour of the current curve on the basis of the most similar curves from the past. The main problem here, as almost in every CBR system, was to create a good similarity measure for curves, i.e. what is the function that can tell to what extent the two curves are similar. In many practical domains data are represented with the set of points, where the point is an ordered pair (x,y). Very often the pairs are (t,v) where t represents time and v represents some value in the time t. When the data is given in this way (as a set of points) then it can be graphically represented. When the points are connected, then they represent some kind of a curve. If the points are connected only with straight lines then it represents the linear interpolation, but if someone wants smoother curves then some other kind of interpolation with polynomials must be used. There was a choice between a classical interpolating polynomial and a cubic spline. The cubic spline was chosen for two main reasons: x x
Power: for the n+1 points classical interpolating polynomial has the power n, while cubic spline always has the power 4. Oscillation: if only one point is moved (which can be the result of bad experiment or measuring) classical interpolating polynomial significantly changes (oscillates), while cubic spline only changes locally (which is more appropriate for real world domains).
Fig. 2. Surface between two curves
When the cubic spline is calculated for curves then one very intuitive and simple similarity (or distance – which is the dual notion for similarity1) measure can be used. 1
When the dictance d i known then the similarity sim can be easily computed using for example function: sim = 1/(1+d)
An Application of Case-Based Reasoning in Multidimensional Database Architecture
71
The distance between two curves can be represented as a surface between these curves as seen on the fig 2. This surface can be easily calculated using the definitive integral. Furthermore, the calculation of the definitive integral for polynomials is a very simple and efficient operation.
5
Application of the system
A data warehouse of “Novi Sad Fair” contains data about payment and invoicing processes from the last 3 years for every exhibition - containing between 25 and 30 exhibitions every year. Processes are presented as sets of points where every point is given with the time of the measuring (day from the beginning of the process) and the value of payment or invoicing on that day. It can be concluded that these processes can be represented as curves. Note that the case-base consists of cases of all exhibitions and that such a case-base is used in solving concrete problems for concrete exhibitions. The reason for this is that environmental and external factors influence business processes of the fair to a high extent. The measurement of the payment and invoicing values was done every 4 days from the beginning of the invoice process in duration of 400 days, therefore every curve consists of approximately 100 points. By analysing these curves, the process of invoicing usually starts several months before the exhibition and that value of invoicing rapidly grows approximately to the time of the beginning of exhibition. After that time the value of invoicing remains approximately the same till the end of the process. That moment, when the value of invoicing reaches some constant value and stays the same to the end, is called the time of saturation for the invoicing process, and the corresponding value – the value of saturation. The process of payment starts several days after the corresponding process of invoicing (process of payment and invoicing for the same exhibition). After that the value of payment grows, but not so rapidly as the value of invoicing. At the moment of exhibition the value of payment is between 30% and 50% of the value of invoicing. After that, the value of payment continues to grow to some moment when it reaches a constant value and stays approximately constant till the end of the process. That moment is called the time of saturation for the payment process, and the corresponding value – the value of saturation. Payment time of saturation is usually couple of months after the invoice time of the saturation, and the payment value of saturation is always less than the invoice value of saturation or equal. The analysis shows that payment value of saturation is between 80% and 100% of the invoice value of saturation. The maximum represents a total of services invoiced and that amount is to be paid. The same stands for the invoicing curve where the maximum amount of payment represents the amount of payment by regular means. The rest will be paid later by court order, other special business agreements or, perhaps, will not be paid at all (debtor bankruptcy).
72
Dragan Simi´c et al.
Fig. 3. The curves from the data mart, as the "Old payment curve" and the "Old invoice curve"
One characteristic invoice and a corresponding payment curve as the "Old payment curve" and "Old invoice curve" from the ”curve base” are shown (fig. 3). The points of saturation (time and value) are represented with the emphasised points on curves. At the beginning system reads the input data from two data marts: one data mart contains the information about all invoice processes for every exhibition in the past 3 years, while the other data mart contains the information about the corresponding payment processes. After that, the system creates splines for every curve (invoice and payment) and internally stores the curves in the list of pairs containing the invoice curve and the corresponding payment curve. In the same way system reads the problem curves from the third data mart. The problem is invoice and a corresponding problem curve at the moment of the exhibition. At that moment, the invoice curve reaches its saturation point, while the payment curve is still far away from its saturation point. These curves are shown as the "Actual payment curve" and the "Actual invoice curve" (fig. 4). The solution of this problem would be the saturation point for the payment curve. This means that system helps experts by suggesting and predicting the level of future payments. At the end of the total invoicing for selected fair exposition, operational exposition manager can get a prediction from CBR system of a) the time period when payment of a debt will be made and b) the amount paid regularly.
An Application of Case-Based Reasoning in Multidimensional Database Architecture
73
Fig. 4. Problem payment and invoice curves, as the "Actual payment curve" and the "Actual invoice curve" and prediction for the future payments Time point and the amount of payment of a debt are marked on the graphic by a big red dot (fig. 4). When used with the subsets of already known values, CBR predicts the results that differed around 10% in time and 2% in value from actually happened. 5.1 Calculation of saturation points and system learning The saturation point for one prediction is calculated by using 10% of the most similar payment curves from the database of previous payment processes. The similarity is calculated by using the previosly described algorithm. Since the values of saturation are different for each exhibition, every curve from the database must be scaled with a particular factor so that the invoice values of saturation of the old curve and actual curve are the same. That factor is easily calculated as: actual _ value _ of _ saturation Factor old _ value _ of _ saturation where the actual value of saturation is in fact the value of the invoice in the time of the exhibition. The final solution is then calculated by using payment saturation points of the 10% most similar payment curves. Saturation points of the similar curves are multiplied with the appropriate type of goodness and then summed. The values of goodness are directly proportional to the similarity between old and actual curves, but the sum of all goodnesses must be 1. Since the system calculates the distance, the similarity is calculated as:
74
Dragan Simi´c et al.
1 1 dist The goodness for every old payment curve is calculated as: sim
goodness i
simi ¦ sim j
all _ j
At the end, the final solution – payment saturation point is calculated as: sat _ point ¦ goodness i sat _ point i all _ i
The system draws the solution point at the diagram combined with the saturation time and value. The system also supports solution revising and retaining (fig. 1). By memorizing a) the problem, b) suggested solution, c) the number of similar curves used for obtaining the suggestion and d) the real solution (obtained later), the system uses this information in the phase of reusing the solution for future problems. The system will then use not only 10% percent of the most similar curves but will also inspect the previous decisions in order to find ‘better’ number of similar curves that would lead to the better prediction.
6
Related work
The system presented in the paper represents a useful coexistence of a data warehouse and a case based reasoning resulting in a decision support system. The data warehouse (part of the described system) has been in operation in “Novi Sad Fair” since 2001 and is described in more details [5] [6] [7]. The part of the system that uses CBR in comparing curves has been done during the stay of the second author at Humboldt University in Berlin and is described in [3] in more detail. Although CBR is successfully used in many areas (aircraft conflict resolution in air traffic control, optimizing rail transport, subway maintenance, optimal job search, support to help-desks, intelligent search on the internet) [4], it is not very often used in combination with data warehouse and in collaboration with classical OLAP, probably due to novelty of this technique. CBR does not require causal model or deep understanding of a domain and therefore it can be used in domains that are poorly defined, where information is incomplete, contradictory, or where it is difficult to get sufficient domain of knowledge. All this is typical for business processing. Besides CBR, other possibilities are rule base knowledge or knowledge discovery in database where knowledge evaluation is based on rules [1]. The rules are usually generated by combining propositions. As the complexity of the knowledge base increases, maintaining becomes problematical because changing rules often implies a lot of reorganization in a rule base system. On the other side, it is easier to add or delete a case in a CBR system, which finally provides the advantages in terms of learning and explicability. Applying CBR to curves and its usage in decision making is also a novel approach. According to the authors' findings, the usage of CBR, looking for similarities in curves and predicting future trends is by far superior to other currently used techniques.
An Application of Case-Based Reasoning in Multidimensional Database Architecture
7
75
Conclusion
The paper presented the decision support system that uses CBR as an OLAP to the data warehouse. The paper has in greater detail described the CBR part of the system giving a thorough explanation of one case study. There are numerous advantages of this system. For instance, based on CBR predictions, operational managers can make important business activities, so they would: a) make payment delays shorter, b) make the total of payment amount bigger, c) secure payment guarantee on time, d) reduce the risk of payment cancellation and e) inform senior managers on time. By combining graphical representation of predicted values with most similar curves from the past, the system enables better and more focussed understanding of predictions with respect to real data from the past. Senior managers can use these predictions to better plan possible investments and new exhibitions, based on the amount of funds and the time of their availability, as predicted by the CBR system. Presented system is not only limited to this case-study but it can be applied to other business values as well (expenses, investments, profit) and it guarantees the same level of success.
Acknowledgement The CBR system that uses graphical representation of problem and cases [3] was implemented by V. Kurbalija at Humboldt University, Berlin (AI Lab) under the leadership of Hans-Dieter Burkhard and sponsorship of DAAD (German academic exchange service). Authors of this paper are grateful to Prof. Burkhard and his team for their unselfish support without which none of this would be possible.
References 1. Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodological Variations and System Approaches, AI Commutations, pp. 39-58. 1994. 2. Zoran Budimac, Vladimir Kurbalija: Case-based Reasoning – A Short Overview, Conference of Informatics and IT, Bitola, 2001. 3. Vladinmir Kurbalija: On Similarity of Curves – project report, Humboldt University, AI Lab, Berlin, 2003. 4. Mario Lenz, Brigitte Bartsh-Sporl, Hans-Dieter Burkhard, Stefan Wess, G. Goos, J. Van Leeuwen, B. Bartsh: Case-Based Reasoning Technology: From Foundations to Aplications, Springer Verlag, October 1998. 5. Dragan Simic: Financial Prediction and Decision Support System Based on Artificial Intelligence Technology, Ph.D. thesis, draft text – manuscript, Novi Sad 2003. 6. Dragan Simic: Reengineering management information systems, contemporary information technologies perspective, Master thesis, Novi Sad 2001. 7. Dragan Simic: Data Warehouse and Strategic Management, Strategic management and decision support systems, Palic, 1999.
MetaCube XTM: A Multidimensional Metadata Approach for Semantic Web Warehousing Systems Thanh Binh Nguyen1, A Min Tjoa1, and Oscar Mangisengi2 1
Institute of Software Technology (E188) Vienna University of Technology, Favoritenstr. 9-11/188, A-1040 Vienna, Austria {binh,tjoa}@ifs.tuwien.ac.at 2 Software Competence Center Hagenberg Hauptstrasse 99, A-4232 Hagenberg, Austria
[email protected] Abstract. Providing access and search among multiple, heterogeneous, distributed and autonomous data warehouses has become one of the main issues in the current research. In this paper, we propose to integrate data warehouse schema information by using metadata represented in XTM (XML Topic Maps) to bridge possible semantic heterogeneity. A detailed description of an architecture that enables the efficient processing of user queries involving data from heterogeneous is presented. As a result, the interoperability is accomplished by a schema integration approach based on XTM. Furthermore, important implementation aspects of the MetaCube-XTM prototype, which makes use of the Meta Data Interchange Specification (MDIS), and the Open Information Model, complete the presentation of our approach.
1
Introduction
The advent of the World Wide Web (WWW) in the mid –1990s has resulted in even greater demand for effectively managing data, information, and knowledge. Web sources consist of very large information resources that are distributed into different location, sites, and systems. According to [15], Web warehousing is a novel and very active research area, which combines two rapidly developing technologies, i.e. data warehousing and Web technology depicted in figure 1. However, the emerging challenge of Web warehousing is how to manage Web OLAP data warehouse sites in a unified way and to provide a unified access among different Web OLAP resources [15]. Therefore, a multidimensional metadata standard or framework is necessary to enable the data warehousing interoperability. As a result, we are addressing the following issues:
Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 76-88, 2003. Springer-Verlag Berlin Heidelberg 2003
MetaCube XTM
77
Data Warehousing contributes: Data management warehousing approach
Web Warehousing
The Web contributes: Web technology text and multimedia managament
Fig. 1. The hybrid of Web warehousing systems
Multidimensional Metadata Standard. In the database community there exists some research efforts for formal multidimensional data models and their corresponding query languages [1,4,6,9, 12,13,19]. However, each approach presents its own view of multidimensional analysis requirements, terminology and formalism. As a result, none of the models is capable of encompassing the others. Data Warehousing Interoperability. The relevance of interoperability for future data warehouse architectures is described in detail in [5]. Interoperability not only has to resolve the differences in data structures; it also has to deal in addition with semantic heterogeneity. In this context, the MetaCube concept is proposed in [19] as a multidimensional metadata framework for cross-domain data warehousing systems. In the further development, the MetaCube concept is extended to MetaCube-X by using XML [20], to support interoperability for web warehousing applications. MetaCube-X is an XML (extensible markup language) instance of MetaCube, and provides a n“ eutral” syntax for interoperability among different web warehousing systems. In the described framework, we define a global MetaCube-X stored in the server site and local MetaCube-X(s), each of which is stored in a local Web warehouse. The emerging issues to be handled in the global MetaCube-X are mainly issues concerning semantic heterogeneities of the local MetaCube-X, while the capability for accessing data at any level of complexity should still be provided by local Web data warehouse. In this paper we extend the concept of MetaCube-X using Topic Maps (TMs) [23] (MetaCube-XTM). Research is showing that topic maps can provide a sound basis for the Semantic Web. In addition, the Topic Maps also builds a bridge between the domains of knowledge representation and information management. The MetaCubeXTM system provides a unified view for users that address the semantic heterogeneities. On the other hand, it also supports data access of any level complexity on the local data warehouses using local MetaCube-XTMs.
78
Thanh Binh Nguyen et al.
Prototyping. Both the MetaCube-XTM concept and the web technologies are now sufficiently mature to move from proof of concept towards a semi-operational prototype – the MetaCube-XTM prototype. The remainder of this paper is organized as follows. Section 2 presents related works. In Section 3 we summarize the concepts of MetaCube [19] and introduce MetaCube-XTM protocol. Hereafter, we show the implementation of MetaCube-XTM prototype. The conclusion and future works appear in Section 5.
2
Related Works
The DARPA Agent Markup Language (DAML) [21] developed by DARPA aims at developing a language and tools to facilitate the concept of the Semantic Web [22]: the idea of having data on the Web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications. The DAML language is being developed as an extension to XML and the Resource Description Framework (RDF). The latest extension step of this language (DAML + OIL - Ontology Inference Layer) provides a rich set of constructs to create ontologies and to markup information so that it is machine readable and understandable. The Ontology Inference Layer (OIL) [7,11] from the On-To-Knowledge Project is a proposal for such a standard way of expressing ontologies based on the use of web standards like XML schema and RDF schemas. OIL is the first ontology representation language that is properly grounded in W3C standard such as RDF/RDF-schema and XML/XML-schema. DAML and OIL are general concepts not specifically related to database or data warehouse interoperability. In the field of federated data warehouses, a variety of approaches to interoperability have been proposed. In [14] the authors describe the usage of XML to enable interoperability of data warehouses by an additional architectural layer used for exchanging schema metadata. Distributed DWH architectures based on CORBA [2], and centralized virtual data warehouses based on CORBA and XML [3] have been proposed recently. All of these approaches propose distributed data warehouse architectures based on a kind of restricted data and metadata interchange format using particular XML terms and RDF extensions, respectively. Basically, they achieve syntactical integration - but these concepts do not address semantic heterogeneity to enable a thorough description of mappings between federated, heterogeneous data warehouse systems. [14] presents distributed and parallel computing issues in data warehousing. [2] also presents the prototypical distributed OLAP system developed in the context of the CUBE-STAR project. In [20], the MetaCube-X is proposed as an XML instance of the MetaCube 's concept [19] for supporting data warehouses in the federated environment. It provides a framework for supporting integration and interoperability of data warehouses. Moreover, this paper, the MetaCube-XTM, a new MetaCube generation, is addressed to the semantic heterogeneity for data warehousing interoperability.
MetaCube XTM
3
79
The Concepts of MetaCube-XTM
In this section MetaCube-XTM is presented as a framework of DWH interoperability. Based on this concept, a protocol is studied and proposed as a generic framework to support data access of any complexity level on local data warehouses. 3.1
MetaCube Conceptual Data Model
In [19], a conceptual multidimensional data model that facilitates a precise rigorous conceptualization for OLAP has been introduced and presented. This approach is built on basic mathematic concepts, i.e. partial order, partially ordered set (poset) [10]. The mathematical foundation provides the basis for handling natural hierarchical relationships among data elements along (OLAP)dimensions with many levels of complexity in their structures. We summarize the MetaCube concepts introduced in [19] as follows: Dimension Concepts In [19] we introduced hierarchical relationships among dimension members by means of one hierarchical domain per dimension. A hierarchical domain is a poset (partially ordered set), denoted by < dom( D ), p D > , of dimension ele-
ments dom(D ) = {dm } ∪ {dm ,L, dm }, organized in hierarchy of levels, corresponding to different levels of granularity. An example of the hierarchy domain of the dimension Time with unbalanced and multiple hierarchical structure is showed in figure 2. Afterwards, it allows us to consider a dimension schema as a poset of levels, denoted by DSchema(D)= Levels(D) , p L . Figure 3 shows examples of dimension all
1
n
schemas of three dimensions Product, Geography and Time. Furthermore, a family of sets {dom(l0 ),.., dom(lh )} is a partition [10]of dom(D) . In this concept, a dimension hierarchy is a path along the dimension schema, beginning at the root level and ending at a leaf level [19]. all
1999
Q1.1999
Jan.1999
Feb.1999
Mar.1999 W1.1999
1.Jan.1999
6.Jan.1999
1.Feb.1999
3.Feb.1999
W5.1999
W9.1999
3.Mar.1999
Fig. 2. An example of the hierarchy domain of the dimension Time with unbalanced and multiple hierarchical structure
80
Thanh Binh Nguyen et al. All
All
All
Category
Country
Year Quarter
Type
State
Month Week
Item
City
Product
Geography
Day Time
Fig. 3. Examples of dimension schemas of three dimensions Product, Geography and Time
The Concept of Measures
[Measure Schema] A schema of a measure M is a tuple MSchema(M ) = Fname, O , where: • •
Fname is a name of a corresponding fact, O ∈ Ω ∪ {NONE, COMPOSITE} is an operation type applied to a specific fact [2]. Furthermore: - Ω={SUM, COUNT, MAX, MIN} is a set of aggregation functions. - COMPOSITE is an operation (e.g. average), where measures cannot be utilized in order to automatically derive higher aggregations. - NONE measures are not aggregated. In this case, the measure is the fact.
[Measure Domain] Let N be a numerical domain where a measure value is defined (e.g. N, Z, R or a union of these domains). The domain of a measure is a subset of N. We denote by dom(M ) ⊂
N
.
The Concept of MetaCube
First, a MetaCube schema is defined by a triple of a MetaCube name, an x tuple of dimension schemas, and a y tuple of measure schemas, denoted by CSchema(C)= Cname, DSchemas, MSchemas . Furthermore, the hierarchy domain of a MetaCube, denoted by dom(C) = Cells(C) , p C is a poset, where each data cell is an intersection among a set of dimension members and measure data values, each of which belongs to one dimension or one measure. Afterwards, data cells of within the MetaCube hierarchy domain are grouped into a set of associated granular groups, each of which expresses a mapping from the domains of x-tuple of dimension levels (independent variables) to y-numerical domains of y-tuple of numeric measures (dependent variables). Hereafter, a MetaCube is constructed based on a set of dimensions, and consists of a MetaCube schema, and is associated with a set of groups.
Product
St or e
MetaCube XTM
81
Mexico USA
Alcoholic
10
Dairy
50
Beverage
20
Baked Food
12
Meat
15
Seafood
10
1 2 3 4 5 6 Time Fig. 4. Sales MetaCube is constructed from three dimensions: Store, Product and Time and one fact: TotalSale
3.2
MetaCube XTM Protocol
The MetaCube-XTM protocol is proposed to handle the design, integration, and maintenance of heterogeneous schemas of the local data warehouses. It describes each local schema including its dimensions, dimension hierarchies, dimension levels, cubes, and measures. With the means of the The MetaCube-XTM protocol it should be possible to describe any schema represented by any multidimensional data models (i.e. star schema, snow-flake model, etc.). Furthermore, it is also aimed to provide abilities for interoperability searching and data integration among web-data warehouses as shown in Figure 5. The architecture of MetaCube-XTM systems consists of clients, server protocol (i.e., the global MetaCube-XTM at an information server, and several distributed local data warehouses and their local MetaCube-XTMs). The functionalities are given as follows: •
•
•
MetaCube–XTM Services. A set of MetaCube–XTM services at the information server is intended to provide searching and navigation abilities for clients and to manage the access to local DWHs from the federated information sever (figure 5). Global MetaCube-XTM. The global MetaCube-XTM is stored at the server, and is intended to provide a multidimensional metadata framework for a class of local data warehouses managed by MetaCube-XTM Services. Thus, it has to solve semantic heterogeneity and support the search facility to the local data warehouse. Local MetaCube-XTM. Each local MetaCube-XTM is used to describe the multidimensional data model for each local data warehouse based on the global MetaCube-XTM. The Local MetaCube-XTM is stored in the local data warehouse.
82
Thanh Binh Nguyen et al.
Client
Web Data Warehouse Queries
MetaCube-X Server MetaCube-XTM Services
XML
Global MetaCubeXTM
locatorDB
XML
XML
XML
Local MetaCubeXTM
Local MetaCubeXTM
Local MetaCubeXTM
Data Warehouse 1
Data Warehouse n
Fig. 5. MetaCube-XTM architecture
4
MetaCube-XTM Prototype
The entire idea behind prototyping is to cut down on the complexity of implementation by eliminating parts of the full system. In this context, the MetaCube-XTM prototype has been implemented. First UML (Unified Modelling Language) is used to model the MetaCube concept. UML modeling provides a framework to implement MetaCube in the XTM (XML Topic Maps). Hereafter, we describe the local MetaCube-XTM as a local presentation of DWH schemas using topic maps (bottomup approach). Then we describe the integration of heterogeneous schemas in subsection 4.2. We are going to use only predefined XTM tags as proposed by the XTM Standard (topicMap, topic, baseName, association, occurrence, topicRef, etc.). Therefore it will be possible to use tools based on the XTM standard to create, generate, and maintain such XTM descriptions easily. In this section we also present the process of MetaCube-XTM prototype. 4.1
Modeling MetaCube-XTM with UML
The common or MetaCube-XTM is a model that is used for expressing all schema objects available in the different local data warehouses. To model the MetaCube-
MetaCube XTM
83
XTM, UML is used to model dimensions, measures and data cubes in context of MetaCube data model (figure 6) [19,20]. The approach is implemented by a mapping into XML schema based on the following standard specifications: Meta Data Interchange Specification (MDIS) [16], and the Open Information Model (OIM) [17,18] of the Meta Data Coalition (MDC). 4.2
Implementation with XML Topic Maps
Topic Maps (TMs) provides a solution for organizing and navigating information resources into a unified view on the Web. In this paper we use XTM is to represent the MetaCube concept, to model data to any dimensional level of complexity, to check data for structural correctness, to define new tags corresponding to a new dimension, and to show hierarchical information corresponding to dimension hierarchies. These functionalities are necessary for data warehouse schema handling and OLAP application.
Has Child
+Ch ild
0..*
NestedElelement
+Fa th er
Des cri ption : String;
0..*
+Fa the r +Chi ld
Has Father
MDElement
belongs to
Cell MeasureValue
Gro upby Gnam e : String;
1 ..*
DimensionElement
Des cription : type;
1..* b elongs to
GSchema IntergerValue Des cription : int;
floatValue
r efers to
Level
Des cription : float;
Lnam e : String;
Gnam e : String;
1..*
1..*
1.. *
refers to
1..*
1 .. *
refers to
MeasureSchema
Hierarchy Hnam e : Stri ng;
belongs to
Fnam e : Str ing; AggFunc ti on : Str ing;
refers to
1..* 1 ..*
refers to
DimensionSchema Dname : String;
belongs to
belo ngs to
refers to
Dimension 1 ..*
Fig. 6. The MetaCube-XTM model with UML
Cube Cnam e : String; Bas icGroupby : Groupby;
84
Thanh Binh Nguyen et al. /* local Topic Maps */ /* define topic Cube */ Meta Cube-1 /* define instance of Cube */ Instance of Cube-1
……….. Meta Dimension Meta Cube-Dimension Meta Dimension-Level
Fig. 7. An example of local MetaCube-XTM
The MetaCube-XTM is an XML Topic Maps (XTM) instance of MetaCube concept for supporting interoperability and integration among data warehouse systems. This metadata provides description of different multidimensional data models. It covers heterogeneity problems, such as syntactical, data model, semantic, schematic, and structural heterogeneities. 4.2.1 Schema Integration
Schema integration is intended to overcome the semantic heterogeneity and to provide a global view for clients. The process of schema integration consists of integration of local MetaCube-XTM(s) into the global MetaCube-XTM, and merging. The following section discusses issues concerning local MetaCube-XTM(s), the global MetaCubeXTM. • Local MetaCube-XTM With reference to the MetaCube-XTM UML modeling given in figure 5, the global MetaCube-XTM is represented in XTM document supporting multidimensional data model, such as cube, dimension, dimension schema, hierarchy, measures for each data warehouse. The local MetaCube-XTM is intended to provide data access at any level of data complexity. Figure 7 shows an example of a local MetaCube-XTM describing a local Web warehouse. • Global MetaCube-XTM The global MetaCube-XTM is aimed to provide a common framework for describing a multidimensional data model for Web warehouses. Therefore, the global MetaCubeXTM is a result of the integration of local MetaCube-XTMs. In the integration process, merging tools resolve heterogeneity problems (i.e., naming conflicts among different local MetaCube-XTMs). The merging process is based on the subject of topics available among local MetaCube-XTMs. The global MetaCube-XTM provides the
MetaCube XTM
85
logic to reconcile differences, and drive Web warehousing systems conforming to a global schema. In addition, the global MetaCube-XTM represents metadata that is used for query processing. If there is a query posted by users, the MetaCube-XTM service receives the query from the user, parses, checks, and compares it with the global MetaCubeXTM, and distributes it to selected local Web warehouses. Thus, in this model the global MetaCube-XTM must be able to represent heterogeneity of dimensions and measures from local Web warehouses in relation to the MetaCube-XTM model. An example of global MetaCube-XTM is given in figure 8. 4.2.2 Prototyping
To demonstrate the capability and efficiency of the proposed concept we use the prototype for the International Union of Forest Research Organizations (IUFRO) data warehouses, which are distributed in different Asian, African and European countries. Because of the genesis of the local (national) data warehouses, they are by nature heterogeneous. Local topic maps are used as an additional layer for the representation of local schema information, whereas in the global layer topic maps acts as mediator to hide conflicts of different local schemas. Currently we have implemented an incremental prototype with full data warehousing funtionality as a proof of concept for the feasibility of our approach of an XTM based data warehouse federation http://iufro.ifs.tuwien.ac.at/metacube2.0/. In detail, the MetaCube-XTM prototype has been implemented to function as follows: Each time a query is posed for the MetaCube-XTM systems, the following 5 steps are required (see figure 9). Steps 1-3 belong to the build-time stage at the global Metacube-XTM server. This stage covers the modeling and the design of the search metadata. In this stage all definitions for the required metadata are completed which are required to search into local DWH. The next step 4 belongs to the run-time stage or searching processes at local DWH systems by means of local MetaCube-XTMs. Steps 5 and 6 are used for displaying retrieved information. All steps are described following in more detail. /* global Topic Maps */ /* define topic Global Cube */ Global Meta Cube
…. Meta Cube-1 …
Fig. 8. An example of the global MetaCube-XTM
86
•
•
Thanh Binh Nguyen et al.
MetaCube-XTM Definitions. Dependent on the characteristics of the required data, different global MetaCube-XTM structures can be defined at the MetaCube-XTM server. In this step, user can select a number of dimensions and measure to define a MetaCube-XTM schema. MetaCube-XTM Browser. Based on the tree representation of selected dimension domains, user can roll up or drill down along each dimension to select elements for searching. These selected dimension elements will be used to for query in local Web warehouses.
Local MetaCube-XTM DWH Selections. This step provides flexibility in support of interoperability searching among multiple heterogeneous DWHs.
Global MetaCube-XTM
MetaCube-XTM Multi-Host Search Process
Searching Preparation 1
MetaCube-XTM Definitions
2
2) Dimension MetaCube-XTM Browser Definitions
3
DWH Selections
Local DWHs 4
Local MetaCube-X TM
Multi-Host Search
Local MetaCube-X
Local MetaCube-X TM
SampleTM Data Definitions
.......... DWH 1
DWH 2
DWH n
Resuts Information Retrievations
5
List of Results
6
Detail Result
Fig. 9. MetaCube-XTM Multi Host Search Processes
MetaCube XTM
5
87
Conclusion and Future Works
In this paper we have presented the concept of MetaCube-XTM, which is an XMLTopic Maps instance of MetaCube concept [19]. The MetaCube-XTM provides a framework to achieve interoperability between heterogeneous schema models, which enable the joint querying of distributed web data warehouse (OLAP) systems. We also describe how to use topic maps to deal with these issues. Local topic maps are used as an additional layer for the representation of local schema information, whereas in the global layer topic maps acts as mediator to hide conflicts of different local schemas. This concept facilitates to achieve semantic integration.
Acknowledgment The authors are very indebted to IUFRO for supporting our approach from the very beginning in the framework of GFIS (Global Forest Information Systems http://www.gfis.net).
References [1] [2] [3] [4] [5]
[6] [7]
Agrawal, R., Gupta, A., Sarawagi, A.: Modeling Multidimensional Databases. IBM Research Report, IBM Almaden Research Center, September 1995. Albrecht. J., Lehner, W.: On-Line Analytical Processing in Distributed Data Warehouses. International Databases Engineering and Applications Symposium (IDEAS), Cardiff, Wales, U.K, July 8-10, 1998. Ammoura A., O. Zaiane, and R. Goebel. Towards a Novel OLAP Interface for Distributed Data Warehouses. Proc. of DAWAK 2001, Springer LNCS 2114, pp. 174-185, Munich, Germany, Sept. 2001. Blaschka M., Sapia C., Höfling G., Dinter B.: Finding your way through multidimensional data models. In Proceeding of the 9th International DEXA Workshop, Vienna, Austria, August 1998. Bruckner R. M, Ling T. W., Mangisengi O., Tjoa A M.. A Framework for a Multidimensional OLAP Model using Topic Maps. In Proceedings of the Second International Conference on WebInformation Systems Engineering (WISE 2001) Conference (Web Semantics Workshop), Vol.2, pp. 109-118, IEEE Computer Society Press.Kyoto, Japan, December 2001. Computing Surveys, Vol. 22, No. 3, September 1990. Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record Volume 26, Number 1, September 1997. D. Fensel, I. Horrocks, F. Van Harmelen, S. Decker, M. Erdmann, and M. Klein. OIL in a Nutshell. In: Knowledge Acquisition, Modeling, and Management, Proc. of the 12th European Knowledge Acquisition Conference (EKAW2000), R. Dieng et al. (eds.), Springer-Verlag LNAI 1937, pp. 1-16, Oct. 2000.
88
[8] [9] [10] [11] [12] [13]
[14]
[15] [16] [17] [18] [19]
[20]
[21] [22] [23]
Thanh Binh Nguyen et al.
Garcia-Molina, H., Labio, W., Wiener, J.L., Zhuge, Y.: Distributed and Parallel Computing Issues in Data Warehousing. In Proceedings of ACM Principles of Distributed Computing Conference, 1999. Invited Talk. Gray J., Bosworth A., Layman A., Pirahesh H.: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tabs, and Sub-Totals. Proceedings of ICDE '96, New Orleans, February 1996. Gross, J., Yellen, J.: Graph Theory and its applications. CRC Press,1999. I. Horrocks, D. Fensel, J. Broekstra, S. Decker, M. Erdmann, C. Goble, F. van Harmelen, M. Klein, S. Staab, R. Studer, and E. Motta. The Ontology Inference Layer OIL. Li, C., Wang, X.S.: A Data Model for Supporting On-Line Analytical Processing. CIKM 1996. Mangisengi, O., Tjoa, A M., Wagner, R.R.: Multidimensional Modelling Approaches for OLAP. Proceedings of the Ninth International Database Conference H “ eterogeneous and Internet Databases 1999, ISBN 962-937-046-8. Ed. J. Fong, Hong Kong, 1999. Mangisengi O, J. Huber, Ch. Hawel, and W. Essmayr. A Framework for Supporting Interoperability of Data Warehouse Islands Using XML. Proc. of DAWAK 2001, Springer LNCS 2114, pp. 328-338, Munich, Germany, Sept. 2001. Mattison R. Web Warehousing and Knowledge Management. McGraw-Hill, 1999. Meta Data Coalition. Metadata Interchange Specification (MDIS) Version 1.1, August 1997. Meta Data Coalition. Open Information Model XML Encoding. Version 1.0, December 1999. http://www.mdcinfo.com/. Meta Data Coalition. Open Information Model. Version 1.1, August 1999. http://www.mdcinfo.com/. Nguyen, T.B., Tjoa, A M., Wagner, R.R.: Conceptual Multidimensional Data Model Based on MetaCube. In Proc. of First Biennial International Conference on Advances in Information Systems (ADVIS'2000), Izmir, TURKEY, October 2000. Lecture Notes in Computer Science (LNCS), Springer, 2000. Nguyen T.B., Tjoa A M., Mangisengi O.. MetaCube-X: An XML Metadata Foundation for Interoperability Search among Web Warehouses. In Proceedings of the 3rd Intl. Workshop DMDW'2001, Interlaken, witzerland, June 4, 2000. The DARPA Agent Markup Language Homepage. http://daml.semanticweb.org/. The Semantic Web Homepage. http://www.semanticweb.org/ XML Topic Maps (XTM) 1.0 Specification. http://www.topicmaps.org/xtm/1.0/.
Designing Web Warehouses from XML Schemas Boris Vrdoljak1, Marko Banek1, and Stefano Rizzi2 1
FER – University of Zagreb Unska 3, HR-10000 Zagreb, Croatia {boris.vrdoljak,marko.banek}@fer.hr 2 DEIS – University of Bologna Viale Risorgimento 2, 40136 Bologna, Italy
[email protected] Abstract. Web warehousing plays a key role in providing the managers with up-to-date and comprehensive information about their business domain. On the other hand, since XML is now a standard de facto for the exchange of semi-structured data, integrating XML data into web warehouses is a hot topic. In this paper we propose a semi-automated methodology for designing web warehouses from XML sources modeled by XML Schemas. In the proposed methodology, design is carried out by first creating a schema graph, then navigating its arcs in order to derive a correct multidimensional representation. Differently from previous approaches in the literature, particular relevance is given to the problem of detecting shared hierarchies and convergence of dependencies, and of modeling many-to-many relationships. The approach is implemented in a prototype that reads an XML Schema and produces in output the logical schema of the warehouse.
1
Introduction
The possibility of integrating data extracted from the web into data warehouses (which in this case will be more properly called web warehouses [1]) is playing a key role in providing the enterprise managers with up-to-date and comprehensive information about their business domain. On the other hand, the Extensible Markup Language (XML) has become a standard for the exchange of semi-structured data, and large volumes of XML data already exist. Therefore, integrating XML data into web warehouses is a hot topic. Designing a data/web warehouse entails transforming the schema that describes the source operational data into a multidimensional schema for modeling the information that will be analyzed and queried by business users. In this paper we propose a semiautomated methodology for designing web warehouses from XML sources modeled by XML Schemas, which offer facilities for describing the structure and constraining the content of XML documents. As HTML documents do not contain semantic description of data, but only the presentation, automating design from HTML sources is unfeasible. XML models semi-structured data, so the main issue arising is that not Y. Kambayashi, M. Mohania, W. Wöß (Eds.): DaWaK 2003, LNCS 2737, pp. 89-98, 2003. Springer-Verlag Berlin Heidelberg 2003
90
Boris Vrdoljak et al.
all the information needed for design can be safely derived. In the proposed methodology, design is carried out by first creating a schema graph, then navigating its arcs in order to derive a correct multidimensional representation in the form of a dependency graph where arcs represent inter-attribute relationships. The problem of correctly inferring the needed information is solved by querying the source XML documents and, if necessary, by asking the designer's help. Some approaches concerning related issues have been proposed in the literature. In [4] a technique for conceptual design starting from DTDs [12] is outlined. That approach is now partially outdated due to the increasing popularity of XML Schema; besides, some complex modeling situations were not specifically addressed in the paper. In [5] and [6] DTDs are used as a source for designing multidimensional schemas (modeled in UML). Though that approach bears some resemblance to ours, the unknown cardinalities of relationships are not verified against actual XML data, but they are always arbitrarily assumed to be to-one. Besides, the id/idref mechanism used in DTDs is less expressive than key/keyref in XML Schema. The approach described in [8] is focused on populating multidimensional cubes by collecting XML data, but assumes that the multidimensional schema is known in advance (i.e., that conceptual design has been already carried out). In [9], the author shows how to use XML to directly model multidimensional data, without addressing the problem of how to derive the multidimensional schema. Differently from previous approaches in the literature, in our paper particular relevance is given to the problem of detecting shared hierarchies and convergence of dependencies, and of modeling many-to-many relationships within hierarchies. The approach is implemented in a prototype that reads an XML Schema and produces in output the star schema for the web warehouse.
2
Relationships in XML Schema
The structure of XML data can be visualized by using a schema graph (SG) derived from the Schema describing the data. The method is adopted from [10], where simpler, but less efficient DTD is still used as a grammar. The SG for the XML Schema describing a purchase order, taken from the W3C's document [14] and slightly extended, is shown in Fig. 1. In addition to the SG vertices that correspond to elements and attributes in the XML Schema, the operators inherited from the DTD element type declarations are also used because of their simplicity. They determine whether the sub-element or attribute may appear one or more (“+”), zero or more (“*”), or zero or one times (“?”). The default cardinality is exactly one and in that case no operator is shown. Attributes and sub-elements are not distinguished in the graph. Since our design methodology is primarily based on detecting many-to-one relationships, in the following we will focus on the way those relationships can be expressed. There are two different ways of specifying relationships in XML Schemas. •
First, relationships can be specified by sub-elements with different cardinalities. However, given an XML Schema, we can express only the cardinality of the relationship from an element to its sub-elements and attributes. The cardinality
Designing Web Warehouses from XML Schemas
91
in the opposite direction cannot be discovered by exploring the Schema; only by exploring the data that conforms to the Schema or by having some knowledge about the domain described, it can be concluded about the cardinality in the direction from a child element to its parent. Second, the key and keyref elements can be used for defining keys and their references. The key element indicates that every attribute or element value must be unique within a certain scope and not null. If the key is an element, it should be of a simple type. By using keyref elements, keys can be referenced. Not just attribute values, but also element content and their combinations can be declared to be keys, provided that the order and type of those elements and attributes is the same in both the key and keyref definitions. In contrast to id/idref mechanism in DTDs, key and keyref elements are specified to hold within the scope of particular elements.
•
3
From XML Schema to Multidimensional Schema
In this section we propose a semi-automatic approach for designing a web warehouse starting from an XML Schema. The methodology consists of the following steps: 1. 2. 3. 4.
Preprocessing the XML Schema. Creating and transforming the SG. Choosing facts. For each fact: 4.1 Building the dependency graph from the SG. 4.2 Rearranging the dependency graph. 4.3 Defining dimensions and measures. 4.4 Creating the logical schema.
Given a fact, the dependency graph (DG) is an intermediate structure used to provide a multidimensional representation of the data describing the fact. In particular, it is a directed rooted graph whose vertices are a subset of the element and attribute vertices of the SG, and whose arcs represent associations between vertices. The root of the DG corresponds to the fact. purchaseOrder
shipTo
orderDate
billTo ?
name street city
state zip
+ ? items comment *
? country
?
item
product brand productCode weight key size
partNum ? keyref productName quantity shipDate USPrice
Fig. 1. The Schema Graph
92
Boris Vrdoljak et al.
While in most cases the hierarchies included in the multidimensional schema represent only to-one associations (sometimes called roll-up relationships since they support the roll-up OLAP operator), in some applications it is important to model also many-to-many associations. For instance, suppose the fact to be modeled is the sales of books, so book is one of the dimensions. Although books that have many authors certainly exist, it would be interesting to aggregate the sales by author. It is remarkable that summarizability is maintained through many-to-many associations, if a normalized weight is introduced [7]. Besides, some specific solutions for logical design in presence of many-to-many associations were devised [11]. However, since modeling many-to-many associations in a warehouse should be considered an exception, their inclusion in the DG is subject to the judgment of the designer, who is supposed to be an expert of the business domain being modeled. After the DG has been derived from the SG, it may be rearranged (typically, by dropping some uninteresting attributes). This phase of design necessarily depends on the user requirements and cannot be carried out automatically; since it has already been investigated (for instance in [2]), it is considered to be outside the scope of this paper. Finally, after the designer has selected dimensions and measures among the vertices of the DG, a logical schema can be immediately derived from it. 3.1
Choosing Facts and Building Dependency Graphs
The relationships in the Schema can be specified in a complex and redundant way. Therefore, we transform some structures to simplify the Schema, similarly as DTD was simplified in [10] and [6]. A common example of Schema simplification concerns the choice element, which denotes that exactly one of the sub-elements must appear in a document conforming to that Schema. The choice element is removed from the schema and a minOccurs attribute with value 0 is added to each of its subelements. The resulting simplified structure, although not being equivalent to the choice expression, preserves all the needed information about the cardinalities of relationships. After the initial SG has been created [10], it must undergo two transformations. First, all the key attributes or elements are located and swapped with their parent vertex in order to explicitly express the functional dependency relating the key with the other attributes and elements. Second, some vertices that do not store any value are eliminated. A typical case is an element that has only one sub-element of complex type and no attributes, and the relationship with its sub-element is to-many. We name such an element a container. Note that, when a vertex v is deleted, the parent of v inherits all the children of v and their cardinalities. The next step is choosing the fact. The designer chooses the fact among all the vertices and arcs of the SG. An arc can be chosen as a fact if it represents a many-tomany relationship. For the purchase order SG presented in Fig. 1, after the items element has been eliminated as a container, the relationship between purchaseOrder and item is chosen as a fact, as in Fig. 2. For each fact f, the corresponding DG must be built by including a subset of the vertices of the SG. The DG is initialized with the root f, to be enlarged by recursively navigating the relationships between vertices in the SG. After a vertex v of the SG is inserted in the DG, navigation takes place in two steps:
Designing Web Warehouses from XML Schemas
93
purchaseOrder ...
? ... orderDate
* fact
item
Fig. 2. Choosing a fact
1.
2.
For each vertex w that is a child of v in the SG: When examining relationships in the direction expressed by arcs of the SG, the cardinality information is expressed either explicitly by “?”, “*” and “+” vertices, or implicitly by their absence. If w corresponds to an element or attribute in the Schema, it is added to the DG as a child of v; if w is a “?” operator, its child is added to the DG as a child of v. If w is a “*” or “+” operator, the cardinality of the relationship from u, child of w, to v is checked by querying the XML documents (see Section 4.5): if it is to-many, the designer decides whether the many-to-many relationship between v and u is interesting enough to be inserted into the DG or not. For each vertex z that is a parent of v in the SG: When examining relationships in this direction, vertices corresponding to “?”, “*” and “+” operators are skipped since they only express the cardinality in the opposite direction. Since the Schema yields no further information about the relationship cardinality, it is necessary to examine the actual data by querying the XML documents conforming to the Schema (see Section 4.5). If a to-one relationship is detected, z is included in the DG.
Whenever a vertex corresponding to a keyref element is reached, the navigation algorithm “jumps” to its associated key vertex, so that descendants of the key become descendants of the keyref element. A similar approach is used in [3], where the operational sources are represented by a relational schema, when a foreign key is met during navigation of relations. See for instance Fig. 3, showing the resulting DG for the purchase order example. From the fact, following to-one relationship, the item vertex is added to the DG. Vertex productCode is defined to be a key (Fig.1). It is swapped with product, which then is dropped since it carries no value. The partNum vertex is a child of item and is defined as a key reference to the productCode attribute. size, weight and brand, the children of productCode, become descendants of the partNum attribute in the DG. 3.2
Querying XML Documents
In our approach, XQuery [15] is used to query the XML documents in three different situations: 1. 2. 3.
examination of convergence and shared hierarchies searching for many-to-many relationships between the descendants of the fact in SG searching for to-many relationships towards the ancestors of the fact in the SG
94
Boris Vrdoljak et al.
orderDate
name
productName
street
quantity USPrice
purchaseOrder-item zip
purchaseOrder USAddress
city
comment state
item
FACT
comment partNum
shipDate
country size
weight
brand
Fig. 3. The DG for the purchase order example
Note that, since in all three cases querying the documents is aimed at counting how many distinct values of an attribute v are associated to a single value of an attribute w, it is always preliminarily necessary to determine a valid identifier for both v and w. To this end, if no key is specified for an attribute, the designer is asked to define an identifier by selecting a subset of its non-optional sub-elements. Convergence and Shared Hierarchies. Whenever a complex type has more than one instance in the SG, and all of the instances have a common ancestor vertex, either a convergence or a shared hierarchy may be implied in the DG. A convergence holds if an attribute is functionally determined by another attribute along two or more distinct paths of to-one associations. On the other hand, it often happens that whole parts of hierarchies are replicated two or more times. In this case we talk of a shared hierarchy, to emphasize that there is no convergence. In our approach, the examination is made by querying the available XML documents conforming to the given Schema. In the purchase order example, following a to-one relationship from the fact, the purchaseOrder vertex is added to the DG. It has two children, shipTo and billTo (Fig. 1), that have the same complex type USAddress. The purchaseOrder element is the closest common ancestor of shipTo and billTo, thus all the instances of the purchaseOrder element have to be retrieved. For each purchaseOrder instance, the content of the first child, shipTo, is compared to the content of the second one, billTo, using the deep-equal XQuery operator as shown in Fig. 4.
let $x:= for $c in $retValue where not(deep-equal($c/first/content, $c/second/content)) return $c return count($x) Fig. 4. A part of the XQuery query for distinguishing convergence from shared hierarchy
Designing Web Warehouses from XML Schemas
95
By using the COUNT function, the query returns the number of couples with different contents. If at least one couple with different contents is counted, a shared hierarchy is introduced. Otherwise, since in principle there still is a possibility that documents in which the contents of the complex type instances are not equal will exist, the designer has to decide about the existence of convergence by leaning on her knowledge of the application domain. In our example, supposing it is found that shipTo and billTo have different values in some cases, a shared hierarchy is introduced. Many-to-Many Relationships between the Descendants of the Fact. While in most cases only to-one associations are included into the DG, there are situations in which it is useful to model many-to-many associations. Consider the SG in Fig. 5, modeling the sales of the books, where the bookSale vertex is chosen as the fact. After the book vertex is included into the DG, a to-many relationship between book and author is detected. Since including a one-to-many association would be useless for aggregation, the available XML documents conforming to the bookSale Schema are examined by using XQuery to find out whether the same author can write multiple books. A part of the query is presented in Fig. 6: it counts the number of distinct books (i.e. different parent elements) for each author (child) and returns the maximum number. If the returned number is greater than one, the relationship is many-to-many, and the designer may choose whether it should be included in the DG or not. If the examination of the available XML documents has not proved that the relationship is many-to-many, the designer can still, leaning on his or her knowledge, state the relationship as many-to-many and decide if it is interesting for aggregation. bookSale
book ?
+ title
quantity
date
store
year publisher
author
price
city storeNo
nameLast nameFirst
address
Fig. 5. The book sale example
max( ... for $c in distinct-values($retValue/child) let $p:=for $exp in $retValue where deep-equal($exp/child,$c) return $exp/parent return count(distinct-values($p)) ) Fig. 6. A part of a query for examining many-to-many relationships
96
Boris Vrdoljak et al. TIME timeKey orderDate dayOfWeek holiday month PRODUCT productKey partNum productName size weight brand
PURCHASE_ORDER shipToCustomerKey billToCustomerKey orderDateKey productKey USPrice quantity income
CUSTOMER customerKey customer name street zip city state country
Fig. 7. The star schema for the purchase order example
To-Many Relationships towards the Ancestors of the Fact. This type of search should be done because the ancestors of the fact element in the SG will not always form a hierarchically organized dimension in spite of the nesting structures in XML. When navigating the SG upwards from the fact, the relationships must be examined by XQuery since we have no information about the relationship cardinality, which is not necessarily to-one. The query is similar to the one for examining many-to-many relationships, and counts the number of distinct values of the parent element corresponding to each value of the child element. 3.3
Creating the Logical Scheme
Once the DG has been created, it may be rearranged as discussed in [3]. Considering for instance the DG in Fig. 3, we observe that there is no need for the existence of both purchaseOrder and purchaseOrder-item, so only the former is left. Considering item and partNum, only the latter is left. The comment and shipDate attributes are dropped to eliminate unnecessary details. Finally, attribute USAddress is renamed into customer in order to clarify its role. The final steps of building a multidimensional schema include the choice of dimensions and measures as described in [2]. In the purchase order example, USPrice and quantity are chosen as measures, while orderDate, partNum, shipToCustomer, and billToCustomer are the dimensions. Finally, the logical schema is easily obtained by including measures in the fact table and creating a dimension table for each hierarchy in the DG. Fig. 7 shows the resulting star schema corresponding to the DG in Fig. 3; note how the shared hierarchy on customer is represented in the logical model by only one dimension table named CUSTOMER, and how a derived measure, income, has been defined by combining quantity and USPrice. In the presence of many-to-many relationships one of the logical design solution proposed in [11] is to be adopted.
4
Conclusion
In this paper we described an approach to design a web warehouse starting from the XML Schema describing the operational source. As compared to previous approaches
Designing Web Warehouses from XML Schemas
97
based on DTDs, the higher expressiveness of XML Schema allows more effective modeling. Particular relevance is given to the problem of detecting shared hierarchies and convergences; besides, many-to-many relationships within hierarchies can be modeled. The approach is implemented in a Java-based prototype that reads an XML Schema and produces in output the star schema for the web warehouse. Since all the needed information cannot be inferred from XML Schema, in some cases the source XML documents are queried using XQuery language, and if necessary, the designer is asked for help. The prototype automates several parts of the design process: preprocessing the XML Schema, creating and transforming the schema graph, building the dependency graph, querying XML documents. All phases are controlled and monitored by the designer through a graphical interface that also allows some restructuring interventions on the dependency graph.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
S. S. Bhowmick, S. K. Madria, W.-K. Ng, and E. P. Lim, “Web Warehousing: Design and Issues”, Proc. DWDM'98, Singapore, 1998. M. Golfarelli, D. Maio, and S. Rizzi, “Conceptual design of data warehouses from E/R schemes”, Proc. HICSS-31, vol. VII, Kona, Hawaii, pp. 334-343, 1998. M. Golfarelli, D. Maio, S. Rizzi, “The Dimensional Fact Model: a Conceptual Model for Data Warehouses”, International Journal of Cooperative Information Systems, vol. 7, n. 2&3, pp. 215-247, 1998. M. Golfarelli, S. Rizzi, and B. Vrdoljak, “Data warehouse design from XML sources”, Proc. DOLAP'01, Atlanta, pp. 40-47, 2001. M. Jensen, T. Møller, and T.B. Pedersen, “Specifying OLAP Cubes On XML Data”, Journal of Intelligent Information Systems, 2001. M. Jensen, T. Møller, and T.B. Pedersen, “Converting XML Data To UML Diagrams For Conceptual Data Integration”, Proc. DIWeb'01, Interlaken, 2001. R. Kimball. “The data warehouse toolkit”. John Wiley & Sons, 1996. T. Niemi, M. Niinimäki, J. Nummenmaa, and P. Thanisch, “Constructing an OLAP cube from distributed XML data”, Proc. DOLAP'02, McLean, 2002. J. Pokorny, “Modeling stars using XML”, Proc. DOLAP'01, 2001. J. Shanmugasundaram et al., “Relational Databases for Querying XML Documents: Limitations and Opportunities”, Proc. 25th VLDB, Edinburgh, 1999. I.Y. Song, W. Rowen, C. Medsker, and E. Ewen, “An analysis of many-tomany relationships between fact and dimension tables in dimensional modeling”, Proc. DMDW, Interlaken, Switzerland, pp. 6.1-6.13, 2001. World Wide Web Consortium (W3C), “XML 1.0 Specification”, http://www.w3.org/TR /2000/REC-xml-20001006. World Wide Web Consortium (W3C), “XML Schema”, http://www.w3.org/XML/Schema.
98
[14] [15]
Boris Vrdoljak et al.
World Wide Web Consortium (W3C), “XML Schema Part 0: Primer”, http://www.w3.org /TR/xmlschema-0/. World Wide Web Consortium (W3C), “XQuery 1.0: An XML Query Language (Working Draft)”, http://www.w3.org/TR/xquery/.
Building XML Data Warehouse Based on Frequent Patterns in User Queries Ji Zhang1, Tok Wang Ling1, Robert M. Bruckner2, A Min Tjoa2 1
2
Department of Computer Science National University of Singapore Singapore 117543
Institute of Software Technology Vienna University of Technology Favoritenstr. 9/188, A-1040 Vienna, Austria
{zhangji, lingtw}@comp.nus.edu.sg
{bruckner, tjoa}@ifs.tuwien.ac.at
Abstract. With the proliferation of XML-based data sources available across the Internet, it is increasingly important to provide users with a data warehouse of XML data sources to facilitate decision-making processes. Due to the extremely large amount of XML data available on web, unguided warehousing of XML data turns out to be highly costly and usually cannot well accommodate the users’ needs in XML data acquirement. In this paper, we propose an approach to materialize XML data warehouses based on frequent query patterns discovered from historical queries issued by users. The schemas of integrated XML documents in the warehouse are built using these frequent query patterns represented as Frequent Query Pattern Trees (FreqQPTs). Using hierarchical clustering technique, the integration approach in the data warehouse is flexible with respect to obtaining and maintaining XML documents. Experiments show that the overall processing of the same queries issued against the global schema become much efficient by using the XML data warehouse built than by directly searching the multiple data sources.
1. Introduction A data warehouse (DWH) is a repository of data that has been extracted, transformed, and integrated from multiple and independent data source like operational databases and external systems [1]. A data warehouse system, together with its associated technologies and tools, enables knowledge workers to acquire, integrate, and analyze information from different data sources. Recently, XML has rapidly emerged as a standardized data format to represent and exchange data on the web. The traditional DWH has gradually given way to the XML-based DWH, which becomes the mainstream framework. Building a XML data warehouse is appealing since it provides users with a collection of semantically consistent, clean, and concrete XML-based data that are suitable for efficient query and analysis purposes. However, the major drawback of building an enterprise wide XML data warehouse system is that it is usually extremely time and cost consuming that is unlikely to be successful [10]. Furthermore, without proper guidance on which information is to be stored, the resulting data warehouse cannot really well accommodate the users’ needs in XML data acquirement. In order to overcome this problem, we propose a novel XML data warehouse approach by taking advantage of the underlying frequent patterns existing in the query Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 99-108, 2003. c Springer-Verlag Berlin Heidelberg 2003
100
Ji Zhang et al.
history of users. The historical user queries can ideally provide us with guidance regarding which XML data sources are more frequently accessed by users, compared to others. The general idea of our approach is: Given multiple distributed XML data sources and their globally integrated schema represented as a DTD (data type definition) tree, we will build a XML data warehouse based on the method of revealing frequent query patterns. In doing so, the frequent query patterns, each represented as a Frequent Query Pattern Tree (FreqQPT), are discovered by applying a rule-mining algorithm. Then, FreqQPTs are clustered and merged to generate a specified number of integrated XML documents. Apparently, the schema of integrated XML documents in the warehouse is only a subset of the global schema and the size of this warehouse is usually much smaller than the total size of all distributed data sources. A smaller sized data warehouse can not only save storage space but also enable query processing to be performed more efficiently. Furthermore, this approach is more user-oriented and is better tailored to the user’s needs and interests. There has been some research in the field of building and managing XML data warehouse. The authors of [2] present a semi-automated approach to building a conceptual schema for a data mart starting from XML sources. The work in [3] uses XML to establish an Internet-based data warehouse system to solve the defects of client/server data warehouse systems. [4] presents a framework for supporting interoperability among data warehouse islands for federated environments based on XML. A change-centric method to manage versions in a web warehouse of XML data is published in [5]. Integration strategies and their application to XML Schema integration has been discussed in [6]. The author of [8] introduces a dynamic warehouse, which supports evaluation, change control and data integration of XML data. The remainder of this paper is organized as follows. Section 2 discusses the generation of XML data warehouses based on frequent query patterns of users’ queries. In Sections 3, query processing using the data warehouse is discussed. Experimental results are repoeted in Section 4. The final section conclude this paper.
2. Building a XML DWH Based on Frequent Query Patterns 2.1.
Transforming Users’ Queries into Query Path Transactions
XQuery is a flexible language commonly used to query a broad spectrum of XML information sources, including both databases and documents [7]. The following XQuery-formatted query aims to extract the ISBN, Title, Author and Price of books with a price over 20 dollars from a set of XML documents about book-related information. The global DTD tree is shown in Figure 1. FOR $a IN DOCUMENT (book XML documents)/book SATIFIES $a/Price/data()>20 RETURN {$a/ISBN, $a/Title, $a/Author, $a/Price}
Building XML Data Warehouse Based on Frequent Patterns in User Queries
101
Book
ISBN
Title
Author+
Name
Affiliation
Section+
Title
Publisher
Para*
Price
Year
Figure*
Title
Image
QP1: Book/ISBN QP2: Book/Title QP3: Book/Author/Name QP4:Book/Author/Affiliation QP5: Book/Price
Fig. 1. Global DTD Tree of multiple XML documents. Fig. 2. QPs of the XQuery sample.
A Query Path is a path expression of a DTD tree that starts at the root of tree. QPs can be obtained from the query script expressed using XQuery Statements. The sample query above can be decomposed into five QPs, as shown in Figure 2. The root of a QP is denoted as Root(QP) and all QPs in a query have the same root. Please note that two QPs with different roots are regarded as different QPs, although these two paths may have some common nodes. This is because different roots of paths often indicate dissimilar contexts of the queries. For example, two queries Author/Name and Book/Author/Name are different because Root(Author/Name)=Author Root(Book/Author/Name)=Book. A query can be expressed using a set of QPs which includes all the QPs that this query consists. For example, the above sample query, denoted as Q, can be expressed using a QP set such as Q={QP1, QP2, QP3, QP4, QP5}. By transforming all the queries into QP sets, we now obtain a database containing all these QP sets of queries, denoted as DQPS. We will then apply a rule-mining techniques to discover significant rules among the users’ query patterns. 2.2.
Discovering Frequent Query Path Sets in DQPS
The aim of applying a rule mining technique in DQPS is to discover Frequent Query Path Sets (FreqQPSs) in DQPS. A FreqQPS contains frequent QPs that jointly occur in DQPS. Frequent Query Pattern Trees (FreqQPTs) are built from these FreqQPSs and serve as building blocks of schemas of the integrated XML documents in the data warehouse. Formal definition of FreqQPTs is given as follows. Definition 1. Frequent Query Path Set (FreqQPS): From all the occurring QPs in DQPS transformed from user’s queries, a Frequent Query Path Set (FreqQPS) is a set of QPs: {QP1, QP2,…,QPn} that satisfies the following two requirements: (1) Support requirement: Support ({QP1, QP2,…,QPn}) minsup; (2) Confidence requirement: For each QPi, Freq({QP1, QP2,…,QPn}) / Freq(QPi) minconf. where Freq(s) counts the occurrence of set s in DQPS. In (1), Support({QP1, QP2,…,QPn}) = freq({QP1, QP2,…,QPn}) / N(DQPS), where N(DQPS) is the total number of QPs in DQPS. The constants minsup and minconf are the minimum support and confidence thresholds, specified by the user. A FreqQPS that consists of n QPs is termed as an n-itemset FreqQPS. The definition of a FreqQPS is similar to that of association rules. The support requirement is identical to the traditional definition of large association rules. The confidence requirement is, however, more rigid than the traditional definition. Setting a more rigid confidence requirement is to ensure the joint occurrence of QPs in a FreqQPS should be significant enough with respect to an individual occurrence of any
102
Ji Zhang et al.
QP. Since the number of QPs in the FreqQPS is unknown in advance, we will mine all FreqQPSs containing various numbers of itemsets. The FreqQPS mining algorithm is presented in Figure 3. The n-itemset QPS candidates are generated by joining (n-1)-itemset FreqQPSs. A pruning mechanism is devised to delete those candidates of the n-itemset QPSs that do not have n (n-1)-itemset subsets in the (n-1)-itemset FreqQPS list. The reason is that if one or more (n-1)-subsets of a n-itemset QPS candidate are missing in the (n-1)-itemset FreqQPS list, this n-itemset QPS cannot become a FreqQPS. This is obviously more rigid than pruning mechanism used in conventional association rule mining. For example, if one or more of the 2-itemset QPSs {QP1, QP2}, {QP1, QP3} and {QP2, QP3} are not frequent, then the 3-itemset QPS {QP1, QP2, QP3} cannot become a frequent QPS. The proof of this pruning mechanism is given below. The pruning the nitemset QPS candidates are evaluated in terms of the support and confidence requirements to decide whether or not they are a FreqQPS. The (n-1)-itemset FreqQPSs are finally deleted if they are subsets of some n–itemset FreqQPSs. For example, the 2itemset FreqQPT {QP1, QP2} will be deleted from 2-itemset FreqQPT list if the 3itemset {QP1, QP2, QP3} exists in the 3-itemset FreqQPT list. Algorithm MineFreqQPS Input: DQPS, minsup, minconf. Output: FreqQPS of varied number of itemsets. FreqQPS1={QP in DQPS| SatisfySup(QP)=true}; i=2; WHILE (CanFreqQPSi-1 is not empty) { CanQPSi=CanQPSGen(FreqQPSi-1); CanQPSi= CanQPSiʊ{QPSi| NoSubSet(QPSi, FreqQPSi-1) TS, xj = (vj1, vj2, …, vjn), 1d j d m. If Ai is a numerical attribute, wjk = vji, k = k + 1. If Ai is a nominal attribute, wjk = P CA ( x j ) , wj(k+1) = P CA2 ( x j ) , ... , wj(k+K-1)= P CAK ( x j ) , 1
i
i
i
k = k + K, repeat Step 2 until yj = (wj1, wj2, …, wjn’) is generated, nc = (n - q) + qK, q is the number of nominal attributes in A. Step 3: j = j + 1, if j d m, go to Step 1. Step 4: Generate the new training set TSc. TSc = {| yj = (wj1, wj2, …, wjn’) , cj C, 1 d j d m}. Step 5: Initialize the population. Let gen = 1 and generate the set of individuals :1 = { h11 , h21 , …, h1p } initially, where :(gen) is the population in the generation gen and hi( gen ) stands for the ith individual of the generation gen. Step 6: Evaluate the fitness value of each individual on the training set. For all hi( gen ) :(gen), compute the fitness values Ei( gen) = fitness( hi( gen ) , TSc), where the fitness evaluating function fitness() is defined as Section 3.3. Step 7: Does it satisfy the conditions of termination? If the best fitness value of E i( gen ) satisfies the conditions of termination ( E i( gen ) = 0) or the gen is equal to the specified maximum generation, the hi( gen ) with the best fitness value are returned and the algorithm halts; otherwise, gen = gen + 1. Step 8: Generate the next generation of individuals and go to Step 5. The new population of next generation :(gen) is generated by the ratio of Pr, Pc and Pm, goes to Step 6, where Pr, Pc and Pm represent the probabilities of reproduction, crossover and mutation operations, respectively.
4 The Classification Algorithm After the learning phase, we obtain a set of classification functions F that can recognize the classes in TSc. However, these functions may still conflict each other in practical cases. To avoid the situations of conflict and rejection, we proposed a
198
Been-Chian Chien et al.
scheme based on the Z-score of statistical test. In the classification phase, we calculate all the Z-values of every classification function for the unknown class data and compare these Z-values. If the Z-value of an unknown class object yj for classification fi is minimum, then yj belongs to the class Ci. We present the classification approach in the following. For a classification function fi F corresponding to the class Ci and positive instances TSc with cj = Ci. Let X i be the mean of values of fi(yj) defined in Section 3.3. The standard deviation of values of fi(yj), 1 d j d mi, is defined as
¦( f (y
Vi
i y j , c j !TS' , c j Ci
j
) X i )2
mi
, 1 d j d mi , 1 d i d K
For a data x S and a classification function fi, let y Sc be the data with all numerical values transformed from x using rough attribute membership. The Z-value of data y for fi is defined as | f i ( y) X i | Z i ( y) ,
Vi
where 1 didK. We used the Z-value to determine the class to which the data should be assigned. The detailed classification algorithm is listed as follows. Algorithm: The classification algorithm Input: A data x. Output: The class Ck that x is assigned. Step 1: Initial value k = 1. Step 2: Transform nominal attributes of x into numerical attributes. Assume that the data x S, x = (v1, v2, …, vn). If Ai is a numerical attribute, wjk = vji, k = k + 1. If Ai is a nominal attribute, wjk = P CA ( x j ) , wj(k+1) = P CA2 ( x j ) , ... , wj(k+K-1)= P CAK ( x j ) , i i k = k + K, repeat Step 2 until yj = (wj1, wj2, …, wjn’) is generated, nc = (n - q) + qK, q is the number of nominal attributes in A. Step 3: Initially, i = 1. Step 4: Compute Zi(y) with classification function fi(y). Step 5: If i < K, then i = i + 1, go to Step 4. Otherwise, go to Step 6. Step 6: Find the k Arg min{ Z i ( y )} , the data x will be assigned to the class Ck. 1
i
1d i d K
Table 1. The parameters of GPQuick used in the experiments Parameters Node mutate weight Mutate constant weight Mutate shrink weight Selection method Tournament size Crossover weight Crossover weight annealing
Values 43.5% 43.5% 13% Tournament 7 28% 20%
Parameters Values Mutation weight 8% Mutation weight annealing 40% Population size 1000 Set of functions {+, -, u, y} 0 Initial value of X i Initial value of ri 10 Generations 10000
Generating Effective Classifiers with Supervised Learning of Genetic Programming
199
5 The Experimental Results The proposed learning algorithm based on genetic programming is implemented by modifying the GPQuick 2.1 [15]. Since the GPQuick is an open source on Internet, it is more confidential for us to build and evaluate the algorithm of learning classifiers. The parameters used in our experiments are listed in Table 1. We define only four basic operations {+, -, u, y} for final functions. That is, each classification function contains only the four basic operations. The population size is set to be 1000 and the number of maximum generations of evolution is set to be 10000 for all data sets. Although the number of generations is high, the GPQuick still have good performance in computation time. The experimental data sets are selected from UCI Machine Learning repository [1]. We take 15 data sets from the repository totally including three nominal data sets, four composite data sets (with nominal and numeric attributes), and eight numeric data sets. The size of data and the number of attributes in the data sets are quite diverse. The related information of the selected data sets is summarized in Table 2. Since the GPQuick is fast in the process of evolving, the training time for each classification function in 10000 generations can be done in few seconds or minutes depending on the number of cases in the training data sets. The proposed learning algorithm is efficient when it is compared with the training time in [11] having more than an hour. We don’t know why the GPQuick is so powerful in evolving speed, but it is easy to get the source [15] and modify the problem class to obtain the results for everyone. The performance of the proposed classification scheme is evaluated by the average classification error rate of 10-fold cross validation for 10 runs. We figure out the experimental results and compare the effectiveness with different classification models in Table 3. These models include statistical model like Naïve Bayes[3], NBTree [7], SNNB [16], the decision tree based classifier C4.5 [14] and the association rule-based classifier CBA [10]. The related error rates in Table 3 are cited from [16] except the GP-based classifier. Since the proposed GP-based classifier is random based, we also show the standard deviations in the table for reference of readers. From the experimental results, we observed that the proposed method obtains lower error rates than CBA in 12 out of the 15 domains, and higher error rates in three domains. It obtains lower error rates than C4.5 rules in 13 domains, only one domain has higher error rate and the other one results in a draw. While comparing our method with NBTree and SNNB, the results are also better for most cases. While comparing with Naïve Bayes, the proposed method wins 13 domains and loses in 2 domains. Generally, the classification results of proposed method are better than others on an average. However, in some data sets, the test results in GP-based is much worse than others, for example, in the “labor” data, we found that the average error rate is 20.1%. The main reason of high error rate terribly in this case is the small size of samples in the data set. The “labor” contains only 57 data totally and is divided into two classes. While the data with small size is tested in 10-fold cross validation, the situation of overfitting will occur in both of the two classification functions. That’s also why the rule based classifiers like C4.5 and CBA have the similar classification results as ours in the labor data set.
200
Been-Chian Chien et al.
6 Conclusions Classification is an important task in many applications. The technique of classification using genetic programming is a new classification approach developed recently. However, how to handling nominal attributes in genetic programming is a difficult problem. We proposed a scheme based on the rough membership function to classify data with nominal attribute using genetic programming in this paper. Further, for improving the accuracy of classification, we proposed an adaptive interval fitness function and use the minimum Z-value to determine the class to which the data should be assigned. The experimental results demonstrate that the proposed scheme is feasible and effective. We are trying to reduce the dimensions of attributes for any possible data sets and cope with the data having missing values in the future. Table 2. The information of data sets Attributes Attributes Data set Classes Cases Data set Classes Cases nominal numeric nominal numeric australian 8 6 2 690 lymph 18 0 4 148 german 13 7 2 1000 pima 0 8 2 768 glass 0 9 7 214 sonar 0 60 2 208 heart 7 6 2 270 tic-tac-toe 9 0 2 958 ionosphere 0 34 2 351 vehicle 0 18 4 846 iris 0 4 3 150 waveform 0 21 3 5000 labor 8 8 2 57 wine 0 13 3 178 led7 7 0 10 3200
Data sets australian german glass heart ionosphere iris labor led7 lymph pima sonar tic-tac-toe vehicle waveform wine
Table 3. The average error rates (%) for compared classifiers NB NBTree SNNB C4.5 CBA GP-Ave. 14.1 14.5 14.8 15.3 14.6 9.5 24.5 24.5 26.2 27.7 26.5 16.7 28.5 28.0 28.0 31.3 26.1 22.1 18.1 17.4 18.9 19.2 18.1 11.9 10.5 12.0 10.5 10.0 7.7 7.2 5.3 7.3 5.3 4.7 5.3 4.7 5.0 12.3 3.3 20.7 13.7 20.1 26.7 26.7 26.5 26.5 28.1 18.7 19.0 17.6 17.0 26.5 22.1 13.7 24.5 24.9 25.1 24.5 27.1 18.3 21.6 22.6 16.8 29.8 22.5 5.6 30.1 17.0 15.4 0.6 0.4 5.2 40.0 29.5 28.4 27.4 31 24.7 19.3 16.1 17.4 21.9 20.3 11.7 1.7 2.8 1.7 7.3 5.0 4.5
S.D. 1.2 2.2 2.9 2.7 2.3 1.0 3.0 2.7 1.5 2.8 1.9 1.6 2.4 1.8 0.7
Generating Effective Classifiers with Supervised Learning of Genetic Programming
201
References 1.
2.
3. 4.
5. 6.
7.
8. 9.
10.
11.
12. 13.
14. 15. 16.
17.
Blake, C., Keogh, E. and Merz, C. J.: UCI repository of machine learning database, http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, University of California, Department of Information and Computer Science (1998) Bramrier, M. and Banzhaf, W.: A Comparison of Linear Genetic Programming and Neural Networks in Medical Data Mining, IEEE Transaction on Evolutionary Computation, 5, 1 (2001) 17-26 Duda, R. O. and Hart, P. E.: Pattern Classification and Scene Analysis, New York: Wiley, John and Sons Incorporated Publishers (1973) Freitas, A. A.: A Genetic Programming Framework for Two Data Mining Tasks: Classification and Generalized Rule Induction, Proc. the 2nd Annual Conf. Genetic Programming. Stanford University, CA, USA: Morgan Kaufmann Publishers (1997) 96101 Heckerman, D. M. and Wellman, P.: Bayesian Networks, Communications of the ACM, 38, 3 (1995) 27-30 Kishore, J. K., Patnaik, L. M., and Agrawal, V. K.: Application of Genetic Programming for Multicategory Pattern Classification, IEEE Transactions on Evolutionary Computation, 4, 3 (2000) 242-258 Kohavi, R.: Scaling Up the Accuracy of Naïve-Bayes Classifiers: a Decision-Tree Hybrid. Proc. Int. Conf. Knowledge Discovery & Data Mining. Cambridge/Menlo Park: AAAI Press/MIT Press Publishers (1996) 202-207 Koza, J. R.: Genetic Programming: On the programming of computers by means of Natural Selection, MIT Press Publishers (1992) Koza, J. R.: Introductory Genetic Programming Tutorial, Proc. the First Annual Conf. Genetic Programming, Stanford University. Cambridge, MA: MIT Press Publishers (1996) Liu, B., Hsu, W., and Ma, Y.: Integrating Classification and Association Rule Mining. Proc. the Fourth Int. Conf. Knowledge Discovery and Data Mining. Menlo Park, CA, AAAI Press Publishers (1998) 443-447 Loveard, T. and Ciesielski, V.: Representing Classification Problems in Genetic Programming, Proc. the Congress on Evolutionary Computation. COEX Center, Seoul, Korea (2001) 1070-1077 Pawlak, Z.: Rough Sets, International Journal of Computer and Information Sciences, 11 (1982) 341-356 Pawlak, Z. and Skowron, A.: Rough Membership Functions, in: R.R. Yager and M. Fedrizzi and J. Kacprzyk (Eds.), Advances in the Dempster-Shafer Theory of Evidence (1994) 251-271 Quinlan, J. R.: C4.5: Programs for Machine Learning, San Mateo, California, Morgan Kaufmann Publishers (1993) Singleton A. Genetic Programming with C++, http://www.byte.com/art/9402/sec10/art1.htm, Byte, Feb., (1994) 171-176 Xie, Z., Hsu, W., Liu, Z., and Lee, M. L.: SNNB: A Selective Neighborhood Based Naïve Bayes for Lazy Learning, Proc. the sixth Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining (2002) 104-114 Zhang, G. P.: Neural Networks for Classification: a Survey, IEEE Transaction on Systems, Man, And Cybernetics-Part C: Applications and Reviews, 30, 4 (2000) 451462
Clustering by Regression Analysis Masahiro Motoyoshi1, Takao Miura1 , and Isamu Shioya2 1
2
Dept.of Elect.& Elect. Engr., HOSEI University 3-7-2 KajinoCho, Koganei, Tokyo, 184–8584 Japan {i02r3243,miurat}@k.hosei.ac.jp Dept.of Management and Informatics, SANNO University 1573 Kamikasuya, Isehara, Kanagawa 259–1197 Japan
[email protected] Abstract. In data clustering, many approaches have been proposed such as K-means method and hierarchical method. One of the problems is that the results depend heavily on initial values and criterion to combine clusters. In this investigation, we propose a new method to avoid this deficiency. Here we assume there exists aspects of local regression in data. Then we develop our theory to combine clusters using F values by regression analysis as criterion. We examine experiments and show how well the theory works. Keywords: Data Mining, Multivariable Analysis, Regression Analysis, Clustering
1
Introduction
It is well-known that stocks in a securities market are properly classified according to industry genre (classification of industries). Such genre appears very often in security investment. The movement of the descriptions would be similar with each other, but this classification should be maintained according to economical situation, trends and activities in our societies and regulations. Sometimes we see some mismatch between the classification and real society. When an analyst tries classifying using more effective criterion, she/he will try a quantitative classification. Cluster analysis is one of the method based on multivariate analysis which performs a quantitative classification. Cluster analysis is a general term of algorithms to classify similar objects into groups (clusters) where each object in one cluster shares heterogeneous feature. We can say that, in very research activity, a researcher is faced to a problem how observed data should be systematically organized, that is, how to classify. Generally the higher similarity of objects in a cluster and the lower similarity between clusters we see, the better clustering we have. This means quality of clustering depends on definition of similarity and the calculation complexity. There is no guarantee to see whether we can interpret similarity easily or not.
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 202–211, 2003. c Springer-Verlag Berlin Heidelberg 2003
Clustering by Regression Analysis
203
So is true for similarity from the view point of analysts. It is an analyst’s responsibility to apply methods accurately to specific applications. The point is how to find out hidden patterns. Roughly speaking, cluster analysis has divided into hierarchical methods and non-hierarchical methods[4]. In non-hierarchical methods, data are decomposed into k clusters each of which satisfies evaluation standard most. To obtain best solutions, we have to look for all the possible patterns, it takes much time. Heuristic methods have been investigated. K-means method[6] generates clusters based on their centers. Fuzzy K-means method[1] takes an approach based on fuzzy classification. AutoClass[3] automatically determines the number of clusters and classifies the data stochastically. Recently an interesting approach[2] has been proposed, called ”Local Dimensionality Reduction”. In this approach, data are assumed to have correlation locally same as our case. But the clustering technique is based on Principal Component Analysis (PCA) and they propose completely different algorithm from ours. In this investigation, we assume there exists aspects of local regression in data, i.e., we assume observed data structure of local sub-linear space. We propose a new clustering method using variance and F value by regression analysis as a criterion to make suitable clusters. In the next section we discuss reasons why conventional approaches are not suitable to our situation. In section 3 we give some definitions and discuss about preliminary processing of data. Section 4 contains a method to combine clusters and the criterion. In section 5, we examine experimental results and the comparative experiments by K-mean method. After reviewing related works in section 6, we conclude our work in section 7.
2
Clustering Data with Local Trends
As previously mentioned, we assume a collection of data in which we see several trends at the same time. Such kind of data could be regressed locally by using partial linear functions and the result forms an elliptic cluster in multidimensional space. Of course, these clusters may cross each other. The naive solution is to put clusters together by using nearest neighbor method found in hierarchical approach. However, when clusters cross, the result doesn’t become nice; they will be divided at a crossing. If the clusters have different trends but they are close to each other, they could be combined. Similarly an approach based on general Minkowski distance have the same problem. In K-mean method, a collection of objects is represented by its center of gravity. Thus every textbook says that it is not suitable for non-convex clusters. The notion of center comes from a notion of variance, but if we look for points to improve linearity of the two clusters by moving objects, we can’t always obtain such a point. More serious is that we should decide the number of clusters in advance.
204
Masahiro Motoyoshi et al.
These two approaches share a common problem, how to define similarity between clusters. In our case, we want to capture local aspects of sub linearity, thus new techniques should satisfy the following properties: (1) similarity to classify sub linear space. (2) convergence on suitable level (i.e., the number of clusters) which can be interpreted easily. Regression analysis is one of techniques of multivariable analysis by which we can predict future phenomenon in form of mathematical functions. We introduce F value as a criterion of the similarity to combine clusters, and that we can consider a cluster as a line, that is, our approach is clustering by line while K-mean method is clustering by point. In this investigation, by examining F value (as similarity measure), we combine linear clusters in one by one manner, in fact, we take an approach of restoring to target clusters.
3
The Choice of Initial Clusters
In this section, let us explain the difference between our approach and hierarchical clustering by existing agglomerative nesting. An object is a thing in the world of interests. Data is a set of objects which have several variables. We assume that all variables are inputs given from surroundings and no other external criteria for a classification. There are two kinds of variables. A criterion variable is an attribute which plays a role of criterion of regression analysis given by analysts, and another is called a explanatory variable. In this investigation, we discuss only numerical data. As for categorical data, readers could think about quantification theory or dummy variables. We deal with data by a matrix (X|Y ). An object is represented as a row of a matrix while criterion/explanatory variables are described as a column. We denote explanatory variables and a criterion variable by x1 , x2 , . . . , xm and y respectively, and the number of objects by n: x11 . . . x1m y1 .. . . . . . . .. .. (X|Y ) = (1) xk1 . . . xkm yk . . . . . . . . . . . . xn1 . . . xnm yn (∈ Rn×(m+1) ) where X denotes explanatory variables, and Y denotes criterion variable. Each variable is assumed to be normalized (called Z score). An initial cluster is a set of objects where each object is exclusively contained in the initial cluster.
Clustering by Regression Analysis
205
In the first stage, in an agglomerative nesting, each object represents each cluster. Similarity is defined as distance between objects. However we like to assume every cluster has to have variances because we deal with ”data as a line”. We pose this assumption on initial clusters. To make ”our” initial clusters, we divide the objects into small groups. We obtain initial clusters dynamically by an inner product (cosine) which measures the difference of angle between two vectors as the following algorithm shows: 0. Let a input vector be s1 , s2 , , , sn . 1. Let the first input vector s1 be center of cluster C1 and s1 be a member of C1 . 2. Calculate similarity between sk and existing cluster C1 . . . Ci by (2). If every similarities is below given threshold Θ, we generate a new cluster and let it be the center of the cluster. Otherwise, let it be a member of cluster which has the highest similarity. By using (3) calculate again a center of cluster to which members are added. 3. Repeat until all the assignment is completed. 4. Remove clusters which has no F value and less than m + 2 members. where cos(k, j) = cj =
sk · cj |sk ||cj |
(2)
Sk ∈ Cjsk Mj
(3)
Note Mj means the number of members in Cj and m means the number of explanatory variables.
4
Combining Clusters
Now let us define similarities between clusters, and describe how the similarity criterion relates to combining. We define the similarity between clusters from two aspects. One of the aspects is a distance between centers of clusters. We simply take Euclidean distance as distance measure (as well as a least square method used by a regression analysis). d(i, j) = |xi1 − xj1 |2 + . . . + |xim − xjm |2 + |yi − yj |2 (4) Then we define non-similarity matrix as follows: 0 d(2, 1) 0 d(3, 1) d(3, 2) 0 .. .. .. .. . . . . d(n, 1) d(n, 2) . . . d(n, n − 1) 0 (∈ Rn×n )
(5)
206
Masahiro Motoyoshi et al.
Clearly one of the candidate clusters to combine is the one with the smallest distance and we have to examine whether it is suitable or not in our case. For this purpose we define a new similarity by F value of the regression to keep effectiveness. In the following let us review quickly F test and presumption by regression based on least square method in multiple regression analysis. Given clusters represented by data matrix like (1), we define a model of multiple regression analysis which is corresponded to the clusters as follows: y = b1 x1 + b2 x2 + . . . + bm xm + ei
(6)
A estimator of the least squares ˜bi of bi is given by B = (˜b1 , ˜b2 , . . . , ˜bm ) = (X T X)−1 X T Y
(7)
This is called regression coefficient. Actually it is a standardised partial regression coefficient, because it is based on z-score. Let y be an observed value and Y be a predicted value based on the regression coefficient. Then, for variation factor by regression, sum of squares SR and mean square VR are defined as SR =
n
(Yk − Y¯ )2
;
VR =
k=1
SR m
(8)
For variation factor by residual, sum of squares SE and mean square VE are SE =
n
(yk − Yk )2
;
VE =
k=1
SE n−m−1
(9)
Then we define F value F0 by: F0 =
VR VE
(10)
It is well known that F0 obeys F distribution where the first and second degrees of freedom are m and n − m − 1 respectively. Given clusters A and B where the number of members are a and b respectively, a data matrix of the combined cluster A ∪ B is described as follows. xA11 . . . xA1m yA1 .. . . . .. . . .. . xAa1 . . . xAam yAa (X|Y ) = (11) xB11 . . . xB1m yB1 . . .. . . ... .. . xBb1 . . . xBbm yBb (∈ Rn×(m+1) )
Clustering by Regression Analysis
207
where n = a + b. As previously mentioned, we can calculate regression by (7) and F value by (10). Let us examine the relationship between two F values of clusters before/after combining. Let A, B be two clusters, FA , FB the two F values and F the F value after combining A and B. Then we have some interesting properties as shown in examples. Property 1. FA > F , FB > F When F decreases, the gradient is significantly different. Thus we can say that the similarity between A and B is low and linearity of the cluster decreases. In the case of FA = FB , F = 0, both A and B have same number objects and coordinates and the regressions are orthogonal at center of gravity. Property 2. FA ≤ F , FB ≤ F When F increases, the gradient isn’t significantly different and the similarity between A and B is high. Linearity of the cluster increases. When FA = FB , F = 2 × FA , we see A and B have same number of the objects and coordinates. Property 3. FA ≤ F, FB > F , or FB ≤ F, FA > F One of FA , FB increases while another decreases, when there exists big difference between the variances of A and B, or between FA and FB . We can’t say anything about combining. Thus we can say we’d better combine clusters if F is bigger than both FA and FB . Non-similarity using Euclidean distance is one of the nice ideas to prohibit from combining clusters that have the distance bigger than local ones. Since our algorithm proceeds based on a criterion using F values, the process continues to look for candidate clusters by decreasing distance criterion until the process satisfies our F value criterion. But we may have difficulties in the case of defective initial clusters, or in the case of no cluster to regress locally; the process might combine clusters that should not be combined. To overcome such problem, we assume a threshold ∆ to a distance. That is, we have ∆ as a criterion of variance. ∆ > (Var(A) + V ar(B)) × D where Var(A), Var(B) mean variance of A, B respectively and D means the distance between the centers of the gravity. When A and B satisfy both criterion of F value and ∆, we can combine the two clusters. In our experiments, we give ∆ the average of the internal variances of initial clusters as a default. By ∆ we manage the internal variances of clusters to avoid combine far clusters.
208
Masahiro Motoyoshi et al.
Now our algorithm is given as follows. 1. Standardize data. 2. Calculate initial clusters that satisfy Θ. Remove clusters which the number of members don’t reach the number of explanatory variables. 3. Calculate center of gravity, variance, regression coefficient, F value to each cluster the distance between their centers of gravity. 4. Choose close clusters as candidates for combining. Standardize the pair. Calculate regression coefficient and F value again. 5. Combine the pair if F value of a combined cluster is bigger than F value of each cluster and if it satisfy ∆. Otherwise, go to step4 to choose other candidates. If there is not candidate any more, then stop. 6. Calculate center of gravity to each cluster and distance between their centers of gravity again and go to step4.
5
Experiments
In this section, let us show some experiments to demonstrate the feasibility of our theory. We have Weather Data in Japan[5]; on January, 1997 two meteorological observatory data of Wakkanai in Hokkaido (northern part of Japan) and Niigata in Honshu (middle part of Japan) measured in January of 1997. Each meteorological observatory contains 744 records. To apply our method under the assumption that there are clusters to regress locally. We simply joined them. and we have 1488 records of 180KB. Each data instance contains 22 attributes observed every hour. We utilize ”day”(day), ”hour”(hour), ”pressure”(hPa), ”sea-level pressure”(hPa), ”air temperature”(C), ”dew point”(C), ”steam pressure”(hPa) and ”relative humidity”(%) as candidates of variables among the 22 attributes. All of them are numerical without any missing value. We use ”observation point number” additionally only for the purpose of evaluation. A table1 contains examples of the data. Before processing, we have standardized all variables to analyze by our algorithm. We take ”air temperature” as a criterion variable and other values as explanatory variables.
Table 1. Weather Data Point Day Hour Pressure Sea pressure Temperature . . . 604 1 1 1019.2 1020 5 ... 604 1 2 1018.6 1019.4 5.2 . . . 604 1 3 1018.3 1019.1 5.4 . . . .. .. .. .. .. .. .. . . . . . . . 401 31 24 1014.6 1016 -5.8 . . .
Clustering by Regression Analysis
209
Let Θ = 0.8 and ∆ = 15. A table 2 shows the results. In this experiment, we have obtained 40 initial clusters from 1364 objects by using inner product. We have excluded other 124 objects because they have been classified into the small clusters. It took us 35 loops for convergence. And eventually we have got 5 clusters. Let us go into more detail of our results. Cluster 1 has been obtained by combining 19 initial clusters. On the other hand, cluster 2 and cluster 3 contains no combining. Cluster 4 and cluster 5 contain 10 and 9 initial clusters respectively. Generally the results seem to reflect features by observation points. In fact, cluster 1 contains 519 objects (69.8%) of 744 objects in Niigata point. cluster 5 holds 469 objects (63.0%) of 744 objects in Wakkanai point. Thus we can say cluster 1 reflects peculiar trends of Niigata points well, and cluster 2 reflects peculiar trends of Wakkanai points well. For example, in a table3, we see both ”pressure” and ”temperature” of cluster 1 are higher than other clusters. Thus the cluster contains objects that were observed in a region of high altitude and high temperature. Also ”temperature” and ”humidity” in cluster 5 are relatively low. And we see the cluster contains objects observed in region of low precipitation and low temperature. In case of cluster 4, ”day” value is high since it was observed in January. The ”pressure” is low, and ”humidity” is also high. We can say that cluster 4 is unrelated to observation region. We might be able to characterize cluster 4 by state of weather such as low-pressures. In fact, the cluster contains almost same number of objects of Niigata Wakkanai points. In a table 4, the absolute value of regression coefficients in cluster 5 is overall high compared with cluster 1 and 4. Compared to change of weather, change of temperature is large. That is, temperature varies in a wide range. Since Hokkaido is the region that takes maximum of annual difference of temperature in Japan1 , our results go well with actual classification. Also cluster 1 is similar to cluster 5, but the gradient is smaller. Thus, temperature in cluster1 doesn’t vary very much. Cluster 4 is clearly different from other clusters; the cluster has correlation only for dew point and relative humidity. Let us summarize our experiment. We got 5 clusters. Especially, we have extracted regional features from cluster1 and 5. It is evident by information on observation point in table 2 to see clustering suitably has classified objects very well. This fact means that the results in our experiment satisfy the initial condition. Let us discuss some more aspects to compare our technique with others. We have analyzed the same data by using K-means method with statistics application tool SPSS. We gave centers of initial clusters by random numbers. We specified 10 as the maximum number of iteration. Then we have analyzed two cases of k = 2 and 3. Let us show the results of k = 2 in a table 5, and the results of k = 3 in a table 6. 1
¨ in winter time, In Hokkaido area, the lowest temperature decreases to about -20¨ uA and the maximum air temperature exceeds 30 C in summer time.
210
Masahiro Motoyoshi et al.
Table 2. Final clusters(Θ = 0.8, ∆ = 15) Variance F-value Contained clusters Niigata Wakkanai Cluster1 4.958 8613.28 19 519 74 Cluster2 2.12926 2043.62 1 29 1 Cluster3 2.42196 78.2235 1 45 0 Cluster4 5.1034 85603.6 10 135 189 Cluster5 5.50085 17964.9 9 11 469
Table 3. Center of gravity for cluster Cluster1 Cluster4 Cluster5 Day -0.0103696 0.487242 -0.0987597 Hour 0.0622433 -0.148476 0.0358148 Pressure 0.599712 -0.899464 -0.0542843 Sea-level pressure 0.580779 -0.902211 -0.023735 Dew point 0.512113 0.294745 -1.05024 Steam pressure 0.468126 0.245389 -0.993099 Relative humidity -0.179054 0.975926 -0.392923 Air temperature 0.692525 -0.234597 -0.976859
Table 4. Standardised regression coefficient of clusters Cluster1 Cluster4 Cluster5 Day -0.0163907 -0.0029574 0.0108929 Hour -0.00316974 -0.00182585 -0.00965708 Pressure 1.18154 0.0357092 1.77679 Sea-level pressure -1.15683 -0.0393822 -1.77494 Dew point 0.909799 1.103 1.22524 Steam pressure 0.421526 -0.016361 0.212142 Relative humidity -1.25678 -0.36817 -0.804252
Table 5. Clustering by K-mean method (k=2) Cluster1 Cluster2
Niigata Wakkanai 496 359 248 385
In case of k = 2, it seems hard for readers (and for us) to extract significant differences of the two final clusters with respect to observation points. Similarly, in case of k = 3, we can’t extract sharp feature from the results. Thus, our technique can be an alternative when it is not possible to cluster well by K-means method.
Clustering by Regression Analysis
211
Table 6. Clustering by K-mean method (k=3) Cluster1 Cluster2 Cluster3
6
Niigata Wakkanai 293 149 158 353 293 243
Conclusion
In this investigation, we have discussed clustering for data where objects with a different local trend existed together. We have proposed how to extract trend of clusters by using regression analysis and similarity of the cluster by F-value of a regression. We have introduced threshold of distance between clusters to keep precision of the cluster. By examining the data, we have shown that we can extract clusters of a moderate number to interpret it and the feature by center of gravity and regression coefficient. We have examined some experimental results and compared our method with other methods to show the feasibility of our approach. We had already discussed how to mine Temporal Class Schemes to model a collection of time series data[7], and we are now developing integrated methodologies to time series data and stream data.
Acknowledgements We would like to acknowledge the financial support by Grant-in-Aid for Scientific Research (C)(2) (No.14580392).
References [1] Bezdek,J. C.: ”Numerical taxonomy with fuzzy sets”, Journal of Mathematical Biology, Vol.1, pp.57-71, 1974. 203 [2] Chakrabarti, K. and Mehrotra, S.: ”Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces”, proc.VLDB, 2000. 203 [3] Cheeseman,P.,et al: ”Bayesian classification”, proc. ACM Artificial Intelligence, 1988, pp.607-611. 203 [4] Jain, A. K., Murty, M. N. and Flynn, P. J.: ”Data Clutering – A Review”, ACM Computing Surveys, Vol. 31-3, 1999, pp.264-323 203 [5] Japan Weather Association: ”Weather Data HIMAWARI”, Maruzen, 1998. 208 [6] MacQueen, J. B.: ”Some methods for classification and analysis of multivariate observations” , proc. Fifth Berkeley Symposium observations, ProStatistics and Probability, Vol.1, University of California Press,1967. 203 [7] Motoyoshi,M., Miura,T., Watanabe,K., Shioya,I.: ”Mining Temporal Classes from Time Series Data”, proc.ACM Conf. on Information and Knowledge Management (CIKM), 2002. 211 [8] Wallace,C. S.and Dowe, D. L.: ”Intrinsic classification by MML-the Snob program”, proc. 7th Australian Joint Conference on Artificial Intelligence, 1994, pp.37-44.
Handling Large Workloads by Profiling and Clustering Matteo Golfarelli DEIS - University of Bologna 40136 Bologna - Italy
[email protected] Abstract. View materialization is recognized to be one of the most effective ways to increase the Data Warehouse performance; nevertheless, due to the computational complexity of the techniques aimed at choosing the best set of views to be materialized, this task is mainly carried out manually when large workloads are involved. In this paper we propose a set of statistical indicators that can be used by the designer to characterize the workload of the Data Warehouse, thus driving the logical and physical optimization tasks; furthermore we propose a clustering algorithm that allows the cardinality of the workload to be reduced and uses these indicators for measuring the quality of the reduced workload. Using the reduced workload as the input to a view materialization algorithm allows large workloads to be efficiently handled.
1 Introduction During the design of a data warehouse (DW), the phases aimed at improving the system performance are mainly the logical and physical ones. One of the most effective ways to achieve this goal during logical design is view materialization. The so-called view materialization problem consists of choosing the best subset of the possible (candidate) views to be precomputed and stored in the database while respecting a set of system and user constraints (see [8] for a survey). Even if the most important constraint is the disk space available for storing aggregated data, the quality of the result is usually measured in terms of the number of disk pages necessary to answer a given workload. Despite the efforts made by research in the last years, view materialization remains a task whose success depends on the experience of the designer that, adopting rules of thumb and applying the trial-and-error approach, may lead to acceptable solutions. Unlike other issues in the Data Warehouse (DW) field, understanding why the large set of techniques available in the literature have not been engineered and included in some commercial tools is fundamental to solving the problem. Of course the main reason is the computational complexity of view materialization that makes all the approaches proposed unsuitable for workloads larger than about forty queries. Unfortunately, real workloads are much larger and are not usually available during the DW design but only when the system is on-line. Nevertheless, the designer can estimate the core of the workload at design phase but such a rough approximation will lead to a largely sub-optimal solution. Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 212-223, 2003. c Springer-Verlag Berlin Heidelberg 2003
Handling Large Workloads by Profiling and Clustering
213
We believe that the best solution is to carry out a rough optimization at design time and to refine the solution by tuning it, manually or automatically, when the system is on-line on the base of the real workload. The main difficulty with this approach is the huge size of the workload that cannot be handled by the algorithms known in the literature. In this context the contribution of the paper is twofold: firstly we propose a technique for profiling large workloads that can be obtained from the log file produced by the DBMS when the DW is on line. The statistical indicators obtained can be used by the designer to characterize the DW workload thus driving the logical and physical optimization tasks. The second contribution concerns a clustering algorithm that allows the cardinality of the workload to be reduced and that uses the indicators in order to measure the quality of the reduced workload. Using the reduced workload as the input to a view materialization algorithm allows large workloads to be efficiently handled. Since clustering is an independent way of preprocessing, all the algorithms presented in the literature can be adopted during the views selection phase. Figure 1 shows the framework we assume for our approach: OLAP applications generate SQL queries whose logs are periodically elaborated to determine the statistical indicators and a clustered workload that can be handled by a view materialization algorithm that produces new patterns to be materialized. OLAP APPLICATIONS Queries
Data
RDBMS Queries
Log data Data volume
Data
DW
Views
Profiling & Clustering
Clustered workload
Statistical indicators
View materialization
Fig. 1. Overall framework for the view materialization process
To the best of our knowledge only few works directly faced the workload size problem; in particular, in [5] the authors proposed a polynomial time algorithm that explores only a subset of the candidate views and delivers a solution whose quality is comparable with other techniques that run in exponential time. In [1] the authors propose a heuristic reduction technique that is based on the functional dependencies between attributes and excludes from the search space those views that are “similar” to other ones already considered. With respect to ours, this approach does not produce any representative workload to be used for further optimizations. Clustering of queries in the field of DWs has been recently used to reduce the complexity of the plan selection task [2]: each cluster has a representative for whom the execution plan, as determined by the optimizer, is persistently stored. Here the concept of similarity is based on a complex set of features that it is necessary to encode when different queries can be efficiently solved using the same execution plan. This idea has been implicitly used in several previous works where a global optimization plan was obtained given a set of queries [7].
214
Matteo Golfarelli
The rest of the paper is organized as follows: Section 2 presents the necessary background, Section 3 defines the statistical indicators for workload profiling; Section 4 presents the algorithm for query clustering while in Section 5 a set of experiments, aimed at proving its effectiveness, are reported. Finally in Section 6 the conclusions are drawn.
2 Background It is recognized that DWs lean on the multidimensional model to represent data, meaning that indicators that measure a fact of interest are organized according a set of dimensions of analysis; for example, sales can be measured by the quantity sold and the price of each sale of a given product that took place in a given store and on a given day. Each dimension is usually related to a set of attributes describing it at different aggregation levels; the attributes are organized in a hierarchy defined according to a set of functional dependencies. For example a product can be characterized by the attributes PName, Type, Category and Brand among which PName→Type, the following functional dependencies are defined: Type→Category and PName→Brand; on the other hand, stores can be described by their geographical and commercial location: SName→City, City→Country, SName→CommArea, CommArea→CommZone. On relational solutions, the multidimensional nature of data is implemented on the logical model by adopting the so-called star scheme, composed by a set of, fully denormalized, dimension tables, one for each dimension of analysis, and a fact table whose primary key is obtained by composing the foreign keys referencing the dimension tables. The most common class of queries used to extract information from a star schema are GPSJ [3] that consists of a selection, over a generalized projection over a selection over a join between the fact table and the dimension table involved. It is easy to understand that grouping heavily contributes to the global query cost and that such a cost can be reduced precomputing (materializing) that aggregated information that is useful to answer a given workload. Unfortunately, in real applications, the size of such views never fits the constraint given by the available disk space and it is very hard to choose the best subset to be actually materialized. When working on a single fact scheme and assuming that all the measures contained in the elemental fact table are replicated a view is completely defined by its aggregation level. Definition 1 The pattern of a view consists of a set of dimension table attributes such that no functional dependency exists between attributes in the same pattern. Possible patterns for the sales fact are: P1 = {Month, Country, Category}, P2 = {Year, Sname}, P3 = {Brand}. In the following we will use indifferently the terms pattern and view and we will refer to the query pattern as the coarsest pattern that can be used to answer the query. Definition 2 Given two views Vi, Vj with patterns Pi, Pj respectively, we say that Vi can be derived from Vj (Vi ≤ Vj) if the data in Vi can be calculated from the data in Vj.
Handling Large Workloads by Profiling and Clustering
215
Derivability determines a partial-order relationship between the views, and thus between patterns, of a fact scheme. Such partial-order can be represented by the socalled multidimensional lattice [1] whose nodes are the patterns and whose arcs show a direct derivability relationship between patterns. Definition 3 We denote with Pi ⊕ Pj the least upper bound (ancestor) of two patterns in the multidimensional lattice. In other words the ancestor of two patterns corresponds to the coarsest one from which both can be derived. Given a set of queries the ancestor operator can be used to determine the set of views that are potentially useful to reduce the workload cost (candidate views). The candidate set can be obtained, starting from the workload queries, by iteratively adding to the set the ancestors of each couple of patterns until a fixed point is reached. Most of the approaches to view materialization try to determine first the candidate views, and to choose the best subset that fits the constraints later. Both problems have an exponential complexity.
3 Profiling the workload Profiling means determining a set of indicators that captures the workload features that have an impact on the effectiveness of different optimization techniques. In particular, we are interested in those relevant to the problem of view materialization and that help the designer to answer queries like: “How suitable to materialization is the workload ?”, “How much space do I need to obtain good results ?”. In the following we propose four indicators that have proved to properly capture all the relevant aspects and that can be used as guidance by the designer that manually tunes the DW or as input to an optimization algorithm for a materialized view section. All the indicators are based on the concept of cardinality of the views associated to a given pattern that can be estimated knowing the data volume of the fact scheme that we assume to contain the cardinality of the base fact table and the number of distinct values of each attribute in the dimension tables. The cardinality of an aggregate view can be estimated using Cardenas’ formula. In our case the objects are the tuples in the elemental fact table with pattern P0 (whose number |P0| is assumed to be known) while the number of buckets is the maximum number of tuples, |P|Max, that can be stored in a view with pattern P and that can be easily calculated given the cardinalities of the attributes belonging to the pattern, thus (1) Card(P)= Φ(|P|Max ,|I0|)
3.1 Aggregation level of the workload The aggregation level of a pattern P is calculated as:
Agg ( P ) = 1 −
Card ( P ) | P0 |
(2)
216
Matteo Golfarelli
Agg(P) ranges between [0,1[, the higher the values the coarser the pattern. The average aggregation level (AAL) of the full workload W ={Q1,…Qn} can be calculated as (3) 1 n AAL = ∑ Agg ( Pi ) n i =1 where Pi is the pattern of query Qi. In order to partially capture how the queries are distributed at different aggregation levels we also include the aggregation level standard deviation (ALSD), which is the standard deviation for AAL: (4) n ∑ (Agg ( Pi ) − AAL )2
ALSD =
i =1
n
AAL and ALSD characterize to what extent the information required by the users is aggregated and express the willingness of the workload to be optimized using materialized views. Intuitively, workloads with high values of AAL will be efficiently optimized using materialized views since they determine a strong reduction of the number of tuples to be read. Furthermore, the limited size of such tables allows a higher number of views to be materialized. On the other hand, a low value for ALSD denotes that most of the views share the same aggregation level further improving the usefulness of view materialization. 3.2 Skewness of the workload Measuring the aggregation level is not sufficient to characterize the workload; in fact workloads with similar values of AAL and ALSD can behave differently, with respect to materialization, depending on the attributes involved in the queries. Consider for example two workloads W1 ={Q1, Q2} and W2 ={Q3, Q4} formulated on the Sales fact and the pattern of their queries: − − − −
P1 = {Category, City} P2 = {Type, Country} P3 = {Category, Country} P4 = {Brand, CommZone}
Card(P1) = 2100 Card(P2) = 1450 Card(P3) = 380 Card(P4) = 680
Materializing a single view to answer both the queries in the workload is much more useful for W1, than for W2 since in the first case the ancestor is very “close” to the queries (P1⊕ P2={Type, City}) and still coarse, while in the second case it is “far” and fine (P3⊕ P4={SName, PName}). This difference is captured by the distance between the two patterns that we calculate as: (5) Dist(Pi, Pj) = Agg(Pi) + Agg(Pj) - 2 Agg(Pi ⊕ Pj)
Handling Large Workloads by Profiling and Clustering
217
Dist(Pi, Pj) is calculated in terms of distance of Pi and Pj from their ancestor that is the point of the multidimensional lattice closest to both the views. Figure 2 shows two different situations on the same multidimensional lattice: even if the aggregation level of the patterns is similar, the distance between each couple change significantly. The average skewness (ASK) of the full workload W ={Q1,…Qn} can be calculated as
ASK =
n −1 n 2 ∑ ∑ Dist ( Pi , P j ) n ⋅ (n − 1) i =1 j =i +1
(6)
where Pz is the pattern of query Qz. ASK ranges in [0,2[1 . Also for the skewness indicator it is useful to calculate the standard deviation (Skewness Standard Deviation, SKSD) in order to evaluate how the distances between queries are distributed with respect to their mean value: (7) n −1 n 2 2 SKSD = Dist ( P , P ) − ASK ∑ ∑ i j n ⋅ (n − 1) i =1 j =i +1
(
)
Intuitively, workloads with low values for ASK will be efficiently optimized using materialized views since the similarity of the query patterns makes it possible to materialize few views to optimize several queries. P0
{} Fig. 2. Distance between close and far patterns
P0
{}
4 Clustering of queries Clustering is one of the most common techniques for classification of features into groups. Several algorithms have been proposed in the literature (see [4] for a survey) each suitable for a specific class of problems. In this paper we adopted the hierarchical approach that recursively agglomerates the two most similar clusters forming a dendogram whose creation can be stopped at different levels to yield different clustering of data, each related to a different level of similarity that will be evaluated using the statistical indicators introduced so far. The initial clusters contain a single query of the workload that represent them. At each step the algorithm looks 1 The maximum value for ASK depends on the cardinalities of the attributes and on the functional dependencies defined on the hierarchies, thus it cannot be defined without considering the specific star schema.
218
Matteo Golfarelli
for the two most similar clusters that are collapsed forming a new one that is represented by the query whose pattern is the ancestor of their representative. Figure 3 shows the output of this process. With a little abuse of terminology we write qx⊕qy meaning that the ancestor operator is applied to the pattern of the queries. c10=q1⊕q2⊕q3⊕q4⊕q5⊕q6 c9=q4⊕q5⊕q6
level 5 level 4
c8=q1⊕q2⊕q3
level 3
c7=q4⊕q5
level 2
c7=q1⊕q2
level 1 c1=q1
c2=q2
c3=q3
c4=q4
c5=q5
c6=q6
level 0
Fig. 3. A possible dendogram for a workload with 6 queries
Similarity between clusters is expressed in terms of the distance, as defined in Section 3.2, between the patterns of their representatives. Each cluster is represented by the ancestor of all the queries belonging to it and is labeled with the sum of the frequencies of its queries. This simple, but effective, solution reflects the criteria adopted by the view materialization algorithms that rely on the ancestor concept when choosing one view to answer several queries. The main drawback here is that the value of AAL tends to decrease when the initial workload is strongly aggregated. Nevertheless the ancestor solution is the only one ensuring that the cluster representative effectively characterizes its queries with respect to materialization (i.e. all the queries in the cluster can be answered on a view on which the representative can also be answered). Adding new queries to a cluster inevitably induces heterogeneity in the aggregation level of its queries thus reducing its capability to represent all of them. Given a clustering Wc ={C1,…Cm}, we measure the compactness of the clusters in terms of similarity of the aggregation levels of the queries in each cluster as: (8) 1 m IntraALSD = ∑ ALSDi m i =1 where ALSDi is the standard deviation of the aggregation level for queries in the cluster Ci. The lower IntraALSD the closer the queries in the clusters. As to the behavior of ASK, it tends to increase when the number of clusters decreases since the closer queries are collapsed earlier than the far ones. While this is an obvious effect of clustering a second relevant measure of the compactness of the clusters in Wc ={C1,…Cm} can be expressed in terms of internal skewness:
Handling Large Workloads by Profiling and Clustering
IntraASK =
219
(9)
1 m ∑ ASK i m i =1
where ASKi is the skewness of the queries in the cluster Ci. The lower IntraASK the closer the queries in the clusters. The ratio between the statistical indicators and the corresponding intra cluster ones can be used to evaluate how well the clustering models the original workload; in particular we adopted this technique to define when the clustering process must be stopped; the stop rule we adopt is as follows: AAL ASK > T AL ∨ > TSK Stop if IntraAAL IntraASK In our tests both TAL and TSK have been set to 5.
5 Tests and discussion In this section we present four different tests aimed at proving the effectiveness of both profiling and clustering. The tests have been carried out on the LINEITEM fact scheme described in the TPC-H/R benchmark [9] using a set of generated workloads. Since selections are rarely take into account by view materialization algorithms our queries do not present any selection clause. As to the materialization algorithm, we adopted the classic one in the literature proposed by Baralis et al. [1]; the algorithm first determines the set of candidate views and then heuristically chooses the best subset that fits given space constraints. Splitting the process into two phases allows us to estimate both the difficulty of the problem, that we measure in terms of the number of candidate views, and the effectiveness of materialization that is calculated in terms of the number of disk pages saved by materialization. The cost function we adopted computes the cost of a query Q on a star schema S composed by a fact table FT and a set {DT1,…, DTn} of dimension tables as (10) (Size( DT ) + Size(PK )) Cost (Q, S ) = Size( FT ) +
∑
i∈Dim(Q )
i
i
where Size( ) function returns the size of a table/index expressed in disk pages, Dim(Q) returns the indexes of the dimension tables involved in Q and PKi is the primary index on DTi. This cost function assumes the execution plan that is adopted by Redbrick 6.0 when no select conditions are present in a query on a star schema.
5.1 Workload features fitting The first test shows that the statistical indicators proposed in Section 3 effectively summarize the features of a workload. Four workloads, each made up of 20 queries, have been generated with different values for the indicators. Table 1 reports the value of the parameters and the resulting number of candidate views that confirms the considerations made in Section 3. The complexity of the problem mainly depends on the value of the ASK and is more slightly influenced by AAL. The simplest workloads
220
Matteo Golfarelli
to be elaborated will be those with highly aggregated queries with similar patterns, while the most complex will be those with very different patterns with a low aggregation level. It should be noted that on increasing the size of the worklfoads, those with a “nice” profile still perform well, while the others quickly become too complex. For example workloads WKL5, WKL6, whose profile follows those of WKL1 and WKL4 respectively, in Table 1 contains 30 queries: while the number of candidate views remains low for WKL5, it explodes for WKL6. Actually, we stopped the algorithm after two days of computation on a PENTIUM IV CPU (1GHz). The profile is also useful to evaluate how well the workload will behave with respect to view materialization. Figure 4.a shows that, regardless of the difficulty of the problems, workloads with high values of AAL are strongly optimized even when a limited disk space is available for storing materialized views. This behavior is induced by the dimension, and thus by the number, of the materialized views that fits the space constraint as it can be verified in Figure 4.b. Table 1. Number of candidate views for workloads with different profiles
12
ALSD 0.307 0.245 0.278 0.153 0.297 0.276
ASK 0.348 0.327 0.810 0.751 0.316 0.668
SKSD N. Candidate views 0.393 97 0.269 124 0.391 596 0.216 868 0.371 99 0.354 > 36158
20
(a)
10
Millions of disk pages
AAL 0.835 0.186 0.790 0.384 0.884 0.352
(b)
N. of materialized views
Name WKL1 WKL2 WKL3 WKL4 WKL5 WKL6
15
8
10
6
5
4 2
0
0 1.1
1.4
WKL1
1.7 2 2.3 Disk space constraint (GB)
2.6
WKL2
2.9
1.1 1.4 1.7 2 2.3 2.6 2.9 Disk space constraint (GB)
WKL3
WKL4
Fig. 4. Cost of the workloads (a) and number of materialized views (b) on varying the disk space constraint for the workloads in Table 1
5.2 Clustering suboptimality The second test is aimed at proving that clustering produces a good approximation of the input workload, meaning that applying view materialization to the original and clustered workload does not induce a too heavy suboptimality. With reference to the workloads in Table 1, Table 2 shows how change the behavior and the effectiveness of the view materialization algorithm changes for an increasing level of clustering. It
Handling Large Workloads by Profiling and Clustering
221
should be noted that the number of candidate views can be strongly reduced inducing, in most cases, a limited suboptimality. By comparing the suboptimality percentages with the statistical indicator trends presented in Figure 5, it is clear that suboptimality arises earlier for workloads where IntraASDL and IntraASK increase earlier. 5.3 Handling large workloads When workloads with hundred of queries are considered it is not possible to measure the suboptimality induced by the clustered solution since the original workloads cannot be directly optimized. On the other hand, it is still possible to compare the increase of the performance with respect to the case with no materialized views and it is also interesting to show how the workload costs change depending on the number of queries included in the clustered workload and how the cost is related to the statistical indicators. Table 2. Effects of clustering on the view materialization algorithm applied to workload in Table 1 WKL WKL1
WKL2
WKL3
WKL4
#. Cluster # Cand.Views #. Mat.Views % SubOpt Stop rule at 15 90 12 0.001 3 10 68 7 0.308 5 25 3 40.511 15 79 2 0.000 6 10 38 2 2.561 5 6 2 4.564 15 549 10 1.186 7 10 156 7 22.146 5 16 4 65.407 15 321 2 0.0 4 10 129 2 0.0 5 17 2 0.0
Table 3 reports the view materialization results for two workloads, WKL 7 (AAL:0.915, ALSD:0.266, ASK: 0.209, SKSD: 0.398) - WKL 8 (AAL: 0.377, ALSD: 0.250, ASK: 0.738, SKSD: 0.345), containing 200 queries. The data in the table and the graphs in Figure 6 confirm the behaviors deduced from previous tests: the effectiveness of view materialization is higher for workloads with high value of AAL and low value of ASK. Also the capability of the clustering algorithm to capture the features of the original workload depends on its profile, in fact workloads with higher values of ASK require more queries (7 for WKL7 vs. 20 for WKL8) in the clustered workload to effectively model the original one. On the other hand it is not useful to excessively increase the clustered workload cardinality since the performance improvement is much lower than the increase of the computation time.
Matteo Golfarelli
1.5
1 0.8 0.6 0.4 0.2 0
WKL1
1 0.5
2
4
6
8
10
12
14
16
18
N. Clusters
N. Clusters 1
WKL3
WKL4
0.8 0.6 0.4 0.2
N. Clusters
AAL
ASK
2
4
6
8
14
16
18
20
2
4
6
8
10
12
14
16
0 18
20
1.2 1 0.8 0.6 0.4 0.2 0
10
20
0
WKL2
12
222
N. Clusters
IntraAAL
IntraASK
Fig. 5. Trends of the statistical indicators for increasing levels of clustering and for different workloads. Table 3. Effects of clustering on the view materialization algorithm applied to workload in Table 1
#. # #. Cluster Cand.Views Mat.Views 30 12506 17 20 4744 15 WKL7 10 384 9 7 64 6 30 17579 5 WKL8 20 2125 5 10 129 2 WKL
%Cost Reduction 90.6 89.0 83.3 38.9 19.1 17.8 2.4
Comp. Time Stop rule (sec.) at 43984 439 6 39 24 78427 19 304 25
6 Conclusions In this paper we have discussed two techniques that make it possible to carry out view materialization when the high cardinality of the workload does not allow the problem to be faced directly. In particular, the set of statistical indicators proposed have proved to capture those workload features that are relevant to the view materialization problem, thus driving the designer choices. The clustering algorithm allows large workloads to be handled by automatic techniques for view materialization since it reduces its cardinality slightly corrupting the original characteristics. We believe that the use of the information carried by the statistical indicators we proposed can be
Handling Large Workloads by Profiling and Clustering
223
profitably used to increase the effectiveness of the optimization algorithms used in both logical and physical design. For example, in [6] the authors propose a technique for splitting a given quantity of disk space into two parts used for creating views and indexes respectively. Since the technique takes account of only information relative to a single query our indicators can improve the solution by providing the bent of the workload to be optimized by indexing or view materializing. 1
2
0.6
1
20 0 18 0 16 0 14 0 12 0 10 0
40 20
80 60
0
20 0 18 0 16 0 14 0 12 0 10 0
0.2
0
N. Clusters
ASK
40 20
0.4
0.5
AAL
WKL8
0.8
80 60
1.5
WKL7
N. Clusters
IntraAAL
IntraASK
Fig. 6. Trends of the statistical indicators for increasing levels of clustering and for different workloads.
References [1] E. Baralis, S. Paraboschi and E. Teniente. Materialized view selection in a multidimensional database. In Proc. 23rd VLDB, Greece, 1997. [2] A. Ghosh, J. Parikh, V.S. Sengar and J. R. Haritsa. Plan Selection Based on Query Clustering, In Proc. 28th VLDB, Hong Kong, China, 2002. [3] A. Gupta, V. Harinarayan and D. Quass. Aggregate-query processing in data-warehousing environments. In Proc. 21st VLDB, Switzerland, 1995. [4] A.K. Jain, M.N. Murty and P.J. Flynn. Data Clustering A Review. ACM Computing Surveys, Vol. 31, N. 3, September 1999. [5] T. P. Nadeau and T. J. Teorey. Achieving scalability in OLAP materialized view selection. In Proc. DOLAP’02, Virginia USA, 2002. [6] S. Rizzi and E. Saltarelli. View materialization vs. Indexing: balancing space constraints in Data Warehouse Design. To appear in Proc. CAISE’03, Austria, 2003. [7] T. K. Sellis. Global query Optimization. In Proc. SIGMOD Conference Washington D.C. 1986, pp. 191-205. [8] D. Theodoratos, M. Bouzeghoub. A General Framework for the View Selection Problem for Data Warehouse Design and Evolution. In Proc. DOLAP’00, Washington D.C. USA, 2000. [9] Transaction Processing Performance Council. TPC Benchmark H (Decision Support) Standard Specification, Revision 1.1.0, 1998, http://www.tpc.org.
Incremental OPTICS: Efficient Computation of Updates in a Hierarchical Cluster Ordering Hans-Peter Kriegel, Peer Kr¨oger, and Irina Gotlibovich Institute for Computer Science University of Munich, Germany {kriegel,kroegerp,gotlibov}@dbs.informatik.uni-muenchen.de
Abstract. Data warehouses are a challenging field of application for data mining tasks such as clustering. Usually, updates are collected and applied to the data warehouse periodically in a batch mode. As a consequence, all mined patterns discovered in the data warehouse (e.g. clustering structures) have to be updated as well. In this paper, we present a method for incrementally updating the clustering structure computed by the hierarchical clustering algorithm OPTICS. We determine the parts of the cluster ordering that are affected by update operations and develop efficient algorithms that incrementally update an existing cluster ordering. A performance evaluation of incremental OPTICS based on synthetic datasets as well as on a real-world dataset demonstrates that incremental OPTICS gains significant speed-up factors over OPTICS for update operations.
1
Introduction
Many companies gather a vast amount of corporate data. This data is typically distributed over several local databases. Since the knowledge hidden in this data is usually of great strategic importance, more and more companies integrate their corporate data into a common data warehouse. In this paper, we do not anticipate any special warehousing architecture but simply address an environment which provides derived information for the purpose of analysis and which is dynamic, i.e. many updates occur. Usually manual or semi-automatic analysis such as OLAP cannot make use of the entire information stored in a data warehouse. Automatic data mining techniques are more appropriate to fully exploit the knowledge hidden in the data. In this paper, we focus on clustering, which is the data mining task of grouping the objects of a database into classes such that objects within one class are similar and objects from different classes are not (according to an appropriate similarity measure). In recent years, several clustering algorithms have been proposed [1,2,3,4,5]. A data warehouse is typically not updated immediately when insertions or deletions on a member database occur. Usually updates are collected locally and applied to the common data warehouse periodically in a batch mode, e.g. each night. As a consequence, all clusters explored by clustering methods have to be updated as well. The update of the mined patterns has to be efficient because it should be finished when the warehouse has to be available for its users again, e.g. in the next morning. Since a warehouse usually Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 224-233, 2003. c Springer-Verlag Berlin Heidelberg 2003
Incremental OPTICS
225
stores a large amount of data, it is highly desirable to perform updates incrementally [6]. Instead of recomputing the clusters by applying the algorithm to the entire (very large) updated database, only the old clusters and the objects inserted or deleted during a given period are considered. In this paper, we present an incremental version of OPTICS [5] which is an efficient clustering algorithm for metric databases. OPTICS combines a density-based clustering notion with the advantages of hierarchical approaches. Due to the density-based nature of OPTICS, the insertion or deletion of an object usually causes expensive computations only in the neighborhood of this object. A reorganization of the cluster structure thus affects only a limited set of database objects. We demonstrate the advantage of the incremental version of OPTICS based on a thorough performance evaluation using several synthetic and a real-world dataset. The remainder of this paper is organized as follows. We review related work in Section 2. Section 3 briefly introduces the clustering algorithm OPTICS. The incremental algorithms for insertions and deletions are presented in Section 4. In Section 5, the results of our performance evaluation are reported. Conclusions are presented in Section 6.
2
Related Work
Beside the tremendous amount of clustering algorithms (e.g. [1,2,3,4,5]), the problem of incrementally updating mined patterns is a rather new area of research. Most work has been done in the area of developing incremental algorithms for the task of mining association rules, e.g. [7]. In [8] algorithms for incremental attribute-oriented generalization are presented. The only algorithm for incrementally updating clusters detected by a clustering algorithm is IncrementalDBSCAN proposed in [6]. It is based on the algorithm DBSCAN [4] which models clusters as density-connected sets. Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in the neighborhood of this object. Based on these observations, IncrementalDBSCAN yields a significant speed-up over DBSCAN [6]. In this paper, we propose IncOPTICS an incremental version of OPTICS [5] which combines the density-based clustering notion of DBSCAN with the advantages of hierarchical clustering concepts. Since OPTICS is an extension to DBSCAN and yields much more information about the clustering structure of a database, IncOPTICS is much more complex than IncrementalDBSCAN. However, IncOPTICS yields an accurate speed-up over OPTICS without any loss of effectiveness, i.e. quality.
3
Density-Based Hierarchical Clustering
R
In the following, we assume that D is a database of n objects, dist : D × D → is a metric distance function on objects in D and Nε (p) := {q ∈ D | dist(p, q) ≤ ε} denotes the ε-neighborhood of p ∈ D where ε ∈ . OPTICS extends the density-connected clustering notion of DBSCAN [4] by hierarchical concepts. In contrast to DBSCAN, OPTICS does not assign cluster memberships
R
226
Hans-Peter Kriegel et al.
but computes a cluster order in which the objects are processed and additionally generates the information which would be used by an extended DBSCAN algorithm to assign cluster memberships. This information consists of only two values for each object, the core-level and the reachability-distance (or short: reachability).
N
R
Definition 1 (core-level). Let p ∈ D, MinPts ∈ , ε ∈ , and MinPts-dist(p) be the distance from p to its MinPts-nearest neighbor. The core-level of p is defined as follows: ∞ if |Nε (p)| < MinPts CLev(p) := MinPts-dist(p) otherwise.
N
R
Definition 2 (reachability). Let p, q ∈ D, MinPts ∈ , and ε ∈ . The reachability of p wrt. q is defined as RDist(p, q) := max{CLev(q), dist(q, p)}.
N
R
Definition 3 (cluster ordering). Let MinPts ∈ , ε ∈ , and CO be a totally ordered permutation of the n objects of D. Each o ∈ D has additional attributes Pos(o), Core(o) and Reach(o), where Pos(o) symbolizes the position of o in CO. We call CO a cluster ordering wrt. ε and MinPts if the following three conditions hold: (1) ∀p ∈ CO : Core(p) = CLev(p) (2) ∀o, x, y ∈ CO : Pos(x) < Pos(o) ∧ Pos(y) > Pos(o) ⇒ RDist(y, x) ≥ RDist(o, x) (3) ∀p, o ∈ CO : Reach(p) = min{RDist(p, o) | Pos(o) < Pos(p)}, where min ∅ = ∞. Intuitively, Condition (2) states that the order is built on selecting at each position i in CO that object o having the minimum reachability to any object before i. A cluster ordering is a powerful tool to extract flat, density-based decompositions for any ε ≤ ε. It is also useful to analyze the hierarchical clustering structure when plotting the reachability values for each object in the cluster ordering (cf. Fig. 1(a)). Like DBSCAN, OPTICS uses one pass over D and computes the ε-neighborhood for each object of D to determine the core-levels and reachabilities and to compute the cluster ordering. The choice of the starting object does not affect the quality of the result [5]. The runtime of OPTICS is actually higher than that of DBSCAN because the computation of a cluster ordering is more complex than simply assigning cluster memberships and the choice of the parameter ε affects the runtime of the range queries (for OPTICS, ε has typically to be chosen significantly higher than for DBSCAN).
4
Incremental OPTICS
The key observation is that the core-level of some objects may change due to an update. As a consequence, the reachability values of some objects have to be updated as well. Therefore, condition (2) of Def. 3 may be violated, i.e. an object may have to move to another position in the cluster ordering. We will have to reorganize the cluster ordering such that condition (2) of Def. 3 is re-established. The general idea for an incremental version of OPTICS is not to recompute the ε-neighborhood for each object in D but restrict the reorganization on a limited subset of the objects (cf. Fig. 1(b)). Although it cannot be ensured in general, it is very likely that the reorganization is bounded to a limited part of the cluster ordering due to the density-based nature of
Incremental OPTICS
(a)
227
(b)
Fig. 1. (a) Visual analysis of the cluster ordering: clusters are valleys in the according reachability plot. (b) Schema of the reorganization procedure.
OPTICS. IncOPTICS therefore proceeds in two major steps. First, the starting point for the reorganization is determined. Second, the reorganization of the cluster ordering is worked out until a valid cluster ordering is re-established. In the following, we will first discuss how to determine the frontiers of the reorganization, i.e. the starting point and the criteria for termination. We will determine two sets of objects affected by an update operation. One set called mutating objects, contains objects that may change its core-level due to an update operation. The second set of affected objects contains objects that move forward/backwards in the cluster ordering to re-establish condition (2) of Def. 3. Movement of objects may be caused by changing reachabilities — as an effect of changing core-levels — or by moving predecessors/successors in the cluster ordering. Since we can easily compute a set of all objects possibly moving, we call this set moving objects, containing all objects that may move forward/backwards in the cluster ordering due to an update. 4.1
Mutating Objects
Obviously, an object o may change its core-level only if the update operation affects the ε-neighborhood of o. From Def. 1 it follows that if the inserted/deleted object is one of o’s MinPts-nearest neighbors, Core(o) increases in case of a deletion and decreases in case of an insertion. This observation led us to the definition of the set M UTATING(p) of mutating objects: Definition 4 (mutating objects). Let p be an arbitrary object either in or not in the cluster ordering CO. The set of objects in CO possibly mutating their core-level after the insertion/deletion of p is defined as: M UTATING(p) := {q | p ∈ Nε (q)}. Let us note that p ∈ M UTATING(p) since p ∈ Nε (p). In fact, M UTATING(p) can be computed rather easily. Lemma 1. ∀p ∈ D : M UTATING(p) = Nε (p).
228
Hans-Peter Kriegel et al.
Proof. Since dist is a metric, the following conclusions hold: ∀ q ∈ Nε (p) : dist(q, p) ≤ ε ⇔ dist(p, q) ≤ ε ⇔ p ∈ Nε (q) ⇔ q ∈ M UTATING(p). Lemma 2. Let C be a cluster ordering and p ∈ CO. M UTATING(p) is a superset of the objects that change their core-level due to an insertion/deletion of p into/from CO. Proof. (Sketch) Let q ∈ M UTATING(p): Core(q) changes if p is one of q’s MinPts-nearest neighbors. Let q ∈ M UTATING(p): According to Lemma 1, p ∈ Nε (q) and thus p either cannot be any of q’s MinPts-nearest neighbors or Core(q) = ∞ remains due to Def. 1. Due to Lemma 2, we have to test for each object o ∈ M UTATING(p) whether Core(o) increases/decreases or not by computing Nε (o) (one range query). 4.2
Moving Objects
The second type of affected objects move forward or backwards in the cluster ordering after an update operation. In order to determine the objects that may move forward or backwards after an update operation occurs, we first define the predecessor and the set of successors of an object: Definition 5 (predecessor). Let CO be a cluster ordering and o ∈ CO. For each entry p ∈ CO the predecessor is defined as o if Reach(p) = RDist(o, p) Pre(p) = UNDEFINED if Reach(p) = ∞. Intuitively, Pre(p) is the object in CO from which p has been reached. Definition 6 (successors). Let CO be a cluster ordering. For each object p ∈ CO the set of successors is defined as Suc(p) := {q ∈ CO | Pre(q) = p}. Lemma 3. Let CO be a cluster ordering and p ∈ CO. If Core(p) changes due to an update operation, then each object o ∈ Suc(p) may change its reachability values. [Def. 6]
[Def. 5]
Proof. ∀o ∈ CO: o ∈ Suc(p) =⇒ Pre(o) = p =⇒ Reach(o) = RDist(o, p) [Def. 2]
=⇒ Reach(o) = max{Core(p), dist(p, o)}. Since the value Core(p) has changed, Reach(o) may also have changed. As a consequence of a changed reachability value, objects may move in the cluster ordering. If the reachability-distance of an object decreases, this object may move forward such that Condition (2) of Def. 3 is not violated. On the other hand, if the reachabilitydistance of an object increases, this object may move backwards due to the same reason. In addition, if an object has moved, all successors of this objects may also move although their reachabilities remain unchanged. All such objects that may move after an insertion or deletion of p are called moving objects:
Incremental OPTICS
229
Definition 7 (moving objects). Let p be an arbitrary object either in or not in the cluster ordering CO. The set of objects possibly moving forward/backwards in CO after insertion/deletion of p is defined recursively: (1) If x ∈ M UTATING(p) and q ∈ Suc(x) then q ∈ M OVING(p). (2) If y ∈ M OVING(p) and q ∈ Suc(y) then q ∈ M OVING(p). (3) If y ∈ M OVING(p) and q ∈ Pre(y) then q ∈ M OVING(p). Case (1) states, that if an object is a successor of a mutating object, it is a moving object. The other two cases state, that if an object is a successor or predecessor of a moving object it is also a moving object. Case (3) is needed, if a successor of an object o is moved to a position before o during reorganization. For the reorganization of moving objects we do not have to compute range queries. We solely need to compare the old reachability values to decide whether these objects have to move or not. 4.3
Limits of Reorganization
We are now able to determine between which bounds the cluster ordering must be reorganized to re-establish a valid cluster ordering according to Def. 3. Lemma 4. Let CO be a cluster ordering and p be an object either in or not in CO. The set of objects that have to be reorganized due to an insertion or deletion of p is a subset of M UTATING(p) ∪ M OVING(p). Proof. (Sketch) Let o be an object which has to be reorganized. If o has to be reorganized due to a change of Core(o), then o ∈ M UTATING(p). Else o has to be reorganized due to a changed reachability or due to moving predecessor/successors. Then o ∈ M OVING(p). Since OPTICS is based on the formalisms of DBSCAN, the determination of the start position for reorganization is rather easy. We simply have to determine the first object in the cluster ordering whose core-level changes after the insertion or deletion because reorganization is only initiated by changing core-levels. Lemma 5. Let CO be a cluster ordering which is updated by an insertion or deletion of object p. The object o ∈ D is the start object where reorganization starts if the following conditions hold: (1) o ∈ M UTATING(p) = q : Pos(o) ≤ Pos(q). (2) ∀q ∈ M UTATING(p), o Proof. Since reorganization is caused by changing core-levels, the start object must change its core-level due to the update. (1) follows from Def. 4. According to Def. 7, each q ∈ Suc(p) can by affected by the reorganization. To ensure, that no object is lost by the reorganization procedure, o has to be the first object, whose core-level has changed (⇒(2)). In addition, all objects before o are neither element of M UTATING(p) nor of M OVING(p). Therefore, they do not have to be reorganized.
230
Hans-Peter Kriegel et al.
WHILE NOT Seeds.isEmpty() DO // Decide which object is at next added to COnew IF currObj.reach > Seeds.first().reach THEN COnew .add(Seeds.first()); Seeds.removeFirst(); ELSE COnew .add(currObj); currObj = next object in COold which has not yet been inserted into COnew // Decide which objects are inserted into Seeds q = COnew .lastInsertedObject(); IF q∈ M UTATING(p) THEN FOR EACH o∈Nε (p) which has not yet been inserted into COnew DO Seeds.insert(o, max{q.core, dist(q,o)}); ELSE IF q∈ M OVING(p) THEN FOR EACH o∈Pre(p) OR o∈Suc(p) and o has not yet been inserted into COnew DO Seeds.insert(o, o.reach); Fig. 2. IncOPTICS: Reorganization of the cluster ordering
4.4
Reorganizing a Cluster Ordering
In the following, COold denotes the old cluster ordering before the update and COnew denotes the updated cluster ordering which is computed by IncOPTICS. After the start object so has been determined according to Lemma 5, all objects q ∈ COold with Pos(q) < Pos(so) can be copied into COnew (cf. Fig. 1(b)) because up to the position of so COold is a valid cluster ordering. The reorganization of CO begins at so and imitates OPTICS. The pseudo-code of the procedure is depicted in Fig. 2. It is assumed that each not yet handled o ∈ Nε (so) is inserted into the priority queue Seeds which manages all not yet handled objects from M OVING(p) ∪ M UTATING(p) (i.e. all o ∈ M OVING(p) ∪ M UTATING(p) with Pos(o) ≥ Pos(so)) sorted in the order of ascending reachabilities. In each step of the reorganization loop, the reachability of the first object in Seeds is compared with the reachability of the current object in COold . The entry with the smallest reachability is inserted into the next free position of COnew . In case of a delete operation, this step is skipped if the considered object is the update object. After this insertion, Seeds has to be updated depending on which object has recently been inserted. If the inserted object is an element of M UTATING(p), all neighbors that are currently not yet handled may change their reachabilities. If the inserted object is an element of M OVING(p), all predecessors and successors that are currently not yet handled may move. In both cases, the corresponding objects are inserted into Seeds using the method Seeds::insert which inserts an object with its current reachability or updates the reachability of an object if it is already in the priority queue. If a predecessor is inserted into Seeds, its reachability has to be recomputed (which means a distance calculation in the worst-case) because RDist(., .) is not symmetric. According to Lemma 4, the reorganization terminates if there are no more objects in Seeds, i.e. all objects in M OVING(p) ∪ M UTATING(p), that have to be processed, are
Incremental OPTICS
(a) Insertion
231
(b) Deletion
Fig. 3. Runtime of OPTICS vs. average and maximum runtime of IncOPTICS.
handled. COnew is filled with all objects from COold which are not yet handled (and thus need not to be considered by the reorganization) maintaining the order determined by COold (cf. Fig. 1(b)). The resulting COnew is valid according to Def. 3.
5
Experimental Evaluation
We evaluated IncOPTICS using four synthetic datasets consisting of 100,000, 200,000, 300,000, and 500,000 2-dimensional points and a real-world dataset consisting of 112,361 TV snapshots encoded as 64-dimensional color histograms. All experiments were run on a workstation featureing a 2 GHz CPU and 3,5 GB RAM. An X-Tree was used to speed up the range queries computed by OPTICS and IncOPTICS. We performed 100 random updates (insertions and deletions) on each of the synthetic datasets and compared the runtime of OPTICS with the maximum and average runtimes of IncOPTICS (insert/delete) on the random updates. The results are depicted in Fig. 3. We observed average speed-up factors of about 45 and 25 and worst-case speed-up factors of about 20 and 17 in case of insertion and deletion, respectively. A similar observation, but on a lower level, can be made when evaluating the performance of OPTICS and IncOPTICS applied to the real world dataset. The worst ever observed speed-up factor for the real-world dataset was 3. In Fig. 5(a)) the average runtimes of IncOPTICS of the best 10 inserted and deleted objects are compared with the runtime of OPTICS using the TV dataset. A possible reason for the large speed up is that IncOPTICS saves a lot of range queries. This is shown in Fig. 4(a) and 4(b) where we compared the average and maximum number of range queries and moved objects, respectively. The cardinality of the set M UTATING(p) is depicted as “RQ” and the cardinality of the set M OVING(p) is depicted as “MO” in the figures. It can be seen, that IncOPTICS saves a lot of range queries compared to OPTICS. For high dimensional data this observation is even more important since the logarithmic runtime of most index structures for a single range query degenerates to a linear runtime. Fig. 5(b) presenting the average cardinality of
232
Hans-Peter Kriegel et al.
(a) Insertion
(b) Deletion
Fig. 4. Comparison of average and maximum cardinalities of M OVING(p) vs. M UTATING(p)
the sets of mutating objects and moving objects of incremental insertion/deletion, illustrates this effect. Since the number of objects which have to be reorganized is rather high in case of insertion or deletion the runtime speed-up is caused by the strong reduction of range queries (cf. bars “IncInsert RQ” and “IncDelete RQ” in Fig. 5(b)). We separately analyzed the objects o whose insertions/deletions caused the highest runtime. Thereby, we found out that the biggest part of the high runtimes originated from the reorganization step due to a high cardinality of the set M OVING(o). We further observed that these objects causing high update runtimes usually are located between two clusters and objects in M UTATING(o) belong to more than one cluster. Since spatially neighboring clusters need not to be adjacent in the cluster ordering, the reorganization affects a lot more objects. This observation is important because it indicates that the runtimes are more likely near the average case than near the worst case especially for insert operations since most inserted objects will probably reproduce the distribution of the already existing data. Let us note, that since the tests on the TV Dataset were run using unfavourable objects, the performance results are less impressive than the results on the synthetic datasets.
6
Conclusions
In this paper, we proposed an incremental algorithm for mining hierarchical clustering structures based on OPTICS. Due to the density-based notion of OPTICS, insertions and deletions affect only a limited subset of objects directly, i.e. a change of their corelevel may occur. We identified a second set of objects which are indirectly affected by update operations and thus they may move forward or backwards in the cluster ordering. Based on these considerations, efficient algorithms for incremental insertions and deletions of a cluster ordering were suggested. A performance evaluation of IncOPTICS using synthetic as well as real-world databases demonstrated the efficiency of the proposed algorithm.
Incremental OPTICS
(a) Runtimes
233
(b) Affected objects
Fig. 5. Runtimes and affected objects of IncOPTICS vs. OPTICS applied on the TV Data.
Comparing these results to the performance of IncrementalDBSCAN which achieves much higher speed-up factors over DBSCAN, it should be mentioned that incremental hierarchical clustering is much more complex than incremental “flat” clustering. In fact, OPTICS generates considerably more information than DBSCAN and thus IncOPTICS is suitable for a much broader range of applications compared to IncrementalDBSCAN.
References 1. McQueen, J.: ”Some Methods for Classification and Analysis of Multivariate Observations”. In: 5th Berkeley Symp. Math. Statist. Prob. Volume 1. (1967) 281–297 2. Ng, R., J., H.: ”Efficient and Affective Clustering Methods for Spatial Data Mining”. In: Proc. 20st Int. Conf. on Very Large Databases (VLDB’94), Santiago, Chile. (1994) 144–155 3. Zhang, T., Ramakrishnan, R., M., L.: ”BIRCH: An Efficient Data Clustering Method for Very Large Databases”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’96), Montreal, Canada. (1996) 103–114 4. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: ”A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD’96), Portland, OR, AAAI Press (1996) 291–316 5. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: ”OPTICS: Ordering Points to Identify the Clustering Structure”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’99), Philadelphia, PA. (1999) 49–60 6. Ester, M., Kriegel, H.P., Sander, J., Wimmer, M., Xu, X.: ”Incremental Clustering for Mining in a Data Warehousing Environment”. In: Proc. 24th Int. Conf. on Very Large Databases (VLDB’98). (1998) 323–333 7. Feldman, R., Aumann, Y., Amir, A., Mannila, H.: ”Efficient Algorithms for Discovering Frequent Sets in Incremental Databases”. In: Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, AZ. (1997) 59–66 8. Ester, M., Wittman, R.: ”Incremental Generalization for Mining in a Data Warehousing Environment”. In: Proc. 6th Int. Conf. on Extending Database Technology, Valencia, Spain. Volume 1377 of Lecture Notes in Computer Science (LNCS)., Springer (1998) 135–152
On Complementarity of Cluster and Outlier Detection Schemes Zhixiang Chen1 , Ada Wai-Chee Fu2 , and Jian Tang2 Department of Computer Science, University of Texas-Pan American, Edinburg TX 78539 USA. 1
[email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong.
[email protected],
[email protected] 2
1 Introduction We are interested in the problem of outlier detection, which is the discovery of data that deviate a lot from other data patterns. Hawkins [7] characterizes an outlier in a quite intuitive way as follows: An outlier is an observation that
deviates so much from other observations as to arouse suspicion that it was generated by a dierent mechanism.
Most methods in the early work that detects outliers independently have been developed in the eld of statistics [3]. These methods normally assume that the distribution of a data set is known in advance. A large amount of the work was done under the general topic of clustering [6,12, 15, 8,17]. These algorithms can also generate outliers as by-products. Recently, researchers have proposed distance-based and density-based as well as connectivity-based outlier detection schemes [10,11,13,4,16], which distinguish objects that are likely to be outliers from those that are not based on the number of objects in the neighborhood of an object These schemes do not make assumptions about the data distribution. In this paper, we want to nd out if the indirect clustering approach such as [6] and the direct approach such as [10,4,16] are similar in the eects of outlier detection and also the cases where they may dier. When a direct approach and an indirect one have similar eects, we say that they are complementary. We consider the comparison of DBSCAN clustering method and the DB-Outlier, LOF and COF de nitions of outliers. These methods are chosen based on their more superior powers on handling clusters or patterns of dierent shapes with no apriori distribution assumption. We believe that these methods are better equipped to handle the varieties of outlier natures. Some interesting discoveries are made. First, we show that DBSCAN and DB-Outlier approaches are almost complementary, and we also show an extension of the DB-outlier scheme so that it is complementary with DBSCAN. Second, we show that DBSCAN approach is complementary with density-based and connectivity-based outlier schemes within a density cluster or far away from some clusters. Finally, we show that there are cases where DBSCAN approach is not complementary with density-based and connectivity-based outlier schemes. Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 234-243, 2003. c Springer-Verlag Berlin Heidelberg 2003
On Complementarity of Cluster and Outlier Detection Schemes
235
2 Density Based Clustering and Outlier Detection Let D be a data set of points in a data space. For any p 2 D and any positive value v, the v-neighborhood of p is Nv (p) = fo : dist(p; o) v & o 2 Dg. For a given threshold n > 0; p is called a core point with respect to v and n (or core point if no confusion arises) if jNv (p)j n. Given v > 0 and n > 0, a point p is directly density-reachable from a point q with respect to v and n, if p 2 Nv (q) and q is a core point. A point p is densityreachable from a point q with respect to v and n, if there is a chain of points p1; p2; : : :; ps, p1 = q, ps = p such that pi+1 is directly density-reachable from pi. A point p is density-connected to a point q with respect to v and n, if there is a point o such that both p and q are density-reachable from o with respect to v and n.
De nition 1 (DBSCAN Clusters and Outliers [6]). Let D be a set of objects. A cluster C with respect to v and n in D is a non-empty subset of D satisfying the following conditions: (1) Maximality: 8p; q 2 D, if p 2 C and q is density-reachable from p with respect to v and n, the also q 2 C . (2) Connectivity: 8p; q 2 C , q is density-connected to p with respect to v and n. Finally, every object not contained in any cluster is an outlier (or a noise).
3 Outlier Detection Schemes
Distance Based Outliers:Knorr and Ng [10] proposed a distance-based scheme, called DB(n; v)-outliers. Let D be a data set. For any p 2 D and any positive values v and n, p is a DB(n; v)-outlier if jNv (p)j n, otherwise it is not. The weakness is that it is not powerful enough to cope with certain scenarios with dierent densities in data clusters [4]. Similar weakness is found in the scheme proposed by Ramaswamy, et al. [13], which is actually a special case of DB(n; v)outlier. Density Based Outliers: Breuning, et al. [4] proposed a density-based outlier detection scheme as follows. Let p; o 2 D. Let k be a positive integer, and k-distance(o) be the distance from o to its k-th nearest neighbor. The reachability distance of p with respect to o for k is reach-diskk (p; o) = maxfk-distance(o); dist(p; o)g: The reachability distance smoothes the uctuation of the distances between p and its \close" neighbors. The local reachability density of p for k is:
P
o2N
k (p) =
p) (p)
k-distance(
k (p; o)
reach-dist
!;
1
: jN p (p)j That is, lrdk (p) is the inverse of the average reachability distance from p to the objects in its k-distance neighborhood. The local outlier factor of p is lrd
k-distance(
k (p) =
LOF
)
P
o2N
p) (p)
k-distance(
jN
p) (p)j
k-distance(
o p
lrdk ( ) lrdk ( )
:
236
Zhixiang Chen et al.
The LOF value measures how strong an object can be an outlier. A threshold on LOF value can be set to de ne outliers. Connectivity Based Outliers: The connectivity based outlier detection scheme was proposed in Tang et al. [16]. This scheme is based on the idea of dierentiating \low density" from \isolativity". While low density normally refers to the fact that the number of points in the \close" neighborhood of an object is (relatively) normalsize, isolativity refers to the degree that an object is \connected" to other objects. As a result, isolation can imply low density, but the other direction is not always true.
De nition 2 Let P; Q D, P \ Q = ; and P; Q 6= ;. We de ne dist(P; Q) = minfdist(x; y) : x 2 P & y 2 Qg, and call dist(P; Q) the distance between P and Q. For any given q 2 Q, we say that q is the nearest neighbor of P in Q if there is a p 2 P such that dist(p; q) = dist(P; Q): De nition 3 Let G = fp ; p ; : : :; pr g be a subset of D. A set-based nearest path, or SBN-path, from p on G is a sequence hp ; p ; : : :; pr i such that for all 1 i r ; 1; pi is a nearest neighbor of set fp ; : : :; pig in fpi ; : : :; pr g. 1
2
1
1
+1
2
1
+1
In the above, if the nearest neighbor is not unique, we can impose a prede ned order among the neighbors to break the tie. Thus an SBN-path is uniquely determined.
De nition 4 Let s = hp ; p ; : : :; pr i be an SBN-path. A set-based nearest trail, or SBN-trail, with respect to s is a sequence he ; : : :; er; i such that for all 1 i r ; 1, ei = (oi ; pi ) where oi 2 fp ; : : :; pig, and dist(ei ) = dist(oi ; pi ) = dist(fp ; : : :; pi g; fpi ; : : :; pr g). 1
2
1
+1
1
1
1
+1
+1
Again, if oi is not uniquely determined, we should break the tie by a prede ned order. Thus the SBN-trail is unique for any SBN-path.
De nition 5 Let G = fp ; p ; : : :; pr g be a subset of D. Let s = hp ; p ; : : :; pr i be an SBN-path from p and e = he ; : : :; er; i be the SBN-trail with respect to s. The average chaining distance from p to G ; fp g, denoted by ac-distG (p ), 1
2
1
1
1
1
is de ned as
G(p1 ) =
ac-dist
2
1
1
1
r; X 2(r ; i) 1
i=1
( ; 1) dist(ei ):
r r
De nition 6 Let p 2 D and k be a positive integer. The connectivity-based outlier factor (COF) at p with respect to its k-neighborhood is de ned as COF
k (p) =
jNk (p)j ac-distNk p (p) : P ac-dist (o) ( )
o2Nk (p)
Nk (o)
A threshold on COF can be set to de ne outliers.
On Complementarity of Cluster and Outlier Detection Schemes
237
4 Complementarity of DB-outlier and DBSCAN When the clustering approach and outlier detection approach both give the same result about a data point (as outlier or non-outlier) we say that they are complementary. Since both techniques typically require some parameter settings, it is of interest to see if there exist some parameter settings for each approach so that the methods are complementary. In this section, we rst show that the DBSCAN clustering scheme and the DB-outlier detection scheme are almost complementary. We then propose an extended DB-outlier detection scheme and show that it is complementary with the DBSCAN clustering scheme.
Theorem 1 If there is a parameter setting for DBSCAN clustering scheme to detect clusters and outliers, then there is a parameter setting for the DB-outlier detection scheme such that the following is true: For any object p 2 D, if DBSCAN identi es p as an outlier then DB-outlier detection scheme also identi es it as an outlier. (Note that this implies that if DB-outlier scheme detection scheme identi es p not to be an outlier, then DBSCAN identi es it not to be an outlier (i.e., inside some cluster).)
Proof. Let a parameter setting for DBSCAN be v and n. For any object p 2 D, if DBSCAN identi es p as an outlier, then we have by de nition that jNv (p)j < n. When we choose the same parameter setting v and n for the DBoutlier detection scheme, it identi es p as an outlier too, because jNv (p)j < n. It is easy to see that objects identi ed by DBSCAN to be inside clusters can be identi ed as outliers by the DB-outlier detection scheme with the same parameter setting. In order to avoid such in-complementarity on border objects, we propose the following extension of the DB-outlier detection scheme:
De nition 7 (EDB-Outliers). Given any object p in a data set D, p is an extended distance-based outlier, denoted as EDB -outlier, with respect to v and n, if jNv (p)j < n and 8q 2 Nv (p), jNv (q)j < n. Following the work in [10], one can easily design an EDB-outlier detection scheme to detect EDB-outliers with respect to the parameter setting of v and n. The following result shows that EDB-outliers and DBSCAN-outliers are complementary.
Theorem 2 If there is a parameter setting for DBSCAN to detect clusters and outliers, then there is a parameter setting for EDB-outlier detection scheme such that the following is true: For any object p 2 D, DBSCAN identi es p as an
outlier if and only if EDB-outlier detection scheme also identi es it as an outlier.
Proof. Let a parameter setting for DBSCAN be v and n. For any object p 2 C , if DBSCAN identi es p as an outlier, then we have by de nition jNv (p)j < n
238
Zhixiang Chen et al.
and p is not density reachable from any core object in D with respect to v and n. The latter property means that 8q 2 Nv (p), q is not a core object, i.e., jNv (q)j < n. Hence, when we choose the same parameter setting of v and n for the EDB-outlier detection scheme, it identi es p as an outlier too. On the other hand, if the EDB-outlier detection scheme identi es p as an outlier with respect to v and n, then we have by de nition jNv (p)j < n and 8q 2 Nv (p), jNv (q)j < n. Now consider that the same parameter setting of v and n is chosen for DBSCAN. Suppose that p is inside a cluster C, then by de nition there is a core object q 2 C such that p is density-reachable from q with respect to v and n. I.e., there is sequence of objects q1 = q; q2; : : :; qs; qs+1 = p, qi 2 C, such that qi+1 is directly density-reachable from qi. In particular, p is directly density-reachable from qs . By de nition, jNv (qs)j n and p 2 Nv (qs). Thus, we have jNv (qs)j n and qs 2 Nv (p), a contradiction to the given fact that p is an EDB-outlier with respect to v and n. The above argument implies that p must not be in any cluster with respect to v and n. Therefore, DBSCAN will identify p as an outlier with respect to v and n.
5 Complementarity of COF and LOF Let D be a data set. For any integer k > 0 and any object p 2 D, we use Nk (p) to denote Nreach-distk (p) (p) for convenience. In the following lemma, C can be viewed as a cluster. We rst consider complementarity inside a density cluster. Lemma 3 Given a subset C of a data set D and an integer k > 0, assume that Nk (p) C and there is a positive value d such that 8p 2 C , d ; reach-distk (p) d + for a very normalsize xed positive value 0 < < d. Assume further that there exists a positive value f > 0 such that for any p; q 2 C , f ; dist(p; q) f + for some positive normalsize value with 0 < < f . Then, 8p 2 C , we have d; LOFk (p) dd +; ; and ff ;+ COFk (p) ff ;+ : (1) d+ Proof. 8p 2 C, since Np C, we have 8q 2 NkP (p), q 2 C, hence d ; reach-distk (p;q) : reach-distk (q) d + . By de nition, lrdk1(p) = P q2Nk p jNk (p)j ( )
lrdk (q)
k p ; we have Hence, d ; lrdk (p) d + . Since LOFk (p) = q2NjNkkp(p)lrd j d; LOFk (p) d+ , this implies the left part of (1). d+ d; Given any p 2 C, let s = fe1; : : :; er;1 g be the SBN-trail with respect to the SBN-path from p on Nk (p). It follows from the de nition and the given conditions that f ; dist(ei ) f + for i = 1; 2; : : :; r ; 1. Hence, by de nition we have r; r; X 2(r ; i) dist(e ) X 2(r ; i) (f + ) = f + : ac-distk (p) = i r (r ; 1) r (r ; 1) i i ( )
1
1
=1
=1
( )
On Complementarity of Cluster and Outlier Detection Schemes
239
Similarly, we have k (p) =
ac-dist
r; X 2(r ; i)
r; X 2(r ; i)
i=1
i=1
1
dist(ei) r (r ; 1)
1
( ; 1) (f ; ) = f ;
r r
:
Thus, k (p) =
COF
PjNk (p)j
k (p) k (o)
ac-dist
o2Nk (p)
ac-dist
ff +; ;
k (p) =
COF
PjNk (p)j
k (p) k (o)
ac-dist
o2Nk (p)
ac-dist
ff ;+
Hence, these together implies the right part of (1).
Theorem 4 Let C be any cluster in a data set D satisfying the conditions in
Lemma 1. We can choose a parameter setting d + and k for the DBSCAN clustering scheme so that all points in C will be identi ed as cluster points, i.e., non-outliers. We can choose a parameter setting of k and dd+; for the LOFoutlier detection scheme so that it identi es all points in C as cluster points. Finally, we can choose a parameter setting of k and ff ;+ for the COF-outlier detection scheme so that it identi es all points in C as cluster points as well.
Proof. For any p 2 C, it follows from the conditions of Lemma 3 that
reach-distk (p) d + . By the de nition of reachability distance, Nk (p) will
have at least k objects. Since Nk (p) Nd+ (p), we have jNd+ (p)j k. Hence, p is a core object with respect to d + and k. It follows from the de nition of DBSCAN clusters that p will be identi ed as an object inside a cluster, hence a non-outlier. From (1) of Lemma 1 we have dd;+ LOFk (p)and ff ;+ COFk (p): This means that when the parameter setting of k and dd+; is chosen for the LOFoutlier detection scheme, (Precisely, k is used to de ne the reachability distance, and dd;+ is a threshold for the LOF values to select outliers.) p will be identi ed as a non-outlier. Similarly, when the parameter setting of k and ff ;+ is chosen for the COF-outlier detection scheme, p will be identi ed a non-outlier as well.
Next we show some cases where LOF and COF are both complementary with DBSCAN in detecting points outside some clusters as outliers. For any two sets of objects A and B, let dist(A; B) = minfdist(x; y) : x 2 A & y 2 B g. Again, for any object p, we let Nm (p) denote the m-reachability neighborhood Nreach-distm (p) of p.
Theorem 5 Given two subsets O and C of a data set D, let d = minfdist(x; y) : x 2 O&y 2 C g, and m = jOj. Assume that 8o 2 O, Nm (o) = O, and N m (o) ; O C . Moreover, 8p 2 C , lrd m (p) d and ac-dist m (p) < 1; and for 8p 2 O, d ac-dist m (p) d. Then, there exist parameter settings for DBSCAN, LOF 2 3
2
2
4
2
2
and COF respectively such that each of the three methods will identify O as outliers.
240
Zhixiang Chen et al.
Proof. First, let r be the diameter of O, i.e., r = maxfdist(x; y) : x 2 O&y 2 Og. Since for any o 2 O, Nm (o) = O, we have reach-distm (o) r and Nm (o) = Nr (o) = O. When the parameter setting of r and m + 1 is chosen for the DBSCAN clustering scheme, then o is not a core object. Since Nr (o) = O, o is not directly reachable from any object outside O with the distance r. Hence, DBSCAN will identi es O as outliers. For any o 2 O, given conditions Nm (o) = O and N2m (o) ; O C it follows that, for any p 2 N2m (o), reach-dist2m (o; p) = maxf2m-distance(p); dist(o; p)g jN m (o)j 1 d. Thus, lrd2m (o) = P reach-dist m (p;o) d : Hence, 2
p2N2m (o)
2
P lrd2m (p) d(P lrd2m (p) + p 2 N p 2 O ;f o g p2C \N2m (o) lrd2m (p)) 2m (o) lrd2m (o) LOF2m (o) = = jN2m (o)j P jN2m (o)j P 4 d( p2O ;fog 1d + ) p2C \N2m (o) d = jO ; fogj + 4jC \ N2m (o)j > 1: jN2m (o)j jN2m (o)j P
Therefore, when the parameter setting of 2m and 1 is chosen for the LOFoutlier detection scheme, each object in O will be identi ed as an outlier. Similarly, For any o 2 O, let y = jN2m (o)j, then y 2m. It follows from the given conditions that COF2m (o) 2d y P 3 ac-dist (p) + 2 m p2O;fog p2N m (o)\C ac-dist2m (p)
jN2m (o)j ac-dist2m (o) P =P ac-dist (p) p2N2m (o)
d
m
2
2
(m ; 1)d +y (y ; m + 1) 1; when d 6: Hence, when the parameter setting of 2m and 1 is chosen for the COF-outlier detection scheme, every object in O will be identi ed as outliers. 2 3
6 Non-Complementarity In this section, we shall show two non-complementarity results of DBSCAN cluster and outlier schemes LOF and COF, which reveal that in general DBSCAN scheme is not complementary with the LOF-outlier detection scheme, nor with the COF-outlier detection scheme. We use two approaches to obtain our results, one is by actual computation and the other is detailed analysis. The computing environment of our computation is a Dell Precision 530 Workstation with dual Xeon 1.5 GHz processors. In order to have more precise results, all arithmetic operations were carried out with 10 decimal digit precision. Example 1. Let us consider a data set D1 consisting of data objects as shown in Fig. 1. A is the set composed of objects on the line patterns, and B is the set composed of all objects in the disk pattern. o refers to the object p outside B. A has 402 objects such that the 1-distance of any object in it isp 2. B has 44 objects such that the 1-distance of any object in it is less than p 2 and the distance between A and B (which are respectively, p and q) is 2. Finally, the
On Complementarity of Cluster and Outlier Detection Schemes
241
A p
LOF
B
o q
(a) Data set Fig. 1.
COF 1.50 1.40 1.30 1.20 1.10 1.00 0.90 0.80
1.80 1.60 1.40 1.20 1.00 0.80
(b) LOF Values
(c) COF Values
LOF and COF Values for Example 1 (k = 3)
p
distance p between o and B is 2, and the distance between o and A is greater than 2. Non-Complementarity Result 1. When the parameter setting of k = 3 and threshold = 1:4 is used, the LOF-outlier detection scheme identi es o as an outliers, and all objects in A or B as non-outliers. When the parameter setting of k = 3 and threshold = 1:13 is used, the COF-outlier detection scheme identi es o as an outliers, and all objects in A or B as non-outliers. Finally, for any parameter setting of v and n, the DBSCAN scheme will not be able to identify o as an outlier and objects in A or B as non-outliers. The rst two results are obtained through actual computation. We have implemented the LOF outlier detection algorithm in [4] and the COF outlier detection algorithm in [16]. We used k=3 to compute LOF values and the COF values. The results are shown in Figure 1. We found that the LOF value of o is 1.8146844215 and all other objects have LOF values the same or almost the same as 1, except several objects have LOF values larger than 1 and less than 1.4. Hence, the threshold 1.4 enables the LOF-outlier detection scheme to distinguish the outlier o from all the non-outliers in A or B. We also found that the COF value of o is 1.2057529817, and all the other objects have COF values the same or almost the same as 1, except several objects have COF values larger than 1 but less than 1.13. Hence, the threshold value 1:13 enables the COF-outlier detection scheme to distinguish the outlier o from all the other non-outliers. p Recall thatp the distance between any two adjacent objects in A is 2 and p dist(A; B) = 2. If the radius parameter v is less than 2, then the DBSCAN scheme will identify all objects in A as outliers. In order to identify p objects in A as non-outliers, the radius parameter v must be greater than 2. Note that p the distance between o and B is 2 and the 1-distance of any object in B is p p less than 2. Let w 2 B such that dist(o; w) = 3. For any radius parameter p v > 2, o is directly reachable from w with respect to v. Hence, if DBSCAN scheme identi es w as an non-outliers then it also identi es o as an non-outlier. In summary, for any parameter setting of v and n, the DBSCAN scheme cannot
242
Zhixiang Chen et al.
p
distinguish p o from objects in A if v 2; it cannot distinguish o from objects in B if v > 2. Therefore, no parameter setting of v and n enables the DBSCAN scheme to identi es outliers and clusters for objects in D. Outlier q=(-88,0) Non-outlier p=(67,45) Outlier o=(26,0) Non-outlier w=(22,0)
14.00
l 1 ... l2
12.00
p 10.00
D LOF
C q
w
o
8.00
6.00
4.00
2.00
l4 l3 ...
(a) Data Set of Example 2 Fig. 2.
20
40
60
80
100
120
140
160
180
k
(b) LOF Values of Four Objects
Illustrations of Example 2
Example 2. Let us consider a data set D2 as shown in Fig. 2(a). D2 has 192 objects. Set C composed of eight objects on the border p of a diamond pattern such that any two adjacent objects have a distance 2. Set D composed of 182 objects on four lines l1 ; l2; l3 and l3 . l1 and l2 lie 45 degrees above the horizontal line, and l3 and l4 lie 45 degrees below the horizontal line. l2 . l1 and l3 meet at (20; 0), and l2 and l4 meet at wp= (22; 0). Any two adjacent objects on any of the four lines have a distance of 2. Two additional objects lie below the l2 , the rst is (22; 4) and the second is o = (26; 0). Let q = (;88; 0) and p = (67; 45) as shown in the gure. According to Hawkin's de nition, is obvious that q and o are outliers, but p and w are not. p Non-Complementarity Result 2. When a parameter setting of v = 2 and n = 4 is chosen, the DBSCAN scheme identi es objects in C (including q) and o as outliers and all other objects as non-outliers. However, for any parameter setting of k and threshold, the LOF-outlier detection scheme cannot identify
outliers and non-outliers correctly.
Because pobjects in C lie at the border of a diamond shape and an equal distance p of 2 separates any two adjacent objects in the diamond shape in C, the 2-neighborhood of any object inpC has exactly 3 objects. Thus, every object in C is an outlier p with respect to v = 2 and n = 4. It follows from the condition of D2 , the 2-neighborhood of any object in D is exactly 4 except the six end objects (20; 0); w; q; (66;46); (66; ;46) and (67; ;45). But, those six objects are reachable from some other points on the lines with respect to v, and so does the object (22; 4). Hence, p the DBSCAN identi ed all objects in D and (22; 4) as nonoutliers. Since 2-neighborhood of o has exactly 3 objects and o is not reachable from any core object with respect to v and pct (o is reachable from the non-core object (22; 4)), the DBSCAN p identi es o as an outlier. In summary, with the parameter setting of v = 2 and pct = 4, the DBSCAN identi es outliers and clusters in D2 correctly.
On Complementarity of Cluster and Outlier Detection Schemes
243
In order to obtain the result for LOF-outlier detection scheme, we ran the LOF outlier detection algorithm for k = 1; 2; 3; : ::; 191 (191 = jD2j ; 1) to compute LOF values for the four objects p; q; o and w. The LOF values are shown in Fig. 2(b). We found the following results: for 1 k 7, LOFk (q) LOFk (p); for 8 k 182, LOFk (o) LOFk (p); and for 183 k 191, LOFk (o) LOFk (w). This implies that any given parameter threshold cannot separate both q and o from p and w. Hence, for any parameter setting of k and threshold, the LOF-outlier detection scheme cannot identify outliers and nonoutliers in D2 correctly.
References 1. M. Ankerst, M. Breunig, H.P. Kriegel, and J. Sander: \OPTICS: Ordering points to identify the cluster structure", Proc. of ACM-SIGMOD Conf., pp. 49-60, 1999. 2. A. Arning, R. Agrawal, P. Raghavan: "A Linear Method for Deviation detection in Large Databases", Proc. of 2nd Intl. Conf. On Knowledge Discovery and Data Mining, 1996, pp 164 - 169. 3. V. Barnett, T. Lewis: "Outliers in Statistical Data", John Wiley, 1994. 4. M. Breuning, Hans-Peter Kriegel, R. Ng, J. Sander: "LOF: Identifying densitybased Local Outliers", Proc. of the ACM SIGMOD Conf., 2000. 5. W. DuMouchel, M. Schonlau: "A Fast Computer Intrusion Detection Algorithm based on Hypothesis Testing of Command Transition Probabilities", Proc.of 4th Intl. Conf. On Knowledge Discovery and Data Mining, 1998, pp. 189 - 193. 6. M. Ester, H. Kriegel, J. Sander, X. Xu: "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Proc. of 2nd Intl. Conf. On Knowledge Discovery and Data Mining, 1996, pp 226 - 231. 7. T. Fawcett, F. Provost: "Adaptive Fraud Detection", Data Mining and Knowledge Discovery Journal, Kluwer Academic Publishers, Vol. 1, No. 3, 1997, pp 291 - 316. 8. S. Guha, R. Rastogi, K. Shim: "Cure: An Ecient Clustering Algorithm for Large Databases", In Proc. of the ACM SIGMOD Conf., 1998, pp 73 - 84. 9. D. Hawkins: "Identi cation of Outliers", Chapman and Hall, London, 1980. 10. E. Knorr, R. Ng: "Algorithms for Mining Distance-based Outliers in Large Datasets", Proc. of 24th Intl. Conf. On VLDB, 1998, pp 392 - 403. 11. E. Knorr, R. Ng: "Finding Intensional Knowledge of Distance-based Outliers", Proc. of 25th Intl. Conf. On VLDB, 1999, pp 211 - 222. 12. R. Ng, J. Han: "Ecient and Eective Clustering Methods for Spatial Data Mining", Proc. of 20th Intl. Conf. On Very Large Data Bases, 1994, pp 144 - 155. 13. S. Ramaswamy, R. Rastogi, S. Kyuseok: "Ecient Algorithms for Mining Outliers from Large Data Sets", Proc. of ACM SIGMOD Conf., 2000, pp 427 - 438. 14. N. Roussopoulos, S. Kelley, F. Vincent, "Nearest Neighbor Queries", Proc. of ACM SIGMOD Conf., 1995, pp 71 - 79. 15. G. Sheikholeslami, S. Chatterjee, A. Zhang: "WaveCluster: A multi-Resolution Clustering Approach for Very Large Spatial Databases", Proc. of 24th Intl. Conf. On Very Large Data Bases, 1998, pp 428 - 439. 16. J. Tang, Z. Chen, A. Fu, D. Cheung, \A Robust Outlier Detection Scheme in Large Data Sets", PAKDD, 2002. 17. T. Zhang, R. Ramakrishnan, M. Linvy: "BIRCH: An Ecient Data Clustering Method for Very Large Databases", Proc. of ACM SIGMOD Intl. Conf., , 1996, pp 103 - 114.
Cluster Validity Using Support Vector Machines Vladimir Estivill-Castro1 and Jianhua Yang2 1
2
Griffith University, Brisbane QLD 4111, Australia The University of Western Sydney, Campbelltown, NSW 2560, Australia
Abstract. Gaining confidence that a clustering algorithm has produced meaningful results and not an accident of its usually heuristic optimization is central to data analysis. This is the issue of validity and we propose here a method by which Support Vector Machines are used to evaluate the separation in the clustering results. However, we not only obtain a method to compare clustering results from different algorithms or different runs of the same algorithm, but we can also filter noise and outliers. Thus, for a fixed data set we can identify what is the most robust and potentially meaningful clustering result. A set of experiments illustrates the steps of our approach.
1
Introduction
Clustering is challenging because normally there is no a priori information about structure in the data or about potential parameters, like the number of clusters. Thus, assumptions make possible to select a model to fit to the data. For instance, k-Means fits mixture models of normals with covariance matrices set to the identity matrix. k-Means is widely applied because of its speed; but, because of its simplicity, it is statistically biased and statistically inconsistent, and thus it may produce poor (invalid) results. Hence, clustering depends significantly on the data and the way the algorithm represents (models) structure for the data [8]. The purpose of clustering validity is to increase the confidence about groups proposed by a clustering algorithm. The validity of results is up-most importance, since patterns in data will be far from useful if they were invalid [7]. Validity is a certain amount of confidence that the clusters found are actually somehow significant [6]. That is, the hypothetical structure postulated as the result of a clustering algorithm must be tested to gain confidence that it actually exists in the data. A fundamental way is to measure how “natural” are the resulting clusters. Here, formalizing how “natural” a partition is, implies fitting metrics between the clusters and the data structure [8]. Compactness and separation are two main criteria proposed for comparing clustering schemes [17]. Compactness means the members of each cluster should be as close to each other as possible. Separation means the clusters themselves should be widely spaced. Novelty detection and concepts of maximizing margins based on Support Vector Machines (SVMs) and related kernel methods make them favorable for verifying that there is a separation (a margin) between the clusters of an algorithm’s output. In this sense, we propose to use SVMs for validating data models, Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 244–256, 2003. c Springer-Verlag Berlin Heidelberg 2003
Cluster Validity Using Support Vector Machines
245
and confirm that the structure revealed in clustering results is indeed of some significance. We propose that an analysis of magnitude of margins and (relative) number of Support Vectors increases the confidence that a clustering output does separate clusters and creates meaningful groups. The confirmation of separation in the results can be gradually realized by controlling training parameters. At a minimum, our approach is able to discriminate between two outputs of two clustering algorithms and identify the more significant one. Section 2 presents relevant aspects of Support Vector Machines for our clustering validity approach. Section 3 presents our techniques. Section 4 presents experimental results. We then conclude with a discussion of related work.
2
Support Vector Machines
Our cluster validity method measures margins and analyzes the number of Support Vectors. Thus, a summary of Support Vector Machines (SVMs) is necessary. The foundations of SVMs were developed by Vapnik [16] and are gaining popularity due to promising empirical performance [9]. The approach is systematic, reproducible, and motivated by statistical learning theory. The training formulation embodies optimization of a convex cost function, thus all local minima are global minimum in the learning process [1]. SVMs can provide good generalization performance on data mining tasks without incorporating problem domain knowledge. SVM have been successfully extended from basic classification tasks to handle regression, operator inversion, density estimation, novelty detection, clustering and to include other desirable properties, such as invariance under symmetries and robustness in the presence of noise [15, 1, 16]. In addition to their accuracy, a key characteristic of SVMs is their mathematical tractability and geometric interpretation. Consider the supervised problem of finding a separator for a set of training samples {(xi , yi )}li=1 belonging to two classes, where xi is the input vector for the ith example and yi is the target output. We assume that for the positive subset yi = +1 while for the negative subset yi = −1. If positive and negative examples are “linearly separable”, the convex hulls of positive and negative examples are disjoint. Those closest pair of points in respective convex hulls lie on the hyper-planes wT x + b = ±1. The separation between the hyper-plane and the closest data point is called the margin of separation and is denoted by γ. The goal of SVMs is to choose the hyper-plane whose parameters w and b maximize γ = 1/w; essentially a quadratic minimization problem (minimize w). Under these conditions, the decision surface w T x + b is referred to as the optimal hyper-plane. The particular data points (xi , yi ) that satisfy yi [w t xi +b] = 1 are called Support Vectors, hence the name “Support Vector Machines”. In conceptual terms, the Support Vectors are those data points that lie closest to the decision surface and are the most difficult to classify. As such, they directly influence the location of the decision surface [10].
246
Vladimir Estivill-Castro and Jianhua Yang
If the two classes are nonlinearly separable, the variants called φ-machines map the input space S = {x1 , . . . , xl } into a high-dimensional feature space F = {φ(x)|i = 1, . . . , l}. By choosing an adequate mapping φ, the input samples become linearly or mostly linearly separable in feature space. SVMs are capable of providing good generalization for high dimensional training data, since the complexity of optimal hyper-plane can be carefully controlled independently of the number of dimensions [5]. SVMs can deal with arbitrary boundaries in data space, and are not limited to linear discriminants. For our cluster validity, we make use of the features of ν-Support Vector Machine (ν-SVM). The ν-SVM is a new class of SVMs that has the advantage of using a parameter ν on effectively controlling the number of Support Vectors [14, 18, 4]. Again consider training vectors xi ∈ d , i = 1, · · · , l labeled in two classes by a label vector y ∈ l such that yi ∈ {1, −1}. As a primal problem for ν-Support Vector Classification (ν-SVC), we consider the following minimization: Minimize 12 w2 − νρ + 1l li=1 ξi subject to yi (wT φ(xi ) + b) ≥ ρ − ξi , ξi ≥ 0, i = 1, · · · , l, ρ ≥ 0,
(1)
where 1. Training vectors xi are mapped into a higher dimensional feature space through the function φ, and 2. Non-negative slack variables ξi for soft margin control are penalized in the objective function. The parameter ρ is such that when ξT = (ξ1 , · · · , ξl ) = 0, the margin of separation is γ = ρ/w. The parameter ν ∈ [0, 1] has been shown to be an upper bound of the fraction of margin errors and a lower bound of the fraction of Support Vectors [18, 4]. In practice, the above prime problem is usually solved through its dual by introducing Lagrangian multipliers and incorporating kernels: Minimize 12 αT (Q + yy T )α subject to 0 ≤ αi ≤ 1/l, i = 1, · · · , l (2) eT α ≥ ν where Q is a positive semidefinite matrix, Qij ≡ yi yj k(xi , xj ), and k(xi , xj ) = φ(xi )T · φ(xj ) is a kernel, e is a vector of all ones. The context for solving this dual problem is presented in [18, 4], some conclusions are useful for our cluster validity approach. Proposition 1. Suppose ν-SVC leads to ρ > 0, then regular C-SVC with parameter C set a priori to 1/ρ, leads to the same decision function. Lemma 1. Optimization problem (2) is feasible if and only if ν ≤ νmax , where νmax = 2 min(#yi = 1, #yi = −1)/l, and (#yi = 1), (#yi = −1) denote the number of elements in the first and second classes respectively. Corollary 1. If Q is positive definite, then the training data are separable.
Cluster Validity Using Support Vector Machines
247
Thus, we note that νl is a lower bound of the number of Support Vectors(SVs) and an upper bound of the number of misclassified training data. These misclassified data are treated as outliers and called Bounded Support Vectors(BSVs). The larger we select ν, the more points are allowed to lie inside the margin; if ν is smaller, the total number of Support Vectors decreases accordingly. Proposition 1 describes the relation between standard C-SVC and ν-SVC, and an interesting interpretation of the regularization parameter C. The increase of C in C-SVC is like the decrease of ν in ν-SVC. Lemma 1 shows that the size of νmax depends on how balanced the training set is. If the numbers of positive and negative examples match, then νmax = 1. Corollary 1 helps us verify whether a training problem under extent kernels is separable. We do not assume the original cluster results are separable, but, it is favorable to use balls to describe the data in feature space by choosing RBF kernels. If the RBF kernel is used, Q is positive definite [4]. Also, RBF kernels yield an appropriate tight contour representations of a cluster [15]. Again, we can try to put most of the data into a small ball to maximize the classification problem, and the bound of the probability of points falling outside the ball can be controlled by the parameter ν. For a kernel k(x, x ) that only depends on x − x , k(x, x) is constant, so the linear term in the dual target function is constant. This simplifies computation. So in our cluster validity approach,
2
we will use the Gaussian kernels kq (x, x ) = eqx−x with width parameter −1 q = 2σ 2 (note q < 0). In this situation, the number of Support Vectors depends on both ν and q. When q’s magnitude increases, boundaries become rough (the derivative oscillates more), since a large fraction of the data turns into SVs, especially those potential outliers that are broken off from core data points in the form of SVs. But no outliers will be allowed, if ν = 0. By increasing ν, more SVs will be turned into outliers or BSVs. Parameters ν and p will be used alternatively in the following sections.
3
Cluster Validity Using SVMs
We apply SVMs to the output of clustering algorithms, and show they learn the structure inherent in clustering results. By checking the complexity of boundaries, we are able to verify if there are significant “valleys” between data clusters and how outliers are distributed. All these are readily computable from the data in an supervised manner through SVMs training. Our approach is based on three properties of clustering results. First, good clustering results should separate clusters well; thus in good clustering results we should find separation (relative large margins between clusters). Second, there should be high density concentration in the core of the cluster (what has been named compactness). Third, removing a few points in the core shall not affect their shape. However, points in cluster boundaries are in sparse region and perturbing them does change the shape of boundaries.
248
Vladimir Estivill-Castro and Jianhua Yang
To verify separation pairwise, we learn the margin γ from SVMs training; then we choose the top ranked SVs (we propose 5) from a pair of clusters and their k (also 5) nearest neighbors. We measure the average distance of these SVs from their projected neighbors from each cluster (projected along the normal of the optimal hyper-plane). We let these average be γ1 for the first cluster in a pair and we denote it as γ2 for the second cluster. We compare γ with γ i . Given scalars t1 and t2 , the relations between local measures and margin is evaluated by analyzing if any of the following conditions holds: Condition 1: γ1 < t1 · γ or γ2 < t1 · γ; Condition 2: γ1 > t2 · γ or γ2 > t2 · γ. (3) If either of them holds for carefully selected control parameters t1 and t2 , the clusters are separable; otherwise they are not separable (we recommend t1 = 0.5 and t2 = 2). This separation test can discriminate between two results of a clustering algorithm. That is, when facing two results, maybe because the algorithm is randomized or because two clustering methods are applied, we increase the confidence (and thus the preference to believe one is more valid than the other) by selecting the clustering result that shows less pairs of non-separable classes. To verify the compactness of each cluster, we control the number of SVs and BSVs. As mentioned before, the parameter q of the Gaussian kernel determines the scale at which the data is probed, and as its magnitude increases, more SVs result - especially potential outliers tend to appear isolated as BSVs. However to allow for BSVs, the parameter ν should be greater than 0. This parameter enables analyzing points that are hard to assign to a class because they are away from high density areas. We refer to these as noise or outliers, and they will usually host BSVs. As shown by the theorems cited above, controlling q and ν provides us a mechanism for verifying compactness of clusters. We verify robustness by checking the stability of cluster assignment. After removing a fraction of BSVs, if reclustering results in repeatable assignments, we conclude that the cores of classes exist and outliers have been detected. We test the confidence of the result in applying an arbitrary clustering algorithm A to a data set as follows. If the clustering result is repeatable (compact and robust to our removal of BSVs and their nearest neighbors) and separable (in the sense of having a margin a faction larger than the average distance between SVs), this maximizes our confidence that the data does reflect this clustering and is not an artifact of the clustering algorithm. We say the clustering result has a maximum sense of validity. On the other hand, if reclustering results are not quite repeatable but well separable, or repeatable but not quite separable, we still call the current run a valid run. Our approach may still find valid clusters. However, if reclustering shows output that is neither separable nor repeatable, we call the current run an invalid run. In this case, the BSVs removed in the last run may not be outliers, and they should be recovered for a reclustering. We discriminate runs further by repeating the above validity test, for several rounds. If consecutive clustering results converge to a stable assignment (i.e. the result from each run is repeatable and separable), we claim that potential outliers have been removed, and cores of clusters have emerged. If repetition of
Cluster Validity Using Support Vector Machines
249
the analysis still produces invalid runs, (clustering solutions differ across runs without good separation) the clustering results are not interesting. In order to set the parameters of our method we conducted a series of experiments we summarize here 1 . We determined parameters for separation and compactness checking first. The data sets used were in different shapes to ensure generality. The LibSVM [3] SVMs library has been used in our implementation of our cluster validity scheme. The first evaluation of separation accurately measured the margin between two clusters. To ensure the lower error bound, we use a hard margin training strategy by setting ν = 0.01 and q = 0.001. This allows for few BSVs. In this evaluation, six data sets each with 972 points uniformly and randomly generated in 2 boxes were used. The margin between the boxes is decreasing across data sets. To verify the separation of a pair of clusters, we calculated the values of γ1 and γ2 . Our process compared them with the margin γ and inspected the difference. The experiment showed that the larger the discrepancies between γ1 and γ (or γ2 and γ), the more separable the clusters are. In general, if γ1 < 0.5γ or γ2 < 0.5γ, the two clusters are separable. Thus, the choice of value for t1 . Secondly, we analyzed other possible cases of the separation test. This included (a) both γ1 and γ2 much larger than γ; (b) a small difference between γ1 and γ, but the difference between γ2 and γ is significant (c) significant difference between γ1 and γ, although there is no much difference between γ2 and γ. Again, we set t1 = 0.5 and t2 = 2 for this test. Then, according to the verification rules of separation (in Equation (3)), all of these examples were declared separable coinciding with our expectation. Third, we tested noisy situation and non-convex clusters. Occasionally clustering results might not accurately describe the groups in the data or are hard to interpret because noise is present and outliers may mask data models. When these potential outliers are tested and removed, the cores of clusters appear. We performed a test that showed that, in the presence of noise, our approach works as a filter and the structure or model fit to the data becomes clearer. A ringshaped cluster with 558 points surrounded by noise and another spherical cluster were in the dataset. A ν-SVC trained with ν = 0.1 and q = 0.001 results in 51 BSVs. After filtering these BSVs (outliers are more likely to become BSVs), our method showed a clear data model that has two significantly isolated dense clusters. Moreover, if a ν-SVC is trained again with ν = 0.05 and q = 0.001 on the clearer model, fewer BSVs (17) are generated (see Fig. 1)3 . As we discussed, the existence of outliers complicates clustering results. These may be valid, but separation and compactness are also distorted. The repeated performance of a clustering algorithm depends on the previous clustering results. If these results have recognized compact clusters with cores, then they become robust to our removal of BSVs. There are two cases. In the first case, the last two consecutive runs of algorithm A (separated by an application of BSVs removal) are consistent. That is, the clustering results are repeatable. The alternative 1
The reader can obtain an extended version of this submission with large figures in www.cit.gu.edu.au/˜s2130677/publications.html
250
Vladimir Estivill-Castro and Jianhua Yang
(a)
(b)
(c)
Fig. 1. Illustration of outlier checking. Circled points are SVs
(a) Clustering structure C1
(b) SVs in circles
(c) Clustering structure C2
Fig. 2. For an initial clustering (produced by k-Means) that gives non-compact classes, reclustering results are not repeated when outliers are removed. 2(a) Results of the original first run. 2(b) Test for outliers. 2(c) Reclustering results; R = 0.5077, J = 0.3924, F M = 0.5637 case is that reclustering with A after BSVs removal is not concordant with the previous result. Our check for repeated performance of clustering results verifies this. We experimented with 1000 points drawn from a mixture data model3 and training parameters for ν-SVC are set to ν = 0.05 and q = 0.005, we showed that the reclustering results can become repeatable leading to valid results (see Figs. 3(a), 3(c) and 3(d))3 . However we also showed cases, where an initial invalid clustering does not lead to repeatable results (see Figs. 2(a), 2(b) and 2(c))3 . To measure the degree of repeated performance between clustering results of two different runs, we adopt indexes of external criteria used in cluster validity. External criteria are usually used for comparing a clustering structures C with a predetermined partition P for a given data set X. Instead of referring to a predetermined partition P of X, we measure the degree of agreement between two consecutively produced clustering structures C1 and C2 . The indexes we use are the rand statistic R, the Jaccard coefficient J and the Fowlkes-Mallows index F M [12]. The values of these three statistics are between 0 and 1. The larger their value, the higher degree to which C1 matches C2 .
Cluster Validity Using Support Vector Machines
4
251
Experimental Results
First, we use a 2D dataset for a detailed illustration of our cluster validity testing using SVMs (Fig. 3). The 2D data set is from a mixture model and consists of 1000 points. The k -medoids algorithm assigns two clusters. The validity process will be conducted in several rounds. Each round consists of reclustering and our SVMs analysis (compactness checking, separation verification, and outliers splitting and filtering). The process stops when a clear clustering structure appears (this is identified because it is separable and repeatable), or after several rounds (we recommend six). Several runs that do not suggest a valid result indicate the clustering method is not finding reasonable clusters in the data. For the separation test in this example, we train ν-SVC with parameters ν = 0.01 and q = 0.0005. To filter potential outliers, we conduct ν-SVC with ν = 0.05 but different q in every round. The first round starts with q = 0.005, and q will be doubled in each following round. Fig. 3(b) and Fig. 3(c)3 show separation test and compactness evaluation respectively corresponding to the first round. We observed that the cluster results are separable. Fig. 3(b) indicates γ1 > 2γ and γ2 > 2γ. Fig. 3(c) shows the SVs generated, where 39 BSVs will be filtered as potential outliers. We perform reclustering after filtering outliers, and match the current cluster structure to previous clustering clustering structure. The values of indexes R = 1 (J = 1 and F M = 1) indicate compactness. Similarly, the second round up to the fourth round also show repeatable and separable clustering structure. We conclude that the original cluster results can be considered valid. We now show our cluster validity testing using SVMs on a 3D data set (see Fig. 4)3 . The data set is from a mixture model and consists of 2000 points. The algorithm k-Means assigns three clusters. The validity process is similar to that in 2D example. After five rounds of reclustering and SVMs analysis, the validity process stops, and a clear clustering structure appears. For the separation test in this example, we train ν-SVC with parameters ν = 0.01 and q = 0.0005. To filter potential outliers, we conduct ν-SVC with ν = 0.05 but different q in every round. The first round starts with q = 0.005, and q will be doubled in each following round. In the figure, we show the effect of a round with a 3D view of the data followed by the separation test and the compactness verification. To give a 3D view effect, we construct convex hulls of clusters. For the separation and the compactness checking, we use projections along z axis. Because of pairwise analysis, we denote by γi,j the margin between clusters i and j, while γ i(i,j) is the neighborhood dispersion measure of SVs in cluster i with respect to the pair of clusters i and j. Fig. 4(a) illustrates a 3D view of original clustering result. Fig. 4(b) and Fig. 4(c)3 show separation test and compactness evaluation respectively corresponding to the first round. Fig. 4(b) indicates γ 1(1,2) /γ1,2 = 6.8, γ 1(1,3) /γ1,3 = 11.2 and γ 2(2,3) /γ2,3 = 21.2. Thus, we conclude that the cluster results are separable in the first run. Fig. 4(c) shows the SVs generated, where 63 BSVs will be filtered as potential outliers. We perform reclustering after filtering outliers, and match the current cluster structure to previous clustering structure. Index values
252
Vladimir Estivill-Castro and Jianhua Yang
(a) Original clustering structure C1
(d) Structure C2 from reclustering
(h) BSVs=41, R=J=FM=1.
(b) γ = 0.019004 γ1 = 0.038670 γ2 = 0.055341
(c) SVs in circles, BSVs=39, R=J=FM=1.
(e) γ = 0.062401 γ1 = 0.002313 γ2 = 0.003085
(f) BSVs=39, R=J=FM=1.
(g) γ = 0.070210 γ1 = 0.002349 γ2 = 0.002081
(i) γ = 0.071086 γ1 = 0.005766 γ2 = 0.004546
(j) BSVs=41, R=J=FM=1.
(k) γ = 0.071159 γ1 = 0.002585 γ2 = 0.003663
Fig. 3. A 2D example of cluster validity through SMVs approach. Circled points are SVs. Original first run results in compact classes 3(a). 3(c) Test for outliers. 3(d) Reclustering results; R = 1.0, J = 1.0, F M = 1.0. 3(b) and 3(c) separation check and compactness verification of the first round. 3(e) and 3(f) separation check and compactness verification of the second round. 3(g) and 3(h) separation check and compactness verification of the third round. 3(i) and 3(j) separation check and compactness verification of the fourth round. 3(i) clearly separable and repeatable clustering structure
Cluster Validity Using Support Vector Machines
253
R = 1 indicate the compactness of the result in previous run. Similarly, the second round up to the fifth round also show repeatable and separable clustering structure. Thus the original cluster results can be considered valid.
5
Related Work and Discussion
Various methods have been proposed for clustering validity. The most common approaches are formal indexes of cohesion or separation (and their distribution with respect to a null hypothesis). In [11, 17], a clear and comprehensive description of these statistical tools is available. These tools have been designed to carry out hypothesis testing to increase the confidence that the results of clustering algorithms are actual structure in the data (structure understood as discrepancy from the null hypothesis). However, even these mathematically defined indexes face many difficulties. In almost all practical settings, this statistic-based methodology for validity faces challenging computation of the probability density function of indexes that complicates the hypothesis testing approach around the null hypothesis [17]. Bezdek [2] realized that it seemed impossible to formulate a theoretical null hypothesis used to substantiate or repudiate the validity of algorithmically suggested clusters. The information contained in data models can also be captured using concepts from information theory [8]. In specialized cases, like conceptual schema clustering, formal validation has been used for suggesting and verifying certain properties [19]. While formal validity guarantees the consistency of clustering operations in some special cases like information system modeling, it is not a general-purpose method. On the other hand, if the use of more sophisticated mathematics requires more specific assumptions about the model, and if these assumptions are not satisfied by the application, performance of such validity test could degrade beyond usefulness. In addition to theoretical indexes, empirical evaluation methods [13] are also used in some cases where sample datasets with similar known patterns are available. The major drawback of empirical evaluation is the lack of benchmarks and unified methodology. In addition, in practice it is sometimes not so simple to get reliable and accurate ground truth. External validity [17] is common practice amongst researchers. But it is hard to contrast algorithms whose results are produced in different data sets from different applications. The nature of clustering is exploratory, rather than confirmatory. The task of data mining is that we are to find novel patterns. Intuitively, if clusters are isolated from each other and each cluster is compact, the clustering results are somehow natural. Cluster validity is a certain amount of confidence that the cluster structure found is significant. In this paper, we have applied Support Vector Machines and related kernel methods to cluster validity. SVMs training based on clustering results can obtain insight into the structure inherent in data. By analyzing the complexity of boundaries through support information, we can verify separation performance and potential outliers. After several rounds of reclustering and outlier filtering, we will confirm clearer clustering structures
254
Vladimir Estivill-Castro and Jianhua Yang
(a) Original clustering result
(b)
(c) SV s = 184 BSV s = 63
(d) Reclustering R = 1
(e) γ 1(1,2) /γ1,2 = 0.47 γ 1(1,3) /γ1,3 = 0.25 γ 2(2,3) /γ2,3 = 0.17
(f)
(g)
(h)
SVs=155 BSV s = 57
Reclustering R = 1
γ 1(1,2) /γ1,2 = 0.12 γ 1(1,3) /γ1,3 = 0.02 γ 2(2,3) /γ2,3 = 0.01
(i)
(j)
(k)
SV s = 125 BSV s = 44
Reclustering R = 1
γ 1(1,2) /γ1,2 = 0.06 γ 1(1,3) /γ1,3 = 0.09 γ 2(2,3) /γ2,3 = 0.31
(l) SV s = 105 BSV s = 36
(m) Reclustering R = 1
(n)
(o) SV s = 98 BSV s = 26
γ 1(1,2) /γ1,2 = 6.8 γ 1(1,3) /γ1,3 = 11.2 γ 2(2,3) /γ2,3 = 21.2
γ 1(1,2) /γ1,2 = 0.02 γ 1(1,3) /γ1,3 = 0.08 γ 2(2,3) /γ2,3 = 0.18
(p) Reclustering R = 1
Fig. 4. 3D example of cluster validity through SMVs. SVs as circled points. 4(a) 3D view of the original clustering result. 4(b), 4(c) and 4(d) is 1st run. 4(e), 4(f) and 4(g) is 2nd run. 4(h), 4(i) and 4(j) is 3rd run. 4(k), 4(l) and 4(m) is 4th run. 4(n), 4(o) and 4(p) is 5th run arriving at clearly separable and repeatable clustering structure. Separation tests in 4(b), 4(e), 4(h), 4(k) and 4(n). Compactness verification in 4(c), 4(f), 4(i), 4(l) and 4(o). 3D view of reclustering result in 4(d), 4(g), 4(j) and 4(m)
Cluster Validity Using Support Vector Machines
255
when we observe they are repeatable and compact. Counting the number of valid runs and match results from different rounds in our process contributes to verifying the goodness of clustering result. This provides us a novel mechanism for cluster evaluation. Our approach provides a novel mechanism to address cluster validity problems for more elaborate analysis. This is required by a number of clustering applications. The intuitive interpretability of support information and boundary complexity makes it easy to operate practical cluster validity.
References [1] K. P. Bennett and C. Campbell. Support vector machines: Hype or hallelujah. SIGKDD Explorations, 2(2):1–13, 2000. 245 [2] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, NY, 1981. 253 [3] C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. 249 [4] C. C. Chang and C. J. Lin. Training ν-support vector classifiers: Theory and algorithms. Neural Computation, 13(9):2119–2147, 2001. 246, 247 [5] V. Cherkassky and F. Muller. Learning from Data — Concept, Theory and Methods. Wiley, NY, USA, 1998. 246 [6] R. C. Dubes. Cluster analysis and related issues. C. H. Chen, L. F. Pau, and P. S. P. Wang, eds., Handbook of Pattern Recognition and Computer Vision, 3–32, NJ, 1993. World Scientific. Chapter 1.1. 244 [7] V. Estivill-Castro. Why so many clustering algorithms - a position paper. SIGKDD Explorations, 4(1):65–75, June 2002. 244 [8] E. Gokcay and J. Principe. A new clustering evaluation function using Renyi’s information potential. R. O. Wells, J. Tian, R. G. Baraniuk, D. M. Tan, and H. R. Wu, eds., Proc. of IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP 2000), 3490–3493, Istanbul, 2000. 244, 253 [9] S. Gunn. Support vector machines for classification and regression. Tech. Report ISIS-1-98, Univ. of Southampton, Dept. of Electronics and Computer Science, 1998. 245 [10] S. S. Haykin. Neural networks: a comprehensive foundation. PrenticeHall, NJ, 1999. 245 [11] A. K. Jain & R. C. Dubes. Algorithms for Clustering Data. PrenticeHall, NJ, 1998. 253 [12] R. Koschke and T. Eisenbarth. A framework for experimental evaluation of clustering techniques. Proc. Int. Workshop on Program Comprehension, 2000. 250 [13] A. Rauber, J. Paralic, and E. Pampalk. Empirical evaluation of clustering algorithms. M. Malekovic and A. Lorencic, eds., 11th Int. Conf. Information and Intelligent Systems (IIS’2000), Varazdin, Croatia, Sep. 20 - 22 2000. Univ. of Zagreb. 253 [14] B. Sch¨ olkopf, R. C. Williamson, A. J. Smola, and J. Shawe-Taylor. SV estimation of a distribution’s support. T.K Leen, S. A. Solla, and K. R. M¨ uller, eds., Advances in Neural Information Processing Systems 12. MIT Press, forthcomming. mlg.anu.edu.au/ smola/publications.html. 246 [15] H. Siegelmann, A. Ben-Hur, D. Horn, and V. Vapnik. Support vector clustering. J. Machine Learning Research, 2:125–137, 2001. 245, 247
256
Vladimir Estivill-Castro and Jianhua Yang
[16] V. N. Vapnik. The nature of statistical learning theory. Springer Verlag, Heidelberg, 1995. 245 [17] M. Vazirgiannis, M Halkidi, and Y. Batistakis. On clustering validation techniques. Intelligent Information Systems J. 17(2):107–145, 2001. 244, 253 [18] R. Williamson, B. Sch¨ olkopf, A. Smola, and P. Bartlett. New support vector algorithms. Neural Computation, 12(5):1207–1245, 2000. 246 [19] R. Winter. Formal validation of schema clustering for large information systems. Proc. First American Conference on Information Systems, 1995. 253
FSSM: Fast Construction of the Optimized Segment Support Map Kok-Leong Ong, Wee-Keong Ng, and Ee-Peng Lim Centre for Advanced Information Systems, Nanyang Technological University, Nanyang Avenue, N4-B3C-13, Singapore 639798, SINGAPORE
[email protected] Abstract. Computing the frequency of a pattern is one of the key operations in data mining algorithms. Recently, the Optimized Segment Support Map (OSSM) was introduced as a simple but powerful way of speeding up any form of frequency counting satisfying the monotonicity condition. However, the construction cost to obtain the ideal OSSM is high, and makes it less attractive in practice. In this paper, we propose the FSSM, a novel algorithm that constructs the OSSM quickly using a FP-Tree. Given a user-defined segment size, the FSSM is able to construct the OSSM at a fraction of the time required by the algorithm previously proposed. More importantly, this fast construction time is achieved without compromising the quality of the OSSM. Our experimental results confirm that the FSSM is a promising solution for constructing the best OSSM within user given constraints.
1
Introduction
Frequent set (or pattern) mining plays a pivotal role in many data mining tasks including associations [1] and its variants [2, 4, 7, 13], sequential patterns [12] and episodes [9], constrained frequent sets [11], emerging patterns [3], and many others. At the core of discovering frequent sets is the task of computing the frequency (or support) of a given pattern. In all cases above, we have the following abstract problem for computing support. Given a collection I of atomic patterns or conditions, compute for collections C ⊆ I the support σ(C) of C, where the monotonicity condition σ(C) σ({c}) holds for all c ∈ C. Typically, the frequencies of patterns are computed in a collection of transactions, i.e., D = {T1 , . . . , Ti }, where a transaction can be a set of items, a sequence of events in a sliding time window, or a collection of spatial objects. One class of algorithms find the above patterns by generating candidate patterns C1 , . . . , Cj , and then checking them against D. This process is known to be tedious and time-consuming. Thus, novel algorithms and data structures were proposed to improve the efficiency of frequency counting. However, most solutions do not address the problem in a holistic manner. As a result, extensive efforts are often needed to incorporate a particular solution to an existing algorithm.
This work was supported by SingAREN under Project M48020004.
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 257-266, 2003. c Springer-Verlag Berlin Heidelberg 2003
258
Kok-Leong Ong et al.
Recently, the Optimized Segment Support Map (OSSM) [8, 10] was introduced as a simple yet powerful way of speeding up any form of frequency counting satisfying the monotonicity condition. It is a light-weight, easy to compute structure, that partitions D into n segments, i.e., D = S1 ∪ . . . ∪ Sn and Sp ∩ Sq = ∅, with the goal of reducing the number of candidate patterns for which frequency counting is required. The idea of the OSSM is simple: the frequencies of patterns in different parts of the data is different. Therefore, computing the frequencies separately in different parts of the data makes it possible to obtain tighter support bounds for the frequencies of the collections of patterns. This enables one to prune more effectively, thus improving the speed of counting. Although the OSSM is an attractive solution for a large class of algorithms, it suffers from one major problem: the construction cost to obtain the best OSSM of a user-defined segment size for a given large collection is high. This makes the OSSM much less attractive in practice. For practicality, the authors proposed hybrid algorithms that use heuristics to contain the runtime, and to construct the “next best” OSSM. Although the solution guarantees an OSSM that improves performance, the quality of estimation is sub-optimal. This translates to a weaker support bound estimated for a given pattern and hence, reduces the probability of pruning an infrequent pattern. Our contribution to the above is to show the possibility of constructing the best OSSM within limited time for a given segment size and a large collection. Our proposal, called the FSSM, is an algorithm that constructs the OSSM from the FP-Tree. With the FSSM, we need not compromise the quality of estimation in favor of a shorter construction time. The FSSM may therefore make obsolete the sub-optimal algorithms originally proposed. Our experimental results support these claims.
2
Background
The OSSM is a light-weight structure that holds the support of all singleton itemsets in each segment of the database D. A segment in D is a partition containing a set of transactions such that D = S1 ∪ . . . ∪ Sn and Sp ∩ Sq = ∅. In each segment, the support of each singleton itemset is registered and thus, n the support of an item ‘c’ can be obtained by i=1 σi ({c}). While the OSSM contains only segment supports of singleton itemsets, it can be used to give an upper bound on the support ( σ ) of any itemset C using the formula given below, where On is the OSSM constructed with n segments. σ (C, On ) =
n
min({σi ({c}) | c ∈ C})
i=1
Let us consider the example in Figure 1. Assume in this configuration, each segment has exactly two transactions. Then, we have the OSSM (right table) where the frequency of each item in each segment is registered. By the equation above, the estimated support of an itemset C = {a, b} would be σ (C, On ) =
FSSM: Fast Construction of the Optimized Segment Support Map TID 1 2 3 4 5 6
Contents Segment {a} 1 {a, b} 1 {a} 2 {a} 2 {b} 3 {b} 3
{a} {b}
S1 2 1
S2 2 0
S3 0 2
259
D = S1 ∪ S2 ∪ S3 4 3
Fig. 1. A collection of transactions (left) and its corresponding OSSM (right). The OSSM is constructed with a user-defined segment size of n = 3.
min(2, 1)+min(2, 0)+min(0, 2) = 1. Although this estimate is the support bound of C, it turns out to be the actual support of C for this particular configuration of segments. Suppose we now switch T1 and T5 in the OSSM, i.e., S1 = {T2 , T5 } (C, On ) = 2! This observation suggests that the way and S3 = {T1 , T6 }, then σ transactions are selected into a segment can affect the quality of estimation. Clearly, if each segment contains only one transaction, then the estimate will be optimal and equals the actual support. However, this number of segments will be practically infeasible. The ideal alternative is to use a minimum number of segments to maintain the optimality of our estimate. This leads to the following problem formulation. Definition 1. Given a collection of transactions, the segment minimization problem is to determine the minimum value nm for the number of segments in (C, Onm ) = σ(C) for all itemsets C, i.e., the upper the OSSM Onm , such that σ bound on the support for any itemset C is exactly its actual support. With the FSSM, the minimum number of segments can be obtained quickly in two passes of the database. However, knowing the minimum number of segments is at best a problem of academic interest. In practice, this number is still too large to consider the OSSM as light-weight. It is thus desirable to construct the OSSM based on a user-defined segment size nu . And since nu nm , we expect a drop in the accuracy of the estimate. The goal then is to find the best configuration of segments, such that the quality of every estimate is the best within the bounds of nu . This problem is formally stated as follows. Definition 2. Given a collection of transactions and a user-defined segment size nu nm to be formed, the constrained segmentation problem is to determine the best composition of the nu segments that minimizes the loss of accuracy in the estimate.
3
FSSM: Algorithm for Fast OSSM Construction
In this section, we present our solutions to the above problems. For the ease of discussion, we assume the reader is familiar with the FP-Tree and the OSSM. If not, a proper treatment can be obtained in [5, 10].
260
3.1
Kok-Leong Ong et al.
Constructing the Ideal OSSM
Earlier, we mentioned that the FSSM constructs the optimal OSSM from the FP-Tree. Therefore, we begin by showing the relationship between the two. Lemma 1. Let Si and Sj be two segments of the same configuration from a collection of transactions. If we merge Si and Sj into one segment Sm , then Sm is the same configuration, and σ (C, Sm ) = σ (C, Si ) + σ (C, Sj ). The term configuration refers to the characteristic of a segment that is described by the descending frequency order of the singleton itemsets. As an example, suppose the database has three unique items and two segments, i.e., S1 = {b(4), a(1), c(0)} and S2 = {b(3), a(2), c(2)}, where the number in the parentheses is the frequency of each item in the segment. In this case, both segments are described by the same configuration σ({a}) σ({b}) σ({c}) , and therefore can be merged (by Lemma 1) without loosing accuracy. In a more general case, the lemma solves the segment minimization problem. Suppose each segment begins with a single transaction, i.e., the singleton frequency registered in each segment is either ‘1’ or ‘0’. We begin by merging two single-transaction segments of the same configuration. From this merged segment, we continue merging other single-transaction segments as long as the configuration is not altered. When no other single-transaction segments can be merged without loosing accuracy, we repeat the process on another configuration. The number of segments found after processing all distinct configurations is the minimum number of segments required to build the optimal OSSM. Theorem 1. The minimum number of segments required for the upper bound on σ(C) to be exact for all C, is the number of segments with distinct configurations. Proof: As shown in [10]. Notice the process of merging two segments is very similar to the process of FP-Tree construction. First, the criterion to order items in a transaction is the same as that to determine the configuration of a segment (specifically a singletransaction segment). Second, the merging criterion of two segments is implicitly carried out by the overlaying of a transaction on an existing unique path1 in the FP-Tree. An example will illustrate this observation. Let T1 = {f, a, m, p}, T2 = {f, a, m} and T3 = {f, b, m} such that the transactions are already ordered, and σ({b}) σ({a}). Based on FP-Tree characteristics, T1 and T2 will share the same path in the FP-Tree, while T3 will have a path of its own. For the two transactions overlaid on the same path in the FP-Tree, they actually have the same configuration: σ({f }) σ({a}) σ({m}) σ({p}) σ({b}) . . . , since σ({b}) = 0 in both T1 and T2 and σ({p}) = 0 for T2 . For T3 , the configuration is σ({f }) σ({b}) σ({m}) σ({a}) σ({p}) . . . , where σ({a}) = σ({p}) = 0. Clearly, this is a different configuration from T1 and T2 and hence, a different path in the FP-Tree. 1
A unique path in the FP-Tree, is a distinct path that starts from the root node, and ends at one of the leaf nodes in the FP-Tree.
FSSM: Fast Construction of the Optimized Segment Support Map
261
Theorem 2. Given a FP-Tree constructed from some collection, the number of unique paths (or leaf nodes) in the FP-Tree is the minimum number of segments achievable without compromising the accuracy of the OSSM. Proof: Suppose the number of unique paths in the FP-Tree is not the minimum number of segments required to build the optimal OSSM. Then, there will be at least one unique path that has the same configuration as another path in the FP-Tree. However, two paths Pi and Pj in the FP-Tree can have the same configuration if and only if, there exist transactions in both paths that have the same configuration. If Ti ∈ Pi and Tj ∈ Pj are of the same configuration, they must satisfy the condition Ti ⊆ Tj and ∀c ∈ Tj − Ti , σ({c}) σ({x|Ti | ∈ Ti }), or vice versa. However by the principle of FP-Tree construction, if Ti and Tj satisfy the above condition, then they must be overlaid on the same path. Therefore, each unique path in the FP-Tree must be of a distinct configuration. Hence, we may now apply Theorem 1 to complete the proof of Theorem 2. Corollary 1. The transactions that are fully contained in each unique path of the FP-Tree is the set of transactions that constitutes to a distinct segment in the optimal OSSM. Proof: By Theorem 2, every unique path in the FP-Tree must have a distinct configuration, and all transactions contained in a unique path are transactions with the same configuration. In addition, since every transaction in the collection must lie completely along one of the paths in the FP-Tree, it follows that there is an implicit and complete partition on the collection by the unique path the transaction belongs. By this observation, every unique path and its set of transactions must therefore correspond to a distinct segment in the optimal OSSM. Hence, we have the above corollary of Theorem 2. From Theorem 1, we shall give an algorithmic sketch of the construction algorithm for the optimal OSSM. Although this has little practical utility, its result is an intermediate step towards the goal of finding the optimal OSSM within the bounds of the user-defined segment size. Hence, its efficient construction is still important. The algorithm to construct the optimal OSSM is given in Figure 2. Notice that the process is very much based on the FP-Tree construction. In fact, the entire FP-Tree is constructed along with the optimal OSSM. Therefore, the efficiency of the algorithm is bounded by the time needed to construct the FP-Tree, i.e., within two scans of the database. The results of the above is important to solve the constrained segmentation problem. As we will show in the next subsection, the overlapping of unique paths in the FP-Tree contain an important property that will allow us to construct the best OSSM within the bounds of the user-defined segment size. As before, we shall present the formal discussions that lead to the algorithm. 3.2
Constructing OSSM with User-Defined Segment Size
Essentially, Theorem 1 states the lower bound nm on the number of segments allowable before the OSSM becomes sub-optimal in its estimation. Also men-
262
Kok-Leong Ong et al.
Algorithm BuildOptimalOSSM(Set of transactions D) begin Find the singleton frequency of each item in D; // Pass 1 foreach transaction T ∈ D do // Pass 2 Sort T accordingly to descending frequency order; if (T can be inserted completely along an existing path Pi in the FP-Tree) then Increment the counter in segment Si for each item in T ; else Create the new path Pj in the FP-Tree, and the new segment Sj ; Initialize the counter in segment Sj for each item in T to 1; endif endfor return optimal OSSM and FP-Tree; end Fig. 2. Algorithm to construct the optimal OSSM via FP-Tree construction.
tioned is that the value of nm is too high to construct the OSSM as a light weight and easy to compute structure. The alternative, as proposed, is to introduce a user-defined segment size nu where nu nm . Clearly, when nu < nm , the accuracy can no longer be maintained. This means merging segments of different configuration so as to reach the user-defined segment size. Of course, the simplest approach is to randomly merge any distinct configuration. However, this will result in an OSSM with poor pattern pruning efficiency. As such, we are interested in constructing the best OSSM within the bounds of the user-defined segment size. Towards this goal, the following measure was proposed. [ σ ({ci , cj }, O1 ) − σ ({ci , cj }, Ov )] SubOp (S) = ci ,cj
In the equation, S = {S1 , . . . , Sv } is a set of segments with v 2. The first term is the upper bound on σ({ci , cj }) based on O1 , which consists of one combined segment formed by merging all v segments in S. The second term is the upper bound based on Ov which keeps the v segments separated. The difference between the two terms quantifies the amount of sub-optimality in the estimation on the set {ci , cj } to have the v segments merged, and the sum over all pairs of items measure the total loss. Generally, if all v segments are of the same configuration, then SubOp (S) = 0, and if there are at least two segments with different configurations, then SubOp (S) > 0. What this means is that we would like to merge segments having smaller sub-optimality values, i.e., they have a reduced loss when the v segments are merged. And this measure is the basis of operation for the algorithms proposed by the authors. Clearly, this approach is expensive. First, computing a single suboptimality value requires the sum of all pairs of items in the segment. If there terms to be summed. Second, the number of are k items, then there are k·(k−1) 2 distinct segments for which the sub-optimality is to be computed is also very large. As a result, the runtime to construct the best OSSM within the bounds
FSSM: Fast Construction of the Optimized Segment Support Map
263
of the user-defined segment size becomes very high. To contain the runtime, hybrid algorithms were proposed. These algorithms first create segments of larger granularity by randomly merging existing segments before the sub-optimality measure is used to reach the user-defined segment size. The consequence is an OSSM with an estimation accuracy that cannot be predetermined, and is often not the best OSSM possible for the given user-defined segment size. With regards to the above, the FP-Tree has some interesting characteristics. Recall in Theorem 2, we learn that segments having the same configuration share the same unique path. Likewise, it is not difficult to observe that two unique paths are similar in configuration if they have a high degree of overlapping (i.e., sharing of prefixes). In other words, as the overlapping increases, the suboptimality value approaches zero. To illustrate this, suppose T1 = {f, a, m}, T2 = {f, a, c, p} and T3 = {f, a, c, q}. A FP-Tree constructed over these transactions will have three unique paths due to their distinct configurations. Assuming that T2 is to be merged with either T1 or T3 , then we observed that T2 should be merged with T3 . This is because T3 has a longer shared prefix than T1 , i.e., more overlapping in the two paths. This can be confirmed by the calculating the sub-optimality, i.e., SubOp(T1 , T2 ) = 2 and SubOp(T2 , T3 ) = 1. Lemma 2. Given a segment Si and its corresponding unique path Pi in the FP-Tree, the segment(s) that have the lowest sub-optimality value (i.e., the most similar configuration) with respect to Si , are the segment(s) whose unique path has the most overlap with Pi in the FP-Tree. Proof: Let Pj be a unique path with a distinct configuration from Pi . Without loss of distinction in the configuration, let the first k items in both configurations share the same item and frequency ordering. Then, the sub-optimality computed with or without the k items will be the same; since computing all pairs of the first k items (of the same configuration) contributes a zero result. Furthermore, the sub-optimality of Pi and Pj has to be non-zero. Therefore, a non-zero suboptimality depends on the remaining L = max(|Pi |, |Pj |) − k items, where each pair (formed from the L items) contributes to a non-zero partial sum. As k tends towards L, the number of pairs that can be formed from the L items reduces, and the sub-optimality thus approaches zero. Clearly, max(|Pi |, |Pj |) > k > 0 when Pi and Pj in the FP-Tree partially overlaps one another, and k = 0 when they do not overlap at all. Hence, with more overlapping between the two path, i.e., a large k, there is less overall loss in the accuracy, hence Lemma 2. Figure 3 shows the FSSM algorithm that constructs the best OSSM based on the user-defined segment size nu . Instead of creating segments of larger granularity by randomly merging existing ones, we begin with the nm segments in the optimal OSSM constructed earlier. From this nm segments, we merged two segments at a time such that the loss of accuracy is minimized. Clearly, this is costly if we compare each segment against every other as proposed [10]. Rather, we utilize Lemma 2 to cut the search space down to comparing only a few segments. More importantly, the FSSM begins with the optimal OSSM and will
264
Kok-Leong Ong et al.
Algorithm BuildBestOSSM(FP-Tree T, Segment Size nu , Optimal OSSM Om ) begin while (number of segments in Om > nu ) do select node N from lookup table H where N is the next furthest from the root of T and has > 1 child nodes; foreach possible pair of direct child nodes (ci , cj ) of N do Let Si /Sj be the segment for path Pi /Pj containing ci /cj respectively; Compute the sub-optimality as a result of merging Si and Sj ; endfor Merge the pair Sp and Sq whose sub-optimality value is smallest; Create unique path Pp q in T by merging Pp and Pq ; endwhile return best OSSM with nu segments; end Fig. 3. FSSM: algorithm to build the best OSSM for any given segment size nu < nm .
always merge segments with minimum loss of accuracy. This ensures that the best OSSM is always constructed for any value of nu . Each pass through the while-loop merges two segments at a time, and this continues until the OSSM of nm segments reduces to nu segments. At the start of each pass, we first find the set of unique paths having the longest common prefix (i.e., the biggest k value). This is satisfied by the condition in the selectwhere statement which returns N , the last node in the common prefix. This node is important because together with its direct children, we can derive the set of all unique paths sharing this common prefix. The for-loop then computes the sub-optimality for each pair of segments in this set of unique paths. Instead of searching the FP-Tree (which will be inefficient), our implementation uses a lookup table H to find N . Each entry in H records the distance of a node having more than one child, and a reference to the actual node in the FP-Tree. All entries in H are then ordered by their distance so that the select-where statement can find the next furthest node by iterating through H. Although the pair of segments to process is substantially reduced, the efficiency of the for-loop can be further enhanced with a more efficient method of computing sub-optimality. As shown in the proof for Lemma 2, the first k items in the common prefix do not contribute to a non-zero sub-optimality. By the same rationale, we can also exclude all the h items where their singleton frequencies are zero in both segments. Hence, the sub-optimality can be computed by considering only the remaining |I| − k − h or max(|Pi |, |Pj |) − k items. After comparing all segments under N , we merge the two segments represented by the two unique paths with the least loss in the accuracy. Finally, we merge the two unique paths whose segments they represent were combined earlier. This new path will then correspond to the merged segment in the OSSM, where all nodes in the path are arranged according to their descending singleton frequency. The rationale for merging the two paths is to consistently reflect the state of the OSSM required for the subsequent passes.
FSSM: Fast Construction of the Optimized Segment Support Map
265
60
100000
Runtime (seconds)
1000 100 FSSM Random-RC Greedy
10
Speedup Relative to Apriori without the OSSM
FSSM/Greedy
10000
50
Random-RC
40 30 20 10 0
1 20
30
50 80 120 170 Number of Segments
230
300
20
30
50 80 120 170 Number of Segments
230
300
Fig. 4. (a) Runtime performance comparison for constructing the OSSM based on a number of given segment sizes. (b) Corresponding speedup achieved for Apriori using the OSSMs constructed in the first set of experiments.
4
Experimental Results
The objective of our experiment is to evaluate the cost effectiveness of our approach against the Greedy and Random-RC algorithms proposed in [10]. We conducted two sets of experiments using a real data set BMS-POS [6], which has 515,597 transactions. In the first set of experiments, we compare the FSSM against the Greedy and Random-RC in terms of their performance to construct the OSSM based on different user-defined segment sizes. In the second set of experiments, we compare their speedup contribution to the Apriori using the OSSMs constructed by the three algorithms at varying segment sizes. Figure 4(a) shows the results of the first set of experiments. As we expected from our previous discussion, the Greedy algorithm experiences extremely poor runtime when it comes to constructing the best OSSM within the bounds of the given segment size. Compared to the greedy algorithm, FSSM produces the same results in significantly less time, showing the feasibility of pursuing the best OSSM in practical context. Interestingly, our algorithm is even able to out-perform the Random-RC on larger user-defined segment sizes. This can be explained by observing the fact that the Random-RC first randomly merge segments to some larger granularity segments before constructing the OSSM based on the sub-optimality measure. As the user-defined segment size becomes larger, the granularity of each segment, formed from random merging, becomes finer. With more combination of segments, the cost to find the best segments to merge in turn becomes higher. Although we are able to construct the OSSM at the performance level of the Random-RC algorithm, it does not mean that the OSSM produced is of poor quality. As a matter of fact, the FSSM guarantees the best OSSM by the same principle that the Greedy algorithm used to build the best OSSM from the given user-defined segment size. Having shown this by a theoretical discussion, our experimental results in Figure 4(b) provides the empirical evidence. While the Random-RC takes approximately the same amount of time as the FSSM during construction, it fails to deliver the same level of speedup as the FSSM in
266
Kok-Leong Ong et al.
all cases of our experiments. On the other hand, our FSSM is able to construct the OSSM very quickly, and yet deliver the same level of speedup as the OSSM produced by the Greedy algorithm.
5
Conclusions
In this paper, we present an important observation about the construction of an optimal OSSM with respect to the FP-Tree. We show, by means of formal analysis, the relationship between the them, and how the characteristics of the FP-Tree can be exploited to construct high-quality OSSMs. We demonstrated, both theoretically and empirically, that our proposal is able to consistently produce the best OSSM within limited time for any given segment size. More importantly, with the best within reach, the various compromises suggested to balance construction time and speedup becomes unnecessary.
References 1. R. Agrawal and R. Srikant. Fast Algorithm for Mining Association Rules. In Proc. of VLDB, pages 487–499, Santiago, Chile, August 1994. 2. C. H. Cai, Ada W. C. Fu, C. H. Cheng, and W. W. Kwong. Mining Association Rules with Weighted Items. In Proc. of IDEAS Symp., August 1998. 3. G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In Proc. of ACM SIGKDD, San Diego, CA, USA, August 1999. 4. J. Han and Y. Fu. Discovery of Multiple-Level Association Rules from Large Databases. In Proc. of VLDB, Zurich, Swizerland, 1995. 5. J. Han, J. Pei Y. Yin, and R. Mao. Mining Frequent Patterns without Candidate Generation: A Frequent-pattern Tree Approach. J. of Data Mining and Knowledge Discovery, 7(3/4), 2003. 6. R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98, 2000. 7. K. Koperski and J. Han. Discovery of Spatial Association Rules in Geographic Information Databases. In Proc. of the 14th Int. Symp. on Large Spatial Databases, Maine, August 1995. 8. L. Lakshmanan, K-S. Leung, and R.T. Ng. The Segment Support Map: Scalable Mining of Frequent Itemsets. SIGKDD Explorations, 2:21–27, December 2000. 9. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in Sequences. In Proc. of ACM SIGKDD, Montreal, Canada, August 1995. 10. C. K-S. Leung R. T. Ng and H. Mannila. OSSM: A Segmentation Approach to Optimize Frequency Counting. In Proc. of IEEE Int. Conf. on Data Engineering, pages 583–592, San Jose, CA, USA, February 2002. 11. R. T. Ng, L. V. S. Lakshmanan, and J. Han. Exploratory Mining and Pruning Optimizations of Constrained Association Rules. In Proc. of SIGMOD, Washington, USA, June 1998. 12. R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proc. of the 5th Int. Conf. on Extending Database Technology, Avignon, France, March 1996. 13. O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. In Proc. of ICDE, San Diego, March 2000.
Using a Connectionist Approach for Enhancing Domain Ontologies: Self-Organizing Word Category Maps Revisited Michael Dittenbach1 , Dieter Merkl1,2 , and Helmut Berger1 1
E-Commerce Competence Center – EC3 Donau-City-Straße 1, A–1220 Wien, Austria 2 Institut f¨ ur Softwaretechnik, Technische Universit¨ at Wien Favoritenstraße 9–11/188, A–1040 Wien, Austria {michael.dittenbach,dieter.merkl,helmut.berger}@ec3.at
Abstract. In this paper, we present an approach based on neural networks for organizing words of a specific domain according to their semantic relations. The terms, which are extracted from domain-specific text documents, are mapped onto a two-dimensional map to provide an intuitive interface displaying semantically similar words in spatially similar regions. This representation of a domain vocabulary supports the construction and enrichment of domain ontologies by making relevant concepts and their relations evident.
1
Introduction
Ontologies gained increasing importance in many fields of computer science. Especially for information retrieval systems, ontologies can be a valuable means of representing and modeling domain knowledge to deliver search results of a higher quality. However, a crucial problem is an ontology’s increasing complexity with growing size of the application domain. In this paper, we present an approach based on a neural network to assist domain engineers in creating or enhancing ontologies for information retrieval systems. We show an example from the tourism domain, where free-form text descriptions of accommodations are used as a basis to enrich the ontology of a tourism information retrieval system with highly specialized terms that are hardly found in general purpose thesauri or dictionaries. We exploit information inherent in the textual descriptions that are accessible but separated from the structured information the search engine operates on. The vector representations of the terms are created by generating statistics about local contexts of the words occurring in natural language descriptions of accommodations. These descriptions have in common that words belonging together with respect to their semantics, are found spatially close together regarding their position in the text, even though the descriptions are written by different authors, i.e. the accommodation providers themselves in case of our application. Therefore, we think that the approach presented in this paper can be applied to a variety of domains, since, Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 267–277, 2003. c Springer-Verlag Berlin Heidelberg 2003
268
Michael Dittenbach et al.
for instance product descriptions, generally have similarly structured content. Consider for example, typical computer hardware descriptions where information about, say, storage devices are normally grouped together, rather than being intertwined with input and display devices. More specifically, we use the self-organizing map to cluster terms relevant to the application domain to provide an intuitive representation of their semantic relations. With this kind of representation at hand, finding synonyms, adding new relations between concepts or detecting new concepts, which would be important to be added to the ontology, is facilitated. More traditional clustering techniques are used in the DARE system [3] as methods supporting combined top-down and bottom-up ontology engineering [11]. The remainder of the paper is structured as follows. In Section 2 we provide a brief review of our natural language tourism information retrieval system along with some results of a field trial in which the interface has been made publicly accessible. Section 3 gives an overview of the SOM and how it can be used to create a word category map. Following a description of our experiments in Section 4, we provide some concluding remarks in Section 5.
2 2.1
A Tourism Information Retrieval System System Architecture
We have developed a natural language interface for the largest Austrian webbased tourism platform Tiscover (http://www.tiscover.com) [12]. Tiscover is a well-known tourism information system and booking service in Europe that already covers more than 50,000 accommodations in Austria, Germany, Liechtenstein, Switzerland and Italy. Contrary to the original form-based Tiscover interface, our natural language interface allows users to search for accommodations throughout Austria by formulating the query in natural language sentences either in German or English. The language of the query is automatically detected and the result is presented accordingly. For the task of natural language query analysis we followed the assumption that shallow natural language processing is sufficient in restricted and well-defined domains. In particular, our approach relies on the selection of query concepts, which are modeled in a domain ontology, followed by syntactic and semantic analysis of the parts of the query where the concepts appear. To improve the retrieval performance, we used a phonetic algorithm to find and correct orthographic errors and misspellings. It is furthermore an important issue to automatically identify proper names consisting of more than one word, e.g. “Gries am Brenner”, without having the user to enclose it with quotes. This also applies to phrases and multi-word denominations like “city center” or “youth hostel”, to name but a few. In the next query processing step, the relevant concepts and modifiers are tagged. For this purpose, we have developed an XML-based ontology covering the semantics of domain specific concepts and modifiers and describing linguistic concepts like synonymy. Additionally,
Using a Connectionist Approach for Enhancing Domain Ontologies
269
a lightweight grammar describes how particular concepts may be modified by prepositions and adverbial or adjectival structures that are also specified in the ontology. Finally, the query is transformed into an SQL statement to retrieve information from the database. The tagged concepts and modifiers together with the rule set and parameterized SQL fragments, also defined in the knowledge base, are used to create the complete SQL statement reflecting the natural language query. A generic XML description of the matching accommodations is transformed into a device-dependent output, customized to fit screen size and bandwidth. Our information retrieval system covers a part of the Tiscover database, that, as of October 2001, provides access to information about 13,117 Austrian accommodations. These are described by a large number of characteristics including the respective numbers of various room types, different facilities and services provided in the accommodation, or the type of food. The accommodations are located in 1,923 towns and cities that are again described by various features, mainly information about sports activities offered, e.g. mountain biking or skiing, but also the number of inhabitants or the sea level. The federal states of Austria are the higher-level geographical units. For a more detailed report on the system we refer to [2]. 2.2
A Field Trial and Its Implications
The field trial was carried out during ten days in March 2002. During this time our natural language interface was promoted on and linked from the main Tiscover page. We obtained 1,425 unique queries through our interface, i.e. equal queries from the same client host have been reduced to one entry in the query log to eliminate a possible bias for our evaluation of the query complexity. In more than a half of the queries, users formulated complete, grammatically correct sentences, about one fifth were partial sentences and the remaining set were keyword-type queries. Several of the queries consisted of more than one sentence. This approves our assumption that users accept the natural language interface and are willing to type more than just a few keywords to search for information. More than this, a substantial portion of the users is typing complete sentences to express their information needs. To inspect the complexity of the queries, we considered the number of concepts and the usage of modifiers like “and”, “or”, “not”, “near” and some combinations of those as quantitative measures. We found out that the level of sentence complexity is not very high. This confirms our assumption that shallow text parsing is sufficient to analyze the queries emerging in a limited domain like tourism. Even more important for the research described in this paper, we found out that regions or local attractions are inevitable informations that have to be integrated in such systems. We also noticed that users’ queries contained vague or highly subjective criteria like “romantic”, “cheap” or “within walking distance to”. Even “wellness”, a term broadly used in tourism nowadays, is far from being exactly defined. A more detailed evaluation of the results of the field
270
Michael Dittenbach et al.
trial can be found in [1]. It furthermore turned out that a deficiency of our ontology was the lack of diversity of the terminology. To provide better quality search results, it is necessary to enrich the ontology with additional synonyms. Besides the structured information about the accommodations, the web pages describing the accommodations offer a lot more information in form of natural language descriptions. Hence, the words occurring in these texts constitute a very specialized vocabulary for this domain. The next obvious step is to exploit this information to enhance the domain ontology for the information retrieval system. Due to the size of this vocabulary, some intelligent form of representation is necessary to express semantic relations between the words.
3 3.1
Word Categorization Encoding the Semantic Contexts
Ritter and Kohonen [13] have shown that it is possible to cluster terms according to their syntactic category by encoding word contexts of terms in an artificial data set of three-word sentences that consist of nouns, verbs and adverbs, such as, e.g. “Jim speaks often” and “Mary buys meat”. The resulting maps clearly showed three main clusters corresponding to the three word classes. It should furthermore be noted that within each cluster, the words of a class were arranged according to their semantic relation. For example, the adverbs poorly and well were located closer together on the map than poorly and much, the latter was located spatially close to little. An example from a different cluster would be the verbs likes and hates. Other experiments using a collection of fairy tales by the Grimm Brothers have shown that this method works well with real-world text documents [5]. The terms on the SOM were divided into three clusters, namely nouns, verbs and all other word classes. Again, inside these clusters, semantic similarities between words were mirrored. The results of these experiments have been elaborated later to reduce the vector dimensionality for document clustering in the WEBSOM project [6]. Here, a word category map has been trained with the terms occurring in the document collection to subsume words with similar context to one semantic category. These categories, obviously fewer than the number of all words of the document collection, have then been used to create document vectors for clustering. Since new methods of dimensionality reduction have been developed, the word category map has been dropped for this particular purpose [9]. Nevertheless, since our objective is to disclose semantic relations between words, we decided to use word category maps. For training a self-organizing map in order to organize terms according to their semantic similarity, these terms have to be encoded as n-dimensional numerical vectors. As shown in [4], the random vectors are quasi-orthogonal in case of n being large enough. Thus, unwanted geometrical dependence of the word representation can be avoided. This is a necessary condition, because otherwise the clustering result could be dominated by random effects overriding the semantic similarity of words.
Using a Connectionist Approach for Enhancing Domain Ontologies
271
We assume that, in textual descriptions dominated by enumerations, semantic similarity is captured by contextual closeness within the description. For example, when arguing about the attractions offered for children, things like a playground, a sandbox or the availability of a baby-sitter will be enumerated together. Analogously, the same is true for recreation equipment like a sauna, a steam bath or an infrared cabin. To capture this contextual closeness, we use word windows where a particular word i is described by the set of words that appear a fixed number of words before and after word i in the textual description. Given that every word is represented by a unique n-dimensional random vector, the context vector of a word i is built as the concatenation of the average of all words preceding as well as succeeding word i. Technically speaking, an n × N -dimensional vector xi representing word i is a concatenation of vectors xi (dj ) denoting the mean vectors of terms occurring at the set of displacements {d1 , . . . , dN } of the term as given in Equation 1. Consequently, the dimensionality of xi is n×N . This kind of representation has the effect that words appearing in similar contexts are represented by similar vectors in a high-dimensional space. (d1 ) xi .. xi = . (1) xi (dN ) With this method, a statistical model of word contexts is created. Consider, for example, the term Skifahren (skiing). The set of words occurring directly before the term at displacement −1 consists of words like Langlaufen (cross country skiing), Rodeln (toboggan), Pulverschnee (powder snow) or Winter to name but a few. By averaging the respective vectors representing these terms, a statistical model of word contexts is created. 3.2
Self-Organizing Map Algorithm
The self-organizing map (SOM) [7, 8] is an unsupervised neural network providing a mapping from a high-dimensional input space to a usually two-dimensional output space while preserving topological relations as faithfully as possible. The SOM consists of a set of units arranged in a two-dimensional grid, with a weight vector mi ∈ n attached to each unit i. Data from the high-dimensional input space, referred to as input vectors x ∈ n , are presented to the SOM and the activation of each unit for the presented input vector is calculated using an activation function. Commonly, the Euclidean distance between the weight vector of the unit and the input vector serves as the activation function, i.e. the smaller the Euclidean distance, the higher the activation. In the next step the weight vector of the unit showing the highest activation is selected as the winner and is modified as to more closely resemble the presented input vector. Pragmatically speaking, the weight vector of the winner is moved towards the presented input by a certain fraction of the Euclidean distance as indicated by a time-decreasing learning rate α(t) as shown in Equation 2.
272
Michael Dittenbach et al.
mi (t + 1) = mi (t) + α(t) · hci (t) · [x(t) − mi (t)]
(2)
Thus, this unit’s activation will be even higher the next time the same input signal is presented. Furthermore, the weight vectors of units in the neighborhood of the winner are modified accordingly as described by a neighborhood function hci (t) (cf. Equation 3), yet to a less strong amount as compared to the winner. The strenght of adaptation depends on the Euclidean distance ||rc − ri || between the winner c and a unit i regarding their respective locations rc , ri ∈ 2 on the 2-dimensional map and a time-decreasing parameter σ. 2 ||rc − ri || hci (t) = exp − (3) 2 · σ 2 (t) Starting with a rather large neighborhood for a general organization of the weight vectors, this learning procedure finally leads to a fine-grained topologically ordered mapping of the presented input signals. Similar input data are mapped onto neighboring regions on the map.
4 4.1
Experiments Data
The data provided by Tiscover consist, on the one hand, of structured information as described in Section 2, and, on the other hand, of free-form texts describing the accommodations. Because accommodation providers themselves enter the data into the system, the descriptions vary in length and style and are are not uniform or even quality controlled regarding spelling. HTML tags, which are allowed to format the descriptions, had to be removed to have plain-text files for further processing. For the experiments presented hereafter, we used the German descriptions of the accommodations since they are more comprehensive than the English ones. Especially small and medium-sized accommodations provide only a very rudimentary English description, many being far from correctly spelled. It has been shown with a text collection consisting of fairy tales that, with free-form text documents, the word categories dominate the cluster structure of such a map [5]. To create semantic maps primarily reflecting the semantic similarity of words rather than categorizing word classes, we removed words other than nouns and proper names. Therefore, we used the characteristic, unique to the German language, of nouns starting with a capital letter to filter the nouns and proper names occurring in the texts. Obviously, using this method, some other words like adjectives, verbs or adverbs at the beginning of sentences or in improperly written documents are also filtered. Contrarily, some nouns can be missed, too. A different method of determining nouns or other relevant word classes, especially for languages other than German, would be part-of-speech (POS) taggers. But even
Using a Connectionist Approach for Enhancing Domain Ontologies
273
Die Ferienwohnung Lage Stadtrand Wien Bezirk Mauer In Gehminuten Schnellbahn Fahrminuten Wien Mitte Stadt Die Wohnung Wohn Eßraum Kamin SAT TV K¨ uche Geschirrsp¨ uler Schlafzimmer Einzelbetten Einbettzimmer Badezimmer Wanne Doppelwaschbecken Dusche Extra WC Terrasse Sitzgarnitur Ruhebetten Die Ferienwohnung Aufenthalt W¨ unsche
Fig. 1. A sample description of a holiday flat in a suburb of Vienna after removing almost all words not being nouns or proper names the(fem.) , holiday flat, location, outskirts, Vienna, district, Mauer, in, minutes to walk, urban railway, minutes to drive, Wien Mitte (station name), city, the(fem.) , flat, living, dining room, fireplace, satellite tv, kitchen, dishwasher, sleeping room, single beds, single-bed room, bathroom, bathtub, double washbasin, shower, separate toilet, terrace, chairs and table, couches, the(fem.) , holiday flat, stay, wishes
Fig. 2. English translation of the description shown in Figure 1
state-of-the-art POS taggers do not reach an accuracy of 100% [10]. For the rest of this section, the numbers and figures presented, refer to the already preprocessed documents, if not stated otherwise. The collection consists of 12,471 documents with a total number of 481,580 words, i.e. on average, a description contains about 39 words. For the curious reader we shall note that not all of the 13,117 accommodations in the database provide a textual description. The vocabulary of the document collection comprises 35,873 unique terms, but for the sake of readability of the maps we reduced the number of terms by excluding those occurring less than ten times in the whole collection. Consequently, we used 3,662 terms for creating the semantic maps. In Figure 1, a natural language description of a holiday flat in Vienna is shown. Beginning with the location of the flat, the accessibility by public transport is mentioned, followed by some terms describing the dining and living room together with enumerations of the respective furniture and fixtures. Other parts of the flat are the sleeping room, a single bedroom and the bathroom. In this particular example, the only words not being nouns or proper names are the determiner Die and the preposition In at the beginning of sentences. For the sake of convenience, we have provided an English translation in Figure 2. 4.2
Semantic Map
For encoding the terms we have chosen 90-dimensional random vectors. The vectors used for training the semantic map depicted in Figure 3 were created by using a context window of length four, i.e. two words before and two words after a term. But instead of treating all four sets of context terms separately, we have put terms at displacements −2 and −1 as well as those at displacements +1 and +2 together. Then the average vectors of both sets were calculated and
274
Michael Dittenbach et al.
finally concatenated to create the 180-dimensional context vectors. Further experiments have shown that this setting yielded the best result. For example, using a context window of length four but considering all displacements separately, i.e. the final context vector has length 360, has led to a map where the clusters were not as coherent as on the map shown below. A smaller context window of length two, taking only the surrounding words at displacements −1 and +1 into account, had a similar effect. This indicates that the amount of text available for creating such a statistical model is crucial for the quality of the resulting map. By subsuming the context words at displacements before as well as after the word, the disadvantage of having an insufficient amount of text can be alleviated, because having twice the number of contexts with displacements −1 and +1 is simulated. Due to the enumerative nature of the accommodation descriptions, the exact position of the context terms can be disregarded. The self-organizing map depicted in Figure 3 consists of 20 × 20 units. Due to space considerations, only a few clusters can be detailed in this description and enumerations of terms in a cluster will only be exemplary. The semantic clusters shaded gray have been determined by manual inspection. They consist of very homogeneous sets of terms related to distinct aspects of the domain. The parts of the right half of the map that have not been shaded, mainly contain proper names of places, lakes, mountains, cities or accommodations. However, it shall be noted, that e.g. names of lakes or mountains are homogeneously grouped in separate clusters. In the upper left corner, mostly verbs, adverbs, adjectives or conjunctions are located. These are terms that have been inadvertently included in the set of relevant terms as described in the previous subsection. In the upper part of the map, a cluster containing terms related to pricing, fees and reductions can be found. Other clusters in this area predominantly deal with words describing types of accommodation and, in the top-right corner a strong cluster of accommodation names can be found. On the right-hand border of the map, geographical locations, such as central, outskirts, or close to a forest have been mapped, and a cluster containing skiing- and mountaineering-related terms is also located there. A dominant cluster containing words that describe room types, furnishing and fixtures can be found in the lower left corner of the map. The cluster labeled advertising terms in the bottom-right corner of the map, predominately contains words that are found at the beginning of the documents where the pleasures awaiting the potential customer are described. Interesting inter-cluster relations showing the semantic ordering of the terms can be found in the bottom part of the map. The cluster labeled farm contains terms describing, amongst other things, typical goods produced on farms like, organic products, jam, grape juice or schnaps. In the upper left corner of the cluster, names of farm animals (e.g. pig, cow, chicken) as well as animals usually found in a petting zoo (e.g. donkey, dwarf goats, cats, calves) are located. This cluster describing animals adjoins a cluster primarily containing terms related
Using a Connectionist Approach for Enhancing Domain Ontologies
verbs
types of prices, rates,fees reductions adjectives adverbs conjunctions determiner
types of private accomm.
275
proper names (accomm.)
proper names (farms)
group travel
swimming location
proper names wellness
view types of travelers
sports outdoor sports games children
skiing animals
mountaineering
kitchen advertising terms
farm room types, furnishing and fixtures food proper names (cities)
Fig. 3. A self-organizing semantic map of terms in the tourism domain with labels denoting general semantic clusters. The cluster boundaries have been drawn manually
to children, toys and games. Some terms are playroom, tabletop soccer, sandbox and volleyball, to name but a few. This representation of a domain vocabulary supports the construction and enrichment of domain ontologies by making relevant concepts and their relations evident. To provide an example, we found a wealth of terms describing sauna-like recreational facilities having in common that the vacationer sojourns in a closed room with well-tempered atmosphere, e.g. sauna, tepidarium, bio sauna, herbal sauna, Finnish sauna, steam sauna, thermarium or infrared cabin. On the one hand, major semantic categories identified by inspecting and evaluating the semantic map can be used as a basis for a top-down ontology engineering approach. On the other hand, the clustered terms, extracted from domain-relevant documents, can be used for bottom-up engineering an existing ontology.
276
5
Michael Dittenbach et al.
Conclusions
In this paper, we have presented a method, based on the self-organizing map, to support the construction and enrichment of domain ontologies. The words occurring in free-form text documents from the application domain are clustered according to their semantic similarity based on statistical context analysis. More precisely, we have shown that when a word is described by words that appear within a fix-sized context window, semantic relations of words unfold in the self-organizing map. Thus, words that refer to similar objects can be found in neighboring parts of the map. The two-dimensional map representation provides an intuitive interface for browsing through the vocabulary to discover new concepts or relations between concepts that are still missing in the ontology. We illustrated this approach with an example from the tourism domain. The clustering results revealed a number of relevant tourism-related terms that can now be integrated into the ontology to provide better retrieval results when searching for accommodations. We achieved this by analysis of self-descriptions written by accommodation providers, thus assisting substantially the costly and time-consuming process of ontology engineering.
References [1] M. Dittenbach, D. Merkl, and H. Berger. What customers really want from tourism information systems but never dared to ask. In Proc. of the 5th Int’l Conference on Electronic Commerce Research (ICECR-5), Montreal, Canada, 2002. 270 [2] M. Dittenbach, D. Merkl, and H. Berger. A natural language query interface for tourism information. In A. J. Frew, M. Hitz, and P. O’Connor, editors, Proceedings of the 10th International Conference on Information Technologies in Tourism (ENTER 2003), pages 152–162, Helsinki, Finland, 2003. Springer-Verlag. 269 [3] W. Frakes, R. Prieto-D´ıaz, and C. Fox. DARE: Domain analysis and reuse environment. Annals of Software Engineering, Kluwer, 5:125–141, 1998. 268 [4] T. Honkela. Self-Organizing Maps in Natural Language Processing. PhD thesis, Helsinki University of Technology, 1997. 270 [5] T. Honkela, V. Pulkki, and T. Kohonen. Contextual relations of words in grimm tales, analyzed by self-organizing map. In F. Fogelman-Soulie and P. Gallinari, editors, Proceedings of the International Conference on Artificial Neural Networks (ICANN 1995), pages 3–7, Paris, France, 1995. EC2 et Cie. 270, 272 [6] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM–self-organizing maps of document collections. Neurocomputing, Elsevier, 21:101–117, November 1998. 270 [7] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 1982. 271 [8] T. Kohonen. Self-organizing maps. Springer-Verlag, Berlin, 1995. 271 [9] T. Kohonen, S. Kaski, K. Lagus, J. Saloj¨ arvi, J. Honkela, V. Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3):574–585, May 2000. 270 [10] C. Manning and H. Sch¨ utze. Foundations of statistical natural language processing. MIT Press, 2000. 273
Using a Connectionist Approach for Enhancing Domain Ontologies
277
[11] R. Prieto-D´ıaz. A faceted approach to building ontologies. In S. Spaccapietra, S. T. March, and Y. Kambayashi, editors, Proc. of the 21st Int’l Conf. on Conceptual Modeling (ER 2002), LNCS, Tampere, Finland, 2002. Springer-Verlag. 268 [12] B. Pr¨ oll, W. Retschitzegger, R. Wagner, and A. Ebner. Beyond traditional tourism information systems – TIScover. Information Technology and Tourism, 1, 1998. 268 [13] H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cybernetics, 61(4):241–254, 1989. 270
Parameterless Data Compression and Noise Filtering Using Association Rule Mining Yew-Kwong Woon1 , Xiang Li2 , Wee-Keong Ng1 , and Wen-Feng Lu23 1
2
Nanyang Technological University, Nanyang Avenue, Singapore 639798, SINGAPORE Singapore Institute of Manufacturing Technology, 71 Nanyang Drive, Singapore 638075, SINGAPORE 3 Singapore-MIT Alliance
Abstract. The explosion of raw data in our information age necessitates the use of unsupervised knowledge discovery techniques to understand mountains of data. Cluster analysis is suitable for this task because of its ability to discover natural groupings of objects without human intervention. However, noise in the data greatly affects clustering results. Existing clustering techniques use density-based, grid-based or resolution-based methods to handle noise but they require the fine-tuning of complex parameters. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. There are several noise/outlier detection techniques but they too need suitable parameters. In this paper, we present a novel parameterless method of filtering noise using ideas borrowed from association rule mining. We term our technique, FLUID (Filtering Using Itemset Discovery). FLUID automatically discovers representative points in the dataset without any input parameter by mapping the dataset into a form suitable for frequent itemset discovery. After frequent itemsets are discovered, they are mapped back to their original form and become representative points of the original dataset. As such, FLUID accomplishes both data and noise reduction simultaneously, making it an ideal preprocessing step for cluster analysis. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID.
1
Introduction
The information age was hastily ushered in by the birth of the World Wide Web (Web) in 1990. All of sudden, an abundance of information, in the form of web pages and digital libraries, was available at the fingertips of anyone who was connected to the Web. Researchers from the Online Computer Library Center found that there were 7 million unique sites in the year 2000 and the Web was predicted to continue its fast expansion [1]. Data mining becomes important because traditional statistical techniques are no longer feasible to handle such immense data. Cluster analysis, or clustering, becomes the data mining technique of choice because of its ability to function with little human supervision. Clustering is the process of grouping a set of physical/abstract objects Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 278-287, 2003. c Springer-Verlag Berlin Heidelberg 2003
Parameterless Data Compression and Noise Filtering
279
into classes of similar objects. It has been found to be useful for a wide variety of applications such as web usage mining [2], manufacturing [3], personalization of web pages [4] and digital libraries [5]. Researchers begin to analyze traditional clustering techniques in an attempt to adapt them to current needs. One such technique is the classic k-means algorithm [6]. It is fast but is very sensitive to the parameter k and noise. Recent clustering techniques that attempt to handle noise more effectively include density-based techniques [7], grid-based techniques [8] and resolution-based techniques [9, 10]. However, all of them require the fine-tuning of complex parameters to remove the adverse effects of noise. Empirical studies show that many adjustments need to be made and an optimal solution is not always guaranteed [10]. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. Since most data, such as digital library documents, web logs and manufacturing specifications, have many features or dimensions, this shortcoming is unacceptable. There are also several work on outlier/noise detection but they too require the setting of non-intuitive parameters [11, 12]. In this paper, we present a novel unsupervised method of filtering noise using ideas borrowed from association rule mining (ARM) [13]. We term our technique, FLUID (FiLtering Using Itemset Discovery). FLUID first maps the dataset into a set of items using binning. Next, ARM is applied to it to discover frequent itemsets. As there has been sustained intense interest in ARM since its conception in 1993, ARM algorithms have improved by leaps and bounds. Any ARM algorithm can be used by FLUID and this allows the leveraging of the efficiency of latest ARM methods. After frequent itemsets are found, they are mapped back to become representative points of the original dataset. This capability of FLUID not only eliminates the problematic need for noise removal in existing clustering algorithms but also improves their efficiency and scalability because the size of the dataset is significantly reduced. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID. The rest of the paper is organized as follows. The next section reviews related work in the areas of clustering, outlier detection, ARM while Section 3 presents the FLUID algorithm. Experiments are conducted on both real and synthetic datasets to assess the feasibility of FLUID in Section 4. Finally, the paper is concluded in Section 5.
2
Related Work
In this section, we review prominent works in the areas of clustering and outlier detection. The problem of ARM and its representative algorithms are discussed as well. 2.1
Clustering and Outlier Detection
The k-means algorithm is the pioneering algorithm in clustering [6]. It begins by randomly generating k cluster centers known as centroids. Objects are iteratively
280
Yew-Kwong Woon et al.
assigned to the cluster where the distance between itself and the cluster’s centroid is the shortest. It is fast but sensitive to the parameter k and noise. Densitybased methods are more noise-resistant and are based on the notion that dense regions are interesting regions. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is the pioneering density-based technique [7]. It uses two input parameters to define what constitutes the neighborhood of an object and whether its neighborhood is dense enough to be considered. Grid-based techniques can also handle noise. They partition the search space into a number of cells/units and perform clustering on such units. CLIQUE (CLustering In QUEst) considers a unit to be dense if the number of objects in it exceeds a density threshold and uses an apriori-like technique to iteratively derive higherdimensional dense units. CLIQUE requires the user to specify a density threshold and the size of grids. Recently, resolution-based techniques are proposed and applied successfully on noisy datasets. The basic idea is that when viewed at different resolutions, the dataset reveals different clusters and by visualization or change detection of certain statistics, the correct resolution at which noise is minimum can be chosen. WaveCluster is a resolution-based algorithm that uses wavelet transformation to distinguish clusters from noise [9]. Users must first determine the best quantization scheme for the dataset and then decide on the number of times to apply wavelet transform. The TURN* algorithm is another recent resolution-based algorithm [10]. It iteratively scales the data to various resolutions. To determine the ideal resolution, it uses the third differential of the series of cluster feature statistics to detect an abrupt change in the trend. However, it is unclear how certain parameters such as the closeness threshold and the step size of resolution scaling are chosen. Outlier detection is another means of tackling noise. One classic notion is that of DB(Distance-Based)-outliers [11]. An object is considered to be a DB-outlier if a certain fraction f of the dataset lies greater than a distance D from it. A recent enhancement of it involves the use of the concept of k-nearest neighbors [12]; the top n points with the largest Dk (distance of the k th nearest neighbor of a point) are treated as outliers. The parameters f, D, k, n must be supplied by the user. In summary, currently, there is no ideal solution to the problem of noise and existing clustering algorithms require much parameter tweaking which becomes difficult for high-dimensional datasets. Even if somehow their parameters can be optimally set for a particular dataset, there is no guarantee that the same settings will work for other datasets. The problem is similar in the area of outlier detection. 2.2
Association Rule Mining
Since the concept of ARM is central to FLUID, we formally define ARM and then survey existing ARM algorithms in this section. A formal description of ARM is as follows. Let the universal itemset, I = {a1 , a2 , ..., aU } be a set of literals called items. Let Dt be a database of transactions, where each transaction T contains a set of items such that T ⊆ I. A j-itemset is a set of j unique items.
Parameterless Data Compression and Noise Filtering
281
For a given itemset X ⊆ I and a given transaction T, T contains X if and only if X ⊆ T . Let ψX be the support count of an itemset X, which is the number of transactions in Dt that contain X. Let s be the support threshold and |Dt | be the number of transactions in Dt . An itemset X is frequent if ψX |Dt | × s%. An association rule is an implication of the form X =⇒ Y , where X ⊆ I, Y ⊆ I and X ∩ Y = ∅. The association rule X =⇒ Y holds in Dt with confidence c% if no less than c% of the transactions in Dt that contain X also contain Y . The association rule X =⇒ Y has support s% in Dt if ψX∪Y = |Dt | × s%. The problem of mining association rules is to discover rules that have confidence and support greater than the thresholds. It consists of two main tasks: the discovery of frequent itemsets and the generation of association rules from frequent itemsets. Researchers usually tackle the first task only because it is more computationally expensive. Hence, current algorithms are designed to efficiently discover frequent itemsets. We will leverage the ability of ARM algorithms to rapidly discover frequent itemsets in FLUID. Introduced in 1994, the Apriori algorithm is the first successful algorithm for mining association rules [13]. Since its introduction, it has popularized ARM. It introduces a method to generate candidate itemsets in a pass using only frequent itemsets from the previous pass. The idea, known as the apriori property, rests on the fact that any subset of a frequent itemset must be frequent as well. The FP-growth (Frequent Pattern-growth) algorithm is a recent ARM algorithm that achieves impressive results by removing the need to generate candidate itemsets which is the main bottleneck in Apriori [14]. It uses a compact tree structure called a Frequent Pattern-tree (FP-tree) to store information about frequent itemsets. This compact structure also removes the need for multiple database scans and it is constructed using only two scans. The items in the transactions are first sorted and then used to construct the FP-tree. Next, FP-growth proceeds to recursively mine FP-trees of decreasing size to generate frequent itemsets. Recently, we presented a novel trie-based data structure known as the Support-Ordered Trie ITemset (SOTrieIT) to store support counts of 1-itemsets and 2-itemsets [15, 16]. The SOTrieIT is designed to be used efficiently by our algorithm, FOLDARM (Fast OnLine Dynamic Association Rule Mining) [16]. In our recent work on ARM, we propose a new algorithm, FOLD-growth (Fast OnLine Dynamic-growth) which is an optimized hybrid version of FOLDARM and FP-growth [17]. FOLD-growth first builds a set of SOTrieITs from the database and use them to prune the database before building FP-trees. FOLD-growth is shown to outperform FP-growth by up to two orders of magnitude.
3 3.1
Filtering Using Itemset Discovery (FLUID) Algorithm
Given a d-dimensional dataset Do consisting of n objects o1 , o2 , . . . , on , FLUID discovers a set of representative objects O1 , O2 , . . . , Om where m n in three main steps:
282
Yew-Kwong Woon et al.
1. Convert dataset Do into a transactional database Dt using procedure MapDB 2. Mine Dt for frequent itemsets using procedure MineDB 3. Convert the discovered frequent itemsets back to their original object form using procedure MapItemset Procedure MapDB 1 2
3 4 5 6
Sort each dimension of Do in ascending order Compute mean µx and standard deviation σx of the nearest object distance in each dimension x by checking the left and right neighbors of each object Find range of values rx for each dimension x Compute number of bins βx for each dimension x: βx = rx /((µx + 3 × σx ) × 0.005 × n Map each bin to a unique item a ∈ I Convert each object oi in Do into a transaction Ti with exactly d items items by binning its feature values, yielding a transactional database Dt
Procedure MapDB tries to discretize the features of dataset Do in a way that minimizes the number of required bins without losing the pertinent structural information of Do . Every dimension has its own distribution of values and thus, it is necessary to compute the bin sizes of each dimension/feature separately. Discretization is itself a massive area but experiments reveal that MapDB is good enough to remove noise efficiently and effectively. To understand the data distribution in each dimension, the mean and standard deviation of the closest neighbor distance of every object in every dimension are computed. Assuming that all dimensions follow a Normal distribution, an object should have one neighboring object within three standard deviations of the mean nearest neighbor distance. To avoid having too many bins, there is a need to ensure that each bin would contain a certain number of objects (0.5% of dataset size) and this is accomplished in step 4. In the event that the values are spread out too widely, i.e. the standard deviation is much larger than the mean, the number of standard deviations used in step 4 is reduced to 1 instead of 3. Note that if a particular dimension has less than 100 unique values, steps 2-4 would be unnecessary and the number of bins would be the number of unique values. As mentioned in step 6, each object becomes a transaction with exactly d items because each item represents one feature of the object. The transactions do not have duplicated items because every feature has its own unique set of bins. Once Do is mapped into transactions with unique items, it is now in a form that can be mined by any association rule mining algorithm. Procedure MineDB 1
Set support threshold s = 0.1 (10%)
Parameterless Data Compression and Noise Filtering
2 3 4 5 6 7 8 9 10 11 12 13 14
283
Set number of required frequent d-itemsets k = n Let δ(A, B) be the distance between 2 j-itemsets A(a1 , . . . , aj ) and j B(b1 , . . . , bj ): δ(A, B) = i=1 (ai − bi ) A itemset A is a loner itemset if δ(A, Z) > 1, ∀Z ∈ L ∧ Z = A Repeat Repeat Use an association rule mining algorithm to discover a set of frequent itemsets L from Dt Remove itemsets with less than d items from L Adjust s using a variable step size to bring |L| closer to k Until |L| = k or |L| stabilizes Set k = 12 |L| Set s = 0.1 Remove loner itemsets from |L| Until abrupt change in number of loner itemsets
MineDB is the most time-consuming and complex step of FLUID. The key idea here is to discover the optimal set of frequent itemsets that represents the important characteristics of the original dataset; we consider important characteristics as dense regions in the original dataset. In this case, the support threshold s is akin to the density threshold used by density-based clustering algorithms and thus, it can be used to remove regions with low density (itemsets with low support counts). The crucial point here is how to automate the finetuning of s. This is done by checking the number of loner itemsets after each iteration (steps 6-14). Loner itemsets represent points with no neighboring points in the discretized d-dimensional feature space. Therefore, an abrupt change in the number of loner itemsets indicates that the current support threshold value has been reduced to a point where dense regions in the original datasets are being divided into too many sparse regions. This point is made more evident in Section 5 where its effect can be visually observed. The number of desired frequent d-itemsets (frequent itemsets with exactly d items), k, is initially set to the size of the original dataset as seen in step 2. The goal is to obtain the finest resolution of the dataset that is attainable after its transformation. The algorithm then proceeds to derive coarser resolutions in an exponential fashion in order to quickly discover a good representation of the original dataset. This is done at step 11 where k is being reduced to half of |L|. The amount of reduction can certainly be lowered to get more resolutions but this will incur longer processing time and may not be necessary. Experiments have revealed that our choice suffices for a good approximation of the representative points of a dataset. In step 8, notice that itemsets with less than d items are removed. This is because association rule mining discovers frequent itemsets with various sizes but we are only interested in frequent itemsets containing items that represent all the features of the dataset. In step 9, the support threshold s is incremented/decremented by a variable step size. The step size is variable as it must
284
Yew-Kwong Woon et al.
be made smaller in order to zoom in on the best possible s to obtain the required number of frequent d-itemsets, k. In most situations, it is quite unlikely that |L| can be adjusted to equal k exactly and thus, if |L| stabilizes or fluctuates between similar values, its closest approximation to k is considered as the best solution as seen in step 10. Procedure MapItemset 1 2 3 4 5
for each frequent itemset A ∈ L do for each item i ∈ A do Assign the center of the bin represented by i as its new value end for end for
The final step of FLUID is the simplest: it involves mapping the frequent itemsets back to their original object form. The filtered dataset would now contain representative points of the original dataset excluding most of the noise. Note that the filtering is only an approximation but it is sufficient to remove most of the noise in the data and retain pertinent structural characteristics of the data. Subsequent data mining tasks such as clustering can then be used to extract knowledge from the filtered and compressed dataset efficiently with little complications from noise. Note also that the types of clusters discovered depend mainly on the clustering algorithm used and not on FLUID. 3.2
Complexity Analysis
The following are the time complexities of the three main steps of FLUID: 1. MapDB: The main processing time is taken by step 1 and hence, its time complexity is O(n log n). 2. MineDB: As the total number of iterations used by the loops in the procedure is very small, the bulk of the processing time is attributed to the time to perform association rule mining given by TA . 3. MapItemset: The processing time is dependent on the number of resultant representative points |L| and thus, it has a time complexity of O(n). Hence, the overall time complexity of FLUID is O(n log n + TA + n). 3.3
Strengths and Weaknesses
The main strength of FLUID is its independence on user-supplied parameters. Unlike its predecessors, FLUID does not require any human supervision. Not only it removes noise/outliers, it compresses the dataset into a set of representative points without any loss of pertinent structural information of the original dataset. In addition, it is reasonably scalable with respect to both the size and
Parameterless Data Compression and Noise Filtering
285
500 500 450 450 400 400 350 350 300 300 250
250
200
200
150
150
100
100
50
50
0
0 0
100
200
300
400
500
600
700
0
100
200
(a)
300
400
500
600
700
400
500
600
700
(b)
500
500
450
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50
50
0
0 0
100
200
300
(c)
400
500
600
700
0
100
200
300
(d)
Fig. 1. Results of executing FLUID on a synthetic dataset.
dimensionality of the dataset as it inherits the efficient characteristics of existing association rule mining algorithms. Hence, it is an attractive preprocessing tool for clustering or other data mining tasks. Ironically, its weakness also stems from its use of association rule mining techniques. This is because association rule mining algorithms do not scale as well as resolution-based algorithms in terms of dataset dimensionality. Fortunately, since ARM is still receiving much attention from the research community, it is possible that more efficient ARM algorithms will be available to FLUID. Another weakness is that FLUID spends much redundant processing time in finding and storing frequent itemsets that have less than d items. This problem is inherent in association rule mining because larger frequent itemsets are usually formed from smaller frequent itemsets. Efficiency and scalability can certainly be improved greatly if there is a way to directly discover frequent d-itemsets.
4
Experiments
This section evaluates the viability of FLUID by conducting experiments on a Pentium-4 machine with a CPU clock rate of 2 GHz and 1 GB of main memory. We shall use FOLD-growth as our ARM algorithm in our experiments as it is fast, incremental and scalable [17]. All algorithms are implemented in Java.
286
Yew-Kwong Woon et al.
The synthetic dataset (named t7.10k.dat) used here tests the ability of FLUID to discover clusters of various sizes and shapes amidst much noise; it has been used as a benchmarking test for several clustering algorithms [10]. It has been shown that prominent algorithms like k-means [6], DBSCAN [7], CHAMELEON [18] and WaveCluster [9] are unable to properly find the nine visually-obvious clusters and remove noise even with exhaustive parameter adjustments [10]. Only TURN* [10] manages to find the correct clusters but it requires user-supplied parameters as mentioned in Section 2.1. Figure 1(a) shows the dataset with 10,000 points in nine arbitrary-shaped clusters interspersed with random noise. Figure 1 shows the results of running FLUID on the dataset. FLUID stops at the iteration when Figure 1(c) is obtained but we show the rest of the results to illustrate the effect of loner itemsets. It is clear that Figure 1(c) is the optimal result as most of the noise is removed while the nine clusters remain intact. Figure 1(d) loses much of the pertinent information of the dataset. The number of loner itemsets for Figures 1(b), (c) and (d) is 155, 55 and 136 respectively. Figure 1(b) has the most loner itemsets because of the presence of noise in the original dataset. It is the finest representation of the dataset in terms of resolution. There is a sharp drop in the number of loner itemsets in Figure 1(c) followed by a sharp increase in the number of loner itemsets in Figure 1(d). The sharp drop can be explained by the fact that most noise is removed leaving behind objects that are closely grouped together. In contrast, the sharp increase in loner itemsets is caused by too low a support threshold. This means that only very dense regions are captured and this causes the disintegration of the nine clusters as seen in Figure 1(d). Hence, a change in the trend of the number of loner itemsets is indicative that the structural characteristics of the dataset has changed. FLUID took a mere 6 s to compress the dataset into 1,650 representatives points with much of the noise removed. The dataset is reduced by more than 80% without affecting its inherent structure, that is, the shapes of its nine clusters are retained. Therefore, it is proven in this experiment that FLUID can filter away noise even in a noisy dataset with sophisticated clusters without any user parameters and with impressive efficiency.
5
Conclusions
Clustering is an important data mining task especially in our information age where raw data is abundant. Several existing clustering methods cannot handle noise effectively because they require the user to set complex parameters properly. We propose FLUID, a noise-filtering and parameterless algorithm based on association rule mining, to overcome the problem of noise as well as to compress the dataset. Experiments on a benchmarking synthetic dataset show the effectiveness of our approach. In our future work, we will improve and provide vigorous proofs of our approach and design a clustering algorithm that can integrate efficiently with FLUID. In addition, the problem of handling high dimensional datasets will be addressed. Finally, more experiments involving larger datasets with more dimensions will be conducted to affirm the practicality of FLUID.
Parameterless Data Compression and Noise Filtering
287
References 1. Dean, N., ed.: OCLC Researchers Measure the World Wide Web. Number 248. Online Computer Library Center (OCLC) Newsletter (2000) 2. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1 (2000) 12–23 3. Gardner, M., Bieker, J.: Data mining solves tough semiconductor manufacturing problems. In: Proc. 6th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, Boston, Massachusetts, United States (2000) 376–383 4. Mobasher, B., Dai, H., Luo, T., Nakagawa, M., Sun, Y., Wiltshire, J.: Discovery of aggregate usage profiles for web personalization. In: Proc. Workshop on Web Mining for E-Commerce - Challenges and Opportunities, Boston, MA, USA (2000) 5. Sun, A., Lim, E.P., Ng, W.K.: Personalized classification for keyword-based category profiles. In: Proc. 6th European Conf. on Research and Advanced Technology for Digital Libraries, Rome, Italy (2002) 61–74 6. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability. (1967) 281–297 7. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon (1996) 226–231 8. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD Conf., Seattle, WA (1998) 94–105 9. Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: A wavelet based clustering approach for spatial data in very large databases. VLDB Journal 8 (2000) 289–304 10. Foss, A., Zaiane, O.R.: A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In: Proc. Int. Conf. on Data Mining, Maebashi City, Japan (2002) 179–186 11. Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proc. 24th Int. Conf. on Very Large Data Bases. (1998) 392–403 12. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) 427– 438 13. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. on Very Large Databases, Santiago, Chile (1994) 487–499 14. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) 1–12 15. Das, A., Ng, W.K., Woon, Y.K.: Rapid association rule mining. In: Proc. 10th Int. Conf. on Information and Knowledge Management, Atlanta, Georgia (2001) 474–481 16. Woon, Y.K., Ng, W.K., Das, A.: Fast online dynamic association rule mining. In: Proc. 2nd Int. Conf. on Web Information Systems Engineering, Kyoto, Japan (2001) 278–287 17. Woon, Y.K., Ng, W.K., Lim, E.P.: Preprocessing optimization structures for association rule mining. In: Technical Report CAIS-TR-02-48, School of Computer Engineering, Nanyang Technological University, Singapore (2002) 18. Karypis, G., Han, E.H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32 (1999) 68–75
Performance Evaluation of SQL-OR Variants for Association Rule Mining* P. Mishra and S. Chakravarthy Information and Technology Laboratory and CSE Department The University of Texas at Arlington, Arlington, TX 76019 {pmishra,sharma}@cse.uta.edu
Abstract. In this paper, we focus on the SQL-OR approaches. We study several additional optimizations for the SQL-OR approaches (Vertical Tid, Gather-join, and Gather count) and evaluate them using DB2 and Oracle RDBMSs. We evaluate the approaches analytically and compare their performance on large data sets. Finally, we summarize the results and indicate the conditions for which the individual optimizations are useful.
1
Introduction
The work on association rule mining started with the development of the AIS algorithm [1] and then some of its modifications as discussed in [2]. Since then, there have been continuous attempts in improving the performance of these algorithms [3, 4, 5]. However, most of these algorithms are applicable to data present in flat files. SETM [6], showed how the data stored in RDBMS can be mined using SQL and the corresponding performance gain achieved by optimizing these queries. Recent research in the field of mining over databases has been in integrating the mining functions with the database. The Data Mining Query Language DMQL [7] proposed a collection of such operators for classification rules, association rules etc. [8] proposed the MineRule operator for generating general/clustered/ordered association rules. [9] presents a methodology for tightly-coupled integration of data mining applications with a relational database system. In [10] and [11] the authors have tried to highlight the implications of various architectural alternatives for coupling data mining with relational database systems. Some of the research has focused on the development of SQL-based formulations for association rule mining. Relative performances and all possible combinations for optimizations of k-way join is addressed in [13, 14]. In this paper, we will analyze the characteristics of these optimizations in detail both analytically and experimentally. We conclude why certain optimizations are always useful and why some perceived optimizations do not seem to work as intended. *
This work was supported, in part, by NSF grants IIS-0097517, IIS-0123730 and ITR 0121297.
Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 288-298, 2003. Springer-Verlag Berlin Heidelberg 2003
Performance Evaluation of SQL-OR Variants for Association Rule Mining
1.1
289
Focus of This Paper
With more and more use of RDBMS to store and manipulate data, mining directly on RDBMSs is critical. The goal of this paper is to study all aspects of the basic SQL-OR approaches for association rule mining and then explore additional performance optimizations to them. The other goal of our work is to use the results obtained from mining various relations to make the optimizer mining-aware. Also, the results collected from the performance evaluations of these algorithms are critical for developing a knowledge base that can be used for selecting appropriate approaches as well as optimizations with in a given approach. The rest of the paper is organized as follows. Section 3 covers in detail various SQL-OR approaches for support counting and their performance analysis. Section 4 considers the optimizations and reports the main results only due to space limitations. The details can be found in [13] available on the web. In section 5 we have compiled the summary of results obtained from mining various datasets. We conclude and present the future work in section 6.
2
Association Rules
The problem of association rule mining was formally defined in [2]. In short, it can be stated as: Let I be the collection of all the items and D be the set of transactions. Let T be a single transaction involving some of the items from the set I. The association rule is of the form A ⇒ B (where A and B are sets). If the support of itemset AB is 30%, it means 3“ 0% of all the transactions contain both the itemsets – itemset A and itemset B”. And if the confidence of the rule A ⇒ B is 70%, it means 7“ 0% of all the transactions that contain itemset A also contains itemset B”.
3
SQL-OR Based Approaches
The nomenclature of these datasets is of the form TxxIyyDzzzK. Where xx denotes the average number of items present per transaction. yy denotes the average support of each item in the dataset and zzzK denotes the total number of transactions in K (1000's). The experiments have been performed on Oracle 8i (installed on a Solaris machine with 384MB of RAM) and IBM DB2/UDB (over Windows NT with 256MB of RAM). Each experiment has been performed 4 times. The values from the first results are ignored so as to avoid the effect of the previous experiments and other database setups. The average of the next 3 results is taken and used for analysis. This is done so as to avoid any false reporting of time due to system overload or any other factors. For most of the experiments, we have found that the percentage difference of each run with respect to the average is less than one percent. Before feeding the input to the mining algorithm, if it is not in the (tid, item) format, it is converted to that format (by using the algorithm and the approach presented in [12]). On completion of the mining, the results are remapped to their original values. Since the time taken for
290
P. Mishra and S. Chakravarthy
mapping, rule generation and re-mapping the results to their original descriptions is not very significant, they are not reported. For the purpose of reporting the experimental results in this paper, for most of the optimizations we have shown the results only for three datasets – T5I2D500K, T5I2D1000K and T10I4D100K. Wherever there is a marked difference between the results for Oracle and IBM DB2/UDB they are also shown; otherwise the result from anyone of the RDBMSs have been included. 3.1
VerticalTid Approach (Vtid)
This approach makes uses of two procedures – SaveTid and CountAndK. The SaveTid procedure is called once to create CLOBs (character large objects) for representing a list of transactions. This procedure scans the input table once and for every unique item id, generates a CLOB containing the list of transactions in which that item occurs (TidList). These item ids, along with there corresponding TidList are then inserted in the TidListTable relation, which has the following schema (Item: number, TidList: CLOB). Once the TidListTable is generated, then this relation is used for support counting in all the passes. Figure 3.1 shows the time for mining the relation T5I2D100K with different support values on DB2. Figure 3.2 shows the same for Oracle. A pass-wise analysis of these figures shows that second pass is consuming most of the time. This is where the TidList of items constituting the 2-itemsets are compared for finding the common transactions in them. Though the counting process seems to be very straightforward but the process of reading and intersecting these CLOBs is time consuming. As number of 2-candidate itemsets is very large, the total time taken for support counting in pass 2 is very high. We also checked how this approach scales up as size of datasets increase for support values of 0.20%, 0.15% and 0.10% on DB2 and Oracle respectively. From these figures [13] it is clear that Vertical Tid does not do well as size of the datasets increases.
Fig. 1. VertTid on T5I2D100K (DB2)
Fig. 2. VertTid on T5I2D100K (Oracle)
Performance Evaluation of SQL-OR Variants for Association Rule Mining
3.2
291
Gather Join Approach (Gjn)
In this approach for candidate itemset generation, Thomas [11], Dudgikar [12], and our implementation for DB2 uses the SaveItem procedure. This procedure is similar to the SaveTid procedure. The only difference being that here a CLOB object represents a list of item ids. The SaveItem procedure scans the input dataset and for every unique transaction, generates a CLOB object to represent the list of items bought in that transaction (called ItemList). The transaction along with its corresponding ItemList is then inserted into the ItemListTable relation, which has the following schema: (Tid: number, ItemList: CLOB). The ItemList column is then read in every pass for generation of k-candidate itemset. In our implementation, for Oracle, we skip the generation of ItemListTable and the CombinationK stored procedure has been modified. The CombinationK udf for DB2 uses the ItemList column from the ItemListTable to generate k-candidate itemsets while in Oracle, in any pas k, this stored procedure reads the input dataset ordered by T “ id” column and inserts all item ids, corresponding to a particular transaction in to a vector. This vector is then used to generate all the possible k-candidate itemsets. This is done to avoid the usage of CLOBs as working on CLOBs in Oracle has been found to be very time consuming and also the implementation in Oracle had to be done as stored procedure, which does not necessarily needs the inputs as CLOBs. In pass 2 and pass 3, Combination2 and Combination3 stored procedures read the input dataset and generate candidate itemsets of length 2 and length 3 respectively. For DB2 the process of candidate itemset generation is as follows: In any pass k, for each tuple of ItemListTable, the CombinationK udf is invoked. This udf receives the ItemList as input and returns all k-item combinations. Figure 4.1 and Figure 4.2 show the time taken for mining the dataset T5I2D100K with different support values, using this approach on Oracle and DB2 respectively. The legend I“ temLT” corresponds to the time taken in building the ItemListTable. Since building of ItemListTable is skipped for our Oracle implementation, the time taken for building ItemListTable for Oracle is zero. 3.3
Gather Count Approach (Gcnt)
This approach has been implemented for Oracle only. This is a slight modification to the Gather Join approach. Here, the support of candidate itemsets are counted directly in memory, so as to save the time spent in materializing the candidate itemsets and then counting their support. In pass 2, Gcnt uses GatherCount2 procedure, which is a modification to the Combination2 procedure. In second pass, instead of simply generating all the candidate itemsets of length 2 (as is done in the Combination2 procedure in Gjn), the GatherCount2 procedure uses a 2 dimensional array to count the occurrence of each itemset and then only those itemsets that have support count > the user specified minimum support value are inserted in frequent itemsets table. This reduces the time taken for generating frequent itemsets of length 2 as it skips the materialization of C2 relation. The way it is done is that in pass 2 a 2-D array of dimensions [# of items] * [# of items] is built. All the cells of this array are initialized to zero. The Gathercount2 procedure generates all 2-item combinations (similar to the way it was done in Combination2 procedure of Gjn) and increments the count of the itemset in the array. Thus if an itemset {2,3} is generated, the value in the cell
292
P. Mishra and S. Chakravarthy
[Item2][Item3] is incremented by 1. As the itemsets are generated in such a way that item in position 1 < the item in position 2, hence half of the cells in the 2-D array will always be zero. However this method of support counting cannot be used for higher passes, because building a 3 or more dimensional array would cost a whole lot of memory.
Fig. 3. Gather Join on T5I2D100K (O)
Fig. 4. Gather Join on T5I2D100K (DB2)
Fig. 5. Naïve SQL-OR Based Approaches (O)
Fig. 6. Ck and Fk for Gjn (DB2)
4
Analysis and Optimizations to the SQL-OR Based Approaches
Figure 4.3 compares the time taken for mining by the naïve SQL-OR based approaches for support values of 0.10% on datasets T5I2D10K, T5I2D100K, T5I2D500K and T5I2D1000K on Oracle. From this figure it is very clear that of the 3 approaches, Vertical Tid has the worst performance. This is because Vtid blows up at the second pass, where the overall time taken in support counting of all the 2-itemsets by intersecting their TidLists is very large. So the optimization to Vtid would be to reduce the number of TidLists processed by the CountAndK procedure in each pass. This optimization is explained in more detail in section 4.1. For the other two approaches, though they complete for large datasets, they take a lot of time. The difference in the candidate itemset generation process, as is done in
Performance Evaluation of SQL-OR Variants for Association Rule Mining
293
these approaches and the way it is done for any SQL-92 based approach is that here in any pass k, all the items bought in a transaction (the complete ItemList) are used for generation of candidate itemsets. Whereas in the SQL-92 based approaches, in kth pass, only frequent itemsets of length k-1, were extended. The significance of this is in the number of candidate itemsets that are generated at each pass and the way support counting is done. In SQL-92 based approaches, frequent itemsets of length k1 are used to generate candidate itemsets of length k and then additional joins are done to consider only those candidate itemsets, whose subsets of length k-1 are also frequent (because of the subset property). This reduces the number of candidate itemsets that are generated at each pass significantly. But then for support counting input dataset had to be joined k-times with an additional join condition to identify that these items (constituting an itemset) where coming from same transaction. In Gjn and Gcnt, since the candidate itemsets are generated from the complete ItemList of a transaction, there is no need to join the input dataset. Just a single group by on the items constituting an itemset, with a having clause is sufficient to identify all those candidate itemsets that are frequent. However, in any pass k, there is no easy way to identify the frequent itemsets of length k-1 and use them selectively to generate candidate itemsets of length k; rather the entire ItemList is used for generation of kcandidate itemsets. This generates a huge number of unwanted candidate itemsets and hence an equivalent increase in the time for support counting. compares the time taken in generation of these candidate itemsets and their support counting for each pass for dataset T5I2D100K, for support value of 0.10% on DB2. These figures suggest that most of the time taken is in the generation of large number of candidate itemsets. So a way to optimize it would be to reduce the number of candidate itemsets. This optimization is explained in detail in the section 4.2 and 4.3. 4.1
Improved VerticalTid Approach (IM_Vtid)
In Vtid approach, for support counting, in any pass k, the TidList of each item constituting an itemset is passed to the CountAndK procedure. As the length of the itemsets increases, the number of TidLists passed as parameter to the CountAndK procedure also increases (in pass k, CountAndK procedure receives k TidLists).
Fig. 7. % Gain of Im_Vtid over Vtid
Fig. 8. IM_Vtid on T5I2D1000K
294
P. Mishra and S. Chakravarthy
So to enhance the process of support counting, this optimization does the following: In pass 2, frequent itemsets of length two are generated directly by performing a self-join of input dataset. The join condition being that the item from the first copy < the item from second copy and that both the items belong to the same Tid. For pass 3 onwards, for those itemsets, whose count > minimum support value, the CountAndK procedure builds again a list of transactions (as a CLOB) that have been found common in all the TidLists to represent that itemset as a whole. (We have implemented this for Oracle only and have modified the CountAndK stored procedure to reflect the above change, hence for this optimization CountAndK procedure is used only in the reference of implementation for Oracle.) In pass k, the itemset along with its TidList is materialized in an intermediate relation. In the next pass (pass k+1), during the support counting of the candidate itemsets (which are one extension to the frequent itemsets of length k, that have been materialized in pass k), there is no need to pass the TidLists of all the items constituting this itemset. Instead, just two TidLists – one representing the k-itemset and other representing the item, extending this itemset are passed. This saves a whole lot of time, in searching the list of common transactions in the TidLists received by the CountAndK procedure. Figure 4.5 shows the performance gained (in percentages) by using Im_Vtid over Vtid for datasets T5I2D10K and T5I2D100K for support values of 0.20%, 0.15% and 0.10% (for other datasets Vtid didn't complete). Figure 4.6 shows the overall time taken for mining the relation T5I2D1000K with IM_Vtid approach for different support values on Oracle. The legend TidLT represents the time taken in building the TidListTable from the input dataset (T5I2D1000K). This phase basically represents the time taken in building the TidList (a CLOB object) for each item id. From Figure 4.6 it is clear that time taken in building the TidListTable is a huge overhead. It accounts for nearly 60 to 80 percent of the total time spent for mining. Though this optimization is very effective but still the time taken for building the TidListTable shows that the efficiency of RDBMS in manipulating CLOBs is a bottleneck. 4.2
Improved Gather Join Approach (IM_Gjn)
In Gjn approach, in any pass k, all the items that occur in a transaction are used for the generation of candidate itemsets of length k. In subsequent passes, the items, which did not participate in the generation of frequent itemsets, are not eliminated from the list of items for that transaction. There is no easy way of scanning and eliminating all those items from the ItemList of a transaction that did not participate in the formation of frequent itemsets in any pass. As there is no pruning of the items, a huge number of unwanted candidate itemsets are generated in every pass. One possible way to optimize this would be that in any pass k, use tuples of only those transactions, (instead of the entire input table) which have contributed to the generation of frequent itemsets in pass k-1. For this we use an intermediate relation FComb. In any pass k, this relation contains the tuples of only those transactions whose items have contributed in the formation of frequent itemsets in pass k-1. This is done by joining the candidate itemsets table (Ck-1) with the frequent itemsets table (Fk-1). But for identifying the candidate itemsets that belong to same transaction, the CombinationK stored procedure has been modified to insert the transaction id along with the item combinations that were generated from the ItemList of that transaction, in the Ck
Performance Evaluation of SQL-OR Variants for Association Rule Mining
295
relation In any pass k, FComb table is thus generated which is then used by the CombinationK stored procedure (instead of the input dataset) to generate candidate itemsets of length k. Figure 4.7 compares the time required for mining relation T5I2D100K on Oracle, when the FComb table is materialized (IM_Gjn) and used for the generation of candidate itemsets and when input table is used as it is (Gjn) for generation of candidate itemset. We see that the total mining time by using FComb relation is quite less than the total mining time using the input dataset as it is. Also in Gjn, for different support values (0.20%, 0.15% and 0.10%) the time taken in each pass is nearly same. This is because in Gjn there is no pruning of candidate itemsets and then irrespective of the user specified support values the entire ItemList is used for generating all the candidate itemsets of length k. Figure 4.8 compares the number of candidate itemsets generated for relation T5I2D100K when input relation and when FComb relation are used by the CombinationK stored procedure for support value of 0.10%. From this figure, we see that in higher passes, when input relation is used then the number of candidate itemsets are significantly larger than when FComb relation is used which accounts for the difference in the total time taken for mining by these two methods. Figure 4.9 shows the performance gained (in percentages) by using IM_Gjn over Gjn on datasets T5I2D10K, T5I2D100K, T5I2D500K and T5I2D1000K for different support value. From this figure we see that on an average the gain for different support values is 1500% on the different datasets. 4.3
Improved Gather Count Approach (IM_Gcnt)
As Gcnt approach is a slight modification of the Gjn approach, the optimization suggested for the Gjn can be used for this approach also. In Gcnt approach, the second pass uses a 2-Dimensioanl array to count the occurrence of all item combinations of length 2 and those item combinations whose count > user specified support value are directly inserted in the frequent itemsets' relation (F2). The materialization of the candidate itemsets of length 2 (C2) at this step is skipped and in the third pass, F2 is joined with two copies of input dataset to generate FComb, which is then used by the modified Combination3 stored procedure. For subsequent passes, materialization of FComb relation is done in the same manner as is done for the IM_Gjn approach.
Fig. 9. Gjn & IM_Gjn on T5I2D100K (O)
Fig. 10. Size of Ck (Gjn & IM_Gjn)
296
P. Mishra and S. Chakravarthy
Fig. 11. Performance Gain for IM_Gjn
Fig. 13. Gjn & Gcnt for T5I2D1000K (O)
Fig. 12. Vtid, Gj , Gcnt on T5I2D100K (O)
Fig. 14. Performance Gain for IM_Gcnt
Figure 4.11 compares the mining time for tables T5I2D1000K on Oracle using IM_Gcnt approach for different support values and also compares it with the IM_Gjn approach. This figure shows that, of both the approaches, IM_Gcnt performs better than IM_Gjn. This is because of the time saved in the second pass of the IM_Gcnt approach. For the rest of the passes, the time taken by both of them is almost same as both of them use the same modified CombinationK stored procedure for generation of candidate itemsets. Thus if memory is available for building the 2-D array then performance can be improved by counting the support in memory. Remember that the size of the array needed would be of the order of n2 where n is the number of distinct items in the dataset. Figure 4.12 shows the performance gained (in percentages) by using IM_Gcnt over Gcnt on datasets T5I2D10K, T5I2D100K, T5I2D500K and T5I2D1000K for different support values. From this figure we see that on an average the gain for different support values is 2500% on different datasets.
Performance Evaluation of SQL-OR Variants for Association Rule Mining
5
297
Summary of Results
The SQL-OR based approaches use a simple approach to candidate itemset generation and support count. But when compared with SQL-92 based approaches [14], they do not even come close. The time taken by the naïve SQL-OR based approaches, using stored procedures and udfs, is much more than the basic k-way join approach for support counting. In the SQL-OR approaches, although the use of complex data structures makes the process of mining simpler, they also make it quite inefficient. Among the naïve SQL-OR approaches, we found that the Gather Count approach is the best while the VerticalTid approach has the worst performance. Figure 4.10 shows this for dataset T5I2D100K on Oracle and Figure 4.3 compares the total time taken by these approaches for different datasets. The Gather count outperforms the Gather join approach because in the second pass it uses main memory to do the support counting and hence skips the generation of candidate itemsets of length 2. The other optimizations (IM_Gjn and IM_Gcnt), as implemented in Oracle, avoid the usage of CLOB objects and hence these improved versions seem to be very promising. The Gather Count approach, which makes use of system memory in second pass for support counting, is an improvement over the optimization for the Gather Join approach. Figure 4.6 and Figure 4.11 shows the performance of IM_Vtid, IM_Gjn and IM_Gcnt for dataset T5I2D1000K for different support values. From these figures it is clear that IM_Gcnt is the best of the three SQL-OR approaches and their optimizations discussed in this paper. We have compiled the results obtained from mining different relations into a tabular format. This can be converted into metadata and made available to the miningoptimizer so that it can use these values as a cue for choosing a particular optimization for mining a given input relation.
6
Conclusion and Future Work
In SQL-OR based approaches, if we have enough memory to build a 2 dimensional array for counting support in the second pass, then Gather count approach has been found to be the best of all the naïve SQL-OR based approaches. If building an in memory 2-dimensional array is a problem, then Gather join is a better alternative. The same implies when we have enough space to materialize intermediate relations (on disk). Hence when the optimizations to the SQL-OR based approaches is considered; the optimized Gather count approach (IM_Gcnt) is the best in all the optimizations. Also in most of the cases IM_Gcnt has been found to be the best of all the all the approaches and their optimizations (including those for SQL-92 based approaches).
References [1] [2]
Agrawal, R., T. Imielinski, and A. Swami. Mining Association Rules between sets of items in large databases. in ACM SIGMOD 1993. Agrawal, R. and R. Srikant. Fast Algorithms for mining association rules. in 20th Int'l Conference on Very Large Databases (VLDB). 1994.
298
P. Mishra and S. Chakravarthy
[3]
Savasere, A., E. Omiecinsky, and S. Navathe. An efficient algorithm for mining association rules in large databases. in 21st Int'l Cong. on Very Large Databases (VLDB). 1995. Shenoy, P., et al. Turbo-charging Vertical Mining of Large Databases. in 2000 SIGMOD. Han, J., J. Pei, and Y. Yin. Mining Frequent Patterns wihtout Candidate Generation. in ACM SIGMOD 2000. Houtsma, M. and A. Swami. Set-Oriented Mining for Association Rules in Relational Databases. in ICDE, 1995. Han, J., et al. DMQL: A data mining query language for relational database. in ACM SIGMOD workshop on research issues on data mining and knowledge discovery. 1996. Meo, R., G. Psaila, and S. Ceri. A New SQL-like Operator for Mining Association Rules. in Proc. of the 22nd VLDB Conference. 1996 India. Agrawal, R. and K. Shim, Developing tightly-coupled Data Mining Applications on a Relational Database System. 1995, IBM Report. Sarawagi, S., S. Thomas, and R. Agrawal. Integrating Association Rule Mining with Rekational Database System: Alternatives and Implications. in ACM SIGMOD 1998. Thomas, S., Architectures and optimizations for integrating Data Mining algorithms with Database Systems, in CSE. 1998, University of Florida. Dudgikar, M., A Layered Optimizer or Mining Association Rules over RDBMS, in CSE Department. 2000, University of Florida: Gainesville. Mishra, P. Evaluation of K-way Join and its variants for Association Rule Mining. MS Thesis 2002, Information and Technology Lab and CSE Department at UT Arlington, TX. Mishra, P. and Chakravarthy, S. P “ erformance Evaluation and Analysis of SQL-92 Approaches for Association Rule Mining”, in BNCOD Proc., 2003.
[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
A Distance-Based Approach to Find Interesting Patterns Chen Zheng1 and Yanfen Zhao2 1
Department of Computer Science, National University of Singapore 3 Science Drive 2, Singapore 117543
[email protected] 2 China Construction Bank No.142, Guping Road, Fujian, P.R.China 350003
[email protected] Abstract. One of the major problems in knowledge discovery is producing too many trivial and uninteresting patterns. The measurement of interestingness is divided into subjective and objective measures and used to address the problem. In this paper, we propose a novel method to discover interesting patterns by incorporating the domain user's preconceived knowledge. The prior knowledge constitutes a set of hypothesis about the domain. A new parameter called the distance is proposed to measure the gap between the user's existing hypothesis and system-generated knowledge. To evaluate the practicality of our approach, we apply the proposed approach through some real-life data sets and present our findings.
1
Introduction
In the field of knowledge discovery in database, most of the previous research work focuses on the validity of the discovered patterns. Little was given consideration to the interestingness problem. Among the huge number of patterns in database, most are useless and common sense rules. It is difficult for domain users to identify the patterns that are interesting to him/her manually. To address this problem, some researchers have proposed many useful and novel approaches according to their different understanding of interesting patterns. In [14], the interestingness is defined as the unexpected pattern, which is in the form of probabilistic terms. The patterns are interesting if they can affect the degree of users' beliefs. In [5, 6], the definition of interestingness is based on the syntactic comparison between system generated rules and belief. In [9], a new definition of interestingness is given in terms of logical contradiction between rule and belief. In this paper, we follow the research of subjective measures and give the new definition of interestingness in terms of distance between the discovered knowledge and an initial set of users hypothesis. We believe that the interesting knowledge is the surprising pattern, which is the deviation of the general conforming rules. Thus, the Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 299-308, 2003. Springer-Verlag Berlin Heidelberg 2003
300
Chen Zheng and Yanfen Zhao
further the distance between generated rules and user's hypothesis, the more interesting the pattern it will be. To calculate the distance, we first transform the original data set into (fuzzy linguistic variable, linguistic terms) pairs according to different level of the certainty. The existing hypothesis is also a set of fuzzy rules since the domain users usually have the vague ideas about the domain beforehand. The distance is calculated on the hypothesis and the rules generated by the transformed data set. The rest of this paper is organized as follows. Section 2 describes related work in developing different measures of the interestingness. Section 3 describes our proposed fuzzy distance measure and the methodology to find the interesting patterns. Section 4 describes our implementation and presents the experiment results. Finally, Section 5 concludes our work.
2
Related Work
Generally speaking, there are two categories of interestingness measurement: objective measure and subjective measure. The objective measure aims to find interesting pattern by exploring the data and its underlying structure during discovery process. Such measures includes J-measure[13], certainty factor[2] and strength[16]. However, the interestingness also depends on the users who examine the pattern, i.e. A pattern that may be interesting to a group of users doesn't make any sense for another group of users. Even for the same user, he/she may have different feeling towards the same rule when time passes by. Thus, the subjective measure is useful and necessary. In the field of data mining, subjective interestingness has been identified as an important problem in [3,7,11,12]. The domain-specific system KEFIR [1] is one example. KEFIR use actionability to measure interestingness and analyzes healthcare insurance claims for uncovering k“ ey findings”. In [14], the probabilistic belief is used to describe subjective interestingness. In [10,15], the author proposes two subjective measure of interestingness: unexpectedness and actionability, which means the pattern can help users do something to his/her advantage. Liu.et.al. reported a technique for rule analysis against user's expectation [5], the technique is based on the syntactic comparisons between a rule and a belief, this method requires user to provide precise knowledge. However, in real life situation, it may be difficult to supply such information. In [6], Liu et.al analyze the discovered classification rules against a set of general impressions that are specified using a special representation language. The unexpected rules are defined as those fail to conform the general impressions. Different from the above approaches, our proposed method is domain-independent and use fuzzy α -level cut to transform the original data set, and then use the generated fuzzy rules to compare with the fuzzy hypothesis. A new measurement is defined to calculate the degree of interestingness.
3
The Proposed Approach
3.1
Specifying Interesting Patterns
Let R be the set of system generated knowledge, H be the set of user's hypothesis. Our proposed method will calculate the distance between R and H , the discovered
A Distance-Based Approach to Find Interesting Patterns
301
rules are classified into four sub groups: conforming rules, similar rules, covered rules, deviated rules based on the distance (section 3.2). Below, we give the definition of each sub group rules: Definition 1 Conforming Rules A discovered rule r ( r ∈ R ) is said to be the conforming rule w.r.t. the hypothesis h ( h ∈ H ) given that both of the antecedent and consequent part of the two rules are exactly the same. r and h has no distance in this situation. Definition 2 Similar Rules A discovered rule r ( r ∈ R ) is said to be the similar rule w.r.t. the hypothesis h ( h ∈ H ) given that they have similar attribute values in the antecedent part of the rules and the same consequent. We say that they are c“ lose” to each other under this situation. Definition 3 Covered Rules A discovered rule r ( r ∈ R ) is said to be the covered rule w.r.t. the hypothesis h ( h ∈ H ) given that the antecedent part of h is the subset of that of r , r and h have the same attribute values in the common attributes and the consequent part. In this situation, r can be inferred from h , they have no distance in this situation. Definition 4 Deviated Rules A discovered rule r ( r ∈ R ) is said to be the deviated rule w.r.t. the hypothesis h ( h ∈ H ) given three situations as follows: (1) Same antecedent part, different consequence r and h have the same conditions, however, their class label are different, This means r has the surprising result to the user. (2) Different antecedent part, same consequence r and h have the different conditions and same class label. But r is not covered by h . This means r has the surprising reason to the user. (3) Different antecedent part, different consequence r and h have the different class labels as well as different conditions. Difference means they can be different in attribute values, attribute names or both. Among these four sub group rules, since some knowledge and interests are embedded in the users' expected hypothesis, whether these patterns are interesting or not depend on the degree that the system generated knowledge and users' hypothesis are apart from each other. The rules that are far apart from hypothesis always surprise the user, contradict user expectations and trigger the user to investigate it further, i.e. are more interesting than the trivial rules, which are the common sense, similar or can be derived from the hypothesis.
302
Chen Zheng and Yanfen Zhao
3.2
Measures of Interestingness
We use the distance measure to identify the interesting pattern. The computation of distance between rule ri in the system generated knowledge base R and the hypothesis is made up of three steps: (1). Calculate the attribute distance; (2). Calculate the tuple distance; (3). Calculate the rule distance. 3.2.1 Attribute Distance The value of attribute distance varies from 0 to 1. 1 represents complete difference and 0 represents no difference. The higher the value, the more difference of rule and hypothesis in attribute comparison. Suppose an attribute K in rule r has a value r.k and given a hypothesis h . distk (r , h) denotes the distance between r and h in attribute K . We should consider the following factors during attribute comparison: attribute type, attribute name, attribute value, class label difference.
Discrete attribute values distance: The distance between discrete attribute values is either 0 or 1. 0“ ” for the same attribute value. 1“ ” for different attribute values. Continuous attribute values distance: The distance between continuous attribute values is calculated as follows: Since we have changed the original data tuple into (linguistic variable, linguistic term) pairs, we assign an ordered list {l 1, l 2...li} to the lin-
guistic term set, where l 1 < l 2... < li , lj ∈ [0,1] and j ∈ [1, i ] . The distance between linguistic terms termj and termk of continuous attributes is | lj − lk | . For example, given the term set (short, middle, tall) of linguistic variable h“ eight”, we can assign a list [0,0.5,1]. Then the distance between short and middle is 0.5. Attribute names distance: Suppose the antecedent part of rule r and hypothesis h is r.ant and h.ant respectively. The set of attributes that are common to both the antecedent part of r and h are denoted as IS (stands for intersection) i.e. IS( r , h ) = r.ant ∩ h.ant . Let | r.ant | be the number of attributes in r.ant . The distance be-
tween attribute names in r and h (denoted as distname(r , h) ) is computed as follows: distname(r , h) =
| r.ant | − | IS (r , h) | | r.ant | + | h.ant |
(1)
Class distance: For classification rule generation, the class attribute name of each tuple is the same. The distance between class attribute values in r and h (denoted as distclass (r , h) ) is either 0 or 1. Since class attribute is an important attribute, we use the maximum attribute weight for class attribute i.e. w max = max( w1, w2,...wn) given n attributes (except the class attribute) in the original data set. 3.2.2 Tuple Distance
The tuple distance is computed from attribute distance and attribute weight. We introduce the concept of attribute weight to indicate the relative importance of some
A Distance-Based Approach to Find Interesting Patterns
303
attributes during the calculation of the tuple distance between r and h . For example, in the credit card risk analysis, we may consider the s“ alary” attribute is more important than the S “ ex” attribute and contribute more to the interestingness of the rule. We define a data set with attributes attr 1, attr 2,...attrn and attribute weights w1, w2,...wn respectively. The attribute weight is given by the users and the sum of all the attribute weight is 1. Given a rule r and a hypothesis h , let dist 1(r , h), dist 2(r , h),...distn (r , h) be every attribute value distance between r and h . Simply using syntactic comparison between r and h can't distinguish the covered rule, which are redundant rules. For example, suppose we have rule r : age=young, sex=male, salary=low, occupation=student ! risk=high and h : salary=low !risk=high, although they have different attribute names, attribute values, r is covered by h . This means r is not surprise to us if we already know h . So the tuple distance between r and h is defined according to two situations. The top part of the formula 2 is used to calculate the distance between the covered rules and hypothesis. distclass (r , h) × w max
(if given ∀k , k ∈ IS (r , h) , distk (r , h) = 0 )
d ( r , h) =
(2)
∑ distk (r , h) × wk
distclass (r , h) × w max +
k∈IS ( r , h )
| IS (r , h) |
+ distname(r , h)
(Otherwise)
3.2.3 Rule Distance
Finally, we calculate the average tuple distance between rule ri and the set of existing user hypothesis. Suppose R and H is the system generated knowledge base and existing hypothesis respectively, | H | denotes the size of H . Given a rule ri ∈ R . The distance between rule ri and H (denoted as Di ) is defined as follows: |H |
Di = ∑ d (ri, hj ) / | H |
(3)
j =1
3.3
Discovery Strategy
The key idea of our proposed approach is to use fuzzy α -cut to transform our original data set and generate the fuzzy certainty rules from the transformed data. On the other hand, user's imprecise prior knowledge is expressed as s set of fuzzy rules. So the distance is calculated by comparing the same format rules. Let us first review the definition of some fuzzy terms: linguistic variable, linguistic terms, degree of membership, α -level cut, and fuzzy certainty rules. According to the definition given by Zimmermann, a fuzzy linguistic variable is a quintuple ( x, T ( x), U , G, M% ) in which x is the name of the variable; T ( x) denotes the term set of x ; that is, the set of names
304
Chen Zheng and Yanfen Zhao
of linguistic values of x , with each value being a fuzzy variable denoted generically by x and ranging over a universe of discourse U ; G is a syntactic rule for generating the name, X , of values of x ; and M is a semantic rule for associating with each value X its meaning, M% ( X ) which is a fuzzy subset of U . A particular X , that is a name generated by G is called a term [18]. For example, given the linguistic variable a“ ge,” the term set (T ) x could be v“ ery young,” y“ oung,” m “ iddle-age,” o“ ld.” The base-variable u is the age in years of life, the µ F (u ) is interpreted as the degree of membership of u in the fuzzy set F . M% ( X ) assigns a meaning to the fuzzy terms. For example, M% (old ) can be defined as follows: M% (old ) = {(u , µ old (u )) | u ∈ [0,120]} , where µ old (u ) denotes the membership function of u of term young as follows: µ old (u ) equals to 0 when u belongs to [0, 70] and µ old (u ) equals to (1 + ((u − 70) / 5)−2 ) −1 when u belongs to [70,120] . Given a certainty level α
( α ∈ [0,1] ), we can define the set F α [17] as follows: F α = {u ∈ U | µ F (u ) ≥ α } , F α is called as α -level cut, which contains all the elements of U that are compatible with fuzzy set F above the level α . The syntax of the fuzzy certainty rule A → B is I“ f X is A , then Y is B with the certainty α ”, where, A , B is the fuzzy set. Compared with traditional classification rules, our method uses the fuzzy α -cut concept to generate the certainty rules. Now, we present an overview of our approach. It consists of the four phases below: Step 1. Given a certainty level α , transform the continuous attributes in the original dataset to the (linguistic variable, linguistic term) pairs according to the α -level cut and keep the categorical attributes unchanged. Then we get the transformed dataset T . Step 2. Generate the fuzzy certainty rules R based on T and compare with the hypothesis given by users and calculate the distance according to the formula given in section 3.2. Step 3. Sort fuzzy certainty rules according to the distance and choose the fuzzy rules with distance larger than the threshold δ . Step 4. Check the α -level cut of linguistic terms and defuzzify the fuzzy certainty rules into crisp if-then rules. Given original data set D and each tuple d belongs to D . For each continuous attribute Ai in D , we first specify the linguistic term set Lik for Ai given K linguistic terms, then we generate the membership of Ai in d for every element Lij belongs to Lik according to user specification or methods in [8]. After that, a certainty level α is given and we construct the α -cut (denoted as Lijα ) of the linguistic term Lij . If the value of Ai in tuple d falls into Lijα , we say Lij is the possible linguis-
A Distance-Based Approach to Find Interesting Patterns
305
tic term. The original tuple d will be split and inserted into transformed data set T according to the combination result of every possible linguistic terms of different attributes. Then the traditional data mining tools for example [4] will be applied to the transformed data set T and used to generate the fuzzy certainty rules. The next step is to use the formula in section 3.2 to calculate the distance Di between each rule tuple ri ( ri ∈ R ) and all the hypothesis rules h belongs to H . We specify a distance threshold δ to identify those interesting rules r * , which have Di greater than δ . The user then chooses the explainable interesting rules from r * and updates the hypothesis rule base H . Then we check the α -level cut of each linguistic term Lijα and return the data points belong to Lijα and defuzzify the fuzzy certainty rules into crisp if-then rules. Similarly, we can generate different certainty level rules and compare them with the hypothesis.
4
Experiments
To evaluate the effectiveness of our approach, we implement the proposed approach on the c“ redit card risk analysis” system of China Construction Bank, which is the third biggest bank of China. We have four years of historic customer's credit card information. The bank users knew that the number of malicious overdraft cases had steadily gone up over the years. In order to decide suitable actions to be taken, they were interested in whether there were some specific groups of people who were responsible for this or such cases happened randomly. Especially, they want to know those unexpected patterns of thousand of rules. We let the user to input ten to twenty hypothesis, after we generate the fuzzy rules, we use the Oracle 8 database system to store the rules discovered and user hypothesis, system runs on windows 2000. The experiments on the database are performed on a 500 MHz Pentium II machine with 128 MB memory. Table 1 gives the summary of the number of conforming and interesting rules discovered at some certainty levels (minimum support is 1%, minimum confidence is 50% and δ is 0.5). Column C “ ert.” shows different certainty levels. Column R “ ul.” shows the number of fuzzy certainty rules. Column C “ onf” shows the number of conforming rules, similar and covered rules. Column #“ num” shows the number of interesting rules. Column e“ xpl.” Shows the number of explainable interesting rules. Column f“ alse” shows the number of interesting rules that are not surprised to the user. After we show the result to our users, they all agree that our fuzzy rules are concise, intuitive compared to CBA crisp rules and the hypothesis are verified by our conformed rules generated. On the other hand, part of interesting rules are explainable, especially users show great interest in investigating the unexpected rules, few of the rules which are not interesting are mis-identified because they have statistic significance (large support , confidence and distance values). Figure 1 shows different minimum support thresholds in the X-axis given sub_99 data. Figure 2 shows the total execution time with respect to different data size sampling from year 2000 data, which
306
Chen Zheng and Yanfen Zhao
contains 270000 tuples. 5“ 0k” means we sample 50000 tuples to perform our task. The legend in the right corner specifies different certainty levels. Table 1. Results of conforming and deviated rule mining
#attr
#rec
Cert.
Rul.
Conf.
Sub_98 Sub_98 Sub_99 Sub_99 Sub_00 Sub_00
27 27 31 31 28 28
4600 4600 12031 12031 13167 13167
0.7 0.5 0.8 0.4 0.8 0.6
232 291 397 693 487 688
Sub_01 Sub_01
28 28
17210 17210
0.7 0.9
365 448 511 821 773 101 5 776 341
Execuction time (sec.)
Data
559 220
Interesting Rules #num #expl. false 133 57 9 157 62 5 114 79 13 128 87 25 286 53 32 327 78 11 217 121
65 45
300 250 200 150 100 50
17 3
0. 9 0. 8 0. 7 0. 6 0. 5%
1. 0%
1. 5%
2. 0%
2. 5%
3. 0%
Execuction time (sec.)
Fig. 1. Total execution time with respect to different minsup values
3250 2950 2650 2350 2050 1750 1450 1150 850 550
0. 9 0. 8 0. 7 0. 6
50k
100k
150k
200k
250k
Fig. 2. Total execution time with respect to different data size
A Distance-Based Approach to Find Interesting Patterns
5
307
Conclusion
This paper proposes a novel domain-independent approach to help the domain users find the conforming rules and identify the interesting patterns. The system transform the data set based on different level of belief, the fuzzy certainty rules generated will be used to compare with the same format imprecise knowledge given by the users. The distance measure considers both the semantic and syntactic factors during comparison. Since users have different background knowledge, interest and hypothesis, our new approach is flexible to satisfy their needs. In our future work, we will carry on more experiments and compare our algorithm with other methods of computing the interestingness.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
C. J.Matheus, G.Piatesky-Shapiro, and D.Mcneil. An application of KEFIR to the analysis of healthcare information. In proceedings of the AAAI-94 Workshop on Knowledge Discovery in Databases, 1994 J.Hong and C.Mao. Incremental discovery of rules and structure by hierachical and parallel clustering. In G.Piatetsky-Shapiro and W.J.Frawley, editors, Knowledge Discovery in Databases. AAAI/MIT Press, 1991 Klemetinen, M., Mannila, H., et.al. Finding interesting rules from large sets of discovered association rules. Proceedings of the Third International Conference on Information and Knowledge Management, 401-407,1994 Liu, B. et.al, "Integrating Classification and Association Rule Mining." Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 80-86.,1998 Liu, B. and Hsu, W. Post-Analysis of Learned Rules. In Proc. Of the thirteenth National Conf. On Artificial Intelligence (AAAI'96), 828-834,1996 Liu, B. and Hsu, W, and Chen.S. Using General Impressions to Analyze Discovered Classification Rules. In Proc.of the Third Intl. Conf. On knowledge Discovery and Data Mining, 31-36,1997 Major, J., and Mangano, J. 1993. Selecting among rules induced from a hurricane database. KDD-93, 28-41,1993 M. Kaya, et.al. Efficient Automated Mining of Fuzzy Association Rules, 133142, DEXA, 2002 Padmanabhan, B. and Tuzhilin, A. On the Discovery of Unexpected Rules in Data Mining Applications. In Procs. of the Workshop on Information Technology and Systems, 81-90,1997 Padmanabhan, B. and Tuzhilin, A beliefe-driven method for discovering unexpectedpatterns. In Proc.of the Fourth International Conference on Knowledge Discovery and Data Mining , 27-31,1998 Piatesky-Shapiro, G. and Matheus, C. The interestingness of deviations. KDD94, 25-36,1994
308
Chen Zheng and Yanfen Zhao
[12]
Piatetsky-Shapiro, G., Matheus, C., Smyth, P., and Uthurusamy, R. KDD-93: progress and challenges ..., AI magazine, Fall, 77-87,1994 P.Smyth and R.M.Goodman. Rule induction using information theory. In G.Piatetsky-Shapiro and W.J.Frawley, editors, Knowledge Discovery in Databases. AAAI/MIT Press,1991 Silberschatz, A. and Tuzhilin, A. On Subjective Measures of Interestingness in Knowledge Discovery. In Proc. of the First International Conference on Knowledge Discovery and Data Mining, 275-281,1995 Silberschatz, A. and Tuzhilin, A. What Makes Patterns Interesting in Knowledge Discovery Systems. IEEE Trans. on Know. and Data Engineering. Spec. Issue on Data Mining, v.5, no.6, 970-974,1996 V.Dhar and A. Tuzhilin. Abstract-driven pattern discovery in databases. IEEE Transactions on Knowledge and Data Engineering, 5(6),1993 Zadeh,L.A. Similarity relations and fuzzy orderings. Inf. Sci., 3, 159-176,1971 Zimmermann, H. J. Fuzzy set theory and its applications. Kluwer Academic Publishers, 1991
[13] [14] [15] [16] [17] [18]
Similarity Search in Structured Data Hans-Peter Kriegel and Stefan Sch¨ onauer University of Munich Institute for Computer Science {kriegel, schoenauer}@informatik.uni-muenchen.de
Abstract. Recently, structured data is getting more and more important in database applications, such as molecular biology, image retrieval or XML document retrieval. Attributed graphs are a natural model for the structured data in those applications. For the clustering and classification of such structured data, a similarity measure for attributed graphs is necessary. All known similarity measures for attributed graphs are either limited to a special type of graph or computationally extremely complex, i.e. NP-complete, and are, therefore, unsuitable for data mining in large databases. In this paper, we present a new similarity measure for attributed graphs, called matching distance. We demonstrate, how the matching distance can be used for efficient similarity search in attributed graphs. Furthermore, we propose a filter-refinement architecture and an accompanying set of filter methods to reduce the number of necessary distance calculations during similarity search. Our experiments show that the matching distance is a meaningful similarity measure for attributed graphs and that it enables efficient clustering of structured data.
1
Introduction
Modern database applications, like molecular biology, image retrieval or XML document retrieval, are mainly based on complex structured objects. Those objects have an internal structure that is usually modeled using graphs or trees, which are then enriched with attribute information (cf. figure 1). In addition to the data objects, those modern database applications can also be characterized by their most improtant operations, which are extracting new knowledge from the database, or in other words data mining. The data mining tasks in this context require some notion of similarity or dissimilarity of objects in the database. A common approach is to extract a vector of features from the database objects and then use the Euclidean distance or some other Lp -norm between those feature vectors as similarity measure. But often this results in very highdimensional feature vectors, which even index structures for high-dimensional feature vectors like the X-tree [1], the IQ-tree [2] or the VA-file [3], can no longer handle efficiently due to a number of effects usually described by the term ’curse of dimensionality’. Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 309-319, 2003. c Springer-Verlag Berlin Heidelberg 2003
310
Hans-Peter Kriegel and Stefan Sch¨onauer
O
O C C
C H
O
O C
C
C
C C
C H
H
Fig. 1. Examples of attributed graphs: an image together with its graph and the graph of a molecule.
Especially for graph modeled data, the additional problem arises how to include the structural information into the feature vector. As the structure of a graph cannot be modeled by a low-dimensional feature vector, the dimensionality problem gets even worse. A way out of this dilemma is to define similarity directly for attributed graphs. Consequently, there is a strong need for similarity measures for attributed graphs. Several approaches to this problem have been proposed in recent time. Unfortunately, all of them have certain drawbacks, like being restricted to special graph types or having NP-complete time complexity, which makes them unusable for data mining applications. Therefore, we present a new similarity measure for attributed graphs, called the edge matching distance, which is not restricted to special graph types and can be evaluated efficiently. Additionally, we propose a filter-refinement architecture for efficient query processing and provide a set of filter methods for the edge matching distance. The paper is organized as follows: In the next section, we describe the existing similarity measures for attributed graphs and discuss their strengths and weaknesses. The edge matching distance and its properties are presented in section 3, before the query architecture and the filter methods are introduced in section 4. In section 5, the effectiveness and efficiency of our methods is demonstrated in experiments with real data from the domain of image retrieval, before we finish with a short conclusion.
2
Related Work
As graphs are a very general object model, graph similarity has been studied in many fields. Similarity measures for graphs have been used in systems for shape retrieval [4], object recognition [5] or face recognition [6]. For all those measures, graph features specific to the graphs in the application, are exploited in order to define graph similarity. Examples of such features are a given oneto-one mapping between the vertices of different graphs or the requirement that all graphs are of the same order. A very common similarity measure for graphs is the edit distance. It uses the same principle as the well known edit distance for strings [7, 8]. The idea is to determine the minimal number of insertion and deletions of vertices and edges
Similarity Search in Structured Data
311
to make the compared graphs isomorphic. In [9] Sanfeliu and Fu extended this principle to attributed graphs, by introducing vertex relabeling as a third basic operation beside insertions and deletions. In [10] this measure is used for data mining in a graph. Unfortunately, the edit distance is a very time-complex measure. Zhang, Statman and Shasha proved in [11] that the edit distance for unordered labeled trees is NP-complete. Consequently, in [12] a restricted edit distance for connected acyclic graphs, i.e. trees, was introduced. Papadopoulos and Manulopoulos presented another similarity measure for graphs in [13]. Their measure is based on histograms of the degree sequence of graphs and can be computed in linear time, but does not take the attribute information of vertices and edges into account. In the field of image retrieval, similarity of attributed graphs is sometimes described as an assignment problem [14], where the similarity distance between two graphs is defined as the minimal cost for mapping the vertices of one graph to those of another graph. With an appropriate cost function for the assignment of vertices, this measure takes the vertex attributes into account and can be evaluated in polynomial time. This asssignment measure, which we will call vertex matching distance in the rest of the paper, obviously completely ignores the structure of graphs, i.e. they are just treated as sets of vertices.
3
The Edge Matching Distance
As we just described, all the known similarity measures for attributed graphs have certain drawbacks. Starting from the edit distance and the vertex matching distance we propose a new method to measure the similarity of attributed graphs. This method solves the problems mentioned above and is useful in the context of large databases of structured objects. 3.1
Similarity of Structured Data
The similarity of attributed graphs has several major aspects. The first one is the structural similarity of graphs and the second one is the similarity of the attributes. Additionally, the weighting of the two just mentioned aspects is significant, because it is highly application dependent, to what extent the structural similarity determines the object similarity and to what extent the attribute similarity has to be considered. With the edit distance between attributed graphs there exists a similarity measure that fulfills all those conditions. Unfortunately, the computational complexity of this measure is too high to use it for clustering databases of arbitrary size. The vertex matching distance on the other hand can be evaluated in polynomial time, but this similarity measure does not take the structural relationships between the vertices into account, which results in a too coarse model for the similarity of attributed graph. For our similarity measure, called the edge matching
312
Hans-Peter Kriegel and Stefan Sch¨onauer
G1
G2 ∆
Fig. 2. An example of an edge matching between the graphs G1 and G2 .
distance, we also rely on the principle of graph matching. But instead of matching the vertices of two graphs, we propose a cost function for the matching of edges and then derive a minimal weight maximal matching between the edge sets of two graphs. This way not only the attribute distribution, but also the structural relationships of the vertices are taken into account. Figure 2 illustrates the idea behind our measure, while the formal definition of the edge matching distance is as follows: Definition 1. (edge matching, edge matching distance) Let G1 (V1 , E1 ) and G2 (V2 , E2 ) be two attributed graphs. Without loss of generality, we assume that |E1 | ≥ |E2 |. The complete bipartite graph Gem (Vem = E1 ∪ E2 ∪ ∆, E1 × (E2 ∪ ∆)), where ∆ represents an empty dummy edge, is called the edge matching graph of G1 and G2 . An edge matching between G1 and G2 is defined as a maximal matching in Gem . Let there be a non-negative metric cost function c : E1 × (E2 ∪ ∆) → IR0+ . We define the matching distance between G1 and G2 , denoted by dmatch (G1 , G2 ), as the cost of the minimum-weight edge matching between G1 and G2 with respect to the cost function c. Through the use of an appropriate cost function, it is possible to adapt the edge matching distance to the particular application needs. This implies how individual attributes are weighted or how the structural similarity is weighted relative to the attribute similarity. 3.2
Properties of the Edge Matching Distance
In order to use the edge matching distance for the clustering of attributed graphs, we need to investigate a few of the properties of this measure. The time complexity of the measure is of great importance for the applicability of the measure in data mining applications. Additionally, the proof of the following theorem also provides an algorithm how the matching distance can be computed efficiently. Theorem 1. The matching distance can be calculated in O(n3 ) time in the worst case. Proof. To calculate the matching distance between two attributed graphs G1 and G2 , a minimum-weight edge matching between the two graphs has to be determined. This is equivalent to determining a minimum-weight maximal matching
Similarity Search in Structured Data
313
in the edge matching graph of G1 and G2 . To achieve this, the method of Kuhn [15] and Munkres [16] can be used. This algorithm, also known as the Hungarian method, has a worst case complexity of O(n3 ), where n is the number of edges in the larger one of the two graphs. Apart from the complexity of the edge matching distance itself, it is also important that there are efficient search algorithms and index structures to support the use in large databases. In the context of similarity search two query types are most important, which are range queries and (k)-nearest-neighbor queries. Especially for k-nearest-neighbor search, Roussopoulos, Kelley and Vincent[17] and Hjaltason and Samet [18] proposed efficient algorithms. Both of these require that the similarity measure is a metric. Additionally, those algorithms rely on an index structure for the metric objects, such as the M-tree [19]. Therefore, the following theorem is of great importance for the practical application of the edge matching distance. Theorem 2. The edge matching distance for attributed graphs is a metric. Proof. To show that the edge matching distance is a metric, we have to prove the three metric properties for this similarity measure. 1. dmatch (G1 , G2 ) ≥ 0 The edge matching distance between two graphs is the sum of the cost for each edge matching. As the cost function is non-negative, any sum of cost values is also non-negative. 2. dmatch (G1 , G2 ) = dmatch (G2 , G1 ) The minimum-weight maximal matching in a bipartite graph is symmetric, if the edges in the bipartite graph are undirected. This is equivalent to the cost function being symmetric. As the cost function is a metric, the cost for matching two edges is symmetric. Therefore, the edge matching distance is symmetric. 3. dmatch (G1 , G3 ) ≤ dmatch (G1 , G2 ) + dmatch (G2 , G3 ) As the cost function is a metric, the triangle inequality holds for each triple of edges in G1 , G2 and G3 and for those edges that are mapped to an empty edge. The edge matching distance is the sum of the cost of the matching of individual edges. Therefore, the triangle inequality also holds for the edge matching distance. Definition 1 does not require that the two graphs are isomorphic in order to have a matching distance of zero. But the matching of the edges together with an appropriate cost function ensures that graphs with a matching distance of zero have a very high structural similarity. But even if the application requires that only isomorphic graphs are considered identical, the matching distance is still of great use. The following lemma allows to use the matching distance between two graphs as filter for the edit distance in a filter refinement architecture as will be described in section 4.1. This way, the number of expensive edit distance calculations during query processing can be greatly reduced.
314
Hans-Peter Kriegel and Stefan Sch¨onauer
Lemma 1. Given a cost function for the edge matching which is always less than or equal to the cost for editing an edge, the matching distance between attributed graphs is a lower bound for the edit distance between attributed graphs: ∀G1 , G2 : dmatch (G1 , G2 ) ≤ dED (G1 , G2 ) Proof. The edit distance between two graphs is the number of edit operations which are necessary to make those graphs isomorphic. To be isomorphic, the two graphs have to have identical edge sets. Additionally, the vertex sets have to be identical, too. As the cost function for the edge matching distance is always less than or equal to the cost to transform two edges into each other through an edit operation, the edge matching distance is a lower bound for the number of edit operations, which are necessary to make the two edge sets identical. As the cost for making the vertex sets identical is not covered by the edge matching distance, it follows that the edge matching distance is a lower bound for the edit distance between attributed graphs.
4
Efficient Query Processing Using the Edge Matching Distance
While the edge matching distance already has polynomial time complexity as compared to the exponential time complexity of the edit distance, a matching distance calculation is still a complex operation. Therefore, it makes sense to try to reduce the number of distance calculations during query processing. This goal can be achieved by using a filter-refinement architecture. 4.1
Multi-Step Query Processing
Query processing in a filter-refinement architecture is performed in two or more steps, where the first steps are filter steps that return a number of candidate objects from the database. For those candidate objects the exact similarity distance is determined in the refinement step and the objects fulfilling the query predicate are reported. To reduce the overall search time, the filter steps have to be easy to perform and a substantial part of the database objects has to be filtered out. Additionally, the completeness of the filter step is essential, i.e. there must be no false drops during the filter steps. Available similarity search algorithms guarantee completeness if the distance function in the filter step fulfills the lowerbounding property. This means that the filter distance between two objects must always be less than or equal to their exact distance. Using a multi-step query architecture requires efficient algorithms which actually make use of the filter step. Agrawal, Faloutsos and Swami proposed such an algorithm for range search [20]. In [21] and [22] multi-step algorithms for k-nearest-neighbor search were presented, which are optimal in the number of exact distance calculations neccessary during query processing. Therefore, we employ the latter algorithms in our experiments.
Similarity Search in Structured Data
4.2
315
A Filter for the Edge Matching Distance
To employ a filter-refinement architecture we need filters for the edge matching distance, which cover the structural as well as the attribute properties of the graphs in order to be effective. A way to derive a filter for a similarity measure is to approximate the database objects and then determine the similarity of those approximations. As an approximation for the structure of a graph G we use the size of that graph, denoted by s(G), i.e. the number of edges in the graph. We define the following similarity measure for our structural approximation of attributed graphs: dstruct (G1 , G2 ) = |s(G1 ) − s(G2 )| · wmismatch Here wmismatch is the cost for matching an edge with the empty edge ∆. When the edge matching distance between two graphs is determined, all edges of the larger graph, which are not mapped onto an edge of the smaller graph, are mapped onto an empty dummy edge ∆. Therefore, the above measure fulfills the lower bounding property, i.e. ∀G1 , G2 : dstruct (G1 , G2 ) ≤ dmatch (G1 , G2 ). Our filters for the attribute part of graphs are based on the observation that the difference between the attribute distributions of two graphs influences their edge matching distance. This is due to the fact, that during the distance calculation, edges of the two graphs are assigned to each other. Consequently, the edge matching distance between two graphs is the smaller, the more edges with the same attribute values the two graphs have, i.e. the more similar their attribute value distributions are. Obviously, it is too complex to determine the exact difference of the attribute distributions of two graphs in order to use this as a filter and an approximation of those distributions is, therefore, needed. We propose a filter for the attribute part of graphs, which exploits the fact that |x − y| ≥ ||x| − |y||. For attributes which are associated with edges, we add all the absolute values for an attribute in a graph. For two graphs G1 and G2 with s(G1 ) = s(G2 ), the difference between those sums, denoted by da (G1 , G2 ), is the minimum total difference between G1 and G2 for the respective attribute. Weighted appropriately according to the cost function that is used, this is a lower bound for the edge matching distance. For graphs of different size, this is no longer true, as an edge causing the attribute difference could also be assigned to an empty edge. Therefore, the difference in size of the graphs multiplied with the maximum cost for this attribute has to be substracted from da (G1 , G2 ) in order to be lower bounding in all cases. When considering attributes that are associated with vertices in the graphs,we have to take into account that during the distance calculation a vertex v is compared with several vertices of the second graph, namely exactly degree(v) many vertices. To take care of this effect, the absolute attribute value for a vertex attribute has to be multiplied with the degree of the vertex, which carries this attribute value, before the attribute values are added in the same manner as for edge attributes. Obviously, the appropriately weighted size difference has to be substracted in order to achieve a lower bounding filter value for a node attribute.
316
Hans-Peter Kriegel and Stefan Sch¨onauer
Fig. 3. Result of a 10-nearest-neighbor query for the pictograph dataset. The query object is shown on top, the result for the vertex matching distance is in the middle row and the result for the edge matching distance is in the bottom row.
With the above methods it is ensured that the sum of the structural filter distance plus all attribute filter distances is still a lower bound for the edge matching distance between two graphs. Furthermore, it is possible to precompute the structural and all attribute filter values and store them in a single vector. This supports efficient filtering during query processing.
5
Experimental Evaluation
To evaluate our new methods, we chose an image retrieval application and ran tests on a number of real world data sets: – 705 black-and-white pictographs – 9818 full-color TV images To extract graphs from the images, they were segmented with a region growing technique and neighboring segments were connected by edges to represent the neighborhood relationship. Each segment was assigned four attribute values, which are the size, the height and width of the bounding box and the color of the segment. The values of the first three attributes were expressed as a percentage relative to the image size, height and width in order to make the measure invariant to scaling. We implemented all methods in Java 1.4 and performed our tests on a workstation with a 2.4GHz Xeon processor and 4GB RAM. To calculate the cost for matching two edges, we add the difference between the values of the attributes of the corresponding terminal vertices of the two edges divided by the maximal possible difference for the respective attribute. This way, relatively small differences in the attribute values of the vertices result in a small matching cost for the compared edges. The cost for matching an edge with an empty edge is equal to the maximal cost for matching two edges. This results in a cost function, which fulfills the metric properties.
Similarity Search in Structured Data
317
Fig. 4. A cluster of portraits in the TV-images.
Figure 3 shows a comparison between the results of a 10-nearest-neighbor query in the pictograph dataset with the edge matching distance and the vertex matching distance. As one can see, the result obtained with the edge matching distance contains less false positives due to the fact that the structural properties of the images are considered more using this measure. It is important to note that this better result was obtained, even though the runtime of the query processing increases by as little as 5%. To demonstrate the usefullness of the edge matching distance for data mining tasks, we determined clusterings of the TV-images by using the density-based clustering algorithm DBSCAN [23]. In figure 4 one cluster found with the edge matching distance is depicted. Although, the cluster contains some other objects, it clearly consist mainly of portraits. When clustering with the vertex matching distance, we found no comparable cluster, i.e. this cluster could only be found with the edge matching distance as similarity measure. To measure the selectivity of our filter method, we implemented a filter refinement architecture as described in [21]. For each of our datasets, we measured the average filter selectivity for 100 queries which retrieved various fractions of the database. The results for the experiment when using the full-color TV-images are depicted in figure 5(a). It shows that the selectivity of our filter is very good, as e.g. for a query result which is 5% of the database size, more than 87% of the database objects are filtered out. The results for the pictograph dataset, as shown in figure 5(b), underline the good selectivity of the filter method. Even for a quite large result size of 10%, more than 82% of the database objects are removed by the filter. As the calculation of the edge matching distance is far more complex than that of the filter distance, it is not surprising that the reduction in runtime resulting from filter use was proportional to the number of database objects, which were filtered out.
6
Conclusions
In this paper, we presented a new similarity measure for data modeled as attributed graphs. Starting from the vertex matching distance, well known from the field of image retrieval, we developed the so called edge matching distance, which
318
Hans-Peter Kriegel and Stefan Sch¨onauer
(a)
(b)
Fig. 5. Average filter selectivity for the TV-image dataset (a) and the pictograph dataset (b).
is based on minimum-weight maximum matching of the edge sets of graphs. This measure takes the structural and the attribute properties of the attributed graphs into account and can be calculated in O(n3 ) time in the worst case, which allows to use it in data mining applications, unlike the common edit distance. In our experiments, we demonstrate that the edge matching distance reflects the similarity of graph modeled objects better than the similar vertex matching distance, while having an almost identical runtime. Furthermore, we devised a filter refinement architecture and a filter method for the edge matching distance. Our experiments show that this architecture reduces the number of necessary distance calculations during query processing between 87% and 93%. In our future work, we will investigate different cost functions for the edge matching distance as well as their usefullness for different applications. This includes especially, the field of molecular biology, where we plan to apply our methods to the problem of similarity search in protein databases.
7
Acknowledgement
Finally let us acknowledge the help of Stefan Brecheisen, who implemented part of our code.
References 1. Berchtold, S., Keim, D., Kriegel, H.P.: The X-tree: An index structure for highdimensional data. In: Proc. 22nd VLDB Conf., Bombay, India (1996) 28–39 2. Berchtold, S., B¨ ohm, C., Jagadish, H., Kriegel, H.P., Sander, J.: Independent quantization: An index compression technique for high-dimensional data spaces. In: Proc. of the 16th ICDE. (2000) 577–588
Similarity Search in Structured Data
319
3. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proc. 24th VLDB Conf. (1998) 194–205 4. Huet, B., Cross, A., Hancock, E.: Shape retrieval by inexact graph matching. In: Proc. IEEE Int. Conf. on Multimedia Computing Systems. Volume 2., IEEE Computer Society Press (1999) 40–44 5. Kubicka, E., Kubicki, G., Vakalis, I.: Using graph distance in object recognition. In: Proc. ACM Computer Science Conference. (1990) 43–48 6. Wiskott, L., Fellous, J.M., Kr¨ uger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. IEEE PAMI 19 (1997) 775–779 7. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady 10 (1966) 707–710 8. Wagner, R.A., Fisher, M.J.: The string-to-string correction problem. Journal of the ACM 21 (1974) 168–173 9. Sanfeliu, A., Fu, K.S.: A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man and Cybernetics 13 (1983) 353–362 10. Cook, D.J., Holder, L.B.: Graph-based data mining. IEEE Intelligent Systems 15 (2000) 32–41 11. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Information Processing Letters 42 (1992) 133–139 12. Zhang, K., Wang, J., Shasha, D.: On the editing distance between undirected acyclic graphs. International Journal of Foundations of Computer Science 7 (1996) 43–57 13. Papadopoulos, A., Manolopoulos, Y.: Structure-based similarity search with graph histograms. In: Proc. DEXA/IWOSS Int. Workshop on Similarity Search, IEEE Computer Society Press (1999) 174–178 14. Petrakis, E.: Design and evaluation of spatial similarity approaches for image retrieval. Image and Vision Computing 20 (2002) 59–76 15. Kuhn, H.: The hungarian method for the assignment problem. Nval Research Logistics Quarterly 2 (1955) 83–97 16. Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the SIAM 6 (1957) 32–38 17. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: Proc. ACM SIGMOD, ACM Press (1995) 71–79 18. Hjaltason, G.R., Samet, H.: Ranking in spatial databases. In: Advances in Spatial Databases, 4th International Symposium, SSD’95, Portland, Maine. Volume 951 of Lecture Notes in Computer Science., Springer (1995) 83–95 19. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proc. of 23rd VLDB Conf. (1997) 426–435 20. Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Proc. of the 4th Int. Conf. of Foundations of Data Organization and Algorithms (FODO), Springer Verlag (1993) 69–84 21. Seidl, T., Kriegel, H.P.: Optimal multi-step k-nearest neighbor search. In: Proc. ACM SIGMOD, ACM Press (1998) 154–165 22. Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., Protopapas, Z.: Fast and effective retrieval of medical tumor shapes. IEEE TKDE 10 (1998) 889–904 23. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, AAAI Press (1996) 226–231
Using an Interest Ontology for Improved Support in Rule Mining Xiaoming Chen1 , Xuan Zhou1 , Richard Scherl2 , and James Geller1, 1
CS Dept., New Jersey Institute of Technology, Newark, NJ 07102 2 Monmouth University, West Long Branch, New Jersey 07764
Abstract. This paper describes the use of a concept hierarchy for improving the results of association rule mining. Given a large set of tuples with demographic information and personal interest information, association rules can be derived, that associate ages and gender with interests. However, there are two problems. Some data sets are too sparse for coming up with rules with high support. Secondly, some data sets with abstract interests do not represent the actual interests well. To overcome these problems, we are preprocessing the data tuples using an ontology of interests. Thus, interests within tuples that are very specific are replaced by more general interests retrieved from the interest ontology. This results in many more tuples at a more general level. Feeding those tuples to an association rule miner results in rules that have better support and that better represent the reality.3
1
Introduction
Data mining has become an important research tool for the purpose of marketing. It makes it possible to draw far-reaching conclusions from existing customer databases about connections between different products purchased. If demographic data are available, data mining also allows the generation of rules that connect them with products. However, companies are not just interested in the behavior of their existing customers, they would like to find out about potential customers. Typically, there is no information about potential customers available in a company database, that can be used for data mining. It is possible to perform data mining on potential customers, if one makes the following two adjustments: (1) Instead of looking at products already purchased, we may look at interests of a customer. (2) Many people express their interests freely and explicitly on their Web home pages. The process of mining data of potential customers becomes a process of Web Mining. In this project, we are extracting raw data from home pages on the Web. In the second stage, we raise specific but sparse data to higher levels, to make it denser. In the third stage we apply traditional rule mining algorithms to the data. When mining real data, what is available is often too sparse to produce rules with reasonable support. In this paper we are describing a method how to 3
This research was supported by the NJ Commission for Science and Technology Contact Author: James Geller,
[email protected] Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 320-329, 2003. c Springer-Verlag Berlin Heidelberg 2003
Using an Interest Ontology for Improved Support in Rule Mining
321
improve the support of mined rules by using a large ontology of interests that are related to the extracted raw data.
2
Description of Project, Data and Mining
Our Web Marketing system consists of six modules. (1) The Web search module extracts home pages of users from several portal sites. Currently, the following portal sites are used: LiveJournal, ICQ and Yahoo, as well as a few major universities. (2) The Object-Relational database stores the cleaned results of this search. (3) The data mining module uses the WEKA [13] package for extracting association rules from the table data. (4) The ontology is the main knowledge representation of this project [4, 11]. It consists of interest hierarchies based on Yahoo and ICQ. (5) The advanced extraction component processes Web pages which do not follow simple structure rules. (6) The front end is a user-friendly, Web-based GUI that allows users with no knowledge of SQL to query both the raw data in the tables and the derived rules. The data that we are using for data mining consists of records of real personal data that contain either demographic data and expressed interest data or two different items of interest data. In most cases, we are using triples of age, gender and one interest as input for data mining. In other cases we are using pairs of interests. Interests are derived from one of sixteen top level interest categories. These interest categories are called interests at level 1. Examples of level 1 interests (according to Yahoo) include RECREATION SPORTS, HEALTH WELLNESS, GOVERNMENT POLITICS, etc. Interests are organized as a DAG (Directed Acyclic Graph) hierarchy. As a result of the large size of the database, the available data goes well beyond the capacity of the data mining program. Thus, the data sets had to be broken into smaller data sets. A convenient way to do this is to perform data mining on the categories divided at level 1 (top level) or the children of level 1. Thus there are 16 interest categories at level 1, and the interest GOVERNMENT POLITICS has 20 children, including LAW, MILITARY, ETHICS, TAXES, etc. At the time when we extracted the data, ENTERTAINMENT ARTS was the largest data file at level 1. It had 176218 data items, which is not too large to be handled by the data mining program. WEKA generates association rules [1] using the Apriori algorithm first presented by [2]. Since WEKA only works with clean data converted to a fixed format, called .arff format, we have created customized programs to do data selection and data cleaning.
322
3
Xiaoming Chen et al.
Using Raising for Improved Support
A concept hierarchy is present in many databases either explicitly or implicitly. Some previous work utilizes a hierarchy for data mining. Han [5] discusses data mining at multiple concept levels. His approach is to use discovered associations at one level (e.g., milk → bread) to direct the search for associations at a different level (e.g., milk of brand X → bread of brand Y). As most of our data mining involves only one interest, our problem setting is quite different. Han et al. [6] introduce a top-down progressive deepening method for mining multiple-level association rules. They utilize the hierarchy to collect large item sets at different concept levels. Our approach utilizes an interest ontology to improve support in rule mining by means of concept raising. Fortin et al. [3] use an object-oriented representation for data mining. Their interest is in deriving multi-level association rules. As we are typically using only one data item in each tuple for raising, the possibility of multi-level rules does not arise in our problem setting. Srikant et al. [12] present Cumulative and EstMerge algorithms to find associations between items at any level by adding all ancestors of each item to the transaction. In our work, items of different levels do not coexist in any step of mining. Psaila et al. [9] describe a method how to improve association rule mining by using a generalization hierarchy. Their hierarchy is extracted from the schema of the database and used together with mining queries [7]. In our approach, we are making use of a large pre-existing concept hierarchy, which contains concepts from the data tuples. P´airc´eir et al. also differ from our work in that they are mining multi-level rules that associate items spanning several levels of a concept hierarchy [10]. Joshi et al. [8] are interested in situations where rare instances are really the most interesting ones, e.g., in intrusion detection. They present a two-phase data mining method with a good balance of precision and recall. For us, rare instances are not by themselves important, they are only important because they contribute with other rare instances to result in frequently occurring instances for data mining. There are 11 levels in the Yahoo interest hierarchy. Every extracted interest belongs somewhere in the hierarchy, and is at a certain level. The lower the level value, the higher up it is in the hierarchy. Level 0 is the root. Level 1 is the top level, which includes 16 interests. For example, FAMILY HOME is an interest at level 1. PARENTING is an interest at level 2. PARENTING is a child of FAMILY HOME in the hierarchy. If a person expressed an interest in PARENTING, it is common sense that he or she is interested in FAMILY HOME. Therefore, at level 1, when we count those who are interested in FAMILY HOME, it is reasonable to count those who are interested in PARENTING. This idea applies in the same way to lower levels. A big problem in the derivation of association rules is that available data is sometimes very sparse and biased as a result of the interest hierarchy. For example, among over a million of interest records in our database only 11 people expressed an interest in RECREATION SPORTS, and nobody expressed an interest in SCIENCE. The fact that people did not express interests with more general terms does not mean they are not interested. The data file of
Using an Interest Ontology for Improved Support in Rule Mining
323
RECREATION SPORTS has 62734 data items. In other words, 62734 interest expressions of individuals are in the category of RECREATION SPORTS. Instead of saying “I’m interested in Recreation and Sports,” people prefer saying “I’m interested in basketball and fishing.” They tend to be more specific with their interests. We analyzed the 16 top level categories of the interest hierarchy. We found users expressing interests at the top level only in two categories, MUSIC and RECREATION SPORTS. When mining data at higher levels, it is important to include data at lower levels, in order to gain data accuracy and higher support. In the following examples, the first letter stands for an age range. The age range from 10 to 19 is represented by A, 20 to 29 is B, 30 to 39 is C, 40 to 49 is D, etc. The second letter stands for Male or Female. Text after a double slash (//) is not part of the data. It contains explanatory remarks. Original Data File: B,M,BUSINESS FINANCE //level=1 D,F,METRICOM INC //level=7 E,M,BUSINESS SCHOOLS //level=2 C,F,ALUMNI //level=3 B,M,MAKERS //level=4 B,F,INDUSTRY ASSOCIATIONS //level=2 C,M,AOL INSTANT MESSENGER //level=6 D,M,INTRACOMPANY GROUPS //level=3 C,M,MORE ABOUT ME //wrong data The levels below 7 do not have any data in this example. Raising will process the data level-by-level starting at level 1. It is easiest to see what happens if we look at the processing of level 3. First the result is initialized with the data at level 3 contained in the source file. With our data shown above, that means that the result is initialized with the following two lines. C,F,ALUMNI D,M,INTRACOMPANY GROUPS In order to perform the raising we need to find ancestors at level 3 of the interests in our data. Table 1 shows all ancestors of our interests from levels 4, 5, 6, 7, such that the ancestors are at level 3. The following lines are now added to our result. D,F,COMMUNICATIONS AND NETWORKING // raised from level=7 (1st ancestor) D,F,COMPUTERS // raised from level=7 (2nd ancestor) B,M,ELECTRONICS // raised from level=4 C,M,COMPUTERS // raised from level=6 That means, after raising we have the following occurrence counts at level 3.
324
Xiaoming Chen et al.
ALUMNI: 1 INTRACOMPANY GROUPS: 1 COMMUNICATIONS AND NETWORKING: 1 COMPUTERS: 2 ELECTRONICS: 1 Before raising, we only had two items at level 3. Now, we have six items at level 3. That means that we now have more data as input for data mining than before raising. Thus, the results of data mining will have better support and will much better reflect the actual interests of people. Table 1. Relevant Ancestors Interest Name METRICOM INC METRICOM INC MAKERS AOL INSTANT MESSENGER
Its Ancestor(s) at Level 3 COMMUNICATIONS AND NETWORKING COMPUTERS ELECTRONICS COMPUTERS
Due to the existence of multiple parents and common ancestors, the precise method of raising is very important. There are different ways to raise a data file. One way is to get the data file of the lowest level, and raise interests bottom-up, one level at a time, until we finish at level 1. The data raised from lower levels is combined with the original data from the given level to form the data file at that level. If an interest has multiple parents, we include these different parents in the raised data. However, if those parents have the same ancestor at some higher level, duplicates of data appear at the level of common ancestors. This problem is solved by adopting a different method: we are raising directly to the target level, without raising to any intermediate level. After raising to a certain level, all data at this level can be deleted and never have to be considered again for lower levels. This method solves the problem of duplicates caused by multiple parents and common ancestors. The data file also becomes smaller when the destination level becomes lower. In summary, the raising algorithm is implemented as follows: Raise the original data to level 1. Do data mining. Delete all data at level 1 from the original data file. Raise the remaining data file to level 2. Do data mining. Delete all data at level 2 from the data file, etc. Continue until there’s no more valid data. The remaining data in the data file are wrong data.
4
Results
The quality of association rules is normally measured by specifying support and confidence. Support may be given in two different ways [13], as absolute support and as relative support. Witten et al. write:
Using an Interest Ontology for Improved Support in Rule Mining
325
The coverage of an association rule is the number of instances for which it predicts correctly – this is often called its support. . . . It may also be convenient to specify coverage as a percentage of the total number of instances instead. (p. 64) For our purposes, we are most interested in the total number of tuples that can be used for deriving association rules, thus we will use the absolute number of support only. The data support is substantially improved by means of raising. Following are two rules from RECREATION SPORTS at level 2 without raising: age=B interest=AVIATION 70 ⇒ gender=M 55 conf:(0.79) (1) age=C interest=OUTDOORS 370 ⇒ gender=M 228 conf:(0.62) (2) Following are two rules from RECREATION SPORTS at level 2 with raising. age=A gender=F 13773 ⇒ interest=SPORTS 10834 conf:(0.79) (3) age=C interest=OUTDOORS 8284 ⇒ gender=M 5598 conf:(0.68) (4) Rule (2) and Rule (4) have the same attributes and rule structure. Without raising, the absolute support is 228, while with raising it becomes 5598. The improvement of the absolute support of this rule is 2355%. Not all rules for the same category and level have the same attributes and structure. For example, rule (1) appeared in the rules without raising, but not in the rules with raising. Without raising, 70 people are of age category B and choose AVIATION as their interest. Among them, 55 are male. The confidence for this rule is 0.79. After raising, there is no rule about AVIATION, because the support is too small compared with other interests such as SPORTS and OUTDOORS. In other words, one effect of raising is that rules that appear in the result of WEKA before raising might not appear after raising and vice versa. There is a combination of two factors why rules may disappear after raising. First, this may be a result of how WEKA orders the rules that it finds by confidence and support. WEKA primarily uses confidence for ordering the rules. There is a cut off parameter, so that only the top N rules are returned. Thus, by raising, a rule in the top N might drop below the top N. There is a second factor that affects the change of order of the mined rules. Although the Yahoo ontology ranks both AVIATION and SPORTS as level-2 interests, the hierarchy structure underneath them is not balanced. According to the hierarchy, AVIATION has 21 descendents, while SPORTS has 2120 descendents, which is about 100 times more. After raising to level 2, all nodes below level 2 are replaced by their ancestors at level 2. As a result, SPORTS becomes an interest with overwhelmingly high support, whereas the improvement rate for AVIATION is so small that it disappeared from the rule set after raising. There is another positive effect of raising. Rule (3) above appeared in the rules with raising. After raising, 13773 people are of age category A and gender category F. Among them, 10834 are interested in SPORTS. The confidence is 0.79. These data look good enough to generate a convincing rule. However, there were no rules about SPORTS before raising. Thus, we have uncovered a rule with strong support that also agrees with our intuition. However, without raising, this
326
Xiaoming Chen et al.
rule was not in the result of WEKA. Thus, raising can uncover new rules that agree well with our intuition and that also have better absolute support. To evaluate our method, we compared the support and confidence of raised and unraised rules. The improvement of support is substantial. Table 2 compares support and confidence for the same rules before and after raising for RECREATION SPORTS at level 2. There are 58 3-attribute rules without raising, and 55 3-attribute rules with raising. 18 rules are the same in both results. Their support and confidence are compared in the table. The average support is 170 before raising, and 4527 after raising. The average improvement is 2898%. Thus, there is a substantial improvement in absolute support. After raising, the lower average confidence is a result of expanded data. Raising effects not only the data that contributes to a rule, but all other data as well. Thus, confidence was expected to drop. Even though the confidence is lower, the improvement in support by far outstrips this unwanted effect. Table 2. Support and Confidence Before and After Raising Rule (int = interest, gen = gender) age=C int=AUTOMOTIVE ⇒ gen=M age=B int=AUTOMOTIVE ⇒ gen=M age=C int=OUTDOORS ⇒ gen=M age=D int=OUTDOORS ⇒ gen=M age=B int=OUTDOORS ⇒ gen=M age=C gen=M ⇒ int=OUTDOORS gen=M int=AUTOMOTIVE ⇒ age=B age=D gen=M ⇒ int=OUTDOORS age=B int=OUTDOORS ⇒ gen=F age=B gen=M ⇒ int=OUTDOORS gen=F int=OUTDOORS ⇒ age=B gen=M int=OUTDOORS ⇒ age=B int=AUTOMOTIVE ⇒ age=B gen=M gen=M int=OUTDOORS ⇒ age=C age=D ⇒ gen=M int=OUTDOORS gen=M int=AUTOMOTIVE ⇒ age=C int=OUTDOORS ⇒ age=B gen=M int=OUTDOORS ⇒ age=C gen=M
Supp. Supp. Improv. Conf. Conf. Improv. w/o w/ of w/o w/ of rais. rais. supp. rais. rais. Conf. 57 3183 5484% 80 73 -7% 124 4140 3238% 73 65 -8% 228 5598 2355% 62 68 6% 100 3274 3174% 58 67 9% 242 5792 2293% 54 61 7% 228 5598 2355% 51 23 -28% 124 4140 3238% 47 37 -10% 100 3274 3174% 46 27 -19% 205 3660 1685% 46 39 -7% 242 5792 2293% 44 18 -26% 205 3660 1685% 42 39 -3% 242 5792 2293% 38 34 -4% 124 4140 3238% 35 25 -10% 228 5598 2355% 35 33 -2% 100 3274 3174% 29 19 -10% 57 3183 5484% 22 28 6% 242 5792 2293% 21 22 1% 228 5598 2355% 20 21 1%
Table 3 shows the comparison of all rules that are the same before and after raising. The average improvement of support is calculated at level 2, level 3, level 4, level 5 and level 6 for each of the 16 categories. As explained in Sect. 3, few people expressed an interest at level 1, because these interest names are too general. Before raising, there are only 11 level-1 tuples with the interest RECREATION SPORTS and 278 tuples with the interest MUSIC. In the other
Using an Interest Ontology for Improved Support in Rule Mining
327
14 categories, there are no tuples at level 1 at all. However, after raising, there are 6,119 to 174,916 tuples at level 1, because each valid interest in the original data can be represented by its ancestor at level 1, no matter how low the interest is in the hierarchy. All the 16 categories have data down to level 6. However, COMPUTERS INTERNET, FAMILY HOME and HEALTH WELLNESS have no data at level 7. In general, data below level 6 is very sparse and does not contribute a great deal to the results. Therefore, we present the comparison of rules from level 2 through level 5 only. Some rules generated by WEKA are the same with and without raising. Some are different. In some cases, there is not a single rule in common between the rule sets with and without raising. The comparison is therefore not applicable. Those conditions are denoted by “N/A” in the table.
Table 3. Support Improvement Rate of Common Rules Category BUSINESS FINANCE COMPUTERS INTERNET CULTURES COMMUNITY ENTERTAINMENT ARTS FAMILY HOME GAMES GOVERNMENT POLITICS HEALTH WELLNESS HOBBIES CRAFTS MUSIC RECREATION SPORTS REGIONAL RELIGION BELIEFS ROMANCE RELATIONSHIPS SCHOOLS EDUCATION SCIENCE Average Improvement
Level2 122% 363% N/A N/A 148% 488% 333% 472% N/A N/A 2898% 6196% 270% 224% 295% 1231% 1086%
Level3 284% 121% 439% N/A 33% N/A 586% 275% 0% 2852% N/A 123% 88% 246% 578% 0% 432%
Level4 0% 11% N/A N/A 0% 108% 0% 100% 0% N/A 76% N/A 634% N/A N/A 111% 104%
Level5 409% 0% 435% N/A 0% 0% N/A 277% 0% 0% N/A 0% 0% 17% 297% 284% 132%
Table 4 shows the average improvement of support of all rules after raising to level 2, level 3, level 4 and level 5 within the 16 interest categories. This is computed as follows. We sum the support values for all rules before raising and divide them by the number of rules, i.e., we compute the average support before raising, Sb . Similarly, we compute the average support of all the rules after raising. Then the improvement rate R is computed as:
R=
Sa − S b ∗ 100 [percent] Sb
(1)
328
Xiaoming Chen et al.
The average improvement rate for level 2 through level 5 is, respectively, 279%, 152%, 68% and 20%. WEKA ranks the rules according to the confidence, and discards rules with lower confidence even though the support may be higher. In Tab. 4 there are three values where the improvement rate R is negative. This may happen if the total average relative support becomes lower after raising. That in turn can happen, because, as mentioned before, the rules before and after raising may be different rules. The choice of rules by WEKA is primarily made based on relative support and confidence values. Table 4. Support Improvement Rate of All Rules Category BUSINESS FINANCE COMPUTERS INTERNET CULTURES COMMUNITY ENTERTAINMENT ARTS FAMILY HOME GAMES GOVERNMENT POLITICS HEALTH WELLNESS HOBBIES CRAFTS MUSIC RECREATION SPORTS REGIONAL RELIGION BELIEFS ROMANCE RELATIONSHIPS SCHOOLS EDUCATION SCIENCE Average Improvement
5
Level2 231% 361% 1751% 4471% 77% 551% 622% 526% 13266% 13576% 6717% 7484% 285% 173% 225% 890% 279%
Level3 574% 195% 444% 2438% 26% 1057% 495% 383% 2% 3514% 314% 170% 86% 145% 550% 925% 152%
Level4 -26% 74% 254% 1101% 56% 188% 167% 515% 7% 97% 85% 242% 627% 2861% 1925% 302% 68%
Level5 228% -59% 798% 332% 57% 208% 1400% 229% 60% 62% 222% -50% 383% 87% 156% 317% 20%
Conclusions and Future Work
In this paper, we showed that the combination of an ontology of the mined concepts with a standard rule mining algorithm can be used to generate data sets with orders of magnitude more tuples at higher levels. Generating rules from these tuples results in much larger (absolute) support values. In addition, raising often produces rules that, according to our intuition, better represent the domain than rules found without raising. Formalizing this intuition is a subject of future work. According to our extensive experiments with tuples derived from Yahoo interest data, data mining with raising can improve absolute support for rules up to over 6000% (averaged over all common rules in one interest category). Improvements in support may be even larger for individual rules. When averaging
Using an Interest Ontology for Improved Support in Rule Mining
329
over all support improvements for all 16 top level categories and levels 2 to 5, we get a value of 438%. Future work includes using other data mining algorithms, and integrating the raising process directly into the rule mining algorithm. Besides mining for association rules, we can also perform classification and clustering at different levels of the raised data. The rule mining algorithm itself needs adaptation to our domain. For instance, there are over 31,000 interests in our version of the interest hierarchy. Yahoo has meanwhile added many more interests. Finding interest – interest associations becomes difficult using WEKA, as interests of persons appear as sets, which are hard to map onto the .arff format.
References 1. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 1993. 2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 1994. 3. S. Fortin and L. Liu. An object-oriented approach to multi-level association rule mining. In Proceedings of the fifth international conference on Information and knowledge management, pages 65–72. ACM Press, 1996. 4. J. Geller, R. Scherl, and Y. Perl. Mining the web for target marketing information. In Proceedings of CollECTeR, Toulouse, France, 2002. 5. J. Han. Mining knowledge at multiple concept levels. In CIKM, pages 19–24, 1995. 6. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proc. of 1995 Int’l Conf. on Very Large Data Bases (VLDB’95), Z¨ urich, Switzerland, September 1995, pages 420–431, 1995. 7. J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A data mining query language for relational databases, 1996. 8. M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining needle in a haystack: classifying rare classes via two-phase rule induction. SIGMOD Record (ACM Special Interest Group on Management of Data), 30(2):91–102, 2001. 9. G. P. and P. L. Lanzi. Hierarchy-based mining of association rules in data warehouses. In Proceedings of the 2000 ACM symposium on Applied computing 2000, pages 307–312. ACM Press, 2000. 10. R. P´ airc´eir, S. McClean, and B. Scotney. Discovery of multi-level rules and exceptions from a distributed database. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 523–532. ACM Press, 2000. 11. R. Scherl and J. Geller. Global communities, marketing and web mining,. Journal of Doing Business Across Borders, 1(2):141–150, 2002. http://www.newcastle.edu.au/journal/dbab/images/dbab 1(2).pdf. 12. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of 1995 Int’l Conf. on Very Large Data Bases (VLDB’95), Z¨ urich, Switzerland, September 1995, pages 407–419, 1995. 13. I. H. Witten and E. Frank. Data Mining. Morgan Kaufmann Publishers, San Francisco, 2000.
Fraud Formalization and Detection Bharat Bhargava, Yuhui Zhong, and Yunhua Lu Center for Education and Research in Information Assurance and Security (CERIAS) and Department of Computer Sciences Purdue University, West Lafayette, IN 47907, USA {bb,zhong,luy}@cs.purdue.edu
Abstract. A fraudster can be an impersonator or a swindler. An impersonator is an illegitimate user who steals resources from the victims by “taking over” their accounts. A swindler is a legitimate user who intentionally harms the system or other users by deception. Previous research efforts in fraud detection concentrate on identifying frauds caused by impersonators. Detecting frauds conducted by swindlers is a challenging issue. We propose an architecture to catch swindlers. It consists of four components: profile-based anomaly detector, state transition analysis, deceiving intention predictor, and decision-making component. Profilebased anomaly detector outputs fraud confidence indicating the possibility of fraud when there is a sharp deviation from usual patterns. State transition analysis provides state description to users when an activity results in entering a dangerous state leading to fraud. Deceiving intention predictor discovers malicious intentions. Three types of deceiving intentions, namely uncovered deceiving intention, trapping intention, and illusive intention, are defined. A deceiving intention prediction algorithm is developed. A user-configurable risk evaluation function is used for decision making. A fraud alarm is raised when the expected risk is greater than the fraud investigation cost.
1
Introduction
Fraudsters can be classified into two categories: impersonators and swindlers. An impersonator is an illegitimate user who steals resources from the victims by “taking over” their accounts. A swindler, on the other hand, is a legitimate user who intentionally harms the system or other users by deception. Taking superimposition fraud in telecommunication [7] as an example, impersonators impose their usage on the accounts of legitimate users by using cloned phones with Mobile Identification Numbers (MIN) and Equipment Serial Numbers (ESN) stolen from the victims. Swindlers obtain legitimate accounts and use the services without the intention to pay bills, which is called subscription fraud. Impersonators can be forestalled by utilizing cryptographic technologies that provide strong protection to users’ authentication information. The idea of separation of duty may be applied to reduce the impact of a swindler. The essence
This research is supported by NSF grant IIS-0209059.
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 330–339, 2003. c Springer-Verlag Berlin Heidelberg 2003
Fraud Formalization and Detection
331
is to restrict the power an entity (e.g., a transaction partner) can have to prevent him from abusing it. An empirical example of this idea is that laws are set, enforced and interpreted by different parties. Separation of duty can be implemented by using access control mechanisms such as role based access control mechanism, or lattice-based access control model [8]. Separation of duty policies and other mechanisms, like dual-log bookkeeping [8] reduce frauds but cannot eliminate them. For example, for online auctions, such as eBay, sellers and buyers have restricted knowledge about the other side. Although eBay, as a trusted third party, has authentication services to check the information provided by sellers and buyers (e.g. phone numbers), it is impossible to verify all of them due to the high quantities of online transactions. Fraud is a persistent issue under such an environment. In this paper, we concentrate on swindler detection. Three approaches are considered: (a) detecting an entity’s activities that deviate from normal patterns, which may imply the existence of a fraud; (b) constructing state transition graphs for existing fraud scenarios and detecting fraud attempts similar to the known ones; and (c) discovering an entity’s intention based on his behavior. The first two approaches can also be used to detect frauds conducted by impersonators. The last one is applicable only for swindler detection. The rest of this paper is organized as the follows. Section 2 introduces the related work. Definitions for fraud and deceiving intentions are presented in Section 3. An architecture for swindler detection is proposed in Section 4. It consists of a profile-based anomaly detector, a state transition analysis component, a deceiving intention predictor, and a decision-making component. The functionalities and design considerations for each component are discussed. An algorithm for predicting deceiving intentions is designed and studied via experiments. Section 5 concludes the paper.
2
Related Work
Fraud detection systems are widely used in telecommunications, online transactions, the insurance industry, computer and network security [1, 3, 6, 11]. The majority of research efforts addresses detecting impersonators (e.g. detecting superimposition fraud in telecommunications). Effective fraud detection uses both fraud rules and pattern analysis. Fawcett and Provost proposed an adaptive rule-based detection framework [4]. Rosset et al. pointed out that standard classification and rule generation were not appropriate for fraud detection [7]. The generation and selection of a rule set should combine both user-level and behavior-level attributes. Burge and Shawe-Taylor developed a neural network technique [2]. The probability distributions for current behavior profiles and behavior profile histories are compared using Hellinger distances. Larger distances indicate more suspicion of fraud. Several criteria exist to evaluate the performance of fraud detection engines. ROC (Receiver Operating Characteristics) is a widely used one [10, 5]. Rosset et al. use accuracy and fraud coverage as criteria [7]. Accuracy is the number
332
Bharat Bhargava et al.
of detected instances of fraud over the total number of classified frauds. Fraud coverage is the number of detected frauds over the total number of frauds. Stolfo et al. use a cost-based metric in commercial fraud detection systems [9]. If the loss resulting from a fraud is smaller than the investigation cost, this fraud is ignored. This metric is not suitable in circumstances where such a fraud happens frequently and causes a significant accumulative loss.
3
Formal Definitions
Frauds by swindlers occur in cooperations where each entity makes a commitment. A swindler is an entity that has no intention to keep his commitment. Commitment is the integrity constraints, assumptions, and conditions an entity promises to satisfy in a process of cooperation. Commitment is described by using conjunction of expressions. An expression is (a) an equality with an attribute variable on the left hand side and a constant representing the expected value on the right hand side, or (b) a user-defined predicate that represents certain complex constraints, assumptions and conditions. A user-defined Boolean function is associated with the predicate to check whether the constraints, assumptions and conditions hold. Outcome is the actual results of a cooperation. Each expression in a commitment has a corresponding one in the outcome. For an equality expression, the actual value of the attribute is on the right hand side. For a predicate expression, if the use-define function is true, the predicate itself is in the outcome. Otherwise, the negation of the predicate is included. Example: A commitment of a seller for selling a vase is (Received by = 04/01) ∧ (Prize = $1000) ∧ (Quality = A) ∧ ReturnIfAnyQualityProblem. This commitment says that the seller promises to send out one “A” quality vase at the price of $1000. The vase should be received by April 1st . If there is a quality problem, the buyer can return the vase. An possible outcome is (Received by = 04/05) ∧ (Prize = $1000) ∧ (Quality = B) ∧ ¬ReturnIfAnyQualityProblem. This outcome shows that the vase of quality “B”, was received on April 5th . The return request was refused. We may conclude that the seller is a swindler. Predicates or attribute variables play different roles in detecting a swindler. We define two properties, namely intention-testifying and intention-dependent. Intention-testifying: A predicate P is intention-testifying if the presence of ¬P in an outcome leads to the conclusion that a partner is a swindler. An attribute variable V is intention-testifying if one can conclude that a partner is a swindler when V’s expected value is more desirable than the actual value. Intention-dependent: A predicate P is intention-dependent if it is possible that a partner is a swindler when ¬P appears in an outcome. An attribute variable V is intention-dependent if it is possible that a partner is a swindler when its expected value is more desirable than the actual value. An intention-testifying variable or predicate is intention-dependent. The opposite direction is not necessarily true.
1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.6
0.5
0.4
0.3
0.7
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
0
50
100
150
Satisfaction Rating
1
0.9
Satisfaction Rating
Satisfaction Rating
Fraud Formalization and Detection
0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
Number of Observations
Number of Observations
(a) Uncovered deceiving intention
333
(b) Trapping intention
0
50
100
150
Number of Observations
(c) Illusive intention
Fig. 1. Deceiving intention
In the above example, ReturnIfAnyQualityProblem can be intentiontestifying or intention-dependent. The decision is up to the user. Prize is intention-testifying since if the seller charges more money, we believe that he is a swindler. Quality and received by are defined as intention-dependent variables considering that a seller may not have full control on them. 3.1
Deceiving Intentions
Since the intention-testifying property is usually too strong in real applications, variables and predicates are specified as intention-dependent. A conclusion that a partner is a swindler cannot be drawn with 100% certainty based on one intention-dependent variable or predicate in one outcome. Two approaches can be used to increase the confidence: (a) consider multiple variables or predicates in one outcome; and (b) consider one variable or predicate in multiple outcomes. The second approach is applied in this paper. Assume a satisfaction rating ranging from 0 to 1 is given for the actual value of each intention-dependent variable in an outcome. The higher the rating is, the more satisfied the user is. The value of 0 means totally unacceptable and the value of 1 indicates that actual value is not worse than the expected value. For example, if the quality of received vase is B, the rating is 0.5. If the quality is C, the rating drops to 0.2. For each intention-dependent predicate P, the rating is 0 if ¬P appears. Otherwise, the rating is 1. A satisfaction rating is related to an entity’s deceiving intention as well as some unpredictable factors. It is modelled by using random variables with normal distribution. The mean function fm (n) determines the mean value of the normal distribution at the the nth rating. Three types of deceiving intentions are identified. Uncovered deceiving intention: The satisfaction ratings associated with a swindler having uncovered deceiving intention are stably low. The ratings vary in a small range over time. The mean function is defined as fm (n) = M, where M is a constant. Figure 1a shows satisfaction ratings with fm (n)=0.2. The fluctuation of ratings results from the unpredictable factors.
334
Bharat Bhargava et al.
Trapping intention: The rating sequence can be divided into two phases: preparing and trapping. A swindler behaves well to achieve a trustworthy image before he conducts frauds. The mean function can be defined as: mhigh , n≤ n0 ; fm (n) = W here n0 is the turning point. mhigh , otherwise. Figure 1b shows satisfaction ratings for a swindler with trapping intention. Fm (n) is 0.8 for the first 50 interactions and 0.2 afterwards. Illusive intention: A smart swindler with illusive intention, instead of misbehaving continuously, attempts to cover the bad effects by intentionally doing something good after misbehaviors. He repeats the process of preparing and trapping. fm (n) is a periodic function. For simplicity, we assume the period is N, the mean function is defined as: mhigh , (n mod N) < n0 ; fm (n) = mhigh , otherwise. Figure 1c shows satisfaction ratings with period of 20. In each period, fm (n) is 0.8 for the first 15 interactions and 0.2 for the last five.
4
Architecture for Swindler Detection
Swindler detection consists of profile-based anomaly detector, state transition analysis, deceiving intention predictor, and decision-making. Profile-based anomaly detector monitors suspicious actions based upon the established patterns of an entity. It outputs fraud confidence indicating the possibility of a fraud. State transition analysis builds a state transition graph that provides state description to users when an activity results in entering a dangerous state leading
Record Preprocessor Architecture boundary Satisfied ratings Profile-based Anomaly Detector Fraud Confidence
State Transition Analysis State Description
Deceiving Intention predictor DI-Confidence
Decision Making
Fig. 2. Architecture for swindler detection
Fraud Formalization and Detection
335
to a fraud. Deceiving intention predictor discovers deceiving intention based on satisfaction ratings. It outputs DI-confidence to characterize the belief that the target entity has a deceiving intention. DI-confidence is a real number ranging over [0,1]. The higher the value is, the greater the belief is. Outputs of these components are feed into decision-making component that assists users to reach decisions based on predefined policies. Decision-making component passes warnings from state transition analysis to user and display the description of next potential state in a readable format. The expected risk is computed as follows. f(fraud confidence, DI-confidence, estimated cost) = max(fraud confidence, DI-confidence) × estimated cost Users can replace this function according to their specific requirements. A fraud alarm will arise when expected risk is greater than fraud-investigating cost. In the rest of this section, we concentrate on the other three components. 4.1
Profile-Based Anomaly Detector
As illustrated in fig. 3, profile-based anomaly detector consists of rule generation and weighting, user profiling, and online detection. Rule generation and weighting: Data mining techniques such as association rule mining are applied to generate fraud rules. The generated rules are assigned weights according to their frequency of occurrence. Both entity-level and behavior-level attributes are used in mining fraud rules and weighting. Normally, a large volume of rules will be generated. User profiling: Profile information characterizes both the entity-level information (e.g. financial status) and an entity’s behavior patterns (e.g. interested products). There are two sets of profiling data, one for history profiles and the other for current profiles. Two steps, variable selection followed by data filtering, are used for user profiling. The first step chooses variables characterizing the normal behavior. Selected variables need to be comparable among different entities.
Profile-based anomaly detector boundary Case selection
Rule Generation and Weighting Rules selection
Record Preprocessor
User Profiling Rules and patterns retrieving Online Detection Fraud confidence
Fig. 3. Profile-based anomaly detector
336
Bharat Bhargava et al.
Profile of the selected variable must show a pattern under normal conditions. These variables need to be sensitive to anomaly (i.e., at least one of these patterns is not matched in occurrence of anomaly). The objective of data filtering for history profiles is data homogenization (i.e. grouping similar entities). The current profile set will be dynamically updated according to behaviors. As behavior level data is large, decay is needed to reduce the data volume. This part also involves rule selection for a specific entity based on profiling results and rules. The rule selection triggers the measurements of normal behaviors regarding the rules. These statistics are stored in history profiles for online detection. Online detection: The detection engine retrieves the related rules from the profiling component when an activity occurs. It may retrieve the entity’s current behavior patterns and behavior pattern history as well. Analysis methods such as Hellinger distance can be used to calculate the deviation of current profile patterns to profile history patterns. These results are combined to determine fraud confidence. 4.2
State Transition Analysis
State transition analysis models fraud scenarios as series of states changing from an initial secure state to a final compromised state. The initial state is the start state prior to actions that lead to a fraud. The final state is the resulting state of completion of the fraud. There may be several intermediate states between them. The action, which causes one state to transit to another, is called the signature action. Signature actions are the minimum actions that lead to the final state. Without such actions, this fraud scenario will not be completed. This model requires collecting fraud scenarios and identifying the initial states and the final states. The signature actions for that scenario are identified in backward direction. The fraud scenario is represented as a state transition graph by the states and signature actions. A danger factor is associated with each state. It is defined by the distance from the current state to a final state. If one state leads to several final states, the minimum distance is used. For each activity, state transition analysis checks the potential next states. If the maximum value of the danger factors associated with the potential states exceeds a threshold, a warning is raised and detailed state description is sent to the decision-making component. 4.3
Deceiving Intention Predictor
The kernel of the predictor is the deceiving intention prediction (DIP) algorithm. DIP views the belief of deceiving intention as the complementary of trust belief. The trust belief about an entity is evaluated based on the satisfaction sequence , Rn is the most recent one, which contributes to a portion of α to the trust belief. The rest portion comes from the previous trust belief that is determined recursively. For each entity, DIP maintains a pair of factors (i.e. current construction factor W c and current destruction factor W d). If integrating Rn will increase trust belief, α = W c. Otherwise, α = W d. W c and
Fraud Formalization and Detection
337
W d satisfy the constraint W c < W d, which implies that more efforts are needed to gain the same amount of trust than to loose it [12]. W c and W d are modified when a foul event is triggered by the fact that the coming satisfaction rating is lower than a user-defined threshold. Upon a foul event, the target entity is put under supervision. His W c is decreased and W d is increased. If the entity does not conduct any foul event during the supervision period, the W c and W d are reset to the initial values. Otherwise, they are further decreased and increased respectively. Current supervision period of an entity increases each time when he conduct a foul event, so that he will be punished longer next time, which means an entity with worse history is treated harsher. The DI-confidence is computed as 1 − current trust belief . DIP algorithm accepts seven input parameters: initial construction factor W c and destruction factor W d; initial supervision period p; initial penalty ratios for construction factor, destruction factor and supervision r1, r2 and r3 such that r1, r2 ∈ (0, 1) and r3 > 1; foul event threshold f T hreshold. For each entity k, we maintain a profile P(k) consisting of five fields: current trust value tV alue, current construction factor W c, current destruction factor W d, current supervision period cP eriod, rest of supervision period sRest. DIP algorithm (Input parameters: Wd, Wc, r1, r2, r3, p, fThreshold; Output: DI-confidence) Initialize P(k) with input parameters while there are new rating R if R fThreshold then P(k).sRest = P(k).sRest - 1 if P(k).sRest = 0 then //restore Wc and Wd P(k).Wd = Wd and P(k).Wc = Wc end if end if return (1 - P(k).tValue) end while Experimental Study DIP’s capability of discovering deceiving intentions defined in section 3.1 is investigated through experiments. Initial construction fac-
Bharat Bhargava et al.
1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.5
0.4
DI−confidence
1
0.9
DI−confidence
DI−confidence
338
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
50
100
Number of Ratings
150
0
0.1
0
50
100
Number of Ratings
150
0
0
50
100
150
Number of Ratings
(a) Discovery uncovered de- (b) Discovery trapping in- (c) Discovery illusive intenceiving intention tention tion
Fig. 4. Experiments to discovery deceiving intentions
tor is 0.05. Initial destruction factor is 0.1. Penalty ratios for construction factor, destruction factor and supervision-period are 0.9, 0.1 and 2 respectively. The threshold for a foul event is 0.18. The results are shown in fig. 4. The x-axis of each figure is the number of ratings. The y-axis is the DI-confidence. Swindler with uncovered deceiving intention: The satisfaction rating sequence of the generated swindler is shown in fig. 1a. The result is illustrated in fig. 4a. Since the possibility for the swindler to conduct foul events is high, he is under supervision at most of the time. The construction and destruction factors become close to 0 and 1 respectively because of the punishment for foul events. The trust values are close to the minimum rating of interactions that is 0.1 and DI-confidence is around 0.9. Swindler with trapping intention: The satisfaction rating sequence of the generated swindler is shown in fig. 1b. As illustrated in fig. 4b, DIP responds to the sharp drop of fm (n) very quickly. After fm (n) changes from 0.8 to 0.2, it takes only 6 interactions for DI-confidence increasing from 0.2239 to 0.7592. Swindler with illusive intention: The satisfaction rating sequence of the generated swindler is shown in fig. 1c. As illustrated in fig. 4c, when the mean function fm (n) changes from 0.8 to 0.2, DI-confidence increases. When fm (n) changes back from 0.2 to 0.8, DI-confidence decreases. DIP is able to catch this smart swindler in the sense that his DI-confidence eventually increases to about 0.9. The swindler’s effort to cover a fraud with good behaviors has less and less effect with the number of frauds.
5
Conclusions
In this paper, we classify fraudsters as impersonators and swindlers and present a mechanism to detect swindlers. The concepts relevant to frauds conducted by swindlers are formally defined. Uncovered deceiving intention, trapping intention, and illusive intention are identified. We propose an approach for swindler detection, which integrates the ideas of anomaly detection, state transition analysis, and history-based intention prediction. An architecture that realizes this approach is presented. The experiment results show that the proposed deceiving
Fraud Formalization and Detection
339
intention prediction (DIP) algorithm accurately detects the uncovered deceiving intention. Trapping intention is captured promptly in about 6 interactions after a swindler enters the trapping phase. The illusive intention of a swindler, who attempt to cover frauds with good behaviors, can also be caught by DIP.
References [1] R. J. Bolton and D. J. Hand. Statistical fraud detection: A review. Statistical Science, 17(3):235–255, 2002. 331 [2] P. Burge and J. Shawe-Taylor. Detecting cellular fraud using adaptive prototypes. In Proceedings of AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management, 1997. 331 [3] M. Cahill, F. Chen, D. Lambert, J. Pinheiro, and D. Sun. Detecting fraud in the real world. In Handbook of Massive Datasets, pages 911–930. Klewer Academic Publishers, 2002. 331 [4] T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1997. 331 [5] J. Hollm´en and V. Tresp. Call-based fraud detection in mobile communication networks using a hierarchical regime-switching model. In Proceedings of Advances in Neural Information Processing Systems (NIPS’11), 1998. 331 [6] Bertis B. Little, Walter L. Johnston, Ashley C. Lovell, Roderick M. Rejesus, and Steve A. Steed. Collusion in the U. S. crop insurance program: applied data mining. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 594–598. ACM Press, 2002. 331 [7] Saharon Rosset, Uzi Murad, Einat Neumann, Yizhak Idan, and Gadi Pinkas. Discovery of fraud rules for telecommunications` uchallenges and solutions. In Proceedings of the fifth ACM SIGKDD, pages 409–413. ACM Press, 1999. 330, 331 [8] Ravi Sandhu. Lattice-based access control models. IEEE Computer, 26(11):9–19, 1993. 331 [9] Salvatore J. Stolfo, Wenke Lee, Philip K. Chan, Wei Fan, and Eleazar Eskin. Data mining-based intrusion detectors: an overview of the columbia IDS project. ACM SIGMOD Record, 30(4):5–14, 2001. 332 [10] M. Taniguchi, J. Hollm´en M. Haft, and V. Tresp. Fraud detection in communications networks using neural and probabilistic methods. In Proceedings of the IEEE International Conference in Acoustics, Speech and Signal Processing, 1998. 331 [11] David Wagner and Paolo Soto. Mimicry attacks on host-based intrusion detection systems. In Proceedings of the 9th ACM conference on Computer and communications security, pages 255–264. ACM Press, 2002. 331 [12] Y. Zhong, Y. Lu, and B. Bhargava. Dynamic trust production based on interaction sequence. Technical Report CSD-TR 03-006, Department of Computer Sciences, Purdue University, 2003. 337
Combining Noise Correction with Feature Selection? Choh Man Teng Institute for Human and Machine Cognition University of West Florida 40 South Alcaniz Street, Pensacola FL 32501, USA
[email protected] Polishing is a noise correction mechanism which makes use of the inter-relationship between attribute and class values in the data set to identify and selectively correct components that are noisy. We applied polishing to a data set of amino acid sequences and associated information of point mutations of the gene COLIA1 for the classi cation of the phenotypes of the genetic collagenous disease Osteogenesis Imperfecta (OI). OI is associated with mutations in one or both of the genes COLIA1 and COLIA2. There are at least four known phenotypes of OI, of which type II is the severest and often lethal. Preliminary results of polishing suggest that it can lead to a higher classi cation accuracy. We further investigated the use of polishing as a scoring mechanism for feature selection, and the eect of the features so derived on the resulting classi er. Our experiments on the OI data set suggest that combining polishing and feature selection is a viable mechanism for improving data quality. Abstract.
1
Approaches to Noise Handling
Imperfections in data can arise from many sources, for instance, faulty measuring devices, transcription errors, and transmission irregularities. Except in the most structured and synthetic environment, it is almost inevitable that there is some noise in any data we have collected. Data quality is a prime concern for many tasks in learning and induction. The utility of a procedure is limited by the quality of the data we have access to. For a classi cation task, for instance, a classi er built from a noisy training set might be less accurate and less compact than one built from the noise-free version of the same data set using an identical algorithm. Imperfections in a data set can be dealt with in three broad ways. We may leave the noise in, lter it out, or correct it. On the rst approach, the data set is taken as is, with the noisy instances left in place. Algorithms that make use of the data are designed to be robust ; that is, they can tolerate a certain amount of noise in the data. Robustness is typically accomplished by avoiding over tting, ?
This work was supported by NASA NCC2-1239 and ONR N00014-03-1-0516.
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 340-349, 2003. c Springer-Verlag Berlin Heidelberg 2003
Combining Noise Correction with Feature Selection
341
so that the resulting classi er is not overly specialized to account for the noise. This approach is taken by, for example, c4.5 [Quinlan, 1987] and CN2 [Clark and Niblett, 1989]. On the second approach, the data is ltered before being used. Instances that are suspected of being noisy according to certain evaluation criteria are discarded [John, 1995; Gamberger et al., 1996; Brodley and Friedl, 1999]. A classi er is then built using only the retained instances in the smaller but cleaner data set. Similar ideas can be found in robust regression and outlier detection techniques in statistics [Rousseeuw and Leroy, 1987]. On the rst approach, robust algorithms do not require preprocessing of the data, but the noise in the data may interfere with the mechanism, and a classi er built from a noisy data set may be of less utility than it could have been if the data were not noisy. On the second approach, by ltering out the noisy instances from the data, there is a tradeo between the amount of information available for building the classi er and the amount of noise retained in the data set. Filtering is not information-eÆcient; the more noisy instances we discard, the less data remains. On the third approach, the noisy instances are identi ed, but instead of tossing these instances out, they are repaired by replacing the corrupted values with more appropriate ones. These corrected instances are then reintroduced into the data set. Noise correction has been shown to give better results than simply removing the noise from the data set in some cases [Drastal, 1991; Teng, 2001]. We have developed a data correction method called polishing [Teng, 1999]. Data polishing, when carried out correctly, would preserve the maximal information available in the data set, approximating the noise-free ideal situation. A classi er built from this corrected data should have a higher predictive power and a more streamlined representation. Polishing has been shown to improve the performance of classi ers in a number of situations [Teng, 1999; Teng, 2000]. In this paper we study in more detail a research problem in the biomedical domain, using a data set which describes the genetic collagenous disease Osteogenesis Imperfecta (OI). We have previously applied polishing to this data set, with some improvement in the accuracy and size of the resulting classi ers [Teng, 2003]. Here we in addition explore the selection and use of relevant features in the data set in conjunction with noise correction.
2
Feature Selection
Feature selection is concerned with the problem of identifying a set of features or attributes that are relevant or useful to the task at hand [Liu and Motoda, 1998, for example]. Spurious variables, either irrelevant or redundant, can aect the performance of the induced classi er. In addition, concentrating on a reduced set of features improves the readability of the classi er, which is desirable when our
342
Choh Man Teng
goal is to achieve not only a high predictive accuracy but also an understanding of the underlying structure relating the attributes and the prediction. There are several approaches to feature selection. The utility of the features can be scored using a variety of statistical and experimental measures, for instance, correlation and information entropy [Kira and Rendell, 1992; Koller and Sahami, 1996]. The wrapper approach uses the learning algorithm itself to iteratively search for sets of features that can improve the performance of the algorithm [Kohavi and John, 1997]. Feature scoring is typically faster and the resulting data set is independent of the particular learning algorithm to be used, since the selection of the features is based on scores computed using the characteristics of the data set alone. The wrapper approach in addition takes into account the bias of the learning algorithm to be deployed by utilizing the algorithm itself in the estimation of the relevance of the features. We study the eect of feature selection when combined with noise correction. The polishing mechanism was used in part to score the features in the data set, and the reduced and polished data set was compared to the unreduced and/or unpolished data sets. In the following sections we will rst describe the polishing mechanism and the application domain (the classi cation of the genetic disease OI), and then we will discuss the experimental procedure together with the feature selection method employed.
3
Polishing
Machine learning methods such as the naive Bayes classi er typically assume that dierent components of a data set are (conditionally) independent. It has often been pointed out that this assumption is a gross oversimpli cation of the actual relationship between the attributes; hence the word \naive" [Mitchell, 1997, for example]. Extensions to the naive Bayes classi er have been introduced to loosen the independence criterion [Kononenko, 1991; Langley et al., 1992], but some have also investigated alternative explanations for the success of this classi er [Domingos and Pazzani, 1996]. Controversy aside, most will agree that in many cases there is a de nite relationship within the data; otherwise any eort to mine knowledge or patterns from the data would be ill-advised. Polishing takes advantage of this interdependency between the components of a data set to identify the noisy elements and suggest appropriate replacements. Rather than utilizing the features only to predict the target concept, we can as well turn the process around and utilize the target together with selected features to predict the value of another feature. This provides a means to identify noisy elements together with their correct values. Note that except for totally irrelevant elements, each feature would be at least related to some extent to the target concept, even if not to any other features. The basic algorithm of polishing consists of two phases: prediction and adjustment. In the prediction phase, elements in the data that are suspected of
Combining Noise Correction with Feature Selection
343
being noisy are identi ed together with a nominated replacement value. In the adjustment phase, we selectively incorporate the nominated changes into the data set. In the rst phase, the predictions are carried out by systematically swapping the target and particular features of the data set, and performing a ten-fold classi cation using a chosen classi cation algorithm for the prediction of the feature values. If the predicted value of a feature in an instance is dierent from the stated value in the data set, the location of the discrepancy is agged and recorded together with the predicted value. This information is passed on to the next phase, where we institute the actual adjustments. Since the polishing process itself is based on imperfect data, the predictions obtained in the rst phase can contain errors as well. We should not indiscriminately incorporate all the nominated changes. Rather, in the second phase, the adjustment phase, we selectively adopt appropriate changes from those predicted in the rst phase, using a number of strategies to identify the best combination of changes that would improve the tness of a datum. Given a training set, we try to identify suspect attributes and classes and replace their values according to the polishing procedure. The bare-bones description of polishing is given in Figure 1. Polishing makes use of a procedure flip to recursively try out selective combinations of attribute changes. The function classify(Classi ers ; xj ; c) returns the number of classi ers in the set Classi ers which classify the instance xj as belonging to class c. Further details of polishing can be found in [Teng, 1999; Teng, 2000; Teng, 2001].
4
Osteogenesis Imperfecta
Osteogenesis Imperfecta (OI), commonly known as brittle bone disease, is a genetic disorder characterized by bones that fracture easily for little or no reason. This disorder is associated with mutations in one or both of the genes, COLIA1 and COLIA2, which are associated with the production of peptides of type I collagen. Type I collagen is a protein found in the connective tissues in the body. A mutation in COLIA1 or COLIA2 may lead to a change in the structure and expression of the type I collagen molecules produced, which in turn aects the bone structure. There are at least four known phenotypes of osteogenesis imperfecta, namely, types I, II, III, and IV. Of these four type II is the severest form of OI and is often lethal. At least 70 dierent kinds of point mutations in COLIA1 and COLIA2 have been found to be associated with OI, and of these approximately half of the mutations are related to type II, the lethal form of OI [Hunter and Klein, 1993]. While OI may be diagnosed with collagenous or DNA tests, determining the relevant structure and the relationship between the point mutations and the types of OI remains an open research area [Klein and Wong, 1992; Mooney et al., 2001].
344
Choh Man Teng
Polishing(OldData ,votes ,changes ,cutoff ) Input OldData : (possibly) noisy data votes : #classifiers that need to agree changes : max #changes per instance cutoff : size of attribute subset considered Output NewData : polished data for
each attribute ai AttList i ;; tmpData swap ai and class c in OldData ; 10-fold cross-validation of tmpData ; for each instance xj misclassified new value of ai predicted for xj ; AttList i AttList i [ fhj; new ig;
flip(j; votes ; k; cutoff ; starti ) Input j : index of the instance to be adjusted votes : #classifiers that need to agree k : #changes yet to be made cutoff : size of attribute subset considered starti : index of AttSorted containing first attribute to be adjusted Output true/false: whether a change has been made (also modifies NewData ) if k
=0
if
then
NewData return
end
NewData true;
xj
) votes
[ f j g; x
end
end
NewData AttSorted
else return false;
;;
relevant attributes sorted in ascending order of jAttList i j; Classifiers classifiers from 10-fold cross-validation of OldData ; for each instance xj for k from 0 to changes adjusted flip(j; votes ; k; cutoff ; 0); if adjusted then break; end if
then
classify(Classifiers ; xj ; class of
(not adjusted ) then NewData NewData [ fxj g;
from starti to cutoff AttSorted [i]; hj; new i 2 AttList i0 then attribute ai0 of xj new ; adjusted flip(j; votes ; k 1; cutof f; i + 1); if adjusted then return true; reset ai0 of xj ;
else for i ai0
if
end end return false;
end end return
NewData ; Fig. 1.
The polishing algorithm.
4.1 Data Description Below we will describe a data set consisting of information on sequences of amino acids, each with a point mutation in COLIA1. The sequences are divided into lethal (type II) and non-lethal (types I, III, and IV) forms of OI. The objective is to generate a classi cation scheme that will help us understand and dierentiate between lethal and non-lethal forms of OI. Each instance in the data set contains the following attributes.
A1 ; : : : ; A29 : a sequence of 29 amino acids. These are the amino acids at and around the site of the mutation. The mutated residue is centered at A15 , with 14 amino acids on each side in the sequence. Each attribute Ai can take on one of 21 values: each of the 20 regular amino acids (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V), and hydroxyproline (O), a modi ed proline (P) which can be found in collagen molecules. Four attributes provide supplementary information regarding hydrogen bonds in the molecules.
S-W : number of solute hydrogen bonds wild type; S-M : number of solute hydrogen bonds mutated type; SS-W : number of solute-solvent hydrogen bonds wild type;
Combining Noise Correction with Feature Selection
345
SS-M : number of solute-solvent hydrogen bonds mutated type. These are the number of hydrogen bonds of the speci ed types that are present in the wild (un-mutated) and mutated protein molecules more than 80% of the time. The class of each instance can be one of two values.
y : lethal OI (type II); n : non-lethal OI (types I, III, or IV). Thus, each instance contains 33 attributes and a binary classi cation.
4.2 Data Characteristics A number of characteristics of the OI data set suggest that it is an appropriate candidate for polishing and feature selection. First of all, the amino acid sequence and associated information are prone to noise arising from the clinical procedures. Thus, there is a need for an eective measure for noise handling. The number of possible values for many of the attributes is fairly large, resulting in a data set that is sparse with little redundancy. This makes it undesirable to use an information-ineÆcient mechanism such as ltering for noise handling, since discarding any data instance is likely to lose some valuable information that is not duplicated in the remaining portion of the data set. While the precise relationship between the dierent amino acid blocks is not clear, we do know that they interact, and this inter-relationship between amino acids in a sequence can be exploited to nominate replacement values for the attributes using the polishing mechanism. In addition, the conformation of collagen molecules is exceptionally linear, and thus we can expect that each attribute may be predicted to a certain extent by considering only the values of the adjacent attributes in the sequence. Furthermore, we are interested not only in the predictive accuracy of the classi er but also in identifying the relevant features contributing to the lethal phenotype of OI and the relationship between these features. We have previously observed that many of the attributes may not be relevant [Hewett et al., 2002; Teng, 2003], in the sense that the classi er may make use of only a few of the available attributes. This makes it desirable to incorporate a feature selection procedure that may increase the intelligibility of the resulting classi er as well as improve the accuracy of the prediction by removing potentially confounding attributes.
5
Experiments
We used the decision tree builder c4.5 [Quinlan, 1993] to provide our basic classi ers, and performed ten-fold cross-validation on the OI data set described in the previous section. In each trial, nine parts of the data were used for training and a tenth part was reserved for testing. The training data was polished and the polished data was
346
Choh Man Teng
Average classi cation accuracy and size of the decision trees constructed from the unpolished and polished data. (a) All attributes were used. The dierence between the classi cation accuracies of the pruned unpolished and polished cases is signi cant at the 0.05 level. (b) Only those attributes reported in Table 2 were used. The dierences between the corresponding classi cation accuracies of the unpruned trees in (a) and (b) are signi cant at the 0.05 level. Table 1.
(a) Using all attributes Unpolished Polished (b) Using only attributes reported in Table 2 Unpolished Polished
Unpruned Accuracy Tree Size 46:5% 91.4 53:0% 94.8 Unpruned Accuracy Tree Size 71.0% 34.4 73.5% 76.4
Pruned Accuracy Tree Size 60:0% 11.6 66:0% 11.4 Pruned Accuracy Tree Size 62.0% 16.7 66.0% 8.8
then used to construct a decision tree. The unseen (and unpolished) instances in the test data set were classi ed according to this tree. For each trial a tree was also constructed from the original unpolished training data for comparison purposes. Below we analyze a number of aspects of the results obtained from the experiments, namely, the classi er characteristics (accuracy and size) and the list of relevant attributes selected by the classi ers. We observed that few of the attributes were considered relevant according to this procedure. The experiments were rerun using only the selected attributes as input, and the results were compared to those obtained using all the attributes in the original data set, with and without polishing.
5.1 Classi er Characteristics The average classi cation accuracy and size of the decision trees constructed from the unpolished and polished data, using all the available attributes as input, are reported in Table 1(a). The dierence between the classi cation accuracies of the pruned trees constructed from unpolished and polished data is statistically signi cant at the 0.05 level, using a one-tailed paired t-test. Even though previously we found that polishing led to a decrease in tree size [Teng, 1999; Teng, 2000], in this study the tree sizes resulting from the two approaches do not dier much.
5.2 Relevant Attributes We looked at the attributes used in constructing the unpolished and polished trees, as these attributes were the ones that were considered predictive of the OI phenotype in the decision tree setting.
Combining Noise Correction with Feature Selection
347
Relevant attributes, in decreasing order of the average percentage of occurrence in the decision trees. Table 2.
Unpolished % Occurrence 33.3% 16.7% A15 ; A20 ; A22 16.7% Attribute S-M S-W
Attribute A15 A11
, A14
Polished % Occurrence 50.0% 25.0%
We used the number of trees involving a particular attribute as an indicator of the relevance of that attribute. Table 2 gives the normalized percentages of occurrence of the attributes, averaged over the cross validation trials, obtained from the trees constructed using the unpolished and polished data sets respectively. The relevant attributes picked out from using the unpolished and polished data are similar, although the rank orders and the percentages of occurrence dier to some extent. We expected A15 , the attribute denoting the mutated amino acid in the molecule, to play a signi cant role in the classi cation of OI disease types. This was supported by the ndings in Table 2. We also noted that A15 was used more frequently in the decision trees constructed from the polished data than in those constructed from the unpolished data. The stronger emphasis placed on this attribute may partially account for the increase in the classi cation accuracy resulting from polishing. Other attributes that were ranked high in both the unpolished and polished cases include S-M (the number of solute hydrogen bonds mutated) and S-W (the number of solute hydrogen bonds wild). The amino acids in the sequence that were of interest in the unpolished and polished trees diered. Domain expertise is needed to further interpret the implications of these results.
5.3 Rebuilding with Selected Attributes As we discussed above, the results in Table 2 indicated that only a few of the attributes were used in the decision trees. Even though the rest of the attributes were not retained in the pruned trees, they nonetheless entered into the computation, and could have had a distracting eect on the tree building process. We used as a feature scoring mechanism the decision trees built using all the attributes as input. This was similar to the approach taken in [Cardie, 1993], although in our case the same learning method was used for both feature selection and the nal classi er induction. We adopted a binary scoring scheme: all and only those attributes that were used in the construction of the trees were selected. These were the attributes reported in Table 2. The classi cation accuracy and size of the decision trees built using only the features selected from the unpolished and polished data are reported in Table 1(b). The dierences between the corresponding classi cation accuracies
348
Choh Man Teng
of the unpruned trees in Tables 1(a) and (b) are signi cant at the 0.05 level, using a one-tailed paired t-test. The accuracy and size of the pruned trees constructed using only the selected attributes do not dier much from those obtained by using all the attributes as input. Pruning was not helpful in this particular set of experiments, perhaps because the data set had already been \cleaned" to some extent by the various preprocessing techniques. In both the unpolished and polished cases, using only the selected attributes gave rise to trees with signi cantly higher classi cation accuracy and smaller size than those obtained when all the attributes were included. This suggests that the additional re nement of thinning out the irrelevant attributes is bene cial. In addition, using the polished data as a basis for feature selection can improve to some extent the performance of the learning algorithm over the use of unpolished data for the same task.
6
Remarks
We investigated the eects of polishing and feature selection on a data set describing the genetic disease osteogenesis imperfecta . Both mechanisms, when applied individually, were shown to improve the predictive accuracy of the resulting classi ers. Better performance was obtained by combining the two techniques so that the relevant features were selected based on classi ers built from a polished data set. This suggests that the two methods combined can have a positive impact on the data quality by both correcting noisy values and removing irrelevant and redundant attributes from the input.
References [Brodley and Friedl, 1999] Carla E. Brodley and Mark A. Friedl. Identifying mislabeled training data. Journal of Arti cial Intelligence Research, 11:131{167, 1999. [Cardie, 1993] Claire Cardie. Using decision trees to improve case-based learning. In Proceedings of the Tenth International Conference on Machine Learning, pages 25{ 32, 1993. [Clark and Niblett, 1989] P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3(4):261{283, 1989. [Domingos and Pazzani, 1996] Pedro Domingos and Michael Pazzani. Beyond independence: Conditions for the optimality of the simple Bayesian classi er. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 105{112, 1996. [Drastal, 1991] George Drastal. Informed pruning in constructive induction. In Proceedings of the Eighth International Workshop on Machine Learning, pages 132{136, 1991. [Gamberger et al., 1996] Dragan Gamberger, Nada Lavrac, and Saso Dzeroski. Noise elimination in inductive concept learning: A case study in medical diagnosis. In Proceedings of the Seventh International Workshop on Algorithmic Learning Theory, pages 199{212, 1996.
Combining Noise Correction with Feature Selection
349
[Hewett et al., 2002] Rattikorn Hewett, John Leuchner, Choh Man Teng, Sean D. Mooney, and Teri E. Klein. Compression-based induction and genome data. In Proceedings of the Fifteenth International Florida Arti cial Intelligence Research Society Conference, pages 344{348, 2002. [Hunter and Klein, 1993] Lawrence Hunter and Teri E. Klein. Finding relevant biomolecular features. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, pages 190{197, 1993. [John, 1995] George H. John. Robust decision trees: Removing outliers from databases. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 174{179, 1995. [Kira and Rendell, 1992] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning, pages 249{256, 1992. [Klein and Wong, 1992] Teri E. Klein and E. Wong. Neural networks applied to the collagenous disease osteogenesis imperfecta. In Proceedings of the Hawaii International Conference on System Sciences, volume I, pages 697{705, 1992. [Kohavi and John, 1997] Ron Kohavi and George H. John. Wrappers for feature selection. Arti cial Intelligence, 97(1{2):273{324, 1997. [Koller and Sahami, 1996] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 284{292, 1996. [Kononenko, 1991] Igor Kononenko. Semi-naive Bayesian classi er. In Proceedings of the Sixth European Working Session on Learning, pages 206{219, 1991. [Langley et al., 1992] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian classi ers. In Proceedings of the Tenth National Conference on Arti cial Intelligence, pages 223{228, 1992. [Liu and Motoda, 1998] Huan Liu and Hiroshi Motoda, editors. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998. [Mitchell, 1997] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997. [Mooney et al., 2001] Sean D. Mooney, Conrad C. Huang, Peter A. Kollman, and Teri E. Klein. Computed free energy dierences between point mutations in a collagen-like peptide. Biopolymers, 58:347{353, 2001. [Quinlan, 1987] J. Ross Quinlan. Simplifying decision trees. International Journal of Man-Machine Studies, 27(3):221{234, 1987. [Quinlan, 1993] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [Rousseeuw and Leroy, 1987] Peter J. Rousseeuw and Annick M. Leroy. Robust Regression and Outlier Detection. John Wiley & Sons, 1987. [Teng, 1999] Choh Man Teng. Correcting noisy data. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 239{248, 1999. [Teng, 2000] Choh Man Teng. Evaluating noise correction. In Proceedings of the Sixth Paci c Rim International Conference on Arti cial Intelligence. Springer-Verlag, 2000. [Teng, 2001] Choh Man Teng. A comparison of noise handling techniques. In Proceedings of the Fourteenth International Florida Arti cial Intelligence Research Society Conference, pages 269{273, 2001. [Teng, 2003] Choh Man Teng. Noise correction in genomic data. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning. Springer-Verlag, 2003. To appear.
Pre-computing Approximate Hierarchical Range Queries in a Tree-Like Histogram Francesco Buccafurri and Gianluca Lax DIMET, Universit` a degli Studi Mediterranea di Reggio Calabria Via Graziella, Localit` a Feo di Vito, 89060 Reggio Calabria, Italy {bucca,lax}@ing.unirc.it
Abstract. Histograms are a lossy compression technique widely applied in various application contexts, like query optimization, statistical and temporal databases, OLAP applications, and so on. This paper presents a new histogram based on a hierarchical decomposition of the original data distribution kept in a complete binary tree. This tree, thus containing a set of pre-computed hierarchical queries, is encoded in a compressed form using bit saving in representing integer numbers. The approach, extending a recently proposed technique based on the application of such a decomposition to the buckets of a pre-existing histogram, is shown by several experiments to improve the accuracy of the state-of-the-art histograms.
1
Introduction
Histograms are a lossy compression technique widely applied in various application contexts, like query optimization [9], statistical [5] and temporal databases [12], and, more recently, OLAP applications [4, 10]. In OLAP, compression allows us to obtain fast approximate answers by evaluating queries on reduced data in place that original ones. Histograms are well suited to this purpose, especially in case of range queries. Indeed, buckets of histograms basically correspond to a set of pre-computed range queries, allowing us to estimate the remainder possible range queries. Estimation is needed when the range query overlaps partially a bucket. As a consequence, the problem of minimizing the estimation error becomes crucial in the context of OLAP applications. In this work we propose a new histogram, extending the approach used in [2] for the estimation inside the bucket. The histogram, called nLT, consists of a tree-like index, with a number of levels depending on the fixed compression ratio. Nodes of the index contain, hierarchically, pre-computed range queries, stored by an approximate (via bit saving) encoding. Compression derives both from aggregation implemented by leaves of the tree, and from the saving of bits obtained by representing range queries with less than 32 bits (assumed enough for an exact representation). The number of bits used for representing range queries is decreasing for increasing level of the tree. Peculiar characteristics of our histogram are the following: Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 350–359, 2003. c Springer-Verlag Berlin Heidelberg 2003
Pre-computing Approximate Hierarchical Range Queries
351
1. Due to bit saving, the number of pre-computed range queries embedded in our histogram is larger than a bucket-based histogram occupying the same storage space. Observe that such queries are stored in an approximate form. However, hierarchical organization of the index, allows us to express the value of a range query as a fraction of the range query including it (i.e., corresponding to the parent node in the tree), and this allows us to maintain a low numeric approximation. In case of absence of the tree, values of range queries would be expressed as a fraction of the maximum value (i.e., the query involving the entire domain). 2. The histogram supports directly hierarchical range queries, representing a meaningful type of queries in the OLAP context [4]. 3. The evaluation of a range query can be executed by visiting the tree from the root to a leaf (in the worst case), thus with a logarithmic cost on the number of smallest pre-computed range queries (this number is the counterpart of the number of buckets of a classic histogram, from which the cost of the evaluation of the query depends linearly). 4. The update of the histogram (we refer here to the case of the change of a single occurrence frequency) can be performed without reconstructing the entire tree, but only by updating nodes of the path connecting the leaf involved by the change with the root of the tree. Also this task is hence feasible in logarithmic time. While the three last points above describe evidently positive characteristics of the proposed method, the first point needs some kind of validation, to be considered effectively a point in favor of our proposal. Indeed, there is no apriori clear if having a larger set of approximate pre-computed queries (even if this approximation is reduced by the hierarchical organization) is better than having a smaller set of exact pre-computed range queries. In this work we try to give an answer to this question through experimental comparison with the most relevant histograms proposed in the literature. Thus, the main contribution of the paper is to conclude that keeping pre-computed hierarchical range queries (with a suitable numerical approximation done by bit saving), advances accuracy of histograms, not only when hierarchical decomposition is applied to buckets of pre-existing histograms (as shown in [2]), but also when the technique is applied to the entire data distribution. The paper is organized as follows. In the next section we illustrate histograms. Our histogram is presented in Section 3. Section 4 reports results of experiments conducted on our histogram and several other ones. Finally, we give conclusions in Section 5.
2
Histograms
Histograms are used for reducing relations in order to give approximate answers to range queries on such relations. Let X be an attribute of a relation R. W.l.o.g., we assume that the domain U of the attribute X is the interval of integer numbers from 1 to |U |1 . The set of frequencies is the set F = {f (1), ..., f (|U |)} where f (i) 1
|U | denotes the cardinality of the set U
352
Francesco Buccafurri and Gianluca Lax
is the number of occurrence of the value i in the relation R, for each 1 ≤ i ≤ |U |. The set of values is V = {i ∈ U such that f (i) > 0}. From now on, consider given R, X, F and V . A bucket B on X is a 4-tuple ub lb, ub, t, c, with 1 ≤ lb < ub ≤ |U |, t = |V | and c = i=lb f (i). lb and ub are said, respectively, lower bound and upper bound of B, t is said number of non-null values of B and c is the sum of frequencies of B. A histogram H on X is a h-tuple B1 , ..., Bh of buckets such that (1) ∀1 ≤ i < h, the upper bound of Bi precedes the lower bound of Bi+1 and (2) ∀j with 1 ≤ j ≤ |U | and (fj > 0) ⇒ ∃i ∈ [1, h] such that j ∈ Bi . Given a histogram H and a range query Q, it is possible to return an estimation of the answer to Q using information contained in H. At this point the following problem arises: how to partition the domain U into b buckets in order to minimize the error estimation? According to the criterion used for partitioning the domain, there are different classes of histograms (we report here only the most important ones): 1. Equi-sum Histograms [9]: buckets are obtained in such a way that the sum of occurrences in each bucket is equal to 1/b times the total sum of occurrences. 2. MaxDiff Histogram [9, 8]: each bucket has the upper bound in Vi ∈ V (set of attribute values actually appearing in the relation R), if |φ(Vi ) − φ(Vi+1 )| is one of the b − 1 highest computed values, for each i. φ(Vi ) is said area and is obtained as f (Vi ) · (Vi+1 − Vi ). 3. V-Optimal Histograms [6]: boundaries of each bucket, say lbi and ubi (with b 1 ≤ i ≤ b), are fixed in such a way that i=1 SSEi is minimum, where ubi SSEi = j=lbi (f (j) − avgi )2 and avgi is the average of the frequencies occurring in the i-th bucket. In the part of the work devoted to experiments (see Section 4), among the above presented bucket-based histograms, we have considered only MaxDiff and V-Optimal histograms, as it was shown in the literature that their have the best perfomances in terms of accuracy. In addition, we will consider also two further bucket-based histograms, called MaxDiff4LT and V-Optimal4LT. Such methods have been proposed in [2], and consist of adding a 32 bit tree-like index, called 4LT, to each bucket of either a MaxDiff or a V-Optimal histogram. The 4LT is used for computing, in an approximate way, the frequency sums of 8 non overlapping sub-ranges of the bucket. We observe that the idea underlying the proposal presented in this paper takes its origin just from the 4LT method, extending the application of such an approach to the construction of the entire histograms instead of single buckets. There are other kinds of histograms whose construction is not driven by the search of a suitable partition of the attribute domain and, further, their structure is more complex than simply a set of buckets. We call such histograms non bucket-based. Two important examples of histograms of such a type are waveletbased and binary-tree histograms. Wavelets are mathematical transformations implementing hierarchical decomposition of functions originally used in different
Pre-computing Approximate Hierarchical Range Queries
353
research and application contexts, like image and signal processing [7, 13]. Recent studies have shown the applicability of wavelets to selectivity estimation [6] as well as the approximation of OLAP range queries over datacubes [14, 15]. A wavelet-based histogram is not a set of buckets; it consists of a set of wavelet coefficients and a set of indices by which the original frequency set can be reconstructed. Histograms are obtained by applying one of these transformations to the original cumulative frequency set (extended over the entire attribute domain) and selecting, among the N wavelet coefficients, the m < N most significant coefficients, for m corresponding to the desired storage usage. The binary-tree histogram [1] is also based on a hierarchical multiresolution decomposition of the data distribution operating in a quad-tree fashion, adapted to the mono-dimensional case. Beside the bucket-based histograms, both the above types of histograms are compared experimentally in this paper with our histogram, which is a non bucketbased histogram too.
3
The nLT Histogram
In this section we describe the proposed histogram, called nLT. As wavelet and binary-tree histograms, nLT is a non bucket-based histogram. Given a positive integer n, an nLT histogram (on the attribute X is a full binary tree with n levels such that each node N is a 3-tuple l(N ), u(N ), val(N ), u(N ) where 1 ≤ l(N ) < u(N ) ≤ |U | and val(N ) = i=l(N ) f (i). l(N ) and u(N ) are said, respectively, lower bound and upper bound of N and val(N ) is said value of N . Observe that the interval of the domain of X with boundaries l(N ) and u(N ) is associated to N . We denote by r(N ) such an interval. Moreover, val(N ) is the sum of occurrence frequencies of X within such an interval. The root node, denoted by N0 is such that l(N0 ) = 1 and u(N0 ) = |U |. Given a leaf node N , the left-hand child node, say Nf s , is such that l(Nf s ) = l(N ) and ) 2 u(Nf s ) = u(N )+l(N , while the right-hand child node, say Nf d , is such that 2 u(N )+l(N ) l(Nf d ) = + 1 and u(Nf d ) = u(N ). 2 Concerning the implementation of the nLT, we observe that it is not needed to keep lower and upper bounds of nodes, since they can be derived by the knowledge of n and the position of the node in the tree. Moreover, we don’t have to keep the value of any right-hand child node too, since such a value can be obtained as difference between the value of the parent node with the value of the sibling node. In Figure 1 an example of nLT with n = 3 is reported. The nLT of this example refers to a domain of size 12 with 3 null elements. For each node (represented as a box), we have reported boundaries of the associated interval (on the left side and on the right side, respectively) and the value of the node (inside the box). Grey nodes can be derived by white nodes. Thus, they are not stored. 2
x denotes the application of the operator floor to x
354
Francesco Buccafurri and Gianluca Lax
Fig. 1. Example of nLT
The storage space required by the nLT in case integers are encoded using t bits, is t · 2n−1 . We assume that t = 32 is enough for representing integer values with no scaling approximation. In the following we will refer to this kind of nLT implementation as exact implementation of the nLT, or, for short, exact nLT. In the next section, we will illustrate how to reduce the storage space by varying the number of bits used for encoding the value of the nodes. Of course, to the lossy compression due to linear interpolation needed for retrieving all the non pre-computed range queries, we add another lossy compression given by the bit saving. 3.1
Approximate nLT
In this section we describe the approximate nLT, that is an implementation of the nLT which uses length-variable encoding of integer numbers. In particular, all nodes which belong to the same level in the tree are represented with the same number of bits. When we go down to the lower level, we reduce by 1 the number of bits used for representing nodes of this level. This bit saving, allows us to increase the nLT depth (w.r.t. the exact nLT), once the total storage space is fixed, and to have a larger set of pre-computed range queries and thus higher resolution. Substantially, the approach is based on the assumption that, in the average, the sum of occurrences of a given interval of the frequency vector, is twice than the sum of the occurrences of each half of such an interval. This assumption is chosen as heuristic criterion for designing the approximate nLT, and this explains the choice of reducing by 1 per level the number of bits used for representing numbers. Clearly, the sum contained in a given node is represented as a fraction of the sum contained in the parent node. Observe that, in principle, it could be used also a representation allowing possibly different number of bits for nodes belonging to the same level, depending on the actual value contained into nodes. However, we should deal with the spatial overhead due to these variable codes. The reduction of 1 bit per level appears as a reasonable compromise. We describe now in more details how to encode with a certain number of bits, say k, the value of a given node N , denoting by P the parent node of N .
Pre-computing Approximate Hierarchical Range Queries
355
With such a representation, the value of the node val(N ) will be recovered not in exact way, in general. It will be affected by a certain scaling approximation. k (N ) the encoding of val(N ) done with k bits and by valk (N ) We denote by val k (N ). the approximation of val(N ) obtained by val We have that: k (N ) = Round( val(N ) × (2k − 1)) val val(P ) k (N ) ≤ 2k − 1. Concerning the approximation of val(N ) it Clearly, 0 ≤ val results: k (N ) valk (N ) = ( val × val(P )) 2k −1 The absolute error due to the k-bit encoding of the node N , with parent node P is: %a (val(N ), val(P ), k) = |val(N ) − valk (N )|. It can be easily verified that 0 ≤ %a (val(N ), val(P ), k) ≤ The relative error is defined as: %r (val(N ), val(P ), k) =
val(P ) . 2k+1
a (val(N ),val(P ),k) . val(N )
Define now the average relative error (for variable value of the node N ) as: val(P ) 1 %r (i, val(P ), k). %r (val(N ), val(P ), k) = val(P i=1 ) × We observe that, for the root node N0 , we use 32 bits. Thus, no scaling error arises for such a node, i.e. val(N0 ) = valk (N0 ). It can be proven that the average relative error is null until val(P ) reaches the value 2k , and, then, after a number of decreasing oscillations, converges to a value independent of val(P ) and depending on k. Before proceeding to the implementation of a nLT, we should set the two parameters n and k, that are, we recall, number of levels of the nLT and number of bits used for encoding each child node of the root (for the successive levels, as already mentioned, we drop 1 bit per level). Observe that, according to the above observation about the average relative error, setting the parameter k means fixing also the average relative error due to scaling approximation. Thus, in order to reduce such an error, we should set k to a value as large as possible. However, for a fixed compression ratio, this may limit the depth of the tree and, thus the resolution of the leaves. As a consequence, the error arising from linear interpolation done inside leaf nodes, increases. The choice of k has thus to solve the above trade-off. The size of an approximate nLT is thus: size(nLT ) = 32 +
n−2
(n − h) × 2h
(1)
h=0
recalling that the root node is encoded with 32 bits. For instance, a nLT with n = 4 and k = 11 uses 32+20 ·11+21 ·10+22 ·9 = 99 bit for representing its nodes.
356
4
Francesco Buccafurri and Gianluca Lax
Experiments on Histograms
In this section we shall conduct several experiments on synthetic data in order to compare the effectiveness of several histograms in estimating range query. Available Storage: For our experiments, we shall use a storage space, that is 42 four-byte numbers to be in line with experiments reported in [9], which we replicate. Techniques: We compare nLT with 6 new and old histograms, fixing the total space required by each technique: – MaxDiff (MD) and V-Optimal (VO) produce 21 bucket; for each bucket both upper bound and value are stored. – Max-Diff with 4LT (MD4LT) and V-Optimal with 4LT (VO4LT) produce 14 bucket; for each bucket is stored the upper bound, the value and the 4LT index. – Wavelet (WA) are constructed using the bi -orthogonal 2.2 decomposition of the M AT LAB 5.2 wavelet toolbox. The wavelet approach needs 21 four-byte wavelet coefficients plus another 21 four-byte numbers for storing coefficient positions. We have stored the 21 largest (in absolute value) wavelet coefficients and, in the reconstruction phase, we have set to 0 the remaining coefficients. – Binary-Tree (BT) produces 19 terminal buckets (for reproducing the experiments reported in [1]). – nLT (nLT) is obtained fixing n = 9 and k = 11. Using (1) shown in Section 3.1, the stored space is about 41 four-byte numbers. The choice of k = 11 and, consequently of n = 9, is done by fixing the average relative error of the highest level of the tree to about 0.15%. Data Distributions: A data distribution is characterized by a distribution for frequencies and a distribution for spreads. Frequency set and value set are generated independently, then frequencies are randomly assigned to the elements of the value set. We consider 3 data distributions: (1) D1 : Zipf-cusp max(0.5,1.0). (2) D2 = Zipf-zrand(0.5,1.0): Frequencies are distributed according to a Zipf distribution with the z parameter equal to 0.5. Spreads follow a ZRand distribution [8] with z parameter equal to 1.0 (i.e., spreads following a Zipf distributions with z parameter equal to 1.0 are randomly assigned to attribute values). (3) D3 = Gauss-rand: Frequencies are distributed according to a Gauss distribution. Spreads are randomly distributed. Histograms Populations: A population is characterized by the value of three parameters, that are T , D and t and represents the set of histograms storing a relation of cardinality T , attribute domain size D and value set size t (i.e., number of non-null attribute values).
Pre-computing Approximate Hierarchical Range Queries
method/popul. WA MD VO M D4LT V O4LT BT nLT
P1 3.50 4.30 1.43 0.70 0.29 0.26 0.24 (a)
P2 3.42 5.78 1.68 0.80 0.32 0.27 0.24
P3 2.99 8.37 1.77 0.70 0.32 0.27 0.22
avg 3.30 6.15 1.63 0.73 0.31 0.27 0.23
method/popul. P1 WA 13.09 MD 19.35 VO 5.55 M D4LT 1.57 V O4LT 1.33 BT 1.12 nLT 0.63 (b)
P2 13.06 16.04 5.96 1.60 1.41 1.15 0.69
P3 6.08 2.89 2.16 0.59 0.56 0.44 0.26
357
avg 10.71 12.76 4.56 1.25 1.10 0.90 0.53
Fig. 2. (a): Errors for distribution 1. (b): Errors for distribution 2
Population P1 . This population is characterized by the following values for the parameters: D = 4100, t = 500 and T = 100000. Population P2 . This population is characterized by the following values for the parameters: D = 4100, t = 500 and T = 500000. Population P3 . This population is characterized by the following values for the parameters: D = 4100, t = 1000 and T = 500000. Data Sets: Each data set included in the experiments is obtained by generating under one of the above described data distributions 10 histograms belonging to one of the populations specified below. We consider the 9 data sets that are generated by combining all data distributions and all populations. All queries belonging to the query set below are evaluated over the histograms of each data set: Query Set and Error Metric: In our experiments, we use the query set {X ≤ d : 1 ≤ d ≤ D} (recall, X is the histogram attribute and 1..D is its domain) for evaluating the effectiveness of the various methods. We measure the error of approximation made by histograms Q rel on the above query set by us1 ing the average of the relative error Q i=1 ei , where Q is the cardinality of rel = |SiS−iSi | , where Si and Si the query set, and ei is the relative error , i.e., erel i are the actual answer and the estimated answer of the query i-th of the query set. For each population and distribution we have calculated the average relative error. Table in Figure 2.(a) shows good accuracy on the distibution Zipf max of all index-based methods. In particular, nLT has the best performances, even if there is no a high gap w.r.t. the other methods. The error is considerable low for nLT (less than 0.25%) although the compression ratio is very high (i.e., about 100). With the second distribution, that is Zipf rand (see Figure 2.(b)), behavoir of methods becomes more different: Wavelt and MaxDiff show an unsatisfactory accuracy, V-Optimal has better performances but errors still high, while indexbased methods show very low errors. Once again, nLT reports the minimum
358
Francesco Buccafurri and Gianluca Lax
method/popul. WA MD VO M D4LT V O4LT BT nLT
P1 14.53 11.65 10.60 3.14 2.32 1.51 1.38
P2 5.55 6.65 6.16 2.32 4.85 3.50 0.87
P3 5.06 3.30 2.82 1.33 1.24 0.91 0.70
avg 8.38 7.20 6.53 2.26 2.80 1.97 0.99
Fig. 3. Errors for distribution 3 2.5
10 Wavelet Maxdiff V−Optimal nLT
9
2
8
1
relative error %
relative error %
7
1.5
Wavelet Maxdiff V−Optimal nLT
6
5
4
3
0.5
2
1
0 20
30
40
50
60 density %
70
80
90
100
0 15
20
25
30 35 storage space
40
45
50
Fig. 4. Experimental results
error. In Figure 3 we report results of experiments performed on Gauss data. Due to the high variance, all methods become worse. Also nLT presents a slightly higher error, w.r.t. Zipf data, but still less than 1% (in the average), and still less than the error of the other methods. In Figure 4, average relative error versus data density and versus histogram size are plotted (in the left-hand graph and right-hand graph, respectively). For | data density we mean the ratio |V |U| between the cardinality of the non null value set and the cardinality of the attribute domain. For histogram size we mean the amount of 4-byte numbers used for storing the histogram. This measure is hence related to the compression ratio. In both cases nLT, compared with classical bucket-based histograms, shows the best performances with a considerable improvement gap.
5
Conclusion
In this paper we have presented a new non bucket-based histogram, which we have called nLT. It is based on a hierarchical decomposition of the data dis-
Pre-computing Approximate Hierarchical Range Queries
359
tribution kept in a complete n-level binary tree. Nodes of the tree store, in a approximate form (via bit saving), pre-computed range query on the original data distribution. Beside the capability of the histogram to directly support hierarchical range query and efficient updating and query answering, we have shown experimentally it improves significantly the state of the art in terms of accuracy in estimating range queries.
References [1] F. Buccafurri, L. Pontieri, D. Rosaci, D. Sacc` a Binary-tree Histograms with Tree Indices DEXA 2002, Aix-en-Provence, France. 353, 356 [2] F. Buccafurri, L. Pontieri, D. Rosaci, D. Sacc` a Improving Range Query Estimation on Histograms ICDE 2002, San Jose (CA), USA. 350, 351, 352 [3] Buccafurri, F., Rosaci, D., Sacca’, D., Compressed datacubes for fast OLAP applications, DaWaK 1999, Florence, 65-77. [4] Koudas, N., Muthukrishnan, S., Srivastava, D., Optimal Histograms for Hierarchical Range Queries, Proc. of Symposium on Principles of Database Systems PODS pp. 196-204, Dallas, Texas, 2000. 350, 351 [5] Malvestuto, F., A Universal-Scheme Approach to Statistical Databases Containing Homogeneous Summary Tables, ACM TODS, 18(4), 678–708, December 1993. 350 [6] Y. Matias, J. S. Vitter, M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998 352, 353 [7] Natsev, A., Rastogi, R., Shim, K., WALRUS: A Similarity Retrieval Algorithm for Image Databases, In Proceedings of the 1999 ACM SIGMOD Conference on Management of Data, 1999. 353 [8] V. Poosala. Histogram-based Estimation Techniques in Database Systems. PhD dissertation, University of Wisconsin-Madison, 1997 352, 356 [9] V. Poosala, Y. E. Ioannidis, P. J. Haas, E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 294-305, 1996 350, 352, 356 [10] Poosala, V., Ganti, V., Ioannidis, Y. E., Approximate Query Answering using Histograms, IEEE Data Engineering Bulletin Vol. 22, March 1999. 350 [11] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. T. Price. Access path selection in a relational database management system. In Proc. of ACM SIGMOD Internatinal Conference, pages 23-24, 1979 [12] Sitzmann, I., Stuckey, P. J., Improving Temporal Joins Using Histograms, Proc. of the Int. Conference, Database and Expert Systems Applications – DEXA 2000. 350 [13] E. J. Stollnitz, T. D. Derose, and D. H. Salesin. Wavelets for Computer Graphics. Morgann Kauffmann, 1996. 353 [14] J. S. Vitter, M. Wang, B. Iyer. Data Cube Approximation and Histograms via Wavelet. In Proceedings of the 1998 CIKM International Conference on Information and Knowledge Management, Washington, 1998 353 [15] J. S. Vitter, M. Wang, Approximate Computation of Multidimansional Aggregates of Sparse Data using Wavelets, In Proceedings of the 1999 ACM SIGMOD International Conference on Managemnet of Data, 1999. 353
Comprehensive Log Compression with Frequent Patterns Kimmo H¨at¨ onen1 , Jean Fran¸cois Boulicaut2 , Mika Klemettinen1 , Markus Miettinen1 , and Cyrille Masson2 1
Nokia Research Center P.O.Box 407, FIN-00045 Nokia Group, Finland {kimmo.hatonen,mika.klemettinen,markus.miettinen}@nokia.com 2 INSA de Lyon, LIRIS CNRS FRE 2672 F-69621 Villeurbanne, France {Jean-Francois.Boulicaut,Cyrille.Masson}@insa-lyon.fr
Abstract. In this paper we present a comprehensive log compression (CLC) method that uses frequent patterns and their condensed representations to identify repetitive information from large log files generated by communications networks. We also show how the identified information can be used to separate and filter out frequently occurring events that hide other, unique or only a few times occurring events. The identification can be done without any prior knowledge about the domain or the events. For example, no pre-defined patterns or value combinations are needed. This separation makes it easier for a human observer to perceive and analyse large amounts of log data. The applicability of the CLC method is demonstrated with real-world examples from data communication networks.
1
Introduction
In the near future telecommunication networks will deploy an open packet-based infrastructure which has been originally developed for data communication networks. The monitoring of this new packet-based infrastructure will be a challenge for operators. The old networks will remain up and running for still some time. At the same time the rollout of the new infrastructure will take place introducing many new information sources, between which the information needed in, e.g., security monitoring and fault analysis will be scattered. These sources can include different kinds of event logs, e.g., firewall logs, operating systems’ system logs and different application server logs to name a few. The problem is becoming worse every day as operators are adding new tools for logging and monitoring their networks. As the requirements for the quality of service perceived by customers gain more importance, the operators are starting to seriously utilise information that is hidden in these logs. Their interest towards analysing their own processes and operation of their network increases concurrently. Data mining and knowledge discovery methods are a promising alternative for operators to gain more out of their data. Based on our experience, however, Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 360–370, 2003. c Springer-Verlag Berlin Heidelberg 2003
Comprehensive Log Compression with Frequent Patterns
361
simple-minded use of discovery algorithms in the network analysis poses problems with the amount of generated information and its relevance. In the KDD process [6, 10, 9], it is often reasonable or even necessary to constrain the discovery using background knowledge. If no constraints are applied, the discovered result set of, say, association rules [1, 2] might become huge and contain mostly trivial and uninteresting rules. Also, association and episode rule mining techniques can only capture frequently recurring events according to some frequency and confidence thresholds. This is needed to restrict the search space and thus for computation tractability. Clearly, the thresholds that can be used are not necessarily the ones that denote objective interestingness from the user point of view. Indeed, rare combinations can be extremely interesting. When considering our previously unknown domains, an explicit background knowledge is missing, e.g., about the possible or reasonable values of attributes and their relationships. When it is difficult or impossible to define and maintain a priori knowledge about the system, there is still a possibility to use meta information that can be extracted from the logs. Meta information characterizes different types of log entries and log entry combinations. It can not only be used to help an expert in filtering and browsing the logs manually but also to automatically identify and filter out insignificant log entries. It is possible to reduce the size of an analysed data set to a fraction of its original size without losing any critical information. One type of meta information are frequent patterns. They capture the common value combinations that occur in the logs. Furthermore, such meta information can be condensed by means of, e.g., the closed frequent itemsets [12, 3]. Closed sets form natural inclusion graphs between different covering sets. This type of presentation is quite understandable for an expert and can be used to create hierarchical views. These condensed representations can be extracted directly from highly correlated and/or dense data, i.e., in contexts where the approaches that compute the whole collection of the frequent patterns F S are intractable [12, 3, 17, 13]. They can also be used to regenerate efficiently the whole F S collection, possibly partially and on the fly. We propose here our Comprehensive Log Compression (CLC) method. It is based on the computation of frequent pattern condensed representations and we use this presentation as an entry point to the data. The method provides a way to dynamically characterize and combine log data entries before they are shown to a human observer. It finds frequently occurring patterns from dense log data and links patterns to the data as a data directory. It is also possible to separate recurring data and analyse it separately. In most cases, this reduces the amount of data needed to be evaluated by an expert to a fraction of the original volume. This type of representation is general w.r.t. different log types. Frequent sets can be generated from most of the logs that have structure and contain repeating symbolic values in their fields, e.g., in Web Usage Mining applications [11, 16]. The main difference between the proposed method and those applications is the objective setting of the mining task. Most of the web usage applications try to identify and somehow validate common access patterns in web sites. These patterns are then used to do some sort of optimization of the site. The proposed
362
Kimmo H¨ at¨ onen et al.
... 777;11May2000; 778;11May2000; 779;11May2000; 781;11May2000; 782;11May2000; ...
0:00:23;a_daemon;B1;12.12.123.12;tcp;; 0:00:31;a_daemon;B1;12.12.123.12;tcp;; 0:00:32;1234;B1;255.255.255.255;udp;; 0:00:43;a_daemon;B1;12.12.123.12;tcp;; 0:00:51;a_daemon;B1;12.12.123.12;tcp;;
Fig. 1. An example of a firewall log method, however, doesn’t say anything about semantic correctness or relations between the found frequent patterns. It only summarizes the most frequent value combinations in entries. This gives either a human expert or computationally more intensive algorithms a change to continue with data, which doesn’t contain too common and trivial entries. Based on our experience with real-life log data, e.g., large application and firewall logs, the original data set of tens of thousands of rows can often be represented by just a couple of identified patterns and the exceptions not matching these patterns.
2
Log Data and Log Data Analysis
A log data consists of entries that represent a specific condition or an event that has occurred somewhere in the system. The entries have several fields, which are called variables from now on. The structure of entries might change over time from entry to another, although some variables are common to all of them. Each variable has a set of possible values called a value space. Values of one value space can be considered as binary attributes. Variable value spaces are separated. A small example of a log data is given in Figure 1. It shows a sample from a log file produced by CheckPoint’s Firewall-1. In a data set a value range in a variable value space might be very large or very limited. For example, there may be only few firewalls in an enterprise, but every IP address in the internet might try to contact the enterprise domain. There are also several variables that have such a large value space but contain only a fraction of the possible values. Therefore, it is unpractical and almost impossible to fix the size of the value spaces as a priori knowledge. A log file may be very large. During one day, there might accumulate millions of lines into a log file. A solution to browse the data is either to search for patterns that are known to be interesting with high probability or to filter out patterns that most probably are uninteresting. A system can assist in this but the evaluation of interestingness is left for an expert. To be able to make the evaluation an expert has to check the found log entries by hand. He has to return to the original log file and iteratively check all those probably interesting entries and their surroundings. Many of the most dangerous attacks are new and unseen for an enterprise defense system. Therefore, when the data exploration is limited only to known patterns it may be impossible to find the new attacks. Comprehensive Log Compression (CLC) is an operation where meta information is extracted from the log entries and used to summarize redundant entries
Comprehensive Log Compression with Frequent Patterns {Proto:tcp, Service:a_daemon, Src:B1} 11161 {Proto:tcp, SPort:, Src:B1} 11161 {Proto:tcp, SPort:, Service:a_daemon} 11161 {SPort:, Service:a_daemon, Src:B1} 11161 ... {Destination:123.12.123.12, SPort:, Service:a_daemon, Src:B1} 10283 {Destination:123.12.123.12, Proto:tcp, Service:a_daemon, Src:B1} 10283 {Destination:123.12.123.12, Proto:tcp, SPort:, Src:B1} 10283 {Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon} 10283 {Proto:tcp, SPort:, Service:a_daemon, Src:B1} 11161 ... {Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon, Src:B1}
363
10283
Fig. 2. A sample of frequent sets extracted from a firewall log
without losing any important information. By combining log entries with their frequencies and identifying recurring patterns, we are able to separate correlating entries from infrequent ones and display them with accompanying information. Thus, an expert has a more covering overview of the logged system and he can identify interesting phenomena and concentrate on his analysis. The summary has to be understandable for an expert and must contain all the relevant information that is available in the original log. Presentation has also to provide a mechanism to move back and forth between the summary and the original logs. Summarization can be done by finding correlating value combinations from large amount of log entries. Due to the nature of the logging mechanism, there are always several value combinations that are common to a large number of the entries. When these patterns are combined with information about how uncorrelating values are changing w.r.t. to these correlating patterns it gives a comprehensive description of the contents of the logs. In many cases it is possible to detect such patterns by browsing the log data but unfortunately it is also tedious. E.g., a clever attack against a firewall cluster of an enterprise is scattered over all of its firewalls and executed slowly from several different IP addresses using all the possible protocols alternately. Figure 2 provides a sample of frequent sets extracted from the data introduced in Figure 1. In Figure 2, the last pattern, which contains five attributes, has five subpatterns out of which four have the same frequency as the longer pattern and only one has larger frequency. In fact, many frequent patterns have the same frequency and it is the key idea of the frequent closed set mining technique to consider only some representative patterns, i.e., the frequent closed itemsets (see next section for a formalization). Figure 3 gives a sample of frequent closed sets that correspond to the frequent patterns shown in Figure 2. An example of the results of applying the CLC method to a firewall log data set can be seen in Table 1. It shows three patterns with highest coverage values found from the firewall log introduced in Figure 1. If the supports of these patterns are combined, then 91% of the data in the log is covered. The blank fields in the figure are intentionally left empty in the original log data. The fields marked with ’*’ can have varying values. For example, in the pattern 1 the field
364
Kimmo H¨ at¨ onen et al.
{Proto:tcp, SPort:, Service:a_daemon, Src:B1} 11161 {Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon, Src:B1} {Destination:123.12.123.13, Proto:tcp, SPort:, Service:a_daemon, Src:B1}
10283 878
Fig. 3. A sample of closed sets extracted from a firewall log Table 1. The three most frequent patterns found from a firewall log No Destination Proto SPort Service 1. * tcp A daemon 2. 255.255.255.255 udp 1234 3. 123.12.123.12 udp B-dgm
Src B1 * *
Count 11161 1437 1607
’Destination’ gets two different values on lines matched by it, as it is shown in Figure 3.
3
Formalization
The definition of a LOG pattern domain is made of the definition of a language of patterns L, evaluation functions that assign a description to each pattern in a given log r, and languages for primitive constraints that specify the desired patterns. We introduce some notations that are used for defining the LOG pattern domain. A so-called log contains the data in a form of log entries and patterns are the so-called itemsets, which are sets of (f ield, value) pairs of log entries. Definition 1 (Log). Assume that Items is a finite set of (f ield, value) pairs denoted by field name combined with value, e.g., Items= {A : ai , B : bj , C : ck , . . .}. A log entry t is a subset of Items. A log r is a finite and non empty multiset r = {e1 , e2 , . . . , en } of log entries. Definition 2 (Itemsets). An itemset is a subset of Items. The language of patterns for itemsets is L = 2Items . Definition 3 (Constraint). If T denotes the set of all logs and 2Items the set of all itemsets, an itemset constraint C is a predicate over 2Items × T . An itemset S ∈ 2Items satisfies a constraint C in the database r ∈ T iff C(S, r) = true.When it is clear from the context, we write C(S). Evaluation functions return information about the properties of a given itemset in a given log. These functions provide an expert information about the events and conditions in the network. They also form a basis for summary creation. They are used to select the proper entry points to the log data. Definition 4 (Support for Itemsets). A log entry e supports an itemset S if every item in S belongs to e, i.e., S ⊆ e. The support (denoted support(S, r)) of an itemset S is the multiset of all log entries of r that supports S (e.g., support(∅) = r).
Comprehensive Log Compression with Frequent Patterns
365
Definition 5 (Frequency). The frequency of an itemset S in a log r is defined by F (S, r) = |support(S)| where |.| denotes the cardinality of the multiset. Definition 6 (Coverage). The coverage of an itemset S in a log r is defined by Cov(S, r) = F (S, r) · |S|, where |.| denotes the cardinality of the itemset S. Definition 7 (Perfectness). The perfectness of an itemset S in a log r is de (S,r) |ei |, where ∀ei : ei ∈ support(S, r) and fined by Perf (S, r) = Cov(S, r)/ F i=0 |ei | denotes to the cardinality of log entry ei . Please, notice that if the cardinality of all the log entries is constant it applies then Perf (S, r) = Cov(S, r)/(F (S, r) · |e|), where e is an arbitrary log entry. Primitive constraints are a tool set that is used to create and control summaries. For instance, the summaries are composed by using the frequent (closed) sets, i.e., sets that satisfy a conjunction of a minimal frequency constraint and the closeness constraint plus the original data. Definition 8 (Minimal Frequency). Given an itemset S, a log r, and a frequency threshold γ ∈ [1, |r|], Cminfreq(S, r) ≡ F(S, r) ≥ γ. Itemsets that satisfy Cminfreq are called γ-frequent or frequent in r. Definition 9 (Minimal Perfectness). Given an itemset S, a log r, and a perfectness threshold π ∈ [0, 1], Cminperf (S, r) ≡ Perf (S, r) ≥ π. Itemsets that satisfy Cminperf are called π-perfect or perfect in r. Definition 10 (Closures, Closed Itemsets and Constraint Cclose ). The closure of an itemset S in r (denoted by closure(S, r)) is the maximal (for set inclusion) superset of S which has the same support than S. In other terms, the closure of S is the set of items that are common to all the log entries which support S. A closed itemset is an itemset that is equal to its closure in r, i.e., we define Cclose (S, r) ≡ closure(S, r) = S. Closed itemsets are maximal sets of items that are supported by a multiset of log entries. If we consider the equivalence class that group all the itemsets that have the same closure (and thus the same frequency), the closed sets are the maximal elements of each equivalence class. Thus, when the collection of the frequent itemsets F S is available, a simple post-processing technique can be applied to compute only the frequent closed itemsets. When the data is sparse, it is possible to compute F S, e.g., by using Apriori-like algorithms [2]. However, the number of frequent itemsets can be extremely large, especially in dense logs that contain many highly correlated field values. In that case, computing F S might not be feasible while the frequent closed sets CF S can often be computed for the same frequency threshold or even a lower one. CF S = {φ ∈ L | Cminfreq(φ, r) ∧ Cclose (φ, r) satisfied}. On one hand, F S can be efficiently derived from CF S without scanning the data again [12, 3]. On the other hand, CF S is a compact representation of the information about every frequent set and its frequency and thus fulfills the needs for CLC. Several algorithms can compute efficiently the frequent closed sets. In this work, we compute the frequent closed sets by
366
Kimmo H¨ at¨ onen et al.
computing the frequent free sets and providing their closures [4, 5]. This is efficient since the freeness property is anti-monotonic, i.e., a key property for an efficient processing of the search space. For a user, displaying of the adequate information is the most important phase of the CLC method. This phase gets the original log file and a condensed set of frequent patterns as input. An objective of the method is to select the most informative patterns as starting points for navigating the condensed set of patterns and data. As it has been shown [12], the frequent closed sets give rise to a lattice structure, ordered by set inclusion. These inclusion relations between patterns can be used as navigational links. What are the most informative patterns depends on the application and a task in hand. There are at least three possible measures that can be used to sort the patterns: frequency, i.e., on how many lines the pattern exists in a data set; perfectness, i.e., how big part of the line has been fixed in the pattern; and coverage of the pattern, i.e., how large part of the database is covered by the pattern. Coverage is a measure, which balances the trade-off between patterns that are short but whose frequency is high and patterns that are long but whose frequency is lower. Selection of the most informative patterns can also be based on the optimality w.r.t. coverage. It is possible that an expert wishes to see only n most covering patterns or most covering patterns that together cover more than m% of the data. Examples of optimality constraints are considered in [14, 15]. An interesting issue is the treatment of the patterns, whose perfectness is close to zero. It is often the case that the support of such a small pattern is almost entirely covered by supports of larger patterns, subset of which the small pattern is. The most interesting property of this kind of lines is the possibility to find those rare and exceptional entries that are not covered by any of the frequent patterns. In the domain that we are working on, log entries of telecommunication applications, we have found out that coverage and perfectness are very good measures to find good and informative starting points for pattern and data browsing. This is probably because of the fact that if there are too many fields that have not fixed values, then the meaning of the entry is not clear and those patterns are not understandable for an expert. On the other hand, in those logs there are a lot of repeating patterns, whose coverage is high and perfectness is close to 100 percent.
4
Experiments
Our experiments were done with two separate log sets. The first of them was a firewall log that was divided into several files so that each file contained entries logged during one day. From this collection we selected logs of four days with which we executed the CLC method with different frequency thresholds. The purpose of this test was to find out how large a portion of the original log it is possible to cover with the patterns found and what the optimal value for the
Comprehensive Log Compression with Frequent Patterns
367
Table 2. Summary of the CLC experiments with firewall data Sup
Day 1 Freq Clsd Sel Lines 100 8655 48 5 5162 50 9213 55 6 5224 10 11381 74 12 5347 5 13013 82 13 5351 Tot 5358
Firewall days Day 2 Day 3 % Freq Clsd Sel Lines % Freq Clsd Sel Lines 96.3 9151 54 5 15366 98.6 10572 82 7 12287 97.5 9771 66 7 15457 99.2 11880 95 11 12427 99.8 12580 88 12 15537 99.7 19897 155 19 12552 99.9 14346 104 14 15569 99.9 22887 208 20 12573 15588 12656
Day 4 % Freq Clsd Sel Lines 97.1 8001 37 4 4902 98.2 8315 42 5 4911 99.2 10079 58 8 4999 99.3 12183 69 10 5036 5039
% 97.3 97.5 99.2 99.9
frequency threshold would be. In Table 2, a summary of the experiment results is presented. Table 2 shows, for each firewall daily log file, the number of frequent sets (Freq), closed sets (Clsd) derived from those, selected closed sets (Sel), the number of lines that the selected sets cover (Lines) and how big part of the log these lines are covering (%). The tests were executed with several frequency thresholds (Sup). The pattern selection was based on the coverage of each pattern. As can be seen from the result, already with the rather high frequency threshold of 50 lines, the coverage percentage is high. With this threshold there were, e.g., only 229 (1.8%) lines not covered in the log file of day 3. This was basically because there was an exceptionally well distributed port scan during that day. Those entries were so fragmented that they escaped from the CLC algorithm, but were clearly visible when all the other information was taken away. In Table 2, we also show the sizes of the different representations compared to each other. As can be seen, the reduction from the number of frequent sets to the number of closed sets is remarkable. However, by selecting the most covering patterns, it is possible to reduce the number of shown patterns to very few without losing the descriptive power of the representation. Another data set that was used to test our method was an application log of a large software system. The log contains information about the execution of different application modules. The main purpose of the log is to provide information for system operation, maintenance and debugging. The log entries provide a continuous flow of data, not occasional bursts, which are typical for firewall entries. The interesting thing in the flow are the possible error messages that are rare and often hidden in the mass. The size of the application log was more than 105 000 lines, which were collected during a period of 42 days. From these entries, with the frequency threshold of 1000 lines (about 1%), the CLC method was able to identify 13 interesting patterns that covered 91.5% of the data. When the frequency threshold was still lowered to 50 lines, the coverage rose up to 95.8%. With that threshold value, there were 33 patterns found. The resulting patterns, however, started to be so fragmented that they were not very useful anymore. These experiments show the usefulness of the condensed representation of the frequent itemsets by means of the frequent closed itemsets. In a data set like a firewall log, it is possible to select only a few most covering of the found frequent closed sets and cover the majority of the data. After this bulk has been
368
Kimmo H¨ at¨ onen et al.
removed from the log it is much easier for any human expert to inspect the rest of the log, even manually. Notice also that the computation of our results has been easy. This is partly because of our test data sets reported here are not very large; the largest set being a little over 100 000 lines. However, in a real environment of a large corporation, the daily firewall logs might contain millions of lines and much more variables. The amount of data — the number of lines and the number of variables — will continue to grow in the future, when the number of service types, different services and their use will grow. The scalability of the algorithms that compute the frequent closed sets is quite good compared to the Apriori approach: fewer data scans are needed and the search space can be drastically reduced in the case of dense data [12, 3, 5]. In particular, we have done preliminary testing with ac-miner designed by A. Bykowski [5]. It discovers free sets, from which it is straightforward to compute closed sets. These tests have shown promising results w.r.t. execution times. This approach seems to scale up more easily than the search for a whole set of frequent sets. Also, other condensed representations have been recently proposed like the δ-free sets, the ∨-free sets or the Non Derivable Itemsets [5, 7, 8]. They could be used in even more difficult contexts (very dense and highly-correlated data). Notice however, that from the end user point of view, these representations do not have the intuitive semantics of the closed itemsets.
5
Conclusions and Future Work
The Comprehensive Log Compression (CLC) method provides a powerful tool for any analysis that inspects data with lot of redundancy. Only very little a priori knowledge is needed to perform the analysis: knowledge structures: only a minimum frequency threshold for the discovery of closed sets and e.g., the number of displayed patterns, to guide the selection of the most covering patterns. The method provides a mechanism to separate different information types from each other. The CLC method identifies frequent repetitive patterns from a log database and can be used to emphasize either the normal course of actions or exceptional log entries or events in the normal course of actions. This is especially useful in getting knowledge out of previously unknown domains or in analyzing logs that are used to record unstructured and unclassified information. In the future we are interested in generalizing and testing the described method with frequent episodes: how to utilize relations between selected closed sets. Other interesting issues concern the theoretical foundations of the CLC method as well as ways to utilize this method in different real world applications.
Acknowledgements The authors have partly been supported by the Nokia Foundation and the consortium on discovering knowledge with Inductive Queries (cInQ), a project
Comprehensive Log Compression with Frequent Patterns
369
funded by the Future and Emerging Technologies arm of the IST Programme (Contract no. IST-2000-26469).
References [1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In SIGMOD’93, pages 207–216, Washington, USA, May 1993. ACM Press. 361 [2] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press, 1996. 361, 365 [3] Jean-Fran¸cois Boulicaut and Artur Bykowski. Frequent closures as a concise representation for binary data mining. In PAKDD’00, volume 1805 of LNAI, pages 62–73, Kyoto, JP, April 2000. Springer-Verlag. 361, 365, 368 [4] Jean-Fran¸cois Boulicaut, Artur Bykowski, and Christophe Rigotti. Approximation of frequency queries by mean of free-sets. In PKDD’00, volume 1910 of LNAI, pages 75–85, Lyon, F, September 2000. Springer-Verlag. 366 [5] Jean-Fran¸cois Boulicaut, Artur Bykowski, and Christophe Rigotti. Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Mining and Knowledge Discovery journal, 7(1):5–22, 2003. 366, 368 [6] Ronald J. Brachman and Tej Anand. The process of knowledge discovery in databases: A first sketch. In Advances in Knowledge Discovery and Data Mining, July 1994. 361 [7] Artur Bykowski and Christophe Rigotti. A condensed representation to find frequent patterns. In PODS’01, pages 267 – 273. ACM Press, May 2001. 368 [8] Toon Calders and Bart Goethals. Mining all non derivable frequent itemsets. In PKDD’02, volume 2431 of LNAI, pages 74–83, Helsinki, FIN, August 2002. Springer-Verlag. 368 [9] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27 – 34, November 1996. 361 [10] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining, pages 1 – 34. AAAI Press, Menlo Park, CA, 1996. 361 [11] R. Kosala and H. Blockeel. Web mining research: A survey. SIGKDD: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 2(1):1–15, 2000. 361 [12] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Efficient mining of association rules using closed itemset lattices. Information Systems, 24(1):25–46, January 1999. 361, 365, 366, 368 [13] Jian Pei, Jiawei Han, and Runying Mao. CLOSET an efficient algorithm for mining frequent closed itemsets. In SIGMOD Workshop DMKD’00, Dallas, USA, May 2000. 361 [14] Tobias Scheffer. Finding association rules that trade support optimally against confidence. In PKDD’01, volume 2168 of LNCS, pages 424–435, Freiburg, D, September 2001. Springer-Verlag. 366 [15] Jun Sese and Shinichi Morishita. Answering the most correlated N association rules efficiently. In PKDD’02, volume 2431 of LNAI, pages 410–422, Helsinki, FIN, August 2002. Springer-Verlag. 366
370
Kimmo H¨ at¨ onen et al.
[16] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23, 2000. 361 [17] Mohammed Javeed Zaki. Generating non-redundant association rules. In SIGKDD’00, pages 34–43, Boston, USA, August 2000. ACM Press. 361
Non Recursive Generation of Frequent K-itemsets from Frequent Pattern Tree Representations Mohammad El-Hajj and Osmar R. Za¨ıane Department of Computing Science, University of Alberta, Edmonton AB, Canada {mohammad, zaiane}@cs.ualberta.ca Abstract. Existing association rule mining algorithms suffer from many problems when mining massive transactional datasets. One major problem is the high memory dependency: gigantic data structures built are assumed to fit in main memory; in addition, the recursive mining process to mine these structures is also too voracious in memory resources. This paper proposes a new association rule-mining algorithm based on frequent pattern tree data structure. Our algorithm does not use much more memory over and above the memory used by the data structure. For each frequent item, a relatively small independent tree called COFI-tree, is built summarizing co-occurrences. Finally, a simple and non-recursive mining process mines the COFI-trees. Experimental studies reveal that our approach is efficient and allows the mining of larger datasets than those limited by FP-Tree
1
Introduction
Recent days have witnessed an explosive growth in generating data in all fields of science, business, medicine, military, etc. The same rate of growth in the processing power of evaluating and analyzing the data did not follow this massive growth. Due to this phenomenon, a tremendous volume of data is still kept without being studied. Data mining, a research field that tries to ease this problem, proposes some solutions for the extraction of significant and potentially useful patterns from these large collections of data. One of the canonical tasks in data mining is the discovery of association rules. Discovering association rules, considered as one of the most important tasks, has been the focus of many studies in the last few years. Many solutions have been proposed using a sequential or parallel paradigm. However, the existing algorithms depend heavily on massive computation that might cause high dependency on the memory size or repeated I/O scans for the data sets. Association rule mining algorithms currently proposed in the literature are not sufficient for extremely large datasets and new solutions, that especially are less reliant on memory size, still have to be found. 1.1
Problem Statement
The problem consists of finding associations between items or itemsets in transactional data. The data could be retail sales in the form of customer transactions Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 371-380, 2003. c Springer-Verlag Berlin Heidelberg 2003
372
Mohammad El-Hajj and Osmar R. Za¨ıane
or any collection of sets of observations. Formally, as defined in [2], the problem is stated as follows: Let I = {i1 , i2 , ...im } be a set of literals, called items. m is considered the dimensionality of the problem. Let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. A unique identifier TID is given to each transaction. A transaction T is said to contain X, a set of items in I, if X ⊆ T . An association rule is an implication of the form “X ⇒ Y ”, where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. An itemset X is said to be large or frequent if its support s is greater or equal than a given minimum support threshold σ. The rule X ⇒ Y has a support s in the transaction set D if s% of the transactions in D contain X ∪ Y . In other words, the support of the rule is the probability that X and Y hold together among all the possible presented cases. It is said that the rule X ⇒ Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y . In other words, the confidence of the rule is the conditional probability that the consequent Y is true under the condition of the antecedent X. The problem of discovering all association rules from a set of transactions D consists of generating the rules that have a support and confidence greater than a given threshold. These rules are called strong rules. This association-mining task can be broken into two steps: 1. A step for finding all frequent k-itemsets known for its extreme I/O scan expense, and the massive computational costs; 2. A straightforward step for generating strong rules. In this paper, we are mainly interested in the first step. 1.2
Related Work
Several algorithms have been proposed in the literature to address the problem of mining association rules [2, 6]. One of the key algorithms, which seems to be the most popular in many applications for enumerating frequent itemsets is the Apriori algorithm [2]. This Apriori algorithm also forms the foundation of most known algorithms. It uses a monotone property stating that for a k-itemset to be frequent, all its (k-1)-itemsets have to be frequent. The use of this fundamental property reduces the computational cost of candidate frequent itemset generation. However, in the cases of extremely large input sets with big frequent 1-items set, the Apriori algorithm still suffers from two main problems of repeated I/O scanning and high computational cost. One major hurdle observed with most real datasets is the sheer size of the candidate frequent 2-itemsets and 3-itemsets. TreeProjection is an efficient algorithm presented in [1]. This algorithm builds a lexicographic tree in which each node of this tree presents a frequent pattern. The authors of this algorithm report that their algorithm is one order of magnitude faster than the existing techniques in the literature. Another innovative approach of discovering frequent patterns in transactional databases, FP-Growth, was proposed by Han et al. in [6]. This algorithm creates a compact tree-structure, FP-Tree, representing frequent patterns, that alleviates the multi-scan problem and improves the candidate itemset generation. The algorithm requires only two full I/O scans of the dataset to build the prefix tree in main memory and then mines directly this structure. The authors
Non Recursive Generation of Frequent K-itemsets
373
of this algorithm report that their algorithm is faster than the Apriori and the TreeProjection algorithms. Mining the FP-tree structure is done recursively by building conditional trees that are of the same order of magnitude in number as the frequent patterns. This massive creation of conditional trees makes this algorithm not scalable to mine large datasets beyond few millions. [7] proposes a new algorithm H-mine that invokes FP-Tree to mine condensed data. This algorithm is still not scalable as reported by its authors in [8]. 1.3
Preliminaries, Motivations and Contributions
The (Co-Occurrence Frequent Item Tree, or COFI-tree for short) algorithm that we are presenting in this paper is based on the core idea of the FP-Growth algorithm proposed by Han et al. in [6]. A compacted tree structure, FP-Tree, is built based on an ordered list of the frequent 1-itemsets present in the transactional database. However, rather than using FP-Growth which recursively builds a large number of relatively large trees called conditional trees [6] from the built FP-tree, we successively build one small tree (called COFI-tree) for each frequent 1-itemset and mine the trees with simple non-recursive traversals. We keep only one such COFI-tree in main memory at a time. The COFI-tree approach is a divide and conquer approach, in which we do not seek to find all frequent patterns at once, but we independently find all frequent patterns related to each frequent item in the frequent 1-itemset. The main differences between our approach and the FP-growth approach are the followings: (1) we only build one COFI-tree for each frequent item A. This COFI-tree is non-recursively traversed to generate all frequent patterns related to item A. (2) Only one COFI-tree resides in memory at one time and it is discarded as soon as it is mined to make room for the next COFI-tree. Algorithms like FP-Tree-based depend heavily on the memory size as the memory size plays an important role in defining the size of the problem. Memory is not only needed to store the data structure itself, but also to generate recursively in the mining process the set of conditional trees. This phenomenon is often overlooked. As argued by the authors of the algorithm, this is a serious constraint [8]. Other approaches such as in [7], build yet another data structure from which the FP-Tree is generated, thus doubling the need for main memory. The current association rule mining algorithms handle only relatively small sizes with low dimensions. Most of them scale up to only a couple of millions of transactions and a few thousands of dimensions [8, 5]. None of the existing algorithms scales to beyond 15 million transactions, and hundreds of thousands of dimensions, in which each transaction has an average of at least a couple of dozen items. The remainder of this paper is organized as follows: Section 2 describes the Frequent Pattern tree, design and construction. Section 3 illustrates the design, constructions and mining of the Co-Occurrence Frequent Item trees. Experimental results are given in Section 4. Finally, Section 5 concludes by discussing some issues and highlights our future work.
374
Mohammad El-Hajj and Osmar R. Za¨ıane
2
Frequent Pattern Tree: Design and Construction
The COFI-tree approach we propose consists of two main stages. Stage one is the construction of the Frequent Pattern tree and stage two is the actual mining for these data structures, much like the FP-growth algorithm. 2.1
Construction of the Frequent Pattern Tree
The goal of this stage is to build the compact data structures called Frequent Pattern Tree [6]. This construction is done in two phases, where each phase requires a full I/O scan of the dataset. A first initial scan of the database identifies the frequent 1-itemsets. The goal is to generate an ordered list of frequent items that would be used when building the tree in the second phase. This phase starts by enumerating the items appearing in the transactions. After enumeration these items (i.e. after reading the whole dataset), infrequent items with a support less than the support threshold are weeded out and the remaining frequent items are sorted by their frequency. This list is organized in a table, called header table, where the items and their respective support are stored along with pointers to the first occurrence of the item in the frequent pattern tree. Phase 2 would construct a frequent pattern tree. Table 1. Transactional database T.No. T1 T5 T9 T13 T17
A A A M A
Items GD C B N O FMN D C G K E F
T.No. Items B T2 B C H E P T6 A C Q R O T10 C F P G O T14 C F P Q C T18 C D L B
T.No. Items D T3 B D E A G T7 A C H I R T11 A D B H J T15 B D E F A
Item Freq. Item Freq. Item Freq. A 11 H 3 Q 2 B 10 F 7 R 2 C 10 M 3 I 3 D 9 N 3 K 3 G 4 O 3 L 3 E 8 P 3 J 3 Step1
Item Freq. A 11 B 10 C 10 D 9 E 8 F 7 Step2
T.No. M T4 G T8 I T12 I T16
C L D J
Items EFA EFK EBK EBA
N B L D
Item Freq. F 7 E 8 D 9 C 10 B 10 A 11 Step3
Fig. 1. Steps of phase 1.
Phase 2 of constructing the Frequent Pattern tree structure is the actual building of this compact tree. This phase requires a second complete I/O scan
Non Recursive Generation of Frequent K-itemsets
375
from the dataset. For each transaction read only the set of frequent items present in the header table is collected and sorted in descending order according to their frequency. These sorted transaction items are used in constructing the FP-Trees as follows: for the first item on the sorted transactional dataset, check if it exists as one of the children of the root. If it exists then increment the support for this node. Otherwise, add a new node for this item as a child for the root node with 1 as support. Then, consider the current item node as the newly temporary root and repeat the same procedure with the next item on the sorted transaction. During the process of adding any new item-node to the FP-Tree, a link is maintained between this item-node in the tree and its entry in the header table. The header table holds as one pointer per item that points to the first occurrences of this item in the FP-Tree structure. For illustration, we use an example with the transactions shown in Table 1. Let the minimum support threshold set to 4. Phase 1 starts by accumulating the support for all items that occur in the transactions. Step 2 of phase 1 removes all non-frequent items, in our example (G, H, I, J, K, L,M, N, O, P, Q and R), leaving only the frequent items (A, B, C, D, E, and F). Finally all frequent items are sorted according to their support to generate the sorted frequent 1itemset. This last step ends phase 1 of the COFI-tree algorithm and starts the second phase. In phase 2, the first transaction (A, G, D, C, B) read is filtered to consider only the frequent items that occur in the header table (i.e. A, D, C and B). This frequent list is sorted according to the items’ supports (A, B, C and D). This ordered transaction generates the first path of the FP-Tree with all item-node support initially equal to 1. A link is established between each itemnode in the tree and its corresponding item entry in the header table. The same procedure is executed for the second transaction (B, C, H, E, and D), which yields a sorted frequent item list (B, C, D, E) that forms the second path of the FP-Tree. Transaction 3 (B, D, E, A, and M) yields the sorted frequent item list (A, B, D, E) that shares the same prefix (A, B) with an existing path on the tree. Item-nodes (A and B) support is incremented by 1 making the support of (A) and (B) equal to 2 and a new sub-path is created with the remaining items on the list (D, E) all with support equal to 1. The same process occurs for all transactions until we build the FP-Tree for the transactions given in Table 1. Figure 2 shows the result of the tree building process.
Root A 11 F
B 4
C 3
7
E 8 D 9 C 10 B 10 A 11
F 1
C 4 E 2
F 2
B C
D
6 2
2
C 1 D
3 E 2
D 1
E 1 F 1
E 1
Fig. 2. Frequent Pattern Tree.
D2
F 2 E 2 F 1
D 1
376
3
Mohammad El-Hajj and Osmar R. Za¨ıane
Co-Occurrence Frequent-Item-trees: Construction and Mining
Our approach for computing frequencies relies first on building independent relatively small trees for each frequent item in the the header table of the FP-Tree called COFI-trees. Then we mine separately each one of the trees as soon as they are built, minimizing the candidacy generation and without building conditional sub-trees recursively. The trees are discarded as soon as mined. At any given time, only one COFI-tree is present in main memory. 3.1
Construction of the Co-Occurrence Frequent-Item-trees
The small COFI-trees we build are similar to the conditional FP-trees in general in the sense that they have a header with ordered frequent items and horizontal pointers pointing to a succession of nodes containing the same frequent item, and the prefix tree per-se with paths representing sub-transactions. However, the COFI-trees have bidirectional links in the tree allowing bottom-up scanning as well, and the nodes contain not only the item label and a frequency counter, but also a participation counter as explained later in this section. The COFI-tree for a given frequent item x contains only nodes labeled with items that are more frequent or as frequent as x. To illustrate the idea of the COFI-trees, we will explain step by step the process of creating COFI-trees for the FP-Tree of Figure 2. With our example, the first Co-Occurrence Frequent Item tree is built for item F as it is the least frequent item in the header table. In this tree for F, all frequent items which are more frequent than F and share transactions with F participate in building the tree. This can be found by following the chain of item F in the FP-Tree structure. The F-COFI-tree starts with the root node containing the item in question, F. For each sub-transaction or branch in the FP-Tree containing item F with other frequent items that are more frequent than F which are parent nodes of F, a branch is formed starting from the root node F. The support of this branch is equal to the support of the F node in its corresponding branch in FP-Tree. If multiple frequent items share the same prefix, they are merged into one branch and a counter for each node of the tree is adjusted accordingly. Figure 3 illustrates all COFI-trees for frequent items of Figure 2. In Figure 3, the rectangle nodes are nodes from the tree with an item label and two counters. The first counter is a support-count for that node while the second counter, called participation-count, is initialized to 0 and is used by the mining algorithm discussed later, a horizontal link which points to the next node that has the same item-name in the tree, and a bi-directional vertical link that links a child node with its parent and a parent with its child. The bi-directional pointers facilitate the mining process by making the traversal of the tree easier. The squares are actually cells from the header table as with the FP-Tree. This is a list made of all frequent items that participate in building the tree structure sorted in ascending order of their global support. Each entry in this list contains
Non Recursive Generation of Frequent K-itemsets F-COFI-tree
E D C B A
4 2 4 2 3
E-COFI-tree
E
C
377
( 2
( 4 0 )
0 )
F
( 7 0 )
A
( 1 0 )
B ( 1 0 )
C
D
A ( 2 0 ) D-COFI-tree
( 1
B
( 2 0 )
D C B A
0 )
( 1 0 )
5 3 6 4
D
C
( 1
B ( 1 C-COFI-tree
0 )
D ( 9 0 )
C
( 5 0 )
0 )
B
A
E
( 8 0 )
C
( 2 0 )
( 4
( 2
0 )
B
( 1
0 )
A ( 2 0 )
0 ) B-COFI-tree B ( 10 0 )
( 10 0 )
A 6 C 4 B 8 A 5
C
( 4 0 )
B
( 5
0 )
B
( 3
0 )
A
( 4
0 )
A
( 6
0 )
B 3 A 6 B
( 3
A ( 2
0 )
0 )
A
( 3
0 )
A ( 2 0 )
Fig. 3. COFI-trees
the item-name, item-counter, and a pointer to the first node in the tree that has the same item-name. To explain the COFI-tree building process, we will highlight the building steps for the F-COFI-tree in Figure 3. Frequent item F is read from the header table and its first location in the FP-Tree is located using the pointer in the header table. The first location of item F indicate that it shares a branch with item A, with support = 1 for this branch as the support of the F-item is considered the support for this branch (following the upper links for this item). Two nodes are created, for FA: 1. The second location of F indicate a new branch of FECA:2 as the support of F=2. Three nodes are created for items ECA with support = 2. The support of the F node is incremented by 2. The third location indicates the sub-transaction FEB:1. Nodes for F and E are already exist and only new node for B is created as a another child for E. The support for all these nodes are incremented by 1. B becomes 1, E becomes 3 and F becomes 4. FEDB:1 is read after that, FE branch already exists and a new child branch for DB is created as a child for E with support = 1. The support for E nodes becomes 4, F becomes 5. Finally FC:2 is read, and a new node for item C is created with support =2, and F support becomes 7. Like with FP-Trees, the header constitutes a list of all frequent items to maintain the location of first entry for each item in the COFI-tree. A link is also made for each node in the tree that points to the next location of the same item in the tree if it exists. The mining process is the last step done on the F-COFI-tree before removing it and creating the next COFI-tree for the next item in the header table. 3.2
Mining the COFI-trees
The COFI-trees of all frequent items are not constructed together. Each tree is built, mined, then discarded before the next COFI-tree is built. The mining pro-
378
Mohammad El-Hajj and Osmar R. Za¨ıane
Step 1
E (8,1) E
( 8
E(8 5)
Step 2
0 )
E
( 8
1 )
C
( 2
0 )
B
( 4
0 )
A
( 2
0 )
E
( 8
6 )
C
( 2
0 )
( 4
4 )
D(5 1) D(5 5) D C B A
5 3 6 3
C
B
D
( 5
( 1
0 )
( 1
0 )
0 )
C
( 2
B
( 4
0 )
A
( 2
0 )
E
( 8
0 )
A
B
( 1
( 2
0 )
0 )
C(1 1) B(1 1) EDB:1 ED:1 EB:1 EDB:1
Step 3
D C B A
5 3 6 3
C
B
D
( 5
( 1
1 )
( 1
1 )
1 )
A
B
( 1
( 2
0 )
0 ) B(4 4) EDB:4 ED:5 EB:5 EDB:5 E(8 6)
Step 4
E (8 6) 5 ) B(1 1)
D C B A
5 3 6 3
D
C
B
( 1
( 1
( 5
1 )
1 )
5 )
B
A
C
( 4
( 2
( 2
4 )
0 )
0 )
A
B
( 2
( 1
0 )
0 ) EB:1 ED:5 EB:6 EDB:5
D(5 5) D C B A
5 3 6 3
D
( 5
( 1
1 )
5 )
B
( 1
( 2
0 )
1 ) No change
C
B
( 1
1 )
B
A
( 2
A
0 )
ED:5 EB:6 EDB:5
Fig. 4. Steps needed to generate frequent patterns related to item E
cess is done for each tree independently with the purpose of finding all frequent k -itemset patterns in which the item on the root of the tree participates. Steps to produce frequent patterns related to the E item for example, are illustrated in Figure 4. From each branch of the tree, using the support-count and the participation-count, candidate frequent patterns are identified and stored temporarily in a list. The non-frequent ones are discarded at the end when all branches are processed. The mining process for the E-COFI-tree starts from the most locally frequent item in the header table of the tree, which is item B. Item B exists in three branches in the E-COFI-tree which are (B:1, C:1, D:5 and E:8), (B:4, D:5, and E:8) and (B:1, and E:8). The frequency of each branch is the frequency of the first item in the branch minus the participation value of the same node. Item B in the first branch has a frequency value of 1 and participation value of 0 which makes the first pattern EDB frequency equals to 1. The participation values for all nodes in this branch are incremented by 1, which is the frequency of this pattern. In the first pattern EDB: 1, we need to generate all sub-patterns that item E participates in which are ED:1 EB:1 and EDB:1. The second branch that has B generates the pattern EDB: 4 as the frequency of B on this branch is 4 and its participation value is equal to 0. All participation values on these nodes are incremented by 4. Sub-patterns are also generated from the EDB pattern which are ED: 4 , EB: 4, and EDB: 4. All patterns already exist with support value equals to 1, and only updating their support value is needed to make it equal to 5. The last branch EB:1 will generate only one pattern which is EB:1, and consequently its value will be updated to become 6. The second locally frequent item in this tree, “D” exists in one branch (D: 5 and E: 8) with participation value of 5 for the D node. Since the participation value for this node equals to its support value, then no patterns can be generated from this node. Finally all non-frequent patterns are omitted leaving us with only frequent patterns that item E participates in which are ED:5, EB:6 and EBD:5. The COFI-tree of Item E can be removed at this time
Non Recursive Generation of Frequent K-itemsets
379
and another tree can be generated and tested to produce all the frequent patterns related to the root node. The same process is executed to generate the frequent patterns. The D-COFI-tree is created after the E-COFI-tree. Mining this tree generates the following frequent patterns: DB:8, DA:5, and DBA:5. C-COFItree generates one frequent pattern which is CA:6. Finally, the B-COFI-tree is created and the frequent pattern BA:6 is generated.
4
Experimental Evaluations and Performance Study
To test the efficiency of the COFI-tree approach, we conducted experiments comparing our approach with two well-known algorithms namely: Apriori and FP-Growth. To avoid implementation bias, third party Apriori implementation, by Christian Borgelt [4], and FP-Growth [6] written by its original authors are used. The experiments were run on a 733-MHz machine with a relatively small RAM of 256MB. Transactions were generated using IBM synthetic data generator [3]. We conducted different experiments to test the COFI-tree algorithm when mining extremely large transactional databases. We tested the applicability and scalability of the COFI-tree algorithm. In one of these experiments, we mined using a support threshold of 0.01% transactional databases of sizes ranging from 1 million to 25 million transactions with an average transaction length of 24 items. The dimensionality of the 1 and 2 million transaction dataset was 10,000 items while the datasets ranging from 5 million to 25 million transactions had a dimensionality of 100,000 unique items. Figure 5A illustrates the comparative results obtained with Apriori, FP-Growth and the COFI-tree. Apriori failed to mine the 5 million transactional database and FP-Growth couldn’t mine beyond the 5 million transaction mark. The COFI-tree, however, demonstrates good scalability as this algorithm mines 25 million transactions in 2921s (about 48 minutes). None of the tested algorithms, or reported results in the literature reaches such a big size. To test the behavior of the COFI-tree vis-`a-vis different support thresholds, a set of experiments was conducted on a database size of one million transactions, with 10,000 items and an average transaction length of 24 items. The mining process tested different support levels, which are 0.0025% that revealed almost 125K frequent patterns, 0.005% that revealed nearly 70K frequent patterns, 0.0075% that generated 32K frequent patterns and 0.01 that returned 17K frequent patterns. Figure 5B depicts the time needed in seconds for each one of these runs. The results show that the COFI-tree algorithm outperforms both Apriori and FP-Growth algorithms in all cases.
5
Discussion and Future Work
Finding scalable algorithms for association rule mining in extremely large databases is the main goal of our research. To reach this goal, we propose a new algorithm that is FP-Tree based. This algorithm identifies the main problem of the FPGrowth algorithm which is the recursive creation and mining of many conditional
380
Mohammad El-Hajj and Osmar R. Za¨ıane Apriori
FP-Growth
COFI-tree
Apriori
FP-Growth
COFI-tree
12000
3000
10000
Time in seconds
Time in seconds
3500
2500 2000 1500 1000 500 0 1M
2M
Size in millions
5M
(A)
10M
15M
20M
25M
8000 6000 4000 2000 0
Support 0.0025%
0.005%
0.0075%
0.01%
(B)
Fig. 5. Computational performance and scalability
pattern trees, which are equal in number to the distinct frequent patterns generated. We have replaced this step by creating one COFI-tree for each frequent item. A simple non-recursive mining process is applied to generate all frequent patterns related to the tested COFI-tree. The experiments we conducted showed that our algorithm is scalable to mine tens of millions of transactions, if not more. We are currently studying the possibility of parallelizing the COFI-tree algorithm to investigate the opportunity of mining hundred of millions of transactions in a reasonable time and with acceptable resources.
References 1. R. Agarwal, C.Aggarwal, and V. Prasad. A tree projection algorithm for generation of frequent itemsets. Parallel and distributed Computing, 2000. 2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 487–499, Santiago, Chile, September 1994. 3. IBM. Almaden. Quest synthetic data generation code. http://www.almaden.ibm.com/cs/quest/syndata.html. 4. C. Borgelt. Apriori implementation. http://fuzzy.cs.unimagdeburg.de/~borgelt/apriori/apriori.html. 5. E.-H. Han, G. Karypis, and V.Kumar. Scalable parallel data mining for association rule. Transactions on Knowledge and data engineering, 12(3):337–352, May-June 2000. 6. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM-SIGMOD, Dallas, 2000. 7. H. Huang, X. Wu, and R. Relue. Association analysis with one scan of databases. In IEEE International Conference on Data Mining, pages 629–636, December 2002. 8. J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by oppotunistic projection. In Eight ACM SIGKDD Internationa Conf. on Knowledge Discovery and Data Mining, pages 229–238, Edmonton, Alberta, August 2002.
A New Computation Model for Rough Set Theory Based on Database Systems Jianchao Han1 , Xiaohua Hu2 , T. Y. Lin3 1
Dept. of Computer Science, California State University Dominguez Hills 1000 E. Victoria St., Carson, CA 90747, USA 2 College of Information Science and Technology, Drexel University 3141 Chestnut St., Philadelphia, PA 19104, USA 3 Dept. of Computer Science, San Jose State University One Washington Square, San Jose, CA 94403, USA We propose a new computation model for rough set theory using relational algebra operations in this paper. We present the necessary and suÆcient conditions on data tables under which an attribute is a core attribute and those under which a subset of condition attributes is a reduct, respectively. With this model, two algorithms for core attributes computation and reduct generation are suggested. The correctness of both algorithms is proved and their time complexity is analyzed. Since relational algebra operations have been eÆciently implemented in most widely-used database systems, the algorithms presented can be extensively applied to these database systems and adapted to a wide range of real-life applications with very large data sets. Abstract.
1
Introduction
Rough sets theory was rst introduced by Pawlak in the 1980's [10] and has been widely applied in dierent real applications such as machine learning, knowledge discovery, expert systems [2, 6, 7, 11] since then. Rough sets theory is especially useful for domains where data collected are imprecise and/or inconsistent. It provides a powerful tool for data analysis and data mining from imprecise and ambiguous data. Many rough sets models have been developed in the rough set community [7, 8]. Some of them have been applied in the industrial data mining projects such as stock market prediction, patient symptom diagnosis, telecommunication churner prediction, and nancial bank customer attrition analysis to solve challenging business problems. These rough set models focus on the extension of the original model proposed by Pawlak [10, 11] and attempt to deal with its limitations, but haven't paid much attention on the eÆciency of the model implementation, like the core and reduct generation. One of the serious drawbacks of existing rough set models is the ineÆciency and unscalability of their implementations to compute the core and reduct and identify the dispensable attributes, which limits their suitability in data mining applications with large data sets. Further investigation reveals that existing rough set methods perform the computations of core and reduct in at les rather than integrate with the eÆcient and high performance relational database set Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 381-390, 2003. c Springer-Verlag Berlin Heidelberg 2003
382
Jianchao Han et al.
operations, while some authors have proposed ideas to reduce data using relational database system techniques [4, 6]. To overcome the problem, we propose a new computation model of rough set theory to eÆciently compute the core and reducts by means of relational database set-oriented operations such as Cardinality and Projection. We prove and demonstrate that our computation model is equivalent to the traditional rough set model, but much more eÆcient and scalable. The rest of the paper is organized as follows: We brie y overview the traditional rough set theory in Section 2. A new computation model of rough set theory by means of relational database set-oriented operations is proposed in Section 3. In Section 4, we describe our new algorithms to compute core attributes, construct reducts based on our new model, and analyze their time complexity. Related works are discussed in Section 5. Finally Section 6 is the conclusion and future work.
2
Overview of Rough Set Theory
An information system, IS , is de ned as: IS =< U; C; D; fVa ga2C [D ; f >; where U = fu ; u ; :::ung is a non-empty set of tuples, called data set or data table, C is a non-empty set of condition attributes, and D is a non-empty set of decision attributes and C \ D = ;. Va is the domain S of attribute \a" with at least two elements. f is a function: U (C [ D) ! V= a2C [D Va , which maps each pair of tuple and attribute to an attribute value. Let A C [ D, and ti ; tj 2 U , we de ne a binary relation RA , called an indiscernibility relation, as follows: RA = f< ti ; tj >2 U U : 8a 2 A; ti [a] = tj [a]g, where t[a] indicates the value of attribute a 2 A of the tuple t. The indiscernibility relation, denoted IND, is an equivalent relation on U . The ordered pair < U; IND > is called an approximation space. It partitions U into equivalent classes, each of which is labeled by a description Ai , and called an elementary set. Any nite union of elementary set is called a de nable set in < U; IND >. De nition 1. Let X be a subset of U and represent a concept. Assume A is a subset of attributes, A C [ D, [A] = fA ; A ; : : : ; Am g is the set of elementary sets based on A. The lower approximation of X based on A, denoted LowerA (X ), is de ned S as LowerA (X ) = fAi 2 [A]jAi X; 1 i mg; which contains all the tuples in U that can be de nitely classi ed to X , so is called a positive region of X w.r.t. A. The upper approximation of X based on A, denoted UpperA (X ), is de ned S as UpperA (X ) = fAi 2 [A]jAi \ X 6= ;; 1 i mg; which contains those tuples in U that can be possibly classi ed to X . The set of those tuples that can be possibly but not de nitely classi ed to X is called the boundary area of X , denoted BoundaryA (X ), and de ned as BoundaryA(X ) = UpperA(X ) LowerA(X ): S The negative region of X is de ned as NegativeA(X ) = fAi 2 [A]jAi U X; 1 i mg; which contains the tuples that can not be classi ed to X . 2 1
2
1
2
A New Computation Model for Rough Set Theory Based on Database Systems
383
Thus, the positive and negative regions encompass positive and negative examples of concept X , respectively, while the boundary region forms the uncertain examples. If LowerA (X ) = UpperA (X ), then the boundary region of the set X disappears and the rough set becomes equivalent to the standard set. Generally, for any concept X , we can derive two kinds of classi cation rules from the lower and upper approximation of X based on a subset of condition attributes. The former is deterministic because they de nitely determine that the tuples satisfying the rule condition must be in the target concept, while the latter is non-deterministic because the tuples satisfying the rule condition are only possibly in the target concept. Speci cally, let [D] = fD1 ; D2 ; : : : ; Dk g is the set of elementary sets based on the decision attributes set D. Assume A is a subset of condition attributes, A C , and [A] = fA1; A2; : : : ; Ahg is the set of elementary sets based on A.
De nition 2. 8 Dj 2 [D]; 1 j k, the lower approximation of Dj based on A, S denoted LowerA (Dj ), is de ned as LowerA (Dj ) = fAi jAi Dj ; 1 i hg: All tuples in LowerA (Dj ) can be certainly classi ed to Dj . The lower approximaS tion of [D], denoted LowerA ([D]), is de ned as LowerA ([D]) = kj=1 LowerA (Dj ): All tuples in LowerA ([D]) can be certainly classi ed. Similarly, 8 Dj 2 [D]; 1 j k , the upper approximation of Dj based on A, denoted UpperA(Dj ), is de ned as UpperA(Dj ) = SfAijAi \ Dj 6= ;; 1 i hg: All tuples in UpperA(Dj ) can be probably classi ed to Dj . The Upper approximation of [D], denoted UpperA ([D]), is de ned as UpperA ([D]) = Sk Upper (D ): All tuples in Upper ([D]) can be probably classi ed. A j A j=1 The boundary of [D] based on A C , denoted BoundaryA ([D]), is de ned as BoundaryA ([D]) = UpperA([D]) LowerA ([D]): All tuples in BoundaryA([D]) can not be classi ed in terms of A and D. 2 Rough sets theory can tell us whether the information for classi cation of tuples is consistent based on the data table itself. If the data is inconsistent, it suggests more information about the tuples need to be collected in order to build a good classi cation model for all tuples. If there exist a pair of tuples in U such that they have the same condition attributes values but dierent decision attributes values, U is said to contain contradictory tuples.
U is consistent if no contradictory pair of tuples exist in U , that is, 8 t1 ; t2 2 U , if t1 [D] 6= t2 [D], then t1 [C ] 6= t2 [C ]. 2 De nition 3.
Usually, the existence of contradictory tuples indicates that the information contained in U is not enough to classify all tuples, and there must be some contradictory tuples contained in the boundary area, see Proposition 1 On the other hand, if the data is consistent, rough sets theory can also determine whether there are more than suÆcient or redundant information in the data and provide approaches to nding the minimum data needed for classi cation model. This property of rough sets theory is very important for applications where domain knowledge is limited or data collection is expensive/laborious, because it ensures the data collected is right (not more or less) to build a good
384
Jianchao Han et al.
classi cation model without sacri cing the accuracy of the classi cation model or wasting time and eort to gather extra information. Furthermore, rough sets theory classi es all the attributes into three categories: core attributes, reduct attributes, and dispensable attributes. Core attributes have the essential information to make correct classi cation for the data set and should be retained in the data set; dispensable attributes are the redundant ones in the data set and should be eliminated without loss of any useful information; while reduct attributes are in the middle between. A reduct attribute may or may not be essential.
De nition 4. A condition attribute a 2 C is a dispensable attribute of C in U w.r.t. D if LowerC ([D]) = LowerC fag ([D]): Otherwise, a 2 C is called a core
attribute of C w.r.t. D.
2
A reduct of the condition attributes set is a minimum subset of the entire condition attributes set that has the same classi cation capability as the original attributes set.
De nition 5. A subset R of C , R C , is de ned as a reduct of C in U w.r.t. D if LowerR ([D]) = LowerC ([D]) and 8B R; LowerB ([D]) 6= LowerC ([D]). A condition attribute a 2 C is said to be a reduct attribute if 9R C , R is a reduct of C and a 2 R. 2 For a given data table, there may exist more than one reduct. Finding all reducts of the condition attributes set is NP-hard [2].
3
A New Computation Model for Rough Set Theory
Some limitations of rough sets theory have been presented [7, 8], which restrict its suitability in practice. One of these limitations is the ineÆciency in computation of core attributes and reducts, which limits its suitability for large data sets. In order to nd core attributes, dispensable attributes, or reducts, rough set model needs to construct all the equivalent classes based on the values of condition and decision attributes of all tuples in the data set. It is very time-consuming and infeasible, since most data mining applications require eÆcient algorithms to deal with scalable data sets. Our experience and investigation nd out that current implementations of rough set model is based on the at le-oriented computations to calculate core attributes and reducts. As is known, however, set-oriented operations in existing relational database systems such as Oracle, Sybase, and DB2 are much eÆcient and scalable to deal with large data sets. These high performance set-oriented operations can be integrated with rough set model to improve the eÆciency of the various operations of rough sets theory. We propose a computation model based on the relational algebra in this section, which provides the necessary and suÆcient conditions with respect to database operations for computing core attributes and constructing reducts, and
A New Computation Model for Rough Set Theory Based on Database Systems
385
then describe the algorithms to compute the core attributes and generate reducts of the given attribute sets in next section. For simplicity and convenience, we make the following conventions: Let a 2 C [ D be an attribute and t 2 U be a tuple. t[a] denotes t's value of the attribute a. If t1 2 U and t2 2 U are two tuples and t1[a] = t2 [a], then it is denoted as t1 a t2; Let A = fa1; a2; : : :; ak g C [ D be a subset of attributes and t 2 U be a tuple. t[A] denotes the sequence < t[a1 ]; t[a2 ]; : : : ; t[ak ] >. For t1 ; t2 2 U , we say t1 [A] = t2 [A], denoted t1 A t2 , if and only if t1 [ai ] = t2 [ai ]; i = 1; 2; : : :; k . To start with, let's review two set-oriented operations utilized in relational database systems: Count and Projection [5]. Assume Y is a data table. Count (Cardinality): Card(Y ) is the number of distinct tuples in Y . Projection: Assume Y has columns C , and E C , C (Y ) is a data table that contains all tuples of Y but only columns in E .
Proposition 1 The data table U is consistent if and only if U = LowerC ([D]) = UpperC ([D]) and BoundaryC ([D]) = ;. Proof. Let [C ] = fC1 ; C2 ; : : :; Cm g and [D] = fD1 ; D2 ; : : :; Dn g be the set of equivalent classes induced by C and D, respectively. Assume U is consistent. On the one hand, by De nitions 1 and 2, it is obvious that LowerC ([D]) UpperC ([D]) U: S On the other hand, 8 t 2 U = ni=1 Di ; 91S j n such that t 2 Dj . 0 0 Similarly, 91 i m such that t 2 Ci , for U = m i=1 Ci . 8t 2 Ci ; t[C ] = t [C ]. 0 0 By De nition 3, t[D] = t [D]. So t 2 Dj , for t 2 Dj . Thus, Ci Dj , which leads to t 2 LowerC (Dj ), and t 2 LowerC ([D]). Hence, U LowerC ([D]), and therefore U = LowerC ([D]) = UpperC ([D]). Furthermore, BoundaryC ([D]) = UpperC ([D]) LowerC ([D]) = ;. 2
U is consistent if and only if Card(C (U )) = Card(C D (U )). Proof. By Proposition 1, U is consistent if and only if BoundaryC ([D]) = ; if and only if 8 t; s 2 U; t[C ] = s[C ] is equivalent to t[C + D] = s[C + D] if and only if Card(C (U )) = Card(C D (U )). 2 Proposition 2 Let A B C . Assume [A] = fA ; A ; : : :; Am g and [B ] = fB ; B ; : : : ; Bn g are the set of equivalent classes induced by A and B , respectively, then 8 Bi 2 [B ]; i = 1; 2; : : :; n, and Aj 2 [A]; j = 1; 2; : : :; m, either Bi \ Aj = ; or Bi Aj . [B] is said a re nement of [A]. 2 Proposition 3 If U is consistent, 8 A C; Card(A (U )) Card(A D (U )). Proof. 8 t; s 2 U , if t and s are projected to be the same in A D (U ), then they must be projected to be the same in A (U ). 2 Theorem 2 If U is consistent, then 8 A C; LowerC ([D]) 6= LowerC A ([D]) if and only if Card(C A (U )) 6= Card(C A D (U )): Theorem 1
+
+
1
1
2
2
+
+
+
386
Jianchao Han et al.
Proof. Let [C ] = fC1 ; C2 ; : : :; Cm g and [C A] = fC10 ; C20 ; : : : ; Ck0 g be the set of equivalent classes induced by C and C A, respectively, and [D] = fD1; D2 ; : : : ; Dn g be the set of equivalent classes induced by D. to De nition 2, for given 1 j n, we have LowerC A (Dj ) = SfCAccording 0q jCq0 Dj ; 1 q kg. Thus, 8 t 2 LowerC A (Dj ); 9 1 q k such that t 2 Cq0 and Cq0 Dj . Because U = Ski=1 Ci0 = Smi=1 Ci , so 9 1 p m; t 2 Cp . Hence, we have t 2 Cq0 \ Cp 6= ;: By Proposition 2,Sit can be easily seen that Cp Cq0 Dj because C A C . Hence t 2 fCi jCi Dj ; 1 i mg = LowerC (Dj ). Therefore, LowerC A(Dj ) LowerC (Dj ) and thus LowerC A([D]) LowerC ([D]). Because LowerC A ([D]) 6= LowerC ([D]) from the given condition, we must have LowerC A([D]) LowerC (D): So it can be inferred that 9 t0 2 U such that t0 2 LowerC ([D]) and t0 2= LowerC A([D]). Thus, 9 Dj ; 1 j n, such that t0 2 LowerC (Dj ), which means, 9 Cp ; 1 pS m; t0 2 Cp Dj . And 8 1 i n; t0 2= LowerC A(Di ), that is, t0 2= fCq0 jCq0 Di ; 1 q Skg. However, t0 2 U = kq=1 fCq0 g. Hence 9 1 q k; t0 2 Cq0 but 8 1 i k; Cq0 6 Di. It is known t0 2 Dj . Thus, we have 9 t0 2 U; t0 2 Cq0 \ Dj 6= ;; andSnCq0 6 Dj ; which means, 9 t 2 U such that t 2 Cq0 , but t 2= Dj . Because U = i=1 fDig, so 9 1 s n such that t 2 Ds ; s 6= j . Thus, t 2 Cq0 \ Ds; s 6= j: Therefore, we obtain t0 C A t, that is, t0 [C A] = t[C A]; for t0 2 Cq0 and t 2 Cq0 ; but t0 6C A+D t, that is, t0 [C A + D] 6= t[C A + D]; for t0 2 Dj and t 2 Ds ; s 6= j . From above, one can see that t0 and t are projected to be same by C A (U ) but dierent by C A+D (U ). Thus, C A+D (U ) has at least one more distinct tuple than C A (U ), which means Card(C A (U )) < Card(C A+D (U )). On the other hand, if Card(C A (U )) 6= Card(C A+D (U )), one can infer Card(C A (U )) < Card(C A+D (U )) by Proposition 3. Hence, 9 t and s 2 U such that t and s are projected to be same by C A(U ) but distinct by C A+D (U ), that is, t[C A] = s[C A], and t[C A + D] 6= s[C A + D]. Thus, we have t[D] 6= s[D], that is, t 6D s. Therefore, 9 1 q k such that t; s 2 Cq0 , and 1 i 6= j k such that t 2 Di and s 2 Dj . So 8 1 p n; Cq0 6 Dp (otherwise t; s 2 Dp). By De nition 2, we have 8 1 p n; t; s 2= LowerC A(Dp). Thus, t; s 2= LowerC A([D]): U is consistent, however. By De nition 3 and Proposition 1, t; s 2 U = LowerC ([D]); which leads to LowerC ([D]) 6= LowerC A([D]). 2
Corollary 1 If U is consistent, then a 2 C is a core attribute of C in U w.r.t. D if and only if Card(C fag (U )) 6= Card(C fag+D (U )). 2 Corollary 2 If U is consistent, then a 2 C is a dispensable attribute of C in U w.r.t. D if and only if Card(C fag+D (U )) = Card(C fag (U )). 2 Corollary 3 If U is consistent, then 8A C , LowerC ([D]) = LowerC A ([D]) if and only if Card(C A+D (U )) = Card(C A (U )). 2
A New Computation Model for Rough Set Theory Based on Database Systems
387
Thus, in order to check whether an attribute a 2 C is a core attribute, we only need to take two projections of the table: one on C fag + D, and the other on C fag, and then count the distinct number of tuples in the projections. If the cardinality of the two projection tables is the same, then no information is lost in removing the dispensable attribute a. Otherwise, a is a core attribute. Put it in a more formal way, using database term, the cardinality of two projections being compared will be dierent if and only if there exist at least two tuples x and y such that 8 c 2 C fag; x[c] = y [c], but x[a] 6= y [a] and x[D] 6= y [D]. In this case, the number of distinct tuples in the projection on C fag will be one fewer than that in the projection on C fag + D, for x and y are identical in the former, while they are still distinguishable in the latter. So eliminating attribute a will lose the ability to distinguish tuples x and y. Intuitively, this means that some classi cation information will be lost if a is eliminated.
De nition 6. Let B C . The degree of dependency, denoted K (B; D), between Card(B (U )) : B and D in the data table U is de ned as K (B; D) = Card 2 (B+D (U )) Proposition 4 If
K (C; D) = 1.
U
is consistent, then
8 B C; 0 < K (B; D)
1, and
Proof. By Proposition 3 and De nition 6, one can infer K (B; D) 1. By Theorem 1, Card(C (U )) = Card(C +D (U )). Therefore, K (C; D) = 1. 2
U is consistent, then R C is a reduct of C w.r.t. D if and K (R; D) = K (C; D), and 8B R; K (B; D) 6= K (C; D). Proof. K (R; D) = K (C; D) if and only if, by Proposition 4, K (R; D) = 1 if and only if, by De nition 6, Card(R (U )) = Card(R D (U ) if and only if, by Corollary 3, LowerR ([D]) = LowerC ([D]). Similarly, 8 B R; K (B; D) 6= K (C; D) if and only if LowerB ([D]) 6= LowerC ([D]). By De nition 5, one can Theorem 3 If only if
+
see that this theorem holds.
4
2
Algorithms for Finding Core Attributes and Reducts
In classi cation, two kinds of attributes are generally perceived as unnecessary: attributes that are irrelevant to the target concept (like the customer ID), and attributes that are redundant given other attributes. These unnecessary attributes can exist simultaneously, but the redundant attributes are more diÆcult to eliminate because of the correlations between them. In rough set community, we eliminate unnecessary attributes by constructing reducts of condition attributes. As proved [10], a reduct of condition attributes set C must contain all core attributes of C . So it is important to develop an eÆcient algorithm to nd all core attributes in order to generate a reduct. In traditional rough set models, this is achieved by constructing a decision matrix, and then nd all entries with only one attribute in the decision matrix. The corresponding attributes of the entries containing only one attribute, are core attributes [2]. This method is ineÆcient
388
Jianchao Han et al.
and not realistic to construct a decision matrix for millions of tuples, which is a typical situation for data mining applications. Before we present the algorithms, we review the implementation of Count and Projection in relational database systems using SQL statements. One can verify that both of them run in time of O(n) [5].
Card(C D (U )): SELECT DISTINCT COUNT(*) FROM U { Card(X (U )): {
+
SELECT DISTINCT COUNT(*) FROM (SELECT X FROM U)
Algorithm 1 FindCore: Find the set of core attributes of a data table Input: A consistent data table U with conditional attributes set C and decision attributes set D
Output: Core { the set of core attributes of C w.r.t. D in U 1. 2. 3. 4. 5.
Set Core ; For each attribute a 2 C If Card(C fag (U )) < Card(C fag+D (U ))
Then Core
Return Core
Core [ fag
Theorem 2 ensures that the outcome Core of the algorithm FindCore contains all core attributes and only those attributes.
Theorem 4 The algorithm FindCore can be implemented in O(mn) time, where
m is the number of attributes and n is the number of tuples (rows). Proof. The For loop is executed m times, and inside each loop, nding the cardinality takes O(n). Therefore, the total running time is O(mn). 2 Algorithm 2 FindReduct: Find a reduct of the conditional attributes set Input: A consistent data table U with conditional attributes set C and decision attributes set D, and the Core of C w.r.t. D in U
Output: REDU { a reduct of conditional attributes set of C w.r.t. D in U 1. 2. 3. 4. 5.
REDU
C; DISP
C Core
For each attribute a 2 DISP Do If K (REDU fag; D) = 1 Then
REDU REDU fag
Return REDU
Proposition 5 Assume U is consistent and R C . If K (R; D) < 1 then 8 B
R; K (B; D) < 1. Proof. Since K (R; D) < 1, we have Card(R (U )) < Card(R D (U )) by De nition 6. Thus, 9 t; s 2 U such that t[R] = s[R] but t[R + D] 6= s[R + D], so t[D] 6= s[D], and 8 B R; t[B] = s[B]. Therefore, Card(B (U )) < Card(B D (U )) and K (B; D) < 1 by De nition 6. 2 +
+
A New Computation Model for Rough Set Theory Based on Database Systems
389
Theorem 5 The outcome of Algorithm 2 is a reduct of C w.r.t. D in U . Proof. Assume the output of Algorithm 2 is REDU . From the algorithm it can be easily observed that K (REDU; D) = 1, and 8 a 2 REDU; K (REDU fag; D) < 1. By Proposition 5, one can see 8 B REDU; K (B; D) < 1. Therefore, by Proposition 4 and Theorem 3, we conclude that REDU is a reduct of C w.r.t. D in U . 2
Theorem 6 Algorithm 2 runs in time of O(mn), where
m is the number of U and n is the number of tuples in U . Proof. The For loop executes at most m times and each loop takes O(n) time to calculate K (REDU fag; D). Thus, the total running time of the algorithm is O(mn). 2 attributes in
One may note that the outcome of the algorithm FindReduct is an arbitrary reduct of the condition atrributes set C , if C has more than one reduct. Which reduct is generated depends on the order of attributes that are checked for dispensibility in Step 2 of the FindReduct algorithm. Some authers propose algorithms for constructing the best reduct, but what is the best depends on how to de ne the criteria, such as the number of attributes in the reduct, the number of possible values of attributes, etc. Given the criteria, FindReduct can be easily adapted to construct the best reduct, only if we choose an attribute to check for dispensibility based on the criteria. This will be one of our future works.
5
Related Work
Currently, there are few papers on the algorithm for nding core attributes. The traditional method is to construct a decision matrix and then search all the entries in the matrix. If an entry in the matrix contains only one attribute, that attribute is a core attribute [2]. Constructing the decision matrix, however, is not realistic in real-world applications. Our method for nding all attributes is much more eÆcient and scalable, especially when used with relational database systems, and only takes O(mn) time. There are algorithms for nding reducts in the literature, although nding all reducts is NP-hard [11]. Feature selection algorithms for constructing classi ers have been proposed [1, 3, 9], which are strongly related with nding reducts. However, very few of those literature address the time complexity analysis of algorithms. The algorithm for nding a reduct proposed in [1] takes O(m2 n2 ), while four algorithms for nding subset attributes are developed in [3], each of which takes O(m3 n2 ). Our algorithm for nding a reduct presented in this paper runs only in time of O(mn). Moreover, our algorithm utilizes the relational database system operations, and thus much more scalable. What we present in this paper is originally activated by [4, 6], both of which propose new techniques of using relational database systems to implement some rough set operations.
390
6
Jianchao Han et al.
Concluding Remarks
Most existing rough set models do not integrate with database systems but perform computational intensive operations such as generating core, reduct, and rule induction on at les, which limits their applicability for large data set in data mining applications. In order to take advantage of eÆcient data structures and algorithms developed in database systems, we proposed a new computation model for rough set theory using relational algebra operations. Two algorithms for computing core attributes and constructing reducts were presented. We proved the their correctness and analyzed their time complexity. Since relational algebra operations have been eÆciently implemented in most widely-used database systems, the algorithms presented can be extensively applied to these database systems and adapted to a wide range of real-life applications. Moreover, our algorithms are scalable, because existing database systems have demonstrated the capability of eÆciently processing very large data sets. However, the FindReduct algorithm can only generate an arbitrary reduct, which may not be the best one. To nd the best reduct, we should gure out how to de ne the seletion criteria, which is usually dependent on the application and bias. Our future work will focus on following two aspects: de ning the reduct selection criteria and nding the best reduct in terms of the criteria; and applying this model to feature selection and rule induction for knowledge discovery.
References 1. Bell, D., Guan, J., Computational methods for rough classi cation and discovery, J. of ASIS 49:5, pp. 403-414, 1998. 2. Cercone, N., Ziarko, W., Hu, X., Rule Discovery from Databases: A Decision Matrix Approach, Proc. Int'l Sym. on Methodologies for Intelligent System, 1996. 3. Deogun, J., Choubey, S., Taghavan, V., Sever, H., Feature selection and eective classi ers, J. of ASIS 49:5, pp. 423-434, 1998. 4. Hu, X., Lin, T. Y., Han, J., A New Rough Sets Model Based on Database Systems, Proc. of the 9th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, 2003. 5. Garcia-Molina, H., Ullman, J. D., Widom, J., Database System Implementation, Prentice Hall, 2000. 6. Kumar, A., A New Technique for Data Reduction in A Database System for Knowledge Discovery Applications, J. of Intelligent Systems, 10(3). 7. Lin, T.Y and Cercone, N., Applications of Rough Sets Theory and Data Mining, Kluwer Academic Publishers, 1997. 8. Lin, T. Y., Yao, Y. Y., and Zadeh, L. A., Data Mining, Rough Sets and Granular Computing, Physical-Verlag, 2002. 9. Modrzejewski, M., Feature Selection Using Rough Sets Theory, in Proc. ECML, pp.213-226, 1993. 10. Pawlak, Z., Rough Sets, International Journal of Information and Computer Science, 11(5), pp.341-356, 1982. 11. Pawlak, Z., Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic Publishers, 1991.
Computing SQL Queries with Boolean Aggregates Antonio Badia Computer Engineering and Computer Science Department University of Louisville
Abstract. We introduce a new method for optimization of SQL queries with nested subqueries. The method is based on the idea of Boolean aggregates, aggregates that compute the conjunction or disjunction of a set of conditions. When combined with grouping, Boolean aggregates allow us to compute all types of non-aggregated subqueries in a uniform manner. The resulting query trees are simple and amenable to further optimization. Our approach can be combined with other optimization techniques and can be implemented with a minimum of changes in any cost-based optimizer.
1
Introduction
Due to the importance of query optimization, there exists a large body of research in the subject, especially for the case of nested subqueries ([10, 5, 13, 7, 8, 17]). It is considered nowadays that existing approaches can deal with all types of SQL subqueries through unnesting. However, practical implementation lags behind the theory, since some transformations are quite complex to implement. In particular, subqueries where the linking condition (the condition connecting query and subquery) is one of NOT IN, NOT EXISTS or a comparison with ALL seem to present problems to current optimizers. These cases are assumed to be translated, or are dealt with using antijoins. However, the usual translation does not work in the presence of nulls, and even when fixed it adds some overhead to the original query. On the other hand, antijoins introduce yet another operator that cannot be moved in the query tree, thus making the job of the optimizer more difficult. When a query has several levels, the complexity grows rapidly (an example is given below). In this paper we introduce a variant of traditional unnesting methods that deals with all types of linking conditions in a simple, uniform manner. The query tree created is simple, and the approach extends neatly to several levels of nesting and several subqueries at the same level. The approach is based on the concept of Boolean aggregates, which are an extension of the idea of aggregate function in SQL ([12]). Intuitively, Boolean aggregates are applied to a set of predicates and combine the truth values resulting from evaluation of the predicates. We show how two simple Boolean predicates can take care of any type of SQL subquery in
This research was sponsored by NSF under grant IIS-0091928.
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 391–400, 2003. c Springer-Verlag Berlin Heidelberg 2003
392
Antonio Badia
a uniform manner. The resulting query trees are simple and amenable to further optimization. Our approach can be combined with other optimization techniques and can be implemented with a minimum of changes in any cost-based optimizer. In section 2 we describe in more detail related research on query optimization and motivate our approach with an example. In section 3 we introduce the concept of Boolean aggregates and show its use in query unnesting. We then apply our approach to the example and discuss the differences with standard unnesting. Finally, in section 4 we offer some preliminary conclusions and discuss further research.
2
Related Research and Motivation
We study SQL queries that contain correlated subqueries1 . Such subqueries contain a correlated predicate, a condition in their WHERE clause introducing the correlation. The attribute in the correlated predicate provided by a relation in an outer block is called the correlation attribute; the other attribute is called the correlated attribute. The condition connecting query and subquery is called the linking condition. There are basically four types of linking condition in SQL: comparisons between an attribute and an aggregation (called the linking aggregate); IN and NOT IN comparisons; EXISTS and NOT EXISTS comparisons; and quantified comparisons between an attribute and a set of attribute through the use of SOME and ALL. We call linking conditions involving an aggregate, IN, EXISTS, and comparisons with SOME positive linking conditions, and the rest (those involving NOT IN, NOT EXISTS, and comparisons with ALL) negative linking conditions. All nested correlated subqueries are nowadays executed by some variation of unnesting. In its original approach ([10]), the correlation predicate is seen as a join; if the subquery is aggregated, the aggregate is computed in advance and then join is used. Kim’s approach had a number of shortcomings; among them, it assumed that the correlation predicate always used equality and the linking condition was a positive one. Dayal’s ([5]) and Muralikrishna’s ([13]) work solved these shortcomings; Dayal introduced the idea of using an outerjoin instead of a join (so values with no match would not be lost), and proceeds with the aggregate computation after the outerjoin. Muralikrishna generalizes the approach and points out that negative linking aggregates can be dealt with using antijoin or translating them to other, positive linking aggregates. These approaches also introduce some shortcomings. First, outerjoins and antijoins do not commute with regular joins or selections; therefore, a query tree with all these operators does not offer many degrees of freedom to the optimizer. The work of [6] and [16] has studied conditions under which outerjoins and antijoins can be moved; alleviating this problem partially. Another problem with this approach is that by carrying out the (outer)join corresponding to the correlation predicate, other predicates in the WHERE clause of the main query, which may restrict the total computation to be carried out, are postponed. The magic sets 1
The approach is applicable to non-correlated subqueries as well, but does not provide any substantial gains in that case.
Computing SQL Queries with Boolean Aggregates
393
approach ([17, 18, 20]) pushes these predicates down past the (outer)join by identifying the minimal set of values that the correlating attributes can take (the magic set), and computing it in advance. This minimizes the size of other computation but comes at the cost of building the magic set in advance. However, all approaches in the literature assume positive linking conditions (and all examples shown in [5, 13, 19, 20, 18] involve positive linking conditions). Negative linking conditions are not given much attention; it is considered that queries can be rewritten to avoid them, or that they can be dealt with directly using antijoins. But both approaches are problematic. About the former, we point out that the standard translation does not work if nulls are present. Assume, for instance, the condition attr > ALL Q, where Q is a subquery, with attr2 the linked attribute. It is usually assumed that a (left) antijoin with condition attr ≤ attr2 is a correct translation of this condition, since for a tuple t to be in the antijoin, it cannot be the case that t.attr ≤ attr2, for any value of attr2 (or any value in a given group, if the subquery is correlated). Unfortunately, this equivalence is only true for 2-valued logics, not for the 3-valued logic that SQL uses to evaluate predicates when null is present. The condition attr ≤ attr2 will fail if attr is not null, and no value of attr2 is greater than or equal to attr, which may happen because attr2 is the right value or because attr2 is null. Hence, a tuple t will be in the antijoin in the last case above, and t will qualify for the result. Even though one could argue that this can be solved by changing the condition in the antijoin (and indeed, a correct rewrite is possible, but more complex than usually considered ([1]), a larger problem with this approach is that it produces plans with outerjoins and antijoins, which are very difficult to move around on the query tree; even though recent research has shown that outerjoins ([6]) and antijoins ([16]) can be moved under limited circumstances, this still poses a constraint on the alternatives that can be generated for a given query plan -and it is up to the optimizer to check that the necessary conditions are met. Hence, proliferation of these operations makes the task of the query optimizer difficult. As an example of the problems of the traditional approach, assume tables R(A,B,C,D), S(E,F,G,H,I), U(J,K,L), and consider the query Select * From R Where R.A > 10 and R.B NOT IN (Select S.E From S Where S.F = 5 and R.D = S.G and S.H > ALL (Select U.J From U Where U.K = R.C and U.L != S.I)) Unnesting this query with the traditional approach has the problem of introducing several outerjoins and antijoins that cannot be moved, as well as extra
394
Antonio Badia Project(R.*) Select(A>10 & F=5) AJ(B = E) AJ(H =< J) Project(R.*,S.*) Project(S.*,T.*) LOJ(K = C and L != I) T
LOJ(D = G) R
S
Fig. 1. Standard unnesting approach applied to the example
operations. To see why, note that we must outerjoin U with S and R, and then group by the keys of R and S, to determine which tuples of U must be tested for the ALL linking condition. However, should the set of tuples of U in a group fail the test, we cannot throw the whole group away: for that means that some tuples in S fail to qualify for an answer, making true the NOT IN linking condition, and hence qualifying the R tuple. Thus, tuples in S and U should be antijoined separately to determine which tuples in S pass or fail the ALL test. Then the result should separately antijoined with R to determine which tuples in R pass or fail the NOT IN test. The result is shown in figure 1, with LOJ denoting a left outer join and AJ denoting an antijoin (note that the tree is actually a graph!). Even though Muralikrishna ([13]) proposes to extract (left) antijoins from (left) outerjoins, we note that in general such reuse may not be possible: here, the outerjoin is introduced to deal with the correlation, and the antijoin with the linking, and therefore they have distinct, independent conditions attached to them (and such approaches transform the query tree in a query graph, making it harder for the optimizer to consider alternatives). Also, magic sets would be able to improve on the above plan pushing selections down to the relations; however, this approach does not improve the overall situation, with outerjoins and antijoins still present. Clearly, what is called for is an approach which uniformly deals with all types of linking conditions without introducing undue complexity.
3
Boolean Aggregates
We seek a uniform method that will work for all linking conditions. In order to achieve this, we define Boolean aggregates AND and OR, which take as input a comparison, a set of values (or tuples), and return a Boolean (true or false) as output. Let attr be an attribute, θ a comparison operator and S a set of values.
Computing SQL Queries with Boolean Aggregates
Then AN D(S, attr, θ) =
395
attr θ attr2
attr2∈S
We define AN D(∅, att, θ) to be true for any att, θ. Also, OR(S, attr, θ) = attr θ attr2 attr2∈S
We define OR(∅, att, θ) to be false for any att, θ. It is important to point out that each individual comparison is subject to the semantics of SQL’s WHERE clause; in particular, comparisons with null values return unknown. The usual behavior of unknown with respect to conjunction and disjunction is followed ([12]). Note also that the set S will be implicit in normal use. When the Boolean aggregates are used alone, S will be the input relation to the aggregate; when used in conjunction with a GROUP-BY operator, each group will provide the input set. Thus, we will write GBA,AN D(B,θ) (R), where A is a subset of attributes of the schema of R, B is an attribute from the schema of R, and θ is a comparison operator; and similarly for OR. The intended meaning is that, similar to other aggregates, AND is applied to each group created by the grouping. We use boolean aggregates to compute any linking condition which does not use a (regular) aggregate, as follows: after a join or outerjoin connecting query and subquery is introduced by the unnesting, a group by is executed. The grouping attributes are any key of the relation from the outer block; the Boolean aggregate used depends on the linking condition: for attr θ SOM E Q, where Q is a correlated subquery, the aggregate used is OR(attr, θ). For attr IN Q, the linking condition is treated as attr = SOM E Q. For EXIST S Q, the aggregate used in OR(1, 1, =)2 . For attr θ ALL Q, where Q is a correlated subquery, the aggregate used is AN D(attr, θ). For attr N OT IN Q, the linking condition is treated as attr = ALL Q. Finally, for N OT EXIST S Q, the aggregate used is AN D(1, 1, =). After the grouping and aggregation, the Boolean aggregates leave a truth value in each group of the grouped relation. A selection then must be used to pick up those tuples where the boolean is set to true. Note that most of this work can be optimized in implementation, an issue that we discuss in the next subsection. Clearly, implementing a Boolean aggregate is very similar to implementing a regular aggregate. The usual way to compute the traditional SQL aggregates (min, max, sum, count, avg) is to use an accumulator variable in which to store temporary results, and update it as more values come. For min and max, for instance, any new value is compared to the value in the accumulator, and replaces it if it is smaller (larger). Sum and count initialize the accumulator to 0, and increase the accumulator with each new value (using the value, for sum, using 1, for count). Likewise, a Boolean accumulator is used for Boolean 2
Note that technically this formulation is not correct since we are using a constant instead of attr, but the meaning is clear.
396
Antonio Badia
aggregates. For ALL, the accumulator is started as true; for SOME, as false. As new values arrive, a comparison is carried out, and the result is ANDed (for AND) or ORed (for OR) with the accumulator. There is, however, a problem with this straightforward approach. When an outerjoin is used to deal with the correlation, tuples in the outer block that have no match appear in the result exactly once, padded on the attributes of the inner block with nulls. Thus, when a group by is done, these tuples become their own group. Hence, tuples with no match actually have one (null) match in the outer join. The Boolean aggregate will then iterate over this single tuple and, finding a null value on it, will deposit a value of unknown in the accumulator. But when a tuple has no matches the ALL test should be considered successful. The problem is that the outer join marks no matches with a null; while this null is meant to be no value occurs, SQL is incapable of distinguishing this interpretation from others, like value unknown (for which the 3-valued semantics makes sense). Note also that the value of attr2 may genuinely be a null, if such a null existed in the original data. Thus, what is needed is a way to distinguish between tuples that have been added as a pad by the outer join. We stipulate that outer joins will pad tuples without a match not with nulls, but with a different marker, called an emptymarker, which is different from any possible value and from the null marker itself. Then a program like the following can be used to implement the AND aggregate: acc = True; while (not (empty(S)){ t = first(S); if (t.attr2 != emptymark) acc = acc AND attr comp attr2; S = rest(S); } Note that this program implements the semantics given for the operator, since a single tuple with the empty marker represents the empty set in the relational framework3. 3.1
Query Unnesting
We unnest using an approach that we call quasi-magic. First, at every query level the WHERE clause, with the exception of any linking condition(s), is transformed into a query tree. This allows us to push selections before any unnesting, as in the magic approach, but we do not compute the magic set, just the complementary set ([17, 18, 20]). This way, we avoid the overhead associated with the magic method. Then, correlated queries are treated as in Dayal’s approach, by adding 3
The change of padding in the outer join should be of no consequence to the rest of query processing. Right after the application of the Boolean aggregate, a selection will pick up only those tuples with a value of true in the accumulator. This includes tuples with the marker; however, no other operator up the query tree operates on the values with the marker -in the standard setting, they would contain nulls, and hence no useful operation can be carried out on these values.
Computing SQL Queries with Boolean Aggregates
397
a join (or outerjoin, if necessary), followed by a group by on key attributes of the outer relation. At this point, we apply boolean aggregates by using the linking condition, as outlined above. In our previous example, a tree (call it T1 ) will be formed to deal with the outer block: σA>10 (R). A second tree (call it T2 ) is formed for the nested query block at first level: σF =5 (S). Finally, a third tree is formed for the innermost block: U (note that this is a trivial tree because, at every level, we are excluding linking conditions, and there is nothing but linking conditions in the WHERE clause of the innermost block of our example). Using these trees as building blocks, a tree for the whole query is built as follows: 1. First, construct a graph where each tree formed so far is a node and there is a direct link from node Ti to node Tj if there is a correlation in the Tj block with the value of the correlation coming from a relation in the Ti block; the link is annotated with the correlation predicate. Then, we start our tree by left outerjoining any two nodes that have a link between them (the left input corresponding to the block in the outer query), using the condition in the annotation of the link, and starting with graph sources (because of SQL semantics, this will correspond to outermost blocks that are not correlated) and finishing with sinks (because of SQL semantics, this will correspond to innermost blocks that are correlated). Thus, we outerjoin from the outside in. An exception is made for links between Ti and Tj if there is a path in the graph between Ti and Tj on length ≥ 1. In the example above, our graph will have three nodes, T1 , T2 and T3 , with links from T1 to T2 , T1 to T3 and T2 to T3 . We will create a left outerjoin between T2 and T3 first, and then another left outerjoin of T1 with the previous result. In a situation like this, the link from T1 to T3 becomes a condition just another condition when we outerjoin T1 to the result of the previous outerjoin. 2. On top of the tree obtained in the previous step, we add GROUP BY nodes, with the grouping attributes corresponding to keys of relations in the left argument of the left outerjoins. On each GROUP BY, the appropriate (boolean) aggregate is used, followed by a SELECT looking for tuples with true (for boolean aggregates) or applying the linking condition (for regular aggregates). Note that these nodes are applied from the inside out, ie. the first (bottom) one corresponds to the innermost linking condition, and so on. 3. A projection, if needed, is placed on top of the tree. The following optimization is applied automatically: every outerjoin is considered to see if it can be transformed into a join. This is not possible for negative linking conditions (NOT IN, NOT EXISTS, ALL), but it is possible for positive linking conditions and all aggregates except COUNT(*)4 . 4
This rule coincides with some of Galindo-Legaria rules ([6]), in that we know that in positive linking conditions and aggregates we are going to have selections that are null-intolerant and, therefore, the outerjoin is equivalent to a join.
398
Antonio Badia
PROJECT(R.*) SELECT(Bool=True) GB(Rkey,AND(R.B != S.E)) Select(Bool=True) GB(Rkey,Skey, AND(S.H > T.J)) LOJ(K = C and D = G) SELECT(A>10) LOJ(L = I) R
Select(F=5) T S
Fig. 2. Our approach applied to the example
After this process, the tree is passed on to the query optimizer to see if further optimization is possible. Note that inside each subtree Ti there may be some optimization work to do; note also that, since all operators in the tree are joins and outerjoins, the optimizer may be able to move around some operators. Also, some GROUP BY nodes may be pulled up or pushed down ([2, 3, 8, 9]). We show the final result applied to our example above in figure 2. Note that in our example the outerjoins cannot be transformed into joins; however, the group bys may be pushed down depending on the keys of the relation (which we did not specify). Also, even if groupings cannot be pushed down, note that the first one groups the temporal relation by the keys of R and S, while the second one groups by the keys of R alone. Clearly, this second grouping is trivial; the whole operation (grouping and aggregate) can be done in one scan of the input. Compare this tree with the one that is achieved by standard unnesting (shown in figure 1), and it is clear that our approach is more uniform and simple, while using to its advantage the ideas behind standard unnesting. Again, magic sets could be applied to Dayal’s approach, to push down the selections in R and S like we did. However, in this case additional steps would be needed (for the creation of the complementary and magic sets), and the need for outerjoins and antijoins does not disappear. In our approach, the complementary set is always produced by our decision to process first operations at the same level, collapsing each query block (with the exception of linking conditions) to one relation (this is the reason we call our approach a quasi-magic strategy). As more levels and more subqueries with more correlations are added, the simplicity and clarity of our approach is more evident.
Computing SQL Queries with Boolean Aggregates
3.2
399
Optimizations
Besides algebraic optimizations, there are some particular optimizations that can be applied to Boolean aggregates. Obviously, AND evaluation can stop as soon as some predicate evaluates to false (with final result false); and OR evaluation can stop as soon as some predicate evaluates to true (with final result true). The later selection on boolean values can be done on the fly: since we know that the selection condition is going to be looking for groups with a value of true, groups with a value of false can be thrown away directly, in essence pipelining the selection in the GROUP-BY. Note also that by pipelining the selection, we eliminate the need for a Boolean attribute! In our example, once both left outer joins have been carried out, the first GROUP-BY is executed by using either sorting or hashing by the keys of R and S. On each group, the Boolean aggregate AND is computed as tuples come. As soon as a comparison returns false, computation of the Boolean aggregate is stopped, and the group is marked so that any further tuples belonging to the group are ignored; no output is produced for that group. Groups that do not fail the test are added to the output. Once this temporary result is created, it is read again and scanned looking only at values of the keys of R to create the groups; the second Boolean aggregate is computed as before. Also as before, as soon as a comparison returns false, the group is flagged for dismissal. Output is composed of groups that were not flagged when input was exhausted. Therefore, the cost of our plan, considering only operations above the second left outer join, is that of grouping the temporary relation by the keys of R and S, writing the output to disk and reading this output into memory again. In traditional unnesting, the cost after the second left outer joins is that of executing two antijoins, which is in the order of executing two joins.
4
Conclusion and Further Work
We have proposed an approach to unnesting SQL subqueries which builds on top of existing approaches. Therefore, our proposal is very easy to implement in existing query optimization and query execution engines, as it requires very little in the way of new operations, cost calculations, or implementation in the back-end. The approach allows us to treat all SQL subqueries in a uniform and simplified manner, and meshes well with existing approaches, letting the optimizer move operators around and apply advanced optimization techniques (like outerjoin reduction and push down/pull up of GROUP BY nodes). Further, because it extends to several levels easily, it simplifies resulting query trees. Optimizers are becoming quite sophisticate and complex; a simple and uniform treatment of all queries is certainly worth examining. We have argued that our approach yields better performance than traditional approaches when negative linking conditions are present. We plan to analyze the performance of our approach by implementing Boolean attributes on a DBMS and/or developing a detailed cost model, to offer further support for the conclusions reached in this paper.
400
Antonio Badia
References [1] Cao, Bin and Badia, A. Subquery Rewriting for Optimization of SQL Queries, submitted for publication. 393 [2] Chaudhuri, S. ans Shim, K. Including Group-By in Query Optimization, in Proceedings of the 2th VLDB Conference, 1994. 398 [3] Chaudhuri, S. ans Shim, K. An Overview of Cost-Based Optimization of Queries with Aggregates, Data Engineering Bulletin, 18(3), 1995. 398 [4] Cohen, S., Nutt, W. and Serebrenik, A. Algorithms for Rewriting Aggregate Queries using Views, Proceedings of the Design and Management of Data Warehouses Conference, 1999. [5] Dayal, U. Of Nests and Trees: A Unified Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers, in Proceedings of the VLDB Conference, 1987. 391, 392, 393 [6] Galindo-Legaria, C. and Rosenthal, A. Outerjoin Simplification and Reordering for Query Optimization, ACM TODS, vol. 22, n. 1, 1997. 392, 393, 397 [7] Ganski, R. and Wong, H. Optimization of Nested SQL Queries Revisited, in Proceedings of the ACM SIGMOD Conference, 1987. 391 [8] Goel, P. and Iyer, B. SQL Query Optimization: Reordering for a General Class of Queries, in Proceedings of the 1996 ACM SIGMOD Conference. 391, 398 [9] Gupta, A., Harinayaran, V. and Quass, D. Aggregate-Query Processing in Data Warehousing Environments, in Proceedings of the VLDB Conference, 1995. 398 [10] Kim, W. On Optimizing an SQL-Like Nested Query, ACM Transactions On Database Systems, vol. 7, n.3, September 1982. 391, 392 [11] Materialized Views: Techniques, Implementations and Applications, A. Gupta and I. S. Mumick, eds., MIT Press, 1999. [12] Melton, J. Advanced SQL: 1999, Understanding Object-Relational and Other Advanced Features, Morgan Kaufmann, 2003. 391, 395 [13] Muralikrishna, M. Improving Unnesting Algorithms for Join Aggregate Queries in SQL, in Proceedings of the VLDB Conference, 1992. 391, 392, 393, 394 [14] Ross, K. and Rao, J. Reusing Invariants: A New Strategy for Correlated Queries, in Proceedings of the ACM SIGMOD Conference, 1998. [15] Ross, K. and Chatziantoniou, D., Groupwise Processing of Relational Queries, in Proceedings of the 23rd VLDB Conference, 1997. [16] Jun Rao, Bruce Lindsay, Guy Lohman, Hamid Pirahesh and David Simmen, Using EELs, a Practical Approach to Outerjoin and Antijoin Reordering, in Proceedings of ICDE 2001. 392, 393 [17] Praveen Seshadri, Hamid Pirahesh, T. Y. Cliff Leung Complex Query Decorrelation, in Proceedings of ICDE 1996, pages 450-458. 391, 393, 396 [18] Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T. Y. Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. Sudarshan Cost-Based Optimization for Magic: Algebra and Implementation, in Proceedings of the SIGMOD Conference, 1996, pages 435-446. 393, 396 [19] Inderpal Singh Mumick and Hamid Pirahesh Implementation of Magic-sets in a Relational Database System, in Proceedings of the SIGMOD Conference 1994, pages 103-114. 393 [20] Inderpal Singh Mumick, Sheldon J. Finkelstein, Hamid Pirahesh and Raghu Ramakrishnan Magic is Relevant, in Proceedings of the SIGMOD Conference, 1990, pages 247-258. 393, 396
Fighting Redundancy in SQL Antonio Badia and Dev Anand Computer Engineering and Computer Science Department University of Louisville Louisville KY 40292
Abstract. Many SQL queries with aggregated subqueries exhibit redundancy (overlap in FROM and WHERE clauses). We propose a method, called the for-loop, to optimize such queries by ensuring that redundant computations are done only once. We specify a procedure to build a query plan implementing our method, give an example of its use and argue that it offers performance advantages over traditional approaches.
1
Introduction
In this paper, we study a class of Decision-Support SQL queries, characterize them and show how to process them in an improved manner. In particular, we analyze queries containing subqueries, where the subquery is aggregated (type-A and type-JA in [8]). In many of these queries, SQL exhibits redundancy in that FROM and WHERE clauses of query and subquery show a great deal of overlap. We argue that these patterns are currently not well supported by relational query processors. The following example gives some intuition about the problem; the query used is Query 2 from the TPC-H benchmark ([18]) -we will refer to it as query TPCH2: select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like ’%BRASS’ and r_name = ’EUROPE’ and s_nationkey = n_nationkey and n_regionkey = r_regionkey and ps_supplycost = (select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = ’EUROPE’) order by s_acctbal desc, n_name, s_name, p_partkey;
This research was sponsored by NSF under grant IIS-0091928.
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 401–411, 2003. c Springer-Verlag Berlin Heidelberg 2003
402
Antonio Badia and Dev Anand
This query is executed in most systems by using unnesting techniques. However, the commonality between query and subquery will not be detected, and all operations (including common joins and selections) will be repeated (see an in-depth discussion of this example in subsection 2.3). Our goal is to avoid duplication of effort. For lack of space, we will not discuss related research in query optimization ([3, 11, 6, 7, 8, 15]); we point out that detecting and dealing with redundancy is not attempted in this body of work. Our method applies only to aggregated subqueries that contain WHERE clauses overlapping with the main query’s WHERE clause. This may seem a very narrow type of queries until one realizes that all types of SQL subqueries can be rewritten as aggregated subqueries (EXISTS, for instance, can be rewritten as a subquery with COUNT; all other types of subqueries can be rewritten similarly ([2])). Therefore, the approach is potentially applicable to any SQL query with subqueries. Also, it is important to point out that the redundancy is present because of the structure of SQL, which necessitates a subquery in order to declaratively state the aggregation to be computed. Thus, we argue that such redundancy is not infrequent ([10]). We describe an optimization method geared towards detecting and optimizing this redundancy. Our method not only computes the redundant part only once, but also proposes a new special operator to compute the rest of the query very effectively. In section 2 we describe our approach and the new operator in more detail. We formally describe the operator (subsection 2.1), show one query trees with the operator can be generated for a given SQL query (subsection 2.2), and describe an experiment ran on the context of the TPC-H benchmark ([18]) (subsection 2.3). Finally, in section 3 we propose some further research.
2
Optimization of Redundancy
In this section we define patterns which detect redundancy in SQL queries. We then show how to use the matching of patterns and SQL queries to produce a query plan which avoids repeating computations. We represent SQL queries in an schematic form or pattern. With the keywords SELECT ... FROM ... WHERE we will use L, L1 , L2 , . . . as variables over a list of attributes; T, T1 , T2 , . . . as variables over a list of relations, F, F1 , F2 , . . . as variables over aggregate functions and ∆, ∆1 , ∆2 , . . . as variables over (complex) conditions. Attributes will be represented by attr, attr1 , attr2 , . . .. If there is a condition in the WHERE clause of the subquery which introduces correlation it will be shown explicitly; this is called the correlation condition. The table to which the correlated attribute belongs is called the correlation table, and is said to introduce the correlation; the attribute compared to the correlated attribute is called the correlating attribute. Also, the condition that connects query and subquery (called a linking condition) is also shown explicitly. The operator in the linking condition is called the linking operator, the attributes the linking attributes and the aggregate function on the subquery side is called the linking aggregate. We will say that a pattern
Fighting Redundancy in SQL
403
matches an SQL query when there is a correspondence g between the variables in the pattern and the elements of the query. As an example, the pattern SELECT L FROM T WHERE ∆1 AND attr1 θ (SELECT F(attr2) FROM T WHERE ∆2 ) would match query TPCH2 by setting = {p partkey = ps partkey and s suppkey = ps suppkey and g(∆1 ) p size = 15 and p type like ’%BRASS’ and r name = ’EUROPE’ and s nationkey = n nationkey and n regionkey = r regionkey }, g(∆2 ) = {p partkey = ps partkey and s suppkey = ps suppkey and r name = ’EUROPE’ and s nationkey = n nationkey and n regionkey = r regionkey}, g(T) = {part,supplier,partuspp,nation,region}, g(F) = min and g(attr1) = g(attr2) = ps supplycost. Note that the T symbol appears twice so the pattern forces the query to have the same FROM clauses in the main query and in the subquery1 . The correlation condition is p partkey = ps partkey; the correlation table is part, and ps partkey is the the correlating attribute. The linking condition here is ps supplycost = min(ps suplycost); thus ps supplycost is the linking attribute, ’=’ the linking operator and min the linking aggregate. The basic idea of our approach is to divide the work to be done in three parts: one that is common to query and subquery, one that belongs only to the subquery, and one that belongs only to the main query2 . The part that is common to both query and subquery can be done only once; however, as we argue in subsection 2.3 in most systems today it would be done twice. We calculate the three parts above as follows: the common part is g(∆1 ) ∩ g(∆2 ); the part proper to the main query is g(∆1 ) − g(∆2 ); and the part proper to the subquery is g(∆2 ) − g(∆1 ). For query TPCH2, this yields { p partkey = ps partkey and s suppkey = ps suppkey and r name = ’EUROPE’ and s nationkey = n nationkey and n regionkey = r regionkey}, {p size = 15 and p type like ’%BRASS’} and ∅, respectively. We use this matching in constructing a program to compute this query. The process is explained in the next subsection. 2.1
The For-Loop Operator
We start out with the common part, called the base relation, in order to ensure that it is not done twice. The base relation can be expressed as an SPJ query. Our strategy is to compute the rest of the query starting from this base relation. This strategy faces two difficulties. First, if we simply divide the query based 1 2
For correlated subqueries, the correlation table is counted as present in the FROM clause of the subquery. We are assuming that all relations mentioned in a query are connected; i.e. that there are no Cartesian products present, only joins. Therefore, when there is overlap between query and subquery FROM clause, we are very likely to find common conditions in both WHERE clauses (at least the joins).
404
Antonio Badia and Dev Anand
on common parts we obtain a plan where redundancy is eliminated at the price of fixing the order of some operations. In particular, some selections not in the common part wouldn’t be pushed down. Hence, it is unclear whether this strategy will provide significant improvements by itself (this situation is similar to that of [13]). Second, when starting from the base relation, we face a problem in that this relation has to be used for two different purposes: it must be used to compute an aggregate after finishing up the WHERE clause in the subquery (i.e. after computing g(∆2 ) − g(∆1 )); and it must be used to finish up the WHERE clause in the main query (i.e. to compute g(∆1 ) − g(∆2 )) and then, using the result of the previous step, compute the final answer to the query. However, it is extremely hard in relational algebra to combine the operators involved. For instance, the computation of an aggregate must be done before the aggregate can be used in a selection condition. In order to solve this problem, we define a new operator, called the forloop, which combines several relational operators into a new one (i.e. a macrooperator). The approach is based on the observation that some basic operations appear frequently together and they could be more efficiently implemented as a whole. In our particular case, we show in the next subsection that there is an efficient implementation of the for-loop operator which allows it, in some cases, to compute several basic operators with one pass over the data, thus saving considerable disk I/O. Definition 1. Let R be a relation, sch(R) the schema of R, L ⊆ sch(R), A ∈ sch(R), F an aggregate function, α a condition on R (i.e. involving only attributes of sch(R)) and β a condition on sch(R) ∪ {F (A)} (i.e. involving attributes of sch(R) and possibly F (A)). Then for-loop operator is defined as either one of the following: 1. F LL,F (A),α,β (R). The meaning of the operator is defined as follows: let T emp be the relation GBL,F (A) (σα (R)) (GB is used to indicate a group-by operation). Then the for-loop yields relation σβ (R R.L=T emp.L T emp), where the condition of the join is understood as the pairwise equality of each attribute in L. This is called a grouped for-loop. 2. F LF (A),α,β (R). The meaning of the operator is given by σβ (AGGF (A) (σα (R)) × R), where AGGF (A) (R) indicates the aggregate F computed over all A values of R. This is called a flat for-loop. Note that β may contain aggregated attributes as part of a condition. In fact, in the typical use in our approach, it does contains an aggregation. The main use of a for-loop is to calculate the linking condition of a query with an aggregated subquery on the fly, possibly with additional selections. Thus, for instance, for query TPCH2, the for-loop would take the grouped form F Lp partkey , min(ps supplycost),∅,p size=15∧p typeLIKE%BRASS∧ps suplycost=min(ps supplycost)(R), where R is the relation obtained by computing the base relation3 . The for-loop is equivalent to the relational expression σp size=15∧p typeLIKE%BRASS∧ps suplycost= min(ps supplycost) (AGGmin(ps supplycost) (R) × R). 3
Again, note that the base relation contains the correlation as a join.
Fighting Redundancy in SQL
405
It can be seen that this expression will compute the original SQL query; the aggregation will compute the aggregate function of the subquery (the conditions in the WHERE clause of the subquery have already been computed in R, since in this case ∆2 ⊆ ∆1 and hence ∆2 − ∆1 = ∅), and the Cartesian product will put a copy of this aggregate on each tuple, allowing the linking condition to be stated as a regular condition over the resulting relation. Note that this expression may not be better, from a cost point of view, than other plans produced by standard optimization. What makes this plan attractive is that the for-loop operator can be implemented in such a way that it computes its output with one pass over the data. In particular, the implementation will not carry out any Cartesian product, which is used only to explain the semantics of the operator. The operator is written as an iterator that loops over the input implementing a simple program (hence the name). The basic idea is simple: in some cases, computing an aggregation and using the aggregate result in a selection can be done at the same time. This is due to the behavior of some aggregates and the semantics of the conditions involved. Assume, for instance, that we have a comparison of the type att = min(attr2), where both attr and attr2 are attributes of some table R. In this case, as we go on computing the minimum for a series of values, we can actually decide, as we iterate over R, whether some tuples will make the condition true or not ever. This is due to the fact that min is monotonically non-increasing, i.e. as we iterate over R and we carry a current minimum, this value will always stay the same or decrease, never increase. Since equality imposes a very strict constraint, we can take a decision on the current tuple t based on the values of t.attr and the current minimum, as follows: if t.attr is greater than the current minimum, we can safely get rid of it. If t.attr is equal to the current minimum, we should keep it, as least for now, in a temporary result temp1. If t.attr is less than the current minimum, we should keep it, in case our current minimum changes, in a temporary result temp2. Whenever the current minimum changes, we know that temp1 should be deleted, i.e. tuples there cannot be part of a solution. On the other hand, temp2 should be filtered: some tuples there may be thrown away, some may be in a new temp1, some may remain in temp2. At the end of the iteration, the set temp1 gives us the correct solution. Of course, as we go over the tuples in R we may keep some tuples that we need to get rid of later on; but the important point is that we never have to get back and recover a tuple that we dismissed, thanks to the monotonic behavior of min. This behavior does generalize to max, sum, count, since they are all monotonically non-decreasing (for sum, it is assumed that all values in the domain are positive numbers); however, average is not monotonic (either in an increasing or decreasing manner). For this reason, our approach does not apply to average. For the other aggregates, though, we argue that we can successfully take decisions on the fly without having to recover discarded tuples later on.
406
2.2
Antonio Badia and Dev Anand
Query Transformation
The general strategy to produce a query plan with for-loops for a given SQL query Q is as follows: we classify q into one of two categories, according to q’s structure. For each category, a pattern p is given. As before, if q fits into p there is a mapping g between constants in q and variables in p. Associated with each pattern there is a for-loop program template t. A template is different from a program in that it has variables and options. Using the information on the mapping g (including the particular linking aggregate and linking condition in q), a concrete for-loop program is generated from t. The process to produce a query tree containing a for-loop operator is then simple: our patterns allow us to identify the part common to query and subquery (i.e. the base relation), which is used to start the query tree. Standard relational optimization techniques can be applied to this part. Then a for-loop operator which takes the base relation as input is added to the query tree, and its parameters determined. We describe each step separately. We distinguish between two types of queries: type A queries, in which the subquery is not correlated (this corresponds to type J in [8]); and type B queries, where the subquery is correlated (this corresponds to the type JA in [8]). Queries of type A are interesting in that usual optimization techniques cannot do anything to improve them (obviously, unnesting does not apply to them). Thus, our approach, whenever applicable, offers a chance to create an improved query plan. In contrast, queries of type B have been dealt with extensively in the literature ([8, 3, 6, 11, 17, 16, 15]). As we will see, our approach is closely related to other unnesting techniques, but it is the only one that considers redundancy between query and subquery and its optimization. The general pattern a type A query must fit is given below: SELECT L FROM T WHERE ∆1 and attr1 θ (SELECT F(attr2 ) FROM T WHERE ∆2 ) {GROUP BY L2} The parenthesis around the GROUP BY clause are to indicate that such clause is optional4 . We create a query plan for this query in two steps: 1. A base relation is defined by g(∆1 ) ∩ g(∆2 )(g(T )). Note that this is an SPJ query, which can be optimized by standard techniques. 2. We apply a forloop operator defined by F L(g(F (attr2 )), g(∆2 ) − g(∆1 ), g(∆1 ) − g(∆2 ) ∧ g(attr3 θ F2 (attr4 ))) It can be seen that this query plan computes the correct result for this query by using the definition of the for-loop operator. Here, the aggregate is F (attr2 ), 4
Obviously, SQL syntax requires that L2 ⊆ L, where L and L2 are lists of attributes. In the following, we assume that queries are well formed.
Fighting Redundancy in SQL
407
α is g(∆2 − ∆1 ) and β is g(∆1 ) − g(∆2 ) ∧ g(attr θ F (attr2 )). Thus, this plan will first apply ∆1 ∩∆2 to T , in order to generate the base relation. Then, the for-loop will compute the aggregate F (attr2 ) on the result of selecting g(∆2 − ∆1 ) on the base relation. Note that (∆2 − ∆1 ) ∪ (∆1 ∩ ∆2 ) = ∆2 , and hence the aggregate is computed over the conditions in the subquery only, as it should. The result of this aggregate is then “appended” to every tuple in the base relation by the Cartesian product (again, note that this description is purely conceptual). After that, the selection on g(∆1 ) − g(∆2 ) ∧ g(attr3 θ F2 (attr4 )) is applied. Here we have that (∆1 − ∆2 ) ∪ (∆1 ∩ ∆2 ) = ∆1 , and hence we are applying all the conditions in the main clause. We are also applying the linking condition attr3 θ F (attr2 ), which can be considered a regular condition now because F (attr2 ) is present in every tuple. Thus, the forloop operator computes the query correctly. This forloop operator will be implemented by a program that will carry out all needed operators with one scan of the input relation. Clearly, the concrete program is going to depend on the linking operator (θ, assumed to be one of {=, =, }) and the aggregate function (F, assumed to be one of min,max,sum,count,avg). The general pattern for type B queries is given next. SELECT L FROM T1 WHERE ∆1 and attr1 θ (SELECT F1 (attr2 ) FROM T2 WHERE ∆2 and S.attr3 θ R.attr4 ) {GROUP BY L2} where R ∈ T1 − T2 , S ∈ T2 , and we are assuming that T1 − {R} = T2 − {S} (i.e. the FROM clauses contain the same relations except the one introducing the correlated attribute, called R, and the one introducing the correlation attribute, called S). We call T = T1 − {R}. As before, a group by clause is optional. In our approach, we consider the table containing the correlated attribute as part of the FROM clause of the subquery too (i.e. we effectively decorrelate the subquery). Thus, the outer join is always part of our common part. In our plan, there are two steps: 1. compute the base relation, given by g(∆1 ∩ ∆2 )(T ∪ {R, S}). This includes the outer join of R and S. 2. computation of a grouped forloop defined by F L(attr6, F (attr2), ∆2 − ∆1 , ∆1 − ∆2 ∧ attr1 θ F (attr2)) which computes the rest of the query. Our plan has two main differences with traditional unnesting: the parts common to query and subquery are computed only once, at the beginning of the plan, and computing the aggregate, the linking predicate, and possible some selections is carried out by the forloop predicate in one step. Thus, we potentially deal with larger temporary results, as some selections (those not in ∆1 ∩ ∆2 ) are not pushed down, but may be able to effect several computations at once
408
Antonio Badia and Dev Anand Select
ps_supplycost=min(ps_supplycost)
Join Join
GBps_partkey,min(ps_supplycost) Select size=15&type LIKE %BRASS
Join
Join Part Select name="Europe"
Join
Select name="Europe"
Join Join
Region Region
Nation Join
PartSupp
Supplier PartSupp
Nation Supplier
Fig. 1. Standard query plan
(and do not repeat any computation). Clearly, which plan is better depends on the amount of redundancy between query and subquery, the linking condition (which determines how efficient the for-loop operator is), and traditional optimization parameters, like the size of the input relations and the selectivity of the different conditions. 2.3
Example and Analytical Comparison
We apply our approach to query TPCH2; this is a typical B query. For our experiment we created a TPC-H benchmark of the smallest size (1 GB) using two leading commercial DBMS. We created indices in all primary and foreign keys, updated system statistics, and capture the query plan for query 2 on each system. Both query plans were very similar, and they are represented by the query tree in figure 1. Note that the query is unnested based on Kim’s approach (i.e. first group and then join). Note also that all selections are pushed all the way down; they were executed by pipelining with the joins. The main differences between the two systems were the choices of implementations for the joins and different join ordering5 . For our concern, the main observation about this query plan is that operations in query and subquery are repeated, even though there clearly is a large amount of repetition6 . We created a query plan for this query, based on our approach (shown in figure 2). Note that our approach does not dictate how the base relation is optimized; the particular plan shown uses the same tree as the original query tree to facilitate comparisons. It is easy to see that 5
6
To make sure that the particular linking condition was not an issue, the query was changed to use different linking aggregates and linking operators; the query plan remained the same (except that for operators other than equality Dayal’s approach was used instead of Kim’s). Also, memory size was varied from a minimum of 64 M to a maximum of 512 M, to determine if memory size was an issue. Again, the query plan remained the same through all memory sizes. We have disregarded the final Sort needed to complete the query, as this would be necessary in any approach, including ours.
Fighting Redundancy in SQL
409
FL(p_partkey, min(ps_supplycost),
(p_size=15 & p_type LIKE %BRASS & ps_supplycost=min(ps_supplycos
Join Join Part Join Supplier
Select name="Europe"
Region Join Nation PartSupp
Fig. 2. For-loop query plan
our approach avoids any duplication of work. However, this comes at the cost of fixing the order of some operations (i.e. operations in ∆1 ∩ ∆2 must be done before other operations). In particular, some selections get pushed up because they do not belong into the common part, which increases the size of the relation created as input for the for-loop. Here, TPCH2 returns 460 rows, while the intermediate relation that the for-loop takes as input has 158,960 tuples. Thus, the cost of executing the for-loop may add more than other operations because of a larger input. However, grouping and aggregating took both systems about 10% of the total time7 . Another observation is that the duplicated operations do not take double the time, because of cache usage. But this can be attributed to the excellent main memory/database size ratio in our setup; with a more realistic setup this effect is likely to be diminished. Nevertheless, our approach avoids duplicated computation and does result in some time improvement (it takes about 70% of the time of the standard approach). In any case, it is clear that a plan using the for-loop is not guaranteed to be superior to traditional plans under all circumstances. Thus, it is very important to note that we assume a cost-based optimizer which will generate a for-loop plan if at least some amount of redundancy is detected, and will compare the for-loop plan to others based on cost.
3
Conclusions and Further Research
We have argued that Decision-support SQL queries tend to contain redundancy between query and subquery, and this redundancy is not detected and optimized by relational processors. We have introduced a new optimization mechanism to deal with this redundancy, the for-loop operator, and an implementation for it, the for-loop program. We developed a transformation process that takes us from SQL queries to for-loop programs. A comparative analysis with standard relational optimization was shown. The for-loop approach promises a more efficient implementation for queries falling in the patterns given. For simplicity and lack of space, the approach is introduced here applied to a very restricted 7
This and all other data about time come from measuring performance of appropriate SQL queries executed against the TPC-H database on both systems. Details are left out for lack of space.
410
Antonio Badia and Dev Anand
class of queries. However, we have already worked out extensions to widen its scope (mainly, the approach can work with overlapping (not just identical) FROM clauses in query and subquery, and with different classes of linking conditions). We are currently developing a precise cost model, in order to compare the approach with traditional query optimization using different degrees of overlap, different linking conditions, and different data distributions as parameters. We are also working on extending the approach to several levels of nesting, and studying its applicability to OQL.
References [1] Badia, A. and Niehues, M. Optimization of Sequences of Relational Queries in Decision-Support Environments, in Proceedings of DAWAK’99, LNCS n. 1676, Springer-Verlag. [2] Cao, Bin and Badia, A. Subquery Rewriting for Optimization of SQL Queries, submitted for publication. 402 [3] Dayal, U. Of Nests and Trees: A Unified Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers, in Proceedings of the VLDB Conference, 1987. 402, 406 [4] Fegaras, L. and Maier, D. Optimizing Queries Using an Effective Calculus, ACM TODS, vol. 25, n. 4, 2000. [5] Freytag, J. and Goodman, N. On the Translation of Relational Queries into Iterative Programs, ACM Transactions on Database Systems, vol. 14, no. 1, March 1989. [6] Ganski, R. and Wong, H. Optimization of Nested SQL Queries Revisited, in Proceedings of the ACM SIGMOD Conference, 1987. 402, 406 [7] Goel, P. and Iyer, B. SQL Query Optimization: Reordering for a General Class of Queries, in Proceedings of the 1996 ACM SIGMOD Conference. 402 [8] Kim, W. On Optimizing an SQL-Like Nested Query, ACM Transactions On Database Systems, vol. 7, n.3, September 1982. 401, 402, 406 [9] Lieuwen, D. and DeWitt, D. A Transformation-Based Approach to Optimizing Loops in database Programming Languages, in Proceedings of the ACM SIGMOD Conference, 1992. [10] Lu, H., Chan, H. C. and Wei, K. K. A Survey on Usage of SQL, SIGMOD Record, 1993. 402 [11] Muralikrishna, M. Improving Unnesting Algorithms for Join Aggregate Queries in SQL, in Proceedings of the VLDB Conference, 1992. 402, 406 [12] Park, J. and Segev, A. Using common subexpressions to optimize multiple queries, in Proceedings of the 1988 IEEE CS ICDE. [13] Ross, K. and Rao, J. Reusing Invariants: A New Strategy for Correlated Queries, in Proceedings of the ACM SIGMOD Conference, 1998. 404 [14] Jun Rao, Bruce Lindsay, Guy Lohman, Hamid Pirahesh and David Simmen, Using EELs, a Practical Approach to Outerjoin and Antijoin Reordering, in Proceedings of ICDE 2001. [15] Praveen Seshadri, Hamid Pirahesh, T. Y. Cliff Leung Complex Query Decorrelation, in Proceedings of ICDE 1996. 402, 406 [16] Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T. Y. Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. Sudarshan Cost-Based Optimization for Magic: Algebra and Implementation, in Proceedings of the SIGMOD Conference, 1996. 406
Fighting Redundancy in SQL
411
[17] Inderpal Singh Mumick and Hamid Pirahesh Implementation of Magic-sets in a Relational Database System, in Proceedings of the SIGMOD Conference 1994. 406 [18] TPC-H Benchmark, TPC Council, http://www.tpc.org/home.page.html. 401, 402
“On-the-fly” VS Materialized Sampling and Heuristics Pedro Furtado 1
Centro de Informática e Sistemas (DEI-CISUC) Universidade de Coimbra
[email protected] http://www.dei.uc.pt/~pnf
Abstract. Aggregation queries can take hours to return answers in large Data warehouses (DW). The user interested in exploring data in several iterative steps using decision support or data mining tools may feel frustrated for such long response times. The ability to return fast approximate answers accurately and efficiently is important to these applications. Samples for use in query answering can be obtained “On-thefly” (OS) or from a materialized summary of samples (MS). While MS are typically faster than OS summaries, they have the limitation that sampling rates are predefined upon construction. This paper analyzes the use of OS versus MS for approximate answering of aggregation queries and proposes a Sampling Heuristic that chooses the appropriate sampling rate to provide answers as fast as possible while guaranteeing accuracy targets simultaneously. The experimental section compares OS to MS, analyzing response time and accuracy (TPC-H benchmark), and shows the heuristics strategy in action.
1
Introduction
Applications that analyze data in todays' large organizations typically access very large volumes of data, pushing the limits of traditional database management systems in performance and scalability. Sampling summaries return fast approximate answers to aggregation queries, can easily be implemented in a DBMS with none or only minor changes and make use of the query processing and optimization strategies and structures of the DBMS. Materialized sampling (MS) such as AQUA [6] imply that summaries are constructed in one phase and used subsequently. Although these summaries can be very fast, they have an important limitation: the summary size must be defined at summary construction time. The statistical answer estimation strategy used by sampling means that, while a very detailed query pattern can only be answered accurately with a large number of samples, more aggregated patterns can be answered with very small, extremely fast summaries. Therefore, it is useful to be able to choose a specific sampling rate for a specific query. Sampling can also be achieved using a common SAMPLE operator that extracts a percentage of rows from a table randomly using for instance a sequential one-pass Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 412-421, 2003. Springer-Verlag Berlin Heidelberg 2003
“On-the-fly” VS Materialized Sampling and Heuristics
413
strategy [10] over a table directory or index. This operator exists typically for the collection of statistics over schema objects for cost-based-optimization (e.g. Oracle 9i SAMPLE operator). It is based on specifying the desired sampling rate (e.g. SAMPLE 1%), scanning only a subset of the table blocks and extracting samples from those blocks. A faster but less uniform sampling alternative uses all the tuples from each scanned block as samples (e.g. SAMPLE BLOCK 1%). Materialized Sampling (MS) has an important advantage over “on-the-fly” (OS) and “online aggregation” (OA) [9] in that while OS and OA retrieve random samples, requiring non-sequential I/O, MS can use faster sequential scans over the materialized samples. In this paper we analyze the use of the SAMPLE operator for OS approximate answering of aggregation queries and compare with MS, revealing the advantages and shortcomings of OS. We also propose sampling heuristics to choose the appropriate sampling rate to provide answers as fast as possible while guaranteeing accuracy targets simultaneously. The paper is organized as follows: section 2 discusses related work. Section 3 discusses summarizing approaches and heuristics issues. Section 4 presents experimental analysis and comparison using the TPC-H decision support benchmark. Section 5 contains concluding remarks.
2
Related Work
There are several recent works in approximate query answering strategies, which include [9, 8, 2, 1]. There has also been considerable amount of work in developing statistical techniques for data reduction in large data warehouses, as can be seen in the survey [3]. Summaries reduce immensely the amount of data that must be processed. Materialized views (MVs) can also achieve this by pre-computing quantities, and they are quite useful for instance to obtain pre-defined reports. However, while summaries work well in any ad-hoc environment, MVs have a more limited, pre-defined scope. The Approximate Query Answering (AQUA) system [6, 7] provides approximate answers using small, pre-computed synopsis of the underlying base data. The system provides probabilistic error/confidence bounds on the answer [2, 8]. [9] proposed a technique for online aggregation, in which the base data is scanned in random order at query time and the approximate answer is continuously updated as the scan proceeds, until all tuples are processed or the user is satisfied with the answer. A graphical display depicts the answer and a confidence interval as the scan proceeds, so that the user may stop the process at any time. In order to achieve the random order scanning, there must be an index on the base tuples ordering by the grouping columns (typically a large index), and a specific functionality that scans this index iteratively (a possibly enormous number of runs) end-to-end in order to retrieve individual tuples from each group in each run. The authors claim that, with appropriate buffering, index striding is at least as fast as scanning a relation via an unclustered index, with each tuple of the relation being fetched only once, although each fetch requires a random I/O which is typically much slower than a full table scan with sequential I/O.
414
Pedro Furtado
avg = avg(samples) ± 1.65 x σ(l_quantity)/√(count(*) count = count(samples)/SP ± 1.65 x sqrt(count(*)) / SP
sum = sum(samples)/SP max = max(samples) min = min(samples) SELECT brand, yearmonth, avg(l_quantity), sum(l_quantity)/SP, count(*)/SP FROM lineitem, part WHERE l_partkey=p_partkey GROUP BY yearmonth, brand SAMPLE SP; Fig. 1. Estimation and Query Rewriting
On-the-fly sampling (OS) also retrieves samples online, but uses a common SAMPLE operator and no specific structures or functionality over such structures. For this operator to be used in approximate answering, the appropriate sampling rate must be used depending on the aggregation pattern and it should deliver estimations and accuracy measures. The insufficiency of samples in summaries is an important issue in the determination of sampling rates and our previous work includes a strategy for appropriate sampling rate choice [5]. This problem has also been the main driver of proposals on improving the representation capability of summaries based on query workloads [2, 1, 4].
3
Summarizing Approaches and Heuristics
In this section we describe the structure and procedure for “on-the-fly” (OS) and materialized (MS) sampling, comparing the approaches, and develop the heuristic strategy used to choose a convenient sampling rate. 3.1
Sampling Rate and Accuracy/Speed (A/S) Limitations
Summary approaches use a middle layer SW that analyses each query, rewrites it to execute against a sampled data set and returns a very fast estimation. The sampling strategy itself is based on either pre-computed materialized samples (MS) or “on-thefly” sampling (OS) from the DW. The pre-computed MS summary is put into a structure similar to the DW (typically, a set of star schemas), but facts are replaced with samples using a sampling rate (or sampling percentage - SP) (e.g. a summary can have SP=1% of the original data) and dimensions contain the subset of rows that are referenced by the fact samples. “On-the-fly” sampling, on the other hand, is obtained by specifying a sampling rate “SAMPLE SP%” at the end of the query which is then submitted against the DW. The estimation procedure is based on producing an individual estimation and error bound for each group of a typical group aggregation query. Figure 1 shows the formulas used and a rewritten query to estimate from samples and provide confidence intervals. Intuitively, or from the analysis of the query rewrites, it is possible to conclude that the summaries do not involve any complex computations and holds the promise of extremely fast response times against a much smaller data set than the original DW.
“On-the-fly” VS Materialized Sampling and Heuristics
415
Fig. 2.- Materialized Summary Construction
For either OS or MS with a set of summary sizes, an appropriate sampling rate should be determined. The sampling rate should be as small as possible for fast response times. However, it is crucial to have enough samples in order to return accurate estimations. Going back to the rewritten query of figure 1, the estimation procedure is applied individually within each group aggregated in the query. A certain sampling rate can estimate the sales of some brands by year but fail completely to estimate by month or week. Additionally, groups can have different sizes (and data distribution as well), so that some may lack samples. The heuristics strategy proposes solutions not only to determine the best SP but also to deal with these issues. 3.2
Structure and Comparison of Materialized and “On-the-fly” Sampling
MS: Figure 2 shows the Materialized samples (MS) construction strategy. MS can be obtained easily by setting up a schema similar to the base schema and then sampling the base fact table(s) into the MS fact table(s). Dimensions are then populated from the base dimensions by importing the rows that are referenced by the MS fact(s), resulting in facts with SP% of the base fact and dimensions typically much smaller than the base dimensions as well. The query that would be submitted against the base DW is rewritten by replacing the fact and dimension table names by the corresponding summary fact and dimensions, proceeding then with expression substitution (query rewriting). OS: Figure 3 shows the basic strategy used to obtain “on-the-fly” samples (OS) to answer the query. The fact table source is replaced by a sub-expression selecting samples from that fact source. Query expressions are also substituted exactly as with MS but, unlike MS, the base dimensions are maintained in the query. The query processor samples the fact table by selecting tuples randomly. In order to be more efficient, this sampling should be done over a row directory or index structure to avoid scanning the whole table.
Fig. 3. Answering Queries with On-the-fly Sampling
416
Pedro Furtado
From the previous description, it is easy to see why materialized summaries (MS) are typically faster than “on-the-fly” sampling (OS) with the same sampling rate. In MS the summary facts are available for fast sequential scanning and dimensions are smaller than the original data warehouse dimensions, while OS must retrieve samples using non-sequential I/O and join with complete base dimensions. The exact difference of speedup between MS and OS depends on a set of factors related to the schema and size of facts and dimensions, but the difference is frequently large, as we show in the experimental section. How can we reduce the response time disadvantage of OS? It is advantageous to reduce I/O by sampling blocks rather than individual rows, but samples will not be completely random. The overhead of joining the sample rows with complete (large) dimensions in OS instead of joining with the subset of dimension rows corresponding to summary facts (MS) is more relevant in many situations. The only straightforward way to reduce this problem would be to materialize a reasonably small summary (MS) and then sample that summary “on-the-fly” (OS) for smaller sampling rate. 3.3
Sampling Rate Decision
The objective of the sampling heuristic (SH) is simple: to find the most appropriate sampling rate (SPQ) for a query Q. This is typically the fastest (therefore smallest) summary that is still capable of answering within a desired accuracy target. If OS is being used, the heuristic then uses SPQ to sample the base data, otherwise (MS) it chooses the (larger) summary size closest to SPQ. The accuracy target can be defined by parameters (CI%, FG%). The CI% value is a confidence interval target CI% = CIestimation/estimation (e.g. the error should be within 10% of the estimated value). The fraction of groups that must be within CI% (FG%) is important to enable sampling even when a few groups are too badly represented (have too few samples). Without this parameter (or equivalently when FG%=100%), the accuracy target would have to be met by all groups in the response set, including the smallest one, which can result in large sampling rates. For instance, (CI%=10%, FG%=90%) means that at least 90% of groups are expected to answer within the 10% CI% target. Minimum and maximum sampling rates can be useful to enclose the range of possible choices for SPQ (e.g. SPmin=0.05%, SPmax=30%). The sampling rate SPmax is a value above which it is not very advantageous to sample and the minimum is a very fast summary. In practice, the sampling rate SPmax would depend on a specific query and should be modified accordingly. For instance, a full table scan on a base fact is as fast as a sampling rate that requires every block of the table to be read (SP=1/average number of tuples per block). However, queries with heavy joining can still improve execution time immensely with that sampling rate. Given a query Q, the heuristic must find the suitable sampling rate SPQ I n the spectrum of Figure 4 based on accuracy targets CI% and FG%. If SPQ is below SPmin, it is replaced by SPmin, which provides additional accuracy without large response time. Otherwise, SPQ is used unless SPQ >SPmax, in which case it is better to go to the DW directly.
“On-the-fly” VS Materialized Sampling and Heuristics
417
Fig.4. SP ranges and query processing choices
3.4
Determining SPQ from Accuracy Targets
If we know the minimum number of samples needed to estimate within a given accuracy target (nmin) and we also know the number of values in the most demanding aggregation group (ng), then a sampling rate of SPQ = nmin/ng should be enough. For instance, if nmin = 45 and the number of elements of that group is 4500, then SPQ≥1%. Instead of the most demanding group, we can determine ng as the FG% percentile of the distribution of the number of elements. For instance, for FG=75%, ng is a number such that 75% of the aggregation groups have at least ng samples. Then SPQ= nmin/ng should be able to estimate at least 75% of the group results within the accuracy target CI%. We call ng the characteristic number of elements, as it is a statistical measure on the number of elements in groups. Next we show how ng and nmin are determined. There are three alternatives for the determination of SPQ or ng: • • •
Manual: the user can specify SPQ manually; Selectivity: ng can be estimated using count statistics; Trial: a trial SPQ (SPtry) can be used and if the result is not sufficiently accurate, another SPQ is estimated from statistics on ng collected during the execution of SPtry;
The determination of ng based on statistical data is a selectivity estimation problem with the particularity that what must be estimated is the number of elements within each aggregation group. Selectivity estimation is a recurring theme in RDBMS, with many alternative strategies. We propose that statistics be collected when queries are executed so that they become available later on. Count statistics are collected and put into a structure identifying the aggregation and containing percentiles of the cumulative distribution of number of elements (e.g. in the following example 75% of the groups have at least 4500 elements): brand/month 10%=17000, 25%=9400, 50%=6700, 75%=6000, 90%=1750, max=20000, min=1000, stdev=710, SPS=100% These statistics are useful to determine the ng value that should be used based on the minimum fraction of groups (FG%) that are expected to return confidence intervals below CI%. For instance, in the above example, supposing that nmin=45, if FG%=75% SPQ=45/4500=0.75%, whereas if FG%=100%, SPQ=45/1000=4.5%. The system should be able to collect this information when the query is executed against a summary, as this is the most frequent situation. In that case, if the sampling rate used to query was SPS, this value should be stored together with the statistics to be used in inferring the probable number of elements (using 1/SP x each value). If a query has not been executed before and it is impossible to estimate the group selectivity, the strategy uses a trial approach. The query is executed against a reasona-
418
Pedro Furtado
bly small and fast sampling rate SPtry in a first step (SPtry should be defined). Either way, response statistics are collected on the number of elements (ngtry) for posterior queries. If the answer from this first try is not sufficiently accurate, a second try uses an SP2 = nmin / (ngtry/SPtry), or this value multiplied by a factor for additional guarantees (e.g. 1.1 x nmin / (ngtry/SPtry)). Iterations go on until the accuracy targets are met. If SPtry is too small for a query pattern, the system simply reiterates using the inferral process until the CI% accuracy target is guaranteed. The other parameter that must be determined is nmin. This value is obtained from the confidence interval formulas by solving to obtain ns and considering the relative confidence interval ci. For instance, for the average and count functions: nmin(avg) = (zp/ci)2(σ/µ)2
nmin(count) = (zp/ci)2
The unknown expression (σ/µ) is typically replaced by 50% in statistics works, for estimation purposes. The minimum number of samples varies between different aggregation functions. If an expression contains all the previous aggregation functions, nmin = min[nmin(AVG), nmin(SUM), nmin(COUNT)].
4
Experimental Analysis and Comparison
This section analyses experimental results on a Intel Pentium III 800 MHz CPU and 256 MB of RAM, running Oracle 9i DBMS and the TPC-H benchmark with the scale factor (SF) of 5 (5GB). The Oracle SAMPLE operator was used directly. We have used template aggregation queries over TPCH (Figure 5), with different time granularities. Query Qa(above) involves joining only two tables, whereas query Qb (below) engages in extensive joining, including a very large ORDERS table. The base fact table that was sampled in OS was LINEITEM. 4.1
Response Time Results
Our objective in this section is to evaluate the response time improvement using OS and MS summaries, in order to have a comparison and measure of the effectiveness of these approaches. The response time is very dependent on several factors, including the query plan or the amount of memory available for sorting or for hash joining. We ran experiments with exactly the same conditions and repeatedly. We have optimized the execution plan for queries Qa (19 min, 20 mins) and Qb (47 mins, 52 mins) for monthly and yearky aggregations respectively. SELECT p_brand, year_month, avg(l_quantity), sum(l_quantity), count(*) FROM lineitem, part WHERE l_partkey=p_partkey GROUP BY to_char(l_shipdate,'yyyy-mm'), p_brand; SELECT n_name, year_month, avg(l_extendedprice), sum(l_extendedprice), count(*) FROM lineitem, customer, orders, supplier, nation, region WHERE <join conditions> GROUP BY n_name, to_char(l_shipdate,’yyyy-mm’); Fig. 5. Qeries Qa (left) and Qb (right) used in the experiments
“On-the-fly” VS Materialized Sampling and Heuristics
Response Time (%)
40%
Qa(year) Qb(year)
15% Response Time (%)
Qa(month) Qb(month)
50%
30% 20% 10% 0% 0%
2%
4%
6%
8%
10%
Qa(month) Qb(month)
419
Qa(year) Qb(year)
10% 5% 0% 0.0%
0.2%
0.4%
0.6%
0.8%
1.0%
Sampling Rate (%)
Sampling Rate (%)
Fig. 6. % of Resp. Time VS % Sampling Rate for Qa and Qb Using OS
50% OS Qa(month) OS Qa(year) 40%
MS Qa(month) MS Qa(year)
100%
30% 20% 10% 0% 0%
2%
4% 6% Sampling Rate (%)
8%
10%
Response Time (%)
Response Time (%)
Figure 6 displays the query response time using OS, as a percentage of the response time of the base data (Y axis) for each sampling rate (X-axis). Linear speedup (1/SP) is indicated in the picture as a solid line. The right picture is an 0-1% detail. The most important analysis from the figure is that the speedup is typically much less than linear. For instance, a summary with SP=1% takes about 12% of the DW response time; Other comments: the speedup to sampling rate ratio improves as the sampling rate increases; Query Qb (with heavy joining) exhibited a worse ratio than query Qa for SP below 1% (detail) and a better ratio for larger SP; Figure 7 compares On-the-fly summaries (OS) to Materialized summaries (MS) in the same setup of the previous experiment (the right picture is a detail for SP