1 data by means

added-value is illustrated on benchmark data and three real data sets: a medical data set and two gene expression data sets. Keywords: Co-clustering, ...

0 downloads 187 Views 258KB Size
Intelligent Data Analysis 10 (2006) 457–472 IOS Press

457

Supporting bi-cluster interpretation in 0/1 data by means of local patterns Ruggero G. Pensa∗ , C´eline Robardet and Jean-Franc¸ois Boulicaut INSA Lyon, LIRIS CNRS UMR 5205, F-69621 Villeurbanne cedex, France Abstract. Clustering or co-clustering techniques have been proved useful in many application domains. A weakness of these techniques remains the poor support for grouping characterization. As a result, interpreting clustering results and discovering knowledge from them can be quite hard. We consider potentially large Boolean data sets which record properties of objects and we assume the availability of a bi-partition which has to be characterized by means of a symbolic description. Our generic approach exploits collections of local patterns which satisfy some user-defined constraints in the data, and a measure of the accuracy of a given local pattern as a bi-cluster characterization pattern. We consider local patterns which are bi-sets, i.e., sets of objects associated to sets of properties. Two concrete examples are formal concepts (i.e., associated closed sets) and the so-called δ-bi-sets (i.e., an extension of formal concepts towards fault-tolerance). We introduce the idea of characterizing query which can be used by experts to support knowledge discovery from bi-partitions thanks to available local patterns. The added-value is illustrated on benchmark data and three real data sets: a medical data set and two gene expression data sets. Keywords: Co-clustering, characterization, closed sets, fault-tolerant formal concept

1. Introduction Exploratory data analysis processes are often based on clustering techniques to get insights about global patterns within the data. Clustering has been studied extensively, including for the special case of Boolean data which record properties of objects (see a toy example in Table 1). Its main goal is to identify a partition of objects and/or properties such that an objective function which specifies its quality is optimized (e.g., maximizing intra-cluster similarity and inter-cluster dissimilarity) [16]. Looking for optimal solutions is intractable but heuristic local search optimizations can be performed. As a result, many efficient algorithms which compute good partitions are available and widely used. In this paper, we assume that clustering results are available and we are interested in knowledge discovery from such results. For example, in our running example r, we could get {{{o 1 , o3 , o4 }, {o2 , o5 , o6 , o7 }} as a partition on objects. Our thesis is that expert users need symbolic descriptions to characterize the computed groups. Indeed, it is well-known that using various settings for a given clustering algorithm and/or using different algorithms can provide quite different clustering results. The interpretation phase is then tedious. In fact, many clustering approaches suffer from the lack of an explicit cluster characterization. It has motivated the research on conceptual clustering [12]. Among others, it has been studied in the context of co-clustering (see [19] for a survey), including for the special case of categorical or Boolean data. The goal is to identify bi-clusters or bi-partitions in the data, i.e., a mapping between a partition of objects and ∗

Corresponding author. Tel.: +33 4 72 43 70 24; Fax: +33 4 72 43 87 13; E-mail: [email protected]

1088-467X/06/$17.00  2006 – IOS Press and the authors. All rights reserved

458

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

a partition of properties. For instance, an algorithm like COCLUSTER [11] can compute in r the interesting bi-partition {{{o1 , o3 , o4 }, {p1 , p3 , p4 }}, {{o2 , o5 , o6 , o7 }, {p2 , p5 }}}. The first bi-cluster indicates that the characterization for objects from {o1 , o3 , o4 } is that they almost always share properties {p 1 , p3 , p4 }. Also, properties {p2 , p5 } are characteristics for objects {o 2 , o5 , o6 , o7 }. Our experience is that this first step towards characterization is not sufficient, especially in high dimensional data sets in which global patterns like bi-partitions do not reflect unexpected but strong local associations between some sets of objects and some sets of properties. Our proposal is to combine bi-clustering with a characterization phase based on collections of local patterns. We assume that a bipartition on a Boolean data set is available (e.g., computed using COCLUSTER [11]). Our contribution to bipartition characterization is as follows. First, we introduce an original and generic cluster characterization technique which is based on constraint-based bi-set mining, i.e., mining bi-sets whose set components satisfy some constraints. We show how to measure that a given bi-set is an accurate characterization pattern for a given bi-cluster. Thanks to such accuracy measures, it is possible to consider characterizing queries which can support knowledge discovery from co-clustering results. The method is illustrated on two kinds of bi-sets, the well-known formal concepts (i.e., associated closed sets [25]) and a new class, the so-called δ-bi-sets. This later pattern type is new and it is based on a previous work about approximate condensed representations for frequent patterns [7]. Intuitively, a formal concept is a maximal rectangle of true values modulo arbitrary permutations of rows and columns. Following that perspective, a δ-bi-set is a fault-tolerant extension of a formal concept for which a bounded number of exceptions (i.e., 0 values) is accepted per column. The added-value of our characterizing method is illustrated not only on a benchmark data set but also on three real-life data sets. The obtained characterizations are consistent with the available knowledge. This paper extends the preliminary version [21] by further developments on the motivation and the possible applications of the method, the study of some δ-bi-set properties, and further experiments. Indeed, we added the application of our approach to a gene expression data analysis task for which COCLUSTER provides unstable bi-partitions. Section 2 formally defines the characterizing framework. Section 3 discusses which type of local pattern can be used. Section 4 is dedicated to our empirical validation of the proposed method. Finally, Section 5 concludes. 2. Bi-cluster characterization using bi-sets Let us consider a set of objects O = {o 1 , . . . , om } and a set of Boolean properties P = {p 1 , . . . , pn }. The Boolean context to be mined is r ⊆ O × P , where r ij = 1 if the property pj is true for object oi . We assume that a co-clustering algorithm, e.g. [11], provides a bijective mapping between K clusters of o , C p )} with C o ⊂ O and objects and K clusters of properties forming K bi-clusters {(C 1o , C1p ) . . . (CK k K p Ck ⊂ P . A first characterization comes from this mapping. Our goal is to support each bi-cluster interpretation by collections of bi-sets which are locally pointing out interesting associations between groups of objects and groups of properties. Formally, a bi-set is an element of 2O × 2P . Therefore, we assume that a collection of N bi-sets B = b 1 , . . . , bN has been extracted from the data. First, we associate each of them to one of the K bi-clusters. Each bi-set characterizes the bi-cluster to which it is associated with some degree of accuracy. We can now define a similarity measure between a bi-set (T, G) (T ⊆ O, G ⊆ P ) and a bi-cluster (C ko , Ckp ) as follows:   |T ∩ Cko | · |G ∩ Ckp | sim (T, G), (Cko , Ckp ) = |T ∪ Cko | · |G ∪ Ckp |

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

o1 o2 o3 o4 o5 o6 o7

Table 1 A Boolean context r p1 p2 p3 p4 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0

459

p5 0 1 0 0 1 1 1

Intuitively, (T, G) and (Cko , Ckp ) denote rectangles in the matrix (modulo permutations over the rows and the columns) and we measure the area of the intersection of the two rectangles normalized by the area of their union. Each bi-set b which is a candidate characterization pattern can now be assigned to the bi-cluster (Cko , Ckp ) for which sim(b, (Cko , Ckp )) is maximal. Doing so, we get K groups of potentially characterizing bi-sets. Example 1. In r from Table 1, a possible bi-partition is {(C1o , C1p ), (C2o , C2p )} = {({o1 , o3 , o4 }, {p1 , p3 , p4 }), ({o2 , o5 , o6 , o7 }, {p2 , p5 })}

If we consider the bi-set b1 = ({o1 , o3 , o5 }, {p1}), its similarity measures w.r.t. (C1o , C1p ) and (C2o , C2p ) are: 2·1 = 0.2 3·1+3·3−2·1 1·0 =0 sim(b1 , (C2o , C2p )) = 3·1+4·2−1·0 sim(b1 , (C1o , C1p )) =

The bi-set b1 is then associated to the first bi-cluster. If we consider now the bi-set b 2 = ({o5 }, {p1 , p2 , p5 }), we get: 0·1 =0 1·3+3·3−0·1 1·2 = 0.22 sim(b2 , (C2o , C2p )) = 1·3+4·2−1·2 sim(b2 , (C1o , C1p )) =

This bi-set b2 is thus associated to the second bi-cluster. Finally, we can use an accuracy measure to select the most relevant bi-sets. For that purpose, we propose to measure the exception ratios for the two set components of the bi-sets. Given a bi-set (T, G) and a bi-cluster (Cko , Ckp ), it can be computed as follows: εo (T, Cko ) = εp (G, Ckp )

|{oi ∈ T | oi ∈ Cko }| |T |

|{pi ∈ G| pi ∈ Ckp }| = |G|

460

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

Example 2. In our toy example from Table 1, the bi-set b 1 = ({o1 , o3 , o5 }, {p1}) contains the object o5 which does not belong to C 1o : we have the exception ratio ε o ({o1 , o3 , o5 }, C1o ) = 13 = 0.33. The bi-set b2 contains property p1 which does not belong to C 2p : we have εp ({p1 , p2 , p5 }, C2p ) = 13 = 0.33. It is then possible to consider thresholds to select only the bi-sets that have small exception ratios, i.e., εo < o and εp < p where o , p ∈ [0, 1]. There are several possible interpretations for these measures. If we are interested in characterizing a cluster of objects (resp. properties), we can look for all the sets of properties (resp. objects) for which the  o (resp. p ) values of the related bi-sets are less than a threshold εo (resp. εp ). Alternatively, we can consider the whole bi-cluster and characterize it with all the bi-sets for which the two exception ratios εo and εp are less than two thresholds  o and p .

3. Choosing a bi-set type for characterization We now discuss the type of bi-sets which will be post-processed for bi-cluster characterization. It is clear that bi-clusters are, by construction, interesting characterizing bi-sets but they only support a global interpretation. We are interested in strong associations between sets of objects and sets of properties that can locally explain the global behavior. Clearly, formal concepts are candidates. 3.1. Using formal concepts

Definition 1. (formal concept [25]) If T ⊆ O and G ⊆ P , assume φ(T, r) = {p ∈ P | ∀o ∈ T, (o, p) ∈ r} and ψ(G, r) = {o ∈ O | ∀p ∈ G, (o, p) ∈ r}. A bi-set (T, G) is a formal concept in r when T = ψ(G, r) and G = φ(T, r). By construction, G and T are closed sets, i.e., G = φ ◦ ψ(G, r) and T = ψ ◦ φ(T, r). Formal concepts are maximal association of sets of objects and sets of properties: if one adds a property (resp. an object) one might remove at least an object (resp. a property) to get only true values in the encoded Boolean relation. Example 3. Eight formal concepts hold in r from Table 1. ({o 1 , o3 }, {p1 , p3 , p4 }), ({o1 , o3 , o4 }, {p3 , p4 }), and ({o5 , o6 }, {p2 , p5 }) are among them. Efficient algorithms have been developed to extract complete collections of formal concepts which satisfy also user-defined constraints (e.g., minimal size constraint on set components) [3,24]. Indeed, the popular frequent closed set mining task for a frequency threshold ν fundamentally computes each formal concept (T, G) such that |T |  ν . A major problem with formal concepts is that the Galois connection (φ, ψ) is, in some sense, a too strong one: we have to capture every maximal set of objects and its maximal set of associated properties. As a result, the number of formal concepts even in small matrices can be huge. It is indeed common to get several millions of formal concepts even from rather small matrices. A solution is to look for “dense” rectangles in the matrix, i.e., bi-sets with mainly true values but also a bounded (and small) number of false values or exceptions. Some approaches for dense bi-set mining have been recently discussed (see, e.g., [4] for a starting point). We now propose a new type of bi-set which can be efficiently computed and which is an extension of formal concepts towards fault-tolerance.

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

461

3.2. Mining δ-bi-sets We want to compute efficiently smaller collections of bi-sets which still capture strong associations. We recall some definitions about the association rule mining task [1] since it is used for both the definition of the δ-bi-set pattern type and for bi-cluster characterization. Definition 2. (association rule) An association rule in r is an expression of the form X ⇒ Y , where X, Y ⊆ P , Y = ∅ and X ∩ Y = ∅. Its absolute frequency is |ψ(X ∪ Y, r)| and its confidence is |ψ(X ∪ Y, r)|/|ψ(X, r)|. In an association rule X ⇒ Y with high confidence, the properties in Y are almost always true for an object when the properties in X are true. Intuitively, X ∪ Y associated to ψ(X, r) is then a dense bi-set: it contains few false values. We now consider our technique for computing association rules with high confidence, the so-called δ-strong rules [6,7]. Definition 3. (δ-strong rule) Given an integer δ, a δ-strong rule in r is an association rule X ⇒ Y (X, Y ⊂ P) s.t. |ψ(X, r)| − |ψ(X ∪ Y, r)|  δ, i.e., the rule is violated in no more than δ objects. Interesting collections of δ-strong rules with minimal left-hand side can be computed efficiently from the so-called δ-free-sets [6,7,10] and their δ-closures. Definition 4. (δ-free set, δ-closure) Let δ be an integer and X ⊂ P , X is a δ-free-set in r iff there is no δ-strong rule which holds between two of its own proper subsets. The δ-closure of X in r, h δ (X, r), is the maximal superset Y of X s.t. ∀p ∈ Y \ X , |ψ(X ∪ {p})|  |ψ(X, r)| − δ. In other terms, the frequency of the δ-closure of X in r is almost the same than the frequency of X when δ << |O| and X is frequent. Moreover, ∀p ∈ h δ (X) \ X , X ⇒ p is a δ-strong rule. Example 4. In the data from Table 1, the 1-free itemsets are {p 1 }, {p2 }, {p3 }, {p4 }, {p5 }, {p1 , p2 }, and {p1 , p5 }. An example of 1-closure for {p 1 } is {p3 , p4 }. The association rules {p 1 } ⇒ {p3 } and {p1 } ⇒ {p4 } have only one exception. δ-freeness is an anti-monotonic property such that it is possible to compute δ-free sets (eventually combined with a minimal frequency constraint) in very large data sets. Notice that h 0 ≡ φ ◦ ψ , i.e., the classical closure operator. Looking for a 0-free-set, say X , and its 0-closure, say Y , provides the closed set X ∪ Y and thus the formal concept (ψ(X ∪ Y, r), X ∪ Y ). Definition 5. (δ-bi-set) A δ-γ -bi-set (T, G) in r is built on each δ-free-set X ⊂ P with T = ψ(X, r) and G = hγ (X, r). When δ = γ we call them δ-bi-sets. Example 5. In the data from Table 1, the 1-bi-sets derived from the 1-free-sets {p 3 } and {p5 } are ({o1 , o3 , o4 }, {p1 , p3 , p4 }) and ({o2 , o5 , o6 , o7 }, {p2 , p5 }). When δ << |T |, δ-bi-sets are dense bi-sets with a small number of exceptions per column. In order to experiment, we implemented a straightforward extension of AC-MINER [7] which provides the supporting set for each extracted δ-free-set. Let us now discuss some properties of δ-bi-sets. It is clear that 0-bi-sets are formal concepts. However, some important properties of formal concepts do not hold for δ-bi-sets when δ > 0. In particular we lack of a function which associates the set G to the set T and vice-versa. As a result, we do not have a Galois connection anymore. For example, in Table 1, ({o2 , o5 , o6 , o7 }, {p2 , p5 }) (the δ-bi-set generated by the δ-free set {p 5 }), and ({o2 , o5 , o6 }, {p2 , p5 }) (the δ-bi-set generated by the δ-fee set {p 2 }) have the same property set, while the first set of objects

462

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

o1 o2 o3 o4 o5 o6 o7

Table 2 Boolean context r p1 p2 p3 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 0 1 0 0 1

p4 1 0 0 0 1 0 0

Table 3 0-free sets, 1-closures and supporting sets of objects in r X {∅} {p1 } {p2 } {p3 } {p4 } {p1 , p2 } {p1 , p3 } {p1 , p4 }

h1 (X, r) {p3 } {p1 , p3 } {p2 , p3 } {p3 } {p1 , p2 , p3 , p4 } {p1 , p2 , p3 , p4 } {p1 , p2 , p3 } {p1 , p2 , p3 , p4 }

ψ(X, r) {o1 , o2 , o3 , o4 , o5 , o6 , o7 } {o3 , o4 , o5 , o6 } {o1 , o2 , o4 , o5 } {o1 , o2 , o4 , o5 , o6 , o7 } {o1 , o5 } {o4 , o5 } {o4 , o5 , o6 } {o5 }

Table 4 1-free sets, 1-closures and supporting sets of objects in r X {∅} {p1 } {p2 } {p4 } {p1 , p2 }

h1 (X, r) {p3 } {p1 , p3 } {p2 , p3 } {p1 , p2 , p3 , p4 } {p1 , p2 , p3 , p4 }

ψ(X, r) {o1 , o2 , o3 , o4 , o5 , o6 , o7 } {o3 , o4 , o5 , o6 } {o1 , o2 , o4 , o5 } {o1 , o5 } {o4 , o5 }

includes the second one. Among others, it makes the interpretation process (in terms of characterization) less natural. We now consider how the parameters δ and γ influence the properties of the δ-γ -bi-set collection. Property 1. Given a Boolean context r, two positive integers µ and δ such that µ < δ. Let us denote F reeδ (r) the collection of the δ-free sets on r, and F ree µ (r) the collection of the µ-free sets on r, we have: F reeδ (γ, r) ⊆ F reeµ (γ, r). Proof 1. X is a δ-free-set iff ∀Y ⊂ X|ψ(Y, r)| − |ψ(X, r)| > δ. Thus |ψ(Y, r)| − |ψ(X, r)| > µ and X is also a µ-free-set. As a consequence, any collection of δ-free sets (δ > 0) is included in the collection of 0-free sets. Example 6. Let us consider the Boolean data set given in Table 2. The set of properties A = {p 1 , p3 } is 0-free (see Tab. 3), but not 1-free with (see Table 4). Its 1-closure is {p 1 , p2 , p3 }. The corresponding δ-bi-set is bA = ({o4 , o5 , o6 }, {p1 , p2 , p3 }) with one exception on p 2 . Since A is not in the collection of 1-free sets, this bi-set can not be built using A, and we have neither another 1-free set which can generate bA nor any other bi-set covering b A (see Table 4). Property 2. Given a Boolean context r and two positive integers ρ and γ such that ρ  γ . Given a set X ⊆ P we have: hρ (X) ⊆ hγ (X).

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

463

Table 5 A Boolean context r1 o1 o2 o3 o4 o5 o6 o7 cp1 cp2

p1 1 0 1 0 1 0 0 1 0

p2 0 1 0 0 1 1 0 0 1

p3 1 0 1 1 0 0 0 1 0

p4 1 0 1 1 0 0 0 1 0

p5 0 1 0 0 1 1 1 0 1

co1 1 0 1 1 0 0 0 − −

co2 0 1 0 0 1 1 1 − −

When the parameter γ increases, then the size of the attribute component of the γ -δ-bi-set increases too. Property 3. Given a δ-free set X , ∀Y ⊂ X , then X ⊆ h δ (Y, r), i.e., X is not included in the δ-closure of any of its own proper subsets. Proof 2. If Y ⊂ X , and X ⊆ hδ (Y, r), then there exists Z ⊂ X , Z ∩ Y = ∅, s.t. Y ⇒ Z is a δ-strong rule, i.e., there exists a δ-strong rule which holds between two own proper subsets of X (Y and Z ), but this contradicts that X is a δ-free set. As a consequence, when γ > δ, a set X can belong to the γ -closure of one of its subsets which is not the case when γ  δ. Example 7. In the data from Table 1, we have eight 1-free sets: {∅}, {p 1 }, {p2 }, {p3 }, {p4 }, {p5 }, {p1 , p2 }, and {p1 , p5 }. The collection of 0-free sets contains two more sets {p 1 , p3 } and {p1 , p4 } which are contained in the 1-closures of {p 1 } (i.e., {p1 , p3 , p4 } which is also the 1-closure of {p 3 } and {p4 }). The supporting set of objects for both {p 1 , p3 } and {p1 , p4 } is {o1 , o3 } and it is a subset of the supporting set of objects for {p1 } (i.e., {o1 , o3 , o5 }), and the supporting set of objects for {p 3 } and {p4 } as well. Indeed, the two 0-δ-bi-sets are already included in larger bi-sets obtained from 1-free sets. We have considered several settings for computing δ-bi-sets. As the γ -closure of a 0-free set X is equal to the γ -closure of h0 (X, r), by computing 0-δ-bi-sets we get either formal concepts (0-closure of a 0-free set) or their extension towards fault-tolerance (number of exceptions bounded per column). Computing formal concepts by extracting the free sets and their closure, may become intractable in some data sets, while δ-free set mining for δ > 0 remains quite feasible at the price of missing some associations. On the other hand, using a value of δ greater than γ may result in a further loss in information, even if the size of collection of produced bi-sets could be reduced. Our answer to the previously given question, is that using the same δ value for computing the free sets and their closure is a good trade-off to preserve information and to reduce both the search space and the size of the extracted bi-set collection. 3.3. Formal concepts vs. δ-bi-sets To study the relevancy of δ-bi-sets w.r.t. formal concepts, we have considered the addition of noise to a synthetical data set. Hereafter, r denotes a reference data set from which we generate noisy data sets by adding a given quantity of uniform random noise. Then, we compare the collection of formal concepts which are “built-in” within r with various collections of formal concepts and δ-bi-sets extracted from the noised matrices. To measure the relevancy of each extracted collection w.r.t the reference one, we look for subsets of the reference collection in each of them. Since both set components of each formal

464

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

concept can be changed when adding noise, we identify those having the largest area in common with the reference ones, and we compute the σ measure which takes into account the common area: σ(Cr , Ce ) =

ρ(Cr , Ce ) + ρ(Ce , Cr ) 2

ρ is defined as follows: ρ(C1 , C2 ) =

1 |C1 |

 (Xi ,Yi )∈C1

max

(Xj ,Yj )∈C2

|Xi ∩ Xj | · |Yi ∩ Yj | |Xi ∪ Xj | · |Yi ∪ Yj |

Cr is the collection of formal concepts computed on the reference dataset, C e is a collection of patterns in a noised dataset. When σ(C r , Ce ) = 1, all the bi-sets ∈ Cr have identical instances in C e . In the experiment, r has 30 objects and 15 properties and it contains 3 formal concepts of the same size which are pair-wise disjoints: the built-in formal concepts are ({o 1 , . . . , o10 }, {p1 , . . . , p5 }), ({o11 , . . . , o20 }, {p6 , . . . , p10 }), and ({o21 , . . . , o30 }, {p11 , . . . , p15 }). We generated 40 different data sets by adding to r increasing quantities of noise (from 1% to 40% of the matrix). Then, for each data set, we have extracted a collection of formal concepts and different collections of δ-bi-sets with increasing values of δ (from 1 to 6). Finally, we looked for the occurrence of the 3 formal concepts in each of these extracted collections by using our σ measure. Results are in Fig. 1. The σ measure decreases when the noise level increases. Interestingly, its values for δ-bi-set collections are always greater or similar to the values for the collections of formal concepts. The collections of δ-bi-sets contain always less patterns than the collections of formal concepts (for a noise level greater than 7%). For δ = 2, the size is halved. For greater values of δ, noise does not influence the size of the collections of δ-bi-sets. This experiment confirms that δ-bi-sets are more robust to noise than formal concepts. Furthermore, it enables to reduce significantly the size of the extracted collections and this is important to support the interpretation process.

3.4. Using association rules Association rules can be derived from extracted bi-sets and used for bi-cluster characterization. For characterization but also classification, heuristics have been studied which select relevant association rules based on their frequency and confidence values [10,17,18,23]. In our case, we propose to use exception ratios on the extracted bi-sets to provide characterization rules. They have the form X ⇒ k where X is a set of properties (resp. objects) and k is a property denoting a cluster of objects (resp. an object denoting a cluster of properties). When considering formal concepts, deriving characterization rules from them is straightforward. Property 4. Given a bi-cluster (C ko , Ckp ), if (T, G) is a formal concept, then G ⇒ k (resp. T ⇒ k) is a rule with frequency equal to |T | · (1 − εo (T, Cko )) (resp. |G| · (1 − εp (G, Ckp )) and confidence equal to 1 − εo (T, Cko ) (resp. 1 − εp (G, Ckp )). Example 8. Consider the toy example from Table 1 with two new columns (resp. rows) to denote the values of the object cluster variable v o ∈ {co1 , co2 } (resp. the property cluster variable v p ∈ {cp1 , cp2 }). For each object belonging to C 1o (resp. C2o ), we have co1 = 1 and co2 = 0 (resp. co1 = 0 and co2 = 1). We obtain the Boolean data in Tab. 5. Bi-set b 1 = (T1 , G1 ) = ({o1 , o3 , o5 }, {p1 }) is a formal concept that can be used to form the association rule p 1 ⇒ co1 . Its relative frequency is |T1 | · (1 − εo (T1 , C1o ))/|O| =

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

465

Fig. 1. Size of different collections of bi-sets (top) and related values of σ (bottom) depending on noise level.

3 · (1 − 1/3)/7 = 29%, its confidence is (1 − ε o (T1 , C1o )) = (1 − 1/3) = 67%. The formal concept b2 = (T2 , G2 ) = ({o5 }, {p1 , p2 , p5 }) forms the association rule o5 ⇒ cp2 . Its relative frequency is |G2 | · (1 − εp (G2 , C2p ))/|P| = 3 · (1 − 1/3)/5 = 40%, its confidence is (1 − ε p (G2 , C2p )) = (1 − 1/3) =

466

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

67%. When we use δ-bi-sets instead of formal concepts, Property 3.4 does not hold because |ψ(G, r)| < |T |. However, if we are interested in characterizing a cluster of objects, we can use the following property.

Property 5. Given a cluster C ko , if (T, G) is a δ-bi-set, and X ⊆ G is a δ-free-set then X ⇒ k is a rule with frequency equal to |T | · (1 − εo (T, Cko )) and confidence equal to 1 − ε o (T, Cko ). Example 9. Consider our toy example in Table 1, and its extension (Table 5). Bi-set b δ = (Tδ , Gδ ) = ({o1 , o3 , o4 }, {p1 , p3 , p4 }) is a 1-bi-set generated by the 1-free set X δ = {p3 }. It is associated with (C1o , C1p ) since sim(bδ , (C1o , C1p )) = 0, and εo (Tδ , C1o ) = 0. Indeed, Xδ forms an association rule p3 ⇒ co1 with relative frequency |Tδ | · (1 − 0)/|O| = 43% and confidence 100%. Such rules are interesting in practice because X is often a rather small set such that its interpretation is easier. However, this approach can not be applied to data sets with large numbers of properties (e.g., for gene expression data sets where we can have thousands of properties). In such cases, we propose to use the εo and εp measures. Notice however that a recent work studies δ-free set mining for very large numbers of properties [15].

4. Experimental validation 4.1. Mining a benchmark data set First, we applied our characterization method to the well-known benchmark voting-records [5]. It contains 435 objects and 48 Boolean attributes (removing class variables). We used COCLUSTER [11] to get 2 bi-clusters: bi-cluster bi-cluster1 bi-cluster2 total

|τ | 193 242 435

rep. 153 15 168

dem. 40 227 267

|γ| 16 32 48

To characterize each bi-cluster, we used D-MINER [3] to extract all formal concepts, and our slight extension of ACMINER to extract two collections δ-bi-sets (δ = 1,2). We obtained 227 031 formal concepts, 130 313 1-bi-sets and 66 908 2-bi-sets. The collections have been post-processed by looking for rules with increasing values of the relative minimal frequency (15% up to 40%) and confidence (90% up to 100%). Results for the first bi-cluster are in Fig. 2. Results for the second one look similar. The number of characterizing rules decreases when we increase the frequency and confidence thresholds. When we use δ-bi-sets, we have to process significantly smaller collections. Two examples of characterizing rules which are consistent with the domain knowledge associated to voting-records are now given. The first one (resp. the second) has a 42% relative frequency (resp. 31%) and both have a 100% confidence, i.e., we have ε o = 0. el-salvador-aid = yes ∧ anti-satellite-test-ban = yes ∧ aid-to-nicaraguan-contras = yes ⇒ bi-cluster2 handicapped-infants = no ∧ physician-fee-freeze = yes ∧ el-salvador-aid = yes ⇒ bi-cluster1

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

467

Fig. 2. Characterizing patterns for bi-cluster1 in voting-records w.r.t. different values of minimal frequency and confidence.

4.2. Mining a medical data set We applied the method to the real world medical data set meningitis already used in [23]. It has been gathered from children hospitalized for acute meningitis. The pre-processed Boolean data set is composed of 329 examples described by 60 Boolean attributes encoding clinical signs (hemodynamic troubles, consciousness troubles, . . .), cytochemical analysis of the cerebrospinal fluid (C.S.F proteins, C.S.F glucose, . . .), and blood analysis (sedimentation rate, white blood cell count, . . .). In meningitis, the majority of the cases are known to be viral infections whereas about one quarter are are known to be caused by bacteria. Furthermore, medical knowledge is available which can be used to assess characterization relevancy. Using COCLUSTER, we got two bi-clusters: bi-cluster bi-cluster1 bi-cluster2 total

|τ | 100 229 329

bact. 81 3 84

vir. 19 226 245

|γ| 21 39 60

The first bi-cluster contains a majority of bacterial cases while the second one contains almost only viral cases. We selected characterization rules based on a collection of formal concepts and 2 collections of δ-bi-sets (δ = 1,2). We obtained the results in Fig. 4. Here again, using δ-bi-sets leads to smaller collections of candidate characterization patterns. The number of characterization rules for the first bi-cluster is always very low and it does not significantly change when using δ-bi-sets instead of formal concepts. If we select the rules with a minimal body, a 10% frequency threshold, a 98% confidence threshold, and for which the property exception ratio ε p is zero, we obtain only 9 rules which are consistent with the medical knowledge (see [23] for details). Examples of rules are:

468

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

presence of bacteria in C.S.F. analysis = yes ⇒ bi-cluster1 polynuclear percent > 80 ∧ C.S.F. proteins > 0.8 ⇒ bi-cluster1 C.S.F. proteins > 0.8 ∧ C.S.F. glucose < 1.5 ⇒ bi-cluster1 4.3. Mining Boolean gene expression data An other experiment concerns the analysis of plasmodium, a public gene expression data set concerning Plamodium falciparum (i.e., a causative agent of human malaria) described in [9]. It records the expression profile of 3 719 genes in 46 biological samples. Each sample corresponds to a time point of the developmental cycle. It is divided into 3 phases: the ring, the trophozoite and the schizont stages. The numerical expression data have been preprocessed by using one of the property encoding methods described in [22]. We used COCLUSTER to get the following bi-clusters: bi-cluster bi-cluster1 bi-cluster2 bi-cluster3 total

|τ | 20 16 10 46

ring 15 0 6 21

troph 5 5 0 10

schiz. 0 11 4 15

|γ| 558 1699 1462 3719

We extracted collections of bi-sets to characterize clusters of samples by means of sets of genes. Here, the number of properties was too large and we extracted the δ-bi-sets on the transposed matrix. It means that the frequency and the confidence measures can not be used since they are computed on samples while we are looking for patterns on genes. Therefore, to evaluate a bi-set (T, G), we have considered |T |, |G|, εo , and εp . Results for a minimal size from 10% up to 25% of |O| and for maximal values of ε o from 0% up to 10% are in Fig. 3. Considering bi-cluster1, we analyzed the characterizing 2-bi-sets when the minimal size for their sets of objects was 25% of |O| and for a maximal exception ratio ε o = 0. Among the 442 bi-sets characterizing bi-cluster1, only 4 of them concern genes that belong to the same bi-cluster. In each of them, we found at least one gene belonging to the cytoplasmic translation machinery group which is known to be active in the ring stage (see [9] for details), i.e., the main developmental phase corresponding to bi-cluster1. 4.4. Characterization of unstable bi-partitions In some application, clustering results are quite ambiguous. Algorithms generally return local optimum solutions for the considered objective function. Usually, such local optima are close to the global one, and the computed bi-partitions are quite similar after many randomly initialized executions of the algorithm. However, in some cases, the local optima may give rise to very different bi-partitions. How does our technique behave in this particular situation? How does characterization changes between two different bi-partitions? Does bi-partition quality influence the relevancy of the characterizing patterns? To answer such questions, we have analyzed the data set described in [2]. It concerns the expression profiles of 3 433 genes during 10 time points of adult drosophila melanogaster life cycle. The expression levels are measured for both males and females, i.e., the data involve 20 biological situations. We applied again a discretization method from [22] for gene expression property encoding. We then executed 100 randomly initialized instances of COCLUSTER (to find 2 bi-clusters), and compared the results by considering both the Goodman-Kruskal’s τ coefficient [14] and the loss in mutual information [11]. Notice that this later is the objective function which COCLUSTER wants to minimize. Both coefficient are evaluated in

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

469

Fig. 3. Characterizing bi-sets for bi-cluster1 in plasmodium w.r.t. different values of minimal size and maximal exception ratio.

a contingency table p. Let p ij be the between an object of a cluster C io and a frequency of relations  p property of a cluster Cj , and pi. = j pij and p.j = i pij . The Goodman-Kruskal’s τ coefficient, which evaluates the proportional reduction in error given by the knowledge of C o on the prediction of C p and vice versa, is defined as follows: 2 pi. +p.j 1  i j (pij − pi. p.j ) pi. p.j 2 τ=   1 − 12 i p2i. − 12 j p2.j The mutual information, which compute the amount of information C o contains about C p , is:  pij I(C o ; C p ) = pij log pi. p.j i

j

Then, given two different bi-partitions (C o , C p ) and (Cˆ o , Cˆ p ), the loss in mutual information is given by: I(C o ; C p ) − I(Cˆ o ; Cˆ p )

When computing such coefficients on the 100 bi-partitions returned by COCLUSTER, we found that results were significantly unstable (see Table 6). It seems that there are two optimum points for which the two measures are distant. For 56 runs, we got a high τ coefficient (the mean is about 0.5605), for the other 44 ones the τ coefficient was sensibly smaller (about 0.1156). If we consider each group of results separately, the standard deviation is significantly smaller. It means that these two results are two local optima for the COCLUSTER heuristics.

470

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns Table 6 Clustering results on adult drosophila individuals τ bi-partition males vs. females mixed overall

instances 56 44 100

mean 0.5605 0.1156 0.3648

std.dev 0.0381 0.0166 0.2240

I − Iˆ mean std.dev 1.6615 0.0390 2.0256 0.0258 1.8217 0.1847

Fig. 4. Characterizing patterns for the bi-cluster2 in meningitis w.r.t. different values of minimal frequency and confidence.

From a semantical point of view, the first group of solution reflects the male and female repartition of the individuals, while in the second group each cluster contains both male and female individuals. Indeed, it seems that the the first co-clustering is more relevant w.r.t. the biological knowledge. We use then our characterization technique to post-process the collection of all formal concepts contained in the matrix. Obviously the characterization changes, but we want also to evaluate this change in co-clustering interpretation. To do that, we computed the means of all our interestingness measures (frequency, confidence), one instance for each group of solutions. The two instances have been chosen by considering those with the minimum deviation from the mean. The interestingness measures were computed on all the 5 936 formal concepts, without setting any frequency or confidence constraint. Results are in Table 7. In the first bi-partition, the average frequency and confidence of the characterizing rules are higher than in the second one. This is true for rules computed on both objects and properties clusters. This means that local patterns (formal concepts) reflects more the first bi-partition than the second one. In other words, the consistency of the first global model is validated by the local associations within the matrix. The fact that both Goodman-Kruskal and mutual information loss measures are better in the first group of solutions, is a further mean to link global and local consistency. It means that, characterizing a

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

471

Table 7 Characterization interestingness measures on adult drosophila data bi-partition males vs. females mixed

|B1 |/|B2 | 0.87 0.97

P-freq 0.6% 0.4%

P-conf 91% 73%

O-freq 6% 3%

O-conf 78% 47%

global bi-partition by means of local patterns makes sense, and could be a new way to assess bi-partition quality. 5. Conclusion We presented a new (bi-)cluster characterization method based on extracted local patterns, more precisely formal concepts and δ-bi-sets. It is now possible to use quite efficient constraint-based mining techniques for computing various local patterns. While a bi-partition provides a global and somehow expected characterization, selected collections of characterizing bi-sets point out local association which might lead to more unexpected but yet relevant information. Global and local patterns are both useful during a knowledge discovery process, and it is important to support these intrinsically interactive processes. Our approach suggest the use of characterizing queries, i.e., queries in which analysts can used the proposed accuracy measures to select relevant characterizing patterns. Examples of typical characterizing queries are as follows: – Select all the bi-sets which characterize bi-cluster (C o , C p ) with a maximum exception ratio of ε for both objects and properties; – Select all the association rules with minimal body characterizing bi-cluster (C o , C p ) with a minimal frequency f , a minimal confidence c, and a maximal exception ratio ε for the set of properties; – Select all the association rules with minimal body characterizing bi-cluster (C o , C p ) with a minimal frequency f , a minimal confidence c, and a minimal exception ratio ε for the set of properties. The two first types of queries are obviously useful for bi-cluster characterization. The third one concerns knowledge discovery thanks to unexpectedness. Indeed, it might return patterns that are exceptions, i.e., they concern objects belonging to bi-cluster (C o , C p ) that are characterized by some properties from other bi-clusters. If a global pattern like a bi-partition captures some important structures in the data, it seems also interesting to look at the collections of local associations which are somehow far from it. Assume that the popular association R which points out frequent transactions with beers and diapers among male customers is somehow valid. A co-clustering on a complete basket data set may group beer together with male customers within one bi-cluster, and diapers with female customers in a second bi-cluster. In such a case, a query which would select frequent and high-confidence association rules with a high exception ratio on properties (ε > ) would support the discovery of the “unexpected” association R. Characterizing queries might be studied further. They are interesting examples of queries which have to process both the data and multiple types of patterns holding in the data, i.e., interesting objects for the study of the promising inductive database framework [8,20]. An other perspective is also to better understand the convergent techniques developed for (conceptual) clustering, subgroup discovery [13], and association rule discovery. Acknowledgements The authors thank P. Francois and B. Cr e´ milleux who provided the data set meningitis. They also thank C. Rigotti and J. Besson for exciting discussions. This research is partially funded by ACI MD 46 BINGO

472

R.G. Pensa et al. / Supporting bi-cluster interpretation in 0/1 data by means of local patterns

(French government funding) and by EU contract IQ FP6-516169 (FET arm of the IST programme). References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]

R. Agrawal, T. Imielinski and A. Swami, Mining Association Rules Between Sets of Items in Large Databases, In Proceedings of ACM SIGMOD’93, Washington (USA), 1993, 207–216. M.N. Arbeitman, E.E. Furlong, F. Imam, E. Johnson, B.H. Null, B.S. Baker, M.A. Krasnow, M.P. Scott, R.W. Davis and K.P. White, Gene expression during the life cycle of drosophila melanogaster, Science 297 (September 2002), 2270–2275. J. Besson, C. Robardet and J.-F. Boulicaut, Constraint-Based Mining of Formal Concepts in Transactional Data, In Proceedings PaKDD’04, Volume 3056 of LNAI, Sydney (Australia), May 2004, 615–624, Springer-Verlag. J. Besson, C. Robardet and J.-F. Boulicaut, Mining Formal Concepts with a Bounded Number of Exceptions from Transactional Data, In Proceedings KDID’04, Volume 3377 of LNCS, Pisa (I), 2004, 33–45, Springer-Verlag. C.L. Blake and C.J. Merz, UCI repository of machine learning databases, 1998. J.-F. Boulicaut, A. Bykowski and C. Rigotti, Approximation of Frequency Queries by Mean of Free-Sets, In Proceedings PKDD’00, volume 1910 of LNAI, Lyon (F), September 2000, 75–85, Springer-Verlag. J.-F. Boulicaut, A. Bykowski and C. Rigotti, Free-sets: a condensed representation of boolean data for the approximation of frequency queries, Data Mining and Knowledge Discovery 7(1) (2003), 5–22. J.-F. Boulicaut, L. De Raedt and H. Mannila, editors, Constraint-Based Mining and Inductive Databases, Springer-Verlag LNAI 3848, 2006, 405. Z. Bozdech, M. Llin´as, B. Lee Pulliam, E.D. Wong, J. Zhu and J.L. DeRisi, The transcriptome of the intraerythrocytic developmental cycle of plasmodium falciparum, PLoS Biology 1(1) (October 2003), 1–16. B. Cr´emilleux and J.-F. Boulicaut, Simplest Rules Characterizing Classes Generated by Delta-Free Sets, In Proceedings ES 2002, Cambridge (UK), dec 2002, 33–46, Springer-Verlag. I.S. Dhillon, S. Mallela and D.S. Modha, Information-Theoretic Co-Clustering, In Proceedings ACM SIGKDD 2003, Washington (USA), 2003, 89–98. D.H. Fisher, Knowledge acquisition via incremental conceptual clustering, Machine Learning 2 (1987), 139–172. D. Gamberger and N. Lavrac, Expert-guided subgroup discovery: Methodology and application, Journal on Artificial Intelligence Research 17 (2002), 501–527. L.A. Goodman and W.H. Kruskal, Measures of association for cross classification, Journal of the American Statistical Association 49 (1954), 732–764. C. H´ebert and B. Cr´emilleux, Mining Frequent Delta-Free Patterns in Large Databases, In Proceedings DS’05, Volume 3735 of LNCS, Singapore, October 2005, 124–136, Springer-Verlag. A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood cliffs, New Jersey, 1988. W. Li, J. Han and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, In Proceedings IEEE ICDM’01, San Jose (USA), November 2001, 369–376. B. Liu, W. Hsu and Y. Ma, Integrating Classification and Association Rule Mining, In Proceedings KDD’98, New York (USA), August 1998. AAAI Press, 80–86. S.C. Madeira and A.L. Oliveira, Biclustering algorithms for biological data analysis: A survey, IEEE/ACM Trans. Comput. Biol. Bioinf. 1(1) (2004), 24–45. R. Meo, P.-L. Lanzi and M. Klemettinen, Editors, Database Support for Data Mining Applications: Discovering Knowledge with Inductive Queries, Springer-Verlag LNCS 2682, 2004, 385. R. Pensa and J.-F. Boulicaut, From Local Pattern Mining to Relevant Bi-Cluster Characterization, In Proceedings IDA’05, Volume 3646 of LNCS, Madrid (E), 2005, 293–304, Springer-Verlag. R.G. Pensa, C. Leschi, J. Besson and J.-F. Boulicaut, Assessment of Discretization Techniques for Relevant Pattern Discovery from Gene Expression Data, In Proceedings ACM BIOKDD’04, Seattle, USA, August 2004, 24–30. C. Robardet, B. Cr´emilleux and J.-F. Boulicaut, Characterization of Unsupervized Clusters by Means of the Simplest Association Rules: An Application for Child’s Meningitis, In Proceedings IDAMAP’02 co-located with ECAI’02, Lyon (F), July 2002, 61–66. G. Stumme, R. Taouil, Y. Bastide, N. Pasquier and L. Lakhal, Computing iceberg concept lattices with TITANIC, Data & Knowledge Engineering 42 (2002), 189–222. R. Wille, Restructuring lattice theory: an approach based on hierarchies of concepts, in: Ordered Sets, I. Rival, ed., Reidel, 1982, pp. 445–470.