This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Fundamenta Informaticae 34 (1998) 1{16 IOS Press 1 PATTERN EXTRACTION FROM DATA
Sinh Hoa Nguyen, Hung Son Nguyen
Institute of Mathematics Warsaw University Banacha 2, Warsaw 02097, Poland Email: hoa@mimuw.edu.pl, son@alfa.mimuw.edu.pl Abstract. Searching for patterns is one of the main goals in data mining. Patterns have important applications in many KDD domains like rule extraction or classi cation. In this paper we present some methods of rule extraction by generalizing the existing approaches for the pattern problem. These methods, called partition of attribute values or grouping of attribute values, can be applied to decision tables with symbolic value attributes. If data tables contain symbolic and numeric attributes, some of the proposed methods can be used jointly with discretization methods. Moreover, these methods are applicable for incomplete data. The optimization problems for grouping of attribute values are either NPcomplete or NPhard. Hence we propose some heuristics returning approximate solutions for such problems. 1. Introduction
We consider decision tables containing objects represented by vectors of attribute values. Formally, attributes are de ned as functions from the set of objects into a corresponding set of values (domains). We distinguish two types of attributes, called continuous (numeric) and nominal (symbolic) with respect to their domains. We also assume that every object belongs to one of the decision classes. The classi cation problem is de ned as the problem of classifying new unseen objects to proper decision classes using accessible information about objects from a training set. There are many classi cation methods e.g. AQ 10], C4.5, ID3 16], which are based on two main approaches, called decision trees and pattern extraction. All decision tree methods are based on searching for classi ers labeling internal nodes. Usually, classi ers are descriptors (i.e. pairs of attributes and values), hence this approach is more suitable for nominal attributes rather than for continuous ones, because the rami cation number of a node is equal to the number of possible values of the attribute domain. In case of continuous attributes, either some discretization processes must be applied or another type of classi ers used; for example one can use a simple test of the form: "if attribute value is less than a given value" to create a binary internal node. The patternbased classi cation methods consist in looking for descriptions of decision classes by logical formulas and they also require some discretization processes for continuous attributes. In general, every discretization method aims at removing super uous attribute information by unifying values in some intervals and preserving necessary information (e.g. discernibility between objects). Hence nding suitable interval boundaries is the main goal of all discretization methods. The following question arises: "What we can do in case of symbolic attributes with a large number of possible values?". One of the possible solutions is to partition the attribute domain into smaller ones. In previous papers 11, 13, 14] we presented discretization methods based on rough sets theory and Boolean reasoning approach as well as methods for patterns extraction when pattern are conjunctions of simple descriptors. This paper focuses on the symbolic value partition problem which is a generalization of both the discretization and the pattern extraction problems. To solve this problem we adopt well known techniques based on Boolean reasoning and decision trees. Moreover, the computational complexity of some optimal partition problems are studied and we show that they are either NPcomplete or NPhard. Therefore, we propose some heuristics and evaluate their accuracy on di erent data sets. 2. Preliminaries
An information system 15] is a pair A = (U; A), where U is a nonempty, nite set called the universe and A is a nonempty, nite set of attributes, i.e. a : U ! Va for a 2 A, where Va is called the value set of a. Elements of U are called objects. Any information system of the form A = (U; A fdg) is a decision table where d 2 A is = called decision and the elements of A are called conditions. Let Vd = f1; : : : ; r(d)g. The decision d determines the partition fC1 ; :::; Cr(d) g of the universe U , where Ck = fx 2 U : d(x) = kg for 1 k r(d). The set Ck is called the k ? th decision class of A. With any subset of attributes B A, an equivalence relation called the B indiscernibility relation 15], denoted by IND(B ), is de ned by IND(B ) = f(x; y) 2 U U : 8a2B (a(x) = a(y))g Objects x; y satisfying the relation IND(B ) are indiscernible by attributes from B . By x]IND(B) we denote the equivalence class of IND (B ) de ned by x. Any minimal subset B of A such that IND(A) = IND(B ) is called a reduct of A. Let A = (U; A fdg) be a decision table and B A. We de ne a function @B : U ! 2f1;::;r(d)g ; called the generalized decision in A, by @B (x) = d x]IND(B) . A decision table A is called consistent (deterministic) if card (@A (x)) = 1 for any x 2 U , otherwise A is inconsistent (nondeterministic). 3. Searching for patterns in data
We have mentioned in Section 1 that discretization of real value attributes is a necessary process for almost all classi cation methods. In this section we recall some discretization techniques which will be adopted to solve partition problems for symbolic value attributes. We take into special consideration the Boolean reasoning approach for the both discretization problem and the partition problem. Moreover, we recall the problem of searching for templates in data 14] to show that partition methods also produce generalized templates. In this way the partition problem is becoming a common generalization of the both discretization and the template extraction problems. 3.1. Discretization Let A = (U; A fdg) be a decision table, where A = fa1 ; :::; ak g and d : U ! f1; :::; rg. Any pair (a; c) where a 2 A and c 2 IR will be called a cut on Va . Any set of cuts: Pa = f(a; ca ); (a; ca ); : : : ; (a; caa )g A IR 1 2 k on Va = la ; ra ) IR (for a 2 A) uniquely de nes a partition of Va into subintervals i.e.
Va = ca ; ca ) ca ; ca ) : : : caa ; caa +1 ) 0 1 1 2 k k
where la = ca < ca < ca < : : : < caa < caa +1 = ra . Therefore, any set of cuts P = a2A Pa 2 1 0 k k de nes a new decision table AP = (U; AP fdg) called the Pdiscretization of A, where AP = faP : a 2 Ag and aP (x) = i , a(x) 2 ca ; ca+1 ) for x 2 U and i 2 f0;0 ::; ka g. i i Two sets of cuts P0 ; P are equivalent, i.e. P0 A P, i AP = AP . The equivalence relation A has a nitely many equivalence classes. In the sequel we will not discern between equivalent sets of cuts. The set of cuts P is Aconsistent if @A = @AP , where @A and @AP are generalized decisions of A and AP , respectively. An Aconsistent set of cuts Pirr is Airreducible if?P is not Aconsistent for any P Pirr . An Aconsistent set of cuts Popt is Aoptimal if card Popt card (P) for any Aconsistent set of cuts P. Any cut (a; c) splits the set of values (la ; ra ) of the attribute a into two intervals I1 = (la ; c), I2 = (c; ra ). For a xed cut (a; c) we use the following notation: r = the number of decision classes of A Aij = the set of objects of j th class in the ith interval nij = the number of objects in Aij Ri =P number of objects in the ith interval for i 2 f1; 2g (i.e. Ri = Pr=1 nij ) the j cj = 2=1 nij = jCj j = cardinality of the j th decision class (see Section 2) i n = total number of objects (i.e. n = P2=1 Ri = Pr=1 cj ) j i Ri cj ) Eij = expected frequency of Aij (i.e. Eij = n where i 2 f1; 2g; j 2 f1; :::; rg. S 3.1.1. Statistical test methods
Statistical tests allow to check the probabilistic independence between the object partition dened by the decision attribute and the partition de ned by the cut (a; c). The independence degree is estimated by the 2 test described by
2 r 2 2 = X X (nij ? Eij ) Eij i=1 j =1 Intuitively, if the partition de ned by c does not depend on the partition de ned by the decision d then: P (Cj ) = P (Cj jI1 ) = P (Cj jI2 ) (1) for any j 2 f1; :::; rg : The condition (1) is equivalent to nij = Eij for any i 2 f1; 2g and j 2 f1; :::; rg ; hence we have 2 = 0: When a cut c "properly" separates objects from di erent decision classes, the value of the 2 test for c is maximal. Discretization methods based on the 2 test only choose cuts with large values of this test (and delete the cuts with small values of the 2 test). Some versions of this method can be found in 9]. 3.1.2. Entropy methods
A number of methods based on entropy measure have been developed for the discretization problem. They use classentropy as a criterion to extract a list of the best cuts which together with the attribute domain induce the desired intervals. The class information entropy of the partition induced by a cut point c on attribute a is de ned by E (a; c; U ) = jU1 j Ent (U1 ) + jU2 j Ent (U2 ) n n where U1 ; U2 are the sets of objects on the two sides (left and right) of the cut c and the function Ent is de ned by r X nij nij Ent(Ui) = ? R log R for i = 1; 2: For a given feature a, a cut cmin minimizing the entropy function over all possible cuts is selected. This method can be applied recursively to both object sets U1 ; U2 induced by cmin until some stopping condition is achieved. There is a number of discretization methods based on information entropy (see e.g. 2, 6, 3, 16]). We mention as an example a method using Minimal Description Length Principle to determine some stopping criteria for their recursive discretization strategy ( see 6]). First the Information Gain of the cut (a; cmin ) over the set of objects U is de ned by:
j =1 i i Gain (a; cmin; U ) = Ent (U ) ? E (a; cmin; U ) and the recursive partitioning within a set of objects U stops i Gain (a; cmin; U ) is too small to continue the spliting process i.e. Gain (a; c; U ) < log2 (n ? 1) + (a; c; U ) where n n (a; c; U ) = log2 (3r ? 2) ? r Ent (U ) ? r1 Ent (U1 ) ? r2 Ent (U2 )]; and r; r1 ; r2 are the numbers of decision classes in U; U1 ; U2 , respectively. 3.1.3. Boolean reasoning method
In 11, 12] we have presented a discretization method based on rough set theory and Boolean reasoning. The main idea of this method is based on a construction and an analysis of a new decision table A = (U ; A fd g) where U = (u; v) 2 U 2 : d (u) 6= d (v) f?g ( discerns A = fc : c is a cut on Ag, where c (?) = 0 and c ((u; v)) = 1 if c otherwiseu; v 0 d (?) = 0 and d (u; v) = 1 for (u; v) 2 U . It has been shown 11] that any relative reduct of A is an irreducible set of cuts for A and any minimal relative reduct of A is an optimal set of cuts for A. The Boolean function ? corresponding to the minimal relative reduct problem fA has O (nk) variables (cuts) and O n2 clauses. Hence Theorem 3.1. 11] The decision problem of checking if for a given decision table A and an integer k there exists an irreducible set of cuts P in A such that card(P) < k is NP complete. The problem of searching for an optimal set of cuts P of a given decision table A is NP hard.
The discernibility degree of a cut (a; c) is de ned as the number of pairs of objects from di erent decision classes (or the number of objects in the table A ) discerned by c. This number can be computed by Disc(a; c) = X
i6=j n1i n2j = R1 R2 ? r X i=1 n1i n2i The discretization algorithm called MDheuristic 1 performs a search for a cut (a; c) 2 A with the largest discernibility degree Disc(a; c). The cut c is then move from A to the resulting set of cuts P and all pairs of objects discerned by c are removed from U . Our algorithm runs until U = f?g. It has been shown in 12] that MDheuristic is very e cient, because it determines the best cut in O (kn) steps using O (kn) space only.
1 Abbreviation of "Maximal Discernibility Heuristics" Objects x1 x2 x3 x4 x5 a1 5 5
1 4 5 a2
1 0 1 0 0 a3 0 0
1 0 0 1.16 black 8.33 red 3.13 red 3.22 black 3.24 red a4 a5 d
1 0 1 1 0 Table 1. An example of the template with tness 2 and length 3 3.2. Searching for patterns in data A template T of A is any propositional formula pi ; where pi are propositional variables, called descriptors, of the form ai = v, ai is an attribute of A, ai 6= aj for i 6=j , and v is a value from the value set of the attribute ai . Assuming A = fa1 ; :::; am g, any template T = (ai1 = vi1 ) ^ ::: ^ (aik = vik ) can be represented by a sequence x1; :::; xm ] where xp is vp if p = i1 ; :::; ik and " " ("don't care" symbol), otherwise. An object x satis es the descriptor a = v if a(x) = v. An object x satis es (matches) a template if it satis es all the descriptors of the template. For any template T by length(T ) we denote the number of di erent descriptors a = v occurring in T and by fitnessA(T ) its tness i.e. the number of objects from the universe U satisfying T . If T consists of one descriptor a = v only we also write nA (a; v) (or n(a; v)) instead of fitnessA(T ). The quality of a template T can be taken to be the number fitnessA(T ) length(T ). If s is an integer then we denote by TemplateA (s) the set of all the templates of A with tness not less than s. V 4. Symbolic value partition problem
We have considered the real value attribute discretization problem. It is a searching problem for a partition of real values into intervals (the natural linear order "<" in the real space IR is assumed). In case of symbolic value attributes (i.e. without any preassumed order in the value sets of attributes) the problem of searching for partitions of value sets into a "small" number of subsets is more complicated than for continuous attributes. Once again, we apply the Boolean reasoning approach to construct a partition of symbolic value sets intonsmall number of subsets. oo na a Let A = (U; A fdg) be a decision table where A = ai : U ! Vai = v1 i ; :::; vnii , for i 2 f1; :::; kg. Any function Pai : Vai ! f1; : : : ; mi g (where mi ni) is called a partition of Vai . The rank of Pai is the value rank (Pai ) = jPai (Vai )j. The function Pai de nes a new partition attribute bi = Pai ai i.e. bi (u) = Pai (ai (u)) for any object u 2 U: Let B A be an arbitrary subset of attributes. The family of partitions fPa ga2B is called B ? consistent i 8u;v2U d (u) 6= d (v) ^ (u; v) 2 IND(B )] ) 9a2B Pa (a (u)) 6= Pa (a (v))] = (2) It means that if two objects u; u0 are discerned by B and d, then they are discerned by the attribute partition de ned by fPa ga2B . We consider the following optimization problem called the symbolic value partition problem: Symbolic Value Partition Problem: For a given decision table A = (U; A fdg), and a set of nominal attributes B A, search for a minimal B ? consistent family of partitions (i.e. B consistent family fP g with the minimal value of P rank (P )).
a a2B a2B a This concept is useful when we want to reduce attribute domains of with large cardinalities. The discretization problem can be derived from the partition problem by adding the monotonicity condition for the family fPa ga2A such that 8v1;v2 2Va v1 v2 ) Pa (v1 ) Pa (v2)]
We discuss three approaches of the solution of this problem, namely the local partition method, the global partition method and the "divide and conquer" method. The rst approach is based on grouping the values of each attribute independently whereas the second approach is based on grouping of attribute values simultaneously for all attributes. The third method is similar to the decision tree techniques: the original data table is divided into two subtables by selecting the "best binary partition of some attribute domain" and this process is continued for all subtables until some stop criteria is satis ed. 4.1. Local partition The local partition strategy is very simple. For any xed attribute a 2 A, we search for a partition Pa that preserves the consistency condition (2) for the attribute a (i.e. B = fag). For any partition Pa the equivalence relation Pa is de ned by: v1 Pa v2 , Pa (v1 ) = Pa (v2 ) for all v1 ; v2 2 Va . We consider the relation UNIa de ned on Va as follows: v1 UNIav2 , 8u;u02U (a(u) = v1 ^ a(u0 ) = v2 ) ) d(u) = d(u0 ) (3) It is obvious that the relation UNIa de ned by (3) is an equivalence relation. Local partition are created using the following Proposition 4.1. If Pa is aconsistent then de nes a minimal a?consistent partition on a. Pa UNIa. The equivalence relation UNIa We consider the discernibility matrix M (A) = mi;j ]n =1 (see 17]) of the decision table A, where i;j mi;j = fa 2 A : a (ui ) 6= a (uj )g is the set of attributes discerning two objects ui ; uj . Observe that if we want to discern an object ui from another object uj we need to preserve one of the attributes in mi;j . To put it more precisely: for any two objects ui ; uj there exists an attribute a 2 mi;j such that the values a (ui) ; a (uj ) are discerned by Pa . 4.2. Global partition Hence instead of cuts as in the case of continuous values (de ned by pairs (ai ; cj )), we can discern objects by triples ai ; via1i ; via2i called chains, where ai 2 A for i = 1; :::; k and i1 ; i2 2 f1; :::; ni g : We can build a new decision table A+ = (U + ; A+ fd+ g) (analogously to the table A (see Section 3.2)) assuming U + = U ; d+ = d and A+ = f(a; v1 ; v2 ) : (a 2 A) ^ (v1 ; v2 2 Va )g. Again e.g. the Johnson heuristic can be applied to A+ to search for a minimal set of chains discerning all pairs of objects in di erent decision classes. It can be seen that our problem can be solved by e cient heuristics of graph coloring. The "graph k?colorability" problem is formulated as follows: input: Graph G = (V; E ), positive integer k jV j output: 1 if G is k?colorable, (i.e. if there exists a function f : V ! f1; : : : ; kg such that f (v) 6= f (v0 ) whenever (v; v0 ) 2 E ) and 0 otherwise. This problem is solvable in polynomial time for k = 2, but is NPcomplete for all k 3. However, similarly to discretization, some e cient heuristic searching for optimal graph coloring determining optimal partitions of attribute value sets can be applied. For any attribute ai in a semiminimal set X of chains returned from the above heuristic we construct a graph ?ai = hVai ; Eai i, where Eai is the set of all chains of the attribute ai in X . Any coloring of all the graphs ?ai de nes an Aconsistent partition of value sets. Hence heuristic searching for minimal graph coloring returns also suboptimal partitions of attribute value sets. The coresponding Boolean formula has O(knl2 ) variables and O(n2 ) clauses, where l is the maximal value of card(Va ) for a 2 A. When prime implicants of Boolean formula have been constructed, a heuristic for graph coloring should be applied to generate new features, 4.3. Example
Let us consider the decision table presented in Figure 1 and a reduced form of its discernibility matrix. Firstly, from the Boolean function fA with Boolean variables of the form av2 (corresponding v1 to the chain (a; v1 ; v2 ) described in Section 4.2) we nd a shortest prime implicant: aa1 ^ aa2 ^ a2 a3 a aa1 ^ aa3 ^ ba1 ^ ba2 ^ ba2 ^ ba1 ^ ba3 ], which can be represented by graphs (Figure 1). Next a4 a4 a4 a3 a3 a5 4 we apply a heuristic to color vertices of those graphs as it is shown in Figure 1. The colors are corresponding to the partitions: Pa (a1 ) = Pa (a3 ) = 1; Pa (a2 ) = Pa (a4 ) = 2 Pb (b1 ) = Pb (b2 ) = Pb (b5 ) = 1; Pb (b3) = Pb (b4 ) = 2 and at the same time one can construct the new decision table (see Figure 1). The following set of decision rules can be derived from the table AP if a(u) 2 fa1 ; a3 g and b(u) 2 fb1 ; b2 ; b5 g then d = 0 (supported by u1 ; u2 ; u4 ) if a(u) 2 fa2 ; a4 g and b(u) 2 fb3 ; b4 g then d = 0 (supported by u3 ) if a(u) 2 fa1 ; a3 g and b(u) 2 fb3 ; b4 g then d = 1 (supported by u5 ; u9 ) if a(u) 2 fa2 ; a4 g and b(u) 2 fb1 ; b2 ; b5 g then d = 1 (supported by u6 ; u7 ; u8 ; u10 ) A u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 a b d
a1 a1 a2 a3 a1 a2 a2 a4 a3 a2 b1 b2 b3 b1 b4 b2 b1 b2 b4 b5
0 0 0 0 1 1 1 1 1 1  M (A) u1 u5 bb1 b4 u6 aa1 , bb1 a2 b2 u7 aa1 a2 u8 aa1 , bb1 a4 b2 u9 aa1 , bb1 a3 b4 u10 aa1 , bb1 a2 b5 u2 bb2 b4 aa1 a2 aa1 , bb1 a2 b2 aa1 a4 aa1 , bb2 a3 b4 aa1 , bb2 a2 b5
? u3 aa1 , bb3 a2 b4 bb2 b3 bb1 b3 aa2 , bb2 a4 b3 aa2 , bb3 a3 b4 bb3 b5 u4 aa1 , bb1 a3 b4 aa2 , bb1 a3 b2 aa2 a3 aa3 , bb1 a4 b2 bb1 b4 aa2 , bb1 a3 b5 aa1 ^ aa2 ^ aa1 ^ aa3 ^ ba1 ^ ba2 ^ ba2 ^ ba1 ^ ba3 a2 a3 a4 a4 a4 a4 a3 a3 a5
a1 aPa 1 2 1 2 bPb 1 2 2 1 0 0 1 1 d a3 u e u e
? @ @ ? @ ? ? @ ? @ a2 a4 ? Figure 1 The decision table and the corresponding discernibility matrix. Coloring of attribute value graphs and the reduced table. Pa (a1 ) = Pa (a3 ) = 1; Pa (a2) = Pa (a4 ) = 2 Pb (b1) = Pb (b2 ) = Pb (b5 ) = 1; Pb (b3 ) = Pb (b4 ) = 2 a b5 u u u e e
b1 b b2 Q B Q B Q B Q B Q B B QB Q B B B b4 b3 4.4. Divide and Conquer method
For a xed attribute a and an object set Z U , we de ne the discernibility degree of disjoint sets V1 ; V2 of values from Va (denoted by Disca (V1 ; V2 jZ )) by Disca (V1 ; V2 jZ ) = f(u1 ; u2 ) 2 Z 2 : d (u1 ) 6= d (u2 ) ^ (a (u1 ) ; a (u2 )) 2 V1 V2 g
Let P be an arbitrary partition of Va . For any two objects u1 ; u2 2 U we say that the pair of objects (u1 ; u2 ) is discerned by P if d (u1 ) 6= d (u2 )] ^ P (a (u1 )) 6= P (a (u2 ))]. The discernibility degree of partition P over the set of objects Z U is de ned as a number of pairs of objects from Z discerned by P . Using the same notation Disc (P jZ ) we have: Disca (P jZ ) = f(u1 ; u2 ) 2 Z 2 : d (u1 ) 6= d (u2 ) ^ P (a (u1 )) 6= P (a (u2 ))g (4) In this section, similarly to the decision tree approach, we consider the optimalization problem called the binary optimal partition (BOP) problem which can be described as follows: BOP Problem: Given a set of objects Z and an attribute a. Find a binary partition P of Va (rank (P ) = 2) such that Disca (P jZ ) is maximal. We will show that the BOP problem is NPhard with respect to the size of Va : The proof will suggest some natural searching heuristics for optimal partition. We apply these heuristic to construct the binary decision tree from symbolic attributes. 4.4.1. Complexity of the binary partition problem
In this section we consider a problem of searching for an optimal binary partition of the domain of a xed attribute a. Let us denote by s (V; jZ ) the number of occurrences of the values from V among values a (u) for u 2 Z \C , i.e. s (V; jZ ) = jfu 2 Z : (a (u) 2 V ) ^ (d (u) = )gj : In case of rank (P ) = 2 (without loss of generality one can assume that P : Va ! f1; 2g) and Vd = f0; 1g, the discernibility degree of P is expressed by X Disc (P jZ ) = s (i; 0jZ ) s (j; 1jZ ) + s (i; 1jZ ) s (j; 0jZ )] (5) where V1 = P ?1 (1) , s (v; jZ ) = s (fvg ; jZ ). In this section we x the set of objects Z and in the sequel, for simplicity of notation, Z will be omitted in the described functions. To prove the NPhardness of the BOP problem we consider the corresponding decision problem called the binary partition problem which is de ned as follows:
i2V1 ;j 2V2 V2 = P ?1 (2) and Binary Partition (BP) Problem: Input: A value set V = fv1 ; : : : ; vn g, two functions: s0; s1 : V ! N and a positive integer K . Question: Is there a binary partition of V into two disjoint subsets P (V ) = fV1 ; V2g such that the discernibility degree of P de ned by X
Disc (P ) =
i2V1 ;j 2V2 s0 (i) s1 (j ) + s1 (i) s0 (j )] It is easy to see that the BP problem is in NP. The NPcompleteness of the BP problem can be shown by polynomial transformation from Set Partition Problem (SPP), which is de ned as the problem of checking for a given nite set of positive integers S = fn1 ; n2 ; :::; nk g, if there is P i = P j. a partition of S into two disjoint subsets S1 and S2 such that i2S1 j 2S2 It is known that the SPP is NPcomplete 8]. We will show that SPP is polynomially transformable to BP. Let S = fn1 ; n2 ; :::; nk g be an instance of SPP. The corresponding instances of the BP problem are: Proof: satis es the inequality Disc(P ) K . Obviously, if the BP problem is NPcomplete, then the BOP problem is NP hard. Theorem 4.1. The binary partition problem is NPcomplete. V = f1; 2; :::; kg; s0 (i) = s1 (i) = ni for i = 1; ::; k; ! K=
k 1 P 2 i=1 ni
2 One can see that for any partition P of the set Va into two disjoint subsets V1 and V2 the discernibility degree of P can be expressed by: Disc (P ) =
= X i2V1 ;j 2V2 X s0 (i) s1 (j ) + s1 (i) s0 (j )] =
2ni nj = 2 X i2V1 ;j 2V2 1 @X n + X n A = 1 X n 2 i2V1 i j 2V2 j 2 i2V i 0 i2V 12 1 ni X j 2V2 nj = !2 =K i.e. for any partition P we have the inequality Disc (P ) K and the equality holds i ni = i2 P n : Hence P is a good partition of V (into V and V ) for the BP problem i it deV1nes a j 1 2 j 2V2 good partition of S (into S1 = fni gi2V1 and S2 = fnj gj 2V2 ) for the SPP problem. Therefore the BP problem is NPcomplete and the BOP problem is NP hard. t u Now we are going to describe several approximate solutions for the BOP problem. These heuristics are quite similar to 2mean clustering algorithms, but we use the discernibility degree function instead of Euclidean measure. First of all, let us explore several properties of the Disc function. The function Disc (V1 ; V2 ) can be computed quickly from the function s, namely P Disc (V1 ; V2 ) =
i.e.:
v2V X P It is easy to observe that s (V; ) = s (v; ), hence the function Disc is additive and symetric,
Disc (V1 V2 ; V3 ) = Disc (V1 ; V3 ) + Disc (V2 ; V3 ) Disc (V1 ; V2 ) = Disc (V2 ; V1 )
for arbitrary value sets V1 ; V2 ; V3 . Let Va = fv1 ; :::; vm g be the domain of an attribute a. The grouping algorithm by minimizing discernibility of values from Va starts with the most detailed partition Pa = ffv1 g ; :::; fvm gg, where for each V 2 Pa , a vector s (V ) = s (V; 0) ; s (V; 1) ; :::; s (V; r)] of its occurrences in decision classes C0 ; C1 ; :::; Cr is associated. In every step we look for two nearest sets V1 , V2 in Pa with respect to the function Disc (V1 ; V2 ) and then we replace them by a new set V = V1 V2 1 6= s (V1 ; 1 ) s (V2 ; 2 ) 2 with occurrence vector s (V ) = s (V1 ) + s (V2 ). The algorithm is runs until Pa contains two sets and then the suboptimal binary partition is stored in Pa . The second technique is called grouping by maximizing discernibility. The algorithm also starts with a family of singletons Pa = ffv1 g ; :::; fvm gg, but rst we look for two singletons with the largest discernibility degree to create kernels of two groups; let us denote them by V1 = fv1 g and V2 = fv2g. For any symbolic value vi 2 V1 V2 we compare the discernibility = degrees Disc (fvi g; V1 ) and Disc (fvi g; V2 ) and attach it to the group with a smaller discernibility degree for vi . This process ends when all the values in Va are drawn out. 4.4.2. Decision tree construction
In 13] we have presented a very e cient method for decision tree construction called the MD algorithm. It has been developed for continuous (numeric) data only. In this section we extend the MD algorithm to data with mixed attributes, i.e. numeric and symbolic attributes, by using binary partition algorithms presented in the previous Sections. Given a decision table A = (U; A fdg), we assume that there are two types of attributes: numeric and symbolic. We also assume that the type of attributes is given by a prede ned function type : A ! fN; S g where type (a) = ( N if a is a Numeric attribute S if a is a Symbolic attribute We can use grouping methods to generate a decision tree. The structure of the decision tree is de ned as follows: internal nodes of a tree are labelled by logical test functions of the form T : U ! fTrue; Falseg and external nodes (leaves) are labelled by decision values. In this paper we consider two kinds of tests related to attribute types. In the case of symbolic attributes aj 2 A we use test functions de ned by partition: T (u) = True () aj (u) 2 V ]
where V Vaj . For numeric attributes ai 2 A we use test functions de ned by discretization: T (u) = True () ai (u) c] () ai (u) 2 (?1; ci)] where c is a cut in Vai . Below we present the process of decision tree construction. During the construction, we additionally use some object sets to label nodes of the decision tree. This third kind of labels will be removed at the end of construction process. The algorithm is presented in Figure 3. 4.5. Pattern for Incomplete Data
Now we consider a data table with incomplete value attributes. The problem is how to guess unknown values in a data table to guarantee maximal discernibility of objects in di erent decision classes. u Z
? a(u) 2 V1??
? ? ? A A A L1 AA A The best partition Pa = fV1 ; V2 g over Z @ @a(u) 2 V2
@ @ R @ A A A L A 2 Figure 2. The decision tree A A ALGORITHM
1. Create a decision tree with one node labelled by the object set 2. For each leaf L labelled by an begin 3. For each attribute a 2 A do object set Z do U; 4. 5. If type (a) = S then search for the optimal binary partition Pa = fV1 ; V2 g of Va ; If type (a) = N then search for the optimal cut c in Va and set Pa = fV1 ; V2 g where V1 = (?1; ci and V2 = (c; 1); Choose an attribute a so that Disc (Pa jZ ) is maximal and label the current node by the formula a 2 V1 ]; For i := 1 to 2 do
Zi = fu 2 Z : a(u) 2 Vig; ( if d(Zi = f d Li = vd if card)(d(Zv))g 2 Zi i
two successors of the current node and 6. Create
L2 ; label else them by L1 and 7. end; If all leaves are labelled by decisions then STOP goto Step 2; Figure 3. The decision tree construction algorithm The idea of grouping values proposed in the previous sections can be used to solve this problem. We have shown how to extract patterns from data by using discernibility of objects in di erent decision classes. Simultaneously the information about values in one group can be used to guess the unknown values. Below we de ne the searching problem for unknown values in an incomplete decision table. The decision table A = (U; A fdg) is called "incomplete" if attributes in A are de ned as functions a : U ! Va f g where for any u 2 U by a(u) = we mean an unknown value of the attribute a. All values di erent from are called xed values. We say that a pair of objects x; y 2 U is inconsistent if d(x) 6= d(y) ^ 8a2A a(x) = _ a(y) = _ a(x) = a(y)]. We denote by Conflict(A) the number of inconsistent pairs of objects in the decision table A. The problem is to search for possible xed values which can be substituted for the value in the table A in such a way that the number of con icts Conflict A0 in the new table A0 (obtained by changing entries in table A into xed values) is minimal. The main idea is to group values in the table A so that discernibility of objects in di erent decision classes is maximixed. Then we replace by a value depending on xed values belonging to the same group. To group attribute values we can use heuristics proposed in the previous sections. We assume that all the unknown values of attributes in A are pairwise di erent and di erent from the xed values. Hence we can label the unknown values by di erent indices before applying algorithms proposed in the previous Sections. This assumption allows to create the discernibility matrix for an incomplete table as in the case of complete tables and we can then use the Global Partition method presented in Section 4.2 for grouping unknown values. The function Disc(V1 ; V2 ) can also be computed for all pairs of subsets which may contain unknown values. Hence we can apply both heuristics of Dividing and Conquer methods for grouping unknown values. After the grouping step, we assign to the unknown value one (or all) of the xed values in the same group which contains the unknown one. If there is no xed value in the group we choose an arbitrary value (or all possible values) from an attribute domain, that does not belong to other groups. If such values do not exist either, we can say that these unknown values have no in uence on discernibility in a decision table and we can assign to them an arbitrary value from the domain. 5. Experimental results
Experiments for classi cation methods have been carried over decision tables using two techniques called "trainandtest" and "nfoldcrossvalidation". In Table 2 we present some experimental results obtained by testing the proposed methods for classi cation quality on well known data tables from the "UC Irvine repository" and execution times. Similar results obtained by altenative methods are reported in 7]. It is interesting to compare those results with regard to Names of Tables Australian Breast (L) Diabetes Glass Heart Iris Lympho Monk1 Monk2 Monk3 Soybean TicTacToe Average Classi cation accuracies SID3 C4.5 MD MDG 78.26 85.36 83.69 84.49 62.07 71.00 69.95 69.95 66.23 70.84 71.09 76.17 62.79 65.89 66.41 69.79 77.78 77.04 77.04 81.11 96.67 94.67 95.33 96.67 73.33 77.01 71.93 82.02 81.25 75.70 100 93.05 69.91 65.00 99.07 99.07 90.28 97.20 93.51 94.00 100 95.56 100 100 84.38 84.02 97.7 97.70 78.58 79.94 85.48 87.00 Table 2 The quality comparison between decision tree methods. MD: MDheuristics; MDG: MDheuristics with symbolic value partition both classi cation quality and execution time. 6. Conclusions
We have presented symbolic (nominal) value partition problems as alternatives to discretization problems for numeric (continuous) data. We have discussed their computational complexities and proposed a number of approximate solutions for them. The proposed solutions are obtained by using well known existing approaches like Boolean reasoning, entropy, decision trees. Some of those methods can be used together with discretization methods for data containing both symbolic and numeric attributes. In our system for data analysis we have implemented e cient algorithms based on the methods discussed. The tests show that they are very e cient from the point of view of time complexity. They also assure high quality of recognition of new unseen cases. The proposed heuristics for symbolic value partition allow to obtain a more compressed form of the decision algorithm. Hence, due to the minimum description length principle, we can expect them to return decision algorithms with high quality of unseen object classi cation. Acknowledgment: This work was supported by the Polish State Committee for Scienti c Research grant #8T11C01011 and Research Program of the European Union  ESPRITCRIT2 No. 20288 References
1] Brown F.M. (1990). Boolean reasoning, Kluwer, Dordrecht. 2] Catlett J. (1991). On changing continuos attributes into ordered discrete attributes. In: Y. Kodrato (ed.), Machine LearningEWSL91, Porto, Portugal, LNAI, pp. 164178. 3] Chmielewski M.R., Grzymala{Busse J.W. (1994). Global Discretization of Attributes as Preprocessing for Machine Learning. Proc. of the III International Workshop on RSSC94 , pp. 294301. 4] Dougherty J., Kohavi R., Sahami M.(1995). Supervised and Unsupervised Discretization of Continuous Features, Proc. of the 20th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, pp. 194202. 5] Fayyad U. M., Irani K.B. (1992). The attribute selection problem in decision tree generation. Proc. of AAAI92, San Jose, CA.MIT Press, pp. 104110. 6] Fayyad, U. M., Irani, K.B. (1993). Multiinterval discretization of continuousvalued attributes for classi cation learning. "Proc. of the 13th International Joint Conference on Arti cial Intelligence", Morgan Kaufmann, pp. 10221027. 7] Friedman, J., Kohavi, R., Yun Y., Lazy Decision Trees. In: AAAI96, p. 717724. 8] Garey, M.R., Johnson, D.S. (1979). Computers and Intractability, A guide to the theory of NPcompleteness. W.H. Freeman, San Francisco. 9] Kerber R. (1992), Chimerge: Discretization of numeric attributes. Proc. of the Tenth National Conference on Arti cial Intelligence, MIT Press, pp. 123128. 10] Kodrato Y., Michalski R. (1990). Machine learning: An Arti cial Intelligence approach, vol.3, Morgan Kaufmann, 1990. 11] Nguyen H. S, Skowron A. (1995). Quantization of real values attributes, Rough set and Boolean Reasoning Approaches. Proc. of the Second Joint Annual Conference on Information Sciences, Wrightsville Beach, NC, pp 3437. 12] Nguyen S. H., Nguyen H. S. (1996). Some E cient Algorithms for Rough Set Methods. Proc. of the Conference of Information Processing and Management of Uncertainty in KnowledgeBased Systems IPMU'96, Granada, Spain, pp. 14511456. 13] Nguyen, H.S., Nguyen, S.H. (1997). Discretization Methods for Data Mining. In A.Skowron and L.Polkowski (Eds.), Rough Set in Data Mining and Knowledge Discovery. Berlin, Springer Verlag. 14] Nguyen, S.H., Skowron A., Synak P. (1997). Discovery of Data Patterns with Applications to Decomposition and Classi cation Problems. In A.Skowron and L.Polkowski (Eds.), Rough Set in Data Mining and Knowledge Discovery. Berlin, Springer Verlag. 15] Pawlak Z. (1991). Rough sets: Theoretical aspects of reasoning about data, Kluwer Dordrecht. 16] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo. CA: Morgan Kaufmann Publishers. 17] Skowron A., Rauszer C. (1992). The Discernibility Matrices and Functions in Information Systems. In: Slowinski R.(ed.) Intelligent Decision SupportHandbook of Applications and Advances of the Rough Sets Theory, Kluwer Dordrecht, 331362. ...
View
Full
Document
This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto Toronto.
 Spring '09
 fdfdf

Click to edit the document details