{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

lecture20

# lecture20 - Data Mining CS57300 Purdue University Pattern...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Data Mining CS57300 Purdue University November 23, 2010 Pattern mining: learning Association rules Algorithm to ﬁnd frequent itemsets • Ck: candidate itemset of size k • Lk: frequent itemset of size k • L1={frequent single items} • for (k=1; Lk!=∅; k++) • Ck+1 = candidates generated from Lk • for each transaction t in D • increment the count of all candidates in Ck+1 contained in t • Lk+1 = candidates in Ck+1 with min_support • Return ∪k Lk Example !"# \$%&'(&' \$)*(&'+",- . / 01,%)# !"#"\$"%&'! ! "# ()) -)) +)) ,)) /\$%& '%\$ '012 3(4 !" 3-4 + ()"*'! 3+4 + 3,4 ( 3.4 + "\$%& ' (*+*, -*+*. (*-*+*. -*. \$# /\$%&'%\$ '01 3(*+4 3-*+4 3-*.4 3+*.4 + - !% /\$%&'%\$ 3-*+*.4 !# /\$%& '%\$ '01 3(*-4 3(*+4 3(*.4 3-*+4 3-*.4 3+*.4 ()"*'! ( ( + - \$" /\$%& '%\$ '012 3(4 3-4 3+4 3.4 + + + !# /\$%& '%\$ 3(*-4 ()"*'! \$% /\$%&'%\$ '01 3-*+*.4 - 3(*+4 3(*.4 3-*+4 3-*.4 3+*.4 Generating candidates • Store the items Lk-1 in order (e.g., lexicographic) • Step 1: self-joining Lk-1 • insert into Ck select p.item1, ..., p.itemm, q.itemm from Lk-1 as p, Lk-1 as q where p.item1=q.item1,...,p.itemm-1=q.itemm-1,p.itemm<q.itemm • Step 2: pruning • For all itemsets c in Ck For all (k-1) subsets s of c If s not in Lk-1 then delete c from Ck Example • L3 = {abc, abd, acd, ace, bcd} • Self join • {abcd, acde} • Pruning • acde is removed because ade in not in L3 • C4 = {abcd} Counting support • Issues with counting support • Total number of candidates may be very, very large • One transaction may contain many candidates • Approach #1: • Compare transaction to every candidate itemset • Approach #2: • Exploit lexicographic order to efﬁciently enumerate all k-itemsets for a given transaction • Increment counts of appropriate itemset (sequential scan) Example enumeration !"#\$%&'()%*+&,-. C*D)%"\$"(1\$%A\$,(*3%"(&"E-\$(" \$1)"(-)"F3AA*+G)"A/+A)(A"3B" A*H)">I # 1 \$ %A \$ , ( * 3% & " ( : ""< ""> " "? "" @ ! " # " \$% & ! < ""> ""? ""@ " > " "? "" @ \$ ? ""@ ! " # " \$% ' ! #" > " "? "" @ ! #\$ ? " "@ : "< "> : "< "? : "< "@ : "> "? : "> "@ ! " # " \$% ( !"#\$%&'()*%+\$,-&"./0\$1" ! #% @ : " ? "@ " #\$ ? "" @ < " > "? < " > "@ "# % @ < "? "@ \$ #% @ > " ? "@ ' / + A ) ( A " 3B " >" *( )0 A 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8"""""""""""""""<: Counting support • Approach #3: • Use approach #2 but look up candidates efﬁciently in hash tree • Store candidate itemsets in a hash-tree • Leaf node contains a list of itemsets and counts • Interior node contains a hash table Example hash tree Example: Counting Supports of Candidates Subset function 3 ,6 ,9 1 ,4 ,7 Transaction: 1 2 3 5 6 2 ,5 ,8 1+2356 234 567 13+56 145 136 12+356 124 457 125 458 CS590D 345 356 357 689 367 368 159 15 Rule generation ()*+,' !"#\$%&\$'\$ a ?*@)%"\$"A1)B/ L, ("*()0 non-empty "\$EE"%3% ⊂ L s (H" • Given ! frequent itemset )%ﬁnd all C)("D&"A*%4subsets fF)0Guch that f → (L - f) /+C)(C"A"! minimum conﬁdence"I A"C\$(*CA*)C"(-)" Csatisﬁes the D"C/,-"(-\$("A"" D requirement 0*%*0/0" ,3%A*4)%,)" 1)B/*1)0)%( I 2A"JK&L&M&5N"*C"\$"A1)B/)%("*()0C)(&",\$%4*4\$()"1/E)CO KLM""5&" K""LM5& KL""M5& L5""KM&" ! KL5""M&" L""KM5& KM"" L5&" M5""KL& KM5""L&" M""KL5&" K5"" LM&" LM5""K&" 5""KLM LM""K5&" 2A" PDP" Q" R&" (-)%" (-)1)" \$1)" < R I < " , \$ % 4 * 4 \$ ( ) " \$CC3,*\$(*3%"1/E)C"S*7%31*%7"D"" # \$%4"# " DT k • If |L|=k then there are 2 -2 candidate association rules (ignoring L → ∅ and ∅ → L) !"#\$%&'()*%+\$,-&"./0\$1" 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8"""""""""""""""8> Efﬁcient rule generation • In general, conﬁdence is not monotonic • c(ABC→D) can be larger or smaller than c(AB→D) • But the conﬁdence of rules generated from the same itemset is monotonic with respect to the number of items in the consequent • E.g., for L={A,B,C,D} • c(ABC→D)≥ c(AB→CD)≥ c(A→BCD) Pruning rules !"#\$%&\$'\$()*+,'%-,(%./(+,(+ .#0,(+*12 !"##\$%&'()'*+,&- F3G" A3%H*4)%,)" I/J) [email protected] [email protected]"E @A5BC? [email protected] @5BC?A [email protected] @ABC? 5 [email protected] 5BCA [email protected] A [email protected] 5 @BC?A5 [email protected] [email protected] 5 [email protected] [email protected] !"#\$%&' (#)%* !"#\$%&'()*%+\$,-&"./0\$1" 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8"""""""""""""""8> Algorithm to ﬁnd frequent itemsets Lk : ach e For• Rm=conﬁdent rules with m variable consequents • Hm=candidate rules with m variable consequents • H1=candidate rules with single variable consequent • for (m=1; Hm!=∅; m++) • If k > m + 1: • Hm+1 = candidate rules generated from Rm • Rm+1 = candidates in Hm+1 with min_conﬁdence • Return ∪m Rm Complexity • Apriori bottleneck is candidate itemset generation • 104 1-itemsets will produce 107 candidate 2-itemsets • To discover a frequent pattern of size 100 you need to generate 2100 ≈ 1030 candidates • Solutions • Use alternative algorithms to search itemset lattice • Use alternative data representations • E.g., Mine patterns without candidate generation using FP-tree structure Evaluation • Association rules algorithms usually return many, many rules • Many are uninteresting or redundant (e.g., ABC→D and AB→D may have same support and conﬁdence) • How to quantify interestingness? • Objective: statistical measures • Subjective: unexpected and/or actionable patterns (requires domain knowledge) ' ) B) , ( *3 % !"#\$%&'()*%+\$,-&"./0\$1" 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8""""""""""""""">? Objective measures 3 X→Y, '\$) compute statistics based on contingency tables • Given a rule(4"2can 0*,)'-.-/'\$)0)-//*1-&/2.! F*G)%"\$"1/B)"H"! I&"*%J310\$(*3%"%))4)4"(3",30C/()"1/B)" *%()1)D(*%7%)DD",\$%"+)"3+(\$*%)4"J130"\$",3%(*%7)%,K"(\$+B) N3%(*%7)%,K"(\$+B) J31"H"! I I H J:: J:= J: M H" J=: J== J3 M JM: JM= J::O"D/CC31("3J"H"\$%4"I J:=O"D/CC31("3J"H"\$%4"I J=:O"D/CC31("3J"H"\$%4"I J==O"D/CC31("3J"H"\$%4"I I" L#L PD)4"(3"4)J*%)"G\$1*3/D"0)\$D/1)D " !"#\$%&'()*%+\$,-&"./0\$1" D/CC31(&",3%J*4)%,)&"B*J(&"F*%*& QR0)\$D/1)&")(,S 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8""""""""""""""">; Drawback of support • Support suffers from the rare item problem (Liu et al.,1999 ) • Infrequent items not meeting minimum support are ignored which is problematic if rare items are important • E.g. rarely sold products which account for a large part of revenue or proﬁt • Support falls rapidly with itemset size. A threshold on support favors short itemsets Drawback of conﬁdence !"#\$%#&'()*(+),*-./,&/ A3BB)) A3BB)) #)\$ :> > <= #)\$ @> > ;= ?= := :== !""#\$%&'%#()*+,-.)/-&)! 0#11-0#(1%2-(\$-3)450#11--6/-&7)3)89:; <+')450#11--7)3)89= " !,'>#+?>)\$#(1%2-(\$-)%")>%?>@)A+,-)%")B%",-&2%(? " 450#11--6/-&7)3)89=C:; !"#\$%&'()*%+\$,-&"./0\$1" 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8""""""""""""""">? Statistical-based'()#%*+,-*#%./*% measures !"#"\$%"\$&# ! 6)\$?/1)?"(-\$("([email protected])"*%(3"\$,,3/%("?(\$(*?(*,\$A" 4)B)%4)%,) " #! ' # ! /)(\$ " "#! ! "# # & ! ! ,*\$'-'.\$ " " # # ! " #! ! "+ " "# # & ! ! ! " # # ! " #! ! # ! %&'(()%)'*\$ " !"#\$%&'()*%+\$,-&"./0\$1" " # # & ! ! ! "# # ! "#! ! "# # !%\$ ! " # # !"" #! !%\$ ! " #! !" 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8""""""""""""""">: !"#\$%&'()*%+\$,-&"./0\$1" 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8""""""""""""""">: Lift example 01#23'*4,5\$6"789"*/*%" F3GG)) F3GG)) #)\$ :D D <= #)\$ ED D ;= C= := :== !""#\$%&'%#()*+,-.)/-&)\$ 0#11-0#(1%2-(\$-3)450#11--6/-&7)3)89:; <+')450#11--7)3)89= % >%1')3 89:;?89=3)[email protected])5B)CD)'E-F-1#F-)%")(-G&'%H-,I)&""#\$%&'-27 !"#\$%&'()*%+\$,-&"./0\$1" 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8""""""""""""""">< Unexpectedness • Model expectation of users (based on domain knowledge) !"#\$%\$&#'"("\$&&)*'+),"\$-.\$/#\$0"\$&& • Combine with evidence from data ! >))4"(3"034)?")@A),(\$(*3%"3B"/C)1C"D430\$*%"E%3F?)47)G J M K\$(()1%")@A),()4"(3"+)"B1)L/)%( K\$(()1%")@A),()4"(3"+)"*%B1)L/)%( K\$(()1%"B3/%4"(3"+)"B1)L/)%( K\$(()1%"B3/%4"(3"+)"*%B1)L/)%( JM MJ ! [email protected]),()4"K\$(()1%C O%)@A),()4"K\$(()1%C >))4"(3",30+*%)")@A),(\$(*3%"3B"/C)1C"F*(-")H*4)%,)"B130" 4\$(\$"D*I)I&")@(1\$,()4"A\$(()1%CG !"#\$%&'()*%+\$,-&"./0\$1" 2%(134/,(*3%"(3"5\$(\$"6*%*%7"""""""" 89:;9<==8""""""""""""""";: Graph mining Subgraph mining • Frequent subgraphs • A subgraph is frequent if its support in a given dataset is no less than the min_support threshold • Applications • Mining biochemical structures • Program control ﬂow analysis • Mining XML structures or web communities • Building blocks for graph classiﬁcation/clustering/compression/ comparison Algorithms • Incomplete beam search (SUBDUE) • Inductive logic programming (WARMR) • Graph theory-based approaches • Apriori-based approach • Pattern-growth approach Algorithmic properties • Search order: breadth vs. depth • Candidate generation: apriori vs. pattern growth • Duplicate elimination: passive vs. active • Discovery order: path → tree → graph Apriori-based priori-Based A approach k-edge Approach (k+1)-edge G1 G G2 G’ … G’’ Gn JOIN November 3, 2007 Mining and Searching Graphs in Graph Databases 14 Pattern growth method Pattern Growth Method (k+2)-edge (k+1)-edge G1 k-edge G … duplicate graph G2 … Gn November 3, 2007 … Mining and Searching Graphs in Graph Databases 18 Breadth-ﬁrst search Apriori-Based, Breadth-First Search Apriori-Based, Breadth-First Search • Breadth ﬁrst, joining two graphs Methodology: breadth-search, joining two graphs Methodology: breadth-search, joining two graphs ! ! AGM (Inokuchi, et al. PKDD’00) ! Generatenokuchi, ewith one additional node or one additional edge • AGM! (Igenerates t al. PKDD’00) one more node new graph new graphs with ! generates new graphs with one more node ! FSG (Kuramochi and Karypis ICDM’01) FSG !(Kuramochi and Karypis with one more edge generates new graphs ICDM’01) ! generates new graphs with one more edge ! ! November 3, 2007 Mining and Searching Graphs in Graph Databases 15 Complexity • Graphs have the Apriori property • If a graph is frequent then all its subgraphs are frequent • A n-edge frequent graph may have 2n subgraphs • Among 422 chemical compounds which are conﬁrmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns with min_support=5% • Subgraph isomorphism problem • No, but graphs in bio/chemistry have labels that make this easier Announcements • Homework 5 due Dec 6 • Next class: Anomaly detection ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online