Lect29-Associat2 (1)

Lect29-Associat2 (1) - DATA MINING Susan Holmes © Stats202...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: DATA MINING Susan Holmes © Stats202 Lecture 29 Fall 2010 ABabcdfghiejkl . . . . . . Special Announcements All requests should be sent to [email protected] Kaggle competition is cooking along, site: http://inclass.kaggle.com/stat202 and 12 teams competing. This week is the last active week, with office hours and response to email, deadweek there is no email and no office hours, please plan accordingly. . . . . . . Last time Mining Association Data. Based on dependence of categorical variables. . . . . . . ASSOCIATION RULES A rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = ∅. The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively. The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset {diapers, milk, bread} has a support of 3/6 = 0.5 since it occurs in 50% of all transactions (3 out of 6 transactions). The confidence of a rule is defined conf(X ⇒ Y) = supp(X ∪ Y)/supp(X). For example, the rule {milk, bread} ⇒ {diapers} has a confidence of (3/6)/(4/6) = 0.75 in the database, which means that for 75% of the transactions containing milk and bread the rule is correct. . . . . . . ASSOCIATION RULES: EVALUATION Conditional Probability P(Y|X) = P(Y ∩ X)P(X) = confidence(X ⇒ Y) Monotonicity X ⊂ Y then s(X) ≥ s(Y) Lift P(Y|X)/P(Y) = P(Y ∩ X)P(X)P(Y) = lift(X ⇒ Y) lift(X ⇒ Y) = Conf Expected0 (Conf) . . . . . . Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into different steps 1. First, minimum support is applied to find all frequent itemsets in a database. 2. Second, these frequent itemsets and the minimum confidence constraint are used to form rules. While the second step is straight forward, the first step needs more attention. . . . . . . ID beer diapers bread milk butter soda 1 1 1 1 1 0 0 2 0 1 1 1 0 0 3 0 1 0 0 0 1 4 1 0 1 1 0 0 5 0 0 1 0 1 0 6 0 1 1 1 1 0 Here the items are I = {beer, diapers, bread, milk, butter, soda, tuna} tuna 0 0 0 1 1 1 shopping.list=list(c("beer","diapers","bread","milk"), c("diapers","bread","milk"), c("diapers","soda"), c("beer","bread","milk","tuna"), c("bread","butter","tuna"), c("diapers","bread","milk","butter","tuna")) . . . . . . shopping.list=list(c("beer","diapers","bread","milk"),c("dia c("diapers","soda"),c("beer","bread","milk","tuna"), c("diapers","bread","milk","butter","tuna")) shop.trans <- as(shopping.list, "transactions") summary(shop.trans) transactions as itemMatrix in sparse format with 6 rows (elements/itemsets/transactions) and 7 columns (items) and a density of 0.5 most frequent items: bread diapers milk tuna beer (Other) 5 4 4 3 2 3 element (itemset/transaction) length distribution: sizes 2345 1221 . . . . . . Min. 1st Qu. Median Mean 3rd Qu. Max. 2.0 3.0 3.5 3.5 4.0 5.0 includes extended item information - examples: labels 1 beer 2 bread 3 butter includes extended transaction information - examples: transactionID 1 Tr1 2 Tr2 3 Tr3 > as(shop.matrix,"matrix") beer bread butter diapers milk soda tuna Tr1 1 1 0 1 1 0 0 Tr2 0 1 0 1 1 0 0 Tr3 0 0 0 1 0 1 0 Tr4 1 1 0 0 1 0 1 Tr5 0 1 1 0 0 0 1 . . . . . . > shop.rules=apriori(shop.trans,parameter = list(supp = 0.5, c + target = "rules")) parameter specification: confidence minval smax arem aval originalSupport support minl 0.9 0.1 1 none FALSE TRUE 0.5 1 5 ru algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgel set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[7 item(s), 6 transaction(s)] done [0.00s] sorting and recoding items ... [4 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 done [0.00s]. writing ... [4 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. . . . . . . > summary(shop.rules) set of 4 rules rule length distribution (lhs + rhs):sizes 23 22 Min. 1st Qu. Median Mean 3rd Qu. Max. 2.0 2.0 2.5 2.5 3.0 3.0 summary of quality measures: support confidence lift Min. :0.5000 Min. :1 Min. :1.200 1st Qu.:0.5000 1st Qu.:1 1st Qu.:1.200 Median :0.5000 Median :1 Median :1.200 Mean :0.5417 Mean :1 Mean :1.275 3rd Qu.:0.5417 3rd Qu.:1 3rd Qu.:1.275 Max. :0.6667 Max. :1 Max. :1.500 mining info: data ntransactions support confidence shop.trans 6 0.5 0.9 . . . . . . > inspect(shop.rules) lhs rhs 1 {tuna} => {bread} 2 {milk} => {bread} 3 {diapers, milk} => {bread} 4 {bread, diapers} => {milk} support confidence lift 0.5000000 1 1.2 0.6666667 1 1.2 0.5000000 1 1.2 0.5000000 1 1.5 . . . . . . #####UCI data on house votes####################### votes=read.csv("house-votes-84.data",header=F,na.string="?") 1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n) 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 6. el-salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-nicaraguan-contras: 2 (y,n) 10. mx-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n) 14. superfund-right-to-sue: 2 (y,n) 15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n) . . . . . . votes.disj=acm.disjonctif(votes) dim(votes.disj) votes.matrix=as(as.matrix(votes.disj), "itemMatrix") votes.trans=as(votes.matrix, "transactions") votes.rules <- apriori(votes.trans, parameter = list(supp = 0.5, conf = 0.8, target = "rules")) . . . . . . votes.rules <- apriori(votes.trans, parameter = list(supp = 0.5, conf = 0.8, target = "rules")) inspect(votes.rules) lhs rhs support confidence lift 1 {V9.y} => {V1.democrat} 0.5011494 0.9008264 1.467639 2 {V1.democrat} => {V9.y} 0.5011494 0.8164794 1.467639 3 {V5.n} => {V4.y} 0.5034483 0.8866397 1.524460 4 {V4.y} => {V5.n} 0.5034483 0.8656126 1.524460 5 {V5.n} => {V1.democrat} 0.5632184 0.9919028 1.616021 6 {V1.democrat} => {V5.n} 0.5632184 0.9176030 1.616021 7 {V4.y} => {V1.democrat} 0.5310345 0.9130435 1.487543 8 {V1.democrat} => {V4.y} 0.5310345 0.8651685 1.487543 9 {V4.y, V5.n} => {V1.democrat} 0.5034483 1.0000000 1.629213 10 {democrat,V5.n}=> {V4.y} 0.5034483 0.8938776 1.536904 11 {democrat,V4.y}=> {V5.n} 0.5034483 0.9480519 1.669646 Note: 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 9. aid-to-nicaraguan-contras: 2 (y,n) . . . . . . Visualization of Categorical Data MDS/PCA type visualization for contingency tables: Multiple Correspondence Analysis. (uses Chisquare distance between the rows of the contingency table) votes.acm=dudi.acm(votes) Select the number of axes: 2 s.label(votes.acm$co,boxes=F) . . . . . . . . . . . . d = 0.5 V3.n V12.n V1.republican V5.y V13.y V4.n V11.y V15.y V7.n V8.y 10.y V V14.n V9.y V6.n V16.y V17.y V2.n V16.n V6.y V9.n V14.y V7.y V10.n V17.n V4.y V13.n V15.n V2.y V5.n V1.democrat V8.n V11.n V3.y V12.y . . . . . . V1 q q q q qq q qq qq qq qq q qq q q q q q q q qq q q q q q q qq q q q qq q q q qq q q qq q q q q qq qq q q qqq q q q q q q q qq q qq q q q q q qq qq q q q qq qq q q qq q q qqq q qq qq q qq q qq qq q qq q q q q q q qq q q q qq q q q qqq qq q qq q q q q q q q qq qq q qq q qq q q q qq q q q qq q q q qq q q q qq q q q q q qq q qq q q qq q q q qq q q qqq q qq q q q q q q q q q qq q q q q qq q q q qq q qq q qq qq q q q qq q q qq qq q qq q q q q qq q q q q q q q q q q q qq q q q q qq q q q q q qq q q qq q q q q q q q qq q q q q republican democrat V5 q q q q q n q q q qq q q qq q q q qq qq qq qq q qq q q q q q q q qq q q q qq q q qq q q q qq q q qq q q q q qq qq q q qqq q q q q q q q qq q qq q q q q qq qq q q q qq qq q q qq q q q qq qq qq q q qq q qq qq q qq q q q qq q q q q qq q q q qqq qq qq q q q q q q qq q qq q qq q q q qq q q qq q q q qq q q q q qq q q q q q qq q qq q q q q q qq q q q qqq q qq q q q q q q q q q qq q q q q qq q q q qq q qq q qq q q q qq q q q qq qq q qq q q q q qq q q q q q q q q q q qq q q q q qq q q q q q qq q q qq q q q q q q q qq q q q q q q y n V13 q q q q q q q qq q q q q q q qq q y q q q q q qq q qq q q q q q q q q qq q q qq q q q q qq q qq q qq q q q q qqq q q q q n q q q q q q y q q q q q q q q q q q q q qq q q q qq q q q q q q q q q q qq q y n y q q q q q qq q q q qq q q q q q q q qq qq q qq qq qq q q q qq q q q q q q q q q q q qq q q q q qqqqq q q q q qq q q q q q qq q q q qqq q qq q qq q q q q q qq qq qq q q q qqq q q q q q q q q q qq q qq q q q q qq qq q q q qq qq q qq q q qqq qq qq q qq q qq qq q qq q q q q q q qq q q q q qq q q qqq qq qq q q q q q q q qq q qq q qq q q q qq q q q qq q q q qq q q q q qq q q q q q qq q qq q q qq q q q qq q q qqq q qq q q q q q q qq q q q qq q q qq q q q qq q qq q qq qq q q q qq q q qq qq q qq q q q q qq q q q q q q q q q q q qq q q q q qq q q q q q qq q q qq q q q q q q q qq q q q y n q n q q q qq q q q q q q qqqqq q q qq qq q q qq q q q q q qq q q q q q q q q q q q qq y q q q q q q q q q q q q q q qq qq qq qq q qq q q q q q q q qq q q q qq q q q q qqq q qq q qq q qqq q q q q qq qq q q q qq qq q q q q qq q qq q q qq q q q q q q qq q q q qq q q qq q q q qq q q q qq q q q q qq q q q qq q q q qq q q q qq q qq q q q q q qq q qq qq qq qq q q q qq q q qq qq q qq q q q q q q q q q q q q q q q qq q q qq qq q q q q q q qq q qq q q qq q qq q q q q q q q q q q q q q qq q qq qq qq qq q qq q q q q q q q qq q q q qq q qq q q qq q q qq q qq qqq q q q q qq qq q q q q qq q qq q q q qq q q qq q q q q qq q qq q q q qq q q q qq q qq q q q qq q q q q q qq q q q qq q q q qq q q q q q q q qq q q q qq q qq qq q qq qq q q q qq q q qq qq q qq q q q q q q q q q q qq q q q q q q q qq q q q n q q q q q qq n q q q q qqq q y qq q qq q q q q q q q qq q qqq qq q qq q q qq q q q qq q q qq q q q q q q q qqq q q q q q q q q q q q q q q q qq q q q q q qq q qq qq qq q q q q q q qq q qq q q q qq q q q qqq q q q q q qq qq q q q q q q qq q qq q q q qq q q q q qq q qq q q q qq q q qq q qq qq q q q qq q qq q qq qq q q qq q qq q q qq qq qq q q qq q q qq qq q q q q q q q q q q q q qq qq q q q q q q q q q q n qq qq q q q qq q q q q q q q q q q qq q qq q qq q q q q q q q q q q qq q q q q q q q q q qq q q q q q qq q q q q q qqqqq q q qq qq q qq q q q q qq qq q q q q q q q qq q q qq q q q q q qq q q q qq qq q q q q q qq q q q y q qq q q qq q q q qq q q qq q q q q q q q q q qq q q q q q q qqqqq q qq qq q q qq q q q q q qq q q qq q qqq qq q qq q q qq q q q qq q q qq q q qq q q q q qq q q q q q q q q q qq q q q qq q q q q q q q q q q q qq q q n qq q q qq q q y q q q q q qq q q q qq q qq q q q q q q q q q q q q q q q q q qqqqq q q qq qq q q qq q q q qq q qqq qq q qq q q qq q q q q qq q q qq q q qq q q q qqq q q q q q q q q qq q qq V16 qq q q qq q q q q q qq q qq q qq q q q qq q q qqq q q q q qq q qq q q q q q q qq q qq qq q q qq q q qq q q q qq q qq q qq q qq q qq q q q q q q qq q q q qq q q q q qq q qq q q q q q qq q q q q qq q qq q q q q qq q q q q q qq q q qq q q q q q qq q qq q q q q q qq q q q qqq q qq q q q q q q q q q qq q q q q q q q qq q qq q qq q q q qq q q q qq qq q qq q q q q qq q q q q q q q q q q q qq q q q q qq q q q q qq q q qq q q q q q q q qq q q q q q q q q q qqq qq q q V12 q q q qq q q q q q qq q q q q q qq q q q q q q q V8 q q q q qq qq qq qq q qq q q q q q q qq q q q qq q qq q q qq q q q q qq q q qqq q q q q q qq q q q qq q qq qq q q qq qq q q qq q q qqq qq qq q qq q qq qq q qq q q q q q qq q q qq q q q q qq q qq q q q q q qq q q q q qq q qq q qq qq q q q q q q qq q q qq q q q q qq q qq q q qq q q q qq q q qqq qq q q q q q q q q qq q q q q qq q q q qq q qq q qq q q q qq q q q qq qq q qq q q q q qq q q q q q q q q q q q qq q q q q qq q q q q qq qq q q q q q q q q qq q q q q q q q qq q q qq q V15 q q q q q q qq q q qq q q q q q qq q qq q qq q q q q qq q qqq q q q q qq qq q q q qq qq q q q q qq q qq q q qq q q q q q q qq q q qqq qq q q qq qq qq q q q qq q q qq q q qq q q q qq q q q qq q qq q q q q q qq q qq qq qq qq q q q qq q q qq qq q qq q q q q q q q q q qq q q q q q q qq q q q q q q q q q q qq qq q q q q q qq q q q q q qq q q q qq q q qq q qq q q q q q q q qq q n q qq q qqq qq q qq q q qq q q q q qq q q qq q q qq q q q q qqq q q q q q q q q q q q qq q q q q q q qqqqq q q qq qq q qq q q q q q q qq qq qq q qq q q q q q qq q q qq q q q q qq q y q q q q qq q q q qq qq qq q q q qqq q q q q qq q q q q qq q qq qq q q qq qq q q qq q qq q q q q q qq q qq qq qq q q qq qq q qq q q qq q q q qq qq qq q q q q qq q q q qq q q q q q qq q q qq q q q n q q q q q q q q q y q q q qq q q q q q q qqqqq q q qq qq q qq q q q q q q qq q q q q q q qqqqq q q qq qq q q qq q q q q qq qq qq qq q qq q q q q q q q qq q q q qq q qq q q qq q q qq q q q qq q qqq q q q q q q qq q qq q q q q q qq q qq qq q q qq q q qq q q qqq qq q qq q qq qq q qq q q q q q q qq q q q q qq q q qq q qq q q q q q qq q q q q qq q qq q q q qq q q q qq q q q qq q q q q qq q q q q q qq q qq q q qq q q q qq q q qqq qq q q q q q q q q q qq q q q q qq q q q qq q qq q qq qq q q q qq q q qq qq q qq q q q q qq q q q q q q q q q q qq q q q qq q q q q q qq q q qq q q q q q q qq q q q q q q qq q q qq q q qq q q qq q q q q qq q q q qq q q q q q q q qq qq q qq qq qq q q q qq q q q q q q q q q q q qq q q q q qqqqq q q q q qq q q q qq q qq q q q qq q q qq q qq q q q q q qq qq q q q q q q q qq q q q q q q q q q qq q q q q qq qq q q q qq qq q qq qqq qq qq q q qq q qq qq q qq q q qq q q q qq q q q q q q q q qq q q q q q q qq q q q q qq q qq q q q qq q q q qq q q qq q q q q qq q q q q qq q qq q q qq q q q qq q q qqq q qq q q q q q q qq q q q q qq q q q qq q qq q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q qq q q q q qq q q q q q qq q q qq q q q q q q q qq q q q q q q y V11 q q q n q q q q q q q qq q q q q q q q qq qq q n V17 q q q q qq q qq q q q qq q qq q q qq q q qq q qq qq q q q q q qq n q q V4 q q q q V7 q q q q q q q q qq qq qq qq q qq q q q q q q q qq q q q q q q qq q qq q q q qq q qq q q q q qq q qqq q qq q q qq q q q qq qq q q qq q q qqq q qq q qq q qq q q qq q q q q q q qq q q q qq q q q qq q qq q q q q q qq q q qq q qq q q q qq q q qq q q q q q qq q q qq q q q q q qq q qq q q qq q q qq q qqq q qq q q q q q q q q q qq q q q q qq q q q q q qq qq qq q q q qq q q qq qq q qq q q q q qq q q q q q q q q q q q q q q q q qq q q q q q qq q q qq q q q q q q q qq q q q q qq q q qq q q q qq q qq qq q q q q q q q qq q q q q qq q q q q q q q qqq q qq qq qq q q q q q q q y q qq q q qq q q qq q q qq q q q q qq q V14 qq q q q q q q qq q q qq q qq q q qq q q qqq q q q q qq q qq q q q q qq qq qq q q qq q q qq q q qqq qq q qq q qq qq q qq q q q q q q qq q q qq q q q q qq q qq q q q q q qq q q qq q q qq q q q qq q q qq q q q q q qq q q q qq q q q q q qq qq q q q q q q q q q q qq q q q q q q q q q q qq q q q q qq q q q qq q qq q qq q q q q q q qq qq q qq q q q q q q q q q q q q q q q qq q q q q qq q q q q q qq q q qq q q q q q qqq q q q q q q q q q q qq q q q q q q qqqqq q q qq qq q q qq q q q q q q q q q q qq q q q y qq q q qq q q q q q qq q qq q qq q q q q qq q qqq q q q q q q qq q qq q q q q qq qq q q q qq qq q q q qq q q qqq qq qq qq qq qq q qq q q q q q q qq q q q qq q q q q qq qq q q q q q qq q q q q qq q qq q q q qq q q q qq q q q qq q q q q qq q q q q q qq q qq q q qq q q q q q q qqq q qq q q q q q q q q q qq q q q qq q q q qq q qq qq qq q q q qq q q qq qq q qq q q q q qq q q q q q q q q q q q qq q q q q qq q q q q q qq q q qq q q q q q q q q q q V10 q q q q q q q q q q q q q qqqqq q q qq qq q q qq q q q q q qq q n q qq q qq qq qq qq q qq q q q q q q q qq q q q qq q qq q q qq q q qq q q q qq q q qqq q q q q q q qq q qq q q q q qq qq q q q qq q q qqq q qqq qq qq q qq q qq qq q qq q q q q q q qq q q q qq q q q q qq q qq q q q q q qq q q q q q qq q qq q q q qq q q qq q q q q q qq q q qq q q q q q qq q qq q q qq q q q qq q q qqq q qq q q q q q q q q q qq q q q q qq q q qq q qq q qq qq q q q qq q q qq qq q qq q q q q qq q q q q q q q q q qq q q q q qq q q q q q qq q q qq q q q q q q qq q q V9 q V6 q q q y q qq qq qq qq q qq q q q q q q q qq q q q q q q qq q q q qq q q q qq q qq q q q q qq qq q q qqq q q q q q q q qq q qq q q q qq qq q q qq qq q q qqq qqq qq qq q qq q qq qq q qq q q q q q q qq q q q qq q q q q qq q qq q q q q q qq q q q q qq q qq q q q q qq q qq q q q q q qq q q qq q q q q q qq q qq q q qq q q q qq q q qqq q qq q q q q q q q q q qq q q q q qq q q q qq q qq qq qq q q q qq q q qq qq q qq q q q q qq q q q q q q q q q q q qq q q q q qq q q q q q qq q q qq q q q q q q q qq q q q qq q q qq q q q qq q q q qq q q qq q qq q q q V3 q q q q q q q qq q q q q q q qqqqq q q qq qq q q qq q q q q q q q q q qq q q q q q q qqqqq q q qq qq q q qq q q q q q q qq q q q qq q qq q q qq q q V2 q q q q q q q qq q q q q q q qqqqq q q qq qq q q qq q q q q q q . . . . . . Example from UCI data repository on census data, employment, demographics and education: > data("AdultUCI") > dim(AdultUCI) [1] 48842 15 > AdultUCI[1:2, ] age workclass fnlwgt education education-num marital-status 1 39 State-gov 77516 Bachelors 13 Never-married 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse occupation relationship race sex capital-gain capital-loss 1 Adm-clerical Not-in-family White Male 2174 0 2 Exec-managerial Husband White Male 0 0 hours-per-week native-country income 1 40 United-States small 2 13 United-States small > AdultUCI[["fnlwgt"]] <- NULL > AdultUCI[["education-num"]] <- NULL . . . . . . We need to change the continuous variables to categorical: > AdultUCI[["age"]]=ordered(cut(AdultUCI[["age"]], c(15, 25, 45, 65, 100)), labels = c("Young", "Middle-aged", "Senior", "Old")) > AdultUCI[["hours-per-week"]]=ordered(cut(AdultUCI[["hours-per-we c(0, 25, 40, 60, 168)), labels = c("Part-time", "Full-time", "Over-time", "Workaholic")) > AdultUCI[["capital-gain"]]=ordered(cut(AdultUCI[["capital-gain"] c(-Inf, 0, median(AdultUCI[["capital-gain"]][AdultUCI[["capital-ga 0]), Inf)), labels = c("None", "Low", "High")) > AdultUCI[["capital-loss"]]=ordered(cut(AdultUCI[["capital-loss"] c(-Inf, 0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-lo 0]), Inf)), labels = c("none", "low", "high")) . . . . . . Transform the factor based dataframe into transaction data: > Adult <- as(AdultUCI, "transactions") > Adult transactions in sparse format with 48842 transactions (rows) and 115 items (columns) > summary(Adult) transactions as itemMatrix in sparse format with 48842 rows (elements/itemsets/transactions) and 115 columns (items) and a density of 0.1089939 most frequent items: capital-loss=none capital-gain=None 46560 44807 native-country=United-States race=White 43832 41762 workclass=Private (Other) 33906 401333 element (itemset/transaction) length distribution: sizes 9 10 11 12 13 19 971 2067 15623 30162 \begin{frame}[fragile] . . . . . . The summary of the transaction data set gives shows the most frequent items, the length distribution of the transactions To see which items are important in the data set we can use the itemFrequencyPlot(). To reduce the number of items, we only plot the item frequency for items with a support greater than 10% (using the parameter support). itemFrequencyPlot(Adult, support = 0.15, cex.names = 0.8) . . . . . . . . . . . . ag e= age M =Y id o dl un e g w or ag −ag ed ed kc e= e uc uc las Se d rit at at s= ni al io ion P or −s t e n= = riv m atu du So HS ate ar s= ca m −g ita M tio e− ra l− ar n= co d st rie B lle at us d−c ach ge re rela =N iv− elo la ti ev sp rs tio on e n s r− ou re ship hip ma se la = =H rr tio N u ie ns ot− sb d hi in an p= − d O fam w il ra n−c y ce hi se =W ld x= h ca Fe ite pi ho m ur c tal− sex ale ho s− ap g =M ur pe ita ain a na s l l tiv −p r−w −lo =No e e− er ee ss n co −w k= =N e un ee Fu on try k= ll− e =U Ov tim ni er− e te t in d− ime co St a in me tes co =s mm e= al la l rg e . . . . . . 0.0 0.2 0.4 0.6 item frequency (relative) 0.8 rules <- apriori(Adult, parameter = list(support = 0.05, + confidence = 0.6)) parameter specification: confidence minval smax arem aval originalSupport support minl 0.6 0.1 1 none FALSE TRUE 0.05 1 5 target ext rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgel set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[115 item(s), 48842 transaction(s)] done [ sorting and recoding items ... [36 item(s)] done [0.01s]. creating transaction tree ... done [0.06s]. . . . . . . > summary(rules) set of 14783 rules rule length distribution (lhs + rhs):sizes 1 2 3 4 5 6 224 1650 5021 7882 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 4.00 5.00 4.39 5.00 5.00 summary of quality measures: support confidence lift Min. :0.05004 Min. :0.6000 Min. :0.8124 1st Qu.:0.06238 1st Qu.:0.7695 1st Qu.:0.9983 Median :0.08249 Median :0.8964 Median :1.0298 Mean :0.11208 Mean :0.8544 Mean :1.1860 3rd Qu.:0.12714 3rd Qu.:0.9441 3rd Qu.:1.1132 Max. :0.95328 Max. :1.0000 Max. :4.2464 mining info: data ntransactions support confidence Adult 48842 0.05 0.6 . . . . . . rulesIncomeSmall <- subset(rules, subset = rhs %in% "income=small" & lift > 1.2) inspect(head(SORT(rulesIncomeSmall, by = "confidence"),n = 3)) . . . . . . inspect(head(SORT(rulesIncomeSmall, by = "confidence"),n = 3) lhs rhs support confidence lift 1 {age=Young, workclass=Private, relationship=Own-child, capital-gain=None} => {income=small} 0.054 0.686 1.35 2 {age=Young, workclass=Private, relationship=Own-child, native-country=United-States} => {income=small} 0.0515 0.68 3 {age=Young, workclass=Private, relationship=Own-child} => {income=small} 0.0552 0.683 1.35 . . . . . . adultpart=AdultUCI[,c(3,5,8,11,12,13)] adultpart.trans=as(adultpart,"transactions") partrules=apriori(adultpart.trans,parameter = list(support = 0.05, confidence = 0.6)) education occupation sex hours-per-week native-country inco 1 Bachelors Adm-clerical Male Full-time United-States smal 2 Bachelors Exec-managerial Male Part-time United-States sma 3 HS-grad Handlers-cleaners Male Full-time United-States sma 4 11th Handlers-cleaners Male Full-time United-States smal partrulesIncomeSmall <- subset(partrules, subset = rhs %in% "income=small" & lift > 1.2) partrulesIncomeLarge <- subset(partrules, subset = rhs %in% "income=large" & lift > 1.2) inspect(head(SORT(partrulesIncomeSmall, by = "confidence"),n = 3)) lhs rhs support confidence lift 1 {occupation=Other-service} =>{income=small} 0.0647 0.641 1.267 2 {occupation=Other-service, native-country=United-States} =>{income=small} 0.0547 0.639 1.26 3 {education=Some-college, sex=Female} =>{income=small} 0.0534 0.624 1.234 . . . . . . Measures of interestingness For rules the following measures are implemented: Chi square measure. Conviction Confidence Difference of Confidence (DOC) Leverage (Piatetsky-Shapiro,1991) Lift Improvement (Bayardo+Agrawal+Gunopulos,2000) Support Several other measures (e.g., cosine, Gini index, ϕ-coefficient, odds ratio) . . . . . . ...
View Full Document

This note was uploaded on 07/29/2011 for the course STAT 202 at Stanford.

Ask a homework question - tutors are online