34 Pages

assoc-rules1-3

Course: CS 345, Fall 2001
School: Stanford
Rating:
 
 
 
 
 

Word Count: 1260

Document Preview

Rules Market Association Baskets Frequent Itemsets A-priori Algorithm 1 The Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day. 2 Support Simplest question: find sets of items that appear "frequently" in the baskets. Support for itemset I = the number of...

Register Now

Unformatted Document Excerpt

Coursehero >> California >> Stanford >> CS 345

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Rules Market Association Baskets Frequent Itemsets A-priori Algorithm 1 The Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day. 2 Support Simplest question: find sets of items that appear "frequently" in the baskets. Support for itemset I = the number of baskets containing all items in I. Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets. 3 Example: Frequent Itemsets Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets. B1 B3 B5 B7 = = = = {m, c, b} {m, b} {m, p, b} {c, b, j} B2 B4 B6 B8 = = = = {m, p, j} {c, j} {m, c, b, j} {b, c} Frequent itemsets: {m}, {c}, {b}, {j}, {m,b}, {b,c} , {c,j}. 4 Applications (1) Real market baskets: chain stores keep terabytes of information about what customers buy together. Tells how typical customers navigate stores, lets them position tempting items. Suggests tie-in "tricks," e.g., run sale on diapers and raise the price of beer. High support needed, or no $$'s. 5 Applications (2) Baskets = sentences; items = words in those sentences. Lets us find words that appear together unusually frequently, i.e., linked concepts. Baskets = sentences, items = documents containing those sentences. Items that appear together too often could represent plagiarism. 6 Applications (3) Baskets = people; items = genes or blood-chemistry factors. Has been used to detect combinations of genes that result in diabetes, e. g. But requires extension: absence of an item needs to be observed as well as presence. 7 Many-Many Relationships "Market Baskets" is an abstraction that models any many-many relationship between two concepts: "items" and "baskets." Items need not be "contained" in baskets. The only distinction is that we count co-occurrences of items, not baskets 8 Scale of Problem WalMart sells 100,000 items and can store billions of baskets. The Web has over 100,000,000 words and billions of pages. 9 Association Rules If-then rules about the contents of baskets. {i1, i2,...,ik} j means: "if a basket contains all of i1,...,ik then it is likely to contain j." Confidence of this association rule is the probability of j given i1,...,ik. 10 Example: Confidence + B1 = {m, c, b} _ B3 = {m, b} _ B5 = {m, p, b} B7 = {c, b, j} B2 B4 + B6 B8 = = = = {m, p, j} {c, j} {m, c, b, j} {b, c} An association rule: {m, b} c. Confidence = 2/4 = 50%. 11 Interest The interest of an association rule X Y is the absolute value of the amount by which the confidence differs from the probability of Y being in a given basket. 12 Example: Interest B1 B3 B5 B7 = = = = {m, c, b} {m, b} {m, p, b} {c, b, j} B2 B4 B6 B8 = = = = {m, p, j} {c, j} {m, c, b, j} {b, c} For association rule {m, b} c, item c appears in 5/8 of the baskets. Interest = |2/4 - 5/8| = 1/8 --- not very interesting. 13 Relationships Among Measures Rules with high support and confidence may be useful even if they are not "interesting." We don't care if buying bread causes people to buy milk, or whether simply a lot of people buy both bread and milk. But high interest suggests a cause that might be worth investigating. 14 Finding Association Rules A typical question: "find all association rules with support s and confidence c." Note: "support" of an association rule is the support of the set of items it mentions. Hard part: finding the high-support (frequent ) itemsets. Checking the confidence of association rules involving those sets is relatively easy. 15 Computation Model Typically, data is kept in a flat file rather than a database system. Stored on disk. Stored basket-by-basket. Expand baskets into pairs, triples, etc. as you read baskets. Use k nested loops to generate all sets of size k. 16 File Item Organization Item Item Item Item Item Item Item Item Item Item Item Basket 1 Basket 2 Basket 3 Etc. 17 Computation Model (2) The true cost of mining disk-resident data is usually the number of disk I/O's. In practice, association-rule algorithms read the data in passes all baskets read in turn. Thus, we measure the cost by the number of passes an algorithm takes. 18 Main-Memory Bottleneck For many frequent-itemset algorithms, main memory is the critical resource. As we read baskets, we need to count something, e.g., occurrences of pairs. The number of different things we can count is limited by main memory. Swapping counts in/out is a disaster. 19 Finding Frequent Pairs The hardest problem often turns out to be finding the frequent pairs. We'll concentrate on how to do that, then discuss extensions to finding frequent triples, etc. 20 Nave Algorithm Read file once, counting in main memory the occurrences of each pair. From each basket of n items, generate its n (n -1)/2 pairs by two nested loops. Fails if (#items)2 exceeds main memory. Remember: #items can be 100K (WalMart) or 10B (Web pages). 21 Details of Main-Memory Counting Two approaches: 1. Count all pairs, using a triangular matrix. 2. Keep a table of triples [i, j, c] = the count of the pair of items {i,j } is c. (1) requires only 4 bytes/pair. Note: assume integers are 4 bytes. (2) requires 12 bytes, but only for those pairs with count > 0. 22 4 per pair 12 per occurring pair Method (1) Method (2) 23 Triangular-Matrix Approach (1) Number items 1, 2,... Requires table of size O(n). Keep pairs in the order {1,2}, {1,3},..., {1,n }, {2,3}, {2,4},...,{2,n }, {3,4},..., {3,n },...{n -1,n }. 24 Triangular-Matrix Approach (2) Find pair {i, j } at the position (i 1)(n i /2) + j i. Total number of pairs n (n 1)/2; total bytes about 2n 2. 25 Details of Approach #2 Total bytes used is about 12p, where p is the number of pairs that actually occur. Beats triangular matrix if at most 1/3 of possible pairs actually occur. May require extra space for retrieval structure, e.g., a hash table. 26 A-Priori Algorithm (1) A two-pass approach called a-priori limits the need for main memory. Key idea: monotonicity : if a set of items appears at least s times, so does every subset. Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets. 27 A-Priori Algorithm (2) Pass 1: Read baskets and count in main memory the occurrences of each item. Requires only memory proportional to #items. Pass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to be frequent. Requires memory proportional to square of frequent items only. 28 Picture of A-Priori Item counts Frequent items Counts of candidate pairs Pass 1 Pass 2 29 Detail for A-Priori You can use the triangular matrix method with n = number of frequent items. Saves space compared with storing triples. Trick: number frequent items 1,2,... and keep a table relating new numbers to original item numbers. 30 Frequent Triples, Etc. For each k, we construct two sets of k tuples: Ck = candidate k - tuples = those that might be frequent sets (support > s ) based on information from the pass for k 1. Lk = the set of truly frequent k tuples. 31 All items Count the items All pairs of items from L1 Count the pairs To be explained C1 Filter L1 Construct C2 Filter L2 Construct C3 First pass Second pass 32 A-Priori for All Frequent Itemsets One pass for each k. Needs room in main memory to count each candidate k tuple. For typical market-basket data and reasonable support (e.g., 1%), k = 2 requires the most memory. 33 Frequent Itemsets (2) C1 = all items L1 = those counted on first pass to be frequent. C2 = pairs, both chosen from L1. In general, Ck = k tuples, each k 1 of which is in Lk -1. Lk = members of Ck with support s. 34
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

Stanford - CS - 345
Association RulesMarket Baskets Frequent Itemsets Apriori Algorithm1The MarketBasket Modelx A large set of items, e.g., things sold in a supermarket. x A large set of baskets, each of which is a small set of the items, e.g., the things one customer bu
Stanford - CS - 345
Association RulesMarket Baskets Frequent Itemsets A-Priori Algorithm1The Market-Basket ModelA large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys
Stanford - CS - 345
Association RulesMarket Baskets Frequent Itemsets APriori Algorithm1The MarketBasket Modelx A large set of items, e.g., things sold in a supermarket. x A large set of baskets, each of which is a small set of the items, e.g., the things one customer bu
Stanford - CS - 345
Improvements to A-PrioriPark-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results1PCY AlgorithmHash-based improvement to A-Priori. During Pass 1 of A-priori, most memory is idle. Use that memory to keep counts of buckets in
Stanford - CS - 345
Improvements to APrioriParkChenYu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results1PCY Algorithmx Hashbased improvement to APriori. x During Pass 1 of Apriori, most memory is idle. x Use that memory to keep counts of buckets in
Stanford - CS - 345
Improvements to APrioriParkChenYu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results1PCY Algorithmx Hashbased improvement to APriori. x During Pass 1 of Apriori, most memory is idle. x Use that memory to keep counts of buckets in
Stanford - CS - 345
Improvements to APrioriBloom Filters ParkChenYu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results1Aside: HashBased Filteringx Simple problem: I have a set S of one billion strings of length 10. x I want to scan a larger file F o
Stanford - CS - 345
SQL/MRPeter Pawlowski Member of Technical Staff January 16, 2009ASTER BACKGROUND2Our Founders3 PhD students from Stanford C.S. Cool ideas. . but no funding, no product, no clients!OK, they had $ 10,000.3Our Product: nCluster A massively scalable
Stanford - CS - 345
Clustering AlgorithmsApplications Hierarchical Clustering k Means Algorithms CURE Algorithm1The Problem of Clusteringx Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a
Stanford - CS - 345
Clustering PreliminariesApplications Euclidean/Non-Euclidean Spaces Distance Measures1The Problem of ClusteringGiven a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster
Stanford - CS - 345
Clustering PreliminariesApplications Euclidean/NonEuclidean Spaces Distance Measures1The Problem of Clusteringx Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluste
Stanford - CS - 345
Clustering PreliminariesApplications Euclidean/Non-Euclidean Spaces Distance Measures1The Problem of ClusteringGiven a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster
Stanford - CS - 345
Clustering PreliminariesApplications Euclidean/NonEuclidean Spaces Distance Measures1The Problem of Clusteringx Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluste
Stanford - CS - 345
Clustering AlgorithmsHierarchical Clustering k -Means Algorithms CURE Algorithm1Methods of ClusteringHierarchical (Agglomerative):Initially, each point in cluster by itself. Repeatedly combine the two "nearest" clusters into one.Point Assignment:Ma
Stanford - CS - 345
Clustering AlgorithmsHierarchical Clustering k Means Algorithms CURE Algorithm1Methods of Clusteringx Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two "nearest" clusters into one. Maintain a set of
Stanford - CS - 345
Clustering AlgorithmsHierarchical Clustering k -Means Algorithms CURE Algorithm1Methods of ClusteringHierarchical (Agglomerative):Initially, each point in cluster by itself. Repeatedly combine the two nearest clusters into one.Point Assignment:Main
Stanford - CS - 345
Clustering AlgorithmsHierarchical Clustering k Means Algorithms CURE Algorithm1Methods of Clusteringx Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two "nearest" clusters into one. Maintain a set of
Stanford - CS - 345
CS345 Data Mining Crawling the Web Web Crawling BasicsStart with a "seed set" of tovisit urlsget next url get page extract urlsto visit urlsWebvisited urlsweb pagesCrawling Issues Load on web servers Insufficient resources to crawl entire web
Stanford - CS - 345
Problem 1:a) True Consider visiting the rows in the permuted order. The first time you see a one in any of the two columns, the column C1 \/ C2 will also have a one. Consequently, the first (minimum) row number which corresponds to the min hash value for
Stanford - CS - 345
Locality-Sensitive HashingBasic Technique Hamming-LSH Applications1Finding Similar PairsSuppose we have in main memory data representing a large number of objects.May be the objects themselves (e.g., summaries of faces). May be signatures as in minha
Stanford - CS - 345
LocalitySensitive HashingBasic Technique HammingLSH Applications1Finding Similar Pairsx Suppose we have in main memory data representing a large number of objects. May be the objects themselves (e.g., summaries of faces). May be signatures as in minh
Stanford - CS - 345
Finding Similar PairsDivideComputeMerge LocalitySensitive Hashing Applications1Finding Similar Pairsx Suppose we have in main memory data representing a large number of objects. May be the objects themselves (e.g., summaries of faces). May be signatu
Stanford - CS - 345
Mining Data StreamsThe Stream Model Sliding Windows Counting 1s1The Stream ModelData enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make critical calculations about the stream using a limited
Stanford - CS - 345
Mining Data StreamsThe Stream Model Sliding Windows Counting 1's1The Stream Modelx Data enters at a rapid rate from one or more input ports. x The system cannot store the entire stream. x How do you make critical calculations about the stream using a
Stanford - CS - 345
Mining Data StreamsThe Stream Model Sliding Windows Counting 1's1Data Management Versus Stream ManagementIn a DBMS, input is under the control of the programmer.SQL INSERT commands or bulk loaders.Stream Management is important when the input rate i
Stanford - CS - 345
Mining Data StreamsThe Stream Model Sliding Windows Counting 1's1Data Management Versus Stream Managementx In a DBMS, input is under the control of the programmer. x Stream Management is important when the input rate is controlled externally. Example
Stanford - CS - 345
More Stream-MiningCounting How Many Elements Computing Moments1Counting Distinct ElementsProblem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Obvious approach: mainta
Stanford - CS - 345
More StreamMiningCounting How Many Elements Computing "Moments"1Counting Distinct Elementsx Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. x Obvious approach: m
Stanford - CS - 345
More Stream-MiningCounting Distinct Elements Computing "Moments" Frequent Itemsets Elephants and Troops Exponentially Decaying Windows1Counting Distinct ElementsProblem: a data stream consists of elements chosen from a set of size n. Maintain a count
Stanford - CS - 345
More StreamMiningCounting Distinct Elements Computing "Moments" Frequent Itemsets Elephants and Troops Exponentially Decaying Windows1Counting Distinct Elementsx Problem: a data stream consists of elements chosen from a set of size n. Maintain a count
Stanford - CS - 345
Still More Stream-MiningFrequent Itemsets Elephants and Troops Exponentially Decaying Windows1Counting ItemsProblem: given a stream, which items appear more than s times in the window? Possible solution: think of the stream of baskets as one binary st
Stanford - CS - 345
Still More StreamMiningFrequent Itemsets Elephants and Troops Exponentially Decaying Windows1Counting Itemsx Problem: given a stream, which items appear more than s times in the window? x Possible solution: think of the stream of baskets as one binary
Stanford - CS - 345
Stream ClusteringExtension of DGIM to More Complex Problems1Clustering a StreamAssume points enter in a stream. Maintain a sliding window of points. Queries ask for clusters of points within some suffix of the window. Important issue: where are the cl
Stanford - CS - 345
Stream ClusteringExtension of DGIM to More Complex Problems1Clustering a Streamx Assume points enter in a stream. x Maintain a sliding window of points. x Queries ask for clusters of points within some suffix of the window. x Important issue: where ar
Stanford - CS - 345
CS345 Data MiningIntroductions What Is It? Cultures of Data Mining1Course Staffx Instructors: Anand Rajaraman Jeff Ullman Robbie Yanx TA:2Requirementsx Homework (Gradiance and other) 20% x Project 40% x Final Exam 40% Gradiance class code BB8F69
Stanford - CS - 345
CS345 - Data MiningIntroductions What Is It? Cultures of Data Mining1Course StaffInstructors:Anand Rajaraman Jeff UllmanTA:Jeff Klingner2RequirementsHomework (Gradiance and other) 20%Gradiance class code DD984360Project 40% Final Exam 40%3Pr
Stanford - CS - 345
CS345 Data MiningIntroductions What Is It? Cultures of Data Mining1Course Staffx Instructors: Anand Rajaraman Jeff Ullman Jeff Klingnerx TA:2Requirementsx Homework (Gradiance and other) 20% x Project 40% x Final Exam 40% Gradiance class code DD9
Stanford - CS - 345
CS345 - Data MiningCourse Introduction Varieties of Data Mining Bonferroni's Principle1Course StaffInstructors:Anand Rajaraman Jeff UllmanTA:Babak Pahlavan2RequirementsHomework (Gradiance and other) 20%Gradiance class code B0E9AA66 Note URL for
Stanford - CS - 345
CS345A: Data Mining on the WebCourse Introduction Issues in Data Mining Bonferroni's Principle1Course Staffx Instructors: Anand Rajaraman Jeff Ullman Babak Pahlavanx TA:2Requirementsx Homework (Gradiance and other) 20% Gradiance class code B0E9A
Stanford - CS - 345
CS345A: Data Mining on the WebCourse Introduction Issues in Data Mining Bonferroni's Principle1Course Staffx Instructors: Anand Rajaraman Jeff Ullmanx Reach us as cs345awin0809staff @ lists.stanford.edu. x More info on www.stanford.edu/class/cs345a.
Stanford - CS - 345
Generalizing MapReduceThe Computational Model MapReduceLike Algorithms Computing Joins1Overviewx There is a new computing environment available: x Mapreduce allows us to exploit this environment easily. x But not everything is mapreduce. x What else c
Stanford - CS - 345
CS 345A Data MiningMapReduceSingle-node architectureCPU Machine Learning, Statistics Memory "Classical" Data Mining DiskCommodity ClustersWeb data sets can be very largeTens to hundreds of terabytesCannot mine on a single server (why?) Standard arc
Stanford - CS - 345
CS 345A Data MiningMapReduce Singlenode architectureCPU Machine Learning, Statistics Memory "Classical" Data Mining DiskCommodity ClustersWeb data sets can be very large Cannot mine on a single server (why?) Standard architecture emerging: Te
Stanford - CS - 345
CS 345A Data MiningMapReduceSingle-node architectureCPU Machine Learning, Statistics Memory Classical Data Mining DiskCommodity ClustersWeb data sets can be very largeTens to hundreds of terabytesCannot mine on a single server (why?) Standard archi
Stanford - CS - 345
CS 345A Data MiningMapReduce Singlenode architectureCPU Machine Learning, Statistics Memory "Classical" Data Mining DiskCommodity ClustersWeb data sets can be very large Cannot mine on a single server (why?) Standard architecture emerging: Te
Stanford - CS - 345
Near-Neighbor SearchApplications Matrix Formulation Minhashing1Example Application: Face RecognitionWe have a database of (say) 1 million face images. We want to find the most similar images in the database. Represent faces by (relatively) invariant v
Stanford - CS - 345
NearNeighbor SearchApplications Matrix Formulation Minhashing1Example Application: Face Recognitionx We have a database of (say) 1 million face images. x We want to find the most similar images in the database. x Represent faces by (relatively) invari
Stanford - CS - 345
Near-Neighbor SearchApplications Matrix Formulation Minhashing1Example Problem - Face RecognitionWe have a database of (say) 1 million face images. We are given a new image and want to find the most similar images in the database. Represent faces by (
Stanford - CS - 345
NearNeighbor SearchApplications Matrix Formulation Minhashing1Example Problem Face Recognitionx We have a database of (say) 1 million face images. x We are given a new image and want to find the most similar images in the database. x Represent faces b
Stanford - CS - 345
What is Database Theory?A collection of studies, often connected to the relational model of data. Restricted forms of logic, between SQL and full rst-order. Dependency theory: generalizing functional dependencies. Conjunctive queries CQ's: useful, decida
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. UllmanLink Analysis Algorithms Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover Detecting dup
Stanford - CS - 345
Link Analysis AlgorithmsCS345 Data MiningLink Analysis Algorithms Page RankPage Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we wont coverDetecting duplicates and mirrors Mining for communities
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. UllmanLink Analysis Algorithms Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover Detecting dup
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page RankAnand Rajaraman, Jeffrey D. UllmanLink Analysis AlgorithmsPage Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we wont coverDetecting duplicates
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page RankAnand Rajaraman, Jeffrey D. UllmanLink Analysis AlgorithmsPage Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we won't coverDetecting duplicates
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. UllmanLink Analysis Algorithms Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover Detecting dup
Stanford - CS - 345
TopicsCS345 Data MiningLink Analysis 2 Page Rank VariantsThis lectureMany-walkers model Tricks for speeding convergence Topic-Specific Page RankAnand Rajaraman, Jeffrey D. UllmanRandom walk interpretationAt time 0, pick a page on the web uniformly
Stanford - CS - 345
CS345 Data MiningLink Analysis 2 Page Rank Variants Anand Rajaraman, Jeffrey D. UllmanTopicsThis lecture Manywalkers model Tricks for speeding convergence TopicSpecific Page RankRandom walk interpretation At time 0, pick a page on the web unif
Stanford - CS - 345
CS345 Data MiningRecommendation SystemsAnand Rajaraman, Jeffrey D. UllmanRecommendationsSearchRecommendationsItemsProducts, web sites, blogs, news items, The Long TailSource: Chris Anderson (2004)From scarcity to abundanceShelf space is a scarc
Stanford - CS - 345
CS345 Data MiningRecommendation Systems Netflix Challenge Course Projects Anand Rajaraman, Jeffrey D. UllmanRecommendations SearchRecommendationsItemsProducts, web sites, blogs, news items, .From scarcity to abundanceShelf space is a scarce com
Stanford - CS - 345
CS345 Data MiningRecommendation Systems Anand Rajaraman, Jeffrey D. UllmanRecommendations SearchRecommendationsItemsProducts, web sites, blogs, news items, .The Long TailSource: Chris Anderson (2004)From scarcity to abundanceShelf space is a