Unformatted text preview: 1 What Is Data Mining?
Originally, data mining" was a statistician's term for overusing data to draw invalid inferences. Bonferroni's theorem warns us that if there are too many possible conclusions to draw, some will be true for purely statistical reasons, with no physical validity. Famous example: David Rhine, a parapsychologist" at Duke in the 1950's tested students for extrasensory perception" by asking them to guess 10 cards | red or black. He found about 1 1000 of them guessed all 10, and instead of realizing that that is what you'd expect from random guessing, declared them to have ESP. When he retested them, he found they did no better than average. His conclusion: telling people they have ESP causes them to lose it! Our de nition: discovery of useful summaries of data." 1.1 Applications Some examples of successes": 1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan. 2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels, etc. 3. Diapers and beer." Observation that customers who buy diapers are more likely to by beer than average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased sales of all three items. 4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in di erent bands allowed astromomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects. 5. Comparison of the genotype of people with without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining will become much more important as the human genome is constructed. 1.2 The Data-Mining Communities
1. 2. 3. 4. 5. As data-mining has become recognized as a powerful tool, several di erent communities have laid claim to the subject: Statistics. AI, where it is called machine learning." Researchers in clustering algorithms. Visualization researchers. Databases. We'll be taking this approach, of course, concentrating on the challenges that appear when the data is large and the computations complex. In a sense, data mining can be thought of as algorithms for executing very complex queries on non-main-memory data. 1 1.3 Stages of the Data-Mining Process 1. Data gathering, e.g., data warehousing, Web crawling. 2. Data cleansing : eliminate errors and or bogus data, e.g., patient fever = 125. 3. Feature extraction : obtaining only the interesting attributes of the data, e.g., date acquired" is probably not useful for clustering celestial objects, as in Skycat. 4. Pattern extraction and discovery. This is the stage that is often thought of as data mining," and is where we shall concentrate our e ort. 5. Visualization of the data. 6. Evaluation of results; not every discovered fact is useful, or even true! Judgement is necessary before following your software's conclusions. 2 2 Association Rules and Frequent Itemsets
The market-basket problem assumes we have some large number of items, e.g., bread," milk." Customers ll their market baskets with some subset of the items, and we get to know what items people buy together, even if we don't know who they are. Marketers use this information to position items, and control the way a typical customer traverses the store. In addition to the marketing application, the same sort of question has the following uses: 1. Baskets = documents; items = words. Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering. 2. Baskets = sentences, items = documents. Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web. 2.1 Goals for Market-Basket Mining 1. Association rules are statements of the form fX1 ; X2 ; : : :; Xn g Y , meaning that if we nd all of X1 ; X2; : : :; Xn in the market basket, then we have a good chance of nding Y . The probability of nding Y for us to accept this rule is called the con dence of the rule. We normally would search only for rules that had con dence above a certain threshold. We may also ask that the con dence be signi cantly higher than it would be if items were placed at random into baskets. For example, we might nd a rule like fmilk; butterg bread simply because a lot of people buy bread. However, the beer diapers story asserts that the rule fdiapersg beer holds with con dence sigini cantly greater than the fraction of baskets that contain beer. 2. Causality. Ideally, we would like to know that in an association rule the presence of X1 ; : : :; Xn actually causes" Y to be bought. However, causality" is an elusive concept. nevertheless, for market-basket data, the following test suggests what causality means. If we lower the price of diapers and raise the price of beer, we can lure diaper buyers, who are more likely to pick up beer while in the store, thus covering our losses on the diapers. That strategy works because diapers causes beer." However, working it the other way round, running a sale on beer and raising the price of diapers, will not result in beer buyers buying diapers in any great numbers, and we lose money. 3. Frequent itemsets. In many but not all situations, we only care about association rules or causalities involving sets of items that appear frequently in baskets. For example, we cannot run a good marketing strategy involving items that no one buys anyway. Thus, much data mining starts with the assumption that we only care about sets of items with high support; i.e., they appear together in many baskets. We then nd association rules or causalities only involving a high-support set of items i.e., fX1 ; : : :; Xn ; Y g must appear in at least a certain percent of the baskets, called the support threshold. 2.2 Framework for Frequent Itemset Mining We use the term frequent itemset for a set S that appears in at least fraction s of the baskets," where s is some chosen constant, typically 0.01 or 1. We assume data is too large to t in main memory. Either it is stored in a RDB, say as a relation BasketsBID; item or as a at le of records of the form BID; item1; item2; : : : ; itemn. When evaluating the running time of algorithms we: Count the number of passes through the data. Since the principal cost is often the time it takes to read data from disk, the number of times we need to read each datum is often the best measure of running time of the algorithm. There is a key principle, called monotonicity or the a-priori trick that helps us nd frequent itemsets: If a set of items S is frequent i.e., appears in at least fraction s of the baskets, then every subset of S is also frequent. 3 To nd frequent itemsets, we can: 1. Proceed levelwise, nding rst the frequent items sets of size 1, then the frequent pairs, the frequent triples, etc. In our discussion, we concentrate on nding frequent pairs because: a Often, pairs are enough. b In many data sets, the hardest part is nding the pairs; proceeding to higher levels takes less time than nding frequent pairs. Levelwise algorithms use one pass per level. 2. Find all maximal frequent itemsets i.e., sets S such that no proper superset of S is frequent in one pass or a few passes. 2.3 The A-Priori Algorithm This algorithm proceeds levelwise. 1. Given support threshold s, in the rst pass we nd the items that appear in at least fraction s of the baskets. This set is called L1 , the frequent items. Presumably there is enough main memory to count occurrences of each item, since a typical store sells no more than 100,000 di erent items. 2. Pairs of items in L1 become the candidate pairs C2 for the second pass. We hope that the size of C2 is not so large that there is not room for an integer count per candidate pair. The pairs in C2 whose count reaches s are the frequent pairs, L2. 3. The candidate triples, C3 are those sets fA; B; C g such that all of fA; B g, fA; C g, and fB; C g are in L2 . On the third pass, count the occurrences of triples in C3; those with a count of at least s are the frequent triples, L3 . 4. Proceed as far as you like or the sets become empty. Li is the frequent sets of size i; Ci+1 is the set of sets of size i + 1 such that each subset of size i is in Li . 2.4 Why A-Priori Helps Consider the following SQL on a BasketsBID; item relation with 108 tuples involving 107 baskets of 10 items each; assume 100,000 di erent items typical of Wal-Mart, e.g..
SELECT b1.item, b2.item, COUNT* FROM Baskets b1, Baskets b2 WHERE b1.BID = b2.BID AND b1.item GROUP BY b1.item, b2.item HAVING COUNT* = s; b2.item Note: s is the support threshold, and the second term of the WHERE clause is to prevent pairs of items that are really one item, and to prevent pairs from appearing ,twice. In the join Baskets . Baskets, each basket contributes 10 = 45 pairs, so the join has 4:5 108 tuples. 2 A-priori pushes the HAVING down the expression tree," causing us rst to replace Baskets by the result of
SELECT * FROM Baskets GROUP by item HAVING COUNT* = s; If s = 0:01, then at most 1000 items' groups can pass the HAVING condition. Reason: there are 108 item occurrences, and an item needs 0:01 107 = 105 of those to appear in 1 of the baskets. Although 99 of the items are thrown away by a-priori, we should not assume the resulting Baskets relation has only 106 tuples. In fact, all the tuples may be for the high-support items. However, in real situations, the shrinkage in Baskets is substantial, and the size of the join shrinks in proportion to the square of the shrinkage in Baskets. 4 2.5 Improvements to A-Priori Two types: 1. Cut down the size of the candidate sets Ci for i 2. This option is important, even for nding frequent pairs, since the number of candidates must be su ciently small that a count for each can t in main memory. 2. Merge the attempts to nd L1 ; L2 ; L3; : : : into one or two passes, rather than a pass per level. 2.6 PCY Algorithm Park, Chen, and Yu proposed using a hash table to determine on the rst pass while L1 is being determined that many pairs are not possibly frequent. Takes advantage of the fact that main memory is usualy much bigger than the number of items. During the two passes to nd L2, the main memory is laid out as in Fig. 1.
Count items Frequent items Bitmap Hash table Counts for candidate pairs Pass 1 Pass 2 Figure 1: Two passes of the PCY algorithm Assume that data is stored as a at le, with records consisting of a basket ID and a list of its items. 1. Pass 1: a Count occurrences of all items. b For each bucket, consisting of items fi1 ; : : :; ik g, hash all pairs to a bucket of the hash table, and increment the count of the bucket by 1. c At the end of the pass, determine L1 , the items with counts at least s. d Also at the end, determine those buckets with counts at least s. Key point: a pair i; j cannot be frequent unless it hashes to a frequent bucket, so pairs that hash to other buckets need not be candidates in C2. Replace the hash table by a bitmap, with one bit per bucket: 1 if the bucket was frequent, 0 if not. 2. Pass 2: a Main memory holds a list of all the frequent items, i.e. L1 . b Main memory also holds the bitmap summarizing the results of the hashing from pass 1. Key point: The buckets must use 16 or 32 bits for a count, but these are compressed to 1 bit. Thus, even if the hash table occupied almost the entire main memory on pass 1, its bitmap ocupies no more than 1 16 of main memory on pass 2. 5 c Finally, main memory also holds a table with all the candidate pairs and their counts. A pair i; j can be a candidate in C2 only if all of the following are true: i. i is in L1 . ii. j is in L1 . iii. i; j hashes to a frequent bucket. It is the last condition that distinguishes PCY from straight a-priori and reduces the requirements for memory in pass 2. d During pass 2, we consider each basket, and each pair of its items, making the test outlined above. If a pair meets all three conditions, add to its count in memory, or create an entry for it if one does not yet exist. When does PCY beat a-priori? When there are too many pairs of items from L1 to t a table of candidate pairs and their counts in main memory, yet the number of frequent buckets in the PCY algorithm is su ciently small that it reduces the size of C2 below what can t in memory even with 1 16 of it given over to the bitmap. When will most of the buckets be infrequent in PCY? When there are a few frequent pairs, but most pairs are so infrequent that even when the counts of all the pairs that hash to a given bucket are added, they still are unlikely to sum to s or more. 2.7 The Iceberg" Extensions to PCY 1. Multiple hash tables : share memory between two or more hash tables on pass 1, as in Fig. 2. On pass 2, a bitmap is stored for each hash table; note that the space needed for all these bitmaps is exactly the same as what is needed for the one bitmap in PCY, since the total number of buckets represented is the same. In order to be a candidate in C2, a pair must: a Consist of items from L1 , and b Hash to a frequent bucket in every hash table.
Count items Hash table 1 Hash table 2 Counts for candidate pairs Frequent items Bitmaps Pass 1 Pass 2 Figure 2: Multiple hash tables memory utilization 2. Iterated hash tables Multistage : Instead of checking candidates in pass 2, we run another hash table di erent hash function! in pass 2, but we only hash those pairs that meet the test of PCY; i.e., they are both from L1 and hashed to a frequent bucket on pass 1. On the third pass, we keep bitmaps from both hash tables, and treat a pair as a candidate in C2 only if: a Both items are in L1 . 6 b The pair hashed to a frequent bucket on pass 1. c The pair also was hashed to a frequent bucket on pass 2. Figure 3 suggests the use of memory. This scheme could be extended to more passes, but there is a limit, because eventually the memory becomes full of bitmaps, and we can't count any candidates.
Count items Frequent items Bitmap Hash table Another hash table Frequent items Bitmap Bitmap Counts for candidate pairs Pass 1 Pass 2 Pass 3 Figure 3: Multistage hash tables memory utilization When does multiple hash tables help? When most buckets on the rst pass of PCY have counts way below the threshold s. Then, we can double the counts in buckets and still have most buckets below threshold. When does multistage help? When the number of frequent buckets on the rst pass is high e.g., 50, but not all buckets. Then, a second hashing with some of the pairs ignored may reduce the number of frequent buckets signi cantly. The methods above are best when you only want frequent pairs, a common case. If we want all maximal frequent itemsets, including large sets, too many passes may be needed. There are several approaches to getting all frequent itemsets in two passes or less. They each rely on randomness of data in some way. 1. Simple approach : Taka a main-memory-sized sample of the data. Run a levelwise algorithm in main memory so you don't have to pay for disk I O, and hope that the sample will give you the truly frequent sets. Note that you must scale the threshold s back; e.g., if your sample is 1 of the data, use s=100 as your support threshold. You can make a complete pass through the data to verify that the frequent itemsets of the sample are truly frequent, but you will miss a set that is frequent in the whole data but not in the sample. To minimize false negatives, you can lower the threshold a bit in the sample, thus nding more candidates for the full pass through the data. Risk: you will have too many candidates to t in main memory. 2. SON95 Savasere, Omiecinski, and Navathe from 1995 VLDB; referenced by Toivonen. Read subsets of the data into main memory, and apply the simple approach" to discover candidate sets. Every basket is part of one such main-memory subset. On the second pass, a set is a candidate if it was identi ed as a candidate in any one or more of the subsets. 7 2.8 All Frequent Itemsets in Two Passes Key point: A set cannot be frequent in the entire data unless it is frequent in at least one subset. 3. Toivonen's Algorithm : a Take a sample that ts in main memory. Run the simple approach on this data, but with a threshold lowered so that we are unlikely to miss any truly frequent itemsets e.g., if sample is 1 of the data, use s=125 as the support threshold. b Add to the candidates of the sample the negative border: those sets of items S such that S is not identi ed as frequent in the sample, but every immediate subset of S is. For example, if ABCD is not frequent in the sample, but all of ABC; ABD; ACD, and BCD are frequent in the sample, then ABCD is in the negative border. c Make a pass over the data, counting all the candidate itemsets and the negative border. If no member of the negative border is frequent in the full data, then the frequent itemsets are exactly those candidates that are above threshold. d Unfortunately, if there is a member of the negative border that turns out to be frequent, then we don't know whether some of its supersets are also frequent, so the whole process needs to be repeated or we accept what we have and don't worry about a few false negatives. 8 3 Low-Support, High-Correlation Mining
We continue to assume a market-basket" model for data, and we visualize the data as a boolean matrix, where rows = baskets and columns = items. Key assumptions: 1. Matrix is very sparse; almost all 0's. 2. The number of columns items is su ciently small that we can store something per column in main memory, but su ciently large that we cannot store something per pair of items in main memory same assumption we've made in all association-rule work so far. 3. The number of rows is so large that we cannot store the entire matrix in memory, even if we take advantage of sparseness and compress again, sames assumption as always. 4. We are not interested in high-support pairs or sets of columns; rather we want highly correlated pairs of columns. 3.1 Applications While marketing applications generally care only about high support it doesn't pay to try to market things that nobody buys anyway, there are several applications that meet the model above, especially the point about pairs of columns items with low support but high correlation being interesting: 1. Rows and columns are Web pages; r; c = 1 means that the page of row r links to the page of column c. Similar columns may be pages about the same topic. 2. Same as 1, but the page of column c links to the page of row r. Now, similar columns may represent mirror pages. 3. Rows = Web pages or documents; columns = words. Similar columns are words that appear almost always together, e.g., phrases." 4. Same as 3, but rows are sentences. Similar columns may indicate mirror pages or plagiarisms. 3.2 Similarity
Example 3.1 : Think of a column as the set of rows in which the column has a 1. Then the similarity of two columns C1 and C2 is SimC1 ; C2 = jC1 C2 j=jC1 C2j. 0 1 1 0 1 0 1 0 1 = 2 5 = 40 similar 0 1 1 2 3.3 Signatures Key idea: map hash" each column C to a small amount of data the signature, SigC such that: 1. SigC is small enough that a signature for each column can be t in main memory. 2. Columns C1 and C2 are highly similar if and only if SigC1 and SigC2 are highly similar. But note that we need to de ne similarity" for signatures. 9 An idea that doesn't work: Pick 100 rows at random, and make that string of 100 bits be the signature for each column. The reason is that the matrix is assumed sparse, so many columns will have an all-0 signature even if they are quite dissimilar. Useful convention: given two columns C1 an...
View Full Document
- Fall '07
- Data Mining, hash function, main memory, Cryptographic hash function