the A-Priori Algorithm, one pass is taken for each set-size k. If no frequent itemsets of a certain size are
found, then monotonicity tells us there can be no larger frequent itemsets, so we can stop. The pattern
of moving from one size k to the next size
2. There are one million items, represented by the integers 0,1,.,999999. All items are frequent; that is,
they occur at least 10,000 times.
3. There are one million pairs that occur 10,000 times or more.
4. There are P pairs that occur exactly once.
5. N
which has a number of important properties useful for clustering. In particular, a Euclidean spaces points
are vectors of real numbers. The length of the vector is the number of dimensions of the space. The
components of the vector are commonly called coo
! Exercise 7.1.2: If you choose two points uniformly in the unit square, what is their expected Euclidean
distance?
! Exercise 7.1.3: Suppose we have a d-dimensional Euclidean space. Consider vectors whose components
are only +1 or 1 in each dimension. No
Exercise 6.4.2: Apply Toivonens Algorithm to the data of Exercise 6.3.1, with a support threshold of 4.
Take as the sample the rst row of baskets: cfw_1,2,3, cfw_2,3,4, cfw_3,4,5, and cfw_4,5,6, i.e., one-third of the le.
Our scaleddown support theshold w
are some eciencies we can make by careful implementation. When the space is non-Euclidean, there
are additional problems associated with hierarchical clustering. We therefore consider clustroids and
the way we can represent a cluster when there is no cent
Exercise 6.3.3: Suppose we run the Multihash Algorithm on the data of Exercise 6.3.1. We shall use two
hash tables with ve buckets each. For one, the set cfw_i,j, is hashed to bucket 2i+3j+4 mod 5, and for the
other, the set is hashed to i+4j mod 5. Since
1. Distances are always nonnegative, and only the distance between a point and itself is 0.
2. Distance is symmetric; it doesnt matter in which order you consider the points when computing their
distance.
3. Distance measures obey the triangle inequality;
within the current basket, and all its immediate proper subsets already are being counted. As the
window is decaying, we multiply all counts by 1c and eliminate those that are less than 1/2.
6.7 References for Chapter 6
The market-basket data model, inclu
be one of the points of the cluster, but that situation is coincidental. The state of the clusters is shown in
Fig. 7.4. Now, there are several pairs of centroids that are at distance 5, and these are the closest
centroids. We show in Fig. 7.5 the result
Bitmap 1
Bitmap 2
Data structure for counts of pairs
Pass 2
Bitmap 1
counts for bucket
Second hash table
Pass 3
Figure 6.6: The Multistage Algorithm uses additional hash tables to reduce the number of candidate
pairs
The rst pass of Multistage is the same
6.3.4 Exercises for Section 6.3
Exercise 6.3.1: Here is a collection of twelve baskets. Each contains three of the six items 1 through 6.
cfw_1,2,3 cfw_2,3,4 cfw_3,4,5 cfw_4,5,6 cfw_1,3,5 cfw_2,4,6 cfw_1,3,4 cfw_2,4,5 cfw_3,5,6 cfw_1,2,4 cfw_2,3,5 cfw_3,4
5. A. Savasere, E. Omiecinski, and S.B. Navathe, An ecient algorithm for mining association rules in
large databases, Intl. Conf. on Very Large Databases, pp. 432444, 1995.
6.7. REFERENCES FOR CHAPTER 6 239
6. H. Toivonen, Sampling large databases for ass
12
Item names to integers n
12
Item names to integers n
counts Item
items quent Fre
Pass 1 Pass 2
of pairs for counts Data structure
Bitmap
counts for bucket Hash table
Figure 6.5: Organization of main memory for the rst two passes of the PCY Algorithm
we
2. Find Lk by making a pass through the baskets and counting all and only the itemsets of size k that are
in Ck. Those itemsets that have count at least s are in Lk.
6.2.7 Exercises for Section 6.2
Exercise 6.2.1: If we use a triangular matrix to count pa
and/or where the space either is high-dimensional, or the space is not Euclidean at all. We shall
therefore discuss several algorithms that assume the data does not t in main memory. However, we
begin with the basics: the two general approaches to cluster
hand, the total number of pairs among all the baskets is 10710 2 = 4.5108. Even in the extreme case
that every pair of items appeared only once, there could be only 4.5108 pairs with 1Here, and
throughout the chapter, we shall use the approximation that `
! Exercise 6.2.4: How would you count all itemsets of size 3 by a generalization of the triangular-matrix
method? That is, arrange that in a one-dimensional array there is exactly one element for each set of
three items.
! Exercise 6.2.5: Suppose the supp
between all but a vanishingly small fraction of the pairs of points. However, the maximum distance
between two points is d, and one can argue that all but a vanishingly small fraction of the pairs do not
have a distance close to this upper limit. In fact,
4. There are one million pairs that occur 10,000 times or more.
5. There are 2M pairs that occur exactly once. Of these pairs, M consist of two frequent items; the other
M each have at least one nonfrequent item.
6. No other pairs occur at all.
7. Integer
Example 7.1: Classical applications of clustering often involve low-dimensional Euclidean spaces. For
example, Fig. 7.1 shows height and weight measurements of dogs of several varieties. Without knowing
which dog is of which variety, we can see just by lo
Sometimes, we can get most of the benet of the extra passes of the Multistage Algorithm in a single
pass. This variation of PCY is called the Multihash Algorithm. Instead of using two dierent hash tables
on two successive passes, use two hash functions an
Pass 1 Pass 2
of pairs for counts Data structure
Bitmap 1 Bitmap 2
Hash Table 2
Hash Table 1
Figure 6.7: The Multihash Algorithm uses several hash tables in one pass
Example 6.10: Suppose that if we run PCY, the average bucket will have a count s/10, wher
6. No other pairs occur at all.
7. Integers are always represented by 4 bytes.
8. When we hash pairs, they distribute among buckets randomly, but as evenly as possible; i.e., you may
assume that each bucket gets exactly its fair share of the P pairs that
1. No member of the negative border is frequent in the whole dataset. In this case, the correct set of
frequent itemsets is exactly those itemsets from the sample that were found to be frequent in the
whole.
2. Some member of the negative border is freque