1. No member of the negative border is frequent in the whole dataset. In this case, the correct set of
frequent itemsets is exactly those itemsets from the sample that were found to be frequent in the
Bitmap 1
Bitmap 2
Data structure for counts of pairs
Pass 2
Bitmap 1
counts for bucket
Second hash table
Pass 3
Figure 6.6: The Multistage Algorithm uses additional hash tables to reduce the number of
be one of the points of the cluster, but that situation is coincidental. The state of the clusters is shown in
Fig. 7.4. Now, there are several pairs of centroids that are at distance 5, and these are
within the current basket, and all its immediate proper subsets already are being counted. As the
window is decaying, we multiply all counts by 1c and eliminate those that are less than 1/2.
6.7 Refer
1. Distances are always nonnegative, and only the distance between a point and itself is 0.
2. Distance is symmetric; it doesnt matter in which order you consider the points when computing their
dista
Exercise 6.3.3: Suppose we run the Multihash Algorithm on the data of Exercise 6.3.1. We shall use two
hash tables with ve buckets each. For one, the set cfw_i,j, is hashed to bucket 2i+3j+4 mod 5, an
are some eciencies we can make by careful implementation. When the space is non-Euclidean, there
are additional problems associated with hierarchical clustering. We therefore consider clustroids and
t
Exercise 6.4.2: Apply Toivonens Algorithm to the data of Exercise 6.3.1, with a support threshold of 4.
Take as the sample the rst row of baskets: cfw_1,2,3, cfw_2,3,4, cfw_3,4,5, and cfw_4,5,6, i.e.,
! Exercise 7.1.2: If you choose two points uniformly in the unit square, what is their expected Euclidean
distance?
! Exercise 7.1.3: Suppose we have a d-dimensional Euclidean space. Consider vectors
which has a number of important properties useful for clustering. In particular, a Euclidean spaces points
are vectors of real numbers. The length of the vector is the number of dimensions of the spac
2. There are one million items, represented by the integers 0,1,.,999999. All items are frequent; that is,
they occur at least 10,000 times.
3. There are one million pairs that occur 10,000 times or m
6.3.4 Exercises for Section 6.3
Exercise 6.3.1: Here is a collection of twelve baskets. Each contains three of the six items 1 through 6.
cfw_1,2,3 cfw_2,3,4 cfw_3,4,5 cfw_4,5,6 cfw_1,3,5 cfw_2,4,6 cf
5. A. Savasere, E. Omiecinski, and S.B. Navathe, An ecient algorithm for mining association rules in
large databases, Intl. Conf. on Very Large Databases, pp. 432444, 1995.
6.7. REFERENCES FOR CHAPTER
12
Item names to integers n
12
Item names to integers n
counts Item
items quent Fre
Pass 1 Pass 2
of pairs for counts Data structure
Bitmap
counts for bucket Hash table
Figure 6.5: Organization of mai
6. No other pairs occur at all.
7. Integers are always represented by 4 bytes.
8. When we hash pairs, they distribute among buckets randomly, but as evenly as possible; i.e., you may
assume that each
Pass 1 Pass 2
of pairs for counts Data structure
Bitmap 1 Bitmap 2
Hash Table 2
Hash Table 1
Figure 6.7: The Multihash Algorithm uses several hash tables in one pass
Example 6.10: Suppose that if we r
Sometimes, we can get most of the benet of the extra passes of the Multistage Algorithm in a single
pass. This variation of PCY is called the Multihash Algorithm. Instead of using two dierent hash tab
Example 7.1: Classical applications of clustering often involve low-dimensional Euclidean spaces. For
example, Fig. 7.1 shows height and weight measurements of dogs of several varieties. Without knowi
4. There are one million pairs that occur 10,000 times or more.
5. There are 2M pairs that occur exactly once. Of these pairs, M consist of two frequent items; the other
M each have at least one nonfr
between all but a vanishingly small fraction of the pairs of points. However, the maximum distance
between two points is d, and one can argue that all but a vanishingly small fraction of the pairs do
! Exercise 6.2.4: How would you count all itemsets of size 3 by a generalization of the triangular-matrix
method? That is, arrange that in a one-dimensional array there is exactly one element for each
hand, the total number of pairs among all the baskets is 10710 2 = 4.5108. Even in the extreme case
that every pair of items appeared only once, there could be only 4.5108 pairs with 1Here, and
throug
and/or where the space either is high-dimensional, or the space is not Euclidean at all. We shall
therefore discuss several algorithms that assume the data does not t in main memory. However, we
begin
2. Find Lk by making a pass through the baskets and counting all and only the itemsets of size k that are
in Ck. Those itemsets that have count at least s are in Lk.
6.2.7 Exercises for Section 6.2
Ex
the A-Priori Algorithm, one pass is taken for each set-size k. If no frequent itemsets of a certain size are
found, then monotonicity tells us there can be no larger frequent itemsets, so we can stop.
Neo-Realism, Neo-Liberalism, and Collective Security in International Politics
After the World Wars, different theories were created regarding international relations and ways to prevent a future war
Culture's Effect on Society
In his book The Central Liberal Truth, Lawrence E. Harrison claims that culture is a major determinant on how successful and progressive a country is in the modern world.
The Causes of War and the Preserving of Peace
In the age before any collective security agreement, wars were fought between different nations, city-states, or tribes if the reward of success was grea