Unformatted text preview: In our project, memory is the most critical issue since an algorithm with
O(n2)spacecomplexity is definitely unable to handle a large dataset. The following features,
however, are helpful to improve the algorithm:
1. If an item pair is frequent pair, each item in this pair must be frequent
This is not surprising. In one dataset, count(a,b) is obviously not greater than either
count(a) or count(b). Benefit from this feature, a frequency list of single items can be created
first. Based on this list, the nonfrequent items can be removed and only the remaining frequent
items are being considered to generate the frequent item pairs. By doing this, a linear space
complexity, O(n), is reached in the first pass to maintain the counts of each item. In this second
pass, only the frequency of the item pairs generated from frequent items are counted. By doing
this a large amount of nonfrequent items can be removed. Two method are developed as
follows:
Method 1
Generate a list of possible item pairs based on the of single items, which takes O(m2)
space. Then for each basket, iterate through this list and check if each pair exist. This method
takes O(m2*L*N) time, where m is number of frequent single items, L is the average length of
baskets and N is the number of baskets.
Method 2
For each basket, generate a list of frequent single items and all possible item pair based
on this list. Then iterate through all the basket. This method takes O(L2*N), where L is the
average length of baskets and N is the number of baskets. In practical datasets with reasonable thresholds, L is typically much smaller than m^2.
Therefore, method 2 is a more efficient way to count item pairs in the second pass.
2. Typically only a small fraction of the item pairs are frequent
There is usually only a very small fraction of all the possible item pairs are frequent. So
why do we waste a large proportion of memory to store the count of nonfrequent item pairs?
Here a hash table is implemented which creates a set of bucket and hashes each item to...
View
Full
Document
This document was uploaded on 02/27/2014 for the course CS 18.337 at MIT.
 Fall '13
 Edelman

Click to edit the document details