6.338_Project_Report_Runmi&Lu

The following features

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: In our project, memory is the most critical issue since an algorithm with O(n2)­space­complexity is definitely unable to handle a large dataset. The following features, however, are helpful to improve the algorithm: 1. If an item pair is frequent pair, each item in this pair must be frequent This is not surprising. In one dataset, count(a,b) is obviously not greater than either count(a) or count(b). Benefit from this feature, a frequency list of single items can be created first. Based on this list, the non­frequent items can be removed and only the remaining frequent items are being considered to generate the frequent item pairs. By doing this, a linear space complexity, O(n), is reached in the first pass to maintain the counts of each item. In this second pass, only the frequency of the item pairs generated from frequent items are counted. By doing this a large amount of non­frequent items can be removed. Two method are developed as follows: Method 1 Generate a list of possible item pairs based on the of single items, which takes O(m2) space. Then for each basket, iterate through this list and check if each pair exist. This method takes O(m2*L*N) time, where m is number of frequent single items, L is the average length of baskets and N is the number of baskets. Method 2 For each basket, generate a list of frequent single items and all possible item pair based on this list. Then iterate through all the basket. This method takes O(L2*N), where L is the average length of baskets and N is the number of baskets. In practical datasets with reasonable thresholds, L is typically much smaller than m^2. Therefore, method 2 is a more efficient way to count item pairs in the second pass. 2. Typically only a small fraction of the item pairs are frequent There is usually only a very small fraction of all the possible item pairs are frequent. So why do we waste a large proportion of memory to store the count of non­frequent item pairs? Here a hash table is implemented which creates a set of bucket and hashes each item to...
View Full Document

Ask a homework question - tutors are online