This preview shows page 1. Sign up to view the full content.
Unformatted text preview: a
certain bucket, In general, the number of bucket is much smaller than the number of possible
item pairs, which saves a large amount of memory. With this hash mechanism, only the
frequency of each bucket are counted and stored, not the item pairs.
What can we benefit from the hash table? Intuitively, an item pair is frequent only if the bucket it
was hashed to is frequent, since the count of a bucket is the sum of the counts of each item pair
in it. By doing this a large amount of nonfrequent item pairs can be removed.
Improving computational time: parallelization
A MapReduce framework is applied to parallelize the 2pass method as shown in Figure
1. It is supposed to improve the running time performance of the process when running with
multiprocessors. This framework is applied when counting the frequencies of single items and
when counting the frequencies of item pairs after filtering out nonfrequent single items.
The procedure is described in below:
Step 1. If on one computer, load the dataset into main memory, divide the data into chunks with
similar size. The number of chunks is decided based on the number of processors to use. If on
a cluster with multiple computers, divide the data into chunks beforehand and load each chunk
into the main memory of each computer. Then, on each computer, divide the data into chunks
based on the number of processors to use.
Step 2. For each data chunk, use a mapper function to count items (item pairs) while traversing
the chunk and save the results in dictionaries.
Step 3. Use a reducer function to combine different dictionaries from the mapper functions and
get a dictionary to save the total counts of each item (item pair) over the entire dataset.
Step 4. Apply a filter to get frequent items (item pairs) based on a threshold. Figure 1: The MapReduce framework to parallelize the 2pass method III. Implementation
The 2pass method is implemented using Julia. Julia is a highlevel technical computing
language with high performance. It is very convenient to implement parallel c...
View Full Document
This document was uploaded on 02/27/2014 for the course CS 18.337 at MIT.
- Fall '13