6.338_Project_Report_Runmi&Lu

What can we benefit from the hash table intuitively

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: a certain bucket, In general, the number of bucket is much smaller than the number of possible item pairs, which saves a large amount of memory. With this hash mechanism, only the frequency of each bucket are counted and stored, not the item pairs. What can we benefit from the hash table? Intuitively, an item pair is frequent only if the bucket it was hashed to is frequent, since the count of a bucket is the sum of the counts of each item pair in it. By doing this a large amount of non­frequent item pairs can be removed. Improving computational time: parallelization A MapReduce framework is applied to parallelize the 2­pass method as shown in Figure 1. It is supposed to improve the running time performance of the process when running with multi­processors. This framework is applied when counting the frequencies of single items and when counting the frequencies of item pairs after filtering out non­frequent single items. The procedure is described in below: Step 1. If on one computer, load the dataset into main memory, divide the data into chunks with similar size. The number of chunks is decided based on the number of processors to use. If on a cluster with multiple computers, divide the data into chunks beforehand and load each chunk into the main memory of each computer. Then, on each computer, divide the data into chunks based on the number of processors to use. Step 2. For each data chunk, use a mapper function to count items (item pairs) while traversing the chunk and save the results in dictionaries. Step 3. Use a reducer function to combine different dictionaries from the mapper functions and get a dictionary to save the total counts of each item (item pair) over the entire dataset. Step 4. Apply a filter to get frequent items (item pairs) based on a threshold. Figure 1: The MapReduce framework to parallelize the 2­pass method III. Implementation The 2­pass method is implemented using Julia. Julia is a high­level technical computing language with high performance. It is very convenient to implement parallel c...
View Full Document

This document was uploaded on 02/27/2014 for the course CS 18.337 at MIT.

Ask a homework question - tutors are online