This preview shows page 1. Sign up to view the full content.
Unformatted text preview: t number of processes in Julia were tested.
Memory usage saving
Different hash functions could result in different memory usage levels. By implementing
the 2pass method and hashing, roughly 76% memory has been saved on average.
Computational time saving
The running times of single items count, item pairs count, and the total running time are
with respect to different numbers of processes are shown in Figure 2. Compared to item pairs
counting, single items counting required much less time, and therefore the trend for total time is
mainly decided by the running time of item pairs counting. The running time decreases when
more processes are used. However, when the number of processes exceeds 5, the overhead
from the multiprocessing framework exceeds the time saving and the total running time
increases when more processes are used. This type of test on a relatively small dataset is
important to make decisions about how many processes to use in real largescale cases. Figure 2: Running time with different number of processes V. Conclusions
In this project, we attempted to improve the naive solution algorithm of a realistic
problem: finding frequent item pairs in massive transaction datasets. Our methods, though
relatively intuitive, are capable of reducing both the required memory space and the running time
significantly. This process helped us to better understand the MapReduce process and to
appreciate Julia’s extremely easytouse parallel computing features. In future researches, Julia
will become a strong candidate when we try to decide a scientific computing language to use.
There are research papers focusing on more sophisticated methods with rigorous
mathematical derivations and proofs. For example:
Pasquier, Nicolas, et al. "Discovering frequent closed itemsets for association rules." Database
Theory—ICDT’99. Springer Berlin Heidelberg, 1999. 398416. Zaki, Mohammed J. "Efficiently mining frequent trees in a forest." Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.
Bodon, Ferenc. "A triebased APRIORI implementation for mining frequent item sequences."
Proceedings of the 1st international workshop on open source data mining: frequent pattern
mining implementations. ACM, 2005....
View Full Document
This document was uploaded on 02/27/2014 for the course CS 18.337 at MIT.
- Fall '13