MapReduce Algorithm Design
Introduction
A large part of the power of MapReduce
comes from its simplicity: the programmer
needs only to implement the mapper, the
reducer, and optionally, the combiner
MapReduce Algorithm Design
Adapted from Jimmy Lins slides
MapReduce: Recap
Programmers must specify:
map (k, v) <k, v>*
reduce (k, v) <k, v>*
All values with the same key are reduced together
Optio
Spark Big Data Ecosystem
MapReduce enables big data analytics using
large, unreliable clusters
There are multiple implementation of
MapReduce
 Based on HDFS
 Based on NoSQL databases
 Based on c
Nave Bayes Algorithm For
Supervised Learning
An Illustrative Example
Congressional voting data by party affiliation.
Question 1: randomly choose a representive.
What is the probility this representa
MapReduce Basics
Principle of Parallel Process
The only feasible approach to tackling largedata problems
today is to divide and conquer, a fundamental concept in
computer science.
The basic idea is
Clustering Analysis
Dr. Ying Xie
What is clustering analysis?
Group elements into clusters such that:
1) Elements inside a cluster are highly similar
to each other
2) Elements from different cluster
Pairs and Stripes
One common approach for synchronization in
MapReduce is to construct complex keys and
values in such a way that data necessary for a
computation are naturally brought together by
th
PageRank
Introduction
PageRank is a measure of web page quality
based on the structure of the hyperlink graph.
PageRank is one of the best known and most
studied algorithms used in Googles search
en
Inverted Index
Recall the basic implementation
Issues with the basic implantation:
For efficient retrieval, postings need to be sorted
by document id.
However, as collections become larger, posting
Graph Algorithms
Introduction
Graphs are ubiquitous in modern society:
hyperlink structure of the web (simply known as
the web graph),
social networks
transportation networks (roads, bus routes, e
Illustration of Parellel Breadthfirst search with MapReduce framework.
Sample graph for illustration purpose.
Let the source node to be 1.
3
2
1
4
6
5
7
8
9
3: Represent the graph to the following fo
Introduction
This course will focus on advanced algorithm
design for processing big unstructured data
with MapReduce.
The reference book will be DataIntensive Text
Processing with MapReduce. This b
Value to Key Conversion
MapReduce sorts intermediate keyvalue pairs
by the keys during the shuffle and sort phase.
What if, in addition to sorting by key, we also
need to sort by value?
Consider t
Inverted Index
Introduction
Inverted Index is the core data structure for an
information retrieval system (Search engine)
An information retrieval system can be
simplified as two major components

Order Inversion
Issues
Absolute cooccurrence frequency sometimes
may be misleading.
A common word may be frequently appearing
with many other words.
A possible solution to this issue is that we
ca
Local Aggregation
Introduction
In the context of dataintensive distributed processing, the single most
important aspect of synchronization is the exchange of intermediate
results, from the processes
Hypotheses generation as supervised link
discovery with automated class labeling on large
scale biomedical concept networks
Dr. Ying Xie
Computer Science Department
Kennesaw State University
Outline
B