CSCI 5510 Big Data Analytics
Lecture 2: MapReduce and
Frequent Itemsets
Prof. Irwin King and Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
1
Grade Assessment Scheme and
Deadlines
Assignments (20%)
Written
CSCI 5510 Big Data Analytics
Lecture 2: MapReduce and
Frequent Itemsets
Prof. Irwin King and Prof.
Michael R. Lyu
Computer Science &
1
Grade Assessment Scheme and
Deadlines
Assignments
(20%)
Written
assignments
Coding
Midterm
Examination (30%)
Nov. 4, 9:3
70
Chapter 3
Finding Similar Items
A fundamental data-mining problem is to examine data for similar items. We
shall take up applications in Section 3.1, but an example would be looking at a
collection of Web pages and nding near-duplicate pages. These pag
Chapter 2
Map-Reduce and the New
Software Stack
Modern data-mining applications, often called big-data analysis, require us
to manage immense amounts of data quickly. In many of these applications, the
data is extremely regular, and there is ample opportu
CSCI5510 In-Class Practice
Social Graph
Date:
_
Students Names:
IDs:
_
_
_
_
For the graph on the right, compute:
The adjacency matrix
The degree matrix
The Laplacian matrix
Answer:
CSCI 5510 Big Data Analytics
Lecture 11: Online Learning
Prof. Irwin King and Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
1
Outline
Introduction
Learning paradigms
Online learning and its applications
CSCI 5510 Big Data Analytics
Lecture 7: Matrix Factorization
Methods
Prof. Irwin King and Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
1
Outline
Introduction
LU Decomposition
Singular Value Decomposition
Pr
CSCI 5510 Big Data Analytics
Lecture 4: Mining Data Streams
Prof. Irwin King and Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
1
Motivation
In many data mining situations, we know the
entire data set in adv
Chapter 1
Data Mining
In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute
to this eld. We cover Bonferronis Principle, which is really a warning abo
CSCI 5510 Big Data Analytics
Mining Data Streams
Prof. Irwin King and Prof. Michael
R. Lyu
Computer Science & Engineering
Dept.
1
Motivation
In many data mining situations, we know t
he entire data set in advance
Stream Management is important when th
e i
CSCI5510 In-Class Practice
Dimensionality Reduction
Date:
_
Students Names:
IDs:
_
_
_
_
1. Describe briefly (informally or formally) the relationship between singular value
decomposition and eigenvalue decomposition.
2.1 Compute the eigenvalues and eigen
CSCI 5510 Big Data Analytics
Lecture 8: Massive Link Analysis
Prof. Irwin King and Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
1
Whats the Mechanism Behind
Google?
How google return
such kind of rankings
(
CSCI 5510 Big Data Analytics
Lecture 6: Data Representation for
High Dimensional Data
Prof. Irwin King and Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
1
Outline
Motivation
SVD
CUR
Application of SVD an
CSCI5510 In-Class Practice
MapReduce and Hadoop
Date:
_
Students Names:
IDs:
_
_
_
_
1. Given the following input:
I spent long spells at sea on all types of vessel; I followed officer training with the
Surface Fleet and with the Royal Marines.
Problem:
1
CSCI5510 In-Class Practice
Mining Data Streams
Date:
_
Students Names:
IDs:
_
_
_
_
1. There are several ways that the bit-stream 1001011011101 could be
partitioned into buckets. Find all of them.
1001011 0 11 1 0 1
100101 101 11 0 1
1 00 101101 11 0 1
2.
CSCI 5510 Big Data Analytics
Lecture 9: Large Scale Support
Vector Machines
Prof. Irwin King and Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
1
Motivation
Introduce the widely used classification tool:
Sup
CSCI5510 In-Class Practice
Scalable Clustering
Date:
Students Names:
_
IDs:
_
_
_
_
Given 8 points in the left 2D space,
suppose that the initial seeds
(centers of each cluster) are A1, A4
and A7. Run the k-means
algorithm.
1.
Using Euclidean distance sho
CSCI5510 Tutorial 3
Introduction to Numpy
Guang Ling
Sept. 24, 2013
Announcement
Lecture 4 (Mining data streams) is shifted
to next Monday lecture time (when lecture
5 is scheduled)
Tutorial on Oct. 1 is cancelled since it is a
public holiday (makeup tuto
CSCI5510 Tutorial 5
Hints on Assignment1 and
FAQs
Guang Ling
Oct. 14, 2013
Late Penalty
Submitted before Oct. 11 23:59:59, no
penalty
Submitted before Oct. 12 23:59:59, 20%
mark deduction
Submitted before Oct. 13 23:59:59, 30%
mark deduction
Submitted bef
Chapter 4
Mining Data Streams
Most of the algorithms described in this book assume that we are mining a
database. That is, all our data is available when and if we want it. In this
chapter, we shall make another assumption: data arrives in a stream or str
CSCI 5510: Tutorial 7
Mid-term Preview
Tutor: Robbie
Oct. 29, 2013
1
Announcement
Assignment 2 is posted online
Deadline: 23:59:59 Nov. 17
Last penalty is same as assignment 1
Marks of assignment 1 will be returned to
you this week
Sign up for your projec
CSCI5510 Tutorial 6
The Netflix Prize and Related
Recommendation Models
Jieming Zhu
Oct. 14, 2013
Outline
The Netflix Prize & the Recommendation
Problem
Some Interesting Contests on Machine
Learning Applications
Training data
100 million ratings, 480,000
340
Chapter 10
Mining Social-Network
Graphs
There is much information to be gained by analyzing the large-scale data that
is derived from social networks. The best-known example of a social network
is the friends relation found on sites like Facebook. How
Chapter 6
Frequent Itemsets
We turn in this chapter to one of the major families of techniques for characterizing data: the discovery of frequent itemsets. This problem is often viewed as
the discovery of association rules, although the latter is a more c