Mining
of
Massive
Datasets
Jure Leskovec
Stanford Univ.
Anand Rajaraman
Milliway Labs
Jeffrey D. Ullman
Stanford Univ.
c 2010, 2011, 2012, 2013, 2014 Anand Rajaraman, Jure Leskovec,
Copyright
and Jeffrey D. Ullman
ii
Preface
This book evolved from materi
COMP9318: Data Warehousing
and Data Mining
L6: Association Rule Mining
Modified from Prof. Jiawei Hans Slides
COMP9318: Data Warehousing and Data Mining
1
Chapter 6: Mining Association Rules
in Large Databases
n
Association rule mining
n
Algorithms for
FPtree Essential Idea
Illustrate the FPtree algorithm without
using the FPtree
Also output each item in
FreqPattern(DB):
X (appended with the
If boundary condition, then conditional pattern)
X = FindLocallyFrequentItems(DB)
Also gets rid of items
COMP9318 Review
Wei Wang @ UNSW
October 24, 2016
Course Logisitics
I
THE formula:
mark = 0.55 exam + 0.15 (ass1 + ass2 + proj1)
mark = FL, if exam < 40
I
I
ass2 will be marked ASAP; we aim at delivering the result before the
exam
Preexam consultations:
I
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
September 22, 2016
1 / 26
Generative vs. Discriminative Learning
I
Generative models:
Pr[x  y ]Pr[y ]
Pr[x]
Pr[x  y ]Pr[y ] = Pr[x, y ]
Pr[y  x] =
I
I
I
Discriminative models:
I
I
I
The key is to mod
COMP9318: Data Warehousing
and Data Mining
L7a: Classification and Prediction
:
45452
65 17:9
1 3548 19
50
Chapter 7. Classification and Prediction
n
n
n
n
n
n
What is classification? What is prediction?
Classification by decision tree induction
Bayesia
COMP9318 Assignment 1
Due Date: 23:59 11 Sept, 2016 (SUN)
Note
Modified parts are marked in this style.
DESCRIPTION
Q1
(40% )
Consider the following ER diagram and data.1
year
lecturer
session
cname
cid
Course
grade
Offers
Enrolls
sname
start date
sid
did
COMP9318: Data Warehousing
and Data Mining
L2: Data Warehousing and OLAP
n
Why and What are Data Warehouses?
Data Analysis Problems
n
n
The same data found in many different systems
n Example: customer data across different
departments
n The same concep
Name:
,
(Family name)
(Given name)
Student ID:
THE UNIVERSITY OF NEW SOUTH WALES
Final Exam
COMP9318
Data Warehousing and Data Mining
SESSION 1, 2008
Time allowed: 10 minutes reading time + 3 hours
Total number of questions: 7 + 1
Total number of marks: 1
COMP9318 Tutorial 3: Clustering
Wei Wang @ UNSW
Q1 I
Consider eight tuples represented as points in the two dimensional space as
follows:
c
a
b
e
d
f
g
h
Assume that (1) each point lies within the center of the grid; (2) the grid is a
uniform partition of
COMP9318 Tutorial 4: Association Rule Mining
Wei Wang @ UNSW
Q1 I
Show that if A B does not meet the minconf constraint, A BC does not
either.
Solution to Q1 I
supp(ABC )
supp(A)
supp(AB)
= conf (A B)
supp(A)
conf (A BC ) =
Like Apriori, we can utilize th
COMP9318 Tutorial 2: Classification
Wei Wang @ UNSW
September 22, 2016
Q1 I
Consider the following training dataset and the original decision tree induction
algorithm (ID3).
Risk is the class label attribute. The Height values have been already
discretize
COMP9318 Tutorial 2: Classification
Wei Wang @ UNSW
September 22, 2016
Q1 I
Consider the following training dataset and the original decision tree induction
algorithm (ID3).
Risk is the class label attribute. The Height values have been already
discretize
COMP9318 Tutorial 1
Wei WANG
The University of New South Wales
[email protected]
Data Warehouse and OLAP
A
Q1
Create a star schema diagram that will enable FITWORLD GYM
INC. to analyze their revenue.
The fact table will include for every instance o
COMP9318 Tutorial 4: Association Rule Mining
Wei Wang @ UNSW
Q1 I
Show that if A B does not meet the minconf constraint, A BC does not
either.
Q2 I
Given the following transactional database
1
2
3
4
5
6
C,
B,
A,
C,
B,
B,
B, H
F, S
F, G
B, H
F, G
E, O
1. W
A Brief MDX Tutorial Using Mondrian
Wei Wang
weiw AT cse.unsw.edu.au
School of Computer Science & Engineering
University of New South Wales
Wei Wang (UNSW)
cs9318.MDX
1 / 20
[Pentaho].[Mondrian]
Pentaho: Open source business intelligence suite
Mondrian 
Hierarchical Clustering
Produces a set of nested clusters organized as
a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits
4
0.15
3
4
2
5
0.1
2
0.05
0
5
6
0.2
1
3
1
3
2
5
4
6
1
Takeaway Message
Mul.dimensional model is designed for data
analyses
DB vs DW/OLAP
Item
Database
Conceptual model En.tyRela.onship
ROLAP
MOLAP
Mul.dimensional
Logical model
Rela.onal (3NF+)
Physical model
Tables
COMP9318: Data Warehousing
and Data Mining
L2b: Data Warehousing and OLAP
Modified from Prof. Jiawei Hans Slides
COMP9318: Data Warehousing and Data Mining
Relational View of Data Cube
Store
Sales
Product
1
2
3
4
ALL
1
454


925
1379
2
468
800


126
COMP9318: Data Warehousing
and Data Mining
L2a: Data Warehousing and OLAP
Modified from Slides of
Prof. Jiawei Han and Dr. Yannis Kotidis
COMP9318: Data Warehousing and Data Mining
1
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
n
What
COMP9318: Data Warehousing
and Data Mining
L3: Data Preprocessing and Data Cleaning
Abridged from Prof. Jiawei Hans Slides
COMP9318: Data Warehousing and Data Mining
1
Chapter 3: Data Preprocessing
n
Why preprocess the data?
n
Data cleaning
n
Data integ
About COMP9318 (2016 s2)
Wei Wang @ CSE, UNSW
July 26, 2016
Introduction
Lecturerincharge:
A/Prof. Wei Wang
School of Computer Science and Engineering
Office: K17 507
Email: [email protected]
Ext: 9385 7162
http: / www. cse. unsw. edu. au/ ~ weiw
Research Inter
COMP9318: Data Warehousing
and Data Mining
L1: Introduction
Modified from Prof. Jiawei Hans Slides
Textbook
Cover page of the
1st ed.
Chapter 1. Introduction
!
Motivation: Why data mining?
!
What is data mining?
!
Data Mining: On what kind of data?
!
Da
#L7  Information Extraction
import nltk
def ie_preprocess(document):
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def tokpos2
COMP9318 Assignment 1
Due Date: 23:59 11 Sept, 2016 (SUN)
DESCRIPTION
Q1
(40% )
Consider the following ER diagram and data.1
year
lecturer
session
cname
cid
Course
grade
Oers
Enrolls
sname
sid
start date
did
Majors
Department
dname
street
location
suburb
ID: z3471402 Jiang Quan
COMP9318 Assignment 2
Question 1
(1)
The original class table is shown as right.
In the class there are 6 + class and 4  class.
The entropy of this whole sample is:
Then we discuss different situation in the A,B and C.
For A, we h
COMP9318: Data Warehousing
and Data Mining
L2b: Data Warehousing and OLAP
Modified from Prof. Jiawei Hans Slides
COMP9318: Data Warehousing and Data Mining
1
Relational View of Data Cube
Store
Sales
Product
1
2
3
4
ALL
1
454


925
1379
2
468
800


1
COMP9318: Data Warehousing
and Data Mining
L1: Introduction
Modified from Prof. Jiawei Hans Slides
1
Textbook
Cover page of the
1st ed.
2
Chapter 1. Introduction
n
Motivation: Why data mining?
n
What is data mining?
n
Data Mining: On what kind of data?
COMP9318: Data Warehousing
and Data Mining
L3: Data Preprocessing and Data Cleaning
Abridged from Prof. Jiawei Hans Slides
COMP9318: Data Warehousing and Data Mining
Chapter 3: Data Preprocessing
n
Why preprocess the data?
n
Data cleaning
n
Data integra