Lecture 1
Introduction to Algorithms
1.1
Overview
The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of
thinking it involves: why we focus on the subjects that we do, and why we emphasize proving
guarantees. We
Clustering &
Retrieval:
A machine learning perspective
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1
2016 Emily Fox & Carlos Guestrin
Machine Learning Specializa0on
Part of a specialization
2
2016 Emily Fox & Car
Recap &
Look ahead
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1
2016 Emily Fox & Carlos Guestrin
Machine Learning Specializa0on
What weve learned
2
2016 Emily Fox & Carlos Guestrin
Machine Learning Specializa0o
Latent Dirichlet
Allocation: Mixed
Membership Modeling
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1
2016 Emily Fox & Carlos Guestrin
Machine Learning Specializa0on
Mixed membership models
for documents
2
2016 Em
Clustering:
Grouping
Related Docs
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1
2016 Emily Fox & Carlos Guestrin
Machine Learning Specializa0on
Motivating clustering approaches
2
2016 Emily Fox & Carlos Guestrin
Nearest Neighbor
Search:
Retrieving Documents
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1
2016 Emily Fox & Carlos Guestrin
Machine Learning Specializa0on
Retrieving documents of interest
2
2016 Emily Fox & Carl
Mixture Models:
Model-Based Clustering
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1
2016 Emily Fox & Carlos Guestrin
Machine Learning Specializa0on
Why a probabilistic approach?
2
2016 Emily Fox & Carlos Guestri
Big Data Analysis with
Apache Spark
UC#BERKELEY
This Lecture
Course Objectives and Prerequisites
Brief History of Data Analysis
Correlation, Causation, and Confounding Factors
Big Data and Data Science Why All the Excitement?
So What is Data Science?
Doin
Big Data Analysis with
Apache Spark
UC#BERKELEY
Data Science Roles
This Lecture
Data Cleaning
Data Quality: Problems, Sources, and Continuum
Data Gathering, Delivery, Storage, Retrieval, Mining/Analysis
Data Quality Constraints and Metrics
Data Integratio
Big Data Analysis with
Apache Spark
UC#BERKELEY
This Lecture
Resilient Distributed Datasets (RDDs)
Creating an RDD
Spark RDD Transformations and Actions
Spark RDD Programming Model
Spark Shared Variables
Review: Python Spark (pySpark)
We are using the Pyt
Lecture 2
Asymptotic Analysis and Recurrences
2.1
Overview
In this lecture we discuss the notion of asymptotic analysis and introduce O, , , and o notation.
We then turn to the topic of recurrences, discussing several methods for solving them. Recurrences
Lecture 3
Probabilistic Analysis and
Randomized Quicksort
3.1
Overview
In this lecture we begin by introducing randomized (probabilistic) algorithms and the notion of
worst-case expected time bounds. We make this concrete with a discussion of a randomized
Lecture 4
Selection (deterministic &
randomized): nding the median in
linear time
4.1
Overview
Given an unsorted array, how quickly can one nd the median element? Can one do it more quickly
than by sorting? This was solved armatively in 1972 by (Manuel) B
Lecture 5
Concrete models and tight
upper/lower bounds
5.1
Overview
In this lecture, we will examine some simple, concrete models of computation, each with a precise
denition of what counts as a step, and try to get tight upper and lower bounds for a numb
Game Theory
15-451
- Zero-sum games
- General-sum games
Plan for Today
09/13/12
2-Player Zero-Sum Games (matrix games)
Minimax optimal strategies
Connection to randomized algorithms
Minimax theorem and proof
Game Theory
General-Sum Games (bimatrix ga
Lecture 7
Universal and Perfect Hashing
7.1
Overview
Hashing is a great practical tool, with an interesting and subtle theory too. In addition to its use as
a dictionary data structure, hashing also comes up in many dierent areas, including cryptography
a
Lecture 8
Amortized Analysis
8.1
Overview
In this lecture we discuss a useful form of analysis, called amortized analysis, for problems in which
one must perform a series of operations, and our goal is to analyze the time per operation. The
motivation for
Linear Regression
Regression
Goal: Learn a mapping from observations (features) to
continuous labels given a training set (supervised learning)
Example: Height, Gender, Weight Shoe Size
Audio features Song year
Processes, memory Power consumption
Histo
RDD Fundamentals
Workloads
DataFrames API and Spark SQL
Spark Streaming
RDD API
Spark Core
Data Sources
MLlib
GraphX
W
Ex
RDD
RDD
T
T
Worker Machine
Driver Program
W
Ex
RDD
RDD
T
T
Worker Machine
Resilient Distributed Datasets (RDDs)
Write programs in te
Online Advertising
Online Advertising is Big Business
Multiple billion dollar industry
$43B in 2013 in USA, 17% increase over 2012
[PWC, Internet Advertising Bureau, April 2013]
Higher revenue in USA than cable TV and nearly
the same as broadcast TV
[PWC
PCA Derivation
(Optional)
Eigendecomposition
All covariance matrices have an eigendecomposition
CX = UU (eigendecomposition)
U is d d (column are eigenvectors, sorted by their eigenvalues)
is d d (diagonals are eigenvalues, off-diagonals are zero)
Eige