Lecture 1
Introduction to Algorithms
1.1
Overview
The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of
thinking it involves: why we focus on the subjects that we do, and why we emphasize proving
guarantees. We
PCA Derivation
(Optional)
Eigendecomposition
All covariance matrices have an eigendecomposition
CX = UU (eigendecomposition)
U is d d (column are eigenvectors, sorted by their eigenvalues)
is d d (diagonals are eigenvalues, off-diagonals are zero)
Eige
Big Data Analysis with
Apache Spark
UC#BERKELEY
This Lecture: Relation between Variables
An association
A trend
Positive association or Negative association
A pattern
Could be any discernible shape
Could be Linear or Non-linear
Visualize, then quantify
Neuroscience
Introduction
The brain
As humans, we can identify galaxies light
years away, we can study particles smaller
than an atom. But we still havent unlocked
the mystery of the three pounds of matter
that sits between our ears.
President Obama
The b
Big Data Analysis with
Apache Spark
UC#BERKELEY
This Lecture
Resilient Distributed Datasets (RDDs)
Creating an RDD
Spark RDD Transformations and Actions
Spark RDD Programming Model
Spark Shared Variables
Review: Python Spark (pySpark)
We are using the Pyt
Big Data Analysis with
Apache Spark
UC#BERKELEY
Data Science Roles
This Lecture
Data Cleaning
Data Quality: Problems, Sources, and Continuum
Data Gathering, Delivery, Storage, Retrieval, Mining/Analysis
Data Quality Constraints and Metrics
Data Integratio
Big Data Analysis with
Apache Spark
UC#BERKELEY
This Lecture
Course Objectives and Prerequisites
Brief History of Data Analysis
Correlation, Causation, and Confounding Factors
Big Data and Data Science Why All the Excitement?
So What is Data Science?
Doin
Online Advertising
Online Advertising is Big Business
Multiple billion dollar industry
$43B in 2013 in USA, 17% increase over 2012
[PWC, Internet Advertising Bureau, April 2013]
Higher revenue in USA than cable TV and nearly
the same as broadcast TV
[PWC
RDD Fundamentals
Workloads
DataFrames API and Spark SQL
Spark Streaming
RDD API
Spark Core
Data Sources
MLlib
GraphX
W
Ex
RDD
RDD
T
T
Worker Machine
Driver Program
W
Ex
RDD
RDD
T
T
Worker Machine
Resilient Distributed Datasets (RDDs)
Write programs in te
Lecture 2
Asymptotic Analysis and Recurrences
2.1
Overview
In this lecture we discuss the notion of asymptotic analysis and introduce O, , , and o notation.
We then turn to the topic of recurrences, discussing several methods for solving them. Recurrences
Lecture 3
Probabilistic Analysis and
Randomized Quicksort
3.1
Overview
In this lecture we begin by introducing randomized (probabilistic) algorithms and the notion of
worst-case expected time bounds. We make this concrete with a discussion of a randomized
Lecture 4
Selection (deterministic &
randomized): nding the median in
linear time
4.1
Overview
Given an unsorted array, how quickly can one nd the median element? Can one do it more quickly
than by sorting? This was solved armatively in 1972 by (Manuel) B
Lecture 5
Concrete models and tight
upper/lower bounds
5.1
Overview
In this lecture, we will examine some simple, concrete models of computation, each with a precise
denition of what counts as a step, and try to get tight upper and lower bounds for a numb
Game Theory
15-451
- Zero-sum games
- General-sum games
Plan for Today
09/13/12
2-Player Zero-Sum Games (matrix games)
Minimax optimal strategies
Connection to randomized algorithms
Minimax theorem and proof
Game Theory
General-Sum Games (bimatrix ga
Lecture 7
Universal and Perfect Hashing
7.1
Overview
Hashing is a great practical tool, with an interesting and subtle theory too. In addition to its use as
a dictionary data structure, hashing also comes up in many dierent areas, including cryptography
a
Lecture 8
Amortized Analysis
8.1
Overview
In this lecture we discuss a useful form of analysis, called amortized analysis, for problems in which
one must perform a series of operations, and our goal is to analyze the time per operation. The
motivation for
Linear Regression
Regression
Goal: Learn a mapping from observations (features) to
continuous labels given a training set (supervised learning)
Example: Height, Gender, Weight Shoe Size
Audio features Song year
Processes, memory Power consumption
Histo