Practicing Data Science
Bill Basener
wfb2m@eservices.virginia.edu
UVA Data Science Institute
Main Data Science Tasks:
Exploratory Data Analysis
Dimension Reduction / Feature Extraction
Mahalanobis Distance from a sample mean is the most common anomaly det

More on two sampled t-tests
I. Quick review:
A. The basic procedure for hypotheses tests:
- set up your hypotheses
- decide on
- verify assumptions (which is what's covered in this set of notes)
- calculate t* (or whatever your test statistic is)
- compa

What to do if your assumptions fail
I. Review - assumptions of the t-test:
1) equal variances - usually you dont even bother to check this; unless you have a good reason
to think the variances are equal just use the t-test for unpooled variances.
2) data

Supplement on paired tests:
There are two other simple tests available for paired data:
The sign test.
- this works under almost any circumstances, but the power is not terribly great.
Heres how it works:
- get the sign of the difference in your paired sa

One-sided tests
I. So what is a one sided test?
1) in our original set-up, we formulated a hypothesis such as:
H0: 1 = 2
our alternative was:
H1: 1 2
we also discussed, briefly, that one could have:
H1: 1 < 2
or
H1: 1 > 2
2) The last two alternatives give

Homework 2:
Same instructions apply as on the last homework.
2nd edition references are given in normal type.
[3rd edition references are given in bold and in square brackets].
cfw_4th edition references are given in italics, underlined and in curly brack

Homework assignment # 1
General instructions to be followed on all future homework assignments:
1) You may work with a partner (only one!) on this assignment. If you do, please put both your names
on the assignment and hand in only one copy. Do not copy a

Homework # 4
Note: Please circle your answers when appropriate!
As usual, 2nd edition references are given in normal type following the 3rd edition references. This set of
homework is probably a little less intense than usual - but then you have an exam t

Homework # 3:
Same instructions apply as on the last homework.
2nd edition references are given in normal type.
[3rd edition references are given in bold and in square brackets].
cfw_4th edition references are given in italics, underlined and in curly bra

Homework # 5
Note: Please circle your answers when appropriate!
Note - there are more problems here than usual, but most of them are really pretty simple.
1) 10.3, p. 399 (10.3, p. 392) cfw_9.4.3, p. 357. But use a non-directional (two sided) test.
2) 10.

Instructions for article interpretation assignment:
Summary: find an article from the medical literature, summarize it, and give details as to the statistics used.
Details: You need to go through the medical literature (= medical journals) and find an art

INTERVIEWER:_Jae Ru_ INTERVIEWEE:_Sungmin Choi _
INSTRUCTIONS Each student will interview and be interviewed twice. A separate report must be turned in for each
interview. The interviewer should write seven unique interview questions for each interview, b

INTERVIEWER: Sungmin Choi
INTERVIEWEE:
INSTRUCTIONS Each student will interview and be interviewed twice. A separate report must be turned in for each
interview. The interviewer should write seven unique interview questions for each interview, based on ma

INTERVIEWER: Sungmin Choi
INTERVIEWEE:
INSTRUCTIONS Each student will interview and be interviewed twice. A separate report must be turned in for each
interview. The interviewer should write seven unique interview questions for each interview, based on ma

Study Tips Homework
Name:
Sungmin Choi
At this level, quantity is an issue. Learn to adapt.
Schedule one week at a time, by the hour.
Schedule 35 or more hours of study time per week
Schedule at least 2 hours on each day (35 minimum/week)
Spend 50 minutes

Study Skills Calendar
INSTRUCTIONS:
Name _Sungmin Choi_
Plan your week beforehand, for studying, recreational activities, chores, classes, in one hour blocks. Be specific where possible.
On the second page, keep a record of how you actually spent your t

ESSAY HOMEWORK
INSTRUCTIONS: Each student will write a personal statement (essay) suitable for the type of school s/he will be applying
to (DDS, MD, DO, DVM, etc.) An essay written for a past (or current) application cycle may be used. You may also
substi

CHAPTER 5
Sampling Distributions
^
5.1 The possible values of p are 0, 1/3, 2/3, and 1. These correspond to getting 0 persons with lung cancer, 1
with lung cancer, 2 with lung cancer, and all 3 with lung cancer.
^
5.2 (a) Prcfw_p = 0 = Prcfw_no mutants =

CHAPTER 6
Confidence Intervals
6.1 (a) = 1269; s = 145; n = 8.
y
The standard error of the mean is
145
s
=
= 51.3 ng/gm.
SE - =
y
n
8
(b) = 1269; s = 145; n = 30.
y
The standard error of the mean is
145
s
=
= 26.5 ng/gm.
SE - =
y
n
30
6.2 (a) 15/ 25 = 3.0

Practicing Data Science
Bill Basener
wfb2m@eservices\Virginia.edu
UVA Data Science Institute
What is Data Science? related fields:
Artificial Intelligence (AI)
Study of systems that perceive their environment and take actions to maximize
success. (percept

Practicing Data Science
Bill Basener
wfb2m@eservices\Virginia.edu
UVA Data Science Institute
Definitions:
Classificaiton: The process of assigning a set of objects to classes. Usually this is
done so that the objects in each class are similar and the clas

Practicing Data Science
Bill Basener
wfb2m@eservices.virginia.edu
UVA Data Science Institute
K-Nearest Neighbors
Given Training Data with Known Classes:
For each new data point, compute the distance to points in the training data.
Determine which class ma

Practicing Data Science
Bill Basener
wfb2m@eservices.virginia.edu
UVA Data Science Institute
Networks Basic Definition:
DEFINITION: A network is a set of objects (usually called nodes or vertices) and a
collection of relations (usually called edges) betwe

Practicing Data Science
Bill Basener
wfb2m@eservices\Virginia.edu
UVA Data Science Institute
Definitions:
Classificaiton: The process of assigning a set of objects to classes. Usually this is
done so that the objects in each class are similar and the clas

Practicing Data Science
Bill Basener
wfb2m@eservices.virginia.edu
UVA Data Science Institute
Example:
How do we compute variance for an
n-dimensional random variable?
NOTATION: =
=
1
Mean:
Covariance matrix:
= =
=Var =
=
1
# elements in dataset
in
Da

Practicing Data Science
Bill Basener
wfb2m@eservices.virginia.edu
UVA Data Science Institute
Networks Basic Definition:
DEFINITION: A network is a set of objects (usually called nodes or vertices) and a
collection of relations (usually called edges) betwe

Suppose we go out and collect some data. Now what?
a) first, figure out what kind of data you have:
categorical/ordinal/quantitative (discrete/continuous).
b) second, how many records (= cases) do you have?
n = sample size
c) third, look at your data (org

Samples, estimates & random sampling
I. Samples and populations.
Suppose we tried to figure out the weights of everyone on campus. How could we do
this?
1) Weigh everyone. Is this practical? Possible? Accurate?
- try counting every word in your textbook.

Descriptive statistics:
Note: I'm assuming you know some basics. If you don't, please read chapter 1 on your own. It's
pretty easy material, and it gives you a good background as to why we need statistics.
First, some definitions:
sample:
- a bunch of dat