High Dimensional Statistics in Genomics
Some Statistical Problems and Solutions
Hongzhe Li
[email protected], http:/www.cceb.upenn.edu/hli/
Department of Biostatistics and Epidemiology
Graduate Group in Genomics and Computational Biology
University of Pen
BSTA 670 (Fall 2012) - Statistical Computing
Lectures 9-11
Optimization
1 / 143
Dening the Problem
Given some function F (x), we wish to nd values of x for
which F (x) attains a minimum or maximum. We will assume
a minimum; if you want a maximum, just min
BSTA 670 Statistical Computing
27 October 2010
Lecture 13:
Random Number Generation
Presented by:
Paul Wileyto, Ph.D.
A good analog
random number
generator.
Anyone who uses software to produce random
numbers is in a state of sin.
John von Neumann
Why do w
BSTA 670 (Fall 2012) - Statistical Computing
Lecture 17
Resampling Methods:
The Bootstrap, The Jackknife,
& Cross Validation
1 / 76
Bootstrap Condence Intervals
CIs for parameter estimates are calculated using statistics
from original sample and bootstrap
BSTA 670 (Fall 2012) - Statistical Computing
Lectures 7-8
Roots of Nonlinear Equations
1 / 67
Roots of Nonlinear Equations
Given a known function f , the zeros of f are the values of x
that satisfy the equation f (x) = 0; the same values of x are
also cal
Supervised Learning
Evolutionary Computation
Artificial Neural Networks
Support Vector Machines
John H. Holmes, Ph.D.
Center for Clinical Epidemiology and Biostatistics
University of Pennsylvania School of Medicine
CCEB
Whats on the agenda for today
Revie
BSTA 670 (Fall 2012) - Statistical Computing
Lecture 5
Concepts of Linear Algebra
1 / 62
Why Study Linear Algebra?
Vector spaces (aka linear algebra) are the natural setting for
many problems in statistics and mathematics, including both
analytic and nume
BSTA 670 (Fall 2012) - Statistical Computing
Lecture 4
Programming and Software
1 / 33
Some Key Issues in Numerical Software
What makes a good program/method/algorithm?
Accuracy and precision of the results
Run time and memory requirements of the software
BSTA 670 (Fall 2012) - Statistical Computing
Lecture 6
Numerical Solution of Linear Equations
1 / 49
Solving Systems of Linear Equations
The problem is to nd x such that Ax = b for a given matrix
A and one or more column vectors b.
We will concentrate mos
BSTA 670 (Fall 2012) - Statistical Computing
Lectures 2 and 3
Computer Arithmetic and Numerical Precision
1 / 42
Computer Arithmetic and Numerical Precision
Computers can help perform arithmetic rapidly, but caution is
warranted.
Computer numbers dier sig
Nonparametric estimation of basic quantities
Sample of right-censored data
Data:
T-time under observation,
*-indicator of failure/censoring
Assumption: potential censoring time unrelated to potential event time: methods
appropriate for type I, II, and pro
Parametric models for survival functions, hazards
can be used for estimating, comparing survival functions, hazards
Common functions/distributions:
Mathematical form, appropriateness for applied settings
1
Exponential
constant hazard: h(t) = 8
survival fu
Nonignorable or informative missingness/censoring
Often ill-defined concept
Discussed concept of independent censoring
formalizations/expressions for idea:
censoring not dependent on future failure time
can make conditional on baseline covariates:
1
alter
Suppose that need to control for some covariate X but dont want to use it in prediction; still want to use modeling (not nonparametric estimation)
1
Two options: Model-based standardization Weighted estimation Model-based standardization As before, get mo
Model building and selection
variable selection
causal questions:
is some treatment or exposure harmful or beneficial? What is its effect?
Examples:
HIP trial:
comparison of group randomized to receive screening with group not
receiving screening
does scr
Competing risks
until now treated competing risk same as loss to follow-up
utilized both types of losses in same fashion in estimates, tests, and calculations:
Kaplan-Meier, Nelson-Aalen
log-rank type tests
proportional hazards models
problems with treati
Multivariate survival analysis (chapter 13)
Issues here similar to issues for multivariate or longitudinal data analysis
classification:
multiple events in same individual
recurrent events (of same type)
different types of events
separate individuals
give
Nonparametric hypothesis tests
What are types of hypotheses (both null and alternative) one would like to test
with survival data?
1
One sample tests (7.2)
Two-sample tests (7.3)
Ordered categories (7.4)
Stratified tests (7.5)
Tests for crossing hazards (
parametric estimation, univariate:
by maximum likelihood
estimation in standard software
no special procedures for univariate estimation
how might one do univariate estimation using regression program?
1
use regression program, no regressors
SAS syntax:
p
Regression models for survival data
Why not use standard models (e.g., linear regression) for failure-time data?
1
Typically dont restrict failure times to be positive
How can one impose restriction?
2
Use transformation (especially log); accelerated fail
Proportional hazards models
purposes of modeling
1
causal; predictive;
parsimony
causal questions:
is some treatment or exposure harmful or beneficial? What is its effect?
Examples:
HIP trial:
comparison of group randomized to receive screening with group
Counting process approach:
Used in most theoretical modern work:
developing estimators,
proving their large sample quantities, especially consistency and asymptotic
normality
Will not give a thorough account in this course; neither does book
Nonetheless,
Likelihoods for censored data
piece together from various parts
for uncensored data, same as always: f(t) (in likelihood, formally
)
for right-censored data, probability of surviving beyond Cr: pr(T > Cr) = S(Cr)
for left-censored data, probability of hav
Survival Analysis
Also known as failure-time analysis, event history analysis
Outcome: time until the occurrence of an event
What is (almost) a defining characteristic of survival analysis?
1
censoring: outcome not observed, but known to occur in particul
Censoring and truncation
Definitions and types
Censoring: time of event is not known precisely
Truncation: eligibility or observation of subject or subject-time for study
depends on event
Use terms left, right, and interval
Based on time moving from left