Unformatted text preview: Lecture 2 1. SAS Procedures: Proc Ttest, Proc NPar1Way, Proc Freq
2. SAS Data step
3. Data step: arithmetic
4. IFTHENELSE
5. Data step: comparisons and logical conditions
6. Subsetting IF
7. Missing values
Class email group: did you get an email? 1 Proc TTEST Compare IQ between boys and girls with a twosample t test. Proc TTEST
class
var data= dataset ci=none ; omit CI for standard deviation groupvariable, gives a twosample test ;
responsevariable(s) ; Proc TTEST data = ph6470.child_iq ci = none ; class mom_HS_grad;
var child_iq;
Note use of permanent data: no need to import it again 2 The TTEST Procedure
Variable:
mom_HS_
grad
0
1
Diff (12) N
93
341 mom_HS_
grad
0
1
Diff (12)
Diff (12) (child IQ) Std Dev
22.5738
19.0495
19.8525 Mean
77.5484
89.3196
11.7713 child_IQ Std Err
2.3408
1.0316
2.3224 Method Mean
77.5484
89.3196
11.7713
11.7713 Pooled
Satterthwaite Method
Pooled
Satterthwaite Variances
Equal
Unequal DF
432
129.88 Minimum
20.0000
38.0000 95% CL
72.8994
87.2906
16.3359
16.8321
t Value
5.07
4.60 Maximum
136.0
144.0 Mean
82.1974
91.3487
7.2066
6.7105 Pr > t
<.0001
<.0001 Equality of Variances
Method
Folded F Num DF
92 Den DF
340 F Value
1.40 Pr > F
0.0326 3 SAS performs 2 t tests:
Pooled t test assumes two population standard deviations are equal,
uses pooled standard deviation, which is Root Mean Square Error in onefactor
ANOVA: ˆ
æ= s (n 1 ° 1)SD2 + (n 2 ° 1)SD2
1
2
n1 + n2 ° 2 ˆ
æ is used for the test and for conﬁdence intervals. The test statistic is: t= ¯
¯
X1 ° X2
,
s
1
1
ˆ
æ
+
n1 n2 with df = (n 1 + n 2 ° 2). 4 Satterthwaite t test does not assume two population standard deviations are equal,
adjusts its degrees of freedom for differences in the group SDs and uses unpooled
standard error: SEU =
approximate degrees of freedom:
4
dfU = SEU √ s SD2
1
n1 SD4
1
2
n 1 (n 1 ° 1) + + SD2
2
n2 SD4
2
2
n 2 (n 2 ° 1) !°1 When SD1 and SD2 are different, dfU is smaller than pooled df = (n 1 + n 2 ° 2)
The Satterthwaite test statistic: t= ¯
¯
X1 ° X2
,
SEU with df = dfU .
5 When are the variances unequal? “Folded F test” needs normally distributed data, so usually not helpful. Instead, when the ratio
larger standard deviation
>3
smaller standard deviation
then the SDs are probably different, and different enough to matter.
• Report the Satterthwaite pvalue
• Take logs and perform t test on logged data
• Wilcoxon ranksum test (nonparametric) 6 Wilcoxon ranksum test Ranksum test compares two samples: combine all the
data and order from smallest to largest.
H0 : two population distributions equal (same center).
If so, neither sample should be clustered in lower ranks.
Test statistic is sum of ranks in one sample.
Wikipedia Frank Wilcoxon
(1892–1965) Proc NPar1Way data = pubh.child_iq wilcoxon ; class mom_HS_grad;
var child_iq;
7 The NPAR1WAY Procedure
Wilcoxon Scores (Rank Sums) for Variable child_IQ
Classified by Variable mom_HS_grad
mom_HS_
Sum of
Expected
Std Dev
Mean
grad
N
Scores
Under H0
Under H0
Score
1
341
78993.50
74167.50
1071.98173
231.652493
0
93
15401.50
20227.50
1071.98173
165.607527
Average scores were used for ties.
Wilcoxon TwoSample Test
Statistic 15401.5000 Normal Approximation
Z
OneSided Pr < Z
TwoSided Pr > Z
t Approximation
OneSided Pr < Z
TwoSided Pr > Z 4.5015
<.0001
<.0001
<.0001
<.0001 8 Chisquare test in Proc Freq
Proc Freq (for frequency ) makes tables of counts, performs chisquare test of association, also calculates relative risk and odds ratio, tests of trend and measures
of association. Proc FREQ
tables data= dataset ;
rowvariable * columnvariable ; Proc Freq data = one;
tables mom_HS_grad * IQ_over_100; Manual chapter for Freq in
SAS Help and Documentation > SAS Products > Base SAS > SAS Procedures
9 The FREQ Procedure
Table of mom_HS_grad by IQ_over_100
mom_HS_grad(mom HS grad)
IQ_over_100
Frequency
Percent 
Row Pct 
Col Pct 
0
1 Total
+++
0
80 
13 
93
 18.43 
3.00  21.43
 86.02  13.98 
 25.24  11.11 
+++
1
237 
104 
341
 54.61  23.96  78.57
 69.50  30.50 
 74.76  88.89 
+++
Total
317
117
434
73.04
26.96
100.00 10 Better to omit unwanted percents: Proc Freq data = one;
tables mom_HS_grad * IQ_over_100 / nopercent nocol
nopercent = omit percents of grand total
nocol = omit column percents
norow = omit row percents 11 Table of mom_HS_grad by IQ_over_100
mom_HS_grad(mom HS grad)
IQ_over_100
Frequency
Row Pct 
0
1
+++
0
80 
13 
 86.02  13.98 
+++
1
237 
104 
 69.50  30.50 
+++
Total
317
117 Total
93
341
434 Statistics for Table of mom_HS_grad by IQ_over_100
Statistic
DF
Value
Prob
ChiSquare
1
10.1275
0.0015
Likelihood Ratio ChiSquare
1
11.2096
0.0008
Continuity Adj. ChiSquare
1
9.3059
0.0023
MantelHaenszel ChiSquare
1
10.1042
0.0015
12 chisq ; Phi Coefficient
Contingency Coefficient
Cramer’s V 0.1528
0.1510
0.1528 Fisher’s Exact Test
Cell (1,1) Frequency (F)
80
Leftsided Pr <= F
0.9998
Rightsided Pr >= F
7.236E04
Table Probability (P)
Twosided Pr <= P 4.742E04
0.0014 Sample Size = 434 13 Pearson’s Chisquare Test for No Association.
H0 : no association between row probabilities and column probabilities.
1. Individual’s chance of being in a particular column does not depend on which
row they belong to.
Within a column, all row percents should be roughly equal, except for sampling
variability. 2. Equivalently, individual’s chance of being in a particular row does not depend
on which column they belong to.
Within a row, all column percents should be roughly equal, except for sampling
variability. 14 Using this noassociation assumption, we can compute an expected count for each
cell:
±
expected count = (row total)£(column total) (grand total)
The test statistic compares expected to observed counts: X2 = X all cells ° ¢2
observed count ° expected count
°
¢
expected count °
¢°
¢
degrees of freedom = number of rows ° 1 £ number of columns ° 1 .
15 Each cell’s contribution to the chisquare sum measure’s its departure from the
null hypothesis (no association).
Pearson residual is the squareroot of the cell’s contribution to chisquare, with
sign of (observed ° expected).
In a large table (bigger than 2 £ 2), examine Pearson residuals for each cell to ﬁnd
which cells depart from noassociation. Proc Freq data = one;
tables mom_HS_grad * IQ_over_100 / nopercent nocol norow cellchi2 deviation 16 c mom_HS_grad(mom HS grad)
IQ_over_100
Frequency

Deviation

Cell ChiSquare
0
1 Total
+++
0
80 
13 
93
 12.071  12.07 
 2.1452  5.8122 
+++
1
237 
104 
341
 12.07  12.071 
 0.5851  1.5851 
+++
Total
317
117
434 Frequency

(Squared) Pearson
Residual

0
1
+++
0
80 
13 
 2.1452  5.8122 
+++
1
237 
104 
 0.5851  1.5851 
+++
Total
317
117 Total
93
341
434 17 The DATA step (LSB §1.4) Data two;
set one;
statement A;
statement B;
statement C;
1. Read observation 1 (row 1) from one. Perform statements A, B, C, using
observation 1.
2. Write output to data two.
3. Set variables to missing values.
4. Repeat steps 1, 2, 3 using observation 2 from one.
5. Repeat steps 1, 2, 3 using observation 3 from one.
6. Continue through all rows of one.
18 Data step is sequential, only looks at one observation at a time. _N_ is the internal variable that counts observations.
There is an implicit DOloop in every data step: Data two;
Documentation 01/23/2007 10:23 PM DO from _N_ = 1 to lastobservation;
values created within the WHERE expression itself. set one; You cannot use variables that are created within the DATA step (for example, FIRST. variable , LAST. variable ,
_N_, ostatement are created in assignment statements) in a WHERE expression because the WHERE
r variables that A;
statement is executed before the SAS System brings observations into the DATA or PROC step. When WHERE
statement B;
expressions contain comparisons, the unformatted values of variables are compared.
Use operands in WHEREC;
statement statements as in the following examples:
where
END; score>50;
where date>='01jan1999'd and time>='9:00't;
where state='Mississippi';
As in other SAS expressions, the names of numeric variables can stand alone. SAS treats values of 0 or
19
missing as false; other values are true. These examples are WHERE expressions that contain the numeric
variables EMPNUM and SSN:
where empnum;
where empnum and ssn; Calculations in the DATA step: arithmetic Character literals or the names of character variables can also stand alone in WHERE expressions. If you use
the name of a character variable by itself as a WHERE expression, SAS selects observations where the value of
the character variable is not blank. Calculation with data to create new variables is done in the data step: Operators Used in the WHERE Expression Data Z; You can include both SAS operators and special WHERE expression operators in the WHERE statement. For a
complete list of the operators, see WHERE Statement Operators . For the rules inches
height_inches = height_cm/2.54; convert cm to SAS follows when it evaluates
WHERE expressions, see WHERE Expression Processing in SAS Language Reference: Concepts .
WHERE Statement Operators
Operator Type
SAS arithmetic symbols:Symbol or Mnemonic Description Arithmetic
*
/ division + addition  subtraction ** exponentiation = or EQ Comparison multiplication equal to 4 20
http://support.sas.com/onlinedoc/913/docMainpage.jsp Page 4 of 8 Expressions in parentheses are evaluated ﬁrst, starting with the innermost of
nested groups. Here is the order of evaluating operations, from left to right in an
expression:
1. Exponents
2. Multiplication and division
3. Addition and subtraction Evaluate:
2 § 3 + 4/5 ° 1
2 § (3 + 4)/(5 ° 1)
2 § (3 + 4/5) ° 1 21 Data A;
C = 2*3+4/51;
D = 2*(3+4)/(51);
E = 2*(3+4/5) 1;
proc print data=A;
Obs C D E 1 5.8 3.5 6.6 22 Exponents: Don’t use ** for any exponents except 2 or 3. To compute x = a b , don’t use x = a**b. Instead, use the more numerically stable log and exp functions: x = exp(b * log(a));
log is the natural log function; log10 is common log or base 10 log.
Inverse of log is exp.
Inverse of b = log10(c) is c = exp(b*log(10.0)). See LSB §3.3 for a short list of SAS functions. For a complete list, see the SAS
Documentation:
SAS Products > Base SAS > SAS Language Dictionary >
Dictionary of Language Elements > Functions and CALL Routines
23 IFTHENELSE (LSB §3.53.6 In the data step, you can make evaluation of a command depend on a condition: IF ( condition ) THEN statement A ;
When condition is true, statement A is performed.
Parentheses around condition are not required, but make code easier to read. detection_limit = 0.025
IF (0 < x < detection_limit) THEN x = detection_limit/2.0; 24 In the data step, you can make a branch depend on a condition: IF ( condition ) THEN statement A ;
Documentation 01/23/2007 10:23 PM values created
ELSE statement B ;within the WHERE expression itself.
You cannot use variables that are created within the DATA step (for example, FIRST. variable , LAST. variable ,
_N_, or variables that are created in assignment statements) in a WHERE expression because the WHERE
statement is executed before the SAS System brings observations into the DATA or PROC step. When WHERE
expressions contain comparisons, the unformatted values of variables are compared. When condition is true, statement A is performed.
Use operands in WHERE statements as in the following examples:
where score>50; When condition is false, statement B is performed.
where date>='01jan1999'd and time>='9:00't;
where state='Mississippi';
As in other SAS expressions, the names of numeric variables can stand alone. SAS treats values of 0 or
missing as false; other values are true. These examples are WHERE expressions that contain the numeric
variables EMPNUM and SSN: if (score > 70) then grade = ‘‘S’’;
where empnum;
where e ‘‘N’’;
else grade =mpnum and ssn; Character literals or the names of character variables can also stand alone in WHERE expressions. If you use
the name of a character variable by itself as a WHERE expression, SAS selects observations where the value of
the character variable is not blank. Operators Used in the WHERE Expression
You can include both SAS operators and special WHERE expression operators in the WHERE statement. For a
complete list of the operators, see WHERE Statement Operators . For the rules SAS follows when it evaluates
WHERE expressions, see WHERE Expression Processing 25 SAS Language Reference: Concepts .
in
WHERE Statement Operators
Operator Type Symbol or Mnemonic Description * multiplication + addition Arithmetic Here are the SAS symbols for comparisons and division relations.
logical
/
Letters are easier to read and remember.
 subtraction ** = or EQ equal to ^=, ¬=, ~=, or NE1 Comparison exponentiation not equal to 4 Documentation http://support.sas.com/onlinedoc/913/docMainpage.jsp GT
> or 01/23/2007 10:23 PM greater than < or LT less than >= or GE greater than or equal to <= or LE less than or equal to IN equal to one of a list & or AND logical and  or OR 2 logical or1 ~,^ , ¬, or NOT 1 logical not  3 concatenation of character variables () indicate order of evaluation + prefix positive number Logical (Boolean) Other  prefix negative number 26 WHERE Expression Only
BETWEEN AND an inclusive range ? or CONTAINS a character string Page 4 of 8 Subsetting IF Data A;
set B;
IF ( condition );
If condition is true, the observation is kept.
This data step says: Make a copy of B and call it A.
Include in A only those observations that satisfy the condition.
Equivalent: Data A;
set B;
if ( NOT condition ) then delete;
27 Create an indicator 0/1 variable: x = (condition); x = 1 when condition is true, x = 0 when false data one;
set pubh.child_iq;
IQ_over_100 = (child_iq > 100.0); 28 Missing values Numeric variables: missing is indicated by a period, x = . Comparisons with missing values. In a sort of a numeric variable, missing values
are treated as °1. detection_limit = 0.025
IF (x < detection_limit) THEN x = detection_limit/2.0;
What happens to a subject who is missing x? How could we ﬁx this? 29 Create an indicator 0/1 variable: IQ_under_100 = (child_IQ < 100);
What is the value for a child with missing IQ score? 0 or 1? How should we ﬁx this? 30 Arithmetic with missing values Find mean diastolic blood pressure (DBP) measured at 4 clinic visits.
Data from 2 subjects in visits: ID DBP1 DBP2 DBP3 DBP4 11 95 90 98 92 14 94 . 91 95 data G;
set visits; DBP_mean = (DBP1 + DBP2 + DBP3 + DBP4)/4.0 ; 31 Results: Obs ID DBP1 DBP2 DBP3 DBP4 DBP_mean 1 11 95 90 98 92 93.75 2 14 94 . 91 95 . Arithmetic with a missing value has a missing result. Usually we want to ignore missing values and average the rest of the numbers,
not have the mean be missing.
SAS procedures (Proc Ttest, Proc Reg) omit observations with missing values. 32 Many SAS functions correctly handle missing values—see the Manual:
MEAN (argument list ) returns the average of the nonmissing values; for example, MEAN(3, ., ., 1) = 2 DBP_mean1 = mean(DBP1, DBP2, DBP3, DBP4) ; Results: DBP_
ID DBP1 DBP2 DBP3 DBP4 DBP_mean mean1 11 95 90 98 92 93.75 93.7500 14 94 . 91 95 . 93.3333 33 ...
View
Full Document
 Fall '11
 WilliamThomas
 Standard Deviation

Click to edit the document details