Handout02 - Lecture 2 1. SAS Procedures: Proc Ttest, Proc...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Lecture 2 1. SAS Procedures: Proc Ttest, Proc NPar1Way, Proc Freq 2. SAS Data step 3. Data step: arithmetic 4. IF-THEN-ELSE 5. Data step: comparisons and logical conditions 6. Subsetting IF 7. Missing values Class email group: did you get an email? 1 Proc TTEST Compare IQ between boys and girls with a two-sample t -test. Proc TTEST class var data= dataset ci=none ; omit CI for standard deviation group-variable, gives a two-sample test ; response-variable(s) ; Proc TTEST data = ph6470.child_iq ci = none ; class mom_HS_grad; var child_iq; Note use of permanent data: no need to import it again 2 The TTEST Procedure Variable: mom_HS_ grad 0 1 Diff (1-2) N 93 341 mom_HS_ grad 0 1 Diff (1-2) Diff (1-2) (child IQ) Std Dev 22.5738 19.0495 19.8525 Mean 77.5484 89.3196 -11.7713 child_IQ Std Err 2.3408 1.0316 2.3224 Method Mean 77.5484 89.3196 -11.7713 -11.7713 Pooled Satterthwaite Method Pooled Satterthwaite Variances Equal Unequal DF 432 129.88 Minimum 20.0000 38.0000 95% CL 72.8994 87.2906 -16.3359 -16.8321 t Value -5.07 -4.60 Maximum 136.0 144.0 Mean 82.1974 91.3487 -7.2066 -6.7105 Pr > |t| <.0001 <.0001 Equality of Variances Method Folded F Num DF 92 Den DF 340 F Value 1.40 Pr > F 0.0326 3 SAS performs 2 t -tests: Pooled t -test assumes two population standard deviations are equal, uses pooled standard deviation, which is Root Mean Square Error in one-factor ANOVA: ˆ æ= s (n 1 ° 1)SD2 + (n 2 ° 1)SD2 1 2 n1 + n2 ° 2 ˆ æ is used for the test and for confidence intervals. The test statistic is: t= ¯ ¯ X1 ° X2 , s 1 1 ˆ æ + n1 n2 with df = (n 1 + n 2 ° 2). 4 Satterthwaite t -test does not assume two population standard deviations are equal, adjusts its degrees of freedom for differences in the group SDs and uses unpooled standard error: SEU = approximate degrees of freedom: 4 dfU = SEU √ s SD2 1 n1 SD4 1 2 n 1 (n 1 ° 1) + + SD2 2 n2 SD4 2 2 n 2 (n 2 ° 1) !°1 When SD1 and SD2 are different, dfU is smaller than pooled df = (n 1 + n 2 ° 2) The Satterthwaite test statistic: t= ¯ ¯ X1 ° X2 , SEU with df = dfU . 5 When are the variances unequal? “Folded F -test” needs normally distributed data, so usually not helpful. Instead, when the ratio larger standard deviation >3 smaller standard deviation then the SDs are probably different, and different enough to matter. • Report the Satterthwaite p-value • Take logs and perform t -test on logged data • Wilcoxon rank-sum test (non-parametric) 6 Wilcoxon rank-sum test Rank-sum test compares two samples: combine all the data and order from smallest to largest. H0 : two population distributions equal (same center). If so, neither sample should be clustered in lower ranks. Test statistic is sum of ranks in one sample. Wikipedia Frank Wilcoxon (1892–1965) Proc NPar1Way data = pubh.child_iq wilcoxon ; class mom_HS_grad; var child_iq; 7 The NPAR1WAY Procedure Wilcoxon Scores (Rank Sums) for Variable child_IQ Classified by Variable mom_HS_grad mom_HS_ Sum of Expected Std Dev Mean grad N Scores Under H0 Under H0 Score ----------------------------------------------------------------------1 341 78993.50 74167.50 1071.98173 231.652493 0 93 15401.50 20227.50 1071.98173 165.607527 Average scores were used for ties. Wilcoxon Two-Sample Test Statistic 15401.5000 Normal Approximation Z One-Sided Pr < Z Two-Sided Pr > |Z| t Approximation One-Sided Pr < Z Two-Sided Pr > |Z| -4.5015 <.0001 <.0001 <.0001 <.0001 8 Chi-square test in Proc Freq Proc Freq (for frequency ) makes tables of counts, performs chi-square test of association, also calculates relative risk and odds ratio, tests of trend and measures of association. Proc FREQ tables data= dataset ; row-variable * column-variable ; Proc Freq data = one; tables mom_HS_grad * IQ_over_100; Manual chapter for Freq in SAS Help and Documentation > SAS Products > Base SAS > SAS Procedures 9 The FREQ Procedure Table of mom_HS_grad by IQ_over_100 mom_HS_grad(mom HS grad) IQ_over_100 Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0| 80 | 13 | 93 | 18.43 | 3.00 | 21.43 | 86.02 | 13.98 | | 25.24 | 11.11 | ---------+--------+--------+ 1| 237 | 104 | 341 | 54.61 | 23.96 | 78.57 | 69.50 | 30.50 | | 74.76 | 88.89 | ---------+--------+--------+ Total 317 117 434 73.04 26.96 100.00 10 Better to omit unwanted percents: Proc Freq data = one; tables mom_HS_grad * IQ_over_100 / nopercent nocol nopercent = omit percents of grand total nocol = omit column percents norow = omit row percents 11 Table of mom_HS_grad by IQ_over_100 mom_HS_grad(mom HS grad) IQ_over_100 Frequency| Row Pct | 0| 1| ---------+--------+--------+ 0| 80 | 13 | | 86.02 | 13.98 | ---------+--------+--------+ 1| 237 | 104 | | 69.50 | 30.50 | ---------+--------+--------+ Total 317 117 Total 93 341 434 Statistics for Table of mom_HS_grad by IQ_over_100 Statistic DF Value Prob -----------------------------------------------------Chi-Square 1 10.1275 0.0015 Likelihood Ratio Chi-Square 1 11.2096 0.0008 Continuity Adj. Chi-Square 1 9.3059 0.0023 Mantel-Haenszel Chi-Square 1 10.1042 0.0015 12 chisq ; Phi Coefficient Contingency Coefficient Cramer’s V 0.1528 0.1510 0.1528 Fisher’s Exact Test ---------------------------------Cell (1,1) Frequency (F) 80 Left-sided Pr <= F 0.9998 Right-sided Pr >= F 7.236E-04 Table Probability (P) Two-sided Pr <= P 4.742E-04 0.0014 Sample Size = 434 13 Pearson’s Chi-square Test for No Association. H0 : no association between row probabilities and column probabilities. 1. Individual’s chance of being in a particular column does not depend on which row they belong to. Within a column, all row percents should be roughly equal, except for sampling variability. 2. Equivalently, individual’s chance of being in a particular row does not depend on which column they belong to. Within a row, all column percents should be roughly equal, except for sampling variability. 14 Using this no-association assumption, we can compute an expected count for each cell: ± expected count = (row total)£(column total) (grand total) The test statistic compares expected to observed counts: X2 = X all cells ° ¢2 observed count ° expected count ° ¢ expected count ° ¢° ¢ degrees of freedom = number of rows ° 1 £ number of columns ° 1 . 15 Each cell’s contribution to the chi-square sum measure’s its departure from the null hypothesis (no association). Pearson residual is the square-root of the cell’s contribution to chi-square, with sign of (observed ° expected). In a large table (bigger than 2 £ 2), examine Pearson residuals for each cell to find which cells depart from no-association. Proc Freq data = one; tables mom_HS_grad * IQ_over_100 / nopercent nocol norow cellchi2 deviation 16 c mom_HS_grad(mom HS grad) IQ_over_100 Frequency | Deviation | Cell Chi-Square| 0| 1| Total ---------------+--------+--------+ 0| 80 | 13 | 93 | 12.071 | -12.07 | | 2.1452 | 5.8122 | ---------------+--------+--------+ 1| 237 | 104 | 341 | -12.07 | 12.071 | | 0.5851 | 1.5851 | ---------------+--------+--------+ Total 317 117 434 Frequency | (Squared) Pearson| Residual | 0| 1| ---------------+---------+---------+ 0| 80 | 13 | | 2.1452 | -5.8122 | ---------------+---------+---------+ 1| 237 | 104 | | -0.5851 | 1.5851 | ---------------+---------+---------+ Total 317 117 Total 93 341 434 17 The DATA step (LSB §1.4) Data two; set one; statement A; statement B; statement C; 1. Read observation 1 (row 1) from one. Perform statements A, B, C, using observation 1. 2. Write output to data two. 3. Set variables to missing values. 4. Repeat steps 1, 2, 3 using observation 2 from one. 5. Repeat steps 1, 2, 3 using observation 3 from one. 6. Continue through all rows of one. 18 Data step is sequential, only looks at one observation at a time. _N_ is the internal variable that counts observations. There is an implicit DO-loop in every data step: Data two; Documentation 01/23/2007 10:23 PM DO from _N_ = 1 to last-observation; values created within the WHERE expression itself. set one; You cannot use variables that are created within the DATA step (for example, FIRST. variable , LAST. variable , _N_, ostatement are created in assignment statements) in a WHERE expression because the WHERE r variables that A; statement is executed before the SAS System brings observations into the DATA or PROC step. When WHERE statement B; expressions contain comparisons, the unformatted values of variables are compared. Use operands in WHEREC; statement statements as in the following examples: where END; score>50; where date>='01jan1999'd and time>='9:00't; where state='Mississippi'; As in other SAS expressions, the names of numeric variables can stand alone. SAS treats values of 0 or 19 missing as false; other values are true. These examples are WHERE expressions that contain the numeric variables EMPNUM and SSN: where empnum; where empnum and ssn; Calculations in the DATA step: arithmetic Character literals or the names of character variables can also stand alone in WHERE expressions. If you use the name of a character variable by itself as a WHERE expression, SAS selects observations where the value of the character variable is not blank. Calculation with data to create new variables is done in the data step: Operators Used in the WHERE Expression Data Z; You can include both SAS operators and special WHERE -expression operators in the WHERE statement. For a complete list of the operators, see WHERE Statement Operators . For the rules inches height_inches = height_cm/2.54; convert cm to SAS follows when it evaluates WHERE expressions, see WHERE -Expression Processing in SAS Language Reference: Concepts . WHERE Statement Operators Operator Type SAS arithmetic symbols:Symbol or Mnemonic Description Arithmetic * / division + addition - subtraction ** exponentiation = or EQ Comparison multiplication equal to 4 20 http://support.sas.com/onlinedoc/913/docMainpage.jsp Page 4 of 8 Expressions in parentheses are evaluated first, starting with the innermost of nested groups. Here is the order of evaluating operations, from left to right in an expression: 1. Exponents 2. Multiplication and division 3. Addition and subtraction Evaluate: 2 § 3 + 4/5 ° 1 2 § (3 + 4)/(5 ° 1) 2 § (3 + 4/5) ° 1 21 Data A; C = 2*3+4/5-1; D = 2*(3+4)/(5-1); E = 2*(3+4/5) -1; proc print data=A; ----------------------------------------------------------Obs C D E 1 5.8 3.5 6.6 22 Exponents: Don’t use ** for any exponents except 2 or 3. To compute x = a b , don’t use x = a**b. Instead, use the more numerically stable log and exp functions: x = exp(b * log(a)); log is the natural log function; log10 is common log or base 10 log. Inverse of log is exp. Inverse of b = log10(c) is c = exp(b*log(10.0)). See LSB §3.3 for a short list of SAS functions. For a complete list, see the SAS Documentation: SAS Products > Base SAS > SAS Language Dictionary > Dictionary of Language Elements > Functions and CALL Routines 23 IF-THEN-ELSE (LSB §3.5-3.6 In the data step, you can make evaluation of a command depend on a condition: IF ( condition ) THEN statement A ; When condition is true, statement A is performed. Parentheses around condition are not required, but make code easier to read. detection_limit = 0.025 IF (0 < x < detection_limit) THEN x = detection_limit/2.0; 24 In the data step, you can make a branch depend on a condition: IF ( condition ) THEN statement A ; Documentation 01/23/2007 10:23 PM values created ELSE statement B ;within the WHERE expression itself. You cannot use variables that are created within the DATA step (for example, FIRST. variable , LAST. variable , _N_, or variables that are created in assignment statements) in a WHERE expression because the WHERE statement is executed before the SAS System brings observations into the DATA or PROC step. When WHERE expressions contain comparisons, the unformatted values of variables are compared. When condition is true, statement A is performed. Use operands in WHERE statements as in the following examples: where score>50; When condition is false, statement B is performed. where date>='01jan1999'd and time>='9:00't; where state='Mississippi'; As in other SAS expressions, the names of numeric variables can stand alone. SAS treats values of 0 or missing as false; other values are true. These examples are WHERE expressions that contain the numeric variables EMPNUM and SSN: if (score > 70) then grade = ‘‘S’’; where empnum; where e ‘‘N’’; else grade =mpnum and ssn; Character literals or the names of character variables can also stand alone in WHERE expressions. If you use the name of a character variable by itself as a WHERE expression, SAS selects observations where the value of the character variable is not blank. Operators Used in the WHERE Expression You can include both SAS operators and special WHERE -expression operators in the WHERE statement. For a complete list of the operators, see WHERE Statement Operators . For the rules SAS follows when it evaluates WHERE expressions, see WHERE -Expression Processing 25 SAS Language Reference: Concepts . in WHERE Statement Operators Operator Type Symbol or Mnemonic Description * multiplication + addition Arithmetic Here are the SAS symbols for comparisons and division relations. logical / Letters are easier to read and remember. - subtraction ** = or EQ equal to ^=, ¬=, ~=, or NE1 Comparison exponentiation not equal to 4 Documentation http://support.sas.com/onlinedoc/913/docMainpage.jsp GT > or 01/23/2007 10:23 PM greater than < or LT less than >= or GE greater than or equal to <= or LE less than or equal to IN equal to one of a list & or AND logical and | or OR 2 logical or1 ~,^ , ¬, or NOT 1 logical not || 3 concatenation of character variables () indicate order of evaluation + prefix positive number Logical (Boolean) Other - prefix negative number 26 WHERE Expression Only BETWEEN -AND an inclusive range ? or CONTAINS a character string Page 4 of 8 Subsetting IF Data A; set B; IF ( condition ); If condition is true, the observation is kept. This data step says: Make a copy of B and call it A. Include in A only those observations that satisfy the condition. Equivalent: Data A; set B; if ( NOT condition ) then delete; 27 Create an indicator 0/1 variable: x = (condition); x = 1 when condition is true, x = 0 when false data one; set pubh.child_iq; IQ_over_100 = (child_iq > 100.0); 28 Missing values Numeric variables: missing is indicated by a period, x = . Comparisons with missing values. In a sort of a numeric variable, missing values are treated as °1. detection_limit = 0.025 IF (x < detection_limit) THEN x = detection_limit/2.0; What happens to a subject who is missing x? How could we fix this? 29 Create an indicator 0/1 variable: IQ_under_100 = (child_IQ < 100); What is the value for a child with missing IQ score? 0 or 1? How should we fix this? 30 Arithmetic with missing values Find mean diastolic blood pressure (DBP) measured at 4 clinic visits. Data from 2 subjects in visits: ID DBP1 DBP2 DBP3 DBP4 11 95 90 98 92 14 94 . 91 95 data G; set visits; DBP_mean = (DBP1 + DBP2 + DBP3 + DBP4)/4.0 ; 31 Results: Obs ID DBP1 DBP2 DBP3 DBP4 DBP_mean 1 11 95 90 98 92 93.75 2 14 94 . 91 95 . Arithmetic with a missing value has a missing result. Usually we want to ignore missing values and average the rest of the numbers, not have the mean be missing. SAS procedures (Proc Ttest, Proc Reg) omit observations with missing values. 32 Many SAS functions correctly handle missing values—see the Manual: MEAN (argument list ) returns the average of the non-missing values; for example, MEAN(3, ., ., 1) = 2 DBP_mean1 = mean(DBP1, DBP2, DBP3, DBP4) ; Results: DBP_ ID DBP1 DBP2 DBP3 DBP4 DBP_mean mean1 11 95 90 98 92 93.75 93.7500 14 94 . 91 95 . 93.3333 33 ...
View Full Document

This note was uploaded on 11/21/2011 for the course PUBH 6470 taught by Professor Williamthomas during the Fall '11 term at University of Florida.

Ask a homework question - tutors are online