Handout20 - Lecture 20 1. Working with multiple...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Lecture 20 1. Working with multiple observations: long form and wide form 2. Proc SQL 3. Counting observations with SQL and Transpose 4. Proc SQL to combine datasets: fuzzy merge 5. Longitudinal data: response feature analysis 6. Area under the curve (AUC) 1 Long form: multiple observations per subject, one measurement (income) in each observation, with variables identifying subject ID and observation number (year) Obs 1 2 3 family_ id 1 1 1 income 66483 69146 74643 year 1990 1991 1992 Wide form: one observation per subject, all measurements (income) in one observation, with variables identifying subject ID, and observation number (year) encoded in variable name Obs 1 2 family_ id 1 2 income_ 1990 66483 17510 income_ 1991 69146 . income_ 1992 74643 19484 2 income_ 1993 79783 20979 income_ 1994 81710 21268 income_ 1995 86143 22998 Long form. Required for graphing and analysis procedures Wide form. All measurements from a subject are available for computation because they are all in the same row. SAS Data step works with one observation (row) at a time. 3 ways to access the values in a column from more than one observation: • using RETAIN statement to store values “across” rows • Proc Transpose pivots data to put multiple values in the same row (= wide form) • Proc SQL (Structured Query Language) 3 Transpose Help chapter: SAS Help > SAS Products > Base SAS > Base SAS 9.2 Procedures Guide > Procedures > Proc Transpose SQL Help chapter: SAS Help > SAS Products > Base SAS > SAS 9.2 SQL Procedure User’s Guide 4 Proc SQL applied to one dataset Family economic data, in long form. 3 to 6 annual observations per family. Obs 1 2 3 4 5 6 7 8 9 10 family_ id 1 1 1 1 1 2 2 2 2 2 income 66483 69146 74643 81710 86143 17510 17947 20979 21268 22998 year 1990 1991 1992 1994 1995 1990 1991 1993 1994 1995 expenses 49804 65634 61820 85504 75640 21609 12085 22985 11097 21768 debt no no no yes no yes no yes no no cohort A A A 1993 missing A A B B 1992 missing B B B Count number of observations per family. 5 Proc SQL; create table pubh.M as select from family_id, count(income) as n_years_income family_economic_data group by family_id; quit; Create SAS permanent dataset pubh.M by selecting these variables from family_economic_data: group observations for each distinct family_ID, create n_years_income = COUNT(income) make one observation per family_ID with 2 variables: family_ID, n_years_income Ends with quit , not run. 6 family_ n_years_ id income 1 1 5 2 2 5 3 3 6 4 4 6 5 5 5 Obs Summary: pattern of missing observations from Proc n_years_income 3 4 5 6 Frequency 2 2 17 29 Cumulative Frequency 2 4 21 50 Percent 4.00 4.00 34.00 58.00 Cumulative Percent 4.00 8.00 42.00 100.00 7 Select variables that do not repeat within family_id groups: Proc SQL; create table M2 as select *, * = all variables count(income) as n_years_income no comma after last variable selected from family_economic_data group by family_id; quit; Summary count is merged back: Obs 1 2 3 4 5 family_ id income year expenses debt cohort n_years_ income 1 1 1 1 1 66483 81710 74643 86143 69146 1990 1994 1992 1995 1991 49804 85504 61820 75640 65634 no yes no no no A A A A A 5 5 5 5 5 8 Another way to count nonmissing observations Use Proc Transpose to reshape from long to wide form: Proc Transpose ID prefix =income_ ; income; BY 1 2 3 4 5 6 out=wide year; VAR Obs data=long family_id; family_ id 1 2 3 4 5 6 _NAME_ income_ 1990 income_ 1991 income_ 1992 income_ 1994 income_ 1995 income_ 1993 income income income income income income 66483 17510 57947 64831 18904 32057 69146 17947 62964 71060 19949 34770 74643 . 68717 71918 21335 35834 81710 21268 75198 73100 23829 40899 86143 22998 75722 74379 23913 42372 . 20979 70957 72514 . 37387 9 In wide format, we can use the N function Data count; set wide; n_years_income = N(of income_1990 - income_1995) ; 10 Combining datasets (tables) in SQL Join two datasets without conditions in Proc SQL to get all combinations of rows = Cartesian product. Data A id color Data B mass id mass pH 12 orange 3650 13 11267 7.8 13 blue 3877 14 3568 8.2 15 yellow 4103 15 4103 5.1 3 rows in A × 3 rows in B = 9 combinations Just as in MERGE, order of joining matters. 11 Proc SQL; create table D as select * from A, B ; select * = select all variables Obs id color mass pH 1 2 3 4 5 6 7 8 9 12 12 12 13 13 13 15 15 15 orange orange orange blue blue blue yellow yellow yellow 3650 3650 3650 3877 3877 3877 4103 4103 4103 7.8 8.2 5.1 7.8 8.2 5.1 7.8 8.2 5.1 ID and mass are common variables: only IDs and masses from first dataset (A) in product. 12 Proc SQL; create table E as select * from B, A ; Obs id mass pH color 1 2 3 4 5 6 7 8 9 13 13 13 14 14 14 15 15 15 11267 11267 11267 3568 3568 3568 4103 4103 4103 7.8 7.8 7.8 8.2 8.2 8.2 5.1 5.1 5.1 orange blue yellow orange blue yellow orange blue yellow ID and mass are common variables: only IDs and masses from first dataset (B) in product. 13 Merge with SQL No sorting is required, which can speed merging of very large datasets. Merge observations from A and B with matching IDs, and keep mass measurements separate: Proc SQL; create table F as select * from A(rename=(mass=mass_A)), B(rename=(mass=mass_B)) where A.id = B.id ; Since variable ID is in both A and B, need to identify dataset.variable 14 Data A id color Data B mass id mass pH 12 orange 3650 13 11267 7.8 13 blue 3877 14 3568 8.2 15 yellow 4103 15 4103 5.1 Merge where A.id = B.id Obs id color mass_A mass_B pH 1 13 blue 3877 11267 7.8 2 15 yellow 4103 4103 5.1 Gives the intersection of the two datasets: only IDs that appear in both. 15 Data-step merge by ID gives non-matching observations: Data E; MERGE A(rename=(mass=mass_A)) B(rename=(mass=mass_B)); BY id; Obs id color mass_A 1 12 orange 3650 . 2 13 blue 3877 11267 7.8 3 14 . 3568 8.2 4 15 4103 4103 5.1 yellow mass_B To get this from SQL, use a full join 16 pH . Proc SQL; create table F1 as select * from A(rename=(mass=mass_A)) full join on A.id B(rename=(mass=mass_B)) = B.id; Obs id color mass_A 1 12 orange 3650 . 2 13 blue 3877 11267 7.8 3 . . 3568 8.2 4 15 4103 4103 5.1 yellow mass_B pH . Because A is read first, ID = 14 from B is missing. 17 Use COALESCE function: takes first non-missing value from arguments. Proc SQL; create table F2 as select * , coalesce(A.id, B.id) as item from A(rename=(mass=mass_A)) full join on A.id B(rename=(mass=mass_B)) = B.id; Obs id color mass_A 1 12 orange 3650 . 2 13 blue 3877 3 . pH item . 12 11267 7.8 13 . 4 15 yellow mass_B 3568 8.2 14 4103 4103 5.1 15 18 Fuzzy merge in Proc SQL Patients in a study are supposed to have measurements of their weight and blood pressure within 7 days. Weight data Blood pressure data Obs 1 2 3 id 33 33 33 date 10MAR2009 12APR2009 15MAY2009 kg 78 81 80 Obs 1 2 3 patient date dbp 33 15MAR2009 72 33 10APR2009 68 33 30MAY2009 71 sbp 112 105 110 4 5 34 34 02FEB2009 05JUN2009 65 66 4 5 34 34 07FEB2009 15JUN2009 55 60 95 99 6 7 8 35 35 35 21MAY2009 08JUN2009 14AUG2009 71 71 70 6 7 35 35 21MAY2009 06JUN2009 66 68 110 105 SQL can merge on variables with different names (ID, patient) and on dates that are within 7 days. 19 Start by merging on ID = patient Proc SQL; create table G1 as select * from weights(rename=(date=date_wt)), BP (rename=(date=date_BP)) where weights.id = BP.patient; Use dataset option rename to save dates from weights and blood pressures. Want to check all combinations of observations where ID = patient: Result has 9 rows for ID=33, 4 for ID=34, and 6 for ID=35 20 Obs id date_wt kg patient date_BP dbp sbp 1 2 3 4 5 6 7 8 9 33 33 33 33 33 33 33 33 33 10MAR2009 15MAY2009 12APR2009 10MAR2009 15MAY2009 12APR2009 10MAR2009 15MAY2009 12APR2009 78 80 81 78 80 81 78 80 81 33 33 33 33 33 33 33 33 33 15MAR2009 15MAR2009 15MAR2009 10APR2009 10APR2009 10APR2009 30MAY2009 30MAY2009 30MAY2009 72 72 72 68 68 68 71 71 71 112 112 112 105 105 105 110 110 110 10 11 12 13 34 34 34 34 02FEB2009 05JUN2009 02FEB2009 05JUN2009 65 66 65 66 34 34 34 34 07FEB2009 07FEB2009 15JUN2009 15JUN2009 55 55 60 60 95 95 99 99 14 15 16 17 18 19 35 35 35 35 35 35 21MAY2009 14AUG2009 08JUN2009 21MAY2009 14AUG2009 08JUN2009 71 70 71 71 70 71 35 35 35 35 35 35 21MAY2009 21MAY2009 21MAY2009 06JUN2009 06JUN2009 06JUN2009 66 66 66 68 68 68 110 110 110 105 105 105 21 To check whether weight dates and blood pressure dates are within 7 days, calculate interval: Proc SQL; create table G2 as select * , ABS(date_wt - date_BP) as interval from weights(rename=(date=date_wt)), BP (rename=(date=date_BP)) where weights.id = BP.patient ; Add comma after * for variable list ABS is absolute value function ABS(date_wt - date_BP) is positive number of days between dates Note: used new names for the dates in calculating interval 22 Obs id date_wt kg patient date_BP dbp sbp 1 2 3 4 5 6 7 8 9 33 33 33 33 33 33 33 33 33 10MAR2009 15MAY2009 12APR2009 10MAR2009 15MAY2009 12APR2009 10MAR2009 15MAY2009 12APR2009 78 80 81 78 80 81 78 80 81 10 11 12 13 34 34 34 34 02FEB2009 05JUN2009 02FEB2009 05JUN2009 14 15 16 17 18 19 35 35 35 35 35 35 21MAY2009 14AUG2009 08JUN2009 21MAY2009 14AUG2009 08JUN2009 interval 33 33 33 33 33 33 33 33 33 15MAR2009 15MAR2009 15MAR2009 10APR2009 10APR2009 10APR2009 30MAY2009 30MAY2009 30MAY2009 72 72 72 68 68 68 71 71 71 112 112 112 105 105 105 110 110 110 5 61 28 31 35 2 81 15 48 65 66 65 66 34 34 34 34 07FEB2009 07FEB2009 15JUN2009 15JUN2009 55 55 60 60 95 95 99 99 5 118 133 10 71 70 71 71 70 71 35 35 35 35 35 35 21MAY2009 21MAY2009 21MAY2009 06JUN2009 06JUN2009 06JUN2009 66 66 66 68 68 68 110 110 110 105 105 105 0 85 18 16 69 2 23 Add condition that interval between dates (for same ID) must be ≤ 7 days Proc SQL; create table G3 as select * , ABS(date_wt - date_BP) as interval from weights(rename=(date=date_wt)), BP (rename=(date=date_BP)) where weights.id = BP.patient and (calculated interval LE 7); Must refer to interval as calculated interval because it is not in either data set 24 Obs id date_wt kg patient date_BP dbp sbp interval 1 2 33 33 10MAR2009 12APR2009 78 81 33 33 15MAR2009 10APR2009 72 68 112 105 5 2 3 34 02FEB2009 65 34 07FEB2009 55 95 5 4 5 35 35 21MAY2009 08JUN2009 71 71 35 35 21MAY2009 06JUN2009 66 68 110 105 0 2 Last, fix the redundant variables ID and patient 25 Proc SQL; create table G4 as select * , ABS(date_wt - date_BP) as interval from weights(rename=(date=date_wt)), BP (rename=(date=date_BP patient=id )) where weights.id = BP.id and (calculated interval LE 7); Obs id date_wt kg date_BP dbp sbp interval 1 2 3 4 5 33 33 34 35 35 10MAR2009 12APR2009 02FEB2009 21MAY2009 08JUN2009 78 81 65 71 71 15MAR2009 10APR2009 07FEB2009 21MAY2009 06JUN2009 72 68 55 66 68 112 105 95 110 105 5 2 5 0 2 This is a fuzzy merge on ID and visit dates within 7-day interval. Very difficult in SAS data step. 26 Longitudinal data: Response Feature Analysis Response feature analysis replaces repeated measurements with one outcome: no more longitudinal data, apply simpler analysis method: ANOVA, regression, t -test. Common response features: • mean • area under the curve (AUC) • for peaked data: maximum or minimum value • for peaked data: time to maximum or minimum value • for growth data: slope of regression line More than one feature can be used, with multiple analyses to compare groups. 27 Visual Analog Scale (VAS) example A nutrition study compared the immediate effect on feelings of hunger after a breakfast muffin containing 0, 5, or 8 g of short-chain fructooligosaccharides (scFOS). To measure hunger, participants marked a visual analog scale (VAS) to indicate how hungry they felt: Distance from zero on scale was numeric response. Participants completed the VAS at 0, 15, 30, 45, 60, 90, 120, 180, and 240 minutes after eating the muffin. 28 100 Hunger Satisfaction Hunger VAS mean ± SE (mm) 80 !! ! ! ! ! 60 ! 40 ! !! ! ! ! ! ! ! ! ! ! 0 g scFOS ! 20 Satisfaction VAS mean ± SE (mm) 100 80 ! ! 60 ! ! ! !! 40 5 g scFOS 20 ! ! 8 g scFOS 0 0 0 1 2 3 4 0 Hours From Baseline 100 100 80 Fullness ! 29 ! ! ! !! 60 ! ! ! ! Finding the!area under a curve: trapezoid rule ! ! ! ! 40 !! ! Sequence of individual’s measurements m i , taken at times t i 20 80 Fullness VAS mean ± SE (mm) Prospective Food Intake VAS mean ± SE (mm) Prospective Food Intake Mean curves in response to each treatment. Differences? ! ! 60 40 20 ! ! m2 ! Measurements 0 0 0 1 2 3 4 Hours From Baseline m3 ! m4 ! m1 ! t1 t2 t3 t4 Measurement times Trapezoid rule: connect measurements with line segments, find area below in gray. 30 0 !! ! ! ! Trapezoid: 4-sided plane figure with 2 parallel sides. Duplicate trapezoid on top gives rectangle that has twice the area. Trapezoid area = ￿ 1￿ (t 2 − t 1)(m 1 + m 2) 2 31 m2 Measurements ! m3 ! m4 ! m1 ! t1 t2 t3 t4 Measurement times ￿ 1￿ Area under the curve = (t 2 − t 1)(m 1 + m 2) + (t 3 − t 2)(m 2 + m 3) + (t 4 − t 3)(m 3 + m 4) 2 Approximates area under true curve of measured quantity m . 32 Calculating AUC (Area Under Curve) In example, VAS hunger measured 9 times: at 0, 15, 30, 45, 60, 90, 120, 180, and 240 minutes after eating the muffin. Convert times to hours: t i = 0, .25, .5, .75, 1, 1.5, 2, 3, 4. ￿ 1￿ AUC = (t 2 − t 1)(m 1 + m 2) + (t 3 − t 2)(m 2 + m 3) + · · · + (t 9 − t 8)(m 8 + m 9) 2 How many trapezoids? Use one array for times, one for measurements. 33 data AUC; set VAS_hunger; array m[9] hunger1-hunger9; array t[9] time1-time9; time1=0; time2=0.25; time3=0.5; time4=0.75; time5=1; time6=1.5; time7=2; time8=3; time9=4; AUC = 0; do j=1 to 8; why 8 instead of 9? next_trapezoid = 0.5 * (t[j+1] - t[j])*(m[j] + m[j+1]); AUC = sum(AUC, next_trapezoid); end; How should we adapt this to find maximum hunger score? 34 What if some observations are missing? Suppose a subject is missing VAS hunger measurement m 2 at time 0.25 hours. What happens to the AUC calculation in SAS? AUC = ￿ 1￿ (t 2−t 1)(m 1+m 2)+(t 3−t 2)(m 2+m 3)+(t 4−t 3)(m 3+m 4)+ · · · +(t 9−t 8)(m 8+m 9) 2 35 Write code to alert you to problems: write observations with missing data to a separate data set. To create 2 datasets, give 2 names, and separate output statements. data hunger_AUC missing; create 2 data sets set VAS_hunger; array m[9] hunger1-hunger9; array t[9] time1-time9; time1=0; time2=0.25; time3=0.5; time4=0.75; time5=1; time6=1.5; time7=2; time8=3; time9=4; 36 AUC = 0; do j=1 to 8; next_trapezoid = 0.5 * (t[j+1] - t[j])*(m[j] + m[j+1]); if (next_trapezoid = .) then do; output deal with a missing value missing; GOTO Duluth; jump to label ’Duluth’ end; AUC= sum(AUC,next_trapezoid); end; output hunger_AUC; Duluth: SAS label ends with full colon, not semicolon 37 proc print data=missing; Obs subject hunger1 hunger2 hunger3 hunger4 1 1 81 . 10 15 hunger5 ... 37 ... Common practice: replace missing k -th value at t k by linear interpolation from measurements m k −1, m k +1 on either side. Solve for x : m k −1 − x t k −1 − t k = m k −1 − m k +1 t k −1 − t k +1 Use imputation for missing measurements at the ends. 38 ...
View Full Document

This note was uploaded on 11/21/2011 for the course PUBH 6470 taught by Professor Williamthomas during the Fall '11 term at University of Florida.

Ask a homework question - tutors are online