1303_0607sem2

1303_0607sem2 - THE UNIVERSITY OF HONG KONG DEPARTMENT OF...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 10
Background image of page 11
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: THE UNIVERSITY OF HONG KONG DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE STAT1303 DATA MANAGEMENT May 31, 2007 Time: 9:30 a.m. — 11:30 a.m. Candidates taking examinations that permit the use of calculators may use any calculator which fulfils the following criteria: (a) it should be self-contained, silent, battery-operated and pocket- sized and (b) it should have numeral—display facilities only and should be used only for the purposes of calculation. It is the candidate ’s responsibility to ensure that the calculator operates satisfactorily and the candidate must record the name and type of the calculator 0n the fi'ont page of the examination scripts. Lists of permitted/prohibited calculators will not be made available to candidates for reference, and the onus will be on the candidate to ensure that the calculator used will not be in violation of the criteria listed above. Answer ALL FOUR questions. The questions are of equal value. An abridged version of SAS syntax is provided in ANNEX. 1. The raw data file ‘MEDDAT’ consists of the medical information on a number of patients. The variables are defined as follows. Variable Details Type Length Missing Values X1 Gender Char 1 ‘U' X2 Age Numeric 2 0 X3 Group Char 1 'U’ SYM1 to SYM3 Three different symptoms Numeric 1 9 DX1 to DX4 Four different diagnosis Numeric 2 99 __ values {0-10) Variable Code Label X1 'F’ Female ‘M’ Male X2 0 Unknown X3 ‘1’ Mild ‘2’ Moderate ‘3’ Severe ‘U’ Unknown SYM1 to SYM3 I 1 Symptom exists 0 Symptom does not exist 9 Unknown DX1 to DX4 99 Unknown S&AS: STAT1303 Data Management 2 The values are stored in column format with the length defined in the above table. They are ordered according to the appearance in the table. There is no delimiter between any values of the observations. Each observation is stored in one row. Three observations are given as follows. ' l ( column index, not in the data file 123456789012345 ( F45100001020303 ( observations in the data file M34210005030403 ( F0 U9900599 00 ( Write SAS programs to answer the follow questions. (a) Create a SAS data set named D1 for the data in the library named LIBl linked to the folder ‘c:\libl ’. (b) Create a new data set D2 in the library LIBl which consist of (i) the contents copied from the data set D1 (ii) a new variable SYMCOUNT which is the number of symptoms (If any symptom is unknown, set SYMCOUNT to missing.) (iii) a new variable DX which is equal to the average of DXl and DX2 if the patient is in group 1 equal to the average of DX3 and DX4 if the patient is in group 2 equal to the average of DXl to DX4 if the patient is in group 3 missing if group is missing. * o (if any required value of DXl to DX4 in the above definition is missing, DX is also missing.) (0) For the data set D2 (i) Find the number of valid cases for each variable (ii) Find the percentages of patients with a different number of symptoms for each gender (iii) Find the mean and SD of the age of the patients who were in group 1 and the value of DX is either smaller than 3 or greater than 12 S&AS: STAT1303 Data Management 3 2. A data set contains the systolic blood pressure (SBP) and diastolic blood pressure (DBP) for each subject. There are 2 groups of subjects, A and B. The following SAS program is to create the data set and express each measurement as a percentage of the group mean. DATA TEST; , INPUT GROUP 5 SB? DBP @@; CARDS; A 160 100 B 150 90 A 140 80 B 200 180 A 180 140 B 140 80 PROC SORT DATA=TEST; BY GROUP; RUN; PROC MEANS DATA=TEST NOPRINT NWAY; CLASS GROUP; VAR sap DBP; OUTPUT OUT=MOUT(DROP=_TYPE_ FREQ ) MEAN=M_SBP M_DBP; T — RUN; DATA NEW; MERGE TEST MOUT; BY GROUP; SBPPER=100*SBP/M_SBP; DBPPER=100*DBP/M_DBP; DROP M SBP M DBP; RUN; * fl PROC PRINT DATA=NEW NOOBS; RUN; Answer the following question when the whole program is submitted. (a) What are the types and lengths of the variables in the data set TEST? (b) How many observations are there in the data set TEST? (0) Could we drop PROC SORT? Why? (d) How many observations are there in the data set MOUT? (e) What are the values of the variables for each observation in the data set MOUT? (D What is the final output from this SAS program? Show all the values in the output if there is any. (g) Could we drop “@@” to achieve the same result in (0? Why? 3. The raw data file IADL.DAT consists of a number of cases in list format. The variables are defined as follows. Variable Description Valid values ID ID number Numerals only Gender Gender of the elderly ‘M’ or ‘F’ AGE ' Age of the elderly (year) 60 or above IADL IADL score of the elderly 0 to 100 Write SAS programs for the following. (a) Check the missing values by listing out the ID and the variables with missing values (b) Check the ranges of all variables by listing out the ID and the variables outside the valid range (c) Check if there are any outliers by listing out the ID and the variables outside of the meani3 SD S&AS: 4. :- STAT1303 Data Management 4 The data set HEI consists of the height of a number of children for various ages. The age range is 0 yr, 1 yr, 2 yr and 3 yr. There are 3 variables in the data set which are given as follows. Variable Description id ID number age Age (year) height Height g cm) Observations for two of the children are given as follows. ID Age Height 50 0 l 55 2 58 3 62 0 47 l 51 2 54 3 59 NNNNt—‘Ht—‘H Write SAS programs for the following. (a) Sort the data set HEI according to the ID number and the age. (b) Create a new data set HEI_H as a transpose of the data set HEI such that all the height measurements of one child are stored in only one observation and each observation consists of the height of a child at age 0, age 1, age 2 and age 3, respectively. The variables of HEI_H are defined as follows. Variable Description ID ID number H00 Height at age 0 H01 Height at age 1 H02 Height at age 2 H03 Height at age 3 The observations for the two of the children in HEI H would look like this:- id H00 H01 H02 H03 1 50 55 58 62 2 47 51 54 59 (c) Based on HEI_H, save the mean and SD of the height at different ages to a new data set MSDHEI in which the variables of the means are M00 to M03 and the variables for the SD are SD00 to SD03. Given that to save the statistics, the keyword for mean is MEAN: and that for standard deviation is ST D=. Create a new data set HSD by combining the data set MSDHEI and HEI_H, in which the standardized values of the height at different ages are calculated. The variables of the standardized height should be HSDOO to HSD03. The standardized value is given as (height — mean) / SD. (d) ************M END OF pApER ************** S&AS: STAT1303 Data Management ANNEX : An abridged version of SAS Syntax A. DATA STEP LIBNAME libref ’SAS—data-lz‘bmry’ ; DATA dataset-I <(data-set-0ptions)> V . . ; INPUT variablefi) <format> . . . ; LENGTH variable-1 <$>length. . . ; INFORMAT variable-I <inf0rmat> . . . ; LABEL variable-Iz’label-I’ . . . ; FORMAT variable—1 <format> . . . ; CARDSIDATALINES ; data 1 RUN; LIBNAME libref ’SAS—data-libmry’ ; DATA dataset—J <(data—set-options)> . . . ; INFILE filename; INPUT variable(3) <f0rmat> ; LENGTH variable-1 <$>length. . . ; INFORMAT variable-1 <inf0rmat> . . . ; LABEL variable-1=’label-1’. . . ; FORMAT variable—1 <f0rmat> . . . ; RUN; S&AS:‘ STAT1303 Data Management 6 LIBNAME libref ’SAS—data—library’ ; DATA dataset-J <(data—set-options)>. . . ; MERGEISET dataset-I <(data—set—options)> <dataset-2 <(data—set-options)> > ...; UPDATE dataset-J <(data—set—options)> dataset—2 <(data—set-0ptions)> ; BY <DESCENDING> variable-I . . . ; DROP variable(s) ; KEEP variablds) ; variable=e$presston ; ARRAY army-name (subscript) <a7‘my—elements> ; DELETE ; FILE fileref. . . ; OUTPUT dataset-J . . . ; PUT ’chamcter-stm'ng’ variable-1: . . . RENAME old-name-1=new-name—1 . . , ; RETAIN variable(5) ; STOP; WHERE where-expression ; IF expression ; IF expression THEN statement ; <ELSE statement ; > DO; . . . more statements. . . END; DO index—vamablez'sta'rt TO stop ; . more SAS statements . . . END; RUN; 1. data set options in DATA step and other SAS PROCS: DROPz, FIRSTOBS=, IN=, KEEP=, OBS=, RENAMEz, WHERE: S&AS: STAT1303 Data Managemeht ‘ 7 B. The following statements are common to All SAS PROCs A 1. FORMAT statement: FORMAT variable-1 <f0rmat> . . . ; 2. LABEL statement: LABEL variable-1=’label-J’ . ‘ . ; 3. WHERE statement: WHERE where-expression ; C. CONTENTS PROC CONTENTS <DATA=datasetname> <VARNUM> ; RUN; D. EXPORT PROC EXPORT DATA: datasetname OUTFILE=“filename” I OUTTABLE=“tablename” <DBMS=identifier><REPLACE> ; <data—source—statements ;> RUN; E. FORMAT PROC FORMAT <options1 > ; INVALUE <$> name value—or—mnge-Izinformat-value-I < value-or-mnge-n==inf0rmat-value-n> ; VALUE <$>name value-or-mnge-J=format—value—1 < . .. value-or-Tange-n=format—'ualue-n> ; RUN; S&AS: ISTAT1303 Data Management 8 1. options in PROC FORMAT: CNTLIN==, CNTLOUT=, LIBRARY: F. FREQ PROC FREQ <DATA=datasetname <data set options>> <optionsl>; TABLES variable] variableQ variableQWam’ablel </optionsz> ; WEIGHT variable; BY <DESCENDING> variable—1 < <DESCENDING> variable-71>; RUN; 1. options in PROC FREQ: FORMCHAR(1,2,7)=formcha7"—st'rmg, PAGE, NOPRINT 2. options in TABLE statement: NOCOL, NOROW, NOPRECENT, NOFREQ, NOCUM, NOPRINT, TESTP=(p1p2 . . .), EXPECTED, CHISQ, FISHERIEXACT, MEASURES, MISSING, MISSPRINT, OUT=datasetname <data set options> G. IMPORT PROC IMPORT DATAFILE=“filename” I TABLE=“tablename” OUT=datasetname <DBMS=identifier><REPLACE> ; RUN; S&AS: STAT1303 Data Management 9 H. MEANS PROC MEANS <DATA=datasetname <data set options>> <options1 > statistic—keywordg; BY <DESCENDING> variable-1 < <DESCENDING> variable-71>; CLASS grouping—variable(s); VAR variablefis); FREQ variable; ID variablds); I OUTPUT OUT=datasetname <data set 0ptions> statistic—keyword3<(variable(s))> = <name(s)>; RUN; 1. options in PROC MEANS: ALPHA=, MISSING, NONOBS, NOPRINT, NWAY 2. statistic—keyword in PROC MEANS: CLM CSS CV ‘KURTOSIS LCLM MAX MEAN MIN N NMISS RANGE “SKEWNESS STD STDERR SUM SUMWGT UCLM USS VAR MEDIAN P1 P5 P10 Q1 Q3 P90 P95 P99 QRANGE PROBT T 3. statistic—keyword in OUTPUT statement: CSS CV KURTOSIS LCLM MAX MEAN MIN N NMISS RANGE SKEWNESS STD STDERR SUM SUMWGT UCLM USS VAR MEDIAN P1 P5 P10 Q1 Q3 P90 P95 P99 QRANGE PROBT T I. PRINT PROC PRINT <DATA=datasetname <data set 0pti0ns>> <options1 > ; VAR variablefls); BY <DESCENDING> variablds); ID variablefis); SUM variablefis); RUN; 1. options in PROC PRINT: NOOBS, LABEL S&AS: STAT1303 Data Management _ 10 J. SORT PROC SORT <DATA=datasetname <data set 0ptions>> <OUT=datasetname <data set 0ptions>> ; BY <DESCENDING> variable-J < <DESCENDING> variable-n>; RUN; K. SQL PROC SQL ; CREATE TABLE table-name AS query-expression <ORDER BY order-by-z'tem <,0rder~by-item>...>; SELECT <DISTINCT> object-item <,0bject-item>... <INTO :macro-variable-specification <, :macro-vam‘able-specification>...> FROM from—list <WHERE sql—empressi0n> <GROUP BY group-by-item <,gr0up-by-item>...> <HAVING sql-eccpressz'on> <ORDER BY order-by-item <,0rder-by-item>...>; QUIT; S&AS: STAT1303 Data Management 11 L. UNIVARIA’I‘E PROC UNIVARIATE <DATA=datasetname <data set options>> <0ptions1 >; BY <DESCENDING> variable-1 < <DESCENDING> variable-72>; CLASS grouping-variablds); VAR variablefis}; FREQ variable; ID variable(s); HISTOGRAM variablds) / normal; QQPLOT variablefis) / normal (muzest sigmazest); OUTPUT OUT = datasetname statistic-keyword2< (vanable{s))> <name(s)>; RUN; 1. options in PROC UNIVARIATE: ALL, ALPHszalue, CIBASIC<TYPE=LOWER|UPPERITWOSIDE>, MUO=value(s), NORMAL, ROBUSTSCALE, FREQ, NOPRINT, PLOTS, NEXTROBS=n, NEXTRVALG 2. statistic—keyword in OUTPUT statement: CSS CV KURTOSIS MAX MEAN N MIN MODE RANGE NMISS NOBS STDMEAN SKEWNESS STD USS SUM SUMWGT VAR MEDIAN P1 P5 P10 P90 P95 P99 Q1 Q3 QRANGE GINI MAD QN SN STD_GINI STD_MAD STD_QN STD_QRANGE STD_SN ...
View Full Document

This document was uploaded on 03/18/2012.

Page1 / 11

1303_0607sem2 - THE UNIVERSITY OF HONG KONG DEPARTMENT OF...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online