This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: THE UNIVERSITY OF HONG KONG
DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE STAT1303 DATA MANAGEMENT May 31, 2007 Time: 9:30 a.m. — 11:30 a.m. Candidates taking examinations that permit the use of calculators may use any calculator which
fulﬁls the following criteria: (a) it should be selfcontained, silent, batteryoperated and pocket
sized and (b) it should have numeral—display facilities only and should be used only for the purposes
of calculation. It is the candidate ’s responsibility to ensure that the calculator operates satisfactorily and the
candidate must record the name and type of the calculator 0n the ﬁ'ont page of the examination
scripts. Lists of permitted/prohibited calculators will not be made available to candidates for
reference, and the onus will be on the candidate to ensure that the calculator used will not be in
violation of the criteria listed above. Answer ALL FOUR questions. The questions are of equal value. An abridged version of SAS
syntax is provided in ANNEX. 1. The raw data ﬁle ‘MEDDAT’ consists of the medical information on a number of patients. The
variables are deﬁned as follows. Variable Details Type Length Missing Values
X1 Gender Char 1 ‘U'
X2 Age Numeric 2 0
X3 Group Char 1 'U’
SYM1 to SYM3 Three different symptoms Numeric 1 9
DX1 to DX4 Four different diagnosis Numeric 2 99 __ values {010) Variable Code Label X1 'F’ Female
‘M’ Male X2 0 Unknown X3 ‘1’ Mild
‘2’ Moderate
‘3’ Severe
‘U’ Unknown SYM1 to SYM3 I 1 Symptom exists
0 Symptom does not exist
9 Unknown DX1 to DX4 99 Unknown S&AS: STAT1303 Data Management 2 The values are stored in column format with the length deﬁned in the above table. They are
ordered according to the appearance in the table. There is no delimiter between any values of the observations. Each observation is stored in one row. Three observations are given as follows.
' l ( column index, not in the data file
123456789012345 ( F45100001020303 ( observations in the data file
M34210005030403 (
F0 U9900599 00 ( Write SAS programs to answer the follow questions.
(a) Create a SAS data set named D1 for the data in the library named LIBl linked to the folder
‘c:\libl ’. (b) Create a new data set D2 in the library LIBl which consist of (i) the contents copied from the data set D1 (ii) a new variable SYMCOUNT which is the number of symptoms (If any symptom is unknown, set SYMCOUNT to missing.) (iii) a new variable DX which is
equal to the average of DXl and DX2 if the patient is in group 1
equal to the average of DX3 and DX4 if the patient is in group 2
equal to the average of DXl to DX4 if the patient is in group 3
missing if group is missing. * o (if any required value of DXl to DX4 in the above deﬁnition is missing, DX is
also missing.) (0) For the data set D2 (i) Find the number of valid cases for each variable
(ii) Find the percentages of patients with a different number of symptoms for each gender (iii) Find the mean and SD of the age of the patients who were in group 1 and the value of
DX is either smaller than 3 or greater than 12 S&AS: STAT1303 Data Management 3 2. A data set contains the systolic blood pressure (SBP) and diastolic blood pressure (DBP) for
each subject. There are 2 groups of subjects, A and B. The following SAS program is to create
the data set and express each measurement as a percentage of the group mean. DATA TEST; , INPUT GROUP 5 SB? DBP @@;
CARDS;
A 160 100 B 150 90 A 140 80
B 200 180 A 180 140 B 140 80 PROC SORT DATA=TEST;
BY GROUP;
RUN;
PROC MEANS DATA=TEST NOPRINT NWAY;
CLASS GROUP;
VAR sap DBP;
OUTPUT OUT=MOUT(DROP=_TYPE_ FREQ )
MEAN=M_SBP M_DBP; T — RUN; DATA NEW;
MERGE TEST MOUT;
BY GROUP;
SBPPER=100*SBP/M_SBP;
DBPPER=100*DBP/M_DBP;
DROP M SBP M DBP; RUN; * ﬂ PROC PRINT DATA=NEW NOOBS; RUN; Answer the following question when the whole program is submitted.
(a) What are the types and lengths of the variables in the data set TEST?
(b) How many observations are there in the data set TEST? (0) Could we drop PROC SORT? Why? (d) How many observations are there in the data set MOUT? (e) What are the values of the variables for each observation in the data set MOUT? (D What is the ﬁnal output from this SAS program? Show all the values in the output if there
is any. (g) Could we drop “@@” to achieve the same result in (0? Why? 3. The raw data ﬁle IADL.DAT consists of a number of cases in list format. The variables are
deﬁned as follows. Variable Description Valid values
ID ID number Numerals only
Gender Gender of the elderly ‘M’ or ‘F’
AGE ' Age of the elderly (year) 60 or above
IADL IADL score of the elderly 0 to 100 Write SAS programs for the following. (a) Check the missing values by listing out the ID and the variables with missing values (b) Check the ranges of all variables by listing out the ID and the variables outside the valid
range (c) Check if there are any outliers by listing out the ID and the variables outside of the meani3 SD S&AS:
4. : STAT1303 Data Management 4 The data set HEI consists of the height of a number of children for various ages. The age range
is 0 yr, 1 yr, 2 yr and 3 yr. There are 3 variables in the data set which are given as follows. Variable Description
id ID number
age Age (year) height Height g cm)
Observations for two of the children are given as follows. ID Age Height
50 0 l 55
2 58
3 62
0 47
l 51
2 54
3 59 NNNNt—‘Ht—‘H Write SAS programs for the following. (a) Sort the data set HEI according to the ID number and the age. (b) Create a new data set HEI_H as a transpose of the data set HEI such that all the height
measurements of one child are stored in only one observation and each observation
consists of the height of a child at age 0, age 1, age 2 and age 3, respectively. The variables
of HEI_H are deﬁned as follows. Variable Description ID ID number H00 Height at age 0
H01 Height at age 1
H02 Height at age 2
H03 Height at age 3 The observations for the two of the children in HEI H would look like this: id H00 H01 H02 H03
1 50 55 58 62
2 47 51 54 59 (c) Based on HEI_H, save the mean and SD of the height at different ages to a new data set
MSDHEI in which the variables of the means are M00 to M03 and the variables for the SD
are SD00 to SD03. Given that to save the statistics, the keyword for mean is MEAN: and
that for standard deviation is ST D=. Create a new data set HSD by combining the data set MSDHEI and HEI_H, in which the
standardized values of the height at different ages are calculated. The variables of the
standardized height should be HSDOO to HSD03. The standardized value is given as
(height — mean) / SD. (d) ************M END OF pApER ************** S&AS: STAT1303 Data Management ANNEX : An abridged version of SAS Syntax A. DATA STEP LIBNAME libref ’SAS—datalz‘bmry’ ; DATA datasetI <(dataset0ptions)> V . . ;
INPUT variableﬁ) <format> . . . ;
LENGTH variable1 <$>length. . . ;
INFORMAT variableI <inf0rmat> . . . ;
LABEL variableIz’labelI’ . . . ;
FORMAT variable—1 <format> . . . ;
CARDSIDATALINES ; data 1 RUN; LIBNAME libref ’SAS—datalibmry’ ;
DATA dataset—J <(data—setoptions)> . . . ;
INFILE ﬁlename; INPUT variable(3) <f0rmat> ; LENGTH variable1 <$>length. . . ;
INFORMAT variable1 <inf0rmat> . . . ;
LABEL variable1=’label1’. . . ;
FORMAT variable—1 <f0rmat> . . . ;
RUN; S&AS:‘ STAT1303 Data Management 6 LIBNAME libref ’SAS—data—library’ ; DATA datasetJ <(data—setoptions)>. . . ; MERGEISET datasetI <(data—set—options)>
<dataset2 <(data—setoptions)> > ...; UPDATE datasetJ <(data—set—options)>
dataset—2 <(data—set0ptions)> ; BY <DESCENDING> variableI . . . ; DROP variable(s) ; KEEP variablds) ; variable=e$presston ; ARRAY armyname (subscript) <a7‘my—elements> ; DELETE ; FILE ﬁleref. . . ; OUTPUT datasetJ . . . ; PUT ’chamcterstm'ng’ variable1: . . . RENAME oldname1=newname—1 . . , ; RETAIN variable(5) ; STOP; WHERE whereexpression ; IF expression ; IF expression THEN statement ; <ELSE statement ; > DO; . . . more statements. . . END; DO index—vamablez'sta'rt TO stop ;
. more SAS statements . . . END; RUN; 1. data set options in DATA step and other SAS PROCS:
DROPz, FIRSTOBS=, IN=, KEEP=, OBS=, RENAMEz, WHERE: S&AS: STAT1303 Data Managemeht ‘ 7 B. The following statements are common to All SAS PROCs A 1. FORMAT statement:
FORMAT variable1 <f0rmat> . . . ; 2. LABEL statement:
LABEL variable1=’labelJ’ . ‘ . ; 3. WHERE statement:
WHERE whereexpression ; C. CONTENTS PROC CONTENTS <DATA=datasetname> <VARNUM> ; RUN; D. EXPORT PROC EXPORT DATA: datasetname
OUTFILE=“ﬁlename” I OUTTABLE=“tablename” <DBMS=identiﬁer><REPLACE> ; <data—source—statements ;>
RUN; E. FORMAT PROC FORMAT <options1 > ;
INVALUE <$> name value—or—mngeIzinformatvalueI < valueormngen==inf0rmatvaluen> ; VALUE <$>name valueormngeJ=format—value—1 < . .. valueorTangen=format—'ualuen> ;
RUN; S&AS: ISTAT1303 Data Management 8 1. options in PROC FORMAT:
CNTLIN==, CNTLOUT=, LIBRARY: F. FREQ PROC FREQ <DATA=datasetname <data set options>> <optionsl>;
TABLES variable] variableQ variableQWam’ablel </optionsz> ; WEIGHT variable;
BY <DESCENDING> variable—1 < <DESCENDING> variable71>;
RUN; 1. options in PROC FREQ:
FORMCHAR(1,2,7)=formcha7"—st'rmg, PAGE, NOPRINT 2. options in TABLE statement:
NOCOL, NOROW, NOPRECENT, NOFREQ, NOCUM, NOPRINT,
TESTP=(p1p2 . . .), EXPECTED, CHISQ, FISHERIEXACT, MEASURES,
MISSING, MISSPRINT, OUT=datasetname <data set options> G. IMPORT PROC IMPORT DATAFILE=“ﬁlename” I TABLE=“tablename”
OUT=datasetname <DBMS=identiﬁer><REPLACE> ; RUN; S&AS: STAT1303 Data Management 9 H. MEANS PROC MEANS <DATA=datasetname <data set options>>
<options1 > statistic—keywordg;
BY <DESCENDING> variable1 < <DESCENDING> variable71>;
CLASS grouping—variable(s);
VAR variableﬁs); FREQ variable; ID variablds); I OUTPUT OUT=datasetname <data set 0ptions>
statistic—keyword3<(variable(s))> = <name(s)>; RUN; 1. options in PROC MEANS:
ALPHA=, MISSING, NONOBS, NOPRINT, NWAY 2. statistic—keyword in PROC MEANS:
CLM CSS CV ‘KURTOSIS LCLM MAX MEAN MIN N NMISS RANGE
“SKEWNESS STD STDERR SUM SUMWGT UCLM USS VAR
MEDIAN P1 P5 P10 Q1 Q3 P90 P95 P99 QRANGE PROBT T 3. statistic—keyword in OUTPUT statement:
CSS CV KURTOSIS LCLM MAX MEAN MIN N NMISS RANGE
SKEWNESS STD STDERR SUM SUMWGT UCLM USS VAR
MEDIAN P1 P5 P10 Q1 Q3 P90 P95 P99 QRANGE PROBT T I. PRINT PROC PRINT <DATA=datasetname <data set 0pti0ns>> <options1 > ;
VAR variableﬂs);
BY <DESCENDING> variablds); ID variableﬁs);
SUM variableﬁs);
RUN; 1. options in PROC PRINT:
NOOBS, LABEL S&AS: STAT1303 Data Management _ 10 J. SORT PROC SORT <DATA=datasetname <data set 0ptions>>
<OUT=datasetname <data set 0ptions>> ; BY <DESCENDING> variableJ < <DESCENDING> variablen>;
RUN; K. SQL PROC SQL ;
CREATE TABLE tablename AS queryexpression
<ORDER BY orderbyz'tem <,0rder~byitem>...>;
SELECT <DISTINCT> objectitem <,0bjectitem>...
<INTO :macrovariablespeciﬁcation <, :macrovam‘ablespeciﬁcation>...>
FROM from—list
<WHERE sql—empressi0n> <GROUP BY groupbyitem <,gr0upbyitem>...> <HAVING sqleccpressz'on> <ORDER BY orderbyitem <,0rderbyitem>...>;
QUIT; S&AS: STAT1303 Data Management 11 L. UNIVARIA’I‘E PROC UNIVARIATE <DATA=datasetname <data set options>>
<0ptions1 >; BY <DESCENDING> variable1 < <DESCENDING> variable72>; CLASS groupingvariablds); VAR variableﬁs}; FREQ variable; ID variable(s); HISTOGRAM variablds) / normal; QQPLOT variableﬁs) / normal (muzest sigmazest); OUTPUT OUT = datasetname statistickeyword2< (vanable{s))> <name(s)>;
RUN; 1. options in PROC UNIVARIATE:
ALL, ALPHszalue, CIBASIC<TYPE=LOWERUPPERITWOSIDE>,
MUO=value(s), NORMAL, ROBUSTSCALE, FREQ, NOPRINT, PLOTS,
NEXTROBS=n, NEXTRVALG 2. statistic—keyword in OUTPUT statement:
CSS CV KURTOSIS MAX MEAN N MIN MODE RANGE NMISS NOBS
STDMEAN SKEWNESS STD USS SUM SUMWGT VAR
MEDIAN P1 P5 P10 P90 P95 P99 Q1 Q3 QRANGE
GINI MAD QN SN STD_GINI STD_MAD STD_QN STD_QRANGE STD_SN ...
View
Full Document
 Spring '11
 Statistics

Click to edit the document details