Chapter 9 Data Cleaning and Preparation

Chapter 9 Data Cleaning and Preparation - STAT1303 Data...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon
STAT1303 Data Management 9. Data Cleaning and Preparation 9D a t aC l e a n i n ga n dP r e p a r a t i o n After data have been collected and transited to electronic Fles, some problems may arise in the data Fles: Incomplete data (missing values). Noisy data which may be caused by recording error, coding error, data transition error and etc. Even if no such errors are found, outliers may also be found in the data Fles. Data inconsistencies which may be caused by di f erent coding schemes or unobserved relations among data items. Duplicate records Consequently, data cleaning becomes essential to the data management because low quality of data can lead to unreliable results of data analysis and wrong decision would be made. Indeed, data cleaning is a process of improving data quality by 1. identifying potential errors and data problems in the data - detecting data problems. 2. conFrming data errors - tracing back the origin of the data to conFrm error cases. 3. making needed changes in the data - handling data problem. Example 9.1. Suppose a patient visit a clinic / hospital, some information about the patient are collected, such as 1. Patient number (patient identiFcation) 2. Gender 3. Visit date 4. Heart rate, systolic blood pressure, diastolic blood pressure and etc. 5. Diagnosis Then, the clinical trial data are subsequently transited to an electronic Fle named PATIENTS.TXT and this electronic Fle is then read in by SAS for some statistical data analysis (PATIENTS.TXT). Here is the example of the data Fle. HKU STAT1303 (2011-12, Semester 1) 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
STAT1303 Data Management 9. Data Cleaning and Preparation 001M11/11/1998 88140 80 10 002F11/13/1998 84120 78 X0 003X10/21/1998 68190100 31 004F01/01/1999101200120 5A XX5M05/07/1998 68120 80 10 006 06/15/1999 72102 68 61 007M08/32/1998 88148102 0 M11/11/1998 90190100 0 008F08/08/1998210 70 009M09/25/1999 86240180 41 0l0fl0/19/1999 40120 10 011M13/13/1998 68300 20 41 012M10/12/98 60122 74 0 013208/23/1999 74108 64 1 014M02/02/1999 22130 90 1 ... The codebook of the data set is given by Variable Name Description Starting Column Length Variable Type Valid Values PATNO Patient Number 1 3 Character Numerals only GENDER Gender 4 1 Character ’M’ or ’F’ VISIT Visit Date 5 10 MMDDYY10. Any valid date HR Heart Rate 15 3 Numeric Between 40 and 100 SBP Systolic Blood Pressure 18 3 Numeric Between 80 and 200 DBP Diastolic Blood 21 3 Numeric Between 60 and 120 Pressure DX Diagnosis Code 24 3 Character 1to3d ig itsnumera l AE Adverse Event 27 1 Character ’0’ or ’1’ Note that PATNO is the identi±cation variable. After the data is read in, simple summarization reports are run. HKU STAT1303 (2011-12, Semester 1) 2
Background image of page 2
STAT1303 Data Management 9. Data Cleaning and Preparation *Example9.1-simplesummariza t i o n ; libname clean ’D:/temp’; data mylib.patients; infile ’D:/temp/patients.txt’ pad; input @1 patno $3. @4 gender $1. @5 visit mmddyy10. @15 hr 3. @18 sbp 3. @21 dbp 3. @24 dx $2. @26 ae $1.; label patno = ’Patient Number’ gender = ’Gender’ visit = ’Visit Date’ hr = ’Heart Rate’ sbp = ’Systolic Blood Pressure’ dbp = ’Diastolic Blood Pressure’ dx = ’Diagnosis Code’ ae = ’Adverse Event?’; format visit mmddyy10.; run; Initial examination on the data can be performed by the following procedures: proc freq data=mylib.patients ;
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 4
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 02/09/2012 for the course STAT 1301 taught by Professor Smslee during the Spring '08 term at HKU.

Page1 / 25

Chapter 9 Data Cleaning and Preparation - STAT1303 Data...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online