CH10 - STAT1303A Data Management 10. Data Cleaning 10 Data...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: STAT1303A Data Management 10. Data Cleaning 10 Data Cleaning After data have been collected and transited to electronic &les, some problems may arise in the data &les: & Incomplete data (missing values). & Noisy data which may be caused by recording error, coding error, data transition error and etc. Even if no such errors are found, outliers may also be found in the data &les. & Data inconsistencies which may be caused by di/erent coding schemes or unobserved relations among data items. & Duplicate records Consequently, data cleaning becomes essential to the data management because low quality of data can lead to unreliable results of data analysis and wrong decision would be made. Indeed, data cleaning is a process of improving data quality by 1. identifying potential errors and data problems in the data - detecting data problems. 2. con&rming data errors - tracing back the origin of the data to con&rm error cases. 3. making needed changes in the data - handling data problem. Example 10.1. Suppose a patient visit a clinic ¡ hospital, some information about the patient are collected, such as 1. Patient number (patient identi&cation) 2. Gender 3. Visit date 4. Heart rate, systolic blood pressure, diastolic blood pressure and etc. 5. Diagnosis Then, the clinical trial data are subsequently transited to an electronic &le named PATIENTS.TXT and this electronic &le is then read in by SAS for some statistical data analysis (PATIENTS.SAS). Here is the example of the data &le. HKU STAT1303A (2009-10, Semester 1) 10 ¡ 1 STAT1303A Data Management 10. Data Cleaning 001M11/11/1998 88140 80 10 002F11/13/1998 84120 78 X0 003X10/21/1998 68190100 31 004F01/01/1999101200120 5A XX5M05/07/1998 68120 80 10 006 06/15/1999 72102 68 61 007M08/32/1998 88148102 M11/11/1998 90190100 008F08/08/1998210 70 009M09/25/1999 86240180 41 0l0fl0/19/1999 40120 10 011M13/13/1998 68300 20 41 012M10/12/98 60122 74 013208/23/1999 74108 64 1 014M02/02/1999 22130 90 1 ... 002F11/13/1998 84120 78 X0 003M11/12/1999 58112 74 015F 82148 88 31 017F04/05/1999208 84 20 019M06/07/1999 58118 70 123M15/12/1999 60 10 321F 900400200 51 020F99/99/9999 10 20 8 022M10/10/1999 48114 82 21 023f12/31/1998 22 34 78 024F11/09/199876 120 80 10 025M01/01/1999 74102 68 51 027FNOTAVAIL NA 166106 70 028F03/28/1998 66150 90 30 029M05/15/1998 41 006F07/07/1999 82148 84 10 The codebook of the data set is given by HKU STAT1303A (2009-10, Semester 1) 10 & 2 STAT1303A Data Management 10. Data Cleaning Variable Description Starting Length Variable Type Valid Values Name Column PATNO Patient Number 1 3 Character Numerals only GENDER Gender 4 1 Character &M¡or &F¡ VISIT Visit Date 5 10 MMDDYY10. Any valid date HR Heart Rate 15 3 Numeric Between 40 and 100 SBP Systolic Blood 18 3 Numeric Between 80 and 200 Pressure DBP Diastolic Blood 21 3 Numeric Between 60 and 120 Pressure DX Diagnosis Code 24 3 Character 1 to 3 digits numeral AE Adverse Event 27 1 Character &0¡or &1¡ Note that PATNO is the identi¢cation variable.Note that PATNO is the identi¢cation variable....
View Full Document

Page1 / 23

CH10 - STAT1303A Data Management 10. Data Cleaning 10 Data...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online