Chapter 10. Data cleaning

Chapter 10. Data cleaning - 1 Data cleaning 2 ¡ What is...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Data cleaning 2 ¡ What is data cleaning ¡ Validity checks ¡ Verification checks ¡ Check for missing values ¡ Check for duplicate data entries ¡ Consistency checks 3 What is Data cleaning? ¡ When a patient visit a clinic / hospital, some information about the patient are collected ¢ Patient number (patient identification) ¢ Gender ¢ Visit date ¢ Heart rate, systolic blood pressure, diastolic blood pressure etc. ¢ Diagnosis ¡ Suppose the clinical trial data are subsequently transited to an electronic file named PATIENTS.TXT ¡ The electronic file is then read in by SAS for some statistical data analysis (PATIENTS.SAS) 4 PATIENTS.TXT 001M11/11/1998 88140 80 10 002F11/13/1998 84120 78 X0 003X10/21/1998 68190100 31 004F01/01/1999101200120 5A XX5M05/07/1998 68120 80 10 006 06/15/1999 72102 68 61 007M08/32/1998 88148102 M11/11/1998 90190100 008F08/08/1998210 70 009M09/25/1999 86240180 41 0l0fl0/19/1999 40120 10 011M13/13/1998 68300 20 41 012M10/12/98 60122 74 013208/23/1999 74108 64 1 014M02/02/1999 22130 90 1 ... 002F11/13/1998 84120 78 X0 003M11/12/1999 58112 74 015F 82148 88 31 017F04/05/1999208 84 20 019M06/07/1999 58118 70 123M15/12/1999 60 10 321F 900400200 51 020F99/99/9999 10 20 8 022M10/10/1999 48114 82 21 023f12/31/1998 22 34 78 024F11/09/199876 120 80 10 025M01/01/1999 74102 68 51 027FNOTAVAIL NA 166106 70 028F03/28/1998 66150 90 30 029M05/15/1998 41 006F07/07/1999 82148 84 10 5 ¡ Code book ‘0’ or ‘1’ Character 1 27 Adverse Event AE 1 to 3 digit numeral Character 3 24 Diagnosis Code DX Between 60 and 120 Numeric 3 21 Diastolic Blood Pressure DBP Between 80 and 200 Numeric 3 18 Systolic Blood Pressure SBP Between 40 and 100 Numeric 3 15 Heart Rate HR Any valid date MMDDYY10. 10 5 Visit Date VISIT ‘M’ or ‘F’ Character 1 4 Gender GENDER Numerals only Character 3 1 Patient Number PATNO Valid Values Variable Type Length Starting Column Description Variable Name Identification variable 6 Initial examination on the data ¡ After the data is read in, simple summarization reports are run (eg10-1) GENDER Frequency------------------- 2 1 F 12 M 14 X 1 f 2 Frequency Missing = 1 AE Frequency--------------- 0 19 1 10 A 1 Frequency Missing = 1 DX Frequency--------------- 1 7 2 2 3 3 4 3 5 3 6 1 7 2 X 2 Frequency Missing = 8 7 The MEANS Procedure Variable Label Mean Minimum-------------------------------------------------------------- SBP Systolic Blood Pressure 144.5185185 20.0000000 DBP Diastolic Blood Pressure 88.0714286 8.0000000 HR Heart Rate 107.3928571 10.0000000-------------------------------------------------------------- Variable Label Median Maximum-------------------------------------------------------------- SBP Systolic Blood Pressure 122.0000000 400.0000000 DBP Diastolic Blood Pressure 81.0000000 200.0000000 HR Heart Rate 74.0000000 900.0000000-------------------------------------------------------------- 8 ¡ What do you find about the data?...
View Full Document

Page1 / 82

Chapter 10. Data cleaning - 1 Data cleaning 2 ¡ What is...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online