10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 1/26 DO NOT ASSUME GOOD QUALITY OF YOUR DATA Data preparation takes up more than 80% of an analyst's time. Data may be in the wrong format and/or bad quality. pandas provides high-level tools to manipulate data into right form. Handling missing data For numeric data, pandas uses ﬂoating-point value NaN. It is called a Sentinel value and can be easily detected. In [1]: In [2]: It is equivalent to NA in R language. NA may either be data that does not exist or data that was not observed, aka, missing data Analyse the missing data to identify data collection problems or potential bias due to missing data. For example, when collecting salary info, very rich people don't want to provide the data, then the average salary of population will be lower biased. In [3]: Out[1]: 0 a 1 b 2 NaN 3 d dtype: object Out[2]: 0 False 1 False 2 True 3 False dtype: bool Out[3]: 0 None 1 b 2 NaN 3 d dtype: object import pandas as pd import numpy as np string_data = pd.Series([ 'a' , 'b' , np.nan, 'd' ]) string_data string_data.isnull() string_data[ 0 ] = None string_data
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 2/26 In [4]: What is the difference between NaN and None? np.nan allows for vectorized operations; its a ﬂoat value, while None, by definition, forces object type, which basically disables all eﬃciency in numpy. So repeat 3 times fast: object==bad, ﬂoat==good In [5]: In [6]: Filtering 'Out' Missing Data We always have the option to filter out missing data by hand using 'isnull' and boolean indexing. The 'dropna' function can be pretty useful too. For a Series it returns the Series with only non- null data and index values. For DataFrame, it is a bit complex. dropna by default will drop any row that contains even 1 missing value. By passing "how='all'" will target rows with all NAs. To drop columns, pass 'axis=1'. Out[4]: 0 None 1 b 2 NaN 3 NaN dtype: object Out[5]: 0 True 1 False 2 True 3 True dtype: bool Out[6]: 0 True 1 False 2 True 3 True dtype: bool string_data[ 3 ] = np.nan string_data string_data.isnull() string_data.isna() #exactly same as isnull()
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 3/26 In [7]: In [8]: In [9]: In [10]: In [11]: Out[7]: 0 1.0 2 3.5 4 7.0 dtype: float64 Out[8]: 0 1.0 2 3.5 4 7.0 dtype: float64 Out[9]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 Out[10]: 0 1 2 0 1.0 6.5 3.0 Out[11]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0 from numpy import nan as NA data = pd.Series([ 1 , NA, 3.5 , NA, 7 ]) data.dropna() data[data.notnull()] data = pd.DataFrame([[ 1. , 6.5 , 3. ], [ 1. , NA, NA], [NA, NA, NA], [NA, 6.5 , 3. ]]) cleaned = data.dropna() data cleaned data.dropna(how = 'all' )
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 4/26 In [12]: In [13]: Another DataFrame cleaning method concerns with time series data.

