W5 Data Cleaning and Preparation - Jupyter Notebook.pdf - W5 Data Cleaning and Preparation Jupyter Notebook DO NOT ASSUME GOOD QUALITY OF YOUR DATA Data

# W5 Data Cleaning and Preparation - Jupyter Notebook.pdf -...

• 26

This preview shows page 1 - 5 out of 26 pages.

10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 1/26 DO NOT ASSUME GOOD QUALITY OF YOUR DATA Data preparation takes up more than 80% of an analyst's time. Data may be in the wrong format and/or bad quality. pandas provides high-level tools to manipulate data into right form. Handling missing data For numeric data, pandas uses ﬂoating-point value NaN. It is called a Sentinel value and can be easily detected. In [1]: In [2]: It is equivalent to NA in R language. NA may either be data that does not exist or data that was not observed, aka, missing data Analyse the missing data to identify data collection problems or potential bias due to missing data. For example, when collecting salary info, very rich people don't want to provide the data, then the average salary of population will be lower biased. In [3]: Out[1]: 0 a 1 b 2 NaN 3 d dtype: object Out[2]: 0 False 1 False 2 True 3 False dtype: bool Out[3]: 0 None 1 b 2 NaN 3 d dtype: object import pandas as pd import numpy as np string_data = pd.Series([ 'a' , 'b' , np.nan, 'd' ]) string_data string_data.isnull() string_data[ 0 ] = None string_data
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 2/26 In [4]: What is the difference between NaN and None? np.nan allows for vectorized operations; its a ﬂoat value, while None, by definition, forces object type, which basically disables all eﬃciency in numpy. So repeat 3 times fast: object==bad, ﬂoat==good In [5]: In [6]: Filtering 'Out' Missing Data We always have the option to filter out missing data by hand using 'isnull' and boolean indexing. The 'dropna' function can be pretty useful too. For a Series it returns the Series with only non- null data and index values. For DataFrame, it is a bit complex. dropna by default will drop any row that contains even 1 missing value. By passing "how='all'" will target rows with all NAs. To drop columns, pass 'axis=1'. Out[4]: 0 None 1 b 2 NaN 3 NaN dtype: object Out[5]: 0 True 1 False 2 True 3 True dtype: bool Out[6]: 0 True 1 False 2 True 3 True dtype: bool string_data[ 3 ] = np.nan string_data string_data.isnull() string_data.isna() #exactly same as isnull()
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 3/26 In [7]: In [8]: In [9]: In [10]: In [11]: Out[7]: 0 1.0 2 3.5 4 7.0 dtype: float64 Out[8]: 0 1.0 2 3.5 4 7.0 dtype: float64 Out[9]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 Out[10]: 0 1 2 0 1.0 6.5 3.0 Out[11]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0 from numpy import nan as NA data = pd.Series([ 1 , NA, 3.5 , NA, 7 ]) data.dropna() data[data.notnull()] data = pd.DataFrame([[ 1. , 6.5 , 3. ], [ 1. , NA, NA], [NA, NA, NA], [NA, 6.5 , 3. ]]) cleaned = data.dropna() data cleaned data.dropna(how = 'all' )
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 4/26 In [12]: In [13]: Another DataFrame cleaning method concerns with time series data.

#### You've reached the end of your free preview.

Want to read all 26 pages?

• Spring '19

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern