W5 Data Cleaning and Preparation - Jupyter Notebook.pdf - W5 Data Cleaning and Preparation Jupyter Notebook DO NOT ASSUME GOOD QUALITY OF YOUR DATA Data

W5 Data Cleaning and Preparation - Jupyter Notebook.pdf -...

This preview shows page 1 - 5 out of 26 pages.

10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 1/26 DO NOT ASSUME GOOD QUALITY OF YOUR DATA Data preparation takes up more than 80% of an analyst's time. Data may be in the wrong format and/or bad quality. pandas provides high-level tools to manipulate data into right form. Handling missing data For numeric data, pandas uses floating-point value NaN. It is called a Sentinel value and can be easily detected. In [1]: In [2]: It is equivalent to NA in R language. NA may either be data that does not exist or data that was not observed, aka, missing data Analyse the missing data to identify data collection problems or potential bias due to missing data. For example, when collecting salary info, very rich people don't want to provide the data, then the average salary of population will be lower biased. In [3]: Out[1]: 0 a 1 b 2 NaN 3 d dtype: object Out[2]: 0 False 1 False 2 True 3 False dtype: bool Out[3]: 0 None 1 b 2 NaN 3 d dtype: object import pandas as pd import numpy as np string_data = pd.Series([ 'a' , 'b' , np.nan, 'd' ]) string_data string_data.isnull() string_data[ 0 ] = None string_data
Image of page 1
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 2/26 In [4]: What is the difference between NaN and None? np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy. So repeat 3 times fast: object==bad, float==good In [5]: In [6]: Filtering 'Out' Missing Data We always have the option to filter out missing data by hand using 'isnull' and boolean indexing. The 'dropna' function can be pretty useful too. For a Series it returns the Series with only non- null data and index values. For DataFrame, it is a bit complex. dropna by default will drop any row that contains even 1 missing value. By passing "how='all'" will target rows with all NAs. To drop columns, pass 'axis=1'. Out[4]: 0 None 1 b 2 NaN 3 NaN dtype: object Out[5]: 0 True 1 False 2 True 3 True dtype: bool Out[6]: 0 True 1 False 2 True 3 True dtype: bool string_data[ 3 ] = np.nan string_data string_data.isnull() string_data.isna() #exactly same as isnull()
Image of page 2
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 3/26 In [7]: In [8]: In [9]: In [10]: In [11]: Out[7]: 0 1.0 2 3.5 4 7.0 dtype: float64 Out[8]: 0 1.0 2 3.5 4 7.0 dtype: float64 Out[9]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 Out[10]: 0 1 2 0 1.0 6.5 3.0 Out[11]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0 from numpy import nan as NA data = pd.Series([ 1 , NA, 3.5 , NA, 7 ]) data.dropna() data[data.notnull()] data = pd.DataFrame([[ 1. , 6.5 , 3. ], [ 1. , NA, NA], [NA, NA, NA], [NA, 6.5 , 3. ]]) cleaned = data.dropna() data cleaned data.dropna(how = 'all' )
Image of page 3
10/26/2019 W5 Data Cleaning and Preparation - Jupyter Notebook localhost:8888/notebooks/W5 Data Cleaning and Preparation.ipynb 4/26 In [12]: In [13]: Another DataFrame cleaning method concerns with time series data.
Image of page 4
Image of page 5

You've reached the end of your free preview.

Want to read all 26 pages?

  • Spring '19

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes
A+ icon
Ask Expert Tutors