Exploratory Data Analytics What is EDA? EDA means Exploration of Data for Analysis Used to analyze a dataset features/attributes to summarize its key characteristics What can data tell us quickly so that we can form some hypothesis What key characteristics? 5 number summary i.e. min, 25th percentile, median, 75th percentile and max Other basic statistics like average, standard deviation Understand how the data is distributed over various parameters Data distribution is presented visually using graphs and charts What will we do in this EDA exercise? #movies, #ratings, #users Genre distribution as a pie chart 5-point summary of the rating attribute Rating distribution as a histogram Top ranked movies Find awesome masala movies to watch Python Basics Importing numpy, pandas, matplotlib and seaborn in python import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from scipy import stats slope, intercept, r_value, p_value, std_err = stats.linregress(df[‘height’], df[‘weight’]) slope = 16.783524424282902 , intercept = -37.45428562014031
EDA using Python We will work on our Movielens dataset using the "Pandas" package. Pandas makes working with Tabular data very easy as we will see import pandas as pd Read the movies.csv file and create a Pandas DataFrame called movies_df movies_df = pd.read_csv('13_Movies.csv') Now let’s peek into this data frame object using its head function movies_df.head() Now let’s see what shape is, i.e., number of rows and number of columns in the DataFrame

