STATS 507
Data Analysis in Python
Lecture 10: Basics of
pandas

Pandas
Open-source library of data analysis tools
Low-level ops implemented in Cython (C+Python=Cython, often faster)
Database-like structures, largely similar to those available in R
Optimized for most common operations
E.g., vectorized operations, operations on rows of a table
From the documentation:
pandas is a Python package providing
fast, flexible, and expressive data structures designed to make
working with “relational” or “labeled” data both easy and intuitive. It
aims to be the fundamental high-level building block for doing
practical, real world data analysis in Python.

:

Basic Data Structures
Series: represents a one-dimensional
labeled
array
Labeled just means that there is an index into the array
Support vectorized operations
DataFrame: table of rows, with labeled columns
Like a spreadsheet or an R data frame
Support
numpy
ufuncs (provided data are numeric)
@

pandas
Series
By default, indices are
integers, starting from 0,
just like you’re used to.
But we can specify a
different set of indices if
we so choose.
Can create a
pandas
Series from
any array-like structure (e.g.,
numpy
array, Python list, dict).
Pandas tries to infer this data
type automatically.
Warning:
providing too few or too
many indices is a
ValueError
.

pandas
Series
Can create a series from a
dictionary. Keys become indices.
Index
‘cthulu’
doesn’t appear in the
dictionary, so
pandas
assigns it
NaN
, the
standard “missing data” symbol.

pandas
Series
Indexing works like you’re used
to and supports slices, but
not
negative indexing.
This object has type
np.int64
This object is another
pandas
Series.

pandas
Series
Caution:
indices need not be unique in
pandas
Series. This will only cause an error if/when you
perform an operation that requires unique indices.

pandas
Series
Series objects are like
np.ndarray
objects, so they support all the same
kinds of slice operations, but note that
the indices come along with the slices.
Series objects even support most
numpy
functions that act on arrays.

pandas
Series
Series objects are
dict
-like,
in that we can access and
update entries via their keys.
Like a dictionary, accessing
a non-existent key is a
KeyError.
Note:
I cropped out a bunch of the
error message, but you get the idea.
Not shown:
Series also support the
in
operator:
x in s
checks if
x
appears as an index of Series
s
.
Series also supports the dictionary
get
method.