You don't need to be a data scientist to use Pandas for some basic
Traditionally, people who program in Python use the data types that
come with the language, such as integers, strings, lists, tuples and
dictionaries. Sure, you can create objects in Python, but those
objects typically are built out of those fundamental data structures.
If you're a data scientist working with Pandas though, most of your
time is spent with NumPy. NumPy might feel like a Python data
structure, but it acts differently in many ways. That's not just
because all of its operations work via vectors, but also because the
underlying data is actually a C-style array. This makes NumPy
extremely fast and efficient, consuming far less memory for a given
array of numbers than traditional Python objects would do.
The thing is, NumPy is designed to be fast, but it's also a bit
low level for some people. To get more functionality and a more
flexible interface, many people use Pandas, a Python package that
provides two basic wrappers around NumPy arrays: one-dimensional
Series objects and two-dimensional Data Frame objects.
I often describe Pandas as "Excel within Python", in that you can
perform all sorts of calculations as well as sort data, search
through it and plot it.
For all of these reasons, it's no surprise that Pandas is a darling of
the data science community. But here's the thing: you don't need to
be a data scientist to enjoy Pandas. It has a lot of excellent
functionality that's good for Python developers who otherwise
would spend their time wrestling with lists, tuples and dictionaries.
So in this article, I describe some basic analysis that everyone can do
with Pandas, regardless of whether you're a data scientist. If you
ever work with CSV files (and you probably do), I definitely
recommend thinking about using Pandas to open, read, analyze and even
write to them. And although I don't cover it in this article, Pandas
handles JSON and Excel very well too.
Creating Data Frames
Although it's possible to create a data frame from scratch using Python
data structures or NumPy arrays, it's more common in my experience to
do so from a file. Fortunately, Pandas can load data from a variety of
Before you can do anything with Pandas, you have to load it. In a
Jupyter notebook, do:
import pandas as pd
For example, Python comes with a
csv module that knows how to handle
files in CSV (comma-separated value) format. But, then you need to
iterate over the file and do something with each of those
lines/rows. I often find it easier to use Pandas to work with
such files. For example, here's a CSV file:
You can turn this into a data frame with: