In this lesson I'm just going to describe the main libraries that we'll see when we work with data in Python.
Numpy¶
Numpy is the first library we work with. By convention, it's imported with import numpy as np
. Numpy really provides two things to our workflow:
Math that goes faster than unadorned Python could do it---which is important when you're doing statistics, because under the hood computational stats can take a lot of calculations.
Convenient data structures as well as functions that operate on them.
Let's talk about number 2 for a minute. Numpy provides special numeric types which you don't need to worry about, but also the array, which is like a Python list, but with special properties that make it more useful for mathematical operations. The other main Python data libraries tend to assume that you're working with arrays, or something that can be converted with arrays---but lists can be pretty seamlessly converted to arrays, so that's ok.
The other great part of number 2 is that numpy lets you do math on entire arrays as well as individual numbers. For example:
import numpy as np
mynums = np.array([1, 2, 3, 4, 5])
print(mynums * 2)
See what I did there? I just multipled the entire array by 2 in one go. You couldn't do that with ordinary Python lists.
print([1, 2, 3, 4, 5] * 2)
Numpy also provides a lot of convenience functions, for example, for calculating mean and standard deviation. We won't go through them here, but just will introduce them as they come up in other lessons. If you're curious about the menu of options provided, however, check out the documentation:
Numpy is a huge package with tons more stuff, and also really complicated features to handle things like multidimensional data. But you won't need to worry about that stuff.
Pandas¶
Pandas is a library that helps us work with structured data (like Excel-spreadsheet-type data), which is what we'll be focusing on for the statistics work in this course. By convention, we import pandas as pd
.
Pandas is the library you'll use to read in data from things like CSV and Excel spreadsheets. (Note: it's usually better to just export an Excel spreadsheet, or google docs, or whatever else, to CSV format and then ingest it in Python from there. Excel can be a bit of a monster to deal with.)
The Pandas data format is called a DataFrame
. You can think of it as Python's version of a spreadsheet. Let's look at one. I'll just pull down a CSV of data from my own book to play with.
import pandas as pd
df = pd.read_csv("http://rulelaw.net/downloads/rol-scores.csv")
We can look at the first few rows of data with the head()
method on a dataframe
df.head()
Incidentally, I apologize for the fact that the headings of this table might not be properly aligned with the data below them on the website. I'm working on this problem. But you can look in the lessons github repo, and it'll be better formatted.
You can treat a Pandas DataFrame kind of like a dictionary where the keys are the columns. For example:
df["State"]
You can also create new columns by assigning things to them, often by applying mathematical transofmrations to other columns. For example, we could create a column in our current dataframe that does a bunch of silly math to another.
(Under the hood, Pandas columns use Numpy arrays with some extra juice on them, so we can do the same stuff we did before like multiplying a whole column with something in one fell swoop.)
df["stupid math"] = (df["RoLScore"] / 2) + df["per_auto"]
df.head()
We can also access subtables of a DataFrame by passing a list of columns.
df[["State", "RoLScore"]].head()
There's lots more to do with Pandas as well. I've assigned an introduction from DataCamp for this week in chapter 2 of this lesson.
Matplotlib/Seaborn¶
Matplotlib is the Python library that handles data visualization. For the most part, however, we won't be working with matplotlib directly. It has a really bad API. Like, terrible.
Instead, we'll be using seaborn. Again, the convention among Python data people is to import it using a short name: import seaborn as sns
Seaborn provides us with some very fancy and easy to use plots. It can handle Pandas columns, Numpy arrays, ordinary Python lists, you name it.
I won't show more than one example here because there's another lesson covering several visualizations, but check out the official seaborn example gallery for the cool stuff you can do.
Also, before you can get plots to show up in jupyter notebooks, you probably have to do %matplotlib inline
to tell the notebook to render plots within the webpage.
import seaborn as sns
%matplotlib inline
sns.pairplot(df[["RoLScore", "pol_plur", "free_expr", "assoc_org", "per_auto"]])
Statsmodels and/or Scipy¶
Statsmodels and Scipy are libraries that provide a bunch of statistical functionality. For example, if you want to do a hypothesis test, you'd go there. Statsmodels has more robust functionality, more or less, but also pretty terrible documentation, weird rules about what you have to import, etc. I'll probably grab bits and pieces of each library as we go forward into our stats section.
Here's an example from statsmodels. Don't worry too much about it for now.
import statsmodels.formula.api as smf
regression = smf.ols('RoLScore ~ assoc_org + per_auto + hprop', data=df).fit()
print(regression.summary())