Here is a combination of some of the code Sam kindly showed us in class, plus the visualizations I showed you, for our simpson's paradox example on 3/25/19.
Here are a few additional FYIs:
The source of the underlying dataset is an article entitled "Simpson’s Paradox: A Data Set and Discrimination Case Study" in the Journal of Statistics Education, Volume 22, Number 1 (2014) by Stanley A. Taylor and Amy E. Mickel
Taylor and Mickel talk about pivot tables as a good solution for looking at these data in their article. That's a slightly more powerful version of some of the groupby code Sam showed us. For a nice explanation of how to do pivot tables in Pandas, see this web page.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
import numpy as np
%matplotlib inline
SECRET_PASSWORD = "INSERT PASSWORD FOR CLASS SERVER HERE"
endpoint = "https://gobbledygook.herokuapp.com/data?file={}&password={}".format("mickel.csv", SECRET_PASSWORD)
df = pd.read_csv(endpoint)
df.head()
Let's look at the (good) choices that Sam made for poking around in these data.
# begin sam's code
df.Ethnicity.unique()
df.groupby("Ethnicity")["Expenditures"].mean()
df.groupby("Age Cohort")["Expenditures"].mean()
df.groupby("Gender")["Expenditures"].mean()
df.groupby(["Age Cohort", "Ethnicity"])["Gender"].count()
age_buckets = df.groupby(["Age Cohort"])["Gender"].count()
df.groupby(["Age Cohort", "Ethnicity"])["Gender"].count() / age_buckets * 100
sns.scatterplot(df["Age"], df["Expenditures"])
mod = sm.ols(formula="Expenditures ~ Age", data=df)
res = mod.fit()
print(res.summary())
mod = sm.ols(formula="Expenditures ~ Ethnicity", data=df)
res = mod.fit()
print(res.summary())
mod = sm.ols(formula="Expenditures ~ Ethnicity + Age + Gender", data=df)
res = mod.fit()
print(res.summary())
# gowder code (visualizations) begins here
sns.countplot(df["Ethnicity"])
def bin_ethnicity(eth):
if eth == "White not Hispanic":
return "white"
elif eth == "Hispanic":
return "hispanic"
return "other"
# there is doubtless a better way to do this involving the apply function in pandas or something.
# But I'm rusty with my Pandas data tranformations.
df["binned_eth"] =np.array([bin_ethnicity(x) for x in list(df["Ethnicity"])])
df.head()
sns.countplot(df["binned_eth"])
sns.boxplot(x=df["binned_eth"], y=df["Expenditures"])
sns.violinplot(x=df["binned_eth"], y=df["Expenditures"])
cohorts = sorted(df["Age Cohort"].unique()) # just sorting this now like a sensible person
cohorts
I'm going to make a couple of changes from the code I showed in class here.
First, I'm going to sort our pandas dataframe by the value of binned ethnicity in order to try to get our columns in the violin plots to come out right.
Second, I'm going to sort the list of cohorts so that it's easy to generate plots in order.
Third, I'm going to change the function that generates the violin plot to let me loop over and show a plot for each cohort.
df.sort_values("binned_eth", inplace=True)
import re
def sorting_function(elem):
e = elem.strip()
e = re.split(r"[-\+\s]", e)
return int(e[0])
cohorts = sorted(cohorts, key=sorting_function)
import matplotlib.pyplot as plt # this is a change from my code in class to make it work in a loop
def subsetted_violin(cohort):
temp_df = df[df["Age Cohort"] == cohort]
plt.figure()
sns.violinplot(x=temp_df["binned_eth"], y=temp_df["Expenditures"])
plt.title(cohort)
for cohort in cohorts:
subsetted_violin(cohort)
sns.violinplot(x=df["Age Cohort"], y=df["Expenditures"])
Anna showed us another very useful plot---the swarm plot, which is sort of like a violin plot, but with dots for individual items as well as the capacity to see an extra dimension by colorizing those dots---which can be very useful for seeing the relationship between ethnicity, age cohort, and expenditures in these data.
(Seaborn legends can be obnoxious; this second line of code uses matplotlib to take the existing legend and shove it over to the right so it doesn't end up on top of the figure.)
sns.swarmplot(df["binned_eth"], df["Expenditures"], df["Age Cohort"], size=5)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
If we want to, we can even use a catplot to divide up our swarmplots by some other dimension, like gender, so we can eyeball whether there are any differences in there too. See the seaborn docs for more cool things we can do with these swarm and swarm-adjacent plots.
sns.catplot(x="binned_eth", y="Expenditures", hue="Age Cohort", col="Gender", data=df)
Finally, a quick look at how to get a look at the same kinds of things that Sam's code showed us, but with a pivot table.
pd.pivot_table(df,index=["Age Cohort","binned_eth"], values="Expenditures", aggfunc=np.mean)