Here is a combination of some of the code Sam kindly showed us in class, plus the visualizations I showed you, for our simpson's paradox example on 3/25/19.

Here are a few additional FYIs:

The source of the underlying dataset is an article entitled "Simpson’s Paradox: A Data Set and Discrimination Case Study" in the Journal of Statistics Education, Volume 22, Number 1 (2014) by Stanley A. Taylor and Amy E. Mickel
Taylor and Mickel talk about pivot tables as a good solution for looking at these data in their article. That's a slightly more powerful version of some of the groupby code Sam showed us. For a nice explanation of how to do pivot tables in Pandas, see this web page.

In [1]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
import numpy as np
%matplotlib inline

SECRET_PASSWORD = "INSERT PASSWORD FOR CLASS SERVER HERE"
endpoint = "https://gobbledygook.herokuapp.com/data?file={}&password={}".format("mickel.csv", SECRET_PASSWORD)
df = pd.read_csv(endpoint)

In [2]:

df.head()

Out[2]:

	ID	Age Cohort	Age	Gender	Expenditures	Ethnicity
0	10210	13-17	17	Female	2113	White not Hispanic
1	10409	22-50	37	Male	41924	White not Hispanic
2	10486	0 - 5	3	Male	1454	Hispanic
3	10538	18-21	19	Female	6400	Hispanic
4	10568	13-17	13	Male	4412	White not Hispanic

Let's look at the (good) choices that Sam made for poking around in these data.

In [3]:

# begin sam's code
df.Ethnicity.unique()

Out[3]:

array(['White not Hispanic', 'Hispanic', 'Black', 'Multi Race', 'Asian',
       'American Indian', 'Other', 'Native Hawaiian'], dtype=object)

In [4]:

df.groupby("Ethnicity")["Expenditures"].mean()

Out[4]:

Ethnicity
American Indian       36438.250000
Asian                 18392.372093
Black                 20884.593220
Hispanic              11065.569149
Multi Race             4456.730769
Native Hawaiian       42782.333333
Other                  3316.500000
White not Hispanic    24697.548628
Name: Expenditures, dtype: float64

In [5]:

df.groupby("Age Cohort")["Expenditures"].mean()

Out[5]:

Age Cohort
 0 - 5     1415.280488
 51 +     53521.896226
13-17      3922.613208
18-21      9888.537688
22-50     40209.283186
6-12       2226.862857
Name: Expenditures, dtype: float64

In [6]:

df.groupby("Gender")["Expenditures"].mean()

Out[6]:

Gender
Female    18129.606362
Male      18001.195171
Name: Expenditures, dtype: float64

In [7]:

df.groupby(["Age Cohort", "Ethnicity"])["Gender"].count()

Out[7]:

Age Cohort  Ethnicity         
 0 - 5      Asian                   8
            Black                   3
            Hispanic               44
            Multi Race              7
            White not Hispanic     20
 51 +       American Indian         2
            Asian                  13
            Black                   7
            Hispanic               17
            Native Hawaiian         1
            White not Hispanic     66
13-17       American Indian         1
            Asian                  20
            Black                  12
            Hispanic              103
            Multi Race              7
            Other                   2
            White not Hispanic     67
18-21       Asian                  41
            Black                   9
            Hispanic               78
            Multi Race              2
            White not Hispanic     69
22-50       American Indian         1
            Asian                  29
            Black                  17
            Hispanic               43
            Multi Race              1
            Native Hawaiian         2
            White not Hispanic    133
6-12        Asian                  18
            Black                  11
            Hispanic               91
            Multi Race              9
            White not Hispanic     46
Name: Gender, dtype: int64

In [8]:

age_buckets = df.groupby(["Age Cohort"])["Gender"].count()
df.groupby(["Age Cohort", "Ethnicity"])["Gender"].count() / age_buckets * 100

Out[8]:

Age Cohort  Ethnicity         
 0 - 5      Asian                  9.756098
            Black                  3.658537
            Hispanic              53.658537
            Multi Race             8.536585
            White not Hispanic    24.390244
 51 +       American Indian        1.886792
            Asian                 12.264151
            Black                  6.603774
            Hispanic              16.037736
            Native Hawaiian        0.943396
            White not Hispanic    62.264151
13-17       American Indian        0.471698
            Asian                  9.433962
            Black                  5.660377
            Hispanic              48.584906
            Multi Race             3.301887
            Other                  0.943396
            White not Hispanic    31.603774
18-21       Asian                 20.603015
            Black                  4.522613
            Hispanic              39.195980
            Multi Race             1.005025
            White not Hispanic    34.673367
22-50       American Indian        0.442478
            Asian                 12.831858
            Black                  7.522124
            Hispanic              19.026549
            Multi Race             0.442478
            Native Hawaiian        0.884956
            White not Hispanic    58.849558
6-12        Asian                 10.285714
            Black                  6.285714
            Hispanic              52.000000
            Multi Race             5.142857
            White not Hispanic    26.285714
Name: Gender, dtype: float64

In [9]:

sns.scatterplot(df["Age"], df["Expenditures"])

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x10e5054a8>

In [10]:

mod = sm.ols(formula="Expenditures ~ Age", data=df)

res = mod.fit()

print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           Expenditures   R-squared:                       0.711
Model:                            OLS   Adj. R-squared:                  0.711
Method:                 Least Squares   F-statistic:                     2456.
Date:                Wed, 27 Mar 2019   Prob (F-statistic):          2.64e-271
Time:                        17:37:13   Log-Likelihood:                -10678.
No. Observations:                1000   AIC:                         2.136e+04
Df Residuals:                     998   BIC:                         2.137e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -2285.6473    528.305     -4.326      0.000   -3322.364   -1248.931
Age          892.6067     18.011     49.558      0.000     857.262     927.951
==============================================================================
Omnibus:                      165.825   Durbin-Watson:                   1.980
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              251.596
Skew:                           1.165   Prob(JB):                     2.33e-55
Kurtosis:                       3.780   Cond. No.                         46.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [11]:

mod = sm.ols(formula="Expenditures ~ Ethnicity", data=df)

res = mod.fit()

print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           Expenditures   R-squared:                       0.118
Model:                            OLS   Adj. R-squared:                  0.112
Method:                 Least Squares   F-statistic:                     18.94
Date:                Wed, 27 Mar 2019   Prob (F-statistic):           7.89e-24
Time:                        17:37:13   Log-Likelihood:                -11236.
No. Observations:                1000   AIC:                         2.249e+04
Df Residuals:                     992   BIC:                         2.253e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
Intercept                        3.644e+04   9209.742      3.956      0.000    1.84e+04    5.45e+04
Ethnicity[T.Asian]              -1.805e+04   9351.439     -1.930      0.054   -3.64e+04     304.995
Ethnicity[T.Black]              -1.555e+04   9516.817     -1.634      0.103   -3.42e+04    3121.748
Ethnicity[T.Hispanic]           -2.537e+04   9258.600     -2.740      0.006   -4.35e+04   -7203.990
Ethnicity[T.Multi Race]         -3.198e+04   9892.850     -3.233      0.001   -5.14e+04   -1.26e+04
Ethnicity[T.Native Hawaiian]     6344.0833   1.41e+04      0.451      0.652   -2.13e+04     3.4e+04
Ethnicity[T.Other]              -3.312e+04    1.6e+04     -2.076      0.038   -6.44e+04   -1818.719
Ethnicity[T.White not Hispanic] -1.174e+04   9255.562     -1.269      0.205   -2.99e+04    6422.027
==============================================================================
Omnibus:                       95.947   Durbin-Watson:                   2.015
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              104.142
Skew:                           0.747   Prob(JB):                     2.43e-23
Kurtosis:                       2.485   Cond. No.                         53.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [12]:

mod = sm.ols(formula="Expenditures ~ Ethnicity + Age + Gender", data=df)

res = mod.fit()

print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           Expenditures   R-squared:                       0.724
Model:                            OLS   Adj. R-squared:                  0.721
Method:                 Least Squares   F-statistic:                     288.3
Date:                Wed, 27 Mar 2019   Prob (F-statistic):          1.82e-269
Time:                        17:37:13   Log-Likelihood:                -10655.
No. Observations:                1000   AIC:                         2.133e+04
Df Residuals:                     990   BIC:                         2.138e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
Intercept                       -9478.0155   5254.102     -1.804      0.072   -1.98e+04     832.441
Ethnicity[T.Asian]               8083.2889   5270.650      1.534      0.125   -2259.641    1.84e+04
Ethnicity[T.Black]               9224.2697   5360.504      1.721      0.086   -1294.986    1.97e+04
Ethnicity[T.Hispanic]            5664.1767   5230.543      1.083      0.279   -4600.049    1.59e+04
Ethnicity[T.Multi Race]          5193.5583   5600.352      0.927      0.354   -5796.366    1.62e+04
Ethnicity[T.Native Hawaiian]     2.155e+04   7886.344      2.732      0.006    6071.684     3.7e+04
Ethnicity[T.Other]               -894.9578   8962.598     -0.100      0.920   -1.85e+04    1.67e+04
Ethnicity[T.White not Hispanic]  1.014e+04   5207.487      1.948      0.052     -75.692    2.04e+04
Gender[T.Male]                   -251.7477    653.446     -0.385      0.700   -1534.046    1030.550
Age                               863.4592     18.526     46.607      0.000     827.104     899.814
==============================================================================
Omnibus:                      158.613   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              236.989
Skew:                           1.121   Prob(JB):                     3.45e-52
Kurtosis:                       3.811   Cond. No.                     1.36e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [13]:

# gowder code (visualizations) begins here
sns.countplot(df["Ethnicity"])

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x10e62ac88>

In [14]:

def bin_ethnicity(eth):
    if eth == "White not Hispanic":
        return "white"
    elif eth == "Hispanic":
        return "hispanic"
    return "other"



# there is doubtless a better way to do this involving the apply function in pandas or something.  
# But I'm rusty with my Pandas data tranformations.

df["binned_eth"] =np.array([bin_ethnicity(x) for x in list(df["Ethnicity"])])

In [15]:

df.head()

Out[15]:

	ID	Age Cohort	Age	Gender	Expenditures	Ethnicity	binned_eth
0	10210	13-17	17	Female	2113	White not Hispanic	white
1	10409	22-50	37	Male	41924	White not Hispanic	white
2	10486	0 - 5	3	Male	1454	Hispanic	hispanic
3	10538	18-21	19	Female	6400	Hispanic	hispanic
4	10568	13-17	13	Male	4412	White not Hispanic	white

In [16]:

sns.countplot(df["binned_eth"])

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x1062f4fd0>

In [17]:

sns.boxplot(x=df["binned_eth"], y=df["Expenditures"])

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x10639cc88>

In [18]:

sns.violinplot(x=df["binned_eth"], y=df["Expenditures"])

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x10e8089e8>

In [19]:

cohorts = sorted(df["Age Cohort"].unique())  # just sorting this now like a sensible person

In [20]:

cohorts

Out[20]:

[' 0 - 5', ' 51 +', '13-17', '18-21', '22-50', '6-12']

I'm going to make a couple of changes from the code I showed in class here.

First, I'm going to sort our pandas dataframe by the value of binned ethnicity in order to try to get our columns in the violin plots to come out right.

Second, I'm going to sort the list of cohorts so that it's easy to generate plots in order.

Third, I'm going to change the function that generates the violin plot to let me loop over and show a plot for each cohort.

In [21]:

df.sort_values("binned_eth", inplace=True)

In [22]:

import re
def sorting_function(elem):
    e = elem.strip()
    e = re.split(r"[-\+\s]", e)
    return int(e[0])
cohorts = sorted(cohorts, key=sorting_function)

In [23]:

import matplotlib.pyplot as plt # this is a change from my code in class to make it work in a loop
def subsetted_violin(cohort):
    temp_df = df[df["Age Cohort"] == cohort]
    plt.figure()
    sns.violinplot(x=temp_df["binned_eth"], y=temp_df["Expenditures"])
    plt.title(cohort)

In [24]:

for cohort in cohorts:
    subsetted_violin(cohort)

In [25]:

sns.violinplot(x=df["Age Cohort"], y=df["Expenditures"])

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x10ed8bda0>

Anna showed us another very useful plot---the swarm plot, which is sort of like a violin plot, but with dots for individual items as well as the capacity to see an extra dimension by colorizing those dots---which can be very useful for seeing the relationship between ethnicity, age cohort, and expenditures in these data.

(Seaborn legends can be obnoxious; this second line of code uses matplotlib to take the existing legend and shove it over to the right so it doesn't end up on top of the figure.)

In [26]:

sns.swarmplot(df["binned_eth"], df["Expenditures"], df["Age Cohort"], size=5)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Out[26]:

<matplotlib.legend.Legend at 0x10eafecc0>

If we want to, we can even use a catplot to divide up our swarmplots by some other dimension, like gender, so we can eyeball whether there are any differences in there too. See the seaborn docs for more cool things we can do with these swarm and swarm-adjacent plots.

In [27]:

sns.catplot(x="binned_eth", y="Expenditures", hue="Age Cohort", col="Gender", data=df)

Out[27]:

<seaborn.axisgrid.FacetGrid at 0x10eafeeb8>

Finally, a quick look at how to get a look at the same kinds of things that Sam's code showed us, but with a pivot table.

In [28]:

pd.pivot_table(df,index=["Age Cohort","binned_eth"], values="Expenditures", aggfunc=np.mean)

Out[28]:

		Expenditures
Age Cohort	binned_eth
0 - 5	hispanic	1393.204545
	other	1523.000000
	white	1366.900000
51 +	hispanic	55585.000000
	other	54440.347826
	white	52670.424242
13-17	hispanic	3955.281553
	other	3871.619048
	white	3904.358209
18-21	hispanic	9959.846154
	other	9457.115385
	white	10133.057971
22-50	hispanic	40924.116279
	other	39652.140000
	white	40187.624060
6-12	hispanic	2312.186813
	other	2233.894737
	white	2052.260870

Sociological Gobbledygook

Simpson's Paradox Homework Example

links