Hypothesis Tests on Experimental Data: Housing Discrimination Test Example

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(10)

Let's simulate some housing discrimination tester data. Suppose a testing organization hired 20 white people and 20 black people, trained them to behave identically and give identical answers to landlord questions, and sent them into a large housing complex to inquire about one-bedroom apartments. The testers were instructed to ask how much rent is on one-bedrooms in the complex, but not to ask for an application---and then to record how much they were quoted, and whether they were offered an application anyway.

We'll assume that whether they're offered an application is a bernouli variable, i.e., a coin flip, and that the rent they're quoted is normally distributed. We'll simulate some racist data, i.e., where the probability that black people get offered an application is lower, and the mean rent is higher.

In [2]:
def generate_test(race, mean_rent, prob_application):
    rent_charged = np.trunc(stats.norm.rvs(loc=mean_rent, scale=20, size=1))
    application_offered = stats.bernoulli.rvs(p=prob_application, size=1)
    return {"race": race, "rent": rent_charged[0], "application": application_offered[0]}
testers = []
for white_tester in range(20):
    testers.append(generate_test("white", 500, 0.7))
for black_tester in range(20):
    testers.append(generate_test("black", 550, 0.4))
In [3]:
testers
Out[3]:
[{'race': 'white', 'rent': 526.0, 'application': 1},
 {'race': 'white', 'rent': 514.0, 'application': 1},
 {'race': 'white', 'rent': 512.0, 'application': 1},
 {'race': 'white', 'rent': 485.0, 'application': 1},
 {'race': 'white', 'rent': 505.0, 'application': 1},
 {'race': 'white', 'rent': 502.0, 'application': 1},
 {'race': 'white', 'rent': 508.0, 'application': 0},
 {'race': 'white', 'rent': 524.0, 'application': 1},
 {'race': 'white', 'rent': 504.0, 'application': 1},
 {'race': 'white', 'rent': 508.0, 'application': 1},
 {'race': 'white', 'rent': 529.0, 'application': 1},
 {'race': 'white', 'rent': 478.0, 'application': 1},
 {'race': 'white', 'rent': 505.0, 'application': 1},
 {'race': 'white', 'rent': 547.0, 'application': 1},
 {'race': 'white', 'rent': 501.0, 'application': 0},
 {'race': 'white', 'rent': 527.0, 'application': 1},
 {'race': 'white', 'rent': 494.0, 'application': 1},
 {'race': 'white', 'rent': 489.0, 'application': 0},
 {'race': 'white', 'rent': 502.0, 'application': 1},
 {'race': 'white', 'rent': 490.0, 'application': 0},
 {'race': 'black', 'rent': 558.0, 'application': 0},
 {'race': 'black', 'rent': 543.0, 'application': 1},
 {'race': 'black', 'rent': 563.0, 'application': 0},
 {'race': 'black', 'rent': 542.0, 'application': 0},
 {'race': 'black', 'rent': 533.0, 'application': 1},
 {'race': 'black', 'rent': 545.0, 'application': 0},
 {'race': 'black', 'rent': 561.0, 'application': 1},
 {'race': 'black', 'rent': 547.0, 'application': 0},
 {'race': 'black', 'rent': 539.0, 'application': 0},
 {'race': 'black', 'rent': 564.0, 'application': 1},
 {'race': 'black', 'rent': 597.0, 'application': 0},
 {'race': 'black', 'rent': 568.0, 'application': 0},
 {'race': 'black', 'rent': 545.0, 'application': 0},
 {'race': 'black', 'rent': 539.0, 'application': 1},
 {'race': 'black', 'rent': 548.0, 'application': 0},
 {'race': 'black', 'rent': 539.0, 'application': 1},
 {'race': 'black', 'rent': 542.0, 'application': 1},
 {'race': 'black', 'rent': 547.0, 'application': 1},
 {'race': 'black', 'rent': 545.0, 'application': 1},
 {'race': 'black', 'rent': 569.0, 'application': 0}]
In [4]:
df = pd.DataFrame(testers)
In [5]:
df.head()
Out[5]:
application race rent
0 1 white 526.0
1 1 white 514.0
2 1 white 512.0
3 1 white 485.0
4 1 white 505.0
In [6]:
df.describe()
Out[6]:
application rent
count 40.00000 40.000000
mean 0.62500 529.600000
std 0.49029 27.392494
min 0.00000 478.000000
25% 0.00000 505.000000
50% 1.00000 536.000000
75% 1.00000 547.000000
max 1.00000 597.000000

The first thing we'll do is a simple t-test for difference of means here. The data isn't paired (i.e., in our hypothetical research design, the housing organization didn't send two people who were matched on other characteristics, one white and one black, in to test.) so we can use the independent t-test.

In [7]:
t, p = stats.ttest_ind(df[df.race == "white"].rent, df[df.race == "black"].rent)
In [8]:
print(p)
1.26829227017898e-10
In [9]:
print(t)
-8.736151260346574

It looks like we can reject the null hypothesis that white and black rents are the same. How about offering applications? There, we have a binary categorical variable, so we should go for something like a chi-squared test.

I actually brushed over some complexities in the chi-squared test in the catalogue of hypothesis tests lesson. There are three different flavors of the chi-squared test: a test for independence, for homogeneity, and for goodness of fit. The goodness-of-fit one is for testing whether a set of categorial data came from a hypothesis discrete distribution. We don't need that here.

The difference between the test for independence and the test for homogeneity is a bit subtle. The most common explanation of the difference is that the test for independence is for research designs where you collect two observations for each member of the sample (i.e., their race and whether they were offered an application); whereas the test for homogeneity is for research designs where you sample the subpopulations separately, i.e., sample a bunch of white people and a bunch of black people and then observe who got an application. Here's a good explanation of the flavors of chi-squared tests.

The good thing is that, mathematically, the tests of independence and homogeneity are the same, the only difference is in interpretation. So we can use what we've already learned.

In [10]:
crosstab = pd.crosstab(df["race"], df["application"])
In [11]:
crosstab
Out[11]:
application 0 1
race
black 11 9
white 4 16
In [12]:
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
In [13]:
print(p)
0.05004352124870519

It looks like we can't quite reject (at the standard 0.05 level) the null hypothesis that white and black folks are given applications similarly---which should suggest something to our testing organization about maybe having picked a research design with a bigger sample size, since the differences look pretty stark in the crosstab.

But, of course, Fisher's exact test is also available to us, so let's do one more test just to see what we get. (In scientific context, you should really precommit to the tests you'll carry out before doing so, since, let's recall, that if you look at the data a bunch of times, you might get significant results just by chance. So what I'm about to do is technically a bit dirty and I really should have just done this first.)

In [14]:
oddsratio, p = stats.fisher_exact(crosstab)
In [15]:
print(p)
0.04837206505727086

Oh look, I managed to torture a significant result out of the data. See above as to why this is cheating. But also, probably reach for Fisher's exact test first. Also, we could have gone for a z test of independence of proportions---google it if you're curious.

Paired data

The hypothetical test I described doesn't quite match the examples given in many of our housing discrimination cases. Often, these organizations use paired/matched testers instead. That is, they send one black person and one white person close together in time, and where they match on everything else that might be relevant to the landlord (income, number of pets, general personality and good looks, etc.)

This violates one of the assumptions of both the standard t-test and the chi-squared/fisher tests, namely that the samples are independent. But we have tests for this purpose too, including the paired t-test and McNemar's test. (I didn't cover McNemar's test in the catalogue of hypothesis tests, but it's the equivalent of the chi-squared test for paired data.) I'll also break out the Wilcoxon signed-rank test, discussed at the end of the catalogue of hypothesis tests, just because I can.

Rather than simulate new data, let's just change the assumption of our implicit data. Let's assume they were paired, and that the pairings correspond to equivalent positions in the subsets of the DataFrame we already have (i.e., that the first white person was paired with the first black person, and so forth).

In [16]:
t, p = stats.ttest_rel(df[df.race == "white"].rent, df[df.race == "black"].rent)
In [17]:
print(p)
3.608791002314423e-08

Mcnemar's test is buried in statsmodels, and it has a really obnoxious API that doesn't take Pandas dataframes, and returns some idiotic object called a "bunch." It's very annoying. But we can work with it.

(Note: if you actually find yourself doing this kind of research on a regular basis, I'd probably recommend switching from Python to R; as a language it's better suited to this kind of bread-and-butter statistics, although less well-suited to lots of the other stuff we do in this class. Once you understand Python, it's pretty easy to learn R.)

In [18]:
from statsmodels.stats.contingency_tables import mcnemar 
print(mcnemar(np.array(crosstab)))
pvalue      0.26684570312499983
statistic   4.0
In [19]:
statistic, p = stats.wilcoxon(df[df.race == "white"].rent, df[df.race == "black"].rent)
In [20]:
print(p)
0.0001032027839634719

Again we see that we have more convincing evidence for a difference in rents than for a difference in application-offering rates. Looks good enough to file suit to me...

Now I'm going to save this dataset, so I can make you figure out how to work with it in class!

In [21]:
df.to_csv("simulated_housing_test.csv", index=False)

links