Solutions to Data Scavenger Hunt

Here are some example ways to respond to the prompts in our data scavenger hunt.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import seaborn as sns
import numpy as np
%matplotlib inline

I'm going to hide the password for the class dataset, but it ended up getting put in a file called "Lawsuit.csv" so we'll grab that.

In [2]:
import os
with open(os.path.expanduser("~/gobbledygook_password.txt")) as gp:
    secret = gp.read().strip()
df = pd.read_csv("https://gobbledygook.herokuapp.com/data?file=Lawsuit.csv&password={}".format(secret))
In [3]:
df.head()
Out[3]:
ID Dept Gender Clin Cert Prate Exper Rank Sal94 Sal95
0 1 1 1 0 0 7.4 9 3 77836 84612
1 2 1 1 0 0 6.7 10 2 69994 78497
2 3 1 1 0 0 8.1 6 1 62872 67756
3 4 1 1 1 1 5.1 27 3 155196 173220
4 5 1 1 0 0 7.0 10 3 89268 96099

1. How many observations are there in the dataset?

In [4]:
df.describe()
Out[4]:
ID Dept Gender Clin Cert Prate Exper Rank Sal94 Sal95
count 261.00000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000
mean 131.00000 3.651341 0.593870 0.613027 0.720307 4.932184 10.229885 1.896552 153593.344828 168906.655172
std 75.48841 1.796655 0.492053 0.487993 0.449711 1.944430 6.227176 0.864262 80469.666720 88778.425411
min 1.00000 1.000000 0.000000 0.000000 0.000000 1.300000 1.000000 1.000000 34514.000000 38675.000000
25% 66.00000 2.000000 0.000000 0.000000 0.000000 3.200000 6.000000 1.000000 90771.000000 99972.000000
50% 131.00000 4.000000 1.000000 1.000000 1.000000 4.400000 9.000000 2.000000 133284.000000 148117.000000
75% 196.00000 5.000000 1.000000 1.000000 1.000000 6.900000 14.000000 3.000000 200543.000000 218955.000000
max 261.00000 6.000000 1.000000 1.000000 1.000000 8.700000 37.000000 3.000000 428876.000000 472589.000000

looks like 261 to me!

2. What can you say about the distribution of salaries? Does it look like the standard bell-shaped curve that we all know and love? If not, why not?

In [5]:
sns.distplot(df["Sal94"])
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f50d240>
In [6]:
sns.distplot(df["Sal95"])
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1145b7860>

It looks pretty right-skewed. Not really a normal distribution, but not so extreme to be something like an exponential distribution either---as we'd expect given the restricted range of salaries.

3. What are the mean, median, and mode of salaries? How do they change from year to year?

we've already seen the first two of those; the mode is a trick question, because the salaries are granular enough that they're unique, there isn't really a mode. but we could use our histograms with different bin sizes to get a sense of where they clump.

4. What data visualization would you use to get a look at the relationship between gender and salaries? What do we learn from that visualization?

In [7]:
sns.boxplot(x=df["Gender"], y=df["Sal95"])
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x114683c88>

A boxplot is always a good place to start.

What we can immediately observe is that there's a gender difference, but that the gender difference isn't necessarily huge---the highlighted boxes represent the interquartile range (if seaborn does it normally)---the 25th to 75th percentiles, and there's a good amount of overlap. (The whiskers are 1.5 times that range, and then the dots are outliers that are beyond that range.)

We can also observe that there are a lot more outliers among the women, which are likely to drag things like the mean for women up; whether this is ok or not (i.e., or whether we should prefer the median) is something we could argue about.

5. Can we subdivide the dataset in some useful way to get more insight about the relationship between gender and salaries under different conditions? Come up with something that helps us learn more, and visualize it.

In [8]:
sns.boxplot(x=df["Dept"], y=df["Sal95"], hue=df["Gender"])
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1147821d0>
In [9]:
sns.boxplot(x=df["Rank"], y=df["Sal95"], hue=df["Gender"])
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x116bf2080>

We could break up the data by department, by rank, etc. It looks like there are still gender differences across departments and ranks, although it looks like the differences are most striking at the junior level, assistant professor rank.

There are lots of other plausible cuts you could make at the data.

6. How many people in the dataset make more than 450k a year? What else do we know about them?

In [10]:
rich = df[df["Sal95"] > 450000]
In [11]:
rich.describe()
Out[11]:
ID Dept Gender Clin Cert Prate Exper Rank Sal94 Sal95
count 3.000000 3.0 3.0 3.0 3.0 3.000000 3.000000 3.0 3.000000 3.000000
mean 241.333333 6.0 1.0 1.0 1.0 2.766667 21.333333 3.0 419741.333333 463466.333333
std 10.692677 0.0 0.0 0.0 0.0 0.404145 8.144528 0.0 8844.412153 8615.698134
min 232.000000 6.0 1.0 1.0 1.0 2.400000 12.000000 3.0 411219.000000 455468.000000
25% 235.500000 6.0 1.0 1.0 1.0 2.550000 18.500000 3.0 415174.000000 458905.000000
50% 239.000000 6.0 1.0 1.0 1.0 2.700000 25.000000 3.0 419129.000000 462342.000000
75% 246.000000 6.0 1.0 1.0 1.0 2.950000 26.000000 3.0 424002.500000 467465.500000
max 253.000000 6.0 1.0 1.0 1.0 3.200000 27.000000 3.0 428876.000000 472589.000000

There are only three of them. Judging by the fact that the minimum of the gender column is 1, they're all men. They're all also in surgery, and they're all full professors, as one would expect.

How many standard deviations away from the mean is the most highly-paid person in the dataset? What about the lowest person?

In [12]:
def standardize(column):
    std = np.std(column)
    mean = np.mean(column)
    return (column - mean) / std

df["std95"] = standardize(df["Sal95"])
df.describe()
Out[12]:
ID Dept Gender Clin Cert Prate Exper Rank Sal94 Sal95 std95
count 261.00000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000 261.000000
mean 131.00000 3.651341 0.593870 0.613027 0.720307 4.932184 10.229885 1.896552 153593.344828 168906.655172 0.000000
std 75.48841 1.796655 0.492053 0.487993 0.449711 1.944430 6.227176 0.864262 80469.666720 88778.425411 1.001921
min 1.00000 1.000000 0.000000 0.000000 0.000000 1.300000 1.000000 1.000000 34514.000000 38675.000000 -1.469747
25% 66.00000 2.000000 0.000000 0.000000 0.000000 3.200000 6.000000 1.000000 90771.000000 99972.000000 -0.777972
50% 131.00000 4.000000 1.000000 1.000000 1.000000 4.400000 9.000000 2.000000 133284.000000 148117.000000 -0.234625
75% 196.00000 5.000000 1.000000 1.000000 1.000000 6.900000 14.000000 3.000000 200543.000000 218955.000000 0.564828
max 261.00000 6.000000 1.000000 1.000000 1.000000 8.700000 37.000000 3.000000 428876.000000 472589.000000 3.427249

We get the information we want from the min and max of the last column of this table.

8. What's the biggest department? The smallest? Can you order them from largest to smallest?

9. What's the most highly paid department? What's the lowest paid department? What measure did you use to make that decision, and could a different measure have yielded different results?

In [13]:
departments = []
for dept in df["Dept"].unique():
    out = {}
    out["department"] = dept
    subset = df[df["Dept"] == dept]
    out["size"] = len(subset)
    out["median_salary"] = np.median(subset["Sal95"])
    out["mean_salary"] = np.mean(subset["Sal95"])
    departments.append(out)
print(departments)
[{'department': 1, 'size': 50, 'median_salary': 97897.5, 'mean_salary': 103547.48}, {'department': 2, 'size': 40, 'median_salary': 84343.5, 'mean_salary': 91218.1}, {'department': 3, 'size': 21, 'median_salary': 126118.0, 'mean_salary': 125873.0}, {'department': 4, 'size': 30, 'median_salary': 127002.0, 'mean_salary': 132093.16666666666}, {'department': 5, 'size': 80, 'median_salary': 187601.5, 'mean_salary': 198546.1375}, {'department': 6, 'size': 40, 'median_salary': 318747.0, 'mean_salary': 319218.0}]

We could sort them however we want from there. Mean and median are both plausible ways to think about the highest paid departments.

I'll leave questions 10 and (unnumbered) 11 off for now, as that requires a bit more interpretation and creativity.

links