Here are some example ways to respond to the prompts in our data scavenger hunt.

In [1]:

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import seaborn as sns
import numpy as np
%matplotlib inline

I'm going to hide the password for the class dataset, but it ended up getting put in a file called "Lawsuit.csv" so we'll grab that.

In [2]:

import os
with open(os.path.expanduser("~/gobbledygook_password.txt")) as gp:
    secret = gp.read().strip()
df = pd.read_csv("https://gobbledygook.herokuapp.com/data?file=Lawsuit.csv&password={}".format(secret))

In [3]:

df.head()

Out[3]:

	ID	Dept	Gender	Clin	Cert	Prate	Exper	Rank	Sal94	Sal95
0	1	1	1	0	0	7.4	9	3	77836	84612
1	2	1	1	0	0	6.7	10	2	69994	78497
2	3	1	1	0	0	8.1	6	1	62872	67756
3	4	1	1	1	1	5.1	27	3	155196	173220
4	5	1	1	0	0	7.0	10	3	89268	96099

1. How many observations are there in the dataset?

In [4]:

df.describe()

Out[4]:

	ID	Dept	Gender	Clin	Cert	Prate	Exper	Rank	Sal94	Sal95
count	261.00000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000
mean	131.00000	3.651341	0.593870	0.613027	0.720307	4.932184	10.229885	1.896552	153593.344828	168906.655172
std	75.48841	1.796655	0.492053	0.487993	0.449711	1.944430	6.227176	0.864262	80469.666720	88778.425411
min	1.00000	1.000000	0.000000	0.000000	0.000000	1.300000	1.000000	1.000000	34514.000000	38675.000000
25%	66.00000	2.000000	0.000000	0.000000	0.000000	3.200000	6.000000	1.000000	90771.000000	99972.000000
50%	131.00000	4.000000	1.000000	1.000000	1.000000	4.400000	9.000000	2.000000	133284.000000	148117.000000
75%	196.00000	5.000000	1.000000	1.000000	1.000000	6.900000	14.000000	3.000000	200543.000000	218955.000000
max	261.00000	6.000000	1.000000	1.000000	1.000000	8.700000	37.000000	3.000000	428876.000000	472589.000000

looks like 261 to me!

2. What can you say about the distribution of salaries? Does it look like the standard bell-shaped curve that we all know and love? If not, why not?

In [5]:

sns.distplot(df["Sal94"])

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x10f50d240>

In [6]:

sns.distplot(df["Sal95"])

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x1145b7860>

It looks pretty right-skewed. Not really a normal distribution, but not so extreme to be something like an exponential distribution either---as we'd expect given the restricted range of salaries.

3. What are the mean, median, and mode of salaries? How do they change from year to year?

we've already seen the first two of those; the mode is a trick question, because the salaries are granular enough that they're unique, there isn't really a mode. but we could use our histograms with different bin sizes to get a sense of where they clump.

4. What data visualization would you use to get a look at the relationship between gender and salaries? What do we learn from that visualization?

In [7]:

sns.boxplot(x=df["Gender"], y=df["Sal95"])

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x114683c88>

A boxplot is always a good place to start.

What we can immediately observe is that there's a gender difference, but that the gender difference isn't necessarily huge---the highlighted boxes represent the interquartile range (if seaborn does it normally)---the 25th to 75th percentiles, and there's a good amount of overlap. (The whiskers are 1.5 times that range, and then the dots are outliers that are beyond that range.)

We can also observe that there are a lot more outliers among the women, which are likely to drag things like the mean for women up; whether this is ok or not (i.e., or whether we should prefer the median) is something we could argue about.

5. Can we subdivide the dataset in some useful way to get more insight about the relationship between gender and salaries under different conditions? Come up with something that helps us learn more, and visualize it.

In [8]:

sns.boxplot(x=df["Dept"], y=df["Sal95"], hue=df["Gender"])

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x1147821d0>

In [9]:

sns.boxplot(x=df["Rank"], y=df["Sal95"], hue=df["Gender"])

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x116bf2080>

We could break up the data by department, by rank, etc. It looks like there are still gender differences across departments and ranks, although it looks like the differences are most striking at the junior level, assistant professor rank.

There are lots of other plausible cuts you could make at the data.

6. How many people in the dataset make more than 450k a year? What else do we know about them?

In [10]:

rich = df[df["Sal95"] > 450000]

In [11]:

rich.describe()

Out[11]:

	ID	Dept	Gender	Clin	Cert	Prate	Exper	Rank	Sal94	Sal95
count	3.000000	3.0	3.0	3.0	3.0	3.000000	3.000000	3.0	3.000000	3.000000
mean	241.333333	6.0	1.0	1.0	1.0	2.766667	21.333333	3.0	419741.333333	463466.333333
std	10.692677	0.0	0.0	0.0	0.0	0.404145	8.144528	0.0	8844.412153	8615.698134
min	232.000000	6.0	1.0	1.0	1.0	2.400000	12.000000	3.0	411219.000000	455468.000000
25%	235.500000	6.0	1.0	1.0	1.0	2.550000	18.500000	3.0	415174.000000	458905.000000
50%	239.000000	6.0	1.0	1.0	1.0	2.700000	25.000000	3.0	419129.000000	462342.000000
75%	246.000000	6.0	1.0	1.0	1.0	2.950000	26.000000	3.0	424002.500000	467465.500000
max	253.000000	6.0	1.0	1.0	1.0	3.200000	27.000000	3.0	428876.000000	472589.000000

There are only three of them. Judging by the fact that the minimum of the gender column is 1, they're all men. They're all also in surgery, and they're all full professors, as one would expect.

How many standard deviations away from the mean is the most highly-paid person in the dataset? What about the lowest person?

In [12]:

def standardize(column):
    std = np.std(column)
    mean = np.mean(column)
    return (column - mean) / std

df["std95"] = standardize(df["Sal95"])
df.describe()

Out[12]:

	ID	Dept	Gender	Clin	Cert	Prate	Exper	Rank	Sal94	Sal95	std95
count	261.00000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000	261.000000
mean	131.00000	3.651341	0.593870	0.613027	0.720307	4.932184	10.229885	1.896552	153593.344828	168906.655172	0.000000
std	75.48841	1.796655	0.492053	0.487993	0.449711	1.944430	6.227176	0.864262	80469.666720	88778.425411	1.001921
min	1.00000	1.000000	0.000000	0.000000	0.000000	1.300000	1.000000	1.000000	34514.000000	38675.000000	-1.469747
25%	66.00000	2.000000	0.000000	0.000000	0.000000	3.200000	6.000000	1.000000	90771.000000	99972.000000	-0.777972
50%	131.00000	4.000000	1.000000	1.000000	1.000000	4.400000	9.000000	2.000000	133284.000000	148117.000000	-0.234625
75%	196.00000	5.000000	1.000000	1.000000	1.000000	6.900000	14.000000	3.000000	200543.000000	218955.000000	0.564828
max	261.00000	6.000000	1.000000	1.000000	1.000000	8.700000	37.000000	3.000000	428876.000000	472589.000000	3.427249

We get the information we want from the min and max of the last column of this table.

8. What's the biggest department? The smallest? Can you order them from largest to smallest?

9. What's the most highly paid department? What's the lowest paid department? What measure did you use to make that decision, and could a different measure have yielded different results?

In [13]:

departments = []
for dept in df["Dept"].unique():
    out = {}
    out["department"] = dept
    subset = df[df["Dept"] == dept]
    out["size"] = len(subset)
    out["median_salary"] = np.median(subset["Sal95"])
    out["mean_salary"] = np.mean(subset["Sal95"])
    departments.append(out)
print(departments)

[{'department': 1, 'size': 50, 'median_salary': 97897.5, 'mean_salary': 103547.48}, {'department': 2, 'size': 40, 'median_salary': 84343.5, 'mean_salary': 91218.1}, {'department': 3, 'size': 21, 'median_salary': 126118.0, 'mean_salary': 125873.0}, {'department': 4, 'size': 30, 'median_salary': 127002.0, 'mean_salary': 132093.16666666666}, {'department': 5, 'size': 80, 'median_salary': 187601.5, 'mean_salary': 198546.1375}, {'department': 6, 'size': 40, 'median_salary': 318747.0, 'mean_salary': 319218.0}]

We could sort them however we want from there. Mean and median are both plausible ways to think about the highest paid departments.

I'll leave questions 10 and (unnumbered) 11 off for now, as that requires a bit more interpretation and creativity.

Sociological Gobbledygook

Solutions to Data Scavenger Hunt

links