Here are some example ways to respond to the prompts in our data scavenger hunt.
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import seaborn as sns
import numpy as np
%matplotlib inline
I'm going to hide the password for the class dataset, but it ended up getting put in a file called "Lawsuit.csv" so we'll grab that.
import os
with open(os.path.expanduser("~/gobbledygook_password.txt")) as gp:
secret = gp.read().strip()
df = pd.read_csv("https://gobbledygook.herokuapp.com/data?file=Lawsuit.csv&password={}".format(secret))
df.head()
1. How many observations are there in the dataset?
df.describe()
looks like 261 to me!
2. What can you say about the distribution of salaries? Does it look like the standard bell-shaped curve that we all know and love? If not, why not?
sns.distplot(df["Sal94"])
sns.distplot(df["Sal95"])
It looks pretty right-skewed. Not really a normal distribution, but not so extreme to be something like an exponential distribution either---as we'd expect given the restricted range of salaries.
3. What are the mean, median, and mode of salaries? How do they change from year to year?
we've already seen the first two of those; the mode is a trick question, because the salaries are granular enough that they're unique, there isn't really a mode. but we could use our histograms with different bin sizes to get a sense of where they clump.
4. What data visualization would you use to get a look at the relationship between gender and salaries? What do we learn from that visualization?
sns.boxplot(x=df["Gender"], y=df["Sal95"])
A boxplot is always a good place to start.
What we can immediately observe is that there's a gender difference, but that the gender difference isn't necessarily huge---the highlighted boxes represent the interquartile range (if seaborn does it normally)---the 25th to 75th percentiles, and there's a good amount of overlap. (The whiskers are 1.5 times that range, and then the dots are outliers that are beyond that range.)
We can also observe that there are a lot more outliers among the women, which are likely to drag things like the mean for women up; whether this is ok or not (i.e., or whether we should prefer the median) is something we could argue about.
5. Can we subdivide the dataset in some useful way to get more insight about the relationship between gender and salaries under different conditions? Come up with something that helps us learn more, and visualize it.
sns.boxplot(x=df["Dept"], y=df["Sal95"], hue=df["Gender"])
sns.boxplot(x=df["Rank"], y=df["Sal95"], hue=df["Gender"])
We could break up the data by department, by rank, etc. It looks like there are still gender differences across departments and ranks, although it looks like the differences are most striking at the junior level, assistant professor rank.
There are lots of other plausible cuts you could make at the data.
6. How many people in the dataset make more than 450k a year? What else do we know about them?
rich = df[df["Sal95"] > 450000]
rich.describe()
There are only three of them. Judging by the fact that the minimum of the gender column is 1, they're all men. They're all also in surgery, and they're all full professors, as one would expect.
How many standard deviations away from the mean is the most highly-paid person in the dataset? What about the lowest person?
def standardize(column):
std = np.std(column)
mean = np.mean(column)
return (column - mean) / std
df["std95"] = standardize(df["Sal95"])
df.describe()
We get the information we want from the min and max of the last column of this table.
8. What's the biggest department? The smallest? Can you order them from largest to smallest?
9. What's the most highly paid department? What's the lowest paid department? What measure did you use to make that decision, and could a different measure have yielded different results?
departments = []
for dept in df["Dept"].unique():
out = {}
out["department"] = dept
subset = df[df["Dept"] == dept]
out["size"] = len(subset)
out["median_salary"] = np.median(subset["Sal95"])
out["mean_salary"] = np.mean(subset["Sal95"])
departments.append(out)
print(departments)
We could sort them however we want from there. Mean and median are both plausible ways to think about the highest paid departments.
I'll leave questions 10 and (unnumbered) 11 off for now, as that requires a bit more interpretation and creativity.