Problem Set 2: Answers and Explanations

Problem 1: Fun with APIs, continued (30 points)

Remember problem 3 from the previous pset? I'd like you to go back and use the Caselaw Access Project API again, only, this time, I'd like you to plot line charts of the following two time series, on the same chart:

  • The total number of the uses of the words "pork," "pig," "pigs," "hog," or "hogs" in the Iowa state courts within the CAP dataset, and

  • The total number of the uses of the words "corn," "sweetcorn," "maize," or "ethanol" in the same courts.

In other words, I'd like you to produce a chart with two lines: one is the number of uses of the pork words, the other is the number of uses of the corn words. On the y axis is number of uses of words; on the x axis is year.

Hint: Check out the last example chart for the 'plottyprint' library, which I have released. You may feel free to use this library, or to borrow some of the code from it.

Answer

This is another use of the ngrams endpoint, but this time, we have to combine a bunch of data from different words and then plot it. Let's start by getting the data and combining it. In order to do that, we'll use a bunch of helper functions.

In [1]:
import requests

def extract_counts(api_result, word):  # the api results are messy, so I'd like to clean them up.
    # going to label them by word just to make sure that everything is nice and clear when we move on.
    return {"year": api_result["year"], "count": api_result["count"][0], "word": word}

def get_counts_by_year(word):
    endpoint = f'https://api.case.law/v1/ngrams/?q={word}&jurisdiction=iowa'
    result = requests.get(endpoint).json()['results'][word]['iowa']
    return [extract_counts(x, word) for x in result]

pork_words = ["pork", "pig", "pigs", "hog", "hogs"]
corn_words = ["corn", "maize", "ethanol"]

# I'm leaving off "sweetcorn" here because I happen to know, from having gotten 
# no results when I ran the query before, that it doesn't produce anything.

pork_results = []
corn_results = []

for word in pork_words:
    pork_results.append(get_counts_by_year(word))
    
for word in corn_words:
    corn_results.append(get_counts_by_year(word))

Ok, now we have two lists of lists dicts, where each top-level list has one inner list for each word in its category, and then, in the inner lists, one dict for every year. Observe:

In [2]:
print(corn_results)
[[{'year': '1840', 'count': 1, 'word': 'corn'}, {'year': '1848', 'count': 1, 'word': 'corn'}, {'year': '1849', 'count': 3, 'word': 'corn'}, {'year': '1850', 'count': 7, 'word': 'corn'}, {'year': '1851', 'count': 5, 'word': 'corn'}, {'year': '1852', 'count': 4, 'word': 'corn'}, {'year': '1854', 'count': 1, 'word': 'corn'}, {'year': '1856', 'count': 38, 'word': 'corn'}, {'year': '1857', 'count': 6, 'word': 'corn'}, {'year': '1858', 'count': 1, 'word': 'corn'}, {'year': '1859', 'count': 6, 'word': 'corn'}, {'year': '1860', 'count': 1, 'word': 'corn'}, {'year': '1862', 'count': 6, 'word': 'corn'}, {'year': '1864', 'count': 26, 'word': 'corn'}, {'year': '1865', 'count': 11, 'word': 'corn'}, {'year': '1866', 'count': 38, 'word': 'corn'}, {'year': '1867', 'count': 34, 'word': 'corn'}, {'year': '1868', 'count': 6, 'word': 'corn'}, {'year': '1869', 'count': 10, 'word': 'corn'}, {'year': '1870', 'count': 41, 'word': 'corn'}, {'year': '1871', 'count': 13, 'word': 'corn'}, {'year': '1872', 'count': 25, 'word': 'corn'}, {'year': '1873', 'count': 10, 'word': 'corn'}, {'year': '1874', 'count': 55, 'word': 'corn'}, {'year': '1875', 'count': 36, 'word': 'corn'}, {'year': '1876', 'count': 24, 'word': 'corn'}, {'year': '1877', 'count': 5, 'word': 'corn'}, {'year': '1878', 'count': 27, 'word': 'corn'}, {'year': '1879', 'count': 52, 'word': 'corn'}, {'year': '1880', 'count': 49, 'word': 'corn'}, {'year': '1881', 'count': 58, 'word': 'corn'}, {'year': '1882', 'count': 108, 'word': 'corn'}, {'year': '1883', 'count': 146, 'word': 'corn'}, {'year': '1884', 'count': 45, 'word': 'corn'}, {'year': '1885', 'count': 53, 'word': 'corn'}, {'year': '1886', 'count': 51, 'word': 'corn'}, {'year': '1887', 'count': 55, 'word': 'corn'}, {'year': '1888', 'count': 105, 'word': 'corn'}, {'year': '1889', 'count': 49, 'word': 'corn'}, {'year': '1890', 'count': 29, 'word': 'corn'}, {'year': '1891', 'count': 21, 'word': 'corn'}, {'year': '1892', 'count': 37, 'word': 'corn'}, {'year': '1893', 'count': 71, 'word': 'corn'}, {'year': '1894', 'count': 66, 'word': 'corn'}, {'year': '1895', 'count': 33, 'word': 'corn'}, {'year': '1896', 'count': 69, 'word': 'corn'}, {'year': '1897', 'count': 91, 'word': 'corn'}, {'year': '1898', 'count': 46, 'word': 'corn'}, {'year': '1899', 'count': 70, 'word': 'corn'}, {'year': '1900', 'count': 19, 'word': 'corn'}, {'year': '1901', 'count': 36, 'word': 'corn'}, {'year': '1902', 'count': 44, 'word': 'corn'}, {'year': '1903', 'count': 61, 'word': 'corn'}, {'year': '1904', 'count': 53, 'word': 'corn'}, {'year': '1905', 'count': 11, 'word': 'corn'}, {'year': '1906', 'count': 27, 'word': 'corn'}, {'year': '1907', 'count': 20, 'word': 'corn'}, {'year': '1908', 'count': 34, 'word': 'corn'}, {'year': '1909', 'count': 31, 'word': 'corn'}, {'year': '1910', 'count': 44, 'word': 'corn'}, {'year': '1911', 'count': 50, 'word': 'corn'}, {'year': '1912', 'count': 41, 'word': 'corn'}, {'year': '1913', 'count': 45, 'word': 'corn'}, {'year': '1914', 'count': 124, 'word': 'corn'}, {'year': '1915', 'count': 77, 'word': 'corn'}, {'year': '1916', 'count': 190, 'word': 'corn'}, {'year': '1917', 'count': 87, 'word': 'corn'}, {'year': '1918', 'count': 57, 'word': 'corn'}, {'year': '1919', 'count': 253, 'word': 'corn'}, {'year': '1920', 'count': 76, 'word': 'corn'}, {'year': '1921', 'count': 182, 'word': 'corn'}, {'year': '1922', 'count': 114, 'word': 'corn'}, {'year': '1923', 'count': 73, 'word': 'corn'}, {'year': '1924', 'count': 99, 'word': 'corn'}, {'year': '1925', 'count': 49, 'word': 'corn'}, {'year': '1926', 'count': 96, 'word': 'corn'}, {'year': '1927', 'count': 33, 'word': 'corn'}, {'year': '1928', 'count': 68, 'word': 'corn'}, {'year': '1929', 'count': 121, 'word': 'corn'}, {'year': '1930', 'count': 120, 'word': 'corn'}, {'year': '1931', 'count': 54, 'word': 'corn'}, {'year': '1932', 'count': 40, 'word': 'corn'}, {'year': '1933', 'count': 54, 'word': 'corn'}, {'year': '1934', 'count': 52, 'word': 'corn'}, {'year': '1935', 'count': 50, 'word': 'corn'}, {'year': '1936', 'count': 52, 'word': 'corn'}, {'year': '1937', 'count': 123, 'word': 'corn'}, {'year': '1938', 'count': 36, 'word': 'corn'}, {'year': '1939', 'count': 31, 'word': 'corn'}, {'year': '1940', 'count': 16, 'word': 'corn'}, {'year': '1941', 'count': 32, 'word': 'corn'}, {'year': '1942', 'count': 51, 'word': 'corn'}, {'year': '1943', 'count': 27, 'word': 'corn'}, {'year': '1944', 'count': 21, 'word': 'corn'}, {'year': '1945', 'count': 25, 'word': 'corn'}, {'year': '1946', 'count': 147, 'word': 'corn'}, {'year': '1947', 'count': 10, 'word': 'corn'}, {'year': '1948', 'count': 71, 'word': 'corn'}, {'year': '1949', 'count': 20, 'word': 'corn'}, {'year': '1950', 'count': 19, 'word': 'corn'}, {'year': '1951', 'count': 28, 'word': 'corn'}, {'year': '1952', 'count': 93, 'word': 'corn'}, {'year': '1953', 'count': 5, 'word': 'corn'}, {'year': '1954', 'count': 42, 'word': 'corn'}, {'year': '1955', 'count': 54, 'word': 'corn'}, {'year': '1956', 'count': 45, 'word': 'corn'}, {'year': '1957', 'count': 37, 'word': 'corn'}, {'year': '1958', 'count': 14, 'word': 'corn'}, {'year': '1959', 'count': 38, 'word': 'corn'}, {'year': '1960', 'count': 35, 'word': 'corn'}, {'year': '1961', 'count': 30, 'word': 'corn'}, {'year': '1962', 'count': 70, 'word': 'corn'}, {'year': '1963', 'count': 95, 'word': 'corn'}, {'year': '1964', 'count': 27, 'word': 'corn'}, {'year': '1965', 'count': 30, 'word': 'corn'}, {'year': '1966', 'count': 53, 'word': 'corn'}, {'year': '1967', 'count': 11, 'word': 'corn'}, {'year': '1968', 'count': 116, 'word': 'corn'}, {'year': '1969', 'count': 26, 'word': 'corn'}, {'year': '1970', 'count': 31, 'word': 'corn'}, {'year': '1971', 'count': 14, 'word': 'corn'}, {'year': '1972', 'count': 26, 'word': 'corn'}, {'year': '1973', 'count': 18, 'word': 'corn'}, {'year': '1974', 'count': 12, 'word': 'corn'}, {'year': '1975', 'count': 5, 'word': 'corn'}, {'year': '1976', 'count': 44, 'word': 'corn'}, {'year': '1977', 'count': 103, 'word': 'corn'}, {'year': '1978', 'count': 88, 'word': 'corn'}, {'year': '1979', 'count': 26, 'word': 'corn'}, {'year': '1980', 'count': 10, 'word': 'corn'}, {'year': '1981', 'count': 20, 'word': 'corn'}, {'year': '1982', 'count': 20, 'word': 'corn'}, {'year': '1983', 'count': 6, 'word': 'corn'}, {'year': '1984', 'count': 15, 'word': 'corn'}, {'year': '1985', 'count': 34, 'word': 'corn'}, {'year': '1986', 'count': 43, 'word': 'corn'}, {'year': '1987', 'count': 14, 'word': 'corn'}, {'year': '1988', 'count': 49, 'word': 'corn'}, {'year': '1989', 'count': 9, 'word': 'corn'}, {'year': '1990', 'count': 36, 'word': 'corn'}, {'year': '1991', 'count': 39, 'word': 'corn'}, {'year': '1992', 'count': 29, 'word': 'corn'}, {'year': '1993', 'count': 11, 'word': 'corn'}, {'year': '1994', 'count': 23, 'word': 'corn'}, {'year': '1995', 'count': 12, 'word': 'corn'}, {'year': '1996', 'count': 3, 'word': 'corn'}, {'year': '1997', 'count': 6, 'word': 'corn'}, {'year': '1998', 'count': 22, 'word': 'corn'}, {'year': '1999', 'count': 8, 'word': 'corn'}, {'year': '2000', 'count': 54, 'word': 'corn'}, {'year': '2001', 'count': 4, 'word': 'corn'}, {'year': '2002', 'count': 3, 'word': 'corn'}, {'year': '2004', 'count': 6, 'word': 'corn'}, {'year': '2006', 'count': 6, 'word': 'corn'}, {'year': '2008', 'count': 1, 'word': 'corn'}, {'year': '2009', 'count': 4, 'word': 'corn'}, {'year': '2010', 'count': 3, 'word': 'corn'}, {'year': '2012', 'count': 7, 'word': 'corn'}, {'year': '2013', 'count': 33, 'word': 'corn'}, {'year': '2014', 'count': 5, 'word': 'corn'}, {'year': '2015', 'count': 5, 'word': 'corn'}, {'year': '2016', 'count': 1, 'word': 'corn'}, {'year': '2017', 'count': 8, 'word': 'corn'}], [{'year': '1855', 'count': 1, 'word': 'maize'}, {'year': '1859', 'count': 3, 'word': 'maize'}, {'year': '1870', 'count': 1, 'word': 'maize'}, {'year': '1880', 'count': 1, 'word': 'maize'}, {'year': '1895', 'count': 3, 'word': 'maize'}, {'year': '1911', 'count': 1, 'word': 'maize'}, {'year': '1917', 'count': 3, 'word': 'maize'}, {'year': '1972', 'count': 1, 'word': 'maize'}, {'year': '1985', 'count': 1, 'word': 'maize'}, {'year': '2003', 'count': 1, 'word': 'maize'}, {'year': '2015', 'count': 1, 'word': 'maize'}], [{'year': '1988', 'count': 11, 'word': 'ethanol'}, {'year': '1995', 'count': 3, 'word': 'ethanol'}, {'year': '2000', 'count': 2, 'word': 'ethanol'}, {'year': '2002', 'count': 39, 'word': 'ethanol'}, {'year': '2008', 'count': 1, 'word': 'ethanol'}, {'year': '2009', 'count': 2, 'word': 'ethanol'}, {'year': '2010', 'count': 1, 'word': 'ethanol'}, {'year': '2011', 'count': 1, 'word': 'ethanol'}, {'year': '2012', 'count': 1, 'word': 'ethanol'}, {'year': '2015', 'count': 12, 'word': 'ethanol'}, {'year': '2017', 'count': 3, 'word': 'ethanol'}]]

We need to combine them, so that we have associated each year with the total number of words in its category. The easiest way to do that is probably to make one big dict for each group, with the key being the year and the value being the count. We can create that as follows:

In [3]:
def combine_inner_lists(biglist):
    out = {}
    for innerlist in biglist:
        for entry in innerlist:
            year = entry["year"]
            count = entry["count"]  # this is totally unnecessary and verbose, but makes it a little clearer
            current_count = out.get(year, 0)  # returns zero if there isn't anything for the year.
            out[year] = current_count + count
    return out

porkcounts = combine_inner_lists(pork_results)
corncounts = combine_inner_lists(corn_results)

Eyeballing the list that I printed out above, it looks like 2015 had all three corn words, 5 uses of the word "corn," 1 use of the word "maize," and 12 uses of the word "ethanol," so we can look at that result of our combination to get a quick check on whether the code was correct.

In [4]:
print(corncounts["2015"])
18

Looks good to me! Now we have to graph them. We have effectively two choices: use my "plottyprint" library or use seaborn. I'll show you both.

In [5]:
import plottyprint

For plottyprint, we're going to need a list of datetime objects and a corresponding list of each of the counts for that year. This means a little bit more data manipulation, but that's ok.

In [6]:
from datetime import datetime
In [7]:
all_years = sorted(list(set(corncounts.keys()).union(set(porkcounts.keys()))))
print(all_years)
['1840', '1848', '1849', '1850', '1851', '1852', '1854', '1855', '1856', '1857', '1858', '1859', '1860', '1861', '1862', '1863', '1864', '1865', '1866', '1867', '1868', '1869', '1870', '1871', '1872', '1873', '1874', '1875', '1876', '1877', '1878', '1879', '1880', '1881', '1882', '1883', '1884', '1885', '1886', '1887', '1888', '1889', '1890', '1891', '1892', '1893', '1894', '1895', '1896', '1897', '1898', '1899', '1900', '1901', '1902', '1903', '1904', '1905', '1906', '1907', '1908', '1909', '1910', '1911', '1912', '1913', '1914', '1915', '1916', '1917', '1918', '1919', '1920', '1921', '1922', '1923', '1924', '1925', '1926', '1927', '1928', '1929', '1930', '1931', '1932', '1933', '1934', '1935', '1936', '1937', '1938', '1939', '1940', '1941', '1942', '1943', '1944', '1945', '1946', '1947', '1948', '1949', '1950', '1951', '1952', '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017']
In [8]:
# good examples on how to convert strings to dates: https://chrisalbon.com/python/basics/strings_to_datetime/ 
years_as_dt = [datetime.strptime(x, '%Y') for x in all_years]
print(years_as_dt[0:10])
[datetime.datetime(1840, 1, 1, 0, 0), datetime.datetime(1848, 1, 1, 0, 0), datetime.datetime(1849, 1, 1, 0, 0), datetime.datetime(1850, 1, 1, 0, 0), datetime.datetime(1851, 1, 1, 0, 0), datetime.datetime(1852, 1, 1, 0, 0), datetime.datetime(1854, 1, 1, 0, 0), datetime.datetime(1855, 1, 1, 0, 0), datetime.datetime(1856, 1, 1, 0, 0), datetime.datetime(1857, 1, 1, 0, 0)]
In [9]:
# now we will get the appropriate count from each category for each year, or 0 if nothing.

porknums = []
cornnums = []

for year in all_years:
    porknums.append(porkcounts.get(year, 0))
    cornnums.append(corncounts.get(year, 0))
In [10]:
# now our data is in the right format, we can plot it! 

plottyprint.timeseries(years_as_dt, [porknums, cornnums], ["pork", "corn"])
/opt/conda/lib/python3.7/site-packages/pandas/plotting/_matplotlib/converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)
Out[10]:

I'm not totally sure why it printed twice---the way that matplotlib (the underlying plotting library) plays with jupyter notebooks isn't exactly transparent. But that doesn't really matter. The job is done!

It's also kind of interesting. It doesn't look like there are any particular trends with respect to pork and corn--- I would have expected that there would be an increase over time in pork relative to corn, but apparently not! There's also something really weird going on around the great depression and the period of the two world wars...

Ok, now let's do seaborn. That'll look better anyway, since my plottyprint library is designed for greyscale for print, and with a dense chart like this greyscale isn't the best idea. It'll be easiest to use seaborn if we start with a Pandas DataFrame, so let's do that.

In [11]:
import seaborn as sns
import pandas as pd
In [12]:
df = pd.DataFrame({"date": years_as_dt, "pork": porknums, "corn": cornnums})
df = df.set_index(df.date)
df.head()
Out[12]:
date pork corn
date
1840-01-01 1840-01-01 1 1
1848-01-01 1848-01-01 0 1
1849-01-01 1849-01-01 4 3
1850-01-01 1850-01-01 19 7
1851-01-01 1851-01-01 0 5

Because this is actually a little difficult to get right in seaborn (one reason I wrote my own library for this), we'll build up to it. First we'll make just a pair of single-line plots.

In [13]:
sns.lineplot(data=df, x="date", y="pork")
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a9eea6450>
In [14]:
sns.lineplot(data=df, x="date", y="corn")
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a9edcae90>

Ok, now the annoying thing. To get a multiple-line lineplot, you have to change your data from "wide" to "long" format using the pandas "melt" function. This is a dumb design decision on seaborn's part, but, so it goes. Here's an explanation. Let's look at what long format looks like, first.

In [15]:
df_long = pd.melt(df, ['date'])
df_long.head()
Out[15]:
date variable value
0 1840-01-01 pork 1
1 1848-01-01 pork 0
2 1849-01-01 pork 4
3 1850-01-01 pork 19
4 1851-01-01 pork 0
In [16]:
print(len(df))
print(len(df_long))
170
340

We see that we have doubled the length of our dataset, because each year now has two rows: one row for the pork value, and one row for the corn value. With that in hand, we can make seaborn do what we want it to:

In [17]:
sns.lineplot(data=df_long, x="date", y="value", hue="variable")
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a9d51a950>

And we're done!

Problem 2: The Inexpert Witness (70 points)

Harriet Hawkeye, the owner of a sports fan apparel store in Iowa City, is being prosecuted by the State of Iowa for sales tax fraud. The key piece of evidence is that the sales figures that Harriet reported to the state did not reliably increase on home football game days. There were 5 home games that year (note to students: I don't actually know or care how many football games there are, don't fight the hypo), and Harriet's sales didn't increase beyond her average range on any of them.

The prosecution calls Dr. Carl Cyclone, a professor of Economics at Iowa State, to testify, and Carl testifies as follows:

  • Based on his research, the average sales of sports fan apparel in Iowa increased substantially in almost every store in the state on home game days. Of 1000 store-homegame pairs in the state in the last year other than at Harriet's store (i.e., 1000 is the product of the number of stores and the number of home games per store, so 3 stores that each had 3 home games in their town would be 9 store-homegame pairs), 800 of them had a substantial increase in sales over the store in question's daily average.

  • Therefore, there's a 80% change that sales on any given day will increase substantially if there's a home game.

  • The probability of Harriet not experiencing an increase in sales for all five days is thus $0.2^5 = 0.00032$, which is so vanishingly small that, in his expert opinion, Harriet must have actually had more sales than she reported on at least one home game day.

You represent Harriet.

Subquestion 1 (20 points): Describe in words what problematic mistake or assumption Carl makes.

Subquestion 2 (20 points): What kind of evidence could you look for to undermine Carl's argument?

Answer to subquestions 1 and 2

There are a few things that could potentially be wrong here. The key is to think about how Carl is making assumptions about the world and then translate it to math. Probably the low-hanging fruit answer here is that Carl is assuming that each day of sales is independent of one another. But that might not be true. There might be some way that harriet's sales on early days affect her sales on later days (for example, if she uses past sales to project future sales in order to determine how many hours she should be open).

Or, for an even simpler kind of assumption Carl makes, that doesn't even particularly rely on talk about independence (though it could be translated into those terms), he assumes that there isn't anything about Harriet's store that makes it particularly less likely to increase on home game days. For example, suppose her store is really far away from the stadium, and suppose that distance to stadium is strongly associated with increase in sales on home game days? Carl doesn't look at that factor, but as a lawyer, it's something you should think about. This is one of those pitfalls of observational data that we've been talking about (and which we'll learn to manage when we look at linear regression): there's always something that might screw up out observed association between cause (here, having a home game) and effect (here, increase in sales).

Again, there are any number of things that you could say here, I'll accept lots of different answers, but the key is to be able to move back and forth between math and world.

For subquestion 2, then, it'll depend on what you have to say. If you talk about independence, you'd want to develop some kind of testimony from Harriet about what affects her sales, as well as what affects her decisions about things such as hours, prices, etc. If you talk about location or anything like that, you'd want to have your own expert dig into Carl's data and figure out if his results change, for example, if we limit it to stores a certain distance from the stadium. And so on, and so forth. Just think like a litigator.

Subquestion 3 (20 points): Write a simulation to illustrate the effects of correcting the problem that you identified in subquestion 1. I know that this subquestion is a little ambigious and open-ended, so here's all I can really say as a hint without giving away the answer: there is at least one, and possible several, parameter(s) that you could plausibly vary, which would change the ultimate probability calculation that Carl made. Your simulation should vary that parameter/those parameters and see what outcomes come out.

Subquestion 4 (10 points): Use an appropriate data visualization to illustrate the simulation you wrote in the previous subquestion.

Answer to subquestions 3 and 4.

Again, there's a lot of ways you could do this. I just want you to practice translating assumptions about the world/about math into code. Here, as a demonstration, I'll bang out a quick simulation of the location example. Yours doesn't need to be this complicated.

We're also going to do this in aggressively object-oriented fashion: I'm going to create a shopper class, a store class, etc. etc. Again, this is VERY heavy on the overkill, you don't need to go this hard.

We will also just look at the outcome in the form of a visualization, rather than wasting time trying to interpret our results in numbers.

In [18]:
import random
class Store(object):
    def __init__(self, distance = None):
        if distance: 
            self.distance = distance  # distance from stadium.  we probably won't need to fix this, but just in case
        else:
            self.distance = random.randint(2, 9)  # larger distances are further
        self.sales = 0
        self.game_day_sales = 0
        self.non_game_sales = 0
        
    def sell(self, game_day):  # register a sale.
        self.sales += 1
        if game_day:
            self.game_day_sales += 1
        else:
            self.non_game_sales += 1
            
    def report(self):
        return {"total_sales": self.sales, 
                "game_day_sales": self.game_day_sales, 
                "non_game_sales": self.non_game_sales, 
                "distance": self.distance}
In [19]:
class Shopper(object):
    def __init__(self, energy = None):
        if energy:
            self.energy = energy  # extent to which shopper is willing to shop further from current location
        else:
            self.energy = random.randint(10, 20)
            
    def shop(self, list_of_stores, game_day = False):
        if game_day:  # give every store a random score weighted to prefer closer stores
            best_store = {"store": list_of_stores[0], "score": 0}
            for current_store in list_of_stores:
                current_score = random.randint(0, self.energy - current_store.distance)
                if current_score > best_store["score"]:
                    best_store = {"store": current_store, "score": current_score}
            best_store["store"].sell(game_day)
        else:
            store = random.choice(list_of_stores)
            store.sell(game_day)
        
In [20]:
class Simulation(object):
    def __init__(self, days = 100, stores = 100, shoppers = 1000, games = 20, fixed_energy = None):
        self.stores = [Store() for x in range(stores)]
        self.games = games
        self.non_games = days - games
        if fixed_energy:
            self.shoppers = [Shopper(fixed_energy) for x in range(shoppers)]
        else:
            self.shoppers = [Shopper() for x in range(shoppers)]
            
    def round(self, game_day=False):
        for shopper in self.shoppers:
            shopper.shop(self.stores, game_day)
            
    def run(self):
        for x in range(self.games):
            self.round(game_day = True)
        for x in range(self.non_games):
            self.round()
        return pd.DataFrame([store.report() for store in self.stores])
In [21]:
simple_simulation = Simulation()
In [22]:
simple_output = simple_simulation.run()
In [23]:
simple_output.head()
Out[23]:
total_sales game_day_sales non_game_sales distance
0 816 0 816 8
1 803 0 803 9
2 807 0 807 8
3 748 0 748 9
4 795 0 795 7
In [24]:
sns.boxplot(x=simple_output["distance"], y=simple_output["game_day_sales"])
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3aa1246190>

Unsurprisingly, because of the way we wrote the simulation, we see that the game day sales crater once you get too far from the stadium. But writing the simulation also allows us to think about further assumptions that we're making, and which could be tested by evidence. For example: even if we do have evidence that Harriet's store is unusually far from the stadium, do we have evidence about how sensitive people are to stadium distance in Harriet's town? We could tweak that in the code above by adding a fixed_energy argument to our Simulation initializer.

We could also change the code to permit us to tweak any number of the parameters of our model in order to see how the outcomes change. We could also write code to parse our dataset or change what is reported out to more directly represent Carl Cyclone's claim about the unlikelihood of a store having low sales on each of five game days, for example, by having each store report daily sales rather than just total sales. I don't particularly feel like taking the time to write the code to do that, this sample answer is already very long, but you should be able to see how we might do that within our Store class. Again, this is just one of the many, many, ways you might answer this problem.

In [ ]:
 

links