1. # Introduction to Distributions

## What's a Distribution, Anyway?¶

Statistics people often talk about distributions, like a normal distribution. Here's what they mean: suppose you could see all of the instances of the thing you're trying to study. What kind of pattern would their values have? That's the distribution.

For example, suppose you expect that most of the values of the thing you care about will be clustered around some average value. IQ is a good example: most IQs in the population are around 100, and then as values get further away from 100 in either direction, the fraction of the total number of instances that takes that range of values gets smaller. There are lots of folks with an IQ between 85 and 115, fewer between 70 and 85 on one side, and 115 and 130, many fewer between 55 and 70 or 130 and 145, and a (proportionally) truly tiny number between 40-55 or 145-160.

2. # The Normal Distribution and the Central Limit Theorem

The main reason scientists like the normal distribution so much is because of a two little ideas called the law of large numbers (LLN) and the central limit theorem (CLT).

I'm not going to walk you through proofs of these; instead, we'll just look at some graphs and talk about some intuition.

3. # When Regressions Attack

This lesson is all about what can go wrong in linear regression. Here's an outline of the ways things can go wrong.

• data isn't linear
• extreme outliers
• multicolinnearity
• conditioning on a collider
• counfounder bias
• non-normal residuals

There's also a problem known as "autocorrelation" which mainly appears in time series data (i.e., when one tries to run a regression on something that changes over time, like stock market prices). Time series analysis is a fairly advanced topic that is beyond the scope of this course, but you should have alarm bells ringing if anyone tries to do ordinary linear regression on data that spans time like that.

4. # The Basics of Probability

## What is Probability?

Probability is the mathematical representation of the likelihood of an event under a given set of circumstances (conditions) in a given period of time. We will say, for example, that the probability of winning the jackpot in the lottery from buying one ticket this week is some …

5. # Hypothesis Testing: Conceptual Introduction (draft)

Now that we understand distributions and the central limit theorem, we’re in a good position to make sense of the notion of a hypothesis test. It’s actually very simple.

Suppose you do an experiment. Let’s say you want to find out whether a company is engaging in …

6. # P-Values and Bayes Rule

Recall from the previous lesson what a p-value is: it’s the probability of observing a value of your statistic as extreme (as far away from the null hypothesis statistic) as you in fact observed, if the null hypothesis were true.

In other words, if you’re doing a (two-sided …

7. # Common Data Transformations

It's often useful in performing data analysis to transform some of your variables to fit a common scale; this is especially useful in exploratory data analysis, because these transformations often make it much easier to eyeball the relationship between variables. (Also, some statistical techniques require these transformations.)

In this short lesson, we'll introduce two common methods of transforming data---the log transform read more

8. # Introduction to Linear Regression

The standard technique for measuring the relationship between one or more continuous independent variables and a continuous dependent variable is linear regression.

The basic idea of linear regression can be expressed simply. A linear regression is a line (or some more dimensional geometric thingy) that maps the independent variables to the best predicted value for the dependent variable.

9. # Key Python Libraries for Working with Data

In this lesson I'm just going to describe the main libraries that we'll see when we work with data in Python.

## Numpy¶

Numpy is the first library we work with. By convention, it's imported with import numpy as np. Numpy really provides two things to our workflow:

1. Math that goes faster than unadorned Python could do it---which is important when you're doing statistics, because under the hood computational stats can take a lot of calculations.

10. # Practical Basic Hypothesis Tests

In this lesson, we're going to very quickly rip through the basic hypothesis tests, their uses, and how to achieve them in Python. I won't spend a lot of time on this, because the mathematical details are covered in the assigned reading, and, at any rate, I think for practical purposes regression analysis is more important for lawyers. Also, this is basically AP/undergrad stats material, so you've probably seen it somewhere already.

12. # Confidence Intervals and Bayesian Statistics oh my!

One of the readings for week 13, "The Bayesian New Statistics," covers a variety of different approaches to statistics, as contrasted with the standard frequentist hypothesis-testing method. I don't expect you to come out of this class being able to work any of those alternative paradigms, but you should be able to recognize them and understand broadly how they operate. That article is a very good summary of the landscape, but this supplemental lesson aims to provide a briefer and slightly more basic introduction.