Statistics people often talk about distributions, like a normal distribution. Here's what they mean: suppose you could see all of the instances of the thing you're trying to study. What kind of pattern would their values have? That's the distribution.

For example, suppose you expect that most of the values of the thing you care about will be clustered around some average value. IQ is a good example: most IQs in the population are around 100, and then as values get further away from 100 in either direction, the fraction of the total number of instances that takes that range of values gets smaller. There are lots of folks with an IQ between 85 and 115, fewer between 70 and 85 on one side, and 115 and 130, many fewer between 55 and 70 or 130 and 145, and a (proportionally) truly tiny number between 40-55 or 145-160.

The main reason scientists like the normal distribution so much is because of a two little ideas called the law of large numbers (LLN) and the central limit theorem (CLT).

I'm not going to walk you through proofs of these; instead, we'll just look at some graphs and talk about some intuition.

This lesson is all about what can go wrong in linear regression. Here's an outline of the ways things can go wrong.

data isn't linear

extreme outliers

heteroskadiscity

multicolinnearity

conditioning on a collider

counfounder bias

non-normal residuals

There's also a problem known as "autocorrelation" which mainly appears in time series data (i.e., when one tries to run a regression on something that changes over time, like stock market prices). Time series analysis is a fairly advanced topic that is beyond the scope of this course, but you should have alarm bells ringing if anyone tries to do ordinary linear regression on data that spans time like that.

Probability is the mathematical representation of the likelihood of an event under a given set of circumstances (conditions) in a given period of time. We will say, for example, that the probability of winning the jackpot in the lottery from buying one ticket this week is some …

Now that we understand distributions and the central limit theorem, we’re in a good position to make sense of the notion of a hypothesis test. It’s actually very simple.

Suppose you do an experiment. Let’s say you want to find out whether a company is engaging in …

Recall from the previous lesson what a p-value is: it’s the probability of observing a value of your statistic as extreme (as far away from the null hypothesis statistic) as you in fact observed, if the null hypothesis were true.

It's often useful in performing data analysis to transform some of your variables to fit a common scale; this is especially useful in exploratory data analysis, because these transformations often make it much easier to eyeball the relationship between variables. (Also, some statistical techniques require these transformations.)

In this short lesson, we'll introduce two common methods of transforming data---the log transformread more

The standard technique for measuring the relationship between one or more continuous independent variables and a continuous dependent variable is linear regression.

The basic idea of linear regression can be expressed simply. A linear regression is a line (or some more dimensional geometric thingy) that maps the independent variables to the best predicted value for the dependent variable.

Numpy is the first library we work with. By convention, it's imported with import numpy as np. Numpy really provides two things to our workflow:

Math that goes faster than unadorned Python could do it---which is important when you're doing statistics, because under the hood computational stats can take a lot of calculations.

In this lesson, we're going to very quickly rip through the basic hypothesis tests, their uses, and how to achieve them in Python. I won't spend a lot of time on this, because the mathematical details are covered in the assigned reading, and, at any rate, I think for practical purposes regression analysis is more important for lawyers. Also, this is basically AP/undergrad stats material, so you've probably seen it somewhere already.

One of the readings for week 13, "The Bayesian New Statistics," covers a variety of different approaches to statistics, as contrasted with the standard frequentist hypothesis-testing method. I don't expect you to come out of this class being able to work any of those alternative paradigms, but you should be able to recognize them and understand broadly how they operate. That article is a very good summary of the landscape, but this supplemental lesson aims to provide a briefer and slightly more basic introduction.

You've seen a lot of talk about controlling for things, for e.g., in multiple regression (or more generally, conditioning on things). It's worth having a quick list of some rules of thumb that will suffice for many cases. Of course, the design of observational studies is much more complicated …