What's a Distribution, Anyway?¶
Statistics people often talk about distributions, like a normal distribution. Here's what they mean: suppose you could see all of the instances of the thing you're trying to study. What kind of pattern would their values have? That's the distribution.
For example, suppose you expect that most of the values of the thing you care about will be clustered around some average value. IQ is a good example: most IQs in the population are around 100, and then as values get further away from 100 in either direction, the fraction of the total number of instances that takes that range of values gets smaller. There are lots of folks with an IQ between 85 and 115, fewer between 70 and 85 on one side, and 115 and 130, many fewer between 55 and 70 or 130 and 145, and a (proportionally) truly tiny number between 40-55 or 145-160.
As it turns out, what I just described is a normal distribution---IQs follow it. For reasons to be explained later, lots of distributions follow it. But before we dig into the normal, let's say a bit more about distributions in the abstract.
There's a good bit of underlying mathematics to describe distributions, but it requires calculus, and won't be necessary for the introductory level of this course. That being said, you'll probably see some of this terminology elsewhere, so I'll give you a quick vocabulary list:
Continuous distribution: A distribution where the thing you care about can take any value in its range. The normal distribution is an example of a continuous distribution. Think about the distribution of incomes (which, incidentally, are not typically normally distributed), where
Discrete distribution: A distribution where the thing you care about can take a discrete number of values.
Probability Density Function (PDF): The function that describes the curve of a continuous distribution. If you've seen the famously scary normal distribution equation, with, like, e, and a square root, and all kinds of other craziness floating around, that's a PDF. Don't worry about it, though. We won't be using this directly in class.
Probability Mass Function (PMF): Like a PDF, but for a discrete distribution.
Cumulative Density Function: A much more readily understandable concept, basically, you plug in a value to a CDF and get the probability of seeing a value less than or equal to that. Hence the cumulative part. For those who remember calculus, the CDF is just the integral of the PDF. Again, we won't be working with this directly. There's also a CDF of discrete variables, which works the same except it's a sum rather than an integral, and sometimes people write "cumulative distribution function" instead of density (sometimes). (For that matter, sometimes they write "distribution" for the continuous one too. Stats people. Sigh.