It's often useful in performing data analysis to transform some of your variables to fit a common scale; this is especially useful in exploratory data analysis, because these transformations often make it much easier to eyeball the relationship between variables. (Also, some statistical techniques require these transformations.)
In this short lesson, we'll introduce two common methods of transforming data---the log transform and centering and scaling (a.k.a. standardizing), and we'll look at examples of them through the data underlying part of my 2016 book, The Rule of Law in the Real World (shameless plug).
Let's start by loading the data and taking a quick look at what it looks like, as well as setting up the libraries we'll need.
import pandas as pd
import seaborn as sns
%matplotlib inline
df = pd.read_csv("http://rulelaw.net/downloads/rol-scores.csv")
df.head()
Looking at this table, we can see that some variables appear to be on radically different scales. In particular, the variable RoLScore (which is a measure of the rule of law) looks like it range from 40-ish to 80-ish, just on this snippet of information, while GDP appears to range, well, that e+xx notation is exponential notation, so it involves numbers with like 10 zeroes after it.
Actually, this table might turn out ugly when I post it online, so let's just look at an abbreviated version of the table to see. (Sorry, I still haven't figured out how to get the table layout working right.)
df[["RoLScore", "2012GDP"]].head()
Let's create a GDP per capita variable first of all, and then let's look at its relationship to the rule of law.
df["gdp_percapita"] = df["2012GDP"] / (df["Pop. In Millions for 2012"] * 1000000)
df[["State", "gdp_percapita"]].head(10)
That seems pretty realistic. Australia is quite rich, Bangladesh is quite poor, and so forth.
sns.scatterplot(x=df["RoLScore"], y=df["gdp_percapita"])
What's going on here? If you look at this plot, it looks like rule of law scores are spread out across their range, while GDP seems to be clustered toward the bottom. Maybe we should inspect the distributions of the two variables?
sns.distplot(df["RoLScore"])
sns.distplot(df["gdp_percapita"])
So they're both obviously a bit skewed, but the rule of law is much less skewed. In fact, what we see in the GDP data is pretty common for most data involving money: there are way, way, way more points at the low end than at the higher end. Lots of poor countries, very few super-rich countries like the U.S. This often calls for a logarithmic transformation to understand the relationship between the two data points.
Log Transformations¶
As you may or may not recall from high school math, taking a logarithm is the inverse of exponentiation. That is, the statement $2^3 = 8$ is equivalent to the statement $\log_2 8 = 3$.
Logarithms are a bit tough for human psychology to really have intuitions about, but one good way to think of them is that they are a way of turning differences across orders of magnitude to differences within a single magnitude.
That is, suppose we had the following data: [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]
. The range goes from one to ten million and the difference between the first and the last element is obviously huge. But if we take the log base 10 of every element in there, we end up with much smaller differences and a much less curvy line representing our datapoints in a graph.
from math import log10
toydata = [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]
loggedtoydata = [log10(x) for x in toydata]
print(loggedtoydata)
sns.scatterplot(range(0, 8), toydata)
sns.scatterplot(range(0, 8), loggedtoydata)
So let's do that to our gdp to see if it helps us make an interpretable graph. (We'll have to use numpy log functions because they can apply to a series of data just as well as an individual point, so we'll just shadow (overwrite) the log10
name we originally imported from the Python standard library with one from numpy. This is normally a bad idea.)
from numpy import log10
df["log_gdp"] = log10(df["gdp_percapita"])
sns.scatterplot(df["RoLScore"], df["log_gdp"])
What we can see is that there's actually a pretty clear relationship between rule of law scores and GDP per capita, after a log transformation. The way to read this graph is that as GDP per capita increases by a factor of 10, rule of law scores seem to increase by, oh, 15 points or thereabouts. This relationship is much harder to see on the original scatterplot.
Log-transforming data relating to money is often a good idea.
Centering and Scaling¶
Centering and scaling is carried out as follows:
Subtract the mean of the variable from each of its values ("centering" --- this effectively converts the mean of the variable to zero)
Divide each variable by its standard deviation.
We can easily use more numpy functions to achieve this.
from numpy import mean, std
def center_scale(variable): # this applies to an entire pandas column
mu = mean(variable)
sigma = std(variable)
return (variable - mu) / sigma
What this will ultimately do is reduce a variable to what we might call standard deviation units. Let's look at another example. In the dataset we're working with, the variable hprop
is the heritage foundation property rights score for 2012, and assoc_org
is the Freedom House associational and organizational rights score. Observe:
df[['State', 'hprop', 'assoc_org', 'RoLScore']].head(10)
As you can see, again, they're on different scales, which might make it a bit difficult to compare them, especially later on when we try to figure out the magnitude of their relationship with something else.
For example, suppose we learn from statistical analysis that a one-unit change in the Freedom House score predicts a one-unit change in the Rule of Law Score, but it takes a five-unit change in the Heritage Foundation score to predict a one-unit change in the Rule of Law Score. What does that mean in terms of the relative predictive power of those two variables? It's hard to tell, because they're not on a common scale. A one-unit change in the Freedom House score could be a really big difference, while a five-unit change in the Heritage Foundation score could be a really small difference. They're hard to compare.
Let's standardize, a.k.a. center and scale, them.
df["hprop_std"] = center_scale(df["hprop"])
df["assoc_std"] = center_scale(df["assoc_org"])
df[["State", "hprop_std", "assoc_std"]].head(10)
Now we can see that the two variables are on a common scale. The scale, as I said, is in standard deviation units. In other words, looking at the first row of the table above, we can say that Albania is 0.69 standard deviations below the mean on property rights, and 0.2 standard deviations below the mean on associational/organizational freedoms.
Thinking about this in terms of the analytic question I posed above, this would allow us to compare the relationship of the two scales on some common third variable---we might say, for example, that a 1 standard deviation increase in property rights predicts a 1 standard deviation increase in the rule of law, while it takes a 3 standard deviation increase in freedom of association to do the same. This isn't a panacea for comparison, but it can help in a lot of cases.