14 years ago, I was introduced to the process of normalising grades. Professors “fit” students’ marks into a normal distribution and assign grades based on that. (I still don’t know how they do it).
Since then, I’ve encountered normalising a lot. My performance at work is normalised. I normalise my song ratings and movie ratings. I’ve normalised all kinds of things at work: lead-time of delivery of fans, movements in savings account balances, calls to a call centre, demand for a resource… you name it.
(What I mean by normalising is, I find the mean and standard deviation, and assume that it’s a normal distribution with that mean and standard deviation. For things under my control, like movie ratings, I revise the ratings to fit a normal distribution.)
In fact, I normalise everything I encounter by default.
A few years ago, I started feeling uncomfortable about this. I’ve now figured out why normalising is bad — at least when done blindly like I do.
First, let’s explore why normalising is good. Normalising eliminates biases. If the Prof in Section A grades higher than the Prof in Section B, normalising takes care of it. If a Prof is extremist (more A’s as well as F’s), normalising takes care of it. If a Prof is skewed (lots below average, few extremely high above average), normalising takes care of it.
Eliminating biases makes sense if Section A is fundamentally like Section B. It’s not better, nor more extremist, nor more skewed. If the sections are large enough and picked randomly, this assumption is correct. If Section A represents the smarter half, or people born in the second half of the year, or people from the Western states, or any other non-random selection, this need not be correct.
An aside: You may wonder why people born in the second half of the year is non-random. If school admissions start in September, and admissions start when you’re 3 years old, kids born in September will be nearly 4 years old when they join. Kids born in August will be between just over 3 years. That one-year difference, to a three-year old, is HUGE. For example, you will find a birth date bias in football, with most premiership players being born in the months of September – November.
Normalising goes a step further than eliminating bias, however. Normalising forces a normal distribution. This would be right if the underlying data is normally distributed. But if not, we may be making a mistake by force-fitting.
The Central Limit Theorem says that if you add up random variables, you get a normal distribution. Provided it’s a large sample, variables are independent, and each has a finite standard deviation.
This means that many things you get by adding random variables are normally distributed. For example:
- Number of heads when you toss a coin (add up each coin toss)
- Average age of an army platoon (add up each soldier’s age)
- Terminus-to-terminus time for a bus (add up the time between each stop)
- Price movement of an stock exchange index (add up each stock’s price movement)
But a lot of real-life data is NOT normally distributed. The usual reasons are:
- It’s not the sum of random variables
- It doesn’t satisfy the central limit theorem (independence, large sample, finite standard deviations)
Here are some non-normal distributions that are NOT the sum of random variables:
- Soldier’s age within an army platoon. What random variables could you add up? You’ll probably find a lot of people at age 18, because that’s the minimum age. A little fewer at age 19 — last year’s recruits. Far less at age 20 — 2 years minimum service accomplished. Certainly not a normal distribution.
- Price movement of a single stock. What random variables could you add up? You’ll find that there are far larger price movements than a normal distribution predicts.
Here are some non-normal distributions that don’t satisfy the central limit theorem. (These are, in fact, things I said were normally distributed earlier. You see? It’s easy to think things are normal, but in reality they’re not.)
- The terminus-to-terminus time for a bus. The number of bus stops is quite small. More importantly, the time between stops isn’t independent. If there’s a traffic jam, an entire section of the route will take more time. If there’s a delay between point 2 to 3, it’s likely that there’ll be a delay between points 1-2 and 3-4 as well.
- The price movement of a stock exchange index. The price movement of stocks follows a power-law distribution, which does not have finite standard deviations. Also, the price movements are not independent.
- See more non-normal distributions.
Summary: Don’t assume that anything you see is a normal distribution. It usually isn’t.
I’ll shortly talk about what happens when you assume something’s a normal distribution, when it really is not.