Non-normal distributions

14 years ago, I was introduced to the process of normalising grades. Professors "fit" students' marks into a normal distribution and assign grades based on that. (I still don't know how they do it).

Since then, I've encountered normalising a lot. My performance at work is normalised. I normalise my song ratings and movie ratings. I've normalised all kinds of things at work: lead-time of delivery of fans, movements in savings account balances, calls to a call centre, demand for a resource... you name it.

(What I mean by normalising is, I find the mean and standard deviation, and assume that it's a normal distribution with that mean and standard deviation. For things under my control, like movie ratings, I revise the ratings to fit a normal distribution.)

In fact, I normalise everything I encounter by default.

A few years ago, I started feeling uncomfortable about this. I've now figured out why normalising is bad -- at least when done blindly like I do.

First, let's explore why normalising is good. Normalising eliminates biases. If the Prof in Section A grades higher than the Prof in Section B, normalising takes care of it. If a Prof is extremist (more A's as well as F's), normalising takes care of it. If a Prof is skewed (lots below average, few extremely high above average), normalising takes care of it.

Eliminating biases makes sense if Section A is fundamentally like Section B. It's not better, nor more extremist, nor more skewed. If the sections are large enough and picked randomly, this assumption is correct. If Section A represents the smarter half, or people born in the second half of the year, or people from the Western states, or any other non-random selection, this need not be correct.

An aside: You may wonder why people born in the second half of the year is non-random. If school admissions start in September, and admissions start when you're 3 years old, kids born in September will be nearly 4 years old when they join. Kids born in August will be between just over 3 years. That one-year difference, to a three-year old, is HUGE. For example, you will find a birth date bias in football, with most premiership players being born in the months of September - November.

Normalising goes a step further than eliminating bias, however. Normalising forces a normal distribution. This would be right if the underlying data is normally distributed. But if not, we may be making a mistake by force-fitting.

The Central Limit Theorem says that if you add up random variables, you get a normal distribution. Provided it's a large sample, variables are independent, and each has a finite standard deviation.

This means that many things you get by adding random variables are normally distributed. For example:

But a lot of real-life data is NOT normally distributed. The usual reasons are:

  1. It's not the sum of random variables
  2. It doesn't satisfy the central limit theorem (independence, large sample, finite standard deviations)

Here are some non-normal distributions that are NOT the sum of random variables:

Here are some non-normal distributions that don't satisfy the central limit theorem. (These are, in fact, things I said were normally distributed earlier. You see? It's easy to think things are normal, but in reality they're not.)

Summary: Don't assume that anything you see is a normal distribution. It usually isn't.

I'll shortly talk about what happens when you assume something's a normal distribution, when it really is not.

Written on 19 Jul 2006 | alternate titles: Not all distributions are normal Non-normal distributions Normal distribution

Comments


(not shared, not spammed)


S Anand, Infosys Consulting, London UK. +44 7957 440 260