Gap statistic

21st October 2014

Before we begin, the important point to remember is that computers are stupid. They will do exactly what you ask them to and no more.

My cats often ask me ‘What is the gap statistic?’. If I wasn’t so busy tending to their other, higher-priority demands, this is what I would tell them.

The gap statistic is an unsupervised machine learning method. Machine learning is a catch-all term for trying to persuade computers to do things that humans are generally quite good at, like spotting patterns and pulling signals out of noise. Computers are pretty rubbish at these things and only really come into their own when problems are too big or boring for people to focus on. Unsupervised machine learning is when the stupid computer is provided with an unbiased input data set and asked to find meaning in that data set.

FACT: I like examples and I like unicorns. Let’s assume that I also like gathering data about equidae but that I don’t know a zedonk from a zorse. I’ve got a big old data set, containing all of the information I’ve collected over the years. I’d love to know which animal was which species, so I decide to let my computer sort it out for me. After spending ten seconds on wikipedia, I decide that a clustering method sounds like the right approach. Also, I have a handy k-means accelerator kicking around somewhere that I’d like to make use of.

I give the clustering method all of my data, magic happens and all of my records are split into groups. Hopefully these will match with species or hydrids, and if they don’t then maybe the species-splitting people made a mistake somewhere and I can have a horsie named after me.

I know from an earlier paragraph that an unbiased data set is needed, that sounds easy. I showed absolutely no bias in scrupulously taking the following measurements for each equid I was lucky enough to meet. Because I’m English, I like to use as many different units of measurement as possible:

MeasurementUnitsMinimumMaximum
Heighthands716
Lengthcm140233
Sugarlumps consumedquantity012
Horn lengthcm00
Distance from my housemiles11265
Stripesnumber027

However, computers are stupid. If I want to give a clustering algorithm any chance of providing a useful output, I will need to convert all information about each equid to numbers and then force all measurements occupy the same range (generally 0-1 for simplicity). This process is known as normalisation, and is essential to make sure that the clustering algorithm isn’t biased towards one of my measurements. If I were to skip the normalisation step, differences in geographical distance would dominate the output of the clustering algorithm. I know that you can’t directly compare differences in geographical distances to differences in horn length directly but the stupid computer doesn’t.

Any fool knows that unicorns are equally likely to occur anywhere on the planet, whereas zebras are banned from visiting Belfast. I could use this information to adjust the normalised figures, but that would count as biasing the input data. As we wish to carry out an unsupervised learning experiment, this is against the rules.

So, we have a nice normalised data set, let’s go k-means clusterin’. All I need to do is decide how many groups I want the algorithm to produce – WHAT? If I knew how many varieties of the equine genus I hade measured then I wouldn’t be bothering with all this machine learning nonsense. Plus, doesn’t deciding the number of clusters make the method supervised rather than unsupervised, question-statement.

This is where the gap statistic comes to the rescue. It tries a range of different numbers of clusters, tries them all and tells you which best suits the data set. The approach was devised by Tibshirani, Walther and Hastie at Stanford University.

The gap statistic works by comparing the clustering results for your actual data set against a baseline. This baseline is formed by performing the same clustering operation on a number of imaginary data sets, a process known as bootstrapping. In my case, this would be a group of equids with randomly generated measurements which fall within the same range as my real data set. This seems ridiculous, but then so is doing anything with computers.

The real skill is in understanding the data that you are dealing with, because the computer won’t.

Share This

Tweet this!   Share on LinkedIn   Share on Facebook

Leave a Reply

Your email address will not be published. Required fields are marked *