8th August 2014
The basic k-means is an extremely simple and efficient algorithm. However, it assumes prior knowledge of the data in order to choose the appropriate K. Other disadvantages are the sensitivity of the final clusters to the selection of the initial centroids and the fact that the algorithm can produce empty clusters.
The gap statistic was developed by Stanford researchers Tibshirani, Walther and Hastie in their 2001 paper. The idea behind their approach was to find a way to standardize the comparison of log Wk with a null reference distribution of the data, i.e. a distribution with no obvious clustering. Wk is computed in Python as follows:
def Wk(mu, clusters):
K = len(mu)
return sum([np.linalg.norm(mu[i]-c)**2/(2*len(c)) \
for i in range(K) for c in clusters[i]])
The estimation of the optimal number of clusters within a set of data points is a very important problem, as most clustering algorithms need that parameter as input in order to group the data.
Many methods have been proposed to find the proper K, among which the “elbow” method offers a very clear and naive solution based on intra-cluster variance. The gap statistic, proposed by Tibshirani et al. formalizes this approach and offers an easy-to-implement algorithm that successfully finds the correct K in the case of globular, Gaussian-distributed, mildly disjoint data distributions.