K-means clustering is a widely used clustering technique that seeks to minimize within-cluster sum of distance. Solving the problem exactly is computationally difficult, but Lloyd [6] proposed a local search solution, often referred to as “k-means” algorithm or Lloyd’s algorithm, that is still widely used today. It is the speed and simplicity of the k-means method that make it popular, not its accuracy. The algorithm can converge to a local optimum that can be very away from the global optimum (even under repeated random initializations). A typical example is given in Figure 1. In this example, k-means algorithm converges to a local minimum after five iterations, which contradicts the obvious cluster structure of the data set.Figure 1: A typical example of the k-means convergence to a local minimum. The result of kmeans clustering (the rightmost figure) contradicts the obvious cluster structure of the data set. The small circles are the data points, the four ray stars are the centroids (means). The initial configuration is on the leftmost figure. The algorithm converges after five iterations presented in the figures, from the left to the right. A proper initialization of k-means is crucial to obtain a good final solution. Arthur and Vassilvitskii proposed the k-means++ initialization algorithm that both improves the running time of Lloyd’s iterations and the quality of the final solution. Our Statistics Toolbox already offers the kmeans function which implements the Lloyd’s algorithm. We can improve the performance of the kmeans function by adding thek-means++ seeding to the initialization step. There are many other initialization methods for k-means, for example, K-means Refinement algorithm presented in and some others. I chose k-means++ because it is simple and can be extended in the MapReduce framework. The k-means refine algorithm can also be easily implemented in a MapReduce framework, but the initialization is rather complicated and has a non-empty clustering constraints. Although, the refinement method is salable and very suitable for big data, for small datasets, the run time can be much longer than the standard random sampling or the k-means++ algorithms due to the complexity and constraints.

You must be logged in to post a comment.