Skip to content
Bernard Brenyah edited this page Feb 21, 2021 · 14 revisions

Useful links

Julia

Clustering.jl

Documentation: https://juliastats.org/Clustering.jl/stable/kmeans.html

Native Julia implementation of KMeans clustering algorithm.

Python Implementations

Scikit-Learn

KMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

The k-means problem is solved using either Lloyd’s or Elkan’s algorithm. The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration. The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. (D. Arthur and S. Vassilvitskii, ‘How slow is the k-means method?’ SoCG2006). In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.

MiniBatchKMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

Paper: https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ϵaccurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml

Yinyang K-Means

Paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf

This paper presents Yinyang K-means, a new algorithm for K-means clustering. By clustering the centers in the initial stage, and leveraging efficiently maintained lower and upper bounds between a point and centers, it more effectively avoids unnecessary distance calculations than prior algorithms. It significantly outperforms prior K-means algorithms consistently across all experimented data sets, cluster numbers, and machine configurations. The consistent, superior performance—plus its simplicity, user-control of overheads, and guarantee in producing the same clustering results as the standard K-means—makes Yinyang K-means a dropin replacement of the classic K-means with an order of magnitude higher performance

R implementations

Knor

It was found in this stackoverflow question: https://stackoverflow.com/questions/20416944/parallel-k-means-in-r

R package can be found here: https://cran.r-project.org/web/packages/knor/index.html

It is based on this research paper: https://arxiv.org/abs/1606.08905

From abstract

k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The k-means NUMA Optimized Routine knor library has (i) in-memory knori, (ii) distributed memory knord, and (iii) semi-external memory knors modules that radically improve the performance of k-means for varying memory and hardware budgets. knori boosts performance for single machine datasets by an order of magnitude or more. knors improves the scalability of k-means on a memory budget using SSDs. knors scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. knord retains knori's performance characteristics, while scaling in-memory through distributed computation in the cloud. knor modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate knor outperforms distributed commercial products like H2O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of 107 to 109 points.

C implementations

CFXKmeans

Documentation is here: https://colfaxresearch.com/cfxkmeans/

Repository with source code: https://github.com/ColfaxResearch/CFXKMeans

Lloyd algorithm variations

Elkan algorithm

Original paper can be found here: https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf

From abstract

The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k > 20 it is many times faster than the best previously known accelerated k-means method.

Hamerly algorithm

Original paper can be found here: https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster

From abstract

The k-means algorithm is widely used for clustering, compressing, and summarizing vector data. In this paper, we propose a new acceleration for exact k-means that gives the same answer, but is much faster in practice. Like Elkan's accelerated algorithm (8), our algorithm avoids distance computations using distance bounds and the triangle inequality. Our algorithm uses one novel lower bound for point-center distances, which allows it to eliminate the innermost k-means loop 80% of the time or more in our experiments. On datasets of low and medium dimension (e.g. up to 50 dimensions), our algorithm is much faster than other methods, including methods based on low-dimensional indexes, such as k-d trees. Other advantages are that it is very simple to implement and it has a very small memory overhead, much smaller than other accelerated algorithms.

Geometric methods to accelerate k-means algorithms

Paper link: http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf

From abstract

The k-means algorithm is popular for data clustering applications. Most implementations use Lloyd’s algorithm, which does many unnecessary distance calculations. Several accelerated algorithms (Elkan’s, Hamerly’s, heap, etc.) have recently been developed which produce exactly the same answer as Lloyd’s, only faster. They avoid redundant work using the triangle inequality paired with a set of lower and upper bounds on point-centroid distances. In this paper we propose several novel methods that allow those accelerated algorithms to perform even better, giving up to eight times further speedup. Our methods give tighter lower bound updates, efficiently skip centroids that cannot possibly be close to a set of points, keep extra information about upper bounds to help the heap algorithm avoid more distance computations, and decrease the number of distance calculations that are done in the first iteration.

Additional information can be found on Dr. Hamerly page: http://cs.baylor.edu/~hamerly/software/kmeans.php

Github repository with implementation: https://github.com/ghamerly/fast-kmeans

Coresets

Strong coresets

Original paper "Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures" can be found here: http://proceedings.mlr.press/v51/lucic16-supp.pdf

From abstract

Coresets are efficient representations of datasets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem. We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KLdivergence, and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in empirical evaluation.

Lightweight Coresets

Original paper "Scalable k-Means Clustering via Lightweight Coresets" can be found here: https://arxiv.org/abs/1702.08248

From abstract:

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct lightweight coresets for k-means clustering as well as soft and hard Bregman clustering. The algorithm is substantially faster than existing constructions, embarrassingly parallel, and the resulting coresets are smaller. We further show that the proposed approach naturally generalizes to statistical k-means clustering and that, compared to existing results, it can be used to compute smaller summaries for empirical risk minimization. In extensive experiments, we demonstrate that the proposed algorithm outperforms existing data summarization strategies in practice.

Mini-batch KMeans

Paper: https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

From abstract: We present two modifications to the popular k-means clus- tering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces compu- tation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ε- accurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml

Other considerations

KModes - Clustering with Categorical Values.

Paper: Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.4028&rep=rep1&type=pdf

From abstract:

The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.

DeepCluster

Paper: Deep Clustering for Unsupervised Learning of Visual Features (https://arxiv.org/pdf/1807.05520.pdf).

From abstract:

Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, kmeans, and uses the subsequent assignments as supervision to update the weights of the network. We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M. The resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks.