This blog post is the first in a two part series describing work done on self-tuning clustering algorithms.
This first post will describe the context of the problem and outline the basic approach. The second part will focus on the experiments, implementation experiences and conclusions.
In order for any data analytics service provider to high margin sustainable business has to deal with scalability, multi-tenancy and self-adaptability. Machine learning is a very powerful instrument for Big Data applications but a bad choice of algorithm can lead to poor results of the intended analysis. One way to mitigate this is to automate the tuning process. Such as tuning process should not require a priori knowledge of the data and without human intervention. As a Big Data Engineer at AgilOne, I worked on solutions for the self-tuning open problem. The work led to the development of TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering. The result was a solution that automatically evaluates and tunes data clustering algorithms, so that clustering-based analytics services can self-adapt and scale in a cost-efficient manner. Evaluating clusters For the initial work we choose K-Means as our clustering algorithm. K-Means is a simple but popular algorithm, widely used in many data mining applications.
The K-Means method is an iterative local search algorithm that partitions n data points into k clusters , It does this by seeding with k initial cluster centroids, it assigns every data point to its closest center, and then recomputes the new centroids. This process of assigning data points and readjusting the centroids is repeated until it stabilizes. Any configuration of k and distance metric may result into very different output clusters. We configured k-means with a value of k ranging from 2 to 40 and a set of admissible distance measures: Angular, Chebyshev, Cosine, Euclidean, Jaccard Index, Manhattan, Pearson Correlation Coefficient, Radial Basis Function Kernel, Spearman Footrule. Any configuration of k and distance metric may result into very different output clusters. In order to evaluate each configuration we need a way to numerically evaluate the quality of the clustering output.We adopted four internal techniques, namely AIC, Dunn, Davies-Bouldin and Silhouette and one external evaluation technique referred to asAdjustedRand.
An internal evaluation typically assigns better scores to clusters with high-intra cluster similarity and low inter- cluster similarity. External evaluation uses known or expected results. Benchmarks consisting of a set of pre-classified items. Those benchmark sets are utilized as gold standard for the evaluation (AdjustedRand). AIC is acronym of Akaike information criterion. It is a measure of the relative fitness of a statistical model based on the concept of information entropy. It models K-Means as a Gaussian mixture with hard assignment, uniform cluster priors and identical spherical covariance matrices. AdjustedRand index is an external validation measuring of the similarity between two data clusterings. We used it to measure the similarity between the K-Means output and the a priori known clusters.
An external validation techniques, as explained before, is based on the assumption of having pre-classified label for our data. This gives us a way to validate the algorithm to a know data set, which is different from evaluating the internal properties of the results from an algorithm. Supposing we have a training dataset. In addition, assume this training dataset has the same properties as unlabelled datasets that we want to cluster. We then can perform a Pearson-Coefficient Student t-test in order to validate and benchmark our internal techniques against AdjustedRand.
Selecting a criterion.
The first problem is to find the best evaluation criteria for a specific test. To do that we generated a few configurations of K-Means for each dataset (we used 120 random samples), measured the e Pearson coefficient (r) and compute the t-value (t) and p-value (p) of our statistics hypothesis test. The highest absolute value of r will provide us with the most accurate and reliable internal technique for further evaluations of similar not-labelled dataset. Given the best evaluation criteria, the main challenge of K-Means is setting the right value for k and the distance measure used to compute distances of each pair of points in the data space. To address this problem we implemented a Genetic Evolutionary Algorithm that heuristically finds out an optimal configuration of our clustering algorithm.
Finding the right configuration.
Genetic Algorithms (GA) are largely used in the field of artificial intelligence to solve problems of optimization and solution search. GA is a search heuristic approach belonging to the class of Evolutionary Algorithms (EAs) inspired by the biological model of evolution and natural selection first proposed by Charles Darwin. The concept is to let species of individuals to adapt to their environment without strictly have a priori knowledge of the key factors that make an individual stronger or weaker. Species of individuals evolve during the time and each generation is yield by random mutations and feedbacks provided from deterministic natural selection. In this way the new generation will recombine the stronger factors of their parents (non random natural selection) and will discover new potential factors through random genetic mutations.
To solve a particular engineering problem we need to model an environment in which individuals can evolve. The environment is typically a black-box model with a set of input parameters that will characterize the single individual and a fitness output function that will encourage the evolution of good solutions. The work-flow is mainly described by a population of individuals that iteratively evolve over many generations. The probability of selection is governed by a fitness function that evaluate each individual and a selection strategy that favors the surviving of the stronger individuals. If well designed, at each stage the fitness of the population should always improve until reaching a convergence point where the enough good solution is found. An outline of an Evolutionary Algorithm is illustrated by the following diagram:
In order to better scale when the number of parameters (the search space) grows, we implemented a Parallel version of GA. We first developed a REST API for executing and evaluating a k-means configuration. The REST API provides a POST method that takes as input a JSON request and returns a JSON responses. Then, we implemented a Master - Slave parallel architecture for our evaluator. Our parallel evaluator sends the requests to the pool of available servers and collect the responses before to return to the main thread, simulating the serial evaluation in the local machine. In the second part of this two part blog post we are describing and analyzing the main numerical results of our experiments.