Data Clustering? don't worry about the algorithm.

May 07, 2013

We are constantly pushing to improve our underlying algorithms and make them as adaptive as possible. Taking a step back, our problem generally is to fit classes of models and algorithms to customer data sets of varying data quality. In addition, we need to automate this so that we can scale delivery of our offerings from a business perspective.

This high-level business goal boils down to a number of technical requirements. It means we need to find ways of automatically evaluating results based on customer data and adaptations, and we need to do this in many different contexts.

One of our engineers, Gianmario Spacagana took a fresh look at how to tune clustering algorithms. In this blog post, I will briefly introduce validation of clustering algorithms so that you can later more easily appreciate and understand Gianmario’s upcoming blog post.

The American writer Mark Russell once said “The scientific theory I like best is that the rings of Saturn are composed entirely of lost airline luggage.” I believe many developers and marketers feel just as lost when exposed to some supposedly great clustering. That is, it is very difficult to evaluate supposed optimal data clustering without understanding some basic concepts about how to evaluate and dissect clustering results. Lacking at least apprentice cluster validation  knowledge, confirming or disapproving an assertion of optimal data clustering is difficult and becomes more of a leap of faith than a reasoned judgment. I will attempt to give you enough information to ask the right questions.

Clustering involves grouping individuals in a population so that individuals in the same cluster share some properties of interest. Algorithmically, the goal of a clustering algorithm is to minimize or maximize some function related to the individuals in the same cluster. We call this the objective function. Frequently, we think of it as some type of distance between the individuals in a population. For example, let’s say we group customers based on their lifetime value. Once we have predicted the lifetime value for each customer we can use this to group customers based on these values and potentially other values such as gender, age etc. In addition to formulating an objective function, it is key for clustering that you pick the appropriate values related to the individuals to consider in your clustering. These values are often called the features.

There are many different algorithms for actually performing the clustering. Clustering algorithms are potentially very process intensive and the complexity grows quickly with the size of the data set. Each algorithm is attempting to deal with the complexity in a different way. 

One common technique in clustering is to use some kind of heuristic that allows the algorithm to reduce the number of calculations. As an example, the K-MEANS algorithm guesses clusters as a starting point. It then uses this guess to reduce the computations, while hoping to converge on an adequate clustering by iterating and moving individuals between clusters as the structure evolves.

The key question I’d like to explore is how we can determine if a clustering is valid or not. The procedure of evaluating the result of a clustering algorithm is called cluster validity. There are three different perspectives on cluster validation: external, internal and relative. External involves comparing the result to some external pre-defined structure that reflects an intuition about what should characterize a cluster. It could be as simple as eying the cluster or it could be more sophisticated, using pre-labeled data.

Internal validation refers to some internal properties of the clustering. Compactness is one criterion, and it requires that clustered individuals be as close to each other as possible given an objective function. Separation is another criterion, and it requires that clustered individuals be as clearly separated from each other  given an objective function.

Gianmario employs a few different techniques in his internal evaluation, including AIC, David-Boudin and Silhouette. For external validation, he uses AdjustedRand, which is a statistical measure of cluster similarity. He uses that to compare clustering results with previously known clusters determined to be of a certain quality. This way, we achieved an external validation of clustering results.

So, when presented with a clustering, don’t worry too much about the algorithm. Instead ask:

  1. What was the evaluation criteria used in validating the clusters. Were they internal or external?
  2. What was the objective function of the algorithms?
  3. What properties were considered for internal validation, and what were the results?
  4. If external validation was done, what pre-defined structure or intuition was used?

If you are curious about the algorithm, ask about what assumptions or heuristics the algorithms uses. Each scalable algorithm makes some assumptions; the question is does it makes sense for your problem? We will expand on this in a future blog post.

During 2012, Gianmario Spacagana looked into auto-tuning of clustering. He titled his thesis “TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering.”  Gianmario was a student of Polytechnic of Turin and the Royal Institute of Technology on Stockholm. He will provide a more detailed summary of his work in an upcoming blog post.

In the meantime, if you would like more information, you can read Gianmario’s thesis presentation, which he posted here: http://www.slideshare.net/gmspacagna. He and AgilOne also published the source code on GitHub if you are interested in looking into it in more detail: https://github.com/gm-spacagna/tunup.