Data Quality is King ⎯ When a Prediction API is Inadequate.

Mar, 4 2013

Not as easy as it seems

There is a lot of excitement about being able to introduce predictions and recommendations into applications. The goal is to improve the user experience, increase adoption and improve conversion rates. In some instances services want to analyze fairly structured data such as transactions and web-clicks. Other times the analysis concerns unstructured data such as text or short posts with their own “language” such as tweets. This has created an opportunity for cloud based Prediction API’s. In the last few years I have seen many different Prediction API’s appear, and some of them disappeared.

In this post I will argue that most prediction API are not as easy to make useful as may first appear. I am arguing that the “prediction” part is not the main challenge associated with the incorporation of prescriptive features into an application. I am not saying that you shouldn’t use these API’s but don’t expect the current breed of Prediction API’s to give you an out-of-the-box experience. Let me explain why.

Steps of a machine-learning process

When you compute predictions you normally use some sort of machine learning algorithm. These algorithms need good data to give good results. The GIGO idiom (garbage in, garbage out) is more applicable than ever. Supervised algorithms rely on good training data to create good models and provide adequate scoring. Unsupervised algorithms need features that are representative to work properly. The common theme is that garbage-data in results in garbage-predictions out.

The figure below illustrates the high level steps and algorithms that data typically must pass through to provide adequate results. [Admittedly somewhat of a simplification]

Data Quality is the step that involves taking raw data and cleaning so that we have a base data set that satisfies certain fundamental quality criteria.  If we are concerned with semi-structured data, quality may encompass things like making sure we have addresses that are valid and complete, names are sensible, numbers that are expected to be positive are positive, garbage rows are removed, etc. It can also include doing entity resolution and record linkage so that we get a more accurate picture of what is going on.

Data preparation is the next step in the process. And I consider this step somewhat model specific. By this I mean that what you do in this step depends largely on what type of model that you are trying to build. For example, in this step, we identify and remove outliers that will skew our model in addition we may normalize data. When records are missing data in specific fields we may try to approximate them. The goal is to have a representative data set that allows us to create as good a model as possible. Since models are also domain specific, I believe it Is safe to say that if you are building predictive models for a specific problem domain area you can do a better job preparing the data. This is simply due to the fact that you can make more accurate assumptions and hence prepare the data better.

Feature selection involves selecting the data from records that we use for the model building. The term feature refers to the attributes or fields of the data that are most relevant for the model we are building. A feature could be taken straight from the input data or be generated by a function applied to the feature. An example of a feature would be to decide that each noun in a text is a feature, a generated feature could be that we map words to their stem base. One goal is reduce dimensions to avoid redundant or interrelated features. We also do not want to include features that are not relevant to the model at hand. For structured data, the features may be specific attributes such as revenue, last purchase date, number of purchases etc. For unstructured data features are often words, stem base of words, n-grams etc.

Model Building, Model Selection and Scoring refer to the traditional steps of machine learning. When you look at software and literature related to machine learning, these are the steps they tend to focus on. Based on labeled data, features and possibly target variables (if supervised learning) we create a model that is later used for scoring.

I am sure you realize by now that if we get bad quality data in we will ultimately end up with models that will not perform well. Missing values or outliers can significantly impact models. Analysis of unstructured data will be impacted if we have a lot of bad data such as bad words, misspellings, and words that are not “real” words.

Predictive API’s

There is a long list of services that provide predictive API’s. Google Prediction API has been around for a while, and so has Saplo.com. Last year or so has brought some new ones such as BigML, Sociocast and PriorKnowledge. I am sure there are quite a few more that I omitted [sorry]. AgilOne also provides a Prediction API, although with a different twist than many other vendors.

All vendors naturally provide various ways of uploading or referring to data.  Data can of course be in many different format  and at many different levels of quality as illustrated by the diagram below. For the purpose of this discussion I view it as a continuum from completely unstructured data such as text, to very structured data were we can expect a very well defined format and well defined manifest types. As far as I can see Prediction API's tend to address one of two extremes. Either they process unstructured data such as texts of different sizes, providing tagging and various forms of text analysis. Or they focus on fairly structured data requiring that users have prepared and cleansed the data before uploading, or they are making predictions on very structured machine generated data such as web click data.

By focusing on these two extremes they avoid as much of the data problem as possible. This means that all these services mainly focus on the Model Building and Scoring steps assuming customers can provide adequate quality data.  That said, some services do provide some limited data preparation capabilities. If you are building an application that is business and marketing oriented much of your data will be in the middle semi-structured category.

The importance of data quality

This essentially means anyone who wishes to use predictive services that involve data that is semi-structure and dirty will need to do the Data Quality to Feature Selection steps in some other way. And I claim that if you are doing anything related to business transactions, marketing analytics, etc. you will likely have data that needs preparation. You also need to understand enough of what you are trying to do to do features selection [separate post in this in the future] and that is not trivial. On the contrary it often requires you understand the data so that you can reduce the number of features, look at interdependencies, remove noise etc. There are algorithms that can help you with this, but you still need significant in-depth knowledge about this step.

The diagram below illustrates the fact that Prediction API’s are great at handling certain machine learning algorithms; however, often lacking what you need for the earlier steps on the data processing pipeline.

Ok? What now?

Before you make assumptions on what predictions you can do in your application and what service you should be using make sure you know your data. Are you dealing with Bad Data that may need to be corrected and completed? Do you deal with structured, semi-structured or unstructured data or all of the above? Even if it is clean, is it likely you have outliers, incorrect or missing data. If you, against all odds in my mind, have good understanding of the structure of the data and the data is clean and well prepared, you can go ahead and pick a Prediction API and off you go. If you think you may have data issues you must first decide how you want to deal with the data. Do you want to use a service for preparing it. Do you do it in house? What tools and algorithms do you need to prepare the data?

I am not suggesting you have to do a large time consuming analysis before you can get started. You should make sure you start with clean data for your initial trials, or at least manage your own expectation about the results based on the quality of the data. Ultimately you need to address your data if want to have something useful. But no predictions is better than inaccurate or bad predictions that could be misleading.

At AgilOne we strongly believe the truth of GIGO. That’s why we developed a sophisticated data quality infrastructure that can process billions of records effectively. The cleansed data feeds directly into our predictive pipeline so that our customers do not need to worry about GIGO. Predictions can drive tremendous benefits, but must be based on best possible input.