What you shouldn't need to know about Big Data and Machine Learning.

Feb, 21 2013

As an end-user of an analytics application you should really not need to worry about how machine learning and Big Data relate. But if you are considering how to provide analytics services to your company or customers you should at least have a high level understanding of how these technologies interact. My goals of this post is to help clarify the relationship between Big Data and Machine Learning. When do you need to be able to handle Big Data to benefit from Machine Learning, and when does Machine Learning benefit from Big Data?

Big Data and machine learning are two hot areas that are driving loads of innovation. Some people think they need a lot of data to apply machine learning and others think they need machine learning to do anything with their Big Data.  But is this always true? You can learn a lot from your Big Data without complex machine learning algorithms; transformation of data and calculation of aggregates are two such examples. With that said, you can enable very powerful analysis by using machine learning techniques on large data sets. This post will help you better understand the relationship between Big Data and machine learning and when to use them together.

What do we mean by Big Data?

Big Data is a very active topic right now. We generate more data than ever before, and everyone agrees that we should use it to improve decisions, understand customers, detect system anomalies, etc. A study by the IBM Institute of Business Value [Analytics: the real-world use of big data] states that businesses associate Big Data with a greater scope of data collection and analysis. Companies also associate it with the techniques for handling massive amounts of data.

According to the IBM study, only 10% of companies say that Big Data is about large volumes of data. 18% of organizations consider Big Data to be related to an increasing scope of available information. 17% associate it with new kinds of analysis while 16% associate real-time analysis with Big Data.

For the purpose of this post Big Data refers to a system’s ability to handle a dataset regardless of its size. These data sets can be huge: billions of purchase transactions or much smaller. With Big Data methods, we can process the data  regardless of size within a time that makes results practical and useful.

Learning is the key word in Machine Learning.

Machine learning refers to many different techniques, from the statistical, mathematical, and computer science communities. But at their core, these methods are all about creating data models that help us understand and predict the world. For example, we can create a language detection model using texts and their associated language such as English or Swedish. [training data]. Then, we can use this model to derive, with some acceptable accuracy, a specific text’s language. There are many Machine Learning algorithms. At the highest level we categorize them into supervised and unsupervised. A supervised model learns to predict a specific outcome using labeled data. For example, we build models that predict propensity to buy, using training data that labels people, which did and did not buy. Statistical regression is a typical example of a supervised learning algorithm. In contrast, an unsupervised algorithm finds patterns in your data, when that data does not have any labels. For example, we can build a customer segmentation model, when you do not know how many types of customers you have. But both supervised and unsupervised learning has two major stages: model building and scoring. Model building is when we create the model using training data, scoring occurs when we use the model to make a prediction about some new data.

What data is required to learn?

So what is the connection between machine learning and big data? Do you need big data to do machine learning? To cut to the chase: no. You do not always need big data. However, if you can apply machine learning to Big Data, then you can create some very powerful tools and insights you wouldn’t see any other way.

In many situations you can apply machine learning to a sample [subset] of the population that you care about. A sampled data set can be quite small and in many cases you can do the processing on a simple computer. This works when you are looking for a simple relationship such as normal distribution. For example, will one campaign return at least $10 more per customer, than another campaign? In this hypothetical situation, we only care about big effect sizes and so we should be able to reliably detect this effect in a small sample. If the effect size is small, we may not detect them, but it would be uninteresting anyway.

The key to good results: data, samples and effect size.

Problems occur when the distribution [in a statistical sense] we are trying to model is not a simple distribution. Another problem occurs if there is noise in the data. Noise makes it more difficult to detect patterns in the data and leads to the need for more data to create accurate models. For instance, noise could be that some  customers have missing or incorrect data.

An even bigger problem occurs if you create a sample with a bias, relative to the population. If the sample is biased, the model will be biased and of course lead to biased scoring. Bias is of course not a problem if we use the complete data set (you have all the data!). Lets assume we want to predict lifetime of people across the whole world. If we limited our training set to US population we would create a large, but biased training data set. If we used data from the whole world, we would more accurately be able to predict the lifetime of any newborn child anywhere in the world.

May we have some simple rules?

The table below provides a simplified synopsis of how Big Data and machine learning are related. In this context Big Data denotes the ability to apply machine learning to complete data sets.

If you are looking for a large effect size and for a simple relation, then using Big Data is an overkill. If the relations are complex or noisy you will benefit from using the complete data set [Big Data] but large samples may be sufficient. If you have biased samples you are in trouble, but Big Data can prevent this problem, while a large sample may give acceptable results in some situations.

One should note that even if model building can sometimes be done with sampling, in many situations scoring is still done on the complete data set. Scoring is normally less computationally intensive so the benefit of using subsets is diminished even if scoring on a subset would be make sense in a particular situation.

If you take away anything, take away this.

There is no direct dependency by machine learning on Big Data. Machine learning can be just as useful, even when you have a small sample of data as when you process a massive data set. In general, the ability to process Big Data will provide more accurate and powerful predictions and insights. Sometimes you will need Big Data to produce useful results due to issues with data, relations, effect sizes etc. but Big Data is not always required.

Correspondingly, think of Big Data completely independent from machine learning. For example, you might want to know how many of your customers have purchased items from at least three distinct product categories, no machine learning is involved in that calculation. Even for these types of data analytics, sampling can be useful. For example, let’s assume you want to know the average salary of individuals in a particular zip code. You can create a highly accurate estimate of this value, by calculating the average over a sample.

If you build your own data processing infrastructure you need to know how to deal with all this. When you can sample and when you really need Big Data. If you are one of our customers we are managing this for you. If  you are a Marketer you really should not need to know all this detail to become a data driven organization.

Thanks to Daragh Sibley for interesting feedback and discussions on the topic.