As tools in the big data world emerges and mature, question is how much of the data to save in high versus low resolution. The answer depends on the uses of this data. Recently, i've had lunch with someone from Yahoo, where they were doing modeling on full-resolution data and claimed that you need big-data tools (hadoop, mahout) to build predictive models.
The problem with predictive algorithms requiring more data only arises if the number of independent variables that are predictive is large. Higher number of variables require larger datasets to train classification models (see Richard Bellman's curse of dimensionality, the godfather of dynamic programming). In any case, the big data gives us a 1-2 orders of magnitude higher processing power, which only allows for a few more variables, as the volume of data required increases exponentially with new variables.
Perhaps the more important question to ask is why we need and how much data we need to do what we need to do. In our focus, we provide marketing analytics to our clients, so our focus is marketing. In the case of mining web analytics logs, there are apparently four uses
- Revenue Attribution
- Triggering marketing actions
- Building temporal statistics on customer actions
These four topics require data to be saved for various
- Length of time
Determining how much to keep after then initial 90 days or so depends on the modeling uses. If the models being built have a natural 3-4% response rate, you need data that is approximately double that, so you are properly representing negative outcome events (actually oversampling success events). This level of data retention is enough for doing most propensity and event modeling exercises, since the data is actually pretty large.