Notes on our trial of columnar databases for analytics

Feb, 26 2011

In the past 3-4 years we started hearing about a new breed of database vendors. Unlike traditional row-based relational database engines such as , Microsoft SQL Server, Oracle, Sybase, these databases are column oriented.  What this means is that instead of writing and reading a record in row format.  Wikipedia has a deeper explanation of this technology at  http://en.wikipedia.org/wiki/Column-oriented_DBMS .  The theory is old, but the application and commercialization of it is quite new.  It is mostly driven by the change in need from write speed to read speed, especially for really large data sets, such as click stream data.

In marketing analytics, our typical approach is to combine data from multiple systems, such as

  • Transactional (order mgmt system, CRM system, Quoting System)
  • Telephone switches
  • Web analytics platforms (Coremetrics, Omniture etc.)
  • Email Service Providers (ExactTarget, Epsilon etc.)
  • Shipment Delivery (Fedex etc.)
  • Promotion Mgmt systems
  • Keyword Bidding platforms

among others.  The process we go through I believe is  a typical one, where we

  1. Combine data to come up with a 360 view of the customer, through properly putting the data together ("proper" means address correction, householding, geocoding, deduping, merging, standardizing, imputing etc.)
  2. Calculate various attibutes (such as first order source, first order date type customer and other related attributed required for marketing analytics)
  3. Transform the data into our universal data model and BI framework (fact and dimensional tables)
  4. Build multi-dimensional OLAP cubes and various other reports
  5. Iterate various statistical models for propensity modeling, product affinity etc. for targeting, testing, spend optimization, personalization etc.

We believe the steps we take are required to convert the raw data into information for decision support.  I've recently decided to test a few of these new platforms for any of these steps involved.

A few more background on our typical data environment.  Our backend database, ETL, OLAP and programming environment are all Microsoft (SQL Server 2008, Database, Integration Services --ETL, Analysis Services -- OLAP).  Our typical datasize for an average client is about 1TB in size for the active database (about 2-3x for the archived portion).

I do not want to publicly give out the two vendors we've tried, but we've considered replacing part of our platform with these technologies.  We've considered the following scenarios

  1. Eliminate Step 4 entirely: Completely eliminate OLAP cube building and issue queries to the columnar db environment
  2. Point OLAP process to Columnar DB: Keep the OLAP processes but point the OLAP building process to the columnar db environment, since the aggregation queries run faster
  3. Use Columnar DB for calculations in Steps 2 and 3: For calculating user attributes, use the Columnar DB to issue SQL commands

We've also considered some of the ETL process migration options, but we quickly gave up on that idea, since the functionality doesn't really exist in any of the tools we've reviewed.

Results

Results were quite disappointing.  In all cases, a properly built and maintained SQL Server has beaten the Columnar DB on the same HW.  Some vendors claim they do better as they are MPP (Massively Parallel Processing), but such claims require hardware investment and we don't know how they would compare since SQL Server could also be clustered.

We've talked to highest levels in their engineering teams and one of them actually visited our offices to really understand the issue.  At the end, SQL server with proper indexing, Storage configuration performs better, since most of the time full table scans are required for many queries, and the columnar db only performs well for ordered dimensions when partial scans are requested.

Based on the performance, we've seen no way that these technologies could actually eliminate the need for OLAP environments.  There might be special cases where a non-pre-calculated aggregation is requested, a columnar db might perform better, but for this off-chance 5-10% of the use-cases, the cost is not justified.

One nifty improvement that the columnar DB provide is the storage behavior.  Since data is always saved in column forms, and each column is the same data type, data is compressed much better and this saves significant space.  Although at $250 per 2TB SATA disks, I don't know how much of a cost saving this would actually bring about.