Have you ever bought something expensive, say a consumer electronics device, only to see it on sale at a significantly discounted price soon after? Well, it happened to Charles Nicholls, founder and chief strategy officer of Web analytics company SeeWhy recently. Nicholls blames batch processing.
Why is batch processing to blame? It’s not real-time. Since SeeWhy specializes in real-time analytics for e-commerce, it’s not surprising to hear this story from Nicholls.
Nicholls is not alone. Stale data is a real issue for companies, which is why real-time and streaming analytics has been such a hot topic at events like Hadoop World and Strata. With real-time systems, new data can be responded to as it comes in, and there’s a race on to see who can make the best big data stream processing system. HStreaming is using the batch processing system Apache Hadoop as a base, LexisNexis has its own HPCC solution, Twitter has open sourced a project called Storm and it seems more companies are joining the race every day.
So what is the problem with batch processing?
Nicholls bought a satellite navigation device from the UK retailer Halfords. He browsed online for the model he wanted, then reserved it online and bought it at a brick and mortar location the same day. Not long after, Nicholas received an e-mail promotion for 10% off the device he’d just bought. He felt like a sucker, and it soured him on the otherwise great shopping experience.
“Batch processing” in this context means running automated analytics on a pre-defined data set. The upside to batch processing is that once the batch of data is determined, the end user needs not do anything – the software just takes care of transforming the data into hopefully actionable information.
The trouble is that you have to predefine your data set, and if new data comes in while you’re processing your data set, you have to wait til the next go round. That’s fine if you have static, unchanging data such as sales from a particular month or other historical data. But what happens when you are trying to analyze a data set that is constantly changing?
What happened to Nicholls is that the retailer processed data showing that Nichollas had browsed satellite navigation systems. But because of the gap between when he purchased and when he browsed the retailer’s system didn’t know he’d already purchased the system. Hence the mailing. There were mere hours between when Nichollas browsed the site and when he bought the device, but it was long enough for the data to get stale.
“Delivering real-time analysis on large volumes of unstructured, streaming data is the elusive white whale of the Big Data industry,” Wikibon analyst Jeffrey Kelly has said before. “Such capabilities, once developed, will provide developers the opportunity to build smarter, highly reactive applications and systems.”
If you’re working with a service provider on a big data solution, be sure to find out what their plan is for real-time analysis.
Image by VectorPortal