Microsoft managed to momentarily steal the analytics limelight from Hadoop last week after a new report revealed that it’s planning to launch a cloud-based alternative to the batch processing platform using technology that’s been used internally for several years with great success.
Some 5,000 Microsoft engineers rely on a massive implementation of Cosmos, as the technology is known, to scan data from core services for useful insights into operations and user activity. There are also nearly many as knowledge workers leveraging it through a built-in structured query component that exposes the information for conventional business intelligence and reporting purposes.
If and when Microsoft makes Cosmos available to the outside world, it should expect tough competition from Hadoop, which not enjoys industry-wide backing but also offers a much broader choice of capabilities for analyzing uninstructed data. That selection became even bigger a few days after the report about the upcoming service leaked when Samza become an official member of the upstream ecosystem.
The LinkedIn-developed framework is designed to process complex real-time workloads that require special handling after ingestion. It embeds a local key-value store in every stream that makes it possible to store the kind of contextual information needed to carry out advanced operations such as merging datasets locally instead of having to query a remote system every time they’re needed.
That noticeably speeds each action, which can add up to a potentially massive performance improvement across the billions of data points that flow through the typical production-scale Hadoop cluster every day. Samza is already finding success at LinkedIn and other web companies like DoubleDutch Inc. in handling the machine-generated logs coming off their infrastructure.
The project fills an important gap in the Hadoop ecosystem, which had previously only accommodated more lightweight streaming workloads, but not every developer requires a beefy event processor to monitor their environments. In fact, a sizable majority are content with pre-packaged services from the likes of DataDog Inc., which raised $31 million last week to drive adoption.
The company’s namesake platform aggregates data from practically every part of an organization’s infrastructure into a sleek dashboard that displays operational metrics and provides the ability to track the impact of new updates on the environment. That straightforward but effective approach has made DataDog a favorite among many organizations including Netflix Inc., Electric Arts Inc. and other big names, momentum that the new capital is meant to sustain into 2015.