Google launches Cloud Dataflow pipeline for batch and stream processing

IO2014There was plenty of excitement at Google I/O yesterday, and not just because of the brief interruption by a protester calling on Google to “develop a conscience”. While much of the spotlight fell on Android, Google announced a number of new services on its cloud front, including something called Cloud Dataflow that makes it easier to create data-processing pipelines combining both stream and batch-processing capabilities.

Dataflow is based on several earlier Google projects, including its FlumeJava data-pipeline tool and MillWheel stream-processing technology. Its been designed to enable analysis of live data, allowing users to view trends and receive alerts of events in real-time. The service is primarily aimed at developers who need to stream real-time data.

It’s possible to run your own Hadoop cluster atop Google Compute Engine of course, but Google Cloud platform marketing head Brian Goldfarb says Dataflow has been built to overcome latency and complexity limitations that are inherent in MapReduce.

“[MapReduce] was good for simple jobs, but when you needed to run pipelines it wasn’t so easy,” he said. “Internally, we don’t use it anymore because we don’t think it’s the right solution for the overwhelming number of situations.”

Chiefly, Dataflow has been designed as an easy-to-use tool that’s capable of handling both complex workflows and very large datasets. Streaming and batch jobs both employ the same code, while Dataflow manages the infrastructure and optimizes the data pipeline. The service is compatible with multiple programming languages, though the first SDK is designed for Java.

According to Google, the main focus is helping its users to get “actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure.”

Real-time anomaly detection was cited as one primary use for Dataflow. A live demo involved analyzing streamed World Cup data that was compared with historical data in an attempt to spot anomalies. Users can either investigate events themselves using Google BigQuery, or set Dataflow up so it automatically takes actions when it detects something.

The service is important for Google’s cloud efforts because Amazon has had its own data pipeline service for some time already. In addition, AWS also has its Kinesis service that specializes in real-time data processing – Dataflow is, in essence, Google’s combined answer to both.

Image credit: Google