UPDATED 04:21 EDT / JUNE 26 2014

Google launches Cloud Dataflow pipeline for batch and stream processing

IO2014There was plenty of excitement at Google I/O yesterday, and not just because of the brief interruption by a protester calling on Google to “develop a conscience”. While much of the spotlight fell on Android, Google announced a number of new services on its cloud front, including something called Cloud Dataflow that makes it easier to create data-processing pipelines combining both stream and batch-processing capabilities.

Dataflow is based on several earlier Google projects, including its FlumeJava data-pipeline tool and MillWheel stream-processing technology. Its been designed to enable analysis of live data, allowing users to view trends and receive alerts of events in real-time. The service is primarily aimed at developers who need to stream real-time data.

It’s possible to run your own Hadoop cluster atop Google Compute Engine of course, but Google Cloud platform marketing head Brian Goldfarb says Dataflow has been built to overcome latency and complexity limitations that are inherent in MapReduce.

“[MapReduce] was good for simple jobs, but when you needed to run pipelines it wasn’t so easy,” he said. “Internally, we don’t use it anymore because we don’t think it’s the right solution for the overwhelming number of situations.”

Chiefly, Dataflow has been designed as an easy-to-use tool that’s capable of handling both complex workflows and very large datasets. Streaming and batch jobs both employ the same code, while Dataflow manages the infrastructure and optimizes the data pipeline. The service is compatible with multiple programming languages, though the first SDK is designed for Java.

According to Google, the main focus is helping its users to get “actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure.”

Real-time anomaly detection was cited as one primary use for Dataflow. A live demo involved analyzing streamed World Cup data that was compared with historical data in an attempt to spot anomalies. Users can either investigate events themselves using Google BigQuery, or set Dataflow up so it automatically takes actions when it detects something.

The service is important for Google’s cloud efforts because Amazon has had its own data pipeline service for some time already. In addition, AWS also has its Kinesis service that specializes in real-time data processing – Dataflow is, in essence, Google’s combined answer to both.

Image credit: Google

A message from John Furrier, co-founder of SiliconANGLE:

Support our open free content by sharing and engaging with our content and community.

Join theCUBE Alumni Trust Network

Where Technology Leaders Connect, Share Intelligence & Create Opportunities

11.4k+  
CUBE Alumni Network
C-level and Technical
Domain Experts
15M+ 
theCUBE
Viewers
Connect with 11,413+ industry leaders from our network of tech and business leaders forming a unique trusted network effect.

SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.