Google pitches Cloud Dataflow to the Apache Software Foundation
Google is making its first major open-source move of the year by offering up its Dataflow technology to the Apache Software Foundation (ASF) as an incubator project.
Google is hoping to spur more collaborative efforts and governance around its technology, which is used for writing processing for large-scale data processing jobs. The end goal is to enable the development of data pipelines which can be ported across multiple execution engines, both in the cloud and on-premises. As such, Google is hoping that its Dataflow programming model and its Dataflow Software Development Kit (SDK) will be bundled together as a single Apache Incubator project.
The search giant has gathered the support of several big name name companies behind its bid, including Cloudera Inc., Data Artisans GmbH, PayPayl Holdings Inc. and Talend.
Ultimately Google is hoping that Dataflow will be accepted as a Top Level project under the ASF, but in order to get there it must first go through the mandatory incubation stage, during which issues related to its future direction and licensing will be tackled.
“We believe this proposal is a step towards the ability to define one data pipeline for multiple processing needs, without tradeoffs, which can be run in a number of runtimes, on-premise, in the cloud, or locally,” wrote Google Software Engineer Frances Perry and Product Manager James Malone in a January 20 blog post.
Google’s Cloud Dataflow service, which is based on the technology, will not be affected by the proposal to open-source the programming model, SDK and other components, they added.
Google built Dataflow as a means of helping developers write applications and data pipelines that can run on multiple Big Data engines, including Apache Spark and Apache Flink, as well as its own Cloud Dataflow. The technology consists of a number of SDKs that are used to define data processing jobs in batch mode and in streaming for large data sets.
The company open-sourced the Dataflow SDK back in December 2014, in order to boost development activity around the technology and quell fears that it might help to lock users into Google’s infrastructure. Since that time, Google says the Dataflow SDK has been used to create a variety of “pluggable runners” that connect data pipelines to Spark, Flink and others.
Perry and Malone pointed to a number of benefits if the ASF accepts Dataflow as an incubator project. The main one, they said, is that developers would be able to focus on their applications and data pipelines instead of worrying too much about which Big Data engine it’s compatible with.
Google previously said Dataflow was a combination of several technologies it’s been using internally for years, including FlumeJava, a batch processing engine, MillWheel, a stream processing engine, and MapReduce.
Photo Credit: TMarieShines via Compfight cc
A message from John Furrier, co-founder of SiliconANGLE:
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.
We really want to hear from you, and we’re looking forward to seeing you at the event and in theCUBE Club.