Google pitches Cloud Dataflow to the Apache Software Foundation
Google is making its first major open-source move of the year by offering up its Dataflow technology to the Apache Software Foundation (ASF) as an incubator project.
Google is hoping to spur more collaborative efforts and governance around its technology, which is used for writing processing for large-scale data processing jobs. The end goal is to enable the development of data pipelines which can be ported across multiple execution engines, both in the cloud and on-premises. As such, Google is hoping that its Dataflow programming model and its Dataflow Software Development Kit (SDK) will be bundled together as a single Apache Incubator project.
The search giant has gathered the support of several big name name companies behind its bid, including Cloudera Inc., Data Artisans GmbH, PayPayl Holdings Inc. and Talend.
Ultimately Google is hoping that Dataflow will be accepted as a Top Level project under the ASF, but in order to get there it must first go through the mandatory incubation stage, during which issues related to its future direction and licensing will be tackled.
“We believe this proposal is a step towards the ability to define one data pipeline for multiple processing needs, without tradeoffs, which can be run in a number of runtimes, on-premise, in the cloud, or locally,” wrote Google Software Engineer Frances Perry and Product Manager James Malone in a January 20 blog post.
Google’s Cloud Dataflow service, which is based on the technology, will not be affected by the proposal to open-source the programming model, SDK and other components, they added.
Google built Dataflow as a means of helping developers write applications and data pipelines that can run on multiple Big Data engines, including Apache Spark and Apache Flink, as well as its own Cloud Dataflow. The technology consists of a number of SDKs that are used to define data processing jobs in batch mode and in streaming for large data sets.
The company open-sourced the Dataflow SDK back in December 2014, in order to boost development activity around the technology and quell fears that it might help to lock users into Google’s infrastructure. Since that time, Google says the Dataflow SDK has been used to create a variety of “pluggable runners” that connect data pipelines to Spark, Flink and others.
Perry and Malone pointed to a number of benefits if the ASF accepts Dataflow as an incubator project. The main one, they said, is that developers would be able to focus on their applications and data pipelines instead of worrying too much about which Big Data engine it’s compatible with.
Google previously said Dataflow was a combination of several technologies it’s been using internally for years, including FlumeJava, a batch processing engine, MillWheel, a stream processing engine, and MapReduce.
Photo Credit: TMarieShines via Compfight cc
Since you’re here …
Show your support for our mission with our one-click subscription to our YouTube channel (below). The more subscribers we have, the more YouTube will suggest relevant enterprise and emerging technology content to you. Thanks!
Support our mission: >>>>>> SUBSCRIBE NOW >>>>>> to our YouTube channel.
… We’d also like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.