The Apache Software Foundation’s stable of open-source projects continues to grow, with the addition of two more top-level projects in the past 24 hours.
They include a Google Inc.-backed unified programming model for batch and streaming big data processing and a new monitoring tool for big data platforms such as Apache Hadoop and Apache Spark.
First up is Apache Beam, which is best described as unified big data programming framework. The project was born out of Google and is designed to remove some of the complexity that’s involved with building analytic pipelines able to run across distributed systems. It’s compatible with popular stream processing engines such as Google’s Cloud Dataflow service, Apache Spark and Apache Flink. Like many projects that started out under Google, it was later donated to the ASF in order to generate more support and speed up its development.
Tyler Akidau, on the Apache Beam Project Management Committee and a staff software engineer at Google, said that Beam has become a much better project now that it has the support of people outside of Google.
“Though there were many motivations behind the creation of Apache Beam, the one at the heart of everything was a desire to build an open and thriving community and ecosystem around this powerful model for data processing that so many of us at Google spent years refining,” Akidau wrote in a blog post. “But taking a project with over a decade of engineering momentum behind it from within a single company and opening it to the world is no small feat. That’s why I feel today’s announcement is so meaningful.”
Akidau put the importance of this help from outside the company into perspective, saying that 10 of the 22 large modules that make up Beam were developed from scratch by the community, with virtually no contribution from Google’s engineers. “Since September no single organization has had more than 50% of the unique contributors per month,” he said.
And the majority of new contributors added during Beam’s incubation period came from outside Google. These include well known names like Cloudera Inc., the Hadoop systems provider, Data Artisans GmbH, the company behind Flink, Talend Inc., and PayPal, among many others.
According to PayPal Director of Big Data Platform Assaf Pinhasi, Beam has served the company well by enabling it to make stream processing available to data engineers via a single application programming interface, decoupled from the underlying execution engine. “Our data engineers can now focus on what they do best – i.e. express their processing pipelines easily, and not have to worry about how these get translated to the complex underlying engine they run on,” Pinhasi said in a statement.
When Google first donated Beam to the ASF, it had three underlying execution engines that made it compatible with Cloud Dataflow, Spark and Flink. Now there are four, with one more added for Apache Apex, a real-time stream processing engine founded at Yahoo Inc. and now being developed by DataTorrent Inc.
“Becoming a top-level project is an indication that Apache Beam now has a development community that is ready for prime time,” Akidau wrote. “We’re ready to bring the promise of portability to programmatic data processing, much in the way SQL has done so for declarative data analysis. We’re ready to build the things that never would have gotten built had this project stayed confined within the walls of Google.”
The second project to graduate Tuesday was Apache Eagle, which is an open-source monitoring and alerting system designed to warn users about performance and security issues on big data platforms like Hadoop, Spark and others.
The project was first developed by engineers at eBay Inc., who needed a solution to help them monitor their large-scale Hadoop clusters so they didn’t have to do this manually. It wasn’t long before eBay’s team realized the system would benefit from community help, so it submitted it to the ASF as an incubator project back in October 2015.
In a press release, the ASF said Eagle is an analytics tool that’s able to identify performance and security issues “instantly.” It works by analyzing data activity, daemon logs, JMX metrics and YARN applications in order to spot security breaches and problems with performance and provide other insights.
“Eagle fills a very important role in providing top-notch security and performance monitoring and alerting for Big Data deployments,” said P. Taylor Goetz, an ASF member and Apache Eagle Project Management Committee member.