Apache Arrow zooms to top-level status with promise of fast memory access for all
The latest project to reach top-level status in the Apache Software Foundation is a no-lose proposition for everyone, or at least that’s how the project leaders are positioning it.
Apache Arrow achieves that elite ranking today under the direction of veterans of other notable Apache projects like Parquet, Drill, Pig and Calcite, and with the support of 13 major open source big data projects, including Cassandra, Drill, Hadoop and HBase.
Arrow proposes to give nearly every big data analysis and reporting tool a ten- to one hundred-fold speed boost by enabling them to use columnar in-memory analytics, a memory mapping technique that arranges data in columns rather than rows.
Arrow is an embeddable technology that’s designed to work with existing systems like Drill, Spark, Kudu and many others. “It’s an accelerator for processing and storage systems,” said Jacques Nadeau (right), founder and chief technology officer at Dremio Corp. and chairman of the Apache Drill project management council (PMC). “It’s a set of data representations that are much more CPU-efficient.”
Nadeau has teamed up on Arrow with co-founder Tomer Shiran, formerly vice president of product at MapR Technologies Inc. and Julien Le Dem, founder of Apache Parquet and member of the Apache Pig PMC.
Columnar in-memory processing has a couple of big advantages over row-based processing, including the ability for applications to handle larger data sets and to work on a few big values instead of a lot of small ones.
The disadvantage? “Doing in-memory columnar is hard,” Nadeau said. “Doing rows is easy.”
But easy has its downside. Using row-based memory processing means applications either need to accept the limitations of that structure or reload data into a memory configuration of their own choosing. Such serialization and deserialization can waste up to 80% of CPU resources, the project leaders claim. Reformatting also creates multiple copies of the same data.
If developers of popular data processing engines can agree that in-memory columnar is superior, which is something the Arrow leadership team indicated is a no-brainer, then there is one more big advantage: Multiple applications can share the same memory.
“When Kudu wants to hand a data structure to Drill, it can use the same representation of memory,” Nadeau said. “You can share data between systems at no cost.” That improves on interoperability, performance and memory efficiency. It also gives users more choice by taking memory-handling out of the equation when choosing a processing engine.
Support for the C, C++, Python and Java languages is expected to be available at launch with additional languages to be added over the next few months. Arrow is expected to be adopted by the Drill, Ibis, Impala, Kudu, Parquet and Spark projects this year. In addition, “My expectation is that at launch we’ll see a number of commercial companies and projects get involved,” Nadeau said, declining to name names.
He likened Arrow to Apache Yarn, the open source cluster manager. “Yarn democratized access to the cluster,” he said. “This democratizes access to data.”
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU