UPDATED 14:40 EDT / NOVEMBER 12 2014

LinkedIn’s latest open-source project supercharges Hadoop

big data elephant tusks zebra stripes hybridLinkedIn Inc. is releasing yet another internally developed framework for Hadoop under an open-source license in a bid to help organizations that can’t afford hiring an army of expensive specialists to fine-tune every detail to make the most of their analytic clusters. The project adds to the already formidable pile of community contributions that the web-scale crowd has racked up over the course of its journey to push the boundaries of large-scale data processing.

Hadoop itself was borne of that endless pursuit along with many of the complementary technologies in the surrounding ecosystem, including the most recent addition, an engine called Kylin that eBay Inc. developed to spare internal users long delays when digging for data in its massive deployment. The newly revealed Cubert framework from LinkedIn extends that vision beyond queries to the full gamut of operations in Hadoop, from organizing information for analysis to carrying out the processing.

Cubert implements the lessons that the social networking powerhouse learned when laying out the foundation for its XLNT engagement testing platform, which proved too taxing for existing Hadoop sub-projects to handle. After spending several months trying to make the tools they already had at their disposal work to little avail, LinkedIn’s engineers decided to build an entirely new system to bear the brunt of the complex data manipulations in XLNT.

The technology served its purpose, but the developers found themselves having to rewrite large portions of the underlying code in order to accommodate the new use cases that the success of the project drew over time. So they set out to come up with an answer to the requirements of XLNT for the third time, and thus Cubert was born.

Tackling all 3 levels of the analytic stack

 

The framework provides an engine for finding simple solutions to complex analytical problems that might normally prove too resource-intensive to solve within an allocated time frame. It cuts across all three levels of the analytic stack.

In the storage layer, Cubert uses a combination of abstractions over the Hadoop File System to organize data as blocks structured for the most efficient access possible. These partitions are manipulated with operators located one level higher up at the execution layer that automate tasks not directly supported in other platforms, such mapping out relationships between entities and calculating statistical positions. Finally, this functionality exposed to developers through a simplified syntax dubbed Cubert Script implemented at the top of the stack that makes it possible to to specify workload execution paths without writing any Java code.

That provides a relatively straightforward interface for optimizing data processing that LinkedIn says can help users accelerate analytics by up to 60 times. Cubert only works with the default MapReduce execution engine in Hadoop on launch, but the company plans to leverage the extensibility of the framework in order to add support for the exponentially faster Spark further down the road. More analytic functions and increased automation are in the works as well.

photo credit: Camil Tulcan via photopin cc

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU