Cascading 2.0: An Application Framework for Hadoop Winning the Attention of Twitter, Etsy and EMC


Cascading is an open-source application framework getting the attention of Twitter, Etsy, AirBnB and big data analytics companies such as EMC Greemplum and Map R. The popularity stems from its ability to abstract the complexities of MapReduce and making Hadoop clusters easier to manage.

Today Concurrent announced Cascading 2.0, an ennterprise-grade development platform designed for Java developers to build big data applications on top of Hadoop.

The complexity of MapReduce makes the process of deploying big data apps a time consuming endeavor with multiple opportunities for error. With Cascading 2.0, data scientists and developers use high-level scripting languages and open APIs to process, integrate and schedule on complex Hadoop clusters.

Cascading reminds me a bit of platforms such as Yahoo! Pipes that aggregates RSS feeds, Web pages and other data sources. It pipes that data into Web-based applications that publish information to the Web.

Cascading follows similar principles. According to Wikipedia, Cascading users create descriptions of processes that often consist of business logic. Data is captured from different sources and run through pipes that use algorithms to process the data. Pipes are built independently from the data they will process. Once tied to the data sources and “sinks,” the user can create flows that may be grouped inta a “cascade.” These cascades run through a process scheduler so the clusters can be easier managed.

Developers program on JVM-based languages and do not need to learn MapReduce. That in itslef can makes it far easier to deploy big data apps.

As a result, Cascading 2.0 is getting more attention from companies like EMC that are investing heavily in big data. EMC Greenplum is diistributing Cascading as part of its  Greenplum MR distribution, and plan to increase integration and support with other offerings in the future.