UPDATED 14:03 EDT / APRIL 25 2014

The Data Economy: Meet the hybrid data scientist-application developer

machine data geek computer scientist human chip face All the investment and innovation that’s occurred in the Big Data infrastructure space over the last decade will have gone for naught if data scientists and application developers can’t production-ize analytic insights.

That’s why YARN, a sub-project of Apache Hadoop released last fall, is such a big deal. YARN enables developers to build applications on Hadoop that process data in multiple new ways beyond just batch processing.

Still, YARN, like HDFS and MapReduce before it, is simply an enabler. Developers still need to actually build Big Data applications. This requires toolsets that allow developers to integrate multiple data streams, to apply predictive models at scale, to create intuitive user interfaces and more. And even then, significant training is needed, especially for data scientists whose expertise is in analyzing data, not building applications for end-users.

So I’ve been encouraged by a handful of recent announcements from Hadoop ecosystem vendors aimed at lowering the barriers to successful Big Data application development.

Hadoop meets application development tooling

On the tools set side of the equation, Hortonworks recently expanded its partnership with Concurrent, which sells support services for the open source Cascading application development framework. When I spoke with the company last fall, Concurrent Founder and CTO Chris Wensel described Cascading as a Java library used by application developers to quickly create complex, data oriented applications. Concurrent’s Cascading SDK abstract’s away the complexity of dealing with things like MapReduce and Pig, allowing developers to integrate data sources via APIs and easily migrate predictive models into Hadoop. (You can explore sample Cascading-based apps on GitHub here.)

HDP with Cascading (Source: Hortonworks)

As part of the expanded partnership, Hortonworks said it will ensure ongoing compatibility of Cascading-based apps with the Hortonworks Data Platform and will provide level 1 and level 2 Cascading support for customers (Concurrent will still handle level 3 support.) This compatibility includes the ability to execute Cascading-based apps on Apache Tez, a recently developed Hadoop-based execution engine for real-time Big Data workloads. While Concurrent itself is still in its early days, open source Cascading is quite popular with application developers, garnering over 90,000 downloads per month.

Training needed to make the most of Hadoop

Even with better tooling, application developers need to learn new skills in order to build enterprise-grade Big Data apps. This requires training, a cause Cloudera has taken on as its own. At its analyst day event in March, I learned from CEO Tom Reilly that Cloudera has trained over 50,000 practitioners on Hadoop since the company’s founding in 2009. But its training efforts really took off in December when it formed a partnership with Udacity, provider of MOOCs focused on computer and data science. Since then, Cloudera has trained over 30,000 practitioners. Cloudera estimates it’s educational services, overseen by Sarah Sproehnle, has trained over 80% of all practitioners that have taken some form of Hadoop training.

Sarah Sproehnle
Vice President of Educational Services at Cloudera (Source: LinkedIn)

Earlier this month Cloudera announced a new training service to provide developers hands-on training in building Big Data applications on Hadoop. Cloudera says the purpose of the four day course is to prepare “data professionals to use an EDH’s full capabilities to build custom, converged applications that enable their organizations to achieve greater value from data and solve real-world problems.”

EDH refers to Cloudera’s Enterprise Data Hub, which layers multiple data processing engines, including Cloudera Impala and Cloudera Search, on top of its core Hadoop distribution. While the ability to process data in multiple ways on a single platform is a positive in the abstract, it means Hadoop application developers must be fluent in a number of data processing approaches. Cloudera’s new training course is designed to help developers learn these new skills so they can take advantage of Hadoop’s new multi-data processing capabilities.

A new, hybrid role emerging

Both announcements are good signs. It signals to me that Hadoop is starting to move beyond the “feeds-and-speeds” stage of its life-cycle and into a new stage focused on business value. What good is all that data you can now store, process and analyze relatively inexpensively in Hadoop if you can’t surface actionable insights to end users that are responsible for moving the business forward? Not much. That’s why the development of Big Data applications is so critical.

But tools and tools-related training are just two legs of the stool. There are softer skills data scientists and applications developers need to learn, and a not insignificant change-management challenge lurking in the background.

Is this the face of the new hybrid data scientist-application developer? (Source: Wikibon)

Namely, as enterprise applications become more data-centric, the roles of data scientist and application developer are merging. In the short-term, this means the two roles must learn collaborate more effectively and both must assume new ways of thinking. For data scientists, this means starting to think more about how the insights they uncover can be translated into repeatable form factors consumable by end-users. And application developers need to gain a better understanding of data flows and how analytic requirements impact application performance.

CIOs, too, have a job to do facilitating this transition. They should take steps now to enable and encourage collaboration between data scientists and application developers, and help both roles understand the challenges of the other. This may require developing new incentives and ways of measuring (and rewarding) outcomes that focus on mutual successes of these two previously silo’ed roles.

In the long-term, this gradual merging of roles may result in a new role entirely, a hybrid data scientist-application developer. There are a few already out there (though they may not think of themselves in these terms), but they are rare. I don’t know exactly what we might call this role, but if you think data scientists are valuable and hard to find today, just think what the demand for this hybrid data scientist-application developer will be in the years to come.

feature image: Photo Extremist via photopin cc

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

The Data Economy: Meet the hybrid data scientist-application developer

Hadoop meets application development tooling

Training needed to make the most of Hadoop

A new, hybrid role emerging

feature image: Photo Extremist via photopin cc

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

MWC Barcelona 2026

The Data Economy: Meet the hybrid data scientist-application developer

Hadoop meets application development tooling

Training needed to make the most of Hadoop

A new, hybrid role emerging

feature image: Photo Extremist via photopin cc

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

MWC Barcelona 2026

Cookies