The broken promise of open-source Big Data software – and what might fix it

opensourcebigdata

Standing amid the clamor of the Hadoop Summit trade show in San Jose in June, Rakesh Kant couldn’t conceal his annoyance.

As head of enterprise data management and analytics technology for U.S. Bank, Kant wanted to use some of the many emerging Big Data technologies hawked at the show — such as Hadoop, the open source software used to store huge amounts of data in clusters of computers — to help his employer make better use of all the data it collects. But the confusing morass of open source software and cloud services leaves him unsure how to do it and worried about buying technologies that may become obsolete before a project even launches.

“The industry is evolving more and more experiments that are confusing the market,” Kant said. “We don’t want to spend time choosing things, we want to deploy them.”

Open-source software, which anyone can modify and improve, has enabled an explosion of innovation throughout information technology in the past couple of decades. In particular, Big Data software developed this way helped Yahoo, Google, Facebook and other companies create services used by billions of people. Indeed, these companies and the startups spun out of them created many of those same open-source Big Data software and services precisely to solve problems, such as indexing the Internet to facilitate search, that traditional IT couldn’t.

Downside of open source

The hope was that this data-driven open-source software also could unlock massive value inside more traditional corporations, from banks to retailers to manufacturers. But that hasn’t happened nearly as widely as expected. And it’s not likely to happen anytime soon. Gartner Inc. Research Director Nick Heudecker estimates that through 2018, some 70 percent of Hadoop deployments won’t produce hoped-for higher revenues or cost savings.

The situation exposes a downside of open source software, which saw earlier blockbuster successes such as the operating software Linux, sold in an enterprise version by Red Hat Inc. and others. Startups hoped to do the same with Hadoop, the data processing software Spark and other software. But they’ve struggled to persuade many of the largest corporations accustomed to integrated systems and support from the likes of IBM Corp. and Oracle Corp. Those enterprises don’t want to have to cobble together software from dozens of startups.

“It’s sort of like putting Frankenstein together,” said George Gilbert, Big Data analyst at Wikibon Research, owned by the same company as SiliconANGLE. As a result, added Peter Burris, chief research officer at SiliconANGLE Media, “Big Data pilot projects often are abandoned.” (The analysts, along with SiliconANGLE Media co-Chief Executive Dave Vellante, will explore this and other issues starting Tuesday at the BigDataNYC conference, which will be broadcast on theCUBE, the video unit of SiliconANGLE Media.)

All that means startups that bet on open source as the path to creating the next great software business are starting to see their growth sputter. In early August, Hortonworks Inc., one of the few open source Big Data companies that’s publicly held, saw its shares plunge 25 percent after missing second-quarter earnings expectations, losing $64.2 million on revenues of $43.6 million. Its top sales executive, President Herb Cunitz, also left the company.

“The open-source-only model isn’t working,” said John Schroeder, chairman of MapR Technologies Inc., which sells its own versions of Hadoop and other big data software. “It only really worked with Red Hat and I don’t think it’ll work again.”

That may be premature. But there are plenty of problems to overcome, especially for large enterprise buyers — chief among them the dizzying array of overlapping software, a situation endemic to open source. “A bunch of vendors are picking off parts of the problem and saying they’ll just solve that,” said Andrew Brust, senior director of market strategy and intelligence at Big Data analytics and visualization firm Datameer Inc. “The integration of these things is non-trivial.”

Before co-founding the Big Data management service Unravel Data, which launched publicly Sept. 21, Kunal Agarwal saw that problem up close and personal. Customers at another company where he sold Oracle Corp. software struggled to make their projects work. “People were spending half their time fighting Big Data issues such as debugging and performance instead of creating new applications,” he said. “Johnson & Johnson won’t use Big Data until all the kinks are worked out.”

Lack of Big Data talent

Many large companies are at least getting started on projects. Hortonworks says half the Fortune 100 are customers, including Macy’s, Ford Motor Co. and Royal Bank of Canada. One sign that some customers are increasing the size of their projects is that on average, customers spent about 50 percent more after they renewed their support subscriptions. “It’s very early days,” Hortonworks Chief Executive Rob Bearden said during the company’s second-quarter earnings call. “We’ve just begun to penetrate these markets.”

Moreover, some mainstream companies have gone ahead with larger deployments, with some success. Hortonworks customer Progressive Insurance created data centers using Hadoop computer clusters to analyze data from its Snapshot device, which tracks clients’ driving behavior, to do custom pricing. CapitalOne Financial Corp. uses another open-source technology to detect transaction fraud on its corporate network.

And truth be told, some companies have themselves to blame for the failures. “A lot of non-tech companies have waded in with unclear goals,” said Doug Henschen, vice president and principal analyst at Constellation Research. “They’ve fallen prey to the myths of these platforms, like you can just dump everything into a data lake.”

Making things even more difficult is a lack of Big Data engineering talent in large traditional enterprises. Analytics tools such as Hadoop and Spark are relatively new, so engineers and programmers who know how to use them are scarce. Technology companies such as Google, Facebook or a hot startup also snap them up with big salaries and stock options. And the projects require lots of talent. Gilbert said Big Data projects deployed on a relatively small network of five servers can require five engineers to administer them.

The challenges are acute enough that they’re spurring the rise of a cottage industry of companies such as Unravel Data that offer to simplify various parts of Big Data projects. Cask Data Inc., for instance, whose software makes it easier to create applications on Hadoop, on Sept. 19 launched a Big Data app store with pre-built components such as connections to Salesforce.com’s services and Twitter’s sentiment analysis. Qubole Inc. offers a managed data service on several public clouds that hides some of the complexity of Hadoop and related software. Yet another startup, Iguazio, has created an entirely new data “stack” that bypasses bottlenecks in operating systems, networks and storage systems to simplify and speed up data flow and use.

Even traditional software companies smell an opportunity. Oracle, for instance, made a pitch at its OpenWorld conference last week that it can uniquely provide an integrated solution for customers’ cloud requirements, including Big Data analysis. A Wikibon study found that Oracle’s “Big Data Appliance” provides a simplified infrastructure much faster than homegrown on-premise Hadoop setups. And on Sept. 27, SAP confirmed plans to buy Big Data-as-a-service startup Altiscale.

Public cloud competition

Open source remains an important part of Big Data projects. And the rapid innovation and built-in hedge against getting locked into traditional software guarantees an ongoing role.

But its commanding role may be in jeopardy as corporate urgency to leverage Big Data becomes acute. Even the openness of open source may be in doubt. Yaron Haviv, Iguazio’s cofounder and chief technology officer, said the plethora of individual open-source projects ends up making it difficult for one piece of software to connect with others. That can mean, he said, that “the lock-in with open source is even worse” than with traditional software in many cases.

There’s a competitive aspect at play, too. Public cloud companies that have largely served tech-savvy startups are now getting the attention of enterprises with an appealing offer: Let us run everything, manage all that complexity in the background and pay only when you need the services. Amazon, for instance, is pitching AWS as “the most complete platform for Big Data.” So is Microsoft with its Azure cloud. And analysts say they’re starting to fulfill their hype.

“At one point, the cloud vendors were less sophisticated,” said Mike Gualtieri, an analyst with Forrester Research. Not anymore, he added. As a result, said Slim Baltagi, CapitalOne’s director of enterprise architecture, eventually all Big Data applications will move to the cloud.

The threat hasn’t escaped big data software companies. Cloudera Inc. recently was said to be seeking $1 billion from investor Intel Corp. to build its own data centers from which to offers its software in the cloud.

All this skirmishing may be largely a function of the still-early stage of the market. Some observers, though, think Big Data applications are finally about to emerge in full production settings. “We are at the beginning of the mainstream part of the market,” said Gartner Inc. analyst Merv Adrian. But that new market may end up looking a little more like the old software industry than open-source advocates had hoped.

* * *

SiliconANGLE Media co-CEO Dave Vellante and Chief Research Officer Peter Burris offered deeper analysis of the challenges and opportunities in open-source Big Data software at SiliconANGLE‘s BigDataNYC conference this week:

Image by SiliconANGLE