UPDATED 23:53 EDT / JULY 09 2018

BIG DATA

Hadoop’s star dims in the era of cloud object data storage and stream computing

One of the most noteworthy findings from Wikibon’s annual update to our big data market forecast was how seldom Hadoop was mentioned in vendors’ roadmaps.

I wouldn’t say that Hadoop — open-source software for storing data and running applications on large hardware clusters — is entirely dead. Most big-data analytics platform and cloud providers still support such Hadoop pillars as YARN, Pig, Hive, HBase, ZooKeeper and Ambari.

However, none of those really represents the core of this open-source platform in the way that the Hadoop Distributed File System or HDFS does. And HDFS is increasingly missing from big data analytics vendors’ core platform strategies.

The core reason why HDFS is receding in vendors’ big data roadmaps is that their customers have moved far beyond the data-at-rest architectures it presupposes. Data-at-rest architectures — such as HDFS-based data lakes — are becoming less central to enterprise data strategies. When you hear “data lake” these days, it’s far more likely to be in reference to some enterprise’s data storage in S3, Microsoft Azure Data Lake Storage, Google Cloud Storage and the like.

Even a Hadoop stalwart such as Hortonworks Inc. sees the writing on the wall, which is why, in its recent 3.0 release, it emphasized heterogeneous object storage. The new Hortonworks Data Platform 3.0 supports data storage in all of the major public-cloud object stores, including Amazon S3, Azure Storage Blob, Azure Data Lake, Google Cloud Storage and AWS Elastic MapReduce File System.

HDP’s latest storage enhancements include a consistency layer, NameNode enhancements to support scale-out persistence of billions of files with lower storage overhead, and storage-efficiency enhancements such as support for erasure coding across heterogeneous volumes. HDP workloads access non-HDFS cloud storage environments via the Hadoop Compatible File System API.

So it was no surprise when MapR recently unveiled its 6.1 data platform update, still in beta, with scarcely a reference to HDFS or any other core component of the Hadoop ecosystem, apart from Hive 2.3. Though it had always distanced itself slightly from the by-the-book Hadoop vendors, MapR has gone even further in its latest release. It now offers a robust next-generation cloud data platform grounded in the following pillar technologies:

  • Object storage: Persisting heterogeneous data objects in heterogeneous multiclouds is the new data-at-rest fabric. It’s telling that MapR now positions the S3 API as the core of its data-at-rest fabric, though it also supports data reads and writes using the HDFS, NFS, POSIX, SMB and REST interfaces. MapR’s new Object Data Service gives data administrators the flexibility to integrate with their choice of public or private clouds that expose the S3 API (including but not limited to Amazon Web Service’s S3 public cloud service). MapR exposes one global namespace and enforces a common set of rules and policies — such as access control, automated data placement, volume encryption, and erasure coding — across a distributed object storage environment, regardless of the public, private or hybrid cloud tiers or the formats in which data is persisted. And it adds policy-based tiering to move data automatically across performance, capacity and archive storage on-premises and in the cloud.
  • Stream computing: Processing diverse data objects continuously is the new data-in-motion fabric. After object storage, stream computing is the most important news in MapR’s latest platform refresh. Kafka, in particular, is a key focus for it and most other big data analytics companies now. The latest MapR release supports simplified development of streaming analytics and change data capture applications with the Kafka 1.1 and KStreams APIs, as well as more user-friendly query of streaming data via Kafka’s KSQL language. At the same time, MapR now supports Spark Structured Streaming 2.3 for high-performance continuous processing with in-stream machine learning.

Object storage is the core platform for big data now, but it’s very likely that it will be eclipsed in importance by stream computing over the coming decade. As I noted in this recent SiliconANGLE article, streaming is as fundamental to today’s always-on economy as relational data architectures were to the prior era of enterprise computing. In Wikibon’s big data market update, we uncovered several business technology trends that point toward a new era in which stream computing is the foundation of most data architectures:

  • Data sources are incorporating more locally acquired sensor machine data from “internet of things” endpoints.
  • Adoption of serverless computing is shifting workloads toward event-driven request-response flows over always-on fabrics.
  • Edge-facing application architectures require in-stream analytic processing, inferencing and training at mobile, embedded and IoT devices.
  • Shifts toward real-time, live, and interactive online sessions require end-to-end environments that support low-latency, continuous data processing.
  • Movement of transactional workloads into stream computing is bringing stateful and orchestrated semantics into these environments.
  • Decision support in always-on environments requires persistence of durable sources of truth in streaming platforms.
  • The maturation of open-source streaming environments such as Kafka, Flink and Spark Structured Streaming has put this technology in enterprise information technology professionals’ comfort zones.

Enterprises are expanding their investments in in-memory, continuous computing, change data capture and other low-latency solutions while converging those investments with their big data at-rest environments, including Hadoop, NoSQL and RDBMSs. Within the coming decade, the database as we used to know it will be ancient history, from an architectural standpoint, in a world where streaming, in-memory, edge and serverless infrastructures reign supreme.

The last wall of the Hadoop castle is built on at-rest architectures in support of stateful and transactional applications, but it appears likely that streaming environments such as Kafka will address more of those requirements robustly, perhaps in conjunction with blockchain as a persistent metadata log.

In fact, a database-free world may await us in coming decades as streams, object stores, blockchain and the IoT pervade all applications. Check out this thought-provoking article for a discussion of how that’s already possible, and this article for a discussion of how different stream types can support transactional data apps.

Hadoop may still have plenty of useful life left in it. Databases may endure as pillars of many application architectures. But we’ve entered a new era where these familiar landmarks are receding. It’s an era in which stream computing cuts new channels through every application environment and massive object stores anchor it all.

Image: kalhh/Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU