UPDATED 14:50 EST / JUNE 20 2018

BIG DATA

At DataWorks 2018, Hortonworks accelerates its shift toward public cloud deployments

ANALYSIS by James Kobielus

DataWorks Summit began as Hadoop Summit in 2008, when it was a developer event hosted by Yahoo Inc. The big-data market exploded as established data vendors and hot startups spawned a dynamo of innovation, and the core technology evolved and eventually gave way to a deeper open-source ecosystem.

What has remained the same is data’s centrality in the global economy, and in the leading big-data solution providers’ orientation toward entirely or predominantly open-source software.

Hadoop is the least of it. Among multicloud big-data solution providers, Hortonworks Inc. remains one of the most active participants and committers in the open-source ecosystem. Its solution portfolio now incorporates 26 open-source codebases. These include widely adopted data and analytics platforms such as Apache Hadoop, Apache Hive and Apache Spark, as well as a diverse range of code such as Apache Atlas, Apache Ranger, Apache Ambari and Apache Knox for managing, securing and governing big data environments.

The product focus at DataWorks (pictured) was on the Hortonworks’ launch of the next generation of its core Hadoop platform, which is built on the Apache Hadoop 3.1 distribution. The new Hortonworks Data Platform 3.0 is now available for preview under an early access program and is expected to be generally available in the third quarter of 2018.

HDP 3.0 includes the following new features:

Containerization: For developers building the next generation of cloud-native data applications, HDP 3.0 supports faster building, training and deployment of containerized advanced analytics, machine learning, deep learning and artificial intelligence workloads microservices all the way to the edges of today’s increasingly distributed cloud environments. With containers running on HDP, developers can move fast, deploy more software efficiently and operate with increased velocity, which is optimized for the DevOps environments in which more data applications are being built. Scalability enhancements in HDP 3.0 supports running of very large multitenant clusters and more packaged containerized microservices.
GPU support: Support for graphic processing units within Hadoop 3.0’s YARN scheduler enables AI, DL and ML workloads to run on supported Hadoop clusters. HDP 3.0 now allows Hortonworks customers to leverage GPUs in the cloud for scalable training and inferencing workloads. When developing and refining containerized TensorFlow applications in the cloud, data scientists can share access to GPU resources through pooling and isolation in HDP 3.0.
Hive 3.0 support: HDP 3.0 includes a real-time database built on Hive 3.0, which now incorporates Tez, LLAP and Druid to support real-time data warehousing. With this release, Hive has evolved into a full enterprise database with support for high concurrency, low latency, expanded SQL syntax and ACID compliance. Within HDP 3.0, Hive 3.0 provides a unified SQL layer that supports improved query optimization to process more data, both real-time and historical, more rapidly for low-latency and high-throughput applications. It enables scalable, interactive query of data that lives anywhere in private, public and hybrid clouds.
Heterogeneous cloud-storage optimization: HDP 3.0 includes the ability to separate storage clusters from compute clusters. As an alternative to HDFS when running in the cloud, HDP 3.0 supports data storage in all of the major public-cloud object stores, including Amazon S3, Azure Storage Blob, Azure Data Lake, Google Cloud Storage and AWS Elastic MapReduce File System. HDP workloads access cloud storage environments via the Hadoop Compatible File System API. The latest storage enhancements include a consistency layer for nonconsistent cloud stores. And it offers improved storage scalability, leveraging enhancements in NameNode to support scale-out persistence of billions files with lower storage overhead. It also includes storage-efficiency enhancements such as support for erasure coding.
Governance and compliance enhancements: HDP 3.0 enables enhanced governance and compliance with such mandates as GDPR, through support for a full chain of data custody and fine-grained event auditing. Users can now track the lineage of data from its origin all the way to its storage in data lakes built on HDP 3.0. This allows auditors to view data without making changes, enforce time-based policies and audit events around third parties with encryption protection. HDP 3.0 also supports shared enterprise security and data governance services across public clouds and automatic cluster scaling based on usage or time metrics.

There were no specific enhancements announced to Hortonworks DataFlow or HDF for data-in-motion, though Hortonworks discussed a new lightweight streaming technology, MiNiFi, that will enable customers to deploy containerized AI/DL/ML for deployment to IoT, edge and embedded endpoints in multiclouds.

Likewise, there was no specific product announcement regarding its DataPlane Service or DPS, which Hortonworks had previously launched as a “single pane of glass” for monitoring, managing and deploying data applications across complex hybrid-data multiclouds. However, Hortonworks extensively discussed a new compliance-relevant DPS solution, Data Steward Studio, that it rolled out a couple of months ago at DataWorks 2018 Berlin.

As in previous releases of the platform, HDP 3.0 enables customers to build hybrid data multiclouds that include any and all of the major public cloud providers. Last Friday, it released Cloudbreak 2.7, which supports provisioning of HDP clusters into complex hybrid cloud architectures.

Hortonworks now puts its public cloud footprint front and center in its go-to-market message, though only 25 percent of its 1,400 paying customers currently run Hortonworks solutions in public cloud and only 5 percent are only in the public cloud. By contrast, 95 percent of Hortonworks customers deploy its offerings entirely or predominantly on-premises.

Nevertheless, Hortonworks sees a customer trend toward putting more analytics workloads in public and hybrid clouds, and its entire product roadmap is focused on making that transition as seamless as possible for its customers. This is consistent with Wikibon’s finding from the recent annual update to our big-data market forecast. Our analysts found that hybrid clouds are becoming an intermediate stop for enterprise big-data analytics deployments on the way to more complete deployment in public clouds in the coming decade and beyond. Across the big-data market, traditionally premises-based platforms are being rearchitected to deploy primarily in public clouds.

Hortonworks had public-cloud partnership announcements this week at DataWorks San Jose designed to help customers make that transition when they’re ready:

IBM Corp.: The partners announced IBM Hosted Analytics for Hortonworks or IHAH, which runs HDP 3.0 instances on IBM Cloud and incorporates IBM Db2, IBM Big SQL and IBM Data Science Experience or DSX. This move builds on last year’s announcement by IBM and Hortonworks that they were incorporating HDP and DSX into a converged solution for the next generation of developers building AI-driven applications for multicloud deployment. IHAH brings that converged data management and analytics offering into the IBM Cloud as a hosted service. It enables quick setup, provisioning, security and deployment so that data scientists and other developers can rapidly operationalize their applications for production enterprise uses. It lets users run DSX workloads in virtual Python environment on all HDP clusters hosted in IBM Cloud without needing to install Python libraries on those nodes. Within IHAH, DSX workloads can easily consume the data and infrastructure services managed in HDP data lakes in IBM Cloud. The hosted service also enables data scientists to write ANSI SQL to invoke IBM Big SQL directly from DSX, avoiding the need to write Python scripts in order to bring together different types of data from different federated data stores in IBM Cloud.
Microsoft Corp.: The partners announced that customers can now deploy the complete Hortonworks portfolio — including HDP, HDF and DPS — natively on Microsoft Azure’s infrastructure as a service public cloud. This gives joint customers greater flexibility in distributing big data workloads throughout complex hybrid multicloud scenarios, including edge deployment in the “internet of things.” Joint customers also retain the choice of running their analytic workloads, such as Hadoop and Spark, purely in the public cloud on the existing HDInsight offering in Microsoft Azure.
Google LLC: The partners announced expanded support for Google Cloud Platform public-cloud storage services. Hortonworks customers can now tap into Google Cloud Storage to support HDP, HDF and DPS workloads that run in diverse private, public and hybrid cloud environments. In the GCP public cloud, users can run fast, scalable analytics for interactive query, AI/ML/DL and streaming data analytics. At no upfront cost and in minutes, customers can provision HDP, HDF and DPS workloads in GCP with unlimited elastic scalability. They can automate and optimize the provisioning of GCP resources while configuring and securing workloads in the cloud. They now have the flexibility to run ephemeral, short-lived workloads in GCP. And they can securely move any data flow from any source between on-premises HDP/HDF/DPS deployments and GCP deployments.

In his interview on theCUBE, Rob Bearden, Hortonworks Chief Executive Officer, discussed key themes in the vendor’s go-to-market strategy now: edge, containerization, compliance, connected communities, stream computing, and a “single pane of glass” to manage data analytics assets distributed across hybrid clouds.

You can play back Bearden’s first-day keynote:

Image: Robert Hof/SiliconANGLE

Other interviews with Hortonworks executives and partners are here. (* Disclosure: Hortonworks Inc. sponsored these segments of theCUBE. Neither Hortonworks nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.