Coverage from SiliconANGLE's livestreaming video studio

UPDATED 15:14 EDT / JANUARY 25 2019

AI

Kubeflow to the rescue: ML toolkit offers hope for data science and deep learning

by Mark Albertson

Data scientists in the machine learning community have a growing skills gap, and it will take some serious technology to fix it.

Google Cloud executive Rajen Sheth recently voiced his agreement with estimates that the number of machine learning engineers capable of moving deep learning from concept to production equals a few thousand. But there are millions of data scientists and significantly more developers. How can the gap be closed?

The answer may be found in large part to the current activity taking place among major cloud players and key figures in the open-source community focused on a relatively new, yet vitally important project — Kubeflow.

The Kubeflow project, co-founded by David Aronchick (pictured) in 2017 at Google LLC, provided a toolkit so data scientists could run machine learning jobs more easily on Kubernetes clusters without a lot of extra work and adaptation.

“When it gets to really complex apps, like machine learning, you’re able to do that at an even higher-level using constructs like Kubeflow,” said Aronchick, as he described how data scientists can quickly create a model. “When they’re done they hit a button, and it will provision out all the machines necessary, all of the drivers, spin it up, run that training job, bring it back, and shut everything down.”

Aronchick, who became head of open-source machine learning strategy at Microsoft Corp. in November, spoke with John Furrier (@furrier) and Stu Miniman (@stu), co-hosts of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the recent KubeCon + CloudNativeCon conference in Seattle. They discussed the impact of Kubeflow on workload portability, recent commercial contributions to support machine learning deployment, the importance of executing data training models at the edge, a growing need for improving data efficiency, and how corporate and open-source contributions are bringing Kubernetes to a new level of maturity.

This week, theCUBE features David Aronchick as its Guest of the Week.

Reaching portability and scalability

Kubeflow is a natural outgrowth of the Kubernetes movement, where the popular container orchestration tool has made it easier to manage distributed workloads across the enterprise. It is designed to deploy on the Kubernetes stack with the goal of making the distribution of machine learning workloads portable and scalable across multiple nodes.

Workload portability is an essential ingredient in a world where enterprises are moving jobs between multiple clouds, and machine learning could help navigate an increasingly complicated environment. A survey of cloud computing trends conducted by RightScale Inc. last year found that 81 percent of enterprises had a multicloud strategy.

“I can’t overstate how valuable that portability is,” Aronchick said. “Kubernetes lets you compose these highly complex pipelines together that lets you do real training anywhere.”

New platform from Intel

The pace of innovation to facilitate precisely the kind of portable model that Aronchick describes is beginning to pick up. This month, Intel Corp. released a new platform — Nauta — designed to facilitate deep learning at scale for data scientists and developers.

Nauta will support both batch and streaming inference for model testing, while facilitating the use of Kubernetes to manage orchestration of machine learning pipelines in hybrid environments. Intel is a major code contributor to Kubeflow, and Nauta is built on the machine learning toolkit, according to statements from company executives during an artificial intelligence gathering in Munich, Germany, this month.

The latest news highlighted the need to execute data training models in a variety of locales. “You’re ultimately going to have models and training and inference move to many different locations,” Aronchick explained. “So you’ll do inference at the edge on my phone or on a little Bluetooth device in the corner of my house, saying whether it’s too hot or too cold. We’re going to need that kind of intelligence, and we’re going to do that kind of training and collection at the edge.”

Coping with data avalanche

While Intel’s latest announcement provides another boost for data scientists seeking to deploy machine learning using Kubeflow-based tools, engineers are still struggling with issues involving the sheer amount of data that must be processed and analyzed.

The solution may lie in academic and commercial research that is contributing advances to applications for artificial intelligence. If machines can intuitively discern anomalies faster than humans, the potential grows for training models using less data rather than more.

Computational researchers at Vicarious Inc. have developed a model that trains computers to decipher CAPTCHAs, those jumbles of letters and numbers used by many websites to determine whether the user is actually a human. Their research has enabled computers to reach 67 percent accuracy using less data, according to a recent report in the “Harvard Business Review,” a success rate that may indeed already surpass the ability of humans to decipher the often-confounding images.

There is still hope that a move toward needing less data will help the cause for machine learning and data scientists, according to Aronchick, as the scale has made it difficult to troubleshoot problems. “It’s not a matter of whether you’re able to process it; you are,” Aronchick said. “But it’s so easy to get lost, to get caught in little anomalies. If you have a petabyte of data, and a megabit of it is causing your model to go sideways, that’s really hard to detect.”

Dual role for cloud providers

Aronchick’s new role with Microsoft represents a return for the open-source technology veteran. The Dartmouth graduate originally worked for the software giant for six years, starting in 2001, when he handled a variety of project roles.

That was followed by positions at Amazon and Google, where he played key roles in Kubernetes and Kubeflow. The experience has given Aronchick a perspective on the dual role that is evolving among major cloud providers, such as Microsoft Azure and Google Cloud, in providing commercial and open-source contributed tools.

“Much like Kubernetes has both a commercial offering and an open-source offering, I think that all of the major cloud providers will have that kind of duality,” Aronchick said. “They’ll work in open-source and you can measure how many contributions and the number of open-source projects. But then they’ll also have hosted other versions that make it easier for customers to migrate their data and adopt some of these new solutions.”

In December, Sourced Technologies, S.L., a company specializing in machine learning for large-scale code analysis, released a report that documented signs of maturity for the four-year Kubernetes project. These included reaching 2 million lines of code and stabilization of the API.

With age comes power, and ancillary tools like Kubeflow are only enhancing the opportunities being driven by Kubernetes, including advances in machine learning deployment and beyond.

“You’re seeing Kubernetes become boring, and that is incredibly powerful,” Aronchick said. “People are building enormous businesses on top of it.”

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s extensive coverage of KubeCon + CloudNativeCon:

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.