UPDATED 09:47 EDT / DECEMBER 19 2018

BIG DATA

Attacking the 95% of machine learning that’s grunt work

The amount of labor that goes into machine learning is pretty daunting. And despite the obstacles tackled by open source contributions, some of the most hyped machine learning frameworks merely skim the surface of the work to be done. Does a technology exist that can collapse the sprawling processes of machine learning, from data ingestion to training to edge inferencing?

Today, there is growing focus on choosing the right machine learning framework, according to David Aronchick (pictured), head of open-source machine learning strategy at Microsoft Corp. The considered frameworks include TensorFlow, Microsoft Cognitive Toolkit and Apache MXNet, to name a few. They’re far from useless — but they may not yet warrant all the attention they get.

“The reality is, when you look at the overall landscape, that’s just 5 percent of the work that the average data scientist goes through,” Aronchick said. The remaining 95 percent is a big pile of rusty nuts and bolts that should be abstracted away already, he added.

That is the aim of Kubeflow — an open-source project for deploying and managing an machine learning stack on Kubernetes, an open-source platform for orchestrating containers, a virtualized method for running distributed applications.

Aronchick spoke with John Furrier and Stu Minimanco-hosts of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the recent KubeCon + CloudNativeCon conference in Seattle. They discussed what’s cooking in open source and academia to shorten machine learning cycle times. 

Kubernetes gives data scientists long-awaited abstraction layer

The grunt work we ask data scientists to do today would shock a lot of people in more abstracted areas of information technology. “We’re asking data scientists, ML engineers, to think about how to provision pods, how to work on drivers, how to do all these very, very low-level things,” Aronchick said.

Aronchick believes academic researchers will discover ways to reduce amounts of data and labor needed to train models. However, this may not solve all data-transport issues. Operations across multicloud environments call for Kubernetes’ abstraction layer, he added.

“The reality is, you can’t beat the speed of light,” he said. “If I have a petabyte of data here, it’s going to take a long time to move it over there. I think you’re ultimately going to have models and training and inference move to many, many different locations.”

Kubernetes and Kubeflow offer high-level abstraction, so a data scientist can work on a model, see how it works, hit a button, and provision it on all the machines necessary.

No, Kubernetes doesn’t spread an application across Azure, Google Cloud Platform and Amazon Web Services Inc. like cream cheese. “What you really want to do is have isolated deployments to each place that enables you, in a single button, to deploy to all three of these locations,” Aronchick said.

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s extensive coverage of KubeCon + CloudNativeCon:

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU