Going mainstream in the data-driven enterprise is Apache Spark, the open-source analytics engine. As prominent industries move to the Internet of Things markets and machine learning technologies to capitalize on data, Spark ML (which provides a uniform set of high-level application program interfaces that help users create and tune practical machine learning pipelines) offers companies the ability to build real-time streaming solutions that provide fast, advanced analytics to gain insights that drive business.
“We are going to be focused on how to use structured streaming for machine learning. I think that is really interesting, because stream learning is something that people want to do but aren’t yet doing in production. So it’s always fun to talk to people before they’ve built their systems,” said Holden Karau (pictured), principal software engineer at IBM Corp.
Karau, who is a “Spark Committer” and noted authority on the platform, met with Jeff Frick (@JeffFrick) and George Gilbert (@ggilbert41), co-hosts of theCUBE, SiliconANGLE Media’s mobile live streaming studio, during the BigData SV event in San Jose CA. (*Disclosure below.)
Machine learning: What is happening at the edge?
IoT and machine learning are consuming the technology industry. Apache Spark-structured streaming is making an impact in this technology. Karau noted, however, that certain aspects of Spark are not meant to be pushed out to the edge.
“Structured streaming for today, latency wise, is probably not something I would use [for IoT and real-time streaming]. It’s in the sub-second range, which is nice, but it’s not what you want for live surveying of decisions — like for your car. It’s just not going to be feasible,” Karau said.
She maintained that there is the potential to become faster and spoke about a renewed interest in Apache MLlib local, a scalable machine learning library that has the capacity to take models trained in Spark and push them out to the edge and apply the models to edge devices.
“I think for these IoT devices, it makes a lot more sense to do the predictions on the device itself,” Karau said.
Explaining that the models are only megabytes in size and do not need a cluster to do predictions on the models, using the cluster to train the models and pushing the prediction out to the edge node is a reasonable use case for Karau. Instead of using Spark to push the model, she recommends trying other tools.
“Spark is not very well suited to large amounts of internet traffic, but it is well-suited to the training. With MLlib local, it will be able to provide both sides, and the copy part is left to whoever is doing the work,” Karau advised.
The reason for moving the models to the edge is to improve latency. The question that many people are asking is: Will there be a different programming model at the edge?
“I don’t think the answer is finished yet, but I think the work is being done to make it look the same. … Spark has done a really good job of making things look very similar on single node cases to multi-node cases, and I think we can bring the same things to machine learning,” she said.
At IBM, open-source work on Spark is underway to simplify and improve programming languages that interoperate with the platform. Karau pointed out that Java is easy to use with Spark, but the aim of the project is to provide more comfortable experiences to increase adoption.
Predicting that the tools of the future will resemble the tools we have today, but with more options, Karau estimated that the experience will become more simplified.
“The main thing that we are lacking right now is good documentation — and of good books and good resources for people to figure out how to use these tools,” she said.
Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of BigData SV 2017. (*Disclosure: Some segments on SiliconANGLE Media’s theCUBE are sponsored. Sponsors have no editorial control over content on theCUBE or SiliconANGLE.)