UPDATED 11:00 EDT / APRIL 06 2016

NEWS

High-performance woman working on making Spark simple | #WomenInTech

by Marlene Den Bleyker

When it comes to dealing with Big Data and open-source platforms, such as Apache Spark, there are still many hurdles to overcome when it comes to simplifying it for the enterprise user. In this week’s SiliconANGLE Women in Tech Wednesday, we shine the spotlight on Holden Karau, principal software engineer of Big Data at IBM and coauthor of Learning Spark: Lightning-Fast Big Data Analysis and High Performance Spark (Early Release, Raw and Unedited).

Karau, spoke with John Furrier (@furrier) and George Gilbert (@ggilbert41), cohosts of theCUBE, from the SiliconANGLE Media team, at the BigDataSV 2016 event held in in San Jose, California. The interview focused on the future of Spark, Machine Learning and Karau’s newly released book.

The future of Spark

Responding to Furrier, who wanted to know about what’s new with Spark, Karau was excited to discuss the latest edition of Spark that is on the way.

“I think there are a lot of really exciting thing happening with Spark. And the big thing is Spark 2.0 is this year, and that’s what’s really exciting because it is an opportunity to get rid of some of the dead weight, some of the things that have built up … and some of the new exciting things like going from the RDD [Resilient Distributed Datasets] model to the dataset model and allowing people to mix functional and relational queries together really, really easily and bring their expertise together. So maybe the more traditional business analyst can more easily work with Spark and then have that work ‘productionized’ by traditional data engineers.”

Heavy lifting for the enterprise

Expanding further on adopting Spark in the enterprise, Karau noted:

“With any technology your first version is great, it’s wonderful, very lean; it doesn’t have a lot of security or other things like that. And as time goes on you have to add all of these things to make it a really good enterprise product. And that’s part of where companies like IBM come in, adding the things that the enterprise needs so they can adopt it. And Spark SQL is also opening up the kinds of people that can make Spark programs … there’s tons of business analysts who, if I asked them to write Scala code, would be like, ‘Oh no, that’s quite alright.’ But I don’t have time to help all those business analysts write; so it’s really powerful to be able to give them the tools that are used to working with and being able to work with really large-scale data at the speeds that Spark is. So Spark SQL is really great there.”

IBM contributes to Spark

When asked about IBM’s contribution to open source, Karau explained:

“I work at the Spark Technology Center in San Francisco, and we’re focused on just open-source Spark … and if you look at where the contributions are coming, you can see that a lot of them from IBM have especially been focused on Spark SQL. And after that, the machine learning libraries are the next area of focus in terms of the number of contributions that we’ve been getting into Spark.”

Better machine learning: Is it rocket science?

Gilbert asked Karau about the rocket science of machine learning, and she responded:

“Different organizations are in very different places … but I think a lot of people are sort of at the data science on data science stage where they’re collecting all of these metrics, they have all these analysts, and then the realize maybe some of the stuff they are doing could be useful for each other. And so they start to do meta-analysis to figure out where their data is coming from, which pieces can be shared between the organization and how to be good in this department. As far as automating the models, we’re there-ish I guess would be the expression … but it is not a thing that a lot of people are doing right now. … But, essentially, we’re at the point where some people are sort of in that phase, but I would say most people aren’t really doing machine learning on their machine learning themselves.

“I think we’ll see a lot more people using machine learning to do a lot of their tuning for their machine-learning models much sooner than five years, maybe two, but I’m also an optimist, so who knows.”

High-performance Spark

What’s next for Karau?

“I’m working on High Performance Spark with my coauthor Rachel [Warren], who is wonderful, and we just got an early release of the first four chapters are out as of this week.”

Those four chapters include:

Introduction to High-Performance Spark
How Spark Works
DataFrames, Datasets and Spark SQL
Joins (SQL and Core)

Watch the video below to learn more about Spark and machine learning from our pick for Women in Technology Wednesday this week, Holden Karau.

Photo by SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.