UPDATED 11:00 EDT / APRIL 06 2016

NEWS

High-performance woman working on making Spark simple | #WomenInTech

When it comes to dealing with Big Data and open-source platforms, such as Apache Spark, there are still many hurdles to overcome when it comes to simplifying it for the enterprise user. In this week’s SiliconANGLE Women in Tech Wednesday, we shine the spotlight on Holden Karau, principal software engineer of Big Data at IBM and coauthor of Learning Spark: Lightning-Fast Big Data Analysis and High Performance Spark (Early Release, Raw and Unedited).

Karau, spoke with John Furrier (@furrier) and George Gilbert (@ggilbert41), cohosts of theCUBE, from the SiliconANGLE Media team, at the BigDataSV 2016 event held in in San Jose, California. The interview focused on the future of Spark, Machine Learning and Karau’s newly released book.

The future of Spark

Responding to Furrier, who wanted to know about what’s new with Spark, Karau was excited to discuss the latest edition of Spark that is on the way.

“I think there are a lot of really exciting thing happening with Spark. And the big thing is Spark 2.0 is this year, and that’s what’s really exciting because it is an opportunity to get rid of some of the dead weight, some of the things that have built up … and some of the new exciting things like going from the RDD [Resilient Distributed Datasets] model to the dataset model and allowing people to mix functional and relational queries together really, really easily and bring their expertise together. So maybe the more traditional business analyst can more easily work with Spark and then have that work ‘productionized’ by traditional data engineers.”

Heavy lifting for the enterprise

Expanding further on adopting Spark in the enterprise, Karau noted:

“With any technology your first version is great, it’s wonderful, very lean; it doesn’t have a lot of security or other things like that. And as time goes on you have to add all of these things to make it a really good enterprise product. And that’s part of where companies like IBM come in, adding the things that the enterprise needs so they can adopt it. And Spark SQL is also opening up the kinds of people that can make Spark programs … there’s tons of business analysts who, if I asked them to write Scala code, would be like, ‘Oh no, that’s quite alright.’ But I don’t have time to help all those business analysts write; so it’s really powerful to be able to give them the tools that are used to working with and being able to work with really large-scale data at the speeds that Spark is. So Spark SQL is really great there.”

IBM contributes to Spark

When asked about IBM’s contribution to open source, Karau explained:

“I work at the Spark Technology Center in San Francisco, and we’re focused on just open-source Spark … and if you look at where the contributions are coming, you can see that a lot of them from IBM have especially been focused on Spark SQL. And after that, the machine learning libraries are the next area of focus in terms of the number of contributions that we’ve been getting into Spark.”

Better machine learning: Is it rocket science?

Gilbert asked Karau about the rocket science of machine learning, and she responded:

“Different organizations are in very different places … but I think a lot of people are sort of at the data science on data science stage where they’re collecting all of these metrics, they have all these analysts, and then the realize maybe some of the stuff they are doing could be useful for each other. And so they start to do meta-analysis to figure out where their data is coming from, which pieces can be shared between the organization and how to be good in this department. As far as automating the models, we’re there-ish I guess would be the expression … but it is not a thing that a lot of people are doing right now. … But, essentially, we’re at the point where some people are sort of in that phase, but I would say most people aren’t really doing machine learning on  their machine learning themselves.

“I think we’ll see a lot more people using machine learning to do a lot of their tuning for their machine-learning models much sooner than five years, maybe two, but I’m also an optimist, so who knows.”

High-performance Spark

What’s next for Karau?

“I’m working on High Performance Spark with my coauthor Rachel [Warren], who is wonderful, and we just got an early release of the first four chapters are out as of this week.”

Those four chapters include:

  • Introduction to High-Performance Spark
  • How Spark Works
  • DataFrames, Datasets and Spark SQL
  • Joins (SQL and Core)

Watch the video below to learn more about Spark and machine learning from our pick for Women in Technology Wednesday this week, Holden Karau.

Photo by SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU