UPDATED 21:18 EDT / JUNE 08 2016


Why Spark is on fire: a conversation with creator Matei Zaharia

Apache Spark, the open source software framework for distributing computing across many clusters of machines, has caught on like wildfire among companies looking to divine insights from their growing masses of data.

Nowhere was that more apparent than at a conference this week in San Francisco centered on the software. Spark Summit West drew a sold-out 2,500 software developers and data scientists, according to host and Spark cloud service provider Databricks. This week, IBM Corp. also reinforced a big commitment to Spark that it made last year, and Microsoft Corp. joined in.

Why the widespread enthusiasm over what is, after all, a highly technical set of software? Partly it’s simply how well it crunches data to provide fast analytics, but also, contended Matei Zaharia, Spark’s creator and chief technology officer at Databricks, because it’s getting easier to use–especially in its latest version, Spark 2.0, which he outlined in his keynote Tuesday.

In an interview with SiliconANGLE at the summit, Zaharia provided a primer on Spark, explained how he hopes to make it accessible to more mainstream business analysts, and gave his view on how open source business models are evolving. This is an edited version of the conversation. (* See disclosure below.)

Q: What kinds of applications did you design Spark for?

A: Spark is an engine for doing computation on a cluster of machines and distributing work across them. The most common use cases are data transformation and “extract, transform and load” (ETL). That’s what needs to be done to prepare data for any interesting application. For example, data comes in in a particular format and we have to look up the customer I.D. and figure out their name using our database.

The other major focus of the Spark research project at UC Berkeley from the beginning was machine learning. MapReduce, the previous very popular computing model for clusters, wasn’t super-good at it.

Q: How is Spark especially useful for machine learning?

A: One is the speed. Spark takes advantage of distributed memory on all the machines, so you can load part of your data into memory and you can interactively search for it or you can update results in real time if you make a streaming application. People do a lot of this basic data preparation before they can apply one of those machine learning models.

The second part that really matters to developers is ease of use. It’s designed to be very high-level. It’s much like Java. One of the great things about Java when it came out was that it had very standard libraries of things built in, a lot of the things you’d want to do, from display an image on the screen to connect to an Oracle database. It’s the same with Spark but for working on a distributed cluster.

Q: A big focus at the summit was Spark 2.0. Can you outline broadly the new features?

A: One interesting part is the performance–it just goes faster. The second thing is the streaming aspect, what we call Structured Streaming. Imagine you had a static amount of data, say all the data gathered last month, and you want to compute something on it, like a Spark application that takes all the events and counts them up by customer I.D. and the time of day they happened and we’re going to look for some anomalies like unhappy users.

With Structured Streaming, you can describe the combination the same way, but now you can run it incrementally as new data arrives–as a stream. It’s a powerful way to go from a static application to one that can run continuously and can handle data coming in.

Continuous applications

Q: What could people do differently with that capability?

A: A lot of business intelligence analysts, if they made a report once in Tableau or in Databricks using our graphing tools, can turn it into a streaming report and put it on the wall in the office or set up an alert when something bad happens.

That’s the vision. It’s still early on. It doesn’t support all the operations and all the input sources you may want.

Q: You also mentioned something called “continuous applications.” What do you mean?

A: It’s something this new engine supports. Right now there are many stream processing systems out there. There are the Complex Event Processing systems that have existed for a long time from IBM, Microsoft, as well as Apache Storm and Google Cloud Dataflow. All these things are only focused on streaming, so as data is flying by, you compute something on it, like look up a date and fix it to be in your time zone.

But whenever people use it, they want to integrate this streaming engine with other types of systems. So for example, they integrate it with a Web application where someone wants to see a user interface and be able to type in a phrase, like “I want to pick a customer and see all the events about that customer.”

The idea of continuous applications is that the same system should understand both pieces, the interactive query piece and the streaming piece and should be able to join them, instead of having to hook together two or three systems and figure out how to connect them.

Q: How does this idea get implemented in the real world?

A: When you use the streaming interface in Spark, you can view the results of a stream as a table, similar to a database. Anything that knows how to talk to a database, like Tableau and other business intelligence tools, can connect to Spark and see the latest version in the stream. So Spark is both serving the table to these external things and updating it as new data comes in. Before, with the streaming-only systems, you would have to install a database, install a streaming system, get that thing to update the database, and then you have all kinds of challenges with consistency and fault tolerance, like what if it did half the update and then crashed?

Democratizing Spark

Q: So many of the breakthroughs we’re seeing in big data seem so conceptual, seemingly independent of physical breakthroughs such as faster silicon or storage that have driven progress in computing. Why are these conceptual breakthrough coming so fast and furious now?

A: It does actually have to do with hardware trends. In the past, compute kept up with storage, so as you got more data, you just got a faster processor every two years, so you took your fast application and got a new computer. Now, the cost of storage has kept falling and you’d be crazy not to store all the data you can.

But now you need large clusters of many servers to process that data; you can’t do it on one anymore, even with multi-core processors. It’s not keeping up with storage. All this great data analysis software for a single machine doesn’t work when you use multiple machines. That’s why people are building all sorts of distributed engines such as Spark.

The second reason you see a lot more action in this space lately is because the users have changed pretty dramatically. At the beginning when Google heard about MapReduce, the users of these systems were largely software engineers at Web companies, so they knew Java or C++, pretty hard-core systems engineers. But now, we have many non-programmers such as the data scientists or even the business analysts using it, so they need higher-level access.

Q: Spark still doesn’t look that easy to many companies. How are you trying to change that?

A: When we started the company, we decided to focus 100 percent on cloud as just a better way to deliver this kind of software. Literally within an hour of getting set up, you can connect it to your data and start getting some results. We also have pretty good integration with business intelligence tools like Tableau or Qlik. Once you’ve prepared the data for those, you can drag and drop and plot graphs and so on.

Q: You talked about a skills gap. Where is the gap?

A: Among users, and administrators as well. Our main approach is to have a lot of free training materials. We also have a lot of case studies. And we are starting to have more events focused on management, providing best practices.

Q: With improvements coming so quickly, how do you keep Spark relevant? Or is it inevitable that new technologies quickly supplant existing ones like Spark is doing to Hadoop/MapReduce?

A: It’s very unlikely that you’ll see a large sequence of these [technologies like Spark]. It’s kind of an anomaly that there were so many of them. Usually that doesn’t happen especially with open source software. It’s in everyone’s interest to have a common platform to build on because there’s a network effect. The more applications that run on Spark, the more valuable it is for vendors that offer Spark environments; likewise, the more vendors, the more applications there will be. And if you’re a user, you’re mixing together many applications and you want only one kind of infrastructure to manage.

If you look at other open source software such as operating systems, there’s really only Linux. It runs anywhere from a watch to a giant supercomputer. There’s only one thing to learn and be an expert in. Same with open source databases; there’s only MySQL and Postgres. I hope the same thing will happen with these distributed computing things.

Open source business models

Q: But you’re still seeing lots of new Apache projects in big data. Is that going to slow down at some point?

A: You also see a ton of projects on Spark or built around it. There are many groups that do R&D in distributed computing. But in terms of a unified platform, Spark is by far the largest one.

Q: How do you see open source business models evolving? There seems to be wide disagreement on the best model.

A: There’s temptation for vendors to have their open source software be a toy version and then you get the real one from them. That is bad for the community. Ideally you have something where many people want to contribute and work together, similar to Linux where Intel is happy to contribute because they make new hardware and they want to make sure the software can use it. Red Hat is happy to contribute because they sell Linux.

The approach we’ve taken is a pretty clear delineation. The project out of the box has all the power, all the libraries so you can do a lot of data analysis, on one to thousands of machines. We provide all the tools you need to manage it in an enterprise setting. It’s not just software, it’s also services. We actually operate it for you, keep it up 24/7 and so on. That’s not something you can download in a .zip file.

Q: But you’re not providing what you’d call support, as some open source companies do.

A: Support traditionally is you have someone you can call on the phone if it breaks. One issue with that is you still have to operate it yourself, so you have to get IT staff to do that. Also, every support vendor’s goal is to minimize the number of support calls, so there’s an inherent conflict. And if a market is big enough, it’s easy for anyone to come in, gain expertise and provide cheaper support in another country. it’s not an amazing business model.

With software as a service, the way we do it, we operate it for you, like Amazon, as a managed service. You don’t have to call us at all. It’s kind of like how the power company delivers electricity to you. They don’t need to hire 10 more people each time they get 10 more customers.

The main risk for us is, are we at the right point in the cloud adoption cycle.

Q: Where are we in that cycle? There’s a lot of argument about how much work to move to the public cloud and how quickly.

A: It’s hard to know for sure. But there’s enough companies using Amazon Web Services where we can put our software to make a pretty viable market.

What’s next for Spark

Q: Where do you want to see Spark going next?

A: The goal is to have this standard library for doing parallel processing. We want to expand the libraries to make it easier to use. We also want to keep up with hardware platforms, such as GPUs (Graphics Processing Units, increasingly used for machine learning applications).

Q: You were at UC Berkeley’s AMPLab, where you developed Spark, and the lab has produced many other open source software innovations. What’s the secret of its success?

A: It was a great set of people. Most of those people are now at Databricks. It was also in the right place at the right time. We were one of the first academic labs to look at both Hadoop and MapReduce and to look at cloud computing. My first research paper as a grad student was we took Hadoop and ran it on this thing called Amazon EC2 that was new at the time. Our goal was a measurement system where you could instrument a distributed system and try to find bugs in it using this advanced tracing mechanism.

But it didn’t really work well because it was a multi-tenant environment with machines having different numbers of users on them, some heavily loaded and some not. Basically it ran as slow as the slowest machine. Also, the first time we ran it, we got a call from Amazon saying, Hey, you’re kind of taking down our network, what are you running? So we got to work with early users of Hadoop and cloud and see some of the problems, and figure out if there was a common solution across these problems.

Q: Google, Microsoft and other large companies seem to be talking the open source game a lot lately. Are they also walking the walk?

A: Early on, Google viewed their computing infrastructure as a competitive advantage. Sometimes they’d post a white paper with just hints of how something worked and how cool it was. What’s happening now is that the infrastructure outside is actually quite good, and everyone else who’s not a Google employee is using this thing that they’re not using. So it’s almost a disadvantage to have a custom infrastructure. It would be great if you could take a class on Hadoop and Spark in school, and go to Google and just use them.

Q: So we’re at a tipping point for open source? Even big enterprises such as Goldman Sachs are contributing their own software.

A: Same with companies like Bloomberg. They’re extremely tired of having their Oracle price increase every year. Basically 20 years ago they built something on Oracle, and now Oracle says, you’ve got this much extra revenue in your company, we’re going to bump up your license [fees].

Companies want something that in the next 20 years gives them some degree of control, so if push comes to shove, they can manage and develop and extend the software themselves. They don’t have to buy an expensive license to fix a small bug.

* Disclosure: TheCUBE, owned by the same company as SiliconANGLE, was the paid media partner at Spark Summit West. This interview was conducted independently and neither Databricks nor other summit sponsors have editorial influence on SiliconANGLE content.

Photo by Robert Hof

A message from John Furrier, co-founder of SiliconANGLE:

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and soon to be Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

Join Our Community 

We are holding our second cloud startup showcase on June 16. Click here to join the free and open Startup Showcase event.


“TheCUBE is part of re:Invent, you know, you guys really are a part of the event and we really appreciate your coming here and I know people appreciate the content you create as well” – Andy Jassy

We really want to hear from you. Thanks for taking the time to read this post. Looking forward to seeing you at the event and in theCUBE Club.