UPDATED 10:00 EDT / AUGUST 19 2016

NEWS

Spark innovation: Catalyst Optimizer simplifies complicated queries | #GuestOfTheWeek

Databricks Inc. is bringing Apache Spark to the enterprise, and Michael Armbrust, lead developer of Spark SQL at Databricks, created Spark SQL’s Catalyst Optimizer, a query API that is a fundamental piece of Spark. The Catalyst Optimizer enables much of the functionality in Spark and supports Databricks’ new DataFrame API that makes Big Data processing accessible and simple for a wider range of users.

Armbrust joined George Gilbert, host of theCUBE, from the SiliconANGLE Media team, for two segments at Innovation Day at Databricks in San Francisco to discuss the Catalyst API and how it supports the new DataFrame API, as well as database integration scenarios. Armbrust is theCUBE’s Guest of the Week.

A cool new tool

Gilbert began the interview looking for in-depth information about the Catalyst Optimizer and about what its core value is to Spark. Armbrust provided a comprehensive look at the API and explained how it simplifies complicated queries.

“Catalyst started initially as this kind of ‘out there’ research project. I was working on another relational database, and I realized that the query optimizer, while a very important part of the system, was very ridged and very difficult to change. I was trying to implement some of my thesis research on top of i,t and I took a step back and said, ‘Wait a second; why is this so difficult to add new concepts into the query optimizer? Maybe there’s an easier way to do it.’ So we looked at a bunch of research on composable query optimizers and decided to try and build it. Once we had this really cool tool, applying it to Spark was the next logical step. … We took this next step in giving Spark its own query optimizer, and that was really how Catalyst began.

“I think a key part of this is that it also comes along with logical language for describing what kind of dataflow you are trying to establish. So you can actually say things like, ‘I want to do a join. I want to do an aggregation,’ and this is a higher level than the RDD [Resilient Distributed Dataset] API, which is more like individual code operators that you are implementing yourself. The key difference here is what we like to call ‘declarative programming’ where you are telling the system what you want to do but not necessarily how to do it. Once you have that language saying what you want to do, now the system has all of this new flexibility for trying different ways to accomplish that; [you] can actually explore the space and find the best way to do it.”

The ‘skinny waist’ of expressing data

Gilbert and Armbrust moved on to deep-dive into the DataFrame API and structured streaming. Armbrust provided an informative lesson on what makes structured streams so powerful.

“Data frames is like the skinny waist of expressing data transformations inside of Spark. … A little bit of history: So there was another project before structured streaming called Spark Streaming, which used this thing called the DStream API, very similar to RDDs where you express how you want the streaming computation to proceed. After hearing a lot of feedback from users [about how] they liked the power of this API but they didn’t like to express these complicated concepts on their own, adding data frames to streaming was kind of a natural next step.

“Once you’ve answered the question once, you want to answer the question over and over in real-time, incrementally. Once you had this high-level way of expressing things that’s very general [that] lets you go back and forth between a declarative and functional programming, you want to use it in many different places. Not just for batch queries.”

The best possible answer

Gilbert pointed out that streaming Big Data in real time doesn’t always give you an exact answer. He questioned what the trade-off is in order to make data frames and streaming accessible to people who just understood batch. Armbrust described how the team modified tactics.

“We spent a lot of time thinking about the model. Traditional streaming systems often make you actually think about streaming, think about late data. Think how you are going to adjust things as data arrives out of sequence. With the DataFrame API, we wanted to take a slightly different tact.

“We came up with this notion of triggers, and the way to think about it is you have an ever-growing table as data arrives. You execute a query at certain trigger intervals, and we give you the best answer as of the time the trigger fires, but the best things about this is that in five minutes you have another trigger fire and now you have a new up-to-date answer that is logically consistent.”

Integration over time

Gilbert inquired about how structured streams will roll out within the product over time to affect and integrate with all the different APIs. According to Armbrust, there was a great deal of thought put into that question.

“In 2.0, our primary focus was putting a lot of thought into this model and the API we were presenting to users. We wanted to make sure there was going to be a solid foundation to build on so when people started building applications they wouldn’t have to rewrite them with each new release of Spark. Moving forward, I think you are going to see a lot of progress in a couple areas. One is integrating with different sources. Right now we can read files from HDFS. We want tight integration with S3, Kinesis and Kafka and many of these other kinds of things in the streaming ecosystem.

“Another thing we are working on is integration with other parts of Spark, in particular machine learning. So Spark ML is great if you have a batch of data and you want to train a model on that batch of data. But usually you want to incrementally improve the model as new data arrives. I’m working on building model updating into the structured streaming framework.”

Watch the complete video interviews below, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of Innovation Day at Databricks.

Photo by SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU