

“We are trying to make it easy to plug these models into the same interface …”
“We have focused on making it easy for people to contribute to Spark …”
“Just compute the thing you need and get out of there …”
There’s a common theme to theCUBE’s conversation with Matei Zaharia, CTO of Databricks, Inc. and creator of Apache Spark: making life easier for data scientists. But rather than adding to the pool of existing solutions, Zaharia is focused on more complex processing to solve problems that do not have existing solutions.
In an interview at Spark Summit East 2016 at the New York Hilton Midtown in NYC, Zaharia talked with Jeff Frick and George Gilbert, cohosts of theCUBE , from the SiliconANGLE Media team.
“Everyone wants real-time,” stated Gilbert, as he invited Zaharia to clarify the confusion between real-time and streaming. Zaharia described the differences in detail, before the topic moved specifically to Spark streaming and the internal operations of the engine, which is designed for high-volume batch operations.
“We design our roadmap and engine based on what people want,” Zaharia said, and most use cases do not require both huge volume and true real-time. The latency of Spark – a few hundreds of milliseconds – is considered real time for most operations.
In this in depth discussion, Zaharia – the rockstar of Spark Summit East – shared his knowledge and opinions on Spark SQL [Spark’s module for working with structured data], Catalyst [a query optimization framework for Spark], DataFrames [a distributed collection of data organized into named columns], and machine learning systems, as well as the exponential growth of the Spark community, new solutions from Databricks, and the future of Big Data.
Watch the full video interview below, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of Spark Summit East 2016. Also join in on the conversation by CrowdChatting with theCUBE hosts.
THANK YOU