UPDATED 20:37 EDT / JUNE 07 2016

NEWS

Spark Summit keynote explores structured streaming, innovation in deep learning | #SparkSummit

Spark Summit 2016 opened today at the Hilton San Francisco Union Square with Matei Zaharia, chief technology officer at Databricks, Inc. and creator of Spark, revealing the latest version of Spark 2.0, the company’s largest release to date, which will be coming this month. For those developers who want a sneak peek, there is an unstable preview release at spark.apache.org, where you can take it for a test drive.  Zaharia explained that the new release will remain highly compatible with Apache Spark 1.X, fixes issues with dependencies and has over 2000 patches from 280 contributors.

The focus of upgrading Spark 2.0 was led by three key ideas. The goals were to create a unified engine that will support end-to-end applications; high-level APIs that are easy to use and enables rich optimizations; and to support broad integration.

“It’s agnostic to the storage system so you can run it on data you have anywhere … and integrates with many libraries,” said Zaharia.

New and improved Spark 2.0

The new version will support structured API improvements around DataFrame, Dataset and SparkSession along with structured streaming that will allow users to query data in real time. Michael Armbrust, a software engineer at Databricks, later provided a demonstration of these new capabilities by showcasing the ease of use and the ability to begin with unstructured data, use ETL (“extract, transform and load”) data, use JSON (JavaScript Object Notation) to take the structured table and finally he took the same code and applied to a stream.

For the broader community, Zaharia explained that Spark 2.0 will also facilitate deep learning libraries, graph frames, PyData integration, reactive streams, C# bindings and JS bindings. Spark 2.0 will also provide deep dive structured APIs using an engine optimization plan with specialized code and DataSet static timing. Also new in Spark 2.0, will be whole stage code generation which will fuse across multiple operators and optimized input/output with Apache Parquet and built in cache.

High-level improvements

Structured streaming is the newest feature in this version of Spark. “Structured streaming is still very new and experimental,” said Zaharia. Using high-level streaming APIs built on a structured engine (DataFrames), structured streaming also supports interactive and batch queries, not just for streaming but for continuous applications. Other Spark 2.0 improvements include infinite DataFrames.

In keeping up with the industry, Spark 2.0 has been upgraded to aid in machine learning by allowing users to export models, load them in another program and move them to production. These enhancements include SparkR, MLlib 2.0 and new algorithms to provide deep learning experiences.

Growing the Apache Spark community

Recognizing the biggest challenge in applying Big Data, Zaharia notes the skills gap. In order to combat that hole, Databricks is offering the “Community Edition”, a place where developers can find interactive tutorials, Apache Spark and popular data science libraries and visualization and debug tools.

Additionally, in conjunction with Berkeley University of California, UCLA and edX, there is a free five-course series that will provide courses such as Introduction to Apache Spark, Distributed Machine Learning, Big Data Analysis, Advanced Apache Spark for Data Science and Data Engineering and Advanced Machine Learning. The company completed the beta in February and now these courses are available on d.bricks.com/mooc16.

Deep learning with Google

Jeff Dean Google senior fellow at Google, Inc. spoke next about deep learning using data. He demonstrated how you can teach a machine to learn things you never thought possible before. He used examples of perceptional data that facilitated in graphical recognition, eliminating the need to tag photos. Deep learning is a powerful class of machine learning with a modern reincarnation of artificial neural networks that uses a collection of simple, trainable mathematical functions.

The systems work to build layers of abstractions. The concept is loosely based on the human brain and according to what it sees, it decides what it wants to say. Ultimately, like our brains, the machine learns to cooperate to accomplish the task. According to Dean, the results get better with more data, bigger models and more computation. “Better algorithms, new insights and improved techniques always help too,” he said.

Dean also showcased TensorFlow, an open source software library for numerical computation using data flow graphs. He demonstrated the flexible architecture, which allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. According to Dean, TensorFlow is improving voice recognition and photo searches. To learn more about this open source software, visit http://tensorflow.org.

Deep learning trends

Andrew Ng, chief scientist of Baidu; chairman and co-founder of Coursera; associate professor (Research) of Stanford University took to the stage to talk about deep learning trend and how AI will impact teams and industries. He compared large neural networks as the engine or driver of the trend and data as the fuel. He spoke about how speech recognition has changed due to speech system complex features that provide the end-to-end learning.

His key takeaways were that scale drives AI progress and learning complex outputs offer the end-to-end learning. According to Ng, AI is the new electricity. He also predicts the future trends in AI. In the short term, companies will build a centralized AI function and sprinkle it on the existing business. In the long term, AI will be deeply incorporated into the business and novel business strategies will be built on AI.

The possibilities of production

Lastly,  Marvin Theimer, distinguished engineer at Amazon Web Services delivered his talk, “From Prototype to Production” in which he covered how to bring your ideas to market. He listed several qualities that needed to be part of any solution. Scalability, high availability, maintainability and evolvability. He spoke to the challenges and offered advice on making your prototype efficient and user-friendly.

Visit the Spark Summit 2016 event page for more information, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of Spark Summit 2016.

Photo by HPE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU