UPDATED 08:00 EDT / JUNE 15 2015

NEWS

Databricks updates Spark with support for R and Python 3

Databricks has announced a major new update to the popular data analytics cluster framework Apache Spark, adding support for the R statistical programming language in an effort to make life easier for data scientists.

As well as support for Python 3, Apache Spark 1.4 allows R users to work directly on large datasets via the SparkR R API. With over two million users worldwide, R is one of the most popular programming languages that’s specifically designed for predictive analytics and statistical computing.

“Because SparkR uses Spark’s parallel engine underneath, operations take advantage of multiple cores or multiple machines, and can scale to data sizes much larger than standalone R programs,” Patrick Wendell, a software engineer at Databricks, wrote in a blog post.

SparkR is an R package that was first developed at the AMPLab at UC Berkeley. It was designed to provide a frontend for R to Apache Spark. By utilizing Spark’s distributed computation engine, users can now run large data analysis workloads straight from the R shell, added Wendell.

Besides R, Spark 1.4 also comes with improvements like the addition of new capabilities to the DataFrame API, including windows functionalities in Spark SQL and in the DataFrame library that enable users to compute statistics over window ranges.

“In addition, we have also implemented many new features for DataFrames, including enriched support for statistics and mathematical functions – random data generation, descriptive statistics and correlations, and contingency tables – as well as functionalities for working with missing data,” Wendell continued.

“To make DataFrame operations execute quickly, this release also ships the initial pieces of Project Tungsten, a broad performance initiative which will be a central theme in Spark’s upcoming 1.5 release. Spark 1.4 adds improvements to serializer memory use and options to enable fast binary aggregations.”

Wendell revealed that the machine-learning pipelines API that was first introduced in Spark 1.2 and allows users to run complex workflows involving multiple steps, is now stable and production ready. According to Wendell, the new release means the Python API has attained parity with the Java and Scala interfaces. Besides this, the pipelines add a range of new feature transformers like OneHotEncoder, RegexTokenizer, and VectorAssembler, plus new algorithms such as tree models and linear models.

Spark 1.4 also adds visual debugging and monitoring utilities that are designed to help users understand how apps are running in Spark. There’s a new application timeline viewer, for example, that shows the completion of stages and tasks inside a running app. There’s also a new tool that provides visual representation of the underlying computation graph tied directly to the metrics of physical execution. The same feature also allows users to track the latency and throughput of data streams.

Image credit: ClkerFreeVectorImages via Pixabay.com

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU