UPDATED 14:51 EST / FEBRUARY 09 2017

BIG DATA

Data on the move: getting the most value out of the numbers | #SparkSummit

While most organizations understand the inherent value of big data — the more data, the better — there can be issues around managing and moving that data. The true value comes from the analysis of the data, not from static data itself. Many are leaning on Apache Spark (an open-source cluster computing framework) to reduce data management complexity, according to Bryan Duxbury (pictured), vice president of engineering at StreamSets Inc.

“We’re seeing a lot of interest in the Spark arena. People want to add their complex event processing or their aggregation and analysis, like Spark SQL [Apache Spark’s module for working with structured data],” Duxbury said.

He explained that these customers are looking for continuous workloads and moving away from batch. Customers are seeking analytics occurring almost simultaneously at the time of ingest, he said. To help with that, StreamSets is building integration via their Spark processor, making it possible to do the ingest and capture real-time analytics along the way.

Duxbury recently joined Dave Vellante (@dvellante) and George Gilbert (@ggilbert41), co-hosts of theCUBE, SiliconANGLE Media’s mobile live streaming studio, during Spark Summit East 2017 Boston, held in Boston, MA. (*Disclosure below.)

The topic of discussion included how data movement software maximizes the value of data, including the use of Spark, and why Duxbury believes it’s better for organizations to buy than to build solutions.

Building a data pipeline without code

While many companies will build their own internal tools to move their data, and make it a science project of sorts, there’s better ways to allocate time and resources. “It’s not their job to build a world-class data movement tool; it’s their job to make the data valuable,” said Duxbury.

One of the advantages of StreamSets’ Data Collector software, according to Duxbury, is it allows users to build a data pipeline without code; it’s a graphical user interface (GUI). The software is heavy-duty and open source, made to integrate easily with other products, including Apache Kafka (an open-source stream processing platform) and Spark.

StreamSets’ Data Collector deploys every way imaginable, on-prem, in the cloud or on the edge of clusters. It focuses on the initial movement and ingestion of the data and then lets the analytical tools, such as Spark, take over and provide the business value to the data. For large scale deployments, the company offers StreamSets Dataflow Performance Manager as a way to manage the dozens or hundreds of Data Collectors including a live data map of the data flow topologies and enforcement of Data SLAs.

Watch the complete video interview below, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of the Spark Summit East 2017 Boston(*Disclosure: TheCUBE is a media partner at the conference. Neither Databricks nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo by SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU