

While most organizations understand the inherent value of big data — the more data, the better — there can be issues around managing and moving that data. The true value comes from the analysis of the data, not from static data itself. Many are leaning on Apache Spark (an open-source cluster computing framework) to reduce data management complexity, according to Bryan Duxbury (pictured), vice president of engineering at StreamSets Inc.
“We’re seeing a lot of interest in the Spark arena. People want to add their complex event processing or their aggregation and analysis, like Spark SQL [Apache Spark’s module for working with structured data],” Duxbury said.
He explained that these customers are looking for continuous workloads and moving away from batch. Customers are seeking analytics occurring almost simultaneously at the time of ingest, he said. To help with that, StreamSets is building integration via their Spark processor, making it possible to do the ingest and capture real-time analytics along the way.
Duxbury recently joined Dave Vellante (@dvellante) and George Gilbert (@ggilbert41), co-hosts of theCUBE, SiliconANGLE Media’s mobile live streaming studio, during Spark Summit East 2017 Boston, held in Boston, MA. (*Disclosure below.)
The topic of discussion included how data movement software maximizes the value of data, including the use of Spark, and why Duxbury believes it’s better for organizations to buy than to build solutions.
While many companies will build their own internal tools to move their data, and make it a science project of sorts, there’s better ways to allocate time and resources. “It’s not their job to build a world-class data movement tool; it’s their job to make the data valuable,” said Duxbury.
One of the advantages of StreamSets’ Data Collector software, according to Duxbury, is it allows users to build a data pipeline without code; it’s a graphical user interface (GUI). The software is heavy-duty and open source, made to integrate easily with other products, including Apache Kafka (an open-source stream processing platform) and Spark.
StreamSets’ Data Collector deploys every way imaginable, on-prem, in the cloud or on the edge of clusters. It focuses on the initial movement and ingestion of the data and then lets the analytical tools, such as Spark, take over and provide the business value to the data. For large scale deployments, the company offers StreamSets Dataflow Performance Manager as a way to manage the dozens or hundreds of Data Collectors including a live data map of the data flow topologies and enforcement of Data SLAs.
Watch the complete video interview below, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of the Spark Summit East 2017 Boston. (*Disclosure: TheCUBE is a media partner at the conference. Neither Databricks nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Support our open free content by sharing and engaging with our content and community.
Where Technology Leaders Connect, Share Intelligence & Create Opportunities
SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.