Kicking off today is our live broadcast from Spark Summit East, an event dedicated to the open-source community as Apache Spark tackles the biggest data-wrangling challenges in enterprise information technology. With ongoing enhancements for intelligent automation and code consolidation, Spark is expected to grow its influence in enterprise environments, continuing to close complexity gaps in the Hadoop open-source ecosystem.
Ahead of the conference we heard from George Gilbert, analyst with Wikibon (owned by the same company as SiliconANGLE), on Spark’s market opportunities and remaining obstacles. Below is a recap of Gilbert’s commentary, along with resources for viewing our live broadcast and archived interviews with some of the industry’s leading experts. (* Disclosure below.)
As Internet powerhouses such as Yahoo, eBay and Netflix deploy Spark at massive scale, Spark has seen rapid adoption by enterprises across a wide range of industries. It also claims to be the largest open-source community in big data, with more than 1,000 contributors from more than 250 organizations, the most recent contribution from Intel. The chipmaker boosted Spark’s deep-learning capabilities with the open sourcing of BigDL, a distributed library leveraging existing Spark clusters to run deep learning computations while simplifying data loading from large datasets stored in Hadoop.
This dedication to simplification processes, including advanced support for speed and security from Apache Software Foundation’s latest top-tier additions Apache Beam and Apache Eagle, is what makes Spark so appealing, according to Gilbert. Spark’s deep integration among its libraries of code means it can “minimize the number of building blocks needed for developing machine-learning pipelines that would otherwise have to come from multiple vendors,” he explained.
Dual support for transforming data using either batch processing or streaming with SQL libraries gives Spark a leg up in readying data for machine-learning programs, he added. So even as Hadoop complexities continue to fragment its ecosystem, Spark’s ability to unify processes make it the choice for engineering at scale.
Calling all third parties
Nevertheless, Spark’s rapid growth faces its own set of challenges to avoid the same pitfalls as Hadoop, which grew so quickly that it lacked much top-down guiding architecture, leaving cracks in the ecosystem. Despite a commitment to speed and fresh efforts supporting machine-learning methods, Spark isn’t yet fast enough to do hyper-scale predictions, leaving “developers to convert Spark’s machine-learning models into a language that’s faster, such as C++ or Java,” Gilbert noted. Without a native database to call its own, Spark also leaves gaps in its ecosystem, requiring ongoing integration with third-party services.
Gilbert called out two other areas where Spark will need third-party help. One is the process of ingesting data. With most usage scenarios defaulting to the open-source stream processing platform Apache Kafka, he said, “Spark still needs a fair amount of work under the covers to be able to handle this step.”
The other challenge is real-time data analytics in the Internet of Things era, where edge devices may need a single event analyzed and acted upon immediately. “As Spark’s stream processing can’t analyze one event at a time, Spark could make it difficult to support IoT workloads at the edge,” Gilbert explained.
How will Apache Spark face these challenges? SiliconANGLE will get the inside scoop at Spark Summit East, broadcasting live from the roving news desk, theCUBE. During the event, theCUBE hosts Dave Vellante and George Gilbert will talk with industry experts about the future of Apache Spark, how to use the Spark stack in a variety of applications, the best practices for deploying Spark at Scale, and use cases from leading organizations solving big data problems. TheCUBE guests are set to include:
- Ziya Ma, vice president of big data at Intel
- Mike Gualtieri, lead big data analyst at Forrester Research
- Alfred Essa, vice president of analytics and R&D at McGraw-Hill Education
- John Landry, distinguished technologist for HP Personal Systems Data Science
- Many more to come
Where to watch
Contributors: Cheryl Knight
(*Disclosure: TheCUBE is a media partner at the conference. Neither Databricks nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)