IBM bets big on Spark, calling it the Linux of Big Data analytics
IBM was late to the Hadoop party, and has had to settle for playing a supporting role. It isn’t going to make the same mistake again.
The company is putting a major stake in the ground in support of Apache Spark, the high-speed analytics and machine-learning engine that is the hottest thing in Big Data right now. IBM said it will embed Spark into all of of its analytics and ecommerce platforms, commit more than 3,500 researchers and developers to work on Spark-related projects and open-source its SystemML machine learning technology for plug a key hole in the Spark technology stack. It will also offer courses to train more than one million data scientists and engineers to use Spark.
Regarded by some people as both a complement and competitor to Hadoop, Spark is actually one of many components of the large Hadoop ecosystem. It’s an in-memory analytics processing engine that works across many back-end file systems, including Hadoop’s native HDFS. Spark has rapidly gained popularity among businesses that are struggling to analyze data in multiple formats scattered across incompatible databases and file systems.
Because it runs in memory, Spark performs up to 100 times faster than Hadoop’s native MapReduce processing engine on native HDFS files. It also works just as fluidly on data stored in Amazon Web Services’ S3, HBase, Apache Cassandra, MySQL and several other popular file systems, meaning that applications don’t have to be rewritten for each engine. Spark is considered especially strong at working with unstructured data like Twitter streams.
In throwing its substantial weight behind Spark, IBM is casting a vote for simplicity, said George Gilbert, Wikibon’s Big Data analyst. One of the chief complaints about Hadoop is its complexity, a function of the large ecosystem that surrounds it, Gilbert said. Hadoop-related projects such as Hive, Pig, Spark and Impala all work on their own update schedules, which means users need to do the integration work. “That’s why there’s been a need for organizations like the Open Data Platform to bring some coherence to the process,” he said.
In contrast, Spark is composed of interleaved processes that each work with each other. That makes it relatively simple to maintain, Gilbert said. Its query language is also considered to be relatively easy to use.
IBM is addressing one of Spark’s biggest perceived weaknesses by open-sourcing SystemML, which is a component of the company’s Watson cognitive computing framework, and by collaborating with Databricks, Inc., a major Spark developer, which just added support for the R and Python 3 programming languages in a new release of Spark.
SystemML is one of the latest innovations to have emerged from the company’s ongoing work on Watson, which has seen its use expand from answering trivia questions to extracting complicated patterns out of vast quantities of unstructured data over the last few years. To keep up, SystemML provides a language that directly exposes the capabilities of the artificial intelligence for data scientists to harness.
Queries written in the syntax, which is deliberately modeled after the widely-used R statistical programming framework, are automatically executed according to the most efficient mode of operation for the specific workload and operational characteristics of a Spark cluster. Needless to say, that has the potential to provide a tremendous boost for the project’s machine learning capabilities.
IBM’s endorsement can catalyze entire markets, Gilbert noted. He cited the example of Linux, which was a bit player in the desktop operating system market until IBM threw its support behind Linux as an engine for everything from embedded systems to mainframes.
IBM is also embedding Spark into its Bluemix platform-as-a-service stack, which will make the capabilities of the framework accessible on-demand for developers and data scientists. The company hopes to bring the total number of professionals skilled in using the project to over a million within a few years through a number of education partnerships announced in conjunction, users who it hopes will tilt toward its implementation over the competition as a result.
IBM analytics executive Bob Picciano cited the same example in a blog post announcing the Spark initiatives. IBM’s support of Linux, “marked the beginning of its ascendancy in corporations and Internet-class data centers. The same sort of thing could happen now with Spark,” he wrote.
Picciano said Spark’s power and ease-of-use were demonstrated by a hackathon the company sponsored a few weeks ago. Thousands of IBM programmers with no background in Spark were given three weeks to learn Spark, form teams and create “moon shot” projects. They created more than 100 “impressive applications–software that could really matter in the world,” Picciano wrote.
In total, IBM’s commitment to Spark represents the arguably biggest milestone for the project since its inception at UC Berkeley four years ago. The framework is already a fixture of the analytics discussion thanks to its speed and extensibility, but if Big Blue’s past kingmaking role in other open-source projects as Linux is anything to go by, its addition fray could take that to a whole different level.
IBM’s Bob Picciano and Inhi Cho Suh were guests on theCUBE at IBMz Next 2015 in January. Watch the interview below (26:34).
Maria Deutscher contributed to this report.
Photo by Dominik Brygier via Flickr
A message from John Furrier, co-founder of SiliconANGLE:
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.
We are holding our third cloud startup showcase on Sept. 22. Click here to join the free and open Startup Showcase event.
We really want to hear from you, and we’re looking forward to seeing you at the event and in theCUBE Club.