UPDATED 08:20 EDT / NOVEMBER 11 2015

NEWS

Piping hot: Why is Apache Kafka catching on?

Having been thrust into the spotlight by IBM last week, Apache Kafka is getting so much attention that it’s threatening to surpass Apache Spark as Big Data’s tech du jour and become a permanent fixture in the Hadoop landscape.

Kafka is described as an open-source, high-throughput publish and subscribe messaging system for managing real-time streams of data from websites, applications and sensors. It began life as an internal project at the professional networking website LinkedIn before being turned over to the Apache Software Foundation in 2012.

While Hadoop can be considered as the primary data store, and Spark is the best known analytics engine for processing that data, Kafka can be thought of as a kind of “circulatory” system that pumps Big Data throughout an organization – collecting data such as application metrics, user activity, logs and stock tickers and transforming it into a stream of data (like a blood stream) that’s fed into Spark or other analytics software.

“We use it heavily as the messaging backbone that helps our applications work together in a loosely coupled manner,” explained Mammad Zadeh, head of real-time data infrastructure at LinkedIn. “We use it to move data between systems, and it touches virtually every server, every day.”

LinkedIn isn’t alone in making heavy use of Kafka. Since being open-sourced, the technology has been adopted by big-name companies including Cisco Systems Ltd., Goldman Sachs, Netflix Inc., and Uber Technologies Inc. Most recently, IBM threw its weight behind the project with its new Kafka-on-Bluemix offering.

It’s all about scale

instruments-860912_1920When asked why Kafka is suddenly grabbing so much attention, Todd Moore, Vice President of Open Technology at IBM, said its popularity was due to the changing nature of the way modern (cloud-native) applications are developed, and particularly the agile way in which those apps evolve. Enterprises are increasingly looking to build cloud-native applications that can scale, and many see Kafka as an essential component in making this happen.

“Kafka’s integration with Spark, the programmatic controls it gives to the application, and the way it scales out by just adding more partitions makes it ideal for these types of [modern] applications,” said Moore. “Kafka is a great technology for building cloud-native applications that scale horizontally on commodity hardware.”

Kafka isn’t the only data messaging system out there. Competing technologies such as ActiveMQ and RabbitMQ have been around for a while and are both much more mature, but analysts agree Kafka is far superior to those technologies. For one thing, Kafka offers more than just simple message queuing alone; it also decides what to do with data and where to put it, explained Constellation Research analyst Holger Mueller.

In addition, Kafka offers some significant advantages, like being able to make high-volume data available as a real-time stream for consumption in systems with varying requirements, said Matt Aslett, research director of data platforms and analytics at 451 Research Inc.

“In comparison with the likes of ActiveMQ, Kafka is probably best suited to ingesting a high-throughput ‘firehose’ of events from multiple sources, while other products are arguably better suited to more complex routing of lower volumes of events,” Aslett explained.

Streaming data made easy

14199505809_77a1d9ddb7Most analysts agree Kafka is complimentary to and plays nicely with streaming technologies like Spark, Apache Samza (which was also developed by LinkedIn) and Apache Storm, rather than competing with them. That’s because stream processing is a separate layer in the stack, one which requires a message layer foundation that feeds data into it, 451 Research’s Aslett said.

An example of how the two technologies compliment each other can be seen in Big Data vendor Teradata Corp.’s latest offering, a data-integration platform called Teradata Listener, which combines Spark with Kafka (and other technologies) for streaming Internet of Things (IoT) data. In that implementation, Kafka feeds IoT data into Spark, which then tries to make sense of it all.

“From a Streams perspective, Kafka is simply a source of streaming data,” said IBM’s Moore. He explained that while Kafka is the perfect messaging system for cloud-native applications, Spark and other stream processing systems are designed to consume those messages and perform complex analytics on them.

One company that could disrupt things is Confluent Inc., a startup headed by three ex-LinkedIn engineers and Kafka co-creators that aims to help enterprises use the platform in production at scale. Confluent, which picked up $24 million in a funding round this summer, is said to be building native stream-processing capabilities into Kafka that would, in theory, do away with the need for something like Spark or Samza.

“The simplicity of integrated messaging and analytics functionality could provide a competitive advantage for Confluent against the existing alternatives, which are notoriously difficult to implement,” wrote SiliconANGLE’s Maria Deutscher. “And the pitch will be made all the stronger by the fact that the company also intends to add storage capabilities into the mix.”

But although Confluent’s native stream-processing system might make Kafka simpler to implement, it’s unlikely to impact Kafka’s bigger users, who believe that a dedicated stream processing system like Spark or Samza is more desirable.

“We built Samza to work at our operational scale, as a service,” said LinkedIn’s Zadeh. “We also needed specific functionality for stateful processing, as well as the flexibility to ingest events other than Kafka. In essence, for the scale at which we operate at LinkedIn, our stream processing framework has to be operationally mature and functionally complete.”

It’s true that Apache Kafka still has a long way to go before it can be considered part of the furniture in enterprise IT shops. Still, with so many big-name backers and a highly active open-source community behind it, it’s also true that there are few technologies that show quite as much promise.

“Along with Hadoop and Spark, Kafka is poised to become one of the key building blocks of next-generation data processing platforms,” said 451 Research’s Aslett. “We certainly expect it to be baked into all the major Hadoop and Spark distributions at some point.”

Image credits: Unsplash via pixabay.com; Pashminu via pixabay.com; Savannah Sam Photography via Compfight cc

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU