Yahoo open-sources Pulsar, a low-latency alternative to Apache Kafka

A NASA Hubble Space Telescope (HST) view of the magnificent spiral galaxy NGC 4603, the most distant galaxy in which a special class of pulsating stars called Cepheid variables have been found. It is associated with the Centaurus cluster, one of the most massive assemblages of galaxies in the nearby universe. The Local Group of galaxies, of which the Milky Way is a member, is moving in the direction of Centaurus at a speed of more than a million miles an hour under the influence of the gravitational pull of the matter in that direction. Clusters of young bright blue stars highlight the galaxy's spiral arms. In contrast, red giant stars in the process of dying are also found. Only the very brightest stars in NGC 4603 can be seen individually, even with the unmatched ability of the Hubble Space Telescope to obtain detailed images of distant objects. Much of the diffuse glow comes from fainter stars that cannot be individually distinguished by Hubble. The reddish filaments are regions where clouds of dust obscure blue light from the stars behind them. This galaxy was observed by a team affiliated with the HST Key Project on the Extragalactic Distance Scale. Because NGC 4603 is much farther away than the other galaxies studied with Hubble by the Key Project team, 108 million light-years, its stars appear very faint from the Earth, and so accurately measuring their brightness, as is required for distinguishing the characteristic variations of Cepheids, is extremely difficult. Determining the distance to the galaxy required an unprecedented statistical analysis based on extensive computer simulations.

Yahoo! Inc. has open-sourced a new distributed “publish and subscribe” messaging system called Pulsar that’s capable of scaling out while maintaining low latencies. Yahoo has long used Pulsar to back some of its own critical applications, and now wants the open-source community to help further its development.

Publish-and-subscribe refers to a messaging system in which the senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead characterize published messages into classes without knowledge of which subscribers, if any, there may be. In a nutshell, publish-and-subscribe is a highly-scalable system for applications to communicate with one another.

Yahoo developers Joe Francis and Matteo Merli introduced Pulsar on the company’s engineering blog, explaining some of the application requirements they had that led to its development.

“These applications provide real-time services, and need publish-latencies of 5ms on average and no more than 15ms at the 99th percentile,” they wrote. “At Internet scale, these applications require a messaging system with ordering, strong durability, and delivery guarantees.”

They added that messages also need to be committed to multiple disks or nodes to achieve the 99.999 percent guaranteed durability Pulsar assures.

“At the time we started, we could not find any existing open-source messaging solution that could provide the scale, performance, and features Yahoo required to provide messaging as a hosted service, supporting a million topics,” the engineers explained. As such, they “set out to build Pulsar as a general messaging solution, that also addresses these specific requirements.”

Pulsar was designed to work on commodity hardware and scale horizontally in order to provide messaging services to multiple apps. The system can be scaled up to accommodate millions of topics and many more millions of messages per second, Pusar’s GitHub page reveals.

Pulsar is accessible via a collection of APIs, and also comes with a client library that contains the messaging protocol and provides functionality such as service discovery, establishing and recovering connections.

How Pulsar works

Yahoo’s Francis and Merli also discussed Pulsar’s architecture in the blog, saying a cluster is composed of a set of brokers, BookKeepers and a ZooKeeper to manage coordination and configuration. A single Pulsar instance can consist of multiple clusters which can be geographically separated from one another, the engineers explained.

A Pulsar cluster

A Pulsar cluster. Image via Yahoo! Inc.

Pulsar’s durable storage mechanism is provided by Apache BookKeeper, another Yahoo-built project that was open-sourced in 2011.

“With Bookkeeper, applications can create many independent logs, called ledgers,” it says on the Pulsar GitHub page. “A ledger is an append-only data structure with a single writer that is assigned to multiple storage nodes (or bookies) and whose entries are replicated to multiple of these nodes.”

Topics are assigned to individual brokers, which can serve up to thousands of topics at once, the company said.

“The broker accepts messages from writers, commits them to a durable store, and dispatches them to readers,” the GitHub page reads.

As for Apache ZooKeeper, this is used to help ensure all the bits of Pulsar work together. ZooKeeper was also a Yahoo internal project, open-sourced to the Apache Software Foundation back in 2008, and has now become an integral part of Apache Hadoop.

A Kafka challenger? 

BookKeeper is critical to Pulsar’s high level of durability, and also provides it with the capability to scale different elements of the system independently. BookKeeper’s capabilities also help to explain why Yahoo build Pulsar in the first place, instead of using an existing messaging technology like Apache Kafka.

“By using separate physical disks (one for journal and another for general storage), bookies are able to isolate the effects of read operations from impacting the latency of ongoing write operations, and vice-versa,” Francis and Merli wrote. “Since read and write paths are decoupled, spikes in reads – which commonly occur when readers drain backlog to catch up – do not impact publish latencies in Pulsar. This sets Pulsar apart from other commonly-used messaging systems.”

In their blog post, Francis and Merli said the first instance of Pulsar was deployed back in Spring of 2015. Since then, Yahoo has adopted Pulsar in many of its key services, such as Yahoo Mail, Finance, Gemini Ads, Sports and Sherp, the company’s distributed key-value service. Pulsar now publishes around 100 billion messages a day across 1.4 million topics. Latency averages at less than five milliseconds.

Yahoo hopes that by open-sourcing Pulsar it will be able to speed up the system’s development. The company said it’s hoping the open-source community can help it to decrease the time it takes for topics to be migrated across brokers from the current 10 seconds to under one second. It also wants to provide additional language bindings for Pulsar, and improve publish latencies further.

Image credit: Stuart Rankin via flickr.com