UPDATED 12:00 EDT / JUNE 15 2017

BIG DATA

Yahoo unloads Bullet for querying streaming data in real time

Yahoo Inc. probably hasn’t been given enough credit for its contributions to the open-source software community over the last couple of decades. The company, which famously developed the Apache Hadoop software that’s at the heart of most big data projects today, is an active participant in numerous popular open-source projects.

Today, Oath Inc., the new Verizon Communications Inc. parent company of Yahoo and AOL, is unveiling Yahoo’s latest contribution to open-source, and it could potentially be a very important one: Bullet, a new general-purpose query engine for streaming data.

For the uninitiated, there are two kinds of data: streaming, which is data that arrives in a computer system in real time, and batch, which refers to information that’s been batched together over regular time intervals, for example hourly or daily. Batch data is usually quite easy to run queries against because it’s stored in a data warehouse where it can be accessed via commonly used SQL interfaces or business intelligence tools such as Tableau, Looker or Superset. But running queries on streaming data has always been much more challenging.

Bullet is aimed at changing that. In a blog post, Yahoo engineers Michael Natkovich, Akshai Sarma, Nathan Speidel, Marcus Svedman and Cat Utah explained the differences between Bullet and other engines. Most batch data querying engines have something called a persistence layer that includes in-memory storage. When the data is queried, you are effectively doing a “look-back” on data stored in the persistence layer, Yahoo’s engineers explained.

Bullet is different because it does not have a persistence layer, and it does not store any data. Instead, Bullet is a “forward-looking” query engine that queries only the data that passes through the system, after the query has been made. It does not query any older data that has already passed through the system, making it “as real-time as real-time gets,” in the words of Yahoo’s team.

The company suggests a number of use cases for Bullet, including being able to quickly look at a range of metrics, check on assumptions, iterate on queries, check statuses and more. It also offered the following example of why Bullet is much better for these kinds of queries:

“Consider the following: if you had 1,000 queries in a traditional query system that operated on the same data, these query systems would most likely scan the data 1,000 times each,” Yahoo’s engineering team wrote. “By the very virtue of it being forward looking, 1,000 queries in Bullet scan the data only once because the arrival of the query determines and fixes the data that it will see. Essentially, the data is coming to the queries instead of the queries being farmed out to where the data is.”

High Level Bullet Architecture - via Yahoo.

High-level Bullet architecture (Image: Yahoo)

Yahoo has already put Bullet into production on a number of projects. For example, the company is running an instance of Bullet against a small subset of its user engagement data stream to glean insights about their behavior in real time. Yahoo also uses Bullet to manually validate its software applications’ instrumentation, which produces user engagement data such as clicks, swipes and views.

In addition, Yahoo is using Bullet in continuous delivery pipelines for functional testing instrumentation on product releases. This involves simulating new product usage so that Bullet can validate the data those products generate.

“Bullet is orders of magnitude faster to use for this kind of validation and for general data exploration use cases, as opposed to waiting for the data to be available in Hive or other systems,” Yahoo’s team wrote.

Bullet has been made available to download on GitHub.

Main image: Sarah-L-B/flickr.com

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU