UPDATED 14:59 EDT / JUNE 29 2016

NEWS

How data analysis opened up new possibilities for Yahoo Mail | #HS16SJ

There are some who say e-mail is dying, but is it true? Certainly, as technology advances, there are more ways than ever to connect and send messages that don’t rely on traditional e-mail servers. However, Peter Monaco, VP of Engineering for Communications Products at Yahoo, Inc., argued that instead of dying out, e-mail is transforming. As today’s keynote speaker at Hadoop Summit in San Jose, California, he had some insights into the future of e-mail.

There are two main kinds of e-mail messages, Monaco said. The first kind is personal and involves correspondence with people you actually know. The second is business-related and can include purchases made online, newsletters and promotions. While it’s true that the former is diminishing, with people preferring to send messages via text message or the various different apps available, people still use e-mail for business all the time. Perhaps even more so now, with more people than ever relying on e-mail for transactions instead of phone conversations.

Notifications for interactions with real people

But for those times you do correspond with a real human being, wouldn’t it be nice to get a notification, so you don’t have to keep checking and refreshing your inbox? And not have to worry that when you do get a notification, it’s simply regarding a shoe sale? This is a new feature that Yahoo has been working on. Knowing that no one would use such a feature if they got too many notifications about promotional e-mails or newsletters, Yahoo has been testing different ways they could offer such a tool.

First it tried looking at e-mail addresses themselves, but that didn’t work. There was no way for an algorithm to tell the difference between an address such as john@amazon.com and borders@amazon.com. There wasn’t anything in the syntax of the addresses themselves that would reveal if someone was a human or a bot. So the company turned to a new methodology.

Using the data set to understand user behavior

Yahoo’s next tactic, which has been proven to be much more successful, was to look at the behavior of each e-mail address. By looking at how many messages a given e-mail address sends, receives, replies to, forwards, or even how many URLs can usually be found in its outgoing messages, it becomes a little bit clearer whether or not it is an actual human being.

While this method is successful, it is also harder to do. It takes a lot of data analyzation to get results, and there is so much data in Yahoo mail. On a daily basis, Monaco estimated they go through 4 billion messages, and if you add spam onto that, it’s 10 times as much. All in all, it equals about 30 terabytes of data a day. In addition to that daily deluge, they wanted to look at the history as well. “Even for Yahoo it’s a big dataset,” said Monaco.

Protection of customer information

Monaco hastened to add that there is also extreme control of access to all the files, with only a small select few being able to sort through it. In addition, all of these employees who are allowed access understand that were they to download the data onto a personal device or do any other unauthorized action, they would be terminated.

Compressing the data for manageability

In order to process such a huge amount of data, Yahoo had to do some optimization. One method it used was compression. First, just by eliminating things like attachments from the dataset, it had a 6X compression. However, it took it a step further by grouping e-mails from similar sources.

Facebook alone was responsible for 23 percent of it, said Monaco. So by identifying patterns in addresses and clustering them together, the Yahoo employees were able to compress it even further, from the initial 6X down to 9X.

Cultivating a better user experience

They are working on another similar feature called smart files, which would run clustering algorithms, find e-mails that are similar, combine them within a cluster, and then attach them to specific folders, such as “Travel.” Then the user would be able to easily access similar messages and find the information they need.

It is certainly bring to mind how much work goes into one feature. But the work is well worth it to design a system that works better for what the user wants. “The dataset is very useful for us. We learn about each user’s behavior, which helps us to provide them with a better experience,” Monaco said.

Watch the entire keynote video below, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of the Hadoop Summit US.

Photo by Hadoop Summit keynote

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU