UPDATED 10:31 EDT / JUNE 28 2013


eHarmony Refines the Science of Love : Hadoop + Machine Learning | #hadoopsummit

For our flagship broadcast program theCUBE, Jeff Kelly interviewed Vaclav Petricek, Principal Data Specialist with eHarmony, live from Hadoop Summit 2013, talking about long term compatibility and the underlying architecture that makes it possible.

Petricek runs machine learning applications at eHarmony, in order to decide who they should introduce to whom, and when. For that, they use Hadoop and logical machine learning. “eHarmony is a bit different than your typical dating site,” brags Petreicek. Those are search-based, with results generated by certain search criteria. The founder of eHarmony is Neil Clark Warren, a marriage counselor. After years of counseling couples in failing marriages, he wanted to help people not only meet the people they would be attracted to, but also the people they are compatible with.

As for the underlying technology that makes this possible, Petricek explains: ”To match people effectively, you need to solve three separate problems. The first one is long term compatibility, then there’s the affinity matches (based on age and location), and finally, distribution (who to introduce to whom and when).”

An affinity for Hadoop


Hadoop and large scale machine learning are used for the affinity part. To predict whether or not two people would be interested in talking to each-other, eHarmony uses the historical data generated by their 10 years of operations. As for the data itself, Petricek clarifies: “Over the years the questionnaires have evolved, but certain questions have survived. It used to be 500 questions and now it’s down to 150, which is a lot of data, enough to ‘know’ someone. That’s how you can still make recommendations to people who joined the site recently.” The questionnaire alone is not the only tool. eHarmony collects behavioral data, when they are logging in and how often, what kind of devices they are using.

Jeff Kelly wanted to know next how the problem of people who are not answering the 150 questions truthfully is addressed. “You cannot force someone to answer truthfully, but we offer incentives to do so, in order to get the right matches. It’s a science in itself to design the questions in such a way to get the underlying psychological traits, and not what the person would like to be.”

Talking about the technology itself, Petricek explained: ”We store all of our data in-house, on Hadoop cluster, in HDFS, and on top of that we run Hive, which provides the SQL interface, and then we do the machine learning modeling. We use a lot of vowpal wabbit, a large-scale machine learning open source written by John Langford, that can scale on the Hadoop cluster. And lastly, we use some genetic algorythms.”

A message from John Furrier, co-founder of SiliconANGLE:

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

Join Our Community 

Click here to join the free and open Startup Showcase event.

“TheCUBE is part of re:Invent, you know, you guys really are a part of the event and we really appreciate your coming here and I know people appreciate the content you create as well” – Andy Jassy

We really want to hear from you, and we’re looking forward to seeing you at the event and in theCUBE Club.

Click here to join the free and open Startup Showcase event.