UPDATED 10:31 EST / JUNE 28 2013


eHarmony Refines the Science of Love : Hadoop + Machine Learning | #hadoopsummit

For our flagship broadcast program theCUBE, Jeff Kelly interviewed Vaclav Petricek, Principal Data Specialist with eHarmony, live from Hadoop Summit 2013, talking about long term compatibility and the underlying architecture that makes it possible.

Petricek runs machine learning applications at eHarmony, in order to decide who they should introduce to whom, and when. For that, they use Hadoop and logical machine learning. “eHarmony is a bit different than your typical dating site,” brags Petreicek. Those are search-based, with results generated by certain search criteria. The founder of eHarmony is Neil Clark Warren, a marriage counselor. After years of counseling couples in failing marriages, he wanted to help people not only meet the people they would be attracted to, but also the people they are compatible with.

As for the underlying technology that makes this possible, Petricek explains: ”To match people effectively, you need to solve three separate problems. The first one is long term compatibility, then there’s the affinity matches (based on age and location), and finally, distribution (who to introduce to whom and when).”

An affinity for Hadoop


Hadoop and large scale machine learning are used for the affinity part. To predict whether or not two people would be interested in talking to each-other, eHarmony uses the historical data generated by their 10 years of operations. As for the data itself, Petricek clarifies: “Over the years the questionnaires have evolved, but certain questions have survived. It used to be 500 questions and now it’s down to 150, which is a lot of data, enough to ‘know’ someone. That’s how you can still make recommendations to people who joined the site recently.” The questionnaire alone is not the only tool. eHarmony collects behavioral data, when they are logging in and how often, what kind of devices they are using.

Jeff Kelly wanted to know next how the problem of people who are not answering the 150 questions truthfully is addressed. “You cannot force someone to answer truthfully, but we offer incentives to do so, in order to get the right matches. It’s a science in itself to design the questions in such a way to get the underlying psychological traits, and not what the person would like to be.”

Talking about the technology itself, Petricek explained: ”We store all of our data in-house, on Hadoop cluster, in HDFS, and on top of that we run Hive, which provides the SQL interface, and then we do the machine learning modeling. We use a lot of vowpal wabbit, a large-scale machine learning open source written by John Langford, that can scale on the Hadoop cluster. And lastly, we use some genetic algorythms.”

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy