UPDATED 08:15 EST / NOVEMBER 15 2011

Where’s Etsy’s Big Data Strategy? Musings of Hadoop World 2011

Big data’s so prevalent in the social media space that you hardly think of them as mutually independent.  At Hadoop World last week the social representation amongst Hadoop users were well represented, with sessions from Facebook, LinkedIn, Bit.ly and Etsy, amongst others.  But these are the four I’m concentrating on, as the most high profile social media networks at the conference.  As far as Hadoop and other big data technology implementation goes, social media networks are amongst the most anxious to execute, a development borne of necessity.  This is helping to standardize some of the processes around unstructured data analytics, providing an array of use case scenarios and delivering the promise of monetization.

Facebook’s own big data solutions

Facebook was perhaps the crown jewel at Hadoop World, as far as social networks go.  Between their two sessions Jonathan Gray, a software developer at Facebook, went into detail about the ways in which Hadoop technology has prompted a number of developments and ongoing projects with his team.  With large and growing indexes being created for each individual user on Facebook, cataloging all their actions, Facebook was in desperate need of something that could enable them to process, access and catalog data quickly.  Through the process of developing their own solution around this, Facebook developers found themselves looking at Hadoop HBase, uncovering its many benefits over MySQL.

After considering several databases, Facebook chose HBase for its atomic read-modify-write operations, multiple shards per server capabilities and bulk importing.  The most important deciding factor for Facebook was HBase’s use of HDFS, giving them all the benefits of it as a storage system with no additional costs, along with fault tolerance, scalability, fix consumptions and more.

Aside from the two in-use examples of Hadoop, which have been implemented for things like organizing and searching your messages Inbox, Facebook demonstrated two experimental uses of Hadoop, the first of which is an Operational Data Store similar to StumbleUpon’s HBase use cases.  This applies to system metrics and other things to ultimately be utilized by advertisers and brands on the site.  Facebook’s looking for the most efficient way to graph this data over time, supporting complex aggregation and transformations.  Where Facebook runs into trouble is its difficulty in scaling with MySQL (there’s millions of unique time-series with billions of points, irregular data and growth patterns).  Facebook’s looking to Hadoop to transition away from MySQL in this case, finding another area of integration for HBase.

LinkedIn sees big data verticals

LinkedIn is also very active in its Hadoop exploration, finding its technology easy to execute once, and apply across a number of its verticals.  This has come in handy for determining which features to update, when and how, enabling the professional networking site to run models before pushing a new feature live on the site.  “No analytics platform is complete without Hadoop,” says Abhishek Gupta, the presenting software engineer from LinkedIn.   For LinkedIn, just like Facebook, data is quickly building around the individual user, and LinkedIn’s looking for the best way to put it to work on their behalf.  There’s no way they can leverage it on their own, right?

The verticals of most importance for LinkedIn right now circle around its recommendations capabilities, helping you find jobs of interest, the highest value recruits and people you should get to know.  There are certain trade-offs you face when you take data and use it to reach a conclusion on behalf of a user, and balancing those trade-offs is the best way LinkedIn’s found to put Hadoop to work, as far as recommendations go.  Hadoop has helped LinkedIn to scale its billions of recommendations, ensuring their relevancy as well as speed.

LinkedIn ended up blending a couple of techniques here, with Hadoop providing the right sandbox environment to play around with the site without the “down payment” of putting in a new system.  This helped LinkedIn better allocate resources all around, and also makes sure that existing site activity goes on uninterrupted, even during a period of implementation.

Where’s Etsy’s big data strategy?

Just like LinkedIn, Etsy is balancing a series of trade-offs in taking the leap to recommend something to a user, and this is most evident in its product search.  Etsy’s a unique marketplace with unique needs, holding great promise for  big data solutions.  While Etsy’s shared its Hadoop usage, running dozens of workflows each night on Amazon’s cloud-based Elastic MapReduce service, none of that was discussed in detail at Hadoop World.  Their session looked primarily at data mining and the countless issues software engineer Aaron Beppu and his team face in creating useful and predictive product search results.

Etsy’s search is a learning process, dealing with a vast user base, specialized products and a wealth of crowd-sourced tags, descriptions and titles.  Demonstrating a series of tweaks Etsy’s done to its search tool, we can see how the system is deducing a user’s goals based on their activity.  Applying that data to their recommendation system, however, is where the magic can really happen.

It’s pretty evident Etsy’s anxious to implement its big data strategy to scale, with its existing use of MapReduce, and also its recent employment of Splunk to manage and analyze up to a terabyte of data a day.  What’s not clear, however, is where Etsy plans to take all of this in the future.  Nothing Etsy demonstrated at Hadoop World has been pushed live to the site, and everything appears quite experimental at this point.  Beppu had little to say when I asked him Etsy’s plans around leveraging its learning systems towards serendipitous discovery (a big thing for anyone that actually uses Etsy regularly) or personalized recommendations.  While Etsy’s Taste Test was a fun step towards individualizing its big data capabilities around each user, Etsy’s done little else on the product side to prove its big data goals, and even refused to comment on its use of Splunk.

Bitl.y keeps it organic

Bit.ly is one company that’s excited about its big data potential, something that’s been made all the more evident by the enthusiasm of their chief data scientist Hilary Mason.  She’s building quite a team of engineers around her big data vision, applying Hadoop technology to help bit.ly remain an organic and productive experience.  Bit.ly leverages data for a number of in-house services, including a search engine, but finds several end-user scenarios as well.

Bit.ly primarily uses Hadoop for storage, computation and infrastructure.  There’s heavy MapReduce integration, and the result is a technology that can spin up extra large computers for lower costs.  Bit.ly looks at a few things, like decoding Twitter and Facebook traffic, the distribution of times between decodes, what users have similar click histories, and of course, recommendations.  When it comes to preserving the Bit.ly user experience, however, big data tools can help determine organic versus inorganic activity.  “You want to predict how many clicks a link will have, and you also want to spot abnormalities,” says Bit.ly scientist Brian David Eoff.  “And you have to do it in real time.”

This requires an isolation of data, which is as complicated as it sounds.  Bit.ly tried a few different approaches in its focus on data isolation and cleaning, first looking at an auto-regression model that considers past history and has a decided time factor, and another format for robot spotting, seeking out bots that keep hitting a link.

“What we’re trying to do is use Hadoop to run models to help us go through the data and basically put it into a format you need, and do your machine learning right then and there,” Eoff explains.


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU