UPDATED 13:57 EDT / DECEMBER 02 2013


Using Big Data to predict product launch succcess with dunnhumby stats at hack/reduce

Back in May, Kaggle ran a competition with hack/reduce and dunnhumby focused on using big data predictive analysis on product launches. The objective: to discover how to predict the level of success or failure of a product by looking at the first few weeks of sales. Called the “Product launch challenge” the hackathon brought in quite a few interested hacker and data science teams to try their hand at the problem.

Dunnhumby supplied the data—around 100,000 rows—that consisted of numerous data points involving sales per week from stores, product category, number of customers who purchased the product, and even customer categories. The information was delivered to the hackers with 26 total weeks of data. Their job: to take the first 13 weeks of data and produce a data mining model that could best predict sales across the last 13 weeks of data. Kaggle provided event for the data-mining competition at the hack/reduce space where participants would parse, model, and finally provide predictions based on the dunnhumby data set.

Over 111 teams competed internationally with 20 teams in house and out of the scrum arose a winner who put Python scripting together with Matlab to provide the most accurate predictive models of sales based on the dunnhumby data.

SiliconANGLE has had a chance to speak to Team SidPac, top team in the hack/reduce hackathon in Boston and 6th overall. Team SidPac is made up of William Li, George Tucker and Alex Levin all CSAIL PhD students. The name “SidPac” comes from the graduate dormitory that the small, brilliant team all lived in.

Team SidPack’s teamwork won the day

To approach the goal of predicting the number of sales in the last 13 weeks, the researchers looked to the number of units sold per store in the historical data in order to make predictions about future units sold per store. The researchers went with linear regression and random forests “because they are efficient, easy to tune, and the results are reasonably interpretable.”

Dunnhumby provided a boatload of tracking information on the would-be shoppers for predictive algorithms to work against, and Team SidPac sought to determine the relevance of all the different categories and labels. As listed above these included product categories (e.g. bread, coffee, video games), the number of stores selling a product, number of products sold per week, unique customers who had bought a product, unique customers who had bought a product more than once, and total number sold to distinct categories of customers (e.g. Family Focused, Finest, Grab and Go, Shoppers On A Budget, Watching the Waistline.)

With only a short time to make a lot of sense of the total data, though, Team SidPac looked over the relevance visualization of the data and picked what intuitively seemed to be the strongest predictors and ran with that:

The final set of features were:

  • for each of weeks 1 to 13, the ratio of stores in week 26 to stores in week 13, multiplied by the sales in that week.
  • the raw sales in weeks 1 through 13
  • the number of stores in weeks 14 through 26
  • three interaction terms (products between some of the above features)

The researchers describe the models as “fairly simple,” but obviously also very powerful.

To make all this happen the team mixed Python scripting and Matlab to provide the heavy lifting. Since the data came in a format that used an amount of text, SidPac used Python scripts to preprocess the data before sending it to Matlab for modelling.

“Matlab isn’t so great with string manipulation,” George Tucker explains, “so for example when we wanted to see if there was a difference between frozen food and other product categories, we broke the data up and formatted it with Python and then fed it to Matlab.”

To do the predictive work, Team SidPac went with linear regression and random forests to predict the last 13 weeks of sales based on the dunnhumby data. For the curious, the winning model is provided with a presentation and the code is available at GitHub.

The Big Data angle

Businesses have long wanted a crystal ball somewhere in their organization to help predict how well product launches would go to help reduce costs and avoid sunk costs. Oracles have come and gone, but really much of the power of prediction is still buried in the data. This is where data science comes into play by comparing historical data to current data—the Holy Grail will still fall to the company who produces the most flexible model.

Marketing, production, and keeping track of customer interest produces a great deal of data. That data can take up giant volumes of space within databases and there are lots of technologies available to move, sift, and store (from Hadoop to Flash) but when it comes down to analysis it’s still wild and open country.

What Team SidPac has shown us is that often the most intuitive approach provides the cleanest results. In their own words, they went to what they thought would provide the best possible solution in the limited time available and the approach didn’t even involve most of the metadata provided by dunnhumby. This suggests that for some tasks the simplest model is likely already the most powerful; but this still leaves open the chance that fine tuning will be based on otherwise unseen indicators in the metadata.

Future participants: have fun and prepare well

Team SidPac’s George Tucker says he would give this advice to future teams:

“Setup your tools and workflow ahead of time, spend a lot of time visualizing data and checking where your models are making mistakes,” he added that “crafting good features is more important than using fancy models especially when time is limited.”

When asked if Team SidPac had fun and would participate again, William Li told SiliconANGLE that he would do it again.

“I definitely had fun and learned a lot, especially from George and Alex,” he said. “The event was a great introduction to the hack/reduce space in Cambridge. I’d definitely be interested in participating in future machine learning- and data science-related events there in the future. Since the competition, we’ve had a chance to go to some community group meetings to meet and learn from other people.”


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy