UPDATED 15:05 EST / MARCH 24 2018

BIG DATA

How to solve those really hard data science problems? Throw a contest!

As the NCAA’s March Madness basketball tournament nears its climax over the next two weeks, data scientists around the world will be watching to see if their predictive algorithms survived the upsets that characterize the annual tourney.

The winners will get bragging rights and recognition by SAP SE as part of its annual #ViztheMadness analytics competition. SAP has been running sports-related analytics competitions for about five years, recently branching out into topics that solve bigger structural problems, such as municipal water safety and power grid resiliency.

“We like to jump in and tackle interesting, trending topics to crunch algorithms and uncover hidden insights,” said Nic Smith, SAP’s global vice president of product marketing for cloud analytics. “It’s fun for the community and it shows how analytics can be used.”

It’s also an example of an increasingly popular way for many companies to get often fiendishly complex big-data projects done at a time when there’s a huge shortage of the data scientists needed to see them through. And those data scientists command big bucks. As big data has gone mainstream, salaries for data scientists have rocketed to more than $120,000, assuming candidates can even be found. Competitions can be a cost-effective way to tap into top talent, and the cost is often modest.

No one knows the size of the data science competition market, but Kaggle Inc. and Topcoder Inc., which operate the two largest competition platforms, collectively boast nearly 1.5 million members. Google LLC thought enough of Kaggle’s business to purchase it last year for an undisclosed amount.

At the end of this week, Kaggle had 17 active competitions with a top published prize of $100,000. DrivenData Inc., which specializes in competitions that attack social challenges, has six, including one that asks competitors to predict the failure rate of water pumps in remote parts of Tanzania. Both firms boast a roster of blue-chip clients, as does venture-backed competitor CrowdAnalytix Inc.

New perspectives

Data science competitions are also a way to find new perspectives that wouldn’t necessarily emerge from people who are steeped in a particular discipline. “No matter who you are, the best talent is beyond your organizational walls,” said Trevor Monroe, a program officer with the Innovation Labs at the World Bank, which has regularly conducted competitions since 2014.

For contestants in the SAP contests, the reward is a T-shirt and designation as an SAP “Data Genius,” but in other data science competitions the stakes are much higher. For example, online real estate listing service Zillow Inc. has put up $1.2 million in prize money for code-slingers who can improve upon its flagship Zestimate algorithm for estimating home values.

More than 75,000 entries were submitted by 4,400 competitors in the first round of the competition, which is hosted by Kaggle. The top 100 contestants have now moved to the second round, with winners to be chosen next year.

Zillow has its own staff of data scientists, but there are times when it consults the wisdom of crowds to find a different perspective. Given a range of data points about houses in the Los Angeles area – the square footage, number of bedrooms, distance to schools and the like – contestants have come up with strikingly different approaches to estimating home value.

“We found that the third-place winners didn’t correlate at all with the others; that was a total surprise,” said Andy Martin, a senior manager in Zillow’s Data Science and Engineering group. In fact, most of the finalists don’t even come from the real estate industry, he said.

Topcoder Chief Executive Mike Morris isn’t surprised. “We almost always find that the people who win these contests have nothing to do with that industry,” he said.

The U.S. Department of Homeland Security dangled $1.5 million in prize money in front of data scientists who could help improve its threat prediction algorithms. Honeywell Inc. offered $2,500 to the contestant who could build the best model of airplane fuel efficiency during different phases of a flight. Tennis Australia Ltd. will pay $5,000 to someone who can come up with a better way to algorithmically estimate the way points in a tennis match will end.

Match made in science

Data science is a discipline that lends itself particularly well to a competitive format, experts say. “There’s usually no one right answer, so you can compare approaches,” said Divyabh Mishra, CEO of CrowdAnalytix, a crowdsourced analytics service that focuses on life sciences and professional services firms. “But unlike software development, one person can often deliver the solution.”

Many data scientists see themselves as lone wolves who enjoy solving the problems on their own and learning new disciplines. “These are mathematicians who like to work on multiple problems at the same time,” Mishra said. “The fun for them is that it’s part art and part science.”

Money matters, but experts agree it’s not a primary motivator. “Competitors say they’re here because they love competing,” said Morris. “Even when they lose, they learn something new.”

Topcoder competitor Wladimir Leite agreed. The Brazilian computer forensics expert has won 41 matches since 2003, none of which had anything to do with computer forensics. “I’ve learned a lot of things that I wouldn’t even have heard about otherwise,” he said. “These competitions are a great way to meet awesome people with amazing skills and to keep my programming skills in shape.”

Competitive spirit

The competitive format seems to bring out the most creative ideas and raise the performance level across the board, experts say. That’s why successful platforms all use a leaderboard and scoring mechanism to enable contestants to compare themselves to their peers. “Gamification is an important part of the experience,” said Greg Lipstein, co-founder of DrivenData.

The competition format isn’t right for every problem. Goals and variables should be clearly stated, data scrubbed and scope limited. Competitions don’t work if the path to a solution is clear. The best problems are those that benefit from completely new perspectives.

A good example is a challenge DrivenData ran in 2015 in an attempt to correlate reviews on the consumer ratings site Yelp with health violations at Boston restaurants. Using linguistic analysis, star ratings and visit frequency data, contestants wrote predictive models that were compared to actual historical violation records. The winning algorithm enabled the city to uncover 25 percent more violations with the same number of inspectors.

If the skills shortage in data science continues – and demand shows no sign of slowing — then competitions could become an even more attractive alternative for getting big-data projects done. Even if the objective is just to predict the outcome of a basketball tournament. “Our prediction accuracy has always been in the 70 percent to 80 percent range,” said SAP’s Smith. “We weren’t surprised this year when there were so many upsets.”

Image: Flickr CC

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU