UPDATED 14:47 EDT / NOVEMBER 06 2012

Tresata Promises Post-Grad Course in Data Scientist, the In-Demand Job in Big Data

As Big Data becomes a major driver in business, a new DevOps position, the data scientist, is coming into demand. This may be the next really hot position for DevOps. But, said Tresata Founder and CEO Abhi Mehta in an interview in The Cube at the Strata + Hadoop World 2012 conference, data scientists are scarce on the ground; no training, degree, or credentialing program exists for the title; and no agreement has been reached in the industry as to what a data scientist is.

To help fill this huge gap, Mehta announced, the new chief scientist at Tresata, Ph.D. Roy Lowrance, will lead an effort to create the first masters and Ph.D. level post-graduate program in Data Science.

Mehta said that Tresata is leading this effort in part to meet the needs of its clients. They are coming to him saying, “I love it, I love the solution. But I don’t have the people. I don’t have the people to write the new models for all this data. So sampling is dead, and we need to analyze all of the population. Who’ll write it for me?”

And just how big is that opportunity? Gartner recently estimated that Big Data is a potential $200 billion market. Mehta thinks that is off by several orders of magnitude.

“We are talking to the head of credit cards at one of the biggest global banks. He is looking at our solution for underwriting, and he goes, ‘If we do this right the payoff is not in millions, it’s not in billions, it’s in the trillions, because we can redo the banking industry off of it.’” And that is just one vertical market. It will have similar impacts on retail, manufacturing, and healthcare among others.

And while this is not primarily a technological opportunity, it clearly means that people with knowledge of Hadoop-based development, data science, and related areas, will be in huge demand and limited supply for some years to come.

So what is a data scientist? There is little actual agreement on the specifics, but in general a data scientist is someone who can work with machine learning systems to extract valuable information from massive amounts of data of a wide variety of types, most of it unstructured, coming from a wide variety of sources.

Mehta said the data science area is divided into two groups, the “marketing quants” and the “data quants”. The first actually does about 20% of the work but gets most of the attention, formulating questions that often have to do with marketing and using emerging Big Data analysis tools to derive answers that often have huge value in the market. The second gets much less notice but actually does 80% of the work and is where the opportunity lies. Data quants are “a sort of cyborg” who are “part physicist, part mechanical engineer, part statistician, and a total reservoir of common sense”. They work with highly automated machine learning systems because only computers can keep up with huge volume of data involved, for instance, in researching the needs of 25 million consumers (the U.S. population) to help a company design a successful new product or service or fix an unsuccessful one. However, machines can only do so much, and the important work has to be done by the data quants who are the true data scientists.

One of the challenges they face is that today the very software stack they will use is not fully understood. Big Data creates big changes in how research is done. Instead of creating a statistically valid sample of a large population — for instance the entire consumer population of the United States or the subscriber population for a mobile phone carrier — Big Data allows researchers to gather detailed information on the entire population. And the amount and types of data has expanded greatly. Instead of just structured data such as financial information, researchers can include social data such as tweets, Facebook “likes” and status updates, machine logs that for instance can show what Web sites individuals visit regularly, etc. And of course all of this is running over a Hadoop database, which is a very different technology from an RDBMS.

The software stack for traditional data research is very well understood. The stack for Hadoop Big Data research is still a work in progress. One major question is whether statistical models are needed at all, and if so what they will look like. Traditional models were in large part designed to make up for the limitations of RDBMS-based analysis that basically was a top-down approach.

Mehta thinks models may not be needed at all. “I think the fundamental principle that bottom-up analysis wins always holds true,” he said. “The question we ask ourselves is this: If I were to build the next generation analytics company, what am I building it on? If it’s HTFS [High Throughput File System], which, I think, is the right answer, then do I need layers on top of HTFS to make a solution like mine more useful? The answer, Dave, is, I don’t.”

This means that the software stack will be much simpler and that complex queries will be constructed differently than they are today. But the huge volumes of data and speed at which it comes through the firehose will mean that data management will be more complex. While marketing quants and managers will have little problem writing simple queries in SQL and running them through an engine that translates them for the Hadoop database, complex queries will require work directly in Map Reduce. That is the realm of the data quant, who will both manage that data and develop and run the complex queries that will provide the information on which businesses will run.


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU