UPDATED 16:28 EDT / FEBRUARY 28 2013

NEWS

Data Scientists are Like Forrest Gump, Scrubbing Data with Toothbrushes

“I am Forrest Gump, I have a toothbrush, I have a lot of data and I scrub,” Josh Wills, Data Scientist Cloudera, whimsically described his profession to John Furrier and Dave Vellante inside theCube, live from Strata 2013. He also added that while he thinks of himself mostly as a mathematician, a data scientist is a lot like a data janitor.

John Furrier pointed out that data is now part of the developer community and wanted to know which are the best tools to scrub the data. Wills explained that when it comes to developer tools, the conversation could be described as a “religious debate,” as the tools depending a lot on each developer’s preference. Python, Aurora, SAS, they are all good scripting languages, his personal choice being the first two. “Some kind of scripting language” is a basic tool for a data scientist, but there isn’t a generally adopted best tool.

Talking about unstructured data that needs to be coded on, the need to analyze multiple sets of data and available solutions, Josh Wills expressed a preference for in-memory tools such as Spark and SAS, which provide a great way of exploring data. In what samples for data sets, he stated larger samples are preferable to smaller ones, especially when preparing data sets for other people to analyze,

John Furrier asked about existing collaborative tools in what data science is concerned, how they support team work, through cloud or other vehicles. While such tools would be a great idea, Josh Wills pointed out that nothing worth mentioning exists in this direction. He explained that at this point an inter-office, global collaboration solution is out of the question, a lightweight tool allowing people in the same office to collaborate would be very useful for data scientists. A collaboration tool allowing to share data analysis and data set preparation for data scientists in one location would be a great starting point.

One of the defining qualities of a data scientist is being relentless, Wills said. “If the tool does not answer my question, I google another tool.” A question without an answer is unacceptable to a data scientist.

Sharing projects he works on at Cloudera and is excited about, Wills said he is currently involved in simplifying data science and making everything simple, easy to use, so that machine level techniques become available to the general audience – a programmer or a statistician would then easily use data science in their daily activities.

See Wills’ full segment below:


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU