As technologists, we should consider ourselves very fortunate to be at the rediscovery of the Internet and its behaviors. It’s the “Era of Data”.
The Internet and its users are evolving. The best way for companies and individuals to fully understand and respond to these changes is to carefully observe behavior and analyze data.
From application performance forecasting to marketing insights, the data we have serves one purpose: Improving the customer experience.
All of this data aggregation, statistical analysis, and data warehousing gave birth to a new breed of technologist known today as “The Data Scientist.”
What is data science and what does a data scientist do?
Not unlike Web 2.0 and Ajax, the term data science is slowly becoming a victim of industry hype. Trying to understand what a data scientist does has become increasingly confusing. What are the main tasks? What do they do on a day-to-day basis?
You might often hear the term data anthropologist instead of data scientist. That’s because the tasks of an anthropologist can be similar to a data scientist.
As a data scientist, my daily tasks include:
- Review the data we have collected.
- Identify which data we don’t have, and where we can get it.
- Work on storing that data and more importantly be able to retrieve it in an efficient and scalable manner.
- Automate the data collection.
- Explore the data (this part is very important).
- Talk to people in the organization to identify relevant questions.
- Prepare the data.
- Clean the data.
- Clean it more.
- Perform a split-apply-combine strategy. If you are familiar with Map-Reduce, the split-apply-combine strategy is very similar where a split and apply would be the map and the combine would be the reduce part of a Map-Reduce process.
- Try merging various sources of information, and retrieve patterns.
- Formulate hypotheses, reject or accept them — based on reviewing the data.
- Present my findings to a wider audience, in my case, it can be our support team, engineering team, executives and more importantly our customers.
From that list, you might notice that a few skills and technologies are required. A well-rounded data scientist should have strong bases in data analysis, statistics, application development, and be obsessively curious. In fact, basic economics and general business knowledge will help you convey your findings and potential decisions to a broader range of decision makers.
Technologies To Get Started
The technologies we use are as diversified as the skills we consider required to be a data scientist. We use the following technologies:
- Hadoop: Hadoop has become the leader in all “big data” related discussions in the industry. It is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. There are various projects that we use from the Hadoop family: ZooKeeper, Pig, Hive, HBase and Mahout.
- Mahout: Mahout allows us to leverage the power of MapReduce with Hadoop to perform machine learning and data mining tasks. It has powerful clustering and classification core algorithms.
- NLTK: The Natural Language ToolKit for Python. NLTK is a leading platform for building Python programs to work with human language data.
- NumPy: is the fundamental package for scientific computing with Python. It has very useful capabilities for linear algebra, N-dimensional array/matrix object, Fourier Transform, and much more.
- Python: Python is easy to learn, it’s clean, has a strong academic and scientific community, and therefore has popular scientific processing packages.
- R: R is the Lingua Franca of today’s statisticians. It’s a platform for statistical computing.
- SciPy: SciPy is open-source software for mathematics, science, and engineering. It provides efficient and user-friendly numerical routines and is designed to work with NumPy data-structures, namely array objects.
- Various R Packages: We use a large number of R packages when performing data-analysis and visualization, including: RStudio, plyr, reshape, ggplot2, RCurl, RJSONIO, HMisc, devtools, lattice, lubridate, forecast, quantmod, and PerformanceAnalytics.
Of course, there is a lot more technology behind it all, but this covers the general outline of our data science initiative.
Reality of Data Science
Even though being a data scientist has been proclaimed one of the sexiest jobs of the 21st century, there are a few unsaid facts about being a data scientist:
- There’s nothing glamorous about it. You are not a rockstar, but a janitor. You will clean a lot of data.
- There is a lot of mathematics involved. Get over it.
- Some people will be offended by your findings. People are volatile. Especially when what the data says is different than their preconceived beliefs.
- If you are awesome at technology and math but can’t communicate, being a data scientist is probably not for you.
- All the business lingo you hear about in the economic news and all the crazy hyped-buzzwords people use might very well become part of your vocabulary as you may have to use them to convey your findings.
- Some days will be long, and you won’t find a thing. Get more coffee.
- You may read that classical statistics is outdated for today’s needs. It may be true, however classical statistics provides a strong basis, and gets you well-versed on the cryptic world and vocabulary of the statistician.
- Not everything is about Big Data. In fact, learning how to sample is a key skill you’ll need. Sampling (properly) has a lot of benefits and can be very representative of your actual dataset.
The most important tip of all:
Whenever you read about data science or data analysis, it’s about the ability to store petabytes of data, retrieve that data in nanoseconds, turn it into a rainbow with a unicorn dancing on it, and so on. There is one aspect that is consistently missing in any article or blog post: When you have data, any amount of data, how do you identify which questions are relevant to you?
I once heard an economist make the following statement about the field of economics:
Economics is a field with all the answers and none of the questions.
The previous quote is surprisingly similar to the field of data science. If there is one important tip, it’s this:
Spend time finding the right questions. Answers are easy to find once you have the question.
About the Author
David Coallier is a data scientist at Engine Yard, the leading Platform as a Service. David is a startup advisor, avid entrepreneur, and regularly speaks at industry conferences. His main field of expertise is evolutionary game theory, and he is a passionate contributor to PHP open source projects.