It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.
Gartner hype cycle (Wikipedia).
Given we agree data science exists, who is allowed to call themselves a data scientist?
There is a school of thought that you can not call yourself a data scientist unless you master all of the following:
- Statistical learning theory
- High dimensional geometry
- Optimization theory
- Petabyte scale operations
- Advanced programming
- Combinatorics and algebra
- Theoretical computer science
- Measure theory
- All of statistics
- Distributed System design
Many of these are topics covered in works such as Foundations of Data Science (John Hopcroft, Ravindran Kannan) and Mining of Massive Data Sets (Jure Leskovec , Anand Rajaraman, Jeffrey David Ullman).
These are topics I know, and many of these authors are personal heroes:
- John Hopcroft: One of the founders of modern design and analysis of algorithms. Coauthor of Introduction to Automata Theory, Languages, and Computation.
- Ravindran Kannan: My advisor! Definitely brilliant.
- Anand Rajaraman: CEO I had the honor of working for at Kosmix.com, one of the inventors of Mechanical Turk, also brilliant.
- Jeffrey David Ullman: One of the founders of modern design and analysis of algorithms. Coauthor of Introduction to Automata Theory, Languages, and Computation.
The theory is: only the unicorn who knows all of the above is to be allowed to call themselves a data scientist.
a field that uses results from statistics, machine learning, and computer science to create predictive models.
Practical Data Science with R, “about this book”, page xix.
And here is why: outside of academia and some major labs the task of data science is essentially looking at client data and building useful predictive models.
This is good news. Statisticians know that prediction is fundamentally easier than inference (as prediction dodges many issues of causality). And most real world business clients have data at what we call “SQL scale” (fits in a nice database that can quickly run complicated SQL aggregations, not requiring a petabyte infrastructure). Clients tend to need automated decision procedures yielding high ROI (
Radio Over the Internet Return On Investment) to free up analysts for new problems.
And that brings to the point of this essay. Because all of the analyst jobs have been re-classified as “data science” jobs we have to allow analysts to call themselves “data scientists”.