A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others.
What is Data Science?
The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics itself. We define data science as the art of transforming hypotheses and data into actionable predictions. For example, we can use models and data to predict who will win an election, what products will sell well together, which loans will default, or which advertisements will be clicked on.
Data science draws on tools from the empirical sciences, statistics, reporting, analytics, visualization, business intelligence, expert systems, machine learning, databases, data warehousing, data mining, and big data. It is because we have so many tools that we need a discipline that covers them all. What distinguishes data science itself from the tools and techniques is the central goal of deploying effective decision-making models to a production environment.
Data science is often a “second calling.” Many of the best data scientists we meet started as programmers, statisticians, business intelligence analysts or scientists. By adding a few more techniques to their repertoire they became excellent data scientists. That observation drives this book: we will introduce the practical skills needed by the data scientist by concretely working through all of the common project steps on real data. Some steps you will know better than we do, some you will pick up quickly, and some you may need to research further.
Much of the theoretical basis of data science comes from statistics. However, data science as we know it is very much influenced by technology and software engineering methodologies, and has largely evolved in heavily computer science and information technology driven groups. We can call out some of the engineering flavor of data science by listing some famous examples:
Amazon’s product recommendation systems.
Google’s advertisement valuation systems.
Linkedin’s contact recommendation system.
Twitter’s trending topics.
Walmart’s consumer demand projection systems.
These systems share a lot of features:
All of these systems are
built off large data sets. That is not to say they are all in the realm of “big data.” But none of them could have been accomplished from only small data sets. To manage the data, these systems require concepts from computer science: database theory, parallel programming theory, streaming data techniques, and data warehousing.
Most of these systems are
online or live. Rather than producing a single report or analysis, the data science team deploys a decision procedure or scoring procedure to either directly make decisions or directly show results to a large number of end users. The production deployment is the last chance to get things right, as the data scientist can not always be around to explain defects.
All of these systems are
allowed to make mistakesat some non-negotiable rate.
None of these systems are concerned with
cause. They are successful when they find useful correlations and are not held to correctly sorting cause fromcan effect.
This book will teach the principles and tools needed to build systems like these. We want to teach the common tasks, steps and tools used to successfully deliver such projects. Our emphasis is on the whole process, project management, working with others and presenting results to non-specialists.
Why this book?
This is the book for you if you want to work as a data scientist, or already do. This book will demonstrate the tools, habits and interactions of successful data scientists and data science projects. We have learned a lot from the many different people and fields that work in learning from data and here we distill down and share the best practices. Some chapters are elementary and some are advanced, but all chapters contain things we wish we had known a lot earlier.
This is the book we wish we had available to hand out to clients and peers. Its purpose is to explain the best parts of statistics, computer science and machine learning that are relevant to data science. Most data scientists have arrived recently from some other field, so can still benefit in being reminded of some of the best tools from the many fields that contribute to data science. A software engineer who works as a data scientist will likely benefit in seeing a bit of explanation about statistical testing procedures and machine learning procedures. A statistician may be unfamiliar of the software engineering techniques of version control and agile project management and how much these things can greatly increase the chance of success in a project.
Throughout this book we are going to emphasize scientific principles such as repeatability of experiments. We will also emphasize software engineering principles such as automation of steps. We see scientific principles and software engineering principles as being co-equal ways to think about data science projects. You automate steps because you will have to repeat them and you can repeat steps because of your version control and automation.
We don’t want to invent techniques in this book, but explain the best standard techniques. A real victory for this book would for an experienced data scientist to says “I always knew to split my data into test and training sets, but I had no idea how many things that was protecting me from!” Or, perhaps, for a software engineer who is starting to work as a data scientist to invent a new tool to automate test and train splits.
Throughout we are going to write about concepts (both statistics and machine learning), include concrete code and explore partnering with and presenting to non-specialists. We hope when you don’t find one of these topics novel that we are able to share a wrinkle on one or two of the other topics that you may not have thought about recently. We encourage you to try the example R code as you read the text; even when we are discussing fairly abstract aspects of data science we will illustrate examples with concrete data and code. We are arranging topics in book in an order that we feel increases understanding. This order may not always be the order of the tasks in sequence.
What is in this book?
We will explain the data science process itself. The data scientist must have the ability to measure and track their own project.
Together we will apply many of the most powerful statistical and machine learning techniques used in data science projects.
We will describe the model life-cycle including putting models into production and tracking versions.
We will show how to prepare presentations for the various stakeholders: management, users, deployment team and so on. You can’t get away with just throwing data science project results over the fence.
What is not in this book?
This book is not an R manual.We are going to use R to concretely demonstrate the important steps of data science projects. We will teach enough R for you to work through the examples, but a reader unfamiliar with R will want to refer to Appendix as well as to the many excellent R books and tutorials already available.
This book is not a set of case studies.We are going to emphasize methodology and technique. Example data and code is given only to make sure we are giving concrete usable advice.
This book is not a “big data” book.We feel most significant data science occurs at a database or file manageable scale. Valuable data that maps measured conditions to dependent outcomes tends to be expensive to produce and that tends to bound its size. For some report generation, data mining and natural language processing you will have to move into the big data regime.
This is not a theoretical book.We are not going to emphasize the absolute rigorous theory of any one technique. The goal of data science is to be flexible, have a number of good techniques at hand and be willing to research a technique more deeply if it appears to apply to the problem at hand.
Who are the authors?
The first author, Nina Zumel, has worked as a scientist at one the largest independent, nonprofit research institutes. She has worked as chief scientist of a price optimization company and founded a contract research company. Nina Zumel is now a principal consultant at Win-Vector LLC. She can be reached at
The second author, John Mount, has worked as a computational scientist in biotechnology, a stock trading algorithm designer and managed a research team for a major online shopping site. John Mount is now a principal consultant at Win-Vector LLC. He can be reached at