Setting expectations in data science projects
How is it even possible to set expectations and launch data science projects?
Data science projects vary from “executive dashboards” through “automate what my analysts are already doing well” to “here is some data, we would like some magic.” That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention. Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.
Many of these projects either fail or become open ended (become unmanageable).
As an alternative we describe some of our methods for setting quantifiable goals and front-loading risk in data science projects.
The typical situation
Data science projects are often considered untrackable because either there is “some magic” expected, there is no prior bound on how dirty the incoming data is or there is no prior definition of what a good result would look like. An example might be “invent a method to use website visit data to predict who is likely to purchase from us.”
When magic is expected you are really talking about invention and not a data science deployment. The research that more naturally fits into a data science project is both: learning the nature of the domain, problem and data; and traditional literature research (are there known methods that help with our situation?). You can schedule intervals of invention spikes into a larger project; but you can not really specify outcomes of these spikes (“magic method solves problem X by February 7th”). So these should be seen as tasks in a larger project that may or may not help. The entire project should not fully depend on them. The project must be able to succeed even when the invention spikes fail.
Additional red flags include lack of description of the input data and no concrete definition of the outcome we are trying to predict (Purchase ever? Purchase in the next month? Spend at least $100?).
Often the first fix to the project ask happen as: “we better at least quantify expected performance: let’s insist on an accuracy of 95%.” This often happens in a business meeting late in the project launch when it is noticed that what is likely a large and important project has absolutely no acceptance criteria. Unfortunately this “bar” is often set without any research if accuracy is even the correct measure, if 95% is easy or hard or even if the enterprise will be profitable at this accuracy. This is a good intent, but the arbitrary goal (that nobody will really be held to) is a step backwards.
What we need to do is: schedule dedicated time to learn about the domain and data before writing project goals and scope. This itself can be part of a small concrete expectation setting project. To complete the expectation setting project we need reusable methods to set useful, realistic goals that really measure if a data science project is on track (i.e. that a data science project can be held to). We outline a few methods to generate prior estimates for two of the important data science project measures: model performance and business utility.
Minimal components to safely ensure success
At a minimum a project must have an observable quantifiable measure of success. So it makes sense to work on setting this expectation first. That does not mean the success criterion needs to be set in stone- as this is often not possible. Instead it means you often have to commission an initial research project to quantify what sort of outcome is even possible and if such an outcome would make sense for the business. This unknown result determines if the project even has a chance to succeed, so it makes sense to try and eliminate the hidden project risk it represents by determining success criteria as a separate project. The overall modeling project often should not even be commissioned until the expectation setting project is complete. The result may indicate no further data science work is appropriate until features are added to engineering systems or the business. But this is good: nobody wants to start doomed projects, they instead want to know what to changes to implement to allow a successful project to be later launched.
You can in fact run data science projects as you would run any development project (all projects have risks and unknowns- so these problems are not in fact unique to data science projects). It is just that you can not, unfortunately, run data science projects in parallel with developing initial measurement and feedback systems. This is one case where starting a bridge from two shores to meet in the middle does not decrease project time. Specific measurement, control and feedback in a data science project requires running a few cars across the bridge (but won’t require all lanes be ready at the start).
Methods for expectation setting
The expectation setting part of a data science project is to estimate how well a very good model would perform without paying the time and cost of producing the model. This may seem impossible, but there are methods that estimate to be performance and utility with moderate effort. To show the flavor of this idea we list a couple of methods to estimate performance and a couple of methods to estimate utility.
Methods to prior estimate model performance
You need to know what prediction performance to commit to. Some ways to prior estimate this are given here:
- Current performance
Data scientists don’t work in a vacuum. Usually we are trying to build a model that will be used to improve a business process. This implies there is already some process in place. It could be something as simple as “offer all return visitors a recommendation” or even a hand tuned set of business rules. Do not to be too proud, too polite or too rushed to measure the current system’s performance as if it were a classifier. For example if the current policy is “offer all return visitors a recommendation” measure what percentage of them buy (getting at precision) and what percentage of first time visitors buy (getting at recall).
If the current system looks like a very high performance classifier (near perfect precision and recall) then you are not going to be able to usefully improve on it (so nobody should want the task of improving on it). You may want the task of automating it (if parts of it are human driven)- but you now have guidance where to go for rules and training data. If the current system looks like a low performance classifier then you should not agree to develop a very high performance replacement as your immediate project. If the business is running with 50% accuracy, then it is plausible the business will run better with 70% accuracy and it does not make sense to propose hitting 95% accuracy as a first project (such a project may be unrealistic or may just take longer).
- 3 by 5 card estimation
This is an especially useful technique in trying to automate high quality human judgements (often done so they can be applied at a larger scale or higher speed). Get some time with the experts you are trying to extend the work of. If you can’t get the time for the initial project- then you already know any larger project would be doomed, so this itself is a good up-front test. Then ask them to do their job on paper. For example: suppose the task is to send a coupon to somebody likely to use it; prepare by sending a lot of coupons at random and recording who used the coupons (again this is a good gatekeeper or risk that should be front-loaded; if an organization is not ready for quick measurements and A/B tests adding these capabilities if far more important than any model construction). Have the experts pick (using all information available) who should have gotten coupons. Ask what information they used. User your known hidden outcomes to evaluate performance. Prepare 3 by 5 cards with only the chosen information and see if they can indeed predict at their historic rate with the limited information. If they can do it a model may be able to do it (and the less you need to put into the model the less engineering is needed) if they can not do it a model may not be able to do it.
- Bayes error rate estimation
One property shared by most models is that they report the same prediction given two identical examples. This “same input produces same output” observation puts an upper bound on classifier performance called the Bayes error rate. Even a perfect model can not outperform the Bayes error rate. You can design a cross-validation study on your training data to estimate the Bayes error rate without building a real deployable model. Generate pairs of training examples with what you consider to be identical or nearly identical input patterns. You then see how well the known outcome from one example is at predicting the known example of the other. This is an instance of a permutation test or resampling simulation. We do not have to pair all of the training data, just find nearest neighbors for an appropriate random sample to estimate training data variation (with respect to the whole data set, this can be done with a single table scan).
An actual implementation of this estimate as a final method would be a nearest neighbor model. Which may are may not be advised for actual implementation. Two of the downsides of nearest neighbor algorithms are their computational cost and poor generalization. The computational cost is equal to the number of queries times the training set size (which is okay for scoring an example sample but unacceptable in production, meaning you often must bring in some sort of sophisticated dimension reduction technique). Poor generalization (or over fitting) is why nearest neighbor algorithms tend not to perform as well on truly novel data as they do during naive cross validation.
Think of the Bayes error rate as an estimated upper bound, you probably won’t do better than it and you likely will (intentionally) do worse (as you trade performance on historic data for better performance on new data and efficiency). If your initial Bayes error rate is poor, then no amount of modeling will help- you need more features and feature engineering (a different sort of project).
Methods to prior estimate business utility
What the business really needs to know is if promised increase in classifier performance leads to a desired increase in business (customers, revenue or anything as long as it is specific and measurable). From your performance calibration exercises you should have a reasonable target classification or modeling improvement in mind. You want of quantify the expected business impact of that the proposed amount of classification improvement on the business. This is where we are converting from statistical significance (is the math on our side) to clinical significance (will it drive an appreciable change in outcome).
- Retrospective simulation
From the historic business data choose uniformly at random a sample of interactions that meet your target classifier performance. That is if you thing your achievable goal is a 10% increase and precision and a 2% decrease in recall (often you want to or are forced to trade precision and recall): generate a sample of customers from historic offers that meet this pattern. This doesn’t require a model- just access to historic data. Then measure the change in revenue if these had been your entire customer set versus a unbiased sample of the same size.
- Secant line method
This can be a prospective or active study. If you feel you can build a model which increases precision by 10% then get permission to degrade the running site precision by 10% (for example: make 10% more offers at random to degrade the current model or procedure). If the degradation has no effect then probably the improvement will have no effect. You want to run this study on a small sub-population as it confirms utility by losing money. The idea is: it is easier to break things than improve them and the rate of change in one direction isn’t a bad approximation of the rate of change in the opposite direction.
- Wizard of Oz method
Deploy a simulation of a better model by more expensive means (with intention of doing the engineering work to get a reasonable implementation if the experiment yields good results). If you goal is to build a machine that “produces color combinations for you as good as a designer” simulate the effect of success by paying a designer to work with a small subset of your customers. If the designer doesn’t improve revenue and/or customer satisfaction than even an algorithm as smart as the designer will also so fail. It is important that nobody confuses this experiment with a sustainable method that will be left in place.
Review of Purpose
In all cases you are trying to front load getting specific and risk. You are trying to fill in unknowns (what is our current performance?, what is our sensitivity? what do we need?) and front load risk (do we have the data?, does the data even differ between good and bad prospects?). The results of a project like this can serve both as a gatekeeper to and source of specification for a follow-up actual data-science research and implementation project. The first project itself should have specific description like: “be confident in the estimates of the following measures of possible model quality, data availability and probably business impact by this date.” After that those values can be used as part of a project scoping exercise for an actual data science implementation.