Given the range of wants, diverse data sources, required innovation and methods it often feels like data science projects are immune to planning, scoping and tracking. Without a system to break a data science project into smaller observable components you greatly increase your risk of failure. As a followup to the statistical ideas we shared in setting expectations in data science projects we share a few project planning ideas from software engineering. Continue reading
How is it even possible to set expectations and launch data science projects?
Data science projects vary from “executive dashboards” through “automate what my analysts are already doing well” to “here is some data, we would like some magic.” That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention. Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.
Many of these projects either fail or become open ended (become unmanageable).
As an alternative we describe some of our methods for setting quantifiable goals and front-loading risk in data science projects. Continue reading
We discuss a “medium scale data” technique that we call “SQL Screwdriver.”
Previously we discussed some of the issues of large scale data analytics. A lot of the work done at the MapReduce scale is necessarily limited to mere aggregation and report generation. But what of medium scale? That is data too large to perform all steps in your favorite tool (R, Excel or something else) but small enough that you are expected to produce sophisticated models, decisions and analysis. At this scale, if properly prepared, you don’t need large scale tools and their limitations. With extra preparation you can continue to use your preferred tools. We call this the realm of medium scale data and discuss a preparation tool style we call “screwdriver” (as opposed to larger hammers).
We stand the “no SQL” movement on its head and discuss the beneficial use of SQL without a server (as opposed to their vision of a key-value store without SQL). Database servers can be a nuisance- but that is not enough reason to give up the power of relational query languages.