One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your data.
There is no final “right shape.” In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your “penultimate analysis” (always one more to come). This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.
In this article we will work a small example and call out some R tools that make reshaping your data much easier. The idea is to think in terms of “relational algebra” (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner). Continue reading Your Data is Never the Right Shape
A “for fun” piece, reposted from mzlabs.com.
I would like to comment on Duncan Jones’ movie “Moon” and compare some elements of “Moon” to earlier science fiction. Continue reading Gerty, a character in Duncan Jones’ “Moon.”
With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians. One part of experiment design that has always been particularly hard to teach is how to pick the size of your sample. The two points that are hard to communicate are that:
- The required sample size is essentially independent of the total population size.
- The required sample size depends strongly on the strength of the effect you are trying to measure.
These things are only hard to explain because the literature is overly technical (too many buzzwords and too many irrelevant concerns) and these misapprehensions can’t be relieved unless you spend some time addressing the legitimate underlying concerns they are standing in for. As usual explanation requires common ground (moving to shared assumptions) not mere technical bullying.
We will try to work through these assumptions and then discuss proper sample size. Continue reading What is a large enough random sample?
Our friends at Dataspora have a nice article on the more modern Map Reduce languages. A very good read and clearly a lot of thought went into preparing it. Continue reading Brevity is a Virtue
Stop and think: which of our tools are making us smarter and which of our tools are making us dumber. In my opinion tools and habits that support complexity literally train us to be dumber. Continue reading Do your tools support production or complexity?
The Win-Vector blog is experiencing a bit of a slow-down. All of our staff are very busy helping clients right now and we need to take a couple of extra weeks to get our next article out.
This is a tutorial on how to try out a new package in R. The summary is: expect errors, search out errors and don’t start with the built in examples or real data.
Suppose you want to try out a novel statistical technique? A good fraction of the time R is your best bet for a first trial. Take as an example general additive models (“Generalized Additive Models,” Trevor J Hastie, Robert Tibshirani, Statistical Science (1986) vol. 1 (3) pp. 297-318); R has a package named “gam” written by Trevor Hastie himself. But, like most R packages, trying the package from the supplied documentation brings in unfamiliar data and concerns. It is best to start small and quickly test if the package itself is suitable to your needs. We give a quick outline of how to learn such a package and quickly find out if the package is for you.
Continue reading The cranky guide to trying R packages
We discuss a “medium scale data” technique that we call “SQL Screwdriver.”
Previously we discussed some of the issues of large scale data analytics. A lot of the work done at the MapReduce scale is necessarily limited to mere aggregation and report generation. But what of medium scale? That is data too large to perform all steps in your favorite tool (R, Excel or something else) but small enough that you are expected to produce sophisticated models, decisions and analysis. At this scale, if properly prepared, you don’t need large scale tools and their limitations. With extra preparation you can continue to use your preferred tools. We call this the realm of medium scale data and discuss a preparation tool style we call “screwdriver” (as opposed to larger hammers).
We stand the “no SQL” movement on its head and discuss the beneficial use of SQL without a server (as opposed to their vision of a key-value store without SQL). Database servers can be a nuisance- but that is not enough reason to give up the power of relational query languages.
Continue reading SQL Screwdriver
A reason to care about the cloud: your credit card is now a supercomputer. Continue reading Your credit card as the supercomputer