One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your data.
There is no final “right shape.” In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your “penultimate analysis” (always one more to come). This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.
In this article we will work a small example and call out some R tools that make reshaping your data much easier. The idea is to think in terms of “relational algebra” (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner). Continue reading Your Data is Never the Right Shape
This is a tutorial on how to try out a new package in R. The summary is: expect errors, search out errors and don’t start with the built in examples or real data.
Suppose you want to try out a novel statistical technique? A good fraction of the time R is your best bet for a first trial. Take as an example general additive models (“Generalized Additive Models,” Trevor J Hastie, Robert Tibshirani, Statistical Science (1986) vol. 1 (3) pp. 297-318); R has a package named “gam” written by Trevor Hastie himself. But, like most R packages, trying the package from the supplied documentation brings in unfamiliar data and concerns. It is best to start small and quickly test if the package itself is suitable to your needs. We give a quick outline of how to learn such a package and quickly find out if the package is for you.
Continue reading The cranky guide to trying R packages
One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that. Continue reading Learn Logistic Regression (and beyond)
Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools. I would like to quickly exhibit my “must have” list. These are the packages that I find to be the single “must have offerings” in a number of categories. I have avoided some categories (such as editors, email programs, programing language, IDEs, photo editors, backup solutions, databases, database tools and web tools) where I have no feeling of having seen a single absolute best offering.
The spirit of the list is to pick items such that: if you disagree with an item in this list then either you are wrong or you know something I would really like to hear about.
Continue reading Must Have Software
Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way. Continue reading R annoyances
IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R.
Continue reading CRU graph yet again (with R)
This article is quick concrete example of how to use the techniques from Survive R to lower the steepness of The R Project for Statistical Computing‘s learning curve (so an apology to all readers who are not interested in R). What follows is for people who already use R and want to achieve more control of the software. Continue reading R examine objects tutorial
New PDF slides version (presented at the Bay Area R Users Meetup October 13, 2009).
We at Win-Vector LLC appear to like R a bit more than some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting Technique #1: The “R” language ) we also understand the need to defend oneself against the abuse regularly dished out by R. Here we will quickly share a few fighting techniques.
Continue reading Survive R
What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called The Elements of Graphing Data, inspired by Strunk and White’s classic writing handbook The Elements of Style . The Elements of Graphing Data puts forward Cleveland’s philosophy about how to produce good, clear graphs — not only for presenting one’s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland’s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. Continue reading Good Graphs: Graphical Perception and Data Visualization
Our first “exciting technique” article is about a statistical language called “R.”
R is a language for statistical analysis available from http://cran.r-project.org/ . The things you can immediately do with it are incredible. You can import a spreadsheet and immediately spot relationships, trend and anomalies. R gives you instant access to top notch visualization methods and sophisticated statistical methods.
Continue reading Exciting Technique #1: The “R” language.