We have been writing a lot on higher-order data transforms lately:
What I want to do now is "write a bit more, so I finally feel I have been concise."
Continue reading Arbitrary Data Transforms Using cdata
I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R.
You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).
Just wrote a new
R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).
Please check it out.
We have just released a major update of the
cdata R package to CRAN.
If you work with
R and data, now is the time to check out the
cdata package. Continue reading Update on coordinatized or fluid data
Our article "Let’s Have Some Sympathy For The Part-time R User" includes two points:
- Sometimes you have to write parameterized or re-usable code.
- The methods for doing this should be easy and legible.
The first point feels abstract, until you find yourself wanting to re-use code on new projects. As for the second point: I feel the
wrapr package is the easiest, safest, most consistent, and most legible way to achieve maintainable code re-use in
In this article we will show how
wrapr makes code-rewriting even easier with its new
let x=x automation.
Continue reading Let X=X in R
As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using
R and substantial data stores (such as relational database variants such as
PostgreSQL or big data systems such as
Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."
Continue reading Big Data Transforms
I am pleased to announce that
vtreat version 0.6.0 is now available to
R users on CRAN.
vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an
R user we strongly suggest you incorporate
vtreat into your projects. Continue reading Upcoming data preparation and modeling article series
My favorite advice on debugging is from Professor Norman Matloff:
Finding your bug is a process of confirming the many things that you believe are true – until you find one that is not true.
Continue reading On debugging
There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.”
John Whorfin quoting Dwight L. Moody.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the
dplyr. Continue reading My advice on dplyr::mutate()