Win Vector LLC’s Dr. Nina Zumel has had great success applying y-aware methods to machine learning problems, and working out the detailed cross-validation methods needed to make y-aware procedures safe. I thought I would try our hand at y-aware neural net or deep learning methods here.
For data science projects I recommend using source control or version control, and committing changes at a very fine level of granularity. This means checking in possibly broken code, and the possibly weak commit messages (so when working in a shared project, you may want a private branch or second source control repository).
Please read on for our justification.
Here is an example how easy it is to use
cdata to re-layout your data.
Tim Morris recently tweeted the following problem (corrected).
Please will you take pity on me #rstats folks? I only want to reshape two variables x & y from wide to long! Starting with: d xa xb ya yb 1 1 3 6 8 2 2 4 7 9 How can I get to: id t x y 1 a 1 6 1 b 3 8 2 a 2 7 2 b 4 9 In Stata it's: . reshape long x y, i(id) j(t) string In R, it's: . an hour of cursing followed by a desperate tweet 👆 Thanks for any help! PS – I can make reshape() or gather() work when I have just x or just y.
This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.
cdata implements an operator notation for data transform.
The idea is simple, yet powerful.
With all of the excitement surrounding
cdata style control table based data transforms (the
cdata ideas being named as the “replacements” for
tidyr‘s current methodology, by the
tidyr authors themselves!) I thought I would take a moment to describe how they work.
Recently ran into something interesting in the
R macros/quasi-quotation/substitution/syntax front:
Romain François: “.@_lionelhenry reveals planned double curly syntax At #satRdayParis as a possible replacement, addition to !! and enquo()”
!! is no longer the last word in substitution (it certainly wasn’t the first).
I am not sure if it is a good or bad idea. But let’s play with it a bit, and perhaps readers can submit their experience and opinions in the comments section.
This means I can time the exact same algorithm implemented nearly identically in each of these three languages. So I can extract some comparative “apples to apples” timings. Please read on for a summary of the results.