From https://tidyr.tidyverse.org/dev/articles/pivot.html (text by Hadley Wickham):
For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and
gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.
There are two important new features inspired by other R packages that have been advancing of reshaping in R:
- The reshaping operation can be specified with a data frame that describes precisely how metadata stored in column names becomes data variables (and vice versa). This is inspired by the
cdata package by John Mount and Nina Zumel. For simple uses of
pivot_wide(), this specification is implicit, but for more complex cases it is useful to make it explicit, and operate on the specification data frame using
- pivot_long() can work with multiple value variables that may have different types. This is inspired by the enhanced
dcast() functions provided by the
data.table package by Matt Dowle and Arun Srinivasan.
If you want to work in the above way we suggest giving our
cdata package a try. We named the functions
unpivot_to_blocks. The idea was: by emphasizing the record structure one might eventually internalize what the transforms are doing. On the way to that we have a lot of documentation and tutorials.
We recently commented on excess package dependencies as representing risk in the
R package ecosystem.
The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practice (or at least correlates with other good practices)?
Continue reading Quantifying R Package Dependency Risk
I would like to once again recommend our readers to our note on
R function that can help you eliminate many problematic NSE (non-standard evaluation) interfaces (and their associate problems) from your
R programming tasks.
The idea is to imitate the following lambda-calculus idea:
let x be y in z := ( λ x . z ) y
Continue reading wrapr::let()
Dirk Eddelbuettel just shared an important point on software and analyses: dependencies are hard to manage risks.
If your software or research depends on many complex and changing packages, you have no way to establish your work is correct. This is because to establish the correctness of your work, you would need to also establish the correctness of all of the dependencies. This is worse than having non-reproducible research, as your work may have in fact been wrong even the first time.
Continue reading Software Dependencies and Risk
I am collecting here some notes on testing in
There seems to be a general (false) impression among non R-core developers that to run tests,
R package developers need a test management system such as
testthat. And a further false impression that
testthat is the only
R test management system. This is in fact not true, as
R itself has a capable testing facility in "
R CMD check" (a command triggering
R checks from outside of any given integrated development environment).
By a combination of skimming the
R-manuals ( https://cran.r-project.org/manuals.html ) and running a few experiments I came up with a description of how
R-testing actually works. And I have adapted the available tools to fit my current preferred workflow. This may not be your preferred workflow, but I have and give my reasons below.
Continue reading Unit Tests in R
Let’s try some "ugly corner cases" for data manipulation in
R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.
Let’s see what happens when we try to stick a fork in the power-outlet.
Continue reading Data Manipulation Corner Cases
R package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.
This becomes important as many of the
rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a
data.frame column name or a database column name) or a character/string (which will be translated to a constant) is important.
Continue reading rquery Substitution
Roz King just wrote an interesting article on binning data (a common data analytics step) in a database. They compare a case-based approach (where the bin divisions are stuffed into code) with a join based approach. They share code and timings.
Best of all:
rquery gets some attention and turns out to be the dominant solution at all scales measured.
Here is an example timing (lower times better):
So please check the article out.
To make teaching
R quasi-quotation easier it would be nice if
R string-interpolation and quasi-quotation both used the same notation. They are related concepts. So some commonality of notation would actually be clarifying, and help teach the concepts. We will define both of the above terms, and demonstrate the relation between the two concepts.
Continue reading Make Teaching R Quasi-Quotation Easier
R Tip: use inline operators for legibility.
Python feature I miss when working in
R is the convenience of
+ operator. In
+ does the right thing for some built in data types:
- It concatenates lists:
[1,2] +  is
[1, 2, 3].
- It concatenates strings:
'a' + 'b' is
And, of course, it adds numbers:
1 + 2 is
The inline notation is very convenient and legible. In this note we will show how to use a related notation
Continue reading R Tip: Use Inline Operators For Legibility