From https://tidyr.tidyverse.org/dev/articles/pivot.html (text by Hadley Wickham):
For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and
gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.
There are two important new features inspired by other R packages that have been advancing of reshaping in R:
- The reshaping operation can be specified with a data frame that describes precisely how metadata stored in column names becomes data variables (and vice versa). This is inspired by the
cdata package by John Mount and Nina Zumel. For simple uses of
pivot_wide(), this specification is implicit, but for more complex cases it is useful to make it explicit, and operate on the specification data frame using
- pivot_long() can work with multiple value variables that may have different types. This is inspired by the enhanced
dcast() functions provided by the
data.table package by Matt Dowle and Arun Srinivasan.
If you want to work in the above way we suggest giving our
cdata package a try. We named the functions
unpivot_to_blocks. The idea was: by emphasizing the record structure one might eventually internalize what the transforms are doing. On the way to that we have a lot of documentation and tutorials.
We recently commented on excess package dependencies as representing risk in the
R package ecosystem.
The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practice (or at least correlates with other good practices)?
Continue reading Quantifying R Package Dependency Risk
I would like to once again recommend our readers to our note on
R function that can help you eliminate many problematic NSE (non-standard evaluation) interfaces (and their associate problems) from your
R programming tasks.
The idea is to imitate the following lambda-calculus idea:
let x be y in z := ( λ x . z ) y
Continue reading wrapr::let()
Dirk Eddelbuettel just shared an important point on software and analyses: dependencies are hard to manage risks.
If your software or research depends on many complex and changing packages, you have no way to establish your work is correct. This is because to establish the correctness of your work, you would need to also establish the correctness of all of the dependencies. This is worse than having non-reproducible research, as your work may have in fact been wrong even the first time.
Continue reading Software Dependencies and Risk
I am collecting here some notes on testing in
There seems to be a general (false) impression among non R-core developers that to run tests,
R package developers need a test management system such as
testthat. And a further false impression that
testthat is the only
R test management system. This is in fact not true, as
R itself has a capable testing facility in "
R CMD check" (a command triggering
R checks from outside of any given integrated development environment).
By a combination of skimming the
R-manuals ( https://cran.r-project.org/manuals.html ) and running a few experiments I came up with a description of how
R-testing actually works. And I have adapted the available tools to fit my current preferred workflow. This may not be your preferred workflow, but I have and give my reasons below.
Continue reading Unit Tests in R
Let’s try some "ugly corner cases" for data manipulation in
R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.
Let’s see what happens when we try to stick a fork in the power-outlet.
Continue reading Data Manipulation Corner Cases
Starting With Data Science
A rigorous hands-on introduction to data science for software engineers.
Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets software engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or video) course.
Continue reading Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers
R package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.
This becomes important as many of the
rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a
data.frame column name or a database column name) or a character/string (which will be translated to a constant) is important.
Continue reading rquery Substitution
Roz King just wrote an interesting article on binning data (a common data analytics step) in a database. They compare a case-based approach (where the bin divisions are stuffed into code) with a join based approach. They share code and timings.
Best of all:
rquery gets some attention and turns out to be the dominant solution at all scales measured.
Here is an example timing (lower times better):
So please check the article out.
We’ve been getting some good uptake on our piping in
R article announcement.
The article is necessarily a bit technical. But one of its key points comes from the observation that piping into names is a special opportunity to give general objects the following personality quiz: “If you were an
R function, what function would you be?”
Continue reading “If You Were an R Function, What Function Would You Be?”