I thought I would give a personal update on our book: Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019.
Here is an example how easy it is to use
cdata to re-layout your data.
Tim Morris recently tweeted the following problem (corrected).
Please will you take pity on me #rstats folks? I only want to reshape two variables x & y from wide to long! Starting with: d xa xb ya yb 1 1 3 6 8 2 2 4 7 9 How can I get to: id t x y 1 a 1 6 1 b 3 8 2 a 2 7 2 b 4 9 In Stata it's: . reshape long x y, i(id) j(t) string In R, it's: . an hour of cursing followed by a desperate tweet 👆 Thanks for any help! PS – I can make reshape() or gather() work when I have just x or just y.
This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.
R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called “method chaining” since the days of
Smalltalk (~1972). Let’s take a look at method chaining in
Python, in terms of pipe notation.
A good friend is now a professor at the University of Auckland and knew to photograph and send us this. Thanks!!!
A good friend shared with us a great picture of Practical Data Science with R, 1st Edition hanging out in Cambridge at the MIT Press Bookstore.
This is as good an excuse as any to share a book update.
From the recent developer.r-project.org “Staged Install” article:
Incidentally, there were just two distinct (very long) lists of methods in the warnings across all installed packages in my run, but repeated for many packages. It turned out that they were lists of exported methods from dplyr and rlang packages. These two packages take very long to install due to C++ code compilation.
dplyr indeed uses
rlang appears to currently be a
C-package. So any problems associated with
rlang are probably not due to
Rcpp. Similarly other tidyverse packages such as
tibble are currently
C packages. I think
purrr once used
C++, but do not know about the others.
The (matter of opinion) claim:
“When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]”
(source discussed here)
got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough to use something as complicated as C++ correctly?
Or put it another way: as R is a typical “the reference implementation is the specification” programming environment there is no true “de jure” R, only a de facto R.
To look at popular R packages I defined “popular” as used (Depends/Imports/LinkingTo) by other packages on CRAN. One could use other definitions (e.g. Github stars), but this is the one I used for this particular study.
My “quick look” (sure to anger everyone) is a couple of diagrams such as the following.
r-project article “Use of C++ in Packages” stated as its own summary of recommendation:
C++to interface with
A careful reading of the article exposes at least two possible meanings of this:
- Don’t use
C++to directly call
Ror directly manipulate
Rstructures. A technical point directly argued (for right or wrong) in the article.
- Don’t use
Rpackages. A point implicit in the article.
Rcpp(a package designed to allow the use of
R) are not the same thing, but both are mentioned in the note.
One could claim the article is “all about point 1, which we can argue on its technical merits.” The technicalities involve discussion of
longjmp and how this differs from
C++‘s treatment of RAII, destructors, and exceptions.
(edit: It has been pointed out to me that as there is no
C++ interface to
R that the point-1 interpretation is in some sense not technically possible. All
C++ is in some sense forced to go through the
C interface. Yes things can go wrong, but in strict technical sense you can’t directly “use
C++ to interface with
.Call() just as
However, in my opinion the overall tone of the article unfortunately reads as being about point 2. In fact after multiple readings of the article I remain uncomfortable saying if the article is in fact attempting to make point 2 or attempting to avoid point 2. Statements such as “Packages that are already using C++ would best be carefully reviewed and fixed by their authors” seem to accuse all existing
C++ packages. But statements such as “one could use some of the tricks I’ve described here” seem to imply there are in fact correct ways to interface
R (which for all we know, many
C++ packages may already be using).
I think a point 2 interpretation of the article does the
R community a disservice. So I hope the note is not in fact about point 2. And if it isn’t about point 2, I wish that had been stronger emphasized and made clearer.
Rcpp is the most popular package on CRAN. Based on CRAN data downloaded 2019/03/31:
Rcpp is directly used in 1605 CRAN packages (or about 11% of CRAN packages), and indirectly used (brought in through Import/Depends/LinkingTo) by 6337 packages (or about 45% of CRAN packages). It has the highest reach of any CRAN package under each of those measures (calculation shared here), and even under a pagerank style measure.
Rcpp is something
R users should be appreciative of and grateful for.
Rcpp should not become the subject of fear, uncertainty, and doubt.
I apologize if I am merely criticizing my own mis-reading of the note. However, others have also written about discomfort with this note, and the original note comes from a position of authority (so does have a greater responsibility to be fairly careful in how it might be plausibly read).
There is a lot of unnecessary worry over “Non Standard Evaluation” (NSE) in
R versus “Standard Evaluation” (SE, or standard “variables names refer to values” evaluation). This very author is guilty of over-discussing the issue. But let’s give this yet another try.
The entire difference between NSE and regular evaluation can be summed up in the following simple table (which should be clear after we work some examples).