Posted on Categories Opinion, StatisticsTags , , Leave a comment on Technical books are amazing opportunities

Technical books are amazing opportunities

Nina and I have been sending out drafts of our book Practical Data Science with R 2nd Edition for technical review. A few of the reviews came back from reviewers that described themselves with variations of:

Senior Business Analyst for COMPANYNAME. I have been involved in presenting graphs of data for many years.

To us this reads as somebody with deep experience, confidence, and bit of humility. They do something technical and valuable, but because they understand it they do not consider it to be arcane magic.

In this note we describe might can happen if such a person (or if a junior version of such a person) acquires 1 or 2 technical books.

Continue reading Technical books are amazing opportunities

Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags , Leave a comment on Timing Working With a Row or a Column from a data.frame

Timing Working With a Row or a Column from a data.frame

In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames.

We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.

We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.

Continue reading Timing Working With a Row or a Column from a data.frame

Posted on Categories OpinionTags , , 1 Comment on Not Always C++’s Fault

Not Always C++’s Fault

From the recent developer.r-project.org “Staged Install” article:

Incidentally, there were just two distinct (very long) lists of methods in the warnings across all installed packages in my run, but repeated for many packages. It turned out that they were lists of exported methods from dplyr and rlang packages. These two packages take very long to install due to C++ code compilation.

Technical point.

While dplyr indeed uses C++ (via Rcpp), rlang appears to currently be a C-package. So any problems associated with rlang are probably not due to C++ or Rcpp. Similarly other tidyverse packages such as purrr and tibble are currently C packages. I think purrr once used C++, but do not know about the others.

Posted on Categories Opinion, ProgrammingTags , , Leave a comment on Why RcppDynProg is Written in C++

Why RcppDynProg is Written in C++

The (matter of opinion) claim:

“When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]”

(source discussed here)

got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough to use something as complicated as C++ correctly?

Continue reading Why RcppDynProg is Written in C++

Posted on Categories OpinionTags 1 Comment on What are the Popular R Packages?

What are the Popular R Packages?

“R is its packages”, so to know R we should know its popular packages (CRAN).

Or put it another way: as R is a typical “the reference implementation is the specification” programming environment there is no true “de jure” R, only a de facto R.

To look at popular R packages I defined “popular” as used (Depends/Imports/LinkingTo) by other packages on CRAN. One could use other definitions (e.g. Github stars), but this is the one I used for this particular study.

My “quick look” (sure to anger everyone) is a couple of diagrams such as the following.

NewImage

Continue reading What are the Popular R Packages?

Posted on Categories OpinionTags , 6 Comments on C++ is Often Used in R Packages

C++ is Often Used in R Packages

The recent r-project article “Use of C++ in Packages” stated as its own summary of recommendation:

don’t use C++ to interface with R.

A careful reading of the article exposes at least two possible meanings of this:

  1. Don’t use C++ to directly call R or directly manipulate R structures. A technical point directly argued (for right or wrong) in the article.
  2. Don’t use C++/Rcpp to write R packages. A point implicit in the article. C++ and Rcpp (a package designed to allow the use of C++ from R) are not the same thing, but both are mentioned in the note.

One could claim the article is “all about point 1, which we can argue on its technical merits.” The technicalities involve discussion of C‘s setjmp/longjmp and how this differs from C++‘s treatment of RAII, destructors, and exceptions.

(edit: It has been pointed out to me that as there is no C++ interface to R that the point-1 interpretation is in some sense not technically possible. All C++ is in some sense forced to go through the C interface. Yes things can go wrong, but in strict technical sense you can’t directly “use C++ to interface with R“, C++ calls .C() or .Call() just as C does.)

However, in my opinion the overall tone of the article unfortunately reads as being about point 2. In fact after multiple readings of the article I remain uncomfortable saying if the article is in fact attempting to make point 2 or attempting to avoid point 2. Statements such as “Packages that are already using C++ would best be carefully reviewed and fixed by their authors” seem to accuse all existing C++ packages. But statements such as “one could use some of the tricks I’ve described here” seem to imply there are in fact correct ways to interface C++ with R (which for all we know, many C++ packages may already be using).

I think a point 2 interpretation of the article does the R community a disservice. So I hope the note is not in fact about point 2. And if it isn’t about point 2, I wish that had been stronger emphasized and made clearer.

For context Rcpp is the most popular package on CRAN. Based on CRAN data downloaded 2019/03/31: Rcpp is directly used in 1605 CRAN packages (or about 11% of CRAN packages), and indirectly used (brought in through Import/Depends/LinkingTo) by 6337 packages (or about 45% of CRAN packages). It has the highest reach of any CRAN package under each of those measures (calculation shared here), and even under a pagerank style measure.

Rcpp is something R users should be appreciative of and grateful for. Rcpp should not become the subject of fear, uncertainty, and doubt.

I apologize if I am merely criticizing my own mis-reading of the note. However, others have also written about discomfort with this note, and the original note comes from a position of authority (so does have a greater responsibility to be fairly careful in how it might be plausibly read).

Posted on Categories Opinion, Programming, TutorialsTags , , 2 Comments on Standard Evaluation Versus Non-Standard Evaluation in R

Standard Evaluation Versus Non-Standard Evaluation in R

There is a lot of unnecessary worry over “Non Standard Evaluation” (NSE) in R versus “Standard Evaluation” (SE, or standard “variables names refer to values” evaluation). This very author is guilty of over-discussing the issue. But let’s give this yet another try.

The entire difference between NSE and regular evaluation can be summed up in the following simple table (which should be clear after we work some examples).

Tbl

Continue reading Standard Evaluation Versus Non-Standard Evaluation in R

Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags , 2 Comments on Why we Did Not Name the cdata Transforms wide/tall/long/short

Why we Did Not Name the cdata Transforms wide/tall/long/short

We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques.

NewImage

NewImage

While adopting the cdata methodology into tidyr, the terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure.

The key point is: are we in a very de-normalized form where all facts about an instance are in a single row (which we called “row records”), or are we in a record oriented form where all the facts about an instances are in several rows (which we called “block records”)? The point is: row records don’t necessarily have more columns than block records. This makes shape based naming of the transforms problematic, no matter what names you pick for the shapes. There is an advantage to using intent or semantic based naming.

Below is a simple example.

library("cdata")

# example 1 end up with more rows, fewer columns
d <- data.frame(AUC = 0.6, R2 = 0.7, F1 = 0.8)
print(d)
#>   AUC  R2  F1
#> 1 0.6 0.7 0.8
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC', 'R2', 'F1')) 
#>   meas val
#> 1  AUC 0.6
#> 2   R2 0.7
#> 3   F1 0.8

# example 2 end up with more rows, same number of columns
d <- data.frame(AUC = 0.6, R2 = 0.7)
print(d)
#>   AUC  R2
#> 1 0.6 0.7
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC', 'R2')) 
#>   meas val
#> 1  AUC 0.6
#> 2   R2 0.7

# example 3 end up with same number of rows, more columns
d <- data.frame(AUC = 0.6)
print(d)
#>   AUC
#> 1 0.6
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC'))
#>   meas val
#> 1  AUC 0.6

Notice the width of the result relative to input width varies as function of the input data, even though we were always calling the same transform. This makes it incorrect to characterize these transforms as merely widening or narrowing.

There are still some subtle points (for instance row records are in fact instances of block records), but overall the scheme we (Nina Zumel, and myself: John Mount) worked out, tested, and promoted is pretty good. A lot of our work researching this topic can be found here.

Posted on Categories Opinion, Programming, TutorialsTags , , 4 Comments on Quantifying R Package Dependency Risk

Quantifying R Package Dependency Risk

We recently commented on excess package dependencies as representing risk in the R package ecosystem.

The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practice (or at least correlates with other good practices)?

Continue reading Quantifying R Package Dependency Risk