Posted on Categories Coding, Exciting Techniques, TutorialsTags , 6 Comments on Operator Notation for Data Transforms

## Operator Notation for Data Transforms

As of `cdata` version `1.0.8` `cdata` implements an operator notation for data transform.

The idea is simple, yet powerful.

Posted on Categories Coding, TutorialsTags , , 2 Comments on How cdata Control Table Data Transforms Work

## How cdata Control Table Data Transforms Work

With all of the excitement surrounding `cdata` style control table based data transforms (the `cdata` ideas being named as the “replacements” for `tidyr`‘s current methodology, by the `tidyr` authors themselves!) I thought I would take a moment to describe how they work.

Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags , 2 Comments on Why we Did Not Name the cdata Transforms wide/tall/long/short

## Why we Did Not Name the cdata Transforms wide/tall/long/short

We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques.

While adopting the cdata methodology into tidyr, the terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure.

The key point is: are we in a very de-normalized form where all facts about an instance are in a single row (which we called “row records”), or are we in a record oriented form where all the facts about an instances are in several rows (which we called “block records”)? The point is: row records don’t necessarily have more columns than block records. This makes shape based naming of the transforms problematic, no matter what names you pick for the shapes. There is an advantage to using intent or semantic based naming.

Below is a simple example.

```library("cdata")

# example 1 end up with more rows, fewer columns
d <- data.frame(AUC = 0.6, R2 = 0.7, F1 = 0.8)
print(d)
#>   AUC  R2  F1
#> 1 0.6 0.7 0.8
unpivot_to_blocks(d,
nameForNewKeyColumn= 'meas',
nameForNewValueColumn= 'val',
columnsToTakeFrom= c('AUC', 'R2', 'F1'))
#>   meas val
#> 1  AUC 0.6
#> 2   R2 0.7
#> 3   F1 0.8

# example 2 end up with more rows, same number of columns
d <- data.frame(AUC = 0.6, R2 = 0.7)
print(d)
#>   AUC  R2
#> 1 0.6 0.7
unpivot_to_blocks(d,
nameForNewKeyColumn= 'meas',
nameForNewValueColumn= 'val',
columnsToTakeFrom= c('AUC', 'R2'))
#>   meas val
#> 1  AUC 0.6
#> 2   R2 0.7

# example 3 end up with same number of rows, more columns
d <- data.frame(AUC = 0.6)
print(d)
#>   AUC
#> 1 0.6
unpivot_to_blocks(d,
nameForNewKeyColumn= 'meas',
nameForNewValueColumn= 'val',
columnsToTakeFrom= c('AUC'))
#>   meas val
#> 1  AUC 0.6
```

Notice the width of the result relative to input width varies as function of the input data, even though we were always calling the same transform. This makes it incorrect to characterize these transforms as merely widening or narrowing.

There are still some subtle points (for instance row records are in fact instances of block records), but overall the scheme we (Nina Zumel, and myself: John Mount) worked out, tested, and promoted is pretty good. A lot of our work researching this topic can be found here.

## Support Rotary to Support our World

Thank you to Win-Vector LLC General Partner Nina Zumel for stepping up her workload, allowing me take some time off from Win-Vector LLC (and time off from from revising chapter 8 of Practical Data Science with R 2nd Edition) to make time to help administer the Vietnam Rotary Global Grant mentioned below. This project is going to help over 1,600 farmers in Vietnam.

Heidi Kühn is a remarkable individual, and Roots of Peace and Rotary are remarkable organizations. It is an honor to work with all of you.

As with all projects- it feels like my part (the paperwork and supervision) is back under control. So back to research, clients, and the book (until something more is needed).

Posted on Categories Pragmatic Data Science, Programming, TutorialsTags , , 5 Comments on Tidyverse users: gather/spread are on the way out

## Tidyverse users: gather/spread are on the way out

For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and `gather()`. Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.

There are two important new features inspired by other R packages that have been advancing of reshaping in R:

• The reshaping operation can be specified with a data frame that describes precisely how metadata stored in column names becomes data variables (and vice versa). This is inspired by the `cdata` package by John Mount and Nina Zumel. For simple uses of `pivot_long()` and `pivot_wide()`, this specification is implicit, but for more complex cases it is useful to make it explicit, and operate on the specification data frame using `dplyr` and `tidyr`.
• pivot_long() can work with multiple value variables that may have different types. This is inspired by the enhanced `melt()` and `dcast()` functions provided by the `data.table` package by Matt Dowle and Arun Srinivasan.

If you want to work in the above way we suggest giving our `cdata` package a try. We named the functions `pivot_to_rowrecs` and `unpivot_to_blocks`. The idea was: by emphasizing the record structure one might eventually internalize what the transforms are doing. On the way to that we have a lot of documentation and tutorials.

Posted on Categories Opinion, Programming, TutorialsTags , , 4 Comments on Quantifying R Package Dependency Risk

## Quantifying R Package Dependency Risk

We recently commented on excess package dependencies as representing risk in the `R` package ecosystem.

The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practice (or at least correlates with other good practices)?

Posted on Categories Opinion, Programming

## wrapr::let()

I would like to once again recommend our readers to our note on `wrapr::let()`, an `R` function that can help you eliminate many problematic NSE (non-standard evaluation) interfaces (and their associate problems) from your `R` programming tasks.

The idea is to imitate the following lambda-calculus idea:

``` let x be y in z := ( λ x . z ) y ```

Posted on Categories Opinion, Programming, TutorialsTags 4 Comments on Software Dependencies and Risk

## Software Dependencies and Risk

Dirk Eddelbuettel just shared an important point on software and analyses: dependencies are hard to manage risks.

If your software or research depends on many complex and changing packages, you have no way to establish your work is correct. This is because to establish the correctness of your work, you would need to also establish the correctness of all of the dependencies. This is worse than having non-reproducible research, as your work may have in fact been wrong even the first time.

Posted on Categories Opinion, Programming, TutorialsTags , , , 2 Comments on Unit Tests in R

## Unit Tests in R

I am collecting here some notes on testing in `R`.

There seems to be a general (false) impression among non R-core developers that to run tests, `R` package developers need a test management system such as `RUnit` or `testthat`. And a further false impression that `testthat` is the only `R` test management system. This is in fact not true, as `R` itself has a capable testing facility in "`R CMD check`" (a command triggering `R` checks from outside of any given integrated development environment).

By a combination of skimming the `R`-manuals ( https://cran.r-project.org/manuals.html ) and running a few experiments I came up with a description of how `R`-testing actually works. And I have adapted the available tools to fit my current preferred workflow. This may not be your preferred workflow, but I have and give my reasons below.