Posted on Categories Programming, TutorialsTags 2 Comments on A Quick Appreciation of the R transform Function

## A Quick Appreciation of the R transform Function

`R` users who also use the `dplyr` package will be able to quickly understand the following code that adds an estimated area column to a `data.frame`.

``````suppressPackageStartupMessages(library("dplyr"))

iris %>%
mutate(
.,
Petal.Area = (pi/4)*Petal.Width*Petal.Length) %>%
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1          5.1         3.5          1.4         0.2  setosa  0.2199115
## 2          4.9         3.0          1.4         0.2  setosa  0.2199115
## 3          4.7         3.2          1.3         0.2  setosa  0.2042035
## 4          4.6         3.1          1.5         0.2  setosa  0.2356194
## 5          5.0         3.6          1.4         0.2  setosa  0.2199115
## 6          5.4         3.9          1.7         0.4  setosa  0.5340708``````

The notation we used above is the "explicit argument" variation we recommend for readability. What a lot of `dplyr` users do not seem to know is: base-`R` already has this functionality. The function is called `transform()`.

To demonstrate this, let’s first detach `dplyr` to show that we are not using functions from `dplyr`.

``detach("package:dplyr", unload = TRUE)``

Now let’s write the equivalent pipeline using exclusively base-`R`.

``````iris ->.
transform(
.,
Petal.Area = (pi/4)*Petal.Width*Petal.Length) ->.
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1          5.1         3.5          1.4         0.2  setosa  0.2199115
## 2          4.9         3.0          1.4         0.2  setosa  0.2199115
## 3          4.7         3.2          1.3         0.2  setosa  0.2042035
## 4          4.6         3.1          1.5         0.2  setosa  0.2356194
## 5          5.0         3.6          1.4         0.2  setosa  0.2199115
## 6          5.4         3.9          1.7         0.4  setosa  0.5340708``````

The "`->.`" notation is the end-of-line variation of the Bizarro Pipe. The `transform()` function has been part of `R` since 1998. `dplyr::mutate()` was introduced in 2014.

``````git log --all -p --reverse --source -S 'transform <-'

commit 41c2f7338c45dbf9eac99c210206bc3657bca98a refs/remotes/origin/tags/R-0-62-4
Author: pd <pd@00db46b3-68df-0310-9c12-caf00c1e9a41>
Date:   Wed Feb 11 18:31:12 1998 +0000

Added the frametools functions subset() and transform()

git-svn-id: https://svn.r-project.org/R/trunk@709 00db46b3-68df-0310-9c12-caf00c1e9a41``````
Posted on Categories Programming, TutorialsTags , , , , 4 Comments on R Tip: How to Pass a formula to lm

## R Tip: How to Pass a formula to lm

`R` tip : how to pass a `formula` to `lm()`.

Often when modeling in `R` one wants to build up a formula outside of the modeling call. This allows the set of columns being used to be passed around as a vector of strings, and treated as data. Being able to treat controls (such as the set of variables to use) as manipulable values allows for very powerful automated modeling methods.

Posted on Categories Opinion, ProgrammingTags , , 2 Comments on data.table is Really Good at Sorting

## data.table is Really Good at Sorting

The `data.table` `R` package is really good at sorting. Below is a comparison of it versus `dplyr` for a range of problem sizes.

Posted on Categories Opinion, ProgrammingLeave a comment on How to use rquery with Apache Spark on Databricks

## How to use rquery with Apache Spark on Databricks

A big thank you to Databricks for working with us and sharing:

rquery: Practical Big Data Transforms for R-Spark Users
How to use rquery with Apache Spark on Databricks

rquery on Databricks is a great data science tool.

Posted on Categories data science, Programming11 Comments on Speed up your R Work

# Introduction

In this note we will show how to speed up work in `R` by partitioning data and process-level parallelization. We will show the technique with three different `R` packages: `rqdatatable`, `data.table`, and `dplyr`. The methods shown will also work with base-`R` and other packages.

For each of the above packages we speed up work by using `wrapr::execute_parallel` which in turn uses `wrapr::partition_tables` to partition un-related `data.frame` rows and then distributes them to different processors to be executed. `rqdatatable::ex_data_table_parallel` conveniently bundles all of these steps together when working with `rquery` pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Posted on Categories data science, Opinion, Programming, Tutorials4 Comments on seplyr 0.5.8 Now Available on CRAN

## seplyr 0.5.8 Now Available on CRAN

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.

seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.

Posted on Categories Administrativia, Coding, ProgrammingTags , 1 Comment on wrapr 1.5.0 available on CRAN

## wrapr 1.5.0 available on CRAN

The `R` package wrapr 1.5.0 is now available on CRAN.

wrapr includes a lot of tools for writing better `R` code:

I’ll be writing articles on a number of the new capabilities. For now I just leave you with the nifty operator coalesce notation.

Posted on Categories Programming, TutorialsTags , ,

## R Tip: use isTRUE()

R Tip: use `isTRUE()`.

A lot of R functions are type unstable, which means they return different types or classes depending on details of their values.

For example consider `all.equal()`, it returns the logical value `TRUE` when the items being compared are equal:

```all.equal(1:3, c(1, 2, 3))
# [1] TRUE
```

However, when the items being compared are not equal `all.equal()` instead returns a message:

```all.equal(1:3, c(1, 2.5, 3))
# [1] "Mean relative difference: 0.25"
```

This can be inconvenient in using functions similar to `all.equal()` as tests in `if()`-statements and other program control structures.

The saving functions is `isTRUE()`. `isTRUE()` returns `TRUE` if its argument value is equivalent to `TRUE`, and returns `FALSE` otherwise. `isTRUE()` makes `R` programming much easier.

Posted on Categories data science, Opinion, Programming, Statistics, TutorialsTags , , 2 Comments on WVPlots now at version 1.0.0 on CRAN!

## WVPlots now at version 1.0.0 on CRAN!

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. We are excited to announce the WVPlots is now at version 1.0.0 on CRAN!