Posted on Categories Coding, Opinion, Pragmatic Data Science, Statistics, Tutorials

## R Tip: Think in Terms of Values

`R` tip: first organize your tasks in terms of data, values, and desired transformation of values, not initially in terms of concrete functions or code.

I know I write a lot about coding in `R`. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.

`R` without data is like going to the theater to watch the curtain go up and down.

(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)

Usually you come to `R` to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , , , , 1 Comment on R Tip: Use let() to Re-Map Names

## R Tip: Use let() to Re-Map Names

Another R tip. Need to replace a name in some R code or make R code re-usable? Use `wrapr::let()`.

Posted on Categories Coding, TutorialsTags , , , 4 Comments on R Tip: Use `stringsAsFactors = FALSE`

## R Tip: Use `stringsAsFactors = FALSE`

R tip: use `stringsAsFactors = FALSE`.

R often uses a concept of `factor`s to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.

It is often claimed Sigmund Freud said “Sometimes a cigar is just a cigar.”

Posted on 9 Comments on R Tip: Use the vtreat Package For Data Preparation

## R Tip: Use the vtreat Package For Data Preparation

If you are working with predictive modeling or machine learning in `R` this is the `R` tip that is going to save you the most time and deliver the biggest improvement in your results.

R Tip: Use the `vtreat` package for data preparation in predictive analytics and machine learning projects.

When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:

• Missing, invalid, or out of range values.
• Categorical variables with large sets of possible levels.
• Novel categorical levels discovered during test, cross-validation, or model application/deployment.
• Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
• Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

`vtreat` systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

`vtreat` can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.

If you are attempting high-value predictive modeling in `R`, you should try out `vtreat` and consider adding it to your workflow.

Posted on Categories Coding, Statistics, TutorialsTags , , , 1 Comment on R Tip: Introduce Indices to Avoid `for()` Class Loss Issues

## R Tip: Introduce Indices to Avoid `for()` Class Loss Issues

Here is an R tip. Use loop indices to avoid `for()`-loops damaging classes.

Below is an R annoyance that occurs again and again: vectors lose class attributes when you iterate over them in a `for()`-loop.

```d <- c(Sys.time(), Sys.time())
print(d)
#> [1] "2018-02-18 10:16:16 PST" "2018-02-18 10:16:16 PST"

for(di in d) {
print(di)
}
#> [1] 1518977777
#> [1] 1518977777
```

Notice we printed numbers, not dates/times. To avoid this problem introduce an index, and loop over that, not over the vector contents.

```for(ii in seq_along(d)) {
di <- d[[ii]]
print(di)
}
#> [1] "2018-02-18 10:16:16 PST"
#> [1] "2018-02-18 10:16:16 PST"
```
Posted on Categories Coding, TutorialsTags , ,

## R Tip: Use `vector(mode = "list")` to Pre-Allocate Lists

Another R tip. Use `vector(mode = "list")` to pre-allocate lists.

```result <- vector(mode = "list", 3)
print(result)
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
```

The above used to be critical for writing performant R code (R seems to have greatly improved incremental list growth over the years). It remains a convenient thing to know.

Posted on Categories Coding, TutorialsTags , , 1 Comment on R Tip: Get Out of the Habit of Calling `View()` Directly

## R Tip: Get Out of the Habit of Calling `View()` Directly

R tip: get out of the habit of calling `View()` directly.

`View()` only works correctly in interactive environments, not currently in RMarkdown contexts. It is better to call something else that safely dispatches to `View()`, or to something else depending if you are in an interactive or non-interactive session.

The following code will work interactively, in `RMarkdown`, or even in a `reprex`.

```#' Invoke a spreadsheet like viewer when appropriate.
#'
#' @param x R object to view
#' @param title title for viewer
#' @param n number of rows to show
#' @return invoke view or format object
#'
view <- function(
x,
...,
title = as.character(substitute(x)),
n = 200) {
UseMethod("view", x)
}
view.data.frame <- function(
x,
...,
title = as.character(substitute(x)),
n = 200) {
wrapr::stop_if_dot_args(substitute(list(...)),
"view")
if(interactive()) {
View(x, title = title)
} else {
if(require("knitr",
character.only = TRUE,
quietly = TRUE)) {
caption = title)
} else {
}
}
}

view(mtcars)
```

The above code is a nice safe way to view frames which falls back to a low dependency solution when needed.

For more on `wrapr::stop_if_dot_args()` please see R Tip: Force Named Arguments.

Posted on Categories Coding, Statistics, TutorialsTags , , , , 1 Comment on R Tip: Make Arguments Explicit in `magrittr`/`dplyr` Pipelines

## R Tip: Make Arguments Explicit in `magrittr`/`dplyr` Pipelines

I think this is the R Tip that is going to be the most controversial yet. Its potential pitfalls include: it is a style prescription (which makes it different than and less immediately useful than something of the nature of R Tip: Force Named Arguments), and it is heterodox (this is not how `magrittr`/`dplyr` is taught by the original authors, and not how it is commonly used). However, I have not been at all good at anticipating which tips get which sort of reception (and this valuable feedback, public and private, is part of what I get of this series).

On to the tip (which only applies if you are a `magrittr` pipeline user).

R tip: when using `magrittr` pipelines consider making them more explicit, and more readable (especially to novices) by using explicit dot-arguments throughout.