Posted on Categories Coding, Opinion, Programming, StatisticsTags , , 2 Comments on More on safe substitution in R

## More on safe substitution in R

Let’s worry a bit about substitution in `R`. Substitution is very powerful, which means it can be both used and mis-used. However, that does not mean every use is unsafe or a mistake.

Posted on Categories Opinion, Programming, Statistics

## There is usually more than one way in R

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in `R` (especially once you add many packages) there is usually more than one way. As an example we will talk about the common `R` functions: `str()`, `head()`, and the `tibble package`‘s `glimpse()`. Continue reading There is usually more than one way in R

Posted on Categories data science, Opinion, StatisticsTags , , ,

## R summary() got better!

Here is a really nice feature found in the current 3.4.0 version of R: summary() has become a lot more reasonable.

```summary(15555)

#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   15555   15555   15555   15555   15555   15555
```

Please read on for some background. Continue reading R summary() got better!

Posted on 9 Comments on Be careful evaluating model predictions

## Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.

Please read on for a nasty little example. Continue reading Be careful evaluating model predictions

Posted on Categories Rants, Statistics, TutorialsTags , 8 Comments on My criticism of R numeric summary

## My criticism of R numeric summary

My criticism of R‘s numeric `summary()` method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. `summary()` likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.

The Big Lebowski, 1998.

Please read on for some context and my criticism.

Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.

Posted on Tags , , 9 Comments on On ranger respect.unordered.factors

## On ranger respect.unordered.factors

It is often said that “R is its packages.”

One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value `respect.unordered.factors=FALSE` in `ranger::ranger()` which we strongly advise overriding to `respect.unordered.factors=TRUE` in applications. Continue reading On ranger respect.unordered.factors

Posted on Categories Coding, Rants9 Comments on sample(): “Monkey’s Paw” style programming in R

## sample(): “Monkey’s Paw” style programming in R

The R functions `base::sample` and `base::sample.int` are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.

“The Monkey’s Paw”, story: William Wymark Jacobs, 1902; illustration Maurice Greiffenhagen.

Continue reading sample(): “Monkey’s Paw” style programming in R

Posted on 1 Comment on Some programming language theory in R

## Some programming language theory in R

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”).

Our reasoning is: if you want to work as a modern data scientist you have to program (this is not optional for reasons of documentation, sharing and scientific repeatability). If you do program you are going to have to eventually think a bit about programming theory (hopefully not too early in your studies, but it will happen). Let’s use R’s powerful programming language (and implementation) to dive into some deep issues in programming language theory:

• References versus values
• Function abstraction
• Equational reasoning
• Recursion
• Substitution and evaluation
• Fixed point theory

To do this we will translate some common ideas from a theory called “the lambda calculus” into R (where we can actually execute them). This translation largely involves changing the word “lambda” to “function” and introducing some parenthesis (which I think greatly improve readability, part of the mystery of the lambda calculus is how unreadable its preferred notation actually is).

Recursive Opus (on a Hyperbolic disk)
Continue reading Some programming language theory in R