Posted on Categories Rants, Statistics, TutorialsTags , 8 Comments on My criticism of R numeric summary

My criticism of R numeric summary

My criticism of R‘s numeric summary() method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. summary() likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.


The Big Lebowski, 1998.

Please read on for some context and my criticism.

Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.

Continue reading My criticism of R numeric summary

Posted on Categories TutorialsTags , , , , 4 Comments on Using geom_step

Using geom_step

geom_step is an interesting geom supplied by the R package ggplot2. It is an appropriate rendering option for financial market data and we will show how and why to use it in this article.

Continue reading Using geom_step

Posted on Categories Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , 9 Comments on On ranger respect.unordered.factors

On ranger respect.unordered.factors

It is often said that “R is its packages.”

One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value respect.unordered.factors=FALSE in ranger::ranger() which we strongly advise overriding to respect.unordered.factors=TRUE in applications. Continue reading On ranger respect.unordered.factors

Posted on Categories UncategorizedTags , , 9 Comments on For loops in R can lose class information

For loops in R can lose class information

Did you know R‘s for() loop control structure drops class annotations from vectors? Continue reading For loops in R can lose class information

Posted on Categories Coding, RantsTags , , , , , 9 Comments on sample(): “Monkey’s Paw” style programming in R

sample(): “Monkey’s Paw” style programming in R

The R functions base::sample and are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.

“The Monkey’s Paw”, story: William Wymark Jacobs, 1902; illustration Maurice Greiffenhagen.

Continue reading sample(): “Monkey’s Paw” style programming in R

Posted on Categories Coding, Computer Science, Expository Writing, Programming, TutorialsTags , , , , , , 1 Comment on Some programming language theory in R

Some programming language theory in R

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”).

Our reasoning is: if you want to work as a modern data scientist you have to program (this is not optional for reasons of documentation, sharing and scientific repeatability). If you do program you are going to have to eventually think a bit about programming theory (hopefully not too early in your studies, but it will happen). Let’s use R’s powerful programming language (and implementation) to dive into some deep issues in programming language theory:

  • References versus values
  • Function abstraction
  • Equational reasoning
  • Recursion
  • Substitution and evaluation
  • Fixed point theory

To do this we will translate some common ideas from a theory called “the lambda calculus” into R (where we can actually execute them). This translation largely involves changing the word “lambda” to “function” and introducing some parenthesis (which I think greatly improve readability, part of the mystery of the lambda calculus is how unreadable its preferred notation actually is).

Opus hyp
Recursive Opus (on a Hyperbolic disk)
Continue reading Some programming language theory in R

Posted on Categories Programming, TutorialsTags , , 1 Comment on An R function return and assignment puzzle

An R function return and assignment puzzle

Here is an R programming puzzle. What does the following code snippet actually do? And ever harder: what does it mean? (See here for some material on the difference between what code does and what code means.)

f <- function() { x <- 5 }

In R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" the code appears to call the function f() and return nothing (nothing is printed). When teaching I often state that you should explicitly use a non-assignment expression as your return value. You should write code such as the following:

f <- function() { x <- 5; x }
## [1] 5

(We are showing an R output as being prefixed with ##.)

But take a look at the this:

f <- function() { x <- 5 }
## [1] 5

It prints! Read further for what is really going on.

NewImage Continue reading An R function return and assignment puzzle

Posted on Categories Coding, Statistics, TutorialsTags , 16 Comments on Efficient accumulation in R

Efficient accumulation in R

R has a number of very good packages for manipulating and aggregating data (dplyr, sqldf, ScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.

Accumulating wheat (Photo: Cyron Ray Macey, some rights reserved)

In this latest “R as it is” (again in collaboration with our friends at Revolution Analytics) we will quickly become expert at efficiently accumulating results in R. Continue reading Efficient accumulation in R

Posted on Categories Opinion, ProgrammingTags , , 4 Comments on R in a 64 bit world

R in a 64 bit world

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?

IMG 1691

We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is” that we are publishing with cooperation from Revolution Analytics. Continue reading R in a 64 bit world