Posted on Categories Coding, OpinionTags , , , , 1 Comment on Another R [Non-]Standard Evaluation Idea

Another R [Non-]Standard Evaluation Idea

Jonathan Carroll had a an interesting R language idea: to use @-notation to request value substitution in a non-standard evaluation environment (inspired by msyql User-Defined Variables).

He even picked the right image:

PandorasBox Continue reading Another R [Non-]Standard Evaluation Idea

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on vtreat: prepare data

vtreat: prepare data

This article is on preparing data for modeling in R using vtreat.

Vtreat Continue reading vtreat: prepare data

Posted on Categories Coding, Opinion, StatisticsTags , , , , , , , , , 7 Comments on wrapr: for sweet R code

wrapr: for sweet R code

This article is on writing sweet R code using the wrapr package.


Wrapr
Continue reading wrapr: for sweet R code

Posted on Categories Opinion, Programming, TutorialsTags , , , 1 Comment on Iteration and closures in R

Iteration and closures in R

I recently read an interesting thread on unexpected behavior in R when creating a list of functions in a loop or iteration. The issue is solved, but I am going to take the liberty to try and re-state and slow down the discussion of the problem (and fix) for clarity.

The issue is: are references or values captured during iteration?

Many users expect values to be captured. Most programming language implementations capture variables or references (leading to strange aliasing issues). It is confusing (especially in R, which pushes so far in the direction of value oriented semantics) and best demonstrated with concrete examples.


NewImage

Please read on for a some of the history and future of this issue. Continue reading Iteration and closures in R

Posted on Categories Computers, Opinion, Public Service Article, Statistics, TutorialsTags , , , , 12 Comments on Upgrading to macOS Sierra (nee OSX) for R users

Upgrading to macOS Sierra (nee OSX) for R users

A good fraction of R users use Apple computers. Apple machines historically have sat at a sweet spot of convenience, power, and utility:

  • Convenience: Apple machines are available at retail stores, come with purchasable support, and can run a lot of common commercial software.
  • Power: R packages such as parallel and Rcpp work better on top of a Posix environment.
  • Utility: OSX was good at interoperating with the Linux your big data systems are likely running on, and some R packages expect a native operating system supporting a Posix environment (which historically has not been a Microsoft Windows, strength despite claims to the contrary).

Frankly the trade-off is changing:

  • Apple is neglecting its computer hardware and operating system in favor of phones and watches. And (for claimed license prejudice reasons) the lauded OSX/macOS “Unix userland” is woefully out of date (try “bash --version” in an Apple Terminal; it is about 10 years out of date!).
  • Microsoft Windows Unix support is improving (Windows 10 bash is interesting, though R really can’t take advantage of that yet).
  • Linux hardware support is improving (though not fully there for laptops, modern trackpads, touch screens, or even some wireless networking).

Our current R platform remains Apple macOS. But our next purchase is likely a Linux laptop with the addition of a legal copy of Windows inside a virtual machine (for commercial software not available on Linux). It has been a while since Apple last “sparked joy” around here, and if Linux works out we may have a few Apple machines sitting on the curb with paper bags over their heads (Marie Kondo’s advice for humanely disposing of excess inanimate objects that “see”, such as unloved stuffed animals with eyes and laptops with cameras).

IMG 0726

That being said: how does one update an existing Apple machine to macOS Sierra and then restore enough functionality to resume working? Please read on for my notes on the process. Continue reading Upgrading to macOS Sierra (nee OSX) for R users

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , , , , 4 Comments on Data Preparation, Long Form and tl;dr Form

Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.


NewImage

Photo: NY – http://nyphotographic.com/, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form. Continue reading Data Preparation, Long Form and tl;dr Form

Posted on Categories Coding, Opinion, Programming, Statistics, TutorialsTags , , , 8 Comments on Comparative examples using replyr::let

Comparative examples using replyr::let

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier.


NewImage
Archie’s Mechanics #2 (1954) copyright Archie Publications

(edit: great news! CRAN just accepted our replyr 0.2.0 fix release!)

Please read on for examples comparing standard notations and replyr::let. Continue reading Comparative examples using replyr::let

Posted on Categories Opinion, StatisticsTags , , , , , , 3 Comments on Organize your data manipulation in terms of “grouped ordered apply”

Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.


Iris germanica Purple bearded Iris Wakehurst Place UK DiliffIris, by DiliffOwn work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”

Posted on Categories Opinion, Programming, RantsTags , , , , , , , , , 12 Comments on magrittr’s Doppelgänger

magrittr’s Doppelgänger

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice.

If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a magrittr pipeline without using the “%>%” operator. This note will expand (tongue in cheek) that notation into an alternative to magrittr that you should never use.


SupermanBizarro

Superman #169 (May 1964, copyright DC)

What follows is a joke (though everything does work as I state it does, nothing is faked). Continue reading magrittr’s Doppelgänger

Posted on Categories Opinion, Programming, Rants, StatisticsTags , , , 29 Comments on The Case For Using -> In R

The Case For Using -> In R

R has a number of assignment operators (at least “<-“, “=“, and “->“; plus “<<-” and “->>” which have different semantics).

The R-style guides routinely insist on “<-” as being the only preferred form. In this note we are going to try to make the case for “->” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “=” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]


Honore Daumier 017 Don Quixote

Don Quijote and Sancho Panza, by Honoré Daumier


Continue reading The Case For Using -> In R