R has a number of assignment operators (at least “
=“, and “
->“; plus “
<<-” and “
->>” which have different semantics).
R-style guides routinely insist on “
<-” as being the only preferred form. In this note we are going to try to make the case for “
->” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “
=” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]
Don Quijote and Sancho Panza, by Honoré Daumier
Continue reading The Case For Using -> In R
Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how
data.frames describe themselves (try “
str(data.frame(x=1:2))” in an
R-console to see this) and is part of the tidy data manifesto.
SQL (structured query language) and
dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation
Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:
dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
Species sdlower mean sdupper iqrlower median iqrupper
1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000
2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500
3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375
For a specific data frame, with known column names, such a table is easy to construct using
dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in
dplyr can get quite hairy, quite quickly. Try it yourself, and see.
let, from our new package
Continue reading Using replyr::let to Parameterize dplyr Expressions
When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using
R libraries that assume you know the variable names. The
R data manipulation library
dplyr currently supports parametric treatment of variables through “underbar forms” (methods of the form
dplyr::*_), but their use can get tricky.
Rube Goldberg machine 1931 (public domain).
Better support for parametric treatment of variable names would be a boon to
dplyr users. To this end the
replyr package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete
dplyr code to be used as if it was parametric. Continue reading Parametric variable names and dplyr
One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.
This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”
The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.
Please read on for a nasty little example. Continue reading Be careful evaluating model predictions
Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].
vtreat is an R
data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems
vtreat defends against include:
NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training).
vtreat::prepare should be your first choice for real world data preparation and cleaning.
We hope this article will make getting started with
vtreat much easier. We also hope this helps with citing the use of
vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv
It is a bit of a shock when R
dplyr users switch from using a
tbl implementation based on R in-memory
data.frames to one based on a remote database or service. A lot of the power and convenience of the
dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with
dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds.
replyr attempts to provide a few helpful work-arounds.
Our new package
replyr supplies methods to get a grip on working with remote
tbl sources (SQL databases, Spark) through
dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory
data.frame. Results still do depend on which
dplyr service you use, but with
replyr you have fairly uniform access to some useful functions.
Continue reading New R package: replyr (get a grip on remote dplyr data services)
I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. Continue reading MySql in a container
Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of.
I have written before how I think this book stands out and why you should consider studying from it.
Please read on for a some additional comments on the intent of different sections of the book. Continue reading Teaching Practical Data Science with R
Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.
In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.
In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.
Please read on for how to fix this. Continue reading You should re-encode high cardinality categorical variables