Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , , , , 4 Comments on Data Preparation, Long Form and tl;dr Form

Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.


Photo: NY –, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form. Continue reading Data Preparation, Long Form and tl;dr Form

Posted on Categories StatisticsTags , , Leave a comment on Does replyr::let work with data.table?

Does replyr::let work with data.table?

I’ve been asked if the adapter “let” from our R package replyr works with data.table.

My answer is: it does work. I am not a data.table user so I am not the one to ask if data.table benefits a from a non-standard evaluation to standard evaluation adapter such as replyr::let. Continue reading Does replyr::let work with data.table?

Posted on Categories Coding, Opinion, Programming, Statistics, TutorialsTags , , , 7 Comments on Comparative examples using replyr::let

Comparative examples using replyr::let

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier.

Archie’s Mechanics #2 (1954) copyright Archie Publications

(edit: great news! CRAN just accepted our replyr 0.2.0 fix release!)

Please read on for examples comparing standard notations and replyr::let. Continue reading Comparative examples using replyr::let

Posted on Categories Coding, Computer Science, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , Leave a comment on A Simple Example of Using replyr::gapply

A Simple Example of Using replyr::gapply

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, replyr::gapply, from our latest package, replyr.

4140852348 2ebe864822 z
Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.

Continue reading A Simple Example of Using replyr::gapply

Posted on Categories AdministrativiaTags , , 2 Comments on A bit on the formatting of code on this site and HTML/RSS

A bit on the formatting of code on this site and HTML/RSS

I am running into what looks like a WordPress bug involving formatting of code blocks. I think this is mostly affecting our RSS subscribers. They have been seeing posts rendered almost entirely in ugly fixed-font, the font error typically starting after the first substantial code-block in an article.

I’d like to apologize for any trouble this may be causing. I am looking into it, but I don’t currently have a solution. A work-around would be to not attempt to put pre-rendered code blocks into code font, but I would rather wait on a fix. I do have a diagnosis (it is likely a WordPress issue, and not user error, editor weirdness, or an RSS fault). (edit: please see the comments below for the solution, I was wrong to nest pre inside code- but I still think the WordPress transformations that made things much worse and are in fact a bug.) If you are interested in the details (or can help) please read on. Continue reading A bit on the formatting of code on this site and HTML/RSS

Posted on Categories Opinion, StatisticsTags , , , , , , 3 Comments on Organize your data manipulation in terms of “grouped ordered apply”

Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.

Iris germanica Purple bearded Iris Wakehurst Place UK DiliffIris, by DiliffOwn work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”

Posted on Categories Opinion, Programming, RantsTags , , , , 10 Comments on magrittr’s Doppelgänger

magrittr’s Doppelgänger

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice.

If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a magrittr pipeline without using the “%>%” operator. This note will expand (tongue in cheek) that notation into an alternative to magrittr that you should never use.


Superman #169 (May 1964, copyright DC)

What follows is a joke (though everything does work as I state it does, nothing is faked). Continue reading magrittr’s Doppelgänger

Posted on Categories Opinion, Programming, Rants, StatisticsTags , , , 27 Comments on The Case For Using -> In R

The Case For Using -> In R

R has a number of assignment operators (at least “<-“, “=“, and “->“; plus “<<-” and “->>” which have different semantics).

The R-style guides routinely insist on “<-” as being the only preferred form. In this note we are going to try to make the case for “->” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “=” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]

Honore Daumier 017 Don Quixote

Don Quijote and Sancho Panza, by Honoré Daumier

Continue reading The Case For Using -> In R

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , 2 Comments on The case for index-free data manipulation

The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how R data.frames describe themselves (try “str(data.frame(x=1:2))” in an R-console to see this) and is part of the tidy data manifesto.

Tools like SQL (structured query language) and dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation