Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier.
It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, replyr::gapply, from our latest package, replyr.
The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.
Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:
dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
Species sdlower mean sdupper iqrlower median iqrupper
1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000
2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500
3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375
For a specific data frame, with known column names, such a table is easy to construct using dplyr::group_by and dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see.
It is a bit of a shock when Rdplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds. replyr attempts to provide a few helpful work-arounds.
Our new package replyr supplies methods to get a grip on working with remote tbl sources (SQL databases, Spark) through dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory data.frame. Results still do depend on which dplyr service you use, but with replyr you have fairly uniform access to some useful functions.
We are pleased to release a new free data science video lecture: Debugging R code using R, RStudio and wrapper functions. In this 8 minute video we demonstrate the incredible power of R using wrapper functions to catch errors for later reproduction and debugging. If you haven’t tried these techniques this will really improve your debugging game.
A corporate site called NPM decided to remove control of a project called “Kik” from its author and give it to a company that claimed to own the trademark on “Kik.” This isn’t actually how trademark law works or we would see the Coca-Cola Company successfully saying we can’t call certain types of coal “coke” (though it is the sort of world the United States’s “Digital Millennium Copyright Act” assumes).
The author of “Kik” decided since he obviously never had true control of the distribution of his modules distributed through NPM he would attempt to remove them (see here). This is the type of issue you worry about when you think about freedoms instead of mere discounts. We are thinking more about at this as we had to recently “re-sign” an arbitrary altered version of Apple’s software license just to run “git status” on our own code.
Tons of code broke because it is currently more stylish to include dependencies than to write code.
Egg is on a lot of faces when it is revealed one of the modules that is so critical to include is something called “leftpad.”
NPM forcibly re-published some modules to try and mitigate the damage.
The R functions base::sample and base::sample.int are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.
A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.
In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.
In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).
The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)
“Despite all my rage I am still just a rat in a cage!”
The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.
We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.
We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.
For the rest of this post, we give a quick how-to on using the RpostgreSQL package to interact with Postgres databases in R.