In our recent note What is new for
rquery December 2019 we mentioned an ugly processing pipeline that translates into
SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the
Continue reading Better SQL Generation via the data_algebra
I was working through Kyle Miller‘s excellent note: “Tail call recursion in Python”, and decided to experiment with variations of the techniques.
The idea is: one may want to eliminate use of the
Python language call-stack in the case of a “tail calls” (a function call where the result is not used by the calling function, but instead immediately returned). Tail call elimination can both speed up programs, and cut down on the overhead of maintaining intermediate stack frames and environments that will never be used again.
The note correctly points out that
Python purposely does not have a
goto statement, a tool one might use to implement true tail call elimination. So Kyle Miller built up a data-structure based replacement for the call stack, which allows one to work around the stack-limit for a specific function (without changing any
Python configuration, and without changing the behavior of other functions).
Python does have some exotic control-flow controls:
yield. So I decided to build an
exception based solution of our own using
Please read on for how we do this, and for some examples.
Continue reading Eliminating Tail Calls in Python Using Exceptions
It usually gives us a chuckle when we find some natural and seemingly easy data science question is NP-hard. For instance we have written that variable pruning is NP-hard when one insists on finding a minimal sized set of variables (and also why there are no obvious methods for exact large permutation tests).
In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.
Continue reading Minimal Key Set is NP hard
Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported in the following ROC plot.
Permutation tests have their own literature and issues (examples: Permutation, Parametric and Bootstrap Tests of Hypotheses, Springer-Verlag, NY, 1994 (3rd edition, 2005), 2, 3, and 4).
In our R packages the permutation tests are estimated by a sampling procedure, and not computed exactly (or deterministically). It turns out this is likely a necessary concession; a complete exact permutation test procedure at scale would be big news. Please read on for my comments on this issue.
Continue reading Why No Exact Permutation Tests at Scale?
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.
Continue reading Base R can be Fast
Win-Vector LLC recently announced the
R package, an operator based query generator.
In this note I want to share some exciting and favorable initial rquery benchmark timings.
Continue reading rquery: Fast Data Manipulation in R
In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some very deep yet accessible pure mathematics).
A permutation of indices
In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.
R‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with
R), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable). Continue reading On indexing operators and composition
It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns
process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation,
replyr::gapply, from our latest package,
Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.
The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.
Continue reading A Simple Example of Using replyr::gapply
Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:
dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
Species sdlower mean sdupper iqrlower median iqrupper
1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000
2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500
3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375
For a specific data frame, with known column names, such a table is easy to construct using
dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in
dplyr can get quite hairy, quite quickly. Try it yourself, and see.
let, from our new package
Continue reading Using replyr::let to Parameterize dplyr Expressions
I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in fact lack a lot of the monotone structure or submodular structure that would make things easy).
That being said we have a lot of powerful and effective heuristics to discuss in upcoming articles. I am going to leave such positive results for my later articles and here concentrate on an instructive technical negative result: picking a good subset of variables is theoretically quite hard. Continue reading Variable pruning is NP hard