To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.
In R, the “
[” array access operator is a function call. And it is one a user can re-bind to the new effect of their own choosing.
Let’s see what sort of mischief we can get into using this capability.
Continue reading You Can Override Just About Anything in R
I was working through Kyle Miller‘s excellent note: “Tail call recursion in Python”, and decided to experiment with variations of the techniques.
The idea is: one may want to eliminate use of the
Python language call-stack in the case of a “tail calls” (a function call where the result is not used by the calling function, but instead immediately returned). Tail call elimination can both speed up programs, and cut down on the overhead of maintaining intermediate stack frames and environments that will never be used again.
The note correctly points out that
Python purposely does not have a
goto statement, a tool one might use to implement true tail call elimination. So Kyle Miller built up a data-structure based replacement for the call stack, which allows one to work around the stack-limit for a specific function (without changing any
Python configuration, and without changing the behavior of other functions).
Python does have some exotic control-flow controls:
yield. So I decided to build an
exception based solution of our own using
Please read on for how we do this, and for some examples.
Continue reading Eliminating Tail Calls in Python Using Exceptions
We at Win-Vector LLC have some big news.
We are finally porting a streamlined version of our R vtreat variable preparation package to Python.
vtreat is a great system for preparing messy data for supervised machine learning.
The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the
.fit_transform() pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case
.fit_transform() != .fit().transform()). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.
The new repository is here. And we have a non-trivial worked classification example. Next up is multinomial classification. After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.
The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).
This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.
Here is simple modeling problem in
We want to fit a linear model where the names of the data columns carrying the outcome to predict (
y), the explanatory variables (
x2), and per-example row weights (
wt) are given to us as string values in variables.
Continue reading Programming Over lm() in R
R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called “method chaining” since the days of
Smalltalk (~1972). Let’s take a look at method chaining in
Python, in terms of pipe notation.
Continue reading Piping is Method Chaining
The (matter of opinion) claim:
“When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]”
(source discussed here)
got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough to use something as complicated as C++ correctly?
Continue reading Why RcppDynProg is Written in C++
There is a lot of unnecessary worry over “Non Standard Evaluation” (NSE) in
R versus “Standard Evaluation” (SE, or standard “variables names refer to values” evaluation). This very author is guilty of over-discussing the issue. But let’s give this yet another try.
The entire difference between NSE and regular evaluation can be summed up in the following simple table (which should be clear after we work some examples).
Continue reading Standard Evaluation Versus Non-Standard Evaluation in R
From https://tidyr.tidyverse.org/dev/articles/pivot.html (text by Hadley Wickham):
For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and
gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.
There are two important new features inspired by other R packages that have been advancing of reshaping in R:
- The reshaping operation can be specified with a data frame that describes precisely how metadata stored in column names becomes data variables (and vice versa). This is inspired by the
cdata package by John Mount and Nina Zumel. For simple uses of
pivot_wide(), this specification is implicit, but for more complex cases it is useful to make it explicit, and operate on the specification data frame using
- pivot_long() can work with multiple value variables that may have different types. This is inspired by the enhanced
dcast() functions provided by the
data.table package by Matt Dowle and Arun Srinivasan.
If you want to work in the above way we suggest giving our
cdata package a try. We named the functions
unpivot_to_blocks. The idea was: by emphasizing the record structure one might eventually internalize what the transforms are doing. On the way to that we have a lot of documentation and tutorials.
We recently commented on excess package dependencies as representing risk in the
R package ecosystem.
The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practice (or at least correlates with other good practices)?
Continue reading Quantifying R Package Dependency Risk
I would like to once again recommend our readers to our note on
R function that can help you eliminate many problematic NSE (non-standard evaluation) interfaces (and their associate problems) from your
R programming tasks.
The idea is to imitate the following lambda-calculus idea:
let x be y in z := ( λ x . z ) y
Continue reading wrapr::let()