Saw this the other day:
In defense of
wrapr::let() (originally part of
replyr, and still re-exported by that package) I would say:
let() was deliberately designed for a single real-world use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can re-phrase that as: there is deliberately less to learn as
let() is adapted to a need (instead of one having to adapt to
R community already has months of experience confirming
let() working reliably in production while interacting with a number of different packages.
let() will continue to be a very specific, consistent, reliable, and relevant tool even after
dpyr 0.6.* is released, and the community gains experience with
tidyeval in production.
tidyeval is your thing, by all means please use and teach it. But please continue to consider also using
wrapr::let(). If you are trying to get something done quickly, or trying to share work with others: a “deeper theory” may not be the best choice.
An example follows. Continue reading In defense of wrapr::let()
Our next "R and big data tip" is: summarizing big data.
We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).
Simple question: is there an easy way to summarize big data in
The answer is: yes, but we suggest you use the
replyr package to do so.
Continue reading Summarizing big data in R
When working with big data with
R (say, using
sparklyr) we have found it very convenient to keep data handles in a neat list or
Please read on for our handy hints on keeping your data handles neat. Continue reading Managing Spark data handles in R
Win-Vector LLC has recently been teaching how to use
R with big data through
sparklyr. We have also been helping clients become productive on
R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using
R, share some hints, and even admit to some of the warts found in this combination of systems.
The ability to perform sophisticated analyses and modeling on “big data” with
R is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).
J. Howard Miller, 1943.
The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using
R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.
Please read on for a brief description of our new articles series: “R and big data.” Continue reading New series: R and big data (concentrating on Spark and sparklyr)
In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some very deep yet accessible pure mathematics).
A permutation of indices
In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.
R‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with
R), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable). Continue reading On indexing operators and composition
Continue reading dplyr in Context
R users often come to the false impression that the popular packages
tidyr are both all of
R and sui generis inventions (in that they might be unprecedented and there might no other reasonable way to get the same effects in
R). These packages and their conventions are high-value, but they are results of evolution and implement a style of programming that has been available in
R for some time. They evolved in a context, and did not burst on the scene fully armored with spear in hand.
I have written about referential transparency before. In this article I would like to discuss “leaky abstractions” and why
wrapr::let() supplies a useful (but leaky) abstraction for
Continue reading Why to use wrapr::let()
R is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.
Continue reading Programming over R
(or: how to correctly use
R has "one-hot" encoding hidden in most of its modeling paths. Asking an
R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.
For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:
dTrain <- data.frame(x= c('a','b','b', 'c'),
y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
## lm(formula = y ~ x, data = dTrain)
## 1 2 3 4
## -2.914e-16 5.000e-01 -5.000e-01 2.637e-16
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.7071 1.414 0.392
## xb 0.5000 0.8660 0.577 0.667
## xc 1.0000 1.0000 1.000 0.500
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: -0.5
## F-statistic: 0.5 on 2 and 1 DF, p-value: 0.7071
Continue reading Encoding categorical variables: one-hot and beyond
Authors: John Mount and Nina Zumel
In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.
One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.
Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot