partition_mutate_qt(): these are query planners/optimizers that work over
dplyr::mutate()assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
if_else_device(): provides a
dplyr::mutate()based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant
dplyr::mutate()data flow code that works on Spark (via Sparklyr) and databases.
We have been writing a lot on higher-order data transforms lately:
- Coordinatized Data: A Fluid Data Specification
- Data Wrangling at Scale
- Fluid Data
- Big Data Transforms.
What I want to do now is "write a bit more, so I finally feel I have been concise."
As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using
R and substantial data stores (such as relational database variants such as
PostgreSQL or big data systems such as
Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."
- Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”,
Sunday, October 29, 2017
10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area).
- ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and a talk.
Thursday Nov 2 2017,
“Modeling big data with R, Sparklyr, and Apache Spark”,
Workshop/Training intermediate, 4 hours,
by Dr. John Mount (link).
Friday Nov 3 2017,
“Myths of Data Science: Things you Should and Should Not Believe”,
Data Science lecture beginner/intermediate, 45 minutes,
by Dr. Nina Zumel (link, length, abstract, and title to be corrected).
We really hope you can make these talks.
- On the “R for big data” front we have some big news: the replyr package now implements pivot/un-pivot (or what tidyr calls spread/gather) for big data (databases and Sparklyr). This data shaping ability adds a lot of user power. We call the theory “coordinatized data” and the work practice “fluid data”.
One of the services that the
vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.
vtreat level codes to the difference between the conditional means and the grand mean (
catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (
catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the
ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by
vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what
vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in
vtreat. We assume you are familiar with the basics of
vtreat: the types of derived variables, how to create and apply a treatment plan, etc.
While working on a large client project using
Sparklyr and multinomial regression we recently ran into a problem:
Apache Spark chooses the order of multinomial regression outcome targets, whereas
R users are used to choosing the order of the targets (please see here for some details). So to make things more like
R users expect, we need a way to translate one order to another.
Please check them out (hint:
vtreat is our favorite).
To illustrate this we will work an example.
I have been writing a lot (too much) on the
tidyeval lately. The reason is: major changes were recently announced. If you are going to use
dplyr well and correctly going forward you may need to understand some of the new issues (if you don’t use
dplyr you can safely skip all of this). I am trying to work out (publicly) how to best incorporate the new methods into:
- real world analyses,
- reusable packages,
- and teaching materials.
I think some of the apparent discomfort on my part comes from my feeling that
dplyr never really gave standard evaluation (SE) a fair chance. In my opinion:
dplyr is based strongly on non-standard evaluation (NSE, originally through
lazyeval and now through
tidyeval) more by the taste and choice than by actual analyst benefit or need.
dplyr isn’t my package, so it isn’t my choice to make; but I can still have an informed opinion, which I will discuss below.