Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”,
Sunday, October 29, 2017
10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area).
ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and a talk.
Thursday Nov 2 2017,
“Modeling big data with R, Sparklyr, and Apache Spark”,
Workshop/Training intermediate, 4 hours,
by Dr. John Mount (link).
Friday Nov 3 2017,
“Myths of Data Science: Things you Should and Should Not Believe”,
Data Science lecture beginner/intermediate, 45 minutes,
by Dr. Nina Zumel (link, length, abstract, and title to be corrected).
We really hope you can make these talks.
On the “R for big data” front we have some big news: the replyr package now implements pivot/un-pivot (or what tidyr calls spread/gather) for big data (databases and Sparklyr). This data shaping ability adds a lot of user power. We call the theory “coordinatized data” and the work practice “fluid data”.
In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat. In this article, we will discuss a little more about the how and why of partial pooling in R.
We will use the lme4 package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The lme4 documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.
The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….
The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…
– Gelman and Hill 2007, Chapter 11.4
We will also restrict ourselves to the case that vtreat considers: partially pooled estimates of conditional group expectations, with no other predictors considered.
One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.
By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.
There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
The p-value is a valid frequentist statistical concept that is much abused and mis-used in practice. In this article I would like to call out a few features of p-values that can cause problems in evaluating summaries.
Keep in mind: p-values are useful and routinely taught correctly in statistics, but very often mis-remembered or abused in practice.
Question: how hard is it to count rows using the R package dplyr?
Answer: surprisingly difficult.
When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").
While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.