We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).

# Tag: python

## New vtreat Feature: Nested Model Bias Warning

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later *naively* using to train a model on, leads to an undesirable nested model bias. The `vtreat`

package (both the `R`

version and `Python`

version) both incorporate a cross-frame method that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent PyData LA talk).

The next version of `vtreat`

will warn the user if they have improperly used the same data for both `vtreat`

impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation. `vtreat`

has had methods for avoiding nested model bias for vary long time, we are now adding new warnings to confirm users are using them.

## Set up the Example

This example is excerpted from some of our classification documentation.

Continue reading New vtreat Feature: Nested Model Bias Warning

## New Timings for a Grouped In-Place Aggregation Task

I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.

Continue reading New Timings for a Grouped In-Place Aggregation Task

## A Richer Category for Data Wrangling

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the `data_algebra`

and in `rquery`

/`rqdatatable`

.

I think I’ve found an even better category theory re-formulation of the package, which I will describe here.

## Better SQL Generation via the data_algebra

In our recent note What is new for `rquery`

December 2019 we mentioned an ugly processing pipeline that translates into `SQL`

of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the `data_algebra`

.

## data_algebra/rquery as a Category Over Table Descriptions

##
Introduction

I would like to talk about some of the design principles underlying the `data_algebra`

package (and also in its sibling `rquery`

package).

The `data_algebra`

package is a query generator that can act on either `Pandas`

data frames or on `SQL`

tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.

Continue reading data_algebra/rquery as a Category Over Table Descriptions

## Slides for PyData LA 2019 vtreat Talk

Slides for PyData LA 2019 vtreat Talk are here!

## Slides from the PyData2019 data_algebra lightning talk

Slides from my PyData2019 data_algebra lightning talk are here.

## Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter

We are excited to share a free extract of Zumel, Mount, *Practical Data Science with R, 2nd Edition*, Manning 2019: Evaluating a Classification Model with a Spam Filter.

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):

(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)

Zumel, Mount, *Practical Data Science with R, 2nd Edition* is coming out in print *very* soon. Here is a discount code to help you get a good deal on the book:

Take 37% off Practical Data Science with R, Second Edition by entering

fcczumel3into the discount code box at checkout at manning.com.

## AI for Engineers

For the last year we (Nina Zumel, and myself: John Mount) have had the honor of teaching the AI200 portion of LinkedIn’s AI Academy.

John Mount at the LinkedIn campus

Nina Zumel designed most of the material, and John Mount has been delivering it and bringing her feedback. We’ve just started our 9th cohort. We adjust the course each time. Our students teach us a lot about how one thinks about data science. We bring that forward to each round of the course.

Roughly the goal is the following.

If every engineer, product manager, and project manager had some hands-on experience with data science and AI (deep neural nets), then they are both more likely to think of using these techniques in their work

andof introducing the instrumentation required to have useful data in the first place.

This will have huge downstream benefits for LinkedIn. Our group is thrilled to be a part of this.

We are looking for more companies that want an on-site data science intensive for their teams (either in Python or in R).