We have a new improved version of the “how to design a cdata/data_algebra data transform” up!

The original article, the Python example, and the R example have all been updated to use the new video.

Please check it out!

Skip to content
# Tag: python

Posted on Categories data science, Statistics, TutorialsLeave a comment on New improved cdata instructional video## New improved cdata instructional video

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsLeave a comment on Data re-Shaping in R and in Python## Data re-Shaping in R and in Python

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsLeave a comment on sklearn Pipe Step Interface for vtreat## sklearn Pipe Step Interface for vtreat

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsLeave a comment on New vtreat Feature: Nested Model Bias Warning## New vtreat Feature: Nested Model Bias Warning

## Set up the Example

Posted on Categories data science, Opinion, Pragmatic Data Science, Tutorials1 Comment on New Timings for a Grouped In-Place Aggregation Task## New Timings for a Grouped In-Place Aggregation Task

Posted on Categories data science, Pragmatic Data Science, Tutorials## A Richer Category for Data Wrangling

Posted on Categories Administrativia, Computer Science, Pragmatic Data Science## Better SQL Generation via the data_algebra

Posted on Categories data science, Tutorials1 Comment on data_algebra/rquery as a Category Over Table Descriptions## data_algebra/rquery as a Category Over Table Descriptions

##
Introduction

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics## Slides for PyData LA 2019 vtreat Talk

Posted on Categories Administrativia, data science, Programming## Slides from the PyData2019 data_algebra lightning talk

We have a new improved version of the “how to design a cdata/data_algebra data transform” up!

The original article, the Python example, and the R example have all been updated to use the new video.

Please check it out!

Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial.

This reflects our opinion on the “which is better for data science R or Python?” They both are great. So start with one, and expect to eventually work with both (if you are lucky).

We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later *naively* using to train a model on, leads to an undesirable nested model bias. The `vtreat`

package (both the `R`

version and `Python`

version) both incorporate a cross-frame method that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent PyData LA talk).

The next version of `vtreat`

will warn the user if they have improperly used the same data for both `vtreat`

impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation. `vtreat`

has had methods for avoiding nested model bias for vary long time, we are now adding new warnings to confirm users are using them.

This example is excerpted from some of our classification documentation.

Continue reading New vtreat Feature: Nested Model Bias Warning

I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.

Continue reading New Timings for a Grouped In-Place Aggregation Task

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the `data_algebra`

and in `rquery`

/`rqdatatable`

.

I think I’ve found an even better category theory re-formulation of the package, which I will describe here.

In our recent note What is new for `rquery`

December 2019 we mentioned an ugly processing pipeline that translates into `SQL`

of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the `data_algebra`

.

I would like to talk about some of the design principles underlying the `data_algebra`

package (and also in its sibling `rquery`

package).

The `data_algebra`

package is a query generator that can act on either `Pandas`

data frames or on `SQL`

tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.

Continue reading data_algebra/rquery as a Category Over Table Descriptions

Slides for PyData LA 2019 vtreat Talk are here!

Slides from my PyData2019 data_algebra lightning talk are here.