Posted on Categories Practical Data Science, Statistics, TutorialsTags , , , , , Leave a comment on The Advantages of Record Transform Specifications

The Advantages of Record Transform Specifications

Nina Zumel had a really great article on how to prepare a nice Keras performance plot using R.


Keras plot

I will use this example to show some of the advantages of cdata record transform specifications.

Continue reading The Advantages of Record Transform Specifications

Posted on Categories data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , ,

Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

  • The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
  • The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
  • The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Continue reading Advanced Data Reshaping in Python and R

Posted on Categories Administrativia, data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , ,

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable).

Continue reading Introducing data_algebra

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , ,

Big News: Porting vtreat to Python

We at Win-Vector LLC have some big news.

We are finally porting a streamlined version of our R vtreat variable preparation package to Python.

vtreat is a great system for preparing messy data for supervised machine learning.

The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the .fit_transform() pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case .fit_transform() != .fit().transform()). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.

The new repository is here. And we have a non-trivial worked classification example. Next up is multinomial classification. After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.

The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).

This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.

Posted on Categories Programming, TutorialsTags , , ,

Piping is Method Chaining

What R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called “method chaining” since the days of Smalltalk (~1972). Let’s take a look at method chaining in Python, in terms of pipe notation.

Continue reading Piping is Method Chaining

Posted on Categories Coding, data science, Programming, StatisticsTags , , , , , , , 12 Comments on Is 10,000 Cells Big?

Is 10,000 Cells Big?

Trick question: is a 10,000 cell numeric data.frame big or small?

In the era of “big data” 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box).


Punch card

The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.

Continue reading Is 10,000 Cells Big?