## Why R?

I was working with our copy editor on Appendix A of Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019, and ran into this little point (unfortunately) buried in the back of the book.

In our opinion the R ecosystem is the fastest path to substantial data science, statistical, and machine learning accomplishment.

This is why we use and teach R (in addition to using and teaching Python).

## Introducing data_algebra

This article introduces the `data_algebra` project: a data processing tool family available in `R` and `Python`. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the `Python` implementation (also called `data_algebra`) and its relation to the mature `R` implementations (`rquery` and `rqdatatable`).

Posted on Categories Computer Science, Programming, TutorialsLeave a comment on Eliminating Tail Calls in Python Using Exceptions

## Eliminating Tail Calls in Python Using Exceptions

I was working through Kyle Miller‘s excellent note: “Tail call recursion in Python”, and decided to experiment with variations of the techniques.

The idea is: one may want to eliminate use of the `Python` language call-stack in the case of a “tail calls” (a function call where the result is not used by the calling function, but instead immediately returned). Tail call elimination can both speed up programs, and cut down on the overhead of maintaining intermediate stack frames and environments that will never be used again.

The note correctly points out that `Python` purposely does not have a `goto` statement, a tool one might use to implement true tail call elimination. So Kyle Miller built up a data-structure based replacement for the call stack, which allows one to work around the stack-limit for a specific function (without changing any `Python` configuration, and without changing the behavior of other functions).

Of course `Python` does have some exotic control-flow controls: `raise` and `yield`. So I decided to build an `exception` based solution of our own using `raise` .

Please read on for how we do this, and for some examples.

## What is vtreat?

`vtreat` is a `DataFrame` processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.

`vtreat` takes an input `DataFrame` that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/string-valued, these columns may have missing values) that the user later wants to use to predict “y”. In practice such an input `DataFrame` may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.

To solve this, `vtreat` builds a transformed `DataFrame` where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The `vtreat` implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified “y” or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed `DataFrame` is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.

The idea is: you can take a `DataFrame` of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using `vtreat`. Incorporating `vtreat` into your machine learning workflow lets you quickly work with very diverse structured data.

Worked examples can be found here.

For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the `R` version, however all of the examples can be found worked in `Python` here).

`vtreat` is available as a `Python`/`Pandas` package, and also as an `R` package.

(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)

Some operational examples can be found here.

## Speaking at BARUG

We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us.

Nina Zumel & John Mount
Practical Data Science with R

Practical Data Science with R (Zumel and Mount) was one of the first, and most widely-read books on the practice of doing Data Science using R. We have been working hard on an improved and revised 2nd edition of our book (coming out this Fall). The book reflects more experience with data science, teaching, and with R itself. We will talk about what direction we think the R community has been taking, how this affected the book, and what is new in the upcoming edition.

## Returning to Tides

Fred Viole shared a great “data only” R solution to the forecasting tides problem.

The methodology comes from a finance perspective, and has some great associated notes and articles.

This gives me a chance to comment on the odd relation between prediction and profit in finance.

## Florence Nightingale, Data Scientist

Florence Nightingale, Data Scientist.

In 1858 Florence Nightingale published her now famous “rose diagram” breaking down causes of mortality.