Posted on Categories Programming, TutorialsTags , , 2 Comments on R Tip: How To Look Up Matrix Values Quickly

## R Tip: How To Look Up Matrix Values Quickly

R is a powerful data science language because, like Matlab, numpy, and Pandas, it exposes vectorized operations. That is, a user can perform operations on hundreds (or even billions) of cells by merely specifying the operation on the column or vector of values.

Of course, sometimes it takes a while to figure out how to do this. Please read for a great R matrix lookup problem and solution.

Posted on Tags , , 2 Comments on New Introduction to `rquery`

## Introduction

`rquery` is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of `R`’s `base::transform()`, or `dplyr`’s `dplyr::mutate()` and uses a pipe in the style popularized in `R` with `magrittr`. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional `SQL` “window functions.” More on the background and context of `rquery` can be found here.

The `R`/`rquery` version of this introduction is here, and the `Python`/`data_algebra` version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new `data.frame`s, instead of as alterations of a primary data structure (as is the case with `data.table`). Transform system can use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In `rquery`’s case the primary set of data operators is as follows:

• `drop_columns`
• `select_columns`
• `rename_columns`
• `select_rows`
• `order_rows`
• `extend`
• `project`
• `natural_join`
• `convert_records` (supplied by the `cdata` package).

These operations break into a small number of themes:

• Simple column operations (selecting and re-naming columns).
• Simple row operations (selecting and re-ordering rows).
• Creating new columns or replacing columns with new calculated values.
• Aggregating or summarizing data.
• Combining results between two `data.frame`s.
• General conversion of record layouts (supplied by the `cdata` package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. `rquery` supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful `SQL` interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the `rquery` data manipulation operators.

Posted on Categories ProgrammingTags 1 Comment on You Can Override Just About Anything in R

## You Can Override Just About Anything in R

To understand computations in R, two slogans are helpful:

• Everything that exists is an object.
• Everything that happens is a function call.

John Chambers

In R, the “`[`” array access operator is a function call. And it is one a user can re-bind to the new effect of their own choosing.

Let’s see what sort of mischief we can get into using this capability.

Posted on Categories Computer Science, Programming, Tutorials

## Eliminating Tail Calls in Python Using Exceptions

I was working through Kyle Miller‘s excellent note: “Tail call recursion in Python”, and decided to experiment with variations of the techniques.

The idea is: one may want to eliminate use of the `Python` language call-stack in the case of a “tail calls” (a function call where the result is not used by the calling function, but instead immediately returned). Tail call elimination can both speed up programs, and cut down on the overhead of maintaining intermediate stack frames and environments that will never be used again.

The note correctly points out that `Python` purposely does not have a `goto` statement, a tool one might use to implement true tail call elimination. So Kyle Miller built up a data-structure based replacement for the call stack, which allows one to work around the stack-limit for a specific function (without changing any `Python` configuration, and without changing the behavior of other functions).

Of course `Python` does have some exotic control-flow controls: `raise` and `yield`. So I decided to build an `exception` based solution of our own using `raise` .

Please read on for how we do this, and for some examples.

Posted on Tags , , ,

## Big News: Porting vtreat to Python

We at Win-Vector LLC have some big news.

We are finally porting a streamlined version of our R vtreat variable preparation package to Python.

vtreat is a great system for preparing messy data for supervised machine learning.

The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the `.fit_transform()` pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case `.fit_transform() != .fit().transform()`). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.

The new repository is here. And we have a non-trivial worked classification example. Next up is multinomial classification. After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.

The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).

This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.

Posted on Categories Opinion, Programming, TutorialsTags , 11 Comments on Programming Over lm() in R

## Programming Over lm() in R

Here is simple modeling problem in `R`.

We want to fit a linear model where the names of the data columns carrying the outcome to predict (`y`), the explanatory variables (`x1`, `x2`), and per-example row weights (`wt`) are given to us as string values in variables.

Posted on Categories Programming, TutorialsTags , , ,

## Piping is Method Chaining

What `R` users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called “method chaining” since the days of `Smalltalk` (~1972). Let’s take a look at method chaining in `Python`, in terms of pipe notation.

Posted on Categories Opinion, ProgrammingTags , ,

## Why RcppDynProg is Written in C++

The (matter of opinion) claim:

“When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]”

(source discussed here)

got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough to use something as complicated as C++ correctly?

Posted on Categories Opinion, Programming, Tutorials2 Comments on Standard Evaluation Versus Non-Standard Evaluation in R

## Standard Evaluation Versus Non-Standard Evaluation in R

There is a lot of unnecessary worry over “Non Standard Evaluation” (NSE) in `R` versus “Standard Evaluation” (SE, or standard “variables names refer to values” evaluation). This very author is guilty of over-discussing the issue. But let’s give this yet another try.

The entire difference between NSE and regular evaluation can be summed up in the following simple table (which should be clear after we work some examples).