We are *very* excited to announce a new (paid) Win-Vector LLC video training course: Supervised Learning in R: Regression now available on DataCamp

# Category: Pragmatic Data Science

## More documentation for Win-Vector R packages

The Win-Vector public R packages now all have new `pkgdown`

documentation sites! (And, a thank-you to Hadley Wickham for developing the `pkgdown`

tool.)

Please check them out (hint: `vtreat`

is our favorite).

Continue reading More documentation for Win-Vector R packages

## Join Dependency Sorting

In our latest installment of “`R`

and big data” let’s again discuss the task of left joining many tables from a data warehouse using `R`

and a system called "a join controller" (last discussed here).

One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transform and extend data. Explicit rich data beats vague convention and complicated code.

## Use a Join Controller to Document Your Work

This note describes a useful `replyr`

tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Continue reading Use a Join Controller to Document Your Work

## Managing intermediate results when using R/sparklyr

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.

## Managing Spark data handles in R

When working with big data with `R`

(say, using `Spark`

and `sparklyr`

) we have found it very convenient to keep data handles in a neat list or `data_frame`

.

Please read on for our handy hints on keeping your data handles neat. Continue reading Managing Spark data handles in R

## New series: R and big data (concentrating on Spark and sparklyr)

Win-Vector LLC has recently been teaching how to use `R`

with big data through `Spark`

and `sparklyr`

. We have also been helping clients become productive on `R/Spark`

infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using `R`

, share some hints, and even admit to some of the warts found in this combination of systems.

The ability to perform sophisticated analyses and modeling on “big data” with `R`

is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).

The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using `Spark`

through `R`

and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.

Please read on for a brief description of our new articles series: “R and big data.” Continue reading New series: R and big data (concentrating on Spark and sparklyr)

## Encoding categorical variables: one-hot and beyond

## (or: how to correctly use `xgboost`

from `R`

)

`R`

has "one-hot" encoding hidden in most of its modeling paths. Asking an `R`

user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

```
dTrain <- data.frame(x= c('a','b','b', 'c'),
y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
```

```
##
## Call:
## lm(formula = y ~ x, data = dTrain)
##
## Residuals:
## 1 2 3 4
## -2.914e-16 5.000e-01 -5.000e-01 2.637e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.7071 1.414 0.392
## xb 0.5000 0.8660 0.577 0.667
## xc 1.0000 1.0000 1.000 0.500
##
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: -0.5
## F-statistic: 0.5 on 2 and 1 DF, p-value: 0.7071
```

Continue reading Encoding categorical variables: one-hot and beyond

## Teaching pivot / un-pivot

## Authors: John Mount and Nina Zumel

## Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot

## Coordinatized Data: A Fluid Data Specification

## Authors: John Mount and Nina Zumel.

## Introduction

It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are *inessential* implementation details when *reasoning* about your data. Many *algorithms* are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of *coordinatized data*, and demonstrate advanced data wrangling in `R`

.

Continue reading Coordinatized Data: A Fluid Data Specification