Posted on Categories Opinion, StatisticsTags , , , , , , 3 Comments on Organize your data manipulation in terms of “grouped ordered apply”

## Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous `iris` example data set) per-group ranks. Suppose we want the rank of `iris` `Sepal.Length`s on a per-`Species` basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.

Iris, by DiliffOwn work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”

Posted on Tags , , , , , , 2 Comments on The case for index-free data manipulation

## The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how `R` `data.frame`s describe themselves (try “`str(data.frame(x=1:2))`” in an `R`-console to see this) and is part of the tidy data manifesto.

Tools like `SQL` (structured query language) and `dplyr` can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation

Posted on Tags , , 2 Comments on Using replyr::let to Parameterize dplyr Expressions

## Using replyr::let to Parameterize dplyr Expressions

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

```dist_intervals(iris, "Sepal.Length", "Species")

# A tibble: 3 × 7
Species  sdlower  mean  sdupper iqrlower median iqrupper

1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375
```

For a specific data frame, with known column names, such a table is easy to construct using `dplyr::group_by` and `dplyr::summarize`. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in `dplyr` can get quite hairy, quite quickly. Try it yourself, and see.

Enter `let`, from our new package `replyr`.

Posted on Categories OpinionTags , , 14 Comments on Parametric variable names and dplyr

## Parametric variable names and dplyr

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using `R` libraries that assume you know the variable names. The `R` data manipulation library `dplyr` currently supports parametric treatment of variables through “underbar forms” (methods of the form `dplyr::*_`), but their use can get tricky.

Rube Goldberg machine 1931 (public domain).

Better support for parametric treatment of variable names would be a boon to `dplyr` users. To this end the `replyr` package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete `dplyr` code to be used as if it was parametric. Continue reading Parametric variable names and dplyr

Posted on Tags , , , , , 2 Comments on New R package: replyr (get a grip on remote dplyr data services)

## New R package: replyr (get a grip on remote dplyr data services)

It is a bit of a shock when R `dplyr` users switch from using a `tbl` implementation based on R in-memory `data.frame`s to one based on a remote database or service. A lot of the power and convenience of the `dplyr` notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with `dplyr` in one modality and hope to move to another back-end without significant debugging and work-arounds. `replyr` attempts to provide a few helpful work-arounds.

Our new package `replyr` supplies methods to get a grip on working with remote `tbl` sources (SQL databases, Spark) through `dplyr`. The idea is to add convenience functions to make such tasks more like working with an in-memory `data.frame`. Results still do depend on which `dplyr` service you use, but with `replyr` you have fairly uniform access to some useful functions.