`cdata`

`R`

package is that very powerful and arbitrary record transforms should be convenient and take only one or two steps. In fact it is the goal to take just about any record shape to any other in two steps: first convert to row-records, then re-block the data into arbitrary record shapes (please see here and here for the concepts).
But as with all general ideas, it is much easier to see what we mean by the above with a concrete example.

Let’s consider the following artificial (but simple) example. Suppose we have the following data.

```
library("cdata")
data <- build_frame(
'record_id', 'row', 'col1', 'col2', 'col3' |
1, 'row1', 1, 2, 3 |
1, 'row2', 4, 5, 6 |
1, 'row3', 7, 8, 9 |
2, 'row1', 11, 12, 13 |
2, 'row2', 14, 15, 16 |
2, 'row3', 17, 18, 19 )
knitr::kable(data)
```

record_id | row | col1 | col2 | col3 |
---|---|---|---|---|

1 | row1 | 1 | 2 | 3 |

1 | row2 | 4 | 5 | 6 |

1 | row3 | 7 | 8 | 9 |

2 | row1 | 11 | 12 | 13 |

2 | row2 | 14 | 15 | 16 |

2 | row3 | 17 | 18 | 19 |

In the above the records are the triples of rows with matching `record_id`

and the different rows within the record are identified by the value in the `row`

column. So The data items are named by the triplet `record_id`

, `row`

and renaming column name (`col1`

, `col2`

, or `col2`

). This sort of naming of values is essentially Codd’s "guaranteed access rule".

Suppose we want to transpose each of the records- swapping the row and column notions. With `cdata`

this is easy. First you design a transform to flatten each complex record into a single wide row (using the design steps taught here). Essentially that is just specifying the following control variables. We define how to identify records (the key columns) and the structure of the records (giving the interior of the record arbitrary names we will re-use later).

```
keyColumns = 'record_id'
incoming_shape <- qchar_frame(
row, col1, col2, col3 |
row1, v11, v12, v13 |
row2, v21, v22, v23 |
row3, v31, v32, v33 )
```

And we specify (using the same principles) the desired final record shape, re-using the interior names from the first step to show where values are to be mapped.

```
outgoing_shape <- qchar_frame(
column, row1, row2, row3 |
col1, v11, v21, v31 |
col2, v12, v22, v32 |
col3, v13, v23, v33 )
```

Once you have done this the conversion is accomplished in two function calls.

```
rowrecs <- blocks_to_rowrecs(
data,
keyColumns = keyColumns,
controlTable = incoming_shape)
transformed <- rowrecs_to_blocks(
rowrecs,
controlTable = outgoing_shape,
columnsToCopy = keyColumns)
knitr::kable(transformed)
```

record_id | column | row1 | row2 | row3 |
---|---|---|---|---|

1 | col1 | 1 | 4 | 7 |

1 | col2 | 2 | 5 | 8 |

1 | col3 | 3 | 6 | 9 |

2 | col1 | 11 | 14 | 17 |

2 | col2 | 12 | 15 | 18 |

2 | col3 | 13 | 16 | 19 |

And the transform is done, each record has been transposed. The principle is "draw a picture." First we draw a picture of the block record structure we have, and then we draw a picture of the block record structure we want. The intermediate form (`rowrecs`

) is a special form where the concepts of records and rows exactly agree. In this form each record is one exactly row and each row is exactly one record. This data looks like the following.

`knitr::kable(rowrecs)`

record_id | v11 | v21 | v31 | v12 | v22 | v32 | v13 | v23 | v33 |
---|---|---|---|---|---|---|---|---|---|

1 | 1 | 4 | 7 | 2 | 5 | 8 | 3 | 6 | 9 |

2 | 11 | 14 | 17 | 12 | 15 | 18 | 13 | 16 | 19 |

We have complete freedom to re-name columns and record-piece labels (the labels that tell us which portion of a block-record each row fits into).

In the development version of `cdata`

(`1.0.5`

or newer, install instructions here) we can make things even easier and use a convenience function that combines these steps.

```
t2 <- convert_records(
data,
keyColumns = keyColumns,
incoming_shape = incoming_shape,
outgoing_shape = outgoing_shape)
knitr::kable(t2)
```

record_id | column | row1 | row2 | row3 |
---|---|---|---|---|

1 | col1 | 1 | 4 | 7 |

1 | col2 | 2 | 5 | 8 |

1 | col3 | 3 | 6 | 9 |

2 | col1 | 11 | 14 | 17 |

2 | col2 | 12 | 15 | 18 |

2 | col3 | 13 | 16 | 19 |

Also these conversions can also be translated into `rquery`

operators, and therefore saved to be run either in memory or directly on a database.

```
table_desciption <- rquery::local_td(data)
ops <- table_desciption %.>%
convert_records(
.,
keyColumns = keyColumns,
incoming_shape = incoming_shape,
outgoing_shape = outgoing_shape)
cat(format(ops))
#> table(data;
#> record_id,
#> row,
#> col1,
#> col2,
#> col3) %.>%
#> non_sql_node(., blocks_to_rowrecs(.)) %.>%
#> non_sql_node(., rowrecs_to_blocks(.))
rquery::column_names(ops)
#> [1] "record_id" "column" "row1" "row2" "row3"
```

`R`

quasi-quotation easier it would be nice if `R`

string-interpolation and quasi-quotation both used the same notation. They are related concepts. So some commonality of notation would actually be clarifying, and help teach the concepts. We will define both of the above terms, and demonstrate the relation between the two concepts.

String-interpolation is the name for substituting value into a string. For example:

```
library("wrapr")
variable <- as.name("angle")
sinterp(
'variable name is .(variable)'
)
```

`## [1] "variable name is angle"`

Notice the "`.(variable)`

" portion was replaced with the actual variable name "`angle`

". For string interpolation we are intentionally using the "`.()`

" notation that Thomas Lumley’s picked in 2003 when he introduced quasi-quotation into `R`

(a different concept than string-interpolation, but the topic of our next section).

String interpolation is a common need, and there are many `R`

packages that supply variations of such functionality:

`base::sprintf`

`R.utils::gstring()`

`rprintf::rprintf()`

`stringr::str_interp()`

`glue::glue()`

`wrapr::sinterp()`

(requires version 1.8.3, or newer).

A related idea is "quasi-quotation" which substitutes a value into a general expression. For example:

```
angle = 1:10
variable <- as.name("angle")
evalb(
plot(x = .(variable),
y = sin(.(variable)))
)
```

Notice how in the above plot the actual variable name "`angle`

" was substituted into the `graphics::plot()`

arguments, allowing this name to appear on the axis labels.

`evalb()`

is a very simple function built on top of `base::bquote()`

:

`print(evalb)`

```
## function(..., where = parent.frame()) {
## force(where)
## exprq <- bquote(..., where = where)
## eval(exprq,
## envir = where,
## enclos = where)
## }
## <bytecode: 0x7fa0181b4470>
## <environment: namespace:wrapr>
```

All `evalb()`

does is: call `bquote()`

as intended. A way to teach this is to just call `bqoute()`

alone.

```
bquote(
plot(x = .(variable),
y = sin(.(variable)))
)
```

`## plot(x = angle, y = sin(angle))`

And we see the un-executed code with the substitutions performed.

There are many `R`

quasi-quotation systems including:

`base::bquote()`

`gtools::strmacro()`

`lazyeval`

`wrapr::let()`

`rlang::as_quosure()`

`nseval`

If you don’t want to wrap your `plot()`

call in `evalb()`

you can instead pre-adapt the function. Below we create a new function `plotb()`

that is intended as shorthand for `eval(bquote(plot(...)))`

.

```
plotb <- bquote_function(graphics::plot)
plotb(x = .(variable),
y = sin(.(variable)))
```

When string-inerpolation and quasi-quotation use the same notation we can teach them quickly as simple related concepts.

]]>`R`

Tip: use inline operators for legibility.
A `Python`

feature I miss when working in `R`

is the convenience of `Python`

‘s inline `+`

operator. In `Python`

, `+`

does the right thing for some built in data types:

- It concatenates lists:
`[1,2] + [3]`

is`[1, 2, 3]`

. - It concatenates strings:
`'a' + 'b'`

is`'ab'`

.

And, of course, it adds numbers: `1 + 2`

is `3`

.

The inline notation is very convenient and legible. In this note we will show how to use a related notation `R`

.

To be clear: when working in a language it is important to learn to write code that is idiomatic for that language. Otherwise you are fighting the language, and writing code that may be hard for other users to work with (as it won’t match the learnable expectations of the language). The Python community has formalized this concept as “Pythonic”, which means Python Enhancement Proposal (PEP) 8‘s style recommendations plus a number of community conventions. The R situation is less formal, but “R-like” can include some important concepts such as: writing in a functional style, working vectorized, and a number of other concepts.

My note on Timing the Same Algorithm in R, Python, and C++ was a deliberate example of “writing C/C++ style code” in C++ (where that makes sense) plus R and Python (where that can be awkward). In fact I left the semi-colons in the C-style (scalar oriented) to R transliteration to emphasize how alien to R this code is (and later removed them in the more “R-like” vectorized translation).

However, if a good idea from one language works well in another language, then there is a good argument for implementing an analogue. The is no strong reason to leave one language less convenient than another.

For example: in Python `range(a, b)`

returns an iterator that enumerates the integers from `a`

through `b-1`

if `b > a`

, and the empty iterator otherwise. This is the exact right iterator in a zero-indexed language (such as Python) for driving for-loops and list-comprehensions. R doesn’t have an operator so closely adapted to its indexing needs (the closest being `seq_len()`

and `seq_along()`

). So R is missing a bit of the convenience of this Python feature. However it is easy to add an R-version of such a feature, and this is found in `wrapr::seqi()`

. Note `wrapr::seqi()`

is not a direct translation of Python’s `range()`

; `wrapr::seqi(a, b)`

generates the range of integers `a`

through `b`

*inclusive* (if `b >= a`

), as *this* is the convenient interval notation for a one-indexed language (such as R).

Now back to Python’s `+`

features.

The `wrapr`

package (available from CRAN) supplies some nice related inline operators including:

`%c%`

:`c(1,2) %c% 3`

is`1, 2, 3`

(named after R’s`c()`

function).`%p%`

:`"a" %p% "b"`

is`"ab"`

(named after R’s`paste0()`

function).

The above code is assuming you have the `wrapr`

package attached via already having run `library('wrapr')`

.

Notice we picked R-related operator names. We stayed away from overloading the `+`

operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t to make R more like Python, but to adapt a good idea from Python to improve R.

The general purpose of `wrapr`

package is to provide extensions to make working in R incrementally more convenient while preserving an “R-like” style. It *might* not seem worth it to bring in a whole package for one our two such features. However, `wrapr`

is a very lightweight low-dependency package. And `wrapr`

includes *many* useful extensions- all documented with examples (and many of which are covered in earlier tips).

The second edition of our best-selling book *Practical Data Science with R ^{2}*, Zumel, Mount is featured as deal of the day at Manning.

The second edition isn’t finished yet, but chapters 1 through 4 are available in the Manning Early Access Program (MEAP), and we have finished chapters 5 and 6 which are now in production at Manning (so they should be available soon). The authors are hard at work on chapters 7 and 8 right now.

The discount gets you half off. Also the 2nd edition comes with a free e-copy the first edition (so you can jump ahead).

Here are the details in Tweetable form:

]]>Deal of the Day January 13: Half off Practical Data Science with R, Second Edition. Use code dotd011319au at http://bit.ly/2SKAxe9.

`R`

Tip: use `seqi()`

for indexing.
`R`

‘s “`1:0`

trap” is a mal-feature that confuses newcomers and is a reliable source of bugs. This note will show how to use `seqi()`

to write more reliable code and document intent.

The issue is, contrary to expectations (formed in working with other programming languages) the sequence `1:0`

is not empty. It is instead a decreasing sequence. Data scientists typically work in many languages, so we should expect differences. However having a sequence builder that returns empty when the bounds cross is a common useful tool for controlling loops and other indexing tasks.

We have written about this before. The usual defense is that it is the same as `seq(1, 0)`

, but I see that more as a doubling-down than an argument. Also due to odd behavior when iterating over vectors or lists with class-attributes, we sometimes must introduce indices (as it isn’t always safe to directly iterate over contents in `R`

).

What this means is in `R`

there is no common safe, succinct way to write index vector or loops where one of the end-points is passed in as an argument. For example the following simple example is incorrect.

# sum reciprocals of squares of positive integers from 1 up to k # converges to pi^2/6 sum_sq_recip_k <- function(k) { sum(1/((1:k)^2)) } # should be zero, as the convention 1 up to -1 is the empty set sum_sq_recip_k(-1) # [1] Inf

There are plenty of ways to write reversed sequences (such as `rev(0:1)`

), so writing reversed sequences isn’t a great unmet need. Previously we recommended using `seq_len()`

as a solution. This is still good, however that only directly addressed upper-bound issues. For general ranges (where perhaps the lower-bound is the parameter) we still have a problem.

`Python`

is one of the most popular programming languages, and it supplies a convenient function for the common task of iterating over increasing ranges of integers.

# Python code [k for k in range(3, 5)] # Out[1]: [3, 4] [k for k in range(5, 3)] # Out[2]: []

Now of course different programming languages made different choices. However, in my opinion, writing possibly empty sequences parametrically is a common programming need and it is nice to have this be convenient.

Our current advice to `R`

users is use `wrapr::seqi()`

which stands for “sequence, increasing integer(s)”. We needed such a capability when translating `C++`

code to `R`

code for our `RcppDynProg`

example (otherwise we would have to put guards around the loops so they don’t activate on what should be empty sequences).

`seqi()`

is used as follows.

library("wrapr") # print 3, 4, and then 5 for(i in seqi(3, 5)) { print(i) } #> [1] 3 #> [1] 4 #> [1] 5 # empty for(i in seqi(5, 2)) { print(i) }

This is clear, safe, and documents intent. It is a non-negotiable fact that in `R`

`base::seq(1,0)`

is `[1, 0]`

. Well, `wrapr::seqi(1,0)`

is `[]`

.

`RcppDynProg`

algorithm we derived the following beautiful identity of 2 by 2 real matrices:

The superscript “top” denoting the transpose operation, the ||.||^2_2 denoting sum of squares norm, and the single |.| denoting determinant.

This is derived from one of the check equations for the Moore–Penrose inverse and we have details of the derivation here, and details of the messy algebra here.

]]>`RcppDynProg`

`R`

package I took a little extra time to port the core algorithm from `C++`

to both `R`

and `Python`

.

This means I can time the exact same algorithm implemented nearly identically in each of these three languages. So I can extract some comparative “apples to apples” timings. Please read on for a summary of the results.

The algorithm in question is the general dynamic programming solution to the “minimum cost partition into intervals” problem. As coded in `C++`

it uses one-time allocation of large tables and then `for`

-loops and index chasing to fill in the dynamic programming table solution. The `C++`

code is given here.

I then directly transliterated (or line-for line translated) this code into `R`

(code here) and `Python`

(code here). Both of these implementations are very direct translations of the `C++`

solution, so they are possibly not what somebody starting in `R`

or `Python`

would design. So really we are coding in an an imperative `C`

style in `R`

and `Python`

. To emphasize the shallowness of the port I deliberately left the semi-colons from the `C++`

in the `R`

port. The `Python`

can be taken to be equally “un-Pythonic” (for example, we are using `for`

loops and not list comprehensions).

That being said we now have very similar code to compare in all three languages. We can summarize the timings (details here and here) as follows.

problem | solution language | time in seconds |
---|---|---|

500 point partition into intervals dynamic program | R | 21 |

500 point partition into intervals dynamic program | C++ (from R via Rcpp) | 0.088 |

500 point partition into intervals dynamic program | Python | 39 |

Notice for this example `C++`

is 240 times faster than `R`

, and `R`

is almost twice as fast as `Python`

Neither `R`

nor `Python`

is optimized for the type of index-chasing this dynamic programming solution depends on. So we also took a look at a simpler problem: computing the PRESS statistic, which is easy to vectorize (the preferred way of writing efficient code in `R`

and `Python`

). When we compare all three languages on this problem we see the following.

problem | solution method | time in seconds |
---|---|---|

3,000,000 point PRESS statistic calculation | R scalar code | 3.4 |

3,000,000 point PRESS statistic calculation | Rcpp scalar code | 0.26 |

3,000,000 point PRESS statistic calculation | R vectorized code | 0.35 |

3,000,000 point PRESS statistic calculation | Python vectorized (`numpy)` |
0.21 |

The timing details can be found here and here.

Ignoring the `R`

scalar solution (which is *too* direct a translation from `C++`

to `R`

, but a stepping stone to the `R`

vectorized solution as we discuss here). We see: vectorized `Python`

is now about 1.6 times faster than the vectorized `R`

and even 1.2 times faster than the `C++`

(probably not due to `Rcpp`

, but instead driven by my choice of container class in the `C++`

code).

Obviously different code (and per-language tuning and optimization) will give different results. But the above is consistent with our general experience with `R`

, `Python`

, and `C++`

in production.

In conclusion: `R`

and `Python`

are in fact much slower than `C++`

for direct scalar manipulation (single values, indexes, and pointers). However, `R`

and `Python`

are effective *glue* languages that can be fast when they are orchestrating operations over higher level abstractions (vectors, databases, data frames, Spark, Tensorflow, or Keras).

`R`

can not be fast (false), or more correctly that for fast code in `R`

you may have to consider “vectorizing.”
A lot of knowledgable `R`

users are not comfortable with the term “vectorize”, and not really familiar with the method.

“Vectorize” is just a slightly high-handed way of saying:

`R`

naturally stores data in columns (or in column major order), so if you are not coding to that pattern you are fighting the language.

In this article we will make the above clear by working through a non-trivial example of writing vectorized code.

For our example problem we will take on the task of computing the PRESS statistic (a statistic we use in our new `RcppDynProg`

package). The “predicted residual error sum of squares” or PRESS statistic is an estimate of the out of sample quality of a fit. We motivate the PRESS statistic as follows.

Suppose we are fitting a simple linear model. In such a case it is natural to examine the model residuals (differences between the actual values and the matching predictions):

d <- data.frame( x = 1:6, y = c(1, 1, 2, 2, 3, 3)) lm(y ~ x, data = d)$residuals # 1 2 3 4 5 6 # 0.1428571 -0.3142857 0.2285714 -0.2285714 0.3142857 -0.1428571

However, because the model has seen the data it is being applied to, these residuals are not representative of the residuals we would see on new data (there is a towards-zero bias in this estimate). An improved estimate is the PRESS statistic: for each point the model is fit on all points except the point in question, and then the residuals are estimated. This is a form of cross-validation and is easy to calculate:

xlin_fits_lm <- function(x, y) { n <- length(y) d <- data.frame(x = x, y = y) vapply( seq_len(n), function(i) { m <- lm(y ~ x, data = d[-i, ]) predict(m, newdata = d[i, ]) }, numeric(1)) } d$y - xlin_fits_lm(d$x, d$y) # [1] 0.3000000 -0.4459459 0.2790698 -0.2790698 0.4459459 -0.3000000

Notice these values tend to be further from zero. They also better represent how an overall model might perform on new data.

At this point, from a statistical point of view, we are done. However, re-building an entire linear model for each point is computationally inefficient. Each of the models we are calculating share many training points, so we should be able to build a much faster hand-rolled calculation.

That faster calculation is given in the rather formidable function `xlin_fits_R()`

found here. This function computes the summary statistics needed to solve the linear regression for all of the data. Then for each point in turn, it subtracts that point’s contribution out of the summaries and then performs the algebra needed to solve for the model and then apply it. The advantage of this approach is that taking the point out and solving the model is only a very small (constant) number of steps independent of how many points are in the summary. So this code is, in *principle*, very efficient.

And in fact timings on a small problem (300 observations) show while the simple “`xlin_fits_lm()`

call `lm()`

a bunch of times” takes 0.28 seconds, the more complicated `xlin_fits_R()`

takes 0.00043 seconds, a speedup of over 600 times!

However this code is performing a separate calculation for each scalar data-point. As we mentioned above, this is fighting `R`

, which is specialized for performing calculations over large vectors. The exact same algorithm written in `C++`

, instead of `R`

, takes 0.000055 seconds, almost another multiple of 10 faster!

The timings are summarized below.

This sort of difference, scalar oriented `C++`

being so much faster than scalar oriented `R`

, is often distorted into “`R`

is slow.”

This is just not the case. If we adapt the algorithm to be vectorized we get an `R`

algorithm with performance comparable to the `C++`

implementation!

Not all algorithms can be vectorized, but this one can, and in an incredibly simple way. The original algorithm itself (`xlin_fits_R()`

) is a bit complicated, but the vectorized version (`xlin_fits_V()`

) is literally derived from the earlier one by crossing out the indices. That is: in this case we can move from working over very many scalars (slow in `R`

) to working over a small number of vectors (fast in `R`

).

Let’s take a look at the code transformation.

We are *not* saying that `xlin_fits_R()`

or `xlin_fits_V()`

are easy to understand; we felt pretty slick when we derived them and added a lot of tests to confirm they calculate the same thing as `xlin_fits_lm()`

. What we are saying is that the transform from `xlin_fits_R()`

to `xlin_fits_V()`

is simple: just cross out the for-loop and all of the “`[k]`

” indices!

Performing the exact same operation on every entry in a structure (but with different values) is the essence of “vectorized code.” When we wrote a `for`

-loop in `xlin_fits_R()`

to perform the same steps for each index, we were in fact fighting the language. Crossing out the `for`

-loop and indices mechanically turned the scalar code into faster vector code.

And that is our example of how and why to vectorize code in `R`

`RcppDynProg`

is a new `Rcpp`

based `R`

package that implements simple, but powerful, table-based dynamic programming. This package can be used to optimally solve the minimum cost partition into intervals problem (described below) and is useful in building piecewise estimates of functions (shown in this note).

The primary problem `RcppDynProg::solve_interval_partition()`

is designed to solve is formally given as follows.

Minimum cost partition into intervals.

Given: a positive integer

`n`

and an anby`n`

matrix called`costs`

.Find: an increasing sequence of integers

`soln`

with`length(soln)==k (>=2)`

,`soln[1] == 1`

, and`soln[k] == n+1`

such that`sum[i=1,...,k-1] costs[soln[i], soln[i+1]-1]`

is minimized.

To rephrase: `costs[i,j]`

is specifying the cost of taking the interval of integers `{i,...,j}`

(inclusive) as a single element of our solution. The problem is to find the minimum cost partition of the set of integers `{1,...,n}`

as a sequence of intervals. A user supplies a matrix of costs of *every* possible interval of integers, and the solver then finds what disjoint *set* of intervals that cover `{1,...,n}`

have the lowest sum of costs. The user encodes their optimization problem a family of interval costs (`n(n-1)/2`

of them, which is a lot- but is tractable) and the algorithm quickly finds the best simultaneous set of intervals (there are `2^(n-1)`

partitions into intervals, so exhaustive search would not be practical).

We can illustrate this abstract problem as follows (if this is too abstract, please skip forward to the concrete application).

Suppose we have the following cost matrix.

```
costs <- matrix(c(1.5, NA ,NA ,1 ,0 , NA, 5, -1, 1),
nrow = 3)
print(costs)
# [,1] [,2] [,3]
# [1,] 1.5 1 5
# [2,] NA 0 -1
# [3,] NA NA 1
```

Then the optimal partition is found as follows.

```
library("RcppDynProg")
soln <- solve_interval_partition(costs, nrow(costs))
print(soln)
# [1] 1 2 4
```

The sequence `[1, 2, 4]`

is a just compact representation for the following sequence of intervals.

```
lapply(
seq_len(length(soln)-1),
function(i) {
soln[i]:(soln[i+1]-1)
})
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2 3
```

Which is saying the optimal partition into intervals is to the sequence of sets `[{1}, {2, 3}]`

which has total cost `costs[1,1] + costs[2,3]`

. The dynamic programming solver knew to take the expensive set `{1}`

to allow the cheap set `{2, 3}`

to be in its chosen partition. This is the essence of dynamic programming: finding an optimal *global* solution, even if it requires odd-looking local choices.

The intended application of `RcppDynProg`

is to find optimal piecewise solutions to single-variable modeling problems. For example consider the following data.

In the above we have an input (or independent variable) `x`

and an observed outcome (or dependent variable) `y_observed`

(portrayed as points). `y_observed`

is the unobserved idea value `y_ideal`

(portrayed by the dashed curve) plus independent noise. The modeling goal is to get close the `y_ideal`

curve using the `y_observed`

observations. Obviously this can be done with a smoothing spline, but let’s use `RcppDynProg`

to find a piecewise linear fit.

To encode this as a dynamic programming problem we need to build a cost matrix that for every consecutive interval of `x`

-values we have estimated the out-of sample quality of fit. This is supplied by the function `RcppDynProg::lin_costs()`

(using the PRESS statistic), but lets take a quick look at the idea.

The following interval is a good interval, as all the chosen points (shown in dark blue) are in a nearly linear arrangement. The in-sample price of the interval would be the total sum of residuals of a linear model fit on the selected region (and the out of sample price would be given by the PRESS statistic).

The "cost" (or loss) of this interval can be estimated as shown.

```
print(good_interval_indexes) # interval
# [1] 94 139
print(1 + good_interval_indexes[2] - good_interval_indexes[1]) # width
# [1] 46
fit <- lm(y_observed ~ x,
data = d[good_interval_indexes[1]:good_interval_indexes[2], ])
sum(fit$residuals^2) # cost for interval
# [1] 2.807998
```

The following interval is a bad interval, as all the chosen points (shown in dark blue) are not in a nearly linear arrangement.

```
print(bad_interval_indexes) # interval
# [1] 116 161
print(1 + bad_interval_indexes[2] - bad_interval_indexes[1]) # width
# [1] 46
fit <- lm(y_observed ~ x,
data = d[bad_interval_indexes[1]:bad_interval_indexes[2], ])
sum(fit$residuals^2) # cost for interval
# [1] 5.242647
```

The user would price all of the intervals individually, and then ask the solver to find the best simultaneous set of intervals.

The complete solution is worked as follows (using the `RcppDynProg::solve_for_partition()`

function which wraps all the steps together, converting from indices to `x`

-coordinates).

```
x_cuts <- solve_for_partition(d$x, d$y_observed, penalty = 1)
print(x_cuts)
# x pred group what
# 1 0.05 -0.1570880 1 left
# 2 4.65 1.1593754 1 right
# 3 4.70 1.0653666 2 left
# 4 6.95 -0.9770792 2 right
# 5 7.00 -1.2254925 3 left
# 6 9.20 0.8971391 3 right
# 7 9.25 1.3792437 4 left
# 8 11.10 -1.1542021 4 right
# 9 11.15 -1.0418353 5 left
# 10 12.50 1.1519490 5 right
# 11 12.55 1.3964906 6 left
# 12 13.75 -1.2045219 6 right
# 13 13.80 -1.3791405 7 left
# 14 15.00 1.0195679 7 right
d$estimate <- approx(x_cuts$x, x_cuts$pred,
xout = d$x,
method = "linear", rule = 2)$y
d$group <- as.character(
findInterval(d$x, x_cuts[x_cuts$what=="left", "x"]))
plt2 <- ggplot(data= d, aes(x = x)) +
geom_line(aes(y = y_ideal), linetype=2) +
geom_point(aes(y = y_observed, color = group)) +
geom_line(aes(y = estimate, color = group)) +
ylab("y") +
ggtitle("RcppDynProg piecewise linear estimate",
subtitle = "dots: observed values, segments: observed group means, dashed line: unobserved true values") +
theme(legend.position = "none") +
scale_color_brewer(palette = "Dark2")
print(plt2)
```

`RcppDynProg::solve_for_partition()`

finds a partition of a relation into a number of linear estimates. Each interval is priced using out-of sample cost via the PRESS statistic plus the specified penalty (to discourage small intervals). Notice, however, the user did not have to specify a *k* (or number of intervals) to a get good result.

The entire modeling procedure is wrapped as a `vtreat`

custom-coder in the function `RcppDynProg::piecewise_linear()`

. This allows such variable treatments to be easily incorporated into modeling pipelines (example here).

In addition to a piecewise linear solver we include a piecewise constant solver, which is demonstrated here. Other applications can include peak detection, or any other application where the per-segment metrics are independent.

The solver is fast through to the use of 3 techniques:

`RcppDynProg::solve_for_partition()`

includes a problem reduction heuristic in the spirit of the parameterized complexity methodology.- Ordered (or interval) partition problems are amenable to dynamic programming because initial segments of an interval partition have succinct summaries (just the right-most index and how many segments were used to get to this point).
`RcppDynProg`

is a fast`C++`

implementation using`Rcpp`

.

Some basic timings show the `C++`

implementation can be over 200 times faster than a direct transliteration `R`

of the same code (so not vectorized, not fully R idiomatic, some time lost to `seqi()`

abstraction), and over 400 times faster than a `Python`

direct transliteration of the same code (so not optimized, and not "Pythonic"). The non-optimized and non-adapted nature of the code translations unfortunately exaggerates the speedup, however the `Rcpp`

is likely buying as a solid factor of over 100- as `C++`

is going to be much more efficient at all of the index-chasing this dynamic programming solution is based on.

A note on problem complexity: general partition problems (where we do not restrict the subsets to be intervals) are NP-hard, so not thought to be amenable to efficient general solutions at scale (subset sum problems being good examples).

]]>However, this post is going to be an exception.

I’ve just got back from photographing the Rotary Club of San Francisco‘s 2018 Holiday Party. We had a special guest SF Mayor London Breed (shown here with Rotary Club of San Francisco President Rhonda Poppen).

I am proud to say I have been a member of this organization for over 10 years. It is where I do my volunteer work both in San Francisco and internationally.

In particular I am thrilled to be supporting the efforts of a number of Rotarians and Roots of Peace in their latest effort to remediate farmland in Vietnam (with the help and permission of the Vietnamese government). These people are working hard to undo some of the pain and misery of unexploded ordinance (UXO). I’ll be helping with some administrative tasks and these people will be training hundreds of farmers to move into profitable world market crops.

Pictured above Heidi Kuhn and Christian Kuhn of Roots of Peace.

]]>