Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.

This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

To see what is going on, let’s work an example.

First we import our packages/modules.

```
import pandas
import numpy
import sklearn
import sklearn.model_selection
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import cross_val_predict
sklearn.__version__
```

`'0.21.3'`

Now let’s set up some simple example data.

```
nrow = 15
cv = 3
d = pandas.DataFrame({
'const': ['a'] * nrow,
'r1': numpy.random.normal(size=nrow),
'row_id': range(nrow)
})
y = [2**i for i in range(nrow)]
d
```

const | r1 | row_id | |
---|---|---|---|

0 | a | -0.090306 | 0 |

1 | a | -0.062128 | 1 |

2 | a | 0.530181 | 2 |

3 | a | -0.769375 | 3 |

4 | a | -2.082851 | 4 |

5 | a | 0.703230 | 5 |

6 | a | 0.404206 | 6 |

7 | a | -0.648879 | 7 |

8 | a | 0.149515 | 8 |

9 | a | -0.697519 | 9 |

10 | a | -0.177883 | 10 |

11 | a | 0.809709 | 11 |

12 | a | 0.956048 | 12 |

13 | a | 0.621239 | 13 |

14 | a | -0.579699 | 14 |

We now use `sklearn.model_selection.cross_val_predict`

to land some derived columns. In this case we are going to land the global average the outcome `y`

as our estimate.

```
class CopyYMeanTransform(BaseEstimator,
TransformerMixin):
def __init__(self):
self.est = 0
BaseEstimator.__init__(self)
TransformerMixin.__init__(self)
def fit(self, X, y):
self.est = numpy.mean(y)
return self
def transform(self, X):
return [self.est] * X.shape[0]
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
def predict(self, X):
return self.transform(X)
def fit_predict(self, X, y):
return self.fit_transform(X, y)
ests1 = cross_val_predict(CopyYMeanTransform(), d, y, cv=cv)
pandas.DataFrame({'ests': ests1})
```

ests | |
---|---|

0 | 3273.6 |

1 | 3273.6 |

2 | 3273.6 |

3 | 3273.6 |

4 | 3273.6 |

5 | 3177.5 |

6 | 3177.5 |

7 | 3177.5 |

8 | 3177.5 |

9 | 3177.5 |

10 | 102.3 |

11 | 102.3 |

12 | 102.3 |

13 | 102.3 |

14 | 102.3 |

In the result we notice two things:

Let’s re-encode the output to see what is going on. We deliberately chose the `y`

values to be powers of 2 so `v*(nrow-nrow/cv)`

should give us the exact rows used in each calculation as bit positions. We can view this as follows.

```
pandas.DataFrame({
'blocks':
[format(int(v*(nrow-nrow/cv)), '#0' + str(nrow+2) + 'b') for v in ests1]
})
```

blocks | |
---|---|

0 | 0b111111111100000 |

1 | 0b111111111100000 |

2 | 0b111111111100000 |

3 | 0b111111111100000 |

4 | 0b111111111100000 |

5 | 0b111110000011111 |

6 | 0b111110000011111 |

7 | 0b111110000011111 |

8 | 0b111110000011111 |

9 | 0b111110000011111 |

10 | 0b000001111111111 |

11 | 0b000001111111111 |

12 | 0b000001111111111 |

13 | 0b000001111111111 |

14 | 0b000001111111111 |

The first row indicates it was derived from all rows except the first 5 (as the 5 lowest bit positions are zero). In fact the first five rows are all calculated in this manner. So we have 3-way cross validation (each row is calculated using 2/3rds of the data), but in consecutive blocks.

This happens because `sklearn.model_selection.cross_val_predict`

defaults to using one of `klearn.model_selection.KFold.html`

`sklearn.model_selection.StratifiedKFold`

. These in turn both default to `shuffle=False`

, which explains the observed behavior.

This is “as expected” in the sense it is clearly documented. It is however, not how a statistician would expect k-fold cross validation to work for a small k.

The solution is, as documented, avoid the default by explicitly setting the cross validation strategy. We demonstrate this here.

```
cvstrat = sklearn.model_selection.KFold(shuffle=True, n_splits=3)
ests2 = sklearn.model_selection.cross_val_predict(CopyYMeanTransform(), d, y, cv=cvstrat)
pandas.DataFrame({
'ests2':
[format(int(v*(nrow-nrow/cv)), '#0' + str(nrow+2) + 'b') for v in ests2]
})
```

ests2 | |
---|---|

0 | 0b111101010110110 |

1 | 0b110111111001001 |

2 | 0b110111111001001 |

3 | 0b111101010110110 |

4 | 0b110111111001001 |

5 | 0b110111111001001 |

6 | 0b111101010110110 |

7 | 0b001010101111111 |

8 | 0b111101010110110 |

9 | 0b001010101111111 |

10 | 0b111101010110110 |

11 | 0b001010101111111 |

12 | 0b110111111001001 |

13 | 0b001010101111111 |

14 | 0b001010101111111 |

This is still a 3-fold cross validation strategy as there are only 3 distinct calculations made. However the arrangement is now random subject to the important constraint that the `i`

-th row is not in input to the `i`

-th result.

We can also confirm that the shuffle option suffles the cross-validation plan, and not the data set rows.

```
class CopyXTransform(BaseEstimator, TransformerMixin):
def __init__(self):
BaseEstimator.__init__(self)
TransformerMixin.__init__(self)
def fit(self, X, y):
return self
def transform(self, X):
return X.copy()
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
def predict(self, X):
return self.transform(X)
def fit_predict(self, X, y):
return self.fit_transform(X, y)
preds = sklearn.model_selection.cross_val_predict(
CopyXTransform(), d, y, cv=cvstrat)
pandas.DataFrame(preds)
```

0 | 1 | 2 | |
---|---|---|---|

0 | a | -0.0903059 | 0 |

1 | a | -0.0621276 | 1 |

2 | a | 0.530181 | 2 |

3 | a | -0.769375 | 3 |

4 | a | -2.08285 | 4 |

5 | a | 0.70323 | 5 |

6 | a | 0.404206 | 6 |

7 | a | -0.648879 | 7 |

8 | a | 0.149515 | 8 |

9 | a | -0.697519 | 9 |

10 | a | -0.177883 | 10 |

11 | a | 0.809709 | 11 |

12 | a | 0.956048 | 12 |

13 | a | 0.621239 | 13 |

14 | a | -0.579699 | 14 |

The `CopyXTransform`

copied out the input data in its original order, confirming shuffle shuffles the plan not the data rows.

And that concludes our tip: don’t use default cross validation settings.