*Please* help us get the word out by sharing/Tweeting!

- Random Test/Train Split is not Always Enough
- How Do You Know if Your Data Has Signal?
- How do you know if your model is going to work?
- A Simpler Explanation of Differential Privacy (explaining the reusable holdout set)
- Using differential privacy to reuse training data
- Preparing Data for Analysis using R: Basic through Advanced Techniques

What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.

Suggested static cal/train/test experiment design from vtreat data treatment library.

When you think about data handling as being a part of the modeling process, you realize you can use your data with more statistical efficiency. You can build better models with the same amount of original data by trying one of:

- Jackknifing calibration/training/test data.
- Protecting calibration data by a significance threshold.
- Protecting calibration data by noising (differential privacy).

All of these techniques are demonstrated as examples in our articles. The idea is to get the most out your data by fluidly re-arranging it for analysis (versus a rigid test/train split).

These techniques are more complicated than the traditional one-time test/train split (and much more complicated than the flawed approach of training and testing on the same single data set). For example:consider the following simple improvement: re-training a production model on all of your data *after* you are done scoring models on test/train splits. This produces a best possible model (as it used all of your data) that you just happen to not know the performance of (as you have no data disjoint from training to score it on). This is a good practice, and can make quite a lot of difference when you have limited or expensive to produce data.

Computer science dean and professor Dr. Merrick Furst taught:

The biggest difference between time and space is that you can’t reuse time.

For data science this might be:

The biggest difference between computation and data is you can’t always spin up more data.

Even in the “big data” era, data can be more valuable than processor cycles (such as when predicting rare events). Data handling is part of model design, and not something you can always leave to a framework.

]]>**Workshop at ODSC, San Francisco – November 14**

Both of us will be giving a two-hour workshop called *Preparing Data for Analysis using R: Basic through Advanced Techniques*. We will cover key issues in this important but often neglected aspect of data science, what can go wrong, and how to fix it. This is part of the Open Data Science Conference (ODSC) at the Marriot Waterfront in Burlingame, California, November 14-15. If you are attending this conference, we look forward to seeing you there!

You can find an abstract for the workshop, along with links to software and code you can download ahead of time, here.

**An Introduction to Differential Privacy as Applied to Machine Learning: Women in ML/DS – December 2**

I (Nina) will give a talk to the Bay Area Women in Machine Learning & Data Science Meetup group, on applying differential privacy for reusable hold-out sets in machine learning. The talk will also cover the use of differential privacy in effects coding (what we’ve been calling “impact coding”) to reduce the bias that can arise from the use of nested models. Information about the talk, and the meetup group, can be found here.

We’re looking forward to these upcoming appearances, and we hope you can make one or both of them.

]]>I thought I would take a peek to learn about the statistical methodology (see here for some commentary). I would say the kindest thing you can say about the paper is: its problems are not statistical.

At this time the authors don’t seem to have supplied their data preparation or analysis scripts and the paper “isn’t published yet” (though they have had time for a press release), so we have to rely on their pre-print. Read on for excerpts from the work itself (with commentary).

Using 2007-2008 Centers for Disease Control’s National Household and Nutrition Examination Survey, the consumption incidence of targeted foods on two non- continuous days was examined across discrete ranges of BMI.

(My understanding is the NHNES is a “day later recall” survey, so at best we are measuring “reported consumption incidence,” not consumption. So even done well the strongest conclusion such a study could support would be something like “people bad at remembering how much they ate.” This reminds one of the title of an earlier book by Wansink “Mindless Eating: Why We Eat More Than We Think.” Frankly this sounds like a dataset unsuitable for establishing anything like the paper’s title.)

Data were analyzed in 2011.

(Okay, not a “fast” publication. So was this also published in 2011? Or was it something that has been claimed for four years and is now being substantiated?)

After excluding the clinically underweight and morbidly obese, consumption of fast food, soft drinks or candy was not positively correlated with measures of BMI.

(Eliminate enough outcome variation and there is no variation to measure/explain.)

We restrict our sample to adults, defined as age 18 or older, who completed two 24-hour dietary recall surveys.

(It plausibly takes more than two days of measurements to get a good image of long term eating habits. Also most “food regulation”, a topic these authors have written on, is targeted at children. So for a useful public policy analysis it would have been nice to leave them in.)

We focus on eating episode rather than amount eaten because it is less subject to recall bias.

(Breaking the actual relation between eating and health, by leaving out amount. Also some effective diets advise more sittings of much smaller portions. Finally haven’t changes in fast-food portions been a huge issue?)

We compare average eating episodes within food and across BMI categories.

(I am guessing this means they are modeling BMI category code instead of the BMI number. There are only about 3 BMI category codes left after “excluding the clinically underweight and morbidly obese.” Again eliminate variation in the measured outcome, and nothing will correlate to it.)

Missing data were omitted from the analysis …

(Just dropping missing data is not likely to work with interview data, unless you truly believe censoring is completely independent of health, diet, and health/diet interactions.)

Likewise, those with normal BMIs consume an average of 1.1 salty snacks over two days, while overweight, obese, and morbidly obese consume an average 0.9, 1.0, and 0.9 salty snacks, respectively.

(Uh, I thought we were “excluding the clinically underweight and morbidly obese.” I guess this is a different analysis. But here is a statistical issue: it really doesn’t look like the independent variable (“salty snacks”) is varying. So you are not going to be able to see if it drives an outcome. And since there isn’t a complete methods section I really wonder if the analysis is really looking at the claimed underlying data, or just looking at aggregate values.)

From: Table 1. Average Instances of Consumption in 48 Hours of Various Food Items, Sorted by BMI

(I’m not a statistician, but a negative p-value? Maybe that is some variation of z? But the weird values are not just in one column. Is all this just off one ANOVA table? Also, why not try a linear regression on BMI score using non-grouped data, or a logistic regression on BMI category?)

Also when the input (or “independent”) variables are not known to be independent of each other ANOVA is variable order dependent! Usually this is handled by experiment design- but in this case we are observing eating patterns, not assigning them.

Some R code showing the effect is given below. Notice all of the x’s have the same relation to y, but the ANOVA analysis assigns effect in variable order. It does not make any sense to say “x1 is significant, but x10 is not” as the F-scores are not about each variable in isolation.

```
set.seed(6326)
d <- data.frame(y=rnorm(100))
for(i in 1:10) {
d[[paste('x',i,sep='')]] <- d$y + rnorm(nrow(d))
}
anova(lm(y~x1+x2+x3+x4+x5+x6+x7+x9+x10,data=d))
```

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 70.643 70.643 640.0173 < 2.2e-16 ***
## x2 1 22.647 22.647 205.1824 < 2.2e-16 ***
## x3 1 5.285 5.285 47.8821 6.425e-10 ***
## x4 1 6.588 6.588 59.6906 1.491e-11 ***
## x5 1 2.382 2.382 21.5771 1.155e-05 ***
## x6 1 3.027 3.027 27.4269 1.063e-06 ***
## x7 1 0.494 0.494 4.4757 0.03714 *
## x9 1 1.914 1.914 17.3441 7.137e-05 ***
## x10 1 0.376 0.376 3.4048 0.06830 .
## Residuals 90 9.934 0.110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

I am sure I got a few points wrong, but I just don’t see strong result here.

I’ll just end with: it is of course difficult to prove a non-effect, but a single analysis failing to find an effect is not strong evidence against an effect. A single study not finding a relation, doesn’t make two things unrelated. This analysis (seemingly entirely driven off one or two aggregated ANOVA tables, evidently without also trying the simple standard techniques of regression or logistic regression) does not in fact seem sensitive enough to see effects even if there are any.

]]>*Very* roughly: a Bitcoin is a cryptographic secret that is considered to have some value. Bitcoins are individual data tokens, and duplication is prevented through a distributed shared ledger (called the blockchain). As interesting as this is, we want to point out notional value existing both in ledgers and as possessed tokens has quite a long precedent.

This helps us remember that important questions about Bitcoins (such as: are they a currency or a commodity?) will be determined by regulators, courts, and legislators. It will not be a simple inevitable consequence of some detail of implementation as this has never been the case for other forms of value (gold, coins, bank notes, stocks certificates, or bank account balances).

Value has often been recorded in combinations of ledgers and tokens, so many of these issues have been seen before (though they have never been as simple as one would hope). Historically the rules that apply to such systems are subtle, and not completely driven by whether the system primarily resides in ledgers or primarily resides portable tokens. So we shouldn’t expect determinations involving Bitcoin to be simple either.

What I would like to do with this note is point out some fun examples and end with the interesting case of Crawfurd v The Royal Bank, as brought up by “goonsack” in 2013.

Any time we visit an ATM, direct deposit our pay, or write a check we are converting money between ledgers or between tokens (bills) and ledgers. The fluidity is one of the reasons it is hard to even define terms when asking “how much money is there?” (see money supply).

Charlie Shrem is considering introducing a set of shared duplicate hand-maintained ledgers to replace the private portable currency of a prison (where he currently resides). The currency in question is largely cans of mackerel.

Canned fish (Wikipedia).

Frankly his Mackerelcoin, or MAK ledger system proposed seems too laborious, and would leave a long undesirable trail. I seems likely one encountering a Mackerelcoin society would be very motivated to invent a currency to eliminate the ledgers and ledger recorders. Perhaps instead of transferring “one MAC” to obtain a haircut one could hand over one actual can of Mackerel.

Barrel-shaped clay cylinder covered with lines of cuneiform text (Wikipedia).

An interesting item dating to one of the posited origins of notional money is the Sumerian Bulla (see also Eleanor Robson, D.J. Melville. Tokens: the origin of mathematics. Mesopotamian Mathematics (published by St. Lawrence University).

Multi-stamped bulla (Wikipedia).

Essentially small tokens or figurines were used to represent purchased or sold animals or food. To prevent fraud and tampering all the tokens were stored in a trusted location in a cast Bulla. To avoid the trouble of having to crack open the Bulla one decorated the outside of the Bulla with impressions of the tokens within. Eventually only the impressions (or stylizations of the impressions) are used and you have writing and a public ledger.

So we have an example of something like a portable primitive currency (the tokens) being displaced by a centralized ledger around 8,000BC.

Rai stones are large punctured stone disks used as currency in Micronesia around 500AD through at least 1871AD.

Rai stone at Yap (Wikipedia).

Interestingly enough the stones do not actually need to be moved to change owner. From the Wikipedia:

While the monetary system of Yap appears to use these giant stones as tokens, in fact it relies on an oral history of ownership. Being too large to move, buying an item with these stones is as easy as saying it no longer belongs to you. As long as the transaction is recorded in the oral history, it will now be owned by the person you passed it on to—no physical movement of the stone is required.

Presentation of Yapese stone money for FSM inauguration (Wikipedia).

So we have an example of something like a public currency being converted into a shared ledger around 500AD.

In recording wealth and debts one often has to choose between ledgers (such as bank account balances) or tokens (such as coins or bills). Each has its advantages and disadvantages and the trade-offs come out different depending on what you are doing. Your coffee loyalty card may use simple ink stamps as tokens (as one needs low value and high connivence) while you purchase the coffee using state supplied bills (as they are harder to forge or duplicate). At the other extreme are land deeds which are more public records than portable tokens.

Now we ask: does your ledger or token behave more like property (where the rule of nemo dat quod non habet or recovery of stolen goods from innocent third parties applies) or a bank note (where such recovery is not assumed).

What we have seen is the law tends to be set to meet the market needs, and not off implementation details (such as the traceability of parties, tokens, or ledger entries). For details see Banknotes and Their Vindication in Eighteenth-Century Scotland. This describes the case of Crawfurd v The Royal Bank, where Crawfurd found a bill stolen from him at the Royal Bank of Scottland (confirmed by the serial number) and sued for it to be restored to him (which would the the legal principle if Crawfurd had found a pocket watch that had been stolen from him at the bank). The court ruled the Crawfurd could not recover the note (even after stipulating it had been stolen by a 3rd party). Money changes color when it changes hands.

Bitcoin transactions are pseudonymous in that users can use (if they choose) arbitrary addresses for transactions, but input Bitcoins are linked to output Bitcoins. And some fraction of Bitcoins have known tainted ownership histories (despite having been mingled with other coins). So it is not immediately obvious if Bitcoins will be treated like bearer bonds or registered securities.

Notice how law (both common and statute) doesn’t decide these issues entirely on the implementation details. Coins historically are considered fungible due to their presumed identicalness and impracticality of tracing. Bank notes (even those issued by private banks, not governments) got similar treatment, despite having serial numbers due to their importance to efficient commerce.

Bitcoin’s fate will be decided in the market, courts, and legislatures. It isn’t going to pivot on some convenient technical detail such as it similarity to private tokens or a public ledger. Bitcoin having a currently confusing legal status (US IRS treating it as property, versus a US Federal judge treating it as money) isn’t proof it is going to collapse. Most other currencies (including bank notes) went through similar travails. While I am not a fan of Bitcoin, I don’t claim it is obvious if it has a future or not.

]]>**A Simpler Explanation of Differential Privacy**: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in*Science*(Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”,*Science*, vol 349, no. 6248, pp. 636-638, August 2015).Note that Cynthia Dwork is one of the inventors of differential privacy, originally used in the analysis of sensitive information.

**Using differential privacy to reuse training data**: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.**A simple differentially private-ish procedure**: The bootstrap as an alternative to Laplace noise to introduce privacy.

Our R code and experiments are available on Github here, so you can try some experiments and variations yourself.

]]>`stats::aggregate()`

.
Read on for our example.

For our example we create a data frame. The issue is: I am working in the Pacific time zone on Saturday October 31st 2015, and I have some time data that I want to work with that is in an Asian time zone.

`print(date())`

`## [1] "Sat Oct 31 08:14:38 2015"`

```
d <- data.frame(group='x',
time=as.POSIXct(strptime('2006/10/01 09:00:00',
format='%Y/%m/%d %H:%M:%S',
tz="Etc/GMT+8"),tz="Etc/GMT+8")) # I'd like to say UTC+8 or CST
print(d)
```

```
## group time
## 1 x 2006-10-01 09:00:00
```

`print(d$time)`

`## [1] "2006-10-01 09:00:00 GMT+8"`

`str(d$time)`

`## POSIXct[1:1], format: "2006-10-01 09:00:00"`

`print(unclass(d$time))`

```
## [1] 1159722000
## attr(,"tzone")
## [1] "Etc/GMT+8"
```

Suppose I try to aggregate the data to find the earliest time for each group. I have a problem, aggregate loses the timezone and gives a bad answer.

```
d2 <- aggregate(time~group,data=d,FUN=min)
print(d2)
```

```
## group time
## 1 x 2006-10-01 10:00:00
```

`print(d2$time)`

`## [1] "2006-10-01 10:00:00 PDT"`

This is bad. Our time has lost its time zone and changed from `09:00:00`

to `10:00:00`

. This violates John M. Chambers’ “Prime Directive” that:

computations can be understood and trusted.

Software for Data Analysis, John M. Chambers, Springer 2008, page 3.

The issue is the POSIXct time time is essentially a numeric array carrying around its timezone as an attribute. Most base R code has problems if there are extra attributes on a numeric array. So R-stat code tends to have a habit of dropping attributes when it can. it is odd that the class() is kept (which itself an attribute style structure) and the timezone is lost, but R is full of hand-specified corner cases.

dplyr gets the right answer.

`library('dplyr')`

```
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

```
by_group = group_by(d,group)
d3 <- summarize(by_group,min(time))
print(d3)
```

```
## Source: local data frame [1 x 2]
##
## group min(time)
## 1 x 2006-10-01 09:00:00
```

`print(d3[[2]])`

`## [1] "2006-10-01 09:00:00 GMT+8"`

And plyr also works.

`library('plyr')`

```
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
```

```
d4 <- ddply(d,.(group),summarize,time=min(time))
print(d4)
```

```
## group time
## 1 x 2006-10-01 09:00:00
```

`print(d4$time)`

`## [1] "2006-10-01 09:00:00 GMT+8"`

]]>Gartner hype cycle (Wikipedia).

Given we agree data science exists, who is allowed to call themselves a data scientist?

There is a school of thought that you can not call yourself a data scientist unless you master all of the following:

- Statistical learning theory
- High dimensional geometry
- Optimization theory
- Petabyte scale operations
- Advanced programming
- Combinatorics and algebra
- Theoretical computer science
- Measure theory
- All of statistics
- SQL
- noSQL
- Distributed System design
- …

Many of these are topics covered in works such as *Foundations of Data Science* (John Hopcroft, Ravindran Kannan) and *Mining of Massive Data Sets* (Jure Leskovec , Anand Rajaraman, Jeffrey David Ullman).

These are topics I know, and many of these authors are personal heroes:

- John Hopcroft: One of the founders of modern design and analysis of algorithms. Coauthor of
*Introduction to Automata Theory, Languages, and Computation*. - Ravindran Kannan: My advisor! Definitely brilliant.
- Anand Rajaraman: CEO I had the honor of working for at Kosmix.com, one of the inventors of Mechanical Turk, also brilliant.
- Jeffrey David Ullman: One of the founders of modern design and analysis of algorithms. Coauthor of
*Introduction to Automata Theory, Languages, and Computation*.

The theory is: only the unicorn who knows all of the above is to be allowed to call themselves a data scientist.

However, when Nina Zumel and I wrote *Practical Data Science with R* (Manning 2014) we took an opposite approach. We deliberately widened data science to:

a field that uses results from statistics, machine learning, and computer science to create predictive models.

Practical Data Science with R, “about this book”, page xix.

And here is why: outside of academia and some major labs the task of data science is essentially looking at client data and building useful predictive models.

This is good news. Statisticians know that prediction is fundamentally easier than inference (as prediction dodges many issues of causality). And most real world business clients have data at what we call “SQL scale” (fits in a nice database that can quickly run complicated SQL aggregations, not requiring a petabyte infrastructure). Clients tend to need automated decision procedures yielding high ROI (~~Radio Over the Internet~~ Return On Investment) to free up analysts for new problems.

And that brings to the point of this essay. Because all of the analyst jobs have been re-classified as “data science” jobs we have to allow analysts to call themselves “data scientists”.

]]>Then, for some very readable background material on SVMs I recommend section 13.4 of

Applied Predictive Modelingand sections 9.3 and 9.4 ofPractical Data Science with Rby Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference.

For more on SVMs see the original article on the Revolution Analytics blog.

]]>Nina and I were noodling with some variations of differentially private machine learning, and think we have found a variation of a standard practice that is actually fairly efficient in establishing ~~differential privacy~~ a privacy condition (but, as commenters pointed out- not differential privacy).

Read on for the idea and a rough analysis.

A commonly discussed step in establishing differential privacy is to add some Laplace distributed noise to queries. It works (when used in conjunction with other steps), but it can seem mysterious. We think bootstrap resampling should be considered more seriously as a component in privacy preserving procedures (despite some references claiming it is not enough on its own).

We are going to simplify analysis and consider indistinguishability to be the good effect we are trying to establish, and variance to be the bad side-effect we are willing to put up with to achieve indistinguishability.

Throughout let `n`

be the number of examples in the set you are using “as a test.” The scheme we are analyzing is allowing count queries against this test set of the form “what fraction of rows match a given predicate.” Returning exact answers to such queries is not differentially private, so some form of noise addition is used (usually adding a Laplace random variable as noise).

Consider the following alternate technique to defend the test set in a privacy scheme.

Pick a positive real number `Z < n`

. Think of `Z`

as taking on a value like 10.

When it comes time to compute a noisy test-set score instead score a bootstrap re-sample of the test set of size `ceiling(n/Z)`

where `n`

is the size of your actual test set. Each time a score is asked for: re-do the Bootstrap. This re-sampling introduces empirical sampling noise (a standard trick) and by varying the size of `Z`

we can vary the amount of noise (similar to varying the Laplace distribution parameter in Laplace noise addition). The trick is we compute on these bootstrapped samples, but never share them (or the original set they are drawing from). We can use this for model scoring (as in stepwise regression) or for variable coding (as in effect/impact coding).

The question is: does this establish privacy, and does it do it while introducing only a reasonable amount of variance? It does not establish differential privacy, but it does establish a weak form of near indistinguishability (and one could also wonder about the delta-variant of differential privacy).

Recall for epsilon differential privacy for our measurement `A`

we must have for every pair of sets `S`

(chosen by an adversary) we must have for any data sets D1, D2 that differ by only one row:

```
``` log(P[A(D1) in S] / P[A(D2) in S]) ≤ epsilon

We don’t establish that, but wonder if establishing epsilon indistinguishability of the form:

```
``` P[ (A(D1) in S) ≠ (A(D2) in S) ] ≤ epsilon

might not be enough to drive data re-use proofs (but not to protect sensitive data).

We work only one example, but the calculation gives the idea.

Suppose in a privacy proof the adversary submits two sets: one that is `n`

zeros, and one that is `n-1`

zeros and one one. The query is what fraction of the rows are non-zero. We claim a bootstrap re-sample of size `ceiling(n/Z)`

roughly establishes indistinguishability with `epsilon = 1/Z`

and a variance of `Z/(epsilon n^2)`

for `Z,epsilon,n`

in an appropriate range.

Consider what happens when sampling from the set with the 1-row. The number of times the unique 1-row is copied into the bootstrap sampling is Poisson distributed with expected value of `1/Z`

. This follows from the linearity of expectation: we have a sum of `n/Z`

rows in the bootstrap sample each of which has an expected value of `1/n`

.

By Markov’s inequality we know `P[count ≥ 1] ≤ E[count]`

. So it follows the 1-row shows up at all in the bootstrap set with probability no more than `1/Z`

. As the presence of this row is the only way to tell the sets apart we have `epsilon`

indistinguishability with `epsilon = 1/Z`

(pretty much by definition). Or: with `Z=1/epsilon`

we can hope for `epsilon`

indistinguishability.

Now consider the variance of the frequency estimate we are returning. Because the 1-row count is a Poisson process we know it has variance equal to its mean. So the 1-row count is a random variable with mean `1/Z`

and variance `1/Z`

. Frequency is `count/setSize`

which is `count/(n/Z)`

. So the frequency estimate is a random variable with mean `(1/Z)/(n/Z) = 1/n`

and variance `(1/Z)/(n/Z)^2 = Z/n^2`

. Substituting in `Z = 1/epsilon`

we get variance of the frequency estimate of `1/(epsilon n^2)`

.

So we should expect this re-sized bootstrap scheme with `Z=1/epsilon`

to achieve `epsilon`

indistinguishability at a cost of introducing `1/(epsilon n^2)`

units of variance. From reading references this seems like a favorable privacy/variance trade (calculation here).

The reason indistinguishability is enough to drive good machine learnings results (though it is not be enough to establish differential privacy) is given by the following argument. Suppose we split our data into three sets: Calibration, Training, and Test. And then we perform one of the two training procedures:

- Build our effect codes on bootstrapped Calibration, train the model on Train, and score our model on Test.
- Build our effect codes on bootstrapped Train, train the model on Train, and score our model on Test.

The first method is correct with or without the bootstrap (the same data isn’t use for both variable design and training, so we don’t have the nested model problem). If the second method is indistinguishable from the first, then the second method would also be correct, showing the possibility of working without the Calibration set.

It may be possible to replace Laplace noise methods with re-sized Bootstrap resampling *in* various indistinguishability establishing algorithms. Obviously bootstrapping is a fairly standard technique and others have noted its relation to differential privacy. Examples include:

- “A Bootstrap Mechanism for Response Masking in Remote Analysis Systems”, Krishnamurty Muralidhar, Christine M. O’Keefe, and Rathindra Sarathy, Article first published online: 14 SEP 2015, Decision Sciences DOI: 10.1111/deci.12168 [link]
- “Differential privacy based on importance weighting.”, Ji Z, Elkan C.; Machine learning. 2013;93(1):163-183. doi:10.1007/s10994-013-5396-x [link]
- “Differential Privacy in a Bayesian setting through posterior sampling”, Christos Dimitrakakis, Blaine Nelson, and Zuhe Zhang, Aikaterini Mitrokotsa, Benjamin Rubinstein, 2013 [link]

But perhaps the bootstrapping method (especially with the change of set size) still deserves to be used more often.

]]>