What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.
Suggested static cal/train/test experiment design from vtreat data treatment library.
There is a lot of current interest in various “crypto currencies” such as Bitcoin, but that does not mean there have not been previous combined ledger and token recording systems. Others have noticed the relevance of Crawfurd v The Royal Bank (the case where money became money), and we are going to write about this yet again.
Very roughly: a Bitcoin is a cryptographic secret that is considered to have some value. Bitcoins are individual data tokens, and duplication is prevented through a distributed shared ledger (called the blockchain). As interesting as this is, we want to point out notional value existing both in ledgers and as possessed tokens has quite a long precedent.
This helps us remember that important questions about Bitcoins (such as: are they a currency or a commodity?) will be determined by regulators, courts, and legislators. It will not be a simple inevitable consequence of some detail of implementation as this has never been the case for other forms of value (gold, coins, bank notes, stocks certificates, or bank account balances).
Value has often been recorded in combinations of ledgers and tokens, so many of these issues have been seen before (though they have never been as simple as one would hope). Historically the rules that apply to such systems are subtle, and not completely driven by whether the system primarily resides in ledgers or primarily resides portable tokens. So we shouldn’t expect determinations involving Bitcoin to be simple either.
When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().
It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.
A bit of text we are proud to steal from our good friend Joseph Rickert:
Then, for some very readable background material on SVMs I recommend section 13.4 of Applied Predictive Modeling and sections 9.3 and 9.4 of Practical Data Science with R by Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference.
Nina and I were noodling with some variations of differentially private machine learning, and think we have found a variation of a standard practice that is actually fairly efficient in establishing differential privacy a privacy condition (but, as commenters pointed out- not differential privacy).
We will discuss a few more examples that have been in our mind, including one I am calling “baking priors.” This final example will demonstrate some of the advantages of allowing researchers to document their priors.
One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).
Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.
Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant. Continue reading Thumbs up for Anaconda