We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Continue reading An Appreciation of Locality Sensitive Hashing
Re-read Fred Brooks “The Mythical Man Month” over vacation. Book remains insightful about computer science and project management. Continue reading “The Mythical Man Month” is still a good read
We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Continue reading Kernel Methods and Support Vector Machines de-Mystified
I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior. Continue reading Increase your productivity
Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the “deviance” (that mysterious
quantity reported by fitting packages) is both a term in “the pseudo-R^2” (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure. One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn’t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).
We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models. Continue reading The equivalence of logistic regression and maximum entropy models
We hope to see our R content shared through this network.
One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your data.
There is no final “right shape.” In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your “penultimate analysis” (always one more to come). This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.
In this article we will work a small example and call out some R tools that make reshaping your data much easier. The idea is to think in terms of “relational algebra” (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner). Continue reading Your Data is Never the Right Shape
A “for fun” piece, reposted from mzlabs.com.
I would like to comment on Duncan Jones’ movie “Moon” and compare some elements of “Moon” to earlier science fiction. Continue reading Gerty, a character in Duncan Jones’ “Moon.”
With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians. One part of experiment design that has always been particularly hard to teach is how to pick the size of your sample. The two points that are hard to communicate are that:
- The required sample size is essentially independent of the total population size.
- The required sample size depends strongly on the strength of the effect you are trying to measure.
These things are only hard to explain because the literature is overly technical (too many buzzwords and too many irrelevant concerns) and these misapprehensions can’t be relieved unless you spend some time addressing the legitimate underlying concerns they are standing in for. As usual explanation requires common ground (moving to shared assumptions) not mere technical bullying.
We will try to work through these assumptions and then discuss proper sample size. Continue reading What is a large enough random sample?