Posted on Categories Computer Science, Exciting Techniques, Expository Writing, math programming, Opinion, TutorialsTags , , 1 Comment on An Appreciation of Locality Sensitive Hashing

An Appreciation of Locality Sensitive Hashing

We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Continue reading An Appreciation of Locality Sensitive Hashing

Posted on Categories Computer Science, Computers, Opinion, ProgrammingTags , , , 1 Comment on “The Mythical Man Month” is still a good read

“The Mythical Man Month” is still a good read

Re-read Fred Brooks “The Mythical Man Month” over vacation. ¬†Book remains insightful about computer science and project management. Continue reading “The Mythical Man Month” is still a good read

Posted on Categories Expository Writing, Mathematics, Opinion, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on Kernel Methods and Support Vector Machines de-Mystified

Kernel Methods and Support Vector Machines de-Mystified

We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Continue reading Kernel Methods and Support Vector Machines de-Mystified

Posted on Categories Opinion, Public Service ArticleTags , 1 Comment on Increase your productivity

Increase your productivity

I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior. Continue reading Increase your productivity

Posted on Categories Expository Writing, Statistics, Statistics To English Translation, TutorialsTags , , , , ,

The equivalence of logistic regression and maximum entropy models

Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the “deviance” (that mysterious
quantity reported by fitting packages) is both a term in “the pseudo-R^2” (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure. One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn’t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).

We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models. Continue reading The equivalence of logistic regression and maximum entropy models

Posted on Categories AdministrativiaTags 1 Comment on Win-Vector starts submitting content to r-bloggers.com

Win-Vector starts submitting content to r-bloggers.com

We have been consistently impressed by and enjoyed the wealth of R wisdom available on the R-bloggers aggregation site.

Therefore Win-Vector LLC is granting the right to reformat and redistribute (with attribution and link) our blog‘s R content in the R-bloggers site and feeds.

We hope to see our R content shared through this network.

Posted on Categories Computer Science, Programming, Statistics, TutorialsTags , , , 1 Comment on Programmers Should Know R

Programmers Should Know R

Programmers should definitely know how to use R. I don’t mean they should switch from their current language to R, but they should think of R as a handy tool during development. Continue reading Programmers Should Know R

Posted on Categories Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 2 Comments on Your Data is Never the Right Shape

Your Data is Never the Right Shape

One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your data.

There is no final “right shape.” In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your “penultimate analysis” (always one more to come). This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.

In this article we will work a small example and call out some R tools that make reshaping your data much easier. The idea is to think in terms of “relational algebra” (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner). Continue reading Your Data is Never the Right Shape

Posted on Categories OpinionTags , , 1 Comment on Gerty, a character in Duncan Jones’ “Moon.”

Gerty, a character in Duncan Jones’ “Moon.”

A “for fun” piece, reposted from mzlabs.com.

I would like to comment on Duncan Jones’ movie “Moon” and compare some elements of “Moon” to earlier science fiction. Continue reading Gerty, a character in Duncan Jones’ “Moon.”

Posted on Categories Statistics, Statistics To English Translation, TutorialsTags , , ,

What is a large enough random sample?

With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians. One part of experiment design that has always been particularly hard to teach is how to pick the size of your sample. The two points that are hard to communicate are that:

  • The required sample size is essentially independent of the total population size.
  • The required sample size depends strongly on the strength of the effect you are trying to measure.

These things are only hard to explain because the literature is overly technical (too many buzzwords and too many irrelevant concerns) and these misapprehensions can’t be relieved unless you spend some time addressing the legitimate underlying concerns they are standing in for. As usual explanation requires common ground (moving to shared assumptions) not mere technical bullying.

We will try to work through these assumptions and then discuss proper sample size. Continue reading What is a large enough random sample?