A lot of people consider the static typing found in languages such as C, C++, ML, Java and Scala as needless hairshirtism. They consider the dynamic typing of languages like Lisp, Scheme, Perl, Ruby and Python as a critical advantage (ignoring other features of these languages and other efforts at generic programming such as the STL).
I strongly disagree. I find the pain of having to type or read through extra declarations is small (especially if you know how to copy-paste or use a modern IDE). And certainly much smaller than the pain of the dynamic language driven anti-patterns of: lurking bugs, harder debugging and more difficult maintenance. Debugging is one of the most expensive steps in software development- so you want incur less of it (even if it is at the expense of more typing). To be sure, there is significant cost associated with static typing (I confess: I had to read the book and post a question on Stack Overflow to design the type interfaces in Automatic Differentiation with Scala; but this is up-front design effort that has ongoing benefits, not hidden debugging debt).
There is, of course, no prior reason anybody should immediately care if I do or do not like dynamic typing. What I mean by saying this is I have some experience and observations about problems with dynamic typing that I feel can help others.
I will point out a couple of example bugs that just keep giving. Maybe you think you are too careful to ever make one of these mistakes, but somebody in your group surely will. And a type checking compiler finding a possible bug early is the cheapest way to deal with a bug (and static types themselves are only a stepping stone for even deeper static code analysis). Continue reading Why I don’t like Dynamic Typing
It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to the important question of the magnitude of plausible anthropogenic global warming).
Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.
We will pointedly discuss one of the differences. Continue reading Why you can not to use statistics to dispute magic
The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.
— William Cleveland, The Elements of Graphing Data, Chapter 2
In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.
I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.
Continue reading My Favorite Graphs
We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Continue reading An Appreciation of Locality Sensitive Hashing
Re-read Fred Brooks “The Mythical Man Month” over vacation. Book remains insightful about computer science and project management. Continue reading “The Mythical Man Month” is still a good read
We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Continue reading Kernel Methods and Support Vector Machines de-Mystified
I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior. Continue reading Increase your productivity
Research surveys tend to fall on either end of the spectrum: either they are so high level and cursory in their treatment that they are useful only as a dictionary of terms in the field, or they are so deep and terse that the discussion can only be followed by those already experienced in the field. Ensemble Methods in Data Mining (Seni and Elder, 2010) strikes a good balance between these extremes. This book is an accessible introduction to the theory and practice of ensemble methods in machine learning, with sufficient detail for a novice to begin experimenting right away, and copious references for researchers interested in further details of algorithms and proofs. The treatment focuses on the use of decision trees as base learners (as they are the most common choice), but the principles discussed are applicable with any modeling algorithm. The authors also provide a nice discussion of cross-validation and of the more common regularization techniques.
The heart of the text is the chapter on the Importance Sampling. The authors frame the classic ensemble methods (bagging, boosting, and random forests) as special cases of the Importance Sampling methodology. This not only clarifies the explanations of each approach, but also provides a principled basis for finding improvements to the original algorithms. They have one of the clearest explanations of AdaBoost that I’ve ever read.
A major shortcoming of ensemble methods is the loss of interpretability, when compared to single-model methods such as Decision Trees or Linear Regression. The penultimate chapter is on “Rule Ensembles”: an attempt at a more interpretable ensemble learner. They also discuss measures for variable importance and interaction strength. The last chapter discusses Generalized Degrees of Freedom as an alternative complexity measure and its relationship to potential over-fit.
Overall, I found the book clear and concise, with good attention to practical details. I appreciated the snippets of R code and the references to relevant R packages. One minor nitpick: this book has also been published digitally, presumably with color figures. Because the print version is grayscale, some of the color-coded graphs are now illegible. Usually the major points of the figure are clear from the context in the text; still, the color to grayscale conversion is something for future authors in this series to keep in mind.
A “for fun” piece, reposted from mzlabs.com.
I would like to comment on Duncan Jones’ movie “Moon” and compare some elements of “Moon” to earlier science fiction. Continue reading Gerty, a character in Duncan Jones’ “Moon.”
Stop and think: which of our tools are making us smarter and which of our tools are making us dumber. In my opinion tools and habits that support complexity literally train us to be dumber. Continue reading Do your tools support production or complexity?