How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.
Figure: the standard SVM margin diagram, this time with some un-marked data added.
I spend a lot of my time writing and teaching about the proper use and consequences of choosing different machine learning techniques in data science projects. Some of the experience comes from working with our clients (you don’t need a theory to tell you random forest can in fact overfit after you see it actually do so on client data, though it does pay to follow-up on such things). Studying implementation details is in fact useful, but it is only one source of insight. It is also an already over-represented teaching choice, and isn’t always the best first exposure for all students.
That being said, my background is as a “hacking theorist.” I do toy with experimental side implementations (some public examples here and here) and even more I like pushing some math around to find the edges of what is possible (see here).
Along these lines over the holiday I decided to re-study support vector machines, from primary and secondary sources. I wanted to see what was originally claimed, what the original proof ideas were, and try and see what was left open. What I found is the proof chains are bit longer than I had hoped and I feel we should really thank researchers who took the trouble to re-specialize re-write all of the proofs into a linear sequence of arguments (instead of merely citing). In particular I came to re-appreciate an item already in my library: Cristianini, N. and Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge, 2000.
That being said: here are my new notes on the original proofs that large margin establishes low VC dimension (which in turn establishes good generalization error). To my mind there are a few twists and surprises that will have (necessarily) been smoothed over in any first course on support vector machines.