Principal Components Regression, Pt.1: The Standard Method

Posted on Categories data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 14 Comments on Principal Components Regression, Pt.1: The Standard Method

In this note, we discuss principal components regression and some of the issues with it:

  • The need for scaling.
  • The need for pruning.
  • The lack of “y-awareness” of the standard dimensionality reduction step.

Continue reading Principal Components Regression, Pt.1: The Standard Method

For a short time: Half Off Some Manning Data Science Books

Posted on Categories Administrativia, Pragmatic Data Science, StatisticsTags , Leave a comment on For a short time: Half Off Some Manning Data Science Books

Our publisher Manning Publications is celebrating the release of a new data science in Python title Introducing Data Science by offering it and other Manning titles at half off until Wednesday, May 18.

As part of the promotion you can also use the supplied discount code mlcielenlt for half off some R titles including R in Action, Second Edition and our own Practical Data Science with R. Combine these with our half off code (C3) for our R video course Introduction to Data Science and you can get a lot of top quality data science material at a deep discount.

Coming up: principal components analysis

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , 2 Comments on Coming up: principal components analysis

Just a “heads-up.”

I’ve been editing a two-part three-part series Nina Zumel is writing on some of the pitfalls of improperly applied principal components analysis/regression and how to avoid them (we are using the plural spelling as used in following Everitt The Cambridge Dictionary of Statistics). The series is looking absolutely fantastic and I think it will really help people understand, properly use, and even teach the concepts.

The series includes fully worked graphical examples in R and is why we added the ScatterHistN plot to WVPlots (plot shown below, explained in the upcoming series).

s

Frankly the material would have worked great as an additional chapter for Practical Data Science with R (but instead everybody is going to get it for free).

Please watch here for the series.
The complete series is now up:

vtreat cross frames

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , 2 Comments on vtreat cross frames

vtreat cross frames

John Mount, Nina Zumel

2016-05-05

As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat.

Continue reading vtreat cross frames

On Nested Models

Posted on Categories Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 1 Comment on On Nested Models

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.

Continue reading On Nested Models

Improved vtreat documentation

Posted on Categories Administrativia, Statistics, TutorialsTags , , Leave a comment on Improved vtreat documentation

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here).

Chrome Vanadium Adjustable Wrench

vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Continue reading Improved vtreat documentation

Free data science video lecture: debugging in R

Posted on Categories Coding, Programming, TutorialsTags , , 2 Comments on Free data science video lecture: debugging in R

We are pleased to release a new free data science video lecture: Debugging R code using R, RStudio and wrapper functions. In this 8 minute video we demonstrate the incredible power of R using wrapper functions to catch errors for later reproduction and debugging. If you haven’t tried these techniques this will really improve your debugging game.



All code and examples can be found here and in WVPlots. Continue reading Free data science video lecture: debugging in R

Half off Win-Vector data science books and video training!

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , , 1 Comment on Half off Win-Vector data science books and video training!

We are pleased to announce our book Practical Data Science with R (Nina Zumel, John Mount, Manning 2014) is part of Manning’s “Deal of the Day” of April 9th 2016. This one day only offer gets you half off for physical book (with free e-copy) or paid e-copy (e-copy simultaneous pdf + ePub + kindle, and DRM free!).

Here is the discount count in Tweetable form (please Tweet/share!):

Deal of the Day April 9: Half off my book Practical Data Science with R. Use code dotd040916au at https://www.manning.com/books/practical-data-science-with-r

In celebration of this we are offering our video instruction course Introduction to Data Science (Nina Zumel, John Mount 2015) is also half off with “code C3” (https://www.udemy.com/introduction-to-data-science/?couponCode=C3).

A bit on the F1 score floor

Posted on Categories Mathematics, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , Leave a comment on A bit on the F1 score floor

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called “confusion matrix.”

We repeated our usual admonition to not use “accuracy itself” as a project quality goal (business people tend to ask for it as it is the word they are most familiar with, but it usually isn’t what they really want).


NewImage
One reason not to use accuracy: an example where a classifier that does nothing is “more accurate” than one that actually has some utility. (Figure credit Nina Zumel, slides here)

And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more).

Please read on to see what stood out. Continue reading A bit on the F1 score floor

WVPlots: example plots in R using ggplot2

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , 7 Comments on WVPlots: example plots in R using ggplot2

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. The idea is: we sacrifice some of the flexibility and composability inherent to ggplot2 in R for a menu of prescribed presentation solutions (which we are sharing on Github).

For example the plot below showing both an observed discrete empirical distribution (as stems) and a matching theoretical distribution (as bars) is a built in “one liner.”

NewImage

Please read on for some of the ideas and how to use this package. Continue reading WVPlots: example plots in R using ggplot2