Did she know we were writing a book?

Posted on Categories Administrativia, Expository Writing, Opinion, Practical Data Science, StatisticsTags , , Leave a comment on Did she know we were writing a book?

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge.

Nina Zumel and I definitely troubled over possibilities for some time before deciding to write Practical Data Science with R, Nina Zumel, John Mount, Manning 2014.

600 387630642

In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware we were writing a book. Please read on to see some of her prefiguring work. Continue reading Did she know we were writing a book?

The R community is awesome (and fast)

Posted on Categories Administrativia, StatisticsTags 2 Comments on The R community is awesome (and fast)

Recently I whined/whinged or generally complained about a few sharp edges in some powerful R systems.

In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem.

Please read on for my list of n=3 interactions. Continue reading The R community is awesome (and fast)

The Win-Vector parallel computing in R series

Posted on Categories Administrativia, Programming, Statistics, TutorialsTags , Leave a comment on The Win-Vector parallel computing in R series

With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.

For your convenience here they are in order:

  1. A gentle introduction to parallel computing in R
  2. Running R jobs quickly on many machines
  3. Can you nest parallel operations in R?

Please check it out, and please do Tweet/share these tutorials.

On accuracy

Posted on Categories Administrativia, Statistics, Statistics To English Translation, TutorialsTags , Leave a comment on On accuracy

In our last article on the algebra of classifier measures we encouraged readers to work through Nina Zumel’s original “Statistics to English Translation” series. This series has become slightly harder to find as we have use the original category designation “statistics to English translation” for additional work.

To make things easier here are links to the original three articles which work through scores, significance, and includes a glossery.

A lot of what Nina is presenting can be summed up in the diagram below (also by her). If in the diagram the first row is truth (say red disks are infected) which classifier is the better initial screen for infection? Should you prefer the model 1 80% accurate row or the model 2 70% accurate row? This example helps break dependence on “accuracy as the only true measure” and promote discussion of additional measures.


NewImage

vtreat version 0.5.26 released on CRAN

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , 1 Comment on vtreat version 0.5.26 released on CRAN

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN.

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

(from the package documentation)

‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:

  • Decrease the amount of hand-work needed to prepare data for predictive modeling.
  • Improve actual model performance on new out of sample or application data.
  • Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
  • Increase model reliability (by re-coding unexpected situations).
  • Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).

‘vtreat’ can be used to prepare data for either regression or classification.

Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN

Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

Posted on Categories Administrativia, Exciting Techniques, Expository Writing, Statistics, TutorialsTags , , ,

Short form:

Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time.

  • Part 1: the proper preparation of data (including scaling) and use of principal components analysis (particularly for supervised learning or regression).
  • Part 2: the introduction of y-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict.
  • Part 3: how to pick the number of components to retain for analysis.

Continue reading Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

Installing WVPlots and “knitting R markdown”

Posted on Categories Administrativia, TutorialsTags

Some readers have been having a bit of trouble using devtools to install WVPlots (announced here and used to produce some of the graphs shown here). I thought I would write a note with a few instructions to help.

These are things you should not have to do often, and things those of us already running R have stumbled through and forgotten about. These are also the kind of finicky system dependent non-repeatable interactive GUI steps you largely avoid once you have a scriptable system like fully R up and running. Continue reading Installing WVPlots and “knitting R markdown”

For a short time: Half Off Some Manning Data Science Books

Posted on Categories Administrativia, Pragmatic Data Science, StatisticsTags ,

Our publisher Manning Publications is celebrating the release of a new data science in Python title Introducing Data Science by offering it and other Manning titles at half off until Wednesday, May 18.

As part of the promotion you can also use the supplied discount code mlcielenlt for half off some R titles including R in Action, Second Edition and our own Practical Data Science with R. Combine these with our half off code (C3) for our R video course Introduction to Data Science and you can get a lot of top quality data science material at a deep discount.

Coming up: principal components analysis

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , 2 Comments on Coming up: principal components analysis

Just a “heads-up.”

I’ve been editing a two-part three-part series Nina Zumel is writing on some of the pitfalls of improperly applied principal components analysis/regression and how to avoid them (we are using the plural spelling as used in following Everitt The Cambridge Dictionary of Statistics). The series is looking absolutely fantastic and I think it will really help people understand, properly use, and even teach the concepts.

The series includes fully worked graphical examples in R and is why we added the ScatterHistN plot to WVPlots (plot shown below, explained in the upcoming series).

s

Frankly the material would have worked great as an additional chapter for Practical Data Science with R (but instead everybody is going to get it for free).

Please watch here for the series.
The complete series is now up:

Improved vtreat documentation

Posted on Categories Administrativia, Statistics, TutorialsTags , ,

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here).

Chrome Vanadium Adjustable Wrench

vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Continue reading Improved vtreat documentation