It looks like some of the new in-line display behavior is back-ported to R Markdown and some of the difference is the delayed running and different level of interactivity in the HTML document. This makes it a bit hard to call out which RStudio’s improvements are “R notebooks” versus “R markdown”, but it means there is a lot of new functionality available. I’ve updated the video to reflect the subtlty (unfortunately on YouTube that means a new URL as you can’t replace videos).
And some just in case decelerations/clarifications/reminders: this video is not from RStudio (the company), and Rstudio client (the software) is a user interface that is separate from the R analysis system itself.
Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope are already planning to attend), so we are asking you our technical readers to help promote this talk to a broader audience of executives and managers.
Our messages is: if you have to manage data science projects, you need to know how to evaluate results.
In these talks we will lay out how data science results should be examined and evaluated. If you can’t make ODSC (or do attend and like what you see), please reach out to us and we can arrange to present an appropriate targeted summarized version to your executive team. Continue reading Data science for executives and managers
When we teach “R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more detailed version of the following list:
To test the quality of a numeric model to numeric outcome: F-test (as in linear regression).
To test the quality of a numeric model to a categorical outcome: χ2 or “Chi-squared” test (as in logistic regression).
To test the association of a categorical predictor to a categorical outcome: many tests including Fisher’s exact test and Barnard’s test.
To test the quality of a categorical predictor to a numeric outcome: t-Test, ANOVA, and Tukey’s “honest significant difference” test.
The above tests are all in terms of checking model results, so we don’t allow re-scaling of the predictor as part of the test (as we would have in a Pearson correlation test, or an area under the curve test). There are, of course, many alternatives such as Wald’s test- but we try to start with a set of tests that are standard, well known, and well reported by R. An odd exception has always been the χ2 test, which we will write a bit about in this note. Continue reading Adding polished significance summaries to papers using R
Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic idea you bring to analysis, it is a likely fact about the world that comes at you). The article is a good example of how to plot and reason about such situations.
Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge.
In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware we were writing a book. Please read on to see some of her prefiguring work. Continue reading Did she know we were writing a book?
Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.
For some informative discussion on such issues please see the following:
In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.
Recently I whined/whinged or generally complained about a few sharp edges in some powerful R systems.
In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem.
Very roughly vtreat accepts an arbitrary “from the wild” data frame (with different column types, NAs, NaNs and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of NA, NaNs, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as vtreat tries to handle as much of the common stuff as practical). For more of an overall description please see here.
We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of vtreat).