Posted on Categories Administrativia, Programming, StatisticsTags , , , , 5 Comments on Announcing the wrapr packge for R

Announcing the wrapr packge for R

Recently Dirk Eddelbuettel pointed out that our R function debugging wrappers would be more convenient if they were available in a low-dependency micro package dedicated to little else. Dirk is a very smart person, and like most R users we are deeply in his debt; so we (Nina Zumel and myself) listened and immediately moved the wrappers into a new micro-package: wrapr.


WrapperImage: Friedensreich Hundertwasser
Continue reading Announcing the wrapr packge for R

Posted on Categories Administrativia, StatisticsTags , , , , , , 1 Comment on My recent BARUG talk: Parametric Programming in R with replyr

My recent BARUG talk: Parametric Programming in R with replyr

I want to share an edited screencast of my rehearsal for my recent San Francisco Bay Area R Users Group talk:



Posted on Categories Administrativia, StatisticsTags , , , , Leave a comment on Going to Strata / Hadoop World 2017 San Jose?

Going to Strata / Hadoop World 2017 San Jose?

Are you attending or considering attending Strata / Hadoop World 2017 San Jose? Are you interested in learning to use R to work with Spark and h2o? Then please consider signing up for my 3 1/2 hour workshop soon. We are about half full now, but I really want to fill the room, while making sure that people who really want to go get in.

Win-Vector LLC is partnering with RStudio to produce and present some awesome material that will allow you to perform data science at scale using R to control Spark and even h2o.

The links to the event are below. To make sure you get to participate please sign up soon!

  • Modeling big data with R, sparklyr, and Apache Spark (by RStudio and Win-Vector LLC)

    03/14/2017 1:30pm – 5:00pm PDT (210 minutes)

    Strata & Hadoop World West, San Jose Convention Center, CA; Room: LL21 C/D

    link

    Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use R, Spark, sparklyr, h2o, and rsparkling.

    This is going to be hands-on exercises with R, sparklyr, and h2o using RStudio Server Pro (generously provided by RStudio!).

    Sponsored by RStudio and
    Win-Vector LLC.

  • Office Hour with John Mount (Win-Vector LLC)

    03/15/2017 2:40pm – 3:20pm PDT (40 minutes)

    Strata & Hadoop World West, San Jose Convention Center, CA; Room: Table B

    link

    Come and ask me questions about data science, machine learning, R, statistics, or whatever you like.

Posted on Categories Administrativia, StatisticsTags , , , , , , , Leave a comment on Upcoming Win-Vector LLC public speaking engagements

Upcoming Win-Vector LLC public speaking engagements

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements.

Hope to see you there!

Posted on Categories AdministrativiaTags , , 2 Comments on A bit on the formatting of code on this site and HTML/RSS

A bit on the formatting of code on this site and HTML/RSS

I am running into what looks like a WordPress bug involving formatting of code blocks. I think this is mostly affecting our RSS subscribers. They have been seeing posts rendered almost entirely in ugly fixed-font, the font error typically starting after the first substantial code-block in an article.

I’d like to apologize for any trouble this may be causing. I am looking into it, but I don’t currently have a solution. A work-around would be to not attempt to put pre-rendered code blocks into code font, but I would rather wait on a fix. I do have a diagnosis (it is likely a WordPress issue, and not user error, editor weirdness, or an RSS fault). (edit: please see the comments below for the solution, I was wrong to nest pre inside code- but I still think the WordPress transformations that made things much worse and are in fact a bug.) If you are interested in the details (or can help) please read on. Continue reading A bit on the formatting of code on this site and HTML/RSS

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags

Teaching Practical Data Science with R

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of.

I have written before how I think this book stands out and why you should consider studying from it.

600 387630642

Please read on for a some additional comments on the intent of different sections of the book. Continue reading Teaching Practical Data Science with R

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , 2 Comments on You should re-encode high cardinality categorical variables

You should re-encode high cardinality categorical variables

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.

In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.

NewImage

In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.

Please read on for how to fix this. Continue reading You should re-encode high cardinality categorical variables

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , 1 Comment on Data science for executives and managers

Data science for executives and managers

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope are already planning to attend), so we are asking you our technical readers to help promote this talk to a broader audience of executives and managers.

Our messages is: if you have to manage data science projects, you need to know how to evaluate results.

In these talks we will lay out how data science results should be examined and evaluated. If you can’t make ODSC (or do attend and like what you see), please reach out to us and we can arrange to present an appropriate targeted summarized version to your executive team. Continue reading Data science for executives and managers

Posted on Categories Administrativia, data science, Statistics, TutorialsTags , 3 Comments on Upcoming Talks

Upcoming Talks

I (Nina Zumel) will be speaking at the Women who Code Silicon Valley meetup on Thursday, October 27.

The talk is called Improving Prediction using Nested Models and Simulated Out-of-Sample Data.

In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.

Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.

John Mount and I will also be giving a workshop called A Unified View of Model Evaluation at ODSC West 2016 on November 4 (the premium workshop sessions), and November 5 (the general workshop sessions).

We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.

I’m looking forward to these talks, and I hope some of you will be able to attend.

Posted on Categories Administrativia, Expository Writing, Opinion, Practical Data Science, StatisticsTags , ,

Did she know we were writing a book?

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge.

Nina Zumel and I definitely troubled over possibilities for some time before deciding to write Practical Data Science with R, Nina Zumel, John Mount, Manning 2014.

600 387630642

In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware we were writing a book. Please read on to see some of her prefiguring work. Continue reading Did she know we were writing a book?