Posted on Categories Administrativia, Statistics, TutorialsTags , ,

New screencast: using R and RStudio to install and experiment with Apache Spark

I have new short screencast up: using R and RStudio to install and experiment with Apache Spark.

More material from my recent Strata workshop Modeling big data with R, sparklyr, and Apache Spark can be found here.

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine LearningTags , , , ,

Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article

We have updated the errata for Practical Data Science with R to reflect that it is no longer worth the effort to use the Java version of SQLScrewdriver as described.

Screwdriver

We are very sorry for any confusion, trouble, or wasted effort bringing in Java software (something we are very familiar with, but forget not everybody uses) has caused readers. Also, database adapters for R have greatly improved, so we feel more confident depending on them alone. Practical Data Science with R remains an excellent book and a good resource to learn from that we are very proud of and fully support (hence errata). Continue reading Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article

Posted on Categories Administrativia, StatisticsTags , ,

Some Win-Vector R packages

This post concludes our mini-series of Win-Vector open source R packages. We end with WVPlots, a collection of ready-made ggplot2 plots we find handy.

IMG 6061

Please read on for list of some of the Win-Vector LLC open-source R packages that we are pleased to share. Continue reading Some Win-Vector R packages

Posted on Categories Administrativia, Programming, StatisticsTags , , , , 5 Comments on Announcing the wrapr packge for R

Announcing the wrapr packge for R

Recently Dirk Eddelbuettel pointed out that our R function debugging wrappers would be more convenient if they were available in a low-dependency micro package dedicated to little else. Dirk is a very smart person, and like most R users we are deeply in his debt; so we (Nina Zumel and myself) listened and immediately moved the wrappers into a new micro-package: wrapr.


WrapperImage: Friedensreich Hundertwasser
Continue reading Announcing the wrapr packge for R

Posted on Categories Administrativia, StatisticsTags , , , , , , 5 Comments on My recent BARUG talk: Parametric Programming in R with replyr

My recent BARUG talk: Parametric Programming in R with replyr

I want to share an edited screencast of my rehearsal for my recent San Francisco Bay Area R Users Group talk:



Posted on Categories Administrativia, StatisticsTags , , , ,

Going to Strata / Hadoop World 2017 San Jose?

Are you attending or considering attending Strata / Hadoop World 2017 San Jose? Are you interested in learning to use R to work with Spark and h2o? Then please consider signing up for my 3 1/2 hour workshop soon. We are about half full now, but I really want to fill the room, while making sure that people who really want to go get in.

Win-Vector LLC is partnering with RStudio to produce and present some awesome material that will allow you to perform data science at scale using R to control Spark and even h2o.

The links to the event are below. To make sure you get to participate please sign up soon!

  • Modeling big data with R, sparklyr, and Apache Spark (by RStudio and Win-Vector LLC)

    03/14/2017 1:30pm – 5:00pm PDT (210 minutes)

    Strata & Hadoop World West, San Jose Convention Center, CA; Room: LL21 C/D

    link, materials (including slides)

    Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use R, Spark, sparklyr, h2o, and rsparkling.

    This is going to be hands-on exercises with R, sparklyr, and h2o using RStudio Server Pro (generously provided by RStudio!).

    Sponsored by RStudio and
    Win-Vector LLC.

  • Office Hour with John Mount (Win-Vector LLC)

    03/15/2017 2:40pm – 3:20pm PDT (40 minutes)

    Strata & Hadoop World West, San Jose Convention Center, CA; Room: Table B

    link

    Come and ask me questions about data science, machine learning, R, statistics, or whatever you like.

Posted on Categories Administrativia, StatisticsTags , , , , , , ,

Upcoming Win-Vector LLC public speaking engagements

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements.

Hope to see you there!

Posted on Categories AdministrativiaTags , , 2 Comments on A bit on the formatting of code on this site and HTML/RSS

A bit on the formatting of code on this site and HTML/RSS

I am running into what looks like a WordPress bug involving formatting of code blocks. I think this is mostly affecting our RSS subscribers. They have been seeing posts rendered almost entirely in ugly fixed-font, the font error typically starting after the first substantial code-block in an article.

I’d like to apologize for any trouble this may be causing. I am looking into it, but I don’t currently have a solution. A work-around would be to not attempt to put pre-rendered code blocks into code font, but I would rather wait on a fix. I do have a diagnosis (it is likely a WordPress issue, and not user error, editor weirdness, or an RSS fault). (edit: please see the comments below for the solution, I was wrong to nest pre inside code- but I still think the WordPress transformations that made things much worse and are in fact a bug.) If you are interested in the details (or can help) please read on. Continue reading A bit on the formatting of code on this site and HTML/RSS

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags

Teaching Practical Data Science with R

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of.

I have written before how I think this book stands out and why you should consider studying from it.

600 387630642

Please read on for a some additional comments on the intent of different sections of the book. Continue reading Teaching Practical Data Science with R

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , 2 Comments on You should re-encode high cardinality categorical variables

You should re-encode high cardinality categorical variables

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.

In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.

NewImage

In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.

Please read on for how to fix this. Continue reading You should re-encode high cardinality categorical variables