Posted on Categories Administrativia, StatisticsTags , , , , Leave a comment on Going to Strata / Hadoop World 2017 San Jose?

Going to Strata / Hadoop World 2017 San Jose?

Are you attending or considering attending Strata / Hadoop World 2017 San Jose? Are you interested in learning to use R to work with Spark and h2o? Then please consider signing up for my 3 1/2 hour workshop soon. We are about half full now, but I really want to fill the room, while making sure that people who really want to go get in.

Win-Vector LLC is partnering with RStudio to produce and present some awesome material that will allow you to perform data science at scale using R to control Spark and even h2o.

The links to the event are below. To make sure you get to participate please sign up soon!

  • Modeling big data with R, sparklyr, and Apache Spark (by RStudio and Win-Vector LLC)

    03/14/2017 1:30pm – 5:00pm PDT (210 minutes)

    Strata & Hadoop World West, San Jose Convention Center, CA; Room: LL21 C/D

    link

    Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use R, Spark, sparklyr, h2o, and rsparkling.

    This is going to be hands-on exercises with R, sparklyr, and h2o using RStudio Server Pro (generously provided by RStudio!).

    Sponsored by RStudio and
    Win-Vector LLC.

  • Office Hour with John Mount (Win-Vector LLC)

    03/15/2017 2:40pm – 3:20pm PDT (40 minutes)

    Strata & Hadoop World West, San Jose Convention Center, CA; Room: Table B

    link

    Come and ask me questions about data science, machine learning, R, statistics, or whatever you like.

Posted on Categories Administrativia, StatisticsTags , , , , , , , Leave a comment on Upcoming Win-Vector LLC public speaking engagements

Upcoming Win-Vector LLC public speaking engagements

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements.

Hope to see you there!

Posted on Categories Opinion, StatisticsTags , , , , , , 3 Comments on Organize your data manipulation in terms of “grouped ordered apply”

Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.


Iris germanica Purple bearded Iris Wakehurst Place UK DiliffIris, by DiliffOwn work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , 2 Comments on The case for index-free data manipulation

The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how R data.frames describe themselves (try “str(data.frame(x=1:2))” in an R-console to see this) and is part of the tidy data manifesto.

Tools like SQL (structured query language) and dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , , , , 2 Comments on New R package: replyr (get a grip on remote dplyr data services)

New R package: replyr (get a grip on remote dplyr data services)

It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds. replyr attempts to provide a few helpful work-arounds.

Our new package replyr supplies methods to get a grip on working with remote tbl sources (SQL databases, Spark) through dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory data.frame. Results still do depend on which dplyr service you use, but with replyr you have fairly uniform access to some useful functions.

Continue reading New R package: replyr (get a grip on remote dplyr data services)