Posted on Categories Administrativia, data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Data Reshaping with cdata

Data Reshaping with cdata

I’ve just shared a short webcast on data reshaping in R using the cdata package.

(link)

We also have two really nifty articles on the theory and methods:

Please give it a try!

This is the material I recently presented at the January 2017 BARUG Meetup.

NewImage

Posted on Categories Administrativia, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , Leave a comment on Big cdata News

Big cdata News

I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata.

cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago.

KerasPlot

However, cdata is much more than that.

Continue reading Big cdata News

Posted on Categories Coding, data science, Exciting Techniques, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , , 1 Comment on Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN):

  • partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
  • if_else_device(): provides a dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.


Blacksmith working

Image by Jeff Kubina from Columbia, Maryland – [1], CC BY-SA 2.0, Link

Continue reading Win-Vector LLC announces new “big data in R” tools

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , ,

Data Wrangling at Scale

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Fd

Please check it out.

Posted on Categories Administrativia, Statistics, TutorialsTags , , , , , 8 Comments on Update on coordinatized or fluid data

Update on coordinatized or fluid data

We have just released a major update of the cdata R package to CRAN.

Cdata

If you work with R and data, now is the time to check out the cdata package. Continue reading Update on coordinatized or fluid data

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , ,

Fluid use of data

Nina Zumel and I recently wrote a few article and series on best practices in testing models and data:

What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.


CalTrainTest
Suggested static cal/train/test experiment design from vtreat data treatment library.
Continue reading Fluid use of data