Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, TutorialsTags , ,

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Continue reading Data Layout Exercises

Posted on Categories Administrativia, Practical Data ScienceTags , ,

Practical Data Science with R Book Update (April 2019)

I thought I would give a personal update on our book: Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019.

Continue reading Practical Data Science with R Book Update (April 2019)

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, TutorialsTags , , 4 Comments on Controlling Data Layout With cdata

Controlling Data Layout With cdata

Here is an example how easy it is to use cdata to re-layout your data.

Tim Morris recently tweeted the following problem (corrected).

Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!

Starting with:
    d xa xb ya yb
    1  1  3  6  8
    2  2  4  7  9

How can I get to:
    id t x y
    1  a 1 6
    1  b 3 8
    2  a 2 7
    2  b 4 9
    
In Stata it's:
 . reshape long x y, i(id) j(t) string
In R, it's:
 . an hour of cursing followed by a desperate tweet 👆

Thanks for any help!

PS – I can make reshape() or gather() work when I have just x or just y.

This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.

Continue reading Controlling Data Layout With cdata

Posted on Categories Programming, TutorialsTags , , ,

Piping is Method Chaining

What R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called “method chaining” since the days of Smalltalk (~1972). Let’s take a look at method chaining in Python, in terms of pipe notation.

Continue reading Piping is Method Chaining

Posted on Categories Administrativia, Practical Data ScienceTags , ,

Practical Data Science with R Book Update

A good friend shared with us a great picture of Practical Data Science with R, 1st Edition hanging out in Cambridge at the MIT Press Bookstore.

IMG 20190404 114957

This is as good an excuse as any to share a book update.

Continue reading Practical Data Science with R Book Update

Posted on Categories OpinionTags , , 1 Comment on Not Always C++’s Fault

Not Always C++’s Fault

From the recent developer.r-project.org “Staged Install” article:

Incidentally, there were just two distinct (very long) lists of methods in the warnings across all installed packages in my run, but repeated for many packages. It turned out that they were lists of exported methods from dplyr and rlang packages. These two packages take very long to install due to C++ code compilation.

Technical point.

While dplyr indeed uses C++ (via Rcpp), rlang appears to currently be a C-package. So any problems associated with rlang are probably not due to C++ or Rcpp. Similarly other tidyverse packages such as purrr and tibble are currently C packages. I think purrr once used C++, but do not know about the others.

Posted on Categories Opinion, ProgrammingTags , ,

Why RcppDynProg is Written in C++

The (matter of opinion) claim:

“When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]”

(source discussed here)

got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough to use something as complicated as C++ correctly?

Continue reading Why RcppDynProg is Written in C++

Posted on Categories OpinionTags 1 Comment on What are the Popular R Packages?

What are the Popular R Packages?

“R is its packages”, so to know R we should know its popular packages (CRAN).

Or put it another way: as R is a typical “the reference implementation is the specification” programming environment there is no true “de jure” R, only a de facto R.

To look at popular R packages I defined “popular” as used (Depends/Imports/LinkingTo) by other packages on CRAN. One could use other definitions (e.g. Github stars), but this is the one I used for this particular study.

My “quick look” (sure to anger everyone) is a couple of diagrams such as the following.

NewImage

Continue reading What are the Popular R Packages?