Posted on Categories Administrativia, Pragmatic Data ScienceTags , Leave a comment on Practical Data Science with R, half off sale!

Practical Data Science with R, half off sale!

Our publisher, Manning, is running a Memorial Day sale this weekend (May 24-27, 2019), with a new offer every day.

  • Fri: Half off all eBooks
  • Sat: Half off all MEAPs
  • Sun: Half off all pBooks and liveVideos
  • Mon: Half off everything

The discount code is: wm052419au.

Many great opportunities to get Practical Data Science with R 2nd Edition at a discount!!!

Posted on Categories Mathematics, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Free Video Lecture: Vectors for Programmers and Data Scientists

Free Video Lecture: Vectors for Programmers and Data Scientists

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Please check the lectures out.

NewImage

Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags , Leave a comment on Timing Working With a Row or a Column from a data.frame

Timing Working With a Row or a Column from a data.frame

In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames.

We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.

We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.

Continue reading Timing Working With a Row or a Column from a data.frame

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, TutorialsTags , , Leave a comment on Data Layout Exercises

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Continue reading Data Layout Exercises

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, TutorialsTags , , 4 Comments on Controlling Data Layout With cdata

Controlling Data Layout With cdata

Here is an example how easy it is to use cdata to re-layout your data.

Tim Morris recently tweeted the following problem (corrected).

Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!

Starting with:
    d xa xb ya yb
    1  1  3  6  8
    2  2  4  7  9

How can I get to:
    id t x y
    1  a 1 6
    1  b 3 8
    2  a 2 7
    2  b 4 9
    
In Stata it's:
 . reshape long x y, i(id) j(t) string
In R, it's:
 . an hour of cursing followed by a desperate tweet 👆

Thanks for any help!

PS – I can make reshape() or gather() work when I have just x or just y.

This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.

Continue reading Controlling Data Layout With cdata

Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags , 2 Comments on Why we Did Not Name the cdata Transforms wide/tall/long/short

Why we Did Not Name the cdata Transforms wide/tall/long/short

We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques.

NewImage

NewImage

While adopting the cdata methodology into tidyr, the terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure.

The key point is: are we in a very de-normalized form where all facts about an instance are in a single row (which we called “row records”), or are we in a record oriented form where all the facts about an instances are in several rows (which we called “block records”)? The point is: row records don’t necessarily have more columns than block records. This makes shape based naming of the transforms problematic, no matter what names you pick for the shapes. There is an advantage to using intent or semantic based naming.

Below is a simple example.

library("cdata")

# example 1 end up with more rows, fewer columns
d <- data.frame(AUC = 0.6, R2 = 0.7, F1 = 0.8)
print(d)
#>   AUC  R2  F1
#> 1 0.6 0.7 0.8
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC', 'R2', 'F1')) 
#>   meas val
#> 1  AUC 0.6
#> 2   R2 0.7
#> 3   F1 0.8

# example 2 end up with more rows, same number of columns
d <- data.frame(AUC = 0.6, R2 = 0.7)
print(d)
#>   AUC  R2
#> 1 0.6 0.7
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC', 'R2')) 
#>   meas val
#> 1  AUC 0.6
#> 2   R2 0.7

# example 3 end up with same number of rows, more columns
d <- data.frame(AUC = 0.6)
print(d)
#>   AUC
#> 1 0.6
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC'))
#>   meas val
#> 1  AUC 0.6

Notice the width of the result relative to input width varies as function of the input data, even though we were always calling the same transform. This makes it incorrect to characterize these transforms as merely widening or narrowing.

There are still some subtle points (for instance row records are in fact instances of block records), but overall the scheme we (Nina Zumel, and myself: John Mount) worked out, tested, and promoted is pretty good. A lot of our work researching this topic can be found here.

Posted on Categories Pragmatic Data Science, Programming, TutorialsTags , , 5 Comments on Tidyverse users: gather/spread are on the way out

Tidyverse users: gather/spread are on the way out

From https://twitter.com/sharon000/status/1107771331012108288:

NewImage

From https://tidyr.tidyverse.org/dev/articles/pivot.html (text by Hadley Wickham):

For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.

There are two important new features inspired by other R packages that have been advancing of reshaping in R:

  • The reshaping operation can be specified with a data frame that describes precisely how metadata stored in column names becomes data variables (and vice versa). This is inspired by the cdata package by John Mount and Nina Zumel. For simple uses of pivot_long() and pivot_wide(), this specification is implicit, but for more complex cases it is useful to make it explicit, and operate on the specification data frame using dplyr and tidyr.
  • pivot_long() can work with multiple value variables that may have different types. This is inspired by the enhanced melt() and dcast() functions provided by the data.table package by Matt Dowle and Arun Srinivasan.

If you want to work in the above way we suggest giving our cdata package a try. We named the functions pivot_to_rowrecs and unpivot_to_blocks. The idea was: by emphasizing the record structure one might eventually internalize what the transforms are doing. On the way to that we have a lot of documentation and tutorials.

Posted on Categories data science, Opinion, Pragmatic Data Science, Programming, StatisticsTags , , , 3 Comments on Data Manipulation Corner Cases

Data Manipulation Corner Cases

Let’s try some "ugly corner cases" for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.

Let’s see what happens when we try to stick a fork in the power-outlet.

Fork

Continue reading Data Manipulation Corner Cases

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , ,

Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers

Starting With Data Science

A rigorous hands-on introduction to data science for software engineers.

Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets software engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or video) course.

Continue reading Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags 5 Comments on PDSwR2: New Chapters!

PDSwR2: New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review!

NewImage

The newly available chapters cover:

Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.

Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.

If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.

For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.