In this note we will use five real life examples to demonstrate data layout transforms using the cdataR package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.
Let’s try some "ugly corner cases" for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.
Let’s see what happens when we try to stick a fork in the power-outlet.
The rqueryR package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.
This becomes important as many of the rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a data.frame column name or a database column name) or a character/string (which will be translated to a constant) is important.
R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use.
Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.
Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.
If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.
For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.
The idea is: many important calculations can be considered as a sequence of transforms applied to a data set. Each step may be a function taking many arguments. It is often the case that only one of each function’s arguments is primary, and the rest are parameters. For data science applications this is particularly common, so having convenient pipeline notation can be a plus. An example of a non-trivial data processing pipeline can be found here.
In this note we will discuss the advanced R pipeline operator "dot arrow pipe" and an S4 class (wrapr::UnaryFn) that makes working with pipeline notation much more powerful and much easier.
One of the design goals of the cdataR package is that very powerful and arbitrary record transforms should be convenient and take only one or two steps. In fact it is the goal to take just about any record shape to any other in two steps: first convert to row-records, then re-block the data into arbitrary record shapes (please see here and here for the concepts).
But as with all general ideas, it is much easier to see what we mean by the above with a concrete example.