We have a new improved version of the “how to design a cdata/data_algebra data transform” up!

The original article, the Python example, and the R example have all been updated to use the new video.

Please check it out!

Skip to content
# Tag: cdata

Posted on Categories data science, Statistics, TutorialsLeave a comment on New improved cdata instructional video## New improved cdata instructional video

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsLeave a comment on Data re-Shaping in R and in Python## Data re-Shaping in R and in Python

Posted on Categories Practical Data Science, Statistics, Tutorials## The Advantages of Record Transform Specifications

Posted on Categories data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Tutorials## Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).
Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Tutorials## Data Layout Exercises

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, Tutorials4 Comments on Controlling Data Layout With cdata## Controlling Data Layout With cdata

Posted on Categories Coding, Exciting Techniques, Tutorials6 Comments on Operator Notation for Data Transforms## Operator Notation for Data Transforms

Posted on Categories Coding, Tutorials2 Comments on How cdata Control Table Data Transforms Work## How cdata Control Table Data Transforms Work

Posted on Categories Opinion, Pragmatic Data Science, Tutorials2 Comments on Why we Did Not Name the cdata Transforms wide/tall/long/short## Why we Did Not Name the cdata Transforms wide/tall/long/short

Posted on Categories Pragmatic Data Science, Programming, Tutorials5 Comments on Tidyverse users: gather/spread are on the way out## Tidyverse users: gather/spread are on the way out

We have a new improved version of the “how to design a cdata/data_algebra data transform” up!

The original article, the Python example, and the R example have all been updated to use the new video.

Please check it out!

Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial.

This reflects our opinion on the “which is better for data science R or Python?” They both are great. So start with one, and expect to eventually work with both (if you are lucky).

Nina Zumel had a really great article on how to prepare a nice `Keras`

performance plot using `R`

.

I will use this example to show some of the advantages of `cdata`

record transform specifications.

Continue reading The Advantages of Record Transform Specifications

The advantages of data_algebra and cdata are:

- The user specifies their desired transform declaratively
*by example*and*in data*. What one does is: work an example, and then write down what you want (we have a tutorial on this here). - The transform systems can print what a transform is going to do. This makes reasoning about data transforms
*much*easier. - The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the `cdata`

`R`

package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Here is an example how easy it is to use `cdata`

to re-layout your data.

Tim Morris recently tweeted the following problem (corrected).

Please will you take pity on me #rstats folks? I only want to reshape two variables x & y from wide to long! Starting with: d xa xb ya yb 1 1 3 6 8 2 2 4 7 9 How can I get to: id t x y 1 a 1 6 1 b 3 8 2 a 2 7 2 b 4 9 In Stata it's: . reshape long x y, i(id) j(t) string In R, it's: . an hour of cursing followed by a desperate tweet 👆 Thanks for any help! PS – I can make reshape() or gather() work when I have just x or just y.

This is not to make fun of Tim Morris: the above *should* be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.

As of `cdata`

version `1.0.8`

`cdata`

implements an operator notation for data transform.

The idea is simple, yet powerful.

With all of the excitement surrounding `cdata`

style control table based data transforms (the `cdata`

ideas being named as the “replacements” for `tidyr`

‘s current methodology, by the `tidyr`

authors themselves!) I thought I would take a moment to describe how they work.

Continue reading How cdata Control Table Data Transforms Work

We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques.

While adopting the cdata methodology into tidyr, the terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure.

The key point is: are we in a very de-normalized form where all facts about an instance are in a single row (which we called “row records”), or are we in a record oriented form where all the facts about an instances are in several rows (which we called “block records”)? The point is: row records don’t necessarily have more columns than block records. This makes shape based naming of the transforms problematic, no matter what names you pick for the shapes. There is an advantage to using intent or semantic based naming.

Below is a simple example.

library("cdata") # example 1 end up with more rows, fewer columns d <- data.frame(AUC = 0.6, R2 = 0.7, F1 = 0.8) print(d) #> AUC R2 F1 #> 1 0.6 0.7 0.8 unpivot_to_blocks(d, nameForNewKeyColumn= 'meas', nameForNewValueColumn= 'val', columnsToTakeFrom= c('AUC', 'R2', 'F1')) #> meas val #> 1 AUC 0.6 #> 2 R2 0.7 #> 3 F1 0.8 # example 2 end up with more rows, same number of columns d <- data.frame(AUC = 0.6, R2 = 0.7) print(d) #> AUC R2 #> 1 0.6 0.7 unpivot_to_blocks(d, nameForNewKeyColumn= 'meas', nameForNewValueColumn= 'val', columnsToTakeFrom= c('AUC', 'R2')) #> meas val #> 1 AUC 0.6 #> 2 R2 0.7 # example 3 end up with same number of rows, more columns d <- data.frame(AUC = 0.6) print(d) #> AUC #> 1 0.6 unpivot_to_blocks(d, nameForNewKeyColumn= 'meas', nameForNewValueColumn= 'val', columnsToTakeFrom= c('AUC')) #> meas val #> 1 AUC 0.6

Notice the width of the result relative to input width varies as function of the input data, even though we were always calling the same transform. This makes it incorrect to characterize these transforms as merely widening or narrowing.

There are still some subtle points (for instance row records are in fact instances of block records), but overall the scheme we (Nina Zumel, and myself: John Mount) worked out, tested, and promoted is pretty good. A lot of our work researching this topic can be found here.

From https://twitter.com/sharon000/status/1107771331012108288:

From https://tidyr.tidyverse.org/dev/articles/pivot.html (text by Hadley Wickham):

For some time, it’s been obvious that there is something fundamentally wrong with the design of

spread() and `gather()`

. Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.There are two important new features inspired by other R packages that have been advancing of reshaping in R:

- The reshaping operation can be specified with a data frame that describes precisely how metadata stored in column names becomes data variables (and vice versa). This is inspired by the
`cdata`

package by John Mount and Nina Zumel. For simple uses of`pivot_long()`

and`pivot_wide()`

, this specification is implicit, but for more complex cases it is useful to make it explicit, and operate on the specification data frame using`dplyr`

and`tidyr`

.- pivot_long() can work with multiple value variables that may have different types. This is inspired by the enhanced
`melt()`

and`dcast()`

functions provided by the`data.table`

package by Matt Dowle and Arun Srinivasan.

If you want to work in the above way we suggest giving our `cdata`

package a try. We named the functions `pivot_to_rowrecs`

and `unpivot_to_blocks`

. The idea was: by emphasizing the record structure one might eventually internalize what the transforms are doing. On the way to that we have a lot of documentation and tutorials.