Posted on Categories Computer Science, TutorialsTags , , Leave a comment on Minimal Key Set is NP hard

Minimal Key Set is NP hard

It usually gives us a chuckle when we find some natural and seemingly easy data science question is NP-hard. For instance we have written that variable pruning is NP-hard when one insists on finding a minimal sized set of variables (and also why there are no obvious methods for exact large permutation tests).

In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.

Continue reading Minimal Key Set is NP hard

Posted on Categories Mathematics, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Free Video Lecture: Vectors for Programmers and Data Scientists

Free Video Lecture: Vectors for Programmers and Data Scientists

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Please check the lectures out.

NewImage

Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags , Leave a comment on Timing Working With a Row or a Column from a data.frame

Timing Working With a Row or a Column from a data.frame

In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames.

We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.

We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.

Continue reading Timing Working With a Row or a Column from a data.frame

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, TutorialsTags , , Leave a comment on Data Layout Exercises

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Continue reading Data Layout Exercises

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, TutorialsTags , , 4 Comments on Controlling Data Layout With cdata

Controlling Data Layout With cdata

Here is an example how easy it is to use cdata to re-layout your data.

Tim Morris recently tweeted the following problem (corrected).

Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!

Starting with:
    d xa xb ya yb
    1  1  3  6  8
    2  2  4  7  9

How can I get to:
    id t x y
    1  a 1 6
    1  b 3 8
    2  a 2 7
    2  b 4 9
    
In Stata it's:
 . reshape long x y, i(id) j(t) string
In R, it's:
 . an hour of cursing followed by a desperate tweet 👆

Thanks for any help!

PS – I can make reshape() or gather() work when I have just x or just y.

This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.

Continue reading Controlling Data Layout With cdata

Posted on Categories Programming, TutorialsTags , , , Leave a comment on Piping is Method Chaining

Piping is Method Chaining

What R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called “method chaining” since the days of Smalltalk (~1972). Let’s take a look at method chaining in Python, in terms of pipe notation.

Continue reading Piping is Method Chaining

Posted on Categories Opinion, Programming, TutorialsTags , , 2 Comments on Standard Evaluation Versus Non-Standard Evaluation in R

Standard Evaluation Versus Non-Standard Evaluation in R

There is a lot of unnecessary worry over “Non Standard Evaluation” (NSE) in R versus “Standard Evaluation” (SE, or standard “variables names refer to values” evaluation). This very author is guilty of over-discussing the issue. But let’s give this yet another try.

The entire difference between NSE and regular evaluation can be summed up in the following simple table (which should be clear after we work some examples).

Tbl

Continue reading Standard Evaluation Versus Non-Standard Evaluation in R

Posted on Categories Coding, Exciting Techniques, TutorialsTags , 6 Comments on Operator Notation for Data Transforms

Operator Notation for Data Transforms

As of cdata version 1.0.8 cdata implements an operator notation for data transform.

The idea is simple, yet powerful.

Continue reading Operator Notation for Data Transforms

Posted on Categories Coding, TutorialsTags , , 2 Comments on How cdata Control Table Data Transforms Work

How cdata Control Table Data Transforms Work

With all of the excitement surrounding cdata style control table based data transforms (the cdata ideas being named as the “replacements” for tidyr‘s current methodology, by the tidyr authors themselves!) I thought I would take a moment to describe how they work.

Continue reading How cdata Control Table Data Transforms Work