Posted on Categories Computer Science, Expository Writing, Programming, TutorialsTags , , , , , , 1 Comment on On indexing operators and composition

On indexing operators and composition

In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some very deep yet accessible pure mathematics).


NewImage

A permutation of indices

In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.

R‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with R), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable). Continue reading On indexing operators and composition

Posted on Categories Coding, Opinion, Programming, StatisticsTags , , , , 5 Comments on Why to use wrapr::let()

Why to use wrapr::let()

I have written about referential transparency before. In this article I would like to discuss “leaky abstractions” and why wrapr::let() supplies a useful (but leaky) abstraction for R programmers.

Wraprs Continue reading Why to use wrapr::let()

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , Leave a comment on Teaching pivot / un-pivot

Teaching pivot / un-pivot

Authors: John Mount and Nina Zumel

Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , , 1 Comment on Coordinatized Data: A Fluid Data Specification

Coordinatized Data: A Fluid Data Specification

Authors: John Mount and Nina Zumel.

Introduction

It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R.

Continue reading Coordinatized Data: A Fluid Data Specification

Posted on Categories Programming, StatisticsTags , , , Leave a comment on sigr: Simple Significance Reporting

sigr: Simple Significance Reporting

sigr is a simple R package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use.

Sigr Continue reading sigr: Simple Significance Reporting

Posted on Categories Exciting Techniques, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , Leave a comment on Step-Debugging magrittr/dplyr Pipelines in R with wrapr and replyr

Step-Debugging magrittr/dplyr Pipelines in R with wrapr and replyr

In this screencast we demonstrate how to easily and effectively step-debug magrittr/dplyr pipelines in R using wrapr and replyr.



Continue reading Step-Debugging magrittr/dplyr Pipelines in R with wrapr and replyr

Posted on Categories Programming, StatisticsTags , , , Leave a comment on replyr: Get a Grip on Big Data in R

replyr: Get a Grip on Big Data in R

replyr is an R package that contains extensions, adaptions, and work-arounds to make remote R dplyr data sources (including big data systems such as Spark) behave more like local data. This allows the analyst to more easily develop and debug procedures that simultaneously work on a variety of data services (in-memory data.frame, SQLite, PostgreSQL, and Spark2 currently being the primary supported platforms).

Replyrs Continue reading replyr: Get a Grip on Big Data in R

Posted on Categories Opinion, Programming, TutorialsTags , , , , 1 Comment on Iteration and closures in R

Iteration and closures in R

I recently read an interesting thread on unexpected behavior in R when creating a list of functions in a loop or iteration. The issue is solved, but I am going to take the liberty to try and re-state and slow down the discussion of the problem (and fix) for clarity.

The issue is: are references or values captured during iteration?

Many users expect values to be captured. Most programming language implementations capture variables or references (leading to strange aliasing issues). It is confusing (especially in R, which pushes so far in the direction of value oriented semantics) and best demonstrated with concrete examples.


NewImage

Please read on for a some of the history and future of this issue. Continue reading Iteration and closures in R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , 16 Comments on The Zero Bug

The Zero Bug

I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.


The zero bug

The zero bug

Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem. Continue reading The Zero Bug

Posted on Categories Administrativia, Programming, StatisticsTags , , , , 5 Comments on Announcing the wrapr packge for R

Announcing the wrapr packge for R

Recently Dirk Eddelbuettel pointed out that our R function debugging wrappers would be more convenient if they were available in a low-dependency micro package dedicated to little else. Dirk is a very smart person, and like most R users we are deeply in his debt; so we (Nina Zumel and myself) listened and immediately moved the wrappers into a new micro-package: wrapr.


WrapperImage: Friedensreich Hundertwasser
Continue reading Announcing the wrapr packge for R