Posted on Categories Computer Science, Expository Writing, Programming, TutorialsTags , , , , , , 1 Comment on On indexing operators and composition

On indexing operators and composition

In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some very deep yet accessible pure mathematics).


A permutation of indices

In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.

R‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with R), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable). Continue reading On indexing operators and composition

Posted on Categories Coding, Computer Science, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags ,

A Simple Example of Using replyr::gapply

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, replyr::gapply, from our latest package, replyr.

4140852348 2ebe864822 z
Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.

Continue reading A Simple Example of Using replyr::gapply

Posted on Categories Coding, Computer Science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , , 2 Comments on Using replyr::let to Parameterize dplyr Expressions

Using replyr::let to Parameterize dplyr Expressions


Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species")

# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375

For a specific data frame, with known column names, such a table is easy to construct using dplyr::group_by and dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see.

Enter let, from our new package replyr.

Continue reading Using replyr::let to Parameterize dplyr Expressions

Posted on Categories Computer Science, math programming, Statistics, TutorialsTags , , , 2 Comments on Variable pruning is NP hard

Variable pruning is NP hard

I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in fact lack a lot of the monotone structure or submodular structure that would make things easy).

That being said we have a lot of powerful and effective heuristics to discuss in upcoming articles. I am going to leave such positive results for my later articles and here concentrate on an instructive technical negative result: picking a good subset of variables is theoretically quite hard. Continue reading Variable pruning is NP hard

Posted on Categories Computer ScienceTags , 2 Comments on The magrittr monad

The magrittr monad

Monads are a formal theory of composition where programmers get to invoke some very abstract mathematics (category theory) to argue the minutia of annotating, scheduling, sequencing operations, and side effects. On the positive side the monad axioms are a guarantee that related ways of writing code are in fact substitutable and equivalent; so you want your supplied libraries to obey such axioms to make your life easy. On the negative side, the theory is complicated.

In this article we will consider the latest entry of our mad “programming theory in R series” (see Some programming language theory in R, You don’t need to understand pointers to program using R, Using closures as objects in R, and How and why to return functions in R): category theory!

Continue reading The magrittr monad

Posted on Categories Coding, Computer Science, Expository Writing, Programming, TutorialsTags , , , , 1 Comment on Some programming language theory in R

Some programming language theory in R

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”).

Our reasoning is: if you want to work as a modern data scientist you have to program (this is not optional for reasons of documentation, sharing and scientific repeatability). If you do program you are going to have to eventually think a bit about programming theory (hopefully not too early in your studies, but it will happen). Let’s use R’s powerful programming language (and implementation) to dive into some deep issues in programming language theory:

  • References versus values
  • Function abstraction
  • Equational reasoning
  • Recursion
  • Substitution and evaluation
  • Fixed point theory

To do this we will translate some common ideas from a theory called “the lambda calculus” into R (where we can actually execute them). This translation largely involves changing the word “lambda” to “function” and introducing some parenthesis (which I think greatly improve readability, part of the mystery of the lambda calculus is how unreadable its preferred notation actually is).

Opus hyp
Recursive Opus (on a Hyperbolic disk)
Continue reading Some programming language theory in R

Posted on Categories Administrativia, Computer Science, data science, Exciting Techniques, Statistics, UncategorizedTags , , , ,

Our Differential Privacy Mini-series

We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we’ve tried to touch on the highlights of the papers, and to play around with variations of our own.

Blurry snowflakes stock by cosmicgallifrey d3inho1

  • A Simpler Explanation of Differential Privacy: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in Science (Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”, Science, vol 349, no. 6248, pp. 636-638, August 2015).

    Note that Cynthia Dwork is one of the inventors of differential privacy, originally used in the analysis of sensitive information.

  • Using differential privacy to reuse training data: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.
  • A simple differentially private-ish procedure: The bootstrap as an alternative to Laplace noise to introduce privacy.

Our R code and experiments are available on Github here, so you can try some experiments and variations yourself.

Image Credit

Posted on Categories Computer Science, Exciting Techniques, Opinion, Programming, RantsTags , 2 Comments on Thumbs up for Anaconda

Thumbs up for Anaconda

One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).

Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.

Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant. Continue reading Thumbs up for Anaconda

Posted on Categories Computer Science, Opinion, RantsTags , , , , , 2 Comments on Text encoding is a convoluted mess

Text encoding is a convoluted mess

Modern text encoding is a convoluted mess where costs can easily exceed benefits. I admit we are in a world that has moved beyond ASCII (which at best served only English, and even then without full punctuation). But modern text encoding standards (utf-x, Unicode) have metastasized to the point you spend more time working around them than benefiting from them.

ASCII Code Chart-Quick ref card” by Namazu-tron – See above description. Licensed under Public Domain via Wikimedia Commons
Continue reading Text encoding is a convoluted mess

Posted on Categories Computer Science, MathematicsTags , , 2 Comments on Erdős writing like a programmer

Erdős writing like a programmer

Here is a neat example of famous mathematician Pál Erdős (often rendered in English as Paul Erdős) writing like a programmer in 1961. He goes to some trouble to introduce notation that allows him to index everything from zero. Continue reading Erdős writing like a programmer