sigr is a simple
R package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use.
replyr is an
R package that contains extensions, adaptions, and work-arounds to make remote
dplyr data sources (including big data systems such as
Spark) behave more like local data. This allows the analyst to more easily develop and debug procedures that simultaneously work on a variety of data services (in-memory
Spark2 currently being the primary supported platforms).
I recently read an interesting thread on unexpected behavior in
R when creating a list of functions in a loop or iteration. The issue is solved, but I am going to take the liberty to try and re-state and slow down the discussion of the problem (and fix) for clarity.
The issue is: are references or values captured during iteration?
Many users expect values to be captured. Most programming language implementations capture variables or references (leading to strange aliasing issues). It is confusing (especially in R, which pushes so far in the direction of value oriented semantics) and best demonstrated with concrete examples.
Please read on for a some of the history and future of this issue. Continue reading Iteration and closures in R
I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.
The zero bug
Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem. Continue reading The Zero Bug
Recently Dirk Eddelbuettel pointed out that our
R function debugging wrappers would be more convenient if they were available in a low-dependency micro package dedicated to little else. Dirk is a very smart person, and like most
R users we are deeply in his debt; so we (Nina Zumel and myself) listened and immediately moved the wrappers into a new micro-package:
Image: Friedensreich Hundertwasser
One of the distinctive features of the
R platform is how explicit and user controllable everything is. This allows the style of use of
R to evolve fairly rapidly. I will discuss this and end with some new notations, methods, and tools I am nominating for inclusion into your view of the evolving “current best practice style” of working with
R. Continue reading Evolving R Tools and Practices
Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of
replyr::let makes such programming easier.
Archie’s Mechanics #2 (1954) copyright Archie Publications
(edit: great news! CRAN just accepted our
replyr 0.2.0 fix release!)
Please read on for examples comparing standard notations and
replyr::let. Continue reading Comparative examples using replyr::let
It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns
process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation,
replyr::gapply, from our latest package,
Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.
The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.