Posted on 4 Comments on Factors are not first-class citizens in R

## Factors are not first-class citizens in R

The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number `5` are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.

For example, consider the following R code.

``` levels <- c('a','b','c') f <- factor(c('c','a','a',NA,'b','a'),levels=levels) print(f) ## [1] c a a <NA> b a ## Levels: a b c print(class(f)) ## [1] "factor" ```

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`, `'b'`, and `'c'`). As is the case with real data some of the positions might be missing/invalid values such as `NA`. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'` was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

``` fRevised <- ifelse(is.na(f),'a',f) print(fRevised) ## [1] "3" "1" "1" "a" "2" "1" print(class(fRevised)) ## [1] "character" ```

Notice the new column `fRevised` is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f` had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem. Continue reading Factors are not first-class citizens in R

Posted on Tags , 3 Comments on R style tip: prefer functions that return data frames

## R style tip: prefer functions that return data frames

While following up on Nina Zumel’s excellent Trimming the Fat from glm() Models in R I got to thinking about code style in R. And I realized: you can make your code much prettier by designing more of your functions to return `data.frame`s. That may seem needlessly heavy-weight, but it has a lot of down-stream advantages. Continue reading R style tip: prefer functions that return data frames

Posted on Tags , , 10 Comments on Trimming the Fat from glm() Models in R

## Trimming the Fat from glm() Models in R

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, `glm` models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

As many R users know (but often forget), a `glm` model object carries a copy of its training data by default. You can use the settings `y=FALSE` and `model=FALSE` to turn this off.

```set.seed(2325235)

# Set up a synthetic classification problem of a given size
# and two variables: one numeric, one categorical
# (two levels).
synthFrame = function(nrows) {
d = data.frame(xN=rnorm(nrows),
xC=sample(c('a','b'),size=nrows,replace=TRUE))
d\$y = (d\$xN + ifelse(d\$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5
d
}

# first show that model=F and y=F help reduce model size

dTrain = synthFrame(1000)
y=FALSE)
y=FALSE, model=FALSE)

#
# Estimate the object's size as the size of its serialization
#
length(serialize(model1, NULL))
# [1] 225251
length(serialize(model2, NULL))
# [1] 206341
length(serialize(model3, NULL))
# [1] 189562

dTest = synthFrame(100)
p1 = predict(model1, newdata=dTest, type='response')
p2 = predict(model2, newdata=dTest, type='response')
p3 = predict(model3, newdata=dTest, type='response')
sum(abs(p1-p2))
# [1] 0
sum(abs(p1-p3))
# [1] 0

```
Posted on Categories Coding, Expository Writing, Programming, Tutorials9 Comments on You don’t need to understand pointers to program using R

## You don’t need to understand pointers to program using R

R is a statistical analysis package based on writing short scripts or programs (versus being based on GUIs like spreadsheets or directed workflow editors). I say “writing short scripts” because R’s programming language (itself called S) is a bit of an oddity that you really wouldn’t be using except it gives you access to superior analytics data structures (R’s data.frame and treatment of missing values) and deep ready to go statistical libraries. For longer pure programming tasks you are better off using something else (be it Python, Ruby, Java, C++, Javascript, Go, ML, Julia, or something else). However, the S language has one feature that makes it pleasant to learn (despite any warts): it can be initially used and taught without having the worry about the semantics of references or pointers. Continue reading You don’t need to understand pointers to program using R

Posted on Categories Coding, Computer Science, Programming, Rants, Tutorials

## Unit tests as penance

It recently hit me that I see unit tests as a form of penance (in addition to being a great tool for specification and test driven development). If you fix a bug and don’t add a unit test I suspect you are not actually sorry. Continue reading Unit tests as penance

Posted on Categories Coding, Programming, Tutorials2 Comments on Resolving git “pseudo conflicts”

## Resolving git “pseudo conflicts”

I strongly advise using version control, and usually recommend using git as your version control system. Usually I feel a bit guilty about this advice as git is so general that it is more of a toolkit for a version control system than a complete proscriptive version control system (the missing pieces being the selection and documentation of a workflow and conventions).

But I still feel git is the one to use. My requirements involve not writing dot files in every single directory (breaks some OSX tools, and both CVS and Subversion do this), being able to work disconnected (eliminates Perforce), being cross-platform, being actively maintained, and being able to easily change decision such as where the gold standard repository lives (or even changing your mind on collaborating or not). This makes me lean towards BZR, git and Mecurial. Git is the most popular one of the bunch and has the most popular repository aggregator: GitHub.

For beginners I teach treating git like old-school RCS or SCCS: just use git to maintain versions of your local files. Don’t worry about using it to share or distribute files (but do make sure to back-up you directory in some way). To use git in this way you only need to run three commands regularly: “git status,” “git add,” and “git commit” (see Minimal Version Control Lesson: Use It). Roughly status shows you what is going on and add/commit pairs checkpoint your work. To work in this way you don’t need to know anything about branching (version control nerds’ favorite confusing topic), merging and so on. The idea is that as long as you are running add/commit pairs often enough any other problem you run into can be solved (though it make take an hour of searching books and Stack Overflow to find the answer). Git’s user interface is horrible (in part) because “everything is possible,” but that also means you can (with difficulty) solve just about any problem you run into with git (except, it seems, nested or dependent repositories).

However eventually you want to work with a collaborator or distribute your results to a client. To do that effectively with git you need to start using additional commands such as “git pull,” “git rebase,” and “git push.” Things seem more confusing at this point (though you still do not yet need to worry about branching in its full generality), but are in fact far less confusing and far less error prone than ad-hoc solutions such as emailing zip files. I almost always advise sharing work in “star workflow” where each worker has their own repository and a single common “naked” repository (that is a repository with only git data structures and no ready to use files) is used to coordinate (thought of as a server or gold standard, often named “origin”). This is treating git as if it were just a better CVS or SVN (the difference being if you want to perform a truly distributed step like pushing code to a collaborator without using the main server, you can and git will actually help with the record keeping). The central repository can be GitHub, GitLab or even a directory on a machine with ssh access. A lot of ink is spilled on how such a workflow doesn’t feel like a “distributed workflow,” but it is (you can work when disconnected from the central repository, and if the central repository is lost any up to date worker can provision a new central repository).

To get familiar with git I recommend a good book such as Jon Loeliger and Matthew McCullough’s “Version Control with Git” 2nd Edition, O’Reilly 2012. Or, better yet, work with people who know git. In all cases you need to keep notes, git issues are often solved by sequences of of three to five esoteric commands. Even after working with git for some time I still run into major “hair pullers.” One of these major “hair pullers” I run into is what I call “pseudo conflicts” and is what I am going to describe in this article. Continue reading Resolving git “pseudo conflicts”

Posted on Categories Mathematics, ProgrammingTags , , 20 Comments on Prefer = for assignment in R

## Prefer = for assignment in R

We share our opinion that `=` should be preferred to the more standard `<-` for assignment in R. This is from a draft of the appendix of our upcoming book. This has the risk of becoming an R version of Javascript’s semicolon controversy, but here you have it. Continue reading Prefer = for assignment in R

Posted on Categories Programming, Statistics1 Comment on Checking claims in published statistics papers

## Checking claims in published statistics papers

When finishing Worry about correctness and repeatability, not p-values I got to thinking a bit more about what can you actually check when reading a paper, especially when you don’t have access to the raw data. Some of the fellow scientists I admire most have a knack for back of the envelope calculations and dimensional analysis style calculations. They could always read a few facts off a presentation that the presenter may not have meant to share. There is a joy in calculation and figuring, so I decided it would be a fun challenge to see if you could check any of the claims of “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439 from just the summary tables supplied in the paper itself. Continue reading Checking claims in published statistics papers

Posted on Categories Coding, data science, Opinion, Programming, Rants, StatisticsTags , , 8 Comments on Please stop using Excel-like formats to exchange data

## Please stop using Excel-like formats to exchange data

I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.

I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds. Continue reading Please stop using Excel-like formats to exchange data

Posted on

## Yet Another Java Linear Programming Library

From time to time we work on projects that would benefit from a free lightweight pure Java linear programming library. That is a library unencumbered by a bad license, available cheaply, without an infinite amount of file format and interop cruft and available in Java (without binary blobs and JNI linkages). There are a few such libraries, but none have repeatably, efficiently and reliably met our needs. So we have re-packaged an older one of our own for release under the Apache 2.0 license. This code will have its own rough edges (not having been used widely in production), but I still feel fills an important gap. This article is brief introduction to our WVLPSolver Java library. Continue reading Yet Another Java Linear Programming Library