Posted on Categories Coding, data science, Opinion, Programming, Rants, StatisticsTags , , 8 Comments on Please stop using Excel-like formats to exchange data

Please stop using Excel-like formats to exchange data

I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.

I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds. Continue reading Please stop using Excel-like formats to exchange data

Posted on Categories Coding, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , 3 Comments on Error Handling in R

Error Handling in R

It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a model, perhaps. I set the script running and turn to another task, only to come back later and find the loop has crashed partway through, on an unanticipated error. Here’s a toy example:

> inputs = list(1, 2, 4, -5, 'oops', 0, 10)

> for(input in inputs) {
+   print(paste("log of", input, "=", log(input)))
+ }

[1] "log of 1 = 0"
[1] "log of 2 = 0.693147180559945"
[1] "log of 4 = 1.38629436111989"
[1] "log of -5 = NaN"
Error in log(input) : Non-numeric argument to mathematical function
In addition: Warning message:
In log(input) : NaNs produced

The loop handled the negative arguments more or less gracefully (depending on how you feel about NaN), but crashed on the non-numeric argument, and didn’t finish the list of inputs.

How are we going to handle this?

Continue reading Error Handling in R

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 4 Comments on Level fit summaries can be tricky in R

Level fit summaries can be tricky in R

Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them. Continue reading Level fit summaries can be tricky in R

Posted on Categories Mathematics, StatisticsTags , , , , 1 Comment on Newton-Raphson can compute an average

Newton-Raphson can compute an average

In our article How robust is logistic regression? we pointed out some basic yet deep limitations of the traditional full-step Newton-Raphson or Iteratively Reweighted Least Squares methods of solving logistic regression problems (such as in R‘s standard glm() implementation). In fact in the comments we exhibit a well posed data fitting problem that can not be fit using the traditional methods starting at the traditional (0,0) start point. And we cited an example where the traditional methods fail to compute the average from a non-zero start. The question remained: can we prove the standard methods always compute the average correctly if started at zero? It turns out they can, and the proof isn’t as messy as I anticipated. Continue reading Newton-Raphson can compute an average

Posted on Categories data science, Expository Writing, Mathematics, Statistics, TutorialsTags , , , , , , 6 Comments on How robust is logistic regression?

How robust is logistic regression?

Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the technical statistical sense of immune to deviations from assumptions or outliers.)

Even a detailed reference such as “Categorical Data Analysis” (Alan Agresti, Wiley, 1990) leaves off with an empirical observation: “the convergence … for the Newton-Raphson method is usually fast” (chapter 4, section 4.7.3, page 117). This is a book that if there is a known proof that the estimation step is a contraction (one very strong guarantee of convergence) you would expect to see the proof reproduced. I always suspected there was some kind of Brouwer fixed-point theorem based folk-theorem proving absolute convergence of the Newton-Raphson method in for the special case of logistic regression. This can not be the case as the Newton-Raphson method can diverge even on trivial full-rank well-posed logistic regression problems. Continue reading How robust is logistic regression?

Posted on Categories data science, Expository Writing, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 1 Comment on What does a generalized linear model do?

What does a generalized linear model do?

What does a generalized linear model do? R supplies a modeling function called glm() that fits generalized linear models (abbreviated as GLMs). A natural question is what does it do and what problem is it solving for you? We work some examples and place generalized linear models in context with other techniques. Continue reading What does a generalized linear model do?

Posted on Categories data science, Expository Writing, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 4 Comments on Modeling Trick: Impact Coding of Categorical Variables with Many Levels

Modeling Trick: Impact Coding of Categorical Variables with Many Levels

One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.

Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.

Continue reading Modeling Trick: Impact Coding of Categorical Variables with Many Levels

Posted on Categories data science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , ,

Modeling Trick: Masked Variables

A primary problem data scientists face again and again is: how to properly adapt or treat variables so they are best possible components of a regression. Some analysts at this point delegate control to a shape choosing system like neural nets. I feel such a choice gives up far too much statistical rigor, transparency and control without real benefit in exchange. There are other, better, ways to solve the reshaping problem. A good rigorous way to treat variables are to try to find stabilizing transforms, introduce splines (parametric or non-parametric) or use generalized additive models. A practical or pragmatic approach we advise to get some of the piecewise reshaping power of splines or generalized additive models is: a modeling trick we call “masked variables.” This article works a quick example using masked variables. Continue reading Modeling Trick: Masked Variables

Posted on Categories Mathematics, Opinion, TutorialsTags , , , 4 Comments on How to outrun a crashing alien spaceship

How to outrun a crashing alien spaceship

Hollywood movies are obsessed with outrunning explosions and outrunning crashing alien spaceships. For explosions the movies give the optimal (but unusable) solution: run straight away. For crashing alien spaceships they give the same advice, but in this case it is wrong. We demonstrate the correct angle to flee.

PrometheusRun

Running from a crashing alien spaceship, Prometheus 2012, copyright 20th Century Fox
Continue reading How to outrun a crashing alien spaceship

Posted on Categories Pragmatic Machine Learning, Rants, Statistics, TutorialsTags , , , , , , 3 Comments on Selection in R

Selection in R

The design of the statistical programming language R sits in a slightly uncomfortable place between the functional programming and object oriented paradigms. The upside is you get a lot of the expressive power of both programming paradigms. A downside of this is: the not always useful variability of the language’s list and object extraction operators.

Towards the end of our write-up Survive R we recommended using explicit environments with new.env(hash=TRUE,parent=emptyenv()), assign() and get() to simulate mutable string-keyed maps for storing results. This advice rose out of frustration with the apparent inconsistency with the user facing R list operators. In this article we bite the bullet and discuss the R list operators a bit more clearly. Continue reading Selection in R