Check out: I Write, Therefore I Think

## Error Handling in R

It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a model, perhaps. I set the script running and turn to another task, only to come back later and find the loop has crashed partway through, on an unanticipated error. Here’s a toy example:

```
```> inputs = list(1, 2, 4, -5, 'oops', 0, 10)
> for(input in inputs) {
+ print(paste("log of", input, "=", log(input)))
+ }
[1] "log of 1 = 0"
[1] "log of 2 = 0.693147180559945"
[1] "log of 4 = 1.38629436111989"
[1] "log of -5 = NaN"
Error in log(input) : Non-numeric argument to mathematical function
In addition: Warning message:
In log(input) : NaNs produced

The loop handled the negative arguments more or less gracefully (depending on how you feel about NaN), but crashed on the non-numeric argument, and didn’t finish the list of inputs.

How are we going to handle this?

## Rudie can’t fail (if majorized)

We have been writing for a while about the convergence of Newton steps applied to a logistic regression (See: What does a generalized linear model do?, How robust is logistic regression? and Newton-Raphson can compute an average). This is all based on our principle of working examples for understanding. This eventually progressed to some writing on the nature of problem solving (a nice complement to our earlier writing on calculation). In the course of research we were directed to a very powerful technique called the MM algorithm (see: “The MM Algorithm” Kenneth Lang, 2007; “A Tutorial on MM Algorithms”, David R. Hunter, Kenneth Lange, Amer. Statistician 58:30–37, 2004; and “Monotonicity of Quadratic-Approximation Algorithms”, Dankmar Bohning, Bruce G. Lindsay, Ann. Inst. Statist. Math, Vol. 40, No. 4, pp 641-664, 1988). The MM algorithm introduces an essential idea: majorized functions (not to be confused with the majorized order on R^d). Majorization it is an interesting way to modify Newton methods to be reliable contractions (and therefore converge in a manner similar to EM algorithms).

Here we will work an example of the MM method. We will not work it in its most general form, but in a form that quickly reveals much of the beauty of the method. We also introduce a “collared Newton step” which guarantees convergence without resorting to line-search (essentially resolving the issues in solving a logistic regression by Newton style methods). Continue reading Rudie can’t fail (if majorized)

## Level fit summaries can be tricky in R

Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them. Continue reading Level fit summaries can be tricky in R

## I am done with 32 bit machines

I am going to come-out and say it: I am emotionally done with 32 bit machines and operating systems. My sympathy for them is at an end.

I know that ARM is still 32 bit, but in that case you get something big back in exchange: the ability to deploy on smartphones and tablets. For PCs and servers 32 bit addressing’s time is long past, yet we still have to code for and regularly run into these machines and operating systems. The time/space savings of 32 bit representations is nothing compared to the loss of capability in sticking with that architecture and the wasted effort in coding around it. My work is largely data analysis in a server environment, and it is just getting ridiculous to not be able to always assume at least a 64 bit machine. Continue reading I am done with 32 bit machines

## On Being a Data Scientist

When people ask me what it means to be a data scientist, I used to answer, “it means you don’t have to hold my hand.” By which I meant that as a data scientist (a consulting data scientist), I can handle the data collection, the data cleaning and wrangling, the analysis, and the final presentation of results (both technical and for the business audience) with a minimal amount of assistance from my clients or their people. Not *no* assistance, of course, but little enough that I’m not interfering too much with their day-to-day job.

This used to be a key selling point, because people with all the necessary skills used to be relatively rare. This is less true now; data science is a hot new career track. Training courses and academic tracks are popping up all over the place. So there is the question: what should such courses teach? Or more to the heart of the question — what does a data scientist do, and what do they need to know?

## On Writing Technical Articles for the Nonspecialist

*This was originally posted at ninazumel.com. I’m re-blogging it here.*

I came across a post from Emily Willingham the other day: “Is a PhD required for Good Science Writing?”. As a science writer with a science PhD, her answer is: is it not required, and it can often be an impediment. I saw a similar sentiment echoed once by Lee Gutkind, the founder and editor of the journal *Creative Nonfiction*. I don’t remember exactly what he wrote, but it was something to the effect that scientists are exactly the wrong people to produce literary, accessible writing about matters scientific.

I don’t agree with Gutkind’s point, but I can see where it comes from. Academic writing has a reputation for being deliberately obscure and prolix, jargonistic. Very few people read journal papers for fun (well, except me, but I’m weird). On the other hand, a science writer with a PhD has been trained for critical thinking, and should have a nose for bullpucky, even outside their field of expertise. This can come in handy when writing about medical research or controversial new scientific findings. Any scientist — any person — is going to hype up their work. It’s the writer’s job to see through that hype.

I’m not a science writer in the sense that Dr. Willingham is. I write statistics and data science articles (blog posts) for non-statisticians. Generally, the audience that I write for is professionally interested in the topic, but aren’t necessarily experts at it. And as a writer, many of my concerns are the same as those of a popular science writer.

I want to cut through the bullpucky. I want you, the reader, to come away understanding something you thought you didn’t — or even couldn’t — understand. I want you, the analyst or data science practitioner, to understand your tools well enough to innovate, not just use them blindly. And if I’m writing about one of my innovations, I want you to understand it well enough to possibly use it, not just be awed at my supposed brilliance.

I don’t do these things perfectly; but in the process of trying, and of reading other writers with similar objectives, I’ve figured out a few things.

Continue reading On Writing Technical Articles for the Nonspecialist

## The Mathematician’s Dilemma

A recent run of too many articles on the same topic (exhibits: A, B and C) puts me in a position where I feel the need to explain my motivation. Which itself becomes yet another article related to the original topic. The explanation I offer is: this is the way mathematicians think. To us mathematicians the tension is that there are far too many observable patterns in the world to be attributed to mere chance. So our dilemma is: for which patterns/regularities should we derive some underlying law and which ones are not worth worrying about. Or which conjectures should try to work all the way to proof or counter-example? Continue reading The Mathematician’s Dilemma

## Newton-Raphson can compute an average

In our article How robust is logistic regression? we pointed out some basic yet deep limitations of the traditional full-step Newton-Raphson or Iteratively Reweighted Least Squares methods of solving logistic regression problems (such as in R‘s standard glm() implementation). In fact in the comments we exhibit a well posed data fitting problem that can not be fit using the traditional methods starting at the traditional (0,0) start point. And we cited an example where the traditional methods fail to compute the average from a non-zero start. The question remained: can we prove the standard methods always compute the average correctly if started at zero? It turns out they can, and the proof isn’t as messy as I anticipated. Continue reading Newton-Raphson can compute an average

## How robust is logistic regression?

Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the technical statistical sense of immune to deviations from assumptions or outliers.)

Even a detailed reference such as “Categorical Data Analysis” (Alan Agresti, Wiley, 1990) leaves off with an empirical observation: “the convergence … for the Newton-Raphson method is usually fast” (chapter 4, section 4.7.3, page 117). This is a book that if there is a known proof that the estimation step is a contraction (one very strong guarantee of convergence) you would expect to see the proof reproduced. I always suspected there was some kind of Brouwer fixed-point theorem based folk-theorem proving absolute convergence of the Newton-Raphson method in for the special case of logistic regression. This can not be the case as the Newton-Raphson method can diverge even on trivial full-rank well-posed logistic regression problems. Continue reading How robust is logistic regression?