Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them. Continue reading Level fit summaries can be tricky in R
When people ask me what it means to be a data scientist, I used to answer, “it means you don’t have to hold my hand.” By which I meant that as a data scientist (a consulting data scientist), I can handle the data collection, the data cleaning and wrangling, the analysis, and the final presentation of results (both technical and for the business audience) with a minimal amount of assistance from my clients or their people. Not no assistance, of course, but little enough that I’m not interfering too much with their day-to-day job.
This used to be a key selling point, because people with all the necessary skills used to be relatively rare. This is less true now; data science is a hot new career track. Training courses and academic tracks are popping up all over the place. So there is the question: what should such courses teach? Or more to the heart of the question — what does a data scientist do, and what do they need to know?
This was originally posted at ninazumel.com. I’m re-blogging it here.
I came across a post from Emily Willingham the other day: “Is a PhD required for Good Science Writing?”. As a science writer with a science PhD, her answer is: is it not required, and it can often be an impediment. I saw a similar sentiment echoed once by Lee Gutkind, the founder and editor of the journal Creative Nonfiction. I don’t remember exactly what he wrote, but it was something to the effect that scientists are exactly the wrong people to produce literary, accessible writing about matters scientific.
I don’t agree with Gutkind’s point, but I can see where it comes from. Academic writing has a reputation for being deliberately obscure and prolix, jargonistic. Very few people read journal papers for fun (well, except me, but I’m weird). On the other hand, a science writer with a PhD has been trained for critical thinking, and should have a nose for bullpucky, even outside their field of expertise. This can come in handy when writing about medical research or controversial new scientific findings. Any scientist — any person — is going to hype up their work. It’s the writer’s job to see through that hype.
I’m not a science writer in the sense that Dr. Willingham is. I write statistics and data science articles (blog posts) for non-statisticians. Generally, the audience that I write for is professionally interested in the topic, but aren’t necessarily experts at it. And as a writer, many of my concerns are the same as those of a popular science writer.
I want to cut through the bullpucky. I want you, the reader, to come away understanding something you thought you didn’t — or even couldn’t — understand. I want you, the analyst or data science practitioner, to understand your tools well enough to innovate, not just use them blindly. And if I’m writing about one of my innovations, I want you to understand it well enough to possibly use it, not just be awed at my supposed brilliance.
I don’t do these things perfectly; but in the process of trying, and of reading other writers with similar objectives, I’ve figured out a few things.
In our article How robust is logistic regression? we pointed out some basic yet deep limitations of the traditional full-step Newton-Raphson or Iteratively Reweighted Least Squares methods of solving logistic regression problems (such as in R‘s standard glm() implementation). In fact in the comments we exhibit a well posed data fitting problem that can not be fit using the traditional methods starting at the traditional (0,0) start point. And we cited an example where the traditional methods fail to compute the average from a non-zero start. The question remained: can we prove the standard methods always compute the average correctly if started at zero? It turns out they can, and the proof isn’t as messy as I anticipated. Continue reading Newton-Raphson can compute an average
Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the technical statistical sense of immune to deviations from assumptions or outliers.)
Even a detailed reference such as “Categorical Data Analysis” (Alan Agresti, Wiley, 1990) leaves off with an empirical observation: “the convergence … for the Newton-Raphson method is usually fast” (chapter 4, section 4.7.3, page 117). This is a book that if there is a known proof that the estimation step is a contraction (one very strong guarantee of convergence) you would expect to see the proof reproduced. I always suspected there was some kind of Brouwer fixed-point theorem based folk-theorem proving absolute convergence of the Newton-Raphson method in for the special case of logistic regression. This can not be the case as the Newton-Raphson method can diverge even on trivial full-rank well-posed logistic regression problems. Continue reading How robust is logistic regression?
What does a generalized linear model do? R supplies a modeling function called
glm() that fits generalized linear models (abbreviated as GLMs). A natural question is what does it do and what problem is it solving for you? We work some examples and place generalized linear models in context with other techniques. Continue reading What does a generalized linear model do?
Dr. Nina Zumel recently published an excellent tutorial on a modeling technique she called impact coding. It is a pragmatic machine learning technique that has helped with more than one client project. Impact coding is a bridge from Naive Bayes (where each variable’s impact is added without regard to the known effects of any other variable) to Logistic Regression (where dependencies between variables and levels is completely accounted). A natural question is can pick up more of the positive features of each model? Continue reading A bit more on impact coding
One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.
Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.
A primary problem data scientists face again and again is: how to properly adapt or treat variables so they are best possible components of a regression. Some analysts at this point delegate control to a shape choosing system like neural nets. I feel such a choice gives up far too much statistical rigor, transparency and control without real benefit in exchange. There are other, better, ways to solve the reshaping problem. A good rigorous way to treat variables are to try to find stabilizing transforms, introduce splines (parametric or non-parametric) or use generalized additive models. A practical or pragmatic approach we advise to get some of the piecewise reshaping power of splines or generalized additive models is: a modeling trick we call “masked variables.” This article works a quick example using masked variables. Continue reading Modeling Trick: Masked Variables
The design of the statistical programming language R sits in a slightly uncomfortable place between the functional programming and object oriented paradigms. The upside is you get a lot of the expressive power of both programming paradigms. A downside of this is: the not always useful variability of the language’s list and object extraction operators.
Towards the end of our write-up Survive R we recommended using explicit environments with
get() to simulate mutable string-keyed maps for storing results. This advice rose out of frustration with the apparent inconsistency with the user facing R list operators. In this article we bite the bullet and discuss the R list operators a bit more clearly. Continue reading Selection in R