Category Archives: Public Service Article

Excel spreadsheets are hard to get right

Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft Excel spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using Excel-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of gdata‘s underlying Perl script to work around a bug).

But one thing that continues to confound us is how hard it is to read Excel data correctly. When Excel exports into CSV/TSV style formats it uses fairly clever escaping rules about quotes and new-lines. Most CSV/TSV readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is Excel itself often transforms data without any user verification or control. For example: Excel routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable Excel transform: changing the strings “TRUE” and “FALSE” into 1 and 0 inside the actual “.xlsx” file. That is Excel does not faithfully store the strings “TRUE” and “FALSE” even in its native format. Most Excel users do not know about this, so they certainly are in no position to warn you about it.

This would be a mere annoyance, except it turns out Libre Office (or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.

We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug. Continue reading

On Writing Technical Articles for the Nonspecialist

This was originally posted at ninazumel.com. I’m re-blogging it here.


WatchPhoto: John Mount

I came across a post from Emily Willingham the other day: “Is a PhD required for Good Science Writing?”. As a science writer with a science PhD, her answer is: is it not required, and it can often be an impediment. I saw a similar sentiment echoed once by Lee Gutkind, the founder and editor of the journal Creative Nonfiction. I don’t remember exactly what he wrote, but it was something to the effect that scientists are exactly the wrong people to produce literary, accessible writing about matters scientific.

I don’t agree with Gutkind’s point, but I can see where it comes from. Academic writing has a reputation for being deliberately obscure and prolix, jargonistic. Very few people read journal papers for fun (well, except me, but I’m weird). On the other hand, a science writer with a PhD has been trained for critical thinking, and should have a nose for bullpucky, even outside their field of expertise. This can come in handy when writing about medical research or controversial new scientific findings. Any scientist — any person — is going to hype up their work. It’s the writer’s job to see through that hype.

I’m not a science writer in the sense that Dr. Willingham is. I write statistics and data science articles (blog posts) for non-statisticians. Generally, the audience that I write for is professionally interested in the topic, but aren’t necessarily experts at it. And as a writer, many of my concerns are the same as those of a popular science writer.

I want to cut through the bullpucky. I want you, the reader, to come away understanding something you thought you didn’t — or even couldn’t — understand. I want you, the analyst or data science practitioner, to understand your tools well enough to innovate, not just use them blindly. And if I’m writing about one of my innovations, I want you to understand it well enough to possibly use it, not just be awed at my supposed brilliance.

I don’t do these things perfectly; but in the process of trying, and of reading other writers with similar objectives, I’ve figured out a few things.

Continue reading

Minimal Version Control Lesson: Use It

There is no excuse for a digital creative person to not use some sort of version control or source control. In the past disk space was too dear, version control systems were too expensive and software was not powerful enough; this is no longer the case. Unless your work is worthless both back it up and version control it. We will demonstrate a minimal set of version control commands that will one day save your bacon. Continue reading

Public Service Article: Back Up

This is a public service article encouraging all of us to back up our data (which more and more is our lives). I sketch some methods and resources for doing this.

As more of our life becomes digital (work, finances, passwords, pictures, contacts,dairies,videos and email) we must be more diligent in backing up our data. If your hard drive fails at work you might lose some spreadsheets (and you might not lose anything if your IT department is on their toes) if you computer fails at home you lose your wedding album. Your hard disk will fail and try to take all of your data (life) with it- it is a matter of when not a matter of if. You want this to be an inconvenience, not a disaster. Become expert at backing up and take the time to help others.
Continue reading