Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft
Excel spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using
Excel-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of
Perl script to work around a bug).
But one thing that continues to confound us is how hard it is to read
Excel data correctly. When
Excel exports into
CSV/TSV style formats it uses fairly clever escaping rules about quotes and new-lines. Most
CSV/TSV readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is
Excel itself often transforms data without any user verification or control. For example:
Excel routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable
Excel transform: changing the strings “
TRUE” and “
FALSE” into 1 and 0 inside the actual “
.xlsx” file. That is
Excel does not faithfully store the strings “
TRUE” and “
FALSE” even in its native format. Most
Excel users do not know about this, so they certainly are in no position to warn you about it.
This would be a mere annoyance, except it turns out
Libre Office (or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.
We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug.
Excel spreadsheet was produced using Microsoft
Excel 2011 for OSX. We started a new sheet and typed in a few cells by hand. We formatted the header and the numeric column, but did not move off default settings for any of the
TRUE/FALSE cells. The spreadsheet looks like the following:
Excelspreadsheet (TRUE/FALSE typed in as text, no formatting commands on those cells).
You can also download the spreadsheet here.
Numbers can read the sheet correctly. We demonstrate this below.
Sheet looks okay in Apple Numbers.
Libre Office doesn’t reverse the encoding (as it may not know some details of
Excel‘s encoding practices) and also shows corrupted data as we see below.
Libre Office, and third row damaged.
In practice we have seen the data damage is pervasive and not limited to columns who’s original value was
FALSE. It may be a presentation problem as examining individual cells shows “
=TRUE()” and “
=FALSE()” as the contents of the affected cells (and apparently in the correct positions independent of what is being displayed).
Quick Look both also fail to understand the
Excel data encoding, as we show below.
Sheet damaged in Apple Preview (same for Apple Quick Look).
Our favorite analysis hammer (R) appears to read the data correctly (with only the undesired translation of
R appears to load what was stored correctly.
But what is going on? It turns out
.xlsx files are actually
zip archives storing a directory tree of
xml artificts. By changing the file extension from
.zip we can treat the spreadsheet as a
zip archive and inflate it to see the underlying files. The inflated file tree is shown below.
The file tree representing the
Excelworkbook on disk.
Of particular interest are the files
sheet1.xml contains the worksheet data and
sharedStrings.xml is a shared string table containing all strings used in the worksheet (the worksheet stores no user supplied strings, only indexes into the shared string table). Let’s look into
The XML representing the sheet data.
The sheet data is arranged into rows that contain columns. It is easy to match these rows and cells to our original spreadsheet. For cells containing uninterpreted strings the
<c> tag has has an attributed set to
t="s" (probably denoting type is “string” and to use the
<v> value as a string index). Notice floating point numbers are not treated as shared strings, but stored directly in the
<v> tag. Further notice that the last three columns are stored as
0/1 and have the attribute
t="b" set. My guess is this is declaring the type is “boolean” which then must have the convention that
This doesn’t seem that complicated, but clearly of all the “
Excel compatible” tools we tried only Apple
Numbers knew all of the details of this encoding (and was able to reverse it). Other than
gdata package was able to extract usable data (and even it only recovered the encoded version of the field, not the original user value).
And these are our issue with working with data that has passed through
Excelhas a lot of non-controllable data transforms including booleans, and dates (in fact mangling string fragments
Exceleven suspects could be made into dates). Some of these transforms are non-faithful or not reversible.
- Very few tools that claim to interoperate with
Excelactually get the corner cases right. Even for simple well-documented data types like
CSVexport. And definitely not for the native
These transforms and conventions make exporting data harder (and riskier) than it has to be. To add insult to injury you often run into projects that are sharing
.xlsx spreadsheets where neither the reader nor the writer is
Excel, so neither end is even good at working with the format. Because working with data that has passed through
Excel is hard to get right, data that has passed through
Excel is often wrong.
(Note: I definitely feel we do need to be thankful to open source and free software developers. These teams in addition to generously supplying software without charge are also working to preserve user freedoms and often the only way to read older data. However, when we are using software for work we do need it to work correctly and be faithful to data. This problem is small when you detect it, but large if hidden in a larger project.)