Posted on Categories Opinion, ProgrammingTags , ,

R in a 64 bit world

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?

IMG 1691

We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is” that we are publishing with cooperation from Revolution Analytics.

Currently R’s only integer data type is a 32 bit signed integer. Such integers can only count up to about 2 billion. This range is in fact ridiculously small and unworkable. Some examples:

What can we do about this in R?

Note we are not talking about big-data issues (which are addressed in R with tools like ff, data.table, dplyr, databases, streaming algorithms, Hadoop, see Revolutions big data articles for some tools and ideas). Fortunately, when working with data you really want to do things like aggregate, select, partition, and compute derived columns. With the right primitives you can (and should) perform all of these efficiently without ever referring to explicit row indices.

What we are talking about is the possible embarrassment of:

  • Not being to represent 64 bit IDs as something other than strings.
  • Not being able to represent a 65536 by 65536 matrix in R (because R matrices are actually views over single vectors).
  • Not being able to index into 3 billion doubles (or about $300 worth of memory).

Now R actually dodged these bullets, but without introducing a proper 64 bit integer type. Let’s discuss how.

First a lot of people think R has a 64 bit integer type. This is because R’s notation for integer constants “L” looks like Java’s notation for longs (64 bit), but probably derives from C’s notation for long (“at least 32 bits”). But feast your eyes on the following R code:

 

3000000000L
## [1] 3e+09
Warning message:
non-integer value 3000000000L qualified with L; using numeric value 

 

Yes, “L” only means integer in R.

What the R designers did to delay paying the price of having only 32 bit integers was to allow doubles to be used as array indices (and as the return value for length())! Take a look at the following R code:

 

c(1,2,3)[1.7]
## [1] 1

 

It looks like this was one of the big changes in moving from R2.15.3 to R3.0.1 in 2013 (see here). However, it feels dirty. In a more perfect world the above code would throw an error. This puts R in league with languages that force everything to be represented in way too few base-types ( Javascript, TCL, and Perl). IEEE 754 doubles define a 53 bit mantissa (separate from the sign and exponent), so with a proper floating point implementation we expect a double can faithfully represent an integer range of -2^53 through 2^53. But only as long as you don’t accidentally convert to or round-trip through a string/character type.

One of the issues is that underlying C and Fortran code (often used to implement R packages/libraries) are not going to be able to easily use longs as indices. However, I still would much prefer the introduction of a proper 64 bit integer type.

Of course Java is in a much worse position going forward than R. Because of Java’s static type signatures any class that implements the Collection interface is stuck with “int size()” pretty much forever (this includes Array, Vector, List, Set, and many more). In much better shape is Python which has been working on unifying ints and longs since 2001 ( PEP237 ) and uses only 64 bit integers in Python 3 (just a matter of moving people from Python 2 to Python 3).

Enough about sizes and indexing- let’s talk a bit about representation. What should we do if we try to import data and we know one of the columns is 64 bit integers (assuming we are lucky enough to detect this and the column doesn’t get converted in a non-reversible way to doubles)?

R has always been a bit “loosey-goosey” with ints. That is why you see weird stuff in summary:

 

summary(55555L)
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  55560   55560   55560   55560   55560   55560 

 

Where we are told that the integer 55555 is in the range 55560 to 55560 (in hindsight we can see R is printing as if the data were floating point and then, adding insult to injury, it is not signaling its use of four significant figures by having the decency to switch into scientific notation). This is also why I don’t really trust R numerics to reliably represent integers like 2^50, some function accidentally round trips your value into a string representation and back (such as reading/writing data to a CSV/TSV table) and you may not get back the value you started with (for the worst possible reason: you never wrote the correct value out).

In fact it becomes a bit of a bother to even check if a floating point number represents an integer. Some standard advice is “check if floor(a)==ceiling(a).” Which works until it fails:

 

a <- 2^50
b <- 2^50 + 0.5
c <- 5^25+1.5
for(v in c(a,b,c)) {
  print(floor(v)==ceiling(v))
}
## [1] TRUE
## [1] FALSE
## [1] TRUE

 

What went wrong for “c” is “c” is an integer, it just isn’t the number we typed in (due to the use of floating point). The true value is seen as follows:

 

# Let's try the "doubles are good enough" path
# 5^25 < 2^60, so would fit in a 64 bit integer
format(5L^25L,digits=20)
## [1] "298023223876953152"
# doesn't even end in a 5 as powers of 5 should
format(5^25+1.5,digits=20)
## [1] "298023223876953152"
# and can't see our attempt at addition

# let's get the right answer
library('Rmpfr')
mpfr(5,precBits=200)^25
## [1] 298023223876953125
# ends in 5 as we expect
mpfr(5,precBits=200)^25 + 1.5
# obviously not an integer!
## [1] 298023223876953126.5

 

Something like the following is probably pretty close to the correct test function:

 

is.int <- function(v) {
  is.numeric(v) & 
    v>-2^53 & v<2^53 & 
    (floor(v)==ceiling(v))
}

 

But, that (or whatever IEEE math library function actually does this) is hard to feel very good about. The point is we should not have to study What Every Computer Scientist Should Know About Floating-Point Arithmetic when merely trying to index into arrays. However every data scientist should read this paper to understand some of the issues of general numeric representation and manipulation!

What are our faithful representation options in R?

  • Force to strings (and pray they don’t try to convert to factors).
  • Try to use doubles (this is what happens if you don’t know about the column, and will irreversibly mangle IDs).
  • Try a package like Google’s int64 package (kicked off cran in 2012 for lack of maintenance).
  • Try a bigint package such as gmp special math package such as Rmpfr.

Our advice is to first try representing 64 bit integers as strings. For functions like read.table() this means setting as.is to TRUE for the appropriate columns, and not converting a column back to string after it has already been damaged by the reader.

And this is our first taste of “R as it is.”

(Thank you to Joseph Rickert and Nina Zumel for helpful comments on earlier drafts of this article.)

4 thoughts on “R in a 64 bit world”

    1. Yes, but it’s also highly risky. Since it’s built on top of doubles, if at any point the class attribute is lost (or apt he internal C code simply doesn’t care) you will silently get incorrect results.

  1. The need for long integers is not uncommon in working with common datasets. For example, a manufacturer in my field (oceanography) creates instruments that measure time in milliseconds since 1970, storing the value in sqlite3 format. Getting these values into R with the DBI package is not straightforward.

    I’m not arguing that updating R will be an easy task, and am adding this note mainly to add to the list of application topics that would benefit from long ints in R.

    1. Dan, that is a really good example. Reliable succinct time representations tend to use longs as their underlying implementation. R’s POSIXlt is a big struct with a lot of fields, and POSIXct is a double (try typeof(unclass(Sys.time())) to confirm).

Comments are closed.