What is integer overflow in R and how can it happen?

54,691

Solution 1

You can answer many of your questions by reading the help page ?integer. It says:

R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9.

Expanding to larger integers is under consideration by R Core but it's not going to happen in the near future.

If you want a "bignum" capacity then install Martin Maechler's Rmpfr package [PDF]. I recommend the 'Rmpfr' package because of its author's reputation. Martin Maechler is also heavily involved with the Matrix package development, and in R Core as well. There are alternatives, including arithmetic packages such as 'gmp', 'Brobdingnag' and 'Ryacas' package (the latter also offers a symbolic math interface).

Next, to respond to the critical comments in the answer you linked to, and how to assess the relevance to your work, consider this: If there were the same statistical functionality available in one of those "modern" languages as there is in R, you would probably see a user migration in that direction. But I would say that migration, and certainly growth, is in the R direction at the moment. R was built by statisticians for statistics.

There was at one time a Lisp variant with a statistics package, Xlisp-Stat, but its main developer and proponent is now a member of R-Core. On the other hand one of the earliest R developers, Ross Ihaka, suggests working toward development in a Lisp-like language [PDF]. There is a compiled language called Clojure (pronounced as English speakers would say "closure") with an experimental interface, Rincanter.

Update:

The new versions of R (3.0.+) has 53 bit integers of a sort (using the numeric mantissa). When an "integer" vector element is assigned a value in excess of '.Machine$integer.max', the entire vector is coerced to "numeric", a.k.a. "double". Maximum value for integers remains as it was, however, there may be coercion of integer vectors to doubles to preserve accuracy in cases that would formerly generate overflow. Unfortunately, the length of lists, matrix and array dimensions, and vectors is still set at integer.max.

When reading in large values from files, it is probably safer to use character-class as the target and then manipulate. If there is coercion to NA values, there will be a warning.

Solution 2

In short, integer is an exact type with limited range, and numeric is a floating-point type that can represent a much wider range of value but is inexact. See the help pages (?integer and ?numeric) for further details.

As to the overflow, here is an explanation by Brian D. Ripley:

It means that you are taking the mean [in your case, the sum -- @aix] of some very large integers, and the calculation is overflowing. It is just a warning.

This will not happen in the next release of R.

You can specify that a number is an integer by giving it the suffix L, for example, 1L is the integer one, as opposed to 1 which is a floating point one, with class "numeric".

The largest integer that you can create on your machine is given by .Machine$integer.max.

> .Machine$integer.max
[1] 2147483647
> class(.Machine$integer.max)
[1] "integer"

Adding a positive integer to this causes an overflow, returning NA.

> .Machine$integer.max + 1L
[1] NA
Warning message:
In .Machine$integer.max + 1L : NAs produced by integer overflow
> class(.Machine$integer.max + 1L)
[1] "integer"

You can get round this limit by adding floating point values instead.

> .Machine$integer.max + 1
[1] 2147483648
> class(.Machine$integer.max + 1)
[1] "numeric"

Since in your case the warning is issued by sum, this indicates that the overflow happens when the numbers are added together. The suggested workaround sum(as.numeric(.)) should do the trick.

Solution 3

What's the max length of integer or numeric?

Vectors are currently indexed with an integer, so the max length is given by .Machine$integer.max. As DWin noted, all versions of R currently use 32-bit integers, so this will be 2^31 - 1, or a little over 2 billion.

Unless you are packing some serious hardware (or you are reading this in the future; hello from 2012) you won't have enough memory to allocate vectors that long.

I remember a discussion where R-core (Brian Ripley, I think) suggested that the next step could be to index vectors with the mantissa of doubles, or something clever like that, effectively giving 48-bits of index. Sadly, I can't find that discussion.


In addition to the Rmpfr package, if you are suffering integer overflow, you might want to try the int64 package.

Share:
54,691
Matt Bannert
Author by

Matt Bannert

Data Science and Analytics Engineer Engineer. Global coordinator for @_useRconf. Creator of Hacking for Social Sciences. Talks stats, hoops and trash.

Updated on November 15, 2020

Comments

  • Matt Bannert
    Matt Bannert over 3 years

    I have some calculation going on and get the following warning (i.e. not an error):

    Warning messages:
    1: In sum(myvar, na.rm = T) :
    Integer overflow - use sum(as.numeric(.))
    

    In this thread people state that integer overflows simply don't happen. Either R isn't overly modern or they are not right. However, what am I supposed to do here? If I use as.numeric as the warning suggests I might not account for the fact that information is lost way before. myvar is read form a .csv file, so shouldn't R figure out that some bigger field is needed? Does it already cut off something?

    What's the max length of integer or numeric? Would you suggest any other field type / mode?

    EDIT: I run:

    R version 2.13.2 (2011-09-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) within R Studio

  • Matt Bannert
    Matt Bannert over 12 years
    ok, what if I want to have an exact calculation and have big numbers? Exactly, overflows are created when numbers are added. Can I have an exact result anyway?
  • James
    James over 12 years
    The gmp package may also be of interest
  • Richie Cotton
    Richie Cotton over 12 years
    I've fixed the description of what happens when you add numbers to the largest integer.
  • Dason
    Dason over 12 years
    ... but try this: class(sum(c(.Machine$integer.max, as.integer(1)))) for me I get an integer overflow (using 2.14).
  • Richie Cotton
    Richie Cotton over 12 years
    @Dason: Yup, as.integer(1) is the same as 1L so you don't get conversion to floating point.
  • skan
    skan over 7 years
    I'm doing a DT[,sapply(.SD,sum,na.rm=T)] with a data.table filled with 0,1 and NA, with 2 million rows. And I get the overflow message, but the maximum number generated should be less than 2 million. What could happen?
  • IRTFM
    IRTFM over 7 years
    I think you should post more information. Offhand I would guess that creating a matrix (as sapply would attempt to do when the default for 'simplify' is unchanged) would require multiplying the number of rows by the number of columns to get the length of the argument supplied to sum, That might be more than you expected.