Fast Arc Cos algorithm?

c++ c algorithm math performance

26,628

Solution 1

A simple cubic approximation, the Lagrange polynomial for x ∈ {-1, -½, 0, ½, 1}, is:

double acos(x) {
   return (-0.69813170079773212 * x * x - 0.87266462599716477) * x + 1.5707963267948966;
}

It has a maximum error of about 0.18 rad.

Solution 2

Got spare memory? A lookup table (with interpolation, if required) is gonna be fastest.

Solution 3

nVidia has some great resources that show how to approximate otherwise very expensive math functions, such as: acos asin atan2 etc etc...

These algorithms produce good results when speed of execution is more important (within reason) than precision. Here's their acos function:

// Absolute error <= 6.7e-5
float acos(float x) {
  float negate = float(x < 0);
  x = abs(x);
  float ret = -0.0187293;
  ret = ret * x;
  ret = ret + 0.0742610;
  ret = ret * x;
  ret = ret - 0.2121144;
  ret = ret * x;
  ret = ret + 1.5707288;
  ret = ret * sqrt(1.0-x);
  ret = ret - 2 * negate * ret;
  return negate * 3.14159265358979 + ret;
}

And here are the results for when calculating acos(0.5):

nVidia:   result: 1.0471513828611643
math.h:   result: 1.0471975511965976

That's pretty close! Depending on your required degree of precision, this might be a good option for you.

Solution 4

I have my own. It's pretty accurate and sort of fast. It works off of a theorem I built around quartic convergence. It's really interesting, and you can see the equation and how fast it can make my natural log approximation converge here: https://www.desmos.com/calculator/yb04qt8jx4

Here's my arccos code:

function acos(x)
    local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
    local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
    local c=0.88-0.77*x c=(c+(2-a)/c)/2
    return (8*(c+(2-a)/c)-(b+(2-2*x)/b))/6
end

A lot of that is just square root approximation. It works really well, too, unless you get too close to taking a square root of 0. It has an average error (excluding x=0.99 to 1) of 0.0003. The problem, though, is that at 0.99 it starts going to shit, and at x=1, the difference in accuracy becomes 0.05. Of course, this could be solved by doing more iterations on the square roots (lol nope) or, just a little thing like, if x>0.99 then use a different set of square root linearizations, but that makes the code all long and ugly.

If you don't care about accuracy so much, you could just do one iteration per square root, which should still keep you somewhere in the range of 0.0162 or something as far as accuracy goes:

function acos(x)
    local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
    local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
    local c=0.88-0.77*x c=(c+(2-a)/c)/2
    return 8/3*c-b/3
end

If you're okay with it, you can use pre-existing square root code. It will get rid of the the equation going a bit crazy at x=1:

function acos(x)
    local a = math.sqrt(2+2*x)
    local b = math.sqrt(2-2*x)
    local c = math.sqrt(2-a)
    return 8/3*d-b/3
end

Frankly, though, if you're really pressed for time, remember that you could linearize arccos into 3.14159-1.57079x and just do:

function acos(x)
    return 1.57079-1.57079*x
end

Anyway, if you want to see a list of my arccos approximation equations, you can go to https://www.desmos.com/calculator/tcaty2sv8l I know that my approximations aren't the best for certain things, but if you're doing something where my approximations would be useful, please use them, but try to give me credit.

Solution 5

You can approximate the inverse cosine with a polynomial as suggested by dan04, but a polynomial is a pretty bad approximation near -1 and 1 where the derivative of the inverse cosine goes to infinity. When you increase the degree of the polynomial you hit diminishing returns quickly, and it is still hard to get a good approximation around the endpoints. A rational function (the quotient of two polynomials) can give a much better approximation in this case.

acos(x) ≈ π/2 + (ax + bx³) / (1 + cx² + dx⁴)

where

a = -0.939115566365855
b =  0.9217841528914573
c = -1.2845906244690837
d =  0.295624144969963174

has a maximum absolute error of 0.017 radians (0.96 degrees) on the interval (-1, 1). Here is a plot (the inverse cosine in black, cubic polynomial approximation in red, the above function in blue) for comparison:

The coefficients above have been chosen to minimise the maximum absolute error over the entire domain. If you are willing to allow a larger error at the endpoints, the error on the interval (-0.98, 0.98) can be made much smaller. A numerator of degree 5 and a denominator of degree 2 is about as fast as the above function, but slightly less accurate. At the expense of performance you can increase accuracy by using higher degree polynomials.

A note about performance: computing the two polynomials is still very cheap, and you can use fused multiply-add instructions. The division is not so bad, because you can use the hardware reciprocal approximation and a multiply. The error in the reciprocal approximation is negligible in comparison with the error in the acos approximation. On a 2.6 GHz Skylake i7, this approximation can do about 8 inverse cosines every 6 cycles using AVX. (That is throughput, the latency is longer than 6 cycles.)

View more solutions

26,628

jmasterx

Updated on July 09, 2022

Comments

jmasterx almost 2 years
I have my own, very fast cos function:
```
float sine(float x)
{
    const float B = 4/pi;
    const float C = -4/(pi*pi);

    float y = B * x + C * x * abs(x);

    //  const float Q = 0.775;
    const float P = 0.225;

    y = P * (y * abs(y) - y) + y;   // Q * y + P * y * abs(y)


    return y;
}

float cosine(float x)
{
    return sine(x + (pi / 2));
}
```
But now when I profile, I see that acos() is killing the processor. I don't need intense precision. What is a fast way to calculate acos(x) Thanks.
- Damon almost 12 years
  
  Your very fast function has a mean error of 16% in [-pi,pi] and is entirely unusable outside that interval. The standard sinf from math.h on my system takes only about 2.5x as much time as your approximation. Considering your function is inlined and the lib call is not, this is really not much difference. My guess is if you added range reduction so it was usuable in the same way as the standard function, you would have exactly the same speed.
- jcwenger over 11 years
  
  No, the maximum error is 0.001 (1/10th %). Did you forget to apply the correction? (y = P * bla...) Look at the original source and discussion here: devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine Second, sin and cos pre-bounded by +-pi is a VERY common case, especially in graphics and simulation, both of which often require a fast approximate sin/cos.
- JosephDoggie almost 5 years
  
  This is a really intriguing problem, thanks for asking!
jmasterx almost 14 years

How could I implement this as a C function?
Ben Voigt almost 14 years

@Jex: bounds-check your argument (it must be between -1 and 1). Then multiply by a nice power of 2, say 64, yielding the range (-64, 64). Add 64 to make it non-negative (0, 128). Use the integer part to index a lookup table, if desired use the fractional part for interpolation between the two closest entries. If you don't want interpolation, try adding 64.5 and take the floor, this is the same as round-to-nearest.
phkahler almost 14 years

Lookup tables require an index, which is going to require a float to int conversion, which will probably kill performance.
Timo Kähkönen about 11 years

Maximum error is 10.31 in degrees. Rather big, but in some solutions may be enough. Suitable where computational speed is more important than precision. May be quartic approximation would produce more precision and still be faster than native acos?
miho about 11 years

Sure there isn't a mistake in this formular? Just tried it with Wolfram Alpha and it doesn't look right: wolframalpha.com/input/?i=y%3D%282%2F9*pixx-5*pi%2F18%29*x%2‌Bpi%2F2
FocusedWolf almost 9 years

Is this what you meant for the last one? 1.57079-1.57079*x.
FocusedWolf almost 9 years

Also for anyone using c# this might be a good firstline: if (x < -1D || x > 1D || Double.IsNaN(x)) return Double.NaN; to be consistent with the .net framework acos function: msdn.microsoft.com/en-us/library/…
Peter Cordes about 8 years

@phkahler: float-to-int conversion is very cheap on x86, almost as cheap as an FP add, as you can see in Agner Fog's latency/throughput/uop tables. Range-checking the index to make sure it doesn't index outside the table is probably about as expensive. int idx = x * 4096.0 will have ~ 9 cycle latency on Intel Haswell. By far the most expensive part of this is the cache miss from a decent-sized table. If there isn't a bunch of parallel computation that doesn't depend on the acos result, a large table will probably be slower (esp. with cache competition).
josch over 7 years

In contrast to what the comment in the "reference implementation" on the nvidia site says, the absolute error is not <= 6.7e-5 but I was able to observe an error of 6.759167e-05. Also, how sure are you that this function is actually faster than just plain acos? On a 4th generation core i5 (Haswell) the nvidia function was consistenly 25% slower than acos from math.h.
ideasman42 almost 7 years

Any reason to use 3.14159265358979 instead of math.pi ?
Fnord almost 7 years

@ideasman42 Python was irrelevant to my answer. My goal was just to point out to nVidia's docs for approximation methods as a resource. So I edited my answer for clarity. But to answer your question: my guess is that these numbers were chosen to work well together. So using math.pi might not seem to make much of a difference in most cases until you hit the edge case where the error threshold would worsen.
ideasman42 almost 7 years

The reason for asking is there is a very small difference leading to the question: When porting this to any other language - should the constant for pi be used? Or is this an intentionally slightly modified pi, tuned to work better with the approximation? (can be tested without too much trouble of course)
mingzhao.pro almost 7 years

what does float(x < 0) mean? How can I turn it to scala?
Fnord almost 7 years

@mingzhao.pro it coverts the true/false result of the x<0 expression into a float. So if x<0 is true it converts it to 1.0, if false: 0.0
Steve over 6 years

This is way off at x = -1, like by 0.5rad. Not usable.
Steve over 6 years

Your first implementation is way off at x = -1, like by 0.5rad.
Steve over 6 years

This approximation is within 10e-5 of the built-in acos() in a desktop opengl implementation I tried. Quite good! But it's also basically the same speed as the built-in acos() on a low end Android device I tested on.
Iris Technologies over 3 years

I doubt that the FPU will be faster than SSE instructions, also, it's unusable for x64 targets since MSVC won't allow inlined asm blocks for such targets
Gokul almost 3 years

Is there a source for these coefficients?
Ruud almost 3 years

@Gokul they are computed by this script: github.com/ruuda/convector/blob/…