modular arithmetic on the gpu

cuda gpgpu

11,241

Solution 1

A high-end Fermi GPU (e.g. a GTX 580) will likely give you the best performance among shipping cards for this. You would want all 32-bit operands to be of type "unsigned int" for best performance, as there is some additional overhead for the handling of signed divisions and modulos.

The compiler generates very efficient code for division and modulo with fixed divisor As I recall it is usually around three to five machine instructions instructions on Fermi and Kepler. You can check the generated SASS (machine code) with cuobjdump --dump-sass. You might be able to use templated functions with constant divisors if you only use a few different divisors.

You should see on the order of sixteen inlined SASS instructions being generated for the unsigned 32-bit operations with variable divisor, across Fermi and Kepler. The code is limited by the throughput of integer multiplies and for Fermi-class GPUs is competitive with hardware solutions. Somewhat reduced performance is seen on currently shipping Kepler-class GPUs due to their reduced integer multiply throughput.

[Added later, after clarification of the question:]

Unsigned 64-bit division and modulo with variable divisor on the other hand are called subroutines of about 65 instructions on Fermi and Kepler. They look close to optimal. On Fermi, this is still reasonably competitive with hardware implementations (note that 64-bit integer divisions are not exactly super fast on CPUs that provide this as a built-in instruction). Below is some code that I posted to the NVIDIA forums some time back for the kind of task described in the clarification. It avoids the expensive division, but does assume that fairly large batches of operands are sharing the same divisior. It uses double-precision arithmetic, which is especially fast on Tesla-class GPUs (as opposed to consumer cards). I only did a cursory test of the code, you might want to test this more carefully before deploying it.

// Let b, p, and A[i] be integers < 2^51
// Let N be a integer on the order of 10000
// for i from 1 to N
// A[i] <-- A[i] * b mod p

/*---- kernel arguments ----*/
unsigned long long *A;
double b, p; /* convert from unsigned long long to double before passing to kernel */
double oop;  /* pass precomputed 1.0/p to kernel */

/*---- code inside kernel -----*/
double a, q, h, l, rem;
const double int_cvt_magic = 6755399441055744.0; /* 2^52+2^51 */

a = (double)A[i];

/* approximate quotient and round it to the nearest integer */
q = __fma_rn (a * b, oop, int_cvt_magic);
q = q - int_cvt_magic;

/* back-multiply, representing p*q as a double-double h:l exactly */
h = p * q;
l = __fma_rn (p, q, -h);

/* remainder is double-width product a*b minus double-double h:l */
rem = __fma_rn (a, b, -h);
rem = rem - l;

/* remainder may be negative as quotient rounded; fix if necessary */
if (rem < 0.0) rem += p;

A[i] = (unsigned long long)rem;

Solution 2

There are tricks to efficiently perform mod operations but if only m is radix 2.

For instance, x mod y == x & (y-1), where y is 2^n. Performing bitwise operation is the fastest.

Otherwise, probably a look-up table? Below is a link on discussion of efficient modulo implementation. You might need to implement it yourself to get the most out of it.

Efficient computation of mod

11,241

athspk

/+:o. .-././-:/-/./-hhom:/-::`. oh+N+dshmomomsddsNomohh:y` /-hmsmsmhdmymsmyddymsmyddoh- y+hmyNymhdmyNyNhdmyNyNymd` shomsmhhdymsmyddsm+d+o+ ``+modyhdsmodsyh`. .` /d+dsyhododsyy` ./-y+yhod+doyy` `+:yhod+dsyh.-`.` `syohCORFUhhhy:`` /++hohsyhshohoss` .y+yhshohoso ``ssoh+h+// +++y+y+:: ::/y/y+// --/y/y+o+` .-/s/y+oo` /s/s+so`` -o:soss+s:s:/: `+. `/.oo+s/o/o+/+:o:/- ```..o:o+/o/o//: `-.///o:o/+/ -.-:-+:+/.. `:-/:-:` Learning MapReduce in Hadoop :-)

Updated on September 20, 2022

Comments

athspk over 1 year
I am working on the GPU algorithm which is supposed to do a lot of modular computations. Particularly, various operations on matrices in a finite field which in the long run reduce to primitive operations like: (a*b - c*d) mod m or (a*b + c) mod m where a,b,c and d are residues modulo m and m is a 32-bit prime.

Through experimentation I learned that the performance of the algorithm is mostly limited by slow modular arithmetic because integer modulo (%) and division operations are not supported on the GPU in hardware.

I appreciate if somebody can give me an idea how to realize efficient modular computations with CUDA ?

To see how this is implemented on CUDA, I use the following code snippet:
```
__global__ void mod_kernel(unsigned *gout, const unsigned *gin) {

unsigned tid = threadIdx.x;
unsigned a = gin[tid], b = gin[tid * 2], m = gin[tid * 3];

typedef unsigned long long u64;

__syncthreads();
unsigned r = (unsigned)(((u64)a * (u64)b) % m);
__syncthreads();
gout[tid] = r;
}
```
This code is not supposed to work, I just wanted to see how modular reduction is implemented on CUDA.

When I disassemble this with cuobjdump --dump-sass (thanks njuffa for advice!), I see the following:
```
/*0098*/     /*0xffffdc0450ee0000*/     BAR.RED.POPC RZ, RZ;
/*00a0*/     /*0x1c315c4350000000*/     IMUL.U32.U32.HI R5, R3, R7;
/*00a8*/     /*0x1c311c0350000000*/     IMUL.U32.U32 R4, R3, R7;
/*00b0*/     /*0xfc01dde428000000*/     MOV R7, RZ;
/*00b8*/     /*0xe001000750000000*/     CAL 0xf8;
/*00c0*/     /*0x00000007d0000000*/     BPT.DRAIN 0x0;
/*00c8*/     /*0xffffdc0450ee0000*/     BAR.RED.POPC RZ, RZ;
```
Note that between the two calls to bar.red.popc there is a call to 0xf8 procedure which implements some sophisticated algorithm (about 50 instructions or even more). Not surpising that mod (%) operation is slow
- njuffa over 11 years
  
  What platform are you on? A high-end Fermi GPU will likely give you the best performance among shipping cards. Are all operands 32-bit integers? You would want the operands to be of type "unsigned int" for best performance. The compiler generates very efficient code for division and modulo with fixed divisor, you can check the SASS (machine code) with cuobjdump --dump-sass. You may want to use templated functions with constant divisors if you only use a few modulos. For Fermi-based GPUs, division and modulo with variable divisors are performance-competitive with hardware-based solutions.
- ArchaeaSoftware over 11 years
  
  Hey Norbert, copy-and-paste this into the answers section where it belongs :-)
- njuffa over 11 years
  
  Unsigned 32-bit division and modulo with a variable divisor compile to about 16 inlined machine instructions on Fermi and Kepler GPUs (if memory serves). The code is limited by the throughput of integer multiplies so you should see very good performance on Fermi-class GPUs, and a very decent speedup compared to CPUs. What performance are you observing?
Admin over 11 years

thanks for suggestions: I found cuobjdump disassembler really handy tool. Unfortunately in my case I need to perform modular reduction of the product of two 32-bit integers which finally boils down to computing 64-bit product modulo m. In disassembler it really looks awful, that is, there is a call to procedure with more thatn 50 instructions. I edited my question to add some small example with my investigations
Admin over 11 years

yes this is clear that we can use bitwise arithmetic for power-of-two moduli but in my case I have arbitrary 32-bit primes
njuffa over 11 years

Thanks for the clarification. Unsigned 64-bit division and modulo are called subroutines with about 65 instructions altogether, which is near optimal. On Fermi, that is almost competitive with hardware implementations (64-bit divisions aren't particularly fast on CPUs either that have this as a built-in instruction).
Admin over 11 years

this is a good solution, thanks. Now I have an idea how this works. Though in my case I use 32-bit primes because 51-bit primes will occupy too much register space.