What's the most efficient way to make bitwise operations in a C array

c performance micro-optimization

13,354

Solution 1

for ( i = 10 ; i-- > 0 ; )
    result_array[i] = byte_array[i] & byte_mask[i];

Going backwards pre-loads processor cache-lines.
Including the decrement in the compare can save some instructions.

This will work for all arrays and processors. However, if you know your arrays are word-aligned, a faster method is to cast to a larger type and do the same calculation.

For example, let's say n=16 instead of n=10. Then this would be much faster:

uint32_t* input32 = (uint32_t*)byte_array;
uint32_t* mask32 = (uint32_t*)byte_mask;
uint32_t* result32 = (uint32_t*)result_array;
for ( i = 4 ; i-- > 0 ; )
    result32[i] = input32[i] & mask32[i];

(Of course you need a proper type for uint32_t, and if n is not a power of 2 you need to clean up the beginning and/or ending so that the 32-bit stuff is aligned.)

Variation: The question specifically calls for the results to be placed in a separate array, however it would almost certainly be faster to modify the input array in-place.

Solution 2

If you want to make it faster, make sure that byte_array has length that is multiple of 4 (8 on 64-bit machines), and then:

char byte_array[12];
char byte_mask[12];
/* Checks for proper alignment */
assert(((unsigned int)(void *)byte_array) & 3 == 0);
assert(((unsigned int)(void *)byte_mask) & 3 == 0);
for (i = 0; i < (10+3)/4; i++) {
  ((unsigned int *)(byte_array))[i] &= ((unsigned int *)(byte_mask))[i];
}

This is much faster than doing it byte per byte.

(Note that this is in-place mutation; if you want to keep the original byte_array also, then you obviously need to store the results in another array instead.)

Solution 3

\#define CHAR_ARRAY_SIZE    (10)
\#define INT_ARRAY_SIZE     ((CHAR_ARRAY_SIZE/ (sizeof (unsigned int)) + 1)

typedef union _arr_tag_ {

    char          byte_array [CHAR_ARRAY_SIZE];
    unsigned int  int_array [INT_ARRAY_SIZE]; 

} arr_tag;

Now int_array for masking. This might work for both 32bit and 64 bit processors.

arr_tag arr_src, arr_result, arr_mask;

for (int i = 0; i < INT_ARRAY_SIZE; i ++) {
    arr_result.int_array [i] = arr_src.int_array[i] & arr_mask.int_array [i];
}

Try this, code might also look clean.

13,354

Author by

alvatar

Hi there.

Updated on June 09, 2022

Comments

alvatar almost 2 years
I have a C array like:
```
char byte_array[10];
```
And another one that acts as a mask:
```
char byte_mask[10];
```
I would like to do get another array that is the result from the first one plus the second one using a bitwise operation, on each byte.

What's the most efficient way to do this?

thanks for your answers.
Crashworks about 15 years

Wait, does the cache prefetcher work better in reverse? I thought it only prefetched going forwards.
Trent about 15 years

Worrying about pre-loading processor cache-lines seems like a severe premature optimization.
Jason Cohen about 15 years

@Trent -- the point of the question is optimization. Also going backwards is no slower, so you might as well. @Crashworks -- remember that cache lines are aligned, typically on massive boundaries, so typically it has to pull in bytes prior to the ones you're asking for.
Trent about 15 years

Any statements regarding cache is going to be processor specific. I don't see where the OP states what HW this code will execute on.
alvatar about 15 years

I appreciate this explanation. I will use this method. I don't fully understand the cache, so I can't really tell what's going on at that level.
Adam Rosenfield about 15 years

Another advantage of going backwards is that it's easier for the CPU to compare the counter to a constant 0 than to compare it with a variable. It avoids a memory access, or frees up a register, depending on if the count is stored in a register.
Ian Kelling about 15 years

Why a power of 2? Wouldn't a multiple of a word size work? You assume 32 bit word here?
sstock about 15 years

Nice answer Jason. I would add one other option for the aligned case: use vector operations if the processor supports them. Such as SSE on x86. GCC and Intel C++ both support intrinsics that make it easy to "vectorize" loops like the one above. Google "gcc sse instrinsics" for some good links.
Jason Cohen about 15 years

@Ian - Yes any multiple of word size works, ALSO ASSUMING that the char arrays in question are themselves word-aligned. Also you are right that I'm assuming 32-bit processor; it must be tuned to the processor in question. Although, assuming 32-bit is still faster than byte-wise on almost any proc.
bk1e about 15 years

10/4 == 2, so this only processes 8 chars. In addition, on some non-x86 architectures this may raise a bus error due to unaligned memory accesses.
Antti Huima about 15 years

bk1e: you are right, i < 10/4 is wrong. The comment about bus error is also correct. I will edit the answer.