What's the most efficient way to make bitwise operations in a C array

13,354

Solution 1

for ( i = 10 ; i-- > 0 ; )
    result_array[i] = byte_array[i] & byte_mask[i];
  • Going backwards pre-loads processor cache-lines.
  • Including the decrement in the compare can save some instructions.

This will work for all arrays and processors. However, if you know your arrays are word-aligned, a faster method is to cast to a larger type and do the same calculation.

For example, let's say n=16 instead of n=10. Then this would be much faster:

uint32_t* input32 = (uint32_t*)byte_array;
uint32_t* mask32 = (uint32_t*)byte_mask;
uint32_t* result32 = (uint32_t*)result_array;
for ( i = 4 ; i-- > 0 ; )
    result32[i] = input32[i] & mask32[i];

(Of course you need a proper type for uint32_t, and if n is not a power of 2 you need to clean up the beginning and/or ending so that the 32-bit stuff is aligned.)

Variation: The question specifically calls for the results to be placed in a separate array, however it would almost certainly be faster to modify the input array in-place.

Solution 2

If you want to make it faster, make sure that byte_array has length that is multiple of 4 (8 on 64-bit machines), and then:

char byte_array[12];
char byte_mask[12];
/* Checks for proper alignment */
assert(((unsigned int)(void *)byte_array) & 3 == 0);
assert(((unsigned int)(void *)byte_mask) & 3 == 0);
for (i = 0; i < (10+3)/4; i++) {
  ((unsigned int *)(byte_array))[i] &= ((unsigned int *)(byte_mask))[i];
}

This is much faster than doing it byte per byte.

(Note that this is in-place mutation; if you want to keep the original byte_array also, then you obviously need to store the results in another array instead.)

Solution 3

\#define CHAR_ARRAY_SIZE    (10)
\#define INT_ARRAY_SIZE     ((CHAR_ARRAY_SIZE/ (sizeof (unsigned int)) + 1)

typedef union _arr_tag_ {

    char          byte_array [CHAR_ARRAY_SIZE];
    unsigned int  int_array [INT_ARRAY_SIZE]; 

} arr_tag;

Now int_array for masking. This might work for both 32bit and 64 bit processors.

arr_tag arr_src, arr_result, arr_mask;

for (int i = 0; i < INT_ARRAY_SIZE; i ++) {
    arr_result.int_array [i] = arr_src.int_array[i] & arr_mask.int_array [i];
}

Try this, code might also look clean.

Share:
13,354
alvatar
Author by

alvatar

Hi there.

Updated on June 09, 2022

Comments

  • alvatar
    alvatar almost 2 years

    I have a C array like:

    char byte_array[10];
    

    And another one that acts as a mask:

    char byte_mask[10];
    

    I would like to do get another array that is the result from the first one plus the second one using a bitwise operation, on each byte.

    What's the most efficient way to do this?

    thanks for your answers.

  • Crashworks
    Crashworks about 15 years
    Wait, does the cache prefetcher work better in reverse? I thought it only prefetched going forwards.
  • Trent
    Trent about 15 years
    Worrying about pre-loading processor cache-lines seems like a severe premature optimization.
  • Jason Cohen
    Jason Cohen about 15 years
    @Trent -- the point of the question is optimization. Also going backwards is no slower, so you might as well. @Crashworks -- remember that cache lines are aligned, typically on massive boundaries, so typically it has to pull in bytes prior to the ones you're asking for.
  • Trent
    Trent about 15 years
    Any statements regarding cache is going to be processor specific. I don't see where the OP states what HW this code will execute on.
  • alvatar
    alvatar about 15 years
    I appreciate this explanation. I will use this method. I don't fully understand the cache, so I can't really tell what's going on at that level.
  • Adam Rosenfield
    Adam Rosenfield about 15 years
    Another advantage of going backwards is that it's easier for the CPU to compare the counter to a constant 0 than to compare it with a variable. It avoids a memory access, or frees up a register, depending on if the count is stored in a register.
  • Ian Kelling
    Ian Kelling about 15 years
    Why a power of 2? Wouldn't a multiple of a word size work? You assume 32 bit word here?
  • sstock
    sstock about 15 years
    Nice answer Jason. I would add one other option for the aligned case: use vector operations if the processor supports them. Such as SSE on x86. GCC and Intel C++ both support intrinsics that make it easy to "vectorize" loops like the one above. Google "gcc sse instrinsics" for some good links.
  • Jason Cohen
    Jason Cohen about 15 years
    @Ian - Yes any multiple of word size works, ALSO ASSUMING that the char arrays in question are themselves word-aligned. Also you are right that I'm assuming 32-bit processor; it must be tuned to the processor in question. Although, assuming 32-bit is still faster than byte-wise on almost any proc.
  • bk1e
    bk1e about 15 years
    10/4 == 2, so this only processes 8 chars. In addition, on some non-x86 architectures this may raise a bus error due to unaligned memory accesses.
  • Antti Huima
    Antti Huima about 15 years
    bk1e: you are right, i < 10/4 is wrong. The comment about bus error is also correct. I will edit the answer.