What's the most efficient way to make bitwise operations in a C array
Solution 1
for ( i = 10 ; i-- > 0 ; )
result_array[i] = byte_array[i] & byte_mask[i];
- Going backwards pre-loads processor cache-lines.
- Including the decrement in the compare can save some instructions.
This will work for all arrays and processors. However, if you know your arrays are word-aligned, a faster method is to cast to a larger type and do the same calculation.
For example, let's say n=16
instead of n=10
. Then this would be much faster:
uint32_t* input32 = (uint32_t*)byte_array;
uint32_t* mask32 = (uint32_t*)byte_mask;
uint32_t* result32 = (uint32_t*)result_array;
for ( i = 4 ; i-- > 0 ; )
result32[i] = input32[i] & mask32[i];
(Of course you need a proper type for uint32_t
, and if n
is not a power of 2 you need to clean up the beginning and/or ending so that the 32-bit stuff is aligned.)
Variation: The question specifically calls for the results to be placed in a separate array, however it would almost certainly be faster to modify the input array in-place.
Solution 2
If you want to make it faster, make sure that byte_array has length that is multiple of 4 (8 on 64-bit machines), and then:
char byte_array[12];
char byte_mask[12];
/* Checks for proper alignment */
assert(((unsigned int)(void *)byte_array) & 3 == 0);
assert(((unsigned int)(void *)byte_mask) & 3 == 0);
for (i = 0; i < (10+3)/4; i++) {
((unsigned int *)(byte_array))[i] &= ((unsigned int *)(byte_mask))[i];
}
This is much faster than doing it byte per byte.
(Note that this is in-place mutation; if you want to keep the original byte_array also, then you obviously need to store the results in another array instead.)
Solution 3
\#define CHAR_ARRAY_SIZE (10)
\#define INT_ARRAY_SIZE ((CHAR_ARRAY_SIZE/ (sizeof (unsigned int)) + 1)
typedef union _arr_tag_ {
char byte_array [CHAR_ARRAY_SIZE];
unsigned int int_array [INT_ARRAY_SIZE];
} arr_tag;
Now int_array for masking. This might work for both 32bit and 64 bit processors.
arr_tag arr_src, arr_result, arr_mask;
for (int i = 0; i < INT_ARRAY_SIZE; i ++) {
arr_result.int_array [i] = arr_src.int_array[i] & arr_mask.int_array [i];
}
Try this, code might also look clean.
Comments
-
alvatar almost 2 years
I have a C array like:
char byte_array[10];
And another one that acts as a mask:
char byte_mask[10];
I would like to do get another array that is the result from the first one plus the second one using a bitwise operation, on each byte.
What's the most efficient way to do this?
thanks for your answers.
-
Crashworks about 15 yearsWait, does the cache prefetcher work better in reverse? I thought it only prefetched going forwards.
-
Trent about 15 yearsWorrying about pre-loading processor cache-lines seems like a severe premature optimization.
-
Jason Cohen about 15 years@Trent -- the point of the question is optimization. Also going backwards is no slower, so you might as well. @Crashworks -- remember that cache lines are aligned, typically on massive boundaries, so typically it has to pull in bytes prior to the ones you're asking for.
-
Trent about 15 yearsAny statements regarding cache is going to be processor specific. I don't see where the OP states what HW this code will execute on.
-
alvatar about 15 yearsI appreciate this explanation. I will use this method. I don't fully understand the cache, so I can't really tell what's going on at that level.
-
Adam Rosenfield about 15 yearsAnother advantage of going backwards is that it's easier for the CPU to compare the counter to a constant 0 than to compare it with a variable. It avoids a memory access, or frees up a register, depending on if the count is stored in a register.
-
Ian Kelling about 15 yearsWhy a power of 2? Wouldn't a multiple of a word size work? You assume 32 bit word here?
-
sstock about 15 yearsNice answer Jason. I would add one other option for the aligned case: use vector operations if the processor supports them. Such as SSE on x86. GCC and Intel C++ both support intrinsics that make it easy to "vectorize" loops like the one above. Google "gcc sse instrinsics" for some good links.
-
Jason Cohen about 15 years@Ian - Yes any multiple of word size works, ALSO ASSUMING that the char arrays in question are themselves word-aligned. Also you are right that I'm assuming 32-bit processor; it must be tuned to the processor in question. Although, assuming 32-bit is still faster than byte-wise on almost any proc.
-
bk1e about 15 years10/4 == 2, so this only processes 8 chars. In addition, on some non-x86 architectures this may raise a bus error due to unaligned memory accesses.
-
Antti Huima about 15 yearsbk1e: you are right, i < 10/4 is wrong. The comment about bus error is also correct. I will edit the answer.