Quickly find whether a value is present in a C array?

21,102

Solution 1

In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.

Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.

; r0 = count, r1 = source ptr, r2 = comparison value

   stmfd sp!,{r4-r11}   ; save non-volatile registers
   mov r3,r0,LSR #3     ; loop count = total count / 8
   pld [r1,#128]
   ldmia r1!,{r4-r7}    ; pre load first set
loop_top:
   pld [r1,#128]
   ldmia r1!,{r8-r11}   ; pre load second set
   cmp r4,r2            ; search for match
   cmpne r5,r2          ; use conditional execution to avoid extra branch instructions
   cmpne r6,r2
   cmpne r7,r2
   beq found_it
   ldmia r1!,{r4-r7}    ; use 2 sets of registers to hide load delays
   cmp r8,r2
   cmpne r9,r2
   cmpne r10,r2
   cmpne r11,r2
   beq found_it
   subs r3,r3,#1        ; decrement loop count
   bne loop_top
   mov r0,#0            ; return value = false (not found)
   ldmia sp!,{r4-r11}   ; restore non-volatile registers
   bx lr                ; return
found_it:
   mov r0,#1            ; return true
   ldmia sp!,{r4-r11}
   bx lr

Update: There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:

.L9: cmp r3, r0
     beq .L8
.L3: ldr r2, [r3, #4]!
     cmp r2, r1
     bne .L9
     mov r0, #1
.L2: add sp, sp, #1024
     bx  lr
.L8: mov r0, #0
     b .L2

GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.

Update 2: I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):

loop_top:
   ldr  r3,[r1],#4  
   cmp  r3,r2  
   beq  true_exit
   subs r0,r0,#1 
   bne  loop_top
false_exit: xxx
   bx   lr
true_exit: xxx
   bx   lr

As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.

Update 3: I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.

             My Code       MS Code
Surface RT    297ns         562ns
Surface RT 2  172ns         296ns  

In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.

Solution 2

There's a trick for optimizing it (I was asked this on a job-interview once):

  • If the last entry in the array holds the value that you're looking for, then return true
  • Write the value that you're looking for into the last entry in the array
  • Iterate the array until you encounter the value that you're looking for
  • If you've encountered it before the last entry in the array, then return true
  • Return false

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    uint32_t x = theArray[SIZE-1];
    if (x == compareVal)
        return true;
    theArray[SIZE-1] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    theArray[SIZE-1] = x;
    return i != SIZE-1;
}

This yields one branch per iteration instead of two branches per iteration.


UPDATE:

If you're allowed to allocate the array to SIZE+1, then you can get rid of the "last entry swapping" part:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    theArray[SIZE] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    return i != SIZE;
}

You can also get rid of the additional arithmetic embedded in theArray[i], using the following instead:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t *arrayPtr;
    theArray[SIZE] = compareVal;
    for (arrayPtr = theArray; *arrayPtr != compareVal; arrayPtr++);
    return arrayPtr != theArray+SIZE;
}

If the compiler doesn't already apply it, then this function will do so for sure. On the other hand, it might make it harder on the optimizer to unroll the loop, so you will have to verify that in the generated assembly code...

Solution 3

Keep the table in sorted order, and use Bentley's unrolled binary search:

i = 0;
if (key >= a[i+512]) i += 512;
if (key >= a[i+256]) i += 256;
if (key >= a[i+128]) i += 128;
if (key >= a[i+ 64]) i +=  64;
if (key >= a[i+ 32]) i +=  32;
if (key >= a[i+ 16]) i +=  16;
if (key >= a[i+  8]) i +=   8;
if (key >= a[i+  4]) i +=   4;
if (key >= a[i+  2]) i +=   2;
if (key >= a[i+  1]) i +=   1;
return (key == a[i]);

The point is,

  • if you know how big the table is, then you know how many iterations there will be, so you can fully unroll it.
  • Then, there's no point testing for the == case on each iteration because, except on the last iteration, the probability of that case is too low to justify spending time testing for it.**
  • Finally, by expanding the table to a power of 2, you add at most one comparison, and at most a factor of two storage.

** If you're not used to thinking in terms of probabilities, every decision point has an entropy, which is the average information you learn by executing it. For the >= tests, the probability of each branch is about 0.5, and -log2(0.5) is 1, so that means if you take one branch you learn 1 bit, and if you take the other branch you learn one bit, and the average is just the sum of what you learn on each branch times the probability of that branch. So 1*0.5 + 1*0.5 = 1, so the entropy of the >= test is 1. Since you have 10 bits to learn, it takes 10 branches. That's why it's fast!

On the other hand, what if your first test is if (key == a[i+512)? The probability of being true is 1/1024, while the probability of false is 1023/1024. So if it's true you learn all 10 bits! But if it's false you learn -log2(1023/1024) = .00141 bits, practically nothing! So the average amount you learn from that test is 10/1024 + .00141*1023/1024 = .0098 + .00141 = .0112 bits. About one hundredth of a bit. That test is not carrying its weight!

Solution 4

You're asking for help with optimising your algorithm, which may push you to assembler. But your algorithm (a linear search) is not so clever, so you should consider changing your algorithm. E.g.:

Perfect hash function

If your 256 "valid" values are static and known at compile time, then you can use a perfect hash function. You need to find a hash function that maps your input value to a value in the range 0..n, where there are no collisions for all the valid values you care about. That is, no two "valid" values hash to the same output value. When searching for a good hash function, you aim to:

  • Keep the hash function reasonably fast.
  • Minimise n. The smallest you can get is 256 (minimal perfect hash function), but that's probably hard to achieve, depending on the data.

Note for efficient hash functions, n is often a power of 2, which is equivalent to a bitwise mask of low bits (AND operation). Example hash functions:

  • CRC of input bytes, modulo n.
  • ((x << i) ^ (x >> j) ^ (x << k) ^ ...) % n (picking as many i, j, k, ... as needed, with left or right shifts)

Then you make a fixed table of n entries, where the hash maps the input values to an index i into the table. For valid values, table entry i contains the valid value. For all other table entries, ensure that each entry of index i contains some other invalid value which doesn't hash to i.

Then in your interrupt routine, with input x:

  1. Hash x to index i (which is in the range 0..n)
  2. Look up entry i in the table and see if it contains the value x.

This will be much faster than a linear search of 256 or 1024 values.

I've written some Python code to find reasonable hash functions.

Binary search

If you sort your array of 256 "valid" values, then you can do a binary search, rather than a linear search. That means you should be able to search 256-entry table in only 8 steps (log2(256)), or a 1024-entry table in 10 steps. Again, this will be much faster than a linear search of 256 or 1024 values.

Solution 5

If the set of constants in your table is known in advance, you can use perfect hashing to ensure that only one access is made to the table. Perfect hashing determines a hash function that maps every interesting key to a unique slot (that table isn't always dense, but you can decide how un-dense a table you can afford, with less dense tables typically leading to simpler hashing functions).

Usually, the perfect hash function for the specific set of keys is relatively easy to compute; you don't want that to be long and complicated because that competes for time perhaps better spent doing multiple probes.

Perfect hashing is a "1-probe max" scheme. One can generalize the idea, with the thought that one should trade simplicity of computing the hash code with the time it takes to make k probes. After all, the goal is "least total time to look up", not fewest probes or simplest hash function. However, I've never seen anybody build a k-probes-max hashing algorithm. I suspect one can do it, but that's likely research.

One other thought: if your processor is extremely fast, the one probe to memory from a perfect hash probably dominates the execution time. If the processor is not very fast, than k>1 probes might be practical.

Share:
21,102
wlamers
Author by

wlamers

Updated on September 24, 2020

Comments

  • wlamers
    wlamers over 3 years

    I have an embedded application with a time-critical ISR that needs to iterate through an array of size 256 (preferably 1024, but 256 is the minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case.

    The microcontroller is an NXP LPC4357, ARM Cortex M4 core, and the compiler is GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead of flash. I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i!=0 is faster than checking if i<256). All in all, I end up with a duration of 12.5 µs which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:

    uint32_t i;
    uint32_t *array_ptr = &theArray[0];
    uint32_t compareVal = 0x1234ABCD;
    bool validFlag = false;
    
    for (i=256; i!=0; i--)
    {
        if (compareVal == *array_ptr++)
        {
             validFlag = true;
             break;
         }
    }
    

    What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks are also allowed.

  • Notlikethat
    Notlikethat over 9 years
    Bonus silly micro-optimisation: forget r3, use subs r0, r0, #8 for the decrement instead, then r0 will already be zero when you fall out of the 'not found' path.
  • BitBank
    BitBank over 9 years
    True - I'm used to needing to re-use the original count later in the function, so by default I send it to another register.
  • phuclv
    phuclv over 9 years
    in simple blocks of code you may win the compiler but in more complex situations you can hardly outperform it
  • sapi
    sapi over 9 years
    @LưuVĩnhPhúc - that's generally true, but tight ISRs are one of the biggest exceptions, in that you often know a lot more than the compiler does.
  • auselen
    auselen over 9 years
    Isn't it better if you drop beq found_its do two loads on the top and always do cmpne on all regardless? I don't know the capabilities on given cortex-m4, but two branches still seems too much to have in the code.
  • BitBank
    BitBank over 9 years
    @auselen I wanted to give maximum time for the registers to get loaded since that's usually where the bottleneck is. If working with 0 wait state static RAM, then your idea would be better.
  • MSalters
    MSalters over 9 years
    -1, ARM has an indexed address mode so this is pointless. As for making the pointer const, GCC already spots that it doesn't change. The const doesnt't add anything either.
  • barak manos
    barak manos over 9 years
    @auselen: Thanks... well if it's a read-only memory then this solution is not feasible.
  • MSalters
    MSalters over 9 years
    @auselen: Then you copy it to RAM. You'd want that anyway because RAM is faster.
  • auselen
    auselen over 9 years
    @MSalters copy from flash + compare on ram > compare on flash ? If you do it only once I guess that won't be useful.
  • unwind
    unwind over 9 years
    @MSalters OK, I didn't verify with the generated code, the point was to express something that makes it simpler at the C level, and I think just managing pointers instead of a pointer and an index is simpler. I simply disagree that "const doesn't add anything": it very clearly tells the reader that the value won't change. That is fantastic information.
  • ratchet freak
    ratchet freak over 9 years
    if you can copy into ram then allocate one extra space and put compareVal in there, no need for the first branch then
  • barak manos
    barak manos over 9 years
    @themik81: I rejected it your post-edit was wrong (it changed the code to work incorrectly).
  • barak manos
    barak manos over 9 years
    @ratchetfreak: OP does not provide any details on how, where and when this array is allocated and initialized, so I gave an answer that does not depend on that.
  • wlamers
    wlamers over 9 years
    Array is in RAM, writes are not allowed though.
  • RevanProdigalKnight
    RevanProdigalKnight over 9 years
    If I remember correctly, xor rx, rx is faster than mov rx, #0
  • barak manos
    barak manos over 9 years
    @wlamers: If both &theArray and compareVal are constant throughout the execution of the program, then you can place the constant value of compareVal immediately after the array (as part of the executable image, not during runtime), and then use only a small portion of the code that I've provided - the for loop, and the return value being i < SIZE instead of i < SIZE-1.
  • artless noise
    artless noise over 9 years
    This is a nice answer for generic ARM. See: LPC4357 Datasheet. The ARM Cortex-M4 supports single-cycle digital signal processing and SIMD instructions. The SIMD will probably allow comparison of multiple values at a time.
  • BitBank
    BitBank over 9 years
    @artlessnoise - AFAIK the Cortex-M4 doesn't support NEON, but instead has some instructions which can do 8/16-bit SIMD inside 32-bit registers (which wouldn't help in this situation)
  • maxywb
    maxywb over 9 years
    It depends on how many times this lookup has to be done for this to be effective.
  • artless noise
    artless noise over 9 years
    Well, sort of true. Two 16bit compares are the same as one 32bit compare. However, you are correct, it is not NEON. It doesn't look like the M4-SIMD supports an SIMD vector compare like the NEON does. As well, you can remove the pld if you cache lock the memory. The M4 has I/D-TCM and I think it is fast as cache. Positioning the array will also have a speed up; but it is not clear about who is creating the array.
  • EOF
    EOF over 9 years
    nice, but the array is no longer const, which makes this not thread-safe. Seems like a high price to pay.
  • barak manos
    barak manos over 9 years
    @Shadow503: Thanks :)
  • barak manos
    barak manos over 9 years
    @EOF: Where was const ever mentioned in the question?
  • supercat
    supercat over 9 years
    Could one remove the beq found_it instructions from the loop if one replaced cmp r8,r2 with cmpne r8,r2, and subs r3,r3,#1 with subsne r3,r3,#1? The loop will exit with r3 non-zero if the value was found, or zero if it wasn't.
  • EOF
    EOF over 9 years
    @barakmanos: If I pass an array and a value to you, and ask you whether the value is in the array, I don't usually assume you'll be modifying the array. The original question mentions neither const nor threads, but I think it's fair to mention this caveat.
  • barak manos
    barak manos over 9 years
    @EOF: Well, after OP wrote in one of the comments "writes are not allowed", I added an update, which you can read at the second part of the question. In short, you can simply add an entry at the end of the array, which no other thread uses but the thread which searches for compareVal. In addition, if theArray is located at the same memory address throughout the execution of the program, and the value of compareVal is constant throughout the execution of the program, then you can further optimize it by setting theArray[SIZE] = compareVal (or even as part of the executable image) once.
  • EOF
    EOF over 9 years
    I'll clarify what I mean: Say two threads search for two different values in the same array simultaneously (or concurrently, whichever). Now both try to modify the last value (or the sentinel value you want to write past the end of the array [btw, you'd have to allocate a +1 sized array to even do that, otherwise undefined behaviour]) and boom, equivalent to a non null-terminated string in one of the threads.
  • barak manos
    barak manos over 9 years
    @EOF: The allocation is mentioned as part of the answer (not that it is that critical to the question, as it's pretty obvious that you need to allocate a bigger array, or an element "sitting" right next to it). In any case, in the second part of the answer (which refers to the case of a read-only array), I have clearly mentioned that it is relevant for a constant compareVal.
  • Oliver Charlesworth
    Oliver Charlesworth over 9 years
    Devil's advocate: is there any quantitative evidence that this code is faster?
  • BitBank
    BitBank over 9 years
    @OliCharlesworth - I haven't run the code; I wrote it off the top of my head, but I can tell you from experience that due to the pipelined nature of ARM processors, working with one word at a time versus the way I've written the code will get very different performance. If memory is very fast, then this code can be as much as 6 times faster than the original C code.
  • MSalters
    MSalters over 9 years
    This is deeply embedded code; optimizations so far have included moving the code from flash to RAM. And yet it still needs to be faster. At this point, readability is not the goal.
  • MSalters
    MSalters over 9 years
    A Cortex-M is nowhere near extremely fast.
  • David Ongaro
    David Ongaro over 9 years
    In fact in this case he doesn't need any hash table at all. He only wants to know if a certain key is in the set, he doesn't want to map it to a value. So it's enough if the perfect hash function maps each 32 bit value to either 0 or 1 where "1" could be defined as "is in the set".
  • Ira Baxter
    Ira Baxter over 9 years
    Good point, if he can get a perfect hash generator to produce such a mapping. But, that would be "an extremely dense set"; I doube he can find a perfect hash generator that does that. He might be better off trying to get a perfect hash that produces some constant K if in the set, and any value but K if not in the set. I suspect it is hard to get a perfect hash even for the latter.
  • EOF
    EOF over 9 years
    If both the array and the value you look for are constant, why would you need to search for it in the first place?. And you can't get '[...] an element "sitting" right next to it' in C. You get variables with some storage. This really has to be the equivalent to a string (hence the +1 size...), and contiguous allocation from the beginning.
  • barak manos
    barak manos over 9 years
    @EOF: The address of the array is constant, not the contents of the array!!! Regarding the "element sitting next to it" - it's a common thing to do in embedded systems (which is what OP seems to be working on). You allocate a uint32_t right after the array. You make sure that it is "write after the array" through the linker command-file (assuming that they are both global variables, it is absolutely feasible, and in fact, a very common thing to do on these systems). In any case, you might as well declare the array to size SIZE+1. I might change it to this, just to make it simpler.
  • barak manos
    barak manos over 9 years
    @EOF: A couple of additional things related to your previous comments. OP mentions that this code runs in the context of an ISR, so it doesn't need to be thread-safe, and I doubt that it even needs to be ISR-safe. In any case, even if the code does need to be thread-safe, and assuming that we have a small number of N threads, we can simply allocate uint32_t theArray[SIZE+N], and let each thread write theArray[SIZE+threadId] = compareVal. This would make the whole thing thread-safe while keeping the complexity at O(n), since N is much smaller than SIZE.
  • wlamers
    wlamers over 9 years
    Thanks for that. The binary search option is the one I have chosen. See also an earlier comment in the first post. This does the trick very well without using assembly.
  • wlamers
    wlamers over 9 years
    Quite some discussion going on here. If I knew that all this would be important I had better give some more details in the questions. My apologies. The above suggestion could work in my application. To clarify some more: the array is in RAM because it changes once in a while, but never in the ISR itself. I could add an extra item which may be changed in the ISR, this is feasible. Thread safety is (luckely) no issue. Both the array and pointer+compareval can be accessed simulataniously by the cpu since they are in different RAM's which can be accessed by two seperate busses.
  • wlamers
    wlamers over 9 years
    This already speed thing up. In the mean time (see comment in the first post) I solved the issue using a binary search. This speeds things up from 12.4 to 3.9us is the worst case. Quite some improvement. This could probably be even faster, but the time/effort required to do that does not weigh up to the gain. I am very happy with the result. Thank all for your input!
  • barak manos
    barak manos over 9 years
    @wlamers: You're welcome. If you're allowed to allocate the array one entry larger, then it might work out for you without changing the value of that last entry (as explain in the second part of the answer). Of course, if you're allowed to sort the array beforehand (or if you have it sorted already), then it's a totally different thing, and I guess that a binary-search beats all other methods. In any case, thank you for the interesting question, and for the (currently) second highest score I got for an answer here :)
  • OregonTrail
    OregonTrail over 9 years
    I really like this solution. It can be modified to run in a fixed number of cycles to avoid timing-based forensics if the location of the value is sensitive information.
  • Mike Dunlavey
    Mike Dunlavey over 9 years
    @OregonTrail: Timing-based forensics? Fun problem, but sad comment.
  • OregonTrail
    OregonTrail over 9 years
    You see unrolled loops like this in crypto libraries to prevent Timing Attacks en.wikipedia.org/wiki/Timing_attack. Here's a good example github.com/jedisct1/libsodium/blob/… In this case we are preventing an attacker from guessing the length of a string. Usually the attacker will take several million samples of a function invocation to perform a timing attack.
  • MSalters
    MSalters over 9 years
    How does this get 4 upvotes? The question states it's a Cortex M4. The thing has 136 KB RAM, not 262.144 KB.
  • msw
    msw over 9 years
    @BitBank You wrote "I haven't run the code". If so, you can't make assertions about its performance. Even extensive personal experience can only generate guesses.
  • msw
    msw over 9 years
    It's astounding how many upvotes were given to manifestly wrong answers because the answerer missed the forest for the trees. For the OP's largest case O(log n) << O(n).
  • ysdx
    ysdx over 9 years
    Indeed, before trying to optimize your code (such as using assembly or other tricks) you should probably see if you can reduce the algorithmic complexity. Usually reducing the algorithmic complexity will be more efficient than trying to scap a few cycles but keeping the same algorithmic complexity.
  • BitBank
    BitBank over 9 years
    @msw - you're right that they're guesses, but I've done this long enough to know that a C compiler is not going to generate what I wrote and mine will be much faster.
  • Lightness Races in Orbit
    Lightness Races in Orbit over 9 years
    @BitBank: That's not good enough. You have to back up your claims with evidence.
  • Rocketmagnet
    Rocketmagnet over 9 years
    I learned my lesson years ago. I crafted an amazing optimised inner loop for a graphics routine on a Pentium, using the U and V pipes optimally. Got it down to 6 clock cycles per loop (calculated and measured), and I was very proud of myself. When I tested it against the same thing written in C, the C was faster. I never wrote another line of Intel assembler again.
  • Voo
    Voo over 9 years
    So the only thing you've done is unrolled the loop and combined the branches. No idea about ARM compilers, but those are some of the most basic optimizations any serious optimizing compiler would do. Rather shocking if gcc for ARM couldn't manage that itself.
  • Exectron
    Exectron over 9 years
    I get very grumpy at programmers who burn ridiculous amounts of memory, when there are far better solutions available. Every 5 years it seems that my PC is running out of memory, where 5 years ago that amount was plenty.
  • Cole Tobin
    Cole Tobin over 9 years
    @CraigMcQueen Kids these days. Wasting memory. Outrageous! Back in my days, we had 1 MiB of memory and a word size of 16-bits. /s
  • Bogdan Alexandru
    Bogdan Alexandru over 9 years
    What's with the harsh critics? The OP clearly states the speed is absolutely critical for this portion of code, and StephenQuan already mentioned a "ridiculous amount of memory".
  • Benjamin Gruenbaum
    Benjamin Gruenbaum over 9 years
    Great answer, one thing that would make it better is actually benchmarking the two and showing us that your solution is faster (Compared to Barak Manos's solution in particular) - while it likely is, this should shut up the skeptics.
  • Cody Gray
    Cody Gray over 9 years
    "skeptics in the comments who think that my experience is anecdotal/worthless and require proof." Don't take their comments overly negatively. Showing the proof just makes your great answer all that much better.
  • Cody Gray
    Cody Gray over 9 years
    @Voo I've only played with it on one or two occasions, but it seems that GCC's ARM compiler frequently misses or fails to perform some fairly obvious optimizations. I assume this is the result of less programmer-years of development than the x86 compiler.
  • maxy
    maxy over 9 years
    Not sure about ARM, but many modern CPUs can predict fixed-step memory access patterns and will start prefetching after the first few iterations without explicit prefetch instruction. They also do fancy branch prediction and speculative execution, so an "if" statement can be much cheaper than you would think depending on the data. So in general it's very hard to predict performance on desktop PCs (maybe ARM is different).
  • artless noise
    artless noise over 9 years
    Some proof is here as gcc bug 48789; gcc doesn't understand the ldm nor stm instructions. These allow loading of multiple memory values to registers and are commonly coded in memset(), memcpy(), etc as they minimize code overhead for data bound ARM code. You can use inline assembler with 'C' to get the same effect.
  • Jim Balter
    Jim Balter over 9 years
    The OP ended up sorting the values beforehand and doing a binary search ... duh.
  • Jim Balter
    Jim Balter over 9 years
    The OP bypassed all the foolish answers focused on optimizing linear loops, and instead presorted the array and did binary search.
  • Jim Balter
    Jim Balter over 9 years
    Loops over pointers are idiomatic in C and good optimizing compilers can handle them just as well as indexing. But this whole thing is moot because the OP ended up doing a binary search.
  • Jim Balter
    Jim Balter over 9 years
    Er, lookup can run off the end of the array. And this sort of linear hashing has high collision rates -- no way you'll get O(1). Good hash sets aren't implemented like this.
  • Jim Balter
    Jim Balter over 9 years
    @DavidOngaro table[PerfectHash(value)] == value yields 1 if the value is in the set and 0 if it isn't, and there are well known ways to produce the PerfectHash function (see, e.g., burtleburtle.net/bob/hash/perfect.html). Trying to find a hash function that directly maps all values in the set into 1 and all values not in the set to 0 is a foolhardy task.
  • jpa
    jpa over 9 years
    @JimBalter True, not perfect code. More like the general idea; could have just pointed to existing hash set code. But considering that this is an interrupt service routine it may be useful to demonstrate that the lookup is not very complex code.
  • Jim Balter
    Jim Balter over 9 years
    You should just fix it so it wraps i around.
  • Jim Balter
    Jim Balter over 9 years
    @MSalters "ARM has an indexed address mode so this is pointless" -- well, if you completely miss the point ... the OP wrote "I also use pointer arithmetic and a for loop". unwind didn't replace indexing with pointers, he just eliminated the index variable and thus an extra subtract on every loop iteration. But the OP was wise (unlike many of the people answering and commenting) and ended up doing a binary search.
  • barak manos
    barak manos over 9 years
    @Christian: Thanks :)
  • Mixaz
    Mixaz over 9 years
    @Jim, it is obvious that that kind of optimization should be made first. 'Foolish' answers may look not so foolish in some use cases when for example you do not have time to sort the array. Or if the speed you get, is not enough anyway
  • Jim Balter
    Jim Balter over 9 years
    "it is obvious that that kind of optimization should be made first" -- obviously not to the people who went to great effort to develop linear solutions. "you do not have time to sort the array" -- I have no idea what that means. "Or if the speed you get, is not enough anyway" -- Uh, if the speed from a binary search is "not enough", doing an optimized linear search won't improve it. Now I'm done with this subject.
  • Mixaz
    Mixaz over 9 years
    @JimBalter, if I had such problem as OP, I certainly would consider using algs like binary search or something. I just couldn't think that OP didn't consider it already. "you do not have time to sort the array" means that sorting array takes time. If you need to do it for each input data set, it may take longer time than a linear loop. "Or if the speed you get, is not enough anyway" means following - optimization hints above could be used to speed up binary search code or whatsoever
  • TonyK
    TonyK over 9 years
    @OregonTrail: I second your timing-based comment. I have more than once had to write cryptographic code that executes in a fixed number of cycles, to avoid leaking information to timing-based attacks.
  • Olof Forshell
    Olof Forshell almost 9 years
    A popular notion is that it takes too much effort to find an efficient hash routine so the "best practice" is a binary search. Sometimes though, "best practice" is not good enough. Suppose you are routing network traffic on the fly at the moment when a packet's header has arrived (but not its payload): using a binary search would make your product hopelessly slow. Embedded products usually have such constraints and requirements that what is "best practice" in, for example, an x86 execution environment is "taking the easy way out" in embedded.
  • The Paramagnetic Croissant
    The Paramagnetic Croissant over 8 years
    @Puppy You are a terrible hater.
  • Peter Cordes
    Peter Cordes over 7 years
    Does NEON have a packed-compare instruction? Since the array is small, you could always loop over the whole thing, and use 128-bit vector OR instructions to combine the vector compare results into a single vector that has a non-zero element if there was a match. (Then horizontal OR this down to a truth value in a scalar register). Unrolling this with multiple accumulators can hide latencies for in-order cores like this one.
  • BitBank
    BitBank over 7 years
    @PeterCordes - Yes, you can certainly take this approach with both NEON and SSE. I've written code to find the max/min value in a large unsorted list using SIMD and it definitely speeds it up quite a bit. Vectorizing compilers usually give up if there is a conditional statement in a loop, but it's easy to write intrinisics to handle the test case.
  • Peter Cordes
    Peter Cordes over 7 years
    IIRC, ternary operators can help with auto-vectorization, since compilers like the fact that the assignment is unconditional, and only the value is conditional. That's definitely true for convincing compilers to use CMOV instead of a branch. What really defeats gcc is loops with a trip-count that's not predictable (at run time) before entering the loop. So compilers suck at search loops usually, but always looping over the whole array should be fine. (Although I suspect you would need intrinsics for what I suggested, and it's probably easier to do that than wrestle with the compiler)
  • Exectron
    Exectron over 7 years
    @DavidOngaro: a perfect hash function has many "false positives", which is to say, values not in the set would have the same hash as values in the set. So you have to have a table, indexed by the hash value, containing the "in-the-set" input value. So to validate any given input value you (a) hash it; (b) use the hash value to do the table look-up; (c) check if the entry in the table matches the input value.
  • Ira Baxter
    Ira Baxter over 7 years
    The point of a perfect hash function is that it does one probe. Period.
  • David Ongaro
    David Ongaro over 7 years
    @CraigMcQueen: a simple counter example would be value & 1 which would be a perfect hash function for the set of even (or uneven) integers (given that value as type integer is fixed). But it simply depends on your usecase, e.g. if you want to have a perfect hash to use as a jumptable for a list of keywords and you can not be sure that your input string is in the list of your keywords, or at least is very constrained, then yes you need to store the value as well.
  • David Ongaro
    David Ongaro over 7 years
    @CraigMcQueen: Mathematically you can always have a "perfect hash" function of a set, by defining it as the list of all possible values and a function which map all input values to 1 or 0 depending wether or not it is in the the list. In this way the "perfect hash" would be just a representation of the set. Practically it only makes sense to talk about a "perfect hash function of a set" when the complexity stays near O(1) and the space requirement somewhat below O(n).
  • Jeremy
    Jeremy almost 7 years
    "the array is in RAM because it changes once in a while, but never in the ISR itself." It's important to bear in mind that for a binary search solution to work, changing the array's contents requires maintaining its sort order. If the array is modified in-place, this requires inhibiting the ISR while the sort is being performed, at the cost of some interrupt responsiveness. Another option might be to prepare the updated table elsewhere and swap it in when it's ready, at the cost of memory. Other options also suggest themselves. Presumably your solution takes this consideration into account.
  • Peter Mortensen
    Peter Mortensen over 5 years
    Someone has to tell the truth (instead of repeating conventional wisdom).
  • Peter Cordes
    Peter Cordes over 5 years
    Interesting, Cortex-M7 has optional instruction/data caches, but before that definitely not. en.wikipedia.org/wiki/ARM_Cortex-M#Silicon_customization.
  • Peter Cordes
    Peter Cordes over 5 years
    Why is i signed int? Probably the compiler can prove that it stays non-negative (and thus % HASH_LEN can be implemented as & (HASH_LEN - 1)), but you might lead the compiler to emit code that accounts for signed remainder semantics.
  • Peter Cordes
    Peter Cordes over 5 years
    gcc -O2 doesn't include -funroll-loops in gcc4.8. That's only enabled by -fprofile-use, or if you specify it manually. -O3 used to include -funroll-loops, and I don't know when it changed, but adding that option does change code-gen for the OP's C loop with gcc4.6 or gcc8.2 on Godbolt godbolt.org/z/d6yDF7. (Still nowhere near as good as your hand-written asm, though; it just replicates multiple ldr/cmp/beq blocks). GCC / clang suck at loops where the trip-count isn't calculable before entering the loop, e.g. they can't auto-vectorize search loops like this.