improving C circular buffer efficiency

c embedded circular-buffer

13,430

Solution 1

As "Oli Charlesworth" suggested - you'd be able to simplify things if your buffer size is a power of 2. I'd like to write the read/write function bodies, so that the intent is more clear.

#define BUFF_SIZE (4U)
#define BUFF_SIZE_MASK (BUFF_SIZE-1U)

struct buffer {
    float buff[BUFF_SIZE];
    unsigned writeIndex;
};

void write(struct buffer *buffer, float value) {
    buffer->buff[(++buffer->writeIndex) & BUFF_SIZE_MASK] = value;
}

float readn(struct buffer *buffer, unsigned Xn){
    return buffer->buff[(buffer->writeIndex - Xn) & BUFF_SIZE_MASK];
}

Some explanations. Note that there's no branching (if) at all. We don't limit the array index to the array bounds, instead we're AND-ing it with the mask.

Solution 2

If you can make your buffer size a power-of-2, then the check against zero can be replaced with unconditional bit-masking. On most processors, this should be faster.

Solution 3

This may not be seem elegant but is efficient. Accessing structure elements through the pointer is taking up a lot of instructions. Why not remove the structure altogether and make buffer and writeIndex as global variables? This will considerably decrease the size of your readn and write functions.

I tried in gcc and here is the output with and without the structure

With Structure

_write:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %ecx
    movl    8(%ebp), %eax
    movl    16(%eax), %edx
    movl    12(%ebp), %eax
    movl    %eax, (%ecx,%edx,4)
    movl    8(%ebp), %eax
    incl    16(%eax)
    movl    8(%ebp), %eax
    cmpl    $3, 16(%eax)
    jne L1
    movl    8(%ebp), %eax
    movl    $0, 16(%eax)
L1:
    popl    %ebp
    ret

Without Structure. ie Making buffer and writeIndex as global

_write:
    pushl   %ebp
    movl    %esp, %ebp
    movl    _writeIndex, %edx
    movl    8(%ebp), %eax
    movl    %eax, _buff(,%edx,4)
    incl    _writeIndex
    cmpl    $3, _writeIndex
    jne L1
    movl    $0, _writeIndex
L1:
    popl    %ebp
    ret

Solution 4

Keeping track of the start and the end of the circular buffer with pointers, is likely a bit faster than array indexing, since the address will computed in runtime in case of the latter. Try to replace readIndex and writeIndex with float* instead. The code would then be

*buffer->writeIndex = value;
buffer->writeIndex++;
if(buffer->writeIndex == buffer + BUFF_SIZE)
  buffer->writeIndex=buffer->buff;

buffer + BUFF_SIZE will still be a constant expression and the compiler will translate that to a fixed address at compile time.

Solution 5

The accepted answer contains code that is incorrect and will invoke undefined behavior. Correction below:

#define BUFF_SIZE (4U)
#define BUFF_SIZE_MASK (BUFF_SIZE-1U)

struct buffer {
    float buff[BUFF_SIZE];
    unsigned writeIndex;
};

void write(struct buffer *buffer, float value) {
    buffer->buff[(++buffer->writeIndex) & BUFF_SIZE_MASK] = value;
}

float readn(struct buffer *buffer, unsigned Xn){
    return buffer->buff[(buffer->writeIndex - Xn) & BUFF_SIZE_MASK];
}

The error in the original answer was to assume that 'int' will wrap around. Using binary masks with int is also unwise.

View more solutions

13,430

Author by

Gurba

Updated on June 10, 2022

Comments

Gurba almost 2 years

I'd like some help improving the efficiency of my circular buffer code.

I had a look around stackoverflow and found that (nearly) all of the topics on circular buffers are about the uses of such a buffer or the basic implementation of a circular buffer. I really need information about how to make it super efficient.

The plan is to use this buffer with the STM32F4 microcontroller which has a single precicion FPU. I plan to make heavy use of especially the write() and readn() functions. We're literally talking a few million calls a second here so shaving of a few clock cycles here and there is really going to make a difference.

I'll put the most important bits of code here, the full buffer code is available via http://dl.dropbox.com/u/39710897/circular%20buffer.rar

Can anyone provide me with a few pointers on how to improve the efficiency of this buffer?

#define BUFF_SIZE 3             // buffer size set at compile time

typedef struct buffer{
    float buff[BUFF_SIZE];
    int readIndex;
    int writeIndex;
}buffer;

/********************************\
* void write(buffer* buffer, float value)
* writes value into the buffer
* @param buffer* buffer
*   pointer to buffer to be used
* @param float value
*   valueto be written in buffer
\********************************/
void write(buffer* buffer,float value){
    buffer->buff[buffer->writeIndex]=value;
    buffer->writeIndex++;
    if(buffer->writeIndex==BUFF_SIZE)
        buffer->writeIndex=0;
}

/********************************\
* float readn(buffer* buffer, int Xn)
* reads specified value from buffer
* @param buffer* buffer
*   pointer to buffer to be read from
* @param int Xn
*   specifies the value to be read from buffer counting backwards from the most recently written value
*   i.e. the most recently writen value can be read with readn(buffer, 0), the value written before that with readn(buffer, 1)
\********************************/
float readn(buffer* buffer, int Xn){
    int tempIndex;

    tempIndex=buffer->writeIndex-(Xn+1);
    while(tempIndex<0){
        tempIndex+=BUFF_SIZE;
    }

    return buffer->buff[tempIndex];
}

valdo about 12 years

Agree. Also there'll be no need to check for index over/under-flow at all. Just increment and decrement it without any checking, just AND-mask it for the actual element access
Gurba about 12 years

I'll be investigating this, making it a power of 2 is easy enough (just set BUFF_SIZE to 4). I'll have to dig a bit for the bit-masking though
valdo about 12 years

With optimizations enabled there shouldn't be such a big difference. One may resolve once the address of buff and writeIndex, all the rest should be the same. OTOH without optimizations any access such as buff->... would be long
Lundin about 12 years

@Gurba I assume the final application will have a much larger buffer size? Otherwise I would strongly question the use of the ring buffer ADT in the first place.
Oliver Charlesworth about 12 years

This won't work if the OP needs more than one buffer instance.
Pavan Manjunath about 12 years

@valdo Hmmm.. But the compiler seems to be doing something else. I tried with -O2 and the resultant code seems to be much bigger than the above ones :O
Pavan Manjunath about 12 years

@OliCharlesworth You are right. He would then have to create more global buffers/indices which would go on becoming clumsier. That's why I added the not so elegant phrase in my answer :)
Gurba about 12 years

I'll be keeping this one in mind, but I doubt that I'll use it. I do indeed need multiple instances and this just seems unmaintanable to me.. I'll just get confused too often
Jens Gustedt about 12 years

As soon as you use bitmasks you should simply switch to unsigned index types. Then you don't have to argue about 2-complement or stuff like that.
Jens Gustedt about 12 years

The assembler really looks ineffient. I am sure you don't use the correct options for a modern platform. (-O3 -march=native should do for gcc). Also, the lengthy preamble is in that assembler because this is seen as a function. when declared as inline and then inlined into the call side, most of this noise should go away.
Gurba about 12 years

I thank you a LOT, this has made my code roughly 6.5 times faster! I'd call that a rather dramatic improvement
Gurba about 12 years

since I can set only 1 answer as accepted, I'll to that with this one because of the examples even though Oli Charlesworth came up with a similar answer.
Jens Gustedt about 12 years

@Stacker, I meant "assembler" as a short hand for "assembly listing".
Mtr about 11 years

It might make sense to try to make ReadArray and WriteArray functions that should work with arrays instead of single values. Or to make write and readn as macroses or inline procedures (in case of C++ compiler). In that case you avoid some stack workload, that may give you cpu ticks.
4pie0 about 8 years

@valdo writeIndex will eventually overflow
Evan Benn about 5 years

As @4pie0 noted, this answer is incorrect C code, and will invoke undefined behaviour. I tried to edit the answer: stackoverflow.com/review/suggested-edits/22381560 but was rejected. Please do not use this code.
valdo about 5 years

@EvanBenn: why do you think it invokes undefined behavior? There is no problem with writeIndex overflow.
Evan Benn about 5 years

When you say there is no problem, what do you mean? If signed writeIndex overflows that is undefined behaviour. unsigned integers wrap around.