In C, accessing my array index is faster or accessing by pointer is faster?

12,318

Solution 1

templatetypedef has summed it up. To add some support to his response. Take these example functions:

unsigned int fun1 ( unsigned int *x )
{
    unsigned int ra,rb;

    rb=0;
    for(ra=0;ra<1000;ra++) rb+=*x++;
    return(rb);
}

unsigned int fun2 ( unsigned int *x )
{
    unsigned int ra,rb;
    rb=0;
    for(ra=0;ra<1000;ra++) rb+=x[ra];
    return(rb);
}

Now gcc produced this:

00000000 fun1:
   0:   e52d4004    push    {r4}        ; (str r4, [sp, #-4]!)
   4:   e1a03000    mov r3, r0
   8:   e2804efa    add r4, r0, #4000   ; 0xfa0
   c:   e3a00000    mov r0, #0
  10:   e1a02003    mov r2, r3
  14:   e492c004    ldr ip, [r2], #4
  18:   e5931004    ldr r1, [r3, #4]
  1c:   e2823004    add r3, r2, #4
  20:   e080000c    add r0, r0, ip
  24:   e1530004    cmp r3, r4
  28:   e0800001    add r0, r0, r1
  2c:   1afffff7    bne 10 
  30:   e49d4004    pop {r4}        ; (ldr r4, [sp], #4)
  34:   e12fff1e    bx  lr

00000038 fun2:
  38:   e3a03000    mov r3, #0
  3c:   e1a02003    mov r2, r3
  40:   e790c003    ldr ip, [r0, r3]
  44:   e2833004    add r3, r3, #4
  48:   e7901003    ldr r1, [r0, r3]
  4c:   e2833004    add r3, r3, #4
  50:   e082200c    add r2, r2, ip
  54:   e3530efa    cmp r3, #4000   ; 0xfa0
  58:   e0822001    add r2, r2, r1
  5c:   1afffff7    bne 40 
  60:   e1a00002    mov r0, r2
  64:   e12fff1e    bx  lr

The code is different, but I am surprised at the missed opportunities for optimization.

Clang/llvm produced this:


00000000 fun1:
   0:   e3a01000    mov r1, #0
   4:   e3a02ffa    mov r2, #1000   ; 0x3e8
   8:   e1a03001    mov r3, r1
   c:   e2522001    subs    r2, r2, #1
  10:   e490c004    ldr ip, [r0], #4
  14:   e08c3003    add r3, ip, r3
  18:   e2c11000    sbc r1, r1, #0
  1c:   e182c001    orr ip, r2, r1
  20:   e35c0000    cmp ip, #0
  24:   1afffff8    bne c 
  28:   e1a00003    mov r0, r3
  2c:   e12fff1e    bx  lr

00000030 fun2:
  30:   e3a01000    mov r1, #0
  34:   e3a02ffa    mov r2, #1000   ; 0x3e8
  38:   e1a03001    mov r3, r1
  3c:   e2522001    subs    r2, r2, #1
  40:   e490c004    ldr ip, [r0], #4
  44:   e08c3003    add r3, ip, r3
  48:   e2c11000    sbc r1, r1, #0
  4c:   e182c001    orr ip, r2, r1
  50:   e35c0000    cmp ip, #0
  54:   1afffff8    bne 3c
  58:   e1a00003    mov r0, r3
  5c:   e12fff1e    bx  lr

You might notice that the compiler produced the exact same code, pointer or offset. And by changing compilers I was better off than changing pointer vs array indexing. I think llvm could have done a little better, I will need study this some more to understand what my code did to cause this.

EDIT:

I was hoping to get the compiler to at a minimum use the ldr rd,[rs],#4 instruction which favors pointers, and hoped the compiler would see that it could destroy the array address thus treating it like a pointer rather than an offset into an array (and use the above instruction, which is basically what clang/llvm did). Or if it did the array thing that it would use the ldr rd,[rm,rn] instruction. Basically was hoping one of the compilers would generate one of these solutions:


funa:
    mov r1,#0
    mov r2,#1000
funa_loop:
    ldr r3,[r0],#4
    add r1,r1,r3
    subs r2,r2,#1
    bne funa_loop
    mov r0,r1
    bx lr

funb:
    mov r1,#0
    mov r2,#0
funb_loop:
    ldr r3,[r0,r2]
    add r1,r1,r3
    add r2,r2,#4
    cmp r2,#0x4000
    bne funb_loop
    mov r0,r1
    bx lr

func:
    mov r1,#0
    mov r2,#4000
    subs r2,r2,#4
func_loop:
    beq func_done
    ldr r3,[r0,r2]
    add r1,r1,r3
    subs r2,r2,#4
    b func_loop
func_done:
    mov r0,r1
    bx lr

Didnt quite get there but got pretty close. This was a fun exercise. Note the above is all ARM assembler.

In general, (not my specific C code example and not necessarily an ARM), a number of the popular architectures you will have a load from a register based address (ldr r0,[r1]) and a load with a register index/offset (ldr r0,[r1,r2]) where the address is the sum of the two registers. one register ideally is the base address of the array and the second the index/offset. The former load from register lends itself to pointers, the latter to arrays. if your C program is NOT going to change or move the pointer or index, then in both cases that means a static address which is computed then a normal load is used, both array and pointer should produce the same instructions. For the more interesting case of changing the pointer/index.

Pointer

ldr r0,[r1]
...
add r1,r1,some number

Array index

ldr r0,[r1,r2]
...
add r2,r2,some number

(replace the load with a store and the add with a sub as needed)

Some architectures do not have a three register register index instruction so there you have to do something like

array index:
mov r2,r1
...
ldr r0,[r2]
...
add r2,r2,some number

Or depending on the compiler it can get really bad, esp if you compile for debugging or without optimizations, and assuming you dont have a three register add

array index:
mov r2,#0
...
mov r3,r1
add r3,r2
ldr r4,[r3]
...
add r2,some number

So it is quite possible that the two approaches are equal. As seen on the ARM, it can combine the two (within limits for the immediate) pointer instructions into one, making that a little faster. The array index solution burns more registers, and depending on the number of available registers for the architecture that pushes you toward having to swap registers out to the stack sooner and more often (than you would with pointers), slowing you down even more. If you dont mind destroying the base address, the bottom line is the pointer solution might give you an advantage from a performance perspective. It has a lot to do with your code and the compiler. For me it readability comes into play and I feel arrays are easier to read and follow, and second do I need to preserve that pointer to free a malloc or to go through that memory again, etc. If so I will probably use an array with an index, if it is a one time pass and I dont care about destroying the base address I will use a pointer. As you saw above with the compiler generated code, if performance is critical, then hand code the solution in assembler anyway (based on suggested approaches by letting the compilers try it first).

Solution 2

It's completely system-dependent which one is faster, but the two are functionally equivalent to one another and I'd be really surprised if one actually was faster. That is, the code

myArr[index]

Is completely equivalent to

*(&myArr[0] + index)

Similarly, writing

*ptr

Is equivalent to writing

ptr[0]

Most compilers are smart enough to figure this out, so I'd be amazed if one was faster than another.

More importantly, though, you probably shouldn't be too worried about this. Worry about optimizations after you have everything else working. If you find that array accesses really are killing you, then consider finding a faster alternative. Otherwise, don't worry about it; it's infinitely more valuable to have clean, readable, maintainable code than it is to have optimized code unless you have a pressing need for optimization.

Solution 3

Simple index operations compile to the same machine code on every compiler I've ever touched. By index is usually recommended for readability.

More complex cases that involve different logic for pointer access vs array indexing need to be examined on a case-by-case basis. If you are in doubt, profile your code - as always.

Solution 4

There's no meaningful answer to your question. Language-level operations have no specific "speed" associated to them. By themselves, they can't be "faster" or "slower".

Only CPU instructions can be faster or slower and only CPU instructions can consume CPU cycles. In order to somehow carry over this concept of "speed" from CPU instructions back to language-level operations [these CPU instructions were generated from] in general case you'll need to know the context. This is so because the same language-level operation can generate totally different CPU instructions in different contexts (not even mentioning that it might also depend on the compiler settings etc.)

In other words, post the actual code. As an abstract context-less question it simply makes no sense.

Solution 5

At the lowest level, these operations mostly tend to compile to the same thing. If you're really interested, you should get your C compiler to generate assembly output (such as with gcc -S) so you can check, especially since it depends, at a bare minimum, on:

  • your target platform.
  • your compiler.
  • your optimisation level.

You'll find that, even if there was a difference (which is doubtful), that level of micro-optimisation is mostly not worth the effort you put into it. You're better off doing macro-optimisations such as improved algorithms since that's the sort of thing that offers more return on investment.

In these sorts of situations, where the effect is likely to be minimal, I always optimise for readability.

Share:
12,318

Related videos on Youtube

David Gao
Author by

David Gao

Updated on February 27, 2020

Comments

  • David Gao
    David Gao about 4 years

    In C, accessing an array index is faster or accessing by pointer is faster? By faster I mean, which one would take less clock cycle. The array is not an constant array.

    • Peyman
      Peyman about 13 years
      If you have a pointer to the exact location of an element, then pointers might be faster, because indexing has to calculate the address of the element based on the base address.
  • templatetypedef
    templatetypedef about 13 years
    Two operations being O(1) says nothing about their relative speed; both 1 and 1000000 are O(1).
  • Dervin Thunk
    Dervin Thunk about 13 years
    Given the abstract nature of the OPs question, this (and AndreyT's) answer is the only correct answer. You cannot answer this without going into assembly. You say what you know, that it is O(1). Therein lies the beauty of theoretical computer science.
  • Oystein
    Oystein about 13 years
    How do you know that it's O(1)? What if my compiler compiled a[i] to a loop that increased a pointer by 1 i times? That would make it O(n). If you want to be pedantic...
  • Dervin Thunk
    Dervin Thunk about 13 years
    @Oyster: Ah, yes, the famous case of memory being implemented as a linked-list... :)
  • Olof Forshell
    Olof Forshell about 13 years
    "the only correct answer" - that's a VERY broad statement which I'm not sure anyone could defend, not even you.
  • Olof Forshell
    Olof Forshell about 13 years
    "It's infinitely more valuable to have clean, readable, maintainable code than it is to have optimized code unless you have a pressing need for optimization" and then optimization is - what? A very sweeping statement which is more fitting to someone who has believed everything his profs have said at university than someone who has many years of experience. Everything is a balance of pros and cons and the sooner all our new colleagues learn this the sooner our clients will benefit from it.
  • Olof Forshell
    Olof Forshell about 13 years
    This answer is so uninformed that it shames our profession. Harsh words but I stand behind them 100%. The compiler is not a random instruction generator as you appear to be implying but often an excellent quality tool that is imperative to our work. However, if you feed it bloated garbage code source code it will faithfully produce bloated garbage machine code that performs poorly. Judging from your answer, this appears to be your general experience. If you feed it quality source code it will faithfully produce quality machine code that is fast. Perhaps you were unaware of this relationship?
  • Olof Forshell
    Olof Forshell about 13 years
    "You'll find that, even if there was a difference (which is doubtful), that level of micro-optimisation is not worth the effort you put into it" is a very absolute statement and I'm sure that it will fall like a deck of cards if challenged. Should you really be offering your (hardly informed) opinion here? I say that the correct answer is that "in some - though not all - cases it is well worth the effort and in the process you will have learned something useful for future projects."
  • AnT stands with Russia
    AnT stands with Russia about 13 years
    @Olof Forshell: Are you trolling or what? What you posted in your comment is some kind of irrelevant nonsense (talk about shame). All I said that the actual machine code will drastically depend on the context. Like when accessing all array elements sequentially in a cycle, both index access and "sliding-pointer" access will often produce the same code in many modern compilers. When accessing individual array elements outside of any mass-access context, the code might easily turn out to be different for pointer and index access.
  • AnT stands with Russia
    AnT stands with Russia about 13 years
    Where and how did the "bloated garbage source code" came into the picture and how it could be possibly relevant to the question and/or to my answer is totally unclear to me. I suspect that someone's forgetting to drink their glass of juice in the morning might be the root of the problem here.
  • Olof Forshell
    Olof Forshell about 13 years
    A very good answer where different approaches were tested against each other. A- for an ambitious approach but you did not do it for arrays as the question specified. I'm also missing a comparison of relative execution times. I'll give you a 1 though!
  • paxdiablo
    paxdiablo about 13 years
    @Olot, you're right, I normally despise absolutes, so I changed it. However, my opinion is anything but uninformed. Short of a brain-dead compiler, pointers and array indexes will have exactly the same underlying assembly and the benefits of macro optimisations vastly outweigh a couple of clock cycles. That's from long experience.
  • Olof Forshell
    Olof Forshell about 13 years
    Given an investigative approach, it is entirely possible to learn what sort of machine code the compiler will generate for different C constructs, if not the exact sequence (at least I can't). It is therefore also possible to choose the highest-performing construct and be assured that the performance will be there come execution time. A developer who does not investigate these matters will not be competent to say that it is not possible nor will he or she understand that they should refrain from offering advice except perhaps "I do not know. Isn't there a way you can investigate this?"
  • Olof Forshell
    Olof Forshell about 13 years
    You said that "language-level operations have no specific "speed"" and I say yes they do. A poorly constructed piece of C code will cause the compiler to generate poor and slow machine code. Ergo the opposite must be true as well. As to the "same language-level operation can generate totally different CPU-instructions in different contexts" I say that generally speaking they will produce more or less the same code however the context may add instructions not apart from the addressing itself. The optimization level however, can produce very different sequences, but that is not a "context."
  • Olof Forshell
    Olof Forshell about 13 years
    Let me clarify if I was unclear: language-level operations have everything to do with execution speed in that the compiler uses them to produce the executable code. I know of no commercial or free compiler that produces good machine code from poor source code. That does not mean that no such compiler exists, perhaps you've come across them? A speed optimization level will generally produce faster code than that for debug (low) optimization. There are very definitely meaningful answers to the question.
  • Olof Forshell
    Olof Forshell about 13 years
    Oops I meant "the context may add instructions that are not part of the addressing itself"
  • Olof Forshell
    Olof Forshell about 13 years
    "exactly" ... hmm. A not entirely hypothetical situation: assume your function is supposed to return the index of the first structure member where active is FALSE. The structure has one million members and is randomly loaded to 2/3. This is a fairly simple algorithm where a local optimization inside the loop can mean a halved or better average search time. Should this optimization not be taken because macro-optimisations should be performed but there are none because the algorithm is too simple?
  • old_timer
    old_timer about 13 years
    clock cycles is as loaded a question as anything else. the instructions with memory cycles are probably going to cost more, so if one loop has 8 instructions and another 7 but the one with 8 also has an extra memory cycle or two that may really change the results. Cache differences, etc it is a mess. Even within a single processor family, or even the same processor core, the speeds are going to vary widely based on factors outside the chip.
  • old_timer
    old_timer about 13 years
    read Michael Abrash, Zen of assembly language or others. Even if you think the code looks faster you need to time it and test it. A generalized faster or slower question with out a mountain of details as to what it is running on, a generalized answer that this will often cost you more this is often easier on the compiler to implement, that is about the best you can do. And unless you have a detailed reason and usually closed platform, you dont want more than a generalized performance solution.
  • old_timer
    old_timer about 13 years
    if you look at templatetypedef's (excellent) answer, I did respond to that and the original question, arrays vs pointers. *ptr vs ptr[0] is not worth coding we know the answer, ptr[index] vs *(&ptr + index) or ptr[index++] vs *ptr++ are the use cases where a performance question that can be seen in the compiler output is interesting. I didnt cover the *(&ptr +index) vs ptr[index] because like *ptr vs ptr[0] they are the same. so if you have another use case that is not covered by the question or templatetypedef I could talk to that.
  • paxdiablo
    paxdiablo about 13 years
    @Olot: That depends on the optimisation (you haven't stated what it is so it's a little hard to evaluate). If it's something simple like common subexpression elimination or constant folding, that's best left to the compiler (unless, as I stated, you have a brain-dead compiler). If your optimisation is changing the algorithm based on the fact it's 2/3 loaded, or maintaining the first FALSE index separately to the list to be changed only on certain updates (falsifying one before that point or truifying that particular one), that is a macro optimisation.
  • Olof Forshell
    Olof Forshell about 13 years
    Common subexpression elimination is relatively often performed by RISC compilers that have more "free" registers to allocate. For x86-32 (don't know about -64) it is relatively less often performed since optimizations must be juggled with a quarter as many registers as RISC (8 vs 32). To say that the compiler is "brain-dead" seems to me to be somewhat of an over-simplification: it does the best it can with what it has to work with I guess. My answer further down on the page illustrates an optimization using explict common subexpression elimination and how it can be evolved further.
  • Olof Forshell
    Olof Forshell about 13 years
    As I've written in a comment here: in the same way that it is possible to write poor C code and get poorly performing machine code it is also possible to write good C code and get fast and efficient machine code. Many "don't-optimizers" (or whatever they should be called) seem to think that it doesn't matter what source code is written as the compiler will produce the same machine code anyway. This is pure rubbish. As to hand-writing assembly code I think that in general it is best left to the compiler. However, I can still help the compiler to do a better job.
  • Olof Forshell
    Olof Forshell about 13 years
    The x86-32 general purpose registers are eax, ebc, ecx, edx, esi, edi, ebp and esp. My OpenWatcom compiler uses them all except esp. On RISC two or more may have special purposes such as stack pointer, often-used constants etc.
  • paxdiablo
    paxdiablo about 13 years
    Not sure where the register requirement comes into it there. CSE doesn't require registers, it's just the act of memorising a calculation so that you only do it once, not every time in a loop. While register storage of the result would be even more optimal, it's not required. Memory storage of CSE is still a big performance boost.
  • Olof Forshell
    Olof Forshell about 13 years
    It's not the calculation that needs to be memorized, it's the result. To calculate a result you need registers. If it's used often enough the result will be stored in a register on RISC processors. The same applies for x86 if the code isn't too big. Otherwise it will re-calculated. An x86 compiler will not allocate a stack location to store it unless explicitly instructed to do that by the programmer. That's why explicit CSE may be necessary for x86 programs.
  • David Gao
    David Gao about 13 years
    Thank you, I understand a lot now!
  • galois
    galois over 8 years
    "premature optimization is the root of all evil"

Related