Atomic operations in ARM

thread-safety kernel arm atomic

34,768

Solution 1

Okay, got the answer from their website.

If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

Solution 2

The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:

do
{
  new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));

can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:

do
{
  old_value = *dest;

  new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);

... Assuming CompareAndStore is something like:

uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
  do
  {
    if (__LDREXW(dest) != old_value) return 1; // Failure
  } while(__STREXW(new_value, dest);
  return 0;
}

This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]

Addendum An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:

#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)

void do_update(void)
{
  // The foo_pool_alloc() method should return a slot number in the lower bits and
  // some sort of counter value in the upper bits so that once some particular
  // uint32_t value is returned, that same value will not be returned again unless
  // there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
  // the possibility that while one task is performing its update, a second task
  // changes the thing to a new one and releases the old one, and a third task gets
  // given the newly-freed item and changes the thing to that, such that from the
  // point of view of the first task, the thing never changed.)

  uint32_t new_thing = foo_pool_alloc();
  uint32_t old_thing;
  do
  {
    // Capture old reference
    old_thing = foo_current_thing;

    // Compute new thing based on old one
    update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
      &foo_pool[old_thing & FOO_POOL_SIZE_MASK);
  } while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
  foo_pool_free(old_thing);
}

If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).

34,768

Author by

sgupta

Updated on July 09, 2022

Comments

sgupta almost 2 years
I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.

Atomic operations.

ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
```
arch/arm/include/asm/atomic.h :
    #define atomic_read(v)  (*(volatile int *)&(v)->counter)
    #define atomic_set(v,i) (((v)->counter) = (i))
```
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).

ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?

I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
sgupta over 11 years

I've been doing exactly the same thing but the part where significant computing is required for the new value, puzzles me still. Using cmxchg loop makes sense because then the exclusive monitor won't be cleared by a context switch but re-doing the significant computing requires a lot of overhead as I've observed street to fail with no apparent reasons (UP with IRQs masked in PSR) as mentioned in your post.
supercat over 11 years

@user1075375: See addendum
ahcox over 9 years

These (__LDREXW & __STREXW) are intrinsics supported in the Keil compilers for Cortex-M series microcontroler-level processors, not generally available for mainstream ARM targets (e.g., AArch64) and compilers (e.g., gcc, llvm) right? infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552‌a/…
supercat over 9 years

@ahcox: ARM recommends that all compiler vendors targeting the Cortex-M3 series support those intrinsics, but can't force them; I would expect that GCC should support them, but I don't really know. As for other processors, I know the Cortex-M0 does not support those operations, but I would expect more advanced processors to do so. Generally I would suggest that one confine use of them to small methods like "atomic increment" and such, which could easily be rewritten if needed to use other approaches (e.g. on the Cortex-M0, they'd be implemented by temporarily disabling interrupts).
ahcox over 9 years

Thanks @supercat. Just now looking at some AArch64 disassembly and see that the GCC builtin __sync_add_and_fetch(address, 1) generates optimal code using these ops. Like this: top: ldaxr w1, [x0] add w1, w1, #0x1 stlxr w2, w1, [x0] cbnz w2, top (sorry for messy formatting).
Peter Cordes over 5 years

This is safe as long as your OS uses clrex or a dummy strex inside the context-switch, or itself uses a LDREX / STREX before returning to user-space. Otherwise you could get a false positive if you switched from one LL/SC retry loop to the middle of another. The CPU could see the LDREX in one loop and the STREX in the other and if no invalidations came in between them then it could succeed on simplistic implementations.