Fast ARM NEON memcpy

c assembly arm memcpy neon

18,099

ARM has a great tech note on this.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

Your performance will definitely vary depending on the micro-architecture, ARM's note is on the A8 but I think it will give you a decent idea, and the summary at the bottom is a great discussion of the various pros and cons that go beyond just the regular numbers, such as which methods result in the least amount of register usage, etc.

And yes, as another commenter mentions, pre-fetching is very difficult to get right, and will work differently with different micro-architectures, depending on how big the caches are and how big each line is and a bunch of other details about the cache design. You can end up thrashing lines you need if you aren't careful. I would recommend avoiding it for portable code.

18,099

Author by

robbie_c

Updated on August 11, 2022

Comments

robbie_c over 1 year

I want to copy an image on an ARMv7 core. The naive implementation is to call memcpy per line.

for(i = 0; i < h; i++) {
  memcpy(d, s, w);
  s += sp;
  d += dp;
}

I know that the following

d, dp, s, sp, w

are all 32-byte aligned, so my next (still quite naive) implementation was along the lines of

for (int i = 0; i < h; i++) {
  uint8_t* dst = d;
  const uint8_t* src = s;
  int remaining = w;
  asm volatile (
    "1:                                               \n"
    "subs     %[rem], %[rem], #32                     \n"
    "vld1.u8  {d0, d1, d2, d3}, [%[src],:256]!        \n"
    "vst1.u8  {d0, d1, d2, d3}, [%[dst],:256]!        \n"
    "bgt      1b                                      \n"
    : [dst]"+r"(dst), [src]"+r"(src), [rem]"+r"(remaining)
    :
    : "d0", "d1", "d2", "d3", "cc", "memory"
  );
  d += dp;
  s += sp;
}

Which was ~150% faster than memcpy over a large number of iterations (on different images, so not taking advantage of caching). I feel like this should be nowhere near the optimum because I am yet to use preloading, but when I do I only seem to be able to make performance substantially worse. Does anyone have any insight here?