Why is memmove faster than memcpy?
Solution 1
Your memmove
calls are shuffling memory along by 2 to 128 bytes, while your memcpy
source and destination are completely different. Somehow that's accounting for the performance difference: if you copy to the same place, you'll see memcpy
ends up possibly a smidge faster, e.g. on ideone.com:
memmove (002) 0.0610362
memmove (004) 0.0554264
memmove (008) 0.0575859
memmove (016) 0.057326
memmove (032) 0.0583542
memmove (064) 0.0561934
memmove (128) 0.0549391
memcpy 0.0537919
Hardly anything in it though - no evidence that writing back to an already faulted in memory page has much impact, and we're certainly not seeing a halving of time... but it does show that there's nothing wrong making memcpy
unnecessarily slower when compared apples-for-apples.
Solution 2
When you are using memcpy
, the writes need to go into the cache. When you use memmove
where when you are copying a small step forward, the memory you are copying over will already be in the cache (because it was read 2, 4, 16 or 128 bytes "back"). Try doing a memmove
where the destination is several megabytes (> 4 * cache size), and I suspect (but can't be bothered to test) that you'll get similar results.
I guarantee that ALL is about cache maintenance when you do large memory operations.
Solution 3
Historically, memmove and memcpy are the same function. They worked in the same way and had the same implementation. It was then realised that memcpy doesn't need to be (and frequently wasn't) defined to handle overlapping areas in any particular way.
The end result is that memmove was defined to handle overlapping regions in a particular way even if this impacts performance. memcpy is supposed to use the best algorithm available for non-overlapping regions. The implementations are normally almost identical.
The problem you have run into is that there are so many variations of the x86 hardware that it is impossible to tell which method of shifting memory around will be the fastest. And even if you think you have a result in one circumstance something as simple as having a different 'stride' in the memory layout can cause vastly different cache performance.
You can either benchmark what you're actually doing or ignore the problem and rely on the benchmarks done for the C library.
Edit: Oh, and one last thing; shifting lots of memory contents around is VERY slow. I would guess your application would run faster with something like a simple B-Tree implementation to handle your integers. (Oh you are, okay)
Edit2: To summarise my expansion in the comments: The microbenchmark is the issue here, it isn't measuring what you think it is. The tasks given to memcpy and memmove differ significantly from each other. If the task given to memcpy is repeated several times with memmove or memcpy the end results will not depend on which memory shifting function you use UNLESS the regions overlap.
Solution 4
"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions.
In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.
Reference: https://www.youtube.com/watch?v=Yr1YnOVG-4g Dr. Jerry Cain, (Stanford Intro Systems Lecture - 7) Time: 36:00
cruppstahl
I'm the author of upscaledb (https://upscaledb.com), a NoSQL key-value store.
Updated on July 08, 2022Comments
-
cruppstahl almost 2 years
I am investigating performance hotspots in an application which spends 50% of its time in memmove(3). The application inserts millions of 4-byte integers into sorted arrays, and uses memmove to shift the data "to the right" in order to make space for the inserted value.
My expectation was that copying memory is extremely fast, and I was surprised that so much time is spent in memmove. But then I had the idea that memmove is slow because it's moving overlapping regions, which must be implemented in a tight loop, instead of copying large pages of memory. I wrote a small microbenchmark to find out whether there was a performance difference between memcpy and memmove, expecting memcpy to win hands down.
I ran my benchmark on two machines (core i5, core i7) and saw that memmove is actually faster than memcpy, on the older core i7 even nearly twice as fast! Now I am looking for explanations.
Here is my benchmark. It copies 100 mb with memcpy, and then moves about 100 mb with memmove; source and destination are overlapping. Various "distances" for source and destination are tried. Each test is run 10 times, the average time is printed.
https://gist.github.com/cruppstahl/78a57cdf937bca3d062c
Here are the results on the Core i5 (Linux 3.5.0-54-generic #81~precise1-Ubuntu SMP x86_64 GNU/Linux, gcc is 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The number in brackets is the distance (gap size) between source and destination:
memcpy 0.0140074 memmove (002) 0.0106168 memmove (004) 0.01065 memmove (008) 0.0107917 memmove (016) 0.0107319 memmove (032) 0.0106724 memmove (064) 0.0106821 memmove (128) 0.0110633
Memmove is implemented as a SSE optimized assembler code, copying from back to front. It uses hardware prefetch to load the data into the cache, and copies 128 bytes to XMM registers, then stores them at the destination.
(memcpy-ssse3-back.S, lines 1650 ff)
L(gobble_ll_loop): prefetchnta -0x1c0(%rsi) prefetchnta -0x280(%rsi) prefetchnta -0x1c0(%rdi) prefetchnta -0x280(%rdi) sub $0x80, %rdx movdqu -0x10(%rsi), %xmm1 movdqu -0x20(%rsi), %xmm2 movdqu -0x30(%rsi), %xmm3 movdqu -0x40(%rsi), %xmm4 movdqu -0x50(%rsi), %xmm5 movdqu -0x60(%rsi), %xmm6 movdqu -0x70(%rsi), %xmm7 movdqu -0x80(%rsi), %xmm8 movdqa %xmm1, -0x10(%rdi) movdqa %xmm2, -0x20(%rdi) movdqa %xmm3, -0x30(%rdi) movdqa %xmm4, -0x40(%rdi) movdqa %xmm5, -0x50(%rdi) movdqa %xmm6, -0x60(%rdi) movdqa %xmm7, -0x70(%rdi) movdqa %xmm8, -0x80(%rdi) lea -0x80(%rsi), %rsi lea -0x80(%rdi), %rdi jae L(gobble_ll_loop)
Why is memmove faster then memcpy? I would expect memcpy to copy memory pages, which should be much faster than looping. In worst case I would expect memcpy to be as fast as memmove.
PS: I know that I cannot replace memmove with memcpy in my code. I know that the code sample mixes C and C++. This question is really just for academic purposes.
UPDATE 1
I ran some variations of the tests, based on the various answers.
- When running memcpy twice, then the second run is faster than the first one.
- When "touching" the destination buffer of memcpy (
memset(b2, 0, BUFFERSIZE...)
) then the first run of memcpy is also faster. - memcpy is still a little bit slower than memmove.
Here are the results:
memcpy 0.0118526 memcpy 0.0119105 memmove (002) 0.0108151 memmove (004) 0.0107122 memmove (008) 0.0107262 memmove (016) 0.0108555 memmove (032) 0.0107171 memmove (064) 0.0106437 memmove (128) 0.0106648
My conclusion: based on a comment from @Oliver Charlesworth, the operating system has to commit physical memory as soon as the memcpy destination buffer is accessed for the very first time (if someone knows how to "proof" this then please add an answer!). In addition, as @Mats Petersson said, memmove is cache friendlier than memcpy.
Thanks for all the great answers and comments!