What does `rep ret` mean?
Solution 1
There's a whole blog named after this instruction. And the first post describes the reason behind it: http://repzret.org/p/repzret/
Basically, there was an issue in the AMD's branch predictor when a single-byte ret
immediately followed a conditional jump as in the code you quoted (and a few other situations), and the workaround was to add the rep
prefix, which is ignored by CPU but fixes the predictor penalty.
Solution 2
Apparently, some AMD processors' branch predictors behave badly when a branch's target or fallthrough is a ret
instruction, and adding the rep
prefix avoids this.
As to the meaning of rep ret
, there is no mention of this instruction sequence in the Intel Instruction Set Reference, and the documentation of rep
is not being very helpful:
The behavior of the REP prefix is undefined when used with non-string instructions.
This means at least that the rep
doesn't have to behave in a repeating manner.
Now, from the AMD instruction set reference (1.2.6 Repeat Prefixes):
The prefixes should only be used with such string instructions.
In general, the repeat prefixes should only be used in the string instructions listed in tables 1-6, 1-7, and 1-8 above [which do not contain ret].
So it really seems like undefined behavior but one can assume that, in practice, processors just ignore rep
prefixes on ret
instructions.
Solution 3
As Trillian's answer points out, AMD K8 and K10 have a problem with branch prediction when ret
is a branch target, or follow a conditional branch (as the fall-through target). That's because ret
is only 1 byte long.
repz ret: why all the hassle? has some extra details about the specific micro-architectural reasons why that gives K8 and Barcelona a hard time.
Avoiding 1-byte ret
as a possible branch target:
AMD's optimization guide for K10 (Barcelona) recommends 3-byte ret 0
in those cases, which pops zero bytes from the stack as well as returning. That version is significantly worse than rep ret
on Intel. Ironically, it's also worse than rep ret
on later AMD processors (Bulldozer and onwards.) So it's a good thing nobody changed to using ret 0
based on AMD's Family 10 optimization guide update.
The processor manuals warn that future processors could differently interpret a combination of a prefix and an instruction that it doesn't modify. That's true in theory, but nobody's going to make a CPU that can't run a lot of existing binaries.
gcc still uses rep ret
by default (without -mtune=intel
, or -march=haswell
or something). So most Linux binaries have a repz ret
in them somewhere.
gcc will probably stop using rep ret
in a few years, once K10 is thoroughly obsolete. After another 5 or 10 years, almost all binaries will be built with a gcc newer than that. Another 15 years after that, a CPU manufacturer might think about repurposing the f3 c3
byte sequence as (part of) a different instruction.
There will still be legacy closed-source binaries using rep ret
that don't have more recent builds available, and that someone needs to keep running, though. So whatever new feature f3 c3 != rep ret
is part of would need to be disable-able (e.g. with a BIOS setting), and have that setting actually change the instruction-decoder behaviour to recognize f3 c3
as rep ret
. If that backwards-compatibility for legacy binaries isn't possible (because it can't be done power efficiently in terms of power and transistors), IDK what kind of time-frame you'd be looking at. Much longer than 15 years, unless this was a CPU for only part of the market.
So it's safe to use rep ret
, because everyone else is already doing it. Using ret 0
is a bad idea. In new code, it's may still a good idea to use rep ret
for another couple years. There probably aren't too many AMD PhenomII CPUs still around, but they're slow enough without extra return-address mispredicts or w/e the problem is.
The cost is pretty small. It doesn't end up taking any extra space in most cases, because it's usually followed by nop
padding anyway. However, in the cases where it does result in extra padding, it'll be the worst-case where 15B of padding is needed to reach the next 16B boundary. gcc may only align by 8B in that case. (with .p2align 4,,10;
to align to 16B if it will take 10 or fewer nop bytes, then a .p2align 3
to always align to 8B. Use gcc -S -o-
to produce asm output to stdout to see when it does this.)
So if we guesstimate that one in 16 rep ret
end up creating extra padding where a ret
would have just hit the desired alignment, and that the extra padding goes to an 8B boundary, this means each rep
has an average cost of 8 * 1/16 = half a byte.
rep ret
isn't used often enough to add up to much of anything. For example, firefox with all the libraries it has mapped is only has ~9k instances of rep ret
. So that's about 4k bytes, across many files. (And less RAM than that, since many of those functions in dynamic libraries are never called.)
# disassemble every shared object mapped by a process.
ffproc=/proc/$(pgrep firefox)/
objdump -d "$ffproc/exe" $(sudo ls -l "$ffproc"/map_files/ |
awk '/\.so/ {print $NF}' | sort -u) |
grep 'repz ret' -c
objdump: '(deleted)': No such file # I forgot to restart firefox after the libexpat security update
9649
That counts rep ret
in all the functions in all the libraries firefox has mapped, not just the functions it ever calls. This is somewhat relevant, because lower code density across functions means your calls are spread out over more memory pages. ITLB and L2-TLB only have a limited number of entries. Local density matters for L1I$ (and Intel's uop-cache). Anyway, rep ret
has a very tiny impact.
It took me a minute to think of a reason that /proc/<pid>/map_files/
isn't accessible to the owner of the process, but /proc/<pid>/maps
is. If a UID=root process (e.g. from a suid-root binary) mmap(2)
s a 0666 file that's in a 0700 directory, then does setuid(nobody)
, anyone running that binary could bypass the access restriction imposed by the lack of x for other
permission on the directory.
Devolus
Assembly, C/C++, SQL and Java developer. Been working in the industry for over 20 years now, and still love programming for a hobby.
Updated on April 29, 2020Comments
-
Devolus about 4 years
I was testing some code on Visual Studio 2008 and noticed
security_cookie
. I can understand the point of it, but I don't understand what the purpose of this instruction is.rep ret /* REP to avoid AMD branch prediction penalty */
Of course I can understand the comment :) but what is this prefix exaclty doing in context with the
ret
and what happens ifecx
is != 0? Apparently the loop count fromecx
is ignored when I debug it, which is to be expected.The code where I found this was here (injected by the compiler for security):
void __declspec(naked) __fastcall __security_check_cookie(UINT_PTR cookie) { /* x86 version written in asm to preserve all regs */ __asm { cmp ecx, __security_cookie jne failure rep ret /* REP to avoid AMD branch prediction penalty */ failure: jmp __report_gsfailure } }
-
Devolus over 10 yearsYes, I looked into the Intel manual as well before asking, but I fugred from the comment, that I will not find something usefull there (and indeed I didn't), as the comment already said it was about AMD anyway.
-
Trillian over 10 years@Devolus Right, and AMD's documentation says the same thing. I guess that if Microsoft uses this in the CRT, they must have a reason to think that it's a
nop
and that it's going to stay that way. -
Devolus over 10 yearsAs it is Visual Studio 2008, it may be already changed in a newer version.
-
Kerrek SB over 10 yearsYeah, it's undefined according to the architecture... And if you like
rep ret
, you will probably loverep nop
:-) -
Peter Cordes over 8 yearsAFAICT, the issue is present in AMD K8 and K10 (Barcelona) CPUs. It's definitely not present in Bulldozer and later. The last K10 desktop CPUs were Phenom II. gcc will probably stop defaulting to
rep ret
at some point in the next few years. -
Peter Cordes over 8 yearsIt's not undefined behaviour. IIRC, Intel's manual say prefixes that don't apply to an instruction are ignored. The issue is that it's potentially not future-proof: The prefix byte could get a new meaning for that instruction in a future instruction-set extension, or the whole prefix+opcode sequence could mean something else. This won't happen for
rep ret
, because gcc uses it by default. -
Blindy over 6 years@PeterCordes, 2018 and it's still there.
-
Acorn almost 6 years@Blindy: Starting with gcc 8.1 (released May 2018), by default, it outputs
ret
. -
Acorn almost 6 yearsStarting with gcc 8.1 (released May 2018), by default, it outputs
ret
. -
Peter Cordes about 4 yearsCorrection to my prev comment: Intel's manual actually says that there's no guarantee what will happen with prefixes which don't apply to an instruction. Current CPUs ignore prefixes they don't understand, but they don't document that fact until after they use them as part of a backwards-compatible new instruction like
rep nop
=pause
that older CPUs are guaranteed to execute safely. other examples: Are Intel TSX prefixes executed (safely) on AMD as NOP? and Does x64 support imply BMI1 support? -
Peter Cordes about 4 yearsSo you can't just randomly tack on a
rep
prefix to make an instruction longer; a future CPU might decode that as a different instruction. Likerep bsr
islzcnt
on newer CPUs. Use other prefixes like segment overrides that can apply to make instructions longer: What methods can be used to efficiently extend instruction length on modern x86?