What does `rep ret` mean?

assembly x86 micro-optimization branch-prediction

35,113

Solution 1

There's a whole blog named after this instruction. And the first post describes the reason behind it: http://repzret.org/p/repzret/

Basically, there was an issue in the AMD's branch predictor when a single-byte ret immediately followed a conditional jump as in the code you quoted (and a few other situations), and the workaround was to add the rep prefix, which is ignored by CPU but fixes the predictor penalty.

Solution 2

Apparently, some AMD processors' branch predictors behave badly when a branch's target or fallthrough is a ret instruction, and adding the rep prefix avoids this.

As to the meaning of rep ret, there is no mention of this instruction sequence in the Intel Instruction Set Reference, and the documentation of rep is not being very helpful:

The behavior of the REP prefix is undefined when used with non-string instructions.

This means at least that the rep doesn't have to behave in a repeating manner.

Now, from the AMD instruction set reference (1.2.6 Repeat Prefixes):

The prefixes should only be used with such string instructions.

In general, the repeat prefixes should only be used in the string instructions listed in tables 1-6, 1-7, and 1-8 above [which do not contain ret].

So it really seems like undefined behavior but one can assume that, in practice, processors just ignore rep prefixes on ret instructions.

Solution 3

As Trillian's answer points out, AMD K8 and K10 have a problem with branch prediction when ret is a branch target, or follow a conditional branch (as the fall-through target). That's because ret is only 1 byte long.

repz ret: why all the hassle? has some extra details about the specific micro-architectural reasons why that gives K8 and Barcelona a hard time.

Avoiding 1-byte ret as a possible branch target:

AMD's optimization guide for K10 (Barcelona) recommends 3-byte ret 0 in those cases, which pops zero bytes from the stack as well as returning. That version is significantly worse than rep ret on Intel. Ironically, it's also worse than rep ret on later AMD processors (Bulldozer and onwards.) So it's a good thing nobody changed to using ret 0 based on AMD's Family 10 optimization guide update.

The processor manuals warn that future processors could differently interpret a combination of a prefix and an instruction that it doesn't modify. That's true in theory, but nobody's going to make a CPU that can't run a lot of existing binaries.

gcc still uses rep ret by default (without -mtune=intel, or -march=haswell or something). So most Linux binaries have a repz ret in them somewhere.

gcc will probably stop using rep ret in a few years, once K10 is thoroughly obsolete. After another 5 or 10 years, almost all binaries will be built with a gcc newer than that. Another 15 years after that, a CPU manufacturer might think about repurposing the f3 c3 byte sequence as (part of) a different instruction.

There will still be legacy closed-source binaries using rep ret that don't have more recent builds available, and that someone needs to keep running, though. So whatever new feature f3 c3 != rep ret is part of would need to be disable-able (e.g. with a BIOS setting), and have that setting actually change the instruction-decoder behaviour to recognize f3 c3 as rep ret. If that backwards-compatibility for legacy binaries isn't possible (because it can't be done power efficiently in terms of power and transistors), IDK what kind of time-frame you'd be looking at. Much longer than 15 years, unless this was a CPU for only part of the market.

So it's safe to use rep ret, because everyone else is already doing it. Using ret 0 is a bad idea. In new code, it's may still a good idea to use rep ret for another couple years. There probably aren't too many AMD PhenomII CPUs still around, but they're slow enough without extra return-address mispredicts or w/e the problem is.

The cost is pretty small. It doesn't end up taking any extra space in most cases, because it's usually followed by nop padding anyway. However, in the cases where it does result in extra padding, it'll be the worst-case where 15B of padding is needed to reach the next 16B boundary. gcc may only align by 8B in that case. (with .p2align 4,,10; to align to 16B if it will take 10 or fewer nop bytes, then a .p2align 3 to always align to 8B. Use gcc -S -o- to produce asm output to stdout to see when it does this.)

So if we guesstimate that one in 16 rep ret end up creating extra padding where a ret would have just hit the desired alignment, and that the extra padding goes to an 8B boundary, this means each rep has an average cost of 8 * 1/16 = half a byte.

rep ret isn't used often enough to add up to much of anything. For example, firefox with all the libraries it has mapped is only has ~9k instances of rep ret. So that's about 4k bytes, across many files. (And less RAM than that, since many of those functions in dynamic libraries are never called.)

# disassemble every shared object mapped by a process.
ffproc=/proc/$(pgrep firefox)/
objdump -d "$ffproc/exe" $(sudo ls -l "$ffproc"/map_files/ |
       awk  '/\.so/ {print $NF}' | sort -u) |
       grep 'repz ret' -c
objdump: '(deleted)': No such file  # I forgot to restart firefox after the libexpat security update
9649

That counts rep ret in all the functions in all the libraries firefox has mapped, not just the functions it ever calls. This is somewhat relevant, because lower code density across functions means your calls are spread out over more memory pages. ITLB and L2-TLB only have a limited number of entries. Local density matters for L1I$ (and Intel's uop-cache). Anyway, rep ret has a very tiny impact.

It took me a minute to think of a reason that /proc/<pid>/map_files/ isn't accessible to the owner of the process, but /proc/<pid>/maps is. If a UID=root process (e.g. from a suid-root binary) mmap(2)s a 0666 file that's in a 0700 directory, then does setuid(nobody), anyone running that binary could bypass the access restriction imposed by the lack of x for other permission on the directory.

35,113

Author by

Devolus

Assembly, C/C++, SQL and Java developer. Been working in the industry for over 20 years now, and still love programming for a hobby.

Updated on April 29, 2020

Comments

Devolus about 4 years
I was testing some code on Visual Studio 2008 and noticed security_cookie. I can understand the point of it, but I don't understand what the purpose of this instruction is.
```
    rep ret /* REP to avoid AMD branch prediction penalty */
```
Of course I can understand the comment :) but what is this prefix exaclty doing in context with the ret and what happens if ecx is != 0? Apparently the loop count from ecx is ignored when I debug it, which is to be expected.

The code where I found this was here (injected by the compiler for security):
```
void __declspec(naked) __fastcall __security_check_cookie(UINT_PTR cookie)
{
    /* x86 version written in asm to preserve all regs */
    __asm {
        cmp ecx, __security_cookie
        jne failure
        rep ret /* REP to avoid AMD branch prediction penalty */
failure:
        jmp __report_gsfailure
    }
}
```
Devolus over 10 years

Yes, I looked into the Intel manual as well before asking, but I fugred from the comment, that I will not find something usefull there (and indeed I didn't), as the comment already said it was about AMD anyway.
Trillian over 10 years

@Devolus Right, and AMD's documentation says the same thing. I guess that if Microsoft uses this in the CRT, they must have a reason to think that it's a nop and that it's going to stay that way.
Devolus over 10 years

As it is Visual Studio 2008, it may be already changed in a newer version.
Kerrek SB over 10 years

Yeah, it's undefined according to the architecture... And if you like rep ret, you will probably love rep nop :-)
Peter Cordes over 8 years

AFAICT, the issue is present in AMD K8 and K10 (Barcelona) CPUs. It's definitely not present in Bulldozer and later. The last K10 desktop CPUs were Phenom II. gcc will probably stop defaulting to rep ret at some point in the next few years.
Peter Cordes over 8 years

It's not undefined behaviour. IIRC, Intel's manual say prefixes that don't apply to an instruction are ignored. The issue is that it's potentially not future-proof: The prefix byte could get a new meaning for that instruction in a future instruction-set extension, or the whole prefix+opcode sequence could mean something else. This won't happen for rep ret, because gcc uses it by default.
Blindy over 6 years

@PeterCordes, 2018 and it's still there.
Acorn almost 6 years

@Blindy: Starting with gcc 8.1 (released May 2018), by default, it outputs ret.
Acorn almost 6 years

Starting with gcc 8.1 (released May 2018), by default, it outputs ret.
Peter Cordes about 4 years

Correction to my prev comment: Intel's manual actually says that there's no guarantee what will happen with prefixes which don't apply to an instruction. Current CPUs ignore prefixes they don't understand, but they don't document that fact until after they use them as part of a backwards-compatible new instruction like rep nop = pause that older CPUs are guaranteed to execute safely. other examples: Are Intel TSX prefixes executed (safely) on AMD as NOP? and Does x64 support imply BMI1 support?
Peter Cordes about 4 years

So you can't just randomly tack on a rep prefix to make an instruction longer; a future CPU might decode that as a different instruction. Like rep bsr is lzcnt on newer CPUs. Use other prefixes like segment overrides that can apply to make instructions longer: What methods can be used to efficiently extend instruction length on modern x86?