Why does Windows64 use a different calling convention from all other OSes on x86-64?

windows assembly x86-64 calling-convention abi

27,426

Solution 1

Choosing four argument registers on x64 - common to UN*X / Win64

One of the things to keep in mind about x86 is that the register name to "reg number" encoding is not obvious; in terms of instruction encoding (the MOD R/M byte, see http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm), register numbers 0...7 are - in that order - ?AX, ?CX, ?DX, ?BX, ?SP, ?BP, ?SI, ?DI.

Hence choosing A/C/D (regs 0..2) for return value and the first two arguments (which is the "classical" 32bit __fastcall convention) is a logical choice. As far as going to 64bit is concerned, the "higher" regs are ordered, and both Microsoft and UN*X/Linux went for R8 / R9 as the first ones.

Keeping that in mind, Microsoft's choice of RAX (return value) and RCX, RDX, R8, R9 (arg[0..3]) are an understandable selection if you choose four registers for arguments.

I don't know why the AMD64 UN*X ABI chose RDX before RCX.

Choosing six argument registers on x64 - UN*X specific

UN*X, on RISC architectures, has traditionally done argument passing in registers - specifically, for the first six arguments (that's so on PPC, SPARC, MIPS at least). Which might be one of the major reasons why the AMD64 (UN*X) ABI designers chose to use six registers on that architecture as well.

So if you want six registers to pass arguments in, and it's logical to choose RCX, RDX, R8 and R9 for four of them, which other two should you pick ?

The "higher" regs require an additional instruction prefix byte to select them and therefore have a bigger instruction size footprint, so you wouldn't want to choose any of those if you have options. Of the classical registers, due to the implicit meaning of RBP and RSP these aren't available, and RBX traditionally has a special use on UN*X (global offset table) which seemingly the AMD64 ABI designers didn't want to needlessly become incompatible with.
Ergo, the only choice were RSI / RDI.

So if you have to take RSI / RDI as argument registers, which arguments should they be ?

Making them arg[0] and arg[1] has some advantages. See cHao's comment.
?SI and ?DI are string instruction source / destination operands, and as cHao mentioned, their use as argument registers means that with the AMD64 UN*X calling conventions, the simplest possible strcpy() function, for example, only consists of the two CPU instructions repz movsb; ret because the source/target addresses have been put into the correct registers by the caller. There is, particularly in low-level and compiler-generated "glue" code (think, for example, some C++ heap allocators zero-filling objects on construction, or the kernel zero-filling heap pages on sbrk(), or copy-on-write pagefaults) an enormous amount of block copy/fill, hence it'll be useful for code so frequently used to save the two or three CPU instructions that'd otherwise load such source/target address arguments into the "correct" registers.

So in a way, UN*X and Win64 are only different in that UN*X "prepends" two additional arguments, in purposefully chosen RSI/RDI registers, to the natural choice of four arguments in RCX, RDX, R8 and R9.

Beyond that ...

There are more differences between the UN*X and Windows x64 ABIs than just the mapping of arguments to specific registers. For the overview on Win64, check:

http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx

Win64 and AMD64 UN*X also strikingly differ in the way stackspace is used; on Win64, for example, the caller must allocate stackspace for function arguments even though args 0...3 are passed in registers. On UN*X on the other hand, a leaf function (i.e. one that doesn't call other functions) is not even required to allocate stackspace at all if it needs no more than 128 Bytes of it (yes, you own and can use a certain amount of stack without allocating it ... well, unless you're kernel code, a source of nifty bugs). All these are particular optimization choices, most of the rationale for those is explained in the full ABI references that the original poster's wikipedia reference points to.

Solution 2

IDK why Windows did what they did. See the end of this answer for a guess. I was curious about how the SysV calling convention was decided on, so I dug into the mailing list archive and found some neat stuff.

It's interesting reading some of those old threads on the AMD64 mailing list, since AMD architects were active on it. e.g. Choosing register names was one of the hard parts: AMD considered renaming the original 8 registers r0-r7, or calling the new registers UAX etc.

Also, feedback from kernel devs identified things that made the original design of syscall and swapgs unusable. That's how AMD updated the instruction to get this sorted out before releasing any actual chips. It's also interesting that in late 2000, the assumption was that Intel probably wouldn't adopt AMD64.

The SysV (Linux) calling convention, and the decision on how many registers should be callee-preserved vs. caller-save, was made initially in Nov 2000, by Jan Hubicka (a gcc developer). He compiled SPEC2000 and looked at code size and number of instructions. That discussion thread bounces around some of the same ideas as answers and comments on this SO question. In a 2nd thread, he proposed the current sequence as optimal and hopefully final, generating smaller code than some alternatives.

He's using the term "global" to mean call-preserved registers, that have to be push/popped if used.

The choice of rdi, rsi, rdx as the first three args was motivated by:

minor code-size saving in functions that call memset or other C string function on their args (where gcc inlines a rep string operation?)
rbx is call-preserved because having two call-preserved regs accessible without REX prefixes (rbx and rbp) is a win. Presumably chosen because they're the only "legacy" registers that aren't implicitly used by any common instruction. (rep string, shift count, and mul/div outputs/inputs touch everything else).
None of the registers that common instructions force you to use are call-preserved (see prev point), so a function that wants to use a variable-count shift or division might have to move function args somewhere else, but doesn't have to save/restore the caller's value. cmpxchg16b and cpuid need RBX, but are rarely used so not a big factor. (cmpxchg16b wasn't part of original AMD64, but RBX would still have been the obvious choice. cmpxchg8b exists but was obsoleted by qword cmpxchg)
We are trying to avoid RCX early in the sequence, since it is register used commonly for special purposes, like EAX, so it has same purpose to be missing in the sequence. Also it can't be used for syscalls and we would like to make syscall sequence to match function call sequence as much as possible.

(background: syscall / sysret unavoidably destroy rcx(with rip) and r11(with RFLAGS), so the kernel can't see what was originally in rcx when syscall ran.)

The kernel system-call ABI was chosen to match the function call ABI, except for r10 instead of rcx, so a libc wrapper functions like mmap(2) can just mov %rcx, %r10 / mov $0x9, %eax / syscall.

Note that the SysV calling convention used by i386 Linux sucks compared to Window's 32bit __vectorcall. It passes everything on the stack, and only returns in edx:eax for int64, not for small structs. It's no surprise little effort was made to maintain compatibility with it. When there's no reason not to, they did things like keeping rbx call-preserved, since they decided that having another in the original 8 (that don't need a REX prefix) was good.

Making the ABI optimal is much more important long-term than any other consideration. I think they did a pretty good job. I'm not totally sure about returning structs packed into registers, instead of different fields in different regs. I guess code that passes them around by value without actually operating on the fields wins this way, but the extra work of unpacking seems silly. They could have had more integer return registers, more than just rdx:rax, so returning a struct with 4 members could return them in rdi, rsi, rdx, rax or something.

They considered passing integers in vector regs, because SSE2 can operate on integers. Fortunately they didn't do that. Integers are used as pointer offsets very often, and a round-trip to stack memory is pretty cheap. Also SSE2 instructions take more code bytes than integer instructions.

I suspect Windows ABI designers might have been aiming to minimize differences between 32 and 64bit for the benefit of people that have to port asm from one to the other, or that can use a couple #ifdefs in some ASM so the same source can more easily build a 32 or 64bit version of a function.

Minimizing changes in the toolchain seems unlikely. An x86-64 compiler needs a separate table of which register is used for what, and what the calling convention is. Having a small overlap with 32bit is unlikely to produce significant savings in toolchain code size / complexity.

Solution 3

Remember that Microsoft was initially "officially noncommittal toward the early AMD64 effort" (from "A History of Modern 64-bit Computing" by Matthew Kerner and Neil Padgett) because they were strong partners with Intel on the IA64 architecture. I think that this meant that even if they would have otherwise been open to working with GCC engineers on a ABI to use both on Unix and Windows, they wouldn't have done so as it would mean publicly supporting the AMD64 effort when they hadn't yet officially done so (and would have probably upset Intel).

On top of that, back in those days Microsoft had absolutely no leanings toward being friendly with open source projects. Certainly not Linux or GCC.

So why would they have cooperated on an ABI? I'd guess that the ABIs are different simply because they were designed at more or less the same time and in isolation.

Another quote from "A History of Modern 64-bit Computing":

In parallel with the Microsoft collaboration, AMD also engaged the open source community to prepare for the chip. AMD contracted with both Code Sorcery and SuSE for tool chain work (Red Hat was already engaged by Intel on the IA64 tool chain port). Russell explained that SuSE produced C and FORTRAN compilers, and Code Sorcery produced a Pascal compiler. Weber explained that the company also engaged with the Linux community to prepare a Linux port. This effort was very important: it acted as an incentive for Microsoft to continue to invest in the AMD64 Windows effort, and also ensured that Linux, which was becoming an important OS at the time, would be available once the chips were released.

Weber goes so far as to say that the Linux work was absolutely crucial to AMD64’s success, because it enabled AMD to produce an end-to-end system without the help of any other companies if necessary. This possibility ensured that AMD had a worst-case survival strategy even if other partners backed out, which in turn kept the other partners engaged for fear of being left behind themselves.

This indicates that even AMD didn't feel that cooperation was necessarily the most important thing between MS and Unix, but that having Unix/Linux support was very important. Maybe even trying to convince one or both sides to compromise or cooperate wasn't worth the effort or risk(?) of irritating either of them? Perhaps AMD thought that even suggesting a common ABI might delay or derail the more important objective of simply having software support ready when the chip was ready.

Speculation on my part, but I think the major reason the ABIs are different was the political reason that MS and the Unix/Linux sides just didn't work together on it, and AMD didn't see that as a problem.

Solution 4

Win32 has its own uses for ESI and EDI, and requires that they not be modified (or at least that they be restored before calling into the API). I'd imagine 64-bit code does the same with RSI and RDI, which would explain why they're not used to pass function arguments around.

I couldn't tell you why RCX and RDX are switched, though.

View more solutions

27,426

Author by

JanKanis

Updated on July 10, 2021

Comments

JanKanis almost 3 years

AMD has an ABI specification that describes the calling convention to use on x86-64. All OSes follow it, except for Windows which has it's own x86-64 calling convention. Why?

Does anyone know the technical, historical, or political reasons for this difference, or is it purely a matter of NIHsyndrome?

I understand that different OSes may have different needs for higher level things, but that doesn't explain why for example the register parameter passing order on Windows is rcx - rdx - r8 - r9 - rest on stack while everyone else uses rdi - rsi - rdx - rcx - r8 - r9 - rest on stack.

P.S. I am aware of how these calling conventions differ generally and I know where to find details if I need to. What I want to know is why.

Edit: for the how, see e.g. the wikipedia entry and links from there.
JanKanis over 13 years

All calling conventions have some registers designated as scratch and some as preserved like ESI/EDI and RSI/RDI on Win64. But those are general purpose registers, Microsoft could have chosen without a problem to use them differently.
cHao over 13 years

@Somejan: Sure, if they wanted to rewrite the whole API and have two different OSes. I wouldn't call that "without a problem", though. For dozens of years now, MS has made certain promises about what it will and won't do with x86 registers, and they've been more or less consistent and compatible all that time. They're not gonna toss all that out the window just because of some edict from AMD, especially one so arbitrary and outside the realm of "building a processor".
cHao over 13 years

BTW, ?SI and ?DI are semi general-purpose registers. Just like most of the registers that have been around the whole time, they have their built-in register-specific uses; some instructions (the string instructions: MOVS?, INS?, OUTS?, etc) are hard-coded to use those registers for moving data around. Unless x86-64 provides some other way of doing so, it'd take considerably more than just using different registers.
JanKanis over 13 years

@cHao: You're talking about x86 registers. To run code in 64bits mode, you need to recompile anyway, 32bits mode code doesn't run in 64 bits (long mode). The only usefull ways of compatibility between 64 and 32 bits are source-level compatibility and the possibility of running 32bits user programs on a 64bits kernel. Linux et al also have specific uses for some registers in 32 bit mode, but for them the AMD ABI apparently isn't a problem.
JanKanis over 13 years

@cHao: ?SI and ?DI indeed also have special purposes (though I don't know exactly if that works the same in 64 bits), but apparently that is not a problem for all other OSes other than Windows. These special purposes are also not related to function calls.
JanKanis over 13 years

About register names: That prefix byte may be a factor. But then it would be more logical for MS to choose rcx - rdx - rdi - rsi as argument registers. But the numerical value of the first eight could guide you if you're designing an ABI from scratch, but there's no reason to change them if a perfectly fine ABI already exists, that only leads to more confusion.
JanKanis over 13 years

On RSI/RDI: These instructions will usually be inlined, in which case calling convention doesn't matter. Otherwise, there's only one copy (or maybe a few) of that function systemwide, so it only saves a handfull of bytes in total. Not worth it. On other differences / call stack: The usefullness of specific choices is explained in the ABI references, but they don't make a comparison. They don't tell why other optimizations were not chosen - e.g. why doesn't Windows have the 128 byte red zone, and why doesn't the AMD ABI have the extra stack slots for arguments?
cHao over 13 years

@Somejan: In case it wasn't obvious from the name of "x86-64", it is x86, extended to 64 bits. Lots of code that doesn't work directly with 32-bit values (ie: pushing and popping registers rather than loading values) will work in both modes. Unless, of course, you go and redefine what all those registers are for and say "you're not allowed to use these anymore". Which, sure, MS could do, if they want to sacrifice compatibility and consistency. Which they don't. Which is not at all unusual, considering Win32 has the same rules as Win16. To them, consistency beats kowtowing to AMD.
cHao over 13 years

@Somejan: no, they're not related to function calls. In fact, those special purposes may be part of what got them a spot in AMD's calling convention -- passing a string around in RSI could be really useful. However, Windows has its own uses for those registers, which may or may not be related to their built-in functionality. Why is frankly irrelevant -- MS has said registers are used a certain way, which is its prerogative as the designer of the OS. AMD has no standing to dictate how registers will be used; that's the OS maker's domain.
cHao over 13 years

@Somejan: A perfectly fine ABI existed well before x86-64. Should it be changed now at AMD's whim? I think not.
JanKanis over 13 years

@cHao: x86-64 is a straightforward extension of x86, but it is still a different architecture. Binary code compiled for 32bits mode does not run unmodified in 64bits mode. It can run in compatibility mode, but the 64bits ABI has nothing to do with that. And anyway the Win64 calling convention is different from the Win32 one. Sure MS can define their own ABI and they have done so, but that brings me back to my original question: is there any reason other than "we don't care about what others do" to not follow the ABI that everyone else uses. (compatibility with 32bits mode isn't one)
JanKanis over 13 years

@cHao: no. But they changed it anyway. The Win64 ABI is different from the Win32 one (and not compatible), and also different from AMDs ABI.
cHao over 13 years

@Somejan: Not as different as you'd think. Mostly same operations, same registers (plus additional ones), and binary compatibility to the point where the same bytes can represent equivalent and valid code in 32- or 64-bit modes (with some caveats). Even if you ignore most of that, the fact still remains: the language, OS, and/or compiler define calling conventions. AMD is overstepping if it defines anything as the "one true way" (cause there isn't one). MS has a way that's worked well for them for 20+ years. To declare it invalid cause it's not blessed by AMD is the height of ignorance.
cHao over 13 years

@Somejan: Its major difference is that it uses additional registers that, til now, have not even existed. The registers that have belonged to Windows all along (?BX, ?SI, ?DI) still do.
FrankH. over 13 years

@Somejan: The AMD64 UN*X ABI was always exactly that - a UNIX-specific piece. The document, x86-64.org/documentation/abi.pdf, is titled System V Application Binary Interface, AMD64 Architecture Processor Supplement for a reason. The (common) UNIX ABIs (a multi-volume collection, sco.com/developers/devspecs) leave a section for processor-specific chapter 3 - the Supplement - which are the function calling conventions and data layout rules for a specific processor.
FrankH. over 13 years

@Somejan: Microsoft Windows has never attempted to be particularly close to UN*X, and when it came to porting Windows to x64/AMD64 they simply chose to extend their own __fastcall calling convention. You claim Win32/Win64 aren't compatible, but then, look closely: For a function that takes two 32bit args and returns 32bit, Win64 and Win32 __fastcall actually are 100% compatible (same regs for passing two 32bit args, same return value). Even some binary(!) code may work in both operating modes. The UNIX side completely broke with "old ways". For good reasons, but a break is a break.
FrankH. over 13 years

@Somejan: Win64 and Win32 __fastcall are 100% identical for the case of having no more than two arguments no larger than 32bit and returning a value no larger than 32bit. That's not a small class of functions. No such backward compatibility at all is possible between the UN*X ABIs for i386 / amd64.
JanKanis over 13 years

@FrankH: You are making a good point, please consider expanding your comment into a full answer. I wasn't aware the AMD ABI document was meant to 'plug in to' this common unix ABI that apparently tries to define the ABI for all unix supported architectures. I thought it was more kind of meant to provide a common standard in order to avoid the x86-32 mess. I'm still wondering about the technical reasons for choosing one over the other, but your comment does shed light on the nontechnical background.
Olof Forshell over 13 years

I programmed for Win 3.1 using MSC 6.0 (I think, or was it 7.0?). Registers si and di (16-bit era, not esi/edi) were preserved in function calls but you could use them if you saved and restored them. My understanding is that this has nothing to do with what Windows requires but had to do with what registers were reserved in the grand scheme of the compiler. I do think that the purpose may have been to distinguish them from intel compilers where si and di were not reserved.
cHao over 13 years

@Olof: It's more than just a compiler thing. I had issues with ESI and EDI when i did standalone stuff in NASM. Windows definitely cares about those registers. But yes, you can use them if you save them before you do and restore them before Windows needs them.
Olof Forshell over 13 years

@cHao: I had a large application written in intel PL/M-86 and ASM-86 with parts in MS MASM which ran fine under DOS. I ported it to Win16 by eliminating MASM and using MSC 6/7 for accessing the user interface routines. Win16 had no problem whatsoever with this supposed mis-match of si&di strategies.
szx over 10 years

Why is RDX passed before RCX in the System V ABI? strcpy is not 2 instructions then but 3 (plus a mov rcx, rdx)?
FrankH. over 10 years

@szx: I said above, wrt. to the ordering rdx before rcx, "I don't know why". I can only speculate that the idea was to allow a three-argument function use rcx/ecx as loop counter without destroying its 3rd argument. Agreed, the consequence is funcs which want the 3rd arg being the loop counter need ... an extra instruction.
Peter Cordes about 8 years

strcpy doesn't actually work as an example. repz doesn't work with movsb. There is no rep-string instruction that can copy implicit-length (null-terminated) strings. I always remember which r?i register comes first in the calling convention by remembering that it lines up with memcpy. The rdx instead of rcx choice has always puzzled me, too. Writing tiny functions to look at asm output, I often see an extra mov to get a shift count into ecx. I like the theory about not forcing functions to destroy the arg in rcx, though.
Peter Cordes about 8 years

@szx: I just found the relevant mailing list thread from Nov 2000, and posted an answer summarizing the reasoning. Note that it's memcpy that could be implemented that way, not strcpy.
Peter Cordes about 8 years

Nice perspective on the politics. I agree that it's not AMD's fault or responsibility. I blame Microsoft for choosing a worse calling convention. If their calling convention had turned out to be better, I'd have some sympathy, but they had to change from their initial ABI to __vectorcall because passing __m128 on the stack sucked. Having call-preserved semantics for the low 128b of some of the vector regs is also weird (partly Intel's fault for not designing an extensible save/restore mechanism with SSE originally, and still not with AVX.)
Peter Cordes about 8 years

Also, maybe 5 or 6 call-preserved xmm registers would probably be more sensible, rather than 10, but I don't have any data. Just a guess based on not much. Functions that use full vectors don't tend to make calls in the middle of hot loops, but functions with some scalar doubles certainly do. Anyway, Microsoft's ABI is clearly not optimal for many cases.
Michael Burr about 8 years

I don't really have any expertise or knowledge of how good the ABIs are. I just occasionally need to know what they are so I can understand/debug at the assembly level.
Peter Cordes about 8 years

A good ABI minimizes code size and number of instructions, and keeps dependency chains low-latency by avoiding extra round-trips through memory. (for args, or for locals that need to be spilled/reloaded). There are tradeoffs. SysV's red-zone takes a couple extra instructions in one place (the kernel's signal-handler dispatcher), for a relatively large benefit for leaf functions of not having to adjust the stack pointer to get some scratch space. So that's a clear win with near-zero downside. It was adopted with pretty much no discussion after it was proposed for SysV.
phuclv about 6 years

I think I have read somewhere on Raymond Chen's blog about the rationale for choosing those registers after benchmarking from MS side but I can't find it anymore. However some reasons regarding the homezone was explained here blogs.msdn.microsoft.com/oldnewthing/20160623-00/?p=93735 blogs.msdn.microsoft.com/freik/2006/03/06/…
dgnuff over 5 years

@PeterCordes "... not having to adjust the stack pointer to get some scratch space. ..." suggests to me that after the call, space below RSP can be used safely. I feel like I'm missing something here, because won't the first interrupt that comes along damage the bytes at and just below RSP?
Peter Cordes over 5 years

@dgnuff: Right, that's the answer to Why can't kernel code use a Red Zone. Interrupts use the kernel stack, not the user-space stack, even if they arrive when the CPU is running user-space code. The kernel doesn't trust user-space stacks because another thread in the same user-space process could modify it, thus taking over control of the kernel!
phuclv about 5 years

another blog post from Raymond Chen: Why do we even need to define a red zone? Can’t I just use my stack for anything?
Peter Cordes about 5 years

@phuclv: See also Is it valid to write below ESP?. Raymond's comments on my answer there pointed out some SEH details I didn't know which explain why x86 32/64 Windows doesn't currently have a de-facto red zone. His blog post has some plausible cases for the same code page-in handler possibility I mentioned in that answer :) So yeah, Raymond did a better job of explaining it than I did (unsurprisingly because I started from knowing very little about Windows), and the table of red-zone sizes for non-x86 is really neat.
David A. Gray over 4 years

I came here after a day spent spelunking the 64-bit build of a time display library that I created about 6 months ago for another project. Among the things I discovered that aren't mentioned in the ABI, but observed by me as I stepped through the code in a Disassembly view are that the base pointer (RBP) is neither set, nor used; all offsets are relative to RSP, and that a call of 4 or fewer arguments is not accompained by any stack cleanup.
Peter Cordes over 4 years

@DavidA.Gray: yeah, the ABI doesn't say you have to use RBP as a frame pointer so optimized code usually doesn't (except in functions that use alloca or a few other cases). This is normal if you're used to gcc -fomit-frame-pointer being the default on Linux. The ABI defines stack-unwind metadata that allows exception handling to still work. (I assume it works something like GNU/Linux x86-64 System V's CFI stuff in .eh_frame). gcc -fomit-frame-pointer has been the default (with optimization enabled) since forever on x86-64, and other compilers (like MSVC) do the same thing.
René Nyffenegger about 3 years

on Win64, the caller must allocate stackspace for function arguments even though args 0...3 are passed in registers. What is reason for this? This seems like unnecessary overhead to me (but I am not a compiler developer, so I possibly just don't see the obvious).
FrankH. about 3 years

@RenéNyffenegger I'm not sure about the "reason" other than convenience (for the compiler / code generators, in a way). That this is done is mentioned in docs.microsoft.com/en-us/cpp/build/stack-usage, as: Note that space is always allocated for the register parameters, even if the parameters themselves are never homed to the stack - I take it as "decree by elder beings".
Dan Lenski about 3 years

An excellent and thoughtful answer! Regarding similarity of the Win64 convention to the Win32 __fastcall mentioned by @FrankH. … I don't really see the value in it, beyond simple familiarity up the designers. A partially, or even mostly, compatible ABI doesn't go much good; in order to be able to link or run unmodified binaries, you need 100% compatibility for at least some large identifiable subset of code.
Sourav Kannantha B almost 3 years

@PeterCordes 'Presumably chosen because it's the only other reg that isn't implicitly used by any instruction' Which are the registers that are not implicitly used by any instructions in r0-r7? I thought none, that's why they have special names like rax, rcx etc.
Peter Cordes almost 3 years

@SouravKannanthaB: yes, all the legacy registers have some implicit uses. (Why are rbp and rsp called general purpose registers?) What I really meant to say is that there's are no common instructions you'd want to use for other reasons (like shl rax, cl, mul) that requires you to use RBX or RBP. Only cmpxchg16b and cpuid need RBX, and RBP is only used implicitly by leave (and the unusably-slow enter instruction). So for RBP, the only implicit uses are just manipulating RBP, and not something you'd want if not using it as a frame pointer
Mikko Rantalainen over 2 years

This answer has a great explanation why the calling convention that everybody but Microsoft uses was selected. However, nobody seems to really know why Microsoft decided to use different calling convention where all evidence points to MS variant having worse performance and being invented later.
phuclv about 2 years

Choosing six argument registers on x64 - UN*X specific this is wrong, Windows calling conventions on RISC architectures are usually the same as *nix and neither pass 6 arguments in registers. Windows on Alpha AXP uses 6 and on PowerPC uses 8. *nix also uses 8 registers for PPC. MIPS has O64 and N64 calling conventions which pass parameters in 4 and 8 registers respectively. Sparc has 8 in and 8 out registers so obviously all platforms pass 8 arguments in registers
FrankH. about 2 years

@phuclv "wrong" is a strong word here. I give you an example where your comment is wrong. Refer to the SPARC Platform UNIX ABI supplement document, gaisler.com/doc/sparc-abi.pdf section 3.8, where it clearly states that, while the CPU has eight input/output registers, only six of them are used for argument passing. I agree with your general statement and sentiment that "intricate details apply". I would not blanket use the word "right" or "wrong" in this context. My answer here is only an attempt to explain, with a bit history (and baggage), why things are as-is.