8086- why can't we move an immediate data into segment register?

assembly x86 x86-16 cpu-registers instruction-set

27,220

Solution 1

Remember that the syntax of assembly language (any assembly) is just a human-readable way to write machine code. The rules of what you can do in machine code depend on how the processor's electronics were designed, not on what the assembler syntax could easily support.

So, just because it looks like you could write mov DS, 5000h and that conceptually it doesn't seem like there is a reason why you shouldn't be able to do it, it's really about "is there a mechanism by which the processor can load a segment register directly from an immediate value?"

In the case of 8086 assembly, I figure that the reason is simply that the engineers just didn't create an electric path that could feed a signal from the memory I/O data lines to the lines that write to the segment registers.

Why? I have several theories, but no authoritative knowledge.

The most likely reason is simply one of simplifying the design: it takes extra wiring and gates to do that, and it's an uncommon enough operation (this is the 70's) that it's not worth the real estate in the chip. This is not surprising; the 8086 already went overboard allowing any of the normal registers to be connected to the ALU (arithmetic logic unit) which allows any register to be used as an accumulator. I'm sure that wasn't cheap to do. Most processors at the time only allowed one register (the accumulator) to be used for that purpose.

As far as the brackets, you are correct. Let's say memory position 5000h contains the number 4321h. mov ax, 5000h puts the value 5000h into ax, while mov ax, [5000h] loads 4321h from memory into ax. Essentially, the brackets act like the * pointer dereference operator in C.

Just to highlight the fact that assembly is an idealized abstraction of what machine code can do, you should note that the two variations are not the same instruction with different parameters, but completely different opcodes. They could have used – say – MOV for the first and MVD (MoVe Direct addressed memory) for the second opcode, but they must have decided that the bracket syntax was easier for programmers to remember.

Solution 2

x86 machine code only has one opcode for move-to-Sreg. That opcode is
8E /r mov Sreg, r/m16, and allows a register or memory source (but not immediate).

Contrary to some claims in other answers, mov ds, [5000h] runs just fine, assuming the 2 bytes at address 5000h hold a useful segment value for the mode you're in. (Real mode where they're used directly as numbers vs. protected where Sreg values are selectors that index the LDT / GDT).

x86 always uses a different opcode for the immediate form of an instruction (with a constant encoded as part of the machine code) vs. the register/memory source version. e.g. add eax, 123 assembles to a different opcode from add eax, ecx. But add eax, [esi] is the same add r, r/m32 opcode as add eax, ecx, just a different ModR/M byte.

NASM listing, from nasm sreg.asm -l/dev/stdout, assembling a flat binary in 16-bit mode and producing a listing.

I edited by hand to separate the bytes into opcode modrm extra. These are all one-byte opcodes (with no extra opcode bits borrowing space in the /r field of the ModRM byte), so just look at the first byte to see what opcode it is, and notice when two instructions share the same opcode.

   address    machine code         source           ;  comments
 1 00000000 BE 0050           mov si, 5000h     ; mov si, imm16
 2 00000003 A1 0050           mov ax, [5000h]   ; special encoding for AX, no modrm
 3 00000006 8B 36 0050        mov si, [5000h]   ; mov r16, r/m16 disp16
 4 0000000A 89 C6             mov si, ax        ; mov r/m16, r16
 5                                  
 6 0000000C 8E 1E 0050        mov ds, [5000h]   ; mov Sreg, r/m16
 7 00000010 8E D8             mov ds, ax        ; mov Sreg, r/m16
 8                                  
 9                            mov ds, 5000h
 9          ******************       error: invalid combination of opcode and operands

Supporting a mov Sreg, imm16 encoding would need a separate opcode. This would take extra transistors for 8086 to decode, and it would use up more opcode coding space leaving less room for future extensions. I'm not sure which of these was considered more important by the architect(s) of the 8086 ISA.

Notice that 8086 has special mov AL/AX, moffs opcodes which save 1 byte when loading the accumulator from an absolute address. But it couldn't spare an opcode for mov-immediate to Sreg? This design decision makes good sense. How often do you need to reload a segment register? Very infrequently, and in real large programs it often wouldn't be with a constant (I think). But in code using static data, you might be loading / storing the accumulator to a fixed address inside a loop. (8086 had very weak code-fetch, so code-size = speed most of the time).

Also keep in mind that you can use mov Sreg, r/m16 for assemble-time constants with just one extra instruction (like mov ax, 4321h). But if we'd only had mov Sreg, imm16, runtime variable segment values would have required self-modifying code. (So obviously you wouldn't leave out the r/m16 source version.) My point is if you're only going to have one, it's definitely going to be the register/memory source version.

Solution 3

About segment registers

The segment registers are not the same (on hardware level) as the general purpose registers. Of course, as Mike W said in the comments, the exact reason why you can't move directly immediate value into the segment register is known only by the Intel developers. But I suppose, it is because the design is simple this way. Note that this choice does not affects the processor performance, because the segment register operations are very rare. So, one instruction more, one less is not important at all.

About syntax

In all reasonable implementations of x86 assembler syntax, mov reg, something moves the immediate number something to the register reg. For example:

NamedConst = 1234h
SomeLabel:
    mov  edx, 1234h      ; moves the number 1234h to the register edx
    mov  eax, SomeLabel  ; moves the value (address) of SomeLabel to eax
    mov  ecx, NamedConst ; moves the value (1234h in this case) to ecx

Closing the number in square brackets means that the content of memory with this address is moved to the register:

SomeLabel dd 1234h, 5678h, 9abch

    mov  eax, [SomeLabel+4]  ; moves 5678h to eax
    mov  ebx, dword [100h]   ; moves double word memory content from the 
                             ; address 100h in the data segment (DS) to ebx.

27,220

Rijo Joseph

Updated on February 25, 2021

Comments

Rijo Joseph about 3 years

In 8086 assembly programming, we can only load a data into a segment register by, first loading it into a general purpose register and then we have to move it from this general register to the segment register.

Why can't we load it directly? Is there any special reason for not being allowed?

What is the difference between mov ax,5000H and mov ax,[5000H]? Does [5000h] mean content in memory location 5000h?
- Admin over 10 years
  
  To get an answer to your question you'd need to ask one of the design engineers on the original 8086 project. Be pragmatic - do what it takes.
- Admin over 10 years
  
  This question appears to be off-topic because it is about the design philosophy of a 30 year old processor.
- Jamal over 10 years
  
  I believe the [] denotes the value at that memory address.
- Ed S. over 10 years
  
  @MikeW: How is that off topic? If this guy is programming an 8086 why wouldn't SO be a place to get help with that? He's asking a practical question. If I asked "why can't I write to an arbitrary memory location in C?" would you vote to close that for the same reason? Pretty much any question can be summed up as "design decision". That doesn't mean it's not worth asking and knowing the answer to. Yeesh, people around here have become so ridiculously heavy handed with their close buttons.
- Jamal over 10 years
  
  @EdS.: Plus, it is still taught in school. I've learned how to code this in a class two years ago, and it was referenced (alongside MIPS) in another class this semester. It wouldn't be much of a CS education if everything taught was modern.
- Admin over 10 years
  
  @EdS. The OP is asking why a certain operation is not allowed - because the engineers designed it that way. The instruction set is what it is. Debating whether it should be something else won't change it, nor help program it.
- Ed S. over 10 years
  
  @MikeW: Right, and you can probably "answer" about 90% of the questions here with that same response. Of course, you wouldn't actually be helping anyone, and you wouldn't be making the site any better. Every design choice has a reason behind it (hopefully!) and those reasons are worth knowing. I would much prefer letting a few questionable questions slip through (not that I think this one fits into that category) than to nix useful questions that may help others down the road.
- Ed S. over 10 years
  
  OP, is this protected (virtual) 8086 mode (introduced with the 386, used by many emulation programs)? If the CPU is not running in real mode then those registers are protected and some instructions are forbidden
- Nathan Fellman over 10 years
  
  @EdS.: writing segment registers is allowed in all modes.
- Ed S. over 10 years
  
  @NathanFellman: I don't believe that it is allowed in virtual mode. Do you have a source for that? That said, I'm just going out on a limb here and the OP does say he can manipulate the segment registers by loading data into a general purpose register first. I haven't touched an 8086 emulator in years (never had the pleasure of working with the actual chip)
- Nathan Fellman over 10 years
  
  @EdS.: The whole purpose of VM86 is to enable real-mode programs to run in protected mode. It wouldn't make much sense to exclude all the programs that access segment registers. Considering the limited size of a real-mode segment (64k in 16-bit addressing mode), that would be pretty much any segment register.
- Nathan Fellman over 10 years
  
  @EdS.: I just peaked at the IA32 Software Developer's Manual, and the only faults for segment accesses are on attempts to write CS.
- Ed S. over 10 years
  
  @NathanFellman: Ahh, fair enough. See, this question was useful to at least one other person :)
- Nils Pipenbrinck over 8 years
  
  In a pinch, if you want to load a segment register without touching any of the general purpose registers you can always do the 'push immediate / pop segment register' sequence.
Euro Micelli over 10 years

"This is the 70's"... Heck, today it's even more so: Windows code NEVER write to a segment register (except for the very select parts of the kernel)
Peter Cordes over 8 years

If they were going to use different mnemonics for load/store vs. reg-reg data movement, the could have been like everyone else and used LDsomething and STsomething as the mnemonics. But do note that mov r16, r/m16 is the same opcode whether the source is memory or another register. So that's actually not a great example. Making mov-immediate a separate mnemonic from the rest would make more sense, because the opcodes for mov r16, imm16 are always the same instruction. Great answer, but unfortunate choice of example :P
Peter Cordes over 8 years

Besides having to wire a data path for an immediate or memory operand to go straight into segment registers, the decoders would have to recognize that opcode and activate the handling for that special case of data routing. More opcodes means more transistors, even beyond the other consideration.
Ruslan almost 7 years

"the segment register to be written might be used to address the source operand" — this doesn't sound plausible. You could say the same about the existing instructions like mov bp,[bp].
Ruslan almost 7 years

The remark about syntax implies that MASM syntax is unreasonable. I don't disagree, but it's important to have in mind that mov ax, myvar \n... myvar dw 1234 will load 1234 into ax in MASM (and TASM in default mode). OTOH, FASM and NASM have done it right (more consistently), getting rid of the offset keyword.
Peter Cordes over 6 years

@Ruslan: more importantly, this answer is totally wrong. mov ds, [5000h] is encodeable: the opcode is mov Sreg, r/m16. An immediate isn't encodeable (mov ds, 4321h) because that would need a different opcode, but the one opcode we do have for move-to-Sreg (8E /r) takes a register or memory source. It's all a question of opcode coding space / decoder complexity, not the segment reg being used during the instruction, because that is the case for mov ds, [5000h].
Peter Cordes over 6 years

Definitely wrong, mov ds,[5000H] is encodeable, but mov ds, 5000H isn't. If there was a mov Sreg, imm16 opcode (which there isn't), it couldn't execute until all its bytes were fetched into the decode buffer. So the only instruction you're proposing a problem with is one of the forms that is encodeable.
Ruslan over 6 years

For your illustration it'be more useful to have 8B F0 as mov si, ax. Not sure though how to convince NASM to emit this variant.
Ruslan over 6 years

Also, I don't quite get what extra instruction you mean in the first sentence of the last paragraph. Do you mean the one clobbering a general-purpose register?
Peter Cordes over 6 years

@Ruslan: Agreed. GAS AT&T syntax could use mov.s %ax, %si instead of mov %ax, %si to select the opposite encoding.
Peter Cordes over 6 years

@Ruslan: I meant a mov-immediate. updated. Thanks for reading it through to point out stuff I left unclear.
Ruslan over 6 years

Actually, even with GAS Intel syntax you can use mov.s, just checked.
Peter Cordes over 6 years

@Ruslan: I added a db 0x8B, 0xF0 and tried disassembling with ndisasm and objdump. Not even objdump printed mov.s, just mov for both encodings :/ IDK if there's an objdump option to use operand-ordering suffixes; I didn't see one in the man page. (And BTW, I'm not surprised that .s works in .intel_syntax noprefix, I was kind of being overly specific to be clear for future readers that aren't experts on different assembler syntaxes).
Peter Cordes over 6 years

@Ruslan: I want to edit this answer so I can change my upvote to a downvote now that I know it's based on a false premise. But I don't think I should edit in a "this is wrong" banner at the top, and I don't see anything else to edit. It attracted another upvote after posting my answer bumped the question... Fortunately it's an ISA-design question, not a programming question directly with bad code people would copy, and x86-16 is mostly dead and buried, so very few people will be negatively affected by getting a wrong answer here. OP is still active, maybe they'll accept my answer :)
Ruslan over 6 years

@PeterCordes yeah, I think "This is wrong" banner could be viewed as vandalism.
Peter Cordes over 6 years

@Ruslan: I guess that makes sense; an answer based on a false premise is just plain wrong and can only be handled with downvotes, not edit.
Peter Cordes over 6 years

@Ruslan: I grabbed your long-dash changes, but I'm not convinced that code-formatting so many things in a paragraph is actually an improvement. I think it's better if the whole instructions stand out more.
Peter Cordes over 6 years

And BTW, hi @Euro. Sorry for the comment noise in your answer. It turns out you were mis-remembering how x86 worked when you wrote this, xD. You might want to edit out a lot of the stuff based on the false premise about mov ds, [5000h] not being encodeable, because in fact it is.
Euro Micelli over 6 years

This stream about edits is not suited for "comments" so please continue on chat or just delete them now that your edits are done. I will delete this comment as well in a few hours
Euro Micelli over 6 years

I knew that not everything you can write in assembler is actually encodeable and assumed without checking the OP assertion that this was such a case; it's been a while since I've written assembler. I think the basic explanation was good, even if it turns out it didn't apply to the exact instruction picked. I will re-investigate, and edit/fix the answer when I have a chance, hopefully after work tonight. Or delete it if I find it's not fixable.
Peter Cordes about 2 years

I guess I didn't see your reply years ago since you didn't tag me with @peter. The assertion in the question is correct: mov ds, 5000h (mov immediate constant to Sreg) is not encodeable. Your wrong assumption was about mov ds, [5000h] which is a load from an absolute direct addressing mode, which is encodable. Since then, @ilkkachu has edited this answer to say mov ds, 5000h, but the next paragraphs still propose reasoning that only makes sense for a memory source, not an immediate embedded in the machine code of the instruction. So this answer is still unfortunately a mess.