what is the meaning of align an the start of a section?

59,714

Solution 1

I always liked the comprehensive explanation by Samael in the following thread:
Explanation of the ALIGN MASM directive, How is this directive interpreted by the compiler?

Quote:

1. USAGE

ALIGN X

The ALIGN directive is accompanied by a number (X).
This number (X) must be a power of 2. That is 2, 4, 8, 16, and so on...

The directive allows you to enforce alignment of the instruction or data immediately after the directive, on a memory address that is a multiple of the value X.

The extra space, between the previous instruction/data and the one after the ALIGN directive, is padded with NULL instructions (or equivalent, such as MOV EAX,EAX) in the case of code segments, and NULLs in the case of data segments.

The number X, cannot not be greater than the default alignment of the segment in which the ALIGN directive is referenced. It must be less or equal to the default alignment of the segment. More on this to follow...

2. PURPOSE

A. Working with code

If the directive precedes code, the reason would be optimization (with reference to execution speed) . Some instructions are executed faster if they are aligned on a 4 byte (32 bits) boundary. This kind of optimization can be usually used or referenced in time-critical functions, such as loops that are designed for manipulating large amount of data, constantly. Besides execution speed improvement, there is no "necessity" to use the directive with code, though.

B. Working with data

The same holds true also with data - we mainly use the directive in order to improve execution speed - as a means of speed optimization. There are situations where data misalignment can have a huge performance impact on our application.

But with data, there are situations where correct alignment is a necessity, not luxury. This holds especially true on the Itanium platform and the SSE/SSE2 instruction set, where misalignment on a 128bit boundary (X=16), may fire up a general-protection exception.

An interesting and most informative article on data alignment, though orientated on the MS C/C++ compiler, is the following:

Windows Data Alignment on IPF, x86, and x64, by Kang Su Gatlin, MSDN

3. What is the default aligment of a segment?

A. If you use the .386 processor directive, and you havent explicitly declared the default alignment value for a segment, the default segment alignment is of DWORD (4 bytes) size. Yeah, in this case, X = 4. You can then use the following values with the ALIGN directive: (X=2, X= 4). Remember, X must be less or equal than the segment alignment.

B. If you use the .486 processor directive and above, and you havent explicitly declared the default alignment value for a segment, the default segment alignment is of PARAGRAPH (16 bytes) size. In this case, X = 16. You can then use the following values with the ALIGN directive: (X=2, X= 4, X = 8, X = 16).

C. You can declare a segment with non-default alignment in the following way:

;Here, we create a code segment named "JUNK", which starts aligned on a 256 bytes boundary 
JUNK SEGMENT PAGE PUBLIC FLAT 'CODE'

;Your code starts aligned on a PAGE boundary (X=256)
; Possible values that can be used with the ALIGN directive 
; within this segment, are all the powers of 2, up to 256. 

JUNK ENDS

Here are the aliases for segment aligment values...

Align Type     Starting Address 

BYTE             Next available byte address.
WORD          Next available word address (2 bytes per word).
DWORD        Next available double word address (4 bytes per double word).
PARA             Next available paragraph address (16 bytes per paragraph).
PAGE             Next available page address (256 bytes per page).

4. Example

Consider the following example (read the comments on the usage of the ALIGN directive).

.486 
.MODEL FLAT,STDCALL 
OPTION CASEMAP:NONE 

INCLUDE \MASM32\INCLUDE\WINDOWS.INC 

.DATA

var1 BYTE  01; This variable is of 1 byte size. 
ALIGN 4

; We enforce the next variable to be alingned in the next memory 
;address that is multiple of 4. 
;This means that the extra space between the first  variable 
;and this one will be padded with nulls. ( 3 bytes in total)

var2 BYTE  02; This variable is of 1 byte size. 

ALIGN 2
; We enforce the next variable to be alingned in the next memory 
;address that is multiple of 2. 
;This means that the extra space between the second variable 
;and this  one will be padded with nulls. ( 1 byte in total)

var3 BYTE  03; This variable is of 1 byte size. 

.CODE
; Enforce the first instruction to be aligned on a memory address multiple of 4
ALIGN 4

EntryPoint:
; The following 3 instructions have 7 byte - opcodes 
; of the form 0F B6 05 XX XX XX XX
; In the following block, we do not enforce opcode
; alignment in memory...

MOVZX EAX, var1 
MOVZX EAX, var2 
MOVZX EAX, var3 

; The following 3 instructions have 7 byte - opcodes 
; of the form 0F B6 05 XX XX XX XX
; In the following block, we  enforce opcode alignment 
; for the third instruction, on a memory address multiple of 4.
; Since the second instruction opcodes end on a memory address 
; that is not a multiple of 4, some nops would be injected before 
; the first opcode  of the next instruction, so that the first opcode of it
; will start on a menory address that is a multiple of 4.


MOVZX EAX, var1 
MOVZX EAX, var2 
ALIGN 4 
MOVZX EAX, var3 

; The following 3 instructions have 7 byte - opcodes 
; of the form 0F B6 05 XX XX XX XX
; In the following block, we  enforce opcode alignment 
; for all instructions, on a memory address multiple of 4.
;The extra space between each instruction will be padded with NOPs

ALIGN 4
MOVZX EAX, var1
ALIGN 4
MOVZX EAX, var2
ALIGN 4
MOVZX EAX, var3


ALIGN 2
; The following  instruction has 1 byte - opcode (CC).
; In the following block, we  enforce opcode alignment 
; for the instruction, on a memory address multiple of 2.   
;The extra space between this instruction , 
;and the previous one,  will be padded with NOPs

INT 3
END EntryPoint

If we compile the program, here's what the compiler generated:

.DATA
;------------SNIP-SNIP------------------------------
.data:00402000 var1            db 1
.data:00402001                 db    0; This NULL was generated to enforce the alignment of the next instruction on an address that is a multiple of 4
.data:00402002                 db    0; This NULL was generated to enforce the alignment of the next instruction on an address that is a multiple of 4
.data:00402003                 db    0; This NULL was generated to enforce the alignment of the next instruction on an address that is a multiple of 4

.data:00402004 var2            db 2 
.data:00402005                 db    0; This NULL was generated to enforce the alignment of the next instruction oon an address that is a multiple of 2

.data:00402006 var3            db 3

.data:00402007                 db    0; The rest of the NULLs are to fill the memory page in which the segment will be loaded
;------------SNIP-SNIP------------------------------

.CODE
;------------SNIP-SNIP------------------------------

.text:00401000 start:
.text:00401000                 movzx   eax, var1
.text:00401007                 movzx   eax, var2
.text:0040100E                 movzx   eax, var3
.text:00401015                 movzx   eax, var1
.text:0040101C                 movzx   eax, var2
.text:00401023                 nop; This NOP was generated to enforce the alignment...
.text:00401024                 movzx   eax, var3
.text:0040102B                 nop; This NOP was generated to enforce the alignment...
.text:0040102C                 movzx   eax, var1
.text:00401033                 nop; This NOP was generated to enforce the alignment...
.text:00401034                 movzx   eax, var2
.text:0040103B                 nop; This NOP was generated to enforce the alignment...
.text:0040103C                 movzx   eax, var3
.text:00401043                 nop; This NOP was generated to enforce the alignment...
.text:00401044                 int     3              ; Trap to Debugger
.text:00401044; ---------------------------------------------------------------------------
.text:00401045                 db    0
.text:00401046                 db    0
.text:00401047                 db    0
.text:00401048                 db    0

;------------SNIP-SNIP------------------------------

As you see, after the code / data of our application ends, the compiler generates more instructions / data. This is because the PE sections, when loaded in memory, are aligned on a PAGE size (512 bytes).

So, the compiler, fills the extra space to the next 512 byte boudary with junk bytes (usually INT 3 instructions, NOPs or NULLs for code segments, and 0FFh, NULLs for data segments) in order to ensure that the memory alignment for the loaded PE image is correct...

Solution 2

Memories are a fixed width, today either 32 bit or typically 64 bit wide (even if it is a 32 bit system). Lets assume a 32 bit data bus for now. Every time you do a read, be it 8, 16, or 32 bits, it is a 32 bit bus so those data lines will have something on them, makes sense to just put the 32 bits related to the aligned address.

So if at address 0x100 you had the 32 bit value 0x12345678. And you were to perform a 32 bit read well all of those bits would be on the bus. If you were to perform an 8 bit read at address 0x101, the memory controller would do a read of address 0x100, it would get 0x12345678. And from those 32 bits it would isolate the proper "byte lane", the 8 bits related to address 0x101. Some processors the memory controller may never see anything but 32 bit reads, the processor would handle isolating the byte lane.

What about processors that allow unaligned accesses like the x86? If you had 0x12345678 at address 0x100 and 0xAABBCCDD at address 0x104. And were to do a 32 bit read at address 0x102 on this 32 bit data bus based system, then two memory cycles are required, one at address 0x100 where 16 bits of the desired value live and then another at 0x104 where the other two bytes are found. After those two reads happen you can piece together the 32 bits and provide that deeper into the processor where it was requested. Same thing happens if you want to do a 16 bit read at say address 0x103, costs you twice as many memory cycles, takes twice as long.

What the .align directive normally does in assembly language (of course you have to specify the exact assembler and processor as this is a directive and each assembler can define whatever it wants to define for directives) is pad the output such that the thing that immediately follows the .align is, well, aligned on that boundary. If I had this code:

b: .db 0
c: .dw 0

And it turns out that when I assemble and link the address for C is 0x102, but I know I will be accessing that very often as a 32 bit value, then I can align it by doing something like this:

b: .db 0
.align 4
c: .dw 0

assuming nothing else before this changes as a result, then b will still be at address 0x101, but the assembler will put two more bytes in the binary between b and c so that c changes to address 0x104, aligned on a 4 byte boundary.

"aligned on a 4 byte boundary" simply means that the address modulo 4 is zero. basically 0x0, 0x4, 0x8, 0xc, 0x10, 0x14, 0x18, 0x1C and so on. (the lower two bits of the address are zero). Aligned on 8 means 0x0, 0x8, 0x10, 0x18, or lower 3 bits of the address are zero. And so on.

Writes are worse than reads as you have to do read-modify-writes for data smaller than the bus. If we wanted to change the byte at address 0x101, we would read the 32 bit value at address 0x100, change the one byte, then write that 32 bit value back to 0x100. So when you are writing a program and you think you are making things faster by using smaller values, you are not. So a write that is not aligned and the width of the memory costs you the read-modify-write. An unaligned write costs you twice as much just as it did with reads. An unaligned write would be two read-modify-writes. Writes do have a performance feature over reads though. When a program needs to read something from memory and use that value right away, the next instruction has to wait for the memory cycle to complete (which these days can be hundreds of clock cycles, dram has been stuck at 133MHz for about a decade, your 1333MHz DDR3 memory is not 1333MHz, the bus is 1333MHz/2 and you can put requests in at that speed but the answer doesnt come back for a long while). Basically with a read you have an address but you have to wait for the data as long as it takes. For a write you have both items, the address and data, and you can "fire and forget" you give the memory controller the address and data and your program can keep running. Granted if the next instruction or set of instructions need to access memory, read or write, then everyone has to wait for the first write to finish then move on to the next access.

All of the above is very simplistic, yet what you would see between the processor and cache, on the other side of the cache, the fixed width memory (the fixed width of the sram in the cache and the fixed width of the dram on the far side do not have to match) on the other side of the cache is accessed in "cache lines" which are generally multiples of the size of the bus width. this both helps and hurts with alignment. Say for example 0x100 is a cache line boundary. The word at 0xFE let's say is the tail end of one cache line and 0x100 the beginning of the next. If you were to perform a 32 bit read at address 0xFE, not only do two 32 bit memory cycles have to happen but two cache line fetches. Worst case would be to have to evict two cache lines to memory to make room for the two new cache lines you are fetching. Had you used an aligned address, it would still be bad but only half as bad.

Your question did not specify the processor, but the nature of your question implies x86 which is well known for this problem. Other processor families do not allow unaligned accesses, or you have to specifically disable the exception fault. And sometimes the unaligned access isn't x86 like. For example on at least one processor if you had 0x12345678 at address 0x100, and 0xAABBCCDD at address 0x104 and you disabled the fault and performed a 32 bit read at address 0x102 you will get 0x56781234. A single 32 bit read with the byte lanes rotated to put the lower byte in the right place. No, I am not talking about an x86 system but some other processor.

Solution 3

align fills the address with NOPs/0x90 (NASM) until it's align to the operand (addr modulo operand is zero).

For instance:

db 12h
align 4
db 32h

When assembled outputs:

0000  12 90 90 90 
0004  32

This is faster for memory access and necessary to load some tables in x86 CPUs (and probably other architectures as well). I can't name any specific cases, but you can find several answers on SO and search engines.

59,714

Author by

user1462787

Updated on December 23, 2021

Comments

user1462787 over 2 years
What is the meaning of align an the start of a section?

For example:
```
 align 4
 a: dw 0
```
How does it save memory access?
user1462787 almost 12 years

thanks!! it is affect only for the next data/instruction or to all the section?
copy almost 12 years

@user1462787 it doesn't depend or modify the next instruction, it just writes NOPs depending on the current offset from the start of the file
Ravid Goldenberg over 9 years

Most accurate, comprehensive and educational explanation I found online, thank you!
Peter Cordes over 6 years

On most fixed-length ISAs like MIPS, instructions must be 4-byte aligned or the CPU will fault. Also, on x86, instruction alignment matters (sometimes) for jump targets, not really depending on which instruction it is. Your claim that some instructions are executed faster if they are aligned on a 4 byte (32 bits) boundary isn't very sensible on any modern x86 CPUs (even in 2012 when you wrote this). The boundaries that matter are cache-line (64-byte) or fetch-block (usually 16-byte) boundaries, or uop-cache block boundaries (32-byte on Intel). See agner.org/optimize.
Peter Cordes over 6 years

Related: Why is integer assignment on a naturally aligned variable atomic on x86?.
Constantino Tsarouhas over 2 years

@PeterCordes Good point. Is explicit alignment with .align on MIPS/RISC-V necessary on assemblers such as LLVM’s or are instructions in handwritten S files implicitly aligned on 4-byte boundaries when the assembler outputs a binary?
Peter Cordes over 2 years

@ConstantinoTsarouhas: The .text section starts out aligned at least by 4, maybe by 16. So you don't need .p2align 2 before instructions unless you do something like .byte 1. LLVM's built-in assembler doesn't implicitly align as part of instruction mnemonics, so if you do misalign on purpose before some instructions, e.g. .byte 1 ; or $t1, $t2, $t3 and then assemble it with clang -target mips -c foo.s, llvm-objdump -d foo.o shows that it did not leave a 3-byte gap, it just assembled 4 bytes where you told it to. Put const data in .rodata instead so this is a non-issue.