A few notes on assembly
19 Nov 2011Quick navigation within this page
- instruction sets: [ x86-16 | x86-32 | x86-64 | x87 | sse | avx ]
- cpu history: [cpu's | fpu's | architecture ]
Introduction
Assembly language is a very low level programming language. Many higher programming language compilers translate their code into assembly and then into machine code. Although almost no-one develops software exclusively in assembly it is still very useful to be able to understand the language. For example with the gnu c compilergcc
one can inspect the assembly code using the -S
flag and
the result will be stored in a file with the extention .s
.
$ gcc -Wall -O2 -S -masm=intel prog.c # generates assembly prog.s $ gcc prog.s -o prog # produces executableFor the windows compiler the option is
/FA
,
i.e. cl /O2 /FA prog.cpp
writes the assembly code
into prog.asm
and generates the executable.
By default gcc
generates assembly code using the
AT&T syntax whereas cl
generates
Intel syntax. Since the Intel syntax is described more widely
and also somewhat easier to work with we use the flag
-masm=intel
to get gcc
producing Intel syntax
as well.
The main concepts in assembly are
registers
and
instructions,
where registers are very small (e.g. 64 bit on a 64-bit cpu,
sse registers are 128 bit) but very fast memory locations integrated into
the cpu. Each register normally only stores one variable.
Let's have a look at a simple (and rather pointless) loop in c++:
$ cat loop.cpp double loop(double a, int n){ double sum=0.0; for(int i=1; i<n; i++) { sum+=a; } return sum; }This loop translates into the assembly code below. We use sse2 floating point instructions (default on 64-bit systems), where on older systems (32-bit) x87 instructions would be generated by default (even if sse2 was available).
$ g++ -Wall -O2 -msse2 -mfpmath=sse -S -masm=intel loop.cpp $ cat loop.s ... .L4: add eax, 1 # i+=1; addsd xmm0, xmm1 # sum+=a; cmp eax, edi # compare i, n jne .L4 # conditional jump to .L4 ...A non-optimised version would not necessary load all variables into registers and so require lots of memory access which is slow:
$ g++ -Wall -msse2 -mfpmath=sse -S -masm=intel loop.cpp $ cat loop.s ... .L3: movsd xmm0, QWORD PTR [rbp-16] addsd xmm0, QWORD PTR [rbp-24] movsd QWORD PTR [rbp-16], xmm0 add DWORD PTR [rbp-4], 1 .L2: mov eax, DWORD PTR [rbp-4] cmp eax, DWORD PTR [rbp-28] setl al test al, al jne .L3 ...For larger programmes it will be harder to find the assembler code corresponding to particular parts of the C++ source, so it can be a good idea to add markers to the assembly, for example like this:
$ cat test.cpp ... #define ASM_COMMENT(X) asm volatile ("#" X) int main() { ... ASM_COMMENT("start loop"); for(int i=0; i<n; i++) { ...
References:
8080 8-bit instruction set
registers | ||
---|---|---|
general purpose | 8-bit | A, B, C, D, E, H, L |
16-bit | BC, DE, HL : limited 16-bit operations |
|
program registers | 16-bit | SP : stack pointer, points to stack memoryPC : program counter, memory address of next instruction
|
8086 16-bit instruction set (x86-16)
The term x86 refers to the instruction set introduced by the Intel 8086 and is used up to the Intel 80286 cpu. For backwards compatibility reasons even most modern Intel and AMD processors contain that 16-bit instruction set as a subset. General purpose registers as well as index registers (pointing to memory locations) are 16-bit wide. As a consequence only 216 = 64KB of memory can be directly addressed. However, the 8086 and 80286 have a 20-bit and 24-bit external address bus and can address 220 = 1MB and 224 = 16MB of main memory, respectively. This is achieved using a cumbersome mechanism of memory segmentation and is also the reason why segment registers are needed.registers | ||
---|---|---|
general purpose | 16-bit | AX : accumulatorBX : base index, e.g. for arraysCX : counterDX : data, generalthese are just a recommendations, registers can be used for anything |
8-bit | AX=(AH, AL), BX=(BH, BL), ... |
|
index registers | 16-bit | SP : stack pointer, contains main memory address of the top of the stackBP : base pointer, used to point to some other place in stack,
typically above the local variablesSI, DI : source, destination index, used to point to arrays (e.g. strings)
|
segment register | 16-bit | CS : code segmentDS : data segmentES : extra segmentSS : stack segment
|
status register | 16-bit | one register which is implicitly set and read by instructions |
instruction pointer | 16-bit | IP : points to the next instruction in main memory |
instruction | pseudo code | description |
---|---|---|
mov ax, bx mov ax, 12 mov ax, [sp] |
ax = bx; ax = 12; ax = *sp; |
copies contents |
lea ax, [sp] lea ax, [sp+si] |
ax = &(*sp) = sp; ax = &(*(sp+si))=sp+si;
|
load effective address, returns the address of memory location, note, lea ax, [dx+8] is short for mov ax, dx and
add ax, 8
|
add ax, bx |
ax += bx; |
integer addition |
sub ax, bx |
ax -= bx; |
integer subtraction |
inc ax ... dec ax |
ax++; ax--; |
short for add x, 1 and sub x, 1 |
imul ax, bx |
ax *= bx; |
signed int multiply |
idiv ax, bx |
ax /= bx; |
signed int divide |
mul |
unsigned int multiply | |
div bx |
ax=[dx,ax]/bx, dx=[dx,ax]%bx |
unsigned int divide, ax contains the quotient and
dx the raminder |
and ax, bx |
ax = ax & bx; |
bitwise logical and |
or ax, bx |
ax = ax | bx; |
bitwise logical or |
xor ax, bx |
ax = ax ^ bx; |
bitwise logical xor, e.g. xor ax, ax means
ax = 0 |
not ax |
ax = ~ax; |
bitwise logical not |
sar ax, 3 |
ax = ax >> 3; |
signed int bitwise shift right |
sal ax, 3 |
ax = ax << 3; |
signed int bitwise shift left |
shl , shr |
unsigned int version of shift operators | |
cmp ax, bx |
tests ax and bx of equality, internally using sub withoutoverwriting ax but only setting the status/flag register |
|
jmp .label |
jump to location .label in assembly code |
|
je .label jg .label jl .label jge .label jle .label jne .label |
conditional jump to .label based on the values in the status/flag register;
normally follows the cmp instruction, and the jump is executed
if the comparison resulted in equal ( e ), greater
(g ), less (l ), not equal ( ne ), or combinations of it, respectively
|
|
pop ax |
ax = stack.rmtop(); | moves the top value of the memory stack into ax |
push ax |
stack.add() = ax; |
saves the value of ax on top of the memory stack |
test |
||
BYTE : 1 byte (8 bit)WORD : 2 byte (16 bit)DWORD : 4 byte (32 bit)QWORD : 8 byte (64 bit)TBYTE : 10 byte (80 bit) |
these are type specifiers; since for example [si] only pointsto the beginning of a memory address and in the absence of type specifiers we don't know the bit-length of [si] ; if registers are involved we implicitly know the bit-length: mov ax, [si] : use 16-bit mov al, [si] : use 8-bit if registers are not involved we need to specify the data type: mov WORD [si], 0 : use 16-bit |
80386 32-bit instruction set (x86-32)
Also known as IA-32 and i386, the x86-32 extends all registers to 32-bit and adds a few new instructions but keeps all the old x86-16 instructions.registers | ||
---|---|---|
general purpose | 32-bit | EAX, EBX, ECX, EDX |
index registers | 32-bit | ESP : stack pointer, contains main memory address of the top of the stackEBP : base pointer, used to point to some other place in stack,
typically above the local variablesESI, EDI : source, destination index, used to point to arrays
|
status register | 32-bit | one register which is implicitly set and read by instructions |
instruction pointer | 32-bit | EIP : points to the next instruction in main memory |
- X86 instruction listings
- Intel 80386 reference programmer's manual hosted at mit.edu, logix.cz
AMD64 64-bit instruction set (x86-64)
The x86-64 has been developed by AMD and is also known as AMD64 and Intel 64. Note, the Intel Itanium (IA-64) has a different 64-bit instruction set.registers | ||
---|---|---|
general purpose | 64-bit | RAX, RBX, RCX, RDX R8, R9, ..., R15
|
index registers | 64-bit | RSP : stack pointer, contains main memory address of the top of the stackRBP : base pointer, used to point to some other place in stack,
typically above the local variablesRSI, RDI : source, destination index, used to point to arrays
|
status register | 64-bit | RFLAGS one register which is implicitly set and read by instructions |
instruction pointer | 64-bit | RIP : points to the next instruction in main memory |
x87 fpu instruction set
registers | ||
---|---|---|
general purpose | 80-bit | ST(0), ST(1), ..., ST(7) : arranged in a direct access
stack structure, ST(0) is the top of the register stack
|
control/status | 16-bit | control register: for rounding/precision control and exception handling,
access via fldcw and fstcw status register: status of last fpu operation, incl. top of stack pointer, fp exceptions, read via fstsw tag register: status of each register (valid, zero, special, empty) |
program | 48-bit | instruction pointer: data pointer: |
instruction | pseudo code | description |
---|---|---|
fld QWORD [sp] fld st(2) fst QWORD [dp] fstp QWORD [dp]
|
st(0) = *(sp); st(0) = st(2); *(dp) = st(0); *(dp) = st(0); |
loads floats (here QWORD=64-bit ) values from memory to
the fpu register stack (add to the stack, no overwrite)store registers in mem, and store with pop, i.e. removed from fpu stack after store Note:
|
fild DWORD [sp] fist DWORD [dp] fistp DWORD [dp] |
st(0) = *(sp); *(dp) = st(0); *(dp) = st(0); |
(signed) integer version of load/store operations,
here DWORD=32bit to load an unsigned int, a conversion to signed int (with higher precision, e.g. 64bit) is necessary |
fldz fld1 fldpi ... |
st(0) = 0.0; st(0) = 1.0; st(0) = M_PI; ... |
loads special constants into register stack, all 80-bit accurate |
fxch st(2) |
swap(st(0),st(2)); |
swaps values in registers |
fadd st(2) fadd QWORD [sp] fadd st(4), st
|
st(0) += st(2); st(0) += *(sp); st(4) += st(0);
|
floating point add |
fiadd DWORD [sp] |
st(0) += *(sp); |
add a signed integer from memory to stack register (80-bit float) |
fsub st(1) fsubr st(1) fmul st(1) fdiv st(1) fdivr st(1)
|
st(0) -= st(1); st(0) = st(1)-st(0); st(0) *= st(1); st(0) /= st(1); st(0) = st(1)/st(0);
|
other basic operations, there are many more versions, including integer versions, pop versions, and two-operant versions |
fsqrt |
st(0)=sqrt(st(0)) |
square root |
f2xm1 fyl2xp1 fyl2x
|
st(0) = 2^st(0) - 1; st(0) = log_2(st(0)+1)*st(1); st(0) = log_2(st(0))*st(1);
|
only works for small values -1 < st(0) < 1 ,
used for exp() an additional pop operation is performed an additional pop operation is performed, used for log()
|
fsin fcos fptan fpatan |
st(0) = sin(st(0)); st(0) = cos(st(0)); st(0) = tan(st(0)); st(0) = atan(st(1)/st(0)); |
only available since the 387 |
fcom st(1) |
compares st(0) with st(1) and stores the
result in the fpu status register |
|
fxam |
examines st(0)
(unsupported, nan, normal, inf, zero, empty, denormal) and stores
result in the fpu flag register |
|
fldcw [sp] fstcw [dp]
|
loads and stores the control register | |
fstsw [dp] |
stores the status register in main memory at location
dp |
The gcc compiler emits x87-fpu instruction when compiling x86-32 bit code and sse-fpu instructions when compiling x86-64 bit code. This behaviour can be changed using different compiler flags:
$ gcc -m32 # x86-32 bit mode (x87-fpu is default) $ gcc -mfpmath=387 # use x87-fpu instructions (sse default on x86-64) $ gcc -ffast-math # emits more x87-fpu instructions, because # -funsafe-math-optimizations is included $ gcc -mpc32 / -mpc64 / -mpc80 # set internal x87-fpu precision to 32/64/80 bit $ gcc -mfpmath=387 -msoft-float # uses software emulation of fpu instructions
C function | gcc -S -m32 |
gcc -S -m32 -funsafe-math-optimizations |
---|---|---|
sqrt() |
fsqrt |
fsqrt |
exp() |
libmath.so |
using f2xm1 , fldl2e , frndint ,
fscale , ... |
log() |
libmath.so |
using fyl2x , fldln2 |
pow() |
libmath.so |
libmath.so |
sin() , cos() |
libmath.so |
fsin , fcos |
tan() |
libmath.so |
fptan , fld1 |
atan() |
libmath.so |
fpatan |
acos() |
libmath.so |
using fpatan , fsqrt , fmul ,
fsubr
|
cosh() |
libmath.so |
libmath.so |
acosh() |
libmath.so |
libmath.so |
erf() |
libmath.so |
libmath.so |
- speed optimised code with e.g.
gcc -O2
is also generally more accurate (if x87-fpu instructions are used), this is because values will be kept in the 80-bit accurate x87-fpu register as long as possible, whereas without optimisation values will be written back to memory (and converted to 64-bitdouble
or 32-bitfloat
) after each calculation; this is true even without-funsafe-math-optimizations
- to make sure results are always exactly the same either use sse-fpu
instructions (
gcc -mfpmath=sse
) or reduce accuracy of the x87-fpu usinggcc -mpc64
(might slightly speed upfdiv
andfsqrt
as well) - no direct copy function for registers like
st(2)=st(4)
, onlyfld st(4)
available which adds a new value to the stack and copiesst(4)
into it
- x87
- x87 instruction set
- x87 fpu reference by Raymond Filiatreault, 2003
- instruction timings, 2002?
MMX fpu instruction set
The MMX instruction set was introduced by Intel in 1996 on top of the Pentium processor to facilitate SIMD (single instruction, multiple data) instructions. It introduces newMM?
64-bit wide registers which can be packed with
multiples of lower-bit integer data types and instruction are then
executed on all of them at once.
registers | ||
---|---|---|
general purpose | 64-bit | MM0, MM1, ..., MM7 : using physical registers of the x87 fpu,
e.g. SP(?)=[1,...,1,MM0]
|
SSE fpu instruction set
The SSE instruction set is a further development of MMX to improve SIMD capabilities. It introduces dedicated 128-bit wide registers and extends operations to floating point numbers (onlyfloat
's with the initial SSE version).
The most important version is SSE2 as it implements
working with double
's and int
's, so it makes
MMX redundant and can also replace x87 operations.
registers | ||
---|---|---|
general purpose | 128-bit | XMM0, XMM1, ..., XMM7 : dedicated registersXMM8, XMM9, ..., XMM15 : only on x86-64 bit systems
|
status register | 32-bit | MXCSR : control/status register |
- introduced with Pentium III in 1999
XMM?
can contain up to four 32-bit single-precision floats, no 64-bit double and no integer capabilities- Pentium III shares execution resources between sse and the fpu
- no time penalty when switching between sse and fpu instructions
- introduced with Pentium 4 in 2001
XMM?
can contain:- 2 64-bit single-precision floats, 4 32-bit double-precision floats
- 4
int32_t
, 8int16_t
, 16int8_t
- makes MMX instruction set redundant
- can fully replace x87 instruction set
- data in memory needs to be aligned on 16-byte boundaries, otherwise it will incur performance penalties
- introduced with the Prescott revision of the Pentium 4 in 2004
- adds 13 new instructions, e.g.
HADDPD
(horizontal-add-packed-double), input:(x0,x1)
,(y0,y1)
, output:(x0+x1,y0+y1)
LDDQU
, a misaligned integer vector load operation
- introduced with the Core 2 in 2006
- adds 16 new instructions
double
's, for example)
as well as scalar operations (only part of the register is filled with
one double
for example), and so can be used as a general
purpose floating point unit. Most names have a suffix of the form
[p/s][d/s]
, standing for packed/scalar and double/single
precision, e.g.:
addpd | add packed double |
addps | add packed single |
addsd | add scalar double |
addss | add scalar single |
instruction | pseudo code | C++ intrinsics | description |
---|---|---|---|
movapd xmm0, xmm1 movupd xmm0, xmm1
|
xmm0=xmm1 xmm0=xmm1 |
__m128d _mm_load_pd(double* p) |
aligned packed double move unaligned packed double move |
addpd xmm1, xmm2 subpd xmm1, xmm2 mulpd xmm1, xmm2 divpd xmm1, xmm2 sqrtpd xmm1 |
xmm1 += xmm2 xmm1 -= xmm2 xmm1 *= xmm2 xmm1 /= xmm2 xmm1 = sqrt(xmm1) |
|
packed double arithmetic operations |
cmppd xmm1, xmm2 |
|
packed double compare, result stored where? | |
unpcklpd xmm1, xmm2 |
|
pack lower parts of xmm2 into upper part of xmm1 | |
ldmxcsr DWORD [rsi] stmxcsr DWORD [rdi] |
mxcsr=*rsi *rdi=mxcsr |
unsigned int _mm_getcsr(); void _mm_setcsr(unsigned int i);
|
access the status register by moving it to/from a general 32-bit memory location |
XMMWORD : 16 byte (128 bit) |
type specifier for memory locations |
- SSE instruction listings
- SSE2 instruction set by Christopher Wright
- MSDN docs of SSE compiler intrinsics
- instruction timing
AVX fpu instruction set
The advanced vector extensions (AVX) instruction set is a successor of SSE and is also backwards compatible. It increases the bit-width of the SSE XMM0-XMM15 registers from 128-bit to 256-bit, now called YMM0-YMM15. The XMM registers are still available but share the physical space of the lower 128-bits of the YMM registers, i.e.YMM0=[... XMM0]
.
registers | ||
---|---|---|
general purpose | 256-bit | YMM0, YMM1, ..., YMM15 |
instruction | pseudo code | C++ intrinsics | description |
---|---|---|---|
vmovapd ymm0, ymm1 vmovapd ymm0, YMMWORD PTR [rsp] vmovapd xmm0, xmm1
|
ymm0=ymm1 ymm0=*rsp xmm0=xmm1
|
__m256d _mm256_set_pd() __m256d _mm256_load_pd() |
aligned packed double move |
vaddpd ymm1, ymm2, ymm3 vsubpd ymm1, xmm2, ymm3 vmulpd ymm1, ymm2, ymm3 vdivpd ymm1, ymm2, ymm3 vsqrtpd ymm1, ymm2 |
ymm1=ymm2+ymm3 ymm1=ymm2-ymm3 ymm1=ymm2*ymm3 ymm1=ymm2/ymm3 ymm1=sqrt(ymm2) |
|
packed double arithmetic operations |
YMMWORD : 32 byte (256 bit) |
type specifier for memory locations |
CPU architecture
external bus | cache | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
cpu | year | instruction | data | address | clock speed | L1 | L2 | circuit size | transistors | power | notes |
Intel 4004 | 1971 | 4-bit | 4-bit | 92.5 kHz | 10 µm | 2300 | 0.5W | ||||
Intel 8080 | 1974 | 8-bit | 8-bit | 16-bit | 2 MHz | 6 µm | 6000 | 0.8W | |||
Zilog Z80 | 1976 | 8-bit | 8-bit | 16-bit | 4 MHz | 8500 | 0.8W | ||||
Intel 8086 | 1978 | x86-16 | 16-bit | 20-bit | 5-10 MHz | 3 µm | 29k | 1.7-1.8W | |||
Intel 80286 | 1982 | x86-16 | 16-bit | 24-bit | 6-25 MHz | 1.5 µm | 134k | 3.3W | |||
Intel 80386 | 1985 | x86-32 | 32-bit | 32-bit | 12-40 MHz | 1-1.5 µm | 275k | 1.3-2.0W | SX: 16-bit data bus, 24-bit address bus | ||
Intel 80486 | 1989 | x86-32, x87 | 32-bit | 32-bit | 16-100 MHz | 8KB | 0.6-1.0 µm | 1.2m | 3-4W | SX has no x87 fpu | |
Pentium | 1993 | x86-32, x87 | 64-bit | 32-bit | 60-200 MHz | 8+8KB | 0.35-0.8 µm | 3.1m | 4-15W | ||
Pentium MMX | 1996 | x86-32, x87 | 64-bit | 32-bit | 166-266 MHz | 16+16KB | 0.35 µm | 4.5m | 13-17W | ||
Pentium II | 1997 | x86-32, x87 | 64-bit | 36-bit | 233-450 MHz | 16+16KB | 512KB | 0.25-0.35 µm | 7.5m | 17-43W | |
Pentium III | 1999 | x86-32, x87, sse | 64-bit | 36-bit | 400-1400 MHz | 16+16KB | 512KB | 0.13-0.25 µm | 9.5-28.1m | 16-42W | |
Pentium 4 | 2000 | x86-32, x87, sse2 | 64-bit | 36-bit | 1.3-3.8 GHz | 16+16KB | 512KB | 65-180 nm | 42-55m | 50-115W | later versions contained x86-64 |
Core 2 | 2006 | x86-64, x87, ssse3 | 1.8-3.5 GHz | 2-6MB | 45-65 nm | 65-130W | duo:
Conroe
65nm,
Allendale
65nm,
Wolfdale
45nm quad: Kentsfield 65nm, Yorkfield 45nm |
||||
Core i5 / i7 Nehalem | 2008 | x86-64, x87, sse4 | 1.73-3.46 GHz | 32+32KB | 256KB | 32-45 nm | 82-130W | dual:
Clarkdale quad: Lynnfield, Bloomfield |
|||
Core i5 / i7 Sandy Bridge | 2011 | x86-64, x87, sse4, avx | 1.73-3.46 GHz | 32+32KB | 256KB | 32-45 nm | |||||
Haswell | 2013 | x86-64, x87, sse4, avx2 | 14-22 nm | ||||||||
Skylake | 10-14 nm |
fpu | year | data | address | clock speed | L1 cache | circuit size | transistors | power |
---|---|---|---|---|---|---|---|---|
Intel 8087 | 1980 | 16-bit | 16-bit | 5-10 MHz | 3 µm | 45k | 2.4 W | |
Intel 80287 | 1983 | 16-bit | 16-bit | 6-20 MHz | ||||
Intel 80387 | 1986 | 32-bit | 32-bit | 16-40 MHz | 8KB | 1.5 µm | 120k |
- Stanford University cpu db
- cpu-info.com
- cpu-collection.de
- list of Intel cpu microarchitectures
- list of cpu power dissipation
- saved Intel data sheets
CPU architecture drawings
Intel 4004 | Intel 8080 | Intel 8086 | Intel 80286 | Intel 80386 |
Intel 8087 | Intel 80287 | Intel 80387 | ||
Intel 80486 | Pentium | Pentium MMX | Pentium 2 | Pentium 3 |
Pentium 4 | Core | Nehalem | Sandy Bridge | |