notes on assembly

A few notes on assembly

19 Nov 2011

Quick navigation within this page

instruction sets: [ x86-16 | x86-32 | x86-64 | x87 | sse | avx ]
cpu history: [cpu's | fpu's | architecture ]

Introduction

Assembly language is a very low level programming language. Many higher programming language compilers translate their code into assembly and then into machine code. Although almost no-one develops software exclusively in assembly it is still very useful to be able to understand the language. For example with the gnu c compiler gcc one can inspect the assembly code using the -S flag and the result will be stored in a file with the extention .s.

$ gcc -Wall -O2 -S -masm=intel prog.c	# generates assembly prog.s
$ gcc prog.s -o prog			# produces executable

For the windows compiler the option is /FA, i.e. cl /O2 /FA prog.cpp writes the assembly code into prog.asm and generates the executable. By default gcc generates assembly code using the AT&T syntax whereas cl generates Intel syntax. Since the Intel syntax is described more widely and also somewhat easier to work with we use the flag -masm=intel to get gcc producing Intel syntax as well.

The main concepts in assembly are registers and instructions, where registers are very small (e.g. 64 bit on a 64-bit cpu, sse registers are 128 bit) but very fast memory locations integrated into the cpu. Each register normally only stores one variable. Let's have a look at a simple (and rather pointless) loop in c++:

$ cat loop.cpp
double loop(double a, int n){
   double sum=0.0;
   for(int i=1; i<n; i++) {
      sum+=a;
   }
   return sum;
}

This loop translates into the assembly code below. We use sse2 floating point instructions (default on 64-bit systems), where on older systems (32-bit) x87 instructions would be generated by default (even if sse2 was available).

$ g++ -Wall -O2 -msse2 -mfpmath=sse -S -masm=intel loop.cpp
$ cat loop.s
...
.L4:
	add	eax, 1		# i+=1;
	addsd	xmm0, xmm1	# sum+=a;
	cmp	eax, edi	# compare i, n
	jne	.L4		# conditional jump to .L4
...

A non-optimised version would not necessary load all variables into registers and so require lots of memory access which is slow:

$ g++ -Wall -msse2 -mfpmath=sse -S -masm=intel loop.cpp
$ cat loop.s
...
.L3:
	movsd	xmm0, QWORD PTR [rbp-16]
	addsd	xmm0, QWORD PTR [rbp-24]
	movsd	QWORD PTR [rbp-16], xmm0
	add	DWORD PTR [rbp-4], 1
.L2:
	mov	eax, DWORD PTR [rbp-4]
	cmp	eax, DWORD PTR [rbp-28]
	setl	al
	test	al, al
	jne	.L3
...

For larger programmes it will be harder to find the assembler code corresponding to particular parts of the C++ source, so it can be a good idea to add markers to the assembly, for example like this:

$ cat test.cpp
...
#define ASM_COMMENT(X)  asm volatile ("#" X)
int main() {
...
   ASM_COMMENT("start loop");
   for(int i=0; i<n; i++) {
...

References:

8080 8-bit instruction set

registers
general purpose	8-bit	`A, B, C, D, E, H, L`
general purpose	16-bit	`BC, DE, HL`: limited 16-bit operations
program registers	16-bit	`SP`: stack pointer, points to stack memory `PC`: program counter, memory address of next instruction

8086 16-bit instruction set (x86-16)

The term x86 refers to the instruction set introduced by the Intel 8086 and is used up to the Intel 80286 cpu. For backwards compatibility reasons even most modern Intel and AMD processors contain that 16-bit instruction set as a subset.

General purpose registers as well as index registers (pointing to memory locations) are 16-bit wide. As a consequence only 2¹⁶ = 64KB of memory can be directly addressed. However, the 8086 and 80286 have a 20-bit and 24-bit external address bus and can address 2²⁰ = 1MB and 2²⁴ = 16MB of main memory, respectively. This is achieved using a cumbersome mechanism of memory segmentation and is also the reason why segment registers are needed.

registers
general purpose	16-bit	`AX`: accumulator `BX`: base index, e.g. for arrays `CX`: counter `DX`: data, general these are just a recommendations, registers can be used for anything
general purpose	8-bit	`AX=(AH, AL), BX=(BH, BL), ...`
index registers	16-bit	`SP`: stack pointer, contains main memory address of the top of the stack `BP`: base pointer, used to point to some other place in stack, typically above the local variables `SI, DI`: source, destination index, used to point to arrays (e.g. strings)
segment register	16-bit	`CS`: code segment `DS`: data segment `ES`: extra segment `SS`: stack segment
status register	16-bit	one register which is implicitly set and read by instructions
instruction pointer	16-bit	`IP`: points to the next instruction in main memory

instruction	pseudo code	description
`mov ax, bx` `mov ax, 12` `mov ax, [sp]`	`ax = bx;` `ax = 12;` `ax = *sp;`	copies contents
`lea ax, [sp]` `lea ax, [sp+si]`	`ax = &(sp) = sp;` `ax = &((sp+si))=sp+si;`	load effective address, returns the address of memory location, note, `lea ax, [dx+8]` is short for `mov ax, dx` and `add ax, 8`
`add ax, bx`	`ax += bx;`	integer addition
`sub ax, bx`	`ax -= bx;`	integer subtraction
`inc ax` ... `dec ax`	`ax++;` `ax--;`	short for `add x, 1` and `sub x, 1`
`imul ax, bx`	`ax *= bx;`	signed int multiply
`idiv ax, bx`	`ax /= bx;`	signed int divide
`mul`		unsigned int multiply
`div bx`	`ax=[dx,ax]/bx, dx=[dx,ax]%bx`	unsigned int divide, `ax` contains the quotient and `dx` the raminder
`and ax, bx`	`ax = ax & bx;`	bitwise logical and
`or ax, bx`	`ax = ax \| bx;`	bitwise logical or
`xor ax, bx`	`ax = ax ^ bx;`	bitwise logical xor, e.g. `xor ax, ax` means `ax = 0`
`not ax`	`ax = ~ax;`	bitwise logical not
`sar ax, 3`	`ax = ax >> 3;`	signed int bitwise shift right
`sal ax, 3`	`ax = ax << 3;`	signed int bitwise shift left
`shl`, `shr`		unsigned int version of shift operators
`cmp ax, bx`		tests ax and bx of equality, internally using `sub` without overwriting `ax` but only setting the status/flag register
`jmp .label`		jump to location `.label` in assembly code
`je .label` `jg .label` `jl .label` `jge .label` `jle .label` `jne .label`		conditional jump to `.label` based on the values in the status/flag register; normally follows the `cmp` instruction, and the jump is executed if the comparison resulted in equal (`e`), greater (`g`), less (`l`), not equal (`ne`), or combinations of it, respectively
`pop ax`	ax = stack.rmtop();	moves the top value of the memory stack into ax
`push ax`	`stack.add() = ax;`	saves the value of `ax` on top of the memory stack
`test`
`BYTE` : 1 byte (8 bit) `WORD` : 2 byte (16 bit) `DWORD` : 4 byte (32 bit) `QWORD` : 8 byte (64 bit) `TBYTE` : 10 byte (80 bit)		these are type specifiers; since for example `[si]` only points to the beginning of a memory address and in the absence of type specifiers we don't know the bit-length of `[si]`; if registers are involved we implicitly know the bit-length: `mov ax, [si]` : use 16-bit `mov al, [si]` : use 8-bit if registers are not involved we need to specify the data type: `mov WORD [si], 0` : use 16-bit

References:

80386 32-bit instruction set (x86-32)

Also known as IA-32 and i386, the x86-32 extends all registers to 32-bit and adds a few new instructions but keeps all the old x86-16 instructions.

registers
general purpose	32-bit	`EAX, EBX, ECX, EDX`
index registers	32-bit	`ESP`: stack pointer, contains main memory address of the top of the stack `EBP`: base pointer, used to point to some other place in stack, typically above the local variables `ESI, EDI`: source, destination index, used to point to arrays
status register	32-bit	one register which is implicitly set and read by instructions
instruction pointer	32-bit	`EIP`: points to the next instruction in main memory

References:

X86 instruction listings
Intel 80386 reference programmer's manual hosted at mit.edu, logix.cz

AMD64 64-bit instruction set (x86-64)

The x86-64 has been developed by AMD and is also known as AMD64 and Intel 64. Note, the Intel Itanium (IA-64) has a different 64-bit instruction set.

registers
general purpose	64-bit	`RAX, RBX, RCX, RDX` `R8, R9, ..., R15`
index registers	64-bit	`RSP`: stack pointer, contains main memory address of the top of the stack `RBP`: base pointer, used to point to some other place in stack, typically above the local variables `RSI, RDI`: source, destination index, used to point to arrays
status register	64-bit	`RFLAGS` one register which is implicitly set and read by instructions
instruction pointer	64-bit	`RIP`: points to the next instruction in main memory

References:

x87 fpu instruction set

registers
general purpose	80-bit	`ST(0), ST(1), ..., ST(7)`: arranged in a direct access stack structure, `ST(0)` is the top of the register stack
control/status	16-bit	control register: for rounding/precision control and exception handling, access via `fldcw` and `fstcw` status register: status of last fpu operation, incl. top of stack pointer, fp exceptions, read via `fstsw` tag register: status of each register (valid, zero, special, empty)
program	48-bit	instruction pointer: data pointer:

instruction	pseudo code	description
`fld QWORD [sp]` `fld st(2)` `fst QWORD [dp]` `fstp QWORD [dp]`	`st(0) = (sp);` `st(0) = st(2);` `(dp) = st(0);` `*(dp) = st(0);`	loads floats (here `QWORD=64-bit`) values from memory to the fpu register stack (add to the stack, no overwrite) store registers in mem, and store with pop, i.e. removed from fpu stack after store Note: conversion with rounding to and from internal 80-bit format are performed no direct x86 to x87 register transfer
`fild DWORD [sp]` `fist DWORD [dp]` `fistp DWORD [dp]`	`st(0) = (sp);` `(dp) = st(0);` `*(dp) = st(0);`	(signed) integer version of load/store operations, here `DWORD=32bit` to load an unsigned int, a conversion to signed int (with higher precision, e.g. 64bit) is necessary
`fldz` `fld1` `fldpi` ...	`st(0) = 0.0;` `st(0) = 1.0;` `st(0) = M_PI;` ...	loads special constants into register stack, all 80-bit accurate
`fxch st(2)`	`swap(st(0),st(2));`	swaps values in registers
`fadd st(2)` `fadd QWORD [sp]` `fadd st(4), st`	`st(0) += st(2);` `st(0) += *(sp);` `st(4) += st(0);`	floating point add
`fiadd DWORD [sp]`	`st(0) += *(sp);`	add a signed integer from memory to stack register (80-bit float)
`fsub st(1)` `fsubr st(1)` `fmul st(1)` `fdiv st(1)` `fdivr st(1)`	`st(0) -= st(1);` `st(0) = st(1)-st(0);` `st(0) *= st(1);` `st(0) /= st(1);` `st(0) = st(1)/st(0);`	other basic operations, there are many more versions, including integer versions, pop versions, and two-operant versions
`fsqrt`	`st(0)=sqrt(st(0))`	square root
`f2xm1` `fyl2xp1` `fyl2x`	`st(0) = 2^st(0) - 1;` `st(0) = log_2(st(0)+1)st(1);` `st(0) = log_2(st(0))st(1);`	only works for small values `-1 < st(0) < 1`, used for `exp()` an additional pop operation is performed an additional pop operation is performed, used for `log()`
`fsin` `fcos` `fptan` `fpatan`	`st(0) = sin(st(0));` `st(0) = cos(st(0));` `st(0) = tan(st(0));` `st(0) = atan(st(1)/st(0));`	only available since the 387
`fcom st(1)`		compares `st(0)` with `st(1)` and stores the result in the fpu status register
`fxam`		examines `st(0)` (unsupported, nan, normal, inf, zero, empty, denormal) and stores result in the fpu flag register
`fldcw [sp]` `fstcw [dp]`		loads and stores the control register
`fstsw [dp]`		stores the status register in main memory at location `dp`

Usage by gcc:
The gcc compiler emits x87-fpu instruction when compiling x86-32 bit code and sse-fpu instructions when compiling x86-64 bit code. This behaviour can be changed using different compiler flags:

$ gcc -m32			# x86-32 bit mode (x87-fpu is default)
$ gcc -mfpmath=387		# use x87-fpu instructions (sse default on x86-64)
$ gcc -ffast-math		# emits more x87-fpu instructions, because
				# -funsafe-math-optimizations is included
$ gcc -mpc32 / -mpc64 / -mpc80	# set internal x87-fpu precision to 32/64/80 bit
$ gcc -mfpmath=387 -msoft-float	# uses software emulation of fpu instructions

C function	`gcc -S -m32`	`gcc -S -m32 -funsafe-math-optimizations`
`sqrt()`	`fsqrt`	`fsqrt`
`exp()`	`libmath.so`	using `f2xm1`, `fldl2e`, `frndint`, `fscale`, ...
`log()`	`libmath.so`	using `fyl2x`, `fldln2`
`pow()`	`libmath.so`	`libmath.so`
`sin()`, `cos()`	`libmath.so`	`fsin`, `fcos`
`tan()`	`libmath.so`	`fptan`, `fld1`
`atan()`	`libmath.so`	`fpatan`
`acos()`	`libmath.so`	using `fpatan`, `fsqrt`, `fmul`, `fsubr`
`cosh()`	`libmath.so`	`libmath.so`
`acosh()`	`libmath.so`	`libmath.so`
`erf()`	`libmath.so`	`libmath.so`

Notes:

speed optimised code with e.g. gcc -O2 is also generally more accurate (if x87-fpu instructions are used), this is because values will be kept in the 80-bit accurate x87-fpu register as long as possible, whereas without optimisation values will be written back to memory (and converted to 64-bit double or 32-bit float) after each calculation; this is true even without -funsafe-math-optimizations
to make sure results are always exactly the same either use sse-fpu instructions (gcc -mfpmath=sse) or reduce accuracy of the x87-fpu using gcc -mpc64 (might slightly speed up fdiv and fsqrt as well)
no direct copy function for registers like st(2)=st(4), only fld st(4) available which adds a new value to the stack and copies st(4) into it

References:

x87
x87 instruction set
x87 fpu reference by Raymond Filiatreault, 2003
instruction timings, 2002?

MMX fpu instruction set

The MMX instruction set was introduced by Intel in 1996 on top of the Pentium processor to facilitate SIMD (single instruction, multiple data) instructions. It introduces new MM? 64-bit wide registers which can be packed with multiples of lower-bit integer data types and instruction are then executed on all of them at once.

registers
general purpose	64-bit	`MM0, MM1, ..., MM7`: using physical registers of the x87 fpu, e.g. `SP(?)=[1,...,1,MM0]`

One register could either contain a 64-bit integer, two 32-bit integer, or four 16-bit integer and operations execute at the same speed independently of the contents, and therefore a speedup for 32-bit and lower integer operations can be expected if MMX registers are used. There are big performance penalties for switching between x87 fpu and mmx operations.

SSE fpu instruction set

The SSE instruction set is a further development of MMX to improve SIMD capabilities. It introduces dedicated 128-bit wide registers and extends operations to floating point numbers (only float's with the initial SSE version). The most important version is SSE2 as it implements working with double's and int's, so it makes MMX redundant and can also replace x87 operations.

registers
general purpose	128-bit	`XMM0, XMM1, ..., XMM7`: dedicated registers `XMM8, XMM9, ..., XMM15`: only on x86-64 bit systems
status register	32-bit	`MXCSR`: control/status register

SSE

introduced with Pentium III in 1999
XMM? can contain up to four 32-bit single-precision floats, no 64-bit double and no integer capabilities
Pentium III shares execution resources between sse and the fpu
no time penalty when switching between sse and fpu instructions

SSE2

introduced with Pentium 4 in 2001
XMM? can contain:
- 2 64-bit single-precision floats, 4 32-bit double-precision floats
- 4 int32_t, 8 int16_t, 16 int8_t
makes MMX instruction set redundant
can fully replace x87 instruction set
data in memory needs to be aligned on 16-byte boundaries, otherwise it will incur performance penalties

SSE3

introduced with the Prescott revision of the Pentium 4 in 2004
adds 13 new instructions, e.g.
- HADDPD (horizontal-add-packed-double), input: (x0,x1), (y0,y1), output: (x0+x1,y0+y1)
- LDDQU, a misaligned integer vector load operation

SSSE3

introduced with the Core 2 in 2006
adds 16 new instructions

SSE4.1

introduced with the Penryn revision of the Core 2 in 2007
adds 47 instructions

SSE4.2

introduced with the Nehalem revision of the Core i7 in 2009
adds 7 instructions

The SSE unit can be used for vector operations (packed, i.e. when the register is filled with multiple of double's, for example) as well as scalar operations (only part of the register is filled with one double for example), and so can be used as a general purpose floating point unit. Most names have a suffix of the form [p/s][d/s], standing for packed/scalar and double/single precision, e.g.:

`addpd`	add packed double
`addps`	add packed single
`addsd`	add scalar double
`addss`	add scalar single

The table below only shows packed double precision versions and is also very incomplete.

instruction	pseudo code	C++ intrinsics	description
`movapd xmm0, xmm1` `movupd xmm0, xmm1`	`xmm0=xmm1` `xmm0=xmm1`	`__m128d _mm_load_pd(double* p)`	aligned packed double move unaligned packed double move
`addpd xmm1, xmm2` `subpd xmm1, xmm2` `mulpd xmm1, xmm2` `divpd xmm1, xmm2` `sqrtpd xmm1`	`xmm1 += xmm2` `xmm1 -= xmm2` `xmm1 *= xmm2` `xmm1 /= xmm2` `xmm1 = sqrt(xmm1)`		packed double arithmetic operations
`cmppd xmm1, xmm2`			packed double compare, result stored where?
`unpcklpd xmm1, xmm2`			pack lower parts of xmm2 into upper part of xmm1
`ldmxcsr DWORD [rsi]` `stmxcsr DWORD [rdi]`	`mxcsr=rsi` `rdi=mxcsr`	`unsigned int _mm_getcsr();` `void _mm_setcsr(unsigned int i);`	access the status register by moving it to/from a general 32-bit memory location
`XMMWORD`: 16 byte (128 bit)			type specifier for memory locations

References:

SSE instruction listings
SSE2 instruction set by Christopher Wright
MSDN docs of SSE compiler intrinsics
instruction timing

AVX fpu instruction set

The advanced vector extensions (AVX) instruction set is a successor of SSE and is also backwards compatible. It increases the bit-width of the SSE XMM0-XMM15 registers from 128-bit to 256-bit, now called YMM0-YMM15. The XMM registers are still available but share the physical space of the lower 128-bits of the YMM registers, i.e. YMM0=[... XMM0].

registers
general purpose	256-bit	`YMM0, YMM1, ..., YMM15`

instruction	pseudo code	C++ intrinsics	description
`vmovapd ymm0, ymm1` `vmovapd ymm0, YMMWORD PTR [rsp]` `vmovapd xmm0, xmm1`	`ymm0=ymm1` `ymm0=*rsp` `xmm0=xmm1`	`__m256d _mm256_set_pd()` `__m256d _mm256_load_pd()`	aligned packed double move
`vaddpd ymm1, ymm2, ymm3` `vsubpd ymm1, xmm2, ymm3` `vmulpd ymm1, ymm2, ymm3` `vdivpd ymm1, ymm2, ymm3` `vsqrtpd ymm1, ymm2`	`ymm1=ymm2+ymm3` `ymm1=ymm2-ymm3` `ymm1=ymm2*ymm3` `ymm1=ymm2/ymm3` `ymm1=sqrt(ymm2)`		packed double arithmetic operations
`YMMWORD`: 32 byte (256 bit)			type specifier for memory locations

References:

Intel 64 and IA-32 Architectures Software Developer Manuals

CPU architecture

			external bus			cache
cpu	year	instruction	data	address	clock speed	L1	L2	circuit size	transistors	power	notes
Intel 4004	1971		4-bit	4-bit	92.5 kHz			10 µm	2300	0.5W
Intel 8080	1974	8-bit	8-bit	16-bit	2 MHz			6 µm	6000	0.8W
Zilog Z80	1976	8-bit	8-bit	16-bit	4 MHz				8500	0.8W
Intel 8086	1978	x86-16	16-bit	20-bit	5-10 MHz			3 µm	29k	1.7-1.8W
Intel 80286	1982	x86-16	16-bit	24-bit	6-25 MHz			1.5 µm	134k	3.3W
Intel 80386	1985	x86-32	32-bit	32-bit	12-40 MHz			1-1.5 µm	275k	1.3-2.0W	SX: 16-bit data bus, 24-bit address bus
Intel 80486	1989	x86-32, x87	32-bit	32-bit	16-100 MHz	8KB		0.6-1.0 µm	1.2m	3-4W	SX has no x87 fpu
Pentium	1993	x86-32, x87	64-bit	32-bit	60-200 MHz	8+8KB		0.35-0.8 µm	3.1m	4-15W
Pentium MMX	1996	x86-32, x87	64-bit	32-bit	166-266 MHz	16+16KB		0.35 µm	4.5m	13-17W
Pentium II	1997	x86-32, x87	64-bit	36-bit	233-450 MHz	16+16KB	512KB	0.25-0.35 µm	7.5m	17-43W
Pentium III	1999	x86-32, x87, sse	64-bit	36-bit	400-1400 MHz	16+16KB	512KB	0.13-0.25 µm	9.5-28.1m	16-42W
Pentium 4	2000	x86-32, x87, sse2	64-bit	36-bit	1.3-3.8 GHz	16+16KB	512KB	65-180 nm	42-55m	50-115W	later versions contained x86-64
Core 2	2006	x86-64, x87, ssse3			1.8-3.5 GHz		2-6MB	45-65 nm		65-130W	duo: Conroe 65nm, Allendale 65nm, Wolfdale 45nm quad: Kentsfield 65nm, Yorkfield 45nm
Core i5 / i7 Nehalem	2008	x86-64, x87, sse4			1.73-3.46 GHz	32+32KB	256KB	32-45 nm		82-130W	dual: Clarkdale quad: Lynnfield, Bloomfield
Core i5 / i7 Sandy Bridge	2011	x86-64, x87, sse4, avx			1.73-3.46 GHz	32+32KB	256KB	32-45 nm
Haswell	2013	x86-64, x87, sse4, avx2						14-22 nm
Skylake								10-14 nm

fpu	year	data	address	clock speed	L1 cache	circuit size	transistors	power
Intel 8087	1980	16-bit	16-bit	5-10 MHz		3 µm	45k	2.4 W
Intel 80287	1983	16-bit	16-bit	6-20 MHz
Intel 80387	1986	32-bit	32-bit	16-40 MHz	8KB	1.5 µm	120k

References:

CPU architecture drawings

Intel 4004	Intel 8080	Intel 8086	Intel 80286	Intel 80386

		Intel 8087	Intel 80287	Intel 80387

Intel 80486	Pentium	Pentium MMX	Pentium 2	Pentium 3

Pentium 4	Core	Nehalem	Sandy Bridge